-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_data-structures.qmd
788 lines (548 loc) · 19.7 KB
/
2_data-structures.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
---
title: "R's data structures and data types"
author: "Software Carpentry / Jelmer Poelstra"
date: 2025-02-11
editor_options:
chunk_output_type: console
---
-----
<br>
## Introduction
#### What we'll cover
In this session, we will learn about R's **data structures** and **data types**.
- **Data structures** are the kinds of objects that R can store data in.
Here, we will cover the two most common ones: _vectors_ and _data frames_.
- **Data types** are how R distinguishes between different kinds of data like numbers
and character strings.
Here, we'll talk about the 4 main data types:
`character`, `integer`, `double`, and `logical.`
We'll also cover `factor`s, a construct related to the data types.
#### Setting up
To make it easier to keep track of what we do,
we'll write our code in a script (and send it to the console from there) --
here is how to create and save a new R script:
1. _Open a new R script_ (Click the **`+`** symbol in toolbar at the top, then click `R Script`)^[
Or Click `File` => `New file` => `R Script`.].
2. _Save the script_ straight away as `data-structures.R` --
you can save it anywhere you like, though it is probably best to save it in a
folder specifically for this workshop.
3. If you want the section headers as comments in your script,
as in the script I am showing you now,
then copy-and-paste the following into your script:
<details><summary>Section headers for your script _(Click to expand)_</summary>
```{r}
# 2 - Vectors ------------------------------------------------------------------
# 2.1 - Single-element vectors and quoting
# 2.2 - Multi-element vectors
# 2.3 - Vectorization
# Challenge 1
# A. Start by making a vector x with the whole numbers 1 through 26.
# Then, subtract 0.5 from each element in the vector and save the result in vector y.
# Check your results by printing both vectors.
# B. What do you think will be the result of the following operation?
# 1:5 * 1:5
# 2.4 - Exploring vectors
# 2.5 - Extracting element from vectors
# 3 - Data frames --------------------------------------------------------------
# 3.1 - Data frame intro
# 4 - Data types ---------------------------------------------------------------
# 4.1 - R's main data types
# 4.2 - Factors
# 4.3 - A vector can only contain one data type
# Challenge 2
# What type of vector (if any) do you think each of the following will produce?
# Try it out and see if you were right.
# typeof("TRUE")
# typeof(banana)
# typeof(c(2, 6, "3"))
# Bonus / trick question:
# typeof(18, 3)
# 4.4 - Automatic type coercion
# 4.5 - Manual type conversion
```
</details>
<br>
## Data structure 1: Vectors
The first data structure we will explore is the simplest: the vector.
A vector in R is essentially **a collection of one or more items**.
Moving forward, we'll call such individual items "elements".
### Single-element vectors and quoting
Vectors can consist of just a single element,
so each of the two lines of code below creates a vector:
```{r}
vector1 <- 8
vector2 <- "panda"
```
Two things are worth noting about the `"panda"` example,
which is a so-called **character string** (or _string_ for short):
- `"panda"` constitutes _one element_, not 5 (its number of letters).
- Unlike when dealing with numbers, we have to _quote the string_.^[
Either double quotes (`"..."`) or single quotes (`'...'`) work,
but the former are most commonly used by convention.]
Character strings need to be quoted because they are otherwise interpreted as
R objects -- for example, because our vectors `vector1` and `vector2` are objects,
we refer to them without quotes:
```{r}
# [Note that R will show auto-complete options after you type 3 characters]
vector1
vector2
```
Therefore, the code below doesn't work, because there is no _object called `panda`_:
```{r, error=TRUE}
vector_fail <- panda
```
<hr style="height:1pt; visibility:hidden;" />
### Multi-element vectors
A common way to make vectors with **multiple elements** is
by using the `c` (combine) function:
```{r}
c(2, 6, 3)
```
::: {.callout-note appearance='minimal'}
_Unlike in the first couple of vector examples,_
_we didn't save the above vector to an object:_
_now the vector simply printed to the console -- but it is created all the same._
:::
`c()` can also **append** elements to an existing vector:
```{r}
# First we create a vector:
vector_to_append <- c("vhagar", "meleys")
vector_to_append
# Then we append another element to it:
c(vector_to_append, "balerion the dread")
```
<hr style="height:1pt; visibility:hidden;" />
To create vectors with **series of numbers**, a couple of shortcuts are available.
First, you can make series of whole numbers with the `:` operator:
```{r}
1:10
```
Second, you can use a function like `seq()` for fine control over the sequence:
```{r}
myseq <- seq(from = 6, to = 8, by = 0.2)
myseq
```
<hr style="height:1pt; visibility:hidden;" />
### Vectorization
Consider the output of this command:
```{r}
myseq * 2
```
Above, **every individual element in `myseq` was multiplied by 2**.
We call this behavior "vectorization" and this is a key feature of the R language.
(Alternatively, you may have expected this code to _repeat_ `myseq` twice,
but this did not happen!)
::: {.callout-note appearance='minimal'}
For more about vectorization, see
[episode 9](https://swcarpentry.github.io/r-novice-gapminder/instructor/09-vectorization.html)
from the Carpentries lesson that this material is based on.
:::
<hr style="height:1pt; visibility:hidden;" />
::: exercise
### {{< fa user-edit >}} Challenge 1 {-}
<hr style="height:1pt; visibility:hidden;" />
**A.**
Start by making a vector `x` with the whole numbers 1 through 26.
Then, subtract 0.5 from each element in the vector and save the result in vector `y`.
Check your results by printing both vectors.
<details><summary>Click for the solution</summary>
```{r}
x <- 1:26
x
y <- x - 0.5
y
```
</details>
<hr style="height:1pt; visibility:hidden;" />
**B.**
What do you think will be the result of the following operation?
Try it out and see if you were right.
```{r, eval=FALSE}
1:5 * 1:5
```
<details><summary>Click for the solution</summary>
```{r}
1:5 * 1:5
```
Both vectors are of length 5 which will lead to "element-wise matching":
the first element in the first vector will be multiplied with the first element
in the second vector,
the second element in the first vector will be multiplied with the second element
in the second vector, and so on.
</details>
:::
<hr style="height:1pt; visibility:hidden;" />
### Exploring vectors
R has many built-in functions to get information about vectors and other types of
objects, such as:
Get the **first and last few elements**, respectively, with `head()` and `tail()`:
```{r}
# Print the first 6 elements:
head(myseq)
# Both head and tail take an argument `n` to specify the number of elements to print:
head(myseq, n = 2)
# Print the last 6 elements:
tail(myseq)
```
<hr style="height:1pt; visibility:hidden;" />
Get the **number of elements** with `length()`:
```{r}
length(myseq)
```
<hr style="height:1pt; visibility:hidden;" />
Get **arithmetic summaries** like `sum()` and `mean()` for vectors with numbers:
```{r}
# sum() will sum the values of all elements
sum(myseq)
# mean() will compute the mean (average) across all elements
mean(myseq)
```
<br>
### Extracting elements from vectors
Extracting element from objects like vectors is often referred to as **"indexing"**.
In R, we can do this using bracket notation -- for example:
- Get the second element:
```{r}
myseq[2]
```
- Get the second through the fifth elements:
```{r}
myseq[2:5]
```
- Get the first and eight elements:
```{r}
myseq[c(1, 8)]
```
To put this in a general way:
we can extract elements from a vector by using another vector,
whose values are the positional indices of the elements in the original vector.
<br>
## Data structure 2: Data frames
### R stores tabular data in "data frames"
One of R's most powerful features is its **built-in ability to deal with tabular data** --
i.e., data with rows and columns like you are familiar with from spreadsheets
like those you create with Excel.
In R, tabular data is stored in objects that are called "**data frames**",
the second R data structure we'll cover in some depth.
Let's start by making a toy data frame with information about 3 cats:
```{r}
cats <- data.frame(
name = c("Luna", "Thomas", "Daisy"),
coat = c("calico", "black", "tabby"),
weight = c(2.1, 5.0, 3.2)
)
cats
```
Above:
- We created 3 vectors and pasted them side-by-side to create a data frame
in which _each vector constitutes a column_.
- We gave each vector a name (e.g., `coat`), and those names became the _column names_.
- The resulting data frame has 3 rows (one for each cat) and 3 columns
(each with a type of info about the cats, like coat color).
Data frames are typically (and best) organized like above, where:
- Each column contains a different **"variable"** (e.g. coat color, weight)
- Each row contains a different **"observation"** (data on e.g. one cat/person/sample)
That's all we'll say about data frames for now,
but in today's remaining sessions we will explore this key R data structure more!
<br>
## Data types
### R's main Data Types
R distinguishes different kinds of data, such as character strings and numbers,
in a formal way, using several pre-defined "data types".
The behavior of R in various operations will depend heavily on the data type --
for example, the below fails:
```{r, error=TRUE}
"valerion" * 5
```
We can ask what type of data something is in R using the `typeof()` function:
```{r}
typeof("valerion")
```
R sets the data type of `"valerion"` to `character`,
which we commonly refer to as character strings or strings.
In formal terms, the failed command did not work because R will not allow us to
perform mathematical functions on vectors of type `character`.
The `character` data type most commonly contains letters,
but anything that is placed between quotes (`"..."`) will be interpreted as the
`character` data type --- even plain numbers:
```{r}
typeof("5")
```
<hr style="height:1pt; visibility:hidden;" />
Besides `character`, the other 3 **common data types** are:
- `double` / `numeric` -- numbers that can have decimal points:
```{r}
typeof(3.14)
```
- `integer` -- whole numbers only:
```{r}
typeof(1:3)
```
- `logical` (either `TRUE` or `FALSE` -- unquoted!):
```{r}
typeof(TRUE)
```
<hr style="height:1pt; visibility:hidden;" />
### Factors
**Categorical** data, like treatments in an experiment,
can be stored as "factors" in R.
Factors are useful for statistical analyses and for plotting,
e.g. because they allow you to specify a custom order.
```{r}
diet_vec <- c("high", "medium", "low", "low", "medium")
diet_vec
factor(diet_vec)
```
In the example above, we turned a character vector into a factor.
Its "levels" (low, medium, high) are sorted alphabetically by default,
but we can manually specify an order that makes more sense:
```{r}
diet_fct <- factor(diet_vec, levels = c("low", "medium", "high"))
diet_fct
```
This ordering would be automatically respected in plots and statistical analyses.
<hr style="height:1pt; visibility:hidden;" />
::: {.callout-warning collapse='true'}
### Oddly, factors are technically not a data type _(Click to expand)_
For most intents and purposes,
it makes sense to think of factors as another data type, even though technically,
they are a kind of data structure build on the `integer` data type:
```{r}
typeof(diet_fct)
```
:::
<hr style="height:1pt; visibility:hidden;" />
### A vector can only contain one data type
Individual vectors, and therefore also individual columns in data frames,
can only **be composed of a single data type**.
R will silently pick the "best-fitting" data type when you enter or read data into
a data frame.
So let's see what the data types are in our `cats` data frame:
```{r}
str(cats)
```
- The `name` and `coat` columns are `character`, abbreviated `chr`.
- The `weight` column is `double`/`numeric`, abbreviated `num`.
<hr style="height:1pt; visibility:hidden;" />
::: exercise
### {{< fa user-edit >}} Challenge 2 {-}
What type of vector (if any) do you think each of the following will produce?
Try it out and see if you were right.
```{r, eval=FALSE}
typeof("TRUE")
typeof(banana)
typeof(c(2, 6, "3"))
```
Bonus / trick question:
```{r, eval=FALSE}
typeof(18, 3)
```
<details><summary>Click for the solutions</summary>
```{r}
typeof("TRUE")
```
1. `"TRUE"` is `character` (and not `logical`) because of the quotes around it.
```{r, error=TRUE}
typeof(banana)
```
2. Recall the earlier example:
this returns an error because the object `banana` does not exist.
Any unquoted string (that is not a special keyword like `TRUE` and `FALSE`)
is interpreted as a reference to an object in R.
```{r}
typeof(c(2, 6, "3"))
```
3. We'll talk about why this produces a `character` vector in the next section.
```{r, error=TRUE}
typeof(18, 3)
```
4. This produces an error because the `typeof()` only accepts a single argument,
which is an R object like a vector.
Because we did not wrap `18, 3` within `c()` (i.e. we did not use `c(18, 3)`),
we ended up passing **two** arguments to the function, and this resulted in
an error.
If you guessed that it would have TWICE returned `integer` (or `double`),
you were on the right track: you couldn't have known that the function does
not accept multiple objects.
</details>
:::
<hr style="height:1pt; visibility:hidden;" />
### Automatic Type Coercion
That a character vector was returned by `c(2, 6, "3")` in the challenge above
is due to something called **type coercion**.
When R encounters a _mix of types_ (here, numbers and characters)
to be combined into a single vector, it will force them all to be the same type.
It "must" do this because, as pointed out above,
a vector can consist of only a single data type.
Type coercion can be the source of many surprises,
and is one reason we need to be aware of the basic data types and how R will
interpret them.
<hr style="height:1pt; visibility:hidden;" />
### Manual Type Conversion
Luckily, you are not simply at the mercy of whatever R decides to do automatically,
but can convert vectors at will using the `as.` group of functions:
:::callout-tip
### Try to use RStudio's auto-complete functionality here: type "`as.`" and then press the <kbd>Tab</kbd> key.
:::
```{r}
as.integer(c("0", "2", "4"))
as.character(c(0, 2, 4))
```
As you may have guessed, though, not all type conversions are possible ---
for example:
```{r}
as.double("kiwi")
```
(`NA` is R's way of denoting _missing data_ --
see [this bonus section](#missing-values-na) for more.)
<br>
------
------
<br>
## Bonus material for self-study
### Changing vector elements using indexing
Above, we saw how we can _extract_ elements of a vector using indexing.
To _change_ elements in a vector,
simply use the bracket on the other side of the arrow -- for example:
- Change the first element to `30`:
```{r}
myseq[1] <- 30
myseq
```
- Change the last element to `0`:
```{r}
myseq[length(myseq)] <- 0
myseq
```
- Change the second element to the mean value of the vector:
```{r}
myseq[2] <- mean(myseq)
myseq
```
<hr style="height:1pt; visibility:hidden;" />
### Extracting columns from a data frame
We can extract individual columns from a data frame using the `$` operator:
```{r}
cats$weight
cats$coat
```
This kind of operation will return a vector -- and can be indexed as well:
```{r}
cats$weight[2]
```
<hr style="height:1pt; visibility:hidden;" />
### More on the `logical` data type
Let's add a column to our `cats` data frame indicating whether each cat does
or does not like string:
```{r}
cats$likes_string <- c(1, 0, 1)
cats
```
So, `likes_string` is numeric,
but the `1`s and `0`s actually represent `TRUE` and `FALSE`.
We could instead use the `logical` data type here,
by converting this column with the `as.logical()` function,
which will turn 0's into `FALSE` and everything else, including 1, to `TRUE`:
```{r}
as.logical(cats$likes_string)
```
And to actually modify this column in the dataframe itself, we would do this:
```{r}
cats$likes_string <- as.logical(cats$likes_string)
cats
```
<hr style="height:1pt; visibility:hidden;" />
You might think that `1`/`0` could be a handier coding than `TRUE`/`FALSE`
because it may make it easier, for exmaple,
to count the number of times something is true or false.
But consider the following:
```{r}
TRUE + TRUE
```
So, logicals can be used as if they were numbers,
in which case `FALSE` represents 0 and `TRUE` represents 1.
<hr style="height:1pt; visibility:hidden;" />
### Missing values (`NA`)
R has a concept of missing data, which is important in statistical computing,
as not all information/measurements are always available for each sample.
In R, missing values are coded as `NA`
(and like `TRUE`/`FALSE`, this is not a character string, so it is not quoted):
```{r}
# This vector will contain one missing value
vector_NA <- c(1, 3, NA, 7)
vector_NA
```
A key thing to be aware of with `NA`s is that many functions that operate on vectors
will return `NA` if **any** element in the vector is `NA`:
```{r}
sum(vector_NA)
```
The way to get around this is by setting `na.rm = TRUE` in such functions,
for example:
```{r}
sum(vector_NA, na.rm = TRUE)
```
<hr style="height:1pt; visibility:hidden;" />
### A few other data structures in R
We did not go into details about R's other data structures,
which are less common than vectors and data frames.
Two that are worth mentioning briefly, though, are:
- **Matrix**, which can be convenient when you have tabular data that is exclusively
numeric (excluding names/labels).
- **List**, which is more flexible (and complicated) than vectors:
it can contain multiple data types, and can also be hierarchically structured.
<hr style="height:1pt; visibility:hidden;" />
::: exercise
### {{< fa user-edit >}} Bonus Challenge {-}
An important part of every data analysis is cleaning input data.
Here, you will clean a cat data set that has an added observation with a
problematic data entry.
Start by creating the new data frame:
```{r}
cats_v2 <- data.frame(
name = c("Luna", "Thomas", "Daisy", "Oliver"),
coat = c("calico", "black", "tabby", "tabby"),
weight = c(2.1, 5.0, 3.2, "2.3 or 2.4")
)
```
Then move on to the tasks below,
filling in the blanks (`_____`) and running the code:
```{r, eval=FALSE}
# 1. Explore the data frame,
# including with an overview that shows the columns' data types:
cats_v2
_____(cats_v2)
# 2. The "weight" column has the incorrect data type _____.
# The correct data type is: _____.
# 3. Correct the 4th weight with the mean of the two given values,
# then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2
# 4. Convert the weight column to the right data type:
cats_v2$weight <- _____(cats_v2$weight)
# 5. Calculate the mean weight of the cats:
_____
```
<details><summary>Click for the solution</summary>
```{r}
# 1. Explore the data frame,
# including with an overview that shows the columns' data types:
cats_v2
str(cats_v2)
# 2. The "weight" column has the incorrect data type CHARACTER.
# The correct data type is: DOUBLE.
# 3. Correct the 4th weight data point with the mean of the two given values,
# then print the data frame to see the effect:
cats_v2$weight[4] <- 2.35
cats_v2
# 4. Convert the weight column to the right data type:
cats_v2$weight <- as.double(cats_v2$weight)
# 5. Calculate the mean weight of the cats:
mean(cats_v2$weight)
```
</details>
:::
<hr style="height:1pt; visibility:hidden;" />
### Learn more
To learn more about data types and data structures, see
[this episode from a separate Carpentries lesson](https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures.html).