-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathjust-enough-r.qmd
1146 lines (870 loc) · 49.3 KB
/
just-enough-r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
format:
html:
code-fold: false
code-line-numbers: false
---
# Just Enough R {.unnumbered}
The purpose of this section is to get you up-to-speed with `R`.
If you're completely unfamiliar with `R` and RStudio, this should provide you with enough to get started and understand what's going on in the code (and you can always refer back to this page if you understandably get a little lost), and if you have some experience, then it should provide a sufficient description of the packages and functions that we use in this workshop.
Now you have `R` set installed and you can access it and are familiar with RStudio, it's time to learn some of the core features of the language.
<a id="suggested-reading"/>
::: {.callout-tip}
We'd strongly recommend you read [Hands-On Programming With R](https://rstudio-education.github.io/hopr) by Garett Grolemund and [R for DataScience](https://r4ds.hadley.nz/) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund for a deeper understanding of the following concepts (and many more).
:::
## Objects & types introduction
An object is anything you can create in R using code, whether that is a table you import from a **csv** file (that will get converted to a **dataframe**), or a **vector** you create within a script.
Each object you create has a **type**.
We've already mentioned two (**dataframes** and **vectors**), but there are plenty more.
But before we get into object types, let's take a step back and look at types in general, thinking about individual elements and the fundamentals.
## Element types
Generally in programming, we have two broad types of numbers: **floating point** and **integer** numbers, i.e., numbers with decimals, and whole numbers, respectively.
In `R`, we have these number types, but a **floating point** number is called a **double**.
The **floating point** number is the default type `R` assigns to number: look at the types assigned when we leave off a decimal place vs. specify type integer by ending a number with an `L`.
```{r}
typeof(1)
typeof(1L)
```
::: {.callout-note collapse=true}
Technically type **double** is a subset of type **numeric**, so you will often see people convert numbers to floating points using `as.numeric()`, rather than `as.double()`, but the different is semantics.
You can confirm this using the command `typeof(as.numeric(10)) == typeof(as.double(10))semantics.
You can confirm this using the command `typeof(as.numeric(10)) == typeof(as.double(10))`.
:::
Integer types are not commonly used in `R`, but there are occasions when you will want to use them e.g., when you need whole numbers of people in a simulation you may want to use integers to enforce this.
Integers are also slightly more precise (unless very big or small), so when exactness in whole number is required, you may want to use integers.
::: {.callout-note collapse=true}
`R` has some idiosyncrasies when it comes to numbers.
For the most part, **doubles** are produced, but occasionally an **integer** will be produced when you are expecting a **double**.
For example:
```{r}
typeof(1)
typeof(1:10)
typeof(seq(1, 10))
typeof(seq(1, 10, by = 1))
```
:::
Outside of numbers, we have **characters** (**strings**) and **boolean** types.
A **boolean** (also known as a **logical** in `R`) is a `TRUE/FALSE` statement.
In `R`, as in many programming languages, `TRUE` is equal to a value of 1, and `FALSE` equals `0`.
There are times when this comes in handy e.g. you need to calculate the number of people that responded to a question, and their responses is coded as `TRUE/FALSE`, you can just sum the **vector** of responses (more on **vectors** shortly).
```{r}
TRUE == 1
FALSE == 0
```
::: {.callout-tip title="Question" appearance="minimal"}
Can you figure out what value will be returned for the command `(TRUE == 0) == FALSE`?
:::
A **character** is anything in quotation marks.
This would typically by letter, but is occasionally a number, or other symbol.
Other languages make a distinction between **characters** and **strings**, but not `R`.
```{r}
typeof("a")
typeof("1")
```
It is important to note that characters are not **parsed** i.e., they are not interpreted by `R` as anything other than a **character**.
This means that despite `"1"` looking like the number `1`, it behaves like a **character** in `R`, not a **double**, so we can't do addition etc. with it.
```{r}
#| error: true
"1" + 1
```
## Object types
### Vectors
As mentioned, anything you can create in `R` is an object.
For example, we can create an character object with the assignment operator (`<-`).
```{r}
my_char_obj <- "a"
```
::: {.callout-note collapse=true}
In other languages, `=` is used for assignment.
In `R`, this is generally avoided to distinguish between creating objects (assignment), and specifying argument values (see the [section on functions](#functions)).
However, despite what some purists may say, it really doesn't matter which one you use, from a practical standpoint.
:::
You will note that when we created our object, it did not return a value (unlike the previous examples, a value was not printed).
To retrieve the value of the object (in this case, just print it), we just type out the object name.
```{r}
my_char_obj
```
In this case, we just create an object with only one element.
We can check this using the `length()` function.
```{r}
length(my_char_obj)
```
We could also create an **atomic vector** (commonly just called a **vector**, which we'll use from here-on in).
In fact, `my_char_obj` is actually an **vector**, i.e., it is a vector of length 1, as we've just seen.
Generally, a **vector** is an object that contains multiple elements that each have the same type.
```{r}
my_char_vec <- c("a", "b", "c")
```
As we'll see in the example below, we can give each element in a **vector** a name, and to highlight that vectors must contain elements of the same type, watch what happens here.
```{r}
my_named_char_vec <- c(a = "a", b = "b", c = "c", d = 1)
names(my_named_char_vec)
my_named_char_vec
```
Because `R` saw the majority of the first elements in the **vector** were of type **character** it **coerced** the number to a **character**.
This is super important to be aware of, as it can cause errors, particularly when coercion goes in the other direction i.e. trying to create a **numeric vector**.
#### Factors
All the **vector** types we've mentioned so far map nicely to their corresponding **element** types.
But there is an extension of the **character** vector used frequently: the **factor** (and, correspondingly, the **ordered** vector).
A **factor** is a **vector** where there are distinct groups that exist within a **vector** i.e., they are *nominal categorical data*.
For example, we often include gender as a covariate in epidemiological analysis.
There is no intrinsic order, but we would want to account for the groups in the analysis.
An **ordered vector** is when there *is* an intrinsic order to the grouping i.e., we have *ordinal categorical data*.
If, for example, we were interested in how the frequency of cigarette smoking is related to an outcome, and we wanted to use *binned* groups, rather than treating it as a continuous value, we would want to create an **ordered vector** as the ordering of the different groupings is important.
Let's use the `mtcars` dataset (that comes installed with `R`), and turn the number of cylinders (`cyl`) into an **ordered vector**, as there are discrete numbers of cylinders a car engine can have, *and* the ordering matters.
Don't worry about what `$` is doing; we'll come to that [later](#indexing-objects)
```{r}
my_mtcars <- mtcars
my_mtcars$cyl
my_mtcars$cyl <- ordered(my_mtcars$cyl)
my_mtcars$cyl
```
If we wanted to directly specify the ordering of the groups, we can do this using the `levels` argument i.e.
```{r}
my_mtcars$cyl <- ordered(my_mtcars$cyl, levels = c(8, 6, 4))
my_mtcars$cyl
```
To create a **factor**, just replace the `ordered()` call with `factor()`
### Lists
There is another type of **vector**: the **list**.
Most people do not refer to **lists** as type of **vectors**, so we will only refer to them as **lists**, and **atomic vectors** will just be referred to as **vectors**.
Unlike **vectors** there are no requirements about the form of **lists** i.e., each element of the **list** can be completely different.
One element could store a **vector** of numbers, another a model object, another a **dataframe**, and another a **list** (i.e. a nested **list**).
```{r}
my_list <- list(
c(1, 2, 3, 4, 5),
glm(mpg ~ ordered(cyl) + disp + hp, data = mtcars),
data.frame(column_1 = 1:5, column_2 = 6:10)
)
my_named_list <- list(
my_vec = c(1, 2, 3, 4, 5),
my_model = glm(mpg ~ ordered(cyl) + disp + hp, data = my_mtcars),
my_dataframe = data.frame(column_1 = 1:5, column_2 = 6:10)
)
my_list
my_named_list
```
Similar to **vectors**, **lists** can be named, or unnamed, and also that we they display in slightly different ways: when unnamed, we get the notation `[[1]] ... [[3]]` to denote the different **list** elements, and with the **named list** we get `$my_vec ... $my_dataframe`.
It is often useful to name them, though, as it gives you some useful options when it comes to indexing and extracting values later.
::: {.callout-note collapse=true}
If you're wondering why we are creating our list elements with the `=` operator, that's because we can think of this as an argument in the `list()` function, where the argument name is the name we want the element to have, and the argument value is the element itself.
:::
### Dataframes
**Dataframes** are the last key object type to learn about.
A **dataframe** is technically a special type of list.
Effectively, it is a 2-D table where every column has to have elements of the same type (i.e., is a **vector**), but the columns can be different types to each other.
The other important restriction is that all columns must be the same length, i.e. we have a rectangular **dataframe**.
As we've seen before, we can create a dataframe using this code, where `1:5` is shorthand for a vector that contains the sequence of numbers from 1 to 5, inclusive (i.e., `c(1, 2, 3, 4, 5)`).
We could also write this sequence as `seq(1, 5, by = 1)`, allowing us more control over the steps in the sequence.
```{r}
my_dataframe <- data.frame(
column_int = 1:5,
column_dbl = seq(6, 10, 1),
column_3 = letters[1:5]
)
```
Like with every other object type, we can just type in the **dataframe's** name to return it's value, but this tim, let' explore the *structure* of the **dataframe** using the `str()` function.
This function can be used on any of the objects we've seen so far, and is particularly helpful when exploring **lists**.
One nice feature of **dataframes** is that it will explicitly print the columns types.
```{r}
str(my_dataframe)
```
### Matrices
**Matrices** are crucial to many scientific fields, including epidemiology, as they are the basis of linear algebra.
This course will use **matrix** multiplication extensively (notably [R Session 2](r-session-02.qmd)), so it is worth knowing how to create matrices.
Much like vectors, all elements in a **matrix** should be the same type (or they will be coerced if possible, resulting in `NA` if not).
It is unusual to have a **non-numeric matrix** e.g., a **character matrix**, but it is possible.
When we create our **matrix**, notice that it fills column-first, much like how we think of **matrices** in math (i.e., `i` then `j`).
```{r}
my_matrix <- matrix(1:8, nrow = 2)
my_matrix
```
## Indexing objects
### Indexing operators
We've got our objects, but now we want to do stuff with them.
Without getting into too much detail about *Object-Oriented Programming* (e.g., the `S3` class system in `R`), there are three mains ways of indexing in `R`:
- The single bracket `[]`
- The double bracket `[[]]`
- The dollar sign `$`
Which method we use depends on the type of object we have.
Handily, `[]` will work for pretty much everything, and we typically only use use `[[]]` for **lists**.
### Indexing vectors
With both `[]` and `[[]]`, we can use the *indices* i.e., the numbered position of the specific values/elements we want to extract, but if we have named objects, we can pass the names to the `[]` in a **vector**.
```{r}
# Extract elements 1 through 3 inclusively
my_char_vec[1:3]
# Extract the same elements but using their names in a vector
my_named_char_vec[c("a", "b", "c")]
```
Notice that when we index the named **vector** we get *both* the name *and* the value returned.
Many times this is OK, but if we only wanted the value, then you'd index with `[[]]`, but it is important to note that you can only pass *one* value to the brackets.
```{r}
#| error: true
my_named_char_vec[[c("a", "b")]]
my_named_char_vec[["a"]]
```
If you're wondering why go through the hassle, it's because values can change position in the list when we update inputs, such as **csv** datafiles, or needing to restructure code to make something else work.
If we only index with the numeric indices, we run the risk of a silent error being returned i.e., a value is provided to us, but we don't know that it's referring to the wrong thing.
Indexing with names mean that the element's position in the **vector** doesn't matter, and if it's accidentally been removed when we updated code, and error will be explicitly thrown as it won't be able to find the index.
### Lists and Dataframes
When it comes to indexing **lists** and **dataframes** (remember, **dataframes** are just special **lists**, so the same methods are available to us), it is more common to use `[[]]` and `$`, though there are obviously occasions when `[]` is useful.
Let's look at `my_named_list` first.
```{r}
my_named_list[1]
my_named_list["my_vec"]
my_named_list[[1]]
my_named_list[["my_vec"]]
my_named_list$my_vec
```
::: {.callout-note}
In the examples above, notice how both `[]` methods returned the name of the element as well as the values (as it did before with the named **vector**).
This is important as it means we need to extract the values from what is returned before we can do any further indexing i.e., to get the value `3` from the **list** element `my_vec`.
:::
We can do the same with the unnamed **list**, except the last two methods are not available as we do not have a name to use.
```{r}
my_list[1]
my_list[[1]]
```
Because a **dataframe** is a type of list where the column headers are the element names, we can use `[[]]` and `$` as with the named list.
```{r}
my_dataframe[1]
my_dataframe[[1]]
my_dataframe["column_int"]
my_dataframe$column_int
```
If we wanted to extract a particular value from a column, we can use the following methods.
```{r}
# indexes i then j, just like in math
my_dataframe[2, 1]
# Extract the second element from the first column
my_dataframe[[1]][2]
# Extract the second element from column_int, using the i, j procedure as before
my_dataframe[2, "column_int"]
# Extract the second element from column_int
my_dataframe$column_int[2]
```
## Packages
Up until now, we've been getting to grips with the core concepts of objects, and indexing them.
But when you're writing code, you'll want to do things that are relatively complicated to implement, such as solve a set of differential equations.
Fortunately, for many areas of computing (and, indeed, epidemiology and statistics), many others have also struggled with the same issues and some have gone one to document their solutions in a way others can re-use them.
This is the basis for **packages**.
Someone has *packaged up* a set of functions for others to re-use.
We've mentioned the word **function** a number of time so far, and we haven't defined it, but that's [coming soon](#functions).
For the moment, let's just look at how we can find, install, and load **packages**.
### Finding packages
As [mentioned previously](install-r.qmd#r) CRAN is a place where many pieces of `R` code is documents and stored for others to download and use.
Not only are the `R` programming language executables stored in CRAN, but so are user-defined **functions** that have been turned into **packages**.
To find packages, you can go to the CRAN website and search by name, but there are far too many for that to be worthwhile - just Google what you want to do and add "r" to the end of your search query, and you'll likely find what you're looking for.
Once you've found a package you want to download, next you need to install it.
### Installing packages
Barring any super-niche packages, you should be able to use the following command(s):
```{r}
#| eval: false
install.packages("package to download")
# Download multiple by passing a vector of package names
install.packages(c("package 1", "package 2"))
```
If for some reason you get an error message saying the package isn't available on CRAN, first, check for typos, and if you still get an error, you may need to download it directly from GitHub.
Read [here](https://pak.r-lib.org/dev/reference/get-started.html#install-a-package-from-github) for more information about using the `{pak}` package to download packages from other sources.
### Loading packages
Now you have your packages installed, you just need to load them to get any of their functionality.
The easiest way is to place this code at the top of your script.
```{r}
#| eval: false
# Quotations are not required, but can be used
library(package to download)
```
<a id="namespace-conflict"/>
Most of the time, this is fine, but occasionally you will run in to an issue where a function doesn't work as expected.
Sometimes this is because of what's called a *namespace conflict* i.e., you have two functions with the same name loaded, and potentially you're using the wrong verion.
For example, in base `R` (i.e, these functions come pre-installed when you set up `R`), there is a `filter()` function from the `{stats}` package (as mentioned, we'll denote this as `stats::filter()`).
Throughout this workshop, you will see `library(tidyverse)` at the top of the pages to indicate the `{tidyverse}` set of packages are being loaded (this is actually a package that installs a bunch of related and useful packages for us).
In `{dplyr}` (one of the packages loaded by `{tidyverse}`) there is also a function called `filter()`.
Because `{dplyr}` was loaded after `{stats}` was loaded (because `{stats}` is automatically loaded when `R` is started), the `dplyr::filter()` function will take precedence.
If we wanted to specifically use the `{stats}` version, we could write this:
```{r}
#| column: body
#| out-width: 100%
# Set the seed for the document so we get the same random numbers sampled
# each time we run the script (assuming it's run in its entirety from start
# to finish)
set.seed(1234)
# Create a cosine wave with random noise
raw_timeseries <- cos(pi * seq(-2, 2, length.out = 1000)) + rnorm(1000, sd = 0.5)
# Calculate 20 day moving average using stats::filter()
smooth_timeseries <- stats::filter(raw_timeseries, filter = rep(1/20, 20), sides = 1)
# Plot raw data
plot(raw_timeseries, col = "grey80")
# Overlay smoothed data
lines(smooth_timeseries, col = "red", lwd = 2)
```
## Functions
As we've alluded to, **functions** are core to gaining *functionality* in `R`.
We can always hand-write the code to complete a task, but if we have to repeat a task more than once, it can be tiresome to repeat the same code, particularly if it is a particularly complex task that requires many lines of code.
This is where **functions** come in: they provide us with a mechanism to wrap up code into something that can be re-used.
Not only does this reduce the amount of code we need to write, but by minimize code duplication, debugging becomes a lot easier as we only need to remember to make changes and correct one section of our codebase.
Say, for example, you want to take a vector of numbers and calculate the cumulative sum e.g.;
```{r}
my_dbl_vec <- 1:10
cumulative_sum <- 0
for(i in seq_along(my_dbl_vec)) {
cumulative_sum <- cumulative_sum + i
}
cumulative_sum
```
This is OK if we only do this calculation once, but it's easy to imagine us wanting to repeat this calculation; for example, we might use calculate the cumulative sum of daily cases to get a weekly incidence over every week of a year.
In this situation, we would want to create a function.
```{r}
my_cumsum <- function(vector) {
cumulative_sum <- 0
for(i in seq_along(my_dbl_vec)) {
cumulative_sum <- cumulative_sum + i
}
cumulative_sum
}
my_cumsum(my_dbl_vec)
```
::: {.callout-note collapse=true}
This is obviously a contrived example because, as with many basic operations in `R`, there is already a function written to perform this calculation that does it in a much more performant and safer manner: `cumsum()`
:::
For many of the manipulations we will want to perform, a **function** has already been written by someone else and put into a **package** that we can download, as we've [already seen](#packages).
### Anonymous functions
There is a special class of functions called anonymous functions that are worth being aware of, as we will use them quite extensively throughout this workshop.
As the name might suggest, **anonymous functions** are functions that are not named, and therefore, not saved for re-use.
You may, understandably, be wondering why we would want to use them, given we just make the case for functions replacing repeatable blocks of code.
In some instances, we want to be able to perform multiple computations that require creating intermediate objects, but because we only need to use them once, we don't save them save to our environment, potentially causing issues with conflicts (e.g., accidentally using an object we didn't mean to, or overwriting existing ones by re-using the same object name).
This gets into the broader concept of local vs global scopes, but that is too far beyond the scope of this workshop: see [Hands-On Programming with R](https://rstudio-education.github.io/hopr/environments.html#scoping-rules) and [Advanced R](https://adv-r.hadley.nz/functions.html?q=lexical#lexical-scoping) for more information.
Let's look at an example to see when we might want to use an anonymous function.
Throughout this workshop, we will make use of the `map_*()` series of functions from the `{purrr}` package.
We'll go into more detail about `purr::map()` [shortly](#sec-map-functions), but for now, imagine we have a **vector** of numbers, and we want to add `5` to each value before and multiplying by `10`.
The `map_dbl()` function takes a **vector** and a function, and outputs a **double vector**.
We could write a function to perform this multiplication, but if we're only going to do this operation once, it seems unnecessary.
```{r}
#| error: true
purrr::map_dbl(
.x = my_dbl_vec,
.f = function(.x) {
add_five_val <- .x + 5
add_five_val * 10
}
)
# only exists within the function
add_five_val
```
Here, we've specified the anonymous function to take the input `.x` and multiple each value by 10, and we did it without saving the function.
This would be equivalent to writing this:
```{r}
#| error: true
add_five_multiply_ten <- function(x) {
add_five_val <- x + 5
add_five_val * 10
}
purrr::map_dbl(
.x = my_dbl_vec,
.f = ~add_five_multiply_ten(.x)
)
# only exists within the function
add_five_val
```
::: {.callout-warning}
Notice the `~` used: this specifies that we want to pass arguments into our named function.
Without it, we will get an error about `.x` not being found.
:::
::: {.callout-note collapse=true}
In this example, because we are doing standard arithmetic, `R` will **vectorize** our function so that it can automatically be applied to each element of the object, so this example was merely to illustrate the point.
```{r}
add_five_multiply_ten(my_dbl_vec)
```
:::
## Tidy data
Before we look at the common packages and functions we use throughout this workshop, let's take a second to talk about how our data is structured.
For much of what we do, it is convenient to work with **dataframes**, and many functions we will use are designed to work with *long* **dataframes**.
What this means is that each *column* represents a variable, and each row is a unique observation.
Let's first look at a **wide dataframe** to see how data may be represented.
Here, we have one column representing a number for each of the states in the US, and then we have two columns representing some random incidence: one for July and one for August.
```{r}
wide_df <- data.frame(
state_id = 1:52,
july_inc = rbinom(52, 1000, 0.4),
aug_inc = rbinom(52, 1000, 0.6)
)
wide_df
```
Instead, we reshape this into a **long dataframe** so that there is a column for the state ID, a column for the month, and a column for the incidence (that is associated with *both* the state *and* the month).
Using the `{tidyr}` package, we could reshape this **wide dataframe** to be a **long dataframe** (see [this section](#sec-pivot-functions) for more information about the `pivot_*()` functions)
```{r}
long_df <- tidyr::pivot_longer(
wide_df,
cols = c(july_inc, aug_inc),
names_to = "month",
values_to = "incidence",
# Extract only the month using regex
names_pattern = "(.*)_inc"
)
long_df
```
You will notice that our new dataframe contains three columns still, but is longer than previously; two time as long, in fact.
::: {.callout-note collapse=true}
Particularly keen-eyed reader may also notice that `long_df` is also has class **tibble**, not a **data.frame**.
A **tibble** effectively is a **data.frame**, but is an object commonly used and output by `{tidyverse}` functions, as it has a few extra safety features over the base **data.frame**.
:::
## Core code used
We're finally ready to talk about the functions that are used throughout this workshop.
The first package to mention is the `{tidyverse}` package, which actually a collection of packages: the core packages can be found [here](https://www.tidyverse.org/packages/).
The reason why are using the `{tidyverse}` packages throughout this workshop is that they are relatively easily to learn, compared to base `R` and `{data.table}` (not that they are mutually exclusive), and what most people are familiar with.
They also are well designed and powerful, so you should be able to do most things you need using their packages.
You can find a list of cheatsheets for all of these packages (and more) [here](https://posit.co/resources/cheatsheets/?type=posit-cheatsheets&_page=1/).
Let's load the `{tidyverse}` packages and then go through the key functions used.
Unless stated explicitly, these packages will be available to you after loading the `{tidyverse}` with the following command.
```{r}
library(tidyverse)
```
### `tibble()`
The **tibble** is a modern reincarnation of the **dataframes** that is slightly safer i.e., is more restricted in what you can do with it, and will throw errrors more frequently, but very rarely for anything other than a bug.
We will use the terms interchangeably, as most people will just talk about **dataframes**, as for the most part, they can be treated identically.
Use the same syntax as the `data.frame()` function to create the **tibble**.
### `dplyr::filter()`
If we wanted to take a subset of rows of a **dataframe**, we would use the `dplyr::filter()` function.
Here, we're listing the package it's coming from, as there are some other packages that also export their own version of the `filter()` function.
However, for all the code in this workshop, there aren't any concerns about [**namespace conflicts**](#namespace-conflict), so we won't use it from here on in.
The `filter()` function is relatively simple to work with: you specify the **dataframe** variable you want to subset by, the filtering criteria, and that's it.
If we include multiple arguments, they get treated as *AND* statements (`&`), so all conditions need to be met.
```{r}
filter(
long_df,
month == "july",
incidence > 410
# equivalent to: month == "july" & incidence > 410
)
```
We can filter using *OR* statements (`|`), so if either condition returns `TRUE`, then it will be included in the subset.
```{r}
filter(
long_df,
month == "july" | incidence > 600
)
```
### `select()`
If, instead, we wanted to subset of columns of a **dataframe**, we would use the `dplyr::select()` function.
Let's say, from our wide incidence data, we only want the state's ID and their August incidence.
We can directly select the columns this way.
```{r}
select(
wide_df,
state_id, aug_inc
)
```
But in this case, it would be more efficient (for us) to tell `R` the columns we *don't* want.
We can do that using the `-` sign.
```{r}
select(
wide_df,
-july_inc
)
```
If there were multiple columns we didn't want, we would pass them in a vector.
```{r}
select(
wide_df,
-c(july_inc, aug_inc)
)
```
When it comes to selecting columns, the `{tidyselect}` package has a few very handy functions for us.
To understand when they are most useful, let's first look at the `mutate()` function, and then we'll highlight how to use the different column selection functions available to use through `{tidyselect}`.
### `mutate()`
If we have a **dataframe** and want to add or edit a column, we use the `mutate()` function.
Usually the `mutate()` function is used to add a column that is related to the existing data, but it is not necessary.
Below are examples of both.
```{r}
# add September incidence that is based on August incidence
mutate(
wide_df,
sep_inc = round(aug_inc * 1.2 + rnorm(52, 0, 10), digits = 0)
)
# add random September incidence
mutate(
wide_df,
sep_inc = rbinom(52, 1000, 0.7)
)
```
If we wanted to update a column, we can do that by specifying the column on both sides of the equals sign.
```{r}
# Update the August incidence to add random noise
mutate(
wide_df,
aug_inc = aug_inc + round(rnorm(52, 0, 10), digits = 0)
)
```
One crucial thing to note is that `mutate()` applies our function/operation to each row simultaneously, so the new column's value only depends on the *row's* original values (or the vector in the case of the second example that didn't use the values from the data).
### `paste0()`
The `paste0()` function is useful for manipulating objects and coercing them into string, allowing us to do *string interpolation*.
It comes installed with base `R`, so there's nothing to install, and because of the way `mutate()` works, apply functions to each row simultaneously, we can modify whole columns at once, depending on the row's original values.
It works to *squish* all the values together, without any separators by default.
If you wanted spaces between your words, for example, you can use the `paste(..., sep = " ")` function, which takes the `sep` argument.
```{r}
char_df <- mutate(
long_df,
# Notice that text is in commas, and object values being passed to paste0()
# are unquoted.
state_id = paste0("state_", state_id)
)
char_df
```
### `glue::glue()` {#sec-glue}
`glue()` is a function that comes installed with `{tidyverse}`, but is not loaded automatically, so you have to reference it explicitly by either using `library(glue)` or the `::` notation shown below.
It serves the same purpose as the base `paste0()`, but in a slightly different syntax.
Instead of using a mix of quotations and unquoted object names, `glue()` requires everything to be in quotation marks, with any value being passed to the *string interpolation* being *enclosed* in `{ }`.
It is worth learning `glue()` as it is used throughout the `{tidyverse}` packages, such as in the `pivot_wider()` function.
```{r}
char_df <- mutate(
long_df,
state_id = glue::glue("state_{state_id}")
)
char_df
```
### `str_replace_all()`
If we want to replace characters throughout the whole of a string vector, we can do that with the `str_replace_all()` function.
And because **dataframes** are made up of individual **vectors**, we can use this to modify vectors.
```{r}
mutate(
char_df,
# pass in the vector (a column, here), the pattern to remove, and the replacement
clean_state_id = str_replace_all(state_id, "state_", "")
)
```
### `across()`
Above, we were only **mutating** a single column at a time, which is what we often do.
But, sometimes we want to apply the exact same transformation to multiple columns.
For example, say we wanted to turn our monthly incidence data into the average weekly incidence.
We could write out each transformation by hand, but when there are more than two columns, this gets rather tedious and introduces the opportunity for mistakes when copying code (one of our motivations for using functions).
The `tidyselect::across()` function allows us to specify the columns we want to apply the transformation, and the function (can be named or anonymous), and that's it.
There are a couple of points to understand about the code below:
- Note the `.` preceding the `cols`, `fns`, and `x`
- Each column is passed to the `.x` value in the function argument
- `~` is required to pass arguments into the function. In this case it is an anonymous function using the [`map_*()` syntax](#sec-map-functions).
```{r}
mutate(
wide_df,
across(
.cols = c(july_inc, aug_inc),
.fns = ~.x * 7 / 30
)
)
```
### `everything()`
If we wanted to select every column in a dataframe, we would use the `everything()` function.
This may not seem helpful initially, but there are occasions when it's very useful.
For instance, in the previous example we still specified the exact columns we wanted to transform.
However, if there were five times as many, we wouldn't want to do that.
Do note that if we replaced this with `everything()`, we would also `mutate()` our `state_id` column, which we probably don't want to do, so we could combine it with the `-` selection seen previously.
### `contains()`
Another very handy function is the `tidyselect::contains()` function.
This allows us to specify a string that the column names must *contain* for them to be selected.
We could change the above example to look like this:
```{r}
mutate(
wide_df,
across(
.cols = contains("_inc"),
.fns = ~.x * 7 / 30
)
)
```
### `rename_with()`
If we wanted to rename columns of a **dataframe**, we can use the `rename()` function.
However, like the previous `{tidyselect}` examples, sometimes we want to apply the same renaming scheme (function) to the columns.
`rename_with()` allows us to pass a function to multiple columns at once, achieving what we want with minimal effort, and without needing to use `across()`.
```{r}
rename_with(
wide_df,
.cols = contains("_inc"),
.fn = ~str_replace_all(.x, "_inc", "_incidence")
)
```
::: {.callout-important}
Hopefully you are noticing a pattern between the `{tidyselect}`-type functions.
When you need to apply a function to multiple columns in a **dataframe**, you will select the columns with the `.cols` argument, and pass the function to the `.fn(s)` argument with the `~` symbol indicating you are using the `.x` to represent the column in the function (yes, there is a touch of ambiguity between `.fns` and `.fn`, but the general pattern holds).
This will be useful when we look at the `map_*()` family of functions.
:::
### `magrittr::%>%`
The `%>%` operator is an interesting and very useful function that comes installed (and loaded) with the `{tidyverse}` package (technically from the `{magrittr}` package from within the `{tidyverse}`).
It allows us to chain together operations without needing to create intermediate objects.
Say for example we have our wide incidence data and want to add data for September before turning it into a **long dataframe**, we could create and intermediate object before using the `pivot_longer()` function from before, but we might not want to create another object that we don't really care about.
This is when we would want to use a pipe, as it takes the output of one operation and *pipes* it into the next one.
```{r}
mutate(
wide_df,
sep_inc = round(aug_inc * 1.2 + rnorm(52, 0, 10), digits = 0)
) %>%
pivot_longer(
cols = c(july_inc, aug_inc, sep_inc),
names_to = "month",
values_to = "incidence",
names_pattern = "(.*)_inc",
data = .
)
```
By default, the previous object gets input into the first argument of the next function, but here we've shown that you can manipulate the position the object is *piped* into by specify the argument using the `.` syntax.
### `|>`
In `R` version 4.1.0, the `|>` was added as the base pipe operator.
It works slightly differently to `%>%`, and frankly, is less powerful and less common (at the moment), so we won't use it in this workshop.
### `group_by()`
If we have groups in our **dataframe** and want to apply some function to each group's data, we can use the `group_by()` function.
For example, if we wanted to calculate the mean and median incidence in our fake data from earlier, but group it by the month.
```{r}
group_by(long_df, month) %>%
summarize(mean = mean(incidence), median = median(incidence))
```
### `pivot_*()` {#sec-pivot-functions}
We've [already seen](#tidy-data) the purpose of the `pivot_longer()` function: taking wide data and reshaping it to be long.
There is an equivalent to go from long to wide: `pivot_wider()`.
Occassionally this is useful (though it is less common than creating long data).
```{r}
pivot_wider(
long_df,
names_from = month,
values_from = incidence,
names_glue = "{month}_inc"
)
```
Here, the `names_glue` argument is making use of the `glue::glue()` function ([see above](#sec-glue)) that is installed with `{tidyverse}`, but not loaded automatically for use by the users.
### `map_*()` {#sec-map-functions}
The `map_*()` functions come from the `{purrr}` package (a core part of the `{tidyverse}`), and are incredibly useful.
They are relatively complicated, so there isn't enough space to go into full detail, but here we'll just outline enough so you can read more and understand what's going on.
We've [already seen](#anonymous-functions) we can apply functions to each element of a vector (**atomic** or **list** vectors).
The key points to note are the `.` preceding the `x` and `f` arguments.
If we use `map()` we get a **list** returned, `map_dbl()` a **double vector**, `map_char()` a **character vector**, `map_dfr()` a **dataframe** etc.
In the example below, we'll walk through `map_dfr()` as it's one of the more confusing variants due to the **return** requirements.
```{r}
map_dfr_example <- map_dfr(
.x = my_dbl_vec,
.f = function(.x) {
# Note we don't use , at the end of each line - it's as if we were
# running the code in the console
times_ten <- .x * 10
divide_ten <- .x / 10
# construct a tibble as normal (requires , between arguments)
tibble(
original_val = .x,
times_ten = times_ten,
divide_ten = divide_ten
)
}
)
map_dfr_example
```
What's happening under the hood is that `map_dfr()` is applying the [anonymous function](#anonymous-functions) we defined to each element in our vector and returning a **list** of **dataframes** that contains one row and three columns, i.e. for the first element, we would get this:
```{r}
list(map_dfr_example[1, ])
```
It then calls the `bind_rows()` function to *squash* all of those **dataframes** together, one row stacked on top of the next, to create one large **dataframe**.
We could write the equivalent code like this:
```{r}
bind_rows(
map(
.x = my_dbl_vec,
.f = function(.x) {
# Note we don't use , at the end of each line - it's as if we were
# running the code in the console
times_ten <- .x * 10
divide_ten <- .x / 10
# construct a tibble as normal (requires , between arguments)
tibble(
original_val = .x,
times_ten = times_ten,
divide_ten = divide_ten
)
}
)
)
```
`map_dfc()` does exactly the same thing, but calls `bind_cols()` instead, to place the columns next to each other.
There is one more important variant to go through: `pmap_*()`.
If `map_*()` takes one vector as an argument, `pmap_*()` takes a **list** of arguments.
What this means is that we can iterate through the elements of as many arguments as we'd like, *in sequence*.
For example, let's multiply the elements of two **double** vectors together.
```{r}
# Create a second vector of numbers
my_second_dbl_vec <- rnorm(length(my_dbl_vec), 20, 20)
my_second_dbl_vec
# Remind ourselves what our original vector looks like
my_dbl_vec
pmap_dbl(
.l = list(first_num = my_dbl_vec, sec_num = my_second_dbl_vec),
.f = function(first_num, sec_num) {
first_num * sec_num
}
)
```
There are a couple of important points to note here:
- All vectors need to be the same length
- The function is applied to each element index of the input vectors, i.e., the first elements of the vectors are multiplied together, the second element of the vectors are multiplied together, and so on, until the last elements are reached.
- We use `.l` instead of `.x` to denote we are passing a `list()` of vectors.
- Our function specifies the names of the vectors in the `list()`, which are then used within the function itself (similar to how we used `.x` in our `map_*()` functions)
::: {.callout-note collapse=true}
As before, this is an unnecessary approach as `R` would vectorize the operation, but it is useful to demonstrate the principle.
```{r}
my_dbl_vec * my_second_dbl_vec
```
:::
### `nest()`
Nesting is a relatively complex, but powerful, concept, particularly when combined with the `map_*()` functions.
Commonly, as in this workshop, it is used to apply a model function to multiple different datasets, and store them all in one **dataframe** for easy of manipulation.
What it effectively does is group your existing **dataframe** by a variable, and then shrink all the columns (except the grouping column), into a single list column, leaving you with as many rows as there are distinct groups.
Each element of the new list column is itself a small **dataframe** that contains all the original variables and data, but only those that are relevant for the group.
Hopefully this example will make it clearer.
Here, we'll take the `mtcars` dataset, and like before, we'll group by the `cyl` variable, but this time we'll nest the rest of the data.
```{r}
nested_mtcars <- nest(mtcars, data = -cyl)
nested_mtcars
```
We can see we've nested all columns, *except* cyl.
Looking at the `data` column for just the first row (`cyl == 6`), we see we have a list with one item: the rest of the data that's relevant to the rows where `cyl == 6` (notice the `[[1]]` above the **tibble**).
```{r}
nested_mtcars[1, ]$data
```
Now we can use `map` to fit a model to this subsetted data.
```{r}
mutate(
nested_mtcars,
model_fit = map(data, ~glm(mpg ~ hp + wt + ordered(carb), data = .x))
)
```
This creates a **list** column (because we used the `map()` function, which returns a list) that contains the relevant model fits.
::: {.callout-important}
It is important to note that there is also a function called `nest_by()`.
However, it returns a `rowwise` **tibble**, i.e., any later manipulations will be applied on a row-by-row basis, unlike a standard **tibble** that applies the manipulation to every row all at once, so we would need to use normal `mutate()` syntax (and explicitly return a list column) to get the same effect as before.
```{r}
nest_by(mtcars, .by = cyl) %>%
mutate(model_fit = list(glm(mpg ~ hp + wt + ordered(carb), data = data)))
```
:::
### `ggplot()`
To create out plots, we can use the base `plot()` functions, but `{ggplot2}` package provides a clean and consistent interface to plotting that has many benefits.
In essence, plots are built up in layers, with each stacking on top of the previous.
To initialize a plot, we simply use the `ggplot()` function call, that creates the background of a figure.
Now we need to add data, and **geoms** to interpret that data.
Let's use the `mtcars` dataset again.
```{r}