-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
1118 lines (769 loc) · 48.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
pagetitle: "CSPS with R"
title: "The CSPS with R and RStudio"
subtitle: "An introduction"
author: "The CSPS team, Cabinet Office"
date: "18 February 2018"
output:
html_document:
govdown::govdown_document:
css: "style/style.css"
highlight: tango
number_sections: yes
toc: yes
toc_depth: 3
toc_float: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = NA, width = 80)
library(countdown) # remotes::install_github("gadenbuie/countdown")
```
---
# This session
## Summary
This session is a *very* brief introduction to R and RStudio for beginners, with reference to Civil Service People (CSPS) Data.
There's a lot of material about R and about the CSPS data that we aren't going to have time to cover today.
We'll be developing this guidance and making it freely available on the web. It will include more information on tidying, analysing, plotting and reporting.
**Important note on the data**: the data used in this document **is not real data**. Instead, it's a 'synthetic' version created using [the {synthpop} package](https://cran.r-project.org/web/packages/synthpop/vignettes/synthpop.pdf). This preserves data distributions without any response being that sampled from a real individual.
## Code along
This session is designed as a 'code-along'. You'll be asked to type what you see on screen as we progress
Ideally you have R and RStudio installed already and you are able to download packages. In which case, you can do everything outlined in this document from your computer.
Don't worry if you don't have R and RStudio downloaded, or you can't download packages. Instead, we've set up an instance of RStudio in the cloud for this training session, using a non-profit service called Binder.
Click the button below to launch RStudio in your browser with Binder. It may take a few moments to load; retry or switch browser if it fails.
<!-- badges: start -->
[![Launch Rstudio Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/matt-dray/csps-with-r-demo/master?urlpath=rstudio){target='_blank'}
<!-- badges: end -->
In this RStudio instance, the folder structure is set up and the packages and data are pre-installed for you, so you **won't** need to follow the steps in:
* section 2.5: Project folders
* section 3.6: Install and load packages (don't use `install.packages()`, but do use `library()`)
* section 4.2: Download the data
Note that This is not how you would normally access RStudio; this has been set up so you are able to follow along with the demonstrations in the session.
After a period of inactivity, your instance of Binder will shut down. Note that anything code you write won't be preserved. You will need to copy, paste and save anything you write into a file on your computer instead.
## Background
The annual CSPS produces a lot of data each year. Departments are provided with summary reports, but can access response-level data ('microdata') to perform their own in-depth analyses.
Many tools like Excel, SPSS and Stata are used across government to analyse the microdata. Many of these tools are proprietary and require expensive licenses. This variety can make it tricky for analysts to share approaches between departments and even within them.
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
* What program do you use for analysis?
* Have you used R before?
* What makes you want to use R?
</div>
<br>
Of course, every analyst and every department is welcome to use the tools that are available to them, that they understand and that get the job done.
Having said this, we're advocating for the [statistical programming language R](https://www.r-project.org/about.html) and the [RStudio](https://rstudio.com/) code editor.
# R and RStudio
## Why?
Why R? It:
* is free and open source
* has a strong community of users across government and the world
* handles all steps of an analysis from reading data to writing reports
* makes your work reproducible because the code can be re-run by you in future with the same or different data
* has highly configurable graphics, including interactives
* has tools for building apps and interactive 'dashboards' to let users explore your data
RStudio is a popular and well-supported piece of software for editing and running R code for both beginners and advanced users. It's also free of charge and the company behind it is [a public benefit corporation](https://blog.rstudio.com/2020/01/29/rstudio-pbc) with a commitment to producing open source software.
## {cspstools}
In particular, the CSPS team are developing some R-based tools for analysing CSPS data specifically. You will be able to download a package called {cspstools} that contains common functions for analysing CSPS data. This will help provide consistency in analysis and reporting and make tasks easier to perform and more reproducible. The tools will be shared in the open for anyone to use and so that anyone can help to improve them.
## Download
Before starting, you should download:
* [R](https://cran.ma.imperial.ac.uk/)
* [RStudio Desktop](https://rstudio.com/products/rstudio/download/) (open-source license)
Both are free, but you might need to get in touch with your IT team to get them installed to your computer.
## RStudio layout
Open RStudio -- its icon is a white letter 'R' in a blue circle: <img src="img/rstudio-icon.png" width="5%">
When you open RStudio for the first time, you'll see the window is split into three 'panes', which are numbered below:
<img src="img/rstudio-panes.png" width="100%"/>
Your window may not look *exactly* like this one, depending on your operating system.
Labelled in the image are:
1. The console pane -- code is executed here
2. The environment pane -- stored values, tables, plots, etc, are displayed here
3. The files pane -- navigate your folder structure (also has tabs for showing plots and help)
We don't need to concern ourselves with every button and tab for now.
## Project folders
<br>
<div class="binder"> `r icon::fa("address-book")` <b>Binder users</b>
Binder users: you don't need to run this section because you are already working in a Project folder with the correct folder structure.
</div>
<br>
There are many benefits to having one folder per analytical project. It means your work is more:
* organised -- all the code, data, outputs, etc, are stored in one place (a single project folder)
* reproducible -- your code can be re-run from scratch to produce the same outputs every time
* transferable -- you can pass the entire project folder to someone else and they'll be able to run it on their own machine; the filepaths you specify in your code assume the home folder is the project folder, so you can write something like `data/dataset.csv` rather than `file/path/on/my/personal/machine/that/you/cannot/access.csv`
RStudio has a system that helps you set this up. You can create an 'RStudio Project' like this:
1. Open RStudio (the icon is a white R inside a blue circle)
2. File > New Project...
3. New Directory > New Project
4. Give your project a meaningful name in the 'Directory Name' box (e.g. `csps-r` for this session)
5. Browse for the filepath where your R Project folder will be placed
6. Click 'Create Project' and RStudio will open your project (note the project name in the top right)
<img src="img/rstudio-new-project.png" width="100%"/>
This creates a folder where you specified that contains an RStudio Project file (extension '.Rproj'). This folder is the 'home' of your project and this is where you should house all the files and code that you need.
For now, create two new folders -- `data` and `output` -- in your Project folder (we'll be using these later).
We haven't created any R script files yet, but they'll go in the project folder too.
This means we'll get a folder structure like this:
```
csps-r/ # the project folder
├── csps-r.Rproj # R Project file
├── data/ # read-only raw data
├── output/ # processed data
└── training.R # R script files
```
To access your RStudio Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. `your-project.Rproj`). It will open RStudio in the same state that you left it when you last closed it.
## Start a new script
You'll write your code into a special text file called an R script, which has the extension `.R`.
Having opened the R Project (.Rproj) file for your analysis, open a new script by clicking File > New File > R script. A new blank script will appear in a new pane in the upper left of the RStudio window.
You can type or copy and paste code into this document. This serves as a record of the actions you used to analyse the data step-by-step.
<img src="img/rstudio-panes2.png" width="100%"/>
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* You can have multiple scripts open at once
* They will appear as separate tabs in the scripts pane
</div>
<br>
# R fundamentals
## Comments
In an R script, any characters prefixed with a hash (`#`) will be recognised as a comment. R will ignore these when you run your code.
Comments are really helpful for letting people to understand what your code is doing. Try to keep a narrative going throughout your code to explain what it's doing. Be explicit -- it might be obvious to you right now why a certain line code is being written, but you might come back in a few months time and forget.
It's also good to use comments to explain what each block of code is doing and to explain particular lines of code. Don't worry about the code itself, but here's an example of comments in use:
```{r comments, eval=FALSE}
# Subset the data
data_set %>%
filter(column_a > 10) %>% # 10 is the cut-off point
mutate(column_c = column_a + column_b)
```
It's also good to add the title, your name, date, etc, as comments at the top of your script so people know what the script is for when they open it.
## Run code
How do you actually run some R code? Let's start with a small calculation.
First, we'll add two numbers together. Type the calculation `1 + 1` into your script:
```{r add-notrun, eval=FALSE}
# Add two numbers together
1 + 1
```
To execute it, make sure your cursor is on the same line as the code and press <kbd>Command</kbd>+<kbd>Enter</kbd> on a Mac or <kbd>Control</kbd>+<kbd>Enter</kbd> on a PC (there's also a 'Run' button in the upper right of the script pane). You can also run multiple bits of code by highlighting selected lines and then running it.
What happened when you ran the code? The following was printed to the console in the lower-left pane of RStudio:
```{r add-run, echo=FALSE}
# Add two numbers together
1 + 1
```
Great, we got the answer `2`, as expected. (The number in square brackets is related to the the number of items retuned in the answer and doesn't concern us right now.)
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* Don't forget to save your script file
* Go to `File` > `Save` or use the <kbd>Control</kbd>+<kbd>S</kbd> or <kbd>Command</kbd>+<kbd>S</kbd> shortcuts
</div>
<br>
## Store a result
This is good, but ideally we want to store objects (values, tables, plots, etc), so we can refer to them in other pieces of code later.
You do this in R with a special operator: the 'assignment arrow', which is written as `<-`. The shortcut for it is <kbd>Alt</kbd>+<kbd>-</kbd> (hyphen).
For example, we can assign `1 + 1` to the name `my_num` with `<-`. Execute the following code:
```{r define-object}
my_num <- 1 + 1
```
Hm. Nothing printed out in the console. Instead the object is now stored in your 'environment' -- see the top right pane in RStudio:
<img src="img/rstudio-object.png" width="100%"/>
You can now refer to this object by name in your script. For example, you can print it:
```{r print-object}
print(my_num)
```
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* You can actually print an object by running its name alone
* Running `my_num` is equivalent to `print(my_num)`
* I'll be using `print()` throughout to be more explicit
</div>
<br>
The real benefit to this is that you don't have to repeat yourself every time you want to use that particular calculation. For example, you can refer to the object in new expressions:
```{r use-object}
my_num * my_num + 6
```
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* Use names that are meaningful, relatively short and use consistent naming conventions (like 'snake_case', which is all lowercase and separates words with underscores)
* For example, `var_mean` and `var_median`
* Names are case sensitive, can't have spaces, nor start with a number
</div>
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
* Make an object called `val1` that stores the value 543
* Make another object called `val2` that stores the value 612
* Make a third object called `calc` that is the multiplication (`*`) of `val1` and `val2`
* What value do you get when you print `calc`?
`r countdown(minutes = 1, seconds = 0, style = "position: relative; width: min-content;")`
</div>
<br>
## Object classes
We stored a numeric value in the last section. We can do more than just store one item of data at a time though.
This next chunk of code _combines_ multiple elements with the `c()` command. This kind of multi-element object is called a 'vector'.
Here's a vector that contains text rather than numbers. You put character strings inside quotation marks (`""`), which isn't needed for numbers.
```{r vectors}
# Create an example vector
dept_names <- c("DfE", "DHSC", "DfT") # combine some values
print(dept_names) # have a look at what the object contains
```
So each of the elements of the object was returned.
You can see what 'class' your object is at any time with the `class()` function.
```{r class-vector}
class(my_num)
class(dept_names)
```
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* To create a vector of numbers, you could write `c(1, 2, 3)`
* There's a shortcut for this: `1:3`
* This means 'all the numbers from 1 to 3 including 1 and 3'
</div>
<br>
So we've create objects composed of a single values (`my_num`) and a vector of values (`dept_names`).
The next step would be to combine a number of vectors together to create a table with rows and columns. Tables of data with rows and columns are called 'data frames' in R and are effectively a bunch of vectors of the same length stuck together.
Here's an example of a data frame built from scratch:
```{r create-df}
# Create a data frame of selected departments
dept_info <- data.frame(
dept = dept_names, # use vector from earlier
headcount = c(6900, 8300, 15000),
responsibility = c("Education", "Health", "Transport")
)
print(dept_info) # see the data frame
```
Can you see how the data frame is three vectors (`dept`, `headcount` and `responsibility`) of the same length (3 values) arranged into columns? The function `data.frame()` bound these together into a table format. Let's check the class:
```{r class-df}
class(dept_info)
```
R is capable of building very complex objects, but tabular data with rows and columns is ubiquitous and it's how the CSPS data is stored. We'll be focusing on data frames for now.
## Use a function
You've been using functions already: `print()`, `c()`, `data.frame()`, `class()`.
A function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. There are any of these built into R already, but you can also download 'packages' of functions and you can also create your own.
Functions are written as the function name followed by brackets. The brackets contain the 'arguments', which are like the settings for the function. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They're separated by commas.
So a generic function might look like this:
```{r generic-function, eval=FALSE}
# This isn't a real function; don't run it
function_name(
data = my_data,
colour = "red",
option = 5
)
```
Note that you can break the function over several lines to improve readability and so you can comment on individual arguments. You can put your cursor on any of these lines and run it. You don't have to highlight the whole thing.
You can use type a question mark followed by a function name to learn about its arguments. This will appear in a help file in the bottom right pane. For example, `?plot()`.
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* You can create your own functions, but we're not going to spend time on that now
* This is a good way to stop repeating code
* Here's an example that adds two provided numbers together:
```{r example-function}
# Define a function that adds two numbers
add_nums <- function(val_a, val_b) {
val_a + val_b
}
add_nums(val_a = 3, val_b = 4) # use the function
```
</div>
<br>
## Install and load packages
<br>
<div class="binder"> `r icon::fa("address-book")` <b>Binder users</b>
Binder users: you don't need to use `install.packages()` because the packages have already been installed for you; you will need to use `library()` though.
</div>
<br>
Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R or improve them.
Packages can be installed to your computer using the `install.packages()` function. This automatically fetches and downloads packages from a centralised package database on the internet called [CRAN](https://cran.r-project.org), which only accepts packages that meet strict quality criteria.
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* Packages _can_ be downloaded from places other than CRAN
* You can download packages from open online repositories like GitHub
* You may need your IT team to authorise or install packages for you
</div>
<br>
We're going to use a few packages to help us:
* [{haven}](https://haven.tidyverse.org/) for reading a range of file types, like Stata's (.dta) and SPSS's (.sav)
* [{dplyr}](https://dplyr.tidyverse.org/) for data cleaning and preparation
* [{ggplot2}](https://ggplot2.tidyverse.org/) for plotting
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* {haven}, {dplyr} and {ggplot2} are part of a larger suite of packages called ['the tidyverse'](https://www.tidyverse.org/)
* This is a set of packages that share a common design philosophy and seek to make analysis more intuitive
* The RStudio company produces the majority of the packages in the tidyverse
* You can download all the packages of the tidyverse at once with `install.packages("tidyverse")`
</div>
<br>
Typically you would type `install.packages("packagename")` to download the package, but we can use the following to install the packages from the tidyverse all at once:
```{r install-packages, eval=FALSE}
install.packages("tidyverse")
```
You only need to run the installation function once per package on your machine.
Each time you start a new session you'll need to run `library("package_name")` to tell R to make available the functions from a that package so you can use them in your script.
So now we have the `tidyverse` packages installed we can call the packages we need with `library()`.
```{r libraries}
library("haven") # data reading
library("ggplot2") # plotting
library("dplyr") # data wrangling
```
Sometimes a message will be printed to tell you a bit more about the package, which is what happens for {dplyr}.
We can start using functions from these packages now that they're loaded.
It's good practice to write the `library()` lines near the top of your script file so that others know which packages are being used in the script.
# Data
## Synthetic data
We aren't using real CSPS data for these exercises. Instead, we'll be using a 'synthetic' version that mimics the 2019 data.
In short, this means that the data distributions within the variables are preserved, but no response represents a real individual. This means we can get some realistic-looking outputs without any response being from a real individual.
We've also restricted the number of variables (columns) and rows (responses) to keep the data set relatively small, and have added a fake unique ID value.
The variables are in the synthetic data set are:
* ResponseID (a faked unique respondent identifier) and OverallDeptCode (the department name)
* B01 to B05 (questions about 'my work') and B47 to B51 (engagement questions)
* E03 and E03_GRP (have you been bullied or harassed n the past 12 months?) and E03A_1 to E03A_16 (a column for each possible response to the bullying, harassment or discrimination experienced)
* W01 to W04 (wellbeing questions)
* J03 (ethnicity)
* Z02 (civil service grades)
* ees (a derived variable that translates each engagement question score to a percentage and takes the mean)
* mw_p (a derived variable that gives the proportion of responses to the 'my work' theme questions that were positive, i.e. 'agree' or 'strongly agree')
## Get the data set
<br>
<div class="binder"> `r icon::fa("address-book")` <b>Binder users</b>
Binder users: You don't need to run this section because the data set is already in your `data/` folder.
</div>
<br>
Ordinarily we would send you the data for your organisation on request. For this session, we've prepared the synthetic data set as a Stata-format (.dta) file.
You can [download the data from the Cabinet Office GitHub page](https://github.com/co-analysis/csps-synth/blob/master/output/csps_synth.dta). Visit the link, click the 'download' button and save the downloaded file to the `data/` folder of your project.
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
You could also download the file to your machine with the `download.file()` function. The first argument is `url`; the file path to where the data are saved on the internet. The `destfile` argument is where you want to save the file on your computer; we want to put it in `data/`.
```{r download-file, eval=FALSE}
# The location of the data on the internet
path <-
"https://github.com/co-analysis/csps-with-r/blob/master/data/csps_synth.dta?raw=true"
# Download from the internet to your computer
download.file(
url = path, # the path to the file on the internet
destfile = "data/csps_synth.dta", # where to save on your computer
mode = "wb" # save as a 'binary file'
)
```
</div>
<br>
Now take a look at the 'Files' pane in RStudio and navigate into the `data/` folder. The `csps_synth.dta` file should now be in there.
# Read
There's a number of functions for reading in data to R. A common one is `read_csv()` from the tidyverse's {readr} package.
The {haven} package has a function called `read_stata()` that you can use to read in a .dta file. Let's read in the data with this function and name the object 'data'.
```{r haven-read}
data <- read_stata(file = "data/csps_synth.dta")
```
This will read the data in as a 'tibble', a fancier type of data frame that's used by the tidyverse packages. For example, when printed to the console, tibbles use colour coding and are truncated to fit.
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
How do you know that the data has been successfully read into R?
</div>
<br>
## Inspect
It's good to preview the data and check it looks like what we expected.
### Glimpse the structure
The {dplyr} package that we loaded earlier has a function called `glimpse()`, which tells you about the structure of the data.
```{r glimpse}
glimpse(data)
```
The top of the output tells us there's 11,555 observations (rows) and 38 variables (columns).
Column names are then listed with the data type and the first few examples. For example, 'OverallDeptCode' contains character class (`<chr>`) data in the form of strings. Column names starting with 'B', 'E', 'J' and 'Z' are question codes and they contain responses expressed in numeric form, so they're of class 'double' (`<dbl>`).
The numbers encode certain responses. For example, 1 means 'strongly disagree' and 5 means be 'strongly agree' for the 'B' series of questions.
How do we know what all the numeric values mean? You'll see that a number of the columns have the label class (`<lbl>`) too. This means that the column carries additional 'attributes' that give the corresponding labels for the values.
Labels aren't used that frequently in R data frames, but are used in programs like Stata and SPSS. Since we've read in a Stata file, we've got these labels available to us.
You can also see that there are also lots of `NA` values. R uses `NA` to mean 'not available' -- the data are missing. In this case, it means that the respondent didn't supply an answer for that question.
### See the full data
Another way of expressing this is to `print()` to the console.
```{r print}
print(data)
```
The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!). You can see the labels are printed alongside the values in this view.
If you want to see the whole dataset you could use the `View()` function:
```{r View, eval=FALSE}
View(data) # note the capital 'V'
```
This opens up a read-only tab in the script pane that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn't affect the underlying data at all.
You can also access this by clicking the little image of a table to the right of the object in the environment pane (upper-right).
# Wrangle
We're going to use a number of functions from the {dplyr} package, which we loaded earlier, to practice some data manipulation.
Functions in the tidyverse suite of packages are usually verbs that describe what they're doing, like `select()` and `filter()`.
We won't have time to go through all of the functions and their variants, but you should get a flavour of what's possible.
## Select columns
Firstly, we can `select()` columns of interest. This means we can return a version of the data set composed of a smaller number of columns. This can be helpful for a number of reasons, but in particular it lets us focus on specific variables of interest.
The {dplyr} functions take the data frame as their first argument, so the first thing we'll supply the function is our `data` object. Then we can supply the names of columns that we want to keep. Note that we can also rename columns as we select them with the format `new_name = old_name`. (Alternatively there is a `rename()` function that only renames columns.)
```{r select}
# Return specific columns
select(
data, # the first argument is the data
Z02, ethnicity = J03 # then the columns to keep
)
```
See that the order in which we selected the columns is the order in which they appeared when printed.
Instead of naming columns to keep, you can also specify columns to remove by prefixing the column name with a `-` (minus).
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* It's worth noting that our original data set (`data`) remains unchanged, despite us having selected some columns
* Usually you will wrangle a data set and save it with a different object name so the original is always available to you
* You can overwrite an original variable by using its name again, but it's usually best to avoid this
* For example, `data <- select(data, B01)` would overwrite our original `data` object
</div>
<br>
To save time you can use some special `select()` helper functions. For example, you can select a column that `contains()` or `starts_with()` certain strings. This is useful if you have lots of columns that share a similarity in their names, like in the CSPS (e.g. B01, B02, etc, all start with "B").
```{r select-3}
# Use select helpers to make code concise
select(data, contains("ID"), starts_with("W"))
```
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
1. Use `select()` to return all the 'B' series columns (B01, B02, etc)
1. How many columns are in this selection?
1. Do the same, but *without* the B02 column
`r countdown(minutes = 2, seconds = 0, style = "position: relative; width: min-content;")`
</div>
<br>
## Filter rows
Now to filter the rows of the data set based on certain criteria.
We're going to make use of some [logical operators](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html) for filtering our data. These return `TRUE` or `FALSE` depending on the statement's validity.
| Symbol | Meaning | Example |
| :-- | :-- | :-- |
| `==` | Equal to | `5 == 2 + 3` returns `TRUE` |
| `!=` | Not equal to | `5 != 3 + 3` returns `TRUE` |
| `%in%` | Match to a vector (shortcut for multiple logical tests) | `4 %in% c(2, 4, 6)` returns `TRUE` |
| `>`, `<` | Greater than, less than | `2 < 3` returns `TRUE` |
| `>=`, `<=` | Equal or greater than, equal or less than | `5 <= 5` returns `TRUE` |
| `&` | And (helps string together multiple filters) | `1 < 2 & 5 == 5` returns `TRUE` |
| `|` | Or (helps string together multiple filters) | `1 < 2 | 5 == 6` returns `FALSE` (only one of them is true) |
R also has some special shortcut functions for come logical checks. For example:
| Symbol | Meaning | Example |
| :-- | :-- | :-- |
| `is.numeric()` | Is the content numeric class? | `is.numeric(10)` returns `TRUE` |
| `is.character()` | Is the content character class? | `is.character("Downing Street")` returns `TRUE` |
| `is.na()` | Is the content an `NA`? | `is.na(NA)` returns `TRUE` |
You can negate these functions by preceding them with a `!`, so `is.na(NA)` returns `TRUE` but `!is.na(NA)` returns `FALSE`.
Let's start by creating an object that contains the data filtered for senior civil servants (where variable Z02 equals 5) from two of the organisations.
```{r filter}
scs_orgs <- filter(
data,
Z02 == 5 & OverallDeptCode %in% c("ORGB", "ORGC")
)
```
See how there are two filter statements: `Z02 == 5` and `Organisation %in% c("ORGB", "ORGC")`? We're asking for both of these things to be true by using the `&` operator between them.
Notice that we used `%in%` to match to a vector of department names (this is quicker than writing `OverallDeptCode == ORGB | OverallDeptCode == ORGC`). The names are stored as character strings, so we put them in quotation marks.
We could print the columns of interest to see if it worked, but a better method would be to return only the 'distinct' (unique) values in these columns:
```{r filter-distinct}
distinct(scs_orgs, OverallDeptCode, Z02)
```
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
1. Use `filter()` to return senior civil servants in Org A only
2. Use `distinct()` to make sure it's worked
2. How many rows are in this filtered data set?
`r countdown(minutes = 2, seconds = 0, style = "position: relative; width: min-content;")`
</div>
<br>
## Add new columns
Now to create new columns. The function name is `mutate()`; we're 'mutating' our dataframe by budding a new column where there wasn't one before. Often you'll be creating new columns based on the content of columns that already exist, like adding the contents of one to another.
One relevant use of this for the CSPS is to create dummy columns. If certain conditions are met in other columns, we can put a '1' in the dummy column, else we can put '0' if it's not met.
So we could create a dummy column that flags when a respondent is a SEO/HEO grade. This example uses an `ifelse()` statement that fills the column with one value if the logical test is `TRUE` and another if it's `FALSE`.
```{r mutate}
# Add a column that gets a 1 when the condition is true
data_dummy <- mutate(
data,
dummy = ifelse( # create a new column called 'dummy'
test = Z02 == 3 & J03 %in% 1:4, # test this condition
yes = 1, # if TRUE, put a 1 in the dummy column
no = 0 # otherwise put a 0 in the column
)
)
# See if it worked
select(data_dummy, Z02, J03, dummy)
```
<br>
<div class="activity"> `r icon::fa("edit")` <b>Activity</b>
Use `mutate()` to create a dummy column where:
* people who responded with 'strongly agree' (`5`) to both B01 *and* B02 get a `1`
* everyone else gets a `0`
`r countdown(minutes = 2, seconds = 0, style = "position: relative; width: min-content;")`
</div>
<br>
This function is particularly useful for the CSPS data if we want to overwrite our numeric values with their corresponding text labels. Fortunately, the {haven} package that we loaded earlier has a function that replaces the numeric values with their labels: `as_factor()`.
We want to apply this only to the columns that are numeric. Fortunately there's a variant of `mutate()` called `mutate_if()`, which lets you use logical statements to select columns. This means we don't have to write out all their names.
```{r as-factor, eval=FALSE}
# Add a column that gets a 1 when the condition is true
data_lbl <- mutate_if(
data,
is.numeric, # if the column is numeric
haven::as_factor # then apply the as_factor function
)
data_lbl_chr <- mutate_all(data_lbl, as.character)
glimpse(data_lbl_chr)
```
<br>
<div class="tip"> `r icon::fa("lightbulb")` <b>Tip</b>
* There's more than one function called `as_factor()` -- how can we resolve this?
* We can specify that we mean the one from the {haven} package by writing it in the form `package::function()`
</div>
<br>
## Join
We can use variant `join()` functions to merge two data frames together on a common column.
Let's create a small trivial data frame that provides a lookup from department codes to full department names and merge it into our CSPS data.
We'll use the `tibble()` function from {dplyr} to build the data frame. Remember: tibbles are data frames with nice defaults and printing properties; we've seen them already in the outputs from our earlier wrangling with {dplyr}
```{r dept-lookup}
lookup <- tibble(
OverallDeptCode = c("ORGA", "ORGB", "ORGC"),
dept_full_name = c("Dept for A", "Ministry of B", "C Agency")
)
print(lookup)
```
We want what is perhaps the most common join: `left_join()`. It gives you all the rows from the 'left' data set (in our case, `data`) and merges on the columns from the 'right' (our new `lookup`).
Here's what we'll be doing (gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/)):
<img src="img/left-join.gif">
To do this, we pass two data frames to arguments `x` ('left') and `y` ('right') and provide the column name to join by.
```{r left-join}
data_join <- left_join(
x = data, # the original data set
y = lookup, # the data to merge to it
by = "OverallDeptCode" # the common column between them
)
```
You might get a message saying that the attributes for our joining column aren't the same. That's okay; it's because the column in `data` (the data set on the 'LHS', or 'left-hand side', of the join) has attributes, but the one in `lookup` (on the right-hand side) doesn't.
Let's check to see if rows from both data frames are present in the joined data set:
```{r inspect-join}
select(data_join, ResponseID, B01, OverallDeptCode, dept_full_name)
```
Success: the output has all the rows of the `data` data frame, plus the new one (`dept_full_name`) from the `lookup` data frame.
## Pipes
We've seen how to manipulate our data frame a bit. But we've been doing it one discrete step at a time, so your script might end up looking something like this:
```{r workflow-no-pipe}
data_select <- select(data, ResponseID, OverallDeptCode, B01, Z02)
data_filter <- filter(data_select, OverallDeptCode == "ORGA" & Z02 != 5)
data_mutate <- mutate(
data_filter,
positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive")
)
print(data_mutate)
```
This is fine, but you will be creating a lot of intermediate objects to get to the final data frame that you want. This clutters up your environment and can fill up your computer's memory if the data are large enough. You're in danger of accidentally referring to the wrong object if you don't name them well.
Instead, you could create one object that is built by chaining all the functions together in order.
We'll use a special pipe operator -- `%>%` -- that will read as 'take what's on the left of the operator and pass it through to the next function'. In pseudocode:
```{r eval=FALSE}
dataset %>%
do something %>%
do something %>%
do something
```
A real example with our data might look like this:
```{r pipe}
data_piped <- data %>%
select(ResponseID, OverallDeptCode, B01, Z02) %>%
filter(OverallDeptCode == "ORGA" & Z02 != 5) %>%
mutate(positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive"))
print(data_piped)
```
So the steps for creating the `data_piped` object are:
* take the `data` object
* *then* select out some columns
* *then* filter on a variable
* *then* add a column
This is a bit like a recipe. And it's easier to read.
You also repeat yourself fewer times. We only to name the `data` object once, a the very start. This minimises the chance that you'll accidentally name the wrong object by mistake.
## Save your wrangled data
There are a number of ways and formats in which to save our wrangled data.
For example, to save the output as a CSV, we can do one of these:
```{r write-files, eval=FALSE}
write_dta(data_piped, "output/data_piped.dta") # Stata-format
# Other options
write_rds(data_piped, "output/data_piped.rds") # R-specific format
write_csv(data_piped, "output/data_piped.csv") # comma-separated values
```
You pass to the function the object name and the filepath for where you want it to be saved.
Note that the labels will be lost if you save as CSV, but they're retained in .dta and .rds format.
Check in your `output/` folder to make sure they've been saved.
You can then read these back in like how we did earlier in this document (you don't have to do this now):
```{r read-files, eval=FALSE}
data_rds <- read_rds("output/data_piped.rds") # R-specific format
data_dta <- read_stata("output/data_piped.dta") # Stata-format
data_csv <- read_csv("output/data_piped.csv") # comma-separated values
```
# Summarise
So far we've been wrangling but not analysing data. Let's look at the `summarise()` function for some quick summaries.
A simple example might be to get the total count of responses in the data set and the mean of the engagement scores.
```{r summarise}
data %>%
summarise(
total_count = n(),
ees_mean = round(mean(ees, na.rm = TRUE), 2)
)
```
That's good, but we can extend the summary so we get results grouped by some other variables.
This is what the `group_by()` function does. You give `group_by()` the variables within which to summarise and you finish by calling `ungroup()` so that the subsequent functions don't get applied to the groups.
So here's a more comprehensive example that gets the mean count and mean EES grouped within departments and the Z02 variable (grade). It then filters out people who didn't answer Z02 and uses a `mutate()` to suppress any mean EES values composed of less than 10 responses.
```{r group-by}
data %>%
group_by(OverallDeptCode, Z02) %>%
summarise(
total_count = n(),
ees_mean = round(mean(ees, na.rm = TRUE), 2)
) %>%
ungroup() %>%
filter(!is.na(Z02)) %>%
mutate(
ees_mean_supp = ifelse(
test = total_count < 10, yes = NA, no = ees_mean
)
)
```
# Plot
We could have a whole separate session on visualising data.
The tidyverse package for plotting is called {ggplot2}. The 'gg' stands for 'grammar of graphics'. It's a system to build up a graphic using common components including:
* data
* 'geoms' (marks that represent the data)
* a coordinate system
You also supply aesthetic properties like size, colour, x and y locations.
<center><img src="img/ggplot2-cheat.png" width="70%"></center><br>
These elements are built up with the `+` operator. Imagine you've created a blank canvas and you're adding each layer. (This is different to using the pipe, `%>%`, which is passing information from the left-hand side to the right-hand side.)
The great thing about building plots with code is that you can produce them with the same styles very quickly without all the manual adjustments that might be required in some other programs.
{ggplot2} is a very powerful graphics package that can create all sorts of charts. Check out [the R Graph Gallery](https://www.r-graph-gallery.com/) for some more examples.
## Simple plots
For now, let's look at a simple bar chart of the answers to question B01 using the `ggplot()` function from {ggplot2}.
```{r plot-simple}
# Prepare the data
plot_data <- data %>%
filter(!is.na(B01)) %>% # remove NAs
count(OverallDeptCode, B01) %>% # count() is a shortcut for summarising
mutate(
B01 = haven::as_factor(B01), # add the text labels
Department = OverallDeptCode
)
# Plot the data
plot_data %>% # with the plot data
ggplot(aes(x = B01, y = n)) + # create a canvas with these coords
geom_col() # apply columns to the canvas given the coords
```
What just happened? We:
* gave ggplot() the dataset, `plot_data`
* gave it some aesthetic mappings, `aes()` (in this case, the x and y variables)
* added the column geom, `geom_col()`, to make a bar chart
We can spruce this up a little by adding on additional things like a theme or labels.
```{r plot-stacked}
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col(aes(fill = Department)) +
coord_flip() + # flip the axes
theme_light() + # apply a theme
scale_fill_brewer(palette = "Blues") + # set the bar colours
labs( # provide overall labels
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) + # remove x axis
ylab("Count of responses") # y axis title
```
But we could also split each department's results into a grid of small multiples, or 'facets', with `facet_grid()`.
```{r plot-facet}
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col() +
coord_flip() +
theme_light() +
labs(
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) +
ylab("Count of responses") +
facet_grid(
cols = vars(OverallDeptCode), # one column per department