-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtemporal-trends.qmd
848 lines (586 loc) · 39.2 KB
/
temporal-trends.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
# Temporal patterns
In this part of the workshop, we explore the temporal patterns displayed by human mobility data from Facebook Meta. The data was collected for Chile in the time period spanning from end of March of 2020 to the end of March 2022.
```{r}
#clean environment
rm(list=ls())
```
::: callout-note
Before getting started, run the code below to avoid getting warnings and messages when executing the rest of the workbook.
:::
```{r}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
## Dependencies
To analyse the data, we need R and the following R packages (see installation instructions on \[R packages\]). Once installed, you can use the R packages by loading them as libraries:
```{r}
#| warning: false
#| message: false
# Load the tidyverse package for data manipulation and visualization
library(tidyverse)
# Load the zoo package for working with time series data
library(zoo)
# Load the mice package for multiple imputation of missing data
library(mice)
# Load the ggthemes package for additional plot themes
library(ggthemes)
```
## Data wrangling {#sec-loading}
We start by looking at the temporal evolution of internal human mobility in Chile in the first few months after the first outbreak of COVID-19. To do so, we read the movement data corresponding to the months of May, June, July, August and September 2020. The data is stored as .rds files and can be read using the `readRDS()` function:
```{r}
df_5 <- readRDS("./data/fb/movement_grid/2020_05_mov.rds")
df_6 <- readRDS("./data/fb/movement_grid/2020_06_mov.rds")
df_7 <- readRDS("./data/fb/movement_grid/2020_07_mov.rds")
df_8 <- readRDS("./data/fb/movement_grid/2020_08_mov.rds")
df_9 <- readRDS("./data/fb/movement_grid/2020_09_mov.rds")
```
Since we are interested in the evolution of trends in internal human mobility over this five-month period, we bind the data sets together using the function `rbind()`. This function appends datasets that share the same columns. We call the resulting data set `df_mov`.
```{r}
# Combine the data frames df_5, df_6, df_7, df_8, and df_9 into a single data frame
df_mov <- rbind(df_5, df_6, df_7, df_8, df_9)
```
But, what are the columns of `df1`, `df2`, `df3`, `df4`, `df5` and `df_mov`? We can easily visualise a dataset and its contents with the `glimpse()` function. For example, for `df_mov`:
```{r}
# Display a concise summary of the data frame df_mov
glimpse(df_mov)
```
In the output, we first see the number of rows and columns in the data set. Then we see the name of each column, followed by the type of data stored in that column (e.g. `<chr>` for character), and a sample of the data values. Alternatively, if we only want to know the names of the columns, we can use the `names()` function:
```{r}
# Retrieve the names of the variables (columns) in the data frame df_mov
names(df_mov)
```
As you might have guessed, each column represents a variable or a characteristic of the data stored in the dataset. Each row or data point in the resulting dataset `df_mov` represents a population flow between an origin and a destination location, taking place during a given time window, in this case, within the five-month period starting in May 2020. The information about the origin and destination of the flow is fully determined by the `geometry` column. Other columns also contain information about the start and end locations of the movements, such as `start_polygon_id`, `start_polygon_name`, `end_polygon_id`, `end_polygon_name`, `start_lat`, `start_lon`, `end_lat`, `end_lon`, `start_quadkey` and `end_quadkey`. The information about the time window in which each population flow takes place is contained in the `date_time` column.
In the next few sections, we will focus only on three variables, namely, `date_time`, `n_crisis` and `n_baseline`. But first, let's talk a bit more about the datetime format.
### Variable selection
Of all the columns in the dataset `df_mov`, three of them are particularly relevant for the analysis in this section of the workshop, `n_crisis`, `n_baseline` and `date_time`. Firstly, the variable `n_crisis` gives us the size of a given flow, i.e. the number of people moving from an origin to a destination. For the first flow, this would be:
```{r}
# Retrieve the value of the variable 'n_crisis' from the first row of the data frame 'df_mov'
df_mov$n_crisis[[1]]
```
But is this number large or small? In order to establish a benchmark, we can look at `n_baseline`. This variable tells us about the number of people that we would expect travelling from origin to destination on the same weekday at the same time of the day in normal conditions, i.e. pre-pandemic.
```{r}
# Retrieve the value of the variable 'n_baseline' from the first row of the data frame 'df_mov'
df_mov$n_baseline[[1]]
```
The fact that the value of `n_crisis` is smaller than `n_baseline` for the first flow suggests that during the pandemic people were moving less between the particular origin-destination pair of locations corresponding to the first flow in the dataset.
However, in order to draw conclusions at the population level, looking at a data point in isolation is not enough. We can gain much more meaningful insights if we look at much larger amounts of data as well as the temporal evolution of the observed patterns. To do this, we need to look into the `date_time` variable, that tells us the 8-hour time window in which a flow takes place. For example, we can look up the value of `date_time` for the first flow recorded in `df_mov`:
```{r}
# Accessing the value of the first element in the "date_time" column of the "df_mov" data frame
df_mov$date_time[[1]]
```
Therefore, the first flow in the dataset `df_mov` took place between 00:00 and 08:00 of the 1st of May 2020. However, for the analysis in this workbook, we are interested only in the patterns taking place at the daily scale, so the information regarding the time window within the day is irrelevant.
### Extracting calendar date
To explore the variation in the number of flows at the daily level, we need to first extract only the date from each movement. This can be done by running the code below, where we create a new column in `df_mov` named `date`. This new column stores only the information corresponding to date which can be obtained by applying the `as.Date()` function to the data from the `date_time` column. This function allows us to convert character data into objects of class `Date` representing calendar dates and following a specified format (in this case, year-month-day).
```{r}
# Converting the "date_time" column in the "df_mov" data frame to Date format
# and assigning the result to a new column called "date"
df_mov$date <- as.Date(df_mov$date_time, format = "%Y-%m-%d")
```
### Temporally grouping data
Since we are interested in the number of movements taking place on a daily basis, we need to aggregate the information from flows that take place on the same day. Specifically, we need to sum all the values of `n_crisis` corresponding to flows from the same day. This can be done, firstly, by passing the `df_mov` dataset to the `group_by()` function and specifying that we want to group the data according to the information stored in the `df_mov$date` column that we just created. This action can be achieved via the pipeline operator `%>%`, which simply feeds the results of one operation into the next operation. For those familiar with the mathematical language, the pipe operator is equivalent to the function composition operation. But the `group_by()` function alone will not give the desired output, so we follow it by the `summarise()` function with an appropriate action to perform, in this case the sum of the values of `n_crisis` for all the entries that share a date. The output is a dataset containing two columns, named `total` and `df_mov$date`. We rename the second column of the resulting dataset to `date` instead of `df_mov$date`. The final dataset is `df_crisis`.
```{r}
# Creating a new data frame called "df_crisis" by performing the following operations:
# Grouping the rows of the "df_mov" data frame by the "date" column
df_crisis <- df_mov %>%
group_by(df_mov$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_crisis)) %>%
# Renaming the "df_mov$date" column as "date" in the resulting data frame
rename("date" = "df_mov$date")
```
We apply a very similar process to sum the values of `n_baseline` corresponding to flows from the same day. The resulting dataset is `df_baseline`.
```{r}
# Creating a new data frame called "df_baseline" by performing the following operations:
# Grouping the rows of the "df_mov" data frame by the "date" column
df_baseline <- df_mov %>%
group_by(df_mov$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_baseline)) %>%
# Renaming the "df_mov$date" column as "date" in the resulting data frame
rename("date" = "df_mov$date")
```
## System-wide analysis
With `df_crisis` and `df_baseline`, we are ready to create our first figure. The goal is to show the daily evolution of internal movements at the nation-wide scale in Chile during the five-month period starting in May 2020. Therefore, the figure will include information about the date on the *x*-axis and the number of movements on the *y*-axis. Furthermore, the figures should allow us to compare the trends in the period of interest with the baseline levels of internal mobility. Hence, our plot will include two components: a line showing the evolution of internal movements from May to September 2020 and a line representing the baseline.
To generate the figure, we use the `ggplot2` package. First, `ggplot()` is used to construct the initial plot object, and is followed by a plus sign ( + ) to add components to the plot. Note that we add a legend to specify the component of the plot that refers to the "crisis" or pandemic period and the component that refers to the baseline.
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_baseline" data
geom_line(mapping = aes(x = date, y = total, color = "Baseline"),
data = df_baseline, linewidth = 1.5) +
# Adding a line layer for the "df_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "Crisis"),
data = df_crisis, linewidth = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("Crisis" = "red", "Baseline" = "blue"),
labels = c("Baseline", "Crisis")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 1: Temporal patterns in the number of movements in Chile",
x = "Date",
y = "Number of movements") +
# Applying the "minimal" theme
theme_minimal()
```
We can also plot the difference between the levels of internal migration during the pandemic period and the baseline. First we create a new dataset `df_difference` with a column named `difference` as follows:
```{r}
# Creating a new data frame called "df_difference" as a copy of "df_crisis"
df_difference <- df_crisis
# Adding a new column called "difference" to "df_difference"
# The values in this column are calculated as the difference between the "total" column of "df_crisis" and "df_baseline"
df_difference$difference <- df_crisis$total - df_baseline$total
```
Then we generate a plot following a procedure as above:
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_difference" data
geom_line(mapping = aes(x = date, y = difference, color = "Difference"),
data = df_difference, linewidth = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("Difference" = "darkgreen"),
labels = c("Difference between crisis \nand baseline")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 2: Temporal patterns in the number of movements in Chile",
x = "Date",
y = "Difference in number of movements") +
# Applying the "minimal" theme
theme_minimal() +
# Modifying y-axis labels to display in full form with no space for thousands separator
scale_y_continuous(labels = scales::number_format(big.mark = ""))
```
::: {.callout-note icon="false" appearance="simple"}
**Question 1**
The patterns during the crisis period, display an overall trend and some shorter-term oscillations. What do these oscillations correspond to?
:::
::: {.callout-note icon="false" appearance="simple"}
**Question 2**
Does the observed overall trend in the 5-month period starting in May 2020 agree with what you expected? You may answer this question based on your knowledge on the evolution of the COVID-19 pandemic in Chile.
:::
## Santiago: Internal movements
Up to this point, we have focused on the temporal trends of internal mobility at the nation-wide scale. Thanks to the fine level of detail in our original datasets, we can focus the analysis on more specific areas or regions. In the remainder of this sections, we will analyse movements that involve Santiago, Chile's capital city. As we pointed out in section @sec-loading, there are several columns in the dataset `df_mov` that contain information about the start and end locations of the movements. The spatial variability of internal population flows will be explored in more detail in the next section of the workshop. But for now, we will focus on the columns names `start_polygon_name` and `end_polygon_name`, which contain the name of the province where a flow starts or ends respectively.
### Santiago as destination
Hence, to obtain only the movements that have the province of Santiago as their destination, we need to filter the data set `df_mov` so that it only contains rows where the value of `end_polygon_name` is Santiago. In R, this can be done very simply by running the lines of code below. The first line, keeps only the flows that start outside of Santiago and stores the result in the `df_mov_to` dataset. The second line, keeps only the flows that end in Santiago and stores the result in the `df_mov_to` dataset. We call the resulting data set `df_mov_to`.
```{r}
# Filtering the "df_mov" data frame to select rows where "start_polygon_name" is not 'Santiago'}
df_mov_to <- filter(df_mov, start_polygon_name != 'Santiago')
# Further filtering the "df_mov_to" data frame to select rows where "end_polygon_name" is 'Santiago'
df_mov_to <- filter(df_mov_to, end_polygon_name == 'Santiago')
```
Then, like before, we create a new column for `df_mov_to` containing only information about the date (and not the time) in year-month-day format:
```{r}
# Converting the "date_time" column in the "df_mov_to" data frame to Date format
# and assigning the result to a new column called "date"
df_mov_to$date <- as.Date(df_mov_to$date_time, format = "%Y-%m-%d")
```
and we follow this action by grouping the data by date. We generate the datasets `df_to_crisis` and `df_to_baseline`, which contain the sum of the values of `n_crisis` and `n_baseline` for flows that happen on the same day:
```{r}
# Creating a new data frame called "df_to_crisis" by performing the following operations:
# Grouping the rows of the "df_mov_to" data frame by the "date" column
df_to_crisis <- df_mov_to %>%
group_by(df_mov_to$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_crisis)) %>%
# Renaming the "df_mov_to$date" column as "date" in the resulting data frame
rename("date" = "df_mov_to$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Grouping the rows of the "df_mov_to" data frame by the "date" column
df_to_baseline <- df_mov_to %>%
group_by(df_mov_to$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_baseline)) %>%
# Renaming the "df_mov_to$date" column as "date" in the resulting data frame
rename("date" = "df_mov_to$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
With all these elements, we can generate a figure showing the trends in the number of movements from any province outside of Santiago to Santiago, both during the 5-month crisis period during the pandemic and during the baseline period:
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_to_baseline" data
geom_line(mapping = aes(x = date, y = total, color = "Baseline"),
data = df_to_baseline, linewidth = 1.5) +
# Adding a line layer for the "df_to_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "Crisis"),
data = df_to_crisis, linewidth = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("Crisis" = "magenta", "Baseline" = "blue"),
labels = c("Baseline", "Crisis")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 3: Temporal patterns in the number of movements to Santiago",
x = "Date",
y = "Number of movements") +
# Applying the "minimal" theme
theme_minimal()
```
### Santiago as origin
An analogous analysis can be carried out for movements with origin in the province of Santiago and destination elsewhere. To do this, we follow exactly the same steps as before but modifying the way in which we filter `df_mov`, and replacing the names of new variables where appropriate (e.g. `df_mov_from` instead of `df_mov_to`):
```{r}
# Filtering the "df_mov" data frame to select rows where "start_polygon_name" is 'Santiago'
df_mov_from <- filter(df_mov, start_polygon_name == 'Santiago')
# Further filtering the "df_mov_from" data frame to select rows where "end_polygon_name" is not 'Santiago'
df_mov_from <- filter(df_mov_from, end_polygon_name != 'Santiago')
```
```{r}
# Converting the "date_time" column in the "df_mov_from" data frame to Date format
# and assigning the result to a new column called "date"
df_mov_from$date <- as.Date(df_mov_from$date_time, format = "%Y-%m-%d")
```
```{r}
# Creating a new data frame called "df_from_crisis" by performing the following operations:
# Grouping the rows of the "df_mov_from" data frame by the "date" column
df_from_crisis <- df_mov_from %>%
group_by(df_mov_from$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_crisis)) %>%
# Renaming the "df_mov_from$date" column as "date" in the resulting data frame
rename("date" = "df_mov_from$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a new data frame called "df_from_baseline" by performing the following operations:
# Grouping the rows of the "df_mov_from" data frame by the "date" column
df_from_baseline <- df_mov_from %>%
group_by(df_mov_from$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_baseline)) %>%
# Renaming the "df_mov_from$date" column as "date" in the resulting data frame
rename("date" = "df_mov_from$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_from_baseline" data
geom_line(mapping = aes(x = date, y = total, color = "Baseline"),
data = df_from_baseline, linewidth = 1.5) +
# Adding a line layer for the "df_from_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "Crisis"),
data = df_from_crisis, linewidth = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("Crisis" = "darkcyan", "Baseline" = "blue"),
labels = c("Baseline", "Crisis")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 4: Temporal patterns in the number of movements from Santiago",
x = "Date",
y = "Number of movements") +
# Applying the "minimal" theme
theme_minimal()
```
We can also compare the movements to and from Santiago province, this time omitting the baseline trends for clarity:
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_to_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "To"),
data = df_to_crisis, size = 1.5) +
# Adding a line layer for the "df_from_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "From"),
data = df_from_crisis, size = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("To" = "magenta", "From" = "darkcyan"),
labels = c("From Santiago", "To Santiago")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 5: Temporal patterns in the number of movements to and from Santiago",
x = "Date",
y = "Number of movements") +
# Applying the "minimal" theme
theme_minimal()
```
::: {.callout-note icon="false" appearance="simple"}
**Question 3**
Does the overall trend in the internal movements involving Santiago province agree with your intuition?
:::
::: {.callout-note icon="false" appearance="simple"}
**Question 4**
Can you observe any differences in the shorter-term oscillations corresponding to movements to and from Santiago province? To answer this question, it might be helpful to complete the Exercise below.
:::
::: {.callout-warning icon="false" appearance="simple"}
**Exercise 1**
Create a plot analogous to Figure 5, but comparing the baseline movements instead of the movements during the crisis period.
:::
### Santiago-Valparaíso movements
It is possible to look at movements beetween a specified origin and destination. Here, we focus on the flow of people between Santiago and a popular destination province for Santiaguinos, Valparaíso. To explore the movements from Santiago to Valparaíso, you can run the code below:
```{r}
# Filtering the "df_mov" data frame to select rows where "start_polygon_name" is 'Santiago'
df_mov_between1 <- filter(df_mov, start_polygon_name == 'Santiago')
# Further filtering the "df_mov_between1" data frame to select rows where "end_polygon_name" is 'Valparaíso'
df_mov_between1 <- filter(df_mov_between1, end_polygon_name == 'Valparaíso')
```
```{r}
# Converting the "date_time" column in the "df_mov_between1" data frame to Date format
# and assigning the result to a new column called "date"
df_mov_between1$date <- as.Date(df_mov_between1$date_time, format = "%Y-%m-%d")
```
```{r}
# Creating a new data frame called "df_between1_crisis" by performing the following operations:
# Grouping the rows of the "df_mov_between1" data frame by the "date" column
df_between1_crisis <- df_mov_between1 %>%
group_by(df_mov_between1$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_crisis)) %>%
# Renaming the "df_mov_between1$date" column as "date" in the resulting data frame
rename("date" = "df_mov_between1$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a new data frame called "df_between1_baseline" by performing the following operations:
# Grouping the rows of the "df_mov_between1" data frame by the "date" column
df_between1_baseline <- df_mov_between1 %>%
group_by(df_mov_between1$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "total" to store the sum
summarise(total = sum(n_baseline)) %>%
# Renaming the "df_mov_between1$date" column as "date" in the resulting data frame
rename("date" = "df_mov_between1$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "df_between1_baseline" data
geom_line(mapping = aes(x = date, y = total, color = "Baseline"),
data = df_between1_baseline, size = 1.5) +
# Adding a line layer for the "df_between1_crisis" data
geom_line(mapping = aes(x = date, y = total, color = "Crisis"),
data = df_between1_crisis, size = 1.5) +
# Customizing the color scale manually
scale_color_manual(values = c("Crisis" = "magenta", "Baseline" = "blue"),
labels = c("Baseline", "Crisis")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 6: Temporal patterns in the number of movements from Santiago to Valparaíso",
x = "Date",
y = "Number of movements") +
# Applying the "minimal" theme
theme_minimal()
```
::: {.callout-note icon="false" appearance="simple"}
**Question 5**
Does the overall trend in the internal movements from Santiago to Valparaíso in this time period agree with your intuition?
:::
::: {.callout-warning icon="false" appearance="simple"}
**Exercise 2**
Create a plot analogous to Figure 6, but comparing the movements from Valparaíso to Santiago during the crisis and baseline period.
:::
::: {.callout-warning icon="false" appearance="simple"}
**Exercise 3**
Compare the trends in movements from Santiago to Valparaíso and from Valparaíso to Santiago. Are the overall trends similar? Are there differentces in the shorter-term fluctuations?
:::
## Relationship between movements and active Facebook user population
So far, we have looked at the 5-month period starting in May 2020 and spanning throughout the Chilean winter. But, as mentioned at the beginning of the workbook, Facebook Meta collected data for Chile in the two-year period spanning from the end of March of 2020 to the end of March 2022. As we saw in @sec-loading, the data is organised in files by month, and we can load them individually. However, since we are going to look at many months, we can automate the data-reading process. This is done by first, specifying the path to the folder where the movement data files are stored, and then, using a for loop to read each of the files in the folder and bind them together. Getting a full understanding of the code below is out of the scope of this workshop, but we have included comments starting with the ( \# ) symbol to indicate what each line of code does in case you are curious.
```{r}
# Get a list of file paths in the folder
folder_path <- "./data/fb/movement_grid/"
file_paths <- list.files(path = folder_path, pattern = "\\.rds$", full.names = TRUE)
# Create an empty list to store the data frames
dfs <- list()
# Loop through the file paths
for (file_path in file_paths) {
# Read each file as a data frame
df <- readRDS(file_path)
# Store the data frame in the list
dfs <- append(dfs, list(df))
}
# Bind all the data frames into a single data frame
df_mov_full <- bind_rows(dfs)
```
In the previous sections, we have focused on analysing movement data during the pandemic and comparing it to a baseline period. While some insights can be gained by looking only at this data, it is important to consider the number of active Facebook users. The reason for this can be illustrated through the following examples. Suppose that the number of Facebook users drops to zero, then no movements would be recorded. However, this does not mean there are no movements taking place. Conversely, suppose Facebook becomes very popular and the whole population starts using it, then more movements would probably be recorded but this does not mean that people started moving more. Hence, it can be very useful to look at the number of movements per active user and analyse how this quantity evolves both during the pandemic and the baseline.
Below, we load data for the population of active Facebook users:
```{r}
# Get a list of file paths in the folder
folder_path <- "data/fb/population_grid/"
file_paths <- list.files(path = folder_path, pattern = "\\.rds$", full.names = TRUE)
# Create an empty list to store the data frames
dfs <- list()
# Loop through the file paths
for (file_path in file_paths) {
# Read each file as a data frame
df <- readRDS(file_path)
# Store the data frame in the list
dfs <- append(dfs, list(df))
}
# Bind all the data frames into a single data frame
df_mov_full_pop <- bind_rows(dfs)
```
### Data preparation
Like the movement datasets, the Facebook population data is aggregated into 8-hour time windows but here, we are only interested in the variation at the daily level. Therefore, we group by date the number of movements and the number of active users during the crisis period.
```{r}
# Converting the "date_time" column in the "df_mov_full" data frame to Date format
# and assigning the result to a new column called "date"
df_mov_full$date <- as.Date(df_mov_full$date_time, format = "%Y-%m-%d")
```
```{r}
# Converting the "date_time" column in the "df_mov_full_pop" data frame to Date format
# and assigning the result to a new column called "date"
df_mov_full_pop$date <- as.Date(df_mov_full_pop$date_time, format = "%Y-%m-%d")
```
```{r}
# Creating a new data frame called "df_mov_full_crisis" by performing the following operations:
# Grouping the rows of the "df_mov_full" data frame by the "date" column
df_mov_full_crisis <- df_mov_full %>%
group_by(df_mov_full$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "n_crisis" to store the sum
summarise(n_crisis = sum(n_crisis)) %>%
# Renaming the "df_mov_full$date" column as "date" in the resulting data frame
rename("date" = "df_mov_full$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a new data frame called "df_mov_full_pop_crisis" by performing the following operations:
# Grouping the rows of the "df_mov_full_pop" data frame by the "date" column
df_mov_full_pop_crisis <- df_mov_full_pop %>%
group_by(df_mov_full_pop$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_crisis" column
# and creating a new column called "pop_n_crisis" to store the sum
summarise(pop_n_crisis = sum(n_crisis)) %>%
# Renaming the "df_mov_full_pop$date" column as "date" in the resulting data frame
rename("date" = "df_mov_full_pop$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
Once the grouping by date is done, we merge the resulting datasets for movements and population of active Facebook users according to the `date` column:
```{r}
# Merging the "df_mov_full_crisis" and "df_mov_full_pop_crisis" data frames based on the "date" column
# and assigning the result to a new data frame called "df_full_crisis"
df_full_crisis <- merge(df_mov_full_crisis, df_mov_full_pop_crisis, by = "date", all.x = TRUE)
```
The original data for the population of active Facebook users is missing some dates, so we estimate the missing data using Multivariate Imputation by Chained Equations (with the mice package).
```{r}
# Create a mice object with only the column you want to impute
df_to_complete <- mice(df_full_crisis[, -which(names(df_full_crisis) == "n_crisis")])
# Perform the imputation
df_complete_crisis <- complete(df_to_complete)
# Assign the imputed values from the completed data frame to the original "df_full_crisis" data frame
df_full_crisis$pop_n_crisis <- df_complete_crisis$pop_n_crisis
```
Similarly, we group by date the number of movements and the number of active users during the baseline. Then, we merge the resulting datasets for movements and population of active Facebook users according to the `date` column and finally, we estimate any missing values using Multivariate Imputation by Chained Equations (with the mice package).
```{r}
# Creating a new data frame called "df_mov_full_baseline" by performing the following operations:
# Grouping the rows of the "df_mov_full" data frame by the "date" column
df_mov_full_baseline <- df_mov_full %>%
group_by(df_mov_full$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "n_baseline" to store the sum
summarise(n_baseline = sum(n_baseline)) %>%
# Renaming the "df_mov_full$date" column as "date" in the resulting data frame
rename("date" = "df_mov_full$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Creating a new data frame called "df_mov_full_pop_baseline" by performing the following operations:
# Grouping the rows of the "df_mov_full_pop" data frame by the "date" column
df_mov_full_pop_baseline <- df_mov_full_pop %>%
group_by(df_mov_full_pop$date) %>%
# Summarizing the grouped data by calculating the total sum of the "n_baseline" column
# and creating a new column called "pop_n_baseline" to store the sum
summarise(pop_n_baseline = sum(n_baseline)) %>%
# Renaming the "df_mov_full_pop$date" column as "date" in the resulting data frame
rename("date" = "df_mov_full_pop$date") %>%
# Ungrouping the data frame to remove the grouping structure
ungroup()
```
```{r}
# Merging the "df_mov_full_baseline" and "df_mov_full_pop_baseline" data frames based on the "date" column
# and assigning the result to a new data frame called "df_full_baseline"
df_full_baseline <- merge(df_mov_full_baseline, df_mov_full_pop_baseline, by = "date", all.x = TRUE)
```
```{r}
# Create a mice object with only the column you want to impute
df_to_complete <- mice(df_full_baseline[, -which(names(df_full_baseline) == "n_baseline")])
# Perform the imputation
df_complete_baseline <- complete(df_to_complete)
# Assign the imputed values from the completed data frame to the original "df_full_baseline" data frame
df_full_baseline$pop_n_baseline <- df_complete_baseline$pop_n_baseline
```
The merged datasets for the crisis period and the baseline can be merged according to the `date` column:
```{r}
# Merging the "df_full_crisis" and "df_full_baseline" data frames based on the "date" column
# and assigning the result to a new data frame called "df_full"
df_full <- merge(df_full_crisis, df_full_baseline, by = "date", all.x = TRUE)
```
To obtain the number of movements per active Facebook user on a given date, we can divide the number of movements on that date by the total number of active Facebook users on the same date. This can be done for both the crisis and baseline periods and the results can be stored in new columns. Furthermore, we compute the difference in the number of movements per active Facebook user between the crisis and baseline periods and store it in the `difference_div` column:
```{r}
# Calculating the division of "n_crisis" by "pop_n_crisis" and assigning the result to a new column "n_crisis_div"
df_full$n_crisis_div <- df_full$n_crisis / df_full$pop_n_crisis
# Calculating the division of "n_baseline" by "pop_n_baseline" and assigning the result to a new column "n_baseline_div"
df_full$n_baseline_div <- df_full$n_baseline / df_full$pop_n_baseline
# Calculating the difference between "n_crisis_div" and "n_baseline_div" and assigning the result to a new column "n_difference_div"
df_full$n_difference_div <- df_full$n_crisis_div - df_full$n_baseline_div
```
### Analysing trends
Below we plot the baseline number of tracked movements and Facebook active users as well as the same quantities for the crisis period:
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "n_baseline" data with color "Baseline"
geom_line(mapping = aes(x = date, y = n_baseline, color = "Baseline"),
data = df_full, size = 1.5) +
# Adding a line layer for the "n_crisis" data with color "Crisis"
geom_line(mapping = aes(x = date, y = n_crisis, color = "Crisis"),
data = df_full, size = 1.5) +
# Adding a line layer for the "pop_n_baseline" data with color "Baseline FB users"
geom_line(mapping = aes(x = date, y = pop_n_baseline, color = "Baseline FB users"),
data = df_full, size = 1.5) +
# Adding a line layer for the "pop_n_crisis" data with color "Crisis FB users"
geom_line(mapping = aes(x = date, y = pop_n_crisis, color = "Crisis FB users"),
data = df_full, size = 1.5) +
# Customizing the color scale manually with the respective colors and labels
scale_color_manual(values = c("Crisis" = "red", "Baseline" = "blue", "Crisis FB users" = "magenta", "Baseline FB users" = "cyan"),
labels = c("Baseline movements", "Baseline FB users", "Crisis movements", "Crisis FB users")) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 7: Number of tracked movements \nand active Facebook users in Chile",
x = "Date",
y = "Number of movements/users") +
# Applying the "minimal" theme
theme_minimal()
```
Disregarding the spikes which are likely due to anomalies in the data collection process applied by Facebook, we observe that, generally speaking, both the baseline number of movements and the number of active Facebook users remain constant through the plotted period. In the case of the number of movements and the number of active Facebook users during the pandemic, they both experience a decline. But, is the decline observed in the number of movements a result of the decline in the number of active Facebook users? To answer this question, we can plot the number of tracked movements per active Facebook user, both for the baseline and the pandemic period. We can also plot the difference between these two quantities:
```{r}
# Creating a line plot using ggplot
ggplot() +
# Adding a line layer for the "n_baseline_div" data with color "Baseline movements per user"
geom_line(mapping = aes(x = date, y = n_baseline_div, color = "Baseline movements per user"),
data = df_full, size = 1.5) +
# Adding a line layer for the "n_crisis_div" data with color "Crisis movements per user"
geom_line(mapping = aes(x = date, y = n_crisis_div, color = "Crisis movements per user"),
data = df_full, size = 1.5) +
# Adding a line layer for the "n_difference_div" data with color "Difference"
geom_line(mapping = aes(x = date, y = n_difference_div, color = "Difference"),
data = df_full, size = 1.5) +
# Customizing the color scale manually with the respective colors and labels
scale_color_manual(values = c("Crisis movements per user" = "red", "Baseline movements per user" = "blue", "Difference" = "darkgreen"),
labels = c("Baseline movements per user", "Crisis movements per user", "Difference")) +
# Expanding the y-axis limits to include 0
expand_limits(y = 0) +
# Customizing plot labels
labs(color = "Legend",
title = "Figure 8: Number of tracked movements per active Facebook user in Chile",
x = "Date",
y = "Number of movements per active user") +
# Applying the "minimal" theme
theme_minimal()
```
This last figure shows that since the start of the pandemic, the number of movements per active Facebook user has been consistently lower than during the baseline period. However, the difference in trends between the crisis and basline periods is not as large as it might have been suggested by looking at the raw number of movements without taking into account the active population of users (e.g. the line corresponding to the number of movements during the crisis period from Figure 7).