forked from xiayunj/2020_summer_invitational_datathon
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-Brazil.Rmd
656 lines (528 loc) · 30.5 KB
/
05-Brazil.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
# Brazil - 2016 Rio Olympics
<font size="6"><b> Analysis of Regional Economic Impacts of Hosting Olympics on Brazil </b></font>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(naniar)
library(zoo)
library(reactable)
library(scales)
library(ggridges)
library(viridis)
library(gganimate)
library(gifski)
library(magick)
```
```{r include=FALSE}
br_gdp <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_gdp.csv')
br_ia <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_international_arrivals.csv')
br_mi <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_monthly_income.csv')
br_tj <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_tourism_jobs.csv')
br_un <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_unemployment.csv')
```
<font size="4"><b> Introduction </b></font>
In this section, we discuss the regional difference of economic impacts on Brazil by the 2016 Olympic Games. To find whether or not there are significant differences on a regional level, we analyze Brazil GDP, monthly income, unemployment rate and tourism over years. The 2016 Olympic Games took place at Rio de Janeiro. Rio was awarded to host the games in 2009.
<font size="4"><b> Region Description </b></font>
We divide Brazil to 5 regions, where Rio de Janeiro, the city hosted Olympics, is in the South-East Region.
1. North region: Acre, Amapá, Amazonas, Pará, Rondônia, Roraima and Tocantins.
2. Northeast region: Alagoas, Bahia, Ceará, Maranhão, Paraíba, Pernambuco, Piauí, Rio Grande do Norte and Sergipe.
3. Midwest region: Goiás, Mato Grosso, Mato Grosso do Sul and Distrito Federal (Federal District).
4. Southeast region: Espírito Santo, Minas Gerais, Rio de Janeiro and São Paulo.
5. South region: Paraná, Rio Grande do Sul and Santa Catarina.
Reference: https://en.wikipedia.org/wiki/Regions_of_Brazil
## Brazil GDP
<font size="5"><b> Brazil Regional GDP </b></font>
```{r include=FALSE}
north <- c('Acre', 'Amapá', 'Amazonas', 'Pará', 'Rondônia', 'Roraima', 'Tocantins')
north_east <- c('Alagoas', 'Bahia', 'Ceará', 'Maranhão', 'Paraíba', 'Pernambuco', 'Piauí',
'Rio Grande do Norte', 'Sergipe')
mid_west <- c('Goiás', 'Mato Grosso', 'Mato Grosso do Sul', 'Distrito Federal')
south_east <- c('Espírito Santo', 'Minas Gerais', 'Rio de Janeiro', 'São Paulo')
south <- c('Paraná', 'Rio Grande do Sul', 'Santa Catarina')
for (i in 1:nrow(br_gdp)){
if (br_gdp[i,'state'] %in% north){
br_gdp[i,'larger_region'] <- 'North Region'
} else if (br_gdp[i,'state'] %in% north_east){
br_gdp[i,'larger_region'] <- 'North East Region'
} else if (br_gdp[i,'state'] %in% mid_west){
br_gdp[i,'larger_region'] <- 'Mid West Region'
} else if (br_gdp[i,'state'] %in% south_east){
br_gdp[i,'larger_region'] <- 'South East Region'
} else {
br_gdp[i,'larger_region'] <- 'South Region'
}
}
```
```{r include=FALSE}
df_gdp <- br_gdp[,c('year','larger_region','value')] %>%
group_by(year, larger_region) %>%
summarise(gdp = sum(value))
df_gdp[,'gdp'] <- df_gdp$gdp/1000000000
```
First, we want to visualize the regional GDP of Brazil, so we make a multiple time series plot. Also, we add two vertical lines to mark the year of final selection (2009) and the year of the Olympics (2016).
```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_gdp, aes(year, gdp, color=larger_region)) +
geom_line() +
geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
geom_text(aes(x = 2016, label = 'Rio Olympics', y = 2), colour = 'black', size = 4) +
geom_vline(xintercept = 2009, linetype="dashed", color = "red", size=1.5) +
geom_text(aes(x = 2009, label = 'Final Selection', y = 1), colour = 'black', size = 4) +
ggtitle('Brazil GDP by Regions', subtitle = 'From 2002 to 2017') +
labs(x = 'Year', y = 'GDP in trillions of Reals', color = 'Regions') +
scale_x_continuous(breaks = seq(2002, 2017, 1)) +
theme_gray(13)
```
From this plot, we find that from year 2002 to 2017, South East region always has a higher GDP than other 4 regions. After year 2019, the GDP for all of the 5 regions seem to increase faster than before, especially the South East region. Since 2009 is the year when Rio was awarded to host the 2016 Olympics, we come up with a hypothesis that the success of the bid for Brazil helps to increase the GDP growth rate, especially for the South East. In other words, the final selection brings different regional GDP growth impact on Brazil.
However, the plot does not show a higher increasing rate for all of the 5 regions after the Olympic year 2016. Also, since our raw data does not provide the statistics after year 2017, our analysis may have some limitation to analyze the regional GDP impact after year 2016. Also, the plot shows that the GDP growth rate of South East region decreases from 2014 to 2016.
Before we get to our conclusion, we want to get a closer look at regional GDP growth rate.
<font size="5"><b> Brazil Regional GDP Increasing Rate from 2003 to 2017 </b></font>
```{r echo=FALSE}
gdp_rate_temp <- df_gdp
gdp_rate <- data.frame('Year' = c('2003','2004','2005','2006','2007','2008','2009','2010',
'2011','2012','2013','2014', '2015', '2016', '2017'))
```
```{r echo=FALSE}
for (i in 2003:2017){
for (j in c('Mid West Region', 'North East Region',
'North Region', 'South East Region', 'South Region')){
gdp_rate[i-2002, j] <- round(as.numeric(
(gdp_rate_temp[which(gdp_rate_temp$year == i & gdp_rate_temp$larger_region == j), 'gdp'] -
gdp_rate_temp[which(gdp_rate_temp$year == i-1 & gdp_rate_temp$larger_region == j), 'gdp']) /
gdp_rate_temp[which(gdp_rate_temp$year == i-1 & gdp_rate_temp$larger_region == j), 'gdp']), 5)
}
}
```
```{r echo=FALSE}
temp_col <- numeric(75)
for (i in 1:15){
for (j in 2:6){
temp_col <- c(temp_col, as.numeric(gdp_rate[i,j]))
}
}
```
```{r echo=FALSE}
orange_pal <- function(x) rgb(colorRamp(c("#ffe4cc", "#ff9500"))(x), maxColorValue = 255)
reactable(
gdp_rate,
defaultColDef = colDef(
align = 'center',
headerStyle = list(background = "#D1E5FC"),
format = colFormat(percent = TRUE, digits = 2),
style = function(value) {
if (!is.numeric(value)) {
color <- "#A9D0FD"
list(background = color)
} else {
normalized <- (value - min(temp_col)) / (max(temp_col) - min(temp_col))
color <- orange_pal(normalized)
list(background = color)
}
}
),
bordered = TRUE,
defaultPageSize = 5
)
```
From this table, we find that the highest GDP growth rate appears at year 2010 in North region, which is 24.60%, whereas the lowest GDP growth rate appears at year 2015 in South East region, which is 2.02%. Also, the regional GDP increase rate peaked at 2010 for almost all regions. Year 2010 is the year after the final selection of Olympics, whereas year 2015 is the year before the Olympic games.
The table also shows that in year 2016, the GDP growth rate for each of the 5 regions has no significant changes compared to 2015. The highest increase in GDP growth rate appears at Mid West region in the Olympic year, but the increasing percent is $9.2\%-6.84\% = 2.36\%$, which is small. The changes of growth rate in year 2017 are also not significant for the 5 regions.
Moreover, we find that before year 2019, the GDP growth rate for all 5 regions stays around 10+ percent, but the previous multiple time series plot shows that South East region may have a higher GDP growth rate than the other 4 regions after 2009. If we have an assumption that "before 2009, all of the 5 regions in Brazil have the same GDP growth rate", we can then test whether the 5 regions still have the same GDP growth rate after 2009. The reason why we want to do this test is to check if the success of Olympic bid will bring a more significant GDP growth for South East region. Therefore, we want to first test our assumption by using a One-Way ANOVA test using the existing data from 2003 to 2009.
<font size="5"><b> First ANOVA test on regional GDP from 2003 to 2009: </b></font>
<font size="3"><b> Hypothesis: </b></font>
$H_0$: The means of GDP increasing rate for the 5 regions from 2003 to 2009 are the same.
$H_1$: Otherwise.
Description: This test is for checking whether our initial assumption holds, where the assumption is "Before the final selection year, the 5 regions in Brazil have the same GDP growth rate".
```{r include=FALSE}
br_gdp_anova1_temp <- gdp_rate[1:7, -1]
br_gdp_anova2_temp <- gdp_rate[8:15, -1]
br_gdp_anova1 <- data.frame('region' = factor(), 'rate' = numeric())
for (i in 1:7){
for (j in c('Mid West Region', 'North East Region',
'North Region', 'South East Region', 'South Region')){
br_gdp_anova1 <- br_gdp_anova1 %>% add_row(region = j,
rate = as.numeric(br_gdp_anova1_temp[i,j]))
}
}
br_gdp_anova2 <- data.frame('region' = factor(), 'rate' = numeric())
for (i in 1:8){
for (j in c('Mid West Region', 'North East Region',
'North Region', 'South East Region', 'South Region')){
br_gdp_anova2 <- br_gdp_anova2 %>% add_row(region = j,
rate = as.numeric(br_gdp_anova2_temp[i,j]))
}
}
```
<font size="3"><b> Summary of the test: </b></font>
```{r echo=FALSE}
model1 <- aov(rate~region, data = br_gdp_anova1)
summary(model1)
```
<font size="3"><b> Conclusion: </b></font>
From the summary of test table, we find that the p-value is 0.93, which shows that we fail to reject the null hypothesis. That is, we cannot say the mean of GDP growth rate for each of the 5 regions have significant difference. Therefore, our initial assumption holds, then we can continue to test whether there are significant differences in the means of GDP growth rate for the 5 regions after 2009. In the next test, we still use the One-Way ANOVA test, but we use the existing data from 2010 to 2017.
<font size="5"><b> Second ANOVA test on regional GDP from 2010 to 2017: </b></font>
<font size="3"><b> Hypothesis: </b></font>
$H_0$: The means of GDP increasing rate for the 5 regions from 2010 to 2017 are the same.
$H_1$: Otherwise.
Description: Based on the assumption from the first ANOVA test, we want to conduct a second test to check whether the means of GDP growth rate for the 5 regions still remain the same from 2010 to 2017. If the test result shows a significant difference in the means, we can conclude that the succuss of 2016 Olympic bid shows a significant GDP growth on regional levels.
<font size="3"><b> Summary of the test: </b></font>
```{r echo=FALSE}
model2 <- aov(rate~region, data = br_gdp_anova2)
summary(model2)
```
<font size="3"><b> Conclusion: </b></font>
From the summary table, we find that the p-value is 0.881, which shows that we fail to reject the null hypothesis. That is, from 2010 to 2017, we cannot say the GDP growth rate for each of the 5 regions has a significant difference.
For both tests, we fail to reject the null hypothesis. Therefore, the year 2009 does not bring a different regional GDP growth rate impact on Brazil. This conclusion is not consistent with our initial hypothesis from the multiple time series plot.
In conclusion, for Brazil regional GDP analysis, the data does not show that the Olympic games results in any significant difference in GDP growth across the regions in Brazil.
## Brazil Monthly Income
<font size="5"><b> Monthly Income Distribution </b></font>
```{r include=FALSE}
for (i in 1:nrow(br_mi)){
if (br_mi[i,'state'] %in% north){
br_mi[i,'larger_region'] <- 'North Region'
} else if (br_mi[i,'state'] %in% north_east){
br_mi[i,'larger_region'] <- 'North East Region'
} else if (br_mi[i,'state'] %in% mid_west){
br_mi[i,'larger_region'] <- 'Mid West Region'
} else if (br_mi[i,'state'] %in% south_east){
br_mi[i,'larger_region'] <- 'South East Region'
} else {
br_mi[i,'larger_region'] <- 'South Region'
}
}
```
```{r include=FALSE}
br_mi1 <- br_mi %>% drop_na()
```
Before we analyze the regional average monthly income, we want to first visualize the distribution of monthly income for the entire country from 2012 to 2020. This will help us to get a better understanding of the overall distribution of Brazil monthly income over years. We first make a multiple boxplot over years to better visualize the robust statistics: medians, quartiles and outliers.
```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(br_mi1, aes(x = as.factor(year), y = value)) +
geom_boxplot(fill = "#cc9a38", color = "#473e2c") +
ggtitle("Boxplots of Brazil Monthly Income ",
subtitle = "From 2012 to 2020") +
labs(x = "Year", y = "Monthly Income") +
theme_grey(16) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
scale_x_discrete(labels = c('2016'='2016 (Olympics)'))
```
From the boxplots we find that the median of national monthly income shows an increasing pattern from 2012 to 2020. However, the plot does not show any significant increase in the median of monthly income after the Olympic year 2016 on national level. The Interquartile Range also increases from 2012 to 2020.
Then we want to visualize the density distribution for the national monthly income, but by the boxplots, there are higher outliers for each year. If we keep those outliers, it will be hard to visualize the density plot, so we remove those higher outliers for better visualization.
Note: Higher Outlier definition: points above $Q_3 + (1.5\times \text{IQR})$, where $Q_3$ is the upper quartile and IQR is the Interquatile Range.
```{r include=FALSE}
for (i in 2012:2020){
temp_value <- br_mi1[which(br_mi1$year == i),]$value
Q1 <- quantile(temp_value)[2]
Q3 <- quantile(temp_value)[4]
outlier <- Q3 + 1.5*(Q3-Q1)
br_mi1 <- br_mi1[-which(br_mi1$year == i & br_mi1$value > outlier),]
}
```
```{r echo=FALSE, fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
ggplot(br_mi1, aes(x = value, y = as.factor(year), fill = year))+
geom_density_ridges_gradient(scale = 4, show.legend = FALSE) +
theme_ridges() +
scale_y_discrete(expand = c(0.01, 0), labels = c('2016'='2016 \n (Olympics)')) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(x = "Brazil Monthly Income",y = "Year") +
ggtitle("Density estimation of Brazil Monthly Income",
subtitle = 'From 2012 to 2020 (outlier removed)') +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_viridis()
```
The ridgeline plot helps us to better understand the density estimation. From 2012 to 2020, the peak of density curves increases from 1000 to 1500 approximately. Also, the tail of the density curves becomes much heavier from 2012 to 2020. These observations shows that the monthly income has an increasing pattern over years. However, between 2015 and 2017, the density curves do not show a more significant increasing pattern.
<font size="5"><b> Regional Monthly Income Distribution in the Olympics Year 2016 </b></font>
Next, we want to focus on the regional level. We pick data of the Olympic year 2016 and plot a set of histograms grouped by regions. We want to first check whether there are any significant difference in the distributions of regional monthly income. If hosting the Olympics may result in different regional impact, there may be difference in the regional distribution of monthly income in 2016.
```{r echo=FALSE, fig.height=7, fig.width=10}
ggplot(br_mi1[which(br_mi1$year == 2016),], aes(x = value, y = ..density..)) +
geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") +
facet_wrap(~larger_region) +
ggtitle("2016 Brazil Monthly Income Distribution",
subtitle = "by Regions (outlier removed)") +
labs(x = "Monthly Income", y = 'Density') +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
```
In this plot, we also remove the outliers for better visualization. The histograms shows all of the 5 regions have most density in monthly income less than 2000 but greater than 1000. Also, South East region does not show a distribution of higher monthly income than other regions in year 2016.
<font size="5"><b> Regional Average Monthly Income </b></font>
Now we move to visualize the regional average monthly income. Here we keep the outliers because we are calculating the regional average for each year.
```{r include=FALSE}
df_mi <- br_mi[, c('year','value','larger_region')] %>%
drop_na() %>%
group_by(year, larger_region) %>%
summarise(monthly_income = mean(value))
```
```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_mi, aes(year, monthly_income, color=larger_region)) +
geom_line() +
geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
geom_text(aes(x = 2016, label = 'Rio Olympics', y = 3000), colour = 'black', size = 4) +
ggtitle('Brazil Average Monthly Income by Regions',
subtitle = 'From 2012 to 2020') +
labs(x = 'Year', y = 'Average Monthly Income in Reals', color = 'Regions') +
scale_x_continuous(breaks = seq(2012, 2020, 1)) +
theme_gray(13)
```
The plot shows an overall increasing pattern in monthly income for all of the 5 regions, where the income of Mid West region increases most fast from 2017 to 2018, but decreases from 2019 to 2020. The plot does not show a more significant increasing after 2016 than before. Also the patterns for all of the 5 regions do not appear to have significant difference.
```{r echo=FALSE}
mi_rate_temp <- df_mi
mi_rate <- data.frame('Year' = c('2013', '2014', '2015', '2016', '2017',
'2018', '2019', '2020'))
```
```{r echo=FALSE}
for (i in 2013:2020){
for (j in c('Mid West Region', 'North East Region',
'North Region', 'South East Region', 'South Region')){
mi_rate[i-2012, j] <- round(as.numeric(
(mi_rate_temp[which(mi_rate_temp$year == i &
mi_rate_temp$larger_region == j), 'monthly_income'] -
mi_rate_temp[which(mi_rate_temp$year == i-1 &
mi_rate_temp$larger_region == j), 'monthly_income']) /
mi_rate_temp[which(mi_rate_temp$year == i-1 &
mi_rate_temp$larger_region == j), 'monthly_income']), 5)
}
}
```
```{r echo=FALSE}
temp_col_mi <- numeric(40)
for (i in 1:8){
for (j in 2:6){
temp_col_mi <- c(temp_col_mi, as.numeric(mi_rate[i,j]))
}
}
```
<font size="5"><b> Regional Monthly Income Increasing Rate from 2013 to 2020 </b></font>
```{r echo=FALSE}
reactable(
mi_rate,
defaultColDef = colDef(
align = 'center',
headerStyle = list(background = "#D1E5FC"),
format = colFormat(percent = TRUE, digits = 2),
style = function(value) {
if (!is.numeric(value)) {
color <- "#A9D0FD"
list(background = color)
} else {
normalized <- (value - min(temp_col_mi)) / (max(temp_col_mi) - min(temp_col_mi))
color <- orange_pal(normalized)
list(background = color)
}
}
),
bordered = TRUE,
defaultPageSize = 5
)
```
From the income increasing rate table, we find that the highest rate (11.55%) takes place at Mid West region in 2018, whereas the lowest one (-3.22%) takes place at Mid West region in 2020. For South East Region, the increasing rate in 2016 is lower than that in 2015. Together with the multiple time series plot, we find that the regional monthly income increases after the Olympic year, but the increasing rates do not become much more significant comparing with previous ones before 2016.
<font size="5"><b> Relationship between Brazil GDP and Monthly Income </b></font>
Then we create an animated scatter plot to visualize the relationship between GDP and month income by regions over years.
```{r include=FALSE}
Ag <- br_gdp[which(br_gdp$year >= 2012 & br_gdp$year <= 2017),] %>%
group_by(year, state, larger_region) %>%
summarize(gdp = mean(value)/1000000000)
Ai <- br_mi[which(br_mi$year >= 2012 & br_mi$year <= 2017),] %>%
drop_na() %>%
group_by(year, state) %>%
summarize(income = mean(value))
Agi <- merge(Ag, Ai, by=c('year', 'state'))
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
anim <- ggplot(Agi, aes(gdp, income, color = larger_region)) +
geom_point() +
theme_bw() +
labs(title = 'Scatter Plot between GDP and Monthly Income by Regions',
subtitle = 'Year: {frame_time}', x = 'GDP in Trillions of Reals',
y = 'Average Monthly Income', color = 'Region') +
transition_time(year)
animate(
anim, renderer = magick_renderer()
)
```
From this plot, we find that the points are moving up and right. Each of these point represents a state in Brazil and the color shows its region. One state in Mid West region has the highest GDP and monthly income over years. Also, two states in South East region have higher GDP than the majority of states.
## Brazil Unemployment Rate
```{r include=FALSE}
for (i in 1:nrow(br_un)){
if (br_un[i,'state'] %in% north){
br_un[i,'larger_region'] <- 'North Region'
} else if (br_un[i,'state'] %in% north_east){
br_un[i,'larger_region'] <- 'North East Region'
} else if (br_un[i,'state'] %in% mid_west){
br_un[i,'larger_region'] <- 'Mid West Region'
} else if (br_un[i,'state'] %in% south_east){
br_un[i,'larger_region'] <- 'South East Region'
} else {
br_un[i,'larger_region'] <- 'South Region'
}
}
```
```{r include=FALSE}
br_un_bar <- br_un[which(br_un$category != 'Outside the workforce' & br_un$year != 2020),
c('year','quarter','category','value','larger_region')] %>%
group_by(year, quarter, category) %>%
summarise(value = sum(value))
for (i in 1:nrow(br_un_bar)){
y_temp <- br_un_bar$year[i]
q_temp <- br_un_bar$quarter[i]
un_temp <- br_un_bar[which(br_un_bar$year == y_temp & br_un_bar$quarter == q_temp &
br_un_bar$category == 'Workforce - Unemployed'), 'value']
e_temp <- br_un_bar[which(br_un_bar$year == y_temp & br_un_bar$quarter == q_temp &
br_un_bar$category == 'Workforce - Employed'), 'value']
if (br_un_bar$category[i] == 'Workforce - Unemployed'){
br_un_bar[i,'rate'] <- un_temp/(un_temp+e_temp)
} else {
br_un_bar[i,'rate'] <- e_temp/(un_temp+e_temp)
}
br_un_bar[i,'work_force'] <- un_temp+e_temp
}
br_un_bar <- br_un_bar %>%
group_by(year, category) %>%
summarize(avg_rate = mean(rate))
```
Another way we can analyze the potential impact of the Olympic Games on Brazil is through analyzing the unemployment rate over time. First we make a stacked bar chart to visualization the national unemployment rate from 2012 to 2019.
```{r echo=FALSE, fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
ggplot(br_un_bar, aes(x = as.factor(year), y = avg_rate, fill = category)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle('Stacked Bar Chart of Average Brazil Unemployment/Employment Rate',
subtitle = 'From 2012 to 2019') +
labs(x = 'Rate', y = 'Year') +
theme_gray(13) +
scale_x_discrete(labels = c('2016'='2016 \n (Olympics)'))
```
The stacked bar chart shows that year 2017 has the highest unemployment rate, whereas year 2014 has the lowest unemployment rate. Also, the unemployment rate keeps increasing from 2015 to 2017. The above plot confirms our conclusion that the 2016 Rio Olympics did not appreciably benefit Brazil as measured through the unemployment rate. In fact, during the years that follow the Olympics the unemployment rate seems to increase, not decrease, which suggest a possible negative effect that could be the subject of future exploration.
<font size="5"><b> Unemployment Rate by Regions </b></font>
Now let's move to the regional level. We want to check whether the 5 regions follow a same pattern as the national unemployment rate.
```{r include=FALSE}
#Calculate unemployment rate and work force for each quarter of each year (by region)
df_un <- br_un[which(br_un$category != 'Outside the workforce'),
c('year','quarter','category','value','larger_region')] %>%
group_by(year, quarter, category, larger_region) %>%
summarise(value = sum(value))
```
```{r include=FALSE}
for (i in 1:nrow(df_un)){
y_temp <- df_un$year[i]
q_temp <- df_un$quarter[i]
r_temp <- df_un$larger_region[i]
un_temp <- df_un[which(df_un$year == y_temp & df_un$quarter == q_temp & df_un$larger_region == r_temp &
df_un$category == 'Workforce - Unemployed'), 'value']
e_temp <- df_un[which(df_un$year == y_temp & df_un$quarter == q_temp & df_un$larger_region == r_temp &
df_un$category == 'Workforce - Employed'), 'value']
df_un[i,'unemployment_rate'] <- un_temp/(un_temp+e_temp)
df_un[i,'work_force'] <- un_temp+e_temp
}
```
```{r include=FALSE}
df_un <- unique(subset(df_un, select=-c(category,value)))
```
```{r include=FALSE}
#Calculate average unemployment rate for each year (by region)
df_un1 <- df_un %>%
group_by(year, larger_region) %>%
summarise(avg_un_rate = mean(unemployment_rate))
```
```{r echo=FALSE, fig.height=5, fig.width=10, message=FALSE, warning=FALSE}
ggplot(df_un1[1:40,], aes(year, avg_un_rate, color=larger_region)) +
geom_line() +
ggtitle('Brazil Average Yearly Unemployment Rate by Regions',
subtitle = 'From 2012 to 2019') +
labs(x = 'Year', y = 'Average Yearly Unemployment Rate', color = 'Regions') +
scale_x_continuous(breaks = seq(2012, 2019, 1)) +
theme_gray(13) +
geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
geom_text(aes(x = 2016, label = 'Rio Olympics', y = 0.06), colour = 'black', size = 4)
```
From 2012 to 2014, the unemployment rate decreases for the 5 regions. However, during the period between 2014 and 2017, the unemployment rate increases for the 5 regions. The unemployment rate starts to decrease after 2017. The regions follow the same pattern as the national pattern, except Mid West region has an increasing unemployment rate from 2018 to 2019.
For more precise conclusion, we also calculate the exact regional unemployment rate increasing rate.
<font size="5"><b> Unemployment Rate Increasing Rate by Regions from 2013 to 2019 </b></font>
```{r echo=FALSE, message=FALSE, warning=FALSE}
un_rate_temp <- df_un1[1:40,]
un_rate <- data.frame('Year' = c('2013', '2014', '2015', '2016', '2017',
'2018', '2019'))
```
```{r echo=FALSE}
for (i in 2013:2019){
for (j in c('Mid West Region', 'North East Region',
'North Region', 'South East Region', 'South Region')){
un_rate[i-2012, j] <- round(as.numeric(
(un_rate_temp[which(un_rate_temp$year == i &
un_rate_temp$larger_region == j), 'avg_un_rate'] -
un_rate_temp[which(un_rate_temp$year == i-1 &
un_rate_temp$larger_region == j), 'avg_un_rate']) /
un_rate_temp[which(un_rate_temp$year == i-1 &
un_rate_temp$larger_region == j), 'avg_un_rate']), 5)
}
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
temp_col_un <- numeric(35)
for (i in 1:7){
for (j in 2:6){
temp_col_un <- c(temp_col_un, as.numeric(un_rate[i,j]))
}
}
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
reactable(
un_rate,
defaultColDef = colDef(
align = 'center',
headerStyle = list(background = "#D1E5FC"),
format = colFormat(percent = TRUE, digits = 2),
style = function(value) {
if (!is.numeric(value)) {
color <- "#A9D0FD"
list(background = color)
} else {
normalized <- (value - min(temp_col_un)) / (max(temp_col_un) - min(temp_col_un))
color <- orange_pal(normalized)
list(background = color)
}
}
),
bordered = TRUE,
defaultPageSize = 5
)
```
Since 2014, the average unemployment rate keeps increasing for all of the 5 regions until 2017. The highest increasing rate appears at year 2016 which is the Olympics year for all of the 5 regions. The unemployment rate drops starting from 2018. Therefore, the 2016 Rio Olympic games did not benifit Brazil's unemployment rate. There are no significant difference in regional unemployment pattern observed as well.
## Brazil Tourism
Now we want to analyze Brazil tourism to check whether the Olympics benifits the tourism, especially for the South East region.
```{r include=FALSE}
for (i in 1:nrow(br_tj)){
if (br_tj[i,'state'] %in% north){
br_tj[i,'larger_region'] <- 'North Region'
} else if (br_tj[i,'state'] %in% north_east){
br_tj[i,'larger_region'] <- 'North East Region'
} else if (br_tj[i,'state'] %in% mid_west){
br_tj[i,'larger_region'] <- 'Mid West Region'
} else if (br_tj[i,'state'] %in% south_east){
br_tj[i,'larger_region'] <- 'South East Region'
} else {
br_tj[i,'larger_region'] <- 'South Region'
}
}
```
```{r include=FALSE}
df_tj <- br_tj %>%
group_by(year, month, larger_region) %>%
summarise(total_jobs = sum(jobs)) %>%
group_by(year, larger_region) %>%
summarise(avg_jobs = mean(total_jobs)/1000)
```
```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_tj, aes(year, avg_jobs, color=larger_region)) +
geom_line() +
ggtitle('Brazil Average Yearly Tourism Jobs by Regions',
subtitle = 'From 2006 to 2018') +
labs(x = 'Year', y = 'Average Yearly Tourism Jobs in Thousands', color = 'Regions') +
scale_x_continuous(breaks = seq(2006, 2018, 1)) +
theme_gray(13) +
geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
geom_text(aes(x = 2016, label = 'Rio Olympics', y = 300), colour = 'black', size = 4) +
geom_vline(xintercept = 2009, linetype="dashed", color = "red", size=1.5)
```
From the multiple time series plot, we find that from 2006 to 2018, South East region alwasys has a much higher number of tourism jobs than the other regions. After 2009, the final selection year, all of the 5 regions seems to increase faster than before in the number of tourism jobs. However, from 2015 to 2017, South East region seems to decreases slightly. This shows that the Olympics does not necessarily increase the number of tourism jobs for South East region. The decreasing in number of tourism jobs in South East region complies with the pattern of Brazil unemployment rate.