05-Brazil.Rmd

# Brazil - 2016 Rio Olympics
<font size="6"><b> Analysis of Regional Economic Impacts of Hosting Olympics on Brazil </b></font>    

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(naniar)
library(zoo)
library(reactable)
library(scales)
library(ggridges)
library(viridis)
library(gganimate)
library(gifski)
library(magick)
```


```{r include=FALSE}
br_gdp <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_gdp.csv')
br_ia <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_international_arrivals.csv')
br_mi <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_monthly_income.csv')
br_tj <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_tourism_jobs.csv')
br_un <- read.csv('/Users/xiayunj/Desktop/datathon/datasets_full/Rio/brazil_unemployment.csv')
```
  
<font size="4"><b> Introduction </b></font>      
In this section, we discuss the regional difference of economic impacts on Brazil by the 2016 Olympic Games. To find whether or not there are significant differences on a regional level, we analyze Brazil GDP, monthly income, unemployment rate and tourism over years. The 2016 Olympic Games took place at Rio de Janeiro. Rio was awarded to host the games in 2009.      
    
<font size="4"><b> Region Description </b></font>    
We divide Brazil to 5 regions, where Rio de Janeiro, the city hosted Olympics, is in the South-East Region.     
1. North region: Acre, Amapá, Amazonas, Pará, Rondônia, Roraima and Tocantins.    
2. Northeast region: Alagoas, Bahia, Ceará, Maranhão, Paraíba, Pernambuco, Piauí, Rio Grande do Norte and Sergipe.    
3. Midwest region: Goiás, Mato Grosso, Mato Grosso do Sul and Distrito Federal (Federal District).   
4. Southeast region: Espírito Santo, Minas Gerais, Rio de Janeiro and São Paulo.   
5. South region: Paraná, Rio Grande do Sul and Santa Catarina.    
      
Reference: https://en.wikipedia.org/wiki/Regions_of_Brazil      
      
     
## Brazil GDP    
     
<font size="5"><b> Brazil Regional GDP </b></font>      
```{r include=FALSE}
north <- c('Acre', 'Amapá', 'Amazonas', 'Pará', 'Rondônia', 'Roraima', 'Tocantins')
north_east <- c('Alagoas', 'Bahia', 'Ceará', 'Maranhão', 'Paraíba', 'Pernambuco', 'Piauí', 
                'Rio Grande do Norte', 'Sergipe')
mid_west <- c('Goiás', 'Mato Grosso', 'Mato Grosso do Sul', 'Distrito Federal')
south_east <- c('Espírito Santo', 'Minas Gerais', 'Rio de Janeiro', 'São Paulo')
south <- c('Paraná', 'Rio Grande do Sul', 'Santa Catarina')

for (i in 1:nrow(br_gdp)){
  if (br_gdp[i,'state'] %in% north){
    br_gdp[i,'larger_region'] <- 'North Region'
  } else if (br_gdp[i,'state'] %in% north_east){
    br_gdp[i,'larger_region'] <- 'North East Region'
  } else if (br_gdp[i,'state'] %in% mid_west){
    br_gdp[i,'larger_region'] <- 'Mid West Region'
  } else if (br_gdp[i,'state'] %in% south_east){
    br_gdp[i,'larger_region'] <- 'South East Region'
  } else {
    br_gdp[i,'larger_region'] <- 'South Region'
  }
}

```

```{r include=FALSE}
df_gdp <- br_gdp[,c('year','larger_region','value')] %>% 
  group_by(year, larger_region) %>% 
  summarise(gdp = sum(value))
df_gdp[,'gdp'] <- df_gdp$gdp/1000000000
```

First, we want to visualize the regional GDP of Brazil, so we make a multiple time series plot. Also, we add two vertical lines to mark the year of final selection (2009) and the year of the Olympics (2016).    

```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_gdp, aes(year, gdp, color=larger_region)) + 
  geom_line() +
  geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
  geom_text(aes(x = 2016, label = 'Rio Olympics', y = 2), colour = 'black', size = 4) +
  geom_vline(xintercept = 2009, linetype="dashed", color = "red", size=1.5) +
  geom_text(aes(x = 2009, label = 'Final Selection', y = 1), colour = 'black', size = 4) +
  ggtitle('Brazil GDP by Regions', subtitle = 'From 2002 to 2017') +
  labs(x = 'Year', y = 'GDP in trillions of Reals', color = 'Regions') +
  scale_x_continuous(breaks = seq(2002, 2017, 1)) +
  theme_gray(13)
```

From this plot, we find that from year 2002 to 2017, South East region always has a higher GDP than other 4 regions. After year 2019, the GDP for all of the 5 regions seem to increase faster than before, especially the South East region. Since 2009 is the year when Rio was awarded to host the 2016 Olympics, we come up with a hypothesis that the success of the bid for Brazil helps to increase the GDP growth rate, especially for the South East. In other words, the final selection brings different regional GDP growth impact on Brazil.      

However, the plot does not show a higher increasing rate for all of the 5 regions after the Olympic year 2016. Also, since our raw data does not provide the statistics after year 2017, our analysis may have some limitation to analyze the regional GDP impact after year 2016. Also, the plot shows that the GDP growth rate of South East region decreases from 2014 to 2016.    

Before we get to our conclusion, we want to get a closer look at regional GDP growth rate.      

<font size="5"><b> Brazil Regional GDP Increasing Rate from 2003 to 2017 </b></font>    

```{r echo=FALSE}
gdp_rate_temp <- df_gdp
gdp_rate <- data.frame('Year' = c('2003','2004','2005','2006','2007','2008','2009','2010',
                                  '2011','2012','2013','2014', '2015', '2016', '2017'))
```

```{r echo=FALSE}
for (i in 2003:2017){
  for (j in c('Mid West Region', 'North East Region',
              'North Region', 'South East Region', 'South Region')){
      gdp_rate[i-2002, j] <- round(as.numeric(
      (gdp_rate_temp[which(gdp_rate_temp$year == i & gdp_rate_temp$larger_region == j), 'gdp'] - 
      gdp_rate_temp[which(gdp_rate_temp$year == i-1 & gdp_rate_temp$larger_region == j), 'gdp']) /
      gdp_rate_temp[which(gdp_rate_temp$year == i-1 & gdp_rate_temp$larger_region == j), 'gdp']), 5)
              }
}
```

```{r echo=FALSE}
temp_col <- numeric(75)
for (i in 1:15){
  for (j in 2:6){
    temp_col <- c(temp_col, as.numeric(gdp_rate[i,j]))
  }
}
```

```{r echo=FALSE}
orange_pal <- function(x) rgb(colorRamp(c("#ffe4cc", "#ff9500"))(x), maxColorValue = 255)
reactable(
  gdp_rate,
  defaultColDef = colDef(
    align = 'center',
    headerStyle = list(background = "#D1E5FC"),
    format = colFormat(percent = TRUE, digits = 2),
    style = function(value) {
    if (!is.numeric(value)) {
      color <- "#A9D0FD"
      list(background = color)
    } else {
    normalized <- (value - min(temp_col)) / (max(temp_col) - min(temp_col))
    color <- orange_pal(normalized)
    list(background = color)
    }
  }
  ),
  bordered = TRUE,
  defaultPageSize = 5
)
```

  
From this table, we find that the highest GDP growth rate appears at year 2010 in North region, which is 24.60%, whereas the lowest GDP growth rate appears at year 2015 in South East region, which is 2.02%. Also, the regional GDP increase rate peaked at 2010 for almost all regions. Year 2010 is the year after the final selection of Olympics, whereas year 2015 is the year before the Olympic games.    

The table also shows that in year 2016, the GDP growth rate for each of the 5 regions has no significant changes compared to 2015. The highest increase in GDP growth rate appears at Mid West region in the Olympic year, but the increasing percent is $9.2\%-6.84\% = 2.36\%$, which is small. The changes of growth rate in year 2017 are also not significant for the 5 regions.    

Moreover, we find that before year 2019, the GDP growth rate for all 5 regions stays around 10+ percent, but the previous multiple time series plot shows that South East region may have a higher GDP growth rate than the other 4 regions after 2009. If we have an assumption that "before 2009, all of the 5 regions in Brazil have the same GDP growth rate", we can then test whether the 5 regions still have the same GDP growth rate after 2009. The reason why we want to do this test is to check if the success of Olympic bid will bring a more significant GDP growth for South East region. Therefore, we want to first test our assumption by using a One-Way ANOVA test using the existing data from 2003 to 2009.   
         
         
<font size="5"><b> First ANOVA test on regional GDP from 2003 to 2009: </b></font>      
<font size="3"><b> Hypothesis: </b></font>       
$H_0$: The means of GDP increasing rate for the 5 regions from 2003 to 2009 are the same.    
$H_1$: Otherwise.     
Description: This test is for checking whether our initial assumption holds, where the assumption is "Before the final selection year, the 5 regions in Brazil have the same GDP growth rate".       


```{r include=FALSE}
br_gdp_anova1_temp <- gdp_rate[1:7, -1]
br_gdp_anova2_temp <- gdp_rate[8:15, -1]

br_gdp_anova1 <- data.frame('region' = factor(), 'rate' = numeric())
for (i in 1:7){
  for (j in c('Mid West Region', 'North East Region',
              'North Region', 'South East Region', 'South Region')){
    br_gdp_anova1 <- br_gdp_anova1 %>% add_row(region = j, 
                                               rate = as.numeric(br_gdp_anova1_temp[i,j]))
  }
}

br_gdp_anova2 <- data.frame('region' = factor(), 'rate' = numeric())
for (i in 1:8){
  for (j in c('Mid West Region', 'North East Region',
              'North Region', 'South East Region', 'South Region')){
    br_gdp_anova2 <- br_gdp_anova2 %>% add_row(region = j, 
                                               rate = as.numeric(br_gdp_anova2_temp[i,j]))
  }
}
```

<font size="3"><b> Summary of the test: </b></font>       
```{r echo=FALSE}
model1 <- aov(rate~region, data = br_gdp_anova1)
summary(model1)
```
     
<font size="3"><b> Conclusion: </b></font>     
From the summary of test table, we find that the p-value is 0.93, which shows that we fail to reject the null hypothesis. That is, we cannot say the mean of GDP growth rate for each of the 5 regions have significant difference. Therefore, our initial assumption holds, then we can continue to test whether there are significant differences in the means of GDP growth rate for the 5 regions after 2009. In the next test, we still use the One-Way ANOVA test, but we use the existing data from 2010 to 2017.         
        
        
<font size="5"><b> Second ANOVA test on regional GDP from 2010 to 2017: </b></font>      
<font size="3"><b> Hypothesis: </b></font>       
$H_0$: The means of GDP increasing rate for the 5 regions from 2010 to 2017 are the same.    
$H_1$: Otherwise.     
Description: Based on the assumption from the first ANOVA test, we want to conduct a second test to check whether the means of GDP growth rate for the 5 regions still remain the same from 2010 to 2017. If the test result shows a significant difference in the means, we can conclude that the succuss of 2016 Olympic bid shows a significant GDP growth on regional levels.      
   
<font size="3"><b> Summary of the test: </b></font>       
```{r echo=FALSE}
model2 <- aov(rate~region, data = br_gdp_anova2)
summary(model2)
```

<font size="3"><b> Conclusion: </b></font>    
From the summary table, we find that the p-value is 0.881, which shows that we fail to reject the null hypothesis. That is, from 2010 to 2017, we cannot say the GDP growth rate for each of the 5 regions has a significant difference.     
       
For both tests, we fail to reject the null hypothesis. Therefore, the year 2009 does not bring a different regional GDP growth rate impact on Brazil. This conclusion is not consistent with our initial hypothesis from the multiple time series plot.     

In conclusion, for Brazil regional GDP analysis, the data does not show that the Olympic games results in any significant difference in GDP growth across the regions in Brazil.    


## Brazil Monthly Income   
       
<font size="5"><b> Monthly Income Distribution </b></font>   
   
    
```{r include=FALSE}
for (i in 1:nrow(br_mi)){
  if (br_mi[i,'state'] %in% north){
    br_mi[i,'larger_region'] <- 'North Region'
  } else if (br_mi[i,'state'] %in% north_east){
    br_mi[i,'larger_region'] <- 'North East Region'
  } else if (br_mi[i,'state'] %in% mid_west){
    br_mi[i,'larger_region'] <- 'Mid West Region'
  } else if (br_mi[i,'state'] %in% south_east){
    br_mi[i,'larger_region'] <- 'South East Region'
  } else {
    br_mi[i,'larger_region'] <- 'South Region'
  }
}
```


```{r include=FALSE}
br_mi1 <- br_mi %>% drop_na()
```

Before we analyze the regional average monthly income, we want to first visualize the distribution of monthly income for the entire country from 2012 to 2020. This will help us to get a better understanding of the overall distribution of Brazil monthly income over years. We first make a multiple boxplot over years to better visualize the robust statistics: medians, quartiles and outliers.     

```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(br_mi1, aes(x = as.factor(year), y = value)) + 
  geom_boxplot(fill = "#cc9a38", color = "#473e2c") + 
  ggtitle("Boxplots of Brazil Monthly Income ",
          subtitle = "From 2012 to 2020") +
  labs(x = "Year", y = "Monthly Income") +
  theme_grey(16) +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
  theme(plot.caption = element_text(color = "grey68")) +
  scale_x_discrete(labels = c('2016'='2016 (Olympics)'))
```

From the boxplots we find that the median of national monthly income shows an increasing pattern from 2012 to 2020. However, the plot does not show any significant increase in the median of monthly income after the Olympic year 2016 on national level. The Interquartile Range also increases from 2012 to 2020.      

Then we want to visualize the density distribution for the national monthly income, but by the boxplots, there are higher outliers for each year. If we keep those outliers, it will be hard to visualize the density plot, so we remove those higher outliers for better visualization.        

Note: Higher Outlier definition: points above $Q_3 + (1.5\times \text{IQR})$, where $Q_3$ is the upper quartile and IQR is the Interquatile Range.    

```{r include=FALSE}
for (i in 2012:2020){
  temp_value <- br_mi1[which(br_mi1$year == i),]$value
  Q1 <- quantile(temp_value)[2]
  Q3 <- quantile(temp_value)[4]
  outlier <- Q3 + 1.5*(Q3-Q1)
  br_mi1 <- br_mi1[-which(br_mi1$year == i & br_mi1$value > outlier),]
}
```


```{r echo=FALSE, fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
ggplot(br_mi1, aes(x = value, y = as.factor(year), fill = year))+
  geom_density_ridges_gradient(scale = 4, show.legend = FALSE) + 
  theme_ridges() +
  scale_y_discrete(expand = c(0.01, 0), labels = c('2016'='2016 \n (Olympics)')) +
  scale_x_continuous(expand = c(0.01, 0)) +
  labs(x = "Brazil Monthly Income",y = "Year") +
  ggtitle("Density estimation of Brazil Monthly Income", 
  subtitle = 'From 2012 to 2020 (outlier removed)') +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_viridis()
```

The ridgeline plot helps us to better understand the density estimation. From 2012 to 2020, the peak of density curves increases from 1000 to 1500 approximately. Also, the tail of the density curves becomes much heavier from 2012 to 2020. These observations shows that the monthly income has an increasing pattern over years. However, between 2015 and 2017, the density curves do not show a more significant increasing pattern.   

     
<font size="5"><b> Regional Monthly Income Distribution in the Olympics Year 2016 </b></font>    

Next, we want to focus on the regional level. We pick data of the Olympic year 2016 and plot a set of histograms grouped by regions. We want to first check whether there are any significant difference in the distributions of regional monthly income. If hosting the Olympics may result in different regional impact, there may be difference in the regional distribution of monthly income in 2016.     

```{r echo=FALSE, fig.height=7, fig.width=10}
ggplot(br_mi1[which(br_mi1$year == 2016),], aes(x = value, y = ..density..)) + 
  geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
  geom_density(color = "#3D6480") + 
  facet_wrap(~larger_region) +
  ggtitle("2016 Brazil Monthly Income Distribution",
          subtitle = "by Regions (outlier removed)") +
  labs(x = "Monthly Income", y = 'Density') +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
  theme(plot.caption = element_text(color = "grey68"))
```
   
In this plot, we also remove the outliers for better visualization. The histograms shows all of the 5 regions have most density in monthly income less than 2000 but greater than 1000. Also, South East region does not show a distribution of higher monthly income than other regions in year 2016.      


<font size="5"><b> Regional Average Monthly Income </b></font>     

Now we move to visualize the regional average monthly income. Here we keep the outliers because we are calculating the regional average for each year.   

```{r include=FALSE}
df_mi <- br_mi[, c('year','value','larger_region')] %>% 
  drop_na() %>%
  group_by(year, larger_region) %>% 
  summarise(monthly_income = mean(value))
```


```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_mi, aes(year, monthly_income, color=larger_region)) + 
  geom_line() +
  geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
  geom_text(aes(x = 2016, label = 'Rio Olympics', y = 3000), colour = 'black', size = 4) +
  ggtitle('Brazil Average Monthly Income by Regions', 
          subtitle = 'From 2012 to 2020') +
  labs(x = 'Year', y = 'Average Monthly Income in Reals', color = 'Regions') +
  scale_x_continuous(breaks = seq(2012, 2020, 1)) +
  theme_gray(13)
```

The plot shows an overall increasing pattern in monthly income for all of the 5 regions, where the income of Mid West region increases most fast from 2017 to 2018, but decreases from 2019 to 2020. The plot does not show a more significant increasing after 2016 than before. Also the patterns for all of the 5 regions do not appear to have significant difference.      

```{r echo=FALSE}
mi_rate_temp <- df_mi
mi_rate <- data.frame('Year' = c('2013', '2014', '2015', '2016', '2017',
                                 '2018', '2019', '2020'))
```

```{r echo=FALSE}
for (i in 2013:2020){
  for (j in c('Mid West Region', 'North East Region',
              'North Region', 'South East Region', 'South Region')){
      mi_rate[i-2012, j] <- round(as.numeric(
      (mi_rate_temp[which(mi_rate_temp$year == i & 
                            mi_rate_temp$larger_region == j), 'monthly_income'] - 
      mi_rate_temp[which(mi_rate_temp$year == i-1 & 
                            mi_rate_temp$larger_region == j), 'monthly_income']) /
      mi_rate_temp[which(mi_rate_temp$year == i-1 & 
                            mi_rate_temp$larger_region == j), 'monthly_income']), 5)
              }
}
```

```{r echo=FALSE}
temp_col_mi <- numeric(40)
for (i in 1:8){
  for (j in 2:6){
    temp_col_mi <- c(temp_col_mi, as.numeric(mi_rate[i,j]))
  }
}
```


<font size="5"><b> Regional Monthly Income Increasing Rate from 2013 to 2020 </b></font>      
```{r echo=FALSE}
reactable(
  mi_rate,
  defaultColDef = colDef(
    align = 'center',
    headerStyle = list(background = "#D1E5FC"),
    format = colFormat(percent = TRUE, digits = 2),
    style = function(value) {
    if (!is.numeric(value)) {
      color <- "#A9D0FD"
      list(background = color)
    } else {
    normalized <- (value - min(temp_col_mi)) / (max(temp_col_mi) - min(temp_col_mi))
    color <- orange_pal(normalized)
    list(background = color)
    }
  }
  ),
  bordered = TRUE,
  defaultPageSize = 5
)
```  

From the income increasing rate table, we find that the highest rate (11.55%) takes place at Mid West region in 2018, whereas the lowest one (-3.22%) takes place at Mid West region in 2020. For South East Region, the increasing rate in 2016 is lower than that in 2015. Together with the multiple time series plot, we find that the regional monthly income increases after the Olympic year, but the increasing rates do not become much more significant comparing with previous ones before 2016.        


<font size="5"><b> Relationship between Brazil GDP and Monthly Income </b></font>    

Then we create an animated scatter plot to visualize the relationship between GDP and month income by regions over years.   

```{r include=FALSE}
Ag <- br_gdp[which(br_gdp$year >= 2012 & br_gdp$year <= 2017),] %>%
  group_by(year, state, larger_region) %>%
  summarize(gdp = mean(value)/1000000000)
Ai <- br_mi[which(br_mi$year >= 2012 & br_mi$year <= 2017),] %>%
  drop_na() %>%
  group_by(year, state) %>%
  summarize(income = mean(value))

Agi <- merge(Ag, Ai, by=c('year', 'state'))
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
anim <- ggplot(Agi, aes(gdp, income, color = larger_region)) +
  geom_point() +
  theme_bw() +
  labs(title = 'Scatter Plot between GDP and Monthly Income by Regions',
       subtitle = 'Year: {frame_time}', x = 'GDP in Trillions of Reals', 
       y = 'Average Monthly Income', color = 'Region') +
  transition_time(year)

animate(
  anim, renderer = magick_renderer()
)

```
   
From this plot, we find that the points are moving up and right. Each of these point represents a state in Brazil and the color shows its region. One state in Mid West region has the highest GDP and monthly income over years. Also, two states in South East region have higher GDP than the majority of states.  
    
    
## Brazil Unemployment Rate     
 
```{r include=FALSE}
for (i in 1:nrow(br_un)){
  if (br_un[i,'state'] %in% north){
    br_un[i,'larger_region'] <- 'North Region'
  } else if (br_un[i,'state'] %in% north_east){
    br_un[i,'larger_region'] <- 'North East Region'
  } else if (br_un[i,'state'] %in% mid_west){
    br_un[i,'larger_region'] <- 'Mid West Region'
  } else if (br_un[i,'state'] %in% south_east){
    br_un[i,'larger_region'] <- 'South East Region'
  } else {
    br_un[i,'larger_region'] <- 'South Region'
  }
}
```

```{r include=FALSE}
br_un_bar <- br_un[which(br_un$category != 'Outside the workforce' & br_un$year != 2020),
                   c('year','quarter','category','value','larger_region')] %>%
  group_by(year, quarter, category) %>%
  summarise(value = sum(value))

for (i in 1:nrow(br_un_bar)){
  y_temp <- br_un_bar$year[i]
  q_temp <- br_un_bar$quarter[i]
  un_temp <- br_un_bar[which(br_un_bar$year == y_temp & br_un_bar$quarter == q_temp &
                  br_un_bar$category == 'Workforce - Unemployed'), 'value']
  e_temp <- br_un_bar[which(br_un_bar$year == y_temp & br_un_bar$quarter == q_temp &
                  br_un_bar$category == 'Workforce - Employed'), 'value']
  if (br_un_bar$category[i] == 'Workforce - Unemployed'){
    br_un_bar[i,'rate'] <- un_temp/(un_temp+e_temp)
  } else {
    br_un_bar[i,'rate'] <- e_temp/(un_temp+e_temp)
  }
  br_un_bar[i,'work_force'] <- un_temp+e_temp
}

br_un_bar <- br_un_bar %>%
  group_by(year, category) %>%
  summarize(avg_rate = mean(rate)) 
```

Another way we can analyze the potential impact of the Olympic Games on Brazil is through analyzing the unemployment rate over time. First we make a stacked bar chart to visualization the national unemployment rate from 2012 to 2019.     

```{r echo=FALSE, fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
ggplot(br_un_bar, aes(x = as.factor(year), y = avg_rate, fill = category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle('Stacked Bar Chart of Average Brazil Unemployment/Employment Rate',
          subtitle = 'From 2012 to 2019') +
  labs(x = 'Rate', y = 'Year') +
  theme_gray(13) +
  scale_x_discrete(labels = c('2016'='2016 \n (Olympics)'))
```
      
The stacked bar chart shows that year 2017 has the highest unemployment rate, whereas year 2014 has the lowest unemployment rate. Also, the unemployment rate keeps increasing from 2015 to 2017. The above plot confirms our conclusion that the 2016 Rio Olympics did not appreciably benefit Brazil as measured through the unemployment rate. In fact, during the years that follow the Olympics the unemployment rate seems to increase, not decrease, which suggest a possible negative effect that could be the subject of future exploration.    
      

<font size="5"><b> Unemployment Rate by Regions </b></font>      

Now let's move to the regional level. We want to check whether the 5 regions follow a same pattern as the national unemployment rate.    

```{r include=FALSE}
#Calculate unemployment rate and work force for each quarter of each year (by region)  
df_un <- br_un[which(br_un$category != 'Outside the workforce'),
               c('year','quarter','category','value','larger_region')] %>% 
  group_by(year, quarter, category, larger_region) %>% 
  summarise(value = sum(value)) 
```

```{r include=FALSE}
for (i in 1:nrow(df_un)){
  y_temp <- df_un$year[i]
  q_temp <- df_un$quarter[i]
  r_temp <- df_un$larger_region[i]
  un_temp <- df_un[which(df_un$year == y_temp & df_un$quarter == q_temp & df_un$larger_region == r_temp &
                  df_un$category == 'Workforce - Unemployed'), 'value']
  e_temp <- df_un[which(df_un$year == y_temp & df_un$quarter == q_temp & df_un$larger_region == r_temp &
                  df_un$category == 'Workforce - Employed'), 'value']
  df_un[i,'unemployment_rate'] <- un_temp/(un_temp+e_temp)
  df_un[i,'work_force'] <- un_temp+e_temp
}
```


```{r include=FALSE}
df_un <- unique(subset(df_un, select=-c(category,value)))
```

 
```{r include=FALSE}
#Calculate average unemployment rate for each year (by region)  
df_un1 <- df_un %>%
  group_by(year, larger_region) %>% 
  summarise(avg_un_rate = mean(unemployment_rate))
```

```{r echo=FALSE, fig.height=5, fig.width=10, message=FALSE, warning=FALSE}
ggplot(df_un1[1:40,], aes(year, avg_un_rate, color=larger_region)) + 
  geom_line() +
  ggtitle('Brazil Average Yearly Unemployment Rate by Regions',
          subtitle = 'From 2012 to 2019') +
  labs(x = 'Year', y = 'Average Yearly Unemployment Rate', color = 'Regions') +
  scale_x_continuous(breaks = seq(2012, 2019, 1)) +
  theme_gray(13) +
  geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
  geom_text(aes(x = 2016, label = 'Rio Olympics', y = 0.06), colour = 'black', size = 4)
```

From 2012 to 2014, the unemployment rate decreases for the 5 regions. However, during the period between 2014 and 2017, the unemployment rate increases for the 5 regions. The unemployment rate starts to decrease after 2017. The regions follow the same pattern as the national pattern, except Mid West region has an increasing unemployment rate from 2018 to 2019.   

For more precise conclusion, we also calculate the exact regional unemployment rate increasing rate.   

<font size="5"><b> Unemployment Rate Increasing Rate by Regions from 2013 to 2019 </b></font>    
```{r echo=FALSE, message=FALSE, warning=FALSE}
un_rate_temp <- df_un1[1:40,]
un_rate <- data.frame('Year' = c('2013', '2014', '2015', '2016', '2017',
                                 '2018', '2019'))
```

```{r echo=FALSE}
for (i in 2013:2019){
  for (j in c('Mid West Region', 'North East Region',
              'North Region', 'South East Region', 'South Region')){
      un_rate[i-2012, j] <- round(as.numeric(
      (un_rate_temp[which(un_rate_temp$year == i & 
                            un_rate_temp$larger_region == j), 'avg_un_rate'] - 
      un_rate_temp[which(un_rate_temp$year == i-1 & 
                            un_rate_temp$larger_region == j), 'avg_un_rate']) /
      un_rate_temp[which(un_rate_temp$year == i-1 & 
                            un_rate_temp$larger_region == j), 'avg_un_rate']), 5)
              }
}
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
temp_col_un <- numeric(35)
for (i in 1:7){
  for (j in 2:6){
    temp_col_un <- c(temp_col_un, as.numeric(un_rate[i,j]))
  }
}
```

   
```{r echo=FALSE, message=FALSE, warning=FALSE}
reactable(
  un_rate,
  defaultColDef = colDef(
    align = 'center',
    headerStyle = list(background = "#D1E5FC"),
    format = colFormat(percent = TRUE, digits = 2),
    style = function(value) {
    if (!is.numeric(value)) {
      color <- "#A9D0FD"
      list(background = color)
    } else {
    normalized <- (value - min(temp_col_un)) / (max(temp_col_un) - min(temp_col_un))
    color <- orange_pal(normalized)
    list(background = color)
    }
  }
  ),
  bordered = TRUE,
  defaultPageSize = 5
)
```  

   
Since 2014, the average unemployment rate keeps increasing for all of the 5 regions until 2017. The highest increasing rate appears at year 2016 which is the Olympics year for all of the 5 regions. The unemployment rate drops starting from 2018. Therefore, the 2016 Rio Olympic games did not benifit Brazil's unemployment rate. There are no significant difference in regional unemployment pattern observed as well.     

       
## Brazil Tourism    
   
Now we want to analyze Brazil tourism to check whether the Olympics benifits the tourism, especially for the South East region.     

```{r include=FALSE}
for (i in 1:nrow(br_tj)){
  if (br_tj[i,'state'] %in% north){
    br_tj[i,'larger_region'] <- 'North Region'
  } else if (br_tj[i,'state'] %in% north_east){
    br_tj[i,'larger_region'] <- 'North East Region'
  } else if (br_tj[i,'state'] %in% mid_west){
    br_tj[i,'larger_region'] <- 'Mid West Region'
  } else if (br_tj[i,'state'] %in% south_east){
    br_tj[i,'larger_region'] <- 'South East Region'
  } else {
    br_tj[i,'larger_region'] <- 'South Region'
  }
}
```

```{r include=FALSE}
df_tj <- br_tj %>% 
  group_by(year, month, larger_region) %>% 
  summarise(total_jobs = sum(jobs)) %>%
  group_by(year, larger_region) %>%
  summarise(avg_jobs = mean(total_jobs)/1000)
```


```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(df_tj, aes(year, avg_jobs, color=larger_region)) + 
  geom_line() +
  ggtitle('Brazil Average Yearly Tourism Jobs by Regions', 
          subtitle = 'From 2006 to 2018') +
  labs(x = 'Year', y = 'Average Yearly Tourism Jobs in Thousands', color = 'Regions') +
  scale_x_continuous(breaks = seq(2006, 2018, 1)) +
  theme_gray(13) +
  geom_vline(xintercept = 2016, linetype="dashed", color = "red", size=1.5) +
  geom_text(aes(x = 2016, label = 'Rio Olympics', y = 300), colour = 'black', size = 4) +
  geom_vline(xintercept = 2009, linetype="dashed", color = "red", size=1.5)
```

From the multiple time series plot, we find that from 2006 to 2018, South East region alwasys has a much higher number of tourism jobs than the other regions. After 2009, the final selection year, all of the 5 regions seems to increase faster than before in the number of tourism jobs. However, from 2015 to 2017, South East region seems to decreases slightly. This shows that the Olympics does not necessarily increase the number of tourism jobs for South East region. The decreasing in number of tourism jobs in South East region complies with the pattern of Brazil unemployment rate.