Project TSA_V3.Rmd

---
title: "Project"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Background
Unemployment is an important yardstick that defines the condition of a country’s economy. The unemployment rate has several consequences on the country’s economy such as loss of productive forces, loss of income, as well as a burden on the state budget. Continued and persistent employment rate helps bolster the country’s social and economic status. 
The unemployment rate has been the primary summary statistic for the health of the labor market for quite some time. Recently, however, forecasts of the unemployment rate have come to the forefront, as monetary policy makers are trying to formulate a way of conditioning expectations in the new and extraordinary policy environment.
The unemployment rate has varied from as low as 1% during World War I to as high as 25% during the Great Depression. More recently, it hit 10.8% in November 1982 and 10.0% in October 2009. Unemployment tends to rise during recessions and fall during expansions. From 1948 to 2015, unemployment averaged about 5.8%. The United States has experienced 11 recessions since the end of the postwar period in 1948.


## Purpose

The purpose of this report is to identify the most appropriate model to forecast future unemployment rate in the US using the historical data. We present an in-depth study of the forecasts for the monthly U.S. unemployment rate using various time series models and comparing them to further our understanding of the strengths and deficiencies of these methods.

## About the Data

The data used for the purposes of this report represents the US Unemployment Statistics from January 1990 - December 2016, broken down by state and month. The formatted version of the data in CSV format for the purposes of this analysis was obtained from Kaggle. The raw unformatted data is available at the United States Bureau of Labor Statistics Website.
Note: These unemployment rates are monthly U-3 rates, and are NOT seasonally adjusted or categorized by age, gender, level of education, etc.

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr)
library(forecast)
library(urca)
library(fpp2)

unemp_data = read.csv('D:/4th Qtr Study Material/Project/output.csv', header = TRUE, stringsAsFactors = FALSE)
```

## Including Plots

Exploratory Data Analysis
Overview Of Data: 
To see how these rates varied from state to state, we plotted a map shown below. We see that states like California, Arizona and Michigan have fairly high average rates of unemployment. We can also see that Central America has a fairly lower unemployment rate than the east & west coast, suggesting that coasts are rather more volatile in jobs.

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
str(unemp_data)
```

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
state_wise_avg <- unemp_data %>%
  select(State,Rate) %>%
  group_by(State) %>%
  summarise('Average'= mean(Rate)) %>%
  mutate(State = tolower(State))

colnames(state_wise_avg)[1]<-"region"
colnames(state_wise_avg)[2]<-"value"

require(choroplethr)
require(choroplethrMaps)
state_choropleth(state_wise_avg, title="Average Unemployment across USA", num_colors=8, legend="Avg unemp rate")

```

States like California, Arizona and Michigan have fairly high average rates of unemployment.

## Analysis 

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
byYear <- unemp_data %>%
  select(-State, -County) %>% 
  group_by(Year, Month) %>% 
  summarise(Rate = mean(Rate)) %>%
  arrange(Year, match(Month, month.name))

df <- ts(byYear[,3], start = c(1990,1), freq = 12)
plot(stl(df[, 1], s.window = "periodic"))
```

There seems to be some seasonality/cyclicity. Also the data is not stationary. SO we take a log and do a first order difference. 

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
p1 = autoplot(df)+
  ggtitle("US Monthly Unemployment Rate") + xlab("Year") + ylab("Unemployment Rate")
df_bc = BoxCox(df,lambda=BoxCox.lambda(df))
p2 = autoplot(df_bc)+
  xlab("Year") + ylab("BoxCox of Unemployment Rate")
gridExtra::grid.arrange(p1,p2, nrow=2)

train <- window(df_bc, end = c(2010,12))
test <- window(df_bc, start = c(2011,1), end = c(2016,12))

```

## ETS Models

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
ets_fit0 <- ets(train)
checkresiduals(ets_fit0)

#a2 <- accuracy(ets_fit0)
#a2[,c("RMSE","MAE","MAPE","MASE")]

```

## ARIMA Models

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
df_bc %>%  nsdiffs()
df_bc %>% diff(lag=12) %>% nsdiffs()
df_bc %>% diff(lag=12) %>% ndiffs()
df_bc %>% diff(lag=12) %>% ur.kpss() %>% summary() #since t-stat is < 5pct, we'll call this series stationary
df_bc %>% diff(lag=12) %>% ggtsdisplay() 
```


```{r  , echo=FALSE, message=FALSE, warning=FALSE}
#--Data Modeling. This takes a few minutes to run
#auto.arima(log(df), stepwise=FALSE, max.order = 9,
#           approximation=FALSE)
```

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
fit1 <- Arima(train, order=c(4,0,3), seasonal = c(1,1,1))
fit1 %>% forecast %>% autoplot
checkresiduals(fit1)
```

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
fit2 <- Arima(train,order=c(4,0,2), seasonal = c(1,1,1))
fit2 %>% forecast %>% autoplot
checkresiduals(fit2)
```

```{r  , echo=FALSE, message=FALSE, warning=FALSE}
fit3 <- Arima(train,order=c(4,0,3), seasonal = c(2,1,1))
fit3 %>% forecast %>% autoplot
checkresiduals(fit3)

```

## Check which Arima Model fits better

Based on the AICc values, we see that Model 1 fits the best with lowest AICc of -1252.247.
```{r echo=FALSE, message=FALSE, warning=FALSE}

fit1$aicc
fit2$aicc
fit3$aicc

```

## Comparing ETS vs ARIMA
```{r}

a1 = ets_fit0 %>% forecast(h=72) %>% accuracy(test)
a2 = fit1 %>% forecast(h=72) %>% accuracy(test)

a1[,c("RMSE","MAE","MAPE","MASE")]
a2[,c("RMSE","MAE","MAPE","MASE")]
```


## Forecasting next 2 years

```{r echo=FALSE, warning=FALSE}
fit = Arima(df_bc, order=c(4,0,3), seasonal = c(1,1,1))
fc <- forecast(fit, h=48)
summary(fc)
plot(fc)
```