-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathclass_2.Rmd
265 lines (198 loc) · 8.75 KB
/
class_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
title: "Basic Time Series Constructions"
subtitle: "Moving Average and Auto Regression"
output: html_document
---
```{r,message=FALSE}
source('library.R')
```
A time series is a a set of data points measured over time. Generally the measurements are over equaly
spaced time intervals such as days, weeks, months, or years. One example could be the increase or
decrease in an investment after purchase. The measurement would start at zero then change each day.
The mathematical notation uses $X_t$ to index an obvervatrion $X$ at time $t$.
t (days) | $X_t$ Change from buying price | Concrete value
----------|--------------------------------|---------------
0 | 0 | 0
1 | $\epsilon_1$ | 1.2
2 | ? | ?
The stock is bought and increases in value by 1.2 on the first day.
Short term gain or loss in the commodities market is thought to be due to random variation
and previous market values.
The random variation is represented by $\epsilon_t$ and is called the error term at time $t$.
It is a random variable drawn from the normal distribution with a mean zero and unknown
variance.
In the ARIMA model, here are three groups of values that go into each day's commoditiy price:
the current time period's error term $\epsilon_t$, all the previous time periods valuations $X_0 \dots X_{t-1}$,
and all the previous time periods error terms $\epsilon_0 \dots \epsilon_{t-1}$. ARIMA is the specification for parameterizing
a function of the later two groups of values to predict new observations or fit current observations.
Future error terms $\epsilon_{t+i}$ cannot be predicted, and interestingly enough, current and
past error terms can't be observed.
$\hat{X_t}$
But we can measure the actual commodity value, which includes that error term, and subtract
out our prediction to get a predition of the error term $\epsilon_t = X_t - \hat{X_t}$
```{r}
library(forecast)
library(magrittr)
library(ggplot2)
library(ggfortify)
library(xts)
library(tidyverse)
```
ARIMA time series
At each time period, there is the addition of this random normal realization, but also
the previous random effects
But what is the model for $t=2$? We know that the value it closes at
after the first day has to be the value it starts at the next. But how will that
information be incorporated into a mathematical model? For starters, we can just use a
portion of the $\epsilon_1$ value.
One method of modeling time series data is called ARIMA.
This method combines three ways to introduce parameters into a model used to forcast
future observations or describe observed observations. The proportion of the previous error
can be represented by $\theta$. Here is an example for the case where $\theta=0.4$.
t (days) | $X_t$ Change from buying price | Notation | $\theta_1$ | $\epsilon$ | Concrete value
----------|------------------------------------|-----------|------------|-----------|---------
0 | 0 | $X_0$ | N/A | N/A | 0
1 | $\epsilon_1$ | $X_1$ | N/A | 1.2 | 1.2
2 | $\theta_1 \epsilon_1 + \epsilon_2$ | $X_2$ | 0.4 | -0.5 | `r 0.4*1.2 -0.5`
This model show above is $ X_t = 0.4 \epsilon_{t-1} + \epsilon_t $
Each observation is just a portion of the error in the previous observation plus the new error amount.
The problem, the actual error cannot be measured. Only the resulting loss or gain in the stock
can be measured.
1. **A**uto **R**egressive
2. **I**ntegrated (Differences)
3. **M**oving **A**verage
Let $X_i$ measure the distance a point is from where is starts.
Time Series
: An ordered set of responses equally spaced in time. For example, observations recorded
every day, week, or month.
$Y_i = \text{A function of the previous observations and previous errors} + \text{current error}$
Auto Regressive
: Future observations are modeled as a function of previous observations.
Integrated
: The differences of an observation minus a prevous observations is modeled.
# Manually construct a first order moving average series.
In statistical modeling there is usually a deterministic part and a random part.
In linear regression, there is the equation of line plus a random error.
Here is the equation for a linear regression model predicting observation
$Y_i$ from the predictor $X_i$
Linear regression is a linear function of X plus a random observation.
$$Y_i = m*X_i + b + \epsilon_i$$
Change the index of the observation from $i$ to $t$ to represent time.
Also use $\omega$ as the random error. Change the equation of the slope
to a simple $\mu$ to represent the mean model.
This next change is where the terminology *time series* is applicable.
Change both the $X_i$ and $Y_i$ to $x_t$. The same symbol is used for both
the predictor and the predicted. A piece of yesterday is used to predict today.
A part of the error of yesterday is added to the new error for today.
$$x_t = \mu + \omega_t + \theta_1\omega_{t-1}$$
$$x_t = \mu + \omega_t + \theta_1\omega_{t-1} + \theta_2\omega_{t-2} $$
The term *moving* in *moving average* can be explained. In time period 1, t is 1 and
t-1 is 0. In time period 2, t is 2 and t-1 is 1. The index is constantly *moving* as
the time periods increase, so the *averaging* of the error $\omega$ by $\theta$ is moved to apply to the
most recent previous error term. Let's look at an example sequence. The concept of the value
at the current time being a function of the previous time period plus a random error
is what makes time series analysis different from linear regression. The observations in time
are correlated. But in sequence of non time related data as used in linear regression, the observations
aren't correlated. Next we will
manually simulate a MA(1) process.
## Characteristics.
Time series models account for characteristics in the data. The MA(1) model
accounts for the moving average and and overall mean. It does not account for
any upward or downward trend. Time series models add the ability to incorporate correlation
from one time point to the next, but you have to consider all the characteristics in the data.
1. Trend up or down
2. seasonality: week, month,year.
3. Other cycles
4. Model changes.
5. Change in variability
6. outliers or extreme values.
All this needs to be taken into consideration when fitting models.
```{r}
lag(1:10)
#Simulate a MA(1) process.
# First generate a vector of random normal variables
# N is the number of observations
N <- 1000
average <- 14
theta <- .8
omega <- rnorm(N)
date <- xts(order.by = seq.Date(from=today("EST") - N+1,to=today("EST"),"days"))
date
x <- tibble(date = seq.Date(from=today("EST") - N+1,to=today("EST"),"days"),
average = average,
omega = rnorm(N),
theta=theta) %>%
mutate(omega_previous = lag(omega,n=1L),
x = average + omega + theta*omega_previous) %>%
select(date,x,average,omega,theta,omega_previous)
```
Now, let's run an analysis of this data.
The ACF has an auto correlation up to position 1, so this is a MA(1) series as expected.
The formula for the auto correlation $\phi$ is:
$\phi_1 = \frac{\theta_1}{1 + \theta_1^2}$
$\frac{0.8}{1+0.8^2} = \frac{0.8}{1 + 0.64} =`r 0.8/(1 + 0.64)`$
This is very close to the value we see below.
```{r}
x %>%
na.omit() %>%
select(x) %>%
acf()
```
```{r}
fit <- arima(x$x,order=c(0,0,1))
glance(fit)
```
```{r}
#library(forecast)
fit_2 <- auto.arima(x$x)
broom::tidy(fit_2)
```
```{r}
summary(fit)
```
Dickey Fuller test
H0: Series in not stationary. $\phi = 1$
H1: Series in stationary. $\phi <= 1$
$x_t = \alpha + \rho x_{t-1} + \epsilon_t$
$x_t - x_{t-1} = \alpha + (\rho - 1) x_{t-1} + \epsilon_t$
$\Delta{x_t} = \alpha + \delta x_{t-1} + \epsilon_t$
Augmented Dickey-Fuller Test for stationary
```{r}
x %>% na.omit() %$%
adf.test(x)
```
Test for serial correlation
```{r}
Box.test(resid(fit),type="Ljung",lag=6,fitdf=1)
```
## Exercise: Try to simulate a MA(2) series.
The MA(2) series will have $\theta_1 and \theta_1$
These are the two moving average parameters.
The formula for MA(2) is:
$$x_t = \mu + \omega_t + \theta_1\omega_{t-1} + \theta_2\omega_{t-2}$$
To simulate this series, add the calculation *averaging* the error in the
second lag by $\theta_2$. Enter ```?dplyr::lag``` to see how to get the second lag.
Try that now.
# Manually construct a first order auto regression series.
```{r}
tidyquant::tq_get_options()
```
```{r}
apl_stock_prices <- tidyquant::tq_get("AAPL")
```
```{r}
```
# tq_get
```{r}
oil <- tq_get("DCOILWTICO", get = "economic.data")
names(oil)
dim(oil)
str(oil)
head(oil)
oil %>% na.omit() %>% pacf
gold <- tq_get("GOLDAMGBD228NLBM", get = "economic.data")
plot(gold)
gold %>% na.omit() %>% pacf
```
```{r}
```