-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBoostrapping with Markov Chain Extension.Rmd
226 lines (189 loc) · 13.6 KB
/
Boostrapping with Markov Chain Extension.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Boostrapping with Markov Chain Extension"
author: "Ho Huu Binh"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
rmdformats::robobook:
# code_folding: false
fig_width: 7
fig_height: 4.5
thumbnails: false
lightbox: true
highlight: true
code_download: false
use_bookdown: false
pkgdown:
as_is: true
---
```{r setup, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, error=FALSE)
knitr::opts_chunk$set(out.width="100%", fig.height = 4.5, split=FALSE, fig.align = 'center')
options(dplyr.summarise.inform = FALSE)
```
# Boostrapping with extension to Markov Chain
## Motivation and Algorithm Sketch
In the context of forecasting intermittent demands, many bootstrapping approaches have been proposed to overcome the obstacles with this specific pattern of time series, which is not efficiently manipulated by many parametric models (Shenstone and Hyndman, 2003). The boostrapping approach propose by Willemain et al., (2004) is the most attracted and well-approved in both academic literature and real-world applications. To improve the accuracy of forecasting procedures of products posed highly intermittent patterns, Willemain devised the simple boostrapping method by plug in two features (extensions), the use of discrete-time Markov Chain and the process of "Jittering".
Concerning conventional boostrapping, which was developed by Efron (1979), the method involves repetitively sampling with replacement from the original data set, thus construct independent boostrap replications, in order to estimate the empirical distribution of the demand (or sale volumes under our concern). Nevertheless, two significant drawbacks acknowledged in this method are: First, the possibility of any autocorrelations presenting in the data violates the assumption of independence amongst observations and Second, the generated values of the reconstructed distribution may not be observed, yet still compatible with the earlier distribution.
To transcend these limitations, Willemain 's WSS method utilizes the Markov chain to explicitly capture the autocorrelation between occurrences of demand and non-occurrences by evaluating the empirical distribution of the transition matrix. We will first summarize the incorporation of the dependency of current demand state on the previous demand occurrence status as follows:
+ Estimate the transition probability (in this context, it will be a $2 \times 2$ matrix) between two states defined above.
+ Based upon this estimated transition probabilities and the \textbf{last} observed demand, generate a sequence of zero/non-zero values for the entire forecast horizon.
Hence, within the process above, we try to encompass the conditional structure of the occurrence (or non-occurrence) with respect to the previous periods. After generating the future sequence of demands' states, we will only perform resampling demand size restricted to the previous periods where demands occurred and random selections will be collected based on these previous non-zero demands. As previously concerned, possible values that have not been observed in the past data set might be appeared in the future; thereby, need to be accounted in the boostrapped resamples. Willemain proposed a modification to this issue by introducing a "jittering" process which permits a greater variation around the selected positive values. The flow of the jittering procedures is as below:
+ Replace all values where demand occurred with values from a set of positive previous demand sizes.
+ Jittering the values with the deviated realization of standard normal $Z$. If we assume that the forecast is $X$ and $X^{\ast}$ then $X$ would be:
\begin{equation}
X =
\begin{cases}
& 1 + \text{INT}(X^{\ast} + Z\sqrt{X^{\ast}}), \quad \text{if} X >0 \\
& 0, \quad \text{otherwise}
\end{cases}
\end{equation}
We repeat the process multiple times (Willemain et al., 2004 used $1,000$ replications to obtain the empirical distribution). As we clearly inspect, the design of "jittering" process is to shift up or down the primary values by a random quantity which equals to $\sqrt{X^{\ast}}Z$. Since this generated quantity could reduce the original values to be below zero, we restrict them to be $X$ if negative values are returned. Finally, we average the values obtained for each future time step $h$ to obtain is each expected values for future horizon.
## Implementation
We will use the package [Markovchain](https://cran.r-project.org/web/packages/markovchain/vignettes/an_introduction_to_markovchain_package.pdf) by Spedicato (2022) to estimate the transition probabilities using MLE method. Then, we will follow the step from Willemain's method to forecast the future demand based on resampling and **jittering**.
```{r, message = FALSE, warning = FALSE}
library(dplyr)
# Loading RData file
load(file = "boostrap_dat.RData")
# Discrete Time Markov Chain
observed_df <- data_final %>%
filter(id == "HOBBIES_1_192_WI_2" & dates < "2016-04-18") %>%
mutate(State = case_when(sales == 0 ~ "Zero",
sales > 0 ~ "Positive")) %>%
select(sales, State)
# Estimating the Markov Chain
state <- observed_df %>% pull(State) %>%
as.character()
library(markovchain)
mc <- markovchainFit(state, method = "laplace")$estimate
# Examine the property of Markov Chain
verifyMarkovProperty(state)
assessStationarity(state, 10)
# Predict 7 days ahead
pred <- predict(object = mc, newdata = "Zero", n.ahead = 7)
# Extracting df with positive sales only
positive_df <- data_final %>%
filter(id == "HOBBIES_1_192_WI_2" & dates < "2016-04-18") %>%
filter(sales > 0)
# Creating a sample with historical positive sales
sample_positive_sale <- positive_df %>% pull(sales) %>%
unique()
# Retrieving last observation
lastobs <- data_final %>%
filter(id == "HOBBIES_1_192_WI_2" & dates == "2016-04-17") %>% pull(sales)
lastobs <- if_else(lastobs == 0, "Zero", "Positive")
mc.sim <- function(trans_mat, num.iters = 1000, n_ahead = 7, method = c("modified","simple")){
type = method[1]
states <- numeric(n_ahead) # Not Including the last observed demand
states[1] <- lastobs
sample_positive_sale <- positive_df %>% pull(sales) %>%
unique()
result <- matrix(0, nrow = num.iters, ncol = n_ahead)
for (i in 1:num.iters){
fc_seq <- predict(object = trans_mat, newdata = lastobs, n.ahead = 7)
# if (lastobs == "Positive"){
# fc_seq[i] <- ifelse(runif(1) < trans_mat[1], )
# }
for (j in 1:n_ahead){
if (fc_seq[j] == "Zero") {fc_seq[j] = as.numeric(0)}
if (fc_seq[j] == "Positive"){
x = sample(sample_positive_sale, 1, replace= TRUE)
if (type == "Simple"){
fc_seq[j] = 1 + as.integer(x + sqrt(x) * rnorm(1))
if (fc_seq[j] < 0) {fc_seq[j] = x}
else {
fc_seq[j] = as.integer(0.5 + x + sqrt(x) * rnorm(1))
if (fc_seq[j] < 0) {fc_seq[j] = 1}
}
}
}
}
result[i, ] <- fc_seq
}
result <- apply(result, 2, as.numeric)
result_bs <- colMeans(result)
df <- tibble(Forecast = result_bs)
return(df)
}
# Given at 2016-04-17, we have a postive sales => Simulate for 7 day ahead
result <- mc.sim(mc, num.iters = 1000, n_ahead = 7, method = "modified") %>%
bind_cols(data_final %>%
filter(id == "HOBBIES_1_192_WI_2" & dates > "2016-04-17") %>% select(sales))
ZAPE <- function(actual, pred){
loss <- 0
y <- actual
yhat <- pred
for (i in 1:length(y)){
if (y[i] == 0){
loss <- loss + yhat[i]
}
else{
loss <- loss + abs((y[i] - yhat[i])/y[i])
}
}
return((loss / length(y)) * 100)
}
print(result)
ZAPE(result$sales, result$Forecast)
```
As we could see from the result below, the method works really well under the circumstance of intermittency since the probability of transition from demand to zero demand and zero-demand to itself is really high (nearly $1$). Hence, given that transition matrix, if the last day we observe a non-occurence demand event, it is highly possible that the next day would be the same and vice versa, if we observe a demand on the last day, we will more likely to end up being zero sale on the next day.
# Appendix: Derivation of MLE of Markov Chain
This section would be additional for anyone interested in formal derivation of the Markov chain MLE estimation. Here we will follow the proof given in the note of [Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/) (one of the my most favorite statisticians). This proof assume that you are familiar with basic structure of Markov Chain and its properties.
Supposed that we have observed a sample from the chain, the realization of the random variable $X_{1}^{n}$, which is denoted by $x_{1}^{n} \neq x_1, x_2, \dots, x_{n}$. Then we would like to obtain the probability that we will observe such data points, or the joint probability of the sequence $x_{1}^{n}$
$$
\begin{aligned}
P\left(X_{1}^{n} = x_{1}^{n}\right) &= P\left(X_1 = x_1, X_2 = x_2, \dots,X_n = x_n\right) \\
& = P\left(X_1 = x_1\right) \prod_{t = 2}^{n} P\left(X_t = x_t \mid X_{1}^{t-1} = x_{1}^{t-1}\right) \\
& = P\left(X_1 = x_1\right) \prod_{t = 2}^{n} P\left(X_{t} = x_{t} \mid X_{t-1} = x_{t-1}\right)
\end{aligned}
$$
The second expression is the conditional probability of the process being at state $x_t$ given all past observations. On the other hand, the third expression is derived based on one of the most notable properties of a discrete time Markov chain is the first order Markov assumption. In other words, it is assumed that the model hold the \emph{Markov property}:
$$
P\left(X_{t} = j \mid X_{t-1} = i, X_{t-2} = i_{t-1}, \ldots, X_1 = i_1\right) = P\left(X_{t} = j \mid X_{t-1} = i\right)
$$
As our target is to determine the MLE estimation for our transition probabilities, then we rewrite the above expression in terms of $p_{ij}$ to arrive at the likelihood function for the
$$
L(p) = P(X_1 = x_1) \prod_{t = 2}^{n} p_{x_{t-1} x_{t}}
$$
Assumed that $n_{ij}$ be the number of times observed state transits from $i$ to $j$ in our sample $X_{1}^{n}$, then we rewrite the \ref{eq:likelihood} in terms of $n_{ij}$.
$$
L(p) = P(X_1 = x_1) \prod_{i=1}^{h} \prod_{j=1}^{h} p_{ij}^{n_{ij}}
$$
Maximize the above expression \textit{w.r.t} the $p_{ij}$ by transforming it into logarithms (since it would be easier to deal with sums rather than product), taking derivatives, and setting it equal to $0$. However, it would be crucial to consider one special property of the Markov chain: it is stochastic row unitary. That is,
$$
\begin{aligned}
p \cdot \boldsymbol{1}^\intercal & = \mathbf{1}^\intercal \\
& \text{or} \\
\sum_{j} p_{ij} & = 1, \quad \text{ for each } i
\end{aligned}
$$
where $\mathbf{1}^\intercal$ denotes the row vector with its entries equal to one.
It is clear that we have an equality constraint in our formulation and a formal way to tackle this problem would be reforming it to a standard optimization problem with Lagrange multipliers. However, in our case, there exists a method which is more effortless to handle such issue: Eliminating parameters.
The first thing to do is taking the logs of the likelihood function
$$
\mathcal{L}(p) =\log\left(P(X_1 = x_1)\right) + \sum_{i,j} n_{ij} \log p_{ij}
$$
then, randomly choose one of the transition probabilities to express the others. For easy manipulation, we would collect the probability of transiting to state $1$. Thus, for each $i$, $p_{i1} = 1 - \sum_{j=2}^{m} p_{ij}$. Subsequently, differentiating the log-likelihood with respect to $p_{ij}$ while ignoring any partial derivatives relating to $\frac{\partial}{\partial P_{i1}}$
$$
\frac{\partial \mathcal{L}}{\partial p_{ij}} = \frac{n_{ij}}{p_{ij}} - \frac{n_{i1}}{p_{i1}}
$$
Letting this equal to zero, then we have the following relations,
$$
\begin{aligned}
\frac{n_{ij}}{\hat{p}_{ij}} &= \frac{n_{i1}}{\hat{p}_{i1}} \\
& \Updownarrow \\
\frac{n_{ij}}{n_{i1}} &= \frac{\hat{p}_{ij}}{\hat{p_{i1}}}
\end{aligned}
$$
The above relations hold for all $j \neq 1$; thereby, we could conclude that
$$
\begin{aligned}
\hat{p}_{ij} &= \frac{n_{ij}}{\frac{n_{i1}}{\hat{p}_{i1}}} \\
& = \frac{n_{ij}}{n_{i}} \\
& = \frac{n_{ij}}{\sum_{j=1}^{m} n_{ij}}
\end{aligned}
$$
where the last relation could be deduced as follows: $n_{i1}$ represent the number of observed consecutive transitions from state $i$ to state $1$ in random variable $X_{n}^{1}$ or expressing in terms of probability $n_{i1} \propto P(X_{t} = i, X_{t+1} = 1)$, while $\hat{p}_{i1}$ indicates the chance that we move to other states starting from state $1$. Hence, applying the Bayes'rule in terms of proportional expression, then the fraction between the former and latter would result in $P(X_{t} = i)$, implying the number of times that state $i$ is recorded, which is proportional to $n_{i}$ or $\sum_{j=1}^{m} n_{ij}$. It is noteworthy that if $n_{i} = 0$, or the state $i$ is not appeared in the chain except for the last position, then we would formally set all probabilities of the transition from state $i$ to any state $j$ ($j \neq i$) to be zero, $\hat{p}_{ij} = 0$. Thus, it turns out that $\hat{p}_{ii} = 1$.
# References
1. Willemain, Thomas \& Smart, Charles & Schwarz, Henry. (2004). A new approach to forecasting intermittent demand for service parts inventories. International Journal of Forecasting. 20. 375-387. 10.1016/S0169-2070(03)00013-X.
2. Spedicato, Giorgio & Signorelli, Mirko. (2014). The markovchain Package: A Package for Easily Handling Discrete Markov Chains in R.
3. https://www.stat.cmu.edu/~cshalizi/462/lectures/06/markov-mle.pdf