-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
260 lines (194 loc) · 8.33 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Model reproduction checklist"
author: "Breck Baldwin"
date: "2/24/2020"
output:
html_document:
includes:
in_header: _html/ga.html
---
<!--Easiest way to render as html is open this file in RStudio and select 'Knit' option at top of window with this content.-->
# Model overview
This document exists in the repository at [https://github.com/codatmo/Simple_SIR](https://github.com/codatmo/Simple_SIR) as `index.Rmd` and rendered in html as `index.html`.
### Validation checklist items completed
* Model description: Research goals, references, supplementary description as necessary.
* Data description and explicit data munging description.
* Small data set to validate model execution
* Divergent transitions report
* Rhat check
* Compare marginal posterior densities across chains
* Posterior predictive check
* Prior predictive check
* Simulated data parameter recovery
### Validation steps planned
* SBC (Simulation Based Calibration)
## Model description
Based on case study at [https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html](https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html). The case study is a detailed introduction to SIR models (Susceptible, Infectious, Resolved) and Bayesian modeling with Stan.
## Data
The data are freely available in the R package outbreaks, maintained as part of the [R Epidemics Consortium](http://www.repidemicsconsortium.org).
```{r echo=TRUE, message=FALSE, comment=NA}
library(cmdstanr)
library(outbreaks)
library(tidyverse)
library(stringr)
print_file <- function(file) {
cat(paste(readLines(file), "\n", sep=""), sep="")
}
head(influenza_england_1978_school)
```
Data tracks on a per day basis the number of boarding schools students `in_bed` which are considered Infected and `convalescent` which are considered Resolved in the SIR model, total N = 763. The mapping to data is The initial state is I = 1, S = 762, R=0.
### Data munging
The conversion to Stan input given the above data is as follows:
```{r echo=TRUE, message=FALSE, comment=NA}
# time series of cases
cases <- influenza_england_1978_school$in_bed # Number of students in bed
# total count
N <- 763;
# times
n_days <- length(cases)
ts <- seq(1, n_days, by = 1)
t0 = 0
#initial conditions
i0 <- 1 # Infected
s0 <- N - i0 # Susceptible
r0 <- 0 # Resolved
y0 = c(S = s0, I = i0, R = r0)
compute_likelihood = 1
# data for Stan
data_sir <- list(n_days = n_days, i0 = i0, y0 = y0, s0 = s0, r0 = r0, t0 = t0, ts = ts,
N = N, cases = cases, compute_likelihood = compute_likelihood)
```
## Stan program
The Stan model is located at `Simple_SIR/stan/sir_negbin.stan`:
```{r echo=FALSE, message=FALSE, comment=NA}
stan_file <- "stan/sir_negbin.stan"
lines <- readLines(stan_file)
cat(paste(lines, "\n", sep=""), sep="")
get_data_block <- function(lines) {
data_lines <- c()
accumulate_line <- FALSE
for (line in lines) {
if (str_detect(line,"^data\\s*\\{")) {
accumulate_line <- TRUE
}
if (str_detect(line,"^parameters\\s*\\{")) {
accumulate_line <- FALSE
break
}
if (accumulate_line) {
data_lines <- c(data_lines,line)
}
}
return(data_lines)
}
```
## Running model
```{r echo=TRUE, message=FALSE, comment=NA}
model <- cmdstan_model(file.path("stan","sir_negbin.stan"))
fit_sir_negbin <- model$sample(
data = data_sir)
fit_sir_negbin$cmdstan_summary()
```
## Model validation
Below are the diagnostics used to help validate the model.
### Rhat check
Rhat values are below 1.1, see below:
```{r echo==TRUE, comment=NA}
library(rstan)
r_stan_sir_negbin <- rstan::read_stan_csv(fit_sir_negbin$output_files())
stan_rhat(r_stan_sir_negbin)
```
### Divergent transition check
```{r comment=NA}
stan_diag(r_stan_sir_negbin,information="divergence")
```
### Compare marginal posterier densities across chains
These graphs show that posteriors are similar across all 4 chains.
```{r echo=TRUE}
pars=c('beta', 'gamma', "R0", "recovery_time")
stan_dens(r_stan_sir_negbin, pars = pars, separate_chains = TRUE)
```
### Posterior predictive check
Predict new data based on estimates--prediction is done in the `generated quantities` block of the Stan program--repeated below.
```{r echo=FALSE, comment=NA}
get_generated_quantities_block <- function(lines) {
data_lines <- c()
accumulate_line <- FALSE
for (line in lines) {
if (str_detect(line,"^generated quantities\\s*\\{")) {
accumulate_line <- TRUE
}
if (accumulate_line) {
data_lines <- c(data_lines,line)
}
}
return(data_lines)
}
cat(paste(get_generated_quantities_block(lines), "\n", sep=""), sep="")
```
For each draw of parameter values the `generated quantities` block is executed and the variables `R0`, `recovery_time` and `pred_cases` are computed and stored with the corresponding parameter values. Examining these values can increase our confidence that the model and fit are reasonable.
```{r}
smr_pred <- cbind(as.data.frame(summary(
r_stan_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), ts, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred))
fit.df <- as.data.frame(summary(
r_stan_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary)
smr_pred <- cbind(fit.df, ts, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
ggplot(smr_pred, mapping = aes(x = ts)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
geom_line(mapping = aes(x = ts, y = X50.)) +
geom_point(mapping = aes(y = cases)) +
labs(x = "Day", y = "Number of students in bed")
```
Actual data values for I `in_bed` are dots, mean estimates of I condition with 95% confidence interval shaded.
### Prior predictive check
The prior predictive check estimates the model parameters without data or likelihood being used. The resulting draws are then used to predict new data via predictive application of the likelihood without having seen any data. Note that `compute_likelihood = 0` prevents the likelihood being computed in the model.
```{r echo=TRUE, comment=NA}
library(rstan)
compute_likelihood = 1 # will not compute likelihood, allows for prior predictive check
# data for Stan
data_sir2 <- list(n_days = n_days, y0 = y0, t0 = t0, ts = ts, N = N, cases = cases, compute_likelihood = compute_likelihood)
data_sir <- list(n_days = n_days, i0 = i0, y0 = y0, s0 = s0, r0 = r0, t0 = t0, ts = ts,
N = N, cases = cases, compute_likelihood = compute_likelihood)
niter <- 2000
model <- cmdstan_model(file.path("stan","sir_negbin.stan"))
fit_sir_negbin_ppc <- model$sample(
data = data_sir,
seed=1590291619)
r_stan_sir_negbin_ppc <- rstan::read_stan_csv(fit_sir_negbin$output_files())
# fit_sir_negbin_prior_pred_chck <- model$sample(
# data = data_sir,
# #iter = niter,
# #chains = 4,
# seed=1590291619
# ) #this seed works, not all will
pars=c('beta', 'gamma', "R0", "recovery_time")
print(r_stan_sir_negbin_ppc, pars = pars)
```
The above summary shows values for our predicted quantities in the context of the parameters with intervals. The 95% interval of possible `recovery_time` spans reasonable values of .71 to 39.3 days as well as `R0` spanning values of 0.56 to 71.3.
### Parameter recovery with simulated data
```{r echo=TRUE, message=FALSE, warning=FALSE, comment=NA}
#Pick one arbitrary draw from the prior distribution
library(rstan)
draw <- 12
s_prior <- rstan::extract(r_stan_sir_negbin_ppc)
cases_simu <- s_prior$pred_cases[draw,]
compute_likelihood = 1
data_simu <- list (n_days = n_days, y0 = y0, t0 = t0, ts = ts, N=N, cases=cases_simu, compute_likelihood = compute_likelihood)
fit_simulated <- model$sample(
data = data_simu,
seed=1316072718)
fit_simu <- sampling(model, data=data_simu, chains=4, seed=1316072718)
fit_simu <- rstan::read_stan_csv(fit_simulated$output_files())
params = c("beta", "gamma", "phi")
paste("true beta :", toString(s_prior$beta[draw]),
", true gamma :", toString(s_prior$gamma[draw]), ", true phi :", toString(s_prior$phi[draw]))
```
Knowing the true generating parameters allows comparison to estimated values shown below:
```{r, comment=NA}
print(fit_simu, pars = params)
```
All the parameters fit within the 2.5% to to 97.5% interval.
### SBC validation
-->