index.Rmd

---
title: "Model reproduction checklist"
author: "Breck Baldwin"
date: "2/24/2020"
output: 
  html_document:
    includes:
       in_header: _html/ga.html
---
<!--Easiest way to render as html is open this file in RStudio and select 'Knit' option at top of window with this content.-->

# Model overview

This document exists in the repository at [https://github.com/codatmo/Simple_SIR](https://github.com/codatmo/Simple_SIR) as `index.Rmd` and rendered in html as `index.html`.

### Validation checklist items completed

* Model description: Research goals, references, supplementary description as necessary.
* Data description and explicit data munging description.
* Small data set to validate model execution
* Divergent transitions report
* Rhat check
* Compare marginal posterior densities across chains
* Posterior predictive check
* Prior predictive check
* Simulated data parameter recovery

### Validation steps planned

* SBC (Simulation Based Calibration)

## Model description

Based on case study at [https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html](https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html). The case study is a detailed introduction to SIR models (Susceptible, Infectious, Resolved) and Bayesian modeling with Stan.  

## Data 

The data are freely available in the R package outbreaks, maintained as part of the [R Epidemics Consortium](http://www.repidemicsconsortium.org). 

```{r echo=TRUE, message=FALSE, comment=NA}
library(cmdstanr)
library(outbreaks)
library(tidyverse)
library(stringr)
print_file <- function(file) {
  cat(paste(readLines(file), "\n", sep=""), sep="")
}

head(influenza_england_1978_school)
```
Data tracks on a per day basis the number of boarding schools students `in_bed` which are considered Infected and `convalescent` which are considered Resolved in the SIR model, total N = 763. The mapping to data is The initial state is I = 1, S = 762, R=0. 

### Data munging

The conversion to Stan input given the above data is as follows:
```{r echo=TRUE, message=FALSE, comment=NA}
# time series of cases
cases <- influenza_england_1978_school$in_bed  # Number of students in bed

# total count
N <- 763;

# times
n_days <- length(cases) 
ts <- seq(1, n_days, by = 1)
t0 = 0 

#initial conditions
i0 <- 1 # Infected
s0 <- N - i0 # Susceptible
r0 <- 0 # Resolved
y0 = c(S = s0, I = i0, R = r0)

compute_likelihood = 1

# data for Stan
data_sir <- list(n_days = n_days, i0 = i0, y0 = y0, s0 = s0, r0 = r0, t0 = t0, ts = ts, 
                 N = N, cases = cases, compute_likelihood = compute_likelihood)

```

## Stan program

The Stan model is located at `Simple_SIR/stan/sir_negbin.stan`:

```{r echo=FALSE, message=FALSE, comment=NA}
stan_file <- "stan/sir_negbin.stan"
lines <- readLines(stan_file)
cat(paste(lines, "\n", sep=""), sep="")

get_data_block <- function(lines) {
  data_lines <- c()
  accumulate_line <- FALSE
  for (line in lines) {
    if (str_detect(line,"^data\\s*\\{")) {
      accumulate_line <- TRUE
    }
    if (str_detect(line,"^parameters\\s*\\{")) {
      accumulate_line <- FALSE
      break
    }
    if (accumulate_line) {
      data_lines <- c(data_lines,line)
    }
  }
  return(data_lines)
}
```
## Running model

```{r echo=TRUE, message=FALSE, comment=NA}
model <- cmdstan_model(file.path("stan","sir_negbin.stan"))
fit_sir_negbin <- model$sample(
                data = data_sir)

fit_sir_negbin$cmdstan_summary()
```
## Model validation

Below are the diagnostics used to help validate the model. 

### Rhat check

Rhat values are below 1.1, see below:

```{r echo==TRUE, comment=NA}
library(rstan)
r_stan_sir_negbin <- rstan::read_stan_csv(fit_sir_negbin$output_files())
stan_rhat(r_stan_sir_negbin)
```


### Divergent transition check

```{r comment=NA}
stan_diag(r_stan_sir_negbin,information="divergence")
```


### Compare marginal posterier densities across chains

These graphs show that posteriors are similar across all 4 chains.

```{r echo=TRUE}
pars=c('beta', 'gamma', "R0", "recovery_time")
stan_dens(r_stan_sir_negbin, pars = pars, separate_chains = TRUE)

```
### Posterior predictive check

Predict new data based on estimates--prediction is done in the `generated quantities` block of the Stan program--repeated below. 

```{r echo=FALSE, comment=NA}

get_generated_quantities_block <- function(lines) {
  data_lines <- c()
  accumulate_line <- FALSE
  for (line in lines) {
    if (str_detect(line,"^generated quantities\\s*\\{")) {
      accumulate_line <- TRUE
    }
    
    if (accumulate_line) {
      data_lines <- c(data_lines,line)
    }
  }
  return(data_lines)
}

cat(paste(get_generated_quantities_block(lines), "\n", sep=""), sep="")
```
For each draw of parameter values the `generated quantities` block is executed and the variables `R0`, `recovery_time` and `pred_cases` are computed and stored with the corresponding parameter values. Examining these values can increase our confidence that the model and fit are reasonable. 
```{r}

smr_pred <- cbind(as.data.frame(summary(
  r_stan_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), ts, cases)

colnames(smr_pred) <- make.names(colnames(smr_pred))


fit.df <- as.data.frame(summary(
  r_stan_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary)
smr_pred <- cbind(fit.df, ts, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names

ggplot(smr_pred, mapping = aes(x = ts)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
  geom_line(mapping = aes(x = ts, y = X50.)) +
  geom_point(mapping = aes(y = cases)) +
  labs(x = "Day", y = "Number of students in bed")
```
Actual data values for I `in_bed` are dots, mean estimates of I condition with 95% confidence interval shaded. 

### Prior predictive check

The prior predictive check estimates the model parameters without data or likelihood being used. The resulting draws are then used to predict new data via predictive application of the likelihood without having seen any data. Note that `compute_likelihood = 0` prevents the likelihood being computed in the model. 

```{r echo=TRUE, comment=NA}
library(rstan)
compute_likelihood = 1 # will not compute likelihood, allows for prior predictive check
# data for Stan
data_sir2 <- list(n_days = n_days, y0 = y0, t0 = t0, ts = ts, N = N, cases = cases, compute_likelihood = compute_likelihood)

data_sir <- list(n_days = n_days, i0 = i0, y0 = y0, s0 = s0, r0 = r0, t0 = t0, ts = ts, 
                 N = N, cases = cases, compute_likelihood = compute_likelihood)
niter <- 2000
model <- cmdstan_model(file.path("stan","sir_negbin.stan"))

fit_sir_negbin_ppc <- model$sample(
                data = data_sir,
                seed=1590291619)

r_stan_sir_negbin_ppc <- rstan::read_stan_csv(fit_sir_negbin$output_files())
# fit_sir_negbin_prior_pred_chck <- model$sample(
#                 data = data_sir,
#                 #iter = niter,
#                 #chains = 4,
#                 seed=1590291619
#                 ) #this seed works, not all will

pars=c('beta', 'gamma', "R0", "recovery_time")
print(r_stan_sir_negbin_ppc, pars = pars)

```
The above summary shows values for our predicted quantities in the context of the parameters with intervals. The 95% interval of possible `recovery_time` spans reasonable values of .71 to 39.3 days as well as `R0` spanning values of 0.56 to 71.3. 

### Parameter recovery with simulated data

```{r echo=TRUE, message=FALSE, warning=FALSE, comment=NA}
#Pick one arbitrary draw from the prior distribution
library(rstan)
draw <- 12 

s_prior <- rstan::extract(r_stan_sir_negbin_ppc)
cases_simu <- s_prior$pred_cases[draw,] 

compute_likelihood = 1
data_simu <-  list (n_days  = n_days, y0 = y0, t0 = t0, ts = ts, N=N, cases=cases_simu, compute_likelihood = compute_likelihood)
fit_simulated <- model$sample(
                data = data_simu,
                seed=1316072718)

fit_simu <- sampling(model, data=data_simu, chains=4, seed=1316072718)
fit_simu <- rstan::read_stan_csv(fit_simulated$output_files())
params = c("beta", "gamma", "phi")
paste("true beta :", toString(s_prior$beta[draw]), 
      ", true gamma :", toString(s_prior$gamma[draw]), ", true phi :", toString(s_prior$phi[draw]))
```
Knowing the true generating parameters allows comparison to estimated values shown below:
```{r, comment=NA}
print(fit_simu, pars = params)
```

All the parameters fit within the 2.5% to to 97.5% interval.

### SBC validation

-->