case_tracking.Rmd

# Case tracking data

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##"
)
```

```{r setup, message=FALSE}
library(sars2pack)
library(dplyr)
library(ggplot2)
```

Datasets accessed from the internet come in many forms. We reformat
the data into "tidy" data.frames that are described here.

The case-tracking datasets each contain at least one `date` column,
one `count` column that describe some quantity of people over
time. A third column, `subset` is often (but not always) present and
describes the type of "counting" for each record. For instance, common
values for `subset` are "confirmed" and "deaths". Additional columns,
if present, usually specify geographical data for each record such as
the name of the city, state, country, etc. 

The dataset from Johns Hopkins University is one example that we can
look at to get familiar with the data.

```{r}
jhu = jhu_data()
colnames(jhu)
head(jhu)
dplyr::glimpse(jhu)
```

```{r}
table(jhu$subset)
```

## Compare US datasets

We can employ comparisons of the multiple case-tracking datasets that
capture the cases at a US state level to get a sense of systematic
biases in the data and look for agreement across datasets. One
convenience function, `combined_us_cases_data()`, yields a stacked
dataframe with an identifier for each dataset.


```{r}
us_states = combined_us_cases_data()
head(us_states)
table(us_states$dataset)
```

To get a sense of the data and their meaning, consider the following
series of graphs based on three "randomly" chosen states.

```{r}
interesting_states = c('PA','CA','GA')
pd = position_dodge(width=0.2)
```
The `position_dodge` here just moves the lines apart a bit so they do
not overlap and hide each other. The "confirmed cases" plot here shows
that over time, the datasets agree quite well. Adapting the
`plot_epicurve()` function a bit, we can quickly construct faceted,
stratified curves showing the behavior of three datasets across three
states over time. 

```{r fig.cap='Confirmed cases from combined US states datasets for three states'}
plot_epicurve(us_states,
              filter_expression = state %in% interesting_states & count>10,
              case_column = 'count', color='dataset') + 
    facet_grid(rows=vars(state)) + geom_line(position=pd) +
    ggtitle('Cumulative cases')
```

However, the infection **rate** is more easily visualized with daily
incidence curves. 

```{r fig.cap='Daily incidence for three states from multiple data sources'}
plot_epicurve(us_states,
              filter_expression = state %in% interesting_states & incidence>10,
              case_column = 'incidence', color='dataset', log=FALSE) + 
    facet_grid(cols=vars(state)) + geom_line(position=pd) +
    geom_smooth(alpha=0.4) + ggtitle('Daily reported cases')
```


```{r fig.cap='Worldwide daily reported deaths by continent (seven day moving average)'}
library(zoo)
ecdc_data() %>%
    dplyr::filter(subset=='deaths') %>%
    dplyr::group_by(date,continent) %>%
    dplyr::summarize(count=sum(count)) %>%
    dplyr::group_by(continent) %>%
    dplyr::mutate(roll_mean = zoo::rollmean(count, 7, na.pad=TRUE)) %>%
    add_incidence_column(count_column='roll_mean', grouping_columns = c('continent')) %>%
    dplyr::filter(inc>0) %>%
    plot_epicurve(case_column='inc',color='continent', log=FALSE) +
    ylab('Daily reported deaths') +
    ggtitle('Daily reported deaths over time', subtitle='by continent (7-day moving average)')
```