helpful r code.Rmd

---
title: "Helpful R code"
output: html_notebook
---

Code that we often use.


#How to use this file
1) Get the github repository. You can do this in RStudio (if you have git installed) by going to File/NewProject/bersion control/git/ then paste the URL https://github.com/weinbergerlab/helpful_r_code and decide where you want to save it on your computer. This should download all of the file you need with the correct directory structure, and then just open this .Rmd file and run through it. Or you can visit https://github.com/weinbergerlab/helpful_r_code and download the ZIP file, unzip to your computer and then open the .Rmd file and run through it

2) Make sure the packages in the following chunk are all installed (use install.packages())

```{r setup}
library(rgdal) #for importing shape files
#library(readr)
library(ggplot2) #Plotting
library(RCurl) #to read files from github

library(RColorBrewer) #Pick a nice color palette

library(htmlTable) #make nice publication-quality tables
library(htmltools)

library(lubridate) #helpful packages for dealing with dates

library(reshape2) #super helpful for data wrangling

library(mgcv) #smoothing splines
library(tidymv) #plot output from smoothing splines

#TEST123
#MORE CHANGES!!
```

#Import data from Github
Go to Github, click on the 'raw' option to get to unformatted data, copy and paste the URL into the code below
```{r}
paho.url<-getURL("https://raw.githubusercontent.com/weinbergerlab/paho-pneumonia-mortality/master/Data/PAHO%20all%20age%20cuts_SubChapters.csv")
paho.ds<-read.csv(text=paho.url)
```

##Some basic data management
Things you could do in Excel, but shouldn't...

### Fix Date formatting
Need to declare date variable as a date using as.Date. Use a capital Y for 4 digit year, m for month, d for day. Can rearrange that part of the function to match what is in yur data. A 2 digit year uses a lower case y. If the month is written out with 3 letter abbreviation, use b instead of m
```{r}
paho.ds$monthdate<-as.Date(paho.ds$monthdate, format='%Y-%m-%d') 
```

### Subset data
Just pull out ecuador 12-23m all HDI groups
```{r}
ec1<- paho.ds[paho.ds$age_group=='ec 12-23m A',]
```

You might want to pull ut any grouping that have 'ec' in the name. To do this, se the grep function. We will search for the string'ec' in the age_group variable. This will return the row numbers for where 'ec' is found 
```{r}
obs.select<- grep('ec',paho.ds$age_group)
obs.select[1:100] #just loo at first 100 indices
```
Now use these indices to subset the dataset. Only take rows identified by 'obs.select'. When we run unique() it will tell us which age groups are in the dataset. As we can see, we now only have age groups that have 'ec' in the name. We have dropped all of the other rows that do not have 'ec' in the name
```{r}
ec2<-paho.ds[obs.select, 1:10] #also only keep variables 1:10
#ec2<-paho.ds[grep('ec',paho.ds$age_group),] #or can do ti all in 1 step
print('subset')
unique(ec2$age_group)

print("Original")
unique(paho.ds$age_group)

```

### Reshaping data using melt/cast
Let's read in some data from a Chile hospitalization dataset. This has been restricted to just people with a respiratory complaint
```{r}
 d1<-readRDS('chile_j_chapter.rds')
d1$date<- as.Date(d1$date, "%d%b%Y") #format the date variable

```

Then create a new variablecalled 'week.date' that has the date of the Sunday of the week in which the admission occurred. The lubridate has some functions that make this easier. You want to round the date down to the nearest Sunday. The floor_data function can accomplish this: https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf
```{r date_format2}
#"round" the date down to 
d1$week.date<-floor_date(d1$date, unit='week')
head(d1)
```

Then do some basic explorations. What is the distibution of ages? of Dates? (make a histogram for each)
```{r hist1}
hist(d1$EDAD, xlab='Age (years)' )
hist(d1$date, breaks=10)
```

frequency of codes
```{r freq.codes, echo=FALSE}
sort(table(d1$diag1),decreasing=T)
```

define otucome variable based on ICD codes
```{r}
icd10.3digits<-substr(d1$diag1,1,3) #extract 1st 3 digits from IC10 code
#icd10.3digits[1:10] #view first 10

#Initialize variables
d1$j09_j18<-rep(0, nrow(d1))
#ou could either list out J09, J10, J11... or use this >= or <=--just be careful with it and check your work
d1$j09_j18[icd10.3digits >= c('J09') & icd10.3digits <= c('J18')  ]<-1

d1$j22<-0
d1$j22[icd10.3digits >= c('J22')  ]<-1

table(d1$j09_j18, d1$diag1) #Ceeck your work

```

Create age groups
```{r agegrp.set}
d1$agegrp <-NA #Initialize variable
d1$agegrp[d1$EDAD>=0 &d1$EDAD<5] <-1
d1$agegrp[d1$EDAD>=5 &d1$EDAD<18] <-2
d1$agegrp[d1$EDAD>=18 &d1$EDAD<40] <-3
d1$agegrp[d1$EDAD>=40 &d1$EDAD<65] <-4
d1$agegrp[d1$EDAD>=65 &d1$EDAD<115] <-5
agelabs<-c('<5y','5-17y','18-39y', '40-64y', '65+y') #create labels for the age groups
```

Let's aggregate now by week.date AND age group. The melt and dcast function will be used to reshapre the date so that we have 1 row per date and 1 column per age group

```{r}
d2<-d1[,c('agegrp','week.date','j09_j18')] #Just keep the relevant variables
d2.m<-melt(d2, id.vars=c('week.date','agegrp') )
d2.c<-dcast(d2.m, week.date~agegrp, fun.aggregate  =sum)
names(d2.c)<-c('date','ag1','ag2','ag3','ag4','ag5')
```

alternatively, we could create a 'long' dataset by modifying the dcast function. we have 1 row for each date/age group combination.
```{r}
d2.c.long<-dcast(d2.m, week.date+agegrp~., fun.aggregate  =sum)
names(d2.c.long)<-c('week.date', 'age.grp', 'count')
```

Or if we wanted, we could have a 3rd variable that we are aggregatin on (e.g., state or municipality), and we would use acast to get a #D array rather than a matrix or data frame. Or we could do it based on another outcome variable. Here dimension 1 is date, dimension 2 is age group and dimension 3 is outcome variable (J09-J18 or J22)
```{r}
d2<-d1[,c('agegrp','week.date','j09_j18','j22')] #Just keep the relevant variables
d2.m<-melt(d2, id.vars=c('week.date','agegrp') )
d2.c<-acast(d2.m, week.date~agegrp~variable, fun.aggregate  =sum)
str(d2.c) #see the structure of the array
```

Take a 'slice' of the array to just look at j12 or just J22. This gives us 2 arrays, each with number of rows equal to the number of dates, n columns equal to the number of columns, and the value equal to the number of cases in each time period.

```{r}
j09.18.ds<-d2.c[,,'j09_j18']
j22.ds<-d2.c[,,'j09_j18']


str(j22.ds)
str(j09.18.ds)
```


## Basic plotting
I like to use the base plot functions. Some people prefer ggplot2
```{r}
x<-rnorm(n=100)
y<- 0.1 + 1.5*x +rnorm(n=100)

plot(x,y,
     bty='l', #turn off top and side axis
     pch=16, #controls shape of the markers
     col='red', #color of the markers
     ylab='Y Variable',
     xlab='X Variable',
     xpd=NA #allows points to go off side of plotting area
     )
```

Here is a ggplot with multiple panels. 
```{r
ts.plot.fun <- function(time.var=date, yvar,multiple_group=F, group.var ){
  p1 <- ggplot(d3, aes_string(x=time.var, y=yvar)) +
      geom_line() +
      ylab("Number of pneumonia deaths") +
      xlab("Date") +
      theme_classic() +
      theme(panel.spacing= unit(2,'lines') , axis.text.x=element_text(angle=90)) +
      geom_hline(yintercept=0, col='gray', lty=2) +
  
        facet_wrap(group.var , scales='free') +
        theme(panel.spacing= unit(2,'lines') , axis.text.x=element_text(angle=90)) 

  return(p1)
}
  
  ts.plot.fun(time.var='date', yvar='J12_18', multiple_group = T, group.var='agec')

```

### Nice colors with colorbrewer
ColorBrewer has a bunch of palettes that look nice. More info:
```{r}
?RColorBrewer 
```

See the available palettes
```{r}
display.brewer.all()

```
Let's say we want a sequential palette. Our dataset has 100 observations, but the palletes only go up to 11 colors. So we need to extend the palette. We will use the 'Spectral' palette, which normally has 11 colors and will use the coloRampPalette function to expand this to 100 colors
```{r}
n.cols=100
nice.cols <-  colorRampPalette(brewer.pal(11, "Spectral"))(n.cols)
```
Let's then use these colors in the XY plot from above. Need to srt the data by the variable you want it colored by
```{r}
ds<-cbind.data.frame(x,y) #combine x and y into a dataframe
ds<-ds[order(ds$y),] #sort the data frame by y
plot(ds$x,ds$y,
     bty='l', #turn off top and side axis
     pch=16, #controls shape of the markers
     col=nice.cols, #color of the markers
     ylab='Y Variable',
     xlab='X Variable',
     xpd=NA #allows points to go off side of plotting area
     )
```

## heatmaps
heatmap(comb.mat, Rowv=NA, Colv=NA, scale='none', 
        labRow = c('Importance for transmission','Vaccine-derived Protection (carriage)','IPD incidence','Vaccine-derived Protection (IPD)'),
        labCol=date.labs ,
        col=nice.cols, cexRow=0.75,
        margins=c(3,5),
        xlab='Age(months)',
        # IMPORTANT BIT HERE
        add.expr = abline(v=c(12,24,36,48), col=rgb(1,1,1,alpha=0.2), lty=3),
         axis(side=1, at=c(0,12,24,36,48,60), labels=c(NA, 12,24,36,48,NA))
)

## Choropleth maps
These are maps that have areas (ZIP/county/state, etc) that are colored in based on some characteristics. In this example, we will try to make a choropleth map using 2-digit ZIP area in Germany (Annie also recommends this tutorial if you want more depth:https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf
)
```{r}
#Import the shape file
german.shp2<-readOGR('./german_shp_files/plz-2stellig.shp')
german.shp2$plz<-as.character(german.shp2$plz)
german.shp.df<- fortify(german.shp2, region = "plz") #converts shape file into a data frame

#You have some data with an attribute (e.g., a rate ratio) for each spatial unit
#store this as a data frame, with 1 column having the spatial identifier ('plz' in this example and the other having the value for that spatial unit)
ds1 <-cbind.data.frame('plz'=german.shp2$plz, 'Var1'=rnorm(n=length(german.shp2$plz) ))

#Now make your map
ggplot() + geom_map(data = ds1, #use dataset ds1
                    aes(map_id = plz, #the variable 'plz' link the map file and data file
                        fill = ds1$Var1 #use variable Var1 to color the areas
                        ),
    map = german.shp.df )+ 
  #the rest of this code just controls the axis limits and how the background looks
    expand_limits(x = german.shp.df$long, y = german.shp.df$lat) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black"))

```

## GLM over multiple groups
We often want to run some function, like a regression model over multiple groups. An easy way to do this is to split the dataset into smaller datasets, which are stored in a list, and then run the analysis on each element of the list using lapply. Let's use dataset ec2, which has data for ecuador by age group
-First split the data by age group
```{r}
ec2$age_group<-factor(ec2$age_group) #ensure ec2$agegroup is a factor that only has levels present in the data
ec2.split<-split(ec2, ec2$age_group)
str(ec2.split[1:2])
```
Now let's write a simple function that we can use to run the regression
```{r}
reg.func<-function(ds){
  mod<-glm(J12_J18_prim ~acm_noj_prim +A00_B99_prim, family='poisson', data=ds)
  return(mod)
}
```

Now run the reg.func function for each age group. It returns the stored model object
```{r}
 mod1<-lapply(ec2.split,reg.func)
```

We would like to extract useful infor from the model. We can do this in a second lapply. Let's first look at the model summary for each group 
```{r}
mod.summary<-lapply(mod1, function(x) summary(x))
mod.summary[1:3] #show for first 3 groups
```
Next let's extract the coefficients
```{r}
mod.coef<-lapply(mod1, function(x) coef(x))
mod.coef #show for first 3 groups
```
we could instead combine these into a single array by using sapply instead of lapply
```{r}
mod.coef<-t(sapply(mod1, function(x) coef(x)))
mod.coef #show for first 3 groups
```

##Flexible models
Sometimes we want to be able to run slightly different models on the same dataset. We can again use a function. Here we provide the name of the covariates as a character vector, we provide the name of the outcome variable, and the name of the dataset
Within the function, we will extract the AIC and model coefficients
```{r}
reg.func2<-function(ds, outcome, covars){
  covars.combined<- paste(covars, collapse='+')
  form1<-as.formula(paste0(outcome, '~', covars.combined  ))
  mod<-glm(form1, family='poisson', data=ds)
  mod.coefs<-coef(mod)
  aic.mod<- AIC(mod)
  mod.nobs=nobs(mod)
  return(list('mod.coefs'=mod.coefs,'aic.mod'=aic.mod,'mod.nobs'=mod.nobs))
}
```
Then we call the function, using 3 different sets of covariates, with results saved as mod1, mod2, mod3
```{r}
mod1<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("acm_noj_prim"))
mod2<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("A00_B99_prim"))
mod3<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("acm_noj_prim","A00_B99_prim"))
```
We could look at the output model by model:
```{r}
mod1
```

We could combine these results into a list, and then we can pull out the coefficients into a summary table. We pull out elements of list named 'mod.coefs' by using the '[[' operator with sapply
```{r}
mod.list<-list(mod1, mod2, mod3)
mod.coef<-sapply(mod.list, '[[' , 'mod.coefs' , simplify=F)
mod.coef 
```
And we want to compare these models, so extract AIC scores from each. When comparing AIC scores, it is a good idea to confirm that all of the models use the same observations. A quick check of this can be accomplished by looking at the number of observations used in model fitting
```{r}
mod.aic<-sapply(mod.list, '[[' , 'aic.mod' )
mod.aic 

mod.nobs<-sapply(mod.list, '[[' , 'mod.nobs' )
mod.nobs
```

And to get really fancy, let's apply our function for different variables across different age groups. As above, we can do that with lapply. And we will create a 'list of lists'. The first level of the lists will be by model, the second level will be by age group.
```{r}
mod1<- lapply(ec2.split,  reg.func2, outcome="J12_J18_prim", covars=c("acm_noj_prim"))
mod2<- lapply(ec2.split,  reg.func2, outcome="J12_J18_prim", covars=c("A00_B99_prim"))
mod3<- lapply(ec2.split,  reg.func2, outcome="J12_J18_prim", covars=c("acm_noj_prim",'A00_B99_prim'))
mod.list2<-list(mod1, mod2, mod3)
```

Pulling out the summary statistics gets a bit trickier here. We have to use nested sapply statements to pull everything out by age group and then by model.
```{r}
mod.nobs.complex<-sapply(mod.list2, function(x) sapply(x, '[[' , 'mod.nobs'))
mod.nobs.complex
```
```{r}
mod.aic.complex<-sapply(mod.list2, function(x) sapply(x, '[[' , 'aic.mod'))
mod.aic.complex
```

## Generic function for generating variance and confidence intervals of combination of predictors from regression
 #Coefs is the vector of coefficients from the regression, 
 #VCov is the variance-covariance matrix from the regression
 #Prmnames is a vector of the parameter names that you want to add together (subset of the names of Coefs)
 compute_CI_comb <- function(Coefs, Vcov, parmnames){
   
   colnames(Vcov) <- names(Coefs)
   rownames(Vcov) <- names(Coefs)
   
   comb.effect <- sum(Coefs[parmnames])
   
   Var <- diag(Vcov)
   names(Var) <- colnames(Vcov)
   
   if(length(parmnames)>1){
   
     parm.combos <- combn(parmnames,2,simplify=F)
   
    cov.piece <- sapply(parm.combos, function(zz){
      Cov <- 2 * Vcov[zz[1], zz[2] ]
      })
   
     var.comb.effect <- sum(Var[parmnames]) + sum(cov.piece)
    }else{
    
       var.comb.effect <- Var[parmnames]
   
      }
   
   lcl <- comb.effect - 1.96*sqrt(var.comb.effect)
   ucl <- comb.effect + 1.96*sqrt(var.comb.effect)
   
   parm.combos.ci <- c('mean'=comb.effect, 'lcl'=lcl,'ucl'=ucl)
  return(parm.combos.ci)
 } 

## Making nice summary tables for publications
Instead of taking the data and putting it into a table in a Word document, you cna make a beautifl html tabl directly in R. Let's take the AIC summary table from above (mod.aic.complex) and turn it into a nice table. You the htmlTable package to do this. Copy and paste the table into Word (make sure you copy the WHOLE table, otherwise weird things will happen); make sure to keep source formatting when yoy paste into Word.
You would probably want to fix the row names as well. More details are here https://cran.r-project.org/web/packages/htmlTable/vignettes/tables.html
```{r}
tab1<- round(mod.aic.complex) #round AIC scores to nearest whole number
tab1<-as.data.frame(tab1) #convert to data frame
names(tab1)<-c('Model 1', 'Model 2','Model 3') #Label columns
htmlTable(tab1, align='l',caption="Table 1. Summary of AIC scores")
#print(table.div,type="html",useViewer=TRUE)
#save_html(table.div, 'table1.html')
```


## GAM smooths
Generalized Additive Models can be used to flexibly capture the associations between 2 variables
First prepare some pretend data
```{r}
t<-1:120
x1<- round(exp(sin(2*pi*t/120)+5))
x2<- round(exp(cos(2*pi*t/120) +5))
mat1<-cbind(x1,t,1)
mat2<-cbind(x2,t,2)
mat.comb<-rbind(mat1, mat2)
mat.comb<-as.data.frame(mat.comb)
names(mat.comb)<-c('y','t2','grp')
mat.comb$grp<-as.factor(mat.comb$grp) #MUST define the group variable as a factor
```

Then fit the model and view the smoothed fit
```{r}
mod1<-mgcv::gam(y~ s(t2,by=grp), data=mat.comb, family='poisson')
plot1<-plot_smooths(model=mod1, series=t2,  grp, transform=exp)
plot1+  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
```
Note with smooth_plot: In this example, we exponentiate the results in plot_smooth (transform=exp) to get back to the original scale. With a logistic regression, you will need to supply smooth_plot with a function that calculates an inverse-logit function: ilogit<-function(x){ exp(x)/(1+exp(x))}


## INLA: generate posterior predictive samples for Poisson and negative binomial models
library(INLA)
set.seed=(123)
n = 100
alpha=3
beta =1.5
beta2 =1.2
beta3 = -2

x= (1:n)/n

y= rpois(n, exp(alpha+x*beta + x^2*beta2 +x^3*beta3  +rnorm(n, 0, 0.2))  ) 

x2 <- x

plot(x,y, type='l')

r = inla(y ~ 1 + f(x, model = "rw1", constr = FALSE) +  f(x2, model = "iid", constr = FALSE),
         data = data.frame(y, x, x2),
         control.compute = list(config=TRUE),
         family = "poisson")

plot(r)
r.samples = inla.posterior.sample(101, r)

pred.interval.func <- function(sample.ds, dist=c('nb','poisson')){
  mu1 <- exp(sample.ds$latent[grep('Predictor', row.names(sample.ds$latent))])
  if(dist=='nb'){
    nb.size1 = sample.ds$hyperpar
    pred <- replicate(102, rnbinom(n=length(mu1), mu=mu1, size=nb.size1), simplify = 'array')
  }else{
    pred <- replicate(102, rpois(n=length(mu1), lambda=mu1), simplify = 'array')
  }
  return(pred)
}

samps <- sapply(r.samples,pred.interval.func,dist='poisson', simplify='array')
samps.q <- t(apply(samps, 1, quantile, probs=c(0.025,0.5, 0.975)))

par(mfrow=c(1,1))

plot(x, y)
arrows(x0=x, y0=samps.q[,1],y1=samps.q[,3], length=0)