-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhelpful r code.Rmd
481 lines (390 loc) · 18.7 KB
/
helpful r code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
---
title: "Helpful R code"
output: html_notebook
---
Code that we often use.
#How to use this file
1) Get the github repository. You can do this in RStudio (if you have git installed) by going to File/NewProject/bersion control/git/ then paste the URL https://github.com/weinbergerlab/helpful_r_code and decide where you want to save it on your computer. This should download all of the file you need with the correct directory structure, and then just open this .Rmd file and run through it. Or you can visit https://github.com/weinbergerlab/helpful_r_code and download the ZIP file, unzip to your computer and then open the .Rmd file and run through it
2) Make sure the packages in the following chunk are all installed (use install.packages())
```{r setup}
library(rgdal) #for importing shape files
#library(readr)
library(ggplot2) #Plotting
library(RCurl) #to read files from github
library(RColorBrewer) #Pick a nice color palette
library(htmlTable) #make nice publication-quality tables
library(htmltools)
library(lubridate) #helpful packages for dealing with dates
library(reshape2) #super helpful for data wrangling
library(mgcv) #smoothing splines
library(tidymv) #plot output from smoothing splines
#TEST123
#MORE CHANGES!!
```
#Import data from Github
Go to Github, click on the 'raw' option to get to unformatted data, copy and paste the URL into the code below
```{r}
paho.url<-getURL("https://raw.githubusercontent.com/weinbergerlab/paho-pneumonia-mortality/master/Data/PAHO%20all%20age%20cuts_SubChapters.csv")
paho.ds<-read.csv(text=paho.url)
```
##Some basic data management
Things you could do in Excel, but shouldn't...
### Fix Date formatting
Need to declare date variable as a date using as.Date. Use a capital Y for 4 digit year, m for month, d for day. Can rearrange that part of the function to match what is in yur data. A 2 digit year uses a lower case y. If the month is written out with 3 letter abbreviation, use b instead of m
```{r}
paho.ds$monthdate<-as.Date(paho.ds$monthdate, format='%Y-%m-%d')
```
### Subset data
Just pull out ecuador 12-23m all HDI groups
```{r}
ec1<- paho.ds[paho.ds$age_group=='ec 12-23m A',]
```
You might want to pull ut any grouping that have 'ec' in the name. To do this, se the grep function. We will search for the string'ec' in the age_group variable. This will return the row numbers for where 'ec' is found
```{r}
obs.select<- grep('ec',paho.ds$age_group)
obs.select[1:100] #just loo at first 100 indices
```
Now use these indices to subset the dataset. Only take rows identified by 'obs.select'. When we run unique() it will tell us which age groups are in the dataset. As we can see, we now only have age groups that have 'ec' in the name. We have dropped all of the other rows that do not have 'ec' in the name
```{r}
ec2<-paho.ds[obs.select, 1:10] #also only keep variables 1:10
#ec2<-paho.ds[grep('ec',paho.ds$age_group),] #or can do ti all in 1 step
print('subset')
unique(ec2$age_group)
print("Original")
unique(paho.ds$age_group)
```
### Reshaping data using melt/cast
Let's read in some data from a Chile hospitalization dataset. This has been restricted to just people with a respiratory complaint
```{r}
d1<-readRDS('chile_j_chapter.rds')
d1$date<- as.Date(d1$date, "%d%b%Y") #format the date variable
```
Then create a new variablecalled 'week.date' that has the date of the Sunday of the week in which the admission occurred. The lubridate has some functions that make this easier. You want to round the date down to the nearest Sunday. The floor_data function can accomplish this: https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf
```{r date_format2}
#"round" the date down to
d1$week.date<-floor_date(d1$date, unit='week')
head(d1)
```
Then do some basic explorations. What is the distibution of ages? of Dates? (make a histogram for each)
```{r hist1}
hist(d1$EDAD, xlab='Age (years)' )
hist(d1$date, breaks=10)
```
frequency of codes
```{r freq.codes, echo=FALSE}
sort(table(d1$diag1),decreasing=T)
```
define otucome variable based on ICD codes
```{r}
icd10.3digits<-substr(d1$diag1,1,3) #extract 1st 3 digits from IC10 code
#icd10.3digits[1:10] #view first 10
#Initialize variables
d1$j09_j18<-rep(0, nrow(d1))
#ou could either list out J09, J10, J11... or use this >= or <=--just be careful with it and check your work
d1$j09_j18[icd10.3digits >= c('J09') & icd10.3digits <= c('J18') ]<-1
d1$j22<-0
d1$j22[icd10.3digits >= c('J22') ]<-1
table(d1$j09_j18, d1$diag1) #Ceeck your work
```
Create age groups
```{r agegrp.set}
d1$agegrp <-NA #Initialize variable
d1$agegrp[d1$EDAD>=0 &d1$EDAD<5] <-1
d1$agegrp[d1$EDAD>=5 &d1$EDAD<18] <-2
d1$agegrp[d1$EDAD>=18 &d1$EDAD<40] <-3
d1$agegrp[d1$EDAD>=40 &d1$EDAD<65] <-4
d1$agegrp[d1$EDAD>=65 &d1$EDAD<115] <-5
agelabs<-c('<5y','5-17y','18-39y', '40-64y', '65+y') #create labels for the age groups
```
Let's aggregate now by week.date AND age group. The melt and dcast function will be used to reshapre the date so that we have 1 row per date and 1 column per age group
```{r}
d2<-d1[,c('agegrp','week.date','j09_j18')] #Just keep the relevant variables
d2.m<-melt(d2, id.vars=c('week.date','agegrp') )
d2.c<-dcast(d2.m, week.date~agegrp, fun.aggregate =sum)
names(d2.c)<-c('date','ag1','ag2','ag3','ag4','ag5')
```
alternatively, we could create a 'long' dataset by modifying the dcast function. we have 1 row for each date/age group combination.
```{r}
d2.c.long<-dcast(d2.m, week.date+agegrp~., fun.aggregate =sum)
names(d2.c.long)<-c('week.date', 'age.grp', 'count')
```
Or if we wanted, we could have a 3rd variable that we are aggregatin on (e.g., state or municipality), and we would use acast to get a #D array rather than a matrix or data frame. Or we could do it based on another outcome variable. Here dimension 1 is date, dimension 2 is age group and dimension 3 is outcome variable (J09-J18 or J22)
```{r}
d2<-d1[,c('agegrp','week.date','j09_j18','j22')] #Just keep the relevant variables
d2.m<-melt(d2, id.vars=c('week.date','agegrp') )
d2.c<-acast(d2.m, week.date~agegrp~variable, fun.aggregate =sum)
str(d2.c) #see the structure of the array
```
Take a 'slice' of the array to just look at j12 or just J22. This gives us 2 arrays, each with number of rows equal to the number of dates, n columns equal to the number of columns, and the value equal to the number of cases in each time period.
```{r}
j09.18.ds<-d2.c[,,'j09_j18']
j22.ds<-d2.c[,,'j09_j18']
str(j22.ds)
str(j09.18.ds)
```
## Basic plotting
I like to use the base plot functions. Some people prefer ggplot2
```{r}
x<-rnorm(n=100)
y<- 0.1 + 1.5*x +rnorm(n=100)
plot(x,y,
bty='l', #turn off top and side axis
pch=16, #controls shape of the markers
col='red', #color of the markers
ylab='Y Variable',
xlab='X Variable',
xpd=NA #allows points to go off side of plotting area
)
```
Here is a ggplot with multiple panels.
```{r
ts.plot.fun <- function(time.var=date, yvar,multiple_group=F, group.var ){
p1 <- ggplot(d3, aes_string(x=time.var, y=yvar)) +
geom_line() +
ylab("Number of pneumonia deaths") +
xlab("Date") +
theme_classic() +
theme(panel.spacing= unit(2,'lines') , axis.text.x=element_text(angle=90)) +
geom_hline(yintercept=0, col='gray', lty=2) +
facet_wrap(group.var , scales='free') +
theme(panel.spacing= unit(2,'lines') , axis.text.x=element_text(angle=90))
return(p1)
}
ts.plot.fun(time.var='date', yvar='J12_18', multiple_group = T, group.var='agec')
```
### Nice colors with colorbrewer
ColorBrewer has a bunch of palettes that look nice. More info:
```{r}
?RColorBrewer
```
See the available palettes
```{r}
display.brewer.all()
```
Let's say we want a sequential palette. Our dataset has 100 observations, but the palletes only go up to 11 colors. So we need to extend the palette. We will use the 'Spectral' palette, which normally has 11 colors and will use the coloRampPalette function to expand this to 100 colors
```{r}
n.cols=100
nice.cols <- colorRampPalette(brewer.pal(11, "Spectral"))(n.cols)
```
Let's then use these colors in the XY plot from above. Need to srt the data by the variable you want it colored by
```{r}
ds<-cbind.data.frame(x,y) #combine x and y into a dataframe
ds<-ds[order(ds$y),] #sort the data frame by y
plot(ds$x,ds$y,
bty='l', #turn off top and side axis
pch=16, #controls shape of the markers
col=nice.cols, #color of the markers
ylab='Y Variable',
xlab='X Variable',
xpd=NA #allows points to go off side of plotting area
)
```
## heatmaps
heatmap(comb.mat, Rowv=NA, Colv=NA, scale='none',
labRow = c('Importance for transmission','Vaccine-derived Protection (carriage)','IPD incidence','Vaccine-derived Protection (IPD)'),
labCol=date.labs ,
col=nice.cols, cexRow=0.75,
margins=c(3,5),
xlab='Age(months)',
# IMPORTANT BIT HERE
add.expr = abline(v=c(12,24,36,48), col=rgb(1,1,1,alpha=0.2), lty=3),
axis(side=1, at=c(0,12,24,36,48,60), labels=c(NA, 12,24,36,48,NA))
)
## Choropleth maps
These are maps that have areas (ZIP/county/state, etc) that are colored in based on some characteristics. In this example, we will try to make a choropleth map using 2-digit ZIP area in Germany (Annie also recommends this tutorial if you want more depth:https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf
)
```{r}
#Import the shape file
german.shp2<-readOGR('./german_shp_files/plz-2stellig.shp')
german.shp2$plz<-as.character(german.shp2$plz)
german.shp.df<- fortify(german.shp2, region = "plz") #converts shape file into a data frame
#You have some data with an attribute (e.g., a rate ratio) for each spatial unit
#store this as a data frame, with 1 column having the spatial identifier ('plz' in this example and the other having the value for that spatial unit)
ds1 <-cbind.data.frame('plz'=german.shp2$plz, 'Var1'=rnorm(n=length(german.shp2$plz) ))
#Now make your map
ggplot() + geom_map(data = ds1, #use dataset ds1
aes(map_id = plz, #the variable 'plz' link the map file and data file
fill = ds1$Var1 #use variable Var1 to color the areas
),
map = german.shp.df )+
#the rest of this code just controls the axis limits and how the background looks
expand_limits(x = german.shp.df$long, y = german.shp.df$lat) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black"))
```
## GLM over multiple groups
We often want to run some function, like a regression model over multiple groups. An easy way to do this is to split the dataset into smaller datasets, which are stored in a list, and then run the analysis on each element of the list using lapply. Let's use dataset ec2, which has data for ecuador by age group
-First split the data by age group
```{r}
ec2$age_group<-factor(ec2$age_group) #ensure ec2$agegroup is a factor that only has levels present in the data
ec2.split<-split(ec2, ec2$age_group)
str(ec2.split[1:2])
```
Now let's write a simple function that we can use to run the regression
```{r}
reg.func<-function(ds){
mod<-glm(J12_J18_prim ~acm_noj_prim +A00_B99_prim, family='poisson', data=ds)
return(mod)
}
```
Now run the reg.func function for each age group. It returns the stored model object
```{r}
mod1<-lapply(ec2.split,reg.func)
```
We would like to extract useful infor from the model. We can do this in a second lapply. Let's first look at the model summary for each group
```{r}
mod.summary<-lapply(mod1, function(x) summary(x))
mod.summary[1:3] #show for first 3 groups
```
Next let's extract the coefficients
```{r}
mod.coef<-lapply(mod1, function(x) coef(x))
mod.coef #show for first 3 groups
```
we could instead combine these into a single array by using sapply instead of lapply
```{r}
mod.coef<-t(sapply(mod1, function(x) coef(x)))
mod.coef #show for first 3 groups
```
##Flexible models
Sometimes we want to be able to run slightly different models on the same dataset. We can again use a function. Here we provide the name of the covariates as a character vector, we provide the name of the outcome variable, and the name of the dataset
Within the function, we will extract the AIC and model coefficients
```{r}
reg.func2<-function(ds, outcome, covars){
covars.combined<- paste(covars, collapse='+')
form1<-as.formula(paste0(outcome, '~', covars.combined ))
mod<-glm(form1, family='poisson', data=ds)
mod.coefs<-coef(mod)
aic.mod<- AIC(mod)
mod.nobs=nobs(mod)
return(list('mod.coefs'=mod.coefs,'aic.mod'=aic.mod,'mod.nobs'=mod.nobs))
}
```
Then we call the function, using 3 different sets of covariates, with results saved as mod1, mod2, mod3
```{r}
mod1<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("acm_noj_prim"))
mod2<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("A00_B99_prim"))
mod3<- reg.func2(ds=ec2.split[[1]], outcome="J12_J18_prim", covars=c("acm_noj_prim","A00_B99_prim"))
```
We could look at the output model by model:
```{r}
mod1
```
We could combine these results into a list, and then we can pull out the coefficients into a summary table. We pull out elements of list named 'mod.coefs' by using the '[[' operator with sapply
```{r}
mod.list<-list(mod1, mod2, mod3)
mod.coef<-sapply(mod.list, '[[' , 'mod.coefs' , simplify=F)
mod.coef
```
And we want to compare these models, so extract AIC scores from each. When comparing AIC scores, it is a good idea to confirm that all of the models use the same observations. A quick check of this can be accomplished by looking at the number of observations used in model fitting
```{r}
mod.aic<-sapply(mod.list, '[[' , 'aic.mod' )
mod.aic
mod.nobs<-sapply(mod.list, '[[' , 'mod.nobs' )
mod.nobs
```
And to get really fancy, let's apply our function for different variables across different age groups. As above, we can do that with lapply. And we will create a 'list of lists'. The first level of the lists will be by model, the second level will be by age group.
```{r}
mod1<- lapply(ec2.split, reg.func2, outcome="J12_J18_prim", covars=c("acm_noj_prim"))
mod2<- lapply(ec2.split, reg.func2, outcome="J12_J18_prim", covars=c("A00_B99_prim"))
mod3<- lapply(ec2.split, reg.func2, outcome="J12_J18_prim", covars=c("acm_noj_prim",'A00_B99_prim'))
mod.list2<-list(mod1, mod2, mod3)
```
Pulling out the summary statistics gets a bit trickier here. We have to use nested sapply statements to pull everything out by age group and then by model.
```{r}
mod.nobs.complex<-sapply(mod.list2, function(x) sapply(x, '[[' , 'mod.nobs'))
mod.nobs.complex
```
```{r}
mod.aic.complex<-sapply(mod.list2, function(x) sapply(x, '[[' , 'aic.mod'))
mod.aic.complex
```
## Generic function for generating variance and confidence intervals of combination of predictors from regression
#Coefs is the vector of coefficients from the regression,
#VCov is the variance-covariance matrix from the regression
#Prmnames is a vector of the parameter names that you want to add together (subset of the names of Coefs)
compute_CI_comb <- function(Coefs, Vcov, parmnames){
colnames(Vcov) <- names(Coefs)
rownames(Vcov) <- names(Coefs)
comb.effect <- sum(Coefs[parmnames])
Var <- diag(Vcov)
names(Var) <- colnames(Vcov)
if(length(parmnames)>1){
parm.combos <- combn(parmnames,2,simplify=F)
cov.piece <- sapply(parm.combos, function(zz){
Cov <- 2 * Vcov[zz[1], zz[2] ]
})
var.comb.effect <- sum(Var[parmnames]) + sum(cov.piece)
}else{
var.comb.effect <- Var[parmnames]
}
lcl <- comb.effect - 1.96*sqrt(var.comb.effect)
ucl <- comb.effect + 1.96*sqrt(var.comb.effect)
parm.combos.ci <- c('mean'=comb.effect, 'lcl'=lcl,'ucl'=ucl)
return(parm.combos.ci)
}
## Making nice summary tables for publications
Instead of taking the data and putting it into a table in a Word document, you cna make a beautifl html tabl directly in R. Let's take the AIC summary table from above (mod.aic.complex) and turn it into a nice table. You the htmlTable package to do this. Copy and paste the table into Word (make sure you copy the WHOLE table, otherwise weird things will happen); make sure to keep source formatting when yoy paste into Word.
You would probably want to fix the row names as well. More details are here https://cran.r-project.org/web/packages/htmlTable/vignettes/tables.html
```{r}
tab1<- round(mod.aic.complex) #round AIC scores to nearest whole number
tab1<-as.data.frame(tab1) #convert to data frame
names(tab1)<-c('Model 1', 'Model 2','Model 3') #Label columns
htmlTable(tab1, align='l',caption="Table 1. Summary of AIC scores")
#print(table.div,type="html",useViewer=TRUE)
#save_html(table.div, 'table1.html')
```
## GAM smooths
Generalized Additive Models can be used to flexibly capture the associations between 2 variables
First prepare some pretend data
```{r}
t<-1:120
x1<- round(exp(sin(2*pi*t/120)+5))
x2<- round(exp(cos(2*pi*t/120) +5))
mat1<-cbind(x1,t,1)
mat2<-cbind(x2,t,2)
mat.comb<-rbind(mat1, mat2)
mat.comb<-as.data.frame(mat.comb)
names(mat.comb)<-c('y','t2','grp')
mat.comb$grp<-as.factor(mat.comb$grp) #MUST define the group variable as a factor
```
Then fit the model and view the smoothed fit
```{r}
mod1<-mgcv::gam(y~ s(t2,by=grp), data=mat.comb, family='poisson')
plot1<-plot_smooths(model=mod1, series=t2, grp, transform=exp)
plot1+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
```
Note with smooth_plot: In this example, we exponentiate the results in plot_smooth (transform=exp) to get back to the original scale. With a logistic regression, you will need to supply smooth_plot with a function that calculates an inverse-logit function: ilogit<-function(x){ exp(x)/(1+exp(x))}
## INLA: generate posterior predictive samples for Poisson and negative binomial models
library(INLA)
set.seed=(123)
n = 100
alpha=3
beta =1.5
beta2 =1.2
beta3 = -2
x= (1:n)/n
y= rpois(n, exp(alpha+x*beta + x^2*beta2 +x^3*beta3 +rnorm(n, 0, 0.2)) )
x2 <- x
plot(x,y, type='l')
r = inla(y ~ 1 + f(x, model = "rw1", constr = FALSE) + f(x2, model = "iid", constr = FALSE),
data = data.frame(y, x, x2),
control.compute = list(config=TRUE),
family = "poisson")
plot(r)
r.samples = inla.posterior.sample(101, r)
pred.interval.func <- function(sample.ds, dist=c('nb','poisson')){
mu1 <- exp(sample.ds$latent[grep('Predictor', row.names(sample.ds$latent))])
if(dist=='nb'){
nb.size1 = sample.ds$hyperpar
pred <- replicate(102, rnbinom(n=length(mu1), mu=mu1, size=nb.size1), simplify = 'array')
}else{
pred <- replicate(102, rpois(n=length(mu1), lambda=mu1), simplify = 'array')
}
return(pred)
}
samps <- sapply(r.samples,pred.interval.func,dist='poisson', simplify='array')
samps.q <- t(apply(samps, 1, quantile, probs=c(0.025,0.5, 0.975)))
par(mfrow=c(1,1))
plot(x, y)
arrows(x0=x, y0=samps.q[,1],y1=samps.q[,3], length=0)