-
Notifications
You must be signed in to change notification settings - Fork 0
/
1_ancestry_bmi_ewas.Rmd
157 lines (112 loc) · 6.17 KB
/
1_ancestry_bmi_ewas.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Single Race/Ethnicity BMI EWAS"
author: "Kendra Ferrier"
date: "`r format(Sys.time(), '%m/%d/%y')`"
output:
html_document:
theme: cosmo
---
```{r Setup, include=TRUE, message = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide")
library(tidyr)
library(tidyverse)
library(haven)
library(dplyr)
library(tibble)
library(readr)
library(data.table)
library(sesame)
library(purrr)
library(furrr)
library(bacon)
library(tictoc)
library(speedglm)
library(parallel)
sessionInfo()
```
## Import Data
Load your data files into the global environment. The sample dataset should have the subjects as rows and the phenotypes/covariates as columns with the sample names set as the rownames() and not as a column. For this analysis, the phenotype of interest is BMI and the covariates must include age, sex, smoking status (never/past/current = 0/1/2), genetic-ancestry principal components(PCs) 1-10, and six estimated blood cell proportions (Mono, Gran, NK, Bcell, CD8T, and CD4T), and any other covariates that are necessary for the specific cohort/study being analysed. The sample dataset should only contain unrelated individuals with no missing data. Samples with phenotype/covariate values that are considered outliers (greater or less than four standard deviations from the mean) should be excluded.
The methylation dataset should have methylation sites (cpgs) as rows and subjects as columns and methylation M-values as the values in the cells with the cpg IDs set as the rownames and not as a column. Technical variation (site, batch, plate, well-position, etc) should be removed using the ComBat function from the sva R package, or another comparable surrogate variable method, prior to analysis. Additionally, methylation sites that are considered outliers (greater or less than four standard deviations from the mean), X/Y chromosome sites, and sites near common polymorphisms should be removed prior to running analysis.
Under each import line is a line that can be uncommented to set the sample-names/cpg-ID column to the rownames() if the datasets are not already structured in this way.
```{r Import Data}
# load sample data
pheno <- fread("file_name.txt")
#pheno <- pheno %>% remove_rownames() %>% column_to_rownames(var = "rowname")
# load methylation data
mvals <- fread("file_name.txt")
#mvals <- mvals %>% remove_rownames() %>% column_to_rownames(var = "rowname")
```
## Data checks
This step is to ensure that the methylation dataset only includes the subjects in the sample dataset and that the subjects are in the same order in both datasets.
```{r Data Checks}
# Subset methylation data
mvals <- mvals[ ,colnames(mvals) %in% rownames(pheno)]
# Order the sample and methylation data by sample name
pheno <- pheno[order(rownames(pheno)), ]
mvals <- mvals[ ,order(colnames(mvals)) ]
# Check that the order is the same. The order.test object will be TRUE if they are the same and FALSE if they are not. If the order is not the same, double check that the value types are the same for rownames(pheno) and colnames(mvals) or if the import method used created double-quoted strings.
order.test <- identical(rownames(pheno), colnames(mvals))
order.test
```
## Chunk Data
In this step, the methylation subsets are divided into smaller subsets to reduce memory load and computation time in the analysis step.
```{r Chunk Data}
# Set parameters for creating chunks of n cpgs. This will create a list comprised of n-sized subsets and a subset with the remainder that was not divisible by n.
n <- 1000 # number of cpgs per subset
nr <- as.numeric(nrow(mvals)) # total number of cpgs
rm(mvals)
gc()
# function to create a list of equal-sized chunks + one chunk with the remainder
chunk <- rep(seq_len(ceiling(nr/n)),each = n,length.out = nr)
# Chunk the methylation data for each race/ethnicity subset
mvals.chunked <- split(mvals, f = chunk)
mvals.chunked <- lapply(mvals.chunked, function(df){
tdf <- as.data.frame(t(df)) # transform the subsets for the linear regressions
return(tdf)
})
rm(n, nr, chunk)
gc()
```
# Run linear regressions
The following function is optimized to reduce the amount of memory used in the process. To reduce runtime you can do the following: 1) remove the broom::tidy and subsequent steps from the run_ewas function and create a separate function, then use the output from the run_ewas function as the input for the broom::tidy function you created. 2) reduce the size of the dataframe subsets by changing the 'n' variable to a smaller number (~1000 should be sufficient) 3) Uncomment the first four lines of the code chunk and modify the number of cores to use in parallel processing, then change the first map in the first 'linreg' pipeline to future_map and add the .options parameter as seen in the second 'linreg' pipeline.
```{r Run Linear Regressions}
# # Set the parallel processing parameters
# future::plan(multisession)
# availableCores(methods = "mc.cores")
# options(mc.cores=2, seed = T)
# Function to perform linear regressions.
run_ewas <- function(df){
# Run linear regressions
linreg <- df %>%
dplyr::select(one_of(colnames(df))) %>%
map(~ speedglm::speedlm(.x ~ bmi1c +
age1c + gender1 + as.factor(cig1c) + V1 + V2
+ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + CD8T
+ CD4T + NK + Mono + Gran + Bcell, data = pheno)) %>%
map(~ broom::tidy(.x , conf.int = TRUE)) %>%
map(~ .x %>% dplyr::filter(term %in% c( 'bmi1c')))
gc()
linreg <- linreg %>% future_map_dfr(~.x, .id = 'cpgid', .options = furrr_options(seed = TRUE))
return(linreg)
gc()
}
# Loop through the list of race/ethnicity subsets and the list of methylation chunks to perform linear regressions
tic()
ewas.res <- list()
base::suppressMessages(for(i in 1:length(mvals.chunked)){
ewas.res[[i]] <- run_ewas(mvals.chunked[[i]])
save(ewas.res, file = "ewas_results.RData")
gc()
})
toc()
# Combine list of dataframe subsets into one large dataframe
ewas.res <- rbindlist(sub.res)
ewas.res <- sub.res %>% mutate(total = nrow(pheno)) # add a column for sample size total
rm(pheno, mvals.chunked)
gc()
```
## Export Data
```{r Save Files}
# Save files as .csv for input to METAL for meta-analysis
fwrite(bmi_ewas, file = "study_ancestry_bmi_ewas.csv")
```