-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgolanov_deseq2.Rmd
225 lines (175 loc) · 10.3 KB
/
golanov_deseq2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: Identifying Candidates for Master Regulators of Transcriptome Changes in Murine Hippocampi Following Subarachnoid Hemorrhage
bibliography: methodsEugene.bib
#runtime: shiny_prerendered
output:
html_document:
code_folding: hide
theme: space
toc: yes
toc_depth: 3
toc_float: no
BiocStyle::html_document2:
code_folding: hide
toc: yes
toc_float: yes
knitrBootstrap::bootstrap_document:
highlight.chooser: yes
theme.chooser: yes
pdf_document:
toc: yes
always_allow_html: yes
---
```{r setup, bootstrap.show.code = FALSE, results='hide', bootstrap.show.message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE}
knitr::opts_chunk$set(bootstrap.show.code = FALSE, message=FALSE, warning=FALSE)
suppressMessages(library(QoRTs))
suppressMessages(library(goseq))
suppressMessages(library(reshape2))
suppressMessages(library(RColorBrewer))
suppressMessages(library(ggplot2))
suppressMessages(library(tidyr))
suppressMessages(library(plyr))
suppressMessages(library(kableExtra))
suppressMessages(library(magrittr))
suppressMessages(library(gtools))
suppressMessages(library(plotly))
suppressMessages(library(gage))
suppressMessages(library(openxlsx))
suppressMessages(library(pathview))
suppressMessages(library(DESeq2))
suppressMessages(library(edgeR))
suppressMessages(library(limma))
suppressMessages(library(openxlsx))
suppressMessages(library(genefilter))
suppressMessages(library(pheatmap))
suppressMessages(library(gplots))
suppressMessages(library(treemap))
suppressMessages(library(scales))
suppressMessages(library(Hmisc))
suppressMessages(library(knitr))
suppressMessages(library(annotate))
suppressMessages(library(data.table))
suppressMessages(library(ggrepel))
suppressMessages(library(ggpubr))
suppressMessages(library(gridExtra))
getOutputFormat <- function() {
output <- rmarkdown:::parse_yaml_front_matter(
readLines(knitr::current_input())
)$output
if (is.list(output)){
return(names(output)[1])
} else {
return(output[1])
}
}
if(getOutputFormat() == 'pdf_document') {
knitr::opts_chunk$set(bootstrap.show.code = FALSE, message=FALSE, warning=FALSE, echo=FALSE, results='hide', plots='all')
}
```
```{r counts, message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE, context="data"}
## This is the code for actually reading in the count matrix
counts <- read.table(file = "genecounts.txt", header = TRUE, check.names=FALSE, row.names=1)
decoderFile <- "decoder.txt" # this needs to be set up to match the samples and conditions
decoder.data <- read.table(decoderFile,header=T,stringsAsFactors=F,sep="\t")
table(colnames(counts) == decoder.data$sample.ID)
counts <- counts[,c(decoder.data$sample.ID)]
table(colnames(counts) == decoder.data$sample.ID)
decoder.data$condition <- factor(decoder.data$condition)
```
# Samples
The following samples were part of this analysis:
```{r samples, message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE, context="data"}
decoder_df <- decoder.data[,c(-2)]
kable(decoder_df, row.names=FALSE, padding = 0, longtable=TRUE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
***
# Processing
The sequences were aligned to the mouse reference genome (mm9) using *STAR* [@star], a universal RNA-seq aligner. To improve accuracy of the mapping, the genome was created with a splice junction database based on UCSC mm9 annotation from Illumina's iGenomes. Sequences that mapped to more than one locus were excluded from downstream analysis, since they cannot be confidently assigned.
Uniquely mapped sequences were intersected with composite gene models from Gencode v28 basic annotation using *featureCounts* [@featurecounts], a tool for assigning sequence reads to genomic features. Composite gene models for each gene consisted of the union of exons of all transcript isoforms of that gene. Uniquely mapped reads that unambiguously overlapped with no more than one Gencode composite gene model were counted for that gene model; the remaining reads were discarded. The counts for each gene model correspond to gene expression values, and were used for subsequent analyses.
***
# PCA
Principal components analysis (PCA) was performed to visualize sample-to-sample distances and to test if the samples could be separated based on treatment.
The raw gene expression values were transformed using a variance stabilizing transformation (with the 'vst' function from the DESeq2).
Vst-transformed data becomes approximately homoskedastic (the variance is similar across different ranges of the mean values), which PCA works best with.
Vst-transformed can then be used directly for computing distances between samples and making PCA plots.
The top 500 genes showing the highest variance were selected, and the principal components were computed and plotted.
Below is a PCA plot of all of the samples which is colored by condition:
```{r pca, message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE, fig.width=10, fig.height=5.5, context="data"}
deseq2.coldata <- data.frame(decoder.data, row.names = colnames(counts), stringsAsFactors=F)
deseq2.coldata$group <- factor(decoder.data$condition)
deseq2.cds <- DESeq2::DESeqDataSetFromMatrix(countData = counts,colData = deseq2.coldata, design = ~group)
deseq2.cds <- estimateSizeFactors(deseq2.cds)
deseq2.vst <- DESeq2::vst(deseq2.cds, blind=TRUE)
ntop = 500
Pvars <- rowVars(assay(deseq2.vst))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))]
PCA <- prcomp(t(assay(deseq2.vst)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], sampleName = row.names(colData(deseq2.vst)),colData(deseq2.vst))
p <- qplot(PC1, PC2, data = dataGG, color =group,size=I(3), label=sampleName, main = "PC1 vs PC2, top 500 variable genes") + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + theme_bw() + theme(legend.position="right") + geom_label_repel(aes(label=sampleName))
p
```
# Differential expression analysis {.tabset}
Based on the available samples, the following questions were addressed:
- What's different between SAH and sham?
- What's different between SAH and naive?
- What's different between sham and naive?
Using `DESeq2`, we can address these questions using a Wald test to do all three pairwise tests independently.
```{r de_analysis_simple, message=FALSE, eval=TRUE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE}
decoder.data.sub <- subset(decoder.data) # decoder.data
cnts <- counts[,decoder.data.sub$sample.ID]
conds <- factor(make.names(decoder.data.sub$condition))
library(DESeq2)
deseq2.coldata <- data.frame(condition = conds, row.names = colnames(cnts), decoder.data.sub)
deseq2.dds <- DESeq2::DESeqDataSetFromMatrix(countData = cnts, colData = deseq2.coldata, design = ~ condition)
keep <- rowSums(counts(deseq2.dds)) >= 10
deseq2.dds <- deseq2.dds[keep,]
deseq2.dds <- DESeq(deseq2.dds)
#padj 0.10
deseq2.res <- results(deseq2.dds, contrast=c("condition","SAH", "sham"), alpha=0.10)
deseq2.res.sig.sah_vs_sham_0.10 <- as.data.frame(subset(deseq2.res , padj < 0.10))
deseq2.res.sig.sah_vs_sham.all_0.10<- as.data.frame(deseq2.res)
deseq2.res <- results(deseq2.dds, contrast=c("condition","SAH", "naive"), alpha=0.10)
deseq2.res.sig.sah_vs_naive_0.10 <- as.data.frame(subset(deseq2.res , padj < 0.10))
deseq2.res.sig.sah_vs_naive.all_0.10<- as.data.frame(deseq2.res)
deseq2.res <- results(deseq2.dds, contrast=c("condition","sham", "naive"), alpha=0.10)
deseq2.res.sig.sham_vs_naive_0.10 <- as.data.frame(subset(deseq2.res , padj < 0.10))
deseq2.res.sig.sham_vs_naive.all_0.10<- as.data.frame(deseq2.res)
```
Using the Wald test, the following genes were detected as differentially expressed:
```{r dge_table, message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, eval=T, comment=FALSE}
df_deg <- data.frame("padj < 0.10" = c(nrow(deseq2.res.sig.sah_vs_sham_0.10), nrow(deseq2.res.sig.sah_vs_naive_0.10), nrow(deseq2.res.sig.sham_vs_naive_0.10)),check.names=F, stringsAsFactors = F)
row.names(df_deg) <- c("SAH vs. sham", "SAH vs. naive", "sham vs. naive")
kable(df_deg, row.names=T) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
Considering the sham and naive samples are similiar (we only detect at most `r nrow(deseq2.res.sig.sham_vs_naive_0.10)` differentially expressed with a adjusted p threshold of 0.10 when comparing sham to naive`), we treated sham and naive samples as one and contrasting SAH to that combined group.
```{r de_analysis_combine_sham_naive, message=FALSE, eval=TRUE, warning=FALSE, cache=TRUE, echo=FALSE, comment=FALSE}
## treat sham and naive as one
decoder.data.sub <- subset(decoder.data) # decoder.data
cnts <- counts[,decoder.data.sub$sample.ID]
decoder.data.sub$condition <- as.character(decoder.data.sub$condition)
decoder.data.sub[decoder.data.sub$condition == "sham",]$condition <- "ctrl"
decoder.data.sub[decoder.data.sub$condition == "naive",]$condition <- "ctrl"
decoder.data.sub$condition <- factor(decoder.data.sub$condition, levels=c("ctrl", "SAH"))
conds <- factor(make.names(decoder.data.sub$condition), levels=c("ctrl", "SAH"))
library(DESeq2)
deseq2.coldata <- data.frame(condition = conds, row.names = colnames(cnts), decoder.data.sub)
deseq2.dds <- DESeq2::DESeqDataSetFromMatrix(countData = cnts, colData = deseq2.coldata, design = ~ condition)
keep <- rowSums(counts(deseq2.dds)) >= 10
deseq2.dds <- deseq2.dds[keep,]
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds, contrast=c("condition","SAH", "ctrl"), alpha=0.10)
deseq2.res.sig.sah_vs_ctrl_0.10 <- as.data.frame(subset(deseq2.res , padj < 0.10))
deseq2.res.sig.sah_vs_ctrl.all_0.10 <- as.data.frame(deseq2.res)
```
Using this approach, the following genes were detected as differentially expressed:
```{r combined_sham_naive_deg_table, message=FALSE, warning=FALSE, cache=TRUE, echo=FALSE, eval=T, comment=FALSE}
df_deg <- data.frame("padj < 0.10" = c(nrow(deseq2.res.sig.sah_vs_ctrl_0.10)),check.names=F, stringsAsFactors = F)
row.names(df_deg) <- c("SAH vs. control (sham + naive)")
kable(df_deg, row.names=T) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
# Session Info
```{r session, message=FALSE, warning=FALSE, cache=TRUE,comment="",echo=FALSE, fig.width=10, fig.height=5.5, context="data"}
sessionInfo()
```
# References