-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRNASeq-Pasilla.Rmd
273 lines (172 loc) · 8.82 KB
/
RNASeq-Pasilla.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "RNASeq-Pasilla"
author: "Daniel Scheuermann"
date: "2024-01-31"
output:
word_document: default
html_document: default
---
# Intro
This script and analysis is meant to mimic a study done in 2010 ([link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3032923/)). The study consisted of studying the Pasilla gene using RNA-seq data. The Pasilla gene was depleted before the total RNA was isolated to prepare single-end and paired-end libraries for the treated vs untreated samples. From there, the libraries were sequenced to obtain the RNA-seq data for each specific sample. We will be comparing the treated vs untreated samples in this script.
```{r initial_setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
# initial setup
library(DESeq2)
library(pheatmap)
library(dplyr)
library(RColorBrewer)
library(ggplot2)
library(ggrepel)
library(clusterProfiler)
library(goseq)
```
## Data Setup
Here you can see the data set directly split into the treated and untreated samples.
The two separate samples can be compared to show the effects that 'Pasilla' gene depletion has on gene expression.
The factors are single vs paired samples and treated vs untreated samples.
```{r data_setup}
count_data <- read.csv("data/count_matrix.csv", header = TRUE, row.names = 1)
# colnames(count_data)
# head(count_data)
# sample information
sample_info <- read.csv("data/design.csv", header = TRUE, row.names = 1)
# colnames(sample_info)
head(sample_info)
# setting factor levels
sample_info$Treatment <- factor(sample_info$Treatment)
sample_info$Sequencing <- factor(sample_info$Sequencing)
```
## DESeq2 Setup
In this section, we are just creating the DESeq2 object and specifying the treatment factors.
```{r DESeq2_setup}
# creating the deseq object
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~Sequencing + Treatment)
# treatment factor reference
dds$Treatment <- factor(dds$Treatment, level = c("untreated","treated"))
```
## Data Manipulation
We make the data easier to work with by removing genes who have a count less than 10. From there, we can perform a DESeq2 analysis to identify the differentially expressed genes. Note the top 10 values. We can tell they are differentially expressed by look at the pvalues.
```{r data_manip}
# filtering the genes
# we are looking at genes whose count is greater than 10
keep <- rowSums(counts(dds)>10) >= min(table(sample_info$Treatment))
dds <- dds[keep,]
# identifying diferentially expressed genes
dds <- DESeq(dds)
deseq_results <- results(dds)
# transfer the deseq results into a data frame
deseq_results <- as.data.frame(deseq_results)
# ordering
deseq_results <- deseq_results[order(deseq_results$pvalue),]
head(deseq_results)
# count(deseq_results)
```
## Looking at specific genes
By grabbing the results of specific genes from the data frame produced, we can clearly see the Pasilla gene is down-regulated by the RNA treatment due to it's low p-value and negative log2FoldChange value.
```{r data_queries}
# looking specifically at FBgn0039155
deseq_results["FBgn0039155",]
# testing whether or not the Pasilla gene is downregulated by the RNA treatment.
deseq_results["FBgn0261552",]
```
# Data Manipulation 2
After filtering the data and only selecting the differentially expressed genes, we can see that there's a total of 234 differentially expressed genes.
```{r data_manip2}
# filtering
# selecting the differentially expressed genes (alpha = 0.05)
filtered <- deseq_results %>% filter(deseq_results$padj < 0.05)
filtered <- filtered %>% filter(abs(filtered$log2FoldChange) > 1)
# head(filtered)
# number of differentially expressed genes
dim(filtered)
```
## Storing Results
Check the data folder for the data frames produce and normalized counts.
```{r queries2}
# save manipulated data in new csv files
write.csv(deseq_results, 'data/deseq_result.all.csv')
write.csv(filtered, 'data/deseq_result.filtered.csv')
normalized_counts <- counts(dds, normalized = TRUE)
head(normalized_counts)
write.csv(normalized_counts, 'data/normalized_counts.csv')
```
# Visualization (Dispersion and PCA)
There's a clear separation between treated and untreated samples. this demonstrated that there's a path in the data showing a difference in treated vs untreated data samples. The same is true for pairwise vs single samples.
```{r visualization}
# dispersion plot
plotDispEsts(dds)
# principal component analysis plot
# variance stabilization
vsd <- vst(dds, blind = FALSE)
# pca plot
plotPCA(vsd, intgroup = c("Sequencing", "Treatment"))
```
# Visualization (HeatMaps and Volcano)
Based on the heatmaps, it seems like GSM461180_treated_paired sample is similar to GSM461181_treated_pairs sample. Meanwhile, the rest of the samples seem to be relatively different.
```{r heatmaps}
# generate distance matrix
sample_dists <- dist(t(assay(vsd)))
sample_dists_matrix <- as.matrix(sample_dists)
# colnames(sample_dists_matrix)
# choose color scheme
colors <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
# generated the heat map
pheatmap(sample_dists_matrix, clustering_distance_rows = sample_dists, clustering_distance_cols = sample_dists, color = colors)
# heatmap of top 10 log transformed normalized counts
# sort by the log transformed normalized counts and select first 10 values
top_hits <- deseq_results[order(deseq_results$padj),][1:10,]
top_hits <- row.names(top_hits)
# top_hits
write.csv(top_hits, "data/top_hits.csv")
# log transformation
rld <- rlog(dds, blind = FALSE)
# heatmap without clustering
# pheatmap(assay(rld)[top_hits,], cluster_rows = FALSE, show_rownames = TRUE, cluster_cols = FALSE)
# heatmap with clustering
# pheatmap(assay(rld)[top_hits,])
# adding annotations
annot_info <- as.data.frame(colData(dds)[,c("Sequencing","Treatment")])
# heatmap with annotations
pheatmap(assay(rld)[top_hits,], cluster_rows = FALSE, show_rownames = TRUE, cluster_cols = FALSE, annotation_col = annot_info)
# heatmap of z-scores
# calculating z-scores
cal_z_score <- function(x) {(x-mean(x)) / sd(x)}
z_scores <- t(apply(normalized_counts, 1, cal_z_score))
z_score_subset <- z_scores[top_hits,]
pheatmap(z_score_subset)
# MA plot
# plotMA(dds, ylim = c(-2,2))
# removing noise
resLFC <- lfcShrink(dds, coef = "Treatment_treated_vs_untreated", type = "apeglm")
plotMA(resLFC, ylim = c(-2,2))
# Volcano plot
# changed the resLFC to a data frame to make a plot with little noise
resLFC <- as.data.frame(resLFC)
# labeling genes (diferentially expressed vs otherwise)
resLFC$diffexpressed <- "NO"
resLFC$diffexpressed[resLFC$log2FoldChange > 0.1 & resLFC$padj < 0.05] <- "UP"
resLFC$diffexpressed[resLFC$log2FoldChange < 0.1 & resLFC$padj < 0.05] <- "DOWN"
resLFC$delabel <- NA
# volcano plot (note noise is removed)
ggplot(data = resLFC, aes(x = log2FoldChange, y = log10(pvalue), col = diffexpressed, label = delabel)) +
geom_point() +
theme_minimal() +
geom_text_repel() +
scale_color_manual(values = c("blue","black","red")) +
theme(text = element_text(size = 20))
```
## GO-Analysis
While done off-screen using Galaxy, a GO Analysis showed that 60 GO terms (0.50%) are over-represented and 7 (0.07%) under-represented and that. The Go terms are represented as such: over-represented GO terms, 50 BP, 5 CC and 5 MF and for under-represented, 5 BP, 2 CC and 0 MF.
## Kegg Pathway Analysis
When performing a Kegg pathways analysis, it was noted that around 2% of pathways were over-represented. The KEGG pathways over-represented are 01100 and 00010. From the Kegg database: 01100 corresponds to all metabolic pathways and 00010 to pathway for Glycolysis. It appears that no pathways are under-represented. You may find the pathway image in the data folder.
## Future Goals:
This project was very useful in understanding ways to filter, visualize, and understand differentially expressed genes. I feel the experience gained from using DESeq2 in the project will be invaluable, and performed a Kegg Pathway Analysis alongside the GO-Analysis was a good experience, despite them being a great source of strife. In the future, I wish to understand more about the individual genes I work with, as I felt a bit of a disconnect in matching the genes to what they actually do. For example, I seek to answer questions such as: How can these results be applied?
I also hope to perform similar analyses using UNIX or Python instead of R to gain more experience in different methods of analysis.
## Sources
https://academic.oup.com/bioinformatics/article/32/19/3047/2196507?login=true
https://www.youtube.com/playlist?list=PLe1-kjuYBZ05N8tWd2XVW67C4SJOJIdXD
https://www.youtube.com/watch?v=JPwdqdo_tRg
https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/goenrichment/tutorial.html
https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html#conclusion
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3032923/
https://www.genome.jp/kegg/