forked from statOmics/SGA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathheartMainInteraction.Rmd
496 lines (362 loc) · 13.4 KB
/
heartMainInteraction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
---
title: "Proteomics data analysis: heart"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
html_document:
code_download: true
theme: cosmo
toc: true
toc_float: true
highlight: tango
number_sections: true
---
# Background
Researchers have assessed the proteome in different regions of the heart for 3 patients (identifiers 3, 4, and 8). For each patient they sampled the left atrium (LA), right atrium (RA), left ventricle (LV) and the right ventricle (RV). The data are a small subset of the public dataset PXD006675 on PRIDE.
Suppose that researchers are mainly interested in comparing the ventricular to the atrial proteome. Particularly, they would like to compare the left atrium to the left ventricle, the right atrium to the right ventricle, the average ventricular vs atrial proteome and if ventricular vs atrial proteome shifts differ between left and right heart region.
# Data
We first import the peptides.txt file. This is the file that contains your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file can be found by default in the "path_to_raw_files/combined/txt/" folder from the MaxQuant output, with "path_to_raw_files" the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptides file from MaxQuant that can be found in the data tree of the SGA2020 github repository https://github.com/statOmics/SGA2020/tree/data/quantification/heart .
To import the data we use the `QFeatures` package.
We generate the object peptideRawFile with the path to the peptideRaws.txt file.
Using the `grepEcols` function, we find the columns that contain the expression
data of the peptideRaws in the peptideRaws.txt file.
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/PDA21/data/quantification/heart/peptides.txt"
ecols <- grep("Intensity\\.", names(read.delim(peptidesFile)))
pe <- readQFeatures(
table = peptidesFile,
fnames = 1,
ecol = ecols,
name = "peptideRaw", sep="\t")
pe
pe[["peptideRaw"]]
```
We will make use from data wrangling functionalities from the tidyverse package.
The %>% operator allows us to pipe the output of one function to the next function.
```{r}
colData(pe)$location <- substr(
colnames(pe[["peptideRaw"]]),
11,
11) %>%
unlist %>%
as.factor
colData(pe)$tissue <- substr(
colnames(pe[["peptideRaw"]]),
12,
12) %>%
unlist %>%
as.factor
colData(pe)$patient <- substr(
colnames(pe[["peptideRaw"]]),
13,
13) %>%
unlist %>%
as.factor
```
We calculate how many non zero intensities we have per peptide and this
will be useful for filtering.
```{r}
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
```
Peptides with zero intensities are missing peptides and should be represent
with a `NA` value rather than `0`.
```{r}
pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA
```
## Data exploration
`r format(mean(is.na(assay(pe[["peptideRaw"]])))*100,digits=2)`% of all peptide
intensities are missing and for some peptides we do not even measure a signal
in any sample. The missingness is similar across samples.
# Preprocessing
This section preforms standard preprocessing for the peptide data. This
include log transformation, filtering and summarisation of the data.
## Log transform the data
```{r}
pe <- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog")
limma::plotDensities(assay(pe[["peptideLog"]]))
```
## Filtering
### Handling overlapping protein groups
In our approach a peptide can map to multiple proteins, as long as there is
none of these proteins present in a smaller subgroup.
```{r}
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
```
### Remove reverse sequences (decoys) and contaminants
We now remove the contaminants, peptides that map to decoy sequences, and proteins
which were only identified by peptides with modifications.
First look to the names of the variables for the peptide features
```{r}
pe[["peptideLog"]] %>%
rowData %>%
names
```
No information on decoys.
```{r}
pe <- filterFeatures(pe,~ Potential.contaminant != "+")
```
### Drop peptides that were only identified in one sample
We keep peptides that were observed at last twice.
```{r}
pe <- filterFeatures(pe,~nNonZero >= 2)
nrow(pe[["peptideLog"]])
```
We keep `r nrow(pe[["peptideLog"]])` peptides after filtering.
## Normalize the data
```{r}
pe <- normalize(pe,
i = "peptideLog",
name = "peptideNorm",
method = "center.median")
```
## Explore normalized data
After normalisation the density curves for all samples are comparable.
```{r}
limma::plotDensities(assay(pe[["peptideNorm"]]))
```
This is more clearly seen is a boxplot.
```{r,}
boxplot(assay(pe[["peptideNorm"]]), col = palette()[-1],
main = "Peptide distribtutions after normalisation", ylab = "intensity")
```
We can visualize our data using a Multi Dimensional Scaling plot,
eg. as provided by the `limma` package.
```{r}
limma::plotMDS(assay(pe[["peptideNorm"]]),
col = colData(pe)$location:colData(pe)$tissue %>%
as.numeric,
labels = colData(pe) %>%
rownames %>%
substr(start = 11, stop = 13)
)
```
The first axis in the plot is showing the leading log fold changes
(differences on the log scale) between the samples.
## Summarization to protein level
We use robust summarization in aggregateFeatures. This is the default workflow of aggregateFeatures so you do not have to specifiy the argument `fun`.
However, because we compare methods we have included the `fun` argument to show the summarization method explicitely.
```{r,warning=FALSE}
pe <- aggregateFeatures(pe,
i = "peptideNorm",
fcol = "Proteins",
na.rm = TRUE,
name = "proteinRobust",
fun = MsCoreUtils::robustSummary)
```
```{r}
plotMDS(assay(pe[["proteinRobust"]]),
col = colData(pe)$location:colData(pe)$tissue %>%
as.numeric,
labels = colData(pe) %>%
rownames %>%
substr(start = 11, stop = 13)
)
```
# Data Analysis
## Estimation
We model the protein level expression values using `msqrob`.
By default `msqrob2` estimates the model parameters using robust regression.
```{r, warning=FALSE}
pe <- msqrob(
object = pe,
i = "proteinRobust",
formula = ~ location*tissue + patient)
```
## Inference
Explore Design
```{r}
library(ExploreModelMatrix)
VisualizeDesign(colData(pe),~ location*tissue + patient)$plotlist
```
```{r}
design <- model.matrix(~location*tissue + patient, data = colData(pe))
L <- makeContrast(
c(
"tissueV = 0",
"tissueV + locationR:tissueV = 0",
"tissueV + 0.5*locationR:tissueV = 0","locationR:tissueV = 0"),
parameterNames = colnames(design)
)
pe <- hypothesisTest(object = pe, i = "proteinRobust", contrast = L, overwrite=TRUE)
```
## Evaluate results contrast $\log_2 FC_{V-A}^L$
### Volcano-plot
```{r,warning=FALSE}
volcanoLeft <- ggplot(rowData(pe[["proteinRobust"]])$"tissueV",
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcanoLeft
```
### Heatmap
We first select the names of the proteins that were declared signficant.
```{r}
sigNamesLeft <- rowData(pe[["proteinRobust"]])$tissueV %>%
rownames_to_column("proteinRobust") %>%
filter(adjPval<0.05) %>%
pull(proteinRobust)
heatmap(assay(pe[["proteinRobust"]])[sigNamesLeft, ])
```
There are `r length(sigNamesLeft)` proteins significantly differentially expressed at the 5% FDR level.
```{r}
rowData(pe[["proteinRobust"]])$tissueV %>%
cbind(.,rowData(pe[["proteinRobust"]])$Protein.names) %>%
na.exclude %>%
filter(adjPval<0.05) %>%
arrange(pval) %>%
knitr::kable(.)
```
## Evaluate results contrast $\log_2 FC_{V-A}^R$
### Volcano-plot
```{r,warning=FALSE}
volcanoRight <- ggplot(rowData(pe[["proteinRobust"]])$"tissueV + locationR:tissueV",
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcanoRight
```
### Heatmap
We first select the names of the proteins that were declared signficant.
```{r}
sigNamesRight <- rowData(pe[["proteinRobust"]])$"tissueV + locationR:tissueV" %>%
rownames_to_column("proteinRobust") %>%
filter(adjPval<0.05) %>%
pull(proteinRobust)
heatmap(assay(pe[["proteinRobust"]])[sigNamesRight, ])
```
There are `r length(sigNamesRight)` proteins significantly differentially expressed at the 5% FDR level.
```{r}
rowData(pe[["proteinRobust"]])$"tissueV + locationR:tissueV" %>%
cbind(.,rowData(pe[["proteinRobust"]])$Protein.names) %>%
na.exclude %>%
filter(adjPval<0.05) %>%
arrange(pval) %>%
knitr::kable(.)
```
## Evaluate results average contrast $\log_2 FC_{V-A}$
### Volcano-plot
```{r,warning=FALSE}
volcanoAvg <- ggplot(rowData(pe[["proteinRobust"]])$"tissueV + 0.5 * locationR:tissueV",
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcanoAvg
```
### Heatmap
We first select the names of the proteins that were declared signficant.
```{r}
sigNamesAvg <- rowData(pe[["proteinRobust"]])$"tissueV + 0.5 * locationR:tissueV" %>%
rownames_to_column("proteinRobust") %>%
filter(adjPval<0.05) %>%
pull(proteinRobust)
heatmap(assay(pe[["proteinRobust"]])[sigNamesAvg, ])
```
There are `r length(sigNamesAvg)` proteins significantly differentially expressed at the 5% FDR level.
```{r}
rowData(pe[["proteinRobust"]])$"tissueV + 0.5 * locationR:tissueV" %>%
cbind(.,rowData(pe[["proteinRobust"]])$Protein.names) %>%
na.exclude %>%
filter(adjPval<0.05) %>%
arrange(pval) %>%
knitr::kable(.)
```
## Interaction
### Volcano-plot
```{r,warning=FALSE}
volcanoInt <- ggplot(rowData(pe[["proteinRobust"]])$"locationR:tissueV",
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcanoInt
```
### Heatmap
There were no genes significant at the 5% FDR level.
We return the top 25 genes.
```{r}
sigNamesInt <- rowData(pe[["proteinRobust"]])$"locationR:tissueV" %>%
rownames_to_column("proteinRobust") %>%
filter(adjPval<0.05) %>%
pull(proteinRobust)
hlp <- order((rowData(pe[["proteinRobust"]])$"locationR:tissueV")[,"adjPval"])[1:25]
heatmap(assay(pe[["proteinRobust"]])[hlp, ])
```
There are `r length(sigNamesInt)` proteins significantly differentially expressed at the 5% FDR level.
```{r}
rowData(pe[["proteinRobust"]])$"locationR:tissueV" %>%
cbind(.,rowData(pe[["proteinRobust"]])$Protein.names) %>%
na.exclude %>%
filter(adjPval<0.05) %>%
arrange(pval)
```
# Large difference in number of proteins that are returned
Note, that much more proteins are returned significant for average contrast ($\log_2 FC_{V-A}$) as compared to contrast for assessing the fold change between ventriculum and atrium left and right. The power for the average contrast is larger than for the contrast left or right because the log2 FC can be estimated with higher precision.
For none of the proteins the interaction was significant (change in log2 FC between ventriculum and atrium in the right vs the left heart region). The power for the interaction is typically low.
## Reason
Part of variance covariance matrix of model parameters due to design:
```{r}
X <- model.matrix(~ location*tissue + patient, colData(pe))
covarUnscaled <- solve(t(X) %*% X)
```
Variance of contrasts (diagonal elements) due to design
```{r}
varContrasts <- t(L)%*%covarUnscaled%*%L %>%
diag
varContrasts
sqrt(varContrasts)
```
So it is clear that the standard error of the log2 FC left and right is the same.
That of the average contrast is a factor $\sqrt{2}$ smaller!
Indeed, we use double the number of samples to estimate it!
```{r}
varContrasts[3]/varContrasts[2]
sqrt(varContrasts)[3]/sqrt(varContrasts)[2]
1/sqrt(2)
```
The standard error of the interaction is a factor $\sqrt{2}$ larger than that of the main effects!
```{r}
varContrasts[4]/varContrasts[2]
sqrt(varContrasts)[4]/sqrt(varContrasts)[2]
sqrt(2)
```
## Msqrob
This is not the case for the standard errors of protein 2???
```{r}
rowData(pe[["proteinRobust"]])$"tissueV"[2,]
rowData(pe[["proteinRobust"]])$"tissueV + locationR:tissueV"[2,]
rowData(pe[["proteinRobust"]])$"tissueV + 0.5 * locationR:tissueV"[2,]
rowData(pe[["proteinRobust"]])$"locationR:tissueV"[2,]
```
Because msqrob is using robust regression to assess DE!
```{r}
pe %>%
colData %>%
as_tibble %>%
mutate(w=getModel(rowData(pe[["proteinRobust"]])$msqrobModels[[2]])$w)
```
For protein 2 the samples at the left and right side have different weights!
### Part of standard error due to design:
```{r}
covUnscaledRobust <- solve(
t(X) %*%
diag(
getModel(rowData(pe[["proteinRobust"]])$msqrobModels[[2]])$w) %*% X)
varContrastsRobust <- t(L)%*%covUnscaledRobust%*%L %>%
diag
varContrastsRobust
sqrt(varContrastsRobust)
```
### Standard errors Contrasts
```{r}
sqrt(varContrastsRobust) * getSigmaPosterior(rowData(pe[["proteinRobust"]])$msqrobModels[[2]])
```
```{r}
rowData(pe[["proteinRobust"]])$"tissueV"[2,]
rowData(pe[["proteinRobust"]])$"tissueV + locationR:tissueV"[2,]
rowData(pe[["proteinRobust"]])$"tissueV + 0.5 * locationR:tissueV"[2,]
rowData(pe[["proteinRobust"]])$"locationR:tissueV"[2,]
```