forked from cdiener/proliferation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprotocol.rmd
1095 lines (841 loc) · 37.4 KB
/
protocol.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Protocol S1"
author: "Christian Diener, Osbaldo Resendis-Antonio"
output:
html_document:
toc: true
toc_float: true
theme: paper
pdf_document:
latex_engine: lualatex
toc: true
number_sections: true
---
# Installation
This repository is structured to enable easy installation of the required
dependencies. This is enabled by the `prtools` R package that serves two
major purposes:
1. It includes auxiliary functions and tests that are used during the analysis
2. It depends on all other R packages that are used throughout the analysis. Thus,
installing the `prtools` package will also ensure that all other required
packages are installed.
We also provide a Docker container which has everything pre-installed and can
be used to run the analysis on a server in the cloud.
This repository also includes `<step>.R` files for all steps of the analysis
which you can use to run parts of the analysis in an automated manner. Additional
data file are also contained in this repository. In order to obtain a local
copy you can clone the repository with [git](https://git-scm.org).
```bash
git clone cdiener/proliferation
```
The recommended mode of usage is via the docker images since they guarantee
correct versions for dependencies and include all intermediate data.
## Locally
**Note that this will require a machine with 16+ GB of RAM.**
If you have docker installed you can use the steps provided in the section below
"With a cloud provider".
For a local installation you will need R (http://r-project.org), git
(http://git-scm.org) and Python 3 installed (http://python.org).
On Ubuntu or Debian Stretch these can be installed via
```bash
sudo apt-get install r-base r-base-dev python3 python3-pip git
```
Clone the repository and enter the folder:
```bash
git clone https://github.com/cdiener/proliferation && cd proliferation
```
For Debian Jessie we recommend updating pip on a per-user setting
in order to get a version of pip that is greater than 8.1.
```bash
sudo apt-get install r-base r-base-dev python3 python3-pip git
pip3 install -U pip
```
The python dependencies can be installed with
```bash
pip3 install lxml python-libsbml numpy scipy cobra
```
R dependencies can be installed from within R (type "R" in Terminal) with
```{r, eval=FALSE}
install.packages("devtools")
source("https://bioconductor.org/biocLite.R")
biocLite()
```
After that you can install `prtools` simply via
```{r, eval=FALSE}
devtools::install_github("cdiener/proliferation/prtools")
```
Loading the prtools package will then also load all additional packages
we will use:
```{r}
library(prtools)
```
If you want to download the raw analysis data you will also need the GDC
data transfer tool. Installation instructions can be [found here](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool). Please make
sure that the `gdc-client` executable is in your path. The docker image already
includes the tool.
## With a cloud provider
Using a cloud provider such as [Google Cloud](https://cloud.google.com/) or
[Amazon AWS](https://aws.amazon.com/) you can use the docker image.
1. Create a new virtual machine with more than 16 GB of RAM using the CoreOS
stable image.
2. Login to the machine using SSH as described by your cloud provider.
3. Get the docker image with
```bash
docker pull cdiener/proliferation
```
4. Run the docker image
```bash
docker run -d -p 8000:8787 cdiener/proliferation
```
5. Access the machine at http://your-ip:8000 where "your-ip" is the IP of
your VM or "localhost" when running docker on your own machine. You will
be prompted with for login information where the user and password are
"rstudio".
This will present you with an R studio interface where all additional dependencies
and intermediate data are available.
For a start click the "..." symbol in the file panel on the lower
right and enter "/data/proliferation". Now click on "protocol.rmd" to see or
run the protocol (by clicking "knit HTML") or use any of the *.R
files in the same directory.
# Getting the data
This section describes how to obtain the raw data. Alternatively it also shows
how to download intermediate data sets that cut down the analysis time
significantly.
## Obtain raw data (slow)
We will start by getting all the required data. You will need about 40 GB of disk
space for all the downloads. This step is the longest during this analysis.
First, we will download the HuEx 1.0 ST exon expression data for the NCI-60 cell
lines using the `GEOquery` package (about 9 GB).
```{r, eval=FALSE}
GEOquery::getGEOSuppFiles("GSE29682")
untar("GSE29682/GSE29682_RAW.tar", exdir="GSE29682")
```
We follow this by downloading all relevany TCGA data from the Genomic Data
Commons (GDC) as well (19 GB). For that you will need a file manifest generated
from the [GDC data portal](https://gdc-portal.nci.nih.gov/search/s).
Here, we use the [tcgar](https://github.com/cdiener/tcgar) package which we
wrote specifically for this analysis. We will save the data into the `GDC`
subfolder with another subfolder for each data type. File manifests for the
data used in this analysis are already provided in the `GDC` folder.
```{r, eval=FALSE}
get_data("GDC/manigest_tcga_rnaseq", "GDC/rnaseq")
get_data("GDC/manigest_tcga_huex", "GDC/huex")
get_data("GDC/manigest_tcga_clinical", "GDC/clinical")
```
Note that this will take a long time.
Finally, we will also get annotations that relate all the genes from the NCI-60
and TCGA data sets. Here, we will employ biomart to obtain mappings between the
probes on the HuEx microarrays and several Gene ID systems. This step is optional
since the generated `probemap.rds` file is contained in the repository as well.
```{r, eval=FALSE}
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attrs <- c("ensembl_gene_id", "description", "external_gene_name", "ucsc",
"entrezgene", "affy_huex_1_0_st_v2")
probemap <- getBM(mart=ensembl, attributes=attrs)
probemap <- data.table(probemap)
names(probemap) <- c("ensgene", "description", "symbol", "ucsc", "entrez", "huex")
saveRDS(probemap, "probemap.rds")
```
## Obtain intermediate data (faster)
The following intermediate data files are available online:
1. NCI-60 gene expression data and proliferation rates - http://dx.doi.org/10.5281/zenodo.61980
2. TCGA data as compressed RDS file - http://dx.doi.org/10.5281/zenodo.61982
3. probe map as compressed RDS file - *Already in the Github repository*
# Preprocessing exon expression data
*Note: This entire section is optional if you downloaded the NCI-60 gene expression
file (#1). Running it will require the prior download of the NCI-60 data as described
in "Obtain raw data".*
## Normalization and summary
We will start by preparing the HuEx exon expression data for the NCI-60 data
set. Here, we will read all of the raw files, calculate the log expression values
and normalize the arrays with RMA. This will take about 16 GB of RAM and take a
while. If you have several cores available on your machine we recommend to use
them here. To use 6 cores during computations you may use the following:
```{r}
registerDoMC(6)
```
Finally, we will save this raw expression set
to a serialized format, so we do not have to repeat this step every time.
This will take about half an hour.
```{r, eval=FALSE}
if (file.exists("eset_raw.rds")) eset <- readRDS("eset_raw.rds") else {
celfiles <- list.celfiles(recursive=T, listGzipped=T)
raw_data <- read.celfiles(celfiles)
eset <- rma(raw_data, target="probeset")
rm(raw_data)
saveRDS(eset, file="eset_raw.rds")
}
```
The expression values we obtain this way are on the level of "probe sets",
sets of spots on the microarray. However, in order to obtain expression values
per gene we have to summarize those probe set expression values. For this we can
use `probemap.rds` mapping that we obtained earlier. First, we will prepare the
probe map to give us unique mappings between genes and probe sets. We will only
consider genes that map to a probe set that is present on our arrays and we will
also only use probe sets that map to a gene with a known ENSEMBL ID.
```{r, eval=FALSE}
probemap <- readRDS("probemap.rds")
probemap <- unique(probemap, by=c("ensgene", "huex"))
probemap[, huex := as.character(huex)]
setkey(probemap, huex)
eset <- eset[rownames(eset) %in% probemap$huex, ]
probemap <- probemap[huex %in% rownames(eset) & !is.na(ensgene)]
```
We now have a mapping of probe sets to genes and vice versa. We will use that
to summarize the log gene expression values into the geometric mean for each
gene. For that we can use the `eset_reduce` function from the `prtools` package
which does that rapidly (because it is implemented in C++).
```{r, eval=FALSE}
eset <- eset_reduce(eset, probemap$huex, probemap$ensgene)
```
## Joining with proliferation data
We now have the per-gene log expression values for each of the 178 microarrays
in our data set. In order to map those arrays to their respective cell line
we can use the annotations contained in `samples.csv`:
```{r}
samples <- fread("samples.csv")
head(samples)
```
So we see that there are 2-3 reptitions for each cell line. The actual growth
rates are contained in `growth_rates.csv`:
```{r}
gcs <- fread("growth_rates.csv")
head(gcs)
```
Now, we will reduce the individual repetitions for each cell line to its mean
log expression values and relate the cell lines in the microarray samples
to the ones in the growth rate data set.
```{r, eval=FALSE}
setkey(gcs, cell_line)
cell_lines <- intersect(gcs$cell_line, samples$cell_line)
eset_summ <- sapply(cell_lines, function(cl) rowMeans(exprs(eset)[,samples$cell_line==cl]))
colnames(eset_summ) <- cell_lines
```
Right, now proliferation is given in doubling time (in hours), so we will convert
those to proliferation rates and also run a quick check whether the proliferation
rates are ordered the same way as the microarray samples.
```{r, eval=FALSE}
rates <- gcs[cell_lines, log(2)/doubling_time]
names(rates) <- cell_lines
if (all(colnames(eset_summ) == names(rates))) cat("Ordering is the same :)")
```
## Exporting the regression problem
Finally, we will attach the proliferation rates as an aditional columns to the
expression values, thus, each row now denotes a cell line, and each column a
gene, where only the last column denotes the proliferation rates. This is
a common format for regression and classification problems and we will save
it in csv format, so that it can be used easily in other software.
```{r, eval=FALSE}
export <- data.table(t(eset_summ))
export[, "rates" := rates]
write.csv(export, "regprob.csv", row.names=F)
```
# Preparing TCGA data
## Reading the data
If you downloaded the intermediate data (#2) you can skip the following steps and
read it directly using:
```{r}
tcga <- readRDS("tcga.rds")
```
After downloading the raw TCGA data from GDC, we will start by reading the TCGA
data into RAM. The `tcgar` package does this efficiently. We will store the
results in a list with one sublist for each technology (this will take around 20
minutes depending on your server).
```{r, eval=FALSE}
library(tcgar)
tcga <- list(
rnaseq = read_rnaseq("GDC/manifest_tcga_rnaseq.tsv", "GDC/rnaseq", progress=F),
huex = read_huex("GDC/manifest_tcga_huex.tsv", "GDC/huex", progress=F),
clinical = read_clinical("GDC/manifest_tcga_clinical.tsv", "GDC/clinical", progress=F)
)
```
Again, we will save the output in a serialized file.
```{r, eval=FALSE}
saveRDS(tcga, "tcga.rds")
```
## Comparison to NCI-60 cell line data
In order to compare gene expression between the TCGA RNA-Seq data and NCI-60
microarrays we will need to create a data set that contains the mean expression
values for all genes that are contained in the TCGA *and* NCI-60 data. For that,
we begin by calculating the mean log expression values across all samples within
the TCGA and NCI-60 data (RNA-Seq abundances are measured in counts, so we have
to add a pseudo count in order to obtain the log). We begin by calculating the
means for the RNA-Seq and HuEx data contained in the TCGA data set.
```{r}
all_rnaseq <- log(rowMeans(tcga$rnaseq$counts)+1, 2)
all_rnaseq <- data.table(ensgene=tcga$rnaseq$features$ensgene,
tcga_rnaseq=all_rnaseq)
all_huex <- data.table(symbol=tcga$huex$features$symbol,
tcga_huex=rowMeans(tcga$huex$assay))
```
We also do the same for the NCI-60 HuEx data.
```{r, results="hide"}
nci60 <- fread("regprob.csv", header=T)
rates <- nci60$rates
nci60[, rates := NULL]
all_nci60 <- data.table(ensgene=colnames(nci60), nci60_huex=colMeans(nci60))
```
We will now join those data sets by mapping their respective IDs. `tcgar` already
comes with a table mapping the genes in TCGA between different IDs, `genemap`.
```{r}
join <- merge(as.data.table(tcgar::genemap), all_nci60, by="ensgene")
join <- merge(join, all_huex, by="symbol")
join <- merge(join, all_rnaseq, by="ensgene")
head(join)
```
Sometimes, those mapping between IDs are not unique, which makes it impossible to
identify unique gene expression values. We will remove those cases.
```{r}
dupes <- join$ensgene[duplicated(join$ensgene)]
print(sum(join$ensgene %in% dupes))
join <- join[!(join$ensgene %in% dupes), ]
saveRDS(join, "join.rds")
```
That gives us our joined gene expression data set.
## Identifying genes with conserved expression {.tabset .tabset-fade .tabset-pills}
In order to use gene expression data from the NCI-60 data set for predictions
on the TCGA data set, we first have to find a set of genes with very similar
expression values between cell lines and human samples. As described in the
manuscript, we consider gene expression to be conserved if the expression values
are correlated, meaning that the gene expression values between the two data sets
follow a linear relationship. We could estimate the slope and intercept directly
from TCGA RNA-Seq and NCI-60 HuEx data, however, a more stable approach is to
estimate both parameters independently from additional data we have.
---
### Estimating the intercept
The intercept describes a constant difference in log expression values between
the two data set. We can estimate is simply by using data obtained with the
same technology between the two data sets. Since we have HuEx expression data
from NCI-60 as well as the TCGA data set, the intercept can be calculated as
the difference in mean between the HuEx data sets.
```{r}
norm_int <- mean(join$nci60_huex) - mean(join$tcga_huex)
```
As we can see, using this correction both HuEx data sets relate well with
each other.
```{r}
ggplot(join, aes(x=tcga_huex, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(intercept=norm_int, size=3/4, col="dodgerblue") +
geom_abline(intercept=sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
geom_abline(intercept=-sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
xlab("TCGA HuEx") + ylab("NCI60 HuEx") + theme_bw()
```
Here, the dashed blue lines denote the area in which genes have a maximum distance
of 1 from the 1:1 relation denotes by the solid blue line.
### Estimating the slope
The slope of the linear relationship denotes a difference between technologies.
It can be obtained by comparing the two technologies, HuEx and RNA-Seq, within
the same data set.
```{r}
norm <- coef(glm(tcga_huex ~ tcga_rnaseq + 0, data=join))
```
Again, we see that the majority of genes follows this relationship. However,
there also some genes whose expression values are not comparable, even within
the TCGA data set.
```{r}
ggplot(join, aes(x=tcga_rnaseq, y=tcga_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(slope=norm, size=3/4, col="dodgerblue") +
xlab("TCGA RNA-seq") + ylab("TCGA HuEx") + theme_bw()
```
### Validation
Finally, we can validate the two parameters by using the TCGA RNA-Seq data and
the NCI-60 HuEx data.
```{r}
norm <- c(norm, norm_int)
saveRDS(norm, "norm_factor.rds")
ggplot(join, aes(x=tcga_rnaseq, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(slope=norm, intercept=norm_int, size=3/4, col="dodgerblue") +
geom_abline(slope=norm, intercept=sqrt(2)+norm_int, size=3/4,
col="dodgerblue", linetype="dashed") +
geom_abline(slope=norm, intercept=-sqrt(2)+norm_int, size=3/4,
col="dodgerblue", linetype="dashed") +
xlab("TCGA RNA-seq") + ylab("NCI60 HuEx") + theme_bw()
```
Using both parameters we see a good agreement between the data, however, there are
also genes that do not follow the linear relationship. Since, we are only interested
in genes with a conserved expression across the two data sets, we will only consider
genes in the further analysis that do not differ by a log fold change more than 1
between the two normalized data sets (area enclosed by dashed blue lines).
```{r}
cutoff <- 1
good <- join[(tcga_rnaseq*norm[1] + norm[2] - nci60_huex)^2 < cutoff &
(tcga_huex + norm[2] - join$nci60_huex)^2 < cutoff, ensgene]
cor.test(join[good, tcga_rnaseq], join[good, nci60_huex])
```
So we see that this cutoff leads to a good general agreement.
---
# Regression and prediction
We now have a way of normalizing the gene expression data between TCGA and NCI-60
and a criterion to select conserved genes only. Thus, we can now start with the
regression problem. We start by reading the regression problem again.
```{r}
rdata <- fread("regprob.csv", header=T)
```
But this time we will only select the genes that conserved (difference in
normalized log expression smaller than one).
```{r, results="hide"}
rates <- rdata$rates
rdata[, "rates" := NULL]
rdata <- as.matrix(rdata)[, good]
```
Which has dimensions
```{r}
dim(rdata)
```
We will now test a total of 4 different generalized linear models.
## Tested models {.tabset .tabset-fade .tabset-pills}
We will use the LASSO generalized linear models from the `glmnet` package.
We will also initialize two data frames that will contain all the predictions
and goodnes-of-fit metrics. Those will be calculated for each of the models
on the training set and leave-one-out cross validation.
```{r}
library(glmnet)
pred <- data.frame()
m <- data.frame()
# For leave-one-out cross validation
folds <- length(rates)
```
---
### First order model
In the first model gene expression values enter as linear variables into the
model. We will extract the predictions on the training set as well as
predictions from leave-one-out cross validation.
```{r}
mod1 <- cv.glmnet(rdata, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod1, rdata, s="lambda.min")[,1]
pred_test <- mod1$fit.preval[, which.min(mod1$cvm)]
```
We now append the predicted rates and performance measures to our data frames.
```{r}
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st"))
```
### Second order model
In the second order model we will only consider interactions between gene expression values,
meaning products between two gene expression values. Because of the large number of
candidate genes it would be feasible to test all combinations. Instead we will
only consider interactions between those genes that non-zero coefficients in the
first order model.
```{r}
nonzero <- abs(coef(mod1, s="lambda.min")[-1]) > 0
data2 <- inter(rdata[,nonzero])
mod2 <- cv.glmnet(data2, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod2, data2, s="lambda.min")[,1]
pred_test <- mod2$fit.preval[, which.min(mod2$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd"))
```
### First and second order model
In this model we will use the same interaction terms as before but also add all
the original gene expression values in order to see whether the model can be improved.
```{r}
data12 <- cbind(rdata[, nonzero], data2)
mod3 <- cv.glmnet(data12, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod3, data12, s="lambda.min")[,1]
pred_test <- mod3$fit.preval[, which.min(mod3$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st and 2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st and 2nd"))
```
As we can see, adding the first order terms does not improve the second order
model.
### Regularized model by cutoff
Finally, we will try to improve the generalization of the model, by removing those
interactions that have only very influences on the model and may thus be related
to overfitting.
```{r}
cf <- as.numeric(coef(mod2, s="lambda.min"))[-1]
names(cf) <- rownames(coef(mod2))[-1]
nonzero <- abs(cf) > quantile(abs(cf[abs(cf) > 0]), 0.25)
data_red <- data2[, nonzero]
mod <- cv.glmnet(data_red, rates, nfolds=folds, keep=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod, data_red, s="lambda.min")[,1]
pred_test <- mod$fit.preval[, which.min(mod$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd + cutoff"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd + cutoff"))
print(sum(nonzero))
```
## Model selection
If we compare the goodness-of-fit measures we can see that the cutoff model
shows good good performance with an error of about 4% in predicting proliferation
rates.
```{r, results="asis", echo=FALSE}
knitr::kable(m, caption = "goodnes-of-fit measures")
```
This can also be observed visually by plotting the real proliferation rates
against the predicted ones (a perfect match would create a 1:1 line).
```{r, fig.width=10, fig.heigh=5}
ggplot(pred, aes(x=truth, y=pred, col=order)) + geom_abline() +
geom_point() + facet_grid(set ~ order) + theme_bw() +
xlab("measured proliferation rate [1/h]") +
ylab("predicted proliferation rate [1/h]") +
theme(legend.position="none")
```
We will also save a list of the gene interactions and coefficient values
for the final model.
```{r}
genes <- do.call(rbind, strsplit(colnames(data_red), "x"))
colnames(genes) <- c("gene1", "gene2")
write.csv(data.frame(genes, coef=cf[nonzero]), "best_interactions.csv", row.names=F)
```
## Prediction of proliferation rates
We now have our trained model and will continue predicting proliferation rates
for the samples from TCGA.
### Prediction for HuEx samples
The major problem we have is that in the HuEx data genes are identified by
symbols whereas our interaction terms are identied by ENSEMBL gene ids.
So we will begin by creating a mapping between the two using the gene map
from before and getting the symbols for our genes in the interaction terms.
```{r}
map <- genemap
setkey(map, ensgene)
symbs <- cbind(map[genes[,1], symbol], map[genes[,2], symbol])
```
With that we can get the expression values for those genes and normalize them
with the normalization factor we obtained earlier.
```{r}
huex_ex <- tcga$huex$assay[unique(as.vector(symbs)), ]
huex_ex <- huex_ex + norm[2]
```
We can now calculate the corresponding interaction terms and predict the
proliferation rates for all samples. Here, we will also remove internal controls
that do not map to any TCGA samples.
```{r}
huex_red <- t(huex_ex[symbs[,1], ] * huex_ex[symbs[,2], ])
colnames(huex_red) <- paste0(symbs[,1], "x", symbs[,2])
rates_huex <- predict(mod, huex_red, s="lambda.min")[,1]
controls <- is.na(tcga$huex$samples$tumor)
rates_huex <- rates_huex[!controls]
```
### Prediction for RNA-Seq samples
RNA-Seq samples are identified by ENSEMBL ids as well so we can advance straight
away to calculating the log expression values and applying the normalization.
```{r}
rna_ex <- tcga$rnaseq$counts[unique(as.vector(genes)), ]
rna_ex <- log(rna_ex+1, 2)
rna_ex <- rna_ex * norm[1] + norm[2]
```
With that we can calculate the interaction terms and predict the proliferation
rates.
```{r}
rna_red <- t(rna_ex[genes[,1], ] * rna_ex[genes[,2], ])
colnames(rna_red) <- paste0(genes[,1], "x", genes[,2])
rates_rna <- predict(mod, rna_red, s="lambda.min")[,1]
```
### Combining the data sets
Finally we will first create a prediction data set that combines all proliferation
rate predictions together with the sample annotations.
```{r}
pred <- data.table(
patient_barcode=c(tcga$rnaseq$samples$patient_barcode,
tcga$huex$samples$patient_barcode[!controls]),
panel=c(tcga$rnaseq$samples$panel, tcga$huex$samples$panel[!controls]),
rates=c(rates_rna, rates_huex),
tumor=c(tcga$rnaseq$samples$tumor, tcga$huex$samples$tumor[!controls])
)
```
This way we can easily check what percentage of proliferation rates was predicted
as negative...
```{r}
sum(pred$rates)/nrow(pred)
```
We can merge this data set with the clinical information for each patient now.
```{r}
comb <- merge(pred, tcga$clinical, by=c("patient_barcode", "panel"))
print(comb)
```
Finally we will save those data sets for later usage.
```{r}
saveRDS(comb, "combined.rds")
write.csv(pred, "pred_rates.csv", row.names=F)
```
# Proliferation rates and their association with clinical data
## General properties
Having the combined data set `comb` most of the following analyses are straight
forward. First we can visualize the overall distribution of proliferation rates
across the TCGA cancer panels.
We will start by ordering the panels by their mean proliferation rate,
```{r, results="hide"}
means <- pred[, list(val=median(rates)), by=panel]
setkey(pred, panel)
pred <- pred[means[order(val), panel]]
pred[, panel := factor(panel, levels=unique(panel))]
```
followed by plotting them with `ggplot2`.
```{r, fig.width=10, fig.height=4, results="hide"}
ggplot(pred, aes(x=panel, y=rates, color=tumor, shape=tumor)) +
geom_jitter(alpha=0.2, height=0, size=1) + theme_bw() + theme(axis.text.x =
element_text(angle = 45, vjust = 1, hjust=1), legend.position="none") +
xlab("") + ylab("proliferation rate [1/h]") +
scale_colour_manual(values=c("royalblue", "red3"))
# Return panels to its original state
pred[, panel := as.character(panel)]
```
Here red triangles denote cancer samples and blue circles samples from healthy
tissues.
We can also check how the proliferation rates from tumor and normal tissue samples
behave.
```{r, fig.width=3, fig.height=4}
ggplot(pred, aes(x=c("normal", "tumor")[tumor+1], y=rates, fill=tumor)) +
geom_boxplot() + xlab("") + ylab("proliferation rate [1/h]") + theme_bw() +
theme(legend.position="none")
```
Here, the fold-change between tumor and normal tissue is given by
```{r}
pred[(tumor), mean(rates)] / pred[!(tumor), mean(rates)]
```
And this difference can be tested with a wilcoxon ranked-sum test.
```{r}
wilcox.test(pred[!(tumor), rates], pred[(tumor), rates], conf.int=TRUE)
```
## Survival Analysis
First we will set up the times and censoring. The `comb` data set includes
the required information in the `days_to_death`, `days_to_contact` and `vital`
columns.
```{r}
days_per_year <- 365.25
comb <- comb[!is.na(vital) & tumor]
delta <- c(comb[vital=="Alive", days_to_contact/days_per_year],
comb[vital=="Dead", days_to_death/days_per_year])
status <- comb$vital == "Dead"
```
For visualization we will stratify the survival data into two groups. One
including the top 25% of proliferation rates and the other containing the
bottom 25%.
```{r}
prolif <- vector(length=nrow(comb))
prolif[comb$rates > quantile(comb$rates, .75)] <- "high"
prolif[comb$rates < quantile(comb$rates, .25)] <- "low"
prolif <- factor(prolif, levels=c("low", "high"))
```
We can now generate the survival fit and create the corresponding Kaplan-Meier
plot.
```{r}
surv <- Surv(delta, status)
fit <- survfit(surv ~ prolif)
plot(fit, col=c("blue", "red"), xlab="time [years]", ylab="survival", lwd=2)
```
However, the proliferation rate is a continuous variable and we would like to
evaluate its association to survival without creating artificial strata. For
this we can use a Cox proportional hazards model.
```{r}
coxm <- coxph(surv ~ comb$rates)
print(coxm)
```
We can use the parameter β to calculate the increase in risk when increasing the
proliferation rate by a fixed value, for instance 0.01:
```{r}
exp(coef(coxm) * 0.01)
```
So this would mean an increased risk of about 18%.
## Association with tumor stage
We can also study how the tumor stage relates to the predicted proliferation
rates. We will use the [TNM staging system here](http://www.cancer.gov/about-cancer/diagnosis-staging/staging).
The pathological tumor stages are already contained in `comb`, however they may
have several subtypes. So we will clean them up a bit to only leave the major
tumor stages such as "T1", "N2", "Stage III", etc.
```{r, results="hide"}
comb[, T := gsub("[a-e][0-9]*$", "", T)]
comb[, N := gsub("[^0-4NX]", "", N)]
comb[, M := gsub("[^0-4MX]", "", M)]
comb[stage %in% c("I/II NOS", "IS"), stage := NA]
comb[, stage := gsub("[A-C]*$", "", stage)]
```
This enables us to visualize the proliferation rates across the TNM staging system.
```{r, fig.width=10, fig.height=3}
x <- melt(comb[, .(rates, panel, T, N, M, stage)], id.vars=c("rates", "panel"))
ggplot(x, aes(x=value, y=rates, col=variable)) +
geom_violin(aes(fill=variable), scale="width", alpha=0.3, linetype=0) +
geom_boxplot(outlier.colour=NA, width=0.5) +
facet_wrap(~ variable, scales="free_x", nrow=1) + theme_bw() +
theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust=1), legend.position="none", strip.text = element_blank()) +
ylab("proliferation rate [1/h]") + xlab("")
```
As we can see there is again a lot of heterogeneity going on. We can test
for associations by the non-parametric Kruskal-Wallis test.
```{r}
kw_tests <- x[, kruskal.test(rates, factor(value)), by=variable]
print(kw_tests)
```
# Flux predictions and metabolic liabilities
In order to predict fluxes based on the particular predicted proliferation rate
and cancer panel we need a specific metabolic model for the cancer that include
an objective for proliferation (growth). There are only few reconstructions
that fulfill that requirement. Here, we use reconstruction from the Metabolic
Atlas for 9 of the TCGA cancer panels which have been validated before
(http://dx.doi.org/10.1073/pnas.1319196111). The mapping between panels and
cancer models is described in `tissues.csv`:
```{r}
tissues <- fread("tissues.csv")
print(tissues)
```
For each cancer panel we will now use the corresponding model and solve the
linear programming problem posed by parsimonious FBA for every cancer sample
in the panel with a non-zero proliferation rate. Fluxes are denoted by $v_i$,
the stochiometric matrix of the model by $\mathbf{S}$, the proliferation objective
in the model by $v_p$ and the predicted proliferation rate by $r_p$. Models are
converted to the irreversible formulation beforehand.
$$\begin{aligned}
&\min_i \sum_i v_i\\
s.t.\quad& \mathbf{Sv} = 0\\
& v_p = r_p\\
& v_i \geq 0
\end{aligned}$$
For that we use [cobrapy](https://opencobra.github.io/cobrapy/) and warm start
strategy where a previous solution basis is recycled to yield faster convergence
of the solver. All of this is implemented in the `fluxes.py` script which will
use the predicted rates saved earlier.
```{r}
system2("python3", "fluxes.py")
```
Note that you may have to substitute `python3` with `python` depending on your
Python installation.
## Flux analysis
The resulting fluxes are now saved in `fluxes.csv`, for convenience this only
includes fluxes which were non-zero in at least one sample.
First we will create a mapping between patient barcodes and cancer panels.
```{r}
panels <- pred$panel
names(panels) <- pred$patient_barcode
panels <- sort(panels)
```
Now we can read the fluxes and conver them to a matrix.
```{r, results="hide"}
fluxes <- fread("fluxes.csv")
barcodes <- fluxes$V1
fluxes <- as.matrix(fluxes[, V1 := NULL])
rownames(fluxes) <- barcodes
```
For later analysis we would also like to group fluxes by their corresponding
metabolic pathways. The `fluxes.py` script also extracts that information from
the model and saves is as "subsystem" in the `flux_info.csv` file.
```{r}
info <- fread("flux_info.csv")
names(info) <- c("reaction", "subsystem")
head(info)
```
We can visulaize the resulting fluxes with a heatmap.
```{r, dev="png", fig.width=8, fig.height=6}
cols <- viridis::viridis(256)
panels <- panels[!duplicated(names(panels))]
fluxes <- fluxes[order(panels[rownames(fluxes)], decreasing=TRUE), ]
in_fluxes <- names(panels) %in% rownames(fluxes)
annrow <- data.frame(panel=panels[in_fluxes], row.names=names(panels)[in_fluxes])
anncolors <- scales::hue_pal()(9)
names(anncolors) <- levels(annrow$panel)
anncolors <- list(panel=anncolors)
s <- seq(-16, log(max(fluxes)+1e-16,2), length.out = 256)
pheatmap(fluxes, breaks=c(-1e-6,2^s), col=cols, show_rownames=F,
show_colnames=F, annotation_row=annrow, annotation_colors=anncolors,
cluster_rows=FALSE, border_color=NA)
```
In order to see major metabolic proerties of the studied panels we will also
visualize the fluxes across the three major pathways of central carbon
metabolism. We will start by transforming our flux matrix to a data.table that
can be passed to ggplot.
```{r}
dt <- as.data.table(fluxes)
dt$panel <- panels[rownames(fluxes)]
dt <- melt(dt, id.vars = "panel")
names(dt) <- c("panel", "reaction", "flux")
dt <- merge(dt, info, by = "reaction")
```
This will be followed by filtering the subsystem for Glycolysis,
Oxidative Phosphorylation, and TCA cycle and plotting the results.
```{r, dev="png", dpi=200, fig.width=7, fig.height=3}
pathways <- c("Glycolysis / Gluconeogenesis", "Oxidative phosphorylation",
"Tricarboxylic acid cycle and glyoxylate/dicarboxylate metabolism")
ggplot(dt[subsystem %in% pathways], aes(x = panel, y = flux, color = panel)) +
geom_jitter(height = 0, alpha = 0.05, size = 0.5) + facet_wrap(~ subsystem) +
theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1), legend.position = "none") + xlab("")
```
## Cancer Panel Specificity
Based on the predicted fluxes we can now calculate a specificity score $s^i_p$ for
each flux $v_i$ and panel p as the log-fold change of the flux within the panel
versus all other panels. Here $\mu^i_p$ denotes the mean value of the flux $v_i$
within the panel p and $\mu^i_o$ outside the panel p.
$$s^i_p = \log_2 \mu^i_p - \log_2 \mu^i_p$$