protocol.rmd

---
title: "Protocol S1"
author: "Christian Diener, Osbaldo Resendis-Antonio"
output:
    html_document:
        toc: true
        toc_float: true
        theme: paper
    pdf_document:
        latex_engine: lualatex
        toc: true
        number_sections: true
---

# Installation

This repository is structured to enable easy installation of the required
dependencies. This is enabled by the `prtools` R package that serves two
major purposes:

1. It includes auxiliary functions and tests that are used during the analysis
2. It depends on all other R packages that are used throughout the analysis. Thus,
installing the `prtools` package will also ensure that all other required
packages are installed.

We also provide a Docker container which has everything pre-installed and can
be used to run the analysis on a server in the cloud.

This repository also includes `<step>.R` files for all steps of the analysis
which you can use to run parts of the analysis in an automated manner. Additional
data file are also contained in this repository. In order to obtain a local
copy you can clone the repository with [git](https://git-scm.org).

```bash
git clone cdiener/proliferation
```

The recommended mode of usage is via the docker images since they guarantee
correct versions for dependencies and include all intermediate data.

## Locally

**Note that this will require a machine with 16+ GB of RAM.**

If you have docker installed you can use the steps provided in the section below
"With a cloud provider".

For a local installation you will need R (http://r-project.org), git
(http://git-scm.org) and Python 3 installed (http://python.org).

On Ubuntu or Debian Stretch these can be installed via

```bash
sudo apt-get install r-base r-base-dev python3 python3-pip git
```

Clone the repository and enter the folder:
```bash
git clone https://github.com/cdiener/proliferation && cd proliferation
```

For Debian Jessie we recommend updating pip on a per-user setting
in order to get a version of pip that is greater than 8.1.

```bash
sudo apt-get install r-base r-base-dev python3 python3-pip git
pip3 install -U pip
```

The python dependencies can be installed with

```bash
pip3 install lxml python-libsbml numpy scipy cobra
```

R dependencies can be installed from within R (type "R" in Terminal) with

```{r, eval=FALSE}
install.packages("devtools")
source("https://bioconductor.org/biocLite.R")
biocLite()
```

After that you can install `prtools` simply via

```{r, eval=FALSE}
devtools::install_github("cdiener/proliferation/prtools")
```

Loading the prtools package will then also load all additional packages
we will use:

```{r}
library(prtools)
```

If you want to download the raw analysis data you will also need the GDC
data transfer tool. Installation instructions can be [found here](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool). Please make
sure that the `gdc-client` executable is in your path. The docker image already
includes the tool.

## With a cloud provider

Using a cloud provider such as [Google Cloud](https://cloud.google.com/) or
[Amazon AWS](https://aws.amazon.com/) you can use the docker image.

1. Create a new virtual machine with more than 16 GB of RAM using the CoreOS
   stable image.
2. Login to the machine using SSH as described by your cloud provider.
3. Get the docker image with

```bash
docker pull cdiener/proliferation
```

4. Run the docker image

```bash
docker run -d -p 8000:8787 cdiener/proliferation
```

5. Access the machine at http://your-ip:8000 where "your-ip" is the IP of
   your VM or "localhost" when running docker on your own machine. You will
   be prompted with for login information where the user and password are
   "rstudio".

This will present you with an R studio interface where all additional dependencies
and intermediate data are available.

For a start click the "..." symbol in the file panel on the lower
right and enter "/data/proliferation". Now click on "protocol.rmd" to see or
run the protocol (by clicking "knit HTML") or use any of the *.R
files in the same directory.

# Getting the data

This section describes how to obtain the raw data. Alternatively it also shows
how to download intermediate data sets that cut down the analysis time
significantly.

## Obtain raw data (slow)

We will start by getting all the required data. You will need about 40 GB of disk
space for all the downloads. This step is the longest during this analysis.

First, we will download the HuEx 1.0 ST exon expression data for the NCI-60 cell
lines using the `GEOquery` package (about 9 GB).

```{r, eval=FALSE}
GEOquery::getGEOSuppFiles("GSE29682")
untar("GSE29682/GSE29682_RAW.tar", exdir="GSE29682")
```

We follow this by downloading all relevany TCGA data from the Genomic Data
Commons (GDC) as well (19 GB). For that you will need a file manifest generated
from the [GDC data portal](https://gdc-portal.nci.nih.gov/search/s).
Here, we use the [tcgar](https://github.com/cdiener/tcgar) package which we
wrote specifically for this analysis. We will save the data into the `GDC`
subfolder with another subfolder for each data type. File manifests for the
data used in this analysis are already provided in the `GDC` folder.

```{r, eval=FALSE}
get_data("GDC/manigest_tcga_rnaseq", "GDC/rnaseq")
get_data("GDC/manigest_tcga_huex", "GDC/huex")
get_data("GDC/manigest_tcga_clinical", "GDC/clinical")
```

Note that this will take a long time.

Finally, we will also get annotations that relate all the genes from the NCI-60
and TCGA data sets. Here, we will employ biomart to obtain mappings between the
probes on the HuEx microarrays and several Gene ID systems. This step is optional
since the generated `probemap.rds` file is contained in the repository as well.

```{r, eval=FALSE}
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attrs <- c("ensembl_gene_id", "description", "external_gene_name", "ucsc",
    "entrezgene", "affy_huex_1_0_st_v2")
probemap <- getBM(mart=ensembl, attributes=attrs)
probemap <- data.table(probemap)
names(probemap) <- c("ensgene", "description", "symbol", "ucsc", "entrez", "huex")
saveRDS(probemap, "probemap.rds")
```

## Obtain intermediate data (faster)

The following intermediate data files are available online:

1. NCI-60 gene expression data and proliferation rates - http://dx.doi.org/10.5281/zenodo.61980
2. TCGA data as compressed RDS file - http://dx.doi.org/10.5281/zenodo.61982
3. probe map as compressed RDS file - *Already in the Github repository*

# Preprocessing exon expression data

*Note: This entire section is optional if you downloaded the NCI-60 gene expression
file (#1). Running it will require the prior download of the NCI-60 data as described
in "Obtain raw data".*

## Normalization and summary

We will start by preparing the HuEx exon expression data for the NCI-60 data
set. Here, we will read all of the raw files, calculate the log expression values
and normalize the arrays with RMA. This will take about 16 GB of RAM and take a
while. If you have several cores available on your machine we recommend to use
them here. To use 6 cores during computations you may use the following:

```{r}
registerDoMC(6)
```

Finally, we will save this raw expression set
to a serialized format, so we do not have to repeat this step every time.
This will take about half an hour.

```{r, eval=FALSE}
if (file.exists("eset_raw.rds")) eset <- readRDS("eset_raw.rds") else {
    celfiles <- list.celfiles(recursive=T, listGzipped=T)
    raw_data <- read.celfiles(celfiles)
    eset <- rma(raw_data, target="probeset")
    rm(raw_data)
    saveRDS(eset, file="eset_raw.rds")
}
```

The expression values we obtain this way are on the level of "probe sets",
sets of spots on the microarray. However, in order to obtain expression values
per gene we have to summarize those probe set expression values. For this we can
use `probemap.rds` mapping that we obtained earlier. First, we will prepare the
probe map to give us unique mappings between genes and probe sets. We will only
consider genes that map to a probe set that is present on our arrays and we will
also only use probe sets that map to a gene with a known ENSEMBL ID.

```{r, eval=FALSE}
probemap <- readRDS("probemap.rds")
probemap <- unique(probemap, by=c("ensgene", "huex"))
probemap[, huex := as.character(huex)]
setkey(probemap, huex)
eset <- eset[rownames(eset) %in% probemap$huex, ]
probemap <- probemap[huex %in% rownames(eset) & !is.na(ensgene)]
```

We now have a mapping of probe sets to genes and vice versa. We will use that
to summarize the log gene expression values into the geometric mean for each
gene. For that we can use the `eset_reduce` function from the `prtools` package
which does that rapidly (because it is implemented in C++).

```{r, eval=FALSE}
eset <- eset_reduce(eset, probemap$huex, probemap$ensgene)
```

## Joining with proliferation data

We now have the per-gene log expression values for each of the 178 microarrays
in our data set. In order to map those arrays to their respective cell line
we can use the annotations contained in `samples.csv`:

```{r}
samples <- fread("samples.csv")
head(samples)
```

So we see that there are 2-3 reptitions for each cell line. The actual growth
rates are contained in `growth_rates.csv`:

```{r}
gcs <- fread("growth_rates.csv")
head(gcs)
```

Now, we will reduce the individual repetitions for each cell line to its mean
log expression values and relate the cell lines in the microarray samples
to the ones in the growth rate data set.

```{r, eval=FALSE}
setkey(gcs, cell_line)
cell_lines <- intersect(gcs$cell_line, samples$cell_line)
eset_summ <- sapply(cell_lines, function(cl) rowMeans(exprs(eset)[,samples$cell_line==cl]))
colnames(eset_summ) <- cell_lines
```

Right, now proliferation is given in doubling time (in hours), so we will convert
those to proliferation rates and also run a quick check whether the proliferation
rates are ordered the same way as the microarray samples.

```{r, eval=FALSE}
rates <- gcs[cell_lines, log(2)/doubling_time]
names(rates) <- cell_lines
if (all(colnames(eset_summ) == names(rates))) cat("Ordering is the same :)")
```

## Exporting the regression problem

Finally, we will attach the proliferation rates as an aditional columns to the
expression values, thus, each row now denotes a cell line, and each column a
gene, where only the last column denotes the proliferation rates. This is
a common format for regression and classification problems and we will save
it in csv format, so that it can be used easily in other software.

```{r, eval=FALSE}
export <- data.table(t(eset_summ))
export[, "rates" := rates]
write.csv(export, "regprob.csv", row.names=F)
```

# Preparing TCGA data

## Reading the data

If you downloaded the intermediate data (#2) you can skip the following steps and
read it directly using:

```{r}
tcga <- readRDS("tcga.rds")
```

After downloading the raw TCGA data from GDC, we will start by reading the TCGA
data into RAM. The `tcgar` package does this efficiently. We will store the
results in a list with one sublist for each technology (this will take around 20
minutes depending on your server).

```{r, eval=FALSE}
library(tcgar)
tcga <- list(
    rnaseq = read_rnaseq("GDC/manifest_tcga_rnaseq.tsv", "GDC/rnaseq", progress=F),
    huex = read_huex("GDC/manifest_tcga_huex.tsv", "GDC/huex", progress=F),
    clinical = read_clinical("GDC/manifest_tcga_clinical.tsv", "GDC/clinical", progress=F)
)
```

Again, we will save the output in a serialized file.

```{r, eval=FALSE}
saveRDS(tcga, "tcga.rds")
```

## Comparison to NCI-60 cell line data

In order to compare gene expression between the TCGA RNA-Seq data and NCI-60
microarrays we will need to create a data set that contains the mean expression
values for all genes that are contained in the TCGA *and* NCI-60 data. For that,
we begin by calculating the mean log expression values across all samples within
the TCGA and NCI-60 data (RNA-Seq abundances are measured in counts, so we have
to add a pseudo count in order to obtain the log). We begin by calculating the
means for the RNA-Seq and HuEx data contained in the TCGA data set.

```{r}
all_rnaseq <- log(rowMeans(tcga$rnaseq$counts)+1, 2)
all_rnaseq <- data.table(ensgene=tcga$rnaseq$features$ensgene,
    tcga_rnaseq=all_rnaseq)
all_huex <- data.table(symbol=tcga$huex$features$symbol,
    tcga_huex=rowMeans(tcga$huex$assay))
```

We also do the same for the NCI-60 HuEx data.

```{r, results="hide"}
nci60 <- fread("regprob.csv", header=T)
rates <- nci60$rates
nci60[, rates := NULL]
all_nci60 <- data.table(ensgene=colnames(nci60), nci60_huex=colMeans(nci60))
```

We will now join those data sets by mapping their respective IDs. `tcgar` already
comes with a table mapping the genes in TCGA between different IDs, `genemap`.

```{r}
join <- merge(as.data.table(tcgar::genemap), all_nci60, by="ensgene")
join <- merge(join, all_huex, by="symbol")
join <- merge(join, all_rnaseq, by="ensgene")
head(join)
```

Sometimes, those mapping between IDs are not unique, which makes it impossible to
identify unique gene expression values. We will remove those cases.

```{r}
dupes <- join$ensgene[duplicated(join$ensgene)]
print(sum(join$ensgene %in% dupes))
join <- join[!(join$ensgene %in% dupes), ]
saveRDS(join, "join.rds")
```

That gives us our joined gene expression data set.

## Identifying genes with conserved expression {.tabset .tabset-fade .tabset-pills}

In order to use gene expression data from the NCI-60 data set for predictions
on the TCGA data set, we first have to find a set of genes with very similar
expression values between cell lines and human samples. As described in the
manuscript, we consider gene expression to be conserved if the expression values
are correlated, meaning that the gene expression values between the two data sets
follow a linear relationship. We could estimate the slope and intercept directly
from TCGA RNA-Seq and NCI-60 HuEx data, however, a more stable approach is to
estimate both parameters independently from additional data we have.

---

### Estimating the intercept

The intercept describes a constant difference in log expression values between
the two data set. We can estimate is simply by using data obtained with the
same technology between the two data sets. Since we have HuEx expression data
from NCI-60 as well as the TCGA data set, the intercept can be calculated as
the difference in mean between the HuEx data sets.

```{r}
norm_int <- mean(join$nci60_huex) - mean(join$tcga_huex)
```

As we can see, using this correction both HuEx data sets relate well with
each other.

```{r}
ggplot(join, aes(x=tcga_huex, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
    geom_abline(intercept=norm_int, size=3/4, col="dodgerblue") +
    geom_abline(intercept=sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
    geom_abline(intercept=-sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
    xlab("TCGA HuEx") + ylab("NCI60 HuEx") + theme_bw()
```

Here, the dashed blue lines denote the area in which genes have a maximum distance
of 1 from the 1:1 relation denotes by the solid blue line.

### Estimating the slope

The slope of the linear relationship denotes a difference between technologies.
It can be obtained by comparing the two technologies, HuEx and RNA-Seq, within
the same data set.

```{r}
norm <- coef(glm(tcga_huex ~ tcga_rnaseq + 0, data=join))
```

Again, we see that the majority of genes follows this relationship. However,
there also some genes whose expression values are not comparable, even within
the TCGA data set.

```{r}
ggplot(join, aes(x=tcga_rnaseq, y=tcga_huex)) + geom_point(alpha=0.1, stroke=0) +
    geom_abline(slope=norm, size=3/4, col="dodgerblue") +
    xlab("TCGA RNA-seq") + ylab("TCGA HuEx") + theme_bw()
```

### Validation

Finally, we can validate the two parameters by using the TCGA RNA-Seq data and
the NCI-60 HuEx data.

```{r}
norm <- c(norm, norm_int)
saveRDS(norm, "norm_factor.rds")
ggplot(join, aes(x=tcga_rnaseq, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
    geom_abline(slope=norm, intercept=norm_int, size=3/4, col="dodgerblue") +
    geom_abline(slope=norm, intercept=sqrt(2)+norm_int, size=3/4,
    col="dodgerblue", linetype="dashed") +
    geom_abline(slope=norm, intercept=-sqrt(2)+norm_int, size=3/4,
    col="dodgerblue", linetype="dashed") +
    xlab("TCGA RNA-seq") + ylab("NCI60 HuEx") + theme_bw()
```

Using both parameters we see a good agreement between the data, however, there are
also genes that do not follow the linear relationship. Since, we are only interested
in genes with a conserved expression across the two data sets, we will only consider
genes in the further analysis that do not differ by a log fold change more than 1
between the two normalized data sets (area enclosed by dashed blue lines).

```{r}
cutoff <- 1
good <- join[(tcga_rnaseq*norm[1] + norm[2] - nci60_huex)^2 < cutoff &
    (tcga_huex + norm[2] - join$nci60_huex)^2 < cutoff, ensgene]

cor.test(join[good, tcga_rnaseq], join[good, nci60_huex])
```

So we see that this cutoff leads to a good general agreement.

---

# Regression and prediction

We now have a way of normalizing the gene expression data between TCGA and NCI-60
and a criterion to select conserved genes only. Thus, we can now start with the
regression problem. We start by reading the regression problem again.

```{r}
rdata <- fread("regprob.csv", header=T)
```

But this time we will only select the genes that conserved (difference in
normalized log expression smaller than one).

```{r, results="hide"}
rates <- rdata$rates
rdata[, "rates" := NULL]
rdata <- as.matrix(rdata)[, good]
```

Which has dimensions

```{r}
dim(rdata)
```

We will now test a total of 4 different generalized linear models.

## Tested models {.tabset .tabset-fade .tabset-pills}

We will use the LASSO generalized linear models from the `glmnet` package.
We will also initialize two data frames that will contain all the predictions
and goodnes-of-fit metrics. Those will be calculated for each of the models
on the training set and leave-one-out cross validation.

```{r}
library(glmnet)

pred <- data.frame()
m <- data.frame()

# For leave-one-out cross validation
folds <- length(rates)
```
---

### First order model

In the first model gene expression values enter as linear variables into the
model. We will extract the predictions on the training set as well as
predictions from leave-one-out cross validation.

```{r}
mod1 <- cv.glmnet(rdata, rates, nfolds=folds, keep=T, parallel=T,
    grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod1, rdata, s="lambda.min")[,1]
pred_test <- mod1$fit.preval[, which.min(mod1$cvm)]
```

We now append the predicted rates and performance measures to our data frames.

```{r}
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st"))
```

### Second order model

In the second order model we will only consider interactions between gene expression values,
meaning products between two gene expression values. Because of the large number of
candidate genes it would be feasible to test all combinations. Instead we will
only consider interactions between those genes that non-zero coefficients in the
first order model.

```{r}
nonzero <- abs(coef(mod1, s="lambda.min")[-1]) > 0
data2 <- inter(rdata[,nonzero])
mod2 <- cv.glmnet(data2, rates, nfolds=folds, keep=T, parallel=T,
    grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod2, data2, s="lambda.min")[,1]
pred_test <- mod2$fit.preval[, which.min(mod2$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd"))
```

### First and second order model

In this model we will use the same interaction terms as before but also add all
the original gene expression values in order to see whether the model can be improved.

```{r}
data12 <- cbind(rdata[, nonzero], data2)
mod3 <- cv.glmnet(data12, rates, nfolds=folds, keep=T, parallel=T,
    grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod3, data12, s="lambda.min")[,1]
pred_test <- mod3$fit.preval[, which.min(mod3$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st and 2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st and 2nd"))
```

As we can see, adding the first order terms does not improve the second order
model.

### Regularized model by cutoff

Finally, we will try to improve the generalization of the model, by removing those
interactions that have only very influences on the model and may thus be related
to overfitting.

```{r}
cf <- as.numeric(coef(mod2, s="lambda.min"))[-1]
names(cf) <- rownames(coef(mod2))[-1]
nonzero <- abs(cf) > quantile(abs(cf[abs(cf) > 0]), 0.25)
data_red <- data2[, nonzero]
mod <- cv.glmnet(data_red, rates, nfolds=folds, keep=T,
    grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod, data_red, s="lambda.min")[,1]
pred_test <- mod$fit.preval[, which.min(mod$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd + cutoff"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd + cutoff"))
print(sum(nonzero))
```

## Model selection

If we compare the goodness-of-fit measures we can see that the cutoff model
shows good good performance with an error of about 4% in predicting proliferation
rates.

```{r, results="asis", echo=FALSE}
knitr::kable(m, caption = "goodnes-of-fit measures")
```

This can also be observed visually by plotting the real proliferation rates
against the predicted ones (a perfect match would create a 1:1 line).

```{r, fig.width=10, fig.heigh=5}
ggplot(pred, aes(x=truth, y=pred, col=order)) + geom_abline() +
    geom_point() + facet_grid(set ~ order) + theme_bw() +
    xlab("measured proliferation rate [1/h]") +
    ylab("predicted proliferation rate [1/h]") +
    theme(legend.position="none")
```

We will also save a list of the gene interactions and coefficient values
for the final model.

```{r}
genes <- do.call(rbind, strsplit(colnames(data_red), "x"))
colnames(genes) <- c("gene1", "gene2")
write.csv(data.frame(genes, coef=cf[nonzero]), "best_interactions.csv", row.names=F)
```

## Prediction of proliferation rates

We now have our trained model and will continue predicting proliferation rates
for the samples from TCGA.

### Prediction for HuEx samples

The major problem we have is that in the HuEx data genes are identified by
symbols whereas our interaction terms are identied by ENSEMBL gene ids.
So we will begin by creating a mapping between the two using the gene map
from before and getting the symbols for our genes in the interaction terms.

```{r}
map <- genemap
setkey(map, ensgene)
symbs <- cbind(map[genes[,1], symbol], map[genes[,2], symbol])
```

With that we can get the expression values for those genes and normalize them
with the normalization factor we obtained earlier.

```{r}
huex_ex <- tcga$huex$assay[unique(as.vector(symbs)), ]
huex_ex <- huex_ex + norm[2]
```

We can now calculate the corresponding interaction terms and predict the
proliferation rates for all samples. Here, we will also remove internal controls
that do not map to any TCGA samples.

```{r}
huex_red <- t(huex_ex[symbs[,1], ] * huex_ex[symbs[,2], ])
colnames(huex_red) <- paste0(symbs[,1], "x", symbs[,2])
rates_huex <- predict(mod, huex_red, s="lambda.min")[,1]
controls <- is.na(tcga$huex$samples$tumor)
rates_huex <- rates_huex[!controls]
```

### Prediction for RNA-Seq samples

RNA-Seq samples are identified by ENSEMBL ids as well so we can advance straight
away to calculating the log expression values and applying the normalization.

```{r}
rna_ex <- tcga$rnaseq$counts[unique(as.vector(genes)), ]
rna_ex <- log(rna_ex+1, 2)
rna_ex <- rna_ex * norm[1] + norm[2]
```

With that we can calculate the interaction terms and predict the proliferation
rates.

```{r}
rna_red <- t(rna_ex[genes[,1], ] * rna_ex[genes[,2], ])
colnames(rna_red) <- paste0(genes[,1], "x", genes[,2])
rates_rna <- predict(mod, rna_red, s="lambda.min")[,1]
```

### Combining the data sets

Finally we will first create a prediction data set that combines all proliferation
rate predictions together with the sample annotations.

```{r}
pred <- data.table(
    patient_barcode=c(tcga$rnaseq$samples$patient_barcode,
        tcga$huex$samples$patient_barcode[!controls]),
    panel=c(tcga$rnaseq$samples$panel, tcga$huex$samples$panel[!controls]),
    rates=c(rates_rna, rates_huex),
    tumor=c(tcga$rnaseq$samples$tumor, tcga$huex$samples$tumor[!controls])
    )
```

This way we can easily check what percentage of proliferation rates was predicted
as negative...

```{r}
sum(pred$rates)/nrow(pred)
```

We can merge this data set with the clinical information for each patient now.

```{r}
comb <- merge(pred, tcga$clinical, by=c("patient_barcode", "panel"))
print(comb)
```

Finally we will save those data sets for later usage.

```{r}
saveRDS(comb, "combined.rds")
write.csv(pred, "pred_rates.csv", row.names=F)
```

# Proliferation rates and their association with clinical data

## General properties

Having the combined data set `comb` most of the following analyses are straight
forward. First we can visualize the overall distribution of proliferation rates
across the TCGA cancer panels.

We will start by ordering the panels by their mean proliferation rate,

```{r, results="hide"}
means <- pred[, list(val=median(rates)), by=panel]
setkey(pred, panel)
pred <- pred[means[order(val), panel]]
pred[, panel := factor(panel, levels=unique(panel))]
```

followed by plotting them with `ggplot2`.
```{r, fig.width=10, fig.height=4, results="hide"}
ggplot(pred, aes(x=panel, y=rates, color=tumor, shape=tumor)) +
    geom_jitter(alpha=0.2, height=0, size=1) + theme_bw() + theme(axis.text.x =
    element_text(angle = 45, vjust = 1, hjust=1), legend.position="none") +
    xlab("") + ylab("proliferation rate [1/h]") +
    scale_colour_manual(values=c("royalblue", "red3"))

# Return panels to its original state
pred[, panel := as.character(panel)]
```

Here red triangles denote cancer samples and blue circles samples from healthy
tissues.

We can also check how the proliferation rates from tumor and normal tissue samples
behave.

```{r, fig.width=3, fig.height=4}
ggplot(pred, aes(x=c("normal", "tumor")[tumor+1], y=rates, fill=tumor)) +
    geom_boxplot() + xlab("") + ylab("proliferation rate [1/h]") + theme_bw() +
    theme(legend.position="none")
```

Here, the fold-change between tumor and normal tissue is given by

```{r}
pred[(tumor), mean(rates)] / pred[!(tumor), mean(rates)]
```

And this difference can be tested with a wilcoxon ranked-sum test.

```{r}
wilcox.test(pred[!(tumor), rates], pred[(tumor), rates], conf.int=TRUE)
```

## Survival Analysis

First we will set up the times and censoring. The `comb` data set includes
the required information in the `days_to_death`, `days_to_contact` and `vital`
columns.

```{r}
days_per_year <- 365.25

comb <- comb[!is.na(vital) & tumor]
delta <- c(comb[vital=="Alive", days_to_contact/days_per_year],
    comb[vital=="Dead", days_to_death/days_per_year])
status <- comb$vital == "Dead"
```

For visualization we will stratify the survival data into two groups. One
including the top 25% of proliferation rates and the other containing the
bottom 25%.

```{r}
prolif <- vector(length=nrow(comb))
prolif[comb$rates > quantile(comb$rates, .75)] <- "high"
prolif[comb$rates < quantile(comb$rates, .25)] <- "low"
prolif <- factor(prolif, levels=c("low", "high"))
```

We can now generate the survival fit and create the corresponding Kaplan-Meier
plot.

```{r}
surv <- Surv(delta, status)
fit <- survfit(surv ~ prolif)
plot(fit, col=c("blue", "red"), xlab="time [years]", ylab="survival", lwd=2)
```

However, the proliferation rate is a continuous variable and we would like to
evaluate its association to survival without creating artificial strata. For
this we can use a Cox proportional hazards model.

```{r}
coxm <- coxph(surv ~ comb$rates)
print(coxm)
```

We can use the parameter β to calculate the increase in risk when increasing the
proliferation rate by a fixed value, for instance 0.01:

```{r}
exp(coef(coxm) * 0.01)
```

So this would mean an increased risk of about 18%.

## Association with tumor stage

We can also study how the tumor stage relates to the predicted proliferation
rates. We will use the [TNM staging system here](http://www.cancer.gov/about-cancer/diagnosis-staging/staging).
The pathological tumor stages are already contained in `comb`, however they may
have several subtypes. So we will clean them up a bit to only leave the major
tumor stages such as "T1", "N2", "Stage III", etc.

```{r, results="hide"}
comb[, T := gsub("[a-e][0-9]*$", "", T)]
comb[, N := gsub("[^0-4NX]", "", N)]
comb[, M := gsub("[^0-4MX]", "", M)]
comb[stage %in% c("I/II NOS", "IS"), stage := NA]
comb[, stage := gsub("[A-C]*$", "", stage)]
```

This enables us to visualize the proliferation rates across the TNM staging system.

```{r, fig.width=10, fig.height=3}
x <- melt(comb[, .(rates, panel, T, N, M, stage)], id.vars=c("rates", "panel"))
ggplot(x, aes(x=value, y=rates, col=variable)) +
    geom_violin(aes(fill=variable), scale="width", alpha=0.3, linetype=0) +
    geom_boxplot(outlier.colour=NA, width=0.5) +
    facet_wrap(~ variable, scales="free_x", nrow=1) + theme_bw() +
    theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
    hjust=1), legend.position="none", strip.text = element_blank()) +
    ylab("proliferation rate [1/h]") + xlab("")
```

As we can see there is again a lot of heterogeneity going on. We can test
for associations by the non-parametric Kruskal-Wallis test.

```{r}
kw_tests <- x[, kruskal.test(rates, factor(value)), by=variable]
print(kw_tests)
```

# Flux predictions and metabolic liabilities

In order to predict fluxes based on the particular predicted proliferation rate
and cancer panel we need a specific metabolic model for the cancer that include
an objective for proliferation (growth). There are only few reconstructions
that fulfill that requirement. Here, we use reconstruction from the Metabolic
Atlas for 9 of the TCGA cancer panels which have been validated before
(http://dx.doi.org/10.1073/pnas.1319196111). The mapping between panels and
cancer models is described in `tissues.csv`:

```{r}
tissues <- fread("tissues.csv")
print(tissues)
```

For each cancer panel we will now use the corresponding model and solve the
linear programming problem posed by parsimonious FBA for every cancer sample
in the panel with a non-zero proliferation rate. Fluxes are denoted by $v_i$,
the stochiometric matrix of the model by $\mathbf{S}$, the proliferation objective
in the model by $v_p$ and the predicted proliferation rate by $r_p$. Models are
converted to the irreversible formulation beforehand.

$$\begin{aligned}
&\min_i \sum_i v_i\\
s.t.\quad& \mathbf{Sv} = 0\\
& v_p = r_p\\
& v_i \geq 0
\end{aligned}$$

For that we use [cobrapy](https://opencobra.github.io/cobrapy/) and warm start
strategy where a previous solution basis is recycled to yield faster convergence
of the solver. All of this is implemented in the `fluxes.py` script which will
use the predicted rates saved earlier.

```{r}
system2("python3", "fluxes.py")
```

Note that you may have to substitute `python3` with `python` depending on your
Python installation.

## Flux analysis

The resulting fluxes are now saved in `fluxes.csv`, for convenience this only
includes fluxes which were non-zero in at least one sample.

First we will create a mapping between patient barcodes and cancer panels.

```{r}
panels <- pred$panel
names(panels) <- pred$patient_barcode
panels <- sort(panels)
```

Now we can read the fluxes and conver them to a matrix.

```{r, results="hide"}
fluxes <- fread("fluxes.csv")
barcodes <- fluxes$V1
fluxes <- as.matrix(fluxes[, V1 := NULL])
rownames(fluxes) <- barcodes
```

For later analysis we would also like to group fluxes by their corresponding
metabolic pathways. The `fluxes.py` script also extracts that information from
the model and saves is as "subsystem" in the `flux_info.csv` file.

```{r}
info <- fread("flux_info.csv")
names(info) <- c("reaction", "subsystem")
head(info)
```

We can visulaize the resulting fluxes with a heatmap.

```{r, dev="png", fig.width=8, fig.height=6}
cols <- viridis::viridis(256)
panels <- panels[!duplicated(names(panels))]
fluxes <- fluxes[order(panels[rownames(fluxes)], decreasing=TRUE), ]
in_fluxes <- names(panels) %in% rownames(fluxes)
annrow <- data.frame(panel=panels[in_fluxes], row.names=names(panels)[in_fluxes])
anncolors <- scales::hue_pal()(9)
names(anncolors) <- levels(annrow$panel)
anncolors <- list(panel=anncolors)

s <- seq(-16, log(max(fluxes)+1e-16,2), length.out = 256)
pheatmap(fluxes, breaks=c(-1e-6,2^s), col=cols, show_rownames=F,
    show_colnames=F, annotation_row=annrow, annotation_colors=anncolors,
    cluster_rows=FALSE, border_color=NA)
```

In order to see major metabolic proerties of the studied panels we will also
visualize the fluxes across the three major pathways of central carbon
metabolism. We will start by transforming our flux matrix to a data.table that
can be passed to ggplot.

```{r}
dt <- as.data.table(fluxes)
dt$panel <- panels[rownames(fluxes)]
dt <- melt(dt, id.vars = "panel")
names(dt) <- c("panel", "reaction", "flux")
dt <- merge(dt, info, by = "reaction")
```

This will be followed by filtering the subsystem for Glycolysis,
Oxidative Phosphorylation, and TCA cycle and plotting the results.

```{r, dev="png", dpi=200, fig.width=7, fig.height=3}
pathways <- c("Glycolysis / Gluconeogenesis", "Oxidative phosphorylation",
    "Tricarboxylic acid cycle and glyoxylate/dicarboxylate metabolism")
ggplot(dt[subsystem %in% pathways], aes(x = panel, y = flux, color = panel)) +
    geom_jitter(height = 0, alpha = 0.05, size = 0.5) + facet_wrap(~ subsystem) +
    theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
    hjust = 1), legend.position = "none") + xlab("")
```

## Cancer Panel Specificity

Based on the predicted fluxes we can now calculate a specificity score $s^i_p$ for
each flux $v_i$ and panel p as the log-fold change of the flux within the panel
versus all other panels. Here $\mu^i_p$ denotes the mean value of the flux $v_i$
within the panel p and $\mu^i_o$ outside the panel p.

$$s^i_p = \log_2 \mu^i_p - \log_2 \mu^i_p$$

A value of zero denotes homogeneous fluxes across all panels, a value larger than
zero fluxes that are up-regulated in a particular panel and a value smaller zero
fluxes that are down-regulated in a particular panel. The calculation is
implemented in the `prtools` package.

```{r}
lfcs <- panel_lfc(fluxes, panels, info[, list(subsystem)])
```

We will order those by their absolute value from largest to smallest.

```{r}
lfcs <- lfcs[order(-abs(lfc))]
lfcs
```

We can now visualize those changes across all metabolic pathways.

```{r, fig.width=10, fig.height=20, results="hide"}
lfcs[, panel := factor(panel)]
lab <- function(x) lapply(x, shorten, n=20)
lfc_plot <- ggplot(lfcs, aes(x=panel, y=lfc, col=panel)) +
    geom_hline(yintercept=0, linetype="dashed") +
    geom_boxplot(outlier.colour=NA) + geom_jitter(alpha=0.5) +
    theme_bw() + theme(axis.text.x=element_text(angle = 45, vjust = 1, hjust=1),
    legend.position="none") + facet_wrap(~subsystem, labeller=lab, nrow=12, ncol=6) +
    ylab("specificity score")

print(lfc_plot)
```

## Enrichment Analysis

Those are a lot of plot and we would like a more quantitative evaluation which
of the pathways are the most specific or or homogeneous ones. This can be obtained
from an enrichment analysis. Here we use a method similar to [GSEA](http://software.broadinstitute.org/gsea)
which calculates normalized enrichment scores (NES) and p-values for each pathway.

Here, a value larger than one denotes association with specific fluxes and a value
smaller than one association with homogeneous fluxes.

```{r, results="hide"}
pws <- lfcs$subsystem
enr <- sapply(unique(pws), NES, w=lfcs$lfc, pws=pws)
enr <- data.table(subsystem=colnames(enr), nes=enr[1,], p=enr[2,])
enr <- enr[order(nes)]
enr[, subsystem := factor(subsystem, levels=subsystem)]
```

For panel-specific pathways we now obtain the following:

```{r, fig.width=6, fig.height=8}
enr_plot <- ggplot(enr, aes(x=nes, y=subsystem, col=p)) +
    geom_vline(xintercept=1, linetype="dashed") +
    geom_point() + scale_colour_continuous(low="red", high="black") +
    theme_bw() + scale_y_discrete(labels=shorten) +
    theme(legend.position=c(0.2,0.8)) + xlab("enrichment score") + ylab("")

print(enr_plot %+% enr[nes > 1])
```

Here we see that there is a group of three pathways that set themselves apart
on the top. We can visualize only those again:

```{r, fig.width=7, fig.height=3}
best <- rev(enr$subsystem)[1:3]
print(lfc_plot %+% lfcs[subsystem %in% best])
```

Finally we can also visualize homogeneous pathways.

```{r, fig.width=6, fig.height=5}
print(enr_plot %+% enr[nes <= 1])
```

# Software versions

## R and packages

```{r}
sessionInfo()
```

## Python and packages

```{bash}
python3 -V
```

Only directly used packages are cobra and pandas

```{bash}
python3 -c "import cobra; print('cobra', cobra.__version__)"
python3 -c "import pandas; print('pandas', pandas.__version__)"
```