TCGA_summary.Rmd

---
title: "TCGA survival analysis"
date: "`r Sys.Date()`"
output:
  pdf_document:
    toc: no
  html_document:
    theme: united
    toc: no
csl: styles.ref/genomebiology.csl
bibliography: data.TCGA/TCGA.bib
editor_options: 
  chunk_output_type: console
---

```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
# Set up the environment
library(knitr)
opts_chunk$set(cache.path='cache/', fig.path='img/', cache=F, tidy=T, fig.keep='high', echo=F, dpi=100, warnings=F, message=F, comment=NA, warning=F, results='as.is', fig.width = 10, fig.height = 6) #out.width=700, 
library(pander)
panderOptions("table.split.table", Inf)
set.seed(1)
library(dplyr)
options(stringsAsFactors = FALSE)
```

# Settings

```{r settings}
# Settings
#Aggregation of the results generated by the [survival.Rmd](https://github.com/mdozmorov/TCGAsurvival/blob/master/survival.Rmd) file
# Select cancer type
cancer <- "BRCA"
# Select genes of interest
selected_genes   <- "MYBL2" # Search and replace the name of the selected_genes

n_max  <- 10 # Maximum number of rows to output in tables
p_val_cutoff   <- 0.05 # Regular p-value cutoff
p_adj_cutoff   <- 0.1 # FDR cutoff
```

# Global analysis of [survival.Rmd](https://github.com/mdozmorov/TCGAsurvival/blob/master/survival.Rmd)

## Analysis 2: Survival effect of `r selected_genes` in all cancers

The bar plot shows the significance of `r selected_genes` expression on survival in a given cancer. The wider (higher) the bar, the more significant survival effect the `r selected_genes` has. See abbreviations of cancer types at [http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf](http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf)   

```{r echo=FALSE, out.width='60%'}
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis2/", selected_genes, "_all_TCGA_cancers.png"))
```

\pagebreak

The same data in table format. Legend:

- `Cancer`, `Cancer.Name` - cancer abbreviation and description
- `Gene` - gene name for which survival analysis was run
- `p.value` - significance of the survival effect
- `HR`, `HR_left`, `HR_right` - hazard ratio, and left/right confidence interval
- `Min.`, `X1st.Qu.`, `Median`, `Mean`, `X3rd.Qu.`, `Max.` - expression level of the gene in a corresponding cancer
- `Cutoff_type`, `Cutoff_value` - gene expression cutoff best discriminating survival

```{r}
cancers <- openxlsx::read.xlsx("data.TCGA/TCGA_cancers.xlsx")
mtx <- read.table(paste0(selected_genes, ".", cancer, ".Analysis2/global_stats.txt"), sep = "\t", header = TRUE)
mtx <- mtx[order(mtx$p.value), ]
mtx <- mtx %>% mutate(p.adjust = p.adjust(p.value, method = "BH"))
mtx <- left_join(mtx, cancers, by = c("Cancer" = "Acronym"))
mtx <- mtx[, c("Cancer", "Cancer.Name", "p.value", "p.adjust", "HR", "HR_left", "HR_right", "Min.", "X1st.Qu.", "Median", "Mean", "X3rd.Qu.", "Max.", "Cutoff_value")]
colnames(mtx) <- c("Cancer", "Cancer Name", "p-value", "FDR", "Hazard Ratio", "HR left", "HR right", "Min gene expression", "1st Qu", "Median", "Mean", "3rd Quantile", "Max gene expression", "Expression cutoff")
# Save to file
x <- list(mtx)
names(x) <- selected_genes
writexl::write_xlsx(x, paste0(selected_genes, ".", cancer, ".Analysis2/global_stats.xlsx"))
# Modify number of columns in mtx for pdf
mtx <- mtx [,c("Cancer", "Cancer Name", "p-value", "FDR", "Hazard Ratio", "Expression cutoff")]
# DT::datatable(mtx)
pander(mtx[1:min(nrow(mtx), n_max), ])

```

## Analysis 3: Survival effect of `r selected_genes` in specific clinical subtypes, all cancers

Top `r n_max` clinincal subtypes across all cancers where `r selected_genes` significantly (adjusted p-value < `r p_adj_cutoff`) affects survival

```{r}
mtx <- read.table(paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
# Correct for multiple testing in individual cancers
# all_cancers = c("ACC", "BLCA", "PRAD", "CESC", "CHOL", "COAD", "COADREAD", "DLBC", "ESCA", "GBM", "GBMLGG", "HNSC", "KICH", "KIPAN", "KIRC", "KIRP", "LGG", "LIHC", "LUAD", "LUSC", "MESO", "OV", "PAAD", "PCPG", "PRAD", "READ", "SARC", "SKCM", "STAD", "STES", "TGCT", "THCA", "THYM", "UCEC", "UCS", "UVM") # "LAML", 
# Get all cancers from the matrix itself
all_cancers <- sapply(mtx$Cancer, function(x) strsplit(x, "-")[[1]][1]) %>% unname %>% unique 
all_adjusted_p_values <- c()
for (cancer_selected in all_cancers) {
  mtx_subset <- mtx[grepl(paste0("^", cancer_selected, "-"), mtx$Cancer, ignore.case = FALSE), , drop = FALSE] 
  all_adjusted_p_values <- c(all_adjusted_p_values, p.adjust(mtx_subset$p.value, method = "BH"))
}
mtx$p.adjust <- all_adjusted_p_values
mtx$Subgroup <- mtx$Cancer
mtx$Cancer <- sapply(mtx$Cancer, function(x) strsplit(x, "-")[[1]][1]) %>% unname 
mtx <- mtx[with(mtx, order(Cancer, p.value)), ]
mtx <- mtx[, c("Cancer", "Subgroup", "p.value", "p.adjust", "HR", "HR_left", "HR_right", "Min.", "X1st.Qu.", "Median", "Mean", "X3rd.Qu.", "Max.", "Cutoff_value")]
colnames(mtx) <- c("Cancer", "Subgroup", "p-value", "FDR", "Hazard Ratio", "HR left", "HR right", "Min gene expression", "1st Qu", "Median", "Mean", "3rd Quantile", "Max gene expression", "Expression cutoff")
rownames(mtx) <- NULL
#Creating subset of mtx for pdf table
mtx_pdf <- mtx [, c("Cancer", "Subgroup", "p-value", "FDR", "Hazard Ratio", "Expression cutoff")]
mtx_pdf[order(mtx_pdf$FDR), ] %>% head(., n = n_max) %>% pander
```

## Analysis 3: Cancer-specific top clinical subtypes where `r selected_genes` significantly affects survival, for each cancer

Only results for cancers with clinical subtypes where `r selected_genes` significantly (adjusted p-value < `r p_adj_cutoff`) affects survival are shown. The table is sorted by the "Number of significant" column. Cancers with zero Number of significant subtypes are not shown.

```{r}
clinical_subgroups_stats <- list() # List to collect statistics about clinical subgroups
for (cancer_selected in all_cancers) {
  clinical_subgroups_stats <- c(clinical_subgroups_stats, list(c(
    cancer_selected, # Selected cancer
    sum(grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE)), # total clinical annotations
    sum(grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE) & mtx$FDR < p_adj_cutoff) # significant at FDR
  )))
}
clinical_subgroups_stats <- do.call(rbind.data.frame, clinical_subgroups_stats)
clinical_subgroups_stats <- clinical_subgroups_stats %>% mutate( Proportion = as.numeric(clinical_subgroups_stats[, 3]) / as.numeric(clinical_subgroups_stats[, 2]) )
colnames(clinical_subgroups_stats) <- c("Cancer", "Number of clinical subgroups", "Number of significant", "Proportion of significant")
clinical_subgroups_stats <- left_join(clinical_subgroups_stats, cancers, by = c("Cancer" = "Acronym"))
clinical_subgroups_stats <- clinical_subgroups_stats[order(as.numeric(clinical_subgroups_stats$`Number of significant`), decreasing = TRUE), ]
clinical_subgroups_stats <- clinical_subgroups_stats[as.numeric(clinical_subgroups_stats$`Number of significant`) > 0, ]
clinical_subgroups_stats$`Proportion of significant` <- round(clinical_subgroups_stats$`Proportion of significant`, digits = 2)
rownames(clinical_subgroups_stats) <- NULL
kable(clinical_subgroups_stats, caption = paste0("Summary of the number of clinical subtypes where the expression of ", selected_genes, " significantly affects survival"))
```

```{r}
# Save to file
x <- list(mtx, clinical_subgroups_stats)
names(x) <- c(selected_genes, paste0(selected_genes, ".summary"))
writexl::write_xlsx(x, paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.xlsx"))
```


```{r results="asis"}
for (cancer_selected in all_cancers) {
  mtx_subset <- mtx[grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE) & mtx$FDR < p_adj_cutoff, , drop = FALSE] 
  rownames(mtx_subset) <- NULL
  mtx_subset <- mtx_subset[,c("Cancer","Subgroup","p-value","FDR","Hazard Ratio", "Expression cutoff")]
  if (nrow(mtx_subset) > 0) {
    mtx_subset %>% head(., n=n_max) %>% kable(., caption = cancers$Cancer.Name[cancers$Acronym == cancer_selected]) %>% print()
  }
}
```

## Analysis 3: Survival effect of `r selected_genes` in manually defined clinical subtypes, all cancers

```{r results="asis"}
# Add any subcategories to check if the gene affects survival in them in any cancer
clinical_subcategories <- c("race-black or african american") #, "radiationtherapy-no", "pathologyMstage-m0", "pathologyNstage-n0")
for (selected_clinical_subcategory in clinical_subcategories) {
  mtx_subset <- mtx[grepl(selected_clinical_subcategory, mtx$Subgroup) & mtx$FDR < p_adj_cutoff, ]
  rownames(mtx_subset) <- NULL
  mtx_subset <- mtx_subset[,c("Cancer","Subgroup","p-value","FDR","Hazard Ratio", "Expression cutoff")]
  kable(mtx_subset, caption = selected_clinical_subcategory) %>% print()
}
```

\pagebreak 

# Cancer-specific analysis

## Analysis 2: Survival effect of `r selected_genes` in `r cancer` cancer

```{r echo=FALSE, out.width='60%'}
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis2/", selected_genes, "_", cancer, ".png"))
```

## Analysis 3: Survival effect of `r selected_genes` in specific clinical subtypes, `r cancer` cancer

The table lists clinical subtypes where the expression of `r selected_genes` in `r cancer` most significantly affects survival. The table is sorted by increasing p-values, most significant on top. Description of clinical subtypes can be found at [https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization)

```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), ]
global_stats_subset <- global_stats[grepl(cancer, global_stats$Cancer, fixed = TRUE, ignore.case = FALSE), ]
rownames(global_stats_subset) <- NULL
#Creating subset of global_stats for pdf
global_stats_subset_pdf <- global_stats_subset [, c("Cancer", "Gene", "p.value", "HR", "Cutoff_value")]
global_stats_subset_pdf %>% head(., n = n_max) %>% pander
```

Top five corresponding survival plots

```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[1])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[1], ".png"))
```

```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[2])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[2], ".png"))
```

```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[3])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[3], ".png"))
```

```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[4])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[4], ".png"))
```

```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[5])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[5], ".png"))
```

```{r}
if (cancer %in% c("BRCA","OV")){
  continue_analysis <- TRUE
} else {
  continue_analysis <- FALSE
}
opts_chunk$set(eval = continue_analysis, include = continue_analysis)
```

## Analysis 5: Clinical-centric analysis. Selected cancer, selected clinical subcategory, selected_genes expression differences across categories

Expression of `r selected_genes` in selected clinical subcategories. Valid only for BRCA or OV cancer types.

```{r out.height='250px'}
# 'PAM50Call_RNAseq' for PRAD, 'subtype' for OV
if (cancer == "BRCA"){
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis5/", cancer, "_", selected_genes, "_PAM50Call_RNAseq.png"))
}

if (cancer == "OV"){
  knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis5/", cancer, "_", selected_genes, "_subtype.png"))
}
```

```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis5/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), c("Cancer", "Gene", "p.value", "HR")]
rownames(global_stats) <- NULL
pander(global_stats)
```

### ANOVA

What are the means of log2-expression per clinical subgroup?

```{r}
load(file = paste0(selected_genes, ".", cancer, ".Analysis5/mtx_to_plot.rda"))
tapply(mtx_to_plot$Gene, mtx_to_plot$Clinical, mean)
```

Is the expression of selected_genes `r selected_genes` significantly different across clinical subgroups? Significant "Pr(>F)" suggests "Yes"

```{r}
# ANOVA
groupdiff <- lm(mtx_to_plot$Gene ~ mtx_to_plot$Clinical)
# summary(groupdiff)
anova(groupdiff)
```

### Tukey HSD (Honest Significant Difference) test
Which pair of clinical categories have significant differences? "p.adj" and confidence intervals that do not cross 0 suggest significant differences in selected_genes expression between the subgroups in the corresponding pairwise comparison. 

```{r}
# Tukey HSD
a1 <- aov(mtx_to_plot$Gene ~ mtx_to_plot$Clinical)
# summary(a1)
posthoc <- TukeyHSD(x=a1, 'mtx_to_plot$Clinical', conf.level=0.95)
posthoc <- posthoc$`mtx_to_plot$Clinical`
posthoc[order(posthoc[, "p adj"]), ] %>% pander()

# Differences in means plot
# par(las = 2)
# par(mar = c(5, 8, 5, 8))
# plot(posthoc)
```

Is there a survival difference between clinical subgroups?
```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis5/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), c("Cancer", "Gene", "p.value", "HR")]
rownames(global_stats) <- NULL
pander(global_stats)
```

\pagebreak

# Methods

## Survival analysis of selected_genes expression data from TCGA

Gene expression data summarized as RSEM values were obtained using the `TCGA2STAT` R package v 1.2, along with the corresponding clinical annotations. Data for each of the 34 cancers were obtained separately. The data were log2-transformed and analyzed using Kaplan-Meier curves and Cox proportional hazard model. Each gene of interest was analyzed for its effect on survival by separating patients into high/low expression subgroups. A modified approach from [@Mihaly:2013aa] was used to estimate the best gene expression cutoff that separates high/low expression subgroups with differential survival.

We took advantage of the availability of clinical annotations. To identify if the expression of a gene of interest affects survival in any specific clinical subgroup, subsets of patients annotated with specific clinical annotations were selected (e.g., “males” or “females” in the “gender” clinical annotation). Subgroups with < 40 patients were not considered.

## Differential expression analysis

Samples in the selected cancer cohort were sorted by expression of the selected genes. Differentially expressed genes were detected between samples in the upper 75 percentile of the expression gradient and samples in the lower 25 percentile using `limma` v 3.32.6 R package [@Ritchie:2015aa; @Smyth:2004aa]. P-values were corrected for multiple testing using the False Discovery Rate (FDR) method [@Benjamini:1995aa]. Genes differentially expressed at FDR < 0.01 were selected for further analysis.

# References