Merge branch 'master' of https://github.com/jakelever/cancermine

jakelever · Mar 6, 2019 · 53c1762 · 53c1762
2 parents c7f53a3 + 987c24f
commit 53c1762
Show file tree

Hide file tree

Showing 20 changed files with 862 additions and 205 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,3 +17,7 @@ cancermine_unfiltered.tsv
 /paper/_book
 /paper/sideeffects.pdf
 google-analytics.js
+nightly-ClinicalEvidenceSummaries.tsv
+TSgene.txt
+ongene.txt
+/paper/cancermine/intogen_cancer_drivers-2014.12
diff --git a/paper/_bookdown.yml b/paper/_bookdown.yml
@@ -1 +0,0 @@
-delete_merged_file: True

diff --git a/paper/bibliography.bibtex b/paper/bibliography.bibtex
@@ -37,6 +37,18 @@
   publisher={Oxford University Press}
 }
 
+@article{griffith2017civic,
+  title={CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer},
+  author={Griffith, Malachi and Spies, Nicholas C and Krysiak, Kilannin and McMichael, Joshua F and Coffman, Adam C and Danos, Arpad M and Ainscough, Benjamin J and Ramirez, Cody A and Rieke, Damian T and Kujan, Lynzey and others},
+  journal={Nature genetics},
+  volume={49},
+  number={2},
+  pages={170},
+  year={2017},
+  publisher={Nature Publishing Group}
+}
+
+
 
 @article{singhal2016text,
   title={Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine},

diff --git a/paper/brief_communication.Rmd b/paper/brief_communication.Rmd
@@ -0,0 +1,67 @@
+---
+title: "CancerMine: A resource for the role of genes in cancer"
+site: bookdown::bookdown_site
+documentclass: book
+output:
+  bookdown::word_document2
+bibliography: bibliography.bibtex
+csl: nature-methods.csl
+---
+
+```{r cancermineSetup, include=FALSE}
+source('cancermine/dependencies.R')
+
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+```{r cancermineFigures, echo=F}
+pdf(file="sideeffects.pdf")
+source('cancermine/fig_annotationAndModel.R')
+source('cancermine/fig_textminedOverview.R')
+source('cancermine/fig_time.R')
+source('cancermine/fig_accrualAndDualRoles.R')
+source('cancermine/fig_comparisons.R')
+source('cancermine/fig_clustersAndTCGA.R')
+source('cancermine/fig_briefcommunications.R')
+invisible(dev.off())
+```
+
+As precision medicine approaches are increasingly integrated into the clinic, tumors from cancer patients are frequently genetically profiled to understand the driving forces behind their disease. We present the CancerMine resource, a text-mined and routinely updated database of drivers, oncogenes and tumor suppressors in different types of cancer.  All data is easily viewable online (http://bionlp.bcgsc.ca/cancermine) and downloadable under a Creative Commons Zero license for easy re-use.
+
+In order to interpret the somatic events present in a patient sample, one must know which genes play important roles in the development of the corresponding cancer type. This normally requires substantial literature review. We have developed a text mining approach that identifies mentions of genes as drivers, oncogenes or tumor suppressors. This approach will be kept up-to-date with monthly releases and provides a valuable resource amenable for data analysis pipelines. Based on this data, we also provide an easy tool for identifying cancer genes from a gene list and interactive plots of cancer type clustering, which are helpful for understanding the somatic landscapes of cancer.
+
+Oncogenes are genes (either normally or in an aberrated form) that promote the development of cancer while tumor suppressors act against carcinogenesis. Drivers refer to genes that are important in cancer development and can be either oncogenes or tumor suppressors. Some genes (e.g. NOTCH1) have been identified as oncogenes in one type of cancer and as tumor suppressors in another type [@radtke2003role]. Furthermore many genes are important in only certain types of cancer and likely irrelevant in others. The type of cancer provides important context when interpreting the relevant of somatic aberrations in a patient sample. 
+
+Different methods exist to identify potential cancer related genes, including statistical analysis of mutation frequency in large genomic cohorts [@kristensen2014principles] and in-vitro studies of gene knockdowns [@zender2008oncogenomics]. Several resources have been built over time to catalog the roles of genes in cancer. The Cancer Gene Census (CGC) [@futreal2004census] uses data from COSMIC to list known oncogenes and tumor suppressors. The Network of Cancer Genes [@ciccarelli2018network] builds upon the CGC and integrates a wide variety of additional contextual data such as frequency of mutations. IntOGen [@gonzalez2013intogen] uses data from large scale sequencing projects (e.g. the Cancer Genome Atlas) to collate the importance of cancer genes. ONGene [@liu2017ongene] and TSGene [@zhao2015tsgene] list oncogenes and tumor suppressors, but do not associate them with specific cancer types. The CIViC database curates clinically-relevant variants and contains a set of genes that are relevant for different types of cancer [@griffith2017civic].
+
+All manually curated databases face the overwhelming curation burden of expert curator time and cost in order to stay up-to-date. Text mined databases offer an automated approach that can provide high quality results with regular updates. To enable this approach, we build on previous text mining approaches for extracting gene-disease relations [@chun2006extraction; @singhal2016text]. We extracted titles, abstracts and available full-text articles from PubMed, PubMed Central Open Access subset and PubMed Central Author Manuscript Collection. We then identified sentences that mentioned a gene name, a cancer type and keywords suggestive of a gene playing a role in cancer (Table S\@ref(tab:suptab_filterterms)) We manually annotated the drivers, oncogenes and tumor suppressors discussed in 1,500 sentences (3 annotators with mean inter-annotator agreement F1-score = 0.77 and Table S\@ref(tab:suptab_interannotator)). These sentences are from throughout the paper and may discuss a new result or a previous result. An example of a sentence describing two oncogene associations is: “KRAS is a known oncogene in lung cancer and pancreatic cancer”. More examples can be found in Table S\@ref(tab:suptab_examples). An individual sentence may contain zero, one or many drivers, oncogenes and tumor suppressor genes associated with different types of cancer.
+
+We trained a Kindred relation classifier[@lever2017painless] that learns the characteristics of these sentences by building a logistic regression classifier on word frequencies and semantic features. The classifier can then predict the pairs of genes and cancer types with their associated roles (Driver, Oncogene and Tumor Suppressor) from a given sentence. A knowledgebase that contains a high number of incorrect associations would quickly lose users and be unusable for other analyses. We therefore control the precision-recall trade-off by applying a high threshold on classifier scores to reduce the number of false positives with the accepted increase in false negatives (Fig 1A and Table S\@ref(tab:suptab_performance) with error analysis in Table S\@ref(tab:suptab_erroranalysis)). This gives an average precision of `r paper.avg_precision`% and recall of `r paper.avg_recall`% across the three gene role types. We are most interested in well-established drivers, oncogenes and tumor suppressors which will be mentioned in many papers, first in the discovery paper(s) and then in other papers. Even with a low recall of `r paper.avg_recall`%, the redundancy in the literature means that the association needs to be mentioned in only `r paper.minPapersNeeded` papers to be identified with over a 90% chance. Important cancer genes associations are identified in hundreds of papers by CancerMine so this approach appears reasonable. However users should be aware that this method cannot provide an exhaustive list of all known cancer-gene associations.
+
+
+```{r briefcommunications1, echo=F, eval=T, fig.width=11, fig.asp=0.8,  fig.cap='(ref:briefcommunications1)'}
+grid.arrange(fig_briefcommunications1)
+```
+
+(ref:briefcommunications1) We train classifiers to extract mentions of drivers, oncogenes and tumor suppressors and use a 75%/25% training-test split to calculate precision-recall curves (a). This allows for the choice of a high threshold (shown with a vertical line in the lower plot) to reduce the false positive rate. The classifier is applied to all appropriate and accessible sentences in PubMed and PubMed Central and identifies the most frequently discussed cancer genes and cancer types (b). This information is valuable for reference by researchers as well as for use in automated pipelines. 
+
+The classifier was applied it to the remaining corpora to identify `r paper.sentenceCount` sentences containing `r paper.driverMentionCount` mentions of drivers, `r paper.oncogeneMentionCount` of oncogenes and `r paper.tumorSuppressorMentionCount` of tumor suppressors. By aggregating the results, we find `r paper.geneCount` genes linked to `r paper.cancerCount` cancer types as drivers, oncogenes or tumor suppressors with well-established cancer genes (e.g. TP53, MYC and KRAS) having the highest number of associations (Fig 1B). `r paper.percMultipleCitations`% of cancer gene associations are discussed in more than one paper. The resulting knowledge base incorporates information from `r paper.pmidCount` papers. 
+
+CancerMine is updated monthly and every new release adds many new cancer gene associations. These may be completely new discoveries or mentions of known gene associations that the system failed to successfully capture from previous papers. As evidence that automatically curated knowledgebases need to be regularly updated, we find that in 2018, ~`r paper.novelGeneRolesPerMonth2018` new cancer gene associations were extracted from the literature each month. Delving deeper into these new cancer gene associations, we find a strong association between their novelty to CancerMine and their location within the paper (Fig S\@ref(fig:supfig_subsectionnovelty)). Specifically, we find that novel associations will more likely appear (Chi-squared test P=7.2x10^-87) in the non-introductory sections of a paper (e.g. Results and Discussion). Examples are shown in Table S\@ref(tab:suptab_novelexamples) which show some of the frequent differences between novel and non-novel mentions. We found that overall `r paper.percFromFullText`% of mentions of cancer gene roles were found in the full text (as opposed to title or abstract). Furthermore `r paper.percOnlyInFullText`% of all cancer gene roles were only found in full text articles (Fig S\@ref(fig:supfig_sections)). This also underlines the need to text-mine full-text articles as cancer-gene associations are found in all subsections of papers (Fig S\@ref(fig:supfig_subsectionnovelty)) and not only their abstracts and highlights the need for greater access to articles for text mining.
+
+CancerMine can also identify genes that serve dual roles as oncogenes in some types of cancer and as tumor suppressors in other types. We identify genes that are mentioned in at least 4 papers that have strong support as an oncogene in at least one cancer type and also as a tumor suppressor in at least one other type. NOTCH1 as well as FOXP1 and several other genes are identified for their dual roles (Table S\@ref(tab:suptab_dualroles)).
+
+```{r briefcommunications2, echo=F, eval=T, fig.width=11, fig.asp=1.5,  fig.cap='(ref:briefcommunications2)'}
+grid.arrange(fig_briefcommunications2)
+```
+
+(ref:briefcommunications2) A comparison with other resources reveals that CancerMine contains vastly more gene-cancer associations but has poor overlap with the Cancer Gene Census and IntOGen (a). CancerMine overlaps substantially with the genes listed in the TSGen, ONGene and CIViC resources. Using the number of papers discussing a gene role with a cancer, a profile can be created for each cancer type. These profiles can then be clustered together to help understand the overall genetic difference between cancer types (b). Colors of increasing intensity correlate with higher importance score.
+
+
+We compared CancerMine with the CGC and several other cancer gene-related resources (Fig 2A) and found that CancerMine contains substantially more cancer gene associations than other resources but does not show substantial overlap with the CGC or IntOGen. The IntOGen project provides inferences of driver genes from large data sets such as the Cancer Genome Atlas and many will not yet have been examined in publications which may explain the poor overlap with CancerMine. The CGC is a manually curated resource based on the COSMIC resource and the poor overlap may also be explained by a lack of publications discussing all the data in COSMIC. Lowering the threshold of the classifier to reduce the false negative rate improves the overlap but not substantially (Fig S\@ref(fig:supfig_comparisons_lowthreshold)), suggesting that many of the gene associations in the CGC and IntOGen are not mentioned in the literature.
+
+The number of papers that discuss a gene as important for a cancer provides a metric of the importance of that gene in cancer development. Using this information we generated profiles for each cancer type.  As expected, similar cancer types cluster together (Fig 2B) providing an excellent summary of how different cancer types differ based on their genomic profiles. We then validated these profiles, specifically for tumor suppressors, using data from the Cancer Genome Atlas (TCGA). We hypothesized that likely deleterious point mutations would often target important tumor suppressors. For each TCGA sample (in seven TCGA cancer type projects), we matched it most closely with the CancerMine profile that overlapped its somatic mutations. We found that five of the seven CancerMine profiles have their highest proportion matches with the corresponding TCGA project (Fig S\@ref(fig:supfig_tcga)).
+
+All data are free to view and download. We hope this will provide a valuable resource to the cancer research community and that the methods will be of interest to others working on biomedical knowledge base curation.
+
+# References
diff --git a/paper/cancermine/cancermineProfilesTCGA.tsv b/paper/cancermine/cancermineProfilesTCGA.tsv
@@ -1,9 +1,9 @@
 profile	COAD	BRCA	LIHC	PRAD	LUAD	LGG	STAD
-colorectal cancer	40.3	18.2	9.9	8.3	11.3	1.8	10.1
-breast cancer	10.1	32.6	10.7	8.2	18.2	5.6	14.6
-hepatocellular carcinoma	9.7	22.2	15.4	9.0	16.3	11.1	16.2
-prostate cancer	4.1	19.4	11.8	33.5	13.5	9.4	8.2
-lung cancer	7.0	21.4	8.9	9.5	28.6	8.9	15.7
-malignant glioma	2.0	10.6	4.0	3.5	6.4	70.5	3.0
-stomach cancer	6.5	33.8	7.8	18.2	14.3	3.9	15.6
-none	0.4	45.5	5.9	33.4	5.7	5.1	4.0
+colorectal cancer	39.1	19.1	10.7	7.7	10.5	1.9	10.9
+breast cancer	9.7	32.8	10.2	8.4	18.9	5.7	14.4
+hepatocellular carcinoma	10.3	21.0	15.5	9.1	17.3	11.6	15.2
+prostate cancer	4.4	20.8	8.2	36.5	12.6	10.1	7.5
+lung cancer	5.9	22.5	10.3	10.1	26.9	8.9	15.5
+malignant glioma	1.0	11.2	3.8	4.3	6.1	70.5	3.1
+stomach cancer	7.8	37.3	6.5	17.0	11.1	3.9	16.3
+none	0.2	44.2	5.6	34.7	6.1	5.3	3.9
diff --git a/paper/cancermine/error_analysis.tsv b/paper/cancermine/error_analysis.tsv
@@ -0,0 +1,5 @@
+	Driver	Oncogene	Tumor Suppressor	negative
+Driver	10	0	0	25
+Oncogene	0	52	0	97
+Tumor Suppressor	0	0	39	112
+negative	1	8	6	509
diff --git a/paper/cancermine/examples.txt b/paper/cancermine/examples.txt
@@ -0,0 +1,8 @@
+Association Type	PubMed ID	Sentence
+Oncogene	3457468	This work suggests that Hu-ets-1 is an unusual oncogene that can help explain the common involvement of chromosome band 11q23 in various subtypes of hematopoietic malignancies.
+Oncogene	29440180	Although MNX1  functions as an oncogene to promote pancreatic islet cell tumors in multiple endocrine neoplasia type 1 (MEN1)  , this particular mutation is common in the population and unlikely to be relevant to predisposition.
+Oncogene and Tumor Suppressor	30344933	Indeed, Gadd45a behaves as a tumor suppressor in H-RAS driven models of breast cancer and as an oncogene in Myc-driven tumors  .
+Driver	24584072	The MYC  oncogene is an established driver in T-ALL and a canonical BRD4 target .
+Driver	27792127	RIN1  was identified as a breast cancer tumor suppressor gene  .
+Tumor Suppressor	29805611	Hence, RUNX3 may serve as a tumor suppressor gene in cervical cancer.
+Tumor Suppressor	26364604	We showed that LZTFL1 could function as a tumor suppressor in gastric cancers  and suppresses gastric cancer cell migration and invasion through regulating nuclear translocation of β-catenin, indicating a WNT pathway connection.
diff --git a/paper/cancermine/fig_annotationAndModel.R b/paper/cancermine/fig_annotationAndModel.R
@@ -78,6 +78,18 @@ paper.lenient_ts_recall <- data[data$threshold==0.5 & data$role=='Tumor_Suppress
 paper.lenient_avg_precision <- round(mean(c(paper.lenient_driver_precision,paper.lenient_oncogene_precision,paper.lenient_ts_precision)),1)
 paper.lenient_avg_recall <- round(mean(c(paper.lenient_driver_recall,paper.lenient_oncogene_recall,paper.lenient_ts_recall)),1)
 
+perfTable <- selectedThresholds
+perfTable <- merge(perfTable,data,by=c('threshold','role'))
+perfTable <- perfTable[,c('role','threshold','precision','recall')]
+
+paper.avg_recall <- round(100*mean(perfTable$recall),1)
+paper.avg_precision <- round(100*mean(perfTable$precision),1)
+
+neededPapers <- data.table(number=1:10)
+neededPapers$chanceOfExtraction <- 1- (1-paper.avg_recall/100) ^ neededPapers$number
+
+paper.minPapersNeeded <- as.list(neededPapers[neededPapers$chanceOfExtraction>=0.9,'number'])$number[1]
+
 #data <- data[order(data$precision),]
 #data[data$precision>0.85,]