Skip to content

Latest commit

 

History

History
176 lines (135 loc) · 6.8 KB

README.md

File metadata and controls

176 lines (135 loc) · 6.8 KB

C++ R CRAN status DOI Project Status: Active - The project has reached a stable, usable state and is being actively developed.

What is this?

One goal of cancer genomics is to identify DNA variants specific to the cancer tissue within an individual. Perhaps a researcher would like to identify mutated genes and design a cancer treatment or therapy specific to that individual's cancer. These cancer variants are considered somatic or variants that cannot be inherited. Our normal tissue harbors inherited DNA variants called germline variants that are present and identical across all normal tissue.

If one sequences an individual's matched normal DNA (e.g. from blood or adjacent tissue) and tumor DNA, one can identify both germline and somatic mutations and more importantly, distinguish between them. However, without the matched normal DNA serving as a control, the performance of somatic mutation callers (MuTect2, Seurat, Indelocator, Varscan, Strelka, Strelka2, etc.) drops off in terms of recall (sensitivity) and precision (positive predictive value). Perhaps the tumor sample:

  • lacks an available matched normal (e.g. patient is unavailable, has leukemia)
  • sample contamination, poorly sequenced normal,
  • insufficient budget to sequence both samples per patient

A third set of detected and unavoidable variants are false positives or artifacts that can arise from several sources including poor sequencing, sample storage, read misalignment to the reference genome, etc. UNMASC attempts to identify somatic variants from tumor samples without an adequate matched normal.

UNMASC workflow for a single tumor sample against Z unmatched normal controls. SB = strand bias, SEG = segmentation, OXOG = oxoG artifacts, FFPE = paraffin artifacts.

Description

This package is designed to filter and annotate tumor-only variant calls through the integration of public database annotations, clustering, and segmentation to provide the user with a clear characterization of each variant when called against a set of unmatched normal controls.

Citation

Little, P., Jo, H., Hoyle, A., Mazul, A., Zhao, X., Salazar, A.H., Farquhar, D., Sheth, S., Masood, M., Hayward, M.C., Parker, J.S., Hoadley, K.A., Zevallos, J. and Hayes, D.N. (2021). UNMASC: tumor-only variant calling with unmatched normal controls. NAR Cancer, 3(4), zcab040. [HTML, PDF, Supplement]

Installation

Click to expand!

R/RStudio code to check, install, and load libraries.

pandoc = Sys.getenv("RSTUDIO_PANDOC")
build_vign = !is.null(pandoc) && file.exists(pandoc)

cran_packs = c("devtools","Rcpp","RcppArmadillo","emdbook",
	"scales","BiocManager","parallel","doParallel",
	"data.table","grDevices","foreach")
bioc_packs = c("seqTools","Rsamtools","GenomicRanges",
	"IRanges")
github_packs = c("smarter","UNMASC")
req_packs = c(cran_packs,bioc_packs,github_packs)

for(pack in req_packs){
	
	chk_pack = tryCatch(find.package(pack),
		error = function(ee){NULL})
	
	if( !is.null(chk_pack) ){
		library(pack,character.only = TRUE)
		next
	}
	
	if( pack %in% cran_packs ){
		install.packages(pack,dependencies = TRUE)
	} else if( pack %in% bioc_packs ){
		BiocManager::install(pkg = pack,dependencies = TRUE)
	} else if( pack %in% github_packs ){
		devtools::install(sprintf("pllittle/%s",pack),
			dependencies = TRUE)
	}
	
}

Inputs

  • annotated variant calls (e.g. Strelka/Strelka2 + VEP)
  • target capture bed file: contains contig, start position, end position columns
  • centromere start/end bed file
  • dict_chrom file: Run samtools view -H tumor.bam and save the output.
  • tumor bam filename

Workflow

UNMASC's benchmark samples were run with Strelka. Assuming

are installed along with corresponding dependencies (Perl, HTSlib, etc.), Linux commands are provided below to run these software for variant calling and annotation. Running our customized VEP annotation requires downloading a COSMIC database VCF. For example, CosmicCodingMuts.vcf.gz for GRCh37 with the latest release can be found at here. We have instructed VEP to annotate variants with 1000 Genomes population allele frequencies, ExAC/gnomAD population allele frequencies, variant transcripts, impacts/consequences, and COSMIC counts with stable and legacy IDs.

Refer to our comprehensive documentation for setup, inputs, and execution.

Future directions

  • Workflow containers
  • Develop sample code and pipeline for
    • MuTect/MuTect2
    • ANNOVAR annotation code
  • Applying UNMASC toward circulating plasma tumor cell DNA
  • Identifying somatic mutations missed by tumors with matched normals

FAQs