diff --git a/README.md b/README.md
index c460e77..7b4b3ec 100755
--- a/README.md
+++ b/README.md
@@ -1,169 +1,184 @@
-## _gvanno_ - workflow for functional and clinical annotation of germline nucleotide variants
+## *gvanno* - generic workflow for functional and clinical annotation of human DNA variants
+
+
### Contents
-- [Overview](#overview)
-- [News](#news)
-- [Annotation resources](#annotation-resources)
-- [Getting started](#getting-started)
-- [Contact](#contact)
+- [Overview](#overview)
+- [News](#news)
+- [Annotation resources](#annotation-resources)
+- [Getting started](#getting-started)
+- [Contact](#contact)
+
+### Overview {#overview}
+
+The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+
+*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**.
+
+### News {#news}
+
+- April 27th 2023 - **1.6.0 release**
+
+ - Added option `--oncogenicity_annotation` - classifies variants according to oncogenicity ([Horak et al., Genet Med, 2022](https://pubmed.ncbi.nlm.nih.gov/35101336/))
+ - Data updates: ClinVar, GENCODE, GWAS catalog, CancerMine
+ - Excluded extensive disease associations from the Open Targets Platform
-### Overview
+- September 26th 2022 - **1.5.1 release**
-The germline variant annotator (*gvanno*) is a software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+ - Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
-*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
+### Annotation resources (v1.6.0)
-### News
-* September 26th 2022 - **1.5.1 release**
- * Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
-* September 24th 2022 - **1.5.0 release**
- * Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform
- * Software updates: VEP 107
- * Excluded UniProt KB from annotation tracks
-* December 21st 2021 - **1.4.4 release**
- * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform
- * Software updates: VEP (v105)
-* August 25th 2021 - **1.4.3 release**
- * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform
+- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset)
+- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
+- [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
+- [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
+- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023)
+- [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023)
+- [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
+- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023)
-### Annotation resources (v1.5.1)
+### Getting started {#getting-started}
-* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset)
-* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
-* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
-* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
-* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (September 2022)
-* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 47, July 2022)
-* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2022_06, June 2022)
-* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021)
-* [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 26th 2022)
+#### STEP 0: Prerequisites
+- *Python*
-### Getting started
+ An installation of Python (version \>=*3.6*) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
-#### STEP 0: Python
+- *Other utilities*
-An installation of Python (version >=_3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
+ The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)
+
+ **NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system
#### STEP 1: Installation of Docker
-1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
- - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
- - installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/)
- - NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
-2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window
-3. Adjust the computing resources dedicated to the Docker, i.e.:
- - Memory: minimum 5GB
- - CPUs: minimum 4
- - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
+1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
+ - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
+ - installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/)
+ - NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
+2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window
+3. Adjust the computing resources dedicated to the Docker, i.e.:
+ - Memory: minimum 5GB
+ - CPUs: minimum 4
+ - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
-##### 1.1: Installation of Singularity (optional - in dev)
+##### 1.1: Installation of Singularity (_IN DEVELOPMENT_)
-0. **Note: this works for Singularity version 3.0 and higher**.
-1. [Install Singularity](https://sylabs.io/docs/)
-2. Test that singularity works by running `singularity --version`
-3. If you are in the gvanno directory, build the singularity image like so:
+0. **Note: this works for Singularity version 3.0 and higher**.
- `cd src`
+1. [Install Singularity](https://sylabs.io/docs/)
- `sudo ./buildSingularity.sh`
+2. Test that singularity works by running `singularity --version`
+
+3. If you are in the gvanno directory, build the singularity image like so:
+
+ `cd src`
+
+ `sudo ./buildSingularity.sh`
#### STEP 2: Download *gvanno* and data bundle
-1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.1) (gvanno run script, v1.5.1)
-2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory
- * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb)
- * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb)
- * Example commands:
- * `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz`
- * `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
+1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0)
+
+2. Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.:
+
+ - `python download_gvanno_refdata.py --download_dir --genome_assembly grch38`
+
+ **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle)
+
+3. Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb):
- A _data/_ folder within the _gvanno-1.5.1_ software folder should now have been produced
-3. Pull the [gvanno Docker image (1.5.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
- * `docker pull sigven/gvanno:1.5.1` (gvanno annotation engine)
+ - `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine)
#### STEP 3: Input preprocessing
The *gvanno* workflow accepts a single input file:
- * An unannotated, single-sample VCF file (>= v4.2) with germline variants (SNVs/InDels)
+- An unannotated, single-sample VCF file (\>= v4.2) with germline variants (SNVs/InDels)
-We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
+We **strongly** recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
#### STEP 5: Run example
Run the workflow with **gvanno.py**, which takes the following arguments and options:
- usage:
- gvanno.py -h [options]
- --query_vcf
- --gvanno_dir
- --output_dir
- --genome_assembly
- --sample_id
- --container
-
- gvanno - workflow for functional and clinical annotation of germline nucleotide variants
-
- Required arguments:
- --query_vcf QUERY_VCF
- VCF input file with germline query variants (SNVs/InDels).
- --gvanno_dir GVANNO_DIR
- Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.1
- --output_dir OUTPUT_DIR
- Output directory
- --genome_assembly {grch37,grch38}
- Genome assembly build: grch37 or grch38
- --container {docker,singularity}
- Run gvanno with docker or singularity
- --sample_id SAMPLE_ID
- Sample identifier - prefix for output files
-
- VEP optional arguments:
- --vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
- --vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
- --vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
- --vep_n_forks VEP_N_FORKS
- Number of forks for Variant Effect Predictor (VEP) processing, default: 4
- --vep_buffer_size VEP_BUFFER_SIZE
- Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
- - set lower to reduce memory usage, higher to increase speed, default: 500
- --vep_pick_order VEP_PICK_ORDER
- Comma-separated string of ordered transcript properties for primary variant pick in
- Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
- --vep_skip_intergenic
- Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
- --vep_coding_only
- Only report variants falling into coding regions of transcripts (VEP), default: False
-
- Other optional arguments:
- --force_overwrite By default, the script will fail with an error if any output file already exists.
- You can force the overwrite of existing result files by using this flag, default: False
- --version show program's version number and exit
- --no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False
- --docker_uid DOCKER_USER_ID
- Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
- --vcfanno_n_processes VCFANNO_N_PROCESSES
- Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4
-
-The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
-
- python ~/gvanno-1.5.1/gvanno.py
- --query_vcf ~/gvanno-1.5.1/examples/example.grch37.vcf.gz
- --gvanno_dir ~/gvanno-1.5.1
- --output_dir ~/gvanno-1.5.1
- --sample_id example
- --genome_assembly grch37
- --container docker
- --force_overwrite
-
-This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
-
- 1. __example_gvanno_pass_grch37.vcf.gz (.tbi)__ - Bgzipped VCF file with rich set of functional/clinical annotations
- 2. __example_gvanno_pass_grch37.tsv.gz__ - Compressed TSV file with rich set of functional/clinical annotations
+```
+usage:
+gvanno.py -h [options]
+--query_vcf
+--gvanno_dir
+--output_dir
+--genome_assembly
+--sample_id
+--container
+
+gvanno - workflow for functional and clinical annotation of germline nucleotide variants
+
+Required arguments:
+--query_vcf QUERY_VCF
+ VCF input file with germline query variants (SNVs/InDels).
+--gvanno_dir GVANNO_DIR
+ Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0
+--output_dir OUTPUT_DIR
+ Output directory
+--genome_assembly {grch37,grch38}
+ Genome assembly build: grch37 or grch38
+--container {docker,singularity}
+ Run gvanno with docker or singularity
+--sample_id SAMPLE_ID
+ Sample identifier - prefix for output files
+
+VEP optional arguments:
+--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
+--vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
+--vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
+--vep_n_forks VEP_N_FORKS
+ Number of forks for Variant Effect Predictor (VEP) processing, default: 4
+--vep_buffer_size VEP_BUFFER_SIZE
+ Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
+ - set lower to reduce memory usage, higher to increase speed, default: 500
+--vep_pick_order VEP_PICK_ORDER
+ Comma-separated string of ordered transcript properties for primary variant pick in
+ Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
+--vep_skip_intergenic
+ Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
+--vep_coding_only
+ Only report variants falling into coding regions of transcripts (VEP), default: False
+
+Other optional arguments:
+--force_overwrite By default, the script will fail with an error if any output file already exists.
+ You can force the overwrite of existing result files by using this flag, default: False
+--version show program's version number and exit
+--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False
+--docker_uid DOCKER_USER_ID
+ Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
+--vcfanno_n_processes VCFANNO_N_PROCESSES
+ Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4
+--oncogenicity_annotation
+ Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)
+--debug Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.
+```
+
+The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
+
+```
+python ~/gvanno-1.6.0/gvanno.py
+--query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz
+--gvanno_dir ~/gvanno-1.6.0
+--output_dir ~/gvanno-1.6.0
+--sample_id example
+--genome_assembly grch37
+--container docker
+--force_overwrite
+```
+
+This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder:
+
+1. **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations
+2. **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations
Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column.
@@ -171,6 +186,6 @@ Similar files are produced for all variants, not only variants with a *PASS* des
Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. The column names of the tab-separated values (TSV) file will be identical to the INFO tags that are documented in the VCF file.
-### Contact
+### Contact {#contact}
sigven AT ifi.uio.no
diff --git a/download_gvanno_refdata.py b/download_gvanno_refdata.py
new file mode 100755
index 0000000..f0b94bb
--- /dev/null
+++ b/download_gvanno_refdata.py
@@ -0,0 +1,319 @@
+#!/usr/bin/env python
+
+import argparse
+import re
+import os
+import subprocess
+import logging
+import sys
+import locale
+import errno
+#import wget
+import urllib.request as urllib2
+from argparse import RawTextHelpFormatter
+
+GVANNO_VERSION = '1.6.0'
+REFDATA_VERSION = '20230425'
+ENSEMBL_VERSION = '109'
+GENCODE_VERSION = 'v43'
+VEP_ASSEMBLY = "GRCh38"
+HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/"
+
+def __main__():
+
+ program_description = "download_gvanno_refdata - download reference datasets for the gvanno variant annotation workflow"
+ program_options = " --download_dir \n --genome_assembly " + \
+ "\n"
+
+ parser = argparse.ArgumentParser(description = program_description,
+ formatter_class=RawTextHelpFormatter, usage="\n %(prog)s -h [options]\n" + str(program_options))
+ parser._action_groups.pop()
+ required = parser.add_argument_group('Required arguments')
+ optional = parser.add_argument_group('Other optional arguments')
+
+ optional.add_argument('--force_overwrite', action = "store_true", help='By default, the script will fail with an error if any ' + \
+ 'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s')
+ optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
+ optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully")
+ optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
+ required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True)
+ required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True)
+
+
+ args = parser.parse_args()
+ arg_dict = vars(args)
+
+ logger = getlogger('check-download-dir')
+
+ download_dir_full = os.path.abspath(arg_dict['download_dir'])
+ if not os.path.isdir(download_dir_full):
+ err_msg = "Download destination directory (" + str(download_dir_full) + ") does not exist"
+ gvanno_error_message(err_msg,logger)
+
+ ## check that 'data' does not exist in download_dir_full
+ db_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data')
+ arg_dict['db_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'])
+ arg_dict['vep_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')
+
+ if os.path.isdir(db_dir) and arg_dict['force_overwrite'] is False:
+ err_msg = "Data directory (" + str(db_dir) + ") exists (force_overwrite = False)"
+ gvanno_error_message(err_msg,logger)
+
+ if not os.path.isdir(db_dir):
+ os.mkdir(db_dir)
+ if not os.path.isdir(arg_dict['db_assembly_dir']):
+ os.mkdir(arg_dict['db_assembly_dir'])
+ os.mkdir(arg_dict['vep_assembly_dir'])
+
+
+ download_gvanno_ref_data(arg_dict = arg_dict)
+
+
+def rm_file(filename):
+ try:
+ os.remove(filename)
+ except OSError as e:
+ if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
+ raise # re-raise exception if a different error occurred
+
+def gvanno_error_message(message, logger):
+ logger.error('')
+ logger.error(message)
+ logger.error('')
+ exit(0)
+
+def check_subprocess(command, logger, debug = False):
+ if debug:
+ logger.info(command)
+ try:
+ output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
+ if len(output) > 0:
+ print(str(output.decode()).rstrip())
+ except subprocess.CalledProcessError as e:
+ print(e.output.decode())
+ exit(0)
+
+def getlogger(logger_name):
+ logger = logging.getLogger(logger_name)
+ logger.setLevel(logging.DEBUG)
+
+ # create console handler and set level to debug
+ ch = logging.StreamHandler(sys.stdout)
+ ch.setLevel(logging.DEBUG)
+
+ # add ch to logger
+ logger.addHandler(ch)
+
+ # create formatter
+ formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S")
+
+ #add formatter to ch
+ ch.setFormatter(formatter)
+
+ return logger
+
+def get_url_num_bytes(url, logger, verbose=True):
+ """
+ Get number of bytes of file at a URL. None if not reported.
+ """
+ # urllib2 could be a bit more intelligent in guessing what I mean:
+ if not re.match('^[a-zA-Z]*:', url):
+ if os.path.exists(url):
+ url = 'file:' + url
+ else:
+ url = 'http://' + url
+ try:
+ class HeadRequest(urllib2.Request):
+ def get_method(self):
+ return "HEAD"
+ response = urllib2.urlopen(HeadRequest(
+ url, None, {'User-Agent': 'Mozilla/5.0'}))
+
+ num_bytes = response.info().get('content-length')
+ if verbose:
+ # urllib does redirections for me. Report if it's been clever:
+ if response.geturl() != url:
+ logger.info('Reporting size of file at: ' + response.geturl())
+ except:
+ logger.info('Failed to connect to URL ' + str(url))
+ num_bytes = None
+ if num_bytes is not None:
+ num_bytes = int(num_bytes)
+ return num_bytes
+
+def pretty_print(num_bytes, logger):
+ """
+ Output number of bytes according to locale and with IEC binary prefixes
+ """
+ if num_bytes is None:
+ logger.info('File size unavailable.')
+ return
+ KiB = 1024
+ MiB = KiB * KiB
+ GiB = KiB * MiB
+ TiB = KiB * GiB
+ PiB = KiB * TiB
+ EiB = KiB * PiB
+ ZiB = KiB * EiB
+ YiB = KiB * ZiB
+ locale.setlocale(locale.LC_ALL, '')
+ output = locale.format_string("%d", num_bytes, grouping=True) + ' bytes'
+ if num_bytes > YiB:
+ output += ' (%.3g YiB)' % (num_bytes / YiB)
+ elif num_bytes > ZiB:
+ output += ' (%.3g ZiB)' % (num_bytes / ZiB)
+ elif num_bytes > EiB:
+ output += ' (%.3g EiB)' % (num_bytes / EiB)
+ elif num_bytes > PiB:
+ output += ' (%.3g PiB)' % (num_bytes / PiB)
+ elif num_bytes > TiB:
+ output += ' (%.3g TiB)' % (num_bytes / TiB)
+ elif num_bytes > GiB:
+ output += ' (%.3g GiB)' % (num_bytes / GiB)
+ elif num_bytes > MiB:
+ output += ' (%.3g MiB)' % (num_bytes / MiB)
+ elif num_bytes > KiB:
+ output += ' (%.3g KiB)' % (num_bytes / KiB)
+
+ return(output)
+ #logger.info(output)
+
+def download_gvanno_ref_data(arg_dict):
+ """
+ Main function to run the gvanno workflow using Docker
+ """
+ global GENCODE_VERSION, VEP_ASSEMBLY
+ if arg_dict['genome_assembly'] == 'grch37':
+ GENCODE_VERSION = 'v19'
+ VEP_ASSEMBLY = 'GRCh37'
+
+ vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')
+
+ datasets = {}
+ for db in ['vep_cache','vep_fasta','gvanno_custom']:
+ datasets[db] = {}
+ datasets[db]['remote_url'] = 'NA'
+ datasets[db]['local_path'] = 'NA'
+
+ ## Remote URL - pre-indexed VEP cache
+ datasets['vep_cache']['remote_url'] = (
+ f"ftp://ftp.ensembl.org/pub/current_variation/indexed_vep_cache/"
+ f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz"
+ )
+
+ ## Remote URL - Human genome FASTA
+ datasets['vep_fasta']['remote_url'] = (
+ f"http://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens."
+ f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz"
+ )
+
+ logger = getlogger('download-vep-cache')
+ datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
+ datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
+ datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path'])
+
+ vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger)
+
+ if VEP_ASSEMBLY == 'GRCh37':
+ datasets['vep_fasta']['remote_url'] = (
+ f"http://ftp.ensembl.org/pub/grch37/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens."
+ f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz"
+ )
+
+
+ logger.info('VEP cache - remote target file ' + str(datasets['vep_cache']['remote_url']))
+ logger.info('VEP cache - size: ' + pretty_print(vep_cache_bytes_remote, logger = logger))
+ logger.info('VEP cache - local destination file: ' + str(datasets['vep_cache']['local_path']))
+
+ if os.path.exists(datasets['vep_cache']['local_path']):
+ if os.path.getsize(datasets['vep_cache']['local_path']) == vep_cache_bytes_remote:
+ logger.info('VEP cache already downloaded')
+ else:
+ logger.info('VEP cache - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path'])
+ else:
+ logger.info('VEP cache - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path'])
+
+ logger.info('VEP cache - unzip and untar')
+
+ wdir = os.getcwd()
+
+ os.chdir(vep_assembly_dir)
+
+ command_unzip_cache = f"gzip -dc {datasets['vep_cache']['local_path']} | tar xf -"
+ check_subprocess(command = command_unzip_cache, logger = logger)
+ logger.info('VEP cache - finished unzip + untar ')
+
+ ## change back to working directory
+ os.chdir(wdir)
+
+ if arg_dict['clean_raw_files']:
+ logger.info('VEP cache - removing raw compressed tar ball')
+ rm_file({datasets['vep_cache']['local_path']})
+
+ logger = getlogger('download-vep-fasta')
+ fasta_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_fasta']['remote_url'], logger = logger)
+ logger.info('VEP reference FASTA - remote target file: ' + str(datasets['vep_fasta']['remote_url']))
+ logger.info('VEP reference FASTA - size: ' + pretty_print(fasta_cache_bytes_remote, logger = logger))
+ logger.info('VEP reference FASTA - local destination file: ' + str(datasets['vep_fasta']['local_path']))
+
+ if os.path.exists(datasets['vep_fasta']['local_path']):
+ if os.path.getsize(datasets['vep_fasta']['local_path']) == fasta_cache_bytes_remote:
+ logger.info('VEP reference FASTA already downloaded')
+ else:
+ logger.info('VEP reference FASTA files - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path'])
+ else:
+ logger.info('VEP reference FASTA files - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path'])
+
+ logger.info('VEP reference FASTA - unzip + bgzip')
+ command_unzip_fasta = f"gzip -d {datasets['vep_fasta']['local_path']}"
+ check_subprocess(command = command_unzip_fasta, logger = logger)
+ command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}"
+ check_subprocess(command = command_bgzip_fasta, logger = logger)
+
+
+
+ datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz'
+ datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz')
+
+ logger = getlogger('download-gvanno-custom')
+ custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger)
+ logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / cancerhotspots ++")
+ logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url']))
+ logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
+ logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path']))
+
+
+ if os.path.exists(datasets['gvanno_custom']['local_path']):
+ if os.path.getsize(datasets['gvanno_custom']['local_path']) == custom_cache_bytes_remote:
+ logger.info('Custom gvanno datasets already downloaded')
+ else:
+ logger.info('Custom gvanno datasets - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path'])
+ else:
+ logger.info('Custom gvanno datasets - download in progress - this can take a while ... ')
+ urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path'])
+
+ logger.info('Custom gvanno datasets - unzip and untar')
+
+ wdir = os.getcwd()
+ os.chdir(arg_dict['download_dir'])
+
+ command_unzip_cache = f"gzip -dc {datasets['gvanno_custom']['local_path']} | tar xf -"
+ check_subprocess(command = command_unzip_cache, logger = logger)
+ logger.info('Custom gvanno datasets - finished unzip and untar ')
+
+ ## change back to working directory
+ os.chdir(wdir)
+
+ if arg_dict['clean_raw_files']:
+ logger.info('Custom gvanno datasets - removing raw compressed tar ball')
+ rm_file({datasets['gvanno_custom']['local_path']})
+
+ logger.info('Finished')
+
+
+if __name__=="__main__": __main__()
diff --git a/examples/example.grch38.vcf.gz b/examples/example.grch38.vcf.gz
index 8329f55..bf3cbd8 100755
Binary files a/examples/example.grch38.vcf.gz and b/examples/example.grch38.vcf.gz differ
diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example.grch38.vcf.gz.tbi
index 7f8e83e..3817608 100755
Binary files a/examples/example.grch38.vcf.gz.tbi and b/examples/example.grch38.vcf.gz.tbi differ
diff --git a/gvanno.py b/gvanno.py
index db02add..34ee0f4 100755
--- a/gvanno.py
+++ b/gvanno.py
@@ -11,17 +11,17 @@
import platform
from argparse import RawTextHelpFormatter
-GVANNO_VERSION = '1.5.1'
-DB_VERSION = 'GVANNO_DB_VERSION = 20220921'
-VEP_VERSION = '107'
-GENCODE_VERSION = 'v41'
+GVANNO_VERSION = '1.6.0'
+DB_VERSION = 'GVANNO_DB_VERSION = 20230425'
+VEP_VERSION = '109'
+GENCODE_VERSION = 'v43'
VEP_ASSEMBLY = "GRCh38"
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
def __main__():
- program_description = "gvanno - workflow for functional and clinical annotation of germline nucleotide variants"
+ program_description = "gvanno - workflow for functional and clinical annotation of human DNA variants"
program_options = " --query_vcf \n --gvanno_dir \n --output_dir \n --genome_assembly " + \
"\n --sample_id \n --container "
@@ -44,13 +44,16 @@ def __main__():
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s")
optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
"for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
- optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
+ optional_vep.add_argument('--vep_pick_order', default = "mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", help="Comma-separated string " + \
"of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s")
optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s")
optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s")
optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \
"processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s")
- required.add_argument('--query_vcf', help='VCF input file with germline query variants (SNVs/InDels).', required = True)
+ optional.add_argument('--oncogenicity_annotation', action ='store_true', help = 'Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)')
+ optional.add_argument("--debug", action="store_true", help="Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.")
+
+ required.add_argument('--query_vcf', help='VCF input file with query variants (SNVs/InDels).', required = True)
required.add_argument('--gvanno_dir',help='Directory that contains the gvanno data bundle, e.g. ~/gvanno-' + str(GVANNO_VERSION), required = True)
required.add_argument('--output_dir',help='Output directory', required = True)
required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Genome assembly build: grch37 or grch38', required = True)
@@ -65,21 +68,25 @@ def __main__():
## Check that VEP pick criteria is formatted correctly
if not arg_dict['vep_pick_order'] is None:
values = str(arg_dict['vep_pick_order']).split(',')
- permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane']
+ permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane_plus_clinical','mane_select']
num_permitted_sources = 0
for v in values:
if v in permitted_sources:
num_permitted_sources += 1
- if num_permitted_sources != 8:
+ if num_permitted_sources != 9:
err_msg = "Option 'vep_pick_order' = " + str(arg_dict['vep_pick_order']) + " is formatted incorrectly, should be " + \
- "a comma-separated string of the following values: canonical,appris,tsl,biotype,ccds,rank,length,mane"
+ "a comma-separated string of the following values: mane_select,mane_plus_clinical,canonical,appris,tsl,biotype,ccds,rank,length"
gvanno_error_message(err_msg, logger)
if arg_dict['container'] is None:
err_msg = 'Please specify whether the gvanno workflow is running through Docker or Singularity (--container )'
gvanno_error_message(err_msg, logger)
+ if arg_dict['oncogenicity_annotation'] is True and arg_dict['vep_lof_prediction'] is False:
+ err_msg = "Option --oncogenicity_annotation requires --vep_lof_prediction turned on"
+ gvanno_error_message(err_msg, logger)
+
logger = getlogger('gvanno-check-files')
# check that script and Docker image version correspond
@@ -295,12 +302,13 @@ def run_gvanno(arg_dict, host_directories):
container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \""
docker_command_run_end = '\"'
- #logger.info(container_command_run1)
- #logger.info(container_command_run2)
+ if arg_dict['debug']:
+ logger.info(container_command_run1)
+ logger.info(container_command_run2)
## GVANNO|start - Log key information about sample, options and assembly
logger = getlogger("gvanno-start")
- logger.info("--- Germline variant annotation (gvanno) workflow ----")
+ logger.info("--- Generic variant annotation (gvanno) workflow ----")
logger.info("Sample name: " + str(arg_dict['sample_id']))
logger.info("Genome assembly: " + str(arg_dict['genome_assembly']))
print()
@@ -310,7 +318,8 @@ def run_gvanno(arg_dict, host_directories):
logger.info("STEP 0: Validate input data")
vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \
str(vcf_validation) + " " + str(arg_dict['genome_assembly']) + docker_command_run_end
- #logger.info(vcf_validate_command)
+ if arg_dict['debug']:
+ logger.info(vcf_validate_command)
check_subprocess(vcf_validate_command)
logger.info('Finished')
@@ -335,10 +344,10 @@ def run_gvanno(arg_dict, host_directories):
## List all VEP flags used when calling VEP
loftee_dir = '/opt/vep/src/ensembl-vep/modules'
plugins_in_use = "NearestExonJB"
- vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \
+ vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomade --af_gnomadg --variant_class --domains --symbol --protein --ccds " + \
"--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \
"--xref_refseq --plugin NearestExonJB,max_range=50000"
- vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(arg_dict['vep_pick_order']) + \
+ vep_options = "--vcf --quiet --check_ref --flag_pick_allele_gene --pick_order " + str(arg_dict['vep_pick_order']) + \
" --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \
str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
@@ -378,6 +387,8 @@ def run_gvanno(arg_dict, host_directories):
logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction']))
logger.info("VEP configuration - plugins in use: " + str(plugins_in_use))
+ if arg_dict['debug']:
+ logger.info(vep_main_command)
check_subprocess(vep_main_command)
check_subprocess(vep_bgzip_command)
check_subprocess(vep_tabix_command)
@@ -386,11 +397,14 @@ def run_gvanno(arg_dict, host_directories):
## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources
print()
logger = getlogger('gvanno-vcfanno')
- logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, cancerhotspots.org)")
+ logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (Clinvar, ncER, dbNSFP, GWAS catalog)")
logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes']))
gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes " + str(arg_dict['vcfanno_n_processes']) + \
- " --dbnsfp --clinvar --ncer --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
+ " --dbnsfp --clinvar --ncer --gvanno_xref --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
" " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end
+
+ if arg_dict['debug']:
+ logger.info(gvanno_vcfanno_command)
check_subprocess(gvanno_vcfanno_command)
logger.info("Finished")
@@ -398,9 +412,14 @@ def run_gvanno(arg_dict, host_directories):
print()
logger = getlogger("gvanno-summarise")
logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
+ logger.info("Configuration - oncogenicity classification: " + str(int(arg_dict['oncogenicity_annotation'])))
gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
- " " + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end
+ " " + str(int(arg_dict['oncogenicity_annotation'])) + " " + str(int(arg_dict['vep_regulatory'])) + " " + \
+ str(int(arg_dict['debug'])) + docker_command_run_end
+
+ if arg_dict['debug']:
+ logger.info(gvanno_summarise_command)
check_subprocess(gvanno_summarise_command)
logger.info("Finished")
@@ -415,14 +434,15 @@ def run_gvanno(arg_dict, host_directories):
check_subprocess(create_output_vcf_command2)
check_subprocess(create_output_vcf_command3)
check_subprocess(create_output_vcf_command4)
- check_subprocess(clean_command)
+ if not arg_dict['debug']:
+ check_subprocess(clean_command)
print()
## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv
logger = getlogger("gvanno-vcf2tsv")
- logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsvpy")
- gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end
- gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end
+ logger.info("STEP 4: Converting genomic VCF to TSV with https://github.com/sigven/vcf2tsvpy")
+ gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_pass_tsv) + docker_command_run_end
+ gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_tsv) + docker_command_run_end
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")
check_subprocess(gvanno_vcf2tsv_command_pass)
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants")
diff --git a/src/Dockerfile b/src/Dockerfile
index b228d05..1a0a2e2 100644
--- a/src/Dockerfile
+++ b/src/Dockerfile
@@ -1,3 +1,4 @@
+
###################################################
# Stage 1 - docker container to build ensembl-vep #
###################################################
@@ -22,13 +23,13 @@ RUN apt-get update && apt-get -y install \
ENV OPT /opt/vep
ENV OPT_SRC $OPT/src
ENV HTSLIB_DIR $OPT_SRC/htslib
-ENV BRANCH release/107
+ENV BRANCH=release/109
# Working directory
WORKDIR $OPT_SRC
# Clone/download repositories/libraries
-RUN if [ "$BRANCH" = "master" ]; \
+RUN if [ "$BRANCH" = "main" ]; \
then export BRANCH_OPT=""; \
else export BRANCH_OPT="-b $BRANCH"; \
fi && \
@@ -166,7 +167,8 @@ ENV PERL5LIB $PERL5LIB_TMP
WORKDIR /
ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules
ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules
-RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/107/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
+ADD NearestExonJB.pm $OPT/src/ensembl-vep/modules
+#RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
# Final steps
@@ -178,6 +180,24 @@ RUN echo >> $OPT/.profile && \
# Run INSTALL.pl and remove the ensemb-vep tests and travis
./INSTALL.pl -a a -l -n && rm -rf t travisci .travis.yml
+
+# Install dependencies for VEP plugins:
+#USER root
+#ENV PLUGIN_DEPS "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/config"
+## - Ubuntu packages
+#RUN curl -O "$PLUGIN_DEPS/ubuntu-packages.txt" && \
+# apt-get update && apt-get install -y --no-install-recommends \
+# $(sed s/\#.*//g ubuntu-packages.txt) && \
+# rm -rf /var/lib/apt/lists/* ubuntu-packages.txt
+## - Perl modules
+#RUN curl -O "$PLUGIN_DEPS/cpanfile" && \
+# cpanm --installdeps --with-recommends . && \
+# rm -rf /root/.cpanm cpanfile
+## - Python packages
+#RUN curl -O "$PLUGIN_DEPS/requirements.txt" && \
+# python -m pip install --no-cache-dir -r requirements.txt && \
+# rm requirements.txt
+
######################################################
# Stage 3 - adding dependencies for gvanno analysis #
######################################################
@@ -200,7 +220,6 @@ WORKDIR /
ENV PACKAGE_BIO="libhts2 bedtools"
ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev"
-#ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9 toml==0.10.1"
ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9"
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
@@ -224,9 +243,9 @@ chmod 755 /usr/local/bin/vcf_validator
USER root
WORKDIR /
-RUN wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
-RUN bunzip2 -dc samtools-1.10.tar.bz2 | tar xvf -
-RUN cd samtools-1.10 && ./configure --prefix=/usr/local/bin && make -j && make install
+RUN wget https://github.com/samtools/samtools/releases/download/1.15.1/samtools-1.15.1.tar.bz2
+RUN bunzip2 -dc samtools-1.15.1.tar.bz2 | tar xvf -
+RUN cd samtools-1.15.1 && ./configure --prefix=/usr/local && make -j && make install
WORKDIR /
@@ -241,18 +260,19 @@ RUN apt-get update \
USER root
WORKDIR /
-## vt - variant tool set - use conda version
-## primary use in GVANNO/PCGR/CPSR: decomposition of multiallelic variants in a VCF file
+## vt - variant tool set + vcf2tsvpy - install using Conda
+## primary use of vt in GVANNO: decomposition of multiallelic variants in a VCF file
+
+# install Conda
RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \
&& chmod 0755 miniconda.sh
RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"]
RUN rm miniconda.sh
-# update conda & install vt
+# update conda & install vt + vcf2tsvpy
RUN /conda/bin/conda update conda
-#RUN /conda/bin/conda update python
-RUN /conda/bin/conda install -c bioconda vt
-#RUN /conda/bin/conda install -c bioconda vcf2tsvpy
+RUN /conda/bin/conda update python
+RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy python=3.6
## Clean Up
RUN apt-get clean autoclean
@@ -268,13 +288,11 @@ VOLUME /data
USER root
WORKDIR /
-#RUN rm -f nlopt-2.4.2.tar.gz
RUN rm -rf $HOME/src/ensembl-vep/t/
RUN rm -f $HOME/src/v335_base.tar.gz
RUN rm -f $HOME/src/release-1-6-924.zip
-RUN rm -rf /samtools-1.10.tar.bz2
-#RUN rm -f /conda/bin/python
+RUN rm -rf /samtools-1.15.1*
ADD gvanno.tgz /
ENV PATH=$PATH:/conda/bin:/gvanno
-ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
\ No newline at end of file
+ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
diff --git a/src/NearestExonJB.pm b/src/NearestExonJB.pm
new file mode 100644
index 0000000..30ffcec
--- /dev/null
+++ b/src/NearestExonJB.pm
@@ -0,0 +1,143 @@
+=head1 LICENSE
+
+Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute
+Copyright [2016-2023] EMBL-European Bioinformatics Institute
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+=head1 CONTACT
+
+ Ensembl
+
+=cut
+
+=head1 NAME
+
+ NearestExonJB
+
+=head1 SYNOPSIS
+
+ mv NearestExonJB.pm ~/.vep/Plugins
+ ./vep -i variations.vcf --cache --plugin NearestExonJB
+
+=head1 DESCRIPTION
+
+ This is a plugin for the Ensembl Variant Effect Predictor (VEP) that
+ finds the nearest exon junction boundary to a coding sequence variant. More than
+ one boundary may be reported if the boundaries are equidistant.
+
+ The plugin will report the Ensembl identifier of the exon, the distance to the
+ exon boundary, the boundary type (start or end of exon) and the total
+ length in nucleotides of the exon.
+
+ Various parameters can be altered by passing them to the plugin command:
+
+ - max_range : maximum search range in bp (default: 10000)
+
+ Parameters are passed e.g.:
+
+ --plugin NearestExonJB,max_range=50000
+
+=cut
+
+package NearestExonJB;
+
+use strict;
+use warnings;
+
+use Bio::EnsEMBL::Variation::Utils::BaseVepPlugin;
+
+use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
+
+my $char_sep = "|";
+
+my %CONFIG = (
+ max_range => 10000,
+);
+
+sub new {
+ my $class = shift;
+
+ my $self = $class->SUPER::new(@_);
+
+ my $params = $self->params;
+
+ # get output format
+ $char_sep = "+" if ($self->{config}->{output_format} eq 'vcf');
+
+ foreach my $param(@$params) {
+ my ($key, $val) = split('=', $param);
+ die("ERROR: Failed to parse parameter $param\n") unless defined($key) && defined($val);
+ $CONFIG{$key} = $val;
+ }
+
+ return $self;
+}
+
+sub feature_types {
+ return ['Transcript'];
+}
+
+sub get_header_info {
+ my $header = 'Nearest Exon Junction Boundary (coding sequence variants only). Format:';
+ $header .= join($char_sep, qw(ExonID distance start/end length) );
+
+ return {
+ NearestExonJB => $header,
+ }
+}
+
+sub run {
+ my ($self, $tva) = @_;
+
+ my $vf = $tva->base_variation_feature;
+ my $trv = $tva->base_transcript_variation;
+
+ my $loc_string = sprintf("%s:%s-%i-%i", $trv->transcript_stable_id, $vf->{chr} || $vf->seq_region_name, $vf->{start}, $vf->{end});
+
+ if(!exists($self->{_cache}) || !exists($self->{_cache}->{$loc_string})) {
+ my $exons = $trv->_overlapped_exons;
+ my %dists;
+ my $min = $CONFIG{max_range};
+ foreach my $exon (@$exons) {
+ my $startD = abs ($vf->start - $exon->seq_region_start);
+ my $endD = abs ($vf->end - $exon->seq_region_end);
+ if ($startD < $endD){
+ $dists{$exon->stable_id}{$startD} = 'start';
+ $dists{$exon->stable_id}{len} = $exon->length;
+ $min = $startD if $min > $startD;
+ } elsif ($startD > $endD){
+ $dists{$exon->stable_id}{$endD} = 'end';
+ $dists{$exon->stable_id}{len} = $exon->length;
+ $min = $endD if $min > $endD;
+ } else {
+ $dists{$exon->stable_id}{$startD} = "start_end";
+ $dists{$exon->stable_id}{len} = $exon->length;
+ $min = $startD if $min > $startD;
+ }
+ }
+
+ my @finalRes;
+ foreach my $exon (keys %dists){
+ if (exists $dists{$exon}{$min}) {
+ push(@finalRes, $exon.$char_sep.$min.$char_sep.$dists{$exon}{$min}.$char_sep.$dists{$exon}{len})
+ }
+ }
+
+ $self->{_cache}->{$loc_string} = scalar @finalRes ? join(",", @finalRes) : undef;
+ }
+ return $self->{_cache}->{$loc_string} ? { NearestExonJB => $self->{_cache}->{$loc_string} } : {};
+}
+
+1;
+
diff --git a/src/buildDocker.sh b/src/buildDocker.sh
index 9a84197..8962fd6 100755
--- a/src/buildDocker.sh
+++ b/src/buildDocker.sh
@@ -1,6 +1,5 @@
-cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz .
-cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/
-#cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
+#cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz .
+#cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/
tar czvfh gvanno.tgz gvanno/
echo "Build the Docker Image"
TAG=`date "+%Y%m%d"`
diff --git a/src/gvanno.tgz b/src/gvanno.tgz
index a484a28..0c6ec29 100644
Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ
diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py
index af87e66..bd57c2e 100755
--- a/src/gvanno/gvanno_summarise.py
+++ b/src/gvanno/gvanno_summarise.py
@@ -7,6 +7,7 @@
import gzip
import os
import annoutils
+import oncogenicity
logger = annoutils.getlogger('gvanno-gene-annotate')
csv.field_size_limit(500 * 1024 * 1024)
@@ -18,17 +19,19 @@ def __main__():
parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
parser.add_argument('gvanno_db_dir',help='gvanno data directory')
parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
+ parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)')
parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
+ parser.add_argument('debug',default=0,type=int,help='Output extensive debugging information')
args = parser.parse_args()
- extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation)
+ extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.oncogenicity_annotation, args.regulatory_annotation, args.debug)
-def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0):
+def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, oncogenicity_annotation = 0, regulatory_annotation = 0, debug = 0):
"""
Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
- 2. Gene annotations, e.g. known oncogenes/tumor suppressors, curated disease associations (DisGenet), MIM phenotype associations etc
- 3. Protein-relevant annotations, e.g. c functional protein features etc.
+ 2. Gene annotations, e.g. known oncogenes/tumor suppressors
+ 3. Cancer hotspot mutation sites
4. Variant effect predictions
"""
@@ -44,21 +47,27 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
for tag in vcf_infotags_meta:
if lof_prediction == 0 and regulatory_annotation == 0:
if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
- vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+ vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+ 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
elif lof_prediction == 1 and regulatory_annotation == 0:
if not tag.startswith('REGULATORY_'):
- vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+ vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+ 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
elif lof_prediction == 0 and regulatory_annotation == 1:
if not tag.startswith('LoF'):
- vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+ vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+ 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
else:
- vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+ vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+ 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
w = Writer(out_vcf, vcf)
current_chrom = None
num_chromosome_records_processed = 0
num_records_filtered = 0
+
+ cancer_hotspots = annoutils.read_cancer_hotspots(logger, os.path.join(gvanno_db_directory,'cancer_hotspots', 'cancer_hotspots.tsv'))
vcf_info_element_types = {}
for e in vcf.header_iter():
@@ -79,9 +88,6 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
num_chromosome_records_processed = 0
if rec.INFO.get('CSQ') is None:
num_records_filtered = num_records_filtered + 1
- #alt_allele = ','.join(rec.ALT)
- #pos = rec.start + 1
- #variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
#logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?) - variant will be skipped')
continue
num_chromosome_records_processed += 1
@@ -90,30 +96,44 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
if regulatory_annotation == 1:
csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')
- if 'vep_block' in csq_record_results_all:
- vep_csq_records_all = csq_record_results_all['vep_block']
+ if 'picked_gene_csq' in csq_record_results_all:
+ vep_csq_records_all = csq_record_results_all['picked_gene_csq']
rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)
- csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
-
- if 'vep_all_csq' in csq_record_results_pick:
- rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq'])
- if 'vep_block' in csq_record_results_pick:
- vep_csq_records = csq_record_results_pick['vep_block']
+ vep_csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = debug)
- block_idx = 0
- record = vep_csq_records[block_idx]
- for k in record:
+
+ principal_hgvsp = '.'
+ principal_hgvsc = '.'
+ if 'picked_csq' in vep_csq_record_results:
+ csq_record = vep_csq_record_results['picked_csq']
+ for k in csq_record:
if k in vcf_info_element_types:
- if vcf_info_element_types[k] == "Flag" and record[k] == "1":
- rec.INFO[k] = True
+ if vcf_info_element_types[k] == "Flag":
+ #rec.INFO[k] = False
+ if csq_record[k] == "1":
+ rec.INFO[k] = True
else:
- if not record[k] is None:
- rec.INFO[k] = record[k]
+ if not csq_record[k] is None:
+ rec.INFO[k] = csq_record[k]
+
+ if k == 'HGVSp_short':
+ principal_hgvsp = csq_record[k]
+
+ if k == 'HGVSc':
+ principal_hgvsc = csq_record[k].split(':')[1]
+ #else:
+ # print("missing\t" + str(k))
+
+ if 'all_csq' in vep_csq_record_results:
+ rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq'])
+ annoutils.map_cancer_hotspots(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_hgvsp, principal_hgvsc)
if not rec.INFO.get('DBNSFP') is None:
annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
+ if oncogenicity_annotation == 1:
+ oncogenicity.assign_oncogenicity_evidence(rec, tumortype = "Any")
w.write_record(rec)
w.close()
logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py
index 7840a21..d972a5d 100755
--- a/src/gvanno/gvanno_vcfanno.py
+++ b/src/gvanno/gvanno_vcfanno.py
@@ -20,10 +20,8 @@ def __main__():
parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)")
parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar")
parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions")
- #parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase")
parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)")
parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)")
- parser.add_argument("--cancer_hotspots",action = "store_true", help="Annotate VCF with mutation hotspots from cancerhotspots.org")
args = parser.parse_args()
@@ -31,7 +29,7 @@ def __main__():
vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt'
vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml'
print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False)
- run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas, args.cancer_hotspots)
+ run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas)
def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource):
@@ -41,7 +39,7 @@ def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vc
append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname)
append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file)
-def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref,gwas, cancer_hotspots):
+def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref, gwas):
"""
Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
"""
@@ -52,10 +50,7 @@ def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, v
gvanno_xref_info_tags = ["GVANNO_XREF"]
gwas_info_tags = ["GWAS_HIT"]
ncer_info_tags = ["NCER_PERCENTILE"]
- cancer_hotspots_info_tags = ["MUTATION_HOTSPOT","MUTATION_HOTSPOT_TRANSCRIPT","MUTATION_HOTSPOT_CANCERTYPE"]
-
- if cancer_hotspots is True:
- prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, cancer_hotspots_info_tags, query_info_tags, "cancer_hotspots")
+
if clinvar is True:
prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar")
if dbnsfp is True:
diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py
index 2b8362d..5678c9a 100755
--- a/src/gvanno/lib/annoutils.py
+++ b/src/gvanno/lib/annoutils.py
@@ -9,7 +9,9 @@
csv.field_size_limit(500 * 1024 * 1024)
-threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'X'}
+threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H',
+ 'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W',
+ 'Tyr':'Y','Val':'V','Ter':'X'}
@@ -23,6 +25,7 @@ def read_genexref_namemap(logger, gene_xref_namemap_tsv):
reader = csv.DictReader(tsvfile, delimiter='\t')
for row in reader:
namemap_xref[row['name']] = int(row['index'])
+ tsvfile.close()
return namemap_xref
@@ -45,12 +48,13 @@ def read_infotag_file(logger, vcf_info_tags_tsv):
for row in reader:
if not row['tag'] in info_tag_xref:
info_tag_xref[row['tag']] = row
+ tsvfile.close()
return info_tag_xref
def check_subprocess(command):
- #if debug:
- #logger.info(command)
+ # if debug:
+ # logger.info(command)
try:
output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
if len(output) > 0:
@@ -79,7 +83,6 @@ def write_pass_vcf(annotated_vcf, logger):
Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename.
"""
- #out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf)
out_vcf = re.sub(r'\.vcf\.gz$','.pass.vcf',annotated_vcf)
vcf = VCF(annotated_vcf)
w = Writer(out_vcf, vcf)
@@ -268,7 +271,7 @@ def get_correct_cpg_transcript(vep_csq_records):
return csq_idx
- ## some variants iare assigned multiple transcript consequences
+ ## some variants are assigned multiple transcript consequences
## if cancer predisposition genes are in the vicinity of other genes, choose the cancer predisposition gene
## if there are neighbouring cancer-predispositon genes, choose custom gene, preferring coding change (see below, KLLN/PTEN, XPC/TMEM43, NTHL1/TSC2)
csq_idx_dict = {}
@@ -576,15 +579,24 @@ def make_transcript_xref_map(rec, fieldmap, xref_tag = 'PCGR_ONCO_XREF'):
return(transcript_xref_map)
def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
- vep_to_pcgr_af = {'gnomAD_AMR_AF':'AMR_AF_GNOMAD',
- 'gnomAD_AFR_AF':'AFR_AF_GNOMAD',
- 'gnomAD_EAS_AF':'EAS_AF_GNOMAD',
- 'gnomAD_NFE_AF':'NFE_AF_GNOMAD',
- 'gnomAD_AF':'GLOBAL_AF_GNOMAD',
- 'gnomAD_SAS_AF':'SAS_AF_GNOMAD',
- 'gnomAD_OTH_AF':'OTH_AF_GNOMAD',
- 'gnomAD_ASJ_AF':'ASJ_AF_GNOMAD',
- 'gnomAD_FIN_AF':'FIN_AF_GNOMAD',
+ vep_to_pcgr_af = {'gnomADe_AMR_AF':'AMR_AF_GNOMAD',
+ 'gnomADe_AFR_AF':'AFR_AF_GNOMAD',
+ 'gnomADe_EAS_AF':'EAS_AF_GNOMAD',
+ 'gnomADe_NFE_AF':'NFE_AF_GNOMAD',
+ 'gnomADe_AF':'GLOBAL_AF_GNOMAD',
+ 'gnomADe_SAS_AF':'SAS_AF_GNOMAD',
+ 'gnomADe_OTH_AF':'OTH_AF_GNOMAD',
+ 'gnomADe_ASJ_AF':'ASJ_AF_GNOMAD',
+ 'gnomADe_FIN_AF':'FIN_AF_GNOMAD',
+ 'gnomADG_AMR_AF':'AMR_AF_GNOMADg',
+ 'gnomADg_AFR_AF':'AFR_AF_GNOMADg',
+ 'gnomADg_EAS_AF':'EAS_AF_GNOMADg',
+ 'gnomADg_NFE_AF':'NFE_AF_GNOMADg',
+ 'gnomADg_AF':'GLOBAL_AF_GNOMADg',
+ 'gnomADg_SAS_AF':'SAS_AF_GNOMADg',
+ 'gnomADg_OTH_AF':'OTH_AF_GNOMADg',
+ 'gnomADg_ASJ_AF':'ASJ_AF_GNOMADg',
+ 'gnomADg_FIN_AF':'FIN_AF_GNOMADg',
'AFR_AF':'AFR_AF_1KG',
'AMR_AF':'AMR_AF_1KG',
'SAS_AF':'SAS_AF_1KG',
@@ -628,14 +640,33 @@ def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
return vep_dbnsfp_meta_info
-def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ'):
+def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = 0):
all_csq_pick = []
all_transcript_consequences = []
+
+ varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT))
+
for csq in rec.INFO.get(csq_identifier).split(','):
csq_fields = csq.split('|')
+ entrezgene = '.'
+
+ ## Entrez gene identifier is not provided directly by VEP, pull out this from 'transcript_xref_map' for a given transcript-specific CSQ block
+ ## - used for 'consequence_entry' object that are added to 'vep_all_csq' array
+ k = 0
+ while(k < len(csq_fields)):
+ if k in vep_csq_fields_map['index2field']:
+ if vep_csq_fields_map['index2field'][k] == 'Feature':
+ ensembl_transcript_id = csq_fields[k]
+ if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'):
+ if ensembl_transcript_id in transcript_xref_map.keys():
+ if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys():
+ entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE']
+ k = k + 1
+
+
if pick_only is False:
j = 0
csq_record = {}
@@ -649,7 +680,7 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
else:
## loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme
- if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence when expanding with annotation tags
+ if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence(s) when expanding with annotation tags
j = 0
csq_record = {}
@@ -666,7 +697,10 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
## assign additional gene/transcript annotations from the custom transcript
## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object
csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
+ #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
+
else:
+ #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
## If not VEP provides SYMBOL, append SYMBOL provided by xref map
if csq_record['SYMBOL'] is None:
csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
@@ -706,27 +740,246 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
assign_cds_exon_intron_annotations(csq_record)
## Append transcript consequence to all_csq_pick
all_csq_pick.append(csq_record)
- symbol = '.'
- hgvsc = ".:."
- hgvsp = ".:."
+ symbol = "."
+ hgvsc = "."
+ hgvsp = "."
if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "":
symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']])
if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "":
- hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']])
+ hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1])
if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "":
- hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']])
- consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ':' +
- str(symbol) + ':' +
- str(hgvsc) + ':' +
- str(hgvsp) + ':' +
- str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ':' +
- str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ':' +
+ hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1])
+ consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" +
+ str(symbol) + ":" +
+ str(entrezgene) + ":" +
+ str(hgvsc) + ":" +
+ str(hgvsp) + ":" +
+ str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" +
+ str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" +
str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']]))
all_transcript_consequences.append(consequence_entry)
-
+
+
+ vep_chosen_csq_idx = 0
vep_csq_results = {}
- vep_csq_results['vep_block'] = all_csq_pick
- vep_csq_results['vep_all_csq'] = all_transcript_consequences
+ vep_csq_results['picked_gene_csq'] = all_csq_pick
+ vep_csq_results['all_csq'] = all_transcript_consequences
+ vep_csq_results['picked_csq'] = None
+
+ ## IF multiple transcript-specific variant consequences highlighted by --pick_allele_gene ,
+ ## prioritize/choose block of consequence which has
+ ## - A gene with BIOTYPE equal to 'protein-coding' (the other picked transcript/gene may potentialy carry another BIOTYPE nature)
+ ## - A gene consequence classified as 'exonic' (the other picked transcript/gene likely carries a nonexonic consequence)
+ if len(vep_csq_results['picked_gene_csq']) > 0:
+ vep_selected_idx = {}
+ vep_selected_idx['exonic_status'] = {}
+ vep_selected_idx['consequence'] = {}
+
+ #if debug:
+ #print("Picked VEP blocks: " + str(vep_csq_results['picked_gene_csq']))
+
+ i = 0
+ picked_blocks = []
+ for trans_rec in vep_csq_results['picked_gene_csq']:
+
+ if debug:
+ biotype = '.'
+ consequence = '.'
+ exonic_status = '.'
+ genesymbol = '.'
+ distance = '.'
+ feature = '.'
+ if not trans_rec['Consequence'] is None:
+ consequence = trans_rec['Consequence']
+ if not trans_rec['SYMBOL'] is None:
+ genesymbol = trans_rec['SYMBOL']
+ if not trans_rec['BIOTYPE'] is None:
+ biotype = trans_rec['BIOTYPE']
+ if not trans_rec['EXONIC_STATUS'] is None:
+ exonic_status = trans_rec['EXONIC_STATUS']
+ if not trans_rec['DISTANCE'] is None:
+ distance = trans_rec['DISTANCE']
+ if not trans_rec['Feature'] is None:
+ feature = trans_rec['Feature']
+ block = str(genesymbol) + ':' + str(feature) + ':' + str(consequence) + ':' + str(distance) + ':' + str(biotype) + ':' + str(exonic_status)
+ picked_blocks.append(block)
+
+
+ if 'BIOTYPE' in trans_rec and 'Consequence' in trans_rec and 'EXONIC_STATUS' in trans_rec:
+ if not trans_rec['BIOTYPE'] is None and not trans_rec['Consequence'] is None:
+ if trans_rec['BIOTYPE'] == "protein_coding":
+ ## for protein-coding genes - record the exonic variant consequence status (exonic/nonexonic)
+ vep_selected_idx['exonic_status'][i] = trans_rec['EXONIC_STATUS']
+ vep_selected_idx['consequence'][i] = trans_rec['Consequence']
+ i = i + 1
+
+
+ if debug:
+ print(str(varkey) + " - picked CSQ blocks: " + ' /// '.join(picked_blocks))
+
+
+ ## when multiple transcript gene blocks are picked by VEP, prioritize the block with 'exonic' consequence
+ if len(vep_selected_idx['exonic_status'].keys()) > 1:
+ exonic_cons_found = 0
+ for j in vep_selected_idx['exonic_status'].keys():
+ if vep_selected_idx['exonic_status'][j] == 'exonic':
+ exonic_cons_found = 1
+ vep_chosen_csq_idx = j
+
+ ## if multiple non-exonic variants are found, prioritize UTR variants over other nonexonic
+ if exonic_cons_found == 0:
+ for j in vep_selected_idx['consequence'].keys():
+ if vep_selected_idx['consequence'][j] == '5_prime_UTR_variant' or vep_selected_idx['consequence'][j] == '3_prime_UTR_variant':
+ vep_chosen_csq_idx = j
+
+ else:
+ if len(vep_selected_idx['exonic_status'].keys()) == 1:
+ for k in vep_selected_idx['exonic_status'].keys():
+ vep_chosen_csq_idx = k
+
+ vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx]
+
+ else:
+ print('ERROR: No VEP block chosen by --pick_allele_gene')
return(vep_csq_results)
+
+
+def read_cancer_hotspots(logger, hotspots_tsv):
+
+ ## load cancer hotspot entries from 'cancer_hotspots.tsv' (provided by github.com/sigven/cancerHotspots)
+
+ hotspots = {} ##dictionary returned
+ mutation_hotspots = {}
+ codon_hotspots = {}
+ splice_hotspots = {}
+ if not os.path.exists(hotspots_tsv):
+ logger.info("ERROR: File '" + str(hotspots_tsv) + "' does not exist - exiting")
+ exit(1)
+ tsvfile = open(hotspots_tsv, 'r')
+ reader = csv.DictReader(tsvfile, delimiter='\t')
+ for row in reader:
+ mutation_hotspots[str(row['entrezgene']) + '-' + row['hgvsp2']] = row
+ codon_hotspots[str(row['entrezgene']) + '-' + row['codon']] = row
+ if row['hgvsc'] != '.':
+ splice_hotspots[str(row['entrezgene']) + '-' + str(row['hgvsc'])] = row
+ tsvfile.close()
+
+ hotspots['mutation'] = mutation_hotspots
+ hotspots['codon'] = codon_hotspots
+ hotspots['splice'] = splice_hotspots
+ return hotspots
+
+
+def map_cancer_hotspots(transcript_csq_elements, cancer_hotspots, rec, principal_hgvsp, principal_hgvsc):
+
+ unique_hotspot_mutations = {}
+ unique_hotspot_codons = {}
+
+ principal_codon = '.'
+ if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_hgvsp):
+ codon_match = re.findall(r'[A-Z][0-9]{1,}',principal_hgvsp)
+ if len(codon_match) == 1:
+ principal_codon = codon_match[0]
+
+ ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of
+ ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots)
+ ## - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal')
+ ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated)
+ ## - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal')
+
+ for csq in transcript_csq_elements:
+ (consequence, symbol, entrezgene, hgvsc, hgvsp, feature_type, feature, biotype) = csq.split(':')
+
+ if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True:
+ continue
+
+ hgvsp_short = threeToOneAA(hgvsp)
+ hotspot_key_mutation = "."
+ codon_match = []
+ if entrezgene != "." and hgvsp != ".":
+ hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short)
+ codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short)
+
+ if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'):
+ hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc)
+ hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key)
+
+ if hotspot_key_mutation == ".":
+ continue
+
+ if hotspot_key_mutation in cancer_hotspots['mutation']:
+ hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+ hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+ unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype
+
+ if hotspot_key_mutation in cancer_hotspots['splice']:
+ hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+ hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+ unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype
+
+
+ if len(codon_match) > 0:
+ hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0])
+
+ if hotspot_key_codon in cancer_hotspots['codon']:
+ unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] = \
+ cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE']
+
+ if len(unique_hotspot_mutations.keys()) > 0:
+ if len(unique_hotspot_mutations.keys()) == 1:
+ for gene_mutation_key in unique_hotspot_mutations.keys():
+ rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key
+ rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key]
+
+ if gene_mutation_key.split('|')[0] == 'exonic':
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal'
+ hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4])
+ if hgvsp_candidate == principal_hgvsp:
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+ else:
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal'
+ hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4]))
+ if hgvsc_candidate == principal_hgvsc:
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+ else:
+ ## multiple hotspot matches for alternative hgvsp keys
+ ## - keep only match against principal HGVSp
+ for hotspot_info in unique_hotspot_mutations.keys():
+ if hotspot_info.split('|')[0] == 'exonic':
+ hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4])
+
+ if hgvsp_candidate == principal_hgvsp:
+ rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+ rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+ else:
+ hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4]))
+
+ if hgvsc_candidate == principal_hgvsc:
+ rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+ rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+
+ else:
+ if len(unique_hotspot_codons.keys()) > 0:
+ if len(unique_hotspot_codons.keys()) == 1:
+ for gene_codon_key in unique_hotspot_codons.keys():
+
+ if '|' in gene_codon_key:
+
+ codon = str(gene_codon_key.split('|')[3])
+
+ if codon == principal_codon:
+ rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+ rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal'
+ else:
+ rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+ rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+ rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal'
+
+
+ return
+
diff --git a/src/gvanno/lib/oncogenicity.py b/src/gvanno/lib/oncogenicity.py
new file mode 100644
index 0000000..28aa347
--- /dev/null
+++ b/src/gvanno/lib/oncogenicity.py
@@ -0,0 +1,422 @@
+#!/usr/bin/env python
+
+import os,re,sys
+from cyvcf2 import VCF, Writer
+
+def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
+
+ clingen_vicc_ev_codes = [
+ "CLINGEN_VICC_SBVS1",
+ "CLINGEN_VICC_SBS1",
+ "CLINGEN_VICC_SBP1",
+ "CLINGEN_VICC_SBP2",
+ "CLINGEN_VICC_OS3",
+ "CLINGEN_VICC_OM2",
+ "CLINGEN_VICC_OM3",
+ "CLINGEN_VICC_OM4",
+ "CLINGEN_VICC_OP1",
+ "CLINGEN_VICC_OP3",
+ "CLINGEN_VICC_OP4",
+ "CLINGEN_VICC_OVS1"]
+
+ ### Benign oncogenic effects of somatic variants
+
+ # 1) "CLINGEN_VICC_SBVS1"
+ ## Very high MAF: > 0.05 in gnomAD - any 5 general continental populations
+ ## AFR/AMR/EAS/NFE/SAS
+
+ # 2) "CLINGEN_VICC_SBS1"
+ ## High MAF: > 0.01 in gnomAD - any 5 general continental populations
+ ## AFR/AMR/EAS/NFE/SAS
+
+ # 3) "CLINGEN_VICC_SBP1"
+ ## Multiple lines of computational evidence support a benign
+ ## effect on the gene or gene product
+ ## (conservation, evolutionary, splicing impact, etc. - from dbNSFP
+
+ # 4) CLINGEN_VICC_SBP2"
+ ## Synonymous (silent) variant for which splicing prediction
+ ## algorithms predict no effect on the splice consensus sequence
+ ## nor the creation of a new splice site and the nucleotide is
+ ## not highly conserved
+
+ # 5) "CLINGEN_VICC_SBS2"
+ ## Well established in invitro/in vivo functional studies show
+ ## no oncogenic effects
+
+ ## Oncogenic effects of somatic variants
+
+ # 6) "CLINGEN_VICC_OVS1"
+ ## Null variant - predicted as LoF by LOFTEE
+ ## - Nonsense, frameshift, canonical splice sites, initiation codon,
+ ## single-exon/multi-exon deletion
+ ## - Tumor suppressor gene
+
+ # 7) "CLINGEN_VICC_OS1"
+ ## Same amino acid change as previously established oncogenic variant
+ ## NOTE: Not relevant since current implementation does not consider
+ ## any criteria where nucleotide change is affecting annotation
+
+ # 8) "CLINGEN_VICC_OS2"
+ ## Well established in invitro/in vivo functional studies show
+ ## oncogenic effect of the variant
+
+ # 9) "CLINGEN_VICC_OS3"
+ ## Located in a mutation hotspot
+ ## - >= 50 samples with a somatic variant at the AA position (cancerhotspots.org)
+ ## - Same amino acid change in >= 10 samples (cancerhotspots.org)
+
+ # 10) "CLINGEN_VICC_OM1"
+ ## Located in a critical and well-established part of a functional domain
+ ## (active site of an enzyme)
+ ## NOTE: Not used since determination/retrieval of critical protein sites are non-trivial to automate
+
+ # 11) "CLINGEN_VICC_OM2"
+ ## Protein length changes as a result of in-frame deletions/insertions in a
+ ## known oncogene/tumor suppressor genes or stop-loss variants in a
+ ## tumor suppressor gene
+
+ # 12) "CLINGEN_VICC_OM3"
+ ## Missense variant at an amino acid residue where a different missense
+ ## variant determined to be oncogenic (using this standard) has been
+ ## documented. Amino acid difference from reference amino acid should
+ ## be greater or at least approximately the same as for missense change
+ ## determined to be oncogenic.
+
+ # 13) "CLINGEN_VICC_OM4"
+ ## Located in a mutation hotspot
+ ## - < 50 samples with a somatic variant at the AA position (cancerhotspots.org)
+ ## - Same amino acid change in >= 10 samples (cancerhotspots.org)
+ ## - Not applicable if OM1 or OM3 is applicable
+
+ # 14) "CLINGEN_VICC_OP1"
+ ## All used lines of computational support an oncogenic effect
+ ## of a variant (conservation, evolutionary, splicing effect)
+
+ # 15) "CLINGEN_VICC_OP2"
+ ## Somatic variant in a gene in a malignancy with a single genetic
+ ## etiology. Example:retinoblastoma is caused by bi-allelic
+ ## RB1 inactivation.
+
+ # 16) "CLINGEN_VICC_OP3"
+ ## Located in a mutation hotspot
+ ## - Same amino acid change in < 10 samples (cancerhotspots.org)
+ ## - Not applicable if OM1 or OM3 is applicable
+
+ # 17) "CLINGEN_VICC_OP4"
+ ## Absent from controls (gnomAD)
+ ## - Extremely low MAF
+
+
+ required_oncogenicity_vars = [
+ "Consequence",
+ "MUTATION_HOTSPOT",
+ "MUTATION_HOTSPOT_CANCERTYPE",
+ "SYMBOL",
+ "ONCOGENE",
+ "ONCOGENE_EVIDENCE",
+ "TUMOR_SUPPRESSOR",
+ "TUMOR_SUPPRESSOR_EVIDENCE",
+ "LoF",
+ "INTRON_POSITION",
+ "EXON_POSITION",
+ "EAS_AF_GNOMAD",
+ "NFE_AF_GNOMAD",
+ "AFR_AF_GNOMAD",
+ "AMR_AF_GNOMAD",
+ "SAS_AF_GNOMAD",
+ "DBNSFP_SIFT",
+ "DBNSFP_PROVEAN",
+ "DBNSFP_META_RNN",
+ "DBNSFP_FATHMM",
+ "DBNSFP_MUTATIONTASTER",
+ "DBNSFP_DEOGEN2",
+ "DBNSFP_PRIMATEAI",
+ "DBNSFP_MUTATIONASSESSOR",
+ "DBNSFP_FATHMM_MKL",
+ "DBNSFP_M_CAP",
+ "DBNSFP_LIST_S2",
+ "DBNSFP_BAYESDEL_ADDAF",
+ "DBNSFP_SPLICE_SITE_ADA",
+ "DBNSFP_SPLICE_SITE_RF"]
+
+ variant_data = {}
+ for col in required_oncogenicity_vars:
+ if rec.INFO.get(col) is None:
+ if col == "TUMOR_SUPPRESSOR" or col == "ONCOGENE":
+ variant_data[col] = False
+ elif col == "LoF":
+ variant_data['LOSS_OF_FUNCTION'] = False
+ else:
+ variant_data[col] = None
+ else:
+ if rec.INFO.get(col) == '':
+ variant_data[col] = True
+ else:
+ if col == "LoF":
+ if rec.INFO.get(col) == "HC":
+ variant_data["LOSS_OF_FUNCTION"] = True
+ else:
+ variant_data[col] = rec.INFO.get(col)
+
+ for code in clingen_vicc_ev_codes:
+ variant_data[code] = False
+
+ dbnsfp_minimum_majority = 7
+ dbnsfp_maximum_minority = 2
+ dbnsfp_minimum_algos_called = dbnsfp_minimum_majority
+
+ variant_data['N_INSILICO_CALLED'] = 0
+ variant_data['N_INSILICO_DAMAGING'] = 0
+ variant_data['N_INSILICO_TOLERATED'] = 0
+ variant_data['N_INSILICO_SPLICING_NEUTRAL'] = 0
+ variant_data['N_INSILICO_SPLICING_AFFECTED'] = 0
+
+ for algo in ['SIFT','PROVEAN','META_RNN',
+ 'MUTATIONTASTER','DEOGEN2',
+ 'PRIMATEAI','MUTATIONASSESSOR',
+ 'FATHMM_MKL','M_CAP',
+ 'LIST_S2','BAYESDEL_ADDAF',
+ 'SPLICE_SITE_RF','SPLICE_SITE_ADA']:
+ col = 'DBNSFP_' + str(algo)
+ if col in variant_data.keys():
+ if variant_data[col] != '.':
+ variant_data['N_INSILICO_CALLED'] += 1
+ if variant_data[col] == 'D':
+ variant_data['N_INSILICO_DAMAGING'] += 1
+ if variant_data[col] == 'T':
+ variant_data['N_INSILICO_TOLERATED'] += 1
+ if (col == 'SPLICE_SITE_RF' or \
+ col == 'SPLICE_SITE_ADA') and \
+ variant_data[col] == 'SN':
+ variant_data['N_INSILICO_SPLICING_NEUTRAL'] += 1
+ if (col == 'SPLICE_SITE_RF' or \
+ col == 'SPLICE_SITE_ADA') and \
+ variant_data[col] == 'AS':
+ variant_data['N_INSILICO_SPLICING_AFFECTED'] += 1
+
+ if (variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \
+ variant_data['N_INSILICO_DAMAGING'] >= dbnsfp_minimum_majority and \
+ variant_data['N_INSILICO_TOLERATED'] <= dbnsfp_maximum_minority and \
+ variant_data['N_INSILICO_SPLICING_NEUTRAL'] <= 1) or \
+ variant_data['N_INSILICO_SPLICING_AFFECTED'] == 2:
+ variant_data['CLINGEN_VICC_OP1'] = True
+
+ if variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \
+ variant_data['N_INSILICO_TOLERATED'] >= dbnsfp_minimum_majority and \
+ variant_data['N_INSILICO_DAMAGING'] <= dbnsfp_maximum_minority and \
+ variant_data['N_INSILICO_SPLICING_AFFECTED'] == 0:
+ variant_data['CLINGEN_VICC_SBP1'] = True
+
+ hotspot_mutated_samples = {}
+ hotspot_mutated_samples['aa_variant'] = {}
+ hotspot_mutated_samples['aa_site'] = {}
+ hotspot_mutated_samples['aa_variant']['Any'] = 0
+ hotspot_mutated_samples['aa_site']['Any'] = 0
+
+ if not variant_data['MUTATION_HOTSPOT_CANCERTYPE'] is None:
+ ttype_samples_in_hotspots = variant_data['MUTATION_HOTSPOT_CANCERTYPE'].split(',')
+ for ttype in ttype_samples_in_hotspots:
+ ttype_stats = ttype.split('|')
+ if len(ttype_stats) == 3:
+ ttype_hotspot = ttype_stats[0]
+ ttype_hotspot_samples_site = int(ttype_stats[1])
+ ttype_hotspot_samples_aa = ttype_hotspot_samples_site
+ if ttype_stats[2] != '':
+ ttype_hotspot_samples_aa = int(ttype_stats[2])
+
+ if ttype_hotspot != "Unknown":
+ hotspot_mutated_samples['aa_variant'][ttype_hotspot] = ttype_hotspot_samples_aa
+ hotspot_mutated_samples['aa_site'][ttype_hotspot] = ttype_hotspot_samples_site
+
+ hotspot_mutated_samples['aa_variant']['Any'] = hotspot_mutated_samples['aa_variant']['Any'] + ttype_hotspot_samples_aa
+ hotspot_mutated_samples['aa_site']['Any'] = hotspot_mutated_samples['aa_site']['Any'] + ttype_hotspot_samples_site
+
+
+ ttype = "Any"
+ if tumortype != "Any":
+ ttype = re.sub(r'/', '@', re.sub(r' ','_', tumortype))
+
+ if ttype in hotspot_mutated_samples['aa_variant'].keys() and \
+ ttype in hotspot_mutated_samples['aa_site'].keys():
+ if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \
+ hotspot_mutated_samples['aa_site'][ttype] >= 50:
+ variant_data['CLINGEN_VICC_OS3'] = True
+ if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \
+ hotspot_mutated_samples['aa_site'][ttype] < 50:
+ variant_data['CLINGEN_VICC_OM4'] = True
+ if hotspot_mutated_samples['aa_variant'][ttype] < 10 and \
+ hotspot_mutated_samples['aa_variant'][ttype] > 0:
+ variant_data['CLINGEN_VICC_OP3'] = True
+
+
+ if "EAS_AF_GNOMAD" in variant_data.keys() and \
+ "SAS_AF_GNOMAD" in variant_data.keys() and \
+ "AMR_AF_GNOMAD" in variant_data.keys() and \
+ "AFR_AF_GNOMAD" in variant_data.keys() and \
+ "NFE_AF_GNOMAD" in variant_data.keys():
+
+ ## check if variant has MAF > 0.01 (SBVS1) or > 0.05 in any of five major gnomAD subpopulations (exome set)
+ for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+ if not variant_data[pop] is None:
+ ## MAF for this population >= 0.05
+ if float(variant_data[pop]) >= 0.05:
+ variant_data["CLINGEN_VICC_SBVS1"] = True
+ for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+ if not variant_data[pop] is None:
+ ## MAF for this population >= 0.01 (< 0.05)
+ if float(variant_data[pop]) >= 0.01 and variant_data["CLINGEN_VICC_SBVS1"] is False:
+ variant_data["CLINGEN_VICC_SBS1"] = True
+
+ missing_pop_freq = 0
+ approx_zero_pop_freq = 0
+ for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+ ## no MAF recorded in gnomAD for this population
+ if variant_data[pop] is None:
+ missing_pop_freq = missing_pop_freq + 1
+ else:
+ ## Very low MAF for this population
+ if float(variant_data[pop]) < 0.0001:
+ approx_zero_pop_freq = approx_zero_pop_freq + 1
+
+ ## check if variant is missing or with MAF approximately zero in all five major gnomAD subpopulations (exome set)
+ if missing_pop_freq == 5 or approx_zero_pop_freq == 5:
+ variant_data["CLINGEN_VICC_OP4"] = True
+
+ ## check if variant is a loss-of-function variant (LOFTEE) in a tumor suppressor gene (Cancer Gene Census/CancerMine)
+ if "TUMOR_SUPPRESSOR" in variant_data.keys() and \
+ "ONCOGENE" in variant_data.keys() and \
+ "LOSS_OF_FUNCTION" in variant_data.keys() and \
+ "Consequence" in variant_data.keys():
+
+ if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TUMOR_SUPPRESSOR'] is True:
+ variant_data['CLINGEN_VICC_OVS1'] = True
+
+ ## check if variant is creating a stop-lost or protein-length change in oncogene/tumor suppressor genes
+ if variant_data['CLINGEN_VICC_OVS1'] is False and \
+ ((re.match(r'^(inframe_deletion|inframe_insertion)', variant_data['Consequence']) and \
+ (variant_data['TUMOR_SUPPRESSOR'] is True or variant_data['ONCOGENE'] is True)) or \
+ (re.match(r'^(stop_lost)', variant_data['Consequence']) and \
+ variant_data['TUMOR_SUPPRESSOR'] is True)):
+ variant_data['CLINGEN_VICC_OM2'] = True
+
+ ## check if variant is silent (synonymous|splice) and outside critical splice region
+ if "INTRON_POSITION" in variant_data.keys() and \
+ "EXON_POSITION" in variant_data.keys() and \
+ "DBNSFP_SPLICE_SITE_RF" in variant_data.keys() and \
+ "Consequence" in variant_data.keys():
+
+ if (int(variant_data['INTRON_POSITION']) < 0 and int(variant_data['INTRON_POSITION']) < -3 or \
+ int(variant_data['INTRON_POSITION']) > 0 and int(variant_data['INTRON_POSITION']) > 6 or \
+ int(variant_data['EXON_POSITION']) < 0 and int(variant_data['EXON_POSITION']) < -2 or \
+ int(variant_data['EXON_POSITION']) > 0 and int(variant_data['EXON_POSITION']) > 1) and \
+ variant_data['DBNSFP_SPLICE_SITE_RF'] != "AS" and \
+ re.match(r'^(synonymous_variant|splice_region_variant)', variant_data['Consequence']):
+ variant_data['CLINGEN_VICC_SBP2'] = True
+
+
+ og_score_data = {}
+ og_score_data['code'] = \
+ ['CLINGEN_VICC_SBVS1','CLINGEN_VICC_SBS1',
+ 'CLINGEN_VICC_SBP1','CLINGEN_VICC_SBP2',
+ 'CLINGEN_VICC_OVS1','CLINGEN_VICC_OS3',
+ 'CLINGEN_VICC_OM2','CLINGEN_VICC_OM4',
+ 'CLINGEN_VICC_OP1','CLINGEN_VICC_OP3',
+ 'CLINGEN_VICC_OP4']
+ og_score_data['category'] = \
+ ['clinpop','clinpop',
+ 'funccomp','funcvar',
+ 'funcvar','funcvar',
+ 'funcvar','funcvar',
+ 'funccomp','funcvar',
+ 'clinpop']
+ og_score_data['pole'] = \
+ ['B','B','B','B',
+ 'P','P','P','P',
+ 'P','P','P']
+
+ og_score_data['description'] = \
+ ['Very high MAF (> 0.05 in gnomAD - any five major continental pops)',
+ 'High MAF (> 0.01 in gnomAD - any five major continental pops)',
+ 'Multiple lines (>=7) of computational evidence support a benign effect on the gene or gene product - from dbNSFP',
+ 'Silent and intronic changes outside of the consensus splice site',
+ 'Null variant - predicted as LoF by LOFTEE - in bona fide tumor suppressor gene',
+ 'Located in a mutation hotspot (cancerhotspots.org). >= 50 samples with a variant at AA position, >= 10 samples with same AA change',
+ 'Protein length changes from in-frame dels/ins in known oncogene/tumor suppressor genes or stop-loss variants in a tumor suppressor gene',
+ 'Located in a mutation hotspot (cancerhotspots.org). < 50 samples with a variant at AA position, >= 10 samples with same AA change.',
+ 'Multiple lines (>=7) of computational evidence support a damaging effect on the gene or gene product - from dbNSFP',
+ 'Located in a mutation hotspot (cancerhotspots.org). < 10 samples with the same amino acid change.',
+ 'Absent from controls (gnomAD) / very low MAF ( < 0.0001 in all five major subpopulations)']
+
+ og_score_data['score'] = \
+ [-8, -4, -1, -1, 8, 4, 2, 2, 1, 1, 1]
+
+ i = 0
+ oncogenicity_scores = {}
+ for code in og_score_data['code']:
+ oncogenicity_scores[code] = {}
+ for e in ['category','pole','description','score']:
+ oncogenicity_scores[code][e] = og_score_data[e][i]
+ if e == 'score':
+ oncogenicity_scores[code][e] = float(og_score_data[e][i])
+ i = i + 1
+
+ variant_data['ONCOGENICITY_CLASSIFICATION'] = "VUS"
+ variant_data["ONCOGENICITY_CLASSIFICATION_DOC"] = "."
+ variant_data["ONCOGENICITY_CLASSIFICATION_CODE"] = "."
+ variant_data["ONCOGENICITY_SCORE"] = "."
+ onc_score_pathogenic = 0
+ onc_score_benign = 0
+
+ for code in clingen_vicc_ev_codes:
+
+ if code in variant_data.keys():
+ if variant_data[code] is True:
+ score = oncogenicity_scores[code]['score']
+ pole = oncogenicity_scores[code]['pole']
+ if pole == "P":
+ onc_score_pathogenic = onc_score_pathogenic + score
+ if pole == "B":
+ onc_score_benign = onc_score_benign + score
+
+ for e in ['ONCOGENICITY_CLASSIFICATION_DOC',
+ 'ONCOGENICITY_CLASSIFICATION_CODE']:
+
+ if variant_data[e] == ".":
+ if e == 'ONCOGENICITY_CLASSIFICATION_DOC':
+ variant_data[e] = oncogenicity_scores[code]['description']
+ else:
+ variant_data[e] = code
+ else:
+ if e == 'ONCOGENICITY_CLASSIFICATION_DOC':
+ variant_data[e] = f'{variant_data[e]}|{oncogenicity_scores[code]["description"]}'
+ else:
+ variant_data[e] = f'{variant_data[e]}|{code}'
+
+ lb_upper_limit = -1
+ lb_lower_limit = -6
+ b_upper_limit = -7
+ #vus_lower_limit = 0
+ #vus_upper_limit = 4
+ lo_lower_limit = 6
+ lo_upper_limit = 9
+ o_lower_limit = 10
+
+ variant_data['ONCOGENICITY_SCORE'] = onc_score_benign + onc_score_pathogenic
+ if variant_data['ONCOGENICITY_SCORE'] <= lb_upper_limit and \
+ variant_data['ONCOGENICITY_SCORE'] >= lb_lower_limit:
+ variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Benign"
+ if variant_data['ONCOGENICITY_SCORE'] >= lo_lower_limit and \
+ variant_data['ONCOGENICITY_SCORE'] <= lo_upper_limit:
+ variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Oncogenic"
+ if variant_data['ONCOGENICITY_SCORE'] <= b_upper_limit:
+ variant_data['ONCOGENICITY_CLASSIFICATION'] = "Benign"
+ if variant_data['ONCOGENICITY_SCORE'] >= o_lower_limit:
+ variant_data['ONCOGENICITY_CLASSIFICATION'] = "Oncogenic"
+
+ for e in ['ONCOGENICITY_SCORE',
+ 'ONCOGENICITY_CLASSIFICATION',
+ 'ONCOGENICITY_CLASSIFICATION_CODE']:
+ rec.INFO[e] = variant_data[e]
+
+ return(rec)
\ No newline at end of file
diff --git a/src/gvanno/vcf2tsv.py b/src/gvanno/vcf2tsv.py
deleted file mode 100755
index 7d5b544..0000000
--- a/src/gvanno/vcf2tsv.py
+++ /dev/null
@@ -1,302 +0,0 @@
-#!/usr/bin/env python
-
-import argparse
-from cyvcf2 import VCF, Writer
-import numpy as np
-import re
-import math
-import subprocess
-
-version = '0.3.7.1'
-
-
-def __main__():
- parser = argparse.ArgumentParser(description='Convert a VCF file with genomic variants to a file with tab-separated values (TSV). One entry (TSV line) per sample genotype', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
- parser.add_argument('query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)')
- parser.add_argument('out_tsv', help='Output TSV file with one line pr non-rejected sample genotype (Variant, genotype and annotation data as tab-separated values)')
- parser.add_argument("--skip_info_data",action = "store_true", help="Skip printing of data in INFO column")
- parser.add_argument("--skip_genotype_data", action="store_true", help="Skip printing of genotype_data (FORMAT columns)")
- parser.add_argument("--keep_rejected_calls", action="store_true", help="Print data for rejected calls")
- parser.add_argument("--print_data_type_header", action="store_true", help="Print a header line with data types of VCF annotations")
- parser.add_argument("--compress", action="store_true", help="Compress TSV file with gzip")
- parser.add_argument('--version', action='version', version='%(prog)s ' + str(version))
- args = parser.parse_args()
-
- vcf2tsv(args.query_vcf, args.out_tsv, args.skip_info_data, args.skip_genotype_data, args.keep_rejected_calls, args.compress, args.print_data_type_header)
-
-
-def check_subprocess(command):
- try:
- output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
- if len(output) > 0:
- print (str(output.decode()).rstrip())
- except subprocess.CalledProcessError as e:
- print (e.output)
- exit(0)
-
-
-def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejected_calls, compress, print_data_type_header):
-
- vcf = VCF(query_vcf, gts012 = True)
- out = open(out_tsv,'w')
-
- fixed_columns_header = ['CHROM','POS','ID','REF','ALT','QUAL','FILTER']
- fixed_columns_header_type = ['String','Integer','String','String','String','Float','String']
- samples = vcf.samples
- info_columns_header = []
- format_columns_header = []
- sample_columns_header = []
- column_types = {}
- gt_present_header = 0
-
- if len(samples) > 0:
- sample_columns_header.append('VCF_SAMPLE_ID')
-
- for e in vcf.header_iter():
- header_element = e.info()
- if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
- if header_element['HeaderType'] == 'INFO' or header_element['HeaderType'] == 'FORMAT':
- column_types[header_element['ID']] = header_element['Type']
- if header_element['HeaderType'] == 'INFO':
- if skip_info_data is False:
- info_columns_header.append(header_element['ID'])
- if header_element['HeaderType'] == 'FORMAT':
- if len(sample_columns_header) > 0 and skip_genotype_data is False:
- if header_element['ID'] != 'GT':
- format_columns_header.append(header_element['ID'])
- else:
- gt_present_header = 1
-
- #header_line = '\t'.join(fixed_columns_header)
- header_tags = fixed_columns_header
- if skip_info_data is False:
- #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header))
- header_tags = fixed_columns_header + sorted(info_columns_header)
- if len(sample_columns_header) > 0:
- if skip_genotype_data is False:
- #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header)) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT'
- header_tags = fixed_columns_header + sorted(info_columns_header) + sample_columns_header + sorted(format_columns_header) + ['GT']
- else:
- #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header))
- header_tags = fixed_columns_header + sorted(info_columns_header)
- else:
- if len(sample_columns_header) > 0:
- if skip_genotype_data is False:
- #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT'
- header_tags = fixed_columns_header + sample_columns_header + sorted(format_columns_header) + ['GT']
- else:
- #header_line = '\t'.join(fixed_columns_header)
- header_tags = fixed_columns_header
- header_line = '\t'.join(header_tags)
-
- out.write('#https://github.com/sigven/vcf2tsv version=' + str(version) + '\n')
- if print_data_type_header is True:
- #header_tags = header_line.rstrip().split('\t')
- header_types = []
- for h in header_tags:
- if h in column_types:
- header_types.append(str(column_types[h]))
- #header_line_type = '\t'.join(fixed_columns_header_type) + '\t' + '\t'.join(header_types)
- header_line_type = '\t'.join(fixed_columns_header_type + header_types)
- out.write('#' + str(header_line_type) + '\n')
- out.write(str(header_line) + '\n')
- else:
- out.write(str(header_line) + '\n')
-
- for rec in vcf:
- rec_id = '.'
- rec_qual = '.'
- rec_filter = '.'
- alt = ",".join(str(n) for n in rec.ALT)
- if not rec.ID is None:
- rec_id = str(rec.ID)
- if not rec.QUAL is None:
- rec_qual = str("{0:.2f}".format(rec.QUAL))
- rec_filter = str(rec.FILTER)
- if rec.FILTER is None:
- rec_filter = 'PASS'
-
- pos = int(rec.start) + 1
- fixed_fields_string = str(rec.CHROM) + '\t' + str(pos) + '\t' + str(rec_id) + '\t' + str(rec.REF) + '\t' + str(alt) + '\t' + str(rec_qual) + '\t' + str(rec_filter)
-
-
- if not 'PASS' in rec_filter and not keep_rejected_calls:
- continue
-
- variant_info = rec.INFO
- vcf_info_data = []
- if skip_info_data is False:
- for info_field in sorted(info_columns_header):
- if column_types[info_field] == 'Flag':
- if variant_info.get(info_field) is None:
- vcf_info_data.append('False')
- else:
- vcf_info_data.append('True')
- elif column_types[info_field] == 'Float' or column_types[info_field] == 'Integer' or column_types[info_field] == 'String' or column_types[info_field] == 'Character':
- if type(variant_info.get(info_field)) is list or type(variant_info.get(info_field)) is tuple:
- vcf_info_data.append(",".join(str(n) for n in variant_info.get(info_field)))
- else:
- if variant_info.get(info_field) is None:
- vcf_info_data.append('.')
- else:
- if column_types[info_field] == 'Float':
- if not isinstance(variant_info.get(info_field),float):
- print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Float\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
- if not ',' in str(alt):
- print('Warning: Multiple values in INFO tag for single ALT allele (VCF multiallelic sites not decomposed properly?):' + str(fixed_fields_string) + '\t' + str(info_field) + '=' + str(variant_info.get(info_field)))
- vcf_info_data.append('.')
- else:
- val = str("{0:.7f}".format(variant_info.get(info_field)))
- vcf_info_data.append(val)
- else:
- if column_types[info_field] == 'String' or column_types[info_field] == 'Character':
- if isinstance(variant_info.get(info_field),str):
- #print(str(info_field) + '\t' + variant_info.get(info_field).encode('ascii','ignore').rstrip().decode('ascii'))
- vcf_info_data.append(variant_info.get(info_field).encode('ascii','ignore').decode('ascii'))
- else:
- vcf_info_data.append('.')
- if column_types[info_field] == 'String':
- print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'String\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
- if column_types[info_field] == 'Character':
- print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Character\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
- else:
- if isinstance(variant_info.get(info_field),int):
- vcf_info_data.append(str(variant_info.get(info_field)))
- else:
- print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Integer\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
- vcf_info_data.append(re.sub(r'\(|\)', '', variant_info.get(info_field).encode('ascii','ignore').decode('ascii')))
-
- #print(str(vcf_info_data))
- #dictionary, with sample names as keys, values being genotype data (dictionary with format tags as keys)
- vcf_sample_genotype_data = {}
-
- if len(samples) > 0 and skip_genotype_data is False:
- gt_cyvcf = rec.gt_types
- i = 0
- while i < len(samples):
- vcf_sample_genotype_data[samples[i]] = {}
- gt = './.'
- if gt_present_header == 1:
- if gt_cyvcf[i] == 0:
- gt = '0/0'
- if gt_cyvcf[i] == 1:
- gt = '0/1'
- if gt_cyvcf[i] == 2:
- gt = '1/1'
- vcf_sample_genotype_data[samples[i]]['GT'] = gt
- i = i + 1
-
- for format_tag in sorted(format_columns_header):
- if len(samples) > 0 and skip_genotype_data is False:
- sample_dat = rec.format(format_tag)
- if sample_dat is None:
- k = 0
- while k < len(samples):
- if samples[k] in vcf_sample_genotype_data:
- vcf_sample_genotype_data[samples[k]][format_tag] = '.'
- k = k + 1
- continue
- dim = sample_dat.shape
- j = 0
- ## sample-wise
- while j < dim[0]:
- if sample_dat[j].size > 1:
-
- d = ','.join(str(e) for e in np.ndarray.tolist(sample_dat[j]))
- if column_types[format_tag] == 'Float':
- d = ','.join(str(round(e, 4)) for e in np.ndarray.tolist(sample_dat[j]))
- ## undefined/missing value
- if '-2147483648' in d:
- d = d.replace('-2147483648', '.')
- if 'nan' in d.casefold():
- d = d.casefold().replace('nan', '.')
- if samples[j] in vcf_sample_genotype_data:
- vcf_sample_genotype_data[samples[j]][format_tag] = d
- else:
- d = '.'
- if column_types[format_tag] == 'Float':
- if not math.isnan(sample_dat[j]):
- d = str(sample_dat[j][0])
- if column_types[format_tag] == 'String':
- d = str(sample_dat[j])
- if column_types[format_tag] == 'Integer':
- d = str(sample_dat[j][0])
- ## undefined/missing value
- if d == '-2147483648' or d.casefold() == 'nan':
- d = '.'
- if samples[j] in vcf_sample_genotype_data:
- vcf_sample_genotype_data[samples[j]][format_tag] = d
- j = j + 1
-
- tsv_elements = []
- tsv_elements.append(fixed_fields_string)
- if skip_info_data is False:
- if skip_genotype_data is False:
- if len(sample_columns_header) > 0:
- tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
- ## one line per sample variant
- for s in sorted(vcf_sample_genotype_data.keys()):
- sample = s
- line_elements = []
- line_elements.extend(tsv_elements)
- line_elements.append(sample)
- gt_tag = '.'
- for tag in sorted(vcf_sample_genotype_data[sample].keys()):
- if tag != 'GT':
- line_elements.append(vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii'))
- else:
- gt_tag = vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii')
- line_elements.append(gt_tag)
- if gt_tag == './.' or gt_tag == '.':
- if keep_rejected_calls:
- out.write('\t'.join(line_elements) + '\n')
- else:
- out.write("\t".join(str(n) for n in line_elements) + '\n')
-
- else:
- tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
- line_elements = []
- line_elements.extend(tsv_elements)
- out.write('\t'.join(line_elements) + '\n')
- else:
- tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
- line_elements = []
- line_elements.extend(tsv_elements)
- out.write('\t'.join(line_elements) + '\n')
- else:
- if skip_genotype_data is False:
- if len(sample_columns_header) > 0:
- ## one line per sample variant
- for s in sorted(vcf_sample_genotype_data.keys()):
- sample = s
- line_elements = []
- line_elements.extend(tsv_elements)
- line_elements.append(sample)
- gt_tag = '.'
- for tag in sorted(vcf_sample_genotype_data[sample].keys()):
- if tag != 'GT':
- line_elements.append(vcf_sample_genotype_data[sample][tag])
- else:
- gt_tag = vcf_sample_genotype_data[sample][tag]
- line_elements.append(gt_tag)
- if gt_tag == './.' or gt_tag == '.':
- if keep_rejected_calls:
- out.write('\t'.join(line_elements) + '\n')
- else:
- out.write('\t'.join(line_elements) + '\n')
- else:
- line_elements = []
- line_elements.extend(tsv_elements)
- line_elements = tsv_elements
- out.write('\t'.join(line_elements) + '\n')
-
- out.close()
-
- if compress is True:
- command = 'gzip -f ' + str(out_tsv)
- check_subprocess(command)
-
-if __name__=="__main__": __main__()
-
-
-
diff --git a/src/gvanno_logo.png b/src/gvanno_logo.png
new file mode 100644
index 0000000..4f6899a
Binary files /dev/null and b/src/gvanno_logo.png differ