diff --git a/README.md b/README.md index c460e77..7b4b3ec 100755 --- a/README.md +++ b/README.md @@ -1,169 +1,184 @@ -## _gvanno_ - workflow for functional and clinical annotation of germline nucleotide variants +## *gvanno* - generic workflow for functional and clinical annotation of human DNA variants + + ### Contents -- [Overview](#overview) -- [News](#news) -- [Annotation resources](#annotation-resources) -- [Getting started](#getting-started) -- [Contact](#contact) +- [Overview](#overview) +- [News](#news) +- [Annotation resources](#annotation-resources) +- [Getting started](#getting-started) +- [Contact](#contact) + +### Overview {#overview} + +The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework. + +*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**. + +### News {#news} + +- April 27th 2023 - **1.6.0 release** + + - Added option `--oncogenicity_annotation` - classifies variants according to oncogenicity ([Horak et al., Genet Med, 2022](https://pubmed.ncbi.nlm.nih.gov/35101336/)) + - Data updates: ClinVar, GENCODE, GWAS catalog, CancerMine + - Excluded extensive disease associations from the Open Targets Platform -### Overview +- September 26th 2022 - **1.5.1 release** -The germline variant annotator (*gvanno*) is a software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework. + - Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`) -*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__. +### Annotation resources (v1.6.0) -### News -* September 26th 2022 - **1.5.1 release** - * Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`) -* September 24th 2022 - **1.5.0 release** - * Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform - * Software updates: VEP 107 - * Excluded UniProt KB from annotation tracks -* December 21st 2021 - **1.4.4 release** - * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform - * Software updates: VEP (v105) -* August 25th 2021 - **1.4.3 release** - * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform +- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset) +- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021) +- [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP +- [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP +- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023) +- [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023) +- [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer +- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023) -### Annotation resources (v1.5.1) +### Getting started {#getting-started} -* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset) -* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021) -* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP -* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP -* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP -* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (September 2022) -* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 47, July 2022) -* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2022_06, June 2022) -* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021) -* [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer -* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 26th 2022) +#### STEP 0: Prerequisites +- *Python* -### Getting started + An installation of Python (version \>=*3.6*) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window. -#### STEP 0: Python +- *Other utilities* -An installation of Python (version >=_3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window. + The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`) + + **NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system #### STEP 1: Installation of Docker -1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform - - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/) - - installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/) - - NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607)) -2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window -3. Adjust the computing resources dedicated to the Docker, i.e.: - - Memory: minimum 5GB - - CPUs: minimum 4 - - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced) +1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform + - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/) + - installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/) + - NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607)) +2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window +3. Adjust the computing resources dedicated to the Docker, i.e.: + - Memory: minimum 5GB + - CPUs: minimum 4 + - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced) -##### 1.1: Installation of Singularity (optional - in dev) +##### 1.1: Installation of Singularity (_IN DEVELOPMENT_) -0. **Note: this works for Singularity version 3.0 and higher**. -1. [Install Singularity](https://sylabs.io/docs/) -2. Test that singularity works by running `singularity --version` -3. If you are in the gvanno directory, build the singularity image like so: +0. **Note: this works for Singularity version 3.0 and higher**. - `cd src` +1. [Install Singularity](https://sylabs.io/docs/) - `sudo ./buildSingularity.sh` +2. Test that singularity works by running `singularity --version` + +3. If you are in the gvanno directory, build the singularity image like so: + + `cd src` + + `sudo ./buildSingularity.sh` #### STEP 2: Download *gvanno* and data bundle -1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.1) (gvanno run script, v1.5.1) -2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory - * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb) - * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb) - * Example commands: - * `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz` - * `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -` +1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0) + +2. Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.: + + - `python download_gvanno_refdata.py --download_dir --genome_assembly grch38` + + **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle) + +3. Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb): - A _data/_ folder within the _gvanno-1.5.1_ software folder should now have been produced -3. Pull the [gvanno Docker image (1.5.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb): - * `docker pull sigven/gvanno:1.5.1` (gvanno annotation engine) + - `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine) #### STEP 3: Input preprocessing The *gvanno* workflow accepts a single input file: - * An unannotated, single-sample VCF file (>= v4.2) with germline variants (SNVs/InDels) +- An unannotated, single-sample VCF file (\>= v4.2) with germline variants (SNVs/InDels) -We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose). +We **strongly** recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose). #### STEP 5: Run example Run the workflow with **gvanno.py**, which takes the following arguments and options: - usage: - gvanno.py -h [options] - --query_vcf - --gvanno_dir - --output_dir - --genome_assembly - --sample_id - --container - - gvanno - workflow for functional and clinical annotation of germline nucleotide variants - - Required arguments: - --query_vcf QUERY_VCF - VCF input file with germline query variants (SNVs/InDels). - --gvanno_dir GVANNO_DIR - Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.1 - --output_dir OUTPUT_DIR - Output directory - --genome_assembly {grch37,grch38} - Genome assembly build: grch37 or grch38 - --container {docker,singularity} - Run gvanno with docker or singularity - --sample_id SAMPLE_ID - Sample identifier - prefix for output files - - VEP optional arguments: - --vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP). - --vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno). - --vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False - --vep_n_forks VEP_N_FORKS - Number of forks for Variant Effect Predictor (VEP) processing, default: 4 - --vep_buffer_size VEP_BUFFER_SIZE - Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing - - set lower to reduce memory usage, higher to increase speed, default: 500 - --vep_pick_order VEP_PICK_ORDER - Comma-separated string of ordered transcript properties for primary variant pick in - Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane - --vep_skip_intergenic - Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False - --vep_coding_only - Only report variants falling into coding regions of transcripts (VEP), default: False - - Other optional arguments: - --force_overwrite By default, the script will fail with an error if any output file already exists. - You can force the overwrite of existing result files by using this flag, default: False - --version show program's version number and exit - --no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False - --docker_uid DOCKER_USER_ID - Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`) - --vcfanno_n_processes VCFANNO_N_PROCESSES - Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4 - -The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command: - - python ~/gvanno-1.5.1/gvanno.py - --query_vcf ~/gvanno-1.5.1/examples/example.grch37.vcf.gz - --gvanno_dir ~/gvanno-1.5.1 - --output_dir ~/gvanno-1.5.1 - --sample_id example - --genome_assembly grch37 - --container docker - --force_overwrite - -This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder: - - 1. __example_gvanno_pass_grch37.vcf.gz (.tbi)__ - Bgzipped VCF file with rich set of functional/clinical annotations - 2. __example_gvanno_pass_grch37.tsv.gz__ - Compressed TSV file with rich set of functional/clinical annotations +``` +usage: +gvanno.py -h [options] +--query_vcf +--gvanno_dir +--output_dir +--genome_assembly +--sample_id +--container + +gvanno - workflow for functional and clinical annotation of germline nucleotide variants + +Required arguments: +--query_vcf QUERY_VCF + VCF input file with germline query variants (SNVs/InDels). +--gvanno_dir GVANNO_DIR + Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0 +--output_dir OUTPUT_DIR + Output directory +--genome_assembly {grch37,grch38} + Genome assembly build: grch37 or grch38 +--container {docker,singularity} + Run gvanno with docker or singularity +--sample_id SAMPLE_ID + Sample identifier - prefix for output files + +VEP optional arguments: +--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP). +--vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno). +--vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False +--vep_n_forks VEP_N_FORKS + Number of forks for Variant Effect Predictor (VEP) processing, default: 4 +--vep_buffer_size VEP_BUFFER_SIZE + Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing + - set lower to reduce memory usage, higher to increase speed, default: 500 +--vep_pick_order VEP_PICK_ORDER + Comma-separated string of ordered transcript properties for primary variant pick in + Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane +--vep_skip_intergenic + Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False +--vep_coding_only + Only report variants falling into coding regions of transcripts (VEP), default: False + +Other optional arguments: +--force_overwrite By default, the script will fail with an error if any output file already exists. + You can force the overwrite of existing result files by using this flag, default: False +--version show program's version number and exit +--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False +--docker_uid DOCKER_USER_ID + Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`) +--vcfanno_n_processes VCFANNO_N_PROCESSES + Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4 +--oncogenicity_annotation + Classify variants according to oncogenicity (Horak et al., Genet Med, 2022) +--debug Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc. +``` + +The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command: + +``` +python ~/gvanno-1.6.0/gvanno.py +--query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz +--gvanno_dir ~/gvanno-1.6.0 +--output_dir ~/gvanno-1.6.0 +--sample_id example +--genome_assembly grch37 +--container docker +--force_overwrite +``` + +This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder: + +1. **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations +2. **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column. @@ -171,6 +186,6 @@ Similar files are produced for all variants, not only variants with a *PASS* des Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. The column names of the tab-separated values (TSV) file will be identical to the INFO tags that are documented in the VCF file. -### Contact +### Contact {#contact} sigven AT ifi.uio.no diff --git a/download_gvanno_refdata.py b/download_gvanno_refdata.py new file mode 100755 index 0000000..f0b94bb --- /dev/null +++ b/download_gvanno_refdata.py @@ -0,0 +1,319 @@ +#!/usr/bin/env python + +import argparse +import re +import os +import subprocess +import logging +import sys +import locale +import errno +#import wget +import urllib.request as urllib2 +from argparse import RawTextHelpFormatter + +GVANNO_VERSION = '1.6.0' +REFDATA_VERSION = '20230425' +ENSEMBL_VERSION = '109' +GENCODE_VERSION = 'v43' +VEP_ASSEMBLY = "GRCh38" +HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/" + +def __main__(): + + program_description = "download_gvanno_refdata - download reference datasets for the gvanno variant annotation workflow" + program_options = " --download_dir \n --genome_assembly " + \ + "\n" + + parser = argparse.ArgumentParser(description = program_description, + formatter_class=RawTextHelpFormatter, usage="\n %(prog)s -h [options]\n" + str(program_options)) + parser._action_groups.pop() + required = parser.add_argument_group('Required arguments') + optional = parser.add_argument_group('Other optional arguments') + + optional.add_argument('--force_overwrite', action = "store_true", help='By default, the script will fail with an error if any ' + \ + 'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s') + optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION)) + optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully") + optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.") + required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True) + required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True) + + + args = parser.parse_args() + arg_dict = vars(args) + + logger = getlogger('check-download-dir') + + download_dir_full = os.path.abspath(arg_dict['download_dir']) + if not os.path.isdir(download_dir_full): + err_msg = "Download destination directory (" + str(download_dir_full) + ") does not exist" + gvanno_error_message(err_msg,logger) + + ## check that 'data' does not exist in download_dir_full + db_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data') + arg_dict['db_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly']) + arg_dict['vep_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep') + + if os.path.isdir(db_dir) and arg_dict['force_overwrite'] is False: + err_msg = "Data directory (" + str(db_dir) + ") exists (force_overwrite = False)" + gvanno_error_message(err_msg,logger) + + if not os.path.isdir(db_dir): + os.mkdir(db_dir) + if not os.path.isdir(arg_dict['db_assembly_dir']): + os.mkdir(arg_dict['db_assembly_dir']) + os.mkdir(arg_dict['vep_assembly_dir']) + + + download_gvanno_ref_data(arg_dict = arg_dict) + + +def rm_file(filename): + try: + os.remove(filename) + except OSError as e: + if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory + raise # re-raise exception if a different error occurred + +def gvanno_error_message(message, logger): + logger.error('') + logger.error(message) + logger.error('') + exit(0) + +def check_subprocess(command, logger, debug = False): + if debug: + logger.info(command) + try: + output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True) + if len(output) > 0: + print(str(output.decode()).rstrip()) + except subprocess.CalledProcessError as e: + print(e.output.decode()) + exit(0) + +def getlogger(logger_name): + logger = logging.getLogger(logger_name) + logger.setLevel(logging.DEBUG) + + # create console handler and set level to debug + ch = logging.StreamHandler(sys.stdout) + ch.setLevel(logging.DEBUG) + + # add ch to logger + logger.addHandler(ch) + + # create formatter + formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S") + + #add formatter to ch + ch.setFormatter(formatter) + + return logger + +def get_url_num_bytes(url, logger, verbose=True): + """ + Get number of bytes of file at a URL. None if not reported. + """ + # urllib2 could be a bit more intelligent in guessing what I mean: + if not re.match('^[a-zA-Z]*:', url): + if os.path.exists(url): + url = 'file:' + url + else: + url = 'http://' + url + try: + class HeadRequest(urllib2.Request): + def get_method(self): + return "HEAD" + response = urllib2.urlopen(HeadRequest( + url, None, {'User-Agent': 'Mozilla/5.0'})) + + num_bytes = response.info().get('content-length') + if verbose: + # urllib does redirections for me. Report if it's been clever: + if response.geturl() != url: + logger.info('Reporting size of file at: ' + response.geturl()) + except: + logger.info('Failed to connect to URL ' + str(url)) + num_bytes = None + if num_bytes is not None: + num_bytes = int(num_bytes) + return num_bytes + +def pretty_print(num_bytes, logger): + """ + Output number of bytes according to locale and with IEC binary prefixes + """ + if num_bytes is None: + logger.info('File size unavailable.') + return + KiB = 1024 + MiB = KiB * KiB + GiB = KiB * MiB + TiB = KiB * GiB + PiB = KiB * TiB + EiB = KiB * PiB + ZiB = KiB * EiB + YiB = KiB * ZiB + locale.setlocale(locale.LC_ALL, '') + output = locale.format_string("%d", num_bytes, grouping=True) + ' bytes' + if num_bytes > YiB: + output += ' (%.3g YiB)' % (num_bytes / YiB) + elif num_bytes > ZiB: + output += ' (%.3g ZiB)' % (num_bytes / ZiB) + elif num_bytes > EiB: + output += ' (%.3g EiB)' % (num_bytes / EiB) + elif num_bytes > PiB: + output += ' (%.3g PiB)' % (num_bytes / PiB) + elif num_bytes > TiB: + output += ' (%.3g TiB)' % (num_bytes / TiB) + elif num_bytes > GiB: + output += ' (%.3g GiB)' % (num_bytes / GiB) + elif num_bytes > MiB: + output += ' (%.3g MiB)' % (num_bytes / MiB) + elif num_bytes > KiB: + output += ' (%.3g KiB)' % (num_bytes / KiB) + + return(output) + #logger.info(output) + +def download_gvanno_ref_data(arg_dict): + """ + Main function to run the gvanno workflow using Docker + """ + global GENCODE_VERSION, VEP_ASSEMBLY + if arg_dict['genome_assembly'] == 'grch37': + GENCODE_VERSION = 'v19' + VEP_ASSEMBLY = 'GRCh37' + + vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep') + + datasets = {} + for db in ['vep_cache','vep_fasta','gvanno_custom']: + datasets[db] = {} + datasets[db]['remote_url'] = 'NA' + datasets[db]['local_path'] = 'NA' + + ## Remote URL - pre-indexed VEP cache + datasets['vep_cache']['remote_url'] = ( + f"ftp://ftp.ensembl.org/pub/current_variation/indexed_vep_cache/" + f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz" + ) + + ## Remote URL - Human genome FASTA + datasets['vep_fasta']['remote_url'] = ( + f"http://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens." + f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz" + ) + + logger = getlogger('download-vep-cache') + datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz") + datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz") + datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path']) + + vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger) + + if VEP_ASSEMBLY == 'GRCh37': + datasets['vep_fasta']['remote_url'] = ( + f"http://ftp.ensembl.org/pub/grch37/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens." + f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz" + ) + + + logger.info('VEP cache - remote target file ' + str(datasets['vep_cache']['remote_url'])) + logger.info('VEP cache - size: ' + pretty_print(vep_cache_bytes_remote, logger = logger)) + logger.info('VEP cache - local destination file: ' + str(datasets['vep_cache']['local_path'])) + + if os.path.exists(datasets['vep_cache']['local_path']): + if os.path.getsize(datasets['vep_cache']['local_path']) == vep_cache_bytes_remote: + logger.info('VEP cache already downloaded') + else: + logger.info('VEP cache - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path']) + else: + logger.info('VEP cache - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path']) + + logger.info('VEP cache - unzip and untar') + + wdir = os.getcwd() + + os.chdir(vep_assembly_dir) + + command_unzip_cache = f"gzip -dc {datasets['vep_cache']['local_path']} | tar xf -" + check_subprocess(command = command_unzip_cache, logger = logger) + logger.info('VEP cache - finished unzip + untar ') + + ## change back to working directory + os.chdir(wdir) + + if arg_dict['clean_raw_files']: + logger.info('VEP cache - removing raw compressed tar ball') + rm_file({datasets['vep_cache']['local_path']}) + + logger = getlogger('download-vep-fasta') + fasta_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_fasta']['remote_url'], logger = logger) + logger.info('VEP reference FASTA - remote target file: ' + str(datasets['vep_fasta']['remote_url'])) + logger.info('VEP reference FASTA - size: ' + pretty_print(fasta_cache_bytes_remote, logger = logger)) + logger.info('VEP reference FASTA - local destination file: ' + str(datasets['vep_fasta']['local_path'])) + + if os.path.exists(datasets['vep_fasta']['local_path']): + if os.path.getsize(datasets['vep_fasta']['local_path']) == fasta_cache_bytes_remote: + logger.info('VEP reference FASTA already downloaded') + else: + logger.info('VEP reference FASTA files - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path']) + else: + logger.info('VEP reference FASTA files - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path']) + + logger.info('VEP reference FASTA - unzip + bgzip') + command_unzip_fasta = f"gzip -d {datasets['vep_fasta']['local_path']}" + check_subprocess(command = command_unzip_fasta, logger = logger) + command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}" + check_subprocess(command = command_bgzip_fasta, logger = logger) + + + + datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz' + datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz') + + logger = getlogger('download-gvanno-custom') + custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger) + logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / cancerhotspots ++") + logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url'])) + logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger)) + logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path'])) + + + if os.path.exists(datasets['gvanno_custom']['local_path']): + if os.path.getsize(datasets['gvanno_custom']['local_path']) == custom_cache_bytes_remote: + logger.info('Custom gvanno datasets already downloaded') + else: + logger.info('Custom gvanno datasets - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path']) + else: + logger.info('Custom gvanno datasets - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path']) + + logger.info('Custom gvanno datasets - unzip and untar') + + wdir = os.getcwd() + os.chdir(arg_dict['download_dir']) + + command_unzip_cache = f"gzip -dc {datasets['gvanno_custom']['local_path']} | tar xf -" + check_subprocess(command = command_unzip_cache, logger = logger) + logger.info('Custom gvanno datasets - finished unzip and untar ') + + ## change back to working directory + os.chdir(wdir) + + if arg_dict['clean_raw_files']: + logger.info('Custom gvanno datasets - removing raw compressed tar ball') + rm_file({datasets['gvanno_custom']['local_path']}) + + logger.info('Finished') + + +if __name__=="__main__": __main__() diff --git a/examples/example.grch38.vcf.gz b/examples/example.grch38.vcf.gz index 8329f55..bf3cbd8 100755 Binary files a/examples/example.grch38.vcf.gz and b/examples/example.grch38.vcf.gz differ diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example.grch38.vcf.gz.tbi index 7f8e83e..3817608 100755 Binary files a/examples/example.grch38.vcf.gz.tbi and b/examples/example.grch38.vcf.gz.tbi differ diff --git a/gvanno.py b/gvanno.py index db02add..34ee0f4 100755 --- a/gvanno.py +++ b/gvanno.py @@ -11,17 +11,17 @@ import platform from argparse import RawTextHelpFormatter -GVANNO_VERSION = '1.5.1' -DB_VERSION = 'GVANNO_DB_VERSION = 20220921' -VEP_VERSION = '107' -GENCODE_VERSION = 'v41' +GVANNO_VERSION = '1.6.0' +DB_VERSION = 'GVANNO_DB_VERSION = 20230425' +VEP_VERSION = '109' +GENCODE_VERSION = 'v43' VEP_ASSEMBLY = "GRCh38" DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION) def __main__(): - program_description = "gvanno - workflow for functional and clinical annotation of germline nucleotide variants" + program_description = "gvanno - workflow for functional and clinical annotation of human DNA variants" program_options = " --query_vcf \n --gvanno_dir \n --output_dir \n --genome_assembly " + \ "\n --sample_id \n --container " @@ -44,13 +44,16 @@ def __main__(): optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s") optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \ "for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s") - optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \ + optional_vep.add_argument('--vep_pick_order', default = "mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", help="Comma-separated string " + \ "of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s") optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s") optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s") optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \ "processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s") - required.add_argument('--query_vcf', help='VCF input file with germline query variants (SNVs/InDels).', required = True) + optional.add_argument('--oncogenicity_annotation', action ='store_true', help = 'Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)') + optional.add_argument("--debug", action="store_true", help="Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.") + + required.add_argument('--query_vcf', help='VCF input file with query variants (SNVs/InDels).', required = True) required.add_argument('--gvanno_dir',help='Directory that contains the gvanno data bundle, e.g. ~/gvanno-' + str(GVANNO_VERSION), required = True) required.add_argument('--output_dir',help='Output directory', required = True) required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Genome assembly build: grch37 or grch38', required = True) @@ -65,21 +68,25 @@ def __main__(): ## Check that VEP pick criteria is formatted correctly if not arg_dict['vep_pick_order'] is None: values = str(arg_dict['vep_pick_order']).split(',') - permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane'] + permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane_plus_clinical','mane_select'] num_permitted_sources = 0 for v in values: if v in permitted_sources: num_permitted_sources += 1 - if num_permitted_sources != 8: + if num_permitted_sources != 9: err_msg = "Option 'vep_pick_order' = " + str(arg_dict['vep_pick_order']) + " is formatted incorrectly, should be " + \ - "a comma-separated string of the following values: canonical,appris,tsl,biotype,ccds,rank,length,mane" + "a comma-separated string of the following values: mane_select,mane_plus_clinical,canonical,appris,tsl,biotype,ccds,rank,length" gvanno_error_message(err_msg, logger) if arg_dict['container'] is None: err_msg = 'Please specify whether the gvanno workflow is running through Docker or Singularity (--container )' gvanno_error_message(err_msg, logger) + if arg_dict['oncogenicity_annotation'] is True and arg_dict['vep_lof_prediction'] is False: + err_msg = "Option --oncogenicity_annotation requires --vep_lof_prediction turned on" + gvanno_error_message(err_msg, logger) + logger = getlogger('gvanno-check-files') # check that script and Docker image version correspond @@ -295,12 +302,13 @@ def run_gvanno(arg_dict, host_directories): container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \"" docker_command_run_end = '\"' - #logger.info(container_command_run1) - #logger.info(container_command_run2) + if arg_dict['debug']: + logger.info(container_command_run1) + logger.info(container_command_run2) ## GVANNO|start - Log key information about sample, options and assembly logger = getlogger("gvanno-start") - logger.info("--- Germline variant annotation (gvanno) workflow ----") + logger.info("--- Generic variant annotation (gvanno) workflow ----") logger.info("Sample name: " + str(arg_dict['sample_id'])) logger.info("Genome assembly: " + str(arg_dict['genome_assembly'])) print() @@ -310,7 +318,8 @@ def run_gvanno(arg_dict, host_directories): logger.info("STEP 0: Validate input data") vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \ str(vcf_validation) + " " + str(arg_dict['genome_assembly']) + docker_command_run_end - #logger.info(vcf_validate_command) + if arg_dict['debug']: + logger.info(vcf_validate_command) check_subprocess(vcf_validate_command) logger.info('Finished') @@ -335,10 +344,10 @@ def run_gvanno(arg_dict, host_directories): ## List all VEP flags used when calling VEP loftee_dir = '/opt/vep/src/ensembl-vep/modules' plugins_in_use = "NearestExonJB" - vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \ + vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomade --af_gnomadg --variant_class --domains --symbol --protein --ccds " + \ "--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \ "--xref_refseq --plugin NearestExonJB,max_range=50000" - vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(arg_dict['vep_pick_order']) + \ + vep_options = "--vcf --quiet --check_ref --flag_pick_allele_gene --pick_order " + str(arg_dict['vep_pick_order']) + \ " --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \ str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data" @@ -378,6 +387,8 @@ def run_gvanno(arg_dict, host_directories): logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction'])) logger.info("VEP configuration - plugins in use: " + str(plugins_in_use)) + if arg_dict['debug']: + logger.info(vep_main_command) check_subprocess(vep_main_command) check_subprocess(vep_bgzip_command) check_subprocess(vep_tabix_command) @@ -386,11 +397,14 @@ def run_gvanno(arg_dict, host_directories): ## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources print() logger = getlogger('gvanno-vcfanno') - logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, cancerhotspots.org)") + logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (Clinvar, ncER, dbNSFP, GWAS catalog)") logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes'])) gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes " + str(arg_dict['vcfanno_n_processes']) + \ - " --dbnsfp --clinvar --ncer --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ + " --dbnsfp --clinvar --ncer --gvanno_xref --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ " " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end + + if arg_dict['debug']: + logger.info(gvanno_vcfanno_command) check_subprocess(gvanno_vcfanno_command) logger.info("Finished") @@ -398,9 +412,14 @@ def run_gvanno(arg_dict, host_directories): print() logger = getlogger("gvanno-summarise") logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise") + logger.info("Configuration - oncogenicity classification: " + str(int(arg_dict['oncogenicity_annotation']))) gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \ os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \ - " " + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end + " " + str(int(arg_dict['oncogenicity_annotation'])) + " " + str(int(arg_dict['vep_regulatory'])) + " " + \ + str(int(arg_dict['debug'])) + docker_command_run_end + + if arg_dict['debug']: + logger.info(gvanno_summarise_command) check_subprocess(gvanno_summarise_command) logger.info("Finished") @@ -415,14 +434,15 @@ def run_gvanno(arg_dict, host_directories): check_subprocess(create_output_vcf_command2) check_subprocess(create_output_vcf_command3) check_subprocess(create_output_vcf_command4) - check_subprocess(clean_command) + if not arg_dict['debug']: + check_subprocess(clean_command) print() ## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv logger = getlogger("gvanno-vcf2tsv") - logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsvpy") - gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end - gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end + logger.info("STEP 4: Converting genomic VCF to TSV with https://github.com/sigven/vcf2tsvpy") + gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_pass_tsv) + docker_command_run_end + gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_tsv) + docker_command_run_end logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only") check_subprocess(gvanno_vcf2tsv_command_pass) logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants") diff --git a/src/Dockerfile b/src/Dockerfile index b228d05..1a0a2e2 100644 --- a/src/Dockerfile +++ b/src/Dockerfile @@ -1,3 +1,4 @@ + ################################################### # Stage 1 - docker container to build ensembl-vep # ################################################### @@ -22,13 +23,13 @@ RUN apt-get update && apt-get -y install \ ENV OPT /opt/vep ENV OPT_SRC $OPT/src ENV HTSLIB_DIR $OPT_SRC/htslib -ENV BRANCH release/107 +ENV BRANCH=release/109 # Working directory WORKDIR $OPT_SRC # Clone/download repositories/libraries -RUN if [ "$BRANCH" = "master" ]; \ +RUN if [ "$BRANCH" = "main" ]; \ then export BRANCH_OPT=""; \ else export BRANCH_OPT="-b $BRANCH"; \ fi && \ @@ -166,7 +167,8 @@ ENV PERL5LIB $PERL5LIB_TMP WORKDIR / ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules -RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/107/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm +ADD NearestExonJB.pm $OPT/src/ensembl-vep/modules +#RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm # Final steps @@ -178,6 +180,24 @@ RUN echo >> $OPT/.profile && \ # Run INSTALL.pl and remove the ensemb-vep tests and travis ./INSTALL.pl -a a -l -n && rm -rf t travisci .travis.yml + +# Install dependencies for VEP plugins: +#USER root +#ENV PLUGIN_DEPS "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/config" +## - Ubuntu packages +#RUN curl -O "$PLUGIN_DEPS/ubuntu-packages.txt" && \ +# apt-get update && apt-get install -y --no-install-recommends \ +# $(sed s/\#.*//g ubuntu-packages.txt) && \ +# rm -rf /var/lib/apt/lists/* ubuntu-packages.txt +## - Perl modules +#RUN curl -O "$PLUGIN_DEPS/cpanfile" && \ +# cpanm --installdeps --with-recommends . && \ +# rm -rf /root/.cpanm cpanfile +## - Python packages +#RUN curl -O "$PLUGIN_DEPS/requirements.txt" && \ +# python -m pip install --no-cache-dir -r requirements.txt && \ +# rm requirements.txt + ###################################################### # Stage 3 - adding dependencies for gvanno analysis # ###################################################### @@ -200,7 +220,6 @@ WORKDIR / ENV PACKAGE_BIO="libhts2 bedtools" ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev" -#ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9 toml==0.10.1" ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9" RUN apt-get update \ && apt-get install -y --no-install-recommends \ @@ -224,9 +243,9 @@ chmod 755 /usr/local/bin/vcf_validator USER root WORKDIR / -RUN wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2 -RUN bunzip2 -dc samtools-1.10.tar.bz2 | tar xvf - -RUN cd samtools-1.10 && ./configure --prefix=/usr/local/bin && make -j && make install +RUN wget https://github.com/samtools/samtools/releases/download/1.15.1/samtools-1.15.1.tar.bz2 +RUN bunzip2 -dc samtools-1.15.1.tar.bz2 | tar xvf - +RUN cd samtools-1.15.1 && ./configure --prefix=/usr/local && make -j && make install WORKDIR / @@ -241,18 +260,19 @@ RUN apt-get update \ USER root WORKDIR / -## vt - variant tool set - use conda version -## primary use in GVANNO/PCGR/CPSR: decomposition of multiallelic variants in a VCF file +## vt - variant tool set + vcf2tsvpy - install using Conda +## primary use of vt in GVANNO: decomposition of multiallelic variants in a VCF file + +# install Conda RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \ && chmod 0755 miniconda.sh RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"] RUN rm miniconda.sh -# update conda & install vt +# update conda & install vt + vcf2tsvpy RUN /conda/bin/conda update conda -#RUN /conda/bin/conda update python -RUN /conda/bin/conda install -c bioconda vt -#RUN /conda/bin/conda install -c bioconda vcf2tsvpy +RUN /conda/bin/conda update python +RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy python=3.6 ## Clean Up RUN apt-get clean autoclean @@ -268,13 +288,11 @@ VOLUME /data USER root WORKDIR / -#RUN rm -f nlopt-2.4.2.tar.gz RUN rm -rf $HOME/src/ensembl-vep/t/ RUN rm -f $HOME/src/v335_base.tar.gz RUN rm -f $HOME/src/release-1-6-924.zip -RUN rm -rf /samtools-1.10.tar.bz2 -#RUN rm -f /conda/bin/python +RUN rm -rf /samtools-1.15.1* ADD gvanno.tgz / ENV PATH=$PATH:/conda/bin:/gvanno -ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH} \ No newline at end of file +ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH} diff --git a/src/NearestExonJB.pm b/src/NearestExonJB.pm new file mode 100644 index 0000000..30ffcec --- /dev/null +++ b/src/NearestExonJB.pm @@ -0,0 +1,143 @@ +=head1 LICENSE + +Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute +Copyright [2016-2023] EMBL-European Bioinformatics Institute + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. + +=head1 CONTACT + + Ensembl + +=cut + +=head1 NAME + + NearestExonJB + +=head1 SYNOPSIS + + mv NearestExonJB.pm ~/.vep/Plugins + ./vep -i variations.vcf --cache --plugin NearestExonJB + +=head1 DESCRIPTION + + This is a plugin for the Ensembl Variant Effect Predictor (VEP) that + finds the nearest exon junction boundary to a coding sequence variant. More than + one boundary may be reported if the boundaries are equidistant. + + The plugin will report the Ensembl identifier of the exon, the distance to the + exon boundary, the boundary type (start or end of exon) and the total + length in nucleotides of the exon. + + Various parameters can be altered by passing them to the plugin command: + + - max_range : maximum search range in bp (default: 10000) + + Parameters are passed e.g.: + + --plugin NearestExonJB,max_range=50000 + +=cut + +package NearestExonJB; + +use strict; +use warnings; + +use Bio::EnsEMBL::Variation::Utils::BaseVepPlugin; + +use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin); + +my $char_sep = "|"; + +my %CONFIG = ( + max_range => 10000, +); + +sub new { + my $class = shift; + + my $self = $class->SUPER::new(@_); + + my $params = $self->params; + + # get output format + $char_sep = "+" if ($self->{config}->{output_format} eq 'vcf'); + + foreach my $param(@$params) { + my ($key, $val) = split('=', $param); + die("ERROR: Failed to parse parameter $param\n") unless defined($key) && defined($val); + $CONFIG{$key} = $val; + } + + return $self; +} + +sub feature_types { + return ['Transcript']; +} + +sub get_header_info { + my $header = 'Nearest Exon Junction Boundary (coding sequence variants only). Format:'; + $header .= join($char_sep, qw(ExonID distance start/end length) ); + + return { + NearestExonJB => $header, + } +} + +sub run { + my ($self, $tva) = @_; + + my $vf = $tva->base_variation_feature; + my $trv = $tva->base_transcript_variation; + + my $loc_string = sprintf("%s:%s-%i-%i", $trv->transcript_stable_id, $vf->{chr} || $vf->seq_region_name, $vf->{start}, $vf->{end}); + + if(!exists($self->{_cache}) || !exists($self->{_cache}->{$loc_string})) { + my $exons = $trv->_overlapped_exons; + my %dists; + my $min = $CONFIG{max_range}; + foreach my $exon (@$exons) { + my $startD = abs ($vf->start - $exon->seq_region_start); + my $endD = abs ($vf->end - $exon->seq_region_end); + if ($startD < $endD){ + $dists{$exon->stable_id}{$startD} = 'start'; + $dists{$exon->stable_id}{len} = $exon->length; + $min = $startD if $min > $startD; + } elsif ($startD > $endD){ + $dists{$exon->stable_id}{$endD} = 'end'; + $dists{$exon->stable_id}{len} = $exon->length; + $min = $endD if $min > $endD; + } else { + $dists{$exon->stable_id}{$startD} = "start_end"; + $dists{$exon->stable_id}{len} = $exon->length; + $min = $startD if $min > $startD; + } + } + + my @finalRes; + foreach my $exon (keys %dists){ + if (exists $dists{$exon}{$min}) { + push(@finalRes, $exon.$char_sep.$min.$char_sep.$dists{$exon}{$min}.$char_sep.$dists{$exon}{len}) + } + } + + $self->{_cache}->{$loc_string} = scalar @finalRes ? join(",", @finalRes) : undef; + } + return $self->{_cache}->{$loc_string} ? { NearestExonJB => $self->{_cache}->{$loc_string} } : {}; +} + +1; + diff --git a/src/buildDocker.sh b/src/buildDocker.sh index 9a84197..8962fd6 100755 --- a/src/buildDocker.sh +++ b/src/buildDocker.sh @@ -1,6 +1,5 @@ -cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz . -cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/ -#cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/ +#cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz . +#cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/ tar czvfh gvanno.tgz gvanno/ echo "Build the Docker Image" TAG=`date "+%Y%m%d"` diff --git a/src/gvanno.tgz b/src/gvanno.tgz index a484a28..0c6ec29 100644 Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py index af87e66..bd57c2e 100755 --- a/src/gvanno/gvanno_summarise.py +++ b/src/gvanno/gvanno_summarise.py @@ -7,6 +7,7 @@ import gzip import os import annoutils +import oncogenicity logger = annoutils.getlogger('gvanno-gene-annotate') csv.field_size_limit(500 * 1024 * 1024) @@ -18,17 +19,19 @@ def __main__(): parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)') parser.add_argument('gvanno_db_dir',help='gvanno data directory') parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)') + parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)') parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)') + parser.add_argument('debug',default=0,type=int,help='Output extensive debugging information') args = parser.parse_args() - extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation) + extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.oncogenicity_annotation, args.regulatory_annotation, args.debug) -def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0): +def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, oncogenicity_annotation = 0, regulatory_annotation = 0, debug = 0): """ Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from 1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc. - 2. Gene annotations, e.g. known oncogenes/tumor suppressors, curated disease associations (DisGenet), MIM phenotype associations etc - 3. Protein-relevant annotations, e.g. c functional protein features etc. + 2. Gene annotations, e.g. known oncogenes/tumor suppressors + 3. Cancer hotspot mutation sites 4. Variant effect predictions """ @@ -44,21 +47,27 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r for tag in vcf_infotags_meta: if lof_prediction == 0 and regulatory_annotation == 0: if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ + 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) elif lof_prediction == 1 and regulatory_annotation == 0: if not tag.startswith('REGULATORY_'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ + 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) elif lof_prediction == 0 and regulatory_annotation == 1: if not tag.startswith('LoF'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ + 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) else: - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ + 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) w = Writer(out_vcf, vcf) current_chrom = None num_chromosome_records_processed = 0 num_records_filtered = 0 + + cancer_hotspots = annoutils.read_cancer_hotspots(logger, os.path.join(gvanno_db_directory,'cancer_hotspots', 'cancer_hotspots.tsv')) vcf_info_element_types = {} for e in vcf.header_iter(): @@ -79,9 +88,6 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r num_chromosome_records_processed = 0 if rec.INFO.get('CSQ') is None: num_records_filtered = num_records_filtered + 1 - #alt_allele = ','.join(rec.ALT) - #pos = rec.start + 1 - #variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele #logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?) - variant will be skipped') continue num_chromosome_records_processed += 1 @@ -90,30 +96,44 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r if regulatory_annotation == 1: csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ') - if 'vep_block' in csq_record_results_all: - vep_csq_records_all = csq_record_results_all['vep_block'] + if 'picked_gene_csq' in csq_record_results_all: + vep_csq_records_all = csq_record_results_all['picked_gene_csq'] rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all) - csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ') - - if 'vep_all_csq' in csq_record_results_pick: - rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq']) - if 'vep_block' in csq_record_results_pick: - vep_csq_records = csq_record_results_pick['vep_block'] + vep_csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = debug) - block_idx = 0 - record = vep_csq_records[block_idx] - for k in record: + + principal_hgvsp = '.' + principal_hgvsc = '.' + if 'picked_csq' in vep_csq_record_results: + csq_record = vep_csq_record_results['picked_csq'] + for k in csq_record: if k in vcf_info_element_types: - if vcf_info_element_types[k] == "Flag" and record[k] == "1": - rec.INFO[k] = True + if vcf_info_element_types[k] == "Flag": + #rec.INFO[k] = False + if csq_record[k] == "1": + rec.INFO[k] = True else: - if not record[k] is None: - rec.INFO[k] = record[k] + if not csq_record[k] is None: + rec.INFO[k] = csq_record[k] + + if k == 'HGVSp_short': + principal_hgvsp = csq_record[k] + + if k == 'HGVSc': + principal_hgvsc = csq_record[k].split(':')[1] + #else: + # print("missing\t" + str(k)) + + if 'all_csq' in vep_csq_record_results: + rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq']) + annoutils.map_cancer_hotspots(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_hgvsp, principal_hgvsc) if not rec.INFO.get('DBNSFP') is None: annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms) + if oncogenicity_annotation == 1: + oncogenicity.assign_oncogenicity_evidence(rec, tumortype = "Any") w.write_record(rec) w.close() logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom)) diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py index 7840a21..d972a5d 100755 --- a/src/gvanno/gvanno_vcfanno.py +++ b/src/gvanno/gvanno_vcfanno.py @@ -20,10 +20,8 @@ def __main__(): parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)") parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar") parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions") - #parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase") parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)") parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)") - parser.add_argument("--cancer_hotspots",action = "store_true", help="Annotate VCF with mutation hotspots from cancerhotspots.org") args = parser.parse_args() @@ -31,7 +29,7 @@ def __main__(): vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt' vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml' print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False) - run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas, args.cancer_hotspots) + run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas) def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource): @@ -41,7 +39,7 @@ def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vc append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname) append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file) -def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref,gwas, cancer_hotspots): +def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref, gwas): """ Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files """ @@ -52,10 +50,7 @@ def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, v gvanno_xref_info_tags = ["GVANNO_XREF"] gwas_info_tags = ["GWAS_HIT"] ncer_info_tags = ["NCER_PERCENTILE"] - cancer_hotspots_info_tags = ["MUTATION_HOTSPOT","MUTATION_HOTSPOT_TRANSCRIPT","MUTATION_HOTSPOT_CANCERTYPE"] - - if cancer_hotspots is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, cancer_hotspots_info_tags, query_info_tags, "cancer_hotspots") + if clinvar is True: prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar") if dbnsfp is True: diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py index 2b8362d..5678c9a 100755 --- a/src/gvanno/lib/annoutils.py +++ b/src/gvanno/lib/annoutils.py @@ -9,7 +9,9 @@ csv.field_size_limit(500 * 1024 * 1024) -threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'X'} +threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H', + 'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W', + 'Tyr':'Y','Val':'V','Ter':'X'} @@ -23,6 +25,7 @@ def read_genexref_namemap(logger, gene_xref_namemap_tsv): reader = csv.DictReader(tsvfile, delimiter='\t') for row in reader: namemap_xref[row['name']] = int(row['index']) + tsvfile.close() return namemap_xref @@ -45,12 +48,13 @@ def read_infotag_file(logger, vcf_info_tags_tsv): for row in reader: if not row['tag'] in info_tag_xref: info_tag_xref[row['tag']] = row + tsvfile.close() return info_tag_xref def check_subprocess(command): - #if debug: - #logger.info(command) + # if debug: + # logger.info(command) try: output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True) if len(output) > 0: @@ -79,7 +83,6 @@ def write_pass_vcf(annotated_vcf, logger): Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename. """ - #out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf) out_vcf = re.sub(r'\.vcf\.gz$','.pass.vcf',annotated_vcf) vcf = VCF(annotated_vcf) w = Writer(out_vcf, vcf) @@ -268,7 +271,7 @@ def get_correct_cpg_transcript(vep_csq_records): return csq_idx - ## some variants iare assigned multiple transcript consequences + ## some variants are assigned multiple transcript consequences ## if cancer predisposition genes are in the vicinity of other genes, choose the cancer predisposition gene ## if there are neighbouring cancer-predispositon genes, choose custom gene, preferring coding change (see below, KLLN/PTEN, XPC/TMEM43, NTHL1/TSC2) csq_idx_dict = {} @@ -576,15 +579,24 @@ def make_transcript_xref_map(rec, fieldmap, xref_tag = 'PCGR_ONCO_XREF'): return(transcript_xref_map) def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted): - vep_to_pcgr_af = {'gnomAD_AMR_AF':'AMR_AF_GNOMAD', - 'gnomAD_AFR_AF':'AFR_AF_GNOMAD', - 'gnomAD_EAS_AF':'EAS_AF_GNOMAD', - 'gnomAD_NFE_AF':'NFE_AF_GNOMAD', - 'gnomAD_AF':'GLOBAL_AF_GNOMAD', - 'gnomAD_SAS_AF':'SAS_AF_GNOMAD', - 'gnomAD_OTH_AF':'OTH_AF_GNOMAD', - 'gnomAD_ASJ_AF':'ASJ_AF_GNOMAD', - 'gnomAD_FIN_AF':'FIN_AF_GNOMAD', + vep_to_pcgr_af = {'gnomADe_AMR_AF':'AMR_AF_GNOMAD', + 'gnomADe_AFR_AF':'AFR_AF_GNOMAD', + 'gnomADe_EAS_AF':'EAS_AF_GNOMAD', + 'gnomADe_NFE_AF':'NFE_AF_GNOMAD', + 'gnomADe_AF':'GLOBAL_AF_GNOMAD', + 'gnomADe_SAS_AF':'SAS_AF_GNOMAD', + 'gnomADe_OTH_AF':'OTH_AF_GNOMAD', + 'gnomADe_ASJ_AF':'ASJ_AF_GNOMAD', + 'gnomADe_FIN_AF':'FIN_AF_GNOMAD', + 'gnomADG_AMR_AF':'AMR_AF_GNOMADg', + 'gnomADg_AFR_AF':'AFR_AF_GNOMADg', + 'gnomADg_EAS_AF':'EAS_AF_GNOMADg', + 'gnomADg_NFE_AF':'NFE_AF_GNOMADg', + 'gnomADg_AF':'GLOBAL_AF_GNOMADg', + 'gnomADg_SAS_AF':'SAS_AF_GNOMADg', + 'gnomADg_OTH_AF':'OTH_AF_GNOMADg', + 'gnomADg_ASJ_AF':'ASJ_AF_GNOMADg', + 'gnomADg_FIN_AF':'FIN_AF_GNOMADg', 'AFR_AF':'AFR_AF_1KG', 'AMR_AF':'AMR_AF_1KG', 'SAS_AF':'SAS_AF_1KG', @@ -628,14 +640,33 @@ def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted): return vep_dbnsfp_meta_info -def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ'): +def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = 0): all_csq_pick = [] all_transcript_consequences = [] + + varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT)) + for csq in rec.INFO.get(csq_identifier).split(','): csq_fields = csq.split('|') + entrezgene = '.' + + ## Entrez gene identifier is not provided directly by VEP, pull out this from 'transcript_xref_map' for a given transcript-specific CSQ block + ## - used for 'consequence_entry' object that are added to 'vep_all_csq' array + k = 0 + while(k < len(csq_fields)): + if k in vep_csq_fields_map['index2field']: + if vep_csq_fields_map['index2field'][k] == 'Feature': + ensembl_transcript_id = csq_fields[k] + if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'): + if ensembl_transcript_id in transcript_xref_map.keys(): + if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys(): + entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE'] + k = k + 1 + + if pick_only is False: j = 0 csq_record = {} @@ -649,7 +680,7 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl else: ## loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme - if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence when expanding with annotation tags + if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence(s) when expanding with annotation tags j = 0 csq_record = {} @@ -666,7 +697,10 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl ## assign additional gene/transcript annotations from the custom transcript ## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] + #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation])) + else: + #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation])) ## If not VEP provides SYMBOL, append SYMBOL provided by xref map if csq_record['SYMBOL'] is None: csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] @@ -706,27 +740,246 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl assign_cds_exon_intron_annotations(csq_record) ## Append transcript consequence to all_csq_pick all_csq_pick.append(csq_record) - symbol = '.' - hgvsc = ".:." - hgvsp = ".:." + symbol = "." + hgvsc = "." + hgvsp = "." if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "": symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']]) if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "": - hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']]) + hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1]) if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "": - hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']]) - consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ':' + - str(symbol) + ':' + - str(hgvsc) + ':' + - str(hgvsp) + ':' + - str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ':' + - str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ':' + + hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1]) + consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" + + str(symbol) + ":" + + str(entrezgene) + ":" + + str(hgvsc) + ":" + + str(hgvsp) + ":" + + str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + + str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']])) all_transcript_consequences.append(consequence_entry) - + + + vep_chosen_csq_idx = 0 vep_csq_results = {} - vep_csq_results['vep_block'] = all_csq_pick - vep_csq_results['vep_all_csq'] = all_transcript_consequences + vep_csq_results['picked_gene_csq'] = all_csq_pick + vep_csq_results['all_csq'] = all_transcript_consequences + vep_csq_results['picked_csq'] = None + + ## IF multiple transcript-specific variant consequences highlighted by --pick_allele_gene , + ## prioritize/choose block of consequence which has + ## - A gene with BIOTYPE equal to 'protein-coding' (the other picked transcript/gene may potentialy carry another BIOTYPE nature) + ## - A gene consequence classified as 'exonic' (the other picked transcript/gene likely carries a nonexonic consequence) + if len(vep_csq_results['picked_gene_csq']) > 0: + vep_selected_idx = {} + vep_selected_idx['exonic_status'] = {} + vep_selected_idx['consequence'] = {} + + #if debug: + #print("Picked VEP blocks: " + str(vep_csq_results['picked_gene_csq'])) + + i = 0 + picked_blocks = [] + for trans_rec in vep_csq_results['picked_gene_csq']: + + if debug: + biotype = '.' + consequence = '.' + exonic_status = '.' + genesymbol = '.' + distance = '.' + feature = '.' + if not trans_rec['Consequence'] is None: + consequence = trans_rec['Consequence'] + if not trans_rec['SYMBOL'] is None: + genesymbol = trans_rec['SYMBOL'] + if not trans_rec['BIOTYPE'] is None: + biotype = trans_rec['BIOTYPE'] + if not trans_rec['EXONIC_STATUS'] is None: + exonic_status = trans_rec['EXONIC_STATUS'] + if not trans_rec['DISTANCE'] is None: + distance = trans_rec['DISTANCE'] + if not trans_rec['Feature'] is None: + feature = trans_rec['Feature'] + block = str(genesymbol) + ':' + str(feature) + ':' + str(consequence) + ':' + str(distance) + ':' + str(biotype) + ':' + str(exonic_status) + picked_blocks.append(block) + + + if 'BIOTYPE' in trans_rec and 'Consequence' in trans_rec and 'EXONIC_STATUS' in trans_rec: + if not trans_rec['BIOTYPE'] is None and not trans_rec['Consequence'] is None: + if trans_rec['BIOTYPE'] == "protein_coding": + ## for protein-coding genes - record the exonic variant consequence status (exonic/nonexonic) + vep_selected_idx['exonic_status'][i] = trans_rec['EXONIC_STATUS'] + vep_selected_idx['consequence'][i] = trans_rec['Consequence'] + i = i + 1 + + + if debug: + print(str(varkey) + " - picked CSQ blocks: " + ' /// '.join(picked_blocks)) + + + ## when multiple transcript gene blocks are picked by VEP, prioritize the block with 'exonic' consequence + if len(vep_selected_idx['exonic_status'].keys()) > 1: + exonic_cons_found = 0 + for j in vep_selected_idx['exonic_status'].keys(): + if vep_selected_idx['exonic_status'][j] == 'exonic': + exonic_cons_found = 1 + vep_chosen_csq_idx = j + + ## if multiple non-exonic variants are found, prioritize UTR variants over other nonexonic + if exonic_cons_found == 0: + for j in vep_selected_idx['consequence'].keys(): + if vep_selected_idx['consequence'][j] == '5_prime_UTR_variant' or vep_selected_idx['consequence'][j] == '3_prime_UTR_variant': + vep_chosen_csq_idx = j + + else: + if len(vep_selected_idx['exonic_status'].keys()) == 1: + for k in vep_selected_idx['exonic_status'].keys(): + vep_chosen_csq_idx = k + + vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx] + + else: + print('ERROR: No VEP block chosen by --pick_allele_gene') return(vep_csq_results) + + +def read_cancer_hotspots(logger, hotspots_tsv): + + ## load cancer hotspot entries from 'cancer_hotspots.tsv' (provided by github.com/sigven/cancerHotspots) + + hotspots = {} ##dictionary returned + mutation_hotspots = {} + codon_hotspots = {} + splice_hotspots = {} + if not os.path.exists(hotspots_tsv): + logger.info("ERROR: File '" + str(hotspots_tsv) + "' does not exist - exiting") + exit(1) + tsvfile = open(hotspots_tsv, 'r') + reader = csv.DictReader(tsvfile, delimiter='\t') + for row in reader: + mutation_hotspots[str(row['entrezgene']) + '-' + row['hgvsp2']] = row + codon_hotspots[str(row['entrezgene']) + '-' + row['codon']] = row + if row['hgvsc'] != '.': + splice_hotspots[str(row['entrezgene']) + '-' + str(row['hgvsc'])] = row + tsvfile.close() + + hotspots['mutation'] = mutation_hotspots + hotspots['codon'] = codon_hotspots + hotspots['splice'] = splice_hotspots + return hotspots + + +def map_cancer_hotspots(transcript_csq_elements, cancer_hotspots, rec, principal_hgvsp, principal_hgvsc): + + unique_hotspot_mutations = {} + unique_hotspot_codons = {} + + principal_codon = '.' + if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_hgvsp): + codon_match = re.findall(r'[A-Z][0-9]{1,}',principal_hgvsp) + if len(codon_match) == 1: + principal_codon = codon_match[0] + + ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of + ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots) + ## - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal') + ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated) + ## - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal') + + for csq in transcript_csq_elements: + (consequence, symbol, entrezgene, hgvsc, hgvsp, feature_type, feature, biotype) = csq.split(':') + + if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True: + continue + + hgvsp_short = threeToOneAA(hgvsp) + hotspot_key_mutation = "." + codon_match = [] + if entrezgene != "." and hgvsp != ".": + hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short) + codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short) + + if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'): + hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc) + hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key) + + if hotspot_key_mutation == ".": + continue + + if hotspot_key_mutation in cancer_hotspots['mutation']: + hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] + hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] + unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype + + if hotspot_key_mutation in cancer_hotspots['splice']: + hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] + hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] + unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype + + + if len(codon_match) > 0: + hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0]) + + if hotspot_key_codon in cancer_hotspots['codon']: + unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] = \ + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE'] + + if len(unique_hotspot_mutations.keys()) > 0: + if len(unique_hotspot_mutations.keys()) == 1: + for gene_mutation_key in unique_hotspot_mutations.keys(): + rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key] + + if gene_mutation_key.split('|')[0] == 'exonic': + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal' + hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4]) + if hgvsp_candidate == principal_hgvsp: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' + else: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal' + hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4])) + if hgvsc_candidate == principal_hgvsc: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' + else: + ## multiple hotspot matches for alternative hgvsp keys + ## - keep only match against principal HGVSp + for hotspot_info in unique_hotspot_mutations.keys(): + if hotspot_info.split('|')[0] == 'exonic': + hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) + + if hgvsp_candidate == principal_hgvsp: + rec.INFO['MUTATION_HOTSPOT'] = hotspot_info + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' + else: + hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4])) + + if hgvsc_candidate == principal_hgvsc: + rec.INFO['MUTATION_HOTSPOT'] = hotspot_info + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' + + else: + if len(unique_hotspot_codons.keys()) > 0: + if len(unique_hotspot_codons.keys()) == 1: + for gene_codon_key in unique_hotspot_codons.keys(): + + if '|' in gene_codon_key: + + codon = str(gene_codon_key.split('|')[3]) + + if codon == principal_codon: + rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal' + else: + rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal' + + + return + diff --git a/src/gvanno/lib/oncogenicity.py b/src/gvanno/lib/oncogenicity.py new file mode 100644 index 0000000..28aa347 --- /dev/null +++ b/src/gvanno/lib/oncogenicity.py @@ -0,0 +1,422 @@ +#!/usr/bin/env python + +import os,re,sys +from cyvcf2 import VCF, Writer + +def assign_oncogenicity_evidence(rec = None, tumortype = "Any"): + + clingen_vicc_ev_codes = [ + "CLINGEN_VICC_SBVS1", + "CLINGEN_VICC_SBS1", + "CLINGEN_VICC_SBP1", + "CLINGEN_VICC_SBP2", + "CLINGEN_VICC_OS3", + "CLINGEN_VICC_OM2", + "CLINGEN_VICC_OM3", + "CLINGEN_VICC_OM4", + "CLINGEN_VICC_OP1", + "CLINGEN_VICC_OP3", + "CLINGEN_VICC_OP4", + "CLINGEN_VICC_OVS1"] + + ### Benign oncogenic effects of somatic variants + + # 1) "CLINGEN_VICC_SBVS1" + ## Very high MAF: > 0.05 in gnomAD - any 5 general continental populations + ## AFR/AMR/EAS/NFE/SAS + + # 2) "CLINGEN_VICC_SBS1" + ## High MAF: > 0.01 in gnomAD - any 5 general continental populations + ## AFR/AMR/EAS/NFE/SAS + + # 3) "CLINGEN_VICC_SBP1" + ## Multiple lines of computational evidence support a benign + ## effect on the gene or gene product + ## (conservation, evolutionary, splicing impact, etc. - from dbNSFP + + # 4) CLINGEN_VICC_SBP2" + ## Synonymous (silent) variant for which splicing prediction + ## algorithms predict no effect on the splice consensus sequence + ## nor the creation of a new splice site and the nucleotide is + ## not highly conserved + + # 5) "CLINGEN_VICC_SBS2" + ## Well established in invitro/in vivo functional studies show + ## no oncogenic effects + + ## Oncogenic effects of somatic variants + + # 6) "CLINGEN_VICC_OVS1" + ## Null variant - predicted as LoF by LOFTEE + ## - Nonsense, frameshift, canonical splice sites, initiation codon, + ## single-exon/multi-exon deletion + ## - Tumor suppressor gene + + # 7) "CLINGEN_VICC_OS1" + ## Same amino acid change as previously established oncogenic variant + ## NOTE: Not relevant since current implementation does not consider + ## any criteria where nucleotide change is affecting annotation + + # 8) "CLINGEN_VICC_OS2" + ## Well established in invitro/in vivo functional studies show + ## oncogenic effect of the variant + + # 9) "CLINGEN_VICC_OS3" + ## Located in a mutation hotspot + ## - >= 50 samples with a somatic variant at the AA position (cancerhotspots.org) + ## - Same amino acid change in >= 10 samples (cancerhotspots.org) + + # 10) "CLINGEN_VICC_OM1" + ## Located in a critical and well-established part of a functional domain + ## (active site of an enzyme) + ## NOTE: Not used since determination/retrieval of critical protein sites are non-trivial to automate + + # 11) "CLINGEN_VICC_OM2" + ## Protein length changes as a result of in-frame deletions/insertions in a + ## known oncogene/tumor suppressor genes or stop-loss variants in a + ## tumor suppressor gene + + # 12) "CLINGEN_VICC_OM3" + ## Missense variant at an amino acid residue where a different missense + ## variant determined to be oncogenic (using this standard) has been + ## documented. Amino acid difference from reference amino acid should + ## be greater or at least approximately the same as for missense change + ## determined to be oncogenic. + + # 13) "CLINGEN_VICC_OM4" + ## Located in a mutation hotspot + ## - < 50 samples with a somatic variant at the AA position (cancerhotspots.org) + ## - Same amino acid change in >= 10 samples (cancerhotspots.org) + ## - Not applicable if OM1 or OM3 is applicable + + # 14) "CLINGEN_VICC_OP1" + ## All used lines of computational support an oncogenic effect + ## of a variant (conservation, evolutionary, splicing effect) + + # 15) "CLINGEN_VICC_OP2" + ## Somatic variant in a gene in a malignancy with a single genetic + ## etiology. Example:retinoblastoma is caused by bi-allelic + ## RB1 inactivation. + + # 16) "CLINGEN_VICC_OP3" + ## Located in a mutation hotspot + ## - Same amino acid change in < 10 samples (cancerhotspots.org) + ## - Not applicable if OM1 or OM3 is applicable + + # 17) "CLINGEN_VICC_OP4" + ## Absent from controls (gnomAD) + ## - Extremely low MAF + + + required_oncogenicity_vars = [ + "Consequence", + "MUTATION_HOTSPOT", + "MUTATION_HOTSPOT_CANCERTYPE", + "SYMBOL", + "ONCOGENE", + "ONCOGENE_EVIDENCE", + "TUMOR_SUPPRESSOR", + "TUMOR_SUPPRESSOR_EVIDENCE", + "LoF", + "INTRON_POSITION", + "EXON_POSITION", + "EAS_AF_GNOMAD", + "NFE_AF_GNOMAD", + "AFR_AF_GNOMAD", + "AMR_AF_GNOMAD", + "SAS_AF_GNOMAD", + "DBNSFP_SIFT", + "DBNSFP_PROVEAN", + "DBNSFP_META_RNN", + "DBNSFP_FATHMM", + "DBNSFP_MUTATIONTASTER", + "DBNSFP_DEOGEN2", + "DBNSFP_PRIMATEAI", + "DBNSFP_MUTATIONASSESSOR", + "DBNSFP_FATHMM_MKL", + "DBNSFP_M_CAP", + "DBNSFP_LIST_S2", + "DBNSFP_BAYESDEL_ADDAF", + "DBNSFP_SPLICE_SITE_ADA", + "DBNSFP_SPLICE_SITE_RF"] + + variant_data = {} + for col in required_oncogenicity_vars: + if rec.INFO.get(col) is None: + if col == "TUMOR_SUPPRESSOR" or col == "ONCOGENE": + variant_data[col] = False + elif col == "LoF": + variant_data['LOSS_OF_FUNCTION'] = False + else: + variant_data[col] = None + else: + if rec.INFO.get(col) == '': + variant_data[col] = True + else: + if col == "LoF": + if rec.INFO.get(col) == "HC": + variant_data["LOSS_OF_FUNCTION"] = True + else: + variant_data[col] = rec.INFO.get(col) + + for code in clingen_vicc_ev_codes: + variant_data[code] = False + + dbnsfp_minimum_majority = 7 + dbnsfp_maximum_minority = 2 + dbnsfp_minimum_algos_called = dbnsfp_minimum_majority + + variant_data['N_INSILICO_CALLED'] = 0 + variant_data['N_INSILICO_DAMAGING'] = 0 + variant_data['N_INSILICO_TOLERATED'] = 0 + variant_data['N_INSILICO_SPLICING_NEUTRAL'] = 0 + variant_data['N_INSILICO_SPLICING_AFFECTED'] = 0 + + for algo in ['SIFT','PROVEAN','META_RNN', + 'MUTATIONTASTER','DEOGEN2', + 'PRIMATEAI','MUTATIONASSESSOR', + 'FATHMM_MKL','M_CAP', + 'LIST_S2','BAYESDEL_ADDAF', + 'SPLICE_SITE_RF','SPLICE_SITE_ADA']: + col = 'DBNSFP_' + str(algo) + if col in variant_data.keys(): + if variant_data[col] != '.': + variant_data['N_INSILICO_CALLED'] += 1 + if variant_data[col] == 'D': + variant_data['N_INSILICO_DAMAGING'] += 1 + if variant_data[col] == 'T': + variant_data['N_INSILICO_TOLERATED'] += 1 + if (col == 'SPLICE_SITE_RF' or \ + col == 'SPLICE_SITE_ADA') and \ + variant_data[col] == 'SN': + variant_data['N_INSILICO_SPLICING_NEUTRAL'] += 1 + if (col == 'SPLICE_SITE_RF' or \ + col == 'SPLICE_SITE_ADA') and \ + variant_data[col] == 'AS': + variant_data['N_INSILICO_SPLICING_AFFECTED'] += 1 + + if (variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \ + variant_data['N_INSILICO_DAMAGING'] >= dbnsfp_minimum_majority and \ + variant_data['N_INSILICO_TOLERATED'] <= dbnsfp_maximum_minority and \ + variant_data['N_INSILICO_SPLICING_NEUTRAL'] <= 1) or \ + variant_data['N_INSILICO_SPLICING_AFFECTED'] == 2: + variant_data['CLINGEN_VICC_OP1'] = True + + if variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \ + variant_data['N_INSILICO_TOLERATED'] >= dbnsfp_minimum_majority and \ + variant_data['N_INSILICO_DAMAGING'] <= dbnsfp_maximum_minority and \ + variant_data['N_INSILICO_SPLICING_AFFECTED'] == 0: + variant_data['CLINGEN_VICC_SBP1'] = True + + hotspot_mutated_samples = {} + hotspot_mutated_samples['aa_variant'] = {} + hotspot_mutated_samples['aa_site'] = {} + hotspot_mutated_samples['aa_variant']['Any'] = 0 + hotspot_mutated_samples['aa_site']['Any'] = 0 + + if not variant_data['MUTATION_HOTSPOT_CANCERTYPE'] is None: + ttype_samples_in_hotspots = variant_data['MUTATION_HOTSPOT_CANCERTYPE'].split(',') + for ttype in ttype_samples_in_hotspots: + ttype_stats = ttype.split('|') + if len(ttype_stats) == 3: + ttype_hotspot = ttype_stats[0] + ttype_hotspot_samples_site = int(ttype_stats[1]) + ttype_hotspot_samples_aa = ttype_hotspot_samples_site + if ttype_stats[2] != '': + ttype_hotspot_samples_aa = int(ttype_stats[2]) + + if ttype_hotspot != "Unknown": + hotspot_mutated_samples['aa_variant'][ttype_hotspot] = ttype_hotspot_samples_aa + hotspot_mutated_samples['aa_site'][ttype_hotspot] = ttype_hotspot_samples_site + + hotspot_mutated_samples['aa_variant']['Any'] = hotspot_mutated_samples['aa_variant']['Any'] + ttype_hotspot_samples_aa + hotspot_mutated_samples['aa_site']['Any'] = hotspot_mutated_samples['aa_site']['Any'] + ttype_hotspot_samples_site + + + ttype = "Any" + if tumortype != "Any": + ttype = re.sub(r'/', '@', re.sub(r' ','_', tumortype)) + + if ttype in hotspot_mutated_samples['aa_variant'].keys() and \ + ttype in hotspot_mutated_samples['aa_site'].keys(): + if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \ + hotspot_mutated_samples['aa_site'][ttype] >= 50: + variant_data['CLINGEN_VICC_OS3'] = True + if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \ + hotspot_mutated_samples['aa_site'][ttype] < 50: + variant_data['CLINGEN_VICC_OM4'] = True + if hotspot_mutated_samples['aa_variant'][ttype] < 10 and \ + hotspot_mutated_samples['aa_variant'][ttype] > 0: + variant_data['CLINGEN_VICC_OP3'] = True + + + if "EAS_AF_GNOMAD" in variant_data.keys() and \ + "SAS_AF_GNOMAD" in variant_data.keys() and \ + "AMR_AF_GNOMAD" in variant_data.keys() and \ + "AFR_AF_GNOMAD" in variant_data.keys() and \ + "NFE_AF_GNOMAD" in variant_data.keys(): + + ## check if variant has MAF > 0.01 (SBVS1) or > 0.05 in any of five major gnomAD subpopulations (exome set) + for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + if not variant_data[pop] is None: + ## MAF for this population >= 0.05 + if float(variant_data[pop]) >= 0.05: + variant_data["CLINGEN_VICC_SBVS1"] = True + for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + if not variant_data[pop] is None: + ## MAF for this population >= 0.01 (< 0.05) + if float(variant_data[pop]) >= 0.01 and variant_data["CLINGEN_VICC_SBVS1"] is False: + variant_data["CLINGEN_VICC_SBS1"] = True + + missing_pop_freq = 0 + approx_zero_pop_freq = 0 + for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + ## no MAF recorded in gnomAD for this population + if variant_data[pop] is None: + missing_pop_freq = missing_pop_freq + 1 + else: + ## Very low MAF for this population + if float(variant_data[pop]) < 0.0001: + approx_zero_pop_freq = approx_zero_pop_freq + 1 + + ## check if variant is missing or with MAF approximately zero in all five major gnomAD subpopulations (exome set) + if missing_pop_freq == 5 or approx_zero_pop_freq == 5: + variant_data["CLINGEN_VICC_OP4"] = True + + ## check if variant is a loss-of-function variant (LOFTEE) in a tumor suppressor gene (Cancer Gene Census/CancerMine) + if "TUMOR_SUPPRESSOR" in variant_data.keys() and \ + "ONCOGENE" in variant_data.keys() and \ + "LOSS_OF_FUNCTION" in variant_data.keys() and \ + "Consequence" in variant_data.keys(): + + if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TUMOR_SUPPRESSOR'] is True: + variant_data['CLINGEN_VICC_OVS1'] = True + + ## check if variant is creating a stop-lost or protein-length change in oncogene/tumor suppressor genes + if variant_data['CLINGEN_VICC_OVS1'] is False and \ + ((re.match(r'^(inframe_deletion|inframe_insertion)', variant_data['Consequence']) and \ + (variant_data['TUMOR_SUPPRESSOR'] is True or variant_data['ONCOGENE'] is True)) or \ + (re.match(r'^(stop_lost)', variant_data['Consequence']) and \ + variant_data['TUMOR_SUPPRESSOR'] is True)): + variant_data['CLINGEN_VICC_OM2'] = True + + ## check if variant is silent (synonymous|splice) and outside critical splice region + if "INTRON_POSITION" in variant_data.keys() and \ + "EXON_POSITION" in variant_data.keys() and \ + "DBNSFP_SPLICE_SITE_RF" in variant_data.keys() and \ + "Consequence" in variant_data.keys(): + + if (int(variant_data['INTRON_POSITION']) < 0 and int(variant_data['INTRON_POSITION']) < -3 or \ + int(variant_data['INTRON_POSITION']) > 0 and int(variant_data['INTRON_POSITION']) > 6 or \ + int(variant_data['EXON_POSITION']) < 0 and int(variant_data['EXON_POSITION']) < -2 or \ + int(variant_data['EXON_POSITION']) > 0 and int(variant_data['EXON_POSITION']) > 1) and \ + variant_data['DBNSFP_SPLICE_SITE_RF'] != "AS" and \ + re.match(r'^(synonymous_variant|splice_region_variant)', variant_data['Consequence']): + variant_data['CLINGEN_VICC_SBP2'] = True + + + og_score_data = {} + og_score_data['code'] = \ + ['CLINGEN_VICC_SBVS1','CLINGEN_VICC_SBS1', + 'CLINGEN_VICC_SBP1','CLINGEN_VICC_SBP2', + 'CLINGEN_VICC_OVS1','CLINGEN_VICC_OS3', + 'CLINGEN_VICC_OM2','CLINGEN_VICC_OM4', + 'CLINGEN_VICC_OP1','CLINGEN_VICC_OP3', + 'CLINGEN_VICC_OP4'] + og_score_data['category'] = \ + ['clinpop','clinpop', + 'funccomp','funcvar', + 'funcvar','funcvar', + 'funcvar','funcvar', + 'funccomp','funcvar', + 'clinpop'] + og_score_data['pole'] = \ + ['B','B','B','B', + 'P','P','P','P', + 'P','P','P'] + + og_score_data['description'] = \ + ['Very high MAF (> 0.05 in gnomAD - any five major continental pops)', + 'High MAF (> 0.01 in gnomAD - any five major continental pops)', + 'Multiple lines (>=7) of computational evidence support a benign effect on the gene or gene product - from dbNSFP', + 'Silent and intronic changes outside of the consensus splice site', + 'Null variant - predicted as LoF by LOFTEE - in bona fide tumor suppressor gene', + 'Located in a mutation hotspot (cancerhotspots.org). >= 50 samples with a variant at AA position, >= 10 samples with same AA change', + 'Protein length changes from in-frame dels/ins in known oncogene/tumor suppressor genes or stop-loss variants in a tumor suppressor gene', + 'Located in a mutation hotspot (cancerhotspots.org). < 50 samples with a variant at AA position, >= 10 samples with same AA change.', + 'Multiple lines (>=7) of computational evidence support a damaging effect on the gene or gene product - from dbNSFP', + 'Located in a mutation hotspot (cancerhotspots.org). < 10 samples with the same amino acid change.', + 'Absent from controls (gnomAD) / very low MAF ( < 0.0001 in all five major subpopulations)'] + + og_score_data['score'] = \ + [-8, -4, -1, -1, 8, 4, 2, 2, 1, 1, 1] + + i = 0 + oncogenicity_scores = {} + for code in og_score_data['code']: + oncogenicity_scores[code] = {} + for e in ['category','pole','description','score']: + oncogenicity_scores[code][e] = og_score_data[e][i] + if e == 'score': + oncogenicity_scores[code][e] = float(og_score_data[e][i]) + i = i + 1 + + variant_data['ONCOGENICITY_CLASSIFICATION'] = "VUS" + variant_data["ONCOGENICITY_CLASSIFICATION_DOC"] = "." + variant_data["ONCOGENICITY_CLASSIFICATION_CODE"] = "." + variant_data["ONCOGENICITY_SCORE"] = "." + onc_score_pathogenic = 0 + onc_score_benign = 0 + + for code in clingen_vicc_ev_codes: + + if code in variant_data.keys(): + if variant_data[code] is True: + score = oncogenicity_scores[code]['score'] + pole = oncogenicity_scores[code]['pole'] + if pole == "P": + onc_score_pathogenic = onc_score_pathogenic + score + if pole == "B": + onc_score_benign = onc_score_benign + score + + for e in ['ONCOGENICITY_CLASSIFICATION_DOC', + 'ONCOGENICITY_CLASSIFICATION_CODE']: + + if variant_data[e] == ".": + if e == 'ONCOGENICITY_CLASSIFICATION_DOC': + variant_data[e] = oncogenicity_scores[code]['description'] + else: + variant_data[e] = code + else: + if e == 'ONCOGENICITY_CLASSIFICATION_DOC': + variant_data[e] = f'{variant_data[e]}|{oncogenicity_scores[code]["description"]}' + else: + variant_data[e] = f'{variant_data[e]}|{code}' + + lb_upper_limit = -1 + lb_lower_limit = -6 + b_upper_limit = -7 + #vus_lower_limit = 0 + #vus_upper_limit = 4 + lo_lower_limit = 6 + lo_upper_limit = 9 + o_lower_limit = 10 + + variant_data['ONCOGENICITY_SCORE'] = onc_score_benign + onc_score_pathogenic + if variant_data['ONCOGENICITY_SCORE'] <= lb_upper_limit and \ + variant_data['ONCOGENICITY_SCORE'] >= lb_lower_limit: + variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Benign" + if variant_data['ONCOGENICITY_SCORE'] >= lo_lower_limit and \ + variant_data['ONCOGENICITY_SCORE'] <= lo_upper_limit: + variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Oncogenic" + if variant_data['ONCOGENICITY_SCORE'] <= b_upper_limit: + variant_data['ONCOGENICITY_CLASSIFICATION'] = "Benign" + if variant_data['ONCOGENICITY_SCORE'] >= o_lower_limit: + variant_data['ONCOGENICITY_CLASSIFICATION'] = "Oncogenic" + + for e in ['ONCOGENICITY_SCORE', + 'ONCOGENICITY_CLASSIFICATION', + 'ONCOGENICITY_CLASSIFICATION_CODE']: + rec.INFO[e] = variant_data[e] + + return(rec) \ No newline at end of file diff --git a/src/gvanno/vcf2tsv.py b/src/gvanno/vcf2tsv.py deleted file mode 100755 index 7d5b544..0000000 --- a/src/gvanno/vcf2tsv.py +++ /dev/null @@ -1,302 +0,0 @@ -#!/usr/bin/env python - -import argparse -from cyvcf2 import VCF, Writer -import numpy as np -import re -import math -import subprocess - -version = '0.3.7.1' - - -def __main__(): - parser = argparse.ArgumentParser(description='Convert a VCF file with genomic variants to a file with tab-separated values (TSV). One entry (TSV line) per sample genotype', formatter_class=argparse.ArgumentDefaultsHelpFormatter) - parser.add_argument('query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)') - parser.add_argument('out_tsv', help='Output TSV file with one line pr non-rejected sample genotype (Variant, genotype and annotation data as tab-separated values)') - parser.add_argument("--skip_info_data",action = "store_true", help="Skip printing of data in INFO column") - parser.add_argument("--skip_genotype_data", action="store_true", help="Skip printing of genotype_data (FORMAT columns)") - parser.add_argument("--keep_rejected_calls", action="store_true", help="Print data for rejected calls") - parser.add_argument("--print_data_type_header", action="store_true", help="Print a header line with data types of VCF annotations") - parser.add_argument("--compress", action="store_true", help="Compress TSV file with gzip") - parser.add_argument('--version', action='version', version='%(prog)s ' + str(version)) - args = parser.parse_args() - - vcf2tsv(args.query_vcf, args.out_tsv, args.skip_info_data, args.skip_genotype_data, args.keep_rejected_calls, args.compress, args.print_data_type_header) - - -def check_subprocess(command): - try: - output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True) - if len(output) > 0: - print (str(output.decode()).rstrip()) - except subprocess.CalledProcessError as e: - print (e.output) - exit(0) - - -def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejected_calls, compress, print_data_type_header): - - vcf = VCF(query_vcf, gts012 = True) - out = open(out_tsv,'w') - - fixed_columns_header = ['CHROM','POS','ID','REF','ALT','QUAL','FILTER'] - fixed_columns_header_type = ['String','Integer','String','String','String','Float','String'] - samples = vcf.samples - info_columns_header = [] - format_columns_header = [] - sample_columns_header = [] - column_types = {} - gt_present_header = 0 - - if len(samples) > 0: - sample_columns_header.append('VCF_SAMPLE_ID') - - for e in vcf.header_iter(): - header_element = e.info() - if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): - if header_element['HeaderType'] == 'INFO' or header_element['HeaderType'] == 'FORMAT': - column_types[header_element['ID']] = header_element['Type'] - if header_element['HeaderType'] == 'INFO': - if skip_info_data is False: - info_columns_header.append(header_element['ID']) - if header_element['HeaderType'] == 'FORMAT': - if len(sample_columns_header) > 0 and skip_genotype_data is False: - if header_element['ID'] != 'GT': - format_columns_header.append(header_element['ID']) - else: - gt_present_header = 1 - - #header_line = '\t'.join(fixed_columns_header) - header_tags = fixed_columns_header - if skip_info_data is False: - #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header)) - header_tags = fixed_columns_header + sorted(info_columns_header) - if len(sample_columns_header) > 0: - if skip_genotype_data is False: - #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header)) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT' - header_tags = fixed_columns_header + sorted(info_columns_header) + sample_columns_header + sorted(format_columns_header) + ['GT'] - else: - #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header)) - header_tags = fixed_columns_header + sorted(info_columns_header) - else: - if len(sample_columns_header) > 0: - if skip_genotype_data is False: - #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT' - header_tags = fixed_columns_header + sample_columns_header + sorted(format_columns_header) + ['GT'] - else: - #header_line = '\t'.join(fixed_columns_header) - header_tags = fixed_columns_header - header_line = '\t'.join(header_tags) - - out.write('#https://github.com/sigven/vcf2tsv version=' + str(version) + '\n') - if print_data_type_header is True: - #header_tags = header_line.rstrip().split('\t') - header_types = [] - for h in header_tags: - if h in column_types: - header_types.append(str(column_types[h])) - #header_line_type = '\t'.join(fixed_columns_header_type) + '\t' + '\t'.join(header_types) - header_line_type = '\t'.join(fixed_columns_header_type + header_types) - out.write('#' + str(header_line_type) + '\n') - out.write(str(header_line) + '\n') - else: - out.write(str(header_line) + '\n') - - for rec in vcf: - rec_id = '.' - rec_qual = '.' - rec_filter = '.' - alt = ",".join(str(n) for n in rec.ALT) - if not rec.ID is None: - rec_id = str(rec.ID) - if not rec.QUAL is None: - rec_qual = str("{0:.2f}".format(rec.QUAL)) - rec_filter = str(rec.FILTER) - if rec.FILTER is None: - rec_filter = 'PASS' - - pos = int(rec.start) + 1 - fixed_fields_string = str(rec.CHROM) + '\t' + str(pos) + '\t' + str(rec_id) + '\t' + str(rec.REF) + '\t' + str(alt) + '\t' + str(rec_qual) + '\t' + str(rec_filter) - - - if not 'PASS' in rec_filter and not keep_rejected_calls: - continue - - variant_info = rec.INFO - vcf_info_data = [] - if skip_info_data is False: - for info_field in sorted(info_columns_header): - if column_types[info_field] == 'Flag': - if variant_info.get(info_field) is None: - vcf_info_data.append('False') - else: - vcf_info_data.append('True') - elif column_types[info_field] == 'Float' or column_types[info_field] == 'Integer' or column_types[info_field] == 'String' or column_types[info_field] == 'Character': - if type(variant_info.get(info_field)) is list or type(variant_info.get(info_field)) is tuple: - vcf_info_data.append(",".join(str(n) for n in variant_info.get(info_field))) - else: - if variant_info.get(info_field) is None: - vcf_info_data.append('.') - else: - if column_types[info_field] == 'Float': - if not isinstance(variant_info.get(info_field),float): - print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Float\', yet parsed as other type:' + str(type(variant_info.get(info_field)))) - if not ',' in str(alt): - print('Warning: Multiple values in INFO tag for single ALT allele (VCF multiallelic sites not decomposed properly?):' + str(fixed_fields_string) + '\t' + str(info_field) + '=' + str(variant_info.get(info_field))) - vcf_info_data.append('.') - else: - val = str("{0:.7f}".format(variant_info.get(info_field))) - vcf_info_data.append(val) - else: - if column_types[info_field] == 'String' or column_types[info_field] == 'Character': - if isinstance(variant_info.get(info_field),str): - #print(str(info_field) + '\t' + variant_info.get(info_field).encode('ascii','ignore').rstrip().decode('ascii')) - vcf_info_data.append(variant_info.get(info_field).encode('ascii','ignore').decode('ascii')) - else: - vcf_info_data.append('.') - if column_types[info_field] == 'String': - print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'String\', yet parsed as other type:' + str(type(variant_info.get(info_field)))) - if column_types[info_field] == 'Character': - print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Character\', yet parsed as other type:' + str(type(variant_info.get(info_field)))) - else: - if isinstance(variant_info.get(info_field),int): - vcf_info_data.append(str(variant_info.get(info_field))) - else: - print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Integer\', yet parsed as other type:' + str(type(variant_info.get(info_field)))) - vcf_info_data.append(re.sub(r'\(|\)', '', variant_info.get(info_field).encode('ascii','ignore').decode('ascii'))) - - #print(str(vcf_info_data)) - #dictionary, with sample names as keys, values being genotype data (dictionary with format tags as keys) - vcf_sample_genotype_data = {} - - if len(samples) > 0 and skip_genotype_data is False: - gt_cyvcf = rec.gt_types - i = 0 - while i < len(samples): - vcf_sample_genotype_data[samples[i]] = {} - gt = './.' - if gt_present_header == 1: - if gt_cyvcf[i] == 0: - gt = '0/0' - if gt_cyvcf[i] == 1: - gt = '0/1' - if gt_cyvcf[i] == 2: - gt = '1/1' - vcf_sample_genotype_data[samples[i]]['GT'] = gt - i = i + 1 - - for format_tag in sorted(format_columns_header): - if len(samples) > 0 and skip_genotype_data is False: - sample_dat = rec.format(format_tag) - if sample_dat is None: - k = 0 - while k < len(samples): - if samples[k] in vcf_sample_genotype_data: - vcf_sample_genotype_data[samples[k]][format_tag] = '.' - k = k + 1 - continue - dim = sample_dat.shape - j = 0 - ## sample-wise - while j < dim[0]: - if sample_dat[j].size > 1: - - d = ','.join(str(e) for e in np.ndarray.tolist(sample_dat[j])) - if column_types[format_tag] == 'Float': - d = ','.join(str(round(e, 4)) for e in np.ndarray.tolist(sample_dat[j])) - ## undefined/missing value - if '-2147483648' in d: - d = d.replace('-2147483648', '.') - if 'nan' in d.casefold(): - d = d.casefold().replace('nan', '.') - if samples[j] in vcf_sample_genotype_data: - vcf_sample_genotype_data[samples[j]][format_tag] = d - else: - d = '.' - if column_types[format_tag] == 'Float': - if not math.isnan(sample_dat[j]): - d = str(sample_dat[j][0]) - if column_types[format_tag] == 'String': - d = str(sample_dat[j]) - if column_types[format_tag] == 'Integer': - d = str(sample_dat[j][0]) - ## undefined/missing value - if d == '-2147483648' or d.casefold() == 'nan': - d = '.' - if samples[j] in vcf_sample_genotype_data: - vcf_sample_genotype_data[samples[j]][format_tag] = d - j = j + 1 - - tsv_elements = [] - tsv_elements.append(fixed_fields_string) - if skip_info_data is False: - if skip_genotype_data is False: - if len(sample_columns_header) > 0: - tsv_elements.append("\t".join(str(n) for n in vcf_info_data)) - ## one line per sample variant - for s in sorted(vcf_sample_genotype_data.keys()): - sample = s - line_elements = [] - line_elements.extend(tsv_elements) - line_elements.append(sample) - gt_tag = '.' - for tag in sorted(vcf_sample_genotype_data[sample].keys()): - if tag != 'GT': - line_elements.append(vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii')) - else: - gt_tag = vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii') - line_elements.append(gt_tag) - if gt_tag == './.' or gt_tag == '.': - if keep_rejected_calls: - out.write('\t'.join(line_elements) + '\n') - else: - out.write("\t".join(str(n) for n in line_elements) + '\n') - - else: - tsv_elements.append("\t".join(str(n) for n in vcf_info_data)) - line_elements = [] - line_elements.extend(tsv_elements) - out.write('\t'.join(line_elements) + '\n') - else: - tsv_elements.append("\t".join(str(n) for n in vcf_info_data)) - line_elements = [] - line_elements.extend(tsv_elements) - out.write('\t'.join(line_elements) + '\n') - else: - if skip_genotype_data is False: - if len(sample_columns_header) > 0: - ## one line per sample variant - for s in sorted(vcf_sample_genotype_data.keys()): - sample = s - line_elements = [] - line_elements.extend(tsv_elements) - line_elements.append(sample) - gt_tag = '.' - for tag in sorted(vcf_sample_genotype_data[sample].keys()): - if tag != 'GT': - line_elements.append(vcf_sample_genotype_data[sample][tag]) - else: - gt_tag = vcf_sample_genotype_data[sample][tag] - line_elements.append(gt_tag) - if gt_tag == './.' or gt_tag == '.': - if keep_rejected_calls: - out.write('\t'.join(line_elements) + '\n') - else: - out.write('\t'.join(line_elements) + '\n') - else: - line_elements = [] - line_elements.extend(tsv_elements) - line_elements = tsv_elements - out.write('\t'.join(line_elements) + '\n') - - out.close() - - if compress is True: - command = 'gzip -f ' + str(out_tsv) - check_subprocess(command) - -if __name__=="__main__": __main__() - - - diff --git a/src/gvanno_logo.png b/src/gvanno_logo.png new file mode 100644 index 0000000..4f6899a Binary files /dev/null and b/src/gvanno_logo.png differ