diff --git a/README.md b/README.md
index c460e77..7b4b3ec 100755
--- a/README.md
+++ b/README.md
@@ -1,169 +1,184 @@
-## _gvanno_ - workflow for functional and clinical annotation of germline nucleotide variants
+## *gvanno* - generic workflow for functional and clinical annotation of human DNA variants
+
+<a href="https://github.com/sigven/gvanno/"><img src="src/gvanno_logo.png" align="right" height="130" width="113"/></a>
 
 ### Contents
 
-- [Overview](#overview)
-- [News](#news)
-- [Annotation resources](#annotation-resources)
-- [Getting started](#getting-started)
-- [Contact](#contact)
+-   [Overview](#overview)
+-   [News](#news)
+-   [Annotation resources](#annotation-resources)
+-   [Getting started](#getting-started)
+-   [Contact](#contact)
+
+### Overview {#overview}
+
+The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+
+*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**.
+
+### News {#news}
+
+-   April 27th 2023 - **1.6.0 release**
+
+    -   Added option `--oncogenicity_annotation` - classifies variants according to oncogenicity ([Horak et al., Genet Med, 2022](https://pubmed.ncbi.nlm.nih.gov/35101336/))
+    -   Data updates: ClinVar, GENCODE, GWAS catalog, CancerMine
+    -   Excluded extensive disease associations from the Open Targets Platform
 
-### Overview
+-   September 26th 2022 - **1.5.1 release**
 
-The germline variant annotator (*gvanno*) is a software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+    -   Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
 
-*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
+### Annotation resources (v1.6.0)
 
-### News
-* September 26th 2022 - **1.5.1 release**
-  * Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
-* September 24th 2022 - **1.5.0 release**
-  * Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform
-  * Software updates: VEP 107
-  * Excluded UniProt KB from annotation tracks
-* December 21st 2021 - **1.4.4 release**
-     * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform
-	* Software updates: VEP (v105)
-* August 25th 2021 - **1.4.3 release**
-	* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform
+-   [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset)
+-   [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
+-   [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
+-   [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
+-   [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023)
+-   [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023)
+-   [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
+-   [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023)
 
-### Annotation resources (v1.5.1)
+### Getting started {#getting-started}
 
-* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset)
-* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
-* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
-* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
-* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (September 2022)
-* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 47, July 2022)
-* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2022_06, June 2022)
-* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021)
-* [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 26th 2022)
+#### STEP 0: Prerequisites
 
+-   *Python*
 
-### Getting started
+    An installation of Python (version \>=*3.6*) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
 
-#### STEP 0: Python
+-   *Other utilities*
 
-An installation of Python (version >=_3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
+    The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)
+    
+    **NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system
 
 #### STEP 1: Installation of Docker
 
-1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
-   - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
-   - installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/)
-   - NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
-2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window
-3. Adjust the computing resources dedicated to the Docker, i.e.:
-   - Memory: minimum 5GB
-   - CPUs: minimum 4
-   - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
+1.  [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
+    -   installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
+    -   installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/)
+    -   NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
+2.  Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window
+3.  Adjust the computing resources dedicated to the Docker, i.e.:
+    -   Memory: minimum 5GB
+    -   CPUs: minimum 4
+    -   [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
 
-##### 1.1: Installation of Singularity (optional - in dev)
+##### 1.1: Installation of Singularity (_IN DEVELOPMENT_)
 
-0. **Note: this works for Singularity version 3.0 and higher**.
-1. [Install Singularity](https://sylabs.io/docs/)
-2. Test that singularity works by running `singularity --version`
-3. If you are in the gvanno directory, build the singularity image like so:
+0.  **Note: this works for Singularity version 3.0 and higher**.
 
-	`cd src`
+1.  [Install Singularity](https://sylabs.io/docs/)
 
-	`sudo ./buildSingularity.sh`
+2.  Test that singularity works by running `singularity --version`
+
+3.  If you are in the gvanno directory, build the singularity image like so:
+
+    `cd src`
+
+    `sudo ./buildSingularity.sh`
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.1) (gvanno run script, v1.5.1)
-2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory
-   * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb)
-   * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb)
-   * Example commands:
-	* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz`
-	* `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
+1.  [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0)
+
+2.  Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.:
+
+    -   `python download_gvanno_refdata.py --download_dir <PATH_TO_DOWNLOAD_DIR> --genome_assembly grch38`
+
+    **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle)
+
+3.  Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb):
 
-    A _data/_ folder within the _gvanno-1.5.1_ software folder should now have been produced
-3. Pull the [gvanno Docker image (1.5.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
-   * `docker pull sigven/gvanno:1.5.1` (gvanno annotation engine)
+    -   `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
 The *gvanno* workflow accepts a single input file:
 
-  * An unannotated, single-sample VCF file (>= v4.2) with germline variants (SNVs/InDels)
+-   An unannotated, single-sample VCF file (\>= v4.2) with germline variants (SNVs/InDels)
 
-We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
+We **strongly** recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
 
 #### STEP 5: Run example
 
 Run the workflow with **gvanno.py**, which takes the following arguments and options:
 
-	usage:
-	gvanno.py -h [options]
-	--query_vcf <QUERY_VCF>
-	--gvanno_dir <GVANNO_DIR>
-	--output_dir <OUTPUT_DIR>
-	--genome_assembly <grch37|grch38>
-	--sample_id <SAMPLE_ID>
-	--container <docker|singularity>
-
-	gvanno - workflow for functional and clinical annotation of germline nucleotide variants
-
-	Required arguments:
-	--query_vcf QUERY_VCF
-				    VCF input file with germline query variants (SNVs/InDels).
-	--gvanno_dir GVANNO_DIR
-				    Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.1
-	--output_dir OUTPUT_DIR
-				    Output directory
-	--genome_assembly {grch37,grch38}
-				    Genome assembly build: grch37 or grch38
-	--container {docker,singularity}
-				    Run gvanno with docker or singularity
-	--sample_id SAMPLE_ID
-				    Sample identifier - prefix for output files
-
-	VEP optional arguments:
-	--vep_regulatory      Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
-	--vep_gencode_all     Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
-	--vep_lof_prediction  Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
-	--vep_n_forks VEP_N_FORKS
-				    Number of forks for Variant Effect Predictor (VEP) processing, default: 4
-	--vep_buffer_size VEP_BUFFER_SIZE
-				    Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
-				    - set lower to reduce memory usage, higher to increase speed, default: 500
-	--vep_pick_order VEP_PICK_ORDER
-				    Comma-separated string of ordered transcript properties for primary variant pick in
-				    Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
-	--vep_skip_intergenic
-				    Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
-	--vep_coding_only
-	          Only report variants falling into coding regions of transcripts (VEP), default: False
-
-	Other optional arguments:
-	--force_overwrite     By default, the script will fail with an error if any output file already exists.
-				    You can force the overwrite of existing result files by using this flag, default: False
-	--version             show program's version number and exit
-	--no_vcf_validate     Skip validation of input VCF with Ensembl's vcf-validator, default: False
-	--docker_uid DOCKER_USER_ID
-				    Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
-	--vcfanno_n_processes VCFANNO_N_PROCESSES
-				    Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4
-
-The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
-
-	python ~/gvanno-1.5.1/gvanno.py
-	--query_vcf ~/gvanno-1.5.1/examples/example.grch37.vcf.gz
-	--gvanno_dir ~/gvanno-1.5.1
-	--output_dir ~/gvanno-1.5.1
-	--sample_id example
-	--genome_assembly grch37
-	--container docker
-	--force_overwrite
-
-This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
-
-  1. __example_gvanno_pass_grch37.vcf.gz (.tbi)__ - Bgzipped VCF file with rich set of functional/clinical annotations
-  2. __example_gvanno_pass_grch37.tsv.gz__ - Compressed TSV file with rich set of functional/clinical annotations
+```         
+usage:
+gvanno.py -h [options]
+--query_vcf <QUERY_VCF>
+--gvanno_dir <GVANNO_DIR>
+--output_dir <OUTPUT_DIR>
+--genome_assembly <grch37|grch38>
+--sample_id <SAMPLE_ID>
+--container <docker|singularity>
+
+gvanno - workflow for functional and clinical annotation of germline nucleotide variants
+
+Required arguments:
+--query_vcf QUERY_VCF
+                VCF input file with germline query variants (SNVs/InDels).
+--gvanno_dir GVANNO_DIR
+                Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0
+--output_dir OUTPUT_DIR
+                Output directory
+--genome_assembly {grch37,grch38}
+                Genome assembly build: grch37 or grch38
+--container {docker,singularity}
+                Run gvanno with docker or singularity
+--sample_id SAMPLE_ID
+                Sample identifier - prefix for output files
+
+VEP optional arguments:
+--vep_regulatory      Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
+--vep_gencode_all     Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
+--vep_lof_prediction  Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
+--vep_n_forks VEP_N_FORKS
+                Number of forks for Variant Effect Predictor (VEP) processing, default: 4
+--vep_buffer_size VEP_BUFFER_SIZE
+                Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
+                - set lower to reduce memory usage, higher to increase speed, default: 500
+--vep_pick_order VEP_PICK_ORDER
+                Comma-separated string of ordered transcript properties for primary variant pick in
+                Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
+--vep_skip_intergenic
+                Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
+--vep_coding_only
+          Only report variants falling into coding regions of transcripts (VEP), default: False
+
+Other optional arguments:
+--force_overwrite     By default, the script will fail with an error if any output file already exists.
+                You can force the overwrite of existing result files by using this flag, default: False
+--version             show program's version number and exit
+--no_vcf_validate     Skip validation of input VCF with Ensembl's vcf-validator, default: False
+--docker_uid DOCKER_USER_ID
+                Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
+--vcfanno_n_processes VCFANNO_N_PROCESSES
+                Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4
+--oncogenicity_annotation
+                    Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)
+--debug             Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.
+```
+
+The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
+
+```         
+python ~/gvanno-1.6.0/gvanno.py
+--query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz
+--gvanno_dir ~/gvanno-1.6.0
+--output_dir ~/gvanno-1.6.0
+--sample_id example
+--genome_assembly grch37
+--container docker
+--force_overwrite
+```
+
+This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder:
+
+1.  **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations
+2.  **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations
 
 Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column.
 
@@ -171,6 +186,6 @@ Similar files are produced for all variants, not only variants with a *PASS* des
 
 Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. The column names of the tab-separated values (TSV) file will be identical to the INFO tags that are documented in the VCF file.
 
-### Contact
+### Contact {#contact}
 
 sigven AT ifi.uio.no
diff --git a/download_gvanno_refdata.py b/download_gvanno_refdata.py
new file mode 100755
index 0000000..f0b94bb
--- /dev/null
+++ b/download_gvanno_refdata.py
@@ -0,0 +1,319 @@
+#!/usr/bin/env python
+
+import argparse
+import re
+import os
+import subprocess
+import logging
+import sys
+import locale
+import errno
+#import wget
+import urllib.request as urllib2
+from argparse import RawTextHelpFormatter
+
+GVANNO_VERSION = '1.6.0'
+REFDATA_VERSION = '20230425'
+ENSEMBL_VERSION = '109'
+GENCODE_VERSION = 'v43'
+VEP_ASSEMBLY = "GRCh38"
+HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/"
+
+def __main__():
+   
+   program_description = "download_gvanno_refdata - download reference datasets for the gvanno variant annotation workflow"
+   program_options = "   --download_dir <DOWNLOAD_DIR>\n   --genome_assembly " + \
+      "<grch37|grch38>\n"
+
+   parser = argparse.ArgumentParser(description = program_description,
+                                    formatter_class=RawTextHelpFormatter, usage="\n  %(prog)s -h [options]\n" + str(program_options))
+   parser._action_groups.pop()
+   required = parser.add_argument_group('Required arguments')
+   optional = parser.add_argument_group('Other optional arguments')
+
+   optional.add_argument('--force_overwrite', action = "store_true", help='By default, the script will fail with an error if any ' + \
+      'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s')
+   optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
+   optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully")
+   optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
+   required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True)
+   required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True)
+
+   
+   args = parser.parse_args()
+   arg_dict = vars(args)
+
+   logger = getlogger('check-download-dir')
+
+   download_dir_full = os.path.abspath(arg_dict['download_dir'])
+   if not os.path.isdir(download_dir_full):
+      err_msg = "Download destination directory (" + str(download_dir_full) + ") does not exist"
+      gvanno_error_message(err_msg,logger)
+   
+   ## check that 'data' does not exist in download_dir_full
+   db_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data')
+   arg_dict['db_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'])
+   arg_dict['vep_assembly_dir'] = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')
+
+   if os.path.isdir(db_dir) and arg_dict['force_overwrite'] is False:
+      err_msg = "Data directory (" + str(db_dir) + ") exists (force_overwrite = False)"
+      gvanno_error_message(err_msg,logger)
+   
+   if not os.path.isdir(db_dir):
+      os.mkdir(db_dir)
+   if not os.path.isdir(arg_dict['db_assembly_dir']):
+      os.mkdir(arg_dict['db_assembly_dir'])
+      os.mkdir(arg_dict['vep_assembly_dir'])
+
+
+   download_gvanno_ref_data(arg_dict = arg_dict)
+
+
+def rm_file(filename):
+    try:
+        os.remove(filename)
+    except OSError as e:
+        if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
+            raise # re-raise exception if a different error occurred
+
+def gvanno_error_message(message, logger):
+   logger.error('')
+   logger.error(message)
+   logger.error('')
+   exit(0)
+
+def check_subprocess(command, logger, debug = False):
+   if debug:
+      logger.info(command)
+   try:
+      output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
+      if len(output) > 0:
+         print(str(output.decode()).rstrip())
+   except subprocess.CalledProcessError as e:
+      print(e.output.decode())
+      exit(0)
+
+def getlogger(logger_name):
+   logger = logging.getLogger(logger_name)
+   logger.setLevel(logging.DEBUG)
+
+   # create console handler and set level to debug
+   ch = logging.StreamHandler(sys.stdout)
+   ch.setLevel(logging.DEBUG)
+
+   # add ch to logger
+   logger.addHandler(ch)
+   
+   # create formatter
+   formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S")
+   
+   #add formatter to ch
+   ch.setFormatter(formatter)
+   
+   return logger
+
+def get_url_num_bytes(url,  logger, verbose=True):
+   """
+   Get number of bytes of file at a URL. None if not reported.
+   """
+   # urllib2 could be a bit more intelligent in guessing what I mean:
+   if not re.match('^[a-zA-Z]*:', url):
+      if os.path.exists(url):
+         url = 'file:' + url
+      else:
+         url = 'http://' + url
+   try:
+      class HeadRequest(urllib2.Request):
+         def get_method(self):
+            return "HEAD"
+      response = urllib2.urlopen(HeadRequest(
+               url, None, {'User-Agent': 'Mozilla/5.0'}))
+       
+      num_bytes = response.info().get('content-length')
+      if verbose:
+         # urllib does redirections for me. Report if it's been clever:
+         if response.geturl() != url:
+            logger.info('Reporting size of file at: ' + response.geturl())
+   except:
+      logger.info('Failed to connect to URL ' + str(url))
+      num_bytes = None
+   if num_bytes is not None:
+      num_bytes = int(num_bytes)
+   return num_bytes
+
+def pretty_print(num_bytes, logger):
+   """
+   Output number of bytes according to locale and with IEC binary prefixes
+   """
+   if num_bytes is None:
+      logger.info('File size unavailable.')
+      return
+   KiB = 1024
+   MiB = KiB * KiB
+   GiB = KiB * MiB
+   TiB = KiB * GiB
+   PiB = KiB * TiB
+   EiB = KiB * PiB
+   ZiB = KiB * EiB
+   YiB = KiB * ZiB
+   locale.setlocale(locale.LC_ALL, '')
+   output = locale.format_string("%d", num_bytes, grouping=True) + ' bytes'
+   if num_bytes > YiB:
+      output += ' (%.3g YiB)' % (num_bytes / YiB)
+   elif num_bytes > ZiB:
+      output += ' (%.3g ZiB)' % (num_bytes / ZiB)
+   elif num_bytes > EiB:
+      output += ' (%.3g EiB)' % (num_bytes / EiB)
+   elif num_bytes > PiB:
+      output += ' (%.3g PiB)' % (num_bytes / PiB)
+   elif num_bytes > TiB:
+      output += ' (%.3g TiB)' % (num_bytes / TiB)
+   elif num_bytes > GiB:
+      output += ' (%.3g GiB)' % (num_bytes / GiB)
+   elif num_bytes > MiB:
+      output += ' (%.3g MiB)' % (num_bytes / MiB)
+   elif num_bytes > KiB:
+      output += ' (%.3g KiB)' % (num_bytes / KiB)
+   
+   return(output)
+   #logger.info(output)
+
+def download_gvanno_ref_data(arg_dict):
+   """
+   Main function to run the gvanno workflow using Docker
+   """
+   global GENCODE_VERSION, VEP_ASSEMBLY
+   if arg_dict['genome_assembly'] == 'grch37':
+      GENCODE_VERSION = 'v19'
+      VEP_ASSEMBLY = 'GRCh37'
+
+   vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')
+   
+   datasets = {}
+   for db in ['vep_cache','vep_fasta','gvanno_custom']:
+      datasets[db] = {}
+      datasets[db]['remote_url'] = 'NA'
+      datasets[db]['local_path'] = 'NA'
+
+   ## Remote URL - pre-indexed VEP cache
+   datasets['vep_cache']['remote_url'] = (
+      f"ftp://ftp.ensembl.org/pub/current_variation/indexed_vep_cache/"
+      f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz"
+      )
+   
+   ## Remote URL - Human genome FASTA
+   datasets['vep_fasta']['remote_url'] = (
+      f"http://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens."
+      f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz"
+      )
+   
+   logger = getlogger('download-vep-cache')
+   datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
+   datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
+   datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path'])
+
+   vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger)
+
+   if VEP_ASSEMBLY == 'GRCh37':
+      datasets['vep_fasta']['remote_url'] = (
+      f"http://ftp.ensembl.org/pub/grch37/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/dna/Homo_sapiens."
+      f"{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz"
+      )
+   
+  
+   logger.info('VEP cache - remote target file ' + str(datasets['vep_cache']['remote_url']))
+   logger.info('VEP cache - size: ' + pretty_print(vep_cache_bytes_remote, logger = logger))
+   logger.info('VEP cache - local destination file: ' + str(datasets['vep_cache']['local_path']))
+
+   if os.path.exists(datasets['vep_cache']['local_path']):
+      if os.path.getsize(datasets['vep_cache']['local_path']) == vep_cache_bytes_remote:
+         logger.info('VEP cache already downloaded')
+      else:
+         logger.info('VEP cache - download in progress - this can take a while ...  ')
+         urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path'])
+   else:
+      logger.info('VEP cache - download in progress - this can take a while ...  ')
+      urllib2.urlretrieve(datasets['vep_cache']['remote_url'], datasets['vep_cache']['local_path'])
+   
+   logger.info('VEP cache - unzip and untar')
+
+   wdir = os.getcwd()
+   
+   os.chdir(vep_assembly_dir)
+   
+   command_unzip_cache = f"gzip -dc {datasets['vep_cache']['local_path']} | tar xf -"
+   check_subprocess(command = command_unzip_cache, logger = logger)
+   logger.info('VEP cache - finished unzip + untar ')
+
+   ## change back to working directory
+   os.chdir(wdir)
+
+   if arg_dict['clean_raw_files']:
+      logger.info('VEP cache - removing raw compressed tar ball')
+      rm_file({datasets['vep_cache']['local_path']})
+
+   logger = getlogger('download-vep-fasta')
+   fasta_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_fasta']['remote_url'], logger = logger)
+   logger.info('VEP reference FASTA - remote target file: ' + str(datasets['vep_fasta']['remote_url']))
+   logger.info('VEP reference FASTA - size: ' + pretty_print(fasta_cache_bytes_remote, logger = logger))
+   logger.info('VEP reference FASTA - local destination file: ' + str(datasets['vep_fasta']['local_path']))
+
+   if os.path.exists(datasets['vep_fasta']['local_path']):
+      if os.path.getsize(datasets['vep_fasta']['local_path']) == fasta_cache_bytes_remote:
+         logger.info('VEP reference FASTA already downloaded')
+      else:
+         logger.info('VEP reference FASTA files - download in progress - this can take a while ...  ')
+         urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path'])
+   else:
+      logger.info('VEP reference FASTA files - download in progress - this can take a while ...  ')
+      urllib2.urlretrieve(datasets['vep_fasta']['remote_url'], datasets['vep_fasta']['local_path'])
+
+   logger.info('VEP reference FASTA - unzip + bgzip')
+   command_unzip_fasta = f"gzip -d {datasets['vep_fasta']['local_path']}"
+   check_subprocess(command = command_unzip_fasta, logger = logger)
+   command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}"
+   check_subprocess(command = command_bgzip_fasta, logger = logger)
+
+   
+
+   datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz'
+   datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz')
+
+   logger = getlogger('download-gvanno-custom')
+   custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger)
+   logger.info("Downloading custom gvanno variant datasets:  Clinvar / dbNSFP / ncER / cancerhotspots ++")
+   logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url']))
+   logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
+   logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path']))
+
+
+   if os.path.exists(datasets['gvanno_custom']['local_path']):
+      if os.path.getsize(datasets['gvanno_custom']['local_path']) == custom_cache_bytes_remote:
+         logger.info('Custom gvanno datasets already downloaded')
+      else:
+         logger.info('Custom gvanno datasets - download in progress - this can take a while ...  ')
+         urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path'])
+   else:
+      logger.info('Custom gvanno datasets - download in progress - this can take a while ...  ')
+      urllib2.urlretrieve(datasets['gvanno_custom']['remote_url'], datasets['gvanno_custom']['local_path'])
+
+   logger.info('Custom gvanno datasets - unzip and untar')
+   
+   wdir = os.getcwd()
+   os.chdir(arg_dict['download_dir'])
+   
+   command_unzip_cache = f"gzip -dc {datasets['gvanno_custom']['local_path']} | tar xf -"
+   check_subprocess(command = command_unzip_cache, logger = logger)
+   logger.info('Custom gvanno datasets - finished unzip and untar ')
+
+   ## change back to working directory
+   os.chdir(wdir)
+
+   if arg_dict['clean_raw_files']:
+      logger.info('Custom gvanno datasets - removing raw compressed tar ball')
+      rm_file({datasets['gvanno_custom']['local_path']})
+
+   logger.info('Finished')
+   
+   
+if __name__=="__main__": __main__()
diff --git a/examples/example.grch38.vcf.gz b/examples/example.grch38.vcf.gz
index 8329f55..bf3cbd8 100755
Binary files a/examples/example.grch38.vcf.gz and b/examples/example.grch38.vcf.gz differ
diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example.grch38.vcf.gz.tbi
index 7f8e83e..3817608 100755
Binary files a/examples/example.grch38.vcf.gz.tbi and b/examples/example.grch38.vcf.gz.tbi differ
diff --git a/gvanno.py b/gvanno.py
index db02add..34ee0f4 100755
--- a/gvanno.py
+++ b/gvanno.py
@@ -11,17 +11,17 @@
 import platform
 from argparse import RawTextHelpFormatter
 
-GVANNO_VERSION = '1.5.1'
-DB_VERSION = 'GVANNO_DB_VERSION = 20220921'
-VEP_VERSION = '107'
-GENCODE_VERSION = 'v41'
+GVANNO_VERSION = '1.6.0'
+DB_VERSION = 'GVANNO_DB_VERSION = 20230425'
+VEP_VERSION = '109'
+GENCODE_VERSION = 'v43'
 VEP_ASSEMBLY = "GRCh38"
 DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
 
 
 def __main__():
    
-   program_description = "gvanno - workflow for functional and clinical annotation of germline nucleotide variants"
+   program_description = "gvanno - workflow for functional and clinical annotation of human DNA variants"
    program_options = "   --query_vcf <QUERY_VCF>\n   --gvanno_dir <GVANNO_DIR>\n   --output_dir <OUTPUT_DIR>\n   --genome_assembly " + \
       "<grch37|grch38>\n   --sample_id <SAMPLE_ID>\n   --container <docker|singularity>"
 
@@ -44,13 +44,16 @@ def __main__():
    optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s")
    optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
       "for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
-   optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
+   optional_vep.add_argument('--vep_pick_order', default = "mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", help="Comma-separated string " + \
       "of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s")
    optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s")
    optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s")
    optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \
       "processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s")
-   required.add_argument('--query_vcf', help='VCF input file with germline query variants (SNVs/InDels).', required = True)
+   optional.add_argument('--oncogenicity_annotation', action ='store_true', help = 'Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)')
+   optional.add_argument("--debug", action="store_true", help="Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.")
+
+   required.add_argument('--query_vcf', help='VCF input file with query variants (SNVs/InDels).', required = True)
    required.add_argument('--gvanno_dir',help='Directory that contains the gvanno data bundle, e.g. ~/gvanno-' + str(GVANNO_VERSION), required = True)
    required.add_argument('--output_dir',help='Output directory', required = True)
    required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Genome assembly build: grch37 or grch38', required = True)
@@ -65,21 +68,25 @@ def __main__():
    ## Check that VEP pick criteria is formatted correctly
    if not arg_dict['vep_pick_order'] is None:
       values = str(arg_dict['vep_pick_order']).split(',')
-      permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane']
+      permitted_sources = ['canonical','appris','tsl','biotype','ccds','rank','length','mane_plus_clinical','mane_select']
       num_permitted_sources = 0
       for v in values:
          if v in permitted_sources:
             num_permitted_sources += 1
                
-      if num_permitted_sources != 8:
+      if num_permitted_sources != 9:
          err_msg = "Option 'vep_pick_order' = " + str(arg_dict['vep_pick_order']) + " is formatted incorrectly, should be " + \
-            "a comma-separated string of the following values: canonical,appris,tsl,biotype,ccds,rank,length,mane"
+            "a comma-separated string of the following values: mane_select,mane_plus_clinical,canonical,appris,tsl,biotype,ccds,rank,length"
          gvanno_error_message(err_msg, logger)
 
    if arg_dict['container'] is None:
       err_msg = 'Please specify whether the gvanno workflow is running through Docker or Singularity (--container <docker|singularity>)'
       gvanno_error_message(err_msg, logger)
 
+   if arg_dict['oncogenicity_annotation'] is True and arg_dict['vep_lof_prediction'] is False:
+      err_msg = "Option --oncogenicity_annotation requires --vep_lof_prediction turned on"
+      gvanno_error_message(err_msg, logger)
+
    logger = getlogger('gvanno-check-files')
 
    # check that script and Docker image version correspond
@@ -295,12 +302,13 @@ def run_gvanno(arg_dict, host_directories):
       container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \""
       docker_command_run_end = '\"'
 
-   #logger.info(container_command_run1)
-   #logger.info(container_command_run2)
+   if arg_dict['debug']:
+      logger.info(container_command_run1)
+      logger.info(container_command_run2)
 
    ## GVANNO|start - Log key information about sample, options and assembly
    logger = getlogger("gvanno-start")
-   logger.info("--- Germline variant annotation (gvanno) workflow ----")
+   logger.info("--- Generic variant annotation (gvanno) workflow ----")
    logger.info("Sample name: " + str(arg_dict['sample_id']))
    logger.info("Genome assembly: " + str(arg_dict['genome_assembly']))
    print()
@@ -310,7 +318,8 @@ def run_gvanno(arg_dict, host_directories):
    logger.info("STEP 0: Validate input data")
    vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \
       str(vcf_validation) + " "  + str(arg_dict['genome_assembly']) + docker_command_run_end
-   #logger.info(vcf_validate_command)
+   if arg_dict['debug']:
+      logger.info(vcf_validate_command)
 
    check_subprocess(vcf_validate_command)
    logger.info('Finished')
@@ -335,10 +344,10 @@ def run_gvanno(arg_dict, host_directories):
       ## List all VEP flags used when calling VEP
       loftee_dir = '/opt/vep/src/ensembl-vep/modules'
       plugins_in_use = "NearestExonJB"
-      vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \
+      vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomade --af_gnomadg --variant_class --domains --symbol --protein --ccds " + \
          "--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \
          "--xref_refseq --plugin NearestExonJB,max_range=50000"
-      vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(arg_dict['vep_pick_order']) + \
+      vep_options = "--vcf --quiet --check_ref --flag_pick_allele_gene --pick_order " + str(arg_dict['vep_pick_order']) + \
          " --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \
          str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
       
@@ -378,6 +387,8 @@ def run_gvanno(arg_dict, host_directories):
       logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction']))
       logger.info("VEP configuration - plugins in use: " + str(plugins_in_use))
 
+      if arg_dict['debug']:
+         logger.info(vep_main_command)
       check_subprocess(vep_main_command)
       check_subprocess(vep_bgzip_command)
       check_subprocess(vep_tabix_command)
@@ -386,11 +397,14 @@ def run_gvanno(arg_dict, host_directories):
       ## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources
       print()
       logger = getlogger('gvanno-vcfanno')
-      logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, cancerhotspots.org)")
+      logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (Clinvar, ncER, dbNSFP, GWAS catalog)")
       logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes']))
       gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes "  + str(arg_dict['vcfanno_n_processes']) + \
-         " --dbnsfp --clinvar --ncer --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
+         " --dbnsfp --clinvar --ncer --gvanno_xref --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
          " " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end
+      
+      if arg_dict['debug']:
+         logger.info(gvanno_vcfanno_command)
       check_subprocess(gvanno_vcfanno_command)
       logger.info("Finished")
 
@@ -398,9 +412,14 @@ def run_gvanno(arg_dict, host_directories):
       print()
       logger = getlogger("gvanno-summarise")
       logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
+      logger.info("Configuration - oncogenicity classification: " + str(int(arg_dict['oncogenicity_annotation'])))
       gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
          os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
-            " "  + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end
+            " "  + str(int(arg_dict['oncogenicity_annotation'])) + " " + str(int(arg_dict['vep_regulatory'])) + " " + \
+               str(int(arg_dict['debug'])) + docker_command_run_end
+      
+      if arg_dict['debug']:
+         logger.info(gvanno_summarise_command)
       check_subprocess(gvanno_summarise_command)
       logger.info("Finished")
       
@@ -415,14 +434,15 @@ def run_gvanno(arg_dict, host_directories):
       check_subprocess(create_output_vcf_command2)
       check_subprocess(create_output_vcf_command3)
       check_subprocess(create_output_vcf_command4)
-      check_subprocess(clean_command)
+      if not arg_dict['debug']:
+         check_subprocess(clean_command)
       
       print()
       ## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv
       logger = getlogger("gvanno-vcf2tsv")
-      logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsvpy")
-      gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end
-      gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end
+      logger.info("STEP 4: Converting genomic VCF to TSV with https://github.com/sigven/vcf2tsvpy")
+      gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_pass_tsv) + docker_command_run_end
+      gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_tsv) + docker_command_run_end
       logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")
       check_subprocess(gvanno_vcf2tsv_command_pass)
       logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants")
diff --git a/src/Dockerfile b/src/Dockerfile
index b228d05..1a0a2e2 100644
--- a/src/Dockerfile
+++ b/src/Dockerfile
@@ -1,3 +1,4 @@
+
 ###################################################
 # Stage 1 - docker container to build ensembl-vep #
 ###################################################
@@ -22,13 +23,13 @@ RUN apt-get update && apt-get -y install \
 ENV OPT /opt/vep
 ENV OPT_SRC $OPT/src
 ENV HTSLIB_DIR $OPT_SRC/htslib
-ENV BRANCH release/107
+ENV BRANCH=release/109
 
 # Working directory
 WORKDIR $OPT_SRC
 
 # Clone/download repositories/libraries
-RUN if [ "$BRANCH" = "master" ]; \
+RUN if [ "$BRANCH" = "main" ]; \
     then export BRANCH_OPT=""; \
     else export BRANCH_OPT="-b $BRANCH"; \
     fi && \
@@ -166,7 +167,8 @@ ENV PERL5LIB $PERL5LIB_TMP
 WORKDIR /
 ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules
 ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules
-RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/107/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
+ADD NearestExonJB.pm $OPT/src/ensembl-vep/modules
+#RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
 
 
 # Final steps
@@ -178,6 +180,24 @@ RUN echo >> $OPT/.profile && \
     # Run INSTALL.pl and remove the ensemb-vep tests and travis
     ./INSTALL.pl -a a -l -n && rm -rf t travisci .travis.yml
 
+
+# Install dependencies for VEP plugins:
+#USER root
+#ENV PLUGIN_DEPS "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/config"
+##   - Ubuntu packages
+#RUN curl -O "$PLUGIN_DEPS/ubuntu-packages.txt" && \
+#    apt-get update && apt-get install -y --no-install-recommends \
+#    $(sed s/\#.*//g ubuntu-packages.txt) && \
+#    rm -rf /var/lib/apt/lists/* ubuntu-packages.txt
+##   - Perl modules
+#RUN curl -O "$PLUGIN_DEPS/cpanfile" && \
+#    cpanm --installdeps --with-recommends . && \
+#    rm -rf /root/.cpanm cpanfile
+##   - Python packages
+#RUN curl -O "$PLUGIN_DEPS/requirements.txt" && \
+#    python -m pip install --no-cache-dir -r requirements.txt && \
+#    rm requirements.txt
+
 ######################################################
 # Stage 3 - adding dependencies for gvanno analysis  #
 ######################################################
@@ -200,7 +220,6 @@ WORKDIR /
 
 ENV PACKAGE_BIO="libhts2 bedtools"
 ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev"
-#ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9 toml==0.10.1"
 ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9"
 RUN apt-get update \
 	&& apt-get install -y --no-install-recommends \
@@ -224,9 +243,9 @@ chmod 755 /usr/local/bin/vcf_validator
 USER root
 WORKDIR /
 
-RUN wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
-RUN bunzip2 -dc samtools-1.10.tar.bz2 | tar xvf -
-RUN cd samtools-1.10 && ./configure --prefix=/usr/local/bin && make -j && make install
+RUN wget https://github.com/samtools/samtools/releases/download/1.15.1/samtools-1.15.1.tar.bz2
+RUN bunzip2 -dc samtools-1.15.1.tar.bz2 | tar xvf -
+RUN cd samtools-1.15.1 && ./configure --prefix=/usr/local && make -j && make install
 
 WORKDIR /
 
@@ -241,18 +260,19 @@ RUN apt-get update \
 USER root
 WORKDIR /
 
-## vt - variant tool set - use conda version
-## primary use in GVANNO/PCGR/CPSR: decomposition of multiallelic variants in a VCF file
+## vt - variant tool set + vcf2tsvpy - install using Conda
+## primary use of vt in GVANNO: decomposition of multiallelic variants in a VCF file
+
+# install Conda
 RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \
     && chmod 0755 miniconda.sh
 RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"]
 RUN rm miniconda.sh
 
-# update conda & install vt
+# update conda & install vt + vcf2tsvpy
 RUN /conda/bin/conda update conda
-#RUN /conda/bin/conda update python
-RUN /conda/bin/conda install -c bioconda vt
-#RUN /conda/bin/conda install -c bioconda vcf2tsvpy
+RUN /conda/bin/conda update python
+RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy python=3.6
 
 ## Clean Up
 RUN apt-get clean autoclean
@@ -268,13 +288,11 @@ VOLUME /data
 
 USER root
 WORKDIR /
-#RUN rm -f nlopt-2.4.2.tar.gz
 RUN rm -rf $HOME/src/ensembl-vep/t/
 RUN rm -f $HOME/src/v335_base.tar.gz
 RUN rm -f $HOME/src/release-1-6-924.zip
-RUN rm -rf /samtools-1.10.tar.bz2
-#RUN rm -f /conda/bin/python
+RUN rm -rf /samtools-1.15.1*
 
 ADD gvanno.tgz /
 ENV PATH=$PATH:/conda/bin:/gvanno
-ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
\ No newline at end of file
+ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
diff --git a/src/NearestExonJB.pm b/src/NearestExonJB.pm
new file mode 100644
index 0000000..30ffcec
--- /dev/null
+++ b/src/NearestExonJB.pm
@@ -0,0 +1,143 @@
+=head1 LICENSE
+
+Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute
+Copyright [2016-2023] EMBL-European Bioinformatics Institute
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+=head1 CONTACT
+
+ Ensembl <http://www.ensembl.org/info/about/contact/index.html>
+    
+=cut
+
+=head1 NAME
+
+ NearestExonJB
+
+=head1 SYNOPSIS
+
+ mv NearestExonJB.pm ~/.vep/Plugins
+ ./vep -i variations.vcf --cache --plugin NearestExonJB
+
+=head1 DESCRIPTION
+
+ This is a plugin for the Ensembl Variant Effect Predictor (VEP) that
+ finds the nearest exon junction boundary to a coding sequence variant. More than 
+ one boundary may be reported if the boundaries are equidistant.
+
+ The plugin will report the Ensembl identifier of the exon, the distance to the
+ exon boundary, the boundary type (start or end of exon) and the total
+ length in nucleotides of the exon.
+
+ Various parameters can be altered by passing them to the plugin command:
+
+ - max_range : maximum search range in bp (default: 10000)
+
+ Parameters are passed e.g.:
+
+ --plugin NearestExonJB,max_range=50000
+
+=cut
+
+package NearestExonJB;
+
+use strict;
+use warnings;
+
+use Bio::EnsEMBL::Variation::Utils::BaseVepPlugin;
+
+use base qw(Bio::EnsEMBL::Variation::Utils::BaseVepPlugin);
+
+my $char_sep = "|";
+
+my %CONFIG = (
+  max_range => 10000,
+);
+
+sub new {
+  my $class = shift;
+
+  my $self = $class->SUPER::new(@_);
+
+  my $params = $self->params;
+
+  # get output format
+  $char_sep = "+" if ($self->{config}->{output_format} eq 'vcf');
+
+  foreach my $param(@$params) {
+    my ($key, $val) = split('=', $param);
+    die("ERROR: Failed to parse parameter $param\n") unless defined($key) && defined($val);
+    $CONFIG{$key} = $val;
+  }
+
+  return $self;
+}
+
+sub feature_types {
+  return ['Transcript'];
+}
+
+sub get_header_info {
+  my $header = 'Nearest Exon Junction Boundary (coding sequence variants only). Format:';
+  $header .= join($char_sep, qw(ExonID distance start/end length) );
+
+  return {
+    NearestExonJB => $header,
+  }
+}
+
+sub run {
+  my ($self, $tva) = @_;
+
+  my $vf = $tva->base_variation_feature;
+  my $trv = $tva->base_transcript_variation;
+
+  my $loc_string = sprintf("%s:%s-%i-%i", $trv->transcript_stable_id, $vf->{chr} || $vf->seq_region_name, $vf->{start}, $vf->{end});
+
+  if(!exists($self->{_cache}) || !exists($self->{_cache}->{$loc_string})) {
+    my $exons = $trv->_overlapped_exons;
+    my %dists;
+    my $min = $CONFIG{max_range};
+    foreach my $exon (@$exons) {
+      my $startD = abs ($vf->start - $exon->seq_region_start);
+      my $endD = abs ($vf->end - $exon->seq_region_end);
+      if ($startD < $endD){
+        $dists{$exon->stable_id}{$startD} = 'start';
+        $dists{$exon->stable_id}{len} = $exon->length;
+        $min = $startD if $min > $startD;
+      } elsif ($startD > $endD){
+        $dists{$exon->stable_id}{$endD} = 'end';
+        $dists{$exon->stable_id}{len} = $exon->length;
+        $min = $endD if $min > $endD;
+      } else {
+        $dists{$exon->stable_id}{$startD} = "start_end";
+        $dists{$exon->stable_id}{len} = $exon->length;
+        $min = $startD if $min > $startD;
+      }
+    }
+
+    my @finalRes;
+    foreach my $exon (keys %dists){
+      if (exists $dists{$exon}{$min}) {
+        push(@finalRes, $exon.$char_sep.$min.$char_sep.$dists{$exon}{$min}.$char_sep.$dists{$exon}{len})
+      }
+    }
+
+    $self->{_cache}->{$loc_string} = scalar @finalRes ? join(",", @finalRes) : undef;
+  }
+  return $self->{_cache}->{$loc_string} ? { NearestExonJB => $self->{_cache}->{$loc_string} } : {};
+}
+
+1;
+
diff --git a/src/buildDocker.sh b/src/buildDocker.sh
index 9a84197..8962fd6 100755
--- a/src/buildDocker.sh
+++ b/src/buildDocker.sh
@@ -1,6 +1,5 @@
-cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz .
-cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/
-#cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
+#cp /Users/sigven/research/docker/pcgr/src/loftee_1.0.3.tgz .
+#cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/
 tar czvfh gvanno.tgz gvanno/
 echo "Build the Docker Image"
 TAG=`date "+%Y%m%d"`
diff --git a/src/gvanno.tgz b/src/gvanno.tgz
index a484a28..0c6ec29 100644
Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ
diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py
index af87e66..bd57c2e 100755
--- a/src/gvanno/gvanno_summarise.py
+++ b/src/gvanno/gvanno_summarise.py
@@ -7,6 +7,7 @@
 import gzip
 import os
 import annoutils
+import oncogenicity
 
 logger = annoutils.getlogger('gvanno-gene-annotate')
 csv.field_size_limit(500 * 1024 * 1024)
@@ -18,17 +19,19 @@ def __main__():
    parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
    parser.add_argument('gvanno_db_dir',help='gvanno data directory')
    parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
+   parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)')
    parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
+   parser.add_argument('debug',default=0,type=int,help='Output extensive debugging information')
    args = parser.parse_args()
 
-   extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation)
+   extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.oncogenicity_annotation, args.regulatory_annotation, args.debug)
 
-def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0):
+def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, oncogenicity_annotation = 0, regulatory_annotation = 0, debug = 0):
    """
    Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
    1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
-   2. Gene annotations, e.g. known oncogenes/tumor suppressors, curated disease associations (DisGenet), MIM phenotype associations etc
-   3. Protein-relevant annotations, e.g. c functional protein features etc.
+   2. Gene annotations, e.g. known oncogenes/tumor suppressors
+   3. Cancer hotspot mutation sites
    4. Variant effect predictions
    """
 
@@ -44,21 +47,27 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
    for tag in vcf_infotags_meta:
       if lof_prediction == 0 and regulatory_annotation == 0:
          if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
       elif lof_prediction == 1 and regulatory_annotation == 0:
          if not tag.startswith('REGULATORY_'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
       elif lof_prediction == 0 and regulatory_annotation == 1:
          if not tag.startswith('LoF'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
       else:
-         vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+         vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
+                                 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
 
 
    w = Writer(out_vcf, vcf)
    current_chrom = None
    num_chromosome_records_processed = 0
    num_records_filtered = 0
+
+   cancer_hotspots = annoutils.read_cancer_hotspots(logger, os.path.join(gvanno_db_directory,'cancer_hotspots', 'cancer_hotspots.tsv'))
    
    vcf_info_element_types = {}
    for e in vcf.header_iter():
@@ -79,9 +88,6 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
             num_chromosome_records_processed = 0
       if rec.INFO.get('CSQ') is None:
          num_records_filtered = num_records_filtered + 1
-         #alt_allele = ','.join(rec.ALT)
-         #pos = rec.start + 1
-         #variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
          #logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?)  - variant will be skipped')
          continue
       num_chromosome_records_processed += 1
@@ -90,30 +96,44 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
       if regulatory_annotation == 1:
          csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')
 
-         if 'vep_block' in csq_record_results_all:
-            vep_csq_records_all = csq_record_results_all['vep_block']
+         if 'picked_gene_csq' in csq_record_results_all:
+            vep_csq_records_all = csq_record_results_all['picked_gene_csq']
             rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)
 
-      csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
-
-      if 'vep_all_csq' in csq_record_results_pick:
-         rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq'])
-      if 'vep_block' in csq_record_results_pick:
-         vep_csq_records = csq_record_results_pick['vep_block']
+      vep_csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = debug)
 
-         block_idx = 0
-         record = vep_csq_records[block_idx]
-         for k in record:
+      
+      principal_hgvsp = '.'
+      principal_hgvsc = '.'
+      if 'picked_csq' in vep_csq_record_results:
+         csq_record = vep_csq_record_results['picked_csq']
+         for k in csq_record:
             if k in vcf_info_element_types:
-               if vcf_info_element_types[k] == "Flag" and record[k] == "1":
-                  rec.INFO[k] = True
+               if vcf_info_element_types[k] == "Flag":
+                  #rec.INFO[k] = False
+                  if csq_record[k] == "1":
+                     rec.INFO[k] = True                                
                else:
-                  if not record[k] is None:
-                     rec.INFO[k] = record[k]
+                  if not csq_record[k] is None:
+                     rec.INFO[k] = csq_record[k]
+
+                     if k == 'HGVSp_short':
+                        principal_hgvsp = csq_record[k]
+                     
+                     if k == 'HGVSc':
+                        principal_hgvsc = csq_record[k].split(':')[1]
+            #else:
+            #   print("missing\t" + str(k))
+
+      if 'all_csq' in vep_csq_record_results:
+         rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq'])
+         annoutils.map_cancer_hotspots(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_hgvsp, principal_hgvsc)
 
       if not rec.INFO.get('DBNSFP') is None:
          annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
 
+      if oncogenicity_annotation == 1:
+         oncogenicity.assign_oncogenicity_evidence(rec, tumortype = "Any")
       w.write_record(rec)
    w.close()
    logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py
index 7840a21..d972a5d 100755
--- a/src/gvanno/gvanno_vcfanno.py
+++ b/src/gvanno/gvanno_vcfanno.py
@@ -20,10 +20,8 @@ def __main__():
    parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)")
    parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar")
    parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions")
-   #parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase")
    parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)")
    parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)")
-   parser.add_argument("--cancer_hotspots",action = "store_true", help="Annotate VCF with mutation hotspots from cancerhotspots.org")
 
    
    args = parser.parse_args()
@@ -31,7 +29,7 @@ def __main__():
    vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt'
    vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml'
    print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False)
-   run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas, args.cancer_hotspots)
+   run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas)
 
 
 def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource):
@@ -41,7 +39,7 @@ def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vc
    append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname)
    append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file)
 
-def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp,  gvanno_xref,gwas, cancer_hotspots):
+def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp,  gvanno_xref, gwas):
    """
    Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
    """
@@ -52,10 +50,7 @@ def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, v
    gvanno_xref_info_tags = ["GVANNO_XREF"]
    gwas_info_tags = ["GWAS_HIT"]
    ncer_info_tags = ["NCER_PERCENTILE"]
-   cancer_hotspots_info_tags = ["MUTATION_HOTSPOT","MUTATION_HOTSPOT_TRANSCRIPT","MUTATION_HOTSPOT_CANCERTYPE"]
-
-   if cancer_hotspots is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, cancer_hotspots_info_tags, query_info_tags, "cancer_hotspots")
+   
    if clinvar is True:
       prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar")
    if dbnsfp is True:
diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py
index 2b8362d..5678c9a 100755
--- a/src/gvanno/lib/annoutils.py
+++ b/src/gvanno/lib/annoutils.py
@@ -9,7 +9,9 @@
 
 
 csv.field_size_limit(500 * 1024 * 1024)
-threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'X'}
+threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H',
+                           'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W',
+                           'Tyr':'Y','Val':'V','Ter':'X'}
 
 
 
@@ -23,6 +25,7 @@ def read_genexref_namemap(logger, gene_xref_namemap_tsv):
    reader = csv.DictReader(tsvfile, delimiter='\t')
    for row in reader:
       namemap_xref[row['name']] = int(row['index'])
+   tsvfile.close()
 
    return namemap_xref
 
@@ -45,12 +48,13 @@ def read_infotag_file(logger, vcf_info_tags_tsv):
    for row in reader:
       if not row['tag'] in info_tag_xref:
          info_tag_xref[row['tag']] = row
+   tsvfile.close()
    
    return info_tag_xref
 
 def check_subprocess(command):
-   #if debug:
-      #logger.info(command)
+   # if debug:
+   #    logger.info(command)
    try:
       output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
       if len(output) > 0:
@@ -79,7 +83,6 @@ def write_pass_vcf(annotated_vcf, logger):
    Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename.
    
    """
-   #out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf)
    out_vcf = re.sub(r'\.vcf\.gz$','.pass.vcf',annotated_vcf)
    vcf = VCF(annotated_vcf)
    w = Writer(out_vcf, vcf)
@@ -268,7 +271,7 @@ def get_correct_cpg_transcript(vep_csq_records):
       return csq_idx
    
    
-   ## some variants iare assigned multiple transcript consequences
+   ## some variants are assigned multiple transcript consequences
    ## if cancer predisposition genes are in the vicinity of other genes, choose the cancer predisposition gene
    ## if there are neighbouring cancer-predispositon genes, choose custom gene, preferring coding change (see below, KLLN/PTEN, XPC/TMEM43, NTHL1/TSC2)
    csq_idx_dict = {}
@@ -576,15 +579,24 @@ def make_transcript_xref_map(rec, fieldmap, xref_tag = 'PCGR_ONCO_XREF'):
    return(transcript_xref_map)
 
 def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
-   vep_to_pcgr_af = {'gnomAD_AMR_AF':'AMR_AF_GNOMAD',
-                     'gnomAD_AFR_AF':'AFR_AF_GNOMAD',
-                     'gnomAD_EAS_AF':'EAS_AF_GNOMAD',
-                     'gnomAD_NFE_AF':'NFE_AF_GNOMAD',
-                     'gnomAD_AF':'GLOBAL_AF_GNOMAD',
-                     'gnomAD_SAS_AF':'SAS_AF_GNOMAD',
-                     'gnomAD_OTH_AF':'OTH_AF_GNOMAD',
-                     'gnomAD_ASJ_AF':'ASJ_AF_GNOMAD',
-                     'gnomAD_FIN_AF':'FIN_AF_GNOMAD',
+   vep_to_pcgr_af = {'gnomADe_AMR_AF':'AMR_AF_GNOMAD',
+                     'gnomADe_AFR_AF':'AFR_AF_GNOMAD',
+                     'gnomADe_EAS_AF':'EAS_AF_GNOMAD',
+                     'gnomADe_NFE_AF':'NFE_AF_GNOMAD',
+                     'gnomADe_AF':'GLOBAL_AF_GNOMAD',
+                     'gnomADe_SAS_AF':'SAS_AF_GNOMAD',
+                     'gnomADe_OTH_AF':'OTH_AF_GNOMAD',
+                     'gnomADe_ASJ_AF':'ASJ_AF_GNOMAD',
+                     'gnomADe_FIN_AF':'FIN_AF_GNOMAD',
+                     'gnomADG_AMR_AF':'AMR_AF_GNOMADg',
+                     'gnomADg_AFR_AF':'AFR_AF_GNOMADg',
+                     'gnomADg_EAS_AF':'EAS_AF_GNOMADg',
+                     'gnomADg_NFE_AF':'NFE_AF_GNOMADg',
+                     'gnomADg_AF':'GLOBAL_AF_GNOMADg',
+                     'gnomADg_SAS_AF':'SAS_AF_GNOMADg',
+                     'gnomADg_OTH_AF':'OTH_AF_GNOMADg',
+                     'gnomADg_ASJ_AF':'ASJ_AF_GNOMADg',
+                     'gnomADg_FIN_AF':'FIN_AF_GNOMADg',
                      'AFR_AF':'AFR_AF_1KG',
                      'AMR_AF':'AMR_AF_1KG',
                      'SAS_AF':'SAS_AF_1KG',
@@ -628,14 +640,33 @@ def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
 
    return vep_dbnsfp_meta_info
 
-def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ'):
+def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = 0):
 
    all_csq_pick = []
    all_transcript_consequences = []
    
+
+   varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT))
+
    for csq in rec.INFO.get(csq_identifier).split(','):
       csq_fields =  csq.split('|')
 
+      entrezgene = '.'
+
+      ## Entrez gene identifier is not provided directly by VEP, pull out this from 'transcript_xref_map' for a given transcript-specific CSQ block
+      ##  - used for 'consequence_entry' object that are added to 'vep_all_csq' array
+      k = 0
+      while(k < len(csq_fields)):
+         if k in vep_csq_fields_map['index2field']:
+            if vep_csq_fields_map['index2field'][k] == 'Feature':
+               ensembl_transcript_id = csq_fields[k]
+               if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'):
+                  if ensembl_transcript_id in transcript_xref_map.keys():
+                     if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys():
+                        entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE']
+         k = k + 1
+
+
       if pick_only is False:
          j = 0
          csq_record = {}
@@ -649,7 +680,7 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
 
       else:
          ## loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme
-         if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence when expanding with annotation tags
+         if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence(s) when expanding with annotation tags
             j = 0
             csq_record = {}
 
@@ -666,7 +697,10 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
                                  ## assign additional gene/transcript annotations from the custom transcript 
                                  ## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object
                                  csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
+                                 #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
+                                 
                               else:
+                                 #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
                                  ## If not VEP provides SYMBOL, append SYMBOL provided by xref map
                                  if csq_record['SYMBOL'] is None:
                                      csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
@@ -706,27 +740,246 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl
             assign_cds_exon_intron_annotations(csq_record)
             ## Append transcript consequence to all_csq_pick
             all_csq_pick.append(csq_record)
-         symbol = '.'
-         hgvsc = ".:."
-         hgvsp = ".:."
+         symbol = "."
+         hgvsc = "."
+         hgvsp = "."
          if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "":
             symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']])
          if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "":
-            hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']])
+            hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1])
          if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "":
-            hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']])
-         consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ':' +  
-            str(symbol) + ':' + 
-            str(hgvsc) + ':' + 
-            str(hgvsp) + ':' + 
-            str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ':' + 
-            str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ':' + 
+            hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1])
+         consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" +  
+            str(symbol) + ":" + 
+            str(entrezgene) + ":" +
+            str(hgvsc) + ":" + 
+            str(hgvsp) + ":" + 
+            str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + 
+            str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + 
             str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']]))
          all_transcript_consequences.append(consequence_entry)
-         
+
+
+   vep_chosen_csq_idx = 0
    vep_csq_results = {}
-   vep_csq_results['vep_block'] = all_csq_pick
-   vep_csq_results['vep_all_csq'] = all_transcript_consequences
+   vep_csq_results['picked_gene_csq'] = all_csq_pick
+   vep_csq_results['all_csq'] = all_transcript_consequences
+   vep_csq_results['picked_csq'] = None
+
+   ## IF multiple transcript-specific variant consequences highlighted by --pick_allele_gene , 
+   ## prioritize/choose block of consequence which has
+   ## - A gene with BIOTYPE equal to 'protein-coding' (the other picked transcript/gene may potentialy carry another BIOTYPE nature)
+   ## - A gene consequence classified as 'exonic' (the other picked transcript/gene likely carries a nonexonic consequence)
+   if len(vep_csq_results['picked_gene_csq']) > 0:
+      vep_selected_idx = {}
+      vep_selected_idx['exonic_status'] = {}
+      vep_selected_idx['consequence'] = {}
+
+      #if debug:
+         #print("Picked VEP blocks: " + str(vep_csq_results['picked_gene_csq']))
+
+      i = 0
+      picked_blocks = []
+      for trans_rec in vep_csq_results['picked_gene_csq']:
+
+         if debug:
+            biotype = '.'
+            consequence = '.'
+            exonic_status = '.'
+            genesymbol = '.'
+            distance = '.'
+            feature = '.'
+            if not trans_rec['Consequence'] is None:
+               consequence = trans_rec['Consequence']
+            if not trans_rec['SYMBOL'] is None:
+               genesymbol = trans_rec['SYMBOL']
+            if not trans_rec['BIOTYPE'] is None:
+               biotype = trans_rec['BIOTYPE']
+            if not trans_rec['EXONIC_STATUS'] is None:
+               exonic_status = trans_rec['EXONIC_STATUS']
+            if not trans_rec['DISTANCE'] is None:
+               distance = trans_rec['DISTANCE']
+            if not trans_rec['Feature'] is None:
+               feature = trans_rec['Feature']
+            block = str(genesymbol) + ':' + str(feature) + ':' + str(consequence) + ':' + str(distance) + ':' + str(biotype) + ':' + str(exonic_status)
+            picked_blocks.append(block)
+
+
+         if 'BIOTYPE' in trans_rec and 'Consequence' in trans_rec and 'EXONIC_STATUS' in trans_rec:
+            if not trans_rec['BIOTYPE'] is None and not trans_rec['Consequence'] is None:
+               if trans_rec['BIOTYPE'] == "protein_coding":
+                  ## for protein-coding genes - record the exonic variant consequence status (exonic/nonexonic)
+                  vep_selected_idx['exonic_status'][i] = trans_rec['EXONIC_STATUS']
+                  vep_selected_idx['consequence'][i] = trans_rec['Consequence']
+         i = i + 1
+
+      
+      if debug:
+         print(str(varkey) + " - picked CSQ blocks: " + ' /// '.join(picked_blocks))
+
+            
+      ## when multiple transcript gene blocks are picked by VEP, prioritize the block with 'exonic' consequence
+      if len(vep_selected_idx['exonic_status'].keys()) > 1:
+         exonic_cons_found = 0
+         for j in vep_selected_idx['exonic_status'].keys():
+            if vep_selected_idx['exonic_status'][j] == 'exonic':
+               exonic_cons_found = 1
+               vep_chosen_csq_idx = j
+
+         ## if multiple non-exonic variants are found, prioritize UTR variants over other nonexonic
+         if exonic_cons_found == 0:
+            for j in vep_selected_idx['consequence'].keys():
+               if vep_selected_idx['consequence'][j] == '5_prime_UTR_variant' or vep_selected_idx['consequence'][j] == '3_prime_UTR_variant':                  
+                  vep_chosen_csq_idx = j
+
+      else:
+         if len(vep_selected_idx['exonic_status'].keys()) == 1:
+            for k in vep_selected_idx['exonic_status'].keys():
+               vep_chosen_csq_idx = k
+      
+      vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx]
+
+   else:
+      print('ERROR: No VEP block chosen by --pick_allele_gene')
 
    return(vep_csq_results)
+
+
+def read_cancer_hotspots(logger, hotspots_tsv):
+
+   ## load cancer hotspot entries from 'cancer_hotspots.tsv' (provided by github.com/sigven/cancerHotspots)
+
+   hotspots = {} ##dictionary returned
+   mutation_hotspots = {}
+   codon_hotspots = {}
+   splice_hotspots = {}
+   if not os.path.exists(hotspots_tsv):
+      logger.info("ERROR: File '" + str(hotspots_tsv) + "' does not exist - exiting")
+      exit(1)
+   tsvfile = open(hotspots_tsv, 'r')
+   reader = csv.DictReader(tsvfile, delimiter='\t')
+   for row in reader:
+      mutation_hotspots[str(row['entrezgene']) + '-' + row['hgvsp2']] = row
+      codon_hotspots[str(row['entrezgene']) + '-' + row['codon']] = row
+      if row['hgvsc'] != '.':
+         splice_hotspots[str(row['entrezgene']) + '-' + str(row['hgvsc'])] = row
+   tsvfile.close()
+
+   hotspots['mutation'] = mutation_hotspots
+   hotspots['codon'] = codon_hotspots
+   hotspots['splice'] = splice_hotspots
+   return hotspots
+
+
+def map_cancer_hotspots(transcript_csq_elements, cancer_hotspots, rec, principal_hgvsp, principal_hgvsc):
+
+   unique_hotspot_mutations = {}
+   unique_hotspot_codons = {}
+
+   principal_codon = '.'
+   if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_hgvsp):
+      codon_match = re.findall(r'[A-Z][0-9]{1,}',principal_hgvsp)
+      if len(codon_match) == 1:
+         principal_codon = codon_match[0]
+   
+   ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of
+   ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots)
+   ##    - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal')
+   ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated)
+   ##    - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal')
+
+   for csq in transcript_csq_elements:
+      (consequence, symbol, entrezgene, hgvsc, hgvsp, feature_type, feature, biotype) = csq.split(':')
+
+      if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True:
+         continue
+
+      hgvsp_short = threeToOneAA(hgvsp)
+      hotspot_key_mutation = "."
+      codon_match = []
+      if entrezgene != "." and hgvsp != ".":
+         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short)
+         codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short)
+
+      if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'):
+         hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc)
+         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key)
+
+      if hotspot_key_mutation == ".":
+         continue
+
+      if hotspot_key_mutation in cancer_hotspots['mutation']:
+         hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+         hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+         unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype
+
+      if hotspot_key_mutation in cancer_hotspots['splice']:
+         hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+         hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+         unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype
+
+               
+      if len(codon_match) > 0:
+         hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0])
+
+         if hotspot_key_codon in cancer_hotspots['codon']:            
+            unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] =  \
+               cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE']
+         
+   if len(unique_hotspot_mutations.keys()) > 0:
+      if len(unique_hotspot_mutations.keys()) == 1:
+         for gene_mutation_key in unique_hotspot_mutations.keys():
+            rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key
+            rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key]
+
+            if gene_mutation_key.split('|')[0] == 'exonic':
+               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal'
+               hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4])
+               if hgvsp_candidate == principal_hgvsp:
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+            else:
+               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal'
+               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4]))
+               if hgvsc_candidate == principal_hgvsc:
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+      else:
+         ## multiple hotspot matches for alternative hgvsp keys
+         ## - keep only match against principal HGVSp 
+         for hotspot_info in unique_hotspot_mutations.keys():
+            if hotspot_info.split('|')[0] == 'exonic':
+               hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) 
+
+               if hgvsp_candidate == principal_hgvsp:
+                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+            else:
+               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4]))
+
+               if hgvsc_candidate == principal_hgvsc:
+                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+
+   else:
+      if len(unique_hotspot_codons.keys()) > 0:
+         if len(unique_hotspot_codons.keys()) == 1:
+            for gene_codon_key in unique_hotspot_codons.keys():
+
+               if '|' in gene_codon_key:
+
+                  codon = str(gene_codon_key.split('|')[3])
+
+                  if codon == principal_codon:
+                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal'
+                  else:
+                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal'
+
+
+   return
       
+
diff --git a/src/gvanno/lib/oncogenicity.py b/src/gvanno/lib/oncogenicity.py
new file mode 100644
index 0000000..28aa347
--- /dev/null
+++ b/src/gvanno/lib/oncogenicity.py
@@ -0,0 +1,422 @@
+#!/usr/bin/env python
+
+import os,re,sys
+from cyvcf2 import VCF, Writer
+
+def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
+
+   clingen_vicc_ev_codes = [
+      "CLINGEN_VICC_SBVS1", 
+      "CLINGEN_VICC_SBS1", 
+      "CLINGEN_VICC_SBP1", 
+      "CLINGEN_VICC_SBP2",
+      "CLINGEN_VICC_OS3", 
+      "CLINGEN_VICC_OM2",
+      "CLINGEN_VICC_OM3",  
+      "CLINGEN_VICC_OM4", 
+      "CLINGEN_VICC_OP1", 
+      "CLINGEN_VICC_OP3",
+      "CLINGEN_VICC_OP4",
+      "CLINGEN_VICC_OVS1"]
+
+   ### Benign oncogenic effects of somatic variants
+   
+   #  1) "CLINGEN_VICC_SBVS1"
+   ## Very high MAF: > 0.05 in gnomAD - any 5 general continental populations
+   ## AFR/AMR/EAS/NFE/SAS
+
+   # 2) "CLINGEN_VICC_SBS1"
+   ## High MAF: > 0.01 in gnomAD - any 5 general continental populations
+   ## AFR/AMR/EAS/NFE/SAS
+
+   # 3) "CLINGEN_VICC_SBP1"
+   ## Multiple lines of computational evidence support a benign
+   ## effect on the gene or gene product
+   ## (conservation, evolutionary, splicing impact, etc. - from dbNSFP
+
+   # 4) CLINGEN_VICC_SBP2"
+   ## Synonymous (silent) variant for which splicing prediction
+   ## algorithms predict no effect on the splice consensus sequence
+   ## nor the creation of a new splice site and the nucleotide is
+   ## not highly conserved
+
+   # 5) "CLINGEN_VICC_SBS2"
+   ## Well established in invitro/in vivo functional studies show
+   ## no oncogenic effects
+
+   ## Oncogenic effects of somatic variants
+   
+   # 6) "CLINGEN_VICC_OVS1"
+   ## Null variant - predicted as LoF by LOFTEE
+   ## - Nonsense, frameshift, canonical splice sites, initiation codon,
+   ##   single-exon/multi-exon deletion
+   ## - Tumor suppressor gene
+
+   # 7) "CLINGEN_VICC_OS1"
+   ## Same amino acid change as previously established oncogenic variant
+   ## NOTE: Not relevant since current implementation does not consider
+   ## any criteria where nucleotide change is affecting annotation
+
+   # 8) "CLINGEN_VICC_OS2"
+   ## Well established in invitro/in vivo functional studies show
+   ## oncogenic effect of the variant
+
+   # 9) "CLINGEN_VICC_OS3"
+   ## Located in a mutation hotspot
+   ## - >= 50 samples with a somatic variant at the AA position (cancerhotspots.org)
+   ## - Same amino acid change in >= 10 samples (cancerhotspots.org)
+
+   # 10) "CLINGEN_VICC_OM1"
+   ## Located in a critical and well-established part of a functional domain
+   ## (active site of an enzyme)
+   ## NOTE: Not used since determination/retrieval of critical protein sites are non-trivial to automate
+
+   # 11) "CLINGEN_VICC_OM2"
+   ## Protein length changes as a result of in-frame deletions/insertions in a
+   ## known oncogene/tumor suppressor genes or stop-loss variants in a
+   ## tumor suppressor gene
+
+   # 12) "CLINGEN_VICC_OM3"
+   ## Missense variant at an amino acid residue where a different missense
+   ## variant determined to be oncogenic (using this standard) has been
+   ## documented. Amino acid difference from reference amino acid should
+   ## be greater or at least approximately the same as for missense change
+   ## determined to be oncogenic.
+
+   # 13) "CLINGEN_VICC_OM4"
+   ## Located in a mutation hotspot
+   ## - < 50 samples with a somatic variant at the AA position (cancerhotspots.org)
+   ## - Same amino acid change in >= 10 samples (cancerhotspots.org)
+   ## - Not applicable if OM1 or OM3 is applicable
+
+   # 14) "CLINGEN_VICC_OP1"
+   ## All used lines of computational support an oncogenic effect
+   ## of a variant (conservation, evolutionary, splicing effect)
+
+   # 15) "CLINGEN_VICC_OP2"
+   ## Somatic variant in a gene in a malignancy with a single genetic
+   ## etiology. Example:retinoblastoma is caused by bi-allelic
+   ## RB1 inactivation.
+
+   # 16) "CLINGEN_VICC_OP3"
+   ## Located in a mutation hotspot
+   ## - Same amino acid change in < 10 samples (cancerhotspots.org)
+   ## - Not applicable if OM1 or OM3 is applicable
+
+   # 17) "CLINGEN_VICC_OP4"
+   ## Absent from controls (gnomAD)
+   ## - Extremely low MAF
+
+
+   required_oncogenicity_vars = [
+      "Consequence",
+      "MUTATION_HOTSPOT",
+      "MUTATION_HOTSPOT_CANCERTYPE",
+      "SYMBOL",
+      "ONCOGENE",
+      "ONCOGENE_EVIDENCE",
+      "TUMOR_SUPPRESSOR",
+      "TUMOR_SUPPRESSOR_EVIDENCE",
+      "LoF",
+      "INTRON_POSITION",
+      "EXON_POSITION",
+      "EAS_AF_GNOMAD",
+      "NFE_AF_GNOMAD",
+      "AFR_AF_GNOMAD",
+      "AMR_AF_GNOMAD",
+      "SAS_AF_GNOMAD",
+      "DBNSFP_SIFT",
+      "DBNSFP_PROVEAN",
+      "DBNSFP_META_RNN",
+      "DBNSFP_FATHMM",
+      "DBNSFP_MUTATIONTASTER",
+      "DBNSFP_DEOGEN2",
+      "DBNSFP_PRIMATEAI",
+      "DBNSFP_MUTATIONASSESSOR",
+      "DBNSFP_FATHMM_MKL",
+      "DBNSFP_M_CAP",
+      "DBNSFP_LIST_S2",
+      "DBNSFP_BAYESDEL_ADDAF",
+      "DBNSFP_SPLICE_SITE_ADA",
+      "DBNSFP_SPLICE_SITE_RF"]
+
+   variant_data = {}
+   for col in required_oncogenicity_vars:
+      if rec.INFO.get(col) is None:
+         if col == "TUMOR_SUPPRESSOR" or col == "ONCOGENE":
+            variant_data[col] = False
+         elif col == "LoF":
+            variant_data['LOSS_OF_FUNCTION'] = False
+         else:
+            variant_data[col] = None
+      else:
+         if rec.INFO.get(col) == '':
+            variant_data[col] = True
+         else:
+            if col == "LoF":
+               if rec.INFO.get(col) == "HC":
+                  variant_data["LOSS_OF_FUNCTION"] = True
+            else:
+               variant_data[col] = rec.INFO.get(col)
+   
+   for code in clingen_vicc_ev_codes:
+      variant_data[code] = False
+
+   dbnsfp_minimum_majority = 7
+   dbnsfp_maximum_minority = 2
+   dbnsfp_minimum_algos_called = dbnsfp_minimum_majority
+
+   variant_data['N_INSILICO_CALLED'] = 0
+   variant_data['N_INSILICO_DAMAGING'] = 0
+   variant_data['N_INSILICO_TOLERATED'] = 0
+   variant_data['N_INSILICO_SPLICING_NEUTRAL'] = 0
+   variant_data['N_INSILICO_SPLICING_AFFECTED'] = 0
+
+   for algo in ['SIFT','PROVEAN','META_RNN',
+                'MUTATIONTASTER','DEOGEN2',
+                'PRIMATEAI','MUTATIONASSESSOR',
+                'FATHMM_MKL','M_CAP',
+                'LIST_S2','BAYESDEL_ADDAF',
+                'SPLICE_SITE_RF','SPLICE_SITE_ADA']:
+      col = 'DBNSFP_' + str(algo)
+      if col in variant_data.keys():
+         if variant_data[col] != '.':
+            variant_data['N_INSILICO_CALLED'] += 1
+         if variant_data[col] == 'D':
+            variant_data['N_INSILICO_DAMAGING'] += 1
+         if variant_data[col] == 'T':
+            variant_data['N_INSILICO_TOLERATED'] += 1
+         if (col == 'SPLICE_SITE_RF' or \
+             col == 'SPLICE_SITE_ADA') and \
+            variant_data[col] == 'SN':
+            variant_data['N_INSILICO_SPLICING_NEUTRAL'] += 1
+         if (col == 'SPLICE_SITE_RF' or \
+             col == 'SPLICE_SITE_ADA') and \
+            variant_data[col] == 'AS':
+            variant_data['N_INSILICO_SPLICING_AFFECTED'] += 1
+
+   if (variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \
+      variant_data['N_INSILICO_DAMAGING'] >= dbnsfp_minimum_majority and \
+      variant_data['N_INSILICO_TOLERATED'] <= dbnsfp_maximum_minority and \
+      variant_data['N_INSILICO_SPLICING_NEUTRAL'] <= 1) or \
+         variant_data['N_INSILICO_SPLICING_AFFECTED'] == 2:
+      variant_data['CLINGEN_VICC_OP1'] = True
+   
+   if variant_data['N_INSILICO_CALLED'] > dbnsfp_minimum_algos_called and \
+      variant_data['N_INSILICO_TOLERATED'] >= dbnsfp_minimum_majority and \
+      variant_data['N_INSILICO_DAMAGING'] <= dbnsfp_maximum_minority and \
+      variant_data['N_INSILICO_SPLICING_AFFECTED'] == 0:
+      variant_data['CLINGEN_VICC_SBP1'] = True
+
+   hotspot_mutated_samples = {}
+   hotspot_mutated_samples['aa_variant'] = {}
+   hotspot_mutated_samples['aa_site'] = {}
+   hotspot_mutated_samples['aa_variant']['Any'] = 0
+   hotspot_mutated_samples['aa_site']['Any'] = 0
+
+   if not variant_data['MUTATION_HOTSPOT_CANCERTYPE'] is None:
+      ttype_samples_in_hotspots = variant_data['MUTATION_HOTSPOT_CANCERTYPE'].split(',')
+      for ttype in ttype_samples_in_hotspots:
+         ttype_stats = ttype.split('|')
+         if len(ttype_stats) == 3:
+            ttype_hotspot = ttype_stats[0]
+            ttype_hotspot_samples_site = int(ttype_stats[1])
+            ttype_hotspot_samples_aa = ttype_hotspot_samples_site
+            if ttype_stats[2] != '':
+               ttype_hotspot_samples_aa = int(ttype_stats[2])
+            
+            if ttype_hotspot != "Unknown":
+               hotspot_mutated_samples['aa_variant'][ttype_hotspot] = ttype_hotspot_samples_aa
+               hotspot_mutated_samples['aa_site'][ttype_hotspot] = ttype_hotspot_samples_site
+            
+            hotspot_mutated_samples['aa_variant']['Any'] = hotspot_mutated_samples['aa_variant']['Any'] + ttype_hotspot_samples_aa
+            hotspot_mutated_samples['aa_site']['Any'] = hotspot_mutated_samples['aa_site']['Any'] + ttype_hotspot_samples_site
+   
+
+   ttype = "Any"
+   if tumortype != "Any":
+      ttype = re.sub(r'/', '@', re.sub(r' ','_', tumortype))
+
+   if ttype in hotspot_mutated_samples['aa_variant'].keys() and \
+      ttype in hotspot_mutated_samples['aa_site'].keys():
+      if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \
+         hotspot_mutated_samples['aa_site'][ttype] >= 50:
+            variant_data['CLINGEN_VICC_OS3'] = True
+      if hotspot_mutated_samples['aa_variant'][ttype] >= 10 and \
+         hotspot_mutated_samples['aa_site'][ttype] < 50:
+            variant_data['CLINGEN_VICC_OM4'] = True
+      if hotspot_mutated_samples['aa_variant'][ttype] < 10 and \
+         hotspot_mutated_samples['aa_variant'][ttype] > 0:
+            variant_data['CLINGEN_VICC_OP3'] = True
+   
+
+   if "EAS_AF_GNOMAD" in variant_data.keys() and \
+      "SAS_AF_GNOMAD" in variant_data.keys() and \
+      "AMR_AF_GNOMAD" in variant_data.keys() and \
+      "AFR_AF_GNOMAD" in variant_data.keys() and \
+      "NFE_AF_GNOMAD" in variant_data.keys():
+
+      ## check if variant has MAF > 0.01 (SBVS1) or > 0.05 in any of five major gnomAD subpopulations (exome set)
+      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+         if not variant_data[pop] is None:
+            ## MAF for this population >= 0.05
+            if float(variant_data[pop]) >= 0.05:
+               variant_data["CLINGEN_VICC_SBVS1"] = True
+      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+         if not variant_data[pop] is None:
+            ## MAF for this population >= 0.01 (< 0.05)
+            if float(variant_data[pop]) >= 0.01 and variant_data["CLINGEN_VICC_SBVS1"] is False:
+               variant_data["CLINGEN_VICC_SBS1"] = True
+
+      missing_pop_freq = 0
+      approx_zero_pop_freq = 0
+      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+         ## no MAF recorded in gnomAD for this population
+         if variant_data[pop] is None:
+            missing_pop_freq = missing_pop_freq + 1
+         else:
+            ## Very low MAF for this population
+            if float(variant_data[pop]) < 0.0001:
+               approx_zero_pop_freq = approx_zero_pop_freq + 1
+    
+      ## check if variant is missing or with MAF approximately zero in all five major gnomAD subpopulations (exome set)
+      if missing_pop_freq == 5 or approx_zero_pop_freq == 5:
+         variant_data["CLINGEN_VICC_OP4"] = True
+    
+   ## check if variant is a loss-of-function variant (LOFTEE) in a tumor suppressor gene (Cancer Gene Census/CancerMine)
+   if "TUMOR_SUPPRESSOR" in variant_data.keys() and \
+      "ONCOGENE" in variant_data.keys() and \
+      "LOSS_OF_FUNCTION" in variant_data.keys() and \
+      "Consequence" in variant_data.keys():
+
+      if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TUMOR_SUPPRESSOR'] is True:
+         variant_data['CLINGEN_VICC_OVS1'] = True
+  
+      ## check if variant is creating a stop-lost or protein-length change in oncogene/tumor suppressor genes
+      if variant_data['CLINGEN_VICC_OVS1'] is False and \
+         ((re.match(r'^(inframe_deletion|inframe_insertion)', variant_data['Consequence']) and \
+            (variant_data['TUMOR_SUPPRESSOR'] is True or variant_data['ONCOGENE'] is True)) or \
+         (re.match(r'^(stop_lost)', variant_data['Consequence']) and \
+            variant_data['TUMOR_SUPPRESSOR'] is True)):
+            variant_data['CLINGEN_VICC_OM2'] = True
+   
+   ## check if variant is silent (synonymous|splice) and outside critical splice region
+   if "INTRON_POSITION" in variant_data.keys() and \
+      "EXON_POSITION" in variant_data.keys() and \
+      "DBNSFP_SPLICE_SITE_RF" in variant_data.keys() and \
+      "Consequence" in variant_data.keys():
+      
+      if (int(variant_data['INTRON_POSITION']) < 0 and int(variant_data['INTRON_POSITION']) < -3 or \
+         int(variant_data['INTRON_POSITION']) > 0 and int(variant_data['INTRON_POSITION']) > 6 or \
+         int(variant_data['EXON_POSITION']) < 0 and int(variant_data['EXON_POSITION']) < -2 or \
+         int(variant_data['EXON_POSITION']) > 0 and int(variant_data['EXON_POSITION']) > 1) and \
+         variant_data['DBNSFP_SPLICE_SITE_RF'] != "AS" and \
+         re.match(r'^(synonymous_variant|splice_region_variant)', variant_data['Consequence']):
+            variant_data['CLINGEN_VICC_SBP2'] = True
+         
+
+   og_score_data = {}
+   og_score_data['code'] = \
+      ['CLINGEN_VICC_SBVS1','CLINGEN_VICC_SBS1',
+      'CLINGEN_VICC_SBP1','CLINGEN_VICC_SBP2',
+      'CLINGEN_VICC_OVS1','CLINGEN_VICC_OS3',
+      'CLINGEN_VICC_OM2','CLINGEN_VICC_OM4',
+      'CLINGEN_VICC_OP1','CLINGEN_VICC_OP3',
+      'CLINGEN_VICC_OP4']
+   og_score_data['category'] = \
+      ['clinpop','clinpop',
+      'funccomp','funcvar',
+      'funcvar','funcvar',
+      'funcvar','funcvar',
+      'funccomp','funcvar',
+      'clinpop']
+   og_score_data['pole'] = \
+      ['B','B','B','B',
+      'P','P','P','P',
+      'P','P','P']
+   
+   og_score_data['description'] = \
+      ['Very high MAF (> 0.05 in gnomAD - any five major continental pops)',
+      'High MAF (> 0.01 in gnomAD - any five major continental pops)',
+      'Multiple lines (>=7) of computational evidence support a benign effect on the gene or gene product - from dbNSFP',
+      'Silent and intronic changes outside of the consensus splice site',
+      'Null variant - predicted as LoF by LOFTEE - in bona fide tumor suppressor gene',
+      'Located in a mutation hotspot (cancerhotspots.org). >= 50 samples with a  variant at AA position, >= 10 samples with same AA change',
+      'Protein length changes from in-frame dels/ins in known oncogene/tumor suppressor genes or stop-loss variants in a tumor suppressor gene',
+      'Located in a mutation hotspot (cancerhotspots.org). < 50 samples with a variant at AA position, >= 10 samples with same AA change.',
+      'Multiple lines (>=7) of computational evidence support a damaging effect on the gene or gene product - from dbNSFP',
+      'Located in a mutation hotspot (cancerhotspots.org). < 10 samples with the same amino acid change.',
+      'Absent from controls (gnomAD) / very low MAF ( < 0.0001 in all five major subpopulations)']
+   
+   og_score_data['score'] = \
+      [-8, -4, -1, -1, 8, 4, 2, 2, 1, 1, 1]
+   
+   i = 0
+   oncogenicity_scores = {}
+   for code in og_score_data['code']:
+      oncogenicity_scores[code] = {}
+      for e in ['category','pole','description','score']:
+         oncogenicity_scores[code][e] = og_score_data[e][i]
+         if e == 'score':
+            oncogenicity_scores[code][e] = float(og_score_data[e][i])
+      i = i + 1
+
+   variant_data['ONCOGENICITY_CLASSIFICATION'] = "VUS"
+   variant_data["ONCOGENICITY_CLASSIFICATION_DOC"] = "."
+   variant_data["ONCOGENICITY_CLASSIFICATION_CODE"] = "."
+   variant_data["ONCOGENICITY_SCORE"] = "."
+   onc_score_pathogenic = 0
+   onc_score_benign = 0
+
+   for code in clingen_vicc_ev_codes:
+
+      if code in variant_data.keys():
+         if variant_data[code] is True:
+            score = oncogenicity_scores[code]['score']
+            pole = oncogenicity_scores[code]['pole']
+            if pole == "P":
+               onc_score_pathogenic = onc_score_pathogenic + score
+            if pole == "B":
+               onc_score_benign = onc_score_benign + score
+
+            for e in ['ONCOGENICITY_CLASSIFICATION_DOC',
+                      'ONCOGENICITY_CLASSIFICATION_CODE']:
+
+               if variant_data[e] == ".":
+                  if e == 'ONCOGENICITY_CLASSIFICATION_DOC':
+                     variant_data[e] = oncogenicity_scores[code]['description']
+                  else:
+                     variant_data[e] = code
+               else:
+                  if e == 'ONCOGENICITY_CLASSIFICATION_DOC':
+                     variant_data[e] = f'{variant_data[e]}|{oncogenicity_scores[code]["description"]}'
+                  else:
+                     variant_data[e] = f'{variant_data[e]}|{code}'
+
+   lb_upper_limit = -1
+   lb_lower_limit = -6
+   b_upper_limit = -7
+   #vus_lower_limit = 0
+   #vus_upper_limit = 4
+   lo_lower_limit = 6
+   lo_upper_limit = 9
+   o_lower_limit = 10
+
+   variant_data['ONCOGENICITY_SCORE'] = onc_score_benign + onc_score_pathogenic
+   if variant_data['ONCOGENICITY_SCORE'] <= lb_upper_limit and \
+      variant_data['ONCOGENICITY_SCORE'] >= lb_lower_limit:
+      variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Benign"
+   if variant_data['ONCOGENICITY_SCORE'] >= lo_lower_limit and \
+      variant_data['ONCOGENICITY_SCORE'] <= lo_upper_limit:
+      variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Oncogenic"
+   if variant_data['ONCOGENICITY_SCORE'] <= b_upper_limit:
+      variant_data['ONCOGENICITY_CLASSIFICATION'] = "Benign"
+   if variant_data['ONCOGENICITY_SCORE'] >= o_lower_limit:
+      variant_data['ONCOGENICITY_CLASSIFICATION'] = "Oncogenic"
+   
+   for e in ['ONCOGENICITY_SCORE',
+             'ONCOGENICITY_CLASSIFICATION',
+             'ONCOGENICITY_CLASSIFICATION_CODE']:
+      rec.INFO[e] = variant_data[e]
+
+   return(rec)
\ No newline at end of file
diff --git a/src/gvanno/vcf2tsv.py b/src/gvanno/vcf2tsv.py
deleted file mode 100755
index 7d5b544..0000000
--- a/src/gvanno/vcf2tsv.py
+++ /dev/null
@@ -1,302 +0,0 @@
-#!/usr/bin/env python
-
-import argparse
-from cyvcf2 import VCF, Writer
-import numpy as np
-import re
-import math
-import subprocess
-
-version = '0.3.7.1'
-
-
-def __main__():
-   parser = argparse.ArgumentParser(description='Convert a VCF file with genomic variants to a file with tab-separated values (TSV). One entry (TSV line) per sample genotype', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-   parser.add_argument('query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)')
-   parser.add_argument('out_tsv', help='Output TSV file with one line pr non-rejected sample genotype (Variant, genotype and annotation data as tab-separated values)')
-   parser.add_argument("--skip_info_data",action = "store_true", help="Skip printing of data in INFO column")
-   parser.add_argument("--skip_genotype_data", action="store_true", help="Skip printing of genotype_data (FORMAT columns)")
-   parser.add_argument("--keep_rejected_calls", action="store_true", help="Print data for rejected calls")
-   parser.add_argument("--print_data_type_header", action="store_true", help="Print a header line with data types of VCF annotations")
-   parser.add_argument("--compress", action="store_true", help="Compress TSV file with gzip")
-   parser.add_argument('--version', action='version', version='%(prog)s ' + str(version))
-   args = parser.parse_args()
-   
-   vcf2tsv(args.query_vcf, args.out_tsv, args.skip_info_data, args.skip_genotype_data, args.keep_rejected_calls, args.compress, args.print_data_type_header)
-         
-
-def check_subprocess(command):
-   try:
-      output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
-      if len(output) > 0:
-         print (str(output.decode()).rstrip())
-   except subprocess.CalledProcessError as e:
-      print (e.output)
-      exit(0)
-
-
-def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejected_calls, compress, print_data_type_header):
-   
-   vcf = VCF(query_vcf, gts012 = True)
-   out = open(out_tsv,'w')
-   
-   fixed_columns_header = ['CHROM','POS','ID','REF','ALT','QUAL','FILTER']
-   fixed_columns_header_type = ['String','Integer','String','String','String','Float','String']
-   samples = vcf.samples
-   info_columns_header = []
-   format_columns_header = []
-   sample_columns_header = []
-   column_types = {}
-   gt_present_header = 0
-   
-   if len(samples) > 0:
-      sample_columns_header.append('VCF_SAMPLE_ID')
-   
-   for e in vcf.header_iter():
-      header_element = e.info()
-      if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
-         if header_element['HeaderType'] == 'INFO' or header_element['HeaderType'] == 'FORMAT':
-            column_types[header_element['ID']] = header_element['Type']
-         if header_element['HeaderType'] == 'INFO':
-            if skip_info_data is False:
-               info_columns_header.append(header_element['ID'])
-         if header_element['HeaderType'] == 'FORMAT':
-            if len(sample_columns_header) > 0 and skip_genotype_data is False:
-               if header_element['ID'] != 'GT':
-                  format_columns_header.append(header_element['ID'])
-               else:
-                  gt_present_header = 1
-
-   #header_line = '\t'.join(fixed_columns_header)
-   header_tags = fixed_columns_header
-   if skip_info_data is False:
-      #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header))
-      header_tags = fixed_columns_header + sorted(info_columns_header)
-      if len(sample_columns_header) > 0:
-         if skip_genotype_data is False:
-            #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header)) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT'
-            header_tags = fixed_columns_header + sorted(info_columns_header) + sample_columns_header + sorted(format_columns_header) + ['GT']
-         else:
-            #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sorted(info_columns_header))
-            header_tags = fixed_columns_header + sorted(info_columns_header)
-   else:
-      if len(sample_columns_header) > 0:
-         if skip_genotype_data is False:
-            #header_line = '\t'.join(fixed_columns_header) + '\t' + '\t'.join(sample_columns_header) + '\t' + '\t'.join(sorted(format_columns_header)) + '\tGT'
-            header_tags = fixed_columns_header + sample_columns_header + sorted(format_columns_header) + ['GT']
-         else:
-            #header_line = '\t'.join(fixed_columns_header)
-            header_tags = fixed_columns_header
-   header_line = '\t'.join(header_tags)
-   
-   out.write('#https://github.com/sigven/vcf2tsv version=' + str(version) + '\n')
-   if print_data_type_header is True:
-      #header_tags = header_line.rstrip().split('\t')
-      header_types = []
-      for h in header_tags:
-         if h in column_types:
-            header_types.append(str(column_types[h]))
-      #header_line_type = '\t'.join(fixed_columns_header_type) + '\t' + '\t'.join(header_types)
-      header_line_type = '\t'.join(fixed_columns_header_type + header_types)
-      out.write('#' + str(header_line_type) + '\n')
-      out.write(str(header_line) + '\n')
-   else:
-      out.write(str(header_line) + '\n')
-   
-   for rec in vcf:
-      rec_id = '.'
-      rec_qual = '.'
-      rec_filter = '.'
-      alt = ",".join(str(n) for n in rec.ALT)
-      if not rec.ID is None:
-         rec_id = str(rec.ID)
-      if not rec.QUAL is None:
-         rec_qual = str("{0:.2f}".format(rec.QUAL))
-      rec_filter = str(rec.FILTER)
-      if rec.FILTER is None:
-         rec_filter = 'PASS'
-      
-      pos = int(rec.start) + 1
-      fixed_fields_string = str(rec.CHROM) + '\t' + str(pos) + '\t' + str(rec_id) + '\t' + str(rec.REF) + '\t' + str(alt) + '\t' + str(rec_qual) + '\t' + str(rec_filter)
-      
-      
-      if not 'PASS' in rec_filter and not keep_rejected_calls:
-         continue
-      
-      variant_info = rec.INFO
-      vcf_info_data = []
-      if skip_info_data is False:
-         for info_field in sorted(info_columns_header):
-            if column_types[info_field] == 'Flag':
-               if variant_info.get(info_field) is None:
-                  vcf_info_data.append('False')
-               else:
-                  vcf_info_data.append('True')
-            elif column_types[info_field] == 'Float' or column_types[info_field] == 'Integer' or column_types[info_field] == 'String' or column_types[info_field] == 'Character':
-               if type(variant_info.get(info_field)) is list or type(variant_info.get(info_field)) is tuple:
-                  vcf_info_data.append(",".join(str(n) for n in variant_info.get(info_field)))
-               else:
-                  if variant_info.get(info_field) is None:
-                     vcf_info_data.append('.')
-                  else:
-                     if column_types[info_field] == 'Float':
-                        if not isinstance(variant_info.get(info_field),float):
-                           print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Float\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
-                           if not ',' in str(alt):
-                              print('Warning: Multiple values in INFO tag for single ALT allele (VCF multiallelic sites not decomposed properly?):' + str(fixed_fields_string) + '\t' + str(info_field) + '=' + str(variant_info.get(info_field)))
-                           vcf_info_data.append('.')
-                        else:
-                           val = str("{0:.7f}".format(variant_info.get(info_field)))
-                           vcf_info_data.append(val)
-                     else:
-                        if column_types[info_field] == 'String' or column_types[info_field] == 'Character':
-                           if isinstance(variant_info.get(info_field),str):
-                              #print(str(info_field) + '\t' + variant_info.get(info_field).encode('ascii','ignore').rstrip().decode('ascii'))
-                              vcf_info_data.append(variant_info.get(info_field).encode('ascii','ignore').decode('ascii'))
-                           else:
-                              vcf_info_data.append('.')
-                              if column_types[info_field] == 'String':
-                                    print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'String\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
-                              if column_types[info_field] == 'Character':
-                                    print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Character\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
-                        else:
-                           if isinstance(variant_info.get(info_field),int):
-                              vcf_info_data.append(str(variant_info.get(info_field)))
-                           else:
-                              print('vcf2tsv.py WARNING:\tINFO tag ' + str(info_field) + ' is defined in the VCF header as type \'Integer\', yet parsed as other type:' + str(type(variant_info.get(info_field))))
-                              vcf_info_data.append(re.sub(r'\(|\)', '', variant_info.get(info_field).encode('ascii','ignore').decode('ascii')))
-
-      #print(str(vcf_info_data))
-      #dictionary, with sample names as keys, values being genotype data (dictionary with format tags as keys)
-      vcf_sample_genotype_data = {}
-
-      if len(samples) > 0 and skip_genotype_data is False:
-         gt_cyvcf = rec.gt_types
-         i = 0
-         while i < len(samples):
-            vcf_sample_genotype_data[samples[i]] = {}
-            gt = './.'
-            if gt_present_header == 1:
-               if gt_cyvcf[i] == 0:
-                  gt = '0/0'
-               if gt_cyvcf[i] == 1:
-                  gt = '0/1'
-               if gt_cyvcf[i] == 2:
-                  gt = '1/1'
-            vcf_sample_genotype_data[samples[i]]['GT'] = gt
-            i = i + 1
-               
-      for format_tag in sorted(format_columns_header):
-         if len(samples) > 0 and skip_genotype_data is False:
-            sample_dat = rec.format(format_tag)
-            if sample_dat is None:
-               k = 0
-               while k < len(samples):
-                  if samples[k] in vcf_sample_genotype_data:
-                     vcf_sample_genotype_data[samples[k]][format_tag] = '.'
-                  k = k + 1               
-               continue
-            dim = sample_dat.shape
-            j = 0
-            ## sample-wise
-            while j < dim[0]:
-               if sample_dat[j].size > 1:
-
-                  d = ','.join(str(e) for e in np.ndarray.tolist(sample_dat[j]))
-                  if column_types[format_tag] == 'Float':
-                     d = ','.join(str(round(e, 4)) for e in np.ndarray.tolist(sample_dat[j]))
-                  ## undefined/missing value
-                  if '-2147483648' in d:
-                     d = d.replace('-2147483648', '.')
-                  if 'nan' in d.casefold():
-                     d = d.casefold().replace('nan', '.')
-                  if samples[j] in vcf_sample_genotype_data:
-                     vcf_sample_genotype_data[samples[j]][format_tag] = d
-               else:
-                  d = '.'
-                  if column_types[format_tag] == 'Float':
-                     if not math.isnan(sample_dat[j]):
-                        d = str(sample_dat[j][0])
-                  if column_types[format_tag] == 'String':
-                     d = str(sample_dat[j])
-                  if column_types[format_tag] == 'Integer':
-                     d = str(sample_dat[j][0])
-                  ## undefined/missing value
-                  if d == '-2147483648' or d.casefold() == 'nan':
-                     d = '.'
-                  if samples[j] in vcf_sample_genotype_data:
-                     vcf_sample_genotype_data[samples[j]][format_tag] = d
-               j = j + 1
-      
-      tsv_elements = []
-      tsv_elements.append(fixed_fields_string)
-      if skip_info_data is False:
-         if skip_genotype_data is False:
-            if len(sample_columns_header) > 0:
-               tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
-               ## one line per sample variant
-               for s in sorted(vcf_sample_genotype_data.keys()):
-                  sample = s
-                  line_elements = []
-                  line_elements.extend(tsv_elements)
-                  line_elements.append(sample)
-                  gt_tag = '.'
-                  for tag in sorted(vcf_sample_genotype_data[sample].keys()):
-                     if tag != 'GT':
-                        line_elements.append(vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii'))
-                     else:
-                        gt_tag = vcf_sample_genotype_data[sample][tag].encode('ascii','ignore').decode('ascii')
-                  line_elements.append(gt_tag)
-                  if gt_tag == './.' or gt_tag == '.':
-                     if keep_rejected_calls:
-                        out.write('\t'.join(line_elements) + '\n')
-                  else:
-                     out.write("\t".join(str(n) for n in line_elements) + '\n')
-                    
-            else:
-               tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
-               line_elements = []
-               line_elements.extend(tsv_elements)
-               out.write('\t'.join(line_elements) + '\n')
-         else:
-            tsv_elements.append("\t".join(str(n) for n in vcf_info_data))
-            line_elements = []
-            line_elements.extend(tsv_elements)
-            out.write('\t'.join(line_elements) + '\n')
-      else:
-         if skip_genotype_data is False:
-            if len(sample_columns_header) > 0:
-               ## one line per sample variant
-               for s in sorted(vcf_sample_genotype_data.keys()):
-                  sample = s
-                  line_elements = []
-                  line_elements.extend(tsv_elements)
-                  line_elements.append(sample)
-                  gt_tag = '.'
-                  for tag in sorted(vcf_sample_genotype_data[sample].keys()):
-                     if tag != 'GT':
-                        line_elements.append(vcf_sample_genotype_data[sample][tag])
-                     else:
-                        gt_tag = vcf_sample_genotype_data[sample][tag]
-                  line_elements.append(gt_tag)
-                  if gt_tag == './.' or gt_tag == '.':
-                     if keep_rejected_calls:
-                        out.write('\t'.join(line_elements) + '\n')
-                  else:
-                     out.write('\t'.join(line_elements) + '\n')
-         else:
-            line_elements = []
-            line_elements.extend(tsv_elements)
-            line_elements = tsv_elements
-            out.write('\t'.join(line_elements) + '\n')
-       
-   out.close()
-   
-   if compress is True:
-      command = 'gzip -f ' + str(out_tsv)
-      check_subprocess(command)
-
-if __name__=="__main__": __main__()
-
-
-   
diff --git a/src/gvanno_logo.png b/src/gvanno_logo.png
new file mode 100644
index 0000000..4f6899a
Binary files /dev/null and b/src/gvanno_logo.png differ