Skip to content

Commit

Permalink
1.4.4 release - data updates/VEP update
Browse files Browse the repository at this point in the history
  • Loading branch information
sigven committed Dec 21, 2021
1 parent 21da28b commit 904160d
Show file tree
Hide file tree
Showing 8 changed files with 150 additions and 158 deletions.
48 changes: 26 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.

### News
* December 21st 2021 - **1.4.4 release**
* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform
* Software updates: VEP (v105)
* August 25th 2021 - **1.4.3 release**
* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform
* May 24th 2021 - **1.4.2 release**
Expand All @@ -27,19 +30,19 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
* _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, overlap with __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno.
* _NCER_PERCENTILE_: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/).

### Annotation resources
### Annotation resources (v1.4.4)

* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v104 (GENCODE v38/v19 as the gene reference dataset)
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v105 (GENCODE v39/v19 as the gene reference dataset)
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (August 2021)
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 38, August 2021)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_06, June 2021)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_03, June 2021)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v34.0, March 2021)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 16th 2021)
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2021)
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 41, December 2021)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_11, Nocember 2021)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_04, November 2021)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (December 7th 2021)


### Getting started
Expand All @@ -53,14 +56,14 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t
1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
- installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
- installing [Docker on Mac OS](https://docs.docker.com/engine/installation/mac/)
- NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with PCGR (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
- NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with gvanno (an example being [mounting of data volumes](https://github.com/docker/toolbox/issues/607))
2. Test that Docker is running, e.g. by typing `docker ps` or `docker images` in the terminal window
3. Adjust the computing resources dedicated to the Docker, i.e.:
- Memory: minimum 5GB
- CPUs: minimum 4
- [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)

##### 1.1: Installation of Singularity (optional)
##### 1.1: Installation of Singularity (optional - in dev)

0. **Note: this works for Singularity version 3.0 and higher**.
1. [Install Singularity](https://sylabs.io/docs/)
Expand All @@ -73,17 +76,17 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t

#### STEP 2: Download *gvanno* and data bundle

1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.3) (gvanno run script, v1.4.3)
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.4) (gvanno run script, v1.4.4)
2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210825.tgz) (approx 19Gb)
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210825.tgz) (approx 20Gb)
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz) (approx 18Gb)
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20211221.tgz) (approx 20Gb)
* Example commands:
* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210825.tgz`
* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz`
* `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`

A _data/_ folder within the _gvanno-1.4.3_ software folder should now have been produced
3. Pull the [gvanno Docker image (1.4.3)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
* `docker pull sigven/gvanno:1.4.3` (gvanno annotation engine)
A _data/_ folder within the _gvanno-1.4.4_ software folder should now have been produced
3. Pull the [gvanno Docker image (1.4.4)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
* `docker pull sigven/gvanno:1.4.4` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand Down Expand Up @@ -112,7 +115,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
--query_vcf QUERY_VCF
VCF input file with germline query variants (SNVs/InDels).
--gvanno_dir GVANNO_DIR
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.3
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.4
--output_dir OUTPUT_DIR
Output directory
--genome_assembly {grch37,grch38}
Expand All @@ -124,6 +127,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

VEP optional arguments:
--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
--vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
--vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
--vep_n_forks VEP_N_FORKS
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
Expand All @@ -148,10 +152,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

python ~/gvanno-1.4.3/gvanno.py
--query_vcf ~/gvanno-1.4.3/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.4.3
--output_dir ~/gvanno-1.4.3
python ~/gvanno-1.4.4/gvanno.py
--query_vcf ~/gvanno-1.4.4/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.4.4
--output_dir ~/gvanno-1.4.4
--sample_id example
--genome_assembly grch37
--container docker
Expand Down
16 changes: 8 additions & 8 deletions data-raw/RELEASE_NOTES
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
##GVANNO_SOFTWARE_VERSION = 1.4.3
##GVANNO_DB_VERSION = 20210825
pfam = v34.0 (March 2021)
##GVANNO_SOFTWARE_VERSION = 1.4.4
##GVANNO_DB_VERSION = 20211221
pfam = v35.0 (November 2021)
ncER = v1.0 (March 2019)
uniprot = release 2021_03
uniprot = release 2021_04
corum = release 3.0 (20180903)
onekg = phase 3 (20130502)
dbsnp = build 154/153
dbnsfp = v4.2 (April 2021)
gnomad = r2.1 (October 2018)
gwas = August 2021 (20210816)
clinvar = August 2021 (20210731)
opentargets = 2021_06
gencode = 38/19
gwas = December 2021 (20211207)
clinvar = December 2021 (20211130)
opentargets = 2021_11
gencode = 39/19
19 changes: 13 additions & 6 deletions gvanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
import platform
from argparse import RawTextHelpFormatter

GVANNO_VERSION = '1.4.3'
DB_VERSION = 'GVANNO_DB_VERSION = 20210825'
VEP_VERSION = '104'
GENCODE_VERSION = '38'
GVANNO_VERSION = '1.4.4'
DB_VERSION = 'GVANNO_DB_VERSION = 20211221'
VEP_VERSION = '105'
GENCODE_VERSION = 'v39'
VEP_ASSEMBLY = "GRCh38"
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)

Expand All @@ -38,6 +38,7 @@ def __main__():
optional.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator, default: %(default)s")
optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)')
optional_vep.add_argument('--vep_regulatory', action='store_true', help = 'Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).')
optional_vep.add_argument('--vep_gencode_all', action='store_true', help = 'Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).')
optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
"in Variant Effect Predictor (VEP), default: %(default)s")
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for Variant Effect Predictor (VEP) processing, default: %(default)s")
Expand Down Expand Up @@ -236,7 +237,7 @@ def run_gvanno(arg_dict, host_directories):

global GENCODE_VERSION, VEP_ASSEMBLY
if arg_dict['genome_assembly'] == 'grch37':
GENCODE_VERSION = 'release 19'
GENCODE_VERSION = 'v19'
VEP_ASSEMBLY = 'GRCh37'

logger = getlogger('gvanno-get-OS')
Expand Down Expand Up @@ -330,11 +331,16 @@ def run_gvanno(arg_dict, host_directories):
loftee_dir = '/opt/vep/src/ensembl-vep/modules'
plugins_in_use = "NearestExonJB"
vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \
"--uniprot --appris --biotype --canonical --gencode_basic --mane --cache --numbers --total_length --allele_number --no_escape " + \
"--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \
"--xref_refseq --plugin NearestExonJB,max_range=50000"
vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(arg_dict['vep_pick_order']) + \
" --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \
str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"

gencode_set_in_use = "GENCODE - all transcripts"
if arg_dict['vep_gencode_all'] == 0:
vep_options = vep_options + " --gencode_basic"
gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)"
if arg_dict['vep_skip_intergenic'] == 1:
vep_options = vep_options + " --no_intergenic"
if arg_dict['vep_regulatory'] == 1:
Expand All @@ -356,6 +362,7 @@ def run_gvanno(arg_dict, host_directories):
logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flack_pick_allele)")
logger.info("VEP configuration - transcript pick order: " + str(arg_dict['vep_pick_order']))
logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options")
logger.info("VEP configuration - GENCODE set: " + str(gencode_set_in_use))
logger.info("VEP configuration - buffer size: " + str(arg_dict['vep_buffer_size']))
logger.info("VEP configuration - skip intergenic: " + str(arg_dict['vep_skip_intergenic']))
logger.info("VEP configuration - look for overlap with regulatory regions: " + str(arg_dict['vep_regulatory']))
Expand Down
Loading

0 comments on commit 904160d

Please sign in to comment.