From 49da49f2a5233ac851cb9ec9365877d4a597dca7 Mon Sep 17 00:00:00 2001 From: danilotat Date: Tue, 3 Sep 2024 16:20:50 +0200 Subject: [PATCH] Structure for automatic/manual configuration --- docs/index.md | 10 ++----- docs/resources.md | 73 +++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 72 insertions(+), 11 deletions(-) diff --git a/docs/index.md b/docs/index.md index 4e596c6..22914b7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -11,16 +11,10 @@ To start, clone the repo using git clone https://github.com/ctglab/ENEO.git ``` -To limit the size of the repository, test files are not provided directly while cloning but downloaded on-fly for CI testing. For proceeding using a *real* sample you could download tumor RNA-seq data from a patient of the NCI surgery branch from Steven Rosenberg group - -``` bash -wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_1.fastq.gz -O test_data/SRR9697628_1.fastq.gz && \ -wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_2.fastq.gz -O test_data/SRR9697628_2.fastq.gz -``` - -Then execute the pipeline using +To execute the pipeline, be sure to have [snakemake](https://snakemake.readthedocs.io/en/stable/) and [singularity](https://docs.sylabs.io/guides/3.1/user-guide/index.html) installed. Then execute the pipeline using ``` snakemake --use-singularity --use-conda --cores 4 ``` + If you spot any issue, please report in the github issue section https://github.com/ctglab/ENEO/issues diff --git a/docs/resources.md b/docs/resources.md index 5203a5e..85ae2f1 100644 --- a/docs/resources.md +++ b/docs/resources.md @@ -1,7 +1,74 @@ # Setup resources -ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations. +ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations (i.e. Ensembl). + +## Automatic setup +On working + +![](https://imgs.xkcd.com/comics/automation.png) + +## Manual setup + +To prepare input files, it's required to have an environment with these tools installed + +- bedtools +- bcftools +- tabix +- gatk4 + +The easiest way is to create a single conda environment as + +``` +conda create --name eneo_setup -c bioconda -c conda-forge bedtools bcftools tabix gatk4 +``` + + +### Genome and Transcriptome files + +Given the use of Ensembl annotation, files could be downloaded from the Ensembl ftp site + +**Genome**: +``` +wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz +gatk CreateSequenceDictionary -R Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.dict +``` +**Transcriptome** +``` +wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz +``` + +### Genetic population resources + +ENEO uses multiple databases to infer germline likelihood of candidate variants. Most of them uses different chromosome naming, so you need to convert them. A conversion table is available at + +#### dbSNP + +```bash +wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz +wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz.tbi +``` + +#### 1000G + +```bash +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi +``` +#### known indels + +```bash +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi +``` + +### gnomAD + +```bash +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz +wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi +``` + + + -We're working on a single-step setup script for multiple environments -![alt text](https://www.pajiba.com/assets_c/2024/07/reddit-worst-work-mistake-own-header-thumb-700x401-264118.png) \ No newline at end of file