Skip to content

Commit

Permalink
Structure for automatic/manual configuration
Browse files Browse the repository at this point in the history
  • Loading branch information
danilotat committed Sep 3, 2024
1 parent 1780699 commit 49da49f
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 11 deletions.
10 changes: 2 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,10 @@ To start, clone the repo using
git clone https://github.com/ctglab/ENEO.git
```

To limit the size of the repository, test files are not provided directly while cloning but downloaded on-fly for CI testing. For proceeding using a *real* sample you could download tumor RNA-seq data from a patient of the NCI surgery branch from Steven Rosenberg group

``` bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_1.fastq.gz -O test_data/SRR9697628_1.fastq.gz && \
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_2.fastq.gz -O test_data/SRR9697628_2.fastq.gz
```

Then execute the pipeline using
To execute the pipeline, be sure to have [snakemake](https://snakemake.readthedocs.io/en/stable/) and [singularity](https://docs.sylabs.io/guides/3.1/user-guide/index.html) installed. Then execute the pipeline using

```
snakemake --use-singularity --use-conda --cores 4
```

If you spot any issue, please report in the github issue section https://github.com/ctglab/ENEO/issues
73 changes: 70 additions & 3 deletions docs/resources.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,74 @@
# Setup resources

ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations.
ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations (i.e. Ensembl).

## Automatic setup
On working

![](https://imgs.xkcd.com/comics/automation.png)

## Manual setup

To prepare input files, it's required to have an environment with these tools installed

- bedtools
- bcftools
- tabix
- gatk4

The easiest way is to create a single conda environment as

```
conda create --name eneo_setup -c bioconda -c conda-forge bedtools bcftools tabix gatk4
```


### Genome and Transcriptome files

Given the use of Ensembl annotation, files could be downloaded from the Ensembl ftp site

**Genome**:
```
wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gatk CreateSequenceDictionary -R Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.dict
```
**Transcriptome**
```
wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
```

### Genetic population resources

ENEO uses multiple databases to infer germline likelihood of candidate variants. Most of them uses different chromosome naming, so you need to convert them. A conversion table is available at

#### dbSNP

```bash
wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz
wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz.tbi
```

#### 1000G

```bash
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
```
#### known indels

```bash
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
```

### gnomAD

```bash
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi
```




We're working on a single-step setup script for multiple environments

![alt text](https://www.pajiba.com/assets_c/2024/07/reddit-worst-work-mistake-own-header-thumb-700x401-264118.png)

0 comments on commit 49da49f

Please sign in to comment.