From 49da49f2a5233ac851cb9ec9365877d4a597dca7 Mon Sep 17 00:00:00 2001
From: danilotat <danilotatoni@gmail.com>
Date: Tue, 3 Sep 2024 16:20:50 +0200
Subject: [PATCH] Structure for automatic/manual configuration

---
 docs/index.md     | 10 ++-----
 docs/resources.md | 73 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 72 insertions(+), 11 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 4e596c6..22914b7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -11,16 +11,10 @@ To start, clone the repo using
 git clone https://github.com/ctglab/ENEO.git
 ```
 
-To limit the size of the repository, test files are not provided directly while cloning but downloaded on-fly for CI testing. For proceeding using a *real* sample you could download tumor RNA-seq data from a patient of the NCI surgery branch from Steven Rosenberg group
-
-``` bash
-wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_1.fastq.gz -O test_data/SRR9697628_1.fastq.gz && \
-wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR969/008/SRR9697628/SRR9697628_2.fastq.gz -O test_data/SRR9697628_2.fastq.gz
-```
-
-Then execute the pipeline using 
+To execute the pipeline, be sure to have [snakemake](https://snakemake.readthedocs.io/en/stable/) and [singularity](https://docs.sylabs.io/guides/3.1/user-guide/index.html) installed. Then execute the pipeline using 
 
 ```
 snakemake --use-singularity --use-conda --cores 4
 ```
+
 If you spot any issue, please report in the github issue section https://github.com/ctglab/ENEO/issues
diff --git a/docs/resources.md b/docs/resources.md
index 5203a5e..85ae2f1 100644
--- a/docs/resources.md
+++ b/docs/resources.md
@@ -1,7 +1,74 @@
 # Setup resources
 
-ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations.
+ENEO heavily depends on public genetic databases for germline probability estimation. It's required to download data from multiple resources and convert them in order to have concordant annotations (i.e. Ensembl).
+
+## Automatic setup
+On working
+
+![](https://imgs.xkcd.com/comics/automation.png)
+
+## Manual setup
+
+To prepare input files, it's required to have an environment with these tools installed 
+
+- bedtools
+- bcftools
+- tabix
+- gatk4
+
+The easiest way is to create a single conda environment as 
+
+```
+conda create --name eneo_setup -c bioconda -c conda-forge bedtools bcftools tabix gatk4
+```
+
+
+### Genome and Transcriptome files
+
+Given the use of Ensembl annotation, files could be downloaded from the Ensembl ftp site
+
+**Genome**:
+```
+wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
+gatk CreateSequenceDictionary -R Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.dict
+```
+**Transcriptome**
+```
+wget https://ftp.ensembl.org/pub/current/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
+```
+
+### Genetic population resources
+
+ENEO uses multiple databases to infer germline likelihood of candidate variants. Most of them uses different chromosome naming, so you need to convert them. A conversion table is available at
+
+#### dbSNP
+
+```bash
+wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz
+wget -c https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz.tbi
+```
+
+#### 1000G
+
+```bash
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
+```
+#### known indels
+
+```bash
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
+```
+
+### gnomAD
+
+```bash
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz
+wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi
+```
+
+
+
 
-We're working on a single-step setup script for multiple environments
 
-![alt text](https://www.pajiba.com/assets_c/2024/07/reddit-worst-work-mistake-own-header-thumb-700x401-264118.png)
\ No newline at end of file