This repository provides a Snakemake workflow to call CNVs on exome and custom target panel data by integrating multiple tools that employ read-depth during CNV detection:
The workflow itself has multiple dependencies, but they're all installed using conda. Thus, you'll need to have anaconda/conda installed on your computer. We favor the usage of miniconda as it is small, and it's very useful to create virtual environments and to isolate the installation of tools. You can learn more about managing environments with conda in their user guide.
2. Choose and download a workflow version from the releases page
Even though you could just clone or download this repository, we strongly encourage you to get the latest version from the release page to allow for better reproducibility. That's because we make changes over time to the main branch to improve the workflow and update the integrated tools when there are new releases.
With the workflow folder as the working directory, create the virtual environment by running:
conda env create --file workflow/envs/environment.yml
Installing all dependencies may take a while. After completion, activate the virtual environment with:
conda activate CNV-detection-pipeline
To check if the installation was successful, and that the workflow is working on your end, you can make a snakemake dry-run
using the -n
flag:
snakemake -n
This command builds and prints the jobs that should be executed, but don't actually run them. Note that
currently we're providing mock empty files to the workflow (resources/mocks). Thus, if you forget the -n
,
you'll encounter an execution error because the files are empty.
To configure this workflow, you need to modify config/config.yaml
according to your needs.
There explanations on the file itself, but here are some considerations as well:
Actual data such as exomes, reference genomes are expected to be placed on the resources
folder.
However, you don't need to copy or move them there, you can simply create a symbolic link on the resources
folder
to the actual data. For instance, considering that the exome files are on the directory /home/valengo/data/exomes
,
we can create a symlink to them on resources
directory with:
ln -s /home/valengo/data/exomes resources/exomes
In case you want to learn more about symbolic links, check out this tutorial on freeCodeCamp.
Add test samples to config/test-samples.tsv
and baseline samples to config/baseline-samples.tsv
.
Both are tab-delimited files. For each sample, define in the first column the sample name,
and the respective sex in the second column.
We use wildcards to generalize the jobs, so they work in multiple datasets.
Hopefully, your exome files (baseline and test) follow some pattern, for example: NA12878, NA24385, NA24631.
Basically, you need to set the regular expression (regex) parameter named common:sample-regex
in the config/config.yaml
file.
A regex represents a search pattern, but they can be tricky.
Thus, we recommend reading about it on Wikipedia.
In addition, check out the Snakemake's docs on wildcards.
To easily test your regex, you can use an online validator, such as regex101.
It is best practice using the same reference build (if possible, even the same fasta file) that was used to align/map the reads. In case the sequencer's workflow provides you with an aligned BAM, it's probably possible to obtain the corresponding fasta file in the sequencer's server. However, you can obtain reference genome fasta files on Genome Browsers, such as Ensembl or UCSC Genome Browser.
After deploying and configuring the workflow, it can be executed as:
snakemake --cores all
- This pipeline is a work in progress and the current version is for amplicon based targets defined in a very specific BED file.
- The pipeline works in Linux distributions and macOS.
Having issues or suggestions using the workflow? You can contact us by creating an issue here on GitHub. We’ll be very happy in helping you.