The small RNA-Seq description pipeline is a Snakemake pipeline to annotate small RNA loci (miRNAs, phased siRNAs) using one or more reference genomes and based on experimental small RNA-Seq datasets.
This pipeline heavily relies on the ShortStack software that annotates and quantifies small RNAs using a reference genome.
Upon completion, several outputs will be generated for each sample:
- One Shortstack result file called
Results.txt
. See the description of this file in the Shortstack manual. - Two fasta files for each sample: one fasta file containing the predicted hairpins and one containing the predicted mature microRNAs.
- Two blast result files (in tabular format) based on the blast of predicted hairpins and mature miRNAs against mirbase (the version of miRBase is specified in the config file). See the miRBase website for releases.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
This Snakemake pipeline make use of the conda package manager to install softwares and dependencies.
- First, make sure you have conda installed on your system. Use Miniconda3 and follow the installation instructions.
- Using
conda
, create a virtual environment calledsnakemake
to install Snakemake (version 5.4.3 or higher) by executing the following code in a Shell window:conda env create -f environment.yml
. This will installsnakemake version 5.20.0
andpandas version 0.25.0
in a new environment called small. - Activate this environment using:
conda activate small
- You can now run the pipeline (see below).
If you have set up conda
and created the small
environment, that's all you need to do!
- Snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.
- NCBI blast+ - A program to perform sequence similarity search. See NCBI Blast webpage for more info.
- ShortStack - Small RNA loci annotation and quantification.
- Trimmomatic - Read trimming for NGS data.
- bioawk - Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names.
A series of custom Python functions are also used and can be found in the helpers.py
file.
Versions of softwares and packages can be seen in their respective environment .YAML
file in the envs/
folder.
A small dataset is available in test/
to run some tests rapidly. It will use the genome and miRBase reference fasta files stored in refs/
.
To run the test, open a new Shell window and:
- Activate your working environment:
conda activate small
- Type
snakemake -j 1 -np
for a dry run. No analysis is run but it checks that the Directed Acyclic Graph of jobs is OK (input and output from each rule chained to each other). - For the real run, type
snakemake --cores N
whereN
is the number of CPUs that you want to use (default = 1).
A samples.tsv
file can be used to specify sample names, their corresponding genomic reference to use and the location of their sequencing file.
Configuration settings can be changed in the config.yaml
file. For instance, one could modify the minimal coverage required by Shorstack to discover sRNA loci.
Different genomic references can be used for each sample. Simply provide a genomic reference corresponding to your sample.
- Marc Galland - Initial work - Github profile
- Michelle van der Gragt - Initial work - Github profile
- Marc Galland - Initial work - Github profile
...as soon as we have published this software!
This project is licensed under the MIT License - see the LICENSE.md file for details
SemVer is used for versioning. For the versions available, see the releases on this repository.
- Bioawk tutorial: https://isugenomics.github.io/bioinformatics-workbook/Appendix/bioawk-basics
- Vienna RNAfold tutorial: https://www.tbi.univie.ac.at/RNA/tutorial/#sec3
- miRTop: from BAM files to GFF3 files (and conversion to other formats such as Fasta etc.): https://academic.oup.com/bioinformatics/article/36/3/698/5556118