Skip to content

Commit

Permalink
feat: complete, reproducible example workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
m-jahn committed Nov 29, 2024
1 parent 4d7b3a2 commit 1dfa7ad
Show file tree
Hide file tree
Showing 18 changed files with 614 additions and 8 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ resources/**
logs/**
.snakemake
.snakemake/**
.test/results/*
34 changes: 34 additions & 0 deletions .test/config/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
samplesheet: "config/samples.tsv"

get_genome:
database: "ncbi"
assembly: "GCF_000006785.2"
fasta: Null
gff: Null
gff_source_type:
[
"RefSeq": "gene",
"RefSeq": "pseudogene",
"RefSeq": "CDS",
"Protein Homology": "CDS",
]

simulate_reads:
read_length: 100
read_number: 100000
random_freq: 0.01

cutadapt:
threep_adapter: "-a ATCGTAGATCGG"
fivep_adapter: "-A GATGGCGATAGG"
default: ["-q 10 ", "-m 25 ", "-M 100", "--overlap=5"]

multiqc:
config: "config/multiqc_config.yml"

report:
export_figures: True
export_dir: "figures/"
figure_width: 875
figure_height: 500
figure_resolution: 125
2 changes: 2 additions & 0 deletions .test/config/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
remove_sections:
- samtools-stats
3 changes: 3 additions & 0 deletions .test/config/samples.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample condition replicate read1 read2
sample1 wild_type 1 sample1.bwa.read1.fastq.gz sample1.bwa.read2.fastq.gz
sample2 wild_type 2 sample2.bwa.read1.fastq.gz sample2.bwa.read2.fastq.gz
98 changes: 93 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,109 @@
# Snakemake workflow: `<name>`

[![Snakemake](https://img.shields.io/badge/snakemake-≥6.3.0-brightgreen.svg)](https://snakemake.github.io)
[![GitHub actions status](https://github.com/<owner>/<repo>/workflows/Tests/badge.svg?branch=main)](https://github.com/<owner>/<repo>/actions?query=branch%3Amain+workflow%3ATests)

[![Snakemake](https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg)](https://snakemake.github.io)
[![GitHub actions status](https://github.com/MPUSP/snakemake-workflow-template/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/MPUSP/snakemake-workflow-template/actions/workflows/main.yml)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1D355C.svg?labelColor=000000)](https://sylabs.io/docs/)
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)

A Snakemake workflow for `<description>`

- [Snakemake workflow: `<name>`](#snakemake-workflow-name)
- [Usage](#usage)
- [Workflow overview](#workflow-overview)
- [Running the workflow](#running-the-workflow)
- [Input data](#input-data)
- [Execution](#execution)
- [Parameters](#parameters)
- [Authors](#authors)
- [References](#references)
- [TODO](#todo)

## Usage

The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/?usage=<owner>%2F<repo>).

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) <repo>sitory and its DOI (see above).
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository or its DOI.

## Workflow overview

This workflow is a best-practice workflow for `<detailed description>`.
The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and consists of the following steps:

1. Parse sample sheet containing sample meta data (`python`)
2. Simulate short read sequencing data on the fly (`dwgsim`)
3. Check quality of input read data (`FastQC`)
4. Trim adapters from input data (`cutadapt`)
5. Collect statistics from tool output (`MultiQC`)

## Running the workflow

### Input data

This template workflow contains artifical sequencing data in `*.fastq.gz` format.
The test data is located in `.test/data`. Input files are supplied with a mandatory table, whose location is indicated in the `config.yml` file (default: `.test/samples.tsv`). The sample sheet has the following layout:

| sample | condition | replicate | data_folder | fq1 |
| -------- | --------- | --------- | ----------- | ------------------------ |
| RPF-RTP1 | RPF-RTP | 1 | data | RPF-RTP1_R1_001.fastq.gz |
| RPF-RTP2 | RPF-RTP | 2 | data | RPF-RTP2_R1_001.fastq.gz |

### Execution

To run the workflow from command line, change the working directory.

```bash
cd path/to/snakemake-workflow-name
```

Adjust options in the default config file `config/config.yml`.
Before running the entire workflow, you can perform a dry run using:

```bash
snakemake --dry-run
```

To run the complete workflow with test files using **conda**, execute the following command. The definition of the number of compute cores is mandatory.

```bash
snakemake --cores 10 --sdm conda --directory .test
```

To run the workflow with **singularity** / **apptainer**, use:

```bash
snakemake --cores 10 --sdm conda apptainer --directory .test
```

### Parameters

This table lists all parameters that can be used to run the workflow.

| parameter | type | details | default |
| ---------------------- | ---- | ------------------------------------------- | -------------------------------------------- |
| **samplesheet** | | | |
| path | str | path to samplesheet, mandatory | "config/samples.tsv" |
| **cutadapt** | | | |
| fivep_adapter | str | sequence of the 5' adapter | Null |
| threep_adapter | str | sequence of the 3' adapter | `ATCGTAGATCGGAAGAGCACACGTCTGAA` |
| default | str | additional options passed to `cutadapt` | [`-q 10 `, `-m 22 `, `-M 52`, `--overlap=3`] |

## Authors

- Firstname Lastname
- Affiliation
- ORCID profile
- home page

## References

> Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. *Sustainable data analysis with Snakemake*. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.
# TODO
## TODO

* Replace `<owner>` and `<repo>` everywhere in the template (also under .github/workflows) with the correct `<repo>` name and owning user or organization.
* Replace `<name>` with the workflow name (can be the same as `<repo>`).
* Replace `<description>` with a description of what the workflow does.
* Update the workflow description, parameters, running options, authors and references in the `README.md`
* Update the `README.md` badges. Add or remove badges for `conda`/`singularity`/`apptainer` usage depending on the workflow's capability
* The workflow will occur in the snakemake-workflow-catalog once it has been made public. Then the link under "Usage" will point to the usage instructions if `<owner>` and `<repo>` were correctly set.
34 changes: 34 additions & 0 deletions config/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
samplesheet: ".test/config/samples.tsv"

get_genome:
database: "ncbi"
assembly: "GCF_000006785.2"
fasta: Null
gff: Null
gff_source_type:
[
"RefSeq": "gene",
"RefSeq": "pseudogene",
"RefSeq": "CDS",
"Protein Homology": "CDS",
]

simulate_reads:
read_length: 100
read_number: 100000
random_freq: 0.01

cutadapt:
threep_adapter: "-a ATCGTAGATCGG"
fivep_adapter: "-A GATGGCGATAGG"
default: ["-q 10 ", "-m 25 ", "-M 100", "--overlap=5"]

multiqc:
config: "config/multiqc_config.yml"

report:
export_figures: True
export_dir: "figures/"
figure_width: 875
figure_height: 500
figure_resolution: 125
2 changes: 2 additions & 0 deletions config/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
remove_sections:
- samtools-stats
44 changes: 44 additions & 0 deletions config/schemas/config.schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
$schema: "http://json-schema.org/draft-07/schema#"
description: an entry in the sample sheet
properties:
samplesheet:
type: string
description: sample name/identifier

get_genome:
properties:
database:
type: ["string", "null"]
assembly:
type: ["string", "null"]
fasta:
type: ["string", "null"]
gff:
type: ["string", "null"]
gff_source_type:
type: array

simulate_reads:
properties:
read_length:
type: number
read_number:
type: number
random_freq:
type: number

cutadapt:
properties:
threep_adapter:
type: string
fivep_adapter:
type: string
default:
type: array

multiqc:
properties:
config:
type: string

required: ["samplesheet", "get_genome", "simulate_reads", "cutadapt", "multiqc"]
25 changes: 25 additions & 0 deletions config/schemas/samples.schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
$schema: "http://json-schema.org/draft-07/schema#"
description: an entry in the sample sheet
properties:
sample:
type: string
description: sample name/identifier
condition:
type: string
description: sample condition that will be compared during differential analysis
replicate:
type: number
default: 1
description: consecutive numbers representing multiple replicates of one condition
read1:
type: string
description: names of fastq.gz files, read 1
read2:
type: string
description: names of fastq.gz files, read 2 (optional)

required:
- sample
- condition
- replicate
- read1
47 changes: 44 additions & 3 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,45 @@
# Main entrypoint of the workflow.
# Please follow the best practices:
# Main entrypoint of the workflow.
# Please follow the best practices:
# https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html,
# in particular regarding the standardized folder structure mentioned there.
# in particular regarding the standardized folder structure mentioned there.


import os
import pandas as pd


# load configuration
# -----------------------------------------------------
configfile: "config/config.yml"


# container definition: uncomment to include a singularity image, e.g. from github's container registry
# container: "oras://ghcr.io/<user>/<repository>:<version>"


# load rules
# -----------------------------------------------------
include: "rules/common.smk"
include: "rules/process_reads.smk"


# optional messages, log and error handling
# -----------------------------------------------------
onstart:
print("\n--- Analysis started ---\n")


onsuccess:
print("--- Workflow finished! ---")


onerror:
print("--- An error occurred! ---")


# target rules
# -----------------------------------------------------
rule all:
input:
"results/multiqc/multiqc_report.html",
default_target: True
6 changes: 6 additions & 0 deletions workflow/envs/cutadapt.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: cutadapt
channels:
- conda-forge
- bioconda
dependencies:
- cutadapt=4.9
6 changes: 6 additions & 0 deletions workflow/envs/fastqc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: fastqc
channels:
- conda-forge
- bioconda
dependencies:
- fastqc=0.12.1
9 changes: 9 additions & 0 deletions workflow/envs/get_genome.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: get_genome
channels:
- conda-forge
- bioconda
dependencies:
- unzip=6.0
- ncbi-datasets-cli=16.23.0
- bcbio-gff=0.7.1
- samtools=1.20
7 changes: 7 additions & 0 deletions workflow/envs/multiqc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: multiqc
channels:
- conda-forge
- bioconda
dependencies:
- python=3.9
- multiqc=1.14
6 changes: 6 additions & 0 deletions workflow/envs/simulate_reads.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: get_genome
channels:
- conda-forge
- bioconda
dependencies:
- dwgsim=1.1.14
17 changes: 17 additions & 0 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# import basic packages
import pandas as pd
from snakemake.utils import validate
from os import path


# read sample sheet
samples = (
pd.read_csv(config["samplesheet"], sep="\t", dtype={"sample": str})
.set_index("sample", drop=False)
.sort_index()
)


# validate sample sheet and config file
validate(samples, schema="../../config/schemas/samples.schema.yml")
validate(config, schema="../../config/schemas/config.schema.yml")
Loading

0 comments on commit 1dfa7ad

Please sign in to comment.