Plastaumatic: An automated pipeline to assemble and annotate plastome sequences

Introduction

plastaumatic is an automated pipeline developed for both assembly and annotation of plastomes, with the scope of the researcher being able to load whole genome sequence data with minimal manual input, and therefore a faster runtime. The main structure of the current automated pipeline includes trimming of adaptor and low-quality sequences using fastp, de novo plastome assembly using NOVOPlasty, standardization and quality checking of the assembled genomes through a custom script utilizing BLAST+ and SAMtools, annotation of the assembled genomes using AnnoPlast, and finally generating required files for NCBI GenBank submissions.

This repository includes a snakefile which uses Snakemake workflow manager to perform all the tasks. Also, a shell executable that does the same.

Pre-requisites:

Snakemake v5.10.0+ [if using the Snakemake pipeline]
fastp v0.23.0
NOVOPlasty v4.3.1
Python v3.8
BioPython v1.79 & pandas v1.4.3 (for AnnoPlast)
Blast+ v2.12.0
samtools v1.9

installation

Simply clone this repository in your desired destination and add to your PATH

git clone https://github.com/stromviklab/plastaumatic.git
export PATH=$(pwd)/plastaumatic:$PATH
echo 'export PATH='$(pwd)'/plastaumatic:$PATH' >> ~/.bashrc

how-to-run (Snakemake)

Before running the Snakemake pipeline, make sure all the pre-requisites are installed and available in your PATH
Make sure to edit the config.yaml file before your run (a template config.yaml file is available from the repository)

seed: Test_dataset/seed.fasta                                       # path to a seed fasta file for NOVOPlasty (see: https://github.com/ndierckx/NOVOPlasty)
ref_gb: Test_dataset/reference.gb                                   # path to a reference GenBank file for annotation (see: https://github.com/SaiReddy-A/AnnoPlast) 
range: 140000-160000                                                # estimated plastome size range 
repo: plastaumatic                                                  # path to the plastaumatic repository
novo_path: /mnt2/software/NOVOPlasty/NOVOPlasty4.3.1.pl             # path to the NOVOPlasty executable
samples:                                                            # prefix and reads of one/many genomes go under samples
  genome1: Test_dataset/test1.fq.gz,Test_dataset/test2.fq.gz            # prefix: forward_read,revese_read

If you want to run this pipeline on multiple genomes, just add more lines below samples in this format:

samples:
    genome1: forward.fq,reverse.fq
    genome2: forward.fq,reverse.fq
    genome3: forward.fq,reverse.fq
    ...
    ...
    genomeX: forward.fq,reverse.fq

Copy the snakefile to your preferred output directory. Modify the path to config file in the snakefile and run

snakemake

how-to-run (shell script)

Before running the plastaumatic executable, make sure all the pre-requisites are installed and available in your PATH
run plastaumatic -h for help

Usage: plastaumatic -s seed.fa -g reference.gb -r <140000-160000> -f fof.txt -n NOVOPlasty4.3.1.pl

options:
         -s      Path to the seed file for assembly</br>
         -g      Path to the reference GenBank file</br>
         -r      Plastome assembly size range [140000-160000]</br>
         -f      Path to the file-of-filenames with reads</br>
         -n      Path to NOVOPlasty executable</br>
         -h      Shows this help message</br>

The file-of-filenames is a simple txt file with comma seperated prefix and read files of one/many genomes

genome1,forward.fq,reverse.fq
genome2,forward.fq,reverse.fq
genome3,forward.fq,reverse.fq

An example plastaumatic run

plastaumatic -s Test_dataset/seed.fasta -g Test_dataset/reference.gb -r 140000-160000 -f readList.txt -n software/NOVOPlasty/NOVOPlasty4.3.1.pl

$ cat readList.txt
genome1,Test_dataset/test1.fq.gz,Test_dataset/test2.fq.gz

Output files

For each genome a prefix directory is created from where the pipeline is run. Each prefix directory contains

prefix
├── 00-logs
├── 01-trim
├── 02-assemble
├── 03-standardize
├── 04-annotate
├── 05-tbl
├── prefix_config.txt
├── prefix.plastome.fa # a symlink to the final plastome assembly file  
└── prefix.plastome.gb # a symlink to the final plastome annotation file

Citations

Chen W, Achakkagari SR and Strömvik M (2022) Plastaumatic: Automating plastome assembly and annotation. Front. Plant Sci. 13:1011948. doi: 10.3389/fpls.2022.1011948

Since plastaumatic uses multiple software in its pipeline, publishing the results obtained form plastaumtic should also cite the following sources

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

Dierckxsens, N., Mardulyn, P., & Smits, G. (2017). NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic acids research, 45(4), e18. https://doi.org/10.1093/nar/gkw955

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857

Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup, The Sequence Alignment/Map format and SAMtools, Bioinformatics, Volume 25, Issue 16, 15 August 2009, Pages 2078–2079, https://doi.org/10.1093/bioinformatics/btp352

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
Test_dataset		Test_dataset
scripts		scripts
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
pipeline.png		pipeline.png
plastaumatic		plastaumatic
snakefile		snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plastaumatic: An automated pipeline to assemble and annotate plastome sequences

Introduction

Pre-requisites:

installation

how-to-run (Snakemake)

how-to-run (shell script)

Output files

Citations

About

Releases

Packages

Contributors 2

Languages

License

stromviklab/plastaumatic

Folders and files

Latest commit

History

Repository files navigation

Plastaumatic: An automated pipeline to assemble and annotate plastome sequences

Introduction

Pre-requisites:

installation

how-to-run (Snakemake)

how-to-run (shell script)

Output files

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages