Strobealign evaluation

This repository provides runnable workflows for evaluating strobealign.

The following workflows are available:

Snakefile: Downloads reference genomes and generates libraries of simulated reads
minmax/Snakefile: Reproduces the min/max evaluation experiments in the paper "Designing efficient randstrobes for sequence similarity analyses" by Karami et al., 2023. (see minmax/README.md)
simx/Snakefile: Run strobealign and BWA-MEM on all datasets (sim3, sim4, sim5), measure accuarcy, and plot the results.

Datasets

The simulated datasets generated and used here are created in mostly the same way as the datasets for the following paper:

Sahlin, K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biology 23, 260 (2022). https://doi.org/10.1186/s13059-022-02831-7

See the Supplementary Information in that paper, section "Note A: Simulations" for a description.

See also the original snakemake workflow file that documents the evaluation done for the paper: https://github.com/ksahlin/alignment_evaluation/blob/master/evaluation/Snakefile

Some differences:

The paper mentions SIM3 and SIM4 datasets. We added a SIM5 dataset with high variability.
The genomes in the paper are drosophila, maize, CHM13, rye. We added an E. coli pangenome (see below).
The number of reads per library has been reduced to 1 million.

E. coli pangenome

As fifth reference, an E. coli pangenome has been added consisting of 50 randomly selected E. coli assemblies from RefSeq. Because RefSeq changes, querying it is not reproducible. Therefore, a pre-generated list of 100 E. coli assembly accessions can be found in ecoli-accessions.txt. It was generated in the following way:

ncbi-genome-download --dry-run --assembly-levels complete --taxid 562 bacteria | sed 1d | cut -f1 | shuf | head -n 100 > ecoli-accessions.txt

The ecoli50 datasets uses the first 50 accessions from that list.

The number of reads has been reduced

The datasets in the Genome Biology paper contain 10 million simulated reads. Here, we use only 1 million reads.

Note that there is a small difference between the results found by the workflow here and the numbers reported in the Karami et al. paper because two different ways were used to arrive at the smaller dataset:

For the Karami et al. paper, the original 10 million simulated reads in each file were truncated to the first 1 million.
The workflow in this repository simulates 1 million reads.

These datasets are not the same because the output generated by mason depends on all input parameters, and the number of reads it shall generate is one of these parameters.

To get the results to match exactly:

Set N_READS in the Snakefile to 10'000'000
Add a rule to truncate the generated data files to 1 million reads before mapping.

We have not made this modification as it adds hundreds of CPU hours for little gain.

Running the workflow

Makes the required software available:

conda env create --file environment.yml
conda activate strobealign-eval

Download the genomes and simulate reads:

snakemake --cores=all

This generates the folders downloads/, genomes/ and datasets/. The downloads/ folder contains intermediate files and can be deleted.

Issues

mason_variator may fail with this error when running on maize or CHM13:

/home/osboxes/seqan/apps/mason2/mason_variator.cpp:1161
Assertion failed : snpRecord.getPos() != svRecord.getPos()
should be true but was 0 (Should be generated non-overlapping (snp pos = 229421492, sv pos = 229421492).)

We patch mason to work around this. An alternative is to compile mason without assertions.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
minmax		minmax
simx		simx
slurm		slurm
.gitignore		.gitignore
MD5SUM		MD5SUM
README.md		README.md
Snakefile		Snakefile
ecoli-accessions.txt		ecoli-accessions.txt
environment.yml		environment.yml
get_accuracy.py		get_accuracy.py
noplasmids.py		noplasmids.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strobealign evaluation

Datasets

E. coli pangenome

The number of reads has been reduced

Running the workflow

Issues

About

Releases

Packages

Languages

NBISweden/strobealign-evaluation

Folders and files

Latest commit

History

Repository files navigation

Strobealign evaluation

Datasets

E. coli pangenome

The number of reads has been reduced

Running the workflow

Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages