Skip to content

Latest commit

 

History

History
179 lines (119 loc) · 14.5 KB

README.md

File metadata and controls

179 lines (119 loc) · 14.5 KB

Landscape of plant mitochondrial genome via dis-assembly approach.

Short Description

This repository is for my research on plant mitochondrial genomes, which I call "mitogenome" in this README.md. Shortly and roughly speaking, mitogenomes of plants are different from those of animals or fungus. For example,

  • The sizes are much larger.
  • Genomes contain some repetitive regions, and
  • (repeat-)mediated recombinations create divergent structures called "multipartite structure"s.

I'm trying to reconstruct these multipartite structures in the plant mitogenomes. In contrast to usual assembly workflows, I'm currently developping another way to assemble each structure; clustering.

Although it partly solves the problem of mitogenome reconstruction, it does depend on a 'good' reference genome. I'm working on this repository to adress this issue.

Properties

Data Availability

The following list is the list of accessions/URLs to the data used in this research:

All of these data were generated by Sequel Systems. For more details, e.g., the used chemistory, please visit the links.

The Gen Bank IDs of the reference sequences used in this research are BK010421.1, JF729100, and JF729102. BK010421.1 is the reference sequence I'm using generally.

WGS reads were filtered to enrich reads from mitochondrial genome. Specifically, we run the following command:bash ./script/filter_reads ${WGS} ${BK010421.1} > ${FILTERED}.

Results (Ver 2022)

We re-assembled these datasets by JTK, and put the results here.

Strain Name Master circle (FASTA) Master circle (GFA) Alignments (BAM)
c24 FASTA GFA BAM
pacbio FASTA GFA BAM
sha FASTA GFA BAM
col0 FASTA GFA BAM
cvi FASTA GFA BAM
eri FASTA GFA BAM
kyo FASTA GFA BAM
ler FASTA GFA BAM
an1 FASTA GFA BAM

The dotplots between these strains as follows (or click here).

The red lines are boundaries between strains, and blue lines are boundaries of contigs.

Results

The filtered reads, the outpot of PO-HMM, and supplementary plots are available at the following locations. "Circos," "linear," and "no-marge" represent a circos plot (like in the paper), a dot plot between assembled contig and the reference, and a circos plots without any "aggressive merging," respectively.

Strain Name Reads Result Circos Linear no_merge
pacbio Reads Result Circos Liner NoMerge
ler Reads Result Circos Liner NoMerge
col0_1106_exp2 Reads Result Circos Liner NoMerge
cvi Reads Result Circos Liner NoMerge
an1 Reads Result Circos Liner NoMerge
c24 Reads Result Circos Liner NoMerge
kyo Reads Result Circos Liner NoMerge
sha Reads Result Circos Liner NoMerge
eri Reads Result Circos Liner NoMerge

Reproducibility

Synthetic dataset

First, install the dependencies. Minimum requirements are

  • BadRead: git clone https://github.com/rrwick/Badread.git && pip3 install ./Badread would install BadRead to your local envirnment.
  • Flye
  • Minimap2: git clone https://github.com/lh3/minimap2 && cd minimap2 && make and ln -s ${PWD}/minimap2 ${HOME}/local/bin/minimap2 or anywhere included in the $PATH variable.
  • Samtools
  • HTSlib
  • Last
  • Rust

To obtain baseline result, please install

  • WhatsHap: pip3 install --user whatshap
  • isONclust: pip install isONclust.
  • CARNAC-LR: Exec git clone https://github.com/kamimrcht/CARNAC.git && cd CARNAC && make. Then, you can sim-link ${PWD}/CARNAC-LR ${HOME}/local/bin/CARNAC-LR or anywhere included in the $PATH variable. Also, copy CARNAC/scripts/paf_to_CARNAC.py into ./script/ of this repository.

Then,

git clone --recursive https://github.com/ban-m/reconstruct_mito_genome.git
cd reconstruct_mito_genome
cargo check && cargo build && cargo build --release # Build binaries.
bash ./script/create_mock_genome.sh # create datasets under ${PWD}/data/synthetic_data/ by using `BadRead`
bash ./script/synthetic_dataset.sh
bash ./script/benchmark.sh # create results under ${PWD}/result/benchmark/
bash ./script/posterior_probability.job
bash ./script/baseline_mock_genomes.job # Invoke `isONclust`, `CARNAC-LR`, `WhatsHap`, and `Flye`

would create all the result used in the paper. Note that each script would take long time to be done (from a few hours to a few days).

Real dataset

First, download the dataset and the reference genome.

bash ${PWD}/donwloads_files.sh

Please type ll -ht ${PWD}/data/filtered_reads to confirm that all the dataset are properly donwloaded; md5sum values are as follows:

  • e040ac3f8b047b7e84d38cb70faf8ecc an1.fasta
  • 7b4b5f70ba29891a09a83786db5d8ad9 c24.fasta
  • 00157130264b868bf35dd10828749a3d col0_1106_exp2.fasta
  • 893394f4ef02fe4737fb26b7896ea98f cvi.fasta
  • 26646b904a44bd564c84ccf5bcc6c7e9 eri.fasta
  • 072427917074882e4ad9dbc4b03ad87d kyo.fasta
  • 67b2f656d2eaa91126f36b6b111a665f ler.fasta
  • 4a92ad3d6fe66732e2fcba4d76488b53 pacbio.fasta
  • 69a0007d689ba6a797c0e975a0eb8092 sha.fasta

Then, run the pipeline on each dataset. Here, ${threads} is the number of threads to be used. We used ${threads}=24.

bash ./script/real_dataset.sh ${threads}

It takes several days to complete on a 24 threads 2.0GHz computer.

Instalation

This program depends on following libraries and binaries:

After install these requirements, do

git clone --recursive https://github.com/ban-m/reconstruct_mito_genome.git
cd reconstruct_mito_genome
cargo check && cargo build && cargo build --release

Then, target/release/mmmm --help would print help messages to stderr.

Basic usage would be

mmmm decompose --output ${OUTPUT} \
    --reads ${READ} --contigs ${REFERENCE} \
    --cluster_num ${MIN_CLUSTER} --threads ${CORES}

${READ} and ${REFERENCE} should be fasta files. ${MIN_CLUSTER} is the size of the cluster at each window, not the number of resulting cluster.

Disclaimer1: It requires a reference quality contigs for clutering. For those dataset without very good reference, I'm currently developing this repository.

Disclaimer2: As the clustering algorithm (PO-HMM) is very costly, this program is for very small genome(<1Mbp).

Reference

  • Jiao, W., Schneeberger, K. Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics. Nat Commun 11, 989 (2020). https://doi.org/10.1038/s41467-020-14779-y