-
Notifications
You must be signed in to change notification settings - Fork 9
Home
MetaCherchant is a tool for analysing genomic environment of a nucleotide sequence within a metagenome. The implementation is based on MetaFast source code.
It also provides user with tools for comparing two metagenomes. For more details, please consult the reads classifier description.
In the MetaCherchant was added possibility to build genomic environment of a nucleotide sequence using Hi-C pair reads. The implementation is based on MetaCherchant, tool environment-finder, BWA and SAMTOOLS. Description of Hi-C envronment finder is below. For more details, please consult the Detailed Hi-C environment finder description.
It is proposed to build genomic environment using Hi-C links in four stages:
- Build genomic environment of a target nucleotide sequence using MetaCherchant, tool environment-finder. After the end of analysis, metagenomic environment without Hi-C links is found. On the figure below, there is a schematic representation of the original genomic environment. Target contig are marked in red. Contigs included in the genomic environment are marked in blue. The other contigs have grey colour.
- On the second step, pair of Hi-C reads, that extend the original genomic environment, will be found. For it will be found pair of Hi-C reads, in which one read is inside the original genomic environment and the other - outside the original environment. On the figure below, Hi-C links, which extend the original genomic environment, is shown as a red dotted line, all other Hi-C link have black colour.
- On the third step, the genomic environment with Hi-C links will be build. The genomic environment around target genes and all Hi-C reads, found on the second step, will be constructed and will be joined into a single graph. It is possible with MetaCherchant multiple-metagenome mode. On the figure below, there is a schematic representation of the extended genomic environment. After the end of analysis, found metagenomic environment can be visualised using de Bruijn graph.
- On the last step all Hi-C reads will be mapped on the final graph. The result of this step is hic_map.txt file that can be used by Bandage to visualize Hi-C links in the de Bruijn graph. On the figure below, there is example of de Bruijn graph visualization with Hi-C links.
Project pipeline is shown in the picture below:
You can clone github repository to get needed source from MetaCherchant repository
git clone https://github.com/ctlab/metacherchant.git
or just download three files:
Besides you need to set the following prerequisites:
- JRE (>= 1.8 version)
conda install -c conda-forge openjdk=11
- Python (>= 3.5 version)
conda install -c anaconda python=3.8
-
BWA
conda install bwa
-
Samtools
conda install -c bioconda samtools
If you want to use Bandage to visualize de Bruijn graph with Hi-C crosslinks, you should build Bandage from source code. This version wasn't released in the root branch https://github.com/rrwick/Bandage.
-
--reads
- [Mandatory] list of all input files with metagenomic reads separated by space. FASTA and FASTQ formats are supported. -
--seq
- [Mandatory] a FASTA file with the target nucleotide sequences, for each of which a genomic environment will be built. -
--hi-c-r1
and--hi-c-r2
- [Mandatory] two input files with paired Hi-C reads. This parameters assumes the i-th read in hic_R1.fastq and the i-th read in hic_R2.fastq constitute a read pair. FASTA and FASTQ formats are supported. -
--work-dir
- [Mandatory] working directory with intermediate files, logs and output folder. -
--metacherchant
- [Mandatory] path to metacherchant jar file. -
--k
- [Optional] the size of k-mer used in de Bruijn graph. Default value is 31. -
--coverage
- [Optional] the minimum coverage threshold for a k-mer to be included in the graph. Default value is 5. -
--maxradius
- [Optional] maximum allowed distance between every k-mer and target gen. Default value is 100000.
-
$WORK_DIR/output/1/merged/graph.gfa
- de Bruijn graph in GFA format for original genomic environment -
$WORK_DIR/1/selected_reads.fasta
- Hi-C reads that extend the original genomic environment -
$WORK_DIR/output/2/merged/graph.gfa
- de Bruijn graph in GFA format for extended genomic environment -
$WORK_DIR/output/2/hic_map.txt
- mapping Hi-C reads to the contigs in the format: <contig id 1> <contig id 2> where hi-c weight is equal to count of Hi-C crosslink between this two contigs.
The extended Hi-C environment construction:
Folder example contains WGS and Hi-C read pairs, that was generated for Fragment of Salmonella's genome and genome of Salmonella's pSLT plasmid. Salmonella contains pSLT plasmid so there are Hi-C links between this two genomes.
- wgs_reads folder contains generated WGS reads.
- hic_R1.fastq, hic_R2.fastq files contain generated Hi-C reads.
- Seq.fasta file contains target gen from genome of plasmid.
Genomic environment with Hi-C links was constructed with HiCEnvironmentFinder.sh script:
./HiCEnvironmentFinder.sh --reads "example/wgs_reads/*.fastq" \
--seq "example/seq.fasta" \
--hi-c-r1 "example/hic_R1.fastq" \
--hi-c-r2 "example/hic_R2.fastq" \
--work-dir "example_work_dir" \
--metacherchant "../out/metacherchant.jar" \
--k 31 \
--coverage 5 \
--maxradius 100000
The following parameters were used: k = 31, coverage = 5, maximum radius = 100 000.
Results visualisation:
The obtained genomic environment can be visualized in Bandage. The extended Hi-C context contains genome of plasmid and genome of Salmonella, because there are Hi-C links between plasmid's and bacteria's genomes. "1/merged/graph.gfa" file contains de Bruijn graph for original context. "2/merged/graph.gfa" file contains de Bruijn graph for extended Hi-C context. "2/hic_map.txt" contains list of contig pairs that have Hi-C links. "2/merged/graph.gfa" and "2/hic_map.txt" files were used to visualize graph with Hi-C crosslinks in the Bandage.
If you use MetaCherchant to build genomic environment with Hi-C crosslinks in your research, please cite the following publication: Иванов А.Б., Шостина А.Д. Разработка методов построения и визуализации геномного контекста с учетом Hi-C связей в метагеномных данных//Сборник тезисов докладов конгресса молодых ученых. Электронное издание. – СПб: Университет ИТМО - 2022 (Тезисы)
Please report any problems directly to the GitHub issue tracker.
Also, you can send your feedback to shostina77@gmail.com or abivanov@itmo.ru.
The MIT License (MIT)