-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the MetaFX wiki!
Metagenomic feature extraction and classification toolkit, version 0.1.0.
The idea behind MetaFX is to introduce feature extraction algorithm specific for metagenomics short reads data. It is capable of processing hundreds of samples 1-10 Gb each. The distinct property of suggest approach is the construction of meaningful features, which can not only be used to train classification model, but also can be further annotated and biologically interpreted.
- JRE 1.8 or higher
- python3
- python libraries for classification problem:
Should you choose to build contigs via third-party SPAdes software, please follow their installation instructions (not recommended for first-time use).
To run MetaFX you need scripts from the bin/
folder. The main script to run is metafx.sh
.
Scripts have been tested under Ubuntu 18.04 LTS, but should generally work on Linux/MacOS.
The pipeline was executed on cluster node with Intel Xeon Gold 6242 CPU (frequency 2.80GH). We used 32 threads and 128GB RAM.
Running time for each step are approximate and given for rough waiting time estimation. The most time-consuming steps 1 and 2 are well parallelized and results could be improved by adding more threads.
Initially, the presented feature extraction method was developed for Metagenomics Diagnosis for Inflammatory bowel disease Challenge (MEDIC). Here we use a subset of metagenomic dataset from IBD patients for example purposes.
The analysed data was collected as a part of The integrative Human Microbiome Project and is stored in The Inflammatory Bowel Disease Multi'omics Database. We took samples from Schirmer et al. 2018 article and further filtered them resulting in 54 samples. They are distributed into three categories: 23 CD – Crohn's Disease , 17 UC – ulcerative colitis UC, 14 nonIBD – control samples.
Metadata table for example dataset is in data/Class_labels.txt
. The samples can be found and downloaded by this link.
Download 54 samples from IBD dataset and unpack them into data/
folder.
cd data/
for i in `tail -n +2 Class_labels.txt | cut -f2` ; do
wget https://ibdmdb.org/tunnel/static/HMP2/WGS/1818/${i}.tar
tar -xf ${i}.tar
done
cd ../
As a result in data/
folder there will be 54 samples with paired-end reads in format <sample>_[R1|R2].fastq.gz
.
Split input reads into k-mers for each sample independently. We set k-mer length -k 31
and filtered out all k-mers found less than 5 times per sample with -p 4
.
bin/metafast.sh -t kmer-counter-many -p 32 -m 128G -k 31 -b 4 -i data/*.fastq.gz -w kmer-counter-many
As a result in kmer-counter-many/kmers/
folder there will be 54 files storing kmers in binary format.
Select unique k-mers present in at least given number G of samples of one category and absent in all samples of other categories. By default G=1 and can be user-defined to iterate in range [--min-samples
; --max-samples
].
Sample names for each category are in files data/[cd|uc|nonibd]_filelist.txt
and we use sed
command to get path to respective k-mers files.
First, we submit CD files as input and UC & nonIBD files as other categories (--filter-kmers
) to obtain unique k-mers for CD group. We let G iterate from 1 to 10.
bin/metafast.sh -t unique-kmers-multi -p 32 -m 128G -k 31 -w unique_kmers_cd \
--min-samples 1 --max-samples 10 \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/cd_filelist.txt | sed -e 's/$/.kmers.bin/'` \
--filter-kmers `sed -e 's/^/kmer-counter-many\/kmers\//' data/uc_filelist.txt data/nonibd_filelist.txt | sed -e 's/$/.kmers.bin/'`
The resulting unique k-mers are in file unique_kmers_cd/kmers/filtered_G.kmers.bin
.
Likewise, we repeat the operation twice to obtain unique k-mers for UC and nonIBD groups.
bin/metafast.sh -t unique-kmers-multi -p 32 -m 128G -k 31 -w unique_kmers_uc \
--min-samples 1 --max-samples 10 \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/uc_filelist.txt | sed -e 's/$/.kmers.bin/'` \
--filter-kmers `sed -e 's/^/kmer-counter-many\/kmers\//' data/cd_filelist.txt data/nonibd_filelist.txt | sed -e 's/$/.kmers.bin/'`
bin/metafast.sh -t unique-kmers-multi -p 32 -m 128G -k 31 -w unique_kmers_nonibd \
--min-samples 1 --max-samples 10 \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/nonibd_filelist.txt | sed -e 's/$/.kmers.bin/'` \
--filter-kmers `sed -e 's/^/kmer-counter-many\/kmers\//' data/cd_filelist.txt data/uc_filelist.txt | sed -e 's/$/.kmers.bin/'`
Now we should have three folders, containing files with unique k-mers with various thresholds for each of three groups: unique_kmers_cd/kmers/
, unique_kmers_uc/kmers/
, and unique_kmers_nonibd/kmers/
. To choose the optimal threshold value for each category, we are to analyse the number of selected unique k-mers. Based on the information from the log files, we can construct the following table:
# of unique k-mers | CD | UC | nonIBD |
---|---|---|---|
G=1 | 290 305 004 | 173 368 403 | 194 372 801 |
G=2 | 29 361 767 | 8 088 872 | 17 592 216 |
G=3 | 5 950 400 | 638 053 | 2 343 764 |
G=4 | 1 738 998 | 59 181 | 618 736 |
G=5 | 459 717 | 6 579 | 206 860 |
G=6 | 86 875 | 488 | 63 643 |
G=7 | 14 892 | 20 | 16 635 |
G=8 | 2 056 | 0 | 2 977 |
G=9 | 260 | 0 | 200 |
G=10 | 46 | 0 | 0 |
With the final goal of constructing approximately equal number of features for each group, the optimal values are G=6 for CD and nonIBD categories and G=4 for UC category. With these values we are ready for step 3.
Having selected unique k-mers for groups of samples, we are ready to construct metagenomic features. There are two approaches, both based on de Bruijn graph. One of them utilizes MetaFast routine for extracting local environment around given k-mers. Another is based on SPAdes assembly of reads, containing selected k-mers.
SPAdes approach takes significantly more time, especially on low- and medium-sized datasets. For large datasets, however, it might give a bit better results for classification models. Anyway, it is not recommended to try step 3b unless you get and analyze results using step 3a.
Construct de Bruijn graph from k-mers from all samples for a given category, and perform a local walk in this graph, starting from unique k-mers (--pivot
) for the same category.
bin/metafast.sh -t component-extractor -p 32 -m 128G -k 31 -w components_cd \
--pivot unique_kmers_cd/kmers/filtered_6.kmers.bin \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/cd_filelist.txt | sed -e 's/$/.kmers.bin/'`
The resulting specific components for CD group are in file components_cd/components.bin
.
Likewise, we repeat the operation twice to obtain specific components for UC and nonIBD groups.
bin/metafast.sh -t component-extractor -p 32 -m 128G -k 31 -w components_uc \
--pivot unique_kmers_uc/kmers/filtered_4.kmers.bin \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/uc_filelist.txt | sed -e 's/$/.kmers.bin/'`
bin/metafast.sh -t component-extractor -p 32 -m 128G -k 31 -w components_nonibd \
--pivot unique_kmers_nonibd/kmers/filtered_6.kmers.bin \
-i `sed -e 's/^/kmer-counter-many\/kmers\//' data/nonibd_filelist.txt | sed -e 's/$/.kmers.bin/'`
After this step, we have 7 621 CD components, 5 557 UC components, and 4 794 nonIBD components.
To use SPAdes for k-mer extraction, we need to perform several steps:
- Translate unique k-mers from binary format into plane text and save them into files
unique_kmers_[cd|uc|nonibd].txt
.
bin/metafast.sh -t view -m 128G -p 32 -k 31 -w unique_kmers_cd_txt -kf unique_kmers_cd/kmers/filtered_6.kmers.bin -o unique_kmers_cd.txt
bin/metafast.sh -t view -m 128G -p 32 -k 31 -w unique_kmers_uc_txt -kf unique_kmers_uc/kmers/filtered_4.kmers.bin -o unique_kmers_uc.txt
bin/metafast.sh -t view -m 128G -p 32 -k 31 -w unique_kmers_nonibd_txt -kf unique_kmers_nonibd/kmers/filtered_6.kmers.bin -o unique_kmers_nonibd.txt
- For each category select from all samples only those reads, which contain unique k-mers. Reads for each combination of sample and category are in
reads_with_kmers/
folder.
mkdir reads_with_kmers
python bin/reads_filter.py -k 31 -p 32 -w reads_with_kmers --reads data/*.fastq.gz --kmers unique_kmers_cd.txt unique_kmers_uc.txt unique_kmers_nonibd.txt
- Join reads from all sample for each category into one file.
cat reads_with_kmers/*_class_0.fasta > reads_with_kmers/reads_cd.fasta
cat reads_with_kmers/*_class_1.fasta > reads_with_kmers/reads_uc.fasta
cat reads_with_kmers/*_class_2.fasta > reads_with_kmers/reads_nonibd.fasta
- Assemble contigs from reads for each category independently. Resulting contigs are in files
spades_contigs_[cd|uc|nonibd]/contigs.fasta
.
<path_to_spades>/spades.py -t 32 -m 128G -s reads_with_kmers/reads_cd.fasta --only-assembler -o spades_contigs_cd
<path_to_spades>/spades.py -t 32 -m 128G -s reads_with_kmers/reads_uc.fasta --only-assembler -o spades_contigs_uc
<path_to_spades>/spades.py -t 32 -m 128G -s reads_with_kmers/reads_nonibd.fasta --only-assembler -o spades_contigs_nonibd
NB! If you have done this step, you do not need to run step 5. You alerady have contigs in fasta format to analyze.
- Transform fasta contigs into binary format for feature calculation.
bin/metafast.sh -t seq2comp -m 128G -p 32 -k 31 -w components_cd -i spades_contigs_cd/contigs.fasta
bin/metafast.sh -t seq2comp -m 128G -p 32 -k 31 -w components_uc -i spades_contigs_uc/contigs.fasta
bin/metafast.sh -t seq2comp -m 128G -p 32 -k 31 -w components_nonibd -i spades_contigs_nonibd/contigs.fasta
That's it! Components for each category in binary format are in files components_[cd|uc|nonibd]/components.bin
and we are ready for step 4.
After step 3 we obtain components for each of categories, that will play the role of features in subsequent analysis. In order to use the features in machine learning algorithms, we need to map each feature to numerical value for each sample. We take selected components for a given category and calculate the coverage of each component by k-mers from each sample independently.
bin/metafast.sh -t features-calculator -p 32 -m 128G -k 31 -w features_cd -cm components_cd/components.bin \
-ka kmer-counter-many/kmers/*.kmers.bin \
--selected unique_kmers_cd/kmers/filtered_6.kmers.bin
As a result, CD feature values (breadth coverage) for each sample are stored in files features_cd/vectors/<sample>.breadth
.
Likewise, we repeat the operation twice to obtain feature values for UC and nonIBD groups.
bin/metafast.sh -t features-calculator -p 32 -m 128G -k 31 -w features_uc -cm components_uc/components.bin \
-ka kmer-counter-many/kmers/*.kmers.bin \
--selected unique_kmers_uc/kmers/filtered_4.kmers.bin
bin/metafast.sh -t features-calculator -p 32 -m 128G -k 31 -w features_nonibd -cm components_nonibd/components.bin \
-ka kmer-counter-many/kmers/*.kmers.bin \
--selected unique_kmers_nonibd/kmers/filtered_6.kmers.bin
Now for each sample we have three files with numerical features for CD, UC, and nonIBD categories. From this point we are ready to use machine learning for classification task.
You can use your favourite algorithm for this purpose. If you are unsure what to do, there are two Jupyter Notebook examples with detailed description of training steps and classification results:
It might be relevant to analyse the extracted features from biological point of view. For taxonomic and functional annotation, we need nucleotide sequences corresponding to each component.
bin/metafast.sh -t comp2seq -p 32 -m 128G -k 31 -w contigs_cd -cf components_cd/components.bin
The resulting CD contigs are in contigs_cd/seq-builder-many/sequences/component.seq.fasta.
Likewise, we repeat the operation twice to obtain contigs for UC and nonIBD groups.
bin/metafast.sh -t comp2seq -p 32 -m 128G -k 31 -w contigs_uc -cf components_uc/components.bin
bin/metafast.sh -t comp2seq -p 32 -m 128G -k 31 -w contigs_nonibd -cf components_nonibd/components.bin
NB! If you want to analyse the contigs for a specific component, and not all components, --split
flag might help. Be careful, it saves contigs for each component in a separate file, so there might be a lot of files afterwards.
Please report any problems directly to the GitHub issue tracker.
Also, you can send your feedback to abivanov@itmo.ru.
Authors:
- Software: Artem Ivanov (ITMO University)
- Supervisor: Vladimir Ulyantsev (ITMO University)
The MIT License (MIT)
- MetaFast – a toolkit for comparison of metagenomic samples.
- MetaCherchant – a tool for analysing genomic environment within a metagenome.
- RECAST – a tool for sorting reads per their origin in metagenomic time series.