Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM allows for fine-tuning and evaluation across various microbiome data analysis tasks.
pip install microformer-mgm
Install the MGM package using setup.py
:
python setup.py install
MGM can be utilized via the command line interface (CLI) with different modes. The general syntax is:
mgm <mode> [options]
Converts input abundance data to a count matrix at the Genus level, normalizes it using phylogeny, and constructs a microbiome corpus. The corpus represents each sample as a sentence from high rank genus to low rank genus.
Input: Data in hdf5, csv, or tsv format (features in rows, samples in columns)
Output: A pkl file containing the microbiome corpus
Example:
mgm construct -i infant_data/abundance.csv -o infant_corpus.pkl
For hdf5 files, specify the key using
-k
(default key isgenus
).
Pretrains the MGM model using the microbiome corpus by causal language modeling. Optionally, you can train the generator by providing a label file. If the label file is provided, the tokenized label will be added following the <bos> token, meanwhile, the tokenizer will be updated and the model's embedding layer will be expanded.
Input: Corpus from construct
mode
Output: Pretrained MGM model
Examples:
mgm pretrain -i infant_corpus.pkl -o infant_model
mgm pretrain -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_gen --with-label
Use
--from-scratch
to train the model from scratch instead of loading pretrained weights.
Trains a supervised MGM model from sratch, requiring labeled data.
Input: Corpus from construct
mode, label file (csv)
Output: Supervised MGM model
Example:
mgm train -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_clf
Finetunes the MGM model with pre-trained weight to fit a new task, using labeled data and optionally a customized MGM model.
Input: Corpus from construct
mode, label file (csv), pretrained model (optional)
Output: Finetuned MGM model
Example:
mgm finetune -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model -o infant_model_clf_finetune
Predicts labels of input data using a fine-tuned MGM model. If a label file is provided, prediction results will be compared with the ground truth using various metrics.
Input: Corpus from construct
mode, label file (optional), supervised MGM model
Output: Prediction results in csv format
Example:
mgm predict -E -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model_clf -o infant_prediction.csv
Generates synthetic microbiome data using the pretrained MGM model. A prompt file is required for generating samples with specific labels.
Input: Pretrained MGM model
Output: Synthetic genus tensors in pickle format
Example:
mgm generate -m infant_model_gen -p infant_data/prompt.txt -n 100 -o infant_synthetic.pkl
Reconstruct abundance from ranked corpus.
Input: Abundance file for train reconstructor or trained model in ckpt; Ranked corpus for reconstruct; Get label's tokenizer in generator if there is; Prompt if there is label in corpus
Output: Reconstructed corpus ; Reconstructor model; Decoded label
mgm reconstruct -a infant_data/abundance.csv -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file
mgm reconstruct -r reconstructor_file/reconstructor_model.ckpt -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file
For detailed usage of each mode, refer to the help message:
mgm <mode> --help
Name | Organization | |
---|---|---|
Haohong Zhang | haohongzh@gmail.com | PhD Student, School of Life Science and Technology, Huazhong University of Science & Technology |
Zixin Kang | 29590kang@gmail.com | Undergraduate, School of Life Science and Technology, Huazhong University of Science & Technology |
Kang Ning | ningkang@hust.edu.cn | Professor, School of Life Science and Technology, Huazhong University of Science & Technology |