Deciphering self-resistance genes in microbial biosynthetic gene clusters to combat AMR

Team Nassar Members:

Maaly Nassar (Team Leader), Elsevier
Parul Sharma (Flex Lead), Emory University, Atlanta-GA, USA
Dae-young Kim (Technical Lead), Children's National Hospital, Washington D.C., USA
Madeline Galac (Writer Lead), NIAID BCBB, Washington DC, USA
Brendan Jeffrey, NIAID BCBB, Corvallis, OR, USA

Project Goals

Antimicrobial resistant treatments were mostly focused on discovering new antimicrobial drugs (e.g. BGCs by-products) or mutating antimicrobial resistance genes (AMR genes) and mechanisms towards losing their function (i.e. loss of function). Thus, the goal of this project is to:

Identify self-resistant (SR) genes in antimicrobial-producing microorganisms along with their mechanisms and regulators in literature and corresponding whole genome sequences using Large Language models (LLMs).
Fine-tune machine learning models (e.g. transformers or LLMs) to identify AMR and SR genes in whole genome sequences using NCBI pathogen, isolates, MicroBIGG-E and LLM-derived SR dataset.
Use AMR detection ML model to identify self-resistance genes and SR detection model to identify AMR genes to check for horizontal gene transfer and the possibility of AMR up-/downregulation by SR regulators

Deliverables

ML classifier for AMR genes using NCBI pathogen dataset
Self-resistance microbial genes with their related accessions, self-resistance mechanisms, self-resistance compounds and microbes using few-shot LLM prompting and GraphRAG

Results

LLM-derived self-resistance entities

PMC full text were retrieved for the 150 self-resistance pubmed abstracts retrieved by Leroy. Then, splitted into introduction, method, results and discussion sections with their corresponding sentences
All sections were annotated with EMERALD BGCs pipeline to identify BGCs genes, accessions, actions, compounds and classes
llama 70B instruct model was then used to identify self-resistance genes, proteins, mechanism, regulators, accession and organisms in BGCs annotated sentences using few-shot prompting

Identifying genes to use in machine language models

We selected a list of bacterial genomes to build our model on by focusing on the ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacter spp.) and those with complete genomes from the NCBI Pathogen Detection Isolate Browser. This resulted in 8,267 unique genomes being used. The protein sequences in these genomes were then categorized as AMR or non-AMR using the NCBI Pathogen Detection Microbial Browser for Identification of Genetic and Genomic Elements (MicroBIGG-E). We found 180 AMR genes and 118,199 non-AMR genes to be used for the ML training.

Future Work

NCBI Codeathon Disclaimer

This software was created as part of an NCBI codeathon, a hackathon-style event focused on rapid innovation. While we encourage you to explore and adapt this code, please be aware that NCBI does not provide ongoing support for it.

For general questions about NCBI software and tools, please visit: NCBI Contact Page

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
ai		ai
amr_analysis		amr_analysis
bgcs/emerald_bgcs_annotations/pmc_output		bgcs/emerald_bgcs_annotations/pmc_output
figures		figures
literature_analysis		literature_analysis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Team Nassar 2024 AMR Codeathon.pptx		Team Nassar 2024 AMR Codeathon.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deciphering self-resistance genes in microbial biosynthetic gene clusters to combat AMR

Project Goals

Deliverables

Results

LLM-derived self-resistance entities

Identifying genes to use in machine language models

Future Work

NCBI Codeathon Disclaimer

About

Releases

Packages

Contributors 7

Languages

License

NCBI-Codeathons/amr-2024-team-nassar

Folders and files

Latest commit

History

Repository files navigation

Deciphering self-resistance genes in microbial biosynthetic gene clusters to combat AMR

Project Goals

Deliverables

Results

LLM-derived self-resistance entities

Identifying genes to use in machine language models

Future Work

NCBI Codeathon Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages