Skip to content

Latest commit

 

History

History
136 lines (99 loc) · 6.8 KB

README.md

File metadata and controls

136 lines (99 loc) · 6.8 KB

Medical oncept Annotation Tool

Build Status Latest release pypi Version

MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Paper on arXiv.

News

Demo

A demo application is available at MedCAT. This was trained on MIMIC-III and all of SNOMED-CT.

Tutorial

A guide on how to use MedCAT is available in the tutorial folder. Read more about MedCAT on Towards Data Science.

Related Projects

  • MedCATtrainer - an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model (MedCAT) for biomedical domain text.
  • MedCATservice - implements the MedCAT NLP application as a service behind a REST API.
  • iCAT - A docker container for CogStack/MedCAT/HuggingFace development in isolated environments.

Install using PIP (Requires Python 3.6+)

  1. Upgrade pip pip install --upgrade pip
  2. Install MedCAT
  • For macOS/linux: pip install --upgrade medcat
  • For Windows (see PyTorch documentation): pip install --upgrade medcat -f https://download.pytorch.org/whl/torch_stable.html
  1. Get the scispacy models:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz or pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz

  1. Downlad the Vocabulary and CDB from the Models section bellow

  2. Quickstart:

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT

# Load the vocab model you downloaded
vocab = Vocab.load(vocab_path)
# Load the cdb model you downloaded
cdb = CDB.load('<path to the cdb file>') 

# Create cat - each cdb comes with a config that was used
#to train it. You can change that config in any way you want, before or after creating cat.
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

# Test it
text = "My simple document with kidney failure"
doc_spacy = cat(text)
# Print detected entities
print(doc_spacy.ents)

# Or to get an array of entities, this will return much more information
#and usually easier to use unless you know a lot about spaCy
doc = cat.get_entities(text)
print(doc)


# To train on one example
_ = cat(text, do_train=True)

# To train on a iterator over documents
data_iterator = <your iterator>
cat.train(data_iterator)

#Once done, save the new CDB
cat.cdb.save(<save path>)

MetaCAT example

from medcat.meta_cat import MetaCAT
# Assume we have a CDB and Vocab object from before
# Download the mc_status model from the models section below and unzip it

mc_status = MetaCAT.load("<path to the unziped mc_status directory>")
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats=[mc_status])

# Now annotate a document, it will have the meta annotation 'status'
doc = cat.get_entities(text)

Models

A basic trained model is made public for the vocabulary and CDB. It is trained for the ~ 35K concepts available in MedMentions.

Vocabulary Download - Built from MedMentions

CDB Download - Built from MedMentions

MetaCAT Status Download - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)

(Note: This was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

SNOMED-CT and UMLS

If you have access to UMLS or SNOMED-CT and can provide some proof (a screenshot of the UMLS profile page is perfect, feel free to redact all information you do not want to share), contact us - we are happy to share the pre-built CDB and Vocab for those databases.

Acknowledgement

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The vocabulary was compiled from Wiktionary In total ~ 800K unique words

Powered By

A big thank you goes to spaCy and Hugging Face - who made life a million times easier.

Citation

@ARTICLE{Kraljevic2021-ln,
  title="Multi-domain clinical natural language processing with {MedCAT}: The Medical Concept Annotation Toolkit",
  author="Kraljevic, Zeljko and Searle, Thomas and Shek, Anthony and Roguski, Lukasz and Noor, Kawsar and Bean, Daniel and Mascio, Aurelie and Zhu, Leilei and Folarin, Amos A and Roberts, Angus and Bendayan, Rebecca and Richardson, Mark P and Stewart, Robert and Shah, Anoop D and Wong, Wai Keong and Ibrahim, Zina and Teo, James T and Dobson, Richard J B",
  journal="Artif. Intell. Med.",
  volume=117,
  pages="102083",
  month=jul,
  year=2021,
  issn="0933-3657",
  doi="10.1016/j.artmed.2021.102083"
}