Project 7: Cross-linking literature and electronic health records

Description

In this project, we will use large language models (LLMs) to develop an open and expandable infrastructure that will be able to link the literature and electronic health records (EHRs) for different tasks. EHRs contain a wealth of valuable patient-level information embedded within free-text clinical notes, which are now increasingly used as a large-scale source of real-world data in epidemiological research and health data science. On the other hand, the biomedical/healthcare literature is a collective knowledge base of research and clinical practice. Several healthcare LLMs have been developed to represent and process free-text data, but often without making the cross-references between the literature and EHRs. In this hackathon, we propose to build a proof-of-concept infrastructure that will facilitate for both the literature and EHRs to be processed by a series of LLMs to identify, link and contextualise key entities, and then be cross-referenced using a common information model. By integrating the processing of two key sources of healthcare data, this project will fit well into the technical priorities and the missions of both HDR UK and ELIXIR-UK.

We will use two use cases to drive the design and implementation of the project. In the first use case, we will use a suite of LLMs to process the literature and an EHR dataset for key clinical variables (e.g. diagnoses, medications, symptoms, procedures) and provide a common information model to represent the findings. This will allow for a RAG-based (retrieval augmented generation) approach to be developed so that the literature can be used to help answering questions on EHR datasets. The pre-processed retrieved literature will be used for in-context learning, providing it with a more comprehensive understanding of the clinical context.

If the time/resources allow, we will also consider a second use case where we will use a suite of clinical LLMs to index the health informatics literature so that we can find evidence of text analytics used for specific EHR tasks (e.g. specific clinical variables or emerging needs). The primary goal is to identify evidence of how text analytics has been applied to EHR, and cross-reference these with the existing models that are integrated in the infrastructure. We will contextualise this information with details on clinical sub-domains, document types, performance indicators etc., and use that information to suggest specific models for EHR data to extract clinical variables like patient diagnoses, medication histories, or lab results, ensuring that the indexed evidence remains relevant as new EHR tasks or healthcare challenges arise (e.g. monitoring of a new clinical metric). This will allow the community to generate insights from the wealth of health informatics literature and apply those findings to real-world EHR tasks, thus enhancing both research productivity and clinical care outcomes.

As part of the hackathon, we will develop a proof of concept using the publicly available datasets: MIMIC IV (for EHR) and PubMedCentral (for literature), with an aim to port this approach to real-world data (e.g. UK CRIS) in the future, following relevant governance and trusted processing protocols. As part of the preparation for the hackathon, we will need to consider available compute resources and perhaps scope an application area on a specific health sub-domain.

Leads

Goran Nenadic, Angus Roberts, Robert Stewart, Simon Thompson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.md

7.md

Project 7: Cross-linking literature and electronic health records

Description

Leads

Files

7.md

Latest commit

History

7.md

File metadata and controls

Project 7: Cross-linking literature and electronic health records

Description

Leads