This project develops a novel methodology for extracting structured knowledge from biomedical images using Large Language Models (LLMs). We focus specifically on understanding the relationship between COVID-19 and neurodegeneration through analysis of scientific figures and graphical abstracts. The project leverages GPT-4o for image analysis and semantic triple extraction, creating a comprehensive knowledge graph of biomedical relationships.
- Automated collection of relevant biomedical images using Google Image Search
- Multi-stage filtering process using GPT-4o for relevance assessment
- Semantic triple extraction (subject-predicate-object) from images using GPT-4V and GPT-4o
- Knowledge graph construction for COVID-19 and neurodegeneration relationships
- High accuracy and precision in image classification and information extraction
image-based-information-extraction-LLM/
├── src/
│ ├── data_collection/ # Scripts for Google Image Search automation
│ ├── image_processing/ # Image validation and preprocessing
│ ├── triple_extraction/ # GPT-4 based triple extraction
│ └── knowledge_graph/ # Knowledge graph construction
├── notebooks/
│ ├── Triple_Extraction_GPT4o.ipynb # Main implementation notebook
│ └── analysis/ # Additional analysis notebooks
├── data/
│ ├── raw/ # Collected image URLs
│ ├── processed/ # Filtered and validated images
│ └── results/ # Extracted triples and knowledge graphs
└── docs/ # Documentation and methodology details
- Python 3.8+
- OpenAI API access (for GPT-4V and GPT-4o)
- Required Python packages
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/NeginBabaiha/image-based-information-extraction-LLM.git
cd image-based-information-extraction-LLM
- Install dependencies:
pip install -r requirements.txt
- Configure API keys:
cp .env.example .env
# Edit .env with your OpenAI API key
from src.data_collection import ImageCollector
collector = ImageCollector()
urls = collector.search("Covid-19 and Neurodegeneration")
from src.triple_extraction import TripleExtractor
extractor = TripleExtractor()
triples = extractor.process_image(image_url)
Our workflow consists of several key stages:
- Data Collection: Automated collection of 6,319 image URLs using Google Image Search
- URL Processing: Validation and accessibility checking (3,614 valid URLs)
- Relevance Assessment:
- First run: 626 relevant URLs
- Second run: 567 refined URLs
- Manual verification: 289 final images
- Triple Extraction: Using GPT-4o for semantic relationship extraction
- Knowledge Graph Construction: Building structured representations of biomedical relationships
- Negin Babaiha, Elizaveta Popova
- Email:(negin.babaiha@scai.fraunhofer.de, elizaveta.popova@uni-bonn.de)