This is a clone of the original repository from the ONTOX GitHub page, created for portfolio visibility. Find the orginal GitHub page here
This repository employs the seq2rel method from the paper "A sequence-to-sequence approach for document-level relation extraction" to extract relationships between chemicals and adverse outcomes described in scientific literature. Although the project was eventually discontinued, the following outlines the outcomes and current status:
The project's ultimate objective was to utilize the generative relationship extraction method described in the paper to extract and label relationships between chemicals and adverse outcomes. The notable advantage of this method is its ability to express discontinuous mentions, coreferent mentions, and N-ary relationships.
img
The plan for this project was as follows:
- Reproduce the results of the paper by fine-tuning a Huggingface model on the same dataset.
- Create a dataset matching the desired relationship annotation schema.
- Determine the relationship annotation schema, potentially using Ensemble Biclustering (EBC). EBC would identify different relationship groups within the corpus for manual labeling.
- Train the model on this new dataset.
graph LR
A["Reproduce results with<br> Huggingface model"] --> B["Determine relationship<br> annotation schema"]
B --> C["Create own dataset"]
C --> D["Fine-tune on<br> own dataset"]
The status of the project at the time of discontinuation was the completion of reproducing results with a Huggingface model. Specifically, two training scripts were created to train a google/T5 model on the CDR dataset in seq2rel format. The first training script (run.py) was implemented using sacred to enhance experiment reproducibility, although it didn't closely follow Huggingface programming conventions. The second training script (run_ds.py) enabled distributed training with Deepspeed.
Due to the novel output of the models, a custom evaluation method was necessary. This method needed to convert text structured according to the annotation schema into relationship triples, accounting for coreferent mentions. The method works for relationship extraction, but entity recognition measures are still suboptimal. The evaluation method is defined in this script and is explained and tested in this notebook.
The results after fine-tuning on the CDR dataset:
Date | script | Model | RE Precision | RE Recall | RE F1-score | unstructerd |
---|---|---|---|---|---|---|
04-03-2024 | run.py | t5-large | 28.86% | 27.54% | 28.18% | 4.2% |
14-04-2024 | run.py | t5-3b | 74.71% | 74.71% | 69.41% | 2.6% |
16-05-2024 | run_ds.py | t5-11b | 77.78% | 38.89% | 51.85% | 98.24% |
Note:
- The results from the model trained on 03-04-2024 are inaccurate due to bugs in the evaluation method at the time of testing, which deflated the scores.
- The results from the model trained on 05-16-2024 are also inaccurate. During evaluation, the model could only produce output with a maximum length of 20 tokens, leading to lower than expected scores. The correct scores were noted in a SURF directory, but at the time of writing I am unable to access it.
During the project two enviroments were used. A pip enviroment and a Conda enviroment, this is due to the fact that to make use of Deepspeed a conda enviroment had to be made.
Create environment:
python -m venv venv
Install torch manually according to the installation guide
pip install torch
Copy environment
pip install -r requirements.txt
Due to Conda sometimes installing platform specific packages two environment files are availible:
environment.yml
environment_cross_platoform.yml
Create environment:
conda env create -f environment.yml
The repository has two training scripts: run.py
and run_ds.py
. run.py
was the initial script, it has an implementation of sacred which makes the experiments more reproducible. The information of these experiments can be found in sacred_runs. At the end of this line of experiments DeepSpeed was being implemented. To make use of DeepSpeed run_ds.py
was created. run_ds.py is a modified version of this example huggingface training script.
The repository holds the code to fine-tune a huggingface seq2seq model. You can fine-tune a model by using the following command:
python run.py
this will start the training loop and train according to the config file defined in run.py
.
you can also define a config file on the commandline:
python run.py with path/to/config.yaml
used configs:
The code makes use of the sacred module. This is a module that automaticly saves information about each run, making the experiments more reproducible.
this means run.py
has all the features of a sacred experiment:
you can see some of it's functionallity:
python run.py --help
and you could print the config:
python run.py print_config
deepspeed run_ds.py path/to/config.yaml
used configs: