Multilingual named entity recognition (NER) for de-identification
- Operating System: Ubuntu 22.04.3 LTS
- Kernel: Linux 4.18.0-477.27.1.el8_8.x86_64
- Architecture: x86_64
- Python:
- 3.10.12
Ensure openjdk-8-jdk
is installed on your system, as it is required for BRATEval evaluation and the pipeline will not function without it.
# Update the Package Index.
sudo apt update
# Install OpenJDK 8.
sudo apt install openjdk-8-jdk
# Verify the Installation.
java -version
Install the necessary Python libraries using the following steps:
# Create the environment.
conda create -n DeID-NER python=3.10.12
# Activate the environment.
conda activate DeID-NER
# Go to the project root folder.
cd <path to the project root folder>
# Install external libraries.
pip install -r requirements.txt
# Download punkt tokenizer
python -m nltk.downloader punkt
├── config
│ └── ner_training_config.json (configuration required for training)
│ └── ner_inference_config.json (configuration required for inference)
│
├── data
│ └── <lang> (folder for language-specific datasets)
│ └── <dataset> (folder for dataset-specific files)
│ └── train (standoff training data)
│ └── val (standoff validation data)
│ └── test (optional standoff testing data)
│ └── infer (optional standoff inference files)
│
├── src (source code)
│ ├── inference_helpers.py
│ ├── model_helpers.py
│ └── standoff2bio.py
│
├── 0_preprocessing.py (preprocessing script for standoff data)
├── 1_ner_train.py (training script for NER model)
├── 2_ner_infer.py (inference script for trained NER model)
├── BRATEval-0.0.2-SNAPSHOT.jar (BRAT evaluation tool)
└── requirements.txt (Python dependencies)
The pipeline only accepts data in the standoff data format. Ensure your data adheres to the following structure:
├── data
│ └── <lang>
│ └── <dataset>
│ └── train (standoff training data)
│ └── val (standoff validation data)
│ └── test (optional: standoff testing data)
If the test
folder is not provided, set "test_standoff_path"
to null
in ./config/ner_training_config.json
. A synthetic test folder will be created during preprocessing by copying validation files.
Edit the file ./config/ner_training_config.json
to configure training parameters
Run the preprocessing script to prepare the data:
python 0_preprocessing.py ./config/ner_training_config.json
Train the NER model using the training script:
python 1_ner_train.py ./config/ner_training_config.json
After training, the model and associated files (e.g., logs, evaluation metrics) will be saved in ./logs/<lang>_<dataset>/<model_name>_<date>_<time>
.
Edit the file ./config/ner_inference_config.json
to configure inference parameters
Run the inference script to apply the trained NER model:
python 2_ner_infer.py ./config/ner_inference_config.json
Predictions will be saved to the output_path
specified in the inference configuration file.