Skip to content

Latest commit

 

History

History
60 lines (41 loc) · 1.59 KB

README.md

File metadata and controls

60 lines (41 loc) · 1.59 KB

abbreviation-detector

Code to train classifiers for abbreviation detection and expansion in context. This repository also contains the evaluation code that complements the paper Dealing with Abbreviations in the Slovenian Biographical Lexicon to be presented at The 2022 Conference on Empirical Methods in Natural Language Processing EMNLP 2022

Installation

Download repo

git clone git@github.com:angel-daza/abbreviation-detector.git

Create a new environment:

conda create -n abbr-detector python=3.9
conda activate abbr-detector

Install Requirements:

pip install -r requirements

Paper Results

Abbreviation Detection

Create the Dataset Train/Dev/Test Partitions:

python3 slovene_abbr_preprocess.py

To Reproduce the Baseline Results:

python3 naive_baselines.py

To Reproduce the BERT Abbreviation Classifier Results:

# 1) Train the Binary BERT Classifier [ABBR, NO_ABBR]
python3 bert_token_classifier.py -t data/sbl-51abbr.tok.train.json -d data/sbl-51abbr.tok.dev.json\
     --bert_model 'EMBEDDIA/sloberta' --save_model_dir saved_models/BERT_ABBR_876972\
     --epochs 5 --batch_size 32 --info_every 10 --seed_val 876972

# 2) Make predictions using the BERT Classifier
python3 bert_token_classifier_predict.py -m saved_models/BERT_ABBR_876972 --bert_model 'EMBEDDIA/sloberta'\
     --epoch 1 --test_path data/sbl-51abbr.tok.test.json --gold_labels True

Abbreviation Expansion

To Reproduce BERT Abbreviation Expansion Results:

python3 bert_abbrev_expansion.py