This repository contains code to finetune BERT-based models on Named Entity Recognition downstream tasks. A part from providing the code, the repository also provides results for several biomedical datasets as well as the models (which I have uploaded into HuggingFace models website)
If you are only interested on using the models, you can do it directly from 🤗/transformers library. The models websites are available from the following links:
Use the pipeline:
from transformers import pipeline
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
nlp_ner = pipeline("ner",
model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')
nlp_ner(text)
"""
Output:
---------------------------
[
{'word': 'glucocorticoid',
'score': 0.9894881248474121,
'entity': 'B-protein'},
{'word': 'receptor',
'score': 0.989505410194397,
'entity': 'I-protein'},
{'word': 'normal',
'score': 0.7680378556251526,
'entity': 'B-cell_type'},
{'word': 'cs',
'score': 0.5176806449890137,
'entity': 'I-cell_type'},
{'word': 'lymphocytes',
'score': 0.9898491501808167,
'entity': 'I-cell_type'}
]
"""
Or load model and tokenizer as follows:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Example
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
# Load model
tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
# Get input for BERT
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
# Predict
with torch.no_grad():
outputs = model(input_ids)
# From the output let's take the first element of the tuple.
# Then, let's get rid of [CLS] and [SEP] tokens (first and last)
predictions = outputs[0].argmax(axis=-1)[0][1:-1]
# Map label class indexes to string labels.
for token, pred in zip(tokenizer.tokenize(text), predictions):
print(token, '->', model.config.id2label[pred.numpy().item()])
"""
Output:
---------------------------
mouse -> O
thymus -> O
was -> O
used -> O
as -> O
a -> O
source -> O
of -> O
glucocorticoid -> B-protein
receptor -> I-protein
from -> O
normal -> B-cell_type
cs -> I-cell_type
lymphocytes -> I-cell_type
. -> O
"""
-
The script
train_ner.py
is ready to use in order to train and end-to-end BERT-based NER. You just need to download the data and locate it into the corresponding directory (./data/JNLPBA/) or to change the path withintrain_ner.py
. -
The script
find_learning_rate.py
can be used to find an initial learning rate based on the range test detailed in Cyclical Learning Rates for Training Neural Networks. It makes use of the Pytorch Implementationpytorch-lr-finder
. The output given byBertForTokenClassification
fromtransformers
library implementation is not compatible withpytorch-lr-finder
. For this reason, I have createdBertForTokenClassificationCustom
class which is a subclass oftorch.nn.Module
. It instantiatesBertForTokenClassification
and changes the output given by theforward
method in order to make BERT compatible withpytorch-lr-finder
. In other words, the output is reshaped to be directly compatible withnn.module.CrossEntropyLoss
(batch_size, num_classes, sequence_len
). I have mimicked the way of loading the model to be the same as the BERT fromtransformers
:
from nn_utils.neural_architectures import BertForTokenClassificationCustom
nerbert = BertForTokenClassificationCustom.from_pretrained('allenai/scibert_scivocab_uncased')
The main three elements to train a NER are:
It is represented through NerDataset
class.
It is a subclass of torch.utils.data.dataset.Dataset
and it has a __getitem__
method that returns the BERT input and
label for a given index. NerDataset
has a boolean argument, bert_hugging
, which provides different behaviours. If True
,
the returned data by __getitem__
is a python
dictionary with input_ids
, attention_mask
, token_type_ids
, and labels
tensors. This format is used during NER training (train_ner.py
), since it is compatible with BertForTokenClassification
from transformers
library. During training, BertForTokenClassification
estimates the loss inside the forward
method,
so labels are passed as input. During inference there is no need to provide the labels.
If bert_hugging=False
, the returned data is is a tuple with two elements. The first one is a list of tensors with the
BERT's input (input_ids
, attention_mask
, token_type_ids
). The second is the tensor for the labels. This format is
compatible with pytorch-lr-finder
and used in find_learning_rate.py
.
Internally, NerDataset
calls a function data2tensors
that transforms input examples into tensors. The input examples are
represented as a list of a dataclass called DataSample
that contains words and labels. The output tensors are stored in
a list of a dataclass called InputBert
.
It is represented through BertForTokenClassification
and AutoTokenizer
classes from
transformers
library.
It is represented through BertTrainer
class.
It is responsible of performing a complete training and evaluation loop in Pytorch and it is specially designed for BERT-based
models from transformers library. It allows to save the model from the epoch with the best F1-score and the tokenizer. The class
optionally generates reports and figures with the obtained results that are automatically stored in disk.
The report contains the following info that is saved in a file called classification_report.txt
within the directory output_dir
for the model from the best epoch:
- Classification report at span/entity level (for validation dataset).
- Classification report at word level (for validation dataset).
- Epoch where the best model was found (best F1-score in validation dataset)
- Training loss from the best epoch.
- Validation loss from the best epoch.
Optionally, the class can print validation examples (sentences) where the model commits at least one mistake. It is printed at the end of each epoch. This is very useful to inspect the behaviour of your model. This is how the print looks:
TOKEN LABEL PRED
immunostaining O O
showed O O
the O O
estrogen B-cell_type B-cell_type
receptor I-cell_type I-cell_type
cells I-cell_type O
·
·
·
synovial O O
tissues O O
. O O
Another thing that it is worth mentioning about this class it concerns the input argument accumulate_grad_every
.
This parameter sets how often you want to accumulate the gradient. This is useful when there are limitations in the
batch size due to memory issues. Let's say that in your GPU only fits a model with batch size of 8 and you want to try
a batch size of 32. Then, you should set this parameter to 4 (8*4=32). Internally, a loop will be run 4 times
accumulating the gradient for each step. Later, the network parameters will be updated. So at the end, this is equivalent
to train your network with a batch size of 32. The batch size is inferred from dataloader_train
argument.
The used pre-trained model is allenai/scibert_scivocab_cased
.
SciBERT is a pretrained language model based on BERT and trained by the Allen Institute for AI on papers from the corpus of Semantic Scholar. Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus.
The corpus used to fine-tune the NER is BioNLP / JNLPBA shared task.
- Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
- Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).
The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
Class Label | # training examples | # evaluation examples |
---|---|---|
O | 382,963 | 81,647 |
B-protein | 30,269 | 5,067 |
I-protein | 24,848 | 4,774 |
B-cell_type | 6,718 | 1,921 |
I-cell_type | 8,748 | 2,991 |
B-DNA | 9,533 | 1,056 |
I-DNA | 15,774 | 1,789 |
B-cell_line | 3,830 | 500 |
I-cell_line | 7,387 | 9,89 |
B-RNA | 951 | 118 |
I-RNA | 1,530 | 187 |
An exhaustive hyperparameter search was done. The hyperparameters that provided the best results are:
- Max length sequence: 128
- Number of epochs: 6
- Batch size: 32
- Dropout: 0.3
- Optimizer: Adam
The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training with a ratio of steps equal to 0.1 from the total training steps.
The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.
The following table shows the evaluation metrics calculated at span/entity level:
precision | recall | f1-score | |
---|---|---|---|
cell_line | 0.5205 | 0.7100 | 0.6007 |
cell_type | 0.7736 | 0.7422 | 0.7576 |
protein | 0.6953 | 0.8459 | 0.7633 |
DNA | 0.6997 | 0.7894 | 0.7419 |
RNA | 0.6985 | 0.8051 | 0.7480 |
micro avg | 0.6984 | 0.8076 | 0.7490 |
macro avg | 0.7032 | 0.8076 | 0.7498 |
The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their paper, which is equal to 0.7728. This drop in performance could be due to several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field, while this model uses a regular classification layer with softmax activation on top of SciBERT model.
At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.