NLP-project

This is a PyTorch/GPU implementation of the final project Hierarchical Multi-label Classification in Natural Language Processing and Deep Learning class. Feel free to contact guyuntian@stu.pku.edu.cn if you have any issues.

Overview

Since the problem is hierarchical classification, it is not appropriate to simply do a softmax on all labels, ignoring the hierarchical relationship between labels. In this project, I adopt a simple procedure suited to the problem: predict the label one by one, from general label to specific label.
Instead of giving each classification task a classification head, I adopt a more efficient method: concatenate the label and the text, and let the model outputs a score to decide whether they are the same.
The max length of input text is 35092. However, I naively truncate them at max length 512, hopping the model can distinguish between classes after seeing the beginning.
I adopt post-pretraining to get "bert-base-chinese" familiar with datasets.

Procedure of getting scores for labels

Procedure of predicting

Code Usage

To separate validation set from training set, run the following code:

bash scripts/split.sh

To pre-train , run the following code on GPU:

bash scripts/pretrain.sh

To fine-tune, run the following code:

bash scripts/run.sh

To get the threshold score for choosing which level of label the model predicts, run the following code:

bash scripts/threshold.sh

Make sure to set the threshold in "scripts/predict.sh" to the best threshold you get. To get the final prediction, run the following code:

bash scripts/predict.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
scripts		scripts
.gitignore		.gitignore
README.md		README.md
get_threshold.py		get_threshold.py
main.py		main.py
predict.py		predict.py
pretrain.py		pretrain.py
split_train_val.py		split_train_val.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-project

Overview

Procedure of getting scores for labels

Procedure of predicting

Code Usage

About

Releases

Packages

Languages

guyuntian/NLP-project

Folders and files

Latest commit

History

Repository files navigation

NLP-project

Overview

Procedure of getting scores for labels

Procedure of predicting

Code Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages