Skip to content

Latest commit

 

History

History
49 lines (43 loc) · 1.86 KB

README.md

File metadata and controls

49 lines (43 loc) · 1.86 KB

An implementation of Lexical Unit Analysis (LUA) for sequence segmentation tasks (e.g., Chinese POS Tagging). Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the chunking data sets and move them into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following NER case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'O'), (5, 6, 'PER')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'ORG'), (5, 5, 'O'), (0, 0, 'O')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and evaluation script. Create another directory, "resource", with the following arrangement:

  • resource
    • pretrained_lm
      • model.pt
      • vocab.txt
    • conlleval.pl

For Chinese tasks, the source to construct "pretrained_lm" is bert-base-chinese.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -sd dump -rd resource

Citation

@inproceedings{li-etal-2021-segmenting-natural,
    title = "Segmenting Natural Language Sentences via Lexical Unit Analysis",
    author = "Li, Yangming  and  Liu, Lemao  and  Shi, Shuming",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.18",
    doi = "10.18653/v1/2021.findings-emnlp.18",
    pages = "181--187",
}