Skip to content

Latest commit

 

History

History
52 lines (36 loc) · 3.19 KB

README_en.md

File metadata and controls

52 lines (36 loc) · 3.19 KB

NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian

Українська версія

Data description.

This is the second version of the Ukrainian NER corpus. You can find the first version data and documentation here

The labeled data corpus is located in the v2.0/data folder. Total in the corpus:

  • 560 texts (train: 391, test: 169)
  • 21_993 NER entities
  • 13 types of entities
NashiGroshi Bruk Total
ART 319 316 635
DATE 1496 551 2047
DOC 108 34 142
JOB 1344 638 1982
LOC 1380 1620 3000
MISC 102 413 515
MON 897 46 943
ORG 4431 782 5213
PCT 186 77 263
PERIOD 341 255 596
PERS 1820 4415 6235
QUANT 276 106 382
TIME 4 36 40
Total 12704 9289 21993

The primary data source is the Open Corpus of Ukrainian Texts (folder bruk and the texts of the publication "Nashi Groshi" (folder ng). There are two files for each processed text from the corpus:

  • a file with the extension txt contains the tokenized version of the text
  • a file with the extension ann contains NER-annotations to this text in Brat Standoff Format (each line of the file contains 3 records separated by tabs: the annotation number, the start and end index in the text - in this case, the tokenized one - separated by a space, the entity text)

The annotation was performed by at least two annotators for each text according to the following [rules] (doc/README.md), with discrepancies in the results corrected by a third editor.

For model training and validation, we recommend using the Standard split into DEV and TEST sets.

We provide IOB-converted data using the standard breakdown. Під час цієї конвертації ми прибрали вкладені теги.

The repository also contains scripts for converting data to other formats.

License

This data is available for use under the terms of the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License"

Creative Commons License
"Корпус NER-анотацій українських текстів" by lang-uk is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at https://github.com/lang-uk/ner-uk.