license | task_categories | language | tags | size_categories | ||||
---|---|---|---|---|---|---|---|---|
cc-by-nc-nd-4.0 |
|
|
|
|
This is the official code for HistRED: A Historical Document-Level Relation Extraction Dataset (ACL 2023). All materials related to this paper can be found here.
- ACL Anthology: Official proceeding publication
- Virtual-ACL 2023: You can view papers, posters, and presentation slides.
- arXiv: This is the camera-ready version, which is a key part of this paper.
Note that this dataset is open under CC BY-NC-ND 4.0 license. The dataset exists in HuggingFaceDataset.
from datasets import load_dataset
dataset = load_dataset("Soyoung/HistRED")
Due to the complexity of the dataset, we replace the dataset preview with an example figure. The text is translated into English for comprehension (*), however, unlike the figure, the dataset does not include English-translated text, only containing Korean and Hanja. Also, only one relation is shown for readability.
Relation information includes (i) subject and object entities for Korean and Hanja (sbj_kor, sbj_han, obj_kor, obj_han), (ii) a relation type (label), (iii) evidence sentence index(es) for each language (evidence_kor, evidence_han). Metadata contains additional information, such as which book the text is extracted from.
In this dataset, we choose Yeonhaengnok, a collection of records originally written in Hanja, classical Chinese writing, which has later been translated into Korean. Joseon, the last dynastic kingdom of Korea, lasted just over five centuries, from 1392 to 1897, and many aspects of Korean traditions and customs trace their roots back to this era. Numerous historical documents exist from the Joseon dynasty, including Annals of Joseon Dynasty (AJD) and Diaries of the Royal Secretariats (DRS). Note that the majority of Joseon's records were written in Hanja, the archaic Chinese writing that differs from modern Chinese because the Korean language had not been standardized until much later.
In short, Yeonhaengnok is a travel diary from the Joseon period. In the past, traveling to other places, particularly to foreign countries, was rare. Therefore, intellectuals who traveled to Chung (also referred to as the Qing dynasty) meticulously documented their journeys, and Yeonhaengnok is a compilation of these accounts. Diverse individuals from different generations recorded their business trips following similar routes from Joseon to Chung, focusing on people, products, and events they encountered. The Institute for the Translation of Korean Classics (ITKC) has open-sourced the original and their translated texts for many historical documents, promoting active historical research. The entire documents were collected from an open-source database at https://db.itkc.or.kr/.
- Our dataset contains (i) named entities, (ii) relations between the entities, and (iii) parallel relationships between Korean and Hanja texts.
dataset.py
return processed dataset that can be easily applied to general NLP models.- For monolingual setting: KoreanDataset, HanjaDataset
- For Bilingual setting: JointDataset
ner_map.json
,label_map.json
is a mapping dictionary from label class to index.- Sequence level (SL) is a unit of sequence length for extracting self-contained sub-texts without losing context information for each relation in the text. Each folder SL-k indicates that SL is k.
- Testbed for evaluating the model performance when varying the sequence length.
- Relation extraction task especially on Non-English or historical corpus.
@inproceedings{yang-etal-2023-histred,
title = "{H}ist{RED}: A Historical Document-Level Relation Extraction Dataset",
author = "Yang, Soyoung and
Choi, Minseok and
Cho, Youngwoo and
Choo, Jaegul",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.180",
pages = "3207--3224",
}