A German dataset for few-shot relation extraction.
The English FewRel dataset is a benchmark for few-shot relation extraction. FewRelDe is the German version of a subset from the training and validation set of FewRel. The bootstrapping process via machine translation was first proposed by Markus Eberts in the master thesis "Relation Extraction with Attention-Based Transfer Learning". FewRelDe was created for the master thesis "Few-Shot Relation Extraction for German" by Anna Sauer.
The English instances from FewRel have been automatically translated into German with the help of DeepL Translator. Each translation has been reviewed by hand.
FewRelDe contains a subset of 19 relations from FewRel. Each relation is part of the Wikidata ontology. You can access more information about each relation by clicking on the Wikidata id in the following table:
Wikidata id | German name | English name |
---|---|---|
P6 | Leiter:in der Regierung | head of government |
P22 | Vater | father |
P25 | Mutter | mother |
P26 | Ehepartner:in | spouse |
P27 | Land der Staatsangehörigkeit | country of citizenship |
P84 | Architekt:in | architect |
P102 | Parteizugehörigkeit | member of political party |
P118 | Liga | league |
P177 | Querung | crosses |
P206 | liegt am or im Gewässer | located in or next to body of water |
P466 | Bewohner:in, Nutzer:in | occupant |
P551 | Wohnsitz | residence |
P740 | Gründungsort | location of formation |
P931 | vom Verkehrsknotenpunkt bedient | place served by transport hub |
P937 | Wirkungsort | work location |
P974 | Nebenfluss | tributary |
P1408 | Sendelizenz in | licensed to broadcast to |
P3373 | Geschwister | sibling |
P4552 | Gebirgszug | mountain range |
FewRelDe also contains an OTHER relation for negative instances that do not belong to any of the other 19 relations. These negative instances were randomly selected from the remaining FewRel training and validation relations.
Each of the 20 relations in FewRelDe has 50 instances. The total of labeled relation instances thus amounts to 1,000.
The format of the JSON file is modeled after the data format of the FewRel benchmark for few-shot relation extraction. Each file contains a dictionary whose keys are the names of the relations in the dataset. For each relation key, the corresponding value is a list of the labeled instances of that relation. This list contains a dictionary for each individual instance with
- "tokens": a list with the token string sequence in the sentence
- "h": information on the head entity in a list with
- a string with the entity mention in lower case
- a string with the Wikidata id of the entity (cf. wikidata.org). In WissenschaftsSTANDARD, this string is left empty because no entity linking between the head and tail entity and their Wikidata equivalent is performed.
- a list with a nested list that contains the indices of the entity mention tokens in the sentence
- "t": information on the tail entity in a list with the same structure as "h"
- "ner": a list with information obtained from the task of named entity recognition (NER). The list contains the BIOES entity type tag for each token in the sentence. The BIOES tagging has been created using the German NER model in Stanza with the CoNLL 2003 tag set. This tag set contains the entity types PER (person), ORG (organisation) and LOC (location) (cf. https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models).
Consider the following made-up example:
{
"tokens": ["Robin", "Hood", "lebt", "in", "Sherwood", "Forest", "."],
"h": ["sherwood forest", "", [[4, 5]]],
"t": ["robin hood", "", [[0, 1]]],
"ner": ["B-PER", "E-PER", "O", "O", "B-PER", "E-PER", "O"]
}
FewRel has been released under a Creative Commons BY-SA 4.0 license (cf. https://thunlp.github.io/1/fewrel1.html). Therefore, FewRelDe is also released under a Creative Commons BY-SA 4.0 license.