The evaluation datset for contextualized lexical simplification was collected as part of Eliza Hobo's MSc thesis.
The dataset contains 96 sentences, followed by the corresponding complex word and the possible simplifications.
The dataset can also be found in the github repository of the authors.
Related scripts and analysis notebooks are also available in Eliza's MSc thesis repository.
The starting point for the dataset was a collection of ~50 documents provided by the Communications Department of the City of Amsterdam. The documents have diverse sources and purposes (e.g. reports, citizen letters, newsletters, etc.) and cover a variety of topics (legal, medical, urban planning, etc.).
Using the list of complex words, Eliza sampled ~100 sentences which contain a complex word.
The annotation was conducted via a form filled in by 23 annotators, all highly-educated native Dutch speakers. Annotators could either select a pre-filled option generated by an exising simplification model, or propose another suitable alternatives on their own.
Further details about the dataset creation, as well as the developed LSBertje model can be found in the corresponding paper.
If using the dataset, please cite as follows:
Hobo, Eliza, Charlotte Pouw, and Lisa Beinborn. "“Geen makkie”: Interpretable Classification and Simplification of Dutch Text Complexity." Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 2023.