FLORES Dataset Corrections for Four African Languages

This project focuses on correcting the FLORES evaluation dataset (dev and devtest) for four African languages: Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in covering low-resource languages, contained several inconsistencies and inaccuracies in these languages, which could affect the quality of evaluations in Natural Language Processing (NLP) tasks, especially for machine translation.

Overview

In this project, native speakers meticulously reviewed and corrected the dataset to ensure improved accuracy and reliability for each language. Our goal was to enhance the integrity of downstream NLP tasks that use this data.

What We Did:

Reviewed and Corrected Errors: Identified and implemented corrections to translation inconsistencies and inaccuracies in the dataset.
Statistical Analysis: Conducted statistical comparisons between the original and corrected datasets, highlighting the differences and improvements made.
Improved Dataset Quality: Enhanced linguistic accuracy and reliability, ensuring more effective evaluation of NLP tasks involving these languages.

Key Corrections:

Hausa: The Hausa translations revealed numerous inconsistencies, suggesting a significant portion may have been automatically generated, particularly due to unclear or incoherent phrasing. Comparisons with the Hausa FLORES dataset and Google Translate showed that many lexical choices were incorrect and aligned with Google’s outputs, raising concerns about the quality of the original translations. Additional issues included improper translations of named entities and the frequent omission of special characters in standardized Hausa alphabets.
Northern Sotho (Sepedi): The translations in Northern Sotho displayed a need for improvement in vocabulary consistency, syntax, and the accurate conveyance of technical terms. While most text was accurately translated, minor corrections were necessary to enhance clarity, including adjustments for borrowed words and proper spacing. Notably, some translations omitted important terms, affecting the overall meaning, such as leaving out “scientific” when referring to tools.
Xitsonga: In Xitsonga translations, several vocabulary accuracy issues and improper use of borrowed terms were identified, leading to misunderstandings. Errors included incorrect translations for phrases like "Type 1 diabetes" and uniform translations lacking contextual variation, which hindered clarity. Spelling errors and the inappropriate borrowing of terms significantly impacted translation quality, underscoring the need for proper native language usage.
isiZulu: IsiZulu translations faced challenges with vocabulary inconsistencies, syntax errors, and issues in expressing technical terms, compounded by the language's agglutinative structure. Key problems included incorrect grammatical structures for time expressions and the unnecessary borrowing of English terms, which disrupted linguistic flow. Efforts to standardize terminology throughout the translations were made to ensure grammatical accuracy and clarity.

Evaluating the Corrections:

lang.	dev (997 sentences)					devtest (1,012 sentences)
lang.	#corr. (%)	#tokens_o	#tokens_c	Δ tokens	% div.	#corr. (%)	#tokens_o	#tokens_c	Δ tokens	% div.
hau	632 (63.4)	17,948	18,073	125	24.7	70 (6.9)	2,006	1,978	28	49.2
nso	67 (6.7)	2,226	2,271	45	28.9	62 (6.1)	2,082	2,105	23	28.0
tso	-	-	-	-	-	83 (6.1)	2,919	2,947	28	27.4
zul	190 (19.1)	3,605	3,588	17	23.7	226 (22.3)	4,414	4,396	18	31.8

Table: Data statistics; #corr. (%) → number of sentences requiring at least one correction (percentage of original data); #tokens_o → original token count; #tokens_c → corrected token count; Δ tokens → token count difference; % div. → percentage of token divergence.

lang.	dev				devtest
	TER		BLEU	COMET	TER		BLEU	COMET
	Score	#Edits	BLEU	COMET	Score	#Edits	BLEU	COMET
hau	19.2	3,107	72.0	54.1	40.4	711	56.6	42.1
nso	22.4	472	68.5	55.2	21.2	409	71.8	55.9
tso	-	-	-	-	20.9	547	73.9	58.4
zul	17.2	524	76.3	53.0	23.6	879	70.6	53.0

Table: Similarities between the original and corrected FLORES evaluation data on the four African languages - original as predictions; corrected as reference translations.

How to Use

This repository contains the corrected version of the FLORES dataset for the four languages. You can use these corrected datasets for improved performance in evaluating machine translation and other NLP tasks for African languages.

Accessing the Data

Download corrected datasets

Contributing

We welcome contributions and suggestions to further enhance the dataset. If you would like to contribute, please submit a pull request or open an issue.

Acknowledgments

Special thanks to the native speaker annotators—university students and researchers—who volunteered to correct translations in their native languages. Their valuable contributions are crucial to the development and preservation of these low-resource languages in NLP.

Citation

If you use these corrections in your research, please cite our paper:

@misc{abdulmumin2024correctingfloresevaluationdataset,
  title={Correcting FLORES Evaluation Dataset for Four African Languages}, 
  author={Idris Abdulmumin and Sthembiso Mkhwanazi and Mahlatse S. Mbooi and Shamsuddeen Hassan Muhammad and Ibrahim Said Ahmad and Neo Putini and Miehleketo Mathebula and Matimba Shingange and Tajuddeen Gwadabe and Vukosi Marivate},
  year={2024},
  eprint={2409.00626},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.00626}, 
}

We hope these corrections will improve your NLP research and contribute to the growing body of work on African languages!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
scripts		scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FLORES Dataset Corrections for Four African Languages

Overview

What We Did:

Key Corrections:

Evaluating the Corrections:

How to Use

Accessing the Data

Contributing

Acknowledgments

Citation

About

Releases

Packages

Languages

dsfsi/flores-fix-4-africa

Folders and files

Latest commit

History

Repository files navigation

FLORES Dataset Corrections for Four African Languages

Overview

What We Did:

Key Corrections:

Evaluating the Corrections:

How to Use

Accessing the Data

Contributing

Acknowledgments

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages