Skip to content

dsfsi/flores-fix-4-africa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FLORES Dataset Corrections for Four African Languages

This project focuses on correcting the FLORES evaluation dataset (dev and devtest) for four African languages: Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in covering low-resource languages, contained several inconsistencies and inaccuracies in these languages, which could affect the quality of evaluations in Natural Language Processing (NLP) tasks, especially for machine translation.

Overview

In this project, native speakers meticulously reviewed and corrected the dataset to ensure improved accuracy and reliability for each language. Our goal was to enhance the integrity of downstream NLP tasks that use this data.

What We Did:

  1. Reviewed and Corrected Errors: Identified and implemented corrections to translation inconsistencies and inaccuracies in the dataset.
  2. Statistical Analysis: Conducted statistical comparisons between the original and corrected datasets, highlighting the differences and improvements made.
  3. Improved Dataset Quality: Enhanced linguistic accuracy and reliability, ensuring more effective evaluation of NLP tasks involving these languages.

Key Corrections:

  • Hausa: The Hausa translations revealed numerous inconsistencies, suggesting a significant portion may have been automatically generated, particularly due to unclear or incoherent phrasing. Comparisons with the Hausa FLORES dataset and Google Translate showed that many lexical choices were incorrect and aligned with Google’s outputs, raising concerns about the quality of the original translations. Additional issues included improper translations of named entities and the frequent omission of special characters in standardized Hausa alphabets.
  • Northern Sotho (Sepedi): The translations in Northern Sotho displayed a need for improvement in vocabulary consistency, syntax, and the accurate conveyance of technical terms. While most text was accurately translated, minor corrections were necessary to enhance clarity, including adjustments for borrowed words and proper spacing. Notably, some translations omitted important terms, affecting the overall meaning, such as leaving out “scientific” when referring to tools.
  • Xitsonga: In Xitsonga translations, several vocabulary accuracy issues and improper use of borrowed terms were identified, leading to misunderstandings. Errors included incorrect translations for phrases like "Type 1 diabetes" and uniform translations lacking contextual variation, which hindered clarity. Spelling errors and the inappropriate borrowing of terms significantly impacted translation quality, underscoring the need for proper native language usage.
  • isiZulu: IsiZulu translations faced challenges with vocabulary inconsistencies, syntax errors, and issues in expressing technical terms, compounded by the language's agglutinative structure. Key problems included incorrect grammatical structures for time expressions and the unnecessary borrowing of English terms, which disrupted linguistic flow. Efforts to standardize terminology throughout the translations were made to ensure grammatical accuracy and clarity.

Evaluating the Corrections:

lang. dev (997 sentences) devtest (1,012 sentences)
#corr. (%) #tokenso #tokensc Δ tokens % div. #corr. (%) #tokenso #tokensc Δ tokens % div.
hau 632 (63.4) 17,948 18,073 125 24.7 70 (6.9) 2,006 1,978 28 49.2
nso 67 (6.7) 2,226 2,271 45 28.9 62 (6.1) 2,082 2,105 23 28.0
tso - - - - - 83 (6.1) 2,919 2,947 28 27.4
zul 190 (19.1) 3,605 3,588 17 23.7 226 (22.3) 4,414 4,396 18 31.8

Table: Data statistics; #corr. (%) → number of sentences requiring at least one correction (percentage of original data); #tokenso → original token count; #tokensc → corrected token count; Δ tokens → token count difference; % div. → percentage of token divergence.

lang. dev devtest
TER BLEU COMET TER BLEU COMET
Score #Edits Score #Edits
hau 19.2 3,107 72.0 54.1 40.4 711 56.6 42.1
nso 22.4 472 68.5 55.2 21.2 409 71.8 55.9
tso - - - - 20.9 547 73.9 58.4
zul 17.2 524 76.3 53.0 23.6 879 70.6 53.0

Table: Similarities between the original and corrected FLORES evaluation data on the four African languages - original as predictions; corrected as reference translations.

How to Use

This repository contains the corrected version of the FLORES dataset for the four languages. You can use these corrected datasets for improved performance in evaluating machine translation and other NLP tasks for African languages.

Accessing the Data

Contributing

We welcome contributions and suggestions to further enhance the dataset. If you would like to contribute, please submit a pull request or open an issue.

Acknowledgments

Special thanks to the native speaker annotators—university students and researchers—who volunteered to correct translations in their native languages. Their valuable contributions are crucial to the development and preservation of these low-resource languages in NLP.

Citation

If you use these corrections in your research, please cite our paper:

@misc{abdulmumin2024correctingfloresevaluationdataset,
  title={Correcting FLORES Evaluation Dataset for Four African Languages}, 
  author={Idris Abdulmumin and Sthembiso Mkhwanazi and Mahlatse S. Mbooi and Shamsuddeen Hassan Muhammad and Ibrahim Said Ahmad and Neo Putini and Miehleketo Mathebula and Matimba Shingange and Tajuddeen Gwadabe and Vukosi Marivate},
  year={2024},
  eprint={2409.00626},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.00626}, 
}

We hope these corrections will improve your NLP research and contribute to the growing body of work on African languages!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published