Skip to content

This repository contains datasets extracted from Vuk'zenzele prepared to train N-gram models, and traditional ML models (Naive Bases, SVM, and Logistic Regression), and Large pretrained multilingual models for language identification

License

Unknown, CC-BY-SA-4.0 licenses found

Licenses found

Unknown
LICENSE
CC-BY-SA-4.0
LICENSE.md
Notifications You must be signed in to change notification settings

dsfsi/za-lid

Repository files navigation

DSFSI South African Language Identification (za-lid) Githup Repository

This documentation is aimed to help provide information that explains what a project is about.

Last updated: September 2024

Table of contents

  1. Project Description
  2. Getting Started
  3. Authors
  4. More Information

Project Description


This Github Repository contains datasets extracted from Vuk'zenzele () used to train various Language identification (LID) models such as N-grams, Machine Learning models (e.g SVM, Logistic Regression, K Nearest Neighbor, and Naive Bayes), and Transformer models (BERT, DistilBERT, mBERT, RemBERT, XLMr, AfroLM, Afro-XLMR, AfriBERTa, Serengeti, etc). The repo also contains code on how to use available LID models such as GlotLID, OpenLID, AfroLIF, and CLD V3.

Getting Started


This section provides the necessary information for a user to be able to run the code locally.

Prerequisites

All code is developed using Python. :

  • Python 3.*

Installation

  1. Run the requirements.txt to install all the required libraries, modules, and packages.
Run
pip install -r requirements.txt 
If all dependencies did not install successfully, or having compatability issues, the dependencies you need are:
sklearn
pandas
seaborn
matplotlib
numpy 
torch
transformers
nltk
tqdm
seqeval

Usage

All code and datasets is contained inside the src folder:

  1. To use the code , follow the steps:
* For each model category (N-grams, ML, or Transformers) ensure all dependencies are installed
* For each Categoory of models there is script folder  (E.g LID_Toold/scripts). This folder contains a bash file that runs the appropriate python file five times and saves results in a destination folder (may need to change the destination folder)
* To run the bash simply run nohup bash 'script_name.sh' > 'output_text_file.txt' & . This line ensures the execution does not stop even if termibal is closed.
* Once run is complete all output files, plots, etc, will be saved to a destination folder for you to view.
* NB: For files with no script, you may ned to run the python file directly

HuggingFace Models

Authors


Contributions

This is optional and provides information about which and how each of the developers contributed.

How to Reference


Licence

DSFSI South African Language Identification (za-lid) © 2024 by Thapelo Sindane, Vukosi Marivate is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

About

This repository contains datasets extracted from Vuk'zenzele prepared to train N-gram models, and traditional ML models (Naive Bases, SVM, and Logistic Regression), and Large pretrained multilingual models for language identification

Resources

License

Unknown, CC-BY-SA-4.0 licenses found

Licenses found

Unknown
LICENSE
CC-BY-SA-4.0
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published