DSFSI South African Language Identification (za-lid) Githup Repository

This documentation is aimed to help provide information that explains what a project is about.

Last updated: September 2024

Project Description

This Github Repository contains datasets extracted from Vuk'zenzele () used to train various Language identification (LID) models such as N-grams, Machine Learning models (e.g SVM, Logistic Regression, K Nearest Neighbor, and Naive Bayes), and Transformer models (BERT, DistilBERT, mBERT, RemBERT, XLMr, AfroLM, Afro-XLMR, AfriBERTa, Serengeti, etc). The repo also contains code on how to use available LID models such as GlotLID, OpenLID, AfroLIF, and CLD V3.

Getting Started

This section provides the necessary information for a user to be able to run the code locally.

Prerequisites

All code is developed using Python. :

Python 3.*

Installation

Run the requirements.txt to install all the required libraries, modules, and packages.

Run
pip install -r requirements.txt 
If all dependencies did not install successfully, or having compatability issues, the dependencies you need are:
sklearn
pandas
seaborn
matplotlib
numpy 
torch
transformers
nltk
tqdm
seqeval

Usage

All code and datasets is contained inside the src folder:

To use the code , follow the steps:

* For each model category (N-grams, ML, or Transformers) ensure all dependencies are installed
* For each Categoory of models there is script folder  (E.g LID_Toold/scripts). This folder contains a bash file that runs the appropriate python file five times and saves results in a destination folder (may need to change the destination folder)
* To run the bash simply run nohup bash 'script_name.sh' > 'output_text_file.txt' & . This line ensures the execution does not stop even if termibal is closed.
* Once run is complete all output files, plots, etc, will be saved to a destination folder for you to view.
* NB: For files with no script, you may ned to run the python file directly

HuggingFace Models

https://huggingface.co/spaces/dsfsi/dsfsi-language-identification-spaces

Authors

Written by : Thapelo Sindane And Vukosi Marivate
Contact details : sindane.thapelo@tuks.co.za

Contributions

This is optional and provides information about which and how each of the developers contributed.

How to Reference

Licence

DSFSI South African Language Identification (za-lid) © 2024 by Thapelo Sindane, Vukosi Marivate is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
data_statement.md		data_statement.md
model_card.md		model_card.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

DSFSI South African Language Identification (za-lid) Githup Repository

Table of contents

Project Description

Getting Started

Prerequisites

Installation

Usage

HuggingFace Models

Authors

Contributions

How to Reference

Licence

About

Licenses found

Releases

Packages

Contributors 2

Languages

License

Licenses found

dsfsi/za-lid

Folders and files

Latest commit

History

Repository files navigation

DSFSI South African Language Identification (za-lid) Githup Repository

Table of contents

Project Description

Getting Started

Prerequisites

Installation

Usage

HuggingFace Models

Authors

Contributions

How to Reference

Licence

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages