CATT: Character-based Arabic Tashkeel Transformer

This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.

Project Structure

├── api/                  # API-related files
├── benchmarking/         # Benchmarking scripts and data
├── catt/                 # Core CATT package
│   ├── data/             # Data handling modules
│   ├── models/           # Model architectures
│   └── utils/            # Utility functions
├── configs/              # Configuration files
├── dataset/              # Dataset files
├── docs/                 # Documentation
├── models/               # Pre-trained model checkpoints
├── scripts/              # Utility scripts
├── tests/                # Test files
├── compute_der.py        # Diacritization Error Rate computation
├── predict_catt.py       # Prediction script
├── train_catt.py         # Training script
├── pyproject.toml        # Project dependencies and metadata
└── README.md             # This file

Installation

Clone the repository:

git clone https://github.com/abjadai/catt.git
cd catt

Install the required dependencies:
```
pip install poetry
poetry install
```

Download the pre-trained models:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

How to Run?

Using the API

Start the FastAPI server:
```
python -m api.main
```
Send a POST request to http://localhost:8000/tashkeel with a JSON body:
```
{
  "text": "العلم نور والجهل ظلام."
}
```

Using the Prediction Script

Run the prediction script:

python predict_catt.py ./configs/EncoderDecoder_config.yaml
# or
python predict_catt.py ./configs/EncoderOnly_config.yaml

Note:

Encoder-Only (EO): model is recommended for faster inference.
Encoder-Decoder (ED): model is recommended for better accuracy of the predicted diacritics.

How to Train?

Download the dataset:

wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip
unzip dataset.zip

Edit the configs/Sample_config.yaml file to adjust the training parameters.

model_type: encoder-only # or encoder-decoder
max_seq_len: 1024
d_model: 512
n_layers: 6
n_heads: 16
drop_prob: 0.1
learnable_pos_emb: false
batch_size: 32
dl_num_workers: 32
threshold: 0.6
max_epochs: 300
model_path: 
pretrained_mlm_pt: # Use None if you want to initialize weights randomly OR the path to the char-based BERT
device: 'cuda'

Run the training script:

python train_catt.py ./configs/Sample_config.yaml

Resources

This code is mainly adapted from this repo.
An older version of some Arabic scripts that are available in pyarabic library were used as well.

ToDo

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

Table of Contents

Project Structure

Installation

How to Run?

Using the API

Using the Prediction Script

How to Train?

Resources

ToDo

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
api		api
benchmarking		benchmarking
catt		catt
configs		configs
dataset		dataset
docs		docs
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_der.py		compute_der.py
poetry.lock		poetry.lock
predict_catt.py		predict_catt.py
pyproject.toml		pyproject.toml
train_catt.py		train_catt.py

License

CreativeSelf0/catt

Folders and files

Latest commit

History

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

Table of Contents

Project Structure

Installation

How to Run?

Using the API

Using the Prediction Script

How to Train?

Resources

ToDo

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages