Skip to content
forked from abjadai/catt

The official implementation of CATT Arabic diacritization models.

License

Notifications You must be signed in to change notification settings

CreativeSelf0/catt

 
 

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

CC BY-NC 4.0

This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.

Table of Contents

Project Structure

├── api/                  # API-related files
├── benchmarking/         # Benchmarking scripts and data
├── catt/                 # Core CATT package
│   ├── data/             # Data handling modules
│   ├── models/           # Model architectures
│   └── utils/            # Utility functions
├── configs/              # Configuration files
├── dataset/              # Dataset files
├── docs/                 # Documentation
├── models/               # Pre-trained model checkpoints
├── scripts/              # Utility scripts
├── tests/                # Test files
├── compute_der.py        # Diacritization Error Rate computation
├── predict_catt.py       # Prediction script
├── train_catt.py         # Training script
├── pyproject.toml        # Project dependencies and metadata
└── README.md             # This file

Installation

  1. Clone the repository:

    git clone https://github.com/abjadai/catt.git
    cd catt
  2. Install the required dependencies:

    pip install poetry
    poetry install
  3. Download the pre-trained models:

    mkdir models/
    wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt
    wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

How to Run?

Using the API

  1. Start the FastAPI server:

    python -m api.main
  2. Send a POST request to http://localhost:8000/tashkeel with a JSON body:

    {
      "text": "العلم نور والجهل ظلام."
    }

Using the Prediction Script

  1. Run the prediction script:
    python predict_catt.py ./configs/EncoderDecoder_config.yaml
    # or
    python predict_catt.py ./configs/EncoderOnly_config.yaml

Note:

  • Encoder-Only (EO): model is recommended for faster inference.
  • Encoder-Decoder (ED): model is recommended for better accuracy of the predicted diacritics.

How to Train?

  1. Download the dataset:

    wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip
    unzip dataset.zip
  2. Edit the configs/Sample_config.yaml file to adjust the training parameters.

    model_type: encoder-only # or encoder-decoder
    max_seq_len: 1024
    d_model: 512
    n_layers: 6
    n_heads: 16
    drop_prob: 0.1
    learnable_pos_emb: false
    batch_size: 32
    dl_num_workers: 32
    threshold: 0.6
    max_epochs: 300
    model_path: 
    pretrained_mlm_pt: # Use None if you want to initialize weights randomly OR the path to the char-based BERT
    device: 'cuda'
  3. Run the training script:

    python train_catt.py ./configs/Sample_config.yaml

Resources

ToDo

  • Inference script
  • Upload pre-trained models
  • Upload CATT dataset
  • Upload DER scripts
  • Training script

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

About

The official implementation of CATT Arabic diacritization models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 89.1%
  • Shell 10.9%