This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.
├── api/ # API-related files
├── benchmarking/ # Benchmarking scripts and data
├── catt/ # Core CATT package
│ ├── data/ # Data handling modules
│ ├── models/ # Model architectures
│ └── utils/ # Utility functions
├── configs/ # Configuration files
├── dataset/ # Dataset files
├── docs/ # Documentation
├── models/ # Pre-trained model checkpoints
├── scripts/ # Utility scripts
├── tests/ # Test files
├── compute_der.py # Diacritization Error Rate computation
├── predict_catt.py # Prediction script
├── train_catt.py # Training script
├── pyproject.toml # Project dependencies and metadata
└── README.md # This file
-
Clone the repository:
git clone https://github.com/abjadai/catt.git cd catt
-
Install the required dependencies:
pip install poetry poetry install
-
Download the pre-trained models:
mkdir models/ wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt
-
Start the FastAPI server:
python -m api.main
-
Send a POST request to
http://localhost:8000/tashkeel
with a JSON body:{ "text": "العلم نور والجهل ظلام." }
- Run the prediction script:
python predict_catt.py ./configs/EncoderDecoder_config.yaml # or python predict_catt.py ./configs/EncoderOnly_config.yaml
Note:
- Encoder-Only (EO): model is recommended for faster inference.
- Encoder-Decoder (ED): model is recommended for better accuracy of the predicted diacritics.
-
Download the dataset:
wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip unzip dataset.zip
-
Edit the
configs/Sample_config.yaml
file to adjust the training parameters.model_type: encoder-only # or encoder-decoder max_seq_len: 1024 d_model: 512 n_layers: 6 n_heads: 16 drop_prob: 0.1 learnable_pos_emb: false batch_size: 32 dl_num_workers: 32 threshold: 0.6 max_epochs: 300 model_path: pretrained_mlm_pt: # Use None if you want to initialize weights randomly OR the path to the char-based BERT device: 'cuda'
-
Run the training script:
python train_catt.py ./configs/Sample_config.yaml
- This code is mainly adapted from this repo.
- An older version of some Arabic scripts that are available in pyarabic library were used as well.
- Inference script
- Upload pre-trained models
- Upload CATT dataset
- Upload DER scripts
- Training script
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.