Using Machine and Deep learning models to classify what dialect a tweet in Arabic belongs to.
- Clone repo.
- Download these models and copy them to ./models/
- Create a venv (Recommended)
- install requirements.
pip3 install -r requirements.txt
- Run app.py
python3 app.py
- Wait. (It's taking some time because of the deep learning model, working on that.)
- test the API on your local machine. ( I recommend using Postman for this)
In this repo I have two models that can be accessed by two different POST functions.
- Machine Learning model can be accessed by: yourlocalhost/ml
- Deep Learning model can be accessed by: yourlocalhost/dl
- Both take JSON in the body of the POST request for example: {"text": "مثال من تويتر"} and return a string.
- example using Postman:
It's clear that Arabic dialects are similar and even Arabs can get confused between them if written. Models also get confused.
I see that if we combine some close dialects together and have classes less than 18, we'll get much better results.
model | f1 score |
---|---|
LinearSVC | 0.537 |
BILSTM | 0.47 |
Fine-Tuning bert | needs more work |
I thought about some improvements like having multi-label classification.
I'll talk about some notebooks not all of them but all of them are commented.
Pre-Processing:
- Removed numbers (Arabic and latin), emojis and all punctuations because they don't affect the prediction.
- Removed the normal tweets cleaning, Hashtags , URLs and Usernames.
Training Models:
- Chose f1_score macro as a metric to give every class the same importance.
- Work on the FineTune notebook.
- Make a more reliable API.
- Working on making new features not the text only, for example: if someone mentioned the name of a country in a tweet.