Sentiment Analysis on ArmanEmo Dataset

This project performs sentiment analysis on the ArmanEmo dataset, which consists of Persian text labeled with seven different emotional categories. The goal is to classify the emotions of unseen text using state-of-the-art NLP models.

Contributors

Mohammad Sadegh Poulaei
Fatemeh Askari

Dataset

The ArmanEmo dataset contains text from various sources like social media and Digikala reviews. Each sentence is assigned one of seven emotional labels. The data is provided in two files: train.tsv and test.tsv.

Preprocessing Steps

The preprocessing involved the following steps:

Converted tsv files to csv.
Removed non-Persian characters and symbols.
Normalized elongated words (e.g., "خیییلللی" → "خیلی").
Removed Arabic letters and symbols like _ and #.
Removed Persian numbers.
Applied further normalization using parsivar.

Models Used

We experimented with several transformer models for emotion classification:

ParsBert: A Persian-specific BERT model fine-tuned for sentiment tasks.
ALBERT: A lightweight version of BERT with parameter sharing.
roberta_facebook: A robust, general-purpose model.
persian_xlm_roberta_large (Final model): A multilingual version of RoBERTa trained on large datasets, achieving the best accuracy on the test set.

Final Model: `persian_xlm_roberta_large`

Dynamic masking allows for a different mask pattern per mini-batch.
Trained on over 100 languages and 2.5 TB of data.
Achieved the highest accuracy on the ArmanEmo test set, reaching 69% accuracy after 15 epochs.

Evaluation Metrics

We used the following metrics to evaluate model performance:

Accuracy
F1 Score
Precision
Recall

Results

The best-performing model, persian_xlm_roberta_large, achieved the following on the test set:

Accuracy: 69%
F1 Score: 71%
Precision: 73%
Recall: 71%

Example Predictions

Some example sentences with their predicted emotions:

Sentence	True Label	Predicted Label
"آرزوی موفقیت و پیروزی ایران در جام جهانی"	Happy	Other
"میخندم ولی دلم پر از غم است"	Happy	Sad
"آهنگ عالی، راننده خطی صداش کم نمیشه"	Happy	Angry

How to Run

To run the code and reproduce the results:

Clone the repository:

git clone https://github.com/mspoulaei/armanemo-sentiment-analysis.git

run jupyter DL_Project_Model_{model_name}.ipynb in colab

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DL_Project_Model_ALBERT.ipynb		DL_Project_Model_ALBERT.ipynb
DL_Project_Model_ParsBert.ipynb		DL_Project_Model_ParsBert.ipynb
DL_Project_Model_XML_Roberta_MAIN_Result.ipynb		DL_Project_Model_XML_Roberta_MAIN_Result.ipynb
DL_Project_Model_roberta-base-ft-udpos28.ipynb		DL_Project_Model_roberta-base-ft-udpos28.ipynb
DL_Project_Model_roberta_facebook.ipynb		DL_Project_Model_roberta_facebook.ipynb
DL_Project_Model_roberta_large_fa.ipynb		DL_Project_Model_roberta_large_fa.ipynb
DL_Project_Optional.ipynb		DL_Project_Optional.ipynb
DL_Project_PreProcess.ipynb		DL_Project_PreProcess.ipynb
DL_Project_Predicted_Evaluate.ipynb		DL_Project_Predicted_Evaluate.ipynb
README.md		README.md
document.pdf		document.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis on ArmanEmo Dataset

Contributors

Dataset

Preprocessing Steps

Models Used

Final Model: `persian_xlm_roberta_large`

Evaluation Metrics

Results

Example Predictions

How to Run

References

About

Releases

Packages

Languages

MSPoulaei/sentiment_analysis_armanEmo

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis on ArmanEmo Dataset

Contributors

Dataset

Preprocessing Steps

Models Used

Final Model: persian_xlm_roberta_large

Evaluation Metrics

Results

Example Predictions

How to Run

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Final Model: `persian_xlm_roberta_large`

Packages