Skip to content

This repo applies sentiment analysis on Persian text using the ArmanEmo dataset. Various transformer models, including ParsBert and XLM-RoBERTa, were tested to classify emotions. The persian_xlm_roberta_large model achieved the best accuracy. Preprocessing included text normalization and cleaning, with the use of modern NLP techniques for SA in fa.

Notifications You must be signed in to change notification settings

MSPoulaei/sentiment_analysis_armanEmo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis on ArmanEmo Dataset

This project performs sentiment analysis on the ArmanEmo dataset, which consists of Persian text labeled with seven different emotional categories. The goal is to classify the emotions of unseen text using state-of-the-art NLP models.

Contributors

  • Mohammad Sadegh Poulaei
  • Fatemeh Askari

Dataset

The ArmanEmo dataset contains text from various sources like social media and Digikala reviews. Each sentence is assigned one of seven emotional labels. The data is provided in two files: train.tsv and test.tsv.

Preprocessing Steps

The preprocessing involved the following steps:

  1. Converted tsv files to csv.
  2. Removed non-Persian characters and symbols.
  3. Normalized elongated words (e.g., "خیییلللی" → "خیلی").
  4. Removed Arabic letters and symbols like _ and #.
  5. Removed Persian numbers.
  6. Applied further normalization using parsivar.

Models Used

We experimented with several transformer models for emotion classification:

  • ParsBert: A Persian-specific BERT model fine-tuned for sentiment tasks.
  • ALBERT: A lightweight version of BERT with parameter sharing.
  • roberta_facebook: A robust, general-purpose model.
  • persian_xlm_roberta_large (Final model): A multilingual version of RoBERTa trained on large datasets, achieving the best accuracy on the test set.

Final Model: persian_xlm_roberta_large

  • Dynamic masking allows for a different mask pattern per mini-batch.
  • Trained on over 100 languages and 2.5 TB of data.
  • Achieved the highest accuracy on the ArmanEmo test set, reaching 69% accuracy after 15 epochs.

Evaluation Metrics

We used the following metrics to evaluate model performance:

  • Accuracy
  • F1 Score
  • Precision
  • Recall

Results

The best-performing model, persian_xlm_roberta_large, achieved the following on the test set:

  • Accuracy: 69%
  • F1 Score: 71%
  • Precision: 73%
  • Recall: 71%

Example Predictions

Some example sentences with their predicted emotions:

Sentence True Label Predicted Label
"آرزوی موفقیت و پیروزی ایران در جام جهانی" Happy Other
"میخندم ولی دلم پر از غم است" Happy Sad
"آهنگ عالی، راننده خطی صداش کم نمیشه" Happy Angry

How to Run

To run the code and reproduce the results:

  1. Clone the repository:

    git clone https://github.com/mspoulaei/armanemo-sentiment-analysis.git
  2. run jupyter DL_Project_Model_{model_name}.ipynb in colab

References

About

This repo applies sentiment analysis on Persian text using the ArmanEmo dataset. Various transformer models, including ParsBert and XLM-RoBERTa, were tested to classify emotions. The persian_xlm_roberta_large model achieved the best accuracy. Preprocessing included text normalization and cleaning, with the use of modern NLP techniques for SA in fa.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published