ML_Project2 : Twitter Sentiment Analysis

Team members

Sarah Antille: sarah.antille@epfl.ch

Lilia Ellouz: lilia.ellouz@epfl.ch

Zeineb Sahnoun zeineb.sahnoun@epfl.ch

Introduction

This project is part of the Machine Learning course at EPFL. It is part of a challenge hosted on AIcrowd ( https://www.aicrowd.com/challenges/epfl-ml-text-classification-2019 ) .

Given a training set of tweets labeled as expressing a happy feeling versus a sad feeling, we use NLP techniques as well as machine learning models to predict the sentiments of an unlabeled test dataset.

Dataset Information

To obtain the same results than us, you need to download the following files from https://www.aicrowd.com/challenges/epfl-ml-text-classification-2019/dataset_files :

train_pos_full.txt (contains the positive tweets with the happy smiley removed)
train_neg_full.txt (contains the negative tweets with the sad smiley removed)
test_data.txt (contains the tweets to predict)

Make sure to have the above files in a directory called data to be able to run the scripts

Library requirements

In order to run the project, you need the following librairies installed:

scikit-learn
keras with backend tensorflow installed and configured
nltk
gensim
globe
sklearn
re
pandas
numpy

Files

run.py : creates the .csv file used in our best prediction on AIcrowd
preprocessing.py: contains the required methods to clean the training set and the test set
neural_networks.py : contains the following neural nets algo:
- simple neural net
- recurrent neural net with long-short term memory
- recurrent neural net with bidirectional long-short term memory
- recurrent neural net with gated recurrent unit
- convolutional neural network
It also contains a method that gives the prediction for the labels of the test dataset
ml_models.py : trains and validates our classifiers and prints their accuracy on the validation set. You should run this file as follows: $ python ml_models.py model_name where model_name can be one of the following:
- baseline: for a Naive Bayes classifier that uses Count Vectorization
- bayes: for a Naive Bayes classifier that uses TF-IDF vectorization
- sgd: for a Stochastic Gradient Descent Classifier
- svm: for a Support Vector Classifier
- logistic: for a regularized Logistic Regression classifier which is our best performing model among this group.
create_embeddings.py : creates word2vec vectors from the dataset
ensemble.py : computes the majority voting of the predictions of 3 different models
utils.py
Rapport_ML_2.pdf : 4 pages report explaining our approach and trials an errors

Result:

0.880 accuracy on AIcrowd where we are ranked 8th among all the groups.

Reproducibility

To obtain the same predictions we used for the AIcrowd submission, run the python script run.py . It will produce a file output_ensemble_final.csv that can be submitted on the web page of the challenge.

Make sure you have an empty directory called ensemble to run the scripts

Remarks

It will take a long time to run (few hours) because it pre-processes the dataset, then creates word2vec embedding vectors, and then run different neural network models.
To speed up the neural network training, we used colab : https://colab.research.google.com/notebooks/welcome.ipynb
When running the script run.py, the vectors of a word2vec model will be saved in your directory. Since we run 3 models, we will have a prediction csv file for each of the model that will be saved in the folder ensemble (the first model will overwrite the file that is already there).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML_Project2 : Twitter Sentiment Analysis

Team members

Introduction

Dataset Information

Library requirements

Files

Result:

Reproducibility

Remarks

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
embeddings.py		embeddings.py
ensemble.py		ensemble.py
ml_models.py		ml_models.py
neural_networks.py		neural_networks.py
preprocessing.py		preprocessing.py
project2_description.pdf		project2_description.pdf
run.py		run.py
utils.py		utils.py

zeineb12/tweet_classification

Folders and files

Latest commit

History

Repository files navigation

ML_Project2 : Twitter Sentiment Analysis

Team members

Introduction

Dataset Information

Library requirements

Files

Result:

Reproducibility

Remarks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages