Sarah Antille: sarah.antille@epfl.ch
Lilia Ellouz: lilia.ellouz@epfl.ch
Zeineb Sahnoun zeineb.sahnoun@epfl.ch
This project is part of the Machine Learning course at EPFL. It is part of a challenge hosted on AIcrowd ( https://www.aicrowd.com/challenges/epfl-ml-text-classification-2019 ) .
Given a training set of tweets labeled as expressing a happy feeling versus a sad feeling, we use NLP techniques as well as machine learning models to predict the sentiments of an unlabeled test dataset.
To obtain the same results than us, you need to download the following files from https://www.aicrowd.com/challenges/epfl-ml-text-classification-2019/dataset_files :
train_pos_full.txt
(contains the positive tweets with the happy smiley removed)train_neg_full.txt
(contains the negative tweets with the sad smiley removed)test_data.txt
(contains the tweets to predict)
Make sure to have the above files in a directory called data to be able to run the scripts
In order to run the project, you need the following librairies installed:
scikit-learn
keras
with backendtensorflow
installed and configurednltk
gensim
globe
sklearn
re
pandas
numpy
-
run.py
: creates the .csv file used in our best prediction on AIcrowd -
preprocessing.py
: contains the required methods to clean the training set and the test set -
neural_networks.py
: contains the following neural nets algo:- simple neural net
- recurrent neural net with long-short term memory
- recurrent neural net with bidirectional long-short term memory
- recurrent neural net with gated recurrent unit
- convolutional neural network
It also contains a method that gives the prediction for the labels of the test dataset
-
ml_models.py
: trains and validates our classifiers and prints their accuracy on the validation set. You should run this file as follows:$ python ml_models.py model_name
wheremodel_name
can be one of the following:- baseline: for a Naive Bayes classifier that uses Count Vectorization
- bayes: for a Naive Bayes classifier that uses TF-IDF vectorization
- sgd: for a Stochastic Gradient Descent Classifier
- svm: for a Support Vector Classifier
- logistic: for a regularized Logistic Regression classifier which is our best performing model among this group.
-
create_embeddings.py
: creates word2vec vectors from the dataset -
ensemble.py
: computes the majority voting of the predictions of 3 different models -
utils.py
-
Rapport_ML_2.pdf
: 4 pages report explaining our approach and trials an errors
- 0.880 accuracy on AIcrowd where we are ranked 8th among all the groups.
To obtain the same predictions we used for the AIcrowd submission, run the python script run.py
. It will produce a file output_ensemble_final.csv
that can be submitted on the web page of the challenge.
Make sure you have an empty directory called ensemble to run the scripts
- It will take a long time to run (few hours) because it pre-processes the dataset, then creates word2vec embedding vectors, and then run different neural network models.
- To speed up the neural network training, we used colab : https://colab.research.google.com/notebooks/welcome.ipynb
- When running the script run.py, the vectors of a word2vec model will be saved in your directory. Since we run 3 models, we will have a prediction csv file for each of the model that will be saved in the folder
ensemble
(the first model will overwrite the file that is already there).