This repo contains code and instructions necessary to classify tweets as containing ':)' or ':('. The corresponding kaggle competition was part of CS-433 Machine learning class from EPFL. Our team is Martian Jaggirnauts
data folder which should be populated as decribed below
slang_dict_parsing contains code which scrapped noslang website for slang words but did not result in accuracy improvements so it is not used
src folder containing the main code as run.py and the models.
templates_course the code provided by default in the project
Running time The current model took around 12-hours to train on a 8-core CPU, 60GB of RAM and a Tesla K80 GPU. The GPU is highly recommended.
- Clone this repo
$ git clone https://github.com/m-doru/tweets-sentiment-analysis.git
$ cd tweets-sentiment-analysis
- Install fastText v0.1.0 with build for Python. This should be possible after this step:
$ python3
>> import fasttext
>>
-
Clone sent2vec at the root directory of the project. Follow the Setup&Requirments to compile it. Then download the sent2vec_twitter_bigrams 23GB (700dim, trained on english tweets) v1 embeddings and place them in data/
-
Download Glove Twitter pretrained word-vectors glove.twitter.27B.zip. Unzip file and place glove.twitter.27B.200d.txt in data/glove/
-
Download the data from the kaggle competition and place the
.txt
files in data/twitter-datasets/. -
Install the following python requirements:
- scikit-learn
- keras with tensorflow backend
- Clone this repo
$ git clone https://github.com/m-doru/tweets-sentiment-analysis.git
$ cd tweets-sentiment-analysis
-
Download the data from the kaggle competition and place the
.txt
files in data/twitter-datasets/. -
Install the following python3 requirements:
- scikit-learn
- Run run_pretrained.py