The task of determining veracity and authenticity of social media content has been of recent interest to the field of NLP. False claims and rumours affect peoples perceptions of events and their behaviour, sometimes in harmful ways. Here, the task is to classify tweets in threads of tweets based on the stance possessed by the tweet, which can be of 4 categories: supporting (S), denying (D), querying (Q), or commenting (C), i.e., SDQC.
By improving feature extraction and incorporating tweet dependent (and also other textual) features such as hashtags, link content, etc., we were able to achieve accuracies close to that of the state of the art on most of our models. Our highest reported accuracy is 77.6%, which is comparable to that of the state of the art model (78.4%). We also tried to improve the recall on Deny and Query classes by augmenting the training dataset. We also tried to do the same using ensemble methods, and bagging/boosting techniques.
- The
semeval2017-task8-dataset
folder in theDataset
folder contains the training data andsemeval2017-task8-test-data
contains test data. It is a public dataset. Also, the labeled tweets in formtweet id:label
are intest_label.json
. - We also use
GoogleNews-vectors-negative300
forword2Vec
features.GoogleNews-vectors-negative300
folder is required byData.py
script. This is available at https://github.com/mmihaltz/word2vec-GoogleNews-vectors. badwords.txt
contains the list of swear words which is used to find the features inData.py
.- We have additionally labeled datasets for Las Vegas Shooting and California Shooting which are available at https://docs.google.com/spreadsheets/d/1fnZwO-f14QSKVbruinEX0KUV86t5kEHuyslxLs60fuI/edit?usp=sharing and https://docs.google.com/spreadsheets/d/1Y0cfFmK82J6KGjQpN9BuAGnm4oSFBcMgnkGTZ_MWhHU/edit?usp=sharing. A combined csv for only the labeled samples is in
newdata.csv
. The raw json tweets are also inlas_vegas_shootout.json
andtweets_california_shootout.json
.
Python version: 2.7
- First run the
Data.py
file. This python script will will read the dataset, preprocess it and find the features of the tweets. These features are used by different classifiers. The features are stored aspickle
files. - To get the results of SVM and Logestic Regression, run the
SVM_LRClassifier.py
file. nn.py
is a python script for the neural network model. Run thenn.py
file to get the results for neural network model. Keras Library is used in this script.- Run the
NBClassifier.py
file for Naive bayes results. - The jupyter notebook
ngram_models.ipynb
(Python version: 3.6) includes code and output for:
- loading the rumour eval training and test data
- preprocessing the data
- vectorizing the data using CountVectorizer and TFIDF, for Unigram, Bigram and Trigram
- running different classifiers on it, including
- MultinomialNB
- SVM
- Logistic Regression
- RandomForest
- XGBoost
- an attempt at LSTM
- loading and collecting the newly collected data
- Derczynski et. Al, SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours.
- Kochkina et. Al, Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM.
- Dataset Link, SemEval-2017 Task 8 Dataset.
- Bahuleyan et. Al, UWaterloo at SemEval-2017 Task 8: Detecting Stance towards Rumours with Topic Independent Features.