Our goal here is to detect the sentiment, positive or negative, carried by a tweet using shallow-learning techniques.
⚡Binary Classification
⚡Natural Language Processing (NLP)
⚡Logistic Regression
⚡Naive Bayes Classifier (implemented from scratch)
⚡NLTK
⚡scikit-learn
- Introduction
- Data Preprocessing
- Logistic Regression
- Naive Bayes Classifier
- How to use
- License
- Get in touch
We try to detect the sentiment
, positive or negative, carried by a tweet in this project. We use two of the simplest but very effective classification algorithms...
a) Logistic Regression b) Naive Bayes
Broadly, the solution is divided into four parts…
- Preprocessing
- Building and training Logistic Regression
- Building and training Naive Bayes
- Testing the trained models
We make use of the NLTK provided tweet_samples
dataset. It's a balanced dataset with 5000 samples of positive labeled raw tweets and 5000 samples of negative labeled raw tweets. Data is split into train
and test
set in 80:20 ratio.
Data preprocessing is a fundamental task for any machine-learning project. Especially, text data is highly unstructured and needs a lot of cleaning and formatting before it can be used for modeling. Tweets go through the below listed data-cleaning steps...
- Remove tweet handle, remove symbols (#, '$')
- Remove hyperlinks for the tweets
- Convert tweets to lower-case
- Break tweets into word
tokens.
- Get rid of punctuation tokens
- Get rid of English stop-words tokens
- Stem each token
Once we have a clean set of data samples in the form of word tokens, it needs to be encoded into numeric features before feeding it to our models.
- Count and store the number of times a word in the dataset appears in positive tweets (pos_frequency)
- Count and store the number of times a word in the dataset appears in negative tweets (neg_frequency)
- For
Logistic Regression.
Convert each tweet into a data-point with three features- bias, always set to 1
- sum of pos_freqeuncy of all word tokens in the tweet
- sum of neg_freqeuncy of all word tokens in the tweet
Naive Bayes
does not need above-encoded tweets; instead Naive Bayes model will compute log-likelihood from tweet tokens and the pos_frequency, neg_frequency dictionary computed above.
Logistic regression is a statistical model pre-dominantly used for binary classification problems. Logistic regression relies on logistic function
a single variable(feature) logistic function is shown below
The Naive Bayes is based on the Bayes theorem, but it makes an unrealistic assumption (hence naive) that the events are independent of each other which is hardly a case in real life. Despite this, surprisingly, it performs very well for many problems such as email spam detection, text sentiment analysis, etc.
- Ensure the below-listed packages are installed
sklearn
numpy
nltk
- download the jupyter notebook
tweet_sentiment_classification.ipynb
andutils.py
- within the notenook...
-
Create couple of tweet strings (or copy a real tweet), one with positive sentiment and other with negative sentiment (refer example in the notebook)
-
Logistic Regression: call utils.predict function as shown below
sentiment = utils.get_sentiment(test_tweet, lr_model, encode_tweet=True)
sentiment
would be returned as 'POSITIVE' for the positive tweet and 'NEGATIVE' for the negative tweet. Note that theencode_tweet
parameter is set toTrue
to ensure that tweet tokens are encoded into features for Logistic Regression. -
Naive Bayes: call utils.predict function as shown below
sentiment = utils.get_sentiment(test_tweet, nb_model, encode_tweet=False)
sentiment
would be returned as 'POSITIVE' for the positive tweet and 'NEGATIVE' for the negative tweet. Note that theencode_tweet
parameter is set toFalse
in this case; this is to ensure that tweet tokens are used for the Naive Bayes model.
-