- nltk
- colab
The original dataset, train.csv has 99899 tweets.
Pre processing usually depends on the type of data under analysis. For twitter sentiment analysis, preprocessed steps followed are as follows:
-
Removing words containing a particular pattern eg: tweets contained user names like @user1
-
Removing punctuations, numbers and apostrophes
-
Tokenization
-
Fixing the word length and one could also perform spell correction eg: converting juuuusssttttt to just
-
Removing stop words and words with length less than 2
-
Lemmatization
-
Removal of rare and most frequently occuring words
-
Some manual corrections
- Naive Bayes (57.23 % acc for testing dataset)
- Logistic Regression (57.24 % acc for testing dataset)
- LSTM