This project is a Natural Language Processing (NLP) initiative that performs sentiment analysis on IMDB movie reviews. It aims to classify reviews as either positive or negative using machine learning models, with a focus on preprocessing raw text data and converting it into numerical formats for predictive analysis.
- Cleans and preprocesses raw text data by removing HTML tags, punctuations, and stopwords.
- Converts text into numerical data using a Bag of Words model.
- Implements a Random Forest classifier for sentiment classification.
- Measures performance using the ROC AUC score.
- Provides an end-to-end pipeline for data preprocessing, model training, and evaluation.
- Build an NLP pipeline for sentiment analysis.
- Create a robust model to classify movie reviews.
- Demonstrate practical applications of NLP techniques in real-world scenarios.
- Showcase the effectiveness of the Bag of Words model and Random Forest classifier.
Natural Language Processing (NLP) enables computers to understand, interpret, and respond to human language. In this project, NLP techniques are used to preprocess text data and extract meaningful features for sentiment analysis.
This project applies NLP in the following ways:
- HTML tag removal to clean text data.
- Text normalization by converting to lowercase and removing punctuations.
- Stopword removal to focus on meaningful words.
- Feature extraction using the Bag of Words model.
- Sentiment classification with machine learning.
Input: "This movie was fantastic! The plot and characters were well-developed."
Output: Positive Sentiment.
Input: "The movie was too slow and boring for my taste."
Output: Negative Sentiment.
- Python 3.x
- Libraries: numpy, pandas, matplotlib, sklearn, bs4, nltk, re
- NLTK stopwords dataset
- Import the required libraries.
- Download the dataset
labeledTrainData.tsv
from Kaggle. - Preprocess the data by cleaning HTML tags, removing punctuation, and stopwords.
- Train a Random Forest model using the Bag-of-Words features.
- Evaluate model accuracy using the ROC AUC score.
- Source: Kaggle IMDB Dataset
- File:
labeledTrainData.tsv
- Format: Tab-separated values containing movie reviews and sentiment labels.
- id: Unique identifier for each review.
- sentiment: Binary sentiment (0 = negative, 1 = positive).
- review: Text content of the review.
Sentiment analysis is a popular NLP task used in applications like customer feedback analysis, market research, and social media monitoring. This project demonstrates how to preprocess text data and apply machine learning to classify sentiments effectively.
- numpy: Efficient numerical operations for data handling.
- pandas: Data manipulation using DataFrames.
- matplotlib.pyplot: Visualization for data insights.
- sklearn.feature_extraction.text: Tools like CountVectorizer for feature engineering.
- CountVectorizer: Converts text into numerical vectors for machine learning.
- sklearn.ensemble: Includes Random Forest Classifier for model training.
- RandomForestClassifier: An ensemble learning algorithm for classification tasks.
- sklearn.metrics: Tools like ROC AUC score for performance evaluation.
- bs4 (BeautifulSoup): Removes HTML tags from text data.
- re: Handles text cleaning via regular expressions.
- nltk: Provides NLP tools, including stopwords.
- sklearn.model_selection (train_test_split): Splits data into training and testing sets.
- nltk.corpus (stopwords): Contains common stopwords for exclusion.
HTML tags can add noise to text data, affecting the model's ability to learn meaningful patterns.
Punctuation does not convey sentiment and can interfere with text vectorization.
Standardizing text case ensures that words like "Movie" and "movie" are treated the same.
Stopwords (e.g., "the", "is", "and") do not carry significant meaning and can dilute the model's focus on sentiment-carrying words.
- Features (X): Cleaned and preprocessed text data.
- Labels (Y): Sentiment labels (positive/negative).
train_x, test_x, y_train, y_test = train_test_split(x, y, test_size=0.1)
- Transforms text into a numerical matrix using word frequencies.
vectorizer = CountVectorizer(max_features=5000)
train_x = vectorizer.fit_transform(train_x).toarray()
This method creates a matrix of word frequencies representing the most common 5000 words in the dataset.
A Random Forest is an ensemble learning method that combines multiple decision trees for robust classification. The model is trained using RandomForestClassifier(n_estimators=100, random_state=42).
- Vectorizer: Creates numerical features from text.
- Transform: Applies the learned vocabulary to transform new data.
- Model: The trained Random Forest Classifier.
- Predict: Generates predictions on unseen data.
This project is licensed under the MIT License.
Developed by Yash Mittal. Version 1.0