This project aims to perform sentiment analysis on Yelp reviews using Apache Spark and Spark NLP. The project includes two main scripts:
- spark_nlp_sentiment_analysis.py: Utilizes pre-trained models from Spark NLP for sentiment analysis.
- yelp_sentiment_analysis_pipeline.py: Implements a custom sentiment analysis pipeline using various NLP techniques and machine learning models.
There is no direct dependency between the two scripts. Each script is designed to perform sentiment analysis independently using different approaches and techniques. Here is a brief overview of each script's functionality:
spark_nlp_sentiment_analysis.py
:- Uses pre-trained models from Spark NLP for sentiment analysis.
- Loads and preprocesses Yelp review data.
- Applies the Universal Sentence Encoder and SentimentDLModel for sentiment analysis.
- Evaluates the model performance using the F1 score.
- Identifies the top 10 most positive and negative businesses based on review sentiments.
yelp_sentiment_analysis_pipeline.py
:- Implements a custom sentiment analysis pipeline using PySpark.
- Loads and preprocesses Yelp review and business data.
- Performs text preprocessing including tokenization, stop words removal, and lemmatization.
- Uses CountVectorizer and TF-IDF for feature extraction.
- Trains a Linear Support Vector Classifier (SVC) model with cross-validation.
- Evaluates the model performance using the F1 score.
- Handles class imbalance through undersampling.
- Apache Spark
- Spark NLP
- PySpark
- Clone the repository:
git clone https://github.com/pathak-ashutosh/sentiment-analysis-yelp-reviews.git
cd sentiment-analysis-yelp-reviews
- Install the required Python packages:
pip install pyspark sparknlp
- spark_nlp_sentiment_analysis.py:
spark-submit spark_nlp_sentiment_analysis.py
- yelp_sentiment_analysis_pipeline.py:
spark-submit yelp_sentiment_analysis_pipeline.py
- spark_nlp_sentiment_analysis.py: Script for sentiment analysis using pre-trained Spark NLP models.
- yelp_sentiment_analysis_pipeline.py: Script for custom sentiment analysis pipeline using PySpark.
Since there is no dependency between the scripts, you can run them in any order. Each script is self-contained and does not rely on the output or results of the other script. You can choose to run either script first based on your preference or the specific analysis you are interested in:
- If you want to use pre-trained models for a quick sentiment analysis, run
spark_nlp_sentiment_analysis.py
first. - If you are interested in building and evaluating a custom sentiment analysis pipeline, run
yelp_sentiment_analysis_pipeline.py
first.
Using the Spark NLP pre-trained model, we performed sentiment analysis on Yelp reviews. The steps and results are as follows:
-
Model and Data Processing:
- Utilized the Universal Sentence Encoder and SentimentDLModel for sentiment analysis.
- Converted star ratings to sentiment labels: 1-2 stars as negative, 2.5-3.5 stars as neutral, and 4-5 stars as positive.
-
Evaluation:
- Compared the predicted sentiments with the actual star ratings.
- Achieved an F1 score of 0.69.
-
Business Sentiment Analysis:
Using a custom sentiment analysis pipeline with PySpark, we built and evaluated a Linear Support Vector Classifier (SVC) model. The steps and results are as follows:
-
Data Preprocessing:
- Converted text to lowercase, handled contractions, and removed non-alphabetic characters.
- Tokenized text and removed stop words.
- Encoded sentiments: 1-3 stars as negative (0) and 3.5-5 stars as positive (1).
-
Feature Engineering:
- Applied CountVectorizer and TF-IDF for feature extraction.
- TF-IDF provided better results with an F1 score of 0.89 compared to 0.53 from CountVectorizer.
-
Undersampling:
- Balanced the dataset by undersampling the majority class.
- Ensured equal representation of positive and negative sentiments.
-
Model Training and Evaluation:
- Trained the Linear SVC model with cross-validation.
- Best parameters for TF-IDF:
regParam
: 0.01maxIter
: 100
- Achieved an F1 score of 0.8955 on the validation set.
-
Confusion Matrix:
- Spark NLP Model: Achieved an F1 score of 0.69.
- Linear SVC Model with TF-IDF: Achieved an F1 score of 0.8955.
- Top Positive and Negative Businesses: Identified and visualized based on review sentiments.
- State-wise Sentiment Distribution: Analyzed and plotted for better insights.
The results demonstrate the effectiveness of both pre-trained models and custom-built pipelines in performing sentiment analysis on large-scale datasets. The Linear SVC model with TF-IDF vectorization showed superior performance, highlighting the importance of feature engineering and model tuning in sentiment classification tasks.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.