Yelp Sentiment Analysis with Apache Spark and Spark NLP

Overview

This project aims to perform sentiment analysis on Yelp reviews using Apache Spark and Spark NLP. The project includes two main scripts:

spark_nlp_sentiment_analysis.py: Utilizes pre-trained models from Spark NLP for sentiment analysis.
yelp_sentiment_analysis_pipeline.py: Implements a custom sentiment analysis pipeline using various NLP techniques and machine learning models.

Project Files

There is no direct dependency between the two scripts. Each script is designed to perform sentiment analysis independently using different approaches and techniques. Here is a brief overview of each script's functionality:

spark_nlp_sentiment_analysis.py:
- Uses pre-trained models from Spark NLP for sentiment analysis.
- Loads and preprocesses Yelp review data.
- Applies the Universal Sentence Encoder and SentimentDLModel for sentiment analysis.
- Evaluates the model performance using the F1 score.
- Identifies the top 10 most positive and negative businesses based on review sentiments.
yelp_sentiment_analysis_pipeline.py:
- Implements a custom sentiment analysis pipeline using PySpark.
- Loads and preprocesses Yelp review and business data.
- Performs text preprocessing including tokenization, stop words removal, and lemmatization.
- Uses CountVectorizer and TF-IDF for feature extraction.
- Trains a Linear Support Vector Classifier (SVC) model with cross-validation.
- Evaluates the model performance using the F1 score.
- Handles class imbalance through undersampling.

Getting Started

Prerequisites

Apache Spark
Spark NLP
PySpark

Installation

Clone the repository:

git clone https://github.com/pathak-ashutosh/sentiment-analysis-yelp-reviews.git
cd sentiment-analysis-yelp-reviews

Install the required Python packages:

pip install pyspark sparknlp

Running the Scripts

spark_nlp_sentiment_analysis.py:

spark-submit spark_nlp_sentiment_analysis.py

yelp_sentiment_analysis_pipeline.py:

spark-submit yelp_sentiment_analysis_pipeline.py

Project Structure

spark_nlp_sentiment_analysis.py: Script for sentiment analysis using pre-trained Spark NLP models.
yelp_sentiment_analysis_pipeline.py: Script for custom sentiment analysis pipeline using PySpark.

Running Order

Since there is no dependency between the scripts, you can run them in any order. Each script is self-contained and does not rely on the output or results of the other script. You can choose to run either script first based on your preference or the specific analysis you are interested in:

If you want to use pre-trained models for a quick sentiment analysis, run spark_nlp_sentiment_analysis.py first.
If you are interested in building and evaluating a custom sentiment analysis pipeline, run yelp_sentiment_analysis_pipeline.py first.

Results and Analysis

Spark NLP Sentiment Analysis

Using the Spark NLP pre-trained model, we performed sentiment analysis on Yelp reviews. The steps and results are as follows:

Model and Data Processing:
- Utilized the Universal Sentence Encoder and SentimentDLModel for sentiment analysis.
- Converted star ratings to sentiment labels: 1-2 stars as negative, 2.5-3.5 stars as neutral, and 4-5 stars as positive.
Evaluation:
- Compared the predicted sentiments with the actual star ratings.
- Achieved an F1 score of 0.69.
Business Sentiment Analysis:
- Identified the top 10 most positive and negative businesses based on review sentiments.
- Plotted graphs to visualize the sentiment distribution across different states.

Linear SVC Sentiment Analysis Pipeline

Using a custom sentiment analysis pipeline with PySpark, we built and evaluated a Linear Support Vector Classifier (SVC) model. The steps and results are as follows:

Data Preprocessing:
- Converted text to lowercase, handled contractions, and removed non-alphabetic characters.
- Tokenized text and removed stop words.
- Encoded sentiments: 1-3 stars as negative (0) and 3.5-5 stars as positive (1).
Feature Engineering:
- Applied CountVectorizer and TF-IDF for feature extraction.
- TF-IDF provided better results with an F1 score of 0.89 compared to 0.53 from CountVectorizer.
Undersampling:
- Balanced the dataset by undersampling the majority class.
- Ensured equal representation of positive and negative sentiments.
Model Training and Evaluation:
- Trained the Linear SVC model with cross-validation.
- Best parameters for TF-IDF:
  - regParam: 0.01
  - maxIter: 100
- Achieved an F1 score of 0.8955 on the validation set.
Confusion Matrix:
- The model showed a high number of True Positives and True Negatives, indicating strong classification performance.
- Some False Positives and False Negatives were observed, suggesting room for improvement with further hyperparameter tuning.

Summary

Spark NLP Model: Achieved an F1 score of 0.69.
Linear SVC Model with TF-IDF: Achieved an F1 score of 0.8955.
Top Positive and Negative Businesses: Identified and visualized based on review sentiments.
State-wise Sentiment Distribution: Analyzed and plotted for better insights.

The results demonstrate the effectiveness of both pre-trained models and custom-built pipelines in performing sentiment analysis on large-scale datasets. The Linear SVC model with TF-IDF vectorization showed superior performance, highlighting the importance of feature engineering and model tuning in sentiment classification tasks.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
readme-assets		readme-assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
spark_nlp_sentiment_analysis.py		spark_nlp_sentiment_analysis.py
yelp_sentiment_analysis_pipeline.py		yelp_sentiment_analysis_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp Sentiment Analysis with Apache Spark and Spark NLP

Overview

Project Files

Getting Started

Prerequisites

Installation

Running the Scripts

Project Structure

Running Order

Results and Analysis

Spark NLP Sentiment Analysis

Linear SVC Sentiment Analysis Pipeline

Summary

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

pathak-ashutosh/sentiment-analysis-yelp-reviews

Folders and files

Latest commit

History

Repository files navigation

Yelp Sentiment Analysis with Apache Spark and Spark NLP

Overview

Project Files

Getting Started

Prerequisites

Installation

Running the Scripts

Project Structure

Running Order

Results and Analysis

Spark NLP Sentiment Analysis

Linear SVC Sentiment Analysis Pipeline

Summary

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages