Welcome to the exciting journey of unraveling sentiments hidden within Amazon reviews! This project leverages the power of natural language processing (NLP) to categorize sentiments as positive, negative, or neutral across various product categories.
Utilize PySpark or Pandas to clean, transform, and prepare the Amazon reviews data. Explore the magic in:
processing/process_amazon_reviews_pandas.py
processing/process_amazon_data_spark.py
Dive into detailed analysis with Jupyter notebooks:
- Exploratory Data Analysis (EDA)
- Predictions: Star Ratings, Product Categories, Sentiment Labels
- Statistical Significance Testing
Explore the NLP techniques used:
- Text Cleaning: Lowercasing, HTML unescaping, punctuation removal
- Sentiment Analysis: Vader Sentiment library
Before diving into the code (especially the notebooks), we strongly recommend downloading the preprocessed output dataset directory and models hosted on archive.org. This step will save you time by avoiding the need to rerun the entire data processing phase.
- Output Datasets: Available in
.csv.gz
format: Download Output Datasets - Models: Available in
.joblib.gz
format: Download Models
The processing package, although insightful, is mainly included to showcase our data cleaning and preparation process.
Just make sure you have Python 3.11 installed, and we'll take care of the rest!
Before proceeding with either the automatic setup or manual exploration, please ensure that the scripts have the necessary permissions to execute. This can be done by navigating to the root directory of the project and running the following commands:
chmod +x ./scripts/setup.sh
chmod +x ./scripts/download-data.sh
Make sure you are in the root directory of the project when running these commands. These will grant execute permissions for the setup.sh
and download-data.sh
scripts, enabling them to run on your system.
Now you're ready to continue with the setup process, as outlined in the sections below!
- Clone the repository.
- Navigate to the root directory of the project.
- Run the magical setup script:
./scripts/setup.sh
If you prefer to explore manually, you'll need to set some environment variables. Here's the default .env
skeleton:
SPARK_SAMPLE_LIMIT=50000
PANDAS_SAMPLE_LIMIT=95000
SPARK_SAMPLE_FRACTION=0.90
AMAZON_BIGDATA_INPUT_DIRECTORY=./input-amazon/
AMAZON_BIGDATA_OUTPUT_DIRECTORY=./output-amazon/
ML_MODEL_FOLDER=./models/
ML_MODEL_TESTING_FOLDER=validation_data/
- Download Data: Run the
download-data.sh
script from the root directory to download Amazon reviews data and pre-trained models:
./scripts/download-data.sh
- Create Virtual Environment:
python3 -m venv venv
source venv/bin/activate
- Install Dependencies:
pip install -r requirements.txt
- Download spaCy Model:
python3 -m spacy download en_core_web_md
- Process Data: Choose either Pandas or PySpark:
python3 processing/process_amazon_reviews_pandas.py # For Pandas
or
spark-submit processing/process_amazon_data_spark.py # For PySpark
- Explore Analysis Notebooks: Navigate to the
analysis
directory to explore Jupyter notebooks.
Explore the performance of the Bayes models through classification reports and scoring with the analysis/run_models.py
script. This script provides insights into how well the models are performing on the validation data.
To run the script, navigate to the analysis
directory and execute:
python3 run_models.py
This will generate classification reports and scores for the Bayes models, displaying them in the standard output. Make sure you have the required models and validation data available before running this script.
Running Spark jobs requires significant memory and may not be suitable for machines with limited resources. If you are an SFU student, faculty member, or staff, consider using CSIL if you wish to run Spark jobs. Otherwise, you can choose the Pandas option for data processing.