Skip to content

End-to-end Sentiment analysis project using natural language processing (NLP) to analyze reviews of cellular operator applications in the Google App Store.

License

Notifications You must be signed in to change notification settings

anggapark/sentiment-provider-app

Repository files navigation

provider-sentiment

Overview

Sentiment analysis project to analyze user reviews of mobile applications used for cellular operators, that is MyTelkomsel, MyXL, MyIM3, and MySF (Smartfren), by scraping data from the Google Play Store. The focus is on performing sentiment analysis using natural language processing (NLP) techniques to understand user satisfaction and identify common issues.

Goals

  1. Sentiment Classification: To classify reviews about each providers app into categories such as positive or negative, giving a clear picture of customer satisfaction.

  2. Identify Key Issues: To identify common complaints, praises, or suggestions shared by customers, helping each operator understand the issues that need immediate attention or improvement.

  3. Competitor Comparison: To compare sentiment scores between the four providers, allowing for a better understanding of public perception and brand image relative to each other.

Tech Stacks

Framework/Technologies Roles
Kedro Structuring data engineering and data science pipelines
PostgreSQL Serves as a data lake for raw data and a data warehouse for preprocessed data
Docker Containerize the entire project
Apache Airflow Schedule workflows as DAGs
Scikit-learn TF-idf vectorizer and support vector machine
PyTorch Construct LSTM model & training indoBERT
Tableau Creating visual dashboards and reports

How to install dependencies

Declare any dependencies in requirements.txt for pip installation.

To install them, run:

pip install -r requirements.txt

How to run ETL and ML pipeline using Docker

  1. Change directory to root project

    cd sentiment-provider-app
    
  2. Initialize airflow within docker:

    docker-compose up init-airflow -d
    

    -d = Detached mode: Run containers in the background

  3. Run docker-compose:

    docker-compose up
    
  4. To open Airflow, visit this link in browser

    http://localhost:8080/
    

How to stop service from running:

docker-compose down -v

-v = Remove named volumes declared in the "volumes" section of the Compose file and anonymous volumes attached to containers

How to Access API

  1. Change to deploy directory

    cd deploy
    
  2. Run the API

    uvicorn api:app --reload
    
  3. Test the API

    curl -X 'POST' \
    'http://127.0.0.1:8000/predict' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "text": "aplikasi ini bagus tapi sinyalnya jelek dan kadang lemot"
    }'

    you should receive JSON response:

    { "Sentiment": "Negative" }

ETL Pipelines

etl_pipeline

The Extract-Transform-Load pipeline are:

  1. Extract

    • Scrape data from google play store
    • Store csv file in device for manual labelling
    • Dump labeled dataset into postgres
  2. Transform

    • Combine datasets

    • Remove missing value

    • Remove review that has only emoji

    • Case folding

    • Add space after punctuations to prevent each word to combined after punctuation removal

      Example
       Input: "Aplikasi yang sangat buruk,jelek,pembohong"
       Output: "Aplikasi yang sangat buruk, jelek, pembohong"
      
    • Remove punctuation characters

    • Remove non-ASCII characters from the input text

    • Removes URLs

    • Stemming (Reduces words to their root form)

    • Replace slang words in the input texts with their formal equivalents using colloquial-indonesian-lexicon dictionary

    • Remove specific irrelevant words, such as brand name

    • Fix letter repetition

      Example
      "mmantap" -> "mantap",
      "mannntap" -> "mantap",
      "mantapp" -> "mantap"
      
    • Remove reviews with less than 2 words

    • Label encoding

    • Remove empty string after preprocessing

  3. Load

    • Store transformed data in postgres as data warehouse
    • Data in data warehouse can be used for dashboard and machine learning
  4. Machine Learning
    Comparing several model to get the best result:

    • Support Vector Machine (K-fold cross validation & grid search hyperparameter tuning)

    • LSTM (PyTorch)

    • IndoBERT transformer model

    • Gemini LLM

      Model F1 Scores (%)
      SVM 82.353
      SVM (Grid Search) 87.850
      LSTM 82.393
      IndoBERT 97.48
      Gemini LLM 93.913

About

End-to-end Sentiment analysis project using natural language processing (NLP) to analyze reviews of cellular operator applications in the Google App Store.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages