Sentiment analysis project to analyze user reviews of mobile applications used for cellular operators, that is MyTelkomsel, MyXL, MyIM3, and MySF (Smartfren), by scraping data from the Google Play Store. The focus is on performing sentiment analysis using natural language processing (NLP) techniques to understand user satisfaction and identify common issues.
-
Sentiment Classification: To classify reviews about each providers app into categories such as positive or negative, giving a clear picture of customer satisfaction.
-
Identify Key Issues: To identify common complaints, praises, or suggestions shared by customers, helping each operator understand the issues that need immediate attention or improvement.
-
Competitor Comparison: To compare sentiment scores between the four providers, allowing for a better understanding of public perception and brand image relative to each other.
Framework/Technologies | Roles |
---|---|
Kedro | Structuring data engineering and data science pipelines |
PostgreSQL | Serves as a data lake for raw data and a data warehouse for preprocessed data |
Docker | Containerize the entire project |
Apache Airflow | Schedule workflows as DAGs |
Scikit-learn | TF-idf vectorizer and support vector machine |
PyTorch | Construct LSTM model & training indoBERT |
Tableau | Creating visual dashboards and reports |
Declare any dependencies in requirements.txt
for pip
installation.
To install them, run:
pip install -r requirements.txt
-
Change directory to root project
cd sentiment-provider-app
-
Initialize airflow within docker:
docker-compose up init-airflow -d
-d = Detached mode: Run containers in the background
-
Run docker-compose:
docker-compose up
-
To open Airflow, visit this link in browser
http://localhost:8080/
How to stop service from running:
docker-compose down -v
-v = Remove named volumes declared in the "volumes" section of the Compose file and anonymous volumes attached to containers
-
Change to deploy directory
cd deploy
-
Run the API
uvicorn api:app --reload
-
Test the API
curl -X 'POST' \ 'http://127.0.0.1:8000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "text": "aplikasi ini bagus tapi sinyalnya jelek dan kadang lemot" }'
you should receive JSON response:
{ "Sentiment": "Negative" }
The Extract-Transform-Load pipeline are:
-
Extract
- Scrape data from google play store
- Store csv file in device for manual labelling
- Dump labeled dataset into postgres
-
Transform
-
Combine datasets
-
Remove missing value
-
Remove review that has only emoji
-
Case folding
-
Add space after punctuations to prevent each word to combined after punctuation removal
Example
Input: "Aplikasi yang sangat buruk,jelek,pembohong" Output: "Aplikasi yang sangat buruk, jelek, pembohong"
-
Remove punctuation characters
-
Remove non-ASCII characters from the input text
-
Removes URLs
-
Stemming (Reduces words to their root form)
-
Replace slang words in the input texts with their formal equivalents using colloquial-indonesian-lexicon dictionary
-
Remove specific irrelevant words, such as brand name
-
Fix letter repetition
Example
"mmantap" -> "mantap", "mannntap" -> "mantap", "mantapp" -> "mantap"
-
Remove reviews with less than 2 words
-
Label encoding
-
Remove empty string after preprocessing
-
-
Load
- Store transformed data in postgres as data warehouse
- Data in data warehouse can be used for dashboard and machine learning
-
Machine Learning
Comparing several model to get the best result:-
Support Vector Machine (K-fold cross validation & grid search hyperparameter tuning)
-
LSTM (PyTorch)
-
IndoBERT transformer model
-
Gemini LLM
Model F1 Scores (%) SVM 82.353 SVM (Grid Search) 87.850 LSTM 82.393 IndoBERT 97.48 Gemini LLM 93.913
-