A Python based project for performing sentiment analysis on Twitter data. Get full project paper here.
- Twitter data is hourly scraped using {twint} package.
- Scheduled model training is performed monthly on MLflow served on a
g1-small
GCP Compute Engine instance. - Model training artifacts are stored in GCP Cloud Storage.
- Instance schedule is applied on the subscribed Compute Engine instance for cost efficiency.
- Total cost of GCP usage is less than MYR 3.00/ month (approx. USD 0.72/ month).
startup-script
for GCP Compute Engine instance:
#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u <USERNAME> tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root <gsutil URI> \
--host localhost
Working Example:
#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u tweet_sentiment_py tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root gs://mlflow_bucket_001 \
--host localhost
- Unfortunately, hardly reproducible due to manual pipeline integration & authentication processes.
- Dashboard is not scalable. Currently the twitter handle belonging to twitter accounts of interests were hardcoded in python file served on {Streamlit} for data analysis and visualization.
- No fallbacks on failed scheduled Actions.
- Roadblock: As of Jan 2022, GitHub Action build may fail due to dependencies installation error. This affects both the scheduled pipelines and dashboard. (See Ref)
Project based on the cookiecutter data science project template. #cookiecutterdatascience
Thank you to the developers of twint and all other packages for making this project possible!