Twitter Sentiment Analysis [archieved]

A Python based project for performing sentiment analysis on Twitter data. Get full project paper here.

Twitter data is hourly scraped using {twint} package.
Scheduled model training is performed monthly on MLflow served on a g1-small GCP Compute Engine instance.
Model training artifacts are stored in GCP Cloud Storage.
Instance schedule is applied on the subscribed Compute Engine instance for cost efficiency.
Total cost of GCP usage is less than MYR 3.00/ month (approx. USD 0.72/ month).

Dashboard

Click to view dashboard!

Simplified View of Pipelines

Architecture Overview

GCP Config

startup-script for GCP Compute Engine instance:

#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u <USERNAME> tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root <gsutil URI> \
--host localhost

Working Example:

#! /bin/bash
sudo apt update
sudo apt-get -y install tmux
echo Installing python3-pip
sudo apt install -y python3-pip
export PATH="$HOME/.local/bin:$PATH"
echo Installing mlflow and google_cloud_storage
pip3 install mlflow google-cloud-storage
echo Starting new tmux session
sudo -H -u tweet_sentiment_py tmux new-session -d -s mysession
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root gs://mlflow_bucket_001 \
--host localhost

Limitations & Roadblocks

Unfortunately, hardly reproducible due to manual pipeline integration & authentication processes.
Dashboard is not scalable. Currently the twitter handle belonging to twitter accounts of interests were hardcoded in python file served on {Streamlit} for data analysis and visualization.
No fallbacks on failed scheduled Actions.
Roadblock: As of Jan 2022, GitHub Action build may fail due to dependencies installation error. This affects both the scheduled pipelines and dashboard. (See Ref)

Credits

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Thank you to the developers of twint and all other packages for making this project possible!

Name		Name	Last commit message	Last commit date
Latest commit History 29,318 Commits
.github		.github
docs		docs
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
app2.py		app2.py
data.txt		data.txt
new_data_appended.txt		new_data_appended.txt
new_data_cleaned.csv		new_data_cleaned.csv
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
text_preprocessing.py		text_preprocessing.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Sentiment Analysis [archieved]

Dashboard

Simplified View of Pipelines

Architecture Overview

GCP Config

Limitations & Roadblocks

Credits

About

Releases

Packages

Contributors 2

Languages

License

SarahHannes/tweet-sentiment

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis [archieved]

Dashboard

Simplified View of Pipelines

Architecture Overview

GCP Config

Limitations & Roadblocks

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages