In this project, we are trying to build data pipeline using Lambda architecture to handle massive quantities of data by taking advantage of both batch and stream processing methods. Besides, we also analyze Twitter's tweets.
- Python 3.*
- Apache Spark 3.2.*
- Account for Twitter API
- Config.ini file
- Change
config.template.ini
toconfig.ini
- Adjust some basic value in
config.ini
- Change
- logs folder
- Grant full permission :
sudo chmod a+rwx src/logs
- Grant full permission :
- Clone repository
git clone
- Run Docker containers
make start-docker
- Setup virtual env for project
make setup-env
- Run project
make start-all
- Analyze
Go to notebook for analyzing
- If not find twitter keyspace, run container
cassandra-init-schema
again