The CENSUS project has been developed as final assignment of the Big Data Technologies course, offered by the University of Trento.
The project objective refers to deploying a Big Data system, taking as input specific user data and returning a prediction of his/her income group, subcategorising the IRPEF system of taxation and its income segmentation. The prediction corresponds to the output of a Random Forest model trained with data provided by the Banca d’Italia and by the Sole24ore. Specifically, the datasets employed are:
-
Banca d'Italia, Indagine sui bilanci delle famiglie italiane, Indagine sul 2016 (published in 2018) whose data refers to the 2016 survey
-
Banca d'Italia, Indagine sui bilanci delle famiglie italiane, Indagine sul 2014 (published in 2015) with data referring to the survey of 2014
-
Sole24ore, Lab24: Qualità della vita
Whereas the Banca d’Italia offers a large amount of data per survey, only three datasets are retained: carcom, containing the generalties of people who taken part into the survey, rper, individual income, and rfam, income per household. The variable descriptions, along with the survey items, can be found in the Documentazione per l’utilizzo dei microdati and in exploratory_data_analysis.Rmd
.
In order to run this project, the following tools have to be installed on your machine:
- Python, preferably 3.9
- Docker Desktop, preferably 3.5
- Docker Compose v1.27.0 or newer
Older versions of docker-compose
do not support all features required by the docker-compose.yml
file, so check that the minimum version requirements are satisfied.
Clone this repository in a local directory typing in the command line:
git clone https://github.com/elypaolazz/BDT-Project.git
The creation of a virtual environment is highly suggested. If not already installed, install virtualenv:
-
in Unix systems:
python3 -m pip install --user virtualenv
-
in Windows systems:
python -m pip install --user virtualenv
And then create the virtual environment named venv typing in the command line (inside the project folder):
-
in Unix systems:
python3 -m venv venv
-
in Windows systems:
python -m venv venv
The virtual environment can be activated as follow:
-
in Unix systems:
source venv
-
in Windows systems:
venv\Scripts\activate
In the active virtual environment, install all libraries contained in the requirements.txt
file:
pip install -r requirements.txt
This project employs few Docker images:
-
the official Apache-Airflow Docker image with Celery Executor, PostgreSQL image as backend and Redis image as message broker.
-
the official MySQL image with its web interface phpMyAdmin.
-
the Docker image shaynativ/redis-ml that contains both the Redis server and the RedisML module, used during the Machine Learning procedure.
If running on Linux system, a further check for deploying Airflow is needed. The mounted volumes (in the docker-compose.yml
file) use the user/group permissions, therefore double-check if the container and the host computer have matching file permissions.
mkdir ./dags ./logs
echo -e "AIRFLOW_UID=$(id -u) \nAIRFLOW_GID=0" > .env
Further information available here: Running Airflow in Docker
On all operating systems, initialise the project by running:
docker-compose up airflow-init
The command above starts the database migrations and creates the Airflow user account passing as username: airflow
and password: airflow
. The initialisation is complete when the message below appears:
# airflow-init_1 | Admin user airflow created
# airflow-init_1 | 2.1.1
+ airflow-init_1 exited with code 0
Now it is possible to start all the other services by running:
docker-compose up
Or if a detached mode is preferred:
docker-compose up -d
Furthermore, the logs can be are recalled with:
docker-compose logs [OPTIONAL: container name]
NOTE:
The two-step procedure can be sidestepped by running the docker-compose up -d
command once. However, this shortcut implies a constant check of the containers' condition to detect when airflow-init
exits:
docker-compose ps
The resulting view should be the same as the below.
After the virtual environment and the Docker images set up, a last step must be manually performed. To start the entire data pipeline, type in the command line (with the activated virtual environment):
-
in Unix systems:
python3 runner.py
-
in Windows systems:
python runner.py
The pipeline will start and follow these steps:
- ingestion phase: the requests Python library downloads the data from the Banca d’Italia and Sole24ore websites
- ETL phase: a DAG in Airflow extracts relevant data, transforms it employing Pandas Python library, and loads it to the MySQL database
project_bdt
- storage phase: data is stored in a MySQL server running in another container
- machine learning: data is processed using RedisML module and its Random Forest algorithm
- web application: the CENSUS web application is launched and can be visualised by clicking or copy-pasting the localhost link (appering in the terminal) in the browser
Accessing each service's webserver is the recommended approach to monitor the pipeline development. The webservers can be accessed in the browser by navigating to:
- Airflow http://localhost:8080 a first log-in may be necessary with the chosen credentials (the default credentials are
username: airflow
andpassword: airflow
).
-
Flower http://localhost:5555 for monitoring the tasks assigned to the Celery worker.
-
phpMyAdmim http://localhost:8082, which handles the administration of MySQL. As above, a log-in is required using the credentials chosen in the
docker-compose.yml
file (the default credentials areserver: mysql
,user: root
, andpassword: password
).
A user can access the web application in two different ways:
-
clicking on the localhost link returned at the end of the data pipeline
-
connecting to the stable CENSUS web application, hosted on a third-party server: http://elisapaolazzi.pythonanywhere.com/
The web application, reported below, predicts the income bracket of the user, following the IRPEF categories and further sub-groups.
Accessing the web application, you will see this page:
The script will automatically end with the deployment of the web application. However, notice that:
-
the logs of the web application will be uploaded on the terminal.
-
a manual stop of Docker containers is needed.
To stop and remove the running containers:
docker-compose down
To completely cleaning up the environment (i.e., delete containers, delete volumes with database data and download images), run:
docker-compose down --volumes --rmi all
The backend code structure is composed by:
airflow
folder, containing the DAGs files of the first three stages of the data pipeline: ingestion phase (first_dag.py
), ETL phase (second_dag.py
) and MySQL storage (third_dag.py
)R_scripts
folder, containing the R scripts of the interface graphs and the Exploratory Data Analysissrc
, containing Python files with specific functions for data collection, data transformation and machine learning training/testing processdocker-compose.yml
, defining the Docker containers and their relationshipsrunner.py
, triggers the entire project
The final interface is a Flask web application composed by:
main.py
, which contains the function needed to launch the application in the local server and the route functions (that define variables, actions and events) of the different pagesforms.py
, file that defines and manages the application form and its fieldstemplates folder
, containing the HTML template for each pagestatic folder
, containing the CSS file for the presentation (layout, colors, and fonts) and the images
├── airflow
│ └── dags
│ ├── first_dag.py
│ ├── second_dag.py
│ └── third_dag.py
│
├── R_scripts
│ ├── exploratory_data_analysis.Rmd
│ └── graphs_script.Rmd
│
├── src
│ ├── classifier.py
│ ├── collector.py
│ ├── province_ita.json
│ └── saver.py
│
├── static
│ ├── main.css
│ ├── graph.png
│ └── ...
│
├── templates
│ ├── about.htm
│ ├── home.htm
│ ├── layout.htm
│ └── line_chart.htm
│
├── .gitignore
├── docker-compose.yml
├── forms.py
├── main.py
├── requirements.txt
└── runner.py
When running the project, some errors may occur 😥. The most common errors, together with a possible resolution, are listed below:
-
Docker daemon does not reply after the
docker-compose up
command: relevant information can be found at the following link: Configure and troubleshoot the Docker daemon -
Other common errors are directly linked to the requests sent to the Airflow REST API. In particular:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
or
CONNECTION ERROR: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
The target machine refused the connection with the client. In this framework, it may refer to the fact that the Airflow initialisation procedure has not ended yet.
The suggestion is to check the Docker container status by typing docker-compose ps
or docker-compose [airflow-init] logs
in the command line and wait until airflow-init
exits.
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
It refers to an overload of requests. A straightforward solution consists of setting a longer interval into the function sleep()
present at the line 25 of the runner.py
file. It is set to 30 seconds, but this temporal interval may not be enough.
- When launching the webservers, a
500 Internal Server Error
may arise. This server error response code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. Just try to refresh the page or to relaunch the webserver.
If this error appears when submitting a prediction request to the CENSUS application, either a bug in the deployed code is present or the Redis server disconnected (to check if this second option is applicable, type: docker-compose [redis-ml] logs
).