This is an data pipeline that are create by airflow's DAGS which ingest data from this endpoint https://datausa.io/api/data?drilldowns=Nation&measures=Population, save the response as json file, convert the json into csv, upload it into gcs and finally load into data warehouse external table. flow of the DAG also shown below:
The end result of this pipeline is creating external table from the ingested data which can read through metadata of the parquet file and can be queried as well. Below are screenshot of the result:
Airflow Concepts and Architecture
(For the section on the Custom/Lightweight setup, scroll down)
Airflow Setup with Docker, through official guidelines
-
Build the image (only first-time, or when there's any change in the
Dockerfile
, takes ~15 mins for the first-time):docker-compose build
or (for legacy versions)
docker build .
-
Initialize the Airflow scheduler, DB, and other config
docker-compose up airflow-init
-
Kick up the all the services from the container:
docker-compose up
-
In another terminal, run
docker-compose ps
to see which containers are up & running (there should be 7, matching with the services in your docker-compose file). -
Login to Airflow web UI on
localhost:8080
with default creds:airflow/airflow
-
Run your DAG on the Web Console.
-
On finishing your run or to shut down the container/s:
docker-compose down
To stop and delete containers, delete volumes with database data, and download images, run:
docker-compose down --volumes --rmi all
or
docker-compose down --volumes --remove-orphans
This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.
Airflow Setup with Docker, customized
- Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup):
docker-compose down --volumes --rmi all
or
docker-compose down --volumes --remove-orphans
Or, if you need to clear your system of any pre-cached Docker issues:
docker system prune
Also, empty the airflow logs
directory.
-
Build the image (only first-time, or when there's any change in the
Dockerfile
): Takes ~5-10 mins for the first-timeshell docker-compose build
or (for legacy versions)shell docker build .
-
Kick up the all the services from the container (no need to specially initialize):
shell docker-compose -f docker-compose-nofrills.yml up
-
In another terminal, run
docker ps
to see which containers are up & running (there should be 3, matching with the services in your docker-compose file). -
Login to Airflow web UI on
localhost:8080
with creds:admin/admin
(explicit creation of admin user was required) -
Run your DAG on the Web Console.
-
On finishing your run or to shut down the container/s:
shell docker-compose down
Use the docker-compose_2.3.4.yaml file (and rename it to docker-compose.yaml). Don't forget to replace the variables GCP_PROJECT_ID
and GCP_GCS_BUCKET
.
- Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP
For more info, check out these official docs: