-
Notifications
You must be signed in to change notification settings - Fork 0
ETL ELT Workflow Orchestration
ETL process consists of
- Extract time series data from sources
- Transform using Apache Spark
- Load to mysql/Azure Mysql
Latest COVID and Stocks data is available daily. Unemployment Rate data and USA Home prices/Home inventory, data is available 2 day of every month.Since the related data is time series data, ETL/ELT Workflow should be Orchestrated. Apache Airflow is Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. It is
-
Extensible
-
Scalable
-
Dynamic
-
Elegant Airflow consists of
-
Scheduler
-
Executor - Celery executor, Local Executor and Sequential executor
-
Webserver
-
Databases for handling metadata Airflow tasks are grouped under DAG (directed acyclic Graph) depending upon schedule, and dependency. Airflow scheduler schedules the DAGS and hands tasks over to executors to run. Scheduler periodically checks the
dags
folder for any changes and new dags. A DAG execution status can be monitored on a web server.Failed Dags/ tasks can be retried based on number of retried parameters. Metadata related to DAG execution is saved in the database. Celery and Local executors are preferred over sequential executors since they can schedule DAGs/tasks parallel. -
ETL DAG run for COVID - 19 data
Current project is run on docker/Azure ACI with Spark in a standalone model.
- DAGS are not complex, Monthly and daily DAGs don't have too many tasks, So Local executor is used with the Postgres database.
- One drawback of Local executors over Celery even while using with smaller number of tasks is that if a task fails, there is no way of restarting that task alone.
- Ended up adding separate DAG for each source.
- After adding housing data (which is about 50% of entire data), having to rerun the entire DAG was an annoyance.Airflow container with Celery Executor would have been a better option in this case.
Checkout DAGS under Dags Folder