This project is an ELT Pipeline using Dbt (dbt-core) for transformation, Snowflake for data warehousing and Airflow for orchestration.
This project uses the Snowflake tpch dataset (tpch_sf1 schema).
Creates a Snowflake environment with a warehouse, database, and user role for dbt to access and manage data.
Defines a configuration file (dbt_profile.yaml) that specifies how dbt interacts with Snowflake, including the warehouse and schema to use for staging and production data.
Creates source and staging models:
- Source models define how to access raw data from Snowflake's sample data (tpch_sf1 schema).
- Staging models define SQL queries to transform raw data into a format suitable for further processing.
Creates a reusable function (macro) to calculate discounted amounts based on extended price and discount percentage.
Creates dbt models for data transformation:
- Intermediate tables join data from staging models and perform calculations.
- Data marts aggregate and summarize data for specific use cases.
- Fact tables integrate data from various sources for analysis.
Creates generic and singular tests for data quality:
- Generic tests (defined in YAML) enforce data integrity rules like uniqueness, not null values, and relationships between tables.
- Singular tests (written in SQL) identify specific data quality issues like invalid dates or orders with discounts.
Configures Airflow to run the dbt pipeline, I used Astronomer Cosmos for deploying the dbt project into airflow:
- Sets up Airflow with necessary Python libraries for dbt and Snowflake connection.
- Defines a Snowflake connection within the Airflow UI.
- Creates a Python script (dbt_dag.py) that configures an Airflow DAG (Directed Acyclic Graph) to run the dbt project at scheduled intervals. This DAG uses the dbt project directory, profile configuration (including Snowflake connection details), and execution settings (including the path to the dbt executable).