Skip to content

Latest commit

 

History

History
65 lines (49 loc) · 2.9 KB

README.md

File metadata and controls

65 lines (49 loc) · 2.9 KB

Airflow ETL Data Pipeline

Airflow

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Project Structure
  4. Setup
  5. Airflow DAG
  6. Data Transformation
  7. Running the Pipeline
  8. Logging and Monitoring

Introduction

This project demonstrates an ETL (Extract, Transform, Load) data pipeline implemented using Apache Airflow. The pipeline loads data from an S3 bucket, applies data transformations using the Astro framework, and loads the transformed data into a Snowflake database.

Prerequisites

  • An AWS account to access the S3 bucket.
  • A Snowflake account for the target Snowflake data warehouse.
  • Access to an Apache Airflow environment : AWS Managed Apache Airflow (MWAA).

Project Structure

The project structure is as follows:

Setup

  1. Clone this repository to your local machine.
  2. Ensure that you have Apache Airflow or AWS Managed Apache Airflow (MWAA) set up.
  3. Configure your Airflow connections:
    • Create an S3 connection (S3_CONN_ID) for accessing the S3 bucket.
    • Create a Snowflake connection (SNOWFLAKE_CONN_ID) for the Snowflake database.
  4. Make necessary modifications to the DAG script to match your specific requirements, including the S3 file path, Snowflake credentials, and data transformations.

Airflow DAG

dag

The Airflow DAG (Directed Acyclic Graph) is defined in the Python script Pipeline.py. It performs the following tasks:

  • Sets up a DAG with a unique dag_id.
  • Schedules the DAG to run daily.
  • Defines the data loading, transformation, and loading tasks.
  • Uses the Astro framework for data transformation.
  • Specifies conflict resolution when merging data into Snowflake.

Data Transformation

Data transformations are defined using the Astro framework. Two transformation functions are used:

  1. filter_orders: Filters rows in the input table where the "amount" column is greater than 150.
  2. join_orders_customers: Joins two tables, filtered orders, and the customer table, using SQL.

Running the Pipeline

  1. Ensure your DAG script is uploaded and configured in your Airflow environment.
  2. Trigger the DAG to start the ETL process.
  3. Monitor the Airflow logs for any issues or failures.
  4. Set up scheduling as per your requirements.

Logging and Monitoring

To monitor the pipeline's progress and diagnose issues, you can:

  • Monitor Airflow logs for task execution details.
  • Configure alerts and notifications for failures.

Feel free to connect with me on LinkedIn

LinkedIn