A reference implementation of the write-audit-publish pattern with Bauplan and Prefect 3.0
A common need on S3-backed analytics systems (e.g. a data lakehouse) is safely ingesting new data into tables available to downstream consumers.
Due to their distributed nature and large quantity of data to be bulk-inserted, a lakehouse ingestion is more delicate than the equivalent operation on a traditional database.
Data engineering best practices suggest the Write-Audit-Publish (WAP) pattern, which consists of three main logical steps:
- Write: ingest data into a ''staging'' / ''temporary'' section of the lakehouse - the data is not visible yet to downstream consumers;
- Audit: run quality checks on the data, to verify integrity and quality (avoid the ''garbage in, garbage out'' problem);
- Publish: if the quality checks succeed, proceed to publish the data to the main section of the lakehouse - the data is now visible to downstream consumers; otherwise, raise an error / clean-up etc.
This repository showcases how Prefect and Bauplan can be used to implement WAP in ~150 lines of no-nonsense pure Python code: no knowledge of the JVM, SQL or Iceberg is required.
In particular, we will leverage Prefect transactions as the ''outer layer'' for safe handling of the relevant tasks, and Bauplan transactions (through branches) as the ''inner layer'' for safe handling of the relevant data assets:
- For a longer discussion on the context behind the project and the trade-offs involved, please refer to our blog post.
- To get a quick feeling on the developer experience, check out this demo video.
Bauplan is the programmable lakehouse: you can load, transform, query data all from your code (CLI or Python). You can learn more here, read the docs or explore its architecture and ergonomics.
To use Bauplan, you need an API key for our preview environment: you can request one here.
Note: the current SDK version is 0.0.3a243
but it is subject to change as the platform evolves - ping us if you need help with any of the APIs used in this project.
Install the required dependencies (Bauplan and Prefect) in a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Start a local Prefect server and take note of the URL:
prefect server start
Then, in a separate terminal, set up the connection and run the flow:
cd src
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
python wap_flow.py --table_name <table_name> --branch_name <branch_name> --s3_path s3://a-public-bucket/your-data.csv
This is a video demonstration of the flow in action, both in case of successful audit and in case of failure.
Through the Prefect server, you can visualize the flow in the UI, e.g. you can check the latest run:
The code in the project is licensed under the MIT License (Prefect and Bauplan are owned by their respective owners and have their own licenses).