Skip to content

Commit

Permalink
feat(DBArchival): publish Database Archival project
Browse files Browse the repository at this point in the history
Change-Id: Ie0239750e9d0ddbe565bd26e83a887a99de1e88c
GitOrigin-RevId: af83305185ef26ad167cba5fe3b819170cfb95f3
  • Loading branch information
nitobuendia authored and Copybara committed Jan 13, 2025
1 parent fedb331 commit 9a392e4
Show file tree
Hide file tree
Showing 77 changed files with 15,394 additions and 0 deletions.
23 changes: 23 additions & 0 deletions projects/database-archival/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Database Archival and Pruning

Database Archival and Pruning (or "Database Archival" for short) is a solution
which allows you to archive and prune data from Google Cloud database products
(such as Cloud SQL, AlloyDB, Spanner) into BigQuery. This solution archives and
prunes data from Historical Data tables.

Historical Data tables are usually large tables, with a date or timestamp
column, which stores transactions, facts, log or raw data which are useful for
various long-term storage and analytics needs. These tables are copied from the
operational Database to BigQuery - after which, the data can be removed from the
database.

The solution supports multiple tables - across multiple databases and
instances - with customizable configuration and retention periods for each
table.

This concept is similar to BigTable's age-based
[Garbage Collection](https://cloud.google.com/bigtable/docs/garbage-collection).

Refer to detailed documentation in [docs/index.md](/docs/index.md).

Disclaimer: This is not an official Google product.
16 changes: 16 additions & 0 deletions projects/database-archival/buildtest.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM python:3.9

ARG PROJECT_SUBDIRECTORY
ENV PROJECT_SUBDIRECTORY=$PROJECT_SUBDIRECTORY
WORKDIR ${PROJECT_SUBDIRECTORY}
ENTRYPOINT [ "/bin/bash", "-e", "-x", "-c" ]
CMD [ " \
cd ${PROJECT_SUBDIRECTORY} && \
python3 -m venv .venv && \
. .venv/bin/activate && \
python3 -m pip install --upgrade pip && \
python3 -m pip install --no-deps --require-hashes -r requirements.txt && \
python3 -m pip install -e . && \
python3 -m unittest discover tools '*_test.py' && \
python3 -m airflow db init && \
python3 -m unittest discover src/database_archival '*_test.py'" ]
Binary file not shown.
27 changes: 27 additions & 0 deletions projects/database-archival/demo/sample_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[
{
"database_table_name": "Transaction",
"bigquery_location": "${REGION}",
"bigquery_table_name": "${PROJECT_ID}.${BIGQUERY_DATASET}.${CLOUD_SQL_DATABASE_NAME}_Transaction",
"bigquery_days_to_keep": 3650,
"database_prune_data": true,
"database_prune_batch_size": 1000,
"table_primary_key_columns": ["transaction_id"],
"table_date_column": "transaction_date",
"table_date_column_data_type": "DATE",
"database_days_to_keep": 730,
"database_type": "MYSQL",
"database_instance_name": "${CLOUD_SQL_INSTANCE_NAME}",
"database_name": "${CLOUD_SQL_DATABASE_NAME}",
"database_username": "${CLOUD_SQL_USER_NAME}",
"database_password_secret": "${CLOUD_SQL_PASSWORD_SECRET}"
},

{
"database_table_name": "User",
"bigquery_location": "${REGION}",
"bigquery_table_name": "${PROJECT_ID}.${BIGQUERY_DATASET}.${CLOUD_SQL_DATABASE_NAME}_User",
"bigquery_days_to_keep": 3650,
"database_prune_data": false
}
]
7 changes: 7 additions & 0 deletions projects/database-archival/docs/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
title: Database Archival and Pruning
nav:
- 'Overview': 'index.md'
- 'Architecture': 'architecture.md'
- 'Deployment and Configuration': 'deployment.md'
- 'Best Practices and Considerations': 'best_practices.md'
- 'Frequently Asked Questions (FAQs)': 'faqs.md'
210 changes: 210 additions & 0 deletions projects/database-archival/docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Architecture Design of Database Archival and Pruning

## Overview

![Database Archival and Pruning Cloud architecture](./images/architecture.svg)

* **Datastream** continuous replication creates a copy of the database in a
BigQuery dataset. This is a live replica which reflects the same data as the
database.

* **Composer** coordinates the process of creating table snapshost (archiving)
and deleting from the database. Composer creates new BigQuery jobs to copy
(archive) the data and create a metadata table with the pruning status.
Composer calls a Cloud Run Function to perform the data deletion (pruning).

* **Cloud Run Function** is used as ephemeral compute to connect to the
database and delete the data in small batches.

* **BigQuery** is used as the target data destination for the archived data
and the metadata for these tables.

### Flow of data

![Cloud Composer Airflow DAG with the pipeline details](./images/dag.png)

* Datastream runs independently of this pipeline, continuously copying data to
BigQuery.

* Composer runs periodically (configurable), reads a configuration file stored
in Google Cloud Storage and sets up the pipeline.

* One workflow is created for each table configuration.

* For Historical Data:

* A BigQuery job is created to copy the data from the BigQuery table
created as a live replica from Datastream (e.g. `transactions`) into
a table called snapshot (e.g. `transactions_snapshot`). Only the
data that is older than the specified retention period is copied.
Each row is marked with the execution date (`_snapshot_date`) and
the run id (`_snapshot_run_id`) of the Composer pipeline. The
table is partitioned by date (`_snapshot_date`). A partition
expiration time (configurable) will be set.

* A BigQuery job is created to copy and batch (`_prune_batch_number`)
the primary keys from the snapshot table (e.g.
`transactions_snapshot`) into a metadata table (e.g.
`transactions_snapshot_prune_progress`) which contains whether the
primary keys have been deleted from the database (`is_data_pruned`).
The batch size is configurable, and recommended to keep between 100
and 10,000. The execution date (`_snapshot_date`) and the run id
(`_snapshot_run_id`) of the Composer pipeline are also stored. The
table is partitioned by date (`_snapshot_date`). A partition
expiration time (configurable) will be set.

* A Cloud Run Function is run to delete one batch of data from the
database. Each batch is deleted as a transaction. Only one batch
per table is deleted at a time. A waiting time (configurable) is
done in between batches.

* A BigQuery job is created to update the metadata table (e.g.
`transactions_snapshot_prune_progress`) and confirm that the given
data batch has been pruned successfully.

* For Main Data:

* A BigQuery job is created to copy the data from the BigQuery table
created as a live replica from Datastream (e.g. `users`) into a
table called snapshots (e.g. `users_snapshot`). All the data in the
Main Data table (e.g. `users`) is copied regardless of date or
usage. Each row is marked with the execution date (`_snapshot_date`)
and the run id (`_snapshot_run_id`) of the Composer pipeline. The
table is partitioned by date (`_snapshot_date`). A partition
expiration time (configurable) will be set.

* BigQuery will be pruning (removing) partitions that are older than their
expiration date automatically and indepdently of this pipeline.

### Sample tables

#### Historical Data: Flights

This table contains records of all the flights of the company. We may no longer
be interested in the flights that are older than 3 years in our operational
database.

##### Snapshot table for Historical Data

![View of a snapshot table for a Historical Data table](./images/historical_table_snapshot.png)

This table reflects the archived (copied) data. In this case, it will contain
a full copy of the data older than 3 years.

1. Contains metadata from the Composer run including date and run_id to be able
to identify the source of the data.

1. Contains the original data. In this case, only the data that is older than
the desired database expiration date will be copied.

1. Contains the metadata created by datastream, which was used to create the
initial data movement to BigQuery.

##### Metadata table for Historical Data

This table reflects the archived (copied) data primary keys that need to be
deleted from the database. The pruning (deletion) status is tracked.

![View of a metadata table for a Historical Data table](./images/historical_table_snapshot_metadata.png)

1. Contains the primary keys of the source table.

1. Contains the metadata from Composer, which is the same as in the
table_snapshot where the archived data is.

1. Contains whether these primary keys have been deleted from the database, and
to which batch they belong.

#### Main Data: Airlines

Flights contain some foreign primary keys like `airline_id` or `airplane_id`. As
a result, we may want to archive airlines Main Data together with Flights. We do
not want to delete the data in airlines.

##### Snapshot table for Main Data

![View of a snaphot table for a Main Data table](./images/main_table_snapshot.png)

This table reflects the archived (copied) data. In this case, it will contain
a full copy of the original data.

1. Contains metadata from the Composer run including date and run_id to be able
to identify the source of the data.

1. Contains the original data. In this case, the whole table is copied on every
run to allow point in time queries with the Historical Data.

1. Contains the metadata created by datastream, which was used to create the
initial data movement to BigQuery.

##### Metadata table for Main Data

There is no metadata table for airlines since no data will be pruned.

## Design Decisions

### Design Principles

**Principle: Keep design simple and robust.**

* Leverage the optimal number of components in the pipeline.
* Leverage existing features and functionality, where available.
* Log and validate actions taken by the system.

**Principle: Flexible and customizable at table level.**

* Allow customization and flexibility at table level.
* Support adding additional tables at later stages.

**Principle: Minimize the impact on the application performance.**

* Leverage database's (e.g. Cloud SQL) read replicas to run Datastream to
minimize pressure on the database when moving data to BigQuery.
* Leverage BigQuery data, migrated by Datastream, to create the data archive
to minimize queries to the database.
* Batch data for deletion into small batches to minimize pressure on the
database when deleting data.

**Principle: Take a conservative approach towards data pruning.**

* Sequence pruning to start if and only if the archiving step has succeeded.
* Prune data from the database when the data is confirmed to be archived.
* Batch deletion of data into small sets.

### Architecture design choices

**Datastream** is a ready to use solution to replicate the data from databases
(e.g. Cloud SQL as a source) to BigQuery. Datastream supports all Cloud SQL
databases, and AlloyDB. Datastream provides continuous replication while
minimizing one-off spikes in reads when performing data movement. The archival
is then performed over BigQuery, minimizing the costs and performance hit of
reads during the archival.

Either another database equivalent (e.g. Cloud SQL) or BigQuery could have been
used to achieve the goals of archiving from the main instance and provide
queryable access. However, since the goal is to reduce costs and there are
limits on the total size of data that can be stored in databases like Cloud SQL,
**BigQuery** has been chosen for this goal. BigQuery is optimized for analytical
workloads, which is ideal for the common needs for this data (e.g. run AI
models), in addition to just storing the data. BigQuery also has native
mechanisms to prune using partitions and expiration dates for those partitions.
By leveraging partitions, we gain a few capabilities. We are able to:

1. Query data more efficiently without having to go through all the
Historical Data.

1. Drop a full partition or overwrite it if the process fails during
archival.

1. Auto-delete the data after the retention period has passed using
expiration dates.

1. Efficient access and query of the data without having to access data older
than necessary, leading to lower query costs.

**Cloud Composer** is recommended as these are multi step pipelines with
dependencies. Composer is ideal to coordinate and to ensure steps are completed
sequentially or in parallel as needed. Cloud Composer provides observability
(e.g. alerting) and capacity to resume failed nodes (i.e. retriability), which
is critical to minimize risks during the archiving and pruning. Given the
critical nature of archiving data, these are important goals.
Loading

0 comments on commit 9a392e4

Please sign in to comment.