-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(DBArchival): publish Database Archival project
Change-Id: Ie0239750e9d0ddbe565bd26e83a887a99de1e88c GitOrigin-RevId: af83305185ef26ad167cba5fe3b819170cfb95f3
- Loading branch information
1 parent
fedb331
commit 9a392e4
Showing
77 changed files
with
15,394 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Database Archival and Pruning | ||
|
||
Database Archival and Pruning (or "Database Archival" for short) is a solution | ||
which allows you to archive and prune data from Google Cloud database products | ||
(such as Cloud SQL, AlloyDB, Spanner) into BigQuery. This solution archives and | ||
prunes data from Historical Data tables. | ||
|
||
Historical Data tables are usually large tables, with a date or timestamp | ||
column, which stores transactions, facts, log or raw data which are useful for | ||
various long-term storage and analytics needs. These tables are copied from the | ||
operational Database to BigQuery - after which, the data can be removed from the | ||
database. | ||
|
||
The solution supports multiple tables - across multiple databases and | ||
instances - with customizable configuration and retention periods for each | ||
table. | ||
|
||
This concept is similar to BigTable's age-based | ||
[Garbage Collection](https://cloud.google.com/bigtable/docs/garbage-collection). | ||
|
||
Refer to detailed documentation in [docs/index.md](/docs/index.md). | ||
|
||
Disclaimer: This is not an official Google product. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
FROM python:3.9 | ||
|
||
ARG PROJECT_SUBDIRECTORY | ||
ENV PROJECT_SUBDIRECTORY=$PROJECT_SUBDIRECTORY | ||
WORKDIR ${PROJECT_SUBDIRECTORY} | ||
ENTRYPOINT [ "/bin/bash", "-e", "-x", "-c" ] | ||
CMD [ " \ | ||
cd ${PROJECT_SUBDIRECTORY} && \ | ||
python3 -m venv .venv && \ | ||
. .venv/bin/activate && \ | ||
python3 -m pip install --upgrade pip && \ | ||
python3 -m pip install --no-deps --require-hashes -r requirements.txt && \ | ||
python3 -m pip install -e . && \ | ||
python3 -m unittest discover tools '*_test.py' && \ | ||
python3 -m airflow db init && \ | ||
python3 -m unittest discover src/database_archival '*_test.py'" ] |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
[ | ||
{ | ||
"database_table_name": "Transaction", | ||
"bigquery_location": "${REGION}", | ||
"bigquery_table_name": "${PROJECT_ID}.${BIGQUERY_DATASET}.${CLOUD_SQL_DATABASE_NAME}_Transaction", | ||
"bigquery_days_to_keep": 3650, | ||
"database_prune_data": true, | ||
"database_prune_batch_size": 1000, | ||
"table_primary_key_columns": ["transaction_id"], | ||
"table_date_column": "transaction_date", | ||
"table_date_column_data_type": "DATE", | ||
"database_days_to_keep": 730, | ||
"database_type": "MYSQL", | ||
"database_instance_name": "${CLOUD_SQL_INSTANCE_NAME}", | ||
"database_name": "${CLOUD_SQL_DATABASE_NAME}", | ||
"database_username": "${CLOUD_SQL_USER_NAME}", | ||
"database_password_secret": "${CLOUD_SQL_PASSWORD_SECRET}" | ||
}, | ||
|
||
{ | ||
"database_table_name": "User", | ||
"bigquery_location": "${REGION}", | ||
"bigquery_table_name": "${PROJECT_ID}.${BIGQUERY_DATASET}.${CLOUD_SQL_DATABASE_NAME}_User", | ||
"bigquery_days_to_keep": 3650, | ||
"database_prune_data": false | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
title: Database Archival and Pruning | ||
nav: | ||
- 'Overview': 'index.md' | ||
- 'Architecture': 'architecture.md' | ||
- 'Deployment and Configuration': 'deployment.md' | ||
- 'Best Practices and Considerations': 'best_practices.md' | ||
- 'Frequently Asked Questions (FAQs)': 'faqs.md' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
# Architecture Design of Database Archival and Pruning | ||
|
||
## Overview | ||
|
||
![Database Archival and Pruning Cloud architecture](./images/architecture.svg) | ||
|
||
* **Datastream** continuous replication creates a copy of the database in a | ||
BigQuery dataset. This is a live replica which reflects the same data as the | ||
database. | ||
|
||
* **Composer** coordinates the process of creating table snapshost (archiving) | ||
and deleting from the database. Composer creates new BigQuery jobs to copy | ||
(archive) the data and create a metadata table with the pruning status. | ||
Composer calls a Cloud Run Function to perform the data deletion (pruning). | ||
|
||
* **Cloud Run Function** is used as ephemeral compute to connect to the | ||
database and delete the data in small batches. | ||
|
||
* **BigQuery** is used as the target data destination for the archived data | ||
and the metadata for these tables. | ||
|
||
### Flow of data | ||
|
||
![Cloud Composer Airflow DAG with the pipeline details](./images/dag.png) | ||
|
||
* Datastream runs independently of this pipeline, continuously copying data to | ||
BigQuery. | ||
|
||
* Composer runs periodically (configurable), reads a configuration file stored | ||
in Google Cloud Storage and sets up the pipeline. | ||
|
||
* One workflow is created for each table configuration. | ||
|
||
* For Historical Data: | ||
|
||
* A BigQuery job is created to copy the data from the BigQuery table | ||
created as a live replica from Datastream (e.g. `transactions`) into | ||
a table called snapshot (e.g. `transactions_snapshot`). Only the | ||
data that is older than the specified retention period is copied. | ||
Each row is marked with the execution date (`_snapshot_date`) and | ||
the run id (`_snapshot_run_id`) of the Composer pipeline. The | ||
table is partitioned by date (`_snapshot_date`). A partition | ||
expiration time (configurable) will be set. | ||
|
||
* A BigQuery job is created to copy and batch (`_prune_batch_number`) | ||
the primary keys from the snapshot table (e.g. | ||
`transactions_snapshot`) into a metadata table (e.g. | ||
`transactions_snapshot_prune_progress`) which contains whether the | ||
primary keys have been deleted from the database (`is_data_pruned`). | ||
The batch size is configurable, and recommended to keep between 100 | ||
and 10,000. The execution date (`_snapshot_date`) and the run id | ||
(`_snapshot_run_id`) of the Composer pipeline are also stored. The | ||
table is partitioned by date (`_snapshot_date`). A partition | ||
expiration time (configurable) will be set. | ||
|
||
* A Cloud Run Function is run to delete one batch of data from the | ||
database. Each batch is deleted as a transaction. Only one batch | ||
per table is deleted at a time. A waiting time (configurable) is | ||
done in between batches. | ||
|
||
* A BigQuery job is created to update the metadata table (e.g. | ||
`transactions_snapshot_prune_progress`) and confirm that the given | ||
data batch has been pruned successfully. | ||
|
||
* For Main Data: | ||
|
||
* A BigQuery job is created to copy the data from the BigQuery table | ||
created as a live replica from Datastream (e.g. `users`) into a | ||
table called snapshots (e.g. `users_snapshot`). All the data in the | ||
Main Data table (e.g. `users`) is copied regardless of date or | ||
usage. Each row is marked with the execution date (`_snapshot_date`) | ||
and the run id (`_snapshot_run_id`) of the Composer pipeline. The | ||
table is partitioned by date (`_snapshot_date`). A partition | ||
expiration time (configurable) will be set. | ||
|
||
* BigQuery will be pruning (removing) partitions that are older than their | ||
expiration date automatically and indepdently of this pipeline. | ||
|
||
### Sample tables | ||
|
||
#### Historical Data: Flights | ||
|
||
This table contains records of all the flights of the company. We may no longer | ||
be interested in the flights that are older than 3 years in our operational | ||
database. | ||
|
||
##### Snapshot table for Historical Data | ||
|
||
![View of a snapshot table for a Historical Data table](./images/historical_table_snapshot.png) | ||
|
||
This table reflects the archived (copied) data. In this case, it will contain | ||
a full copy of the data older than 3 years. | ||
|
||
1. Contains metadata from the Composer run including date and run_id to be able | ||
to identify the source of the data. | ||
|
||
1. Contains the original data. In this case, only the data that is older than | ||
the desired database expiration date will be copied. | ||
|
||
1. Contains the metadata created by datastream, which was used to create the | ||
initial data movement to BigQuery. | ||
|
||
##### Metadata table for Historical Data | ||
|
||
This table reflects the archived (copied) data primary keys that need to be | ||
deleted from the database. The pruning (deletion) status is tracked. | ||
|
||
![View of a metadata table for a Historical Data table](./images/historical_table_snapshot_metadata.png) | ||
|
||
1. Contains the primary keys of the source table. | ||
|
||
1. Contains the metadata from Composer, which is the same as in the | ||
table_snapshot where the archived data is. | ||
|
||
1. Contains whether these primary keys have been deleted from the database, and | ||
to which batch they belong. | ||
|
||
#### Main Data: Airlines | ||
|
||
Flights contain some foreign primary keys like `airline_id` or `airplane_id`. As | ||
a result, we may want to archive airlines Main Data together with Flights. We do | ||
not want to delete the data in airlines. | ||
|
||
##### Snapshot table for Main Data | ||
|
||
![View of a snaphot table for a Main Data table](./images/main_table_snapshot.png) | ||
|
||
This table reflects the archived (copied) data. In this case, it will contain | ||
a full copy of the original data. | ||
|
||
1. Contains metadata from the Composer run including date and run_id to be able | ||
to identify the source of the data. | ||
|
||
1. Contains the original data. In this case, the whole table is copied on every | ||
run to allow point in time queries with the Historical Data. | ||
|
||
1. Contains the metadata created by datastream, which was used to create the | ||
initial data movement to BigQuery. | ||
|
||
##### Metadata table for Main Data | ||
|
||
There is no metadata table for airlines since no data will be pruned. | ||
|
||
## Design Decisions | ||
|
||
### Design Principles | ||
|
||
**Principle: Keep design simple and robust.** | ||
|
||
* Leverage the optimal number of components in the pipeline. | ||
* Leverage existing features and functionality, where available. | ||
* Log and validate actions taken by the system. | ||
|
||
**Principle: Flexible and customizable at table level.** | ||
|
||
* Allow customization and flexibility at table level. | ||
* Support adding additional tables at later stages. | ||
|
||
**Principle: Minimize the impact on the application performance.** | ||
|
||
* Leverage database's (e.g. Cloud SQL) read replicas to run Datastream to | ||
minimize pressure on the database when moving data to BigQuery. | ||
* Leverage BigQuery data, migrated by Datastream, to create the data archive | ||
to minimize queries to the database. | ||
* Batch data for deletion into small batches to minimize pressure on the | ||
database when deleting data. | ||
|
||
**Principle: Take a conservative approach towards data pruning.** | ||
|
||
* Sequence pruning to start if and only if the archiving step has succeeded. | ||
* Prune data from the database when the data is confirmed to be archived. | ||
* Batch deletion of data into small sets. | ||
|
||
### Architecture design choices | ||
|
||
**Datastream** is a ready to use solution to replicate the data from databases | ||
(e.g. Cloud SQL as a source) to BigQuery. Datastream supports all Cloud SQL | ||
databases, and AlloyDB. Datastream provides continuous replication while | ||
minimizing one-off spikes in reads when performing data movement. The archival | ||
is then performed over BigQuery, minimizing the costs and performance hit of | ||
reads during the archival. | ||
|
||
Either another database equivalent (e.g. Cloud SQL) or BigQuery could have been | ||
used to achieve the goals of archiving from the main instance and provide | ||
queryable access. However, since the goal is to reduce costs and there are | ||
limits on the total size of data that can be stored in databases like Cloud SQL, | ||
**BigQuery** has been chosen for this goal. BigQuery is optimized for analytical | ||
workloads, which is ideal for the common needs for this data (e.g. run AI | ||
models), in addition to just storing the data. BigQuery also has native | ||
mechanisms to prune using partitions and expiration dates for those partitions. | ||
By leveraging partitions, we gain a few capabilities. We are able to: | ||
|
||
1. Query data more efficiently without having to go through all the | ||
Historical Data. | ||
|
||
1. Drop a full partition or overwrite it if the process fails during | ||
archival. | ||
|
||
1. Auto-delete the data after the retention period has passed using | ||
expiration dates. | ||
|
||
1. Efficient access and query of the data without having to access data older | ||
than necessary, leading to lower query costs. | ||
|
||
**Cloud Composer** is recommended as these are multi step pipelines with | ||
dependencies. Composer is ideal to coordinate and to ensure steps are completed | ||
sequentially or in parallel as needed. Cloud Composer provides observability | ||
(e.g. alerting) and capacity to resume failed nodes (i.e. retriability), which | ||
is critical to minimize risks during the archiving and pruning. Given the | ||
critical nature of archiving data, these are important goals. |
Oops, something went wrong.