Skip to content

EDU-1502: Adds bigQuery page #2432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/integrations/index.textile
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ The following pre-built services can be configured:
* "AMQP":/docs/integrations/streaming/amqp
* "AWS SQS":/docs/integrations/streaming/sqs
* "Apache Pulsar":/docs/integrations/streaming/pulsar
* "Google BigQuery":/docs/integrations/streaming/bigquery

h2(#queues). Message queues

Expand Down
109 changes: 109 additions & 0 deletions content/integrations/streaming/bigquery.textile
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: Google BigQuery
meta_description: "Stream realtime event data from Ably into Google BigQuery using the Firehose BigQuery rule. Configure, and analyze your data efficiently."
---

Stream events published to Ably directly into a "table":https://cloud.google.com/bigquery/docs/tables in "BigQuery":https://cloud.google.com/bigquery for analytical or archival purposes. General use cases include:

* Realtime analytics on message data.
* Centralized storage for raw event data, enabling downstream processing.
* Historical auditing of messages.

To stream data from Ably into BigQuery, you need to create a BigQuery "rule":#rule.

<aside data-type='note'>
<p>Ably's BigQuery integration for "Firehose":/docs/integrations/streaming is in alpha status.</p>
</aside>

h2(#rule). Create a BigQuery rule

A rule defines what data gets sent, where it goes, and how it's authenticated. For example, you can improve query performance by configuring a rule to stream data from a specific channel and write them into a "partitioned":https://cloud.google.com/bigquery/docs/partitioned-tables table.

h3(#dashboard). Create a rule using the Ably dashboard

The following steps to create a BigQuery rule using the Ably dashboard:

* Log in to the "Ably dashboard":https://ably.com/accounts/any and select the application you want to stream data from.
* Navigate to the *Integrations* tab.
* Click *New integration rule*.
* Select *Firehose*.
* Choose *BigQuery* from the list of available Firehose integrations.
* "Configure":#configure the rule settings. Then, click *Create*.

h3(#api-rule). Create a rule using the ABly Control API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h3(#api-rule). Create a rule using the ABly Control API
h3(#api-rule). Create a rule using the Ably Control API


The following steps to create a BigQuery rule using the Control API:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't exist in the Control API spec at the moment, so I think we need to check whether this is possible.


* Using the required "rules":/docs/control-api#examples-rules to specify the following parameters:
** @ruleType@: Set this to "bigquery" to define the rule as a BigQuery integration.
** destinationTable: Specify the BigQuery table where the data will be stored.
** @serviceAccountCredentials@: Provide the necessary GCP service account JSON key to authenticate and authorize data insertion.
** @channelFilter@ (optional): Use a regular expression to apply the rule to specific channels.
** @format@ (optional): Define the data format based on how you want messages to be structured.
* Make an HTTP request to the Control API to create the rule.

h2(#configure). Configure BigQuery

Using the Google Cloud "Console":https://cloud.google.com/bigquery/docs/bigquery-web-ui, configure the required BigQuery resources, permissions, and authentication to allow Ably to write data securely to BigQuery.

The following steps configure BigQuery using the Google Cloud Console:

* Create or select a *BigQuery dataset* in the Google Cloud Console.
* Create a *BigQuery table* in that dataset.
** Use the "JSON schema":#schema.
** For large datasets, partition the table by ingestion time, with daily partitioning recommended for optimal performance.

The following steps set up permissions and authentication using the Google Cloud Console:

* Create a Google Cloud Platform (GCP) "service account":https://cloud.google.com/iam/docs/service-accounts-create with the minimal required BigQuery permissions.
* Grant the service account table-level access control to allow access to the specific table.
** @bigquery.tables.get@: to read table metadata.
** @bigquery.tables.updateData@: to insert records.
* Generate and securely store the *JSON key file* for the service account.
** Ably requires this key file to authenticate and write data to your table.

h3(#settings). BigQuery configuration options

The following explains the BigQuery configuration options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the Ably dashboard? If so, I think we should mention that.


|_. Section |_. Purpose |
| *Source* | Defines the type of event(s) for delivery. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you've been working on this in a separate PR, but there should be somewhere to link to for the list of these events now.

| *Channel filter* | A regular expression to filter which channels to capture. Only events on channels matching this regex are streamed into BigQuery. |
| *Table* | The full destination table path in BigQuery, typically in the format @project_id.dataset_id.table_id@. |
| *Service account Key* | A JSON key file Ably uses to authenticate with Google Cloud. You must upload or provide the contents of this key file. |
| *Partitioning* | _(Optional)_ The table must be created with the desired partitioning settings in BigQuery before making the rule in Ably. |
| *Advanced settings* | Any additional configuration or custom fields relevant to your BigQuery setup (for future enhancements). |

h2(#schema). JSON Schema

To store and structure message data in BigQuery, you need a schema that defines the expected fields to help ensure consistency. The following is an example JSON schema for a BigQuery table:

```[json]
{
“name”: “id”,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is describing the message ID? If so, I think we need to explain that this is all this is a snapshot of.

“type”: “STRING”,
“mode”: “REQUIRED”,
“description”: “Unique ID assigned by Ably to this message. Can optionally be assigned by the client.”
}
```

h2(#queries). Direct queries

In Ably-managed BigQuery tables, message payloads are stored in the data column as raw JSON. You can extract fields using the following query. The following example query converts the @data@ column from @BYTES@ to @STRING@, parses it into a JSON object, and filters results by their channel name:

```[sql]
SELECT
PARSE_JSON(CAST(data AS STRING)) AS parsed_payload
FROM project_id.dataset_id.table_id
WHERE channel = “my-channel”
```

h2(#etl). Extract, Transform, Load (ETL)

ETL is recommended for large-scale analytics to structure, deduplicate, and optimize data for querying. Since parsing JSON at query time can be costly for large datasets, pre-process and store structured fields in a secondary table instead. Convert raw data (JSON or BYTES), remove duplicates, and write it into an optimized table for better performance:

* Convert data from raw (BYTES/JSON) into structured columns for example geospatial data fields or numeric data types, for detailed analysis.
* Write transformed records to a new optimized table tailored for query performance.
* Deduplicate records using the unique ID field to ensure data integrity.
* Automate the process using BigQuery scheduled queries or an external workflow to run transformations at regular intervals.

4 changes: 4 additions & 0 deletions src/data/nav/platform.ts
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,10 @@ export default {
name: 'Pulsar',
link: '/docs/integrations/streaming/pulsar',
},
{
name: 'BigQuery',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: 'BigQuery',
name: 'Google BigQuery',

link: '/docs/integrations/streaming/bigquery',
},
],
},
{
Expand Down