diff --git a/_posts/2023-07-23-snowflake.md b/_posts/2023-07-23-snowflake.md
index 3c513d1..eb93252 100644
--- a/_posts/2023-07-23-snowflake.md
+++ b/_posts/2023-07-23-snowflake.md
@@ -838,7 +838,7 @@ title: Snowflake
- `avg_running` - average number of queries executed
- recall that a single virtual warehouse is a cluster of compute nodes
- - consolidation - if we have two bi applications that rarely use a warehouse concurrently, we can use the same warehouse for both of them
+ - consolidation - if we have two bi applications that rarely use a warehouse concurrently, we can use the same warehouse for both of the applications to reduce costs
```sql
select
@@ -920,16 +920,17 @@ title: Snowflake
## Resource Monitors
- **resource monitors** - warehouses can be assigned resource monitors
-- we can define a limit on the number of credits consumed
-- when this limit is reached, we can be notified / warehouses can be suspended
-- my understanding - notification works on both user managed warehouses and warehouses managed by cloud services. however, we can only suspend user managed warehouses
+- we can define a limit on the number of credits consumed - when this limit is reached, we can be notified
+- notification works on both user managed warehouses and warehouses managed by cloud services
![](/assets/img/warehouse-and-snowflake/resource-monitor-visiblity.png)
-- in the above resource monitor, we set -
- - credit quota to 1
- - select the warehouse to monitor
- - start monitoring immediately, never stop monitoring and reset the monitor daily
+- in the above resource monitor, we -
+ - set credit quota to 1
+ - monitor type - we can monitor the entire account or only a specific warehouse
+ - select the warehouse to monitor - a warehouse can have only one resource monitor, but the same resource monitor can be used for multiple warehouses
+ - start monitoring immediately, and never stop monitoring
+ - reset the monitor daily - it starts tracking from 0 again
- notify when 75% and 100% of the quota is consumed
- i ran this statement to modify the users to be notified -
```sql
@@ -938,3 +939,23 @@ title: Snowflake
- finally, on reaching the consumption, i get an email as follows -
![](/assets/img/warehouse-and-snowflake/resource-monitor-notification.png)
+
+- note - i also had to enable receiving notifications in my profile -
+
+![](/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png)
+
+- till now, we saw the **visibility** aspect of resource monitors, now we look at its **control** aspect
+- we can only suspend user managed warehouses, not ones managed by cloud services
+- we can choose to suspend either after running queries complete, or after cancelling all running queries
+
+## Control Strategies
+
+- **virtual warehouses** can be **single cluster** or **multi cluster**
+- changing warehouse size helps with vertical scaling (scaling up) i.e. processing more complex queries
+- using multi cluster warehouse helps with horizontal scaling (scaling out) i.e. processing more concurrent queries
+- some features of warehouses include to control spend include -
+ - **auto suspend** - automatically suspend warehouses if there is no activity
+ - **auto resume** - automatically restart the warehouse when there are new queries
+ - **auto scaling** - start warehouses dynamically if queries are queued. bring them down when queue becomes empty. there are two scaling policies - **standard** and **economic**. first prefers starting additional warehouses, while the second prefers conserving credits
+- `statement_queued_timeout_in_seconds` - statements are dropped if queued for longer than this. default is 0 i.e. never dropped
+- `statement_timeout_in_seconds` - statements are cancelled if they run for longer than this. default is 48hrs
diff --git a/_posts/2024-08-31-dbt.md b/_posts/2024-08-31-dbt.md
index dcc5648..5ab49ef 100644
--- a/_posts/2024-08-31-dbt.md
+++ b/_posts/2024-08-31-dbt.md
@@ -10,30 +10,24 @@ title: DBT
- **data engineers** - build and maintain infrastructure, overall pipeline orchestration, integrations for ingesting the data, etc
- **analytics engineers** - generate cleaned, transformed data for analysis
- **data analysts** - work with business to understand the requirements. use dashboards, sql, etc to query the transformed data
-- dbt sits on top of the cloud warehouses. snowflake / redshift / big query / databricks. it is the t of elt, i.e. it is meant to be used by the analytics engineers
+- dbt sits on top of the cloud warehouses like snowflake / redshift / big query / databricks. it is used for the t of elt by the analytics engineers
- we can **manage**, **test** and **document** transformations from one place
-- can deploy the dbt project on a **schedule** using **environments**
-- dbt builds a **dag** (**directed acyclic graph**) to represent the flow of data between different tables
-- **sources** - where our data actually comes from. we create sources using yaml
-- **models** - the intermediate layer(s) we create
-- we use **macros** e.g. **source** to reference **sources**, **ref** to reference **models**, etc. we create models using sql
-- based on our sql, the dag is created for us / the order of creation of models are determined for us
- **jinja** is used, which is a pythonic language. everything inside double curly braces is jinja, and the rest of it is regular sql. note - replaced curly with round braces, because it was not showing up otherwise
```sql
- ((% for i in range(5) %))
+ (% for i in range(5) %)
select (( i )) as number (% if not loop.last %) union all (% endif %)
- ((% endfor %))
+ (% endfor %)
```
-- this results in the following table containing one column number, with values 0-4
-- there is also a **compiled code** tab, where we can see the actual code that our dbt code compiles down to
-- we can also look at the **lineage** tab, which shows us the **dag** (directed acyclic graph). it tells us the order the dbt models should be built in
- ![](/assets/img/dbt/lineage.png)
-- the graph is interactive - double click on the node to open the corresponding model file automatically
-- `dbt run` - find and create the models in the warehouse for us
-- `dbt test` - test the models for us
-- `dbt build` - a combination of run and test
-- `dbt docs generate` - generate the documentation for us
-- ide - edit models and make changes to the dbt project
+- this results in a table containing one column number, with values 0-4
+- we use the if condition so that union all is not added after the last select. this is what the compiled tab shows -
+ ```sql
+ select 0 as number union all
+ select 1 as number union all
+ select 2 as number union all
+ select 3 as number union all
+ select 4 as number
+ ```
+- **normalized modelling** - earlier, we would model our data as star schema. it was relevant when storage / compute was expensive. now, we tend to also use **denormalized modelling**, where our models are created on an ad hoc / agile basis
## Getting Started
@@ -50,7 +44,8 @@ title: DBT
![](/assets/img/dbt/development-credentials.png)
- in the development credentials, we specify a schema. each dbt developer should use their own specific **target schema** to be able to work simultaneously
- we then hit **test connection** for dbt to test the connection, and hit next
-- can leverage git to version control the code. for this, we can either use **managed repositories** or one of the supported git providers like github, gitlab, etc
+- can leverage git to version control the code
+- for this, we can either use **managed repositories**, or one of the supported git providers like github, gitlab, etc
## Models
@@ -88,74 +83,94 @@ title: DBT
- we can see the corresponding ddl for a particular model in the logs
![](/assets/img/dbt/model-logs.png)
- use the **preview** tab in the ide to see the actual data in table format
-- we see that a view has been created for inside at the schema we had specified in the **development credentials**
-- we can configure the way our model is **materialized** in one of the following ways -
- - configure dbt_project.yml present at the root. everything in `jaffle_shop` will be materialized as a table, but everything inside example as a view
- ```yml
- models:
- jaffle_shop:
- +materialized: table
- example:
- +materialized: view
- ```
- - to make it specific for a model, we can specify the following snippet at the top of the sql file for the model -
- ```sql
- {{
- config(
- materialized='view'
- )
- }}
- ```
+- we see that a view has been created inside the schema we had specified in the **development credentials**
+- we can configure the way our model is **materialized** using the following snippet at the top of the sql file for the model -
+ ```sql
+ ((
+ config(
+ materialized='view'
+ )
+ ))
+ ```
- when we delete models, it does not delete them from snowflake. so, we might have to delete them manually
### Modularity
-- breaking things down into separate models. it allows us to for e.g. reuse the smaller models in multiple combined models. these smaller models are called **staging models** (like the staging area in warehouses?)
- - stg_customers.sql
- ```sql
- select
- *
- from
- dbt_raw.jaffle_shop.customers
- ```
- - stg_customer_orders.sql
- ```sql
- select
- user_id as customer_id,
- min(order_date) as first_order_date,
- max(order_date) as last_order_date,
- count(*) as number_of_orders,
- from
- dbt_raw.jaffle_shop.orders
- group by
- customer_id
- ```
+- breaking things down into separate models. it allows us to for e.g. reuse the smaller models in multiple combined models
+- sources - discussed [here](#sources)
+- **staging models** (src) - a clean view of the source data. should not be visible to the data analysts. e.g. convert cents to dollars
+- **intermediate models** (int) - in between the staging and final models. these too should not be visible to data analysts
+- **fact tables** (fct) - grow very quickly, e.g. orders
+- **dimension tables** (dim) - relatively smaller and represent reference data, e.g. customers
+- so, we create two folders under models - marts and staging
+- models/staging/stg_customers.sql -
+ ```sql
+ select
+ *
+ from
+ dbt_raw.jaffle_shop.customers
+ ```
+- similarly, we have staging models for orders and payments
- **ref** function - allows us to reference staging models in our actual models. dbt can infer the order to build these models in using the **dag**
-- customers.sql now is changed and looks as follows -
+- this is also called a **macro** - we use `ref` macro for models, source macro for sources, etc
+- models/marts/dim_customers.sql -
```sql
- with stg_customers as (
- select * from (( ref('stg_customers') ))
+ with customers as (
+ select * from {{ ref('stg_customers') }}
),
- stg_customer_orders as (
- select * from (( ref('stg_customer_orders') ))
- )
+ order_analytics as (
+ select
+ user_id customer_id,
+ min(order_date) first_order_date,
+ max(order_date) last_order_date,
+ count(*) number_of_orders,
+ from
+ {{ ref('stg_orders') }}
+ group by
+ customer_id
+ ),
select
- stg_customers.*,
- stg_customer_orders.first_order_date,
- stg_customer_orders.last_order_date,
- coalesce(stg_customer_orders.number_of_orders, 0) as number_of_orders
+ customers.*,
+ order_analytics.first_order_date,
+ order_analytics.last_order_date,
+ coalesce(order_analytics.number_of_orders, 0) as number_of_orders
from
- stg_customers
- left join stg_customer_orders on stg_customers.id = stg_customer_orders.customer_id
+ customers
+ left join order_analytics on customers.id = order_analytics.customer_id
+ ```
+- now, we might want to for e.g. want our staging models to be materialized as views and our models in marts as tables
+- we can configure this globally in `dbt_project.yml` instead of doing it in every file like we saw [here](#models) as follows -
+ ```yml
+ name: 'jaffle_shop'
+
+ models:
+ jaffle_shop:
+ marts:
+ +materialized: table
+ staging:
+ +materialized: view
+ ```
+- output for our materialization -
+ ![](/assets/img/dbt/materialization.png)
+- when we go to the **compile** tab, we see the jinja being replaced with the actual table name
+ ```sql
+ with stg_customers as (
+ select * from dbt_analytics.dbt_sagarwal.stg_customers
+ ),
+ -- ......
```
-- when we go to the **compile** tab, we see the jinja being replaced with the actual sql that is generated
+- understand how it automatically adds the schema name based on our development credentials. so, the code automatically works for different developers
+- remember - the compile tab does not contain the actual ddl / dml, it only contains the transformed jinja for instance. the actual ddl / dml comes up in the logs
+- we can also look at the **lineage** tab, which shows us the **dag** (directed acyclic graph). it tells us the order the dbt models should be built in
+ ![](/assets/img/dbt/lineage.png)
+- the graph is interactive - double click on the node to open the corresponding model file automatically
## Sources
-- **sources** - helps describe data load by extract load / by data engineers
-- advantages -
+- **sources** - helps describe data loaded by extract phase by data engineers
+- some advantages of using sources over referencing table names directly -
- helps track **lineage** better when we use **source** function instead of table names directly
- helps test assumptions on source data
- calculate freshness of source data
@@ -179,40 +194,156 @@ title: DBT
from
(( source('jaffle_shop', 'customers') ))
```
+- the graph starts showing the sources in the lineage as well -
+ ![](/assets/img/dbt/sources.png)
+- **freshness** - we can add freshness to our sources table as follows -
+ ```yml
+ - name: orders
+ loaded_at_field: _etl_loaded_at
+ freshness:
+ warn_after: {count: 6, period: hour}
+ error_after: {count: 12, period: hour}
+ ```
+- we run the command `dbt source freshness` to check the freshness of our data, and we will automatically get a warning or error based on our configuration
+- e.g. a warning is raised if the greatest value of `_etl_loaded_at` is more than 6 but less 12 hours
+- note - the `freshness` and `loaded_at_field` configuration can also be added at each source level i.e. alongside database / schema to apply to all tables
-## Tests and Documentation
+## Tests
-- for each of our tests, a query is created that returns the number of rows that fail the test. this should be zero for the test to pass
-- to add tests, create a file schema.yml under the models directory -
+- tests in **analytics engineering** are our about making assertions on data
+- we write tests along with our development to validate our [models](#models)
+- these tests are then run these in production
+- we use the command `dbt test` to run all of our tests
+- **generic tests** - applied across different models
+- generic tests are of four types - unique, not null, accepted values, relationships
+- **unique** - every value in a column is unique
+- **not null** - every value in a column is not null
+- **accepted values** - every value in a column exists in a predefined list
+- **relationships** - every value in a column exists in the column of another table to maintain referential integrity
+- we can also use packages / write our own custom generic tests using jinja and macros
+- e.g. create a file called models/staging/staging.yml -
```yml
version: 2
-
+
models:
- name: stg_customers
columns:
- name: id
- tests: [unique, not_null]
-
- - name: stg_customer_orders
- columns:
- - name: customer_id
tests:
- unique
- not_null
- - relationships:
- field: id
- to: ref('stg_customers')
+
+ - name: stg_orders
+ columns:
+ - name: user_id
+ tests:
+ - relationships:
+ field: id
+ to: ref('stg_customers')
+ ```
+- the logs on running dbt test show us what the actual sql for the test looks like. if any rows satisfy the condition, our test fails. e.g. not_null_stg_customers_id uses the following sql -
+ ```sql
+ select
+ count(*) as failures,
+ count(*) != 0 as should_warn,
+ count(*) != 0 as should_error
+ from
+ (
+ select id
+ from dbt_analytics.dbt_sagarwal.stg_customers
+ where id is null
+ ) dbt_internal_test
+ ```
+- **singular tests** - use custom sql scripts, which is not otherwise possible using the generic tests that dbt ships with
+- in the tests directory, create a file staging/assert_stg_payment_amt_is_positive.sql with the following content -
+ ```sql
+ select
+ order_id,
+ sum(amount) total_amount
+ from
+ {{ ref('stg_payment') }}
+ group by
+ order_id
+ having
+ total_amount < 0
+ ```
+- we can also apply tests to our sources directly. it is done in the same way as models - using custom sql scripts in the tests directory for singular tests or using the sources.yml file for generic tests
+ ```yml
+ version: 2
+
+ sources:
+ - name: jaffle_shop
+ database: dbt_raw
+ schema: jaffle_shop
+ tables:
+ - name: orders
+ columns:
+ - name: status
+ tests:
+ - accepted_values:
+ values: [returned, completed, return_pending, shipped, placed]
+ ````
+- `dbt build` - how it works
+ - run `dbt test` on sources
+ - run `dbt run` and then `dbt test` on 1st layer of our models
+ - run `dbt run` and then `dbt test` on 2nd layer of our models
+ - and so on...
+- this way, our **downstream models** are never even built if the **upstream models** fail
+- the **skip** tab will show us the models which were for e.g. skipped due to failing tests
+
+## Documentation
+
+- we maintain documentation along with code to make the effort of maintaining documentation effortless
+- by using macros like `source` and `ref`, our dags are generated
+- along with this, we can document things like 'what does this model mean', 'what does this field mean', etc
+- we can add descriptions to our models / statuses as so. we continue doing this in the same file we added tests to i.e. models/staging/staging.yml -
+ ```yml
+ models:
+ - name: stg_customers
+ description: one unique customer per row
+ columns:
+ - name: id
+ description: primary key
+ ```
+- sometimes, we might want more extensive documentation, which is not possible using just yaml. we can do so using **doc blocks**
+- create a file models/staging/documentation.md with the following content -
+ ```md
+ {% docs order_status %}
+
+ status can take one of the following values
+
+ One of the following values:
+
+ | status | definition |
+ |----------------|--------------------------------------------------|
+ | placed | Order placed, not yet shipped |
+ | shipped | Order has been shipped, not yet been delivered |
+ | completed | Order has been received by customers |
+ | return pending | Customer indicated they want to return this item |
+ | returned | Item has been returned |
+
+ {% enddocs %}
+ ```
+- then, reference it inside staging.yml as following -
+ ```yml
+ - name: status
+ description: "{{ doc('order_status') }}"
```
-- my understanding of **relationship** test - check if all values of `stg_customer_orders#customer_id` is present in `stg_customers#id`
-- for all **models** and their individual fields in the above yaml, for all tables in [**sources**](#sources), etc we can include a description
-- this way, when we run `dbt docs generate`, it generates rich documentation for our project using json
+- just like we documented models, we can also document our sources
+- to generate the documentation, we use `dbt docs generate`
## Deployment
-- click deploy and click create environment to create a new **environment**. similar to what we did when creating a [project](#getting-started), we need to enter
- - **connection settings**
- - **deployment credentials**
-- then, we can create a **job**
- - specify the **environment** we created
- - in the **execution settings**, specify the commands to run, e.g. `dbt build`, `dbt docs generate`, etc
- - we can set a **schedule** for this job to run on, or run it manually whenever we would like to
+- go to deploy -> environments and create a new environment
+- code in production runs against the default branch (main / master) of our git repository. we can also configure this branch
+- just like we saw in [getting started](#getting-started), we need to set the snowflake account url, database, warehouse
+- we enter **deployment credentials** instead of **development credentials**
+ - the right way would be to create and use a service user's credentials instead of our own credentials here
+ - the code also ideally runs on a dedicated production schema, which only this service user has access to, not a user specific **target schema**
+- after configuring all this, we can create our environment
+- then, we can create a **job** of type **daily job** for this environment
+- here, we can enter a schedule for this job
+- we can also configure the set of commands that we would want to run on that cadence - it runs build, and allows us to check options for running freshness checks on sources and generating docs
+- after configuring all this, we can create our job
+- we can trigger this job manually as well
+- dbt cloud easily lets us access our documentation, shows us a breakdown of how much time the models took to build, etc
diff --git a/assets/img/dbt/lineage.png b/assets/img/dbt/lineage.png
index db2a51e..6e2e5e5 100644
Binary files a/assets/img/dbt/lineage.png and b/assets/img/dbt/lineage.png differ
diff --git a/assets/img/dbt/materialization.png b/assets/img/dbt/materialization.png
new file mode 100644
index 0000000..c05886e
Binary files /dev/null and b/assets/img/dbt/materialization.png differ
diff --git a/assets/img/dbt/sources.png b/assets/img/dbt/sources.png
new file mode 100644
index 0000000..cfd47b3
Binary files /dev/null and b/assets/img/dbt/sources.png differ
diff --git a/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png b/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png
new file mode 100644
index 0000000..ad9197f
Binary files /dev/null and b/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png differ