diff --git a/_posts/2023-07-23-snowflake.md b/_posts/2023-07-23-snowflake.md index 3c513d1..eb93252 100644 --- a/_posts/2023-07-23-snowflake.md +++ b/_posts/2023-07-23-snowflake.md @@ -838,7 +838,7 @@ title: Snowflake - `avg_running` - average number of queries executed - recall that a single virtual warehouse is a cluster of compute nodes - - consolidation - if we have two bi applications that rarely use a warehouse concurrently, we can use the same warehouse for both of them + - consolidation - if we have two bi applications that rarely use a warehouse concurrently, we can use the same warehouse for both of the applications to reduce costs ```sql select @@ -920,16 +920,17 @@ title: Snowflake ## Resource Monitors - **resource monitors** - warehouses can be assigned resource monitors -- we can define a limit on the number of credits consumed -- when this limit is reached, we can be notified / warehouses can be suspended -- my understanding - notification works on both user managed warehouses and warehouses managed by cloud services. however, we can only suspend user managed warehouses +- we can define a limit on the number of credits consumed - when this limit is reached, we can be notified +- notification works on both user managed warehouses and warehouses managed by cloud services ![](/assets/img/warehouse-and-snowflake/resource-monitor-visiblity.png) -- in the above resource monitor, we set - - - credit quota to 1 - - select the warehouse to monitor - - start monitoring immediately, never stop monitoring and reset the monitor daily +- in the above resource monitor, we - + - set credit quota to 1 + - monitor type - we can monitor the entire account or only a specific warehouse + - select the warehouse to monitor - a warehouse can have only one resource monitor, but the same resource monitor can be used for multiple warehouses + - start monitoring immediately, and never stop monitoring + - reset the monitor daily - it starts tracking from 0 again - notify when 75% and 100% of the quota is consumed - i ran this statement to modify the users to be notified - ```sql @@ -938,3 +939,23 @@ title: Snowflake - finally, on reaching the consumption, i get an email as follows - ![](/assets/img/warehouse-and-snowflake/resource-monitor-notification.png) + +- note - i also had to enable receiving notifications in my profile - + +![](/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png) + +- till now, we saw the **visibility** aspect of resource monitors, now we look at its **control** aspect +- we can only suspend user managed warehouses, not ones managed by cloud services +- we can choose to suspend either after running queries complete, or after cancelling all running queries + +## Control Strategies + +- **virtual warehouses** can be **single cluster** or **multi cluster** +- changing warehouse size helps with vertical scaling (scaling up) i.e. processing more complex queries +- using multi cluster warehouse helps with horizontal scaling (scaling out) i.e. processing more concurrent queries +- some features of warehouses include to control spend include - + - **auto suspend** - automatically suspend warehouses if there is no activity + - **auto resume** - automatically restart the warehouse when there are new queries + - **auto scaling** - start warehouses dynamically if queries are queued. bring them down when queue becomes empty. there are two scaling policies - **standard** and **economic**. first prefers starting additional warehouses, while the second prefers conserving credits +- `statement_queued_timeout_in_seconds` - statements are dropped if queued for longer than this. default is 0 i.e. never dropped +- `statement_timeout_in_seconds` - statements are cancelled if they run for longer than this. default is 48hrs diff --git a/_posts/2024-08-31-dbt.md b/_posts/2024-08-31-dbt.md index dcc5648..5ab49ef 100644 --- a/_posts/2024-08-31-dbt.md +++ b/_posts/2024-08-31-dbt.md @@ -10,30 +10,24 @@ title: DBT - **data engineers** - build and maintain infrastructure, overall pipeline orchestration, integrations for ingesting the data, etc - **analytics engineers** - generate cleaned, transformed data for analysis - **data analysts** - work with business to understand the requirements. use dashboards, sql, etc to query the transformed data -- dbt sits on top of the cloud warehouses. snowflake / redshift / big query / databricks. it is the t of elt, i.e. it is meant to be used by the analytics engineers +- dbt sits on top of the cloud warehouses like snowflake / redshift / big query / databricks. it is used for the t of elt by the analytics engineers - we can **manage**, **test** and **document** transformations from one place -- can deploy the dbt project on a **schedule** using **environments** -- dbt builds a **dag** (**directed acyclic graph**) to represent the flow of data between different tables -- **sources** - where our data actually comes from. we create sources using yaml -- **models** - the intermediate layer(s) we create -- we use **macros** e.g. **source** to reference **sources**, **ref** to reference **models**, etc. we create models using sql -- based on our sql, the dag is created for us / the order of creation of models are determined for us - **jinja** is used, which is a pythonic language. everything inside double curly braces is jinja, and the rest of it is regular sql. note - replaced curly with round braces, because it was not showing up otherwise ```sql - ((% for i in range(5) %)) + (% for i in range(5) %) select (( i )) as number (% if not loop.last %) union all (% endif %) - ((% endfor %)) + (% endfor %) ``` -- this results in the following table containing one column number, with values 0-4 -- there is also a **compiled code** tab, where we can see the actual code that our dbt code compiles down to -- we can also look at the **lineage** tab, which shows us the **dag** (directed acyclic graph). it tells us the order the dbt models should be built in
- ![](/assets/img/dbt/lineage.png) -- the graph is interactive - double click on the node to open the corresponding model file automatically -- `dbt run` - find and create the models in the warehouse for us -- `dbt test` - test the models for us -- `dbt build` - a combination of run and test -- `dbt docs generate` - generate the documentation for us -- ide - edit models and make changes to the dbt project +- this results in a table containing one column number, with values 0-4 +- we use the if condition so that union all is not added after the last select. this is what the compiled tab shows - + ```sql + select 0 as number union all + select 1 as number union all + select 2 as number union all + select 3 as number union all + select 4 as number + ``` +- **normalized modelling** - earlier, we would model our data as star schema. it was relevant when storage / compute was expensive. now, we tend to also use **denormalized modelling**, where our models are created on an ad hoc / agile basis ## Getting Started @@ -50,7 +44,8 @@ title: DBT ![](/assets/img/dbt/development-credentials.png) - in the development credentials, we specify a schema. each dbt developer should use their own specific **target schema** to be able to work simultaneously - we then hit **test connection** for dbt to test the connection, and hit next -- can leverage git to version control the code. for this, we can either use **managed repositories** or one of the supported git providers like github, gitlab, etc +- can leverage git to version control the code +- for this, we can either use **managed repositories**, or one of the supported git providers like github, gitlab, etc ## Models @@ -88,74 +83,94 @@ title: DBT - we can see the corresponding ddl for a particular model in the logs
![](/assets/img/dbt/model-logs.png) - use the **preview** tab in the ide to see the actual data in table format -- we see that a view has been created for inside at the schema we had specified in the **development credentials** -- we can configure the way our model is **materialized** in one of the following ways - - - configure dbt_project.yml present at the root. everything in `jaffle_shop` will be materialized as a table, but everything inside example as a view - ```yml - models: - jaffle_shop: - +materialized: table - example: - +materialized: view - ``` - - to make it specific for a model, we can specify the following snippet at the top of the sql file for the model - - ```sql - {{ - config( - materialized='view' - ) - }} - ``` +- we see that a view has been created inside the schema we had specified in the **development credentials** +- we can configure the way our model is **materialized** using the following snippet at the top of the sql file for the model - + ```sql + (( + config( + materialized='view' + ) + )) + ``` - when we delete models, it does not delete them from snowflake. so, we might have to delete them manually ### Modularity -- breaking things down into separate models. it allows us to for e.g. reuse the smaller models in multiple combined models. these smaller models are called **staging models** (like the staging area in warehouses?) - - stg_customers.sql - ```sql - select - * - from - dbt_raw.jaffle_shop.customers - ``` - - stg_customer_orders.sql - ```sql - select - user_id as customer_id, - min(order_date) as first_order_date, - max(order_date) as last_order_date, - count(*) as number_of_orders, - from - dbt_raw.jaffle_shop.orders - group by - customer_id - ``` +- breaking things down into separate models. it allows us to for e.g. reuse the smaller models in multiple combined models +- sources - discussed [here](#sources) +- **staging models** (src) - a clean view of the source data. should not be visible to the data analysts. e.g. convert cents to dollars +- **intermediate models** (int) - in between the staging and final models. these too should not be visible to data analysts +- **fact tables** (fct) - grow very quickly, e.g. orders +- **dimension tables** (dim) - relatively smaller and represent reference data, e.g. customers +- so, we create two folders under models - marts and staging +- models/staging/stg_customers.sql - + ```sql + select + * + from + dbt_raw.jaffle_shop.customers + ``` +- similarly, we have staging models for orders and payments - **ref** function - allows us to reference staging models in our actual models. dbt can infer the order to build these models in using the **dag** -- customers.sql now is changed and looks as follows - +- this is also called a **macro** - we use `ref` macro for models, source macro for sources, etc +- models/marts/dim_customers.sql - ```sql - with stg_customers as ( - select * from (( ref('stg_customers') )) + with customers as ( + select * from {{ ref('stg_customers') }} ), - stg_customer_orders as ( - select * from (( ref('stg_customer_orders') )) - ) + order_analytics as ( + select + user_id customer_id, + min(order_date) first_order_date, + max(order_date) last_order_date, + count(*) number_of_orders, + from + {{ ref('stg_orders') }} + group by + customer_id + ), select - stg_customers.*, - stg_customer_orders.first_order_date, - stg_customer_orders.last_order_date, - coalesce(stg_customer_orders.number_of_orders, 0) as number_of_orders + customers.*, + order_analytics.first_order_date, + order_analytics.last_order_date, + coalesce(order_analytics.number_of_orders, 0) as number_of_orders from - stg_customers - left join stg_customer_orders on stg_customers.id = stg_customer_orders.customer_id + customers + left join order_analytics on customers.id = order_analytics.customer_id + ``` +- now, we might want to for e.g. want our staging models to be materialized as views and our models in marts as tables +- we can configure this globally in `dbt_project.yml` instead of doing it in every file like we saw [here](#models) as follows - + ```yml + name: 'jaffle_shop' + + models: + jaffle_shop: + marts: + +materialized: table + staging: + +materialized: view + ``` +- output for our materialization - + ![](/assets/img/dbt/materialization.png) +- when we go to the **compile** tab, we see the jinja being replaced with the actual table name + ```sql + with stg_customers as ( + select * from dbt_analytics.dbt_sagarwal.stg_customers + ), + -- ...... ``` -- when we go to the **compile** tab, we see the jinja being replaced with the actual sql that is generated +- understand how it automatically adds the schema name based on our development credentials. so, the code automatically works for different developers +- remember - the compile tab does not contain the actual ddl / dml, it only contains the transformed jinja for instance. the actual ddl / dml comes up in the logs +- we can also look at the **lineage** tab, which shows us the **dag** (directed acyclic graph). it tells us the order the dbt models should be built in
+ ![](/assets/img/dbt/lineage.png) +- the graph is interactive - double click on the node to open the corresponding model file automatically ## Sources -- **sources** - helps describe data load by extract load / by data engineers -- advantages - +- **sources** - helps describe data loaded by extract phase by data engineers +- some advantages of using sources over referencing table names directly - - helps track **lineage** better when we use **source** function instead of table names directly - helps test assumptions on source data - calculate freshness of source data @@ -179,40 +194,156 @@ title: DBT from (( source('jaffle_shop', 'customers') )) ``` +- the graph starts showing the sources in the lineage as well - + ![](/assets/img/dbt/sources.png) +- **freshness** - we can add freshness to our sources table as follows - + ```yml + - name: orders + loaded_at_field: _etl_loaded_at + freshness: + warn_after: {count: 6, period: hour} + error_after: {count: 12, period: hour} + ``` +- we run the command `dbt source freshness` to check the freshness of our data, and we will automatically get a warning or error based on our configuration +- e.g. a warning is raised if the greatest value of `_etl_loaded_at` is more than 6 but less 12 hours +- note - the `freshness` and `loaded_at_field` configuration can also be added at each source level i.e. alongside database / schema to apply to all tables -## Tests and Documentation +## Tests -- for each of our tests, a query is created that returns the number of rows that fail the test. this should be zero for the test to pass -- to add tests, create a file schema.yml under the models directory - +- tests in **analytics engineering** are our about making assertions on data +- we write tests along with our development to validate our [models](#models) +- these tests are then run these in production +- we use the command `dbt test` to run all of our tests +- **generic tests** - applied across different models +- generic tests are of four types - unique, not null, accepted values, relationships +- **unique** - every value in a column is unique +- **not null** - every value in a column is not null +- **accepted values** - every value in a column exists in a predefined list +- **relationships** - every value in a column exists in the column of another table to maintain referential integrity +- we can also use packages / write our own custom generic tests using jinja and macros +- e.g. create a file called models/staging/staging.yml - ```yml version: 2 - + models: - name: stg_customers columns: - name: id - tests: [unique, not_null] - - - name: stg_customer_orders - columns: - - name: customer_id tests: - unique - not_null - - relationships: - field: id - to: ref('stg_customers') + + - name: stg_orders + columns: + - name: user_id + tests: + - relationships: + field: id + to: ref('stg_customers') + ``` +- the logs on running dbt test show us what the actual sql for the test looks like. if any rows satisfy the condition, our test fails. e.g. not_null_stg_customers_id uses the following sql - + ```sql + select + count(*) as failures, + count(*) != 0 as should_warn, + count(*) != 0 as should_error + from + ( + select id + from dbt_analytics.dbt_sagarwal.stg_customers + where id is null + ) dbt_internal_test + ``` +- **singular tests** - use custom sql scripts, which is not otherwise possible using the generic tests that dbt ships with +- in the tests directory, create a file staging/assert_stg_payment_amt_is_positive.sql with the following content - + ```sql + select + order_id, + sum(amount) total_amount + from + {{ ref('stg_payment') }} + group by + order_id + having + total_amount < 0 + ``` +- we can also apply tests to our sources directly. it is done in the same way as models - using custom sql scripts in the tests directory for singular tests or using the sources.yml file for generic tests + ```yml + version: 2 + + sources: + - name: jaffle_shop + database: dbt_raw + schema: jaffle_shop + tables: + - name: orders + columns: + - name: status + tests: + - accepted_values: + values: [returned, completed, return_pending, shipped, placed] + ```` +- `dbt build` - how it works + - run `dbt test` on sources + - run `dbt run` and then `dbt test` on 1st layer of our models + - run `dbt run` and then `dbt test` on 2nd layer of our models + - and so on... +- this way, our **downstream models** are never even built if the **upstream models** fail +- the **skip** tab will show us the models which were for e.g. skipped due to failing tests + +## Documentation + +- we maintain documentation along with code to make the effort of maintaining documentation effortless +- by using macros like `source` and `ref`, our dags are generated +- along with this, we can document things like 'what does this model mean', 'what does this field mean', etc +- we can add descriptions to our models / statuses as so. we continue doing this in the same file we added tests to i.e. models/staging/staging.yml - + ```yml + models: + - name: stg_customers + description: one unique customer per row + columns: + - name: id + description: primary key + ``` +- sometimes, we might want more extensive documentation, which is not possible using just yaml. we can do so using **doc blocks** +- create a file models/staging/documentation.md with the following content - + ```md + {% docs order_status %} + + status can take one of the following values + + One of the following values: + + | status | definition | + |----------------|--------------------------------------------------| + | placed | Order placed, not yet shipped | + | shipped | Order has been shipped, not yet been delivered | + | completed | Order has been received by customers | + | return pending | Customer indicated they want to return this item | + | returned | Item has been returned | + + {% enddocs %} + ``` +- then, reference it inside staging.yml as following - + ```yml + - name: status + description: "{{ doc('order_status') }}" ``` -- my understanding of **relationship** test - check if all values of `stg_customer_orders#customer_id` is present in `stg_customers#id` -- for all **models** and their individual fields in the above yaml, for all tables in [**sources**](#sources), etc we can include a description -- this way, when we run `dbt docs generate`, it generates rich documentation for our project using json +- just like we documented models, we can also document our sources +- to generate the documentation, we use `dbt docs generate` ## Deployment -- click deploy and click create environment to create a new **environment**. similar to what we did when creating a [project](#getting-started), we need to enter - - **connection settings** - - **deployment credentials** -- then, we can create a **job** - - specify the **environment** we created - - in the **execution settings**, specify the commands to run, e.g. `dbt build`, `dbt docs generate`, etc - - we can set a **schedule** for this job to run on, or run it manually whenever we would like to +- go to deploy -> environments and create a new environment +- code in production runs against the default branch (main / master) of our git repository. we can also configure this branch +- just like we saw in [getting started](#getting-started), we need to set the snowflake account url, database, warehouse +- we enter **deployment credentials** instead of **development credentials** + - the right way would be to create and use a service user's credentials instead of our own credentials here + - the code also ideally runs on a dedicated production schema, which only this service user has access to, not a user specific **target schema** +- after configuring all this, we can create our environment +- then, we can create a **job** of type **daily job** for this environment +- here, we can enter a schedule for this job +- we can also configure the set of commands that we would want to run on that cadence - it runs build, and allows us to check options for running freshness checks on sources and generating docs +- after configuring all this, we can create our job +- we can trigger this job manually as well +- dbt cloud easily lets us access our documentation, shows us a breakdown of how much time the models took to build, etc diff --git a/assets/img/dbt/lineage.png b/assets/img/dbt/lineage.png index db2a51e..6e2e5e5 100644 Binary files a/assets/img/dbt/lineage.png and b/assets/img/dbt/lineage.png differ diff --git a/assets/img/dbt/materialization.png b/assets/img/dbt/materialization.png new file mode 100644 index 0000000..c05886e Binary files /dev/null and b/assets/img/dbt/materialization.png differ diff --git a/assets/img/dbt/sources.png b/assets/img/dbt/sources.png new file mode 100644 index 0000000..cfd47b3 Binary files /dev/null and b/assets/img/dbt/sources.png differ diff --git a/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png b/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png new file mode 100644 index 0000000..ad9197f Binary files /dev/null and b/assets/img/warehouse-and-snowflake/enable-resource-monitor-notifications.png differ