Skip to content

Latest commit

 

History

History
228 lines (171 loc) · 6.09 KB

README.md

File metadata and controls

228 lines (171 loc) · 6.09 KB

Dbt_With_SnowFlake

Table of Contents


Overview

  • The purpose of doing this was to learn how the Tranformations step can be made easy using data build tool (dbt).
  • Check what different functionalities does dbt support that can save development time.
  • Snowflake was used as a warehouse. Free 30 day trail when you sign up.

Data Build Tool

  • dbt enables analytics engineers to transform data in their warehouses by simply writing select statements.
  • dbt handles turning these select statements into tables and views. These tables can be incremental and dbt will handle that.
  • dbt does the T in ELT (Extract, Load, Transform) processes.
  • dbt provides a web UI that can be used to visualize the lineage and dependency of models.
  • dbt web UI also provides detail about each model and what it depends on as well the as test cases on certain models.

dbt Installation

  • Install git and python.

  • dbt installation on Linux has some problem and due to that we need install the dbt-core project.

  • Will be doing this all in a virtual environment.

  • This will create a virtual env

  python3 -m venv dbt-env
  • Activate the env
    • Remember to be in the correct folder.
  source dbt-env/bin/activate
  • In this repo dbt-core has already been download but you can clone it as well.
  git clone https://github.com/dbt-labs/dbt.git
  • Go into the dbt folder and install dbt.
  • In the requirements.txt you can specify what to download e.g. snowflake, big-query.
  cd dbt
  pip install -r requirements.txt
  • Command to verify dbt installed.
  dbt --version
  • Download the Snowflake Plugin if not specified in the requirements.txt file
  pip install dbt-snowflake
  • Open your choice of editor, VS Code used in this demo. Running the command will populate with different folders (seeds, models etc)
  dbt init <project-name>

Connecting dbt with Snowflake

dbt_model:
  outputs:
    dev:
      account: ap12345.ap-south-1.aws
      database: <database>
      password: <password>
      role: <role-from-snowflake>
      schema: <schema>
      threads: 2
      type: snowflake
      user: <username>
      warehouse: <warehouse-name-from-snowflake>
  target: dev
  • When snowflake profile has been set, run the command to check the connection.
  dbt debug

Folder Structure

  • dbt     - dbt cloned repo used for installation
  • dbt-evn   - python virtual env related
  • dbt-model
    • dbt-model   - after dbt init this is created
      • analyses
      • macros       - create macros here and refer later
      • models       - tables, views, incremental load, merge
      • seeds         - flat files incase want to load to staging tables using dbt
      • snapshots       - SCD tables
      • tests        - tests on different models
      • dbt_project.yml    - one place to configure all
      • packages.yml    - dbt has many packages which can be downloaded

Program Flow


Program Execution

  • Before executing any of the commands remember to be in the correct folder.
  cd <project-name>
  • To load file from seeds folder to Stage Tables in snowflake.
  dbt seed
  • The data will be in the Stage Tables, now will load data to Core/Dim tables.
    • City, Country, Transations will be loaded as they have no history handling needed.
  dbt run
  • To run a specific folder inside model folder.
  dbt run -m <folder-name>
  • The Snapshot folder has all those models on which SCD-2 is being used.
  dbt snapshot
  • We can also check test cases that are defined on different models, snapshots, seeds.
  dbt test
  • dbt provides a web UI that can be accessed using.
    • Internally it has all metadata in json that is saved and used by the web UI
  dbt docs generate
  dbt docs serve
  • You can check different things in the UI and also the lineage as well.


Data Model

  • The source system provides the Full Dump (all of the data every time).
  • The Transactions table is append only data.
  • Most the tables in Core/Dim are SCD Type 1 or Type 2.
  • Country and City do not change much they can be loaded manually whenever needed.

Level Up

  • Right now just loading data to Core/Dim.
  • Create a pipeline and orchectrate it using Airflow or Airbyte.
  • Make Fact Table and visualizations on top of that.
  • Use dbt test for testing purpose and different packages that can aid in development.

Documentation and Material


Tools and Technologies

  • Dbt
  • Snowflake
  • Git