Skip to content

Commit

Permalink
Fix: Documentation Fixes (#245)
Browse files Browse the repository at this point in the history
* Amended documentation and fixed dependencies.
  • Loading branch information
hustic authored Jan 9, 2024
1 parent 678950a commit 8e4dae7
Show file tree
Hide file tree
Showing 14 changed files with 109 additions and 98 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,10 @@ SAYN aims to empower data engineers and analysts through its three core design
SAYN supports Python 3.7 to 3.10.

```bash
$ pip install sayn
$ sayn init test_sayn
$ cd test_sayn
$ sayn run
pip install sayn
sayn init test_sayn
cd test_sayn
sayn run
```

This is it! You completed your first SAYN run on the example project. Continue with the [Tutorial: Part 1](https://173tech.github.io/sayn/tutorials/tutorial_part1/) which will give you a good overview of SAYN's true power!
Expand Down
2 changes: 2 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,11 @@ not produced by the currently filtered tasks. Head over to [database objects](da
Both task filtering and upstream prod arguments can be set using `default_run` in `settings.yaml`. Example:

!!! example "settings.yaml"
```yaml
profiles:
dev:
default_run: -x group:extract
```

This example will make it so that running `sayn run` will already exclude the tasks in the group called `extract`.

Expand Down
14 changes: 7 additions & 7 deletions docs/database_objects.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ default schema. SAYN uses the same format to refer to database tables and views,
## Compilation of object names

In a real world scenario we want to write our code in a way that dynamically changes depending on the profile we're running
on (eg: test vs production). This allows for multiple people to collaborate on the same project, wihtout someone's actions
on (eg: test vs production). This allows for multiple people to collaborate on the same project, without someone's actions
affecting the work of others in the team. Let's consider this example task:

!!! example "tasks/core.yaml"
Expand All @@ -37,7 +37,7 @@ A way to solve this problem could be to have different databases for each person
database setups, potential data governance issues and increased database costs, as you might need a copy of the data per person
working with it.

In SAYN there's another solution: we express database object names like `schema.table` but the code that's execution in the database
In SAYN there's another solution: we express database object names like `schema.table` but the code that is executed in the database
is transformed according to personal settings. For example, we could have a schema called `analytics_models` where our production lives
and another called `test_models` where we store data produced during development, with table names like `USER_PREFIX_table` rather
than `table` so there's no collision and we minimise data redundancy.
Expand All @@ -61,7 +61,7 @@ The modifications described above are setup with prefixes, suffixes and override

The above will make every `schema.table` specification to be compiled to `test_schema.up_table`.

Following the example in the previous section, if we want the to call the production schema `analytics_models` we can do so by
Following the example in the previous section, if we want to call the production schema `analytics_models` we can do so by
adding the prefix in the `project.yaml` file:

!!! example "project.yaml"
Expand Down Expand Up @@ -167,7 +167,7 @@ model. `context.out` and `self.out` are also available in python tasks and their
Note that calling `src` and `out` in the `run` method of a python task class or in the function code when using a decorator doesn't
affect task dependencies, it simply outputs the translated database object name. The task dependency behaviour in python tasks is done
by either calling `self.src` or `self.out` in the `config` method of the class or by passing these references to the `task` decorator
in the `sources` and `outputs` arguments as seen in this example. For more details head to [the python task section](tasks/python).
in the `sources` and `outputs` arguments as seen in this example. For more details head to [the python task section](tasks/python.md).

## Altering the behaviour of `src`

Expand Down Expand Up @@ -221,7 +221,7 @@ actually be:

As you can see, we just need to specify a list of tables in `from_prod` to always read from the production configuration, that is, the
settings shared by all team members as specified in `project.yaml`. To make it easier to use, wildcards (`*`) are accepted, so that we
can specify a whole schema like in the example, but we can also specify a list of tables explicitely instead.
can specify a whole schema like in the example, but we can also specify a list of tables explicitly instead.

`from_prod` can also be specified using environment variables with `export SAYN_FROM_PROD="logs.*"` where the value is a comma
separated list of tables.
Expand Down Expand Up @@ -297,9 +297,9 @@ but the code executed for `another_example_model` will be:
FROM test_models.up_example_model
```

Because `example_task` is part of this exeuction and produces the table `models.example_model` reference by `another_example_task`
Because `example_task` is part of this execution and produces the table `models.example_model` referenced by `another_example_task`,
`models.example_model` is translated using the testing settings into `test_models.up_example_model` unlike `logs.raw_table` which
as no task producing it is present in this execution, will be translated into the production name.
(as no task present in this execution is producting it) will be translated into the production name.

With upstream prod it becomes a lot easier to work with your modelling layer without having to duplicate all your upstream tables
for every person in the team or being forced to work with sampled data.
Expand Down
4 changes: 4 additions & 0 deletions docs/settings/settings_yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,16 @@ with modelling tasks) it's useful to automatically filter the tasks that will be
`settings.yaml` or through the environment variable `SAYN_DEFAULT_RUN`:

!!! example "settings.yaml"
```yaml
profile:
dev:
default_run: -x group:extract
```

!!! example ".env.sh"
```bash
export SAYN_DEFAULT_RUN="-x group:extract"
```

So we just add the arguments we would give after `sayn run` or `sayn compile`. Only task selection and
upstream prod are allowed (`-t/--tasks`, `-x/--exclude` and `-u/--upstream-prod`).
47 changes: 24 additions & 23 deletions docs/tasks/autosql.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The `autosql` task lets you write a `SELECT` statement and SAYN then automates t
An `autosql` task group is defined as follows:

!!! example "project.yaml"
```
```yaml
...

groups:
Expand Down Expand Up @@ -46,12 +46,12 @@ An `autosql` task is defined by the following attributes:

With `autosql` tasks, you should use the `src` macro in your `SELECT` statements to implicitly create task dependencies.

!!! example `autosql` query
```
SELECT field1
, field2
FROM {{ src('my_table') }} l
```
!!! example "`autosql` query"
```sql
SELECT field1
, field2
FROM {{ src('my_table') }} l
```

By using the `{{ src('my_table') }}` in your `FROM` clause, you are effectively telling SAYN that your task depends on the `my_table` table (or view). As a result, SAYN will look for the task that produces `my_table` and set it as a parent of this `autosql` task automatically.

Expand All @@ -63,7 +63,7 @@ By using the `{{ src('my_table') }}` in your `FROM` clause, you are effectively
If you need to amend the configuration (e.g. materialisation) of a specific `autosql` task within a `group`, you can overload the values specified in the YAML group definition. To do this, we simply call `config` from a Jinja tag within the sql file of the task:

!!! example "autosql with config"
```
```sql
{{ config(materialisation='view') }}

SELECT ...
Expand Down Expand Up @@ -132,31 +132,32 @@ Autosql tasks accept a `columns` field in the task definition that affects the t

SAYN also lets you control the CREATE TABLE statement if you need more specification. This is done with:

* columns: the list of columns including their definitions.
* table_properties: database specific properties that affect table creation (indexes, cluster, sorting, etc.).
* post_hook: SQL statments executed right after the table/view creation.
* `columns`: the list of columns including their definitions.
* `table_properties`: database specific properties that affect table creation (indexes, cluster, sorting, etc.).
* `post_hook`: SQL statments executed right after the table/view creation.

`columns` can define the following attributes:

* name: the column name.
* type: the column type.
* tests: list of keywords that constraint a specific column
- unique: enforces a unique constraint on the column.
- not_null: enforces a non null constraint on the column.
- allowed_values: list allowed values for the column.
* `name`: the column name.
* `type`: the column type.
* `tests`: list of keywords that constraint a specific column
- `unique`: enforces a unique constraint on the column.
- `not_null`: enforces a non null constraint on the column.
- `allowed_values`: list allowed values for the column.

`table_properties` can define the following attributes (database specific):
* indexes:
* sorting: specify the sorting for the table
* distribution_key: specify the type of distribution.
* partitioning: specify the partitioning model for the table.
* clustering: specify the clustering for the table.

* `indexes`: specify the indexes for the table,
* `sorting`: specify the sorting for the table,
* `distribution_key`: specify the type of distribution,
* `partitioning`: specify the partitioning model for the table,
* `clustering`: specify the clustering for the table.

!!! attention
Each supported database might have specific `table_properties` related to it; see the database-specific pages for further details and examples.

!!! Attention
If the a primary key is defined in both the `columns` and `indexes` DDL entries, the primary key will be set as part of the `CREATE TABLE` statement only.
If a primary key is defined in both the `columns` and `indexes` DDL entries, the primary key will be set as part of the `CREATE TABLE` statement only.

!!! example "autosql with columns"
```yaml
Expand Down
1 change: 1 addition & 0 deletions docs/tasks/copy.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ specific column types in the final table:
```

In this example we define 2 columns for `task_copy`: `id` and `updated_at`. This will make SAYN:

1. Copy only those 2 columns, disregarding any other columns present at source
2. Infer the type of `id` based on the type of that column at source
3. Enforce the destination table type for `updated_at` to be `TIMESTAMP`
Expand Down
10 changes: 4 additions & 6 deletions docs/tasks/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,13 @@ Please see below the available SAYN task types:
Tasks in SAYN are defined into `groups` which we describe in the `project.yaml` file in your project. Task `groups` define a set of tasks which share the same attributes. For example we can define a group formed of `sql` tasks called `core` like this:

!!! example "project.yaml"
```
```yaml
groups:
core:
type: sql
file_name: "core/*.sql"
materialisation: table
destination:
table: "{{ task.name }}"
destination: "{{ task.name }}"
```

The properties defined in the group tell SAYN how to generate tasks:
Expand All @@ -64,14 +63,13 @@ This definition of `groups` in the `project.yaml` file is available for `autosql
## Task Attributes

!!! example "project.yaml"
```
```yaml
groups:
core:
type: sql
file_name: "core/*.sql"
materialisation: table
destination:
table: "{{ task.name }}"
destination: "{{ task.name }}"
```

As you saw in the example above, task attributes can be defined in a dynamic way. This example shows how to use the task name to dynamically define a task. This will effectively tell the task to create the outputs of the `core` tasks into tables based on the `task` name, which is the name of the file without the `.sql` extension for `sql` tasks.
Expand Down
20 changes: 10 additions & 10 deletions docs/tasks/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ There are two models for specifying python tasks in SAYN: a simple way through u
You can define `python` tasks in SAYN very simply by using decorators. This will let you write a Python function and turn that function into a task. First, you need to add a group in `project.yaml` pointing to the `.py` file where the task code lives:

!!! example "project.yaml"
```
```yaml
groups:
decorator_tasks:
type: python
Expand All @@ -24,7 +24,7 @@ You can define `python` tasks in SAYN very simply by using decorators. This will
Now all tasks defined in `python/decorator_tasks.py` will be added to the DAG. The `module` property expects a python path from the `python` folder in a similar way as you would import a module in python. For example, if our task definition exists in `python/example_mod/decorator_tasks.py` the value in `module` would be `example_mod.decorator_tasks`.

!!! example "python/decorator_tasks.py"
```
```python
from sayn import task

@task(outputs='logs.api_table', sources='logs.another_table')
Expand All @@ -34,6 +34,11 @@ Now all tasks defined in `python/decorator_tasks.py` will be added to the DAG. T
warehouse.execute(f'CREATE OR REPLACE TABLE {out_table} AS SELECT * from {src_table}')
```

!!! info "Python decorators"
Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that `@task` turns a standard python
function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
[for example this](https://realpython.com/primer-on-python-decorators/).

The above example showcases the key elements to a python task:

* `task`: we import SAYN's `task` decorator which is used to turn functions into SAYN tasks added to the DAG.
Expand All @@ -42,12 +47,7 @@ The above example showcases the key elements to a python task:
* function parameters: arguments to the function have special meaning and so the names need to be respected:
* `context`: is an object granting access to some functionality like project parameters, connections and other functions as seen further down.
* `warehouse`: connection names (`required_credentials` in `project.yaml`) will automatically provide the object of that connection. You can specify any number of connections here.
* param1: the rest of the function arguments are matched against task parameters, these are values defined in the `parameter` property in the group.

!!! info "Python decorators"
Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that `@task` turns a standard python
function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
[for example this](https://realpython.com/primer-on-python-decorators/).
* `param1`: the rest of the function arguments are matched against task parameters, these are values defined in the `parameter` property in the group.

Given the code above, this task will:

Expand Down Expand Up @@ -97,10 +97,10 @@ Where `class` is a python path to the Python class implementing the task. This c

In this example:

* We create a new class inheriting from SAYN's PythonTask.
* We create a new class inheriting from SAYN's `PythonTask`.
* We set some dependencies by calling `self.src` and `self.out`.
* We define a setup method to do some sanity checks. This method can be skipped, but it's
useful to check the validity of project parameters or so some initial setup.
useful to check the validity of project parameters or initial setup.
* We define the actual process to execute during `sayn run` with the `run` method.
* Both `setup` and `run` return the task status as successful `return self.success()`, however we can indicate a task failure to sayn with `return self.fail()`. Failing a python task forces child tasks to be skipped.

Expand Down
Loading

0 comments on commit 8e4dae7

Please sign in to comment.