Fix: Documentation Fixes (#245)

* Amended documentation and fixed dependencies.
173TECH · Jan 9, 2024 · 8e4dae7 · 8e4dae7
1 parent 678950a
commit 8e4dae7
Show file tree

Hide file tree

Showing 14 changed files with 109 additions and 98 deletions.
diff --git a/README.md b/README.md
@@ -40,10 +40,10 @@ SAYN aims to empower data engineers and analysts through its  three core design
 SAYN supports Python 3.7 to 3.10.
 
 ```bash
-$ pip install sayn
-$ sayn init test_sayn
-$ cd test_sayn
-$ sayn run
+pip install sayn
+sayn init test_sayn
+cd test_sayn
+sayn run
 ```
 
 This is it! You completed your first SAYN run on the example project. Continue with the [Tutorial: Part 1](https://173tech.github.io/sayn/tutorials/tutorial_part1/) which will give you a good overview of SAYN's true power!

diff --git a/docs/cli.md b/docs/cli.md
@@ -49,9 +49,11 @@ not produced by the currently filtered tasks. Head over to [database objects](da
 Both task filtering and upstream prod arguments can be set using `default_run` in `settings.yaml`. Example:
 
 !!! example "settings.yaml"
+    ```yaml
     profiles:
       dev:
         default_run: -x group:extract
+    ```
 
 This example will make it so that running `sayn run` will already exclude the tasks in the group called `extract`.
 

diff --git a/docs/database_objects.md b/docs/database_objects.md
@@ -14,7 +14,7 @@ default schema. SAYN uses the same format to refer to database tables and views,
 ## Compilation of object names
 
 In a real world scenario we want to write our code in a way that dynamically changes depending on the profile we're running
-on (eg: test vs production). This allows for multiple people to collaborate on the same project, wihtout someone's actions
+on (eg: test vs production). This allows for multiple people to collaborate on the same project, without someone's actions
 affecting the work of others in the team. Let's consider this example task:
 
 !!! example "tasks/core.yaml"
@@ -37,7 +37,7 @@ A way to solve this problem could be to have different databases for each person
 database setups, potential data governance issues and increased database costs, as you might need a copy of the data per person
 working with it.
 
-In SAYN there's another solution: we express database object names like `schema.table` but the code that's execution in the database
+In SAYN there's another solution: we express database object names like `schema.table` but the code that is executed in the database
 is transformed according to personal settings. For example, we could have a schema called `analytics_models` where our production lives
 and another called `test_models` where we store data produced during development, with table names like `USER_PREFIX_table` rather
 than `table` so there's no collision and we minimise data redundancy.
@@ -61,7 +61,7 @@ The modifications described above are setup with prefixes, suffixes and override
 
 The above will make every `schema.table` specification to be compiled to `test_schema.up_table`.
 
-Following the example in the previous section, if we want the to call the production schema `analytics_models` we can do so by
+Following the example in the previous section, if we want to call the production schema `analytics_models` we can do so by
 adding the prefix in the `project.yaml` file:
 
 !!! example "project.yaml"
@@ -167,7 +167,7 @@ model. `context.out` and `self.out` are also available in python tasks and their
 Note that calling `src` and `out` in the `run` method of a python task class or in the function code when using a decorator doesn't
 affect task dependencies, it simply outputs the translated database object name. The task dependency behaviour in python tasks is done
 by either calling `self.src` or `self.out` in the `config` method of the class or by passing these references to the `task` decorator
-in the `sources` and `outputs` arguments as seen in this example. For more details head to [the python task section](tasks/python).
+in the `sources` and `outputs` arguments as seen in this example. For more details head to [the python task section](tasks/python.md).
 
 ## Altering the behaviour of `src`
 
@@ -221,7 +221,7 @@ actually be:
 
 As you can see, we just need to specify a list of tables in `from_prod` to always read from the production configuration, that is, the
 settings shared by all team members as specified in `project.yaml`. To make it easier to use, wildcards (`*`) are accepted, so that we
-can specify a whole schema like in the example, but we can also specify a list of tables explicitely instead.
+can specify a whole schema like in the example, but we can also specify a list of tables explicitly instead.
 
 `from_prod` can also be specified using environment variables with `export SAYN_FROM_PROD="logs.*"` where the value is a comma
 separated list of tables.
@@ -297,9 +297,9 @@ but the code executed for `another_example_model` will be:
       FROM test_models.up_example_model
     ```
 
-Because `example_task` is part of this exeuction and produces the table `models.example_model` reference by `another_example_task`
+Because `example_task` is part of this execution and produces the table `models.example_model` referenced by `another_example_task`,
 `models.example_model` is translated using the testing settings into `test_models.up_example_model` unlike `logs.raw_table` which
-as no task producing it is present in this execution, will be translated into the production name.
+(as no task present in this execution is producting it) will be translated into the production name.
 
 With upstream prod it becomes a lot easier to work with your modelling layer without having to duplicate all your upstream tables
 for every person in the team or being forced to work with sampled data.

diff --git a/docs/settings/settings_yaml.md b/docs/settings/settings_yaml.md
@@ -125,12 +125,16 @@ with modelling tasks) it's useful to automatically filter the tasks that will be
 `settings.yaml` or through the environment variable `SAYN_DEFAULT_RUN`:
 
 !!! example "settings.yaml"
+    ```yaml
     profile:
       dev:
         default_run: -x group:extract
+    ```
 
 !!! example ".env.sh"
+    ```bash
     export SAYN_DEFAULT_RUN="-x group:extract"
+    ```
 
 So we just add the arguments we would give after `sayn run` or `sayn compile`. Only task selection and
 upstream prod are allowed (`-t/--tasks`, `-x/--exclude` and `-u/--upstream-prod`).
diff --git a/docs/tasks/autosql.md b/docs/tasks/autosql.md
@@ -9,7 +9,7 @@ The `autosql` task lets you write a `SELECT` statement and SAYN then automates t
 An `autosql` task group is defined as follows:
 
 !!! example "project.yaml"
-    ```
+    ```yaml
     ...
 
     groups:
@@ -46,12 +46,12 @@ An `autosql` task is defined by the following attributes:
 
 With `autosql` tasks, you should use the `src` macro in your `SELECT` statements to implicitly create task dependencies.
 
-!!! example `autosql` query
-  ```
-  SELECT field1
-       , field2
-    FROM {{ src('my_table') }} l
-  ```
+!!! example "`autosql` query"
+    ```sql
+    SELECT field1
+        , field2
+      FROM {{ src('my_table') }} l
+    ```
 
 By using the `{{ src('my_table') }}` in your `FROM` clause, you are effectively telling SAYN that your task depends on the `my_table` table (or view). As a result, SAYN will look for the task that produces `my_table` and set it as a parent of this `autosql` task automatically.
 
@@ -63,7 +63,7 @@ By using the `{{ src('my_table') }}` in your `FROM` clause, you are effectively
 If you need to amend the configuration (e.g. materialisation) of a specific `autosql` task within a `group`, you can overload the values specified in the YAML group definition. To do this, we simply call `config` from a Jinja tag within the sql file of the task:
 
 !!! example "autosql with config"
-    ```
+    ```sql
     {{ config(materialisation='view') }}
 
     SELECT ...
@@ -132,31 +132,32 @@ Autosql tasks accept a `columns` field in the task definition that affects the t
 
 SAYN also lets you control the CREATE TABLE statement if you need more specification. This is done with:
 
-* columns: the list of columns including their definitions.
-* table_properties: database specific properties that affect table creation (indexes, cluster, sorting, etc.).
-* post_hook: SQL statments executed right after the table/view creation.
+* `columns`: the list of columns including their definitions.
+* `table_properties`: database specific properties that affect table creation (indexes, cluster, sorting, etc.).
+* `post_hook`: SQL statments executed right after the table/view creation.
 
 `columns` can define the following attributes:
 
-* name: the column name.
-* type: the column type.
-* tests: list of keywords that constraint a specific column
-  - unique: enforces a unique constraint on the column.
-  - not_null: enforces a non null constraint on the column.
-  - allowed_values: list allowed values for the column.
+* `name`: the column name.
+* `type`: the column type.
+* `tests`: list of keywords that constraint a specific column
+  - `unique`: enforces a unique constraint on the column.
+  - `not_null`: enforces a non null constraint on the column.
+  - `allowed_values`: list allowed values for the column.
 
 `table_properties` can define the following attributes (database specific):
-* indexes:
-* sorting: specify the sorting for the table
-* distribution_key: specify the type of distribution.
-* partitioning: specify the partitioning model for the table.
-* clustering: specify the clustering for the table.
+
+* `indexes`: specify the indexes for the table,
+* `sorting`: specify the sorting for the table,
+* `distribution_key`: specify the type of distribution,
+* `partitioning`: specify the partitioning model for the table,
+* `clustering`: specify the clustering for the table.
 
 !!! attention
       Each supported database might have specific `table_properties` related to it; see the database-specific pages for further details and examples.
 
 !!! Attention
-    If the a primary key is defined in both the `columns` and `indexes` DDL entries, the primary key will be set as part of the `CREATE TABLE` statement only.
+    If a primary key is defined in both the `columns` and `indexes` DDL entries, the primary key will be set as part of the `CREATE TABLE` statement only.
 
 !!! example "autosql with columns"
     ```yaml

diff --git a/docs/tasks/copy.md b/docs/tasks/copy.md
@@ -149,6 +149,7 @@ specific column types in the final table:
     ```
 
 In this example we define 2 columns for `task_copy`: `id` and `updated_at`. This will make SAYN:
+
 1. Copy only those 2 columns, disregarding any other columns present at source
 2. Infer the type of `id` based on the type of that column at source
 3. Enforce the destination table type for `updated_at` to be `TIMESTAMP`

diff --git a/docs/tasks/overview.md b/docs/tasks/overview.md
@@ -34,14 +34,13 @@ Please see below the available SAYN task types:
 Tasks in SAYN are defined into `groups` which we describe in the `project.yaml` file in your project. Task `groups` define a set of tasks which share the same attributes. For example we can define a group formed of `sql` tasks called `core` like this:
 
 !!! example "project.yaml"
-    ```
+    ```yaml
     groups:
       core:
         type: sql
         file_name: "core/*.sql"
         materialisation: table
-        destination:
-          table: "{{ task.name }}"
+        destination: "{{ task.name }}"
     ```
 
 The properties defined in the group tell SAYN how to generate tasks:
@@ -64,14 +63,13 @@ This definition of `groups` in the `project.yaml` file is available for `autosql
 ## Task Attributes
 
 !!! example "project.yaml"
-    ```
+    ```yaml
     groups:
       core:
         type: sql
         file_name: "core/*.sql"
         materialisation: table
-        destination:
-          table: "{{ task.name }}"
+        destination: "{{ task.name }}"
     ```
 
 As you saw in the example above, task attributes can be defined in a dynamic way. This example shows how to use the task name to dynamically define a task. This will effectively tell the task to create the outputs of the `core` tasks into tables based on the `task` name, which is the name of the file without the `.sql` extension for `sql` tasks.

diff --git a/docs/tasks/python.md b/docs/tasks/python.md
@@ -12,7 +12,7 @@ There are two models for specifying python tasks in SAYN: a simple way through u
 You can define `python` tasks in SAYN very simply by using decorators. This will let you write a Python function and turn that function into a task. First, you need to add a group in `project.yaml` pointing to the `.py` file where the task code lives:
 
 !!! example "project.yaml"
-    ```
+    ```yaml
     groups:
       decorator_tasks:
         type: python
@@ -24,7 +24,7 @@ You can define `python` tasks in SAYN very simply by using decorators. This will
 Now all tasks defined in `python/decorator_tasks.py` will be added to the DAG. The `module` property expects a python path from the `python` folder in a similar way as you would import a module in python. For example, if our task definition exists in `python/example_mod/decorator_tasks.py` the value in `module` would be `example_mod.decorator_tasks`.
 
 !!! example "python/decorator_tasks.py"
-    ```
+    ```python
     from sayn import task
 
     @task(outputs='logs.api_table', sources='logs.another_table')
@@ -34,6 +34,11 @@ Now all tasks defined in `python/decorator_tasks.py` will be added to the DAG. T
         warehouse.execute(f'CREATE OR REPLACE TABLE {out_table} AS SELECT * from {src_table}')
     ```
 
+!!! info "Python decorators"
+    Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that `@task` turns a standard python
+    function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
+    [for example this](https://realpython.com/primer-on-python-decorators/).
+
 The above example showcases the key elements to a python task:
 
   * `task`: we import SAYN's `task` decorator which is used to turn functions into SAYN tasks added to the DAG.
@@ -42,12 +47,7 @@ The above example showcases the key elements to a python task:
   * function parameters: arguments to the function have special meaning and so the names need to be respected:
     * `context`: is an object granting access to some functionality like project parameters, connections and other functions as seen further down.
     * `warehouse`: connection names (`required_credentials` in `project.yaml`) will automatically provide the object of that connection. You can specify any number of connections here.
-    * param1: the rest of the function arguments are matched against task parameters, these are values defined in the `parameter` property in the group.
-
-!!! info "Python decorators"
-    Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that `@task` turns a standard python
-    function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
-    [for example this](https://realpython.com/primer-on-python-decorators/).
+    * `param1`: the rest of the function arguments are matched against task parameters, these are values defined in the `parameter` property in the group.
 
 Given the code above, this task will:
 
@@ -97,10 +97,10 @@ Where `class` is a python path to the Python class implementing the task. This c
 
 In this example:
 
-* We create a new class inheriting from SAYN's PythonTask.
+* We create a new class inheriting from SAYN's `PythonTask`.
 * We set some dependencies by calling `self.src` and `self.out`.
 * We define a setup method to do some sanity checks. This method can be skipped, but it's
-  useful to check the validity of project parameters or so some initial setup.
+  useful to check the validity of project parameters or initial setup.
 * We define the actual process to execute during `sayn run` with the `run` method.
 * Both `setup` and `run` return the task status as successful `return self.success()`, however we can indicate a task failure to sayn with `return self.fail()`. Failing a python task forces child tasks to be skipped.