[RFC] Configuration & Environment

:Subject: Configuration & Environment :Authors: CW Andrews & R. Dorgueil :Created: Sep 7, 2017 :Modified: Oct 7, 2017 :Target: 0.5 :Status: First bits released, draft needs cleanup.

THIS IS A DRAFT

TL;DR

ETL jobs needs to be parametrizable by the end user.

For the simplest needs, system environment is sufficient and one can read from os.environ (easy to do since 0.5).
There may be a need for "validation". For example, a variable may be required (api key...) or needs to be an int (number of queries...). This is not yet possible but should enhance developper experience (target: future).
There should be some possibilities to change the graph topography depending on configuration. For example, the "slack api" branch in the graph may be added only if SLACK_KEY is present.

Environment variables

Runtime configuration should be done using environment variables.

In the future (0.7+), there may be a way to add validation, but not for now.

Order of priority should be, from lower to higher (higher wins, if set):

default values
- os.getenv("VARNAME", default_value)
--default-env-file values Not yet implemented
- Specify file to read default env values from. Each env var in the file is used if the var isn't already a corresponding value set at the system environment (system environment vars not overwritten).
--default-env values Not yet implemented
- Works like #2 but the default NAME=var are passed individually, with one key=value pair for each --default-env flag rather than gathered from a specified file.
system environment values
- Env vars already set at the system level. It is worth noting that passed env vars via NAME=value bonobo run ... falls here in the order of priority.
--env-file values Not yet implemented
- Env vars specified here are set like those in #2 albeit that these values have priority over those set at the system level.
--env values
- Env vars set using the --env / -e flag work like #3 but take priority over all other env vars.

Notes

Way to go for runtime configuration.
Reading a value from environment is done using the standard os.getenv("VARNAME", default_value).
There is no way to "validate" those options yet, not sure about whether it's needed so for now, let's do nothing.

Overriding from shell

If you have a bash like shell, you can override variables in the shell.

FOO=bar bonobo run ...

Overriding with arguments

Some shells apparently make it harder to override env from the command line. Bonobo now includes the --env / -e flag to pass vars in a shell-agnostic way.

bonobo run --env FOO=bar ...
bonobo run -e FOO=bar ...

Environment file (future)

Not implemented yet.

.env file should be possible

Perhaps .env.pub (public) which can be safely included in online git repos as they just contain general settings. I think this might be a good idea in anticipation of users being able to share graphs with each-other or use bonobo for their projects without having everyone working on the project re-write the same settings in their private .env files. .env.loc (local) or some-such would be the private counterpart to the .env.pub and might contain individual API keys, environment specific settings, etc. (keep in mind the names I used for the files are really just placeholders). I am not totally sure how this would work but definitely think it is worth contemplating.
I think that these two would be used in combination with the private settings overriding the public ones if/when the same variables are set in both the public and private files. I don't think this would be too hard to implement
Going-off of #2, as it stands, the implementation of --env I went with has argparse collect whatever args are passed in as a list and bonobo.execute just iterates through the list setting each in-turn. For example, bonobo run ... --e MY_NUM=3 -e MY_NUM=5 wouldn't cause an issue, the last one set would simply be the one which is used for the graph, in this case MY_NUM being 5. Going off of this, a public and private env file would just need to be collected and set in order for this to work as proposed. As an example, the vars in .env.pub would be collected into list_1 and .env.loc would have its vars collected into list_2, then, a simple env_vars = list_1.extend(list_2) would join the lists in the desired order and iterating through the list would have the desired effect. Extending this example, the collected cli args (--env), having already been collected into a list, would then be added to the env_vars list via env_vars.extend(passed_env_vars).

The one issue I foresee with this is that passing vars at runtime via MY_NUM=5 bonobo run ... would no longer work as expected because there is no way to differentiate MY_NUM as set in this manner from any other environment variable at the system level. However, in this instance the simplest solution would simply be to ask users to use --env to pass args at runtime rather than MY_NUM=5 bonobo run ... as not only is using --env shell agnostic (which has it's own benefits), but I simply don't see a real advantage to setting variables via MY_NUM=5 bonobo run ... as-opposed to -env.

Documentation

This needs a complete documentation.

Bonobo ETL - Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly