Skip to content

[RFC] Configuration & Environment

Romain Dorgueil edited this page Oct 8, 2017 · 5 revisions
Subject: Configuration & Environment
Author: CW Andrews
Created: Sep 7, 2017
Modified: Oct 7, 2017
Target: 0.5
Status: First bits released, draft needs cleanup.

THIS IS A DRAFT

TL;DR

ETL jobs needs to be parametrizable by the end user.

  • For the simplest needs, system environment is sufficient and one can read from os.environ (easy to do since 0.5).
  • There may be a need for "validation". For example, a variable may be required (api key...) or needs to be an int (number of queries...). This is not yet possible but should enhance developper experience (target: future).
  • There should be some possibilities to change the graph topography depending on configuration. For example, the "slack api" branch in the graph may be added only if SLACK_KEY is present.

Environment variables

Runtime configuration should be done using environment variables.

In the future (0.7+), there may be a way to add validation, but not for now.

Order of priority should be, from lower to higher (higher wins, if set):

  • default values
  • --default-env-file values
  • --default-env values
  • system environment values
  • --env-file values
  • --env values

Way to go for runtime configuration.

Reading a value from environment is done using the standard os.getenv("VARNAME", default_value).

There is no way to "validate" those options yet, not sure about whether it's needed so for now, let's do nothing.

Resolution order

Target-Version: 0.6?

First found used.

Although we agree that the actual variables used would move from most-specific (execution instance) >>> most-general (system environment) I think that the wording here needs to be re-thought as technically it is the opposite of how it works. At the same time, I will need to think about how we could say it simply without confusing the user/reader (of the documentation) while still being technically correct.

  1. --env FOO=bar
  2. --env-file _OR_ .env (then in order of appearance if more than one --env-file)

In my mind, anytime an argument/option is passed from the command line it should be the one used as this is the most specific, being on a per-execution basis. For example, if the .env file has a FORCE=false but there is a different configuration to use when forcing a job (maybe ignoring whether or not there is a new file and reprocessing something anyway). The more I think about it I think that .env file should be used but if --env-file is passed then it will be used instead. In other words, --env-file and .env would be mutually exclusive.

  1. system environment
  2. default value (os.getenv(..., default))

Maybe let's act like docker and not use system environment unless one says --env SYSVAR.

I am not totally sure about this _but_ if we end-up doing this then we need a way to set it in the .env / environment files (in my opinion). Ideally, if users are worried about isolation to that extent why not just have them use bonobo-docker?

Overriding from shell

If you have a bash like shell, you can override variables in the shell.

FOO=bar bonobo run ...

Overriding with arguments

Some shells apparently make it harder to override env from the command line. Let's allow a flag to do that in a shell-agnostic way.

bonobo run -e FOO=bar ...

Environment file (__future__)

_Not implemented yet._

.env file should be possible

  1. Perhaps .env.pub (public) which can be safely included in online git repos as they just contain general settings. I think this might be a good idea in anticipation of users being able to share graphs with each-other or use bonobo for their projects without having everyone working on the project re-write the same settings in their private .env files. .env.loc (local) or some-such would be the private counterpart to the .env.pub and might contain individual API keys, environment specific settings, etc. (keep in mind the names I used for the files are really just placeholders). I am not totally sure how this would work but definitely think it is worth contemplating.
  2. I think that these two would be used in combination with the private settings overriding the public ones if/when the same variables are set in both the public and private files. I don't think this would be too hard to implement
  3. Going-off of #2, as it stands, the implementation of --env I went with has argparse collect whatever args are passed in as a list and bonobo.execute just iterates through the list setting each in-turn. For example, bonobo run ... --e MY_NUM=3 -e MY_NUM=5 wouldn't cause an issue, the last one set would simply be the one which is used for the graph, in this case MY_NUM being 5. Going off of this, a public and private env file would just need to be collected and set in order for this to work as proposed. As an example, the vars in .env.pub would be collected into list_1 and .env.loc would have its vars collected into list_2, then, a simple env_vars = list_1.extend(list_2) would join the lists in the desired order and iterating through the list would have the desired effect. Extending this example, the collected cli args (--env), having already been collected into a list, would then be added to the env_vars list via env_vars.extend(passed_env_vars).
  • The one issue I foresee with this is that passing vars at runtime via MY_NUM=5 bonobo run ... would no longer work as expected because there is no way to differentiate MY_NUM as set in this manner from any other environment variable at the system level. However, in this instance the simplest solution would simply be to ask users to use --env to pass args at runtime rather than MY_NUM=5 bonobo run ... as not only is using --env shell agnostic (which has it's own benefits), but I simply don't see a real advantage to setting variables via MY_NUM=5 bonobo run ... as-opposed to -env.

Documentation

This needs a complete documentation.