Skip to content

Latest commit

 

History

History
executable file
·
197 lines (161 loc) · 8.46 KB

File metadata and controls

executable file
·
197 lines (161 loc) · 8.46 KB

CircleCI

Reward Function Interpretability

Installation

First clone this repository. Install inside your python virtual environment system of choice:

pip install -r requirements.txt
pip install -r requirements-dev.txt  # If dev packages are needed.

Note on the python version: if you use a separate installation of imitation for creating reward models, then make sure that your environment for imitation and for reward_preprocessing use the same version of Python, otherwise you might run into issues when unpickling imitation models for use in reward_preprocessing.

Docker

You can also use docker by building the image and running the scripts from inside the container.

Usage

Getting a Reward Model

Training the Supervised Reward Model

  • Get an expert rl agent
    python train.py \
      --exp_name coinrun \
      --env_name coinrun \
      --num_levels 100000 \
      --distribution_mode hard \
      --param_name hard-500 \
      --num_timesteps 200000000 \
      --num_checkpoints 5 \
      --seed 6033 \
      --random_percent 0 \
      --gpu_device=3
    • Render videos to see how well the expert does:
    python render.py \
      --exp_name render \
      --env_name coinrun \
      --num_levels 100000 \
      --distribution_mode hard \
      --param_name hard-500 \
      --seed 6033 \
      --model_file path/model_200015872.pth 
      --vid_dir video
    • Add a trajectory path to save trajectories that can be used to train reward nets:
    python render.py \
      --exp_name render \
      --env_name coinrun \
      --num_levels 100000 \
      --distribution_mode hard \
      --param_name hard-500 \
      --seed 9073 \
      --model_file path/model_200015872.pth \
      --vid_dir video \
      --traj_path rollouts.npz \
      --noview \
      # num_wpisodes is not actually the number of episodes, but the number of iterations
      # n_steps (usually 256 for most hparams) each.
      --num_episodes=1000

Interpreting a Reward Model

Once you have a reward model, you can use visualize various interpretability methods. Use print_config to get details of the config params of the sacred scripts, as in:

python -m reward_preprocessing.interpret print_config

Code Structure

  • src
    • notebooks: Currently only a WIP port of the rl vision notebook.
    • reward_preprocessing
      • common: Code used across multiple submodules.
      • env: Implemented environments, mostly used for the old reward-preprocessing code.
      • ext: Code that is basically unmodified from other sources. Often this will be subject to a different license than the main repo.
        • However, this is not exclusive: There are other files which rely heavily on other sources and are therefore subject to a different license.
      • policies: RL policies for training experts with train_rl.
      • preprocessing Reward preprocessing / reward shaping code.
      • scripts: All scripts that are not the main scripts of the projects. Helpers and scripts that produce artifacts that are used by the main script. Everything here should either be an executable file or a config for one.
        • helpers: Helper scripts that are bash executables or python scripts that are not full sacred experiments.
      • trainers: Our additions to the suite of reward learning algorithms available in imitation. Currently this contains the trainer for training reward nets with supervised learning.
      • vis: Visualization code for interpreting reward functions.
      • interpret.py: The main script that provides the functionality for this project.
  • tests

Philosophy on Dependencies

Lucent

Unless there is an outright bug in lucent, when extending lucent add the code here and leave the dependency to vanilla lucent (see e.g. reward_preprocessing.vis.objectives).

Imitation

So far we have tried to keep everything as compatible with imitation as possible to the point where this project could (almost) be an extension for imitation. This also gives us imitation's logging and sacred integration for free.

Tests

Run

poetry run pytest

to run all available tests (or just run pytest if the environment is active).

Code style and linting

ci/code_checks.sh contains checks for formatting and linting. We recommend using it as a pre-commit hook:

ln -s ../../ci/code_checks.sh .git/hooks/pre-commit

To automatically change the formatting (rather than just checking it), run

ci/format_code.sh

Alternatively, you may want to configure your editor to run black and isort when saving a file.

Run poetry run pytype (or just pytype inside the poetry shell) to type check the code base. This may take a bit longer than linting, which is why it's not part of the pre-commit hook. It's still checked during CI though.

Content From Old Reward Preprocessing Readme

Running experiments

The experimental pipeline consists of three to four steps:

  1. Create the reward models that should be visualized (either by using imitation to train models, or by using create_models.py to generate ground truth reward functions)
  2. In a non-tabular setting: generate rollouts to use for preprocessing and visualization. This usually means training an expert policy and collecting rollouts from that.
  3. Use preprocess.py to preprocess the reward function.
  4. Use plot_heatmaps.py (for gridworlds) or plot_reward_curves.py (for other environments) to plot the preprocessed rewards.

Below, we describe these in more detail.

Creating reward models

The input to our preprocessing pipeline is an imitation RewardNet, saved as a .pt file. You can create one using any of the reward learning algorithms in imitation, or your own implementation.

In our paper, we also have experiments on shaped versions of the ground truth reward. To simplify the rest of the pipeline, we have implemented these as simple static RewardNets, which can be stored to disk using create_models.py. (This will store all the different versions in results/ground_truth_models.)

Creating rollouts

Tabular environments do not require samples for preprocessing, since we can optimize the interpretability objective over the entire space of transitions. But for non-tabular environments, a transition distribution to optimize over is needed. Specifically, the preprocessing step requires a sequence of imitations TrajectoryWithRew.

The source of these trajectories can be one of several options, as well as a combination of those. But the most important cases are:

  • a pickled file of trajectories, specified as rollouts=[(0, None, "<rolout name>", "path/to/rollouts.pkl")] in the command line arguments
  • rollouts generated on the fly using some policy, specified as rollouts=[(0, "path/to/policy.zip", "<rollout name>)]

Since always generating rollouts on the fly is inefficient, we have the helper script create_rollouts.py that takes some rollout configuration (such as the second one above) and produces a pickled file with the corresponding rollouts, which can then be used during further stages using the first example above.

Preprocessing

The main part of our pipeline is preprocess.py, which takes in a rollouts=... config as just described (not required in a tabular setting) and a model_path=path/to/reward_model.pt config, and then produces preprocessed versions of the input model (also stored as .pt files).

An example command for gridworld would be: python -m reward_preprocessing.preprocess with env.empty_maze_10 model_path=path/to/reward_net.pt save_path=path/to/output_folder

The env.empty_maze_10 part specifies the environment (in this case, using one of the preconfigured options).

Directories

runs/ and results/ are both in .gitignore and are meant to be used to store artifacts. runs/ is used for the Sacred FileObserver, so it contains information about every single run. In contrast, results/ is meant for generated artifacts that are meant to be used by subsequent runs, such as trained models or datasets of transition-reward pairs. It will only contain the versions of these artifacts that are needed (e.g. the best/latest model), whereas runs/ will contain all the artifacts generated by past runs.