Skip to content

Callback collection, cleanup and bug fixes

Compare
Choose a tag to compare
@araffin araffin released this 12 Mar 22:26
· 34 commits to master since this release
cfcdb2f

Breaking Changes

  • evaluate_policy now returns the standard deviation of the reward per episode
    as second return value (instead of n_steps)

  • evaluate_policy now returns as second return value a list of the episode lengths
    when return_episode_rewards is set to True (instead of n_steps)

  • Callback are now called after each env.step() for consistency (it was called every n_steps before
    in algorithm like A2C or PPO2)

  • Removed unused code in common/a2c/utils.py (calc_entropy_softmax, make_path)

  • Refactoring, including removed files and moving functions.

    • Algorithms no longer import from each other, and common does not import from algorithms.

    • a2c/utils.py removed and split into other files:

      • common/tf_util.py: sample, calc_entropy, mse, avg_norm, total_episode_reward_logger,
        q_explained_variance, gradient_add, avg_norm, check_shape,
        seq_to_batch, batch_to_seq.
      • common/tf_layers.py: conv, linear, lstm, _ln, lnlstm, conv_to_fc, ortho_init.
      • a2c/a2c.py: discount_with_dones.
      • acer/acer_simple.py: get_by_index, EpisodeStats.
      • common/schedules.py: constant, linear_schedule, middle_drop, double_linear_con, double_middle_drop,
        SCHEDULES, Scheduler.
    • trpo_mpi/utils.py functions moved (traj_segment_generator moved to common/runners.py, flatten_lists to common/misc_util.py).

    • ppo2/ppo2.py functions moved (safe_mean to common/math_util.py, constfn and get_schedule_fn to common/schedules.py).

    • sac/policies.py function mlp moved to common/tf_layers.py.

    • sac/sac.py function get_vars removed (replaced with tf.util.get_trainable_vars).

    • deepq/replay_buffer.py renamed to common/buffers.py.

New Features:

  • Parallelized updating and sampling from the replay buffer in DQN. (@flodorner)
  • Docker build script, scripts/build_docker.sh, can push images automatically.
  • Added callback collection
  • Added unwrap_vec_normalize and sync_envs_normalization in the vec_env module
    to synchronize two VecNormalize environment
  • Added a seeding method for vectorized environments. (@NeoExtended)
  • Added extend method to store batches of experience in ReplayBuffer. (@solliet)

Bug Fixes:

  • Fixed Docker images via scripts/build_docker.sh and Dockerfile: GPU image now contains tensorflow-gpu,
    and both images have stable_baselines installed in developer mode at correct directory for mounting.
  • Fixed Docker GPU run script, scripts/run_docker_gpu.sh, to work with new NVidia Container Toolkit.
  • Repeated calls to RLModel.learn() now preserve internal counters for some episode
    logging statistics that used to be zeroed at the start of every call.
  • Fix DummyVecEnv.render for num_envs > 1. This used to print a warning and then not render at all. (@shwang)
  • Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to learn(total_timesteps) reset
    the environment on every call, potentially biasing samples toward early episode timesteps.
    (@shwang)
  • Fixed by adding lazy property ActorCriticRLModel.runner. Subclasses now use lazily-generated
    self.runner instead of reinitializing a new Runner every time learn() is called.
  • Fixed a bug in check_env where it would fail on high dimensional action spaces
  • Fixed Monitor.close() that was not calling the parent method
  • Fixed a bug in BaseRLModel when seeding vectorized environments. (@NeoExtended)
  • Fixed num_timesteps computation to be consistent between algorithms (updated after env.step())
    Only TRPO and PPO1 update it differently (after synchronization) because they rely on MPI
  • Fixed bug in TRPO with NaN standardized advantages (@richardwu)
  • Fixed partial minibatch computation in ExpertDataset (@richardwu)
  • Fixed normalization (with VecNormalize) for off-policy algorithms
  • Fixed sync_envs_normalization to sync the reward normalization too
  • Bump minimum Gym version (>=0.11)

Others:

  • Removed redundant return value from a2c.utils::total_episode_reward_logger. (@shwang)
  • Cleanup and refactoring in common/identity_env.py (@shwang)
  • Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)

Documentation:

  • Add dedicated page for callbacks
  • Fixed example for creating a GIF (@KuKuXia)
  • Change Colab links in the README to point to the notebooks repo
  • Fix typo in Reinforcement Learning Tips and Tricks page. (@mmcenta)