Skip to content

Rajarshi1001/CS780_Project

Repository files navigation

CS780 Project 4: Safe Exploration in Continuous Action Spaces

The given repository contains the runs for the safe exploration experiments mentioned in the original paper. The paper essentially talks about a novel architecture employed for solving real world problems where violating safety or critical contraints are heavily penalized.

There are two environments proposed in the paper for reference, namely the Ball(n-Dimensional) and Spaceship environments whose dynamics are either modelled by first-order or second-order differential equations respectively. The mentioned solution essentially proposes a Constrained Marko Decision Process for modeling the environment dynamics and focuses on constrained policy ptimization where an additional safety-layer is built in top of the original policy calculated from DDPG (Deep Deterministic Policy Gradient) layer. The additional safety layer penalizes or avoids the constraints violations and performs an action correction after every policy evaluation. ; i.e., after every policy query,it solves an optimization problem for finding the minimal change to the action such that the safety constraints are met.

Safety-Layer Diagram

Safety Layer Diagram

Implementations and Experimentations

Our experiments include designing the Safety Layer from scratch and integrating it with DDPG and Twin Delayed Deep Deterministic model (TD3) on various gym environments including Ball-1D, Ball-2D, Ball-3D, Spaceship-Arena, Spaceship-Corridor, Bioreactor. The TD3 algorithm is an improvement over DDPG that avoids the maximization bias by introducing joint backpropagation of twin critics. Our experiments also includes rewards and cumulative constraint violations for each of the environments with customized reward shaping. We have essentially performed a comparative analysis depicting how a minimal safety layer implementation over the deterministic policy model effectively boosts up the training and evaluation rewards obtained by the agent while navigating in the respective environment over episodes and is nearly successful in attaining constraints free actions.

The plots obtained using the safety layer for different environments highlights that the agent is able to attain optimal convergence in terms of rewards in way lesser episodes. The action correction also comes at the cost of increased wall clock time since on every action selection, a forward pass through the trained constraint model is executed to return the safe actions for navigation in the environment. The implementation also guarantees 0 constraints in some of the environments, thus highlighting the potential of a linear safety approximation in several industrial use cases.

All of the results are compiled in the form .npy files inside the files link. The link to the script for visualizing the results obtained for all the above mentioned environments is Link. Some visualizations and comparisons can be found in Link

Project based resources

All the working implementations are found inside the ./notebooks directory

Members Github-ID
Rajarshi Dutta @Rajarshi1001
Udvas Basak @IonUdvas
Divyani Gaur @DivyaniGaur

References

  • Dalal, Gal, et al. "Safe exploration in continuous action spaces." arXiv preprint arXiv:1801.08757 (2018).

Releases

No releases published

Packages

No packages published