Solving the environment require an average total reward of over 300 over 100 consecutive episodes. Training of BipedalWalker is considered as difficult task, in particular, it is very difficult to train BipedalWalker by DDPG and PPO (with one agent). In this directory we solve the environment in 450 episodes by usage of the PPO (with multi-agent) algorithm, see Multi-Agent RL or Baseline doc. For other solutions (based on the single agent) see BipedalWalker-TD3 and BipedalWalker-SAC.
The environment is simulated as list of 16 gym environments. They run in 16
subprocesses adopted from openai baseline:
num_processes=16
envs = parallelEnv('BipedalWalker-v2', n=num_processes, seed=seed)
Agent uses the following hyperparameters:
gamma=0.99 # discount
epoch = 16 # the parameter in the update mexanism of the PPO
mini_batch=16 # optimizer and backward mechisms work after sampling BATCH elements
lr = 0.001 # learning rate
eps=0.2 # the clipping parameter using for calculation of the action loss
Standard policy gradient methods perform one gradient update per data sample.
In the original paper it was proposed a novel objective function that enables multiple epochs.
This is the loss function L_t(\theta), which is (approximately) maximized each iteration:
Parameters c1, c2 and epoch are essential hyperparameters in the PPO algorithm. In this agent, c1 = -0.5, c2 = 0.01.
value_loss = (return_batch - values).pow(2)
loss = -torch.min(surr1, surr2) + 0.5 * value_loss - 0.01 * dist_entropy
The update is performed in the function ppo_agent.update().
We train the agent to understand that it can use information from its surroundings to inform the next best action.
The score 300.5 was achieved in the episode 450 after training 2 hours 33 minutes.
Most of the code is based on the Udacity code, and Ilya Kostrikov's code (https://github.com/ikostrikov).