A recurrent, multi-process and readable PyTorch implementation of the deep reinforcement learning algorithms:
inspired by 3 repositories:
- General kinds of observation spaces: tensors and dict of tensors
- General kinds of action spaces: discrete and continuous
- Recurrent policy with
--recurrence
argument - Observation preprocessing
- Reward shaping
- Entropy regularization
- Fast:
- Multiprocessing for collection trajectories in multiple environments simultaneously
- GPU (CUDA) for tensor operations
- Training logs:
- CSV
- Tensorboard
- PyTorch 0.4.0
You have to clone the repository and then install the module:
pip3 install -e torch_rl
To gets updates from the code, you just need to do a git pull
. No need to install the module again.
The module consists of:
- 2 classes
torch_rl.A2CAlgo
andtorch_rl.PPOAlgo
for, respectively, A2C and PPO algorithms - 2 abstract classes
torch_rl.ACModel
andtorch_rl.RecurrentACModel
for, respectively, non-recurrent and recurrent actor-critic models - 1 class
torch_rl.DictList
for making dictionnaries of lists batch-friendly
Here are detailed the points that can't be understood immediately by looking at the definition files of the classes, or by looking at the arguments of scripts/train.py
with scripts/train.py --help
command.
torch_rl.A2CAlgo
and torch_rl.PPOAlgo
have 2 methods:
__init__
that may take, among the other parameters :- an
acmodel
actor-critic model that is an instance of a class that inherits from one of the two abstract classestorch_rl.ACModel
ortorch_rl.RecurrentACModel
. - a
preprocess_obss
function that transforms a list of observations given by the environment into an objectX
. This objectX
must allow to retrieve from it a sublist of preprocessed observations given a list of indexesindexes
withX[indexes]
. By default, the observations given by the environment are transformed into a Pytorch tensor. - a
reshape_reward
function that takes into parameter, in the order, an observationobs
, the actionaction
of the model, the rewardreward
and the terminal statusdone
and returns a new reward. - a
recurrence
number to specify over how many timestep gradient will be backpropagated. This number is only considered if a recurrent model is used and must divide thenum_frames_per_agent
parameter and, for PPO, thebatch_size
parameter.
- an
update_parameters
that returns some logs.
torch_rl.ACModel
has 2 abstract methods:
__init__
that takes into parameter theobservation_space
and theaction_space
given by the environment.forward
that takes into parameter N preprocessed observationsobs
and returns a Pytorch distributiondist
and a tensor of valuesvalue
. The tensor of values must be of size N, not N x 1.
torch_rl.RecurrentACModel
has 3 abstract methods:
__init__
that takes into parameter the same parameters thantorch_rl.ACModel
.forward
that takes into parameter the same parameters thantorch_rl.ACModel
along with a tensor of N memoriesmemory
of size N x M where M is the size of a memory. It returns the same thing thantorch_rl.ACModel
plus a tensor of N memoriesmemory
.memory_size
that returns the size M of a memory.
For speed purposes, the observations are only preprocessed once. Hence, because of the use of batches in PPO, the preprocessed observations X
must allow to retrieve from it a sublist of preprocessed observations given a list of indexes indexes
with X[indexes]
. If your preprocessed observations are a Pytorch tensor, you are already done, and if you want your preprocessed observations to be a dictionnary of lists or of tensors, you will also be already done if you use the torch_rl.DictList
class as follow:
>>> d = DictList({"a": [[1, 2], [3, 4]], "b": [[5], [6]]})
>>> d.a
[[1, 2], [3, 4]]
>>> d[0]
DictList({"a": [1, 2], "b": [5]})
Note : if you use a RNN, you will need to set batch_first
to True
.
An example of use of torch_rl.A2CAlgo
and torch_rl.PPOAlgo
classes is given in scripts/train.py
.
An example of implementation of torch_rl.RecurrentACModel
abstract class is given in model.py
.
An example of use of torch_rl.DictList
and an example of a preprocess_obss
function is given in the ObsPreprocessor.__call__
function of utils/format.py
.
OMP_NUM_THREADS
affects the number of threads used by MKL. The default value may severly damage your performance. This may be avoided if set to 1:
export OMP_NUM_THREADS=1
For your own purposes, you will probabily need to change:
- the model in
model.py
, - the
ObssPreprocessor.__call__
method inutils.format
.
Along with the torch_rl
package is provided a model that:
- has a memory. This can be disabled by setting
use_memory
toFalse
in the constructor. - understands instructions. This can be disabled by setting
use_instr
toFalse
in the constructor.
Along with the torch_rl
package are provided 3 general reinforcement learning scripts:
train.py
for training an actor-critic model with A2C or PPO.enjoy.py
for visualizing your trained model acting.evaluate.py
for evaluating the performances of your trained model over X episodes.
These scripts were designed especially for the MiniGrid environments. These environments give an observation containing an image and a textual instruction to the agent and a reward of 1 if it successfully executes the instruction, 0 otherwise. They are used in what follows for illustrating purposes.
These scripts assume that you have already installed the gym
package (with pip3 install gym
for example). By default, models and logs are stored in the storage
folder. You can define a different folder in the environment variable TORCH_RL_STORAGE
.
scripts/train.py
enables you to load a model, trains it with the specified actor-critic algorithm and save it in the storage
folder.
2 arguments are required:
--algo ALGO
: name of the actor-critic algorithm.--env ENV
: name of the environment to train on.
and a bunch of optional arguments are available among which:
--model MODEL
: name of the model, used for loading and saving it. If not specified, it is the_
-concatenation of the environment name and algorithm name.--frames-per-proc FRAMES_PER_PROC
: number of frames per process before updating parameters.--no-instr
: disable the understanding of instructions of the original model inmodel.py
. If your model is trained on an environment where there is no need to understand instructions, it is advised to disable it for faster training.--no-mem
: disable the memory of the original model inmodel.py
. If your model is trained on an environment where there is no need to remember something, it is advised to disable it for faster training.- ... (see more using
--help
)
Here is an example of command:
python3 -m scripts.train --algo ppo --env MiniGrid-DoorKey-5x5-v0 --no-instr --no-mem --model DoorKey --save-interval 10
This will print some logs in your terminal:
where:
- "U" is for "Update".
- "F" is for the total number of "Frames".
- "FPS" is for "Frames Per Second".
- "D" is for "Duration".
- "rR" is for "reshaped Return" per episode. The 4 following numbers are, in the order, the mean
x̄
, the standard deviationσ
, the minimumm
and the maximumM
of the reshaped return per episode during the update. - "F" is for the number of "Frames" per episode. The 4 following numbers are again, in the order, the mean, the standard deviation, the minimum, the maximum of the number of frames per episode during the update.
- "H" is for "Entropy".
- "V" is for "Value".
- "pL" is for "policy Loss".
- "vL" is for "value Loss".
- "∇" is for the gradient norm.
These logs are also saved in a logging format in log.log
and in a CSV format in log.csv
in the storage
folder.
If you add --tb
to the command, logs are also plotted in Tensorboard using the tensorboardX
package that you can install with pip3 install tensorboardX
. Then, you just have to execute:
tensorboard --logdir storage
and you will get something like this:
scripts/enjoy.py
enables you to visualize your trained model acting.
2 arguments are required:
--env ENV
: name of the environment to act on.--model MODEL
: name of the trained model.
and several optional arguments are available (see more using --help
).
Here is an example of command:
python3 -m scripts.enjoy --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
In the MiniGrid-DoorKey-6x6-v0
environment, the agent has to reach the green goal. In particular, it has to learn how to open a locked door.
In the MiniGrid-GoToDoor-5x5-v0
environment, the agent has to open a door specified by its color. In particular, it has to understand textual instructions.
In the MiniGrid-RedBlueDoors-6x6-v0
environment, the agent has to open the red door and then the blue door. Because the agent initially faces the blue door, it has to remember if the red door is opened.
scripts/evaluate.py
enables you to evaluate the performance of your trained model on X episodes.
2 arguments are required:
--env ENV
: name of the environment to act on.--model MODEL
: name of the trained model.
and several optional arguments are available (see more using --help
).
By default, the model is tested on 100 episodes with a random seed set to 2 instead of 1 during training.
Here is an example of command:
python3 -m scripts.evaluate --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
This will print the evaluation in your terminal:
where "R" is for "Return" per episode.