https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8632747&tag=1
-
Gorila architecture (General Reinforcement Learning Architecture)
-
- reduces the observed overestimation by learning two value networks with parameters that both use the other network for value-estimation
-
-
uses a network that is split into two streams after the convolutional layers to separately estimate state-value and the action-advantage functions.
-
The main benefit of this factoring is to generalise learning across actions without imposing any change to the underlying reinforcement learning algorithm.
-
Dueling DQN improves Double DQN and can also be combined with prioritized experience replay
-
-
actor-critic method with experience replay (ACER)
- implements an efficient trust region policy method that forces updates to not deviate far from a running average of past policies
- It is much more data efficient
-
[Advantage Actor-Critic (A2C)](Asynchronous Methods for Deep Reinforcement Learning)
- a synchronous variant of A3C
- updates the parameters synchronously in batches and has comparable performance while only maintaining one neural network
-
[Actor-Critic using Kronecker-Factored Trust Region (ACKTR)](Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation)
- extends A2C by approximating the natural policy gradient updates for both the actor and the critic
-
Trust Region Policy Optimization (TRPO)
- uses a surrogate objective with theoretical guarantees for monotonic policy improvement, while it practically implements an approximation called trust region by constraining network updates with a bound on the KL divergence between the current and the updated policy.
- robust and data efficient performance in Atari games while it has high memory requirements and several restrictions
-
IMPALA (Importance Weighted Actor-Learner Architecture)
- an actor-critic method where multiple learners with GPU access share gradients between each other while being synchronously updated from a set of actors
-
-
takes a distributional perspective on reinforcement learning by treating Q-function as an approximate distribution of returns instead of a single approximate expectation for each action as it is in the conventional setting.
-
The distribution is divided into a so-called set of atoms, which determines the granularity of the distribution.
-
-
Evolution Strategies (ES)
- are black-box optimization algorithms that rely on parameter-exploration through stochastic noise. 720 CPUs were used for one hour whereafter ES managed to outperform A3C (which ran for 4 days) in 23 out of 51 games
-
- A simple genetic algorithm with a Gaussian noise mutation operator evolves the parameters of a deep neural network and can achieve surprisingly good scores across several Atari games
-
- a slow planning agent was applied offline, using Monte-Carlo Tree Search, to generate data for training a CNN via multinomial classification. And it was shown to outperform DQN.
-
-
for transferring one or more action policies from Q-networks to an untrained network.
-
The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts; and finally it can be applied as a real-time, online learning process by continually distilling the best policy to a target network, thus efficiently tracking the evolving Q-learning policy.
-
-
- exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers
-
Hybrid Reward Architecture (HRA)
-
The training objective provides feedback to the agent while the performance objective specifies the target behavior. Often, a single reward function takes both roles, but for some games, the performance objective does not guide the training sufficiently
-
The Hybrid Reward Architecture (HRA) splits the reward function into n different reward functions, where each of them are assigned a separate learning agent
-
Most of the algorithms introduced above have failed to learn the sparse feedback through the game. For instance, DQN fails to obtain any reward in this game (receiving a score of 0) and Gorila achieves an average score of just 4.2, whereas a human expert scored 4,367. So, it is clear that the methods presented so far are unable to deal with environments with such sparse rewards.
-
-
A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals.
-
it operates on two temporal scales inside, one is the controller which leans a policy over action that satisfy goals chosen by a higher-level Q-value function, on the other hand, we have the meta-controller which learns a policy over intrinsic goals.
-
-
DQN-CTS(DQN-Context Tree Switching)
- Pseudo-counts have been used to provide intrinsic motivation in the form of exploration bonuses when unexpected pixel configurations are observed and can be derived from CTS density models
- they focus on the problem of exploration in non-tabular reinforcement learning
- they use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model.
- Skip Context Tree Switching: Bellemare et al., 2014
- In this paper we show how to generalize this technique to the class of K-skip prediction suffix trees.
-
- they combine PixelCNN pseudo-counts with different agent architectures to dramatically improve the state of the art on several hard Atari games
-
- a distributed DQN architecture similar to Gorila
-
Natural Language Guided Reinforcement Learning
-
The agent uses a multi-modal embedding between environment observations and natural language to self-monitor progress through a list of English instructions, granting itself reward for completing instructions in addition to increasing the game score
-
Instructions were linked to positions in rooms and agents were rewarded when they reached those locations
-
-
language acquisition in virtual environment
-
how an agent can execute text-based commands in a 2D maze-like environment called XWORLD, such as walking to and picking up objects, after having learned a teacher’s language
-
An RNN-based language module is connected to a CNN-based perception module. These two modules were then connected to an action selection module and a recognition module that learns the teacher’s language in a question answering process.
-
-
- a CNN learns to map from images to meaningful affordance indicators, such as the car angle and distance to lane markings, from which a simple controller can make decisions.
- Direct perception was trained on recordings of 12 hours of human driving in TORCS and the trained system was able to drive in very diverse environments. Amazingly, the network was also able to generalize to real images.
-
Deterministic Policy Gradient (DPG)
- directly differentiate the policy and try approximate it by neural network
-
- demonstrated that a CNN with maxpooling and fully connected layers trained with DQN canachieve human-like behaviors in basic scenarios. In the Visual Doom AI Competition 2016
-
SLAM-Augmented Deep Reinforcement Learning
- Position inference and object mapping from pixels and depth-buffers using Simultaneous Localization and Mapping (SLAM) also improve DQN in Doom
- they approached the issue that partial observability of the environment using the SLAM generated map for an agent to be aware where it is right now.
-
Direct Future Prediction (DFP)
-
The architecture used in DFP has three streams: one for the screen pixels, one for lower-dimensional measurements describing the agent’s current state and one for describing the agent’s goal, which is a linear combination of prioritized measurements.
-
DFP collects experiences in a memory and is trained with supervised learning techniques to predict the future measurements based on the current state, goal and selected action
-
During training, actions are selected that yield the best-predicted outcome, based on the current goal. This method can be trained on various goals and generalizes to unseen goals at test time.
-
-
Distral (Distill & transfer learning)
-
Intrinsic Curiosity Module (ICM)
-
curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life.
-
they formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.
-
-
Teacher-Student Curriculum Learning (TSCL)
-
a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on.
-
framework incorporates a teacher that prioritizes tasks wherein the student’s performance is either increasing (learning) or decreasing (forgetting)
-
-
players have to control multiple agents simultaneously in real-time on a partially observable map.
-
RTS games have no in-game scoring and thus the reward is determined by who wins the game.
-
Multi-agent credit assignment problem
- States and actions are often described locally relative to units, which is extracted from the game engine. If agents are trained individually it is difficult to know which agents contributed to the global reward
So we will look at some MARL(Multi-agent RL) approaches
-
- Given the same number of reinforcement learning agents will cooperative agents outperform independent agents who do not communicate during learning?
- IQL simplifies the multi-agent RL problem by controlling units individually while treating other agents as if they were part of the environment
-
Multiagent Bidirectionally-Coordinated Network (BiCNet)
-
implements a vectorized actor-critic framework based on a bi-directional RNN, with one dimension for every agent, and outputs a sequence of actions
-
BiCNet can handle different types of combats with arbitrary numbers of AI agents for both sides. Our analysis demonstrates that without any supervisions such as human demonstrations or labelled data, BiCNet could learn various types of advanced coordination strategies that have been commonly used by experienced game players
-
-
Deep Reinforcement Relevance Net (DRRN)
-
the architecture represents action and state spaces with separate embedding vectors, which are combined with an interaction function to approximate the Q-function in reinforcement learning.
-
This approach has two networks that learn word embeddings. One embeds the state description, the other embeds the action description. Relevance between the two embedding vectors is calculated with an interaction function such as the inner product of the vectors or a bilinear operation.
-
The Relevance is then used as the Q-Value and the whole process is trained end-to-end with Deep Q-Learning. This approach allows the network to generalize to phrases not seen during training which is an improvement for very large text games. The approach was tested on the text games Saving John and Machine of Death, both choice-based games.
-
-
Affordance Extraction via Word Embeddings
- Affordance(affordances: the set of behaviors enabled by a situation) detection is particularly helpful in domains with large action spaces, allowing the agent to prune its search space by avoiding futile behaviors.
- A word embedding is first learned from a Wikipedia Corpus via unsupervised learning and this embedding is then used to calculate analogies such as song is to sing as bike is to x, where x can then be calculated in the embedding space
- The authors build a dictionary of verbs, noun pairs, and another one of object manipulation pairs. Using the learned affordances, the model can suggest a small set of actions for a state description. Policies were learned with Q-Learning and tested on 50 Z-Machine games.
-
- exclusively on language models Using word embeddings, the agent can replace synonyms with known words. Golovin is built of five command generators: General, Movement, Battle, Gather, and Inventory. These are generated by analyzing the state description, using the language models to calculate and sample from a number of features for each command. Golovin uses no reinforcement learning and scores comparable to the affordance method.
-
AEN(Action Elimination Network)
- propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment.
- In parser-based games, the actions space is very large. The AEN learns, while playing, to predict which actions that will have no effect for a given state description. The AEN is then used to eliminate most of the available actions for a given state and after which the remaining actions are evaluated with the Q-network. The whole process is trained end-to-end and achieves similar performance to DQN with a manually constrained actions space. Despite the progress made for text adventure games, current techniques are still far from matching human performance