Excerpt

## Table of Contents

Abstract

List of Abbreviations

List of Figures

List of Tables

List of Algorithms

1 Introduction

2 Research Method

2.1 Related Work

2.2 Research Conduction

3 Background

3.1 Reinforcement Learning

3.1.1 Markov Decision Process

3.1.2 Value Functions

3.1.3 Tabular Solution Methods

3.2 Deep Learning

4 Results

4.1 Value-Based Deep Reinforcement Learning

4.1.1 Deep Q-Learning and Deep Q-Networks

4.1.2 Double Q-Learning and Double Q-Network

4.1.3 Prioritized Replay

4.1.4 Dueling Network

4.1.5 Distributional Reinforcement Learning

4.1.6 Rainbow

4.2 Policy-Based Deep Reinforcement Learning

4.2.1 Asynchronous Advantage Actor-Critic

4.2.2 Trust Region Policy Optimization

4.2.3 Deep Deterministic Policy Gradients

4.2.4 Policy Iteration Using Monte Carlo Tree Search

4.2.5 Evolutionary Algorithms

4.3 Performance of the Algorithms

4.3.1 Atari 2600

4.3.2 MuJuCo

4.3.3 Various Measures

5 Discussion

5.1 Exploration vs. Exploitation

5.2 Need for Rewards

5.3 Knowledge Reusability

5.4 Inefficiency

5.5 Multi-Agent Reinforcement Learning

5.6 Model-Based Reinforcement Learning

5.7 Proposed Research Directions

6 Conclusion

References

## Abstract

Reinforcement learning is a learning problem in which an actor has to behave optimally in its environment (Kaelbling et al., 1996, p. 237). Deep learning methods, on the other hand, are a subclass of representation learning, which in turn focuses on extracting the necessary features for the task (e.g. classification or detection) (Lecun et al., 2015, p. 436). As such, they serve as powerful function approximators. The combination of those two paradigm results in deep reinforcement learning. This thesis gives an overview of the recent advancement in the field. The results are divided into two broad research directions: value-based and policy-based approaches.

This research shows several algorithms from those directions and how they perform. Finally, multiple open research questions are addressed and new research directions are proposed.

## Zusammenfassung

”Reinforcement learning” ist ein Lernproblem, in dem ein Agent das Ziel hat, sich opti- mal in seiner Umgebung zu verhalten (Kaelbling et al., 1996, p. 237). ”Deep learning” Methoden hingegen sind in der Lage, eine Funktion zu erlernen, die möglichst präzise das Verhältnis von Input zu Output ihrer Trainingsdaten symbolisiert (Lecun et al., 2015, p. 436). Die Kombination dieser zwei Paradigmen resultiert in ”deep reinforcement learn- ing”. Das Ziel dieser Arbeit ist es, eine Übersicht über ”deep reinforcement learning” zu geben, welche grob in zwei Forschungsrichtungen geteilt werden kann: wertorientierte- und verhaltensorientierte Ansätze.

Diese Arbeit führt mehrere Algorithmen dieser Bereiche auf und vergleicht deren Leis- tung. Zum Schluss werden noch offene Fragen angezeigt und weitere Forschungsrichtun- gen ermittelt.

## List of Abbreviations

illustration not visible in this excerpt

## List of Figures

1 Reinforcement Learning Model

2 Policy Iteration Scheme

3 Deep Neural Network Example

4 Forward and Backward Pass

5 Convolutional Neural Network Example

6 Dueling Network Architecture

7 Distributional Bellman Operator

8 Trust Region Policy Optimization Sampling Schemes

9 Self-Play Reinforcement Learning

10 Monte Carlo Tree Search

11 Image Hard Maze

12 Exploration vs. Exploitation Example

13 Multiplayer Online Battle Arena

14 Stratego Game Field

## List of Tables

1 Monte Carlo Tree Seach Steps

2 Median Normalized Scores

3 Trust Region Policy Optimization Scores

4 Evolutionary Algorithm Scores

5 Short Summary of the Model-Based vs. Model-Free Comparison

6 Short Overview of Covered Research

## List of Algorithms

1 Policy Iteration

2 Value Iteration

3 Deep Q-Learning with Experience Replay

4 Double Q-Learning

5 Double Deep Q-Network

6 Categorical Algorithm

7 Asynchronous Advantage Actor-Critic

8 Trust Region Policy Optimization

9 Deep Deterministic Policy Gradient Algorithm

10 Evolution Strategies

11 Parallelized Evolution Strategies

12 Simple Genetic Algorithm

13 Novelty Search

## 1 Introduction

Reinforcement learning (RL) is defined as a class of problem in which an agent has to behave optimally in its environment (Kaelbling et al., 1996, p. 237). For every action the agent takes it receives a reward that indicates to the agent how useful that action was. The action taken by the agent directly influences the future states the agent will end up in. Hence, the goal of the agent is to pick a series of actions that will result in the highest possible reward (Sutton and Barto, 2017, p. 1-5).

One of the key differences between RL as opposed to supervised learning is the lack of a teacher (Kaelbling et al., 1996, p. 239). Whereas in supervised learning the learner is immediately told the error between the predicted and actual output, in RL the agent has no knowledge whether the taken action was, in fact, optimal.

For tasks in which the agent deals with a great state and action space, some form of generalization is needed (Sutton and Barto, 2017, p. 158). An example here would be the case of a self-driving car. It has many types of input, ranging from images from its many cameras and other sensory data like velocity of the car or the outside temperature. At every time-step the car would *see* its environment slightly different than before. At one point there may be a red car driving next to it and later there could be a car with the same model but with a different color. From the point of view of the agent, these would be entirely different states even though the agent should behave the same in both cases. In other words, to prevent the agent from dealing with overwhelmingly many states, it should generalize from experience.

In the last few years, deep learning (Lecun et al., 2015) has shown great success in terms of generalization and has since been used extensively in modern RL methods. Deep learning (DL) delivers very powerful function approximations that have shown stunning performance in different domains, such as mastering the ancient Chinese board game Go (Silver et al., 2017) or playing the multiplayer online battle arena (MOBA) game of Dota 2 (OpenAI, n.d.)

Combining these two paradigms results in deep reinforcement learning (DRL). Here, RL defines the objective, i.e. what the agent wants to achieve while DL on the other hand provides the mechanism to reach the agent’s goal (Silver, 2016, slide 5).

Across a wide variety of domains in machine learning (ML) there has been an ever increasing pace of research over the last recent years. New algorithms emerge between shortening time intervals while new research questions are revealed. To not lose sight of the greater picture and to keep track with the accelerating research, the need for an all encompassing view increases.

As such, the goal of this thesis is twofold: first, this thesis will provide a clear overview of the most recent achievements in the field of DRL, focusing primarily on the algorithms; and secondly, it will emphasize new research directions for the future. The remainder of this thesis is organized as follows. The next chapter will cover how the research for this thesis was conducted, where the sources for the literature and the way they were reviewed will be shown. On the same note, several other closely related papers that influenced this thesis will be mentioned. Chapter three will lay the foundations for this thesis. Before one can understand the algorithms and methods of recent studies, it is vital to know what those were built upon. Here, the fundamental aspects of RL and DL will be covered. The fourth chapter shows the results. This chapter will discuss the recent achievements and their algorithms and techniques. This structure of this chapter is, in parts, derived from the talk at ICML held by David Silver in 2016 (see section 2.1). Chapter five will focus on existing problems and research gaps. The last chapter, six, will summarize this thesis and give a conclusion.

## 2 Research Method

The first section of this chapter will be dedicated to the related work. Afterwards, the way this research was conducted will be laid out. The sources, and the literature itself, will be described.

### 2.1 Related Work

Giving overviews in research field is usually common practice and occurs periodically. RL is no exception and as such, there have already been several overviews. One of the earliest overviews were given by Kaelbling et al. (1996). Their survey focused on dynamic programming and model-free methods, such as Q-Learning and TD-Learning. A more formal view on the matter was given by Gosavi (2009), ”aimed at *uncovering the mathematical roots* of this science [...]” (Gosavi, 2009).

Kober et al. (2012) gave a survey, where the authors focused on RL in conjunction with robotics. On the other hand, Busoniu et al. (2008) focused on multiagent reinforcement learning in their overview. Their research is referred to in this thesis as well (see section 5.5).

The closest related papers to this thesis are Mousavi et al. (2016), Li (2017) and Arulku- maran et al. (2017). These research papers are temporally very close and have therefore very similar content and structure to each other and this thesis. Whereas these papers go more into the *width* of RL, this thesis focuses on a few relevant papers with more detail.

Other content that is very closely related to this thesis is the talk given by David Silver in 2016 at an ICML conference (Silver, 2016)^{1}. The structure of this thesis is in part derived from his talk and it has greatly influenced this thesis as it served as a great starting point for this thesis.

### 2.2 Research Conduction

As already mentioned, the starting point of this research was the talk by David Silver in 2016 (see footnote). Since he is a researcher at Google’s DeepMind, their website^{2} was used to search for more research papers. The arXiv^{3} database and Google Scholar^{4} were also used to search for more papers, while primarily focusing on the most recent ones. These include research papers as early as 2015 and 2016.

Besides Google’s DeepMind, other big companies’ artificial intelligence (AI) research departments were searched for relevant papers. These include UberAI^{5}, the non-profit organization OpenAI^{6} and Microsoft’s AI research^{7}. To be informed, whenever new research papers were released, the reinforcement learning subreddit^{8} was followed. When the initial number of papers were found, their relevance for this thesis was determined, based on their content, results, connection to prior work and the novelty of their work. The papers that were deemed relevant for this thesis were then thoroughly studied and their content reviewed here.

## 3 Background

This chapter is dedicated to deliver necessary background knowledge in both RL and DL in order to understand subsequent sections. The first half of this chapter will cover some of the fundamental aspects of RL and serve as an introduction into the field. That section’s structure and content are based on Sutton and Barto (2017). The second half will deal with DL.

### 3.1 Reinforcement Learning

A typical RL model is composed of two main components, namely the agent and the environment. The agent perceives the environment and receives an observation of the current state of the environment as an input. Based on what the agent observed, it will output some action that can change the environment. The value of that change is communicated to the agent via a scalar reward. High positive values indicate that the previously chosen action led the agent closer to its goal while negative rewards punish the agent for wrong actions (Kaelbling et al., 1996, p. 237-238). The goal of the agent is to pick a series of actions that will eventually lead to the highest possible rewards. In other words, the agent has to *plan* over a long time period (e.g. the agent could take a series of seemingly *bad* actions, which could after many timesteps lead to a high reward).

illustration not visible in this excerpt

Figure 1: RL model, simplified adaptation from (Arulkumaran et al., 2017, p. 3)

The behavior of the agent is called the agent’s *policy* (Sutton and Barto, 2017, p. 5) and is one of the major components in RL. It can be seen as the mapping from a certain state to a particular action. For example, the policy of an agent in a rock-paper-scissors environment could be to always pick rock.

#### 3.1.1 Markov Decision Process

Whereas the previous section described RL conceptually, this section will introduce Markov decision processes (MDP), which are commonly used to formally describe RL problems. As can be seen in figure 1, the agent receives both an observation and a reward at every timestep. The observation is the representation of the environment’s current state and will be denoted as *S t ∈ S* and the output of the agent, i.e. the action it chooses, will be denoted as *A t ∈ A*. After executing an action, the reward signal *R t* +1 arrives at the agent (Sutton and Barto, 2017, p. 37-38).

Every state in an MDP must satisfy the Markov property, which states:

illustration not visible in this excerpt

The probability of transitioning to the next state given the current state is the same as transitioning to next state given all the previous states. In other words, the current state fully characterizes the history of previous states (Silver, 2015, slide 4). As already mentioned in 3.1, the goal of the agent is to maximize the rewards it gets over time. This goal will be written as

illustration not visible in this excerpt

where *γ ∈* [0 *,* 1] is a discount factor and *k* in the sum is the number of timesteps (Sutton and Barto, 2017, p. 43). The closer *γ* is to 1 the more farsighted the agent becomes. Similarly, if *γ* is close to 0, the agent becomes shortsighted and is only interested in the immediate reward.

The last piece of a MDP is the state transition probability matrix. It describes the probability of transitioning to the next state *s ′* given the current state *s* and the action *a* the agent took (Sutton and Barto, 2017, p. 38). It is defined as

illustration not visible in this excerpt

These components make up a MDP as a 5-tuple *〈 S, A, P, R, γ 〉*, where *S* and *A* are each a set of states and actions respectively, *P* is the transition probability matrix, *R* is the reward signal and *γ ∈* [0 *,* 1] is the discount factor.

Actions are chosen according to the agent’s policy *π*, which is a mapping from states to actions (Sutton and Barto, 2017, p. 45-46) and is defined as

illustration not visible in this excerpt

#### 3.1.2 Value Functions

The state value function is used to define the expected total reward of being in a par- ticular state and following the policy *π*. The state value function *v* will be written as

illustration not visible in this excerpt

where E[ *·* ] is the expectation of the random input variables (Sutton and Barto, 2017,

p. 46). Similarly, the value of taking a certain action in a state and then following the policy afterwards is called the state-action value *q* (Sutton and Barto, 2017, p. 46). It is written as

illustration not visible in this excerpt

These functions can both be estimated from experience. Both the state value and the state-action value function can be rewritten in a recursive manner using equation 3.2, resulting in the Bellman equations for *v* (Sutton and Barto, 2017, p. 46)

illustration not visible in this excerpt

and similarly

illustration not visible in this excerpt

With these functions, it is possible to express the advantage of an action as

illustration not visible in this excerpt

In any MDP there exists atleast one optimal policy [illustration not visible in this excerpt]. Finding [illustration not visible in this excerpt] is the goal of the RL agent. A policy *π* is better than some other policy [illustration not visible in this excerpt] i.e. [illustration not visible in this excerpt], if [illustration not visible in this excerpt]

*S*. Following the optimal policy [illustration not visible in this excerpt] will yield the highest possible reward for every state. From this, it results that the optimal value functions are

illustration not visible in this excerpt

illustration not visible in this excerpt

(Sutton and Barto, 2017, p. 49). Since the value of state *s* is the same as taking the best action and following the optimal policy afterwards, it can be written as

illustration not visible in this excerpt

where the equation above is the Bellman optimality equation for the state value. Simi- larly, the Bellman optimality equation for the state-action value function can be written as

illustration not visible in this excerpt

(Sutton and Barto, 2017, p. 50). Having either the optimal state value or the state-action value function will solve the MDP, because it is possible to derive the optimal policy from them. If the agent knew the optimal state-action value function, for example, then executing the action with the highest *q* value would be the optimal behavior. So far, only certain components of a MDP have been defined and the goal was specified. However, it was not described how to approach the goal of computing the optimal policy in an algorithmic manner. In fact, expressing *v* and *R* as a vector and substituting the expectation for the probability transition matrix, the Bellman equation can be written as:

illustration not visible in this excerpt

where *I* is the identity matrix. However, due to the immense computational cost of *O* (*n* ^{3} ) for matrix inversion for every state *n*, scaling is out of the question (Silver, 2015, slide 22-23).

Using dynamic programming (DP), however, it is possible to arrive at the optimal policy iteratively. This will be covered in the next section.

#### 3.1.3 Tabular Solution Methods

Before presenting the most recent advancements in RL that mostly include some sort of function approximator, it is necessary to view the tabular solution methods first.

Dynamic Programming

If a perfect representation, i.e. model, of the environment is given, then DP can be applied to iteratively solve for the optimal behavior (Sutton and Barto, 2017, p. 57). There are two general approaches to this, which will both be covered in this section:

1. Policy Iteration

2. Value Iteration

Policy iteration consists of two repeating steps, which are policy evaluation (prediction) and policy improvement (Sutton and Barto, 2017, p. 62).

illustration not visible in this excerpt

Figure 2: Policy iteration scheme, adapted from (Sutton and Barto, 2017, p. 68)

During the first step, policy evaluation, the value function *v* for following the current policy is computed. As mentioned above, this could (theoretically) be immediately computed but is very computationally expensive.

Using the Bellman equation, an iterative update can be defined as

illustration not visible in this excerpt

where *R a ss ′* istherewardreceivedwhentransitioningfromstate *s* to *s ′* bytakingaction *a* and *k* is the current evaluation iteration. This produces a sequence of value functions in the form of

illustration not visible in this excerpt

starting with *k* = 0. This update sequence will eventually converge to the current policy that is being evaluated. In other words, by iteratively applying the equation 3.14 the quality of the policy *π* can be measured (Sutton and Barto, 2017, p. 58-59). All that is left is to improve the policy in the next step after its evaluation.

Policy improvement is the next step for the policy iteration scheme (Sutton and Barto, 2017, p. 60-62). The goal of this step is to *somehow* find a better policy than before, i.e. to find a policy *π ′ ≥ π*. After the evaluation step, the value of all states are available. Suppose that a single state *s random ∈ S* is randomly chosen among the available states. The current value of this state, while following the current policy, is given by *v π* (*s random*). Now, suppose that for this single state, the policy is not followed for the first action. Instead of choosing an action like *a* = *π* (*a | s*), the agent now looks at all the possible actions and executes the one with the highest *q* value and *then* follows the original policy. In other words,

illustration not visible in this excerpt

This is never decrease the overall performance. If the agent chooses an action that is for a fact the best action for that state, then (unless it already was the best action) the overall improvement must improve (or remain the same if it was already the best action). This process is referred to as *acting greedily*. If for a single state this greedy action improves (or stays at) the current performance, then naturally, applying it to all states improves the performance even further. If two subsequent policies return the same values for all states, i.e. *π ′* = *π*, then both of them must be *π ∗*. The pseudocode below shows the policy iteration scheme.

illustration not visible in this excerpt

Value Iteration

As mentioned previously, policy iteration consists of policy evaluation and improvement, where the policy evaluation step makes several loops over the entire state space. Value iteration, on the other hand, combines policy evaluation and improvement in one update step (Sutton and Barto, 2017, p. 65-66). Instead of looping over the state space multiple times, each state is updated only once per iteration, while using

illustration not visible in this excerpt

as an update step, where the notation is the same as previously. The value iteration algorithm is shown below.

illustration not visible in this excerpt

Model-Free Tabular Methods

The previous section showed two methods of how to solve finite MDPs using DP. How- ever, there were some very unrealistic assumptions so far. To use DP it is necessary to have a perfect representation of the environment. Additionally, all the possible actions must be available, which is, although difficult, a certainly more realistic assumption. On top of this, however, the state transition probabilities must also be known. These components are rarely available for many of the modern problems and as such, other methods were needed that did not require any model, but rather relied on experi- ence only.

These tabular model-free methods will, however, not be covered in this thesis, as they would require an entire thesis on their own. Since most of the algorithms in section 4 do not require a model of the environment, it was deemed necessary to show the procedure, when such a model was given.

### 3.2 Deep Learning

In this section the main concepts of DL will be covered. For the purposes of this thesis, DL methods can be seen as a form of non-linear function approximator. DL methods are a subclass of representation learning, which in turn focuses on extracting the necessary features for the task (e.g. classification or detection) (Lecun et al., 2015, p. 436). Typically, supervised learning, which this section will be focused on, is used. Here, labeled training data is fed to a non-linear function approximator, such as a neural network. In this context, *” labeled ”* means that along with the data (e.g. pixel values of an image) a target is also given (e.g. the image shows a dog). The network now learns a function that maps the inputs (e.g. the pixels) to the output (e.g. the label). The goal of the network is, given enough training data, to be able to generalize to new, unseen data (Lecun et al., 2015, p. 436-438).

The most widely used model architectures are feedforward (neural) networks (FNN), which are also called multilayer perceptrons (MLP) (Goodfellow et al., 2016, p. 167).

illustration not visible in this excerpt

Figure 3: Deep Neural Network Example, adapted from (Lecun et al., 2015, p. 437). Indices show the corresponding neuron (e.g. *i th* neuron of the input layer).

The architecture of these networks typically consist of multiple layers: the input layer, (multiple) hidden layers and the output layer (Lecun et al., 2015, p. 436-438), where each subsequent hidden layer can be seen as a more abstract representation of the input than the previous layer.

The structure of the network is loosely inspired by nature, particularly the model of human brains (Goodfellow et al., 2016, p. 168). Each layer consists of multiple *neurons* (which are sometimes referred to as *units*) that are connected to the *neurons* of the next layer by *weights* (see figure 3). Each weight holds some numerical value and influences the output. More formally, the mapping from input to output can be described as

illustration not visible in this excerpt

(Goodfellow et al., 2016, p. 167). Throughout the rest of this thesis, whenever a function involves ”; *θ*”, unless specified explicitly, it can be expected to use some function approximator with parameters *θ*. These parameters are most of the time used to represent the values held by the weights.

So far, the components of the FNN were only mentioned. The question remains how these components interact to be able to learn a reasonable function for input-output mapping.

Most FNN are optimized with a technique called *stochastic gradient descent* (SGD), us- ing *backpropagation* to compute the gradient of the cost function (Lecun et al., 2015, p. 437-439). The computation of the gradient involves two steps: a *forward* and a *backward* pass.

illustration not visible in this excerpt

Figure 4: Forward and Backward, adapted from (Lecun et al., 2015, p. 437), some connections are omitted to preserve simplicity; *t l* is the target value, *E* is the error function and the indices refer to the particular *neuron* in the layer.

During the forward pass, the weighted sum of the input layers are calculated. Afterwards, a non-linear function *σ* (*·*) is applied to the weighted sum^{9}. Usually, a rectified linear unit *σ* (*z*) = max(0 *, z*) is used, although other non-linear functions are possible as well (Lecun et al., 2015, p. 437-438). This is shown on the left side of figure 4, where *w ab* are the weights from neurons in layer *a* to layer *b*.

After the forward pass, the cost function is computed. This is usually the squared difference between the output of the output layer and the target value. This can be multiplied by 0 *.* 5 to make differentiating easier. In other words, 0 *.* 5(*y l − t*)^{2}, where *y l* is the output of the last layer. Now that the cost, i.e. the difference from the expected output and the networks output is known, the question is how to adjust the weights such that the output matches more closely the target.

For this, the change in the cost function with respect to all the weights must be calcu- lated. For the specific network in figure 4, it would have the form

illustration not visible in this excerpt

Adjusting the weights in the opposite direction of [illustration not visible in this excerpt] will adjust the weights to minimize the cost function (Lecun et al., 2015, p. 436). The computation of these derivatives of the cost function with respect to the weights is shown in figure 4. Figure 3 is one of many model architectures in DL. Another important one of those models is the *convolutional (neural) network* (CNN), which has shown to be very strong in image classification (Krizhevsky et al., 2012).

**[...]**

^{1} http://techtalks.tv/talks/deep-reinforcement-learning/62360/, accessed on 3.1.2018

^{2} https://deepmind.com/ - Accessed on 4.1.2018

^{3} https://arxiv.org/ - Accessed on 4.1.2018

^{4} https://scholar.google.de/ - Accessed on 4.1.2018

^{5} http://uber.ai/ - Accessed on 4.1.2018

^{6} https://openai.com/ - Accessed on 4.1.2018

^{7} https://www.microsoft.com/en-us/research/lab/microsoft-research-ai/ - Accessed on 4.1.2018

^{8} https://www.reddit.com/r/reinforcementlearning/ - Accessed on 4.1.2018

^{9} Normally, a *bias* is added to this sum (Lecun et al., 2015, p. 437), which will be disregarded here for the sake of simplicity.

- Quote paper
- Artur Sahakjan (Author), 2018, A Review of Recent Advancements in Deep Reinforcement Learning, Munich, GRIN Verlag, https://www.grin.com/document/432230

Comments