A Combination of Deep Learning and Mathematical Models for ...€¦ · A Combination of Deep Learning and Mathematical Models for Addressing the failure of Computer Vision in Reinforcement

Australian National University

:

A Combination of Deep Learning andMathematical Models for Addressing the

failure of Computer Vision in ReinforcementLearning Models

Submitted by:

Abhishek SinghID: u6411540COMP 8755

Supervisor: Ms. Josephine Plested , Prof. Tom GedeonCanberra, June 12, 2020

Acknowledgement

Thanks to God for everything.From God we come and to God we go.

But seek first the kingdom of God and his righteousness, and all these things will beadded to you. Mathhew 6:33

I am really grateful and thankful to Ms. Josephine Plested for invaluable guidance andimmense help.

1

Abstract

One of the profound use of Deep Neural Network,since 2013,is in teaching anAI agent to play Atari based games. One of the less researched problems,however,of failure of an agent to transfer its knowledge learned in one domain to be appliedto a related domain still lingers. It has been established that an agent trained onsource domain fails to transfer its learning on a related domain following the stan-dard transfer learning protocols. In this report we propose a technique to enablean agent to focus only on the important dynamics of the environment while igoringthe rest.

Keywords: Reinforcement Learning , Transfer Learning, Deep Con-volutional Networks

.

2

Contents

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Literature Review 2

3 Background 3

3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1.2 Reinforcement Learning Problem . . . . . . . . . . . . . . . . . . . 3

3.1.3 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.6 Temporal Difference Methods . . . . . . . . . . . . . . . . . . . . . 13

3.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Overview of Convolutional Neural Network . . . . . . . . . . . . . . 20

3.2.3 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.4 Convolution Later . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.5 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.6 Fully Connected layers . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.7 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.8 Long Short Term Memory Networks . . . . . . . . . . . . . . . . . 26

3.3 Asynchronous Advantage Actor Critic architecture . . . . . . . . . . . . . . 28

3.3.1 Policy and Value Network . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Problems in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 30

3.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.2 How an Agent Makes a Decision: . . . . . . . . . . . . . . . . . . . 33

4 Methodology 36

I

4.1 OpenAI GYM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 LSTM in conjunction with Saliency Maps . . . . . . . . . . . . . . . . . . 37

5 Result and Conclusion 39

6 Future Work 41

References 42

II

1 Introduction

1.1 Introduction

The use of Deep Convolutional Neural Networks to play Atari games [Volodymyr Mnih,2013] in 2013 has lead to emergence of new area of research in Reinforcement Learn-ing(RL). RL applications are now being applied across a spectrum of problems[G. Zhengand Li., 2018; Z. Zhou and Zare., 2017; J. Jin, 2018; J. Kober, 2013].

Reinforcement learning problems involve learning what to do—how to map situationsto actions—so as to maximize a numerical reward signal [Richard Sutton, 2014]. Thereare primarily 3 ways to solve this problem viz. Dynamic Programming , Monte CarloMethods and Temporal Difference Methods [G.Barto, 1995]. While the DP and MCMCare expensive methods with respect to memory usage, the Temporal Difference Methods,which makes use of one or the other CNN architecture, have lately shown promising resultswith the successful demonstration of an AI agent playing the attari games and performingbetter then the human players [Volodymyr Mnih, 2013].

Estimation/Approximation of Value function of a state is central to all the RL techniques.The state in an RL setting is generally described by the pixel representation and it isparameterised by everything in the game environment at time t.

Intuitively not all the state parameters should be relevant to the game dynamics. Forexample, in the game of classic Mario, the background color of the sky (blue) should notinterfere with the agent’s decision making. It is seen [Shani Gamrian, 2019] however, thatan agent trained on a background (lets say blue) fails to transfer when the background ischanged (say green). This can be explained because all state parameters are part f modeltraining .

In this report we seek to formulate a technique so as to enable an agent to focus onthe relevant state parameters to make the decisions. We start with a general overviewof RL problem and then proceed on to the Network Architecture of the base model i.eAsynchronous Advantage Actor Critic used in this paper and then proceed with a finaldiscussion on the suggested technique to deal with the problem .

1 INTRODUCTION 1

2 Literature Review

The particular problem of transferring the learning an agent makes in one domain to arelated domain has been primarily tackled by making use of image-to-image translation.With image-to-image translation [Zak Murez, 2016] a network trained on source domainpredicts the information about the target domain by translating the target domain intosource domain and then making predictions about the former from the learning previouslymade on latter.

This approach has been successfully adapted in the paper[Shani Gamrian, 2019] by mak-ing use of Generative Adversarial Networks(GAN)[Goodfellow, 2016] where the GANs areused to map a target image to source image.

A similar approach to tackle the problem has been used in the paper [Thomas Carr, 2019]where a mapping between the target state and source state is created using AdversarialAutoEncoder [Alireza Makhzani, 2016] which uses and auto encoder as the generator ofthe GAN framework.

There exists,general, framework(s) [Matthew E. Taylor, 2009; Felipe Leno da Silva, 2016;André Barreto, 2017; Croonenborghs, 2017] for transfer learning in context of RL wheremuch is discussed about transfer learning for a related task, not necessarily domain.

2 LITERATURE REVIEW 2

3 Background

3.1 Reinforcement Learning

3.1.1 History

The problem of Reinforcement learning agent had been previously solved as a problemof Dynamic Programming [G.Barto, 1995], in a general computer science setting and hasbeen in existence since as early as in mid 1990s. One positive effect of this development isthat if the problem could be framed as a DP, it has a solution theoretically. Under the dy-namic programming setting, however, the estimation of value functions for different statesremained a major problem because the optimal solution of the Dynamic Programmingproblem is the Bellman’s optimality equation, which takes into account all the states-action combinations. So for a 2 actions multi state problem, the number of calculationrequired will be 2N . This has the inherent capability to cause memory issues when thenumber of states or actions is very large,which is a common occurrence.

With the research and advancement in machine learning techniques such as the Convolu-tional Neural Network, the problems are now solved differently. However the fundamentalproblem of estimating value function for the state remains as it is. The CNNs help insolving the memory issues, by approximating the state-value function in every episodeand propagating the error after every episode to make adjustments to the weights. Thisapproach has made the solving of problem of RL viable and has attracted much of theresearch in recent times.

We next introduce the Reinforcement learning problem and talk about Dynamic program-ming,Monte Carlo methods and Temporal Difference Methods.

3.1.2 Reinforcement Learning Problem

The reinforcement learning problem is framing of the problem of learning from interactionto achieve a goal. The learner and decision maker is called an agent which interacts withan environment. The agent interacts with the environment by selecting actions and theenvironment responds to the actions by rewarding the agents a numerical value which theagent tries to maximize over time [Richard Sutton, 2014].

3 BACKGROUND 3

Figure 1. Agent-Environment interaction Reinforcement Learning. [Richard Sutton, 2014]

Shown above is an Agent-Environment interaction generating rewards and transition tothe next state. The relevant states terms are defined below :

• Agent

The Learner and the Decision Maker

• Environment

The thing agent interacts with , it comprises everything outside the Agent

• State : stThe current state of environment at time t.

• Action :atThe action that Agent takes in the current State which generates numerical rewardsand leads to state transition.

• Transition Probabilities : P ass′

= Pr(st+1 = s′ |st = s, at = a)

It is important to note here that the transition probabilities are considered to be Markov

and depends only on the previous state and not on the entire history of states and actionsthat passed by. Formally, any process which follows :

Pr(st+1 = s′ |st, at, st−1, at−1, st−2, at−2...) = Pr(st+1 = s

′ |st = s, at = a) (1)

is a Markov Decision Process [Bertsekas, 1995]. This is particularly useful as we shall seewhen the value functions are simplified.

The problem of Reinforcement learning[Richard Sutton, 2014] then, is to make the agent

interact with the environment in such a way as to maximise the total discounted rewardsearned in an episode. The total discounted rewards in an episode is given by :

rt+1 + γ × rt+2 + γ2 × rt+3 + ...... =∞∑k=0

γkrt+k+1 (2)

3 BACKGROUND 4

where, the reward rt+k corresponds to the reward the agent received while it transitionsfrom state st+k−1 → st+k. The choice of γ, in the context of this report is : γ ≤ 1.This choice of γ simply means, the rewards "r" the agent will get immediately is worthmore then if the agent were to get this reward in future. This particular choice of γ hasintuitive significance as we shall see later. A γ = 0 would mean, that agent is concernedonly with the immediate reward, not with the long term total rewards. We next see howto formalise the problem further by introducing Value Functions.

3.1.3 Value Function

Consider the following hypothetical game.

Figure 2. A Hypothetical Game

The Agent’s objective is to reach the terminal state, represented by a flag. If the agenthowever happens to reach the state represented with fire, the game is lost.The state withdiagonal lines is unreachable, Further the reward for all transitions is 0, except for the 2states as depicted.

From visuals,there are indeed 2 paths available for the agent to reach the terminal state,as depicted in the figure below:

3 BACKGROUND 5

Figure 3. Two paths available for the agent to reach terminal state

Intuitively the agent should follow the dotted path. This is because the other path in-cludes a state, in which a wrong action when taken, can make the agent lose the game.This state is shown with 2 yesllow arrows in the figure below:

3 BACKGROUND 6

“Figure 4. Possible actions from a state

It can be argued that the state with blue arrow is safer then the state with 2 yellowarrows. This is because while the agent is in state represented by blue path, whateverimmediate action it takes, it won’t land him in the state with fire. So the idea is thento teach an agent to follow the dotted path. This can be achieved by assigning a valueto individual state(s) encountered in a path and then selecting the path for which thetotal sum of all states is maximum. With this principal the values assigned to individualcubes/state should be something like:

3 BACKGROUND 7

Figure 5. A probable choice of value function

With this distribution, the agent can then be told to follow the path which gives themaximum total rewards . To come up with a value function, which will give the value ofstates in a fashion similar to the one depicted in the above figure, the parameters of thevalue function needs to include a combination of action and reward in some form. Thevalue function which full fills this criteria is defined to be[Richard Sutton, 2014] :

V π(s) = Eπ(Rt|st = s) = Eπ[∞∑k=0

γkrt+k+1|st = s] (3)

This equation can be simplified to take the following form:

V π(s) =∑r,s′,a

π(a|s)p(s′, r|s, a)[r + γEπ(Gt+1|St+1 = s′)] (4)

1 The above result1 can be understood with the knowledge of Law of Total Expectation.

Note: The term p(s′, r|s, a) is a simplification of the term p(s′ , r|st, at, st−1, at−1, st−2, at−2...)following the Markov Assumption as discussed in the previous section.

1The derivation of the Bellman’s equation consists of Lemma and then using the Lemma to arrive atthe final results. The lemma was worked out by the author of this report, while the later part of proofwas provided by Mr. Siong Thye Goh

3 BACKGROUND 8

Lemma 1. Law of Total Expectation : For any random variable X and Y , the following

relationship holds:

E(X) = EY (E(X|Y )) (5)

Proof :

EY (E(X|Y )) = EY (∑x

X × P (X = x|Y )) (6)

= EY [x1 × P (X = x1|Y ) + x2 × P (X = x2|Y ) + x3 × P (X = x3|Y ) (7)+ x4 × P (X = x4|Y ) + ...+ xx × P (X = xx|Y )] (8)= [(x1 × P (X = x1|Y = y1) + x2 × P (X = x2|Y = y1)+ (9)+ x3 × P (X = x3|Y = y1) + x4 × P (X = x4|Y = y1) + ...+ (10)xx × P (X = xx|Y = y1)])× P (Y = y1) + (x1 × P (X = x1|Y = y2) (11)+ x2 × P (X = x2|Y = y2) + x3 × P (X = x3|Y = y2) (12)+ x4 × P (X = x4|Y = y2) + ...+ xx × P (X = xx|Y = y2))× P (Y = y2)

(13)+ ..+ (14)+ (x1 × P (X = x1|Y = yy) + x2 × P (X = x2|Y = yy) (15)+ x3 × P (X = x3|Y = yy) (16)= x1 ×

∑∀yP (X = x1|y)× P (y) (17)

+ x2 ×∑∀yP (X = x2|y)× P (y) (18)

+ ..+ (19)xx × P (X = xx) (20)= E(x) (21)

With the above lemma, we now simplify the value function as follows :

3 BACKGROUND 9

Eπ(Rt+1 + γGt+1|St = s) (22)=∑r,s′,a

Eπ(Rt+1 + γGt+1|Rt+1 = r, St+1 = s′, A = a, St = s)Pr(Rt+1 = r, St+1 = s′, A = a|St = s)

(23)=∑r,s′,a

Pr(Rt+1 = r, St+1 = s′, A = a|St = s)Eπ(Rt+1 + γGt+1|Rt+1 = r, St+1 = s′, A = a, St = s)

(24)=∑r,s′,a

Pr(A = a|St = s)Pr(Rt+1 = r, St+1 = s′|A = a, St = s)Eπ(r + γGt+1|St+1 = s′, A = a, St = s)

(25)=∑r,s′,a

π(a|s)p(s′, r|s, a)Eπ(r + γGt+1|St+1 = s′), (26)

=∑r,s′,a

π(a|s)p(s′, r|s, a)[r + γEπ(Gt+1|St+1 = s′)] (27)

Another important and related function is Q(section 3.1.4),defined as [Richard Sutton,2014] :

Qπ(s, a) = Eπ(Rt|st = s, at = a) = Eπ[∞∑k=0

γkrt+k+1|st = s, at = a] (28)

It can be read as expected rewards earned starting from state s, taking an action a (whilein state s) and then following the policy π thereafter.

The subscript π denotes the policy being followed which is a valid distribution Pπ(a|s).It is understood as Probability of taking an action a while in state s Further a policy π′

is said to be better then policy π iff : Vπ′ (s) > Vπ(s) . An optimal policy can then bedefined to be [Richard Sutton, 2014]:

V ∗(s) = maxπVπ(s)∀s ∈ S (29)

Similarly the optimal Q function can be defined to be [Richard Sutton, 2014]:

Q∗(s, a) = maxπQπ(s, a)∀s ∈ S (30)

With the knowledge of the terminologies used, we now move on the different existingmethod to solve the problem of RL.The relationship between the 2 equations can be established from the argument that whilein a particular state s, the maximum value of that state(which will be under optimal policy) will be equal to best action that can be taken in that state.

We next look at the 3 popular approach to solving a Reinforcement Learning Problem.

3 BACKGROUND 10

3.1.4 Dynamic Programming

The dynamic programming(DP) problems are a class of optimization problem, and hasbeen solved under the Bellman’s Principle of Optimality [Kirk., 1970]. The understandingof underlying principles of Dynamic Programming is essential to understand the problemof reinforcement learning. Although the modern techniques of machine learning help tosolve the problem, but in no way reframe the underlying theoretical framework of the orig-inal problem framed as DP. Further all reinforcement learning techniques involving anycomputational tool can be viewd as attempts to achieve the same effect as DP,only withless computation and without assuming a perfect model for the environment [Richard Sut-ton, 2014]. The objective of a reinforcement problem is to find the best policy(which willcorrespond to the optimal value function) or find a value function which will earn theagent the maximum rewards(which will correspond to the optimal policy function ) overthe course of episode. There are 2 popular methods in dynamic programming for thispurpose viz., policy iteration and value iteration. Both of these methods depend on policyevaluation which is explained below.

Policy Evaluation 2

Policy evaluation refers to the problem of solving the following expression for each state :

V π(s) = Eπ[rt+1 + γrt+2γ2rt+3 + ...|st = s] (31)

= Eπ[rt+1 + γV π(st+1)|st = s] (32)=∑a

π(s, a)∑s′ ,a

∑s′Passs′ [R

ass′ + γV π(s′)] (33)

The above expression, for each state, is evaluated by successive approximation. This isdone by starting with a random value for the each state V (s) and then updating theapproximation after time step K by :

Vk+1(s) = Eπ[rt+1 + γVK(st+1)|st = s] (34)(35)

So if we roll out the value of state, lets say s4, we will get a converging sequence {Vk} andthis way of approximating Vk is called Iterative Policy Evaluation. Once policy evaluationis done for one policy, the next task then remains is to improve this policy. This is doneby making the existing policy greedy at some state.Consider an existing deterministic policy π, which gives the following state-action tuplesfor the game :

(s1, a1)→ (s2, a2)→ (s3, a3)..........(sn, an).. (36)If for example, we change action a2 in the state s2 and rather take an action a′2 and thenfollow the same policy π for the remaining of the episode. The trajectory might look likeas follows :

(s1, a1)→ (s2, a′

2)→ (s′3, a′

3)..........(s′

n, a′

n).. (37)2 The theme of this section is derived from [Richard Sutton, 2014] including most of equations. The

theme is then rephrased by author of this report.

3 BACKGROUND 11

One obvious choice for action a′2 can be the action with immediate highest reward. Thevalue function in such case is written in the following from :

Qπ(s, a) = Eπ[rt+1 + γV π(st+1)|st = s, at = a] (38)Which means starting in state s, taking an action a and then following the policy πthereafter.If however :

Qπ(s, π′(s)) ≥ V π(s), ∀s ∈ S (39)Then the policy π′ is either an improvement over the existing policy or at least as goodas the existing policy. This gives rise to idea of Policy Improvement.Continuing this way ,we can start with a policy π and keep iterating until we can obtaina sequence of converging functions as follows :

π0E−→ V π0 I−→ π1

E−→ V π1 I−→ π2....I−→ π∗

E−→ V ∗ (40)

where E−→ denotes policy evaluation and I−→ denotes a policy improvement .

Value Iteration 3

The drawback of policy evaluation is with respect to time and space complexity. Toobtain the optimal policy, the policy iteration method involves calculation of evaluationand iteration each time. During the value iteration, the policy evaluation is stopped aftercalculating each state once only i.e:

Vk+1(s) = maxaE[rt+1 + γVk(st+1)|st = s, at = a] (41)

The sequence {Vk} can be shown to converge.[John Schulman, 2017]

Algorithm 1 Value Iteration

Intialize V arbitrarily , e.g V (s) = 0,∀s ∈ S+

while δ < θ doδ ← 0while all s ∈ S are not visited do

v ← V (s)V (s)← maxa

∑s′ P

ass′

[Rass′

+ γV (s′)]δ ← max(δ, |v − V (s)|)

end

end

3 The theme of this section is derived from [Richard Sutton, 2014] including all equations. The themeis then rephrased by author of this report.

3 BACKGROUND 12

3.1.5 Monte Carlo Methods

4 Monte Carlo methods use repeated random sampling and statistical modelling to es-timate mathematical functions and mimic the operations of complex system [Jarrison,2010]. In the current scenario, Monte Carlo methods will be used to estimate state-valuefunction V(s)for a given policy.The state-value function is the expected cumulative future discounted reward - staringfrom that state. Intuitively, We can think of approximating the value of individual stateby keeping a record of the future discounted rewards obtained after visiting that state andaveraging them over. This is indeed what Monte Carlo methods also do in principle. Thisway of approximation is important in Reinforcement learning problems. This is because ,in many cases the transition probabilities P a

s,s′may be simply unknown along with other

things .With Monte-Carlo methods , the game parameters can be evaluated only after the gameends .This is a drawback however for non -terminating games. Monte Carlo for policyevaluation can be broken down into :

• Play the game with a policy π

• For each state s appearing in the episode :Calculate Return R, following the occurrence of sAppend return R to an array of return(s)Backtrack the return with a discount factor to calculate the value function Vs

This will be apparent as we discuss the Actor-Critic methods.

3.1.6 Temporal Difference Methods

5 Temporal difference(TD) method is a combination of Dynamic Programming and MonteCarlo methods. Like Monte Carlo , it does not requires the parameter of the model. Andlike dynamic programming it updates the estimates based in part on other learned esti-mates, without waiting for a final outcome [Richard Sutton, 2014]. Discussed below are2 main algorithms viz., Q learning and Actor-Critic,based on this technique, which willform the basis of future discussion:

4 The theme of this section is derived from [Richard Sutton, 2014] s. The theme is then rephrased byauthor of this report.

5 The theme of this section is derived from [Richard Sutton, 2014] including most of equations. Thetheme is then rephrased by author of this report.

3 BACKGROUND 13

Q Learning

Q learning is a Model free Algorithm to find the optimal policy for the given Rein-forcement Learning problem [Watkins, 1992].It does so by calculating the Q values of thestate repeatedly until it converges. Once the Q value of all the states is calculated , theagent can navigate following the Q values. The update rule for the Q value equation is[Richard Sutton, 2014] :

Q(st, at) = Q(st, at) + α[rt+1 + γmaxaQ(st+1,1)−Q(st, at)] (42)

The algorithm [Richard Sutton, 2014] for the Q-Learning is given by :

Algorithm 2 Q-LearningIntialise Q(s, a) arbitrarily while Each Episode is Not Visited do

Initialize swhile Each Episode is Not Visited do

Choose a from s using policy derived from Q

Take action a , observe r , sQ(s, a)← Q(s, a) + α[r + γmaxa′Q(s′ , a′)−Q(s, a)]s ←s′

end

end

3 BACKGROUND 14

Actor Critic

Actor critic methods have been used to implement the the code for this report. Actorcritic methods have the following architecture :

Figure 6. Actor - Critic Architecture

In this architecture, the Policy network is called actor as it selects the actions which letsthe agent interact with the environment generating some rewards . The Value networkevaluates the value function, which is then compared with the sum of discounted rewards.The error thus produced is used to update the weights of the CNN. Both the policy and thevalue networks are CNN networks and share same error parameters for back propagation.In this report, we are going to focus on one variant of Actor Critic called the AsynchronousAdvantage actor Critic Methods .

The idea of the Actor critic methods stems from Policy gradient techniques. With thepolicy gradient methods, the aim is to define a function which relates policy directly withrewards.

3 BACKGROUND 15

To understand this better, we start with understanding a trajectory. Let a trajectory bedefined as the states observed in a particular episode. The reward function is thus theexpected value of rewards with respect to the trajectories. Mathematically

J(θ) =∑τ

P (τ ; θ)×R(τ) (43)

Here θ parametrises the policy function. The aim is to find the optimal θ which maximizesthe objective function. It can be proved that the policy-reward function defined like theway above has a local maxima[Bhandari, 2019]. Calculating the optimal parameters thenbecomes : 6

J(θ) =∑τ

P (τ ; θ)×R(τ) (44)

∇θJ(θ) = ∇θ

∑τ

P (τ ; θ)R(τ) (45)

=∑τ

∇θP (τ ; θ)R(τ) (46)

=∑τ

P (τ ; θ)P (τ ; θ)∇θP (τ ; θ)R(τ) (47)

=∑τ

P (τ ; θ)P (τ ; θ)∇θP (τ ; θ)R(τ) (48)

=∑τ

P (τ ; θ)∇θP (τ ; θ)P (τ ; θ) R(τ) (49)

=∑τ

P (τ ; θ)∇θP (τ ; θ)P (τ ; θ)R(τ) (50)

=∑τ

P (τ ; θ)∇θ logP (τ ; θ)R(τ) (51)

(52)

We finally use the simplification that probability of selecting each of the tracjectories issame (= 1

m) in this case. So we get an intermediate result as :

∇θJ(θ) = 1m

m∑i=1

∇θ logP (τ ; θ)R(τ) (53)

We can further simplify logarithmic with the following argument .The value P (τ ; θ) rep-resents the probability, parameterized by θ. The probability of a trajectory being followedis essentially the probability that all the states in the trajectory are visited. This implies:

P (τ ; θ) =T∏t=0

P (st+1|st, at)πθ(at|st) (54)

Using the probability of logs:

6This derivation has been work some random contributors to the mathematics community on web andof author in parts

3 BACKGROUND 16

∇θ logP (τ ; θ) = ∇θ logT∏t=0

P (st+1|st, at)πθ(at|st) (55)

= ∇θ[T∑t=0

logP (st+1|st, at) +T∑t=0

log πθ(at|st)]] (56)

= ∇θ[T∑t=0

logP (st+1|st, at) +T∑t=0


= ∇θ[T∑t=0


Substituting above result in original equation, we get:

∇θJ(θ) = 1m

m∑i=1

∇θ logP (τ ; θ)R(τ) (59)

= 1m

T∑i=1

m∑i=1

∇θ logP (τ ; θ)R(τ) (60)

(61)

It might so happen, that the cumulative reward over the trajectory may sum to 0 inwhich case, the policy gradient would evaluate to 0. To tackle such cases, a base line isintroduced and the equation then becomes :

∇θJ(θ) = 1m

m∑i=1

T∑i=1

∇θ logP (τ ; θ)[R(τ)− bt] (62)

(63)

R(τ) is the immediate reward when action a is taken at time t. If the baseline bt is takento be average reward taken by following the trajectory τ then the equation can be rewrit-ten as:

∇θJ(θ) = 1m

m∑i=1

T∑i=1

∇θ logP (τ ; θ)[Q(st, at)− Vφ(st)] (64)

(65)

3 BACKGROUND 17

The terms are interpreted as follows :

• logP (τ ; θ) : Actor

• [Q(st, at)− Vφ(st)] : Critic

The way Q(st, at) is calculated gives rise to different types of algorithms viz.,

• Q Actor-Critic

• Advantage Actor Critic

• TD Actor-critic

• TD(λ) Actor-Critic

• Natural Actor-Critic

This paper uses Advantage Actor Critic where for Q(st, at), immediate reward Rt is usedand the term thus obtained, Rt − Vφ(st) is interpreted as Advantage hence the nameAdvantage. The name Actor comes from the discussion above and Asynchronous comesfrom the way code is written(i.e asynchronous). And hence the name: Asynchronous

Advantage Actor-Critic.

The A3C is based on the above skeleton but here multiple agents interact with the environ-ment asynchronously [V minh, 2016].The agents are controlled by a global networks andthe training is done often with a shared architecture between Value and Policy functions.

3 BACKGROUND 18

Algorithm 3 Asynchronous advantage actor-critic - pseudocode for each actor-learnerthread [Richard Sutton, 2014]while T > Tmax do

Reset gradients: dθ ← 0 and dθv ← 0Synchronize thread-specific parameters θ′ = θ and θ′v = θvtstart = tGet state st

while terminal st or t− tstart == tmax doPerform at according to policy π(at|st; θ′)Receive reward rt and new state st+1

tstart = tT ← T + 1

end

R ={

0 for terminal stV (st, θ′v) for non-terminal st// Bootstrap from last state

for i ∈ {t− 1, . . . , tstart} doR← ri + γRAccumulate gradients wrt θ′: dθ ← dθ +∇θ′ log π(ai|si; θ′)(R− V (si; θ′v))Accumulate gradients wrt θ′v: dθv ← dθv + ∂ (R− V (si; θ′v))

2/∂θ′v

end

Perform asynchronous update of θ using dθ and of θv using dθvend

The implementation of above is described in next section . Before going into its imple-mentation, we dedicate a section on Neural Networks which would be instrumental inunderstanding the individual components of A3C.

3 BACKGROUND 19

3.2 Neural Networks

3.2.1 Overview

CNN are primarily used in this research. Starting with the architecture of a general CNNfirst, we will then explain how CNNs are combined to create an actor critic model . 7

3.2.2 Overview of Convolutional Neural Network

Deep Convolutional Neural Networks consist of feature detector units arranged in layers.

Figure 7. Basic Convolution Neural Network Architecture [Odinlmshen, 2018]

Convolutional Neural Networks are a class of Deep Neural networks which are primarilyused in object recognition. Unlike the Artificial Neural Network [Schimidhuber., 2015]where a single weight matrix is applied to the input in one single operation, in C.N.N aweight/kernel matrices (less then the image dimension) is applied to the different subsetsof the input and hence the name convolution because of its similarity with the convolutionoperation [Bryan., 2008] on 2 functions f and g in discrete time.

(f ⊕ g)(t) =∞∑

τ=−∞f(τ)g(t− τ) (66)

The idea to use the subsets of input comes from the fact that nearby pixels in image aresimilar and with the convolution operation we can get a quantified value of the shapethey represent.

In context of C.N.N the functions f and g can be replaced by the filter and image subset(orvice -versa). Additionally, multiple filters are used with each of the filter detecting aparticular shape.

The basic architecture of a C.N.N consists of :

• An Input layer

• A Convolutional layer

• Pooling Layer

• Fully Connected Layer7This section is author’s own work (unless otherwise cited) and has been used in a separate report

previously.

3 BACKGROUND 20

3.2.3 Input Layer

The input layer is where the image, 2 dimensional for a black and white image, is fedinto the network. For a colored input, we use the 2 dimensional Red - Blue- Greenrepresentation of the image. So for a colored image we have 3, two dimensional imageas the input. The number of different representations (Red , Green or blue) is called"Channels"/or depth. For the experiments done in this paper, we have used a :

• 28× 28× 1 image for MNIST([Yann LeCun, 1998])

• 64× 64× 3 image for Tiny ImageNet

where the z axis parameter represents the channel.

3.2.4 Convolution Later

This layer performs the convolution operation as defined above. The performed convolu-tion spans to the depth of the image. Convolution layer serves the purpose of extractinga particular shape from the image [Rikiya Yamashita, 2018]. This along with Poolinglayers is shown in the figure 8 for futher discussion.

3.2.5 Pooling Layer

A Pooling layers is used to reduce the dimensionality of the data. For example a 2 × 2max(for maximum) pooling kernel will select the highest numerical value covered in that2 × 2 matrix. The pooling layer gives a single quantified values of a shape. This helpsin reducing amount of memory required to work with the network andovefitting. It alsoserves the purpose of making the model transnational invariant which means that thedown-sampling done with the pooling layer ensures that the mapping of the shape isindependent of its orientation. Which means the network will give same quantified valueto a particular shape irrespective of its rotation [GoodFellow, 2016].

3 BACKGROUND 21

Figure 8. A convolution Operation on pixel level ,[SuperDataScience, 2018]

In the above figure, convolution operation is being performed on the input image(tensor).The filter moves across the pixel matrix, by x length to the right, where x is called "Stride"which has numerical value one in the current case. After the image is operated on by thefilter, it then passes through the pooling layer. In this figure a sum pooling is used whichsums all the elements of the pooling kernel.If we talk of the translational invariance talkedabove, two same shapes will have same numbers in the matrix albeit at different entries.When summed, however, it will give the same results.

3 BACKGROUND 22

3.2.6 Fully Connected layers

The output of the last pooling layer is flattened and operated upon by a squashing func-tion like Rectified Linear Unit [Agarap, 2018] before it is fed to the Fully Connected layerwhich is a like an Artificial Neural Network that gives the final probabilities for the inputto belong to a particular class/label.

A complete C.N.N model looks like :

Fig 9. A Sample Convolution Neural Network Architecture ,[SuperDataScience, 2018]

3.2.7 Recurrent Neural Network

8 A recurrent neural network(RNN) is a class of artificial neural networks where the con-nections between nodes form a directed graph along a temporal sequence. This allows it toexhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs canuse their internal state (memory) to process variable length sequences of inputs.[Dupond,2019].

RNN were envisaged from how humans learn. For example ,suppose there is a textsequence of the form, "I can see a beautiful boat in the ...".It is very likely , that the nextwork in the sequence will be ocean .This is intutive , because from memory we know thatboat and ocean have high possibility to be related in some way or the other.

An RNN was designed over a traditional neural network with the same principles in mindso that prediction could be aided from some memory of the history. A basic RNN looks

8The work in this section has been inspired and derived from [Dupond, 2019]

3 BACKGROUND 23

like the following :

Fig 10. Basic RNN

It can be interpreted to be working simply as a network with in an output influenced

by the current input and previous output . Usually this influence is achieved by using ahyperbolic like the tanh. The rational for using tanh over other hyperbolics and sigmoidetc., is the underlying reason that influence over decision can be positive and negative.And the inherent nature of tanh allows for the same while the sigmoids are always positive.Mathematically a tanh is given by:

tanhx = e2x − 1e2x + 1 (67)

The input to a single RNN cell at time t is denoted by xt. This input goes through thefollowing operation with the hiddent state vector ht as follows :

ht = tanh(Wxhxt +Whhht−1 + bh) (68)yt = Whyht + by (69)

3 BACKGROUND 24

Here Wxh and Whh are the weights given to vector xt and ht−1 respectively. This isa constant weight vector and does not change over time unlike the weight vectors in aneural network.A complete RNN consisting of multiple cell looks like [Dupond, 2019].

Fig 11. A rolled out RNN

One of the drawbacks, an RNN suffers is failure wheen the dependency of the currentoutput on the history goes back long way[Hochreiter, 2019]. This is depicted in the figurebelow.

Fig 12. The current state h3 is dependent on values encountered long back

Neverthless, RNN form the basic construct on which the Long Short Term Memory arebased, which are capable of solving this problem.

3 BACKGROUND 25

3.2.8 Long Short Term Memory Networks

Long Short Term Memory(LSTM)[Hochreiter, 2019] is a special type of RNN which wereintroduced to overcome the short comings of the RNN. The recurring network/sub networkin LSTM looks like :9

Fig 13. Basic Structure of LSTM

The LSTM in nutshell consists of sigmoid functions and gates other then the hyperbolicfuncitons. The variables hold the following relationship :

ft = σg(Wfxt + Ufht−1 + bf ) (70)it = σg(Wixt + Uiht−1 + bi) (71)ot = σg(Woxt + Uoht−1 + bo) (72)c̃t = σh(Wcxt + Ucht−1 + bc) (73)ct = ft ◦ ct−1 + it ◦ c̃t (74)ht = ot ◦ σh(ct) (75)

• xt ∈ Rd: input vector to the LSTM unit

• ft ∈ Rh forget gate’s activation vector

• it ∈ Rh input/update gate’s activation vector

• ot ∈ Rh output gate’s activation vector9The equations in this section are derived from the work of [olah, 2019]

3 BACKGROUND 26

• ht ∈ Rh hidden state vector also known as output vector of the LSTM unit

• c̃t ∈ Rh cell input activation vector

• ct ∈ Rh cell state vector

• W ∈ Rh×d ,U ∈ Rh×h and b ∈ Rh weight matrices and bias

The working of LSTM is pretty similar to that of RNN. The multiple logic gates howeverhelp it in retaining important data from history better. LSTM works as follows: At timet− 1,it maintains a cell state ct−1 and a hidden state vector ht−1. On receiving an inputxt at time t , it updates its values for ct and ht based on the above equations. A copy ofht serves as the input to the next cell at time t.

3 BACKGROUND 27

3.3 Asynchronous Advantage Actor Critic architecture

The Asynchronous Advantage Actor Critic architecture(A3C) architecture is implementedin broadly 2 ways .

• Seperate Policy and Value Network

• Shared params between Policy and Value Network

In this paper, we use the architecture where params are shared between the Actor andCritic networks. This way of training produces better result for attari games [John Schul-man, 2017]. We further make used of OpenAI gym [Greg Brockman, 2014] to work withthe game environment.

Fig 13a. A3C Network Architecture [Grattarola, 2017]

The above architecture shows an A3C with shared parameters for the Policy and V aluefunction which are explained in detail with respect to there implementation in the currentreport. The network is laid out asynchronously so that the code is run with multiplethreads at the same time. Each thread has its own copy of environment which updatesthe Policy and V alue Networks.

3 BACKGROUND 28

3.3.1 Policy and Value Network

The input to both Policy network and Value Network is the current State and a sampledAction which outputs the probability of taking possible actions and the Value V(s) of thestate s respectively. The shared policy and value network has the shared architecture,which is as follows:

• An input layer with input− channels = 3, Output− channel = 32, kernel− size =3× 3, stride = 3, padding = 1

• Followed by an Exponential Linear Unit(ELU). An ELU [Djork-Arne Clevert, 2015]has the following form:

f(x) =x if x > 0,a(ex − 1) otherwise,

where a ≥ 0 and is the hyper-parameter.

• 4 Hidden layers with input−channels = 32, Output−channel = 32, kernel−size =3× 3, stride = 3, padding = 1 ,with each of them followed by an ELU.

• An LSTM layer with dimensions :

h0 × c0 = 32× 5

h1 × c1 = 5× 256

• An output layer with dimensions = Number of actions for Actor and an output layerwith a single output ∈ R for the Value Network.

Cost function for Policy Network:

Cost = − log probability ∗ advantages (76)

Cost function for Value Network:

Cost = Value Function Estimate for state from the OpenAI Environment − (77)Value function Estimate from the CNN (78)

3 BACKGROUND 29

3.4 Problems in Reinforcement Learning

3.4.1 Overview

While training an RL agent through A3C method described above, the importance of statecan be judged from the fact that it is an input to both the Value and Policy Networks.The state of the system in Reinforcement is given by pixels on the screen taking the valuesbetween 0 to 255. Many a times in an RL problem, some of the state parameters shouldnot be relevant to the performance of the agent. If we take the for example the AtariGame Breakout where the agent is the horizontal paddle, the aim is to maximise the scoreby hitting the bricks while making the sure the paddle does not miss the ball.

Fig 14. Screenshot of the game breakout ,[Shani Gamrian, 2019]

The background (black in this case), should not interfere with the agent’s choice of theaction. However when the back ground is changed, the agent fails to transfer its learningcompletely [Shani Gamrian, 2019]. Consider for example, 2 sample perturbation in gameenvironment .

3 BACKGROUND 30

Fig 15. Perturbed Game Environment [Shani Gamrian, 2019]

Transfer learning protocols inspired from [Yosinski and Lipson, 2014] were used to transferthe learning made on source domain to target domain(perturbed images) using a com-bination of frozen and fine tuned -layers . For this purpose following transfer learningprotocols were used :

• From-Scratch: The game is being trained from scratch on the target game.

• Full-FT: All of the layers are initialized with the weights of the source task and arefine-tuned on the target task.

• Random-Output: The convolutional layers and the LSTM layer are initialized withthe weights of the source task andare fine-tuned on the target task. The outputlayers are initialized randomly.

• Partial-FT: All of the layers are initialized with the weights of the source task. Thethree first convolutional layers arekept frozen, and the rest are fine-tuned on thetarget task.

• Partial-Random-FT: The three first convolutional layers are initialized with theweights of the source task and are keptfrozen, and the rest are initialized randomly.

Below are the results obtained while using the above protocols for transferring

3 BACKGROUND 31

Fig 16. Score vs Epochs for the Image with Green Lines[Shani Gamrian, 2019]

Fig 17. Score vs Epochs for the Image with Diagonal Lines[Shani Gamrian, 2019]

3 BACKGROUND 32

The results clearly show the inability of the agent to transfer its learning. The reason forfailure of the agent stems from the way the agent is trained . The agent, while traininghas the the inputs in form of screen pixels. As such a natural question arises as to whatpixels influence the decision making of the agent.

3.4.2 How an Agent Makes a Decision:

Saliency Maps in CNN

We begin with our discussion of Saliency Maps. Saliency maps were introduced [Karen Si-monyan, 2014],back in 2014, to investigate how a CNN makes a prediction. We first ex-plore this idea and then build upon it to understand the same concpets in context of aReinforcement Learning.The saliency map for a class tells which pixels in the image contributed in designatingthe image to that specific class most .For this purpose, a class saliency score(Sc) is defined to be equal to the output of the lastlayer of the CNN :

Sc ∈ R (79)But as we know, as more layers are added to the network, its linearity is broken and theoutput is usually a highly involved combination of the weight vectors w of different layers.However here first order Taylor approximation is used to approximate Sc.

Sc(I) = ωTc I + bc (80)

where ωc is the first derivative of the saliency score with respect to I, and I would be theflattened vector of pixels such that the values in the vector ω and the I correspond toeach other. Now each pixel in I(and its corresponding first order derivative) is removediteratively to come up with a set of pixels in order of highest contribution in classifyingimage I to class c.

The rationale is that higher the contribution of a pixel, higher will be the change insaliency score, if that pixel is modified/changed.

Saliency Map in RL problem

The aforementioned concept of saliency map has been extended in the paper [Sam Grey-nadus, 2014] to understand the decision making in context of the RL agent. The ideamentioned therein is going to form the basis for the rest of this report.Similar to the saliency score mentioned for the CNN, the saliency score for the Value andPolicy function is defined separately. For the purpose of this report, we are concernedwith the saliency score of the Policy function only which is defined as:

Sπ(t, i, j) = 12‖πu(I1:t)− πu(I

′

1:t)‖ (81)

whereI′1:k =

{φ(Ik,i,j) x ≤ 0Ik x > 0

3 BACKGROUND 33

Here φ(Ik,i,j) corresponds to perturbation in the frame centered around the pixel coordi-nates (i, j). The way perturbation is done in the current context is explained in the nextsection.Perturbation in the Image

To understand this perturbation we revisit some terminology and mathematical con-structs of Gaussian Filtering. A Gaussian filter is a kernel composed of discritized valuesa Gaussian function will take. A 2-dimensional Gaussian function [Squires, 2008] is givenby:

f(x) = e−(x−µ)2

σ2

σ × (2π)0.5(82)

An equivalent Gaussian filter in square matrix of 3 and 5 dimensions is given by :

M = 116

1 2 12 4 21 2 1

M = 1256

1 4 6 4 14 16 24 16 46 24 36 24 64 16 24 16 41 4 6 4 1

Next, recalling that the equation of the line in 2 dimensions is given by y = mx + c andany points between the 2 points x1 and x2 on the line can be calculated by the followingequation

x3 = x1 + (x2 − x1)m (83)where m is the slope or equivalently the scaling factor(contextual).Extending this idea further, the perturbed state is calculated with the 2 extreme statesbeingx1 and x2. x1 being the original state It and x2 a state A obtained by Gaussiansmoothing with a GaussianN (µ, 3). The slopem is given by the maskM(i, j) ∈ (0, 1)m×nwhich is a 2D gaussian centered at (i, j) with σ = 5.

Finally the blur φ(Ik,i,j) is obtained by following equation with hadamard products (Notconvolution) :

φ(It, i, j) = I ◦ (1−M(i, j)) + A(It, σA) ◦M(i, j) (84)

Policy function Evaluation

With the above definition of a perturbed state, the Saliency with respect to policy func-tion is calculated for each frame as follows:

Sπ(t, i, j) = 12‖πu(I1:t)− πu(I

′

1:t)‖ (85)

3 BACKGROUND 34

Figure 1: Saliency Map 1 Figure 2: Saliency Map 2

A high Saliency score would mean the difference the perturbed and unperturbed state ishigh. This in turn means the presence/absence of the masked features is important andthat indeed serves the region of focus for the RL agent.

Saliency Map for the game breakout

The saliency map following the above approach [Sam Greynadus, 2014] is shown in thefigures above. The 2 images show two different frames of the game at 2 different timeintervals. The area an agent focuses to take an action is shown in blue. Clearly this areaincludes the background region and hence a change in the background leads to a failureof decision making by the agent. In the next section, we will explore the saliency mapsto come up with the algorithm described in this report.

3 BACKGROUND 35

4 Methodology

The methodology is formulated assuming availability of variables provided with the OpenAI Gym [Greg Brockman, 2014]. This is discussed in the next section.

4.1 OpenAI GYM

OpenAI provides a framework to test RL algorithm on a set of problems(games). Severaltypes of game(in our case Breakout-Atari) environment are present with which the agent

can interact.Every game environment has its own state and action space. For example, the gamebreakout, where the agent is the paddle, the possible values for the different spaces are:

• State : The pixel representation of the screen in the form of a numpy vector.

• Action: The possible actions an agent can take. Right or Left in case of the gamebreakout.

The agent interacts with the game-environment by taking an action a1 at time t while instate st, which generates a set of 4 variables which are (using standard naming convention){state,reward,done and info} where

• state: The new state st+1 the agent has reached..

• reward : The reward rt, agent received.

• done : The status(true or false) of the game i.e whether the game has reached theterminal state(FALSE) or the game is still in progress(TRUE).

• info : Diagnostic information for debugging.

4 METHODOLOGY 36

4.2 LSTM in conjunction with Saliency Maps

The approach in this report to overcome the problem faced by the RL agent in the previoussection is to train RL through something we name Selective Learning .We first look athow the final output of the Policy network looks like. For simplicity we take a policynetwork with the following architecture.

Fig 20. Simplified Policy Network

The weights(not shown) from the input neuron to the hidden, follows the standard nam-ing convention. Same holds for any 2 connected neurons in the network. .The weightmatrix and the equation for the hidden state vector thus obtained is:

σ(

wh1s1 wh1s2 0 00 wh2s2 wh1s3 00 0 wh3s3 wh3s4

s1

s2

ss4

) =

h1

h2

h3

The second set of equations are:

σ((wm1h1 wm1h2 0

0 wm2h2 wm2h3 0

) h1

h2

h3

) =(m1

m2

)

At this point the output vector (m1,m2) is fed into the LSTM network. Inside the LSTM,we seek to reinforce the cell state with the saliency pixels(section 3.4.2). This is suggestedvia the following algorithm. Two separate networks are suggested:

• The base A3C architecture

• A Network to calculate the saliency maps for the frames being processed

4 METHODOLOGY 37

Fig 21. Suggested Algorithm

This is very similar to the Asynchronous Advantage Actor -Critic method [V minh, 2016]but with an LSTM block being fed memory from the saliency vector instead of randominitialization. The working is as follows:

• Play the game with random actions(sampled from the action space) .Store in arraythe variables (action, state, reward,saliency pixels).

• Once the episode ends(the game ending either by reaching the final state or hardcoded number of iteration.). Run the update block for value and Policy Network(with the policy network’s LSTM being fed cell state from the saliency vector).

4 METHODOLOGY 38

5 Result and Conclusion

The network was run with a base code of A3C [Shani Gamrian, 2019] and the code forsaliency vectors [Sam Greynadus, 2014] .We were able to obtain the hidden state variable ht for the base model . A sample valueof the hidden state obtained after 5 timestamps had following properties:

• Dim(ht) = 256× 1

• Range (ht) ∈ (−1, 1)

The sample(truncated) value looked something like :ht = [1.8941e− 02− 8.1231e− 034.9399e− 046.2160e− 043.3453e− 02− 6.2437e− 03]

On the same note ,the values for the cell state ct obtained after 5 time stamps confirmedthe dimension assertion to be true and had the following similar properties but with asignificant increase in range as compared to the hidden state.:

• Dim(ct) = 256× 1

• Range (ht) ∈ (−10, 10)

The sample value looked something like :ct = [2.82540.65996.9898− 0.5480− 1.47733.6065]

One of the major problems with the base code was with respect to its speed. As RLalgorithms generally take a lot of time to train, the network architecture becomes impor-tant. A simpler architecture was tried to overcome this issue which produced the followingresults10:

Fig 21. Suggested Algorithm

5 RESULT AND CONCLUSION 39

10 However the base code showed better results for the same number of epochs , and hencethis approach had to be rejected.

In conclusion,the theoretical foundation of the framework we suggested presents an alter-nate way, different from image-to-image translation framework, to tackle the problem offailure of RL agent to transfer it’s knowledge across domains.As the field will grow more,we expect to see more research in this area.

10The results obtained came from the original code(reimplemented by theautohor of this report) by the git user by the name Lazy programmerhttps://github.com/lazyprogrammer/machinelearningexamples/tree/master/rl2

5 RESULT AND CONCLUSION 40

6 Future Work

Although saliency maps provide a good understanding of the process behind the scene,theyare still far from perfect. For example, in the heat map given in section 3.4.2, the im-perfections can be noted from the fact that there are regions far away from the agent,and the ball where agent is focusing. This has the potential to act as noise in the hiddenstate vector. It would be interesting to see if this problem could be solved by puttingthreshold to pixel selection by measuring proximity to agent or the other important factorcontrolling the game dynamics.

It would also be interesting to visualise the hidden state vector. For example,there hasbeen research in Neural networks [Marco Tulio Ribeiero, 2016; Karen Simonyan, 2014] tounderstand how it makes decisions, but we don’t much about what features the cell stateand the hidden state of the LSTM stores. Until it is not investigated it would difficult tosuggest a proper way to feed pixels in the LSTM memory.

There are more methods[Rahul Iyer, 2018] coming every often, around the problem ofcalculating the features important to an RL agent while making a decision. It would beworth noting if they can fit into the architecture with better results.

Finally,because the way Value Function V (s) is defined,it should be revisited. The inher-ent dependence of value function on pixel is the root cause of the issue. It would be inter-esting to see if this dependency can be reduced by using experience replay [Mirza Ramicic,2017].

6 FUTURE WORK 41

References

Agarap, A. F. (2018). Deep learning using rectified linear units (relu). Conference paper,CoRR, abs/1803.08375.

Alireza Makhzani, Jonathon Shlens, N. J. I. G. B. F. (2016). Adversarial Autoencoders.Conference, tep, Arxiv.

André Barreto, Will Dabney, R. M. J. J. H. T. S. H. v. H. D. S. (2017). Successor Featuresfor Transfer in Reinforcement Learning. Conference, Neural Information ProcessingSystems, Arxiv.

Bertsekas, D. (1995). Dynamic Programming and Optimal Control, vol. 1, 2. Athena

Scientific. Conference, N/A, Arxiv.

Bhandari, J. (2019). Global Optimality guarantess for policy gradient methods. Book,Oxford, Columbia Universtiy.

Bryan., S. (2008). Discrete Fourier Analysis and Wavelets: Applications to Signal and

Image Processing. New York: Wiley. Journal, 18th Annual Conference of the Interna-tional Speech Communication Association, aa.

Croonenborghs, J. R. D. (2017). Transfer Learning in Reinforcement Learning Problems

Through Partial Policy Recycling. Conference, Neural Information Processing Systems,Arxiv.

Djork-Arne Clevert, Thomas Unterthiner, S. H. G. (2015). Fast and Accurate Deep Net-

work Learning by Exponential Linear Units (ELUs). Conference, University of Lugano.

Dupond, S. (2019). A thorough review on the current advance of neural network structures.Journal, P-429, aa.

Felipe Leno da Silva, A. H. R. C. (2016). Transfer Learning for Multiagent Reinforcement

Learning Systems. Conference, International Joint conference for Artificial Intelligence,Arxiv.

G. Zheng, F. Zhang, Z. Z. Y. X. N. J. Y. X. X. and Li., Z. (2018). A Deep Reinforcement

Learning Frameworkfor News Recommendation. Conference, temp, Proceedings of the27th International Conference onNeural Information Processing Systems.

G.Barto, A. (1995). REINFORCEMENT LEARNING AND DYNAMIC PROGRAM-

MING. Book.

GoodFellow, I. (2016). Deep Learning (Adaptive Computation and Machine Learning

series). Book, P-425, CVPR.

Goodfellow, I. (2016). Generative Adversarial Networks. Conference, tep, NIPS 2016,.

Grattarola, D. (2017). Deep Feature Extraction for Sample-Efficient Reinforcement Learn-

ing. Thesis, University of Lugano.

Greg Brockman, Vicki Cheung, L. P. J. S. J. S. J. T. W. Z. (2014). OpenAI Gym. Con-ference, temp, Proceedings of the 27th International Conference onNeural InformationProcessing Systems.

REFERENCES 42

Hochreiter, S. (2019). LONG SHORT-TERM MEMORY. Journal, P-429, Neural Com-putation.

J. Jin, C.Song, H. L. K. G. J. a. W. Z. (2018). Real-Time Bidding with Multi-Agent

Reinforcement Learningin Display Advertising. Conference, tep, ACSCentral Science3,.

J. Kober, J. A. D. Bagnell, J. P. (2013). Reinforcement Learning in Robotics. Conference,tep, ACSCentral Science3,.

Jarrison, R. L. (2010). Introduction to Monte Carlo Simulation. Conference, N/A, Arxiv.

John Schulman, Filip Wolski, P. D. A. R. O. K. (2017). Proximal Policy Optimization

Algorithms. Journal, Oxford, Journal of Machine Learning.

Karen Simonyan, Andrea Vedaldi, A. Z. (2014). Deep Inside Convolutional Networks:

Visualising Image Classification Models and Saliency Maps. Journal, Oxford, VisualGeometry Group Oxford.

Kirk., D. E. (1970). Optimal Control Theory : An Introduction. Journal, Prentice hall.

Marco Tulio Ribeiero, Sameer Singh, C. G. (2016). Why Should I Trust You? Explaining

the Predictions of Any Classifier. Conference, Neural Information Processing Systems,Arxiv.

Matthew E. Taylor, P. S. (2009). Transfer Learning for Reinforcement Learning Domains:

A survey. Journal, Journal of Machine Learning Research, Arxiv.

Mirza Ramicic, A. B. (2017). Attention-Based Experience Replay in Deep Q-Learning.Conference, University of Lugano, ICML.

Odinlmshen (2018). Deploying a Convolutional Neural Network on Cortex- M

with CMSIS-NN. Website, N/A, https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/deploying-convolutional-neural-network-on-cortex-m-with-cmsis-nn.

olah, C. (2019). Understanding LSTM Networks. Website, P-429,https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Rahul Iyer, Yuezhang Li, H. L. M. L. R. S. K. S. (2018). Transparency and Explanation

in Deep Reinforcement Learning Neural Networks. Conference, University of Lugano.

Richard Sutton, A. G. B. (2014). Reinforcement Learning : An Introduction. Book.

Rikiya Yamashita, M. N. . K. T. (2018). Convolution neural networks: An overview and

application in Radiology. Journal.

Sam Greynadus, Anurag Koul, J. D. A. F. (2014). Visualising and understanding Atari

Agents. Conference, Oxford, ICML conference paper.

Schimidhuber., J. (2015). Deep Learning in Neural Networks: An Overview". Neural

Networks. Journal, 18th Annual Conference of the International Speech CommunicationAssociation, aa.

Shani Gamrian, Y. G. (2019). Transfer Learning for Related Reinforcement Learning

Tasks via Image-to-Image Translation. Journal, P-429, aa.

REFERENCES 43

Squires, G. L. (2008). Practical Physics. Book, Oxford, Cambridge Universtiy.

SuperDataScience (2018). Convolution Operation. Website.

Thomas Carr, Maria Chli, G. V. (2019). Domain Adaption for Reinforcement Learning

on the Atari. Conference, tep, Ashton Universtiy.

V minh, Mehdi Mirza, A. G. T. H. D. S. (2016). Asynchronous Methods for Deep Rein-

forcement Learning. Journal, Oxford, Journal of Machine Learning.

Volodymyr Mnih, Koray Kavukcuoglu, D. S. A. G. I. A. D. W. M. R. (2013). Playing

Atari with Deep Reinforcement Learning. Journal, DeepMind Technologies.

Watkins, C. (1992). Q-Learning. Technical note, N/A, Arxiv.

Yann LeCun, Corinna Cortes, C. J. B. (1998). The MNIST Database of handwritten

digits. Website, Google labs, Microsoft Research.

Yosinski, J., C. J. B. Y. and Lipson, H. (2014). How transferable are features in deep

neural networks. Conference, temp, Proceedings of the 27th International ConferenceonNeural Information Processing Systems.

Z. Zhou, X. L. and Zare., R. N. (2017). Optimizing Chemical Reactions with Deep Rein-

forcement Learning. Conference, tep, ACSCentral Science3,.

Zak Murez, Soheil Kolouri, D. K. R. R. K. K. (2016). Image to Image Translation for

Domain Adaptation. Conference, tep, NIPS 2016,.

REFERENCES 44

Documents

A Combination of Deep Learning and Mathematical Models for ...€¦ · A Combination of Deep Learning and Mathematical Models for Addressing the failure of Computer Vision in Reinforcement