Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Machine LearningLecture 16: Value-Based Deep Reinforcement Learning

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on the references listed at the end and internet resources.

Nevin L. Zhang (HKUST) Machine Learning 1 / 37

Introduction

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay


Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).


Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).


Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) ≈ Q∗(s, a)

A function is defined over all possible states (images) without enumeratingthem.

There is information for picking actions at any states.


Introduction

Deep Q-Networks


Q(s, a; θ) ≈ Q∗(s, a)




Introduction

Deep Q-Networks


Q(s, a; θ) ≈ Q∗(s, a)




Introduction

Deep Q-Networks


Q(s, a; θ) = Q∗(s, a)

Key question: How to learn θ from experiences with the environment?

s0, a0, r0, s1, a1, r1, s2, a2, r2, . . .


Training Target

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari



Training Target

Learning DQN

Q-learning:

We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.

As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:

θ ← update(θ; s, a, s ′, r)

To do so, we need come up with a objective function and update θ using itsgradient.


Training Target

Learning DQN

Q-learning:






Training Target

Learning DQN

Q-learning:






Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.


Training Target



The problem is that

we do not know the true target Q∗(s, a).



Q(s ′, a′; θ)




Training Target






Q(s ′, a′; θ)




Training Target






Q(s ′, a′; θ)




Training Target






Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ).

It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′



Training Target






Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.

As explained in the previous lecture, y is an unbiased estimation of Q ′



Training Target






Q(s ′, a′; θ)




Training Target


Our target is now to push Q(s, a; θ) toward the TD target.

So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.


Training Target


Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:


Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)



Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)



Training Target




Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)



Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)



Training Target




Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)



Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)



Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.


Training Target

Moving Target






Training Target

Moving Target






Training Target

Moving Target






Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN.

Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:


Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.


Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.





Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)



Training Target

Target Network






Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)



Training Target

Target Network






Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)



Training Target

Target Network






Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)



Training Target

Target Network






Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)



Training Target

Deep Q-Learning with Target Network

Repeat:

Take action a in current state s, observe r and s ′

Update the parameters:


Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′

θ− ← θ in every C steps.


Training Target


Repeat:




Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′



Training Target


Repeat:




Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′



Experience Replay

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari



Experience Replay

Experience Replay

In Deep Q-Learning with target network

Only the latest experience tuple is used for parameter update.

The idea of experience replay is:

Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.


Experience Replay

Experience Replay




Store experience tuples in a buffer

Use a random minibatch experience tuples for parameter update.


Experience Replay

Experience Replay




Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.


Experience Replay

Deep Q-Learning with Target Network and ExperienceReplay

Repeat:

Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′

Sample a minibatch B = {sj , aj , s ′j , rj} from D.

Update the parameters

θ ← θ − α∇θ∑j

([r(sj , aj) + γmaxa′j

Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2



Experience Replay


Repeat:




θ ← θ − α∇θ∑j


Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2



Experience Replay


Repeat:




θ ← θ − α∇θ∑j


Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2



Experience Replay


Repeat:




θ ← θ − α∇θ∑j


Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2



Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated.

Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.


Experience Replay


Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.





Experience Replay







Experience Replay




With experience replay, each experience tuple is potentially used multipletimes.

Hence better data efficiency.



Experience Replay







Experience Replay





With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples).

Hence faster convergence.


Experience Replay







Experience Replay

A More General View of DQN

Processes 1 and 3 run at the same pace, while Process 2 is slower.

Old experiences are discarded when buffer is full.


Experience Replay

A More General View of DQN

Processes 1 and 3 run at the same pace, while Process 2 is slower.

Old experiences are discarded when buffer is full.


DQN in Atari

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari



DQN in Atari

Atari Games

Space Invaders


DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games


DQN in Atari

DAN in Atari







DQN in Atari

DAN in Atari







DQN in Atari

DAN in Atari







DQN in Atari

DAN in Atari







DQN in Atari

Comparable with or superior to human at majority of games


DQN in Atari

Results on Breakout

https://www.youtube.com/watch?v=V1eYniJ0Rnk


Improvements to DQN

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari



Improvements to DQN Double DQN

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari




Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:


Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]






Q(s ′, a′; θ−)




Q(s ′, a′; θ−, ω−)



Q(s ′, a′; θ−, ω−)]






Q(s ′, a′; θ−)


The Q-values used are not accurate,

and are influenced by random factorsdenoted as ω−. So, we can write:


Q(s ′, a′; θ−, ω−)



Q(s ′, a′; θ−, ω−)]






Q(s ′, a′; θ−)


The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−.

So, we can write:


Q(s ′, a′; θ−, ω−)



Q(s ′, a′; θ−, ω−)]






Q(s ′, a′; θ−)




Q(s ′, a′; θ−, ω−)



Q(s ′, a′; θ−, ω−)]






Q(s ′, a′; θ−)




Q(s ′, a′; θ−, ω−)



Q(s ′, a′; θ−, ω−)]




Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:


Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because


Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])





Eω− [Q(s ′, a′; θ−, ω−)]



Eω− [Q(s ′, a′; θ−, ω−)]



Q(s ′, a′; θ−, ω−)]


Eω− [Q(s ′, a′; θ−, ω−)]

= y






Eω− [Q(s ′, a′; θ−, ω−)]



Eω− [Q(s ′, a′; θ−, ω−)]



Q(s ′, a′; θ−, ω−)]


Eω− [Q(s ′, a′; θ−, ω−)]

= y






Eω− [Q(s ′, a′; θ−, ω−)]



Eω− [Q(s ′, a′; θ−, ω−)]



Q(s ′, a′; θ−, ω−)]


Eω− [Q(s ′, a′; θ−, ω−)]

= y






Eω− [Q(s ′, a′; θ−, ω−)]



Eω− [Q(s ′, a′; θ−, ω−)]



Q(s ′, a′; θ−, ω−)]


Eω− [Q(s ′, a′; θ−, ω−)]

= y




Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to


Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]



Double DQN





This is closer to


Eω− [Q(s ′, a′; θ−, ω−)]


Q(s ′, a′; θ−, ω−)]



Double DQN





This is closer to


Eω− [Q(s ′, a′; θ−, ω−)]


Q(s ′, a′; θ−, ω−)]



Double DQN





This is closer to


Eω− [Q(s ′, a′; θ−, ω−)]


Q(s ′, a′; θ−, ω−)]



Double DQN





This is closer to


Eω− [Q(s ′, a′; θ−, ω−)]


Q(s ′, a′; θ−, ω−)]



An Analogy

TD Target in DQN:


Q(s ′, a′; θ−)

The second term estimates future optimal value of s ′ in two steps:

Picks next action a′, andEvaluate Q(s’, a’)

In both steps, we use θ−



An Analogy

TD Target in DQN:


Q(s ′, a′; θ−)

The second term estimates future optimal value of s ′ in two steps:

Picks next action a′, andEvaluate Q(s’, a’)

In both steps, we use θ−



An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)



An Analogy





Method 2:




An Analogy



Estimate the score of each student a′: Q(s ′, a′; θ−).

Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).


Method 2:




An Analogy



Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)

Report Q(s ′, a∗; θ−).


Method 2:




An Analogy





Method 2:




An Analogy




Problem: Best student picked by teach θ− might not the best after all.

Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:




An Analogy




Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated,

and hence max score of class is overestimated.

Method 2:




An Analogy





Method 2:




An Analogy





Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)

Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)



An Analogy





Method 2:




Double DQN

Update rule for DQN:


Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2

where a∗ = arg maxa′ Q(s ′, a′; θ−).

Update rule for Double DQN:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)

where a∗ = arg maxa′ Q(s ′, a′; θ).



Double DQN



Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:








Double DQN



Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:








DQN versus Double DQN


Improvements to DQN Prioritized Experience Replay

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari




Why Prioritized Experience Replay?

In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)

This approach does not take the significance of the experience transitionsinto consideration.



Why Prioritized Experience Replay?

In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)

This approach does not take the significance of the experience transitionsinto consideration.



Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.






Q(s ′, a′; θ)]− Q(s, a; θ)



P(i) =pαi∑k p

αk








Q(s ′, a′; θ)]− Q(s, a; θ)



P(i) =pαi∑k p

αk








Q(s ′, a′; θ)]− Q(s, a; θ)



P(i) =pαi∑k p

αk








Q(s ′, a′; θ)]− Q(s, a; θ)



P(i) =pαi∑k p

αk


α determines how much prioritization is used.

α = 0 amounts to uniformsampling.






Q(s ′, a′; θ)]− Q(s, a; θ)



P(i) =pαi∑k p

αk





Prioritized Experience Replay Improves Double DQN inMost Cases



Code Example

github.com/vmayoral/basic reinforcement learning/blob/master/tutorial6/README.md



References

Sergey Levine. Deep Reinforcement Learning,http://rail.eecs.berkeley.edu/deeprlcourse/, 2018.

David Silver. Tutorial: Deep Reinforcement Learning,http://hunch.net/∼beygel/deep rl tutorial.pdf, 2016.

Tambet Matiisen. Demystifying Deep Reinforcement Learning.https://ai.intel.com/demystifying-deep-reinforcement-learning/, 2015.

Watkins. (1989). Learning from delayed rewards: introduces Q-learning

Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks

Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learningwith convolutional networks for playing Atari.

Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: avery effective trick to improve performance of deep Q-learning.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay.In ICLR, 2016.

Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling networkarchitectures for deep reinforcement learning: separates value and advantage estimation inQ-function.


Documents

Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning