Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Machine LearningLecture 16: Value-Based Deep Reinforcement Learning
Nevin L. [email protected]
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
This set of notes is based on the references listed at the end and internet resources.
Nevin L. Zhang (HKUST) Machine Learning 1 / 37
Introduction
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 2 / 37
Introduction
Motivation
Q-learning:
Difficult to represent Q(s, a) as a lookup table when the number of states islarge.
Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.
Cannot determine actions for states never visited (such states are many).
Nevin L. Zhang (HKUST) Machine Learning 3 / 37
Introduction
Motivation
Q-learning:
Difficult to represent Q(s, a) as a lookup table when the number of states islarge.
Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.
Cannot determine actions for states never visited (such states are many).
Nevin L. Zhang (HKUST) Machine Learning 3 / 37
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) = Q∗(s, a)
Key question: How to learn θ from experiences with the environment?
s0, a0, r0, s1, a1, r1, s2, a2, r2, . . .
Nevin L. Zhang (HKUST) Machine Learning 5 / 37
Training Target
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 6 / 37
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that
we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ).
It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.
As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target.
So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN.
Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
Experience Replay
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 13 / 37
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a buffer
Use a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated.
Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes.
Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples).
Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
Experience Replay
A More General View of DQN
Processes 1 and 3 run at the same pace, while Process 2 is slower.
Old experiences are discarded when buffer is full.
Nevin L. Zhang (HKUST) Machine Learning 17 / 37
Experience Replay
A More General View of DQN
Processes 1 and 3 run at the same pace, while Process 2 is slower.
Old experiences are discarded when buffer is full.
Nevin L. Zhang (HKUST) Machine Learning 17 / 37
DQN in Atari
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 18 / 37
DQN in Atari
Atari Games
Space Invaders
Nevin L. Zhang (HKUST) Machine Learning 19 / 37
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
DQN in Atari
Comparable with or superior to human at majority of games
Nevin L. Zhang (HKUST) Machine Learning 21 / 37
DQN in Atari
Results on Breakout
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Nevin L. Zhang (HKUST) Machine Learning 22 / 37
Improvements to DQN
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 23 / 37
Improvements to DQN Double DQN
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 24 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate,
and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−.
So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
Improvements to DQN Double DQN
An Analogy
TD Target in DQN:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The second term estimates future optimal value of s ′ in two steps:
Picks next action a′, andEvaluate Q(s’, a’)
In both steps, we use θ−
Nevin L. Zhang (HKUST) Machine Learning 28 / 37
Improvements to DQN Double DQN
An Analogy
TD Target in DQN:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The second term estimates future optimal value of s ′ in two steps:
Picks next action a′, andEvaluate Q(s’, a’)
In both steps, we use θ−
Nevin L. Zhang (HKUST) Machine Learning 28 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).
Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)
Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.
Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated,
and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)
Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
Improvements to DQN Double DQN
DQN versus Double DQN
Nevin L. Zhang (HKUST) Machine Learning 31 / 37
Improvements to DQN Prioritized Experience Replay
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 32 / 37
Improvements to DQN Prioritized Experience Replay
Why Prioritized Experience Replay?
In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)
This approach does not take the significance of the experience transitionsinto consideration.
Nevin L. Zhang (HKUST) Machine Learning 33 / 37
Improvements to DQN Prioritized Experience Replay
Why Prioritized Experience Replay?
In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)
This approach does not take the significance of the experience transitionsinto consideration.
Nevin L. Zhang (HKUST) Machine Learning 33 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.
α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay Improves Double DQN inMost Cases
Nevin L. Zhang (HKUST) Machine Learning 35 / 37
Improvements to DQN Prioritized Experience Replay
Code Example
github.com/vmayoral/basic reinforcement learning/blob/master/tutorial6/README.md
Nevin L. Zhang (HKUST) Machine Learning 36 / 37
Improvements to DQN Prioritized Experience Replay
References
Sergey Levine. Deep Reinforcement Learning,http://rail.eecs.berkeley.edu/deeprlcourse/, 2018.
David Silver. Tutorial: Deep Reinforcement Learning,http://hunch.net/∼beygel/deep rl tutorial.pdf, 2016.
Tambet Matiisen. Demystifying Deep Reinforcement Learning.https://ai.intel.com/demystifying-deep-reinforcement-learning/, 2015.
Watkins. (1989). Learning from delayed rewards: introduces Q-learning
Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learningwith convolutional networks for playing Atari.
Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: avery effective trick to improve performance of deep Q-learning.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay.In ICLR, 2016.
Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling networkarchitectures for deep reinforcement learning: separates value and advantage estimation inQ-function.
Nevin L. Zhang (HKUST) Machine Learning 37 / 37