31
Safe and efficient off-policy reinforcement learning Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare

Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Safe and efficient off-policy reinforcement learning

Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare

Page 2: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Two desired properties of a RL algorithm:

● Off-policy learning○ use memory replay○ do exploration

● Use multi-steps returns ○ propagate rewards rapidly○ avoid accumulation of approximation/estimation errors

Ex: Q-learning (and DQN) is off-policy but does not use multi-steps returnsPolicy gradient (and A3C) use returns but are on-policy.

Both properties are important in deepRL. Can we have both simultaneously?

Page 3: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Off-policy reinforcement learning

Behavior policy , target policy

Observe trajectory

where , and

Goal: ● Policy evaluation:

● Optimal control:

Page 4: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Off-policy credit assignment problemBehavior policy Target policy

Can we use the TD to estimate for all ?

Page 5: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Importance sampling

Reweight the trace by the product of IS ratios

Page 6: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Importance sampling

Unbiased estimate of

Page 7: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Importance sampling

Unbiased estimate of Large (possibly infinite) variance Not efficient! [Precup, Sutton, Singh, 2000], [Mahmood, Yu, White, Sutton, 2015] ,...

Page 8: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

algorithm [Harutyunyan, Bellemare, Stepleton, Munos, 2016]

Cut traces by a constant

Page 9: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]

works if

Page 10: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]

works if

may not work otherwise Not safe!

Page 11: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]

Reweight the traces by the product of target probabilities

Page 12: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

works for arbitrary policies and

Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]

Page 13: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

works for arbitrary policies and

cut traces unnecessarily when on-policy Not efficient!

Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]

Page 14: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

General off-policy return-based algorithm:

Algorithm: Trace coefficient: Problem:

IS high variance

not safe (off-policy)

not efficient (on-policy)

Page 15: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Off-policy policy evaluation:

Theorem 1: Assume finite state space. Generate trajectories according to behavior policy . Update all states along trajectories according to:

Assume all states visited infinitely often. Under usual SA assumptions, If then a.s.

Sufficient conditions for a safe algorithm (works for any and )

Page 16: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Off-policy return-based operator

Lemma: Assume the traces satisfy .

Then the off-policy return-based operator:

is a contraction mapping (whatever and ) and is its fixed point.

Page 17: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Proof [part 1]

Thus

which is a linear combination weighted by non-negative coefficients which sum to...

Page 18: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Proof [part 2]

Sum of the coeff.

Thus

Page 19: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Tradeoff for trace coefficients

● Contraction coefficient of the expected operator

when (one-step Bellman update) when (full Monte-Carlo rollouts)

● Variance of the estimate (can be infinite for ) Large uses multi-steps returns, but large varianceSmall low variance, but do not use multi-steps returns

Page 20: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Our recommendation:

Use Retrace(λ) defined by

Properties:● Low variance since

● Safe (off policy): cut the traces when needed

● Efficient (on policy): but only when needed. Note that

Page 21: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Summary

Algorithm: Trace coefficient: Problem:

IS high variance

not safe (off-policy)

not efficient (on-policy)

none!

Page 22: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Retrace(λ) for optimal control

Let and sequences of behavior and target policies and

Theorem 2Under previous assumptions (+ a technical assumption)Assume are “increasingly greedy” wrt Then, a.s.,

Page 23: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Remarks

● If are greedy policies, then → Convergence of Watkin’s Q(λ) to

(open problem since 1989)

● “Increasingly greedy” allows for smoother traces thus faster convergence

● The behavior policies do not need to become greedy wrt → no GLIE assumption (Greedy in the limit with infinite exploration)

(first return-based algo converging to without GLIE)

Page 24: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Retrace for deepRL Retrace is particularly suitable for deepRL as it is off-policy (useful for exploration and when using memory-replay) and uses multi-steps returns (fast propagation of rewards)

Automatically adjust the length of the return to the degree of “off-policy-ness”

Typical use:→ Memory replay Ex: DQN [Mnih et al., 2015]

(behavior policy = stored in memory, target policy = current policy)

→ On-line update Ex: A3C [Mnih et al., 2016](behavior policy = exploratory policy, target policy = evaluation policy)

Page 25: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Atari 2600 environments: Retrace vs DQN

Games: Asteroids, Defender, Demon Attack, Hero, Krull, River Raid, Space Invaders, Star Gunner, Wizard of Wor, Zaxxon

Page 26: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Experiments on 60 Atari games

Page 27: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Comparison of the traces: Retrace vs TB

Retrace:

TB:

Page 28: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Conclusions ● General update rule for off-policy return-based RL ● Conditions under which an algo is safe and efficient● We recommend to use Retrace:

○ Converges to Q* ○ Safe: cut the traces when needed ○ Efficient: but only when needed○ Works for policy evaluation and for control○ Particularly suited for deepRL

Extensions:● Works in continuous action spaces● Can be used in off-policy policy-gradient [Wang et al., 2016]

If you want to know more, come to our poster at NIPS!

Page 29: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

References (mentioned in the slides):

● [Harutyunyan, Bellemare, Stepleton, Munos, 2016] Q(λ) with Off-Policy Corrections

● [Mahmood, Yu, White, Sutton, 2015] Emphatic Temporal-Difference Learning

● [Mnih et al., 2015] Human Level Control Through Deep Reinforcement Learning

● [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning

● [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe and Efficient Off-Policy

Reinforcement Learning

● [Precup, Sutton, Singh, 2000] Eligibility traces for off-policy policy evaluation

● [Sutton, 1988] Learning to predict by the methods of temporal differences

Page 30: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Temporal credit assignment problem Behavior policy

Can a “surprise” be used to train for all ?

Page 31: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe

Remi Munos Google DeepMind

Temporal difference learning TD(λ) [Sutton, 1988]

Behavior policy