Safe and efficient off-policy reinforcement learning

Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare

Remi Munos Google DeepMind

Two desired properties of a RL algorithm:

● Off-policy learning○ use memory replay○ do exploration

● Use multi-steps returns ○ propagate rewards rapidly○ avoid accumulation of approximation/estimation errors

Ex: Q-learning (and DQN) is off-policy but does not use multi-steps returnsPolicy gradient (and A3C) use returns but are on-policy.

Both properties are important in deepRL. Can we have both simultaneously?

Off-policy reinforcement learning

Behavior policy , target policy

Observe trajectory

where , and

Goal: ● Policy evaluation:

● Optimal control:

Off-policy credit assignment problemBehavior policy Target policy

Can we use the TD to estimate for all ?

Importance sampling

Reweight the trace by the product of IS ratios

Importance sampling

Unbiased estimate of

Importance sampling

Unbiased estimate of Large (possibly infinite) variance Not efficient! [Precup, Sutton, Singh, 2000], [Mahmood, Yu, White, Sutton, 2015] ,...

algorithm [Harutyunyan, Bellemare, Stepleton, Munos, 2016]

Cut traces by a constant

algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]

works if

algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]

works if

may not work otherwise Not safe!

Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]

Reweight the traces by the product of target probabilities

works for arbitrary policies and

cut traces unnecessarily when on-policy Not efficient!

General off-policy return-based algorithm:

Algorithm: Trace coefficient: Problem:

IS high variance

not safe (off-policy)

not efficient (on-policy)

Off-policy policy evaluation:

Theorem 1: Assume finite state space. Generate trajectories according to behavior policy . Update all states along trajectories according to:

Assume all states visited infinitely often. Under usual SA assumptions, If then a.s.

Sufficient conditions for a safe algorithm (works for any and )

Off-policy return-based operator

Lemma: Assume the traces satisfy .

Then the off-policy return-based operator:

is a contraction mapping (whatever and ) and is its fixed point.

Proof [part 1]

which is a linear combination weighted by non-negative coefficients which sum to...

Proof [part 2]

Sum of the coeff.

Tradeoff for trace coefficients

● Contraction coefficient of the expected operator

when (one-step Bellman update) when (full Monte-Carlo rollouts)

● Variance of the estimate (can be infinite for ) Large uses multi-steps returns, but large varianceSmall low variance, but do not use multi-steps returns

Our recommendation:

Use Retrace(λ) defined by

Properties:● Low variance since

● Safe (off policy): cut the traces when needed

● Efficient (on policy): but only when needed. Note that

Summary

Algorithm: Trace coefficient: Problem:

IS high variance

not safe (off-policy)

not efficient (on-policy)

Retrace(λ) for optimal control

Let and sequences of behavior and target policies and

Theorem 2Under previous assumptions (+ a technical assumption)Assume are “increasingly greedy” wrt Then, a.s.,

Remarks

● If are greedy policies, then → Convergence of Watkin’s Q(λ) to

(open problem since 1989)

● “Increasingly greedy” allows for smoother traces thus faster convergence

● The behavior policies do not need to become greedy wrt → no GLIE assumption (Greedy in the limit with infinite exploration)

(first return-based algo converging to without GLIE)

Retrace for deepRL Retrace is particularly suitable for deepRL as it is off-policy (useful for exploration and when using memory-replay) and uses multi-steps returns (fast propagation of rewards)

Automatically adjust the length of the return to the degree of “off-policy-ness”

Typical use:→ Memory replay Ex: DQN [Mnih et al., 2015]

(behavior policy = stored in memory, target policy = current policy)

→ On-line update Ex: A3C [Mnih et al., 2016](behavior policy = exploratory policy, target policy = evaluation policy)

Atari 2600 environments: Retrace vs DQN

Games: Asteroids, Defender, Demon Attack, Hero, Krull, River Raid, Space Invaders, Star Gunner, Wizard of Wor, Zaxxon

Experiments on 60 Atari games

Comparison of the traces: Retrace vs TB

Retrace:

Conclusions ● General update rule for off-policy return-based RL ● Conditions under which an algo is safe and efficient● We recommend to use Retrace:

○ Converges to Q* ○ Safe: cut the traces when needed ○ Efficient: but only when needed○ Works for policy evaluation and for control○ Particularly suited for deepRL

Extensions:● Works in continuous action spaces● Can be used in off-policy policy-gradient [Wang et al., 2016]

If you want to know more, come to our poster at NIPS!

References (mentioned in the slides):

● [Harutyunyan, Bellemare, Stepleton, Munos, 2016] Q(λ) with Off-Policy Corrections

● [Mahmood, Yu, White, Sutton, 2015] Emphatic Temporal-Difference Learning

● [Mnih et al., 2015] Human Level Control Through Deep Reinforcement Learning

● [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning

● [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe and Efficient Off-Policy

Reinforcement Learning

● [Precup, Sutton, Singh, 2000] Eligibility traces for off-policy policy evaluation

● [Sutton, 1988] Learning to predict by the methods of temporal differences

Temporal credit assignment problem Behavior policy

Can a “surprise” be used to train for all ?

Temporal difference learning TD(λ) [Sutton, 1988]

Behavior policy

Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods...

Documents

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

AIBO Camera Stabilization Tom Stepleton Ethan Tira-Thompson 16-720, Fall 2003

REINFORCEMENT DETAILING ACCESSORIES - …paradigm.in/.../2015/01/6-REINFORCEMENT-DETAILING-ACCESSORIES.pdfStructural Engineering & Geospatial Consultants REINFORCEMENT DETAILING ACCESSORIES

Reinforcement and deep reinforcement learning for wireless

The Value of Abstraction - Princeton University · Pathak, Agrawal, Efros, & Darrell, 2017; Ostrovski, Belle-mare, Oord, & Munos, 2017; Kulkarni, Narasimhan, Saeedi, & Tenenbaum,

Skinner’s Behavioral Reinforcement Theory Positive Reinforcement Negative Reinforcement PunishmentExtinction Person repeats desired behaviors to gain a

Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Processes in Artiﬁcial Intelligenceresearchers.lille.inria.fr/~munos/papers/files/MDPIA... · 2011-03-17 · TD-Gammon [TES 95] using a neural network, produced

Regret Bounds for Restless Markov Banditsresearchers.lille.inria.fr/~munos/papers/files/TCS2013.pdf · Regret Bounds for Restless Markov Bandits Ronald Ortner∗, Daniil Ryabko∗∗,

Chapter 5A reinforcement - Lane Community College Media …Chapter 5A: Reinforcement 5 4/21/2014 Examples of Negative Reinforcement Negative reinforcement: A situation in which a response

Reinforcement Learning - uni-freiburg.degki.informatik.uni-freiburg.de/.../recordings/reinforcement.pdf · Reinforcement Learning 3 What is Reinforcement Learning? Learning from interaction

Schedules of reinforcement. Schedules of Reinforcement Continuous reinforcement refers to reinforcement being administered to each instance of a response

Introduction to Reinforcement Learning and multi …chercheurs.lille.inria.fr/~munos/talks/2013/SS_Netadis2013/part3.pdfStochastic bandits Adversarial bandit Games MCTS Optimistic

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

Universal Reinforcement Learning Algorithms: … Reinforcement Learning Algorithms: Survey and ... class of models hypothesesenvironments M . ... Universal Reinforcement Learning Algorithms:

Inverse Reinforcement Learning CS885 Reinforcement

OVERVIEW OF STATE APPROACHES TO OVERSIGHT AND MONITORING OF PSYCHOTROPIC MEDICATIONS Joyce Pfennig, Ph.D. Kate Stepleton, MSW

Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement