Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General:...

Safety in Sequential Decision Making

Yinlam Chow

Google DeepMind

Reinforcement Learning (RL)

● Combines machine learning with decision making

● Interaction modeled by a Markov decision process (MDP)

Successful Use Cases of RL [V. Minh et al., NIPS 13; D. Silver et al., Nature 17]

Some Challenges to Apply RL in Real-world [A. Irpan 18]

● Noisy data

● Training with insufficient data

● Unknown reward functions

● Robust models w.r.t.

Uncertainty?

● Safety guarantees in RL?

● Safe exploration

Some Challenges to Apply RL in Real-world [A. Irpan 18]

● Noisy data

● Training with insufficient data

● Unknown reward functions

● Robust models w.r.t. uncertainty?

● Safety guarantees in RL?● Safe exploration

Safety Problems in RL

Studied the following safety problems:

1. Safety w.r.t. Baseline○ Model-free

○ Model-based

○ Apprenticeship learning

2. Safety w.r.t. Environment Constraints

3. Robust and Risk-sensitive RL (skip here)

1. Safety w.r.t. Baseline○ Model-free

○ Model-based

Safety w.r.t. Baseline

Question: How to train a RL policy offline that performs no worse than baseline?

Safety w.r.t. Baseline: Model-free ApproachProblem: Safe off-policy optimization [P. Thomas et al., AAAI 15 & ICML 15]

Recent work: More Robust Doubly Robust Off-policy Evaluation [M. Farajtabar, Y. Chow, M. Ghavamzadeh, ICML 18]

Safety w.r.t. Baseline: Model-based ApproachProblem: Sim-to-real with transferable guarantees in performance

Recent work: Safe Policy Improvement by Minimizing Robust Regret [M. Ghavamzadeh, M. Petrik, Y. Chow, NIPS 16]

Safety and Apprenticeship Learning [Abbeel ICML04]

Problem: Imitate expert but become more risk-averse? (Without reward.)

Recent work: Risk-Sensitive Generative Adversarial Imitation Learning [J. Lacotte, M. Ghavamzadeh, Y. Chow, M. Pavone, UAI workshop 18, AISTATS 19 (submitted)]

1. Safety w.r.t. Baseline

○ Model-free

○ Model-based

The Safe Decision Making Problem

Safety definitions directly come from environment constraints, e.g., ● Collision avoidance, speed & fuel limits, traffic rules

● System overheating, quality-of-service guarantees

● User satisfaction in online recommendation

A Lyapunov Approach to Safe RL

Collaboration with Ofir Nachum (Brain), Mohammad Ghavamzadeh, Edgar Duenez-Guzman (DMG) -- NIPS 2018, ICLR 2019 (submitted)

Safety w.r.t. Environment Constraints

Overview on Notions of Safety ● Reachability (safety probability):

Safe if agent doesn’t enter an undesirable region (w.h.p.)

Application: Robot motion planning

● Limit Visits to Undesirable States (time spent in dangerous regions):

Safe if agent doesn’t stay in an undesirable region for long

Applications: System (data center) maintenance

NOTE: Constraints are trajectory-based

General Problem Formulation

Modeled by Constrained Markov Decision Process (CMDP)

● Reward: primary return performance

● Constraint cost: model safety constraints

Goals:

1. Find an optimal (feasible) RL agent

2. More restrictive: Guarantee safety during training

Some Prior Art and Limitations● Prior Art:

● Contributions of the Lyapunov-based method:○ Safety during training

○ Scalable model-free RL (on-policy/off-policy); Less conservative policies

Lyapunov Function and Safety

Safe Policy Iteration

Environment with Discrete Actions

agent actions(with ε-error)

● Stochastic 2D GridWorld

● Stage-wise cost is 1 for fuel usage

● Goal reward is 1000

● Incurs a cost of 1 in lava; Tolerate at most 5 touches

● Experiments: (i) Planning, (ii) RL (explicit position or image obs.)

Safe Planning with Discrete Actions

Lyapunov-based (Our Method) Dual-LP (Exact But Expensive)

Safe Planning with Discrete Actions

Lagrangian-based (Baseline) Unconstrained (Baseline)

From SPI/SVI to Safe Value-based RL

Safe DQN/DPI with Discrete Actions

Baseline: Unsafe during training!

Unconstrained: Always unsafe!

Our Methods: Balance performance and safety

Safe Policy Gradient (Recent Extension)To handle safety in RL (policy gradient) with continuous actions:

1. Constrained optimization w.r.t. policy parameter (CPO)

2. Embed Lyapunov constraint into policy network via network augmentation (see [Dalal et al. 2018] for simpler setting)

Safe Policy Gradient

Safe RL for Continuous ControlObjective: Train a Mujoco HalfCheetah to run stably.

Standard method: DDPG/PPO

Issue: Unstable if runs too fast!

Remedy: Soft constraint torque @ joints

Safe if total torque violation is bounded (d0=50)

Lagrangian-PPO versus Lyapunov-PPO

Lagrangian PPO (Baseline) Lyapunov PPO (Our Method)

Lagrangian-PPO versus Lyapunov-PPO

Lagrangian PPO (Baseline) Lyapunov PPO (Our Method)

Conclusion● Contributions:

a. Formulated safety problems as CMDPs

b. Proposed a Lyapunov-based safe RL methodology

■ Work well for on-policy/off-policy settings

■ Applicable to value-based and policy-based algorithms

c. Guarantee* safety during training

● Limitations:a. Provably-optimal only under restricted conditions!

b. Theoretically justified in MDPs, but not with function approximations (RL)!■ Future work in model-based setting

Current WorkIn talks with the following projects at Google & DeepMind:

● PRM-RL for indoor robot navigation (Brain Robotics)

● FineTuner/TF.agents (Brain RL/rSWE)

● DrData (DeepMind)

Acknowledgements● General:

○ Mohammad Ghavamzadeh (FAIR)

● Safety w.r.t. Baseline:

○ Mehrdad Farajtabar (DMG)○ Marek Petrik (UNH)○ Jonathan Lacotte, Marco Pavone (Stanford)

● Safety w.r.t. Environment Constraint:

○ Ofir Nachum (Brain)○ Edgar Guzman Duenez (DMG)○ Aleksandra Faust, Jasmine Hsu, Vikas Sindhwani (Brain Robotics)

Appendix of Lyapunov Work

Highlights of Other Recent Work

1. Robust and Controllable Representation Learning

2. Risk-Sensitive Imitation Learning

3. Minimizing Robust Regret in RL (Sim-to-Real)

4. More Robust Doubly Robust

Off-policy Evaluation

Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General:...

Documents

Meta-Gradient Reinforcement Learning · 2019-02-19 · Meta-Gradient Reinforcement Learning Zhongwen Xu DeepMind zhongwen@google.com Hado van Hasselt DeepMind hado@google.com David

Replicating DeepMind StarCraft II Reinforcement Learning ......Replicating DeepMind StarCraft II Reinforcement Learning Benchmark with Actor-Critic Methods Bachelor’sthesis RomanRing

Jeff Donahue jeffdonahue@google.com Karen Simonyan arXiv ... · Jeff Donahue DeepMind London, UK jeffdonahue@google.com Karen Simonyan DeepMind London, UK simonyan@google.com ABSTRACT

UCL x DeepMind lecture series - storage.googleapis.comstorage.googleapis.com/deepmind-media...in Deep Learning, ranging from the fundamentals of training neural networks via advanced

Towards Conceptual Compressionpapers.nips.cc/paper/6542-towards-conceptual-compression.pdfTowards Conceptual Compression Karol Gregor Google DeepMind karolg@google.com Frederic Besse

A Generalized Framework for Population Based …A Generalized Framework for Population Based Training Ang Li anglili@google.com DeepMind, Mountain View Ola Spyra aspy@google.com DeepMind,

Graphical Model Sketch - arXiv · 2016. 7. 19. · Graphical Model Sketch Branislav Kveton1 3, Hung Bui2, Mohammad Ghavamzadeh , Georgios Theocharous4, S. Muthukrishnan5, and Siqi

DeepMind Control Suite - arXiv · The DeepMind Control Suite is a set of continuous control tasks with a stan- ... (transition)function f(s,a),anobservationfunctiono ... 0 ⊆Sbythebegin_episode()

A Uniﬁed Game-Theoretic Approach to Multiagent ... · A Uniﬁed Game-Theoretic Approach to Multiagent Reinforcement Learning Marc Lanctot DeepMind lanctot@ Vinicius Zambaldi DeepMind

User Manual - MUSIC Tri · 4 DeepMind 12 User Manual About the DeepMind 12 The DeepMind 12 is a True Analog 12-Voice Polyphonic Synthesizer with TC ELECTRONIC / KLARK TEKNIK FX and

Learning to Incentivize Other Learning Agents · 2020-06-12 · Ang Li DeepMind anglili@google.com Mehrdad Farajtabar DeepMind farajtabar@google.com Peter Sunehag DeepMind sunehag@google.com

Tight Regret Bounds for Model-Based Reinforcement Learning ...€¦ · Yonathan Efroni Technion, Israel Nadav Merlis Technion, Israel Mohammad Ghavamzadeh Facebook AI Research Shie

Supply Inadequacy Risk Evaluation of Stand-alone Renewable ... · DRData-driven robust. E-LSCExpectation of load shedding cost. ESUElectricity storage unit. GMMGaussian mixture model

Regularized Policy Iteration with Nonparametric Function ... · 201 Broadway, 8th Floor Cambridge, MA 02139, USA Mohammad Ghavamzadeh ghavamza@adobe.com Adobe Research 321 Park Avenue

Regulating AI and Robotics - Steve Omohundro · PDF fileRegulating AI and Robotics Steve Omohundro, ... // ... •DeepMind 49 Atari Video Games

Neural Discrete Representation Learning - arXiv · Neural Discrete Representation Learning Aaron van den Oord DeepMind avdnoord@google.com Oriol Vinyals DeepMind vinyals@google.com

An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Google DeepMind and healthcare in an age of algorithms · PDF fileGoogle DeepMind and healthcare in an age of algorithms ... New Scientist. * Julia Powles ... This is an issue to which

SPONSORING OFFER - ds3-datascience-polytechnique.fr · • Diana BORSA (Google DeepMind): Shallow dive in deep reinforcement learning. • Bilal PIOT (Google DeepMind): Shallow dive

Randomized Prior Functions for Deep Reinforcement Learning · Randomized Prior Functions for Deep Reinforcement Learning Ian Osband DeepMind iosband@google.com John Aslanides DeepMind