Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using...

Inverse Reinforcement LearningChelsea Finn

3/5/2017

Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

Inverse RL: Outline

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

Inverse RL: Outline

Mnih et al. ’15

video from Montessori New Zealand

what is the reward?reinforcement learning agent

reward

In the real world, humans don’t get a score.

reward function is essential for RL

real-world domains: reward/cost often difficult to specify

• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95

Behavioral Cloning: Mimic actions of expert

Motivation

- but no reasoning about outcomes or dynamics

Can we reason about what the expert is trying to achieve?

- the expert might have different degrees of freedom

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations

(Kalman ’64, Ng & Russell ’00)(IOC/IRL)

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations

Challenges underdefined problem

difficult to evaluate a learned cost demonstrations may not be precisely optimal

given: - state & action space - roll-outs from π* - dynamics model [sometimes]

goal: - recover reward function - then use reward to get policy

Compare to DAgger: no direct access to π*

Early IRL Approaches

All: alternate between solving MDP w.r.t. cost and updating cost

Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better

Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories

Ratliff et al. ’06: max margin formulation between value of expert actions and other actions

How to handle ambiguity? What if expert is not perfect?

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

Inverse RL: Outline

Whiteboard

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Notation:

: cost with parameters [linear case ]

: dataset of demonstrations

: transition dynamics

Maximum Entropy Inverse RL(Ziebart et al. ’08)

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions

Inverse RL: Outline

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

NIPS Deep RL workshop 2015

IROS 2016

15Need to iteratively solve MDP for every weight update

demonstrations

mean height height variance cell visibility

120km of demonstration data

test-set trajectory prediction:

manually designed cost:

MHD: modified Hausdorff distance

Strengths - scales to neural net costs - efficient enough for real robots

Limitations - still need to repeatedly solve the MDP - assumes known dynamics

Whiteboard

What about unknown dynamics?

Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

Case Study: Guided Cost Learning

ICML 2016

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost

c (x)θ2

policy π cost c

guided cost learning algorithm

policy π0

update π w.r.t. cost (partially optimize)

generator

c (x)θ2

policy π cost c

discriminator

update cost in inner loop of policy optimization

policy π0

generator

Ho et al., ICML ’16, NIPS ‘16

c (x)θ2

policy π cost c

discriminator

update π w.r.t. cost (partially optimize)

policy π0

What about unknown dynamics?Adaptive importance sampling

GCL Experiments

dish placement pouring almonds

Real-world Tasks

state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]

action: joint torques

generate policy samples from q

c (x)θ2

cost c

ComparisonsRelative Entropy IRL (Boularias et al. ‘11)

Path Integral IRL (Kalakrishnan et al. ‘13)

Dish placement, demos

Dish placement, standard cost

Dish placement, RelEnt IRL

• video of dish baseline method

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

Pouring, demos

• video of pouring demos

Pouring, RelEnt IRL

• video of pouring baseline method

Pouring, GCL policy

• video of pouring our method - samples

Conclusion: We can recover successful policies for new positions.

Is the cost function also useful for new scenarios?

Dish placement - GCL reopt.

• video of dish our method - samples & reoptimizing

Pouring - GCL reopt.

• video of pouring our method - reoptimization

Note: normally the GAN discriminator is discarded

Case Study: Generative Adversarial Imitation Learning

NIPS 2016

Case Study: Generative Adversarial Imitation Learning

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks

Guided Cost Learning & Generative Adversarial Imitation Learning

Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots

Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic

teaching or teleoperation (first person)

Next Time: Back to forward RL (advanced policy gradients)

IOC is under-definedneed regularization:• encourage slowly changing cost

• cost of demos decreases strictly monotonically in time

Regularization ablationReaching

samples

5 25 45 65 850

Peg Insertion

samples

5 25 45 65 850

ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init

Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using...

Documents

3.5 Derivatives of Trigonometric Functions · 2014. 1. 20. · π 2 π 0 2 π − −π Consider the function y = sin (θ) We could make a graph of the slope: θslope −1 0 1 0 −1

15. Multivariable QFT Design (Part 2)pcsl.ecs.umass.edu/cbs/handouts/ch15_MIMO_II_handout.pdf · cccc cc ⎛⎞ππ−ππ+π ⎛π+π ⎞ =π+⎜⎟+ =π+⎜+⎟ ⎝⎠π+ ⎝π+

Honors Pre-Cal Mrs. Cortez. xy = sin(x) 00 π /6½ π /21 5 π /6½ π 0 7 π /6-½-½ 3 π /2 11 π /6-½-½ 2π2π 0

Dynamic nonlinear elasticity in geomaterials€¦ · DYNAMIC NONLINEAR ELASTICITY IN GEOMATERIALS 3 x σ = 0 σ = 1-π-π π π x σ = _ σ = 4-π-π π π 2 π x σ >> 1-π π Amplitude

Numeracy - Southern Cross University€¦ · 2 2 12 1 2 2 40 40 40 40 40 3.568 Ar r r r r r r π π π π π π π π = = = = = = ≈ Example: If fertiliser is applied at a paddock

000 Cover€¦ · §”π”¢Õß§—¡¿’√å [5] ‡À¡ Õπ™â“ßµ“∫Õ¥‡∑’Ë¬«‰ª„πªÉ“©–π—Èπé «‘π“ π‘√ÿµ⁄µ‘ê“‡≥π

Policy Gradients - Robot Learningrll.berkeley.edu/deeprlcourse/f17docs/lecture_4_policy...Policy gradients suggested readings •Classic papers •Williams (1992). Simple statistical

Search for CP violation in the phase space of D0→π+π−π+π ... · test, is an unbinned technique, which is advantageous in the anal-ysis ofmultibody decays. The energy test

Resonances in Ψ ˊ→ π + π - π 0 and study of Ψ ˊ→ VP decays at BESII

Π Π - w5. · PDF fileΠ Π Π Π 7kt5 501 7kt5 502 7kt5 504 7kt5 602 7kt5 604

EE232LightwaveDevices Lecture14:QuantumWelland ...inst.eecs.berkeley.edu/~ee232/sp16/lectures/EE232-14-Strained QW... · Ncm π π π π ∞ − ∞ ∞ ... EE232 Lecture 14-10 Prof

Contents · 2016. 7. 20. · Identification principles of π+ (and π-) ＜In flight＞ dE/dx is identical for both π+-and π ＜After STOP＞・π+ +decay to μ π+ +→ μ + ν

BUDKER INSTITUTE OF NUCLEAR PHYSICSϕ1 ϕ 2 π+ π− e+ e− π+ π− Θ1 Θ2 Z X Y ρ Reconstructed Beam axis vertex Figure 2: Deﬁnition of parameters for two track events 2

The WeIghT · news gathering π 2/3 ” 3 progressive CCD 1M pixels π AVC-Intra optional DVCPRO HD 50/25/DV π Gamma; HD, SD, Film like 1/2/3 π 1080/60i, 50i, 30p, 25p, 24p π 720/60p,

GREEK FINAL 18/11/2015 · y¶∏ƒ∂™π∞ ª∂§∂∆ø¡ & ™∆∞∆π™∆π∫∏™ aı‹Ó· 2015 hŒÓˆÛË aÛÊ·ÏÈÛÙÈÎÒÓ eÙ·ÈÚÈÒÓ eÏÏ¿‰Ô˜

Exploration - Robot Learningrll.berkeley.edu/deeprlcourse/f17docs/lecture_13_exploration.pdfExploration and exploitation examples •Restaurant selection •Exploitation: go to your

Art chumchon#1 LNkalasin.cdd.go.th/wp-content/uploads/sites/5/2017/05/คู่มือ-ศูนย์...Ÿà¡ Õ...°“√¥”‡π ‘πß“π» Ÿπ¬ å‡√’¬π√ Ÿâ™ÿ¡™π

4.5 Graphs of Sine and Cosine Functions · 1 2 Period: 2π −π π 3 π 2 3 π 2 − π 2 π 2 − π 5 2 π y x 1, 1 , 2 . What you should learn •Sketch the graphs of basic sine

X-ray Beamline Design 2012cheiron2012.spring8.or.jp/text/Lec5_S.Goto.pdf · G x y z π π π π π π 2 22 2 sin sin sin sin sin sin Laue function: Q = ⋅ ⋅ h, k, l: integerÆIntense

XIV International Conference on Hadron Spectroscopy · 10608.1 ± 1.7 10653.3 ± 1.5 10888.4 ± 3.0 15.5 ± 2.4 14.0 ± 2.8 30.7+8.0 −7.7 Υ(nS)π+π− Υ(nS)π+π− n =1, 2,