Interactive Imitation LearningDAgger Analysis: A reduction to no-regret online learning Decision set Π := {π: S ↦ A} (restricted policy class, π⋆ may not be inside Π) Finite

Interactive Imitation Learning  

Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning

Announcements

Final report: NeurIPS format

Maximum 9 pages for main tex (not including references and appendix)

Recap

Offline IL and Hybrid Setting:

Recap


Ground truth reward is unknown;

assume expert is a near optimal policy

r(s, a) ∈ [0,1]π⋆

Recap


Ground truth reward is unknown;

assume expert is a near optimal policy

r(s, a) ∈ [0,1]π⋆

We have a dataset 𝒟 = (s⋆i , a⋆i )

Mi=1 ∼ d

π⋆

Recap

Maximum Entropy IRL formulation:

maxπ

entropy[ρπ]

s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)

Recap


maxπ

entropy[ρπ]


We can rewrite it in the max-min formulation:

maxθ

minπ

[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)

Recap


maxπ

entropy[ρπ]


We can rewrite it in the max-min formulation:

maxθ

minπ

[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)

θt+1 := θt + η∇θJ(θt, πt), πt+1 = arg minπ

J(θt+1, π)

Algorithm: incremental update on cost function , exact update on policy (soft VI):θ π

Today:

Interactive Imitation Learning Setting

Today:


Key assumption: we can query expert at any time and any state during trainingπ⋆

Today:


Key assumption: we can query expert at any time and any state during trainingπ⋆

(Recall that previously we only had an offline dataset )𝒟 = (s⋆i , a⋆i )

Mi=1 ∼ d

π⋆

Recall the Main Problem from Behavior Cloning:

Expert’s trajectoryLearned Policy

No training data of “recovery’’ behavior

Intuitive solution: Interaction

7

Use interaction to collect data where learned policy goes

General Idea: Iterative Interactive Approach

Update PolicyCollect Data

through Interaction

New Data

Updated Policy

All DAgger slides credit: Drew Bagnell, Stephane Ross, Arun Venktraman

DAgger: Dataset Aggregation 0th iteration

9

Expert Demonstrates Task Dataset

Supervised Learning

1st policy π1

[Ross11a]

DAgger: Dataset Aggregation 1st iteration

10

Execute π1 and Query Expert

Steering from expert

[Ross11a]


11

Execute π1 and Query ExpertNew Data

[Ross11a]



11


[Ross11a]


States from the learned policy


12


All previous data

[Ross11a]



13


Supervised Learning

New policy π2

All previous data

Aggregate Dataset

[Ross11a]


DAgger: Dataset Aggregation 2nd iteration

14


Supervised Learning

New policy π3

All previous data

Aggregate Dataset


[Ross11a]

DAgger: Dataset Aggregation nth iteration

15

[Ross11a]

Execute πn-1 and Query ExpertNew Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

Success!

16

[Ross AISTATS 2011]

Better

17

[Ross AISTATS 2011]

Average Falls/Lap

More fun than Video Games…

18[Ross ICRA 2013]

Forms of the Interactive Experts

Interactive Expert is expensive, especially when the expert is human…

But expert does not have to be human…




Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]




Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

control (steer and throttle)




Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

control (steer and throttle)

Steering + throttle

Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]


Their Setup:

At Training, we have expensive sensors for accurate state estimation

and we have computation resources for MPC (i.e., high-frequency replanning)


Their Setup:

At Training, we have expensive sensors for accurate state estimation

and we have computation resources for MPC (i.e., high-frequency replanning)

The MPC is the expert in this case!

Analysis of DAgger

First let’s do a quick introduction of online no-regret learning

Online Learning

AdversaryLearner

[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

convex Decision set 𝒳…

Online Learning

AdversaryLearner


convex Decision set 𝒳

Learner picks a decision x0

…

Online Learning

AdversaryLearner




Adversary picks a loss ℓ0 : 𝒳 → ℝ

…

Online Learning

AdversaryLearner





Learner picks a new decision x1

…

Online Learning

AdversaryLearner







…

Online Learning

AdversaryLearner







…

Regret =T−1

∑t=0

ℓt(xt) − minx∈𝒳

T−1

∑t=0

ℓt(x)

A no-regret algorithm: Follow-the-Leader

At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

FTL: xt = minx∈𝒳

t−1

∑i=0

ℓi(x)

A no-regret algorithm: Follow-the-Leader

At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

Theorem (FTL): if is convex, and is strongly convex for all , then for regret of FTL, we have: 𝒳 ℓt t1T [

T−1

∑t=0

ℓt(xt) − minx∈𝒳

T−1

∑t=0

ℓt(x)] = O ( log(T)T )

FTL: xt = minx∈𝒳

t−1

∑i=0

ℓi(x)

DAgger Revisit

New Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

At iteration n:

DAgger Revisit

New Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

At iteration n:

Ln(⇡) =nX

i=1

k⇡(x)� yk22

DAgger Revisit

New Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

At iteration n:

Ln(⇡) =nX

i=1

k⇡(x)� yk22

nX

t=1

Lt(⇡)

DAgger Revisit

New Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

At iteration n:

Ln(⇡) =nX

i=1

k⇡(x)� yk22

nX

t=1

Lt(⇡)argmin⇡

nX

t=1

Lt(⇡)

DAgger Revisit

New Data

Supervised Learning

New policy πn

All previous data


Aggregate Dataset

At iteration n:

Ln(⇡) =nX

i=1

k⇡(x)� yk22

nX

t=1

Lt(⇡)argmin⇡

nX

t=1

Lt(⇡)

Data Aggregation = Follow-the-Leader Online Learner

DAgger Analysis: A reduction to no-regret online learning

Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space



(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ

Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]





DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π

t

∑i=0

ℓi(π)





DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π

t

∑i=0

ℓi(π)

If the online learning procedure ensures no-regret, then

1T [

T−1

∑t=0

ℓt(πt) − minπ∈Π

T−1

∑t=0

ℓt(π)] = o(T)/T


Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1

∑t=0


T−1

∑t=0

ℓt(π) = o(T)



∑t=0


T−1

∑t=0

ℓt(π) = o(T)

1T

T−1

∑t=0

ℓt(πt) = o(T)/T

ϵavg−reg

+ minπ∈Π

1T

T−1

∑t=0

ℓt(π)

ϵΠ



∑t=0


T−1

∑t=0

ℓt(π) = o(T)

1T

T−1

∑t=0

ℓt(πt) = o(T)/T

ϵavg−reg

+ minπ∈Π

1T

T−1

∑t=0

ℓt(π)

ϵΠ

, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ



∑t=0


T−1

∑t=0

ℓt(π) = o(T)

1T

T−1

∑t=0

ℓt(πt) = o(T)/T

ϵavg−reg

+ minπ∈Π

1T

T−1

∑t=0

ℓt(π)

ϵΠ

, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ

𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵavg−reg + ϵΠ

Under the assumption that surrogate loss upper bounds zero-one loss:


𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆



Let’s turn this to the true performance under the cost function c(s, a)




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

=1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

=1


≤1

1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

=1


≤1


≤maxs,a A⋆(s, a)

1 − γ⋅ (ϵreg + ϵΠ)




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

=1


≤1



1 − γ⋅ (ϵreg + ϵΠ)

Case study:

1. Worst case: (not

recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;

A⋆(s, a) ≈1

1 − γ




Vπ ̂i − Vπ⋆ =1

1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

=1


≤1



1 − γ⋅ (ϵreg + ϵΠ)

Case study:

1. Worst case: (not

recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;

A⋆(s, a) ≈1

1 − γ

2. Good case: (easily

recoverable from a one-step mistake): Better than BC;

A⋆(s, a) ≈ o ( 11 − γ )

Summary of Imitation Learning


1. Behavior Cloning (Maximum Likelihood Estimation)

Performance-gap ≈1

(1 − γ)2(classification error)





2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):


(1 − γ)(classification error)





2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):



3.DAgger w/ Interactive Experts:

Performance-gap ≈sups,a |A⋆(s, a) |


Documents

Interactive Imitation LearningDAgger Analysis: A reduction to no-regret online learning Decision set Π := {π: S ↦ A} (restricted policy class, π⋆ may not be inside Π) Finite