74
Interactive Imitation Learning Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning

Interactive Imitation LearningDAgger Analysis: A reduction to no-regret online learning Decision set Π := {π: S ↦ A} (restricted policy class, π⋆ may not be inside Π) Finite

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • Interactive Imitation Learning



    Sham Kakade and Wen Sun
CS 6789: Foundations of Reinforcement Learning

  • Announcements

    Final report: NeurIPS format

    Maximum 9 pages for main tex (not including references and appendix)

  • Recap

    Offline IL and Hybrid Setting:

  • Recap

    Offline IL and Hybrid Setting:

    Ground truth reward is unknown;

    assume expert is a near optimal policy

    r(s, a) ∈ [0,1]π⋆

  • Recap

    Offline IL and Hybrid Setting:

    Ground truth reward is unknown;

    assume expert is a near optimal policy

    r(s, a) ∈ [0,1]π⋆

    We have a dataset 𝒟 = (s⋆i , a⋆i )

    Mi=1 ∼ d

    π⋆

  • Recap

    Maximum Entropy IRL formulation:

    maxπ

    entropy[ρπ]

    s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)

  • Recap

    Maximum Entropy IRL formulation:

    maxπ

    entropy[ρπ]

    s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)

    We can rewrite it in the max-min formulation:

    maxθ

    minπ

    [𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)

  • Recap

    Maximum Entropy IRL formulation:

    maxπ

    entropy[ρπ]

    s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)

    We can rewrite it in the max-min formulation:

    maxθ

    minπ

    [𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)

    θt+1 := θt + η∇θJ(θt, πt), πt+1 = arg minπ

    J(θt+1, π)

    Algorithm: incremental update on cost function , exact update on policy (soft VI):θ π

  • Today:

    Interactive Imitation Learning Setting

  • Today:

    Interactive Imitation Learning Setting

    Key assumption: we can query expert at any time and any state during trainingπ⋆

  • Today:

    Interactive Imitation Learning Setting

    Key assumption: we can query expert at any time and any state during trainingπ⋆

    (Recall that previously we only had an offline dataset )𝒟 = (s⋆i , a⋆i )

    Mi=1 ∼ d

    π⋆

  • Recall the Main Problem from Behavior Cloning:

    Expert’s trajectoryLearned Policy

    No training data of “recovery’’ behavior

  • Intuitive solution: Interaction

    7

    Use interaction to collect data where learned policy goes

  • General Idea: Iterative Interactive Approach

    Update PolicyCollect Data

    through Interaction

    New Data

    Updated Policy

    All DAgger slides credit: Drew Bagnell, Stephane Ross, Arun Venktraman

  • DAgger: Dataset Aggregation 0th iteration

    9

    Expert Demonstrates Task Dataset

    Supervised Learning

    1st policy π1

    [Ross11a]

  • DAgger: Dataset Aggregation 1st iteration

    10

    Execute π1 and Query Expert

    Steering from expert

    [Ross11a]

  • DAgger: Dataset Aggregation 1st iteration

    11

    Execute π1 and Query ExpertNew Data

    [Ross11a]

    Steering from expert

  • DAgger: Dataset Aggregation 1st iteration

    11

    Execute π1 and Query ExpertNew Data

    [Ross11a]

    Steering from expert

    States from the learned policy

  • DAgger: Dataset Aggregation 1st iteration

    12

    Execute π1 and Query ExpertNew Data

    All previous data

    [Ross11a]

    Steering from expert

  • DAgger: Dataset Aggregation 1st iteration

    13

    Execute π1 and Query ExpertNew Data

    Supervised Learning

    New policy π2

    All previous data

    Aggregate Dataset

    [Ross11a]

    Steering from expert

  • DAgger: Dataset Aggregation 2nd iteration

    14

    Execute π2 and Query ExpertNew Data

    Supervised Learning

    New policy π3

    All previous data

    Aggregate Dataset

    Steering from expert

    [Ross11a]

  • DAgger: Dataset Aggregation nth iteration

    15

    [Ross11a]

    Execute πn-1 and Query ExpertNew Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

  • Success!

    16

    [Ross AISTATS 2011]

  • Success!

    16

    [Ross AISTATS 2011]

  • Success!

    16

    [Ross AISTATS 2011]

  • Better

    17

    [Ross AISTATS 2011]

    Average Falls/Lap

  • More fun than Video Games…

    18[Ross ICRA 2013]

  • More fun than Video Games…

    18[Ross ICRA 2013]

  • More fun than Video Games…

    18[Ross ICRA 2013]

  • Forms of the Interactive Experts

    Interactive Expert is expensive, especially when the expert is human…

    But expert does not have to be human…

  • Forms of the Interactive Experts

    Interactive Expert is expensive, especially when the expert is human…

    But expert does not have to be human…

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

  • Forms of the Interactive Experts

    Interactive Expert is expensive, especially when the expert is human…

    But expert does not have to be human…

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

    maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

    control (steer and throttle)

  • Forms of the Interactive Experts

    Interactive Expert is expensive, especially when the expert is human…

    But expert does not have to be human…

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

    maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

    control (steer and throttle)

    Steering + throttle

  • Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

  • Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

    Their Setup:

    At Training, we have expensive sensors for accurate state estimation

    and we have computation resources for MPC (i.e., high-frequency replanning)

  • Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

    Their Setup:

    At Training, we have expensive sensors for accurate state estimation

    and we have computation resources for MPC (i.e., high-frequency replanning)

    The MPC is the expert in this case!

  • Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

    Their Setup:

    At Training, we have expensive sensors for accurate state estimation

    and we have computation resources for MPC (i.e., high-frequency replanning)

    The MPC is the expert in this case!

  • Analysis of DAgger

    First let’s do a quick introduction of online no-regret learning

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳…

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳

    Learner picks a decision x0

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

    Adversary picks a loss ℓ1 : 𝒳 → ℝ

  • Online Learning

    AdversaryLearner

    [Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]

    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

    Adversary picks a loss ℓ1 : 𝒳 → ℝ

    Regret =T−1

    ∑t=0

    ℓt(xt) − minx∈𝒳

    T−1

    ∑t=0

    ℓt(x)

  • A no-regret algorithm: Follow-the-Leader

    At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

    FTL: xt = minx∈𝒳

    t−1

    ∑i=0

    ℓi(x)

  • A no-regret algorithm: Follow-the-Leader

    At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

    Theorem (FTL): if is convex, and is strongly convex for all , then for regret of FTL, we have: 𝒳 ℓt t1T [

    T−1

    ∑t=0

    ℓt(xt) − minx∈𝒳

    T−1

    ∑t=0

    ℓt(x)] = O ( log(T)T )

    FTL: xt = minx∈𝒳

    t−1

    ∑i=0

    ℓi(x)

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

    nX

    t=1

    Lt(⇡)

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

    nX

    t=1

    Lt(⇡)

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

    nX

    t=1

    Lt(⇡)argmin⇡

    nX

    t=1

    Lt(⇡)

  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    Ln(⇡) =nX

    i=1

    k⇡(x)� yk22

    nX

    t=1

    Lt(⇡)argmin⇡

    nX

    t=1

    Lt(⇡)

    Data Aggregation = Follow-the-Leader Online Learner

  • DAgger Analysis: A reduction to no-regret online learning

    Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space

  • DAgger Analysis: A reduction to no-regret online learning

    Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space

    (Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]

  • DAgger Analysis: A reduction to no-regret online learning

    Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space

    (Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]

    DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π

    t

    ∑i=0

    ℓi(π)

  • DAgger Analysis: A reduction to no-regret online learning

    Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space

    (Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]

    DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π

    t

    ∑i=0

    ℓi(π)

    If the online learning procedure ensures no-regret, then

    1T [

    T−1

    ∑t=0

    ℓt(πt) − minπ∈Π

    T−1

    ∑t=0

    ℓt(π)] = o(T)/T

  • DAgger Analysis: A reduction to no-regret online learning

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1

    ∑t=0

    ℓt(πt) − minπ∈Π

    T−1

    ∑t=0

    ℓt(π) = o(T)

  • DAgger Analysis: A reduction to no-regret online learning

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1

    ∑t=0

    ℓt(πt) − minπ∈Π

    T−1

    ∑t=0

    ℓt(π) = o(T)

    1T

    T−1

    ∑t=0

    ℓt(πt) = o(T)/T

    ϵavg−reg

    + minπ∈Π

    1T

    T−1

    ∑t=0

    ℓt(π)

    ϵΠ

  • DAgger Analysis: A reduction to no-regret online learning

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1

    ∑t=0

    ℓt(πt) − minπ∈Π

    T−1

    ∑t=0

    ℓt(π) = o(T)

    1T

    T−1

    ∑t=0

    ℓt(πt) = o(T)/T

    ϵavg−reg

    + minπ∈Π

    1T

    T−1

    ∑t=0

    ℓt(π)

    ϵΠ

    , such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ

  • DAgger Analysis: A reduction to no-regret online learning

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1

    ∑t=0

    ℓt(πt) − minπ∈Π

    T−1

    ∑t=0

    ℓt(π) = o(T)

    1T

    T−1

    ∑t=0

    ℓt(πt) = o(T)/T

    ϵavg−reg

    + minπ∈Π

    1T

    T−1

    ∑t=0

    ℓt(π)

    ϵΠ

    , such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵavg−reg + ϵΠ

    Under the assumption that surrogate loss upper bounds zero-one loss:

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

    =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

    =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]

    ≤1

    1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

    =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]

    ≤1

    1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]

    ≤maxs,a A⋆(s, a)

    1 − γ⋅ (ϵreg + ϵΠ)

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

    =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]

    ≤1

    1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]

    ≤maxs,a A⋆(s, a)

    1 − γ⋅ (ϵreg + ϵΠ)

    Case study:

    1. Worst case: (not

    recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;

    A⋆(s, a) ≈1

    1 − γ

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

    =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]

    ≤1

    1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]

    ≤maxs,a A⋆(s, a)

    1 − γ⋅ (ϵreg + ϵΠ)

    Case study:

    1. Worst case: (not

    recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;

    A⋆(s, a) ≈1

    1 − γ

    2. Good case: (easily

    recoverable from a one-step mistake): Better than BC;

    A⋆(s, a) ≈ o ( 11 − γ )

  • Summary of Imitation Learning

  • Summary of Imitation Learning

    1. Behavior Cloning (Maximum Likelihood Estimation)

    Performance-gap ≈1

    (1 − γ)2(classification error)

  • Summary of Imitation Learning

    1. Behavior Cloning (Maximum Likelihood Estimation)

    Performance-gap ≈1

    (1 − γ)2(classification error)

    2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):

    Performance-gap ≈1

    (1 − γ)(classification error)

  • Summary of Imitation Learning

    1. Behavior Cloning (Maximum Likelihood Estimation)

    Performance-gap ≈1

    (1 − γ)2(classification error)

    2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):

    Performance-gap ≈1

    (1 − γ)(classification error)

    3.DAgger w/ Interactive Experts:

    Performance-gap ≈sups,a |A⋆(s, a) |

    (1 − γ)(classification error)