Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Interactive Imitation Learning
Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning
Announcements
Final report: NeurIPS format
Maximum 9 pages for main tex (not including references and appendix)
Recap
Offline IL and Hybrid Setting:
Recap
Offline IL and Hybrid Setting:
Ground truth reward is unknown;
assume expert is a near optimal policy
r(s, a) ∈ [0,1]π⋆
Recap
Offline IL and Hybrid Setting:
Ground truth reward is unknown;
assume expert is a near optimal policy
r(s, a) ∈ [0,1]π⋆
We have a dataset 𝒟 = (s⋆i , a⋆i )
Mi=1 ∼ d
π⋆
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
We can rewrite it in the max-min formulation:
maxθ
minπ
[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
We can rewrite it in the max-min formulation:
maxθ
minπ
[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)
θt+1 := θt + η∇θJ(θt, πt), πt+1 = arg minπ
J(θt+1, π)
Algorithm: incremental update on cost function , exact update on policy (soft VI):θ π
Today:
Interactive Imitation Learning Setting
Today:
Interactive Imitation Learning Setting
Key assumption: we can query expert at any time and any state during trainingπ⋆
Today:
Interactive Imitation Learning Setting
Key assumption: we can query expert at any time and any state during trainingπ⋆
(Recall that previously we only had an offline dataset )𝒟 = (s⋆i , a⋆i )
Mi=1 ∼ d
π⋆
Recall the Main Problem from Behavior Cloning:
Expert’s trajectoryLearned Policy
No training data of “recovery’’ behavior
Intuitive solution: Interaction
7
Use interaction to collect data where learned policy goes
General Idea: Iterative Interactive Approach
Update PolicyCollect Data
through Interaction
New Data
Updated Policy
All DAgger slides credit: Drew Bagnell, Stephane Ross, Arun Venktraman
DAgger: Dataset Aggregation 0th iteration
9
Expert Demonstrates Task Dataset
Supervised Learning
1st policy π1
[Ross11a]
DAgger: Dataset Aggregation 1st iteration
10
Execute π1 and Query Expert
Steering from expert
[Ross11a]
DAgger: Dataset Aggregation 1st iteration
11
Execute π1 and Query ExpertNew Data
[Ross11a]
Steering from expert
DAgger: Dataset Aggregation 1st iteration
11
Execute π1 and Query ExpertNew Data
[Ross11a]
Steering from expert
States from the learned policy
DAgger: Dataset Aggregation 1st iteration
12
Execute π1 and Query ExpertNew Data
All previous data
[Ross11a]
Steering from expert
DAgger: Dataset Aggregation 1st iteration
13
Execute π1 and Query ExpertNew Data
Supervised Learning
New policy π2
All previous data
Aggregate Dataset
[Ross11a]
Steering from expert
DAgger: Dataset Aggregation 2nd iteration
14
Execute π2 and Query ExpertNew Data
Supervised Learning
New policy π3
All previous data
Aggregate Dataset
Steering from expert
[Ross11a]
DAgger: Dataset Aggregation nth iteration
15
[Ross11a]
Execute πn-1 and Query ExpertNew Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
Success!
16
[Ross AISTATS 2011]
Success!
16
[Ross AISTATS 2011]
Success!
16
[Ross AISTATS 2011]
Better
17
[Ross AISTATS 2011]
Average Falls/Lap
More fun than Video Games…
18[Ross ICRA 2013]
More fun than Video Games…
18[Ross ICRA 2013]
More fun than Video Games…
18[Ross ICRA 2013]
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that
maps from data on cheap on-board sensors (raw-pixel imagine) to low-level
control (steer and throttle)
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that
maps from data on cheap on-board sensors (raw-pixel imagine) to low-level
control (steer and throttle)
Steering + throttle
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
The MPC is the expert in this case!
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
The MPC is the expert in this case!
Analysis of DAgger
First let’s do a quick introduction of online no-regret learning
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳…
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
…
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
…
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
…
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
Adversary picks a loss ℓ1 : 𝒳 → ℝ
…
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
Adversary picks a loss ℓ1 : 𝒳 → ℝ
…
Regret =T−1
∑t=0
ℓt(xt) − minx∈𝒳
T−1
∑t=0
ℓt(x)
A no-regret algorithm: Follow-the-Leader
At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1
FTL: xt = minx∈𝒳
t−1
∑i=0
ℓi(x)
A no-regret algorithm: Follow-the-Leader
At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1
Theorem (FTL): if is convex, and is strongly convex for all , then for regret of FTL, we have: 𝒳 ℓt t1T [
T−1
∑t=0
ℓt(xt) − minx∈𝒳
T−1
∑t=0
ℓt(x)] = O ( log(T)T )
FTL: xt = minx∈𝒳
t−1
∑i=0
ℓi(x)
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)argmin⇡
nX
t=1
Lt(⇡)
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)argmin⇡
nX
t=1
Lt(⇡)
Data Aggregation = Follow-the-Leader Online Learner
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π
t
∑i=0
ℓi(π)
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π
t
∑i=0
ℓi(π)
If the online learning procedure ensures no-regret, then
1T [
T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π)] = o(T)/T
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵavg−reg + ϵΠ
Under the assumption that surrogate loss upper bounds zero-one loss:
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
Case study:
1. Worst case: (not
recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;
A⋆(s, a) ≈1
1 − γ
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
Case study:
1. Worst case: (not
recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;
A⋆(s, a) ≈1
1 − γ
2. Good case: (easily
recoverable from a one-step mistake): Better than BC;
A⋆(s, a) ≈ o ( 11 − γ )
Summary of Imitation Learning
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):
Performance-gap ≈1
(1 − γ)(classification error)
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):
Performance-gap ≈1
(1 − γ)(classification error)
3.DAgger w/ Interactive Experts:
Performance-gap ≈sups,a |A⋆(s, a) |
(1 − γ)(classification error)