35
Introduction SCIRL Theoretical results Experimental results Opening and future work Inverse Reinforcement Learning Model-Free Approaches Olivier Pietquin [email protected] Supélec - UMI 2958 (GeorgiaTech - CNRS), France MaLIS research Group September 6, 2012 GDR Robotique Olivier Pietquin IRL

Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Inverse Reinforcement LearningModel-Free Approaches

Olivier [email protected]

Supélec - UMI 2958 (GeorgiaTech - CNRS), FranceMaLIS research Group

September 6, 2012GDR Robotique

Olivier Pietquin IRL

Page 2: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Reinforcement Learning

Notions

AgentTaskEnvironment

Olivier Pietquin IRL

Page 3: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Imitation: Expert

Olivier Pietquin IRL

Page 4: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Generalization

Inverse RL

Reward inference

Given 1) measurements ofan agent’s behaviour overtime, in a variety ofcircumstances, 2)measurements of the sensoryinputs to that agent; 3) amodel of the physicalenvironment (if available).

Determine the rewardfunction that the agent isoptimizing

Olivier Pietquin IRL

Page 5: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Seminal paper by Russell [Russell, 1998] (and [Kalman, 1964])

Why?In RL, the reward is often a heuristic that might be wrongUnderstand human and animal behaviorMost compact representation of the task (transfert imitation)

QuestionsTo what extend is a reward uniquely recoverable?What are appropriate error metrics for fitting?Computational complexity?Can we identify local inconsistencies?Can we determine the reward before learning?How many observations are required?

Olivier Pietquin IRL

Page 6: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

First IRL algorithms [Ng and Russell, 2000]

Inversing Bellman Equations for finite MDPs

∀a 6= a∗ : (Pa∗ − Pa)(I− γPa∗)−1R � 0

Problems0 is a solutionThere is an infinite set of solutions

Solutions: add constrainsPenalize deviations w.r.t. π∗

Linear representation of R (r̂ = θTψ(s))Solve with linear programming

Olivier Pietquin IRL

Page 7: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Apprenticeship learning via IRL [Abbeel and Ng, 2004] I

Find a policy whose performance is close to that of the expert’s, on theunknown reward function r̂∗ = θ∗Tψ(s).

Feature expectation : µπ(s)

V π(st) = E

[∑i

γ i rt+i

∣∣∣∣∣π]

V π(st) = E

[∑i

γ iθTψ(st+i )

∣∣∣∣∣π]

V π(st) = θT E

[∑i

γ iψ(st+i )

∣∣∣∣∣π]

︸ ︷︷ ︸µπ(st)

= θTµπ(st)

Olivier Pietquin IRL

Page 8: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Apprenticeship learning via IRL [Abbeel and Ng, 2004] II

Fitness to performance

if ‖θ‖2 ≤ 1 : ‖µπ(s)− µE (s)‖2 < ε⇒ ‖V π(s)− V E (s)‖2 < ε

1. Initiate Π with random policy π0 and compute µ0 = µπ0

2. Compute t and θ such that

t = maxθ{minπi∈Π

θT (µE − µi )} s.t. ‖θ‖2 ≤ 1

If t ≤ ξ Terminate

3. Train a new policy πi optimizing R = θTψ(s)

4. Compute µi for πi ; Π← πi

5. Goto to step 2.

Olivier Pietquin IRL

Page 9: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Apprenticeship learning via IRL [Abbeel and Ng, 2004] III

Many methods based on matching feature expectations

Game theoretic point of view [Syed and Schapire, 2008]

Linear programming (generates only 1 policy) [Syed et al., 2008]

Maximum entropy [Ziebart et al., 2008]

All iterative algorithms requiring computing µπ for many π

Olivier Pietquin IRL

Page 10: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Max-Margin Planning [Ratliff et al., 2006] I

IdeasMaximize the margin between the best policy obtained withthe current estimation and the expert policyAllow the expert to be non-optimal (slack variables)

Quadratic program

minθ,ξi

12‖θ‖2 +

γ

n

∑i

βiξ2i

s.t.∀i : θTµπ∗i

d0,Mi+ ξi ≥ max

πθTµπd0,Mi

+ LMi ,π

Olivier Pietquin IRL

Page 11: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Max-Margin Planning [Ratliff et al., 2006] II

Optimization problem : min J(θ)

J(θ) =1N

N∑1=1

maxπ

(θTµπd0,Mi+ LMi ,π)− θTµ

π∗i

d0,Mi+λ

2‖θ‖2

Problemsmax operator : sub-gradient descentNeed to solve an MDP for each constraint : A∗

Need to estimate µπ for any π

Olivier Pietquin IRL

Page 12: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Problem statementThe genesisState of the Art

Max-Margin Planning [Ratliff et al., 2006] III

Classification based [Ratliff et al., 2007][Taskar et al., 2005]

J(q) =1N

N∑i=1

maxa

(q(si , a) + l(si , a))− q(si , ai ).

Problemsmax operator : sub-gradient descentNot taking the structure into account (inconsistent behaviour)

Advantages

q(s, a) can be of any (differentiable) formNo need of knowing the MDP

Olivier Pietquin IRL

Page 13: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Structured Classification of IRL: SCIRL [Klein et al., 2012]

Best of both worldsNo need of MDP modelTake structure into account

Contributions

Classification with the structure of the MDPOnly needs data from the expertCan use other data if available

Olivier Pietquin IRL

Page 14: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Special case of classification

Score function based classifiers

Classifier : map inputs x ∈ X to labels y ∈ YData : {(xi , yi )1≤i≤N}Decision rule : g ∈ YXScore function : gs(x) ∈ argmax

y∈Ys(x , y)

Linearly parameterized score function

s(x , y) = θTφ(x , y) (1)

Olivier Pietquin IRL

Page 15: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Let’s compare

Putting it all together

X ≡ S

,

Y ≡ A

,

s ≡ QπE ⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)

Olivier Pietquin IRL

Page 16: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Let’s compare

Putting it all together

X ≡ S ,

Y ≡ A

,

s ≡ QπE ⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)

Olivier Pietquin IRL

Page 17: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Let’s compare

Putting it all together

X ≡ S , Y ≡ A,

s ≡ QπE ⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)

Olivier Pietquin IRL

Page 18: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Let’s compare

Putting it all together

X ≡ S , Y ≡ A, s ≡ QπE

⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)

Olivier Pietquin IRL

Page 19: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Let’s compare

Putting it all together

X ≡ S , Y ≡ A, s ≡ QπE ⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)

Olivier Pietquin IRL

Page 20: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

SCIRL Pseudo-code

Algorithm 1: SCIRL algorithm

Given a training set D = {(si , ai = πE (si ))1≤i≤N}, an estimateµ̂πE of the expert feature expectation µπE and a classificationalgorithm;

Compute the parameter vector θc using the classificationalgorithm fed with the training set D and considering theparameterized score function θT µ̂πE (s, a);

Output the reward function Rθc (s) = θTc ψ(s) ;

Olivier Pietquin IRL

Page 21: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ContributionSCIRL Algorithm

Why is it nice?

Advantages

Only µE is requiredNo need to compute µ for other policiesBased on transitions (not trajectories or policies)Is based on standard classification methodsBounds can be computedReturns a reward

Olivier Pietquin IRL

Page 22: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 23: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 24: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp

εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 25: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 26: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 27: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)

Olivier Pietquin IRL

Page 28: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ImplementationInverted PendulumHighway Driving

Instanciation

Structured large margin classifier [Taskar et al., 2005]

J(θ) =1N

N∑i=1

maxaθT µ̂πE (si , a) +L(si , a)− θT µ̂πE (si , ai ) +

λ

2‖θ‖2.

(3)

L(si , a) = 1 si a 6= ai , 0 otherwiseSub-gradient descend

Olivier Pietquin IRL

Page 29: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ImplementationInverted PendulumHighway Driving

Computing µE

LSTD-µ [?]

Based on alreadyknown Least-squaretemporal differencesmethod

Characteristics

Can be fed withmere transitionsNo modelOff or on policyevaluation

Monte carlo with heuristics

µ̂π(s0, a0) = 1M

M∑j=1

∑i≥0

γ iφ(s ji )

µ̂π(s0, a 6= a0) = γµ̂π(s0, a0)

Characteristics

Only on-policySeems more robust in direconditions

Olivier Pietquin IRL

Page 30: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ImplementationInverted PendulumHighway Driving

Inverted Pendulum

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-1

-0.8

-0.6

-0.4

-0.2

0

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-10

-8

-6

-4

-2

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-100

-50

0

50

100

150

200

250

Olivier Pietquin IRL

Page 31: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

ImplementationInverted PendulumHighway Driving

Results on the driving problem

50 100 150 200 250 300 350 400Number of samples from the expert

−4

−2

0

2

4

6

8

10

Es∼U

[Vπ RE

(s)]

50 100 150 200 250 300 350 400Number of samples from the expert

−4

−2

0

2

4

6

8

10

Es∼U

[Vπ RE

(s)]

Description

Goal of the expert : avoid other cars, do not go off-road, gofastUsing only data from the expert and natural featuresNon trivial (State of the art does not work)

Olivier Pietquin IRL

Page 32: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Future work

Future work

No computation of µE everywhereReal-world problem

Task Transfer ?

Olivier Pietquin IRL

Page 33: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Future work

Thank you. . .

. . . for your attention

Olivier Pietquin IRL

Page 34: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Future work

References I

Pieter Abbeel and Andrew Y. Ng.Apprenticeship learning via inverse reinforcement learning.In Proceedings of the 21st International Conference on Machine learning (ICML), 2004.

Rudolph Kalman.When is a linear control system optimal?Transactions of the ASME, Journal of Basic Engineering, Series D, 86:81.90, 1964.

Edouard Klein, Matthieu Geist, Bilal PIOT, and Olivier Pietquin.Inverse Reinforcement Learning through Structured Classification.In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe (NV, USA),December 2012.

Andrew Y. Ng and Stuart Russell.Algorithms for Inverse Reinforcement Learning.In Proceedings of 17th International Conference on Machine Learning (ICML), 2000.

Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Maximum Margin Planning.In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.

Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Imitation learning for locomotion and manipulation.In Proceedings of ICHR, 2007.

Olivier Pietquin IRL

Page 35: Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Future work

References II

Stuart Russell.Learning agents for uncertain environments (extended abstract).In Proceedings of the 11th annual Conference on Computational Learning Theory (COLT), 1998.

Umar Syed and Robert Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20 (NIPS), 2008.

Umar Syed, Michael Bowling, and Robert E. Schapire.Apprenticeship learning using linear programming.In Proceedings of the 25th International Conference on Machine learning (ICML), 2008.

Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin.Learning Structured Prediction Models: a Large Margin Approach.In Proceedings of 22nd International Conference on Machine Learning (ICML), 2005.

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey.Maximum entropy inverse reinforcement learning.In Proceedings of the 23rd national conference on Artificial intelligence (AAAI), 2008.

Olivier Pietquin IRL