Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready

IntroductionSCIRL

Theoretical resultsExperimental results

Opening and future work

Inverse Reinforcement LearningModel-Free Approaches

Olivier [email protected]

Supélec - UMI 2958 (GeorgiaTech - CNRS), FranceMaLIS research Group

September 6, 2012GDR Robotique

Olivier Pietquin IRL

IntroductionSCIRL



Problem statementThe genesisState of the Art

Reinforcement Learning

Notions

AgentTaskEnvironment


IntroductionSCIRL




Imitation: Expert


IntroductionSCIRL




Generalization

Inverse RL

Reward inference

Given 1) measurements ofan agent’s behaviour overtime, in a variety ofcircumstances, 2)measurements of the sensoryinputs to that agent; 3) amodel of the physicalenvironment (if available).

Determine the rewardfunction that the agent isoptimizing


IntroductionSCIRL




Seminal paper by Russell [Russell, 1998] (and [Kalman, 1964])

Why?In RL, the reward is often a heuristic that might be wrongUnderstand human and animal behaviorMost compact representation of the task (transfert imitation)

QuestionsTo what extend is a reward uniquely recoverable?What are appropriate error metrics for fitting?Computational complexity?Can we identify local inconsistencies?Can we determine the reward before learning?How many observations are required?


IntroductionSCIRL




First IRL algorithms [Ng and Russell, 2000]

Inversing Bellman Equations for finite MDPs

∀a 6= a∗ : (Pa∗ − Pa)(I− γPa∗)−1R � 0

Problems0 is a solutionThere is an infinite set of solutions

Solutions: add constrainsPenalize deviations w.r.t. π∗

Linear representation of R (r̂ = θTψ(s))Solve with linear programming


IntroductionSCIRL




Apprenticeship learning via IRL [Abbeel and Ng, 2004] I

Find a policy whose performance is close to that of the expert’s, on theunknown reward function r̂∗ = θ∗Tψ(s).

Feature expectation : µπ(s)

V π(st) = E

[∑i

γ i rt+i

∣∣∣∣∣π]

V π(st) = E

[∑i

γ iθTψ(st+i )

∣∣∣∣∣π]

V π(st) = θT E

[∑i

γ iψ(st+i )

∣∣∣∣∣π]

︸︷︷︸µπ(st)

= θTµπ(st)


IntroductionSCIRL




Apprenticeship learning via IRL [Abbeel and Ng, 2004] II

Fitness to performance

if ‖θ‖2 ≤ 1 : ‖µπ(s)− µE (s)‖2 < ε⇒ ‖V π(s)− V E (s)‖2 < ε

1. Initiate Π with random policy π0 and compute µ0 = µπ0

2. Compute t and θ such that

t = maxθ{minπi∈Π

θT (µE − µi )} s.t. ‖θ‖2 ≤ 1

If t ≤ ξ Terminate

3. Train a new policy πi optimizing R = θTψ(s)

4. Compute µi for πi ; Π← πi

5. Goto to step 2.


IntroductionSCIRL




Apprenticeship learning via IRL [Abbeel and Ng, 2004] III

Many methods based on matching feature expectations

Game theoretic point of view [Syed and Schapire, 2008]

Linear programming (generates only 1 policy) [Syed et al., 2008]

Maximum entropy [Ziebart et al., 2008]

All iterative algorithms requiring computing µπ for many π


IntroductionSCIRL




Max-Margin Planning [Ratliff et al., 2006] I

IdeasMaximize the margin between the best policy obtained withthe current estimation and the expert policyAllow the expert to be non-optimal (slack variables)

Quadratic program

minθ,ξi

12‖θ‖2 +

γ

n

∑i

βiξ2i

s.t.∀i : θTµπ∗i

d0,Mi+ ξi ≥ max

πθTµπd0,Mi

+ LMi ,π


IntroductionSCIRL




Max-Margin Planning [Ratliff et al., 2006] II

Optimization problem : min J(θ)

J(θ) =1N

N∑1=1

maxπ

(θTµπd0,Mi+ LMi ,π)− θTµ

π∗i

d0,Mi+λ

2‖θ‖2

Problemsmax operator : sub-gradient descentNeed to solve an MDP for each constraint : A∗

Need to estimate µπ for any π


IntroductionSCIRL




Max-Margin Planning [Ratliff et al., 2006] III

Classification based [Ratliff et al., 2007][Taskar et al., 2005]

J(q) =1N

N∑i=1

maxa

(q(si , a) + l(si , a))− q(si , ai ).

Problemsmax operator : sub-gradient descentNot taking the structure into account (inconsistent behaviour)

Advantages

q(s, a) can be of any (differentiable) formNo need of knowing the MDP


IntroductionSCIRL



ContributionSCIRL Algorithm

Structured Classification of IRL: SCIRL [Klein et al., 2012]

Best of both worldsNo need of MDP modelTake structure into account

Contributions

Classification with the structure of the MDPOnly needs data from the expertCan use other data if available


IntroductionSCIRL




Special case of classification

Score function based classifiers

Classifier : map inputs x ∈ X to labels y ∈ YData : {(xi , yi )1≤i≤N}Decision rule : g ∈ YXScore function : gs(x) ∈ argmax

y∈Ys(x , y)

Linearly parameterized score function

s(x , y) = θTφ(x , y) (1)


IntroductionSCIRL




Let’s compare

Putting it all together

X ≡ S

,

Y ≡ A

,

s ≡ QπE ⇒ φ ≡ µE

Linearly parametrized scorefunction based classifiers

gs(x) ∈ argmaxy∈Y

s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)

Expert’s decision rule

πE (s) =

argmaxa

QπE (s, a)

Qπ(st , a) = θTµπ(st , a)

πE (s) =

argmaxaθTµπ

E(s, a)


IntroductionSCIRL




Let’s compare


X ≡ S ,

Y ≡ A

,




s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)


πE (s) =

argmaxa

QπE (s, a)


πE (s) =

argmaxaθTµπ

E(s, a)


IntroductionSCIRL




Let’s compare


X ≡ S , Y ≡ A,




s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)


πE (s) =

argmaxa

QπE (s, a)


πE (s) =

argmaxaθTµπ

E(s, a)


IntroductionSCIRL




Let’s compare


X ≡ S , Y ≡ A, s ≡ QπE

⇒ φ ≡ µE



s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)


πE (s) =

argmaxa

QπE (s, a)


πE (s) =

argmaxaθTµπ

E(s, a)


IntroductionSCIRL




Let’s compare


X ≡ S , Y ≡ A, s ≡ QπE ⇒ φ ≡ µE



s(x , y)

s(x , y) = θTφ(x , y)

gs(x) ∈argmax

y∈YθTφ(x , y)


πE (s) =

argmaxa

QπE (s, a)


πE (s) =

argmaxaθTµπ

E(s, a)


IntroductionSCIRL




SCIRL Pseudo-code

Algorithm 1: SCIRL algorithm

Given a training set D = {(si , ai = πE (si ))1≤i≤N}, an estimateµ̂πE of the expert feature expectation µπE and a classificationalgorithm;

Compute the parameter vector θc using the classificationalgorithm fed with the training set D and considering theparameterized score function θT µ̂πE (s, a);

Output the reward function Rθc (s) = θTc ψ(s) ;


IntroductionSCIRL




Why is it nice?

Advantages

Only µE is requiredNo need to compute µ for other policiesBased on transitions (not trajectories or policies)Is based on standard classification methodsBounds can be computedReturns a reward


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0

γtc(t) with c(t) = maxπ1,...,πt ,s∈S

(ρTE Pπ1 ...Pπt )(s)

ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A

εQ(s, a)−mina∈A

εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE

[v∗Rθc− vπERθc

] ≤ Cf

1− γ

(ε̄Q + εc

2γ‖Rθc‖∞1− γ

)(2)


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0



ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]


ε̄Q = Es∼ρE

[maxa∈A


εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE


] ≤ Cf

1− γ

(ε̄Q + εc


)(2)


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0



ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]

εµ = µ̂πE − µπE : S × A→ Rp

εQ = θTc εµ : S × A→ R

ε̄Q = Es∼ρE

[maxa∈A


εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE


] ≤ Cf

1− γ

(ε̄Q + εc


)(2)


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0



ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]


ε̄Q = Es∼ρE

[maxa∈A


εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE


] ≤ Cf

1− γ

(ε̄Q + εc


)(2)


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0



ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]


ε̄Q = Es∼ρE

[maxa∈A


εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE


] ≤ Cf

1− γ

(ε̄Q + εc


)(2)


IntroductionSCIRL



Analysis

Error bound

Definitions

Cf = (1− γ)∑t≥0



ρE (s)

εc = Es∼ρE

[1{πc(s)6=πE (s)}] ∈ [0, 1]


ε̄Q = Es∼ρE

[maxa∈A


εQ(s, a)] ≥ 0

Theorem

0 ≤ Es∼ρE


] ≤ Cf

1− γ

(ε̄Q + εc


)(2)


IntroductionSCIRL



ImplementationInverted PendulumHighway Driving

Instanciation

Structured large margin classifier [Taskar et al., 2005]

J(θ) =1N

N∑i=1

maxaθT µ̂πE (si , a) +L(si , a)− θT µ̂πE (si , ai ) +

λ

2‖θ‖2.

(3)

L(si , a) = 1 si a 6= ai , 0 otherwiseSub-gradient descend


IntroductionSCIRL




Computing µE

LSTD-µ [?]

Based on alreadyknown Least-squaretemporal differencesmethod

Characteristics

Can be fed withmere transitionsNo modelOff or on policyevaluation

Monte carlo with heuristics

µ̂π(s0, a0) = 1M

M∑j=1

∑i≥0

γ iφ(s ji )

µ̂π(s0, a 6= a0) = γµ̂π(s0, a0)

Characteristics

Only on-policySeems more robust in direconditions


IntroductionSCIRL




Inverted Pendulum

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-1

-0.8

-0.6

-0.4

-0.2

0

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-10

-8

-6

-4

-2

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3

Position

-3

-2

-1

0

1

2

3

Spe

ed

-100

-50

0

50

100

150

200

250


IntroductionSCIRL




Results on the driving problem

50 100 150 200 250 300 350 400Number of samples from the expert

−4

−2

0

2

4

6

8

10

Es∼U

[Vπ RE

(s)]

50 100 150 200 250 300 350 400Number of samples from the expert

−4

−2

0

2

4

6

8

10

Es∼U

[Vπ RE

(s)]

Description

Goal of the expert : avoid other cars, do not go off-road, gofastUsing only data from the expert and natural featuresNon trivial (State of the art does not work)


IntroductionSCIRL



Future work

Future work

No computation of µE everywhereReal-world problem

Task Transfer ?


IntroductionSCIRL



Future work

Thank you. . .

. . . for your attention


IntroductionSCIRL



Future work

References I

Pieter Abbeel and Andrew Y. Ng.Apprenticeship learning via inverse reinforcement learning.In Proceedings of the 21st International Conference on Machine learning (ICML), 2004.

Rudolph Kalman.When is a linear control system optimal?Transactions of the ASME, Journal of Basic Engineering, Series D, 86:81.90, 1964.

Edouard Klein, Matthieu Geist, Bilal PIOT, and Olivier Pietquin.Inverse Reinforcement Learning through Structured Classification.In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe (NV, USA),December 2012.

Andrew Y. Ng and Stuart Russell.Algorithms for Inverse Reinforcement Learning.In Proceedings of 17th International Conference on Machine Learning (ICML), 2000.

Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Maximum Margin Planning.In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.

Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Imitation learning for locomotion and manipulation.In Proceedings of ICHR, 2007.


IntroductionSCIRL



Future work

References II

Stuart Russell.Learning agents for uncertain environments (extended abstract).In Proceedings of the 11th annual Conference on Computational Learning Theory (COLT), 1998.

Umar Syed and Robert Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20 (NIPS), 2008.

Umar Syed, Michael Bowling, and Robert E. Schapire.Apprenticeship learning using linear programming.In Proceedings of the 25th International Conference on Machine learning (ICML), 2008.

Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin.Learning Structured Prediction Models: a Large Margin Approach.In Proceedings of 22nd International Conference on Machine Learning (ICML), 2005.

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey.Maximum entropy inverse reinforcement learning.In Proceedings of the 23rd national conference on Artificial intelligence (AAAI), 2008.


Documents

Inverse Reinforcement Learning Model-Free Approachespages.isir.upmc.fr/~sigaud/GT8/6septembre2012/pietquin.pdf · InvertedPendulum HighwayDriving. Computing E. LSTD- [?] Basedonalready