Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University

Pieter Abbeel and Andrew Y. Ng

Reinforcement Learning and

Apprenticeship Learning


Stanford University


Example of Reinforcement Learning (RL) Problem

Highway driving.


Reinforcement Learning (RL) formalism

DynamicsModel

Psa

RewardFunction

R

ReinforcementLearning Control policy

)(...)(Emax 0 TsRsR


RL formalism

• Assume that at each time step, our system is in some state st.

• Upon taking an action a, our state randomly transitions to some new state st+1.

• We are also given a reward function R.

• The goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)].

Systemdynamics

s0

s1

Systemdynamics

…System

dynamicssT-1

sT

s2

R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++


RL formalism

• Markov Decision Process (S,A,Psa,s0,R)

• W.l.o.g. we assume

• Policy

• Utility of a policy for reward R=wT


RL formalism

DynamicsModel

Psa

RewardFunction

R


)(...)(Emax 0 TsRsR


Part IApprenticeship learning via

inverse reinforcement learning


Motivation

Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving.


Apprenticeship Learning

• Learning from observing an expert.

• Previous work:

– Learn to predict expert’s actions as a function of states.

– Usually lacks strong performance guarantees.

– (E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …)

• Our approach:

– Based on inverse reinforcement learning (Ng & Russell, 2000).

– Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function.


Algorithm

For t = 1,2,…

Inverse RL step:

Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}.

RL step:

Compute optimal policy t for

the estimated reward weights w.


Algorithm: Inverse RL step


Feature Expectation Closeness and Performance

If we can find a policy such that

||(E) - ()||2 ,

then for any underlying reward R*(s) =w*T(s),

we have that

|Uw*(E) - Uw*()| = |w*T (E) - w*T ()|

||w*||2 ||(E) - ()||2

.


Theoretical Results: Convergence

Theorem. Let an MDP (without reward function), a k-dimensional feature vector and the expert’s feature expectations (E) be given. Then after at most

kT2/2

iterations, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s), i.e.,

Uw*() Uw*(E) - .


Gridworld Experiments

Reward function is piecewise constant over small regions.Features for IRL are these small regions.

128x128 grid, small regions of size 16x16.








Case study: Highway driving

The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

Input: Driving demonstration Output: Learned behavior


More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.


Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function.

Algorithm is guaranteed to converge in poly(k,1/) iterations.

The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches).

Conclusions for part I


Part IIApprenticeship learning for

learning the transition model


Learning the dynamics model Psa from data

DynamicsModel

Psa

RewardFunction

R

ReinforcementLearning Control

policy )(...)(Emax 0 TsRsR

Estimate Psa from data


Transition model

• So we need to estimate the dynamics from data.

• Have to collect enough data to model all relevant parts of the flight envelop.

• Consider the problem of controlling a complicated system like a helicopter.

• No models are available that specify the dynamics accurately as a function of the helicopter’s specifications.


Collecting data to learn dynamical model

State-of-the-art: E3 algorithm (Kearns and Singh, 2002)

Have goodmodel of dynamics?

YES

“Exploit”

NO

“Explore”


Learning the dynamics

(a1, s1, a2, s2, a3, s3, ….)

Expert human pilot flight

Learn Psa

DynamicsModel

Psa

RewardFunction

R


(a1, s1, a2, s2, a3, s3, ….)

Autonomous flight

Learn Psa

)(...)(Emax 0 TsRsR


Apprenticeship learning of model

Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/)) examples from a human expert demonstrating the task. Then after a polynomial number N of iterations of testing/re-learning, with high probability, we will obtain a policy whose performance is comparable to the expert’s:

U() U(E) -

Thus, so long as a demonstration is available, it isn’t necessary to explicitly explore.


Proof idea

• From initial pilot demonstrations, our model/simulator Psa will be accurate for the part of the flight envelop (s,a) visited by the pilot.

• Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy E.

• Consequently, there is at least one policy (namely E) that looks like it’s able to fly the helicopter well in our simulation.

• Thus, each time we solve the MDP using the current simulator Psa, we will find a policy that successfully flies the helicopter according to Psa.

• If, on the actual helicopter, this policy fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the flight envelop that the model is failing to accurately model.

• Hence, this gives useful training data to model new parts of the flight envelop.


Conclusions


Conclusions

DynamicsModel

Psa

RewardFunction

R


)(...)(Emax 0 TsRsR

Given expert demonstrations, our inverse RL algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function.

Given an initial demonstration, there is no need to explicitly explore the state/action space. Even if you repeatedly “exploit” (use your best policy), you will collect enough data to learn a sufficiently accurate dynamical model to carry out your control task.


Thanks for your attention.


Different Formulation

LP formulation for RL problem

max. s,a (s,a) R(s)

s.t.

s a (s,a) = s’,a P(s|s’,a) (s’,a)

QP formulation for Apprenticeship Learning

min. , i (E,i - i)2

s.t.

s a (s,a) = s’,a P(s|s’,a) (s’,a)

i i = s,a i(s) (s,a)


Different Formulation (ctd.)

Our algorithm is equivalent to iteratively

linearizing QP at current point (Inverse RL step),

solve resulting LP (RL step).

Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]


Simplification of Inverse RL step: QP Euclidean projection

• In the Inverse RL step

–set (i-1) = orthogonal projection of E onto line through { (i-1),((i-1)) }

–set w(i) = E - (i-1)

• Note: the theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm.


More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.


Proof (sketch)

1(0)

w(1)

(1)

2

(1)

(E)

d0 d1


Proof (sketch)


Proof (sketch)


Algorithm (projection version)

1

E

(0)

w(1)

(1)

2



1

E

(0)

w(1)

w(2)(1)

(2)

2

(1)



1

E

(0)

w(1)

w(2)(1)

(2)

2

w(3)

(1)

(2)


Appendix: Different View

Bellman LP for solving MDPs

Min. V c’V s.t.

s,a V(s) R(s,a) + s’ P(s,a,s’)V(s’)

Dual LP

Max. s,a (s,a)R(s,a) s.t.

s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0

Apprenticeship Learning as QP

Min. i (E,i - s,a (s,a)i(s))2 s.t.

s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0


Different View (ctd.)

Our algorithm is equivalent to iteratively

linearize QP at current point (Inverse RL step),

solve resulting LP (RL step).

Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Collision

Offroad Left

Left Lane

Middle

Lane Right

Lane Offroad

Right

1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658

Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764

Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035

2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0

Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0

Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056

3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908

Feature Distr. Learned 0 0 0 0 0.7447 0.2554

Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081

4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058

Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334

Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564

5 Feature Distr. Expert 0.06 0 0 1 0 0

Feature Distr. Learned 0.0542 0 0 1 0 0

Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153

Car driving results (more detail)


Apprenticeship Learning via Inverse Reinforcement Learning


Stanford University

Documents

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University