Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Reinforcement Learning in MDPs

by Lease-Square Policy Iteration

Presented by Lihan He

Machine Learning Reading Group

Duke University

09/16/2005

Outline

MDP and Q-function

Value function approximation

LSPI: Least-Square Policy Iteration

Proto-value Functions

RPI: Representation Policy Iteration

An MDP is a model M = < S, A, T, R > a set of environment states S,a set of actions A,a transition function T: S A S [0,1] , T(s,a,s’) = P(s’| s,a),

a reward function R: S A R .

A policy is a function : S A.

Value function (expected cumulative reward) V: S R .

Satisfying Bellman Eq.: V(s) = R(s, (s)) + s’ P(s’| s, (s)) V(s’)

Markov Decision Process (MDP)

Markov Decision Process (MDP)

An example of grid world environment

+1 +1

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.5

Optimal policy Value function

State-action value function Q

The state-action value function Q(s,a) of any policy is defined over all

possible combinations of states and actions and indicates the expected,

discounted, total reward when taking action a in state s and following

policy thereafter.

Q(s,a) = R(s,a) + s’ P(s’| s,a) Q(s’, (s’))

Given policy , for each state-action pair, we have a Q(s,a) value.

In matrix format, above Bellman equation becomes

Q = R + P Q

Q , R: vectors of size |S||A|P: stochastic matrix of size (|S||A| |S|) P((s,a),s’) = P(s’|s,a)

How the policy iteration works?

Value Function Q

Policy

Policy improvementValue Evaluation

Q = R + P Q

Model

For model-free reinforcement learning, we don’t have model P.


Let );,(ˆ wasQ be an approximation to ),( asQ with free parameters w:

waswaswasQ Tk

jjj ),(),();,(ˆ

1

i.e., Q values are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a.

.,...,1 ),,( kjasj ),( asj

Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that

.ˆ QQ

In general, k<<|S||A|, we use linear combination of only several bases to approximate value function Q.

Tk asasas ),(),(),( 1 where

Solve w evaluate and get updated policy Q̂ ),(ˆmaxarg)( asQs a


Examples of basis functions:

22

2

2

21

1

1

)(

)(

1)(

)(

)(

1)(

),(

saaI

saaI

aaI

saaI

saaI

aaI

asPolynomials:

Radial basis functions (RBF)

Proto-value functions

Other manually designed bases based on specific problems

Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters.


Least-Square Fixed-Point Approximation

Use Q = R + P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get

waswaswasQ Tk

jjj ),(),();,(ˆ

1

),(),(

),(),(

),(

),(

||||||||1

11111

||||

11

ASkAS

k

TAS

T

asas

asas

as

as

Let

We have wQ ˆ

Q̂

Rw TT )P(

Least-Square Policy Iteration

Solving the parameter w is equivalent to solving linear system bAw

where

)P( TA

Rb T

s a s

TssasasassP'

]))'(,'(),(),()[,|'(

s a s

sasRasassP'

)]',,(),()[,|'(

A is the sum of many matrices Tssasas ))'(,'(),(),(

b is the sum of many vectors )',,(),( sasRas

And they are weighted by transition probability ),|'( assP


If we sampled data from underlying transition probability, samples:

Liiiii srasD 1)}',,,{(

A and b can be learned (in block) as

L

i

Tiiiiii ssasas

LA

1

]))(,(),(),([1~

L

iiii ras

Lb

1

),(1~

Or (real-time)

Ttttttttt ssasasAA ))(,(),(),(

~~ )()1(

ttttt rasbb ),(

~~ )()1(


Input: D, k, φ, γ, ε, π0(w0)

π’ π0

wass Ta ),(maxarg)( %

repeat

π π’

until ||'|| ww

L

i

Tiiiiii ssasas

LA

1

]))(,(),(),([1~

L

iiii ras

Lb

1

),(1~

bAw~~~ 1

% value function update% could use real-time update

wass Ta

~),(maxarg)(' % policy update

,~ )1( tA )1(~ tb

Proto-Value Function

How to choose basis functions ),( as in LSPI algorithm?

Do not need to design bases manually

Data tell us what are the corresponding proto-value functions

Generate from topology of the underlying state space

Do not estimate underlying state transition probability

Capture the intrinsic smoothness constraints that true value functions have.

Proto-value functions are good bases for value function approximation.

1. Graph representation of the underlying state transition:


s1

s2

s3

s4

s5

s6

s7

s8

s9

s10

s11

s1

s2

s3

s4

s5

s6

s7

s8

s9

s10

s11

1 1

1 1 1

1 1

1 1 1

1 1 1

1 1 1

1 1

1 1

1 1

1 1

1 1

2. Adjacency matrix A:


3. Combinatorial Laplacian L: L=T - A

where T is the diagonal matrix whose entries are row sums of the adjacency matrix A

4. Proto value functions:

Eigenvectors of the combinatorial Laplacian L

ff L

Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),

Example of proto value functions:

G

20

21

0 210 420 630 840 1050 1260

0

210

420

630

840

1050

1260

Adjacency matrix

450 460 470 480 490 500

450

455

460

465

470

475

480

485

490

495

500

zoom in

Grid world: 1260 states

Proto-value functions: Low-order eigenvectors as basis functions

Optimal value function

Value function approximation using 10 proto-value functions as bases

Representation Policy Iteration (offline)

Input: D, k, γ, ε, π0(w0)

2. π’ π0


3. repeat

π π’

until ||'|| ww

L

i

Tiiiiii ssasas

LA

1

]))(,(),(),([1~

L

iiii ras

Lb

1

),(1~

bAw~~~ 1

% value function update

wass Ta


1. Construct basis functions:

(a) Use sample D to learn a graph that encodes the underlying state space topology.

(b) Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph.

The basis functions φ(s,a) are produced by combining the k proto-value functions with

indicator function of action a.

Representation Policy Iteration (online)

Input: D0, k, γ, ε, π0(w0)

2. π’ π(0)


3. repeat

(a) π(t) π’

until ||'|| ww

Ttttttttt ssasasAA ))(,(),(),(

~~ )()1(

ttttt rasbb ),(

~~ )()1(

bAw~~~ 1

% value function update

wass Ta


1. Initialization:

(b) execute π(t) to get new data D(t)={st, at, rt, s’t} .

Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get )0()0( ~,

~bA

(c) If new data sample D(t) changes the topology of G, compute a new set of basis functions.

(d)

(e)

Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0.

Optimal policy: 1-9, 26-41: Right; 11-25, 42-50 Left

Example: Chain MDP

Value function and approximation in each iteration

20 bases used

Example: Chain MDP

Steps to convergence

Policy L1 error with respect to optimal policy

Performance comparison

References:

M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), 1107-1149.

-- Give LSPI algorithm for reinforcement learning

S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005.

-- How to build basis function for LSPI algorithm

C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004.

-- An application of LSPI

Documents

Reinforcement Learning in MDPs by Lease-Square Policy Iteration