Policy Gradient in Continuous Time

Policy Gradient in Continuous Time

Presented by Hui Li

Duke University Machine Learning Group

May 30, 2007

by Remi Munos, JMLR 2006

Outline

• Introduction

• Discretized Stochastic Processes Approximation

• Model-free Reinforcement Learning (RL)

algorithm

• Example Results

Introduction of the Problem• Consider an optimal control problem with continuous state

System dynamics: ),( ttt uxf

dt

dx

Control State

• Objective: Find an optimal control (ut) that maximize the functional

)())(;( 0 Ttt xruxJ Objective function:

Deterministic process

Continuous state

• Consider a class of parameterized policies with ),( tt xtu

• Find parameter that maximize the performance measure

)),(;()( 0 ttxtxJV

• Standard approach is to use gradient ascent method

)( V object of the paper

Introduction of the Problem


How to compute )(V

• Finite-difference method

)()(

)(VV

V ii

This method requires a large number of trajectories to compute the gradient of performance measure.

• Pathwise estimation of the gradient

Compute the gradient using one trajectory only


Define tt xz

Dynamics of zt: ttxtt zxfxf

dt

dz)()(

Gradient

TTxTTx zxrxxrV )()()(

• In the reinforcement learning, is unknown. How

to approximate zt?)( txf

Pathwise estimation of the gradient

known

unknown

Discretized Stochastic Processes Approximation• A General Convergence Result

)( tt xf

dt

dxIf

• Discretization of the state

Stochastic policy

Stochastic discrete state process NntnX

0)(

Initialization: 00 xX

Jump in state

Proof of proposition 5:

From Taylor’s formula

The average jump:

Directly apply the Theorem 3, proposition 5 is proved.

)(),(),( 2 ouxfxxuxf tttt

)()(

)(),(),|(

),(),|()],([

2

2

oxf

ouxfxtu

uxfxtuuxfE

Uut

Uutt

• Discretization of the state gradient

Stochastic discrete state gradient processNntn

Z

0)(

Initialization: 00 X

With

Proof of proposition 6:

Since

then

Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm

Let

In this stochastic approximation, is observed, and

is given, we only need to approximate

Least-Square Approximation of

Define

}|],[{)( ts uutctstS

The set of past discrete times t-cs t when action ut have been taken.

From Taylor’s formula, for all discrete time s,

We deduce

Where

We may derive an approximation of by solving the least-square problem:

Then we have

Here

denote the average value of

Algorithm

Experimental Results

Six continuous state:

x0, y0: hand position

x, y: mass position

vx, vy: mass velocity

Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}

Goal: reach a target (xG, yG) with the mass at specific time T

Terminal reward function

The system dynamics:

Consider a Boltzmann-like stochastic policy

where

Conclusion

• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters

• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process

Documents

Policy Gradient in Continuous Time