19
Policy Gradient in Continuous Time Presented by Hui Li Duke University Machine Learning Group May 30, 2007 by Remi Munos, JMLR 2006

Policy Gradient in Continuous Time

  • Upload
    bowie

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Policy Gradient in Continuous Time. by Remi Munos, JMLR 2006. Presented by Hui Li Duke University Machine Learning Group May 30, 2007. Outline. Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results. Control. - PowerPoint PPT Presentation

Citation preview

Page 1: Policy Gradient in Continuous Time

Policy Gradient in Continuous Time

Presented by Hui Li

Duke University Machine Learning Group

May 30, 2007

by Remi Munos, JMLR 2006

Page 2: Policy Gradient in Continuous Time

Outline

• Introduction

• Discretized Stochastic Processes Approximation

• Model-free Reinforcement Learning (RL)

algorithm

• Example Results

Page 3: Policy Gradient in Continuous Time

Introduction of the Problem• Consider an optimal control problem with continuous state

System dynamics: ),( ttt uxf

dt

dx

Control State

• Objective: Find an optimal control (ut) that maximize the functional

)())(;( 0 Ttt xruxJ Objective function:

Deterministic process

Continuous state

Page 4: Policy Gradient in Continuous Time

• Consider a class of parameterized policies with ),( tt xtu

• Find parameter that maximize the performance measure

)),(;()( 0 ttxtxJV

• Standard approach is to use gradient ascent method

)( V object of the paper

Introduction of the Problem

Page 5: Policy Gradient in Continuous Time

Introduction of the Problem

How to compute )(V

• Finite-difference method

)()(

)(VV

V ii

This method requires a large number of trajectories to compute the gradient of performance measure.

• Pathwise estimation of the gradient

Compute the gradient using one trajectory only

Page 6: Policy Gradient in Continuous Time

Introduction of the Problem

Define tt xz

Dynamics of zt: ttxtt zxfxf

dt

dz)()(

Gradient

TTxTTx zxrxxrV )()()(

• In the reinforcement learning, is unknown. How

to approximate zt?)( txf

Pathwise estimation of the gradient

known

unknown

Page 7: Policy Gradient in Continuous Time

Discretized Stochastic Processes Approximation• A General Convergence Result

)( tt xf

dt

dxIf

Page 8: Policy Gradient in Continuous Time

• Discretization of the state

Stochastic policy

Stochastic discrete state process NntnX

0)(

Initialization: 00 xX

Jump in state

Page 9: Policy Gradient in Continuous Time

Proof of proposition 5:

From Taylor’s formula

The average jump:

Directly apply the Theorem 3, proposition 5 is proved.

)(),(),( 2 ouxfxxuxf tttt

)()(

)(),(),|(

),(),|()],([

2

2

oxf

ouxfxtu

uxfxtuuxfE

Uut

Uutt

Page 10: Policy Gradient in Continuous Time

• Discretization of the state gradient

Stochastic discrete state gradient processNntn

Z

0)(

Initialization: 00 X

With

Page 11: Policy Gradient in Continuous Time

Proof of proposition 6:

Since

then

Directly apply the Theorem 3, proposition 6 is proved.

Page 12: Policy Gradient in Continuous Time

Model-free Reinforcement Learning Algorithm

Let

In this stochastic approximation, is observed, and

is given, we only need to approximate

Page 13: Policy Gradient in Continuous Time

Least-Square Approximation of

Define

}|],[{)( ts uutctstS

The set of past discrete times t-cs t when action ut have been taken.

From Taylor’s formula, for all discrete time s,

We deduce

Page 14: Policy Gradient in Continuous Time

Where

We may derive an approximation of by solving the least-square problem:

Then we have

Here

denote the average value of

Page 15: Policy Gradient in Continuous Time

Algorithm

Page 16: Policy Gradient in Continuous Time

Experimental Results

Six continuous state:

x0, y0: hand position

x, y: mass position

vx, vy: mass velocity

Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}

Goal: reach a target (xG, yG) with the mass at specific time T

Terminal reward function

Page 17: Policy Gradient in Continuous Time

The system dynamics:

Consider a Boltzmann-like stochastic policy

where

Page 18: Policy Gradient in Continuous Time
Page 19: Policy Gradient in Continuous Time

Conclusion

• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters

• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process