Upload
bowie
View
19
Download
0
Embed Size (px)
DESCRIPTION
Policy Gradient in Continuous Time. by Remi Munos, JMLR 2006. Presented by Hui Li Duke University Machine Learning Group May 30, 2007. Outline. Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results. Control. - PowerPoint PPT Presentation
Citation preview
Policy Gradient in Continuous Time
Presented by Hui Li
Duke University Machine Learning Group
May 30, 2007
by Remi Munos, JMLR 2006
Outline
• Introduction
• Discretized Stochastic Processes Approximation
• Model-free Reinforcement Learning (RL)
algorithm
• Example Results
Introduction of the Problem• Consider an optimal control problem with continuous state
System dynamics: ),( ttt uxf
dt
dx
Control State
• Objective: Find an optimal control (ut) that maximize the functional
)())(;( 0 Ttt xruxJ Objective function:
Deterministic process
Continuous state
• Consider a class of parameterized policies with ),( tt xtu
• Find parameter that maximize the performance measure
)),(;()( 0 ttxtxJV
• Standard approach is to use gradient ascent method
)( V object of the paper
Introduction of the Problem
Introduction of the Problem
How to compute )(V
• Finite-difference method
)()(
)(VV
V ii
This method requires a large number of trajectories to compute the gradient of performance measure.
• Pathwise estimation of the gradient
Compute the gradient using one trajectory only
Introduction of the Problem
Define tt xz
Dynamics of zt: ttxtt zxfxf
dt
dz)()(
Gradient
TTxTTx zxrxxrV )()()(
• In the reinforcement learning, is unknown. How
to approximate zt?)( txf
Pathwise estimation of the gradient
known
unknown
Discretized Stochastic Processes Approximation• A General Convergence Result
)( tt xf
dt
dxIf
• Discretization of the state
Stochastic policy
Stochastic discrete state process NntnX
0)(
Initialization: 00 xX
Jump in state
Proof of proposition 5:
From Taylor’s formula
The average jump:
Directly apply the Theorem 3, proposition 5 is proved.
)(),(),( 2 ouxfxxuxf tttt
)()(
)(),(),|(
),(),|()],([
2
2
oxf
ouxfxtu
uxfxtuuxfE
Uut
Uutt
• Discretization of the state gradient
Stochastic discrete state gradient processNntn
Z
0)(
Initialization: 00 X
With
Proof of proposition 6:
Since
then
Directly apply the Theorem 3, proposition 6 is proved.
Model-free Reinforcement Learning Algorithm
Let
In this stochastic approximation, is observed, and
is given, we only need to approximate
Least-Square Approximation of
Define
}|],[{)( ts uutctstS
The set of past discrete times t-cs t when action ut have been taken.
From Taylor’s formula, for all discrete time s,
We deduce
Where
We may derive an approximation of by solving the least-square problem:
Then we have
Here
denote the average value of
Algorithm
Experimental Results
Six continuous state:
x0, y0: hand position
x, y: mass position
vx, vy: mass velocity
Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}
Goal: reach a target (xG, yG) with the mass at specific time T
Terminal reward function
The system dynamics:
Consider a Boltzmann-like stochastic policy
where
Conclusion
• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters
• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process