On Movement Skill Learning and Movement Representations ... · References Movement Skill Learning for Robotics Movement Skill Learning can be easily formulated as Reinforcement Learning

References

On Movement Skill Learning and MovementRepresentations for Robotics

Gerhard Neumann1

1Graz University of Technology, Institute for Theoretical Computer Science

November 2, 2011

On Movement Skill Learning and Movement Representations for Robotics Seminar Talk

References

Modern Robotic Systems: Motivation...

Many degrees of freedoms, compliant actuators, highly dynamicmovements...


References

In principle the advanced morphology of these robots would allowus to perform a wide range of complex movements such as

• Different forms of locomotion (walking, running, trott)

• Jumping

• Playing tennis...

Classical control methods often fail or are very hard to use for suchcomplex movements.

• More promising approach : Let the robot learn the movementfrom trial and error

• Main topic of this thesis !


References

Movement Skill Learning for Robotics

Movement Skill Learning can be easily formulated asReinforcement Learning problem.

• The agent has to search for a policy which optimizes reward

So why is it challenging?

• High dimensional continuous state spaces

• High dimensional continuous action spaces

• Data is expensive : Needs to be data efficient

• Needs to be safe


References

Movement Skill Learning for Robotics

Learning algorithms can be roughly divided into

• Value-based methods

• Policy-search methods


References

Value-based methods

• Estimate the expected discounted future reward for each states when following policy π

V π(s) = E

[

∞∑

t=0

γtrt

]

,

• Also denoted as value function of policy π

• Recursive Form

V π(s) = E[

r(s,a) + γV π(s′)]


References

Value-based methods

+ The value function can be used to assess the quality of eachintermediate action of an episode

• E.g. by the use of the Temporal Difference (TD) error

δt = rt + γV π(st+1) − V π(st)

• Evaluates if the current step 〈st,at, rt, st+1〉 was better orworse than expected

• We can efficiently solve the temporal credit assignmentproblem

- The value function is very hard to estimate inhigh-dimensional continuous state and action spaces


References

Policy Search Methods

• Rely on a parametric representation of the policy π(a|s;w)• Parameters of the policy w

• Directly optimize policy parameters by performing rollouts onthe real system

- We can only assess the quality of a whole trajectory instead ofsingle actions

+ However, as no value function is estimated this can be donevery accurately

• More successful than value based methods

• Performance strongly depends on the used movementrepresentation


References

Outline: The thesis is divided into 3 parts...

Value-based Methods

• Graph-Based Reinforcement Learning

• Fitted Q-Iteration by Advantage Weighted Regression


References


Value-based Methods



Movement Representations

• Kinematic Synergies

• Motion Templates

• Planning Movement Primitives


References


Value-based Methods







Policy Search

• Variational Inference for Policy Search in Changing Situations


References


Value-based Methods







Policy Search



References


Value-based Methods




References

Fitted Q-iteration : Batch-Mode Reinforcement Learning(BMRL)

• Batch-Mode RL methods use the whole history H of theagent to update the value or action value function

H = {< si,ai, ri, s′i >}1≤i≤N

• Advantage : Data-points are used more efficiently than inonline methods


References


• Fitted Q-Iteration (Ernst et al., 2003) approximates thestate-action value function Q(s,a) by iteratively usingsupervised regression techniques

• Repeat K times

Qk+1(i) = ri + γVk(s′i) = ri + γ max

a′

Qk(s′i,a

′)

Dk ={

[

(si,ai), Qk+1(i)]

1≤i≤N

}

, Qk+1 = Regress(Dk)


References


+ FQI has proven to outperform classical online RL methods inmany applications (Ernst et al., 2005).

+ Any type of supervised learning method can be used ... E.g.neural networks (Riedmiller, 2005), regression trees(Ernst et al., 2005), Gaussian Processes

- High computational demands...


References

FQI for Robotics...

Continuous state spaces :√

Any type of supervised learning method can be used ... E.g.neural networks, regression trees, Gaussian Processes

Continuous action spaces :

• We have to solve

Qk+1(i) = ri + γ maxa′

Qk(s′i,a

′)

- Hm... how do perform the maxa′-operator in continuous

action spaces?


References

FQI for Robotics...

Hm... how do perform the maxa′-operator in continuous action

spaces?

• Discretizations become prohibitively expensive in highdimensional spaces

• We have to solve an optimization problem for each sample !E.g. use Cross-Entropy optimization for each data point s′i


References

FQI for Robotics...

Hm... how do perform the maxa′-operator in continuous action

spaces?

• We show that an advantage-weighted regression can be usedto approximate maxa Q(s,a).

• The regression uses the states si as input values and Q(si,ai)as target values.

• The weighting wi = exp(τA(s,a) of each data point is basedon the advantage function A(s,a) = Q(s,a) − V (s).


References

FQI for Robotics...

What is a weighted regression ?

• Minimize the error function w.r.t. θ

E =N

∑

i=1

wi(V (si; θ) − Q(si,ai))2

• wi . . . each data point gets an individual weighting


References

FQI for Robotics...

We proof this by applying the following 2 steps:

• Weighted regression for value estimation

• Soft-greedy policy improvement


References

Weighted regression for value estimation

• The value function of a stochastic policy π is given byV π(s) =

∫

aπ(a|s)Q(s,a)da

• We show that this can be approximated without evaluatingthe integral by solving a weighted regression problem

DV = {〈si, Q(si,ai)〉} , U = {π(ai|si)} ,

V = WeightedReg(DV , U)


References

Proof

We want to find an approximation V (s) of V π(s) by minimizingthe error function

Error(V ) =

∫

s

µ(s)

(∫

a

π(a|s)Q(s,a)da − V (s)

)2

ds

=

∫

s

µ(s)

(∫

a

π(a|s)(

Q(s,a) − V (s))

da

)2

ds,

• µ(s) : state distribution when following policy π(·|s).


References

Proof

Squared error function :

Error(V ) =

∫

s

µ(s)

(∫

a

π(a|s)(

Q(s,a) − V (s))

da

)2

ds,

An upper bound of Error(V ) is given by :

ErrorB(V ) =

∫

s

µ(s)

∫

a

π(a|s)(

Q(s,a) − V (s))2

dads ≥ Error(V ).

• Use of Jensens inequality


References

Proof

It is easy to show that both error functions have the sameminimum for V

• The upper bound ErrorB can be approximatedstraightforwardly by samples {(si,ai), Q(si,ai)}1≤i≤N

ErrorB(V ) ≈N

∑

i=1

π(ai|si)(

Q(si,ai) − V (si))2

(1)

• No integral over the action space is needed!


References

FQI for Robotics...

We proof this by applying the following 2 steps:

• Weighted regression for value estimation

• Soft-greedy policy improvement


References

Soft-greedy policy improvement

The optimal value function V (s) = maxa Q(s,a) can beapproximated without evaluating maxa Q(s,a) by solving anadvantage-weighted regression problem.

DV = {〈si, Q(si,ai)〉} , U∗ ={

exp(τA(si,ai))}

, (2)

V = WeightedReg(DV , U∗) (3)

- τ . . . greediness parameter of the algorithm.

- A(s,a) . . . normalized advantage function.


References

Proof

We approximate the value function V π1 of a soft-max policy π1 bythe use of weighted regression.

• Since a soft-max policy is an approximation of the greedypolicy, we can replace V (s) = maxa Q(s,a) with V π1(s).


References

Proof

The used soft-max policy π1(a|s) is based on the advantagefunction A(s,a) = Q(s,a) − V (s).

π1(a|s) =exp(τA(s,a))

∫

aexp(τA(s,a))da

, A(s,a) = A(s,a)−mA(s)σA(s) .

• If we assume that the advantages A(s,a) are normallydistributed the denominator of π1 is constant.

• Thus we can use exp(τA(s,a)) ∝ π1(a|s) directly asweighting for the regression.


References

Concrete algorithm : LAWER

The Locally-Advantage WEighted Regression (LAWER) algorithmimplements the presented theoretical results.

• It combines Locally Weighted Regression (LWR,(Atkeson et al., 1997)) and advantage weighted regression.

• The locality weighting wi(s) and the advantage weightingui = exp(τA(si,ai)) can be multiplicatively combined


References

Concrete algorithm : LAWER

• The value function is then given by a simple weighted linearregression:

Vk+1(s) = s(STUS)−1STUQk+1

• s = [1, sT ]T , S = [s1, s2, ..., sN ]T is the state matrix.• U = diag(wi(s)ui)

• In order to approximate V (s) = maxa Qk(s,a) only theQ-values of neighbored state-action pairs are needed.


References

Approximation of the policy

For unseen states we need to approximate the soft-max policy

• Gaussian policy π(a|s) = N (a|µ(s), σ2).

• For estimating this policy we use reward-weighted regression(Peters & Schaal, 2007), only the advantage is used insteadof the reward for the weighting.

• Thus, we optimize the long-term reward instead of theimmediate reward


References

Results

• We use the Cross-Entropy (CE) optimization method(de Boer et al., 2005) as comparison to find the maximumQ-values maxa Q(s,a).

• We compare the LAWER algorithm to 3 different state of theart CE-based fitted Q-iteration algorithms:

• Tree-based FQI (Ernst et al., 2005) (CE-Tree)• Neural FQI (Riedmiller, 2005) (CE-Net)• LWR-based FQI (CE-LWR)

• After each FQI cycle new data was collected.

• The immediate reward function was quadratic in the distanceto the goal position xG and in the applied torque/force


References

Pendulum swing-up task

• A pendulum needs to be swung up from the position at thebottom to the top position (Riedmiller, 2005).

• 2 experiments with different torque punishment factors (c2)were carried out.

5 10 15 20

−40

−30

−20

−10

Number of Data Collections

Ave

rage

Rew

ard

LAWERCE TreeCE LWRCE Net

(a) c2 = 0.005

5 10 15 20−80

−60

−40

−20


Ave

rage

Rew

ard

LAWERCE TreeCE LWRCE Net

(b) c2 = 0.025


References

Comparison of torque trajectories

−505

LAWER

−505

u [N

]

CE Tree

0 1 2 3 4 5−5

05

Time [s]

CE LWR

(c) c2 = 0.005

−505

LAWER

−505

u [N

]

CE Tree

0 1 2 3 4 5−5

05

Time [s]

CE LWR

(d) c2 = 0.025


References

Dynamic puddle-world

• The agent has to navigate from a start position to a goalposition, it gets negative reward when going through puddles.

• Dynamic version of the puddle-world : The agent can set aforce accelerating a k-dimensional point mass.

• This was done for k = 2 and k = 3 dimensions.

0 1

1

Start

Goal


References

Comparison of the algorithms

5 10 15 20 25 30−100

−80

−60

−40

−20


Ave

rage

Rew

ard

LAWERCE Tree

(e) 2-D

5 10 15 20 25 30−150

−100

−50


Ave

rage

Rew

ard

LAWERCE Tree

(f) 3-D

• The CE-Tree Method learns faster, but does not manage tolearn high quality policies for the 3D setting.

• LAWER also works for high dimensional action spaces.


References

Comparison of torque trajectories

−202

u1

−202

u2

0 1 2 3 4 5−2

02

Time [s]

u3

(g) LAWER

−202

u1

−202

u2

0 1 2 3 4 5−2

02

Time [s]

u3

(h) CE-Tree


References

Conclusion

• We have proven that the greedy operator maxa Q(s,a) can beapproximated efficiently by an advantage-weighted regression.

• The resulting algorithm runs an order of magnitude fasterthan competing algorithms.

• In spite of the resulting soft-greedy policy improvement ouralgorithm was able to produce policies of higher quality.

• The Locally-Advantage Weighted Regression algorithm allowsus to use fitted Q-iteration even for high dimensionalcontinuous action spaces.


References


Value-based Methods







Policy Search



References

Movement Representations for Motor Skill Learning

Directly optimize a parametric movement representation

• No value estimation is needed

• What is a good representation for learning a movement?

Episodic Tasks:

• Often it is sufficient to formulate the learning task in theepisodic RL setup

• Single initial state, specified fixed duration of the movement

• Direct Policy Search can be applied easily in this setup


References

Movement Representations for Motor Skill Learning

Episodic setup : Use a trajectory-based representation

• We learn a parametric representation of the desired trajectory

[qd(t;w), qd(t;w)]

• t . . . duration of the movement, no direct dependence on thehigh dimensional state

• t is now a scalar, this significantly simplifies the learningproblem

• Can only be used in the episodic setup (single start states)• This trajectory is then followed by using feedback control laws

• Most common movement representations are trajectorybased...

• Dynamic Movement Primitives (Ijspeert & Schaal, 2003),Splines (Kolter & Ng, 2009), ...


References

Trajectory-based vs. Value Based Motor Skill learning

Trajectory-Based:

• Can be seen as single-step decision task

• The agent chooses the parameters w as action of a single,temporally extended step

• Only one step per episode...

Value-Based:

• One decision per time step of the agent

• The agent chooses the tourque u as action of a single, veryshort time step

• Up to a few hundred steps per episode...


References

Trajectory-based vs. Value Based Motor Skill learning

Can we find a more intuitive solution for which the agent choosesnew actions only at certain, characteristic time points of themovement?

• Temporal Abstraction: Sequencing of temporally extendedactions, also called Motion Templates


References

Temporal Abstractions for Motor-Skill Learning

Example : Drawing a triangle with a pen

Flat Setup

• We have to make manyunessential decisions

Abstracted Setup

• The movement can be easilydecomposed into 3elemental motions


References


Example : Drawing a triangle with a pen

Flat Setup

• We have to make manyunessential decisions

Abstracted Setup

• The movement can be easilydecomposed into 3elemental motions


References


Standard framework for temporally extended actions : Options(Sutton et al., 1999)

• Options are closed loop policies taking actions over a periodof time

• However: They are mainly used in discrete environments.• In many applications options are discrete temporally extended

actions

• E.g. “Go to another room”, “Follow the hallway” or “Frightenthe poor monkey”

• For motor tasks useful options are often difficult to specify.


References

Temporal Abstractions for Motor-Skill Learning :Illustration

• Pendulum Swing-up Task :• Standard RL benchmark task• Learn how to swing up and balance an inverted pendulum from

the bottom position• We additionally want to minimize the energy consumption• Flat RL : Choose a new action every 50ms


References

Pendulum Swing-Up: Illustration

How can we decompose the trajectory into options?

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[s]

Positive peakNegative peakBalancing Motion

• We have positive and negativepeaks in the torque trajectory ...

• ... followed by a final balancingmotion.


References

Pendulum Swing-Up: Illustration

How can we decompose the trajectory into options?

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[s]


Specify the exact form of the peaks andthe balancing motion for the options ?

• Requires a lot of prior knowledge...

• The learning task becomes trivial...

However : We can specify thefunctional form of the options

• Use parameterized options...


References

Motion Templates

• Motion templates : Parameterized Options• Used as our building blocks of motion.

• A motion template mp is defined by :• Its kp dimensional parameter space Θp

• Its parameterized policy up(s, t; θp)• Its termination condition cp(s, t; θp)

• s . . . state, t . . . execution time, θp ∈ Θp . . . parameters

• The functional form of up and cp is chosen by the designer,the parameters θp are learned by the agent


References

Motion Templates

• At each decision time step σk the agent has to choose :• Which motion template mp ∈ A(σk) to use.

• A(σk) . . . set of available motion templates in decision timestep σk.

• Which parameterization θp ∈ Θp of mp to use.

• Subsequently the policy up is executed until the terminationcondition cp is fullfilled.

• Continuous time :• The duration of the templates can be continuous valued

• The agent has to learn the correct sequence andparameterization of the motion templates


References

Pendulum Swing-up : Decomposition into MotionTemplates

How can we decompose the trajectory into motion templates?

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[s]


• We have positive and negativepeaks in the torque trajectory ...

• ... followed by a final balancingmotion.


References

Pendulum Swing-Up : Templates to model the peaks

We use 2 templates per peak:

• One for the ascending part : m1...

• ... and one for the descending part : m2

• Both just depend on the execution time of the template.


References

Pendulum Swing-Up : Decomposition into MotionTemplates


0 0.1 0.2 0.3 0.4

0

2

4

Ascending part (m1)

Time [s]

Tor

que

[N]

a1 = 4

a1 = 3

a1 = 2

0 0.1 0.2 0.3 0.4

0

2

4

Descending part (m2)

Time [s]

Tor

que

[N]

a2 = 4

a2 = 3

a2 = 2

Parameters :

• ai . . . height of the template


References



0 0.1 0.2 0.3 0.4

0

2

4

Ascending part (m1)

Time [s]

Tor

que

[N]

o1 = 0.5

o1 = 1

o1 = 2

0 0.1 0.2 0.3 0.4

0

2

4


Time [s]

Tor

que

[N]

o2 = 3

o2 = 6

o2 = 9

Parameters :


• oi . . . curvature of the template


References



0 0.2 0.4 0.6

0

2

4

Ascending part (m1)

Time [s]

Tor

que

[N]

d1 = 0.3

d1 = 0.5

d1 = 0.7

0 0.2 0.4 0.6

0

2

4


Time [s]

Tor

que

[N]

d2 = 0.3

d2 = 0.5

d2 = 0.7

Parameters :


• oi . . . curvature of the template

• di . . . duration of the template


References


• We fix the height of the descendingpeak template m2 to be the heightof m1.

• m3 and m4 are the sametemplates, just for negative peaks.

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[N]

Positive Peak, Descending PartNegative Peak, Ascending PartNegative Peak, Descending PartPositive Peak, Ascending PartBalancing Motion

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[N]

m2

m3

m4

m1


References


• The balancing template is implemented as PD-controller

MT Functional Form Parametersm5 −k1θ − k2θ

′ k1, k2

• k1 and k2 are the PD controller gains.• m5 always runs for 20s, subsequently the episode is terminated


References

Pendulum Swing-Up : Constructing the motion

• The agent can either choose thepeak templates in the predefinedorder (m2, m3, m4, m1, m2, ...)

• ... or it can use the balancingtemplate m5 as final template.

• Thus the agent has to learn thecorrect number of swing-ups andthe correct parameterization of theswing- ups.

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[N]


0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[N]

m2

m3

m4

m1

m5


References

Pendulum Swing-Up : Constructing the motion

0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]

Tor

que

[N]


0 1 2 3 4 5−6

−4

−2

0

2

4

6

Time [s]T

orqu

e [N

]

m2

m3

m4

m1

m5

• Flat : Approximately 50 decisions/parameters are needed toreach the top position

• Motion Templates : The whole motion consists only of 5decisions / 13 parameters


References

Pendulum Swing-Up : Accuracy of the policy

• Motion templates decrease the number of necessary decisionssignificantly

• Overall learning task is simplified• Ok... where is the catch?


References

Pendulum Swing-Up : Accuracy of the policy

• Motion templates decrease the number of necessary decisionssignificantly

• Overall learning task is simplified• Ok... where is the catch?

• A single decision has now much more influence on theoutcome of the whole motion.

• Therefore a single decision has to be made much moreprecisely than in flat RL.


References

Algorithm for Motion Template Learning

• An RL algorithm is needed which can learn very precisecontinuous valued policies!

• For each template mp, we use an advancement of the LocallyAdvantage WEighted Regression (LAWER,(Neumann & Peters, 2009)) algorithm to learn the policyπp(θp|s) for selecting the parameters of mp.


References

Extensions of LAWER : LAWER for Motion TemplateLearning

Due to the increased precision requirements of motion templatelearning we had to develop 2 substantal extensions of LAWER

• Adaptive tree-based Kernels

• Additional optimization to improve the approximation ofV (s) = maxa Q(s,a)


References

Extensions of LAWER : Adaptive Tree-Based Kernels

The use of an uniform weighting kernel is often problematic in thecase of ...

• High dimensional input spaces (’curse of dimensionality’)

• Spatially varying data densities

• Spatially varying curvatures of the regression surface

This problem can be alleviated by varying the ’shape’ of theweighting kernel.

• We do this by the use of randomized regression trees...


References

Extensions of LAWER : Improved approximation ofV (s) = maxa Q(s, a)

• In order to estimate the weightings ui, the original LAWERneeded the assumption of normally distributed advantagevalues.

• Often this assumption does not hold and the estimate of ui

gets imprecise.

• We improve the estimate of the ui by an additionaloptimization...


References

Experiments

Minimum-Time problems with additional energy-consumptionconstraints (c2)

• Pendulum Swing-Up

• 2-link Pendulum Swing-Up

• 2-link Pendulum Balancing

Iterative learning protocoll:

• We collect L episodes with the currently estimatedexploration policy

• Subsequently the optimal policy is reestimated ...

• ... and the performance (summed reward) of the optimalpolicy (without exploration) is evaluated.


References

Experiments : Pendulum Swing-Up

Comparison of learning progress for different energy punishmentfactors (L = 50)

0 20 40 60−80

−60

−40

−20


Ave

rage

Rew

ard

MT TreeMT GaussFlat

0 20 40 60−200

−150

−100

−50


Ave

rage

Rew

ard

MT TreeMT GaussFlat

Figure: Learning curves for the Gaussian kernel (MT Gauss) and thetree-based kernel (MT Tree) for (left) c2 = 0.025 and (right) c2 = 0.075


References

Experiments : Pendulum Swing-Up

Comparison of the flat and the motion template policy

−505

m2

m3

m5

c2 = 0.005

−505

m2m

3m

4m

1m

5

c2 = 0.025

0 1 2 3 4 5−5

05

m2

m3m

4m

1m

2m

3m

4m

1m

5

c2 = 0.075

Time [s]

−505

c2 = 0.005

−505

c2 = 0.025

0 1 2 3 4 5−5

05

c2 = 0.075

Time [s]Figure: (a) Torque trajectories and motion templates learned for differentenergy punishment factors c2. (b) Torque trajectories learned with flat RL

Performance for c2 = 0.075 :

• flat RL −48.6, motion templates −38.5


References

Experiments : 2-Link Pendulum Swing-Up

Same templates as for the1-dimensional task

• The peak templates have now 2additional parameters, the heightand the curvature for the secondcontrol dimension u2.

• The parameters of the balancertemplate m5 consists of two 2 × 2matrices for the controller gains.

0 50 100 150−70

−60

−50

−40

−30

−20

−10


Ave

rage

Rew

ard

MT TreeFlat

Figure: Comparison formotion template learningwith tree-based kernels andflat RL


References

Experiments : 2-Link Pendulum Swing-Up

Learned motion template policy

1 2−6

−4

−2

0

2

4

6 m2

m3

m4

m5

Time [s]

Tou

rque

[Nm

]

u1

u2

Figure: Left: Torque trajectories and decomposition in the motiontemplates. Right: Illustration of the motion. The bold postures representthe switching time points of the motion templates.


References

Conclusions

• We have shown that by the use of motion templates, i.e.parametrized options, many motor tasks can be decomposedinto elemental movements.

• Motion templates are the first movement representation whichcan be sequenced in time

• While the whole motion consists of less decisions, a singledecision has to be made more precisely.

• We propose a new algorithm for motion template learningwhich can cope with the precision requirements

• We have shown that learning with motion templates canproduce policies of higher quality than flat RL and could evenbe applied to tasks where flat RL was not successful.


References


Value-based Methods







Policy Search



References

Policy Search for trajectory-based representations

Back to trajectory-based representations

• Only 1 decision per episode : Choose parameter vector w

• Typically w is very high dimensional (40 - 100 parameters)

How can we optimize the parameters w?

• Policy Gradient Methods (Williams, 1992;Peters & Schaal, 2006)

• EM-based Methods (Kober & Peters, 2010)

• Inference-based Methods (Vlassis et al., 2009;Theodorou et al., 2010)


References

Inference-based Methods: Policy Search for changingSituations

In different situations s0i we have to choose different parameter

vectors wi

• Can we generalize between solutions to avoid relearning?

• Learn a hierarchic policy πMP(w|s0; θ) which chooses theparameter vector w according to the situation s0.

• In order to do so we will use approximate inference methods


References

Outline

Approximate Inference for Policy Search

• Decomposition of the log-likelihood

• Monte-Carlo EM based methods

• Variational Inference based methods

Policy Search for Movement Primitives in changing situations

• 4-Link Balancing


References


Using inference or inference-based methods has proven to be veryuseful for policy search

• PoWeR (Kober & Peters, 2010), Policy Improvement by PathIntegrals (Theodorou et al., 2010)

• Reward Weighted Regression, Cost Regularized KernelRegression (Kober et al., 2010)

• Monte Carlo EM Policy Search(Vlassis et al., 2009)

• CMA-ES (Heidrich-Meisner & Igel, 2009)


References

All these algorithms use the Moment-projection of a certaintarget distribution to estimate the policy

• As we will see this can be problematic in many cases...(multi-modal solution space, complex reward functions...)

• Here we will introduce the theory to use theInformation-projection and show that this projectionalleviates many of these problems


References


Formulating policy search as inference problem...

• Observed variable :• Introduce a Reward Event p(R = 1|τ), e.g

p(R = 1|τ) ∝ exp(−C(τ))• C(τ) · · · trajectory costs

• Latent Variables : trajectories τ

• Probabilistic Model : p(R = 1, τ ; θ) = p(R = 1|τ)p(τ ; θ)

We want to find parameters θ which maximize the log-marginallikelihood

log p(R; θ) = log

∫

τ

p(R|τ)p(τ ; θ)dτ


References


Policy Search can be seen as finding the maximum likelihood (ML)solution of p(R; θ)

p(R; θ) =

∫

τ

p(R|τ)p(τ ; θ)dτ

• Problem: Huge trajectory space, the integral is intractable


References

Decomposition of the log-likelihood

We can decompose the log-likelihood by introducing a variationaldistribution q(τ) over the latent variable τ :

log p(R; θ) = L(q, θ) + KL(q||pR),

Lower Bound L(q, θ):

L(q, θ) =

∫

τ

q(τ) log p(R, τ ; θ) + f1(q) = · · ·

=

∫

τ

q(τ) log p(τ ; θ) + f2(q)

• Expected complete data log-likelihood ...


References


We can decompose the log-likelihood by introducing a variationaldistribution q(τ) over the latent variable τ :

log p(R; θ) = L(q, θ) + KL(q||pR),

Kullback-Leibler divergence KL(q||pR) :

KL(q||pR) = −∫

τ

q(τ) logpR(τ)

q(τ)dτ

• ’Distance’ between variational distribution q and conditionaldistribution of the latent variable p(τ |R; θ)

• pR(τ) = p(τ |R; θ) ∝ p(R|τ)p(τ ; θ) ... reward-weighted modeldistribution


References


We can now iteratively increase the lower bound L(q, θ) by:

• E-Step:• Keep model parameters θ fixed• Minimize KL-divergence KL(q||pR) w.r.t q

• M-Step:• Keep variational distribution q fixed• Maximize Lower Bound L(q,θ) w.r.t θ


References


Two types of policy search algorithms emerge from thisdecomposition

• Monte-Carlo EM based Policy Search (Kober et al., 2010;Kober & Peters, 2010; Vlassis et al., 2009)

• Variational Inference Policy Search


References

Monte-Carlo (MC) EM based Algorithms

MC-EM based algorithms use a sample based approximation of q

in the E-step.

• E-Step minq KL(q||pR) : q(i) = pR(i) ∝ p(R|τi)p(τi; θold)

• M-Step maxθ L(q, θ) : Use q(i) to approximate lower bound

L(q, θ) ≈∑

i

pR(i) log p(τi; θ) + const

= −KL(pR||p(τ ; θ)) + const

This is the same lower Bound given as given for PoWER andReward-Weighted Regression.


References

Monte-Carlo (MC) EM based Algorithms

Iteratively calculate M(oment)-projection of pR :

minθ

KL(pR||p) = −∑

i

pR(i) log

(

p(τi; θ)

pR(i)

)

• The model becomes ’Reward attracted’• Forces model p to have high probability in regions with high

reward• Negatively rewarded samples are neglected!

• Minimization?• p can be easily calculated by matching the moments of p with

the moments of pR


References

Variational Inference based Algorithms

For Variational Inference we use a parametric variationaldistribution q(τ) = q(τ ; θ′)

• E-Step minq KL(q||pR) : Use sample-based approximation forthe integral in the KL-divergence

KL(q(τ ; θ′)||pR) ≈ −∑

τi

q(τi; θ′) log

pR(i)

q(τi; θ′)

• M-Step maxθ L(q, θ) : If we use the same family ofdistributions for p(τ ; θ) and q(τ ; θ′) we can simply set θ to θ′


References

Variational Inference based Algorithms

Iteratively calculate I(nformation)-projection of pR :

minθ

KL(p||pR) = −∑

i

p(τi; θ) log

(

pR(i)

p(τi; θ)

)

• The model becomes ’Cost-averse’ :• Tries to avoid including in regions with low reward in p(τ ;θ)• Uses information from negatively and positively rewarded

examples

• Minimization?• Non-convex optimization problem (computationally much more

demanding than using M-projection) ...• We use numerical gradient ascent


References


MC-EM : M-projection based

minθ

KL(pR||p) = −∑

i

pR(i) log

(

p(τi; θ)

pR(i)

)

Variational Inference : I-projection based

minθ

KL(p||pR) = −∑

i

p(τi; θ) log

(

pR(i)

p(τi; θ)

)

Both algorithms are guaranteed to iteratively increase the lowerbound...


References

I vs M-projection : Illustrative Examples

Lets look at the differences in more detail...


References


We consider 1-step decision problems in continuous state andaction spaces

• We typically use a Gaussian distribution as model distribution

p(s,a; θ) = N([

s

a

]

|[

µs

µa

]

,

[

Σss Σas

Σsa Σaa

])

,

• with θ = {µs, µa,Σss,Σas,Σaa}


References


2-dimensional action space, no state variables, multimodal targetdistribution

M−

Pro

ject

ion

I−P

roje

ctio

n

• M-projection averages over all modes• I-projection concentrates on one mode


References


We also want to have state variables...

• The policy π(a|s; θ) is obtained by conditioning on the state s.

• Policy π is a linear Gaussian model...

In order to get more complex policies π(a|st; θ) ...

• For each state st, we reestimate the model p(s,a; θ) locally(using either the M- or I-projection)

• We clamp µs at st.


References


• 1-dimensional state and action space

• complex reward function (dark background indicates negativereward)

• Policy is estimated for 6 different states

M−projection

s1 s2 s3 s4 s5 s6

I−projection

s1 s2 s3 s4 s5 s6

• M-projection includes areas of low reward in the distribution !


References

Policy Search for Motion Primitives

Lets apply variational inference for policy search in changingsituations

• Movement representation : parametrized velocity profiles


References

Multi situation setting : How can we learn θ?

• Existing algorithms are all MC-EM based and therefore usethe M-projection

• Reward-Weighted Regression (Peters & Schaal, 2007),Cost-Regularized Kernel-Regression (Kober et al., 2010)

• Online learning setup : as samples we always use the historyof the agent...


References

Experiments : Cannon-Ball Task

Learn to shoot a cannon-ball at a desired location

• State-space s0 : Desired Location, Wind Force

• Parameter-space w : Launching Angle and Velocity of the ball

Comparison of I andM-projection :

• CRKR : Cost-RegularizedKernel Regression

• Multi-modal solution space,I-projection performs best

1000 2000 3000 4000 5000−80

−60

−40

−20

0

Episodes

Per

form

ance

I−ProjectionM−projectionCRKR


References

Experiments : 4-link pendulum balancing

4-link ’Humanoid’ robot has to counterbalance different pushes

• Situations :• The robot gets pushed with different forces Fi ∈ [0; 25]Ns at 4

different points of origin

• 4-dimensional state space

• Movement Primitives• Sequence of sigmoidal velocity profiles (39 parameters)...

t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s


References

Experiments : 4-link pendulum balancing

4-link ’Humanoid’ robot has to counterbalance different pushes

• After 60000 episode therobot has learned to balancealmost every force

• The robot learns completelydifferent balancing strategies

• We could not producerelyable results with theM-projection...

t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s


References

Conclusion

• We can use the M-projection or the I-projection for PolicySearch

• The I-projection also uses information of bad samples, whichare neglected by the M-projection!

• It therefore can be used with ease for multi-modal distributionsor non-concave reward functions

• Computationally quite demanding...• More efficient methods to calculate the I-projection are needed

• Is there still a big difference for more complex modeldistributions... ?


References

The end

Thanks for your attention!


References

Bibliography I

Atkeson, Chris G., Moore, Andrew W., & Schaal, Stefan. 1997.Locally Weighted Learning.Artificial Intelligence Review, 11, 11–73.

de Boer, Pieter-Tjerk, Kroese, Dirk, Mannor, Shie, & Rubinstein,Reuven. 2005.A Tutorial on the Cross-Entropy Method.Annals of Operations Research, 134(1), 19–67.

Ernst, D., Geurts, P., & Wehenkel, L. 2005.Tree-Based Batch Mode Reinforcement Learning.Journal of Machine Learning Resource, 6, 503–556.


References

Bibliography II

Ernst, Damien, Geurts, Pierre, & Wehenkel, Louis. 2003.Iteratively Extending Time Horizon Reinforcement Learning.Pages 96–107 of: European Conference on Machine Learning(ECML).

Heidrich-Meisner, V., & Igel, C. 2009.Neuroevolution Strategies for Episodic ReinforcementLearning.Journal of Algorithms, 64(4), 152–168.


References

Bibliography III

Ijspeert, Auke Jan, & Schaal, Stefan. 2003.Learning Attractor Landscapes for Learning Motor Primitives.Pages 1523–1530 of: Advances in Neural InformationProcessing Systems 15.Cambridge, MA: MIT Press.

Kober, J., & Peters, J. 2010.Policy Search for Motor Primitives in Robotics.Machine Learning Journal, online first, 1–33.

Kober, Jens, Oztop, Erhan, & Peters, Jan. 2010.Reinforcement Learning to adjust Robot Movements to NewSituations.In: Proceedings of the 2010 Robotics: Science and SystemsConference (RSS 2010).


References

Bibliography IV

Kolter, Z., & Ng, A. 2009.Task-Space Trajectories via Cubic Spline Optimization.Pages 2364–2371 of: Proceedings of the 2009 IEEEinternational conference on Robotics and Automation.ICRA’09.Piscataway, NJ, USA: IEEE Press.

Neumann, G., & Peters, J. 2009.Fitted Q-Iteration by Advantage Weighted Regression.In: Advances in Neural Information Processing Systems 22(NIPS 2008).MA: MIT Press.


References

Bibliography V

Peters, J., & Schaal, S. 2006.Policy Gradient methods for Robotics.In: Proceedings of the IEEE International Conference onIntelligent Robotics Systems (IROS).

Peters, J., & Schaal, S. 2007.Reinforcement Learning by Reward-Weighted Regression forOperational Space Control.In: Proceedings of the International Conference on MachineLearning (ICML).


References

Bibliography VI

Riedmiller, M. 2005.Neural fitted Q-Iteration - First Experiences with a DataEfficient Neural Reinforcement Learning Method.In: Proceedings of the European Conference on MachineLearning (ECML).

Sutton, Richard, Precup, Doina, & Singh, Satinder. 1999.Between MDPs and Semi-MDPs: A Framework for TemporalAbstraction in Reinforcement Learning.Artificial Intelligence, 112, 181–211.


References

Bibliography VII

Theodorou, E., Buchli, J., & Schaal, S. 2010.Reinforcement Learning of Motor Skills in High Dimensions: aPath Integral Approach.Pages 2397–2403 of: Robotics and Automation (ICRA), 2010IEEE International Conference on.

Vlassis, Nikos, Toussaint, Marc, Kontes, Georgios, & Piperidis,Savas. 2009.Learning Model-Free Robot Control by a Monte Carlo EMAlgorithm.Autonomous Robots, 27(2), 123–130.


References

Bibliography VIII

Williams, Ronald J.. 1992.Simple Statistical Gradient..Following Algorithms forConnectionist Reinforcement Learning.Machine Learning.


Documents

On Movement Skill Learning and Movement Representations ... · References Movement Skill Learning for Robotics Movement Skill Learning can be easily formulated as Reinforcement Learning