Upload
donhu
View
216
Download
0
Embed Size (px)
Citation preview
References
On Movement Skill Learning and MovementRepresentations for Robotics
Gerhard Neumann1
1Graz University of Technology, Institute for Theoretical Computer Science
November 2, 2011
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Modern Robotic Systems: Motivation...
Many degrees of freedoms, compliant actuators, highly dynamicmovements...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
In principle the advanced morphology of these robots would allowus to perform a wide range of complex movements such as
• Different forms of locomotion (walking, running, trott)
• Jumping
• Playing tennis...
Classical control methods often fail or are very hard to use for suchcomplex movements.
• More promising approach : Let the robot learn the movementfrom trial and error
• Main topic of this thesis !
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Movement Skill Learning for Robotics
Movement Skill Learning can be easily formulated asReinforcement Learning problem.
• The agent has to search for a policy which optimizes reward
So why is it challenging?
• High dimensional continuous state spaces
• High dimensional continuous action spaces
• Data is expensive : Needs to be data efficient
• Needs to be safe
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Movement Skill Learning for Robotics
Learning algorithms can be roughly divided into
• Value-based methods
• Policy-search methods
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Value-based methods
• Estimate the expected discounted future reward for each states when following policy π
V π(s) = E
[
∞∑
t=0
γtrt
]
,
• Also denoted as value function of policy π
• Recursive Form
V π(s) = E[
r(s,a) + γV π(s′)]
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Value-based methods
+ The value function can be used to assess the quality of eachintermediate action of an episode
• E.g. by the use of the Temporal Difference (TD) error
δt = rt + γV π(st+1) − V π(st)
• Evaluates if the current step 〈st,at, rt, st+1〉 was better orworse than expected
• We can efficiently solve the temporal credit assignmentproblem
- The value function is very hard to estimate inhigh-dimensional continuous state and action spaces
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Policy Search Methods
• Rely on a parametric representation of the policy π(a|s;w)• Parameters of the policy w
• Directly optimize policy parameters by performing rollouts onthe real system
- We can only assess the quality of a whole trajectory instead ofsingle actions
+ However, as no value function is estimated this can be donevery accurately
• More successful than value based methods
• Performance strongly depends on the used movementrepresentation
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
Movement Representations
• Kinematic Synergies
• Motion Templates
• Planning Movement Primitives
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
Movement Representations
• Kinematic Synergies
• Motion Templates
• Planning Movement Primitives
Policy Search
• Variational Inference for Policy Search in Changing Situations
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
Movement Representations
• Kinematic Synergies
• Motion Templates
• Planning Movement Primitives
Policy Search
• Variational Inference for Policy Search in Changing Situations
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Fitted Q-iteration : Batch-Mode Reinforcement Learning(BMRL)
• Batch-Mode RL methods use the whole history H of theagent to update the value or action value function
H = {< si,ai, ri, s′i >}1≤i≤N
• Advantage : Data-points are used more efficiently than inonline methods
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Fitted Q-iteration : Batch-Mode Reinforcement Learning(BMRL)
• Fitted Q-Iteration (Ernst et al., 2003) approximates thestate-action value function Q(s,a) by iteratively usingsupervised regression techniques
• Repeat K times
Qk+1(i) = ri + γVk(s′i) = ri + γ max
a′
Qk(s′i,a
′)
Dk ={
[
(si,ai), Qk+1(i)]
1≤i≤N
}
, Qk+1 = Regress(Dk)
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Fitted Q-iteration : Batch-Mode Reinforcement Learning(BMRL)
+ FQI has proven to outperform classical online RL methods inmany applications (Ernst et al., 2005).
+ Any type of supervised learning method can be used ... E.g.neural networks (Riedmiller, 2005), regression trees(Ernst et al., 2005), Gaussian Processes
- High computational demands...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
Continuous state spaces :√
Any type of supervised learning method can be used ... E.g.neural networks, regression trees, Gaussian Processes
Continuous action spaces :
• We have to solve
Qk+1(i) = ri + γ maxa′
Qk(s′i,a
′)
- Hm... how do perform the maxa′-operator in continuous
action spaces?
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
Hm... how do perform the maxa′-operator in continuous action
spaces?
• Discretizations become prohibitively expensive in highdimensional spaces
• We have to solve an optimization problem for each sample !E.g. use Cross-Entropy optimization for each data point s′i
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
Hm... how do perform the maxa′-operator in continuous action
spaces?
• We show that an advantage-weighted regression can be usedto approximate maxa Q(s,a).
• The regression uses the states si as input values and Q(si,ai)as target values.
• The weighting wi = exp(τA(s,a) of each data point is basedon the advantage function A(s,a) = Q(s,a) − V (s).
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
What is a weighted regression ?
• Minimize the error function w.r.t. θ
E =N
∑
i=1
wi(V (si; θ) − Q(si,ai))2
• wi . . . each data point gets an individual weighting
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
We proof this by applying the following 2 steps:
• Weighted regression for value estimation
• Soft-greedy policy improvement
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Weighted regression for value estimation
• The value function of a stochastic policy π is given byV π(s) =
∫
aπ(a|s)Q(s,a)da
• We show that this can be approximated without evaluatingthe integral by solving a weighted regression problem
DV = {〈si, Q(si,ai)〉} , U = {π(ai|si)} ,
V = WeightedReg(DV , U)
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Proof
We want to find an approximation V (s) of V π(s) by minimizingthe error function
Error(V ) =
∫
s
µ(s)
(∫
a
π(a|s)Q(s,a)da − V (s)
)2
ds
=
∫
s
µ(s)
(∫
a
π(a|s)(
Q(s,a) − V (s))
da
)2
ds,
• µ(s) : state distribution when following policy π(·|s).
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Proof
Squared error function :
Error(V ) =
∫
s
µ(s)
(∫
a
π(a|s)(
Q(s,a) − V (s))
da
)2
ds,
An upper bound of Error(V ) is given by :
ErrorB(V ) =
∫
s
µ(s)
∫
a
π(a|s)(
Q(s,a) − V (s))2
dads ≥ Error(V ).
• Use of Jensens inequality
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Proof
It is easy to show that both error functions have the sameminimum for V
• The upper bound ErrorB can be approximatedstraightforwardly by samples {(si,ai), Q(si,ai)}1≤i≤N
ErrorB(V ) ≈N
∑
i=1
π(ai|si)(
Q(si,ai) − V (si))2
(1)
• No integral over the action space is needed!
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
FQI for Robotics...
We proof this by applying the following 2 steps:
• Weighted regression for value estimation
• Soft-greedy policy improvement
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Soft-greedy policy improvement
The optimal value function V (s) = maxa Q(s,a) can beapproximated without evaluating maxa Q(s,a) by solving anadvantage-weighted regression problem.
DV = {〈si, Q(si,ai)〉} , U∗ ={
exp(τA(si,ai))}
, (2)
V = WeightedReg(DV , U∗) (3)
- τ . . . greediness parameter of the algorithm.
- A(s,a) . . . normalized advantage function.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Proof
We approximate the value function V π1 of a soft-max policy π1 bythe use of weighted regression.
• Since a soft-max policy is an approximation of the greedypolicy, we can replace V (s) = maxa Q(s,a) with V π1(s).
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Proof
The used soft-max policy π1(a|s) is based on the advantagefunction A(s,a) = Q(s,a) − V (s).
π1(a|s) =exp(τA(s,a))
∫
aexp(τA(s,a))da
, A(s,a) = A(s,a)−mA(s)σA(s) .
• If we assume that the advantages A(s,a) are normallydistributed the denominator of π1 is constant.
• Thus we can use exp(τA(s,a)) ∝ π1(a|s) directly asweighting for the regression.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Concrete algorithm : LAWER
The Locally-Advantage WEighted Regression (LAWER) algorithmimplements the presented theoretical results.
• It combines Locally Weighted Regression (LWR,(Atkeson et al., 1997)) and advantage weighted regression.
• The locality weighting wi(s) and the advantage weightingui = exp(τA(si,ai)) can be multiplicatively combined
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Concrete algorithm : LAWER
• The value function is then given by a simple weighted linearregression:
Vk+1(s) = s(STUS)−1STUQk+1
• s = [1, sT ]T , S = [s1, s2, ..., sN ]T is the state matrix.• U = diag(wi(s)ui)
• In order to approximate V (s) = maxa Qk(s,a) only theQ-values of neighbored state-action pairs are needed.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximation of the policy
For unseen states we need to approximate the soft-max policy
• Gaussian policy π(a|s) = N (a|µ(s), σ2).
• For estimating this policy we use reward-weighted regression(Peters & Schaal, 2007), only the advantage is used insteadof the reward for the weighting.
• Thus, we optimize the long-term reward instead of theimmediate reward
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Results
• We use the Cross-Entropy (CE) optimization method(de Boer et al., 2005) as comparison to find the maximumQ-values maxa Q(s,a).
• We compare the LAWER algorithm to 3 different state of theart CE-based fitted Q-iteration algorithms:
• Tree-based FQI (Ernst et al., 2005) (CE-Tree)• Neural FQI (Riedmiller, 2005) (CE-Net)• LWR-based FQI (CE-LWR)
• After each FQI cycle new data was collected.
• The immediate reward function was quadratic in the distanceto the goal position xG and in the applied torque/force
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum swing-up task
• A pendulum needs to be swung up from the position at thebottom to the top position (Riedmiller, 2005).
• 2 experiments with different torque punishment factors (c2)were carried out.
5 10 15 20
−40
−30
−20
−10
Number of Data Collections
Ave
rage
Rew
ard
LAWERCE TreeCE LWRCE Net
(a) c2 = 0.005
5 10 15 20−80
−60
−40
−20
Number of Data Collections
Ave
rage
Rew
ard
LAWERCE TreeCE LWRCE Net
(b) c2 = 0.025
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Comparison of torque trajectories
−505
LAWER
−505
u [N
]
CE Tree
0 1 2 3 4 5−5
05
Time [s]
CE LWR
(c) c2 = 0.005
−505
LAWER
−505
u [N
]
CE Tree
0 1 2 3 4 5−5
05
Time [s]
CE LWR
(d) c2 = 0.025
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Dynamic puddle-world
• The agent has to navigate from a start position to a goalposition, it gets negative reward when going through puddles.
• Dynamic version of the puddle-world : The agent can set aforce accelerating a k-dimensional point mass.
• This was done for k = 2 and k = 3 dimensions.
0 1
1
Start
Goal
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Comparison of the algorithms
5 10 15 20 25 30−100
−80
−60
−40
−20
Number of Data Collections
Ave
rage
Rew
ard
LAWERCE Tree
(e) 2-D
5 10 15 20 25 30−150
−100
−50
Number of Data Collections
Ave
rage
Rew
ard
LAWERCE Tree
(f) 3-D
• The CE-Tree Method learns faster, but does not manage tolearn high quality policies for the 3D setting.
• LAWER also works for high dimensional action spaces.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Comparison of torque trajectories
−202
u1
−202
u2
0 1 2 3 4 5−2
02
Time [s]
u3
(g) LAWER
−202
u1
−202
u2
0 1 2 3 4 5−2
02
Time [s]
u3
(h) CE-Tree
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Conclusion
• We have proven that the greedy operator maxa Q(s,a) can beapproximated efficiently by an advantage-weighted regression.
• The resulting algorithm runs an order of magnitude fasterthan competing algorithms.
• In spite of the resulting soft-greedy policy improvement ouralgorithm was able to produce policies of higher quality.
• The Locally-Advantage Weighted Regression algorithm allowsus to use fitted Q-iteration even for high dimensionalcontinuous action spaces.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
Movement Representations
• Kinematic Synergies
• Motion Templates
• Planning Movement Primitives
Policy Search
• Variational Inference for Policy Search in Changing Situations
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Movement Representations for Motor Skill Learning
Directly optimize a parametric movement representation
• No value estimation is needed
• What is a good representation for learning a movement?
Episodic Tasks:
• Often it is sufficient to formulate the learning task in theepisodic RL setup
• Single initial state, specified fixed duration of the movement
• Direct Policy Search can be applied easily in this setup
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Movement Representations for Motor Skill Learning
Episodic setup : Use a trajectory-based representation
• We learn a parametric representation of the desired trajectory
[qd(t;w), qd(t;w)]
• t . . . duration of the movement, no direct dependence on thehigh dimensional state
• t is now a scalar, this significantly simplifies the learningproblem
• Can only be used in the episodic setup (single start states)• This trajectory is then followed by using feedback control laws
• Most common movement representations are trajectorybased...
• Dynamic Movement Primitives (Ijspeert & Schaal, 2003),Splines (Kolter & Ng, 2009), ...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Trajectory-based vs. Value Based Motor Skill learning
Trajectory-Based:
• Can be seen as single-step decision task
• The agent chooses the parameters w as action of a single,temporally extended step
• Only one step per episode...
Value-Based:
• One decision per time step of the agent
• The agent chooses the tourque u as action of a single, veryshort time step
• Up to a few hundred steps per episode...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Trajectory-based vs. Value Based Motor Skill learning
Can we find a more intuitive solution for which the agent choosesnew actions only at certain, characteristic time points of themovement?
• Temporal Abstraction: Sequencing of temporally extendedactions, also called Motion Templates
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Temporal Abstractions for Motor-Skill Learning
Example : Drawing a triangle with a pen
Flat Setup
• We have to make manyunessential decisions
Abstracted Setup
• The movement can be easilydecomposed into 3elemental motions
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Temporal Abstractions for Motor-Skill Learning
Example : Drawing a triangle with a pen
Flat Setup
• We have to make manyunessential decisions
Abstracted Setup
• The movement can be easilydecomposed into 3elemental motions
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Temporal Abstractions for Motor-Skill Learning
Standard framework for temporally extended actions : Options(Sutton et al., 1999)
• Options are closed loop policies taking actions over a periodof time
• However: They are mainly used in discrete environments.• In many applications options are discrete temporally extended
actions
• E.g. “Go to another room”, “Follow the hallway” or “Frightenthe poor monkey”
• For motor tasks useful options are often difficult to specify.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Temporal Abstractions for Motor-Skill Learning :Illustration
• Pendulum Swing-up Task :• Standard RL benchmark task• Learn how to swing up and balance an inverted pendulum from
the bottom position• We additionally want to minimize the energy consumption• Flat RL : Choose a new action every 50ms
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up: Illustration
How can we decompose the trajectory into options?
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[s]
Positive peakNegative peakBalancing Motion
• We have positive and negativepeaks in the torque trajectory ...
• ... followed by a final balancingmotion.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up: Illustration
How can we decompose the trajectory into options?
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[s]
Positive peakNegative peakBalancing Motion
Specify the exact form of the peaks andthe balancing motion for the options ?
• Requires a lot of prior knowledge...
• The learning task becomes trivial...
However : We can specify thefunctional form of the options
• Use parameterized options...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Motion Templates
• Motion templates : Parameterized Options• Used as our building blocks of motion.
• A motion template mp is defined by :• Its kp dimensional parameter space Θp
• Its parameterized policy up(s, t; θp)• Its termination condition cp(s, t; θp)
• s . . . state, t . . . execution time, θp ∈ Θp . . . parameters
• The functional form of up and cp is chosen by the designer,the parameters θp are learned by the agent
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Motion Templates
• At each decision time step σk the agent has to choose :• Which motion template mp ∈ A(σk) to use.
• A(σk) . . . set of available motion templates in decision timestep σk.
• Which parameterization θp ∈ Θp of mp to use.
• Subsequently the policy up is executed until the terminationcondition cp is fullfilled.
• Continuous time :• The duration of the templates can be continuous valued
• The agent has to learn the correct sequence andparameterization of the motion templates
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-up : Decomposition into MotionTemplates
How can we decompose the trajectory into motion templates?
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[s]
Positive peakNegative peakBalancing Motion
• We have positive and negativepeaks in the torque trajectory ...
• ... followed by a final balancingmotion.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Templates to model the peaks
We use 2 templates per peak:
• One for the ascending part : m1...
• ... and one for the descending part : m2
• Both just depend on the execution time of the template.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Decomposition into MotionTemplates
We use 2 templates per peak:
0 0.1 0.2 0.3 0.4
0
2
4
Ascending part (m1)
Time [s]
Tor
que
[N]
a1 = 4
a1 = 3
a1 = 2
0 0.1 0.2 0.3 0.4
0
2
4
Descending part (m2)
Time [s]
Tor
que
[N]
a2 = 4
a2 = 3
a2 = 2
Parameters :
• ai . . . height of the template
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Decomposition into MotionTemplates
We use 2 templates per peak:
0 0.1 0.2 0.3 0.4
0
2
4
Ascending part (m1)
Time [s]
Tor
que
[N]
o1 = 0.5
o1 = 1
o1 = 2
0 0.1 0.2 0.3 0.4
0
2
4
Descending part (m2)
Time [s]
Tor
que
[N]
o2 = 3
o2 = 6
o2 = 9
Parameters :
• ai . . . height of the template
• oi . . . curvature of the template
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Decomposition into MotionTemplates
We use 2 templates per peak:
0 0.2 0.4 0.6
0
2
4
Ascending part (m1)
Time [s]
Tor
que
[N]
d1 = 0.3
d1 = 0.5
d1 = 0.7
0 0.2 0.4 0.6
0
2
4
Descending part (m2)
Time [s]
Tor
que
[N]
d2 = 0.3
d2 = 0.5
d2 = 0.7
Parameters :
• ai . . . height of the template
• oi . . . curvature of the template
• di . . . duration of the template
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Decomposition into MotionTemplates
• We fix the height of the descendingpeak template m2 to be the heightof m1.
• m3 and m4 are the sametemplates, just for negative peaks.
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[N]
Positive Peak, Descending PartNegative Peak, Ascending PartNegative Peak, Descending PartPositive Peak, Ascending PartBalancing Motion
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[N]
m2
m3
m4
m1
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Decomposition into MotionTemplates
• The balancing template is implemented as PD-controller
MT Functional Form Parametersm5 −k1θ − k2θ
′ k1, k2
• k1 and k2 are the PD controller gains.• m5 always runs for 20s, subsequently the episode is terminated
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Constructing the motion
• The agent can either choose thepeak templates in the predefinedorder (m2, m3, m4, m1, m2, ...)
• ... or it can use the balancingtemplate m5 as final template.
• Thus the agent has to learn thecorrect number of swing-ups andthe correct parameterization of theswing- ups.
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[N]
Positive Peak, Descending PartNegative Peak, Ascending PartNegative Peak, Descending PartPositive Peak, Ascending PartBalancing Motion
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[N]
m2
m3
m4
m1
m5
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Constructing the motion
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]
Tor
que
[N]
Positive Peak, Descending PartNegative Peak, Ascending PartNegative Peak, Descending PartPositive Peak, Ascending PartBalancing Motion
0 1 2 3 4 5−6
−4
−2
0
2
4
6
Time [s]T
orqu
e [N
]
m2
m3
m4
m1
m5
• Flat : Approximately 50 decisions/parameters are needed toreach the top position
• Motion Templates : The whole motion consists only of 5decisions / 13 parameters
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Accuracy of the policy
• Motion templates decrease the number of necessary decisionssignificantly
• Overall learning task is simplified• Ok... where is the catch?
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Pendulum Swing-Up : Accuracy of the policy
• Motion templates decrease the number of necessary decisionssignificantly
• Overall learning task is simplified• Ok... where is the catch?
• A single decision has now much more influence on theoutcome of the whole motion.
• Therefore a single decision has to be made much moreprecisely than in flat RL.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Algorithm for Motion Template Learning
• An RL algorithm is needed which can learn very precisecontinuous valued policies!
• For each template mp, we use an advancement of the LocallyAdvantage WEighted Regression (LAWER,(Neumann & Peters, 2009)) algorithm to learn the policyπp(θp|s) for selecting the parameters of mp.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Extensions of LAWER : LAWER for Motion TemplateLearning
Due to the increased precision requirements of motion templatelearning we had to develop 2 substantal extensions of LAWER
• Adaptive tree-based Kernels
• Additional optimization to improve the approximation ofV (s) = maxa Q(s,a)
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Extensions of LAWER : Adaptive Tree-Based Kernels
The use of an uniform weighting kernel is often problematic in thecase of ...
• High dimensional input spaces (’curse of dimensionality’)
• Spatially varying data densities
• Spatially varying curvatures of the regression surface
This problem can be alleviated by varying the ’shape’ of theweighting kernel.
• We do this by the use of randomized regression trees...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Extensions of LAWER : Improved approximation ofV (s) = maxa Q(s, a)
• In order to estimate the weightings ui, the original LAWERneeded the assumption of normally distributed advantagevalues.
• Often this assumption does not hold and the estimate of ui
gets imprecise.
• We improve the estimate of the ui by an additionaloptimization...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments
Minimum-Time problems with additional energy-consumptionconstraints (c2)
• Pendulum Swing-Up
• 2-link Pendulum Swing-Up
• 2-link Pendulum Balancing
Iterative learning protocoll:
• We collect L episodes with the currently estimatedexploration policy
• Subsequently the optimal policy is reestimated ...
• ... and the performance (summed reward) of the optimalpolicy (without exploration) is evaluated.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : Pendulum Swing-Up
Comparison of learning progress for different energy punishmentfactors (L = 50)
0 20 40 60−80
−60
−40
−20
Number of Data Collections
Ave
rage
Rew
ard
MT TreeMT GaussFlat
0 20 40 60−200
−150
−100
−50
Number of Data Collections
Ave
rage
Rew
ard
MT TreeMT GaussFlat
Figure: Learning curves for the Gaussian kernel (MT Gauss) and thetree-based kernel (MT Tree) for (left) c2 = 0.025 and (right) c2 = 0.075
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : Pendulum Swing-Up
Comparison of the flat and the motion template policy
−505
m2
m3
m5
c2 = 0.005
−505
m2m
3m
4m
1m
5
c2 = 0.025
0 1 2 3 4 5−5
05
m2
m3m
4m
1m
2m
3m
4m
1m
5
c2 = 0.075
Time [s]
−505
c2 = 0.005
−505
c2 = 0.025
0 1 2 3 4 5−5
05
c2 = 0.075
Time [s]Figure: (a) Torque trajectories and motion templates learned for differentenergy punishment factors c2. (b) Torque trajectories learned with flat RL
Performance for c2 = 0.075 :
• flat RL −48.6, motion templates −38.5
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : 2-Link Pendulum Swing-Up
Same templates as for the1-dimensional task
• The peak templates have now 2additional parameters, the heightand the curvature for the secondcontrol dimension u2.
• The parameters of the balancertemplate m5 consists of two 2 × 2matrices for the controller gains.
0 50 100 150−70
−60
−50
−40
−30
−20
−10
Number of Data Collections
Ave
rage
Rew
ard
MT TreeFlat
Figure: Comparison formotion template learningwith tree-based kernels andflat RL
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : 2-Link Pendulum Swing-Up
Learned motion template policy
1 2−6
−4
−2
0
2
4
6 m2
m3
m4
m5
Time [s]
Tou
rque
[Nm
]
u1
u2
Figure: Left: Torque trajectories and decomposition in the motiontemplates. Right: Illustration of the motion. The bold postures representthe switching time points of the motion templates.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Conclusions
• We have shown that by the use of motion templates, i.e.parametrized options, many motor tasks can be decomposedinto elemental movements.
• Motion templates are the first movement representation whichcan be sequenced in time
• While the whole motion consists of less decisions, a singledecision has to be made more precisely.
• We propose a new algorithm for motion template learningwhich can cope with the precision requirements
• We have shown that learning with motion templates canproduce policies of higher quality than flat RL and could evenbe applied to tasks where flat RL was not successful.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline: The thesis is divided into 3 parts...
Value-based Methods
• Graph-Based Reinforcement Learning
• Fitted Q-Iteration by Advantage Weighted Regression
Movement Representations
• Kinematic Synergies
• Motion Templates
• Planning Movement Primitives
Policy Search
• Variational Inference for Policy Search in Changing Situations
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Policy Search for trajectory-based representations
Back to trajectory-based representations
• Only 1 decision per episode : Choose parameter vector w
• Typically w is very high dimensional (40 - 100 parameters)
How can we optimize the parameters w?
• Policy Gradient Methods (Williams, 1992;Peters & Schaal, 2006)
• EM-based Methods (Kober & Peters, 2010)
• Inference-based Methods (Vlassis et al., 2009;Theodorou et al., 2010)
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Inference-based Methods: Policy Search for changingSituations
In different situations s0i we have to choose different parameter
vectors wi
• Can we generalize between solutions to avoid relearning?
• Learn a hierarchic policy πMP(w|s0; θ) which chooses theparameter vector w according to the situation s0.
• In order to do so we will use approximate inference methods
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Outline
Approximate Inference for Policy Search
• Decomposition of the log-likelihood
• Monte-Carlo EM based methods
• Variational Inference based methods
Policy Search for Movement Primitives in changing situations
• 4-Link Balancing
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximate Inference for Policy Search
Using inference or inference-based methods has proven to be veryuseful for policy search
• PoWeR (Kober & Peters, 2010), Policy Improvement by PathIntegrals (Theodorou et al., 2010)
• Reward Weighted Regression, Cost Regularized KernelRegression (Kober et al., 2010)
• Monte Carlo EM Policy Search(Vlassis et al., 2009)
• CMA-ES (Heidrich-Meisner & Igel, 2009)
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
All these algorithms use the Moment-projection of a certaintarget distribution to estimate the policy
• As we will see this can be problematic in many cases...(multi-modal solution space, complex reward functions...)
• Here we will introduce the theory to use theInformation-projection and show that this projectionalleviates many of these problems
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximate Inference for Policy Search
Formulating policy search as inference problem...
• Observed variable :• Introduce a Reward Event p(R = 1|τ), e.g
p(R = 1|τ) ∝ exp(−C(τ))• C(τ) · · · trajectory costs
• Latent Variables : trajectories τ
• Probabilistic Model : p(R = 1, τ ; θ) = p(R = 1|τ)p(τ ; θ)
We want to find parameters θ which maximize the log-marginallikelihood
log p(R; θ) = log
∫
τ
p(R|τ)p(τ ; θ)dτ
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximate Inference for Policy Search
Policy Search can be seen as finding the maximum likelihood (ML)solution of p(R; θ)
p(R; θ) =
∫
τ
p(R|τ)p(τ ; θ)dτ
• Problem: Huge trajectory space, the integral is intractable
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Decomposition of the log-likelihood
We can decompose the log-likelihood by introducing a variationaldistribution q(τ) over the latent variable τ :
log p(R; θ) = L(q, θ) + KL(q||pR),
Lower Bound L(q, θ):
L(q, θ) =
∫
τ
q(τ) log p(R, τ ; θ) + f1(q) = · · ·
=
∫
τ
q(τ) log p(τ ; θ) + f2(q)
• Expected complete data log-likelihood ...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Decomposition of the log-likelihood
We can decompose the log-likelihood by introducing a variationaldistribution q(τ) over the latent variable τ :
log p(R; θ) = L(q, θ) + KL(q||pR),
Kullback-Leibler divergence KL(q||pR) :
KL(q||pR) = −∫
τ
q(τ) logpR(τ)
q(τ)dτ
• ’Distance’ between variational distribution q and conditionaldistribution of the latent variable p(τ |R; θ)
• pR(τ) = p(τ |R; θ) ∝ p(R|τ)p(τ ; θ) ... reward-weighted modeldistribution
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Decomposition of the log-likelihood
We can now iteratively increase the lower bound L(q, θ) by:
• E-Step:• Keep model parameters θ fixed• Minimize KL-divergence KL(q||pR) w.r.t q
• M-Step:• Keep variational distribution q fixed• Maximize Lower Bound L(q,θ) w.r.t θ
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximate Inference for Policy Search
Two types of policy search algorithms emerge from thisdecomposition
• Monte-Carlo EM based Policy Search (Kober et al., 2010;Kober & Peters, 2010; Vlassis et al., 2009)
• Variational Inference Policy Search
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Monte-Carlo (MC) EM based Algorithms
MC-EM based algorithms use a sample based approximation of q
in the E-step.
• E-Step minq KL(q||pR) : q(i) = pR(i) ∝ p(R|τi)p(τi; θold)
• M-Step maxθ L(q, θ) : Use q(i) to approximate lower bound
L(q, θ) ≈∑
i
pR(i) log p(τi; θ) + const
= −KL(pR||p(τ ; θ)) + const
This is the same lower Bound given as given for PoWER andReward-Weighted Regression.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Monte-Carlo (MC) EM based Algorithms
Iteratively calculate M(oment)-projection of pR :
minθ
KL(pR||p) = −∑
i
pR(i) log
(
p(τi; θ)
pR(i)
)
• The model becomes ’Reward attracted’• Forces model p to have high probability in regions with high
reward• Negatively rewarded samples are neglected!
• Minimization?• p can be easily calculated by matching the moments of p with
the moments of pR
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Variational Inference based Algorithms
For Variational Inference we use a parametric variationaldistribution q(τ) = q(τ ; θ′)
• E-Step minq KL(q||pR) : Use sample-based approximation forthe integral in the KL-divergence
KL(q(τ ; θ′)||pR) ≈ −∑
τi
q(τi; θ′) log
pR(i)
q(τi; θ′)
• M-Step maxθ L(q, θ) : If we use the same family ofdistributions for p(τ ; θ) and q(τ ; θ′) we can simply set θ to θ′
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Variational Inference based Algorithms
Iteratively calculate I(nformation)-projection of pR :
minθ
KL(p||pR) = −∑
i
p(τi; θ) log
(
pR(i)
p(τi; θ)
)
• The model becomes ’Cost-averse’ :• Tries to avoid including in regions with low reward in p(τ ;θ)• Uses information from negatively and positively rewarded
examples
• Minimization?• Non-convex optimization problem (computationally much more
demanding than using M-projection) ...• We use numerical gradient ascent
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Approximate Inference for Policy Search
MC-EM : M-projection based
minθ
KL(pR||p) = −∑
i
pR(i) log
(
p(τi; θ)
pR(i)
)
Variational Inference : I-projection based
minθ
KL(p||pR) = −∑
i
p(τi; θ) log
(
pR(i)
p(τi; θ)
)
Both algorithms are guaranteed to iteratively increase the lowerbound...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
I vs M-projection : Illustrative Examples
Lets look at the differences in more detail...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
I vs M-projection : Illustrative Examples
We consider 1-step decision problems in continuous state andaction spaces
• We typically use a Gaussian distribution as model distribution
p(s,a; θ) = N([
s
a
]
|[
µs
µa
]
,
[
Σss Σas
Σsa Σaa
])
,
• with θ = {µs, µa,Σss,Σas,Σaa}
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
I vs M-projection : Illustrative Examples
2-dimensional action space, no state variables, multimodal targetdistribution
M−
Pro
ject
ion
I−P
roje
ctio
n
• M-projection averages over all modes• I-projection concentrates on one mode
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
I vs M-projection : Illustrative Examples
We also want to have state variables...
• The policy π(a|s; θ) is obtained by conditioning on the state s.
• Policy π is a linear Gaussian model...
In order to get more complex policies π(a|st; θ) ...
• For each state st, we reestimate the model p(s,a; θ) locally(using either the M- or I-projection)
• We clamp µs at st.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
I vs M-projection : Illustrative Examples
• 1-dimensional state and action space
• complex reward function (dark background indicates negativereward)
• Policy is estimated for 6 different states
M−projection
s1 s2 s3 s4 s5 s6
I−projection
s1 s2 s3 s4 s5 s6
• M-projection includes areas of low reward in the distribution !
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Policy Search for Motion Primitives
Lets apply variational inference for policy search in changingsituations
• Movement representation : parametrized velocity profiles
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Multi situation setting : How can we learn θ?
• Existing algorithms are all MC-EM based and therefore usethe M-projection
• Reward-Weighted Regression (Peters & Schaal, 2007),Cost-Regularized Kernel-Regression (Kober et al., 2010)
• Online learning setup : as samples we always use the historyof the agent...
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : Cannon-Ball Task
Learn to shoot a cannon-ball at a desired location
• State-space s0 : Desired Location, Wind Force
• Parameter-space w : Launching Angle and Velocity of the ball
Comparison of I andM-projection :
• CRKR : Cost-RegularizedKernel Regression
• Multi-modal solution space,I-projection performs best
1000 2000 3000 4000 5000−80
−60
−40
−20
0
Episodes
Per
form
ance
I−ProjectionM−projectionCRKR
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : 4-link pendulum balancing
4-link ’Humanoid’ robot has to counterbalance different pushes
• Situations :• The robot gets pushed with different forces Fi ∈ [0; 25]Ns at 4
different points of origin
• 4-dimensional state space
• Movement Primitives• Sequence of sigmoidal velocity profiles (39 parameters)...
t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Experiments : 4-link pendulum balancing
4-link ’Humanoid’ robot has to counterbalance different pushes
• After 60000 episode therobot has learned to balancealmost every force
• The robot learns completelydifferent balancing strategies
• We could not producerelyable results with theM-projection...
t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Conclusion
• We can use the M-projection or the I-projection for PolicySearch
• The I-projection also uses information of bad samples, whichare neglected by the M-projection!
• It therefore can be used with ease for multi-modal distributionsor non-concave reward functions
• Computationally quite demanding...• More efficient methods to calculate the I-projection are needed
• Is there still a big difference for more complex modeldistributions... ?
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
The end
Thanks for your attention!
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography I
Atkeson, Chris G., Moore, Andrew W., & Schaal, Stefan. 1997.Locally Weighted Learning.Artificial Intelligence Review, 11, 11–73.
de Boer, Pieter-Tjerk, Kroese, Dirk, Mannor, Shie, & Rubinstein,Reuven. 2005.A Tutorial on the Cross-Entropy Method.Annals of Operations Research, 134(1), 19–67.
Ernst, D., Geurts, P., & Wehenkel, L. 2005.Tree-Based Batch Mode Reinforcement Learning.Journal of Machine Learning Resource, 6, 503–556.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography II
Ernst, Damien, Geurts, Pierre, & Wehenkel, Louis. 2003.Iteratively Extending Time Horizon Reinforcement Learning.Pages 96–107 of: European Conference on Machine Learning(ECML).
Heidrich-Meisner, V., & Igel, C. 2009.Neuroevolution Strategies for Episodic ReinforcementLearning.Journal of Algorithms, 64(4), 152–168.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography III
Ijspeert, Auke Jan, & Schaal, Stefan. 2003.Learning Attractor Landscapes for Learning Motor Primitives.Pages 1523–1530 of: Advances in Neural InformationProcessing Systems 15.Cambridge, MA: MIT Press.
Kober, J., & Peters, J. 2010.Policy Search for Motor Primitives in Robotics.Machine Learning Journal, online first, 1–33.
Kober, Jens, Oztop, Erhan, & Peters, Jan. 2010.Reinforcement Learning to adjust Robot Movements to NewSituations.In: Proceedings of the 2010 Robotics: Science and SystemsConference (RSS 2010).
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography IV
Kolter, Z., & Ng, A. 2009.Task-Space Trajectories via Cubic Spline Optimization.Pages 2364–2371 of: Proceedings of the 2009 IEEEinternational conference on Robotics and Automation.ICRA’09.Piscataway, NJ, USA: IEEE Press.
Neumann, G., & Peters, J. 2009.Fitted Q-Iteration by Advantage Weighted Regression.In: Advances in Neural Information Processing Systems 22(NIPS 2008).MA: MIT Press.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography V
Peters, J., & Schaal, S. 2006.Policy Gradient methods for Robotics.In: Proceedings of the IEEE International Conference onIntelligent Robotics Systems (IROS).
Peters, J., & Schaal, S. 2007.Reinforcement Learning by Reward-Weighted Regression forOperational Space Control.In: Proceedings of the International Conference on MachineLearning (ICML).
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography VI
Riedmiller, M. 2005.Neural fitted Q-Iteration - First Experiences with a DataEfficient Neural Reinforcement Learning Method.In: Proceedings of the European Conference on MachineLearning (ECML).
Sutton, Richard, Precup, Doina, & Singh, Satinder. 1999.Between MDPs and Semi-MDPs: A Framework for TemporalAbstraction in Reinforcement Learning.Artificial Intelligence, 112, 181–211.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography VII
Theodorou, E., Buchli, J., & Schaal, S. 2010.Reinforcement Learning of Motor Skills in High Dimensions: aPath Integral Approach.Pages 2397–2403 of: Robotics and Automation (ICRA), 2010IEEE International Conference on.
Vlassis, Nikos, Toussaint, Marc, Kontes, Georgios, & Piperidis,Savas. 2009.Learning Model-Free Robot Control by a Monte Carlo EMAlgorithm.Autonomous Robots, 27(2), 123–130.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk
References
Bibliography VIII
Williams, Ronald J.. 1992.Simple Statistical Gradient..Following Algorithms forConnectionist Reinforcement Learning.Machine Learning.
On Movement Skill Learning and Movement Representations for Robotics Seminar Talk