View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Robot Learning
Jeremy Wyatt
School of Computer Science
University of Birmingham
Plan
Why and when What we can do
– Learning how to act– Learning maps– Evolutionary Robotics
How we do it– Supervised Learning– Learning from punishments and rewards– Unsupervised Learning
Learning How to Act What can we do?
– Reaching– Road following– Box pushing– Wall following– Pole-balancing– Stick juggling– Walking
Learning How to Act: Reaching
We can learn from reinforcement or from a teacher (supervised learning)
Reinforcement Learning:– Action: Move your arm ()– You received a reward of 2.1
Supervised Learning:– Action: Move your hand to – You should have moved to
(x,y,z)
Learning How to Act: Driving ALVINN: learned to drive in 5 minutes Learns to copy the human response Feedforward multilayer neural network
30
32
Steering wheel
position
Learning How to Act: Driving
Network outputs form a Gaussian Mean encodes the driving direction Compare with the “correct” human action Compute error for each unit given desired Gaussian
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
00.010.020.030.040.050.060.07
Learning How to Act: Driving
Distribution of training examples from on the fly learning causes problems
Network doesn’t see how to cope with misalignments Network can forget if it doesn’t see a situation for a
while Answer: generate new examples from the on the fly
images
Learning How to Act: Driving
Use camera geometry to assess new field of view
Fill in using information about road structure Transform the target steering direction Present as a new training example
Learning How to Act: Driving
Learning How to Act
Obelix Learns to push boxes Reinforcement Learning
What is Reinforcement Learning? Learning from punishments and rewards Agent moves through world, observing states
and rewards Adapts its behaviour to maximise some
function of reward
s9s5s4
……
…
+50
-1-1
+3
r9r5r4r1
s1
a9a5a4a2 …a3a1
s2 s3
Return: Long term performance Let’s assume our agent acts according to some
rules, called a policy, The return Rt is a measure of long term reward
collected after time t
The expected return for a state-action pair is called a Q value Q(s,a)
+50
-1-1
+3
r9r5r4r1
3 4 80 3 1 1 50R 0 1
One step Q-learning
Guess how good state-action pairs are Take an action Watch the new state and reward Update the state-action value
1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t
b AQ s a Q s a r Q s b Q s a
st+1at
strt+1
Obelix
Won’t converge with a single controller Works if you divide it into behaviours But …
Evolutionary Robotics
Learning Maps