Upload
kellie-parks
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
DARPA Mobile Autonomous Robot SoftwareMay 2000 1
Adaptive Intelligent Mobile Robotics
William D. Smart, Presenter
Leslie Pack Kaelbling, PI
Artificial Intelligence Laboratory
MIT
DARPA Mobile Autonomous Robot SoftwareMay 2000 2
Progress to Date
• Fast bootstrapped reinforcement learning• algorithmic techniques• demo on robot
• Optical-flow based navigation• flow algorithm implemented• pilot navigation experiments on robot• pilot navigation experiments in simulation
testbed
DARPA Mobile Autonomous Robot SoftwareMay 2000 3
Making RL Really Work
Typical RL methods require far too much data to be practical in an online setting. Address the problem by
• strong generalization techniques• using human input to bootstrap
Let humans do what they’re good at
Let learning algorithms do what they’re good at
DARPA Mobile Autonomous Robot SoftwareMay 2000 4
JAQL
Learning a value function in a continuous state and action space
• based on locally weighted regression (fancy version of nearest neighbor)
• algorithm knows what it knows• use meta-knowledge to be conservative about
dynamic-programming updates
DARPA Mobile Autonomous Robot SoftwareMay 2000 5
Problems with Q-Learning on Robots
• Huge state spaces/sparse data• Continuous states and actions• Slow to propagate values• Safety during exploration• Lack of initial knowledge
DARPA Mobile Autonomous Robot SoftwareMay 2000 6
Value Function Approximation
Use a function approximator instead of a table• generalization• deals with continuous spaces and actions
• Q-learning with VFA has been shown to diverge, even in benign cases
Which function approximator should we use to minimize problems?
Q(s,a)s
aF
DARPA Mobile Autonomous Robot SoftwareMay 2000 7
Locally Weighted Regression
• Store all previous data points• Given a query point, find k nearest points• Fit a locally linear model to these points, giving
closer ones more weight• Use KD-trees to make lookups more efficient
• Fast learning from a single data point
DARPA Mobile Autonomous Robot SoftwareMay 2000 8
Locally Weighted Regression
Original function
DARPA Mobile Autonomous Robot SoftwareMay 2000 9
Locally Weighted Regression
Bandwidth = 0.1, 500 training points
DARPA Mobile Autonomous Robot SoftwareMay 2000 10
Problems with ApproximateQ-Learning
Errors are amplified by backups
),(),(),( 1 ttnextttttt asQQrasQasQ
),(max 1 asQQ tanext
DARPA Mobile Autonomous Robot SoftwareMay 2000 11
One Source of Errors
DARPA Mobile Autonomous Robot SoftwareMay 2000 12
Independent Variable Hull
Interpolation is safe; extrapolation is not, so
• construct hull around known points
• do local regression if the query point is within the hull
• give a default prediction if not
DARPA Mobile Autonomous Robot SoftwareMay 2000 13
Recap
Use LWR to represent the value function• generalization• continuous spaces
Use IVH and “don’t know”• conservative predictions• safer backups
DARPA Mobile Autonomous Robot SoftwareMay 2000 14
Incorporating Human Input
Humans can help a lot, even if they can’t perform the task very well.
• Provide some initial successful trajectories through the space
• Trajectories are not used for supervised learning, but to guide the reinforcement-learning methods through useful parts of the space
• Learn models of the dynamics of the world and of the reward structure
• Once learned models are good, use them to update the value function and policy as well.
DARPA Mobile Autonomous Robot SoftwareMay 2000 15
Give Some Trajectories
Supply an example policy• Need not be optimal and might be very wrong• Code or human-controlled
Used to generate experience• Follow example policy and record experiences• Shows learner “interesting” parts of the space• “Bad” initial policies might be better
DARPA Mobile Autonomous Robot SoftwareMay 2000 16
Two Learning Phases
LearningSystem
SuppliedControlPolicy
Environment
Phase One
AR O
DARPA Mobile Autonomous Robot SoftwareMay 2000 17
Two Learning Phases
LearningSystem
SuppliedControlPolicy
Environment
Phase Two
AR O
DARPA Mobile Autonomous Robot SoftwareMay 2000 18
What does this Give Us?
• Natural way to insert human knowledge• Keeps robot safe in early stages of learning• Bootstraps information into the Q-function
DARPA Mobile Autonomous Robot SoftwareMay 2000 19
Experimental Results:Corridor-Following
DARPA Mobile Autonomous Robot SoftwareMay 2000 20
Corridor-Following
3 continuous state dimensions• corridor angle• offset from middle• distance to end of corridor
1 continuous action dimension• rotation velocity
Supplied example policy• Average 110 steps to goal
DARPA Mobile Autonomous Robot SoftwareMay 2000 21
Corridor-Following
Experimental setup• Initial training runs start from roughly the middle of
the corridor• Translation speed has a fixed policy• Evaluation on a number of set starting points• Reward
• 10 at end of corridor• 0 everywhere else
DARPA Mobile Autonomous Robot SoftwareMay 2000 22
Corridor-Following
Average steps to goal
65
85
105
125
-25 -15 -5 5 15 25
Training runs
Ste
ps
to g
oal
“Best” possible
Average training
Phase 1 Phase 2
DARPA Mobile Autonomous Robot SoftwareMay 2000 23
Corridor Following: Initial Policy
DARPA Mobile Autonomous Robot SoftwareMay 2000 24
Corridor Following: After Phase 1
DARPA Mobile Autonomous Robot SoftwareMay 2000 25
Corridor Following: After Phase 1
DARPA Mobile Autonomous Robot SoftwareMay 2000 26
Corridor Following: After Phase 2
DARPA Mobile Autonomous Robot SoftwareMay 2000 27
Conclusions
VFA can be made more stable• Locally weighted regression• Independent variable hull• Conservative backups
Bootstrapping value function really helps• Initial supplied trajectories• Two learning phases
DARPA Mobile Autonomous Robot SoftwareMay 2000 28
Optical Flow
Get range information visually by computing optical flow field
• nearer objects cause flow of higher magnitude• expansion pattern means you’re going to hit• rate of expansion tells you when• elegant control laws based on center and rate of
expansion (derived from human and fly behavior)
DARPA Mobile Autonomous Robot SoftwareMay 2000 29
Approaching a Wall
DARPA Mobile Autonomous Robot SoftwareMay 2000 30
Balance Strategy
Simple obstacle-avoidance strategy• compute flow field• compute average magnitude of flow in each hemi-
field• turn away from the side with higher magnitude
(because it has closer objects)
DARPA Mobile Autonomous Robot SoftwareMay 2000 31
Balance Strategy in Action
DARPA Mobile Autonomous Robot SoftwareMay 2000 32
Crystal Space
DARPA Mobile Autonomous Robot SoftwareMay 2000 33
Crystal Space
DARPA Mobile Autonomous Robot SoftwareMay 2000 34
Crystal Space
DARPA Mobile Autonomous Robot SoftwareMay 2000 35
Next Steps
• Extend RL architecture to include model-learning and planning
• Apply RL techniques to tune parameters in optical-flow
• Build topological maps using visual information• Build highly complex simulated environment• Integrate planning and learning in multi-layer
system