35
DARPA Mobile Autonomous Robot Software May 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial Intelligence Laboratory MIT

DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

Embed Size (px)

Citation preview

Page 1: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 1

Adaptive Intelligent Mobile Robotics

William D. Smart, Presenter

Leslie Pack Kaelbling, PI

Artificial Intelligence Laboratory

MIT

Page 2: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 2

Progress to Date

• Fast bootstrapped reinforcement learning• algorithmic techniques• demo on robot

• Optical-flow based navigation• flow algorithm implemented• pilot navigation experiments on robot• pilot navigation experiments in simulation

testbed

Page 3: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 3

Making RL Really Work

Typical RL methods require far too much data to be practical in an online setting. Address the problem by

• strong generalization techniques• using human input to bootstrap

Let humans do what they’re good at

Let learning algorithms do what they’re good at

Page 4: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 4

JAQL

Learning a value function in a continuous state and action space

• based on locally weighted regression (fancy version of nearest neighbor)

• algorithm knows what it knows• use meta-knowledge to be conservative about

dynamic-programming updates

Page 5: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 5

Problems with Q-Learning on Robots

• Huge state spaces/sparse data• Continuous states and actions• Slow to propagate values• Safety during exploration• Lack of initial knowledge

Page 6: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 6

Value Function Approximation

Use a function approximator instead of a table• generalization• deals with continuous spaces and actions

• Q-learning with VFA has been shown to diverge, even in benign cases

Which function approximator should we use to minimize problems?

Q(s,a)s

aF

Page 7: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 7

Locally Weighted Regression

• Store all previous data points• Given a query point, find k nearest points• Fit a locally linear model to these points, giving

closer ones more weight• Use KD-trees to make lookups more efficient

• Fast learning from a single data point

Page 8: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 8

Locally Weighted Regression

Original function

Page 9: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 9

Locally Weighted Regression

Bandwidth = 0.1, 500 training points

Page 10: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 10

Problems with ApproximateQ-Learning

Errors are amplified by backups

),(),(),( 1 ttnextttttt asQQrasQasQ

),(max 1 asQQ tanext

Page 11: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 11

One Source of Errors

Page 12: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 12

Independent Variable Hull

Interpolation is safe; extrapolation is not, so

• construct hull around known points

• do local regression if the query point is within the hull

• give a default prediction if not

Page 13: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 13

Recap

Use LWR to represent the value function• generalization• continuous spaces

Use IVH and “don’t know”• conservative predictions• safer backups

Page 14: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 14

Incorporating Human Input

Humans can help a lot, even if they can’t perform the task very well.

• Provide some initial successful trajectories through the space

• Trajectories are not used for supervised learning, but to guide the reinforcement-learning methods through useful parts of the space

• Learn models of the dynamics of the world and of the reward structure

• Once learned models are good, use them to update the value function and policy as well.

Page 15: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 15

Give Some Trajectories

Supply an example policy• Need not be optimal and might be very wrong• Code or human-controlled

Used to generate experience• Follow example policy and record experiences• Shows learner “interesting” parts of the space• “Bad” initial policies might be better

Page 16: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 16

Two Learning Phases

LearningSystem

SuppliedControlPolicy

Environment

Phase One

AR O

Page 17: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 17

Two Learning Phases

LearningSystem

SuppliedControlPolicy

Environment

Phase Two

AR O

Page 18: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 18

What does this Give Us?

• Natural way to insert human knowledge• Keeps robot safe in early stages of learning• Bootstraps information into the Q-function

Page 19: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 19

Experimental Results:Corridor-Following

Page 20: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 20

Corridor-Following

3 continuous state dimensions• corridor angle• offset from middle• distance to end of corridor

1 continuous action dimension• rotation velocity

Supplied example policy• Average 110 steps to goal

Page 21: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 21

Corridor-Following

Experimental setup• Initial training runs start from roughly the middle of

the corridor• Translation speed has a fixed policy• Evaluation on a number of set starting points• Reward

• 10 at end of corridor• 0 everywhere else

Page 22: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 22

Corridor-Following

Average steps to goal

65

85

105

125

-25 -15 -5 5 15 25

Training runs

Ste

ps

to g

oal

“Best” possible

Average training

Phase 1 Phase 2

Page 23: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 23

Corridor Following: Initial Policy

Page 24: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 24

Corridor Following: After Phase 1

Page 25: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 25

Corridor Following: After Phase 1

Page 26: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 26

Corridor Following: After Phase 2

Page 27: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 27

Conclusions

VFA can be made more stable• Locally weighted regression• Independent variable hull• Conservative backups

Bootstrapping value function really helps• Initial supplied trajectories• Two learning phases

Page 28: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 28

Optical Flow

Get range information visually by computing optical flow field

• nearer objects cause flow of higher magnitude• expansion pattern means you’re going to hit• rate of expansion tells you when• elegant control laws based on center and rate of

expansion (derived from human and fly behavior)

Page 29: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 29

Approaching a Wall

Page 30: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 30

Balance Strategy

Simple obstacle-avoidance strategy• compute flow field• compute average magnitude of flow in each hemi-

field• turn away from the side with higher magnitude

(because it has closer objects)

Page 31: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 31

Balance Strategy in Action

Page 32: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 32

Crystal Space

Page 33: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 33

Crystal Space

Page 34: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 34

Crystal Space

Page 35: DARPA Mobile Autonomous Robot SoftwareMay 2000 1 Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial

DARPA Mobile Autonomous Robot SoftwareMay 2000 35

Next Steps

• Extend RL architecture to include model-learning and planning

• Apply RL techniques to tune parameters in optical-flow

• Build topological maps using visual information• Build highly complex simulated environment• Integrate planning and learning in multi-layer

system