31
Introduction Model Learning Parallel Architecture Experiments Conclusion TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning for Robots Todd Hester and Peter Stone Learning Agents Research Group Department of Computer Science The University of Texas at Austin Journal Track: appeared in Machine Learning, 2013 Hester and Stone UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

TEXPLORE: Real-Time Sample-Efficient

Reinforcement Learning for Robots

Todd Hester and Peter Stone

Learning Agents Research GroupDepartment of Computer ScienceThe University of Texas at Austin

Journal Track: appeared in Machine Learning, 2013

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 2: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Robot Learning

Robots have the potential to solve many problems

We need methods for them to learn and adapt to new

situations

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 3: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Reinforcement Learning

Value function RL has string of positive theoretical results

[Watkins 1989, Brafman and Tennenholtz 2001]

Could be used for learning and adaptation on robots

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 4: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Reinforcement Learning

Model-free Methods

Learn a value function directly from interaction with

environment

Can run in real-time, but not very sample efficient

Model-based Methods

Learn model of transition and reward dynamics

Update value function using model (planning)

Can update action-values without taking real actions in the

world

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 5: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Velocity Control of an Autonomous Vehicle

Upgraded to run autonomously by adding shift-by-wire,

steering, and braking actuators.

10 second episodes (at 20 Hz: 200 samples / episode)

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 6: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Velocity Control

State:

Current Velocity

Desired Velocity

Accelerator Pedal Position

Brake Pedal Position

Actions:

Do nothing

Increase/decrease brake position by 0.1

Increase/decrease accelerator position

by 0.1

Reward: -10.0 * velocity error (m/s)

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 7: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Desiderata

1 Learning algorithm must learn in very few actions (be

sample efficient)

2 Learning algorithm must take actions continually in

real-time (while learning)

3 Learning algorithm must handle continuous state

4 Learning algorithm must handle delayed actions

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 8: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Desiderata

1 Learning algorithm must learn in very few actions (be

sample efficient)

2 Learning algorithm must take actions continually in

real-time (while learning)

3 Learning algorithm must handle continuous state

4 Learning algorithm must handle delayed actions

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 9: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Common Approaches

Algorithm Citation Sample Real Continuous DelayEfficient Time

R-Max Brafman 2001 Yes No No NoQ-Learning Watkins 1989 No Yes No Nowith F.A. Sutton & Barto 1998 No Yes Yes NoSARSA Rummery & Niranjan 1994 No Yes No NoGPRL Deisenroth & Rasmussen 2011 Yes No Yes NoBOSS Asmuth et al 2009 Yes No No NoBayesian DP Strens 2000 Yes No No NoMBBE Dearden et al 1999 Yes No No NoMBS Walsh et al 2009 Yes No No YesDyna Sutton 1990 No Yes No No

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 10: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Common Approaches

Algorithm Citation Sample Real Continuous DelayEfficient Time

R-Max Brafman 2001 Yes No No NoQ-Learning Watkins 1989 No Yes No Nowith F.A. Sutton & Barto 1998 No Yes Yes NoSARSA Rummery & Niranjan 1994 No Yes No NoGPRL Deisenroth & Rasmussen 2011 Yes No Yes NoBOSS Asmuth et al 2009 Yes No No NoBayesian DP Strens 2000 Yes No No NoMBBE Dearden et al 1999 Yes No No NoMBS Walsh et al 2009 Yes No No YesDyna Sutton 1990 No Yes No No

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 11: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

The TEXPLORE Algorithm

1 Limits exploration to be sample efficient

2 Selects actions continually in real-time

3 Handles continuous state

4 Handles actuator delays

Available publicly as a ROS package:

www.ros.org/wiki/rl-texplore-ros-pkg

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 12: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 1: Sample Efficiency

Treat model learning as a supervisedlearning problem

Input: State and Action

Output: Distribution over next states

and reward

Factored model: Learn a separate

model to predict each next state feature

and reward

Decision Trees: Split state space into

regions with similar dynamics

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 13: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Random Forest Model [ICDL 2010]

Average predictions of m different decision

trees

Each tree represents a hypothesis of the

true dynamics of the domain

Acting greedily w.r.t. the average model

balances predictions of optimistic and

pessimistic models

Limits the agent’s exploration to

state-actions that appear promising, while

avoiding those which may have negative

outcomes

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 14: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Random Forest Model [ICDL 2010]

Average predictions of m different decision

trees

Each tree represents a hypothesis of the

true dynamics of the domain

Acting greedily w.r.t. the average model

balances predictions of optimistic and

pessimistic models

Limits the agent’s exploration to

state-actions that appear promising, while

avoiding those which may have negative

outcomes

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 15: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 2: Real-Time Action Selection

Model update can take too long

Planning can take too long

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 16: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Real-Time Model Based Architecture (RTMBA)

Model learning and planning on

parallel threads

Action selection is not restricted

by their computation time

Use sample-based planning

(anytime)

Mutex locks on shared data

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 17: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 3: Continuous State

Use regression trees to model

continuous state

Each tree has a linear regression

model at its leaves

Discretize state space for value

updates from UCT, but still plan over

continuously valued states

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 18: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 4: Actuator Delays

Delays make domain non-Markov, but k-Markov

Provide model with previous k actions (Similar to U-Tree

[McCallum 1996])

Trees can learn which delayed actions are relevant

UCT can plan over augmented state-action histories easily

Would not be as easy with tabular models or dynamic

programming

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 19: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Autonomous Vehicle

Upgraded to run

autonomously by adding

shift-by-wire, steering, and

braking actuators.

Vehicle runs at 20 Hz.

Agent must provide

commands at this frequency.

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 20: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Uses ROS [Quigley et al 2009]

http://www.ros.org/wiki/rl_msgs

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 21: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Simulation Experiments

Exploration Approaches

Epsilon-Greedy

Boltzmann Exploration

Use merged BOSS-like model

Use random model each episode

Sample Efficient Methods

BOSS [Asmuth et al 2009]

Bayesian DP [Strens 2000]

Gaussian Process RL [Deisenroth & Rasmussen 2011]

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 22: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Simulation Experiments

Continuous Models

Tabular Models

Gaussian Process RL [Deisenroth & Rasmussen 2011]

KWIK linear regression [Strehl & Littman 2007]

Real-Time Architectures

Real Time Dynamic Programming [Barto et al 1995]

Dyna [Sutton 1990]

Parallel Value Iteration

Actuator Delays

Model Based Simulation [Walsh et al 2009]

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 23: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 1: Sample Efficiency

-5000

-4500

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0 200 400 600 800 1000

Ave

rag

e R

ew

ard

Episode Number

Simulated Car Control Between Random Velocities

TEXPLORE (Greedy)Epsilon-Greedy

BoltzmannVariance-Bonus b=1

Variance-Bonus b=10BayesDP-like

BOSS-like

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 24: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 1: Sample Efficiency

-6000

-5000

-4000

-3000

-2000

-1000

0 200 400 600 800 1000

Ave

rag

e R

ew

ard

Episode Number

Simulated Car Control Between Random Velocities

TEXPLORE (Greedy)BOSSBayesian DPGPRLR-MaxQ-Learning with Tile-Coding

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 25: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 2: Real-Time Action Selection

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0 200 400 600 800 1000

Ave

rag

e R

ew

ard

Episode Number

Simulated Car Control Between Random Velocities

RTMBA (TEXPLORE)RTDP

Parallel VIValue Iteration

RT-DynaQ-Learning Tile-Coding

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 26: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 3: Modeling Continuous Domains

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800

Avera

ge S

tate

Err

or

(Euclid

ean D

ista

nce)

Number of State-Actions

Model Accuracy on Next State

Regression Tree ForestRegression Tree

Decision Tree ForestDecision Tree

TabularKWIK Linear Regression

GP Regression

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 27: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 3: Modeling Continuous Domains

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800

Avera

ge S

tate

Err

or

(Euclid

ean D

ista

nce)

Number of State-Actions

Model Accuracy on Next State

Regression Tree ForestRegression Tree

Decision Tree ForestDecision Tree

TabularKWIK Linear Regression

GP Regression

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800A

vera

ge R

ew

ard

Err

or

Number of State-Actions

Model Accuracy on Reward

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 28: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Challenge 4: Handling Delayed Actions

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0 200 400 600 800 1000

Ave

rag

e R

ew

ard

Episode Number

Simulated Car Control Between Random Velocities

TEXPLORE k=0TEXPLORE k=1TEXPLORE k=2TEXPLORE k=3

MBS k=1MBS k=2MBS k=3

Tabular k=2

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 29: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

On the physical vehicle

But, does it work on the actual vehicle?

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 30: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

On the physical vehicle

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 5 10 15 20

Ave

rag

e R

ew

ard

Episode Number

Physical Vehicle Velocity Control from 2 to 5 m/s

Yes! It learns the task within 2 minutes of driving time

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots

Page 31: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning …pstone/Courses/394Rfall19/... · 2019. 9. 24. · Experiments Conclusion Challenge 1: Sample Efficiency-6000-5000-4000-3000-2000-1000

Introduction

Model Learning

Parallel Architecture

Experiments

Conclusion

Conclusion

TEXPLORE can:1 Learn in few samples2 Act continually in real-time3 Learn in continuous domains4 Handle actuator delays

TEXPLORE code has been

released as a ROS package:

www.ros.org/wiki/rl-texplore-ros-pkg

Hester and Stone – UT Austin TEXPLORE: Real-Time Sample-Efficient RL for Robots