Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lifelong Learning for Disturbance Rejection on Mobile Robots
GRASP LABORATORY
David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin,
Brandon Kallaher, Matthew E. Taylor
1Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor
Problem 1: Without prior knowledge, RL in a new task is slow
Idea: Reuse knowledge from previously learned tasks
Motivation
G
standard“tabula rasa” initialization initialization via
transfer
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 2
Problem 1: Without prior knowledge, RL in a new task is slow
Idea: Reuse knowledge from previously learned tasks
Motivation
G
standard“tabula rasa” initialization initialization via
transfer
… … … …
Time
Current Task
We focus on the lifelong learning case:Agent learns multiple tasks consecutivelyWant stability guarantees as the number of tasks grows large
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 3
Background
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 4
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control [Kober & Peters 2011;
Peters & Schaal 2008; Sutton et al. 2000]
G
reward function
agent
probabilistic transition
Agent makes sequential decisions
Background: Policy Gradient Methods for Control
•Formalized as a Markov Decision Process (MDP)
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 5
Background: Policy Gradient Methods for Control
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]
n trajectories
Policy GradientLearner
Policy
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 6
Background: Policy Gradient Methods for Control
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]
n trajectories
Policy GradientLearner
Policy
probability of trajectory reward function
Goal: find policy that minimizes
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 7
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 8
Approximate the change in reward with sampled disturbances
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 9
Approximate the change in reward with sampled disturbances
Use the pseudo-inverse to find the gradient
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 10
Approximate the change in reward with sampled disturbances
Use the pseudo-inverse to find the gradient
Update the current policy
Lifelong PG Learning
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 11
Lifelong Machine Learning
17Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 12
... ...
Lifelong Machine Learning
19Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 13
... ...
Lifelong Machine Learning
14Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 14
... ...
Lifelong Machine Learning
21Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
learned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 15
... ...
Lifelong Machine Learning
22Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
3.) New knowledge is stored for future uselearned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 16
... ...
Lifelong Machine Learning
Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
3.) New knowledge is stored for future use
4.) Existingknowledge is refined
learned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 17
Issue: the objective is dependent on all trajectories
PG-ELLA Objective
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 18
Issue: the objective is dependent on all trajectories
PG-ELLA Objective
Hessian
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 19
Verification on Robots
Experiments
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 20
Results for Robot Go-to-Goal Task
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 21
• Run RL on a new robot (goal and disturbance) for a small number of iterations• Use PG-ELLA to adjust policy according to known solutions• Continue training
PG-ELLA improves Learning
Better Results Incorporating Prior
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 22
• Initialization with average policy of other robots improves benefit
PG-ELLA improves Learning
GRASP LABORATORY
Thank you!
Questions?This research was supported by ONR N00014-11-1-0139, AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.
Lifelong Learning for Disturbance Rejection on Mobile Robots
23Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor
David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor