Space-Indexed Dynamic Programming: Learning to Follow Trajectories

Preview:

DESCRIPTION

Space-Indexed Dynamic Programming: Learning to Follow Trajectories. J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML. TexPoint fonts used in EMF. - PowerPoint PPT Presentation

Citation preview

Space-Indexed Dynamic Programming: Learning to

Follow Trajectories

J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway

Computer Science DepartmentStanford University

July 2008, ICML

Outline

• Reinforcement Learning and Following Trajectories

• Space-indexed Dynamical Systems and Space-indexed Dynamic Programming

• Experimental Results

Reinforcement Learning and Following Trajectories

Trajectory Following

• Consider task of following trajectory in a vehicle such as a car or helicopter

• State space too large to discretize, can’t apply tabular RL/dynamic programming

Trajectory Following

• Dynamic programming algorithms w/ non-stationary policies seem well-suited to task– Policy Search by Dynamic Programming

(Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

Dynamic Programming

t=1

Divide control task into discrete time steps

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2t=3

t=4 t=5 : : :

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4¼3

¼2¼1

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Key Advantage: Policies are local (only need to perform well over small

portion of state space)

¼5¼4¼3

¼2¼1

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Problems with Dynamic Programming

¼5

Supposed we learned policy assuming this

distribution over states¼5

Problems with Dynamic Programming

¼5

But, due to natural stochasticity of environment, car is actually here at t = 5

Problems with Dynamic Programming

¼5

Resulting policy will perform very poorly

Problems with Dynamic Programming

¼5¼4

¼3¼2

¼1

Partial Solution: Re-indexingExecute policy closest to current

location, regardless of time

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Problems with Dynamic Programming

Due to stochasticity, large uncertainty over states in

distant future

Dist. over states at time t = 5

Problems with Dynamic Programming

DP algorithms require learning policy that performs well over entire distribution

Dist. over states at time t = 5

Space-Indexed Dynamic Programming

• Basic idea of Space-Indexed Dynamic Programming (SIDP):

Perform DP with respect to space indices (planes tangent to trajectory)

Space-Indexed Dynamical Systems and Dynamic

Programming

Difficulty with SIDP

• No guarantee that taking single action will move to next plane along trajectory

• Introduce notion of space-indexed dynamical system

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

current state

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent state

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent statetime derivative of state

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Euler integration

st+¢ t = st +f (st;ut)¢ t

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

• Simulate forward until whenever vehicle hits next tangent plane

space index d

space index d+1

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

(Positive solution exists as long as controller makes

some forward progress)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

¢ t(s;u) =( _s?d+1)

T (s¡ s?d+1)( _s?d+1)

T _s

¢ t(s;u)

Space-Indexed Dynamical Systems

• Result is a dynamical system indexed by spatial-index variable d rather than time

• Space-indexed dynamic programming runs DP directly on this system

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2d=3

d=4d=5

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4¼3

¼2¼1

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Space-Indexed Dynamic Programming

Time indexed DP: can execute

policy learned for different location

Space indexed DP: always executes policy based on current spatial

index

¼5

¼4

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

t(5):

Experiments

Experimental Domain

• Task: following race track trajectory in RC car with randomly placed obstacles

Experimental Setup

• Implemented space-indexed version of PSDP algorithm– Policy chooses steering angle using SVM

classifier (constant velocity)– Used simple textbook model simulator of car

dynamics to learn policy

• Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed

Time-Indexed PSDP

Time-Indexed PSDP w/ Re-indexing

Space-Indexed PSDP

Empirical Evaluation

Time-indexed PSDP Time-indexed PSDP with Re-indexing

Space-indexed PSDP

Cost: 49.32Cost: Infinite (no trajectory succeeds) Cost: 59.74

Additional Experiments

• In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

Related Work

• Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005

• Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008

• Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

Summary

• Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed

• In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming

• Demonstrated usefulness of these methods on real-world control tasks.

Thank you!

Videos available online athttp://cs.stanford.edu/~kolter/icml08videos

Recommended