1
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give the same performance guarantees for model-based RL. The constant K depends only on the dimensionality of the state, 4. Main Idea Effect on Policy Gradient Es Exact policy gradient: Model based policy gradient: Two sources of error: 0. Overview RL for high-dimensional continuous state-space tasks. Model-based RL: Difficult to build an accurate model. Model-free RL: Often requires large numbers of real-life trials. We present a hybrid algorithm, which requires only an approximate model, a small number of real-life trials. Resulting policy is an approximate local optimum. Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng S S T T A A N N F F O O R R D D S S T T A A N N F F O O R R D D Real-life trajectory Trajectory predicted by model (equals desired trajectory) The new model perfectly predicts the state sequence obtained by the current policy. Consequently, the new model “knows” that more right steering is required. Derivative of approximate transition function Evaluation of derivatives along wrong trajectory Our algorithm eliminates the second source of error. 1. Preliminaries Markov Decision Process (MDP) M = (S, A, T , H, s 0 , R). S = n (continuous state space) Time varying, deterministic dynamics T = { f t : S x A ! S, t = 0, …,H}. Goal: find policy : S ! A, that maximizes U( ) = E [ R(s t ) | ]. Focus: task of trajectory following. 2. Motivating Example Student-driver learning to make a 90 degree right turn: Only a few trials needed, no accurate model. Key aspects Real-life trial: shows whether turn is wide or short. Crude model: turning steering wheel more to the right results in sharper turn; turning steering wheel more to the left results in wider turn. Result: good policy gradient estimate. 5. Complete Algorithm 1. Find the (locally) optimal policy for the model. 2. Execute the current policy and record the state trajectory. 3. Update the model such that the new model is exact for the current policy . 4. Compute the policy gradient in the new model and update the policy: := + . 5. Go back to Step 2. Notes: The step-size parameter is determined by a line search. Instead of the policy gradient, any algorithm that provides a local policy improvement direction can be used. In our experiments we used differential dynamic programming. 8. Related Work Iterative Learning Control: Uchiyama (1978), Longman et al. (1992), Moore (1993), Horowitz (1993), Bien et al. (1991), Owens et al. (1995), Chen et al. (1997), … Successful robot control with limited number of trials: Atkeson and Schaal (1997), Morimoto and Doya (2001). Non-parametric learning: Atkeson et al. (1997). Classical and robust control theory: 9. Conclusion We presented an algorithm that uses a crude model and a small number of real-life trials to find a policy that works well in real-life. Our theoretical results show that--- assuming a deterministic setting and an approximate model---our algorithm returns a policy that is (locally) near- optimal. Our experiments show that our algorithm can significantly improve on purely Real RC Car Control actions: throttle and steering. We used DDP. Our algorithm took 10 iterations. desired trajectory model-based controller our algorithm 76% utility improvement over model-based RL Videos available. Fixed-wing flight simulator available at: http://sourceforge.net/projects/aviones. Improvements over model- based RL: Turn: 97% Circle: 88% Figure-8: 67% Figure-8 Maneuver Circle Maneuv er Figure-8 Maneuver Flight Simulator We generated “approximate models” by randomly perturbing the 43 model parameters. All 4 standard fixed- wing control actions: throttle, ailerons, elevators and rudder. We used differential dynamic programming (DDP) for the model- based RL and to provide local policy improvements. Our algorithm took 5 iterations. Open Loop Turn t=0 H Test the model-based optimal policy in real-life. How to proceed when the real-life trajectory is not the desired trajectory predicted by the model? The policy gradient is zero according to the model, so no improvement is possible based on the model. Solution: Update the model such that it becomes exact for the current policy. More specifically, add a bias to the model for each time step. See illustration below for details.

Using Inaccurate Models in Reinforcement Learning

  • Upload
    yahto

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

desired trajectory model-based controller our algorithm. S T A N F O R D. S T A N F O R D. 1. Preliminaries Markov Decision Process (MDP) M = ( S , A , T , H , s 0 , R ). S =  n (continuous state space) Time varying, deterministic dynamics - PowerPoint PPT Presentation

Citation preview

Page 1: Using Inaccurate Models in Reinforcement Learning

7. Experiments

6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient.

Notes: These assumptions are insufficient to give the same performance guarantees for model-based RL. The constant K depends only on the dimensionality of the state, action, and policy (), the horizon

H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function.

4. Main Idea Effect on Policy Gradient Estimate

Exact policy gradient:

Model based policy gradient:

Two sources of error:

0. Overview RL for high-dimensional continuous state-space tasks.

Model-based RL: Difficult to build an accurate model. Model-free RL: Often requires large numbers of real-life trials.

We present a hybrid algorithm, which requires only an approximate model, a small number of real-life trials.

Resulting policy is an approximate local optimum.

Using Inaccurate Models in Reinforcement Learning

Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng

SSTTAANNFFOORRDD

SSTTAANNFFOORRDD

Real-life trajectory

Trajectory predicted by model (equals desired

trajectory)

The new model perfectly predicts the state sequence obtained by the current policy. Consequently, the new model “knows” that more right steering is required.

Derivative of approximate transition function

Evaluation of

derivatives along wrong trajectory

Our algorithm eliminates the second source of error.

1. Preliminaries Markov Decision Process (MDP)

M = (S, A, T , H, s0, R). S = n (continuous state space)

Time varying, deterministic dynamics T = { ft : S x A ! S, t = 0,…,H}.

Goal: find policy : S ! A, that maximizes U() = E [ R(st) | ]. Focus: task of trajectory following.

2. Motivating Example Student-driver learning to make a 90 degree right turn: Only a few trials needed, no accurate model. Key aspects

Real-life trial: shows whether turn is wide or short. Crude model: turning steering wheel more to the right results in sharper turn; turning steering wheel more to the left results in wider turn.

Result: good policy gradient estimate.

5. Complete Algorithm

1. Find the (locally) optimal policy for the model.

2. Execute the current policy and record the state trajectory.

3. Update the model such that the new model is exact for the current policy .

4. Compute the policy gradient in the new model and update the policy: := + .

5. Go back to Step 2.

Notes: The step-size parameter is determined by a line search. Instead of the policy gradient, any algorithm that provides a local

policy improvement direction can be used. In our experiments we used differential dynamic programming.

8. Related Work Iterative Learning Control:

Uchiyama (1978), Longman et al. (1992), Moore (1993), Horowitz (1993), Bien et al. (1991), Owens et al. (1995), Chen et al. (1997), …

Successful robot control with limited number of trials: Atkeson and Schaal (1997), Morimoto and Doya (2001).

Non-parametric learning: Atkeson et al. (1997).

Classical and robust control theory: Anderson and Moore (1989), Zhou et al. (1995), Bagnell et al. (2001), Morimoto and Atkeson (2002), …

9. Conclusion We presented an algorithm that uses a crude model and a small number of real-life trials to find a policy that works well in real-life. Our theoretical results show that---assuming a deterministic setting and an approximate model---our algorithm returns a policy that is (locally) near-optimal. Our experiments show that our algorithm can significantly improve on purely model-based RL by using only a small number of real-life trials, even when the true system is not deterministic.

Real RC Car Control actions: throttle and steering. We used DDP. Our algorithm took 10 iterations.

desired trajectory

model-based controller

our algorithm

76% utility improvement over model-based RL

Videos available.

Fixed-wing flight simulator available at: http://sourceforge.net/projects/aviones.

Improvements over model-based RL:

Turn: 97% Circle: 88% Figure-8: 67%

Figure-8 Maneuver

Circle Maneuver

Figure-8 Maneuver

Flight Simulator We generated “approximate models” by randomly perturbing the 43 model parameters. All 4 standard fixed-wing control actions: throttle, ailerons, elevators and rudder. We used differential dynamic programming (DDP) for the model-based RL and to provide local policy improvements. Our algorithm took 5 iterations.

Open Loop Turn

t=0

H

Test the model-based optimal policy in real-life.

How to proceed when the real-life trajectory is not the desired trajectory predicted by the model?

The policy gradient is zero according to the model, so no improvement is possible based on the model.

Solution:

Update the model such that it becomes exact for the current policy.

More specifically, add a bias to the model for each time step. See illustration below for details.