45
Imitation Learning Jannik Zürn

Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Imitation Learning

Jannik Zürn

Page 2: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Terminology

0.6

0.0

0.0

0.4

Cat

Dog

Car

Liquid

0.1

0.1

0.8

Left

Right

Straight

Page 3: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Mapping observations to actions

0.1

0.1

0.8

Left

Right

Straight

Page 4: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

States and observations

Page 5: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

States and observations

Lossy mapping

[1]

Transition function

Markov property

Page 6: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Where to get labels?

Problems?

- Noisy labels (expert makes mistakes)- Have to observe a lot of data from expert- Data distribution mismatch:

Observations that are not in training data set

Behavior Cloning

Page 7: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Data Distribution Mismatch

Page 8: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Imitation Driving

[Bojarski et al. 2016]

Page 9: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Imitation Driving - why does it work?

[Bojarski et al. 2016]

Page 10: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Dataset Aggregation

Problem: Distribution drift of

Solution: Move closer to

How? → Add samples of with annotations

from expert to dataset

DAgger algorithm (Dataset Aggregation)

Train from expert data

Run to get new data

Ask expert to label with actions

Aggregate

Page 11: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Failing to fit the expert

Reasons for failing:

- Multimodal expert behavior- Non-Markovian expert behavior

Page 12: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Multimodal expert behavior

[1]

Page 13: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Non-Markovian expert behavior

0.1

0.1

0.8

Left

Right

Straight

Page 14: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

DAgger

Problems:

- Repeatedly query expert- Execute an unsafe/partially trained

policy

Page 15: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Non-human experts

- For training: learn classifier from expert brute force algorithm (i.e. exhaustive search)

- Query expert algorithm using DAgger- For inference: use learned policy instead

of slow expert

Page 16: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Cost function for Imitation

Reward r:

Loss:

(cross-entropy loss)

Page 17: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

References

- [1] UC Berkeley Deep RL course by Sergey Levine, lecture 1

- [2] A. Giusti et al.: A Machine Learning Approach to Visual

Perception of Forest Trails for Mobile Robots (2016)

- [3] M. Bojarski et al.: End to End Learning for Self-Driving Cars

(2016)

Page 18: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Deep Reinforcement Learning

Niklas Wetzel

Page 19: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Outline

Recap reinforcement learning (RL) concepts

Value function approximation

Deep Q-Networks

Page 20: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Components of RL: Recap

Markov Decision Process (MDP)

Value-functions, action-value functions

Policy estimation (prediction), policy improvement, policy iteration

Monte-Carlo estimation, temporal difference (TD) learning

Page 21: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Recap: Markov Decision Process

Defined via

state space, action space, dynamics, rewards, discount factor

Dynamics function defined by probabilities

Notation: We use , if for all t

Goal: To maximize the expected return

Page 22: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Markov Decision Process Terminology

Policy:

Return:

State-value function:

Action-value function:

Page 23: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

How to estimate the value functions

Monte Carlo: Average over samples (unbiased)

Bootstrapping via TD targets (biased)

Page 24: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Large-Scale Reinforcement Learning

Reinforcement learning can be used to solve large problems, e.g.

Backgammon: 1020 states

Chess: 1047 states

Go: 10170 states

Helicopter: continuous state space

How can we scale up methods for prediction and control?

Page 25: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Value Function Problems

Till now value function is treated like lookup table

Every state has an entry v¼(s)

Or every state-action pair (s,a) has an entry q¼(s,a)

Problem with large MDPs:

Too many states and/or actions to store in memory

Too slow to learn the value of each states

Page 26: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Value Function Approximation

Solution for large MDPs:

Estimate value functions with function approximation

Generalize from seen states to unseen states

Update parameters Á using MC or TD learning

Page 27: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Function Approximator Types?

We choose neural networks as function approximators, but other choices are possible, e.g.

Linear combination of features

Decision trees

Nearest neighbour methods

Fourier / wavelet bases

We require a training method that is suitable for

non-stationary, non-i.i.d. data

Page 28: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Gradient Descent

Task: Find a local minimum of differentiable function J(Á)

Adjust Á in direction of the

negative gradient

¢Á = - ® 0.5grad(J),

where ® is the step-size

Page 29: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Value Function Approx. by GD

Goal: find parameters Á minimizing the mean-

squared error between approximate value function VÁ(s) and true value function v¼(s)

Gradient descent (GD) finds a local minimum

Page 30: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Value Function Approx. by SGD

Stochastic gradient descent (SGD) samples states S(1),…,S(N) and estimates the gradient

Expected update is equal to the full gradient update

Page 31: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Incremental Prediction Algorithms

Have assumed true value function v¼(s) given by a supervisor

In RL there is no supervisor, only rewards

In practise, we substitute a target for v¼(s)

For MC, the target is the return Gt

For TD, the target is the TD target

Page 32: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

MC with Value Function Approximation

Return Gt is an unbiased sample of the true v¼(St)

Can therefore apply supervised learning to “training data”:

MC guaranteed to converge to a local optimum using non-linear value function approximation

Page 33: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

TD with Value Function Approximation

TD target is a biased sample of the true v¼(St)

Can still apply supervised learning to “training data” by substituting return samples with TD target samples

Not guaranteed to converge to a local optimum using non-linear value function approximation

Page 34: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Action-Value Function Approximation

Approximate the action-value function

Minimize mean-squared error between QÁ(s,a) and the true action value function q¼(s,a)

Page 35: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Action-Value Function Approx. by SGD

Stochastic gradient descent (SGD) samples states S(1),A(1),…, S(n),A(n) and estimates the gradient

Expected update is equal to the full gradient update

Page 36: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

TD-Control: Optimal Value Functions

Optimal state-value function

Optimal action-value function

TD control goal: Estimate optimal value functions

Page 37: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Sarsa: On-policy TD-control Approximate q¼ with QÁ

TD targets given by

Policy improvement: Update behaviour policy ¼ to be (²-) greedy w.r.t. QÁ

Repeat till convergence:

Page 38: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Control with Value Function Approximation

Policy evaluation Approximate policy evaluation:

QÁ ¼ q¼

Policy improvement Any, e.g. ²–greedy

improvement

Page 39: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Policy iteration

Page 40: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Q-learning: Off-policy TD-control Directly approximate q* with QÁ

TD target given by

Advantage: off-policy

Page 41: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Q-learning with function approximator 1

Q-iteration algorithm (online):

Problem: Highly correlated states

nklswtzl
Rectangle
Page 42: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Deep Q-learning

Idea: Decorrelate data with replay buffer

Page 43: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Q-learning with function approximator 2

Q-iteration algorithm (offline):

Careful: No SGD, since yi depends on Á, but is

„viewed‟ as constant in 3.)

Page 44: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Q-learning with function approximator 3

Idea: Fix TD targets for a given time period

Q-iteration with replay buffer and target network:

You will implement this in your RL exercise

Page 45: Imitation Learning - uni-freiburg.deais.informatik.uni-freiburg.de/...learning_lab/... · Imitation Driving [Bojarski et al. 2016] In contrast to the usual approach to operating self-driving

Sources:

CS 294-112: Deep Reinforcement Learning

Sergey Levine

UCL Course on RL

David Silver

Reinforcement Learning: An Introduction

R.Sutton, A.Barto