65
Model Based Reinforcement Learning Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science

Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Model Based Reinforcement Learning

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie MellonSchool of Computer Science

Page 2: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Model

s

as0

r

Lecture 8: Integrating Learning and Planning

Introduction

Model-Free RL

state

reward

action

At

Rt

St

Anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(s’|s,a) and reward R(s,a).

Page 3: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Model-learning

s

as0

r

We will be learning the model using experience tuples. A supervised learning problem.

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

Page 4: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning Dynamics

System identification: when we assume the dynamics equations given and only have few unknown parameters

general parametric form (no prior from Physics knowledge)

Newtonian Physics equations VS

Neural networks: lots of unknown parameters

Much easier to learn but suffers from under-modeling, bad models

Very flexible, very hard to get it to generalize

Page 5: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Observation prediction

s

ar

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

o o′�

Our model tries to predict the observations. Why?Because MANY different rewards can be computed once I have access to the future visual observation, e.g., make Mario jump, make Mario move to the right, to the left, lie down, make Mario jump on the well and then jump back down again etc.. If I was just predicting rewards, then I can only plan towards that specific goal, e.g., win the game, same in the model-free case.

Unroll the model by feeding the prediction back as input!

Page 6: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

s

a

Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction.

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

oh

h′� r hg

r = exp( −∥h′�− hg∥)

Prediction in a latent space

Page 7: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. One such feature encoding we have seen is the one that keep from the observation ONLY whatever is controllable by the agent.

minθ,ϕ

. ∥T(h(s), a; θ) − h(s′�)∥ + ∥Inv(h(s), h(s′�); ψ) − a∥

T(h; θ)h(s)

h(s′�)

s

s′ �

a

h(s)sa

Prediction in a latent space

Page 8: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

s

a

Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction.

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

oh

h′� r hg

r = exp( −∥h′�− hg∥)

Prediction in a latent space

Page 9: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Prediction in a latent space

s

a

Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction.

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

oh

h′�

Unroll the model by feeding the prediction back as input!

hg

r = exp( −∥h′�− hg∥)

Page 10: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Avoid or minimize unrolling

s

a

Unrolling quickly causes errors to accumulate. We can instead consider coarse models, where we input a long sequences of actions and predict the final embedding in one shot, without unrolling.

Learning machine (random forest,

deep neural network, linear

(shallow predictor)

oh

h′� hg

r = exp( −∥h′�− hg∥)

Page 11: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Why model learning

• Online Planning at test time - Model predictive Control• Model-based RL: training policies using simulated experience• Efficient Exploration

Page 12: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Why model learning

• Online Planning at test time - Model predictive Control• Model-based RL: training policies using simulated experience• Efficient Exploration

Why learn the model?

Given a state I unroll my model forward and seek the action that results in the highest reward. How do I select this action?1.I discretize my action space and perform tree-

search2.I use continuous gradient descent to optimize

over actions

Page 13: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Why model learning

• Online Planning at test time - Model predictive Control• Model-based RL: training policies using simulated experience• Efficient Exploration

Given a state I unroll my model forward and seek the action that results in the highest reward. How do I select this action?1.I discretize my action space and perform tree-

search2.I use continuous gradient descent to optimize

over actions

Page 14: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Bachpropagate to actions

...T(s, a)

πθ(s)

ρ(s, a)

πθ(s)

s0 s1

a0 a1

T(s, a)

ρ(s, a)

r0 r1

θ Reward and dynamics are known

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

Page 15: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Bachpropagate to actions

...T(s, a, θ)s0 s1

a0 a1

sTT(s, a, θ)

r

No policy learned, action selection directly by backpropagating through the dynamics, the continuous analog of online planning

Given a state I unroll my model forward and seek the action that results in the highest reward. How do I select this action?1.I discretize my action space and perform tree-

search2.I use continuous gradient descent to optimize

over actions

dynamics are frozen, we backpropagate to actions directly

Page 16: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Why model learning

• Online Planning at test time - Model predictive Control• Model-based RL: training policies using simulated experience• Efficient Exploration

Page 17: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

s DNN

DPG in Simulated PhysicsI Physics domains are simulated in MuJoCoI End-to-end learning of control policy from raw pixels sI Input state s is stack of raw pixels from last 4 framesI Two separate convnets are used for Q and ⇡I Policy ⇡ is adjusted in direction that most improves Q

Q(s,a)

π(s)

as DNN a(✓µ)

(✓Q)

z

z ⇠ N (0, 1)a = µ(s; ✓) + z�(s; ✓)

Remember: Stochastic Value Gradients V0

Page 18: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Bachpropagate to the policy

...T(s, a)

πθ(s)

ρ(s, a)

πθ(s)

s0 s1

a0 a1

T(s, a)

ρ(s, a)

r0 r1

θ Reward and dynamics are known

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

Page 19: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

...

πθ(s) πθ(s)

s0 s1

a0 a1

Q(s, a)

θdynamics are frozen, backprogate to the policy directly by maximizing Q within a time horizon

Q(s, a)

T(s, a, θ) T(s, a, θ)

Bachpropagate to the policy

Page 20: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Why model learning

• Online Planning at test time - Model predictive Control• Model-based RL: training policies using simulated experience• Efficient Exploration

Page 21: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Challenges• Errors accumulate during unrolling• Policy learnt on top of an inaccurate model is upperbounded by the accuracy of the

model• Policies exploit model errors be being overly optimistic• With lots of experience, model-free methods would always do better

Answers:• Use model to pre-train your polic, finetune while being model-free• Use model to explore fast, but always try actions not suggested by the model so

you do not suffer its biases• Build a model on top of a latent space which is succinct and easily predictable• Abandon global models and train local linear models, which do not generalize but

help you solve your problem fast, then distill the knowledge of the actions to a general neural network policy (next week)

Page 22: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Model LearningThree questions always in mind• What shall we be predicting?

• What is the architecture of the model, what structural biases should we add to get it to generalize?

s

a

oh

h′�

s

a

oh

h′�

• What is the action representation?

s

a

oh

h′�

Page 23: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

F

23

Malik

How do we learn to play Billiards?

• First, we tranfer all knowledge about how objects move, that we have accumulated so far.

• Second, we watch other people play and practise ourselves, to finetune such model knowledge

Page 24: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

F

24

Malik

How do we learn to play Billiards?

Page 25: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

25

Page 26: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

26

Page 27: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

27

Page 28: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

28

Page 29: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

29Predictive Visual Models of Physics for Playing Billiards, K.F. et al. ICLR 2016

Learning Action-Conditioned Billiard Dynamics

Page 30: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

30

Q: will our model be able to generalize across different number of balls present?

Force field

Learning Action-Conditioned Billiard Dynamics

CNN

Page 31: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

F

31

F

World-Centric Prediction Object-Centric Prediction

Learning Action-Conditioned Billiard Dynamics

Q: will our model be able to generalize across different number of balls present?

Page 32: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

32

Page 33: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

33

Page 34: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

34

Page 35: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

35

Page 36: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

36

Page 37: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

37

Page 38: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

38

F

Object-centric Billiard Dynamics

CNN

ball displacementdx

The object-centric CNN is shared across all objects in the scene. We apply it one object at a time to predict the object’s future displacement.We then copy paste the ball at the predicted location, and feed back as input.

Page 39: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

39

file:///.file/id=6571367.7967880

Page 40: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Playing Billiards

40

How should I push the red ball so that it collides with the green on?Cme for searching in the force space

Page 41: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning Dynamics

Two good ideas so far:1) object graphs instead of images. Such encoding allows to generalize across

different number of entities in the scene.2) predict motion instead of appearance. Since appearance does not change,

predicting motion suffices. Let’s predict only the dynamic properties and keep the static one fixed.

Page 42: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Billiards• We predicted object displacement trajectories

s

a

oh

h′�

• We had one CNN per object in the scene, shared the weights across objects

s

a

oh

h′�

a

• A force applied to each object

soh

h′�

Page 43: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Graph EncodingIn the Billiard case, object computations were coordinated by using a large enough context around each object (node). What if we explicitly send each node’s computations to neighboring nodes to be taken account when computing their features?

Graph Networks as Learnable Physics Engines for Inference and Control, Gonzalez et al.

We will encode a robotic agent as a graph, where nodes are the different bodies of the agent and edges are the joints, links between the bodies

Page 44: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Graph EncodingIn the Billiard case, object computations were coordinated by using a large enough context around each object (node). What if we explicitly send each node’s computations to neighboring nodes to be taken account when computing their features?

Graph Networks as Learnable Physics Engines for Inference and Control, Gonzalez et al.

Node features• Observable/dynamic: 3D position, 4D quaternion orientation, linear and angular

velocities • Unobservable/static: mass, inertia tensor• Actions: forces applied on the joints

Page 45: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Graph Forward Dynamics

Graph Networks as Learnable Physics Engines for Inference and Control, Gonzalez et al.

Predictions: I predict only the dynamic features, their temporal difference. Train with regression.

Node features• Observable/dynamic: 3D position, 4D quaternion orientation, linear and angular

velocities • Unobservable/static: mass, inertia tensor• Actions: forces applied on the joints• No visual input here, much easier!

Page 46: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Robots as graphs• We predicted dynamic only node features

s

a

oh

h′�

• Our CNN is a Graph network, the node update function is shared across all nodes (thus we can generalize across different number of nodes)

s

a

oh

h′�

• Forces applied to each node

s

a

oh

h′�

Page 47: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Graph Forward Dynamics

Graph Networks as Learnable Physics Engines for Inference and Control, Gonzalez et al.

Predictions: I predict only the dynamic features, their temporal difference:

Node features• Observable/dynamic: 3D position, 4D quaternion orientation, linear and angular

velocities • Unobservable/static: mass, inertia tensor• Actions: forces applied on the joints

Page 48: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Graph Model Predictive Control

Graph Networks as Learnable Physics Engines for Inference and Control, Gonzalez et al.

Page 49: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning Dynamics

Two good ideas so far:1) object graphs instead of images. Such encoding allows to generalize across

different number of entities in the scene.2) predict motion instead of appearance. Since appearance does not change,

predicting motion suffices. Let’s predict only the dynamic properties and keep the static one fixed.

Page 50: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Differentiable warping

Page 51: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

green: input, red: sampled future motion field and corresponding frame completion

Visual dynamics using motion transformation

Page 52: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Goal representation: move certain pixel of the initial image to desired locations

We will learn a model of pixel motion displacements

Page 53: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Differentiable warping

Can I use this model?

Page 54: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Page 55: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Self-Supervised Visual Planning with Temporal Skip Connections, Ebert et al.

Page 56: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

https://sites.google.com/view/sna-visual-mpc

Page 57: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

What should we be predicting?

Do we really need to be predicting observations?

What if we knew what are the quantities that matter for the goals i care about?For example, I care to predict where the object will end up during pushing but I do not care exactly where it will end up, when it falls off the table, or I do not care about its intensity changes due to lighting.

Let’s assume we knew this set of important useful to predict features. Would we do better?Yes! we would win the competition in Doom the minimum.

Page 58: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Main idea: You are provided with a set of measurements m paired with input visual (and other sensory) observations. Measurements can be health, ammunition levels, enemies killed. Your goal can be expressed as a combination of those measurements.

measurement offsets are the prediction targets: f = (mt+τ1− mt, ⋯, mt+τn

− mt)

(multi) goal representation: u(f, g) = g⊤f

Page 59: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Visual dynamics using motion transformation

Train a deep predictor. No unrolling! One shot prediction of future values:

No policy, direct action selection:

Page 60: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning dynamics of goal-related measurements

Action selection:

Training: we learn the model using \epsilon-greedy exploration policy over the current best chosen actions.

Page 61: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning dynamics of goal-related measurements

Page 62: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Learning dynamics of goal-related measurements

Page 63: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Exploration by Planning

Skill-guided Look-ahead Exploration for Reinforcement Learning of Manipulation Policies, submitted

1. Learn a set of skills, namely, grasp, reach and transfer, using HER2. For each skill, we have a multistep inverse model3. For each skill, we further train a forward model T(s,g)->s’4. In each exploration step, we look-ahead by chaining multistep skills, as opposed

to single step.

π(g, s)

Page 64: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Exploration by Planning

Skill-guided Look-ahead Exploration for Reinforcement Learning of Manipulation Policies, submitted

1. Learn a set of skills, namely, grasp, reach and transfer, using HER2. For each skill, we have a multistep inverse model3. For each skill, we further train a forward model T(s,g)->s’4. In each exploration step, we look-ahead by chaining multistep skills, as opposed

to single step.

π(g, s)

Page 65: Model Based Reinforcement Learning...rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Learning machine (random forest, deep neural

Exploration by Planning

Skill-guided Look-ahead Exploration for Reinforcement Learning of Manipulation Policies, submitted

1. Learn a set of skills, namely, grasp, reach and transfer, using HER2. For each skill, we have a multistep inverse model3. For each skill, we further train a forward model T(s,g)->s’4. In each exploration step, we look-ahead by chaining multistep skills, as opposed

to single step.

π(g, s)