49
Reinforcement Learning: Review CS 330

Reinforcement Learning: Review

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning: Review

Reinforcement Learning:Review

CS 330

Page 2: Reinforcement Learning: Review

Reminders

Today:

Monday next week:

Project proposals due

Homework 2 due, Homework 3 out

Page 3: Reinforcement Learning: Review

Why Reinforcement Learning?Isolated action that doesn’t affect the future?

Common applications

(most deployed ML systems)

robotics autonomous drivinglanguage & dialog business operations finance

+ a key aspect of intelligence

Supervised learning?

Page 4: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 5: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 6: Reinforcement Learning: Review

object classification object manipulation

iid data action affects next state

large labeled, curated datasethow to collect data?

what are the labels?

well-defined notions of success what does success mean?

supervised learning sequential decision making

Page 7: Reinforcement Learning: Review

1. run away

2. ignore

3. pet

Terminology & notation

Slide adapted from Sergey Levine

Page 8: Reinforcement Learning: Review

Images: Bojarski et al. ‘16, NVIDIA

training

data

supervised

learning

Imitation Learning

Slide adapted from Sergey Levine

Imitation Learning vs Reinforcement Learning?

Page 9: Reinforcement Learning: Review

Reward functions

Slide adapted from Sergey Levine

Page 10: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 11: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 12: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 13: Reinforcement Learning: Review

What is a reinforcement learning task?

data generating distributions, loss

A task: 𝒯𝑖 ≜ {𝑝𝑖(𝐱), 𝑝𝑖(𝐲|𝐱), ℒ𝑖}

Supervised learning Reinforcement learning

A task: 𝒯𝑖 ≜ {𝒮𝑖 , 𝒜𝑖 , 𝑝𝑖(𝐬1), 𝑝𝑖(𝐬′|𝐬, 𝐚), 𝑟𝑖(𝐬, 𝐚)}

a Markov decision process

dynamicsaction space

state space

initial state distribution

reward

much more than the semantic meaning of task!

Page 14: Reinforcement Learning: Review

Examples Task Distributions

A task: 𝒯𝑖 ≜ {𝒮𝑖 , 𝒜𝑖 , 𝑝𝑖(𝐬1), 𝑝𝑖(𝐬′|𝐬, 𝐚), 𝑟𝑖(𝐬, 𝐚)}

Multi-robot RL:

Character animation: across maneuvers

across garments &

initial states

𝑟𝑖(𝐬, 𝐚) vary

𝑝𝑖(𝐬1), 𝑝𝑖(𝐬′|𝐬, 𝐚) vary

𝒮𝑖 , 𝒜𝑖 , 𝑝𝑖(𝐬1), 𝑝𝑖(𝐬′|𝐬, 𝐚) vary

Page 15: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 16: Reinforcement Learning: Review

The anatomy of a reinforcement learning algorithm

Page 17: Reinforcement Learning: Review

Evaluating the objective

Slide adapted from Sergey Levine

Page 18: Reinforcement Learning: Review

Direct policy differentiation

a convenient identity

Slide adapted from Sergey Levine

Page 19: Reinforcement Learning: Review

Direct policy differentiation

Slide adapted from Sergey Levine

Page 20: Reinforcement Learning: Review

Evaluating the policy gradient

generate samples (i.e. run the policy)

fit a model to estimate return

improve the policy

Slide adapted from Sergey Levine

Page 21: Reinforcement Learning: Review

Comparison to maximum likelihood

trainingdata

supervisedlearning

Slide adapted from Sergey Levine

Page 22: Reinforcement Learning: Review

What did we just do?

good stuff is made more likely

bad stuff is made less likely

simply formalizes the notion of “trial and error”!

Slide adapted from Sergey Levine

Page 23: Reinforcement Learning: Review

Policy Gradients

Pros:+ Simple+ Easy to combine with existing multi-task & meta-learning algorithms

Cons:- Produces a high-variance gradient

- Can be mitigated with baselines (used by all algorithms in practice), trust regions

- Requires on-policy data- Cannot reuse existing experience to estimate the gradient!

- Importance weights can help, but also high variance

Page 24: Reinforcement Learning: Review

On-policy vs Off-policy

- Data comes from the current policy

- Compatible with all RL algorithms

- Can’t reuse data from previous

policies

- Data comes from any policy

- Works with specific RL

algorithms

- Much more sample efficient,

can re-use old data

Page 25: Reinforcement Learning: Review

Small note

Reward “to go”

Page 26: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 27: Reinforcement Learning: Review

The anatomy of a reinforcement learning algorithm

Page 28: Reinforcement Learning: Review

Improving the policy gradient

Reward “to go”

Slide adapted from Sergey Levine

Page 29: Reinforcement Learning: Review

State & state-action value functions

Slide adapted from Sergey Levine

Page 30: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current 𝜋 𝐚1 𝐬 = 1

Value function: 𝑉𝜋 𝐬𝑡 = ?

Q function: 𝑄𝜋 𝐬𝑡, 𝐚𝑡 = ?

Advantage function: 𝐴𝜋 𝐬𝑡 , 𝐚𝑡 = ?

Page 31: Reinforcement Learning: Review

Multi-Step Prediction

- How do you update your predictions about winning

the game?

- What happens if you don’t finish the game?

- Do you always wait till the end?

Page 32: Reinforcement Learning: Review

How can we use all of this to fit a better estimator?

Slide adapted from Sergey Levine

Goal:

Page 33: Reinforcement Learning: Review

Policy evaluation examples

TD-Gammon, Gerald Tesauro 1992 AlphaGo, Silver et al. 2016

Page 34: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Page 35: Reinforcement Learning: Review

This was just the prediction part…

Page 36: Reinforcement Learning: Review

Improving the Policy

how good is an action compared to the policy?

Page 37: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current 𝜋 𝐚1 𝐬 = 1

Value function: 𝑉𝜋 𝐬𝑡 = ?

Q function: 𝑄𝜋 𝐬𝑡, 𝐚𝑡 = ?

Advantage function: 𝐴𝜋 𝐬𝑡 , 𝐚𝑡 = ?

How can we improve

the policy?

Page 38: Reinforcement Learning: Review

Improving the Policy

Slide adapted from Sergey Levine

Page 39: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Policy Iteration

Page 40: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Value Iteration

approximates the new value!

Page 41: Reinforcement Learning: Review

Q learning

doesn’t require simulation of actions!

Page 42: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current 𝜋 𝐚1 𝐬 = 1

Value function: 𝑉𝜋 𝐬𝑡 = ?

Q function: 𝑄𝜋 𝐬𝑡, 𝐚𝑡 = ?

Q* function: 𝑄∗ 𝐬𝑡 , 𝐚𝑡 = ?

Value* function: 𝑉∗ 𝐬𝑡 = ?

Page 43: Reinforcement Learning: Review

Fitted Q-iteration Algorithm

Slide adapted from Sergey Levine

Algorithm hyperparameters

This is not a gradient descent algorithm!

Result: get a policy 𝜋(𝐚|𝐬) from arg𝑚𝑎𝑥𝐚

𝑄𝜙(𝐬, 𝐚)

Important notes:We can reuse data from previous policies!

using replay buffersan off-policy algorithm

Can be readily extended to multi-task/goal-conditioned RL

Page 44: Reinforcement Learning: Review

Example: Q-learning Applied to Robotics

Continuous action space?

Simple optimization algorithm -> Cross Entropy Method (CEM)

Page 45: Reinforcement Learning: Review

In-memory buffers

QT-Opt: Q-learning at Scale

Bellman updatersstored data from all past experiments

Training jobs

CEM optimization

QT-Opt: Kalashnikov et al. ‘18, Google BrainSlide adapted from D. Kalashnikov

Page 46: Reinforcement Learning: Review

QT-Opt: Setup and Results

7 robots collected 580k grasps Unseen test objects

96% test success rate!

Page 47: Reinforcement Learning: Review

Bellman equation: 𝑄⋆(𝐬𝑡 , 𝐚𝑡) = 𝔼𝐬′∼𝑝(⋅|𝐬,𝐚) 𝑟(𝐬, 𝐚) + 𝛾𝑚𝑎𝑥𝐚′

𝑄⋆(𝐬′, 𝐚′)

Q-learning

Pros:+ More sample efficient than on-policy methods+ Can incorporate off-policy data (including a fully offline setting)+ Can updates the policy even without seeing the reward+ Relatively easy to parallelize

Cons:- Lots of “tricks” to make it work- Potentially could be harder to learn than just a policy

Page 48: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 49: Reinforcement Learning: Review

Additional RL Resources

Stanford CS234: Reinforcement Learning

UCL Course from David Silver: Reinforcement Learning

Berkeley CS285: Deep Reinforcement Learning