47
Distributed RL Richard Liaw

Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Distributed RLRichard Liaw

Page 2: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Common Computational Patterns for RL

Batch Optimization

Simulation

Simulation

Simulation

OptimizationOptimization

How can we better utilize our computational resourcesto accelerate RL progress?

Original

Page 3: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

History of large scale distributed RL

2013

DQN

Playing Atari with Deep Reinforcement Learning

(Mnih 2013)

GORILA

Massively Parallel Methods for Deep

Reinforcement Learning(Nair 2015)

2015

A3C

Asynchronous Methods for Deep Reinforcement

Learning(Mnih 2016)

2016

Ape-X

Distributed Prioritized Experience Replay

(Horgan 2018)

2018

IMPALA

IMPALA: Scalable Distributed Deep-RL with

Importance Weighted Actor-Learner Architectures

(Espeholt 2018)

2018

R2D3

Making Efficient Use of Demonstrations to Solve

Hard Exploration Problems

(Le Paine 2019)

2019

Page 4: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

2013/2015: DQN

for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r))

minibatch = replay.sample() q_network.update(mini_batch)

if should_update_target(): q_network.sync_with(target_net)

Page 5: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

2015: General Reinforcement Learning Architecture (GORILA)

Page 6: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

GORILA Performance

Page 7: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

History of large scale distributed RL

2013

DQN

Playing Atari with Deep Reinforcement Learning

(Mnih 2013)

GORILA

Massively Parallel Methods for Deep

Reinforcement Learning(Nair 2015)

2015

A3C

Asynchronous Methods for Deep Reinforcement

Learning(Mnih 2016)

2016

Ape-X

Distributed Prioritized Experience Replay

(Horgan 2018)

2018

IMPALA

IMPALA: Scalable Distributed Deep-RL with

Importance Weighted Actor-Learner Architectures

(Espeholt 2018)

2018

R2D3

Making Efficient Use of Demonstrations to Solve

Hard Exploration Problems

(Le Paine 2019)

2019

Page 8: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

2016: Asynchronous Advantage Actor Critic (A3C)

Sends gradients back

# Each worker:

while True: sync_weights_from_master()

for i in range(5): collect sample from env

grad = compute_grad(samples) async_send_grad_to_master()

Each has different exploration -> more diverse samples!

Page 9: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

A3C Performance

Changes to GORILA:

1. Faster updates2. Removes the

replay buffer3. Moves to

Actor-Critic (from Q learning)

Page 10: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Importance Weighted Actor-Learner Architectures (IMPALA)

Motivated by progress in distributed deep learning!

Page 11: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

How to correct for Policy Lag? Importance Sampling!

Given an actor-critic model:

1. Apply importance-sampling to policy gradient

2. Apply importance sampling to critic update

Page 12: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Ape-X/R2D2 (2018)

Scaling Off-Policy learning...

Ape-X:1. Distributed

DQN/DDPG/R2D2

2. Reintroduces replay

3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”

Page 13: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Ape-X Performance

Page 14: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

With Demonstrations: R2D3 (2019)

Page 15: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Other interesting distributed architectures

Page 16: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

QT-Opt

https://arxiv.org/pdf/1806.10293.pdf

Page 17: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

AlphaZero

Each model trained on 64 GPUs and 19 parameter servers!

Page 18: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Evolution Strategies

Page 19: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Beyond RL: Population-based Training

Page 20: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

Benefits of PBT

https://deepmind.com/blog/article/population-based-training-neural-networks

Page 21: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18)

21

Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica

Page 22: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io 22

Fig. courtesy OpenAI

Fig. courtesy NVidia Inc.

RL research scales with compute

Page 23: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

How do we leverage this hardware?

23

scalable abstractions for RL?

(a) Supervised Learning (b) Reinforcement Learning

Page 24: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Systems for RL today• Many implementations (16000+ repos on GitHub!)

– how general are they (and do they scale)?PPO: multiprocessing, MPI AlphaZero: custom systems

Evolution Strategies: Redis IMPALA: Distributed TensorFlow

A3C: shared memory, multiprocessing, TF

• Huge variety of algorithms and distributed systems used to implement, but little reuse of components

24

Page 25: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Challenges to reuse

1. Wide range of physical execution strategies for one "algorithm"

25

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

MPImultiprocessing

param-server

Page 26: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Challenges to reuse

2. Tight coupling with deep learning frameworks

26

Different parallelism paradigms:– Distributed TensorFlow vs TensorFlow + MPI?

Page 27: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Challenges to reuse

3. Large variety of algorithms with different structures

27

Page 28: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

We need abstractions for RL

Good abstractions decompose RL algorithms into reusable components.

Goals:– Code reuse across deep learning frameworks– Scalable execution of algorithms– Easily compare and reproduce algorithms

28

Page 29: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Structure of RL computations

29

Agent Environment

action (ai+1)

Policy:state → action state (si)(observation)

reward (ri)

Page 30: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Structure of RL computations

30

Agent Environment

action (ai+1)

state (si)(observation)

reward (ri)

Policy evaluation

(state → action)

Policy improvement

(e.g., SGD)

trajectory X: s0, (s1, r1), …, (sn, rn)

policy

Page 31: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Many RL loop decompositions

31

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

Actor

Actor

Actor

X <- rollout()dθ <- grad(L, X)sync(dθ)

θ <- sync()rollout()

X <- replay()apply(grad(L, X))

Page 32: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Replay

Common components

32

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer

Actor

Actor

Actor

Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

ActorActor

ActorActor

ActorActor

Policy πθ(ot)

Trajectory postprocessor ρθ(X)

Loss L(θ,X)

Page 33: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Replay

Common components

33

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer

Actor

Actor

Actor

Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

ActorActor

ActorActor

ActorActor

Policy πθ(ot)

Trajectory postprocessor ρθ(X)

Loss L(θ,X)

Page 34: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Structural differences

34

Async DQN (Mnih et al; 2016)● Asynchronous optimization● Replicated workers● Single machine

Ape-X DQN (Horgan et al; 2018)● Central learner● Data queues between components● Large replay buffers● Scales to clusters

...and this is just one family!

➝ No existing system can effectively meet all the varied demands of RL workloads.

+ Population-Based Training (Jaderberg et al; 2017)

● Nested parallel computations● Control decisions based on

intermediate results

Page 35: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Requirements for a new systemGoal: Capture a broad range of RL workloads with high performance and substantial code reuse1. Support stateful computations

- e.g., simulators, neural nets, replay buffers- big data frameworks, e.g., Spark, are typically stateless

2. Support asynchrony- difficult to express in MPI, esp. nested parallelism

3. Allow easy composition of (distributed) components

35

Page 36: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Ray System Substrate

36

Hierarchical Task Model

• RLlib builds on Ray to provide higher-level RL abstractions• Hierarchical parallel task model with stateful workers

– flexible enough to capture a broad range of RL workloads (vs specialized sys.)

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

MPImultiprocessing

param-server

Page 37: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Ray Cluster

Hierarchical Parallel Task Model1. Create Python class instances in the cluster (stateful workers)2. Schedule short-running tasks onto workers

– Challenge: High performance: 1e6+ tasks/s, ~200us task overhead

37

Top-level worker(Python process)

Sub-worker (process)

Sub-worker

Sub-worker

"collect experiences"

Sub-sub workerprocesses

"do model-based rollouts"

"allreduce your

gradients"

exchange weight shards through Ray object store

"run K steps of training"

Page 38: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Unifying system enables RL Abstractions

38

Policy Optimizer Abstraction

SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...

Policy Graph Abstraction{πθ, ρθ, L(θ,X)}

{Q-func, n-step, Q-loss}

{LSTM, adv. calc, PG loss}

Examples:

Hierarchical Task Model

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

Page 39: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

RLlib Abstractions in Action

39

Policy Optimizers

SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...

{Q-func, n-step, Q-loss}

{LSTM, adv. calc, PG loss}

DQN (2015)

Async DQN (2016)

Ape-X (2018)

Policy Gradient (2000)

+actor-critic loss, GAE

A2C (2016)

PPO (GPU-optimized)PPO (2017)+clipped obj.

IMPALA (2018)+V-trace

A3C (2016)

PolicyGraphs

Page 40: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

RLlib Reference Algorithms• High-throughput architectures

– Distributed Prioritized Experience Replay (Ape-X)– Importance Weighted Actor-Learner Architecture (IMPALA)

• Gradient-based– Advantage Actor-Critic (A2C, A3C)– Deep Deterministic Policy Gradients (DDPG)– Deep Q Networks (DQN, Rainbow)– Policy Gradients– Proximal Policy Optimization (PPO)

• Derivative-free– Augmented Random Search (ARS)– Evolution Strategies

Community Contributions

Page 41: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Scale your algorithms with RLlib

41

• Beyond a "collection of algorithms",• RLlib's abstractions let you easily implement and scale new

algorithms (multi-agent, novel losses, architectures, etc)

Page 42: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Code example: training PPO

Page 43: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Code example: hyperparam tuning

Page 44: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Code example: hyperparam tuning

Page 45: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

RLlib is open source and available at http://rllib.ioThanks!

45

Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning.

Page 46: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Ray distributed execution engine

46

• Ray provides Task parallel and Actor APIs built on dynamic task graphs

• These APIs are used to build distributed applications, libraries and systems

Ray execution modelDynamic Task Graphs

Applications...Numerical computation

Third-party simulators

Ray programming modelTask Parallelism Actors

Page 47: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed

http://rllib.io

Ray distributed scheduler

47

• Faster than Python multi-processing on a single node

• Competitive with MPI in many workloads

WorkerDriver WorkerWorker WorkerWorker

Object Store Object Store Object Store

Local Scheduler Local Scheduler Local Scheduler

Global Scheduler

Global Scheduler

Global SchedulerGlobal Scheduler