45
Distributed RL Richard Liaw, Eric Liang

Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Distributed RLRichard Liaw, Eric Liang

Page 2: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Common Computational Patterns for RL

Batch Optimization

Simulation

Simulation

Simulation

OptimizationOptimization

How can we better utilize our computational resourcesto accelerate RL progress?

Original

Page 3: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

History of large scale distributed RL

2013

DQN

Playing Atari with Deep Reinforcement Learning

(Mnih 2013)

GORILA

Massively Parallel Methods for Deep

Reinforcement Learning(Nair 2015)

2015

A3C

Asynchronous Methods for Deep Reinforcement

Learning(Mnih 2016)

2016

Ape-X

Distributed Prioritized Experience Replay

(Horgan 2018)

2018

IMPALA

IMPALA: Scalable Distributed Deep-RL with

Importance Weighted Actor-Learner Architectures

(Espeholt 2018)

2018 ?

Page 4: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

2013/2015: DQN

for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r))

minibatch = replay.sample() q_network.update(mini_batch)

if should_update_target(): q_network.sync_with(target_net)

Page 5: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

2015: General Reinforcement Learning Architecture (GORILA)

Page 6: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

GORILA Performance

Page 7: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

2016: Asynchronous Advantage Actor Critic (A3C)

Sends gradients back

# Each worker:

while True: sync_weights_from_master()

for i in range(5): collect sample from env

grad = compute_grad(samples) async_send_grad_to_master()

Each has different exploration -> more diverse samples!

Page 8: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

A3C Performance

Changes to GORILA:

1. Faster updates2. Removes the

replay buffer3. Moves to

Actor-Critic (from Q learning)

Page 9: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Distributed Prioritized Experience Replay (Ape-X)

A3C doesn’t scale very well…

Ape-X:1. Distributed DQN/DDPG

2. Reintroduces replay

3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”

Page 10: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Ape-X Performance

Page 11: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Importance Weighted Actor-Learner Architectures (IMPALA)

Motivated by progress in distributed deep learning!

Page 12: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

How to correct for Policy Lag? Importance Sampling!

Given an actor-critic model:

1. Apply importance-sampling to policy gradient

2. Apply importance sampling to critic update

Page 13: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

IMPALA Performance

Page 14: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Other interesting distributed architectures

Page 15: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

AlphaZero

Each model trained on 64 GPUs and 19 parameter servers!

Page 16: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

Evolution Strategies

Page 17: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18)

17

Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica

Page 18: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io 18

Fig. courtesy OpenAI

Fig. courtesy NVidia Inc.

RL research scales with compute

Page 19: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

How do we leverage this hardware?

19

scalable abstractions for RL?

(a) Supervised Learning (b) Reinforcement Learning

Page 20: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Systems for RL today• Many implementations (7000+ repos on GitHub!)

– how general are they (and do they scale)?PPO: multiprocessing, MPI AlphaZero: custom systems

Evolution Strategies: Redis IMPALA: Distributed TensorFlow

A3C: shared memory, multiprocessing, TF

• Huge variety of algorithms and distributed systems used to implement, but little reuse of components

20

Page 21: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Challenges to reuse

1. Wide range of physical execution strategies for one "algorithm"

21

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

MPImultiprocessing

param-server

Page 22: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Challenges to reuse

2. Tight coupling with deep learning frameworks

22

Different parallelism paradigms:– Distributed TensorFlow vs TensorFlow + MPI?

Page 23: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Challenges to reuse

3. Large variety of algorithms with different structures

23

Page 24: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

We need abstractions for RL

Good abstractions decompose RL algorithms into reusable components.

Goals:– Code reuse across deep learning frameworks– Scalable execution of algorithms– Easily compare and reproduce algorithms

24

Page 25: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Structure of RL computations

25

Agent Environment

action (ai+1)

Policy:state → action state (si)(observation)

reward (ri)

Page 26: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Structure of RL computations

26

Agent Environment

action (ai+1)

state (si)(observation)

reward (ri)

Policy evaluation

(state → action)

Policy improvement

(e.g., SGD)

trajectory X: s0, (s1, r1), …, (sn, rn)

policy

Page 27: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Many RL loop decompositions

27

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

Actor

Actor

Actor

X <- rollout()dθ <- grad(L, X)sync(dθ)

θ <- sync()rollout()

X <- replay()apply(grad(L, X))

Page 28: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Replay

Common components

28

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer

Actor

Actor

Actor

Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

ActorActor

ActorActor

ActorActor

Policy πθ(ot)

Trajectory postprocessor ρθ(X)

Loss L(θ,X)

Page 29: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Replay

Common components

29

Actor-Learner

Actor-Learner

Actor-Learner

ParamServer

Actor

Actor

Actor

Learner

Replay

Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)

ActorActor

ActorActor

ActorActor

Policy πθ(ot)

Trajectory postprocessor ρθ(X)

Loss L(θ,X)

Page 30: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Structural differences

30

Async DQN (Mnih et al; 2016)● Asynchronous optimization● Replicated workers● Single machine

Ape-X DQN (Horgan et al; 2018)● Central learner● Data queues between components● Large replay buffers● Scales to clusters

...and this is just one family!

➝ No existing system can effectively meet all the varied demands of RL workloads.

+ Population-Based Training (Jaderberg et al; 2017)

● Nested parallel computations● Control decisions based on

intermediate results

Page 31: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Requirements for a new systemGoal: Capture a broad range of RL workloads with high performance and substantial code reuse1. Support stateful computations

- e.g., simulators, neural nets, replay buffers- big data frameworks, e.g., Spark, are typically stateless

2. Support asynchrony- difficult to express in MPI, esp. nested parallelism

3. Allow easy composition of (distributed) components

31

Page 32: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Ray System Substrate

32

Hierarchical Task Model

• RLlib builds on Ray to provide higher-level RL abstractions• Hierarchical parallel task model with stateful workers

– flexible enough to capture a broad range of RL workloads (vs specialized sys.)

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

MPImultiprocessing

param-server

Page 33: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Ray Cluster

Hierarchical Parallel Task Model1. Create Python class instances in the cluster (stateful workers)2. Schedule short-running tasks onto workers

– Challenge: High performance: 1e6+ tasks/s, ~200us task overhead

33

Top-level worker(Python process)

Sub-worker (process)

Sub-worker

Sub-worker

"collect experiences"

Sub-sub workerprocesses

"do model-based rollouts"

"allreduce your

gradients"

exchange weight shards through Ray object store

"run K steps of training"

Page 34: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Unifying system enables RL Abstractions

34

Policy Optimizer Abstraction

SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...

Policy Graph Abstraction{πθ, ρθ, L(θ,X)}

{Q-func, n-step, Q-loss}

{LSTM, adv. calc, PG loss}

Examples:

Hierarchical Task Model

single-node cluster

GPU

CPUsynchronous

asynchronous

send gradients

send experiences

Page 35: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

RLlib Abstractions in Action

35

Policy Optimizers

SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...

{Q-func, n-step, Q-loss}

{LSTM, adv. calc, PG loss}

DQN (2015)

Async DQN (2016)

Ape-X (2018)

Policy Gradient (2000)

+actor-critic loss, GAE

A2C (2016)

PPO (GPU-optimized)PPO (2017)+clipped obj.

IMPALA (2018)+V-trace

A3C (2016)

PolicyGraphs

Page 36: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

RLlib Reference Algorithms• High-throughput architectures

– Distributed Prioritized Experience Replay (Ape-X)– Importance Weighted Actor-Learner Architecture (IMPALA)

• Gradient-based– Advantage Actor-Critic (A2C, A3C)– Deep Deterministic Policy Gradients (DDPG)– Deep Q Networks (DQN, Rainbow)– Policy Gradients– Proximal Policy Optimization (PPO)

• Derivative-free– Augmented Random Search (ARS)– Evolution Strategies

Community Contributions

Page 37: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

RLlib Reference Algorithms

1 GPU + 64 vCPUs (large single machine)

Page 38: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Scale your algorithms with RLlib

38

• Beyond a "collection of algorithms",• RLlib's abstractions let you easily implement and scale new

algorithms (multi-agent, novel losses, architectures, etc)

Page 39: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Code example: training PPO

Page 40: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Code example: multi-agent RL

Page 41: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Code example: hyperparam tuning

Page 42: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Code example: hyperparam tuning

Page 43: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

RLlib is open source and available at http://rllib.ioThanks!

43

Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning.

Page 44: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Ray distributed execution engine

44

• Ray provides Task parallel and Actor APIs built on dynamic task graphs

• These APIs are used to build distributed applications, libraries and systems

Ray execution modelDynamic Task Graphs

Applications...Numerical computation

Third-party simulators

Ray programming modelTask Parallelism Actors

Page 45: Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

http://rllib.io

Ray distributed scheduler

45

• Faster than Python multi-processing on a single node

• Competitive with MPI in many workloads

WorkerDriver WorkerWorker WorkerWorker

Object Store Object Store Object Store

Local Scheduler Local Scheduler Local Scheduler

Global Scheduler

Global Scheduler

Global SchedulerGlobal Scheduler