Performant deep reinforcement learning: latency, hazards, and …on-demand.gputechconf.com/gtc/2017/presentation/s7359... · 2017-05-11 · Performant deep reinforcement learning:

Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 0

Performant deep reinforcement learning:latency, hazards, and pipeline stalls in the GPU era… and how to avoid them

Mark HammondCo-founder / CEO


Latency (n): The time elapsed (typically in clock cycles) between a stimulus and the response to it

Hazard (n): A problem with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle


add [memory_location], register1, register2sub register3, register4, [memory_location]

Data Hazard Example

1. CPU feeds registers and addition instruction to the ALU2. ALU performs operation and stores to temporary register3. CPU directs memory controller to write result to memory4. CPU retrieves memory at indicated location and feeds

contents + register4 and subtraction instruction to ALU5. ALU performs operation and stores to register3

1. CPU feeds registers and addition instruction to the ALU2. ALU performs operation and stores to temporary register3. Concurrently

a. CPU directs memory controller to write result to memory

b. CPU forwards temporary register, register4, and subtraction instruction to ALU

4. ALU performs operation and stores to register3 Michael AbrashZen of Code Optimization


From the CPU to the GPU


The Essence of Machine Learning

Traditional programming:

Programmer authored

User / data inputs

Desired outputs

Machine learned

Observed inputs

Observed outputs

Machine learning:

𝑓𝑓() 𝑥𝑥 𝑓𝑓(𝑥𝑥)


Challenges to Learning the Underlying Function

𝑓𝑓 𝑥𝑥 = 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0 𝑓𝑓 𝑥𝑥 = 𝑐𝑐2𝑥𝑥2 + 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0 𝑓𝑓 𝑥𝑥 = 𝑐𝑐4𝑥𝑥4 + 𝑐𝑐3𝑥𝑥3 + 𝑐𝑐2𝑥𝑥2 + 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0

Even with simple one dimensional functions, you have to worry about things like overfitting

underfit generalized fit overfit


Challenges to Learning the Underlying Function

Real data is multi-dimensional and often entails crafting features

Title: The Triumph of the Nerds: The Rise of Accidental EmpiresRelease date: April 14, 1996Genre: DocumentarySynopsis: Three part documentary that shows the insight look at the history of computers, from its rise in the 1970s to the beginning of the Dot-com boom of the late 1990s.Writers: Robert X. Cringely (book), Robert X. Cringely (screenplay) Stars: Robert X. Cringely, Douglas Adams, Sam AlbertRunning time: 150 minutesReviewers: 1,102Rating: 8.5/10

What movies will someone enjoy watching

An engineered feature:Topic area(s) (As derived using natural language processing techniques on synopsis)


ANN image from Wikimedia Commons - Mcstrother

For a good overview of neural network types see:http://www.asimovinstitute.org/neural-network-zoo/http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/

Deep learning for large scale flexibility

How do we make this performant?

1) Make sure the data pipeline can keep the GPUs populated and processing

2) Optimize the efficiency of the neural network architecture by exploiting structural aspects of the data and problem

http://www.asimovinstitute.org/neural-network-zoo/

http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/


Reinforcement learning for control and optimization

Simulationor

Physical System

Actor Action

State

Reward

Environment Assessment


Estimating the value of an action when it is taken

Actor

Time Action Value of Action1 A1 ?2 A2 ?3 A3 ?4 A4 ?

Reward

Dynamic Programming

Use a model of transition probabilities and the reward function to calculate the value function (visiting all states)

Monte Carlo Methods

Use sampling to estimate the value function (need to see a whole episode)

Temporal Difference “TD” Methods

Look at the error in an estimated value function at every time step (don’t wait until the end of an episode before updating)

Policy Search

Optimize the policy directly

Must estimate the value of one of:

• Policy• Value of a state• Value of an action given a state


Simulationor

Physical System

Actor Action

State

Reward

Environment Assessment

Combining deep learning and reinforcement learningLeverage deep learning to approximate

• Value (state value) function• Q value (action value) function• Policy• Model (state transition + reward)

New hazards:

• Latency roundtripping to simulator / real world

• State transition from CPU to GPU memory

• Stale parameters during learning• …


• Replay buffers / memories

• State space sizing

• Parameter server vs. single model + interleaving

• Separation of training and experience collecting and stale data loss

• Actor/critic results within GPU vs. transition over system memory

Innovations to address the modern hazards of Deep RL


Replay buffers / memories

Prioritized Experience Replay – Schaul, et.al. – ICLR 2016 – https://arxiv.org/abs/1511.05952

• Record experience (state, action, reward, resulting state) to a buffer

• Sub-sample the buffer (mini-batches) for training

• Decorrelates the data and makes training with non-linear functions like neural networks stable

• Allows effective use of the GPU as it can process batches even with serial actions

• Optimization trick: prioritize• Favor important transitions for

inclusion in mini-batches

• Optimization trick: dedupe• Store state (often large) separately

and record references

https://arxiv.org/abs/1511.05952


State space sizing

https://xkcd.com/832/

• Assess state space for target domain to select efficient algorithmic approaches

• Optimization trick: pre-populate the replay buffer

• As you pre-populate the replaybuffer, you can do analysis on state recurrence to gauge sizing


Parameter server vs. single model + interleaving

Parameter Server

Actor &Environment

Actor &Environment

Actor &Environment

• Run many concurrent actors asynchronously from the learner

• Requires copying parameters to the actors, but eliminates the need for replay buffer

• Decorrelation is achieved by having multiple actors generating samples in parallel instead of mini-batch subsampling


Separation of training and experience collecting and stale data loss• “On-policy” vs. “Off-policy”

• On-policy methods can only learn from recorded experience that is consistent with the current policy

• Updating the policy this causesstale data loss – the replaybuffer must be made consistent

• Off-policy methods can learn from any experience, whether consistent with the current policy or not

Sutton, Richard S. and Barto, Andrew G.,Reinforcement Learning: An Introduction, MIT Press, 1998


Actor/critic results within GPU vs. transition over system memory• If separate models are maintained

for the actor and critic, then the results must be fed to each other via system memory

• Even with two separately trained components, performing the full training cycle within one composed model removes the need to leave the GPU

Environment

Agent

Critic

Actor


Parallelized Deep RL Implementations

• A3C – Asynchronous Advantage Actor-Critic

• GA3C – Hybrid CPU/GPU Asynchronous Advantage Actor-Critic

• ANAF – Asynchronous Normalized Advantage Functions

• TRPO – Trust Region Policy Optimization


A3C – Asynchronous Advantage Actor-Critic

Asynchronous Methods for Deep Reinforcement Learning – Mnih, et.al. – https://arxiv.org/abs/1602.01783

• Parameter Server Approach

• Does not require replay buffer

• Instead of scaling out across machines, makes use of a single multicore CPU

• Compounds updates from multiple actors over time before updating model

• Non-locking

• Vary exploration strategy per actor

• Works for on and off-policy methods

• No GPU – single CPU only (due to non-locking and global model accessibility)



GA3C – Hybrid CPU/GPU Asynchronous Advantage Actor-Critic

Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPUBabaeizadeh, et.al. – ICLR 2017 – https://arxiv.org/abs/1611.06256

• Addresses underloading of GPUs from A3C due to lack of replay buffer

• Additional constraint of off-policy method

• Does not use a parameter server architecture – uses a single copy of the model in an interleaving mode for prediction/training

• Requires tuning of the various queues (dynamic adjustment of the number of simulators, predictors, and trainers0

• Sees improved performance over CPU-only A3C



ANAF – Asynchronous Normalized Advantage Functions

Continuous Deep Q-Learning with Model-based Acceleration – Gu, et.al. – https://arxiv.org/abs/1603.00748Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates

Gu, et.al. – https://arxiv.org/abs/1610.00633

• Parallelizes NAF using a parameter server approach

• Off-policy

• Does not require the careful balancing of resources required by GA3C




TRPO – Trust Region Policy Optimization

Parallel Trust Region Policy Optimization with Multiple Actors – Frans, et.al. – http://kvfrans.com/static/trpo.pdf

• On-policy

• Unlike many other on-policy algorithms, more time is spent collecting samples than computing gradients to update the policy

• More reliant on CPU than GPU sincepolicy updates requiring the GPUare infrequent

• Monte Carlo estimates

http://kvfrans.com/static/trpo.pdf


What approach does Bonsai take?• Automatic algorithm selection based on state space sizing

• Keep a single copy of the model in GPU

• Interleaves prediction and learning requests

• Use off-policy methods

• Run multiple concurrent simulations

• Use a learner memory and take advantage of that for loading the GPU

• Use queues to desynchronize experience gathering from simulations, predictions, and training

• Queue predictions and training data to allow for the data transformation between CPU and GPU to be handled in an asynchronous way

• Make use of user defined abstract concepts and curricula


User defined abstract concepts and curricula

i oa

b

c

i oa

b

c

a1

a1

a1

i o

b

c

a1 a2

a3 a4

• Not explicit feature engineering, but guidance for navigating state space through decomposition

• Leverages subject matter expertise

• Can aid in parallelization and avoiding hazards:

• Implicit subconcepts can be created to auto-parallelize

• Guided lesson decomposition of a given concept


+-----------------------------------------------------------------------------+| NVIDIA-SMI 370.28 Driver Version: 370.28 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 TITAN X (Pascal) Off | 0000:01:00.0 On | N/A || 35% 63C P2 161W / 250W | 3491MiB / 12186MiB | 68% Default |+-------------------------------+----------------------+----------------------+

Breakout training with 10 concurrent simulations

Bringing it all together• Rapid iteration in this space

• We're at the "writing assembly code in multiple columns" equivalent state in the CPU era progression

• Companies like Bonsai are working to give you compilers that handle this for you

• Ultimately expect to see the same symbiosis with GPU hardware incorporating native capabilities for abstractions like replay buffers


Join us for the Journey

• Do you have a control or optimization problem you’re looking to tackle with Deep Reinforcement Learning?

Sign up for our recently announced early adopter programhttps://bons.ai/getting-started

• Are you interested in working in this area?

We’re hiring!https://bons.ai/careers

• Questions? Want to learn more?

Visit us at Booth #524 and enter to win an NVIDIA Titan X

https://bons.ai/getting-started

https://bons.ai/careers


https://bons.ai

Documents

Performant deep reinforcement learning: latency, hazards, and …on-demand.gputechconf.com/gtc/2017/presentation/s7359... · 2017-05-11 · Performant deep reinforcement learning: