Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 0
Performant deep reinforcement learning:latency, hazards, and pipeline stalls in the GPU era… and how to avoid them
Mark HammondCo-founder / CEO
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 1
Latency (n): The time elapsed (typically in clock cycles) between a stimulus and the response to it
Hazard (n): A problem with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 2
add [memory_location], register1, register2sub register3, register4, [memory_location]
Data Hazard Example
1. CPU feeds registers and addition instruction to the ALU2. ALU performs operation and stores to temporary register3. CPU directs memory controller to write result to memory4. CPU retrieves memory at indicated location and feeds
contents + register4 and subtraction instruction to ALU5. ALU performs operation and stores to register3
1. CPU feeds registers and addition instruction to the ALU2. ALU performs operation and stores to temporary register3. Concurrently
a. CPU directs memory controller to write result to memory
b. CPU forwards temporary register, register4, and subtraction instruction to ALU
4. ALU performs operation and stores to register3 Michael AbrashZen of Code Optimization
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 3
From the CPU to the GPU
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 4
The Essence of Machine Learning
Traditional programming:
Programmer authored
User / data inputs
Desired outputs
Machine learned
Observed inputs
Observed outputs
Machine learning:
𝑓𝑓() 𝑥𝑥 𝑓𝑓(𝑥𝑥)
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 5
Challenges to Learning the Underlying Function
𝑓𝑓 𝑥𝑥 = 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0 𝑓𝑓 𝑥𝑥 = 𝑐𝑐2𝑥𝑥2 + 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0 𝑓𝑓 𝑥𝑥 = 𝑐𝑐4𝑥𝑥4 + 𝑐𝑐3𝑥𝑥3 + 𝑐𝑐2𝑥𝑥2 + 𝑐𝑐1𝑥𝑥 + 𝑐𝑐0
Even with simple one dimensional functions, you have to worry about things like overfitting
underfit generalized fit overfit
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 6
Challenges to Learning the Underlying Function
Real data is multi-dimensional and often entails crafting features
Title: The Triumph of the Nerds: The Rise of Accidental EmpiresRelease date: April 14, 1996Genre: DocumentarySynopsis: Three part documentary that shows the insight look at the history of computers, from its rise in the 1970s to the beginning of the Dot-com boom of the late 1990s.Writers: Robert X. Cringely (book), Robert X. Cringely (screenplay) Stars: Robert X. Cringely, Douglas Adams, Sam AlbertRunning time: 150 minutesReviewers: 1,102Rating: 8.5/10
What movies will someone enjoy watching
An engineered feature:Topic area(s) (As derived using natural language processing techniques on synopsis)
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 7
ANN image from Wikimedia Commons - Mcstrother
For a good overview of neural network types see:http://www.asimovinstitute.org/neural-network-zoo/http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/
Deep learning for large scale flexibility
How do we make this performant?
1) Make sure the data pipeline can keep the GPUs populated and processing
2) Optimize the efficiency of the neural network architecture by exploiting structural aspects of the data and problem
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 8
Reinforcement learning for control and optimization
Simulationor
Physical System
Actor Action
State
Reward
Environment Assessment
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 9
Estimating the value of an action when it is taken
Actor
Time Action Value of Action1 A1 ?2 A2 ?3 A3 ?4 A4 ?
Reward
Dynamic Programming
Use a model of transition probabilities and the reward function to calculate the value function (visiting all states)
Monte Carlo Methods
Use sampling to estimate the value function (need to see a whole episode)
Temporal Difference “TD” Methods
Look at the error in an estimated value function at every time step (don’t wait until the end of an episode before updating)
Policy Search
Optimize the policy directly
Must estimate the value of one of:
• Policy• Value of a state• Value of an action given a state
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 10
Simulationor
Physical System
Actor Action
State
Reward
Environment Assessment
Combining deep learning and reinforcement learningLeverage deep learning to approximate
• Value (state value) function• Q value (action value) function• Policy• Model (state transition + reward)
New hazards:
• Latency roundtripping to simulator / real world
• State transition from CPU to GPU memory
• Stale parameters during learning• …
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 11
• Replay buffers / memories
• State space sizing
• Parameter server vs. single model + interleaving
• Separation of training and experience collecting and stale data loss
• Actor/critic results within GPU vs. transition over system memory
Innovations to address the modern hazards of Deep RL
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 12
Replay buffers / memories
Prioritized Experience Replay – Schaul, et.al. – ICLR 2016 – https://arxiv.org/abs/1511.05952
• Record experience (state, action, reward, resulting state) to a buffer
• Sub-sample the buffer (mini-batches) for training
• Decorrelates the data and makes training with non-linear functions like neural networks stable
• Allows effective use of the GPU as it can process batches even with serial actions
• Optimization trick: prioritize• Favor important transitions for
inclusion in mini-batches
• Optimization trick: dedupe• Store state (often large) separately
and record references
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 13
State space sizing
https://xkcd.com/832/
• Assess state space for target domain to select efficient algorithmic approaches
• Optimization trick: pre-populate the replay buffer
• As you pre-populate the replaybuffer, you can do analysis on state recurrence to gauge sizing
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 14
Parameter server vs. single model + interleaving
Parameter Server
Actor &Environment
Actor &Environment
Actor &Environment
• Run many concurrent actors asynchronously from the learner
• Requires copying parameters to the actors, but eliminates the need for replay buffer
• Decorrelation is achieved by having multiple actors generating samples in parallel instead of mini-batch subsampling
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 15
Separation of training and experience collecting and stale data loss• “On-policy” vs. “Off-policy”
• On-policy methods can only learn from recorded experience that is consistent with the current policy
• Updating the policy this causesstale data loss – the replaybuffer must be made consistent
• Off-policy methods can learn from any experience, whether consistent with the current policy or not
Sutton, Richard S. and Barto, Andrew G.,Reinforcement Learning: An Introduction, MIT Press, 1998
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 16
Actor/critic results within GPU vs. transition over system memory• If separate models are maintained
for the actor and critic, then the results must be fed to each other via system memory
• Even with two separately trained components, performing the full training cycle within one composed model removes the need to leave the GPU
Environment
Agent
Critic
Actor
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 17
Parallelized Deep RL Implementations
• A3C – Asynchronous Advantage Actor-Critic
• GA3C – Hybrid CPU/GPU Asynchronous Advantage Actor-Critic
• ANAF – Asynchronous Normalized Advantage Functions
• TRPO – Trust Region Policy Optimization
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 18
A3C – Asynchronous Advantage Actor-Critic
Asynchronous Methods for Deep Reinforcement Learning – Mnih, et.al. – https://arxiv.org/abs/1602.01783
• Parameter Server Approach
• Does not require replay buffer
• Instead of scaling out across machines, makes use of a single multicore CPU
• Compounds updates from multiple actors over time before updating model
• Non-locking
• Vary exploration strategy per actor
• Works for on and off-policy methods
• No GPU – single CPU only (due to non-locking and global model accessibility)
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 19
GA3C – Hybrid CPU/GPU Asynchronous Advantage Actor-Critic
Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPUBabaeizadeh, et.al. – ICLR 2017 – https://arxiv.org/abs/1611.06256
• Addresses underloading of GPUs from A3C due to lack of replay buffer
• Additional constraint of off-policy method
• Does not use a parameter server architecture – uses a single copy of the model in an interleaving mode for prediction/training
• Requires tuning of the various queues (dynamic adjustment of the number of simulators, predictors, and trainers0
• Sees improved performance over CPU-only A3C
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 20
ANAF – Asynchronous Normalized Advantage Functions
Continuous Deep Q-Learning with Model-based Acceleration – Gu, et.al. – https://arxiv.org/abs/1603.00748Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates
Gu, et.al. – https://arxiv.org/abs/1610.00633
• Parallelizes NAF using a parameter server approach
• Off-policy
• Does not require the careful balancing of resources required by GA3C
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 21
TRPO – Trust Region Policy Optimization
Parallel Trust Region Policy Optimization with Multiple Actors – Frans, et.al. – http://kvfrans.com/static/trpo.pdf
• On-policy
• Unlike many other on-policy algorithms, more time is spent collecting samples than computing gradients to update the policy
• More reliant on CPU than GPU sincepolicy updates requiring the GPUare infrequent
• Monte Carlo estimates
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 22
What approach does Bonsai take?• Automatic algorithm selection based on state space sizing
• Keep a single copy of the model in GPU
• Interleaves prediction and learning requests
• Use off-policy methods
• Run multiple concurrent simulations
• Use a learner memory and take advantage of that for loading the GPU
• Use queues to desynchronize experience gathering from simulations, predictions, and training
• Queue predictions and training data to allow for the data transformation between CPU and GPU to be handled in an asynchronous way
• Make use of user defined abstract concepts and curricula
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 23
User defined abstract concepts and curricula
i oa
b
c
i oa
b
c
a1
a1
a1
i o
b
c
a1 a2
a3 a4
• Not explicit feature engineering, but guidance for navigating state space through decomposition
• Leverages subject matter expertise
• Can aid in parallelization and avoiding hazards:
• Implicit subconcepts can be created to auto-parallelize
• Guided lesson decomposition of a given concept
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 24
+-----------------------------------------------------------------------------+| NVIDIA-SMI 370.28 Driver Version: 370.28 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 TITAN X (Pascal) Off | 0000:01:00.0 On | N/A || 35% 63C P2 161W / 250W | 3491MiB / 12186MiB | 68% Default |+-------------------------------+----------------------+----------------------+
Breakout training with 10 concurrent simulations
Bringing it all together• Rapid iteration in this space
• We're at the "writing assembly code in multiple columns" equivalent state in the CPU era progression
• Companies like Bonsai are working to give you compilers that handle this for you
• Ultimately expect to see the same symbiosis with GPU hardware incorporating native capabilities for abstractions like replay buffers
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 25
Join us for the Journey
• Do you have a control or optimization problem you’re looking to tackle with Deep Reinforcement Learning?
Sign up for our recently announced early adopter programhttps://bons.ai/getting-started
• Are you interested in working in this area?
We’re hiring!https://bons.ai/careers
• Questions? Want to learn more?
Visit us at Booth #524 and enter to win an NVIDIA Titan X
Performant deep reinforcement learning: latency, hazards, and pipeline stalls in the GPU era… and how to avoid them 26
https://bons.ai