Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
spcl.inf.ethz.ch
@spcl_eth
TAL BEN-NUN
Reproducing Deep Learning
https://www.arxiv.org/abs/1802.09941
https://arxiv.org/abs/1901.10183
spcl.inf.ethz.ch
@spcl_eth
What is Deep Learning good for?
2012 20171989
Digit Recognition Image CaptioningObject Classification
Segmentation
2013 2014 2016
Gameplay AITranslation
Neural Computers
number of papers per year
…
A very promising area of research!
23 papers per day!
spcl.inf.ethz.ch
@spcl_eth
How does Deep Learning work?Canziani et al. 2017
Number of users 0.8 bn
0.54
0.28
0.02
0.07
0.33
0.04
0.02
Cat
Dog
Airplane
Truck
Horse
Bicycle
f(x)
layer-wise weight update
ImageNet (1k): 180 GB
ImageNet (22k): A few TB
Industry: Much larger
100-200 layers deep
~100M-2B parameters
0.1-8 GiB parameter storage
10-22k labels
growing (e.g., face recognition)
weeks to train
1.00
0.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Bicycle
Deep Learning is Supercomputing!
spcl.inf.ethz.ch
@spcl_eth
6
A brief theory of supervised deep learning
1.00
0.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Bicycle
labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟
𝑓 𝑥 : 𝑋 → 𝑌
label domain 𝑌
network structure(fixed)
weights 𝑤(learned)
𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥
true label 𝑙(𝑥)
ℓ0−1 𝑤, 𝑥 = ቊ0 𝑓 𝑥 = 𝑙(𝑥)
1 𝑓 𝑥 ≠ 𝑙(𝑥)
0.54
0.28
0.02
0.07
0.33
0.04
0.02
Cat
Dog
Airplane
Truck
Horse
Bicycle
𝑓(𝑥)
layer-wise weight update
𝑓 𝑥 = 𝑓𝑛 𝑓𝑛−1 𝑓𝑛−2 …𝑓1 𝑥 …
con
volu
tion
1
con
volu
tion
2
con
volu
tion
3
po
olin
g
fully co
nn
ected
𝑓1 𝑥 𝑓2 𝑓1 𝑥 𝑓(𝑥)…
ℓ𝑐𝑒 𝑤, 𝑥 = −
𝑖
𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖
σ𝑘 𝑒𝑓 𝑥 𝑘
ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2
spcl.inf.ethz.ch
@spcl_eth
7
Stochastic Gradient Descent
convolution 1
convolution 2
𝑓1(𝑥)
𝑓2 𝑓1 𝑥
Layer storage = 𝑤𝑙 + 𝑓𝑙 𝑜𝑙−1 + 𝛻𝑤𝑙 + 𝛻𝑜𝑙
𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥
convolution 3
pooling
fully connected𝑓(𝑥)
…
T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018
spcl.inf.ethz.ch
@spcl_eth
10
Minibatch Stochastic Gradient Descent (SGD)
0.54
0.28
0.02
0.07
0.03
0.04
0.02
Cat
Dog
Airplane
Truck
Horse
Bicycle
1.00
0.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Bicycle
T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018
spcl.inf.ethz.ch
@spcl_eth
“I have a new optimization algorithm, how do I test it on actual data/models?”
“We can fully utilize a supercomputer, what are the commonly used DL workloads?”
“I designed a new ASIC, how do I convince ML researchers to use it?”
“Mathematical theory states that algorithm X is optimal for max-pooling, how fast is it compared to the others? How accurate is it?”
“If we use compression method Y, how drastic is the communication reduction?”
12
Common Problems in ML Research
spcl.inf.ethz.ch
@spcl_eth
Individual operators
Network evaluation
Optimization algorithm
Distributed training
13
Deep Learning Breakdown
Operators Training
Agent
AgentAgent
Agent
Param. Server
Networks
spcl.inf.ethz.ch
@spcl_eth
Indirect
14
Computing convolutional layers
Direct
𝑤 ℱ
ℱ
ℱ−1
=
×ෝ𝑤
FFT4 1 9 8
5 9 9 8
0 7 3 4
2 6 3 1
1 -1 0
0.1 -2 0
3 4 1.1*
21.9 59.3 53.9 43.9
-6.3 16.8 12.3 12
9.6 15.3 25.8 14
0.4 7.1 52.1 53.1
=Winograd
X. Liu et al.: Efficient Sparse-WinogradConvolutional Neural Networks, ICLR’17 Workshop
S. Chetlur et al.: cuDNN: Efficient Primitives for Deep Learning, arXiv 2014
Direct
im2col
K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Recognition 2016M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16
spcl.inf.ethz.ch
@spcl_eth
15
Data parallelism
Simple and efficient solution, easy to implement
Duplicate parameters at all processors
Generalization is affected by minibatch size
……
…
X. Zhang et al.: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS’89
spcl.inf.ethz.ch
@spcl_eth
Parameters can be distributed across processors
Mini-batch has to be copied to all processors
Backpropagation requires all-to-all communication every layer
16
Model parallelism
… 1
3
U.A. Muller and A. Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994
spcl.inf.ethz.ch
@spcl_eth
17
Pipeline parallelism
Layers/parameters can be distributed across processors
Sparse communication pattern (only pipeline stages)
Mini-batch has to be copied through all processors
Complicated management in codeG. Blelloch and C.R. Rosenberg: Network Learning on the Connection Machine, IJCAI’87
…
spcl.inf.ethz.ch
@spcl_eth
18
Hybrid parallelism
A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014J. Dean et al.: Large scale distributed deep networks, NIPS’12.T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018
Layers/parameters can be distributed across processors
Can distribute minibatch
Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)
Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful!
ModelParallelism
DataParallelism
Layer (pipeline) Parallelism
……
…
spcl.inf.ethz.ch
@spcl_eth
19
Hyperparameter and Architecture search
Reinforcement Learning [1] Evolutionary Algorithms [4]
Meta-optimization of hyper-parameters (momentum) and DNN architecture
Using Reinforcement Learning [1] (explore/exploit different configurations)
Genetic Algorithms with modified (specialized) mutations [2]
Particle Swarm Optimization [3] and other meta-heuristics
[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18
Agent
spcl.inf.ethz.ch
@spcl_eth
Different options how to optimize updates
Send 𝛻𝑤, receive 𝑤
Send FC factors (𝑜𝑙−1, 𝑜𝑙), compute 𝛻𝑤 on parameter server
Broadcast factors to not receive full w
Use lossy compression when sending, accumulate error locally!
Quantization
Quantize weight updates and potentially weights
Main trick is stochastic rounding [1] – expectation is more accurate
Enables low precision (half, quarter) to become standard
TernGrad - ternary weights [2], 1-bit SGD [3], …
Sparsification
Do not send small weight updates or only send top-k [4]
Accumulate them locally
20
Communication optimizations
[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICML’15[2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016[3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014[4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018
source: ai.intel.com
spcl.inf.ethz.ch
@spcl_eth
21
Existing Systems
spcl.inf.ethz.ch
@spcl_eth
DeepBench Uses commonly used operator dimensions
× Cannot test in the context of an actual framework (reduced overall accuracy)
× Does not encapsulate inter-operator optimizations (now focus of research)
DAWNBench Good overall metric (time-to-convergence, time-to-inference)
Free-for-all: anyone can use their own training algorithms, frameworks
× Competitors can optimize for the metric (i.e., methods may not generalize)
× Result is only a JSON file reporting time, network is not necessarily shared
MLperf: Very structured (e.g., open vs. closed competitions)
Includes both convergence and performance tests
Seems to be heavily backed by industry
Unclear: requirements from a potential competitor, how results are presented, reproducibility guarantees, hardware differences, HPC/distributed aspects
22
Existing Benchmarks
spcl.inf.ethz.ch
@spcl_eth
Deep learning meta-framework: a framework for frameworks to reside in
23
Deep500
CustomOpHDD
Dataset
Sampler
OptimizerDist. Optimizer
Executor
Operators
Network
Training Runner (Trainer)
Dist. Sampler
PFSPFS
PFSPFS
Metrics
Level 0
Level 1
Level 2
Level 3
spcl.inf.ethz.ch
@spcl_eth
25
Deep500 Component Overview
Operator Validation
Gradient Checker
Inference Validation
Performance, Energy Metrics
Distributed Training Accuracy
Dataset
Python Operators
Custom Operators
CMakeInterface
Python, C++ References for LeNet, AlexNet, ResNet
TensorFlow, Pytorch, Caffe2
Reference Graph ExecutorNetwork
ONNX Models (LeNet, ResNet)
Reference Optimizers (SGD, Momentum, Adagrad, Adam)
Tickers
Metrics
Optimizer
Distributed Optimizer
Model Consistency
Centralization
MPI Reference Optimizer (Decentralized, Param. Server)
Horovod, Distributed Caffe2
Level 2: Training
Level 0: Operators
Level 1: Network Processing
Level 3: Distributed Training
Training Accuracy
Communication Volume
Overhead Analysis
WUR, 3-Step Opt.
ONNX Parser + Visitor
ONNX Test Parser
Graph Executor
L1 Framework Integration
Distributed Dataset
spcl.inf.ethz.ch
@spcl_eth
26
Existing Systems
spcl.inf.ethz.ch
@spcl_eth
class IPowOp(CustomPythonOp):def __init__(self, power):
super(IPowOp, self).__init__()self.power = powerassert int(power) == power # integral
def forward(self, inputs):return inputs[0] ** self.power
def backward(self, grads, fwd_inputs, fwd_outputs):return (grads[0] * self.power *
(fwd_inputs[0] ** (self.power - 1)))
28
Use-Case 1: New Operator
Python
template<typename T>class ipowop : public deep500::CustomOperator {protected:
int m_len;public:
ipowop(int len) : m_len(len) {}virtual ~ipowop() {}
void forward(const T *input, T *output) {#pragma omp parallel forfor (int i = 0; i < m_len; ++i)
output[i] = std::pow(input[i], DPOWER);}
void backward(const T *nextop_grad,const T *fwd_input_tensor,const T *fwd_output_tensor,T *input_tensor_grad) {
#pragma omp parallel forfor (int i = 0; i < m_len; ++i) {
input_tensor_grad[i] = nextop_grad[i] * DPOWER * std::pow(fwd_input_tensor[i], DPOWER - 1);
}}
};
C++
spcl.inf.ethz.ch
@spcl_eth
Use ONNX visitor to construct OpenCL kernels per operator and compile
Network executor invokes FPGA through .so file
29
Use-Case 2: Intel FPGA DNN Executor
spcl.inf.ethz.ch
@spcl_eth
30
Use-Case 3: New Optimization Algorithm
?