AL EN UN Reproducing Deep Learning - ETH Distributed Caffe2 Level 2: Training Level 0: Operators Level 1: Network Processing Level 3: Distributed Training Training Accuracy Communication

spcl.inf.ethz.ch

@spcl_eth

TAL BEN-NUN

Reproducing Deep Learning

https://www.arxiv.org/abs/1802.09941

https://arxiv.org/abs/1901.10183

https://www.arxiv.org/abs/1802.09941

https://arxiv.org/abs/1901.10183

spcl.inf.ethz.ch

@spcl_eth

What is Deep Learning good for?

2012 20171989

Digit Recognition Image CaptioningObject Classification

Segmentation

2013 2014 2016

Gameplay AITranslation

Neural Computers

number of papers per year

…

A very promising area of research!

23 papers per day!

spcl.inf.ethz.ch

@spcl_eth

How does Deep Learning work?Canziani et al. 2017

Number of users 0.8 bn

0.54

0.28

0.02

0.07

0.33

0.04

0.02

Cat

Dog

Airplane

Truck

Horse

Bicycle

f(x)

layer-wise weight update

ImageNet (1k): 180 GB

ImageNet (22k): A few TB

Industry: Much larger

100-200 layers deep

~100M-2B parameters

0.1-8 GiB parameter storage

10-22k labels

growing (e.g., face recognition)

weeks to train

1.00

0.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Bicycle

Deep Learning is Supercomputing!

spcl.inf.ethz.ch

@spcl_eth

6

A brief theory of supervised deep learning

1.00

0.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Bicycle

labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟

𝑓 𝑥 : 𝑋 → 𝑌

label domain 𝑌

network structure(fixed)

weights 𝑤(learned)

𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥

true label 𝑙(𝑥)

ℓ0−1 𝑤, 𝑥 = ቊ0 𝑓 𝑥 = 𝑙(𝑥)

1 𝑓 𝑥 ≠ 𝑙(𝑥)

0.54

0.28

0.02

0.07

0.33

0.04

0.02

Cat

Dog

Airplane

Truck

Horse

Bicycle

𝑓(𝑥)

layer-wise weight update

𝑓 𝑥 = 𝑓𝑛 𝑓𝑛−1 𝑓𝑛−2 …𝑓1 𝑥 …

con

volu

tion

1

con

volu

tion

2

con

volu

tion

3

po

olin

g

fully co

nn

ected

𝑓1 𝑥 𝑓2 𝑓1 𝑥 𝑓(𝑥)…

ℓ𝑐𝑒 𝑤, 𝑥 = −

𝑖

𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖

σ𝑘 𝑒𝑓 𝑥 𝑘

ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2

spcl.inf.ethz.ch

@spcl_eth

7

Stochastic Gradient Descent

convolution 1

convolution 2

𝑓1(𝑥)

𝑓2 𝑓1 𝑥

Layer storage = 𝑤𝑙 + 𝑓𝑙 𝑜𝑙−1 + 𝛻𝑤𝑙 + 𝛻𝑜𝑙

𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥

convolution 3

pooling

fully connected𝑓(𝑥)

…

T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch

@spcl_eth

10

Minibatch Stochastic Gradient Descent (SGD)

0.54

0.28

0.02

0.07

0.03

0.04

0.02

Cat

Dog

Airplane

Truck

Horse

Bicycle

1.00

0.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Bicycle

T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch

@spcl_eth

“I have a new optimization algorithm, how do I test it on actual data/models?”

“We can fully utilize a supercomputer, what are the commonly used DL workloads?”

“I designed a new ASIC, how do I convince ML researchers to use it?”

“Mathematical theory states that algorithm X is optimal for max-pooling, how fast is it compared to the others? How accurate is it?”

“If we use compression method Y, how drastic is the communication reduction?”

12

Common Problems in ML Research

spcl.inf.ethz.ch

@spcl_eth

Individual operators

Network evaluation

Optimization algorithm

Distributed training

13

Deep Learning Breakdown

Operators Training

Agent

AgentAgent

Agent

Param. Server

Networks

spcl.inf.ethz.ch

@spcl_eth

Indirect

14

Computing convolutional layers

Direct

𝑤 ℱ

ℱ

ℱ−1

=

×ෝ𝑤

FFT4 1 9 8

5 9 9 8

0 7 3 4

2 6 3 1

1 -1 0

0.1 -2 0

3 4 1.1*

21.9 59.3 53.9 43.9

-6.3 16.8 12.3 12

9.6 15.3 25.8 14

0.4 7.1 52.1 53.1

=Winograd

X. Liu et al.: Efficient Sparse-WinogradConvolutional Neural Networks, ICLR’17 Workshop

S. Chetlur et al.: cuDNN: Efficient Primitives for Deep Learning, arXiv 2014

Direct

im2col

K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Recognition 2016M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16

spcl.inf.ethz.ch

@spcl_eth

15

Data parallelism

Simple and efficient solution, easy to implement

Duplicate parameters at all processors

Generalization is affected by minibatch size

……

…

X. Zhang et al.: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS’89

spcl.inf.ethz.ch

@spcl_eth

Parameters can be distributed across processors

Mini-batch has to be copied to all processors

Backpropagation requires all-to-all communication every layer

16

Model parallelism

… 1

3

U.A. Muller and A. Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994

spcl.inf.ethz.ch

@spcl_eth

17

Pipeline parallelism

Layers/parameters can be distributed across processors

Sparse communication pattern (only pipeline stages)

Mini-batch has to be copied through all processors

Complicated management in codeG. Blelloch and C.R. Rosenberg: Network Learning on the Connection Machine, IJCAI’87

…

spcl.inf.ethz.ch

@spcl_eth

18

Hybrid parallelism

A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014J. Dean et al.: Large scale distributed deep networks, NIPS’12.T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

Layers/parameters can be distributed across processors

Can distribute minibatch

Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)

Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful!

ModelParallelism

DataParallelism

Layer (pipeline) Parallelism

……

…

spcl.inf.ethz.ch

@spcl_eth

19

Hyperparameter and Architecture search

Reinforcement Learning [1] Evolutionary Algorithms [4]

Meta-optimization of hyper-parameters (momentum) and DNN architecture

Using Reinforcement Learning [1] (explore/exploit different configurations)

Genetic Algorithms with modified (specialized) mutations [2]

Particle Swarm Optimization [3] and other meta-heuristics

[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18

Agent

spcl.inf.ethz.ch

@spcl_eth

Different options how to optimize updates

Send 𝛻𝑤, receive 𝑤

Send FC factors (𝑜𝑙−1, 𝑜𝑙), compute 𝛻𝑤 on parameter server

Broadcast factors to not receive full w

Use lossy compression when sending, accumulate error locally!

Quantization

Quantize weight updates and potentially weights

Main trick is stochastic rounding [1] – expectation is more accurate

Enables low precision (half, quarter) to become standard

TernGrad - ternary weights [2], 1-bit SGD [3], …

Sparsification

Do not send small weight updates or only send top-k [4]

Accumulate them locally

20

Communication optimizations

[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICML’15[2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016[3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014[4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018

source: ai.intel.com

spcl.inf.ethz.ch

@spcl_eth

21

Existing Systems

spcl.inf.ethz.ch

@spcl_eth

DeepBench Uses commonly used operator dimensions

× Cannot test in the context of an actual framework (reduced overall accuracy)

× Does not encapsulate inter-operator optimizations (now focus of research)

DAWNBench Good overall metric (time-to-convergence, time-to-inference)

Free-for-all: anyone can use their own training algorithms, frameworks

× Competitors can optimize for the metric (i.e., methods may not generalize)

× Result is only a JSON file reporting time, network is not necessarily shared

MLperf: Very structured (e.g., open vs. closed competitions)

Includes both convergence and performance tests

Seems to be heavily backed by industry

Unclear: requirements from a potential competitor, how results are presented, reproducibility guarantees, hardware differences, HPC/distributed aspects

22

Existing Benchmarks

spcl.inf.ethz.ch

@spcl_eth

Deep learning meta-framework: a framework for frameworks to reside in

23

Deep500

CustomOpHDD

Dataset

Sampler

OptimizerDist. Optimizer

Executor

Operators

Network

Training Runner (Trainer)

Dist. Sampler

PFSPFS

PFSPFS

Metrics

Level 0

Level 1

Level 2

Level 3

spcl.inf.ethz.ch

@spcl_eth

25

Deep500 Component Overview

Operator Validation

Gradient Checker

Inference Validation

Performance, Energy Metrics

Distributed Training Accuracy

Dataset

Python Operators

Custom Operators

CMakeInterface

Python, C++ References for LeNet, AlexNet, ResNet

TensorFlow, Pytorch, Caffe2

Reference Graph ExecutorNetwork

ONNX Models (LeNet, ResNet)

Reference Optimizers (SGD, Momentum, Adagrad, Adam)

Tickers

Metrics

Optimizer

Distributed Optimizer

Model Consistency

Centralization

MPI Reference Optimizer (Decentralized, Param. Server)

Horovod, Distributed Caffe2

Level 2: Training

Level 0: Operators

Level 1: Network Processing

Level 3: Distributed Training

Training Accuracy

Communication Volume

Overhead Analysis

WUR, 3-Step Opt.

ONNX Parser + Visitor

ONNX Test Parser

Graph Executor

L1 Framework Integration

Distributed Dataset

spcl.inf.ethz.ch

@spcl_eth

26

Existing Systems

spcl.inf.ethz.ch

@spcl_eth

class IPowOp(CustomPythonOp):def __init__(self, power):

super(IPowOp, self).__init__()self.power = powerassert int(power) == power # integral

def forward(self, inputs):return inputs[0] ** self.power

def backward(self, grads, fwd_inputs, fwd_outputs):return (grads[0] * self.power *

(fwd_inputs[0] ** (self.power - 1)))

28

Use-Case 1: New Operator

Python

template<typename T>class ipowop : public deep500::CustomOperator {protected:

int m_len;public:

ipowop(int len) : m_len(len) {}virtual ~ipowop() {}

void forward(const T *input, T *output) {#pragma omp parallel forfor (int i = 0; i < m_len; ++i)

output[i] = std::pow(input[i], DPOWER);}

void backward(const T *nextop_grad,const T *fwd_input_tensor,const T *fwd_output_tensor,T *input_tensor_grad) {

#pragma omp parallel forfor (int i = 0; i < m_len; ++i) {

input_tensor_grad[i] = nextop_grad[i] * DPOWER * std::pow(fwd_input_tensor[i], DPOWER - 1);

}}

};

C++

spcl.inf.ethz.ch

@spcl_eth

Use ONNX visitor to construct OpenCL kernels per operator and compile

Network executor invokes FPGA through .so file

29

Use-Case 2: Intel FPGA DNN Executor

spcl.inf.ethz.ch

@spcl_eth

30

Use-Case 3: New Optimization Algorithm

?

Documents

AL EN UN Reproducing Deep Learning - ETH Distributed Caffe2 Level 2: Training Level 0: Operators Level 1: Network Processing Level 3: Distributed Training Training Accuracy Communication