General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... ·...

Preview:

Citation preview

Electrical Engineering – Electronic Systems group

Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen

General Architectures for DNN

Recap• Inference and Learning Principles• Improving Network Efficiency – Focuses on reducing number of MACs and weights

• Loop Transformations – Software tricks to effectively use the Memory hierarchy

2

S E (8) M (23)fp32

S E(5) M(10)fp16

Quantization

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

3

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

4

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

5

Today

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

6

Today

Introduction• Machine learning plays a major role in today’s world

7

Introduction• Machine learning plays a major role in today’s world

8

Introduction• Machine learning plays a major role in today’s world

9

Made for “AI”

Compute Intensity of DNN• Compute intensity is roughly proportional to accuracy of the DNN

10Source: Scaling for edge inference, Nature Electronics, 2018.

Energy Efficiency Requirements • Ranges from Cloud to Edge device (low power embedded applications)

• Different energy budget and compute capabilities

11

Edge Node Embedded Device Cloud Server HPC Cloud

mW W kW MW

Compute capability

Hardware Platform for DNN

12

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

*Markovic, EE292 Class, Stanford, 2013

Hardware Platform for DNN

13

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

*Markovic, EE292 Class, Stanford, 2013

Hardware Platform for DNN

14

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

+?

Flexibility

Ene

rgy

Effi

cien

cy ASICPerformance/AreaDSP

CPU

FPGAGPU

ASIP

~1000x*

*Markovic, EE292 Class, Stanford, 2013

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

15

Today

Deep Convolutional Neural Networks

16Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Convolution Layer

17

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

=

Convolution Layer

18

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

=

Convolution Layer

19

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

=

Convolution Layer

20

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

=

Convolution Layer

21

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

2

=

Convolution Layer

22

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

=

x (H – R + 1) x (W - S + 1)

Convolution Layer

23

Input fmap

H

W

C

2

weights

R

S

C

X

Output fmap

F

E

1

=

Many input fmap

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C

Convolution Layer

24

Input fmap

H

W

C

X

Output fmap

F

E

M

=

Many input & output fmap

M

weights

R

S

C

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M

Convolution Layer – For N batch

25

X

Output fmap

F

E

M

=

Many input & output fmap

M

weights

R

S

C

Input fmaps

RS

CN N

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M x N

Fully Connected Layer

26Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Fully Connected Layer

27

X

Output fmap

F =1

E =

1

M

=

1. Height and width of output fmap are 1 (E = F = 1)2. Filters as large as input fmaps (R= H, S=W)

M

weights

R

S

C

Input fmaps

R =

H

S = W

C

N N

Fully Connected Layer

28

X

Output fmap

M

=

Matrix-Multiplication

weights

M

CRS = CHW

Input fmaps

CH

W

N

N

Compute Intensity of Popular CNNs

29

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

30

Today

Overview of Microprocessor Designs

31Source: Time Moore, Liming Xiu 2019.

Intrinsic Compute Capability

32Source: XETAL-II, 2010

Intrinsic Compute Capability

33Source: XETAL-II, 2010

Intrinsic Compute Capability

34Source: XETAL-II, 2010

TPU

CPU

~200x for DL

Difference in Performance/Watt

Source of Inefficiency

35

How to improve?• Results does not include DRAM

power• More than 50% of energy is spent

on Cache and Control logic

• Reduce control overhead

• Improve Cache hierarchy• Multi-Core/Cluster concepts

provides an additional performance gain

Source: Computing’s Energy Problem, ISSCC 2014

Compute

Cac

he &

Con

trol

Reduce Control Overhead: SIMD Extensions

36

Reduce Control Overhead: SIMD Extensions

37

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

38

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

39DNN specific extensions: Reduced Precision and Instruction-set extension

Example: Intel – Cascade Lake 2019(VNNI – Vector Neural Net Instructions)

40

x28-cores

SIMD Extensions for DNNs

41

SIMD Extensions for DNNs

42

(VNNI)

• Mixed-precision mode – INT8 x INT8 + INT32• VNNI – FMA on single cycle compared to 3-cycles with normal SIMD

instructions• Some architectures support “2x2 dot-product” as well

Brain Floating Point

43

• bfloat16: Same dynamic range as IEEE-FP32, but with less accuracy• Example use-case: Google TPU, Cooper Lake Xeon processors• Another option - “posit” floating-point (adaptable fp format)

How about BNN?

44

y = popcount (W XNOR X)

Software Stack for DNN

45

?

Software Stack for DNN

46

?

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics

47

?

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics• Optimized libraries

MKL-DNN, clDNN, BLAS, Arm NN, Arm CIMSIS-NN, and many more..

48

?

Example: Arm-NN Library

49

TensorFlow

armNN

Comparison among CNN Libraries on CPU

50Source: Evaluating the energy efficiency of Deep CNN, Da Li 2016. *Convnet on Xenon E5 (16-core)

• Caffe backends – Atlas, OpenBLAS, MKL, openMP, and CaffeConTroll

• Performance depends on the quality of the library optimizations for the target

Distributed Learning and Inference

51Source: Large Scale Distributed Deep Networks, Jeffrey Dean, Google 2019

Distributed DL - Approach

52

Distributed DL – Performance

53

• Models with more parameters benefit more from the use of additional machines

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

54

Today

Domain Specific Processors (VLIW-SIMD)• Processors optimized for specific

application domain (e.g: Vision, signal processing, etc)

• Example: Qualcomm Hexagon, Movidious (Intel), Ceva, and many more..

• Support for DNN

• Instruction-set extensions

• DNN Accelerator in the execution pipeline

55

Programming Model

56

Hexagon DSP (Qualcomm)

57

Hexagon over Quad CPU

58Source: Hotchips

Hexagon – Power breakdown

59

• Less overhead on control logic and memory compared to CPUs

Another Example: Ceva DSP

60Source: Anandtech

Final Example: Movidious v2

61

Source: Hotchips’14

Final Example: Movidious v2

62

Source: Hotchips’14

Final Example: Intel Neural Compute Stick

63

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

64

Today

Graphics Processing Units (GPUs)• SIMD vs GPU

• GPUs uses threads instead of vectors

• GPUs have the “shared memory” spaces

65

SIMD GPU

How Threads are Scheduled?

6666

Example: NVIDIA Fermi - 2009

67

Example: NVIDIA Fermi - 2009

68

Example: NVIDIA Fermi - 2009

69

• Streaming Multiprocessors (SM)

• 32 – cuda cores/SM

• ALU – 32/64-bit

• FP – SP/DP (with FMA)

• SFU – Sin, cosine, sqrt, etc• Clock – 1.5 GHz (estimated)• Peak Performance – 1.5 TFLOPS

Example: NVIDIA Volta - 2017

70

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Example: NVIDIA Volta - 2017

71

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Example: NVIDIA Volta - 2017

72

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Internals of Tensor Core• Modes of operation – Volta

• FP16 – A, B, C are FP16

• Mixed-precision – A and B are FP16, C is FP32• Turing GPUs – Supports 1, 2, 4, and 8 bit data-types (int4, int8 on Tensor

cores)

73

Scheduling Example for 16x16x16 Gemm

74Source: Anandtech

Scheduling Example for 16x16x16 Gemm

75Source: Anandtech

Scheduling Example for 16x16x16 Gemm

76Source: Anandtech

Scheduling Example for 16x16x16 Gemm

77Source: Anandtech

Scheduling Example for 16x16x16 Gemm

78Source: Anandtech

How to use Tensor cores• CuBLAS, CuDNN, etc• Library takes care of

Tiling and storage hierarchy

• Opcode: HMMA (Matrix Multiply Accumulate)

79

GPU Performance

80

• We still need CPU for some extent

GPU vs CPU Performance

• CPU - 16-core Intel Xeon E5 -2650 v2 @ 2.6GHz

• Benchmark: AlexNet• Lower batch size leads

to under utilization on all devices

• K20 has less memory than Titan X

81

Concluding Remarks• Compute and data requirement of DNN is quite high and a major part of

the computation is from Matrix Multiplications (i.e. MAC ops)• Common DNN specific extensions in Generic architecture is,

1. Instruction-set extensions – Generally SIMD support at reduced precision

2. DNN accelerator on datapath (Co-processor, Tensor core, etc)• The effective performance of the platform depends on the hardware

capability and software support (programming model and library used to realize the network)

• The energy efficiency is still a limitation of generic platforms for DNN

82

Reference• Evaluating the Energy Efficiency of Deep Convolutional Neural Network on

CPUs and GPUs, Da Li and Xinbo Chen, 2016

83

Backup

84

Mutithreading Categories

85

Tim

e (P

roce

ssor

Cyc

le)

super-scalar Fine-Grained Coarse-Grained MultiprocessingSimultaneous Mutithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Example: IBM Power4 (Superscalar)

86

Example: IBM Power5• Supports 2 threads

87

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Power5 Thread Performance• Relative priority of each thread is

controllable in hardware• For balanced operation, both threads

run slower than if they “owned” the machine

88

Any guess on largest chip so far?

89

Source : Cerebras, Hotchip 2019