89
Electrical Engineering – Electronic Systems group Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen General Architectures for DNN

General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Electrical Engineering – Electronic Systems group

Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen

General Architectures for DNN

Page 2: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Recap• Inference and Learning Principles• Improving Network Efficiency – Focuses on reducing number of MACs and weights

• Loop Transformations – Software tricks to effectively use the Memory hierarchy

2

S E (8) M (23)fp32

S E(5) M(10)fp16

Quantization

Page 3: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

3

Page 4: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

4

Page 5: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

5

Today

Page 6: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

6

Today

Page 7: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Introduction• Machine learning plays a major role in today’s world

7

Page 8: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Introduction• Machine learning plays a major role in today’s world

8

Page 9: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Introduction• Machine learning plays a major role in today’s world

9

Made for “AI”

Page 10: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Compute Intensity of DNN• Compute intensity is roughly proportional to accuracy of the DNN

10Source: Scaling for edge inference, Nature Electronics, 2018.

Page 11: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Energy Efficiency Requirements • Ranges from Cloud to Edge device (low power embedded applications)

• Different energy budget and compute capabilities

11

Edge Node Embedded Device Cloud Server HPC Cloud

mW W kW MW

Compute capability

Page 12: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hardware Platform for DNN

12

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

*Markovic, EE292 Class, Stanford, 2013

Page 13: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hardware Platform for DNN

13

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

*Markovic, EE292 Class, Stanford, 2013

Page 14: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hardware Platform for DNN

14

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

+?

Flexibility

Ene

rgy

Effi

cien

cy ASICPerformance/AreaDSP

CPU

FPGAGPU

ASIP

~1000x*

*Markovic, EE292 Class, Stanford, 2013

Page 15: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

15

Today

Page 16: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Deep Convolutional Neural Networks

16Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Page 17: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

17

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

=

Page 18: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

18

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

=

Page 19: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

19

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

=

Page 20: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

20

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

=

Page 21: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

21

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

2

=

Page 22: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

22

Input fmap

H

W

weights

R

S

X

Output fmap

F

E

1

Number of MACs = R x S

=

x (H – R + 1) x (W - S + 1)

Page 23: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

23

Input fmap

H

W

C

2

weights

R

S

C

X

Output fmap

F

E

1

=

Many input fmap

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C

Page 24: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer

24

Input fmap

H

W

C

X

Output fmap

F

E

M

=

Many input & output fmap

M

weights

R

S

C

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M

Page 25: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Convolution Layer – For N batch

25

X

Output fmap

F

E

M

=

Many input & output fmap

M

weights

R

S

C

Input fmaps

RS

CN N

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M x N

Page 26: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Fully Connected Layer

26Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Page 27: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Fully Connected Layer

27

X

Output fmap

F =1

E =

1

M

=

1. Height and width of output fmap are 1 (E = F = 1)2. Filters as large as input fmaps (R= H, S=W)

M

weights

R

S

C

Input fmaps

R =

H

S = W

C

N N

Page 28: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Fully Connected Layer

28

X

Output fmap

M

=

Matrix-Multiplication

weights

M

CRS = CHW

Input fmaps

CH

W

N

N

Page 29: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Compute Intensity of Popular CNNs

29

Page 30: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

30

Today

Page 31: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Overview of Microprocessor Designs

31Source: Time Moore, Liming Xiu 2019.

Page 32: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Intrinsic Compute Capability

32Source: XETAL-II, 2010

Page 33: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Intrinsic Compute Capability

33Source: XETAL-II, 2010

Page 34: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Intrinsic Compute Capability

34Source: XETAL-II, 2010

TPU

CPU

~200x for DL

Difference in Performance/Watt

Page 35: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Source of Inefficiency

35

How to improve?• Results does not include DRAM

power• More than 50% of energy is spent

on Cache and Control logic

• Reduce control overhead

• Improve Cache hierarchy• Multi-Core/Cluster concepts

provides an additional performance gain

Source: Computing’s Energy Problem, ISSCC 2014

Compute

Cac

he &

Con

trol

Page 36: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Reduce Control Overhead: SIMD Extensions

36

Page 37: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Reduce Control Overhead: SIMD Extensions

37

Page 38: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

38

Page 39: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

39DNN specific extensions: Reduced Precision and Instruction-set extension

Page 40: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: Intel – Cascade Lake 2019(VNNI – Vector Neural Net Instructions)

40

x28-cores

Page 41: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

SIMD Extensions for DNNs

41

Page 42: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

SIMD Extensions for DNNs

42

(VNNI)

• Mixed-precision mode – INT8 x INT8 + INT32• VNNI – FMA on single cycle compared to 3-cycles with normal SIMD

instructions• Some architectures support “2x2 dot-product” as well

Page 43: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Brain Floating Point

43

• bfloat16: Same dynamic range as IEEE-FP32, but with less accuracy• Example use-case: Google TPU, Cooper Lake Xeon processors• Another option - “posit” floating-point (adaptable fp format)

Page 44: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

How about BNN?

44

y = popcount (W XNOR X)

Page 45: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Software Stack for DNN

45

?

Page 46: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Software Stack for DNN

46

?

Page 47: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics

47

?

Page 48: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics• Optimized libraries

MKL-DNN, clDNN, BLAS, Arm NN, Arm CIMSIS-NN, and many more..

48

?

Page 49: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: Arm-NN Library

49

TensorFlow

armNN

Page 50: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Comparison among CNN Libraries on CPU

50Source: Evaluating the energy efficiency of Deep CNN, Da Li 2016. *Convnet on Xenon E5 (16-core)

• Caffe backends – Atlas, OpenBLAS, MKL, openMP, and CaffeConTroll

• Performance depends on the quality of the library optimizations for the target

Page 51: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Distributed Learning and Inference

51Source: Large Scale Distributed Deep Networks, Jeffrey Dean, Google 2019

Page 52: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Distributed DL - Approach

52

Page 53: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Distributed DL – Performance

53

• Models with more parameters benefit more from the use of additional machines

Page 54: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

54

Today

Page 55: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Domain Specific Processors (VLIW-SIMD)• Processors optimized for specific

application domain (e.g: Vision, signal processing, etc)

• Example: Qualcomm Hexagon, Movidious (Intel), Ceva, and many more..

• Support for DNN

• Instruction-set extensions

• DNN Accelerator in the execution pipeline

55

Page 56: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Programming Model

56

Page 57: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hexagon DSP (Qualcomm)

57

Page 58: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hexagon over Quad CPU

58Source: Hotchips

Page 59: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Hexagon – Power breakdown

59

• Less overhead on control logic and memory compared to CPUs

Page 60: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Another Example: Ceva DSP

60Source: Anandtech

Page 61: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Final Example: Movidious v2

61

Source: Hotchips’14

Page 62: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Final Example: Movidious v2

62

Source: Hotchips’14

Page 63: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Final Example: Intel Neural Compute Stick

63

Page 64: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

64

Today

Page 65: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Graphics Processing Units (GPUs)• SIMD vs GPU

• GPUs uses threads instead of vectors

• GPUs have the “shared memory” spaces

65

SIMD GPU

Page 66: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

How Threads are Scheduled?

6666

Page 67: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Fermi - 2009

67

Page 68: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Fermi - 2009

68

Page 69: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Fermi - 2009

69

• Streaming Multiprocessors (SM)

• 32 – cuda cores/SM

• ALU – 32/64-bit

• FP – SP/DP (with FMA)

• SFU – Sin, cosine, sqrt, etc• Clock – 1.5 GHz (estimated)• Peak Performance – 1.5 TFLOPS

Page 70: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Volta - 2017

70

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Page 71: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Volta - 2017

71

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Page 72: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: NVIDIA Volta - 2017

72

• Streaming Multiprocessors (SM)

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

Page 73: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Internals of Tensor Core• Modes of operation – Volta

• FP16 – A, B, C are FP16

• Mixed-precision – A and B are FP16, C is FP32• Turing GPUs – Supports 1, 2, 4, and 8 bit data-types (int4, int8 on Tensor

cores)

73

Page 74: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Scheduling Example for 16x16x16 Gemm

74Source: Anandtech

Page 75: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Scheduling Example for 16x16x16 Gemm

75Source: Anandtech

Page 76: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Scheduling Example for 16x16x16 Gemm

76Source: Anandtech

Page 77: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Scheduling Example for 16x16x16 Gemm

77Source: Anandtech

Page 78: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Scheduling Example for 16x16x16 Gemm

78Source: Anandtech

Page 79: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

How to use Tensor cores• CuBLAS, CuDNN, etc• Library takes care of

Tiling and storage hierarchy

• Opcode: HMMA (Matrix Multiply Accumulate)

79

Page 80: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

GPU Performance

80

• We still need CPU for some extent

Page 81: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

GPU vs CPU Performance

• CPU - 16-core Intel Xeon E5 -2650 v2 @ 2.6GHz

• Benchmark: AlexNet• Lower batch size leads

to under utilization on all devices

• K20 has less memory than Titan X

81

Page 82: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Concluding Remarks• Compute and data requirement of DNN is quite high and a major part of

the computation is from Matrix Multiplications (i.e. MAC ops)• Common DNN specific extensions in Generic architecture is,

1. Instruction-set extensions – Generally SIMD support at reduced precision

2. DNN accelerator on datapath (Co-processor, Tensor core, etc)• The effective performance of the platform depends on the hardware

capability and software support (programming model and library used to realize the network)

• The energy efficiency is still a limitation of generic platforms for DNN

82

Page 83: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Reference• Evaluating the Energy Efficiency of Deep Convolutional Neural Network on

CPUs and GPUs, Da Li and Xinbo Chen, 2016

83

Page 84: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Backup

84

Page 85: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Mutithreading Categories

85

Tim

e (P

roce

ssor

Cyc

le)

super-scalar Fine-Grained Coarse-Grained MultiprocessingSimultaneous Mutithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Page 86: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: IBM Power4 (Superscalar)

86

Page 87: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Example: IBM Power5• Supports 2 threads

87

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Page 88: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Power5 Thread Performance• Relative priority of each thread is

controllable in hardware• For balanced operation, both threads

run slower than if they “owned” the machine

88

Page 89: General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... · 2019-10-18 · Outline for Next two Lectures • Introduction • Background on DNN computations

Any guess on largest chip so far?

89

Source : Cerebras, Hotchip 2019