General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... ·...

Electrical Engineering – Electronic Systems group

Kanishkan Vadivel, Henk Corporaal, Pekka Jääskeläinen

General Architectures for DNN

Recap• Inference and Learning Principles• Improving Network Efficiency – Focuses on reducing number of MACs and weights

• Loop Transformations – Software tricks to effectively use the Memory hierarchy

S E (8) M (23)fp32

S E(5) M(10)fp16

Quantization

Outline for Next two Lectures• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

• General Purpose Processor (CPU)

• Domain Specific Processors (DSPs, VLIW-SIMD)

• Graphics Processing Unit (GPU)• Accelerators (ASICs) for DNN• Specialized Architectures

Outline• Introduction• Background on DNN computations• DNN on Traditional Compute Platforms

Introduction• Machine learning plays a major role in today’s world

Made for “AI”

Compute Intensity of DNN• Compute intensity is roughly proportional to accuracy of the DNN

10Source: Scaling for edge inference, Nature Electronics, 2018.

Energy Efficiency Requirements • Ranges from Cloud to Edge device (low power embedded applications)

• Different energy budget and compute capabilities

Edge Node Embedded Device Cloud Server HPC Cloud

mW W kW MW

Compute capability

Hardware Platform for DNN

Workload1. Inference2. Training

3. Meta Learning

Compute Platforms1. High-performance computing

2. Embedded systems

*Markovic, EE292 Class, Stanford, 2013

3. Meta Learning

2. Embedded systems

3. Meta Learning

2. Embedded systems

Flexibility

cy ASICPerformance/AreaDSP

FPGAGPU

~1000x*

Deep Convolutional Neural Networks

16Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Convolution Layer

Input fmap

weights

Output fmap

Convolution Layer

Input fmap

weights

Output fmap

Convolution Layer

Input fmap

weights

Output fmap

Convolution Layer

Input fmap

weights

Output fmap

Number of MACs = R x S

Convolution Layer

Input fmap

weights

Output fmap

Convolution Layer

Input fmap

weights

Output fmap

x (H – R + 1) x (W - S + 1)

Convolution Layer

Input fmap

weights

Output fmap

Many input fmap

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C

Convolution Layer

Input fmap

Output fmap

Many input & output fmap

weights

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M

Convolution Layer – For N batch

Output fmap

Many input & output fmap

weights

Input fmaps

Number of MACs = R x S x (H – R + 1) x (W - S + 1) x C x M x N

Fully Connected Layer

26Source: ICIP Tutorial, 2019

Contributes more that 90% of overall computation, dominating runtime and energy consumption

Output fmap

1. Height and width of output fmap are 1 (E = F = 1)2. Filters as large as input fmaps (R= H, S=W)

weights

Input fmaps

Output fmap

Matrix-Multiplication

weights

CRS = CHW

Input fmaps

Compute Intensity of Popular CNNs

Overview of Microprocessor Designs

31Source: Time Moore, Liming Xiu 2019.

Intrinsic Compute Capability

32Source: XETAL-II, 2010

~200x for DL

Difference in Performance/Watt

Source of Inefficiency

How to improve?• Results does not include DRAM

power• More than 50% of energy is spent

on Cache and Control logic

• Reduce control overhead

• Improve Cache hierarchy• Multi-Core/Cluster concepts

provides an additional performance gain

Source: Computing’s Energy Problem, ISSCC 2014

Compute

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

Reduce Control Overhead: SIMD Extensions

• Intel

• SSE (streaming SIMD Extension, 4x32-bit single precision) [SSE2, SSE3, SSSE3, SSE4]

• AVX, AVX2, AVX-512– 256/512bit• AMD – 3DNow• Arm – VFP (single/double precision co-

processor), NEON → 128bitm SVE → 128 to 2048bit

• Qualcomm – 4x1024b → 4096bits

39DNN specific extensions: Reduced Precision and Instruction-set extension

Example: Intel – Cascade Lake 2019(VNNI – Vector Neural Net Instructions)

x28-cores

SIMD Extensions for DNNs

(VNNI)

• Mixed-precision mode – INT8 x INT8 + INT32• VNNI – FMA on single cycle compared to 3-cycles with normal SIMD

instructions• Some architectures support “2x2 dot-product” as well

Brain Floating Point

• bfloat16: Same dynamic range as IEEE-FP32, but with less accuracy• Example use-case: Google TPU, Cooper Lake Xeon processors• Another option - “posit” floating-point (adaptable fp format)

How about BNN?

y = popcount (W XNOR X)

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics

Software Stack for DNN

• Parallelizing Compiler

• Inline assembly

• Intrinsics• Optimized libraries

MKL-DNN, clDNN, BLAS, Arm NN, Arm CIMSIS-NN, and many more..

Example: Arm-NN Library

TensorFlow

Comparison among CNN Libraries on CPU

50Source: Evaluating the energy efficiency of Deep CNN, Da Li 2016. *Convnet on Xenon E5 (16-core)

• Caffe backends – Atlas, OpenBLAS, MKL, openMP, and CaffeConTroll

• Performance depends on the quality of the library optimizations for the target

Distributed Learning and Inference

51Source: Large Scale Distributed Deep Networks, Jeffrey Dean, Google 2019

Distributed DL - Approach

Distributed DL – Performance

• Models with more parameters benefit more from the use of additional machines

Domain Specific Processors (VLIW-SIMD)• Processors optimized for specific

application domain (e.g: Vision, signal processing, etc)

• Example: Qualcomm Hexagon, Movidious (Intel), Ceva, and many more..

• Support for DNN

• Instruction-set extensions

• DNN Accelerator in the execution pipeline

Programming Model

Hexagon DSP (Qualcomm)

Hexagon over Quad CPU

58Source: Hotchips

Hexagon – Power breakdown

• Less overhead on control logic and memory compared to CPUs

Another Example: Ceva DSP

60Source: Anandtech

Final Example: Movidious v2

Source: Hotchips’14

Final Example: Movidious v2

Source: Hotchips’14

Final Example: Intel Neural Compute Stick

Graphics Processing Units (GPUs)• SIMD vs GPU

• GPUs uses threads instead of vectors

• GPUs have the “shared memory” spaces

SIMD GPU

How Threads are Scheduled?

Example: NVIDIA Fermi - 2009

• Streaming Multiprocessors (SM)

• 32 – cuda cores/SM

• ALU – 32/64-bit

• FP – SP/DP (with FMA)

• SFU – Sin, cosine, sqrt, etc• Clock – 1.5 GHz (estimated)• Peak Performance – 1.5 TFLOPS

Example: NVIDIA Volta - 2017

• 80 - SMs

• 64 – INT32/FP32 core/SM

– 32 – FP64 cuda core/SM

• 8 – Tensor core/SM● 4x4 Matrix Multiply● 512 FMA / 1024 FP ops

• Clock – 1.53 GHz• Peak Tensor TFLOPS – 125

• 1.53GHz * 80 * 1024

• 80 - SMs

• 1.53GHz * 80 * 1024

• 80 - SMs

• 1.53GHz * 80 * 1024

Internals of Tensor Core• Modes of operation – Volta

• FP16 – A, B, C are FP16

• Mixed-precision – A and B are FP16, C is FP32• Turing GPUs – Supports 1, 2, 4, and 8 bit data-types (int4, int8 on Tensor

cores)

Scheduling Example for 16x16x16 Gemm

74Source: Anandtech

75Source: Anandtech

76Source: Anandtech

77Source: Anandtech

78Source: Anandtech

How to use Tensor cores• CuBLAS, CuDNN, etc• Library takes care of

Tiling and storage hierarchy

• Opcode: HMMA (Matrix Multiply Accumulate)

GPU Performance

• We still need CPU for some extent

GPU vs CPU Performance

• CPU - 16-core Intel Xeon E5 -2650 v2 @ 2.6GHz

• Benchmark: AlexNet• Lower batch size leads

to under utilization on all devices

• K20 has less memory than Titan X

Concluding Remarks• Compute and data requirement of DNN is quite high and a major part of

the computation is from Matrix Multiplications (i.e. MAC ops)• Common DNN specific extensions in Generic architecture is,

1. Instruction-set extensions – Generally SIMD support at reduced precision

2. DNN accelerator on datapath (Co-processor, Tensor core, etc)• The effective performance of the platform depends on the hardware

capability and software support (programming model and library used to realize the network)

• The energy efficiency is still a limitation of generic platforms for DNN

Reference• Evaluating the Energy Efficiency of Deep Convolutional Neural Network on

CPUs and GPUs, Da Li and Xinbo Chen, 2016

Backup

Mutithreading Categories

super-scalar Fine-Grained Coarse-Grained MultiprocessingSimultaneous Mutithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Example: IBM Power4 (Superscalar)

Example: IBM Power5• Supports 2 threads

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Power5 Thread Performance• Relative priority of each thread is

controllable in hardware• For balanced operation, both threads

run slower than if they “owned” the machine

Any guess on largest chip so far?

Source : Cerebras, Hotchip 2019

General Architectures for DNN - Electronic Systemsheco/courses/IA-5LIL0/Lecture10... ·...

Documents

Dnn Tutorial

The Dark Side of DNN Pruning - iscaconf.org · Outline Motivation DNN ... – N-Best: our proposal

DNN Web API For Mobile

Dnn 14 08 2013 001

Understanding GMM, HMM, DNN and LSTMcse.iitkgp.ac.in/~ksrao/pdf/atsp/gmm-hmm-ann-dnn-lstm.pdfUnderstanding GMM, HMM, DNN and LSTM Pradeep R 12th April 2019. Outline ... Variations

Convolutional NeuralNetwork Designheco/courses/IA-5LIL0/Lecture3-CNN design.pdfConvolutional NeuralNetwork Design. Overview. Network design : • Design space • Tradeoffs • Common

Dnn 27 11 2013 001

DNN Sentinel

DNN Site Search - puresystems.co.uk

Leadtail and DNN Webinar

Dnn 07 08 2013 001

LinkClick Secure Dnn

Dnn connect dnnmobi-slides

AIIA DNN benchmark

DNN Module Development Company - DNN Extension

Symbolic DNN-Tuner

Dnn for beginners

Stronghold DNN Manual - WebSitesCreative

DNN Performance Best Practices

Chicago Area DNN Users Group