Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Hanguang 800 NPU

– The Ultimate AI Inference Solution for Data Centers

Yang Jiao, Liang Han, Xin Long, & Team

Alibaba Group

A Business-Driven Design

CNN Optimized

• Convolution-efficient architecture• As well as GEMM acceleration

TCO Efficient

• For data center inference acceleration• High throughput, low latency, power efficient• High effective TOPs

Domain Programmable

• Native OPs for computer vision deep learning tasks• Expand to support future activation functions

2/27

Optimized for CV Tasks

Classification Object Detection

Segmentation Point Cloud

3/27

Top Level Diagram

• Top Level:– 4 Cores w/ a ring bus

– Command Processor (CP)

– PCIE gen4 X16

• Each Core:– Tensor Engine (TE)

– Pooling Engine (PE)

– Memory Engine (ME)

• SRAM-Only:– 192MB Local Memory (LM)

– Distributed shared

– No DDR

PCIe4 I2C JTAG

CPDMATE PE ME

SEQ

LM CB

IB

...

Core0

Core3

Core1

Core2...

......

......

......

200 GB/s

4/27

Tensor Engine

• Weight+Activation Stationary – Data reuse

• via W-buffer, A-buffer, Multicasting

– Avoid img2col()

• via sliding-window access

• Fused OPs– e.g. CONV-BN-ResidualADD-ReLU

W-Buffer

A-Buffer

withsliding

window

access

*

*...

*

+*+

*+

...

*+*

*

...

*

+

...

ALU

ALU

...

ALU

f()

f()

...

f()

*+

*+

...

*+

Mu

ltic

ast

from LM

from LM

from CB

from LM

to LM

p

b b

SE

ACC

ACC

...

ACC

IAi

Ai+1

K

W

H

P

Q

C

C

RS E

F

C

RS E

FK

Mapping a CONV2D to Tensor Engine

De-Quantize

W0

Wk

5/27

Pooling Engine

• POOL Unit– POOLs, ROIs

– ROI_align_2

• INTP Unit– Interpolations

– ROI_align_1

• Scale/Bias Unit– Scaling/Bias operations

– Data formatting

Scale/Bias

Output Buffer

INTP

POOL

from LM

to LM

Pooling, ROIInterpolationROI Align

int

fp19

fp16/bf16/int8/int16

6/27

CPY_DST

CPY_SRC

FLATTEN_SRC

FALTTEN_DST

TRANS_SRC

TRANS_DST

PCIE_DSTPCIE & x-core

Tensor copy

Matrix transpose

Image flatten

Matrix to weight

Reshapememory

Creditmemory

2Kb

MATRIX_SRC

WEIGHT_DST

LM 2K

b

2K

b

ME

Ring Bus

Memory Engine

matrix weights

7/27

192MB On-Chip SRAM

• 1-R/W SRAM Modules– High density, low power

• 2 blocks for W & A– 16 modules in each block

• 4 clusters– Distributed shared

• 48MB in each of the 4 cores– Memory copy across cores

via ME & Ring Bus1

R/W

Ctrl

TE PEME/DMA

Ctrl

TE PEME/DMA

Ctrl

TEPE ME/DMA

Ctrl

TEPE ME/DMA

Compute & Control Logics

LM

LM

LM

LM

ME

384K

B

LM Organization in Each Core 8/27

Compressed and Quantized Storage/Processing

• Compressed Model

– Sparsity Engine to unpack

with bit-masks

– Pruning is optional though

• Quantized computation

and storage

• Vector Unit w/ FP-24

– 1sign.8exp.15man

W-Buffer

A-Buffer

withsliding

window access

*

*

...

*

+*+

*+

...

*+*

*

...

*

+

...

ALU

ALU

...

ALU

f()

f()

...

f()

*+

*+

...

*+

Mu

ltic

ast

from CB

SE

ACC

ACC

...

ACC

Local Memory

(LM)

Cmpr/Quant

W

Matrix Computation in INT-8/16

Vector Computation in FP-24

Compressed Quantized

Storage

to LM

De-Quantize

QuantA

9/27

Cmd Queue

HOST

CP

RingBus

NPUCORE

PCIE4.0

IB

CONV

POOL

CONV

MATMUL

. .

SW:FE/Compiler

DNN model (Reset34, for example)

LM(s)

CB

E

NPUCORE

WGT

a, b, Qnt

StartPC

Offset0DMA:(Host->IB) INST

DMA:(Host->CB) CONST

DMA:(Host->LM) WGT

DMA:(Host->LM) IN: cat

COMPUTE: StartPC/Offset0

DMA:(LM->Host) OUTTas

k0: c

at

SetU

p

SW:RT/Driver

DMA:(Host->LM) IN: car

COMPUTE: StartPC/Offset1

DMA:(LM->Host) OUTTas

k1: c

ar

Offset1

IN:cat

COMPUTE

OUT:car

IN:car

COMPUTE

OUT:cat

COMPUTE:frog

IN:frog IN:dog

OUT:frog

Time

OUT:dog

COMPUTE:dog

ACT/in

ACT/in

Task0

Cmd Queue

Task1

Workflow at Command Level

10/27

Workflow at Instruction Level

• Domain-specific instruction

set– CISC-like, operation-fusion

– Coarse-grain data at tensor level

(HW decomposes & ctr)

• Synchronization among 3

engines– Embedded bits in instructions

– Hardware dependency checker

CONV 8 CONV 9

POOL 9

MCPY 8

CONV 0

LP0: CONV 0

CONV 8 POOL 9 (w.t) MCPY 8 (w.p) CONV 9 (w.t) MCPY 9BRNZ C LP0...

CO

NV

3

x3+1

(s)

CO

NV

1

x1+1

(s)

Co

nca

t

PO

OL

3x3

+1(s

)CO

NV

3

x3+1

(s)

CO

NV

3

x3+1

(s)

... ...

Mapped to core 0 Mapped to core 1

0

8

MCPY 9

9 9

PE

TE

ME

SEQ

IB

Fetch Decode Issue

(b) Instructions (a) Algo. & Mapping

(c) Execution on Core 0

w.p

w.t w.t

11/27

Scalable Task Mapping

Copy-3

Weights-0Weights-1Weights-2Weights-3La

yers

-0La

yers

-1La

yers

-2La

yers

-3

Model-0

Model-1

Model-2

Model

FC-0FC-1FC-2FC-3

CONVs(copy0)

CONVs(copy0)

CONVs(copy0)

CONVs(copy0)

Copy-2Copy-1Copy-0

CONVs FC

FC-0

CONVs(copy0)

FC-3

CONVs(copy3)

FC-1

CONVs(copy1)

FC-2

CONVs(copy2)

Weights-0

Weights-1 Weights-2

Weights-3OFFCopy-3

Copy-2Copy-1

Copy-0 Model-0

Model-1 Model-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Data Parallelism

MultipleModels

Layer Pipelining

Model Parallelism

Hybrid Parallelism

Laye

rs-0

Laye

rs-1

Laye

rs-2

Laye

rs-3

PCIE Switch

Multi-Chip Pipelining

Laye

rs-0

Laye

rs-1

Laye

rs-n

Laye

rs-0

Laye

rs-1

Laye

rs-n

Chip-0 Chip-1 Chip-n

...

...For Small/Medium Models For Large Models For X-Large Models

12/27

NVIDIAT4

NVIDIAV100

HuaweiAscent910

NVIDIAA100

Alibaba

Hanguang 800

Peak Performance

(INT8 TOPS)

130 125(FP16)

512 624/1248 825

Frequency

(MHz) 1590 1530 1000 1410 700TDP

(Watt) 70 300 350 400 280Area

(mm^2) 545 815 1228 826 709Process

(nm)

TSMC

12nm

TSMC

12nm

TSMC

7+nm

TSMC

7nmTSMC12nm

ImplementationPCIE 4.0

Core-3

Core-2Core-1

PLL

DPEW

DPEW

DP

EWDPEW

SEQ,PE,ME

LM LM

LM LM

13/27

Software Stack

Compiler Run-time

…

Frameworks

14/27

Quantization Method

• Post-training quantization– Pruning is optional

• Static quantization at compile time

• Clip out-of-range post-rounding values

• Symmetric quantization

• Weights: per-channel

• Activations: per-tensor15/27

Quantization Accuracy

Model GPU FP32 NPU INT8resnet_v1_50 (Top1/Top5) 75.18 / 92.18 74.93 / 92.47resnet_v1_101 (Top1/Top5) 76.40 / 92.90 76.02 / 93.04resnet_v1_152 (Top1/Top5) 76.80 / 93.20 76.45 / 93.41resnet_v2_50 (Top1/Top5) 75.57 / 92.82 75.11 / 92.92resnet_v2_101 (Top1/Top5) 77.00 / 93.70 76.58 / 93.81inception_resnet_v2 (Top1/Top5) 80.40 / 95.30 80.19 / 95.26inception_v3 (Top1/Top5) 77.99 / 93.94 77.86 / 94.07inception_v4 (Top1/Top5) 80.2 / 95.2 80.11 / 95.20ssd_resnet_50_fpn (mAP) 37.12 36.96ssd_inception_v2_coco (mAP) 24 26.27faster_rcnn_resnet50_coco (mAP) 25.89 25.96faster_rcnn_inception_v2_coco (mAP) 24.8 24.66mask_rcnn_resnet50_atrous_512 (mAP) 20.77 20.8

16/27

Performance & Efficiency

ResNet-50 v1 (224x224) inference at INT8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nvidia.V100(batch=128)

Nvidia.T4(batch=128)

Habana.Goya.HL-1000(batch=10)

Ours (Effi. Mode)(batch=1)

Ours (Perf. Mode)(batch=1)

Throughput Performance (Normalized)

Power Efficiency (Normalized)

17/27

Scalable Performance and Power

• Configurable frequency and voltage

• Can be deployed from data center to large edges

Res

Net

-50

v1

18/27

Batching-Independent Performance

1000

10000

100000

1 2 4 8 16 32 64 128

Imga

ge/S

eco

nd

(

log

scal

e)

Batch Size (log scale)

15453 img/s

0.2ms

0.58ms1ms0.9ms

0.93ms

1.2ms

2.1ms24ms

78563 img/s

5402 img/s

0.11ms 0.11ms 0.11ms 0.11ms 0.11ms 0.11ms

Habana.Goya

Nvidia.T4

Ours.Perf-Mode

1ms 1ms

0.11ms

ResNet-50 v1 (224x224) inference at INT8

Best in class latency, no performance drop

19/27

More Results

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

INT8 Inference Performance1 4 8 10Batch#

Habana.Goya Ours Habana.Goya Ours

Resnet-101 V1 Inception V3

0.4

ms 0.9

ms

1.3

ms

1.5

ms

0.6

ms

2.8

ms

2.3

ms

1.3

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

Late

ncy

(m

s)

Throughput(img/sec)

20/27

MLPerf Inference V0.5 (Single-Chip Performance*)

* Single-Chip Performance is not the primary metric of MLPerf. MLPerf name and logo are registered trademarks. See www.mlperf.org for more information.

ResNet-50 v1.5

21/27

http://www.mlperf.org/

Comparing with Latest GPUs

5,097

ResNet-50 v1.5 Inference

73

7,097 24

23,97374

69,307 248

Nvidia GPU Data Source: NVIDIA Data Center Deep Learning Product Performance 22/27

https://developer.nvidia.com/deep-learning-performance-training-inference

Deployment

In Data Centers

In Edge Servers

In Large End-Devices

Smart City

Smart ManufactureRecommender Sys

Image Search Smart Fashion 5G Edge Cloud

Video Cloud Smart Retail

Smart Medical

Alibaba AI

CloudHanguang NPU

Inside23/27

Elastic Compute Service in Alibaba CloudX-Dragon architecture provides unified resource pool for Virtual Machines, Bare-Metals, and Containers

Further Reading: “Introducing the Sixth Generation of Alibaba Cloud's Elastic Compute Service” 24/27

https://www.alibabacloud.com/blog/introducing-the-sixth-generation-of-alibaba-clouds-elastic-compute-service_595716

Elastic Bare-Metal AI Inference Instance

X-Dragon Bare-metal Server

• Bare-metal performance• Elasticity of Cloud Infrastructure

Support Common Framework

• Tensorflow• MXNet• PyTorch

Public Cloud: ecs.ebman1.24xlarge

• CPU: Cascade Xeon 104 cores• Memory: 384GB• NPU: 4x 2-core Hanguang 800

Private Cloud

• Customized Configurations• 1/2/4x 2-core/3-core/4-core Hanguang 800

25/27

Deployed Applications via Alibaba Cloud

Business Major App. Type GPU Server NPU Server Deployment %CV-1 Image/Video 2x T4 2x 3-core 8.2%CV-2 Image/Video 8x T4 2x 3-core 28.5%CV-3 Image/Video 8x T4 2x 3-core 12.5%CV-4 Image/Video 4x T4 4x 2-core 12.5%RM-1 Recommender 1x T4 1x 4-core 12.5%RM-2 Recommender 1x T4 1x 4-core 25.1%

Top 6 businesses representing 99.1% deployment of Hanguang 800 NPUs.

56%

19%

49%

87%

74%

41%51%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CV-1 CV-2 CV-3 CV-4 RM-1 RM-2 Total

TCO Saving for Business

26/27

Summary

• Hanguang 800 NPU takes lead in AI inference performance.

• Fully integrated with X-Dragon compute platform, Hanguang

instances provide both performance of bare-metal and

elasticity of cloud infrastructure.

• Being deployed to cloud customers, Hanguang estimates to

lower system TCO by ~50% for various business applications.

• Stay tuned for more Hanguang powered Alibaba cloud

services.

27/27

上古三剑

一曰含光

视不可见

运之不知其所触

泯然无际

经物而物不觉

Q&

A

Thank You!

Documents

Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME