29
Hanguang 800 NPU The Ultimate AI Inference Solution for Data Centers Yang Jiao, Liang Han, Xin Long, & Team Alibaba Group

Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Hanguang 800 NPU

– The Ultimate AI Inference Solution for Data Centers

Yang Jiao, Liang Han, Xin Long, & Team

Alibaba Group

Page 2: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

A Business-Driven Design

CNN Optimized

• Convolution-efficient architecture• As well as GEMM acceleration

TCO Efficient

• For data center inference acceleration• High throughput, low latency, power efficient• High effective TOPs

Domain Programmable

• Native OPs for computer vision deep learning tasks• Expand to support future activation functions

2/27

Page 3: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Optimized for CV Tasks

Classification Object Detection

Segmentation Point Cloud

3/27

Page 4: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Top Level Diagram

• Top Level:– 4 Cores w/ a ring bus

– Command Processor (CP)

– PCIE gen4 X16

• Each Core:– Tensor Engine (TE)

– Pooling Engine (PE)

– Memory Engine (ME)

• SRAM-Only:– 192MB Local Memory (LM)

– Distributed shared

– No DDR

PCIe4 I2C JTAG

CPDMATE PE ME

SEQ

LM CB

IB

...

Core0

Core3

Core1

Core2...

......

......

......

200 GB/s

4/27

Page 5: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Tensor Engine

• Weight+Activation Stationary – Data reuse

• via W-buffer, A-buffer, Multicasting

– Avoid img2col()

• via sliding-window access

• Fused OPs– e.g. CONV-BN-ResidualADD-ReLU

W-Buffer

A-Buffer

withsliding

window

access

*

*...

*

+*+

*+

...

*+*

*

...

*

+

...

ALU

ALU

...

ALU

f()

f()

...

f()

*+

*+

...

*+

Mu

ltic

ast

from LM

from LM

from CB

from LM

to LM

p

b b

SE

ACC

ACC

...

ACC

IAi

Ai+1

K

W

H

P

Q

C

C

RS E

F

C

RS E

FK

Mapping a CONV2D to Tensor Engine

De-Quantize

W0

Wk

5/27

Page 6: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Pooling Engine

• POOL Unit– POOLs, ROIs

– ROI_align_2

• INTP Unit– Interpolations

– ROI_align_1

• Scale/Bias Unit– Scaling/Bias operations

– Data formatting

Scale/Bias

Output Buffer

INTP

POOL

from LM

to LM

Pooling, ROIInterpolationROI Align

int

fp19

fp16/bf16/int8/int16

6/27

Page 7: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

CPY_DST

CPY_SRC

FLATTEN_SRC

FALTTEN_DST

TRANS_SRC

TRANS_DST

PCIE_DSTPCIE & x-core

Tensor copy

Matrix transpose

Image flatten

Matrix to weight

Reshapememory

Creditmemory

2Kb

MATRIX_SRC

WEIGHT_DST

LM 2K

b

2K

b

ME

Ring Bus

Memory Engine

matrix weights

7/27

Page 8: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

192MB On-Chip SRAM

• 1-R/W SRAM Modules– High density, low power

• 2 blocks for W & A– 16 modules in each block

• 4 clusters– Distributed shared

• 48MB in each of the 4 cores– Memory copy across cores

via ME & Ring Bus1

R/W

Ctrl

TE PEME/DMA

Ctrl

TE PEME/DMA

Ctrl

TEPE ME/DMA

Ctrl

TEPE ME/DMA

Compute & Control Logics

LM

LM

LM

LM

ME

384K

B

LM Organization in Each Core 8/27

Page 9: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Compressed and Quantized Storage/Processing

• Compressed Model

– Sparsity Engine to unpack

with bit-masks

– Pruning is optional though

• Quantized computation

and storage

• Vector Unit w/ FP-24

– 1sign.8exp.15man

W-Buffer

A-Buffer

withsliding

window access

*

*

...

*

+*+

*+

...

*+*

*

...

*

+

...

ALU

ALU

...

ALU

f()

f()

...

f()

*+

*+

...

*+

Mu

ltic

ast

from CB

SE

ACC

ACC

...

ACC

Local Memory

(LM)

Cmpr/Quant

W

Matrix Computation in INT-8/16

Vector Computation in FP-24

Compressed Quantized

Storage

to LM

De-Quantize

QuantA

9/27

Page 10: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Cmd Queue

HOST

CP

RingBus

NPUCORE

PCIE4.0

IB

CONV

POOL

CONV

MATMUL

. .

SW:FE/Compiler

DNN model (Reset34, for example)

LM(s)

CB

E

NPUCORE

WGT

a, b, Qnt

StartPC

Offset0DMA:(Host->IB) INST

DMA:(Host->CB) CONST

DMA:(Host->LM) WGT

DMA:(Host->LM) IN: cat

COMPUTE: StartPC/Offset0

DMA:(LM->Host) OUTTas

k0: c

at

SetU

p

SW:RT/Driver

DMA:(Host->LM) IN: car

COMPUTE: StartPC/Offset1

DMA:(LM->Host) OUTTas

k1: c

ar

Offset1

IN:cat

COMPUTE

OUT:car

IN:car

COMPUTE

OUT:cat

COMPUTE:frog

IN:frog IN:dog

OUT:frog

Time

OUT:dog

COMPUTE:dog

ACT/in

ACT/in

Task0

Cmd Queue

Task1

Workflow at Command Level

10/27

Page 11: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Workflow at Instruction Level

• Domain-specific instruction

set– CISC-like, operation-fusion

– Coarse-grain data at tensor level

(HW decomposes & ctr)

• Synchronization among 3

engines– Embedded bits in instructions

– Hardware dependency checker

CONV 8 CONV 9

POOL 9

MCPY 8

CONV 0

LP0: CONV 0

CONV 8 POOL 9 (w.t) MCPY 8 (w.p) CONV 9 (w.t) MCPY 9BRNZ C LP0...

CO

NV

3

x3+1

(s)

CO

NV

1

x1+1

(s)

Co

nca

t

PO

OL

3x3

+1(s

)CO

NV

3

x3+1

(s)

CO

NV

3

x3+1

(s)

... ...

Mapped to core 0 Mapped to core 1

0

8

MCPY 9

9 9

PE

TE

ME

SEQ

IB

Fetch Decode Issue

(b) Instructions (a) Algo. & Mapping

(c) Execution on Core 0

w.p

w.t w.t

11/27

Page 12: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Scalable Task Mapping

Copy-3

Weights-0Weights-1Weights-2Weights-3La

yers

-0La

yers

-1La

yers

-2La

yers

-3

Model-0

Model-1

Model-2

Model

FC-0FC-1FC-2FC-3

CONVs(copy0)

CONVs(copy0)

CONVs(copy0)

CONVs(copy0)

Copy-2Copy-1Copy-0

CONVs FC

FC-0

CONVs(copy0)

FC-3

CONVs(copy3)

FC-1

CONVs(copy1)

FC-2

CONVs(copy2)

Weights-0

Weights-1 Weights-2

Weights-3OFFCopy-3

Copy-2Copy-1

Copy-0 Model-0

Model-1 Model-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Core-0 Core-3

Core-1 Core-2

Data Parallelism

MultipleModels

Layer Pipelining

Model Parallelism

Hybrid Parallelism

Laye

rs-0

Laye

rs-1

Laye

rs-2

Laye

rs-3

PCIE Switch

Multi-Chip Pipelining

Laye

rs-0

Laye

rs-1

Laye

rs-n

Laye

rs-0

Laye

rs-1

Laye

rs-n

Chip-0 Chip-1 Chip-n

...

...For Small/Medium Models For Large Models For X-Large Models

12/27

Page 13: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

NVIDIAT4

NVIDIAV100

HuaweiAscent910

NVIDIAA100

Alibaba

Hanguang 800

Peak Performance

(INT8 TOPS)

130 125(FP16)

512 624/1248 825

Frequency

(MHz) 1590 1530 1000 1410 700TDP

(Watt) 70 300 350 400 280Area

(mm^2) 545 815 1228 826 709Process

(nm)

TSMC

12nm

TSMC

12nm

TSMC

7+nm

TSMC

7nmTSMC12nm

ImplementationPCIE 4.0

Core-3

Core-2Core-1

PLL

DPEW

DPEW

DP

EWDPEW

SEQ,PE,ME

LM LM

LM LM

13/27

Page 14: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Software Stack

Compiler Run-time

Frameworks

14/27

Page 15: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Quantization Method

• Post-training quantization– Pruning is optional

• Static quantization at compile time

• Clip out-of-range post-rounding values

• Symmetric quantization

• Weights: per-channel

• Activations: per-tensor15/27

Page 16: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Quantization Accuracy

Model GPU FP32 NPU INT8resnet_v1_50 (Top1/Top5) 75.18 / 92.18 74.93 / 92.47resnet_v1_101 (Top1/Top5) 76.40 / 92.90 76.02 / 93.04resnet_v1_152 (Top1/Top5) 76.80 / 93.20 76.45 / 93.41resnet_v2_50 (Top1/Top5) 75.57 / 92.82 75.11 / 92.92resnet_v2_101 (Top1/Top5) 77.00 / 93.70 76.58 / 93.81inception_resnet_v2 (Top1/Top5) 80.40 / 95.30 80.19 / 95.26inception_v3 (Top1/Top5) 77.99 / 93.94 77.86 / 94.07inception_v4 (Top1/Top5) 80.2 / 95.2 80.11 / 95.20ssd_resnet_50_fpn (mAP) 37.12 36.96ssd_inception_v2_coco (mAP) 24 26.27faster_rcnn_resnet50_coco (mAP) 25.89 25.96faster_rcnn_inception_v2_coco (mAP) 24.8 24.66mask_rcnn_resnet50_atrous_512 (mAP) 20.77 20.8

16/27

Page 17: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Performance & Efficiency

ResNet-50 v1 (224x224) inference at INT8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nvidia.V100(batch=128)

Nvidia.T4(batch=128)

Habana.Goya.HL-1000(batch=10)

Ours (Effi. Mode)(batch=1)

Ours (Perf. Mode)(batch=1)

Throughput Performance (Normalized)

Power Efficiency (Normalized)

17/27

Page 18: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Scalable Performance and Power

• Configurable frequency and voltage

• Can be deployed from data center to large edges

Res

Net

-50

v1

18/27

Page 19: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Batching-Independent Performance

1000

10000

100000

1 2 4 8 16 32 64 128

Imga

ge/S

eco

nd

(

log

scal

e)

Batch Size (log scale)

15453 img/s

0.2ms

0.58ms1ms0.9ms

0.93ms

1.2ms

2.1ms24ms

78563 img/s

5402 img/s

0.11ms 0.11ms 0.11ms 0.11ms 0.11ms 0.11ms

Habana.Goya

Nvidia.T4

Ours.Perf-Mode

1ms 1ms

0.11ms

ResNet-50 v1 (224x224) inference at INT8

Best in class latency, no performance drop

19/27

Page 20: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

More Results

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

INT8 Inference Performance1 4 8 10Batch#

Habana.Goya Ours Habana.Goya Ours

Resnet-101 V1 Inception V3

0.4

ms 0.9

ms

1.3

ms

1.5

ms

0.6

ms

2.8

ms

2.3

ms

1.3

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

0.2

ms

Late

ncy

(m

s)

Throughput(img/sec)

20/27

Page 21: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

MLPerf Inference V0.5 (Single-Chip Performance*)

* Single-Chip Performance is not the primary metric of MLPerf. MLPerf name and logo are registered trademarks. See www.mlperf.org for more information.

ResNet-50 v1.5

21/27

Page 22: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Comparing with Latest GPUs

5,097

ResNet-50 v1.5 Inference

73

7,097 24

23,97374

69,307 248

Nvidia GPU Data Source: NVIDIA Data Center Deep Learning Product Performance 22/27

Page 23: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Deployment

In Data Centers

In Edge Servers

In Large End-Devices

Smart City

Smart ManufactureRecommender Sys

Image Search Smart Fashion 5G Edge Cloud

Video Cloud Smart Retail

Smart Medical

Alibaba AI

CloudHanguang NPU

Inside23/27

Page 24: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Elastic Compute Service in Alibaba CloudX-Dragon architecture provides unified resource pool for Virtual Machines, Bare-Metals, and Containers

Further Reading: “Introducing the Sixth Generation of Alibaba Cloud's Elastic Compute Service” 24/27

Page 25: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Elastic Bare-Metal AI Inference Instance

X-Dragon Bare-metal Server

• Bare-metal performance• Elasticity of Cloud Infrastructure

Support Common Framework

• Tensorflow• MXNet• PyTorch

Public Cloud: ecs.ebman1.24xlarge

• CPU: Cascade Xeon 104 cores• Memory: 384GB• NPU: 4x 2-core Hanguang 800

Private Cloud

• Customized Configurations• 1/2/4x 2-core/3-core/4-core Hanguang 800

25/27

Page 26: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Deployed Applications via Alibaba Cloud

Business Major App. Type GPU Server NPU Server Deployment %CV-1 Image/Video 2x T4 2x 3-core 8.2%CV-2 Image/Video 8x T4 2x 3-core 28.5%CV-3 Image/Video 8x T4 2x 3-core 12.5%CV-4 Image/Video 4x T4 4x 2-core 12.5%RM-1 Recommender 1x T4 1x 4-core 12.5%RM-2 Recommender 1x T4 1x 4-core 25.1%

Top 6 businesses representing 99.1% deployment of Hanguang 800 NPUs.

56%

19%

49%

87%

74%

41%51%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CV-1 CV-2 CV-3 CV-4 RM-1 RM-2 Total

TCO Saving for Business

26/27

Page 27: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Summary

• Hanguang 800 NPU takes lead in AI inference performance.

• Fully integrated with X-Dragon compute platform, Hanguang

instances provide both performance of bare-metal and

elasticity of cloud infrastructure.

• Being deployed to cloud customers, Hanguang estimates to

lower system TCO by ~50% for various business applications.

• Stay tuned for more Hanguang powered Alibaba cloud

services.

27/27

Page 28: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

上古三剑

一曰含光

视不可见

运之不知其所触

泯然无际

经物而物不觉

Q&

A

Page 29: Hanguang 800 NPU The Ultimate AI Inference Solution for Data … · 2020. 8. 18. · •48MB in each of the 4 cores –Memory copy across cores via ME & Ring Bus 1 R / W Ctrl ME

Thank You!