EMBEDDED DEEP NEURAL NETWORKS “The Cost of Everything … · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM RISC-RT L1 4/4 kB L2 32kB AMC Crossbar 32x HW Mutex Inter-SHAVE Interconnect

www.movidius.com © Copyright Movidius 2016

EMBEDDED DEEP NEURAL NETWORKS

“The Cost of Everything and the Value of Nothing”

David Moloney, CTO

Hot Chips 28, Cupertino, California August 23, 2016


2 AGENDA

DEEP LEARNING: THE GREAT DISRUPTOR

TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED

THE DEEPER THE BETTER?

BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE

MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50

OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS

CONCLUSIONS


3 DEEP LEARNING: THE GREAT DISRUPTOR

„Tesla Zooms Past BMW, Audi Limos In Europe, Closes In On Mercedes“

www.forbes.com, Oct 19th 2015


4 AGENDA







CONCLUSIONS


5 DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs) Inference Matters in Embedded

Layer 1 Layer 2 Layer 3 Layer 4 Layer N Output

Inference

Input

Weights

Repeated for N layers in Network

Error

Backpropagation

Convolution

Activation

(pointwise

ReLU)

Max-Pooling

(blockwise

maximum)

+

Training

Expected

Output

(label)

Weights from

DDR/RAM

Inputs

Previous layer

or DDR/RAM

Inputs to

next layer


6 AGENDA







CONCLUSIONS


7 THE DEEPER THE BETTER… Yet With Increased Complexity

2012 2013 2014 2015

Layers 8 16 58 158

Accuracy % 83.60% 88.80% 93.33% 95.06%

8 16

58

158

AlexNet

VGG16

GoogLeNet

ResNet

0

20

40

60

80

100

120

140

160

180

76.00%

78.00%

80.00%

82.00%

84.00%

86.00%

88.00%

90.00%

92.00%

94.00%

96.00%

Ac

cu

rac

y %

Year

ImageNet Accuracy % over Time


8 COMPLEXITY OF A STANDARD NETWORK GoogLeNet

Computational cost is increased by less than 2x compared to AlexNet (<1.5 Bn operations/ evaluation)

5M parameters

1 2 3 4 5 6 7 8 9

480 480 512 512 512 832 832 256 1024 Width Filters

9 Inception Layers

Convolution

Pooling

Softmax

Other


9 AGENDA







CONCLUSIONS


10 EMBEDDED PROCESSING AT THE NETWORK EDGE Huge Benefits

Latency

Bandwidth

Energy

Improvement Due to Edge Processing 1,000,000 x

more energy-efficient

10,000 x

less bandwidth consumed

1,000 x

lower latency

Other Improvements in:

Privacy

Fault-tolerance/ Continuity of service


11 AGENDA







CONCLUSIONS


12 MYRIAD 2 MA2x50 Vision Processing Unit (VPU) Architecture

DDR Controller

12

8

Bridge

RISC-

RTOS

L1

32/32kB

L2

256KB

ROM

128KB

Stacked KGD

1-8Gbit LP-DDR2/3

PLL &

CPM

RISC-RT

L1 4/4

kB

L2 32kB

AMC Crossbar

32x

HW

Mutex

Inter-SHAVE Interconnect (ISI)

CMX Memory Fabric Multi-Ported RAM Subsystem

RAW Shading

Correction DEBAYER

LUMA

DNF

CHROMA

GEN & DNF

Sharpening EDGE HARRIS CONV 5x5 MEDIAN

LUT 3-D LUT

LTM

Color Comb. Poly-phase

Scaling x3 MIPI TX Filter MIPI RX Filter

BAYER

De-noise

>20 independent power islands

Software Controlled I/O Multiplexing

MIPI

D-PHY x12 lanes SD USB3

Eth

1 Gb

SPI

x3

I2C

x3

UART

x2

I2S

x3 LCD CIF NAL

L2 cache 256KB

Arbiter & 16:1 mux

SH

AV

E 0

SH

AV

E 1

SH

AV

E 3

SH

AV

E 2

SH

AV

E 4

SH

AV

E 5

SH

AV

E 6

SH

AV

E 7

SH

AV

E 8

SH

AV

E 9

SH

AV

E 1

0

SH

AV

E 1

1

Ma

in B

us


13 SHAVE PROCESSING Maximising GEMM Performance

VRF 32x128-bit (12 ports)

IRF 32x32-bit (17 ports)

PEU Predication

BRU BranchUnit

LSU0 Load-Store

LSU1 Load-Store

IAU IntegerUnit

SAU ScalarUnit

VAU VectorUnit

2 KB

I-cache CMU

CompareU

DCU Debug

IDC Instr

Decode

1 KB

D-cache

SHAVE Processor v3.0

32 128 64 64 128 128 Computes 2 rows of A*B = 2 rows of C

16 lines of SHAVE VLIW Assembler

8-bit MACs in VAU & 32b Partials

Summation A0n*B0n n{0-7} in SAU 32-bit Accumulation

in IAU

Load/Store Port 0 Load/Store Port 1

Zero Overhead

Looping

SAU 32-bit PP

Summation

IAU 32-bit

Accumulation

32 ops/cycle 15 ops/cycle 1 ops/cycle

48 ops/cycle @ 600MHz on 12 SHAVEs

345.6GOPS

VAU 16x8-bit MACs


14 GoogLeNet RELATIVE GFLOPS/W Performance Results

GoogLeNet Single Inference (Batch = 1) no Heatsink or Fan

Myriad 2 28 25 1.2 85.61

GPU-B 20 22 7.5 75.34 10.04

28 15 7.5 51.37 6.85 GPU-A

71.34

nm Process GFLOPS/W GFLOPS fps Power (W)


15 AGENDA







CONCLUSIONS


16 HOW MOVIDIUS IS DEPLOYING STANDARD CNNs At the Network Edge

DDR Controller

128

Bridge

RISC-

RTOS

L1

32/32kB

L2

256KB

ROM

128KB

Stacked KGD

1-8Gbit LP-DDR2/3

PLL &

CPM

RISC-

RT

L1 4/4

kB

L2 32kB

AMC Crossbar

32x

HW

Mutex

Inter-SHAVE Interconnect (ISI)

CMX Memory Fabric Multi-Ported RAM Subsystem

RAW Shading

Correction DEBAYER

LUMA

DNF

CHROMA

GEN & DNF

Sharpening EDGE HARRIS CONV 5x5 MEDIAN

LUT 3-D LUT

LTM

Color Comb. Poly-phase

Scaling x3 MIPI TX Filter MIPI RX Filter

BAYER

De-noise

>20 independent power islands

Software Controlled I/O Multiplexing

MIPI

D-PHY x12 lanes SD

USB

3

Eth

1 Gb

SPI

x3

I2C

x3

UAR

T

x2

I2S

x3 LCD CIF NAL

L2 cache 256KB

Arbiter & 16:1 mux

SH

AV

E 0

SH

AV

E 1

SH

AV

E 3

SH

AV

E 2

SH

AV

E 4

SH

AV

E 5

SH

AV

E 6

SH

AV

E 7

SH

AV

E 8

SH

AV

E 9

SH

AV

E 1

0

SH

AV

E 1

1

DRONES

ROBOTICS

AR/VR

SECURITY

CNN Model

Description

Weights

Parses

CNN Model

Efficient Execution of

Kernel Function Libraries

For Integration

Ma

in B

us


17 AGENDA







CONCLUSIONS


18

1,7

IS IT WORTH IT? Incremental Accuracy and the Power Efficiency Cost

1.59 1.13 0

15.66

8.81

4.69

41.43

24.54

10.26

1.97 1.61 1.43 1.10

15.86

11.13 9.69

6.60

83.00%

92.00% 93.33%

83.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

AlexNet (2012) VGG-16 (2013) GoogLeNet (2014) SqueezeNet (2016)

% A

cc

ura

cy

Po

wer

Eff

icie

nc

y (

GF

LO

PS

/W)

Network, Year Developed

XU4 TK1 TX1 x86 K40 Accuracy %

mobile server

Notes: ImageNet, Batch = 10/64, using active cooling


19

• Deep Learning for Embedded is all about Inference

• Standard Networks are designed to achieve high-accuracy

• Embedded implementation on architectures such as Movidius VPU can achieve significant performance results at the network edge

• Next challenge is to further optimise networks to maximise performance per Watt

SUMMARY

The research leading to these results has received funding from the

European Union Seventh Framework Programme (FP7/2007-2013) under

grant agreement n°611183 (EXCESS Project, www.excess-project.eu).

Achievements and Future Challenges


THANK YOU

FOR YOUR ATTENTION

Documents

EMBEDDED DEEP NEURAL NETWORKS “The Cost of Everything … · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM RISC-RT L1 4/4 kB L2 32kB AMC Crossbar 32x HW Mutex Inter-SHAVE Interconnect