Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
www.movidius.com © Copyright Movidius 2016
EMBEDDED DEEP NEURAL NETWORKS
“The Cost of Everything and the Value of Nothing”
David Moloney, CTO
Hot Chips 28, Cupertino, California August 23, 2016
www.movidius.com © Copyright Movidius 2016
2 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
3 DEEP LEARNING: THE GREAT DISRUPTOR
„Tesla Zooms Past BMW, Audi Limos In Europe, Closes In On Mercedes“
www.forbes.com, Oct 19th 2015
www.movidius.com © Copyright Movidius 2016
4 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
5 DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs) Inference Matters in Embedded
Layer 1 Layer 2 Layer 3 Layer 4 Layer N Output
Inference
Input
Weights
Repeated for N layers in Network
Error
Backpropagation
Convolution
Activation
(pointwise
ReLU)
Max-Pooling
(blockwise
maximum)
+
Training
Expected
Output
(label)
Weights from
DDR/RAM
Inputs
Previous layer
or DDR/RAM
Inputs to
next layer
www.movidius.com © Copyright Movidius 2016
6 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
7 THE DEEPER THE BETTER… Yet With Increased Complexity
2012 2013 2014 2015
Layers 8 16 58 158
Accuracy % 83.60% 88.80% 93.33% 95.06%
8 16
58
158
AlexNet
VGG16
GoogLeNet
ResNet
0
20
40
60
80
100
120
140
160
180
76.00%
78.00%
80.00%
82.00%
84.00%
86.00%
88.00%
90.00%
92.00%
94.00%
96.00%
Ac
cu
rac
y %
Year
ImageNet Accuracy % over Time
www.movidius.com © Copyright Movidius 2016
8 COMPLEXITY OF A STANDARD NETWORK GoogLeNet
Computational cost is increased by less than 2x compared to AlexNet (<1.5 Bn operations/ evaluation)
5M parameters
1 2 3 4 5 6 7 8 9
480 480 512 512 512 832 832 256 1024 Width Filters
9 Inception Layers
Convolution
Pooling
Softmax
Other
www.movidius.com © Copyright Movidius 2016
9 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
10 EMBEDDED PROCESSING AT THE NETWORK EDGE Huge Benefits
Latency
Bandwidth
Energy
Improvement Due to Edge Processing 1,000,000 x
more energy-efficient
10,000 x
less bandwidth consumed
1,000 x
lower latency
Other Improvements in:
Privacy
Fault-tolerance/ Continuity of service
www.movidius.com © Copyright Movidius 2016
11 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
12 MYRIAD 2 MA2x50 Vision Processing Unit (VPU) Architecture
DDR Controller
12
8
Bridge
RISC-
RTOS
L1
32/32kB
L2
256KB
ROM
128KB
Stacked KGD
1-8Gbit LP-DDR2/3
PLL &
CPM
RISC-RT
L1 4/4
kB
L2 32kB
AMC Crossbar
32x
HW
Mutex
Inter-SHAVE Interconnect (ISI)
CMX Memory Fabric Multi-Ported RAM Subsystem
RAW Shading
Correction DEBAYER
LUMA
DNF
CHROMA
GEN & DNF
Sharpening EDGE HARRIS CONV 5x5 MEDIAN
LUT 3-D LUT
LTM
Color Comb. Poly-phase
Scaling x3 MIPI TX Filter MIPI RX Filter
BAYER
De-noise
>20 independent power islands
Software Controlled I/O Multiplexing
MIPI
D-PHY x12 lanes SD USB3
Eth
1 Gb
SPI
x3
I2C
x3
UART
x2
I2S
x3 LCD CIF NAL
L2 cache 256KB
Arbiter & 16:1 mux
SH
AV
E 0
SH
AV
E 1
SH
AV
E 3
SH
AV
E 2
SH
AV
E 4
SH
AV
E 5
SH
AV
E 6
SH
AV
E 7
SH
AV
E 8
SH
AV
E 9
SH
AV
E 1
0
SH
AV
E 1
1
Ma
in B
us
www.movidius.com © Copyright Movidius 2016
13 SHAVE PROCESSING Maximising GEMM Performance
VRF 32x128-bit (12 ports)
IRF 32x32-bit (17 ports)
PEU Predication
BRU BranchUnit
LSU0 Load-Store
LSU1 Load-Store
IAU IntegerUnit
SAU ScalarUnit
VAU VectorUnit
2 KB
I-cache CMU
CompareU
DCU Debug
IDC Instr
Decode
1 KB
D-cache
SHAVE Processor v3.0
32 128 64 64 128 128 Computes 2 rows of A*B = 2 rows of C
16 lines of SHAVE VLIW Assembler
8-bit MACs in VAU & 32b Partials
Summation A0n*B0n n{0-7} in SAU 32-bit Accumulation
in IAU
Load/Store Port 0 Load/Store Port 1
Zero Overhead
Looping
SAU 32-bit PP
Summation
IAU 32-bit
Accumulation
32 ops/cycle 15 ops/cycle 1 ops/cycle
48 ops/cycle @ 600MHz on 12 SHAVEs
345.6GOPS
VAU 16x8-bit MACs
www.movidius.com © Copyright Movidius 2016
14 GoogLeNet RELATIVE GFLOPS/W Performance Results
GoogLeNet Single Inference (Batch = 1) no Heatsink or Fan
Myriad 2 28 25 1.2 85.61
GPU-B 20 22 7.5 75.34 10.04
28 15 7.5 51.37 6.85 GPU-A
71.34
nm Process GFLOPS/W GFLOPS fps Power (W)
www.movidius.com © Copyright Movidius 2016
15 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
16 HOW MOVIDIUS IS DEPLOYING STANDARD CNNs At the Network Edge
DDR Controller
128
Bridge
RISC-
RTOS
L1
32/32kB
L2
256KB
ROM
128KB
Stacked KGD
1-8Gbit LP-DDR2/3
PLL &
CPM
RISC-
RT
L1 4/4
kB
L2 32kB
AMC Crossbar
32x
HW
Mutex
Inter-SHAVE Interconnect (ISI)
CMX Memory Fabric Multi-Ported RAM Subsystem
RAW Shading
Correction DEBAYER
LUMA
DNF
CHROMA
GEN & DNF
Sharpening EDGE HARRIS CONV 5x5 MEDIAN
LUT 3-D LUT
LTM
Color Comb. Poly-phase
Scaling x3 MIPI TX Filter MIPI RX Filter
BAYER
De-noise
>20 independent power islands
Software Controlled I/O Multiplexing
MIPI
D-PHY x12 lanes SD
USB
3
Eth
1 Gb
SPI
x3
I2C
x3
UAR
T
x2
I2S
x3 LCD CIF NAL
L2 cache 256KB
Arbiter & 16:1 mux
SH
AV
E 0
SH
AV
E 1
SH
AV
E 3
SH
AV
E 2
SH
AV
E 4
SH
AV
E 5
SH
AV
E 6
SH
AV
E 7
SH
AV
E 8
SH
AV
E 9
SH
AV
E 1
0
SH
AV
E 1
1
DRONES
ROBOTICS
AR/VR
SECURITY
CNN Model
Description
Weights
Parses
CNN Model
Efficient Execution of
Kernel Function Libraries
For Integration
Ma
in B
us
www.movidius.com © Copyright Movidius 2016
17 AGENDA
DEEP LEARNING: THE GREAT DISRUPTOR
TRAINING VS INFERENCE: INFERENCE MATTERS IN EMBEDDED
THE DEEPER THE BETTER?
BENEFITS OF EMBEDDED PROCESSING AT NETWORK EDGE
MAXIMISING PERFORMANCE OF NETWORKS ON MYRIAD 2 MA2x50
OPTIMISING CNNs ON VECTOR PROCESSORS FOR MOBILE PLATFORMS
CONCLUSIONS
www.movidius.com © Copyright Movidius 2016
18
1,7
IS IT WORTH IT? Incremental Accuracy and the Power Efficiency Cost
1.59 1.13 0
15.66
8.81
4.69
41.43
24.54
10.26
1.97 1.61 1.43 1.10
15.86
11.13 9.69
6.60
83.00%
92.00% 93.33%
83.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
AlexNet (2012) VGG-16 (2013) GoogLeNet (2014) SqueezeNet (2016)
% A
cc
ura
cy
Po
wer
Eff
icie
nc
y (
GF
LO
PS
/W)
Network, Year Developed
XU4 TK1 TX1 x86 K40 Accuracy %
mobile server
Notes: ImageNet, Batch = 10/64, using active cooling
www.movidius.com © Copyright Movidius 2016
19
• Deep Learning for Embedded is all about Inference
• Standard Networks are designed to achieve high-accuracy
• Embedded implementation on architectures such as Movidius VPU can achieve significant performance results at the network edge
• Next challenge is to further optimise networks to maximise performance per Watt
SUMMARY
The research leading to these results has received funding from the
European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement n°611183 (EXCESS Project, www.excess-project.eu).
Achievements and Future Challenges