Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Yu-Hsin Chen1, Joel Emer1, 2, Vivienne Sze1
1 MIT 2 NVIDIA
2
Contributions of This Work• A novel energy-efficient CNN dataflow that has been
verified in a fabricated chip, Eyeriss.
• A taxonomy of CNN dataflows that classifies previouswork into three categories.
• A framework that compares the energy efficiency ofdifferent dataflows under same area and CNN setup.
4000 µm
4000 µm
On-c
hip
Buffe
r Spatial PE ArrayEyeriss [ISSCC, 2016]
A reconfigurable CNN processor
35 fps @ 278 mW*
* AlexNet CONV layers
3
Deep Convolutional Neural Networks
ClassesFCLayers
Modern deep CNN: up to 1000 CONV layers
CONVLayer
CONVLayer Low-level
FeaturesHigh-level Features
4
Deep Convolutional Neural Networks
CONVLayer
CONVLayer Low-level
FeaturesHigh-level Features
ClassesFCLayers
1 – 3 layers
5
Deep Convolutional Neural Networks
ClassesCONVLayer
CONVLayer
FCLayers
Convolutions account for morethan 90% of overall computation,dominating runtime and energyconsumption
6
R
Filter
R
High-Dimensional CNN Convolution
E
EPartial Sum (psum)
Accumulation
Input Image (Feature Map) Output Image
Element-wiseMultiplication
H
a pixel
H
7
HR
Filter
R
High-Dimensional CNN Convolution
E
Sliding Window Processing
Input Image (Feature Map)a pixel
Output Image
H E
8
H
High-Dimensional CNN Convolution
R
R
C
Input Image
Output ImageCFilter
Many Input Channels (C)
E
H E
9
High-Dimensional CNN Convolution
E
Output ImageManyFilters (M)
ManyOutput Channels (M)
M
…
R
R1
R
R
C
M
H
Input ImageC
C
H E
10
High-Dimensional CNN Convolution
…
M
…
ManyInput Images (N) Many
Output Images (N)…
R
R
R
R
C
C
Filters
E
E
H
C
H
H
C
E1 1
N N
H E
11
Memory Access is the Bottleneck
ALUfilter weightimage pixelpartial sum updated partial sum
Memory Read Memory WriteMAC*
* multiply-and-accumulate
12
Memory Access is the Bottleneck
ALU
Memory Read Memory WriteMAC*
* multiply-and-accumulate
DRAM DRAM
• Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required
Worst Case: all memory R/W are DRAM accesses
13
Memory Access is the Bottleneck
ALU
Memory Read Memory WriteMAC*
Extra levels of local memory hierarchy
MemDRAM DRAMMem
14
Memory Access is the Bottleneck
ALU
Memory Read Memory Write
Extra levels of local memory hierarchy
1
Opportunities: data reuse local accumulation1
MemDRAM DRAMMem
MAC*
15
Types of Data Reuse in CNNFilter ReuseConvolutional Reuse Image Reuse
CONV layers only(sliding window)
CONV and FC layers CONV and FC layers(batch size > 1)
Filter ImageFilters
2
1
Image Filter
2
1
Images
Image pixelsFilter weights
Reuse: Image pixelsReuse: Filter weightsReuse:
16
Memory Access is the Bottleneck
ALU
Memory Read Memory Write
Extra levels of local memory hierarchy
** AlexNet CONV layers1) Can reduce DRAM reads of filter/image by up to 500×**
1
Opportunities: data reuse local accumulation1
MemDRAM DRAMMem
1
MAC*
17
Memory Access is the Bottleneck
1) Can reduce DRAM reads of filter/image by up to 500×2) Partial sum accumulation does NOT have to access DRAM12
ALU
Memory Read Memory Write
Extra levels of local memory hierarchy
2
1
Opportunities: data reuse local accumulation1 2
MemDRAM DRAMMem
MAC*
18
Memory Access is the Bottleneck
Opportunities: data reuse local accumulation
• Example: DRAM access in AlexNet can be reducedfrom 2896M to 61M (best case)
1) Can reduce DRAM reads of filter/image by up to 500×2) Partial sum accumulation does NOT have to access DRAM
1 2
ALU
Memory Read Memory Write
Extra levels of local memory hierarchy
2
1
MemDRAM DRAMMem
12
MAC*
19
Spatial Architecture for CNN
ProcessingElement (PE)
Global Buffer (100 – 500 kB)
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
DRAM
Local Memory Hierarchy• Global Buffer• Direct inter-PE network• PE-local memory (RF)
Control
Reg File 0.5 – 1.0 kB
20
Low-Cost Local Data Access
DRAM GlobalBuffer PE
PE PE
ALU fetch data to run a MAC here
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×6×
PE ALU 2×1×1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
NoC: 200 – 1000 PEs
* measured from a commercial 65nm process
21
Low-Cost Local Data Access
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×6×
PE ALU 2×1×1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
* measured from a commercial 65nm process
How to exploit data reuse and local accumulationwith limited low-cost local storage?
1 2
NoC: 200 – 1000 PEs
specialized processing dataflow required!
22
Taxonomy ofExisting Dataflows
• Weight Stationary (WS)
• Output Stationary (OS)
• No Local Reuse (NLR)
23
Weight Stationary (WS)
• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights
• Examples:[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014][Park, ISSCC 2015] [Origami, GLSVLSI 2015]
Global Buffer
W0 W1 W2 W3 W4 W5 W6 W7
Psum Pixel
PEWeight
24
• Minimize partial sum R/W energy consumption− maximize local accumulation
• Examples:
Output Stationary (OS)
[Gupta, ICML 2015] [ShiDianNao, ISCA 2015][Peemen, ICCD 2013]
Global Buffer
P0 P1 P2 P3 P4 P5 P6 P7
Pixel Weight
PEPsum
25
• Use a large global buffer as shared storage− Reduce DRAM access energy consumption
• Examples:
No Local Reuse (NLR)
[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014][Zhang, FPGA 2015]
PEPixel
Psum
Global BufferWeight
26
Energy Efficiency Comparison
0
0.5
1
1.5
2
WS OSA OSB OSC NLR RS
Nor
m. E
nerg
y/O
p
DataflowsNLRWS OSA OSB OSC Row
Stationary
Normalized Energy/MAC
CNN Dataflows
Variants of OS
• Same total area • 256 PEs• AlexNet Configuration* • Batch size = 16
* AlexNet CONV layers
27
Energy-Efficient Dataflow:Row Stationary (RS)
• Maximize reuse and accumulation at RF
• Optimize for overall energy efficiencyinstead for only a certain data type
28
1D Row Convolution in PE
* =Filter Output Image
Input Image
29
1D Row Convolution in PE
* =Filter Partial Sumsa b c a b c
a b c d e
PEReg Fileb ac
d ce ab
Input Image
30
1D Row Convolution in PE
* =Filtera b c a b c
a b c d e
e d
PEb ac
Reg File
b ac
a
Partial SumsInput Image
31
1D Row Convolution in PE
* =a b c
a b c d e Partial SumsInput Image
PEb ac
Reg File
c bd
b
ea
Filtera b c
32
1D Row Convolution in PE
* =a b c
a b c d e Partial SumsInput Image
PEb ac
Reg File
d ce
cb a
Filtera b c
33
1D Row Convolution in PE
PEb ac
Reg File
d ce
cb a
• Maximize row convolutional reuse in RF− Keep a filter row and image sliding window in RF
• Maximize row psum accumulation in RF
34
2D Convolution in PE Array
Row 1 Row 1
=*
*PE 1
35
2D Convolution in PE Array
Row 1 Row 1
Row 2 Row 2
Row 3 Row 3
Row 1
=*
*
*
*
PE 1
PE 2
PE 3
36
2D Convolution in PE Array
Row 1 Row 1
Row 2 Row 2
Row 3 Row 3
Row 1
=*
Row 1 Row 2
Row 2 Row 3
Row 3 Row 4
=*
* *
* *
* *
Row 2
PE 1
PE 2
PE 3
PE 4
PE 5
PE 6
37
2D Convolution in PE Array
PE 1Row 1 Row 1
PE 2Row 2 Row 2
PE 3Row 3 Row 3
Row 1
=*
PE 4Row 1 Row 2
PE 5Row 2 Row 3
PE 6Row 3 Row 4
Row 2
=*
PE 7Row 1 Row 3
PE 8Row 2 Row 4
PE 9Row 3 Row 5
Row 3
=*
* * *
* * *
* * *
38
Convolutional Reuse Maximized
Row 1
Row 2
Row 3
Row 1
Row 2
Row 3
Row 4
Row 2
Row 3
Row 4
Row 5
Row 3
* * *
* * *
* * *
Filter rows are reused across PEs horizontally
Row 1
Row 2
Row 3
Row 1
Row 2
Row 3
Row 1
Row 2
Row 3
PE 1
PE 2
PE 3
PE 4
PE 5
PE 6
PE 7
PE 8
PE 9
39
Convolutional Reuse Maximized
Row 1
Row 2
Row 3
Row 1
Row 1
Row 2
Row 3
Row 2
Row 1
Row 2
Row 3
Row 3
* * *
* * *
* * *
Image rows are reused across PEs diagonally
Row 1
Row 2
Row 3
Row 2
Row 3
Row 4
Row 3
Row 4
Row 5
PE 1
PE 2
PE 3
PE 4
PE 5
PE 6
PE 7
PE 8
PE 9
40
Maximize 2D Accumulation in PE Array
Row 1 Row 1
Row 2 Row 2
Row 3 Row 3
Row 1 Row 2
Row 2 Row 3
Row 3 Row 4
Row 1 Row 3
Row 2 Row 4
Row 3 Row 5
* * *
* * *
* * *
Partial sums accumulate across PEs vertically
Row 1 Row 2 Row 3
PE 1
PE 2
PE 3
PE 4
PE 5
PE 6
PE 7
PE 8
PE 9
41
Dimensions Beyond 2D Convolution3 Multiple Channels1 Multiple Images 2 Multiple Filters
42
Filter Reuse in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters
R
R
C
H
C
H
H
C
H
Row 1 Row 1Channel 1Image 1
* Row 1=Psum 1Filter 1
Channel 1 Row 1 Row 1Image 2
* Row 1=Psum 2Filter 1
Processing in PE: concatenate image rows
Channel 1 *Row 1Image 1 & 2
=Psum 1 & 2Filter 1
Row 1 Row 1 Row 1 Row 1
share the same filter row
43
Image Reuse in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters
R
R
C
R
R
C H
C
H
Row 1 Row 1Channel 1Image 1
* Row 1=Psum 1Filter 1
Channel 1 Row 1 Row 1Image 1
* Row 1=Psum 2Filter 2
share the same image row
Processing in PE: interleave filter rows
*Image 1
=Psum 1 & 2Filter 1 & 2
Row 1Channel 1
44
Channel Accumulation in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters
R
R
C
H
C
H
Row 1 Row 1Channel 1Image 1
* Row 1=Psum 1Filter 1
Channel 2 Row 1 Row 1Image 1
* Row 1=Psum 1Filter 1
accumulate psums
Row 1 Row 1+ = Row 1
45
Channel Accumulation in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters
R
R
C
H
C
H
Row 1 Row 1Channel 1Image 1
* Row 1=Psum 1Filter 1
Channel 2 Row 1 Row 1Image 1
* Row 1=Psum 1Filter 1
Channel 1 & 2Image 1
=PsumFilter 1
* Row 1
Processing in PE: interleave channels
accumulate psums
46
CNN Convolution – The Full Picture
Multiple images:
Multiple filters:
Multiple channels:
PERow 1 Row 1
PERow 2 Row 2
PERow 3 Row 3
PERow 1 Row 2
PERow 2 Row 3
PERow 3 Row 4
PERow 1 Row 3
PERow 2 Row 4
PERow 3 Row 5
* * *
* * *
* * *
Image 1=
PsumFilter 1
**
Image 1=
Psum 1 & 2Filter 1 & 2*
Image 1 & 2=
Psum 1 & 2Filter 1
47
Simulation Results• Same total hardware area
• 256 PEs
• AlexNet Configuration
• Batch size = 16
48
Dataflow Comparison: CONV Layers
RS uses 1.4× – 2.5× lower energy than other dataflows
NormalizedEnergy/MAC
ALURF
NoCbufferDRAM
0
0.5
1
1.5
2
WS OSA OSB OSC NLR RSCNN Dataflows
49
Dataflow Comparison: CONV Layers
0
0.5
1
1.5
2
NormalizedEnergy/MAC
WS OSA OSB OSC NLR RS
psums
weightspixels
RS optimizes for the best overall energy efficiency
CNN Dataflows
50
Dataflow Comparison: FC Layers
0
0.5
1
1.5
2
psums
weightspixels
NormalizedEnergy/MAC
WS OSA OSB OSC NLR RSCNN Dataflows
RS uses at least 1.3× lower energy than other dataflows
51
Row Stationary: Layer Breakdown
ALURF
NoCbufferDRAM
2.0e10
1.5e10
1.0e10
0.5e10
0L1 L8L2 L3 L4 L5 L6 L7
NormalizedEnergy
(1 MAC = 1)
CONV Layers FC Layers
RF dominates DRAM dominates
Total Energy80% 20%
52
Summary• We propose a Row Stationary (RS) dataflow to exploit
the low-cost local memories in a spatial architecture.
• RS optimizes for best overall energy efficiency whileexisting CNN dataflows only focus on certain data types.
• RS has higher energy efficiency than existing dataflows− 1.4× – 2.5× higher in CONV layers− at least 1.3× higher in FC layers. (batch size ≥ 16)
• We have verified RS in a fabricated CNN processor chip, Eyeriss
4000 µm
4000 µm
53
Thank You
Learn more about Eyeriss at
http://eyeriss.mit.edu