View
216
Download
2
Embed Size (px)
Citation preview
University of MichiganElectrical Engineering and Computer Science
Data-centric Subgraph Mapping for Narrow Computation Accelerators
Amir Hormati, Nathan Clark,
and Scott Mahlke
Advanced Computer Architecture Lab.
University of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Introduction• Migration of applications
• Programmability and cost issues in ASIC
• More functionality in the embedded processor
3 University of MichiganElectrical Engineering and Computer Science
What Are the Challenges Accelerator Hardware: Compiler Algorithm:
4 University of MichiganElectrical Engineering and Computer Science
Configurable Compute Array (CCA)
• Array of FUs
• Arithmetic/logic
• 32-bit functional units
• Full interconnect betweenrows
• Supports 95 percent of allcomputation patterns
(Nathan Clark, ISCA 2005)
Input1 Input2 Input3 Input4
Output1 Output2
5 University of MichiganElectrical Engineering and Computer Science
Report Card on the Original CCA
• Easy to integrate to current embedded systems
• High performance gain
however...
• 32-bit general purpose CCA:– 130nm standard cell library– Area requirement: 0.3mm2
– Latency: 3.3nsdie photo of a processor with CCA
6 University of MichiganElectrical Engineering and Computer Science
Objectives of this Work
• Redesign of the CCA hardware– Area– Latency
• Compilation strategy– Code quality– Runtime
7 University of MichiganElectrical Engineering and Computer Science
Width Utilization
• Full width of the FUs is not always needed.
• Narrower FUs is not the solution.
Benchmark Less than 16-bit
Less than 8-bit
Rawcaudio 94% 52%
Rawdaudio 91% 60%
Epic 80% 45%
Unepic 74% 40%
Cjpeg 76% 49%
Djpeg 70% 53%
Larger than 16-bit
Larger than 8-bit
3des 86% 90%
bitcount 80% 85%
rijndael 50% 64%
8 University of MichiganElectrical Engineering and Computer Science
Width-Aware Narrow CCA
Width CheckerCarry bits
[8-31]
[8-31]
[8-31]
[8-31]
Iterate
IterationController
Input Registers
Carry Bits
Iterate
[8-31]
[0-7]
Output 1 Output 2
-
[0-7][0-7]
[0-7]
Output Registers
CCA
[8-31]
[8-31]
[8-31]
9 University of MichiganElectrical Engineering and Computer Science
Sparse Interconnect
• Rank wires based on utilization.
• >50% wires removed.
• 91% of all patterns are supported.
Input1 Input2 Input3 Input4
Output1 Output2
Input1 Input2 Input3 Input4
Output1 Output2
10 University of MichiganElectrical Engineering and Computer Science
Synthesis Results
Accelerator Configuration Latency (ns) Area(mm2)
32-bit with full interconnect 3.30 0.301
32-bit with sparse interconnect 2.95 0.270
16-bit with full interconnect 2.88 0.168
16-bit with sparse interconnect 2.55 0.140
8-bit with full interconnect 2.56 0.080
8-bit with sparse interconnect 2.00 0.070
Width Checker 0.39 0.002
• Synthesized using Synopsys and Encounter in 130nm library.
11 University of MichiganElectrical Engineering and Computer Science
Compilation Challenges
• Best portions of the code
• Non-uniform latency
• What are the current solutions:– Hand coding– Function intrinsics– Greedy solution
12 University of MichiganElectrical Engineering and Computer Science
Step 1: Enumeration
Live Out
Live In
ADD
AND
ADD
OR
XOR
AND
ADD
CMP
Live Out
Live Out
Live In
3
4
1
2
5
6
7
8
Live In
3ADD
8
OR
ADD
XOR
6
7
AND
ADD3
4
6
AND
ADD
ADD
3
5
13 University of MichiganElectrical Engineering and Computer Science
Step 2: Subgraph Isomorphism Pruning
• Ensure subgraphs can run on accelerator
6SUB
11ADD
10SHRA
8SHL3AND << * Logic
>> >> +/-
+/-+/-
A B C
D E F
G H
<< * 3
>> >> +/-
+/-+/-
A B C
D E F
G H
<< * 3
>> >> 6
+/-+/-
A B C
D E F
G H
<< * 3
>> >> 6
11+/-
A B C
D E F
G H
<< * 3
>> 10 6
11+/-
A B C
D E F
G H
<< * 3
10 >> 6
11+/-
A B C
D E F
G H
8 * 3
10 >> 6
11+/-
A B C
D E F
G H
14 University of MichiganElectrical Engineering and Computer Science
Step 3: Grouping
Live Out
Live In
ADD
AND
ADD
OR
XOR
AND
ADD
CMP
Live Out
Live Out
Live In
3
4
1
2
5
6
7
8
Live In
A
BC
DF
E
Live Out
Live In
ADD
AND
ADD
OR
XOR
AND
ADD
CMP
Live Out
Live Out
Live In
3
4
1
2
5
6
7
8
Live In
A
BC
DF
E
AC
• Assuming A and C are the only possibilities for grouping.
15 University of MichiganElectrical Engineering and Computer Science
Dealing with Non-uniform Latency
OR
ADD
AND
W[0,8] W[9,16] W[17,24] W[25,32] Average Latency
ADD 100% 0% 0% 0% 1
OR 0% 50% 0% 50% 3
AND 0% 50% 50% 0% 2.5
Subgraph Cost:3 Benefit: 0
8 bit
24 bit
8 bit
24 bit
8 bit
24 bit
A
B
C
Average Latency =2
Average Latency =2
Average Latency =2Time
• >94% do not change width
16 University of MichiganElectrical Engineering and Computer Science
Step 4: Unate CoveringWidth Op ID A B C AC D E F G H … N
24 1 1 1 1 …
8 2 1 1 1 …
24 3 1 1 1 1 …
8 4 1 1 1 …
32 5 1 1 …
32 6 1 1 …
8 7 1 1 …
8 8 1 1 … 1
Cost 3 4 3 3 1 4 4 1 1 … 1
Benefit -1 -1 -1 1 1 -1 -1 0 0 … 01
3
1
1
1
1
AC
0…001Benefit
1…111Cost
1…188
…178
…632
…532
…48
…324
…128
…1124
N…HGDOp IDWidth
17 University of MichiganElectrical Engineering and Computer Science
Experimental Evaluation
• ARM port of Trimaran compiler system
• Processor model– ARM-926EJS– Single issue, in-order execution, 5 stage pipeline– I/D caches : 16k, 64-way
• Hardware simulation: SimpleScalar 4.0
18 University of MichiganElectrical Engineering and Computer Science
Comparison of Different CCAs
0
10
20
30
40
50
60
70
80
90
Benchmarks
Per
cent
Spe
edup
32-bit CCA 16-bit CCA 8-bit CCA
16-bit and 8-bit CCAs are 7% and 9% better than 32-bit CCA.
• Assuming clock speed(1/(3.3ns) = 300 MHZ)
19 University of MichiganElectrical Engineering and Computer Science
Comparison of Different Algorithms
0
5
10
15
20
25
30
35
md5
blowfis
h3d
es sha
sobe
lrc
4cjp
egdjp
eg epic
unep
ic
g721
deco
de
g721
enco
de
mpe
g2de
c
mpe
g2en
c
rawca
udio
rawda
udio
rasta
rijnda
el
dijks
tra_la
rge
susa
n rls LU
bitco
unt
Averg
e
Benchmarks
Pe
rce
nt
Sp
ee
du
p
Data-centric Data-unaware
• Previous work: Greedy 10% worse than data-unaware
20 University of MichiganElectrical Engineering and Computer Science
Conclusion
• Programmable hardware accelerator • Width-aware CCA: Optimizes for common
case.• 64% faster clock • 4.2x smaller
• Data-centric compilation: Deals with non-uniform latency of CCA.• Average 6.5%,• Max 12% better than data-unaware algorithm.
21 University of MichiganElectrical Engineering and Computer Science
?For more information: http://cccp.eecs.umich.edu/
22 University of MichiganElectrical Engineering and Computer Science
Data-Centric FEUTotal Runtime
0.01
0.1
1
10
100
1000
10000
0 50 100 150 200 250 300
Block Size
Tim
e(s
ec
on
d)
89%96%
2
99%
23 University of MichiganElectrical Engineering and Computer Science
FU FU
FU
A B C D1 D 0 C 2 0 0 8
ADD
1
OR
0
ADD
1
0 0
0
1
89
B C D
ADD
0
OR
0
ADD
0
A1 D 0 C 2 0 0 8
ADD
0
OR
0
ADD
0
1 0
1
5 1
22
Operation of Narrow CCA
[(0x1D + 0x0C) + (0x20 OR 0x08)]
24 University of MichiganElectrical Engineering and Computer Science
Data-Centric Subgraph Mapping
• Enumerate– All subgraphs
• Pruning– Subgraph isomorphism
• Grouping– Iteratively group
disconnected subgraphs
• Selection– Unate covering
• Shrink search space to control runtime
Enumeration
Pruning
Grouping
Selection
25 University of MichiganElectrical Engineering and Computer Science
How Good is the Cost Function
0.75
0.80
0.85
0.90
0.95
1.00
md5
blowfis
h3d
es sha
sobe
lrc
4cjp
egdjp
egep
ic
unep
ic
g721
deco
de
g721
enco
de
mpe
g2dec
mpe
g2enc
rawca
udio
rawda
udiora
sta
rijndae
l
dijks
tra_la
rge
susa
n rls LU
bitco
unt
Averg
e
Benchmarks
No
rmal
ized
Wid
th V
aria
nce
Almost all of the operands have the same width range through out the execution.
27 University of MichiganElectrical Engineering and Computer Science
Width Utilization
• Full width of the FUs is not always needed.
• Replacing FUs with narrower FUs is not a good idea by itself.
Benchmark Less than 16-bit
Less than 8-bit
Rawcaudio 94% 52%
Rawdaudio 91% 60%
Epic 80% 45%
Unepic 74% 40%
Cjpeg 76% 49%
Djpeg 70% 53%
Larger than 16-bit
Larger than 8-bit
3des 86% 90%
bitcount 80% 85%
rijndael 50% 64%
28 University of MichiganElectrical Engineering and Computer Science
Introduction• Migration of applications
• Programmability and cost issues in ASIC
• More functionality in the embedded processor