View
229
Download
0
Embed Size (px)
Citation preview
1
Automatically Generating Custom Instruction Set Extensions
Nathan Clark, Wilkin Tang, Scott MahlkeWorkshop on Application Specific
Processors
2
Problem StatementThere’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs
One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) Combine several primitive operations We propose an automated method for CFU
generation
3
System Overview
4
Example1 2
3
4
5
6
7
8
Potential CFUs1,32,42,63,44,55,86,77,8
5
Example1 2
3
4
5
6
7
8
Potential CFUs1,32,42,6…1,3,42,4,52,6,7…
6
Example1 2
3
4
5
6
7
8
Potential CFUs1,32,42,6…1,3,4,52,4,5,82,6,7,8…1,3,4,5,8
7
Characterization
Use the macro library to get information on each potential CFU Latency is the sum of each primitive’s
latency Area is the sum of each primitive’s
macrocell
8
Issues we considerPerformance On critical path Cycles saved
Cost CFU area Control logic
Difficult to measure
Decode logic Difficult to
measure Register file area
Can be amortized
LD
ADD
ADD
AND
ASL
XOR
BR
1
1
1
1
1
0.1
0.1
0.1
0.6
0.6
9
More Issues to Consider
IO number of input
and output operands
Usability How well can the
compiler use the pattern
OR
LSL
AND
CMPP
10
Selection
Currently use a Greedy Algorithm Pick the best
performance gain / area first
Can yield bad selections
OR
LSL
AND
CMPP
11
Case study 1: BlowfishSpeedup: 1.24 10 cycles can be
compressed down to 2!
Cost: ~6 adders6 inputs, 2 outputsC code this DFG came from: r ^=(((s[(t>>24)] +
s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff;
ADD
XOR
ADD
AND
XOR
LSR
AND
ADD
LSL
ADD
r65 r70
r76
r81
# -1
r891
#16
#255
#256
#2
r91
12
Case study 2: ADPCM DecodeSpeedup: 1.20 3 cycles can be
compressed down to 1
Cost: ~1.5 adders2 inputs, 2 outputsC code this DFG came from: d = d & 7;
if ( d & 4 ) { … }
AND
AND
CMPP
#7r16
#4
#0
13
Experimental SetupCFU recognition implemented in the Trimaran research infrastructureSpeedup shown is with CFUs relative to a baseline machine Four wide VLIW with predication Can issue at most 1 Int, Flt, Mem, Brn
inst./cyc. 300 MHz clock
CFU Latency is estimated using standard cells from Synopsis’ design library
14
Varying the Number of CFUs
More CFUs yields more performance Weakness in our selection algorithm causes plateaus
adpcm-decode
1
1.4
1.8
2.2
0 5 10 15 20
Number of function units
Spe
edup
0
20
40
60
80
Add
ition
al c
ost
Speedup Additional cost
15
Varying the Number of Ops
Bigger CFUs yield better performance If they’re too big, they can’t be used as often
and they expose alternate critical paths
blow f ish
1
1.2
1.4
1.6
1.8
2
0 5 10 15 20
Max Number of ops/CFU
Spe
edup
0
20
40
60
80
Add
ition
al c
ost
Speedup Additional cost
16
Related WorkMany people have done this for code size Bose et al., Liao et al.
Typically done with traces Arnold, et al.
Previous paper used more enumerative discovery algorithmWe are unique because: Compiler based approach Novel analyzation of CFUs
17
Conclusion and Future Work
CFUs have the potential to offer big performance gain for small costRecognize more complex subgraphs Generalized acyclic/cyclic subgraphs
Develop our system to automatically synthesize application tailored coprocessors