View
223
Download
1
Tags:
Embed Size (px)
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
Processor Acceleration Through Automated Instruction Set Customization
Nathan Clark, Hongtao Zhong, Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan, Ann Arbor
December 3, 2003
2 University of MichiganElectrical Engineering and Computer Science
Motivation
• Cell phones, PDAs, digital cameras, etc. are everywhere– High performance yet low power design point
• General core + ASIC solution– Limited post-programmability
• General core + application specific instructions (CFUs)
CPU
ASICCPU
CFU
3 University of MichiganElectrical Engineering and Computer Science
What is a CFU?
• Combine multiple primitive operations– Smaller code size, fewer RF reads
– Increases performance
&
|
<<
^
&
*
+
^
+
+
^
<<
+
^
|
CFU 1
+^
CFU 2
& <<|
2
^
2
*
1
+
1
1
4 University of MichiganElectrical Engineering and Computer Science
Automation is Key• This is ¼ of the DFG for a single basic block of blowfish
159 XOR
164 SHR 173 AND
5 University of MichiganElectrical Engineering and Computer Science
Related Work
• Tensilica Xtensa– Commercial example
– MIPS core + manually constructed CFU
• Automatic instruction set synthesis is mature field – See paper for comparison of techniques
• Our contributions– Novel technique for automatic CFU creation
– System to utilize CFUs in multiple applications
– Analysis of how effectively CFUs for one application apply to other applications in the same domain
6 University of MichiganElectrical Engineering and Computer Science
System Overview• Synthesis
– Subgraph identification• Discover candidates for CFUs• Weed out what shouldn’t be picked
– Selection• Determine which candidates to use as
CFUs
• Compilation– Subgraph replacement
• Make use of the CFUs in a range of applications
7 University of MichiganElectrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?– Making decisions using factors
similar to an architect
– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output
%
^
<<
+ *
&
|
8 University of MichiganElectrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?– Making decisions using factors
similar to an architect
– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output
%
^
<<
+ *
&
|
CFU Candidates
&
<<
9 University of MichiganElectrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?– Making decisions using factors
similar to an architect
– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output
• Sum of these factors determines value of each direction
– NOT picking CFUs
%
^
<<
+ *
&
|
CFU Candidates
&
<< &
+
10 University of MichiganElectrical Engineering and Computer Science
Critical Path
• Combining operations on the critical path will shrink the longer dependence chains– Maximize potential performance
gain
• Wt = – Slack is # cycles off longest
dependence path
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
10/(0+1) = 10 10/(2+1) = 3.33
110
slack
11 University of MichiganElectrical Engineering and Computer Science
Latency
• Growing toward low latency operations allows combination of more nodes in a cycle– Maximize DFG compression
• Wt =
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
10*0.3 / 0.6 = 5
10*0.3 / 0.36 = 8.33latencynewlatencyold
__*10
Opcode Area Cycles
+ 1.00 0.30
& 0.12 0.06
<<, >> 0.01 ~0.00
^ 0.16 0.09
12 University of MichiganElectrical Engineering and Computer Science
Area
• Want the most benefit for the least area
• Wt = • Area is the sum of
macrocell areas
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
10*0.5/0.5 = 10
10*0.5/1.5 = 3.33
Opcode Area Cycles
+ 1.00 0.30
& 0.12 0.06
<<, >> 0.01 ~0.00
^ 0.16 0.09
areanewareaold
__*10
13 University of MichiganElectrical Engineering and Computer Science
Input/Output
• Want CFUs to use as few RF ports as possible– Smaller encoding
– Allow growth of larger candidates
• Wt =
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
10*2/(2+1)= 6.67
10*2/(4+1)= 4
)10,1#
#*10min(portsnew
portsold
14 University of MichiganElectrical Engineering and Computer Science
Example
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
35 28.5
37.530.8
28.537.5
15 University of MichiganElectrical Engineering and Computer Science
Example
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
35 28.5
33.5
30.828.540
16 University of MichiganElectrical Engineering and Computer Science
Example
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
35 28.5
36
30.828.5
36
17 University of MichiganElectrical Engineering and Computer Science
Example
^ &
^
>> >> >>
&& &
+ +
<< <<
+ +
<<
+
+
<<
+
18 University of MichiganElectrical Engineering and Computer Science
Example
^ &
^
>> >> >>
&&
+ +
<< <<
+ +
<<
+
+
<<
+
&
19 University of MichiganElectrical Engineering and Computer Science
Example
&
^
>> >> >>
&&
+ +
<< <<
+ +
<<
+
+
<<
+
&
^
20 University of MichiganElectrical Engineering and Computer Science
Example
&
^
>> >> >>
&&
+ +
<< <<
+ +
<<
+
+
<<
+
&
^
21 University of MichiganElectrical Engineering and Computer Science
Example
&
^
>> >> >>
&&
+
<< <<
+ +
<<
+
+
<<
+
&
^
+
22 University of MichiganElectrical Engineering and Computer Science
Example
&
^
>> >> >>
&&
+
<<
+ +
<<
+
+
<<
+
&
^
+
<<
23 University of MichiganElectrical Engineering and Computer Science
Example
&
^
>> >> >>
&&
+
+ +
<<
+
+
<<
+
&
^
+
<< <<
24 University of MichiganElectrical Engineering and Computer Science
&
^
>> >> >>
&&
+
+ +
<<
+
+
<<
+
&
^
+
<< <<
Finished – Met External Constraints
25 University of MichiganElectrical Engineering and Computer Science
Set of Candidates
^
<<
^
<< <<
^
<< <<
&
^
<< <<
& &
^
<< <<
& &
^
^
<< <<
& &
+
^
^
<< <<
& &
+ +
^
^
<< <<
& &
+ +
^
<<<<
^
<< <<
& &
+ +
^
<<
&
<<
^
<< <<
& &
+ +
^
<<
26 University of MichiganElectrical Engineering and Computer Science
Avoids Exponential Explosion
10
100
1000
10000
100000
1 2 3 4 5
Cost Constraint (Adders)
Nu
mb
er
of
Ca
nd
ida
tes
(K
)
Intelligent ExponentialPerformance Series4
1.00
1.25
1.50
1.38
1.13
Sp
eed
up
27 University of MichiganElectrical Engineering and Computer Science
Greedy Selection Heuristic
Subgraph Number
Value Cost Ops
1 20 4 (3,4),(6,8)
2 6 1 (1,3,7)
… … … …
N 9 5 (1,7)
Subgraph Number
Value Cost Ops
1 10 4 (6,8)
2 6 1 (1,3,7)
… … … …
N 0 5
• Use estimates of performance improvement / cost
28 University of MichiganElectrical Engineering and Computer Science
• Multiple applications can utilize CFUs• Vflib pattern matcher [Cor ’99]
3
5
6
1 4
2
Compiler Replacement
InstructionSynthesis
CFUDescription Compiler
3
5
CFU
421
29 University of MichiganElectrical Engineering and Computer Science
Experimental Setup
• Implemented in the Trimaran toolset
• Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle– CFUs use Int issue slot
• CFU latency/area generated as sum of each individual macrocell – Pipeline latches were added if CFU latency >1 clock cycle– 300 MHz clock assumed– No branch or memory instructions in CFUs
• Four application domains tested– Audio, Encryption, Image, Network
30 University of MichiganElectrical Engineering and Computer Science
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 2 4 6 8 10 12 14 16Cost Budget (Adders)
Sp
eed
up
blowfish
rijndael
sha
Native Encryption Results
31 University of MichiganElectrical Engineering and Computer Science
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 2 4 6 8 10 12 14 16Cost Budget (Adders)
Sp
eed
up
blowfish-rijndael
blowfish-sha
rijndael-blowfish
rijndael-sha
sha-blowfish
sha-rijndael
Encryption Cross Compile
32 University of MichiganElectrical Engineering and Computer Science
Generalizing CFUs
Subsumed(Multiple Paths)
Wildcards (Multiple Nodes)
>>
|
+
IN_1 0x8
0xF
IN_2
>>
|
+
IN_1 0x8, 0x00x0
0xF, 0x00x0
IN_2
>>
|,&&
+,--
IN_1 0x8
0xF
IN_2
33 University of MichiganElectrical Engineering and Computer Science
Effects of Generalization
blowfis
h
bfish
-rijn
bfish
-sha
rijnda
el
rijn-b
fish
rijn-s
hash
a
sha-
bfish
sha-
rijn1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0CFUs Subsumed Subgraphs
Sp
eed
up
34 University of MichiganElectrical Engineering and Computer Science
Conclusions• Developed two phase instruction set synthesis system
– Guide function removes bad candidates– Greedy selection heuristic
• Substantial speedups can be attained with very little die impact
• Subsumed subgraphs and wildcarding increase cross-application effectiveness
Domain Encryption Network Image Audio
Ave. Speedup 1.61 1.38 1.16 1.66
35 University of MichiganElectrical Engineering and Computer Science
Questions?
http://cccp.eecs.umich.edu
37 University of MichiganElectrical Engineering and Computer Science
Individual Factors - Blowfish
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
0 2 4 6 8 10 12 14 16
IO
Latency
Area
Criticality
All
38 University of MichiganElectrical Engineering and Computer Science
Individual Factors - Djpeg
1
1.05
1.1
1.15
1.2
1.25
0 2 4 6 8 10 12 14 16
IO
Latency
Area
Criticality
All