1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors

1

Automatically Generating Custom Instruction Set Extensions

Nathan Clark, Wilkin Tang, Scott MahlkeWorkshop on Application Specific

Processors

2

Problem StatementThere’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs

One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) Combine several primitive operations We propose an automated method for CFU

generation

3

System Overview

4

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,63,44,55,86,77,8

5

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,6…1,3,42,4,52,6,7…

6

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,6…1,3,4,52,4,5,82,6,7,8…1,3,4,5,8

7

Characterization

Use the macro library to get information on each potential CFU Latency is the sum of each primitive’s

latency Area is the sum of each primitive’s

macrocell

8

Issues we considerPerformance On critical path Cycles saved

Cost CFU area Control logic

Difficult to measure

Decode logic Difficult to

measure Register file area

Can be amortized

LD

ADD

ADD

AND

ASL

XOR

BR

1

1

1

1

1

0.1

0.1

0.1

0.6

0.6

9

More Issues to Consider

IO number of input

and output operands

Usability How well can the

compiler use the pattern

OR

LSL

AND

CMPP

10

Selection

Currently use a Greedy Algorithm Pick the best

performance gain / area first

Can yield bad selections

OR

LSL

AND

CMPP

11

Case study 1: BlowfishSpeedup: 1.24 10 cycles can be

compressed down to 2!

Cost: ~6 adders6 inputs, 2 outputsC code this DFG came from: r ^=(((s[(t>>24)] +

s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff;

ADD

XOR

ADD

AND

XOR

LSR

AND

ADD

LSL

ADD

r65 r70

r76

r81

# -1

r891

#16

#255

#256

#2

r91

12

Case study 2: ADPCM DecodeSpeedup: 1.20 3 cycles can be

compressed down to 1

Cost: ~1.5 adders2 inputs, 2 outputsC code this DFG came from: d = d & 7;

if ( d & 4 ) { … }

AND

AND

CMPP

#7r16

#4

#0

13

Experimental SetupCFU recognition implemented in the Trimaran research infrastructureSpeedup shown is with CFUs relative to a baseline machine Four wide VLIW with predication Can issue at most 1 Int, Flt, Mem, Brn

inst./cyc. 300 MHz clock

CFU Latency is estimated using standard cells from Synopsis’ design library

14

Varying the Number of CFUs

More CFUs yields more performance Weakness in our selection algorithm causes plateaus

adpcm-decode

1

1.4

1.8

2.2

0 5 10 15 20

Number of function units

Spe

edup

0

20

40

60

80

Add

ition

al c

ost

Speedup Additional cost

15

Varying the Number of Ops

Bigger CFUs yield better performance If they’re too big, they can’t be used as often

and they expose alternate critical paths

blow f ish

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20

Max Number of ops/CFU

Spe

edup

0

20

40

60

80

Add

ition

al c

ost

Speedup Additional cost

16

Related WorkMany people have done this for code size Bose et al., Liao et al.

Typically done with traces Arnold, et al.

Previous paper used more enumerative discovery algorithmWe are unique because: Compiler based approach Novel analyzation of CFUs

17

Conclusion and Future Work

CFUs have the potential to offer big performance gain for small costRecognize more complex subgraphs Generalized acyclic/cyclic subgraphs

Develop our system to automatically synthesize application tailored coprocessors

Documents

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors