39
1 University of Michigan Electrical Engineering and Computer Science Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

1 University of MichiganElectrical Engineering and Computer Science

Processor Acceleration Through Automated Instruction Set Customization

Nathan Clark, Hongtao Zhong, Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan, Ann Arbor

December 3, 2003

2 University of MichiganElectrical Engineering and Computer Science

Motivation

• Cell phones, PDAs, digital cameras, etc. are everywhere– High performance yet low power design point

• General core + ASIC solution– Limited post-programmability

• General core + application specific instructions (CFUs)

CPU

ASICCPU

CFU

3 University of MichiganElectrical Engineering and Computer Science

What is a CFU?

• Combine multiple primitive operations– Smaller code size, fewer RF reads

– Increases performance

&

|

<<

^

&

*

+

^

+

+

^

<<

+

^

|

CFU 1

+^

CFU 2

& <<|

2

^

2

*

1

+

1

1

4 University of MichiganElectrical Engineering and Computer Science

Automation is Key• This is ¼ of the DFG for a single basic block of blowfish

159 XOR

164 SHR 173 AND

5 University of MichiganElectrical Engineering and Computer Science

Related Work

• Tensilica Xtensa– Commercial example

– MIPS core + manually constructed CFU

• Automatic instruction set synthesis is mature field – See paper for comparison of techniques

• Our contributions– Novel technique for automatic CFU creation

– System to utilize CFUs in multiple applications

– Analysis of how effectively CFUs for one application apply to other applications in the same domain

6 University of MichiganElectrical Engineering and Computer Science

System Overview• Synthesis

– Subgraph identification• Discover candidates for CFUs• Weed out what shouldn’t be picked

– Selection• Determine which candidates to use as

CFUs

• Compilation– Subgraph replacement

• Make use of the CFUs in a range of applications

7 University of MichiganElectrical Engineering and Computer Science

Subgraph Identification

• Grow subgraphs from seed nodes– All nodes are seeds

– Most directions don’t make sense

• How to decide where to grow?– Making decisions using factors

similar to an architect

– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output

%

^

<<

+ *

&

|

8 University of MichiganElectrical Engineering and Computer Science

Subgraph Identification

• Grow subgraphs from seed nodes– All nodes are seeds

– Most directions don’t make sense

• How to decide where to grow?– Making decisions using factors

similar to an architect

– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output

%

^

<<

+ *

&

|

CFU Candidates

&

<<

9 University of MichiganElectrical Engineering and Computer Science

Subgraph Identification

• Grow subgraphs from seed nodes– All nodes are seeds

– Most directions don’t make sense

• How to decide where to grow?– Making decisions using factors

similar to an architect

– Take 4 factors into consideration• Criticality, Latency, Area, Input/Output

• Sum of these factors determines value of each direction

– NOT picking CFUs

%

^

<<

+ *

&

|

CFU Candidates

&

<< &

+

10 University of MichiganElectrical Engineering and Computer Science

Critical Path

• Combining operations on the critical path will shrink the longer dependence chains– Maximize potential performance

gain

• Wt = – Slack is # cycles off longest

dependence path

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

10/(0+1) = 10 10/(2+1) = 3.33

110

slack

11 University of MichiganElectrical Engineering and Computer Science

Latency

• Growing toward low latency operations allows combination of more nodes in a cycle– Maximize DFG compression

• Wt =

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

10*0.3 / 0.6 = 5

10*0.3 / 0.36 = 8.33latencynewlatencyold

__*10

Opcode Area Cycles

+ 1.00 0.30

& 0.12 0.06

<<, >> 0.01 ~0.00

^ 0.16 0.09

12 University of MichiganElectrical Engineering and Computer Science

Area

• Want the most benefit for the least area

• Wt = • Area is the sum of

macrocell areas

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

10*0.5/0.5 = 10

10*0.5/1.5 = 3.33

Opcode Area Cycles

+ 1.00 0.30

& 0.12 0.06

<<, >> 0.01 ~0.00

^ 0.16 0.09

areanewareaold

__*10

13 University of MichiganElectrical Engineering and Computer Science

Input/Output

• Want CFUs to use as few RF ports as possible– Smaller encoding

– Allow growth of larger candidates

• Wt =

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

10*2/(2+1)= 6.67

10*2/(4+1)= 4

)10,1#

#*10min(portsnew

portsold

14 University of MichiganElectrical Engineering and Computer Science

Example

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

35 28.5

37.530.8

28.537.5

15 University of MichiganElectrical Engineering and Computer Science

Example

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

35 28.5

33.5

30.828.540

16 University of MichiganElectrical Engineering and Computer Science

Example

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

35 28.5

36

30.828.5

36

17 University of MichiganElectrical Engineering and Computer Science

Example

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+

18 University of MichiganElectrical Engineering and Computer Science

Example

^ &

^

>> >> >>

&&

+ +

<< <<

+ +

<<

+

+

<<

+

&

19 University of MichiganElectrical Engineering and Computer Science

Example

&

^

>> >> >>

&&

+ +

<< <<

+ +

<<

+

+

<<

+

&

^

20 University of MichiganElectrical Engineering and Computer Science

Example

&

^

>> >> >>

&&

+ +

<< <<

+ +

<<

+

+

<<

+

&

^

21 University of MichiganElectrical Engineering and Computer Science

Example

&

^

>> >> >>

&&

+

<< <<

+ +

<<

+

+

<<

+

&

^

+

22 University of MichiganElectrical Engineering and Computer Science

Example

&

^

>> >> >>

&&

+

<<

+ +

<<

+

+

<<

+

&

^

+

<<

23 University of MichiganElectrical Engineering and Computer Science

Example

&

^

>> >> >>

&&

+

+ +

<<

+

+

<<

+

&

^

+

<< <<

24 University of MichiganElectrical Engineering and Computer Science

&

^

>> >> >>

&&

+

+ +

<<

+

+

<<

+

&

^

+

<< <<

Finished – Met External Constraints

25 University of MichiganElectrical Engineering and Computer Science

Set of Candidates

^

<<

^

<< <<

^

<< <<

&

^

<< <<

& &

^

<< <<

& &

^

^

<< <<

& &

+

^

^

<< <<

& &

+ +

^

^

<< <<

& &

+ +

^

<<<<

^

<< <<

& &

+ +

^

<<

&

<<

^

<< <<

& &

+ +

^

<<

26 University of MichiganElectrical Engineering and Computer Science

Avoids Exponential Explosion

10

100

1000

10000

100000

1 2 3 4 5

Cost Constraint (Adders)

Nu

mb

er

of

Ca

nd

ida

tes

(K

)

Intelligent ExponentialPerformance Series4

1.00

1.25

1.50

1.38

1.13

Sp

eed

up

27 University of MichiganElectrical Engineering and Computer Science

Greedy Selection Heuristic

Subgraph Number

Value Cost Ops

1 20 4 (3,4),(6,8)

2 6 1 (1,3,7)

… … … …

N 9 5 (1,7)

Subgraph Number

Value Cost Ops

1 10 4 (6,8)

2 6 1 (1,3,7)

… … … …

N 0 5

• Use estimates of performance improvement / cost

28 University of MichiganElectrical Engineering and Computer Science

• Multiple applications can utilize CFUs• Vflib pattern matcher [Cor ’99]

3

5

6

1 4

2

Compiler Replacement

InstructionSynthesis

CFUDescription Compiler

3

5

CFU

421

29 University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• Implemented in the Trimaran toolset

• Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle– CFUs use Int issue slot

• CFU latency/area generated as sum of each individual macrocell – Pipeline latches were added if CFU latency >1 clock cycle– 300 MHz clock assumed– No branch or memory instructions in CFUs

• Four application domains tested– Audio, Encryption, Image, Network

30 University of MichiganElectrical Engineering and Computer Science

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 2 4 6 8 10 12 14 16Cost Budget (Adders)

Sp

eed

up

blowfish

rijndael

sha

Native Encryption Results

31 University of MichiganElectrical Engineering and Computer Science

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 2 4 6 8 10 12 14 16Cost Budget (Adders)

Sp

eed

up

blowfish-rijndael

blowfish-sha

rijndael-blowfish

rijndael-sha

sha-blowfish

sha-rijndael

Encryption Cross Compile

32 University of MichiganElectrical Engineering and Computer Science

Generalizing CFUs

Subsumed(Multiple Paths)

Wildcards (Multiple Nodes)

>>

|

+

IN_1 0x8

0xF

IN_2

>>

|

+

IN_1 0x8, 0x00x0

0xF, 0x00x0

IN_2

>>

|,&&

+,--

IN_1 0x8

0xF

IN_2

33 University of MichiganElectrical Engineering and Computer Science

Effects of Generalization

blowfis

h

bfish

-rijn

bfish

-sha

rijnda

el

rijn-b

fish

rijn-s

hash

a

sha-

bfish

sha-

rijn1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0CFUs Subsumed Subgraphs

Sp

eed

up

34 University of MichiganElectrical Engineering and Computer Science

Conclusions• Developed two phase instruction set synthesis system

– Guide function removes bad candidates– Greedy selection heuristic

• Substantial speedups can be attained with very little die impact

• Subsumed subgraphs and wildcarding increase cross-application effectiveness

Domain Encryption Network Image Audio

Ave. Speedup 1.61 1.38 1.16 1.66

35 University of MichiganElectrical Engineering and Computer Science

Questions?

http://cccp.eecs.umich.edu

36 University of MichiganElectrical Engineering and Computer Science

Backup slides

37 University of MichiganElectrical Engineering and Computer Science

Individual Factors - Blowfish

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

0 2 4 6 8 10 12 14 16

IO

Latency

Area

Criticality

All

38 University of MichiganElectrical Engineering and Computer Science

Individual Factors - Djpeg

1

1.05

1.1

1.15

1.2

1.25

0 2 4 6 8 10 12 14 16

IO

Latency

Area

Criticality

All

39 University of MichiganElectrical Engineering and Computer Science

Selection

• Uses estimates of performance improvement

• Greedy Heuristic used

^ &

^

>> >> >>

&& &

+ +

<< <<

+ +

<<

+

+

<<

+