56
1 University of Michigan Electrical Engineering and Computer Science Uncovering Hidden Loop Level Parallelism in Sequential Applications Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

1 University of MichiganElectrical Engineering and Computer Science

Uncovering Hidden Loop Level Parallelism in Sequential Applications

Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman,

Scott Mahlke

Advanced Computer Architecture Lab.University of Michigan

Page 2: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

2 University of MichiganElectrical Engineering and Computer Science

CMP Architectures• Multiple cores on a chip

– Higher throughput– Reduced complexity (per core)– More power/heat friendly

• Multithreaded applications

Inte

l Cor

e 2

Duo

AMD

Quad

-cor

e (B

arce

lona

)

Sun Niagara 2

Page 3: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

3 University of MichiganElectrical Engineering and Computer Science

How About Single Thread?

[Source : Bridges et al, MICRO `07]

Page 4: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

4 University of MichiganElectrical Engineering and Computer Science

Loop Level Parallelization

i = 0-39

DOALL loop

Page 5: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

5 University of MichiganElectrical Engineering and Computer Science

Loop Level Parallelization

i = 0-39i = 20-39i = 0-19

Core 1Core 0

DOALL loop

Page 6: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

6 University of MichiganElectrical Engineering and Computer Science

Loop Level Parallelization

i = 0-39

Speculative DOALL loop

Page 7: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

7 University of MichiganElectrical Engineering and Computer Science

Loop Level Parallelization

i = 0-39

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Core 1Core 0

Loop Chunk

Speculative DOALL loop

Page 8: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

8 University of MichiganElectrical Engineering and Computer Science

Loop Level Parallelization

i = 0-39

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Core 1Core 0

Loop Chunk

Bad news: limited number of parallel loops in general purpose applications

–1.3x speedup for SpecINT2000 on 4 cores

Speculative DOALL loop

Page 9: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

9 University of MichiganElectrical Engineering and Computer Science

Contributions

• Code generation framework– Speculative parallelization of

uncounted loops

• Compiler transformations – Speculative loop fission– Isolation of infrequent dependences– Speculative prematerialization

Initialization

Consolidation

Abort Handler

for(i=IS; i<IE; i++) { ...... if (brk_cond) local_brk_flag = 1; break;}

XBEGIN

if (global_brk_flag) break;

perm = RECV(THREADj-1)XCOMMITif (local_brk_flag) global_brk_flag = 1; kill_other_threads;elseif (IE < n) SEND(perm,THREADj+1)

IS = ...; IE = ...;

Spawn

Page 10: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

10 University of MichiganElectrical Engineering and Computer Science

Target Architecture

L2 cache

L2 cache

Core 0 Core 1

Core 2 Core 3

Page 11: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

11 University of MichiganElectrical Engineering and Computer Science

Target Architecture

L2 cache

L2 cache

Core 0 Core 1

Core 2 Core 3

Scalar operand network

Page 12: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

12 University of MichiganElectrical Engineering and Computer Science

Target Architecture

L2 cache

L2 cache

Core 0 Core 1

Core 2 Core 3Hardware transactional

memory

Scalar operand network

Page 13: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

13 University of MichiganElectrical Engineering and Computer Science

Code Generation Framework

for (i=0;i<n;i++)// original loop code

Page 14: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

14 University of MichiganElectrical Engineering and Computer Science

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGIN

XCOMMIT

for (i=IS;i<IE;i++)// original loop code

Page 15: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

15 University of MichiganElectrical Engineering and Computer Science

RECV(THREADj-1)XCOMMITSEND(THREADj+1)

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGIN

for (i=IS;i<IE;i++)// original loop code

Page 16: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

16 University of MichiganElectrical Engineering and Computer Science

RECV(THREADj-1)XCOMMITSEND(THREADj+1)

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGIN

for (i=IS;i<IE;i++)// original loop code

Spawn

Page 17: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

17 University of MichiganElectrical Engineering and Computer Science

RECV(THREADj-1)XCOMMITSEND(THREADj+1)

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGIN

for (i=IS;i<IE;i++)// original loop code if (brkCond) break;

Spawn

Page 18: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

18 University of MichiganElectrical Engineering and Computer Science

for (i=IS;i<IE;i++)// original loop code if (brkCond)

localBrk=1; break;

RECV(THREADj-1)XCOMMITif (localBrk)globalBrk=1;abortOtherTXs;SEND(THREADj+1)

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGINif (globalBrk) break;

Spawn

Page 19: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

19 University of MichiganElectrical Engineering and Computer Science

for (i=IS;i<IE;i++)// original loop code if (brkCond)

localBrk=1; break;

RECV(THREADj-1)XCOMMITif (localBrk)globalBrk=1;abortOtherTXs;SEND(THREADj+1)

Code Generation Framework

while (...)IS+=...; IE+=...;XBEGINif (globalBrk) break;

Consolidation

Spawn

Page 20: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

20 University of MichiganElectrical Engineering and Computer Science

Code Generation Framework• Supports counted and

uncounted loops– Software managed

control speculation• Iteration chunking• Enforce transaction

ordering• Handles livein, liveout &

accumulator registers

for (i=IS;i<IE;i++)// original loop code if (brkCond)

localBrk=1; break;

RECV(THREADj-1)XCOMMITif (localBrk)globalBrk=1;abortOtherTXs;SEND(THREADj+1)

while (...)IS+=...; IE+=...;XBEGINif (globalBrk) break;

Consolidation

Spawn

Page 21: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

21 University of MichiganElectrical Engineering and Computer Science

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

grid

177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

ompre

ss072.s

c099.g

o124.m

88ks

im129.c

ompre

ss130.li

132.ij

peg

164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wol

fcj

peg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

witdec

peg

witen

cra

wca

udio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

ctio

n o

f se

qu

en

tial execu

tion Provable DOALL

DOALL Coverage – Provable and Profiled

Page 22: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

22 University of MichiganElectrical Engineering and Computer Science

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

grid

177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

ompre

ss072.s

c099.g

o124.m

88ks

im129.c

ompre

ss130.li

132.ij

peg

164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wol

fcj

peg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

witdec

peg

witen

cra

wca

udio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

ctio

n o

f se

qu

en

tial execu

tion

Profiled DOALLProvable DOALL

Page 23: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

23 University of MichiganElectrical Engineering and Computer Science

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

grid

177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

ompre

ss072.s

c099.g

o124.m

88ks

im129.c

ompre

ss130.li

132.ij

peg

164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wol

fcj

peg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

witdec

peg

witen

cra

wca

udio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

ctio

n o

f se

qu

en

tial execu

tion

Profiled DOALLProvable DOALL

Still not good enough!Few dependences hinder parallelization in many loops

Page 24: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

24 University of MichiganElectrical Engineering and Computer Science

DOALL Coverage – Provable and Profiled

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

grid

177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

ompre

ss072.s

c099.g

o124.m

88ks

im129.c

ompre

ss130.li

132.ij

peg

164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wol

fcj

peg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

witdec

peg

witen

cra

wca

udio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

ctio

n o

f se

qu

en

tial execu

tion

Profiled DOALLProvable DOALL

Still not good enough!Few dependences hinder parallelization in many loops

Compiler can help:•Speculative fission•Isolation of infrequent paths•Speculative prematerialization

Page 25: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

25 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

Page 26: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

26 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 27: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

27 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITSEND(THREADj+1)}

Page 28: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

28 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITif (node!= node_array[IS+CS]){

update_node_array;kill_other_threads();}

SEND(THREADj+1)}

1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 29: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

29 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITif (node!= node_array[IS+CS]){

update_node_array;kill_other_threads();}

SEND(THREADj+1)}

1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 30: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

30 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITif (node!= node_array[IS+CS]){

update_node_array;kill_other_threads();}

SEND(THREADj+1)}

1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 31: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

31 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITif (node!= node_array[IS+CS]){

update_node_array;kill_other_threads();}

SEND(THREADj+1)}

1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 32: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

32 University of MichiganElectrical Engineering and Computer Science

1: while (node) {2: work(node);3: node = node->next;

}

Speculative Loop Fission

XBEGIN5: node = node_array[IS];

i = 0;1':while (node && i++ < CS) {2: work(node);3': node = node->next;

}RECV(THREADj-1)XCOMMITif (node!= node_array[IS+CS]){

update_node_array;kill_other_threads();}

SEND(THREADj+1)}

1: while (node) {4: node_array[count++] = node;3: node = node->next;

}

Page 33: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

33 University of MichiganElectrical Engineering and Computer Science

Infrequent Dependence Isolation

1:

2:99%

1%

A

B

C

Page 34: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

34 University of MichiganElectrical Engineering and Computer Science

Infrequent Dependence Isolation

1:

2:

1:

2:99%

1%

A

B

C

A

B

C

Page 35: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

35 University of MichiganElectrical Engineering and Computer Science

Infrequent Dependence Isolation

1:

2:

1:

2:99%

1%

A

B

C

A

B

C’

C

Page 36: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

36 University of MichiganElectrical Engineering and Computer Science

Infrequent Dependence Isolation

1:

2:

1:

2:99%

1%break

A

B

C

A

C’

C

B

1%

99%

Page 37: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

37 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

Sample loop from yacc benchmark

Page 38: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

38 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

Sample loop from yacc benchmark

Page 39: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

39 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

if ( count > times) {best = cbest;times = count;

}

1 %

Sample loop from yacc benchmark

Page 40: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

40 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

if ( count > times) {best = cbest;times = count;

}

j=0;while (j<=nstate){

for( ; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times)break;

}

if (count > times) {best = cbest;times = count; j++;}}

1 %1 %

Sample loop from yacc benchmark

Page 41: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

41 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

if ( count > times) {best = cbest;times = count;

}

j=0;while (j<=nstate){

for( ; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times)break;

}

if (count > times) {best = cbest;times = count; j++;}}

1 %1 %

Sample loop from yacc benchmark

Page 42: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

42 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

if ( count > times) {best = cbest;times = count;

}

j=0;while (j<=nstate){

for( ; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times)break;

}

if (count > times) {best = cbest;times = count; j++;}}

1 %

1 %

1 %

Sample loop from yacc benchmark

Page 43: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

43 University of MichiganElectrical Engineering and Computer Science

for( j=0; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times) {best = cbest;times = count;

}}

Infrequent Dependence Isolation

if ( count > times) {best = cbest;times = count;

}

j=0;while (j<=nstate){

for( ; j<=nstate; ++j ){if( tystate[j] == 0 ) continue;if( tystate[j] == best ) continue;count = 0;cbest = tystate[j];for (k=j; k<=nstate; ++k)if (tystate[k]==cbest) ++count;if ( count > times)break;

}

if (count > times) {best = cbest;times = count; j++;}}

1 %

1 %

1 %

Sample loop from yacc benchmark

Page 44: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

44 University of MichiganElectrical Engineering and Computer Science

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

gri

d177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

om

pre

ss072.s

c099.g

o124.m

88ks

im129.c

om

pre

ss130.li

132.ijp

eg164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wolf

cjpeg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

wit

dec

peg

wit

enc

raw

caudio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

cti

on

of

se

qu

en

tia

l e

xe

cu

tio

n

profiled + provable

DOALL Coverage – Profiled and Transformed

Page 45: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

45 University of MichiganElectrical Engineering and Computer Science

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

gri

d177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

om

pre

ss072.s

c099.g

o124.m

88ks

im129.c

om

pre

ss130.li

132.ijp

eg164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wolf

cjpeg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

wit

dec

peg

wit

enc

raw

caudio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

cti

on

of

se

qu

en

tia

l e

xe

cu

tio

n

profiled + provable

DOALL Coverage – Profiled and Transformed

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

052.a

lvin

n056.e

ar171.s

wim

172.m

gri

d177.m

esa

179.a

rt183.e

quak

e188.a

mm

p008.e

spre

sso

023.e

qnto

tt026.c

om

pre

ss072.s

c099.g

o124.m

88ks

im129.c

om

pre

ss130.li

132.ijp

eg164.g

zip

175.v

pr

181.m

cf197.p

arse

r256.b

zip2

300.t

wolf

cjpeg

djp

egep

icg721dec

ode

g721en

code

gsm

dec

ode

gsm

enco

de

mpeg

2dec

mpeg

2en

cpeg

wit

dec

peg

wit

enc

raw

caudio

raw

dau

dio

unep

icgre

ple

xya

ccav

erag

e

SPEC FP SPEC INT Mediabench Utilities

Fra

cti

on

of

se

qu

en

tia

l e

xe

cu

tio

n

profiled + provable transformations

Page 46: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

46 University of MichiganElectrical Engineering and Computer Science

Coverage Breakdown

0

10

20

30

40

50

60

70

SpecINT MediaBench Utilities

Fra

cti

on

of

se

qu

en

tia

l e

xe

cu

tio

n

DOALL loops Control speculation for uncounted loops

Speculative fission Speculative prematerialization

Infrequent dependence isolation DOALL loops after transformations

Page 47: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

47 University of MichiganElectrical Engineering and Computer Science

Coverage Breakdown

0

10

20

30

40

50

60

70

SpecINT MediaBench Utilities

Fra

cti

on

of

se

qu

en

tia

l e

xe

cu

tio

n

DOALL loops Control speculation for uncounted loops

Speculative fission Speculative prematerialization

Infrequent dependence isolation DOALL loops after transformations

Page 48: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

48 University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• OpenIMPACT compiler• Multicore simulator

– Simulates up to 8 ARM9-like processors– Models scalar operand network– Assumes perfect memory system– Uses STM library to emulate HTM functionality

Page 49: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

49 University of MichiganElectrical Engineering and Computer Science

1

1.5

2

2.5

3

3.5

4

4.5

5

05

2.a

lvin

n

05

6.e

ar

17

1.s

wim

17

2.m

gri

d

17

7.m

esa

17

9.a

rt

18

3.e

qu

ake

18

8.a

mm

p

00

8.e

spre

sso

02

3.e

qn

tott

02

6.c

om

pre

ss

07

2.s

c

09

9.g

o

12

4.m

88

ksi

m

12

9.c

om

pre

ss

13

0.li

13

2.ijp

eg

16

4.g

zip

17

5.v

pr

18

1.m

cf

19

7.p

ars

er

25

6.b

zip

2

30

0.t

wolf

cjp

eg

djp

eg

ep

ic

g7

21

deco

de

g7

21

en

cod

e

gsm

deco

de

gsm

en

cod

e

mp

eg

2d

ec

mp

eg

2en

c

peg

wit

dec

peg

wit

en

c

raw

cau

dio

raw

dau

dio

un

ep

ic

gre

p

lex

yacc

avera

ge

SPEC FP SPEC INT Mediabench Utilities

Sp

ee

du

p

With transformations

Without transformations

Speedup

2 core4 core8 core

7.897.37

7.87

6.44

Page 50: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

50 University of MichiganElectrical Engineering and Computer Science

1

1.5

2

2.5

3

3.5

4

4.5

5

05

2.a

lvin

n

05

6.e

ar

17

1.s

wim

17

2.m

gri

d

17

7.m

esa

17

9.a

rt

18

3.e

qu

ake

18

8.a

mm

p

00

8.e

spre

sso

02

3.e

qn

tott

02

6.c

om

pre

ss

07

2.s

c

09

9.g

o

12

4.m

88

ksi

m

12

9.c

om

pre

ss

13

0.li

13

2.ijp

eg

16

4.g

zip

17

5.v

pr

18

1.m

cf

19

7.p

ars

er

25

6.b

zip

2

30

0.t

wolf

cjp

eg

djp

eg

ep

ic

g7

21

deco

de

g7

21

en

cod

e

gsm

deco

de

gsm

en

cod

e

mp

eg

2d

ec

mp

eg

2en

c

peg

wit

dec

peg

wit

en

c

raw

cau

dio

raw

dau

dio

un

ep

ic

gre

p

lex

yacc

avera

ge

SPEC FP SPEC INT Mediabench Utilities

Sp

ee

du

p

With transformations

Without transformations

Speedup

2 core4 core8 core

7.897.37

7.87

6.44

1.36x, 1.84x and 2.34x speedup on 2-, 4-, and 8-cores

Page 51: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

51 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Figure out ways to use available resources for legacy applications– Codes like error handlers, linked list & tree

traversal limit parallelism• Compiler analysis and optimization

looks promising • 1.84x speedup on 4 cores after

transformations compared to 1.41x

Page 52: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

52 University of MichiganElectrical Engineering and Computer Science

Questions?

Thank you!

Page 53: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

53 University of MichiganElectrical Engineering and Computer Science

SpecDSWP vs. Speculative Fission

B

A

C

Page 54: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

54 University of MichiganElectrical Engineering and Computer Science

SpecDSWP vs. Speculative Fission

B0

A0A1A2A3 B1

B2B3

C0

C1

C2

C3

Core 0 Core 1 Core 2 Core 3

B0

A0A1A2A3

B1 B2 B3

C0 C1 C2 C3

Core 0 Core 1 Core 2 Core 3

Page 55: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

55 University of MichiganElectrical Engineering and Computer Science

Speculative Prematerialization

for (...) {1: current = ...;2: work(last);3: last = current;

}

Page 56: Uncovering Hidden Loop Level Parallelism in …cccp.eecs.umich.edu/slides/mehrara-hpca08.pdf0.9 1 052.alvinn 056.ear 171.swim 172.mgrid 177.mesa 179.art 183.equake 188.ammp 008.espresso

56 University of MichiganElectrical Engineering and Computer Science

Speculative Prematerialization

for (...) {1: current = ...;2: work(last);3: last = current;

}

XBEGIN1’: current =3’: last =

for (...) {1: current = ...;2: work(last);3: last = current;

}XCOMMIT