TM performance: seeing the whole picture or Looking back over the first 500 papers

Preview:

DESCRIPTION

TM performance: seeing the whole picture or Looking back over the first 500 papers. Tim Harris (MSR Cambridge). How might we compare TM systems? Where might TM be most useful?. Extending Dan’s GC analogy. “Here’s a way to reduce the pause times...”. C. A. - PowerPoint PPT Presentation

Citation preview

TM performance: seeing the whole picture

or

Looking back over the first 500 papers

Tim Harris (MSR Cambridge)

How might we compare TM systems?

Where might TM be most useful?

Extending Dan’s GC analogy

Concurrent GC algorithm

(run GC in small steps in

amongst mutators)

“Here’s a way to reduce the pause times...”

A

“Here’s a way to support pinned objects...”

B “Here’s a way to improve the throughput (total app

runtime)...

C

Min mutator utilization

0 2 4 6 8 10 120.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Algorithm AAlgorithm B

Time interval / ms

Min

facti

on o

f int

erva

l run

ning

mut

ator

Five dimensions to TM behaviorSequentialoverhead

Scalability(to longer

transactions)

Scalability(to more cores)

Tx-supportedoperations

Semantics

Scaling to large transactions

0 1 2 3 4 5 6 7 8 9 100.00.51.01.52.02.53.03.54.04.55.0

Algorithm AAlgorithm B

Tx size

Norm

alize

d ex

ecuti

on ti

me

1.0 = optimized sequential code(no tx, no locks)

Scaling: n*1-core copies

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

Algorithm AAlgorithm B

#cores

Norm

alize

d ex

ecuti

on ti

me

1.0 = optimized sequential code(no tx, no locks)

Scaling: 1*n-core copy

0 1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

Algorithm AAlgorithm B

#cores

Spee

dup

over

sequ

entia

l

1.0 = optimized sequential code(no tx, no locks)

How might we compare TM systems?

Where might TM be most useful?

Application model #1

Sequential Parallelizable

f = fraction of original program that is parallelizable

Application model #1

Sequential

Parallel

Parallel

Parallel

...

f = fraction of original program that is parallelizablen = num parallel threads

Application model #1

Sequential

Parallel, transactional

Parallel, transactional

Parallel, transactional

...

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-down

Conflict model

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

1 2 3 4 5 6

Fixed number of alternatives, executedifferent alternatives in parallel

Execute conflictingoperations in series

n=16, c=1.0, vary f, vary x

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

8x on 16 threads => 95% parallelizable

n=16, c=1.0

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Straight-line slow-down bites quickly

n=16, c=1.1 (1..1024)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.4 (1..256)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=2.0 (1..64)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=3.1 (1..16)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

If Amdahl and overheads don’t get

you then conflicts still can...

n=16, c=1.0, scaling of large tx

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*f

n=16, c=1.0, x*(f+(f^1.25)/4)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*(f+

(f^1.

25)/

4)

n=16, c=1.0, x*(f+(f^2)/4)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*(f+

(f^2)

/4)

Application model #2: 100% parallel

Tx

...

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Non-tx

Tx Non-tx

Tx Non-tx

Workloads (ASPLOS ’10)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%30%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)Labyrinth

Genome

JBBAtomicVacation

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

MaxFlow

Workloads (ASPLOS ’10)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%30%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Labyrinth

Genome

JBBAtomicVacation

MaxFlow

n=16, c=1.0 (no conflicts)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0 (no conflicts)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Overheads rapidly reduce the amount

that transactions can be used

n=16, c=1.1 (1..1024)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.4 (1..256)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

n=16, c=2.0 (1..64)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%

x (straight-line transactional slow-down)

t (tr

ansa

ction

al p

ropo

rtion

)

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Conclusions• Bad things come in threes...

– Amdahl’s law– Sequential overhead– Conflicts

• When developing TM systems we need to be careful about tradeoffs between these

• There’s a risk of “chasing around the TM design space”– Sequential overhead– Scaling without conflicts– Scaling with conflicts

Recommended