19
Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens {anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr May 31, 2009 CSLab National Technical University of Athens

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Early Experiences on Accelerating Dijkstra’s AlgorithmUsing Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumasand Nectarios Koziris

Computing Systems LaboratorySchool of Electrical and Computer Engineering

National Technical University of Athens{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr

http://www.cslab.ece.ntua.gr

May 31, 2009

CSLabNational Technical University of Athens

Page 2: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Outline

1 Dijkstra’s Basics

2 Straightforward Parallelization Scheme

3 Helper-Threading Scheme

4 Experimental Evaluation

5 Conclusions

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 2 / 19

Page 3: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

The Basics of Dijkstra’s Algorithm

SSSP Problem

Directed graph G = (V ,E ), weight function w : E → R+, sourcevertex s

∀v ∈ V : compute δ(v) = min{w(p) : sp v}

Shortest path estimate d(v)

gradually converges to δ(v) through relaxations

relax (v,w): d(w) = min{d(w), d(v) + w(v ,w)}I can we find a better path s

p w by going through v?

Three partitions of vertices

Settled: d(v) = δ(v)

Queued: d(v) > δ(v) and d(v) 6=∞Unreached: d(v) =∞

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 3 / 19

Page 4: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

The Basics of Dijkstra’s Algorithm

Serial algorithm

Input : G = (V ,E), w : E → R+,1

source vertex s, min QOutput : shortest distance array d ,2

predecessor array πforeach v ∈ V do3

d [v ]← inf;4

π[v ]← nil;5

Insert(Q, v);6

end7

d [s]← 0;8

while Q 6= ∅ do9

u ← ExtractMin(Q);10

foreach v adjacent to u do11

sum← d [u] + w(u, v);12

if d [v ] > sum then13

DecreaseKey(Q, v , sum);14

d [v ]← sum;15

π[v ]← u;16

end17

end18

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 4 / 19

Page 5: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

The Basics of Dijkstra’s Algorithm

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

i:17 6

Min-priority queue implemented as binary min-heap

maintains all but the settled vertices

min-heap property: ∀i : d(parent(i)) ≤ d(i)

amortizes the cost of multiple ExtractMin’s and DecreaseKey’sI O((|E |+ |V |)log |V |) time complexity

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 5 / 19

Page 6: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Straightforward Parallelization

Fine-grain parallelization at the inner loop level

Fine-Grain Multi-Threaded

/* Initialization phase same to the serialcode */

while Q 6= ∅ do1

Barrier2

if tid = 0 then3

u ← ExtractMin(Q);4

Barrier5

for v adjacent to u in parallel do6

sum← d [u] + w(u, v);7

if d [v ] > sum then8

Begin-Atomic9

DecreaseKey(Q, v , sum);10

End-Atomic11

d [v ]← sum;12

π[v ]← u;13

end14

end15

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8

Issues:

speedup bounded by averageout-degreeconcurrent heap updates due toDecreaseKey’sbarrier synchronization overhead

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 6 / 19

Page 7: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Concurrent Heap Updates with Locks

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

Coarse-grain synchronization (cgs-lock)I enforces atomicity at the level of a DecreaseKey operationI one lock for the entire heapI serializes DecreaseKey’s

Fine-grain synchronization (fgs-lock)I enforces atomicity at the level of a single swapI allows multiple swap sequences to execute in parallel as long as they

are temporally non-overlappingI separate locks for each parent-child pair

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 7 / 19

Page 8: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Performance of FGMT with Locks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

cgs-lockperfbar+cgs-lockperfbar+fgs-lock

Software barriers dominate total execution time

72% with 2 threads, 88% with 8replace with idealized (simulated) zero-latency barriers

Fgs-lock scheme more scalable, but still fails to outperform serial

locking overhead (2 locks + 2 unlocks per swap)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 8 / 19

Page 9: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Concurrent Heap Updates with TM

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

TM-based

Coarse-grain synchronization (cgs-tm)I enclose DecreaseKey within a transactionI allows multiple swap sequences to execute in parallel as long as they

are spatially (and temporally) non-overlappingI conflicting transaction stalls and retries or aborts

Fine-grain synchronization (fgs-tm)I enclose each swap operation within a transactionI atomicity as in fgs-lockI shorter but more transactions

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 9 / 19

Page 10: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Performance of FGMT with TM

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

perfbar+cgs-lockperfbar+fgs-lockperfbar+cgs-tmperfbar+fgs-tm

TM-based schemes offer speedup up to ∼ 1.1

less overhead for cgs-tm, yet equally able to exploit availableconcurrency

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 10 / 19

Page 11: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Helper-Threading Scheme

Motivation

expose more parallelism to each threadeliminate costly barrier synchronization

Rationale

in serial, relaxations are performed onlyfrom the extracted (settled) vertex

allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled

I main thread operates as in the serialalgorithm

I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads

I helper thread k relaxes all out-edgesof vertex xk

50

55

60 65

70

5

2

10

10 15

20

7

i-1

i

S

A

B

CD

E

8

8

8

speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by

main thread

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 11 / 19

Page 12: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Execution Pattern

ste

p k

ste

p k

+1

ste

p k

+2

Main

Th

read

1

Th

read

2

Th

read

3

Th

read

4

ste

p k

ste

p k

+1

ste

p k

+2

kill kill

killM

ain H

elp

er 1

Help

er 2

Help

er 3

kill

kill

ste

p k

ste

p k

+1

ste

p k

+2

extract-min

relax edges

read tidth-min

Serial FGMT Helper Threads

the main thread stops all helpers at the end of each iteration

unfinished work will be corrected, as with mis-speculated distances

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 12 / 19

Page 13: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Helper-Threading Scheme

Main thread

while Q 6= ∅ do1

u ← ExtractMin(Q);2

done ← 0;3

foreach v adjacent to u do4

sum← d [u] + w(u, v);5

Begin-Xact6

if d [v ] > sum then7

DecreaseKey(Q, v , sum);8

d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

for a single neighbour, the check for relaxation, updates to the heap, andupdates to d ,π arrays, are enclosed within a transaction

I performed “all-or-none”I on a conflict, only one thread commits

interruption of helper threads implemented through TM, as well

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 13 / 19

Page 14: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Helper-Threading Scheme

Main thread

while Q 6= ∅ do1

u ← ExtractMin(Q);2

done ← 0;3

foreach v adjacent to u do4

sum← d [u] + w(u, v);5

Begin-Xact6

if d [v ] > sum then7

DecreaseKey(Q, v , sum);8

d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

Why with TM?

composableI all dependent atomic sub-operations composed into a large atomic

operation, without limiting concurrency

optimisticeasily programmable

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 14 / 19

Page 15: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Experimental SetupFull-system simulation

Simics 3.0.31 in conjunction with GEMS toolset 2.1boots unmodified Solaris 10 (UltraSPARC III Cu)

LogTM (“Signature Edition”)

eager version managementeager conflict detection

I on a conflict, a transaction stalls and either retries or aborts

HYBRID conflict resolution policyI favors older transactions

Hardware platform

single CMP system (configurations up to 32 cores)private L1 caches (64KB), shared L2 cache (2MB)

Software

Pthreads for threading and synchronizationSimics “magic” instructions to simulate idealized barriersSun Studio 12 C compiler (-xO3)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 15 / 19

Page 16: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Graphs

Three graph families

Random: G (n,m) model

SSCA#2: cliques with varying size (1 – C )connected with probability P

R-MAT: power-law degree distributions

GTgraph graph generatorFixed #nodes (10K), varying density

sparse (∼10K edges)

medium (∼100K edges)

dense (∼200K edges)

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 16 / 19

Page 17: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Speedups

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x28351

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x118853

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x200000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x10000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x100000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x200000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

Helper-Threading

speedups in 6 out of 9 cases (not all shown), up to 1.46

performance improves with increasing density

main thread not obstructed by helpers (<1% abort rate in all cases)

FGMT with TM

speedups only with perfect barriers

optimistic parallelism does exist in concurrent queue updates

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 17 / 19

Page 18: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Conclusions

FGMT

conventional synchronization mechanisms incur unacceptableoverhead

TM reduces overheads and highlights the existence of parallelism, butstill requires very efficient barriers to offer some speedup

HT with TM

exposes more parallelism and eliminates barrier synchronization

noteworthy speedups with minimal code extensions

Future work

more aggressive parallelization schemes

dynamic adaptation of helper threads to algorithm’s execution phases

explore impact of TM characteristics

applicability of HT on other SSSP algorithms (∆-stepping,Bellman-Ford) and other similar (“greedy”) applications

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 18 / 19

Page 19: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/papers/mtaap09-presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Thank you!Questions?

Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 19 / 19