35
On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case Alessandro Pellegrini Francesco Quaglia High Performance and Dependable Computing Systems Group Sapienza, University of Rome InfQ 2014

On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

On the Relevance of Wait-free Coordination Algorithms inShared-Memory HPC: The Global Virtual Time Case

Alessandro PellegriniFrancesco Quaglia

High Performance and DependableComputing Systems Group

Sapienza, University of Rome

InfQ 2014

Page 2: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Current Technological Trend

• Implications of Moore’s Lawhave changed since 2003

• 130W is considered an upperbound (the power wall)

P = ACV 2f

2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 3: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Current Technological Trend

• Implications of Moore’s Lawhave changed since 2003

• 130W is considered an upperbound (the power wall)

P = ACV 2f

2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 4: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Current Technological Trend

• Implications of Moore’s Lawhave changed since 2003

• 130W is considered an upperbound (the power wall)

P = ACV 2f

2 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 5: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Multicore Software Scaling

3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 6: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Multicore Software Scaling

3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 7: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Multicore Software Scaling

3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 8: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Multicore Software Scaling

3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 9: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Multicore Software Scaling

3 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 10: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Problem Statement in HPC

• Many HPC applications enforce data separation◦ The amount of synchronization points are already reduced

• Coordination algorithms easily become the bottleneck of HPCSystems on multicore architectures

• They cannot be avoided at all◦ Necessary for the overall correctness

• How to deal with performance issues of HPC Systems onmulticores?

4 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 11: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Wait-Free Algorithms

• All threads complete their job in a finite number of steps

• Algorithms in which threads communicate using some sort ofatomic-update machine instruction◦ Compare and Swap (CAS)◦ Load-Link/Store-Conditional (LL/SC)

• Both CAS & LL/SC will fail if their initial conditions change

• The “fix” for failing is usually to loop and try again.

5 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 12: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Wait-Free Algorithms

• All threads complete their job in a finite number of steps

• Algorithms in which threads communicate using some sort ofatomic-update machine instruction◦ Compare and Swap (CAS)◦ Load-Link/Store-Conditional (LL/SC)

• Both CAS & LL/SC will fail if their initial conditions change

• The “fix” for failing is usually to loop and try again.

5 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 13: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Target Architecture: PDES

• Exchanged information unit is the discrete event

• The involved objects state is updated according to the flow ofevents

6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 14: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Target Architecture: PDES

• Exchanged information unit is the discrete event

• The involved objects state is updated according to the flow ofevents

Communication Network

Machine

Processor

Kernel

LPLP

LP

Processor

Kernel

LPLP

LP

Machine

Processor

Kernel

LPLP

LP

Processor

Kernel

LPLP

LP

... ...

...

6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 15: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Target Architecture: PDES

• Exchanged information unit is the discrete event

• The involved objects state is updated according to the flow ofevents

Communication Network

Machine

CPU

Kernel

LPLP

LP LPLP

LP LPLP

LP LPLP

LP

...

...

CPU CPU CPU

Machine

CPU

Kernel

...CPU CPU CPU

Kernel

6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 16: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Target Architecture: PDES

• Exchanged information unit is the discrete event

• The involved objects state is updated according to the flow ofevents

Machine

CPU

LPLP

LP LPLP

LP

CPU CPU CPU

Simulation Kernel

Discrete Event

Runtime

Environment

6 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 17: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj

LPk Execution Time

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 18: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5

LPk Execution Time7

10 Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 19: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5

LPk Execution Time7 17

10

17

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 20: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20

LPk Execution Time7 17 25

10

17

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 21: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20

12

LPk Execution Time7 17 25

10

17

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 22: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20

Straggler Message

12

LPk Execution Time7 17 25

10

17

Rollback Execution:

Recovering state at

LVT 10

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 23: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20

Straggler Message

12

LPk Execution Time7 17 25

10

17 17

Anti-message

anti-message

reception

Rollback Execution:

Recovering state at

LVT 10

Rollback Execution:

Recovering State at

LVT 7

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 24: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20 12

Straggler Message

12

LPk Execution Time7 17 25

10

17

17 17

Anti-message

anti-message

reception

Rollback Execution:

Recovering state at

LVT 10

Rollback Execution:

Recovering State at

LVT 7

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 25: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Coordination Algorithm in PDES: GVT

LPi

LPj 15

5 10

20 12

Straggler Message

12

LPk Execution Time7 17 25

10

17

17 17

Anti-message

anti-message

reception

Rollback Execution:

Recovering state at

LVT 10

Rollback Execution:

Recovering State at

LVT 7

Execution Time

Execution Time

7 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 26: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Fujimoto & Hybinette

• Reference algorithm has been so far Fujimoto’s and Hybinette’sGVT for Shared-Memory Multiprocessors (1998)

• It targets observable Time Warp Systems◦ Essentially the send operation leads to the direct incorporation into

the message queue◦ This removes the notion of in-transit messages

• It relies on the notion of GVT computation phase

8 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 27: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Fujimoto & Hybinette (2)

procedure Main-Loopwhile there are events to process do

LocalGVTFlag ← GVTFlag . Atomically set to NPEReceive messages and process rollbacksReceive antimessages, process annihilations and rollbacksSelect next event Mif LocalGVTFlag > 0 and LocalMin not computed then . CS

PEMin[PE] ← min{SendMin, TS(M)}GVTFlags ← GVTFlags - 1if GVTFlag = 0 then

GVT ← min{PEMin[1], . . . , PEMin[NPE]}end if

end ifend while

end procedure

9 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 28: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Wait-Free GVT Algorithm

• Each worker thread manages a set of LPs

• Event-queues are directly accessible by worker threads◦ No message/anti-message is ever in-flight across worker-threads

• GVT protocol entails determining the right moment to look atdata structures and compute the local minimum◦ It is computed after the incorporation of incoming

messages/anti-messages into the event-queue◦ No need to account for anti-messages and canceled events timestamp

• The algorithm is wait-free because it relies on a sequence of phases

10 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 29: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Wait-Free GVT Algorithm (2)

wall-clock-time

t1 t2

GVT_flag is

set to TRUE

phase-A phase-send

t3

phase-BWT1

WT2

WT3

m

ts

11 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 30: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Test-Bed Scenario

• 32-cores HP ProLiant Server

• 64GB of RAM

• 2.6.32-5-amd64 Linux kernel

• ROOT-Sim Simulation Kernel v. 1.0.2

• Modified version of PHOLD benchmark◦ homogeneous LPs that perform memory allocation/deallocation

operations and read/write operations◦ LPs modify their memory layout◦ few and small-granularity events ⇒ increased likelihood of competing

for the critical section

12 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 31: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Experimental Results

0

2

4

6

8

10

12

14

16

8 16 24 32

Ove

rall

Exe

cutio

n T

ime

(sec

onds

)

Number of Concurrent Worker Threads

WF FH

13 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 32: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Experimental Results (2)

0

0.5

1

1.5

2

8 16 24 32

Num

ber

of S

pinl

ock

Cyc

les

per

WC

T U

nit

Number of Concurrent Worker Threads

FH

14 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 33: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Experimental Results (3)

0

5

10

15

20

25

30

35

8 12 16 20 24

Ove

rall

Exe

cutio

n T

ime

(sec

onds

)

Number of Busy (with external workload) CPUs

WF FH

15 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 34: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Conclusions

• HPC Applications usually have few synchronization points◦ There is a high demand for enhanced synchronization schemes

• Modern multicore architectures enhance the impact ofsynchronization points◦ Contention due to locks can hamper scalability

• Wait-free algorithms can be a viable solution to be studied to facethis issue◦ They can provide enhanced scalability◦ There is a benefit in non-dedicated environments as well

16 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case

Page 35: On the Relevance of Wait-free Coordination Algorithms in ...pellegrini/talks/infq14.pdf · LPj 15 5 10 20 LPk 7 17 25 Execution Time 10 17 Execution Time Execution Time 7 of 17 -

Thanks for your attention

Questio

ns?

Questio

ns?

[email protected]

http://www.dis.uniroma1.it/∼pellegrini

http://www.dis.uniroma1.it/∼ROOT-Sim

17 of 17 - On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC: The Global Virtual Time Case