18
Photos placed in horizontal position with even amount of white space between photos and header Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2016-5027C Horseshoes & hand grenades (& coordination) The case for approximate coordination in local checkpointing protocols Patrick M. Widener, Kurt Ferreira, Scott Levy Center for Computing Research Sandia National Laboratories Albuquerque, NM, USA [email protected]

Horseshoes & hand grenades (& coordination)

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Horseshoes & hand grenades (& coordination)

Photos placed in horizontal position with even amount

of white space between photos

and header

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2016-5027C

Horseshoes & hand grenades (& coordination)

The case for approximate coordination in local checkpointing

protocols

Patrick M. Widener, Kurt Ferreira, Scott Levy

Center for Computing Research Sandia National Laboratories

Albuquerque, NM, USA [email protected]

Page 2: Horseshoes & hand grenades (& coordination)

Overview

!  Today’s predominant HPC resilience mechanism, coordinated checkpoint/restart, is unlikely to scale well !  Resource contention introduced by coordination

!  Uncoordinated solutions have been proposed, but also are unlikely to scale well !  Communication-induced delays propagate between processes

(Ferreira et al. SC’14)

!  Our simulation-based exploration suggests that there is a “sweet spot” between the two coordination extremes !  We call this approximate coordination !  Characterizes the tradeoff between relieving resource contention

and avoiding delay propagation !  Achievable at reasonable cost

Page 3: Horseshoes & hand grenades (& coordination)

Coordinated C/R unlikely to scale Sy

stem

MTB

F (h

ours

)

Effic

ienc

y (%

)

Socket Count

25 year socket MTBF% Efficiency

0.1

1

10

100

1000

1024 2048

4096 8192

16384 32768

65536 131072

262144

0 %

20 %

40 %

60 %

80 %

100 %

Core Count

Page 4: Horseshoes & hand grenades (& coordination)

Coordinated checkpoint/restart (CCR)

PROCESS 3

PROCESS 2

PROCESS 4

PROCESS 1 δ

δ

δ

δ

t1 t2 t3

Page 5: Horseshoes & hand grenades (& coordination)

Uncoordinated Checkpointing (UCR)

δ

δ

PROCESS 3

PROCESS 2

PROCESS 4

PROCESS 1

t1 t3 t2

Page 6: Horseshoes & hand grenades (& coordination)

Advantages of uncoordinated checkpointing

•  Each process checkpoints independently, avoids coordination overheads

•  Logging of messages ensures all checkpoints are consistent

•  Upon failure, only failed node(s) restarts rather than all nodes with coordinated checkpoint/restart

Page 7: Horseshoes & hand grenades (& coordination)

Uncoordinated overheads greater than coordinated due to analogy with “jitter”

Coordinated C/R Uncoordinated C/R

Effic

ienc

y (%

)

Processes

LAMMPSHPCCG

CTH

0

20

40

60

80

100

128256

5121Ki

2Ki4Ki

8Ki16Ki

32Ki64Ki

128Ki

Effic

ienc

y (%

)

Processes

LAMMPSHPCCG

CTH

0

20

40

60

80

100

128256

5121Ki

2Ki4Ki

8Ki16Ki

32Ki64Ki

128Ki

We want to explore the space in between these extremes

Page 8: Horseshoes & hand grenades (& coordination)

Exploration via simulation

!  We have a validated, well-exercised simulation framework, LogGOPSim (Levy et al. PMBS 2014)

!  Checkpointing activity is modeled as CPU detours !  Periods of time during which the CPU is taken from the application

!  Simulate application behavior by replaying collected communication traces !  All communication dependencies are reproduced

!  Traces can be extrapolated from small application runs !  Collect trace of app with p processes, extrapolate to k * p !  Communication traces for MPI collectives are reproduced exactly

!  Validated for extrapolation and prediction of local checkpointing overheads !  Hoefler et al. SC’10, Ferreira et al. SC’14, Hoefler et al. HPDC’10

Page 9: Horseshoes & hand grenades (& coordination)

Effect of different levels of approximate coordination

!  We simulated executions of a set of workloads with varying degrees of checkpointing coordination !  LAMMPS – molecular dynamics simulation !  CTH – large material deformation / strong shock modeling !  HPCCG – conjugate gradient solver mini-app from Mantevo !  LULESH – shock hydrodynamics mini-app

!  Vary the standard deviation of the normal distribution used to produce starting offsets

!  Measure application slowdown in each case, for different process counts

Page 10: Horseshoes & hand grenades (& coordination)

Varying the degree of coordination

Relative Offset from Mean (sec.)

Prob

abilit

y Den

sity

10−3

10−2

10−1

100

−50 −25 0 25 50

stddev=100sstddev=20sstddev=1sstddev=0.1s

Drawing offsets from a normal distribution and varying the standard deviation simulates different degrees of coordination

Page 11: Horseshoes & hand grenades (& coordination)

Simulating the role of coordination !  LogGOPSim uses a detour trace to represent checkpoint

events

(time, duration) (time, duration)

…. (time, duration)

(time, duration) (time, duration)

…. (time, duration)

Coordinated all processes begin trace from offset 0

Completely uncoordinated processes begin trace from uniformly distributed random

offsets

Page 12: Horseshoes & hand grenades (& coordination)

LAMMPS/LJ

Page 13: Horseshoes & hand grenades (& coordination)

HPCCG

Page 14: Horseshoes & hand grenades (& coordination)

CTH

Page 15: Horseshoes & hand grenades (& coordination)

LULESH

Page 16: Horseshoes & hand grenades (& coordination)

All applications at 32Ki processes

Page 17: Horseshoes & hand grenades (& coordination)

Simulation indicates relatively loose coordination is sufficient

!  For normally distributed offsets with stddev=100ms !  On average 95% of checkpoints happen within a 200ms window !  Application runtime increases less than 5% !  This is well within the capabilities of systems with hardware support

!  Dedicated global interconnects or GPS cards have achieved skews ~ 1 us

!  Even extremely loose coordination is viable !  Approximately 10% slowdown for our studied workloads !  Easily realizable in modern HPC systems !  Also possible with acceptable reliabliity in wide-area/cloud contexts

!  Diminishing returns – tighter synchronization does not result in improved performance

!  Insensitive to scale of application

Page 18: Horseshoes & hand grenades (& coordination)

In conclusion

!  We have studied approximate coordination in uncoordinated checkpointing !  Simulation technique with validated simulation infrastructure !  Demonstrated with well-studied and important workloads

!  This type of simulation approach will be valuable for exascale application co-design

!  Approximate synchronization appears viable !  Even modest coordination levels give acceptable performance !  Dedicated hardware is not necessary !  And you may get it for free depending on your collective

operations (EuroMPI 2016) !  Acknowledgements

!  Support: DOE ASC