Horseshoes & hand grenades (& coordination)

Photos placed in horizontal position with even amount

of white space between photos

and header

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2016-5027C

Horseshoes & hand grenades (& coordination)

The case for approximate coordination in local checkpointing

protocols

Patrick M. Widener, Kurt Ferreira, Scott Levy

Center for Computing Research Sandia National Laboratories

Albuquerque, NM, USA patrick.widener@sandia.gov

Overview

!  Today’s predominant HPC resilience mechanism, coordinated checkpoint/restart, is unlikely to scale well !  Resource contention introduced by coordination

!  Uncoordinated solutions have been proposed, but also are unlikely to scale well !  Communication-induced delays propagate between processes

(Ferreira et al. SC’14)

!  Our simulation-based exploration suggests that there is a “sweet spot” between the two coordination extremes !  We call this approximate coordination !  Characterizes the tradeoff between relieving resource contention

and avoiding delay propagation !  Achievable at reasonable cost

Coordinated C/R unlikely to scale Sy

Socket Count

25 year socket MTBF% Efficiency

1024 2048

4096 8192

16384 32768

65536 131072

262144

Core Count

Coordinated checkpoint/restart (CCR)

PROCESS 3

PROCESS 2

PROCESS 4

PROCESS 1 δ

t1 t2 t3

Uncoordinated Checkpointing (UCR)

PROCESS 3

PROCESS 2

PROCESS 4

PROCESS 1

t1 t3 t2

Advantages of uncoordinated checkpointing

•  Each process checkpoints independently, avoids coordination overheads

•  Logging of messages ensures all checkpoints are consistent

•  Upon failure, only failed node(s) restarts rather than all nodes with coordinated checkpoint/restart

Uncoordinated overheads greater than coordinated due to analogy with “jitter”

Coordinated C/R Uncoordinated C/R

Processes

LAMMPSHPCCG

128256

5121Ki

2Ki4Ki

8Ki16Ki

32Ki64Ki

Processes

LAMMPSHPCCG

128256

5121Ki

2Ki4Ki

8Ki16Ki

32Ki64Ki

We want to explore the space in between these extremes

Exploration via simulation

!  We have a validated, well-exercised simulation framework, LogGOPSim (Levy et al. PMBS 2014)

!  Checkpointing activity is modeled as CPU detours !  Periods of time during which the CPU is taken from the application

!  Simulate application behavior by replaying collected communication traces !  All communication dependencies are reproduced

!  Traces can be extrapolated from small application runs !  Collect trace of app with p processes, extrapolate to k * p !  Communication traces for MPI collectives are reproduced exactly

!  Validated for extrapolation and prediction of local checkpointing overheads !  Hoefler et al. SC’10, Ferreira et al. SC’14, Hoefler et al. HPDC’10

Effect of different levels of approximate coordination

!  We simulated executions of a set of workloads with varying degrees of checkpointing coordination !  LAMMPS – molecular dynamics simulation !  CTH – large material deformation / strong shock modeling !  HPCCG – conjugate gradient solver mini-app from Mantevo !  LULESH – shock hydrodynamics mini-app

!  Vary the standard deviation of the normal distribution used to produce starting offsets

!  Measure application slowdown in each case, for different process counts

Varying the degree of coordination

Relative Offset from Mean (sec.)

abilit

10−3

10−2

10−1

−50 −25 0 25 50

stddev=100sstddev=20sstddev=1sstddev=0.1s

Drawing offsets from a normal distribution and varying the standard deviation simulates different degrees of coordination

Simulating the role of coordination !  LogGOPSim uses a detour trace to represent checkpoint

events

(time, duration) (time, duration)

…. (time, duration)

(time, duration) (time, duration)

…. (time, duration)

Coordinated all processes begin trace from offset 0

Completely uncoordinated processes begin trace from uniformly distributed random

offsets

LAMMPS/LJ

LULESH

All applications at 32Ki processes

Simulation indicates relatively loose coordination is sufficient

!  For normally distributed offsets with stddev=100ms !  On average 95% of checkpoints happen within a 200ms window !  Application runtime increases less than 5% !  This is well within the capabilities of systems with hardware support

!  Dedicated global interconnects or GPS cards have achieved skews ~ 1 us

!  Even extremely loose coordination is viable !  Approximately 10% slowdown for our studied workloads !  Easily realizable in modern HPC systems !  Also possible with acceptable reliabliity in wide-area/cloud contexts

!  Diminishing returns – tighter synchronization does not result in improved performance

!  Insensitive to scale of application

In conclusion

!  We have studied approximate coordination in uncoordinated checkpointing !  Simulation technique with validated simulation infrastructure !  Demonstrated with well-studied and important workloads

!  This type of simulation approach will be valuable for exascale application co-design

!  Approximate synchronization appears viable !  Even modest coordination levels give acceptable performance !  Dedicated hardware is not necessary !  And you may get it for free depending on your collective

operations (EuroMPI 2016) !  Acknowledgements

!  Support: DOE ASC

Horseshoes & hand grenades (& coordination)

Documents

TM 43-0001-29 Grenades

Almost horseshoes presentation

Grenades, Mines and Boobytraps

Handbook 1741A Grenades 1918

Shamrocks and horseshoes

Employ Hand Grenades

Grenades - Militaria WWIImilitaria.wwii.free.fr/doc/-grenades/german hand and... · 2007-12-18 · Title: Grenades Author: Unknown Created Date: Thursday, June 28, 2001 8:38:18 PM

About grenades

Take Aim - Test Equipment Depot...Close Only Counts... Everyone knows that close only counts in horseshoes and hand grenades, and that is definitely true for HVACR professionals, because

GRENADES AND PYROTECHNICS - The Eye Intelligence... · A IDENTIFICATION - COMMON USER HAND-GRENADES B IDENTIFICATION - FRAGMENTATION-GRENADES C IDENTIFICATION - SMOKE- GRENADES CHAPTER

Royal kerckhaert horseshoes tool catalog

Horseshoes and Hand Grenades: Frank v.Gaos and the Problem

Royal Kerckhaert Horseshoes > Home

Grenades Web Article

Hand Grenades

40 MM GRENADES - gim.rs

HORSESHOES. Horseshoes is a popular backyard, recreational and tournament game. It involves throwing horseshoes toward a small post for points, where

Grenades - talpo.it

MEDINA LAKE Scuba Diving - Fishing Medina Lake Mico TXScuba Diving HT8 HT1 HT2 HT3 HT4 HT7 HT6 HT5 Horseshoes Horseshoes Horseshoes. Title: Chrisornelas_map Created Date: 4/2/2017

Horseshoes and Hand Grenades: THE Dodd-Frank Wall Street