21
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, et al. University of Wisconsin- Madison Presented by: Nick Kirchem March 26, 2004

Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

  • Upload
    phila

  • View
    24

  • Download
    1

Embed Size (px)

DESCRIPTION

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004. Motivation and Goals. Availability is crucial - PowerPoint PPT Presentation

Citation preview

Page 1: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

Daniel J. Sorin, et al.University of Wisconsin-

Madison

Presented by: Nick KirchemMarch 26, 2004

Page 2: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Motivation and Goals Availability is crucial

Internet services and database management systems highly relied upon

Unless architecture is changed, availability will decrease as # of components increases

Goals for paper Lightweight mechanism providing E2E

recovery for transient and permanent faults Decouple recovery from detection (use

traditional detection techniques: RAID, ECC, duplicate ALUs, etc)

Page 3: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Solution SafetyNet

A global checkpoint/recovery scheme Creates periodic system-wide logical

checkpoints Log all changes to the architected

state

Page 4: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Recovery Scheme Challenges (1) Saving previous values before every

register/ cache update or coherence message would require too much storage space

(2) All processors and components must recover to a consistent point

(3) SafetyNet must determine when it is safe to roll-back to a recovery checkpoint

Page 5: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

(1) Checkpointing Via Logging Checkpoints contain a complete

copy of system’s architectural state

Taken at coarse granularity (e.g. 100,000 cycles)

Log is only taken on first altering action per checkpoint interval.

Page 6: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

(2) Consistent Checkpoints All components coordinate local

checkpoints through logical time Coherence transaction appears logically

atomic once it completes Point of atomicity

When previous owner processes request Response includes CN of this PoA

Requestor does not advance recovery point until all outstanding transactions are complete

Page 7: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

(2) PoA Example

Page 8: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

(3) Validating Checkpoints States: current state, checkpoints

waiting to be validated, recovery point Validation: determining which

checkpoint is the recovery point All prior execution must be fault-free

Coordination is pipelined and performed in background (off critical path)

Page 9: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

(3) Validation continued Validation latency depends on fault

detection latency Output commit problem

Delay all output events until checkpoint is validated

Depends on validation latency Input Commit problem

Log incoming messages

Page 10: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Recovery Processors restore their register

checkpoints Caches and memories unroll local

logs State from coherence transactions

in progress is discarded Reconfiguration if necessary

Page 11: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Implementation

Page 12: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Implementation (Cont’d) Checkpoint Log Buffers (CLBs)

Associated with each cache and memory component

Store log state Shadow registers 2D torus MOSI Directory protocol

Page 13: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Logical Time Base Loosely synchronous checkpoint

clock distributed redundantly Ensures no single point of failure Edge of clock increments current

checkpoint number (CCN) Works as long as skew < minimum

communication time between nodes Assigning transaction checkpoint

interval is protocol-dependent

Page 14: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Logging Memory block written to CLB whenever

action might have to be undone CLBs are write-only (except during

recovery) and off critical path CN added to each block in cache Steps taken for update-action

CCN compared with block’s CN Block is logged if CCN >= CN Updates block’s CN to CCN+1 Performs the update action

Updated CN sent with coherence response

Page 15: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Checkpoint Creation/Validation

Choose suitable checkpoint clock freq Detection latency tolerance Total CLB storage

Lost messages (timeout) trigger recovery Recovery point checkpoint number (RPCN)

broadcasted when recovery point is advanced After fault, recovery msg sent (includes

RPCN) Interconnection network is drained Processors, cache, memories recover to RPCN

Page 16: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Implementation Summary Processor/Cache Changes Required

Processor must be able to checkpoint its register state

Must be able to copy old versions out of cache before transferring them

CNs added to L1 cache block Directory Protocol Changes

CNs added to data response messages Coherence requests can be nack’ed Final ack required from requestor to directory

Page 17: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Evaluation Parameters 16-processor target system Simics + memory hierarchy simulator 4 commercial workloads, 1 scientific In order processor, 4 billion instr/sec MOSI directory protocol with 2D torus Checkpoint interval = 100,000 cycles

Page 18: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Experiments (1) Fault-free performance

Overhead determined to be negligible (2) Dropped messages

Periodic transient faults injected (10/sec)

Recovery latency << crash + reboot (3) Lost Switch

Hard fault – kill half-switch Crash avoided – performance suffers

due to restricted bandwidth

Page 19: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Sensitivity Analysis Cache bandwidth

Depends on frequency of stores requiring logging (additional b/w consumed reading old copy of the block)

Cache ownership transfer: no additional b/w

Storage cost CLBs sized to avoid performance

degradation due to full buffers Entries per checkpoint corresponds to

logging frequency

Page 20: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Conclusion SafetyNet is efficient in common

case (error free execution) – little to no latency added

Latency is hidden by pipelining validation of checkpoints

Checkpoints coordinated in logical time (no synch exchanges necessary)

Page 21: Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Questions What about faults/errors with the

saved state itself? What if you there’s a permanent

fault for which you can’t reconfigure (endless loop of recovering to last checkpoint?)