34
mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign [email protected]

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Embed Size (px)

Citation preview

Page 1: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT: Low-Cost Hardware Fault Detection

and Diagnosis for Multicore Systems

Siva Kumar Sastry Hari, Man-Lap (Alex) Li,

Pradeep Ramachandran, Byn Choi, Sarita Adve

Department of Computer Science

University of Illinois at Urbana-Champaign

[email protected]

Page 2: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Motivation

• Hardware will fail in-the-field due to several reasons

Þ Need in-field detection, diagnosis, repair, and recovery

• Reliability problem pervasive across many markets

– Traditional redundancy solutions (e.g., nMR) too expensive

Need low-cost solutions for multiple failure sources* Must incur low area, performance, power overhead

Transient errors(High-energy particles )

Wear-out(Devices are weaker)

Design Bugs… and so on

Page 3: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

SWAT: Low-Cost Hardware Reliability

SWAT Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

SWAT Approach

Watch for software anomalies (symptoms)

Zero to low overhead “always-on” monitors

Diagnose cause after symptom detected

May incur high overhead, but rarely invoked

3

Page 4: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

SWAT Framework Components

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

Detectors with simple hardware [Li et al. ASPLOS’08]

µarch-level Fault Diagnosis (TBFD)

[Li et al. DSN’09] 4

Page 5: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Challenge

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

Detectors with simple hardware [Li et.al. ASPLOS’08]

µarch-level Fault Diagnosis (TBFD)

[Li et.al. DSN’09] 5

Shown to work well for single-threaded apps

Does SWAT approach work on multithreaded apps?

Page 6: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Challenge: Data sharing in multithreaded apps

• Multithreaded apps share data among threads

• Does symptom detection work?

• Symptom causing core may not be faulty

– How to diagnose faulty core?6

Memory

Store

Symptom Detectionon a fault-free core

Load

Core 1 Core 2

Error

Fault

Page 7: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Contributions

• Evaluate SWAT detectors on multithreaded apps

– Low Silent Data Corruption rate for multithreaded apps

– Observed symptom from fault-free cores

• Novel fault diagnosis for multithreaded apps

– Identifies the faulty core despite error propagation

– Provides high diagnosability

7

Page 8: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Outline

• Motivation

• mSWAT Detection

• mSWAT Diagnosis

• Results

• Summary and Future Work

8

Page 9: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Fault Detection

• SWAT Detectors:

– Low-cost monitors to detect anomalous sw behavior

– Incur near-zero perf overhead in fault-free operation

• Symptom detectors provide low Silent Data Corruption rate

SWAT firmware

Fatal Traps

Division by zero,RED state, etc.

Kernel Panic

OS enters panicState due to fault

High OS

High contiguousOS activity

Hangs

Simple HW hangdetector

App Abort

Application abort due to fault

Page 10: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

SWAT Fault Diagnosis

• Rollback/replay on same/different core

– Single-threaded application on multicore

No symptom Symptom

Deterministic s/w orPermanent h/w bug

Symptom detected

Faulty

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or Non-deterministic s/w bug

Symptom

Permanenth/w fault,

needs repair!

No symptom

Deterministic s/w bug, send to s/w layer

10

Good

Page 11: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Challenges

• Rollback/replay on same/different core

– Single-threaded application on multicore

No symptom Symptom

Deterministic s/w orPermanent h/w bug

Symptom detected

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or Non-deterministic s/w bug

Symptom

Permanenth/w fault,

needs repair!

No symptom

Deterministic s/w bug, send to s/w layer

11

Faulty Good

No known good cores available

Faulty core is unknown

How to replay multithreaded apps?

Page 12: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Extending SWAT Diagnosis to Multithreaded Apps

• Assumptions: In-core faults, single core fault model

• Naïve extension – N known good cores to replay the trace

Too expensive – area

Requires full-system deterministic replay

• Simple optimization – One spare core

Not scalable, requires N full-system deterministic replays

High hardware overhead – requires a spare core

Single point of failure – spare core12

Faulty core is C2C1 C2 C3 No Symptom DetectedSpare

C1 C2 C3 Symptom DetectedSpare

C1 C2 C3 Symptom DetectedSpare

Page 13: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Diagnosis - Key Ideas

13

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Emulated TMR

TA TB TC TD

TA

TA TB TC TD

TA

A B C D

TA

A B C D

Key IdeasIsolated

deterministic replay

Page 14: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Diagnosis - Key Ideas

14

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Emulated TMR

TA TB TC TD

TA

TA TB TC TD

A B C DA B C D

Key IdeasIsolated

deterministic replay

TA TB

TA TB

Page 15: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Diagnosis - Key Ideas

15

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Emulated TMR

TA TB TC TD

TA

TA TB TC TD

A B C DA B C D

Key IdeasIsolated

deterministic replay

TA TB TC

TC TA TB

Page 16: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Diagnosis - Key Ideas

16

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Emulated TMR

TA TB TC TD

TA

TA TB TC TD

A B C DA B C D

Key IdeasIsolated

deterministic replay

TD TA TB TC

TC TD TA TBMaximum 3 replays

Page 17: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Multicore Fault Diagnosis Algorithm Overview

17

Replay & capture fault

activating trace

TA TB TC TD

A B C D

Example

Symptom

detected

Diagnosis

Deterministically replay

captured trace

Page 18: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Multicore Fault Diagnosis Algorithm Overview

18

Replay & capture fault

activating trace

Symptom

detected

Diagnosis

Deterministically replay

captured trace

TA TB TC TD

A B C D A B C D

Example

Page 19: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Multicore Fault Diagnosis Algorithm Overview

19

Replay & capture fault

activating trace

Symptom

detected

Diagnosis

Deterministically replay

captured trace

TA TB TC TD

A B C DTD TA TB TC

A B C D

Example

Look for

divergence

Page 20: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Multicore Fault Diagnosis Algorithm Overview

20

Replay & capture fault

activating trace

Symptom

detected

Diagnosis

Deterministically replay

captured trace

Look for

divergence

TA TB TC TD

A B C DTD TA TB TC

A B C D

Divergence

Example

TA

A B C DFaulty core is A

Fault-free Cores

Divergence

Faulty core

Page 21: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Digging Deeper

21

Symptom

detected

Replay &capture fault

activating trace

Isolateddeterministic

replay

Faulty core

Look for divergence

What info to capture to enableisolated deterministic replay?

How to identify divergence?

Hardware costs?

Page 22: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Enabling Isolated Deterministic Replay

22

Thread

Input to thread

Ld

Ld

Ld

Ld

• Recording thread inputs sufficient – similar to BugNet

• Record all retiring loads values

Page 23: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Digging Deeper (Contd.)

23

Symptom

detected

Replay &capture fault

activating trace

Isolateddeterministic

replay

Faulty core

Look for divergence

How to identify divergence?

Trace Buffer

What info to capture to enable isolated deterministic replay?

Hardware costs?

Page 24: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Identifying Divergence

24

Thread

• Comparing all instructions Large buffer requirement

• Faults corrupt software through memory and control instrns

• Capture memory and control instructions

Store

Load

Branch

Store

Page 25: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Digging Deeper (Contd.)

25

Symptom

detected

Replay &capture fault

activating trace

Isolateddeterministic

replay

Faulty core

Look for divergence

How to identify divergence?

Trace Buffer

What info to capture to enable isolated deterministic replay?

Hardware costs?

Page 26: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

How Big?

Hardware Costs

26

Native Execution

Memory Backed LogSmall hardware support

Firmware Emulation

Minor support for firmware reliability

Symptom

detected

Replay &capture fault

activating trace

Isolateddeterministic

replay

Faulty core

Look for divergence

Trace Buffer

What if the faulty core subverts the process?

Key Idea: On a divergence two cores take over

Page 27: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Trace Buffer Size

• Long detection latency large trace buffers (8MB/core)

– Need to reduce the size requirement

Iterative Diagnosis Algorithm

27

Repeatedly execute on short tracese.g. 100,000 instrns

Symptom

detected

Replay &capture fault

activating trace

Isolateddeterministic

replay

Faulty core

Look for divergence

Trace Buffer

Page 28: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Experimental Methodology

• Microarchitecture-level fault injection

– GEMS timing models + Simics full-system simulation

– Six multithreaded applications on OpenSolaris* 4 Multimedia apps and 1 each from SPLASH and PARSEC

– 4 core system running 4-threades apps

• Faults in latches of 7 arch units

– Permanent (stuck-at) and transients faults

28

Page 29: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Experimental Methodology

Detection:

• Metrics: SDC Rate, detection latency

Diagnosis:

• Iterative algorithm with 100,000 instrns in each iteration

• Until divergence or 20M instrns

• Deterministic replay is native execution

• Not firmware emulated

• Metrics: Diagnosability, overheads29

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

Masked or Silent Data Corruption (SDC)

Page 30: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Results: mSWAT Detection Summary

• SDC Rate: Only 0.2% for permanents & 0.55% for transients

• Detection Latency: Over 99% detected within 10M instrns 30

Permanents Transients0%

20%

40%

60%

80%

100%

SDC

DUE

Detect-Faulty

Masked

Per

cen

tag

e o

f in

ject

ion

s

0.2% 0.55%

Detected

Page 31: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Results: mSWAT Detection Summary

• SDC Rate: Only 0.2% for permanents & 0.55% for transients

• Detection Latency: Over 99% detected within 10M instrns 31

Permanents Transients0%

20%

40%

60%

80%

100%

SDC

DUE

Detect-Fault-Free

Detect-Faulty

Masked

Per

cen

tag

e o

f in

ject

ion

s

0.2% 0.55% 4.5% detected in a good core

Page 32: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Results: mSWAT Diagnosability

• Over 95% of detected faults are successfully diagnosed

• All faults detected in fault-free core are diagnosed

• Undiagnosed faults: 88% did not activate faults32

0%

20%

40%

60%

80%

100%

CorrectlyDiagnosed Undiagnosed

Pe

rce

nta

ge

of

De

tec

ted

Fa

ult

s 99 99 99 86 100 80 99 95.9

Page 33: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Results: mSWAT Diagnosis Overheads

• Diagnosis Latency

– 98% diagnosed <10 million cycles (10ms in 1GHz system)

– 93% were diagnosed in 1 iteration

* Iterative approach is effective

• Trace Buffer size

– 96% require <400KB/core

* Trace buffer can easily fit in L2 or L3 cache

33

Page 34: MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

mSWAT Summary

• Detection: Low SDC rate, detection latency

• Diagnosis – identifying the faulty core

– Challenges: no known good core, deterministic replay

– High diagnosability with low diagnosis latency

– Low Hardware overhead - Firmware based implementation

– Scalable – maximum 3 replays for any system

• Future Work:

– Reducing SDCs, detection latency, recovery overheads

– Extending to server apps; off-core faults

– Validation on FPGAs (w/ Michigan)

34