Oct. 2006 Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools

Oct. 2006 Terminology, Models, and Measures Slide 1

Fault-Tolerant Computing

Motivation, Background, and Tools


About This Presentation

Edition Released Revised Revised

First Oct. 2006

This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami


Terminology, Models, and Measures for Dependability



Impairments to Dependability

ErrorMalfunction

Degradation

Failure

Fault

Intrusion

Hazard

Defect

Flaw

Bug

Crash


The Fault-Error-Failure Cycle

Schematic diagram of the Newcastle hierarchical model and the impairments within one level.

Failure

Aspect Impairment

Structure Fault

State Error

Behavior

Includes both components and design

0 0

0Fault Correct

signal

Replaced with NAND?


The Four-Universe Model

Cause-effect diagram for Avižienis’ four-universe model of impairments to dependability.

Universe Impairment

Physical Failure

Logical Fault

Informational Error

External Crash


Unrolling the Fault-Error-Failure Cycle

Cause-effect diagram for an extended six-level view of impairments to dependability.

Abstraction Impairment

Component Defect

Logic Fault

Information Error

System Malfunction

Service Degradation

Result Failure

Low- Level

Mid- Level

High- Level

First Cycle

Second Cycle

Failure

Aspect Impairment

Structure Fault

State Error

Behavior


Multilevel Model

Component

Logic

Service

Result

Information

System

Low-Level Impaired

Mid-Level Impaired

High-Level Impaired

Initial Entry

Deviation

Remedy

Legned:

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Legend:

Tolerance

Entry


Analogy for the Multilevel Model

An analogy for our multi-level model of dependable computing. Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow.

Wall heights represent inter-level latencies

Drain valves represent tolerance techniques

Concentric reservoirs are analogs of the six model levels, with defect being innermost

IIIIII

I I I I I I

Inlet valves represent avoidance techniques


Why Our Concern with Dependability?

Reliability of n-transistor system, each having failure rate

R(t) = e–nt

There are only 3 ways of making systems more reliable

Reduce

Reduce n

1.0

0.8

0.6

0.4

0.2

0.0

e–n t

.9999 .9990 .9900

.9048

.3679

1010 810 610 410 nt

Reduce t

Alternative:Change the reliability formula by introducing redundancy in system


Highly Dependable Computer Systems

Long-life systems: Fail-slow, Rugged, High-reliabilitySpacecraft with multiyear missions, systems in inaccessible locationsMethods: Replication (spares), error coding, monitoring, shielding

Safety-critical systems: Fail-safe, Sound, High-integrityFlight control computers, nuclear-plant shutdown, medical monitoringMethods: Replication with voting, time redundancy, design diversity

High-availability: Fail-soft, Robust, High-availabilityTelephone switching centers, transaction processing, e-commerceMethods: HW/info redundancy, backup schemes, hot-swap, recovery

Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers


Aspects of Dependability

ReliabilityMaintainability

Availability

Perform

ability

Security

Integrity

Serviceabilit

y

Testability

Safety

Robustness

Resilience

Reliability, MTTF = MTFF

Risk, consequence

Controllabilit

y,

observability

Perform

ability, M

CBF

Pointwise av., Interval av.,

MTBF, MTTR


Concepts from Probability Theory

Cumulative distribution function: CDFF(t) = prob[x t] = 0 f(x) dx t

Probability density function: pdff(t) = prob[t x t + dt] / dt = dF(t) / dt

Time0 10 20 30 40 50

Time0 10 20 30 40 50

Time0 10 20 30 40 50

1.0

0.8

0.6

0.4

0.2

0.0

CDF

pdf0.05

0.04

0.03

0.02

0.01

0.00

F(t)

f(t)

Expected value of xEx = x f(x) dx = k xk f(xk)

Liftimes of 20 identical systems

Covariance of x and yx,y = E [(x – Ex)(y – Ey)]

= E [x y] – Ex Ey

Variance of xx = (x – Ex)2

f(x) dx = k (xk – Ex)2

f(xk)

2


Some Simple Probability Distributions

CDF

pdf

F(x)

f(x)

1

Uniform Exponential Normal Binomial

CDF

pdf

CDF

pdf

CDF


Reliability and MTTFReliability: R(t)Probability that system remains in the “Good” state through the interval [0, t]

Two-state nonrepairable system

FailureGood FailedStart

State

R(t + dt) = R(t) [1 – z(t) dt]

Hazard function

Constant hazard function z(t) = R(t) = e–t (system failure rate is independent of its age)

R(t) = 1 – F(t) CDF of the system lifetime, or its unreliability

Exponential reliability law

Mean time to failure: MTTFMTTF = t f(t) dt = R(t) dt

Expected value of lifetime

Area under the reliability curve(easily provable)


Failure Distributions of Interest

Exponential: z(t) = R(t) = e–t MTTF = 1/

Weibull: z(t) = (t) –1

R(t) = e(t)MTTF = (1/) (1 + 1/)

Erlang: MTTF = k/

Gamma:Erlang and exponential are special cases

Normal:Reliability and MTTF formulas are complicated

Rayleigh: z(t) = 2(t)R(t) = e(t)2 MTTF = (1/)

Discrete versions

Geometric

Binomial

Discrete Weibull

R(k) = q k


Comparing Reliabilities

Reliability gain: R2 / R1

Reliability difference: R2 – R1

Reliability functionsfor Systems 1/2

Reliability improv. indexRII = log R1(tM) / log R2(tM)

1.0

0.0

Time (t)

R (t)

R (t)

1

2

MTTF1MTTF2t

R (t )1

R (t )2r

T (r )1 T (r )2M

G

G G

M

M

System Reliability (R)

Mission time extensionMTE2/1(rG) = T2(rG) – T1(rG)

Mission time improv. factor:MTIF2/1(rG) = T2(rG) / T1(rG)

Reliability improvement factorRIF2/1 = [1–R1(tM)] / [1–R2(tM)]

Example:[1 – 0.9] / [1 – 0.99] = 10


Availability, MTTR, and MTBF(Interval) Availability: A(t)Fraction of time that system is in the “Up” state during the interval [0, t]

Two-state repairable system

Availability = Reliability, when there is no repair

Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair)

Pointwise availability: a(t)Probability that system available at time tA(t) = (1/t) a(x) dx

t

Failure

Up DownStart State

RepairSteady-state availability: A = limt A(t)

MTTF MTTF MTTF + MTTR MTBF + A = = =

Repair rate1/ = MTTR(Will justify thisequation later)In general, >> , leading to A 1


System Up and Down Times

Time

Up

Down0 t

Time to first failure Time between failures

Repair time

t1 t2t'1 t'2

Failure

Up DownStart State

RepairShort repair time implies good Maintainability (serviceability)


Performability and MCBFPerformability: PComposite measure, incorporating both performance and reliability

Three-state degradable system

P = 2pUp2 + pUp1

Simple exampleWorth of “Up2” twice that of “Up1”pUpi = probability system is in state Upi

t

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average)

Up 2 Up 1Start State

Repair

Failure

Down

Partial repair

Partial failure

Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails:PIF = (2 – 2 0.92) / (2 – 1.90) = 1.6

Question: What is system availability here?


Time

Up

Down0 tt1 t2 t'2 t'1 t3 t'3

Partial Failure

Total Failure

Partial Repair

Partially Up

System Up, Partially Up, and Down Times

Important to prevent direct transitions to the “Down” state (coverage)

Up 2 Up 1Start State

Repair

Failure

Down

Partial repair

Partial failure

MCBF


Integrity and SafetyRisk: Prob. of being in “Unsafe Failed” stateThere may be multiple unsafe states, each with a different consequence (cost)

Three-state fail-safe system

Simple analysisLump “Safe Failed” state with “Good” state; proceed as in reliability analysis

GoodStart State

Safe Failed

Unsafe Failed

Failure

Failure

More detailed analysisEven though “Safe Failed” state is more desirable than “Unsafe Failed”, it is still not as desirable as the “Good” state; so keeping it separate makes sense

For example, if a repair transition is introduced between “Safe Failed” and “Good” states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability

Documents

Oct. 2006 Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools