60
ECE 753: FAULT- TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Embed Size (px)

Citation preview

Page 1: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.SalujaDepartment of Electrical and Computer Engineering

Motivation and IntroductionLecture Set 1

Page 2: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 2

Overview• Motivation

• About the Course and the Instructor

– Conduct, Outline, Coursepack

• Introduction

• Terminology and definitions– Sources, Overview and Comments

– System defined

• Dependability/Security and their attributes

• Threat to dependability and modeling FEF chain

• Means to attain dependability

• Fundamental Principles

Page 3: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 3

Motivation

• Informal Definition

• Key Attributes

• Who, What and Why Study

• Examples

Page 4: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 4

Motivation

• What is Fault-Tolerance?

A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system.

Page 5: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 5

Motivation (contd.)

• Key attributes

Fault - Error - Failure

Performance - Availability - Reliability More recently concept of “survivability”

Inclusions of these constraints at design stage is likely to be more cost effective.

Page 6: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 6

Motivation (contd.)• Who is concerned about fault-tolerance?

– System Users – irrespective of the application but some are a lot more concerned than others

• Who is concerned at design stages?– Universities

• R, d, and a (Research, development, applications)– Industry

• r, D, and A (research, Development, Applications)• Issues

– Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation

Page 7: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 7

Motivation (contd.)

Examples

• General Purpose Systems– PCs: RAMs with parity checks and possibly ECC

(consideration of re-execution on failure detection is being investigated)

– Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW)

Page 8: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 8

Motivation (contd.)

Examples

• Reliable Systems– Telephone systems– Banking systems e.g. ATM– Stock market– CAE - exams/projects– Football games - display/ticketing

Page 9: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 9

Motivation (contd.)

Examples

• Critical and Life Critical Systems– Manned and unmanned space borne systems– Aircraft control systems– Nuclear reactor control systems– Life support systems

Page 10: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 10

Motivation (contd.)

Examples

• Reliable -> Critical Systems– 911 telephone switching system– Traffic light control system– Automotive control systems (ABS, Fuel

injection system)

Page 11: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 11

About the Course and the Instructor

• Conduct– homeworks, exam, project, grading

• Outline

• Coursepack– references and reading list

Page 12: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 12

Introduction

– Historical perspective and major push

– New initiatives

– Goals of fault-tolerance

– Applications of fault-tolerance

Page 13: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 13

Introduction (contd.)• Historical Perspective

– not a new concept

– first use by J. van Neumann 1956• probabilistic logic and synthesis of reliable organism from

unreliable components, Annals of mathematical studies, Princeton University Press

• Major push– Space program

– HW Fault tolerance - then

– SW Fault tolerance later

– Merge the two

Page 14: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 14

Introduction (contd.)

• New initiativesDensity of devices more failures likely

Power issue – schedular, on-chip sensorsFailures due to soft-errors, life time degradations

- hardening, re-exection, - on-chip ECC- erconfiguration- microarchitectural solutions- architectural solutions

Page 15: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 15

Introduction (contd.)

• New initiatives (contd.)Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on

chip/board/system possibility of system hang-up

Speculative execution results may need to be re-checked

Low cost of HW and SW affordable/ecnomical

• Hot issues: Soft errors, Life-time failures, Power and Thermal Management

Page 16: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 16

Introduction (contd.)

• Goals - different goals for different applications

The key word is “reliability” – has different meaning for different users and applications

• Intuitive explanations– Dependability– Service– Specification

Page 17: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 17

Introduction (contd.)• Intuitive concepts

– Reliability – continues to work– Availability – works when I need it– Safety – does not put me in jeopardy – Performability – Maintainability– Testability– Survivability – will the system survive

catastrophic events?– Security

Page 18: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 18

Introduction (contd.)

• Applications– Space borne system

• long life system

– Airplane control system• critical system

– Transaction processing system• high availability system

– Switching system• high availability over certain level of performance

Page 19: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 19

Terminology and definitions

• Reliability and concept of probability– R(t): conditional probability that a system provides

continuous proper service in the interval [0,t] given that it provided desired service at time 0.

• Availability

• Performabiltiy – An Example

• Dependability

• Security

Page 20: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Sources, Overview and Comments (1/4)Key reference:

• Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004.

Other references:• Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007.

• D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

• B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989.

• My course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.html

ECE 753 Fault Tolerant Computing

Page 21: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Sources, Overview and Comments (2/4)

• What does the paper cover?– Very basic definitions of the terminologies used in

dependable computing– It categorizes definitions in three groups

• System, attributes of dependability, threats to dependability

– Covers very briefly methods to attain dependability

ECE 753 Fault Tolerant Computing

Page 22: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Sources, Overview and Comments (3/4)

• How to read the paper?– It is easy to read – scan it first and then read it– I have organized the material differently – you may

find it helpful

• What is not covered?– One attribute almost missing - survivability– Basic methods of Fault Tolerance and their

characterization

ECE 753 Fault Tolerant Computing

Page 23: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Sources, Overview and Comments (4/4)

• Chronology of Developments– Need for fault-tolerance - inception of the space program

(recall “Voyager” launched in 1977 is still sending signals)

– First standard glossary in 1985

– Integration of performance etc into fault tolerance – and hence the term “Dependability” – book published in 1992

– Recognition of “Security” as a basic attribute of dependability – this paper in 2004

ECE 753 Fault Tolerant Computing

Page 24: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

System Defined (1/4)• “. . . an entity that interacts with other entities”

– First entity (system) – limited to be “electronic (mostly digital)” or “computer based”

– Second entity• Hardware, software, human, other systems, .. (can also be called

“environment”)

• Characterization and fundamental properties– Functionality

– Performance

– Dependability and security

– Cost

(usuability, managability, adaptabilty : not directly included in the paper)

ECE 753 Fault Tolerant Computing

Page 25: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

System Defined (2/4)

• Function – “ what the system is intended to do” – functional specifications: describe it in terms of functionality

and performance

– behavior – described as a sequence of states to implement the functionality

– Total states – set of states as system evolves • Internal states

• External states – as viewed by the environment and users

• Structure – “What enables system behavior (function)” – Interconnected components – recursively defined to

“atomic” level

ECE 753 Fault Tolerant Computing

Page 26: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

System Defined (3/4)

• System Life Cycle– Development phase

– Use phase

• Service – what is delivered by the system to its “environment” (user)– Environment sees only the “external states”

– Development Phase – activities from concept to decision that system is ready for “use phase”

– Use Phase - More meaningful and includes service delivery, service outage, service shutdown, maintenance

ECE 753 Fault Tolerant Computing

Page 27: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

System Defined (4/4)• Development phase environment

– Physical world

– Human developers

– Development tools

– Production and test facilities

• User phase environment– Physical world

– Administrators – maintainers

– Users and intruders

– Providers and infrastructure

ECE 753 Fault Tolerant Computing

Page 28: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Dependability/Security Attributes (1/6)

• Original definition: “ability to deliver service that can justifiably be trusted”

• Encompassing the following attributes– Availability

– Reliability

– Safety

– Integrity

– maintainability

ECE 753 Fault Tolerant Computing

Page 29: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Dependability/Security Attributes (2/6)

• New definition: “ability to avoid service failures that are more frequent or more severe than is acceptable” - deliver service that can justifiably be trusted

• Reason for modification– Security related issues

– This recognizes that a system can fail and it usually does fail and it still can be called dependable

– This definition also enables a connection with “development failures”

ECE 753 Fault Tolerant Computing

Page 30: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Dependability/Security Attributes (3/6) Dependability

• availability: readiness for correct service.

• reliability: continuity of correct service.

• safety: absence of catastrophic consequences on the user(s) and the environment.

• integrity: absence of improper system alterations.

• maintainability: ability to undergo modification and repairs

When addressing security, an additional attribute confidentiality: the absence of unauthorized disclosure

ECE 753 Fault Tolerant Computing

Page 31: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Security is concurrent existence of composite of the attributes

1) availability (for authorized actions only),

2) confidentiality, and

3) integrity (with “improper” meaning “unauthorized”)

Dependability/Security Attributes (4/6)

ECE 753 Fault Tolerant Computing

Page 32: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

F

Dependability/Security Attributes (5/6)

ECE 753 Fault Tolerant Computing

Page 33: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

• Other related concepts – summarized in table (Fig 15) - these are– Dependability– High confidence– Survivability– Trustworthiness

• Example: all these have similar goals such as 1): ability to deliver service, 2): predictable service, 3): fulfill mission, 4): assurance of expected service delivery

Dependability/Security Attributes (6/6)

ECE 753 Fault Tolerant Computing

Page 34: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (1/12)

• Different phases are open to different types of threats – generally termed as “faults”

• Faults lead to “errors” – a total state of the system different from the “true total state”

• Errors can lead to “failure” – the service deviates from the desired service

• This creates a FEF chain – a hierarchical phenomenon (see next and more later)

ECE 753 Fault Tolerant Computing

Page 35: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

fault error

failure

Fault activation – Error manifestation – Failure

Threats and modeling threats (2/12)

Fault – active or dormant

Error – masked or latent

Failure – incorrect response

Page 36: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (3/12)

FEF Chain in an hierarchy

Page 37: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (4/12)

Fault classes

• Groups (not exclusive)– Development, Physical – (that affect hardware - I

disagree with this definition), Interaction

• Viewpoints: – phase, system boundary, cause, dimension,

objective, intent, capability, persistence

Page 38: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (5/12)

Fault Taxonomy and Examples

Production defect: physical, hardware, natural

Bug: physical, software, natural

Omission (absence of an action): Humam made, system generated

Melicious (meant to cause harm): Human made, Hardware or software

Notes:

1. Paper has a classification – Fig 4 and 5

2. Examples and definition of many other faults given. Some listed on next slide

Page 39: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (6/12)

Fault Taxonomy (contd.)

Permanent faults

Intermittent faults – repeat at some interval

Transient faults – no specific interval

Malicious logic faults – caused be natural faults

Intrusion attempts – caused by humans

Interaction faults – may be development phase or use phase

Configuration faults – incorrect setting of parameters

Page 40: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (7/12)

Errors classes

• Detected

• Latent

An example– An adder gives incorrect sum for certain operands

– Fault is active when those operands appear, otherwise it is dormant

– Incorrect sum is latent unless used or checked for correctness

Page 41: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (8/12)

Failure classes

• Development failures

• Service failures

• Security failures

Page 42: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (9/12)

• Development failures – introduced during the development phase– Human developers– Tools – Production facility– Budgetary reasons– Scheduling issue (time to market)

(basically the system delivered is a downgraded system)

Page 43: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (10/12)

• Service failures - delivery of incorrect service – Four viewpoints

1. Failure domain– Content failure– Timing failure – early or late delivery of

the service(s)• Special case: silent failure, halt failure, crash

failure

• Erratic failure (like Byzantine failure)

Page 44: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (11/12)

2. Failure detectability– Signal provided by some checking mechanism

• Signaled failure

• Unsignaled failure

• False alarm

3. Consistency – Consistent failure – all services see the same

data– Inconsistent – different services see different

data (like Byzantine failure)

Page 45: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Threats and modeling threats (12/12)

4. Consequence of failure– Need to rate the failure and hence develop

criteria – examples:• Outage of duration (availability related)

• Lives being endangered (safely related)

• Extent of corrupted service (integrity related)

• Amount of information disclosed (confidentiality related)

Page 46: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (1/6)

• Fault Prevention or Fault Avoidance• Improvement of development process

• Elimination of causes that can induce faults

• Fault Tolerance• Techniques and implementations

(more later)

Page 47: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (2/6)

• Fault Removal • Remove faults during development phase

– extensive simulation and validation

• Testing• Deterministic testing

• Random and statistical testing

• Back to back testing

Test/validation quality: fault injection, design for test/verification

Page 48: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (3/6)

• Fault Forecasting – evaluate the system behavior and then use one or more methods previously discussed to improve dependability• Qualitative evaluation• Quantitative evaluation• Use benchmarks• Use of simulators

Examples: 1) Error and failure logs

2) when and where commissioned

Page 49: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (4/6)

• Fault Tolerance Techniques• Error detection - need redundancy

• Duplicate execution

• Use of parity

• Checker programs and/or hardware

• More later

Page 50: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (5/6)

• Recovery - Key is redundancy

• Error handling• Masking and compensation

• Rollback

• Rollforward

• Fault handling• Diagnosis

• Isolation

• Reconfiguration

• Initialization

Page 51: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Means to attain dependability (6/6)

• Key to fault tolerance• Break FEF chain

• Use “redundancy” to improve “use phase” dependability and security

• See next “fundamental principles”

Page 52: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 52

Fundamental Principles

• Hardware redundancy• Low level

• High level

• Software Redundancy• Time Redundancy• Information Redundancy

Page 53: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 53

Fundamental Principles (contd.)

• Hardware Redundancy - Low level– logic level

• Example 1 - Self checking circuits

• Example 2 - Arithmetic code A modular adder using the mathematical principle

(A+B) mod k = ((A mod k) + (B mod k)) mod k

• Hardware Redundancy - High level– Triplicate or 5-copies as in space shuttle

Page 54: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 54

Fundamental Principles (contd.)

• Software Redundancy – Use two different programs/algorithms

• Time Redundancy– Re-compute or redo the task and compare the results

– May or may not use the same hardware/software

• Information Redundancy– backup information

– Use of ECC

• Question - What kind of FT is achieved?

Page 55: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 55

Fault-Error-Failure

• Intuitive definitions• Origins of faults• Methods to break FEF chain• Attribute of faults

Page 56: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 56

Fault-Error-Failure concept (contd.)

Intuitive definitions

• Fault -– An anomalous physical condition caused by a

manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …

– Causes

• Error - Effect of activation of a fault

• Failure - over-all system effect of an error

Fault -> Error -> Failure

Page 57: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 57

Fault-Error-Failure concept (contd.)

Origins of faults

• Physical device level (HW)

• Logic level (HW)

• Chip level (HW)

• System level (HW/SW)– interfacing, specifications, …

• Why systems fail

Page 58: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 58

Fault-Error-Failure concept (contd.)

Methods to break FEF chain

• Flow FEF

• Barriers– Fault avoidance– Fault masking– Fault removal– Fault forecasting

Page 59: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1
Page 60: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

ECE 753 Fault Tolerant Computing 60

Fault-Error-Failure concept (contd.)

Attribute of faults

• Cause

• Nature

• Duration

• Extent

• Value