29
1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

Embed Size (px)

Citation preview

Page 1: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

1. Introduction

Reliable System Design 2010by: Amir M. Rahmani

Page 2: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Computing Everywhere

Computing everywhere:• – Desktop, Laptop, Cars, Cell phones

Input devices everywhere:• – Sensors, cameras, microphones

Connectivity everywhere:• – Rapid growth of bandwidth in the interior of the net• – Internet at home and office

Increased reliance on computers is inevitable Computer systems will become invisible only

when they are reliable

Page 3: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Fault-Tolerance

Why?• Computers are increasingly being used in

critical applications where system failures may have severe consequences.

How?• By introducing redundancy (extra resources) in

the computer system, e.g., hardware redundancy and software redundancy.

Page 4: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Need for Fault Tolerance: Universal

Natural objects:• • Fat deposits in body: survival in starvation• • Duplication of eyes: graceful degradation

upon failure Man-made objects

• • Redundancy in ordinary text• • Asking for password twice during initial set-

up• • Duplicate tires in trucks

Page 5: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Mission Specific Approaches High availability systems:

• – Telephone• – Transaction processing: banks/airlines

Long life missions:• – Unscheduled maintenance too costly• – Manned and unmanned space borne systems

Critical applications:• – Real-time industrial control• – Aircraft control systems• – Life support systems

General Purpose Systems:• – CDs: encoding• – Internet: packet retransmission

Page 6: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example of Failures - eBay Crash

eBay: giant internet auction house• – A top 10 internet business• – Market value of $22 billion• – 3.8 million users as of March 1999

June 6, 1999• – eBay system is unavailable for 22 hours with

problems ongoing for several days• – Stock drops by 6.5%, $3-5 billion lost revenues• – Problems blamed on Sun server software

Page 7: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example - Ariane 5 Rocket Crash

Ariane 5 and its payload destroyed about 40 seconds after launch, June 1996

Error due to software bug:• – Conversion of floating point to 16-bit int• – Out of range error generated but not handled

Testing of full system under actual conditions not done due to budget limits

Estimated cost: 120 million $

Page 8: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example - The Therac-25 Failure

Therac-25 is a linear accelerator used for radiation therapy, June 1985 – January 1987.

More dependent on software for safety than predecessors (Therac-20, Therac-6)

Machine reliably treated thousands of patients, but occasionally there were serious accidents, involving major injuries and 1 death.

Software problems:• – No locks on shared variables (race conditions).• – Timing sensitivity in user interface.

Page 9: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example - Tele Denmark

Tele Denmark Internet, ISP August 31, 1999

• – Internet service down for 3 hours• – Truck drove into the power supply cabinet at

Tele Denmark• – Where were the UPSs?

• Old ones had been disconnected for upgrade• New ones were on the truck!

Page 10: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Dependable

Webster Dictionary Dependable: capable of being depended

on: RELIABLE Reliable: suitable or fit to be relied on:

DEPENDABLE Rely:

• 1) to be dependent <the system for which we depend on water>

• 2) to have confidence based on experience <someone you can rely on>

Page 11: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Dependability Dependability is that property of a computer

system such that reliance can justifiably be placed on the service it delivers.

Attributes Of Dependability• Reliability• Availability• Safety• Confidentiality• Integrity• Maintainability

Fault tolerance is not a system requirement. Fault tolerance is one of the mechanisms that can be used to provide dependability

Page 12: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Motivation Extreme fault tolerance has always been around

• – NASA’s deep space probes• – Medical computing devices (e.g., pacemakers)

But now fault tolerance is becoming more important• – More reliance on computers

Extreme fault tolerance• – Car controllers (e.g., anti-lock brakes), etc.

High fault tolerance• – Commercial servers (databases, web servers), file servers,

etc. Some fault tolerance

• – Desktops, laptops (really!), etc.

Page 13: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Reliability R(t) - Unreliability

R(t) is the probability that the system performs as specified without interruption over the entire interval [0,t]. R(t) is conditioned on the system being operational at time t=0.

time t can be very long, e.g. years in case of space applications

Unreliability F(t) is the probability that the system fails at any time in the interval [0,t].

F(t) = 1 - R(t)

Page 14: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Availability A(t)

A(t) is the probability that the system is up and running correctly at time t

This is different from reliability.• – Reliability considers the interval [0,t]• – Availability takes an instance of time

examples: transaction processing systems, e.g. reservation systems

Page 15: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Reliability vs. Availability

Example: A system that fails, on average, once per

hour but which restarts automatically in ten milliseconds is not very reliable but is highly available

Availability= Uptime/(Uptime+Downtime) = (60000-10)/(60000) = 0.9999972

Page 16: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Nines of Availability

Page 17: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Question ?

Which has higher availability?• (1) two 4 hour outage / year• (2) 1 minute outage / day• A(1) = (365*24-2*8)/(365*24) = 0.9990• A(2) = (24*60-1)/(24*60) = 0.9993

For an Internet-base company such as EBay or AOL, which would be more desirable? Why?

For a Hacker? Need to specify details of acceptable outages

Page 18: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Safety S(t) S(t) is the probability that the system does

not fail in the interval [0,t] in such a manner as to cause unacceptable damage or other terrible effects.

S(t) is attribute of a system which either operates correctly or fails in a safe manner.

Safety is a measure of the fail-safe capability of the system

• – system can be unreliable, yet safe• – bias towards safe failure

Page 19: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Confidentiality

Absence of unauthorized disclosure of information

• Microsoft source code vs. Linux source code• Web browsing• Operating Systems Security Model• Files• Medical records• Credit card transaction records• School grades

Page 20: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Integrity

Absence of improper system state alterations

• Operating systems:• Files, memory, network packets• Linux kernel backdoor attempt• Database records• Your bank account• File transfer• Did I really get the right version of software X?

Page 21: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Maintainability M(t)

M(t) is the probability that a failed system will be restored within a specified period of time t

Restoration process• – locating problem, e.g. via diagnostics• – physically repairing system• – bringing system back to its operational

condition

Page 22: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Ex. Dependability Requirements

Telecommunications:• Availability, maintainability

Transportation:• Reliability, availability, safety

Weapons:• Safety

Nuclear systems:• Safety

Page 23: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Dependability of Pacemaker What matters for this system?

• Correct computation?• Correct logic?• Usability?

No, The safety of the patient

Page 24: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Dependability of Pacemaker General Characteristics Eight-bit processors, moving to 32-bit Software:

• Approximately 30K lines, mostly “C”• Vastly more software in external programmer

Patient data storage example:• 200 samples/sec

Long battery life necessary-device• “sleeps” between heart beats

Page 25: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Dependability of Pacemaker Is availability the goal? How about an availability of 0.99999?

• This corresponds to an average of five minutes per year of downtime

• Death would result if this occurred all at once Is safety the goal? It’s safe when it’s off - or is it?

• Leaving the system off might result in death very quickly.....

Is reliability the goal?• Typical battery life is five years, but persistent storage is

needed

Page 26: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Security

Security is a combination of attributes:• Integrity• Confidentiality• Availability

Under different situations, these attributes are more or less important:

• Denial of service is an availability issue• Disclosure of information is a confidentiality

Page 27: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Performability P(L,t)

P(L,t) is the probability that the system performance will be at or above some level L at time t

Measure of the likelihood that some subset of the function is performed correctly

This differs from reliability, which dictates that all functions are performed correctly

Page 28: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Graceful Degradation

The ability of system to automatically decrease its level of performance to compensate for hardware failure and software errors

Page 29: 1. Introduction Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Testability

Testability: ease of detecting presence of a fault