© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory

CSE4884 Network Design and Management

Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng

Lecture 15-16

Reliability Theory Concepts; Managing Reliability and Maintainability in Networks

References and Reading

Igor Bazovsky, Reliability Theory and Practice, Dover Books on Mathematics, (1961) – recommended reading.

Kopp C., System Reliability and Metrics of Reliability, Peter Harding & Associates, Pty Ltd, Lecture Slides.

MIL-STD-756 Revision B Reliability Modeling & Prediction, US Department of Defense, Revision B - Nov 1991.

Why Study Reliability in Networking?

Networks are among the most complicated systems ever created by man.

A modern network combines hardware, embedded software and host system resident software, providing a range of data transfer and management functions.

The complexity of a modern network makes its reliability a major consideration.

Prudent network design can improve its reliability. Poor network design can reduce its reliability. Running costs and user satisfaction depend strongly on

network reliability. Reliability matters!

What is Reliability?

Reliability is defined as the Probability of System ‘Survival’ or P[S](t) over time T.

Where R(t) is reliability and Q(t) is probability of failure. A measure of the likelihood of no fault occurring. Reliability is related to system function and architecture. Kopp’s 5th Axiom: ‘All systems will fail, the only issue

is when, and how frequently.’

)(1)()]([ tQtRtSP

Defining System Reliability

System Reliability includes the following components:

1. Hardware Reliability.

2. Software Reliability.

3. Reliability of interaction between hardware and software.

4. Reliability of interaction between the system and the operator.

Failure to consider any of these is always at the expense of the reliability of the end product.

Hardware Reliability Classification Considers the reliability of electronic components,

printed circuit boards (PCB), cables, interconnection (ie connector) reliability, and failure modes across all such components. Failure regimes include:

1. Hard failures - The component or system fails and remains in that state.

2. Transient failures – the component or system temporarily experiences a loss a loss of function.

3. Intermittent failures – the component or system repeatedly and temporarily experiences a loss a loss of function.

Failure types are primarily divided into:1. Random failures - exponentially distributed.2. Wearout failures - normally distributed3. Infant Mortality failures

Hardware Life Cycle vs Reliability All hardware exhibits three distinct failure modes through

its operational life cycle. The first weeks or months after introduction exhibit infant

mortality failures, which arise from manufacturing defects which fail under stress.

Once the equipment is established in operation, it exhibits random failures, which arise in components for a variety of reasons. Random failures are Poisson distributed.

As the equipment reaches the end of its useful life, it begins to exhibit wearout failures. Wearout failures are normally (Gaussian) distributed.

The ‘Bathtub Curve’ illustrates failure frequency over the product life cycle.

Life Cycles - Bathtub Curve

Random Failures Random failures arise throughout the useful life of

equipment. They are exponentially distributed.

The failure rate λ is a measure of how frequently they arise.

MTBF = Mean Time Between Failures

Temperature dependency of λ - failure rates always increase at high operating temperatures.

Electrical voltage dependency of λ - failure rates always increase at higher electrical stress levels.

High stress → high failure rates!

)exp()( tetR t

MTBFMTBF

11

Wearout Failures Wearout failures arise at the end of the useful life of

equipment. They are normally distributed.

The mean wearout time μ is a measure of the average time at which they arise. The standard deviation of mean wearout time σ is a measure of their spread in time.

The spread in wearout failures depends on the quality of the components and the types of loads they are subjected to over their life cycle.

Wearout arises in mechanical components and connectors due to cyclic mechanical loads, in semiconductor chips due to cyclic thermal loads and junction diffusion effects.

dtT

tRWearout ))(2

1exp(

2

1)( 2

Wearout Examples

Connectors wear out with insertion and extraction cycles which rub away plating and cause metal fatigue damage.

Electrical relays wear out with switching cycles which rub away plating and cause metal fatigue damage.

Fans wear out due to bush or ball bearing failures, causing a loss of airflow rate and ultimately seizure of the bush or bearing. Fans with bushes – 10,000 hr life.

Cables wear out due to cyclic mechanical loads, especially near connectors, but also due to dielectric degradation due to age and moisture ingress.

Many electrical components fail due to oxidation of metals which is a corrosion effect.

Water, especially salt water, can produce corrosion.

System Reliability – Lusser’s Product Law

Lusser’s Product Law was discovered in Germany during A4/V2 ballistic missile testing during the 1943-44 period.

It superceded the earlier and dysfunctional ‘weak link’ model, which attributed failures to the most failure prone component in a design.

Lusser’s Product Law describes the behaviour of complex series systems, in which the function of the system depends on the function of each and every component.

It provides the theoretical basis of the US Mil-Hdbk-217 and Mil-Std-756B standards, which are the industry benchmark for reliability modelling.

System Reliability – Lusser’s Product Law

Lusser states that the probability of system ‘survival’ is the product of the individual probabilities of survival of each component in the system, where

This means P[S]system = P[S]1*P[S]2*P[S]3 ….P[S]N or:

Where R(t) = P[S](t) = 1 – Q(t), where Q(t) is the probability of failure.

If we know the failure rates λ for the components in a system, we can calculate the system reliability.

N

iiSystem RR

1

)exp(11

tRRN

ii

N

iiSystem

Parallel Systems - Redundancy

Failure of single element is survivable, but P[S] then reduced as a result.

Rp = 1 – QN

Where Q is the probability of failure for each of the redundant components.

Used in aircraft flight control systems, Space Shuttle and critical control applications.

Large servers with multiple parallel interfaces are a good example of a parallel system built for reliability.

RAID storage servers are a similar example.

Complex System Reliability Complex systems combine parallel and serial models. Such systems require detailed analysis to determine R(t)

for subsystems and the complete system. It is necessary to analyse for dependencies. For

instance, a cascading series of failures may arise if one component fails, and results in overstress to other components, which fail in turn (Refer Mil-Std-756B).

Designers must avoid Single Point of Failure (SPoF) items. Such items are typically shared components.

The higher the complexity of the system, the higher the component reliability needed to achieve any given MTBF.

In very complicated systems, this can require exceptionally high component reliability (and thus cost).

Example – RAID Array in Server System (2006)

N x 1 RAID array with a single cooling fan and power supply.

Assessment:

1. Disk drive redundancy in array is good.

2. Power supply failure represent a single point of failure.

3. Fan failures represent a single point of failure.

Problem fixed by introducing redundant fans and power supplies.

By removing single point of failure items we have significantly improved the reliability of the system.

Example - P-38 Twin Engine Aircraft (1944)

Electrical propeller pitch control, radiator and intercooler doors, dive flap actuators, turbocharger controls.

Twin engine aircraft, only one generator on one of the engines.

Loss of generator equipped engine - feather propeller, fail over to battery.

Once battery flat, prop unfeathers, windmills, turbo runaway -> aircraft crashes.

Problem fixed with dual generators, one per engine. Significant loss of pilot lives until problem solved.

LuckyLady

H MC

79th FS/20th FG, Arthur Heiden, May 1944

Software vs Hardware Reliability Hardware failures can induce software failures. Software failures can induce hardware failures. It is often extremely difficult to separate hardware and

software failures. We cannot apply physical models to software failures. While Lusser’s Product Law provides a model for system

level reliability, we have no hard measures for calculating or estimating the component failure rates in software.

The result of software and/or hardware failures is system failure.

Networking equipment, especially routers, contain significant amounts of embedded software to handle protocol stacks, management, and data buffering.

Modes of Software Failure

We can identify four basic modes of software failure:

1. Transient Failure – the program produces an incorrect result, but the program continues to run.

2. Hard Failure – the program crashes (stack overrun, heap overrun, broken thread) and ceases to run.

3. Cascaded Failure – the program crashes and takes down other programs as a result.

4. Catastrophic Failure – the program crashes and takes down the operating system or complete system -> total failure.

Types of Software Failure Numerical Failure - bad result calculated. Propagated Numerical Failure - bad result used in

other calculations. Control Flow Failure - control flow of thread is diverted. Propagated Control Flow Failure - bad control flow

propagates through code. Addressing Failure - bad pointer or array index. Synchronisation Failure - two pieces of code

misunderstand each other's state. In networking equipment, the synchronisation failure is a

very common occurrence. It usually arises due to bugs, misconfiguration or incompatible implementations of a protocol engine.

Case study – PPP LCP failures.

Runtime Detection of Software Failures

Consistency checks on values – is the result that which was expected?

Watchdog timers – has an operation completed on time?

Bounds checking – is the result reasonable or within some safe limits?

Embedded software in networking equipment which is well designed must have runtime software failure detection functions built in.

When choosing equipment it is essential to determine whether critical equipment items have some or any such capability.

Recovery Strategies – Runtime Failures Redundant data structures - overwrite bad data with

clean data. Signal operator or log problem cause and then die. Hot Start - restart from known position, do not reinitialise

data structures. Cold Start - reinitialise data structures and restart, or

reboot. Failover to Standby System in redundant scheme (eg

flight controls). If an item of networking equipment is critical, it is

important to determine how it handles runtime failures. Far too often a failure in synchronisation is not detected,

causing chaos as a result.

Typical Causes of Software Failures

Programmer did not understand the system design very well.

Programmer made unrealistic assumptions about operating conditions.

Programmer made coding error. Programmers and hardware engineers did not talk to

each other. Inadequate or inappropriate testing of code. A network designer is unlikely to have access to

embedded code in equipment, or access to designers. If an item is critical to the function of the network, it should be tested rigorously before it is introduced into a production network and made available to users.

Dormant Fault Problem

Statistical models used for hardware are irrelevant. Code may be operational for years with a fatal bug

hidden somewhere. A set of conditions may one day arise which trigger the

fault. If major disaster arises it may be impossible to recreate

same conditions. In a large and complex network, many dormant bugs

may exist in embedded code inside equipment and in host operating systems.

If such bugs result in transient or intermittent failures, they may be extremely difficult to isolate.

Complex System Problem

Extremely complex systems will be extremely difficult to simulate or test.

Complexity may result in infeasible regression testing time.

Components of system may interact in ‘unpredictable’ ways .

Synchronisation failures may arise. Faults may be hidden and symptoms not easily

detectable due complexity. Networks represent a typical case study of a complex

system, insofar as they may have hundreds of switches and routers, all with embedded software running in them.

Network Design for Reliability

1. Network design objectives must be well understood.2. Redundancy should be used as appropriate for critical

portions of the design, especially if a formal reliability specification exists for the network.

3. Failure modes and consequences should be understood, for all items of hardware and software in the network.

4. Each hardware and software module should be tested thoroughly before use in a network design.

5. A hardware reliability model should be produced, based on Mil-Std-756B or a serial model, as required.

6. Good estimates of hardware reliability are feasible, where manufacturers are able to provide MTBF figures for equipment.

Maintainability

Regardless of how reliable a network might be, maintainability is a critical operational issue.

When inevitable failures arise, these must be fixed as quickly as possible.

Network failures can cause the loss of hundreds or thousands of personnel hours for every hour of network downtime. Time is money!

From a user and management perspective, maintainability is very important and impacts any economic assessment of the running costs of a network.

It is customary to measure network reliability in terms of overall MTBF, or in terms of ‘Availability’ which is the fraction of time, over time, the network can be used.

Maintainability - MTTR The most common measure of maintainability is MTTR. Unfortunately, multiple definitions exist for MTTR, as a

result of which a designer or manager must be very careful when contracting:

Mean Time To Respond – average time for a maintenance crew to respond to a request.

Mean Time To Repair – average time to repair a fault. Mean Time To Restart – average time to restart the

network after a fault. Mean Time To Restore – average time to restore

network function after a fault. Some hardware suppliers will provide MTTR (repair)

numbers for their products. In general, care should be taken when specifying MTTR and when assessing MTTR in a bid.

Maintainability Repairing hardware faults requires spare parts, or

complete replacement equipment. If good MTTR is required, then it is necessary to

maintain a stockpile of spares. The size of the stockpile is typically determined by the

MTBF of the component, and the number of items in operation.

For instance, if the MTBF of a switch is 100,000 hrs and you have 100 of them in operation, annually you incur a total of 876,000 hrs of running time on these switches.

You can thus expect, on average, 8.76 faults annually in these switches.

A spares stockpile of around 10 switches would be needed, and an annual budget for 10 replacements.

Frequency of Repairs vs Availability If we can expect some number of faults annually due to

random failures, since these are Poisson we cannot know exactly when they will occur.

MTBF and population size for specific components will determine on average, how frequently one of these will fail.

We can estimate the Availability of the network if we have good estimates for MTBF and MTTR.

In practice, MTBF can be calculated accurately, but MTTR can be difficult to measure accurately, especially given the range of possible network failure modes and debugging times which result.

MTTRMTBF

MTBFtyAvailabili

Axioms to Memorise

Murphy's Law applies 99% of the time (Vonada's Law) Simpler solutions are usually easier to prove correct

(Occam's Razor) Paranoia Pays Off (Kopp's 4th Axiom) All systems will fail, the only issue is when, and how

frequently (Kopp’s 5th Axiom)

Tutorial

Q&A and Discussion, case studies

Documents

© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory