56
Unit QUAN Session 7 Reliability

Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Unit QUAN – Session 7

Reliability

Page 2: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The
Page 3: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Page 3 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

MSc Strategic Quality Management

Quantitative Methods - QUAN

RELIABILITY

Aims of Session

By the end of the session the student should understand the basic concepts in reliability and

be able to apply the principles to a range of problems for reliability improvement.

Learning Approach

Study the attached notes, and case study Healthy Laptop.

On completion of the above tasks you should be able to do the self-assessment exercises in

the attached notes.

Reading

You will find further reading on this topic in Besterfield (2004), Chapter 11. O‘Connor (2002

or earlier editions) provides a more extensive coverage.

Revised December 2009

Page 4: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit – QUAN

MSc SQM\QUAN Page 4 of 56 Session 7 ©University of Portsmouth

Contents

Introduction to Reliability What is reliability? Quantitative Approach to Reliability Concept of Bath-tub Curve Measuring reliability Exercises Reliability Distributions Reliability Prediction Using Exponential Distribution Introduction to Mean Time between Failures and Mean Time to Failure Exercises Self-Assessment Exercise The addition and multiplication rules of probability Reliability of Systems - Series, Parallel, and Combination Systems Exercises Self-assessment Exercise Exercises Reliability Improvement Techniques Fault Tree Analysis Mathematical Appendix Healthy laptop case study

Page 5: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 5 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

RELIABILITY

Introduction to Reliability

Reliability has gained increasing importance in the last few years in manufacturing

organisations, the government and civilian communities. With recent concern about

government spending, agencies are trying to purchase systems with higher reliability and

lower maintenance costs. In defence, for example, these systems range from complete

weapon systems such as aircraft or tanks, to individual critical components such as diodes,

resistors, transistors, IC's and so on. As consumers, we are mainly concerned with buying

products that last longer and are cheaper to maintain, i.e., have higher reliability. But what is

reliability and how do we measure reliability?

What is reliability?

The reliability of a product (or system) can be defined as the probability that the product will

perform a required function under specified conditions for a certain period of time.

Why we need to go for high product or component or system reliability?

Higher customer satisfaction

Increased sales

Improved safety

Increased competition

Decreased warranty costs

Decreased maintenance costs, etc.

Reliability is often expressed as a probability of success ranging from 0 to 1 or as a

percentage from 0 to 100%.

Page 6: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth

The Quantitative Approach to Reliability

Reliability

We can quantitatively define the reliability of a component or system as the probability that it

will meet the qualitative definition.

The probabilistic definition of reliability is given by:

At time t = 0, the number of survivors is equal to number of items put on test.

Therefore, the reliability at t = 0 is

R(0) = 1 = 100%

After this, when t > 0, R(t) will decline as some components fail.

The hazard rate and the failure rate

The hazard rate (usually represented by h(t) or λ) is a very useful quantity. This is defined as

the probability of a component failing in one (small) unit of time.

Let NF = number of failures in a small time interval, say, Δt.

NS = number of survivors at time t.

The hazard rate can then be calculated by the equation:

tN

Nth

S

F

*)(

For example, if there are 200 surviving components after 400 seconds, and 8 components fail

over the next 10 seconds, the hazard rate is given by

h(400) = 8 / (200 x 10) = 0.004 = 0.4%

0

)(

ttimeattestonputitemsofnumber

ttimeatsurvivorsofnumbertR

Page 7: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 7 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

This simply means that 0.4% of the surviving components fail in each second.

(You may wonder why the above equation defines h(400) and not h(410). The reason is that

Δt is a small time interval, so it is reasonable to assume that the hazard rate will not change

appreciably during the interval. We then define the hazard rate using the beginning of the

interval for convenience. In the extreme we can make Δt infinitesimally small—which is the

basis of the differential calculus.)

The term hazard rate applies to non-repairable items such as transistors, light bulbs,

microprocessors etc. The failure rate is defined in just the same way as the hazard rate,

except that it applies to repairable items such as computers and TV sets.

The Concept of the Bath-tub Curve

The so-called bath-tub curve represents the pattern of failure for many products – e.g.

semiconductor devices and electronic components. The vertical axis in Figure 2 is the hazard

or failure rate at each point in time. Higher values here indicate higher probabilities of failure.

The bath-tub curve is divided into three regions: infant mortality, useful life and wear-out.

The bath-tub curve is illustrated in Figure 1.

Figure 1 Product failure pattern illustrated by bath-tub

curve

Time since start of testing

Obs

erve

d fa

ilure

rate

Infant mortality stage Useful life stage Wear-out stage

Figure 2 Product failure pattern illustrated by bath-tub curve

Page 8: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 8 of 56 Session 7 ©University of Portsmouth

Infant Mortality:

This stage is also called early failure or debugging stage. In this stage, the failure rate is high

but decreases gradually with time. During this period, failures occur because engineering did

not test products or systems or devices sufficiently, or manufacturing made some defective

products. Therefore the failure rate at the beginning of infant mortality stage is high and then

it decreases with time after early failures are removed by burn-in or other stress screening

methods. Some of the typical early failures are:

poor welds

poor connections

contamination on surface in materials

incorrect positioning of parts, etc.

Useful life:

This is the middle stage of the bath-tub curve. This stage is characterised by a constant failure

rate. This period is usually given the most consideration during design stage and is the most

significant period for reliability prediction and evaluation activities. Product or component

reliability with a constant failure rate can be predicted by the exponential distribution (which

we come to later in this session).

Wear-out stage:

This is the final stage where the failure rate increases as the products begin to wear out

because of age or lack of maintenance. When the failure rate becomes high, repair,

replacement of parts etc., should be done.

Page 9: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 9 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Figure 2 Extended bath-tub curve

Can we extend the useful life stage of a product? Products can have their useful life extended

by proper design, conscientious assembly, and careful handling. This leads to a change in the

shape of the bath tub curve (see Figure 2).

Measuring reliability

To see the level and pattern of the reliability of a product or component in practice, it is

necessary to make some measurements. The simplest way to do this is to test a large number

of products or components until they fail, and then analyse the resulting data. Exercise 2

below shows how this works. This enables us to estimate the hazard rate and reliability after

different lengths of time - and decide, empirically, if the bath tub curve applies, or if the

hazard rate shows some other pattern.

There are a number of obvious difficulties which may arise. If the useful life is large it may

be not be practical to wait until products or components fail. It may be too expensive to test

large samples, so small samples may have to suffice. And for some products (e.g. space

capsules) it may be difficult to simulate operating conditions at all closely. There are a

number of approaches to these difficulties - most of these are beyond the scope of this unit,

but they are discussed in the reading suggested for this session. (An exception is the

prediction of the reliability of a product from information about the reliability of its

Time

Obs

erve

d fa

ilure

rate

Useful life stage

Figure 3 Extended bath-tub curve

Page 10: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 10 of 56 Session 7 ©University of Portsmouth

components - this avoids the necessity to test the whole product - which is discussed in a later

section of the notes for this session).

It is also possible that the failure rate will depend on environmental conditions in a

predictable way. For example, one of the key factors affecting the reliability of electronic

components and systems is temperature – basically the higher the temperature of the device

the higher the failure rate. Most computer equipment therefore has some form of cooling,

ranging from a simple fan to forced chilled air cooling. The following chart is from a report

on the effect of temperature on a single board computer published by Crane Surface Warfare

Center Division in 2001, and clearly shows the effect of temperature on failure rate.

(Failure rate = failures per million hours)

Figure 3

Failure Rate over Temperature

0

10

20

30

40

50

60

70

Temperature

Failu

re R

ate

25 35 45 55 65 75 85 95

Temperature

Page 11: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 11 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions, the purpose of which

is to test your retention of the information given in the course notes for

this session. When you have done this, check back on any questions

which you could not answer, or where you gave the wrong answer then

go on to the Self-Appraisal Exercise.

Questions

1 Would you expect the bath tub curve to apply to a car? What about a human

being?

2 One thousand transistors are placed on life test, and the number of failures in

each time interval are recorded. Find the reliability and the hazard rate at each

point in time. (You may find it helpful to set this up on a spreadsheet.)

Time interval Number of failures 0-100 160

100-200 86

200-300 78

300-400 70

400-500 64

500-600 58

600-700 52

700-800 43

800-900 42

900-1000 36

Do you think this component shows the bath tub pattern of failure?

Page 12: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 12 of 56 Session 7 ©University of Portsmouth

(BLANK PAGE)

Page 13: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 13 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

/ Suggested Answers

1 I think the bath tub pattern would apply to both.

2 My answers are below.

Time Failures in

next 100 h Survivors Reliability Hazard rate

(per hr)

0 160 1000 100.00% 0.16% 100 86 840 84.0% 0.10% 200 78 754 75.4% 0.10% 300 70 676 67.6% 0.10% 400 64 606 60.6% 0.11% 500 58 542 54.2% 0.11% 600 52 484 48.4% 0.11% 700 43 432 43.2% 0.10% 800 42 389 38.9% 0.11% 900 36 347 34.7% 0.10%

Note that the hazard rate is per hour. (This is why the figures are so small.) A

hazard rate of 0.10% means that the probability of a failure in a given hour is

0.10%.

Note also that I am using data from the whole interval from 0-100 hours to

calculate the figures for 100 hours. This is an approximation which may lead to

small inaccuracies.

This result shows that the hazard rate is initially 0.16% per hour, and then drops

to around 0.10% to 0.11%. There is an infant mortality phase, a useful life phase

with a fairly constant hazard rate, but no wear out phase. However, there are still

311 components left after the 1000 hour test - the wear out phase may become

apparent if the test were to be prolonged. Obviously, the best way to show this

pattern would be to draw a graph of hazard rate against time.

Page 14: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 14 of 56 Session 7 ©University of Portsmouth

(BLANK PAGE)

Page 15: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 15 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Reliability Distributions

There are many statistical distributions used for reliability analysis—for example, the

exponential distribution, the Weibull distribution, the normal distribution, the lognormal

distribution, and the gamma distribution. In this unit we look at the exponential distribution

only, as this is the simplest and the most widely applicable.

Reliability Prediction Using the Exponential Distribution

The exponential distribution applies when the hazard rate is constant - the graph is a

straight horizontal line instead of a “bath tub”. (It can be used to analyse the middle phase

of a bath tub - e.g. the period from 100 to 1000 hours in Exercise 2 above.) It is one of the

most commonly used distributions in reliability, and is used to predict the probability of

survival to a particular time. If λ is the failure rate and t is the time, then the reliability, R, can

be determined by the following equation

R(t) = e -λ t

There is a brief note on the mathematical background to this equation in the Appendix to this

session.

To see that it gives sensible results, imagine that there are initially 1000 components and that

is 10% per hour. After one hour about 10% of the original 1000 components will have failed

- leaving about 900 survivors. After two hours, about 10% of the 900 survivors will have

failed leaving about 810. Similarly there will be about 729 survivors after the third hour,

which means that the reliability after 3 hours is 0.729.

Using the exponential distribution the reliability after 3 hours, with λ=0.1, is given by

R(t) = e -3 λ

= e -0.3

= 0.741

(You can work this out using a calculator or a spreadsheet—see the mathematical appendix for

more details.)

This is close to the earlier answer as we should expect. The reason it is not identical is that the

method of subtracting 10% every hour to obtain 900, 810, etc ignores the fact that the number

of survivors is changing all the time, not just every hour. The exponential formula uses

Page 16: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 16 of 56 Session 7 ©University of Portsmouth

calculus to take this into account.

If the failure rate is small in relation to the time involved a much simpler method will give

reasonable results. Let‘s suppose that λ=0.01 (1%) in the example above. The formula now

gives the reliability after three hours as

R(t) = e -3 λ

= e -0.03

= 0.9704

A simpler way of working this out would be just to say that if the failure rate is 0.01 per hour

the total proportion of failures in 3 hours will be 0.03 (3%) so the reliability after three hours

is simply

R(t) = 1 – 0.03 = 0.97 (or 100% – 3% = 97%)

This is not exactly right because in each hour the expected number of failures will decline as

the surviving pool of working components gets smaller. But when the failure are is 1% and we

are interested in what happens after three hours, the error is negligible. On the other hand, if

we want to know what happens after 300 hours the simple method gives a silly answer (the

reliability will be negative!) and we need to use the exponential method. You should be able

to check this answer with your calculator or a computer (the answer should be 0.0498 or about

5%).

Mean time to Failure (MTTF) and Mean time between Failures (MTBF)

MTTF applies to non-repairable items or devices and is defined as "the average time an item

may be expected to function before failure". This can be estimated from a suitable sample of

items which have been tested to the point of failure: the MTTF is simply the average of all the

times to failure. For example, if four items have lasted 3,000 hours, 4000, hours, 4000 hours

and 5,000 hours, the MTTF is 16,000/4 or 4,000 hours.

The MTBF applies to repairable items. The definition of this refers to ―between‖ failures for

obvious reasons. It should be obvious that

MTBF = Total device hours / number of failures

For example, consider an item which has failed, say, 4 times over a period of 16,000 hours.

Then MTBF is 16,000/4 = 4,000 hours. (This is, of course, just the same method as for

MTTF.)

For the particular case of an exponential distribution,

λ = 1/MTBF (or 1/MTTF)

where λ is the hazard rate.

Page 17: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 17 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

For example, the item above fails, on average, once every 4000 hours, so the probability of

failure for each hour is obviously 1/4000. This depends on the failure rate being constant -

which is the condition for the exponential distribution.

This equation can also be written the other way round:

MTBF (or MTTF) = 1/λ

For example, if the hazard rate is 0.00025, then

MTBF (or MTTF) = 1/0.00025 = 4,000 hours.

Page 18: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 18 of 56 Session 7 ©University of Portsmouth

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions. The purpose of

which is to test your retention of the information given in the course

notes for this session. When you have done this, check back on any

questions which you could not answer, or where you gave the wrong

answer then go on to the Self-Appraisal Exercise.

Questions

1 An industrial machine compresses natural gas into an interstate gas pipeline.

The compressor is on line 24 hours a day. (If the machine is down, a gas field

has to be shut down until the natural gas can be compressed, so down time is

very expensive.) The vendor knows that the compressor has a constant failure

rate of 0.000001 failures/hr. What is the operational reliability after 2500 hours

of continuous service?

2 What is the highest failure rate for a product if it is to have a reliability (or

probability of survival) of 98 percent at 5000 hours? Assume that the time to

failure follows an exponential distribution.

3 Suppose that a component we wish to model has a constant failure rate with a

mean time between failures of 25 hours? Find:-

(a) The reliability function.

(b) The reliability of the item at 30 hours.

4 A certain type of engine seal (non-repairable) is know to have life exponentially

distributed with a constant failure rate = 0.03 * 10-4

failures/hour.

(a) What is the MTTF of the seal?

(b) What is the reliability at MTTF?

Page 19: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 19 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 20: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 20 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1 The compressor has a constant failure rate and therefore the life times of these

reliability is given by:

R = e -λ t

Failure rate λ = 0.000001 f/hr., operational time t = 2500 hours

Reliability = e-(0.000001 * 2500)

= 0.9975

2 The reliability of the product is given to be 0.98. The reliability of an

exponential distribution is given by:

R = e -λt

t

i.e., 0.98 = e(-λ * 5000)

Taking natural logarithms on both sides (see Appendix), we get,

-0.02020 = -λ * 5000

Therefore λ = 4.04 * 10-6

f/hr

(Alternatively, you could use a trial and error process with your calculator to

find a negative number which gives you 0.98 when you press the ex button. This

number must be equal to λ*5000...)

Therefore the highest failure rate is 4.04/106 hours for a reliability of 0.98 at

5000 hours.

3 (a) Since the failure rate is constant, we will use the exponential distribution.

Also, the MTBF = 25 hours. We know, for an exponential distribution, MTBF

= 1/λ

Therefore λ = 1/25 = 0.04

The reliability function is given by: R(t) = e-λt =

e- (0.04 * t)

(b) The reliability of the item at 30 hours = e-0.04 * 30

= 0.3012

4 a) λ = 0.03 * 10-4

failures/hour

MTTF = 1/λ = 333,333 hours

i.e., the average life of these seals is about 333,333 hours

b) MTTF = 333,333 hours

Therefore R(333,333) = e- (0.000003 * 333,333)

= 0.368

Page 21: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 21 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 22: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 22 of 56 Session 7 ©University of Portsmouth

SELF-APPRAISAL EXERCISE

Exercise

Here follows an exercise to give you further information and food for thought.

The assigned task is:

1 The equipment in a packaging plant has a MTBF of 1000 hours. What is the

probability that the equipment will operate for a period of 500 hours without

failure?

2 During World War II, the hazard rate for bomber aircraft flying over Europe

was believed to be a 4% chance of non-return from each mission, however

experienced the pilot was. Calculate the probability that a crew member will

survive 25 missions. How many missions would it take to reduce a crew

member's probability of survival to 10%?

3 ALCO manufactures microwave ovens. In order to develop warranty

guidelines, TALCO randomly tested 10 microwave ovens continuously to

failure. The failure information of the 10 ovens is shown in Table 2.

Microwave Hours

1 2300

2 2150

3 2800

4 1890

5 2790

6 1890

7 2450

8 2630

9 2100

10 2120

What is the mean time to failure of the microwave ovens? (Note that the mean life of the microwave is defined in terms of their mean time to failure because no maintenance is performed on the ovens).

Page 23: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 23 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 24: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 24 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1 Assuming the exponential model, the hazard rate is

1/MTBF = 0.001

So

R(500) = e-500*0.001

= e-0.5

= 0.61

2 As the hazard rate is constant we can use the exponential model with a hazard

rate of 0.04.

R(25) = e -0.04*25

= e-1

= 0.37

The crew have a 37% chance of surviving 25 missions.

To reduce this to 10% we need e-0.4*t

=0.1.

Trial and error (or natural logarithms - see the mathematical appendix) shows

that

e-2.3

= 0.1 (approximately)

So 0.04t = 2.3, so t is 23/0.04 or about 57 missions.

Note that ―time‖ is measured in missions.

3 The MTTF is 2312 hours (simply the average).

Page 25: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 25 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

The addition and multiplication rules of probability

The next aspects of reliability theory we will consider depend on some probability theory—

the addition and multiplication rules. I will explain these in terms of dice and cards.

Suppose that you throw a single dice. The probability of getting a 6 is

P(6) = 1/6

And the probability of getting a 5 is also

P(5) = 1/6

Equally obvious is that the probability getting a 5 or a 6 is

P(5 or 6) = 2/6 = 1/3

And that

P(5 or 6) = P(5) + P(6)

Similarly if you choose a single card from a pack of (52) cards, the probability of getting an

ace or a picture card (jack, queen or king) is

P(ace or picture) = 4/52 + 12/52 = 16/52

Because 4 of the 52 cards are aces, and another 12 are picture cards.

These examples suggest that if we want the probability of something or something else

happening, we can add the probabilities.

But … what about the probability of an ace or a red card (hearts or diamonds). Can we say

that

P(ace or red) = P(ace) + P(red) = 4/52 + 26/52 = 30/52 ??

This is obviously wrong because two of the aces are also red, so we are in effect double

counting these aces if we add the probabilities. Before adding the probabilities you need to

check that the two events cannot both occur—i.e. they do not overlap or are mutually

exclusive (i.e. each excludes the other).

The complete addition rule is

P(A or B) = P(A) + P(B) if A and B are mutually exclusive (i.e. they don’t overlap).

It can easily be extended to three or more events:

P(A or B or C or …) = P(A) + P(B) + P(C) + … if A, B, C … are mutually exclusive.

To explain the and rule we need to do more than one thing, so let‘s throw the dice and draw a

card from the pack. How can we work out the probability of getting a 6 and a spade?

Page 26: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 26 of 56 Session 7 ©University of Portsmouth

This is a little more complicated than the or rule. It helps to imagine doing the experiment

lots of times—say 1000 times.

Obviously you will get a 6 on about 1/6 of these thousand times—i.e. about 167 times. Now

think about how many times you will get a spade as well. This will happen on about ¼ of

these 167 times, or about 42 times out of the 1000 times we imagined doing the experiment.

So the probability is about 42/100.

Now do the same thought experiment, but working in terms of the probabilities this time. We

will get a 6 on about one sixth of the times, and a spade as well on about one quarter of this

one sixth. One quarter of one sixth means the same as one quarter times one sixth, so we

simply multiply the probabilities:

P(6 and spade) = P(6)*P(spade) = 1/6 * 1/4 = 1/24

which is, of course, about 42/1000 as before.

Just like the addition rule there is an important assumption here. We assumed that the two

events are statistically independent. This means that the two probabilities are independent of

each other: knowing whether one has happened is of no help in assessing the probability of

the other happening. In the example above, knowing that we‘ve got a 6 is of no relevance to

what will happen with the cards—so these events are statistically independent.

But in some situations you need to be very careful about this assumption. Suppose you know

that in a particular place the probability of rain falling on a given day is 1/3. Can you say that

P(rain today and rain tomorrow) = 1/3 * 1/3 = 1/9 ??

In practice this is likely to be wrong because the two events are not likely to be independent.

In England certainly, if it rains today the probability of rain tomorrow is likely to be higher

than if it did not rain today because the weather tends to set in to wet or dry spells. This

means that the second probability should probably be rather more than 1/3, so the result is

likely to be substantially more than 1/9.

The multiplication rule, then, also has an important condition:

P(A and B) = P(A)*P(B) if A and B are statistically independent

And just as before it can be extended:

P(A and B and C and …) = P(A)*P(B)*P(C)*… if A, B, C, … are statistically

independent

Page 27: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 27 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

SELF APPRAISAL EXERCISE

Exercise

Here follows an exercise to give you further information and food for thought.

The assigned task is:

Two football teams are due to play each other three times in the next few weeks. The

data from their matches in the past suggests that the probability of Team A winning is

50%, and the probability of Team B winning is 30%

1 What is the probability of a draw in the first match?

2 Assuming that the results of the three matches are statistically independent what

is the probability of Team B winning all three matches?

3 Assuming that the results of the three matches are statistically independent what

is the probability of Team B winning at least one of the matches?

4 Do you think the assumption of statistical independence is justified?

.

Page 28: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 28 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1

2

3

4

The match must end in a win for Team A, or a win for Team B, or a draw.

These three probabilities are mutually exclusive so they must add up to

100%. This means the probability of a draw is 20% because

50%+30%+20%=100%.

P(B winning three matches)=P(B wins 1st match)*P(B wins 2

nd

match)*P(B wins 3rd

)

=30% * 30% * 30% = 27/1000 = 0.027 = 2.7%

so it is unlikely!

This is a little more difficult. The easiest approach to ―at least one‖

questions is to work out the probability of the event never happening, and

then subtract it from 1. In this case, the probability of Team B failing to

win each match is 70% (i.e. the probability of a Team A win or a draw), so

the probability of failing to win all three matches is

70% * 70% * 70% = 7/10 * 7/10 *7/10 = 343/1000 = 34.3%

In all other circumstances Team B will win at least one match so the

probability is

P(B wins at least one match) = 100% - 34.3% = 65.7%

Which is fairly likely!

This method is used to derive one of the formulae (for parallel

configurations) below.

This is a difficult question to which there is no easy answer. Perhaps if

Team B wins the first match they would become more confident and so

more likely to win the next match? Or perhaps Team A would get more

determined?

Page 29: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 29 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Reliability of Systems - Series, Parallel, and Combination Systems

It is often useful to be able estimate the reliability of a whole system from the reliability of

the individual components. This enables designers to predict reliability levels without having

to build and test the whole system. It also enables them to see weak spots in the system, and

to experiment with ways of improving these - perhaps by installing backups for critical

components.

The following three types of configurations will be discussed:

1 Series configuration

2 Parallel configuration

3 Combination of series and parallel

1 Series Configuration

For a series configuration, the system will fail if any of the components in the system fail.

Like links in a chain, if one link breaks, then the entire chain fails to perform the intended

function. This can be represented by a block diagram like the one in Question 1 in the next

set of Questions to stimulate your thinking on page 31.

If R1, R2, R3, ............. Rn are the reliabilities of first, second, third, ........ nth

components in a

series system, then the system reliability is given by:

Rsys. = R1 x R2 x R3 ..............Rn

This follows from the rules of probability. It depends on the assumptions that the

probabilities are statistically independent. This means that knowing one of the probabilities

is of no help in estimating any other the others: they can all be estimated independently of

each other.

When several components are placed in series, the total system reliability decreases quickly.

If we assume that: R1= R2=R3 ..........=Rn =R , then,

Rsys. = R x R x R ...................R

Assume that each component has the reliability of 0.95, then the system reliability of 5

components in series is given by:

Page 30: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 30 of 56 Session 7 ©University of Portsmouth

Rsys. = 0.955

= 0.774

2 Parallel Configuration

In a parallel configuration, the system fails only if all of the individual components fail. If

one component fails the others serve as a backup - see the diagram in Question 2 on page 31.

below.

Rsys. =P(component 1 works) or P(component 2 works) or ....... P(component n

works)

Assuming that there are two components in the system, then for the system to fail, both

components have to fail, which will happen with probability

(1-R1) (1-R2)

So the system reliability is given by

Rsys. = 1-{ (1-R1) (1-R2)}

Again, this assumes that the reliabilities are statistically independent.

This argument can easily be extended to n components in parallel:

Rsys. = 1-{ (1-R1) ... (1-Rn)}

3 Combination of Series and Parallel Configuration

Electrical circuits or mechanical assemblies are often designed in configurations which are

part series, part parallel. To increase the reliability of the system, wherever possible, low

reliability components are placed in parallel, and components with relatively high reliability

are in series.

The following steps could be useful in dealing with these mixed configurations.

Step 1: Calculate the equivalent reliability of components in series within any parallel

configuration.

Page 31: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 31 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Step 2: Reduce each parallel configuration to an equivalent single reliability.

Step 3: Multiply reliabilities in a series together until the overall reliability is derived.

The exercises below should help make this clear.

The importance of statistical independence

The calculations for both series and parallel configurations depend on the assumption of

statistical independence. The answers may be very different if the probabilities are not

statistically independent - an example is provided by the first self appraisal exercise below.

Page 32: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 32 of 56 Session 7 ©University of Portsmouth

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions. The purpose of which is

to test your retention of the information given in the course notes for this

session. When you have done this, check back on any questions which you

could not answer, or where you gave the wrong answer then go on to the

Self-Appraisal Exercise.

Questions

1 Five components in series have individual reliabilities of 0.99, 0.97, 0.90, 0.98, and

0.94 - as shown in the following (block) diagram. Calculate the reliability of the

system.

0.99 0.97 0.90 0.940.98

2 Three components in parallel have individual reliabilities of 0.99, 0.90, and 0.85.

Calculate the reliability of the system.

0.99

0.90

0.85

. . .

Page 33: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 33 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

. . .

3 Consider the following system:

0.90 0.90

0.90

0.900.90

1 2

3 4

5

The first two components are in series, the last configuration in parallel. Find the

system reliability.

Page 34: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 34 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1 Because the given system is a series configuration, the reliability of the system

can be obtained by:

Rsys. = 0.99 * 0.97 * 0.90 * 0.98 * 0.94 = 0.796

2 Because the components are connected in parallel, the reliability of the system is

given by:

Rsys. = 1 - Qsys. = 1 - {(1 - R1) (1 - R2) (1 - R3)}

= 1 - {(1 - 0.99) * (1 - 0.90) * (1 - 0.85)} = 0.99985

3 This is a simple series/parallel configuration. The first step is to determine the

reliability of the series system with components 3 and 4.

Therefore Rs1 =0.90 * 0.90 = 0.81

Now Rs1 forms a parallel system with component 5. Therefore the reliability of

the parallel system can be determined as follows:

Rp1 = 0.81 + 0.90 - 0.90 * 0.81 = 0.981

Now we have formed a series system with reliabilities 0.90, 0.90 and 0.981.

Therefore the system reliability is given by:

Rsys. = 0.90 * 0.90 * 0.981 = 0.795

Page 35: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 35 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 36: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 36 of 56 Session 7 ©University of Portsmouth

SELF-APPRAISAL EXERCISE

Exercise

Here follows another exercise to give you further information and food for thought.

The assigned task is:

1 Below is a cross-section diagram of a booster rocket outer shell at a joint

between two stages. The system will fail only if both o-rings fail.

(a) Do the two o-rings form a series or parallel system?

(b) Suppose the reliability of each o-ring is 0.95 during the most critical

phase of flight. What is the system reliability?

(c) Do you think the two reliabilities are likely to be statistically

independent? Would this make any difference to your answer for the

system reliability?

Atmosphere

Vent

Fuel

O-rings

2 A certain electronic component has an exponential failure time with a mean of

50 hours.

(a) What is the instantaneous failure rate of this component?

(b) What is the reliability of this component at 100 hours?

(c) What is the minimum number of these components that should be placed

in parallel if we desire a reliability of 0.90 at 100 hours?

Page 37: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 37 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 38: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 38 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1 a A parallel system as the rocket will fail only if both o-rings fail.

b 1 - (1- 0.95)(1- 0.95) = 99.75%

c This assumes the two potential failures are statistically independent - which

may be unlikely in practice. A single factor - such as temperature - may be

responsible for failures in both. This means that, if we know one o-ring has

failed, our estimate of the probability of the other failing will be higher. In

the extreme case, one o-ring fails whenever the other does, and the

reliability of the system is 0.95. In practice, the reliability of the whole

system is likely to be between 0.95 and 0.9975 (the answer based on the

assumption of independence).

2 a 2% per hour

b R(100) = e-0.02x100=0.1353

c The parallel system will only fail if all components fail. The probability of

each failing in 1 - 0.1353 = 0.8647.

If there are n in parallel we need

1 - 0.8647n = 0.9, or

0.8647n = 0.1

By trial and error (or see the Appendix)

n = 16

We need n components in parallel.

Page 39: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 39 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Reliability Improvement Techniques

There are two established techniques for this purpose:

1 Fault Tree Analysis

2 Failure Mode Effect and Criticality Analysis (FMECA)

The second (FMECA) is not discussed here as it is covered in another unit of the course

(Performance Evaluation).

Fault Tree Analysis

This is a commonly used technique in industry to evaluate reliability of a system. It was

developed in the early 1960s to evaluate reliability of the Minuteman Launch Control

System. Since then it has gained favour especially when analysing complex systems. Fault

tree analysis begins by identifying the top event, known as the undesirable event of the

system. The undesirable event of the system is caused by events generated and connected by

logic gates such as AND, OR, etc (as you will see below). The following basic steps are

involved in performing fault tree analysis:

i Establishing system definition.

ii Constructing the fault tree.

iii Evaluating the fault tree qualitatively.

iv Collecting basic data such as components' failure rates, repair rates, and failure

occurrence probability.

v Evaluating fault tree quantitatively.

vi Recommending corrective measures.

This method is frequently used as a qualitative evaluation method in order to assist the

designer, planner or operator in deciding how a system may fail and what remedies may be

used to overcome the causes of failure. The method can also be used for quantitative

evaluation, in which case the causes of system failure are gradually broken down into an

increasing number of hierarchical levels until a level is reached at which reliability data is

Page 40: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 40 of 56 Session 7 ©University of Portsmouth

sufficient or precise enough for a quantitative assessment to be made. The appropriate data is

then inserted into the tree at this hierarchial level and combined together using the logic of

the tree to give the reliability assessment of the complete system being studied.

In order to illustrate the application of this method, consider the electric power requirements

of the system in the following example. In this example, the failure event being considered is

'loss of the electric power'. In practice the electric power requirements may be both a.c.

power, to supply energy for prime movers, and d.c. power, to operate relays and contractors,

both of which are required to ensure the successful operation of the electric power.

Consequently the event 'loss of electric power' can be divided into two sub-events 'loss of

a.c.power' and 'loss of d.c.power'. This is shown in the following figure (see Figure 4) with

the events being joined by an OR gate as failure of either, or both, causes the system to fail.

Figure 4: Development of a Fault Tree

If this subdivision is insufficient, sub-events can be divided further. The event 'loss of a.c.

power' may be caused by 'loss of offsite power' (the grid supply) and by 'loss of onsite power'

(standby generators or similar devices). This process can be continued downwards to any

required level of subdivision. After developing a fault tree, it is necessary to evaluate the

probability of occurrence of the upper event by combining component probabilities using

basic rules of probability and the logic defined in the fault tree.

In the present example, suppose the probabilities of the events at the bottom of the tree are:

Prob (Loss of offsite power) = 0.067

Prob (Loss of onsite power) = 0.075

Prob (Loss of dc power) = 0.005

Loss of Electric power

Loss of a.c. power Loss of d.c. power

Loss of offsite power Loss of onsite power

or

and

Figure 5: Development of a Fault Tree

Page 41: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 41 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Notice that these are the probabilities of faults and are typically low, whereas the reliabilities

used in the block diagrams are typically larger. Obviously

Reliability of the offsite power = 1 – Prob (Loss of offsite power) = 1 – 0.067 = 0.933.

These probabilities for the faults at the bottom of the tree can now be combined using the

addition and multiplication rules of probability. For the AND gate at the bottom we multiply

the probabilities to work out

Prob (Loss of a.c. power) = Prob(Loss of offsite power) x Prob (Loss of onsite power)

= 0.067 x 0.075 = 0.005025.

For the OR gate we add the probabilities to get the probability of the top event:

Prob (Loss of electric power) = Prob (Loss of a.c. power) + Prob (Loss of d..c power)

= 0.005025 + 0.005 = 0.010025.

This means that the probability of the top, undesirable, event – Loss of electric power – is

about 1%. The calculation make two assumptions.

First, to multiply the two probabilities at the bottom of the tree we must assume they are

statistically independent. This is reasonable if the onsite and offsite systems are separate so

that the failure of one is independent of the other. Notice that the combined probability (Loss

of a.c. power) is much less than the probability of either of the two probabilities in the AND

gate. The principle here is that of reducing the risk of a fault by using a backup system: this

will not work, of course, if the two systems are dependent so that if one fails there is an

increased chance that other will fail too.

The second assumption is the assumption when we add the probabilities for the OR gate that

the events are mutually exclusive (i.e. they don‘t overlap). With small probabilities, like we

have here, this is reasonable because the probability of more than one fault occuring is small.

Page 42: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 42 of 56 Session 7 ©University of Portsmouth

SELF-APPRAISAL EXERCISE

Exercise

A risk assessment of a proposed new railway included the following results:

Passenger train derailment would occur if

derailment occurred due to over-speeding (4.3 x 10-11

) or

derailment occurred due to rail faults (6.4 x10-9

) or

derailment occurred due to rolling stock faults (3.9 x10-9

) or

derailment occurred due to running into obstructions (7.1 x 10-9

)

The figures in brackets represent the estimated probability of occurrence of each of

these per train kilometre. For example the first probability means that the probability for

the event in question is 4.3 x 10-11

for each train for each kilometre. Clearly the

probability for six trains, each travelling 100 km, would be 600 times as great.

The probabilities were estimated by breaking the events down into more detail

and using historical data. For example, ―derailment occurred due to rolling stock faults‖

was broken down into seven events which would lead to rolling stock faults - these

included wheel failure and suspension system failure. Furthermore, suspension system

failure would occur if

the suspension system deflated (1.5 x 10-6

), and

the emergency springs failed (1.0 x 10-4

)

The assigned task is:

1 Draw a fault tree from the above information.

2 Estimate the probability (per train kilometre) of suspension system failure and

explain the assumptions on which your estimate is based.

3 Estimate the probability (per train kilometre) of passenger train derailment (the

top event of the fault tree) and explain the assumptions on which your answer is

based.

4 Assuming that the railway is 100 km long and there are 20 trains each way every

day, estimate the mean time between passenger train derailments. Do you think

this represents a satisfactory level of risk?

5 What do you think of the accuracy and usefulness of this type of analysis?

(The events and probabilities are taken from a presentation by C. Leighton &

C. Dennis at a conference on Risk Analysis and Assessment, Edinburgh, 1994.)

Page 43: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 43 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

(BLANK PAGE)

Page 44: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 44 of 56 Session 7 ©University of Portsmouth

Suggested Answers

1 The fault tree is on the next page.

2 Assuming the probabilities of the two events (suspension system deflated and

emergency springs failed) are independent, we can multiply the probabilities, so

the estimate for the probability of suspension system failure is 1.5 x 10-10

. In

practice, this assumption of independence may not be fully realistic if some

causes of the first event also make the second more likely. If the events are not

independent, the probabilities may be much higher.

3 To work out the probability of passenger train derailment we add the four

probabilities to give 1.7 x 10-8

. Note that the probability of the first event - over-

speeding - is negligible compared with the others.

This assumes that these are the only faults that will lead to derailment. Sabotage,

for example, does not seem to be included. (Strictly, adding probabilities

presupposes the events are mutually exclusive. It is possible that more than one

event can occur. However, the probability of this is so small - of the order of 10-

18 - that it can reasonably be ignored.)

4 The probability of derailment per day is 100 x 20 x 2 x 1.7 x 10-8

= 6.8 x 10-5

This is, in effect, a failure rate, so the mean time between derailments (failures)

is 1/(6.8 x 10-5

) or 1.5 x 104 or 15000 days or about 41 years. Derailments can

be expected, on average, every 41 years.

5 The accuracy and usefulness of the model are obviously dependent on the data

on which it is built. In particular, it is obviously important that all input

probabilities are accurate (e.g. of the emergency springs failing), that

assumptions about statistical independence are carefully checked, and that all

lists of events specifying the possible ways a fault can occur are as complete as

possible. This obviously requires systematic research.

Page 45: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 45 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Fault tree for question 1

(Note that this tree omits some faults, as explained above.)

Derailment

OR

Rolling stock

fault Obstruction Rail fault Over-speeding

Suspension

failure

Wheel failure

OR

AND

Emergency springs

failed

Suspension

deflated

Page 46: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 46 of 56 Session 7 ©University of Portsmouth

Mathematical Appendix

Powers and roots

As you will doubtless know

pppp (four p‘s multiplied together) can be written as p4

You may also need to do this backwards. Suppose that

p4 = 0.7 (p

4 is p^4 on a spreadsheet.)

and you want to know p. p is the fourth root of 0.7 or

p = 0.71/4

= 0.70.25

= 0.915

To check, try 0.9154. This should come to 0.7 (approximately).

Powers are also defined for negative and any other fractional index. Experiment with your

calculator or spreadsheet.

Exponential functions and natural logarithms

ex is a function which arises from the mathematics of constant rates of growth. You should

have a button for it on your calculator, and on a spreadsheet the function will be EXP(X) or

something similar. Some examples (all rounded to two decimal places):

e1 = 2.72

e3 = 20.09

e-1.6

= 0.20

The inverse function to ex is the natural logarithm of x, loge(x) or ln(x). This is useful if you

want to know what x is if, for example,

ex = 0.5

You can find the answer by using natural logarithms:

x = loge(0.5) = -0.69315

This can be checked by working out e-0.69315

. It should be very close to 0.5.

Page 47: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 47 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Healthy Laptop Case Study

This case study explores some of the concepts of reliability. It concerns a private provider of

health care that runs a healthcare system with 8 acute care hospitals, 100 clinics, and home

healthcare.

The home healthcare (HHC) division provides nurses (i.e. district nurses) to home care

patients. It employs 125 nurses and each one is provided with a laptop that allows them to

maintain patient records, point of care documentation, case loads, and payroll entry. The

nurses maintain a remote connection with the central server to access new cases and upload

any historical patient data gained throughout the day. The availability of the system, quality

of the data and timeliness of transcribing information are critical to the nurses.

The nurses have been complaining that they are having a lot of problems with the laptops and

the connection to the network and server. Reported laptop performance is very poor and 60%

of the nurses have complained of problems of one sort or another. All the laptops are from the

same manufacturer and have a similar configuration. Whenever a failure occurs the nurse has

to return the laptop to the office, obtain a loan machine and then return that to the office when

their own laptop had been repaired. The company knows how many machines have been

returned to the manufacturer for repair but did not itself keep any records of failure symptoms

and causes – that is done by the vendor‘s field service group who made the repairs.

Discussions between HHC‘s IT Director and the manufacturer‘s Field Support Manager are

getting progressively more heated with HHC threatening to switch suppliers completely –

with serious consequences for the manufacturer who actually supplied all HHC‘s IT

hardware. Losing the laptop contract is likely could result in the entire customer being lost.

In order to try and bring the relationship back onto a more business like footing HHC decided

to bring in an independent reliability specialist to arbitrate.

A typical laptop configuration is shown in the following diagram;

Page 48: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 48 of 56 Session 7 ©University of Portsmouth

The manufacturer did not and would not publish reliability data for its products, saying that

―We do not quote MTBF (Mean Time Before Failure) numbers for our products or

believe this type of information should be used as a meaningful description of the quality

of our systems or component parts within those systems. MTBF is an older industry term

that today has very little value and is mostly misinterpreted and misunderstood. The

following is a definition and example of why MTBF should be avoided.

MTBF is the point at which 63.2% of the population, (everything of that component

built by a manufacturer), will fail. So for example, disk drives that claim 1,000,000 hour

MTBF with 720 power on hours per month would take 115 years to reach the 63.2%

mark. It does NOT mean that no failure should occur for 115 years. This is a total

population statement from the manufacturer‘s point of view not from the customer point

of view. As a Manufacturer, we know that some of that product will fail but we have no

idea which ones, and we do not know the distribution of good or bad product shipped to

any given customer. A single customer may get more or less than his fair share of drives

Page 49: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 49 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

that may fail earlier than expected and therefore may achieve better or worse results and

still meet the MTBF criteria.‖

Exercise A.

Imagine that you are the independent reliability consultant;

Please comment on the manufacturer‘s refusal to publish MTBF data – is their position

reasonable? Where does the 63.2% come from?

===========================================================

The manufacturers were willing to show the defect data for HHC‘s laptops, but once again

would not compare those results with a control sample of the same number of the same

product. This laptop failure data (a Pareto diagram) was produced at the next meeting, as

follows;

The average age of the machines was three years. This prompted some discussion that they

were therefore fully depreciated and should be written off and replacements bought. The

consultant thought that this was an accountant‘s view of the situation and was not a view that

could be supported from a quality and reliability perspective. At three years the machines

should be at the most reliable part of the reliability (bathtub) curve – well past early life

failure and not yet approaching the wear-out part of the curve when failures start to increase.

The high number of problems due to damaged covers, hard drives and CD failures made him

suspect that part of the problem lay in the environment in which the laptops were used and

Laptop Failure causes

0

5

10

15

20

25

Dam

aged

cove

rs

Har

drive

failu

re

Una

ble

to re

ad C

D

Cra

cked

displ

ay

Batte

ry n

ot c

harg

ing

Will n

ot b

oot u

p

Flopp

y fa

ilure

Displ

ay u

nrea

dable

Power d

ropp

ing

Oth

er

Mod

em

Keybo

ard

Page 50: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 50 of 56 Session 7 ©University of Portsmouth

the treatment that they received. If this were the case then replacing them with new machines

would not solve the problem as the new ones would ultimately suffer the same fate.

The diagram below shows a fault tree for one of these problem categories – hard drive

failures. (In order to make it legible it has not been fully developed—seek and logic failures

have not been explored.)

Page 51: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 51 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Exercise B

How can this diagram, and other similar diagrams, be used to analyse and help prevent

failures? Does it show how misuse can contribute to the failure of the component under

review?

===========================================================

The consultant decided to ―walk the process‖ and observe a sample of the nurses at work over

a period of time. He was struck by the fact that the nurses used their cars as mobile offices.

They explained that they did not like taking the laptop into the patient‘s house and updating

patient records in front of them but preferred to do it in the privacy of their car in between

calls. He also noticed that usually the nurses did not pack the laptop away in its case when

they had finished to update and often put it on the car floor where it was easy to get to next

time. He also noticed that nurses without CD players in their cars were prone to use their

laptop to play their favoured music as they drove between calls.

This pointed him to the cause of the reliability problems. Most of the cover damage and

screen breakages were actually being caused by shock and vibration in the car when the

Page 52: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 52 of 56 Session 7 ©University of Portsmouth

laptop was unprotected, or by being dropped when picked up from the floor. Some of the

other failures could be caused by contamination (dust and grit, etc) getting into the drives

while the machine was in the relatively dirty area of the car floor. The operating environment

in the car could cause other failures – the ambient temperature in the car if the laptop was left

in direct sunlight while the nurse was with a patient could easily exceed the manufacturer‘s

specification. The factors that are normally used to calculate failure rates are power on hours,

power on cycles, temperature, and relative humidity.

Although it was not possible to compare these findings to the manufacturer‘s own data it was

possible to compare them to the findings of published industry surveys. This indicated that

laptops were increasingly being used in environments for which they are not designed and

that field personnel were experiencing failure rates as high as 20% per week. This confirmed

the consultant‘s opinion that merely changing from manufacturer‘s standard product to

another‘s would not solve the problem.

She decided that two courses of action were required, both involving Failure Mode Effect

Analysis (FMEA—which is covered in the Performance Evaluation Unit), in order to:

a) Recognise and evaluate the potential failure of a product or process and its effects.

b) Identify actions that could eliminate or reduce the chance of the potential failure

occurring.

With the agreement of HHC it was decided to apply FMEA to the way the nurses used

and treated their laptops (their process) to reduce the opportunities to induce failure.

Process FMEA is normally performed on the manufacturing process. In this case study

we are applying the technique the users process. The objective is to identify where within

the process damage or failure can be caused to the laptop, and what action can be taken

by the nurse or HHC to minimise or eliminate it.

With the agreement of the manufacturer it was also decided to use FMEA to evaluate the

hardware in order to develop a ―ruggedised‖ version of the laptop for use in this and

similar field situations.

These FMEA documents are in the Appendix.

Page 53: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 53 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

Laptop Case Study – Answers and Appendix

Exercise A.

Please comment on the manufacturer’s refusal to publish MTBF data – is their position

reasonable?

In these situations it is always very frustrating for customers and user groups to be told that

they will not be given Mean Time Between Failure (MTBF) and other failure rate data. From

the manufacturer‘s perspective this is a sensible approach because there is always a danger

that the MTBF data will be misinterpreted by the customer. MTBF is a measure applied to

the population as a whole, not to a very small sample.

It is true that caution must exercised when applying averages to individual cases. However,

provided that the customer understands that the MTBF is an average figure, the MTBF would

still be useful, particularly for a customer such as HHC which buys large numbers of laptops.

In this situation, a significant departure from the average figures would indicate either a

problem with the laptops supplied, or that they are being used in unsuitable conditions. (This

issue can be analysed more formally as a null hypothesis test – see Wood, 2003, Chapter 8).

The 63.2% comes from the exponential reliability distribution. If the MTBF is 1,000,000

hours, the failure rate is 0.000001 per hour and the exponential distribution gives

R = e –0.000001 x 1000000

= e –1

= 0.368 = 36.8%

which means that the other 63.2% must have failed. The quotation from the manufacturer is

misleading in that it is not a definition of the MTBF, but rather a fact that follows from the

exponential distribution.

===========================================================

Exercise B.

This diagram is a good illustration of how a fault tree can help in the analysis and prevention

of failures. It shows how external factors such as mechanical shock and dirt can damage

components. Other fault trees should be constructed for the other sub-systems contained in

the laptop schematic diagram and then combined into a fault tree for the entire laptop system.

The first stage of the analysis of the fault tree is a qualitative evaluation of the subsystem to

understand how failure occurs. The next stage is to gather data on component failure rates

and repair rates to allow a quantitative evaluation – where are the major sources of failure and

what action can be taken to minimise or eliminate them?

Page 54: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 54 of 56 Session 7 ©University of Portsmouth

Appendix: FMEA Documents

A process FMEA document for the Field Nurse Laptop Use Process, identifying

reasonable changes that could be made without inconvenience to the nurses, that could

reduce the failure rate of their laptops.

Failure Mode Effect Analysis (Process FMEA)

Process

Step

Potential

Failure

Mode

S

Potential cause/

mechanism of failure

O

Current

Design

Controls

Recommended Actions

(process change)

Put laptop in

car

- Drop laptop 1 - Broken screen

- Dislodged

component - Damaged hard disk

- Damaged outer

casing

1 None 1. Fit laptops into

permanent protective case

with hinged top. 2. Make nurses responsible

for damage caused by

carelessness or abuse, or reward those that do not

cause any problems.

Enter patient

data

- Drop laptop

when picking it up.

- Broken screen

- Dislodged component

- Damaged hard disk

- Damaged outer casing

4 None 1. Instruct nurses to carry

laptop into patient‘s home and not to use their car as

an office.

2. Investigate reengineering the process;

for example, use voice

recognition software so that nurses can phone in

after each call and dictate

directly into the server and eliminate the need for

laptops.

Put laptop on car floor while

driving

- Excessive vibration

- Dirt/dust

entry

- Excessive

heat

- Hard drive damage - Electronic

component failure

8 None Instruct nurses to put laptops in the car boot in

its protective case. Also

avoids a security risk of

laptops being stolen.

Play music CD

in Laptop

- Dirt/dust

entry - Excessive

CD usage

- Failure of CD drive None As above

Take Laptop

out of car

- Drop laptop 1 - Broken screen

- Dislodged

component - Damaged hard disk

- Damaged outer

casing

1. Fit laptops into

permanent protective case

with hinged top. 2. Make nurses responsible

for damage caused by

carelessness or abuse, or reward those that do not

cause any problems.

S = Severity; an assessment of the seriousness of the effect of the potential failure mode.

Rated on a scale of 1 (none) to 10 (most severe)

O = Occurrence; the chance that the specific cause will occur. Rated on a scale of 1 (least

chance of occurrence) to 10 (highest chance of occurrence).

Page 55: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

Page 55 of 56 MSc SQM\QUAN ©University of Portsmouth Session 7

An FMEA study on the laptop itself with the objective of identifying the failure points and

the design changes that could minimise or eliminate the risk of it occurring. First create

the block diagram of a laptop and then the FMEA document (only complete the document

for one item).

Design FMEA should always begin with a block diagram (Total Quality Management,

Besterfield). It is used to show the different flows involved with the item being analysed. Its

purpose is to understand the input to each block, the function of the block, and the output.

First list all the components of the system, their function and the means of connection or

attachment between components, then the components are placed in blocks and their

functional relationship represented by lines connecting the blocks.

The laptop component diagram provided earlier only shows data flows between components

but it can be updated as follows;

Page 56: Unit QUAN Session 7 Reliability - University of Portsmouthwoodm.myweb.port.ac.uk/quan//Quan7.pdf · 2011-08-02 · MSc SQM\QUAN Page 6 of 56 Session 7 ©University of Portsmouth The

Quantitative Methods Unit - QUAN

MSc SQM\QUAN Page 56 of 56 Session 7 ©University of Portsmouth

Failure Mode Effect Analysis (Product Design FMEA)

Item

/Function

Potential Failure Mode

Potential Effect of Failure

S

Potential

cause/

mechanism

of failure

O

Current

Design

Controls

Recommended Actions

(design change)

30 gb hard drive

Store data for

use by the

operating system and

applications.

Read data from

disk and

provide it to the processor

chip when

required.

Write data to disk when

instructed by

the processor chip.

Head / Disk

interference

Complete

functional

failure – data can neither

be read nor

written to or from disk.

10

Contaminatio

n during

manufacture

3

Product is assembled in a

clean room.

1. Increase frequency of checks to ensure that clean room environment meets specification. 2. Increase checks

on clean room staff clothing and

cleanliness

Disk

becoming

distorted by physical

damage

5

Laptop casing

designed to

absorb impact without damage

to contents

1. Use better energy absorbing material for casing. 2. Integrate a

carrying handle into

the casing.

Dirt ingress during

operation

2

Product is a sealed unit.

Review construction

of disk drive case – can it fail in use and

allow dirt ingress?

Slow disk

rotation

Excessive

data errors on

read/write.

Complete

functional failure as

speed drops

9

Bearing

‗sticky‘ if

long period idle.

5

A sample are

laboratory

tested during manufacture

Shelf life problem –

apply stock control

limits to spare parts and assemblies.

Motor defective

1

Manufacturing put a sample on

life test.

Change to a higher specification if test

results show a life

problem.