696
Handbook of Reliability Engineering Hoang Pham, Editor Springer

Handbook of Reliability Engineering - pdi2.org

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Handbook of Reliability Engineering - pdi2.org

Handbook of Reliability Engineering

Hoang Pham,Editor

Springer

Page 2: Handbook of Reliability Engineering - pdi2.org

Handbook of Reliability Engineering

Page 3: Handbook of Reliability Engineering - pdi2.org

SpringerLondonBerlinHeidelbergNew YorkHong KongMilanParisTokyo http://www.springer.de/phys/

Page 4: Handbook of Reliability Engineering - pdi2.org

Hoang Pham (Editor)

Handbook of

ReliabilityEngineering

123

Page 5: Handbook of Reliability Engineering - pdi2.org

Hoang Pham, PhDRutgers UniversityPiscatawayNew Jersey, USA

British Library Cataloguing in Publication DataHandbook of reliability engineering

1. Reliability (Engineering)I. Pham, Hoang620′ .00452ISBN 1852334533

Library of Congress Cataloging-in-Publication DataHandbook of reliability engineering/Hoang Pham (ed.).

p.cm.ISBN 1-85233-453-3 (alk. paper)1. Reliability (Engineering)–Handbooks, manuals, etc. I. Pham, Hoang.

TA169.H358 2003620′ .00452–dc21 2002030652

Apart from any fair dealing for the purposes of research or private study, or criticism orreview, as permitted under the Copyright, Designs and Patents Act 1988, this publicationmay only be reproduced, stored or transmitted, in any form or by any means, with theprior permission in writing of the publishers, or in the case of reprographic reproductionin accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiriesconcerning reproduction outside those terms should be sent to the publishers.

ISBN 1-85233-453-3 Springer-Verlag London Berlin Heidelberga member of BertelsmannSpringer Science+Business Media GmbHhttp://www.springer.co.uk

c© Springer-Verlag London Limited 2003

The use of registered names, trademarks, etc. in this publication does not imply, even inthe absence of a specific statement, that such names are exempt from the relevant laws andregulations and therefore free for general use.

The publisher makes no representation, express or implied, with regard to the accuracy of theinformation contained in this book and cannot accept any legal responsibility or liability forany errors or omissions that may be made.

Typesetting: Sunrise Setting Ltd, Torquay, Devon, UKPrinted and bound in the United States of America69/3830-543210 Printed on acid-free paper SPIN 10795762

Page 6: Handbook of Reliability Engineering - pdi2.org

To

Michelle, Hoang Jr., and David

Page 7: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 8: Handbook of Reliability Engineering - pdi2.org

Preface

In today’s technological world nearly everyone depends upon the continuedfunctioning of a wide array of complex machinery and equipment for our everydaysafety, security, mobility and economic welfare. We expect our electric appliances,lights, hospital monitoring control, next-generation aircraft, nuclear power plants,data exchange systems, and aerospace applications, to function whenever we needthem. When they fail, the results can be catastrophic, injury or even loss of life.

As our society grows in complexity, so do the critical reliability challenges andproblems that must be solved. The area of reliability engineering currently receiveda tremendous attention from numerous researchers and practitioners as well.

This Handbook of Reliability Engineering, altogether 35 chapters, aims to providea comprehensive state-of-the-art reference volume that covers both fundamentaland theoretical work in the areas of reliability including optimization, multi-statesystem, life testing, burn-in, software reliability, system redundancy, componentreliability, system reliability, combinatorial optimization, network reliability,consecutive-systems, stochastic dependence and aging, change-point modeling,characteristics of life distributions, warranty, maintenance, calibration modeling,step-stress life testing, human reliability, risk assessment, dependability and safety,fault tolerant systems, system performability, and engineering management.

The Handbook consists of five parts. Part I of the Handbook contains five papers,deals with different aspects of System Reliability and Optimization.

Chapter 1 by Zuo, Huang and Kuo studies new theoretical concepts and methodsfor performance evaluation of multi-state k-out-of-n systems. Chapter 2 by Phamdescribes in details the characteristics of system reliabilities with multiple failuremodes. Chapter 3 by Chang and Hwang presents several generalizations of thereliability of consecutive-k-systems by exchanging the role of working and failedcomponents in the consecutive-k-systems. Chapter 4 by Levitin and Lisnianskidiscusses various reliability optimization problems of multi-state systems withtwo failure modes using combination of universal generating function techniqueand genetic algorithm. Chapter 5 by Sung, Cho and Song discusses a variety ofdifferent solution and heuristic approaches, such as integer programming, dynamicprogramming, greedy-type heuristics, and simulated annealing, to solve variouscombinatorial reliability optimization problems of complex system structuressubject to multiple resource and choice constraints.

Part II of the Handbook contains five papers, focuses on the Statistical ReliabilityTheory. Chapter 6 by Finkelstein presents stochastic models for the observed failure

Page 9: Handbook of Reliability Engineering - pdi2.org

viii Preface

rate of systems with periods of operation and repair that form an alternatingprocess. Chapter 7 by Lai and Xie studies a general concept of stochastic dependenceincluding positive dependence and dependence orderings. Chapter 8 by Zhaodiscusses some statistical reliability change-point models which can be used tomodel the reliability of both software and hardware systems. Chapter 9 by Laiand Xie discusses the basic concepts for the stochastic univariate and multivariateaging classes including bathtub shape failure rate. Chapter 10 by Park studiescharacteristics of a new class of NBU-t0 life distribution and its preservationproperties.

Part III of the Handbook contains six papers, focuses on Software Reliability.Chapter 11 by Dalal presents software reliability models to quantify the reliabilityof the software products for early stages as well as test and operational phases.Some further research and directions useful to practitioners and researchers arealso discussed. Chapter 12 by Ledoux provides an overview of some aspects ofsoftware reliability modeling including black-box modeling, white-box modeling,and Bayesian-calibration modeling. Chapter 13 by Tokuno and Yamada presentssoftware availability models and its availability measures such as interval softwarereliability and the conditional mean available time.

Chapter 14 by Dohi, Goševa-Popstojanova, Vaidyanathan, Trivedi and Osakipresents the analytical modeling and measurement based approach for evaluatingthe effectiveness of preventive maintenance in operational software systems anddetermining the optimal time to perform software rejuvenation. Chapter 15 byKimura and Yamada discusses nonhomogeneous death process, hidden-Markovprocess, and continuous state-space models for evaluating and predicting thereliability of software products during the testing phase. Chapter 16 by Phampresents basic concepts and recent studies nonhomogeneous Poisson processsoftware reliability and cost models considering random field environments. Somechallenge issues in software reliability are also included.

Part IV contains nine chapters, focuses on Maintenance Theory and Testing.Chapter 17 by Murthy and Jack presents overviews of product warranty andmaintenance, warranty policies and contracts. Further research topics thatlink maintenance and warranty are also discussed. Chapter 18 by Pulcinistudies stochastic point processes maintenance models with imperfect preventivemaintenance, sequence of imperfect and minimal repairs and imperfect repairsinterspersed with imperfect preventive maintenance. Chapter 19 Dohi, Kaio andOsaki presents the basic preventive maintenance policies and their extensions interms of both continuous and discrete-time modeling. Chapter 20 by Nakagawapresents the basic maintenance policies such as age replacement, block replacement,imperfect preventive maintenance, and periodic replacement with minimal repairfor multi-component systems.

Chapter 21 by Wang and Pham studies various imperfect maintenance models thatminimize the system maintenance cost rate. Chapter 22 by Elsayed presents basicconcepts of accelerated life testing and a detailed test plan that can be designed

Page 10: Handbook of Reliability Engineering - pdi2.org

Preface ix

before conducting an accelerated life test. Chapter 23 by Owen and Padgett focuseson the Birnbaum-Saunders distribution and its application in reliability and lifetesting. Chapter 24 by Tang discusses the two related issues for a step-stressaccelerated life testing (SSALT) such as how to design a multiple-steps acceleratedlife test and how to analyze the data obtained from a SSALT. Chapter 25 by Xiongdeals with the statistical models and estimations based on the data from a SSALTto estimate the unknown parameters in the stress-response relationship and thereliability function at the design stress.

Part V contains nine chapters, primarily focuses on Practices and EmergingApplications. Chapter 26 by Phillips presents proportional and non-proportionalhazard reliability models and its applications in reliability analysis using non-parametric approach. Chapter 27 by Yip, Wang and Chao discusses the capture-recapture methods and the Horvits Thompson estimator to estimate the numberof faults in a computer system. Chapter 28 by Billinton and Allan providesoverviews and deals with the reliability evaluation methods of electric powersystems. Chapter 29 by Dhillon discusses various aspects of human and medicaldevice reliability.

Chapter 30 by Bari presents the basic of probabilistic risk assessment methodsthat developed and matured within the commercial nuclear power reactorindustry. Chapter 31 by Akersten and Klefsjö studies methodologies and tools independability and safety management. Chapter 32 by Albeanu and Vladicescu dealswith the approaches in software quality assurance and engineering management.Chapter 33 by Teng and Pham presents a generalized software reliability growthmodel for N-version programming systems which considers the error-introductionrate, the error-removal efficiency, and multi-version coincident failures, based onthe non-homogeneous Poisson process. Chapter 34 by Carrasco presents Markovianmodels for evaluating the dependability and performability of fault tolerant systems.

Chapter 35 by Lee discusses a new random-request availability measure andpresents closed-form mathematical expressions for random-request availabilitywhich incorporate the random task arrivals, the system state, and the operationalrequirements of the system.

All the chapters are written by over 45 leading reliability experts in academia andindustry. I am deeply indebted and wish to thank all of them for their contributionsand cooperation. Thanks are also due to the Springer staff, especially Peter Mitchell,Roger Dobbing and Oliver Jackson, for their editorial work. I hope that practitionerswill find this Handbook useful when looking for solutions to practical problems;researchers can use it for quick access to the background, recent research and trends,and most important references regarding certain topics, if not all, in the reliability.

Hoang PhamPiscataway, New Jersey

Page 11: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 12: Handbook of Reliability Engineering - pdi2.org

Contents

PART I. System Reliability and Optimization

1 Multi-state k-out-of-n SystemsMing J. Zuo, Jinsheng Huang and Way Kuo . . . . . . . . . . . . . . . . . . . 3

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Relevant Concepts in Binary Reliability Theory . . . . . . . . . . . . . 31.3 Binary k-out-of-n Models . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 The k-out-of-n:G System with Independently and IdenticallyDistributed Components . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Reliability Evaluation Using Minimal Path or Cut Sets . . . . . 51.3.3 Recursive Algorithms . . . . . . . . . . . . . . . . . . . . . . 61.3.4 Equivalence Between a k-out-of-n:G System and an

(n− k + 1)-out-of-n:F system . . . . . . . . . . . . . . . . . . 61.3.5 The Dual Relationship Between the k-out-of-n G and F Systems 7

1.4 Relevant Concepts in Multi-state Reliability Theory . . . . . . . . . . 81.5 A Simple Multi-state k-out-of-n:G Model . . . . . . . . . . . . . . . . 101.6 A Generalized Multi-state k-out-of-n:G System Model . . . . . . . . . 111.7 Properties of Generalized Multi-state k-out-of-n:G Systems . . . . . . 131.8 Equivalence and Duality in Generalized Multi-state k-out-of-n Systems 15

2 Reliability of Systems with Multiple Failure ModesHoang Pham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 The Series System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 The Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Cost Optimization . . . . . . . . . . . . . . . . . . . . . . . . 212.4 The Parallel–Series System . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 The Profit Maximization Problem . . . . . . . . . . . . . . . . 232.4.2 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . 24

2.5 The Series–Parallel System . . . . . . . . . . . . . . . . . . . . . . . . 252.5.1 Maximizing the Average System Profit . . . . . . . . . . . . . 262.5.2 Consideration of Type I Design Error . . . . . . . . . . . . . . 27

2.6 The k-out-of-n Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6.1 Minimizing the Average System Cost . . . . . . . . . . . . . . 29

2.7 Fault-tolerant Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 322.7.1 Reliability Evaluation . . . . . . . . . . . . . . . . . . . . . . 33

Page 13: Handbook of Reliability Engineering - pdi2.org

xii Contents

2.7.2 Redundancy Optimization . . . . . . . . . . . . . . . . . . . . 342.8 Weighted Systems with Three Failure Modes . . . . . . . . . . . . . . 34

3 Reliabilities of Consecutive-k SystemsJen-Chun Chang and Frank K. Hwang . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Computation of Reliability . . . . . . . . . . . . . . . . . . . . . . . . 393.2.1 The Recursive Equation Approach . . . . . . . . . . . . . . . 393.2.2 The Markov Chain Approach . . . . . . . . . . . . . . . . . . 403.2.3 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Invariant Consecutive Systems . . . . . . . . . . . . . . . . . . . . . . 413.3.1 Invariant Consecutive-2 Systems . . . . . . . . . . . . . . . . 413.3.2 Invariant Consecutive-k Systems . . . . . . . . . . . . . . . . 423.3.3 Invariant Consecutive-k G System . . . . . . . . . . . . . . . . 43

3.4 Component Importance and the Component Replacement Problem . 433.4.1 The Birnbaum Importance . . . . . . . . . . . . . . . . . . . . 443.4.2 Partial Birnbaum Importance . . . . . . . . . . . . . . . . . . 453.4.3 The Optimal Component Replacement . . . . . . . . . . . . . 45

3.5 The Weighted-consecutive-k-out-of-n System . . . . . . . . . . . . . . 473.5.1 The Linear Weighted-consecutive-k-out-of-n System . . . . . 473.5.2 The Circular Weighted-consecutive-k-out-of-n System . . . . 47

3.6 Window Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.1 The f -within-consecutive-k-out-of-n System . . . . . . . . . 493.6.2 The 2-within-consecutive-k-out-of-n System . . . . . . . . . . 513.6.3 The b-fold-window System . . . . . . . . . . . . . . . . . . . 52

3.7 Network Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.7.1 The Linear Consecutive-2 Network System . . . . . . . . . . . 533.7.2 The Linear Consecutive-k Network System . . . . . . . . . . . 543.7.3 The Linear Consecutive-k Flow Network System . . . . . . . . 55

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Multi-state System Reliability Analysis and OptimizationG. Levitin and A. Lisnianski . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Multi-state System Reliability Measures . . . . . . . . . . . . . . . . . 634.3 Multi-state System Reliability Indices Evaluation Based on the

Universal Generating Function . . . . . . . . . . . . . . . . . . . . . . 644.4 Determination of u-function of Complex Multi-state System Using

Composition Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5 Importance and Sensitivity Analysis of Multi-state Systems . . . . . . 684.6 Multi-state System Structure Optimization Problems . . . . . . . . . . 72

4.6.1 Optimization Technique . . . . . . . . . . . . . . . . . . . . . 734.6.1.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . 73

Page 14: Handbook of Reliability Engineering - pdi2.org

Contents xiii

4.6.1.2 Solution Representation and Decoding Procedure . 754.6.2 Structure Optimization of Series–Parallel System with

Capacity-based Performance Measure . . . . . . . . . . . . . 754.6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . 754.6.2.2 Solution Quality Evaluation . . . . . . . . . . . . . 76

4.6.3 Structure Optimization of Multi-state System with Two FailureModes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . 774.6.3.2 Solution Quality Evaluation . . . . . . . . . . . . . 80

4.6.4 Structure Optimization for Multi-state System with FixedResource Requirements and Unreliable Sources . . . . . . . . 834.6.4.1 Problem Formulation . . . . . . . . . . . . . . . . . 834.6.4.2 Solution Quality Evaluation . . . . . . . . . . . . . 844.6.4.3 The Output Performance Distribution of a System

Containing Identical Elements in the MainProducing Subsystem . . . . . . . . . . . . . . . . . 85

4.6.4.4 The Output Performance Distribution of a SystemContaining Different Elements in the MainProducing Subsystem . . . . . . . . . . . . . . . . . 85

4.6.5 Other Problems of Multi-state System Optimization . . . . . . 87

5 Combinatorial Reliability OptimizationC. S. Sung, Y. K. Cho and S. H. Song . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.2 Combinatorial Reliability Optimization Problems of Series Structure . 95

5.2.1 Optimal Solution Approaches . . . . . . . . . . . . . . . . . . 955.2.1.1 Partial Enumeration Method . . . . . . . . . . . . . 955.2.1.2 Branch-and-bound Method . . . . . . . . . . . . . . 965.2.1.3 Dynamic Programming . . . . . . . . . . . . . . . . 98

5.2.2 Heuristic Solution Approach . . . . . . . . . . . . . . . . . . 995.3 Combinatorial Reliability Optimization Problems of a Non-series

Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.3.1 Mixed Series–Parallel System Optimization Problems . . . . . 1025.3.2 General System Optimization Problems . . . . . . . . . . . . 106

5.4 Combinatorial Reliability Optimization Problems withMultiple-choice Constraints . . . . . . . . . . . . . . . . . . . . . . . 1075.4.1 One-dimensional Problems . . . . . . . . . . . . . . . . . . . 1085.4.2 Multi-dimensional Problems . . . . . . . . . . . . . . . . . . 111

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

PART II. Statistical Reliability Theory

6 Modeling the Observed Failure RateM. S. Finkelstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.2 Survival in the Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Page 15: Handbook of Reliability Engineering - pdi2.org

xiv Contents

6.2.1 One-dimensional Case . . . . . . . . . . . . . . . . . . . . . . 1186.2.2 Fixed Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2.3 Failure Rate Process . . . . . . . . . . . . . . . . . . . . . . . 1216.2.4 Moving Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.3 Multiple Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.3.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . 1246.3.2 Ordinary Multiple Availability . . . . . . . . . . . . . . . . . . 1256.3.3 Accuracy of a Fast Repair Approximation . . . . . . . . . . . . 1266.3.4 Two Non-serviced Demands in a Row . . . . . . . . . . . . . . 1276.3.5 Not More than N Non-serviced Demands . . . . . . . . . . . 1296.3.6 Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 130

6.4 Modeling the Mixture Failure Rate . . . . . . . . . . . . . . . . . . . . 1326.4.1 Definitions and Conditional Characteristics . . . . . . . . . . 1326.4.2 Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.4.3 Multiplicative Model . . . . . . . . . . . . . . . . . . . . . . . 1336.4.4 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4.5 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . 136

7 Concepts of Stochastic Dependence in Reliability AnalysisC. D. Lai and M. Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2 Important Conditions Describing Positive Dependence . . . . . . . . 142

7.2.1 Six Basic Conditions . . . . . . . . . . . . . . . . . . . . . . . 1437.2.2 The Relative Stringency of the Conditions . . . . . . . . . . . 1437.2.3 Positive Quadrant Dependent in Expectation . . . . . . . . . . 1447.2.4 Associated Random Variables . . . . . . . . . . . . . . . . . . 1447.2.5 Positively Correlated Distributions . . . . . . . . . . . . . . . 1457.2.6 Summary of Interrelationships . . . . . . . . . . . . . . . . . 145

7.3 Positive Quadrant Dependent Concept . . . . . . . . . . . . . . . . . 1457.3.1 Constructions of Positive Quadrant Dependent Bivariate

Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.2 Applications of Positive Quadrant Dependence Concept to

Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.3 Effect of Positive Dependence on the Mean Lifetime of a

Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.3.4 Inequality Without Any Aging Assumption . . . . . . . . . . . 147

7.4 Families of Bivariate Distributions that are Positive QuadrantDependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1477.4.1 Positive Quadrant Dependent Bivariate Distributions with

Simple Structures . . . . . . . . . . . . . . . . . . . . . . . . 1487.4.2 Positive Quadrant Dependent Bivariate Distributions with

More Complicated Structures . . . . . . . . . . . . . . . . . . 1497.4.3 Positive Quadrant Dependent Bivariate Uniform Distributions 150

7.4.3.1 Generalized Farlie–Gumbel–Morgenstern Family ofCopulas . . . . . . . . . . . . . . . . . . . . . . . . 151

7.5 Some Related Issues on Positive Dependence . . . . . . . . . . . . . . 152

Page 16: Handbook of Reliability Engineering - pdi2.org

Contents xv

7.5.1 Examples of Bivariate Positive Dependence Stronger thanPositive Quadrant Dependent Condition . . . . . . . . . . . . 152

7.5.2 Examples of Negative Quadrant Dependence . . . . . . . . . . 1537.6 Positive Dependence Orderings . . . . . . . . . . . . . . . . . . . . . 1537.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8 Statistical Reliability Change-point Estimation ModelsMing Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1578.2 Assumptions in Reliability Change-point Models . . . . . . . . . . . . 1588.3 Some Specific Change-point Models . . . . . . . . . . . . . . . . . . . 159

8.3.1 Jelinski–Moranda De-eutrophication Model with a ChangePoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1598.3.1.1 Model Review . . . . . . . . . . . . . . . . . . . . . 1598.3.1.2 Model with One Change Point . . . . . . . . . . . . 159

8.3.2 Weibull Change-point Model . . . . . . . . . . . . . . . . . . 1608.3.3 Littlewood Model with One Change Point . . . . . . . . . . . 160

8.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 1608.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9 Concepts and Applications of Stochastic Aging in ReliabilityC. D. Lai and M. Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1659.2 Basic Concepts for Univariate Reliability Classes . . . . . . . . . . . . 167

9.2.1 Some Acronyms and the Notions of Aging . . . . . . . . . . . 1679.2.2 Definitions of Reliability Classes . . . . . . . . . . . . . . . . 1679.2.3 Interrelationships . . . . . . . . . . . . . . . . . . . . . . . . 169

9.3 Properties of the Basic Concepts . . . . . . . . . . . . . . . . . . . . . 1699.3.1 Properties of Increasing and Decreasing Failure Rates . . . . . 1699.3.2 Property of Increasing Failure Rate on Average . . . . . . . . . 1699.3.3 Properties of NBU, NBUC, and NBUE . . . . . . . . . . . . . . 169

9.4 Distributions with Bathtub-shaped Failure Rates . . . . . . . . . . . . 1699.5 Life Classes Characterized by the Mean Residual Lifetime . . . . . . . 1709.6 Some Further Classes of Aging . . . . . . . . . . . . . . . . . . . . . . 1719.7 Partial Ordering of Life Distributions . . . . . . . . . . . . . . . . . . 171

9.7.1 Relative Aging . . . . . . . . . . . . . . . . . . . . . . . . . . 1729.7.2 Applications of Partial Orderings . . . . . . . . . . . . . . . . 172

9.8 Bivariate Reliability Classes . . . . . . . . . . . . . . . . . . . . . . . 1739.9 Tests of Stochastic Aging . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.9.1 A General Sketch of Tests . . . . . . . . . . . . . . . . . . . . 1749.9.2 Summary of Tests of Aging in Univariate Case . . . . . . . . . 1779.9.3 Summary of Tests of Bivariate Aging . . . . . . . . . . . . . . 177

9.10 Concluding Remarks on Aging . . . . . . . . . . . . . . . . . . . . . . 177

Page 17: Handbook of Reliability Engineering - pdi2.org

xvi Contents

10 Class of NBU-t0 Life DistributionDong Ho Park . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18110.2 Characterization of NBU-t0 Class . . . . . . . . . . . . . . . . . . . . 182

10.2.1 Boundary Members of NBU-t0 and NWU-t0 . . . . . . . . . . 18210.2.2 Preservation of NBU-t0 and NWU-t0 Properties under

Reliability Operations . . . . . . . . . . . . . . . . . . . . . . 18410.3 Estimation of NBU-t0 Life Distribution . . . . . . . . . . . . . . . . . 186

10.3.1 Reneau–Samaniego Estimator . . . . . . . . . . . . . . . . . . 18610.3.2 Chang–Rao Estimator . . . . . . . . . . . . . . . . . . . . . . 188

10.3.2.1 Positively Biased Estimator . . . . . . . . . . . . . . 18810.3.2.2 Geometric Mean Estimator . . . . . . . . . . . . . . 188

10.4 Tests for NBU-t0 Life Distribution . . . . . . . . . . . . . . . . . . . . 18910.4.1 Tests for NBU-t0 Alternatives Using Complete Data . . . . . . 189

10.4.1.1 Hollander–Park–Proschan Test . . . . . . . . . . . . 19010.4.1.2 Ebrahimi–Habibullah Test . . . . . . . . . . . . . . 19210.4.1.3 Ahmad Test . . . . . . . . . . . . . . . . . . . . . . 193

10.4.2 Tests for NBU-t0 Alternatives Using Incomplete Data . . . . . 195

PART III. Software Reliability

11 Software Reliability Models: A Selective Survey and New DirectionsSiddhartha R. Dalal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20111.2 Static Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

11.2.1 Phase-based Model: Gaffney and Davis . . . . . . . . . . . . . 20311.2.2 Predictive Development Life Cycle Model: Dalal and Ho . . . . 203

11.3 Dynamic Models: Reliability Growth Models for Testing andOperational Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20511.3.1 A General Class of Models . . . . . . . . . . . . . . . . . . . . 20511.3.2 Assumptions Underlying the Reliability Growth Models . . . . 20611.3.3 Caution in Using Reliability Growth Models . . . . . . . . . . 207

11.4 Reliability Growth Modeling with Covariates . . . . . . . . . . . . . . 20711.5 When to Stop Testing Software . . . . . . . . . . . . . . . . . . . . . . 20811.6 Challenges and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 209

12 Software Reliability ModelingJames Ledoux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21312.2 Basic Concepts of Stochastic Modeling . . . . . . . . . . . . . . . . . 214

12.2.1 Metrics with Regard to the First Failure . . . . . . . . . . . . . 21412.2.2 Stochastic Process of Times of Failure . . . . . . . . . . . . . . 215

12.3 Black-box Software Reliability Models . . . . . . . . . . . . . . . . . . 21512.3.1 Self-exciting Point Processes . . . . . . . . . . . . . . . . . . . 216

12.3.1.1 Counting Statistics for a Self-exciting Point Process . 218

Page 18: Handbook of Reliability Engineering - pdi2.org

Contents xvii

12.3.1.2 Likelihood Function for a Self-exciting Point Process 21812.3.1.3 Reliability and Mean Time to Failure Functions . . . 218

12.3.2 Classification of Software Reliability Models . . . . . . . . . . 21912.3.2.1 0-Memory Self-exciting Point Process . . . . . . . . 21912.3.2.2 Non-homogeneous Poisson Process Model:

λ(t;Ht , F0)= f (t; F0) and is Deterministic . . . . 22012.3.2.3 1-Memory Self-exciting Point Process with

λ(t;Ht , F0)= f (N(t), t − TN(t), F0) . . . . . . . . 22112.3.2.4 m≥ 2-Memory . . . . . . . . . . . . . . . . . . . . 221

12.4 White-box Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 22212.5 Calibration of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

12.5.1 Frequentist Procedures . . . . . . . . . . . . . . . . . . . . . . 22312.5.2 Bayesian Procedure . . . . . . . . . . . . . . . . . . . . . . . 225

12.6 Current Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22512.6.1 Black-box Modeling . . . . . . . . . . . . . . . . . . . . . . . 225

12.6.1.1 Imperfect Debugging . . . . . . . . . . . . . . . . . 22512.6.1.2 Early Prediction of Software Reliability . . . . . . . 22612.6.1.3 Environmental Factors . . . . . . . . . . . . . . . . 22712.6.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 228

12.6.2 White-box Modeling . . . . . . . . . . . . . . . . . . . . . . . 22912.6.3 Statistical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 230

13 Software Availability Theory and Its ApplicationsKoichi Tokuno and Shigeru Yamada . . . . . . . . . . . . . . . . . . . . . . . 235

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23513.2 Basic Model and Software Availability Measures . . . . . . . . . . . . 23613.3 Modified Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

13.3.1 Model with Two Types of Failure . . . . . . . . . . . . . . . . 23913.3.2 Model with Two Types of Restoration . . . . . . . . . . . . . . 240

13.4 Applied Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24113.4.1 Model with Computation Performance . . . . . . . . . . . . . 24113.4.2 Model for Hardware–Software System . . . . . . . . . . . . . 242

13.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

14 Software Rejuvenation: Modeling and ApplicationsTadashi Dohi, Katerina Goševa-Popstojanova, Kalyanaraman Vaidyanathan,Kishor S. Trivedi and Shunji Osaki . . . . . . . . . . . . . . . . . . . . . . . . 245

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24514.2 Modeling-based Estimation . . . . . . . . . . . . . . . . . . . . . . . 246

14.2.1 Examples in Telecommunication Billing Applications . . . . . 24714.2.2 Examples in a Transaction-based Software System . . . . . . . 25114.2.3 Examples in a Cluster System . . . . . . . . . . . . . . . . . . 255

14.3 Measurement-based Estimation . . . . . . . . . . . . . . . . . . . . . 25714.3.1 Time-based Estimation . . . . . . . . . . . . . . . . . . . . . 25814.3.2 Time and Workload-based Estimation . . . . . . . . . . . . . 260

14.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 262

Page 19: Handbook of Reliability Engineering - pdi2.org

xviii Contents

15 Software Reliability Management: Techniques and ApplicationsMitsuhiro Kimura and Shigeru Yamada . . . . . . . . . . . . . . . . . . . . . 265

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26515.2 Death Process Model for Software Testing Management . . . . . . . . 266

15.2.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . 26715.2.1.1 Mean Number of Remaining Software Faults/Testing

Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 26815.2.1.2 Mean Time to Extinction . . . . . . . . . . . . . . . 268

15.2.2 Estimation Method of Unknown Parameters . . . . . . . . . . 26815.2.2.1 Case of 0 < α ≤ 1 . . . . . . . . . . . . . . . . . . . 26815.2.2.2 Case of α = 0 . . . . . . . . . . . . . . . . . . . . . 269

15.2.3 Software Testing Progress Evaluation . . . . . . . . . . . . . . 26915.2.4 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . 27015.2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 271

15.3 Estimation Method of Imperfect Debugging Probability . . . . . . . . 27115.3.1 Hidden-Markov modeling for software reliability growth

phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . 27115.3.2 Estimation Method of Unknown Parameters . . . . . . . . . . 27215.3.3 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . 27315.3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 274

15.4 Continuous State Space Model for Large-scale Software . . . . . . . . 27415.4.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . 27515.4.2 Nonlinear Characteristics of Software Debugging Speed . . . . 27715.4.3 Estimation Method of Unknown Parameters . . . . . . . . . . 27715.4.4 Software Reliability Assessment Measures . . . . . . . . . . . 279

15.4.4.1 Expected Number of Remaining Faults and ItsVariance . . . . . . . . . . . . . . . . . . . . . . . . 279

15.4.4.2 Cumulative and Instantaneous Mean Time BetweenFailures . . . . . . . . . . . . . . . . . . . . . . . . 279

15.4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 28015.5 Development of a Software Reliability Management Tool . . . . . . . . 280

15.5.1 Definition of the Specification Requirement . . . . . . . . . . 28015.5.2 Object-oriented Design . . . . . . . . . . . . . . . . . . . . . 28115.5.3 Examples of Reliability Estimation and Discussion . . . . . . 282

16 Recent Studies in Software Reliability EngineeringHoang Pham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28516.1.1 Software Reliability Concepts . . . . . . . . . . . . . . . . . . 28516.1.2 Software Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . 288

16.2 Software Reliability Modeling . . . . . . . . . . . . . . . . . . . . . . 28816.2.1 A Generalized Non-homogeneous Poisson Process Model . . . 28916.2.2 Application 1: The Real-time Control System . . . . . . . . . . 289

16.3 Generalized Models with Environmental Factors . . . . . . . . . . . . 28916.3.1 Parameters Estimation . . . . . . . . . . . . . . . . . . . . . . 29216.3.2 Application 2: The Real-time Monitor Systems . . . . . . . . . 292

Page 20: Handbook of Reliability Engineering - pdi2.org

Contents xix

16.4 Cost Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29516.4.1 Generalized Risk–Cost Models . . . . . . . . . . . . . . . . . 295

16.5 Recent Studies with Considerations of Random Field Environments . 29616.5.1 A Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . 29716.5.2 A Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

16.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

PART IV. Maintenance Theory and Testing

17 Warranty and MaintenanceD. N. P. Murthy and N. Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30517.2 Product Warranties: An Overview . . . . . . . . . . . . . . . . . . . . 306

17.2.1 Role and Concept . . . . . . . . . . . . . . . . . . . . . . . . . 30617.2.2 Product Categories . . . . . . . . . . . . . . . . . . . . . . . . 30617.2.3 Warranty Policies . . . . . . . . . . . . . . . . . . . . . . . . . 306

17.2.3.1 Warranties Policies for Standard Products SoldIndividually . . . . . . . . . . . . . . . . . . . . . . 306

17.2.3.2 Warranty Policies for Standard Products Sold in Lots 30717.2.3.3 Warranty Policies for Specialized Products . . . . . 30717.2.3.4 Extended Warranties . . . . . . . . . . . . . . . . . 30717.2.3.5 Warranties for Used Products . . . . . . . . . . . . 308

17.2.4 Issues in Product Warranty . . . . . . . . . . . . . . . . . . . 30817.2.4.1 Warranty Cost Analysis . . . . . . . . . . . . . . . . 30817.2.4.2 Warranty Servicing . . . . . . . . . . . . . . . . . . 309

17.2.5 Review of Warranty Literature . . . . . . . . . . . . . . . . . . 30917.3 Maintenance: An Overview . . . . . . . . . . . . . . . . . . . . . . . . 309

17.3.1 Corrective Maintenance . . . . . . . . . . . . . . . . . . . . . 30917.3.2 Preventive Maintenance . . . . . . . . . . . . . . . . . . . . . 31017.3.3 Review of Maintenance Literature . . . . . . . . . . . . . . . . 310

17.4 Warranty and Corrective Maintenance . . . . . . . . . . . . . . . . . 31117.5 Warranty and Preventive Maintenance . . . . . . . . . . . . . . . . . 31217.6 Extended Warranties and Service Contracts . . . . . . . . . . . . . . . 31317.7 Conclusions and Topics for Future Research . . . . . . . . . . . . . . 314

18 Mechanical Reliability and Maintenance ModelsGianpaolo Pulcini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31718.2 Stochastic Point Processes . . . . . . . . . . . . . . . . . . . . . . . . 31818.3 Perfect Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 32018.4 Minimal Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

18.4.1 No Trend with Operating Time . . . . . . . . . . . . . . . . . 32318.4.2 Monotonic Trend with Operating Time . . . . . . . . . . . . . 323

18.4.2.1 The Power Law Process . . . . . . . . . . . . . . . . 32418.4.2.2 The Log–Linear Process . . . . . . . . . . . . . . . . 32518.4.2.3 Bounded Intensity Processes . . . . . . . . . . . . . 326

Page 21: Handbook of Reliability Engineering - pdi2.org

xx Contents

18.4.3 Bathtub-type Intensity . . . . . . . . . . . . . . . . . . . . . . 32718.4.3.1 Numerical Example . . . . . . . . . . . . . . . . . . 328

18.4.4 Non-homogeneous Poisson Process Incorporating CovariateInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

18.5 Imperfect or Worse Repair . . . . . . . . . . . . . . . . . . . . . . . . 33018.5.1 Proportional Age Reduction Models . . . . . . . . . . . . . . 33018.5.2 Inhomogeneous Gamma Processes . . . . . . . . . . . . . . . 33118.5.3 Lawless–Thiagarajah Models . . . . . . . . . . . . . . . . . . 33318.5.4 Proportional Intensity Variation Model . . . . . . . . . . . . . 334

18.6 Complex Maintenance Policy . . . . . . . . . . . . . . . . . . . . . . . 33518.6.1 Sequence of Perfect and Minimal Repairs Without Preventive

Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 33618.6.2 Minimal Repairs Interspersed with Perfect Preventive

Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 33818.6.3 Imperfect Repairs Interspersed with Perfect Preventive

Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 33918.6.4 Minimal Repairs Interspersed with Imperfect Preventive

Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 34018.6.4.1 Numerical Example . . . . . . . . . . . . . . . . . . 341

18.6.5 Corrective Repairs Interspersed with Preventive MaintenanceWithout Restrictive Assumptions . . . . . . . . . . . . . . . . 342

18.7 Reliability Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34318.7.1 Continuous Models . . . . . . . . . . . . . . . . . . . . . . . . 34418.7.2 Discrete Models . . . . . . . . . . . . . . . . . . . . . . . . . 345

19 Preventive Maintenance Models: Replacement, Repair, Ordering, and InspectionTadashi Dohi, Naoto Kaio and Shunji Osaki . . . . . . . . . . . . . . . . . . . 349

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34919.2 Block Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . 350

19.2.1 Model I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35019.2.2 Model II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35219.2.3 Model III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

19.3 Age Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . . 35419.3.1 Basic Age Replacement Model . . . . . . . . . . . . . . . . . . 354

19.4 Ordering Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35619.4.1 Continuous-time Model . . . . . . . . . . . . . . . . . . . . . 35719.4.2 Discrete-time Model . . . . . . . . . . . . . . . . . . . . . . . 35819.4.3 Combined Model with Minimal Repairs . . . . . . . . . . . . 359

19.5 Inspection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36119.5.1 Nearly Optimal Inspection Policy by Kaio and Osaki (K&O

Policy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36219.5.2 Nearly Optimal Inspection Policy by Munford and Shahani

(M&S Policy) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36319.5.3 Nearly Optimal Inspection Policy by Nakagawa and Yasui

(N&Y Policy) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36319.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

Page 22: Handbook of Reliability Engineering - pdi2.org

Contents xxi

20 Maintenance and Optimum PolicyToshio Nakagawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36720.2 Replacement Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

20.2.1 Age Replacement . . . . . . . . . . . . . . . . . . . . . . . . . 36820.2.2 Block Replacement . . . . . . . . . . . . . . . . . . . . . . . . 370

20.2.2.1 No Replacement at Failure . . . . . . . . . . . . . . 37020.2.2.2 Replacement with Two Variables . . . . . . . . . . . 371

20.2.3 Periodic Replacement . . . . . . . . . . . . . . . . . . . . . . 37120.2.3.1 Modified Models with Two Variables . . . . . . . . . 37220.2.3.2 Replacement at N Variables . . . . . . . . . . . . . 373

20.2.4 Other Replacement Models . . . . . . . . . . . . . . . . . . . 37320.2.4.1 Replacements with Discounting . . . . . . . . . . . 37320.2.4.2 Discrete Replacement Models . . . . . . . . . . . . 37420.2.4.3 Replacements with Two Types of Unit . . . . . . . . 37520.2.4.4 Replacement of a Shock Model . . . . . . . . . . . . 376

20.2.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37720.3 Preventive Maintenance Policies . . . . . . . . . . . . . . . . . . . . . 378

20.3.1 One-unit System . . . . . . . . . . . . . . . . . . . . . . . . . 37820.3.1.1 Interval Reliability . . . . . . . . . . . . . . . . . . 379

20.3.2 Two-unit System . . . . . . . . . . . . . . . . . . . . . . . . . 38020.3.3 Imperfect Preventive Maintenance . . . . . . . . . . . . . . . 381

20.3.3.1 Imperfect with Probability . . . . . . . . . . . . . . 38320.3.3.2 Reduced Age . . . . . . . . . . . . . . . . . . . . . . 383

20.3.4 Modified Preventive Maintenance . . . . . . . . . . . . . . . . 38420.4 Inspection Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

20.4.1 Standard Inspection . . . . . . . . . . . . . . . . . . . . . . . 38620.4.2 Inspection with Preventive Maintenance . . . . . . . . . . . . 38720.4.3 Inspection of a Storage System . . . . . . . . . . . . . . . . . 388

21 Optimal Imperfect Maintenance ModelsHongzhou Wang and Hoang Pham . . . . . . . . . . . . . . . . . . . . . . . . 397

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39721.2 Treatment Methods for Imperfect Maintenance . . . . . . . . . . . . . 399

21.2.1 Treatment Method 1 . . . . . . . . . . . . . . . . . . . . . . . 39921.2.2 Treatment Method 2 . . . . . . . . . . . . . . . . . . . . . . . 40021.2.3 Treatment Method 3 . . . . . . . . . . . . . . . . . . . . . . . 40121.2.4 Treatment Method 4 . . . . . . . . . . . . . . . . . . . . . . . 40221.2.5 Treatment Method 5 . . . . . . . . . . . . . . . . . . . . . . . 40321.2.6 Treatment Method 6 . . . . . . . . . . . . . . . . . . . . . . . 40321.2.7 Treatment Method 7 . . . . . . . . . . . . . . . . . . . . . . . 40321.2.8 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 404

21.3 Some Results on Imperfect Maintenance . . . . . . . . . . . . . . . . 40421.3.1 A Quasi-renewal Process and Imperfect Maintenance . . . . . 404

21.3.1.1 Imperfect Maintenance Model A . . . . . . . . . . . 40521.3.1.2 Imperfect Maintenance Model B . . . . . . . . . . . 405

Page 23: Handbook of Reliability Engineering - pdi2.org

xxii Contents

21.3.1.3 Imperfect Maintenance Model C . . . . . . . . . . . 40521.3.1.4 Imperfect Maintenance Model D . . . . . . . . . . . 40721.3.1.5 Imperfect Maintenance Model E . . . . . . . . . . . 408

21.3.2 Optimal Imperfect Maintenance of k-out-of-n Systems . . . . 40921.4 Future Research on Imperfect Maintenance . . . . . . . . . . . . . . . 41121.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

21.A.1 Acronyms and Definitions . . . . . . . . . . . . . . . . . . . . 41221.A.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

22 Accelerated Life TestingElsayed A. Elsayed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41522.2 Design of Accelerated Life Testing Plans . . . . . . . . . . . . . . . . . 416

22.2.1 Stress Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . 41622.2.2 Types of Stress . . . . . . . . . . . . . . . . . . . . . . . . . . 416

22.3 Accelerated Life Testing Models . . . . . . . . . . . . . . . . . . . . . 41722.3.1 Parametric Statistics-based Models . . . . . . . . . . . . . . . 41822.3.2 Acceleration Model for the Exponential Model . . . . . . . . . 41922.3.3 Acceleration Model for the Weibull Model . . . . . . . . . . . 42022.3.4 The Arrhenius Model . . . . . . . . . . . . . . . . . . . . . . 42222.3.5 Non-parametric Accelerated Life Testing Models: Cox’s Model 424

22.4 Extensions of the Proportional Hazards Model . . . . . . . . . . . . . 426

23 Accelerated Test Models with the Birnbaum–Saunders DistributionW. Jason Owen and William J. Padgett . . . . . . . . . . . . . . . . . . . . . . 429

23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42923.1.1 Accelerated Testing . . . . . . . . . . . . . . . . . . . . . . . . 43023.1.2 The Birnbaum–Saunders Distribution . . . . . . . . . . . . . 431

23.2 Accelerated Birnbaum–Saunders Models . . . . . . . . . . . . . . . . 43123.2.1 The Power-law Accelerated Birnbaum–Saunders Model . . . . 43223.2.2 Cumulative Damage Models . . . . . . . . . . . . . . . . . . . 432

23.2.2.1 Additive Damage Models . . . . . . . . . . . . . . . 43323.2.2.2 Multiplicative Damage Models . . . . . . . . . . . . 434

23.3 Inference Procedures with Accelerated Life Models . . . . . . . . . . . 43523.4 Estimation from Experimental Data . . . . . . . . . . . . . . . . . . . 437

23.4.1 Fatigue Failure Data . . . . . . . . . . . . . . . . . . . . . . . 43723.4.2 Micro-Composite Strength Data . . . . . . . . . . . . . . . . . 437

24 Multiple-steps Step-stress Accelerated Life TestLoon-Ching Tang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44124.2 Cumulative Exposure Models . . . . . . . . . . . . . . . . . . . . . . 44324.3 Planning a Step-stress Accelerated Life Test . . . . . . . . . . . . . . . 445

24.3.1 Planning a Simple Step-stress Accelerated Life Test . . . . . . 44624.3.1.1 The Likelihood Function . . . . . . . . . . . . . . . 44624.3.1.2 Setting a Target Accelerating Factor . . . . . . . . . 447

Page 24: Handbook of Reliability Engineering - pdi2.org

Contents xxiii

24.3.1.3 Maximum Likelihood Estimator and AsymptoticVariance . . . . . . . . . . . . . . . . . . . . . . . . 447

24.3.1.4 Nonlinear Programming for Joint Optimality inHold Time and Low Stress . . . . . . . . . . . . . . 447

24.3.2 Multiple-steps Step-stress Accelerated Life Test Plans . . . . . 44824.4 Data Analysis in the Step-stress Accelerated Life Test . . . . . . . . . . 450

24.4.1 Multiply Censored, Continuously Monitored Step-stressAccelerated Life Test . . . . . . . . . . . . . . . . . . . . . . . 45024.4.1.1 Parameter Estimation for Weibull Distribution . . . 451

24.4.2 Read-out Data . . . . . . . . . . . . . . . . . . . . . . . . . . 45124.5 Implementation in Microsoft ExcelTM . . . . . . . . . . . . . . . . . . 45324.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

25 Step-stress Accelerated Life TestingChengjie Xiong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45725.2 Step-stress Life Testing with Constant Stress-change Times . . . . . . 457

25.2.1 Cumulative Exposure Model . . . . . . . . . . . . . . . . . . . 45725.2.2 Estimation with Exponential Data . . . . . . . . . . . . . . . . 45925.2.3 Estimation with Other Distributions . . . . . . . . . . . . . . 46225.2.4 Optimum Test Plan . . . . . . . . . . . . . . . . . . . . . . . . 463

25.3 Step-stress Life Testing with Random Stress-change Times . . . . . . 46325.3.1 Marginal Distribution of Lifetime . . . . . . . . . . . . . . . . 46325.3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46725.3.3 Optimum Test Plan . . . . . . . . . . . . . . . . . . . . . . . . 467

25.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

PART V. Practices and Emerging Applications

26 Statistical Methods for Reliability Data AnalysisMichael J. Phillips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47526.2 Nature of Reliability Data . . . . . . . . . . . . . . . . . . . . . . . . . 47526.3 Probability and Random Variables . . . . . . . . . . . . . . . . . . . . 47826.4 Principles of Statistical Methods . . . . . . . . . . . . . . . . . . . . . 47926.5 Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48026.6 Weibull Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 48326.7 Accelerated Failure-time Model . . . . . . . . . . . . . . . . . . . . . 48526.8 Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . 48626.9 Residual Plots for the Proportional Hazards Model . . . . . . . . . . . 48926.10 Non-proportional Hazards Models . . . . . . . . . . . . . . . . . . . 49026.11 Selecting the Model and the Variables . . . . . . . . . . . . . . . . . . 49126.12 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

Page 25: Handbook of Reliability Engineering - pdi2.org

xxiv Contents

27 The Application of Capture–Recapture Methods in Reliability StudiesPaul S. F. Yip, Yan Wang and Anne Chao . . . . . . . . . . . . . . . . . . . . 493

27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49327.2 Formulation of the Problem . . . . . . . . . . . . . . . . . . . . . . . 495

27.2.1 Homogeneous Model with Recapture . . . . . . . . . . . . . . 49627.2.2 A Seeded Fault Approach Without Recapture . . . . . . . . . . 49827.2.3 Heterogeneous Model . . . . . . . . . . . . . . . . . . . . . . 499

27.2.3.1 Non-parametric Case: λi(t)= γiαt . . . . . . . . . . 49927.2.3.2 Parametric Case: λi(t)= γi . . . . . . . . . . . . . . 501

27.3 A Sequential Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 50427.4 Real Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50427.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50527.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508

28 Reliability of Electric Power Systems: An OverviewRoy Billinton and Ronald N. Allan . . . . . . . . . . . . . . . . . . . . . . . . 511

28.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51128.2 System Reliability Performance . . . . . . . . . . . . . . . . . . . . . 51228.3 System Reliability Prediction . . . . . . . . . . . . . . . . . . . . . . . 515

28.3.1 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 51528.3.2 Predictive Assessment at HLI . . . . . . . . . . . . . . . . . . 51628.3.3 Predictive Assessment at HLII . . . . . . . . . . . . . . . . . . 51828.3.4 Distribution System Reliability Assessment . . . . . . . . . . . 51928.3.5 Predictive Assessment at HLIII . . . . . . . . . . . . . . . . . 520

28.4 System Reliability Data . . . . . . . . . . . . . . . . . . . . . . . . . . 52128.4.1 Canadian Electricity Association Database . . . . . . . . . . . 52228.4.2 Canadian Electricity Association Equipment Reliability

Information System Database for HLI Evaluation . . . . . . . 52328.4.3 Canadian Electricity Association Equipment Reliability

Information System Database for HLII Evaluation . . . . . . . 52328.4.4 Canadian Electricity Association Equipment Reliability

Information System Database for HLIII Evaluation . . . . . . 52428.5 System Reliability Worth . . . . . . . . . . . . . . . . . . . . . . . . . 52528.6 Guide to Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . 527

29 Human and Medical Device ReliabilityB. S. Dhillon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52929.2 Human and Medical Device Reliability Terms and Definitions . . . . . 52929.3 Human Stress—Performance Effectiveness, Human Error Types, and

Causes of Human Error . . . . . . . . . . . . . . . . . . . . . . . . . . 53029.4 Human Reliability Analysis Methods . . . . . . . . . . . . . . . . . . 531

29.4.1 Probability Tree Method . . . . . . . . . . . . . . . . . . . . . 53129.4.2 Fault Tree Method . . . . . . . . . . . . . . . . . . . . . . . . 53229.4.3 Markov Method . . . . . . . . . . . . . . . . . . . . . . . . . 534

Page 26: Handbook of Reliability Engineering - pdi2.org

Contents xxv

29.5 Human Unreliability Data Sources . . . . . . . . . . . . . . . . . . . . 53529.6 Medical Device Reliability Related Facts and Figures . . . . . . . . . . 53529.7 Medical Device Recalls and Equipment Classification . . . . . . . . . 53629.8 Human Error in Medical Devices . . . . . . . . . . . . . . . . . . . . 53729.9 Tools for Medical Device Reliability Assurance . . . . . . . . . . . . . 537

29.9.1 General Method . . . . . . . . . . . . . . . . . . . . . . . . . 53829.9.2 Failure Modes and Effect Analysis . . . . . . . . . . . . . . . . 53829.9.3 Fault Tree Method . . . . . . . . . . . . . . . . . . . . . . . . 53829.9.4 Markov Method . . . . . . . . . . . . . . . . . . . . . . . . . 538

29.10 Data Sources for Performing Medical Device Reliability Studies . . . . 53929.11 Guidelines for Reliability Engineers with Respect to Medical Devices . 539

30 Probabilistic Risk AssessmentRobert A. Bari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543

30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54330.2 Historical Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 54430.3 Probabilistic Risk Assessment Methodology . . . . . . . . . . . . . . 54630.4 Engineering Risk Versus Environmental Risk . . . . . . . . . . . . . . 54930.5 Risk Measures and Public Impact . . . . . . . . . . . . . . . . . . . . 55030.6 Transition to Risk-informed Regulation . . . . . . . . . . . . . . . . . 55330.7 Some Successful Probabilistic Risk Assessment Applications . . . . . . 55330.8 Comments on Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . 55430.9 Deterministic, Probabilistic, Prescriptive, Performance-based . . . . . 55430.10 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555

31 Total Dependability ManagementPer Anders Akersten and Bengt Klefsjö . . . . . . . . . . . . . . . . . . . . . 559

31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55931.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55931.3 Total Dependability Management . . . . . . . . . . . . . . . . . . . . 56031.4 Management System Components . . . . . . . . . . . . . . . . . . . . 56131.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

32 Total Quality for Software Engineering ManagementG. Albeanu and Fl. Popentiu Vladicescu . . . . . . . . . . . . . . . . . . . . . 567

32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56732.1.1 The Meaning of Software Quality . . . . . . . . . . . . . . . . 56732.1.2 Approaches in Software Quality Assurance . . . . . . . . . . . 569

32.2 The Practice of Software Engineering . . . . . . . . . . . . . . . . . . 57132.2.1 Software Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . 57132.2.2 Software Development Process . . . . . . . . . . . . . . . . . 57432.2.3 Software Measurements . . . . . . . . . . . . . . . . . . . . . 575

32.3 Software Quality Models . . . . . . . . . . . . . . . . . . . . . . . . . 57732.3.1 Measuring Aspects of Quality . . . . . . . . . . . . . . . . . . 57732.3.2 Software Reliability Engineering . . . . . . . . . . . . . . . . 57732.3.3 Effort and Cost Models . . . . . . . . . . . . . . . . . . . . . . 579

Page 27: Handbook of Reliability Engineering - pdi2.org

xxvi Contents

32.4 Total Quality Management for Software Engineering . . . . . . . . . . 58032.4.1 Deming’s Theory . . . . . . . . . . . . . . . . . . . . . . . . . 58032.4.2 Continuous Improvement . . . . . . . . . . . . . . . . . . . . 581

32.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

33 Software Fault ToleranceXiaolin Teng and Hoang Pham . . . . . . . . . . . . . . . . . . . . . . . . . . 585

33.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58533.2 Software Fault-tolerant Methodologies . . . . . . . . . . . . . . . . . 586

33.2.1 N-version Programming . . . . . . . . . . . . . . . . . . . . . 58633.2.2 Recovery Block . . . . . . . . . . . . . . . . . . . . . . . . . . 58633.2.3 Other Fault-tolerance Techniques . . . . . . . . . . . . . . . . 587

33.3 N-version Programming Modeling . . . . . . . . . . . . . . . . . . . 58833.3.1 Basic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 588

33.3.1.1 Data-domain Modeling . . . . . . . . . . . . . . . . 58833.3.1.2 Time-domain Modeling . . . . . . . . . . . . . . . . 589

33.3.2 Reliability in the Presence of Failure Correlation . . . . . . . . 59033.3.3 Reliability Analysis and Modeling . . . . . . . . . . . . . . . . 591

33.4 Generalized Non-homogeneous Poisson Process Model Formulation . 59433.5 Non-homogeneous Poisson Process Reliability Model for N-version

Programming Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 59533.5.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . 59733.5.2 Model Formulations . . . . . . . . . . . . . . . . . . . . . . . 599

33.5.2.1 Mean Value Functions . . . . . . . . . . . . . . . . . 59933.5.2.2 Common Failures . . . . . . . . . . . . . . . . . . . 60033.5.2.3 Concurrent Independent Failures . . . . . . . . . . 601

33.5.3 N-version Programming System Reliability . . . . . . . . . . 60133.5.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 602

33.6 N-version programming–Software Reliability Growth . . . . . . . . . 60233.6.1 Applications of N-version Programming–Software Reliability

Growth Models . . . . . . . . . . . . . . . . . . . . . . . . . . 60233.6.1.1 Testing Data . . . . . . . . . . . . . . . . . . . . . . 602

33.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610

34 Markovian Dependability/Performability Modeling of Fault-tolerant SystemsJuan A. Carrasco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

34.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61334.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615

34.2.1 Expected Steady-state Reward Rate . . . . . . . . . . . . . . . 61734.2.2 Expected Cumulative Reward Till Exit of a Subset of States . . 61834.2.3 Expected Cumulative Reward During Stay in a Subset of States 61834.2.4 Expected Transient Reward Rate . . . . . . . . . . . . . . . . 61934.2.5 Expected Averaged Reward Rate . . . . . . . . . . . . . . . . . 61934.2.6 Cumulative Reward Distribution Till Exit of a Subset of States 61934.2.7 Cumulative Reward Distribution During Stay in a Subset

of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620

Page 28: Handbook of Reliability Engineering - pdi2.org

Contents xxvii

34.2.8 Cumulative Reward Distribution . . . . . . . . . . . . . . . . 62134.2.9 Extended Reward Structures . . . . . . . . . . . . . . . . . . . 621

34.3 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62234.4 Model Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62534.5 The Largeness Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 63034.6 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63234.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640

35 Random-request AvailabilityKang W. Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643

35.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64335.2 System Description and Definition . . . . . . . . . . . . . . . . . . . . 64435.3 Mathematical Expression for the Random-request Availability . . . . 645

35.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64535.3.2 Mathematical Assumptions . . . . . . . . . . . . . . . . . . . 64535.3.3 Mathematical Expressions . . . . . . . . . . . . . . . . . . . . 645

35.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 64735.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64735.6 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65135.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 652

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653

Page 29: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 30: Handbook of Reliability Engineering - pdi2.org

Contributors

Professor Per Anders AkerstenDivision of Quality Technology and StatisticsLulea University of TechnologySweden

Professor G. AlbeanuThe Technical University of DenmarkDenmark

Professor Ronald N. AllanDepartment of Electrical Engineering

and ElectronicsUMIST, ManchesterUnited Kingdom

Dr. Robert A. BariEnergy, Environment and National SecurityBrookhaven National LaboratoryUSA

Professor Roy BillintonDepartment of Electrical EngineeringUniversity of SaskatchewanCanada

Professor Juan A. CarrascoDep. d’Enginyeria Electronica, UPCSpain

Professor Jen-Chun ChangDepartment of Information ManagementMing Hsin Institute of TechnologyTaiwan, ROC

Professor Anne ChaoInstitute of StatisticsNational Tsing Hua UniversityTaiwan

Dr. Yong Kwon ChoTechnology and Industry DepartmentSamsung Economic Research InstituteRepublic of Korea

Dr. Siddhartha R. DalalInformation Analysis and

Services Research DepartmentApplied ResearchTelcordia TechnologiesUSA

Professor B. S. DhillonDepartment of Mechanical EngineeringUniversity of OttawaCanada

Professor Tadashi DohiDepartment of Information EngineeringHiroshima UniversityJapan

Professor Elsayed A. ElsayedDepartment of Industrial EngineeringRutgers UniversityUSA

Professor Maxim S. FinkelsteinDepartment of Mathematical StatisticsUniversity of the Orange Free StateRepublic of South Africa

Professor Katerina Goševa-PopstojanovaLane Dept. of Computer Science and

Electrical EngineeringWest Virginia UniversityUSA

Dr. Jinsheng HuangStantec ConsultingCanada

Professor Frank K. HwangDepartment of Applied MathematicsNational Chao Tung UniversityTaiwan, ROC

Page 31: Handbook of Reliability Engineering - pdi2.org

xxx Contributors

Dr. Nat JackSchool of ComputingUniversity of Abertay DundeeScotland

Professor Naoto KaioDepartment of Economic InformaticsFaculty of Economic SciencesHiroshima Shudo UniversityJapan

Professor Mitsuhiro KimuraDepartment of Industrial and Systems EngineeringFaculty of EngineeringHosei UniversityJapan

Professor Bengt KlefsjöDivision of Quality Technology and StatisticsLulea University of TechnologySweden

Professor Way KuoDepartment of Industrial EngineeringTexas A&M UniversityUSA

Professor C. D. LaiInstitute of Information Sciences and TechnologyMassey UniversityNew Zealand

Professor James LedouxCentre de mathématiquesInstitut National des Sciences AppliqueesFrance

Dr. Gregory LevitinReliability DepartmentThe Israel Electric CorporationIsrael

Dr. Anatoly LisnianskiReliability DepartmentThe Israel Electric CorporationIsrael

Professor D. N. P. MurthyEngineering and Operations Management ProgramDepartment of Mechanical EngineeringThe University of QueenslandAustralia

Professor Toshio NakagawaDepartment of Industrial EngineeringAichi Institute of TechnologyJapan

Professor Shunji OsakiDepartment of Information and

Telecommunication EngineeringFaculty of Mathematical Sciences and

Information EngineeringNanzan UniversityJapan

Dr. W. Jason OwenMathematics and Computer Science DepartmentUniversity of RichmondUSA

Professor William J. PadgettDepartment of StatisticsUniversity of South CarolinaUSA

Professor Dong Ho ParkDepartment of StatisticsHallym UniversityKorea

Professor Hoang PhamDepartment of Industrial EngineeringRutgers UniversityUSA

Dr. Michael J. PhillipsDepartment of Mathematics and Computer ScienceUniversity of LeicesterUnited Kingdom

Professor Gianpaolo PulciniStatistics and Reliability DepartmentIstituto Motori CNRItaly

Mr. Sang Hwa SongDepartment of Industrial EngineeringKorea Advanced Institute of Science and TechnologyRepublic of Korea

Professor Chang Sup SungDepartment of Industrial EngineeringKorea Advanced Institute of Science and TechnologyRepublic of Korea

Page 32: Handbook of Reliability Engineering - pdi2.org

Contributors xxxi

Professor Loon-Ching TangDepartment of Industrial and Systems EngineeringNational University of SingaporeSingapore

Dr. Xiaolin TengDepartment of Industrial EngineeringRutgers UniversityUSA

Professor Koichi TokunoDepartment of Social Systems EngineeringTottori UniversityJapan

Professor Kishor S. TrivediDepartment of Electrical and Computer EngineeringDuke UniversityUSA

Dr. Kalyanaraman VaidyanathanDepartment of Electrical and Computer EngineeringDuke UniversityUSA

Professor Florin Popentiu VladicescuThe Technical University of DenmarkDenmark

Dr. Yan WangSchool of MathematicsCity West Campus, UniSAAustralia

Professor Min XieDepartment of Industrial and Systems EngineeringNational University of SingaporeSingapore

Professor Chengjie XiongDivision of BiostatisticsWashington University in St. LouisUSA

Professor Shigeru YamadaDepartment of Social Systems EngineeringTottori UniversityJapan

Professor Paul S. F. YipDepartment of Statistics and Actuarial ScienceUniversity of Hong KongHong Kong

Professor Ming ZhaoDepartment of TechnologyUniversity of GävleSweden

Professor Ming J. ZuoDepartment of Mechanical EngineeringUniversity of AlbertaCanada

Page 33: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 34: Handbook of Reliability Engineering - pdi2.org

System Reliability andOptimizationP

AR

TI

1 Multi-state k-out-of-n Systems1.1 Introduction1.2 Relevant Concepts in Binary Reliability Theory1.3 Binary k-out-of-nModels1.4 Relevant Concepts in Multi-state Reliability Theory1.5 A Simple Multi-state k-out-of-n:G Model1.6 A Generalized Multi-state k-out-of-n:G System Model1.7 Properties of Generalized Multi-state k-out-of-n:G Systems1.8 Equivalence and Duality in Generalized Multi-state k-out-of-n Systems

2 Reliability of Systems with Multiple Failure Modes2.1 Introduction2.2 The Series System2.3 The Parallel System2.4 The Parallel–Series System2.5 The Series–Parallel System2.6 The k-out-of-n Systems2.7 Fault-tolerant Systems2.8 Weighted Systems with Three Failure Modes

3 Reliabilities of Consecutive-k-Systems3.1 Introduction3.2 Computation of Reliability3.3 Invariant Consecutive Systems3.4 Component Importance and the Component Replacement Problem3.5 The Weighted-consecutive-k-out-of-n System3.6 Window Systems3.7 Network Systems3.8 Conclusion

Page 35: Handbook of Reliability Engineering - pdi2.org

4 Multi-State System Reliability Analysis and Optimization4.1 Introduction4.2 MSS Reliability Measures4.3 MSS Reliability Indices Evaluation Based on the UGF4.4 Determination of u-Function of Complex MSS Using Composition Operators4.5 Importance and Sensitivity Analysis of Multi-state Systems4.6 MSS Structure Optimization Problems

5 Combinatorial Reliability Optimization5.1 Introduction5.2 Combinatorial Reliability Optimization Problems of Series Structure5.3 Combinatorial Reliability Optimization Problems of Non-series Structure5.4 Combinatorial Reliability Optimization Problems with Multiple-choice Constraints5.5 Summary

Page 36: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems

Chap

ter1

Ming J. Zuo, Jinsheng Huang and Way Kuo

1.1 Introduction1.2 Relevant Concepts in Binary Reliability Theory1.3 Binary k-out-of-nModels1.3.1 The k-out-of-n:G System with Independently and Identically Distributed Components1.3.2 Reliability Evaluation Using Minimal Path or Cut Sets1.3.3 Recursive Algorithms1.3.4 Equivalence Between a k-out-of-n:G System and an (n− k + 1)-out-of-n:F system1.3.5 The Dual Relationship Between the k-out-of-nG and F Systems1.4 Relevant Concepts in Multi-state Reliability Theory1.5 A Simple Multi-state k-out-of-n:GModel1.6 A Generalized Multi-state k-out-of-n:G System Model1.7 Properties of Generalized Multi-state k-out-of-n:G Systems1.8 Equivalence and Duality in Generalized Multi-state k-out-of-n Systems

1.1 IntroductionIn traditional reliability theory, both the systemand its components are allowed to take only twopossible states: either working or failed. In a multi-state system, both the system and its componentsare allowed to experience more than two possiblestates, e.g. completely working, partially workingor partially failed, and completely failed. A multi-state system reliability model provides moreflexibility for modeling of equipment conditions.The terms binary and multi-state will be used toindicate these two fundamental assumptions insystem reliability models.

1.2 Relevant Concepts in BinaryReliability TheoryThe following notation will be used:

• xi : state of component i, xi = 1 if componenti is working and zero otherwise;• x: an n-dimensional vector representing the

states of all components, x= (x1, x2, . . . , xn);

• φ(x): state of the system, which is also calledthe structure function of the system;• (ji, x): a vector x whose ith argument is set

equal to j , where j = 0, 1 and i = 1, 2, . . . , n.

A component is irrelevant if its state does notaffect the state of the system at all. The structurefunction of the system indicates that the state ofthe system is completely determined by the statesof all components. A system of components issaid to be coherent if: (1) its structure function isnon-decreasing in each argument; (2) there are noirrelevant components in the system. These tworequirements of a coherent system can be statedas: (1) the improvement of any component doesnot degrade the system performance; (2) everycomponent in the system makes some non-zero contribution to the system’s performance.A mathematical definition of a coherent system isgiven below.

Definition 1. A binary system with n componentsis a coherent system if its structure function φ(x)satisfies:

3

Page 37: Handbook of Reliability Engineering - pdi2.org

4 System Reliability and Optimization

1. φ(x) is non-decreasing in each argument xi ,i = 1, 2, . . . , n;

2. there exists a vector x such that 0= φ(0i , x) <φ(1i , x)= 1;

3. φ(0)= 0 and φ(1)= 1.

Condition 1 in Definition 1 requires that φ(x)be a monotonically increasing function of eachargument. Condition 2 specifies the so-called rel-evancy condition, which requires that every com-ponent has to be relevant to system performance.Condition 3 states that the system fails when allcomponents are failed and system works when allcomponents are working.

A minimal path set is a minimal set of compo-nents whose functioning ensures the functioningof the system. A minimal cut set is a minimal setof components whose failure ensures the failureof the system. The following mathematical defi-nitions of minimal path and cut sets are given byBarlow and Proschan [1].

Definition 2. Define C0(x)= i | xi = 0 and C1(x)= i | xi = 1. A path vector is a vector x such thatφ(x)= 1. The corresponding path set is C1(x).A minimal path vector is a path vector x suchthat φ(y)= 0 for any y < x. The correspondingminimal path set is C1(x). A cut vector is a vectorx such that φ(x)= 0. The corresponding cut set isC0(x). A minimal cut vector is a cut vector x suchthat φ(y)= 1 for any y > x. The correspondingminimal cut set is C0(x).

The reliability of a system is equal to theprobability that at least one of the minimal pathsets works. The unreliability of the system is equalto the probability that at least one minimal cutset is failed. For a minimal path set to work, eachcomponent in the set must work. For a minimalcut set to fail, all components in the set mustfail.

1.3 Binary k-out-of-n Models

A system of n components that works (or is“good”) if and only if at least k of the n

components work (or are “good”) is called a

k-out-of-n:G system. A system of n componentsthat fails if and only if at least k of the n

components fail is called a k-out-of-n:F system.The term k-out-of-n system is often used toindicate either a G system, an F system, or both.Since the value of n is usually larger than thevalue of k, redundancy is built into a k-out-of-nsystem. Both the parallel and the series systemsare special cases of the k-out-of-n system. A seriessystem is equivalent to a 1-out-of-n:F system andto an n-out-of-n:G system. A parallel system isequivalent to an n-out-of-n:F system and to a1-out-of-n:G system.

The k-out-of-n system structure is a verypopular type of redundancy in fault-tolerantsystems. It finds wide applications in bothindustrial and military systems. Fault-tolerantsystems include the multi-display system in acockpit, the multi-engine system in an airplane,and the multi-pump system in a hydraulic controlsystem. For example, in a V-8 engine of anautomobile it may be possible to drive the carif only four cylinders are firing. However, if lessthan four cylinders fire, then the automobilecannot be driven. Thus, the functioning ofthe engine may be represented by a 4-out-of-8:G system. It is tolerant of failures of up tofour cylinders for minimal functioning of theengine. In a data-processing system with fivevideo displays, a minimum of three displaysoperable may be sufficient for full data display.In this case, the display subsystem behaves as a3-out-of-5:G system. In a communications systemwith three transmitters the average message loadmay be such that at least two transmittersmust be operational at all times or criticalmessages may be lost. Thus, the transmissionsubsystem functions as a 2-out-of-3:G system.Systems with spares may also be representedby the k-out-of-n system model. In the case ofan automobile with four tires, for example, thevehicle is usually equipped with one additionalspare tire. Thus, the vehicle can be driven aslong as at least four out of five tires are in goodcondition.

In the following, we will also adopt thefollowing notation:

Page 38: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 5

• n: number of components in the system;• k: minimum number of components that must

work for the k-out-of-n:G system to work;• pi : reliability of component i, i = 1, 2, . . . , n,

pi = Pr(xi = 1);• p: reliability of each component when all

components are i.i.d.;• qi : unreliability of component i, qi = 1− pi ,

i = 1, 2, . . . , n;• q : unreliability of each component when all

components are i.i.d., q = 1− p;• Re(k, n): probability that exactly k out of n

components are working;• R(k, n): reliability of a k-out-of-n:G system

or probability that at least k out of the n

components are working, where 0≤ k ≤ n

and both k and n are integers;• Q(k, n): unreliability of a k-out-of-n:G system

or probability that less than k out of then components are working, where 0≤ k ≤ n

and both k and n are integers, Q(k, n)= 1−R(k, n).

1.3.1 The k-out-of-n:G System withIndependently and IdenticallyDistributed Components

The reliability of a k-out-of-n:G system withindependently and identically distributed (i.i.d.)components is equal to the probability that thenumber of working components is greater than orequal to k:

R(k, n)=n∑

i=k

(n

i

)piqn−i (1.1)

Other equations that can be used for systemreliability evaluation include

R(k, n)= pkn∑

i=k

(i − 1

k − 1

)qi−k (1.2)

and

R(k, n)=(n− 1

k − 1

)pkqn−k + R(k, n+ 1) (1.3)

with the boundary condition:

R(k, n)= 0 for n= k (1.4)

1.3.2 Reliability Evaluation UsingMinimal Path or Cut Sets

In a k-out-of-n:G system, there are(nk

)minimal

path sets and(

nn−k+1

)minimal cut sets. Each min-

imal path set contains exactly k different compo-nents and each minimal cut set contains exactlyn− k + 1 components. Thus, all minimal path setsand minimal cut sets are known. To find the reli-ability of a k-out-of-n:G system, one may chooseto evaluate the probability that at least one of theminimal path sets contains all working compo-nents or one minus the probability that at least oneminimal cut set contains all failed components.

The inclusion–exclusion (IE) method can beused for reliability evaluation of a k-out-of-n:G system. However, it has the disadvantageof involving many canceling terms. Heidtmann[2] and McGrady [3] provide improved versionsof the IE method for reliability evaluationof the k-out-of-n:G system. In their improvedalgorithms, the canceling terms are eliminated.However, these algorithms are still enumerativein nature. For example, the formula provided byHeidtmann [2] is as follows:

R(k, n)=n∑

i=k(−1)i−k

(i − 1

k − 1

) ∑j1<j2<···<ji

i∏l=1

pjl

(1.5)In this equation, for each fixed i value, the innersummation term gives us the probability that i

components are working properly regardless ofwhether the other n− i components are workingor not. The total number of terms to be summedtogether in the inner summation series is equalto(ni

).

The sum-of-disjoint-product (SDP) methodcan also be used for reliability evaluation ofthe k-out-of-n:G systems. Let Si indicate theith minimal path of a k-out-of-n:G system (i =1, 2, . . . , m, where m= (n

i

). The SDP method

uses the following equation for system reliabilityevaluation:

R(k, n)= Pr(S1)+ Pr(S1S2)+ Pr(S1S2S3)+ · · ·+ Pr(S1S2 . . . Sm−1Sm) (1.6)

Page 39: Handbook of Reliability Engineering - pdi2.org

6 System Reliability and Optimization

Like the improved IE method given in Equa-tion 1.5, the SDP method is pretty easy to usefor the k-out-of-n:G systems. However, we will seelater that there are much more efficient methodsthan the IE (and its improved version) and the SDPmethod for the k-out-of-n:G systems.

1.3.3 Recursive Algorithms

Under the assumption that components ares-independent, several efficient recursive algo-rithms have been developed for system reliabilityevaluation of the k-out-of-n:G systems. Barlowand Heidtmann [4] and Rushdi [5] independentlyprovide an algorithm with complexity O(k(n−k + 1)) for system reliability evaluation of thek-out-of-n:G systems. The approaches used toderive the algorithm are the generating func-tion approach (Barlow and Heidtmann) and thesymmetric switching function approach (Rushdi).The following equation summarizes the algo-rithm:

R(i, j)= pjR(i − 1, j − 1)

+ qjR(i, j − 1) i ≥ 0, j ≥ 0 (1.7)

with boundary conditions

R(0, j)= 1 j ≥ 0 (1.8)

R(j + 1, j)= 0 j ≥ 0 (1.9)

Chao and Lin [6] were the first to use theMarkov chain technique in analyzing reliabilitysystem structures; in their case, it was for theconsecutive-k-out-of-n:F system. Subsequently,Chao and Fu [7, 8] standardized this approachof using the Markov chain in the analysis ofvarious system structures and provided a generalframework and general results for this technique.The system structures that can be represented bya Markov chain were termed linearly connectedsystems by Fu and Lou [9]. Koutras [10] providesa systematic summary of this technique and callsthese systems Markov chain imbeddable (MIS)systems. Koutras [10] applied this technique tothe k-out-of-n:F system and provided recursiveequations for system reliability evaluation of thek-out-of-n:F systems. In the following, we provide

the equations for the k-out-of-n:G systems.

a0(t)= qta0(t − 1) t ≥ 1 (1.10)

aj (t)ptaj−1(t − 1)+ qtaj (t − 1)

1≤ j < k, j ≤ t ≤ n (1.11)

ak(t)ptak−1(t − 1)+ ak(t − 1) k ≤ t ≤ n

(1.12)

where aj (t) is the probability that there are exactlyj working components in a system with t com-ponents for 0≤ j < k and ak(t) is the probabilitythat there are at least k working components in thet component subsystem. The following boundaryconditions are immediate:

a0 = 1 (1.13)

aj (0)= 0 j > 0 (1.14)

aj (t)= 0 for t < j (1.15)

The reliability of the system, R(k, n)= ak(n).The computational complexity of the recursiveEquations 1.10–1.12 for system reliability of ak-out-of-n:G system is also O(k(n− k + 1)).

Belfore [11] used the generating functionapproach as used by Barlow and Heidtmann [4]and applied a fast Fourier transform (FFT) incomputation of the products of the generatingfunctions. An algorithm for reliability evaluationof k-out-of-n:G systems results from such acombination that has a computational complexityof O(n[log2(n)]2). This algorithm is not easy touse for manual calculations or when the systemsize is small. For details of this algorithm, readersare referred to Belfore [11].

1.3.4 Equivalence Between ak-out-of-n:G System and an(n− k + 1)-out-of-n:F system

Based on the definitions of these two typesof systems, a k-out-of-n:G system is equivalentto an (n− k + 1)-out-of-n:F system. Similarly, ak-out-of-n:F system is equivalent to an (n− k +1)-out-of-n:G system. This means that providedthe systems have the same set of componentreliabilities, the reliability of a k-out-of-n:G system

Page 40: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 7

is equal to the reliability of an (n− k + 1)-out-of-n:F system and the reliability of a k-out-of-n:F system is equal to the reliability of an(n− k + 1)-out-of-n:G system. As a result, wecan use the algorithms that have been coveredin the previous section for the k-out-of-n:Gsystems in reliability evaluation of the k-out-of-n:Fsystems. The procedure is simple and is outlinedbelow.

Procedure 1. Procedure for using algorithms forthe G systems in reliability evaluation of the Fsystems utilizing the equivalence relationship:

1. given: k, n, p1, p2, . . . , pn for a k-out-of-n:Fsystem;

2. calculate k1 = n− k + 1;3. use k1, n, p1, p2, . . . , pn to calculate the reli-

ability of a k1-out-of-n:G system. This reliabil-ity is also the reliability of the original k-out-of-n:F system.

1.3.5 The Dual Relationship Betweenthe k-out-of-n G and F Systems

Definition 3. (Barlow and Proschan [1]) Given astructure φ, its dual structure φD is given by

φD(x)= (1− x) (1.16)

where 1− x= (1− x1, 1− x2, . . . , 1− xn).

With a simple variable substitution of y= 1− xand then writing y as x, we have the followingequation:

φD(1− x)= 1− φ(x) (1.17)

We can interpret Equation 1.17 as follows. Givena primal system with component state vector xand the system state represented by φ(x), thestate of the dual system is equal to 1− φ(x) ifthe component state vector for the dual systemcan be expressed by 1− x. In the binary systemcontext, each component and the system may onlybe in two possible states: either working or failed.We say that two components with different stateshave opposite states. For example, if component1 is in state 1 and component 2 is in state0, components 1 and 2 have opposite states.

Suppose a system (called system 1) has componentstate vector x and system state φ(x). Consideranother system (called system 2) having the samenumber of components as system 1. If eachcomponent in system 2 has the opposite state ofthe corresponding component in system 1 and thestate of system 2 becomes the opposite of the stateof system 1, then system 1 and system 2 are dualsof each other.

Now let us examine the k-out-of-n G and Fsystems. Suppose that in the k-out-of-n:G system,there are exactly j working components andthe system is working (in other words, j ≥ k).Now assume that there are exactly j failedcomponents in the k-out-of-n:F system. Since j ≥k, the k-out-of-n:F system must be in the failedstate. If j < k, the k-out-of-n:G system is failed,and at the same time, the k-out-of-n:F system isworking. Thus, the k-out-of-n G and F systemsare duals of each other. The dual and equivalencerelationships between the k-out-of-n G and Fsystems are summarized below.

1. A k-out-of-n:G system is equivalent to an(n− k + 1)-out-of-n:F system.

2. A k-out-of-n:F system is equivalent to an(n− k + 1)-out-of-n:G system.

3. The dual of a k-out-of-n:G system is ak-out-of-n:F system.

4. The dual of a k-out-of-n:G system is an(n− k + 1)-out-of-n:G system.

5. The dual of a k-out-of-n:F system is ak-out-of-n:G system.

6. The dual of a k-out-of-n:F system is an(n− k + 1)-out-of-n:F system.

Using the dual relationship, we can summarize thefollowing procedure for reliability evaluation ofthe dual system if the available algorithms are forthe primal system.

Procedure 2. Procedure for using algorithms forthe G systems in reliability evaluation of the Fsystems utilizing the dual relationship:

1. given: k, n, p1, p2, . . . , pn for a k-out-of-n:Fsystem;

2. calculate qi = 1− pi for i = 1, 2, . . . , n;

Page 41: Handbook of Reliability Engineering - pdi2.org

8 System Reliability and Optimization

3. treat qi as the reliability of component i in a k-out-of-n:G system and use the algorithms forthe G system discussed in the previous sectionto evaluate the reliability of the G system;

4. subtract the calculated reliability of the Gsystem from one to obtain the reliability of theoriginal k-out-of-n:F system.

Using the dual relationship, we can also obtainalgorithms for k-out-of-n:F system reliabilityevaluation from those developed for the k-out-of-n:G systems. We only need to change reliabilitymeasures to unreliability measures and vice versa.Take the algorithm developed by Rushdi [5]as an example. The formulas for reliabilityand unreliability evaluation of a k-out-of-n:Gsystem are given in Equation 1.7 with boundaryconditions in Equations 1.8 and 1.9. By changingR(i, j) to Q(i, j), Q(i, j) to R(i, j), pi to qi , andqi to pi in those equations, we obtain the followingequations for unreliability evaluation of a k-out-of-n:F system:

QF(i, j)= qjQF(i − 1, j − 1)+ pjQF(i, j − 1)(1.18)

with the following boundary conditions

QF(0, j)= 1 (1.19)

QF(j + 1, j)= 0 (1.20)

The subscript “F” is added to indicate that thesemeasures are for the F system to avoid confusion.Similar steps can be applied to other algorithmsfor the G systems to derive the correspondingalgorithms for the F systems.

1.4 Relevant Concepts inMulti-state Reliability TheoryIn the multi-state context, the definition domainsof the state of the system and its componentsare expanded from {0, 1} to {0, 1, . . . , M} whereM is an integer greater than one. Such multi-state system models are discrete models. There arealso continuous-state system reliability models.However, we will focus on discrete state models inthis chapter.

We will adopt the following notation:

• xi : state of component i, xi = j if componenti is in state j , 0≤ j ≤M ;• x: an n-dimensional vector representing the

states of all components, x= (x1, x2, . . . , xn);• φ(x): state of the system, which is also called

the structure function of the system, 0≤φ(x) ≤M ;• (ji, x): a vector x whose ith argument is set

equal to j , where j = {0, 1, . . . , M} and i =1, 2, . . . , n.

In extending Definition 1 for a binary coherentsystem to the multi-state context, conditions 1and 3 can be easily extended, namely, φ(x) isstill a monotonically increasing function in eachargument and the state of the system is equal tothe state of all components when all componentsare in the same state. However, there are manydifferent ways to extend condition 2, namely, therelevancy condition. One definition of a coherentsystem in the multi-state context is given below.

Definition 4. A multi-state system with n com-ponents is defined to be a coherent system if itsstructure function satisfies:

1. φ(x) is monotonically increasing in eachargument;

2. there exists a vector x such that φ(0i , x) <φ(Mi, x) for each i = 1, 2, . . . , n;

3. φ(j)= j for j = 0, 1, . . . , M .

The reason that there are many different waysto define the relevancy condition (condition 2)is that there are different degrees of relevancythat a component may have on the system inthe multi-state context. A component has M + 1levels and the system has M + 1 levels. A levelof a component may be relevant to some levelsof the system but not others. Every level of acomponent may be relevant to every level of thesystem (this is the case in the binary case, sincethere are only two levels). There are many papersdiscussing relevancy conditions, for example, seeAndrzejezak [12] for a review. However, the focusof this chapter is not on relevancy conditions.

Page 42: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 9

There are two different definitions of minimalpath vectors and minimal cut vectors in the multi-state context. One is given in terms of resultingin the system state to be exactly in a certain statewhile the other is given in terms of resulting in thesystem state to be in or above a certain state.

Definition 5. (El-Neweihi et al. [13]) A vector x isa connection vector to level j if φ(x)= j . A vectorx is a lower critical connection vector to level jif φ(x)= j and φ(y) < j for all y < x. A vector xis an upper critical connection vector to level j ifφ(x)= j and φ(y) > j for all y > x.

A connection vector to level j defined here is apath vector that results in a system state of level j .A lower critical connection vector to level j can becalled a minimal path vector to level j . An uppercritical connection vector can be called a maximalpath vector to level j . These terms refer to pathvectors instead of cut vectors. If these vectors areknown, one can use them in evaluation of theprobability distribution of the system state.

Definition 6. (Natvig [14]) A vector x is called aminimal path vector to level j if φ(x)≥ j andφ(y) < j for all y < x. A vector x is called aminimal cut vector to level j if φ(x) < j andφ(y) ≥ j for all y≥ x.

In this definition, both minimal path vectorsand minimal cut vectors are defined in terms ofwhether they result in a system state “equal to orgreater than j” or not. Based on the definitionsgiven by El-Neweihi et al. [13] and Natvig [14], aminimal path vector to level j may or may not bea connection vector to level j , as it may result in asystem state higher than j . A minimal cut vectorto level j may or may not be a connection vectorto level j − 1, as it may result in a system statebelow j − 1. We find that the minimal path vectorsand minimal cut vectors defined by Natvig [14] areeasier to use than the connection vectors definedby El-Neweihi et al. [13].

Definition 7. (Xue [15]) Let φ(x) be the structurefunction of a multi-state system. The structure

function of its dual system, φD(x), is defined to be

φD(x)=M − φ(M− x)

where M− x= (M − x1, M − x2, . . . , M − xn).

Based on this definition of duality, the followingtwo results are immediate.

1. (φD(x))D = φ(x) [15].2. Vector x is an upper critical connecting vector

to level j of φ if and only if M− x is a lowercritical connection vector to level M − j ofφD. Vector x is a lower critical connectingvector to level j of φ if and only if M− xis an upper critical connection vector to levelM − j of φD.

Reliability is the most widely used performancemeasure of a binary system. It is the probabilitythat the system is in state 1. Once the reliabilityof a system is given, we also know the probabilitythat the system is in state 0. Thus, the reliabilityuniquely defines the distribution of a binarysystem in different states. In a multi-state system,it is often assumed that the state distribution,i.e. the distribution of each component in differentstates, is given. The performance of the systemis represented by its state distribution. Thus, themost important performance measure in a multi-state system is the state distributions. A statedistribution may be given in terms of a probabilitydistribution function, a cumulative distributionfunction, or a reliability function.

We define the following notation:

• pij : probability that component i is in state j ,1≤ i ≤ n, 0≤ j ≤M ;• pj : probability that a component is in state j

when all components are i.i.d.;• Pij : probability that component i is in state j

or above;• Pj : probability that a component is in state j

or above when all components are i.i.d.;• Qij = 1− Pij : the probability that component

i is in a state below j ;• Qj = 1− Pj ;• Rsj = Pr(φ(x)≥ j);• Qsj = 1Rsj ;• rsj = Pr(φ(x)= j).

Page 43: Handbook of Reliability Engineering - pdi2.org

10 System Reliability and Optimization

Based on this notation, we have the following facts:

Pi0 = 0 1≤ i ≤ n

M∑j=0

pij = 1 1≤ j ≤ n

piM = PiM, pij = Pij − Pi,j+1

1≤ i ≤ n, 0≤ j ≤M − 1

Rs0 = 1, RsM = rsj = Rsj − Rs(j+1)

0≤ j ≤M − 1

1.5 A Simple Multi-statek-out-of-n:G ModelThe following notation is adopted:

• R(k, n; j): probability that the k-out-of-n:Gsystem is in state j or above.

The state of a multi-state series system is equalto the state of the worst component in the system[13], i.e.

φ(x)= min1≤i≤n xi

The state of a parallel system is equal to the stateof the best component in the system [13], i.e.

φ(x)= max1≤i≤n

xi

These definitions are natural extensions fromthe binary case to the multi-state case. Systemperformance evaluation for the defined multi-stateparallel and series systems is straightforward.

Rsj =n∏

i=1

Pij j = 1, 2, . . . , M (1.21)

for a series system, and

Qsj =n∏

i=1

Qij j = 1, 2, . . . , M (1.22)

for a parallel system.A simple extension of the binary k-out-of-

n:G system results in the definition of a simplemulti-state k-out-of-n:G system. We call it simplebecause there are more general models of thek-out-of-n:G systems, which will be discussedlater.

Definition 8. (El-Neweihi et al. [13]) A system isa k-out-of-n:G system if its structure functionsatisfies

φ(x)= x(n−k+1)

where x(1) ≤ x(2) ≤ · · · ≤ x(n) is an non-decreas-ing arrangement of x1, x2, . . . , and xn.

Based on this definition, a k-out-of-n:G systemhas the following properties.

1. The multi-state series and parallel systemsdefined above satisfy this definition of themulti-state k-out-of-n:G system.

2. The state of the system is determined by theworst state of the best k components.

3. Each state j for 1≤ j ≤M has(nk

)minimal

path vectors and(

nk−1

)minimal cut vectors.

The numbers of minimal path and minimalcut vectors are the same for each state. Wecan say that the defined k-out-of-n:G systemhas the same structure at each system statej for 1≤ j ≤M . In other words, the systemis in state j or above if and only if at leastk components are in state j or above foreach j (1≤ j ≤M). Because of this property,Boedigheimer and Kapur [16] actually definethe k-out-of-n:G system as one that has

(nk

)minimal path vectors and

(n

k−1

)minimal cut

vectors to system state j for 1≤ j ≤M .

System performance evaluation for the simplek-out-of-n:G system is straightforward. For anyspecified system state j , we can use the sys-tem reliability evaluation algorithms for a binaryk-out-of-n:G system to evaluate the probabil-ity that the multi-state system is in state j

or above. For example, Equation 1.7 and itsboundary condition can be used as follows, for1≤ j ≤M :

R(k, n; j)= PnjR(k − 1, n− 1; j)+QnjR(k, n− 1; j) (1.23)

R(0, n; j)= 1 (1.24)

R(n+ 1, n; j)= 0 (1.25)

Page 44: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 11

1.6 A Generalized Multi-statek-out-of-n:G System ModelThe simple multi-state k-out-of-n:G system modellimits that the system structure at each systemlevel must be the same. In practical situations, amulti-state system may have different structuresat different system levels. For example, consider athree-component system with four possible states.The system could be a 1-out-of-3:G structure atlevel 1; in other words, it requires at least onecomponent to be in state 1 or above for the systemto be in state 1 or above. It may have a 2-out-of-3:Gstructure at level 2; in other words, for the systemto be in state 2 or above, at least two componentsmust be in state 2 or above. It may have a 3-out-of-3:G structure at level 3; namely, at least threecomponents have to be in state 3 for the systemto be in state 3. Such a k-out-of-n system modelis more flexible for modeling real-life systems.Huang et al. [17] proposed a definition of thegeneralized multi-state k-out-of-n:G system anddeveloped reliability evaluation algorithms for thefollowing multi-state k-out-of-n:G system model.

Definition 9. (Huang [17]) φ(x)≥ j (j = 1, 2,. . . , M) if there exists an integer value l (j ≤l ≤M) such that at least kl components are instate l or above. An n-component system withsuch a property is called a multi-state k-out-of-n:Gsystem.

In this definition, kj values do not have to bethe same for different system states j (1≤ j ≤M). This means that the structure of the multi-state system may be different for different systemstate levels. Generally speaking, kj values arenot necessarily in a monotone ordering. But thefollowing two special cases of this definition willbe particularly considered.

• When k1 ≤ k2 ≤ · · · ≤ kM , the system is calledan increasing multi-state k-out-of-n:G system.In this case, for the system to be in ahigher state level j or above, a largernumber of components must be in state j orabove. In other words, there is an increasingrequirement on the number of components

that must be in a certain state or above for thesystem to be in a higher state level. That is whywe call it the increasing multi-state k-out-of-n:G system.• When k1 ≥ k2 ≥ · · · ≥ kM , the system is called

a decreasing multi-state k-out-of-n:G system.In this case, for a higher system state levelj , there is a decreasing requirement on thenumber of components that must be in statelevel j or above.

When kj is a constant, i.e. k1 = k2 = · · · =kM = k, the structure of the system is the samefor all system state levels. This reduces to thedefinition of the simple multi-state k-out-of-n:Gsystem discussed in the previous section. We callsuch systems constant multi-state k-out-of-n:Gsystems. All the concepts and results of binaryk-out-of-n:G systems can be easily extended tothe constant multi-state k-out-of-n:G systems.The constant multi-state k-out-of-n:G system istreated as a special case of the increasing multi-state k-out-of-n:G system in our later discussions.

For an increasing multi-state k-out-of-n:Gsystem, i.e. k1 ≤ k2 ≤ · · · ≤ kM , Definition 9 canbe rephrased as follows:

φ(x) ≥ j

if and only if at least kj components are in statej or above. If at least kj components are in statej or above (these components can be considered“functioning” as far as state level j is concerned),then the system will be in state j or above (thesystem is considered to be “functioning”) for 1≤j ≤M . The only difference between this case ofDefinition 9 and Definition 8 is that the number ofcomponents required to be in state j or above forthe system to be in state j or above may changefrom state to state. Other characteristics of thiscase of the generalized multi-state k-out-of-n:Gsystem are exactly the same as that defined withDefinition 8. Algorithms for binary k-out-of-n:Gsystem reliability evaluation can also be extendedfor the increasing multi-state k-out-of-n:G systemreliability evaluation:

Rj (kj , n)= PnjRj (kj − 1, n− 1)

+ (1− Pnj )Rj (kj , n− 1) (1.26)

Page 45: Handbook of Reliability Engineering - pdi2.org

12 System Reliability and Optimization

where Rj(b, a) is the probability that at least b

out of a components are in state j or above.The following boundary conditions are needed forEquation 1.26:

Rj (0, a)= 1 for a ≥ 0 (1.27)

Rj(b, a)= 0 for b > a > 0 (1.28)

When all the components have the same stateprobability distribution, i.e. pij = pj for all i, theprobability that the system is in j or above, Rsj ,can be expressed as:

Rsj =n∑

k=kj

(n

k

)Pkj (1− Pj )

n−k (1.29)

Example 1. Consider an increasing multi-statek-out-of-n:G system with k1 = 1, k2 = 2, k3 = 3,p0 = 0.1,p1 = 0.3,p2 = 0.4, andp3 = 0.2. We canuse Equation 1.29 to calculate the probability thatthe system is at each level.

We have P1 = 0.9, P2 = 0.6, P3 = 0.2.At level 3, k3 = 3.

Rs3 = P 33 = 0.23 = 0.008

At level 2, k2 = 2.

Rs2 =(

3

2

)× 0.62 × (1− 0.6)+ 0.63 = 0.648

At level 1, k1 = 1.

Rs1 =(

3

1

)× 0.9× (1− 0.9)2

+(

3

2

)× 0.92 × (1− 0.9)+ 0.93 = 0.999

The system probabilities at all levels are asfollows:

rs3 = 0.008

rs2 = Rs2 − Rs3 = 0.64

rs1 = Rs1 − Rs2 = 0.351

rs0 = 1− 0.008− 0.64− 0.351= 0.001

For a decreasing multi-state k-out-of-n:G sys-tem, i.e. k1 ≥ k2 ≥ · · · ≥ kM , the wording of itsdefinition is not as simple. The system is in levelM if at least kM components are in level M .

The system is in level M − 1 or above if at leastkM−1 components are in level M − 1 or above orat least kM components are in level M . Generallyspeaking, the system is in level j or above (1≤j ≤M) if at least kj components are in level j orabove, at least kj+1 components are in level j + 1or above, at least kj+2 components are in levelj + 2 or above, . . . , or at least kM components arein level M . The definition of the decreasing multi-state k-out-of-n:G system can also be stated as thefollowing in terms of the system being exactly in acertain state:

φ(x)= j if and only if at least kjcomponents are in state j or aboveand at most kl − 1 components areat state l or above for l = j + 1, j +2, . . . , M where j = 1, 2, . . . , M .

When all the components have the same stateprobability distribution, the following equationcan be used to calculate the probability that thesystem is in state j for a decreasing multi-statek-out-of-n:G system:

rsj =n∑

k=kj

(n

k

)( j−1∑m=0

pm

)n−k

×(pkj +

M∑l=j+1,kl>1

βl(k)

)(1.30)

where βl(k) is the probability that there is at leastone and at most kl − 1 components that are instate l, at most ku − 1 components that are instate u for j ≤ u < l, and the total number ofcomponents that are in states between j and l

inclusive is k. To calculate βl(k), we can use thefollowing equation:

βl(k)=kl−1∑i1=1

(k

i1

)pi1l

kl−1−1−i1∑i2=0

(k − i1

i2

)pi2l−1 × · · ·

×kj+1−1−Il−j−1∑

il−j=0

(k − Il−j−1

il−j

)pil−jj+1p

k−Il−jj

(1.31)

where Il−j−1 =∑l−j−1m=1 im and Il =∑l−j

m=1 im.

Page 46: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 13

We can see that even when the components arei.i.d., Equations 1.30 and 1.31 are enumerative innature. When the components are not necessarilyidentical, the procedure for evaluation of systemstate distribution is more complicated. Huanget al. [17] provide an enumerative procedurefor this purpose. These algorithms need furtherimprovement.

1.7 Properties of GeneralizedMulti-state k-out-of-n:G SystemsWe adopt the following notation:

• xij : a binary indicator,

xij ={

1, if xi ≥ j

0, otherwise

• xj : a binary indicator vector,

xj = (x1j , x2j , . . . , xnj )

• φj (x): a binary indicator,

φj ={

1, if φ(x)≥ j

0, otherwise

• ψj : binary structure function for systemstate j ;• ψ : multi-state structure function of binary

variables;• φD: the dual structure function of φ.

In order to develop more efficient algorithmsfor evaluation of system state distribution, proper-ties of multi-state systems should be investigated.

Definition 10. (Hudson and Kapur [18]) Two-component state vectors x and y are said to beequivalent if and only if there exists a j suchthat φ(x)= φ(y)= j , j = 0, 1, . . . , M . We usenotation x↔ y to indicate that these two vectorsare equivalent.

In the generalized k-out-of-n:G system, all per-mutations of the elements of a component statevector are equivalent to the component state vec-tor, since the positions of the components in

the system are not important. For example, con-sider a k-out-of-n:G system with three compo-nents and four possible states. Component statevectors (1, 2, 3) and its permutations, namely(1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2) and (3, 2, 1),are equivalent to one another.

Definition 11. (Huang and Zuo [19]) A multi-state coherent system is called a dominant systemif and only if its structure function φ satisfies:φ(y) > φ(x) implies either (1) y > x or (2) y > zand x↔ z.

This dominance condition says that if φ(y) >φ(x), then vector y must be larger than a vectorthat is in the same equivalent class as vector x.A vector is larger than another vector if everyelement of the first vector is at least as large asthe corresponding element in the second vectorand at least one element in the first vector islarger than the corresponding one in the secondvector. For example, vector (2, 2, 0) is larger than(2, 1, 0), but not larger than (0, 1, 1). We usethe word “dominant” to indicate that vector ydominates vector x, even though we may notnecessarily have y > x. If a multi-state system doesnot satisfy Definition 11, we call it a non-dominantsystem.

The increasing k-out-of-n:G system is a dom-inant system, whereas a decreasing k-out-of-n:Gsystem is a non-dominant system. Considera decreasing k-out-of-n:G system with n= 3,M = 3, k1 = 3, k2 = 2, and k3 = 1. Then, we haveφ(3, 0, 0)= 3, φ(2, 2, 0)= 2, and φ(1, 1, 1)= 1.The dominance conditions given in Definition11 are not satisfied, since (3, 0, 0) �> (2, 2, 0)∗ �>(1, 1, 1)∗ even though φ(3, 0, 0) > φ(2, 2, 0) >φ(1, 1, 1). The asterisk is used to indicate thevector or any one of its permutations.

The concept of a binary image has been usedto indicate whether a multi-state system can beanalyzed with the algorithms for binary systems.Based on Ansell and Bendell [20], a multi-statesystem has a binary image if condition xj = yj

implies φj (x)= φj (y). Huang and Zuo [19] pointout that this requirement is too strong. A systemmay have a binary image even if this condition is

Page 47: Handbook of Reliability Engineering - pdi2.org

14 System Reliability and Optimization

Table 1.1. Structure functions of the multi-state systems in Example 2

System A System B System C

φ(x): 0 1 2 0 1 2 0 1 2

x: (0, 0) (1, 0) (2, 2) (0, 0) (1, 0) (2, 0) (0, 0) (1, 1) (2, 0)(0, 1) (0, 1) (1, 1) (2, 1) (1, 0) (0, 2)(1, 1) (0, 2) (2, 2) (0, 1) (2, 1)(2, 0) (1, 2) (1, 2)(0, 2) (2, 2)(2, 1)(1, 2)

not satisfied. The following definition of binary-imaged systems is provided.

Definition 12. (Huang and Zuo [19]) A multi-state system has a binary-image if and only ifits structure indicator functions satisfy: φj (x)=φj (y) implies either (1) xj = yj or (2) xj = zj andz↔ x for each j (1≤ j ≤M).

Based on Definitions 11 and 12, we can seethat a binary-imaged multi-state system must bea dominant system, whereas a dominant multi-state system may not have a binary image.The increasing k-out-of-n:G system has a binaryimage. The decreasing multi-state k-out-of-n:Gsystem does not have a binary image. If a multi-state system has a binary image, algorithms forbinary systems can be used for evaluation of itssystem state distribution. These binary algorithmsare not efficient for non-dominant systems.The following example illustrates a dominantsystem with binary image, a dominant systemwithout binary image, and a non-dominantsystem.

Example 2. Consider three different multi-statesystems, each with two components and threepossible states. The structure functions of thesystems are given in Table 1.1, where System A isa dominant system with binary image, System Bis a dominant system without binary image, andSystem C is a non-dominant system.

The following properties of multi-state systemsare identified by Huang and Zuo [19].

1. Let Pj

1 , Pj

2 , . . . , Pjr be the minimal path

vectors to level j of a dominant system, thenφ(P

ji )= j for i = 1, 2, . . . , r .

2. Let Kj

1 , Kj

2 , . . . , Kjs be the minimal cut

vectors to level j of a dominant system, thenφ(K

ji )= j − 1 for i = 1, 2, . . . , s.

3. The minimal path vectors to level j ofa binary-imaged dominant system areof the form (j, . . . , j, 0, . . . , 0) forj = 1, 2, . . . , M or one of its permutations.

Example 3. A multi-state coherent system con-sists of three i.i.d. components. Both the systemand the components are allowed to have four pos-sible states: 0, 1, 2, 3. Assume that p0 = 0.1, p1 =0.3, p2 = 0.4, and p3 = 0.2. The structures of thesystem are shown in Figure 1.1.

We say that the system has a parallel structureat level 1 because it is at level 1 or above if andonly if at least one of the components is at level 1or above. The system has a mixed parallel–seriesstructure at level 2, because it is at level 2 or aboveif and only if at least one of components 1 and 2 isat level 2 or above and component 3 is at level 2or above. The system has a series structure atlevel 3, because it is at level 3 if and only if all threecomponents are at level 3. The structures changefrom strong to weak as the level of the systemincreases. The minimal path vectors to level 1are (1, 0, 0), (0, 1, 0), and (0, 0, 1); to level 2 are(2, 0, 2) and (0, 2, 2); and to level 3 is (3, 3, 3).The system is a binary-imaged system. To calculatethe probability of the system at each level, wecan extend the algorithms for binary reliability

Page 48: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 15

Figure 1.1. A dominant system with binary image

systems:

P1 = 0.9, P2 = 0.6, P3 = 0.2

Rs3 = P13 × P23 × P33 = 0.008

Rs2 = [1− (1− P12)(1− P22)] × P33

= 2× P 22 − P 3

2 = 0.504

Rs1 = 1− (1− P11)(1− P12)(1− P13)

= 1− (1− P1)3 = 0.999

rs3 = Rs3 = 0.008

rs2 = Rs2 − Rs3 = 0.496

rs1 = Rs1 − Rs2 = 0.495

rs0 = 1− Rs1 = 0.001

1.8 Equivalence and Duality inGeneralized Multi-statek-out-of-n SystemsFollowing the way that a generalized multi-statek-out-of-n:G system is defined, we propose adefinition of the generalized multi-state k-out-of-n:F system as follows.

Definition 13. φ(x) < j (j = 1, 2, . . . , M) if atleast kl components are in states below l for all lsuch that j ≤ l ≤M . An n-component system withsuch a property is called a multi-state k-out-of-n:Fsystem.

A generalized multi-state k-out-of-n:F systemis labeled increasing if k1 ≤ k2 ≤ · · · ≤ kM and

decreasing if k1 ≥ k2 ≥ · · · ≥ kM . Similar to anincreasing k-out-of-n:G system, the definition ofa decreasing k-out-of-n:F system can be stated ina simpler form. A decreasing k-out-of-n:F systemis in a state below j if and only if at least kjcomponents are in states below j for 1≤ j ≤M .For such a system, we do not need to state therequirements for those states above j because theywill be automatically satisfied by the requirementon state j . However, this cannot be said ofan increasing multi-state k-out-of-n:F system.An increasing multi-state k-out-of-n:F system isin a state below j if and only if all the followingconditions are satisfied: at least kj components arebelow state j , at least kj+1 components are belowstate j + 1, at least kj+2 components are belowstate j + 2, . . . , and at least kM componentsare below state M . Since the increasing k-out-of-n:G system has a binary image, there is anefficient algorithm for evaluation of its systemstate distribution, as discussed in a previoussection. For the decreasing k-out-of-n:F system,there is a similar efficient algorithm for evaluationof system state distribution

Qj(kj , n)= (1− Pj )Qj (kj − 1, n− 1)

+ PnjQj (kj , n− 1) (1.32)

where Qj(b, a) is the probability that at least b

out of a components are below state j . Qj(kj , n)

is the probability that the system is below state j .The following boundary conditions are needed for

Page 49: Handbook of Reliability Engineering - pdi2.org

16 System Reliability and Optimization

Equation 1.32:

Qj(0, a)= 1 for a ≥ 0 (1.33)

Qj(b, a)= 0 for b > a > 0 (1.34)

When all the components have the same stateprobability distribution, i.e. pij = pj for all i, theprobability Qsj that the system is below state j canbe expressed as:

Qsj =n∑

k=kj

(n

k

)(1− Pj )

kP n−kj (1.35)

The evaluation of system state distributionfor an increasing multi-state k-out-of-n:F systemis much more complicated than that for thedecreasing case. A similar algorithm to thatfor a decreasing multi-state k-out-of-n:G systemdeveloped by Huang et al. [17] can be derived.However, that kind of algorithm is enumerativein nature. More efficient algorithms are neededfor the increasing k-out-of-n:F and the decreasingk-out-of-n:G systems.

The definition of a multi-state k-out-of-n:Gsystem with vector k= (k1, k2, . . . , kM) given inDefinition 9 can be stated in an alternative wayas follows. A k-out-of-n:G system is below statej if at least n− kl + 1 components are belowstate l for all l such that j ≤ l ≤M . Similarly,the definition of a multi-state k-out-of-n:F sys-tem with k= (k1, k2, . . . , kM) given in Defini-tion 13 can be stated in an alternative way asfollows. A k-out-of-n:F system is in state j orabove if at least n− kl + 1 components are instate l or above for at least one l such that j ≤l ≤M . An equivalence relationship exists betweenthe generalized multi-state k-out-of-n:F and Gsystems. An increasing k-out-of-n:F system withvector kF = (k1, k2, . . . , kM) is equivalent to adecreasing k-out-of-n:G system with vector kG =(n− k1 + 1, n− k2 + 1, . . . , n− kM + 1). A de-creasing k-out-of-n:F system with vector kF =(k1, k2, . . . , kM) is equivalent to an increasingk-out-of-n:G system with vector kG = (n− k1 +1, n− k2 + 1, . . . , n− kM + 1).

Using the definition of duality for multi-statesystems given in Definition 7, we can verifythat, unlike the binary case, there does not exist

any duality relationship between the generalizedmulti-state k-out-of-n:F and G systems. In a binaryk-out-of-n:G system, if vector x has k componentsbeing in state 1, then vector y= 1− x musthave k components below state 1. However, in amulti-state k-out-of-n:G system, if vector x hask components being in state j or above, thenvector y=M− x does not necessarily have k

components being in states below j .The issues that should be investigated further

on generalized multi-state k-out-of-n systemsand general multi-state systems include thefollowing.

1. Should a different definition of duality begiven for multi-state systems?

2. Does the dual of a non-dominant systembecome a dominant system?

3. What techniques can be developed to providebounds on the probability that the system is ina certain state?

4. How should we evaluate the system statedistribution of a dominant system withoutbinary image?

5. How should we evaluate the system statedistribution of a non-dominant system?

References[1] Barlow RE, Proschan F. Statistical theory of reliability and

life testing: probability models. New York: Holt, Rinehartand Winston; 1975.

[2] Heidtmann KD. Improved method of inclusion-exclusionapplied to k-out-of-n systems. IEEE Trans Reliab1982;31(1):36–40.

[3] McGrady PW. The availability of a k-out-of-n:G network.IEEE Trans Reliab 1985;34(5):451–2.

[4] Barlow RE, Heidtmann KD. Computing k-out-of-nsystem reliability. IEEE Trans Reliab 1984;R-33:322–3.

[5] Rushdi AM. Utilization of symmetric switching functionsin the computation of k-out-of-n system reliability.Microelectron Reliab 1986;26(5):973–87.

[6] Chao MT, Lin GD. Economical design of largeconsecutive-k-out-of-n:F systems. IEEE Trans Reliab1984;R-33(5):411–3.

[7] Chao MT, Fu JC. A limit theorem of certain repairablesystems. Ann Inst Stat Math 1989;41:809–18.

[8] Chao MT, Fu JC. The reliability of large series systemsunder Markov structure. Adv Appl Prob 1991;23:894–908.

Page 50: Handbook of Reliability Engineering - pdi2.org

Multi-state k-out-of-n Systems 17

[9] Fu JC, Lou WY. On reliabilities of certain largelinearly connected engineering systems. Stat Prob Lett1991;12:291–6.

[10] Koutras MV. On a Markov chain approach for the studyof reliability structures. J Appl Prob 1996;33:357–67.

[11] Belfore LA. An o(n(log2(n))2) algorithm for computing

the reliability of k-out-of-n:G & k-to-l-out-of-n:G sys-tems. IEEE Trans Reliab 1995;44(1):132–6.

[12] Andrzejczak K. Structure analysis of multi-state coherentsystems. Optimization 1992;25:301–16.

[13] El-Neweihi E, Proschan F, Sethuraman J. Multi-statecoherent system. J Appl Prob 1978;15:675–88.

[14] Natvig B. Two suggestions of how to define a multi-statecoherent system. Appl Prob 1982;14:391–402.

[15] Xue J. On multi-state system analysis. IEEE Trans Reliab1985;R-34(4):329–37.

[16] Boedigheimer RA, Kapur KC. Customer-driven reliabilitymodels for multi-state coherent systems. IEEE TransReliab 1994;43(1):46–50.

[17] Huang J, Zuo MJ, Wu YH. Generalized multi-state k-out-of-n:G systems. IEEE Trans Reliab 2002;in press.

[18] Hudson JC, Kapur KC. Reliability analysis for multi-state systems with multi-state components. IIE Trans1983;15:127–35.

[19] Huang J, Zuo MJ. Dominant multi-state systems. IEEETrans Reliab 2002;in press.

[20] Ansell JI, Bendell A. On alternative definitions of multi-state coherent systems. Optimization 1987;18(1):119–36.

Page 51: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 52: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with MultipleFailure Modes

Chap

ter2

Hoang Pham

2.1 Introduction2.2 The Series System2.3 The Parallel System2.3.1 Cost Optimization2.4 The Parallel–Series System2.4.1 The Profit Maximization Problem2.4.2 Optimization Problem2.5 The Series–Parallel System2.5.1 Maximizing the Average System Profit2.5.2 Consideration of Type I Design Error2.6 The k-out-of-n Systems2.6.1 Minimizing the Average System Cost2.7 Fault-tolerant Systems2.7.1 Reliability Evaluation2.7.2 Redundancy Optimization2.8 Weighted Systems with Three Failure Modes

2.1 Introduction

A component is subject to failure in eitheropen or closed modes. Networks of relays,fuse systems for warheads, diode circuits, fluidflow valves, etc. are a few examples of suchcomponents. Redundancy can be used to enhancethe reliability of a system without any changein the reliability of the individual componentsthat form the system. However, in a two-failuremode problem, redundancy may either increaseor decrease the system’s reliability. For example,a network consisting of n relays in series hasthe property that an open-circuit failure of anyone of the relays would cause an open-modefailure of the system and a closed-mode failureof the system. (The designations “closed mode”and “short mode” both appear in this chapter,

and we will use the two terms interchangeably.)On the other hand, if the n relays were arrangedin parallel, a closed-mode failure of any one relaywould cause a system closed-mode failure, andan open-mode failure of all n relays would causean open-mode failure of the system. Therefore,adding components in the system may decreasethe system reliability. Diodes and transistorsalso exhibit open-mode and short-mode failurebehavior.

For instance, in an electrical system havingcomponents connected in series, if a short circuitoccurs in one of the components, then the short-circuited component will not operate but willpermit flow of current through the remainingcomponents so that they continue to operate.However, an open-circuit failure of any of thecomponents will cause an open-circuit failure of

19

Page 53: Handbook of Reliability Engineering - pdi2.org

20 System Reliability and Optimization

the system. As an example, suppose we havea number of 5 W bulbs that remain operativein satisfactory conditions at voltages rangingbetween 3 V and 6 V. Obviously, on using thewell-known formula in physics, if these bulbsare arranged in a series network to form a two-failure mode system, then the maximum and theminimum number of bulbs at these voltages aren= 80 and k = 40, respectively, in a situationwhen the system is operative at 240 V. In this case,any of the bulbs may fail either in closed or in openmode till the system is operative with 40 bulbs.Here, it is clear that, after each failure in closedmode, the rate of failure of a bulb in open modeincreases due to the fact that the voltage passingthrough each bulb increases as the number ofbulbs in the series decreases.

System reliability where components havevarious failure modes is covered in References[1–11]. Barlow et al. [1] studied series–paralleland parallel–series systems, where the size ofeach subsystem was fixed, but the number ofsubsystems was varied to maximize reliability.Ben-Dov [2] determined a value of k thatmaximizes the reliability of k-out-of-n systems.Jenney and Sherwin [4] considered systems inwhich the components are i.i.d. and subjectto mutually exclusive open and short failures.Page and Perry [7] discussed the problem ofdesigning the most reliable structure of a givennumber of i.i.d. components and proposed analternative algorithm for selecting near-optimalconfigurations for large systems. Sah and Stiglitz[10] obtained a necessary and sufficient conditionfor determining a threshold value that maximizesthe mean profit of k-out-of-n systems. Pham andPham [9] further studied the effect of systemparameters on the optimal k or n and showed thatthere does not exist a (k, n) maximizing the meansystem profit.

This chapter discusses in detail the aspects ofthe reliability optimization of systems subject totwo types of failure. It is assumed that the systemcomponent states are statistically independent andidentically distributed, and that no constraintsare imposed on the number of components to beused. Reliability optimization of series, parallel,

parallel–series, series–parallel, and k-out-of-nsystems subject to two types of failure will bediscussed next.

In general, the formula for computing thereliability of a system subject to two kinds offailure is [6]:

System reliability

= Pr{system works in both modes}= Pr{system works in open mode}− Pr{system fails in closed mode}+ Pr{system fails in both modes} (2.1)

When the open- and closed-mode failure struc-tures are dual of one another, i.e. Pr{system fails inboth modes} = 0, then the system reliability givenby Equation 2.1 becomes

System reliability

= 1− Pr{system fails in open mode}− Pr{system fails in closed mode} (2.2)

We adopt the following notation:

qo the open-mode failure probability of eachcomponent (po = 1− qo)

qs the short-mode failure probability of eachcomponent (ps = 1− qs)

� implies 1−� for any �

�x the largest integer not exceeding x

∗ implies an optimal value.

2.2 The Series System

Consider a series system consisting of n compo-nents. In this series system, any one componentfailing in an open mode causes system failure,whereas all components of the system must mal-function in short mode for the system to fail.

The probabilities of system fails in open modeand fails in short mode are

Fo(n)= 1− (1− qo)n

andFs(n)= qns

Page 54: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 21

respectively. From Equation 2.2, the systemreliability is:

Rs(n)= (1− qo)n − qns (2.3)

where n is the number of identical and indepen-dent components. In a series arrangement, reli-ability with respect to closed system failure in-creases with the number of components, whereasreliability with respect to open system failure falls.There exists an optimum number of components,say n∗, that maximizes the system reliability. If wedefine

n0 =log

(qo

1− qs

)log

(qs

1− qo

)then the system reliability, Rs(n

∗), is maximum for

n∗ ={�n0 + 1 if n0 is not an integer

n0 or n0 + 1 if n0 is an integer(2.4)

Example 1. A switch has two failure modes: fail-open and fail-short. The probability of switchopen-circuit failure and short-circuit failure are0.1 and 0.2 respectively. A system consists of n

switches wired in series. That is, given qo = 0.1and qs = 0.2. From Equation 2.4

n0 =log

(0.1

1− 0.2

)log

(0.2

1− 0.1

) = 1.4

Thus, n∗ = �1.4 + 1= 2. Therefore, when n∗ = 2the system reliability Rs(n)= 0.77 is maximized.

2.3 The Parallel SystemConsider a parallel system consisting of n

components. For a parallel configuration, all thecomponents must fail in open mode or at leastone component must malfunction in short modeto cause the system to fail completely.

The system reliability is

Rp(n)= (1− qs)n − qno (2.5)

where n is the number of components connectedin parallel. In this case, (1− qs)

n represents theprobability that no components fail in shortmode, and qno represents the probability that allcomponents fail in open mode. If we define

n0 =log

(qs

1− qo

)log

(qo

1− qs

) (2.6)

then the system reliability Rp(n∗) is maximum for

n∗ ={�n0 + 1 if n0 is not an integer

n0 or n0 + 1 if n0 is an integer(2.7)

It is observed that, for any range of qo and qs,the optimal number of parallel components thatmaximizes the system reliability is one, if qs > qo.For most other practical values of qo and qs, theoptimal number turns out to be two. In general,the optimal value of parallel components can beeasily obtained using Equation 2.6.

2.3.1 Cost Optimization

Suppose that each component costs d dollars andsystem failure costs c dollars of revenue. We nowwish to determine the optimal system size n thatminimizes the average system cost given that thecosts of system failure in open and short modes areknown. Let Tn be a total of the system. The averagesystem cost is given by

E[Tn] = dn+ c[1− Rp(n)]where Rp(n) is defined as in Equation 2.5.For given qo, qs, c, and d , we can obtain a valueof n, say n∗, minimizing the average system cost.

Theorem 1. Fix qo, qs, c, and d . There exists aunique value n∗ that minimizes the average systemcost, and

n∗ = inf

{n≤ n1 : (1− qo)q

no − qs(1− qs)

n <d

c

}(2.8)

where n1 = �n0 + 1 and n0 is given in Equa-tion 2.6.

Page 55: Handbook of Reliability Engineering - pdi2.org

22 System Reliability and Optimization

The proof is straightforward and left for anexercise. It was assumed that the cost of systemfailure in either open mode or short mode was thesame. We are now interested in how the cost ofsystem failure in open mode may be different fromthat in short mode.

Suppose that each component costs d dollarsand system failure in open mode and short modecosts c1 and c2 dollars of revenue respectively.Then the average system cost is given by

E[Tn] = dn+ c1qno + c2[1− (1− qs)

n] (2.9)

In other words, the average system cost of systemsize n is the cost incurred when the system hasfailed in either open mode or short mode plusthe cost of all components in the system. We candetermine the optimal value of n, say n∗, whichminimizes the average system cost as shown in thefollowing theorem [5].

Theorem 2. Fix qo, qs, c1, c2, and d . There exists aunique value n∗ that minimizes the average systemcost, and

n∗ ={

1 if na ≤ 0

n0 otherwise

where

n0 = inf

{n≤ na : h(n) ≤ d

c2qs

}and

h(n)= qno

[1− qo

qs

c1

c2−(

1− qs

qo

)n]

na =log

(1− qo

qs

c1

c2

)log

(1− qs

qo

)

Example 2. Suppose d = 10, c1 = 1500, c2 = 300,qs = 0.1, qo = 0.3. Then

d

c2qs= 0.333

From Table 2.1, h(3)= 0.216 < 0.333; therefore,the optimal value of n is n∗ = 3. That is,when n∗ = 3 the average system cost (151.8) isminimized.

Table 2.1. The function h(n) vs n

n h(n) Rp(n) E[Tn]1 9.6 0.6 490.02 2.34 0.72 212.03 0.216 0.702 151.84 −0.373 0.648 155.35 −0.504 0.588 176.56 −0.506 0.531 201.7

2.4 The Parallel–Series System

Consider a system of components arranged so thatthere are m subsystems operating in parallel, eachsubsystem consisting of n identical components inseries. Such an arrangement is called a parallel–series arrangement. The components could bea logic gate, a fluid-flow valve, or an electronicdiode, and they are subject to two types of failure:failure in open mode and failure in short mode.Applications of the parallel–series systems can befound in the areas of communication, networks,and nuclear power systems. For example, considera digital circuit module designed to process theincoming message in a communication system.Suppose that there are, at most, m ways ofgetting a message through the system, dependingon which of the branches with n modules areoperable. Such a system is subject to two failuremodes: (1) a failure in open circuit of a singlecomponent in each subsystem would render thesystem unresponsive; or (2) a failure in shortcircuit of all the components in any subsystemwould render the entire system unresponsive.

We adopt the following notation:

m number of subsystems in a system (orsubsystem size)

n number of components in eachsubsystem

Fo(m) probability of system failure in openmode

Fs(m) probability of system failure in shortmode.

The systems are characterized by the followingproperties.

Page 56: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 23

1. The system consists of m subsystems, eachsubsystem containing n i.i.d. components.

2. A component is either good, failed open, orfailed short. Failed components can neverbecome good, and there are no transitionsbetween the open and short failure modes.

3. The system can be (a) good, (b) failed open (atleast one component in each subsystem failsopen), or (c) failed short (all the componentsin any subsystem fail short).

4. The unconditional probabilities of componentfailure in open and short modes are knownand are constrained: qo, qs > 0; qo + qs < 1.

The probabilities of a system failing in openmode and failing in short mode are given by

Fo(m)= [1− (1− qo)n]m (2.10)

and

Fs(m)= 1− (1− qms )m (2.11)

respectively. The system reliability is

Rps(n, m)= (1− qns )m − [1− (1− qo)

n]m(2.12)

where m is the number of identical subsystemsin parallel and n is the number of identicalcomponents in each series subsystem. The term(1− qns )

m represents the probability that none ofthe subsystems has failed in closed mode. Simi-larly, [1− (1− qo)

n]m represents the probabilitythat all the subsystems have failed in open mode.

An interesting example in Ref. [1] shows thatthere exists no pair n, m maximizing systemreliability, since Rps can be made arbitrarily closeto one by appropriate choice of m and n. To seethis, let

a = log qs − log(1− qo)

log qs + log(1− qo)

Mn = q−n/(1+a)s mn = �Mn

For given n, take m=mn; then one can rewriteEquation 2.12 as:

Rps(n, mn)= (1− qns )mn − [1− (1− qo)

n]mn

A straightforward computation yields

limn→∞ Rps(n, mn)

= limn→∞{(1− qns )

mn − [1− (1− qo)n]mn}

= 1

For fixed n, qo, and qs, one can determine thevalue of m that maximizes Rps, and this is givenbelow [8].

Theorem 3. Let n, qo, and qs be fixed. The max-imum value of Rps(m) is attained at m∗ = �m0 +1, where

m0 = n(log po − log qs)

log(1− qns )− log(1− pno )

(2.13)

If m0 is an integer, then m0 and m0 + 1 bothmaximize Rps(m).

2.4.1 The Profit Maximization Problem

We now wish to determine the optimal subsystemsize m that maximizes the average system profit.We study how the optimal subsystem size m

depends on the system parameters. We also showthat there does not exist a pair (m, n) maximizingthe average system profit.

We adopt the following notation:

A(m) average system profitβ conditional probability that the system

is in open mode1− β conditional probability that the system

is in short modec1, c3 gain from system success in open, short

modec2, c4 gain from system failure in open, short

mode; c1 > c2, c3 > c4.

The average system profit is given by

A(m)= β{c1[1− Fo(m)] + c2Fo(m)}+ (1− β){c3[1− Fs(m)] + c4Fs(m)}

(2.14)

Define

a = β(c1 − c2)

(1− β)(c3 − c4)

Page 57: Handbook of Reliability Engineering - pdi2.org

24 System Reliability and Optimization

andb = βc1 + (1− β)c4 (2.15)

We can rewrite Equation 2.14 as

A(m)= (1− β)(c3 − c4)

× {[1− Fs(m)] − aFo(m)} + b (2.16)

When the costs of the two kinds of systemfailure are identical, and the system is inthe two modes with equal probability, thenthe optimization criterion becomes the sameas maximizing the system reliability. Here, thefollowing analysis deals with cases that need notsatisfy these special restrictions.

For a given value of n, one wishes to findthe optimal number of subsystems m (m∗) thatmaximizes the average system profit. Of course, wewould expect the optimal value of m to depend onthe values of both qo and qs. Define

m0 =ln a + n ln

(1− qo

qs

)ln

[1− qns

1− (1− qo)n

] (2.17)

Theorem 4. Fix β ,n, qo, qs, and ci for i =1, 2, 3, 4. The maximum value of A(m) is attainedat

m∗ ={

1 if m0 < 0

�m0 + 1 if m0 ≥ 0(2.18)

Ifm0 is a non-negative integer, bothm0 andm0 + 1maximize A(m).

The proof is straightforward. When m0 is anon-negative integer, the lower value will providethe more economical optimal configuration forthe system. It is of interest to study how theoptimal subsystem size m∗ depends on the variousparameters qo and qs.

Theorem 5. For fixed n, c1, c2, c3, and c4.

(a) If a ≥ 1, then the optimal subsystem size m∗ isan increasing function of qo.

(b) If a ≤ 1, then the optimal subsystem size m∗ isa decreasing function of qs.

(c) The optimal subsystem size m∗ is an increasingfunction of β .

The proof is left for an exercise. It isworth noting that we cannot find a pair (m, n)

maximizing average system-profit A(m). Let

x = ln qs − ln po

ln qs + ln poMn = q

−n/l+xs mn = �Mn

(2.19)For given n, take m=mn. From Equation 2.14, theaverage system profit can be rewritten as

A(mn)= (1− β)(c3 − c4)

× {[1− Fs(mn)] − aFo(mn)} + b

(2.20)

Theorem 6. For fixed qo and qs

limn→∞ A(mn)= βc1 + (1− β)c3 (2.21)

The proof is left for an exercise. This resultshows that we cannot seek a pair (m, n) maximiz-ing the average system profit A(mn), since A(mn)

can be made arbitrarily close to βc1 + (1− β)c3.

2.4.2 Optimization Problem

We show how design policies can be chosen whenthe objective is to minimize the average totalsystem cost given that the costs of system failurein open mode and short mode may not necessarilybe the same.

The following notation is adopted:

d cost of each componentc1 cost when system failure in openc2 cost when system failure in shortT (m) total system costE[T (m)] average total system cost.

Suppose that each component costs d dollars,and system failure in open mode and short modecosts c1 and c2 dollars of revenue, respectively.The average total system cost is

E[T (m)] = dnm+ c1Fo(m)+ c2Fs(m) (2.22)

In other words, the average system cost is the costincurred when the system has failed in either theopen mode or the short mode plus the cost of all

Page 58: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 25

components in the system. Define

h(m)= [1− (po)n]m

[c1p

no − c2q

ns

(1− qns

1− pno

)m](2.23)

m1 = inf{m<m2 : h(m) < dn}and

m2 =

ln

[c1

c2

(po

qs

)n]ln

(1− qns

1− pno

)+ 1 (2.24)

From Equation 2.23, h(m) > 0 if and only if

c1pno > c2q

ns

(1− qns

1− pno

)mor equivalently, that m<m2. Thus, the functionh(m) is decreasing in m for all m<m2. For fixedn, we determine the optimal value of m, m∗, thatminimizes the expected system cost, as shown inthe following theorem [11].

Theorem 7. Fix qo, qs, d , c1, and c2. There existsa unique value m∗ such that the system minimizesthe expected cost, and

(a) if m2 > 0 then

m∗ ={m1 if E[T (m1)] ≤ E[T (m2)]m2 if E[T (m1)]> E[T (m2)]

(2.25)(b) if m2 ≤ 0 then m∗ = 1.

The proof is straightforward. Since the functionh(m) is decreasing in m for m<m2, again theresulting optimization problem in Equation 2.25 iseasily solved in practice.

Example 3. Suppose n= 5, d = 10, c1 = 500, c2 =700, qs = 0.1, and qo = 0.2. From Equation 2.25,we obtain m2 = 26. Since m2 > 0, we determinethe optimal value of m by using Theorem 7(a).

The subsystem size m, h(m), and the expectedsystem cost E[T (m)] are listed in Table 2.2; fromthis table, we have

m1 = inf{m< 26 : h(m) < 50} = 3

Table 2.2. The data for Example 3

m h(m) E[T (m)]1 110.146 386.172 74.051 326.023 49.784 301.974 33.469 302.195 22.499 318.716 15.124 346.227 10.166 381.108 6.832 420.939 4.591 464.10

10 3.085 509.50

andE[T (m1)] = 301.97

For m2 = 26, E[T (m2)] = 1300.20. From The-orem 7(a), the optimal value of m required tominimize the expected total system cost is 3, andthe expected total system cost corresponding tothis value is 301.97.

2.5 The Series–Parallel SystemThe series–parallel structure is the dual of theparallel–series structure in Section 2.4. We studya system of components arranged so that there arem subsystems operating in series, each subsystemconsisting of n identical components in parallel.Such an arrangement is called a series–parallelarrangement. Applications of such systems can befound in the areas of communication, networks,and nuclear power systems. For example, considera digital communication system consisting of m

substations in series. A message is initially sent tosubstation 1, is then relayed to substation 2, etc.,until the message passes through substation m andis received. The message consists of a sequenceof 0’s and 1’s and each digit is sent separatelythrough the series ofm substations. Unfortunately,the substations are not perfect and can transmit asoutput a different digit than that received as input.Such a system is subject to two failure modes:errors in digital transmission occur in such amanner that either (1) a one appears instead of azero, or (2) a zero appears instead of a one.

Page 59: Handbook of Reliability Engineering - pdi2.org

26 System Reliability and Optimization

Failure in open mode of all the componentsin any subsystem makes the system unresponsive.Failure in closed (short) mode of a singlecomponent in each subsystem also makes thesystem unresponsive. The probabilities of systemfailure in open and short mode are given by

Fo(m)= 1− (1− qno )m (2.26)

andFs(m)= [1− (1− qs)

n]m (2.27)

respectively. The system reliability is

R(m)= (1− qno )m − [1− (1− qs)

n]m (2.28)

where m is the number of identical subsystems inseries and n is the number of identical componentsin each parallel subsystem.

Barlow et al. [1] show that there exists no pair(m, n) maximizing system reliability. For fixed m,qo, and qs, however, one can determine the valueof n that maximizes the system reliability.

Theorem 8. Let n, qo, and qs be fixed. The max-imum value of R(m) is attained at m∗ = �m0 + 1,where

m0 = n(log ps − log qo)

log(1− qno )− log(1− pns )

(2.29)

If m0 is an integer, then m0 and m0 + 1 bothmaximize R(m).

2.5.1 Maximizing the Average SystemProfit

The effect of the system parameters on the optimalm is now studied. We also determine the optimalsubsystem size that maximizes the average systemprofit subject to a restricted type I (system failurein open mode) design error.

The following notation is adopted:

β conditional probability (given systemfailure) that the system is in open mode

1− β conditional probability (given systemfailure) that the system is in short mode

c1 gain from system success in open modec2 gain from system failure in open mode

(c1 > c2)

c3 gain from system success in short modec4 gain from system failure in short mode

(c3 > c4).

The average system-profit, P(m), is given by

P(m)= β{c1[1− Fo(m)] + c2Fo(m)}+ (1− β){c3[1− Fs(m)] + c4Fs(m)}

(2.30)

where Fo(m) and Fs(m) are defined as inEquations 2.26 and 2.27 respectively. Let

a = β(c1 − c2)

(1− β)(c3 − c4)

andb = βc1 + (1− β)c4

We can rewrite Equation 2.30 as

P(m)= (1− β)(c3 − c4)

× [1− Fs(m)− aFo(m)] + b (2.31)

For a given value of n, one wishes to find theoptimal number of subsystems m, say m∗, thatmaximizes the average system-profit. We wouldanticipate that m∗ depends on the values of bothqo and qs. Let

m0 =n ln

(1− qs

qo

)− ln a

ln

[1− qno

1− (1− qs)n

] (2.32)

Theorem 9. Fix β , n, qo, qs, and ci for i =1, 2, 3, 4. The maximum value of P(m) is attainedat

m∗ ={

1 if m0 < 0

�m0 + 1 if m0 ≥ 0

If m0 ≥ 0 andm0 is an integer, bothm0 andm0 + 1maximize P(m).

The proof is straightforward and left as anexercise. When both m0 and m0 + 1 maximize theaverage system profit, the lower of the two valuescosts less. It is of interest to study how m∗ dependson the various parameters qo and qs.

Page 60: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 27

Theorem 10. For fixed n, ci for i = 1, 2, 3, 4.

(a) If a ≥ 1, then the optimal subsystem size m∗ isa decreasing function of qo.

(b) If a ≤ 1, the optimal subsystem size m∗ is anincreasing function of qs.

This theorem states that when qo increases,it is desirable to reduce m as close to one as isfeasible. On the other hand, when qs increases, theaverage system-profit increases with the numberof subsystems.

2.5.2 Consideration of Type I DesignError

The solution provided by Theorem 9 is optimalin terms of the average system profit. Such anoptimal configuration, when adopted, leads to atype I design error (system failure in open mode),which may not be acceptable at the design stage.It should be noted that the more subsystems weadd to the system the greater is the chance ofsystem failure by opening (Fo(m), Equation 2.26));however, we do make the probability of systemfailure in short mode smaller by placing additionalsubsystems in series. Therefore, given β , n, qo, qsand ci for i = 1, 2, . . . , 4, we wish to determinethe optimal subsystem size m∗ in order tomaximize the average system profit P(m) in sucha way that the probability of system type I designerror (i.e. the probability of system failure in openmode) is at most α.

Theorem 9 remains unchanged if m∗ obtainedfrom Theorem 9 is kept within the tolerableα level,namely Fo(m)≤ α. Otherwise, modifications areneeded to determine the optimal system size.This is stated in the following result.

Corollary 1. For given values of β , n, qo, qs, and cifor i = 1, 2, . . . , 4, the optimal value of m, say m∗,that maximizes the average system profit subject toa restricted type I design error α is attained at

m∗ =

1 if min{�m0, �m1}min{�m0 + 1, �m1}

otherwise

where �m0 + 1 is the solution obtained fromTheorem 9 and

m1 = ln(1− a)

ln(1− qno )

2.6 The k-out-of-n Systems

Consider a model in which a k-out-of-n systemis composed of n identical and independentcomponents that can be either good or failed.The components are subject to two types offailure: failure in open mode and failure in closedmode. The system can fail when k or morecomponents fail in closed mode or when (n−k + 1) or more components fail in open mode.Applications of k-out-of-n systems can be foundin the areas of target detection, communication,and safety monitoring systems, and, particularly,in the area of human organizations. The followingis an example in the area of human organizations.Consider a committee with n members who mustdecide to accept or reject innovation-orientedprojects. The projects are of two types: “good”and “bad”. It is assumed that the communicationamong the members is limited, and each memberwill make a yes–no decision on each project.A committee member can make two types of error:the error of accepting a bad project and the errorof rejecting a good project. The committee willaccept a project when k or more members acceptit, and will reject a project when (n− k + 1)or more members reject it. Thus, the two typesof potential error of the committee are: (1) theacceptance of a bad project (which occurs whenk or more members make the error of acceptinga bad project); (2) the rejection of a good project(which occurs when (n− k + 1) or more membersmake the error of rejecting a good project).This section determines the:

• optimal k that minimizes the expected totalsystem cost;• optimal n that minimizes the expected total

system cost;• optimal k and n that minimizes the expected

total system cost.

Page 61: Handbook of Reliability Engineering - pdi2.org

28 System Reliability and Optimization

We also study the effect of the system’s parameterson the optimal k or n. The system fails in closedmode if and only if at least k of its n componentsfail in closed mode, and we obtain

Fs(k, n)=n∑

i=k

(n

i

)qisp

n−is = 1−

k−1∑i=0

(n

i

)qisp

n−is

(2.33)The system fails in open mode if and only if at leastn− k + 1 of its n components fail in open mode,that is:

Fo(k, n)=n∑

i=n−k+1

(n

i

)qiop

n−io =

k−1∑i=0

(n

i

)pi

oqn−io

(2.34)Hence, the system reliability is given by

R(k, n)= 1− Fo(k, n)− Fs(k, n)

=k−1∑i=0

(n

i

)qisp

n−is −

k−1∑i=0

(n

i

)pi

oqn−io

(2.35)

Let

b(k; p, n)=(n

k

)pk(1− p)n−k

and

b inf(k; p, n)=k∑

i=0

b(i; p, n)

We can rewrite Equations 2.33–2.35 as

Fs(k, n)= 1− b inf(k − 1; qs, n)

Fo(k, n)= b inf(k − 1; po, n)

R(k, n)= 1− b inf(k − 1; qs, n)

− b inf(k − 1; po, n)

respectively. For a given k, we can find theoptimum value of n, say n∗, that maximizes thesystem reliability.

Theorem 11. For fixed k, qo, and qs, the maximumvalue of R(k, n) is attained at n∗ = �n0 where

n0 = k

1+log

(1− qo

qs

)log

(1− qs

qo

)

If n0 is an integer, both n0 and n0 + 1 maximizeR(k, n).

This result shows that when n0 is an integer,both n∗ − 1 and n∗ maximize the system reliabil-ity R(k, n). In such cases, the lower value will pro-vide the more economical optimal configurationfor the system. If qo = qs, the system reliabilityR(k, n) is maximized when n= 2k or 2k − 1. Inthis case, the optimum value of n does not dependon the value of qo and qs, and the best choice fora decision voter is a majority voter; this system isalso called a majority system [12].

From the above Theorem 11 we understandthat the optimal system size n∗ depends on thevarious parameters qo and qs. It can be shownthe optimal value n∗ is an increasing function ofqo and a decreasing function of qs. Intuitively,these results state that when qs increases it isdesirable to reduce the number of componentsin the system as close to the value of thresholdlevel k as possible. On the other hand, when qoincreases, the system reliability will be improvedif the number of components increases.

For fixed n, qo, and qs, it is straightforward tosee that the maximum value of R(k, n) is attainedat k∗ = �k0 + 1, where

k0 = n

log

(qo

ps

)log

(qsqo

pspo

)If k0 is an integer, both k0 and k0 + 1 maximizeR(k, n).

We now discuss how these two values, k∗ andn∗, are related to one another. Define α by

α =log

(qo

ps

)log

(qsqo

pspo

)then, for a given n, the optimal threshold k is givenby k∗ = �nα , and for a given k the optimal n isn∗ = �k/α. For any given qo and qs, we can easilyshow that

qs < α < po

Page 62: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 29

Therefore, we can obtain the following bounds forthe optimal value of the threshold k:

nqs < k∗ < npo

This result shows that for given values of qo andqs, an upper bound for the optimal threshold k∗is the expected number of components workingin open mode, and a lower bound for theoptimal threshold k∗ is the expected number ofcomponents failing in closed mode.

2.6.1 Minimizing the Average SystemCost

We adopt the following notation:

d each component costc1 cost when system failure is in open modec2 cost when system failure is in short mode

b inf(k; qs, n)= 1− b inf(k − 1; qs, n)

The average total system cost E[T (k, n)] is

E[T (k, n)] = dn+ [c1Fo(k, n)+ c2Fs(k, n)](2.36)

In other words, the average total system cost is thecost of all components in the system (dn), plus theaverage cost of system failure in the open mode(c1Fo(k, n)) and the average cost of system failurein the short mode (c2Fs(k, n)).

We now study the problem of how designpolicies can be chosen when the objective is tominimize the average total system cost when thecost of components, the costs of system failure inthe open, and short modes are given. We wish tofind the:

• optimal k (k∗) that minimizes the averagesystem cost for a given n;• optimal n (n∗) that minimizes the average

system cost for a given k;• optimal k and n (k∗, n∗) that minimize the

average system cost.

Define

k0 =log

(c2

c1

)+ n log

(ps

qo

)log

(pops

qoqs

) (2.37)

Theorem 12. Fix n, qo, qs, c1, c2, and d . The min-imum value of E[T (k, n)] is attained at

k∗ ={

max{1, �k0 + 1} if k0 < n

n if k0 ≥ n

If k0 is a positive integer, both k0 and k0 + 1minimize E[T (k, n)].

It is of interest to study how the optimal value ofk, k∗, depends on the probabilities of componentfailure in the open mode (qo) and in the shortmode (qs).

Corollary 2. Fix n,

1. if c1 ≥ c2, then k∗ is decreasing in qo;2. if c1 ≤ c2, then k∗ is increasing in qs.

Intuitively, this result states that if the cost ofsystem failure in the open mode is greater thanor equal to the cost of system failure in the shortmode, then, as qo increases, it is desirable toreduce the threshold level k as close to one as isfeasible. Similarly, if the cost of system failure inthe open mode is less than or equal to the costof system failure in the short mode, then, as qsincreases, it is desirable to increase k as close ton as is feasible. Define

a = c1

c2

n0 =

log a + k log

(poPs

qoqs

)log

(ps

qo

) − 1

n1 =

⌈k − 1

1− qo− 1

f (n)=(ps

qo

)n(n+ 1)qs − (k − 1)

(n+ 1)po − (k − 1)

B = a

(pops

qoqs

)kqo

ps

andn2 = f−1(B) for k ≤ n2 ≤ n1

Let

n3 = inf

{n ∈ [n2, n0] : h(n) < d

c2

}

Page 63: Handbook of Reliability Engineering - pdi2.org

30 System Reliability and Optimization

where

h(n)=(

n

k − 1

)pk

oqn−k+1o

×[a −

(qoqs

pspo

)k (ps

qo

)n+1]

(2.38)

It is easy to show that the function h(n) ispositive for all k ≤ n≤ n0, and is increasing inn for n ∈ [k, n2) and is decreasing in n for n ∈[n2, n0]. This result shows that the function h(n)

is unimodal and achieves a maximum value atn= n2. Since n2 ≤ n1, and when the probabilityof component failure in the open mode qo is quitesmall, then n1 ≈ k; so n2 ≈ k. On the other hand,for a given arbitrary qo, one can find a value n2between the values of k and n1 by using a binarysearch technique.

Theorem 13. Fix qo, qs, k, d , c1, and c2. The op-timal value of n, say n∗, such that the systemminimizes the expected total cost is n∗ = k ifn0 ≤ k. Suppose n0 > k. Then:

1. if h(n2) < d/c2, then n∗ = k;2. if h(n2) ≥ d/c2 and h(k) ≥ d/c2 then n∗ = n3;3. if h(n2) ≥ d/c2 and h(k) < d/c2, then

n∗ ={k if E[T (k, k)] ≤ E[T (k, n3)]n3 if E[T (k, k)]> E[T (k, n3)]

(2.39)

Proof. Let �E[T (n)] = E[T (k, n+ 1)] −E[T (k, n)]. From Equation 2.36, we obtain

�E[T (n)] = d − c1

(n

k − 1

)pk

oqn−k+1o

+ c2

(n

k − 1

)qks p

n−k+1s (2.40)

Substituting c1 = ac2 into Equation 2.40, and aftersimplification, we obtain

�E[T (n)] = d − c2

(n

k − 1

)pk

oqn−k+1o

×[a −

(qoqs

pops

)k (ps

qo

)n+1]

= d − c2h(n)

The system of size n+ 1 is better than the systemof size n if, and only if, h(n) ≥ d/c2. If n0 ≤ k,then h(n) ≤ 0 for all n≥ k, so that E[T (k, n)]is increasing in n for all n≥ k. Thus n∗ = k

minimizes the expected total system cost. Supposen0 > k. Since the function h(n) is decreasing in n

for n2 ≤ n≤ n0, there exists an n such that h(n) <d/c2 on the interval n2 ≤ n≤ n0. Let n3 denotethe smallest such n. Because h(n) is decreasingon the interval [n2, n0] where the function h(n) ispositive, we have h(n) ≥ d/c2 for n2 ≤ n≤ n3 andh(n) < d/c2 for n > n3. Let n∗ be an optimal valueof n such that E[T (k, n)] is minimized.

(a) If h(n2) < d/c2, then n3 = n2 and h(k) <

h(n2) < d/c2, since h(n) is increasing in [k, n2)

and is decreasing in [n2, n0]. Note that increment-ing the system size reduces the expected systemcost only when h(n) ≥ d/c2. This implies thatn∗ = k such that E[T (k, n)] is minimized.

(b) Assume h(n2) ≥ d/c2 and h(k) ≥ d/c2.Then h(n) ≥ d/c2 for k ≤ n < n2, since h(n) isincreasing in n for k < n < n2. This impliesthat E[T (k, n+ 1)] ≤ E[T (k, n)] for k ≤ n < n2.Since h(n2) ≥ d/c2, then h(n) ≥ d/c2 for n2 < n<

n3 and h(n) < d/c2 for n > n3. This shows thatn∗ = n3 such that E[T (k, n)] is minimized.

(c) Similarly, assume that h(n2) ≥ d/c2 andh(k) < d/c2. Then, either n= k or n∗ = n3 isthe optimal solution for n. Thus, n∗ = k ifE[T (k, k)] ≤ E[T (k, n3)]; on the other hand,n∗ = n3 if E[T (k, k)]>E[T (k, n3)]. �

In practical applications, the probability ofcomponent failure in the open mode qo is oftenquite small, and so the value of n1 is close to k.Therefore, the number of computations for findinga value of n2 is quite small. Hence, the result of theTheorem 13 is easily applied in practice.

In the remaining section, we assume that thetwo system parameters k and n are unknown. Itis of interest to determine the optimum values of(k, n), say (k∗, n∗), that minimize the expectedtotal system cost when the cost of components andthe costs of system failures are known. Define

α = log(ps/qo)

log(pops/qoqs)β = log(c2/c1)

log(pops/qoqs)(2.41)

Page 64: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 31

We need the following lemma.

Lemma 1. For 0≤m ≤ n and 0≤ p ≤ 1:

m∑i=0

(n

i

)pi(1− p)n−i <

√n

2πm(n−m)

Proof. See [5], Lemma 3.16, for a detailed proof. �

Theorem 14. Fix qo, qs, d , c1, and c2. There existsan optimal pair of values (kn, n), say (kn∗, n∗),such that average total system cost is minimized at(kn∗ , n∗), and

kn∗ = �n∗αand

n∗ ≤(1− qo − qs)

(c1

d

)2 + 1+ β

α(1 − α)(2.42)

Proof. Define �E[T (n)] = E[T (kn+1, n+ 1)] −E[T (kn, n)]. From Equation 2.36, we obtain

�E[T (n)] = d + c1[b inf(kn+1 − 1; po, n+ 1)

− b inf(kn − 1; po, n)]− c2[b inf(kn+1 − 1; qs, n+ 1)

− b inf(kn − 1; qs, n)]Let r = c2/c1, then

�E[T (n)] = d − c1g(n)

g(n)= r[b inf(kn+1 − 1; qs, n+ 1)

− b inf(kn − 1; qs, n)]− [b inf(kn+1 − 1; po, n+ 1)

− b inf(kn − 1; po, n)] (2.43)

Case 1. Assume kn+1 = kn + 1. We have

g(n)=(n

kn

)pkn

o qn−kn+1o

×[r

(qoqs

pops

)kn (ps

qo

)n+1

− 1

]

Recall that (pops

qoqs

)β= r

then (qoqs

pops

)nα+β (ps

qo

)n+1

= 1

r

ps

qo(2.44)

since nα + β ≤ kn ≤ (n+ 1)α + β , we obtain

r

(qoqs

psps

)kn (ps

qo

)n+1

≤ r

(qoqs

pops

)nα+β (ps

qo

)n+1

= ps

qo

Thus

g(n) ≤(n

kn

)pkn

o qn−kn+1o

(ps

qo− 1

)=(n

kn

)pkn

o qn−kno (ps − qo)

From Lemma 1, and nα + β ≤ kn ≤ (n+ 1)α + β ,we obtain

g(n) ≤ (ps − qo)

[2πn

kn

n

(1− kn

n

)]−1/2

≤ (ps − qo)

{2πn

(α + β

n

)×[1− α

(n+ 1

n

)− β

n

]}−1/2

≤ (1− qs − qs)

× {2πn[nα(1− α)− α(α + β)]}−1/2

(2.45)

Case 2. Similarly, if kn+1 = kn then fromEquation 2.43, we have

g(n)=(

n

kn − 1

)qkns pn−kn+1

s

×[(

pops

qoqs

)kn (qo

ps

)n+1

− r

]

Page 65: Handbook of Reliability Engineering - pdi2.org

32 System Reliability and Optimization

since kn = �nα + β ≤ nα + β + 1, and fromEquation 2.44 and Lemma 1, we have

g(n) ≤ qs

(n

kn − 1

)qkn−1

s pn−kn+1s

×[(

pops

qoqs

)nα+β+1 (qo

ps

)n+1

− r

]

≤ qs

√n

2π(kn − 1)[n− (kn − 1)]

×[(

pops

qoqs

)nα+β(qo

ps

)n+1(pops

qoqs

)− r

]

≤ qs

√n

2π(kn − 1)[n− (kn − 1)]×[r

(qo

ps

) (pops

qoqs

)− r

]≤√

n

2π(kn − 1)[n− (kn − 1)] (1− qo − qs)

Note that kn+1 = kn, then nα − (1− α − β)≤kn − 1≤ nα + β . After simplifications, we have

g(n) ≤ (1− qo − qs)

×[2πn

(kn − 1

n

) (1− kn − 1

n

)]−1/2

≤ (1− qo − qs)

×[2πn

(α − 1− α − β

n

)×(

1− α − β

n

)]−1/2

≤ 1− qo − qs√2π[nα(1− α) − (1− α)2 − (1− α)β]

(2.46)

From the inequalities in Equations 2.45 and 2.46,set

(1− qs − qo)1√

2π[nα(1− α)− α(α + β)] ≤d

c1

and

1− qo − qs√2π[nα(1− α)− (1− α)2 − α(α + β)] ≤

d

c1

we obtain

(1− qo − qs)2(c1

d

)2 1

2π≤min{nα(1 − α)− α(α + β), nα(1 − α)

− (1− α)2 − (1− α)β}�E[T (n)] ≥ 0

when

n≥(1− qo − qs)

2

(c1

d

)+ 1+ β

α(1− α)

Hence

n∗ ≤(1− qo − qs)

2

(c1

d

)+ 1+ β

α(1 − α)�

The result in Equation 2.42 provides an upperbound for the optimal system size.

2.7 Fault-tolerant SystemsIn many critical applications of digital systems,fault tolerance has been an essential architecturalattribute for achieving high reliability. It is uni-versally accepted that computers cannot achievethe intended reliability in operating systems, ap-plication programs, control programs, or com-mercial systems, such as in the space shuttle, nu-clear power plant control, etc., without employ-ing redundancy. Several techniques can achievefault tolerance using redundant hardware [12] orsoftware [13]. Typical forms of redundant hard-ware structures for fault-tolerant systems are oftwo types: fault masking and standby. Maskingredundancy is achieved by implementing the func-tions so that they are inherently error correct-ing, e.g. triple-modular redundancy (TMR), N-modular redundancy (NMR), and self-purging re-dundancy. In standby redundancy, spare unitsare switched into the system when working unitsbreak down. Mathur and De Sousa [12] haveanalyzed, in detail, hardware redundancy in thedesign of fault-tolerant digital systems. Redun-dant software structures for fault-tolerant systems

Page 66: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 33

based on the acceptance tests have been proposedby Horning et al. [13].

This section presents a fault-tolerant architec-ture to increase the reliability of a special classof digital systems in communication [14]. In thissystem, a monitor and a switch are associated witheach redundant unit. The switches and monitorscan fail. The monitors have two failure modes:failure to accept a correct result, and failure toreject an incorrect result. The scheme can be usedin communication systems to improve their relia-bility.

Consider a digital circuit module designed toprocess the incoming messages in a communica-tion system. This module consists of two units: aconverter to process the messages, and a monitorto analyze the messages for their accuracy. Forexample, the converter could be decoding or un-packing circuitry, whereas the monitor could bechecker circuitry [12]. To guarantee a high relia-bility of operation at the receiver end, n convertersare arranged in “parallel”. All, except converter n,have a monitor to determine if the output of theconverter is correct. If the output of a converter isnot correct, the output is cancelled and a switchis changed so that the original input message issent to the next converter. The architecture ofsuch a system has been proposed by Pham andUpadhyaya [14]. Systems of this kind have usefulapplication in communication and network con-trol systems and in the analysis of fault-tolerantsoftware systems.

We assume that a switch is never connectedto the next converter without a signal from themonitor, and the probability that it is connectedwhen a signal arrives is ps. We next presenta general expression for the reliability of thesystem consisting of n non-identical convertersarranged in “parallel”. An optimization problemis formulated and solved for the minimum averagesystem cost. Let us define the following notation,events, and assumptions.

The notation is as follows:

pci Pr{converter i works}

psi Pr{switch i is connected to converter

(i + 1) when a signal arrives}

pm1i Pr{monitor i works when converter i

works} = Pr{not sending a signal to theswitch when converter i works}

pm2i Pr{i monitor works when converter i

has failed} = Pr{sending a signal to theswitch when converter i has failed}

Rkn−k reliability of the remaining system of

size n− k given that the first k switcheswork

Rn reliability of the system consisting of nconverters.

The events are:

Cwi , C

fi converter i works, fails

Mwi , M

fi monitor i works, fails

Swi , S

fi switch i works, fails

W system works.

The assumptions are:

1. the system, the switches, and the convertersare two-state: good or failed;

2. the module (converter, monitor, or switch)states are mutually statistical independent;

3. the monitors have three states: good, failed inmode 1, failed in mode 2;

4. the modules are not identical.

2.7.1 Reliability Evaluation

The reliability of the system is defined as theprobability of obtaining the correctly processedmessage at the output. To derive a generalexpression for the reliability of the system, we usean adapted form of the total probability theoremas translated into the language of reliability.Let A denote the event that a system performsas desired. Let Xi and Xj be the event that acomponent X (e.g. converter, monitor, or switch)is good or failed respectively. Then

Pr{system works}= Pr{system works when unit X is good}× Pr{unit X is good}+ Pr{system works when unit X fails}× Pr{unit X is failed}

Page 67: Handbook of Reliability Engineering - pdi2.org

34 System Reliability and Optimization

The above equation provides a convenient wayof calculating the reliability of complex systems.Notice that R1 = pc

i , and for n≥ 2, the reliabilityof the system can be calculated as follows:

Rn = Pr{W | Cw1 and Mw

1 } Pr{Cw1 and Mw

1 }+ Pr{W | Cw

1 and M f1} Pr{Cw

1 and M f1}

+ Pr{W | Cf1 and Mw

1 } Pr{Cf1 and Mw

1 }+ Pr{W | Cf

1 and M f1} Pr{Cf

1 and M f1}

In order for the system to operate when the firstconverter works and the first monitor fails, thefirst switch must work and the remaining systemof size n− 1 must work:

Pr{W | Cw1 and M f

1} = ps1R

1n−1

Similarly:

Pr{W | Cf1 and Mw

1 } = ps1R

1n−1

then

Rnpc1p

m11 + [pc

1(1− pm11 )+ (1− pc

1)pm21 ]ps

1R1n−1

The reliability of the system consisting of n non-identical converters can be easily obtained:

Rn =n−1∑i=1

pci p

m1i πi−1 + πn−1p

cn for n > 1

(2.47)and

R1 = pc1

where

πjk =

k∏i=j

Ai for k ≥ 1

πk = π1k for all k, and π0 = 1

and

Ai ≡ [pci (1− pm1

i )+ (1− pci )p

m2i ]

for all i = 1, 2, . . . , n. Assume that all theconverters, monitors, and switches have the samereliability, that is:

pci = pc, pm1

i = pm1, pm2i = pm2, ps

i = ps

for all i, then we obtain a closed form expressionfor the reliability of system as follows:

Rn = pcpm1

1 − A(1− An−1)+ pcAn−1 (2.48)

where

A= [pc(1− pm1)+ (1− pc)pm2]ps

2.7.2 Redundancy Optimization

Assume that the system failure costs d units ofrevenue, and that each converter, monitor, andswitch module costs a, b, and c units respectively.Let Tn be system cost for a system of size n. Theaverage system cost for size n, E[Tn], is the costincurred when the system has failed, plus the costof all n converters, n− 1 monitors, and n− 1switches. Therefore:

E[Tn] = an+ (b + c)(n− 1)+ d(1− Rn)

where Rn is given in Equation 2.48. The minimumvalue of E[Tn] is attained at

n∗ ={

1 if A≤ 1− pm1

�n0 otherwise

where

n0 = ln(a + b + c)− ln[dpc(A+ pm1 − 1)]ln A

+ 1

Example 4. [14] Given a system with pc = 0.8,pm1 = 0.90, pm2 = 0.95, ps = 0.90, and a = 2.5,b = 2.0, c = 1.5, d = 1200. The optimal systemsize is n∗ = 4, and the corresponding average cost(81.8) is minimized.

2.8 Weighted Systems withThree Failure ModesIn many applications, ranging from target de-tection to pattern recognition, including safety-monitoring protection, undersea communication,and human organization systems, a decision hasto be made on whether or not to accept the hy-pothesis based on the given information so that

Page 68: Handbook of Reliability Engineering - pdi2.org

Reliability of Systems with Multiple Failure Modes 35

the probability of making a correct decision ismaximized. In safety-monitoring protection sys-tems, e.g. in a nuclear power plant, where the sys-tem state is monitored by a multi-channel sensorsystem, various core groups of sensors monitorthe status of neutron flux density, coolant tem-perature at the reaction core exit (outlet temper-ature), coolant temperature at the core entrance(inlet temperature), coolant flow rate, coolant levelin pressurizer, on–off status of coolant pumps.Hazard-preventive actions should be performedwhen an unsafe state is detected by the sensorsystem. Similarly, in the case of chlorination of ahydrocarbon gas in a gas-lined reactor, the pos-sibility of an exothermic, runway reaction occurswhenever the Cl2/hydrocarbon gas ratio is toohigh, in which case a detonation occurs, since asource of ignition is always present. Therefore,there are three unsafe phenomena: a high chlorineflow y1, a low hydrocarbon gas flow y2, and ahigh chlorine-to-gas ratio in the reactor y3. Thechlorine flow must be shut off when an unsafestate is detected by the sensor system. In this ap-plication, each channel monitors a different phe-nomenon and has different failure probabilitiesin each mode; the outputs of each channel willhave different weights in the decision (output).Similarly, in each channel, there are distinct num-ber of sensors and each sensor might have dif-ferent capabilities, depending upon its physicalposition. Therefore, each sensor in a particularchannel might have different failure probabilities;thereby, each sensor will have different weights onthe channel output. This application can be con-sidered as a two-level weighted threshold votingprotection systems.

In undersea communication and decision-making systems, the system consists of n elec-tronic sensors each scanning for an underwaterenemy target [16]. Some electronic sensors, how-ever, might falsely detect a target when none is ap-proaching. Therefore, it is important to determinea threshold level that maximizes the probability ofmaking a correct decision.

All these applications have the following work-ing principles in common. (1) System units makeindividual decisions; thereafter, the system as an

entity makes a decision based on the informationfrom the system units. (2) The individual decisionsof the system units need not be consistent andcan even be contradictory; for any system, rulesmust be made on how to incorporate all informa-tion into a final decision. System units and theiroutputs are, in general, subject to different errors,which in turn affects the reliability of the systemdecision.

This chapter has detailed the problem ofoptimizing the reliability of systems with twofailure modes. Some interesting results concerningthe behavior of the system reliability function havealso been discussed. Several cost optimizationproblems are also presented. This chapter alsopresents a brief summary of recent studies inreliability analysis of systems with three failuremodes [17–19]. Pham [17] studied dynamicredundant system with three failure modes.Each unit is subject to stuck-at-0, stuck-at-1and stuck-at-x failures. The system outcome iseither good or failed. Focusing on the dynamicmajority and k-out-of-n systems, Pham derivedoptimal design policies for maximizing thesystem reliability. Nordmann and Pham [18] havepresented a simple algorithm to evaluate thereliability of weighted dynamic-threshold votingsystems, and they recently presented [19] a generalanalytic method for evaluating the reliability ofweighted-threshold voting systems. It is worthconsidering the reliability of weighted votingsystems with time-dependency.

References[1] Barlow RE, Hunter LC, Proschan F. Optimum redundancy

when components are subject to two kinds of failure.J Soc Ind Appl Math 1963;11(1):64–73.

[2] Ben-Dov Y. Optimal reliability design of k-out-of-nsystems subject to two kinds of failure. J Opt Res Soc1980;31:743–8.

[3] Dhillon BS, Rayapati SN. A complex system reliabilityevaluation method. Reliab Eng 1986;16:163–77.

[4] Jenney BW, Sherwin DJ. Open and short circuit reliabilityof systems of identical items. IEEE Trans Reliab 1986;R-35:532–8.

[5] Pham H. Optimal designs of systems with competingfailure modes. PhD Dissertation, State University of NewYork, Buffalo, February 1989 (unpublished).

Page 69: Handbook of Reliability Engineering - pdi2.org

36 System Reliability and Optimization

[6] Malon DM. On a common error in open and short circuitreliability computation. IEEE Trans Reliab 1989;38:275–6.

[7] Page LB, Perry JE. Optimal series–parallel networks of 3-stage devices. IEEE Trans Reliab 1988;37:388–94.

[8] Pham H. Optimal design of systems subject to twokinds of failure. Proceedings Annual Reliability andMaintainability Symposium, 1990. p.149–52.

[9] Pham H, Pham M. Optimal designs of {k, n− k + 1} out-of-n: F systems (subject to 2 failure modes). IEEE TransReliab 1991;40:559–62.

[10] Sah RK, Stiglitz JE. Qualitative properties of profitmaking k-out-of-n systems subject to two kinds offailures. IEEE Trans Reliab 1988;37:515–20.

[11] Pham H, Malon DM. Optimal designs of systems withcompeting failure modes. IEEE Trans Reliab 1994;43:251–4.

[12] Mathur FP, De Sousa PT. Reliability modeling andanalysis of general modular redundant systems. IEEETrans Reliab 1975;24:296–9.

[13] Horning JJ, Lauer HC, Melliar-Smith PM, Randell B.A program structure for error detection and recovery.Lecture Notes in Computer Science, vol. 16. Springer;1974. p.177–93.

[14] Pham H, Upadhyaya SJ. Reliability analysis of a class offault-tolerant systems. IEEE Trans Reliab 1989;38:333–7.

[15] Pham H, editor. Fault-tolerant software systems: tech-niques and applications. Los Alamitos (CA): IEEE Com-puter Society Press; 1992.

[16] Pham H. Reliability analysis of digital communicationsystems with imperfect voters. Math Comput Model J1997;26:103–12.

[17] Pham H. Reliability analysis of dynamic configurationsof systems with three failure modes. Reliab Eng Syst Saf1999;63:13–23.

[18] Nordmann L, Pham H. Weighted voting human-organization systems. IEEE Trans Syst Man CybernetPt A 1997;30(1):543–9.

[19] Nordmann L, Pham H. Weighted voting systems. IEEETrans Reliab 1999;48:42–9.

Page 70: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-kSystems

Chap

ter3

Jen-Chun Chang and Frank K. Hwang

3.1 Introduction3.1.1 Background3.1.2 Notation3.2 Computation of Reliability3.2.1 The Recursive Equation Approach3.2.2 The Markov Chain Approach3.2.3 Asymptotic Analysis3.3 Invariant Consecutive Systems3.3.1 Invariant Consecutive-2 Systems3.3.2 Invariant Consecutive-k Systems3.3.3 Invariant Consecutive-k G System3.4 Component Importance and the Component Replacement Problem3.4.1 The Birnbaum Importance3.4.2 Partial Birnbaum Importance3.4.3 The Optimal Component Replacement3.5 The Weighted-consecutive-k-out-of-nSystem3.5.1 The Linear Weighted-consecutive-k-out-of-nSystem3.5.2 The Circular Weighted-consecutive-k-out-of-nSystem3.6 Window Systems3.6.1 The f -within-consecutive-k-out-of-nSystem3.6.2 The 2-within-consecutive-k-out-of-nSystem3.6.3 The b-fold-window System3.7 Network Systems3.7.1 The Linear Consecutive-2 Network System3.7.2 The Linear Consecutive-k Network System3.7.3 The Linear Consecutive-k Flow Network System3.8 Conclusion

3.1 Introduction

In this chapter we introduce the consecutive-ksystems, including the original well-knownconsecutive-k-out-of-n:F system. Two generalthemes are considered: computing the systemreliability and maximizing the system reliabilitythrough optimally assigning components topositions in the system.

3.1.1 Background

The consecutive-k-out-of-n:F system is a systemof n components arranged in a line suchthat the system fails if and only if somek consecutive components fail. It was firststudied by Kontoleon [1] under the name of r-successive-out-of-n:F system, but Kontoleon onlygave an enumerative reliability algorithm for

37

Page 71: Handbook of Reliability Engineering - pdi2.org

38 System Reliability and Optimization

the system. Chiang and Niu [2] motivated thestudy of the system by some real applications.They proposed the current name “consecutive-k-out-of-n:F system” for the system and gavean efficient reliability algorithm by recursiveequations. From then on, the system became moreand more popular.

There are many variations and generalizationsof the system, such as circular consecutive-k-out-of-n:F systems [3], weighted-consecutive-k-out-of-n:F systems [4, 5], f -or-consecutive-k-out-of-n:F systems [6, 7], f -within-consecutive-k-out-of-n:F systems [8, 9], consecutive-k-out-of-n:F networks [10, 11], consecutive-k-out-of-n:Fflow networks [8], and consecutive-k − r-out-of-n:DFM systems [12], etc. Reliability analysis forthese systems has been widely studied in recentyears.

There are other variations called consecutive-k:G systems, which are defined by exchangingthe role of working and failed components inthe consecutive-k systems. For example, theconsecutive-k-out-of-n:G system works if andonly if some k consecutive components allwork. The reliability of the G system is simplythe unreliability of the F system computedby switching the component reliabilities andthe unreliabilities. However, to maximize thereliability of a G system is not equivalent tominimize the reliability of an F system.

Two basic assumptions for the consecutive-ksystems are described as follows.

1. Binary system. The components and thesystem all have two states: working or failed.

2. IND model. The states of the componentsare independent. In addition, if the workingprobabilities of all components are identical,we call it the IID model.

We consider two general problems in this chap-ter: computing the system reliability and maxi-mizing the system reliability through optimallyassigning components to positions in the system.

Two general approaches are used to computethe reliability for the IND probability models:one is the recursive equation approach, and theother is the Markov chain approach. The recursive

equation approach was first pursued by Chiangand Niu [2] and Derman et al. [3]. The Markovchain approach was first proposed by Griffith andGovindarajula [13], Griffith [14], Chao and Lin[15], and perfected by Fu [16].

The general reliability optimization problem isto allocate m≥ n components with non-identicalreliabilities to the n positions in a system, where aposition can be allocated one or more componentsto maximize the system reliability (it is assumedthat the components are functionally equivalent).Two different versions of this general problem are:

1. Sequential. Components are allocated one ata time. Once a component is placed into aposition, its state is known.

2. Non-adaptive. Components are allocated si-multaneously.

Since the general optimization problem is verydifficult, in most cases we do not have a globallyoptimal algorithm. Sometimes heuristics are used.The special case which we study here is m= n

under the IND model.

3.1.2 Notation

The following notation is used in this chapter.

Pr(e) probability of event ePr(e1 | e2) probability of event e1 under

the condition that event e2occurs

E(X) expectation of a randomvariable X

E(X | e) expectation of a randomvariable X under the conditionthat event e occurs

Trace(M) trace of a matrix Mn number of components in the

systempi probability that component i

functionsqi 1− pi

L(1, n) or L consecutive-k-out-of-n linearsystem, where k and all pi areassumed understood

Page 72: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 39

C(1, n) or C circular consecutive-k-out-of-nlinear system, where k and allpi are assumed understood

Rx(1, n) or Rx reliability of the system x(1, n)where x ∈ {L, C}, k and all pi

are assumed understoodRx(1, n) or Rx 1− Rx(1, n), unreliability of

the system x(1, n)Rx(n) Rx(1, n) when all pi = p and

all qi = q

Rx(n) 1− Rx(n)

si working state of component isi failure state of component ix(i, j) x system containing

components i, i + 1, . . . , jRx(i, j) reliability of the subsystem x(i, j)

Rx(i, j) 1− Rx(i, j), unreliability ofthe subsystem x(i, j)

Sx(i, j) set of all component statesincluding a working x(i, j)

Sx(i, j) set of all component statesincluding a failed x(i, j)

Nx(d, n, k) number of ways that n nodes,including exactly d failednodes, ordered on a line (ifx = L) or a cycle (if x = C)contain no k consecutive failednodes.

3.2 Computation of Reliability

We introduce various approaches to computethe reliability of the consecutive-k-out-of-n:Fsystem and its circular version. In addition, wealso compare the time complexities of theseapproaches.

3.2.1 The Recursive EquationApproach

The first solution of the reliability of the lin-ear consecutive-k-out-of-n:F system was given by

Chiang and Niu [2] (for the IID model). The so-lution is in the form of recursive equations. Fromthen on, many other recursive equations have beenproposed to improve the efficiency of the relia-bility computation. We only present the fastestones.

Shanthikumar [17] and Hwang [18] gave thefollowing recursive equation for the reliability ofthe linear consecutive-k-out-of-n:F system.

Theorem 1.

RL(1, n)= RL(1, n− 1)

− RL(1, n− k − 1)pn−kk∏

j=1

qn−k+j

This equation holds for any n≥ k + 1.Since RL(1, n) can be computed after RL(1, 1),RL(1, 2), . . . , RL(1, n− 1) are all computed inthat order, the system reliability RL(1, n) can becomputed in O(n) time.

The solution for the reliability of the circularconsecutive-k-out-of-n:F system was first given byDerman et al. [3]. Note that the indices are takenmodulo n for circular systems.

Theorem 2.

RC(1, n)=n∑

1≤i≤kn−k+i≤j≤n

pj

( n+i−1∏h=j+1

qh

)

× piRL(i + 1, j − 1)

Hwang [18] observed that there are O(k2) RL

terms in the right-hand side of the equation,each needing O(n) time to compute. Therefore,Hwang announced that RC(1, n) can be computedin O(k2n) time. Wu and Chen [19] observedthat RL(i, i), RL(i, i + 1), . . . , RL(i, n) can becomputed together in O(n) time. Hence, theyclaimed that RC(1, n) can be computed in O(kn)

time.Wu and Chen [20] also showed a trick of using

fictitious components to make the O(kn) timemore explicit. Let LW(1, n+ i) be a linear systemwith p1 = p2 = · · · = pi = pn+1 = pn+2 = · · · =pn+i = 0. They gave the following recursiveequation.

Page 73: Handbook of Reliability Engineering - pdi2.org

40 System Reliability and Optimization

Theorem 3.

RC(1, n)= RL(1, n)−k−1∑i=1

i∏j=1

qj

× [RLW (1, n+ i − 1)− RLW (1, n+ i)]

Because there are only O(kn) terms of RL,RC(1, n) can be computed in O(kn) time.

In addition, Antonopoulou and Papastavridis[21] gave the following recursive equation.

Theorem 4.

RC(1, n)= pnRL(1, n− 1)+ qnRC(1, n− 1)

−k∑

i=1

pn−k+i−1

( n+i−1∏j=n−k+i

qj

)× piRL(i + 1, n− k + i − 2)

Antonopoulou and Papastavridis claimed thatthis is an O(kn) algorithm, but Wu and Chen[20] found that the computational complexity isO(kn2) rather than O(kn). Later, Hwang [22]gave a different, more efficient implementation ofthe same recursive equation, which is an O(kn)

algorithm.

3.2.2 The Markov Chain Approach

The Markov chain approach was first introducedinto the reliability study by Griffith [14] andGriffith and Govindarajula [13], but they didnot give an efficient algorithm. The first efficientalgorithm based on the Markov chain approachwas given by Chao and Lin [15].

We briefly describe the Markov chain approachas follows. Chao and Lin [15] defined Sv as theaggregate of states of nodes v − k + 1, v − k +2, . . . , v. Hence {Sv} forms a Markov chain, andSv+1 depends only on Sv . Fu [16] lowered thenumber of states by defining Sv to be in state i,i ∈ {0, 1, . . . , k} if the last i nodes including nodev are all failed. Note that all states (except thefailed state k) are working states. The followingtransition probability matrix for Sv was given by

Fu and Hu [23].

v =

pv qvpv qv...

. . .

pv qv0 1

Let π0 and U0 denote the 1× (k + 1) row vector{1, 0, . . . , 0} and the (k + 1)× 1 column vector ofall 1 except the last bit is 0, respectively. The systemreliability of the linear consecutive-k-out-of-n:Fsystem can be computed as

RL(1, n)= π0

( n∏v=1

v

)U0

Fu and Hu did not give any time complexityanalysis. Koutras [24] restated this method andclaimed that RL(1, n) can be computed in O(kn)

time.For the circular system, the Markov chain

method does not work, since a Markov chain musthave a starting point. Hwang and Wright [25]gave a different Markov chain method (called thetransfer matrix method) that works well for thecircular system. For simplicity, assume k divides n.The n nodes are divided into n/k sets of k

each. The ith set consists of nodes (i − 1)k + 1,(i − 1)k + 2, . . . , ik. A state of a set is simply avector of states of its k elements. Let Si = {Siu}be the state space of set i. Then |Si | = 2k. Let Mi

denote the matrix whose rows are indexed by theelements of Si , and columns are indexed by theelements of Si+1. Cell (u, v)= 1 if Siu ∪ S(i+1)vdoes not contain k consecutive failed nodes, and 0otherwise. For the circular system, let Sn/k+1 =S1. The number of working cases out of the 2n totalcases is

Trace

( n/k∏i=1

Mi

)Hwang and Wright [25] described a novel way

to compute the reliability by substituting eachentry 1 in cell (u, v) of Mi with√

Pr(Siu) Pr(S(i+1)v)

since each Siu appears once in Mi and once inMi+1. This method can also be applied to linear

Page 74: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 41

systems. For a linear system, the correspondingterm is then

n/k−1∏i=1

Mi

and M1(u, v) = 1 needs to be replaced withPr(S1u)

√Pr(S2v) and Mn/k−1(u, v)= 1 with√

Pr(Sn/k−1,v) Pr(Sn/k,u). In this approach,there are O(n/k) matrix multiplications whereeach takes O(23k) time. Therefore, the totalcomputational complexity is O(23kn/k). Hwangand Wright [25] also found a clever way to speedup the multiplication time from O(23k) to O(k6).For a working state Siu, let l denote the index ofthe first working node and r (≥l) denote the last.For each Si , delete the unique failed state andregroup the working states according to (l, r).The size of |Si | is lowered down to O(k2), whilethe ability to determine whether Siu ∪ S(i+1)vcontains k consecutive failed nodes is preserved.Note that

Pr(Si,(l,r))=( l−1∏

j=1

q(i−1)k+j)p(i−1)k+lp(i−1)k+r

×( k−1∏

j=1

q(i−1)k+r+j)

but without p(i−1)k+r if l = r . The total computa-tional complexity for the reliability is O(k5n).

3.2.3 Asymptotic Analysis

Chao and Lin [15] conjectured, and Fu [26],proved the following result.

Theorem 5. For any integer k ≥ 1, if the compo-nent reliability pn = 1− λn−1/k (λ > 0) holds forIID model, then

limn→∞ RL(n)= exp{−λk}

Chao and Fu [27] embedded the consecutive-k-out-of-n line into a finite Markov chain{X0, X1, . . . , Xn} with states {1, . . . , k + 1} and

transition probability matrix (blanks are zeros)

n(t)=

pt qtpt qt...

. . .

pt qt0 1

(k+1)×(k+1)

Let π0 = (π1, . . . , πk+1) be the initial probabilityvector. Then

RL(n)= π0

n∏t=1

n(t)U0

where U0 is the (k + 1)× 1 column vector(1, . . . , 1, 0)T.

Define

b(t, n)= Pr{Xt = k + 1 | Xt−1 ≤ k}and

λn =∞∑j=1

1

j

n∑t=1

bj (t, n)

Chao and Fu proved Theorem 5 under a moregeneral condition.

Theorem 6. Suppose π0U0 = 1 and limn→∞ λn =λ > 0. Then

limn→∞ RL(n, k)= exp{−λk}

3.3 Invariant ConsecutiveSystemsGiven n components with reliabilities p1, p2, . . . ,

pn, then a permutation of the n reliabilitiesdefines a consecutive-k system and its reliability.An optimal system is one with the largestreliability. An optimal system is called invariant ifit depends only on the ranks, but not the actualvalues, of pi .

3.3.1 Invariant Consecutive-2 Systems

Derman et al. [3] proposed two optimizationproblems for linear consecutive-2 systems andsolved one of them. Let the reliabilities of the

Page 75: Handbook of Reliability Engineering - pdi2.org

42 System Reliability and Optimization

n components be p1, p2, . . . , pn. Assume p[i]denotes the ith smallest pj . In the sequentialassignment problem, components are assignedone at a time to the system, and the state ofthe component is determined as soon as it isconnected to the system. In the non-adaptiveassignment problem, the system is constructed allat once.

For the sequential assignment problem,Derman et al. [3] proposed the followingassignment rule: Assign the least reliablecomponent first. Afterwards, if the last assignedcomponent works, then assign the least reliablecomponent in the remaining pool next; otherwise,if the last assigned component is failed, thenassign the most reliable component in theremaining pool next. This rule is called DLR,which is optimal.

Theorem 7. The assignment rule DLR is optimal.

For the non-adaptive assignment problem,Derman et al. [3] conjectured the optimalsequence is

Ln = (p[1], p[n], p[3], p[n−2],. . . , p[n−3], p[4], p[n−1], p[2])

where p[i] denotes the ith smallest ofp1, p2, . . . , pn. This conjecture was extended tothe circular system by Hwang [18] as follows:

Cn = (p[1], p[n−1], p[3], p[n−3],. . . , p[n−4], p[4], p[n−2], p[2], p[n])

Since any consecutive-k-out-of-n linear systemcan be formulated as a consecutive-k-out-of-ncircular system by setting pn+1 = 1, the lineconjecture holds if the cycle conjecture holds.Du and Hwang [28] proved the cycle conjecture,while Malon [29] claimed a proof of the lineconjecture (see [30] for comment). Later, Changet al. [30] gave a proof for the cycle conjecturethat combines Malon’s technique for the lineconjecture and Du and Hwang’s technique forthe cycle conjecture, but is simpler than both.These results can be summarized as follows.

Theorem 8. Ln is the unique optimal consecutive-2 line.

Theorem 9. Cn is the unique optimal consecutive-2 cycle.

Note that the optimal dynamic consecutive-2 line, the optimal static consecutive-2 line, andcycle are all invariant. One would expect that theremaining case, the optimal dynamic consecutive-2 cycle, is also invariant. However, Hwang andPai [31] recently gave a negative result.

3.3.2 Invariant Consecutive-k Systems

The invariant consecutive-k line problem iscompletely solved by Malon [29]. The result isquoted as follows.

Theorem 10. There exist invariant consecutive-k lines if and only if k ∈ {1, 2, n− 2, n− 1, n}.The invariant lines are given below:

(any arrangement) if k = 1;(p[1], p[n], p[3], p[n−2],. . . , p[n−3], p[4], p[n−1], p[2]) if k = 2;

(p[1], p[4], (any arrangement),p[3], p[2]) if k = n− 2;(p[1], (any arrangement), p[2]) if k = n− 1;(any arrangement) if k = n.

Hwang [32] extended Malon’s result to the cycle.

Theorem 11. There exist invariant consecutive-kcycles if and only if k ∈ {1, 2, n− 2, n− 1, n}.For k ∈ {1, n− 1, n}, any cycle is optimal. For k ∈{2, n− 2}, the unique invariant cycle is Cn.

Sometimes there is no universal invariantsystem, but there are some local invariantassignments. Such knowledge can reduce thenumber of systems we need to search as candidatesof optimal systems. Tong [33] gave the followingresult for interchanging two adjacent components.

Theorem 12. Assume that (n− 1)/2≤ k ≤ n− 2and p1 > p2. Let C′ be obtained from C =(p1, p2, . . . , pn) by interchanging p1 and p2.Then R(C′)≥ R(C) if and only if( k+1∏

i=3

qi

)pk+2 − pn−k+1

( n∏i=n−k+2

qi

)≤ 0

Page 76: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 43

For n < 2k, the following theorem was inde-pendently given by Tong [34] and Malon [29]. Kuoet al. [35] observed that it holds in general.

Theorem 13. In an optimal consecutive-k line,p1 ≤ p2 ≤ · · · ≤ pn.

3.3.3 Invariant Consecutive-kG System

A consecutive-k G system is the counterpartof a consecutive-k (F) system by interchangingthe notions of working and failed components.That is, a G system works if and only if some k

consecutive components all work. The G systemwas first suggested by Tong [34]. For reliabilitycomputation, formulas for the F systems also workwell for the G systems by interchanging pi and qifor all i, and interchanging R with R.

However, the G system brought out new prob-lems in reliability optimization and in componentimportance. The main reason is that maximizingthe reliability of the G system is equivalent tominimizing the reliability of the F system, whichhas not been studied before. Zuo and Kuo [36]gave the following result.

Theorem 14. There does not exist an invariantconsecutive-k-out-of-n G line for 2≤ k < n/2.

This result can be extended to the circular system.That is:

Theorem 15. There does not exist an invari-ant consecutive-k-out-of-n G cycle for 2≤ k <

(n− 1)/2.

Zuo and Kuo [36] first observed the followingresult.

Theorem 16. In an optimal assignment of aG system, p1 ≤ p2 ≤ · · · ≤ pk and pn ≤ pn−1 ≤· · · ≤ pn−k+1.

For the G line, the case of n≤ 2k was studied byKuo et al. [35], and the case n≤ 2k + 1 for the Gcycle was studied by Zuo and Kuo [36]. But bothproofs are incomplete. Recently, Jalali et al. [37]proposed the following lemma.

Lemma 1. In an optimal assignment of a Gsystem, p1 = p[1], pk = p[n−1], pk+1 = p[n] andpn = p[2].

With Lemma 1, Jalali et al. [37] proved thefollowing theorem for the G line.

Theorem 17. The unique invariant consecutive-k-out-of-n G line for n= 2k is α = (p[1], p[3],p[5], . . . , p[6], p[4], p[2]).

This result can be extended to the n < 2k case.

Theorem 18. For n < 2k, the invariantconsecutive-k-out-of-n G line is

(p[1], p[3], p[5], . . . , B, . . . , p[6], p[4], p[2]),

where B is a center block of 2k − n largestreliabilities in any arrangement.

Du et al. [38] used a completely differentmethod to prove the more general cycle case.

Theorem 19. For n≤ 2k + 1, the unique invariantconsecutive-k-out-of-n G cycle is

(p[1], p[3], p[5], . . . , p[6], p[4], p[2]),

where the two ends are considered adjacent.

3.4 Component Importanceand the ComponentReplacement Problem

Component importance measures the relativeimportance of a component, or sometimes theposition of a component in the system, withrespect to system reliability. The importance indexcan be used to assist the allocation of redundantcomponents or replacement by giving priority tothe more important positions. The most popularimportance index is the Birnbaum importance,but very few results have been obtained. Weakerversions have been introduced to obtain moreresults.

Page 77: Handbook of Reliability Engineering - pdi2.org

44 System Reliability and Optimization

3.4.1 The Birnbaum Importance

The Birnbaum (reliability) importance of compo-nent i in system x ∈ {L, C} is defined as

Ix(i)= ∂Rx(p1, p2, . . . , pn)

∂pi

It is the rate at which the system reliabilitygrows when the reliability of component i grows.When necessary, we use Ix(i, n) to denoteIx(i) with n fixed. Independently, Griffith andGovindarajulu [13] and Papastavridis [39] firststudied the Birnbaum importance for consecutivelines. Their results can be quoted as follows.

Theorem 20.

IL(i)= [RL(1, i − 1)RL(i + 1, n)− RL(1, n)]/qiand

IC(i)= [RL(pi+1, pi+2, . . . , pi−1)

− RC(1, n)]/qi.It can be shown that IL(i) observes the same

recursive equations as observed by L(1, n). Thussimilar to Theorem 1, the following theoremholds.

Theorem 21. For an IID model

1. IL(i, n)= IL(i, n− 1)− pqkIL(i, n− k − 1)if n− i ≥ k + 1

2. IL(i, n)= IL(i−1, n−1)− pqkIL(i − k − 1,n− k − 1) if i − 1≥ k + 1.

The comparison of Birnbaum importanceunder the IND model is valid only for theunderlying set (p1, p2, . . . , pn). On the otherhand, the IID model is the suitable one if the focusis on comparing the positions by neutralizing thedifferences in pi . Even for the IID model, Hwanget al. [40] showed that the comparison is notindependent of p.

The following theorem is given by Chang et al.[30].

Theorem 22. Consider the consecutive-2 line un-der the IID model. Then

IL(2i) > IL(2i − 1) for 2i ≤ (n+ 1)/2

IL(2i) < IL(2i − 2) for 2i ≤ (n+ 1)/2

IL(2i + 1) > IL(2i − 1) for 2i + 1≤ (n+ 1)/2

For general k, not much is known. Kuo et al.[35] first observed the following result.

Theorem 23. For the consecutive-k line under theIID model

IL(1) < IL(2) < · · ·< IL(k) if n≥ 2k

IL(n− k + 1)= IL(n− k + 2)

= · · · = IL(k) if n < 2k

The following theorem was proved (partially)by Zuo [41] and (partially) by Zakaria et al. [42].

Theorem 24. Consider the consecutive-k line un-der the IID model. Then IL(1)≤ IL(i) for all i ≤n/2 and IL(k) > IL(k + 1) for n > 2k.

Chang et al. [43, 44] gave a method to compareIL(i) with IL(i + 1), and derived the followingresults.

Theorem 25.

IL(2k + 1) < IL(2k), IL(k + 1) < IL(k + 2)

and

IL(2k − 1, 4k − 1) < IL(2k, 4k − 1).

Recently, Chang et al. [45] extended the resultsin Theorem 25 as follows:

Theorem 26.

(i) IL((t − 2)k − 1, tk − 1) < IL[(t − 2)k,tk − 1] for t ≥ 3

(ii) IL(3k + 1, 6k + 1) < IL(3k, 6k + 1).

Hwang [46] defined a new importance mea-sure. In Hwang’s definition, component i is saidto be more important than component j , writ-ten as H(i) > H(j), if for every d = k, k + 1, k +2, . . . , n, |CSi,d |, the number of d-cutsets con-taining i is never fewer than |CSj,d |. Hwang [46]proved thatH more importance implies Birnbaummore importance. He gave the following theorem.

Theorem 27. H(i)≥H(j) implies IL(i)≥ IL(j)

under the IID model for all p.

Note that one cannot use computation to proveIL(i)≥ IL(j) for all p since there is an infinitenumber of them. But for any finite system, we

Page 78: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 45

can verify H(i)≥H(j) since d is bounded by n.Once H(i)≥H(j) is verified, then the previouslyimpossible-to-verify relation IL(i)≥ IL(j) is alsoverified. Chang et al. [43] also proved that:

Theorem 28. H(k)≥H(i) for all i ≤ (n+ 1)/2.

3.4.2 Partial Birnbaum Importance

Chang et al. [45] proposed the half-line impor-tance I h, which requires I h(i) > I h(j) only forall p ≥ 1/2 for a comparison. They justified thishalf-line condition by noting that, in most practi-cal cases, p ≥ 1/2. For the consecutive-k-out-of-nline, they were able to establish:

Theorem 29.

I h(1) < I h(2) < · · ·< I h(k − 1)

< I h(k + 1) < I h(i) < I h(2k) < I h(k)

for all i > k + 1 and i �= 2k.

The Birnbaum importance for the special casep = 1/2 is known as the “structure importance”in the literature. Chang et al. [30] suggestedcalling it the “combinatorial importance” so thatthe term “structure importance” can be reservedfor general use (there are other importanceindices depending on structure only). Denote thecombinatorial importance by IC(i); Lin et al.[47] found an interesting correspondence betweenIC(i) and fk,n, the Fibonacci numbers of order k,which is defined by

fk,n =

0 if 1≤ n≤ k − 1

1 if n= kk∑

i=1

fk,n−i if n≥ k + 1

They proved:

Theorem 30. For p = 1/2,

RL(n)= (1/2)nfk,n+k+1.

Thus fk,n+k+1 can be interpreted as thenumber of working consecutive-k-out-of-n lines.

Theorem 31.

IC(i)= (1/2)n−1(2fk,i+kfk,n−i+k−1 − fk,n+k+1).

Chang and Hwang [48] considered the case thatp tends to zero. The importance index, denoted byIR(i), actually measures the number of minimumpathsets (a subset of components whose collectivesuccesses induce a system success) containingcomponent i. Since, in practice, p is not likelyto approach zero, IR(i) is not of interest per se.However, it could be a useful tool for thecomparison of Birnbaum importance. While itis not easy to establish I (i) > I (j), sometimesit is also difficult to establish the falsity ofit. By proving IR(i) < IR(j), we automaticallyestablish the above falsity. Further, if we haveproved I h(i) > I h(j), then proving IR(i)≥ IR(j)

would add a lot of credibility to the conjecturethat I (i) > I (j) since IR(i)≥ IR(j) providesevidence from the other end of the p spectrum.

Represent n as n= qk + r with 0≤ r < k.Then q is the minimum number of workingcomponents for a pathset to exist. Let psq(k, n)denote the number of pathsets with q workingcomponents and let psi,q (k, n) the number ofthose containing component i. Chang and Hwang[48] proved:

Theorem 32.

psq(k, n)=(q + k − r − 1k − r − 1

)Represent i as i = uk + v with 0 < v ≤ k:

Theorem 33.

psi,q (k, n)=(u+ k − v

k − v

) (q − u+ v − r − 2

v − r − 1

)Theorem 34. IR(uk + v) ≤ IR[(u+ 1)k + v] for1≤ v ≤ (k + 1)/2 and (u+ 1)k + v ≤ (n+ 1)/2.

IR(uk + 1) < IR(uk) for uk + 1≤ (n+ 1)/2

IR(uk + 1)≤ IR(j) for uk + 1≤ j ≤ (n+ 1/2)

3.4.3 The Optimal ComponentReplacement

Consider the problem: “When a new extracomponent is given to replace a component in

Page 79: Handbook of Reliability Engineering - pdi2.org

46 System Reliability and Optimization

a linear consecutive-k-out-of-n:F system in orderto raise the system reliability, which componentshould be replaced such that the resulting systemreliability is maximized?” When a component isreplaced, the change of system reliability is notonly dependent on the working probabilities ofthe removed component and the new component,but also on the working probabilities of all othercomponents. A straightforward algorithm is firstre-computing the resulting system reliabilities ofall possible replacements and then selecting thebest position to replace a component. Even usingthe most efficient O(n) reliability algorithmfor linear consecutive-k-out-of-n:F systems, thecomputational complexity of the straightforwardcomponent replacement algorithm is O(n2).

Chang et al. [49] proposed an O(n)-time al-gorithm for the component replacement problembased on the Birnbaum importance. They firstobserved the following results.

Lemma 2. IL(i) is independent of pi .

Let the reliability of the new extra componentbe p∗. Chang et al. [49] derived that:

Theorem 35.

RL(p1, . . . , pi−1, p∗, pi+1, . . . , pn)

− RL(p1, . . . , pi−1, pi , pi+1, . . . , pn)

= IL(i)(p∗ − pi)

They provided an algorithm to find the optimallocation where the component should be replaced.The algorithm is quoted as follows.

Algorithm 1. (Linear component replacementalgorithm)

1. Compute RL(1, 1), RL(1, 2), . . . , RL(1, n)in O(n) time.

2. Compute RL(n, n), RL(n− 1, n), . . . , RL

(2, n) in O(n) time.3. Compute IL(i) for i = 1, 2, . . . , n, with the

equation given in Theorem 14.4. Compute IL(i)(p∗ − pi) for i = 1, 2, . . . , n.5. Choose the i in step 4 with the largest

IL(i)(p∗ − pi) value. Then replace com-

ponent i with the new extra component.

The reliability of the resulting system isIL(i)(p

∗ − pi)+ RL(1, n).

In Algorithm 1, each of the five steps takes at mostO(n) time. Therefore, the total computationalcomplexity is O(n).

Consider the component replacement problemfor the circular system. As with the linear case, astraightforward algorithm can be designed as firstre-computing the resulting system reliabilities ofall possible replacements and then selecting thebest position to replace a component. However,even using the most efficient O(kn) reliabilityalgorithm for the circular systems, the computa-tional complexity of the straightforward algorithmis still O(kn2). Chang et al. [49] also proposed asimilar algorithm for the circular component re-placement problem. The computational complex-ity is O(n2).

If the circular consecutive-k-out-of-n:F systemcontains some components with zero workingprobabilities such that the whole system reliabilityis zero, then the computational complexity of thecircular component replacement algorithm can befurther improved. The following theorem for thespecial case was given by Chang et al. [49].

Theorem 36. If RC(1, n)= 0 and there is an i

in {1, 2, . . . , n} such that IC(i) > 0, then thefollowing three conditions must be satisfied.

1. C(1, n) has just one run of at least k

consecutive components, where the workingprobability of each component is 0.

2. The run that mentioned in condition 1contains fewer than 2k components.

3. If the run that mentioned in condition 1contains components 1, 2, . . . , m, where k ≤m< 2k, then:

IC(i) > 0

for all i ∈ {m− k + 1, m− k + 2, . . . , k},IC(i)= 0

for all i /∈ {m− k + 1, m− k + 2, . . . , k}.Based on Theorem 36, the circular compo-

nent replacement algorithm can be modified asfollows.

Page 80: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 47

Algorithm 2. (Modified circular component re-placement algorithm for RC(1, n)= 0)

1. Find the largest run of consecutive compo-nents consisting of components with 0 work-ing probabilities 0. Then re-index the compo-nents such that this run contains components1, 2, . . . , m (m≥ k).

2. If m≥ 2k, then any replacement is optimal.STOP.

3. Compute RL(i + 1, n+ i − 1) for i ∈ {m−k + 1, m− k + 2, . . . , k}.

4. Choose the i in step 3 with the largest RL(i +1, n+ i − 1) value. Then replace component iwith the new extra component. The reliabilityof the resulting system is RL(i + 1, n+ i −1)p∗. STOP.

In the modified algorithm, step 1 takes O(n)

time, step 2 takes O(1) time, step 3 takes at mostO(kn) time, and step 4 takes at most O(n) time.The total computational complexity is thus O(kn).

3.5 The Weighted-consecutive-k-out-of-n SystemThe weighted-consecutive-k-out-of-n F systemwas first proposed by Wu and Chen [5].A weighted-consecutive-k-out-of-n F systemconsists of n components; each component has itsown working probability and a positive integerweight such that the whole system fails if andonly if the total weight of some consecutivefailed components is at least k. The ordinaryconsecutive-k-out-of-n F system is a special casewith all weights set to 1. Similar to the originalconsecutive-k-out-of-n F system, this weightedsystem also has two types: the linear and thecircular.

3.5.1 The Linear Weighted-consecutive-k-out-of-n System

For the linear weighted-consecutive-k-out-of-nsystem, Wu and Chen [5] gave an O(n)-timealgorithm to compute the set of minimal cutsets.

The algorithm scans the system from node 1 andadds the weights one by one of nodes 1, 2, . . .until a sum K ≥ k is found, say at node j .Then subtract from K the weights one by oneof nodes 1, 2, . . . until a further subtractionwould reduce K to below k. Suppose i is thenew beginning node. Then (i, i + 1, . . . , j ) isthe first minimal cutset. Repeating this procedureby starting from node i + 1, we can find otherminimal cutsets.

Once all minimal cutsets are found, the systemreliability can be computed easily. Wu and Chen[5] gave the following results, where m is the totalnumber of cutsets, F(1, n)= 1− R(1, n), Beg(i)is the index of the first component of ith cutset,and End(i) is the index of the last component ofith cutset.

Theorem 37. In a linear weighted-consecutive-k-out-of-n system:

1. F(1, i)= 0, for i = 0, 1, . . . , End(1)− 1;2. F(1, i) = F(1, End(j)), for i = End(j),

End(j)+ 1, . . . , End(j + 1)− 1 and j = 1,2, . . . , m− 1;

3. F(1, i) = F(1, End(m)), for i = End(m),

End(m)+ 1, . . . , n;4. F(1, End(j))− FLW(1, End(j − 1))

=Beg(j)−Beg(j−1)−1∑

i=0

R[1, Beg(j − 1)+ i − 1]

× pBeg(j−1)+i( End(j)∏

t=Beg(j−1)+i−1

qt

)for j = 2, 3, . . . , m

Based on the recursive equations given inTheorem 37, the system reliability can be com-puted in O(n) time.

3.5.2 The Circular Weighted-consecutive-k-out-of-n System

For the circular weighted-consecutive-k-out-of-nsystem, Wu and Chen [5] also gave an O(n ·min{n, k})-time reliability algorithm. They con-sider the system in two cases: n≥ k and n < k, and

Page 81: Handbook of Reliability Engineering - pdi2.org

48 System Reliability and Optimization

proposed the following equations, where wi is theweight of component i.

Theorem 38. In a circular weighted-consecutive-k-out-of-n system where n≥ k

R(1, n)=k∑

s=1

n∑l=n−k+s

δ(s, l)

where

δ(s, l)

=

( s−1∏i=1

q1

)psR(s + 1, l − 1)pl

n∏i=l+1

qi

ifs−1∑i=1

wi +n∑

j=l+1

wj < k

0 otherwise

Theorem 39. In a circular weighted-consecutive-k-out-of-n system where n < k

R(1, n)=n−2∑s=1

n∑l=s+2

δ(s, l)

where

δ(s, l)

=

( s−1∏i=1

qi

)psR(s + 1, l − 1)pl

n∏i=l+1

qi

ifs−1∑i=1

w1 +n∑

j=l+1

wj < k

0 otherwise

By combining the equations in both cases, Wuand Chen [5] claimed that the reliability of thecircular weighted system can be computed inO(n ·min{n, k})-time.

Though Wu and Chen’s algorithm seems towork well, Chang et al. [4] found that it isincomplete. In some special circular weightedsystems, Wu and Chen’s algorithm will result inwrong reliabilities. Chang et al. [4] also gave anO(T n)-time reliability algorithm for the circularweighted system, where

T =max

{i

∣∣∣∣ i−1∑j=1

wj < k, 1≤ i ≤ n+ 1

}

The basis of Chang et al.’s algorithm [4] isan equation that expresses the reliability of thecircular weighted system in reliability terms oflinear weighted systems. We describe the equationbelow.

Theorem 40. In a circular weighted-consecutive-k-out-of-n system

RC(1, n)

=

1 n < TT∑i=1

( i−1∏j=1

qj

)piRL(i + 1, n+ i − 1)

n≥ T

where

T =max

{i

∣∣∣∣ i−1∑j=1

wj < k, 1≤ i ≤ n+ 1

}In order to analyze the computational complex-

ity, Chang et al. [4] also gave an upper bound of T .

Theorem 41. T ≤min{n, �(k−wmax)/wmin + 1},where wmax and wmin are the maximum andminimum weights of all components, respectively.

Chang et al.’s [4] reliability algorithm isdescribed as follows.

Algorithm 3. (CCH)

1. Compute T .2. If T > n, then RC(1, n)= 1. STOP.3. Compute RL(i + 1, n+ i − 1), for i = 1, 2,

. . . , T .4. Compute RC(1, n) with the equation given in

Theorem 40. STOP.

In this algorithm, the bottleneck step isstep 3, which costs O(T n) time. Therefore, thetotal computational complexity of this reliabilityalgorithm is O(T n).

3.6 Window SystemsA sequence of k consecutive nodes is called ak-window. In various problems, the definition of

Page 82: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 49

a “bad” window varies. In this section, the systemfailure is defined in terms of windows. It could bethat the system fails if it contains a bad window.But we can also define a system to be failed if everywindow is bad, or, equivalently, the system worksif it contains a good window. In general, we candefine a system to be failed if it contains b badwindows, called a b-fold-window system.

The window systems have many practical ap-plications. For example, consider a linear flownetwork consisting of n+ 2 nodes (node 0 tonode n+ 1) and directed links from node i tonode j (0≤ i < j ≤ n+ 1, j − i ≤ k). In this flownetwork, the source (node 0), the sink (node n+1), and all links are infallible, but the intermediatenodes (nodes 1, 2, . . . , n) may fail. When an in-termediate node fails, no flow can go through it;when it works, the flow capacity is unity. Then,the probability that the maximum flow from thesource to the sink is at least f is equal to theprobability that the intermediate nodes do notcontain a bad k-window, where a bad window isa window that contains at least k − f + 1 failednodes. Such a flow network can be realized as acircuit switching wireless telecommunication net-work, where the communication between nodeshas a distance limitation and each intermediatenode has an identical bandwidth limitation.

3.6.1 The f -within-consecutive-k-out-of-n System

An f -within-consecutive-k-out-of-n system, orabbreviated as an (f, k, n) system, is a linear orcircular system consisting of n components thatfails if and only if there exist some k consecutivecomponents (a k-window) containing at least f

failed components.The problem of computing the reliability of an

(f, k, n) system can be viewed as the binomialversion of the generalized birthday problemstudied by Saperstein [50]. The generalizedbirthday problem is described as follows:

Given a random assignment of w 1’sand n− w 0’s, what is the probabilitythat there is no set of k consecutive

bits in the arrangement containing f

or more 1’s?

Naus [51] first studied the case k = 2.Saperstein considered the binomial versionof the generalized birthday problem in which eachbit has a probability p to be 0 and a probabilityq to be 1. By interpreting each 0 as a workingcomponent and each 1 as a failed component,this problem is equivalent to the problem ofcomputing the reliability of the (f, k, n) system.

Hwang and Wright [9] proposed an O(23kn)-time reliability algorithm for the (f, k, n) system.Their algorithm is implemented using Griffith’s[14] Markov chain approach. Let w(l) denote thenumber of 1’s in a binary k-vector l (that is,the weight of l). They define the state space ofthe Markov chain {Y (t) : t ≥ 0} as

S = {l ∈ {0, 1}k : 0 <w(l) < f } ∪ {s1} ∪ {sN }Therefore

N = |S| =f−1∑i=0

(k

i

)+ 1=O(2k)

And the Markov chain {Y (t)} is defined as follows.

1. The k-vector l is encoded from the last k

components of Lf,k(1, t) where a workingcomponent is encoded as 0 and a failedcomponent is encoded as 1. If t < k, attachleading 0’s to l.

2. Y (t)= s1 if w(l)= 0.3. Y (t)= l if 0 <w(l) < f .4. Y (t)= sN if w(l) ≥ f .

The transition matrix for {Y (t) : t ≥ 0} is t,

an N × N matrix. Each (except the last) row oft contains two nonzero elements: one is pt ,and the other is qt . The last row of t onlycontains a non-zero element, which is 1, in the lastcolumn. With the transition matrix t , the systemreliability can be computed with the followingequation:

Rf,k(1, n)= π0

( n∏t=1

t

)UT

whereπ0 = (1, 0, . . . , 0)1×N

Page 83: Handbook of Reliability Engineering - pdi2.org

50 System Reliability and Optimization

and

U= (1, . . . , 1, 0)1×N

Since a multiplication of two N × N matricescostsO(N3) time (or less by Strassen’s algorithm),the total computational complexity of Hwangand Wright’s reliability algorithm is O(N3n)=O(23kn).

Chang [8] proposed another reliability algo-rithm for the (f, k, n) system, which is more ef-ficient than the O(23kn) one. Their algorithm is aMarkov chain approach, but the Markov chain isderived from an automaton with minimal numberof states. The automaton is a mathematical modelof a system, with discrete inputs and outputs.The system can be in any one of a finite numberof internal configurations or “states”. The state ofa system summarizes the information concern-ing past inputs that is needed to determine thebehavior of the system on subsequent inputs. Incomputer science, the theory of automata is auseful design tool for many finite state systems,such as switching circuits, text editors, and lexicalanalyzers. Chang [8] also employed the sparse ma-trix data structure to speed up the computation ofreliabilities. Their method is described as follows.

Consider an f -within-consecutive-k-out-of-nsystem. Every system state can be viewed asa binary string where the ith bit is 1 if andonly if component i works. Chang [8] definedM as the set of all strings corresponding toworking systems. Therefore, M = {0, 1}∗ −M canbe expressed as

M = (0+ 1)∗[ ∑x1,x2,...,xk∈{0,1}

x1+x2+···+xk≤k−f

x1x2 . . . xk

](0+ 1)∗

Based on automata theory, the minimal stateautomaton accepting exactly the strings in M hasthe following set of states Q. They labeled thestates with k-bit binary strings.

Q=Q0 ∪Q1 ∪ · · · ∪Qf

where

Q0 =1 . . . 1︸ ︷︷ ︸

k−f0 . . . 0︸ ︷︷ ︸

f

Q1 =

{b1b2 . . . bk

∣∣∣∣ b1 = 1,k∑

i=2

bi = k − f

}Q2 =

{b1b2 . . . bk

∣∣∣∣ b1 = b2 = 1, . . .

. . .

k∑i=3

bi = k − f

}. . .

Qf ={b1b2 . . . bk

∣∣∣∣ b1 = b2 = · · ·

= bf = 1,k∑

i=f+1

bi = k − f

}

={

1 . . . 1︸ ︷︷ ︸k

}The only state in Qf is the initial state. The onlystate in Q0 is the rejecting state; all other states(including the initial state) are accepting states.The state transition function δ :Q× {0, 1}→Q

is defined as:

δ(b1 . . . bk, 0)=

1 . . . 1︸ ︷︷ ︸k−f

0 . . . 0︸ ︷︷ ︸f

ifk∑

i=2

bi = k − f

b2 . . . bk0

otherwise

δ(b1 . . . bk, 1)=

1 . . . 1︸ ︷︷ ︸k−f

0 . . . 0︸ ︷︷ ︸f

ifk∑

i=1

bi = k − f

1 . . . 1︸ ︷︷ ︸t−2

bt . . . bk1

otherwise,

where

t =min

{x

∣∣∣∣ k∑i=x

bi = k − f − 1

}

Page 84: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 51

Thus, in the minimal state automaton

|Q| = 1+(k − 1

k − f

)+(k − 2

k − f

)+ · · ·

+(k − f

k − f

)= 1+

(k − 1

f − 1

)+(k − 2

f − 2

)+ · · ·

+(k − f

f − f

)= 1+

(k

f − 1

)When this minimum state automaton is inter-preted as a heterogeneous Markov chain, a reli-ability algorithm can be derived easily. For con-venience, they rewrote the states in Q such thatQ= {s1, s2, . . . , sN } where

s1 =1, 1, . . . , 1︸ ︷︷ ︸

k

is the initial state and

SN =1, . . . , 1︸ ︷︷ ︸

k−f, 0, . . . , 0︸ ︷︷ ︸

f

is the only one failed state. When the workingprobability of component i in the f -within-consecutive-k-out-of-n system is pi , the transitionprobability matrix of {Y (t)} is

t =m1,1 · · · m1,N

......

mN,1 · · · mN,N

where

mi,j =

pt if i �= N, δ(si , 0)= sj

1− pt if i �= N, δ(si , 1)= sj

0 if i = N, j �=N

1 if i = N, j =N

Therefore, the reliability Rf,k(1, n) can be com-puted as follows:

Rf,k(1, n)= π0

( n∏t=1

t

)UT

whereπ0 = (1, 0, . . . , 0)1×N

andU= (1, . . . , 1, 0)1×N.

The total computational complexity of Changet al.’s algorithm is O

((k

f−1

)n), which is the most

efficient one so far.

3.6.2 The 2-within-consecutive-k-out-of-n System

The 2-within-consecutive-k-out-of-n system, ab-breviated as the (2, k, n) system, is a specialcase of the f -within-consecutive-k-out-of-n sys-tem. Most results obtained from the consecu-tive system can be extended to the (2, k, n) sys-tem. Let NS(d, f, k, n) denote the number of the(f, k, n) system which works, S ∈ {L, C}, contain-ing exactly d failed components. Naus [51] gavethe following result for f = 2.

Theorem 42.

NL(d, 2, k, n)=(n− (d − 1)(k − 1)

d

)Sfaniakakis et al. [52] gave the following result forthe circular case:

Theorem 43.

NC(d, 2, k, n)= n

n− d(k − 1)

(n− d(k − 1)

d

)Thus, for the IID model and f = 2:

RL(n)=n∑

d=0

(n− (d − 1)(k − 1)

d

)qdpn−d

RC(n)=n∑

d=0

n

n− d(k − 1)

×(n− d(k − 1)

d

)qdpn−d

Higashiyama et al. [53] gave a recursiveequation for the IND model:

Page 85: Handbook of Reliability Engineering - pdi2.org

52 System Reliability and Optimization

Theorem 44. For f = 2 and n≥ k:

RL(1, n)= pnRL(1, n− 1)

+ qn

( n−1∑i=n−k+1

pi

)RL(1, n− k)

They also gave a recursive equation for the INDcircular case:

Theorem 45. For f = 2:

RC(1, n)=( n∏

i=n−k+2

pi

)RL(1, n− k + 1)

+k−1∑i=1

qn−k+i+1

× RL(i + 1, n− 2k + i + 1)

×( n−k+i∏

j=n−2k+i+2

pj

)( n+i∏j=n−k+i+2

pj

)

3.6.3 The b-fold-window System

First consider the b-fold-non-overlapping-consecutive system when a window is aconsecutive system and the system fails ifand only if there exist b non-overlapping badwindows. Let Rb

L(1, n) be the reliability of sucha (linear) system. Papastavridis [54] gave thefollowing result. Based on this result, Rb

L(1, n) canbe computed in O(b2n) time.

Theorem 46.

RbL(1, n)= Rb

L(1, n− 1)

−b∑

j=1

pn−jk( jk∏

j=1

qn−jk+i)

× [Rb−jL (1, n− jk − 1)

− Rb−j+1L (1, n− jk − 1)]

Alevizos et al. [55] gave a similar argument forthe circular system. With their equation, Rb

C(1, n)can be computed in O(b3kn) time.

Theorem 47.

RbC(1, n)= pnR

bL(1, n)+ qnR

bC(1, n− 1)

−b∑

j=1

jk∑i=1

( n+i−1∏l=n−jk+i

ql

)× pn−jk+i−1pn+i× [Rb−j

L (i + 1, n− jk + i − 2)

− Rb−j+1L (i + 1, n− jk + i − 2)]

For an IID model, Papastavridis proved thefollowing results:

Theorem 48.

RbL(n)=

n∑d=0

NbL(d, n, k)p

n−dqd

where

NbL(d, n, k)=

(n− d + b

b

)×∑i≥0

(−1)i(n− d + 1

i

)×(n− k(i + b)

n− d

)Theorem 49.

RbC(n)=

n∑d=0

NbC(d, n, k)p

n−dqd

where

NbC(d, n, k)=

(n− d + b − 1

b

)(n− d

d

)×∑i≥0

(−1)i(n− d + 1

i

)×(n− k(i + b)

n− d

)Windows are called isolated if there exists a

working component between any two windows.Chang et al. [30] considered the case that thesystem fails if and only if there are b isolated badwindows. Let R(b) denote the system reliability.Their results are summarized as follows:

Page 86: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 53

Theorem 50.

R(b)L (1, n)= R

(b)L (1, n− 1)−

( n∏i=n−k+1

qi

)pn−k

Theorem 51.

R(b)C (1, n)= pnR

(b)L (1, n)+ qnR

(b)C (1, n− 1)

−n∑

i=n−k+1

{( i+k+1∏j=i

qj

)pi−1pi+k

× [R(b−1)L (i + k + 1, i − 2)

− R(b)L (i + k + 1, i − 2)]

}Based on these equations, the reliability

R(b)L (1, n) can be computed in O(bn) time, and

the reliability R(b)C (1, n) in O(bkn) time.

3.7 Network SystemsIn the literature, a consecutive-k-out-of-n systemis a system consisting of n components arrangedin a sequence such that the system fails if and onlyif some k consecutive components all fail. In manypractical applications, such as an oil pipeline ora telecommunications system, the objective of thesystem is to transmit something from the sourceto the sink. For these cases, the system can berepresented by a network with n+ 2 nodes (nodes0, 1, 2, . . . , n+ 1) and directed links from node ito node j (0≤ i < j ≤ n+ 1, j − i ≤ k). We referto node 0 as the source and node n+ 1 as thesink, and assume that they are infallible. Nodes 1to n correspond to the n components. If some k

consecutive nodes all fail, then there exists no pathfrom the source to the sink. On the other hand,if no k consecutive nodes all fail, then the pathstarting from the source can always move forwardto a node until it lands at the sink. Therefore, thesystem fails if and only if there exists no path fromthe source to the sink.

The path interpretation of the consecutive-k-out-of-n system brings about the notion oflinks that connect adjacent pairs of nodes.Therefore, one can study the link failure model

in which nodes always work but links can fail.In many practical situations this could be themore appropriate model. For example, nodescould be some physical entities, like telephoneswitching centers or airports, and the links areroutes interconnecting them. When a phone callor a flight is blocked, it is usually due to thenon-availability of routes (wires busy or ticketssold out) rather than a breakdown of hardware.However, previous research suggested that thereliability of the link failure system is much harderto compute than that of the node failure system.The network system is more general than the linkfailure system and the node failure system, sincenodes and links can all fail.

3.7.1 The Linear Consecutive-2Network System

Chen et al. [10, 56] first proposed the networkmodel for the consecutive-2-out-of-n line. Theyassumed that the reliabilities of all nodes are allequal to p. And p1 (p2) is the reliability of thelink from node i to node i + 1 (i + 2). Let R1(n)

denote the probability that node 0 has a workingpath to node n+ 1. They gave a recursive equationas follows:

R1(n)= p(1− q1q2)R1(n− 1)

+ pp2(1− pp1)R1(n− 2)

− p2q1p22R1(n− 3)

from which a closed-form solution is obtained.

Theorem 52. For 0 < p(1− q1q2) < 1 we have:

R(n)= aαn+2 + bβn+2 + cγ n+2

where α, β, γ are the roots of

f (x)= x3 − p(1− q1q2)x2

− pp2(1− pp1)x + p2q1p22

satisfying

1 > α > p(1 − q1q2) > b ≥ pp2q1

≥ 0 >−β > γ >−α >−1

Page 87: Handbook of Reliability Engineering - pdi2.org

54 System Reliability and Optimization

and

a = [pp2q1 − α]/[p(α − β)(γ − α)]> 0

b = [pp2q1 − β]/[p(β − γ )(α − β)]< 0

c = [pp2q1 − γ ]/[p(γ − α)(β − γ )]< 0

Newton’s method can be employed to approx-imate the three roots α, β, γ of f (x). Once theroots are known, the reliability R1(n) can be com-puted in another O(log n) time. However, the pre-cision of the final value of R1(n) is usually worsethan that of recursively employing the equationgiven in Theorem 52, which needs O(n) time.

Chen et al. [10] also compared the importanceof the three types of component (nodes, 1-links, and 2-link) with respect to the systemreliability. Let r > s > t be three given reliabilities.The question is how do we map {r, s, t} to{p, p1, p2}. They found:

Theorem 53. R(n) is maximized by setting p = r ,p1 = s and p2 = t .

3.7.2 The Linear Consecutive-kNetwork System

Chen et al. [10] also claimed that their approachcan be extended to compute the reliability of theconsecutive-k-out-of-n network, but the detailsare messy. Let M be a nonempty subset of thefirst k nodes including node 0. They said one candefine:

RM(n)= Pr{a node in M has a path to the sink}Then one can write 2k−1 recursive equations, onefor each M . The reliability of the consecutive-k-out-of-n:F network is R{1}(n).

Chang et al. [11] proposed a new Markovchain method and gave a closed-form expressionfor the reliability of the consecutive-k-out-of-nnetwork system where each of the nodes and linkshas its own reliability. In their algorithm, theytry to embed the consecutive-k-out-of-n networkinto a Markov chain {Y (t) : t ≥ k} defined on thestate space S = {0, 1, 2, . . . , 2k − 1}. Then, theyshowed that the reliability of the consecutive-k-out-of-n network is equal to the probability that

Y (n+ 1) is odd. The Markov chain {Y (t) : t ≥ k}was defined as follows.

Definition 1. {Y (t) : t ≥ k} is a Markov chaindefined on the state space S, where

S = {0, 1, 2, . . . , 2k − 1}Define (x1x2 . . . xm)2 =∑m

i=1 2m−ixi . Initially:

Pr{Y (k)= 0} = Pr{(d1d2 . . . dk)2 = 0}Pr{Y (k)= 1} = Pr{(d1d2 . . . dk)2 = 1}Pr{Y (k)= 2} = Pr{(d1d2 . . . dk)2 = 2}...

Pr{Y (k)= 2k − 1} = Pr{(d1d2 . . . dk)2 = 2k − 1}The transition probability matrix of the Markovchain {Y (t) : t ≥ k} is

t =

m0,0,t m0,1,t · · · m0,2k−1,tm1,0,t m1,1,t · · · m1,2k−1,t

......

m2k−1,0,t m2k−1,1,t · · · m2k−1,2k−1,t

where, for 0≤ i ≤ 2k − 1, 0≤ j ≤ 2k − 1, k < t ≤n+ 1

mi,j,t = Pr{Y (t)= j | Y (t − 1)= i}

=

1 if i = j = 0[1−

∏x:1≤x≤k,

(i mod 2x)≥2x−1

ql(t − x, t)

]pn(t)

if j = (2i + 1) mod (2k)

1−[1−

∏x:1≤x≤k,

(i mod 2x)≥2x−1

ql(t − x, t)

]pn(t)

if j = (2i) mod (2k)

0 otherwise

Theorem 54.

RN(k, n)= Pr{Y (n+ 1) is odd}

= πk

( n+1∏t=k+1

t

)U

Page 88: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 55

where

πk = [Pr{Y (k)= 0}, Pr{Y (k)= 1},. . . , Pr{Y (k)= 2k − 1}]

U= [0, 1, 0, 1, . . . , 0, 1]T

Using a straightforward algorithm to computethe initial probabilities, the computational com-plexity for πk is at least (2k(k+3)/2). In orderto simplify the computation further, Chang et al.[11] transformed the network system into an ex-tended network system by adding k − 1 dummynodes before node 0. The reliability of the ex-tended consecutive-k-out-of-n:F network was de-fined as the probability that there is a working pathfrom “node 0” to node n+ 1. Therefore, dummynodes and their related links cannot contributeto a working path from node 0 to node n+ 1.The reliability of the extended consecutive-k-out-of-n network is identical to that of the originalconsecutive-k-out-of-n network.

When computing the reliability of the ex-tended consecutive-k-out-of-n:F network, previ-ous definitions must be modified. In Definition 1,{Y ′(t)} = {Y (t) : t ≥ 0} can be defined in a similarway except that t is defined for 0 < t ≤ n+ 1 andinitially

Pr{Y (0)= 0} = Pr{(d−k+1d−k+2 . . . d0)2 = 0}Pr{Y (0)= 1} = Pr{(d−k+1d−k+2 . . . d0)2 = 1}Pr{Y (0)= 2} = Pr{(d−k+1d−k+2 . . . d0)2 = 2}...

Pr{Y (0)= 2k − 1}= Pr{(d−k+1d−k+2 . . . d0)2 = 2k − 1}

Furthermore, in the extended consecutive-k-out-of-n:F network, one can let

pn(i)= 0 for − (k − 1)≤ i < 0

pl(i, j)= 0 for − (k − 1)≤ i < 0,

0 < j − i ≤ k

Again, the extended consecutive-k-out-of-n:F net-work was embedded into the Markov chain

{Y ′(t)} = {Y (t) : t ≥ 0}. Considering the nodes−k + 1,−k + 2, . . . ,−1, it is true that they al-ways fail and there is no working path fromnode 0 to them. Thus the binary number(d−k+1d−k+2 . . . d0)2 is always equal to one.

Let

π0 = [Pr{Y (0)= 0}, Pr{Y (0)= 1}, . . . ,Pr{Y (0)= 2k − 1}]

= [0, 1, 0, 0, . . . , 0]The system reliability can be computed with

Rn(k, n)= π0

( n+1∏t=1

t

)U

Chang et al.’s [11] reliability algorithm is de-scribed as follows.

Algorithm 4.

1. π← π0, t← 1.2. Construct t in a sparse matrix data struc-

ture.3. π← πt .

4. If t < n+ 1 thent← t + 1,go to Step 2.

5. Return πU . STOP.

With the efficient sparse matrix data structure,this algorithm can compute RN(k, n) in O(2kn)time. This is very efficient, since in most practicalsituations where n is large and k is small, the timecomplexity becomes O(n).

3.7.3 The Linear Consecutive-k FlowNetwork System

The linear consecutive-k flow network system, firstproposed by Chang [8], was modified from thenetwork system. The structure of the flow networksystem is similar to that of the network system.In the flow network, the source (node 0), the sink(node n+ 1), and all links are infallible, but theintermediate nodes (nodes 1, 2, . . . , n) may fail.When an intermediate node fails, no flow can gothrough it; when it works, the flow capacity isunity.

Page 89: Handbook of Reliability Engineering - pdi2.org

56 System Reliability and Optimization

The flow network can be realized as a circuitswitching wireless telecommunications network,where the communication between nodes hasa distance limitation and each intermediatenode has an identical bandwidth limitation.Chang [8] defined the f -flow-reliability, denotedas RFN(f, k, n), as the probability that themaximum flow from the source to the sink is atleast f . The maximum flow from the source to thesink is f if and only if there are at most f node-disjoint paths from the source to the sink.

RFN(f, k, n)= Pr{The maximum flow from

node 0 to node n+ 1 is at least f }

Then, they proved the following lemma:

Lemma 3. For 1≤ f ≤ k, if and only if there isno k consecutive intermediate nodes containing atleast k − f + 1 failed nodes, the maximum flowfrom node 0 to node n+ 1 is at least f .

Based on Lemma 3, the following equationholds immediately:

Theorem 55.

RFN(f, k, n)= Pr{There are no k consecutive

intermediate nodes containing

at least k − f + 1 failed nodes}= Rf−k+1,k(n)

That is, the f -flow-reliability of theconsecutive-k-out-of-n:F flow network is equalto the reliability of the (k − f + 1)-within-consecutive-k-out-of-n:F system. Thus, bysubstituting f by k − f + 1, the reliabilityalgorithm for the f -within-consecutive-k-out-of-n system can be used to computethe f -flow-reliability of the consecutive-k-out-of-n flow network system. The totalcomputational complexity for RFN(f, k, n) isO((

kk−f+1−1

)n)=O

((kf

)n).

Furthermore, if the mission is to compute theprobability distribution of the maximum flow(from node 0 to node n+ 1), the following

equations can be used:

Pr{the maximum flow= 0} = 1− RFN(1, k, n)

Pr{the maximum flow= 1}= RFN(1, k, n)− RFN(2, k, n)

Pr{the maximum flow= 2}= RFN(2, k, n)− RFN(3, k, n)

...

Pr{the maximum flow= k − 1}= RFN(k − 1, k, n)− RFN(k, k, n)

Pr{the maximum flow= k} = RFN(k, k, n)

Or equivalently:

Pr{the maximum flow= 0} = 1− Rk(n)

Pr{the maximum flow= 1}= Rk(n)− Rf,k(k − 1, k, n)

Pr{the maximum flow= 2}= Rf,k(k − 1, k, n)− Rf,k(k − 2, k, n)

...

Pr{the maximum flow= k − 1}

= Rf,k(2, k, n)−n∏

i=1

pi

Pr{the maximum flow= k} =n∏

i=1

pi

Another topic that relates to the flow networksystem is to find the route of the maximum flow.Chang [8] proposed an on-line routing algorithmto route the maximum flow. In executing theirrouting algorithm, each working intermediatenode will dynamically choose a proper flow routethrough itself in order to maximize the total flowfrom the source to the sink.

The on-line routing algorithm is describedin Algorithm 6. Note that for each i, 1≤ i ≤ n,Algorithm 6 is periodically executed on node i.Algorithm 5 is a subroutine called by the mainprocedure in Algorithm 5.

Page 90: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 57

Algorithm 5. (Algorithm_Mark(i))

{ if (Ind(i) ≤ k) thenMark(i)← Ind(i);

else{ for j ← 1 to k do temp(j)← 0;

for j ← 1 to k do{ (x, y)← (Ind(i−), Mark(i − j));

// Get messages from node i − j .// If node i − j fails (timeout), skip this iteration.if (temp(y)= 0) then temp(y)← x;

}x← n;for j ← 1 to k do

if (temp(j) > 0 and temp(j) < x) then{ x← temp(j);

Mark(i)← j ;}

}}

Algorithm 6. (Algorithm_Route(i))

{ call Algorithm_Mark(i);if (Ind(i) ≤ k) then

receive a unity flow from node 0;else{ for j ← 1 to k do

if (Mark(i)=Mark(i − j)) then break;receive a unity flow from node i − j ;

}}

The computational complexity of the routingalgorithm is O(k), which is optimal, since it isnecessary for each node to collect messages fromits k predecessor nodes in order to maximize thetotal flow.

3.8 ConclusionOwing to a limitation of 10,000 words for thissurvey, we have to leave out topics so thatwe can do justice to cover some other morepopular topics. Among others, we leave out thefollowing:

The lifetime distribution model. Componenti has a lifetime distribution fi(t); namely, theprobability that i is still working at time t .Note that, at a given time t , fi(t) can be interpretedas pi , the parameter in the static model we study.Hence, our model studies the availability of thesystem at any time t . However, there are problemsunique to the lifetime model, like mean timeto failure, the increasing failure rate property,

and estimation of parameters in the lifetimedistribution.

The dependence model. The failures amongcomponents are dependent. This dependencecould be due to batch production or commoncauses of failure. Typically, Markov chains are usedto study this model.

The consecutively connected systems. Thereachability parameter k of a node is no longerconstant. It is assumed that node i can reach thenext ki nodes. More generally, it is assumed thatthe nodes are multi-state nodes and node i hasprobability pij to be in state (i, j), which allows itto reach the next j nodes.

The multi-failure models. In one model, acomponent has two failure modes: open-circuitand short-circuit. Consequently, the system alsohas two failure modes: consecutive k open-circuits and consecutive r short-circuits. Inanother model, the system fails if either somed components fail, or some k (<d) consecutivecomponents all fail.

The redundancy model. Suppose there are manyredundant components, how do we allocate themto the n positions in a consecutive-k system?Though Birnbaum importance does provide someguideline, it has only a figure-of-merit value,and there are very few Birnbaum importancecomparisons. More definite answers have beenobtained by combinatorial analysis.

The 2-dimension system. Components are ar-ranged in an m× n rectangle and the rectangleis declared failed if there exists an r × s subrect-angle consisting of all failed components. It issurprisingly difficult to write down efficient re-cursive equations to compute the reliability of the2-dimension system.

The graph model. The components are nodes ina graph, and the system fails if and only if thereexist two failed adjacent nodes. Computing thereliability is possible only for some special classesof graphs. A theory of the invariant consecutive-2graph has been developed.

We refer the reader to the recent book by Changet al. [30], which covers the above topics andalso gives more details on the topics we coveredhere.

Page 91: Handbook of Reliability Engineering - pdi2.org

58 System Reliability and Optimization

References[1] Kontoleon JM. Reliability determination of r-successive-

out-of-n:F system. IEEE Trans Reliab 1980;R-29:437.[2] Chiang DT, Niu SC. Reliability of consecutive-k-out-of-

n:F system. IEEE Trans Reliab 1981;R-30:87–9.[3] Derman C, Lieberman GJ, Ross SM. On the consecutive-

k-out-of-n:F system. IEEE Trans Reliab 1982;R-31:57–63.[4] Chang JC, Chen RJ, Hwang FK. A fast reliability

algorithm for the circular consecutive-weighted-k-out-of-n:F system. IEEE Trans Reliab 1998;47:472–4.

[5] Wu JS, Chen RJ. Efficient algorithms for k-out-of-n andconsecutive-weighted-k-out-of-n:F system. IEEE TransReliab 1994;43:650–5.

[6] Chang GJ, Cui L, Hwang FK. Reliability for (n, f, k)

systems. Stat Prob Lett 1999;43:237–42.[7] Chang JC, Chen RJ, Hwang FK. Faster algorithms to

compute reliabilities of the (n, f, k) systems. Preprint2000.

[8] Chang JC. Reliability algorithms for consecutive-ksystems. PhD Thesis, Department of Computer Scienceand Information Engineering, National Chiao TungUniversity, December 2000.

[9] Hwang FK, Wright PE. An O(n log n) algorithm for thegeneralized birthday problem. Comput Stat Data Anal1997;23:443–51.

[10] Chen RW, Hwang FK, Li WC. Consecutive-2-out-of-n:Fsystems with node and link failures. IEEE Trans Reliab1993;42:497–502.

[11] Chang JC, Chen RJ, Hwang FK. An efficient algorithm forthe reliability of consecutive-k − n networks. J Inform SciEng 2002; in press.

[12] Koutras MV. Consecutive-k,r-out-of-n:DFM systems.Microelectron Reliab 1997;37:597–603.

[13] Griffith WS, Govindarajulu Z. Consecutive-k-out-of-n failure systems: reliability, availability, componentimportance and multi state extensions. Am J Math ManagSci 1985;5:125–60.

[14] Griffith WS. On consecutive k-out-of-n failure systemsand their generalizations. In: Basu AP, editor. Reliabilityand quality control. Amsterdam: Elsevier Science; 1986.p.157–65.

[15] Chao MT, Lin GD. Economical design of largeconsecutive-k-out-of-n:F systems. IEEE Trans Reliab1984;R-33:411–3.

[16] Fu JC. Reliability of consecutive-k-out-of-n:F systemswith (k − 1)-step Markov dependence. IEEE Trans Reliab1986;R-35:602–6.

[17] Shanthikumar JG. Recursive algorithm to evaluate thereliability of a consecutive-k-out-of-n:F system. IEEETrans Reliab 1982;R-31:442–3.

[18] Hwang FK. Fast solutions for consecutive-k-out-of-n:Fsystem. IEEE Trans Reliab 1982;R-31:447–8.

[19] Wu JS, Chen RJ. Efficient algorithm for reliability ofa circular consecutive-k-out-of-n:F system. IEEE TransReliab 1993;42:163–4.

[20] Wu JS, Chen RJ. An O(kn) algorithm for a circularconsecutive-k-out-of-n:F system. IEEE Trans Reliab1992;41:303–5.

[21] Antonopoulou I, Papastavridis S. Fast recursive algo-rithms to evaluate the reliability of a circular consecutive-k-out-of-n:F system. IEEE Trans Reliab 1987;R-36:83–4.

[22] Hwang FK. An O(kn)-time algorithm for computing thereliability of a circular consecutive-k-out-of-n:F system.IEEE Trans Reliab 1993;42:161–2.

[23] Fu JC, Hu B. On reliability of a large consecutive-k-out-of-n:F system with (k − 1)-step Markov dependence. IEEETrans Reliab 1987;R-36:75–7.

[24] Koutras MV. On a Markov approach for the study ofreliability structures. J Appl Prob 1996;33:357–67.

[25] Hwang FK, Wright PE. An O(k3 log(n/k)) algorithm forthe consecutive-k-out-of-n:F system. IEEE Trans Reliab1995;44:128–31.

[26] Fu JC. Reliability of a large consecutive-k-out-of-n:Fsystem. IEEE Trans Reliab 1985;R-34:127–30.

[27] Chao MT, Fu JC. The reliability of a large series systemunder Markov structure. Adv Appl Prob 1991;23:894–908.

[28] Du DZ, Hwang FK. Optimal consecutive-2-out-of-n:Fsystem. Math Oper Res 1986;11:187–91.

[29] Malon DM. Optimal consecutive-k-out-of-n:F com-ponent sequencing. IEEE Trans Reliab 1985;R-34:46–9.

[30] Chang GJ, Cui L, Hwang FK. Reliabilities of consecutive-ksystems. Boston: Kluwer, 2000.

[31] Hwang FK, Pai CK. Sequential construction of aconsecutive-2 system. Inform Proc Lett 2002; in press.

[32] Hwang FK. Invariant permutations for consecutive-k-out-of-n cycles. IEEE Trans Reliab 1989;R-38:65–7.

[33] Tong YL. Some new results on the reliability of circularconsecutive-k-out-of-n:F system. In: Basu AP, editor.Reliability and quality control. Elsevier Science; 1986.p.395–9.

[34] Tong YL. A rearrangement inequality for the longest run,with an application to network reliability. J Appl Prob1985;22:386–93.

[35] Kuo W, Zhang W, Zuo M. A consecutive-k-out-of-n:Gsystem: the mirror image of consecutive-k-out-of-n:Fsystem. IEEE Trans Reliab 1990;39:244–53.

[36] Zuo M, Kuo W. Design and performance analysisof consecutive-k-out-of-n structure. Nav Res Logist1990;37:203–30.

[37] Jalali A, Hawkes AG, Cui L, Hwang FK. The optimalconsecutive-k-out-of-n:G line for n≤ 2k. Preprint 1999.

[38] Du DZ, Hwang FK, Jung Y, Hgo HQ. Optimal consecutive-k-out-of-(2k + 1):G cycle. J Global Opt 2001;19:51–60.

[39] Papastavridis S. The most important component ina consecutive-k-out-of-n:F system. IEEE Trans Reliab1987;R-36:266–8.

[40] Hwang FK, Cui LR, Chang JC, Lin WD. Comments on“Reliability and component importance of a consecutive-k-out-of-n system by Zuo”. Microelectron Reliab2000;40:1061–3.

[41] Zuo MJ. Reliability and component importance ofa consecutive-k-out-of-n system. Microelectron Reliab1993;33:243–58.

[42] Zakaria RS, David HA, Kuo W. The nonmonotonicity ofcomponent importance measures in linear consecutive-k-out-of-n systems. IIE Trans 1992;24:147–54.

Page 92: Handbook of Reliability Engineering - pdi2.org

Reliabilities of Consecutive-k Systems 59

[43] Chang GJ, Cui L, Hwang FK. New comparisons inBirnbaum importance for the consecutive-k-out-of-nsystem. Prob Eng Inform Sci 1999;13:187–92.

[44] Chang GJ, Hwang FK, Cui L. Corrigenda on New compar-isons in Birnbaum importance for the consecutive-k-out-of-n system. Prob Eng Inform Sci 2000;14:405.

[45] Chang HW, Chen RJ, Hwang FK. The structuralBirnbaum importance of consecutive-k system. J CombinOpt 2002; in press.

[46] Hwang FK. A new index for component importance. OperRes Lett 2002; in press.

[47] Lin FH, Kuo W, Hwang FK. Structure importanceof consecutive-k-out-of-n systems. Oper Res Lett1990;25:101–7.

[48] Chang HW, Hwang FK. Rare event component impor-tance of the consecutive-k system. Nav Res Logist 2002;in press.

[49] Chang JC, Chen RJ, Hwang FK. The Birnbaum impor-tance and the optimal replacement for the consecutive-k-out-of-n:F system. In: International ComputerSymposium: Workshop on Algorithms 1998; p.51–4.

[50] Saperstein B. The generalized birthday problem. J AmStat Assoc 1972;67:425–8.

[51] Naus J. An extension of the birthday problem. Am Stat1968;22:27–9.

[52] Sfakianakis M, Kounias S, Hillaris A. Reliability ofa consecutive k-out-of-r-from-n:F system. IEEE TransReliab 1992;41:442–7.

[53] Higashiyama Y, Ariyoshi H, Kraetzl M. Fast solutionsfor consecutive-2-out-of-r-from-n:F system. IEICE TransFundam Electron Commun Comput Sci 1995;E78A:680–84.

[54] Papastavridis S. The number of failed components ina consecutive-k-out-of-n:F system. IEEE Trans Reliab1989;38:338–40.

[55] Alevizos PD, Papastavridis SG, Sypsas P. Reliabilityof cyclic m-consecutive-k-out-of-n:F system, reliability,quality control and risk assessment. In: ProceedingsIASTED Conference 1992.

[56] Chen RW, Hwang FK, Li WC. A reversible modelfor consecutive-2-out-of-n:F systems with node andlink failures. Prob Eng Inform Sci 1994;8:189–200.

Page 93: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 94: Handbook of Reliability Engineering - pdi2.org

Multi-state System ReliabilityAnalysis and Optimization(Universal Generating Function andGenetic Algorithm Approach)

Chap

ter4

G. Levitin and A. Lisnianski

4.1 Introduction4.1.1 Notation4.2 Multi-state System Reliability Measures4.3 Multi-state System Reliability Indices Evaluation Based on the Universal Generating

Function4.4 Determination of u-function of Complex Multi-state System Using Composition

Operators4.5 Importance and Sensitivity Analysis of Multi-state Systems4.6 Multi-state System Structure Optimization Problems4.6.1 Optimization Technique4.6.1.1 Genetic Algorithm4.6.1.2 Solution Representation and Decoding Procedure4.6.2 Structure Optimization of Series-Parallel System with Capacity-based Performance

Measure4.6.2.1 Problem Formulation4.6.2.2 Solution Quality Evaluation4.6.3 Structure Optimization of Multi-state System with Two Failure Modes4.6.3.1 Problem Formulation4.6.3.2 Solution Quality Evaluation4.6.4 Structure Optimization for Multi-state System with Fixed Resource Requirements and

Unreliable Sources4.6.4.1 Problem Formulation4.6.4.2 Solution Quality Evaluation4.6.4.3 The Output Performance Distribution of a System Containing Identical Elements in the

Main Producing Subsystem4.6.4.4 The Output Performance Distribution of a System Containing Different Elements in the

Main Producing Subsystem4.6.5 Other Problems of Multi-state System Optimization

4.1 Introduction

Modern large-scale technical systems are distin-guished by their structural complexity. Many ofthem can perform their task at several differentlevels. In such cases, the system failure can lead

to decreased ability to perform the given task, butnot to complete failure.

In addition, each system element can alsoperform its task with some different levels.For example, the generating unit in power systemshas its nominal generating capacity, which is fully

61

Page 95: Handbook of Reliability Engineering - pdi2.org

62 System Reliability and Optimization

available if there are no failures. Some types offailure can cause complete unit outage, whereasother types of failure can cause a unit to workwith reduced capacity. When a system and itscomponents can have an arbitrary finite numberof different states (task performance levels) thesystem is termed a multi-state system (MSS) [1].

The physical characteristics of the performancedepend on the physical nature of the system out-come. Therefore, it is important to measure per-formance rates of system components by theircontribution into the entire MSS output perfor-mance. In the practical cases, one should dealwith various types of MSS corresponding to thephysical nature of MSS performance. For example,in some applications the performance measureis defined as productivity or capacity. Examplesof such MSSs are continuous materials or energytransmission systems, power generation systems[2, 3], etc. The main task of these systems is toprovide the desired throughput or transmissioncapacity for continuous energy, material, or infor-mation flow. The data processing speed can alsobe considered as a performance measure [4,5] andthe main task of the system is to complete the taskwithin the desired time. Some other types of MSSwere considered by Gnedenko and Ushakov [6].

Much work in the field of reliability analysiswas devoted to the binary-state systems, whereonly the complete failures are considered. The re-liability analysis of an MSS is much more complex.

The MSS was introduced in the middle of the1970s [7–10]. In these studies, the basic con-cepts of MSS reliability were formulated, thesystem structure function was defined for acoherent MSS, and its properties were investi-gated. Griffith [11] generalized the coherence def-inition and studied three types of coherence.The reliability importance was extended to MSSsby Griffith [11] and Butler [12]. An asymptoticapproach to MSS reliability evaluation was devel-oped by Koloworcki [13]. An engineering methodfor MSS unavailability boundary points estimationbased on binary model extension was proposed byPourret et al. [14].

Practical methods of MSS reliability assessmentare based on three different approaches [15]:

the structure function approach, where Booleanmodels are extended for the multi-valued case; thestochastic process (mainly Markov) approach; andMonte Carlo simulation. Obviously, the stochasticprocess method can be applied only to relativelysmall MSSs, because the number of system statesincreases drastically with the increase in numberof system components. The structure functionapproach is also extremely time consuming.A Monte Carlo simulation model may be afairly true representation of the real world,but the main disadvantage of the simulationtechnique is the time and expense involved inthe development and execution of the model [15].This is an especially important drawback whenthe optimization problems are solved. In spite ofthese limitations, the above-mentioned methodsare often used by practitioners, for example in thefield of power systems reliability analysis [2].

In real-world problems of MSS reliability anal-ysis, the great number of system states that needto be evaluated makes it difficult to use tradi-tional techniques in various optimization prob-lems. On the contrary, the universal generatingfunction (UGF) technique is fast enough to beused in these problems. In addition, this techniqueallows practitioners to find the entire MSS per-formance distribution based on the performancedistributions of its components. An engineer canfind it by using the same procedures for MSSswith different physical natures of performance.In the following sections, the application of theUGF to MSS reliability analysis and optimizationis considered.

To solve the wide range of reliability optimiza-tion problems, one has to choose an optimizationtool that is robust and universal and that imposesminimal requirements as to the knowledge of thestructure of the solution space. A genetic algo-rithm (GA) has all these properties and can beapplied for optimizing vectors of binary and realvalues, as well as for combinatorial optimization.GAs have been proven to be effective optimizationtools in reliability engineering. The main areas ofGA implementation in this field are redundancyallocation and structure optimization subject toreliability constraints [16, 17], optimal design of

Page 96: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 63

reliable network topology [18,19], optimization ofreliability analysis procedures [20], fault diagno-sis [21], and maintenance optimization [22, 23].

4.1.1 Notation

We adopt the following notation herein:

Pr{e} probability of event eG random output performance rate

of MSSGk output performance rate of MSS in

at state kpk probability that a system is at

state kW random system demandWm mth possible value of system

demandqm Pr{W =Wm}F(G, W) function representing the desired

relation between MSS performanceand demand

A MSS availabilityEA stationary probability that MSS

meets a variable demandEG expected MSS performanceEU expected MSS unsupplied demanduj (z) u-function representing

performance distribution ofindividual element

U(z) u-function representingperformance distribution of MSS

ω composition operator overu-functions

ω function determining performancerate for a group of elements

δx(U(z), F, W)

operator over MSS u-functionwhich determines Ex index,x ∈ {A, G, U}

S MSS sensitivity indexg nominal performance rate of

individual MSS elementa availability of individual MSS elementh version of element chosen for MSSr(n, h) number of elements of version h,

included to nth MSS componentR MSS configuration.

Figure 4.1. MSS reliability indices for failure criterionF(G, W)=G−W < 0

4.2 Multi-state SystemReliability Measures

Consider a system consisting of n units. We sup-pose that any system unit i can have ki states:from complete failure up to perfect functioning.The entire system has K different states as deter-mined by the states of its units. Denote a MSS stateat instance t as Y (t) ∈ {1, 2, . . . , K}. The perfor-mance rate Gk is associated with each state k ∈{1, 2, . . . , K} and the system output performancedistribution (OPD) can be defined by two finitevectors G and p= {pk(t)} = Pr{G(t)=Gk} (1≤k ≤K), where G(t) is the random output perfor-mance rate of the MSS and Pr{x} is a probability ofevent x.

The MSS behavior is characterized by its evolu-tion in the space of states. To characterize numer-ically this evolution process, one has to determinethe MSS reliability indices. These indices can beconsidered as extensions of the corresponding re-liability indices for a binary-state system.

The MSS reliability measures were systemat-ically studied by Aven [15] and Brunelle andKapur [24]. In this paper, we consider three mea-sures that are most commonly used by engineers,namely MSS availability, MSS expected perfor-mance, and MSS expected unsupplied demand(lost throughput) (Figure 4.1).

In order to define the MSS ability to performits task, we determine a function F(G, W)

representing the desired relation between the MSS

Page 97: Handbook of Reliability Engineering - pdi2.org

64 System Reliability and Optimization

random performance rate G and some specifiedperformance level (demand) W . The conditionF(G, W) < 0 is used as the criterion of an MSSfailure. Usually F(G, W)=G−W , which meansthat states with system performance rate less thanthe demand are interpreted as failure states.

MSS availability A(t) is the probability thatthe MSS will be in the states with performancelevel satisfying the condition F(G, W) ≥ 0 at aspecified moment t > 0, where the MSS initialstate at the instance t = 0 is j and F(Gj , W)

≥ 0. For large t , the initial state has practically noinfluence on the availability. Therefore, the indexA is usually used for the steady-state case andis called the stationary availability coefficient, orsimply the MSS availability. MSS availability is thefunction of demand W . It may be defined as

A(W)=∑

F(Gk,W)≥0

pk (4.1)

where pk is the steady-state probability of MSSstate k. The resulting sum is taken only for thestates satisfying the condition F(G, W) ≥ 0.

In practice, the system operation period T isoften partitioned into M intervals Tm (1≤m≤M), and each Tm has its own demand level Wm.The following generalization of the availabilityindex [4] is used in these cases:

EA(W, q)=M∑

m=1

A(Wm)qm (4.2)

where W is the vector of possible demand levelsW= {W1, . . . , WM } and q= {q1, . . . , qM} is thevector of steady-state probabilities of demandlevels:

qm = Tm

/ M∑m=1

Tm (4.3)

For example, in power system reliabilityanalysis, the index (1− EA) is often used andtreated as loss of load probability [2]. The MSSperformance in this case is interpreted as powersystem generating capacity.

The value of MSS expected performance couldbe determined as

EG =K∑k=1

pkGk (4.4)

One can note that the expected MSS performancedoes not depend on demand W . Examples ofthe EG measure are the average productivity(capacity) [2] or processing speed of the system.

When penalty expenses are proportional tothe unsupplied demand, the expected unsupplieddemand EU may be used as a measure ofsystem output performance. If in the case offailure (F(Gk, W) < 0) the function −F(Gk, W)

expresses the amount of unsupplied demand atstate k, the index EU may be presented by thefollowing expression:

EU(W, q)=M∑

m=1

K∑k=1

pkqm max{−F(Gk, Wm), 0}(4.5)

Examples of theEU measure are the unsuppliedpower in power distribution systems and expectedoutput tardiness in information processing sys-tems.

In the following section we consider MSS re-liability assessment based on the MSS reliabilityindices introduced above. The reliability assess-ment methods presented are based on the UGFtechnique.

4.3 Multi-state SystemReliability Indices EvaluationBased on the UniversalGenerating Function

Ushakov [25] introduced the UGF and formulatedits principles of application [26]. The mostsystematical description of mathematical aspectsof the method can be found in [6], where themethod is referred to as a generalized generatingsequences approach. A brief overview of themethod with respect to its applications for MSSreliability assessment was presented by Levitinet al. [4]. The method was first applied to the realsystem reliability assessment and optimizationin [27].

The UGF approach is based on definitionof a u-function of discrete random variables

Page 98: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 65

and composition operators over u-functions.The u-function of a variable X is defined as apolynomial

u(z)=K∑k=1

pkzXk (4.6)

where the variable X has K possible values andpk is the probability that X is equal to Xk .The composition operators over u-functions ofdifferent random variables are defined in order todetermine the probabilistic distribution for somefunctions of these variables.

The UGF extends the widely known ordinarymoment generating function [6]. The essentialdifference between the ordinary generating func-tion and a UGF is that the latter allows one toevaluate an OPD for a wide range of systems char-acterized by different topology, different naturesof interaction among system elements, and thedifferent physical nature of elements’ performancemeasures. This can be done by introducing dif-ferent composition operators over UGF (the onlycomposition operator used with an ordinary mo-ment generating function is the product of poly-nomials). The UGF composition operators will bedefined in the following sections.

In our case, the UGF, represented by polyno-mial U(z) can define an MSS OPD, i.e. it representsall the possible states of the system (or element)by relating the probabilities of each state pk toperformance Gk of the MSS in that state in thefollowing form:

U(t, z)=K∑k=1

pk(t)zGk (4.7)

Having an MSS OPD in the form of Equa-tion 4.6, one can obtain the system availability forthe arbitrary t and W using the following operatorδA:

A(t, F, W) = δA[U(t, z), F, W ]

= δA

( K∑k=1

pk(t)zGk , F, W

)

=K∑k=1

pk(t)α[F(Gk, W)] (4.8)

where

α(x)={

1, x ≥ 0

0, x < 0(4.9)

A multi-state stationary (steady state) availabil-ity was introduced as Pr{G(t) ≥W } after enoughtime had passed for this probability to becomeconstant. In the steady state, the distribution ofstate probabilities is

pk = limt→∞ Pr{G(t)=Gk} G(t) ∈ {G1, . . . , GK }

(4.10)The MSS stationary availability may be defined

according to Equation 4.1 when the demand isconstant or according to Equation 4.2 in the caseof variable demand. Thus, for the given MSSOPD represented by polynomial U(z), the MSSavailability can be calculated as

EA(W, q)=M∑

m=1

qmδA(U(z), F, Wm) (4.11)

The expected system output performance valueduring the operating time (Figure 4.1) defined byEquation 4.4 can be obtained for given U(z) usingthe following δG operator:

EG = δG(U(z))= δG

( K∑k=1

pkzGk

)=

K∑k=1

pkGk

(4.12)In order to obtain the expected unsupplied

demand EU for the given U(z) and variabledemand according to Equation 4.5, the followingδU operator should be used:

EU(W, q)=M∑

m=1

qmδU(U(z), F, Wm) (4.13)

where

EU(Wm)= δU(U(z), F, Wm)

= δU

( K∑k=1

pkzGk , F, Wm

)

=K∑k=1

pk max(−F(Gk, Wm), 0) (4.14)

Page 99: Handbook of Reliability Engineering - pdi2.org

66 System Reliability and Optimization

Example 1. Consider, for example, two powersystem generators with nominal capacity 100 MWas two different MSSs. In the first generator sometypes of failure require the capacity to be reducedto 60 MW and some types lead to a completegenerator outage. In the second generator sometypes of failure require the capacity to bereduced to 80 MW, some types lead to capacityreduction to 40 MW, and some types lead to acomplete generator outage. So, there are threepossible relative capacity levels that characterizethe performance of the first generator:

G11 = 0.0 G1

2 =60

100= 0.6 G1

3 =100

100= 1.0

and four relative capacity levels that characterizethe performance of the second one:

G21 = 0.0 G2

2 =40

100= 0.4

G23 =

80

100= 0.8 G2

4 =100

100= 1.0

The corresponding steady-state probabilitiesare the following:

p11 = 0.1 p1

2 = 0.6 p13 = 0.3

for the first generator and

p21 = 0.05 p2

2 = 0.25 p23 = 0.3 p2

4 = 0.4

for the second generator.Now we find reliability indices for both MSSs

for W = 0.5 (required capacity level is 50 MW).

1. The MSS u-functions for the first and thesecond generator respectively, according toEquation 4.7, are as follows:

U1MSS(z)= p1

1zG1

1 + p12z

G12 + p1

3zG1

3

= 0.1+ 0.6z0.6 + 0.3z1.0

U2MSS(z)= p2

1zG2

1 + p22z

G22 + p2

3zG2

3 + p24z

G24

= 0.05+ 0.25z0.4 + 0.3z0.8 + 0.4z1.0

2. The MSS stationary availability (Equation 4.8)is

E1A(W)= A1(0.5)

=∑

G1k−W≥0

pk = 0.6+ 0.3= 0.9

E2A(W)= A2(0.5)

=∑

G2k−W≥0

pk = 0.3+ 0.4= 0.7

3. The expected MSS performance (Equa-tion 4.12) is

E1G =

3∑k=1

p1kG

1k

= 0.1× 0+ 0.6× 0.6+ 0.3× 1.0= 0.66

which means 66% of the nominal generatingcapacity for the first generator, and

E2G =

4∑k=1

p2kG

2k

= 0.05× 0+ 0.25× 0.4+ 0.3× 0.8

+ 0.4× 1.0= 0.74

which means 74% of the nominal generatingcapacity for the second generator.

4. The expected unsupplied demand (Equa-tion 4.14) is

E1U(W)=

∑Gk−W<0

pk(W −Gk)

= 0.1× (0.5− 0.0)= 0.05

E2U(W)=

∑Gk−W<0

pk(W −Gk)

= 0.05× (0.5− 0.0)

+ 0.25× (0.5− 0.4)= 0.05

In this case, EU may be interpreted asexpected electric power unsupplied to consumers.The absolute value of this unsupplied demand is5 MW for both generators. Multiplying this indexby the considered system operating time, one canobtain the expected unsupplied energy.

Note that since the reliability indices obtainedhave different nature, they cannot be usedinterchangeably. In Example 1, for instance, thefirst generator performs better than the second

Page 100: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 67

one when availability is considered (E1A(0.5) >

E2A(0.5)), the second generator performs better

than the first one when expected productivityis considered (E2

G > E1G), and both generators

have the same unsupplied demand (E1U(0.5)=

E2U(0.5)).

4.4 Determination ofu-function of ComplexMulti-state System UsingComposition OperatorsReal-world MSSs are often very complex andconsist of a large number of elements connectedin different ways. To obtain the MSS OPD and thecorresponding u-function, we must develop somerules to determine the system u-function based onthe individual u-functions of its elements.

In order to obtain the u-function of asubsystem (component) containing a numberof elements, composition operators are intro-duced. These operators determine the subsystemu-function expressed as polynomial U(z) for agroup of elements using simple algebraic oper-ations over individual u-functions of elements.All the composition operators for two different el-ements with random performance rates g1 ∈ {g1i}(1≤ i ≤ I ) and g2 ∈ {g2j } (1≤ j ≤ J ) take theform

ω(u1(z), u2(z))

= ω

[ I∑i=1

p1izg1i ,

J∑j=1

p2j zg2j

]

=I∑

i=1

J∑j=1

p1ip2j zω(g1i ,g2j ) (4.15)

where u1(z), u2(z) are individual u-functionsof elements and ω(·) is a function that isdefined according to the physical nature of theMSS performance and the interactions betweenMSS elements. The function ω(·) in compositionoperators expresses the entire performance ofa subsystem consisting of different elements

in terms of the individual performance of theelements. The definition of the function ω(·)strictly depends on the type of connectionbetween the elements in the reliability diagramsense, i.e. on the topology of the subsystemstructure. It also depends on the physical natureof system performance measure.

For example in an MSS, where performancemeasure is defined as capacity or productivity(MSSc), the total capacity of a pair of elementsconnected in parallel is equal to the sum of thecapacities of elements. Therefore, the functionω(·) in composition operator takes the form

ω(g1, g2)= g1 + g2 (4.16)

For a pair of elements connected in series,the element with the least capacity becomes thebottleneck of the system. In this case, the functionω(·) takes the form

ω(g1, g2)=min(g1, g2) (4.17)

In MSSs where the performances of elementsare characterized by their processing speed(MSSs) and parallel elements cannot share theirwork, the task is assumed to be completed by thegroup of parallel elements when it is completedby at least one of the elements. The entire groupprocessing speed is defined by the maximumelement processing speed:

ω(g1, g2)=max(g1, g2) (4.18)

If a system contains two elements connected inseries, then the total processing time is equal tothe sum of processing times t1 and t2 of individualelements: T = t1 + t2 = g−1

1 + g−12 .

Therefore, the total processing speed of thesystem can be obtained as T −1 = g1g2/(g1 + g2)

and the ω(·) function for a pair of elements isdefined as

ω(g1, g2)= g1g2/(g1 + g2) (4.19)

operators were determined in [4, 27] forseveral important types of series–parallel MSS.Some additional composition operators were alsoderived for bridge structures [5, 28].

One can see that the operators for seriesand parallel connection satisfy the following

Page 101: Handbook of Reliability Engineering - pdi2.org

68 System Reliability and Optimization

conditions:

ω{u1(z), . . . , uk(z), uk+1(z), . . . , un(z)}= ω{u1(z), . . . , uk+1(z), uk(z), . . . , un(z)}

ω{u1(z), . . . , uk(z), uk+1(z), . . . , un(z)}= ω{ ω{u1(z), . . . , uk(z)},

ω{uk+1(z), . . . , un(z)}} (4.20)

Therefore, applying the operators in se-quence, one can obtain the u-function represent-ing the system performance distribution for anarbitrary number of elements connected in seriesand in parallel.

Example 2. Consider a system consisting of twoelements with total failures connected in parallel.The elements have nominal performances g1 andg2 (g1 < g2) and constant availabilities a1 and a2respectively. In the failure state the elements haveperformance zero. Even when a system consists ofelements with total failures it should be consideredas multi-state one when its output performancerate is of interest. Indeed, the u-functions of theindividual elements are (1− a1)z

0 + a1zg1 and

(1− a2)z0 + a2z

g2 respectively. The u-functionfor the entire MSS is

UMSS(z)= ω[u1(z), u2(z)]= ω[(1− a1)z

0 + a1zg1,

(1− a2)z0 + a2z

g2]which for MSSc takes the form

U(z)= (1− a1)(1− a2)z0 + a1(1− a2)z

g1

+ a2(1− a1)zg2 + a1a2z

g1+g2

and for MSSs takes the form

U(z)= (1− a1)(1− a2)z0 + a1(1− a2)z

g1

+ a2(1− a1)zg2 + a1a2z

max(g1,g2)

= (1− a1)(1− a2)z0

+ a1(1− a2)zg1 + a2z

g2

The measures of the system output performanceobtained for failure criterion F(G, W)=G−W < 0 according to Equations 4.11–4.13 for bothtypes of MSS are presented in Table 4.1.

4.5 Importance and SensitivityAnalysis of Multi-state Systems

Methods for evaluating the relative influence ofcomponent availability on the availability of theentire system provide useful information aboutthe importance of these elements. Importanceevaluation is a key point in tracing bottlenecksin systems and in the identification of the mostimportant components. It is a useful tool to helpthe analyst find weaknesses in design and tosuggest modifications for system upgrade.

Some various importance measures have beenintroduced previously. The first importance mea-sure was introduced by Birnbaum [29]. This in-dex characterizes the rate at which the systemreliability changes with respect to changes in thereliability of a given element. So, an improvementin the reliability of an element with the highestimportance causes the greatest increase in systemreliability. Several other measures of elements andminimal cut sets importance in coherent systemswere developed by Barlow and Proschan [30] andVesely [31].

The above importance measures have beendefined for coherent systems consisting of bi-nary components. In multi-state systems the fail-ure effect will be essentially different for systemelements with different performance levels. There-fore, the performance levels of system elementsshould be taken into account when their impor-tance is estimated. Some extensions of importancemeasures for coherent MSSs have been suggested,e.g. [11, 30], and for non-coherent MSSs [32].

From Equations 4.11 and 4.14 one can seethat the entire MSS reliability indices are complexfunctions of the demandW , which is an additionalfactor having a strong impact on an element’simportance in multi-state systems. Availabilityof a certain component may be very importantfor one demand level and less important foranother.

For the complex system structure, where eachsystem component can have a large number ofpossible performance levels and there can be alarge number of demand levelsM , the importance

Page 102: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 69

Table 4.1. Measures of system performance obtained for MSS

MSS EA EU EG Wtype

MSSc 0 W − a1g1 − a2g2 a1g1 + a2g2 W > g1 + g2a1a2 g1a1(a2 − 1)+ g2a2(a1 − 1) g2 <W ≤ g1 + g2+W(1− a1a2)a2 (1− a2)(W − g1a1) g1 <W ≤ g2a1 + a2 − a1a2 (1− a1)(1− a2)W 0 <W ≤ g1

MSSs 0 W − a1g1 − a2g2 + a1a2g1 a1(1− a2)g1 + a2g2 W > g2a2 (1− a2)(W − g1a1) g1 <W ≤ g2a1 + a2 − a1a2 (1− a1)(1− a2)W 0 <W ≤ g1

evaluation for each component obtained usingexisting methods requires an unaffordable effort.Such a problem is quite difficult to formalizebecause of the great number of logical functionsfor the top-event description when we use the logicmethods, and the great number of states when theMarkov technique is used.

In this section we demonstrate the methodfor the Birnbaum importance calculation, basedon the UGF technique. The method providesthe importance evaluation for complex MSS withdifferent physical nature of performance and alsotakes into account the demand.

The natural generalization of Birnbaum impor-tance for MSSs consisting of elements with totalfailure is the rate at which the MSS reliability indexchanges with respect to changes in the availabilityof a given element i. For the constant demand W ,the element importance can be obtained as

∂A(W)

∂ai(4.21)

where ai is the availability of the ith element atthe given moment, A(W) is the availability of theentire MSS, which can be obtained for MSSs witha given structure, parameters, and demand usingEquation 4.8.

For the variable demand represented by vectorsW and q the sensitivity of the generalized MSSavailability EA to the availability of the givenelement i is

SA(i)= ∂EA(W, q)∂ai

(4.22)

where index EA can be obtained using Equa-tion 4.9.

In the same way, one can obtain the sensitivityof the EG and EU indices for an MSS to theavailability of the given element i as

SG(i)= ∂EG

∂ai(4.23)

and

SU(i)=∣∣∣∣∂EU(W, q)

∂ai

∣∣∣∣ (4.24)

Since EU is a decreasing function of ai for eachelement i, the absolute value of the derivative isconsidered to estimate the degree of influence ofelement reliability on the unsupplied demand.

It can easily be seen that all the suggestedmeasures of system performance (EA, EG, EU)are linear functions of elements’ availability.Therefore, the corresponding sensitivities caneasily be obtained by calculating the performancemeasures for two different values of availability.

The sensitivity indices for each MSS elementdepend strongly on the element’s place in thesystem, its nominal performance level, and systemdemand.

Example 3. Analytical example. Consider the sys-tem consisting of two elements with total failuresconnected in parallel. The availabilities of the ele-ments are a1 and a2 and the nominal performancerates are g1 and g2 (g1 < g2). The analyticallyobtained measures of the system output perfor-mance for MSSs of both types are presented in

Page 103: Handbook of Reliability Engineering - pdi2.org

70 System Reliability and Optimization

Table 4.2. Sensitivity indices obtained for MSS

MSS SA(1) SA(2) SU(1) SU(2) SG(1) SG(2) Wtype

MSSc 0 0 −g1 −g2 g1 g2 W > g1 + g2a2 a1 (a2 − 1)g1 (a1 − 1)g2 g2 <W ≤ g1 + g2−a2(W − g2) −a1(W − g1)0 1 (a2 − 1)g1 a1g1 −W g1 <W ≤ g21− a2 1− a1 (a2 − 1)W (a1 − 1)W 0 <W ≤ g1

MSSs 0 0 (a2 − 1)g1 a1g1 − g2 g1(1− a2) g2 − a1g1 W > g20 1 (a2 − 1)g1 a1g1 −W g1 <W ≤ g21− a2 1− a1 (a2 − 1)W (a1 − 1)W 0 <W ≤ g1

Figure 4.2. Example of series–parallel MSS

the Table 4.1. The sensitivity indices can also beobtained analytically. These indices for MSSc andMSSs are presented in Table 4.2. Note that the sen-sitivity indices are different for MSSs of differenttypes, even in this simplest case.

Numerical example. The series–parallel systempresented in Figure 4.2 consists of ten elements ofsix different types. The nominal performance rategi and the availability ai of each type of elementare presented in Table 4.3.

We will consider the example of the systemwith the given structure and parameters as MSScand MSSs separately. The cumulative demandcurves q1, W1 for MSSc and q2, W2 for MSSs arepresented in Table 4.4 (q1 = q2 = q).

The sensitivity indices estimated for both typesof system are presented in Table 4.5. These indicesdepend strongly on the element’s place in thesystem, on the element’s nominal performance,and on the system demand.

Table 4.3. Parameters of system elements

No. of Nominal Availabilityelement i performance ai

rate gi

1 0.40 0.9772 0.60 0.9453 0.30 0.9184 1.30 0.9835 0.85 0.9976 0.25 0.967

Table 4.4. Cumulative demand curves

q 0.15 0.25 0.35 0.25W1 (ton h−1) 1.00 0.90 0.70 0.50W2 (kbytes s−1) 0.30 0.27 0.21 0.15

The dependencies of sensitivity index SA onthe system demand are presented in Figures 4.3and 4.4. Here, system demand variation is de-fined as vector kW, where W is the initial de-mand vector given for each system in Table 4.4and k is the demand variation coefficient. As canbe seen from Figures 4.3 and 4.4, the SA(k) arecomplicated non-monotonic piecewise continu-ous functions for both types of system. The or-der of elements according to their SA indiceschanges when the demand varies. By using thegraphs, all the MSS elements can be ordered ac-cording to their importance for any required de-mand level. On comparing graphs in Figures 4.3

Page 104: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 71

Table 4.5. Sensitivity indices obtained for MSS (numerical example)

Element no. MSSc MSSs

SA SG SU SA SG SU

1 0.0230 0.1649 0.0035 0.1224 0.0194 0.00132 0.7333 0.4976 0.2064 0.4080 0.0730 0.01933 0.1802 0.2027 0.0389 0.1504 0.0292 0.00484 0.9150 1.0282 0.7273 0.9338 0.3073 0.22185 0.9021 0.7755 0.4788 0.7006 0.1533 0.07096 0.3460 0.2022 0.0307 0.1104 0.0243 0.0023

Figure 4.3. Sensitivity SA as a function of demand for MSSc

and 4.4, one can notice that the physical na-ture of MSS performance has a great impact onthe relative importance of elements. For exam-ple, for 0.6≤ k ≤ 1.0, SA(4) and SA(5) are al-most equal for MSSc and essentially different forMSSs.

The fourth element is the most important onefor both types of system and for all the perfor-mance criteria. The order of elements accordingto their sensitivity indices changes when different

Figure 4.4. Sensitivity SA as a function of demand for MSSs

system reliability measures are considered. Forexample, the sixth element is more important thanthe third one when SA is considered for MSSc,but is less important when SG and SU are consid-ered.

The UGF approach makes it possible to evaluatethe dependencies of the sensitivity indices onthe nominal performance rates of elements easily.The sensitivities SG(5), which are increasingfunctions of g5 and decreasing functions of

Page 105: Handbook of Reliability Engineering - pdi2.org

72 System Reliability and Optimization

Figure 4.5. Sensitivity SG as a function of nominal performancerates of elements 5 and 6 for MSSc

Figure 4.6. Sensitivity SG as a function of nominal performancerates of elements 5 and 6 for MSSs

g6 for both types of system, are presented inFigures 4.5 and 4.6. As elements 5 and 6 sharetheir work, the importance of each element isproportional to its share in the total componentperformance.

4.6 Multi-state SystemStructure Optimization Problems

The UGF technique allows system performancedistribution, and thereby its reliability indices,to be evaluated based on a fast procedure.The system reliability indices can be obtained asa function of its structure (topology and numberof elements), performance rates, and reliabilityindices of its elements. Therefore, numerousoptimization problems can be formulated inwhich the optimal composition of all or part ofthe factors influencing the entire MSS reliabilityhas to be found subject to different constraints(e.g. system cost). The performance rates andreliability indices of elements composing thesystem can be supplied by the producer of theelements or obtained from field testing.

In order to provide a required level of sys-tem reliability, redundant elements are included.Usually, engineers try to achieve this level withminimal cost. The problem of total investmentcost minimization, subject to reliability con-straints, is well known as the redundancy opti-mization problem. The redundancy optimizationproblem for an MSS, which may consist of el-ements with different performance rates andreliability, is a problem of system structureoptimization.

In order to solve practical problems in whicha variety of products exist on the market andanalytical dependencies are unavailable for thecost of system components, the reliability engi-neer should have an optimization methodologyin which each product is characterized by itsproductivity, reliability, price, and/or other pa-rameters. To distinguish among products withdifferent characteristics, the notion of elementversion is introduced. To find the optimal system

Page 106: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 73

structure, one should choose the appropriate ver-sions from a list of available products for eachtype of equipment, as well as the number of par-allel elements of these versions. The objective isto minimize some criterion of the MSS quality(e.g. the total MSS cost) subject to the require-ment of meeting the demand with the desiredlevel of reliability or to maximize the reliabil-ity subject to constraints imposed on the systemparameters.

The most general formulation of the MSSstructure optimization problem is as follows.Given a system consisting of N components.Each component of type n contains a number ofdifferent elements connected in parallel. Differentversions and numbers of elements may bechosen for any given system component. Elementoperation is characterized by its availability andnominal performance rate.

For each component n there are Hn elementversions available. A vector of parameters (cost,availability, nominal performance rate, etc.) can bespecified for each version h of element of type n.The structure of system component n is defined,therefore, by the numbers of parallel elements ofeach version r(n, h) for 1≤ h≤Hn. The vectorsrn = {r(n, h)}, (1≤ n≤N , 1≤ h≤Hn), definethe entire system structure. For a given set ofvectors R= {r1, r2, . . . , rN } the entire systemreliability indices E = (W, R) can be obtained,as well as additional MSS characteristics C =(W, R), such as cost, weight, etc.

Now consider two possible formulations of theproblem of system structure optimization:

Formulation 1. Find a system configuration R∗that provides a maximal value of system reliabilityindex subject to constraints imposed on othersystem parameters:

R∗ = arg{E(W, R)→max | C(W, R) < C∗}(4.25)

where C* is the maximal allowable value of thesystem parameter C.

Formulation 2. Find a system configuration R∗that minimizes the system characteristic C =

(W, R) while providing a desired level of systemreliability index E = (W, R)

R∗ = arg{C(W, R)→min | E(W, R) ≥ E∗}(4.26)

where E∗ is the minimal allowable level of systemreliability index.

4.6.1 Optimization Technique

Equations 4.25 and 4.26 formulate a complicatedoptimization problem having an enormous num-ber of possible solutions. An exhaustive exam-ination of all these solutions is not realistic,considering reasonable time limitations. As inmost optimal allocation problems, the quality ofa given solution is the only information avail-able during the search for the optimal solution.Therefore, a heuristic search algorithm is neededthat uses only estimates of solution quality andwhich does not require derivative information todetermine the next direction of the search.

The recently developed family of GAs is basedon the simple principle of evolutionary searchin solution space. GAs have been proven to beeffective optimization tools for a large number ofapplications.

It is recognized that GAs have the theoreticalproperty of global convergence [33]. Despite thefact that their convergence reliability and conver-gence velocity are contradictory, for most practi-cal, moderately sized combinatorial problems, theproper choice of GA parameters allows solutionsclose enough to the global optimum to be obtainedin a relatively short time.

4.6.1.1 Genetic Algorithm

The basic notions of GAs were originally in-spired by biological genetics. GAs operate with“chromosomal” representation of solutions, wherecrossover, mutation, and selection procedures areapplied. Unlike various constructive optimizationalgorithms that use sophisticated methods to ob-tain a good singular solution, the GA deals witha set of solutions (population) and tends to ma-nipulate each solution in the simplest manner.

Page 107: Handbook of Reliability Engineering - pdi2.org

74 System Reliability and Optimization

“Chromosomal” representation requires the solu-tion to be coded as a finite length string (in our GAwe use strings of integers).

Detailed information on GAs can be found inGoldberg’s comprehensive book [34], and recentdevelopments in GA theory and practice can befound in books by Gen and Cheng [17] andBack [33]. The basic structure of the steady-stateversion of GA referred to as GENITOR [35] is asfollows.

First, an initial population of Ns randomly con-structed solutions (strings) is generated. Withinthis population, new solutions are obtained duringthe genetic cycle by using crossover and mutationoperators. The crossover produces a new solution(offspring) from a randomly selected pair of par-ent solutions, facilitating the inheritance of somebasic properties from the parents by the offspring.The probability of selecting the solution as a par-ent is proportional to the rank of this solution.(All the solutions in the population are ranked inorder of their fitness increase.) In this work, we usethe so-called two-point (or fragment) crossoveroperator, which creates the offspring string forthe given pair of parent strings by copying stringelements belonging to the fragment between tworandomly chosen positions from the first parentand by copying the rest of string elements from thesecond parent. The following example illustratesthe crossover procedure (the elements belongingto the fragment are in bold):

Parent string: 7 1 3 5 5 1 4 6Parent string: 1 1 8 6 2 3 5 7Offspring string: 1 1 3 5 5 1 4 7

Each offspring solution undergoes mutation,which results in slight changes to the offspring’sstructure and maintains a diversity of solutions.This procedure avoids premature convergenceto a local optimum and facilitates jumps inthe solution space. The positive changes in thesolution code, created by the mutation can belater propagated throughout the population viacrossovers. In our GA, the mutation procedureswaps elements initially located in two randomlychosen positions on the string. The following

example illustrates the mutation procedure (thetwo randomly chosen positions are in bold):

Solution encoding string before mutation:1 1 3 5 5 1 4 7

Solution encoding string after mutation:1 1 4 5 5 1 3 7

Each new solution is decoded and its objectivefunction (fitness) values are estimated. These val-ues, which are a measure of quality, are used tocompare different solutions. The comparison isaccomplished by a selection procedure that deter-mines which solution is better: the newly obtainedsolution or the worst solution in the population.The better solution joins the population, whilethe other is discarded. If the population containsequivalent solutions following selection, redun-dancies are eliminated and the population sizedecreases as a result.

After new solutions are produced Nrep times,new randomly constructed solutions are gener-ated to replenish the shrunken population, and anew genetic cycle begins.

Note that each time the new solution has suf-ficient fitness to enter the population, it altersthe pool of prospective parent solutions and in-creases the average fitness of the current popula-tion. The average fitness increases monotonically(or, in the worst case, does not vary) during eachgenetic cycle. But in the beginning of a new geneticcycle the average fitness can decrease drasticallydue to inclusion of poor random solutions intothe population. These new solutions are neces-sary to bring into the population a new “geneticmaterial”, which widens the search space and, aswell as the mutation operator, prevents prematureconvergence to the local optimum.

The GA is terminated after Nc genetic cycles.The final population contains the best solutionachieved. It also contains different near-optimalsolutions, which may be of interest in the decision-making process.

The choice of GA parameters depends on thespecific optimization problem. One can find infor-mation about parameters choice in [4, 5, 36, 37].

Page 108: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 75

To apply the GA to a specific problem, asolution representation and decoding proceduremust be defined.

4.6.1.2 Solution Representation andDecoding Procedure

To provide a possibility of choosing a combinationof elements of different versions, our GA dealswith L length integer strings, where L is the totalnumber of versions available:

L=N∑i=1

Hi (4.27)

Each solution is represented by string S={s1, s2, . . . , sL}, where for each

j =n−1∑i=1

Hi + h (4.28)

sj denotes the number of parallel elements of typen and version h: r(n, h)= sj .

For example, for a problem with N = 3,H1 = 3, H2 = 2, and H3 = 3, L= 8 and string{0 2 1 0 3 4 0 0} represents a solution in whichthe first component contains two elements ofversion 2 and one element of version 3, the secondcomponent contains three elements of version 2,and the third component contains four elementsof version 1.

Any arbitrary integer string represents afeasible solution, but, in order to reduce thesearch space, the total number of parallel elementsbelonging to each component should be limited.To provide this limitation, the solution decodingprocedure transforms the string S in the followingway: for each component n for versions from h= 1to h=Hn in sequence define

r(n, h)=min

{sj , D −

h−1∑i=1

r(n, i)

}(4.29)

where D is a specified constant that limitsthe number of parallel elements and j is thenumber of elements of string S defined usingEquation 4.28. All the values of the elements of Sare produced by a random generator in the rangeD ≥ sj ≥ 0.

For example, for a component n with Hm =6 and D = 8 its corresponding substring ofS {1 3 2 4 2 1} will be transformed into substring{1 3 2 2 0 0}.

In order to illustrate the MSS structure opti-mization approach, we present three optimizationproblems in which MSSs of different types areconsidered. In the following sections, for each typeof MSS we formulate the optimization problem,describe the method of quality evaluation for ar-bitrary solution generated by the GA, and presentan illustrative example.

4.6.2 Structure Optimization ofSeries–Parallel System withCapacity-based Performance Measure

4.6.2.1 Problem Formulation

A system consisting of N components connectedin series is considered (Figure 4.7). Each com-ponent of type n contains a number of differentelements with total failures connected in parallel.

For each component n there are a number ofelement versions available in the market. A vectorof parameters gnh, anh, cnh can be specified foreach version h of element of type n. This vectorcontains the nominal capacity, availability, andcost of the element respectively. The elementcapacity is measured as a percentage of nominaltotal system capacity.

The entire MSS should provide capacitylevel not less then W (the failure criterion isF(G, W)=G−W < 0).

For given set of vectors r1, r2, . . . , rN the totalcost of the system can be calculated as

C =N∑n=1

Hi∑h=1

r(n, h)cnh (4.30)

The problem of series–parallel system struc-ture optimization is as follows. Find the minimalcost system configuration R∗ that provides therequired availability level E∗A:

R∗ = arg{C(R)→min | E(W, R) ≥ E∗A} (4.31)

Page 109: Handbook of Reliability Engineering - pdi2.org

76 System Reliability and Optimization

Figure 4.7. Series–parallel MSS structure

4.6.2.2 Solution Quality Evaluation

Using the u-transform technique, the followingprocedure for EA index evaluation is used.Since the capacity gnh and availability anh aregiven for each element of type n and version h

and the number of such elements is determinedby the j th element of string S (Equation 4.28),one can represent the u-function of the subsystemcontaining r(n, h) parallel identical elements as

unh(z)= [anhzgnh + (1− anh)]r(n,h)

=r(n,h)∑k=0

r(n, h)!k!(r(n, h)− k)!

× aknh(1− anh)r(n,h)−kzkgnh (4.32)

(This equation is obtained by using the operator(Equation 4.15) with the function in Equation 4.16corresponding to parallel elements with capacity-based performance.)

To obtain the u-function for the entire compo-nent n represented by elements of string S withposition numbers from

∑n−1i=1 Hi + 1 to

∑ni=1 Hi,

one can use the same operator over u-functionsunh(z) for 1≤ h≤Hn:

Un(z)= (un1(z), . . . , unHn(z))

=Hn∏j=1

unj (z)=Vn∑k=1

µkzβk (4.33)

where Vn is the total number of different states ofthe component n, βk is the output performancerate of the component in state k, and µk is theprobability of the state k.

In the case of a capacity-based MSS in anycombination of states of components connectedin series in which at least one component hasa performance rate lower than the demand, theoutput performance of the entire system is alsolower than the demand. In this case, there is noneed to obtain the u-function of the entire systemfrom u-functions of its components connected inseries. Indeed, if considering only that part ofthe polynomial Un(z) that provides a capacityexceeding the given demand level Wi , then onehas to take into account only elements for whichβk ≥Wi . Therefore, the following sum should be

Page 110: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 77

calculated:

pn(Wi)= δA(Un(z), F, Wm)=∑

βk≥Wm

µk (4.34)

One can calculate the probability of providingcapacity that exceeds the level Wi for the entiresystem containing N components connected inseries as

A(Wm)=N∏n=1

pn(Wm) (4.35)

and obtain the total EA index for the variabledemand using Equation 4.11.

In order to let the GA look for the solution withminimal total cost and with an EA that is not lessthan the required value E∗A, the solution quality(fitness) is evaluated as follows:

= λ−�(E∗A − EA)−N∑n=1

Hn∑j=1

r(n, j)cij

(4.36)where

�(x)={θx x ≥ 0

0 x < 0(4.37)

and θ is a sufficiently large penalty that is aconstant much greater than any possible systemcost. For solutions meeting the requirement EA >

E∗ the fitness of solution depends only on totalsystem cost.

Example 4. A power station coal transportationsystem that supplies the boiler consists of five basiccomponents:

1. primary feeder, which loads the coal from thebin to the primary conveyor;

2. primary conveyor, which transports the coalto the stacker–reclaimer;

3. stacker–reclaimer, which lifts the coal up tothe burner level;

4. secondary feeder, which loads the secondaryconveyor;

5. secondary conveyor, which supplies theburner feeding system of the boiler.

Each element of the system is considered asa unit with total failures. The characteristics of

products available in the market for each type ofequipment are presented in Table 4.6. This tableshows availability a, nominal capacity g (givenas a percentage of the nominal boiler capacity),and unit cost c. Table 4.7 contains the data of thepiecewise cumulative boiler demand curve.

For this type of problem, we define theminimal cost system configuration that providesthe desired reliability level EA ≥ E∗A. The resultsobtained for different desired values of index E∗Aare presented in Table 4.8. Optimal solutions fora system in which each component can containonly identical parallel elements are given forcomparison. System structure is represented bythe ordered sequence of strings. Each string hasformat n: r1 ∗ h1, . . . , ri ∗ hi, . . . , rk ∗ hk , wheren is a number denoting system component, and riis the number of elements of version hi belongingto the corresponding component.

One can see that the algorithm allocatingdifferent elements within a component allows formuch more effective solutions to be obtained.For example, the cost of the system configurationwith different parallel elements obtained for E∗ =0.975 is 21% less than the cost of the optimalconfiguration with identical parallel elements.

The detailed description of the optimizationmethod applied to MSSs with capacity- andprocessing-speed-based performance measurescan be found in [4, 5, 36]. Structure optimizationfor MSSs with bridge topology is described in[5, 28].

4.6.3 Structure Optimization ofMulti-state System with Two FailureModes

4.6.3.1 Problem Formulation

Systems with two failure modes (STFMs) consistof devices that can fail in either of two differentmodes. For example, switching systems not onlycan fail to close when commanded to close butcan also fail to open when commanded to open.A typical example of a switching device with twofailure modes is a fluid flow valve. It is known that

Page 111: Handbook of Reliability Engineering - pdi2.org

78 System Reliability and Optimization

Table 4.6. Characteristics of the system elements available in the market

Component no. Description Version no. g (%) a c ($106)

1 Primary feeder 1 120 0.980 0.5902 100 0.977 0.5353 85 0.982 0.4704 85 0.978 0.4205 48 0.983 0.4006 31 0.920 0.1807 26 0.984 0.220

2 Primary conveyor 1 100 0.995 0.2052 92 0.996 0.1893 53 0.997 0.0914 28 0.997 0.0565 21 0.998 0.042

3 Stacker–reclaimer 1 100 0.971 7.5252 60 0.973 4.7203 40 0.971 3.5904 20 0.976 2.420

4 Secondary feeder 1 115 0.977 0.1802 100 0.978 0.1603 91 0.978 0.1504 72 0.983 0.1215 72 0.981 0.1026 72 0.971 0.0967 55 0.983 0.0718 25 0.982 0.0499 25 0.977 0.044

5 Secondary conveyor 1 128 0.984 0.9862 100 0.983 0.8253 60 0.987 0.4904 51 0.981 0.475

Table 4.7. Parameters of the cumulative demand curve

Wm (%) 100 80 50 20Tm (h) 4203 788 1228 2536

redundancy, introduced to increase the reliabilityof a system without any change in the reliability ofthe individual devices that form the system, in thecase of an STFM may either increase or decreaseentire system reliability [38].

Consider a series–parallel switching systemdesigned to connect or disconnect its inputA1 and output BN according to a command

(Figure 4.8). The system consists of N componentsconnected in series. Each component containsa number of elements connected in parallel.When commanded to close (connect), the systemcan perform the task when the nodes An and Bn

(1≤ n≤N) are connected in each component byat least one element. When commanded to open(disconnect), the system can perform the taskif, in at least one of the component’s nodes, An

and Bn are disconnected, which can occur only ifall elements of this component are disconnected.This duality in element and component roles in thetwo operation modes creates a situation in whichredundancy, while increasing system reliability inan open mode, decreases it in a closed mode

Page 112: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 79

Table 4.8. Parameters of the optimal solutions

E∗A Identical elements Different elements

EA C Structure EA C Structure

0.975 0.977 16.450 1 : 2 ∗ 2 0.976 12.855 1 : 2 ∗ 4, 1 ∗ 62 : 2 ∗ 3 2 : 6 ∗ 53 : 3 ∗ 2 3 : 1 ∗ 1, 1 ∗ 44 : 3 ∗ 7 4 : 3 ∗ 75 : 1 ∗ 2 5 : 3 ∗ 4

0.980 0.981 16.520 1 : 2 ∗ 2 0.980 14.770 1 : 2 ∗ 4, 1 ∗ 62 : 6 ∗ 5 2 : 2 ∗ 33 : 3 ∗ 2 3 : 1 ∗ 2, 2 ∗ 34 : 3 ∗ 7 4 : 3 ∗ 75 : 1 ∗ 2 5 : 2 ∗ 3, 1 ∗ 4

0.990 0.994 17.050 1 : 2 ∗ 2 0.992 15.870 1 : 2 ∗ 4, 1 ∗ 62 : 2 ∗ 3 2 : 2 ∗ 33 : 3 ∗ 2 3 : 2 ∗ 2, 1 ∗ 34 : 3 ∗ 7 4 : 3 ∗ 75 : 3 ∗ 4 5 : 3 ∗ 4

Figure 4.8. Structure of a series–parallel switching system

Page 113: Handbook of Reliability Engineering - pdi2.org

80 System Reliability and Optimization

and vice versa. Therefore, the optimal numberof redundant elements can be found, and thisprovides the minimal probability of failures inboth states.

In many practical cases, some measures ofelement (system) performance must be takeninto account. For example, fluid-transmittingcapacity is an important characteristic of asystem containing fluid valves. A system can havedifferent levels of output performance dependingon the combination of elements available at anygiven moment. Therefore, the system should beconsidered to be MSS.

Since STFMs operate in two modes, twocorresponding failure conditions should bedefined: Fo(Go, Wo) < 0 in the open mode andFc(Gc, Wc) < 0 in the closed mode, where Goand Gc are the output performance of the MSS attime t in the open and closed modes respectively,and Wo and Wc are the required system outputperformances in the open and closed modesrespectively. Since the failures in open and closedmodes, which have probabilities

Qo = Pr{Fo(Go, Wo) < 0} = 1− EAo(Fo, Wo)

and

Qc = Pr{Fc(Gc, Wc) < 0} = 1− EAc(Fc, Wc)

(4.38)respectively, are mutually exclusive events, theentire system reliability can be defined as

1−Qo −Qc = EAo(Fo, Wo)+ EAc(Fc, Wc)− 1(4.39)

One can also determine the expected systemperformance in the both modes EGo and EGc.

Consider two possible formulations of theproblem of system structure optimization for agiven list of elements’ versions characterized by avector of parameters go, gc, ao and ac, where gx isa nominal element capacity in mode x and ax is itsavailability in mode x:

Formulation 3. Find a system configuration R∗that provides maximal system availability:

R∗ = arg{EAo(Wo, Fo, R)

+ EAc(Wc, Fc, R)− 1→max} (4.40)

Formulation 4. Find a system configuration R∗that provides the maximal proximity of expectedsystem performance to the desired levels for bothmodes while satisfying the reliability require-ments:

R∗ = arg{|EGo(R)− E∗Go|+ |EGc(R)− E∗Gc| →min | EAo(R)

≥ E∗Ao, EAc(R)≥ E∗Ac} (4.41)

4.6.3.2 Solution Quality Evaluation

To determine the u-function of an individualflow-transmitting element with total failures(e.g. a fluid flow valve) in the closed mode,note that in the operational state, which hasprobability ac, the element should transmitnominal flow f and in the failure state it failsto transmit any flow. Therefore, according toEquation 4.17, the u-function of the element takesthe form

uc(z)= aczf + (1− ac)z

0 (4.42)

In the open mode the element has to preventflow transmission through the system. If itsucceeds in doing this (with probability ao),the flow is zero; if it fails to do so, the flow isequal to its nominal value in the closed mode f .The u-function of the element in the open modetakes the form

uo(z)= aoz0 + (1− ao)z

f (4.43)

Since the system of flow-transmitting elementsis a capacity-based MSS, one has to use operator with the functions in Equations 4.16 and 4.17over u-functions of the individual elementsof Equations 4.42 and 4.43 to determine theu-function of the entire system.

Note that the u-function of a subsystem con-taining n identical parallel elements can be ob-tained by applying operator (u(z), . . . , u(z))

with function ω determined in Equation 4.16over n u-functions u(z) of an individual el-ement represented by Equations 4.42 or 4.43.The u-function of this subsystem takes theform

n∑k=0

n!k!(n− k)!a

kc (1− ac)

n−kzkf (4.44)

Page 114: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 81

for the closed mode andn∑

k=0

n!k!(n− k)!a

n−ko (1− ao)

kzkf (4.45)

for the open mode.Having u-functions of individual elements and

applying corresponding composition operators

one obtains the u-functions Uc(z) and Uo(z) ofthe entire system for both modes. The expectedvalues of flows through the system in open andclosed modes EGo and EGc are obtained us-ing operators δG(Uo(z)) and δG(Uc(z)) (Equa-tion 4.12).

To determine system reliability one has todefine conditions of its successful functioning.For the flow-transmitting system it is natural torequire that in its closed mode the amount of flowshould exceed some specified value Wc, whereasin the open mode it should not exceed a valueWo. Therefore, the conditions of system successare

Fc(Gc, Wc)=Gc −Wc ≥ 0

and

Fo(Go, Wo)=Wo −Go ≥ 0 (4.46)

Having these conditions one can easilyevaluate system availability using operatorsδA(Uo(z), Fo, Wo) and δA(Uc(z), Fc, Wc)

(Equation 4.8).In order to let the GA look for the solution

meeting the requirements of Equation 4.40 or 4.41,the following universal expression of solutionquality (fitness) is used:

= λ− |EGo − E∗Go| − |EGc − E∗Gc|− θ(max{0, E∗Ao − EAo}+max{0, E∗Ac − EAc}) (4.47)

where θ and λ are constants much greater thanthe maximal possible value of system outputperformance.

Note that the case E∗Ao = E∗Ac = 1 correspondsto the formulation in Equation 4.40. Indeed,since θ is sufficiently large, the value to beminimized in order to maximize is θ(EAo +

EAc). On the other hand, when E∗Ao = E∗Ac =0, all reliability limitations are removed andexpected performance becomes the only factor indetermining system structure.

Example 5. Consider the optimization of a fluid-flow transmission system consisting of fourcomponents connected in series. Each componentcan contain a number of elements connected inparallel. The elements in each component shouldbelong to a certain type (e.g. each componentcan operate in a different medium, which causesspecific requirements on the valves). There existsa list of available element versions. In thislist each version is characterized by elementparameters: availability in open and close modes,and transmitting capacity f . The problem is tofind the optimal system configuration by choosingelements for each component from the listspresented in Table 4.9.

We want the flow to be not less than Wc = 10in the closed mode and not greater than Wo = 2in the open (disconnected) mode. Three differ-ent solutions were obtained for Formulation 3 of

Table 4.9. Parameters of elements available for flow transmissionsystem

Component Element f ac aono. version

no.

1 1 1.8 0.96 0.922 2.2 0.99 0.903 3.0 0.94 0.88

2 1 1.0 0.97 0.932 2.0 0.99 0.913 3.0 0.97 0.88

3 1 3.0 0.95 0.882 4.0 0.93 0.87

4 1 1.5 0.95 0.892 3.0 0.99 0.863 4.5 0.93 0.854 5.0 0.94 0.86

Page 115: Handbook of Reliability Engineering - pdi2.org

82 System Reliability and Optimization

Table 4.10. Solutions obtained for flow transmission system

Component E∗Ao = E∗Ac = 1 E∗Ao = E∗Ac = 0.95

N ≥ 13 N = 9 N = 5 N = 13 N = 9 N = 5

1 9 ∗ 1 9 ∗ 1 1 ∗ 1, 1 ∗ 2, 13 ∗ 1 6 ∗ 1, 3 ∗ 3 1 ∗ 2, 4 ∗ 33 ∗ 3

2 13 ∗ 1 5 ∗ 1, 4 ∗ 2 2 ∗ 2, 3 ∗ 3 7 ∗ 1, 1 ∗ 2, 3 ∗ 1, 1 ∗ 2, 5 ∗ 35 ∗ 3 5 ∗ 3

3 2 ∗ 1, 4 ∗ 2 2 ∗ 1, 4 ∗ 2 2 ∗ 1, 3 ∗ 2 7 ∗ 2 6 ∗ 2 5 ∗ 24 11 ∗ 4 8 ∗ 1, 1 ∗ 2 1 ∗ 1, 4 ∗ 4 5 ∗ 1, 4 ∗ 4 6 ∗ 1, 3 ∗ 4 1 ∗ 1, 4 ∗ 4

Qc = 1− EAc 0.001 0.002 0.026 0.0001 0.0006 0.034Qo = 1−EAo 0.007 0.010 0.037 0.049 0.050 0.049EAo + EAc − 1 0.992 0.988 0.937 0.951 0.949 0.917EGc 12.569 12.669 12.026 21.689 18.125 13.078EGo 0.164 0.151 0.118 0.390 0.288 0.156

the problem, where minimal overall failure prob-ability was achieved (E∗Ao = E∗Ac = 1). The op-timal solutions were obtained under constraintson the maximal number of elements within a

Figure 4.9. Cumulative performance distribution of solutionsobtained for a flow-transmission system in closed mode (f kstands for problem formulation k)

single component: D = 15, D = 9, and D = 5.These solutions are presented in Table 4.10, wheresystem structure is represented by strings r1 ∗h1, . . . , ri ∗ hi, . . . , rk ∗ hk for each component,where ri is the number of elements of ver-sion hi belonging to the corresponding compo-nent. Table 4.10 also contains fault probabilities1− EAc and 1− EAo, availability index EAo +EAc − 1, and expected flows through the sys-tem in open and closed mode EGo and EGcobtained for each solution. Note that when thenumber of elements was restricted by 15, theoptimal solution contains no more than 13 ele-ments in each component. A further increase inthe number of elements cannot improve systemavailability.

The solutions obtained for Formulation 4 ofthe optimization problem for E∗Ao = E∗Ac = 0.95are also presented in Table 4.10. Here, we desiredto obtain the flow as great as possible (E∗Gc =25) in the closed mode and as small as possi-ble (E∗Go = 0) in the open mode, while satisfy-ing conditions Pr{Gc ≥ 10} ≥ 0.95 and Pr{Go <

2} ≥ 0.95. The same constraints on the numberof elements within a single component were im-posed. One can see that this set of solutionsprovides a much greater difference EGc − EGo,while the system availability is reduced when com-pared with Formulation 3 solutions. The proba-bilistic distributions of flows through the system

Page 116: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 83

Figure 4.10. Cumulative performance distribution of solutionsobtained for a flow-transmission system in open mode (f k standsfor problem formulation k)

in closed and open modes for all the solutionsobtained are presented in Figures 4.9 and 4.10 inthe form of cumulative probabilities Pr{Gc ≥Wc}and Pr{Go ≥Wo}.

Levitin and Lisnianski [39] provide a detaileddescription of the optimization method appliedto STFM with capacity-based (flow valves) andprocessing-speed-based (electronic switches)performance measures.

4.6.4 Structure Optimization forMulti-state System with Fixed ResourceRequirements and Unreliable Sources

4.6.4.1 Problem Formulation

Although the algorithms suggested in previoussections cover a wide range of series–parallel sys-tems, these algorithms are restricted to systems

with continuous flows, which are comprised ofelements that can process any piece of product(resource) within its capacity (productivity) lim-its. In this case, the minimal amount of product,which can proceed through the system, is notlimited.

In practice, there are technical elements thatcan work only if the amount of some resources isnot lower than specified limits. If this requirementis not met, the element fails to work. An exampleof such a situation is a control system that stopsthe controlled process if a decrease in its com-putational resources does not allow the necessaryinformation to be processed within the requiredcycle time. Another example is a metalworkingmachine that cannot perform its task if the flowof coolant supplied is less than required. In boththese examples the amount of resources necessaryto provide the normal operation of a given com-position of the main producing units (controlledprocesses or machines) is fixed. Any deficit of theresource makes it impossible for all the units fromthe composition to operate together (in parallel),because no one unit can reduce the amount of re-source it consumes. Therefore, any resource deficitleads to the turning off of some producing units.

This section considers systems containing pro-ducing elements with fixed resource consumption.The systems consist of a number of resource-generating subsystems (RGSs) that supply differ-ent resources to the main producing subsystem(MPS). Each subsystem consists of different el-ements connected in parallel. Each element ofthe MPS can perform only by consuming a fixedamount of resources. If, following failures in theRGSs, there are not enough resources to allow allthe available producing elements to work, thensome of these elements should be turned off.We assume that the choice of the working MPSelements is made in such a way as to maximizethe total performance rate of the MPS under givenresources constraints.

The problem is to find the minimal cost RGSand MPS structure that provides the desired levelof entire system ability to meet the demand.

In spite of the fact that only two-level RGS–MPS hierarchy is considered in this section, the

Page 117: Handbook of Reliability Engineering - pdi2.org

84 System Reliability and Optimization

Figure 4.11. Structure of a simpleN − 1 RGS–MPS system

method can easily be expanded to systems withmultilevel hierarchy. When solving the problemfor multilevel systems, the entire RGS–MPS system(with OPD defined by its structure) may beconsidered in its turn as one of an RGS for a higherlevel MPS.

Consider a system consisting of an MPS andN − 1 different RGSs (Figure 4.11). The MPScan have up to D different elements connectedin parallel. Each producing element consumesresources supplied by the RGS and produces afinal product. There are HN versions of producingelements available. Each versionh (1≤ h≤HN) ischaracterized by its availability aNh, performancerate (productivity or capacity) gNh, cost cNh, andvector of required resources wh = {wnh} 1≤ n≤N − 1. An MPS element of version h can workonly if it receives the amount of each resourcedefined by vector wh.

Each resource n (1≤ n≤N − 1) is generatedby the corresponding RGS, which can contain upto Dn parallel resource-generating elements ofdifferent versions. The total number of availableversions for the nth RGS is Hn. Each version of anelement of RGS supplying the nth resource is char-acterized by its availability, productivity, and cost.

The problem of system structure optimizationis as follows: find the minimal cost systemconfiguration R∗ that provides the requiredreliability level E∗A (Equation 4.31), where for anygiven system structure R the total cost of thesystem can be calculated using Equation 4.30.

4.6.4.2 Solution Quality Evaluation

Since the considered system contains elementswith total failure represented by two-termu-functions and each subsystem consists of a

Page 118: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 85

number of parallel elements, one can obtainu-functions Un(z) for the each RGS n representingthe probabilistic distribution of the amount of nthresource, which can be supplied to the MPS asfollows:

Un(z)=Hn∏h=1

[(1− anh)z0 + anhz

gnh]r(n,h)

=Vn∑k=1

µnkzβnk (4.48)

where Vn is the number of different levels ofresource n generation, βnk is the available amountof the nth resource at state k, and µnk is theprobability of state k for RGS n.

The same equation can be used in orderto obtain the u-function representing the OPDof the MPS UN(z). In this case, VN is thenumber of different possible levels of MPSproductivity. The function UN(z) represents thedistribution of system productivity defined onlyby MPS elements’ availability. This distributioncorresponds to situations in which there are nolimitations on required resources.

4.6.4.3 The Output PerformanceDistribution of a System Containing IdenticalElements in the Main Producing Subsystem

If a producing subsystem contains only identicalelements, then the number of elements that canwork in parallel when the available amount ofnth resource at state k is βnk is �βnk/wn, whichcorresponds to total system productivity γnk =g�βnk/wn. In this expression, g and wn arerespectively the productivity of a single element ofthe MPS and the amount of resource n requiredfor this element, and the function �x rounds anumber x down to the nearest integer. Note thatγnk represents the total theoretical productivity,which can be achieved using available resource n

by an unlimited number of producing elements.In terms of entire system output, the u-function ofthe nth RGS can be obtained in the following form:

Un(z)=Vn∑k=1

µnkzγnk =

Vn∑k=1

µnkzg�βnk/wn (4.49)

The RGS that provides the work of a minimalnumber of producing units becomes the bottle-neck of the system. Therefore, this RGS defines thetotal system capacity. To calculate the u-functionfor all the RGSs U r(z) one has to apply the

operator with the ω function in Equation 4.17.Function U r(z) represents the entire system OPDfor the case of an unlimited number of producingelements.

The entire system productivity is equal to theminimum of total theoretical productivity whichcan be achieved using available resources andtotal productivity of available producing elements.To obtain the OPD of the entire system taking intoaccount the availability of the MPS elements, thesame operator should be applied:

U(z)= (UN(z), Ur(z))

= (U1(z), U2(z), . . . , UN(z)) (4.50)

4.6.4.4 The Output PerformanceDistribution of a System Containing DifferentElements in the Main Producing Subsystem

If an MPS has D different elements, there are 2D

possible states of element availability composition.Each state k may be characterized by set πk (1≤k ≤ 2D) of available elements. The probability ofstate k can be evaluated as follows:

pk =∏j∈πk

aNj

∏i /∈πk

(1− aNi) (4.51)

The maximal possible productivity of the MPSand the corresponding resources consumption instate k are

∑j∈πk gj and

∑j∈πk wnj (1≤ n < N)

respectively.The amount of resources generated by an RGS

is defined by its OPD. There are not alwaysenough resources to provide the maximal possibleproductivity of the MPS at state k. In order todefine the maximum possible productivity G ofthe MPS under resource constraints one has tosolve the following integer linear programmingproblem:

G(β1, β2, . . . , βN , πk)=max∑j∈πk

gjyj

Page 119: Handbook of Reliability Engineering - pdi2.org

86 System Reliability and Optimization

subject to∑j∈πk

wnjyj ≤ βn for 1≤ n < N, yj ∈ {0, 1}(4.52)

where βn is the available amount of nth resource,yj = 1 if element j works (producing gj units ofmain product and consuming wnj of each resource(1≤ n < N)), and yj = 0 if element j is turnedoff.

The OPD of the entire system can be defined byevaluating all the possible combinations of avail-able resources generated by the RGS and the statesof the MPS. For each combination, a solution ofthe above formulated optimization problem de-fines the system productivity. The u-function rep-resenting the OPD of the entire system can bedefined as follows:

U(z)=V1∑i1=1

V2∑i2=1

· · ·VN−1∑iN−1=1

{( N−1∏n=1

µnin

)

×2D∑k=1

pkzG(β1i1 ,β2i2 ,...,βN−1iN−1 ,πk)

}(4.53)

To evaluate the EA index for the entire systemhaving its OPD as the probability that the totalproductivity of the system is not less than aspecified demand levelW , we can use the functionF =G−W and operator δA (Equation 4.8).

To obtain the system OPD, its productivityshould be determined for each unique combina-tion of available resources and for each uniquestate of the MPS. From Equation 4.53 one can seethat in the general case the total number of integerlinear programs to be solved to obtain U(z) is2D∏N−1

n=1 Vn. In practice, the number of programsto be solved can be drastically reduced using thefollowing rules.

1. If for the given vector (β1, . . . , βn,

. . . , βN−1) and for the given set ofMPS elements πk there exists n forwhich βn < minj∈πk wnj , then the systemproductivity G= 0.

2. If for each element j from πk there exists n forwhich βn < wnj , then the system productivityG= 0.

3. If there exists element j ∈ πk for which βn <

wnj for some n, this means that in theprogram (Equation 4.52) yj must be zeroed.In this case the dimension of the integerprogram can be reduced by removing all suchelements from πk .

4. If for the given vector (β1, . . . , βn,

. . . , βN−1) and for the given set πk

the solution of the integer program(Equation 4.52) determines subset π∗k ofturned on MPS elements (j ∈ π∗k if yj = 1),the same solution must be optimal forthe MPS states characterized by any setπ ′k : π∗k ⊂ π ′k ⊂ πk . This allows one to avoidsolving many integer programs by assigningvalue of G(β1, . . . , βN−1, πk) to all theG(β1, . . . , βN−1, π

′k).

It should be noted that, for systems with alarge number of elements and/or resources, therequired computational effort for solving the re-dundancy optimization problem could be unaf-fordable, even when applying the computationalcomplexity reduction technique presented. In thiscase, the use of fast heuristics for solving integerprograms is recommended instead of exact algo-rithms.

By applying the technique described for obtain-ing U(z) and using the δA operator for determin-ing the EA index, one can estimate the solutionfitness in the GA using Equation 4.36.

Example 6. The main producing component ofthe system may have up to six parallel producingelements (chemical reactors) working in parallel.To perform their task, producing elements requirethree different resources:

1. power, generated by the energy supply subsys-tem (group of converters);

2. computational resource, provided by thecontrol subsystem (group of controllers);

3. cooling water, provided by the water supplysubsystem (group of pumps).

Each of these RGSs can have up to five paral-lel elements. Both producing units and resource-generating units may be chosen from the list

Page 120: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 87

Table 4.11. Parameters of the MPS units available

Version no. Cost c Performance rate g Availability a Resources required w

Resource 1 Resource 2 Resource 3

1 9.9 30.0 0.970 2.8 2.0 1.82 8.1 25.0 0.954 0.2 0.5 1.23 7.9 25.0 0.960 1.3 2.0 0.14 4.2 13.0 0.988 2.0 1.5 0.15 4.0 13.0 0.974 0.8 1.0 2.06 3.0 10.0 0.991 1.0 0.6 0.7

Table 4.12. Parameters of the RGS units available

Type of Version Cost Performance Availabilityresource no. c rate g a

1 1 0.590 1.8 0.9802 0.535 1.0 0.9773 0.370 0.75 0.9824 0.320 0.75 0.978

2 1 0.205 2.00 0.9952 0.189 1.70 0.9963 0.091 0.70 0.997

3 1 2.125 3.00 0.9712 2.720 2.60 0.9733 1.590 2.40 0.9714 1.420 2.20 0.976

Table 4.13. Parameters of the cumulative demand curve

Wm 65.0 48.0 25.0 8.0Tm 60 10 10 20

of products available in the market. Each pro-ducing unit is characterized by its availability,its productivity, its cost, and the amount of re-sources required for its work. The characteris-tics of available producing units are presentedin Table 4.11. The resource-generating units arecharacterized by their availability, their generatingcapacity (productivity), and their cost. The char-acteristics of available resource-generating unitsare presented in Table 4.12. Each element of thesystem is considered to be a unit with total failures.

The demand for final product varies with time.The demand distribution is presented in Table 4.13in the form of a cumulative demand curve.

Table 4.14 contains minimal cost solutions fordifferent required levels of system availability E∗A.The structure of each subsystem is presented bythe list of numbers of versions of the elementsincluded in the subsystem. The actual estimatedavailability of the system and its total cost are alsopresented in the table for each solution.

The solutions of the system structure optimiza-tion problem when the MPS can contain onlyidentical elements are presented in Table 4.15 forcomparison. Note that, when the MPS is composedfrom elements of different types, the same systemavailability can be achieved by much lower cost.Indeed, using elements with different availabilityand capacity (productivity) provides much greaterflexibility for optimizing the entire system per-formance in different states. Therefore, the algo-rithm for solving the problem for different MPSelements, which requires much greater computa-tional effort, usually yields better solutions thenone for identical elements.

A detailed description of the optimizationmethod applied to an MSS with fixed resourcerequirements has been given by Levitin [40].

4.6.5 Other Problems of Multi-stateSystem Optimization

In this section we present a short descriptionof reliability optimization problems that can be

Page 121: Handbook of Reliability Engineering - pdi2.org

88 System Reliability and Optimization

Table 4.14. Parameters of the optimal solutions for system with different MPS elements

E∗A EA C System structure

MPS RGS 1 RGS 2 RGS 3

0.950 0.951 27.790 3, 6, 6, 6, 6 1, 1, 1, 1, 1 1, 1, 2, 3 1, 10.970 0.973 30.200 3, 6, 6, 6, 6, 6 1, 1, 1, 1 1, 1, 2, 3 1, 10.990 0.992 33.690 3, 3, 6, 6, 6, 6 1, 1, 1, 1 1, 1, 2, 3 4, 40.999 0.999 44.613 2, 2, 3, 3, 6, 6 1, 1, 1 1, 2, 2 4, 4, 4

Table 4.15. Parameters of the optimal solutions for system with identical MPS elements

E∗A EA C System structure

MPS RGS 1 RGS 2 RGS 3

0.950 0.951 34.752 4, 4, 4, 4, 4, 4 1, 4, 4, 4, 4 2, 2, 2 1, 3, 3, 3, 40.970 0.972 35.161 4, 4, 4, 4, 4, 4 1, 1, 1, 4 2, 2, 2, 2 1, 3, 3, 3, 40.990 0.991 37.664 2, 2, 2, 2 4, 4 3, 3, 3, 3 4, 4, 40.999 0.999 47.248 2, 2, 2, 2, 2 3, 4 2, 2 4, 4, 4, 4

solved using a combination of the UGF techniqueand GAs.

In practice, the designer often has to includeadditional elements in the existing system ratherthan to develop a new one from scratch. Itmay be necessary, for example, to modernizea system according to new demand levels oraccording to new reliability requirements. Theproblem of optimal single-stage MSS expansion toenhance its reliability and/or performance is animportant extension of the structure optimizationproblem. In this case, one has to decide whichelements should be added to the existing systemand to which component they should be added.Such a problem was considered by Levitinet al. [36].

During the MSS life time, the demand andreliability requirements can change. To provide adesired level of MSS performance, managementshould develop a multistage expansion plan.For the problem of optimal multistage MSSexpansion [41], it is important to answer not onlythe question of what must be included into thesystem, but also the question of when.

By optimizing the maintenance policy one canachieve the desired level of system reliability

(availability) requiring the minimal cost. The UGFtechnique allows the entire MSS reliability tobe obtained as a function of the reliabilities ofits elements. Therefore, by having estimations ofthe influence of different maintenance actions onthe elements’ reliability, one can evaluate theirinfluence on the entire complex MSS containingelements with different performance rates andreliabilities. An optimal policy of maintenance canbe developed that would answer the questionsabout which elements should be the focusof maintenance activity and what should theintensity of this activity be [42, 43].

Since the maintenance activity serves the samerole in MSS reliability enhancement as does in-corporation of redundancy, the question arises asto what is more effective. In other words, shouldthe designer prefer a structure with more redun-dant elements and less investment in maintenanceor vice versa? The optimal compromise shouldminimize the MSS cost while providing its desiredreliability. The joint maintenance and redundancyoptimization problem [37] is to find this opti-mal compromise taking into account differencesin reliability and performance rates of elementscomposing the MSS.

Page 122: Handbook of Reliability Engineering - pdi2.org

Multi-state System Reliability Analysis and Optimization 89

Finally, the most general optimization prob-lem is optimal multistage modernization of anMSS subject to reliability and performance re-quirements [44]. In order to solve this problem,one should develop a minimal-cost modernizationplan that includes maintenance, modernization ofelements, and system expansion actions. The ob-jective is to provide the desired reliability levelwhile meeting the increasing demand during thelifetime of the MSS.

References[1] Aven T, Jensen U. Stochastic models in reliability. New

York: Springer-Verlag; 1999.

[2] Billinton R, Allan R. Reliability evaluation of powersystems. Pitman; 1990.

[3] Aven T. Availability evaluation of flow networks withvarying throughput-demand and deferred repairs. IEEETrans Reliab 1990;38:499–505.

[4] Levitin G, Lisnianski A, Ben-Haim H, Elmakis D.Redundancy optimization for series–parallel multi-statesystems. IEEE Trans Reliab 1998;47(2):165–72.

[5] Lisnianski A, Levitin G, Ben-Haim H. Structure optimiza-tion of multi-state system with time redundancy. ReliabEng Syst Saf 2000;67:103–12.

[6] Gnedenko B, Ushakov I. Probabilistic reliability engineer-ing. New York: John Wiley & Sons; 1995.

[7] Murchland J. Fundamental concepts and relations forreliability analysis of multistate systems. In: Barlow R,Fussell S, Singpurwalla N, editors. Reliability and faulttree analysis, theoretical and applied aspects of systemreliability. Philadelphia: SIAM; 1975. p.581–618.

[8] El-Neveihi E, Proschan F, Setharaman J. Multistatecoherent systems. J Appl Probab 1978;15:675–88.

[9] Barlow R, Wu A. Coherent systems with multistatecomponents. Math Oper Res 1978;3:275–81.

[10] Ross S. Multivalued state component systems. AnnProbab 1979;7:379–83.

[11] Griffith W. Multistate reliability models. J Appl Probab1980;17:735–44.

[12] Butler D. A complete importance ranking for componentsof binary coherent systems with extensions to multistatesystems. Nav Res Logist Q 1979;26:556–78.

[13] Koloworcki K. An asymptotic approach to multi-state systems reliability evaluation. In: Limios N,Nikulin M, editors. Recent advances in reliability theory,methodology, practice, inference. Birkhauser; 2000.p.163–80.

[14] Pourret O, Collet J, Bon J-L. Evaluation of the unavail-ability of a multi-state component system using a binarymodel. Reliab Eng Syst Saf 1999;64:13–7.

[15] Aven T. On performance measures for multistatemonotone systems. Reliab Eng Syst Saf 1993:41:259–66.

[16] Coit D, Smith A. Reliability optimization of series–parallel systems using genetic algorithm, IEEE TransReliab 1996;45:254–66.

[17] Gen M, Cheng R. Genetic algorithms and engineeringdesign. New York: John Wiley & Sons; 1997.

[18] Dengiz B, Altiparmak F, Smith A. Efficient optimizationof all-terminal reliable networks, using an evolutionaryapproach. IEEE Trans Reliab 1997;46:18–26.

[19] Cheng S. Topological optimization of a reliable commu-nication network. IEEE Trans Reliab 1998;47:23–31.

[20] Bartlett L, Andrews J. Efficient basic event orderingschemes for fault tree analysis. Qual Reliab Eng Int1999;15:95–101.

[21] Zhou Y-P, Zhao B-Q, Wu D-X. Application of geneticalgorithms to fault diagnosis in nuclear power plants.Reliab Eng Syst Saf 2000;67:2:153–60.

[22] Martorell S, Carlos S, Sanchez A, Serradell V. Constrainedoptimization of test intervals using a steady-state geneticalgorithm. Reliab Eng Syst Saf 2000;67:215–32.

[23] Marseguerra M, Zio E. Optimizing maintenance andrepair policies via a combination of genetic algorithmsand Monte Carlo simulation. Reliab Eng Syst Saf2000;68:69–83.

[24] Brunelle R, Kapur K. Review and classification of relia-bility measures for multistate and continuum models. IIETrans 1999;31:1171–80.

[25] Ushakov I. A universal generating function. Sov J ComputSyst Sci 1986;24:37–49.

[26] Ushakov I. Reliability analysis of multi-state systemsby means of a modified generating function. J InformProcess Cybernet 1988;34:24–9.

[27] Lisnianski A, Levitin G, Ben-Haim H, Elmakis D.Power system structure optimization subject to reliabilityconstraints. Electr Power Syst Res 1996;39:145–52.

[28] Levitin G, Lisnianski A. Survivability maximization forvulnerable multi-state systems with bridge topology.Reliab Eng Syst Saf 2000;70:125–40.

[29] Birnbaum ZW. On the importance of different compo-nents in a multicomponent system. In: Krishnaiah PR,editor. Multivariate analysis 2. New York: Academic Press;1969.

[30] Barlow RE, Proschan F. Importance of system com-ponents and fault tree events. Stochast Process Appl1975;3:153–73.

[31] Vesely W. A time dependent methodology for fault treeevaluation. Nucl Eng Des 1970;13:337–60.

[32] Bossche A. Calculation of critical importance formulti-state components. IEEE Trans Reliab 1987;R-36(2):247–9.

[33] Back T. Evolutionary algorithms in theory and practice.Evolution strategies. Evolutionary programming. Geneticalgorithms. Oxford University Press; 1996.

[34] Goldberg D. Genetic algorithms in search, optimizationand machine learning. Reading (MA): Addison Wesley;1989.

Page 123: Handbook of Reliability Engineering - pdi2.org

90 System Reliability and Optimization

[35] Whitley D. The GENITOR algorithm and selectivepressure: why rank-based allocation of reproductivetrials is best. In: Schaffer D, editor. Proceedings 3rdInternational Conference on Genetic Algorithms. MorganKaufmann; 1989. p.116–21.

[36] Levitin G, Lisnianski A, Elmakis D. Structure optimiza-tion of power system with different redundant elements.Electr Power Syst Res 1997;43:19–27.

[37] Levitin G, Lisnianski A. Joint redundancy and mainte-nance optimization for multistate series-parallel systems.Reliab Eng Syst Saf 1998;64:33–42.

[38] Barlow R. Engineering reliability. Philadelphia: SIAM;1998.

[39] Levitin G, Lisnianski A. Structure optimization of multi-state system with two failure modes. Reliab Eng Syst Saf2001;72:75–89.

[40] Levitin G. Redundancy optimization for multi-statesystem with fixed resource requirements and unreliablesources. IEEE Trans Reliab 2001;50(1):52–9.

[41] Levitin G. Multistate series–parallel system expansionscheduling subject to availability constraints. IEEE TransReliab 2000;49:71–9.

[42] Levitin G, Lisnianski A. Optimization of imperfectpreventive maintenance for multi-state systems. ReliabEng Syst Saf 2000;67:193–203.

[43] Levitin G, Lisnianski A. Optimal replacement schedulingin multi-state series–parallel systems (short communica-tion). Qual Reliab Eng 2000;16:157–62.

[44] Levitin G, Lisnianski A. Optimal multistage mod-ernization of power system subject to reliability andcapacity requirements. Electr Power Syst Res 1999;50:183–90.

Page 124: Handbook of Reliability Engineering - pdi2.org

Combinatorial ReliabilityOptimization

Chap

ter5

C. S. Sung, Y. K. Cho and S. H. Song

5.1 Introduction5.2 Combinatorial Reliability Optimization Problems of Series Structure5.2.1 Optimal Solution Approaches5.2.1.1 Partial Enumeration Method5.2.1.2 Branch-and-bound Method5.2.1.3 Dynamic Programming5.2.2 Heuristic Solution Approach5.3 Combinatorial Reliability Optimization Problems of a Non-series Structure5.3.1 Mixed Series–Parallel System Optimization Problems5.3.2 General System Optimization Problems5.4 Combinatorial Reliability Optimization Problems with Multiple-choice Constraints5.4.1 One-dimensional Problems5.4.2 Multi-dimensional Problems5.5 Summary

5.1 Introduction

Reliability engineering as a concept appeared inthe late 1940s and early 1950s and was applied firstto the fields of communication and transportation.Reliability is understood to be a measure of howwell a system meets its design objective duringa given operation period without repair work.In general, reliability systems are composed ofseveral subsystems (stages), each having morethan one component, including operating andredundant units.

Reliability systems have become so complexthat successful operation of each subsystemhas been a significant issue in automation andproductivity management. One important designissue is to design an optimal system havingthe greatest quality of reliability or to findthe best way to increase any given systemreliability, which may be subject to variousengineering constraints associated with cost,

weight, and volume. Two common approaches tothe design issue have been postulated as follows:

1. an approach of incorporating more reliablecomponents (units);

2. an approach of incorporating more redundantcomponents.

The first approach is not always feasible, sincehigh cost may be involved or reliability improve-ment may be technically limited. Therefore, thesecond approach is more commonly adapted foreconomical design of systems, for which optimalredundancy is mainly concerned. In particular,the combinatorial reliability optimization prob-lem is defined in terms of redundancy optimiza-tion as the problem of determining the optimalnumber of redundant units of the component as-sociated with each stage subject to a minimumrequirement for the associated whole system reli-ability and also various resource restrictions, suchas cost, weight, and volume consumption.

91

Page 125: Handbook of Reliability Engineering - pdi2.org

92 System Reliability and Optimization

There are two distinct types of redundantunit, called parallel and standby units. In parallelredundant systems, one original operating unitand all redundant units at each stage operatesimultaneously. However, in standby redundantsystems, any redundant units are not activatedunless the operating unit fails. If the operating unitfails, then it will be replaced with one from amongthe redundant units. Under the assumptionsthat the replacement time for each failed unit isignored and the failure probability of each unit isfixed at a real value, a mathematical programmingmodel for the standby redundancy optimizationproblem can be derived as the same as that forthe parallel redundancy optimization problem.The last assumption may be relaxed for the prob-lem to be treated in a stochastic model approach.

The combinatorial reliability optimization is-sues concerned with redundancy optimizationmay be classified into:

1. maximization of system effectiveness (relia-bility) subject to various resource constraints;and

2. minimization of system cost subject to thecondition that the associated system effective-ness (reliability) be required to be equal to orgreater than a desired level.

The associated system reliability function canhave a variety of different forms, depending on thesystem structures. For example, the redundancyoptimization problem commonly considers threetypes of system structure, including:

1. series structure;2. mixed series–parallel structure;3. general structure, which is in neither series

nor parallel, but rather in bridge networktype.

The series system has only one path betweenthe source and the terminal node, but the mixedseries–parallel system can have multiple pathsbetween them. The general-structured system canalso have multiple paths, while it is different fromthe mixed series–parallel system such that at leasttwo subsystems (stages) are related in a non-seriesand/or non-parallel structure.

The reliability function of the series system isrepresented by a product form of all the associatedstage reliabilities. However, the other two systemscannot have their reliability functions beingrepresented by any product forms of all stage(subsystem) reliabilities. Rather, their reliabilitiesmay be computed by use of a minimal path setor a minimal cut set. This implies that the series-structured system has the simplest reliabilityfunction form.

Thereby, the redundancy optimization problemcan be formulated in a variety of different math-ematical programming expressions depending onthe associated system structures and problem ob-jectives, as shown below.

Series system reliability maximization problem(SRP):

ZSRP =max∏i∈I[1− (1− ri )

yi ]

subject to ∑i∈I

gmi (yi)≤ Cm ∀m ∈M (5.1)

yi = non-negative integer ∀i ∈ I (5.2)

where ZSRP represents the optimal system relia-bility of the problem (SRP), I is the set of stages,ri is the reliability of the component used at stagei, yi is the variable representing the number ofredundant units of the component allocated atstage i, gmi (yi) is the consumption of resource m

for the component and its redundant units in stagei, and Cm is the total available amount of theresource m.

The constraints in Equation 5.1 representthe resource constraints and the constraints inEquation 5.2 define the decision variables to benon-negative integers.

Non-series system reliability maximizationproblem (NRP):

ZNRP =max R(R1, . . . , Ri, . . . , R|I |)

subject to the constraints in Equations 5.1and 5.2, where ZNRP represents the optimalsystem reliability of the problem (NRP), and

Page 126: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 93

R(R1, . . . , Ri, . . . , R|I |) is the system reliabilityfunction, given the reliability of each stage i atRi = [1− (1− ri)

yi ], which is dependent uponthe associated system structure.

Series system cost minimization problem (SCP):

ZSCP =min∑i∈I

gi(yi)

subject to the constraints in Equations 5.1 and 5.2:∏i∈I[1− (1− ri)]yi ≥ Rmin (5.3)

where ZSCP represents the optimal system costof the problem (SCP), gi(yi) is the cost chargedfor resource consumption yi at stage i, and Rminis the minimum level of the system reliabilityrequirement. The constraints in Equation 5.3represent the system reliability constraint.

Non-series system cost minimization problem(NCP):

ZNCP =min∑i∈I

gi(yi)

subject to the constraints in Equations 5.1 and 5.2:

R(R1, . . . , Ri, . . . , R|I |)≥ Rmin (5.4)

where ZNCP represents the optimal system costof the problem (NCP). The constraints in Equa-tion 5.4 represent the system reliability require-ment.

The constraints in Equation 5.1 representingresource consumption are given in general func-tion forms, since the resource consumption forimproving the associated system reliability doesnot increase linearly as the number of redundantunits of the component at each stage increases.Therefore, the data of the resource consumptionrate are important for the problem formulation.However, only few data are practically available, sothat the resource constraints are often expressedin linear functions with the variable representingthe number of redundant units. Accordingly, theconstraints in Equation 5.1 can be specified as∑

i∈Igmi (yi)=

∑i∈I

cmi yi ≤ Cm

where cmi denotes the consumption of resource m

at stage i.The proposed combinatorial reliability opti-

mization problems are now found as being ex-pressed in nonlinear integer programming wherethe associated system reliability functions arenonlinear functions. It is also found that thereliability function of the non-series-structuredsystem is not separable. In other words, the sys-tem reliability cannot be in any product formof all stage reliabilities. This implies that it isnot easy to evaluate the reliability of the general-structured system. Valiant [1] has proved thatthe problem of computing reliabilities (reliabilityobjective function values) for general-structuredsystems is NP-complete. In the meantime, forthe mixed series–parallel types of system amongthem, Satyanarayana and Wood [2] have proposeda linear time algorithm to compute the reliabilityobjective function values, given their associatedstage reliabilities.

The complexity analysis by Chern [3] showsthat the simplest reliability maximization problemof series systems with only one resource con-straint incorporated is NP-complete. Therefore,even though many algorithms have been proposedin the literature, only a few have been proven effec-tive for large-scale nonlinear reliability program-ming problems, and none of them has been provento be superior over any of the others. A varietyof different solution approaches have been stud-ied for the combinatorial reliability optimizationproblems; these are listed as follows.

Optimal solution approaches:

• integer programming;• partial enumeration method;• branch-and-bound method;• dynamic programming.

Heuristic approaches:

• greedy-type heuristics.

Continuous relaxation approaches:

• maximum principle;• geometric programming;• sequential unconstrained minimization

technique (SUMT);

Page 127: Handbook of Reliability Engineering - pdi2.org

94 System Reliability and Optimization

• modified sequential simplex patternsearch;• Lagrangian multipliers and Kuhn–Tucker

conditions;• generalized Lagrangian function;• generalized reduced gradient (GRG);• parametric programming.

The integer programming approach has beenconsidered for integer solutions, whereas theassociated problems have nonlinear objectivefunctions and/or constraints that may be toodifficult to handle in the approach. However, nointeger programming techniques guarantee thatthe optimal solutions can be obtained for large-sized problems in a reasonable time. Similarly,the dynamic programming approach has the curseof dimensionality, whose computation load mayincrease exponentially with the number of theassociated state variables. Moreover, in general, itgets harder to solve problems with more than twoconstraints.

Heuristic algorithms have been considered tofind integer-valued solutions (local optimums) ina reasonable time, but they do not guaranteethat their resulting solutions are globally optimal.However, Malon [4] has shown that greedyassembly is optimal whenever the assemblymodules have a series structure.

The continuous relaxation approaches in-cluding the sequential unconstrained minimiza-tion technique, the generalized reduced gradientmethod, and the generalized Lagrangian func-tion method have also been developed for reli-ability optimization problems. Along with thesemethods, rounding-off procedures have oftenbeen applied for redundancy allocation problemsin situations where none of the relaxation ap-proaches could yield integer solutions. The asso-ciated reference works have been summarized inTables 1 and 3 of Tillman et al. [5].

Any problem with a nonlinear reliabilityobjective function can be transformed into aproblem with linear objective function, in asituation where the nonlinear objective functionis separable. This has been studied in Tillmanand Liittschwager [6]. By taking logarithms

of the associated system reliability functionand considering appropriate substitutions, thefollowing equivalent problem (LSRP) with linearobjective function and constraints can be derived:

Problem (LSRP):

ZLSRP =max∑i∈I

∞∑k=1

�fij yij

subject to∑i∈I

∞∑k=1

�cmikyik ≤ Cm ∀m ∈M (5.5)

yik − yi,k−1 ≤ 0 ∀k ∈ Z+, i ∈ I (5.6)

yi0 = 1 ∀i ∈ I (5.7)

yik = binary integer ∀k ∈ Z+, i ∈ I (5.8)

where yik is the variable representing the kthredundancy at stage i,

�fik = [1− (1− ri )k] − [1− (1− ri )

k−1]= (1− ri )

k−1 − (1− ri )k,

which is the change in system reliability dueto the kth redundancy added at stage i, and�cmik = gmi (k)− gmi (k − 1), which is the changein consumption of resource m due to the kthredundancy added at stage i.

Likewise, the problem (SCP) can also be trans-formed into a problem with linear constraints.The problems (NRP) and (NCP), however, cannotbe transformed into any linear integer program-ming problems, since the associated system relia-bility functions are not separable.

Note that if any problem with a separablereliability function has linear resource constraints,then its transformed problem can be reduced to aknapsack problem.

This chapter is organized as follows. In Sec-tion 5.2, the combinatorial reliability optimiza-tion problems for series structures are considered.Various solution approaches, including optimalsolution methods and heuristics, are introduced.Section 5.3 considers the problems (NRP) for non-series structured systems. Section 5.4 considers aproblem with multiple-choice constraints incor-porated and the corresponding solution approach

Page 128: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 95

is introduced, which has been studied recently.Finally, concluding remarks are presented.

5.2 Combinatorial ReliabilityOptimization Problems of SeriesStructure

This section considers combinatorial reliabilityoptimization problems having series structures.Many industrial systems (e.g. heavy machinesand military weapons), whose repair work isso costly (difficult), are commonly designed inseries structures of their subsystems to havemultiple components in parallel at each stage(subsystem) so as to maintain (increase) the wholesystem reliability. For example, each combat tank(system) is composed of three major subsystems(including power train, suspension, and firecontrol subsystems) in a series structure. In orderto increase the whole combat tank reliability,the combat tank is designed to have multipleblowers in parallel in the power train subsystem,multiple torsion bars and support rollers inparallel in the suspension subsystem, and multiplegunner controllers in parallel in the fire controlsubsystem. Thus, the associated design issue ofhow many components to be installed multiply ineach subsystem for an optimal combat tank systemcan be handled as a series system combinatorialreliability optimization problem.

The approach of having multiple componentsin parallel at each subsystem can also be appliedto many production systems. For example, thecommon wafer-fabrication flow line is composedof three major processes, viz. diffusion/oxidation,photolithography, and etching, which are handledby the three corresponding facilities called fur-nace, stepper, and dry etcher respectively. In thefabrication flow line, it is common to have multiplefacilities in parallel for each of the processes soas to increase the whole line reliability. Thus, theassociated design issue of how many facilities to beinstalled multiply in each process for an optimalwafer-fabrication flow line can also be handled as

a series system combinatorial reliability optimiza-tion problem.

5.2.1 Optimal Solution Approaches

The combinatorial reliability problem of a se-ries structure has drawn research attention, asshown in the literature, since 1960s. All the earlystudies for the associated reliability optimiza-tion problems were concerned with finding op-timal solutions for them. Their representativesolution approaches include the partial enumera-tion method, the branch-and-bound method, andthe dynamic programming algorithm.

5.2.1.1 Partial Enumeration Method

Misra [7] has applied the partial enumerationmethod to the redundancy optimization problem.The partial enumeration method was first pro-posed by Lawler and Bell [8], and was devel-oped for a discrete optimization problem with amonotonic objective function and arbitrary con-straints. Misra [7] has dealt with the problem asmaximizing reliability or optimizing some otherobjective functions subject to multiple separableconstraints (not necessarily being linear func-tions). He has transformed the problem into con-formable forms to the method.

Using the partial enumeration method ofLawler and Bell [8], the following types of problemcan be solved:

Minimize g0(v)

subject to

gm1 (v)− gm2 (v)≥ 0 ∀m ∈Mwhere v = (v1, v2, . . . , v|I |) and vi = 0 or 1,∀i ∈ I .

Note that each of the functions, gm1 (v) andgm2 (v), must be monotonically non-decreasingwith each of its arguments. With some ingenuity,many problems may be put in this form.

Let vector v = (v1, v2, . . . , v|I |) be a binaryvector in the sense that each vi is either 0or 1. Define v1 ≤ v2 iff v1

i ≤ v2i for ∀i ∈ I,

which is called the vector partial ordering.The lexicographic or numerical ordering of these

Page 129: Handbook of Reliability Engineering - pdi2.org

96 System Reliability and Optimization

vectors can also be obtained by identifyingwith each v the integer value N(v)= v12n−1 +v22n−2 + · · · + vn20. The numerical ordering is arefinement of the vector partial ordering such thatv1 ≤ v2 implies N(v1)≤N(v2), whereas N(v1) ≤N(v2) does not imply v1 ≤ v2.

The partial enumeration method of Lawler andBell [8] is basically a search method, which startswith (0, . . . , 0) and examines the 2|I | solutionvectors in the numerical ordering describedabove. By the way, the labor of examination canbe considerably cut down by the following threerules. As the examination proceeds, one can retainthe least costly up-to-date solution. If v is theleast costly solution having cost g0(v) and v isthe vector to be examined, then the followingsteps indicate the conditions under which certainvectors may be skipped.

Rule 1. Test if g0(v)≤ g0(v). If yes, skip to v∗ andrepeat the operation; otherwise, proceedto Rule 2. Here, the value v∗ is the firstvector following v in the numerical orderthat has the property v < v∗. For any v, v∗is calculated by subtracting one from thebinary number v, and v∗ − 1 is calculatedby logical OR operation performed betweenv and v − 1 (e.g. let us have v = (1010).Then v − 1= (1001), and so v∗ − 1=v OR (v − 1)= (1011)). Finally, add one toobtain v∗.

Rule 2. Examine if gm1 (v∗ − 1)− gm2 (v) ≤ 0 ∀m ∈M . If yes, proceed to Rule 3; otherwise, skipto v∗ and go to Rule 1.

Rule 3. If gm1 (v∗ − 1)− gm2 (v) ≥ 0 ∀m ∈M, thenreplace v by v and skip to v∗; otherwise,change v to v + 1. In either case, transfer toRule 1 for further execution.

The above three rules are the so-called algo-rithm skipping rules. By following the rules, it isnecessary to examine all the binary vectors andcontinue scanning until a vector having maximumnumerical order, (1, . . . , 1), is found. In a situ-ation where one has skipped to a vector havingnumerical order higher than (1, . . . , 1), the pro-cedure is terminated and the resulting least-costly

vector provides the optimal solution. For details,refer to Lawler and Bell [8]. For a numerical ex-ample, refer to Misra [7].

5.2.1.2 Branch-and-bound Method

A few researchers have considered the branch-and-bound method for the series system relia-bility maximization problem. They include Ghareand Taylor [9], Sung and Lee [10], and Kuoet al. [11]. The branch-and-bound method isusually described by determining bounding andbranching procedures. For a general branch-and-bound technique, refer to Lawler and Wood [12].In Ghare and Taylor [9], the problem (SRP) wastransformed into a problem with binary variables,and then it was proved that there is a correspon-dence between their feasible solutions. Finally,they proposed a branch-and-bound method forthe transformed binary problem.

The transformed binary problem is formulatedas follows:

ZSRP =max∑i∈I

k=∞∑k=1

aikxik

subject to

∑i∈I

k=∞∑k=1

cmi xik ≤ bm ∀m ∈M (5.9)

xik = binary integer ∀k ∈ Z+, i ∈ I (5.10)

where xik indicates whether or not the kthredundant unit is used at stage i, aik = ln[1−(qi)

k+1] − ln[1− (qi)k], qi = (1− ri ), cmi denotes

the consumption of resource m of at stage i, andbm = Cm −∑i∈I cmi .

Ghare and Taylor [9] have developed a branch-and-bound method for the binary problem inwhich a simple bound of the series system relia-bility has been derived by using the inequality ofthe component reliability and cost consumption.In order to develop a bounding procedure forthe associated branch-and-bound algorithm, theyhave considered a single-dimensional knapsackproblem as ZSRP =max

∑i∈I∑k=∞

k=1 aikxik, sub-ject to a single constraint

∑i∈I∑k=∞

k=1 cmi xik ≤ bm

for a given m, and defined the ratios αik = aik/cmi .

Page 130: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 97

Then, for a feasible solution, the following relationholds:

ZSRP =max∑i∈I

k=∞∑k=1

aikxik

=max∑i∈I

k=∞∑k=1

αikcmi xik

≤maxi,k

(αik)∑i∈I

k=∞∑k=1

cmi xik

≤maxi,k

(αik)bm

Moreover, since

exp(aik)= [1− (qi)k+1]/[1− (qi)

k]= 1+ (qi)

k/[1+ qi + · · · + (qi)k−1]

and

exp(ai,k+1)= [1− (qi)k+2]/[1− (qi)

k+1]= 1+ (qi)

k+1/[1+ qi + · · · + (qi)k]

it can be seen that aik > aik+1, which impliesαik > αik+1, or maxi,k(αik)=maxi (αi1). It followsthat ZSRP ≤maxi (αi1)bm.

In the problem, there are |M| constraints, onefor each resource m, so that for any feasiblesolution for the problem, it holds that

ZSRP ≤maxi

(αi1)bm for any m

≤ minm∈M

[maxi

(αi1)bm]

Consequently, the optimal objective value ZSRP isbounded by the quantity minm∈M [maxi (αi1)bm],which is the upper bound for the problem.For the complete description of the proposedbranch-and-bound procedure, refer to Ghare andTaylor [9] and Balas [13]. Their branch-and-bound procedure has been modified by McLeavey[14] in several subfunctions, including algorithminitialization, branching variable selection, depthof searching, and problem formulation with eitherbinary variables or integer variables. Moreover, inMcLeavey and McLeavey [15], all those solutionmethods, in addition to the continuous relaxationmethod of Luus [16], the greedy-type heuristic

method of Aggarwal et al. [17], and the branch-and-bound method of Ghare and Talyor [9],have been compared against one another inperformance, being measured in terms of CPUtime, optimality rate, relative error for theoptimum, and coding efforts.

Recently, a new bounding mechanism hasbeen proposed in Sung and Lee [10], based onwhich an efficient branch-and-bound algorithmhas been developed. In developing the algorithm,they did not deal with any transformed binaryproblem, but dealt with the problem (SRP)directly.

Given any feasible system reliability Z to theproblem, Sung and Lee [10] have characterized thetwo properties that the optimal system reliabilityZSRP should be equal to or greater than Z andthat the system reliability can be representedby a product form of all the stage reliabilities.These properties are used to derive the followingrelation for each stage s as

Rs(y∗s )≥

ZSRP∏i �=s Ri(y

ui )≥ Z∏

i �=s Ri(yui )

where Rs(y∗s )= 1− (1− rs)

y∗s and, y∗s and yus are

the optimal solution and an upper bound of ysrespectively. Accordingly, y∗s should be less thanor equal to yu

s . Therefore, the lower bound y ls of

ys can be determined as

y ls =max

{ys

∣∣∣∣ Rs(ys)≥ Z∏i �=s Ri(y

ui )

}In addition, an efficient way of determining suchupper bounds can be found based on y l

i ∀i ∈ I ,Km, and Rm

s , where

Km =maxi∈I

{ln[Ri(y

li + 1)] − ln[Ri(y

li )]

gmi (yli + 1)− gmi (y

li )

}and

Rms =KmCm −

∑i �=s{Kmgmi (y

li )− ln[R(y l

i )]}

− ln Z

Km may be interpreted as the maximum marginalcontribution to the reliability from unit addition

Page 131: Handbook of Reliability Engineering - pdi2.org

98 System Reliability and Optimization

of resource m and Rms is just a value computed in

association with Z and y li (i �= s). Thus, the upper

bound of ys can be determined as

yus =max

{ys

∣∣∣∣ ys ≥ yls;Kmgms (ys)− ln[Rs(ys)]≤ Rm

s or gms (ys)

≤ Cm −∑i �=s

gmi (yi), ∀m}

From the above discussions, the lower andupper bound values of ys can now be summarizedas giving the basis for finding the optimal value y∗ssuch that

(a) y ls ≤ y∗s ≤ yu

s , ∀s ∈ I ;(b) each y l

s can be determined at a specific yui

(i �= s) value given;(c) each yu

s can be determined at a specific y li

(i �= s) value given; and(d) the larger lower bounds lead to the smaller

upper bounds, and vice versa.

Based on these relations between the lower andupper bounds, an efficient search procedure forthe bounds has been proposed.

Bounding procedure:

Step 1. Set t = 1, Ri(yui0)= 1 (i.e. yu

i0 =∞),

where the index t represents iterationnumber. Then, find Z by using the greedy-type heuristic algorithm in Sung and Lee[10].

Step 2. Calculate y lit ∀i ∈ I, by using yu

i,t−1 ∀i ∈ I .

Step 3. Calculate yuit ∀i ∈ I, by using y l

it ∀i ∈ I .If yu

it = yui,t−1 ∀i ∈ I, then go to Step 4;

otherwise, set t = t + 1, and then go toStep 2.

Step 4. Set yui = yu

it , yli = y l

it , and then terminatethe procedure.

Sung and Lee [10] have proved that boththe upper and lower bounds obtained from theproposed bounding procedure converge to theheuristic solution yi for every i.

Theorem 1. (Sung and Lee [10]) The heuristic so-lution (y1, . . . , y|I |) satisfies the relations:

y li1 ≤ y l

i2 ≤ · · · ≤ y li ≤ yi ≤ yu

i ≤ · · · ≤ yui1

∀i ∈ IIn Sung and Lee [10], no theoretical compar-

ison between the bound of Ghare and Taylor [9]and that of Sung and Lee [10] has been made, buta relative efficiency comparison with the branch-and-bound algorithm of Ghare and Taylor [9] forten-stage reliability problems has been made tofind that their proposed algorithm is about tentimes more efficient. It is seen in the numericaltest of the algorithm of Sung and Lee [10] thatthe smaller-gap-first-order branching procedureis more efficient than the larger-gap-first-orderbranching procedure. The smaller-gap-first-orderbranching procedure implies that the stage withthe smaller gap between upper and lower boundsis branched earlier. For details, refer to Sung andLee [10].

Kuo et al. [11] have dealt with the problem(SRP) and proposed a method incorporating aLagrangian relaxation method and a branch-and-bound method. However, the method does notguarantee the optimal solution, since the branch-and-bound method is applied only to search foran integer solution based on a continuous relaxedsolution obtained by the Lagrangian multipliermethod.

5.2.1.3 Dynamic Programming

Various research studies have shown how dynamicprogramming could be used to solve a variety ofcombinatorial reliability optimization problems.As mentioned earlier, the dynamic programmingmethod is based on the principle of optimality.It can yield an exact solution to a problem, but itscomputational complexity increases exponentiallyas the number of constraints or the numberof variables in the problem increases. For thecombinatorial reliability optimization problems,two dynamic programming approaches have beenapplied: one is a basic dynamic programmingapproach [18], and the other is a dynamicprogramming approach using the concept of

Page 132: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 99

Table 5.1. Problem data for the basic dynamic programmingalgorithm

Stage 1 2 3ri 0.8 0.9 0.85ci (= cost) 2 4 3

Constraints: C ≤ 9

dominance sequence [19, 20]. The first of theseapproaches is described in this section, but thelatter is introduced in Section 5.3, where thecomputational complexity of each of the twoapproaches is discussed.

The basic dynamic programming approach isapplied to the redundancy optimization problemfor a series system, which is to maximize thesystem reliability subject to only the budgetconstraint. The recursive formula of the problemin the basic dynamic programming algorithm isformulated as follows:

for stage 1;

f1(d)= maxg1(y1)≤d,y1∈Z+

[R1(y1)], 1≤ d ≤ C

for stage 2;

f2(d)= maxg2(y2)≤d,y2∈Z+

{R2(y2)f1[d − g2(y2)]},1≤ d ≤ C

for stage i;

fi(d)= maxgi (yi)≤d,yi∈Z+

{Ri(yi)fi−1[d − gi(yi)]},1≤ d ≤ C

for the last stage |I |;f|I |(C)= max

y|I |∈Z+{R|I |(y|I |) · f|I |−1[C − g|I |(y|I |)]}

where C represents the amount of budgetavailable, and Ri(yi) represents the reliability ofstage i when the stage i has yi redundant units.

In order to illustrate the basic dynamicprogramming method algorithm, a numericalexample shall be solved with the data given inTable 5.1. And its results are summarized inTable 5.2.

5.2.2 Heuristic Solution Approach

This section presents a variety of heuristicmethods that can give integer solutions directly.As discussed already, the given problems areknown to be in the class of NP-hard problems.Therefore, all the solution methods discussedin the previous sections may give the optimalsolutions, but their computational time to obtainthe solutions may increase exponentially as theproblem size (variables) increases. This is why avariety of heuristic methods have been consideredas practical approaches to find approximatesolutions in a reasonable time.

Research on the heuristic approaches wasextensive from the early 1970s to the mid 1980s.Most of these heuristic approaches are categorizedinto a kind of greedy-type heuristic method.For example, Sharma and Venkateswaran [21]considered a problem in 1971 whose objectivewas to maximize the associated system reliabilitysubject to nonlinear constraints, which is thesame as the problem (SRP). For the problem,they presented a simple algorithm, that handlesthe objective function being transformed into aneasier form, such as the function of the failureprobability function transformed as follows:

max∏i∈I[1− (1− ri)

yi ]

=min

{1−

∏i∈I[1− (1− ri)

yi ]}

∼=min

{1−

∑i∈I[1− (1− ri)

yi ]}

∼=min∑i∈I

(1− ri )yi

=min∑i∈I

qyii

where qi is the failure probability of the compo-nent at stage i.

The algorithm of Sharma and Venkateswaran[21] adds one component to the associated stagein each iteration until no more components canbe added to the system due to the associatedresource constraints, in which the component-adding stage has the largest failure probability

Page 133: Handbook of Reliability Engineering - pdi2.org

100 System Reliability and Optimization

Table 5.2. Example of the basic dynamic programming algorithm

Stage 1 Stage 2 Stage 3

d Y1 f1(d) d y2 R2(y2) f1(d − c2, y2) f2(d) y3 R3(y3) f2(d − c3, y3)

2 1 0.8 4 1 0.9 0 0 1 0.85 0.72a

4 2 0.96 6 1 0.9 0.8 0.72 2 0.978 06 3 0.992 8 2 0.99 0 0 3 0.997 08 4 0.998 — — — — — — — —

a Optimal solution value.

Table 5.3. Problem data for the heuristic algorithm

Stage 1 2 3qi 0.1 0.2 0.15

c1i (= cost) 5 7 7

c2i (= weight) 8 7 8

Constraints: C1 ≤ 45 andC2 ≤ 54

among the stages in the system. For the detailedstep-by-step procedure, refer to Sharma andVenkateswaran [21].

In order to illustrate the algorithm of Sharmaand Venkateswaran [21], a numerical exampleshall be solved with the data given in Table 5.3.

Starting with a feasible solution (1, 1, 1), addone unit at each iteration of the algorithm asshown in Table 5.4. After four iterations, the finalsolution (2, 3, 2) is obtained and its objectivefunction value is 0.96.

The algorithm of Sharma and Venkateswaran[21] may not yield a good (near-optimal) solutionif only those components of similar reliabilitybut quite different resource consumption rateare available to all the stages. Besides, incertain cases, slack variables (balance of unusedresources) may prevent addition of only onecomponent to a particular stage having thelowest reliability, but permit addition of morethan one component to another stage havingthe highest reliability. The latter case may resultin a greater net increase in reliability than theformer. However, their discussion on component

addition is made just based on component-wide reliability, but without considering anyeffectiveness of resource utilization. In fact,the effectiveness can be represented by thecontribution of each component addition to thewhole system reliability, so that any componentaddition at the stage having the lowest reliabilitywill have greater effectiveness in increasing thewhole system reliability than any componentaddition at any other stage. Therefore, Aggarwalet al. [17] proposed a heuristic criterion forsolving the resource-constrained problem inconsideration of resource utilization effectiveness.The heuristic algorithm is based on the conceptthat a component is added to the stage where itsaddition produces the largest ratio of increment inreliability to the product of increments in resourceusage. Accordingly, the stage selection criterion isdefined as

Fi(yi)≡ �(qyii )∏

m∈M �gmi (yi)

where �(qyii )= q

yii − q

yi+1i for all i and

�gmi (yi)= gmi (yi + 1)− gmi (yi) for all i andm.

In the case of linear constraints, however, allFi(yi) can be evaluated by using the followingrecursive relation:

Fi(yi + 1)= qiFi(yi)

For details, refer to Aggarwal et al. [17]. Thealgorithm of Aggarwal et al. [17] is similarly illus-trated with the data of Table 5.3, as summarized inTable 5.5.

Page 134: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 101

Table 5.4. Illustration of the stepwise iteration of the algorithm of Sharma and Venkateswaran [21]

Iteration No. of units in each stage Failure probability of each stage Cost Weight

1 2 3 1 2 3

0 1 1 1 0.1 0.2a 0.15 19 231 1 2 1 0.1 0.04 0.15a 26 302 1 2 2 0.1a 0.04 0.0225 33 383 2 2 2 0.01 0.04a 0.0225 38 464 2 3 2 0.01 0.008 0.0225 45 53

a The stage to which a redundant unit is to be added.

Table 5.5. Illustration of the stepwise iteration of the algorithm of Aggarwal et al. [17]

Iteration No. of units in each stage Fi(yi) Cost Weight

1 2 3 1 2 3

0 1 1 1 0.002 25 0.003 27a 0.002 28 19 231 1 2 1 0.002 25 0.000 65 0.002 28a 26 302 1 2 2 0.002 25a 0.000 65 0.000 34 33 383 2 2 2 0.000 23 0.000 65a 0.000 34 38 464 2 3 2 — — — 45 53

a The stage to which a redundant unit is to be added.

For the example given, after four iterations thealgorithm gives the same final solution as that ofthe algorithm of Sharma and Venkateswaran [21].

Some computational experience has revealedthat the performance of the algorithm of Aggarwalet al. [17] deteriorates with an increase inthe number of constraints. The reason for thisdeterioration may be due to too much emphasis onresources in the product term in the denominatorof the selection factor, which is derived as the ratioof increase in system reliability to the product ofincrements in consumption of various resourceswhen one more redundant component is addedto a stage. In order to take care of this problem,Gopal et al. [22] proposed a heuristic criterion forselecting the stage to which a redundant unit isto be added. Their criterion is given as the ratioof the relative decrement in failure probability tothe largest relative increments in resources. As inGopal et al. [22], the relative increment in resourceconsumption is defined as

�Gmi (yi)=�gmi (yi)/max

i∈I [�gmi (yi)] ∀i and m

where �gmi (yi) represents increment in gmi (yi)

due to increasing yi by unity. A redundant unit isthen to be added to the stage where its additionleads to the least value of the selection factorwithout violating any of the |M| constraints inEquation 5.1.

Fi(yi)= maxm∈M[�Gm

i (yi)]/�Qs(yi |y)

where �Qs(yi |y) represents the decrement insystem failure probability due to increasing yi byunity.

For a system with linear constraints, thefunction �Gm

i (yi) is independent of yi and henceneeds to be computed only once. Accordingly, theselection factor Fi(yi) can be easily computed viathe following forms:

(a) Fi(yi)≈ Fi(yi − 1)/(1− ri ) for yi > 1, and(b) Fi(1)=maxm∈M [�Gm

i (yi)]/[r(1− ri )].For details, refer to Gopal et al. [22]. With the

same data in Table 5.3, the algorithm of Gopal et al.[22] is also illustrated in Table 5.6. Likewise, after

Page 135: Handbook of Reliability Engineering - pdi2.org

102 System Reliability and Optimization

Table 5.6. Illustration of the stepwise iteration of the algorithm of Gopal et al. [22]

Iteration No. of units in each stage Fi(yi) Cost Weight

1 2 3 1 2 3

0 1 1 1 11.11 6.25a 7.84 19 231 1 2 1 11.11 31.25 7.84a 26 302 1 2 2 11.11a 31.25 52.29 33 383 2 2 2 111.11 31.25a 52.29 38 464 2 3 2 — — — 45 53

a The stage to which a redundant unit is to be added.

four iterations, the algorithm of Gopal et al. [22]gives the final solution, which is the same as thoseof the algorithms of Sharma and Venkateswaran[21] and Aggarwal et al. [17].

Other heuristic algorithms have been proposedby Misra [23] and by Nakagawa and Nakashima[24]. For example, Misra [23] suggested the ap-proach of reducing one r-constraint problem fromamong many r-problems, each with one con-straint, where an r-constraint problem is a re-duced problem with only one constraint consid-ered out of the r constraints, and each r-problemis such a reduced problem. In the approach, adesirability factor (i.e. the ratio of the percentageincrease in the system reliability to the percent-age increase of the corresponding cost) was usedto determine the stage to which a redundancyis added. However, the approach has dealt withlinear constraint problems. As the number of con-straints increases, the computation time becomesremarkably larger. Nakagawa and Nakashima [24]presented a method that can solve the reliabilitymaximization problem with nonlinear constraintsfor a series system. Balancing between the reliabil-ity objective function and resource constraints isa basic consideration in the method. A limitationto the approach is that it cannot be used to solveany problem of complex system configuration. Fordetailed algorithms and numerical examples, referto Misra [23] and Nakagawa and Nakashima [24].

Kuo et al. [25] have presented a note onsome of the heuristic algorithms of Sharmaand Venkateswaran [21], Aggarwal et al. [17],Misra [23], and Nakagawa and Nakashima [24].

Nakagawa and Miyazaki [26] have comparedthree heuristic algorithms, each against the other,including those of Sharma and Venkateswaran[21], Gopal et al. [22], and Nakagawa andNakashima [24], in terms of CPU time, optimalityrate, and relative error.

5.3 Combinatorial ReliabilityOptimization Problems of aNon-series Structure

This section considers problems having non-series structures including mixed series–parallelstructures and general structures.

5.3.1 Mixed Series–Parallel SystemOptimization Problems

As mentioned earlier, a mixed series–parallelstructure is defined as one in which the relationbetween any two stages is in series or parallel. Forsuch a structure, the system reliability function isnot expressed as a product form of the reliabilitiesof all stages, but as a multinomial form ofthe reliabilities of multiple paths. Therefore, thesystem reliability depends on its system structure,and so is very complicated to compute. However,it is known that the system reliability of a mixedseries–parallel structure can be figured out on theorder of a polynomial function of the number ofstages, so that the complexity of computing the

Page 136: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 103

system reliability does not increase exponentiallyas the number of stages in the system increases.

One often sees mixed series–parallel structuresin international telephone systems. For example,each international telephone system betweenany two countries is composed of two majorsubsystems. One of them is a wire system, andthe other one is a wireless system. In the wholesystem, all calls from one country to anothercountry are collected together at an internationalgateway; some of the calls are transmitted throughthe wireless system from an earth station of onecountry to the corresponding earth station of theother country via a communications satellite; therest of the calls are transmitted through the wiresystem (being composed of terminals and opticalcables) from the international gateway to thecorresponding international gateway of the othercountry. For these international communications,both the wire and the wireless systems areequipped with multiple modules in parallel, eachbeing composed of various chipsets so as toincrease the whole system reliability. Thus, theassociated design issue of how many modulesto be installed multiply in each subsystem foran optimal international telephone system canbe handled as a mixed series–parallel systemoptimization problem.

Only a few researches have considered the re-liability optimization issue for such mixed series–parallel systems. For example, Burton and Howard[27] have dealt with the problem of maximizingsystem reliability subject to one resource con-straint such as budget restriction. Their algo-rithm is based on a basic dynamic programming.Recently, Cho [28] has proposed a dominance-sequence-based dynamic programming algorithmfor the problem with binary variables and an-alyzed its computational complexity, and thencompared it with the algorithm of Burton andHoward [27].

We now introduce the dominance-sequence-based dynamic programming algorithm ofCho [28]. The problem, dealt with by Cho [28],is one with a mixed series–parallel systemstructure. For developing the algorithm, thefollowing assumptions are made. First, the

resource consumption for the entire system isrepresented by a discrete linear function. Second,the resource coefficients are all integer valued. Theproposed algorithm is composed of two phases.The first phase is to construct a reduction-ordergraph, which is defined as a directed graph havingthe precedence relation between nodes in theproposed mixed series–parallel system. Referringto Satyanarayana and Wood [2], the mixed series–parallel system can be reduced to a one-nodenetwork in the complexity order of O(|E|),where |E| denotes the cardinality of the set ofedges in the given mixed series–parallel system.Two examples of the reduction-order graphs aredepicted in Figure 5.1. The order of reducing thegiven graph is not unique, but the structure of thereduction-order graph remains the same as thatof the given graph, which is represented as a tree.The number of end nodes of the graph dependson the given system structure. The second phaseis concerned with a process of reducing twostages (including series-stage reduction andparallel-stage reduction) into one as follows.

Series-stage reduction. Suppose that stages i andk are merged together in series relation and togenerate a new stage s. Then, it can be processedas rij rkl→ rs,j×l and cmij + cmkl→ cms,j×l for all m.Accordingly, the variable ys,j×l will be included inthe stage s with its reliability and consumption ofresource m at rs,j×l and cms,j×l , respectively.

Parallel-stage reduction. Suppose that stages i andk are also merged together in parallel relationand to generate a new stage s. Then, it can beprocessed as rij + rkl − rij rkl→ rs,j×l and cmij +cmkl→ cms,j×l . Accordingly, the variable xs,j×l willbe newly included in the stage s with its reliabilityand consumption of resource m at rs,j×l andcms,j×l , respectively.

The above discussions on stage reduction arenow put together to formulate the dominance-sequence-based dynamic programming proce-dure:

Step 0. (Ordering) Apply the algorithm of Satya-narayana and Wood [2] to the proposedmixed series–parallel system and find the

Page 137: Handbook of Reliability Engineering - pdi2.org

104 System Reliability and Optimization

Figure 5.1. Two examples of the reduction-order graph

reduction-order graph of the system. Go toStep 1.

Step 1. (Initialization) Arrange all stages accord-ing to the reduction-order graph, and setσ = ∅ and σ = {(1), (2), . . . , (|I |)}, whichrepresents the set of the reduced stages andnon-reduced ones respectively. Go to Step 2.

Step 2. (Reduction) If any two selected stagesare in series relation, then perform theseries-stage reduction procedure for thetwo stages. Otherwise, perform the parallel-stage reduction procedure. Go to Step 3.

Step 3. (Termination) As σ becomes a null set,terminate the procedure. Otherwise, go toStep 2.

To illustrate the dominance-sequence-baseddynamic programming procedure, a numericalexample is solved. The problem data are listed inTable 5.7. And the structure of Figure 5.1(a) is usedas representing the example.

At the first iteration, the series-stage reductionprocedure is applied to merge stages b and c intostage 3. The variables of stage 3 are generated inTable 5.8.

At the second iteration, the parallel-stagereduction procedure is applied to merge stages 3and d into stage 2. At the last iteration, the series-stage reduction procedure is applied to mergestages 2 and a into stage 1, whose variables aregiven in Table 5.9.

The first and second variables are not feasiblebecause the weight consumptions are too great.Thus, the optimal system reliability is found at0.9761, and its cost and weight consumptions arefound at 28 and 15 respectively.

Now, the computational complexity of thedominance-sequence-based dynamic program-ming algorithm is analyzed and compared withthe basic dynamic programming algorithm of Bur-ton and Howard [27]. For the one-dimensionalproblem, the following properties are satisfied.

Proposition 1. (Cho [28]) If the reduction-ordergraph of the given system is a tree with only oneend node, then the computational complexity ofthe dominance-sequence-based dynamic program-ming algorithm of Cho [28] is in the order ofO(|I |JC),where |I | and |Ji | denote the cardinal-ities of I and Ji respectively; Ji represents the set ofvariables used at stage i, and J =maxi∈I {|Ji |}.

Proposition 2. (Cho [28]) If the reduction-ordergraph of the given system is a tree with more thanone end node, then the computational complex-ity of the dominance-sequence-based dynamic pro-gramming algorithm is in the order of O(|I |C2).

Proposition 3. (Cho [28]) The computationalcomplexity of the basic dynamic programmingalgorithm of Burton and Howard [27] is in theorder of O(|I |C2) for any mixed series–parallelnetwork.

Page 138: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 105

Table 5.7. Problem data (reliability, cost, weight) for the dominance-sequence-based dynamic programming algorithma

Component Stage a Stage b Stage c Stage dtype

1 0.92, 7, 5 0.90, 3, 9 0.92, 5, 11 0.80, 3, 32 0.98, 8, 3 0.98, 11, 4 0.90, 5, 6

a Available cost and weight are 30 and 17, respectively.

Table 5.8. Variables generated at the first iteration of the algorithm of Cho [28]

j r3j (= rbj × rcj ) c3j (= cbj × ccj ) w3j (=wbj ×wcj )

1 0.828 (= 0.9× 0.92)a 8 (= 3+ 5) 20 (= 9+ 11)2 0.882 14 133 0.9016 13 144 0.9604 19 7

a The variable is fathomed because its weight consumption is not feasible (i.e. 20 > 17).

Table 5.9. Variables generated at the third iteration of the algorithm of Cho [28]

j r1j (= r2j × raj ) c1j (= c2j × caj ) w1j (= w2j × waj )

1 0.9569 (= 0.9764 × 0.98)a 21 (= 17+ 4) 18 (= 16+ 2)2 0.9607a 20 193 0.9723 26 124 0.9761 28 15

a The variable is fathomed due to resource violation.

As seen in the computational complexity anal-ysis of the one-dimensional case, the dominance-sequence-based dynamic programming algorithmmay depend on the system structure, whereas thebasic dynamic programming algorithm of Bur-ton and Howard [27] does not. Therefore, thedominance-sequence-based dynamic program-ming algorithm may require a reduced computa-tional complexity for the systems of a tree struc-ture with one end node. For the detailed proof,refer to Burton and Howard [27] and Cho [28].

For multi-dimensional problems, the computa-tional complexity is characterized below.

Proposition 4. (Cho [28]) The computationalcomplexity of the dominance-sequence-based dy-namic programming algorithm of Cho [28] is in the

order of O(|M|J |I |) for any mixed series–parallelnetwork.

Proposition 5. (Cho [28]) The computationalcomplexity of the basic dynamic programmingalgorithm of Burton and Howard [27] is in theorder of O(|I |C2|M|) for any mixed series–parallelnetwork.

In general, the number of constraints issmaller than the number of stages, so thatthe computational complexity of the proposedalgorithm is larger than that of Burton andHoward [27], even in the situation where J � C2.There exist a number of variables having not thesame resource consumption rate but the samereliability, so that the number of the variablesgenerated is not bounded by the value C at each

Page 139: Handbook of Reliability Engineering - pdi2.org

106 System Reliability and Optimization

iteration of the algorithm. For the detailed proof,refer to Burton and Howard [27] and Cho [28].

5.3.2 General System OptimizationProblems

One often sees general-structure systems in thearea of telecommunications. Telecommunicationsystems are composed of many switches, whichare interconnected with one another to formcomplex mesh types of network, where eachswitch is equipped with multiple modules inparallel, each being composed of many chipsetsto process traffic transmission operations so asto increase the whole switch reliability. Thus, theassociated design issue of how many modules tobe installed multiply in each switch for an optimaltelecommunication system can be handled as ageneral system reliability optimization problem.

For the reliability optimization problems ofcomplex system structures, no efficient optimalsolution method has yet been derived, and onlya few heuristic methods [29, 30] have beenproposed. The heuristic methods, however, havebeen applied to small-sized problems, because,as stated earlier, the computational complexity ofthe system reliability of complex structures mayincrease exponentially as the number of stages inthe system increases.

Aggarwal [29] proposed a simple heuristicalgorithm for a reliability maximization problem(NRP) to select a stage having the largestratio of the relative increment in reliability tothe increment in resource usage and to adda redundant unit at the stage for increasingredundancy. That is, a redundant unit is added tothe stage where its addition has the largest value ofthe selection factor Fi(yi) defined as follows:

Fi(yi)= �Qs(yi)∏m∈M gmi (yi)

where �Qs(yi) represents increment in reliabilitywhen a redundant unit is added to stage i havingyi redundant units such that

�Qs(yi)=Qs(Q1, . . . , Qi, . . . , Q|I |)−Qs(Q1, . . . , Qi , . . . , Q|I |)

1 2

3 4

5

Figure 5.2. A bridge system

where Qs(Q1, . . . , Q|I |) and Qi = (1− ri )yi rep-

resent the failure probabilities of the system andstage i, respectively, and Qj = Qj ∀j �= i, andQi = (1− ri)

yi+1. For the detailed step-by-stepprocedure, refer to Aggarwal [29].

In order to illustrate the algorithm of Aggarwal[29] for the bridge system structure given inFigure 5.2, a numerical example is solved withthe data of Table 5.10; the results are presented inTable 5.11.

Shi [30] has also proposed a greedy-typeheuristic method for a reliability maximizationproblem (NRP), which is composed of threephases. The first phase is to choose the minimalpath with the highest sensitivity factor al from allminimal paths of the system as

al =∏

i∈l Ri(yi)∑i∈Pl

∑m∈M [gmi (yi)/|M|Cm] for l ∈ L

which represents the ratio of the minimal path re-liability to the percentage of consumed resourcesby the minimal path. The second phase is to findwithin the chosen minimal path the stage havingthe highest selection factor bi where

bi = R − i(yi)∑m∈M [gmi (yi)/|M|Cm] for i ∈ Pl

Table 5.10. Problem data for the bridge system problem

Stage 1 2 3 4 5ri 0.7 0.85 0.75 0.8 0.9ci (= cost) 3 4 2 3 2

Constraints:C ≤ 19

Page 140: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 107

Table 5.11. Illustration of the stepwise iteration of the algorithm of Aggarwal [29]

Iteration Stage Fi(yi )∑

ci Qs (%)

1 2 3 4 5 1 2 3 4 5

1 1 1 1 1 1 1.77 1.34 2.76a 1.07 0.27 14 10.92 1 1 2 1 1 0.53 0.62 0.69 1.15a 0.24 16 5.43 1 1 2 2 1 — — — — — 19 2.6

a The stage to which a redundant unit is to be added.

Table 5.12. Illustration of the stepwise iteration of the algorithm of Shi [30]

Stage al bi

1 2 3 4 5∑

ci P1 P2 P3 P4 1 2 3 4 5

1 1 1 1 1 14 1.62 2.28a 1.20 1.36 — — 7.1b 5.1 —1 1 2 1 1 16 1.62 2.04 1.20 1.36 — — 4.5 5.1b —1 1 2 2 1 19 — — — — — — — — —

a The minimal path set having the largest sensitivity factor.b The stage to which a redundant unit is to be added.

The second phase checks feasibility and, if feasi-ble, allocates a component to the selected stage.For the detailed step-by-step procedure, refer toShi [30].

In order to illustrate the algorithm of Shi [30],a numerical example is solved with the data ofTable 5.10, and Table 5.12 shows the stepwiseresults to obtain the optimal solution, which alsoshows the minimal path sets P1 = {1, 2}, P2 ={3, 4}, P3 = {1, 4, 5}, P4 = {2, 3, 5}.

As seen in the above illustration, the optimalsolution obtained from the algorithm of Shi [30] isthe same as that of the algorithm of Aggarwal [29].

Recently, Ravi et al. [31] proposed a simulated-annealing technique for the reliability optimiza-tion problem of a general-system structure havingthe objective of maximizing the system reliabilitysubject to multiple resource constraints.

5.4 Combinatorial ReliabilityOptimization Problems withMultiple-choice ConstraintsIn considering a redundant system, the impor-tant practical issue is concerned with handling a

variety of different design alternatives in choos-ing each component type. That is, more thanone component type may have to be consid-ered for choosing each redundant unit for a sub-system. This combinatorial situation of choos-ing each component type makes the associatedreliability optimization model incorporate a newconstraint, which is called a multiple-choice con-straint. For example, a guided-weapons system,such as an anti-aircraft gun, has three major sub-systems, including the target detection/trackingsubsystem, the control console subsystem, andthe fault detection subsystem. A failure of oneof these subsystems implies the associated wholesystem failure. Moreover, each subsystem can havemany components. For instance, the target de-tection/tracking subsystem may have three im-portant components, including radar, TV cameraand forward-looking infra-red equipment. Thesecomponents are normally assembled in a parallel-redundant structure (rather than a standby one)to maintain (promote) the subsystem reliability insituations where no on-site repair work is possible.For such subsystems, a variety of different com-ponent types may be available to choose from foreach component.

Page 141: Handbook of Reliability Engineering - pdi2.org

108 System Reliability and Optimization

It is often desired at each stage to considerthe practical design issue of handling a varietyof combinations of different component types,along with various design methods that areconcerned with parallel redundancy of twoordinary (similar) units, standby redundancy ofmain and slave units, and 2-out-of-3 redundancyof ordinary units. For the design issue, theassociated reliability optimization problem isrequired to select only one design combinationfrom among such a variety of different designalternatives.

The reliability optimization problem, calledProblem (NP), is expressed in the following binarynonlinear integer programming:

ZNP =max∏i∈I

(∑j∈Ji

rij xij

)subject to∑

i∈I

∑j∈Ji

cmij xij ≤ Cm ∀m ∈M (5.11)∑j∈Ji

xij = 1 ∀i ∈ I (5.12)

xij = {0, 1} ∀j ∈ Ji, i ∈ I (5.13)

where xij indicates whether or not the j thdesign alternative is used at stage i; rij andcmij represent the reliability and its consumptionof resource m of the design alternative xij ,respectively. The constraints in Equations 5.11and 5.12 represent the resource consumptionconstraints and the multiple-choice constraintsrespectively, and the constraint in Equation 5.13,defines the decision variables.

5.4.1 One-dimensional Problems

The one-dimensional case of Problem (NP) canhave the corresponding constraints in Equa-tion 5.11, newly expressed as∑

i∈I

∑j∈Ji

cij xij ≤ C

The problem has been investigated in Sung andCho [32], who proposed a branch-and-boundsolution procedure for the problem, along with

a reduction procedure of its solution space toimprove the solution search efficiency. For thebranch-and-bound solution procedure, the lowerand the upper bounds for the optimal reliabilityat each stage are derived as in Theorems 2 and 3,given a feasible system reliability.

Theorem 2. (Sung and Cho [32]) Let Zl be a fea-sible system reliability. Then, Rl

s , where Rls =

Zl/∏

i∈I\{s} ri,|Ji |, serves as a lower bound of thereliability at stage s in the optimal solution. Thecorresponding lower bound index of xsj for j ∈ Jsis determined as j ls =min{j | rsj ≥ Rl

s, j ∈ Js}, sothat xsj = 0, ∀j < j ls .

Theorem 3. (Sung and Cho [32]) At stage s, bs =∑i∈I/{s} min{cij | rij ≥ Rl

i , j ∈ Ji} is the mini-mum amount of budget allocated to the systemexcept stage s. Given the total budget C, rs,su,

where su =max{j | csj ≤ C − bs, j ∈ Js}, is anupper bound of the reliability at stage s in theoptimal solution, and su is an upper bound indexat stage s, so that xsj = 0, ∀j > su.

For the detailed proofs of Theorems 2 and 3,refer to Sung and Cho [32]. The computationalcomplexity of one iteration of the full boundingprocess for finding the stagewise lower and up-per bounds (by Theorems 2 and 3 respectively),starting from stage 1 through stage |I |, are bothon the order of O[max[|I |2, |I |J )], where J =maxi∈I |Ji |. Thus, the computational complexityconcerned with the number of all the iterations ofthe full bounding process required for the solutionspace reduction is on the order of O(|I |J ). There-with, the total bounding process for the solutionspace reduction requires a computational com-plexity order of O[max[|I |3J, |I |2J 2)]. Table 5.13shows how efficient Theorems 2 and 3 are inreducing the solution space.

Sung and Cho [32] derived several differentbounds for the branch-and-bound procedure.They include the bound derived in a continuousrelaxation approach, the bound derived in aLagrangian relaxation approach, and the boundderived by use of the results of Theorems 2 and 3.Among them, the Lagrangian relaxation approachhas been considered to find the upper bound value

Page 142: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 109

Table 5.13. Problem data (C = 24) and stepwise results of the solution space reduction procedure [31]

j Stage 1 Stage 2 Stage 3 Stage 4

R1j c1j r2j c2j r3j c3j r4j c4j

1 0.9 2a 0.85 3a 0.8 2a 0.75 3a

2 0.99 4 0.9775 6 0.96 4 0.938 63 0.999 6 0.9966 9 0.99 6 0.984 94 1− 0.14 8 0.9995 12b 0.998 8 0.996 12b

5 1− 0.15 10b 0.9999 15b 0.9997 10b 0.999 15b

6 1− 0.16 12b 0.9999 12b

7 1− 0.17 14b 1− 0.27 14b

8 1− 0.18 18b 1− 0.28 16b

J1 = {2, 3, 4} J2 = {2, 3} J3 = {2, 3, 4} J4 = {2, 3}a Variables discarded by Theorem 2.b Variables discarded by Theorem 3.

of the given objective function. In the continuousrelaxation approach, the Lagrangian dual problemof the associated continuous relaxed problem withthe constraints in Equation 5.11 dualized hasbeen found as satisfying the integrality property.This implies that the optimal objective value of theLagrangian dual problem is equal to that obtainedby the continuous relaxation approach.

In order to derive the associated solutionbounds, some notation is introduced. Let σp ={(1), (2), . . . , (p)} denote the set of stages thatwere branched already from level 0 down tolevel p in the branch-and-bound tree, andσp = {(p + 1), . . . , (|I |)} denote the complementof σp . For an upper bound derivation, letN(j(1), j(2), . . . , j(p)) denote a (branched) nodein level p of the branch-and-bound tree. The noderepresents a partial solution in which the variablesof the stages in σp have the values beingdetermined already as x(1),j(1) = 1, . . . , x(p),j(p) =1, but those of the stages in σp are not yetdetermined. Then, any one of the following threeequations can be used as a bound at the branchingnode N(j(1), j(2), . . . , j(p)):

(b.1) B1(j(1), j(2), . . . , j(p))

=∏i∈σp

ri,ji

∏i∈σp

ri,jui

(b.2) B2(j(1), j(2), . . . , j(p))

=∏i∈σp

ri,ji

∏i∈σp

ri,ji

(b.3) B3(j(1), j(2), . . . , j(p))

=∏i∈σp

ri,ji eZ∗CLNP(j(1),j(2),...,j(p))

where jui , ∀i ∈ σp, represents the upper bound

derived by using the results of Theorems 2 and 3,

ji =max{j | cij ≤ C − ci , j ∈ Ji},ci =

∑k∈σp

ck,jk +∑

k∈σp/{i}minj∈Jk{ckj },

and Z∗CLNP(j(1), j(2), . . . , j(p)) denotes the op-timal solution value for the partial LP prob-lem (corresponding to the remaining stages from(p + 1) through (|I |)) at the branching nodeN(j(1), j(2), . . . , j(p)) to be solved by relaxing thevariables in σp to become continuous ones.

Bound (b.1) can be obtained from the proposedproblem, based on the associated reduced solutionspace (Theorems 2 and 3), and bound (b.2) can bederived by applying the results of Theorem 3 tonode N(j(1), j(2), . . . , j(p)). It is easily seen thatB1(j(1), j(2), . . . , j(p)) ≥ B2(j(1), j(2), . . . , j(p)).

For bounds (b.2) and (b.3), it has not beenproved that the value of bound (b.3) is always lessthan or equal to that of bound (b.2). However,

Page 143: Handbook of Reliability Engineering - pdi2.org

110 System Reliability and Optimization

it is proved that the upper bound value of oneof the integer-valued stages in bound (b.3) is lessthan that of the stage in bound (b.2) under thecondition that some stages have integer solutionvalues in the optimal solution for the continuousrelaxed problem. This implies that the valueof bound (b.3) is probably less than that ofbound (b.2). This conclusion is also revealed in thenumerical tests. For the detailed derivation, referto Sung and Cho [32].

Based on the above discussion, Sung and Cho[32] derived a branch-and-bound procedure in thedepth-first-search principle.

Branching strategy. At each step of branching,select a node with the largest upper boundvalue (not fathomed) for the next branchingbased on the depth-first-search principle. Then,for a such selected node N(j(1), j(2), . . . , j(p)),

branch all of its immediately-succeeding nodesN(j(1), . . . , j(p), j

′), ∀j ′ ∈ J(p+1).

Fathoming strategy. For each currently branchednode, compute the bound B(j(1), . . . , j(p)). If thebound satisfies the following condition (a), thenfathom the node:

condition (a) B(j(1), . . . , j(p))≤ Zopt

where Zopt represents the current best feasiblesystem reliability obtained from the branch-and-bound procedure. Moreover, any node satisfyingthe following Proposition 6 needs to be fathomedfrom any further consideration in the solutionsearch.

Proposition 6. (Sung and Cho [32]) Each branchnode N(j(1), . . . , j(p), j

∗) for j∗ ∈ J(p+1) andj∗ > j(p+1) needs to be removed form the solutionsearch tree.

The variables satisfying the inequality j∗>j(p+1) will lead to an infeasible leaf solution inthe associated branch-and-bound tree, so thatany branching node with such variables can beeliminated from any further consideration infinding the optimal solution, where a leaf solutioncorresponds to a node in the last (bottom) levelof the branch-and-bound tree. Thus, any currentlybranched node N(j(1), . . . , j(p), j

∗), j∗ ∈ J(p+1),

satisfying the following condition (b), needs to beremoved from the solution search tree:

condition (b)

{j∗ | c(p+1),j∗ > C − c(p+1), j∗ ∈ J(p+1)}

Backtracking strategy. If all the currently exam-ined nodes need to be fathomed or any currentlyexamined node is a leaf node, then the associ-ated immediately preceding (parent) node will bebacktracked. This backtracking operation contin-ues until any node to be branched is found.

The above-discussed properties are now puttogether to formulate the branch-and-boundprocedure:

Step 0. (Variable Reduction) Reduce the solutionspace by using Theorems 2 and 3. Go toStep 1.

Step 1. (Initialization) Arrange all stages in thesmallest-gap-first ordered sequence. Go toStep 2.

Step 2. (Branching) According to the branchingstrategy, select the next branching nodeand implement the branching process at theselected node. Go to Step 3.

Step 3. (Fathoming) For each currently branchednode, check if it satisfies the fathomingcondition (b). If it does, then it needs tobe fathomed; otherwise, compute the upperbound associated with the node, and thenuse it to decide whether or not the nodeneeds to be fathomed ,according to thefathoming condition (a). If all the currentlybranched nodes need to be fathomed, thengo to Step 5. Otherwise, go to Step 2. If thecurrently examined (not fathomed) node isa leaf node, then go to Step 4. Otherwise, goto Step 2.

Step 4. (Updating) Whenever a solution (leafnode) is newly found, compare it with thecurrent best solution. Update the currentbest solution by replacing it with the newlyobtained solution and fathom every other

Page 144: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 111

Figure 5.3. The illustrative tree of the branch-and-bound procedure

node that has its bound less than or equalto the reliability of the newly updated bestsolution. Go to Step 5.

Step 5. (Backtracking and Terminating) Back-track according to the backtracking strat-egy. If there are no more branching nodes tobe examined, then terminate the search pro-cess and the current best solution is optimal.Otherwise, go to Step 2.

The smaller problem resulting from Table 5.13is then solved by employing the branch-and-bound procedure with bound (b.2). This isdepicted in Figure 5.3 for illustration, where thenumber in each circle denotes the node number.As seen from Figure 5.3, the stage ordering inthe branch-and-bound process can be decidedas sequencing the associated tree levels in thesmallest-gap-first order as 2–4–1–3, since |J2| = 2,

|J4| = 2, |J1| = 3, and |J3| = 3. For details, refer toSung and Cho [32].

5.4.2 Multi-dimensional Problems

The extension of the problem of Section 5.4.1to a problem with multiple resource constraints,called a multi-dimensional problem, has beenhandled by Chern and Jan [33] and by Sung andCho [34], who have proposed branch-and-boundmethods for the multi-dimensional problem. Sungand Cho [34] derived solution bounds based onsome solution properties and compared them withthose of Chern and Jan [33].

For a cost minimization problem with multipleresource constraints, Sasaki et al. [35] presenteda partial enumeration method referring to thealgorithm of Lawler and Bell [8], in which theobjective is to minimize the system cost subject toa target system reliability.

Page 145: Handbook of Reliability Engineering - pdi2.org

112 System Reliability and Optimization

Table 5.14. Illustration of the solution space reduction by Theorems 4 and 5 [33]

j Stage 1 Stage 2 Stage 3

x1j r1j c11j c2

1j x2j r2j c12j c2

2j x3j r3j c13j c2

3j

1 0.99 4 2 0.8a 3 3 0.92a 5 62 0.9999 8 4 0.9a 3 9 0.98 11 43 1− 10−6 12 6 0.96a 6 6 0.9936 10 124 1− 10−8 a 16 8 0.98 8 3 0.9984 16 105 1− 10−10 a 20 10 0.992 9 9 0.9996a 22 86 0.996 11 67 0.9992 14 98 0.9996 16 69 0.9999a 19 9

J1 = {1, 2, 3} J2 = {3, 4, . . . , 8} J3 = {2, 3, 4}a Variables discarded by the reduction properties.

Sung and Cho [34] have proved that thebounding properties of Theorems 2 and 3 canbe easily extended to the multi-dimensionalproblems. Accordingly, Theorems 4 and 5 canbe used for reducing the solution space byeliminating the unnecessary variables in order tofind the optimal solution.

Theorem 4. (Sung and Cho [34]) Define

Rs = Z/∏

i∈I\{s}ri,|Ji |

for any stage s, where Z is a feasible systemreliability of the multi-dimensional problem. Then,the reliability R∗s of stage s in the optimal solutionto the problem satisfies the relation R∗s ≥ Rs .Moreover, the set of variables {xsj , j ∈ Js : rsj <Rs} take on values of zero in the optimal solution.

Theorem 5. (Sung and Cho [34]) Define

bms =∑

i∈I\{s}minj∈J {c

mij : rij ≥ Ri}

and

su = arg maxj∈Js{xsj : cmsj ≤ Cm − bms , ∀m ∈M}

Then the reliability R∗s of stage s in the optimalsolution to the problem satisfies the relation R∗s ≤rs,su . Moreover, the set of variables {xsj , j ∈ Ji :

rsj > rs,su} take on the values of zero in the optimalsolution.

Table 5.14 shows how efficient Theorems 4and 5 are in reducing the solution space.

Sung and Cho [34] derived a bounding strategyby using the results of Theorem 5. For anupper bound derivation, letN(j(1), j(2), . . . , j(p))

denote a node in level p of the associated branch-and-bound tree. The node represents a partialsolution in which the variables of the stages inσp have the values being determined alreadyas x(1),j(1) = 1, . . . , x(p),j(p) = 1, but those of thestages in σp not determined yet. Thus, an upperbound B(j(1), j(2), . . . , j(p)) on the objectivefunction at the node N(j(1), j(2), . . . , j(p)) can becomputed as

B(j(1), j(2), . . . , j(p))=∏i∈σp

ri,ji

∏i∈σp

ri,ji

where

ji = arg maxj∈Ji{xij : cmij ≤ Cm − cmi , ∀m ∈M}

and

cmi =∑k∈σp

cmk,jk +∑

k∈σp\{i}minj∈Jk{cmkj }

Page 146: Handbook of Reliability Engineering - pdi2.org

Combinatorial Reliability Optimization 113

Note that the upper bound is less than or equal tothat suggested by Chern and Jan [33], since

B(j(1), j(2), . . . , j(p))=∏i∈σp

ri,ji

∏i∈σp

ri,ji

≤∏i∈σp

ri,ji

∏i∈σp

maxj∈Ji{rij }

For the problem, Sung and Cho [34] proposeda branch-and-bound procedure based on thedepth-first-search principle, as done for the one-dimensional case, whereas Chern and Jan [33]suggested a branch-and-bound procedure in abinary tree, referred to as a tree of Murphee[36]. It has been shown in numerical tests thatthe branch-and-bound procedure of Sung andCho [34] finds the optimal solution much quickerthan that of Chern and Jan [33]. For the detailedillustration, refer to Chern and Jan [33] and Sungand Cho [34].

5.5 SummaryThis chapter considers various combinatorial re-liability optimization problems with multiple re-source constraints or multiple-choice constraintsincorporated for a variety of different systemstructures, including series systems, mixed series–parallel systems, and general systems.

Each of the problems is known as an NP-hardproblem. This provided us with the motivationto develop various solution approaches, includingoptimal solution approaches, continuous relax-ation approaches, and heuristic approaches, basedon various problem-dependent solution proper-ties. In particular, owing to the combinatorialnature of the problems, this chapter focuses onsome combinatorial solution approaches, includ-ing the branch-and-bound method, the partialenumeration method, the dynamic programmingmethod, and the greedy-type heuristic method,which can give integer optimal or heuristic solu-tions.

These days, metaheuristic approaches, includ-ing the simulated annealing method, the tabusearch method, the neural network approach, andgenetic algorithms, have been getting popular for

combinatorial optimization. However, not manyapplications of those approaches have been madeto the combinatorial reliability optimization prob-lems yet. Thus, those approaches are not discussedhere.

References[1] Valiant LG. The complexity of enumeration and reliability

problems. SIAM J Comput 1979;8:410–21.[2] Satyanarayana A, Wood RK. A linear-time algorithm

for computing K-terminal reliability in series–parallelnetworks. SIAM J Comput 1985;14:818–32.

[3] Chern CS. On the computational complexity of reliabilityredundancy allocation in a series system. Oper Res Lett1992;11:309–15.

[4] Malon DM. When is greedy module assembly optimal?Nav Res Logist 1990;37:847–54.

[5] Tillman IA, Hwang CL, Kuo W. Optimization techniquesfor system reliability with redundancy: a review. IEEETrans Reliab 1977;R-36:148–55.

[6] Tillman FA, Liittschwager JM. Integer programmingformulation of constrained reliability problems. ManageSci 1967;13:887–99.

[7] Misra KB. A method of solving redundancy optimizationproblems. IEEE Trans Reliab 1971;R-20:117–20.

[8] Lawler EL, Bell MD. A method of solving discreteoptimization problems. Oper Res 1966;14:1098–112.

[9] Ghare PM, Talyer RE. Optimal redundancy for reliabilityin series systems. Oper Res 1969;17:838–47.

[10] Sung CS, Lee HK. A branch-and-bound approach forspare unit allocation in a series system. Eur J Oper Res1994;75:217–32.

[11] Kuo E, Lin HH, Xu Z, Zhang W. Reliability optimizationwith the Lagrange-multiplier and branch-and-boundtechnique. IEEE Trans Reliab 1987;R-36:624–30.

[12] Lawler EL, Wood DE. Branch-and-bound method: asurvey. Oper Res 1966;14:699–719.

[13] Balas E. A note on the branch-and-bound principle. OperRes 1968;16:442–5.

[14] McLeavey DW. Numerical investigation of optimal paral-lel redundancy in series systems. Oper Res 1974;22:1110–7.

[15] McLeavey DW, McLeavey JA. Optimization of systemreliability by branch-and-bound. IEEE Trans Reliab1976;R-25:327–9.

[16] Luus R. Optimization of system reliability by a newnonlinear programming procedure. IEEE Trans Reliab1975;R-24:14–6.

[17] Aggarwal AK, Gupta JS, Misra KB. A new heuristiccriterion for solving a redundancy optimization problem.IEEE Trans Reliab 1975;R-24:86–7.

[18] Bellman R, Dreyfus S. Dynamic programming andthe reliability of multicomponent devices. Oper Res1958;6:200–6.

Page 147: Handbook of Reliability Engineering - pdi2.org

114 System Reliability and Optimization

[19] Kettelle JD. Least-cost allocation of reliability investment.Oper Res 1962;10:249–65.

[20] Woodhouse CF. Optimal redundancy allocation by dy-namic programming. IEEE Trans Reliab 1972;R-21:60–2.

[21] Sharma J, Venkateswaran KV. A direct method formaximizing the system reliability. IEEE Trans Reliab1971;R-20:256–9.

[22] Gopal D, Aggarwal KK, Gupta JS. An improved algorithmfor reliability optimization. IEEE Trans Reliab 1978;R-27:325–8.

[23] Misra KB. A simple approach for constrained redundancyoptimization problem. IEEE Trans Reliab 1972;R-21:30–4.

[24] Nakagawa Y, Nakashima K. A heuristic method fordetermining optimal reliability allocation. IEEE TransReliab 1977;R-26:156–61.

[25] Kuo W, Hwang CL, Tillman FA. A note on heuristicmethods in optimal system reliability. IEEE Trans Reliab1978;R-27:320-324.

[26] Nakagawa Y, Miyazaki S. An experimental comparison ofthe heuristic methods for solving reliability optimizationproblems. IEEE Trans Reliab 1981;R-30:181–4.

[27] Burton RM, Howard GT. Optimal system reliability fora mixed series and parallel structure. J Math Anal Appl1969;28:370–82.

[28] Cho YK. Redundancy optimization for a class of relia-bility systems with multiple-choice resource constraints.Unpublished PhD Thesis, Department of Industrial Engi-neering, KAIST, Korea, 2000; p.66–75.

[29] Aggarwal KK. Redundancy optimization in generalsystems. IEEE Trans Reliab 1976;R-25: 330–2.

[30] Shi DH. A new heuristic algorithm for constrainedredundancy-optimization in complex system. IEEE TransReliab 1987;R-36:621–3.

[31] Ravi V, Murty BSN, Reddy PJ. Nonequilibrium simulatedannealing-algorithm applied to reliability optimization ofcomplex systems. IEEE Trans Reliab 1997;46:233–9.

[32] Sung CS, Cho YK. Reliability optimization of a seriessystem with multiple-choice and budget constraints.Eur J Oper Res 2000;127:159–71.

[33] Chern CS, Jan RH. Reliability optimization problems withmultiple constraints. IEEE Trans Reliab 1986;R-35:431–6.

[34] Sung CS, Cho YK. Branch-and-bound redundancyoptimization for a series system with multiple-choiceconstraints. IEEE Trans Reliab 1999;48:108–17.

[35] Sasaki G, Okada T, Shingai S. A new technique to optimizesystem reliability. IEEE Trans Reliab 1983;R-32:175–82.

[36] Murphee EL, Fenves S. A technique for generatinginterpretive translators for problem oriented languages.BIT 1970:310–23.

Page 148: Handbook of Reliability Engineering - pdi2.org

Statistical ReliabilityTheoryP

AR

TI I

6 Modeling the Observed Failure Rate6.1 Introduction6.2 Survival in the Plane6.3 Multiple Availability6.4 Modeling the Mixture Failure Rate

7 Concepts of Stochastic Dependence in Reliability Analysis7.1 Introduction7.2 Important Conditions Describing Positive Dependence7.3 Positive Quadrant Dependent Concept7.4 Families of Bivariate Distributions that are Positive Quadrant

Dependent7.5 Some Related Issues on Positive Dependence7.6 Positive Dependence Orderings7.7 Concluding Remarks

8 Statistical Reliability Change-point Estimation Models8.1 Introduction8.2 Assumptions in Reliability Change-point Models8.3 Some Specific Change-point Models8.4 Maximum Likelihood Estimation8.5 Application8.6 Summary

9 Concepts and Applications of Stochastic Aging in Reliability9.1 Introduction9.2 Basic Concepts for Univariate Reliability Classes9.3 Properties of the Basic Concepts9.4 Distributions with Bathtub-shaped Failure Rates9.5 Life Classes Characterized by the Mean Residual Lifetime9.6 Some Further Classes of Aging9.7 Partial Ordering of Life Distributions9.8 Bivariate Reliability Classes9.9 Tests of Stochastic Aging9.10 Concluding Remarks on Aging

Page 149: Handbook of Reliability Engineering - pdi2.org

10 Class of NBU-t0 Life Distribution10.1 Introduction10.2 Characterization of NBU-t0 Class10.3 Estimation of NBU-t0 Life Distribution10.4 Tests for NBU-t0 Life Distribution

Page 150: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate

Chap

ter6

M. S. Finkelstein

6.1 Introduction6.2 Survival in the Plane6.2.1 One-dimensional Case6.2.2 Fixed Obstacles6.2.3 Failure Rate Process6.2.4 Moving Obstacles6.3 Multiple Availability6.3.1 Statement of the Problem6.3.2 Ordinary Multiple Availability6.3.3 Accuracy of a Fast Repair Approximation6.3.4 Two Non-serviced Demands in a Row6.3.5 Not More thanN Non-serviced Demands6.3.6 Time Redundancy6.4 Modeling the Mixture Failure Rate6.4.1 Definitions and Conditional Characteristics6.4.2 Additive Model6.4.3 Multiplicative Model6.4.4 Some Examples6.4.5 Inverse Problem

6.1 Introduction

The notion of failure rate is crucial in reliabilityand survival analysis. However, obtaining thefailure rate in many practical situations is oftennot so simple, as the structure of the system to beconsidered, for instance, can be rather complex, orthe process of the failure development cannot bedescribed in a simple way. In these cases a “propermodel” can help a lot in deriving reliabilitycharacteristics. In this chapter we consider severalmodels that can be effectively used for derivingand analyzing the corresponding failure rate,and eventually the survival function. The generalapproach developed eventually boils down toconstructing an equivalent failure rate for differentsettings (survival in the plane, multiple availabilityon demand, mixtures of distributions). It can beused for other applications as well.

Denote by T ≥ 0 a lifetime random variableand assume that the corresponding cumulativedistribution function (cdf) F(t) is absolutely con-tinuous. Then the following exponential formulaexists:

F(t) = 1− exp

[−∫ t

0λ(u) du

](6.1)

where λ(t), t ≥ 0, is the failure rate. In manyinstances a conventional statistical analysis ofthe overall random variable T presents certaindifficulties, as the corresponding data can bescarce (e.g. the failure times of the highly reliablesystems). On the other hand, information on thestructure of the system (object) or on the failureprocess can often be available. This informationcan be used for modeling λ(t), to be called theobserved (equivalent) failure rate, and eventually,for estimation of F(t). The other importantproblem that can be approached in this way is

117

Page 151: Handbook of Reliability Engineering - pdi2.org

118 Statistical Reliability Theory

the analysis of the shape of the failure rate, whichis crucial for the study of the aging propertiesof the cumulative distribution functions underconsideration.

The simplest reliability example of this kind is asystem of two independent components in parallelwith exponentially distributed lifetimes. In thiscase, the system’s lifetime cumulative distributionfunction is

F(t)= 1− exp(−λ1t)− exp(−λ2t)

+ exp[−(λ1 + λ2)t]where λ1 (λ2) is a failure rate of the first (second)component. It can be easily seen that the followingrelation defines the system’s observed (equivalent)failure rate:

λ(t)= {λ1 exp(−λ1t)+ λ2 exp(−λ2t)

− (λ1 + λ2) exp[−(λ1 + λ2)t]}× {exp(−λ1t)+ exp(−λ2t)

− exp[−(λ1 + λ2)t]}−1

A simple analysis of this relation shows that theobserved failure rate λ(t) (λ(0)= 0) is monoton-ically increasing in [0, t0) and monotonically de-creasing in [t0,∞), asymptotically approachingλ1 from above as t→∞ (λ1 < λ2). The maxi-mum t0 is uniquely obtained from the equation:

λ22 exp(−λ1t)+ λ2

1 exp(−λ2t)= (λ1 − λ2)2

λ1 �= λ2

It is also evident that λ(t) < λ2, because thereliability of the considered system cannot beworse than the reliability of a component.

Thus, a given structure of the system createsa possibility of obtaining and analyzing λ(t)

analytically. There are situations, however, whenderiving the observed (equivalent) failure rate isnot so straightforward. In this chapter, severalapplications of this kind are considered where theobserved failure rate can be effectively constructedvia the specific models describing the process of“failure development”.

Sections 6.2 and 6.3 are devoted to two differentapplications of a method of the per demand failurerate [1]. The first one presents a probabilistic

description for the survival in the plane, when asmall normally or tangentially oriented interval ismoving along a fixed route in the plane, crossingpoints of initial Poisson random processes. Eachcrossing leads to a termination of the processwith a given probability, and the probability ofpassing the route without termination is derived.The second one deals with the, so-called, multipleavailability, when the system should be availableat each instant of demand, and the probability ofthis event is of interest. The demands occur inaccordance with the Poisson process. Some weakercriteria of failure are also considered when it is notnecessary that all demands should be serviced in agiven interval of time. Though these applicationsare different in nature, mathematical approachesfor obtaining characteristics of interest are similar.

Finally, in Section 6.4, a mixture of distribu-tions is studied. The analysis of the observed fail-ure rate, which in this case is a mixture failurerate, is performed for the given governing andmixing distributions. It appears that the shapeof the observed failure rate can dramatically dif-fer from the shape of the governing distribution.This fact is rather surprising, and should be takeninto account in practical applications.

6.2 Survival in the Plane

6.2.1 One-dimensional Case

A model of survival in the plane is describedin this section, which is based on the followingsimple reasoning used in the one-dimensionalcase. Consider a system subject to stochastic pointinfluences (shocks). Each shock can lead with agiven probability to a fatal failure of a system,resulting in a termination of the process, andthis will be called an “accident”. The probabilityof performance without accidents in the timeinterval (0, t] is of interest. It is natural todescribe the situation in terms of stochasticpoint processes. Let {N(t); t > 0} be a pointprocess of shocks, occurring at times 0 < t1 < t2 <

· · · , where N(t) is the corresponding countingmeasure in (0, t].

Page 152: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 119

Denote by h(t) the rate function of a pointprocess. For orderly processes, assuming the limitsexist [2]:

h(t)= lim�t→0

Pr{N(t, t +�t)= 1}�t

= lim�t→0

E[N(t, t +�t)]�t

Thus, h(t) dt can be interpreted as an approximateprobability of a shock occurrence in (t, t + dt].

Assume now that a shock, which occurs in(t, t + dt] independently of the previous shocks,leads to an accident with probability θ(t), anddoes not cause any changes in the system withthe complementary probability 1− θ(t). Denoteby Ta a random time to an accident and by Fa(t)=Pr{Ta ≤ t} the corresponding cumulative distribu-tion function. If Fa(t) is absolutely continuous,then similar to Equation 6.1

P(t) = 1− Fa(t)= exp

[−∫ t

0λa(x) dx

](6.2)

where λa(t) is a failure (accident) rate, corre-sponding to Fa(t), and P(t) is the survival func-tion: the probability of performance without ac-cidents in (0, t]. Assume that {N(t); t > 0} is anon-homogeneous Poisson process, θ(x)h(x) isintegrable in [0, t) ∀t ∈ (0,∞) and∫ ∞

0θ(x)h(x) dx =∞

Then the following important relation takes place[3, 4]:

λa(t)= θ(t)h(t) (6.3)

For the time-independent case θ(t)≡ θ , thisapproach has been widely used in the literature.In [1], for instance, it was called the method ofthe per demand failure rate. This name comesfrom the following simple reasoning. Let h(t) bethe rate of the non-homogeneous Poisson processof demands and θ be the probability that asingle demand is not serviced (it is serviced withprobability 1− θ). Then the probability that all

demands in [0, t) are serviced is given by

1− Fa(t)=∞∑0

(1− θ)k exp

[−∫ t

0h(u) du

]

×[∫ t

0 h(u) du]k

k!= exp

[−∫ t

0θh(u) du

]Therefore, θh(t) can be called the per demandfailure rate.

As in Section 6.1, the failure rate in Equation 6.2is called the observed failure rate, and the relationin Equation 6.3 gives its concrete expression forthe setting under consideration. In other words,the modeling of the observed failure rate isperformed via the relation in Equation 6.3. Thisconcept will be exploited throughout the currentchapter. Considering the Poisson point processesof shocks in the plane (which will be interpretedas obstacles) in the next section will lead toobtaining the corresponding observed failure rate“along the fixed curve”. An important applicationof this model results in deriving the probabilityof a safe passage for a ship in a field of obstacleswith fixed (shallows) or moving (other ships)coordinates [4].

6.2.2 Fixed Obstacles

Denote by {N(B)} a non-homogeneous Poissonpoint process in the plane, where N(B) is thecorresponding random measure: i.e. the randomnumber of points in B ⊂�2, where B belongs tothe Borel σ -algebra in�2. We shall consider pointsas prospective point influences on the system(shallows for the ship, for instance). Similar tothe one-dimensional definition (Equation 6.1), therate hf(ξ) can be formally defined as [2]

hf(ξ)= limS(δ(ξ))→0

E[N(δ(ξ))]S(δ(ξ))

(6.4)

whereB = δ(ξ) is the neighborhood of ξ with areaS(δ(ξ)) and the diameter tending to zero, and thesubscript “f” stands for “fixed”.

Page 153: Handbook of Reliability Engineering - pdi2.org

120 Statistical Reliability Theory

Assume for simplicity that hf(ξ) is a continuousfunction of ξ in an arbitrary closed circle in �2.Let Rξ1,ξ2 be a fixed continuous curve to be calleda route, connecting ξ1 and ξ2-two distinct pointsin the plane. A point (a ship in the application)is moving in one direction along the route. Everytime it “crosses the point” of the process {N(B)}an accident can happen with a given probability.Let r be the distance from ξ1 to the current point ofthe route (coordinate) and hf(r) denote the rate ofthe process in (r, r + dr]. Thus, a one-dimensionalparameterization is considered. For defining thecorresponding Poisson measure, the dimensionsof objects under consideration should be takeninto account.

Let (γ+n (r), γ−n (r)) be a small interval of lengthγn(r)= γ+n (r)+ γ−n (r) in a normal toRξ1,ξ2 in thepoint with coordinate r , where the upper indexesdenote the corresponding direction (γ+n (r) is onone side of Rξ1,ξ2 , and γ−n (r) is on the other). LetR be the length of Rξ1,ξ2 : R ≡ |Rξ1,ξ2| and assumethat

R� γn(r), ∀r ∈ [0, R]

The interval (γ+n (r), γ−n (r)) is moving alongRξ1,ξ2 , crossing points of a random field. Forthe application, it is reasonable to assume thefollowing model for the symmetrical (γ+n (r)=γ−n (r)) equivalent interval: γn(r)= 2δs + 2δo(r),where 2δs and 2δo(r) are the diameters of a shipand of an obstacle respectively. It is assumedfor simplicity that all obstacles have the samediameter. There can be other models as well.Using the definition in Equation 6.4 for thespecific domain B and the r-parameterization, thecorresponding equivalent rate of occurrence ofpoints, hef(r) can be obtained:

hef(r)= lim�r→0

E[N(B(r, �r, γn(r)))]�r

(6.5)

where N(B(r, �r, γn(r))) is the random numberof points crossed by the interval γn(r), movingfrom r to r +�r .

When �r→ 0 and γn(r) is sufficientlysmall [4]:

E[N(B(r, �r, γn(r)))]=∫B(r,�r,γn(r))

hf(ξ) dS(δ(ξ))

�r→0= γn(r)hf(r) dr[1+ o(1)]which leads to the following relation for the equiv-alent rate of the corresponding one-dimensionalnon-homogeneous Poisson process:

hef(r)= γn(r)hf(r)[1+ o(1)] (6.6)

It was also assumed, while obtaining Equation 6.6,that the radius of curvature of the route Rc(r) issufficiently large compared with γn(r):

γn(r)� Rc(r)

uniformly in r ∈ [0, R]. This means that thedomain covered by the interval (γ+n (r), γ−n (r))

while it moves from r to r +�r along the route isasymptotically rectangular with an area γn(r)�r ,�r→ 0. The generalization allowing small valuesof radius of curvature can be considered. In thiscase, some points, for instance, can be intersectedby the moving interval twice.

Hence, the r-parameterization along the fixedroute reduces the problem to the one-dimensionalsetting of Section 6.2.1. As in the one-dimensionalcase, assume that crossing a point with a coordi-nate r leads to an accident with probability θf(r)

and to the “survival” with a complementary prob-ability θf(r)= 1− θf(r). Denote by R a randomdistance from the initial point of the route ξ1till a point on the route where an accident hadoccurred. Similar to Equations 6.2 and 6.3, theprobability of passing the route Rξ1,ξ2 withoutaccidents can be derived in a following way:

Pr{R > R} ≡ P(R)= 1− Faf(R)

= exp

[−∫ R

0θf(r)hef(r) dr

]≡ exp

[−∫ R

0λaf(r) dr

](6.7)

whereλaf(r)≡ θf(r)hef(r) (6.8)

Page 154: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 121

is the corresponding observed failure rate writtenin the same way as the per demand failure rate(Equation 6.3) of the previous section.

6.2.3 Failure Rate Process

Assume that λaf(r) in Equation 6.8 is now astochastic process defined, for instance, by anunobserved covariate stochastic process Y = Yr,r ≥ 0 [5]. By definition, the observed failurerate cannot be stochastic. Therefore, it shouldbe obtained via the corresponding conditioningon realizations of Y . Denote the defined failure(hazard) rate process by λaf(Y, r). It is well known,e.g. [6], that in this case the following equationholds:

P(R)= E

[exp

(−∫ R

0λaf(Y, r) dr

)](6.9)

where conditioning is performed with respect tothe stochastic process Y . Equation 6.9 can bewritten via the conditional failure (hazard) rateprocess [5] as

P(R)= exp

{−∫ R

0E[λaf(Y, r) | R > r] dr

}= exp

[−∫ R

0λaf(r) dr

](6.10)

where λaf(r) now denotes the correspondingobserved hazard rate. Thus, the relation of theobserved and conditional hazard rates is

λaf(r)= E[λaf(Y, r | R > r)] (6.11)

As follows from Equation 6.10, Equation 6.11 canconstitute a reasonable tool for obtaining P(R),but the corresponding explicit derivations can beperformed only in some of the simplest specificcases. On the other hand, it can help to analyzesome important properties.

Consider an important specific case. Let proba-bility θf(r) be indexed by a parameter Y : θf(Y, r),where Y is interpreted as a non-negative randomvariable with support in [0,∞) and the prob-ability density function π(y). In the sea safetyapplication this randomization can be due to the

unknown characteristics of a navigation (or (and)a collision avoidance) onboard system, for in-stance. There can be other interpretations as well.Thus the specific case of when Y in Equations 6.9and 6.10 is a random variable is considered. Theobserved failure rate λaf(r) reduces in this case tothe corresponding mixture failure rate:

λaf(r)=∫ ∞

0λaf(y, r)π(y | r) dy (6.12)

where p(y | r) is the conditional probabilitydensity function of Y given that R > r . As in [7], itcan be defined in the following way:

π(y | r)= π(y)P (y, r)∫∞0 P(y, r)π(y) dy

(6.13)

where P(y, r) is defined as in Equation 6.7,and λaf(r) is substituted by θf(y, r)hef(r) for thefixed y.

The relations in Equations 6.12 and 6.13constitute a convenient tool for analyzing theshape of the observed failure rate λaf(r). Thistopic will be considered in a more detailed wayin Section 6.3. It is only worth noting now thatthe shape of λaf(r) can differ dramatically fromthe shape of the conditional failure rate λaf(y, r),and this fact should be taken into considerationin applications. Assume, for instance, a specificmultiplicative form of parameterization:

θf(Y, r)hef(r)= Yθf(r)hef(r)

It is well known that if θf(r)hf(r) is constant,then the observed failure rate is decreasing [8].But it turns out that even if θf(r)hef(r) is sharplyincreasing, the observed failure rate λaf(r) canstill decrease, at least for sufficiently large r !Thus, the random parameter changes the agingproperties of the governing distribution function(Section 6.3).

A similar randomization can be performed inthe second multiplier in the right-hand side ofEquation 6.8: he ≈ γn(r)hf(r), where the lengthof the moving interval γn(r) can be consideredrandom as well as the rate of fixed obstacles hf(ξ).In the latter case the situation can be described interms of doubly stochastic Poisson processes [2].

Page 155: Handbook of Reliability Engineering - pdi2.org

122 Statistical Reliability Theory

For the “highly reliable systems”, when, forinstance, hf→ 0 uniformly in r ∈ [0, R], onecan easily obtain from Equations 6.9 and 6.11the following obvious “unconditional” asymptoticapproximate relations:

P(R)={

1− E

[∫ R

0λaf(Y, r) dr

]}[1+ o(1)]

={

1−∫ R

0E[λaf(Y, r) dr]

}[1+ o(1)]

λaf(r)= E[λaf(Y, r | R > r)]= E[λaf(Y, r)][1+ o(1)]

By applying Jensen’s inequality to the right-handside of Equation 6.9 a simple lower bound forP(R) can be also derived:

P(R)≥ exp

{E

[−∫ R

0λaf(Y, r) dr

]}6.2.4 Moving Obstacles

Consider a random process of continuous curvesin the plane to be called paths. We shall keepin mind an application, that the ship routeson a chart represent paths and the rate of thestochastic processes, to be defined later, representsthe intensity of navigation in a given sea area.The specific case of stationary random lines in theplane is called a stationary line process.

It is convenient to characterize a line in theplane by its (ρ, ψ) coordinates, where ρ isthe perpendicular distance from the line to afixed origin, and ψ is the angle between thisperpendicular line and a fixed reference direction.A random process of undirected lines can bedefined as a point process on the cylinder�+ × S,where �+ = (0,∞) and S denote both the circlegroup and its representations as (0, 2π] [9]. Thus,each point on the cylinder is equivalent to theline in �2 and for the finite case the pointprocess (and associated stationary line process)can be described. Let V be a fixed line in �2

with coordinates (ρV , α) and let NV be thepoint process on V generated by its intersectionswith the stationary line process. Then NV is astationary point process on V with rate hV given

by [9]:

hV = h

∫S

|cos(ψ − α)|P(dψ) (6.14)

where h is the constant rate of the stationaryline process and P(dψ) is the probability that anarbitrary line has orientation ψ . If the line processis isotropic, then the relation in Equation 6.14reduces to:

hV = 2h/π

The rate h is induced by the random measuredefined by the total length of lines inside anyclosed bounded convex set in �2. One cannotdefine the corresponding measure as the numberof lines intersecting the above-mentioned set,because in this case it will not be additive.

Assume that the line process is the homoge-neous Poisson process in the sense that the pointprocess NV generated by its intersections with anarbitrary V is a Poisson point process. Consider,now, a stationary temporal Poisson line processin the plane. Similar to NV , the Poisson pointprocess {NV (t); t > 0} of its intersections with V

in time can be defined. The constant rate of thisprocess, hV (1), as usual, defines the probabilityof intersection (by a line from a temporal lineprocess) of an interval of a unit length in V andin a unit interval of time given these units aresubstantially small. As previously, Equation 6.17results in hV (1)= 2h(1)/π for the isotropic case.

Let Vξ1,ξ2 be a finite line route, connecting ξ1and ξ2 in �2, and r , as in the previous section,is the distance from ξ1 to the current point ofVξ1,ξ2 . Then hV (1) dr dt can be interpreted as theprobability of intersecting Vξ1,ξ2 by the temporalline process in

(r, r + dr)× (t, t + dt) ∀r ∈ (0, R), t > 0

A point (a ship) starts moving along Vξ1,ξ2 atξ1, t = 0 with a given speed. We assume that anaccident happens with a given probability when“it intersects” the line from the (initial) temporalline process. Note that intersection in the previoussection was time independent. A regularizationprocedure, involving dimensions (of a ship, inparticular) can be performed in the following way:

Page 156: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 123

an attraction interval

(r − γ−ta , r + γ+ta )⊂ Vξ1,ξ2 γ+ta , γ−ta ≥ 0

γta(r)= γ+ta (r)+ γ−ta (r)� R

where the subscript “ta” stands for tangential, isconsidered. The attraction interval (which can bedefined by the ship’s dimensions), “attached to thepoint r” is moving along the route. The coordinateof the point is changing in time in accordance withthe following equation:

r(t)=∫ t

0ν(s) ds t ≤ tR

where tR is the total time on the route.Similar to Equations 6.5 and 6.6, the equivalent

rate of intersections hem(r) can be derived. Assumefor simplicity that the speed v(t)= v0 and thelength of the attraction interval γta are constants.Then

E[NV ((r, r +�r), �t)] = hV (1)�r�t (6.15)

Thus, the equivalent rate is also a constant:

hem =�thV (1)= γta

v0hV (1) (6.16)

where �t = γta/v0 is the time needed for themoving attraction interval to pass the interval(r, r +�r) as �r→ 0. In other words, thenumber of intersections of (r, r + dr) by the linesfrom the temporal Poisson line process is countedonly during this interval of time. Therefore, thetemporal component of the process is “eliminated”and its r-parameterization is performed.

As assumed earlier, the intersection can lead toan accident. Let the corresponding probability ofan accident θm also be a constant. Then, using theresults of Sections 6.2.1 and 6.2.2, the probabilityof moving along the route Vξ1,ξ2 without accidentscan be easily obtained in a simple form:

P(R)= exp(−θmhemR) (6.17)

The nonlinear generalization, which is ratherstraightforward, can be performed in the follow-ing way. Let the line route Vξ1,ξ2 turn into the con-tinuous curveRξ1,ξ2 and lines of the stochastic line

process also turn into continuous curves. As pre-viously, assume that the radius of curvature Rc(r)

is bounded from the left by a positive, sufficientlylarge constant (the same for the curves of thestochastic process). The idea of this generalizationis based on the fact that for a sufficiently narrowdomain (containing Rξ1,ξ2) the paths can be con-sidered as lines and sufficiently small portions ofthe route can also be considered as approximatelylinear.

Alternatively it can be assumed from thebeginning (and this seems to be quite reasonablefor our application) that the point processgenerated by intersections of Rξ1,ξ2 with temporal-planar paths is a non-homogeneous Poissonprocess (homogeneous in time, for simplicity)with the rate hR(r, 1). Similar to Equation 6.16,an r-dependent equivalent rate can be obtainedasymptotically as γta→ 0:

hem(r)= γta

v0hR(r, 1)[1+ o(1)] (6.18)

It is important that, once again, everything iswritten in r-parameterization. Marginal cases forEquation 6.18 are quite obvious. When v0→ 0 theequivalent rate tends to infinity (time of exposureof a ship to the temporal path process tends toinfinity); when v0→∞ the rate tends to zero. Therelation in Equation 6.17 can be now written in ther-dependent way:

P(R)= exp

[−∫ R

0θm(r)hem(r) dr

](6.19)

Similar to the previous section, the correspondingrandomization approaches can also be applied tothe equivalent failure rate θm(r)hem(r). Finally,assuming an independence of two causes ofaccidents, the probability of survival on theroute is given by the following equation, whichcombines Equations 6.7 and 6.19:

P(R)= exp

{−∫ R

0[θf(r)hef(r)

+ θm(r)hem(r)] dr}

(6.20)

Other types of external point influence can alsobe considered, which under certain assumptions

Page 157: Handbook of Reliability Engineering - pdi2.org

124 Statistical Reliability Theory

will result in additional terms in the integrandin the right-hand side of Equation 6.20. Thus,the method of the per demand failure rate, as inthe one-dimensional case, combined with the r-parameterization procedure, resulted in the ob-taining of simple relations for the observed failurerate and eventually in deriving the probability ofsurvival in the plane.

6.3 Multiple Availability

6.3.1 Statement of the Problem

The main objective of this section is to show howthe method of the per demand failure rate worksas a fast repair approximation to obtain somereliability characteristics of repairable systemsthat generalize the conventional availability. At thesame time, exact solutions for the characteristicsunder consideration or the corresponding integralequations will be also derived.

Consider a repairable system, which beginsoperating at t = 0. A system is repaired uponfailure, and then the process repeats. Assumethat all time intervals (operation and repair)are s-independent, the periods of operation areexponential, independent, identically distributed(i.i.d.) random variables, and the periods of repairare also exponential, i.i.d. random variables. Thus,a specific alternating renewal process describesthe process of system’s functioning.

One of the main reliability characteristics ofrepairable systems is availability—the probabilityof being in the operating state at an arbitraryinstant of time t . This probability can be easilyobtained [8] for a system that is functioning andrepaired under the above assumptions:

A(t)= µ

µ+ λ+ λ

µ+ λexp[−(λ+ µ)t] (6.21)

where λ and µ are parameters of exponentialdistributions of times to failure and to repairrespectively.

It is quite natural to generalize the notion ofavailability on a setting when a system should beavailable at a number of time instants (demands).

Define a breakdown (it was called “an accident”in the specific setting of Section 6.2) as an eventwhen a system is in the state of repair at themoment of a demand occurrence. Define multipleavailability as the probability of the followingevent: the system is in the operating state at eachmoment of demand in [0, t).

Let, at first, the instants of demands arefixed: 0 < t1 < t2 < · · ·< tk < t . Then, owing tothe Markovian properties of the model, multipleavailability V (t) can be obtained for this caseas:

V (t)= A(t1)A(t2 − t1) · · · A(tk − tk−1) (6.22)

Assume now that a point process of demands isthe homogeneous Poisson process with the rate λdand it is s-independent of the process of the systemfunctioning and repair. The straightforwardintegration of Equation 6.22 for ti, i = 1, 2, . . . , kfrom the homogeneous Poisson process andsummation for all possible k from 1 to ∞ givesthe following general formula [10]:

V (t)=∞∑k=1

∫ ∫· · ·∫

Aλ,µ(t1, t2, . . . , tk)λkd dt1 dt2

· · · dtk exp(−λdt)+ exp(−λdt) (6.23)

where the limits of integration for each ti,i = 1, 2, . . . are 0 and t respectively. The termexp(−λdt) stands for the probability of absenceof demands in [0, t). Of course, computations ofV (t) using the relation in Equation 6.23 can beonly approximate.

There are many applications, where multipleavailability can be considered as an importantprobabilistic characteristic for repairable systems.Performance of various information-measuringsystems, for instance, which operate continuouslyand deliver information only on users’ demand atsome instants of time, can be adequately describedby multiple availability [11].

The criterion of a breakdown in the abovedefinition of multiple availability is very strong:a system should be available at each instantof demand. Thus, it is reasonable to con-sider the weaker versions of this criterion,when:

Page 158: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 125

1. not all demands need to be serviced;2. if a demand occurs, while a system is in the

state of repair, but the time from the startof repair (or to the completion of repair) issufficiently small, then this event may notqualify as a breakdown.

These new assumptions are quite natural. Thefirst describes the fact that the quality of asystem performance can still be considered asacceptable if the number of demands when asystem is in the state of repair is less than k,k ≥ 2. The second can be applied, for instance, toinformation-measuring systems when the qualityof information, measured before the system lastfailure, is still acceptable after τ units of repairtime (information is slowly aging). Two methodsare used for deriving expressions for multipleavailability for these criteria of a breakdown:

1. the method of integral equations to be solvedvia the corresponding Laplace transforms;

2. an approximate method of the per demandfailure rate.

The method of integral equations leads to theexact explicit solution for multiple availabilityonly for the simplest criterion of a breakdown andexponential distributions of times to failure andrepair. In other cases the solution can be obtainedvia the numerical inversion of Laplace transforms.

On the other hand, simple relations for multipleavailability can be obtained via the method of theper demand failure rate on a basis of the fast repairassumption: the mean time of repair is much lessthan the mean time of a system time to failure andthe mean time between the consecutive demands.The fast repair approximation also makes it possi-ble to derive approximate solutions for the prob-lems that cannot be solved by the first method,e.g. for the case of arbitrary time to failure andtime to repair cumulative distribution functions.

6.3.2 Ordinary Multiple Availability

Multiple availability V (t) defined in Section 6.3.1will be called ordinary multiple availabilityas distinct from the availability measures of

subsequent sections. It is easy to obtain thefollowing integral equation for V (t) [11]

V (t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A(x)V (t − x) dx

(6.24)

where the first term defines the probability of theabsence of demands in [0, t), and the integrand isa joint probability of three events:

• occurrence of the first demand in [x, x + dx)• the system is in the operation state at the time

instant x• the system is in the operation state at each

moment of demand in [x, t) (probability:V (t − x)).

Equation 6.4 can be easily solved via theLaplace transform. By applying this transforma-tion to both sides of Equation 6.24 and using theLaplace transform of A(t), then, after elementarytransformations, the following expression for theLaplace transform of V (t) can be obtained:

V (s)= s + λd + λ+ µ

s2 + s(λd + λ+ µ)+ λdλ(6.25)

Finally, performing the inverse transform:

V (t)= B1 exp(s1t)+ B2 exp(s2t) (6.26)

where the roots of the denominator in Equa-tion 6.5 are

s1,2 = −(λd + λ+ µ)±√(λd + λ+ µ)2 − 4λdλ

2and B1 = s2/(s2 − s1) and B2 =−s1/(s2 − s1).It is worth noting that, unlike the relation inEquation 6.23, a simple explicit expression isobtained in Equation 6.26.

The following assumptions are used while de-riving approximations for V (t) and the corre-sponding errors of approximation:

1

λ� 1

µ(λ� µ) (6.27)

1

λd� 1

µ(λd� µ) (6.28)

t � 1

µ(6.29)

Page 159: Handbook of Reliability Engineering - pdi2.org

126 Statistical Reliability Theory

The assumption in Equation 6.27 implies thatthe mean time to failure far exceeds the meantime of repair, and this is obviously the casein many practical situations. The mean time ofrepair usually does not exceed several hours,whereas it can be thousands of hours for λ−1. Itis usually referred to as a condition of the fastrepair [12]. This assumption is important for themethod of the current section, because, as it willbe shown, it is used for approximating availability(Equation 6.11) by its stationary characteristic inthe interval between two consecutive demandsand for expansions in series as well. This meansthat the interval mentioned should be sufficientlylarge and this is stated by the assumption inEquation 6.28. There exist, of course, systems witha very high intensity of demands (a databasesystem, where queries arrive every few seconds,for instance), but here we assume a sufficientlylow intensity of demands, which is relevant for anumber of applications. An onboard navigationalsystem will be considered in this section asthe corresponding example. The assumption inEquation 6.29 means that the mission time t

should be much larger than the mean time ofrepair, and this is quite natural for the fast repairapproximation.

With the help of the assumptions in Equa-tions 6.28 and 6.29 [11], an approximate relationfor V (t) can be obtained via the method of the perdemand failure rate, Equations 6.2 and 6.3, in thefollowing way:

V (t)≡ exp

[−∫ t

0λb(u) du

]≈ exp[−(1− A)λdt]= exp

(− λλdt

λ+ µ

)(6.30)

where λb(t) is the observed failure rate to be callednow the breakdown rate (notation λa(t) was usedfor the similar characteristic).

The above assumptions are necessary forobtaining Equation 6.30, as the time-dependentavailability A(t) is substituted by its stationary

characteristic A(∞)≡ A= µ/λ+ µ. Thus

λb(t)= λλd

λ+ µ

and

θ(t)= (1− A)= λ

λ+ µ, h(t) = λd

6.3.3 Accuracy of a Fast RepairApproximation

What is the source of an error in the approximaterelation in Equation 6.30? The error results fromthe fact that the system is actually availableon demand with probability A(td), where tdis the time since the previous demand. Theassumptions in Equations 6.28 and 6.29 imply thatA(td)≈ A, and this leads to the approximationin Equation 6.30, whereas the exact solutionis defined by Equation 6.26. When t � µ−1

(assumption in Equation 6.29), the second term inthe right-hand side of Equation 6.26 is negligible.Then an error δ is defined as

δ = s2

s2 − s1exp(s1t)− exp

(− λλdt

λ+ µ

)Using Equation 6.27 for expanding s1 and s2 inseries:

δ =[1+ λλd

(λ+ λd + µ)2

]×(

1− s1t + s1t2

2

)[1+ o(1)]

−[1− λλdt

λ+ µ+ (λλd)

2t2

2(λ+ µ)2

][1+ o(1)]

= λλd

(λ+ λd + µ)2

[1+ λdt

(1− λλdt

λ+ µ

)]× [1+ o(1)]= λλd

µ2 (1+ λdt)[1+ o(1)]

Notation o(1) is understood asymptotically:o(1)→ 0 as λ/µ→ 0.

On the other hand, a more accurate resultcan easily be obtained if, instead of 1− A, “an

Page 160: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 127

averaged probability” 1− A is used:

1− A≡∫ ∞

0[1− A(td)]λd exp(−λdtd) dtd

= λ

λ+ λd + µ

where Equation 6.1 was taken into account. Hence

V (t)≈ exp[−(1− A)λdt] = exp

(− λλdt

λ+ λd + µ

)(6.31)

In the same way, as above, it can be shown that

δ ≡ V (t)− exp

(− λλdt

λ+ λd + µ

)= λλd

µ2

(1− λλdt

λ+ λd + µ

)[1+ o(1)]

The error in Equation 6.31 results from theinfinite upper bound of integration in 1− A,whereas in realizations it is bounded. Thus,exp[−(λλdt)/(λ+ λd + µ)] is a more accurateapproximation than exp[−(λλdt)/(λ + µ)] as λdt

is not necessarily small.Consider the following practical example [11].

Example 1. A repairable onboard navigation sys-tem is operating in accordance with the aboveassumptions. To obtain the required high preci-sion of navigational data, corrections from theexternal source are performed with rate λd, s-independent of the process of the system func-tioning and repair. The duration of the correctionis negligible. Corrections can be performed onlywhile the system is operating; otherwise it is abreakdown. These conditions describe multipleavailability; thus V (t) can be assessed by Equa-tions 6.30 and 6.31. Let:

λ= 10−2 h−1 µ= 2 h−1

λd ≈ 0.021 h−1 t = 103 h−1 (6.32)

This implies that the mean time between failuresis 100 h, the mean time of repair is 0.5 h, andthe mean time between successive corrections isapproximately 48 h. The data in Equation 6.32characterize the performance of the real naviga-tion system. On substituting the concrete values of

parameters in the exact relation (Equation 6.26):

V (t)= 0.901 812 8

whereas the approximate relation (Equation 6.31)gives:

V (t)≈ 0.901 769 3

Thus, the “real error” is δ = 0.000 043 5, and

λλd

µ2

(1− λλdt

λ+ λd + µ

)= 0.000 047 1

The general case of arbitrary distributions oftime to failure F(t) with mean T and timeto repair G(t) with mean τ under assumptionssimilar to Equations 6.27–6.29 can also beconsidered. This possibility follows from the factthat the stationary value of availability A dependsonly on the mean times to failure and repair.Equation 6.30, for instance, will turn to

V (t)≡ exp

[−∫ t

0λb(u) du

]≈ exp[−(1− A)λdt]

= exp

(− τ λdt

T + τ

)(6.33)

Furthermore, as follows from [11], the method ofthe per demand failure rate can be generalizedto the non-homogeneous Poisson process ofdemands. In this case, λdt in Equation 6.33 shouldbe substituted by

∫ t0 λd(u) du.

It is difficult to estimate an error of approxi-mation for the case of arbitrary F(t) and G(t) di-rectly. Therefore, strictly speaking, Equation 6.33can be considered only like a heuristic result.Similar to Equation 6.31, a more accurate solutionexp[−(1− A)λdt] can be formally defined, but thenecessary integration of the availability functioncan be performed explicitly only for the specialchoices of F(t) and G(t), like the exponential orthe Erlangian, for instance.

6.3.4 Two Non-serviced Demands in aRow

The breakdown in this case is defined as anevent when a system is in the repair state attwo consecutive moments of demand. Multiple

Page 161: Handbook of Reliability Engineering - pdi2.org

128 Statistical Reliability Theory

availability V1(t) is now defined as the probabilityof a system functioning without breakdowns in[0,∞).

As was stated before, this setting can be quitetypical for some information-processing systems.If, for instance, a correction of the navigationsystem in the example above had failed (thesystem was unavailable), we can still wait forthe next correction, and consider the situation asnormal, if this second correction was performedproperly.

Similar to Equation 6.24, the following integralequation for V1(t) can be obtained:

V1(t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A(x)V1(t − x) dx

+∫ t

0

{λd exp(−λdx)[1− A(x)]

×∫ t−x

0λd exp(−λdy)A

∗(y)

× V1(t − x − y) dy

}dx (6.34)

where A∗(t) is the availability of the system at thetime instant t , if at t = 0 the system was in the stateof repair:

A∗(t)= µ

µ+ λ− µ

µ+ λexp[−(λ+ µ)t]

The first two terms in the right-hand side ofEquation 6.34 have the same meaning as in theright-hand side of Equation 6.24, and the thirdterm defines the joint probability of the followingevents:

• occurrence of the first demand in [x, x + dx);• the system is in the repair state at x

(probability: 1− A(x));• occurrence of the second demand in [x +

y, x + y + dy), given that the first demandhad occurred in [x, x + dx);• the system is in the operation state at x + y,

given that it was in the repair state on theprevious demand (probability: A∗(y));• the system operates without breakdowns in[x + y, t) (probability: V1(t − x − y)).

Equation 6.34 can also be easily solved via theLaplace transform. After elementary transforma-tions

V1(s)= {(s + λd)(s + λd + λ+ µ)2}× {s(s + λd)(s + λd + λ+ µ)2

+ sλdλ(s + 2λd + λ+ µ)

+ λ2dλ(λd + λ)}−1

= P3(s)

P4(s)(6.35)

where P3(s) (P4(s)) denotes the polynomialof the third (fourth) order in the numerator(denominator).

The inverse transformation results in

V1(t)=4∑1

P3(si )

P ′4(si )exp(si t) (6.36)

where P ′4(s) defines the derivative of P4(s) and si,i = 1, 2, 3, 4, are the roots of the denominator inEquation 6.35:

P4(s)=4∑0

bks4−k = 0 (6.37)

where bk are defined as:

b0 = 1

b1 = 2λ+ 2µ+ 3λd

b2 = (λd + λ+ µ)(3λd + λ+ µ)+ λdλ

b3 = λd[(λd + λ+ µ)2

+ λ(λd + λ+ µ)+ λ(λd + λ)]b4 = λ2

dλ(λd + λ)

Equation 6.36 defines the exact solution of theproblem. It can be easily obtained numericallyby solving Equation 6.37 and substituting thecorresponding roots in Equation 6.36. As in theprevious section, a simple approximate formulacan also be used in practice. Let the assumptionsin Equations 6.27 and 6.28 hold. All bk , k =1, 2, 3, 4, in Equation 6.37 are positive, andthis means that there are no positive roots ofthis equation. Consider the smallest in absolute

Page 162: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 129

value root s1. Owing to the assumption inEquation 6.27, and the definitions of P3(s) andP4(s) in Equation 6.35, it can be seen that

s1 ≈−b4

b3≈−λλd(λ+ λd)

µ2,

P3(s1)

P ′4(s1)≈ 1

and that the absolute values of other roots farexceed |s1|. Thus, Equation 6.36 can be written asthe fast repair exponential approximation:

V1(t)≈ exp

[−λλd(λ+ λd)

µ2 t

](6.38)

It is difficult to assess the approximation errorof this approach directly, as it was done in theprevious section, because the root s1 is alsodefined approximately.

On the other hand, the method of the perdemand failure rate can be used for obtainingV1(t). Similar to Equation 6.10:

V1(t)≡ exp

[−∫ t

0λb(u) du

]≈ exp[−A(1− A)2λdt]

= exp

[− µλ2λdt

(λ+ µ)3

](6.39)

Indeed, the breakdown can happen in [t, t + dt)if a demand occurs in this interval (probability:λd dt) and the system is unavailable at thismoment of time and at the moment of the previousdemand, while it was available before. Owing tothe assumptions in Equations 6.28 and 6.29, thisprobability is approximately [µλ/(λ + µ)]3, as

A≡ A(∞)= A∗(∞)= µ

λ+ µ

Similar to Equation 6.31, a more accurateapproximation for V1(t) can be obtained if wesubstitute in Equation 6.39 more accurate values ofunavailability at the moments of current (1− A∗)and previous (1− A) demands. Hence

V1(t)≈ exp[−A(1− A)(1− A∗)λdt]

= exp

{−µλλd[µ(λ+ λd)+ λλd + λ2]t

(λ+ µ)2(λ+ λd + µ)2

}(6.40)

Actually, this is also an obvious modificationof the method of the per demand failure rate.The analysis of accuracy of this and subsequentmodels is much more cumbersome than the onefor ordinary multiple availability. Therefore, itis not included in this chapter. The fast repairassumption (Equation 6.27) leads to an interestingresult:

µλλd[µ(λ+ λd)+ λλd + λ2](λ+ µ)2(λ+ λd + µ)2

= λλd(λ+ λd)

µ2[1+ o(1)]

Thus, Equation 6.40 and the approximate for-mula in Equation 6.38, which was derived via theLaplace transform, are asymptotically equivalent!It is worth noting that a more crude relation(Equation 6.39) does not always lead to Equa-tion 6.38, as it is not necessarily so that λd� λ.

Taking into account the reasoning that was usedwhile deriving Equation 6.30, the approximationin Equation 6.39 can be generalized to arbitraryF(t) and G(t):

V1(t)≈ exp[−A(1− A)2λdt]

= exp

[− λdT τ

2t

(T + τ )3

](6.41)

6.3.5 Not More thanN Non-servicedDemands

The breakdown in this case is defined as an event,when more than N ≥ 1 demands are non-servicedin [0, t). Multiple availability V2,N(t) is defined asthe probability of a system functioning withoutbreakdowns in [0, t).

Let N = 1. Similar to Equations 6.24 and 6.34:

V2,1(t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A(x)V2,1(t − x) dx

+∫ t

0λd exp(−λdx)

× [1− A(x)]V ∗(t − x) dx (6.42)

Page 163: Handbook of Reliability Engineering - pdi2.org

130 Statistical Reliability Theory

whereV ∗(t) is an ordinary multiple availability forthe system that is at t = 0 in the state of repair.Similar to Equation 6.26:

V ∗(t)= B∗1 exp(s1t)+ B∗2 exp(s2t)

where s1 and s2 are the same as in Equation 6.26and

B∗1 =s2 + λd

s2 − s1B∗2 =−

s1 + λd

s2 − s1

The first term in the right-hand side of Equa-tion 6.42 determines the probability of absenceof demands in [0, t). The second term definesthe probability that the system is operable on thefirst demand (renewal point) and is functioningwithout breakdowns in [x, t). The third term de-termines the probability of the system being inthe state of repair on the first demand and in theoperable state on all subsequent demands.

Applying the Laplace transform to both sides ofEquation 6.42 and solving the algebraic equationwith respect to V2,1(s):

V2,1(s)= {(s + λd + λ+ µ)

× [s2 + s(λd + λ+ µ)+ λdλ]+ λdλ(s + λ+ µ)}× {[s2 + s(λd + λ+ µ)+ λdλ]2}−1

(6.43)

The roots of the denominator s1 and s2 are thesame as in Equation 6.25, but two-folded. A rathercumbersome exact solution can be obtained byinverting Equation 6.43. The complexity will in-crease with increase in N . The cumbersome re-current system of integral equations for obtainingV2,N(t) can be derived and solved in terms of theLaplace transform (to be inverted numerically).Hence, the fast repair approximation of the previ-ous sections can also be very helpful in this case.

Consider a point process of breakdowns ofour system in the sense of Section 6.3.2 (eachpoint of unavailability on demand forms thispoint process). It is clear that this is therenewal process with 1− V (t) as the cumulativedistribution function of the first cycle and 1−V ∗(t) as cumulative distribution functions ofthe subsequent cycles. On the other hand, as

follows from Equation 6.31, the point processcan be approximated by the Poisson processwith rate (1− A)λd. This leads to the followingapproximate result for arbitrary N :

V2,N(t)≈ exp

(− λλd

λ+ λd + µt

N∑n=0

1

n!(

λλd

λ+ λd + µt

)nN = 1, 2, . . . (6.44)

Thus, a rather complicated problem had been im-mediately solved via the Poisson approximation,based on the per demand failure rate λλd(λ+λd + µ)−1. When N = 0, we arrive at the case ofordinary multiple availability: V2,0(t)≡ V (t). Aspreviously:

V2,N(t)≈ exp

(λdτ

T + τt

N∑n=0

1

n!(

λdτ

T + τt

)nN = 1, 2, . . .

for the case of general distributions of failure andrepair times.

6.3.6 Time Redundancy

The breakdown in this case is defined as an event,when at the moment of a demand occurrencethe system has been in the repair state for atime exceeding τ > 0. As previously, multipleavailability V3(t) is defined as the probability of asystem functioning without breakdowns in [0, t).

The definition of a breakdown for this casemeans that, if the system is “not too long” inthe repair state at the moment of a demandoccurrence, then this situation is not consideredas a breakdown. Various types of sensor ininformation-measuring systems, for instance, canpossess this feature of certain inertia. It is clearthat if τ = 0, then V3(t)= V (t).

The following notation will be used:

V ∗3 (t) multiple availability in [0, t) inaccordance with a given definition ofa breakdown (the system is in therepair state at t = 0)

Page 164: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 131

�(x, τ ) probability that the system has beenin the repair state at the moment of ademand occurrence x for a time notexceeding τ (the system is in theoperable state at t = 0)

�∗(x, τ ) probability that the system has beenin the repair-state at the moment of ademand occurrence x for a time notexceeding τ (the system is in thestate of repair at t = 0).

Generalizing Equation 6.24, the followingintegral equation can be obtained:

V3(t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A(x)V3(t − x) dx

+∫ t

0λd exp(−λdx)�(x, τ )V ∗3 (t − x) dx

(6.45)

whereV ∗3 (t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A

∗(x)V3(t − x) dx

+∫ t

0λd exp(−λdx)�

∗(x, τ )V ∗3 (t−x) dx

and

�(x, τ )= 1− A(x)�∗(x, τ )= 1− A∗(x)

for x ≤ τ , and

�(x, τ )=∫ τ

0A(x − y)

×{ ∫ y

0λ exp(−λz)

× exp[−µ(y − z)] dz}

dy

�∗(x, τ )=∫ τ

0A∗(x − y)

×{ ∫ y

0λ exp(−λz)

× exp[−µ(y − z)] dz}

dy

for x ≥ τ .

The integrand in the third term in the right-hand side of Equation 6.45, for instance, definesthe joint probability of the following events: ademand in [x, x + dx); the system is in the repair-state for less than τ at this moment of time; thesystem is operating without breakdowns in [x, t −x) (the system was in the repair-state at t = 0). Bysolving these equations via the Laplace transform,the cumbersome exact formulas for V3(t) can alsobe obtained.

To obtain a simple approximate formula bymeans of the method of the per demand failurerate, as in the previous section, consider thePoisson process with rate (1− A)λd, which,due to Equation 6.31, approximates the pointprocess of the non-serviced demands (each pointof unavailability on demand forms this pointprocess). Multiplying the rate of this initial process(1− A)λd by the probability of a breakdown on ademand exp(−µτ), the corresponding breakdownrate can be obtained. Eventually:

V3(t)≡ exp

[−∫ t

0λb(u) du

]≈ exp[−λd(1− A) exp(−µτ)t]= exp

[− λλd

λ+ λd + µexp(−µτ)t

](6.46)

The definition of the breakdown will now bechanged slightly. It is defined as an event when thesystem is in the repair state at the moment of ademand occurrence and remains in this state forat least time τ after the occurrence of the demand.The corresponding multiple availability is denotedin this case by V4(t).

This setting describes the following situation:a demand cannot be serviced at an instant ofits occurrence (the system is in the repair state),but, if the remaining time of repair is less than τ

(probability 1− exp(−µτ)), it can be still servicedafter the completion of the repair operation.The integral equation for this case can be easilywritten:

V4(t)= exp(−λdt)

+∫ t

0λd exp(−λdx)A(x)V4(t − x) dx

Page 165: Handbook of Reliability Engineering - pdi2.org

132 Statistical Reliability Theory

+∫ t

0λd exp(−λdx)[1− A(x)]

×∫ τ

0µ exp(−µy)V4(t − x − y) dy dx

Though this equation seems to be similar tothe previous one, it cannot be solved via theLaplace transform, as V4(t − x − y) is dependenton y. Thus, numerical methods for solving integralequations should be used and the importanceof the approximate approach for this settingincreases.

It can be easily seen that the method of theper demand failure rate can also be applied inthis case, which results in the same approximaterelation (Equation 6.46) for V4(t).

6.4 Modeling the MixtureFailure Rate

6.4.1 Definitions and ConditionalCharacteristics

In most settings involving lifetimes, the popula-tion of lifetimes is not homogeneous. That is, allof the items in the population do not have exactlythe same distribution; there is usually some per-centage of the lifetimes that is different from themajority. For example, for electronic components,most of the population might be exponential, withlong lives, while a certain percentage often hasan exponential distribution with short lives anda certain percentage can be characterized by livesof intermediate duration. Thus, a mixing proce-dure arises naturally when pooling from hetero-geneous populations. The observed failure rate,which is actually the mixture failure rate in thiscase, obtained by usual statistical methods, doesnot bear any information on this mixing proce-dure. To say more, as was stated in Section 6.1,obtaining the observed failure rate in this waycan often present certain difficulties, because thecorresponding data can be scarce. On the otherhand, it is clear that the proper modeling canadd some valuable information for estimating the

mixture failure rate and for analyzing and com-paring its shape with the shape of the failure rateof the governing cumulative distribution function.Therefore, the analysis of the shape of the ob-served failure rate will be the main topic of thissection.

Mixtures of decreasing failure rate (DFR)distributions are always DFR distributions [8]. Itturns out that, very often, mixtures of increasingfailure rate (IFR) distributions can decrease, atleast in some intervals of time. This means that theoperation of mixing can change the correspondingpattern of aging. This fact is rather surprising,and should be taken into account in variousapplications owing to its practical importance.

As previously, consider a lifetime random vari-able T ≥ 0 with a cumulative distribution func-tion F(t). Assume, similar to Section 6.2.3, thatF(t) is indexed by a parameter θ , so that P(T ≤t | θ)= F(t, θ) and that the probability densityfunction f (t, θ) exists. Then the correspondingfailure rate λ(t, θ) can be defined in the usual wayas f (t, θ)/F (t, θ). Let θ be interpreted as a non-negative random variable with a support in [0,∞)

and a probability density function π(θ). For thesake of convenience, the same notation is used forthe random variable and its realization. A mixturecumulative distribution function is defined by

Fm(t)=∫ ∞

0F(t, θ)π(θ) dθ (6.47)

and the mixture (observed) failure rate inaccordance with this definition is

λm(t)=∫∞

0 f (t, θ)π(θ) dθ∫∞0 F (t, θ)π(θ) dθ

(6.48)

Similar to Equations 6.12 and 6.13, and using theconditional probability density function π(θ | t)of θ given T ≥ t :

π(θ | t)= π(θ)F (t, θ)∫∞0 F (t, θ)π(θ) dθ

(6.49)

the mixture failure rate λm(t) can be written as:

λm(t)=∫ ∞

0λ(t, θ)π(θ | t) dθ (6.50)

Page 166: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 133

Denote by E[θ | t] the conditional expectation ofθ given T ≥ t :

E[θ | t] =∫ ∞

0θπ(θ | t) dθ

An important characteristic for further considera-tion is E′[θ | t], the derivative with respect to t :

E′[θ | t] =∫ ∞

0θπ ′(θ | t) dθ

where

π ′(θ | t)= λm(t)π(θ | t)− f (t, θ)π(θ)∫∞0 F (t, θ)π(θ) dθ

Therefore, the following relation holds [13]:

E′[θ | t] = λm(t)E[θ | t] −∫∞

0 θf (t, θ)π(θ) dθ∫∞0 F (t, θ)π(θ) dθ

(6.51)In the next two sections the specific models ofmixing will be considered.

6.4.2 Additive Model

Suppose that

λ(t, θ)= α(t)+ θ (6.52)

where α(t) is a deterministic continuous increas-ing function (α(t) ≥ 0, t ≥ 0) to be specified later.Then, noting that f (t, θ)= λ(t, θ)F (t, θ), andapplying the definition given in Equation 6.48 forthis concrete model:

λm(t)= α(t) +∫∞

0 θF (t, θ)π(θ) dθ∫∞0 F (t, θ)π(θ) dθ

= α(t) + E[θ | t] (6.53)

Using Equations 6.51 and 6.53, the specific form ofE′[θ | t] can be obtained:

E′[θ | t] = [E[θ | t]]2 −∫ ∞

0θ2π(θ | t) dθ

=− Var(θ | t) < 0 (6.54)

Therefore, the conditional expectation of θ forthe additive model is a decreasing function of t ∈[0,∞).

Upon differentiating Equation 6.53, and usingEquation 6.54, the following result on the shape ofthe mixture failure rate can be obtained [7]:

Let α(t) be an increasing (non-decreasing) convexfunction in [0,∞). Assume that Var(θ | t) isdecreasing in t ∈ [0,∞).

IfVar(θ | 0) > α′(0)

then λm(t) decreases in [0, c) and increases in[c,∞), where c can be uniquely defined by theequation: Var(θ | t) (bathtub shape).

IfVar(θ | 0)≤ α′(0)

then λm(t) is increasing in [0,∞).

6.4.3 Multiplicative Model

Suppose that

λ(t, θ)= θα(t) (6.55)

where α(t) is some deterministic increasing(non-decreasing), at least for sufficiently large t ,continuous function (α(t) ≥ 0, t ≥ 0). The modelin Equation 6.55 is usually called a proportionalhazards (PH) model, whereas the previous oneis called an additive hazards model. ApplyingEquation 6.50:

λm(t)=∫ ∞

0λ(t, θ)π(θ | t) dθ = α(t)E[θ | t]

(6.56)After differentiating

λ′m(t)= α′(t)E[θ | t] + α(t)E′[θ | t]It follows immediately from this equation thatwhen α(0)= 0 the failure rate λm(t) increases inthe neighborhood of t = 0. Further behavior ofthis function depends on the other parametersinvolved. If λ′m(t) ≥ 0, then the mixture failure rateis increasing (non-decreasing). Thus, the mixturewill have an IFR if and only if ∀t ∈ [0,∞):

α′(t)α(t)

≥−E′[θ | t]E[θ | t] (6.57)

Page 167: Handbook of Reliability Engineering - pdi2.org

134 Statistical Reliability Theory

Similar to Equation 6.54, it can be shownthat [13]:

E′[θ | t] = −α(t) Var(θ | t) < 0 (6.58)

which means that the conditional expectation ofθ for the model (Equation 6.55) is a decreasingfunction of t ∈ [0,∞). Combining this propertywith Equations 6.55 and 6.56 for the specificcase α(t)= const, the well-known result that themixture of exponentials has a DFR can be easilyobtained. Thus, the foregoing can be consideredas a new proof of this fact.

With the help of Equation 6.58, the inequalityin Equation 6.57 can be written as

α′(t)α2(t)

≥ Var(θ | t)E[θ | t] (6.59)

Thus, the first two conditional moments and thefunction α(t) are responsible for the IFR (DFR)properties of the mixture distribution. Thesecharacteristics can be obtained for each specificcase.

The following result is intuitively evident, asit is known already that E[θ | t] is monotonicallydecreasing:

E[θ | t] =∫ ∞

0θπ(θ | t) dθ→ 0 t→∞

(6.60)It is clear from the definition of π(θ | t) and thefact that the weaker populations are dying outearlier [14], that E[θ | t] should converge to zero.The weaker population “should be distinct” fromthe stronger one as t→∞. This means that thedistance between different realizations of λ(t, θ)should be bounded from below by some constantd(θ1, θ2):

λ(t, θ2)− λ(t, θ1)≥ d(θ1, θ2) > 0

θ2 > θ1, θ1, θ2 ∈ [0,∞)

uniformly in t ∈ (c,∞), where c is a sufficientlylarge constant. This is obviously the case for ourmodel (Equation 6.55), as α(t) is an increasingfunction for large t . Therefore, as t→∞, theproportion of populations with larger failure ratesλ(t, θ) (larger values of θ) is decreasing. More

precisely: for each ε > 0, the proportion of popu-lations with θ ≥ l is asymptotically decreasing tozero no matter how small ε is. This means thatE[θ | t] tends to zero as t→∞ [15]. Alternatively,asymptotic convergence (Equation 6.60) can beproved [13] more strictly based on the fact that

π(θ | t)→ δ(0) as t→∞where δ(t) denotes the Dirac delta function.

From Equation 6.55:

limt→∞ λm(t)= lim

t→∞ α(t)

∫ b

a

θπ(θ | t) dθ

= limt→∞ α(t)E[θ | t] (6.61)

Specifically, for the mixture of exponentials whenα(t) = const, it follows from Equations 6.60and 6.61 that the mixture failure rate λm(t)

asymptotically decreases to zero as t→∞. Thesame conclusion can be made for the casewhen F(t, θ) is a gamma cumulative distributionfunction. Indeed, it is well known that the failurerate of the gamma distribution asymptoticallyconverges to a constant.

Asymptotic behavior of λm(t) in Equation 6.61for a general case depends on the pattern of α(t)increasing (E[θ | t] decreasing). Hence, by usingthe L’Hospital rule (α(t)→∞), the followingresult can be obtained [13]:

The mixture failure rate asymptotically tends tozero:

limt→∞ λm(t)= 0 (6.62)

if and only if

limt→∞−

α2(t)E′[θ | t]α′(t)

= limt→∞

α3(t) Var(θ | t)α′(t)

= 0

(6.63)

This convergence will be monotonic if, inaddition to Equation 6.63:

α′(t)α2(t)

<Var(θ | t)E[θ | t] (6.64)

Hence, conditional characteristics E[θ | t] andVar[θ | t] are responsible for asymptotic and

Page 168: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 135

monotonic properties of the mixture failure rate.The failure rate of the mixture can decrease,if E[θ | t] decreases more sharply than α(t)

increases (see Equations 6.58 and 6.61). As followsfrom Equation 6.63, the variance should alsobe sufficiently small for convergence to zero.Simplifying the situation, one can say that themixture failure rate tends to increase but, at thesame time, another process of “sliding down”,due to the effect of dying out of the weakestpopulations, takes place. The combination of theseeffects can eventually result in convergence tozero, which is usually the case for the commonlyused lifetime distribution functions.

Results, obtained for the multiplicative modelare valid, in a simplified way, for the additivemodel. Indeed:

λm(t)=∫ b

a

[α(t)+ θ ]π(θ | t) dθ

= α(t)+∫ b

a

θπ(θ | t) dθ = α(t)+ E[θ | t]

Similar to Equation 6.60, it can be shown thatE[θ | t] → 0 as t→∞. Therefore, λm(t) as t→∞ is asymptotically converging to the increasingfunction α(t):

λm(t)− [α(t)+ a] → 0

6.4.4 Some Examples

Three simple examples analyzing the shape ofthe mixture failure rate for the multiplicativemodel will be considered in this section [13]. Thecorresponding results will be obtained by usingthe “direct” definition (Equation 6.48).

Example 2. Let F(t) be an exponential distribu-tion with parameter λ, and π(θ) is the exponentialprobability density function with parameter ϑ .Thus, the corresponding failure rate is constant:λ(t, θ)= θλ. Using Equation 6.48:

λm(t)= λ

λt + ϑ→ 1

t[1+ o(1)]→ 0 as t→∞

It can be easily shown by formal calculationsthat the same asymptotic result is valid when

F(t) is a gamma distribution. This follows fromgeneral considerations, because the failure rateof a gamma distribution tends to a constant ast→∞. The conditional expectation in this case isdefined by

E[θ | t] = 1

λt + ϑ

Example 3. Consider the specific type of theWeibull distribution with linearly IFR λ(t, θ)=2θt and assume that π(θ) is a gamma probabilitydensity function:

π(θ)= ϑβθβ−1 exp(−θϑ)�(β)

β, ϑ > 0, θ ≥ 0

Via direct integration, the mixture failure isdefined by

λm(t)= 2βt

ϑ + t2

It is equal to zero at t = 0 and tends to zero ast→∞ with a single maximum at t =√θ . Hence,the mixture of IFR distributions has a decreasing(tending to zero!) failure rate for sufficientlylarge t , and this is rather surprising. Furthermore,the same result asymptotically holds for arbitraryWeibull distributions with IFR.

It is surprising that the mixture failure rate isconverging to zero in both examples in a similarpattern defined by t−1 because in the Example 2we are mixing constant failure rates and in thisexample it is IFRs. The conditional expectation inthis case is:

E[θ | t] = 2β

ϑ + t2

Example 4. Consider the truncated extreme valuedistribution defined in the following way:

F(t, k, θ)= 1− exp{−θk[exp(t)− 1]} t ≥ 0

λ(t, θ)= θα(t) = θk exp(t)

where k > 0 is a constant. Assume that π(θ) isan exponential probability density function with

Page 169: Handbook of Reliability Engineering - pdi2.org

136 Statistical Reliability Theory

parameter ϑ . Then:∫ ∞0

f (t, k, θ)π(θ) dθ

=∫ ∞

0θk exp(t) exp{−θk[exp(t)− 1]}

× ϑ exp(−θϑ) dθ

= ϑk et

ω2

ω = k exp(t)− k + ϑ∫ ∞0

F (t, k, θ)π(θ) dθ = ϑ

∫ ∞0

exp(−θ) dθ

= ϑ

ω

Eventually, using the definition in Equation 6.48:

λm(t)= k exp(t)

ω= k exp(t)

k exp(t)− k + ϑ

= 1+ k − ϑ

k exp(t)− k + ϑ(6.65)

For analyzing the shape of λm(t) it is convenient towrite Equation 6.65 in the following way:

λm(t)= 1

1+ (d/z)(6.66)

where z = k exp(t) and d =−k + ϑ .It follows from Equations 6.65 and 6.66

that λm(0)= k/ϑ . When d < 0 (k > ϑ), themixture failure rate is monotonically decreasing,asymptotically converging to unity. For d > 0 (k <

ϑ) it is monotonically increasing, asymptoticallyconverging to unity. When d = 0 (k = ϑ), themixture failure rate is equal to unity: λm(t)≡ 1,∀t[0,∞). The conditional expectation is definedas

E[θ | t] = 1

k exp(t)− k + ϑ= 1

z+ d

This result is even more surprising. The initialfailure rate is increasing extremely sharply, andstill for some values of parameters the mixturefailure rate is decreasing! On the other hand, fork < ϑ it is increasing but has no resemblance atall with the pattern of λ(t, θ) increasing. The most

amazing fact is that this kind of mixing can resultin a constant failure rate for k = ϑ !

The mixing rule resulting in Equation 6.66has a very interesting interpretation in survivalanalysis. Exponential failure rate (Gompertz law)is used to model human mortality, but recentdemographic studies show that the long-termpattern of adult mortality does not follow theexponential shape of the corresponding failurerate. This means that after the age of 80–85 yearsa substantial deceleration in the observed failurerate is encountered. What is the reason for thisdeceleration? The model of Example 4 can explainthis. Indeed, for various reasons, the population isnot homogeneous. This means that parameter θ israndom, and the corresponding mixing leads toa result similar to Equation 6.66. If the Weibullcumulative distribution function is chosen as amodel for Fm(t, θ) (and this model is also used indemographic studies), then similar to Example 3,the failure rate can eventually decrease to zero, butthis effect can start at ages far beyond the currenthuman life span.

6.4.5 Inverse Problem

The following important inverse problem [15]arises while modeling the mixture failure rate:

Given the mixture failure rate λm(t) and the mixingdistribution π(θ), obtain the failure rate λ(t) of thegoverning distribution F(t).

It will be shown that this problem can beexplicitly solved for additive and proportionalfailure rate models (Equations 6.52 and 6.55respectively). In both of these cases the baselinefailure rate α(t) should be obtained.

The conditional expectation E[θ | t] in Equa-tion 6.53 for the additive model (Equation 6.52)can be simplified to:

E[θ | t] =∫∞

0 θ exp(−θt)π(θ) dθ∫∞0 exp(−θt)π(θ) dθ

(6.67)

It is important that E[θ | t] does not dependon α(t). We want the mixture failure rateλm(t) to be equal to some arbitrary continuousfunction g(t), such that

∫∞0 g(t) dt =∞, g(t) > 0.

Page 170: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 137

Thus, the corresponding equation for solving theinverse problem is λm(t)= g(t). It follows fromEquation 6.53 that

α(t) = g(t) − E[θ | t] (6.68)

and this is, actually, the trivial solution of theinverse problem for the additive model. It is clearthat for the given π(θ) the additional conditionshould be imposed on g(t):

g(t) ≥ E[θ | t] ∀t ≥ 0

It is worth noting that α(t)→ g(t) as t→∞.If, for instance, π(θ) is an exponential prob-

ability density function with parameter ϑ , thenit follows from Equation 6.67 that E[θ | t] = (t +ϑ)−1, and

α(t)= g(t)− 1

t + ϑ

Assume that g(t) is a constant: g(t)= c. Thismeans that mixing will result in a constant failurerate!

Thus, an arbitrary shape of λm(t) can beconstructed. For instance, let us consider themodeling of the bathtub shape [7], which ispopular in many applications (decreasing in[0, t0) and increasing in [t0,∞) for some t0 > 0).This operation can be performed via mixingIFR distributions. It follows from the result ofSection 6.4.2 that, if α(t) is an increasing convexfunction in [0,∞) α′(0) < ϑ−2, then g(t) have abathtub shape.

The analysis is slightly more complicated forthe multiplicative model (Equation 6.65) [15]. Asfor the additive model, the equation for obtainingα(t) is λm(t)= g(t). It follows from Equations 6.55and 6.56 that

β ′(t)∫∞

0 θ exp[−θβ(t)]π(θ) dθ∫∞0 exp[−θβ(t)]π(θ) dθ

= g(t) (6.69)

where β(t)= ∫ t0 α(u) du. Thus, unlike the firstmodel, the conditional expectation E[θ | t] de-pends now on α(t).

Denote by π∗(t) the Laplace transform of π(θ)for t ≥ 0: ∫ ∞

0exp(−θt)π(θ) dθ ≡ π∗(t)

Taking into account that β(t) is monotonicallyincreasing:∫ ∞

0exp[−θβ(t)]π(θ) dθ ≡ π∗[β(t)] ≡ S(t)

(6.70)Differentiating both sides of Equation 6.70:

S′(t)=−β ′(t)∫ ∞

0θ exp[−θβ(t)]π(θ) dθ

(6.71)Combining Equation 6.69 with Equations 6.70and 6.71, we can obtain now a simple expectedresult [15], which also follows from the definitionof the mixture failure rate. It is very convenientthat, in this specific case, the solution of theinverse problem can be written via the Laplacetransform as

π∗[β(t)] = S(t)= exp

[−∫ t

0g(u) du

](6.72)

⇒ β(t)= (π∗)−1{

exp

[−∫ t

0g(u) du

]}(6.73)

Thus, obtaining π∗(t), substituting β(t) insteadof t , and solving Equation 6.72 with respectto β(t)= ∫ t0 α(u) du leads to the solution ofthe inverse problem for this case: obtainingthe governing distribution, which, due to themodel in Equation 6.55, is defined by α(t).Equation 6.72 always has a unique solution, as β(t)is a monotonically increasing function and π∗(t)(π∗(β(t))) is a survival function (monotonicallydecreasing from unity at t = 0 (β(t)= 0) to zeroas t→∞ (β(t)→∞)). This fact is actuallystated by the inverse transform (Equation 6.73).The solution can be easily obtained explicitly forthe mixing distributions with a “nice” Laplacetransform π∗(t).

Example 5. Let π(t)= ϑ exp(−ϑθ). Then

π∗(t)= ϑ

t + ϑ

π∗[β(t)] = ϑ

β(t)+ ϑ

Page 171: Handbook of Reliability Engineering - pdi2.org

138 Statistical Reliability Theory

and

π∗[β(t)] = exp

[−∫ t

0g(u) du

]⇒ β(t)= ϑ exp

[∫ t

0g(u) du

]− ϑ

α(t) = ϑg(t) exp

[∫ t

0g(u) du

]For the gamma mixing distribution:

π(θ)= exp(−ϑθ)θn−1ϑn

�(n)n > 0

π∗(t)=∫ ∞

0exp(−θt)exp(−ϑθ)θn−1

�(n)ϑn dθ

= ϑn

(t + ϑ)n

Thus:

π∗[β(t)] = ϑn

[β(t)+ ϑ]nβ(t)= ϑ exp

[1

n

∫ t

0g(u) du

]− ϑ

Finally:

α(t) = ϑ

ng(t) exp

[1

n

∫ t

0g(u) du

](6.74)

Example 6. Assume for the setting of the previousexample the concrete shape of the mixture failurerate: λm(t)= t (Weibull Fm(t) with linear failurerate). It follows from Equation 6.74 that

α(t)= ϑt

nexp

(t2

2n

)And for obtaining the baselineα(t) in this case, themixture failure rate t should be just multiplied by(ϑ/n) exp(t2/2n).

The corresponding equation for the generalmodel of mixing Equations 6.47–6.50 can bewritten as∫ ∞

0exp

[−∫ t

0λ(u, θ) du

]π(θ) dθ

= exp

[−∫ t

0g(u) du

](6.75)

where the mixture failure rate λ(t, θ) should bedefined by the concrete model of mixing. Specifi-cally, this model can be additive, multiplicative, orthe mixing rule, which is specified by the acceler-ated life model [16]:

λ(t, θ)= θα(θt)

where, as in the multiplicative model, α(t) denotesa baseline failure rate. Equation 6.75 will turn, inthis case, to:∫ ∞

0exp

[−∫ tθ

0α(u) du

]π(θ) dθ

= exp

[−∫ t

0g(u) du

](6.76)

In spite of a similarity with Equation 6.72,Equation 6.76 cannot be solved for this case in anexplicit form.

References[1] Thompson Jr WA. Point process models with applications

to safety and reliability. London: Chapman and Hall;1988.

[2] Cox DR, Isham V. Point processes. London: Chapman andHall; 1980.

[3] Block HW, Savits TH, Borges W. Age dependent minimalrepair. J Appl Prob 1985;22:370–86.

[4] Finkelstein MS. A point process stochastic model ofsafety at sea. Reliab Eng Syst Saf 1998;60; 227–33.

[5] Yashin AI, Manton KG. Effects of unobserved andpartially observed covariate processes on system failure:a review of models and estimation strategies. Stat Sci1997;12:20–34.

[6] Kebir Y. On hazard rate processes. Nav Res Logist1991;38:865–76.

[7] Lynn NJ, Singpurwalla ND. Comment: “Burn-in” makesus feel good. Statist Sci 1997;12:13–9.

[8] Barlow RE, Proschan F. Statistical theory of reliability:probability models. New York: Holt, Rinehart & Winston;1975.

[9] Daley DJ, Vere-Jones D. An introduction to the theory ofpoint processes. New York: Springer-Verlag; 1988.

[10] Lee WK, Higgins JJ, Tillman FA. Stochastic modelsfor mission effectiveness. IEEE Trans Reliab 1990;39: 321–4.

[11] Finkelstein MS. Multiple availability on stochastic de-mand. IEEE Trans. Reliab 1999;39:19–24.

Page 172: Handbook of Reliability Engineering - pdi2.org

Modeling the Observed Failure Rate 139

[12] Ushakov IA, Harrison RA. Handbook of reliabilityengineering. New York: John Wiley & Sons; 1994.

[13] Finkelstein MS, Esaulova V. Why the mixture failure ratedecreases. Reliab Eng Syst Saf 2001;71:173–7.

[14] Block HW, Mi J, Savits TH. Burn-in and mixedpopulations. J Appl Prob 1993; 30:692–02.

[15] Finkelstein MS, Esaulova V. On inverse problem inmixture failure rate modeling. Appl Stochast Model BusInd 2001;17;N2.

[16] Block HW, Joe H. Tail behavior of the failurerate functions of mixtures. Lifetime Data Anal 1999;3:269–88.

Page 173: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 174: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependencein Reliability Analysis

Chap

ter7

C. D. Lai and M. Xie

7.1 Introduction7.2 Important Conditions Describing Positive Dependence7.2.1 Six Basic Conditions7.2.2 The Relative Stringency of the Conditions7.2.3 Positive Quadrant Dependent in Expectation7.2.4 Associated Random Variables7.2.5 Positively Correlated Distributions7.2.6 Summary of Interrelationships7.3 Positive Quadrant Dependent Concept7.3.1 Constructions of Positive Quadrant Dependent Bivariate Distributions7.3.2 Applications of Positive Quadrant Dependence Concept to Reliability7.3.3 Effect of Positive Dependence on the Mean Lifetime of a Parallel System7.3.4 Inequality Without Any Aging Assumption7.4 Families of Bivariate Distributions that are Positive Quadrant Dependent7.4.1 Positive Quadrant Dependent Bivariate Distributions with Simple Structures7.4.2 Positive Quadrant Dependent Bivariate Distributions with More Complicated

Structures7.4.3 Positive Quadrant Dependent Bivariate Uniform Distributions7.4.3.1 Generalized Farlie–Gumbel–Morgenstern Family of Copulas7.5 Some Related Issues on Positive Dependence7.5.1 Examples of Bivariate Positive Dependence Stronger than Positive Quadrant

Dependent Condition7.5.2 Examples of Negative Quadrant Dependence7.6 Positive Dependence Orderings7.7 Concluding Remarks

7.1 Introduction

The concept of dependence permeates the world.There are many examples of interdependencein the medicines, economic structures, andreliability engineering, to name just a few.A typical example in engineering is that alloutput from a piece of equipment will dependon the input in a broader sense, which includesmaterial, equipment, environment, and others.Moreover, the dependence is not deterministicbut of a stochastic nature. In this chapter,

we limit the scope of our discussion to thedependence notions that are relevant to reliabilityanalysis.

In the reliability literature, it is usually assumedthat the component lifetimes are independent.However, components in the same system areused in the same environment or share the sameload, and hence the failure of one componentaffects the others. We also have the case of so-called common cause failure and componentsmight fail at the same time. The dependence isusually difficult to describe, even for very similar

141

Page 175: Handbook of Reliability Engineering - pdi2.org

142 Statistical Reliability Theory

components. From light bulbs in an overheadprojector to engines in an aeroplane, we havedependence, and it is essential to study the effectof dependence for better reliability design andanalysis. There are many notions of bivariateand multivariate dependence. Several of theseconcepts were motivated from applications inreliability.

We may have seen abbreviations like PQD, SI,LTD, RTI, etc. in recent years. As one probablywould expect, they refer to the form of positive de-pendence between two or more variables. We shalltry to explain them and their interrelationshipsin this chapter. Positive dependence means thatlarge values of Y tend to accompany large valuesof X, and vice versa for small values. Discussionof the concepts of dependence involves refiningthis basic idea, by means of definitions and deduc-tions.

In this chapter, we focus our attention on arelatively weaker notion of dependence, namelythe positive quadrant dependence between twovariables X and Y . We think that this easilyverified form of positive dependence is morerelevant in the subject area under discussion. Also,as might be expected, the notions of dependenceare simpler and their relationships are morereadily exposed in the bivariate case than themultivariate ones.

Hutchinson and Lai [1] devoted a chapterto reviewing concepts of dependence for abivariate distribution. Recently, Joe [2] gavea comprehensive treatment of the subject onmultivariate dependence. Thus, our goal here isnot to provide another review; instead, we shallfocus our attention on the positive dependence,in particular the positive quadrant dependence.For simplicity, we confine ourselves mainlyto the bivariate case, although most of ourdiscussion can be generalized to the multivariatesituations. An important aspect of this chapteris the availability of several examples that areemployed to illustrate the concepts of positivedependence.

The present chapter endeavors to develop afundamental dependence idea with the followingstructure:

• positive dependence, a general concept;• important conditions describing positive

dependence (we state some definitions, andexamine their relative stringency and theirinterrelationships);• positive quadrant dependence—conditions

and applications;• examples of positive quadrant dependence;• positive dependence orderings.

Some terms that we will come across herein arelisted below:

PQD (NQD) Positive quadrant dependent(negative quadrant dependent)

SI (alias PRD) Stochastically increasing(positively regressiondependent)

LTD (RTI) Left-tail decreasing (right-tailincreasing)

TP2 Totally positive of order 2PQDE Positive quadrant dependent in

expectationRCSI Right corner set increasingF = 1− F F survival function, F

cumulative distributionfunction.

7.2 Important ConditionsDescribing Positive Dependence

Concepts of stochastic dependence for a bivariatedistribution play an important part in statistics.For each concept, it is often convenient to referto the bivariate distributions that satisfy it as afamily, a group. In this chapter, we are mainlyconcerned with positive dependence. Althoughnegative dependence concepts do exist, they areoften obtained by negative analogues of positivedependence via reversing the inequity signs.

Various notions of dependence are motivatedfrom applications in statistical reliability (e.g.see Barlow and Proschan [3]). The baseline,or starting point, of the reliability analysis ofsystems is independence of the lifetimes of the

Page 176: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 143

components. As noted by many authors, it is oftenmore realistic to assume some form of positivedependence among components.

Around about 1970, several studies discusseddifferent notions of positive dependence betweentwo random variables, and derived some interre-lationships among them, e.g. Lehmann [4], Esaryet al. [5], Esary and Proschan [6], Harris [7],and Brindley and Thompson [8], among others.Yanagimoto [9] unified some of these notions byintroducing a family of concepts of positive de-pendence. Some further notions of positive depen-dence were introduced by Shaked [10–12].

For concepts of multivariate dependence, seeBlock and Ting [13] and a more recent text byJoe [2].

7.2.1 Six Basic Conditions

Jogdeo [14] lists four of the following basicconditions describing positive dependence; theseare in increasing order of stringency (see alsoJogdeo [15]).

1. Positive correlation, cov(X, Y )≥ 0.2. For every pair of increasing functions a and

b, defined on the real line R, cov[a(X), b(Y )]≥ 0. Lehmann [4] showed that this conditionis equivalent to

Pr(X ≥ x, Y ≥ y)≥ Pr(X ≥ x) Pr(Y ≥ y)

(7.1)Or equivalently, Pr(X ≤ x, Y ≤ y)≥ Pr(X ≤x) Pr(Y ≤ y).

We say that (X, Y ) shows positive quadrantdependence if and only if these inequalitieshold. Several families of PQD distributionswill be introduced in Section 7.4.

3. Esary et al. [5] introduced the followingcondition, termed “association”: for everypair of functions a and b, defined on R2,that are increasing in each of the arguments(separately)

cov[a(X, Y ), b(X, Y )] ≥ 0 (7.2)

We note, in passing, that a direct verificationof this dependence concept is difficult in

general. However, it is often easier to verifyone of the alternative positive dependencenotions that imply association.

4. Y is right-tail increasing in X if Pr(Y > y |X > x) is increasing in x for all y (written asRTI(Y |X). Similarly, Y is left-tail decreasingif Pr(Y ≤ y | X ≤ x) is decreasing in x for all y(written as LTD(Y | X)).

5. Y is said to be stochastically increasing in x

for all y (written as SI(Y |X)) if, for every y,Pr(Y > y |X = x) is increasing in x. Similarly,we say that X is stochastically increasing in y

for all x (written as SI(X | Y )) if for every x,Pr(X > x | Y = y) is increasing in y.Note that SI(Y | X) is often simply denoted bySI. Some authors refer to this relationship asY being positively regression dependent on X

(abbreviated to PRD) and similarly X beingpositively regression dependent on Y .

6. Let X and Y have a joint probability densityfunction f (x, y). Then f is said to be totallypositive of order two (TP2) if for all x1 < x2,y1 < y2:

f (x1, y1)f (x2, y2) ≥ f (x1, y2)f (x2, y1)

(7.3)Note that if f is TP2, then F and F

(survival function) are also TP2, i.e.F(x1, y1)F (x2, y2)≥ F(x1, y2)F (x2, y1) andF (x1, y1)F (x2, y2)≥ F (x1, y2)F (x2, y1). Itis easy to see that either F TP2 or F TP2implies that F is PQD.

7.2.2 The Relative Stringency of theConditions

It is well known that these concepts are inter-related, e.g. see Joe [2]. The six conditions welisted above can be arranged in an increasingorder of stringency, i.e. (6)⇒ (5)⇒ (4)⇒ (3)⇒(2)⇒ (1). More precisely:

TP2⇒ SI⇒ RTI⇒ Association

⇒ PQD⇒ cov(X, Y )≥ 0

TP2⇒ SI⇒ LTD⇒ Association

⇒ PQD⇒ cov(X, Y )≥ 0

Page 177: Handbook of Reliability Engineering - pdi2.org

144 Statistical Reliability Theory

Some of the proofs for the chain of implicationsare not straightforward, whereas some othersare obvious (e.g. see Barlow and Proschan [3],p.143–4). For example, it is easy to see thatPQD implies positive correlation by applyingHoeffding’s lemma, which states:

cov(X, Y )

=∫ ∞−∞

∫ ∞−∞[F(x, y)− FX(x)FY (y)] dx dy

(7.4)

This identity is often useful in many areas ofstatistics.

7.2.3 Positive Quadrant Dependent inExpectation

We now introduce a slightly less stringentdependence notion that would include the PQDdistributions. For any real number x, let Yx bethe random variable with distribution functionPr(Y ≤ y |X > x). It is easy to verify that theinequality in the conditional distribution Pr(Y ≤y |X > x)≤ Pr(Y ≤ y) implies an inequality inexpectation E(Yx)≥ E(Y ) if Y is a non-negativerandom variable.

We say that Y is PQDE on X if the lastinequality involving expectation holds. Similarly,we say that there is negative quadrant dependentin expectation if E(Yx)≤ E(Y ).

It is easy to show that PQD⇒ PQDE byobserving PQD is equivalent to Pr(Y > y | X >

x)≥ Pr(Y > y), which in turn implies E(Yx) ≥E(Y ) (assuming Y ≥ 0).

Next, we have

cov(X, Y )

=∫ ∫[F (x, y)− FX(x)FY (y)] dx dy

=∫

FX(x)

{ ∫[Pr(Y > y | X > x)

− FY (y)] dy}

dx

=∫

FX(x)[E(Yx)− E(Y )] dx

which is greater than zero if X and Y arePQDE. Thus, PQDE implies that cov(X, Y ) ≥ 0.In other words, PQDE lies between PQD andpositive correlation. There are many bivariaterandom variables being PQDE, because all thePQD distributions with Y ≥ 0 are also PQDE.

7.2.4 Associated Random Variables

Recall in Section 7.2.1, we say that tworandom variables X and Y are associated ifcov[a(X, Y ), b(X, Y )] ≥ 0. Obviously, thisexpression can be represented alternatively by

E[a(X, Y )b(X, Y )] ≥ E[a(X, Y )]E[b(X, Y )](7.5)

where the inequality holds for all real functionsa and b that are increasing in each componentand are such that the expectations in Equation 7.5exist.

It appears from Equation 7.5 that it is almostimpossible to check this condition for associationdirectly. What one normally does is to verify one ofthe dependence conditions in the higher hierarchythat implies association.

Barlow and Proschan [3], p.29, consideredsome practical reliability situations for which thecomponents lifetimes are not independent, butrather are associated:

1. minimal path structures of a coherent systemhaving components in common;

2. components subject to the same set of stresses;3. structures in which components share the

same load, so that the failure of one compo-nent results in increased load on each of theremaining components.

We note that in each case the random variablesof interest tend to act similarly. In fact, allthe positive dependence concepts share thischaracteristic.

An important application of the concept ofassociation is to provide probability bounds

Page 178: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 145

for system reliability. Many such bounds arepresented in Barlow and Proschan [3], in section 3of chapter 2 and section 7 of chapter 4.

7.2.5 Positively CorrelatedDistributions

Positive correlation is the weakest notion ofdependence between two random variables X

and Y . We note that it is easy to construct apositively correlated bivariate distribution. Forexample, such a distribution may be obtained bysimply applying a well-known trivariate reductiontechnique described as follows.

Set X =X1 + X3, Y =X2 +X3, with Xi (i =1, 2, 3) being mutually independent, then thecorrelation coefficient of X and Y is

ρ = var X3/[var(X1 +X3) var(X2 +X3)]1/2 > 0

For example, let Xi ∼ Poisson(λi), i = 1, 2, 3.Then X ∼ Poisson(λ1 + λ3), Y ∼ Poisson(λ2 +λ3), with ρ = λ3/[(λ1 + λ3)(λ2 + λ3)]1/2 > 0. X

and Y constructed in this manner are also PQD;see example 1(ii) in Lehmann [4], p.1139.

7.2.6 Summary of Interrelationships

Among the positive dependence concepts we haveintroduced so far, TP2 is the strongest. A slightlyweaker notion introduced by Harris [7] is calledthe right corner set increasing (RCSI), meaningPr(X > x1, Y > y1 |X > x2, Y > y2) is increasingin x2 and y2 for all x1 and y1. Shaked [10] showedthat TP2⇒ RCSI. By choosing x1 =−∞ and y2 =−∞, we see that RCSI⇒ RTI.

We may summarize the chain of relations inthe following (in which Y is conditional on X

whenever there is a conditioning):

RCSI ⇒ RTI ⇒ ASSOCIATION ⇒ PQD ⇒ cov ≥ 0⇑ ⇑ ⇑

TP2 ⇒ SI ⇒ LTD

There are other chains of relationships betweenvarious concepts of dependence.

7.3 Positive QuadrantDependent ConceptThere are many notions of bivariate dependenceknown in the literature. The notion of positivequadrant dependence appears to be more straight-forward and easier to verify than other notions.The rest of the chapter mainly focuses on thisdependence concept. The definition of PQD, whichwas first given by Lehmann [4], is now reproducedbelow.

Definition 1. Random variables X and Y are PQDif the following inequality holds:

Pr(X > x, Y > y)≥ Pr(X > x) Pr(Y > y) (7.6)

for all x and y. Equation 7.6 is equivalent toPr(X ≤ x, Y ≤ y)≥ Pr(X ≤ x) Pr(Y ≤ y).

The reason why Equation 7.6 constitutes apositive dependence concept is that X and Y

here are more likely to be large or small togethercompared with the independent case.

If the inequality in Equation 7.6 is reversed,then X and Y are negatively quadrant dependent.

PQD is shown to be a stronger notion of de-pendence than the positive (Pearson) correlationbut weaker than the “association”, which is a keyconcept of positive dependence in Barlow andProschan [3], originally introduced by Esary et al.[5].

Consider a system of two components thatare arranged in series. By assuming that thetwo components are independent, when they arein fact PQD, we will underestimate the systemreliability. For systems in parallel, on the otherhand, assuming independence when componentsare in fact PQD, will lead to overestimationof system reliability. This is because the othercomponent will fail earlier knowing that thefirst has failed. This, from a practical point ofview, reduces the effectiveness of adding parallelredundancy. Thus a proper knowledge of theextent of dependence among the componentsin a system will enable us to obtain a moreaccurate estimate of the reliability characteristic inquestion.

Page 179: Handbook of Reliability Engineering - pdi2.org

146 Statistical Reliability Theory

7.3.1 Constructions of PositiveQuadrant Dependent BivariateDistributions

Let F(x, y) denote the distribution function of(X, Y ) having continuous marginal cumulativedistribution functions FX(x) and FY (y) withmarginal probability distribution functions fX =F ′X and fY = F ′Y respectively. For a PQD bivariatedistribution, the joint distribution function maybe written as

F(x, y)= FX(x)FY (y)+w(x, y)

satisfying the following conditions:

w(x, y)≥ 0 (7.7)

w(x,∞)→ 0, w(∞, y)→ 0,

w(x, −∞)= 0, w(−∞, y)= 0 (7.8)

∂2w(x, y)

∂x∂y+ fX(x)fY (y)≥ 0 (7.9)

Note that if both X ≥ 0 and Y ≥ 0, then thecondition in Equation 7.8 may be replacedby w(x,∞)→ 0, w(∞, y)→ 0, w(x, 0)= 0,w(0, y)= 0.

Lai and Xie [16] used these conditions toconstruct a family of PQD distributions withuniform marginals.

7.3.2 Applications of PositiveQuadrant Dependence Concept toReliability

The notion of association is used to establishprobability bounds on reliability systems, e.g. seechapter 3 of Barlow and Proschan [3]. Given acoherent system of n components with minimalpath sets Pi (i = 1, 2, . . . , p) and minimal cutsets Kj (j = 1, 2, . . . , k). Let Ti denote thelifetime of the ith component and thus pi =P(Ti > t) is its survival probability at time t .It has been shown, e.g. see [3], p.35–8, that if

components are independent, the

k∏j=1

∐i∈Kj

pi ≤ System Reliability ≤p∐

j=1

∏i∈Pj

pi

(7.10)If, in addition, components are associated, then wehave an alternative set of bounds

max1≤r≤p

∏i∈Pj

pi ≤ System Reliability ≤ min1≤s≤k

∐i∈Kj

pi

(7.11)As independence implies association, the boundsgiven in Equation 7.11 are also applicable for asystem with independent components, althoughone cannot conclude that Equation 7.11 istighter than Equation 7.10 or vice versa. Acloser examination on the derivations leadingto Equation 7.11 would readily reveal that thereliability bounds presented here remain valid ifwe assume only the positive quadrant dependenceof the components.

One can find details related to bounds onreliability of a coherent system with associ-ated components in the text by Barlow andProschan [3].

To our disappointment, we have not been ableto find more examples of real applications ofpositive dependence in reliability contexts.

We note in passing that the concept of positivequadrant dependence is widely used in statistics,for example:

• partial sums [17];• order statistics [5];• analysis of variance [18];• contingency tables [19].

7.3.3 Effect of Positive Dependence onthe Mean Lifetime of a Parallel System

Parallel redundancy is a common device toincrease system reliability by enhancing the meantime to failure. Xie and Lai [20] studied theeffectiveness of adding parallel redundancy to asingle component in a complex system consistingof several independent components. It is shown

Page 180: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 147

that adding parallel redundancy to componentswith increasing failure rate (IFR) is less effectivethan that for a component with decreasing failurerate (DFR). Motivated to enhance reliability, Mi[21] considered the question of which componentshould be “bolstered” or “improved” in order tostochastically maximize the lifetime of a parallelsystem, series system, or, in general, a k-out-of nsystem.

Xie and Lai [20] assumed that the two parallelcomponents were independent. This assumptionis rarely valid in practice. Indeed, in reliabilityanalysis, the component lifetimes are usuallydependent. Two components in a reliabilitystructure may share the same load or maybe subject to the same set of stresses. Thiswill cause the two lifetime random variables tobe related to each other, or to be positivelydependent.

When the components are dependent, theproblem of effectiveness via adding a componentmay be different from the case of independentcomponents. In particular, we are interested ininvestigating how the degree of the correlationaffects the increase in the mean lifetime. A generalapproach may not be possible, and at presentwe can only consider typical cases when the twocomponents are either PQD or NQD.

Kotz et al. [22] studied the increase in themean lifetime by means of parallel redundancyfor several cases when the two components areeither PQD or NQD. The findings strongly sug-gest that the effect of dependence among thecomponents should be taken into considerationwhen designing reliability systems. The above ob-servation is valid whenever the two componentsare PQD and the result is stated in the follow-ing.

Let X and Y be identically distributed, not nec-essarily independent, PQD non-negative randomvariables. Kotz et al. [22] showed that

E(T ) < E(X)+ E(Y )−∫ ∞

0F 2(t) dt (7.12)

where F(t) is the common distribution functionand F (t)= 1− F(t). Hence, for increasing failure

rate average components that are PQD, we have

E(T ) < E(X)+ E(Y )−∫ ∞

0F 2(t) dt

= 2µ− µ

2+ µ e−2

2

= µ

2(3+ e−2) (7.13)

Similarly, if a pair of non-negative identicallydistributed random variables X and Y is NQD,then the inequality in Equation 7.6 is reversed, i.e.

E(T ) > E(X)+ E(Y )−∫ ∞

0F 2(t) dt (7.14)

If, in addition, both X and Y are DFR components,then

E(T ) > E(X)+ E(Y )−∫ ∞

0F 2(t) dt

≥ µ

(3

2− e−2

2

)

7.3.4 Inequality Without Any AgingAssumption

If the aging property of the marginal distributionsis unknown but the median m of the commonmarginal is given, then Equation 7.14 reduces to

E(T ) < E(X)+ E(Y )−∫ m

0F (t) dt + 1

2m

7.4 Families of BivariateDistributions that are PositiveQuadrant DependentSince the PQD concept is important in reliabilityapplications, it is imperative for a reliabilitypractitioner to know what kinds of PQD bivariatedistribution are available for reliability modeling.In this section, we list several well-known PQDdistributions, some of which were originallyderived from a reliability perspective. Most ofthese PQD bivariate distributions can be found,for example, in Hutchinson and Lai [1].

Page 181: Handbook of Reliability Engineering - pdi2.org

148 Statistical Reliability Theory

7.4.1 Positive Quadrant DependentBivariate Distributions with SimpleStructures

The distributions whose PQD property can beestablished easily are now given below.

Example 1. The Farlie–Gumbel–Morgenstern bi-variate distribution [23]:

F(x, y)= FX(x)FY (y)

× {1+ α[1− FX(x)][1− FY (y)]}(7.15)

For convenience, the above family may simply bedenoted by FGM. This general system of bivariatedistributions is widely studied in the literature. Itis easy to verify that X and Y are PQD if α > 0.

Consider a special case of the FGM systemwhere both marginals are exponential. The jointdistribution function is then of the form (e.g. seeJohnson and Kotz [24], p.262–3):

F(x, y)

= (1− e−λ1x)(1− e−λ2y)(1+ α e−λ1x−λ2y)

Clearly

w(x, y)= F(x, y)− FX(x)FY (y)

= α e−λ1x−λ2y(1− e−λ1x)(1− e−λ2x)

0 < α ≤ 1

satisfies the conditions in Equations 7.7–7.9, andhence X and Y are PQD.

Mukerjee and Sasmal [25] have worked outthe properties of a system of two exponentialcomponents having the FGM distribution. Theproperties are such things as the densities,means, moment-generating functions, and tailprobabilities of min(X, Y ), max(X, Y ), and X +Y , these being of relevance to series, parallel, andstandby systems respectively.

Lingappaiah [26] was also concerned withproperties of the FGM distribution relevant to thereliability context, but with gamma marginals.

Building upon a paper by Philips [27], Kotzand Johnson [28] considered a model in which

components 1 and 2 were subject to “revealed”and “unrevealed” faults respectively, with (X, Y )

having an FGM distribution, where X is the timebetween unrevealed faults and Y is the time froman unrevealed fault to a revealed fault.

Example 2. The bivariate exponential distribu-tion

F(x, y)= 1− e−x − e−y + (ex + ey − 1)−1

This distribution is not well known, and wecould not confirm its source. However, bothmarginals are exponential, which is used widelyin a reliability context. This bivariate distributionfunction can be rewritten as

F(x, y)= 1− e−x − e−y + e−(x+y)

+ (ex + ey − 1)−1 − e−(x+y)

= FX(x)FY (y)

+ (ex + ey − 1)−1 − e−(x+y)

Now

(ex + ey − 1)−1 − e−(x+y)

= (ex − 1)(ey − 1)

(ex + ey − 1) e(x+y)

= (1− e−x)(1− e−y)(ex + ey − 1)

≥ 0

and therefore F is PQD.

Example 3. The bivariate Pareto distribution

F (x, y)= (1+ ax + by)−λ a, b, λ > 0

(e.g. see Mardia, [29], p.91). Consider a systemof two independent exponential components thatshare a common environment factor η that can bedescribed by a gamma distribution. Lindley andSingpurwalla [30] showed that the resulting jointdistribution has a bivariate Pareto distribution. Itis very easy to verify that this joint distributionis PQD. For a generalization to multivariatecomponents, see Nayak [31].

Example 4. The Durling–Pareto distribution

F (x, y)= (1+ x + y + kxy)−a

a > 0, 0≤ k ≤ a + 1 (7.16)

Page 182: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 149

Obviously, it is a generalization of Example 3above.

Consider a system of two dependent exponen-tial components having a bivariate Gumbel distri-bution

F(x, y)= 1− e−x − e−y + e−x−y−θxy

x, y ≥ 0, 0≤ θ ≤ 1

and sharing a common environment that has agamma distribution. Sankaran and Nair [32] haveshown that the resulting bivariate distributionis specified by Equation 7.16. It follows fromEquation 7.16 that

F (x, y)− FX(x)FY (y)

= 1

(1+ x + y + kxy)a− 1

[(1+ x)(1+ y)]a0≤ k ≤ (a + 1)

= 1

(1+ x + y + kxy)a− 1

(1+ x + y + xy)a

≥ 0 0≤ k ≤ 1

Hence, F is PQD if 0≤ k ≤ 1.

7.4.2 Positive Quadrant DependentBivariate Distributions with MoreComplicated Structures

Example 5. Marshall and Olkin’s bivariate expo-nential distribution [33]

P(X > x, Y > y)

= exp[−λ1x − λ2y − λ12 max(x, y)] λ≥ 0(7.17)

This has become a widely used bivariate exponen-tial distribution over the last three decades. TheMarshall and Olkin bivariate exponential distribu-tion was derived from a reliability context.

Suppose we have a two-component systemsubjected to shocks that are always fatal. Theseshocks are assumed to be governed by threeindependent Poisson processes with parametersλ1, λ2, and λ12, according to whether the shockapplies to component 1 only, component 2 only,or to both components. Then the joint survivalfunction is given by Equation 7.17.

Barlow and Proschan [3], p.129, show that X

and Y are PQD.

Example 6. Bivariate distribution of Block and ofBasu [34]

F (x, y)= 2+ θ

2exp[−x − y − θ max(x, y)]

− θ

2exp[−(2+ θ) max(x, y)]

θ, x, y ≥ 0

This was constructed to modify Marshall andOlkin’s bivariate exponential, which has a singularpart. It is, in fact, a reparameterization of aspecial case of Freund’s [35] bivariate exponentialdistribution. The marginal is

FX(x)= 1+θ2

exp[−(1+ θ)x] − θ

2exp[(1+ θ)x]

and a similar expression for FY (y). It is easy toshow that this distribution is PQD.

Example 7. Kibble’s bivariate gamma distribu-tion. The joint density function is

fρ(x, y; α)= fX(x)fY (y)

× exp[−ρ(x + y)/(1− ρ)] �(α)1− ρ

× (xyρ)−(α−1)/2Iα−1

(2√xyρ

1− ρ

)0≤ ρ < 1

with, fX, fY being the marginal gamma probabil-ity density function with shape parameter α. Here,Iα(·) is the modified Bessel function of the firstkind and the αth order.

Lai and Moore [36] show that the distributionfunction is given by

F(x, y; ρ)= FX(x)FX(y)

+ α

∫ ρ

0ft (x, y; α + 1) dt

≥ FX(x)FY (y)

because u(x, y)= ∫ ρ0 ft (x, y; α + 1) dt is obvi-ously positive.

For the special case of when α = 1, Kibble’sgamma becomes the well-known Moran–Downton bivariate exponential distribution.

Page 183: Handbook of Reliability Engineering - pdi2.org

150 Statistical Reliability Theory

Downton [37] presented a construction from areliability perspective. He assumed that the twocomponents C1 and C2 receive shocks occurringin independent Poisson streams at rates λ1and λ2 respectively, and that the numbers N1and N2 shocks needed to cause failure of C1and C2 respectively have a bivariate geometricdistribution.

For applications of Kibble’s bivariate gamma,see, for example, Hutchinson and Lai [1].

Example 8. The bivariate exponential distribu-tion of Sarmanov. Sarmanov [38] introduced afamily of bivariate densities of the form

f (x, y)= fX(x)fY (y)[1+ ωφ1(x)φ2(y)] (7.18)

where ∫ ∞−∞

φ1(x)fX(x) dx = 0∫ ∞−∞

φ2(y)fY (y) dy = 0

and ω satisfies the condition that1+ ωφ1(x)φ2(y)≥ 0 for all x and y.

Lee [39] discussed the properties of theSarmanov family; in particular, she derived thebivariate exponential distribution given below:

f (x, y)= λ2 e−(x+y)[1+ ω

(e−x + λ1

1+ λ1

)×(

e−y + λ2

1+ λ2

)](7.19)

where−(1+ λ1)(1+ λ2)

λ1λ2≤ ω ≤ (1+ λ1)(1+ λ2)

max(λ1, λ2)

(Here, φ1(x)= e−x − [λ1/(1+ λ1)] and φ2(y)=e−y − [λ2/(1+ λ2)].)

It is easy to see that, for ω > 0:

F(x, y)= (1− e−λx)(1− e−λy)+ ω

1+ λ

)2

× [e−λx − e−(λ+1)x][e−λy − e−(λ+1)y]≥ FX(x)FY (y)

whence X and Y are shown to be PQD if

0≤ ω ≤ (1+ λ1)(1+ λ2)

max(λ1, λ2)

Example 9. The bivariate normal distribution hasa density function given by

f (x, y)= (2π√

1− ρ2)−1

× exp

[− 1

2(1− ρ2)(x2 − 2ρxy + y2)

]− 1 < ρ < 1

X and Y are PQD for 0≤ ρ < 1 and NQD for−1 < ρ ≤ 0. This result follows straightaway fromthe following lemma:

Lemma 1. Let (X1, Y1) and (X2, Y2) be twostandard bivariate normal distributions, withcorrelation coefficients ρ1 and ρ2 respectively. Ifρ1 ≥ ρ2, then

Pr(X1 > x, Y1 > y)≥ Pr(X2 > x, Y2 > y)

The above is known as the Slepian inequality[40, p.805].

By letting ρ2 = 0 (thus ρ1 ≥ 0), we establishthat X and Y are PQD. On the other hand, lettingρ1 = 0 (thus ρ2 ≤ 0), X and Y are then NQD.

7.4.3 Positive Quadrant DependentBivariate Uniform Distributions

A copula C(u, v) is simply the uniform represen-tation of a bivariate distribution. Hence a copula isjust a bivariate uniform distribution. For a formaldefinition of a copula, see, for example, Nelsen[41]. By a simple marginal transformation, a cop-ula becomes a bivariate distribution with specifiedmarginals. There are many examples of copulasthat are PQD, such as that given in Example 10.

Example 10. The Ali–Mikhail–Haq family

C(u, v) = uv

1− θ(1− u)(1− v)θ ∈ [0, 1]

It is clear that the copula is PQD. In fact, Bairamovet al. [42] have shown that it is a copula thatcorresponds to the Durling–Pareto distributiongiven in Example 4.

Nelsen [41], p.152, has pointed out that if X

and Y are PQD, then their copula C is also PQD.Nelsen’s book provides a comprehensive treatment

Page 184: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 151

on copulas and a number of examples of PQDcopulas can be found therein.

7.4.3.1 GeneralizedFarlie–Gumbel–Morgenstern Family ofCopulas

The so-called bivariate FGM distribution givenin Example 1 was originally introduced by Mor-genstern [43] for Cauchy marginals. Gumbel [44]investigated the same structure for exponentialmarginals.

It is easy to show that the FGM copula is givenby

Cα(u, v)= uv[1+ α(1− u)(1− v)]0≤ u, v ≤ 1, −1≤ α ≤ 1 (7.20)

It is clear that the FGM copula is PQD for 0≤ α

≤ 1.It was Farlie [23] who extended the construc-

tion by Morgenstern and Gumbel to

Cα(u, v)= uv[1+ αA(u)B(v)] 0≤ u, v ≤ 1(7.21)

where A(u)→ 0 and B(v)→ 0 as u, v→ 1, A(u)and B(v) satisfy certain regularity conditionsensuring that C is a copula. Here, the admissiblerange of α depends on the functions A and B.

If A(u)= B(v) = 1− u, we then have the clas-sical one-parameter FGM family Equation 7.20.

Huang and Kotz [45] consider the two types:

(i) A(u)= (1− u)p, B(v) = (1− v)p,

p > 1, −1≤ α ≤(p + 1

p − 1

)p−1

(ii) A(u)= (1− up), B(v) = (1− vp),

p > 0, −(max{1, p})−2 ≤ α ≤ p−1

We note that copula (ii) was investigated earlier byWoodworth [46].

Bairamov and Kotz [47] introduce furthergeneralizations such that:

(iii) A(u)= (1− u)p, B(v) = (1− v)q ,

p > 1, q > 1 (p �= q),

−min

{1,

(1+ p

p − 1

)p−1 (1+ q

q − 1

)q−1}

≤ α ≤min

{(1+ p

p − 1

)p−1

,

(1+ q

q − 1

)q−1}

(iv) A(u)= (1− un)p, B(v) = (1− vn)q,

p ≥ 1; n≥ 1,

−min

{1

n2

[1+ np

n(p − 1)

]2(p−1)

, 1

}

≤ α ≤ 1

n

[1+ np

n(p − 1)

]p−1

Recently, Bairamov et al. [48] considered a moregeneral model:

(v) A(u)= (1− up1)q1, B(v) = (1− vp2)q2,

p1, p2 ≥ 1; q1, q2 > 1,

−min

{1,

1

p1p2

[1+ p1q1

p1(q1 − 1)

]q1−1

×[

1+ p2q2

p2(q2 − 1)

]q2−1}

≤ α

≤min

{1

p1

[1+ p1q1

p1(q1 − 1)

]q1−1

,

1

p2

[1+ p2q2

p2(q2 − 1)

]q2−1}

Motivated by a desire to construct PQDs, Laiand Xie [16] derived a new family of FGM copulasthat possess the PQD property with:

(vi) A(u)= ub−1(1− u)a,

B(v) = vb−1(1− v)a, a, b ≥ 1; 0≤ α ≤ 1

so that

Cα(u, v)= uv + αubvb(1− u)a(1− v)a

a, b ≥ 1, 0≤ α ≤ 1 (7.22)

Page 185: Handbook of Reliability Engineering - pdi2.org

152 Statistical Reliability Theory

Table 7.1. Range of dependence parameterα for some positive quadrant dependent FGM copulas

Copula type α range for which copula is PQD

(i) 0≤ α ≤(p + 1

p − 1

)p−1

(ii) 0≤ α ≤ p−1

(iii) 0≤ α ≤min

{(1+ p

p − 1

)p−1,

(1+ q

q − 1

)q−1}, p > 1, q > 1

(iv) 0≤ α ≤ 1

n

[1+ np

n(p − 1)

]p−1

(v) 0≤ α ≤min

{1

p1

[1+ p1q1

p1(q1 − 1)

]q1−1,

1

p2

[1+ p2q2

p2(q2 − 1)

]q2−1}

(vi) 0≤ α ≤ 1

B+(a, b)B−(a, b) , B+, B− are some functions of a and b

Bairamov and Kotz [49] have shown that therange of α in Equation 7.22 can be extended andthey also provide the ranges of α for which thecopulas (i)–(v) are PQD. These are summarized inTable 7.1.

In concluding this section, we note that Joe[2], p.19, considered the concepts of PQD andthe concordance ordering (more PQD) that arediscussed in Section 7.6 as being basic to theparametric families of copulas in determiningwhether a multivariate parameter is a dependenceparameter.

7.5 Some Related Issues onPositive Dependence

7.5.1 Examples of Bivariate PositiveDependence Stronger than PositiveQuadrant Dependent Condition

So far, we have presented only the familiesof bivariate distributions that are PQD, whichis a weaker notion of the positive dependencediscussed in this chapter. We now introducesome bivariate distributions that also satisfy morestringent conditions.

(i) The bivariate normal density is TP2 ifand only if their correlation coefficient 0≤ ρ < 1

(e.g. see Barlow and Proschan [3], p.149). Abdel-Hameed and Sampson [50] have shown thatthe bivariate density of the absolute normaldistribution is TP2.

(ii) X and Y of Marshall and Olkin’s bivariatedistribution are associated owing to havinga variable in common in the constructionprocedure. In fact, Barlow and Proschan [3], p.132,showed that Y is stochastically increasing in X

(SI), which, in turn, implies association.(iii) FGM bivariate exponential distribution:

Rödel [51] showed that, for an FGM distribution,X and Y are SI (alias positively regressiondependent) if α > 0. The following is a direct andeasy proof for the case with exponential marginalssuch that α > 0:

P(Y ≤ y |X = x)= [1− α(2 e−x − 1)(1− e−y)]+ α(2 e−x − 1)(1− e−2y)

= (1− e−y)+ α(2 e−x − 1)(e−y − e−2y)

Thus

P(Y > y | X = x)

= e−y − α(2 e−x − 1)(e−y − e−2y) (7.23)

which is clearly increasing in x, from which weconclude that X and Y are positively regressiondependent if α > 0.

Page 186: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 153

(iv) Rödel [51] showed that Kibble’s bivariategamma distribution given in Section 7.4.2 is alsoSI (alias PRD), which is a stronger concept ofpositive dependence than PQD.

(v) Sarmanov bivariate exponential. The con-ditional distribution that corresponds to Equa-tion 7.19 is

P(Y ≤ y |X = x)

= FY (y)+ ωφ1(x)

∫ y

−∞φ2(z)fY (z) dz

φi(x)= e−x − λi

1+ λii = 1, 2

It follows that

P(Y > y |X = x)= e−λ2y − ω

(e−x − λ1

1+ λ1

)×∫ y

−∞φ2(z)fY (z) dz

is increasing in x because∫ y

−∞φ2(z)fY (z) dz≥ 0

and thus Y is SI increasing in x if

0≤ ω ≤ (1+ λ1)(1+ λ2)

max(λ1, λ2)

Further, it follows from theorem 3 of Lee [39] that(X, Y ) is TP2 since ωφ′(x)φ′(y)≥ 0 for ω ≥ 0.

(vi) Durling–Pareto distribution: Lai et al. [52]showed that X and Y are right tail increasing ifk ≤ 1 and right tail decreasing if k ≥ 1. From thechains of relationships in Section 7.2.6, it is knownthat right tail increasing implies association. ThusX and Y are associated if k ≤ 1.

(vii) The bivariate exponential of Example 3:

F(x, y)= 1− e−x − e−y + (ex + ey − 1)−1

It can easily be shown that

P(Y ≤ y | X = x)= 1+ 1

(ex + ey − 1)2

and hence

P(Y > y |X = x)= −1

(ex + ey − 1)2

which is increasing in x, so Y is SI in X.

7.5.2 Examples of Negative QuadrantDependence

Although the main theme of this chapter is on pos-itive dependence, it is a common knowledge thatnegative dependence does exist in various reliabil-ity situations. Several bivariate distributions dis-cussed in Section 7.4, namely the bivariate normal,FGM family, Durling–Pareto distribution, and bi-variate exponential distribution of Sarmanov areNQD when the ranges of the dependence parame-ter are appropriately specified. The two variablesof the following example can only be negativelydependent.

Example 11. Gumbel’s bivariate exponential dis-tribution

F(x, y)= 1− e−x − e−y + e−(x+y+θxy)

0≤ θ ≤ 1

F(x, y)− FX(x)FY (y)

= e−(x+y+θxy) − e−(x+y) ≤ 0

0≤ θ ≤ 1

showing that F is NQD. It is well known, e.g.see Kotz et al. [53], p.351, that −0.403 65≤corr(X, Y )≤ 0.

Lehmann [4] presented the following situationsfor which negative quadrant dependence occurs:

• Consider the rankings of n objects by m

persons. Let X and Y denote the rank sum forthe ith and the j th object respectively. ThenX

and Y are NQD.• Consider a sequence of n multinomial trials

with s possible outcomes. Let X and Y denotethe number of trials resulting in outcome i

and j respectively. Then X and Y are NQD.

7.6 Positive DependenceOrderingsConsider two bivariate distributions having thesame pair of marginalsFX and FY ; and we assumethat they are both positively dependent. Naturally,

Page 187: Handbook of Reliability Engineering - pdi2.org

154 Statistical Reliability Theory

we would like to know which of the two bivariatedistributions is more positively dependent. Inother words, we wish to order the two givenbivariate distributions by the extent of theirpositive dependence between the two marginalvariables with higher in ordering meaning morepositively dependent. In this section, the conceptof positive dependence ordering is introduced.The following definition is found in Kimeldorf andSampson [54].

Definition 2. A relation � on a family of allbivariate distributions is a positive dependenceordering (PDO) if it satisfies the following tenconditions:

(P0) F �G⇒ F(x,∞)=G(x,∞) andF(∞, y)=G(∞, y);

(P1) F �G⇒ F(x, y)≤G(x, y) for all x, y;(P2) F �G and G�H ⇒ F �H ;(P3) F � F ;(P4) F �G and G� F ⇒ F =G;(P5) F− � F � F+, where F+(x, y)

=max[F(x,∞), F (∞, y)] and F−(x, y)=min[F(x,∞), F (∞, y)− 1, 0];

(P6) (X, Y )� (U, V )⇒ (a(X), Y )

� (a(U), V ) where the (X, Y )

� (U, V ) means the relation� holdsbetween the corresponding bivariatedistributions;

(P7) (X, Y )� (U, V )⇒ (−U, V )� (−X, Y );(P8) (X, Y )� (U, V )⇒ (Y, X)� (V, U);(P9) Fn�Gn, Fn→ F in distribution,

Gn→G in distribution⇒ F �G, whereFn, F , Gn, and G all have the same pair ofmarginals.

Tchen [55] defined a bivariate distribution G

to be more PQD than a bivariate distribution F

having the same pair of marginals if G(x, y)≥F(x, y) for all x, y ∈ R2. It was shown that PQDpartial ordering is a PDO.

Note that more PQD is also known as“more concordant” in the dependence conceptsliterature.

Example 12. Generalized FGM copula. Lai andXie [16] constructed a new family of PQD bivariatedistributions that is a generalization of the FGM

copula:

Cθ(u, v)= uv + θubvb(1− u)a(1− v)a

a, b ≥ 1, 0≤ θ ≤ 1 (7.24)

Let the dependence ordering be defined throughthe ordering of θ . It is clear from Equation 7.24that, when θ < θ ′, then Cθ (u, v)� Cθ ′(u, v).

Example 13. Bivariate normal with positive cor-relation coefficient ρ. The Slepian inequality inSection 7.4.2 says

Pr(X1 > x, Y1 > y)≥ Pr(X2 > x, Y2 > y)

if ρ1 ≥ ρ2. Thus a more PQD ordering canbe defined in terms of the positive correlationcoefficient ρ.

Example 14. Cθ(u, v)= uv/[1− θ(1− u)

(1− v)], θ ∈ [0, 1] (the Ali–Mikhail–Haq family).It is easy to see that Cθ � Cθ ′ if θ > θ ′, i.e. Cθ ismore PQD than Cθ ′ .

There are several other types of PDO in theliterature and we recommend the reader to consultthe text by Shaked and Shantikumar [56], whogave a comprehensive treatment on stochasticorderings.

7.7 Concluding RemarksConcepts of stochastic dependence are widelyapplicable in statistics. Given that some of theseconcepts have arisen from reliability contexts, itseems rather unfortunate that not many reliabilitypractitioners have caught onto this importantsubject. This observation is transparent, since theassumption of independence is still prevailing inmany reliability analyses. Among the dependenceconcepts, the correlation is a widely used conceptin applications. Association is advocated andstudied in Barlow and Proschan [3]. On theother hand, PQD is a weaker condition and alsoseems to be easier to verify. On reflection, thisphenomenon may be due in part to the fact thatmany of the proposed dependence models areoften not readily applicable. One would hope that,in the near future, more applied probabilists and

Page 188: Handbook of Reliability Engineering - pdi2.org

Concepts of Stochastic Dependence in Reliability Analysis 155

reliability engineers would get together to forge apartnership to bridge the gap between the theoryand applications of stochastic dependence.

References[1] Hutchinson TP, Lai CD. Continuous bivariate distri-

butions, emphasising applications. Adelaide, Australia:Rumsby Scientific Publishing; 1990.

[2] Joe H. Multivariate models and dependence concepts.London: Chapman and Hall; 1997.

[3] Barlow RE, Proschan F. Statistical theory of reliability andlife testing: probability models. Silver Spring (MD): ToBegin With; 1981.

[4] Lehmann EL. Some concepts of dependence. Ann MathStat 1966;37:1137–53.

[5] Esary JD, Proschan F, Walkup DW. Association of randomvariables, with applications. Ann Math Stat 1967;38:1466–74.

[6] Esary JD, Proschan F. Relationships among somebivariate dependence. Ann Math Stat 1972;43:651–5.

[7] Harris R. A multivariate definition for increasing hazardrate distribution. Ann Math Stat 1970;41:713–7

[8] Brindley EC, Thompson WA. Dependence and agingaspects of multivariate survival. J Am Stat Assoc1972;67:822–30.

[9] Yanagimoto T. Families of positive random variables.Ann Inst Stat Math 1972;26:559–73.

[10] Shaked M. A concept of positive dependence forexchangeable random variables. Ann Stat 1977;5:505–15.

[11] Shaked M. Some concepts of positive dependence forbivariate interchangeable distributions. Ann Inst StatMath 1979;31:67–84.

[12] Shaked M. A general theory of some positive dependencenotions. J Multivar Anal 1982;12 : 199–218.

[13] Block HW, Ting, ML. Some concepts of multivariate de-pendence. Commun Stat A: Theor Methods 1981;10:749–62.

[14] Jogdeo K. Dependence concepts and probability inequal-ities. In: Patil GP, Kotz S, Ord JK, editors. A moderncourse on distributions in scientific work – models andstructures, vol. 1. Dordrecht: Reidel; 1975. p.271–9.

[15] Jogdeo K. Dependence concepts of. In: Encyclopedia ofstatistical sciences, vol. 2. New York: Wiley; 1982. p.324–34.

[16] Lai CD, Xie M. A new family of positive dependencebivariate distributions. Stat Probab Lett 2000;46:359–64.

[17] Robbins H. A remark on the joint distribution ofcumulative sums. Ann Math Stat 1954;25:614–6.

[18] Kimball AW. On dependent tests of significance inanalysis of variance. Ann Math Stat 1951;22:600–2.

[19] Douglas R, Fienberg SE, Lee MLT, Sampson AR,Whitaker LR. Positive dependence concepts for ordinalcontingency tables. In: IMS lecture notes monographseries: topics in statistical dependence, vol. 16. Hayward(CA): Institute of Mathematical Statistics; 1990. p.189–202.

[20] Xie M, Lai, CD. On the increase of the expected lifetime byparallel redundancy. Asia–Pac J Oper Res 1996;13:171–9.

[21] Mi J. Bolstering components for maximizing systemlifetime. Nav Res Logist 1998;45:497–509.

[22] Kotz S, Lai CD, Xie, M. The expected lifetime whenadding redundancy in systems with dependent compo-nents. IIE Trans 2003;in press.

[23] Farlie DJG. The performance of some correlation coef-ficients for a general bivariate distribution. Biometrika1960;47:307–23.

[24] Johnson NL, Kotz S. Distributions in statistics: continu-ous multivariate distributions. New York: Wiley; 1972.

[25] Mukerjee SP, Sasmal BC. Life distributions of coherentdependent systems. Calcutta Stat Assoc Bull 1977;26:39–52.

[26] Lingappaiah GS. Bivariate gamma distribution as a lifetest model. Aplik Mat 1983;29:182–8.

[27] Philips MJ. A preventive maintenance plan for a systemsubject to revealed and unrevealed faults. Reliab Eng1981;2:221–31.

[28] Kotz S, Johnson NL. Some replacement-times distribu-tions in two-component systems. Reliab Eng 1984;7:151–7.

[29] Mardia KV. Families of bivariate distributions. London:Griffin; 1970.

[30] Lindley DV, Singpurwalla ND. Multivariate distributionsfor the life lengths of components of a system sharing acommon environment. J Appl Probab 1986;23:418–31.

[31] Nayak TK. Multivariate Lomax distribution: propertiesand usefulness in reliability theory. J Appl Probab1987;24:170–7.

[32] Sankaran PG, Nair NU. A bivariate Pareto model and itsapplications to reliability. Nav Res Logist 1993;40:1013–20.

[33] Marshall AW, Olkin I. A multivariate exponentialdistribution. J Am Stat Assoc 1967;62:30–44.

[34] Block HW, Basu AP. A continuous bivariate exponentialdistribution. J Am Stat Assoc 1976;64:1031–7.

[35] Freund J. A bivariate extension of the exponentialdistribution. J Am Stat Assoc 1961;56:971–7.

[36] Lai CD, Moore T. Probability integrals of a bivariategamma distribution. J Stat Comput Simul 1984;19:205–13.

[37] Downton F. Bivariate exponential distributions in relia-bility theory. J R Stat Soc Ser B 1970;32:408–17.

[38] Sarmanov OV. Generalized normal correlation andtwo-dimensional Frechet classes. Dokl Sov Math1966;168:596–9.

[39] Lee MLT. Properties and applications of the Sarmanovfamily of bivariate distributions. Commun Stat A: TheorMethods 1996;25:1207–22.

[40] Gupta SS. Probability integrals of multivariate normaland multivariate t . Ann Math Stat 1963;34:792–828.

[41] Nelsen RB. An introduction copulas. Lecture notes instatistics, vol. 139,: New York: Springer-Verlag; 1999.

[42] Bairamov I, Lai CD, Xie M. Bivariate Lomax distributionand generalized Ali–Mikhail–Haq distribution. Unpub-lished results.

Page 189: Handbook of Reliability Engineering - pdi2.org

156 Statistical Reliability Theory

[43] Morgenstern D. Einfache Beispiele zweidimensionalerVerteilungen. Mitteilungsbl Math Stat 1956;8:234–5.

[44] Gumbel EJ. Bivariate exponential distributions. J Am StatAssoc 1960;55:698–707.

[45] Huang JS, Kotz S. Modifications of the Farlie–Gumbel–Morgenstern distributions. A tough hill to climb. Metrika1999;49:135–45.

[46] Woodworth GG. On the asymptotic theory of tests ofindependence based on bivariate layer ranks. TechnicalReport No 75, Department of Statistics, University ofMinnesota. See also Abstr Ann Math Stat 1966;36:1609.

[47] Bairamov I, Kotz S. Dependence structure and symmetryof Huang–Kotz FGM distributions and their extensions.Metrika 2002;in press.

[48] Bairamov I, Kotz S, Bekci M. New generalized Farlie–Gumbel–Morgenstern distributions and concomitants oforder statistics. J Appl Stat 2001;28:521–36.

[49] Bairamov I, Kotz S. On a new family of positive quadrantdependent bivariate distributions. GWU/IRRA/TR No2000/05. The George Washington University, 2001.

[50] Abdel-Hameed M, Sampson AR. Positive dependence ofthe bivariate and trivariate absolute normal t, χ2 and F

distributions. Ann Stat 1978;6:1360–8.[51] Rödel E. A necessary condition for positive dependence.

Statistics 1987;18 :351–9.[52] Lai CD, Xie M, Bairamov I. Dependence and ageing prop-

erties of bivariate Lomax distribution. In: Hayakawa Y,Irony T, Xie M, editors. A volume in honor of ProfessorR. E. Barlow on his 70th birthday. Singapore: WSP; 2001.p.243–55.

[53] Kotz S, Balakrishnan N, Johnson NL. Continuous mul-tivariate distributions, vol. 1: models and applications.New York: Wiley; 2000.

[54] Kimeldorf G, Sampson AR. Positive dependence order-ings. Ann Inst Stat Math 1987;39:113–28.

[55] Tchen A. Inequalities for distributions with givenmarginals. Ann Probab 1980;8:814–27.

[56] Shaked M, Shantikumar JG, editors. Stochasticorders and their applications. New York: AcademicPress; 1994.

Page 190: Handbook of Reliability Engineering - pdi2.org

Statistical Reliability Change-pointEstimation Models

Chap

ter8

Ming Zhao

8.1 Introduction8.2 Assumptions in Reliability Change-point Models8.3 Some Specific Change-point Models8.3.1 Jelinski–Moranda De-eutrophication Model with a Change Point8.3.1.1 Model Review8.3.1.2 Model with One Change Point8.3.2 Weibull Change-point Model8.3.3 Littlewood Model with One Change Point8.4 Maximum Likelihood Estimation8.5 Application8.6 Summary

8.1 Introduction

The classical change-point problem arisesfrom the observation of a sequence ofrandom variables X1, X2, . . . , Xn, such thatX1, X2, . . . , Xτ have a common distribution F

while Xτ+1, Xτ+2, . . . , Xn have the distributionG with F �=G. The index τ , called the changepoint, is usually unknown and has to be estimatedfrom the data.

The change-point problem has been widelydiscussed in the literature. Hinkley [1] usedthe maximum likelihood (ML) method to esti-mate the change-point τ in the situations whereF and G can be arbitrary known distribu-tions and belong to the same parametric fam-ily. The non-parametric estimation of the changepoint has been discussed by Carlstein [2]. Josephand Wolfson [3] generalized the change-pointproblem by studying multipath change pointswhere several independent sequences are consid-ered simultaneously and each sequence has onechange point. The bootstrap and empirical Bayesmethods are also suggested for estimating thechange points. If the sequence X1, X2, . . . , Xn

is the observation of arrival times at which n

events occur in a Poisson process, which is widelyapplied in reliability analysis, this is a Poissonprocess model with one change point. Rafteryand Akiman [4] proposed using Bayesian analy-sis for the Poisson process model with a changepoint. The parametric estimation in the Poissonprocess change-point model has been given byLeonard [5].

In reliability analysis, change-point models canbe very useful. In software reliability modeling,the initial number of faults contained in a programis always unknown and its estimation is a greatconcern. One has to execute the program in aspecific environment and improve its reliability bydetecting and correcting the faults. Many softwarereliability models assume that, during the faultdetection process, each failure caused by a faultoccurs independently and randomly in timeaccording to the same distribution, e.g. see Musaet al. [6]. The failure distribution can be affectedby many different factors, such as the runningenvironment, testing strategy, and the resource.Generally, the running environment may notbe homogeneous and can be changed with the

157

Page 191: Handbook of Reliability Engineering - pdi2.org

158 Statistical Reliability Theory

human learning process. Hence, the change-point models are of interest in modeling thefault detection process. The change point canoccur when the testing strategy and the resourceallocation are changed. Also, with increasingknowledge of the program, the testing facilitiesand other random factors can be the causes ofchange points. Zhao [7] has discussed differentsoftware reliability change-point models and themethod of parametric estimation. The non-homogeneous Poisson process (NHPP) modelwith change points has been studied by Chang [8]and the parameters in NHPP change-point modelare estimated by the weighted least-squaresmethod.

Change-point problems can also arise inhardware reliability and survival analysis, forexample, when reliability tests are conductedin a random testing environment. Nevertheless,the problem may not be as complex as withthe software reliability problem. This is becausethe sample size in a lifetime test, which iscomparable to the initial number of faults insoftware that needs to be estimated, is usuallyknown in hardware reliability analysis. Unlikethe common change-point models, the observeddata in hardware reliability or survival analysisare often grouped and dependent. For example,if we are investigating a new treatment againstsome disease, the observed data may be censoredbecause some patients may be out of the trialsand new patients may come into the trials. Thereis a need to develop new models for a reliabilitychange-point problem.

In this chapter, we address the change-pointproblems in reliability analysis. A change-pointmodel, which is applicable for both softwareand hardware reliability systems, is consideredin a general form. Nevertheless, more focusis given to software reliability change-pointmodels, since the hardware reliability change-point models can be analyzed as a special casewhere the sample size is assumed to be known.When it is necessary, however, some discussionis given on the differences between hardwareand software reliability models. The ML methodis considered to estimate the change-point and

model parameters. A numerical example is alsoprovided to show the application.

8.2 Assumptions in ReliabilityChange-point Models

Let F and G be two different lifetimedistributions with density functions f (t) andg(t), X1, X2, . . . , Xτ , Xτ+1, Xτ+2, . . . , Xn bethe inter-failure times of the sequential failures ina lifetime testing. Before we consider a number ofspecific reliability change-point models, we needto make some assumptions.

Assumption 1. There are a finite number of itemsN under reliability test; the parameter N may beunknown.

In hardware reliability analysis, the parame-ter N , which is usually assumed to be known, isthe total number of units under the test. However,where software reliability analysis is concerned,the parameterN , the initial number of faults in thesoftware, is assumed to be unknown.

Assumption 2. At the beginning, all of the itemshave the same lifetime distribution F . After τ

failures are observed, the remaining (N − τ ) itemshave the distribution G. The change point τ isassumed unknown.

In software reliability, this assumption meansthat all faults in software have the same detectionrate, but it is changed when a number of faultshave been detected. The most common reasonis that the test team has learned a lot from thetesting process, so that the detection process ismore efficient. On the other hand, it could be thatthe fault detection becomes increasingly difficultdue to the complexity of the problem. In bothsituations, a change-point model is an alternativeto apply.

Assumption 3. The sequence {X1, X2, . . . , Xτ }is statistically independent of the sequence{Xτ+1, Xτ+2, . . . , Xn}.

Page 192: Handbook of Reliability Engineering - pdi2.org

Statistical Reliability Change-point Estimation Models 159

This assumption may not be realistic in somecases. We use it only for the sake of model sim-plicity. However, it is easy to modify this as-sumption. Note that we do not assume the inde-pendence between the variables within sequence{X1, X2, . . . , Xτ } or {Xτ+1, Xτ+2, . . . , Xn}, be-cause the inter-failure times of failures in lifetimetesting are usually dependent.

8.3 Some Specific Change-pointModelsTo consider the reliability change-point model, wefurther assume that a lifetime test is performedaccording to the Type-II censoring plan, in whichthe number of failures n is determined in advance.

Denote by T1, T2, . . . , Tn the arrival times ofsequential failures in the lifetime test. Then wehave

T1 =X1

T2 =X2

...

Tn =X1 + X2 + · · · +Xn (8.1)

According to Assumptions 1–3, the failure timesT1, T2, . . . , Tτ are the first τ order statistics of asample with size N from parent distribution F ,Tτ+1, Tτ+2, . . . , Tn are the first (n− τ )-orderstatistics of a sample with size (N − τ ) fromparent distribution G.

8.3.1 Jelinski–MorandaDe-eutrophication Model with a ChangePoint

8.3.1.1 Model Review

One of the earliest software reliability models isthe de-eutrophication model developed by Jelinskiand Moranda [9]. It was derived based on thefollowing assumptions.

Assumption 4. There are N initial faults in theprogram.

Assumption 5. A detected fault is removed instan-taneously and no new fault is introduced.

Assumption 6. Each failure caused by a faultoccurs independently and randomly in timeaccording to an exponential distribution.

Based on the model assumptions, we can deter-mine that the inter-failure times Xi = Ti − Ti−1,i = 1, 2, . . . n, are independent exponentially dis-tributed random variables with failure rate λ(N −i + 1), where λ is the initial fault detection rate.One can see that each fault is discovered with afailure rate reduced by the proportionality con-stant λ. This means that the impact of each faultremoval is the same and the faults are of equal size;see [10, 11] for details.

The Jelinski–Moranda de-eutrophicationmodel is the simplest and most cited softwarereliability model. It is also the most criticizedmodel, because Assumptions 5 and 6 imply thatall faults in a program have the same size andeach removal of the detected faults reduces thefailure intensity by the same amount. However,the model remains central to the topic of softwarereliability. Many other models are derivedbased on assumptions similar to that of thede-eutrophication model.

8.3.1.2 Model with One Change Point

We assume that F and G are exponentialdistributions with failure rate parameters λ1and λ2 respectively. From Assumptions 1–3,it is easy to show that the inter-failuretimes X1, X2, . . . , Xn are independentlyexponentially distributed. Specifically,Xi = Ti − Ti−1, i = 1, 2, . . . , τ , are exponentiallydistributed with parameter λ1(N − i + 1)and Xj = Tj − Tj−1, j = τ + 1, τ + 2, . . . , n,are exponentially distributed with parameterλ2(N − τ − j + 1).

Note that the original Jelinski–Moranda de-eutrophication model implies that all faults are ofthe same size; the debugging process is perfect.However, an imperfect debugging assumption ismore realistic; see [10] for details. By consideringthe change point in this model, this implies that

Page 193: Handbook of Reliability Engineering - pdi2.org

160 Statistical Reliability Theory

there are at least two groups of faults with differentsize. If more change points are considered in themodel, one can think that the debugging could beimperfect and the faults are of different size. Theproposed model is expected to be closer to thereality.

8.3.2 Weibull Change-point Model

A Weibull change-point model appears when F

and G are Weibull distribution functions withparameters (λ1, β1) and (λ2, β2) respectively.That is

F(t) = 1− exp(−λ1tβ1) (8.2)

G(t)= 1− exp(−λ2tβ2) (8.3)

In this case, the sequence {X1, X2, . . . , Xτ } is notindependent. The Weibull model without changepoints has been used by Wagoner [12] to describethe fault detection process. In particular, when theshape parameter β = 2, the Weibull model is themodel proposed by Schick and Wolverton [13].In application, one can use the simplified modelin which the shape parameters β1 and β2 areassumed to be equal.

8.3.3 Littlewood Model with OneChange Point

Assume that F and G are Pareto distributionfunctions given by

F(t)= 1− (1+ t/λ1)β1 (8.4)

G(t)= 1− (1+ t/λ2)β2 (8.5)

Then we have a Littlewood model with one changepoint. The model without a change point wasgiven by Littlewood [14]. It is derived usingAssumptions 4–6, but the failure rates associatedwith each fault are assumed to be identicaland independent random variables with gammadistributions. Under the Bayesian framework, thefailure distribution is shown to be a Paretodistribution; see [14] for details. This model triesto account for the possibility that the softwareprogram could become less reliable than before.

8.4 Maximum LikelihoodEstimation

In previous sections we have presented some re-liability change-point models. In order to con-sider the parametric estimation, we assume thatthe distributions belong to parametric families{F(t | θ1), θ1 ∈�1} and {G(t | θ2), θ2 ∈�2}. Be-cause T1, T2, . . . , Tτ are the first τ -order statisticsof a sample with size N from parent distributionF , Tτ+1, Tτ+2, . . . , Tn are the first (n− τ )-orderstatistics of a sample with size (N − τ ) from par-ent distribution G, then the log-likelihood withoutthe constant term is given by

L(τ, N, θ1, θ2|T1, T2, . . . , Tn)

=n∑

i=1

(N − i + 1)+τ∑

i=1

f (Ti | θ1)

+n∑

i=τ+1

g(Ti | θ2)

+ (N − τ ) log[1− F(Tτ | θ1)]+ (N − n) log[1−G(Tn | θ2)] (8.6)

Note that where hardware reliability analysis isconcerned, the simple size N is known, and thelikelihood function is then given by

L(τ, θ1, θ2 | T1, T2, . . . , Tn)

=τ∑

i=1

f (Ti | θ1)+n∑

i=τ+1

g(Ti | θ2)

+ (N − τ ) log[1− F(Tτ | θ1)]+ (N − n) log[1−G(Tn | θ2)]

In the following, we consider only the case wherethe sample size N is unknown because it is moregeneral.

The ML estimator (MLE) of the change pointis the value τ that together with (N, θ1, θ2) maxi-mizes the function in Equation 8.6. Unfortunately,there is no closed form for τ . However, it can beobtained by calculating the log-likelihood for eachpossible value of τ , 1≤ τ ≤ (n− 1), and selectingas τ the value that maximizes the log-likelihoodfunction.

Page 194: Handbook of Reliability Engineering - pdi2.org

Statistical Reliability Change-point Estimation Models 161

Figure8.1. The plot of cumulative number of failures against theiroccurrence times in the SYS1 data set

In principle, the likelihood function (Equa-tion 8.6) is valid for the change-point modelsdiscussed in the previous section. Here, we onlyconsider the estimation for the Jelinski–Morandade-eutrophication model with one change-point.

For this model, the MLEs ofparameters(N, λ1, λ2), given the change-point τin terms of the inter-failure times, are determinedby the following equations:

λ1 = τ∑τi=1 (N − i + 1)xi

(8.7)

λ2 = (n− τ )∑ni=τ+1 (N − i + 1)xi

(8.8)

n∑i=1

1

(N − i + 1)= λ1

τ∑i=1

xi + λ2

n∑i=τ+1

xi (8.9)

In order to find out the estimation, the numericalmethod has to be applied. In a regular situation,one can expect that the MLEs exist. Therefore,we can see that the likelihood function is onlydependent on τ , since the other parameters can bedetermined from Equations 8.7–8.9 if τ is given.Consequently, we can compare the values of thelikelihood function taken at τ = 1, 2, . . . , n− 1respectively. The value of τ that makes thelikelihood maximum is therefore the MLE of τ .

Figure 8.2. The values of the log-likelihood function against thechange points

On the distribution of τ , it is not easy to havea general conclusion. The asymptotic distributionof τ may be obtained using the method in [1].It is also possible to use the bootstrap approachto estimate the distribution of τ following themethod in [3]. We do not discuss these topicsfurther here, but we should point out that thereexists a serious problem when the parameter N

is unknown. In some cases, the MLE of N doesnot exist; e.g. see [15–17], where the Jelinski–Moranda model and other models are considered.The conditions for the existence of the MLE of Nhave been given. Consequently, when the MLE ofN does not exist for some data set, the MLE of τdoes not exist either.

8.5 Application

In this section, we use a real data set in softwarereliability engineering to the Jelinski–Morandade-eruphication model. The data set we used iscalled SYS1, from Musa [18], and has frequentlybeen analyzed in the literature. Figure 8.1 showsthe plot of the cumulative failure numbers againsttheir occurrence times in seconds. There are 136failures in total in the data set.

The log-likelihood for the SYS1 data set isplotted in Figure 8.2 for possible values of thechange-point τ . We can see that the log-likelihood

Page 195: Handbook of Reliability Engineering - pdi2.org

162 Statistical Reliability Theory

takes a maximum value near to 20. Actually, theMLEs of these parameters in the model with onechange point are

τ = 16, N = 145, λ1 = 1.1086× 10−4,

λ2 = 2.9925× 10−5

When no change points are considered, the MLEsof parameters N and λ are

N = 142, λ= 3.4967× 10−5

We can see that the estimates for the number offaults N by these two models are slightly different.

In order to see the differences between thesetwo models we can check out the predictionability, which is a very important criterion forevaluating the models, since the main objectivein software reliability is to predict the behaviorof future failures as precisely as possible; e.g. see[19, 20] for more discussion on the predictionability.

The u-plot and prequential likelihood tech-niques [21, 22] are used to evaluate the predictionability in this case. In general, the u-plot of a goodpredicting system should be close to the diagonalline with unit slope, and the maximum verticaldistance between the u-plot and the diagonal line,which is called the Kolmogrov distance, is a mea-surement of the closeness. On the other hand, itcan be shown that model A is favored of modelB if the prequential likelihood ratio PLA/PLB isconsistently increasing as the predicting processcontinues.

Figure 8.3 shows the u-plots of the change-point model and the original model startingafter 35 failures. The Kolmogrov distance for thechange-point model is equal to 0.088 and for theoriginal model the distance is 0.1874.

Figure 8.4 is the plot of the log prequentiallikelihood ratio between the change-point modeland the Jelinski–Moranda model. The increasingtrend is strong, so the change-point model makesthe prediction better.

The analysis above has shown that the fitnessand the prediction have been greatly improvedby considering the change point in the Jelinski–Moranda model for this specific data set.

Figure 8.3. A comparison of the u-plots between the change-point model and the Jelinski–Moranda model

Figure 8.4. The log-prequential likelihood ratios between thechange-point model and the Jelinski–Moranda model

8.6 Summary

We have presented some change-point modelsthat can be used in reliability analysis. The maindifferences between the models discussed and theclassical change-point model are the dependentsample and the unknown sample size, which isparticular to software reliability analysis. Whenthe sample size is assumed to be known, themodel discussed is the one used for hardwarereliability analysis. By using a classical softwarereliability model, we have shown how a change-point model can be developed and applied forreal failure data. The improvement in modelingand prediction by introducing the change point

Page 196: Handbook of Reliability Engineering - pdi2.org

Statistical Reliability Change-point Estimation Models 163

has been demonstrated for one data set. TheMLEs of the change point are yet to be solved,because in some software reliability models theML estimation of the number of faults canbe infinity. In such cases, another estimationprocedure has to be developed. An extension ofthe change-point problems in reliability analysiscould be made by considering multipath changepoints.

References[1] Hinkley DV. Inference about the change-point in a

sequence of random variables. Biometrika 1970;57:1–16.[2] Carlstein E. Nonparametric change-point estimation.

Ann Stat 1988;16:188–97.[3] Joseph L, Wolfson DB. Estimation in multi-path

change-point problems. Commun Stat Theor Methods1992;21:897–913.

[4] Raftery AE, Akiman VE. Bayesian analysis of a Poissonprocess with a change-point. Biometrika 1986;73:85–9.

[5] Leonard T. Density estimation, stochastic processes andprior information (with discussion). J R Stat Soc B1978;40:113–46.

[6] Musa JD, Iannino A, Okumoto K. Software reliabil-ity: measurement, prediction, application. New York:McGraw-Hill; 1987.

[7] Zhao M. Statistical reliability change-point estimationmodels. Commun Stat Theor Methods 1993;22:757–68.

[8] Chang YP. Estimation of parameters for nonhomoge-neous Poisson process: software reliability with change-point model. Commun Stat Simul C 2001;30(3):623–35.

[9] Jelinski Z, Moranda PB. Software reliability research. In:Freiberger W, editor. Statistical computer performanceevaluation. New York: Academic Press; 1972. p.465–97.

[10] Pham H. Software reliability. Singapore: Springer-Verlag;2000.

[11] Xie M. Software reliability modelling. Singapore: WorldScientific; 1991.

[12] Wagoner WL. The final report on a software relia-bility measurement study. In: Report TOR-007(4112)-1.Aerospace Corporation, 1973.

[13] Schick GJ, Wolverton RW. Assessment of software reli-ability. In: Proceedings, Operations Research. WurzburgWein: Physica-Verlag; 1973. p.395–422.

[14] Littlewood B. Stochastic reliability growth: a modelfor fault-removal in computer-programs and hardware-design. IEEE Trans Reliab 1981;30:312–20.

[15] Joe H, Reid, N. Estimating the number of faults in asystem. J Am Stat Assoc 1985;80:222–6.

[16] Wright DE, Hazelhurst CE. Estimation and predictionfor a simple software reliability model. Statistician1988;37:319–25.

[17] Huang XZ. The limit conditions of some time betweenfailure models of software reliability. MicroelectronReliab 1990;30:481–5.

[18] Musa, JD. Software reliability data. New York: RADC;1979.

[19] Musa JD. Software reliability engineered testing. UK:McGraw-Hill; 1998.

[20] Zhao M. Nonhomogeneous Poisson processes and theirapplication in software reliability. PhD Dissertation,Linköping University, 1994.

[21] Littlewood B. Modelling growth in software reliability.In: Rook P, editor. Software reliability handbook. UK:Elsevier Applied Science; 1991. p.111–36.

[22] Dawid AP. Statistical theory: the prequential approach.J R Stat Soc A 1984;147:278–92.

Page 197: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 198: Handbook of Reliability Engineering - pdi2.org

Concepts and Applicationsof Stochastic Aging in Reliability

Chap

ter9

C. D. Lai and M. Xie

9.1 Introduction9.2 Basic Concepts for Univariate Reliability Classes9.2.1 Some Acronyms and the Notions of Aging9.2.2 Definitions of Reliability Classes9.2.3 Interrelationships9.3 Properties of the Basic Concepts9.3.1 Properties of Increasing and Decreasing Failure Rates9.3.2 Property of Increasing Failure Rate on Average9.3.3 Properties of NBU, NBUC, and NBUE9.4 Distributions with Bathtub-shaped Failure Rates9.5 Life Classes Characterized by the Mean Residual Lifetime9.6 Some Further Classes of Aging9.7 Partial Ordering of Life Distributions9.7.1 Relative Aging9.7.2 Applications of Partial Orderings9.8 Bivariate Reliability Classes9.9 Tests of Stochastic Aging9.9.1 A General Sketch of Tests9.9.2 Summary of Tests of Aging in Univariate Case9.9.3 Summary of Tests of Bivariate Aging9.10 Concluding Remarks on Aging

9.1 Introduction

The notion of aging plays an important role inreliability and maintenance theory. “No aging”means that the age of a component has noeffect on the distribution of residual lifetime ofthe component. “Positive aging” describes thesituation where residual lifetime tends to decrease,in some probabilistic sense, with increasing ageof the component. This situation is common inreliability engineering, as systems or componentstend to become worse with time due to increasedwear and tear. On the other hand, “negative aging”has an opposite effect on the residual lifetime.Although this is less common, when a systemundergoes regular testing and improvement, there

are cases for which we have reliability growthphenomena.

Concepts of aging describe how a componentor system improves or deteriorates with age.Many classes of life distributions are categorizedor defined in the literature according to theiraging properties. The classifications are usuallybased on the behavior of the conditional survivalfunction, the failure rate function, or the meanresidual life of the component concerned. Within agiven class of life distributions, closure propertiesunder certain reliability operations are ofteninvestigated. In several cases, for example, theincreasing failure rate average, arise naturallyas a result of damages being accumulated fromrandom shocks. Moreover, there are a number of

165

Page 199: Handbook of Reliability Engineering - pdi2.org

166 Statistical Reliability Theory

chains of implications among these aging classes.The exponential distribution is the simplest lifedistribution that is characterized by a constantfailure rate or no aging. It is also nearly alwaysa member of every known aging class, and thusseveral tests have been proposed for testing alife distribution being exponential against analternative that it belongs to a specified agingclass. Some of the life classes have been derivedmore recently and, as far as we know, no teststatistics have been proposed. On the other hand,there are several tests available for some classes.

By “life distributions” we mean those for whichnegative values do not occur, i.e. F(x)= 0 forx < 0. The non-negative variate X is thought of asthe time to failure (or death) of an electrical ormechanical component (or organism), but otherinterpretations may be possible-an inter-eventtime is normally necessarily positive.

In this chapter, we focus on classes of lifedistributions based on notions of aging; increasingfailure rate (IFR) is perhaps the best known,but we shall meet many others also, and studytheir interrelationships whenever possible. Themajor parts of the chapter are devoted to: (i) thenaming and interpretation of univariate reliabilityclasses, and showing how these notions may beextended to the bivariate case; (ii) presenting theinterrelationships between different classes; and(iii) summarizing various statistical tests on agingin three tables for various univariate and bivariateaging classes. An expository paper by Bergman [1]and chapter 18 of Hutchinson and Lai [2] arehelpful for (i) and (ii); details on (iii) can be foundin Lai [3].

From the definitions of the life distributionclasses, results may be derived concerning suchthings as properties of systems (based uponproperties of components), bounds for survivalfunctions, moment inequalities, and algorithmsfor use in maintenance policies [4].

Most readers will know that statistical theoryapplied to distributions of lifetime lengths plays animportant part in both the reliability engineeringliterature and the biometrics literature. We mayalso note a third applications area: Heckman andSinger [5] review econometric work on duration

variables (e.g. lengths of periods of unemploy-ment, or time intervals between purchases of a cer-tain good), much of which, they say, has borrowedfreely and often uncritically from reliability the-ory and biostatistics. We believe the topics underdiscussion are indeed important, and hope thatthis brief review will provide a clear picture of therecent developments in this area.

The chapter considers the following aspects:

• univariate reliability classes;• interrelationship between the classes;• bathtub-shaped life distributions;• life classes based on mean residual lifetime;• partial ordering of life distributions;• bivariate reliability classes;• tests of univariate and bivariate aging.

In Section 9.2 we begin with giving a reviewof some basic aging notions and their interre-lationships, together with some key references.Section 9.3 discusses the properties of these basicaging classes and Section 9.4 is devoted to thebathtub-shaped life distributions that are impor-tant in reliability applications. Section 9.5 charac-terizes life classes based on their mean residuallifetimes, and Section 9.6 tidies up the aging clas-sifications with the inclusion of some further, butless well known, classes. Section 9.7 provides anintroduction to partial ordering, through whichthe strength of the aging property of the twolife distributions within the same class is com-pared. Univariate aging concepts are extended inSection 9.8, in which bivariate aging classes areintroduced. Section 9.9 considers the subject oftests of aging, with the exponential distributionbeing taken as the null hypothesis against variousalternatives. We omit details of these test proce-dures, as the subject matter may have less appeal toengineers than to statisticians. Finally, in Section9.10, we tidy up the loose ends on stochastic agingand the section ends with some remarks concern-ing future research directions that may bridge thetheory and applications.

The following notation is adopted herein:

F cumulative distribution functionf density function given by F ′

Page 200: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 167

F reliability function given 1− F (alsoknown as the survival probability)

X sample meanψ indicator functionH0 null hypothesisH1 alternative hypothesisφ non-negative functionr(t) hazard rate (or failure rate) functionm(t) mean residual lifetime.

A list of abbreviations used in the text is givenTable 9.1.

9.2 Basic Concepts forUnivariate Reliability Classes

9.2.1 Some Acronyms and the Notionsof Aging

The concepts of increasing and decreasing failurerates for univariate distributions have been foundvery useful in reliability theory. The classes ofdistributions having these aging properties aredesignated the IFR and DFR distributions respec-tively, and they have been studied extensively.Other classes, such as “increasing failure rate onaverage” (IFRA), “new better than used” (NBU),“new better than used in expectation” (NBUE),and “decreasing mean residual life” (DMRL), havealso been of much interest. For fuller accounts ofthese classes see, for example, Bryson and Sid-diqui [6], Barlow and Proschan [7], Klefsjö [8],and Hollander and Proschan [4].

A class that slides between NBU and NBUE,known as “new better than used in convexordering” (NBUC), has also attracted someinterest recently.

The notion of “harmonically new better thanused in expectation” (HNBUE) was introducedby Rolski [9] and studied by Klefsjö [10, 11].Further generalizations along this line were givenby Basu and Ebrahimi [12]. A class of distributionsdenoted by L has an aging property that isbased on the Laplace transform, and was putforward by Klefsjö [8]. Deshpande et al. [13] used

stochastic dominance comparisons to describepositive aging and suggested several new positiveaging criteria based on these ideas (see their paperfor details). Two further classes, NBUFR (“newbetter than used in failure rate”) and NBUFRA(“new better than used in failure rate average”)require the absolute continuity of the distributionfunction, and have been discussed by Loh [14,15],Deshpande et al. [13], Kochar and Wiens [16], andAbouammoh and Ahmed [17].

Rather than F(x), we often think of F (x)=Pr(X > x)= 1− F(x), which is known as thesurvival distribution or reliability function. Theexpected value of X is denoted by µ. The function

F (t | x)= F (x + t)/F (x) (9.1)

represents the survival function of a unit of age x,i.e. the conditional probability that a unit of age x

will survive for an additional t units of time. Theexpected value of the remaining (residual) life, atage x, is m(x)= E(X − x | X > x), which may beshown to be

∫∞0 F (t | x) dt .

When F ′(x)= f (x) exists, we can define thefailure rate (or hazard rate, or force of mortality)as

r(x)= f (x)/F (x) (9.2)

for x such that F (x) > 0. It follows that, if r(x)

exists:

−log F (t)=∫ t

0r(x) dx (9.3)

which represents the cumulative failure rate. Weare now ready to define several reliability classes.

9.2.2 Definitions of Reliability Classes

Most of the reliability classes are defined interms of the failure rate r(t), conditional survivalfunction F (t | x), or the mean residual life m(t).All these three functions provide probabilisticinformation on the residual lifetime, and henceaging classes may be formed according to thebehavior of the aging effect on a component. Theten reliability classes mentioned above are definedas follows.

Definition 1. F is said to be IFR if F (t | x) isdecreasing in 0≤ x <∞ for each t ≥ 0.

Page 201: Handbook of Reliability Engineering - pdi2.org

168 Statistical Reliability Theory

Table 9.1. List of abbreviations used in the text

Abbreviation Aging class

BFR Bathtub-shaped failure rateDMRL (IMRL) Decreasing mean residual life (Increasing mean residual life)HNBUE Harmonically new better than used in expectationIFR (DFR) Increasing failure rate (Decreasing failure rate)IFRA (DFRA) Increasing failure rate average (Decreasing failure rate average)L-class Laplace class of distributionsNBU (NWU) New better than used (New worse than used)NBUE (NWUE) New better than used in expectation (New better than used in expectation)NBUC New better than used in convex orderingNBUFR New better than used in failure rateNBUFRA New better than used in failure rate averageNBWUE (NWBUE) New better then worse than used in expectation (New worse then better than used in expectation)

When the density exists, this is equivalent tor(x)= f (x)/F (x) being increasing in x ≥ 0 [7].

Definition 2. F is said to be IFRA if −(1/t)× log F (t) is increasing in t . (That is, if −log F

is a star-shaped function; for this notion, seeDykstra [18].)

As −log F given in Equation 9.3 representsthe cumulative failure rate, the name given tothis class is appropriate. Block and Savits [19]showed that IFRA is equivalent to Eα[h(X)] ≤E[hα(X/α)] for all continuous non-negativeincreasing functions h and all such that 0 < α

< 1. This class is perhaps the most important inreliability analysis. It is the smallest closed classcontaining an exponential distribution under theformation of coherent systems. It has been shownthat a device subject to shocks governed by aPoisson process, which fails when the accumulateddamage exceeds a fixed threshold, has an IFRAdistribution [20].

Definition 3. F is said to be DMRL if themean remaining life function

∫∞0 F (x | t) dt is

decreasing in x, i.e. the older the device is, thesmaller is its mean residual life [6].

Definition 4. F is said to be NBU if F (t | x)≤F (t) for x ≥ 0, t ≥ 0.

This means that a device of any particular agehas a stochastically smaller remaining lifetimethan does a new device [7].

Definition 5. F is said to be NBUE if∫∞

0 F (t | x)dt ≤ µ for x ≥ 0.

This means that a device of any particular agehas a smaller mean remaining lifetime than does anew device [7].

Definition 6. F is said to be HNBUE if∫∞x F (t

| x) dt ≤ µ exp(−x/µ) for x ≥ 0. There is analternative definition in terms of the mean residuallife [9].

Definition 7. F is said to be an L-distribution if∫∞0 e−st F (t) dt ≥ µ/(1+ s) for s ≥ 0.

The expression µ/(1+ s) can be written as∫exp(−sx)G(x) dx, where G(x)= exp(−x/µ).

This means that the inequality is one between theLaplace transforms of F and of an exponentialsurvival function with the same mean as F [8].

Definition 8. F is said to be NBUFR if r(x) > r(0)for x > 0 [13].

Definition 9. F is said to be NBUFRA if r(0)≤(1/x)

∫∞0 r(t) dt =− log[F (x)]/x [14].

Using Laplace transforms, Block and Sav-its [21] established necessary and sufficient con-ditions for the IFR, IFRA, DMRL, NBU, and NBUEproperties to hold.

Definition 10. F is NBUC if∫∞x

F (t | x) dt ≤∫∞x F (t) dt [22].

Page 202: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 169

Hendi et al. [23] have shown that the newbetter than used in convex ordering is closed un-der the formation of parallel systems with inde-pendent and identically distributed components.Li et al. [24] presented a lower bound of the relia-bility function for this class based upon the meanand the variance. Cao and Wang [22] have provedthat

NBU⇒ NBUC⇒ NBUE⇒HNBUE

9.2.3 Interrelationships

The following chain of implications exists amongthe aging classes [13, 16]:

IFR ⇒ IFRA ⇒ NBU ⇒ NBUFR ⇒ NBUFRA

⇓ ⇓ ⇓⇓ NBUC ⇓⇓ ⇓ ⇓

DMRL ⇒ NBUE ⇒ HNBUE ⇒ L

In Definitions 1–10, if we reverse the inequalitiesand interchange “increasing” and “decreasing”,we obtain the classes DFR, DFRA, NWU, IMRL,NWUE, HNWUE, L, NWUFR, NWUFRA, andNWUC. They satisfy the same chain of implica-tions.

9.3 Properties of the BasicConcepts

9.3.1 Properties of Increasing andDecreasing Failure Rates

Patel [25] gives the following properties of the IFRand DFR concepts.

1. If X1 and X2 are both IFR, so is X1 +X2; butthe DFR property is not so preserved.

2. A mixture of DFR distributions is also DFR;but this is not necessarily true of IFRdistributions.

3. Parallel systems of identical IFR units are IFR.4. Series systems of (not necessarily identical)

IFR units are IFR.

5. Order statistics from an IFR distributionhave IFR distributions, but this is not truefor spacings from an IFR distribution; orderstatistics from a DFR distribution do notnecessarily have a DFR distribution, butspacings from a DFR distribution are DFR.

6. The probability density function (p.d.f.) of anIFR distribution need not be unimodal.

7. The p.d.f. of a DFR distribution is a decreasingfunction.

9.3.2 Property of Increasing FailureRate on Average

This aging notion is fully investigated in the bookby Barlow and Proschan [7]; in particular, IFRA isclosed under the formation of coherent systems,as well as under convolution. The IFRA closuretheorem is pivotal to many of the results given inBarlow and Proschan [7].

9.3.3 Properties of NBU, NBUC, andNBUE

The properties of NBU, NWU, NBUE, and NWUEare also well documented in the book by Barlowand Proschan [7]. Chen [26] showed that thedistributions of these classes may be characterizedthrough certain properties of the correspondingrenewal functions.

Li et al. [24] gave a lower bound of the reliabilityof the NBUC class based on the first two moments.Among other things, they also proved that theNBUC class is preserved under the formation ofparallel systems. The last property was also provedearlier by Hendi et al. [23].

9.4 Distributions withBathtub-shaped Failure Rates

A failure rate function falls into one of threecategories: (a) monotonic failure rates, where r(t)

is either increasing or decreasing; (b) bathtubfailure rates, where r(t) has a bathtub or a U shape;and (c) generalized bathtub failure rates, where

Page 203: Handbook of Reliability Engineering - pdi2.org

170 Statistical Reliability Theory

Figure 9.1. A bathtub failure rate r(t)

r(t) is a polynomial, or has a roller-coaster shapeor some generalization.

The main theme of this section is to give anoverview on the class of bathtub-shaped (BFR) lifedistributions having non-monotonic failure ratefunctions. For a recent review, see Lai et al. [27].

There are several definitions of bathtub-shapedlife distributions. Below is the one that gives riseto curves having definite “bathtub shapes”.

Definition 11. (Mi [28]) A distribution F is abathtub-shaped life distribution if there exists0≤ t ≤ t0 such that:

1. r(t) is strictly decreasing, if 0≤ t ≤ t1;2. r(t) is a constant if t1 ≤ t ≤ t2;3. strictly increasing if t ≥ t2.

The above definition may be rewritten as

r(t)=

r1(t) for t ≤ t1

λ for t1 ≤ t ≤ t2

r2(t) for t ≥ t2

(9.4)

where r1(t) is strictly decreasing in [0, t1] andr2(t) is strictly increasing for t ≥ t2. We donot know of many parametric distributions thatpossess this property, except for some piecewisecontinuous distributions. One example is thesectional Weibull in Jiang and Murthy [29]involving three Weibull distributions.

A typical bathtub-shaped failure rate functionis shown in Figure 9.1. Several comments are inorder.

In Definition 11, Mi [28] called the points t1and t2 the change points of r(t). If t1 = t2 = 0,then a BFR becomes an IFR; and if t1 = t2→∞,then r(t) is strictly decreasing, so becoming aDFR. In general, if t1 = t2, then the interval forwhich r(t) is a constant degenerates to a singlepoint. In other words, the strict monotonic failurerate distributions IFR and DFR may be treated asthe special cases of BFR in this definition. Park[30] also used the same definition. The readerwill find more information on bathtub-shaped lifedistributions in Lai et al. [27].

The class of lifetime distributions havinga bathtub-shaped failure rate function is veryimportant because the lifetimes of electronic,electromechanical, and mechanical products areoften modeled with this feature. In survivalanalysis, the lifetime of human beings exhibitsthis pattern. Kao [31], Lieberman [32], Glaser[33], and Lawless [34] give many examplesof bathtub-shaped life distributions. For morerecent examples, see Chen [35] and Lai et al.[27]. For applications of the bathtub-shaped lifedistributions, see Lai et al. [27].

9.5 Life Classes Characterizedby the Mean Residual Lifetime

Recall, the mean residual lifetime is defined asm(t)= E(X − x |X > x) which is equivalent to∫∞

0 F (t | x) dt . Recall also that F is said tobe DMRL if the mean remaining life function∫∞

0 F (x | t) dt is decreasing in x. That is, the olderthe device is, the smaller is its mean residual lifeand hence m(t) is monotonic. However, in manyreal life situations, the mean residual lifetime isnon-monotonic, and thus there arise several agingnotions defined in terms of the non-monotonicbehavior of m(t).

Definition 12. (Guess et al. [36]) A life distribu-tion with a finite first moment is called an increas-ing then decreasing mean residual life (IDMRL) if

Page 204: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 171

there exists a turning point τ ≥ 0 such that

m(s)

{≤m(t) for 0≤ s ≤ t < τ

≥m(t) for τ ≤ s ≤ t(9.5)

The dual class of “decreasing initially, then in-creasing mean residual life” (DIMRL) is obtainedby reversing the above inequality (Equation 9.5).

Gupta and Akman [37] showed that thenon-monotonic behavior is related to the non-monotonic failure rates. Their results are summa-rized below.

Case (A): r(t) is BFR, then:

1. m(t) is decreasing if r(0)≤ 1/µ;2. IDMRL if r(0)≥ 1/µ.

Case (B) r(t) has an upside-down bathtubshape, then:

1. m(t) is increasing if r(0)≥ 1/µ;2. m(t) is bathtub shaped if r(0)≤ 1/µ.

A similar result is also given by Mi [28], whostated “if the failure rate has a bathtub-shape, thenthe associated MRL has an upside-down bathtub-shape”.

Mitra and Basu [38] defined another agingnotion based on non-monotonic MRL:

Definition 13. A life distribution F having sup-port on [0,∞) (and finite mean µ) is said to be“new worse then better than used in expectation”(NWBUE) (“new better then worse than used inexpectation” (NBWUE)) if there exists a point τ ≥0 such that

m(t)

{≥ (≤)m(0) for t < τ

≤ (≥)m(0) for t ≥ τ

They showed that non-monotonic aging classessuch as BFR and NWBUE arise from the shockmodel under consideration.

For applications of concepts based on the meanresidual life, see Guess and Proschan [39]. SeeNewby [40] for some simple explanations andexamples of analyses in terms of mean residuallife, hazard, and hazard rate average.

9.6 Some Further Classes ofAging

In addition to those lifetime classes defined above,there are a number of other aging classes thathave been investigated over the years. Withoutgiving details, we just present their acronyms andreferences in Table 9.2.

Other chains of relationships are given in Joeand Proschan [46]:

NBU⇔ NBUP-α for all 0 < α < 1⇒ NBUEIFR⇔ DPRL-α for all 0 < α < 1DPRL⇒ DPRL-α⇒ NBUP-α for any0 < α < 1

There are other chains of relationships, but we willnot elaborate on them here.

9.7 Partial Ordering of LifeDistributions

A distribution F is said to be more IFR than G ifthe ratio of hazard

rF [F−1(t)]rG[G−1(t)]

is non-decreasing in t . (This definition hasassumed that the failure rates exist.) Orderings ofeach of the other reliability classes may also bedefined.

There is a chain of implications connectingthe orderings of the classes, that is similar butnot identical to that which connects the classes(Section 9.2.3) [16].

Let X and Y be two random variables withdistribution functions G and H respectively.We say that X is stochastically larger than Y ifG(x)≤H(x), for all x. In economic theory, thisis known as the first-order stochastic dominance.Higher orders of stochastic dominance are definedin Deshpande et al. [13].

Page 205: Handbook of Reliability Engineering - pdi2.org

172 Statistical Reliability Theory

Table 9.2. Table of further aging classes

IFR(2) Increasing failure rate (of second order) [13, 41]NBU(2) New better than used (of second order) [13, 41, 42]HNBUE(3) Harmonic new better than used (of third order) [13]DMRLHA Decreasing mean residual life in harmonic average [13]SIFR Stochastically increasing failure rate [43]SNBU Stochastically new better than used [43]NBU-t0 New better than used of age t0 [4]BMRL-t0 Better mean residual life at t0 (i.e. the mean residual life declines during the time 0 to t0 , [44]

and thereafter is no longer greater than what it was at t0)DVRL Decreasing variance of residual life [45]DPRL-α Decreasing 100α percentile residual life [46]NBUP-α New better than used with respect to the 100α percentile [46]

9.7.1 Relative Aging

Recently, Sengupta and Deshpande [47] studiedthree types of relative aging of two life distribu-tions. The first of these relative aging conceptsis the partial ordering originally proposed byKalashnikov and Rachev [48], which is defined asfollows:

Definition 14. Let F and G be the distributionfunctions of the random variables X and Y

respectively. X is said to be aging faster thanY if the random variable Z =G(X) has anIFR distribution. HereG = −log G and G(t) =1−G(t).

If the failure rates rF (t) and rG(t) bothexist with rF (t)= f (t)/[1 − F(t)] and rG(t)=g(t)/[1−G(t)], then the above definition isequivalent to rF (t)/rG(t) being an increasingfunction of t .

Several well-known partial orderings are givenin Table 9.3 (assuming both failure rates of bothdistributions do exist).

Lai and Xie [51] established some results onrelative aging of two parallel structures. It isobserved that the relative aging property maybe used to allocate resources and can be usedfor failure identification when two components(systems) have the same mean. In particular, if“X ages faster than Y ” and they have the samemean, then Var(X) ≤ Var(Y ). Several examplesare given. In particular, it is shown that when two

Weibull distributions have equal means, the onethat ages faster has a smaller variance.

9.7.2 Applications of Partial Orderings

Most of the stochastic aging definitions involvefailure rate in some way. The partial orderingsdiscussed above have important applications inreliability.

Design engineers are well aware that a systemwhere active spare allocation is made at the com-ponent level has a lifetime stochastically largerthan the corresponding system where active spareallocation is made at the system level. Bolandand El-Neweihi [52] investigated this principle inhazard rate ordering and demonstrated that itdoes not hold in general. However, they discov-ered that for a 2-out-of-n system with independentand identical components and spares, active spareallocation at the component level is superior toactive spare allocation at the system level. Theyconjectured that such a principle holds in generalfor a k-out-of-n system. Singh and Singh [50] haveproved that for a k-out-of-n system where com-ponents and spares have independent and iden-tical life distributions, active spare allocation atthe component level is superior to active spareallocation at the system level in likelihood ratio or-dering. This is stronger than hazard rate orderingand thus establishes the conjecture of Boland andEl-Neweihi [52].

Page 206: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 173

Table 9.3. Partial orderings on aging

Partial ordering Condition Reference

X ≤st Y F (t)≤ G(t), for all t [7] p.110(Y > X in stochastic order)

X ≤hr Y rF (t)≥ rG(t), for all t [49](Y > X in hazard order)

X ≤c Y G−1F(t) is convex in t [7] p.106(F is convex with respect toG)

X ≤∗ Y (1/t)G−1F(t) is↑ in t [7] p.106(F is star shaped with respect toG)

X ≺c Y rF (t)/rG(t) is↑ in t [47](X is aging faster than Y )

X ≺∗ Y Z =G(X) has an IFRA [47](X is aging faster than Y on average)

X ≺lr Y f (t)/g(t) is↑ in t [50](X is smaller than Y in likelihood ratio)

Boland and El-Neweihi [52] has also estab-lished that the active spare allocation at the com-ponent level is better than the active spare allo-cation at the system level in hazard rate order-ing for a series system when the components arematching although the components may not beidentically distributed. Boland [49] gave an exam-ple to show that the hazard rate comparison iswhat people really mean when they compare theperformance of two products. For more on thehazard rate and other stochastic orders, the read-ers should consult Shaked and Shanthikumar [53].

9.8 Bivariate Reliability ClassesIn dealing with multicomponent systems, onewants to extend the whole univariate aging prop-erties to bivariate and multivariate distributions.Consider, for example, the IFR concept in thebivariate case. Let us use the notation F (x1, x2) tomean the probability that item 1 survives longerthan time x1 and item 2 survives longer than timex2. Then a possible definition of bivariate IFR isthat F (x1 + t, x2 + t)/F (x1, x2) decreases in x1and x2 for all t .

Bivariate as well as multivariate versions ofIFR, IFRA, NBU, NBUE, DMRL, HNBUE and of

Table 9.4. Bivariate versions of univariate aging concepts

Ageing Reference

Bivariate IFR [58–67]Bivariate IFRA [55, 68–73]Bivariate NBU [69, 74–76]Bivariate NBUE [56]Bivariate DMRL [54, 77, 78]Bivariate HNBUE [57, 79, 80]

their duals have been defined and their propertieshave been developed by several authors. In eachcase, however, several definitions are possible,because of different requirements imposed byvarious authors. For example, Buchanan andSingpurwalla [54] used one set of requirementswhereas Esary and Marshall [55] used a differentset. For a bibliography of available results, seeBlock and Savits [56], Basu et al. [57], andHutchinson and Lai [2]. Table 9.4 gives the relevantreferences for different bivariate aging concepts.

9.9 Tests of Stochastic AgingTests of stochastic aging play an importantpart in reliability analysis. Usually, the null

Page 207: Handbook of Reliability Engineering - pdi2.org

174 Statistical Reliability Theory

hypothesis is that the lifetime distribution followsan exponential distribution, i.e. there is no aging.Let T have the exponential distribution functionF(t)= 1− e−λt for t ≥ 0. A property of theexponential distribution that makes it especiallyimportant in reliability theory and applicationis that the remaining life of a used exponentialcomponent is independent of its initial age (i.e. ithas the memoryless property). In other words:

Pr{T > t + x | T > t} = e−λx x ≥ 0 (9.6)

(independent of t). It is easy to verify that theexponential distribution has a constant failurerate, i.e. r(t)= λ, and is the only life distributionwith this property.

The exponential distribution is characterizedby having a constant failure rate, and thus standsat the boundary between the IFR and DFRclasses. For a review of goodness-of-fit tests of theexponential distribution, see Ascher [81].

9.9.1 A General Sketch of Tests

In developing tests for different classes of lifedistributions, it is almost without exceptionthat the exponential distribution is used as astrawman to be knocked down. As the exponentialdistribution is always a boundary member of anaging class C (in the univariate case), a usualformat for testing is

H0 (null hypothesis):F is exponential versus

H1 (alternative):F ∈ C but F is not exponential.

If one is merely interested to test whetherF is exponential, then no knowledge of thealternative is required; in such a situation, a two-tailed test is appropriate. Recently, Ascher [81]discussed and compared a wide selection of testsfor exponentiality. Power computations, usingsimulations, were done for each procedure. Inshort, he found:

1. certain tests performed well for alternativedistributions with non-monotonic hazard(failure) rates, whereas others fared well formonotonic hazard rates;

2. of all the procedures compared, the score testpresented in Cox and Oakes [82] appears tobe the best if one does not have a particularalternative in mind.

In what follows, our discussion will focus ontests with alternatives being specified. In otherwords, our aim is not to test exponentiality, butrather to test whether a life distribution belongsto a specific aging class. A general outline of theseprocedures is as follows.

1. Find an appropriate measure of deviationfrom the null hypothesis of exponentiality to-ward the alternative (under H1). The measureof deviation is usually based on some charac-teristic property of the aging class we wish totest against.

2. Based on this measure, some statistics, such asthe U -statistic, are proposed. A large value ofthis statistic often (but not always) indicatesthe distribution belongs to class C.

3. Large-sample properties, such as asymptoticnormality and consistency, are proved.

4. Pitman’s asymptotic relative efficiencies areusually calculated for the following families ofdistributions (with scale factor of unity, θ ≥ 0,x ≥ 0):

(a) the Weibull distribution F1(x)=exp(−xθ );

(b) the linear failure rate distribution:

F2(x)= exp(−x − θx2/2)

(c) the Makeham distribution:

F3(x)= exp{−[x + θ(x + e−x − 1)]}(d) the Gamma distribution with density

f (x)= [xθ−1/�(θ)] e−x.The above distributions are well known exceptfor the Makeham distribution, which is widelyused in life insurance and mortality studies(see [7], p.229). All these distributions are IFR(for an appropriate restriction on θ); hence theyall belong to a wider class. Moreover, all thesereduce to the exponential distribution with anappropriate value of θ .

Page 208: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 175

Table 9.5. Tests on univariate aging

Test name Basic statistics Aging Keyalternatives references

TTT plot τ(Xi)=i∑

j=1

(n− j + 1)(Xj −Xj−1), j = i, . . . n IFR, IFRA, [57]

Ui = τ(Xi)/τ(Xn), T =n∑

j=1

ajUj

NBUEDMRLHNBUE [84]

Each test has a different set of ai values

Un Un = [4/n(n− 1)(n− 2)(n− 3)]∑

ψ[min(Xγ1Xγ2), (Xγ3 +Xγ4)/2] IFR [85–87]

sum over = {γi �= γj } ∩ {γ1 < γ2} ∩ {γ3 < γ4}Jb Jb = n

(n− 1)

∑ψ(Xi, bXj ), 0 < b < 1 IFRA [88]

Qn Qn =n∑

i=1

J

(i

n+ i

)/nX, J (u)= 2(1− u)[1− log(1− u)] − 1 IFRA [89]

Link2

n(n− 1)

∑i<j

Xi/Xj IFRA [90]

V ∗ V ∗ = V/X,V = n−4n∑

i=1

Ci,nXi DMRL [91, 92]

Ci,n = 4

3i3 − 4ni2 + 3n2i − 1

2n3 + 1

2n2 − 1

2i2 + 1

6i

Jn or J Jn = 2[n(n− 1)(n− 2)]−1∑

ψ(Xγ1, Xγ2 +Xγ3), 1≤ γ ≤ n, NBU [93]

γ1 �= γ2, γ1 �= γ3, γ2 < γ3

δ δn = n−2n∑i

n∑j

φ(Sij /n), φ increasing; Sij =n∑

k=1

ψ(Xk, Xi +Xj ) NBU [94, 95]

Generalized J J test together with a two-stage test procedure NBU [96]

S S =U − J , see above for J test NBU [97]

Ln Ln(ψ, m)= n−1n∑

i=1

φ(n−1S(m)i ), S(m)

i =n∑

j=1

ψ(Xj , mX) NBU [98]

K∗ K∗ =K/X,K = 1

2n2

n∑i=1

(3n

2− 2i + 1

2

)Xi NBUE [91]

CV S/X, S2 =n∑

i=1

(Xi − X)2/n NBUE, NWUE [10]

Tn Tn = 1

n

n∑i=1

exp(−Xi/X) HNBUE [101]

En En = 1

n

n−1∑i=1

log[1− τ(Xi)], τ(Xi)=∑i

j=1 (n− j + 1)(Xj −Xj−1) HNBUE [102]

Page 209: Handbook of Reliability Engineering - pdi2.org

176 Statistical Reliability Theory

Table 9.5. Continued.

Test name Basic statistics Aging Keyalternatives references

T T = [n(n− 1)]−1∑

ψ(Xα1 , Xα2 + t0)− (2n)−1n∑

i=1

ψ(Xi, t0) NBU-t0 [103]

Tk Tk = T1k − T2k , T1k =∑

i=j=nψ(Xi, Xj + kt0)

/2

n(n− 1)NBU-t0 [104]

T2k = 1

2

∑ψk(Xi1 , . . . , Xik )

/(nk

)ψk(a1, . . . , ak)=

{1 if min ai > t00 otherwise

W1:n W1n = 1

4

n−1∑i=0

B1i

n(Xi+1 −Xi), DPRL-α [105, 106]

B1(t)={−t2(t2 − 1) 0≤ t < α

t2(α−2 − 1)[2(α−2 + 1)t2 − 1] α < t < 1t = 1− t

α = 1− α

W2:n W2:n = 1

2F−1n (α)− 1

2

n∑i=1

[B2[(i − 1)/n)− B2(i/n)] NBUP-α [105, 106]

B2(t)=

−1

2[(1− t)2 − 1] 1≤ t < α

1

2[(1− α)−2 − 1](1− t)2 α < t ≤ 1

TTT T =n∑

j=1

ajUj , Ui = τ(Xi)/τ(Xn) BFR [107]

Stochastic Residual lifeXt = {X − t |X > t}with cumulative distribution [108]order function FX(t + x)/FX(t), x ≥ 0. For 0≤ s < t

F IFR ifXt ≤st Xs ; F DFR ifXs ≤st Xt ; IFR, DFR

F NBU ifXt ≤st X; F NWU ifX ≤st Xt . NBU, NWU

Test statistics based on testing stochastic order

Rightcensored

Fn(x) = 1−∏

{i:Z(i)≤x}

(n− i

n− i + 1

)δ(i)(Kaplan–Meier estimator [111]), IFRA [109]

δi = 0 or 1 depending on whether object i is censored on the right or NBU [110]not,Zi =min(Xi, Yi), Yi is random time to the right censorship NBUE [112]

NBU-t0 [113]DPRL-α [105]NBUP-α [105]

Page 210: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 177

9.9.2 Summary of Tests of Aging inUnivariate Case

We shall now present several statistical proceduresfor testing exponentiality of a life distributionagainst different alternatives.

Let ψ be the indicator function defined as

ψ(a, b)={

1 if a > b

0 otherwise

φ denotes an increasing function (i.e. φ(x) ≤φ(y), for x < y), and X1 <X2 < · · ·< Xn are theorder statistics.

Table 9.5 gives an overview of the tests availablefor various aging classes. We refer our readers toLai [3] for details.

9.9.3 Summary of Tests of BivariateAgingLet X and Y denote the lifetimes of twocomponents having a joint distribution functionF(x, y). The joint survival function is given byF (x, y)= Pr(X > x, Y > y).

In testing bivariate aging properties, there aretwo problems facing us:

• Which bivariate exponential distribution isthe null distribution?• Which version of bivariate aging property in

a given class are we dealing with? (Thereare several versions of bivariate IFR, bivariateNBU, etc.)

Generally, the bivariate exponential distribution ofMarshall and Olkin [61] (denoted by BVE) is usedas the null distribution. This joint distribution hasthe survival function given by

F (x, y)= exp[−λ1x − λ2y − λ12 max(x, y)]x, y ≥ 0 (9.7)

λ1, λ2 > 0, λ12 ≥ 0.BVE is chosen presumably because:

• it was derived from a reliability context;• it has the property of bivariate lack of

memory:

F (x + t, y + t)= F (x, y)F (t, t) (9.8)

Table 9.6. Tests of bivariate aging

Bivariate version Key references on bivariatetests

IFR [114, 115]IFRA [116]DMRL [117]NBU [118, 119]NBUE [117]HNBUE [120]

Most of the test statistics listed in Table 9.6are bivariate generalizations of the univariate testsdiscussed in Section 9.9.2.

It is our impression that tests for bivariate andmultivariate stochastic aging are more difficultthan for the univariate case; however, they wouldprobably offer a greater scope for applications. Weanticipate that more work will be done in this area.

9.10 Concluding Remarks onAging

It is clear from the material presented in thischapter that research on aging properties (univari-ate, bivariate, and multivariate) is currently beingpursued vigorously. Several surveys are given, e.g.see Block and Savits [56,121,122], Hutchinson andLai [2], and Lai [3].

The simple aging concepts such as IFR, IFRA,NBU, NBUE, DMRL, etc., have been shown to bevery useful in reliability-related decision making,such as in replacement and maintenance studies.For an introduction on these applications, seeBergman [1] or Barlow and Proschan [7].

It is our impression that the more advancedunivariate aging concepts exist in somethingof a mathematical limbo, and have not yetbeen applied by the mainstream of reliabilitypractitioners. Also, it seems that, at present,many of the multivariate aging definitions givenabove lack clear physical interpretations, thoughit is true that some can be deduced from shockmodels. The details of these derivations can be

Page 211: Handbook of Reliability Engineering - pdi2.org

178 Statistical Reliability Theory

found in Marshall and Shaked [73], Ghosh andEbrahimi [123], and Savits [124].

It is our belief that practical concern is verymuch with multicomponent systems—in whichcase, appreciation and clarification of the bivariateconcepts (and hence of the multivariate concepts)could lead to greater application of this wholebody of work.

References[1] Bergman B. On reliability theory and its applications.

Scand J Stat 1985;12:1–30 (discussion: 30–41).[2] Hutchinson TP, Lai CD. Continuous bivariate distribu-

tions, emphasising applications. Adelaide: Rumsby Sci-entific Publishing; 1990.

[3] Lai CD. Tests of univariate and bivariate stochasticageing. IEEE Trans Reliab 1994;R43:233–41.

[4] Hollander M, Proschan F. Nonparametric concepts andmethods in reliability. In: Krishnaiah PR, Sen PK, ed-itors. Handbook of statistics: nonparametric methods,vol. 4. Amsterdam: North Holland; 1984. p.613–55.

[5] Heckman JJ, Singer B. Economics analysis of longi-tudinal data. In: Griliches Z, Intriligator MD, editors.Handbook of econometrics, vol. 3. Amsterdam: North-Holland; 1986. p.1689–763.

[6] Bryson MC, Siddiqui MM. Some criteria for aging. J AmStat Assoc 1969;64:1472–83.

[7] Barlow RE, Proschan F. Statistical theory of reliabilityand life testing. Silver Spring: To Begin With; 1981.

[8] Klefsjö B. A useful ageing property based on the Laplacetransform. J Appl Probab 1983;20:615–26.

[9] Rolski T. Mean residual life. Bull Int Stat Inst1975;46:266–70.

[10] Klefsjö B. HNBUE survival under shock models. Scand JStat 1981;8:39–47.

[11] Klefsjö B. HNUBE and HNWUE classes of life distribu-tions. Nav Res Logist Q 1982;29:615–26.

[12] Basu, AP, Ebrahimi N. On k-order harmonic new betterthan used in expectation distributions. Ann Inst StatMath 1984;36:87–100.

[13] Deshpande JV, Kochar SC, Singh H. Aspects of positiveageing. J Appl Probab 1986;23:748–58.

[14] Loh WY. A new generalisation of NBU distributions.IEEE Trans Reliab 1984;R-33:419–22.

[15] Loh WY. Bounds on ARE’s for restricted classesof distributions defined via tail orderings. Ann Stat1984;12:685–701.

[16] Kochar SC, Wiens DD. Partial orderings of lifedistributions with respect to their ageing properties.Nav Res Logist 1987;34:823–9.

[17] Abouammoh AM, Ahmed AN. The new better than usedfailure rate class of life distributions. Adv Appl Probab1988;20:237–40.

[18] Dykstra RL. Ordering, starshaped. In: Encyclopedia ofstatistical sciences, vol. 6. New York: Wiley; 1985. p.499–501.

[19] Block HW, Savits TH. The IFRA closure problem. AnnProbab 1976;4:1030–2.

[20] Esary JD, Marshall AW, Proschan F. Shock models andwear processes. Ann Probab 1973;1:627–47.

[21] Block HW, Savits TH. Laplace transforms for classes oflife distributions. Ann Probab 1980;8:465–74.

[22] Cao J, Wang Y. The NBUC and NWUC classes of lifedistributions. J Appl Probab 1991;28:473–9.

[23] Hendi MI, Mashhour AF, Montassser MA. Closure of theNBUC class under formation of parallel systems. J ApplProbab 1993;30:975–8.

[24] Li X, Li Z, Jing B. Some results about the NBUC class oflife distributions. Stat Probab Lett 2000;46:229–37.

[25] Patel JK. Hazard rate and other classifications ofdistributions. In: Encyclopedia in statistical sciences,vol. 3. New York: Wiley; 1983. p.590–4.

[26] Chen Y. Classes of life distributions and renewalcounting process. J Appl Probab 1994;31:1110–5.

[27] Lai CD, Xie M, Murthy DNP. Bathtub-shaped failurelife distributions. In: Balakrishnan N, Rao CR, editors.Handbook of statistics, vol. 20. Amsterdam: ElsevierScience; 2001. p.69–104.

[28] Mi J. Bathtub failure rate and upside-down bathtubmean residual life. IEEE Trans Reliab 1995;R-44:388–91.

[29] Jiang R, Murthy DNP. Parametric study of competingrisk model involving two Weibull distributions. Int JReliab Qual Saf Eng 1997;4:17–34.

[30] Park KS. Effect of burn-in on mean residual life. IEEETrans Reliab 1985;R-34:522–3.

[31] Kao JHK. A graphical estimation of mixed Weibull pa-rameters in life testing of electronic tubes. Technomet-rics 1959;1:389–407.

[32] Lieberman GJ. The status and impact of reliabilitymethodology. Nav Res Logist Q 1969;14:17–35.

[33] Glaser RE. Bathtub and related failure rate characteriza-tions. J Am Stat Assoc 1980;75:667–72.

[34] Lawless JF. Statistical models and methods for life timedata. New York: John Wiley; 1982.

[35] Chen Z. A new two-parameter lifetime distribution withbathtub shape or increasing failure rate function. StatProbab Lett 2000;49:155–61.

[36] Guess F, Hollander M, Proschan F. Testing exponential-ity versus a trend change in mean residual life. Ann Stat1986;14:1388–98.

[37] Gupta RC, Akman HO. Mean residual life function forcertain types of non-monotonic ageing. Commun StatStochast Models 1995;11:219–25.

[38] Mitra M, Basu SK. Shock models leading to non-monotonic aging classes of life distributions. J Stat PlanInfer 1996;55:131–8.

[39] Guess F, Proschan F. Mean residual life: theory andapplications. In: Krishnaiah PR, Rao CR, editors.Handbook of statistics, vol. 7. Amsterdam: ElsevierScience; 1988. p.215–24.

Page 212: Handbook of Reliability Engineering - pdi2.org

Concepts and Applications of Stochastic Aging in Reliability 179

[40] Newby M. Applications of concepts of ageing inreliability data analysis. Reliab Eng 1986;14:291–308.

[41] Franco M, Ruiz JM, Ruiz MC. On closure of the IFR(2)and NBU(2) classes. J Appl Probab 2001;38:235–41.

[42] Li XH, Kochar SC. Some new results involving theNBU(2) class of life distributions. J Appl Probab2001;38:242–7.

[43] Singh H, Deshpande JV. On some new ageing properties.Scand J Stat 1985;12:213–20.

[44] Kulaseker KB, Park HD. The class of better meanresidual life at age t0 . Microelectron Reliab 1987;27:725–35.

[45] Launer RL. Inequalities for NBUE and NWUE lifedistributions. Oper Res 1984;32:660–7.

[46] Joe H, Proschan F. Percentile residual life functions.Oper Res 1984;32:668–78

[47] Sengupta D, Deshpande JV. Some results on relative age-ing of two life distributions. J Appl Probab 1994;31:991–1003.

[48] Kalashnikov VV, Rachev ST. Characterization of queu-ing models and their stability. In: Prohorov YuK, Stat-ulevicius VA, Sazonov VV, Grigelionis B, editors. Proba-bility theory and mathematical statistics, vol. 2. Amster-dam: VNU Science Press; 1986. p.37–53.

[49] Boland PJ. A reliability comparison of basic systemsusing hazard rate functions. Appl Stochast Models DataAnal 1998;13:377–84.

[50] Singh H, Singh RS. On allocation of spares at componentlevel versus system level. J Appl Probab 1997;34:283–7.

[51] Lai CD, Xie M. Relative ageing for two parallel systemsand related problems. Math Comput Model 2002;inpress.

[52] Boland PJ, El-Neweihi E. Component redundancy versussystem redundancy in the hazard rate ordering. IEEETrans Reliab 1995;R-8:614–9.

[53] Shaked M, Shanthikumar JG. Stochastic orders and theirapplications. San Diego: Academic Press; 1994.

[54] Buchanan WB, Singpurwalla ND. Some stochasticcharacterizations of multivariate survival. In: Tsokos CP,Shimi IN, editors. The theory and applications ofreliability. New York: Academic Press; 1977. p.329–48.

[55] Esary JD, Marshall AW. Multivariate distributions withincreasing hazard average. Ann Probab 1979;7:359–70.

[56] Block HW, Savits TH. Multivariate classes in reliabilitytheory. Math Oper Res 1981;6:453–61.

[57] Basu AP, Ebrahimi N, Klefsjö B. Multivariate harmonicnew better than used in expectation distributions.Scand J Stat 1983;10:19–25.

[58] Harris R. A multivariate definition for increasing hazardrate distribution. Ann Math Stat 1970;41:713–7.

[59] Brindley EC, Thompson WA. Dependence and ageingaspects of multivariate survival. J Am Stat Assoc1972;67:822–30.

[60] Marshall AW. Multivariate distributions with monotonichazard rate. In: Barlow RE, Fussel JR, Singpurwalla ND,editors. Reliability and fault tree analysis—theoreticaland applied aspects of system reliability and safetyassessment. Philadelphia: Society for Industrial andApplied Mathematics; 1975. p.259–84.

[61] Marshall AW, Olkin I. A multivariate exponentialdistribution. J Am Stat Assoc 1967;62:291–302.

[62] Basu AP. Bivariate failure rate. J Am Stat Assoc1971;66:103–4.

[63] Block HW. Multivariate reliability classes. In: Krishna-iah PR, editor. Applications of statistics. Amsterdam:North-Holland; 1977. p.79–88.

[64] Johnson NL, Kotz S. A vector multivariate hazard rate.J Multivar Anal 1975;5:53–66.

[65] Johnson NL, Kotz S. A vector valued multivariate hazardrate. Bull Int Stat Inst 1973;45(Book 1): 570–4.

[66] Savits TH. A multivariate IFR class. J Appl Probab1985:22:197–204.

[67] Shanbhag DN, Kotz S. Some new approaches tomultivariate probability distributions. J Multivar Anal1987;22:189–211

[68] Block HW, Savits TH. Multivariate increasing failure ratedistributions. Ann Probab 1980;8:730–801.

[69] Buchanan WB, Singpurwalla ND. Some stochastic char-acterizations of multivariate survival. In: Tsokos CP,Shimi IN, editors. The theory and applications of reli-ability, with emphasis on Bayesian and nonparametricmethods, vol. 1. New York: Academic Press; 1977. p.329–48.

[70] Block HW, Savits TH. The class of MIFRA lifetime andits relation to other classes. Nav Res Logist Q 1982;29:55–61.

[71] Shaked M, Shantikumar JG. IFRA processes. In: Basu AP,editor. Reliability and quality control. Amsterdam:North-Holland; 1986. p.345–52.

[72] Muhkerjee SP, Chatterjee A. A new MIFRA class of lifedistributions. Calcutta Stat Assoc Bull 1988;37:67–80.

[73] Marshall AW, Shaked M. Multivariate shock models fordistribution with increasing failure rate average. AnnProbab 1979;7:343–58.

[74] Marshall AW, Shaked M. A class of multivariate newbetter than used distributions. Ann Probab 1982;10:259–64.

[75] Marshall AW, Shaked M. Multivariate new better thanused distributions. Math Oper Res 1986;11:110–6.

[76] Marshall AW, Shaked M. Multivariate new better thanused distributions: a survey. Scand J Stat 1986;13:277–90.

[77] Arnold BC, Zahedi H. On multivariate mean remaininglife functions. J Multivar Anal 1988:25:1–9.

[78] Zahedi H. Some new classes of multivariate survivaldistribution functions. J Stat Plan Infer 1985;11:171–88.

[79] Klefsjö B. On some classes of bivariate life distributions.Statistical Research Report 1980-9, Department ofMathematical Statistics, University of Umea, 1980.

[80] Basu AP, Ebrahimi N. HNBUE and HNWUEdistributions—a survey. In: Basu AP, editor. Reliabilityand quality control. Amsterdam: North-Holland; 1986.p.33–46.

[81] Ascher S. A survey of tests for exponentiality. CommunStat Theor Methods 1990;19:1811–25.

[82] Cox DR, Oakes D. Analysis of survival data. London:Chapman and Hall; 1984.

Page 213: Handbook of Reliability Engineering - pdi2.org

180 Statistical Reliability Theory

[83] Klefsjö B. Some tests against ageing based on the totaltime on test transform. Commun Stat Theor Methods1983;12:907–27.

[84] Klefsjö B. Testing exponentiality against HNBUE.Scand J Stat 1983;10:67–75.

[85] Ahmad IA. A nonparametric test for the monotonicityof a failure rate function. Commun Stat 1975;4:967–74.

[86] Ahmad IA. Corrections and amendments. Commun StatTheor Methods 1976;5:15.

[87] Hoeffiding W. A class of statistics with asymptoticallynormal distributions. Ann Math Stat 1948;19:293–325.

[88] Deshpande JV. A class of tests for exponentiality againstincreasing failure rate average alternatives. Biometrika1983;70:514–8.

[89] Kochar SC. Testing exponentiality against monotonefailure rate average. Commun Stat Theor Methods1985;14:381–92.

[90] Link WA. Testing for exponentiality against monotonefailure rate average alternatives. Commun Stat TheorMethods 1989;18:3009–17.

[91] Hollander M, Proschan F. Tests for the mean residuallife. Biometrika 1975;62:585–93.

[92] Hollander M, Proschan F. Amendments and corrections.Biometrika 1980;67:259.

[93] Hollander M, Proschan F. Testing whether new is betterthan used. Ann Math Stat 1972;43:1136–46.

[94] Koul HL. A test for new is better than used. CommunStat Theor Methods 1977;6:563–73.

[95] Koul HL. A class of testing new is better than used. Can JStat 1978;6:249–71.

[96] Alam MS, Basu AP. Two-stage testing whether new isbetter than used. Seq Anal 1990;9:283–96.

[97] Deshpande JV, Kochar SC. A linear combination of twoU-statistics for testing new better than used. CommunStat Theor Methods 1983;12:153–9.

[98] Kumazawa Y. A class of test statistics for testing whethernew better than used. Commun Stat Theor Methods1983;12:311–21.

[99] Hollander M, Proschan F. Testing whether new is betterthan used. Ann Math Stat 1972;43:1136–46.

[100] De Souza Borges W, Proschan F. A simple test for newbetter than used in expectation. Commun Stat TheorMethods 1984;13:3217–23.

[101] Singh H, Kochar SC. A test for exponentiality againstHNBUE alternative. Commun Stat Theor Methods1986;15:2295–2304.

[102] Kochar SC, Deshpande JV. On exponential scoresstatistics for testing against positive ageing. Stat ProbabLett 1985;3:71–3.

[103] Hollander M, Park HD, Proschan F. A class of lifedistributions for ageing. J Am Stat Assoc 1986;81:91–5.

[104] Ebrahimi N, Habibullah M. Testing whether survivaldistribution is new better than used of specific age.Biometrika 1990;77:212–5.

[105] Joe H, Proschan F. Tests for properties of the percentileresidual life function. Commun Stat Theor Methods1983;12:1087–119.

[106] Joe H, Proschan F. Percentile residual life functions.Oper Res 1984;32:668–78.

[107] Xie M. Some total time on test quantiles useful fortesting constant against bathtub-shaped failure ratedistributions. Scand J Stat 1988;16:137–44.

[108] Belzunce F, Candel J, Ruiz JM. Testing the stochasticorder and the IFR, DFR, NBU, NWU ageing classes. IEEETrans Reliab 1998;R47:285–96.

[109] Wells MT, Tiware RC. A class of tests for testingan increasing failure rate average distributions withrandomly right-censored data. IEEE Trans Reliab 1991;R40:152–6.

[110] Chen YY, Hollander M, Langberg NA. Testing whethernew is better than used with randomly censored data.Ann Stat 1983;11:267–74.

[111] Kaplan EL, Meier P. Nonparametric estimation from in-complete observations. J Am Stat Assoc 1958;53:457–81.

[112] Koul HL, Susarla V. Testing for new better than usedin expectation with incomplete data. J Am Stat Assoc1980;75:952–6.

[113] Hollander M, Park HD, Proschan F. Testing whether newis better than used of a specified age, with randomlycensored data. Can J Stat 1985;13:45–52.

[114] Sen K, Jain MB. A test for bivariate exponentialityagainst BIFR alternative. Commun Stat Theor Methods1991;20:3139–45.

[115] Bandyopadhyay D, Basu AP. A class of tests forexponentiality against bivariate increasing failure ratealternatives. J Stat Plan Infer 1991;29:337–49.

[116] Basu AP Habibullah M. A test for bivariate exponential-ity against BIFRA alternative. Calcutta Stat Assoc Bull1987;36:79–84.

[117] Sen K, Jain MB. Tests for bivariate mean residual life.Commun Stat Theor Methods 1991;20:2549–58.

[118] Basu AP, Ebrahimi N. Testing whether survival functionis bivariate new better than used. Commun Stat1984;13:1839–49.

[119] Sen K, Jain MB. A new test for bivariate distributions:exponential vs new-better-than-used alternative. Com-mun Stat Theor Methods 1991;20:881–7.

[120] Sen K, Jain MB. A test for bivariate exponentialityagainst BHNBUE alternative. Commun Stat TheorMethods 1990;19:1827–35.

[121] Block HW, Savits TH. Multivariate distributions inreliability theory and life testing. In: Taillie C, Patel GP,Baldessari BA, editors. Statistical distributions inscientific work, vol. 5. Dordrecht, Boston (MA): Reidel;1981. p.271–88.

[122] Block HW, Savits, TH. Multivariate nonparametricclasses in reliability. In: Krishnaiah PR, Rao CR, editors.Handbook of statistics, vol. 7. Amsterdam: North-Holland; 1988. p.121–9.

[123] Ghosh M, Ebrahimi N. Shock models leading multivari-ate NBU and NBUE distributions. In: Sen PK, editor.Contributions to statistics: essays in honour of Nor-man Lloyd Johnson. Amsterdam: North-Holland; 1983.p.175–84.

[124] Savits TH. Some multivariate distributions derived froma non-fatal shock model. J Appl Probab 1988;25:383–90.

Page 214: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution

Chap

ter1

0Dong Ho Park

10.1 Introduction10.2 Characterization of NBU-t0 Class10.2.1 Boundary Members of NBU-t0 and NWU-t010.2.2 Preservation of NBU-t0 and NWU-t0 Properties under Reliability Operations10.3 Estimation of NBU-t0 Life Distribution10.3.1 Reneau–Samaniego Estimator10.3.2 Chang–Rao Estimator10.3.2.1 Positively Biased Estimator10.3.2.2 Geometric Mean Estimator10.4 Tests for NBU-t0 Life Distribution10.4.1 Tests for NBU-t0 Alternatives Using Complete Data10.4.1.1 Hollander–Park–Proschan Test10.4.1.2 Ebrahimi–Habibullah Test10.4.1.3 Ahmad Test10.4.2 Tests for NBU-t0 Alternatives Using Incomplete Data

10.1 Introduction

The notion of aging plays an important rolein reliability theory, and several classes oflife distributions (i.e. distributions for whichF(t)= 0 for t < 0) have been proposed anddiscussed to categorize different aspects of aging.Among those, the classes of life distributions thathave been shown to be fundamental in the study ofmaintenance policies are the new better than used(NBU) and the new better than used in expectation(NBUE) classes. Marshall and Proschan [1] andEsary et al. [2] discuss the maintenance policy ofa system when the underlying life distributionsbelong to either the NBU class or the NBUE class.

Definition 1. A life distribution F is an NBU dis-tribution if F (x + y)≤ F (x)F (y) for all x, y ≥ 0,where F ≡ 1− F denotes the survival function.The dual concept of a new worse than used (NWU)distribution is defined by reversing the inequality(i.e. F is NWU if F (x + y)≥ F (x)F (y) for allx, y ≥ 0).

Definition 2. A life distribution F is an NBUE dis-tribution if

∫∞0 F (x) dx <∞ and εF (0)≥ εF (t)

for all t ≥ 0, where εF (t)=∫∞t

F (x) dx/F (t) isthe mean residual life at age t . It is new worse thanused in expectation (NWUE) if εF (0)≤ εF (t) forall t ≥ 0.

The NBU property may be interpreted asstating that a used item of any age has astochastically smaller residual life length thandoes a new item, and the NBUE property impliesthat the mean residual life length at any age t ≥ 0 isless than or equal to the mean life length of a newitem, i.e. the mean residual life length at age zero.The boundary members of the NBU and NBUEclasses are the exponential distributions, for whichresidual life lengths do not improve or deterioratewith age.

Other well-known classes of life distributionsthat have been categorized according to mono-tonicity properties of the failure rate, the averagefailure rate, and the mean residual life length in-clude the following:

181

Page 215: Handbook of Reliability Engineering - pdi2.org

182 Statistical Reliability Theory

Definition 3. A life distribution F is an increasingfailure rate (IFR) distribution if F(0)= 0 andF (x + t)/F (t) is decreasing in t for 0 < t <

F−1(1) and x > 0. Here F−1(1) is defined assupt {t : F(t) < 1}. It is a decreasing failure rate(DFR) distribution if F (x + t)/F (t) is increasingin t for 0 < t < F−1(1). If F has a density f , thenF is IFR (DFR) if the failure rate r(t)= f (t)/F (t)

is increasing (decreasing) for 0 < t < F−1(1).

Definition 4. A life distribution F is an increas-ing failure rate average (IFRA) distribution ifF(0)= 0 and −(1/t) log F (t) is increasing in t

for 0 < t < F−1(1). It is a decreasing failure rateaverage (DFRA) distribution if −(1/t) log F (t) isdecreasing in t for 0 < t < F−1(1). If F has adensity f , then F is IFRA (DFRA) if the averagefailure rate (1/t)

∫ t0 r(u) du is increasing (decreas-

ing) for 0 < t < F−1(1).

Definition 5. A life distribution F is a decreas-ing mean residual life (DMRL) distribution if∫∞

0 F (t) dt <∞ and εF (s) ≥ εF (t) for all 0≤ s ≤t . It is an increasing mean residual life (IMRL)distribution if εF (0)≤ εF (t) for all 0≤ s ≤ t .

It is well known [3] that the following implica-tions among these classes of life distributions holdif F has a finite mean:

IFR ⇒ IFRA ⇒ NBU ⇒ NBUEIFR ⇒ DMRL ⇒ NBUE

Similar relationships exist for the dual classes,DFR, DFRA, IMRL, NWU, and NWUE. It is alsoknown that the exponential distributions are theonly distributions that are both IFR and DFR,both IFRA and DFRA, both NBU and NWU, bothDMRL and IMRL, and both NBUE and NWUE.

We introduce a new class of life distributions,which is obtained by relaxing the conditions forthe NBU (NWU) class somewhat.

Definition 6. Let t0 ≥ 0. A life distribution F isnew better than used at t0 (NBU-t0) if

F (x + t0) ≤ F (x)F (t0) for all x ≥ 0 (10.1)

The dual notion of new worse than used at t0(NBU-t0) is defined analogously by reversing thefirst inequality in Equation 10.1.

It is obvious that the NBU-t0 class is relatedto, but contains and is much larger than, theNBU class. The NBU-t0 property states thata used item of age t0 has a stochasticallysmaller residual life length than does a newitem, whereas the NBU property states thata used item of any age has a stochasticallysmaller residual life length than does a newitem. Section 10.2 characterizes the NBU-t0 andNWU-t0 classes and considers their preservationproperties under several reliability operations.The boundary members of the NBU-t0 andNWU-t0 classes are also discussed. In Section 10.3,the estimation of the NBU-t0 (NWU-t0) lifedistribution is described. Section 10.4 presentsvarious nonparametric tests of exponentialityagainst the NBU-t0 (NWU-t0) alternatives basedon complete and incomplete data.

10.2 Characterization ofNBU-t0 Class

10.2.1 Boundary Members of NBU-t0and NWU-t0

It is well known that the only life distributionsthat belong to both IFR and DFR, IFRA andDFRA, NBU and NWU, DMRL and IMRL, andNBUE and NWUE are the class of exponentialdistributions. However, not only the class ofexponential distributions, but also some other lifedistributions belong to the boundary members ofNBU-t0 and NWU-t0.

Let C denote the class of boundary members ofNBU-t0 and NWU-t0. That is, for t0 > 0

C0 = {F : F (x + t0)= F (x)F (t0) for all x ≥ 0}(10.2)

Using theorem 2 of Marsaglia and Tubilla [4], wemay easily verify that the following distributionsF1, F2, and F3 are in C0:

F1(x)= exp(−λx) λ > 0, x ≥ 0

F2(x)= G(x) 0≤ x <∞ (10.3)

Page 216: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 183

where G(x) is a survival function for whichG(0)= 1 and G(t0)= 0.

F3 = G(x) for 0≤ x < t0

= G(t0)G(x − t0) for t0 ≤ x < 2t0

= G2(t0)G(x − 2t0) for 2t0 ≤ x < 3t0...

= Gn(t0)G(x − nt0) for nt0 ≤ x < (n+ 1)t0...

where G(x) is a survival function defined forx ≥ 0. Note that if G has a density function on[0, t0], then the failure rate of F3 is periodic withperiod t0.

Since the NBU-t0 class contains the NBUclass, it is of interest to find what kinds of lifedistributions belong to the NBU-t0 class, but notto the NBU class. We consider

Ca ={F : F (x + t0) ≤ F (x)F (t0) for all x ≥ 0,

and equality holds for some x ≥ 0} (10.4)

Then Ca is the class of NBU-t0 life distributionsexcluding the boundary members of NBU-t0 andNWU-t0, defined in Equation 10.2. Let C∗ be theclass of life distributions that are not NBU, but arein Ca . Theorem 1 gives a method of constructingsome distribution functions in C∗.

Given a survival function H , let Ht (x)≡H (t + x)/H (t) be the conditional survivalfunction. Recall that for x ≥ 0:

H (x)= exp

[−∫ x

0rH (u) du

](10.5)

Ht (x)= exp

[−∫ t+x

t

rH (u) du

](10.6)

when H has a failure rate function rH .

Theorem 1. Let G be NBU with failure ratefunction rG(x) > 0 for 0≤ x <∞ and let F havea failure rate function rF satisfying:

rF (x)≤ rG(x) for 0≤ x ≤ t0 (10.7)

rF (x)= rG(x) for t0 ≤ x ≤∞ (10.8)

Figure 10.1. Failure rates forF andG of Example 1

and

rF (x) is strictly decreasing for 0≤ t ≤ t1 (10.9)

where 0 < t1 < t0. Then F is NBU-t0 but not NBU.

Proof. F is not NBU by Equation 10.9.To show that F is NBU-t0, note that for x ≥ 0,Ft0(x)= Gt0(x) by Equations 10.6 and 10.8. Also,Gt0(x)≤ G(x) since G is NBU, and G(x)≤ F (x)

by Equations 10.7 and 10.8. Thus, we concludethat, for x ≥ 0, Ft0(x)≤ F (x); i.e. F is NBU-t0. �

Example 1. As an example of Theorem 1,let rG(x)= 1 for 0≤ x <∞, and letrF (x)= 1− (θ/t0)x for 0≤ x < t0 and 0 < θ ≤ 1,and rF (x)= 1 for t0 ≤ x <∞. We do not letθ exceed unity, since we want to ensure thatrF (x) remains positive as x→ t0. Then rFsatisfies Equations 10.7–10.9 and thus F is in C∗.(See Figure 10.1.)

Using Equation 10.3 and the failure ratesdefined in Example 1, if we extend the range of θto include θ = 0, we can construct

F (x; θ)= exp{−[x − θ(2t0)−1x2]} 0≤ x < t0

= exp{−[x − θ(2)−1t0]} x ≥ t0

for 0≤ θ ≤ 1. For θ = 0, F is reduced to theexponential distribution which is both NBU andNBU-t0. For 0 < θ ≤ 1, F is NBU-t0, but not NBU.Another possible NBU-t0 life distribution that isnot NBU can be constructed as in the followingexample.

Example 2. Consider a distribution function F

with the following failure rate rF (t) shown in

Page 217: Handbook of Reliability Engineering - pdi2.org

184 Statistical Reliability Theory

Figure 10.2. Failure rate forF of Example 2

Figure 10.2. Then for n≥ 1

F (x)= exp[−(θ/2t0)x2]for 0≤ x < t0

= exp{−[−θt0 + (1+ θ)x − (θ/2t0)x2]}for t0 ≤ x < 2t0

= exp{−[−3θt0 + (1+ 2θ)x − (θ/2t0)x2]}for 2t0 ≤ x < 3t0

...

= exp(−{−[n(n− 1)θ/2]t0+ [1+ (n− 1)θ ]x − (θ/2t0)x2})

for (n− 1)t0 ≤ x < nt0

= exp{−[−(n/2)θt0 + x]}for x ≥ nt0

To see that F is inCa , we note that, for x + t0 ≤ nt0

F (x + t0)= F (t0)F (x + t0 − t0)= F (t0)F (x)

and for x + t0 > nt0

F (x + t0)= exp{−[−(n/2)θt0 + t0 + x]}= F (t0) exp(−{−[(n− 1)/2]θt0 + x})

Case 1. (n− 1)t0 ≤ x < nt0:

F (x)= exp(−{−[n(n− 1)/2]θt0+ [1+ (n− 1)θ ]x − (θ/2t0)x

2})= exp{−[−(n− 1)/2]θt0 + x}× exp{(θ/2t0)[x − (n− 1)t0]2}≥ exp{−[−(n− 1)/2]θt0 + x}

Thus F (x + t0) ≤ F (x)F (t0).

Case 2. x ≥ nt0:

F (x)= exp{−[−(n/2)θt0 + x]}≥ exp{−[−(n− 1)/2]θt0 + x}

Thus F (x + t0) ≤ F (x)F (t0). In both casesF (x + t0) ≤ F (x)F (t0) and so F is in Ca . It isobvious that F is not an NBU distribution.

10.2.2 Preservation of NBU-t0 andNWU-t0 Properties under ReliabilityOperations

Table 10.1 summarizes the preservation propertiesof several known classes of life distributionsunder reliability operations. With the exceptionof the DMRL class, each class representingadverse aging is closed under convolution ofdistributions. On the other hand, none of theseclasses is closed under mixture of distributions.For the classes representing beneficial aging, thereverse is true: Each is closed under mixtureof distributions (in the NWU and NWUE cases,the distributions being mixed are non-crossing).On the other hand, none of these beneficialaging life distribution classes is closed underconvolution of distributions.

We now consider preservation properties of theNBU-t0 and NWU-t0 classes under those reliabilityoperations specified in Table 10.1.

Theorem 2. The NBU-t0 class is preserved underthe formation of coherent system.

Proof. The proof is exactly analogous to that ofthe corresponding result for the NBU class. In theproof of theorem 5.1 of Barlow and Proschan [5],p.182–3, simply replace s by x and t by t0. �

Theorem 3. The NWU-t0 class is (a) preservedunder mixture of non-crossing distributions and(b) not preserved under arbitrary mixtures.

Proof. (a) In the proof of the corresponding resultfor the NWU class (Barlow and Proschan [5],p.186, theorem 5.7), substitute x for s and t0

Page 218: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 185

Table 10.1. Preservation properties of classes of life distributions

Class of life Reliability Formation of Convolution of Mixture of life Mixture ofdistribution operation coherent structure life distributions distributions non-crossing life

distributions

Adverse aging IFR Not closed Closed Not closed Not closedIFRA Closed Closed Not closed Not closedNBU Closed Closed Not closed Not closedNBUE Not closed Closed Not closed Not closedDMRL Not closed Not closed Not closed Not closedHNBUE Not closed Closed Not closed Not closed

Beneficial aging DFR Not closed Not closed Closed ClosedDFRA Not closed Not closed Closed ClosedNWU Not closed Not closed Not closed ClosedNWUE Not closed Not closed Not closed ClosedIMRL Not closed Not closed Closed ClosedHNWUE Not closed Not closed Closed Not closed

for t . (b) The example given in 5.9, Barlow andProschan [5], p.197, may be used. �

Both NBU-t0 and NWU-t0 classes are notclosed under other reliability operations andsuch properties are exhibited by providing thefollowing counter examples in those reliabilityoperations.

Example 3. The NBU-t0 class is not preservedunder convolution. Let F be the distribution thatplaces mass 1

2 − ε at the point �1, mass 12 − ε

at the point 58 −�2, and mass 2ε at the point

1+ 12�1, where �1, �2, and ε are “small” positive

numbers. Let t0 = 1. Then it is obvious that F isNBU-t0, since a new item with life distribution F

survives at least until time �1 with probability ofunity, whereas an item of age t0 has residual life12�1 with probability unity.

Next, consider F (2)(t0 + x)= P [X1 +X2 >

t0 + x], where X1 and X2 are independent,identically distributed (i.i.d.) ∼ F , t0 = 1(as above), and x = 1

4 . Then F (2)( 54 )= ( 1

2 + ε)2,since X1 +X2 >

54 if and only if X1 ≥ 5

8 −�2

and X2 ≥ 58 −�2. Similarly, F (2)(t0)=

F (2)(1)= P [X1 +X2 > 1] = (2ε)2 + 2(2ε), sinceX1 +X2 > 1 if and only if: (a) X1 >

58 −�2 and

X2 >58 −�2; (b) X1 = 1+ 1

2�1 and X2 is anyvalue; or (c) X2 = 1+ 1

2�2 and X1 is any value.

Finally, F (2)( 14 )= 1− ( 1

2 − ε)2, since X1 +X2

> 14 except when X1 =�1 and X2 =�1. It follows

that, for t0 = 1 and x = 14 , we have

F (2)(t0 + x)− F (2)(t0)F(2)(x)

= ( 12 + ε)2 − {[( 1

2 + ε)2 + 2(2ε)]× [1− ( 1

2 − ε)2]}= 1

4 − ( 14

34 )+ o(ε)

which is greater than zero for sufficiently small ε.Thus F (2) is not NBU-t0.

Example 4. The NWU-t0 class is not preservedunder convolution. The exponential distributionF(x)= 1− e−x is NWU-t0. The convolution F (2)

of F with itself is the gamma distribution oforder two:F (2)(x)= 1− (1+ x) e−x , with strictlyincreasing failure rate. Thus F (2) is not NWU-t0.

Example 5. The NWU-t0 class is not preservedunder the formation of coherent systems.This may be shown using the same example as isused for the analogous result for NWU systems inBarlow and Proschan [5], p.183.

Example 6. The NBU-t0 class is not preservedunder mixtures. The following example showsthat a mixture of NBU-t0 distributions neednot be NBU-t0. Let Fα(x)= e−αx and G(x)=∫∞

0 Fα(x) e−α dα = (x + 1)−1. Then the density

Page 219: Handbook of Reliability Engineering - pdi2.org

186 Statistical Reliability Theory

Table 10.2.

Class of life Formation of Convolution of Mixture of life Mixture ofdistribution coherent structure life distributions distributions non-crossing life

distributions

NBU-t0 Closed Not closed Not closed Not closedNWU-t0 Not closed Not closed Not closed Closed

function is g(x)= (x + 1)−2 and the failure ratefunction is rg(x)= (x + 1)−1, which is strictlydecreasing in x ≥ 0. Thus G is not NBU-t0.

In summary, we have Table 10.2.

10.3 Estimation of NBU-t0 LifeDistribution

In this section we describe the known estimatorsof the survival function when the underlying lifedistribution belongs to the class of NBU-t0 distri-butions. Reneau and Samaniego [6] proposed anestimator of the survival function that is knownto be a member of the NBU-t0 class and stud-ied many properties, including the consistency ofthe estimator and convergence rates. Confidenceprocedures based on the estimator were also dis-cussed. The Reneau and Samaniego (RS) estima-tor is shown to be negatively biased in the sensethat the expectation of the estimator is boundedabove by its survival function. Chang [7] providessufficient conditions for the weak convergence forthe RS estimator to hold. Chang and Rao [8]modified the RS estimator to propose a new esti-mator, which improved the optimality propertiessuch as mean-squared error and goodness of fit.They also developed an estimator that is positivelybiased.

Let X1, . . . , Xn be a random sample from thesurvival function F , which is known to be amember of the NBU-t0 class. Denote the empiricalsurvival function by Fn. It is well known that Fn

is an unbiased estimator of F and almost surelyconverges to F at an optimal rate (see Serfling [9],theorem 2.1.4B). However, Fn is, in general, not amember of the NBU-t0 class.

10.3.1 Reneau–Samaniego Estimator

The first non-parametric estimator of F whenF belongs to the NBU-t0 class was proposed byReneau and Samaniego [6]. Let t0 > 0 be fixed.To find an NBU-t0 estimator of F , the conditionin Equation 10.1 can be rewritten as

F (x)≤ F (x − t0)F (t0) for all x ≥ t0 (10.10)

The procedure to construct an NBU-t0 estimator

is as follows. For x ∈ [0, t0], we let ˆF(x)= Fn(x).

Since ˆF(x)≤ ˆF(x − t0)ˆF(t0)= Fn(x − t0)Fn(t0)

for t0 ∈ [t0, 2t0], we let ˆF(x)= Fn(x) if Fn(x)≤Fn(x − t0)Fn(t0). Otherwise, we define ˆF(x)=Fn(x − t0)Fn(t0). The procedure continues in thisfashion for x ∈ (kt0, (k + h)t0], where k ≥ 2 is aninteger, and consequently we may express theestimator by the following recursion:

ˆF(x)= Fn(x) for 0≤ x ≤ t0

=min{Fn(x),ˆF(x − t0)

ˆF(t0)}for x > t0 (10.11)

Theorem 4. The estimator, given in Equation10.11, uniquely defines an NBU-t0 survival functionthat may be written as a function of Fn as follows:

ˆF(x)= min0≤k≤[x/t0]

{F kn (t0)Fn(x − kt0)}

where [u] denotes the greatest integer less than orequal to u.

Proof. See Reneau and Samaniego [6]. �Theorem 4 assures that the estimator of

Equation 10.11 is indeed in the NBU-t0 class. It isobvious by the method of construction of the

estimator that ˆF(x)= Fn(x) for 0≤ x ≤ t0 and

Page 220: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 187

ˆF(x)≤ Fn(x) for x > t0. Thus it is shown thatˆF(x) is the largest NBU-t0 survival function that

is everywhere less than or equal to Fn for all x ≥ 0.As a result, the RS estimator is a negatively biased

estimator and the expectation of ˆF(x) is boundedabove by F (x) for all x ≥ 0.

The following results show that the estimatorof Equation 10.11 is uniformly consistent andconverges uniformly to F with probability of one.For the proofs, see Reneau and Samaniego [6].

Lemma 1. Define

Dn ≡ supx≥0|Fn(x)− F (x)|

T (x)= F k(t0)F (x − kt0)

and

Tn(x)= F kn (t0)Fn(x − kt0) for 0≤ k ≤ [x/t0]

Then

|Tn(x)− T (x)| ≤Dn{k[F (t0)+Dn]k−1 + 1}Theorem 5. If F is in the NBU-t0 class, then ˆFconverges uniformly to F with probability of one.

It is well known that the empirical survivalfunction Fn is mean squared, consistent with rateO(n−1) and almost surely uniformly consistentwith rate O[n−1/2(log log n)1/2]. The followingresults show that the RS estimator converges to F

at these same rates when F is in the NBU-t0 class.

Theorem 6. E| ˆF(x)− F (x)|2 =O(n−1).

Theorem 7.

supx≥0| ˆF(x)− F (x)| =O[n−1/2(log log n)1/2]

almost surely.

Reneau and Samaniego [6] also show that,

for fixed x ≥ 0, ˆF(x) has the same asymptoticdistribution as Fn(x), which is

ˆF(x)→ N

(F (x),

1

nF (x)[1− F (x)]

)as n→∞ and, based on the asymptotic distribu-

tion of ˆF(x), the large sample confidence proce-dures are discussed using the bootstrap method.

Reneau and Samaniego [6] presented a simulationstudy showing that the mean-squared errors of theRS estimator are smaller than those of the empiri-cal survival function in all cases. Such a simulationstudy implies that the RS estimator, given in Equa-tion 10.11, estimates the NBU-t0 survival functionmore precisely than does the empirical survivalfunction. For the simulation study, Reneau andSamaniego [6] used three NBU-t0 distributions: alinear failure rate distribution, a Weibull distribu-tion, and

F = exp[−x + (θ/2t0)x2] for 0≤ x < t0

= exp[−x + (θ/2t0)] for x ≥ t0

(10.12)

which is in the NBU-t0 class, but not in theNBU class, Such a distribution was originallyintroduced by Hollander et al. [10]. For all threedistributions, t0 is taken to be equal to one.

Although Reneau and Samaniego [6] proved

that ˆF(x) is the largest NBU-t0 survival function

such that ˆF(x)≤ F (x) for every x ≥ 0 and ˆFstrongly uniformly converges to F at an optimalrate, the weak convergence of the stochastic pro-

cess Wn =√n( ˆF − F ) was not solved. Chang [7]discussed such a weak convergence of Wn. For thispurpose, Chang [7] defined two new subclasses ofsurvival functions as follows.

A survival function is said to be New StrictlyBetter than Used of age t0 if F (x + t0) <

F (x)F (t0) for all x > 0. A survival function issaid to be New is the Same as Used of age t0 ifF (x + t0)= F (x)F (t0) for all x ≥ 0.

Chang [7] showed that the weak convergenceof Wn does not hold in general and establishedsufficient conditions for the weak convergenceto hold. Three important cases are as follows.(1) If the underlying survival function F is fromthe subclass of New Strictly Better than Used ofage t0 (e.g. gamma and Weibull distributions),then Wn converges weakly with the same limitingdistribution as that of the empirical process. (2) IfF is from the subclass of New is the Same as Usedof age t0 (e.g. exponential distributions), then theweak convergence of Wn holds, and the finite-dimensional limiting distribution is estimable.

Page 221: Handbook of Reliability Engineering - pdi2.org

188 Statistical Reliability Theory

Wn does not converge weakly to a Brownianbridge, however. (3) If F satisfies neither (1)nor (2), then the weak convergence of Wn fails tohold. For more detailed discussions on the weakconvergence of Wn, refer to Chang [7].

10.3.2 Chang–Rao Estimator

10.3.2.1 Positively Biased Estimator

Chang and Rao [8] modified the RS estimator topropose several estimators of an NBU-t0 survivalfunction, which is itself in an NBU-t0 class.Derivation of an estimator of an NBU-t0 survivalfunction is motivated by rewriting the NBU-t0condition of Equation 10.1 as

F (x)≥ F (x + t0)

F (t0)for all x ≥ 0

Define ˜F0 = Fn and for i = 1, 2, . . . , we let

˜Fi(x)= Fn(x) if Fn(t0)= 0

=max{ ˜Fi−1(x), [ ˜Fi−1(x + t0)]/ ˜Fi−1(t0)}if Fn(t0) > 0

Then, it is obvious that the functions ˜Fi satisfiesthe inequalities

Fn = ˜F0 ≤ · · · ≤ ˜Fi ≤ ˜Fi+1 ≤ · · · ≤ 1

Define ˜F = limi→∞

˜Fi (10.13)

Then, ˜F is a well-defined function, and Chang and

Rao [8] refer to ˜F as a positively based NBU-t0estimator. The following results show that ˜F isin the NBU-t0 class and is a positively biasedestimator of F .

Theorem 8. The estimator ˜F , defined in Equa-tion 10.13, is an NBU-t0 survival function such that

E[ ˜F(x)− F (x)] ≥ 0 for all x ≥ 0. The empiricalsurvival function Fn is a member of an NBU-t0class if and only if ˜F = Fn.

Proof. See Chang and Rao [8]. �

Under the assumption that F is an NBU-t0survival function with compact support, the

consistency of ˜F and the convergence rate arestated in Theorem 9. For the proof, refer to Changand Rao [8].

Theorem 9. Suppose that F is an NBU-t0 survivalfunction such that T = sup{x : F (x) > 0}<∞.

Then, ˜F converges uniformly to F with probabilityof one.

Theorem 10. Under the same conditionsas in Theorem 9, supx≥0 | ˜F(x)− F (x)| =O[n−1/2(log log n)1/2] almost surely.

10.3.2.2 Geometric Mean Estimator

Let 0≤ α ≤ 1 and define the weighted geometricaverage of the RS estimator and positively biasedestimator, i.e.

ˆFα = [ ˆF ]α[ ˜F ]1−α (10.14)

For α = 1 and α = 0, ˆFα is reduced to the RSestimator and the positively biased estimatorrespectively. Consider the class

N = { ˆFα : 0≤ α ≤ 1}Then, each member of N is a survival functionand is in an NBU-t0 class since

ˆFα(x + t0)

= [ ˆF(x + t0)]α[ ˜F(x + t0)]1−α

≤ [ ˆF(x) ˆF(t0)]α[ ˜F(x) ˜F(t0)]1−α

= [ ˆF(x)]α[ ˜F(x)]1−α[ ˆF(t0)]α[ ˜F(t0)]1−α

= ˆFα(x)ˆFα(t0) (10.15)

which shows that ˆFα satisfies the NBU-t0 conditionof Equation 10.1. The inequality of Equation

10.15 holds since ˆF and ˜F are the membersof the NBU-t0 class. For the estimator ˆFα ofEquation 10.14 to be practical, the value of α must

be estimated, denoted by α, so that ˆFα can be usedas an estimator of F .

Page 222: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 189

Chang and Rao [8] recommend the estimatorˆFα where α satisfies the minimax criterion

| ˆFα − Fn| = inf0≤α≤1

supx≥0| ˆFα(x)− Fn(x)| (10.16)

where Fn is the empirical survival function.The minimizing value α can be found by any

standard grid search program. The estimator ˆFα isreferred to as the geometric mean (GM) estimator.

Theorems 11 and 12 prove that ˆFα is a stronglyconsistent NBU-t0 estimator of F and convergesto F at an optimal rate. The proofs are given inChang and Rao [8].

Theorem 11. The estimator ˆFα converges uni-formly to F with probability of one.

Theorem 12. The estimator ˆFα converges to F

at an optimal rate, i.e. supx≥0 | ˆFα(x)− F (x)| =O[n−1/2(log log n)1/2] almost surely.

Note that neither Theorem 11 nor Theorem 12requires the assumption that F possesses acompact support.

A simulation study is conducted to comparethe RS estimator and the GM estimator. For thispurpose, the simulated values of the RS and GMestimators are used to calculate four measures ofperformance of the two estimation procedures.They are bias, mean-squared error, sup-normand L-2 norm. As an NBU-t0 life distributionfor the simulation study, the distribution givenin Equation 10.12 is used. In summary, thesimulation study shows that, in addition to havinga smaller mean-squared error at all time pointsof practical interest, the GM estimator providesan improved overall estimate of the underlyingNBU-t0 survival function, as measured in terms ofsup-norm and L-2 norm.

10.4 Tests for NBU-t0 LifeDistributionSince the class of NBU-t0 life distributions was firstintroduced in Hollander et al. [10], several non-parametric tests dealing with the NBU-t0 class

have been proposed in the literature. The tests ofexponentiality versus (non-exponential) NBU-t0alternatives on the basis of complete data wereproposed by Hollander et al. [10], Ebrahimi andHabibullah [11], and Ahmad [12]. Hollander et al.[13] also discussed the NBU-t0 test when the dataare censored.

10.4.1 Tests for NBU-t0 AlternativesUsing Complete Data

Let X1, . . . , Xn be a random sample from acontinuous life distribution F . Based on thesecomplete data, the problem of our interest is to test

H0 : F is in C0 versus Ha : F is in Ca

for t0 > 0 being fixed, whereC0 andCa are definedin Equations 10.3 and 10.5 respectively. The nullhypothesis asserts that a new item is as good asa used item of age t0, whereas the alternative Ha

states that a new item has stochastically greaterresidual life than does a used item of age t0.

Examples of situations where it is reasonable touse the NBU-t0 (NWU-t0) tests are as follows.

(i) From experience, cancer specialists believethat a patient newly diagnosed as having a certaintype of cancer has a distinctly smaller chance ofsurvival than does a patient who has survived5 years (= t0) following initial diagnosis. (In fact,such survivors are often designated as “cured”.)The cancer specialists may wish to test theirbeliefs. Censorship may occur at the time of dataanalysis because some patients may still be alive.

(ii) A manufacturer believes that a certaincomponent exhibits “infant mortality”, e.g. hasa decreasing failure rate over an interval [0, t0].This belief stems from experience accumulatedfor similar components. He wishes to determinewhether a used component of age t0 has stochasti-cally greater residual life length than does a newcomponent. If so, he will test over the interval[0, t0] a certain percentage of his output, andthen sell the surviving components of age t0 athigher prices to purchasers who must have high-reliability components (e.g. a spacecraft assem-bler). He wishes to test such a hypothesis to rejector accept his a priori belief.

Page 223: Handbook of Reliability Engineering - pdi2.org

190 Statistical Reliability Theory

10.4.1.1 Hollander–Park–Proschan Test

The first non-parametric test of H0 versus Ha

was introduced by Hollander et al. [10]. Their teststatistic (termed HPP here) is motivated byconsidering the following parameter as a measureof deviation. Let

T (F )≡∫ ∞

0[F (x + t0)− F (x)F (t0)] dF(x)

=∫ ∞

0F (x + t0) dF(x)− 1

2F (t0)

≡ T1(F )− T2(F )

Observe that under H0, T1(F )= T2(F ), and thatunder Ha , T1(F )≤ T2(F ), and thus T (F )≤ 0.In fact, T (F ) is strictly less than zero underHa if F is continuous. T (F ) gives a measure ofthe deviation of F from its specification underH0, and, roughly speaking, the more negativeT (F ) is, the greater is the evidence in favorof Ha . It is reasonable to replace F of T (F )

by the empirical distribution function Fn ofX1, . . . , Xn and reject H0 in favor of Ha ifT (Fn)= T1(Fn)− T2(Fn) is too small. Instead ofusing this test statistic, the HPP test uses theasymptotically equivalent U -statistic Tn given inEquation 10.17. Hoeffding’sU -statistic theory [14]yields asymptotic normality of Tn and the HPP testis based on this large sample approximation.

Let

h1(x1, x2)= 12 [ψ(x1, x2 + t0)+ ψ(x2, x1 + t0)]

andh2(x1)= 1

2ψ(x1, t0)

be the kernels of degrees 2 and 1 corresponding toT1(F ) and T2(F ) respectively, where ψ(a, b)= 1if a > b, and ψ(a, b)= 0 if a ≤ b. Let

Tn = [n(n− 1)]−1∑′

ψ(xα1 , xα2 + t0)

− [2n]−1n∑

i=1

ψ(xi, t0) (10.17)

where∑′ is the sum taken over all n(n− 1) sets

of two integers (α1, α2) such that 1≤ αi ≤ n, i =1, 2, and α1 �= α2. To apply Hoeffding’s U -statistic

theory, we let

ξ[1]1 = E[h1(x1, x2)h1(x1, x3)] − [T1(F )]2ξ[1]2 = E[h2

1(x1, x2)] − [T1(F )]2ξ[2]1 = E[h2

2(x1)] − [T2(F )]2and

ξ [1,2] = E[h1(x1, x2)− T1(F )][h2(x1)− T2(F )]Then

var(Tn)=(n

2

)−1 2∑k=1

(2

k

)(n− 2

2− k

)ξ[1]k

+ n−1ξ[2]1 − (4/n)ξ [1,2]

and

σ 2 = limn→∞ n var(Tn)= 4ξ [1]1 + ξ

[2]1 − 4ξ [1,2]

(10.18)It follows from Hoeffding’s U -statistic theory thatif F is such that σ 2 > 0, where σ 2 is given inEquation 10.18, then the limiting distribution ofn1/2[Tn − T (F )] is normal with mean zero andvariance σ 2. Straightforward calculations yield,under H0

var0(Tn)= (n+ 1)[n(n− 1)]−1

× [(1/12)F (t0)+ (1/12)F 2(t0)

− (1/6)F 3(t0)]and

σ 20 = (1/12)F (t0)+ (1/12)F 2(t0)− (1/6)F 3(t0)

Thus, underH0, the limiting distribution of n1/2Tnis normal with mean zero and variance σ 2

0 .Since the null variance σ 2

0 of n1/2Tn does dependon the unspecified distribution F , σ 2

0 must beestimated from the data. The consistent estimatorof σ 2

0 can be obtained by replacing F by itsempirical distribution Fn as follows:

σ 2n = (1/12)Fn(t0)+ (1/12)F 2

n (t0)− (1/6)F 3n (t0)

(10.19)where Fn = 1− Fn. Using the asymptotic nor-mality of Tn and Slutsky’s theorem, it can beshown, under H0, that n1/2Tnσ

−1n is asymptoti-

cally N(0, 1). Thus the approximate α-level test of

Page 224: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 191

Table 10.3. Asymptotic relative efficiencies

t0 F

EF1(J, T ) EF1(S, T ) EF2(J, T ) EF2(S, T ) EF3(J, T ) EF3(S, T )

0.2 6.569 14.598 3.554 4.443 0.579 0.4330.6 2.156 4.79 1.695 2.118 0.253 0.5311.0 1.342 2.983 1.493 1.866 0 0.1451.4 1.047 2.328 1.607 2.009 0 01.8 0.933 2.074 1.929 2.411 0 02.0 0.913 2.030 2.172 2.715 0 0

H0 versus Ha is to reject H0 in favor of Ha if

n1/2Tnσ−1n ≤−zα (10.20)

where zα is the upper α-percentile point of astandard normal distribution. The test of H0versus Ha defined by Equation 10.20 is referredto as the HPP test. Analogously, the approximateα-level test of H0 against the alternative that anew item has stochastically less residual life lengththan does a used item of age t0 is to reject H0 ifn1/2Tnσ

−1n ≥ zα.

Theorem 13. If F is continuous, then the HPP testis consistent against Ca , where Ca is the class ofnon-exponential NBU-t0 life distributions definedin Equation 10.4.

Proof. See Hollander et al. [10]. �Note that

P {n1/2Tnσ−1n ≤−zα}

= P {n1/2[Tn−T (F )] ≤ −zασn − n1/2T (F )}

and for F ∈ Ca , T (F ) < 0 and that σnp→ σ0

<∞. Thus, under Ha , P {n1/2Tnσ−1n ≤−zα} ≥ α

for sufficiently large n, which shows that the HPPtest is asymptotically unbiased against Ca .

To evaluate the performance of the HPP test,the Pitman asymptotic relative efficiencies arecomputed for the following three distributions.

1. Linear failure rate distribution

F1(x; θ)= exp{−[x + (θ/2)x2]}θ ≥ 0, x ≥ 0

2. Makeham distribution

F2(x; θ)= exp[−(x + {θ [x + exp(−x)− 1]})]θ ≥ 0, x ≥ 0

3.

F3(x; θ)= exp[−{x − θ(2t0)−1x2}]0≤ θ ≤ 1, 0≤ x < t0

= exp[−{x − θ(2)−1t0}]0≤ θ ≤ 1, x ≥ t0

For θ = 0,F1,F2, andF3 reduce to the exponentialdistribution. For θ > 0, F1 and F2 are NBU.For 0 < θ ≤ 1, as shown in Example 1, F3 is in Ca

but is not NBU.Since no other tests of H0 versus Ha have

been proposed, the HPP test is compared withHollander and Proschan’s NBU-test [15] (referredto as the J test) and Bickel and Doksom’s NBU-test [16] (referred to as the S test). Both the J

test and S test are testing the null hypothesisH ′0 : F is exponential, versus the alternativeHa : Fis NBU and not exponential. The asymptoticrelative efficiency calculations are summarizedin Table 10.3, where T refers to the HPPtest. The values of zeros for EF3(J, T ) andEF3(S, T ) indicate that the J test and T testare not consistent against F3 when t0 exceeds acertain value. More details are given in Hollanderet al. [10].

The HPP test based on the statistic Tn is notintended to be a competitor of the J or S test.The latter tests are designed for smaller classesof alternatives, whereas the HPP test is designed

Page 225: Handbook of Reliability Engineering - pdi2.org

192 Statistical Reliability Theory

for the relatively large class of alternatives Ca .As Table 10.3 indicates, when F is NBU-t0 and isnot a member of the smaller classes (such as theNBU class), the HPP test will often be preferred toother tests.

10.4.1.2 Ebrahimi–Habibullah Test

To develop a class of tests for testing H0 versusHa , Ebrahimi and Habibullah [11] consider theparameter

�k(F)=∫ ∞

0[F (x + kt0)− F (x){F (t0)}k] dF(x)

=�1k(F )−�2k(F )

where

�1k(F )=∫ ∞

0F (x + kt0) dF(x)

�2k(F )= 1

2{F (t0)}k

�k(F ) is motivated by the fact that theNBU-t0 property implies the weaker propertyF (x + kt0)≤ F (x)[F (t0)]k for all x ≥ 0 andk = 1, 2, . . . . They term such a property asnew better than used of order kt0. Under H0,�k(F)= 0 and under Ha , �k(F) < 0. Thus,�k(F) can be taken as an overall measure ofdeviation from H0 towards Ha . Ebrahimi andHabibullah [11] utilized the U -statistic to developa class of test statistics. Define

Tk = T1k − T2k

where

T1k =[1

/(n

2

)] ∑1≤i≤j≤n

ψ(Xi , Xj + kt0)

and

T2k ={

1

/[2

(n

k

)]}∑ψk(Xi1, . . . , Xik )

and the sum is over all combinations of k

integers (i1, . . . , ik) chosen out of (1, . . . , n).Here ψk(a1, . . . , ak)= 1 if min1≤i≤k ai > t0, orzero otherwise.

Note that T1k and T2k are the U -statisticscorresponding to the parameters �1k(F ) and

�2k(F ) with kernels ψ and ψk of degrees 2and k respectively. Thus, Tk is a U -statisticcorresponding to �k(F) and works as a teststatistic for H0 versus Ha . It follows that E(Tk)≤0 or E(Tk)≥ 0 according to whether F is newbetter than used of age t0 or new worse thanused of age t0 respectively. If k = 1, Tk reducesto the test statistic of Hollander et al. [10].The asymptotic normality of Tk is established byapplying Hoeffding’s U -statistic theory [14].

Theorem 14. If F is such that

σ 2 = limn→∞ n var(Tk) > 0

then the asymptotic distribution of n1/2[Tk −�k(F)] is normal with mean zero and variance σ 2.

Corollary 1. Under H0, n1/2Tk is asymptoticallynormal with mean zero and variance

σ 20 = 1

3 [F (t0)]k + (k − 23 − 1

4k2)[F (t0)]2k

+ 13 [F (t0)]3k + ( 1

4k2 − 1

2k)[F (t0)]2k−1

− 12k[F (t0)]2k+1

Approximate α-level tests of H0 against Ha

are obtained by replacing F (t0) by its consistentestimator Fn(t0)= n−1 ∑ ψ(Xi, t0) in σ 2

0 . It canbe easily shown that if F is continuous, then Tkis consistent against Ha . The Pitman asymptoticrelative efficiencies of Tk for a given k ≥ 2 relativeto the T test of Hollander et al. [10] for thefollowing distributions are evaluated:

F1(x)= exp{−x + θ [x + exp(−x)− 1]}θ ≥ 0, x ≥ 0

F2(x)=[

exp

(−x + θx3

2t0

)]I (0 ≤ x < t0)

+[exp

(−x + θt0

2

)]I (x ≥ t0)

0≤ θ ≤ 2

3

For θ = 0, F1 and F2 reduce to the exponentialdistribution and thus satisfy H0. For θ > 0, F1 isboth NBU and NBU-t0 and F2 is NBU-t0, but notNBU.

Page 226: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 193

Table 10.4. Pitman asymptotic relative efficiency of Tk with respect to T for F1 and F2

t0 (k1, k2) (EF1(Tk1 , T ), EF1(Tk2 , T )) (k1, k2) (EF2(Tk1 , T ), EF2(Tk2 , T ))

0.2 (2, 16) (1.154, 3.187) (2, 5) (1.138, 1.205)0.6 (2, 4) (1.221, 1.634) (2, 3) (1.665, 1.779)1.0 (2, 3) (1.0, 1.0) (2, 2) (2.638, 2.638)1.4 (1, 1) (1.0, 1.0) (2, 2) (4.241, 4.241)1.8 (1, 1) (1.0, 1.0) (2, 2) (6.681, 6.681)2.0 (1, 1) (1.0, 1.0) (2, 2) (8.271, 8.271)

Table 10.4 gives the numerical values ofEF1(Tk, T ) and EF2(Tk, T ) for certain specifiedvalues of t0. In both cases, two values of k, namelyk1 and k2, are reported, where k1 is the smallestvalue for which E(Tk, T ) > 1 and k2 is the integerthat gives optimum Pitman asymptotic relativeefficiency. Table 10.4 shows that although the valueof k varies, there is always a k > 1 for which Tkperforms better than T for small t0. Secondly, ifthe alternative is NBU-t0, but not NBU, then forall t0, there is a k > 1 for which Tk performs betterthan T .

10.4.1.3 Ahmad Test

Ahmad [12] considered the problem of testingH0 versus Ha when the point t0 is unknown,but which can be estimated from the data.He proposed the tests when t0 = µ, the mean of F ,and also when t0 = ξp , the pth percentile of F .In particular, when t0 = ξp the proposed test isshown to be distribution free, whereas both theHPP test and the Ebrahimi–Habibullah test are notdistribution free.

Ahmad [12] utilized the fact that F is NBU-t0if and only if, for all integers k ≥ 1, F (x + kt0)

≤ F (x)[F (t0)]k and considered the followingparameter as the measure of departure of F

from H0. Let

γk(F )=∫ ∞

0[F (x)F k(t0)− F (x + kt0)] dF(x)

= 1

2F k(t0)−

∫ ∞0

F (x + kt0) dF(x)

(10.21)

When t0 = ξp , F (t0)= 1− p and thus Equa-tion 10.21 reduces to

γk(F )= 1

2(1− p)k −

∫ ∞0

F (x + kξp) dF(x)

In the usual way, ξp is estimated by ξp =X([np]),where X(r) denotes the rth-order statistic in thesample. Thus, γk(F ) can be estimated by

γk(Fn)= 1

2(1− p)k

−∫ ∞

0Fn(x + kX([np])) dFn(x)

= 1

2(1− p)k

− n−2n∑

i=1

n∑j=1

I (Xi > Xj + kX([np]))

where Fn (Fn) is the empirical distribution(survival) function.

A U -statistic equivalent to γk(Fn) is given by

Tk = 1

2(1− p)k −

(n

2

)−1

×∑

1≤i≤j≤nI (Xi > Xj + kX([np]))

The statistic γk is used to test H0 versus H1 :F is NBU-ξp for 0≤ p ≤ 1. The asymptoticdistribution of γk is stated in the followingtheorem.

Theorem 15. As n→∞, n1/2[γk − γk(F )] isasymptotically normal with mean zero and vari-ance σ 2 > 0, provided that f = F ′ exists and isbounded. Under H0, the variance reduces to σ 2

0 =

Page 227: Handbook of Reliability Engineering - pdi2.org

194 Statistical Reliability Theory

Table 10.5. Efficacies of γk forF1 andF2 (the first entry is forF1 and the secondis forF2)

k P = 0.05 P = 0.25 P = 0.50 P = 0.75 P = 0.95

1 0.0919 0.1318 0.1942 0.2618 0.231914.4689 0.1605 0.0022 0.0002

3 0.1093 0.2160 0.2741 0.1484 0.00651.0397 0.0002 0.0002

5 0.1266 0.2646 0.1986 0.0288 0.00010.2346 0.0002

7 0.1436 0.2750 0.1045 0.00370.0721 0.0002

13 (1− p)k[1− (1− p)k]2. Here, σ 2 = V [F (X1 +kξp)+ F(X1 − kξp)].Proof. See Ahmad [12]. �

Although there are no other tests in theliterature to compare with the Ahmad test forthe testing problem of H0 versus H1, the Pitmanasymptotic efficacies for two members of theNBU-t0 class have been calculated. Two NBU-t0life distributions and their calculated efficacies aregiven as follows.

(i) Makehem distribution:

F1(x)= exp(−{x + θ [x + exp(−x)− 1]})for θ ≥ 0, x ≥ 0

eff(F1)

= [2+ 3 log(1− p)k − 2(1− p)k]2(1−p)k12[1− (1− p)k]2

(ii) Linear failure rate distribution:

F2(x)= exp[−(x + θx2/2)]})for θ ≥ 0, x ≥ 0

eff(F2)

= 3(1− p)k[log(1− p)k]2[1+ log(1−p)k]264[1− (1− p)k]2

Table 10.5 offers some values of the aboveefficacies for some choices of k and p. Unreportedentries in the table are zero to four decimalplaces. Note that, for F1, the efficacy is increasingin k. It also increases in p initially, and then

decreases for large p. For F2, however, the efficacyis decreasing in both k and p.

When t0 = µ, γk(F ) of Equation 10.21 becomes

γk(F )= 1

2F k(µ)−

∫ ∞0

F (x + kµ) dF(x)

Replacing µ by X, and F (F ) by its correspondingempirical distribution Fn (Fn), γk(F ) can beestimated by

γk(F )= 1

2F kn (X)−

∫ ∞0

Fn(x + kX) dFn(x)

The equivalent U -statistic estimator is

Tk = 1

2T1k − T2k

where

T1k =(n

k

)−1 ∑1≤i<···<ik≤n

I {min(Xi1 , . . . , Xik ) > X}

T2k =(n

2

)−1 ∑1≤i<j≤n

I (Xi > Xj + kX)

Theorem 16. As n→∞, n1/2[Tk − γk(F )] isasymptotically normal with mean zero andvariance σ 2, provided that f = F ′ exists and isbounded. Under H0, the variance reduces to

σ 20 =

1

3[F (µ)]k +

(k − 2

3− k2

4

)[F (µ)]2k

+ 1

3[F (u)]3k +

(k2

4− k

2

)[F (µ)]2k−1

− k

2[F (µ)]2k+1

Page 228: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 195

Here, σ 2 = limn→∞ nV (Tk) > 0.

Proof. See Ahmad [12]. �

10.4.2 Tests for NBU-t0 AlternativesUsing Incomplete Data

In many practical situations the data are incom-plete due to random censoring. For example, ina clinical study, patients may withdraw from thestudy or they may not yet have experienced theend-point event before the completion of a study.Thus, instead of observing a complete sampleX1, . . . , Xn from the life distribution F , we areable to observe only pairs (Zi, δi), i = 1, . . . , n,where

Zi =min(Xi, Yi)

and

δi ={

1 if Zi =Xi

0 if Zi = Yi

Here, we assume that Y1, . . . , Yn are i.i.d. accord-ing to the continuous censoring distribution H

and the X and Y values are mutually independent.Observe that δi = 1 means that the ith observationis uncensored, whereas δi = 0 means that the ithobservation is censored on the right by Yi . The hy-pothesis of H0 against Ha of Section 10.4.1 istested on the basis of the incomplete observations(Zi, δ1), . . . , (Zn, δn). Several consistent estima-tors of F based on the incomplete observationshave been developed by Kaplan and Meier [17],Susarla and Van Ryzin [18], and Kitchin et al.(1980), among others. To develop the NBU-t0 testusing a randomly censored model, the Kaplan andMeier [17] estimator (KME) is utilized.

Under the assumption of F being continuous,the KME is given by:

ˆFn(t)=∏

{i:Z(i)≤t}[(n− i)(n− i + 1)−1]δ(i)

t ∈ [0, Z(n))

where Z(0) ≡ 0 <Z(1) < · · ·<Z(n) denote the or-dered Z values and δ(i) is the δ correspondingto Z(i). Here, we treat Z(n) as an uncensored ob-servation, whether it is uncensored or censored.

When censored observations are tied with uncen-sored observations, our convention for orderingthe Z values is to treat the uncensored obser-vations as preceding the censored observations.

Large sample properties of ˆFn have been studiedby Efron [20], Breslow and Crowley [21], Peterson

[22], and Gill [23]. In the sequel, ˆFn will denote the

KME and Fn = 1− ˆFn. Hollander et al. [13] formthe test statistic by replacing F of T (F ) defined inSection 10.4.1.1, by Fn, the KME of F , i.e.

T (Fn)=∫ ∞

0

ˆFn(x + t0) dFn(x)− 1

2ˆFn(t0)

For computational purpose, we may write:

T cn ≡ T (Fn)

=( n∑

i=1

∏{k:Z(k)≤Z(i)+t0}

[(n− k)(n− k + 1)−1]δ(k)

×{ i∏

r=1

[(n− r)(n− r + 1)−1]δ(r)}

× {1− [(n− i)(n− i + 1)−1]δ(i)})

− 1

2

∏{k:Z(k)≤t0}

[(n− k)(n− k + 1)−1]δ(k)

Asymptotic normality of n1/2[T cn − T (F )] can be

established under the following assumptions:

Assumption 1. The support of both F and H is[0,∞).

Assumption 2. sup{[F (x)]1−ε[H (x)]−1, x ∈[0,∞)}<∞, for some ε ≥ 0.

Assumption 2 restricts the amount of cen-soring. For example, in the proportional hazardmodel where H (t)= [F (t)]β, Assumption 2 im-plies that β < 1, which means that the expectedproportion of censored observations, β/(β + 1),must be less than 0.5.

Let K(t)= F (t)H (t), t ∈ [0,∞), and let{φ(t), t ∈ (0,∞)} be a Gaussian process with

Page 229: Handbook of Reliability Engineering - pdi2.org

196 Statistical Reliability Theory

mean zero and covariance kernel given by

Eφ(t)φ(s) = F (t)F (s)

∫ s

0[K(z)F (z)]−1 dF(z)

0≤ s < t <∞Theorem 17. Assume that Assumptions 1 and 2hold. Then n1/2[T c

n − T (F )] converges in distribu-tion to a normal random variable with mean zeroand variance σ 2

c , where

σ 2c =

∫ ∫E

[φ(x + t0)− φ(x − t0)− 1

2φ(t0)

]×[φ(u+ t0)− φ(u− t0)

− 1

2φ(t0)

]dF(x) dF(u)

Proof. See Hollander et al. [13]. �The null asymptotic mean of n1/2T c

n is zero,independent of the distributions F and H .However, the null asymptotic variance of n1/2T c

n

depends on both F and H , and thus it mustbe estimated from the incomplete observations(Zi, δ1), . . . , (Zn, δn). Under H0, it can be shownthat the null asymptotic variance of n1/2T c

n is

σ 2c0 =

1

4F 2(t0)

∫ ∞0

F 3(z)[K(z+ t0)]−1 dF(z)

+ 1

4F 2(t0)

∫ ∞0

F 3(z)[K(z)]−1 dF(z)

− 1

2F 4(t0)

∫ ∞0

F 3(z)[K(z + t0)]−1 dF(z)

(10.22)

If there is no censoring, i.e. if K(z)= F (z) for z ∈[0,∞), then σ 2

c0 reduces to 112 F (t0)+ 1

12 F2(t0)−

16 F

3(t0), agreeing with the null asymptoticvariance σ 2

0 of n1/2Tn for the complete data case.By a change of variable in the first and third

terms of Equation 10.22, σ 2c0 can be simplified to

σ 2c0 =

1

4[F (t0)]−2

∫ ∞t0

F 3(u)[K(u)]−1 dF(u)

+ 1

4F 2(t0)

∫ ∞0

F 3(u)[K(u)]−1 dF(u)

− 1

2

∫ ∞t0

F 3(u)[K(u)]−1 dF(u) (10.23)

Let Z(1) ≤ · · · ≤ Z(n) denote the ordered Z valuesand let Kn denote the empirical distributionfunction of the Z values. Then, nKn(t)= (numberof Z values≤ t). Since T c

n = 0 when Z(n) ≤ t0, weassume that our sample is such that Z(n) > t0.

Replacing F by ˆFn, K by Kn, and ∞ by Z(n) inEquation 10.23 yields the estimate σ 2

cn defined byEquation 10.24.

σ 2cn =

{1

4[ ˆFn(t0)]−2 − 1

2

∑{i:t0≤W(i)≤Z(n)}

ˆF 3(W(i))[Kn(W(i))]−1

× [Fn(W(i))− Fn(W(i−1))] + 1

4ˆF 2n (t0)

×∑

{i:1≤i≤τ (n)}ˆF 3n (W(i))[Kn(W(i))]−1

× [Fn(W(i))− Fn(W(i−1))] (10.24)

whereW(0) ≡ 0 <W(1) < W(2) < · · ·<W(τ(n)) arethe ordered observed failure times, and τ (n)=∑n

i=1 δi is the total number of failures among then observations. In Equation 10.24, when W(i) =Z(n) we redefine Kn(Z(n)), changing it from zeroto 1/n so that Equation 10.24 is well defined.

The approximate α-level test of H0 versus Ha ,which rejects H0 in favor of Ha if n1/2T c

n σ−1cn ≤

−zα and accepts H0 otherwise, is called theNBU-t0 test for incomplete data. The approximateα-level test of H0 versus the alternative that anew item has a stochastically smaller residual lifelength than does a used item of age t0 is calledthe NWU-t0 test for incomplete data. The NWU-t0test rejects H0 if n1/2T c

n σ−1cn ≥ zα and accepts H0

otherwise.

References[1] Marshall AH, Proschan F. Classes of distributions appli-

cable in replacement, with renewal theory implications.In: Proceedings of the 6th Berkeley Symposium on Math-ematical Statistics and Probability, vol. I. University ofCalifornia Press; 1972. p.395–415.

[2] Esary JD, Marshall AH, Proschan F. Shock models andwear processes. Ann Prob 1973;1:627–49.

[3] Bryson MC, Siddiqui MM. Some criteria for aging. J AmStat Assoc 1969;64:1472–83.

Page 230: Handbook of Reliability Engineering - pdi2.org

Class of NBU-t0 Life Distribution 197

[4] Marsaglia G, Tubilla A. A note on the “lack of memory”property of the exponential distribution. Ann. Prob.1975;3:353–4.

[5] Barlow RE, Proschan F. Statistical theory of reliability andlife testing probability models. Silver Spring: To BeginWith; 1981.

[6] Reneau DM, Samaniego FJ. Estimating the survival curvewhen new is better than used of a specified age. J Am StatAssoc 1990;85:123–31.

[7] Chang MN. On weak convergence of an estimator ofthe survival function when new is better than used of aspecified age. J Am Stat Assoc 1991;86:173–8.

[8] Chang MN, Rao PV. On the estimation of a survivalfunction when new better than used of a specified age.J Nonparametric Stat 1992;2:45–8.

[9] Serfling RJ. Approximation theorems of mathematicalstatistics. New York: Wiley; 1980.

[10] Hollander M, Park DH, Proschan F. A class of lifedistributions for aging. J. Am Stat Assoc 1986;81:91–5.

[11] Ebrahimi N, Habibullah M. Testing whether a survivaldistribution is new better than used of a specified age.Biometrika 1990;77:212–5.

[12] Ahmad IA. Testing whether a survival distribution is newbetter than used of an unknown specified age. Biometrika1998;85:451–6.

[13] Hollander M, Park DH, Proschan F. Testing whether newis better than used of a specified age, with randomlycensored data. Can J Stat 1985;13:45–52.

[14] Hoeffding WA. A class of statistics with asymptoticallynormal distribution. Ann Math Stat 1948;19:293–325.

[15] Hollander M, Prochan F. Testing whether new is betterthan used. Ann Math Stat 1972;43:1136–46.

[16] Bickel PJ, Doksum KA. Tests for monotone failurerate based on normalized spacings. Ann Math Stat1969;40:1216–35.

[17] Kaplan EL, Meier P. Nonparametric estimation from in-complete observations. J Am Stat Assoc 1958;53:457–81.

[18] Susarla V, Van Ryzin J. Nonparametric Bayesian estima-tion of survival curves from incomplete observations.J Am Stat Assoc 1976;71:897–902.

[19] Kitchin J, Langberg NA, Proschan F. A new method for es-timating life distributions from incomplete data. FloridaState University Department of Statistics Technical Re-port No. 548, 1980.

[20] Efron B. The two sample problem with censored data. In:Proceedings of the 5th Berkley Symposium, vol. 4, 1967.p.831–53.

[21] Breslow N, Crowley J. A large sample study of the life tableand product limit estimates under random censorship.Ann Stat 1974;2:437–53.

[22] Peterson AV. Expressing the Kaplan–Meier estimator asa function of empirical subsurvival function. J Am StatAssoc 1977;72:854–8.

[23] Gill RD. Large sample behaviour of the product limitestimator on the whole line [Preprint]. Amsterdam:Mathematisch Centrun; 1981.

Page 231: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 232: Handbook of Reliability Engineering - pdi2.org

Software ReliabilityPA

RT

III

11 Software Reliability Models: A Selective Survey and New Directions11.1 Introduction11.2 Static Models11.3 Dynamic Models: Reliability Growth Models for Testing and

Operational Use11.4 Reliability Growth Modeling with Covariates11.5 When to Stop Testing Software11.6 Challenges and Conclusions

12 Software Reliability Modeling12.1 Introduction12.2 Basic Concepts of Stochastic Modeling12.3 Black-box Software Reliability Models12.4 White-box Modeling12.5 Calibration of Model12.6 Current Issues

13 Software Availability Theory and its Applications13.1 Introduction13.2 Basic Model and Software Availability Measures13.3 Modified Models13.4 Applied Models13.5 Concluding Remarks

14 Software Rejuvenation: Modeling and Applications14.1 Introduction14.2 Modeling-based Estimation14.3 Measurement-based Estimation14.4 Conclusion and Future Work

15 Software Reliability Management: Techniques and Applications15.1 Introduction15.2 Death Process Model for Software Testing Management15.3 Estimation Method of Imperfect Debugging Probability15.4 Continuous State Space Model for Large-scale Software15.5 Development of a Software Reliability Management Tool

Page 233: Handbook of Reliability Engineering - pdi2.org

16 Recent Studies in Software Reliability Engineering16.1 Introduction16.2 Software Reliability Modeling16.3 Generalized Models with Environmental Factors16.4 Cost Modeling16.5 Recent Studies with Considerations of Random Field Environments16.6 Further Reading

Page 234: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models:A Selective Survey and NewDirections

Chap

ter1

1Siddhartha R. Dalal

11.1 Introduction11.2 Static Models11.2.1 Phase-based Model: Gaffney and Davis11.2.2 Predictive Development Life Cycle Model: Dalal and Ho11.3 Dynamic Models: Reliability Growth Models for Testing and Operational Use11.3.1 A General Class of Models11.3.2 Assumptions Underlying the Reliability Growth Models11.3.3 Caution in Using Reliability Growth Models11.4 Reliability Growth Modeling with Covariates11.5 When to Stop Testing Software11.6 Challenges and Conclusions

11.1 Introduction

Software development, design, and testing havebecome very intricate with the advent of modernhighly distributed systems, networks, middleware,and interdependent applications. The demand forcomplex software systems has increased morerapidly than the ability to design, implement, test,and maintain them, and the reliability of softwaresystems has become a major concern for ourmodern society. Within the last decade of the 20thcentury and the first few years of the 21st century,many reported system outages or machine crasheswere traced back to computer software failures.Consequently, recent literature is replete withhorror stories due to software problems.

Even discounting the costly “Y2K” problem asa design failure, a problem that occupied tensof thousands of programmers in 1998–99 withthe costs running to tens of billions of dollars,there have been many other critical failures.Software failures have impaired several high-visibility programs in space, telecommunications,

defense and health industries. The Mars ClimateOrbiter crashed in 1999. The Mars Climate OrbiterMission Failure Investigation Board [1] concludedthat “The ‘root cause’ of the loss of the spacecraftwas the failed translation of English units intometric units in a segment of ground-based,navigation-related mission software, . . . ”. Besidesthe costs involved, it set back the space program bymore than a year. Current versions of the Ospreyaircraft, developed at a cost of billions of dollars,are not deployed because of software-induced fieldfailures. In the health industry [2], the Therac-25radiation therapy machine was hit by softwareerrors in its sophisticated control systems andclaimed several patients’ lives in 1985 and 1986.Even in the telecommunications industry, knownfor its five nines reliability, the nationwide long-distance network of a major carrier suffered anembarrassing network outage on 15 January 1990,due to a software problem. In 1991, a series oflocal network outages occurred in a number of UScities due to software problems in central officeswitches [3].

201

Page 235: Handbook of Reliability Engineering - pdi2.org

202 Software Reliability

Software reliability is defined as the probabilityof failure-free software operations for a specifiedperiod of time in a specified environment [4].The software reliability field discusses ways ofquantifying it and using it for improvement andcontrol of the software development process.Software reliability is operationally measured bythe number of field failures, or failures seen indevelopment, along with a variety of ancillaryinformation. The ancillary information includesthe time at which the failure was found, in whichpart of the software it was found, the state ofsoftware at that time, the nature of the failure, etc.

Most quality improvement efforts are triggeredby lack of software reliability. Thus, softwarecompanies recognize the need for systematicapproaches to measuring and assuring softwarereliability, and devote a major share of projectdevelopment resources to this. Almost a third ofthe total development budget is typically spenton testing, with the expectation of measuringand improving software reliability. A number ofstandards have emerged in the area of developingreliable software consistently and efficiently. ISO9000-3 [5] is the weakest amongst the recognizedstandards, in that it specifies measurement offield failures as the only required quality metric:“. . . at a minimum, some metrics should beused which represent reported field failures and/ordefects form the customer’s viewpoint. . . . Thesupplier of software products should collect andact on quantitative measures of the quality of thesesoftware products”. The Software EngineeringInstitute has proposed an elaborate standardcalled the Software Capability Maturity Model(CMM) [6] that scores software developmentorganizations on multiple criteria and gives anumeric grade from one to five. A similarapproach is taken by the SPICE standards, whichare prevalent in Europe [7].

Formally, software reliability engineering is thefield that quantifies the operational behavior ofsoftware-based systems with respect to user re-quirements concerning reliability. It includes datacollection on reliability, statistical estimation andprediction, metrics and attributes of product ar-chitecture, design, software development, and the

operational environment. Besides its use for op-erational decisions like deployment, it includesguiding software architecture, development, test-ing, etc. Indeed, much of the testing process isdriven by software reliability concerns, and mostapplications of software reliability models are toimprove the effectiveness of testing. Many currentsoftware reliability engineering techniques andpractices are detailed by Musa et al. [8] and Lyu[9]. However, in this chapter we take a narrowerview and just look at models that are used in soft-ware reliability—their efficacy and adequacy—without going into details of the interplay betweentesting and software reliability models.

Though prevalent software reliability modelshave their genesis in hardware reliability models,there are clearly a number of differences betweenhardware and software reliability models. Fail-ures in hardware are typically based on the age ofhardware and the stress of the operational envi-ronment, whereas failures in software are due toincorrect requirements, design, coding, or the in-ability to interoperate with other related software.Software failures typically manifest when the soft-ware is operated in an environment for which itwas not designed or tested. Typically, except forthe infant mortality factor in hardware, hardwarereliability decreases with age, whereas softwarereliability increases with age (due to fault fixes).

In this chapter we focus on software reliabil-ity models and measurements. A software reli-ability model specifies the general form of thedependence of the failure process on the princi-pal factors that affect it: fault introduction, faultremoval, and the operational environment. Dur-ing the test phase, the failure rate of a softwaresystem is generally decreasing due to the discoveryand correction of software faults. With carefulrecord-keeping procedures in place, it is possi-ble to use statistical methods to analyze the his-torical record. The purpose of these analyses istwofold: (1) to predict the additional time neededto achieve a specified reliability objective; (2) topredict the expected reliability when testing isfinished.

Implicit in this discussion is the conceptof “time”. For some purposes this may be

Page 236: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models: A Selective Survey and New Directions 203

calendar time, assuming that testing proceedsroughly uniformly; another possibility is to usecomputer execution time, or some other measureof testing effort. Another implicit assumption isthat the software system being tested remainsfixed throughout (except for the removal of faultsas they are found). This assumption is frequentlyviolated.

Software reliability measurement includes twotypes of model: static and dynamic reliabilityestimation, used typically in the earlier and laterstages of development respectively. These will bediscussed in the following two sections. One ofthe main weaknesses of many of the modelsis that they do not take into account ancillaryinformation, like churn in system during testing.Such a model is described in Section 11.4. A keyuse of the reliability models is in the area ofwhen to stop testing. An economic formulationis discussed in Section 11.5. Additional challengesand conclusions are stated in Section 11.6.

11.2 Static Models

One purpose of reliability models is to performreliability prediction in an early stage of softwaredevelopment. This activity determines futuresoftware reliability based upon available softwaremetrics and measures. Particularly when fieldfailure data are not available (e.g. software isin the design or coding stage), the metricsobtained from the software development processand the characteristics of the resulting productcan be used to estimate the reliability of thesoftware upon testing or delivery. We discusstwo prediction models: the phase-based modelby Gaffney and Davis [10] and a predictivedevelopment life cycle model from TelcordiaTechnologies by Dalal and Ho [11].

11.2.1 Phase-based Model: Gaffneyand Davis

Gaffney and Davis [10] proposed the phase-basedmodel, which divides the software development

cycle into different phases (e.g. requirement re-view, design, implementation, unit test, softwareintegration, systems test, operation, etc.) and as-sumes that code size estimates are available duringthe early phases of development. Further, it as-sumes that faults found in different phases follow aRaleigh density function when normalized by thelines of code. Their model makes use of the faultstatistics obtained during the early developmentphases (e.g. requirements review, design, imple-mentation, and unit test) to predict the expectedfault densities during a later phase (e.g. systemtest, acceptance test and operation).

The key idea is to divide the stage ofdevelopment along a continuous time (i.e. t =“0–1” means requirements analysis, and so on),and overlay the Raleigh density function with ascale parameter. The scale parameter, known asthe fault discovery phase constant, is estimated byequating the area under the curve between earlierphases with observed error rates normalized bythe lines of code. This method gives an estimateof the fault density for any later phase. The modelalso estimates the number of faults in a givenphase by multiplying the fault density estimate bythe number of lines of code.

This model is clearly motivated by the corre-sponding model used in hardware reliability, andthe predictions are hardwired in the model basedon one parameter. In spite of this criticism, thismodel is one of the first to leverage informationavailable in earlier development life cycle phases.

11.2.2 Predictive Development LifeCycle Model: Dalal and Ho

In this model the development life cycle is dividedinto the same phases as in Section 11.2.1. However,unlike in Section 11.2.1, it does not postulatea fixed relationship (i.e. Raleigh distribution)between the numbers of faults discovered duringdifferent phases. Instead, it leverages past releasesof similar products to determine the relationships.The relationships are not postulated beforehand,but are determined from data using only a fewreleases per product. Similarity is measured by

Page 237: Handbook of Reliability Engineering - pdi2.org

204 Software Reliability

Figure 11.1. 22 Products and their releases versus Observed (+) and Predicted Fault Density connected by dash lines. Solid vertical linesare 90% predictive intervals for Fault Density

using an empirical hierarchical Bayes framework.The number of releases used as data is keptminimal and, typically, only the most recent oneor two releases are used for prediction. This iscritical, since there are often major modificationsto the software development process over time,and these modifications change the interphaserelationships between faults. The lack of datais made up for by using as many products aspossible that were being developed in a softwareorganization at around the same time. In thatsense it is similar to meta analysis [12], wherea lack of longitudinal data is overcome by usingcross-sectional data.

Conceptually, the basic assumptions behindthis model are as follows:

Assumption 1. Defect rates from different prod-ucts in the same product life cycle phase are sam-ples from a statistical universe of products comingfrom that development organization.

Assumption 2. Different releases from a givenproduct are samples from a statistical universe ofreleases for that product.

Assumption 1 reflects the fact that the productsdeveloped within the same organization at thesame life cycle maturity are more or less

Page 238: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models: A Selective Survey and New Directions 205

homogeneous. Defect density for a given productis related to covariates like lines of code andnumber of faults in previous life cycle phases byuse of a regression model with the coefficientsof the regression model assumed to have anormal prior distribution. The homogeneityassumption is minimally restrictive, since theBayesian estimates we obtain depend increasinglyon the data as more data become available. Basedon the detailed model described by Dalal andHo [11], one obtains predictive distributions of thefault density per life cycle phase conditionally onobserving some of the previous product life cyclephases. Figure 11.1 shows the power of predictionof this method. On the horizontal axis we have22 products, each with either one or two releases(some were new products and had no previousrelease). On the vertical axis we plot the predictedsystem’s test fault density per million lines of codebased on all fault information available prior to thesystem’s test phase, along with the correspondingposterior confidence intervals. A dashed lineconnects the predicted fault density, and “+”indicates the observed fault density. Except forproduct number 4, all observed values are quiteclose to the predicted value.

11.3 Dynamic Models:Reliability Growth Models forTesting and Operational Use

Software reliability estimation determines thecurrent software reliability by applying statisticalinference techniques to failure data obtainedduring system test or during system operation.Since reliability tends to improve over timeduring the software testing and operation periodsbecause of removal of faults, the models are alsocalled reliability growth models. They model theunderlying failure process of the software, and usethe observed failure history as a guideline, in orderto estimate the residual number of faults in thesoftware and the test time required to detect them.This can be used to make release and deploymentdecisions. Most current software reliability models

fall into this category. Details of these models canbe found in Lyu [9], Musa et al. [8], Singpurwallaand Wilson [13], and Gokhale et al. [14].

11.3.1 A General Class of Models

Now we describe a general class of models.In binomial models the total number of faults issome number N ; the number found by time t hasa binomial distribution with mean µ(t)= NF(t),where F(t) is the probability of a particular faultbeing found by time t . Thus, the number offaults found in any interval of time (including theinterval (t,∞)) is also binomial. F(t) could be anyarbitrary cumulative distribution function. Then,a general class of reliability models is obtained byappropriate parameterization of µ(t) andN .

Letting N be Poisson (with some mean ν) givesthe related Poisson model; now, the number offaults found in any interval is Poisson, and fordisjoint intervals these numbers are independent.Denoting the derivative of F by F ′, the hazardrate at time t is F ′(t)/[1− F(t)]. These modelsare Markovian but not strongly Markovian, exceptwhen F is exponential; minor variations of thiscase were studied by Jelinski and Moranda [15],Shooman [16], Schneidewind [17], Musa [18],Moranda [19], and Goel and Okomoto [20]. Schickand Wolverton [21] and Crow [22] made F aWeibull distribution; Yamada et al. [23] made F aGamma distribution; and Littlewood’s model [24]is equivalent to assuming F to be Pareto. Musaand Okumoto [25] assumed the hazard rate tobe an inverse linear function of time; for this“logarithmic Poisson” model the total number offailures is infinite.

The success of a model is often judged byhow well it fits an estimated reliability curve µ(t)

to the observed “number of faults versus time”function. On general grounds, having a good fitmay be an overfit and may have little to do withhow useful the model is in predicting future faultsin the present system, or future experience withanother system, unless we can establish statisticalrelationships between the measurable attributes ofthe system and the estimated parameters of thefitted models.

Page 239: Handbook of Reliability Engineering - pdi2.org

206 Software Reliability

Figure 11.2. The observed versus the fitted model

Let us examine the real example plotted inFigure 11.2 from testing a large software sys-tem at a telecommunications research company.The system had been developed over years, andnew releases were created and tested by thesame development and testing groups respec-tively. In Figure 11.2, the elapsed testing time instaff days t is plotted against the cumulative num-ber of faults found for one of the releases. It is notclear whether there is some “total number” of bugsto be found, or whether the number found willcontinue to increase indefinitely. However, fromdata such as that in Figure 11.2, an estimation ofthe tail of a distribution with a reasonable degreeof precision is not possible. We also fit a specialcase of the general reliability growth model de-scribed above corresponding to N being Poissonand F being exponential.

11.3.2 Assumptions Underlying theReliability Growth Models

Different sets of assumptions can lead to equiv-alent models. For example, the assumption that,for each fault, the time-to-detection is a randomvariable with a Pareto distribution, these randomvariables being independent, is equivalent to as-suming that each fault has an exponential lifetime,with these lifetimes being independent, and therates for the different faults being distributed ac-cording to a Gamma distribution (this is Little-wood’s model [24]). A single experience cannotdistinguish between a model that assumes a fixedbut unknown number of faults and a model that

assumes this number is random. Little is knownabout how well the various models can be distin-guished.

Most of the published models are based oncommon underlying assumptions. These com-monly include the following.

1. The system being tested remains essentiallyunchanged throughout testing, except forthe removal of faults as they are found.Some models allow for the possibility thatfaults are not corrected perfectly. Miller [26]assumed that if faults are not removedas they are found, then each fault causesfailures according to a stationary Poissonprocess; these processes are independent ofone another and may have different rates.By specifying the rates, many of the modelsmentioned in Section 11.3.1 can be obtained.Gokhale et al. [27] dealt with the case ofimperfect debugging.

2. Removing a fault does not affect the chancethat a different fault will be found.

3. “Time” is measured in such a way that testingeffort is constant. Musa [18] reported thatexecution time (processor time) is the mostsuccessful way to measure time. Others prefertesting effort measured in staff hours [28].

4. The model is Markovian, i.e. at each time,the future evolution of the testing processdepends only on the present state (thecurrent time, the number of faults foundand remaining, and the overall parametersof the model) and not on details of the pasthistory of the testing process. In some modelsa stronger property holds, namely that thefuture depends only on the current state andthe parameters, and not on the current time.We call this the “strong Markov” property.

5. All faults are of equal importance (contributeequally to the failure rate). Some extensionshave been discussed by Dalal and Mallows[29] in the context of when to stop testing.

6. At the start of testing, there is some finite totalnumber of faults, which may be fixed (knownor unknown) or random; if random, itsdistribution may be known or of known form

Page 240: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models: A Selective Survey and New Directions 207

with unknown parameters. Alternatively, the“number of faults” is not assumed finite, sothat, if testing continues indefinitely, an ever-increasing number of faults will be found.

7. Between failures, the hazard rate follows aknown functional form; this is often taken tobe simply a constant.

11.3.3 Caution in Using ReliabilityGrowth Models

Here we would also like to offer some cautionto the readers regarding the usage of softwarereliability models.

In fitting any model to a given data set,one must first bear in mind a given model’sassumptions. For example, if a model assumesthat a fixed number of software faults will beremoved within a limited period of time, but inthe observed process the number of faults is notfixed (e.g. new faults are inserted due to imperfectfault removal, or new code is added), then oneshould use another model that does not make thisassumption.

A second model limitation and implementationissue concerns future predictions. If the softwareis being operated in a manner different than theway it is tested (e.g. new capabilities are beingexercised that were not tested before), the failurehistory of the past will not reflect these changes,and poor predictions may result. Developmentof operational profiles, as proposed by Musaet al. [30], is very important if one wants topredict future reliability accurately in the user’senvironment.

Another issue relates to the software develop-ment environment. Most reliability growth mod-els are primarily applicable from testing onward:the software is assumed to have matured to thepoint that extensive changes are not being made.These models cannot have a credible performancewhen the software is changing and churn of soft-ware code is observed during testing. In thiscase, the techniques described in Section 11.4should be used to handle the dynamic testingsituation.

11.4 Reliability GrowthModeling with CovariatesWe have so far discussed a number of differentkinds of reliability model of varying degreesof plausibility, including phase-based modelsdepending upon a Raleigh curve, growth modelslike the Goel–Okumoto model, etc. The growthmodels take as input either failure time or failurecount data, and fit a stochastic process model toreflect reliability growth. The differences betweenthe models lie principally in assumptions made onthe underlying stochastic process generating thedata.

However, most existing models assume thatno explanatory variables are available. This as-sumption is assuredly simplistic, when the modelsare used to model a testing process, for all butsmall systems involving short development andlife cycles. For large systems (e.g. greater than100 KNCSL, i.e. thousands of non-commentarysource lines) there are variables, other than time,that are very relevant. For example, it is typicallyassumed that the number of faults (found andunfound) in a system under test remains stableduring testing. This implies that the code remainsfrozen during testing. However, this is rarely thecase for large systems, since aggressive deliverycycles force the final phases of development tooverlap with the initial stages of system test. Thus,the size of code and, consequently, the numberof faults in a large system can vary widely duringtesting. If these changes in code size are not con-sidered as a covariate, one is, at best, likely to havean increase in variability and a loss in predictiveperformance; at worst, a poor fitting model withunstable parameter estimates is likely. We brieflydescribe a general approach proposed by Dalaland McIntosh [28] for incorporating covariatesalong with a case study dealing with reliabilitymodeling during product testing when code ischanging.

Example 1. Consider a new release of a largetelecommunications system with approximately 7million NCSL and 300 KNCNCSL (i.e. thousandsof lines of non-commentary new or changed

Page 241: Handbook of Reliability Engineering - pdi2.org

208 Software Reliability

Figure 11.3. Plots of module size (NCNCSL) versus staff time(days) for a large telecommunications software system (top).Observed and fitted cumulative faults versus staff time (bottom).The dotted line (barely visible) represents the fitted model, thesolid line represents the observed data, and the dashed line is theextrapolation of the fitted model

source lines). For a faster delivery cycle, thesource code used for system test was updatedevery night throughout the test period. At theend of each of 198 calendar days in the testcycle, the number of faults found, NCNCSL, andthe staff time spent on testing were collected.Figure 11.3 portrays the growth of the system interms of NCNCSL and of faults against staff time.The corresponding numerical data are provided inDalal and McIntosh [28].

Assume that the testing process is observedat time ti , i = 0, . . . , h, and at any given timethe amount of time it takes to find a specificbug is exponential with rate m. At time ti , thetotal number of faults remaining in the systemis Poisson with mean li+1, and NCNCSL isincreased by an amount Ci . This change adds aPoisson number of faults with mean proportionalto C, say qCi . These assumptions lead to themass balance equation, namely that the expectednumber of faults in the system at ti (after possiblemodification) is the expected number of faultsin the system at ti−1 adjusted by the expectednumber found in the interval (ti−1, ti) plus thefaults introduced by the changes made at ti :

li+1 = li e−m(ti−ti−1) + qCi

for i = 1, . . . , h. Note that q represents thenumber of new faults entering the system per ad-ditional NCNCSL, and l1 represents the number offaults in the code at the start of system test. Both ofthese parameters make it possible to differentiatebetween the new code added in the current releaseand the older code. For the example, the estimatedparameters are q = 0.025,m= 0.002, and l1 = 41.The fitted and the observed data are plottedagainst staff time in Figure 11.3 (bottom). The fitis evidently very good. Of course, assessing themodel on independent or new data is required forproper validation.

Now we examine the efficacy of creating astatistical model. The estimate of q in the exampleis highly significant, both statistically and practi-cally, showing the need for incorporating changesin NCNCSL as a covariate. Its numerical valueimplies that for every additional 10 000 NCNCSLadded to the system, 25 faults are being addedas well. For these data, the predicted numberof faults at the end of the test period is Poissondistributed with mean 145. Dividing this quantityby the total NCNCSL, gives 4.2 per 10 000 NCNCSLas an estimated field fault density. These estimatesof the incoming and outgoing quality are valuablein judging the efficacy of system testing and fordeciding where resources should be allocated toimprove the quality. Here, for example, systemtesting was effective, in that it removed 21 of every25 faults. However, it raises another issue: 25 faultsper 10 000 NCNCSL entering system test may betoo high, and a plan ought to be considered toimprove the incoming quality.

None of the above conclusions could havebeen made without using a statistical model.These conclusions are valuable for controlling andimproving the process. Further, for this analysis itwas essential to have a covariate other than time.

11.5 When to Stop TestingSoftware

Dynamic reliability growth models can be usedto make decisions about when to stop testing.

Page 242: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models: A Selective Survey and New Directions 209

Software testing is a necessary but expensiveprocess, consuming one-third to one-half the costof a typical development project. Testing a largesoftware system costs thousands of dollars perday. Overzealous testing can lead to a productthat is overpriced and late to market, whereasfixing a fault in a released system is usually anorder of magnitude more expensive than fixing thefault in the testing laboratory. Thus, the questionof how much to test is an important economicone. We discuss an economic formulation of the“when to stop testing” issue as proposed by Dalaland Mallows [29, 31]. Other formulations havebeen proposed by Dalal and Mallows [32] and bySingpurwalla [33].

Like many other reliability models, the Dalaland Mallows stochastic model assumes that thereare N (unknown) faults in the software, and thetimes to find faults are observable and are i.i.d.exponential with rate m. N is Poisson(l), andthat l is Gamma (a, b). Further, their economicmodel defines the cost of testing at time t to bef t − cK(t), where K(t) is the number of faultsobserved to time t and f is the cost of operatingthe testing laboratory per unit time. The constantc is the net cost of fixing a fault after rather thanbefore release. Under somewhat more generalassumptions, Dalal and Mallows [29] found theexact optimal stopping rule. The structure of theexact rule, which is based on stochastic dynamicprogramming, is rather complex. However, forlarge N , which is necessarily the case for largesystems, the optimal stopping rule is: stop as soonas f (emt − 1)/(mc)≥K(t). Besides the economicguarantee, this rule gives a guarantee on thenumber of remaining faults, namely that thisnumber has a Poisson distribution with meanf/(mc). Thus, instead of determining the ratiof/c from economic considerations, we can chooseit so that there are probabilistic guarantees on thenumber of remaining faults. Some practitionersmay find that this probabilistic guarantee on thenumber of remaining faults is more relevant intheir application. (See Dalal and Mallows [32]for a more detailed discussion.) Finally, by usingreasoning similar to that used in deriving equation(4.5) of Dalal and Mallows [29], it can be shown

that the current estimate of the additional timerequired for testing �t , is given by

1

mlog

[cmK(t)

f (emt − 1)

]For applications of this we refer the reader to Dalaland Mallows [31].

11.6 Challenges andConclusions

Software reliability modeling and measurementhave drawn quite a bit of attention recently invarious industries due to concerns about thequality of software. Many reliability models havebeen proposed, many success stories have beenreported, several conferences and forums havebeen formed, and much project experience hasbeen shared.

In spite of this, there are many challengesin getting widespread use of software reliabilitymodels. Part of the challenge is that testing andother activities are not as compartmentalizedas assumed in the models. As discussed inSection 11.4, code churn constantly takes placeduring testing, and, except for the Dalal andMcIntosh model described in Section 11.4, thereis very little work in that area. Further, fordecision-making purposes during testing anddeployment phases one would like to have aquick estimate of the system reliability. Waitingto collect a substantial amount of data beforebeing able to fit a model is not feasible inmany settings. Leveraging information from theearly phases of the development life cycle tocome up with a quick reliable model wouldameliorate this difficulty. An extension of theapproach described in Section 11.2.2 seems to beneeded. Looking to the future, it would also beworthwhile incorporating the architecture of thesystem to come up with preliminary estimates; fora survey of architecture-based reliability models,see Goševa-Popstojanova and Trivedi [34]. Finally,to a great extent, the reliability growth modelsas used in the field do not leverage information

Page 243: Handbook of Reliability Engineering - pdi2.org

210 Software Reliability

about the test cases used. One of the reasons isthat each test case is considered to be unique.However, as we are moving in the model-basedtesting area [35], test case creation is supportedby an underlying meta model of the use casesand constraints imposed by users. Creating newreliability models leveraging these meta modelswould be important.

In conclusion, in this chapter we have describedkey software reliability models for early stages, aswell as for the test and operational phases, andhave given some examples of their uses. We havealso proposed some new research directions usefulto practitioners, which will lead to wider use ofsoftware reliability models.

References[1] Mars Climate Orbiter Mishap Investigation Board

Phase I Report, 1999, NASA, ftp://ftp.hq.nasa.gov/pub/pao/reports/1999/MCO_report.pdf

[2] Lee L. The day the phones stopped: how people get hurtwhen computers go wrong. New York: Donald I. Fine,Inc.; 1992.

[3] Dalal SR, Horgan JR, Kettenring JR. Reliable software andcommunication: software quality, reliability, and safety.IEEE J Spec Areas Commun 1993;12:33–9.

[4] Institute of Electrical and Electronics Engineers.ANSI/IEEE standard glossary of software engineeringterminology, IEEE Std. 729-1991.

[5] ISO 9000-3. Quality management and quality assurancestandard—part 3: guidelines for the application ofISO 9001 to the development, supply and maintenance ofsoftware. Switzerland: ISO; 1991.

[6] Paulk M, Curtis W, Chrissis M, Weber C. Capabilitymaturity model for software, version 1.1, CMU/SEI-93-TR-24. Carnegie Mellon University, Software EngineeringInstitute, 1993.

[7] Emam K, Jean-Normand D, Melo W. SPICE: the theoryand practice of software process improvement andcapability determination. IEEE Computer Society Press;1997.

[8] Musa JD, Iannino A, Okumoto K. Software reliability—measurement, prediction, application. New York:McGraw-Hill; 1987.

[9] Lyu MR, editor. Handbook of software reliabilityengineering. New York: McGraw-Hill; 1996.

[10] Gaffney JD, Davis CF. An approach to estimating softwareerrors and availability. SPC-TR-88-007, version 1.0, 1988.[Also in Proceedings of the 11th Minnowbrook Workshopon Software Reliability.]

[11] Dalal SR, and Ho YY. Predicting later phase faultsknowing early stage data using hierarchical Bayesmodels. Technical Report, Telcordia Technologies, 2000.

[12] Thomas D, Cook T, Cooper H, Cordray D, Hartmann H,Hedges L, Light R, Louis T, Mosteller F. Meta-analysisfor explanation: a casebook. New York: Russell SageFoundation; 1992.

[13] Singpurwalla ND, Wilson SP.Software reliability model-ing. Int Stat Rev 1994;62(3):289–317.

[14] Gokhale S, Marinos P, Trivedi K. Important milestones insoftware reliability modeling. In: Proceedings of SoftwareEngineering and Knowledge Engineering (SEKE 96),1996. p.345–52.

[15] Jelinski Z, Moranda PB. Software reliability research. In:Statistical computer performance evaluation. New York:Academic Press; 1972. p.465–84.

[16] Shooman ML. Probabilistic models for software relia-bility prediction. In: Statistical computer performanceevaluation. New York: Academic Press; 1972. p.485–502.

[17] Schneidewind NF. Analysis of error processes in com-puter software. Sigplan Note 1975;10(6):337–46.

[18] Musa JD. A theory of software reliability and itsapplication. IEEE Trans Software Eng 1975;SE-1(3):312–27.

[19] Moranda PB. Predictions of software reliability duringdebugging. In: Proceedings of the Annual Reliabilityand Maintainability Symposium, Washington, DC, 1975.p.327–32.

[20] Goel AL, Okumoto K. Time-dependent error-detectionrate model for software and other performance measures.IEEE Trans Reliab 1979;R-28(3):206–11.

[21] Schick GJ, Wolverton RW. Assessment of software relia-bility. In: Proceedings, Operations Research. Wurzburg–Wien: Physica-Verlag; 1973. p.395–422.

[22] Crow LH. Reliability analysis for complex repairablesystems. In: Proschan F, Serfling RJ, editors. Reliabilityand biometry. Philadelphia: SIAM; 1974. p.379–410.

[23] Yamada S, Ohba M, Osaki S. S-shaped reliability growthmodeling for software error detection. IEEE Trans Reliab1983;R-32(5):475–8.

[24] Littlewood B. Stochastic reliability growth: a modelfor fault-removal in computer programs and hardwaredesigns. IEEE Trans Reliab 1981;R-30(4):313–20.

[25] Musa JD, Okumoto K. A logarithmic Poisson executiontime model for software reliability measurement. In: Pro-ceedings Seventh International Conference on SoftwareEngineering, Orlando (FL), 1984. p.230–8.

[26] Miller D. Exponential order statistic models of softwarereliability growth. IEEE Trans Software Eng 1986;SE-12(1):12–24.

[27] Gokhale S, Lyu M, Trivedi K. Software reliability analysisincorporating debugging activities. In: Proceedingsof International Symposium on Software ReliabilityEngineering (ISSRE 98), 1998. p.202–11.

[28] Dalal SR, McIntosh AM. When to stop testing forlarge software systems with changing code. IEEE TransSoftware Eng 1994;20:318–23.

[29] Dalal SR, Mallows CL. When should one stop softwaretesting? J Am Stat Assoc 1988;83:872–9.

Page 244: Handbook of Reliability Engineering - pdi2.org

Software Reliability Models: A Selective Survey and New Directions 211

[30] Musa JD, Fuoco G. Irving N, Kropfl D, Juhlin B. Theoperational profile. In: Lyu MR, editor. Handbook ofsoftware reliability engineering. New York: McGraw-Hill;1996. p.167–218.

[31] Dalal SR, Mallows CL. Some graphical aids for decidingwhen to stop testing software. Software Quality &Productivity [special issue]. IEEE J Spec Areas Commun1990;8:169–75.

[32] Dalal SR, Mallows CL. Buying with exact confidence. AnnAppl Probab 1992;2:752–65.

[33] Singpurwalla ND. Determining an optimal time intervalfor testing and debugging software. IEEE Trans SoftwareEng 1991;17(4):313–9.

[34] Goševa-Popstojanova K, Trivedi K. Architecture-basedsoftware reliability. In: Proceedings of ASSM 2000International Conference on Applied Stochastic SystemModeling, March 2000.

[35] Dalal SR, Jain A, Karunanithi N, Leaton J, Lott C, Pat-ton G. Model-based testing in practice. In: InternationalConference in Software Engineering—ICSE ’99, 1999.

Page 245: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 246: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling

Chap

ter1

2James Ledoux

12.1 Introduction12.2 Basic Concepts of Stochastic Modeling12.2.1 Metrics with Regard to the First Failure12.2.2 Stochastic Process of Times of Failure12.3 Black-box Software Reliability Models12.3.1 Self-exciting Point Processes12.3.1.1 Counting Statistics for a Self-exciting Point Process12.3.1.2 Likelihood Function for a Self-exciting Point Process12.3.1.3 Reliability and Mean Time to Failure Functions12.3.2 Classification of Software Reliability Models12.3.2.1 0-Memory Self-exciting Point Process12.3.2.2 Non-homogeneousPoisson Process Model:

λ(t;Ht ,F0)= f (t;F0) and is Deterministic12.3.2.3 1-Memory Self-exciting Point Process with

λ(t;Ht ,F0)= f (N(t), t − TN(t), F0)12.3.2.4 m≥ 2-memory12.4 White-box Modeling12.5 Calibration of Model12.5.1 Frequentist Procedures12.5.2 Bayesian Procedure12.6 Current Issues12.6.1 Black-box Modeling12.6.1.1 Imperfect Debugging12.6.1.2 Early Prediction of Software Reliability12.6.1.3 Environmental Factors12.6.1.4 Conclusion12.6.2 White-box Modeling12.6.3 Statistical Issues

12.1 Introduction

This chapter proposes an overview of someaspects of software reliability (SR) engineering.Most systems are now driven by software. Thus,it is well recognized that assessing the reliabilityof software applications is a major issue inreliability engineering, particularly in terms ofcost. But predicting software reliability is noteasy. Perhaps the major difficulty is that we areconcerned primarily with design faults, which isa very different situation from that tackled by

conventional hardware theory. A fault (or bug)refers to a manifestation in the code of a mistakemade by the programmer or designer with respectto the specification of the software. Activation ofa fault by an input value leads to an incorrectoutput. Detection of such an event correspondsto an occurrence of a software failure. Inputvalues may be considered as arriving to thesoftware randomly. So although software failuremay not be generated stochastically, it may bedetected in such a manner. Therefore, this justifiesthe use of stochastic models of the underlying

213

Page 247: Handbook of Reliability Engineering - pdi2.org

214 Software Reliability

random process that governs the software failures.In Section 12.2 we briefly recall the basicconcepts of stochastic modeling for reliability. Twoapproaches are used in SR modeling. The mostprevalent is the so-called black-box approach,in which only the interactions of the softwarewith the environment are considered. FollowingGaudoin [1] and Singpurwalla and Wilson [2], inSection 12.3 we use the self-exciting point processesas a basic tool to model the failure process. Thisenables an overview of most of the publishedSR models. A second approach, called the white-box approach, incorporates information on thestructure of the software in the models. This ispresented in Section 12.4. Section 12.5 proposesbasic techniques for calibrating black-box models.The final section tries to give an account of thecurrent practices in SR modeling and points outsome challenging issues for future research.

Note that this chapter does not aspire tocover the whole topic of SR engineering. Inparticular, we do not discuss fault prevention,fault removal, or fault tolerance, which are threemethods to achieve reliable software. We focushere on methods to forecast failure times. Fora more complete view, we refer the reader toMusa et al. [3], and the two software reliabilityhandbooks [4, 5]. We have used the two recentbooks by Singpurwalla and Wilson [2] andby Pham [6] to prepare this chapter. We alsorecommend reading the short paper by Everettet al. [7], which describes, in particular, theavailable software reliability toolkits (see alsoRamani et al. [8]). Finally, the references of thechapter give a good account of those journals thatpropose research and tutorial papers on SR.

12.2 Basic Concepts ofStochastic Modeling

The reliability of software, as defined by Laprie[9], is a measure of the continuous delivery of thecorrect service by the software under a specifiedenvironment. This is a measure of the time tofailure.

12.2.1 Metrics with Regard to the FirstFailure

The metrics of the first time to failure of a system,as defined by Barlow and Proschan [10] andAscher and Feingold [11], are now recalled. Thefirst failure time is a random variable T withdistribution function

F(t)= P{T ≤ t} t ∈R

If F has a probability density function (p.d.f.) f ,then we define the hazard rate of the randomvariable T by

r(t)= f (t)

R(t)t ≥ 0

with R(t)= 1− F(t)= P{T > t}. We will also usethe term failure rate. Function R(t) is called thesurvivor function of the random variable T . Thehazard rate function is interpreted to be

r(t) dt ≈ P{t < T ≤ t + dt | T > t}≈ P{a failure occurs in ]t, t + dt]

given that no failure occurred

up to time t}Thus, the phenomenon of reliability growth(“wear out”) may be represented by a decreasing(increasing) hazard rate.

When F is continuous, the hazard rate functioncharacterizes the probability distribution of T

through the exponentiation formula

R(t)= exp

(−∫ t

0r(s) ds

)Finally, the mean time to failure (MTTF) is the

expectation E[T ] of the waiting time of the firstfailure. Note that E[T ] is also

∫ +∞0 R(s) ds.

A basic model for the non-negative randomvariable T is the Weibull distribution withparameters λ, β > 0:

f (t)= λβtβ−1 exp(−λtβ)1]0,+∞[(t)R(t)= exp(−λtβ)r(t)= λβtβ−1

MTTF= 1

λ1/β

∫ +∞0

u1/β exp(−u) du

Page 248: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 215

Note that the hazard rate is increasing for β >

1, decreasing for β < 1, and constant for β = 1.For β = 1 we obtain the exponential model withparameter λ.

12.2.2 Stochastic Process of Times ofFailure

The failure process can be thought of as a pointprocess, i.e. a sequence of random variables (Ti)i≥0where Ti is the ith failure time of the software(with T0 = 0). An equivalent point of view is todefine the sequence of random variables Xi =Ti − Ti−1 for i ≥ 1. Xi is ith inter-failure time. Wedefine the counting process N(·) associated with apoint process by

N(t)=∑i≥0

1]0,t ](Ti) (N(0)= 0)

N(t) is the number of observed failures up totime t . A point process will refer to any of (Ti),(Xi), or N(·). The standard metrics associatedwith a counting process are [11]:

• the mean value function: M(t)= E[N(t)]• the rate of occurrence of failures at time t :

ROCOF(t)= dM

dt(t)

In such a context, we define the (conditional)reliability function at time t ≥ 0 by

Rt(s)= P{N(t + s)−N(t)

= 0 | N(t), T1, . . . , TN(t)} s ≥ 0

This is a measure of the continuous delivery ofcorrect service during the mission interval ]t, t +s]. At time t = Ti , this function is nothing elsebut the conditional survivor function of randomvariable Xi+1 = Ti+1 − Ti given T1, . . . , Ti . Thiswill be denoted by Ri(s). We also define the(conditional) mean time to failure at time t ,MTTF(t), by

MTTF(t)=∫ +∞

0Rt (s) ds

The mean time to failure at t = Ti will also bedenoted by MTTFi and is E[Xi+1 | T1, . . . , Ti ].

During the operational life of the software,repairs are carried out when it fails to performcorrectly. In such a case, the time to repair, thetime to reboot the system, and other factorsaffect the dependability of a product. Thus,we may define the software availability as ameasure of the delivery correct service withrespect to the alternation correct and incorrectservice. Availability is highly dependent on themaintenance policies of the software. We do notgo into further details on dependability in theoperational phase. Indeed, we focus here on thereliability attribute of the software as most of theliterature on software reliability modeling does.We refer the reader to Chapter 2 of Lyu [4] for anaccount of dependability during the operationalphase.

12.3 Black-box SoftwareReliability Models

In this section, only dynamic models will bediscussed. That is, we are only concerned withmodels that consider the failure process as astochastic process. In other words, time is anessential component of the description of themodels. On the other hand, static models areessentially capture–recapture models. For a goodaccount of static models, we refer to Chapter 5of Xie [12] and to Pham [6]. A recent evaluationof capture–recapture models in the softwareengineering context is that of Briand et al. [13].Our overview of dynamic models closely followsChapter 2 of Gaudoin [1], Singpurwalla andcoworkers [2, 14], and Snyder and Miller [15]. Weassume throughout this section that any correctiveaction is instantaneous and each detected fault isremoved.

A basic way to represent time evolution inconfidence in software is as follows. At instantzero, the first failure occurs at time t1 accordinga random variable X1 = T1 with hazard rate r1.Given time T1 = t1, we observe a second failure attime t2 at rate r2. Function r2 is the hazard rateof the inter-failure random variable X2 = T2 − T1

Page 249: Handbook of Reliability Engineering - pdi2.org

216 Software Reliability

given T1 = t1. The choice of r2 is based on the factthat one fault was detected at time t1. At time t2,a third failure occurs at t3 with failure rate r3.Function r3 is the hazard rate of the randomvariable X3 = T3 − T2 given T1 = t1, T2 = t2 andis selected according to the “past” of the failureprocess at time t2: two observed failures at times t1and t2. And so on. It is expected that, due to afault removal activity, confidence in the software’sability to deliver a proper service will be improvedduring its life cycle. Therefore, a basic model inSR has to capture the phenomenon of reliabilitygrowth. Reliability growth will basically followfrom a sequence of inequalities of the followingform

ri+1(t − ti )≤ ri (ti) on t ≥ ti (12.1)

and/or from selection of decreasing hazardrates ri (·). We illustrate this “modeling process” onthe celebrated Jelinski–Moranda model (JM) [16].We assume a priori that software includes only afinite number N of faults. The first hazard rate isr1(t; φ, N)= φN , where φ is some non-negativeparameter. From time T1 = t1, a second failureoccurs with the constant failure rate r2(t; φ, N)=φ(N − 1), . . . . In a more formal setting, thetwo parameters N and φ will be encompassed inwhat we call a background history F0, which isany background information that we may haveabout the software. Then “the failure rate” of thesoftware is represented by the function

∀t ≥ 0,

rC(t; F0)=+∞∑i=1

ri (t − Ti−1; F0)1[Ti−1,Ti [(t)

(12.2)

which is called the concatenated failure ratefunction by Singpurwalla and Wilson [2]. Anappealing graphical display of a path of thisstochastic function is given in Figure 12.1 for theJM model. We can rewrite Equation 12.2 as

λ(t; F0, N(t), T1, . . . , TN(t))= φ[N −N(t)](12.3)

Function λ(·) will be called the stochastic intensityof the point process N(·). We see that the stochas-tic intensity for the JM model is proportional to

Figure 12.1. Concatenated failure rate function for (JM)

the residual number of bugs at any time t and eachdetection of failure results in a failure rate whosevalue decreases by an amount φ. This suggests thatno new fault is inserted during a corrective actionand any bug contributes in the same manner to the“failure rate” of the software.

To go further, we replace our intuitive presen-tation in a stochastic modeling framework. Justifi-cation for what follows is that almost all publishedsoftware reliability models can be interpreted inthe foregoing framework. Specifically, this allowsa complete overview of the stochastic properties ofthe panoply of available models without referringto their original presentations.

12.3.1 Self-exciting Point Processes

A slightly more formal presentation of theprevious construction of the point process (Ti)

would be that we have to specify all the conditionaldistributions

L(Xi | F0, Ti−1, . . . T1), i ≥ 1

The sequence of conditioning F0, {F0, T1},{F0, T1, T2}, . . . should be thought of as thenatural or internal history on the point process attimes 0, T1, T2, . . . respectively. Thus, function riis the hazard rate of the conditional distributionL(Xi | F0, Ti−1, . . . , T1), i.e. when it has a p.d.f.

ri (t; F0, Ti−1, . . . , T1)= fXi |F0,Ti−1,...,T1(t)

RXi |F0,Ti−1,...,T1(t)

(12.4)

Page 250: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 217

This leads to the following expression of thestochastic intensity:

λ(t; F0, N(t), T1, . . . , TN(t))

=+∞∑i=1

fXi |F0,Ti−1,...,T1(t − Ti−1)

RXi |F0,Ti−1,...,T1(t − Ti−1)1[Ti−1,Ti [(t)

(12.5)

If we turn back to the JM model, we have F0 ={φ, N} and

fXi |F0,Ti−1,...,T1(t)

= φ[N − (i − 1)] exp{−φ[N − (i − 1)]t}× 1[0,+∞[(t)

Continuing in this way, this should lead tothe martingale approach for analyzing a pointprocess, which essentially adheres to the conceptsof compensator and stochastic intensity withrespect to the internal history of the countingprocess. In particular, the left continuous versionof the stochastic intensity defined in Equation 12.5may be thought of as the usual predictableintensity of a point process in the martingale pointof view (e.g. see Theorem 11 of [17] and referencescited therein). Van Pul [18] gives a good accountof what can be done using the so-called dynamicapproach of a point process. We do not go intofurther details here. We prefer to embrace theengineering point of view developed by Snyderand Miller [15].

It is clear from Equation 12.5 that the stochasticintensity is excited by the history of the pointprocess itself. Such a stochastic process is usuallycalled a self-exciting point process (SEPP). Whatfollows is from Gaudoin [1], Singpurwalla andWilson [2], and Snyder and Miller [15].

1. Ht = {N(t), T1, . . . , TN(t)} will denote theinternal history of N(·) up to time t .

2. N(·) is said to be conditionally orderly if, forany Qt ⊆Ht , we have

P{N(t + dt)−N(t) ≥ 2 | F0, Qt }= P{N(t + dt)− N(t)= 1 | F0, Qt }O(dt)

where O(dt) is some real-valued functionsuch that limdt→0 O(dt)= 0.

With Qt = ∅ and Equation 12.6 we get P{N(t +dt)− N(t) ≥ 2 | F0} = o(dt), where o(dt) is somereal-valued function such that limdt→0 o(dt)/dt =0. This is the usual orderliness or regular propertyof a point process [11]. Conditional orderlinessis to be interpreted as saying that, given Qt andF0, as dt decreases to zero, the probability of atleast two failures occurring in a time interval oflength dt tends to zero at a rate higher than theprobability that exactly one failure in the sameinterval does.

Definition 1. A point process N(·) is called anSEPP if:

1. N(·) is conditionally orderly;2. there exists a non-negative function

λ(·; F0,Ht ) such that

P{N(t + dt)−N(t) = 1 | F0,Ht }= λ(t; F0,Ht ) dt + o(dt) (12.6)

and E[λ(t; F0,Ht )]<+∞ for any t > 0;3. P{N(0)= 0 | F0} = 1.

Function λ is called the stochastic intensity of theSEPP.

We must think of λ(t; F0,Ht ) as a functionof F0, t , and N(t), T1, . . . , TN(t). The degree towhich λ(t) depends on Ht is formalized in thenotion of memory. An SEPP is of m-memory, if:

• for m= 0: λ(·) depends on Ht only throughN(t), the number of observed failures attime t ;• for m= 1: λ(·) depends on Ht only through

N(t) and TN(t);• for m≥ 2: λ(·) depends on Ht only through

N(t), TN(t), . . . , TN(t)−m+1;• for m=−∞: λ(·) is independent of the

history Ht of the point process. We also saythat the stochastic intensity has no memory.

When stochastic intensity depends only on abackground history F0, then we get a doublystochastic Poisson process (DSPP). Thus, the classof SEPP also encompasses the family of Poissonprocesses. If intensity is a non-random constant λ,we have the homogeneous Poisson process (HPP).

Page 251: Handbook of Reliability Engineering - pdi2.org

218 Software Reliability

An SEPP with no memory and a deterministicintensity function λ(·) is a non-homogeneousPoisson process (NHPP). In particular, if we turnback to the concatenated failure rate function(see Equation 12.2), then selecting an NHPPmodel corresponds to selecting some continuousdeterministic function as rC. We see that, givenTi = ti , the hazard rate of Xi+1 is ri+1(·)= rC(· +ti)= λ(· + ti). We retrieve a well-known fact foran NHPP: the (conditional) hazard rate betweenthe ith and the i + 1th failure times and theintensity function λ(·) only differs through theinitial time of observation of the two functions.We now list the properties of an SEPP [15]that are of some value for analyzing the maincharacteristics of models. We will omit writingabout the dependence in F0.

12.3.1.1 Counting Statistics for aSelf-exciting Point Process

The probability distribution of random vari-able N(t) is strongly related to the conditionalexpectation

λ(t;N(t)) = E[λ(t;Ht ) |N(t)]This function is called the count-conditionalintensity. For an SEPP, λ(·;N(t)) satisfies

λ(t; N(t))

= limdt→0

P{N(t + dt)− N(t)= 1 | N(t)}dt

= limdt→0

P{N(t + dt)− N(t) ≥ 1 |N(t)}dt

.

Then we can obtain the following explicit repre-sentation for P{N(t)= n} with n≥ 1:

P{N(t)= n} =∫

0<t1<···<tn<t

n∏i=1

λ(ti; i − 1)

× exp

[−

n∑i=0

∫ ti+1

ti

λ(u; i) du

]dt1 . . . dtn

(12.7)

with t0 = 0 and tn+1 = t .ROCOF(t), defined as the derivate of M(t), is

then

ROCOF(t)= E[λ(t;N(t))] = E[λ(t;Ht )]

We see that the notion of ROCOF(t) andstochastic intensity coincide only if intensity isa deterministic function of time, i.e. the pointprocess is an NHPP.

12.3.1.2 Likelihood Function for aSelf-exciting Point Process

Assume that we observe a fixed number i offailures. Then the likelihood function is

fT1,...,Ti (t1, . . . , ti)

= λ(t1; 0)i∏

k=2

λ(tk; k − 1, t1, . . . , tk−1)

× exp

[−∫ t1

0λ(s; 0) ds

−i∑

k=2

∫ tk

tk−1

λ(s; k − 1, t1, . . . , tk−1) ds

](12.8)

If we observe the failure process up to time t ,the joint distribution of N(t), T1, . . . , TN(t) isgiven by Theorem 6.2.2 in [15].

12.3.1.3 Reliability and Mean Time toFailure Functions

Rt (s)

= exp

[−∫ t+s

t

λ(u; N(t), T1, . . . , TN(t)) du

]In particular, at instant Ti we obtain

Ri(s)=

exp

[−∫ s

0λ(u; 0) du

]if i = 0

exp

[−∫ Ti+s

Ti

λ(u; i, T1, . . . , Ti) du]if i ≥ 1.

The following characterization of 0-memorySEPP is intuitively clear from the definition of thestochastic intensity of an SEPP.

Theorem 1. N(·) is an SEPP with 0-memory isequivalent to N(·) is a Markov process.

Page 252: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 219

This result explains why a very large part of SRmodels may be developed in a Markov framework(e.g. see Xie [12], and Musa et al. [3] chapter 10).In particular, all NHPP models are of Markov type.In fact, it can be shown [15] that an SEPP with m-memory corresponds to a process (Ti) that is anm-order Markov chain, i.e. L(Ti+1 | Ti, . . . , T1)=L(Ti+1 | Ti, . . . , Ti−m+1) for i ≥m.

A final relevant result is concerned with 1-memory SEPP.

Theorem 2. A 1-memory SEPP with a stochasticintensity satisfying

λ(t; N(t), TN(t))= f (N(t), t − TN(t))

for some real-valued function f , is characterized bya sequence of independent inter-failure durations(Xi). In this case, the density probability functionof Xi is

fXi (xi)= f (i − 1; xi) exp

[−∫ xi

0f (i − 1; u) du

](12.9)

Such an SEPP was called a generalized renewalprocess [19] because Ti is the sum of independentbut not identically distributed random variables.Moreover, it is an usual renewal process whenfunction f does not depend on N(t).

To close this presentation of self-excitingprocesses, we point out that we only use a“constructive” point of view. Our purpose, here,is not to discuss the existence of point processeswith a fixed concatenated failure rate functionor stochastic intensity. However, we emphasizethat the orderliness condition and the existenceof the limit in Equation 12.6 in Definition 1are enough to specify the point process (e.g. seeCox and Isham [20]). It is also shown byChen and Singpurwalla [14] (Theorem 4.1)that, under the conditional orderliness condition,the concatenated failure rate function well-defines an SEPP with respect to the operationalDefinition 1. Moreover, an easily checked criterionfor conditional orderliness is given by Chen andSingpurwalla [14] (Theorem 4.2). In particular,if the hazard rates in Equation 12.4 are locallybounded then conditional orderliness holds.

12.3.2 Classification of SoftwareReliability Models

We obtain from the concept of memory for anSEPP a classification of the existing models. It isappealing to define a model with a high memory.But, as usual, the pay-off is the complexity in thestatistical inference and the amount of data to becollected.

12.3.2.1 0-Memory Self-exciting PointProcess

The first type of 0-memory SEPP is whenλ(t;Ht , F0)= f (N(t), F0). We obtain the majorcommon properties of this first class of modelsfrom Section 12.3.1.

• N(·) is a Markov process (a pure birth Markovprocess).• (Xi) are independent (given F0). From Equa-

tion 12.9, random variableXi has an exponen-tial distribution with parameter f (i − 1, F0).This easily gives a likelihood function giveninter-failure durations (Xi).• Ti =∑i

k=1 Xk is a hypoexponential dis-tributed random variable as the sum of i in-dependent and exponentially distributed ran-dom variables [21].• Rt (s)= exp[−f (N(t), F0)s]. The reliability

function only depends on the current num-ber of failures N(t). We have MTTF(t)=1/f (N(t), F0).

Example 1. (JM) The Jelinski–Moranda modelwas introduced in Section 12.3. This model hasto be considered as a benchmark model, sinceall authors designing a new model emphasizethat their model includes the JM model as aparticular case. The stochastic intensity is givenin Equation 12.3 (see Figure 12.1 for a path).Besides the properties common to the class ofSEPP considered in this subsection, we have thefollowing additional assumptions: the softwareincludes a finite N of bugs in the programand no new fault is inserted during debugging.We also noted that each fault has the samecontribution to the unreliability of the software.

Page 253: Handbook of Reliability Engineering - pdi2.org

220 Software Reliability

These assumptions are generally consideredas questionable. We derive from Equation 12.7that the distribution of random variable N(t) isbinomial with parameters N and 1− exp(−φt).The main reliability metrics are:

ROCOF(t)= Nφ exp(−φt)MTTFi = 1

N − iφ

Ri(s)= exp(−(N − iφ)s) (12.10)

We also mention the geometric model ofMoranda [22], where the stochastic intensity isλ(t;Ht , λ, c)= λcN(t) where λ≥ 0 and c ∈]0, 1[.

A second class of 0-memory SEPP is when thestochastic intensity is actually a function of time t ,N(t) (and F0). In fact, we only have in this cat-egory of models SEPP with λ(t;Ht , F0)= [N −N(t)]ϕ(t) for some deterministic function ϕ(·) oftime t . Note that ϕ(t)= φ gives (JM).

• N(·) is a Markov process.• Let us denote

∫ t0 ϕ(s) ds by �(t). We deduce

from Equation 12.7 that N(t) is a binomi-ally distributed random variable with parame-ters N and p(t)= 1− exp[−�(t)]. This leadsto the term binomial-type model in [23]. It fol-lows that E[N(t)] =Np(t) and ROCOF(t)=Nϕ(t) exp[−�(t)].• Rt(s)= exp{−[N −N(t)][�(s + t)−�(t)]}.• We get the likelihood function given failure

times (Ti) from Equation 12.8

fT1,...,Ti (t1, . . . , ti)

= N !(N − i)! exp[−(N − i)�(ti)]

×i∏

j=1

ϕ(tj ) exp[−�(tj )]

• The p.d.f. of random variable Ti is fTi (ti)=i(Ni

)exp[−�(ti)]ϕ(ti){exp[−�(ti)] − 1}i−1.

Littlewood’s model [24] is an instance of abinomial model with ϕ(t)= α/(β + t) andα, β > 0.

12.3.2.2 Non-homogeneous PoissonProcess Model: λ(t;Ht , F0)= f (t; F0)

and is Deterministic

A large number of models are of the NHPP type.Refer to Osaki and Yamada [25] and chapter 5of Pham [6] for a complete list of such models.The very appealing properties of the countingprocess explain the wide use of this family of pointprocesses in SR modeling.

• N(·) is a Markov process.• We deduce from Definition 1 that, for any t ,

N(t) is a Poisson distributed random vari-able with parameter (t)= ∫ t0 f (s; F0) ds.In fact, N(·) has independent (but non-stationary) increments, i.e. N(t1), N(t2)−N(t1), . . . , N(ti )−N(ti−1) are independentrandom variables for any (t1, . . . , ti).• The mean value function M(t) is (t). Then

ROCOF(t)= f (t; F0). Such a model is calleda finite NHPP model ifM(+∞) <+∞ and aninfinite one when M(+∞)=+∞. Indeed, ifM(+∞) <+∞ then we only consider a finitenumber of failure times with probability ofone.• Rt (s)= exp{−[(t + s)−(t)]}.• Likelihood function given failure times (Ti) is

from Equation 12.8

fT1,...,Ti (t1, . . . , ti)

= exp[−(ti)]i∏

k=1

f (tk, F0) (12.11)

Example 2. (GO) This model is characterized bythe following mean value function and in-tensity function: (t)=M[1− exp (−φt)] andλ(t; φ, M)=Mφ exp(−φt) for t ≥ 0. Parametersφ and M are the failure rate per fault and thefinite expected (initial) number of faults con-tained in the software respectively. We see that,at any time t , the intensity function is propor-tional to the expected remaining number of faultsλ(t; φ, M)= φ[M −(t)]. Thus, (GO) is essen-tially an NHPP version of (JM).

Page 254: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 221

Example 3. (MO) For the Musa–Okumoto model[3], the mean value function and intensity func-tion are (t)= ln (λθt + 1)/θ and λ(t; θ, λ)=λ/(λθt + 1) respectively. λ is the initial valueof the intensity function and θ is called thefailure intensity decay parameter. It is easilyseen that λ(t; θ, λ)= λ exp[−θ(t)]. Thus, theintensity function decreases exponentially withthe expected number of failures and shows that(MO) may be understood as an NHPP version ofMoranda’s geometric model.

As noted by Littlewood [26], using an NHPPmodel may appear inappropriate to describereliability growth of software. Indeed, this isdebugging that modifies the reliability. Thus,the true intensity function probably changesin a discontinuous manner during correctiveactions. However, Miller [27] showed that thebinomial-type model of Section 12.3.2.1 may betransformed in an NHPP variant assuming thatthe initial number of faults is a Poisson distributedrandom variable with expectation N . For instance,(JM) is transformed into (GO). Moreover, Millershowed that the binomial-type model and itsNHPP variant are indistinguishable from a singlerealization of the failure process. But these twomodels differ as a prediction model because theestimates of parameters are different.

12.3.2.3 1-Memory Self-exciting PointProcess withλ(t;Ht , F0)= f (N(t), t − TN(t), F0)

N(·) is not Markovian. But the inter-failuredurations (Xi) are independent random variablesgivenF0 (see Theorem 2) and the p.d.f. of randomvariable Xi is given in Equation 12.9.

Example 4. (LV) The stochastic intensity of theLittlewood–Verrall model [28] is

λ(t;Ht , α, ψ(·))= α

ψ[N(t) + 1] + t − TN(t)

(12.12)for some non-negative function ψ(·). We brieflyrecall the Bayesian rationale underlying thedefinition of this model. Uncertainty about thedebugging operation is represented by a sequence

Figure 12.2. A path of the stochastic intensity for (LV)

of stochastically decreasing failure rates (i)i≥1,i.e.

j ≤st j−1

i.e. (∀t ∈ R : P{j ≤ t} ≥ P{j−1 ≤ t}) (12.13)

Thus, using stochastic order allows a decay ofreliability that takes place when faults are inserted.The prior distribution of random variable i isGamma with parameters α and ψ(i). It can beshown that the inequality of Equation 12.13 holdswhen ψ(·) is a monotonic increasing functionof i. Given i = λi , random variable Xi has anexponential distribution with parameter λi . Theunconditional distribution of random variable Xi

is a Pareto distribution with p.d.f.

f (xi; α, ψ(i))= αψ(i)α

[xi + ψ(i)]α+1

Thus, the “true” hazard rate of Xi isri (t; α, ψ(i))= α/[t + ψ(i)]. Mazzuchi andSoyer [29] estimated parameters α and ψ usinga Bayesian method. The corresponding model iscalled a hierarchical Bayesian model [2] and isalso a 1-memory SEPP. If ψ(·) is linear in i, wehave

Ri(t)=[

ψ(i + 1)

t + ψ(i + 1)

]α, MTTFi = ψ(i + 1)

α − 1

We also mention the Schick–Wolverton’smodel [30], where λ(t;Ht , �, N)=φ[N −N(t)](t − TN(t)) and φ > 0, N are thesame parameters as for (JM).

12.3.2.4 m≥ 2-Memory

Instances of such models are rare. For m= 2, wehave the time-series models of Singpurwalla and

Page 255: Handbook of Reliability Engineering - pdi2.org

222 Software Reliability

Soyer [31, 32]. For m> 2, we have the adaptiveconcatenated failure rate model of Al-Mutairiet al. [33] and the Weibull models of Pham andPham [34]. We refer the reader to the originalcontributions for details.

12.4 White-box Modeling

Most work on SR assessment adopts the black-boxview of the system, in which only the interactionswith the environment are considered. The white-box (or structural) point of view is an alternativeapproach in which the structure of the system isexplicitly taken into account. This is advocatedfor instance by Cheung [35] and Littlewood [36].Specifically, the structure-based approach allowsone to analyze the sensitivity of the reliability ofthe system with respect to the reliability of itscomponents. Up to recently, only a few papers pro-posed structure-based SR models. Representativesamples can be found for discrete time [35,37,38],and continuous time [19,36,39,40]. An up-to-datereview on the architecture-based approach isgiven by Goseva-Popstojanova and Trivedi [41].We will present the main features of the basicLittlewood model that are common to most previ-ous cited works. The discrete time counterpart isCheung’s model (see Ledoux and Rubino [42]).

In a first step, Littlewood defines an executionmodel of the software. The basic entity is thestandard software engineering concept of module,e.g. see Cheung [35]. The software structure isthen represented by the call graph of the set M ofthe modules. These modules interact by executioncontrol transfer and, at each instant, control lies inone and only one of the modules, which is calledthe active module. From such a view of the system,we build up a continuous time stochastic process(Xt)t≥0 that indicates the active module at eachtime t . (Xt )t≥0 is assumed to be a homogeneousMarkov process on the set M.

In a second step, Littlewood describes the fail-ure processes associated with execution actions.Failure may happen during a control transfer be-tween two modules or during an execution periodof any module. During a sojourn of the execution

process in the module i, failures are part of aPoisson process having parameter λi . When con-trol is transferred from module i ∈M to mod-ule j ∈M, a failure may happen with probabilityµ(i, j). Given a sequence of executed modules,the failure processes associated with each stateare independent. Also, the interface failure eventsare independent of each other and of the fail-ure processes occurring when a module is ac-tive.

The architecture of the software is combinedwith the failure behavior of the modules andthat of interfaces into a single model whichcan then be analyzed. This method is referredto as the “composite-method” according to theclassification of Markov models in the white-boxapproach proposed by Goseva-Popstojanova andTrivedi [41]. Basically, we are still interested in thecounting process N(·).

Another interesting point process is obtainedby assuming that the probability of a secondaryfailure during a control transfer is zero. Thus,assuming that µ(i, j)= 0 for all i, j ∈M inthe previous context, we get a Poisson processwhose parameter is modulated by the Markovprocess (Xt )t≥0. This is also called a Markovmodulated Poisson process (MMPP). It is wellknown (e.g. [17]) that the stochastic intensity ofan MMPP with respect to the history Ht ∨ F0,where F0 = σ(Xs, s ≥ 0), is λXt . Thus, N(·) is aninstance of a DSPP.

Asymptotic analysis and transient assessmentof distribution of random variable N(t) may becarried out in observing that the bivariate process(Xt , N(t))t≥0 is a jump Markov process withstate space M×N. Computation of all standardreliability metrics may be performed as in [42,43].We refer the reader to these papers for details andfor calibration of the models.

It can be argued that models of the Littlewoodtype are inappropriate to capture reliabilitygrowth of the software. In fact, Laprie et al. [19]have developed a method to incorporate such aphenomenon; this can be used, for instance, in theLittlewood model to take into account reliabilitygrowth of modules in the assessment of the overallreliability (see Ledoux [43]).

Page 256: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 223

In the original papers about Littlewood’smodel for modular software, it is claimed that iffailure parameters decrease to zero then N(·) isasymptotically an HPP with parameter

λ=∑i∈M

π(i)

[ ∑j∈M,j �=i

Q(i, j)µ(i, j)+ λi

]whereQ is the irreducible generator of the Markovprocess (Xt)t≥0 and π its stationary distribution.π(i) is to be interpreted as the proportion of timethe software passed in module i (over a long timeperiod). We just discuss the case of an MMPP.Convergence to zero of the failure parameters λi(i ∈M) may be achieved in multiplying each ofthem by a positive scalar ε, and in consideringthat ε decreases to zero. Hence, the new stochasticintensity of the MMPP is ελXt . As ε tends tozero, it is easily seen from an MMPP with amodulating two-states Markov process (Xt )t≥0that, for the first failure time T , probability P{T >

t} converges to unity. Therefore, we cannot obtainan exponential approximation to the distributionof random variable T as ε tends to zero. Thus, wecannot expect to derive a Poisson approximationto the distribution of N(·). In fact, the correctstatement is: if failure parameters are muchsmaller than the switching rates between modules,N(·) is approximately an HPP with parameter λ.For an MMPP, a proof is given by Kabanovet al. [44] using martingale theory. Moreover, therate of convergence in total variation of finitedimensional-distributions of N(·) to those ofthe HPP is shown to be in ε. This last fact isimportant, because the user has no informationon the quality of the Poissonian approximationgiven in [36]. However, there is no rule to decidea priori if the approximation is optimistic orpessimistic (see Ledoux [43]). A similar approachto Kabanov et al. [44] is used by Gravereaux andLedoux [45] to derive the Poisson approximationand rate of convergence for a more generalpoint process than MMPP, including the completeLittlewood’s counting process [46], the countingmodel of Ledoux [43]. Such asymptotic resultsgive a “hierarchical-method” [41] for reliabilityprediction: we solve the architectural model and

superimpose the failure behavior of the modulesand that of the interfaces on to the solution topredict reliability.

12.5 Calibration of Model

Suppose that we have selected one of the black-box models of Section 12.3. We obtain reliabilitymetrics that depend on the unknown parametersof the model. Thus, we hate to estimate thesemetrics from the failure data. We briefly reviewstandard methods to get point estimates inSections 12.5.1 and 12.5.2.

One major goal of the SR modeling is to pre-dict the future value of metrics from the gatheredfailure data. It is clear from Section 12.3 that acentral problem in SR is in selecting a model,because of the huge number of models available.Criteria to compare SR models are listed by Musaet al. [3]. The authors propose quality of assump-tions, applicability, simplicity, and predictive va-lidity. To assess the predictive validity of models,we need methods that are not just based on agoodness-of-fit approach. Various techniques maybe used: u-plot, prequential likelihood, etc. We donot discuss this issue here. Good accounts of pre-dictive validation methods are given by Abdel-Ghaly et al. [47], Brocklehurst et al. [48], Lyu [4](chapter 4), Musa et al. [3], Ascher and Fein-gold [11], where comparisons between models arealso carried out. Note also that the predictive qual-ity of an SR model may be drastically improvedby using preprocessing of data. In particular, sta-tistical tests have been designed to capture trendsin data. Thus, reliability trend analysis allows theuse of SR models that are adapted to reliabilitygrowth, stable reliability, and reliability decrease.We refer the reader to Kanoun [49], and Lyu [4](chapter 10) and references cited therein for de-tails. Parameters will be denoted by θ (it can bemultivariate).

12.5.1 Frequentist Procedures

Parameter θ is considered as taking an unknownbut fixed value. Two basic methods to estimate

Page 257: Handbook of Reliability Engineering - pdi2.org

224 Software Reliability

the value of θ are the method of maximumlikelihood (ML) and the least squares method.That is, we have to optimize with respect toθ an objective function that depends on θ andcollected failure data to get a point estimate.Another standard procedure, interval estimation,gives an interval of values as estimate of θ . Weonly present point-estimation by the ML methodon (JM) and NHPP models. Others models maybe analyzed in a similar way. ML estimationspossess several appealing properties that makethe procedure widely used. Usually, two of theseproperties are the consistency and the asymptoticnormality of obtaining a confidence interval forthe point estimate. Another one is that the MLestimates of f (θ) (for a one-to-one function f ) issimply f (θ), where θ is the ML estimate of θ . Werefer the reader to chapter 12 of Musa et al. [3]for a complete view of the frequentist inferenceprocedures in an SR modeling context.

Example 5. (JM) Parameters are φ andN . Assumethat failure data are given by observed values x =(x1, . . . , xi) of random variables X1, . . . , Xi . If(φ, N) are the true values of parameters, then thelikelihood to observe x is defined as

L(φ, N; x)= fX1,...,Xi (x; φ, N) (12.14)

where fX1,...,Xi (·; φ, N) is the p.d.f. of the jointdistribution of X1, . . . , Xi . The estimate (φ, N)

of (φ, N) will be the value of (φ, N) thatmaximizes the likelihood to observe data x:(φ, N)= arg maxφ,N L(φ, N; x). Maximizing thelikelihood function is equivalent to maximizingthe log-likelihood function ln L(φ, N; x). Fromthe independence of random variables (Xi) andEquation 12.14, we obtain that

ln L(φ, N; x)= lni∏

k=1

φ[N − (k − 1)]

× exp{−φ[N − (k − 1)]xk}

Estimates (φ, N) are solutions of ∂∂φ

ln L(φ, N; x)= ∂

∂Nln L(φ, N; x)= 0:

φ = i

N∑i

k=1 xk −∑i

k=1(k − 1)xki∑

k=1

1

N − (k − 1)

= i

N − ( ∑ik=1 xk

)−1 ∑ik=1(k − 1)xk

The second equation here may be solved bynumerical techniques and then the solution is putinto the first equation to get φ. These estimates areplugged into Equation 12.10 to get ML estimates ofreliability metrics.

Example 6. (NHPP models) Assume that failuredata are t = (t1, . . . , ti) the observed failuretimes. The likelihood function is given byEquation 12.11. Thus, for the (MO) model, MLestimates of parameters β0 and β1 (where β0 =1/θ and β1 = λθ) are the solution of [3]:

β0 = i

ln(1+ β1ti)

1

β1

i∑k=1

1

1+ β1tk= iti

(1+ β1ti ) ln(1+ β1ti )

Assume now that failure data are the cumula-tive numbers of failures n1, . . . , ni at some in-stants d1, . . . , di . The likelihood function is fromthe independence and Poisson distribution of theincrements of the counting process N(·)

L(θ; t)=i∏

k=1

[θ(dk)−θ(dk−1)]nk−nk−1

(nk − nk−1)!× exp{−[θ(dk)−θ(dk−1)]}(d0 = n0 = 0)

where θ(·) is the mean value function of N(·)given that θ is the true value of the parameter. Forthe (GO) model, parameters areM and φ and their

Page 258: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 225

ML estimates are

M = ni

1− exp(−φdi)di exp(−φdi)ni1− exp(−φdi)

=i∑

k=1

{(nk − nk−1)[dk exp(−φdk)

− dk−1 exp(−φdk−1)]}× {exp(−φdk−1)− exp(−φdk)}−1

The second equation here is solved by numericaltechniques and the solution is incorporated intothe first equation to get M . Estimation ofreliability metrics are obtained as for (JM).

12.5.2 Bayesian Procedure

An alternative to frequentist procedures is touse Bayesian statistics. Parameter θ is consideredas a value of a random variable �. A priordistribution of � has to be selected. Thisdistribution represents the a priori knowledge onthe parameter and is assumed to have a p.d.f.π�(·). Now, from the failure data x, we haveto update our knowledge on �. If L(θ; x) isthe likelihood function of the data, then Bayestheorem is used as an updating formula: the p.d.f.of the posterior distribution of � given data x is

f�|x(θ)= L(θ; x)π�(θ)∫RL(θ; x)π�(θ) dθ

Now, we find an estimate θ of θ by minimizing theso-called posterior expected loss∫

R

l(θ , θ)f�|x(θ) dθ

where l(·, ·) is the loss function. With a quadraticloss function l(θ , θ)= (θ − θ)2, it is well knownthat the minimum is obtained by the conditionalexpectation

θ = E[� | x] =∫R

θf�|x(θ) dθ

It is just the mean of the posterior distributionof �. Note that all SR models involve two or more

parameters, so that the previous integral mustbe considered as multidimensional. Therefore,computation of such integrals is by numericaltechniques. For a decade now, progress has beenmade on such methods: e.g. Monte Carlo Markovchain (MCMC) methods (e.g. see Gilks et al. [50]).

Note that the prior distribution can also beparametric. In general, Gamma or Beta distribu-tions are used. Hence, additional parameters needto be estimated. This may be carried out by usingthe ML method. For instance, this is the case forfunction ψ(·) in the (LV) model [28]. However,such estimates may be derived in the Bayesianframework, and we obtain a hierarchical Bayesiananalysis of the model. Many of the models ofSection 12.3 have been analyzed from a Bayesianpoint of view using MCMC methods like Gibbssampling, or the data augmentation method (seeKuo and Yang [51] for (JM) and (LV); see Kuo andYang [52] for the NHPP model; see Kuo et al. [53]for S-shaped models and references cited therein).

12.6 Current Issues

12.6.1 Black-box Modeling

We list some issues that are not new, but whichcover well-documented limitations of popularblack-box models. Most of them have beenaddressed recently, and practical validation isneeded. We will see that the prevalent approachis to use the NHPP modeling framework. Indeed,an easy way to incorporate the various factorsaffecting the reliability of software in an NHPPmodel is to select a suitable parameterizationof the intensity function (or ROCOF). Anyalternative model combining most of these factorswill be of value.

12.6.1.1 Imperfect Debugging

The problem of imperfect debugging may be natu-rally addressed in a Bayesian framework. Reliabil-ity growth is captured through a deterministicallynon-increasing sequence of failure rates (ri (·))(see Equation 12.1). In a Bayesian framework,parameters of ri (·) are considered as random.

Page 259: Handbook of Reliability Engineering - pdi2.org

226 Software Reliability

Hence, we can deal with stochastically decreasingsequence of random variables (ri )i (see Equa-tion 12.13), which allows one to take into accountthe uncertainty on the effect of a corrective action.An instance of this approach is given by the (LV)model (see also Mazzuchi and Soyer [29], Phamand Pham [34]).

Note that the binomial class of models canincorporate a defective correction of a detectedbug. Indeed, assume that each fault detectedhas a probability p to be removed from thesoftware. The hazard rate after (i − 1) repairsis φ[N − p(i − 1)] (see Shanthikumar [23]). Butthe problem of eventual introduction of newfaults is not addressed. Kremer [54] solvedthe case of a single insertion using a non-homogeneous birth–death Markov model. Thishas been extended to multiple introductions byGokhale et al. [55]. Shanthikumar and Sumita [56]proposed a multivariate model where multipleremovals and insertions of faults are allowedat each repair. This model involved complexcomputational procedures and is not consideredin literature. A recent advance in addressing theproblem of eventual insertion of new faults isconcerned with finite NHPP models. It consistsin generalizing the basic proportionality betweenthe intensity function and the expected number ofremaining faults at time t of the (GO) model (seeExample 2) in

λ(t)= φ(t)[n(t) −(t)]

where φ(t) represents a time-dependent detectionrate of a fault; n(t) is the number of faults inthe software at time t , including those alreadydetected and removed and those inserted duringthe debugging process (φ(t)= φ and n(t)=M

in (GO)). Making φ(·) time dependent allowsone to represent the phenomenon of the learningprocess that is closely related to the changesin the efficiency of testing. This function canmonotonically increase during the testing period.Select a non-decreasing S-shaped curve, as φ(·)gives an usual S-shaped NHPP model. Furtherdetails are given by Pham [6] (chapter 5 andreferences cited therein).

Another basic way to consider insertion of newfaults is to use a marked point process (MPP)(e.g. see chapter 4 in Snyder and Miller [15]).We have the point process of failure detectiontimes T1 < T2 < · · · , and with each date Ti weassociate a markMi that represents the cumulativenumber of faults removed and inserted duringthe debugging phase. We retrieve a usual pointprocess if all marks are unity. For such a model,we are interested in the mark-accumulator process∑N(t)

i=0 Mi (M0 = 0). A basic instance of an MPP isthe compound Poisson process where N(·) is anHPP and Mi values are i.i.d. and independent ofN(·). This may also be used to model clustering offailures (see Sahinoglu [57]). Such MPPs are notan SEPP of Section 12.3.1 because the orderlinesscondition fails. This framework was used byvan Pul [18] to extend models of the (JM) type toincorporate the possibility of inserting new faultsduring repair.

12.6.1.2 Early Prediction of SoftwareReliability

A major limitation of the SR models of Sec-tion 12.3 for the software engineering commu-nity is that they provide no help for managingin the earlier phase of development (or testing)of a product. Indeed, calibration of these black-box models requires a relatively large set of fail-ure data. This is rarely encountered in the ear-lier life-cycle of software. In some sense, we arenow concerned with the general topic of softwarequality assessment (which includes dependabilityconcepts) with no failure data. Thus, we have todevelop statistical models of quality that are notdirectly related to the knowledge of a part of thefailure process. In such a case, the model mustbe based on a priori information on the product:judgment of experts, quality of the developmentprocess, similar existing products, software com-plexity, etc. Incorporating subjective informationleads naturally to Bayesian statistics. We referthe reader to chapters 5 and 6 of Singpurwallaand Wilson [2] for discussion in this context.

Page 260: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 227

Since we are mainly interested in reliability assess-ment, we restrict ourselves to more and less re-cent issues relying quality control to the softwarereliability.

A widespread idea is that the complexityof software is an influencing factor of thereliability attributes. Much work has been devotedto quantifying the software complexity throughsoftware metrics (e.g. see Fenton [58]). Typically,we compute Halstead and McCabe metrics, whichare program size and control flow measuresrespectively. It is worthy of note that mostsoftware complexity metrics are strongly relatedto the concept of structure of software code.Thus, including a complexity factor in SR maybe thought of as a first attempt to take intoaccount the architecture of the software inreliability assessment. We turn back to this issuein Section 12.6.2. Now, how do we includecomplexity attributes in earlier reliability analysis?Most of the recent research focuses on theidentification of software modules that are likelyfault-prone from data of various complexitymetrics. In fact, we are faced with a typicalproblem of data analysis that explains whyliterature on this subject is mainly concernedwith procedures of multivariate analysis: linearand nonlinear regression methods, classificationmethods, techniques of discriminant analysis.We refer the reader to Lyu [4] (chapter 12)and Khoshgoftaar and coworkers [59, 60] andreferences cited therein for details.

Other empirical evidence suggests that thehigher the test coverage, then the higher the re-liability of the software would be. Thus, a modelthat incorporates information on functional test-ing as soon as it is available is of value. Thisissue is addressed in an NHPP model proposed byGokhale and Trivedi [61]. It consists in definingan appropriate parameterization of a finite NHPPmodel which relates software reliability to themeasurements that can be obtained from the codeduring functional testing. Let a be the expectednumber of faults that would be detected giveninfinite time testing. The intensity function λ(·) isassumed to be proportional to the expected num-ber of remaining failures: λ(t) = [a −(t)]φ(t),

where φ(t) is the hazard rate per fault. Finally, thetime-dependent function φ(t) is of the form

φ(t)= dc(t)/dt

1− c(t)

where c(t) is the coverage function. That is, theratio of the number of potential fault sites coveredby time t divided by the total number of potentialfault sites under consideration during testing.Function c(t) is assumed to be continuous andmonotonic as a function of testing time. Specificforms of function c(·) allow the retrieving of somewell-known finite failure models: the exponentialfunction c(t)= 1− exp(−φt) corresponds to the(GO) model; the Weibull coverage function c(t)=1− exp(−φtγ ) corresponds to the generalized(GO) model [62]; the S-shaped coverage functioncorresponds to S-shaped models [25]; etc. Gokhaleand Trivedi [61] propose the use of a log-logisticfunction; see that reference for details. Such aparameterization leads to estimate a and theparameters of function c(·). The model maybe calibrated according to the different phaseof the software life-cycle. Here, in the earlyphase of testing, an approach is to estimatea from software metrics (using procedures ofmultivariate analysis) and measure coverageduring the functional testing using a coveragemeasurement tool (e.g. see Lyu [4], chapter 13).Thus, we get early prediction of reliability (seeXie et al. [63] for an alternative using informationfrom testing phases of similar past projects).

12.6.1.3 Environmental Factors

Most SR models in Section 12.3 ignore the fac-tors affecting software reliability. In some sense,previous issues discussed in this section can beconsidered as an attempt to capture some envi-ronmental factors. Imperfect debugging is relatedto the fact that new faults may be inserted duringa repair. The complexity attributes of softwareare strongly correlated to its fault-proneness. Em-pirical investigations show that the developmentprocess, testing procedure, programmer skill, hu-man factors, the operational profile and manyothers factors affect the reliability of a product(e.g. see Pasquini et al. [64], Lyu [4] (chapter 13),

Page 261: Handbook of Reliability Engineering - pdi2.org

228 Software Reliability

Özekici and Sofer [65], Zhang and Pham [66],and references cited therein). A major issue is toincorporate all these attributes into a single model.At the present time, investigations focus on thefunctional relationship between the hazard rateri(·) of the software and quantitative measures ofthe various factors. In this context, a well-knownmodel is the so-called Cox proportional hazardmodel (PHM), where ri (·) is assumed to be anexponential function of the environmental factors:

ri (t)= r(t) exp

( n∑j=1

βjzj (i)

)(12.15)

where zj (·), called explanatory variables orcovariates, are the measures of the factors, and βjare the regression coefficients. r(·) is a baselinehazard rate that gives the hazard rate when allcovariates are set to zero. Therefore, given z =(z1, . . . , zn), the reliability function Ri is

Ri(t | z)= R(t) exp

( n∑j=1

βjzj (i)

)(12.16)

with R(t) = exp(− ∫ t0 r(s) ds). Equation 12.15expresses the effect of accelerating or deceleratingthe time to failure given z. Note that covariatesmay be time-dependent, random. In this last casethe reliability function will be the expectationof the function in Equation 12.16. The baselinehazard rate may be any of the hazard rates usedin Section 12.3. The family of parameters canbe estimated using ML. Note that one of thereasons for the popularity of PHM is that theunknown βj may be estimated by the partiallikelihood approach without putting a parametricstructure on the baseline hazard rate. We referthe reader to Kalbfleisch and Prentice [67], Coxand Oakes [68], Ascher and Feingold [11], andAndersen et al. [69] for a general discussionon Cox regression models. Applications of PHMto software reliability modeling are given byXie [12] (chapter 7), Saglietti [70], Wright [71],and references cited therein. Recently, Pham [72]derived an enhanced proportional hazard (JM)model.

A general way to represent the influence ofenvironmental factors on reliability is to assume

that the stochastic intensity of the countingprocess N(·) is a function of some m stochasticprocesses E1(t), . . . , Em(t) or covariates

λ(t;Ht , F0)

= f (t, E1(t), . . . , Em(t), T1, . . . , TN(t), N(t))

where Ht is the past up to time t of the point pro-cess and F0 encompasses the specification of thepaths of all covariates. Thus, function λ(t;Ht , F0)

may be thought of as the stochastic intensity ofan SEPP driven or modulated by the multivariateenvironmental process (E1(t), . . . , Em(t)). TheDSPP of Section 12.3.1 is a basic instance of suchmodels and has been widely used in communi-cation engineering and in reliability. Castillo andSiewiorek [73] proposed a DSPP with a cyclo-stationary stochastic intensity to represent theeffect of the workload (measure of system us-age) on failure process. That is, the intensity is astochastic process assumed to have periodic meanand autocorrelation function. A classic form forintensity of an DSPP is λ(Et ), where (Et ) is a finiteMarkov process. This is the MMPP discussed inSection 12.4, where (Et ) represented the controlflow structure of the software. In the same spiritof system in a random environment, Özekici andSofer [65] use a point process whose stochasticintensity is [N −N(t)]λ(Et ), where N − N(t) isthe remaining number of faults at time t and Et

is the operation performed by the system at t .(Et ) is also assumed to be a finite Markov process.Transient analysis of N(·) may be carried out,as in Ledoux and Rubino [42], from the Markovproperty of the bivariate process (N −N(t), Et );see Özekici and Sofer [65] for details. Note thatboth point processes are instances of SEPP withrespective stochastic intensities E[λ(Et) |Ht ] and[N −N(t)]E[λ(Et ) |Ht ] (Ht is the past of thecounting process up to time t). The practical pur-pose of such models has to be addressed. In partic-ular, further investigations are needed to estimateparameters (see Koch and Spreij [74]).

12.6.1.4 Conclusion

There are other issues that exist which are of valuein SR engineering. In the black-box modeling

Page 262: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 229

framework, we can think about alternatives tothe approaches reported in Section 12.3. Theproblem of assessing SR growth may be thoughtof as a problem of statistical analysis of data.Therefore, prediction techniques developed in thisarea of research can be used. For instance, someauthors have considered neural networks (NNs).The main interest for such an SR model is tobe non-parametric. Thus, we rejoin discussionon the statistical issues raised in Section 12.6.3.NNs may also be used as a classification tool.For instance, identifying fault-prone modulesmay be performed with a NN classifier. SeeLyu [4] (chapter 17) and references cited thereinfor an account of the NN approach. Empiricalcomparison of the predictive performance of NNmodels and recalibrated standard models (asdefined in [75]) is given by Sitte [76]. NNs arefound to be a good alternative to the standardmodels.

Computing dependability metrics is not an endin itself in software engineering. A major questionis the time to release the software. In particular,we have to decide when to stop testing. Optimaltesting time is a problem of decision makingunder uncertainty. A good account of Bayesiandecision theory for solving such a problem is givenby Singpurwalla and Wilson [2] (chapter 6). Ingeneral, software release policies are based onreliability requirement and cost factors. We donot go into further details here. See Xie [12](chapter 8) for a survey up to 1990s, and Phamand Zhang [77, 78] for more recent contributionsto these topics.

12.6.2 White-box Modeling

A challenging issue in SR modeling is to definemodels taking into account information aboutthe architecture of the software. To go further,software interacts with hardware to make asystem. In order to derive a model for a systemmade up of software and hardware, the approachto take is the white-box approach (see Kanoun [49]and Laprie and Kanoun [79] for an account onthis topic). We focus on the software product here.

Many reasons lead to advocating a structure-basedapproach in SR modeling:

• advancement and widespread use of object-oriented systems designs. Reuse of compo-nents;• software is developed in a heterogeneous

fashion using components-based softwaredevelopment;• early prediction methods of reliability have to

take into account information on the testingand the reliability of the components of thesoftware;• early failure data are prior to the integration

phase and thus concern testing part of thesoftware, not the whole product;• addressing problem of reliability allocation,

resource allocation for modular software;• analyze sensitivity of the reliability of the

software to the reliability of its components.

As noted in Section 12.4, the structure-based ap-proach has been largely ignored. The foundationsof the Markovian models presented in Section 12.4are not new. Some limitations of Littlewood’smodel have been recently addressed [43], in par-ticular to obtain availability measures. Asymptoticconsiderations [45] show that such a reliabilitymodel tends to be of the Poisson type (homo-geneous or not depending on stationarity or notof the failure parameters) when the product hasachieved a good level of reliability. It is importantto point out that no experience with such kinds ofmodels is reported in the literature. Maybe this isrelated to questionable assumptions in the mod-eling, such as the Markov exchanges of controlbetween modules.

An alternative way to represent the interactionsbetween components of software is to use oneof the available modeling tools which are basedon stochastic Petri nets, SAN networks, etc. Butwhereas many of them offer a high degree offlexibility in the representation of the behavior ofthe software, the computation of various metrics isvery often performed using automatic generationof a Markov chain. So these approaches are subjectto the traditional limitation of Markov modeling:the failure rate of the components is not time

Page 263: Handbook of Reliability Engineering - pdi2.org

230 Software Reliability

dependent; the generated state-space is intractablefrom the computational point of view; etc.

To overcome the limitations of an analyticapproach, a widespread method in performanceanalysis of a system is discrete-event simulation.This point of view was initiated by Lyu [4] (chap-ter 16) for software dependability assessment. Theidea is to represent the behavior of each com-ponent as a non-homogeneous Markov processwhose dynamic evolution only depends on a haz-ard rate function. At any time t , this hazard ratefunction depends on the number of failures ob-served from the component up to time t , as wellas on the execution time experienced by the com-ponent up to time t . Then a rate-based simula-tion technique may be used to obtain a possiblerealization of such a Markovian arrivals process.The overall hazard rate of the software is actuallya function of the number of failures observed fromeach component up to time t and of the amount ofexecution time experienced by each component.See Lyu [4] (chapter 16) and Lyu et al. [80] fordetails. In some sense, the approach is to simulatethe failure process from the stochastic intensity ofthe counting process of failures.

For a long time now, the theory of a “coherentsystem” has allowed one to analyze a systemmade up of n components through the so-calledstructure function. If xi (i = 1, . . . , n) denotes thestate of component i (xi = 1 if component i is upand zero otherwise), then the state of the systemis obtained from computation of the structurefunction

�(x1, . . . , xn)={

1 if system is up

0 if system is down

Function � describes the functional relationshipbetween the state of the system and the stateof its components. Many textbooks on reliabilityreview methods for computing reliability from thecomplex function � assuming that the state ofeach component is a Bernoulli random variable(e.g. see Aven and Jensen [17] (chapter 2)). Aninstance of representation of a 2-module softwareby a structure function taking into account controlflow and data flow is discussed by Singpurwallaand Wilson [2] (chapter 7). This “coherent system”

approach is widely used to analyze the reliabilityof communication networks. However, it is wellknown that exact computation of reliability isthen an NP-hard problem. Thus, only structurefunctions of a few dozens of components canbe analyzed exactly. Large systems have to beassessed by Monte Carlo simulation techniques(e.g. see Ball et al. [81] and Rubino [82]).Moreover, in the case of fault-tolerant software,we are faced with a highly reliable system thatinvolves sophisticated simulation procedures toovercome the limitations of standard ones. Insuch a context, an alternative consists in usingbinary decision diagrams (see Lyu [4] (chapter 15)and Limnios and Rauzy [83]). We point outthat structure function is mainly a functionalrepresentation of the system. Thus, many issuesdiscussed in the context of black-box modelingalso have to be addressed. For instance, how doyou incorporate environmental factors identifiedby Pham and Pham [34]?

As we can see, a great deal of research isneeded to obtain a white-box model that offersthe advantages motivating development of suchan approach. Opening the “black box” to getaccurate models is a hard task. Many aspects haveto be addressed: the definition of what is thestructure or architecture of software; what kindof data can be expected for future calibration ofmodels, and so on. A first study of the potentialsources of SR data available during developmentis given by Smidts and Sova [84]. This wouldhelp the creation of some benchmark data sets,which will allow validation of white-box models.What is clear is that actual progress in white-boxmodeling can only be achieved from an activeinteraction between the statistics and softwareengineering communities. All this surely explainswhy the prevalent approach in SR is the black-boxapproach.

12.6.3 Statistical Issues

A delicate issue in SR is the statistical properties ofthe estimators used to calibrate models. The maindrawbacks of the ML method are well documentedin the literature. Finding ML estimators requires

Page 264: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 231

solving equations that may not always have asolution, or which may give an inappropriatesolution. For instance, Littlewood and Verrall [85]give a criterion for N =∞ and φ = 0 (with finitenon-zero λ= N φ) to be the unique solution ofML equations for the (JM) model. The problemof solving ML equations in SR modeling hasalso been addressed [86–89]. Another well-knowndrawback is that such ML estimators are usuallyunstable with small data sets. This situationis basic in SR. Moreover, note that certainmodels, like (JM), assume that software containsa finite number of faults. So, using the standardasymptotic properties of ML estimators may bequestionable. Such asymptotic results are wellknown in the case of an i.i.d. sample. In SR,however, samples are not i.i.d. That explains whyrecent investigations on asymptotic normalityand consistency of ML estimators for standardSR models use the framework of martingaletheory, which allows dependence in data. Notethat overcoming a conceptually finite (expected)number of faults needs unusual concept ofasymptotic properties. A detailed discussion isgiven by van Pul [18] (and references citedtherein) and by Zhao and Xie [88] for NHPPmodels. Such studies are important, becausethese asymptotic properties are the foundationsof interval estimation (a standard alternative topoint estimate), of the derivation of confidenceinterval for parameters, of studies on asymptoticvariance of estimators, etc. All these topics mustbe addressed in detail to improve the predictivequality of SR models.

It is clear that any model works well with failuredata that correspond to the basic assumptions ofthe model. But, given data, a large number ofmodels are inappropriate. A natural way to over-come too stringent assumptions, in particular ofthe distributional type, is to use non-parametricmodels. However, the parametric approach re-mains highly prevalent in SR modeling. A majorattempt to gap this fill was undertaken by Millerand Sofer [90], where a completely monotonicROCOF is estimated by regression techniques (seealso Brocklehurst and Littlewood [91]). A recentstudy by Littlewood and co-workers [92] uses

non-parametric estimates for the distribution ofinter-failure times (Xi). This is based on kernelmethods for p.d.f. estimation (see Silverman [93]).The conclusion of the authors is that the results arenot very impressive but that more investigation isneeded, in particular using various kernel func-tions. We can think, for instance, of wavelets [94].A similar discussion may be undertaken with re-gard to the Bayesian approach of SR modeling.Specifically, most Bayesian inference for NHPPassumes a parametric model for ROCOF and pro-ceeds with prior assumption on the unknownparameters. In such a context, an instance of anon-parametric Bayesian approach has recentlybeen used Kuo and Ghosh [95]. Conceptually, thenon-parametric approach is promising, but it iscomputationally intensive in general and is noteasy to comprehend.

These developments may be viewed as pre-liminary studies using statistical methods basedon the so-called dynamic approach of countingprocesses, as reported for instance in the bookby Andersen et al. [69] (the bibliography gives alarge account for research in this area). The pio-neering work of Aalen was on the multiplicativeintensity model, which, roughly speaking, writesthe stochastic intensity associated with a countingprocess as

λ(t;Ht , σ (Ys, s ≤ t))= λ(t) Y (t)

where λ(·) is a non-negative deterministic func-tion, whereas Y (·) is a non-negative observablestochastic process whose value at any time t

is known just before t (Y (·) is a predictableprocess). Non-parametric estimation for such amodel is discussed by Andersen et al. [69], chap-ter 4. A Cox-type model may be obtained inchoosing Y (t)= exp

( ∑j βjZj (t)

)with stochas-

tic processes as covariates Zj (see Slud [96]).We can also consider additive intensity modelswhen the multiplicative form involves multivariatefunctions λ(·) and Y (·) (e.g. see Pijnenburg [97]).Conceptually, this dynamic approach is appealingbecause it is well supported by a lot of theoreticresults. It is worth noting that martingale theorymay be of some value for analyzing static models

Page 265: Handbook of Reliability Engineering - pdi2.org

232 Software Reliability

as capture–recapture models (see Yip and cowork-ers [98, 99] for details). Thus, the applicability ofthe dynamic point of view on point processes inthe small data set context of software engineeringis clearly a direction of further investigations (e.g.see van Pul [18] for such an account). Moreover,if it is shown that gain in predictive validity ishigh with respect to standard approaches, then auser-oriented “transfer of technology” must fol-low. That is, friendly tools for using such statisticalmaterial must be developed.

References[1] Gaudoin O. Statistical tools for software reliability

evaluation (in French). PhD thesis, Université JosephFourier–Grenoble I, 1990.

[2] Singpurwalla ND, Wilson SP. Statistical methods insoftware engineering: reliability and risk. Springer; 1999.

[3] Musa JD, Iannino A, Okumoto K. Software reliability:measurement, prediction, application. Computer ScienceSeries. McGraw-Hill International Editions; 1987.

[4] Lyu MR, editor. Handbook of software reliabilityengineering. McGraw-Hill; 1996.

[5] Software reliability estimation and prediction handbook.American Institute of Aeronautics and Astronautics;1992.

[6] Pham H. Software reliability. Springer; 2000.[7] Everett W, Keene S, Nikora A. Applying software

reliability engineering in the 1990s. IEEE Trans Reliab1998;47:372–8.

[8] Ramani S, Gokhale SS, Trivedi KS. Software reliabilityestimation and prediction tool. Perform Eval 2000;39:37–60.

[9] Laprie J-C. Dependability: basic concepts and terminol-ogy. Springer; 1992.

[10] Barlow RE, Proschan F. Statistical theory of reliability andlife testing. New York: Holt, Rinehart and Winston; 1975.

[11] Ascher H, Feingold H. Repairable systems reliability.Lecture Notes in Statistics, vol. 7. New York: MarcelDekker; 1984.

[12] Xie M. Software reliability modeling. UK: World ScientificPublishing; 1991.

[13] Briand LC, El Emam K, Freimut BG. A comprehensiveevaluation of capture-recapture models for estimatingsoftware defect content. IEEE Trans Software Eng2000;26:518–40.

[14] Chen Y, Singpurwalla ND. Unification of softwarereliability models by self-exciting point processes. AdvAppl Probab 1997;29:337–52.

[15] Snyder DL, Miller MI. Random point processes in timeand space. Springer; 1991.

[16] Jelinski Z, Moranda PB. Software reliability research.In: Freiberger W, editor. Statistical methods for the

evaluation of computer system performance. AcademicPress; 1972. p.465–84.

[17] Aven T, Jensen U. Stochastic models in reliability.Applications of Mathematics, vol. 41. Springer; 1999.

[18] Van Pul MC. A general introduction to softwarereliability. CWI Q 1994;7:203–44.

[19] Laprie J-C, Kanoun K, Béounes C, Kaâniche M. TheKAT (knowledge–action–transformation) approach tothe modeling and evaluation of reliability and availabilitygrowth. IEEE Trans Software Eng 1991;17:370–82.

[20] Cox DR, Isham V. Point processes. Chapman and Hall;1980.

[21] Trivedi KS. Probability and statistics with reliability,queuing and computer science applications. John Wiley& Sons; 2001.

[22] Moranda PB. Predictions of software reliability duringdebugging. In: Annual Reliability and MaintainabilitySymposium 1975; p.327–32.

[23] Shanthikumar JG. Software reliability models: a review.Microelectron Reliab 1983;23:903–43.

[24] Littlewood B. Stochastic reliability-growth: a modelfor fault-removal in computer programs and hardwaredesigns. IEEE Trans Reliab 1981;30:313–20.

[25] Osaki S, Yamada S. Reliability growth models forhardware and software systems based on nonhomoge-neous Poisson processes: a survey. Microelectron Reliab1983;23:91–112.

[26] Littlewood B. Forecasting software reliability. In: BittantiS, editor. Software reliability modeling and identification.Lecture Notes in Computer Science 341. Springer; 1988.p.141–209.

[27] Miller DR. Exponential order statistic models forsoftware reliability growth. IEEE Trans Software Eng1986;12:12–24.

[28] Littlewood B, Verrall JL. A Bayesian reliability growthmodel for computer software. Appl Stat 1973;22:332–46.

[29] Mazzzuchi TA, Soyer R. A Bayes empirical-Bayes modelfor software reliability. IEEE Trans Reliab 1988;37:248–54.

[30] Schick GJ, Wolverton RW. Assessment of softwarereliability. In: Operation Research. Physica-Verlag; 1973.p.395–422.

[31] Singpurwalla ND, Soyer R. Assessing (software) relia-bility growth using a random coefficient autoregressiveprocess and its ramification. IEEE Trans Software Eng1985;11:1456–64.

[32] Singpurwalla ND, Soyer R. Nonhomogeneous autore-gressive processes for tracking (software) reliabilitygrowth, and their Bayesian analysis. J R Stat Soc Ser B1992;54:145–56.

[33] Al-Mutairi D, Chen Y, Singpurwalla ND. An adaptativeconcatenated failure rate model for software reliability. JAm Stat Assoc 1998; 93:1150–63.

[34] Pham L, Pham H. Software reliability models with time-dependent hazard function based on Bayesian approach.IEEE Trans Syst Man Cyber Part A 2000;30:25–35.

[35] Cheung RC. A user-oriented software reliability model.IEEE Trans Software Eng 1980;6:118–25.

[36] Littlewood B. Software reliability model for modularprogram structure. IEEE Trans Reliab 1979;28:241–6.

Page 266: Handbook of Reliability Engineering - pdi2.org

Software Reliability Modeling 233

[37] Siegrist K. Reliability of systems with Markov transfer ofcontrol. IEEE Trans Software Eng 1988;14:1049–53.

[38] Kaâniche M, Kanoun K. The discrete time hyperexpo-nential model for software reliability growth evaluation.In: International Symposium on Software Reliability (IS-SRE), 1992; p.64–75.

[39] Littlewood B. A reliability model for systems with Markovstructure. Appl Stat 1975;24:172–7.

[40] Kubat P. Assessing reliability of modular software. OperRes Lett 1989;8:35–41.

[41] Goseva-Popstojanova K, Trivedi KS. Architecture-basedapproach to reliability assessment of software systems.Perform Eval 2001;45:179–204.

[42] Ledoux J, Rubino G. Simple formulae for countingprocesses in reliability models. Adv Appl Probab1997;29:1018–38.

[43] Ledoux J. Availability modeling of modular software.IEEE Trans Reliab 1999;48:159–68.

[44] Kabanov YM, Liptser RS, Shiryayev AN. Weak and strongconvergence of the distributions of counting processes.Theor Probab Appl 1983;28:303–36.

[45] Gravereaux JB, Ledoux J. Poisson approximation for somepoint processes in reliability. Technical report, InstitutNational des Sciences Appliquées, Rennes, France, 2001.

[46] Ledoux J. Littlewood reliability model for modularsoftware and Poisson approximation. In: MathematicalMethods for Reliability, June, 2002; p.367–370.

[47] Abdel-Ghaly AA, Chan PY, Littlewood B. Evaluation ofcompeting software reliability predictions. IEEE TransSoftware Eng 1986;12:950–67.

[48] Brocklehurst S, Kanoun K, Laprie J-C, Littlewood B,Metge S, Mellor P, et al. Analyses of software failuredata. Technical report No. 91173, Laboratoire d’Analyse etd’Architecture des Systèmes, Toulouse, France, May 1991.

[49] Kanoun K. Software dependability growth: characteriza-tion, modeling, evaluation (in French). Technical report89.320, LAAS, Doctor ès Sciences thesis, Polytechnic Na-tional Institute, Toulouse, 1989.

[50] Gilks WR, Richardson S, Spiegelhalter DJ, editors.Markov Chain Monte Carlo in practice. Chapman andHall; 1996.

[51] Kuo L, Yang TY. Bayesian computation of softwarereliability. J Comput Graph Stat 1995;4:65–82.

[52] Kuo L, Yang TY. Bayesian computation for nonhomoge-neous Poisson processes in software reliability. J Am StatAssoc 1996;91:763–73.

[53] Kuo L, Lee JC, Choi K, Yang TY. Bayes inference for s-shaped software-reliability growth models. IEEE TransReliab 1997;46:76–80.

[54] Kremer W. Birth–death and bug counting. IEEE TransReliab 1983;32:37–47.

[55] Gokhale SS, Philip T, Marinos PN. A non-homogeneousMarkov software reliability model with imperfect repair.In: International Performance and Dependability Sympo-sium, 1996; p.262–70.

[56] Shanthikumar JG, Sumita U. A software reliability modelwith multiple-error introduction and removal. IEEETrans Reliab 1986;35:459–62.

[57] Sahinoglu H. Compound-Poisson software reliabilitymodel. IEEE Trans Software Eng 1992;18:624–30.

[58] Fenton NE, Pfleeger SL. Software metrics: a rigorousand practical approach, 2nd edn. International ThomsonComputer Press; 1996.

[59] Khoshgoftaar TM, Allen EB, Wendell DJ,Hudepohl JP. Classification-tree models of software-quality over multiple releases. IEEE Trans Reliab2000;49:4–11.

[60] Khoshgoftaar TM, Allen EB. A practical classification-rule for software-quality models. IEEE Trans Reliab2000;49:209–16.

[61] Gokhale SS, Trivedi KS. A time/structure based softwarereliability model. Ann Software Eng 1999;8:85–121.

[62] Goel AL. Software reliability models: assumptions,limitations, and applicability. IEEE Trans Software Eng1985;11:1411–23.

[63] Xie M, Hong GY, Wohlin C. A practical method for theestimation of software reliability growth in the earlystage of testing. In: International Symposium on SoftwareReliability (ISSRE), 1997; p.116–23.

[64] Pasquini A, Crespo AN, Matrella P. Sensitivity ofreliability-growth models to operational profile errors vs.testing accuracy. IEEE Trans Reliab 1996;45:531–40.

[65] Özekici S, Soyer R. Reliability of software with anoperational profile. Technical report, The George Wash-ington University, Department of Management Science,2000.

[66] Zhang X, Pham H. An analysis of factors affectingsoftware reliability. J. Syst Software 2000; 50:43–56.

[67] Kalbfleisch JD, Prentice RL. The statistical analysis offailure time data. Wiley; 1980.

[68] Cox CR, Oakes D. Analysis of survival data. London:Chapman and Hall; 1984.

[69] Andersen PK, Borgan O, Gill RD, Keiding N. Statisticalmodels on counting processes. Springer Series inStatistics. Springer; 1993.

[70] Saglietti F. Systematic software testing strategies as ex-planatory variables of proportional hazards. In SAFE-COMP’91, 1991; p.163–7.

[71] Wright D. Incorporating explanatory variables in soft-ware reliability models. In Second year report of PDCS,vol. 1. Esprit BRA Project 3092, May 1991.

[72] Pham H. Software reliability. In: Webster JG, editor. WileyEncyclopedia of Electrical and Electronics Engineering.Wiley; 1999. p.565–78.

[73] Castillo X, Siewiorek DP. A workload dependent softwarereliability prediction model. In: 12th InternationalSymposium on Fault-Tolerant Computing, 1982; p.279–86.

[74] Koch G, Spreij P. Software reliability as an applicationof martingale & filtering theory. IEEE Trans Reliab1983;32:342–5.

[75] Brocklehurst S, Chan PY, Littlewood B, Snell J. Recalibrat-ing software reliability models. IEEE Trans Software Eng1990;16:458–70.

[76] Sitte R. Comparison of software-reliability-growth pre-dictions: neural networks vs parametric-recalibration.IEEE Trans Reliab 1999;49:285–91.

Page 267: Handbook of Reliability Engineering - pdi2.org

234 Software Reliability

[77] Pham H, Zhang X. A software cost model with warrantyand risk costs. IEEE Trans Comput 1999;48:71–5.

[78] Pham H, Zhang X. Software release policies with gainin reliability justifying the costs. Ann Software Eng1999;8:147–66.

[79] Laprie J-C, Kanoun K. X-ware reliability and availabilitymodeling. IEEE Trans Software Eng 1992;18:130–47.

[80] Lyu MR, Gokhale SS, Trivedi KS. Reliability simulation ofcomponent-based systems. In: International Symposiumon Software Reliability (ISSRE), 1998; p.192–201.

[81] Ball MO, Colbourn C, Provan JS. Network models. In:Monma C, Ball MO, Magnanti T, Nemhauser G, editors.Handbook of operations research and managementscience. Elsevier; 1995. p.673–762.

[82] Rubino G. Network reliability evaluation. In: BagchiK, Walrand J, editors. State-of-the-art in performancemodeling and simulation. Gordon and Breach; 1998.

[83] Limnios N, Rauzy A, editors. Special issue on binarydecision diagrams and reliability. Eur J Automat1996;30(8).

[84] Smidts C, Sova D. An architectural model for softwarereliability quantification: sources of data. Reliab Eng SystSaf 1999;64:279–90.

[85] Littlewood B, Verrall JL. Likelihood function of adebugging model for computer software reliability. IEEETrans Reliab 1981;30:145–8.

[86] Huang XZ. The limit condition of some time betweenfailure models of software reliability. MicroelectronReliab 1990;30:481–5.

[87] Hossain SA, Dahiya RC. Estimating the parameters of anon-homogeneous Poisson-process model for softwarereliability. IEEE Trans Reliab 1993;42:604–12.

[88] Zhao M, Xie M. On maximum likelihood estimation for ageneral non-homogeneous Poisson process. Scand J Stat1996;23:597–607.

[89] Knafl G, Morgan J. Solving ML equations for 2-parameterPoisson-process models for ungrouped software-failuredata. IEEE Trans Reliab 1996;45:43–53.

[90] Miller DR, Sofer A. A nonparametric software-reliabilitygrowth model. IEEE Trans Reliab 1991;40:329–37.

[91] Brocklehurst S, Littlewood B. New ways to get accuratereliability measures. IEEE Software 1992;9:34–42.

[92] Barghout M, Littlewood B, Abdel-Ghaly AA. A non-parametric order statistics software reliability model.J Test Verif Reliab 1998;8:113–32.

[93] Silverman BW. Density estimation for statistics and dataanalysis. Chapman and Hall; 1986.

[94] Antoniadis A, Oppenheim G, editors. Wavelets andstatistics. Lecture Notes in Statistics, vol. 103. Springer;1995.

[95] Kuo L, Ghosh SK. Bayesian nonparametric inference fornonhomogeneous Poisson processes. Technical report9718, University of Connecticut, Storrs, 1997.

[96] Slud EV. Some applications of counting process modelswith partially observed covariates. Telecommun Syst1997;7:95–104.

[97] Pijnenburg M. Additive hazard models in repairablesystems reliability. Reliab Eng Syst Saf 1991;31:369–90.

[98] Lloyd CJ, Yip PSF, Chan Sun K. Estimating the numberof errors in a system using a martingale approach. IEEETrans Reliab 1999;48:369–76.

[99] Yip PSF, Xi L, Fong DYT, Hayakawa Y. Sensitivity-analysis and estimating the number-of-faults in removingdebugging. IEEE Trans Reliab 1999;48:300–5.

Page 268: Handbook of Reliability Engineering - pdi2.org

Software Availability Theory and ItsApplications

Chap

ter1

3Koichi Tokuno and Shigeru Yamada

13.1 Introduction13.2 Basic Model and Software Availability Measures13.3 Modified Models13.3.1 Model with Two Types of Failure13.3.2 Model with Two Types of Restoration13.4 Applied Models13.4.1 Model with Computation Performance13.4.2 Model for Hardware–Software System13.5 Concluding Remarks

13.1 Introduction

Many methodologies for software reliabilitymeasurement and assessment have been discussedfor the last few decades. A mathematical softwarereliability model is often called a softwarereliability growth model (SRGM). Several booksand papers have surveyed software reliabilitymodeling [1–8]. Most existing SRGMs havedescribed stochastic behaviors of only softwarefault-detection or software failure-occurrencephenomena during the testing phase of thesoftware development process and the operationphase. A software failure is defined as anunacceptable departure from program operationcaused by a fault remaining in the softwaresystem. These can provide several measuresof software reliability defined as the attributethat the software-intensive systems can performwithout software failures for a given time period,under the specified environment; for example, themean time between software failures (MTBSF),the expected number of faults in the system, andso on. These measures are developer-oriented andutilized for measuring and assessing the degree ofachievement of software reliability, deciding the

time to software release for operational use, andestimating the maintenance cost for faults unde-tected during the testing phase. Therefore, thetraditional SRGMs have evaluated “the reliabilityfor developers” or “the inherent reliability”.

These days, the performance and qualityof software systems become evaluated fromcustomers’ viewpoints. For example, customerstake interest in the information on possibleutilization, not the number of faults remaining inthe system. One of the customer-oriented softwarequality attributes is software availability [9–11];this is the attribute that the software-intensivesystems are available at a given time point, underthe specified environment. Software availability isalso called “the reliability for customers” or “thereliability with maintainability”.

When we measure and assess software avail-ability, we need to consider explicitly not only theup time (the software failure time) but also thedown time to restore the system and to describethe stochastic behavior of the system alternatingbetween the up and down states.

This chapter surveys the current research onthe stochastic modeling for software availabilitymeasurement techniques. In particular, we focus

235

Page 269: Handbook of Reliability Engineering - pdi2.org

236 Software Reliability

on the software availability modeling with Markovprocesses [12] to describe the time-dependentbehaviors of the systems. The organization of thischapter is as follows. Section 13.2 gives the basicideas of software availability modeling and theseveral software availability measures. Based onthe basic model, Section 13.3 discusses twomodified models reflecting the software failure-occurrence phenomenon and the restorationscenario peculiar to the user-operational phasesince software availability is a metric applied inthe user operation phase. Section 13.4 also refersto the applied models and considers computationperformance and combining a hardware and asoftware subsystem.

13.2 Basic Model and SoftwareAvailability Measures

Beforehand, we state the difference in the descrip-tions of the failure/restoration characteristics be-tween hardware and software systems. The causesof hardware failures are the deterioration orwearing-out of component parts due to secularchanges, and the renewal of hardware systems byreplacement of failing parts. Therefore, the failureand the restoration characteristics of the hardwaresystems are often described without considerationof the cumulative numbers of failures or restora-tions. On the other hand, the causes of softwarefailures are faults latent in the systems introducedin the development process, and the restorationactions include the debugging activities for the de-tected faults. So the debugging activities decreasethe faults and improve software reliability, unlesssecondary faults are introduced. Accordingly, thefailure and the restoration characteristics of soft-ware systems are often described with relation tothe numbers of software failures and debuggingactivities. The above remarks have to be reflectedin software availability modeling.

The following lists the general assumptions forsoftware availability modeling.

Assumption 1. A software system is unavailableand starts to be restored as soon as a software

failure occurs, and the system cannot operate untilthe restoration action is complete.

Assumption 2. A restoration action implies de-bugging activity, which is performed perfectly withprobability a (0 < a ≤ 1) and imperfectly withprobability b (= 1− a). We call a the perfect de-bugging rate. One fault is corrected and removedfrom the software system when the debugging ac-tivity is perfect.

Assumption 3. The software failure-occurrencetime (up time) and the restoration time (downtime) follow exponential distributions with means1/λn and 1/µn, respectively, where n denotes thecumulative number of corrected faults. λn and µn

are functions of n.

Assumption 4. The probability that two or moresoftware failures occur simultaneously is negligible.

Consider the stochastic behavior of the systemalternating between up and down states with aMarkov process. Let {X(t), t ≥ 0} be the Markovprocess representing the state of the system at thetime point t and its state space Wn and Rn (n=0, 1, 2, . . . ); then denote

Wn the system is operatingRn the system is inoperable due to the

restoration action.

In the actual debugging environment, it oftenhappens that some fault corrections are inexactand that secondary faults are introduced. This isthe so-called imperfect debugging environment.The assumption of imperfect debugging hasgiven rise to much controversy in softwarereliability/availability modeling [1, 13]. In thischapter the following refers to the treatment ofperfect debugging: the purpose of the debuggingactivity is to improve software quality/reliability.Therefore, we assume that a debugging activity isperfect when it contributes to the improvement ofsoftware reliability. That is, a perfect debuggingactivity means that the hazard rate mentionedlater decreases and that one fault is correctedand removed from the system. We also assumethat the increase of the hazard rate due to theintroduction of new faults is negligible [14].

Page 270: Handbook of Reliability Engineering - pdi2.org

Software Availability Theory and Its Applications 237

From Assumption 2, when the restoration actionhas been complete in {X(t)= Rn},

X(t)={Wn (with probability b)

Wn+1 (with probability a)(13.1)

Next, we refer to the transition probabilitybetween the states, i.e. the descriptions of thefailure/restoration characteristics. For describingthe software failure characteristic, several classicalSRGMs can be applied. For example, Okumotoand Goel [15] and Kim et al. [16] have applied themodel of Jelinski and Moranda [17], i.e. they havedescribed the hazard rate λn as

λn = φ(N − n)

(n= 0, 1, 2, . . . , N; N > 0, φ > 0) (13.2)

where N and φ are respectively the initial faultcontent prior to the testing and the hazard rate perfault remaining in the system. Then X(t) forms afinite-state Markov process. Tokuno and Yamada[18–20] have applied the model of Moranda [21],i.e. they have given λn as

λn =Dkn (n= 0, 1, 2, . . . ;D > 0, 0 < k < 1)(13.3)

where D and k are the initial hazard rate and thedecreasing ratio of the hazard rate respectively.Then X(t) forms an infinite-state Markov process.Equation 13.2 assumes that any faults havethe same impact on software reliability growth.On the other hand, Equation 13.3 means thatthe perfect debugging activities for the faultsdetected earlier have a higher impact on softwarereliability growth than those for the later faults[2, 22].

We also mention the restoration rate µn.There are many cases where the difficulties offault isolations and debugging continue to riseand the restoration time tends to be longer asthe fault correction progresses [23]. In order thatsuch situations are reflected in the modeling, thesame forms as Equations 13.2 and 13.3 that arethe decreasing functions of n are often applied toµn. When we apply µn ≡ Ern (E > 0, 0 < r ≤ 1),E and r denote the initial restoration rateand the decreasing ratio of the restoration raterespectively.

Figure 13.1. A sample realization ofY(t)

Let Y (t) be the random variable representingthe cumulative number of faults corrected up tothe time t . Figure 13.1 illustrates the sample be-havior of Y (t), where Tn and Un (n= 0, 1, 2, . . . )denote the random variables representing the so-journ times in states Wn and Rn respectively. Itis noted that the cumulative number of correctedfaults is not always coincident with that of soft-ware failures or restoration actions. Furthermore,Figure 13.2 illustrates the sample state transitiondiagram of X(t).

We can obtain the state occupancy probabilitiesthat the system is in the states Wn and Rn at timepoint t as

PWn(t)≡ Pr{X(t)=Wn}

= gn+1(t)

aλn+ g′n+1(t)

aλnµn(n= 0, 1, 2, . . . )

(13.4)

PRn(t)≡ Pr{X(t)= Rn}= gn+1(t)

aµn

(n= 0, 1, 2, . . . ) (13.5)

respectively, where gn(t) is the probability densityfunction of the random variable Sn representingthe first passage time to the state Wn, andg′n(t)≡ dgn(t)/dt .gn(t) and g′n(t) can be obtainedanalytically.

Page 271: Handbook of Reliability Engineering - pdi2.org

238 Software Reliability

Figure 13.2. State transition diagram for software availability modeling

The following equation holds for the arbitrarytime t :

∞∑n=0

[PWn(t)+ PRn(t)] = 1 (13.6)

The instantaneous software availability is de-fined as

A(t)≡∞∑n=0

PWn(t) (13.7)

which represents the probability that the softwaresystem is operating at the time point t . Further-more, the average software availability over (0, t]is defined as

Aav(t)≡ 1

t

∫ t

0A(x) dx (13.8)

which represents the expected proportion of thesystem’s operating time to the time interval (0, t].Using Equations 13.4 and 13.5, we can expressEquations 13.7 and 13.8 as

A(t)=∞∑n=0

[gn+1(t)

aλn+ g′n+1(t)

aλnµn

]

= 1−∞∑n=0

gn+1(t)

aµn

(13.9)

Aav(t)= 1

t

∞∑n=0

[Gn+1(t)

aλn+ gn+1(t)

aλnµn

]

= 1− 1

t

∞∑n=0

Gn+1(t)

aµn

(13.10)

respectively, where Gn(t) is the distributionfunction of Sn.

The proportion of the up and down times iscalled the maintenance factor. Given that λn ≡Dkn and µn ≡ Ern, the maintenance factor ρn isexpressed by

ρn ≡ E[Un]E[Tn]

= βvn (β ≡D/E, v ≡ k/r) (13.11)

Figure 13.3. Dependence of v on A(t) (λn ≡Dkn, µn ≡Ern; a = 0.9,D = 0.1,E = 0.3, r = 0.8)

Page 272: Handbook of Reliability Engineering - pdi2.org

Software Availability Theory and Its Applications 239

Figure 13.4. State transition diagram for software availability modeling with two types of failure

where β and v are the initial maintenancefactor and the availability improvement parameterrespectively [24]. In general, the maintenancefactor of the hardware system is assumed to beconstant and has no bearing on the numberof failures, whereas that of the software systemdepends on n. Figure 13.3 shows the dependenceof v on A(t) in Equation 13.9. This figure tells usthat we can judge whether the software availabilityincreases or decreases with the value of v, i.e. A(t)increases and decreases in the cases of v < 1 andv > 1 respectively, and it is constant in the caseof v = 1. We must consider not only the softwarereliability growth process but also the upwardtendency of the difficulty in debugging in softwareavailability measurement and assessment.

13.3 Modified Models

Since software availability is a quality characteris-tic to be considered in the user operation, we needto continue to adapt the basic model discussedabove to the actual operation environment.

13.3.1 Model with Two Types of Failure

In this section, we assume that the following twotypes of software failure exist during the operationphase [25, 26]:

F1: software failures caused by the faults thatcould not be detected/corrected duringthe testing phase

F2: software failures caused by the faultsintroduced by deviating from theexpected operational use.

The state space of the process {X(t), t ≥ 0} inthis section is defined as follows:

Wn the system is operatingR1n the system is inoperable due to F1 and

restoredR2n the system is inoperable due to F2 and

restored.

The failure and the restoration characteristics ofF1 are assumed to be described by the formsdependent on the number of corrected faults,and those of F2 are assumed to be constant.Then, Figure 13.4 illustrates the sample statetransition diagram of X(t), where θ and η arethe hazard rate and the restoration rate of F2respectively.

We can express the instantaneous softwareavailability and the average software availability as

A(t)=∞∑n=0

[gn+1(t)

aλn+ g′n+1(t)

aλnµn

](13.12)

Aav(t)= 1

t

∞∑n=0

[Gn+1(t)

aλn+ gn+1(t)

aλnµn

](13.13)

Page 273: Handbook of Reliability Engineering - pdi2.org

240 Software Reliability

Figure 13.5. Dependence of the ratio of hazard rates of F1 and F2on A(t) (λn ≡Dkn, µn ≡ Ern; a = 0.9, η = 1.0, k = 0.8,E = 0.5, r = 0.9)

respectively. The analytical solutions of Gn(t),gn(t), and g′n(t) can also be obtained.

Figure 13.5 shows the dependence of the occur-rence proportion of F1 and F2 on Equation 13.12.The hazard rates of (i) and (ii) in Figure 13.5 forthe first software failure without the distinctionbetween F1 and F2, α0 =D + θ , are the samevalue, i.e. both (i) and (ii) are α0 = 0.06. A(t)

of (ii) is smaller than (i) in the early stage ofthe operation phase. However, A(t) of (ii) in-creases more than (i) with time, since (ii), whoseD is larger than θ , means that the system hasmore room for reliability growth even in theoperation phase than (i), whose D is smallerthan θ .

13.3.2 Model with Two Types ofRestoration

In Section 13.3.1, the basic model was modifiedfrom the viewpoint of the software failurecharacteristics. Tokuno and Yamada [27] havepaid attention to the restoration actions foroperational use. Cases often exist where thesystem is restored without debugging activitiescorresponding to software failures occurringin the operation phase, since protracting aninoperable time may greatly affect the customers.

This is a different policy from the testing phase.This section considers two kinds of restorationaction during the operation phase: one involvesdebugging and the other does not. Furthermore,we assume that it is probabilistic whether ornot a debugging activity is performed and thatthe restoration time without debugging has nobearing on the number of corrected faults.

The state space of {X(t), t ≥ 0} in this sectionis defined as follows:

Wn the system is operatingR1n the system is inoperable and restored with

the debugging activityR2n the system is inoperable and restored

without the debugging activity.

Figure 13.6 illustrates the sample state transi-tion diagram of X(t), where p (0 < p < 1) de-notes the probability that a restoration actionwith debugging is performed and q = 1− p, andη denotes the restoration rate without debug-ging.

We can express the instantaneous softwareavailability and the average software availability as

A(t)=∞∑n=0

[gn+1(t)

paλn+ g′n+1(t)

paλnµn

](13.14)

Aav(t)= 1

t

∞∑n=0

[Gn+1(t)

paλn+ gn+1(t)

paλnµn

](13.15)

respectively. The analytical solutions of Gn(t),gn(t), and g′n(t) can also be obtained.

Figure 13.7 shows the dependence of p inEquation 13.14. A(t) is lower in the earlystage of the operation phase, but it increasesmore with time as the value of p increases.The larger p means that the system developerintends to improve software reliability even duringthe operation phase. However, the time of therestoration with debugging tends to be longer thanone without debugging. Therefore, the value ofp depends on the policy of the developer andis an important factor in software availabilityassessment.

Page 274: Handbook of Reliability Engineering - pdi2.org

Software Availability Theory and Its Applications 241

Figure 13.6. State transition diagram for software availability modeling with two types of restoration

Figure 13.7. Dependence of p on A(t) (λn ≡Dkn,µn ≡ Ern; a = 0.9, D = 0.1, k = 0.8, E = 0.5, r =0.9, η = 1.0)

13.4 Applied Models

13.4.1 Model with ComputationPerformance

In Section 13.3, the basic model was modifiedby detailing the inoperable states. This sectiondiscusses a new software availability measure,noting the operable states. Tokuno and Yamada[28] have provided a software availability measurewith computation performance by introducing

the concept of the computation capacity [29].The computation capacity is defined as thecomputation amount that the system is able toprocess per unit time.

The state space of {X(t), t ≥ 0} in this sectionis defined as follows:

Wn the system is operating with fullperformance in accordance with thespecification

Ln the system is operating but itsperformance degenerates

Rn the system is inoperable and restored.

Suppose that the sojourn times of the states Wn

and Ln are distributed exponentially with therates θ and η respectively. Then, Figure 13.8illustrates the sample state transition diagramof X(t).

Let C (>0) and Cδ (0 < δ < 1) denote thecomputation capacities when the system is inthe states Wn and Ln respectively. Then thecomputation software availability is given by

Ac(t)≡ C

∞∑n=0

[Pr{X(t)=Wn} + δ Pr{X(t)= Ln}](13.16)

Page 275: Handbook of Reliability Engineering - pdi2.org

242 Software Reliability

Figure 13.8. State transition diagram for software availability modeling with computation performance

Figure 13.9. Dependence of δ on Ac(t) (λn ≡Dkn, µn ≡Ern; C = 1.0, a = 0.9, D = 0.1, k = 0.8, E = 0.2, r =0.9, θ = 0.1, η = 1.0)

which represents the expected value of computa-tion capacity at the time point t . Figure 13.9 showsa numerical example of Equation 13.16.

13.4.2 Model for Hardware–SoftwareSystem

Recently, the design of computer systems has be-gun to attach importance to hardware/softwareco-design (or simply co-design) [30]; this isthe concept that the hardware and the soft-ware systems should be designed synchronously,not separately, with due consideration of eachother. Co-design is not a new concept, but

Figure 13.10. Sample behavior of a hardware–software system

it has received much attention since com-puter systems have grown in size and com-plexity and both the hardware and the soft-ware systems have to be designed in order tobring out the mutual maximum performances.The concept of co-design is also important in sys-tem quality/performance measurement and as-sessment. Goel and Soenjoto [31] and Tokunoand Yamada [32] have discussed the availabil-ity models for a system consisting of one hard-ware and one software subsystem. Figure 13.10shows a sample behavior of a hardware–softwaresystem.

Page 276: Handbook of Reliability Engineering - pdi2.org

Software Availability Theory and Its Applications 243

The state space of {X(t), t ≥ 0} in this section isdefined as follows:

Wn the system is operatingRSn the system is inoperable due to a software

failureRHn the system is inoperable due to a hardware

failure.

Suppose that the failure and the restorationcharacteristics of the software subsystem dependon the number of corrected faults and that thoseof the hardware subsystem have no bearing on thenumber of failures. Then the instantaneous andthe average system availabilities can be obtained inthe same forms as with the model with two typesof software failure discussed in Section 13.3.1 byreplacing the states RS

n and RHn with the states R1

n

and R2n respectively.

13.5 Concluding Remarks

In this chapter, we have surveyed the Markovianmodels for software availability measurementand assessment. Most of the results in thischapter have been obtained as closed forms.Thus we can evaluate performance measures moreeasily than the simulation models. Then themodel parameters have to be estimated basedon the actual data. However, it is too difficultto collect the failure data in the operationphase, especially the software restoration times.Under the present situation, we cannot help butdecide the parameters experientially from similarsystems developed before. We need to providecollection procedures for the field data.

In addition, software safety has now begunto draw attention as one of the user-orientedquality characteristics. This is defined as thecharacteristic that the software systems do notinduce any hazards, whether or not the systemsmaintain their intended functions [33], and isdistinct from software reliability. It is of urgentnecessity to establish software safety evaluationmethodologies. We also add that studies onsoftware safety measurement and assessment[34, 35] are currently being pursued.

References[1] Goel AL. Software reliability models: assumptions,

limitations, and applicability. IEEE Trans Software Eng1985;SE-11:1411–23.

[2] Lyu MR, editor. Handbook of software reliabilityengineering. Los Alamitos (CA): IEEE Computer SocietyPress, 1996.

[3] Malaiya YK, Srimani PK, editors. Software reliabilitymodels: theoretical developments, evaluation and ap-plications. Los Alamitos (CA): IEEE Computer SocietyPress; 1991.

[4] Musa JD. Software reliability engineering. New York:McGraw-Hill; 1999.

[5] Pham H. Software reliability. Singapore: Springer-Verlag;2000.

[6] Xie M. Software reliability modelling. Singapore: WorldScientific; 1991.

[7] Yamada S. Software reliability models: fundamentals andapplications (in Japanese). Tokyo: JUSE Press; 1994.

[8] Yamada S. Software reliability models. In: Osaki S, editor.Stochastic models in reliability and maintenance. Berlin:Springer-Verlag; 2002. p.253–80.

[9] Laprie J-C, Kanoun K, Béounes C, Kaâniche M.The KAT (knowledge–action–transformation) approachto the modeling and evaluation of reliability andavailability growth. IEEE Trans Software Eng 1991;17:370–82.

[10] Laprie J-C, Kanoun K. X-ware reliability and avail-ability modeling. IEEE Trans Software Eng 1992;18: 130–47.

[11] Tokuno K, Yamada S. User-oriented software reliabilityassessment technology (in Japanese). Bull Jpn Soc IndAppl Math 2000;10:186–97.

[12] Ross SM. Stochastic processes, second edition. New York:John Wiley & Sons; 1996.

[13] Ohba M, Chou X. Does imperfect debugging affectsoftware reliability growth?. In: Proceedings of 11thIEEE International Conference on Software Engineering1989;p.237–44.

[14] Tokuno K, Yamada S. An imperfect debugging modelwith two types of hazard rates for software reliabilitymeasurement and assessment. Math Comput Modell2000;31:343–52.

[15] Okumoto K, Goel AL. Availability and other performancemeasures for system under imperfect maintenance. In:Proceedings of COMPSAC’78, 1978;p.66–71.

[16] Kim JH, Kim YH, Park CJ. A modified Markov model forthe estimation of computer software performance. OperRes Lett 1982;1:253–57.

[17] Jelinski Z, Moranda PB. Software reliability research.In: Freiberger W, editor. Statistical computer perfor-mance evaluation. New York: Academic Press, 1972.p.465–84.

[18] Tokuno K, Yamada S. A Markovian software availabilitymeasurement with a geometrically decreasing failure-occurrence rate. IEICE Trans Fundam 1995;E78-A:737–41.

Page 277: Handbook of Reliability Engineering - pdi2.org

244 Software Reliability

[19] Tokuno K, Yamada S. Markovian software availabilitymodeling for performance evaluation. In: Christer AH,Osaki S, Thomas LC, editors. Stochastic modelling ininnovative manufacturing: proceedings. Berlin: Springer-Verlag; 1997. p.246–56.

[20] Tokuno K, Yamada S. Software availability model witha decreasing fault-correction rate (in Japanese). J ReliabEng Assoc Jpn 1997;19:3–12.

[21] Moranda PB. Event-altered rate models for generalreliability analysis. IEEE Trans Reliab 1979;R-28:376–81.

[22] Yamada S, Tokuno K, Osaki S. Software reliabilitymeasurement in imperfect debugging environment andits application. Reliab Eng Syst Saf 1993;40:139–47.

[23] Nakagawa Y, Takenaka I. Error complexity model forsoftware reliability estimation (in Japanese). Trans IEICED-I 1991;J74-D-I:379–86.

[24] Tokuno K, Yamada S. Markovian software availabilitymeasurement based on the number of restorationactions. IEICE Trans Fundam 2000;E83-A:835–41.

[25] Tokuno K, Yamada S. A Markovian software availabilitymodel for operational use (in Japanese). J Jpn SocSoftware Sci Technol 1998;15:17–24.

[26] Tokuno K, Yamada S. Markovian availability measure-ment with two types of software failures during the op-eration phase. Int J Reliab Qual Saf Eng 1999;6:43–56.

[27] Tokuno K, Yamada S. Operational software availabilitymeasurement with two kinds of restoration actions.J Qual Mainten Eng 1998;4:273–83.

[28] Tokuno K, Yamada S. Markovian software avail-ability modeling with degenerated performance.In: Lydersen S, Hansen GK, Sandtorv HA, editors.Proceedings of the European Conference on Safety andReliability, vol. 1. Rotterdam: AA Balkema, 1998;1:425–31.

[29] Beaudry MD. Performance-related reliability measuresfor computing systems. IEEE Trans Comput 1978;C-27:540–7.

[30] De Micheli G. A survey of problems and methods forcomputer-aided hardware/software co-design. J InformProcess Soc Jpn 1995;36:605–13.

[31] Goel AL, Soenjoto J. Models for hardware–softwaresystem operational-performance evaluation. IEEE TransReliab 1981;R-30:232–9.

[32] Tokuno K, Yamada S. Markovian availability modelingfor software-intensive systems. Int J Qual Reliab Manage2000;17:200–12.

[33] Leveson NG. Safeware: system safety and computers. NewYork: Addison-Wesley; 1995.

[34] Tokuno K, Yamada S. Stochastic software safety/ reliabil-ity measurement and its application. Ann Software Eng1999;8:123–45.

[35] Tokuno K, Yamada S. Markovian reliability model-ing for software safety/availability measurement. In:Pham H, editor. Recent advances in reliability and qualityengineering. Singapore: World Scientific; 2001. p.181–201.

Page 278: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation:Modeling and Applications

Chap

ter1

4Tadashi Dohi, Katerina Goševa-Popstojanova,Kalyanaraman Vaidyanathan, Kishor S. Trivediand Shunji Osaki

14.1 Introduction14.2 Modeling-based Estimation14.2.1 Examples in Telecommunication Billing Applications14.2.2 Examples in a Transaction-based Software System14.2.3 Examples in a Cluster System14.3 Measurement-based Estimation14.3.1 Time-based Estimation14.3.2 Time- and Workload-based Estimation14.4 Conclusion and Future Work

14.1 Introduction

Since system failures due to software faults aremore frequent than failures caused by hard-ware faults, there is a tremendous need for im-provement of software availability and reliabil-ity. Present-day applications impose stringent re-quirements in terms of cumulative downtime andfailure-free operation of software, since, in manycases, the consequences of software failure canlead to huge economic losses or risk to human life.However, these requirements are very difficult todesign for and guarantee, particularly in applica-tions of non-trivial complexity.

Recently, the phenomenon of softwareaging [1–4] in which error conditions actuallyaccrue with time and/or load, has been observed.In systems with high reliability/availabilityrequirements, software aging can cause outagesresulting in high costs. Huang and coworkers[2, 5, 6] and Jalote et al. [7] report thisphenomenon in telecommunications billing

applications, where over time the applicationexperiences a crash or a hang failure. Avritzerand Weyuker [8] and Levendel [9] discuss agingin telecommunication switching software wherethe effect manifests as gradual performancedegradation. Software aging has also beenobserved in widely used software like Netscapeand xrn. Perhaps the most vivid example ofaging in safety critical systems is the Patriot’ssoftware [10], where the accumulated errors led toa failure that resulted in loss of human life.

Resource leaking and other problems causingsoftware to age are due to the software faultswhose fixing is not always possible because, forexample, the application developer may not haveaccess to the source code. Furthermore, it isalmost impossible to fully test and verify if a pieceof software is fault free. Testing software becomesharder if it is complex, and more so if testing anddebugging cycle times are reduced due to smallerrelease time requirements. Common experiencesuggests that most software failures are transient

245

Page 279: Handbook of Reliability Engineering - pdi2.org

246 Software Reliability

in nature [11]. Since transient failures willdisappear if the operation is retried in a slightlydifferent context, it is difficult to characterize theirroot origin. Therefore, the residual faults have tobe tolerated in the operational phase. The usualstrategies to deal with failures in the operationalphase are reactive in nature; they consist of actiontaken after the failure.

A complementary approach to handlingtransient software failures, called softwarerejuvenation, was proposed by Huang et al. [2].Software rejuvenation is a preventive andproactive (as opposite to being reactive) solutionthat is particularly useful for counteractingthe phenomenon of software aging. It involvesstopping the running software occasionally,cleaning its internal state and restarting it.Garbage collection, flushing operating systemkernel tables and reinitializing internal datastructures are some examples of what cleaning theinternal state of software might involve [12].An extreme, but well-known example ofrejuvenation is a hardware reboot.

Apart from being used in an ad hoc mannerby almost all computer users, rejuvenation hasbeen used to avoid unplanned outages in highavailability systems, such as telecommunicationsystems [2, 8], where the cost of downtime isextremely high. Among typical applications ofmission critical systems, periodic software andsystem rejuvenation have been implemented forlong-life deep-space missions [13–15].

Recently, the use of rejuvenation was extendedto cluster systems [16]. Using the node failovermechanisms in a high availability cluster, onecan maintain operation (though possibly at adegraded level) while rejuvenating one node at atime. The first commercial version of this kindof a software rejuvenation agent for IBM clusterservers has been implemented with collaborationwith Duke University researchers [16, 17].

Although the faults in the software still remain,performing rejuvenation periodically removes orminimizes potential error conditions due to thesefaults, thus preventing failures that might haveunacceptable consequences. Rejuvenation has thesame motivation and advantages/disadvantages as

preventive maintenance policies in hardware sys-tems. Rejuvenation typically involves an overhead,but, on the other hand, it prevents more severefailures from occurring. The application will, ofcourse, be unavailable during rejuvenation, butsince this is a scheduled downtime the cost isexpected to be much lower than the cost of an un-scheduled downtime caused by failure. Hence, animportant issue is to determine the optimal sched-ule to perform software rejuvenation in terms ofavailability and cost.

In this chapter, we present an overview of theapproaches for analyzing software aging and soft-ware rejuvenation. In the following section, we de-scribe the model-based approach to determine theoptimal rejuvenation times, introducing the basicsoftware rejuvenation model, and examples fromtransaction-based software systems and clustersystems. In this approach, Markov/semi-Markovprocesses, queuing processes and stochastic Petrinets are applied to describe the system behavior.Measures such as availability and cost are for-mulated, and algorithms to calculate the optimalsoftware rejuvenation times are developed. Next,we describe the measurement-based approach fordetection and validation of the existence of soft-ware aging. For quantifying the effect of agingin UNIX operating system resources, we performempirical studies on a purely time-based approachand a time and workload-based approach. Here,we periodically monitor and collect data on theattributes responsible for determining the healthof the executing software and estimate resourceexhaustion times. Finally, the chapter is concludedwith some remarks.

14.2 Modeling-basedEstimation

The model-based approach is aimed at evaluatingthe effectiveness of software rejuvenation and de-termining the optimal schedules to perform reju-venation. Huang et al. [2] used a continuous timeMarkov chain to model software rejuvenation.They considered the two-step failure model where

Page 280: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 247

the application goes from the initial robust (clean)state to a failure probable (degraded) state fromwhich two actions are possible: rejuvenation ortransition to failure state. Both rejuvenation andrecovery from failure return the software systemto the robust state.

This model was recently generalized usingsemi-Markov reward processes [18–20]. The op-timal software rejuvenation schedules are ana-lytically derived under the steady-state availabil-ity and the expected cost per unit time in thesteady state. Further, non-parametric statisticalalgorithms to estimate the optimal software reju-venation schedules are also developed. Thus, thisapproach does not depend on the form of thefailure time distribution function. To deal with de-terministic interval between successive rejuvena-tions, Garg et al. [21] used a Markov regenerativestochastic Petri net model.

The fine-grained software rejuvenation modelpresented by Bobbio and Sereno [22] takes adifferent approach to characterize the effect ofsoftware aging. It assumes that the degradationprocess consists of a sequence of additive randomshocks; the system is considered out of serviceas soon as the appropriate parameter reaches anassigned threshold level. Garg et al. [23] analyzedthe effects of checkpointing and rejuvenation usedtogether on the expected completion time of asoftware program. A fluid stochastic Petri-net-based model that captures the behavior of ag-ing software systems which employ rejuvenation,restoration, and checkpointing was also proposed[24]. The use of preventive on-board maintenancethat includes periodic software and system reju-venation has also been proposed and analyzed[13–15].

Garg et al. [25] include arrival and queuing ofjobs in the system and compute load and time-dependent rejuvenation policy. The above modelsconsider the effect of aging as crash/hang failure,referred to as hard failures, which result in un-availability of the software. However, the aging ofthe software system can manifest as soft failures,i.e. performance degradation. Pfennig et al. [26]modeled performance degradation by the gradualdecrease of the service rate. Both effects of aging,

hard failures that result in an unavailability andsoft failures that result in performance degrada-tion, are considered in the model of a transaction-based software system presented by Garg et al.[27]. This model was generalized by consideringmultiple servers [28] and threshold policies [29].

Recently, stochastic models of time-based andprediction-based rejuvenation as applied to clus-ter systems were developed [16]. These modelscapture a multitude of cluster system character-istics, failure behavior, and performability mea-sures.

14.2.1 Examples inTelecommunication Billing Applications

Consider the basic software rejuvenation modelproposed by Huang et al. [2]. Suppose thatthe stochastic behavior of the system can bedescribed by a simple continuous-time Markovchain (CTMC) with the following four states.

State 0: highly robust state(normal operation state).

State 1: failure probable state.

State 2: failure state.

State 3: software rejuvenation state.

Figure 14.1 depicts the state transition diagramof the CTMC. Let Z be the random time intervalwhen the highly robust state changes to thefailure probable state, having the exponentialdistribution Pr{Z ≤ t} = F0(t)= 1− exp(−t/µ0)

(µ0 > 0). Just after the state becomes the failureprobable state, a system failure may occur with apositive probability. Without loss of generality, weassume that the random variable Z is observableduring the system operation.

Define the failure time X (from State 1) and therepair time Y , having the exponential distribu-tions Pr{X ≤ t} = Ff (t)= 1− exp(−t/λf ) andPr{Y ≤ t} = Fa(t)= 1− exp(−t/µa) (λf > 0,µa > 0). If the system failure occurs before trig-gering a software rejuvenation, then the repair isstarted immediately at that time and is completed

Page 281: Handbook of Reliability Engineering - pdi2.org

248 Software Reliability

Figure 14.1. State transition diagram of CTMC

after the random time Y elapses. Otherwise, thesoftware rejuvenation is started. Note thatthe software rejuvenation cycle is measured fromthe time instant just after the system enters State 1.Define the distribution functions of the time toinvoke the software rejuvenation and of the timeto complete software rejuvenation by Fr(t)= 1−exp(−t/µr) and Fc(t)= 1− exp(−t/µc) (µc >

0, µr > 0) respectively. Huang et al. [2] analyzedthe above CTMC and calculated the expectedsystem down time and the expected cost perunit time in the steady state. However, it shouldbe noted that the rejuvenation schedule for theCTMC model is not feasible, since the preventivemaintenance time to rejuvenate the system is anexponentially distributed random variable.

It is not difficult to introduce the periodicrejuvenation schedule and to extend the CTMCmodel to the general one. Dohi et al. [18–20]developed semi-Markov models with the periodicrejuvenation and general transition distributionfunctions. More specifically, let Z be the randomvariable having the common distribution functionPr{Z ≤ t} = F0(t) with finite mean µ0 (>0). Also,let X and Y be the random variables havingthe common distribution functions Pr{X ≤ t} =Ff (t) and Pr{Y ≤ t} = Fa(t) with finite meansλf (>0) and µa (>0) respectively. Denote thedistribution function of the time to invoke thesoftware rejuvenation and the distribution ofthe time to complete software rejuvenation by

Fr(t) and Fc(t) (with mean µc (>0)) respectively.After completing the repair or the rejuvenation,the software system becomes as good as new,and the software age is initiated at the beginningof the next highly robust state. Consequently,we define the time interval from the beginningof the system operation to the next one as onecycle, and the same cycle is repeated again andagain.

If we consider the time to software rejuvenationas a constant t0, then it follows that

Fr(t)= U(t − t0)={

1 if t ≥ t0

0 otherwise(14.1)

We call t0 (≥0) the software rejuvenation intervalin this chapter and U(·) is the unit stepfunction.

The underlying stochastic process is a semi-Markov process with four regeneration states.If the sojourn times in all states are exponentiallydistributed, of course, this model is the CTMCin Huang et al. [2]. From the familiar renewalargument [30], we formulate the steady-statesystem availability as

A(t0)= Pr{software system is operative

in the steady state}

= µ0 +∫ t0

0 Ff (t) dt

µ0+µaFf (t0)+µcFf (t0)+∫ t0

0 Ff (t) dt

= S(t0)/T (t0) (14.2)

where in general φ(·)= 1− φ(·). The problemis to derive the optimal software rejuvenationinterval t∗0 that maximizes the system availabilityin the steady state A(t0).

We make the following assumption.

Assumption 1. µa > µc.

Assumption 1 means that the mean time torepair is strictly larger than the mean time tocomplete the software rejuvenation. This assump-tion is quite reasonable and intuitive. The follow-ing result gives the optimal software rejuvenationschedule for the semi-Markov model.

Page 282: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 249

Theorem 1.(1) Suppose that the failure time distribution is

strictly increasing failure rate (IFR) underAssumption 1. Define the following nonlinearfunction:

q(t0)= T (t0)− {(µa − µc)rf (t0)+ 1}S(t0)(14.3)

where rf (t)= (dFf (t)/dt)/Ff (t) is the failurerate.

(i) If q(0) > 0 and q(∞) < 0, then thereexists a finite and unique optimal softwarerejuvenation schedule t∗0 (0 < t∗0 <∞)

satisfying q(t∗0 )= 0, and the maximumsystem availability is

A(t∗0 )=1

(µa − µc)rf (t∗0 )+ 1

(14.4)

(ii) If q(0)≤ 0, then the optimal softwarerejuvenation schedule is t∗0 = 0, i.e. it isoptimal to start the rejuvenation just afterentering the failure probable state, and themaximum system availability is A(0)=µ0/(µ0 + µc).

(iii) If q(∞)≥ 0, then the optimal rejuvena-tion schedule is t∗0 →∞, i.e. it is optimalnot to carry out the rejuvenation, and themaximum system availability is A(∞)=(µ0 + λf )/(µ0 + µa + λf ).

(2) Suppose that the failure time distribution isdecreasing failure rate (DFR) under Assump-tion 1. Then, the system availability A(t0) is aconvex function of t0, and the optimal rejuve-nation schedule is t∗0 = 0 or t∗0 →∞.

Proof. Differentiating A(t0) with respect to t0and setting it equal to zero implies q(t0)= 0.Further differentiation yields

dq(t0)

dt0=−(µa − µc)S(t0)

drf (t0)

dt0(14.5)

If rf (t0) is strictly increasing, then the functionq1(t0) is strictly decreasing and the systemavailability A(t0) is strictly concave in t0 underAssumption 1. Further, if q(0) > 0 and q(∞) < 0,then there exists a unique optimal softwarerejuvenation schedule t∗0 (0 < t∗0 <∞) satisfying

q(t∗0 )= 0. If q(0)≤ 0 or q(∞)≥ 0, then thesystem availability is monotonically increasing ordecreasing in t0, and the optimal policy becomest∗0 = 0 or t∗0 →∞. On the other hand, if rf (t0)

is a decreasing function of t0, then A(t0) is aconvex function of t0, and the optimal softwarerejuvenation schedule is t∗0 = 0 or t∗0 →∞.The proof is thus completed. �

It is easy to check that the result above impliesthe result in Huang et al. [2], although they usedthe system downtime and its associated cost ascriteria of optimality. As is clear from Theorem 1,when the failure time obeys the exponentialdistribution, the optimal software rejuvenationschedule becomes t∗0 = 0 or t∗0 →∞. This meansthat the rejuvenation should be performed assoon as the software enters the failure probablestate (t0 = 0) or should not be performed atall (t0→∞). Therefore, the determination ofthe optimal rejuvenation schedule based on thesystem availability is never motivated in such asituation. Since for a software system that agesit is more realistic to assume that failure timedistribution is strictly IFR, our general setting isplausible and the result satisfies our intuition.

Dohi et al. [19] developed a non-parametricalgorithm to estimate the optimal software re-juvenation time, when the failure time data areobtained but the underlying distribution func-tion is unknown. Before developing the statisticalestimation algorithms for the optimal softwarerejuvenation schedules, we translate the under-lying problems max0≤t0<∞ A(t0) to a graphicalone. Following Barlow and Campo [31], define thescaled total time on test (TTT) transform of thefailure time distribution:

φ(p)= (1/λf )∫ F−1

f (p)

0Ff (t) dt (14.6)

where

F−1f (p)= inf{t0; Ff (t0) ≥ p} (0≤ p ≤ 1)

(14.7)It is well known [31] that Ff (t) is IFR (DFR) ifand only if φ(p) is concave (convex) on p ∈ [0, 1].After a few algebraic manipulations, we have thefollowing result.

Page 283: Handbook of Reliability Engineering - pdi2.org

250 Software Reliability

Theorem 2. Obtaining the optimal software reju-venation schedule t0∗maximizing the system avail-ability A(t0) is equivalent to obtaining p∗ (0≤p∗ ≤ 1) such as

max0≤p≤1

φ(p) + α

p + β(14.8)

where α = λf + µ0 and β = µc(µa − µc).

The proof is omitted for brevity. From The-orem 2 it follows that the optimal softwarerejuvenation schedule t0

∗ = F−1f (p∗) is deter-

mined by calculating the optimal point p∗(0≤p∗ ≤ 1) maximizing the tangent slope fromthe point (−β,−α) ∈ (−∞, 0)× (−∞, 0) to thecurve (p, φ(p)) ∈ [0, 1] × [0, 1].

Suppose that the optimal software rejuvenationschedule needs to be estimated from an orderedsample of observations 0= x0 ≤ x1 ≤ x2 ≤ · · · ≤xn of the failure times from an absolutelycontinuous distribution Ff , which is unknown.Then the scaled TTT statistics based on thissample are defined by φnj = ψj/ψn, where

ψj =j∑

k=1

(n− k + 1)(xk − xk−1)

(j = 1, 2, . . . , n; ψ0 = 0) (14.9)

The empirical distribution function Fn(x)

corresponding to the sample data xj (j = 0,1, 2, . . . , n) is

Fn(x)={j/n for xj ≤ x < xj+1

1 for xn ≤ x(14.10)

From this, we obtain the polygon by plottingthe points (Fn(x), φnj ) (j = 0, 1, 2, . . . , n) andconnecting them by line segments known as thescaled TTT plot. In other words, the scaled TTTplot can be regarded as a numerical counterpartof the scaled TTT transform.

The following result gives a non-parametricstatistical estimation algorithm for the optimalsoftware rejuvenation schedule.

Theorem 3.(i) Suppose that the optimal software rejuvena-

tion schedule is to be estimated from n or-dered samples 0= x0 ≤ x1 ≤ x2 ≤ · · · ≤ xn of

the failure times from an absolutely continuousdistribution Ff , which is unknown. Then, anon-parametric estimator of the optimal soft-ware rejuvenation schedule t∗0 that maximizesA(t0) is given by xj∗ , where

j∗ ={j

∣∣∣∣ max0≤j≤n

φnj + αi

j/n+ βi

}(14.11)

and the theoretical mean λf is replaced by thesample mean

∑nk=1 xk/n.

(ii) The estimator given in (i) is strongly consis-tent, i.e. xj∗ converges to the optimal solutiont0∗ uniformly with probability one as n→∞, if a unique optimal software rejuvenationschedule exists.

It is straightforward to prove the above resultin (i) from Theorem 2. The uniform convergenceproperty in (ii) is due to the Glivenko–Cantellilemma (e.g. see [31]) and the strong law of largenumbers. The graphical procedure proposed herehas an educational value for better understandingof the optimization problem and it is convenientfor performing sensitivity analysis of the optimalsoftware rejuvenation policy when different valuesare assigned to the model parameters. Finally,it enables us to estimate the optimal schedulewithout specifying the failure time distribution.Although some typical theoretical distributionfunctions, such as the Weibull distribution andthe gamma distribution, are often assumedin the reliability analysis, our non-parametricestimation algorithm can generate the optimalsoftware rejuvenation schedule using the on-lineknowledge about the observed failure times.

Figure 14.2 shows the estimation result of theoptimal software rejuvenation schedule for thesemi-Markov model, where the failure time data isgenerated by the Weibull distribution with shapeparameter β = 4.0 and scale parameter θ = 0.9.For 100 failure data points, the estimates of theoptimal rejuvenation schedule and the maximumsystem availability are t∗0 = 0.565 92 and A(t∗0 )=0.987 813 respectively. Actually, we examinedthrough a simulation study in [18–20] that thenon-parametric estimator of the optimal software

Page 284: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 251

Figure 14.2. Estimation of the optimal software rejuvenationschedule on the two-dimensional graph: θ = 0.9, β = 4.0,µ0 = 2.0,µa = 0.04,µc = 0.03

rejuvenation time derived using Theorem 3 has anice convergence property.

14.2.2 Examples in aTransaction-based Software System

Next, consider the rejuvenation scheme for asingle server software system. Garg et al. [27]analyzed a queuing system with preventivemaintenance as a mathematical model for atransaction-based software system. Suppose thatthe transactions arrive at the system in accordancewith the homogeneous Poisson process with rateλ (>0). The transactions are processed withservice rate µ(·) (>0) based on the first come,first serve (FCFS) discipline. Transactions thatarrive during the service operation of the othertransaction are accumulated in the buffer whosemaximum capacity is K (>1), where without aloss of generality the buffer is empty at time t = 0.If a transaction arrives at the system when K

transactions have already been accumulated inthe buffer, it will be rejected from the systemautomatically. That is to say, the software systemunder consideration can continue processing ifthere exists at least one transaction in the buffer.When the buffer is empty, the system will wait forthe arrival of an additional transaction, keepingthe system idle.

The service rate µ(·) is assumed to be afunction of the operation time t , the number oftransactions in the buffer, and/or other variables.For example, let {Xt ; t ≥ 0} be the number oftransactions accumulated in the buffer at time t .Then, the service rate may be given by µ(t),µ(Xt) or µ(t, Xt ). In this model, the softwaresystem is assumed to be unreliable, where thefailure rate ρ(·) also depends on the operationtime t , the number of transactions in the bufferand/or other variables, i.e. it can be describedas ρ(t), ρ(Xt) or ρ(t, Xt ). The recovery (repair)operation starts immediately when the systemfails, where the recovery time is the randomvariable Yr with E[Yr] = γr (>0). A newly arrivingtransaction is rejected if it arrives at the systemduring the recovery operation period. Intuitively,software rejuvenation is beneficially motivated ifthe software failure rate ρ(·) increases with thepassage of time [2, 27]. Thus, it is assumed thatthe system failure time distribution is IFR, i.e. thesoftware failure rate ρ(·) is an increasing functionof time t . Let YR, the preventive maintenancetime for the software rejuvenation, be the randomvariable with E[YR] = γR (>0). Suppose that thesoftware system operation is started at time t = 0with an empty buffer. The system is regarded asgood as new after both the completion of softwarerejuvenation and the completion of recovery.

Garg et al. [27] consider the following softwarerejuvenation schemes based on the cumulativeoperation time.

Policy I: rejuvenate the system when the cumula-tive operation time becomes T (>0). Then,all transactions in the buffer, if any, will belost.

Policy II: rejuvenate the system at the beginningof the first idle period following thecumulative operation time T (>0).

Figure 14.3 depicts the possible behavior of soft-ware system under the two kinds of rejuvenationpolicies.

Consider three dependability measures: steady-state availability, loss probability of transactions,and mean response time of transactions. We will

Page 285: Handbook of Reliability Engineering - pdi2.org

252 Software Reliability

Figure 14.3. Possible realization of transaction-based softwaresystem under two policies

apply the well-known theory of the Markovregenerative process (MRGP) to derive the abovemeasures. First, consider a discrete-time Markovchain (DTMC) with three states.

State A: system operation.

State B: system is recovering from failure.

State C: system is undergoing softwarerejuvenation.

Figure 14.4 is the state transition diagram ofthe DTMC. Let PAB and PAC denote the transitionprobabilities from State A to B and from State Ato C respectively. Taking account of PAC = 1−PAB, the transition probability matrix is given by

P=0 PAB PAC

1 0 01 0 0

The steady-state probabilities for the DTMC canbe easily found as

πA = 12 (14.12)

πB = PAB

2(14.13)

Figure 14.4. State transition diagram of DTMC

and

πC = PAC

2. (14.14)

Next, define the following random variables:

U is the sojourn time in State A in the steadystate

Un is the sojourn time in State A whenn (= 0, . . . , K) transactions areaccumulated in the buffer, so that

U =K∑n=0

Un, almost surely (14.15)

Nl is the number of transactions lost at thetransition from State A to State B or C.

Theorem 4.(i) The steady-state system availability is

Ass = πAE[U ]πAE[U ] + πBγr + πCγR

= E[U ]E[U ] + PABγr + PACγR

(14.16)

(ii) The loss probability of transactions is

Ploss = λ(PABγr + PACγR + E[UK ])+ E[Nl]λ(E[U ] + PABγr + PACγR)

(14.17)(iii) Let E and Ws be the expected number of

transactions processed by the system and theexpected total processing time for transactionsserved by the system respectively, where E =λ(E[U ] − E[UK ]). Then an upper bound ofthe mean response time on transactions Tres =Ws/(E − E[Nl]) is given by

Tres <W

E − E[Nl] (14.18)

In this way, we can formulate implicitly thesteady-state availability, the loss probability oftransactions, and the upper bound of meanresponse time on transactions, based on thesimple DTMC. Next, we wish to derive the optimalrejuvenation interval T ∗ under Policy I andPolicy II. To do this, we represent the steady-stateavailability, the loss probability of transactions,and the upper bound of the mean response

Page 286: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 253

Figure 14.5. State transition diagram of Policy I

time as the functions of T , i.e. Ass(T ), Ploss(T )

and Tres(T ). Since these functions include theparameters PAB, E[Un] and E[Nl] depending onT , we need to calculate PAB, E[Un] and E[Nl]numerically. Define the following probabilities:

pn(t) is the probability that n transactionsremain in the buffer at time t

pn′ (t) is the probability that the system failsup to time t and that n transactionsremain in the buffer at the failure time

where n= 0, . . . , K .First, we analyze a single server queuing system

under Policy I. Consider a non-homogeneouscontinuous time Markov chain (NHCTMC) with2K states, where:

State 0, . . . , K : 0, . . . , K denotes the number oftransactions in the buffer;

State 0′, . . . , K ′: the system fails with 0, . . . , Ktransactions in the buffer (absorbing states).

Figure 14.5 illustrates the state transition dia-gram for the NHCTMC with Policy I. Applying thewell-known state-space method to the NHCTMC,we can formulate the difference-differential equa-tions (Kolmogorov’s forward equations) on pn(t)

and pn′ (t) as follows:

dp0(t)

dt= µ(·)p1(t)− {λ+ ρ(·)}p0(t) (14.19)

dpn(t)

dt= µ(·)pn+1(t)+ λpn−1(t)

− {µ(·)+ λ+ ρ(·)}pn(t)

n= 1, . . . , K − 1 (14.20)

Figure 14.6. State transition diagram of Policy II

dpK(t)

dt= λpK−1(t)− {µ(·)+ ρ(·)}pK(t)

(14.21)

dpn′(t)

dt= ρ(·)pn(t) n= 0, . . . , K (14.22)

By solving the difference-differential equationsnumerically, we can obtain

PAB =K∑n=0

pn′ (T ) (14.23)

E[Un] =∫ T

0pn(t) dt n= 1, . . . , K (14.24)

E[Nl] =K∑n=0

n{pn(T )+ pn′(T )} (14.25)

and the dependability measures in Equa-tions 14.16, 14.17 and 14.18 based on Policy I.

Figure 14.6 illustrates the state transitiondiagram for the NHCTMC with Policy II. In afashion similar to Policy I, we derive thedifference-differential equations on pn(t) andpn′(t) for Policy II. Since the system starts thesoftware rejuvenation when the buffer becomesempty for the first time, the state transitiondiagrams of Policies I and II are equivalent int ≤ T . In the case of t > T , we can derive thedifference-differential equations under Policy II asfollows.

dp0(t)

dt= µ(·)p1(t) (14.26)

dp1(t)

dt= µ(·)p2(t)− {µ(·)+ λ+ ρ(·)}p1(t)

(14.27)

Page 287: Handbook of Reliability Engineering - pdi2.org

254 Software Reliability

Figure 14.7. Ass andPloss for both policies plotted against T

dpn(t)

dt= µ(·)pn+1(t)+ λpn−1(t)

− {µ(·)+ λ+ ρ(·)}pn(t)

n= 2, . . . , K − 1 (14.28)

dpK(t)

dt= λpK−1(t)− {µ(·)+ ρ(·)}pK(t)

(14.29)

dpn′(t)

dt= ρ(·)pn(t) n= 1, . . . , K. (14.30)

Using p(i)n (t) and pn′(t) in the difference-

differential equations, PAB, E[Un] and E[Nl] aregiven by

PAB =K∑n=0

pn′(∞) (14.31)

E[U0] =∫ T

0p0(t) dt (14.32)

E[Un] =∫ ∞

0pn(t) dt n= 1, . . . , K (14.33)

E[Nl] =K∑n=0

npn′(∞). (14.34)

The difference-differential equations derivedhere can be solved with the standard numericalsolution methods, such as the Runge–Kuttamethod and Adam’s method. However, it is not

always easy to carry out the numerical calculation,since the computation time depends on the sizeof the underlying NHCTMC, i.e. the number ofstates.

We illustrate the usefulness of the modelpresented in determining the optimum value ofT on the example adopted from Garg et al. [27].The failure rate and the service rate are assumedto be functions of real time, where ρ(t) is definedto be the hazard function of Weibull distribution,and µ(t) is defined to be a monotone non-increasing function that approximates the servicedegradation. Figure 14.7 shows Ass and Ploss forboth policies plotted against T for different valuesof the mean time to perform rejuvenation γR.Under both policies, it can be seen that, for anyparticular value of T , the higher the value ofγR, the lower is the availability and the higheris the corresponding loss probability. It can alsobe observed that the value of T which minimizesprobability of loss is much lower than that whichmaximizes availability. In fact, the probabilityof loss becomes very high at values of T thatmaximize availability. For any specific value ofγR, Policy II results in a lower minimum in lossprobability than that achieved under Policy I.Therefore, if the objective is to minimize longrun probability of loss, such as in the case oftelecommunication switching software, Policy IIalways fares better than Policy I.

Page 288: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 255

Figure 14.8. SRN model of a cluster system employing time-based rejuvenation

14.2.3 Examples in a Cluster System

In this section we discuss software rejuvenationas applied to cluster systems [16]. This is a novelapplication, which significantly improves clustersystem availability and productivity. Cluster com-puting [32] has been increasingly popular in thelast decade, and is being widely used in variouskinds of applications, many of which are not tol-erant to system failures and the resulting loss ofservice. By coupling software rejuvenation withclustering, significant increases in cluster systemavailability and performance can be achieved.Taking advantage of node failovers in a cluster,one can still maintain operation (though possiblyat a degraded level) by rejuvenating one node ata time. The main assumptions here are that anode rejuvenation takes less time to perform, isless disruptive, and is less expensive than recoveryfrom an unplanned node failure. Simple time-based rejuvenation policies, in which the nodes

are rejuvenated at regular intervals, can be imple-mented easily. The cluster system availability andservice level can be further enhanced by taking amore proactive approach to detect and predict animpending outage of a specific server in order toinitiate planned failover in a more orderly fashion.This approach not only improves the end user’sperception of service provided by the system,but also gives the system administrator additionaltime to work around any system capacity issuesthat may arise.

The stochastic reward net (SRN) model ofa cluster system employing simple time-basedrejuvenation is shown in Figure 14.8. The clusterconsists of n nodes that are initially in a “robust”working state Pup. The aging process is modeledas a two-stage hypo-exponential distribution(increasing failure rate) with transitions Tfproband Tnoderepair. Place Pfprob represents a “failure-probable” state in which the nodes are stilloperational. The nodes then can eventually transit

Page 289: Handbook of Reliability Engineering - pdi2.org

256 Software Reliability

Figure 14.9. Cluster system employing prediction-based rejuvenation

to the fail state, Pnodefail1. A node can be repairedthrough the transition Tnoderepair, with a coveragec. In addition to individual node failures, there isalso a common-mode failure (transition Tcmode).The system is also considered down when thereare a (a ≤ n) individual node failures. The systemis repaired through the transition Tsysrepair.

In the simple time-based policy, rejuvenationis done successively for all the operational nodesin the cluster, at the end of each deterministicinterval. The transition Trejuvinterval fires everyd time units, depositing a token in place Pstartrejuv.Only one node can be rejuvenated at any time(at places Prejuv1 or Prejuv2). Weight functions areassigned such that the probability of selecting atoken from Pup or Pfprob is directly proportionalto the number of tokens in each. After a nodehas been rejuvenated, it goes back to the “robust”working state, represented by place Prejuved. This isa duplicate place for Pup in order to distinguishthe nodes that are waiting to be rejuvenated from

the nodes that have already been rejuvenated.A node, after rejuvenation, is then allowed tofail with the same rates as before rejuvenation,even when another node is being rejuvenated.Duplicate places for Pup and Pfprob are neededto capture this. Node repair is disabled duringrejuvenation. Rejuvenation is complete when thesum of nodes in places Prejuved, Pfprobrejuv andPnodefail2 is equal to the total number of nodes n.In this case, the immediate transition Timmd10 fires,putting back all the rejuvenated nodes in placesPup and Pfprob. Rejuvenation stops when thereare a − 1 tokens in place Pnodefail2, to preventa system failure. The clock resets itself whenrejuvenation is complete and is disabled whenthe system is undergoing repair. Guard functions(g1 through g7) are assigned to express complexenabling conditions textually.

In prediction-based rejuvenation (Figure 14.9),rejuvenation is attempted only when a node tran-sits into the “failure probable” state. In practice,

Page 290: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 257

Figure 14.10. Time-based rejuvenation for 8/1 configuration

this degraded state could be predicted in advanceby means of analyses of some observable systemparameters [33]. In the case of a successful pre-diction, assuming that no other node is beingrejuvenated at that time, the newly detected nodecan be rejuvenated. A node is allowed to fail evenwhile waiting for rejuvenation.

For the analyses, the following values areassumed. The mean times spent in places Pupand Pfprob are 240 h and 720 h respectively.The mean times to repair a node, to rejuvenatea node, and to repair the system are 30 min,10 min and 4 h respectively. In this analysis,the common-mode failure is disabled and nodefailure coverage is assumed to be perfect. All themodels were solved using the Stochastic PetriNet Package (SPNP) tool. The measures computedwere expected unavailability and the expected costincurred over a fixed time interval. It is assumedthat the cost incurred due to node rejuvenation ismuch less than the cost of a node or system failure,since rejuvenation can be done at predeterminedor scheduled times. In our analysis, we fix thevalue for costnodefail at $5000/h, the costrejuv at$250/h. The value of costsysfail is computed as thenumber of nodes n times costnodefail.

Figure 14.10 shows the plots for an 8/1 con-figuration (eight nodes including one spare) sys-tem employing simple time-based rejuvenation.The upper and lower plots show the expected costincurred and the expected downtime (in hours)respectively in a given time interval, versus reju-venation interval (time between successive reju-venation) in hours. If the rejuvenation interval isclose to zero, the system is always rejuvenating andthus incurs high cost and downtime. As the rejuve-nation interval increases, both expected unavail-ability and cost incurred decrease and reach anoptimum value. If the rejuvenation interval goesbeyond the optimal value, the system failure hasmore influence on these measures than rejuvena-tion. The analysis was repeated for 2/1, 8/2, 16/1and 16/2 configurations. For time-based rejuvena-tion, the optimal rejuvenation interval was 100 hfor the one-spare clusters, and approximately 1 hfor the two-spare clusters. In our analysis of pre-dictive rejuvenation, we assumed 90% predictioncoverage. For systems that have one spare, time-based rejuvenation can reduce downtime by 26%relative to no rejuvenation. Predictive rejuvena-tion does somewhat better, reducing downtime by62% relative to no rejuvenation. However, whenthe system can tolerate more than one failure ata time, downtime is reduced by 98% to 95% viatime-based rejuvenation, compared with a mere85% for predictive rejuvenation.

14.3 Measurement-basedEstimation

In this section we describe the measurement-based approach for detection and validation ofthe existence of software aging. The basic idea isto monitor periodically and collect data on theattributes responsible for determining the healthof the executing software, in this case the UNIXoperating system. The SNMP-based distributedresource monitoring tool discussed by Garget al. [33] was used to collect operating systemresource usage and system activity data from nineheterogeneous UNIX workstations connected by

Page 291: Handbook of Reliability Engineering - pdi2.org

258 Software Reliability

an Ethernet LAN at the Duke Department ofElectrical and Computer Engineering. A centralmonitoring station runs the manager programwhich sends get requests periodically to each ofthe agent programs running on the monitoredworkstations. The agent programs, in turn, obtaindata for the manager from their respectivemachines by executing various standard UNIXutility programs like pstat, iostat and vmstat.For quantifying the effect of aging in operatingsystem resources, the metric Estimated time toexhaustion is proposed. The earlier work [33] useda purely time-based approach to estimate resourceexhaustion times, whereas the work presented byVaidyanathan and Trivedi [34] takes into accountthe current system workload as well.

14.3.1 Time-based Estimation

Data were collected from the machines at intervalsof 15 min for about 53 days. Time-ordered valuesfor each monitored object are obtained, constitut-ing a time series for that object. The objective isto detect aging or a long-term trend (increasing ordecreasing) in the values. Only results for the datacollected from the machine Rossby are discussedhere.

First, the trends in operating system resourceusage and system activity are detected usingsmoothing of observed data by robust locallyweighted regression, proposed by Cleveland [35].This technique is used to get the global trendbetween outages by removing the local variations.Then, the slope of the trend is estimated inorder to do prediction. Figure 14.11 shows thesmoothed data superimposed on the originaldata points from the time series of objectsfor Rossby. The amount of real memory free(plot 1) shows an overall decrease, whereas filetable size (plot 2) shows an increase. Plots ofsome other resources not discussed here alsoshowed an increase or decrease. This corroboratesthe hypothesis of aging with respect to variousobjects.

The seasonal Kendall test [36] was applied toeach of these time series to detect the presence ofany global trends at a significance level α of 0.05.

Table 14.1. Seasonal Kendall test

Resource name Rossby

Real memory free −13.668File table size 38.001Process table size 40.540Used swap space 15.280No. of disk data blocks 48.840No. of queues 39.645

The associated statistics are listed in Table 14.1.With Zα = 1.96, all values in the table are suchthat the null hypothesis H0 that no trend exists isrejected.

Given that a global trend is present and thatits slope is calculated for a particular resource,the time at which the resource will be exhaustedbecause of aging only is estimated. Table 14.2refers to several objects on Rossby and lists anestimate of the slope (change per day) of the trendobtained by applying Sen’s slope estimate for datawith seasons [33]. The values for real memory andswap space are in kilobytes. A negative slope, asin the case of real memory, indicates a decreasingtrend, whereas a positive slope, as in the caseof file table size, is indicative of an increasingtrend. Given the slope estimate, the table lists theestimated time to failure of the machine due toaging only with respect to this particular resource.The calculation of the time to exhaustion is doneby using the standard linear approximation y =mx + c.

A comparative effect of aging on differentsystem resources can be obtained from the aboveestimates. Overall, it was found that file tablesize and process table size are not as importantas used swap space and real memory free, sincethey have a very small slope and high estimatedtimes to failure due to exhaustion. Based on suchcomparisons, we can identify important resourcesto monitor and manage in order to deal withaging-related software failures. For example, theresource used swap space has the highest slopeand real memory free has the second highest slope.However, real memory free has a lower time toexhaustion than used swap space.

Page 292: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 259

Figure 14.11. Non-parametric regression smoothing for Rossby objects

Table 14.2. Estimated slope and time to exhaustion for Rossby objects

Resource name Initial value Max value Sen’s slope Estimated timeestimation to exhaustion

Real memory free 40 814.17 84 980 −252.00 161.96File table size 220 7110 1.33 5167.50Process table size 57 2058 0.43 4602.30Used swap space 39 372 312 724 267.08 1023.50

Page 293: Handbook of Reliability Engineering - pdi2.org

260 Software Reliability

Table 14.3. Statistics for the workload clusters

No. Cluster center % of pts

cupConSw sysCall pgOut pgIn

1 48 405.16 94 194.66 5.16 677.83 0.982 54 184.56 122 229.68 5.39 81.41 0.983 34 059.61 193 927.00 0.02 136.73 0.934 20 479.21 45 811.71 0.53 243.40 1.895 21 361.38 37 027.41 0.26 12.64 7.176 15 734.65 54 056.27 0.27 14.45 6.557 37 825.76 40 912.18 0.91 12.21 11.778 11 013.22 38 682.46 0.03 10.43 42.879 67 290.83 37 246.76 7.58 19.88 4.93

10 10 003.94 32 067.20 0.01 9.61 21.2311 197 934.42 67 822.48 415.71 184.38 0.93

14.3.2 Time and Workload-basedEstimation

The method discussed in Section 14.3.1 assumesthat accumulated use of a resource over atime period depends only on the elapsed time.However, it is intuitive that the rate at whicha resource is consumed is dependent on thecurrent workload. In this subsection, we discussa measurement-based model to estimate the rateof exhaustion of operating system resources as afunction of both time and the system workloadpresented by Vaidyanathan and Trivedi [34].The SNMP-based distributed resource monitoringtool described previously was used for collectingoperating system resource usage and systemactivity parameters (at 10 min intervals) for over3 months. The longest stretch of sample points inwhich no reboots or failures occurred were usedfor building the model. A semi-Markov rewardmodel was constructed using the data. First,different workload states were identified usingstatistical cluster analysis and a state-space modelwas constructed. Corresponding to each resource,a reward function based on the rate of resourceexhaustion in the different states was then defined.Finally, the model is solved to obtain trends andtime to exhaustion for the resources.

The following variables were chosen to char-acterize the system workload: cpuContextSwitch,sysCall, pageIn, and pageOut. Hartigan’s k-means

clustering algorithm [37] was used for partition-ing the data points into clusters based on work-load. The statistics for the 11 workload clus-ters obtained are shown in Table 14.3. Clusterswhose centroids were relatively close to each other,and those with a small percentage of data pointsin them, were merged to simplify computations.The resulting clusters are W1 = {1, 2, 3}, W2 ={4, 5}, W3 = {6}, W4 = {7}, W5 = {8}, W6 = {9},W7 = {10}, and W8 = {11}.

Transition probabilities from one state toanother were computed from data, resulting intransition probability matrix P of the embeddedDTMC shown below:

P=

0.00 0.16 0.22 0.130.07 0.00 0.14 0.140.12 0.26 0.00 0.100.15 0.36 0.06 0.000.03 0.07 0.04 0.010.07 0.16 0.02 0.540.02 0.05 0.00 0.000.31 0.08 0.15 0.23

0.26 0.03 0.17 0.030.32 0.03 0.31 0.000.43 0.00 0.11 0.020.10 0.22 0.09 0.030.00 0.00 0.85 0.000.12 0.00 0.02 0.070.92 0.00 0.00 0.000.08 0.15 0.00 0.00

Page 294: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 261

Table 14.4. Sojourn time distributions

State Sojourn time distributionF(t)

W1 1− 1.602 919 e−0.9t + 0.602 9185 e−2.392 739t

W2 1− 0.9995 e−0.44 9902t − 0.0005 e−0.007 110 07t

W3 1− 0.9952 e−0.327 4977t − 0.0048 e−0.017 5027t

W4 1− 0.841 362 e−0.327 5372t − 0.158 638 e−0.038 254 29t

W5 1− 1.425 856 e−0.56t + 0.425 8555 e−1.875t

W6 1− 0.806 94 e−0.550 9307t − 0.193 06 e−0.037 057 56t

W7 1− 2.865 33 e−1.302t + 1.865 33 e−2t

W8 1− 0.9883 e−265 5196t − 0.0117 e−0.027 101 47t

Table 14.5. Slop estimates (in Kb/10 min)

State usedSwapSpace realMemoryFree

Slope estimate 95% confidence interval Slope estimate 95% confidence interval

W1 119.3 5.5 to 222.4 −133.7 −137.7 to−133.3W2 0.57 0.40 to 0.71 −1.47 −1.78 to−1.09W3 0.76 0.73 to 0.80 −1.43 −2.50 to−0.62W4 0.57 0.00 to 0.69 −1.23 −1.67 to−0.80W5 0.78 0.75 to 0.80 0.00 −5.65 to−6.00W6 0.81 0.64 to 1.00 −1.14 −1.40 to−0.88W7 0.00 0.00 to 0.00 0.00 0.00 to 0.00W8 91.8 72.4 to 111.0 91.7 −369.9 to 475.2

The sojourn time distribution for each of theworkload states was fitted to either two-stagehyper-exponential or two-stage hypo-exponentialdistribution functions. The fitted distributions,shown in Table 14.4, were tested using theKolmogorov–Smirnov test at a significance level of0.01.

Two resources, usedSwapSpace and realMemo-ryFree, are considered for the analysis, since theprevious time-based analysis suggested that theyare critical resources. For each resource, the re-ward function is defined as the rate of corre-sponding resource exhaustion in different states.The true slope (rate of increase/decrease) of a re-source at every workload state is estimated by us-ing Sen’s non-parametric method [34]. Table 14.5shows the slopes with 95% confidence intervals.

It was observed that slopes in a given workloadstate for a particular resource during differentvisits to that state are almost the same. Further,the slopes across different workload states are

different and, generally, the higher the systemactivity, then the higher is the resource utilization.This validates the assumption that resource usagedoes depend on the system workload and the ratesof exhaustion vary with workload changes. It canalso be observed from Table 14.5 that the slopesfor usedSwapSpace in all the workload states arenon-negative, and the slopes for realMemoryFreeare non-positive in all the workload states exceptin one. It follows that usedSwapSpace increases,whereas realMemoryFree decreases, over time,which validates the software aging phenomenon.

The semi-Markov reward model was solvedusing the SHARPE [38]. The slope for theworkload-based estimation is computed as the ex-pected reward rate in steady state from the model.As in the case of time-based estimation, the timesto resource exhaustion are computed using thelinear formula y =mx + c. Table 14.6 gives theestimates for the slope and time to exhaustion forusedSwapSpace and realMemoryFree. It can be

Page 295: Handbook of Reliability Engineering - pdi2.org

262 Software Reliability

Table 14.6. Estimates for slope (in KB/10 min) and time to exhaustion (in days)

Estimation method usedSwapSpace realMemoryFree

Slope estimate Estimated time Slope estimate Estimated timeto exhaustion to exhaustion

Time based 0.787 2276.46 −2.806 60.81Workload based 4.647 453.62 −3.386 50.39

seen that workload-based estimations gave a lowertime to resource exhaustion than those computedusing time-based estimations. Since the machinefailures due to resource exhaustion were observedmuch before the times to resource exhaustionestimated by the time-based method, it followsthat the workload-based approach results in betterestimations.

14.4 Conclusion and FutureWork

In this chapter we have motivated the need forpursuing preventive maintenance in operationalsoftware systems on a scientific basis rather thanon the current ad hoc practice. Thus, an importantresearch issue is to determine the optimal timesto perform the preventive maintenance. In thisregard, we discuss two possible approaches:analytical modeling, and a measurement-basedapproach.

In the first part, we discuss analytical modelsfor evaluating the effectiveness of preventivemaintenance in operational software systems thatexperience aging. The aim of the analyticalmodeling approach is to determine the optimaltimes to perform rejuvenation. The latter partof the chapter deals with a measurement-basedapproach for detection and validation of theexistence of software aging. Although the SNMP-based distributed resource monitoring tool isspecific to UNIX, the methodology can also beused for detection and estimation of aging in othersoftware systems.

In future, a unified approach, by taking accountof both analytical modeling and measurement-based rejuvenations, should be established. Then,

the probability distribution function of the soft-ware failure occurrence time has to be modeledin a consistent way. In other words, the stochasticmodels to describe the physical failure phenomenain operational software systems should be devel-oped under a random usage environment.

Acknowledgments

This material is based upon the support bythe Department of Defense, Alcatel, Telcordia,and Aprisma Management Technologies via aCACC Duke core project. The first author isalso supported in part by the TelecommunicationAdvancement Foundation, Tokyo, Japan.

References[1] Adams E. Optimizing preventive service of the software

products. IBM J Res Dev 1984;28:2–14.[2] Huang Y, Kintala C, Kolettis N, Fulton ND. Software

rejuvenation: analysis, module and applications. In:Proceedings of 25th International Symposium on FaultTolerant Computer Systems. Los Alamitos: IEEE CS Press;1995. p.381–90.

[3] Lin F. Re-engineering option analysis for managing soft-ware rejuvenation. Inform Software Technol 1993;35:462–467.

[4] Parnas DL. Software aging. In: Proceedings of 16thInternational Conference on Software Engineering. NewYork: ACM Press; 1994. p.279–87.

[5] Huang Y, Kintala C. Software fault tolerance in theapplication layer. In: Lyu MR, editor. Software faulttolerance. Chichester: John Wiley & Sons; 1995. p.231–48

[6] Huang Y, Jalote P, Kintala C. Two techniques for transientsoftware error recovery. In: Banatre M, Lee PA, editors.Hardware and software architectures for fault tolerance:experience and perspectives. Springer Lecture Note inComputer Science, vol. 774. Berlin: Springer-Verlag; 1995.p.159–70.

[7] Jalote P, Huang Y, Kintala C. A framework for un-derstanding and handling transient software failures.In: Proceedings of 2nd ISSAT International Conferenceon Reliability and Quality in Design, 1995; p.231–7.

Page 296: Handbook of Reliability Engineering - pdi2.org

Software Rejuvenation: Modeling and Applications 263

[8] Avritzer A, Weyuker EJ. Monitoring smoothly degradingsystems for increased dependability. Empirical SoftwareEng J 1997;2:59–77.

[9] Levendel Y. The cost effectiveness of telecommunicationservice. In: Lyu MR, editor. Software fault tolerance.Chichester: John Wiley & Sons; 1995. p.279–314.

[10] Marshall E. Fatal error: how Patriot overlooked a Scud.Science 1992;255:1347.

[11] Gray J, Siewiorek DP. High-availability computer systems.IEEE Comput 1991;9:39–48.

[12] Trivedi KS, Vaidyanathan K, Goševa-Postojanova K.Modeling and analysis of software aging and rejuvena-tion. In: Proceedings of 33rd Annual Simulation Sympo-sium. Los Alamitos: IEEE CS Press; 2000. p.270–9.

[13] Tai AT, Chau SN, Alkalaj L, Hecht H. On-board preventivemaintenance: analysis of effectiveness and optimal dutyperiod. In: Proceedings of 3rd International Workshopon Object Oriented Real-Time Dependable Systems. LosAlamitos: IEEE CS Press; 1997. p.40–7.

[14] Tai AT, Alkalaj L, Chau SN. On-board preventivemaintenance for long-life deep-space missions: amodel-based evaluation. In: Proceedings of 3rd IEEEInternational Computer Performance & DependabilitySymposium. Los Alamitos: IEEE CS Press; 1998. p.196–205.

[15] Tai AT, Alkalaj L, Chau SN. On-board preventivemaintenance: a design-oriented analytic study for long-life applications. Perform Eval 1999;35:215–32.

[16] Vaidyanathan K, Harper RE, Hunter SW, Trivedi KS.Analysis and implementation of software rejuvenationin cluster systems. In: Proceedings of Joint InternationalConference on Measurement and Modeling of ComputerSystems, ACM SIGMETRICS 2001/Peformance 2001. NewYork: ACM; 2001. p.62–71.

[17] IBM Netfinity director software rejuvenation–white pa-per. URL: http://www.pc.ibm.com/us/techlink/wtpapers/.

[18] Dohi T, Goševa-Popstojanova K, Trivedi KS. Analysis ofsoftware cost models with rejuvenation. In: Proceedingsof 5th IEEE International Symposium High AssuranceSystems Engineering. Los Alamitos: IEEE CS Press; 2000.p.25–34.

[19] Dohi T, Goševa-Popstojanova K, Trivedi KS. Statisticalnon-parametric algorithms to estimate the optimalsoftware rejuvenation schedule. In: Proceedings of2000 Pacific Rim International Symposium Depend-able Computing. Los Alamitos: IEEE CS Press; 2000. p.77–84.

[20] Dohi T, Goševa-Popstojanova K, Trivedi KS. Estimatingsoftware rejuvenation schedule in high assurance sys-tems. Comput J 2001;44:473–85.

[21] Garg G, Puliafito A, Trivedi KS. Analysis of softwarerejuvenation using Markov regenerative stochastic Petrinet. In: Proceedings of 6th International Symposium onSoftware Reliability Engineering. Los Alamitos: IEEE CSPress; 1995. p.180–7.

[22] Bobbio A, Sereno M. Fine grained software rejuvenationmodels. In: Proceedings of 3rd IEEE InternationalComputer Performance & Dependability Symposium.Los Alamitos: IEEE CS Press; 1998. p.4–12.

[23] Garg S, Huang Y, Kintala C, Trivedi KS. Minimizingcompletion time of a program by checkpointing andrejuvenation. In: Proceedings of 1996 ACM SIGMETRICSConference. New York: ACM; 1996. p.252–61.

[24] Bobbio A, Garg S, Gribaudo M, Horvaáth A, Sereno M,Telek M. Modeling software systems with rejuvenation,restoration and checkpointing through fluid stochasticPetri nets. In: Proceedings of 8th International Workshopon Petri Nets and Performance Models. Los Alamitos:IEEE CS Press; 1999. p.82–91.

[25] Garg S, Huang Y, Kintala C, Trivedi KS. Time andload based software rejuvenation: policy, evaluationand optimality. In: Proceedings of 1st Fault-TolerantSymposium, 1995. p.22–5.

[26] Pfening A, Garg S, Puliafito A, Telek M, Trivedi KS.Optimal rejuvenation for tolerating soft failures. PerformEval 1996;27–28:491–506.

[27] Garg S, Puliafito A, Telek M, Trivedi KS. Analysis ofpreventive maintenance in transactions based softwaresystems. IEEE Trans Comput 1998;47:96–107.

[28] Okamura H, Fujimoto A, Dohi T, Osaki S, Trivedi KS.The optimal preventive maintenance policy for a softwaresystem with multi server station. In: Proceedings of6th ISSAT International Conference on Reliability andQuality in Design, 2000; p.275–9.

[29] Okamura H, Miyahara S, Dohi T, Osaki S. Performanceevaluation of workload-based software rejuvenationscheme. IEICE Trans Inform Syst D 2001;E84-D: 1368-75.

[30] Ross SM. Applied probability models and optimizationapplications. San Francisco (CA): Holden-Day; 1970.

[31] Barlow RE, Campo R. Total time on test processes andapplications to failure data. In: Barlow RE, Fussell J,Singpurwalla ND, editors. Reliability and fault treeanalysis. Philadelphia: SIAM; 1975. p.451–81.

[32] Pfister G. In search of clusters: the coming battle in lowlyparallel computing. Englewood Cliffs (NJ): Prentice-Hall;1998.

[33] Garg S, van Moorsel A, Vaidyanathan K, Trivedi KS.A methodology for detection and estimation of softwareaging. In: Proceedings of 9th International Symposiumon Software Reliability Engineering. Los Alamitos: IEEECS Press; 1998. p.282–92

[34] Vaidyanathan K, Trivedi KS. A measurement-basedmodel for estimation of resource exhaustion in opera-tional software systems. In: Proceedings of 10th IEEEInternational Symposium on Software Reliability Engi-neering. Los Alamitos: IEEE CS Press; 1999. p.84–93.

[35] Cleveland WS. Robust locally weighted regression andsmoothing scatterplots. J Am Stat Assoc 1979;74:829–36.

[36] Gilbert RO. Statistical methods for environmental pollu-tion monitoring. New York (NY):Van Nostrand Reinhold;1987.

[37] Hartigan JA. Clustering algorithms. New York (NY):JohnWiley & Sons; 1975.

[38] Sahner RA, Trivedi KS, Puliafito A. Performance andreliability analysis of computer systems–an example-based approach using the SHARPE software package.Norwell (MA): Kluwer Academic Publishers; 1996.

Page 297: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 298: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management:Techniques and Applications

Chap

ter1

5Mitsuhiro Kimura and Shigeru Yamada

15.1 Introduction15.2 Death Process Model for Software Testing Management15.2.1 Model Description15.2.1.1 Mean Number of Remaining Software Faults/Testing-cases15.2.1.2 Mean Time to Extinction15.2.2 Estimation Method of Unknown Parameters15.2.2.1 Case of 0 < α < 115.2.2.1 Case of α = 015.2.3 Software Testing Progress Evaluation15.2.4 Numerical Illustrations15.2.5 Concluding Remarks15.3 Estimation Method of Imperfect Debugging Probability15.3.1 Hidden-Markov Modeling for Software Reliability Growth Phenomenon15.3.2 Estimation Method of Unknown Parameters15.3.3 Numerical Illustrations15.3.4 Concluding Remarks15.4 Continuous State Space Model for Large-scale Software15.4.1 Model Description15.4.2 Nonlinear Characteristics of Software Debugging Speed15.4.3 Estimation Method of Unknown Parameters15.4.4 Software Reliability Assessment Measures15.4.4.1 Expected Number of Remaining Faults and Its Variance15.4.4.2 Cumulative and Instantaneous Mean Time Between Failures15.4.5 Concluding Remarks15.5 Development of a Software Reliability Management Tool15.5.1 Definition of the Specification Requirement15.5.2 Object-oriented Design15.5.3 Examples of Reliability Estimation and Discussion

15.1 Introduction

In this chapter we discuss three new stochasticmodels for assessing software reliability and thedegree of software testing progress, and a softwarereliability management tool.

Precise software reliability assessment is neces-sary to measure and predict the reliability and per-formance of a developed software product. Also, ithas become one of the urgent issues in software

engineering to develop quality software productsand increase the productivity of their developmentprocesses [1–5]. In order to solve such softwarequality management issues, a number of softwarereliability assessment models have been developedby many researchers over the last two decades[2–5]. Among these models, those that have beenbuilt based on some stochastic processes have re-cently become to be known gradually by softwaredevelopment managers and practitioners.

265

Page 299: Handbook of Reliability Engineering - pdi2.org

266 Software Reliability

Section 15.2 considers a model that describesthe degree of software testing progress by focusingon the consumption process of test cases duringthe software testing phase. Measuring the softwaretesting progress may also play a significant rolein software development management, becauseone of the most important factors is traceabilityin the software development process. To evaluatethe degree of the testing progress, we constructa generalized death process model that is ableto analyze the data sets of testing time and thenumber of test cases consumed during the testingphase. The model predicts the mean number ofthe consumed test cases by extrapolation andestimates the mean completion time of the testingphase. We show several numerical illustrationsusing the actual data.

In Section 15.3, we propose a new approachto estimating the software reliability assessmentmodel considering the imperfect debugging en-vironment by analyzing the traditional softwarereliability data sets. Generally, stochastic modelsfor software reliability assessment need severalassumptions, and some of them are too strict todescribe the actual phenomena. In the researcharea of software reliability modeling, it is knownthat the assumption of perfect debugging is veryuseful in simplifying the models, but many soft-ware quality/reliability researchers and practition-ers are critical in terms of assuming perfect debug-ging in a software testing environment. Consider-ing such a background, several imperfect debug-ging software reliability assessment models havebeen developed [6–12]. However, it is found thatthe really practical methods for estimating theimperfect debugging probability have not beenproposed yet. We show that the hidden-Markovestimation method [13] enables one to estimatethe imperfect debugging probability by only usingthe traditional data sets.

In Section 15.4, we consider a state dependencyin the software reliability growth process. In theliterature, almost all of the software reliabilityassessment models based on stochastic processesare continuous time and discrete state spacemodels. In particular, the models based on anon-homogeneous Poisson process (NHPP) are

widely known as software reliability growthmodels and have actually been implementedin several computer-aided software engineering(CASE) tools [4, 5]. These NHPP models canbe easily applied to the observed data sets andexpanded to describe several important factorswhich have an effect on software debuggingprocesses in the actual software testing phase.However, the NHPP models essentially have alinear relation on the state dependency; this isa doubtful assumption with regard to actualsoftware reliability growth phenomena. That is,these models have no choice but to use a meanvalue function for describing a nonlinear relationbetween the cumulative number of detected faultsand the fault detection rate. Thus we present acontinuous state space model to introduce directlythe nonlinearity of the state dependency of thesoftware reliability growth process.

Finally, in Section 15.5, we introduce a proto-type of a software reliability management tool thatincludes the models described in this chapter. Thistool has been implemented using JAVA and object-oriented analysis.

15.2 Death Process Model forSoftware Testing Management

In this section we focus on two phenomenathat are observable during the software testingphase, which is the last phase of a large-scale software development process. The first isthe software failure-occurrence or fault-detectionprocess, and the second is the consumptionprocess of test cases which are provided previouslyand consumed to examine the system test of thesoftware system implemented. To evaluate thequality of the software development process orthe software product itself, we describe a non-homogeneous death process [14] with a state-dependent parameter α (0≤ α ≤ 1).

After the model description, we derive severalmeasures for quantitative assessment. We showthat this model includes Shanthikumar’s binomialreliability model [15] as a special case. The

Page 300: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 267

estimation method of maximum likelihood for theunknown parameters is also shown. Finally, weillustrate several numerical examples for softwaretesting-progress evaluation by applying the actualdata sets to our model, and we discuss theapplicability of the model.

15.2.1 Model Description

We first consider a death process [16, 17]{X(t), t ≥ 0}, which has the initial populationK (K ≥ 1) and satisfies the following assump-tions:

• Pr[X(t +�t)= x − 1 |X(t)= x] =(xα + 1− α)φ(t)�t + o(�t)

• Pr[X(t)−X(t +�t) ≥ 2] = o(�t)

• Pr[X(0)=K] = 1• {X(t), t ≥ 0} has independent increments.

The term population has two meanings in thissection. The first one is the number of softwarefaults that are latent in the software system, if ournew model describes the software fault-detectionor failure-occurrence process. The second onemeans the number of test cases for the testingphase, if the model gives a description of theconsumption process of the test cases.

In the above assumptions, the positive functionφ(t) is called an intensity function, and α (0≤ α ≤1) is a constant parameter and affects the state-dependent property of the process {X(t), t ≥ 0}.That is, the state transition probability Pr[X(t +�t)= x − 1 | X(t)= x] does not depend on thestate x at time t but the intensity function φ(t), ifα = 0. On the other hand, when α = 1, the processdepends on residual population x and φ(t) attime t . This model describes a generalized non-homogeneous death process. We consider that theresidual population at time t , X(t), represents thenumber of remaining software faults or remainingtest cases at the testing time t .

Under the assumptions listed above, we canobtain the probability that the cumulative numberof deaths is equal to n at time t , Pr[X(t)=K −

n] = Pn(t | α), as follows:

Pn(t | α)=∏n−1

i=0 [(K − i)α + 1− α]n!

× [e−αG(t)]K−n−1

×[

1− e−αG(t)

α

]ne−G(t) (15.1)

(n= 0, 1, 2, . . . , K − 1)

PK(t | α)= 1−K−1∑n=0

Pn(t | α)

where

G(t)=∫ t

0φ(x) dx (15.2)

and we define

−1∏i=0

[(K − i)α + 1− α] = 1 (15.3)

In the above, the term death means the elimina-tion of a software fault or the consumption of atesting case during the testing phase.

In Equation (15.1), limα→0 Pn(t | α) yields

Pn(t | 0)= G(t)n

n! e−G(t)

(n= 0, 1, 2, . . . , K − 1) (15.4)

PK(t | 0)=∞∑i=K

G(t)i

i! e−G(t)

since

limα→0

1− e−αG(t)

α=G(t) (15.5)

Similarly, by letting α→ 1 we obtain

Pn(t | 1)=(K

n

)e−G(t)(K−n)(1− e−G(t))n

(n= 0, 1, . . . , K) (15.6)

Equation 15.6 is a binomial reliability model byShanthikumar [15].

Based on the above, we can derive severalmeasures.

Page 301: Handbook of Reliability Engineering - pdi2.org

268 Software Reliability

15.2.1.1 Mean Number of RemainingSoftware Faults/Testing Cases

The mean number of remaining software faults/testing cases at time t , E[X(t)], is derived as

E[X(t)]

=K − e−G(t)K∑n=1

�(K + 1/α)

�(K + 1/α − n)(n− 1)!× [e−αG(t)]K−n−1[1− e−αG(t)]n (15.7)

if 0 < α ≤ 1. To obtain Equation 15.7, we use thefollowing equation:∏n−1

i=0 [(K − i)α + 1− α]αn

= �(K + 1/α)

�(K + 1/α − n)(15.8)

where �(a) represents a gamma function definedas

�(a)=∫ ∞

0e−xxa−1 dx (15.9)

Also, Equation 15.7 can be simplified if α = 1 as

E[X(t)]α=1 =Ke−G(t) (15.10)

In the case of α = 0, E[X(t)] yields

E[X(t)]α=0 =KA(K, G(t))

−G(t)A(K − 1, G(t)) (15.11)

where A(K, G(t)) represents the incompletegamma ratio which is defined as follows [14]:

A(K, G(t))= �(K, G(t))/�(K, 0) (15.12)

�(a, b)=∫ ∞b

e−xxa−1 dx (15.13)

15.2.1.2 Mean Time to Extinction

Let B(t) be the cumulative distribution function(CDF) of the event that all of K are disappeared attime t . From Equation 15.1, B(t) is represented as

B(t) = PK(t | α)=∞∑

n=KPn(t | α) (15.14)

Therefore, the mean time to extinction E[T ] is

E[T ] =∫ ∞

0t dB(t) (15.15)

If the intensity function φ(t) is assumed by

φ(t)= λβtβ−1 (15.16)

two special cases of E[T ] are obtained as follows:

E[T ]α=0 = �(K + 1/β)

�(K)

(1

λ

)1/β

(15.17)

E[T ]α=1 =K∑r=1

[(K

r

)(−1)r+1

(1

r

)1/β]

×(

1

β

)�

(1

β

) (1

λ

)1/β

(15.18)

15.2.2 Estimation Method of UnknownParameters

We show the estimation method of unknownparameters included in the model except for K .We assume that a data set consists of time tj andcumulative number of deaths yj , i.e. (tj , yj ) (j =1, 2, . . . , m) and t1 < t2 < · · ·< tm, y1 < y2 <

· · ·< ym, and the initial population K .

15.2.2.1 Case of 0 < α ≤ 1

By using the data, the likelihood function l can beconstructed as

l = Pr[X(t1)=K − y1, X(t2)

=K − y2, . . . , X(tm)=K − ym]

=m∏

j=1

�(K + 1/α − yj−1)

(yj − yj−1)!�(K + 1/α − yj )

× (e−αG(tj−1) − e−αG(tj ))yj−yj−1

× (1− e−αG(tj−1) + e−αG(tj ))K+(1/α)−yj−1

(15.19)

where t0 ≡ 0, y0 ≡ 0. From the above equation, thelog-likelihood function L is represented as

L≡ log l

=m∑

j=1

log

[�

(K + 1

α− yj−1

)]

−m∑

j=1

log[(yj − yj−1)!]

Page 302: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 269

−m∑

j=1

log

[�

(K + 1

α− yj

)]

+m∑

j=1

(yj − yj−1) log[e−αG(tj−1) − e−αG(tj )]

+m∑

j=1

(K + 1

α− yj − 1

)× log[(1− e−αG(tj−1) + e−αG(tj ))] (15.20)

If the intensity function is given by Equation 15.16,the simultaneous likelihood equations are shownas

∂L(α, λ, β)

∂α= 0

∂L(α, λ, β)

∂λ= 0

∂L(α, λ, β)

∂β= 0 (15.21)

We can solve these equations numerically toobtain the maximum likelihood estimates.

15.2.2.2 Case of α = 0

In this case, the likelihood function l and the log-likelihood L are rewritten respectively as

l ≡m∏

j=1

∫ tjtj−1

φ(x) dxyj−yj−1

(yj − yj−1)!

× exp

[−∫ tj

tj−1

φ(x) dx

](15.22)

L= log l =m∑

j=1

(yj − yj−1) log[G(tj )−G(tj−1)]

−m∑

j=1

log(yj − yj−1)! −G(tm) (15.23)

Substituting Equation 15.16 into Equation 15.23,we have

L(λ, β)= ym log λ+m∑

j=1

(yj − yj−1)

× log(tβj − tβ

j−1)

−m∑

j=1

log(yj − yj−1)! − λtβm (15.24)

The simultaneous likelihood equations are ob-tained as follows:

∂L(λ, β)

∂λ= ym

λ− tβm = 0 (15.25)

∂L(λ, β)

∂β=

m∑j=1

(yj − yj−1)

× tβ

j log tj − tβ

j−1 log tj−1

tβj − t

β

j−1

− ym log tm = 0 (15.26)

By solving Equations 15.25 and 15.26 numerically,we obtain the maximum likelihood estimates for λand β .

15.2.3 Software Testing ProgressEvaluation

Up to the present, a lot of stochastic modelshave been proposed for software reliability/qualityassessment. Basically, such models analyze thesoftware fault-detection or failure-occurrence datawhich can be observed in the software testingphase. Therefore, we first consider the applicabil-ity of our model to the software fault-detectionprocess.

The meaning of the parameters included inEquation 15.1 is as follows:

K is the total number of software faultslatent in the software system

φ(t) is the fault detection rate at testing time tper fault

α is the degree of the relation on thefault-detection rate affected by thenumber of remaining faults.

However, it is difficult for this model to analyzethe software fault-detection data, because theestimation for the unknown parameter K isdifficult unless α = 1. We need to use the otherestimation method for K .

We now focus on the analysis of the consump-tion process of test cases. The software devel-opment managers not only want to know thesoftware reliability attained during the testing, but

Page 303: Handbook of Reliability Engineering - pdi2.org

270 Software Reliability

also the testing progress. In the testing phase ofsoftware development, many test cases providedpreviously are examined as to whether each ofthem satisfies the specification of the softwareproduct or not. When the software completes thetask that is defined by each test case withoutcausing any software failure, the test case is con-sumed. On the other hand, if a software failureoccurs, the software faults that cause the softwarefailure are detected and removed. In this situation,{X(t), t ≥ 0} can be treated as the consumptionprocess of test cases. The meaning of the parame-ters is as follows:

K is the total number of test cases provided

φ(t) is the consumption rate of test case attime t per test case

α is the degree of the relation on theconsumption rate affected by the numberof remaining test cases.

15.2.4 Numerical Illustrations

We show several numerical examples for softwaretesting-progress evaluation. The data sets wereactually obtained in the actual testing processes,and consist of testing time ti , the cumulativenumber of consumed test cases yi , and the totalnumber of test cases K . The data sets are:

DS-2.1: (ti , yi) (i = 1, 2, . . . , 10), K = 5964. ti ismeasured in days. This data appeared inShibata [18].

DS-2.2: (ti , yi) (i = 1, 2, . . . , 19), K = 11 855. tiis measured in days. This data set wasobserved from a module testing phase andthe modules were implemented by high-level programming language [14].

First, we estimate the three unknown param-eters λ, β and α using DS-2.1. The results areλ= 1.764, β = 1.563, and α = 1.000. We illustratethe mean number of consumed test cases and theactual data in Figure 15.1: the small dots representthe realizations of the number of consumed testcases actually observed.

Figure 15.1. Estimated mean number of consumed test cases(DS-2.1; λ= 1.764, β = 1.563, α = 1.000)

Next, we discuss the predictability of our modelin terms of the mean number of consumed testcases by using DS-2.2. That is, we estimate themodel parameters by using the first half of DS-2.2 (i.e. t1→ t10). The estimates are obtained asλ= 4.826, β = 3.104, and α = 0.815. Figure 15.2shows the estimation result for DS-2.2. The right-hand side of the plot is the extrapolated predictionof the test case consumption. These data sets seemto fit our model.

Hence the software development managers mayestimate the software testing-progress quantita-tively by using our model.

Figure 15.2. Estimated mean number of consumed test cases(DS-2.2; λ= 4.826, β = 3.104, α = 0.815)

Page 304: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 271

15.2.5 Concluding Remarks

In this section, we formulated a non-homo-geneous death process model with a state-dependent parameter, and derived the meanresidual population at time t and the meantime to extinction. The estimation method ofthe unknown parameters included in the modelhas also been presented based on the methodof maximum likelihood. We have consideredthe applicability of the model to two softwarequality assessment issues. As a result, it is foundthat our model is not suitable for assessingsoftware reliability, since the parameter K , whichrepresents the total number of software faultslatent in the software system, is difficult toestimate. This difficulty remains as a futureproblem to be solved. However, our model isuseful for such problems when K is known. Thuswe have shown numerical examples of testing-progress evaluation by using the model and theactual data.

15.3 Estimation Method ofImperfect Debugging Probability

We study a simple software reliability assessmentmodel taking an imperfect debugging environ-ment into consideration in this section. The modelwas originally proposed by Goel and Okumoto[12]. We focus on developing a practical methodfor estimating the imperfect debugging softwarereliability assessment model.

In Section 15.3.1 we describe the modeling andthe several assumptions required for it. We showthe Baum–Welch re-estimation procedure, whichis commonly applied in hidden-Markov modeling[13, 19], in Section 15.3.2. Section 15.3.3 presentsseveral numerical illustrations based on the actualsoftware fault-detection data set collected. Wediscuss some computational difficulties of theestimation procedure. Finally, in Section 15.3.4, wesummarize with some remarks about the resultsobtained, and express several issues remaining forfuture study.

Figure 15.3. State transition diagram of the imperfect debuggingmodel

15.3.1 Hidden-Markov modeling forsoftware reliability growth phenomenon

In this section, we assume the following softwaretesting and debugging environment.

• Software is tested by using test cases that arepreviously provided based on their specifica-tion requirements.• One software failure is caused by one software

fault.• In the debugging process, a detected software

fault is fixed correctly or done correctly butthe debugging activity introduces one newfault into the other program path in terms ofthe test case.

In the usual software testing phase, software isinspected and debugged until a test case canbe processed successfully. Consequently, when atest case is performed correctly by the softwaresystem, the program path of the test case musthave no fault. In other words, if a softwaredebugging worker fails to fix a fault, a new faultmight be introduced into the other program path,and it may be detected by the other (later) testcase, or not be detected at all.

Based on these assumptions, Goel andOkumoto [12] proposed a simple software reli-ability assessment model considering imperfectdebugging environments. This model representsthe behavior of the number of faults latent in thesoftware system as a Markov model. The statetransition diagram is illustrated in Figure 15.3.In this figure, the parameters p and q representperfect and imperfect debugging probabilities,respectively (p + q = 1). The sojourn time

Page 305: Handbook of Reliability Engineering - pdi2.org

272 Software Reliability

distribution function of state j is defined as

Fj (t)= 1− exp[−λjt] (j = 0, 1, . . . , N)

(15.27)where j represents the number of remainingfaults, λ is a constant parameter that correspondsto the software failure occurrence rate, and N isthe total number of software faults. This simpledistribution function was first proposed by Jelin-ski and Moranda [20]. Goel and Okumoto [12]discussed the method of statistical inference bythe methods of maximum likelihood and Bayesianestimation. However, the former method,especially, needs data that includes information onwhether each software failure is caused by a faultthat is originally latent in the software system orby an introduced one. Generally, during the actualtesting phase in the software development, itseems very difficult for testing managers to collectsuch data. We need to find another estimationmethod for the imperfect debugging probability.

In this section, we consider this imperfectdebugging model as a hidden-Markov model. Ahidden-Markov model is a kind of Markov model;however, it has hidden states in its state space.That is, the model includes the states that areunobservable from the sequence of results ofthe process considered. The advantage of thismodeling technique is that there is a proposedmethod for estimating the unknown parametersincluded in the hidden states. Hidden-Markovmodeling has been mainly used in the researcharea of speech recognition [13, 19]. We applythis method to estimate the imperfect debuggingprobability q and other unknown parameters.

Now we slightly generalize the sojourn timedistribution function of state j as

Fj (t)= 1− exp[−λj t] (j = 0, 1, . . . , N)

(15.28)This Fj (t) corresponds to Equation 15.27 if λj =λj .

We adopt the following notation:

N is the number of software faults latent inthe software system at the beginning ofthe testing phase of softwaredevelopment

p is the perfect debugging probability(0 < p ≤ 1)

q is the imperfect debugging rate(q = 1− p)

λj is the hazard rate of software failureoccurrence at state j (λj > 0)

j is the state variable representing thenumber of software faults latent in thesoftware system (j = 0, 1, . . . , N)

Fj (t) is the sojourn time distribution in statej , which represents the CDF of a timeinterval measured from the (N − j)th tothe (N − j + 1)th fault-detection (orfailure-occurrence) time.

In our model, the state transition diagramis the same as that of Goel and Okumoto (seeFigure 15.3). In the figure, each state representsthe number of faults latent in the software duringthe testing phase. From the viewpoint of hidden-Markov modeling, the hidden states in this modelare the states themselves. In other words, wecannot find how the sequence of the transitionsoccurred, based on the observed software fault-detection data.

15.3.2 Estimation Method of UnknownParameters

We use the Baum–Welch re-estimation formulasto estimate the unknown parameters q , λj , andN , with fault-detection time data (i, xi) (i =1, 2, . . . , T ) where xi means the time intervalbetween the (i − 1)th and ith fault-detection time(x0 ≡ 0).

By using this estimation procedure, we can re-estimate the values of q and λj (j =N+, N+ −1, . . . , 1, 0), where N+ represents the number ofinitial faults content which is virtually given forthe procedure. We can also obtain the re-estimatedvalue of πj (j = N+, N+ − 1, . . . , 1, 0) throughthis procedure, where πj is the initial probability

of state j( ∑N+

j=0 πj = 1). The re-estimated values

are given as follows:

q =∑T−1

t=1 ξ00(t)∑T−1t=1 γ0(t)

(15.29)

Page 306: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 273

λj =∑T

t=1 αj [t]βj [t]∑Tt=1 αj [t]βj [t]xt

(15.30)

πj = γj (1) (15.31)

where T represents the length of the observeddata, and

aij =

1 (i = 0, j = 0)

q (i = j)

p (i − j = 1)

0 (otherwise)

(15.32)

bj (k)= dFj (k)

dk= λj exp[−λjk]

(j =N+, N+ − 1, . . . , 1, 0)

αj (t)=

πjbj (x1)

(t = 1; j =N+, N+ − 1, . . . , 1, 0)∑N+i=0 αi(t − 1)aij bj (xt )

(t = 2, 3, . . . , T ;j =N+, N+ − 1, . . . , 1, 0)

(15.33)

βi(t)=

1 (t = T ; i = N+, N+−1, . . . , 1, 0)∑N+

j=0 aij bj (xt+1)βj (t + 1)

(t = T − 1, T − 2, . . . , 1;i =N+, N+ − 1, . . . , 1, 0)

(15.34)

γi(t)= αi(t)βi(t)∑N+i=0 αi(T )

(15.35)

ξij (t)= αi(t)aij bj (xt+1)βj (t + 1)∑N+i=0 αi(T )

(15.36)

The initial values for the parameters q , λj ,and π = {π0, π1, . . . , πN+} should be givenpreviously. After iterating these re-estimationprocedures, we can obtain the estimated valuesas the convergence ones. Also, we can estimatethe optimal state sequence by using γi(t) inEquation 15.35. The most-likely state, si , is givenas

si = argmax0≤j≤N+

[γj (i)] (1≤ i ≤ T ) (15.37)

The estimated value of the parameter N , whichis denoted N , is given by N = s1 from Equa-tion 15.37.

Table 15.1. Estimation results for λ and N

q (%) λ N SSE

0.0 0.006 95 31 7171.91.0 0.006 88 31 7171.72.0 0.006 80 31 7172.43.0 0.006 74 31 7173.54.0 0.007 13 30 6872.05.0 0.007 06 30 7213.66.0 0.006 98 30 7182.77.0 0.007 40 29 7217.18.0 0.007 32 29 6872.79.0 0.007 25 29 7248.9

10.0 0.007 17 29 7218.6

15.3.3 Numerical Illustrations

We now try to estimate the unknown parametersincluded in our model described in the previoussection. We analyze a data set that was cited byGoel and Okumoto [21]. This data set (denotedDS-3.1) forms (i, xi) (i = 1, 2, . . . , 26, i.e. T =26). However, we have actually found somecomputational difficulty in terms of the re-estimation procedure. This arises from the factthat there are many unknown parameters, namelyλj (j = N+, N+ − 1, . . . , 1, 0). Hence, we havereduced our model to Goel and Okumoto’s one, i.e.λj = λ · j (for all j ).

We give the initial values for π by takingthe possible number of the initial fault contentfrom the data set into consideration, and λ.For computational convenience, we give thevalues of the imperfect debugging rate q as q =0, 1, 2, . . . , 10 (%), and we estimate the otherparameters for each q .

The estimation results are shown in Table 15.1.In this table, we also calculate the sum of thesquared error (SSE), which is defined as

SSE=26∑i=1

(xi − E[Xi])2 (15.38)

where E[Xi] represents the estimated mean timebetween the (i − 1)th and the ith fault-detectiontimes, i.e. E[Xi] = 1/(λsi ) (i = 1, 2, . . . , 26).

From the viewpoint of minimizing the SSE,the best model for this data set is N = 30, q =

Page 307: Handbook of Reliability Engineering - pdi2.org

274 Software Reliability

Table 15.2. Results of data analysis for DS-3.1

i xi E[xi ] si

1 9 4.68 302 12 4.84 293 11 5.01 284 4 5.19 275 7 5.39 266 2 5.61 257 5 5.84 248 8 6.10 239 5 6.38 22

10 7 6.68 2111 1 7.01 2012 6 7.38 1913 1 7.79 1814 9 8.25 1715 4 8.77 1616 1 9.35 1517 3 10.0 1418 3 10.8 1319 6 11.7 1220 1 12.8 1121 11 14.0 1022 33 15.6 923 7 17.5 824 91 20.0 725 2 20.0 7a

26 1 23.4 6

aThe occurrence of imperfect debugging.

4.0 (%), λ= 0.007 13. This model fits better thanthe model by Jelinski and Moranda [20] (theirmodel corresponds to the case of q = 0.0 (%)

in Table 15.1). By using Equation 15.37, weadditionally obtain the most-likely sequence of the(hidden) states. The result is shown in Table 15.2with the analyzed data set. From Table 15.2, wefind that the estimated number of residual faults isfive. Figure 15.4 illustrates the cumulative testing-time versus the number of times of fault-detectionbased on the results of Table 15.2. We have used theactual data from Table 15.2, and plotted them withthe estimated mean cumulative fault-detectiontime for the first 26 faults. Since an additionalfive data points are also available from Goel andOkumoto [21], we have additionally plotted themin Figure 15.4 with the prediction curve, which is

extrapolated by using

E[Xm] = 1

λ(1− q)(N −m+ 1+ I)

(m= 27, 28, . . . , 31) (15.39)

I = 26− (N − s26 + 1) (15.40)

where I denotes the estimated number of times ofimperfect debugging during the testing phase.

15.3.4 Concluding Remarks

This section considered the method of estimat-ing the imperfect debugging probability whichappears in several software reliability assessmentmodels. We have applied hidden-Markov model-ing and the Baum–Welch re-estimation procedureto the estimation problems of the unknown modelparameters. We have found that there are stillsome computational difficulties for obtaining theestimates if the model is complicated in terms ofthe model structure. However, we could estimatethe imperfect debugging probability by only usingthe fault-detection time data.

It seems that hidden-Markov modeling is com-petent for describing more complicated models forthe software fault-detection process. One area offuture study is to solve the computational prob-lems appearing in the re-estimation procedure,and to find some rational techniques for giving theinitial values.

15.4 Continuous State SpaceModel for Large-scale SoftwareOne of our interests here is to introduce thenonlinear state dependency of software debuggingprocesses directly into the software reliabilityassessment model, not as a mean behavior likethe NHPP models. By focusing on softwaredebugging speed, we first show the importanceof introducing the nonlinear behavior of thedebugging speed into the software reliabilitymodeling. The modeling needs the mathematicaltheory of stochastic differential equations ofthe Itô type [22–24], since a counting process

Page 308: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 275

Figure 15.4. Estimated cumulative fault-detection time

(including an NHPP) cannot directly describethe nonlinearity unless it is modeled as a meanbehavior. We derive several software reliabilityassessment measures based on our new model,and also show the estimation method of unknownparameters included in the model.

15.4.1 Model Description

Let M(t) be the number of software faultsremaining in the software system at testingtime t (t ≥ 0). We consider a stochastic process{M(t), t ≥ 0}, and assume M(t) takes on acontinuous real value. In past studies, {M(t), t ≥0} has usually been modeled as a counting processfor software reliability assessment modeling [2–5,20, 21, 25]. However, we can suppose that M(t) iscontinuous when the size of the software system inthe testing phase is sufficiently large [26, 27]. Theprocess {M(t), t ≥ 0}may start from a fixed valueand gradually decrease with some fluctuation asthe testing phase goes on. Thus, we assume that

M(t) holds the basic equation as follows:

dM(t)

dt=−b(t)g(M(t)) (15.41)

where b(t) (> 0) represents a fault-detection rateper time, and M(0) (=m0 (const.)) is the numberof inherent faults at the beginning of the testingphase. g(x) represents a nonlinear function, whichhas the following required conditions:

• g(x) is non-negative and has Lipschitz conti-nuity;• g(0)= 0.

Equation 15.41 means that the fault-detection rateat testing time t , dM(t)/dt , is defined as a functionof the number of faults remaining at testing time t ,M(t). dM(t)/dt decreases gradually as the testingphase goes on. This supposition has often beenmade in software reliability growth modeling.

In this section, we suppose that b(t) inEquation 15.41 has an irregular fluctuation, i.e. weexpand Equation 15.41 to the following stochastic

Page 309: Handbook of Reliability Engineering - pdi2.org

276 Software Reliability

differential equation [27]:

dM(t)

dt=−{b(t)+ ξ(t)}g(M(t)) (15.42)

where ξ(t) represents a noise function thatdenotes the irregular fluctuation. Further, ξ(t) isdefined as

ξ(t)= σγ (t) (15.43)

where γ (t) is a standard Gaussian white noise andσ is a positive constant that corresponds to themagnitude of the irregular fluctuation. Hence werewrite Equation 15.42 as

dM(t)

dt=−{b(t)+ σγ (t)}g(M(t)) (15.44)

By using the Wong–Zakai transformation [28,29], we obtain the following stochastic differentialequation of the Itô type.

dM(t)= {−b(t)g(M(t))

+ 12σ

2g(M(t))g′(M(t))} dt− σg(M(t)) dW(t) (15.45)

where the so-called growth condition is assumedto be satisfied with respect to the function g(x).

In Equation 15.45, W(t) represents a one-dimensional Wiener process, which is formallydefined as an integration of the white noise γ (t)

with respect to testing time t . The Wiener process{W(t), t ≥ 0} is a Gaussian process, and has thefollowing properties:

• Pr[W(0)= 0] = 1;• E[W(t)] = 0;• E[W(t)W(τ)] =min[t, τ ].

Under these assumptions and conditions, we de-rive a transition probability distribution functionof the solution process by solving the Fokker–Planck equation [22] based on Equation 15.45.

The transition probability distribution func-tion P(m, t |m0) is denoted by:

P(m, t |m0)= Pr[M(t) ≤m |M(0)=m0](15.46)

and its density is represented as

p(m, t |m0)= ∂

∂mP(m, t |m0) (15.47)

The transition probability density functionp(m, t |m0) holds for the following Fokker–Planck equation.

∂p(m, t)

∂t= b(t)

∂m{g(m)p(m, t)}

− 1

2σ 2 ∂

∂m{g(m)g′(m)p(m, t)}

+ 1

2σ 2 ∂2

∂m2{g(m)2p(m, t)} (15.48)

By solving Equation 15.48 with the followinginitial condition:

limt→0

p(m, t)= δ(m−m0) (15.49)

we obtain the transition probability densityfunction of the process M(t) as follows:

p(m, t |m0)= 1

g(m)σ√

2πt

× exp

−[ ∫ m

m0

dm′g(m′) +

∫ t0 b(t ′) dt ′

]2

2σ 2t

(15.50)

Moreover, the transition probability distributionfunction of M(t) is obtained as:

P(m, t |m0)= Pr[M(t) ≤m |M(0)=m0]=∫ m

−∞p(m′, t |m0) dm′

=�

∫ mm0

dm′g(m′) +

∫ t0 b(t ′) dt ′

σ√t

(15.51)

where the function �(z) denotes the standardnormal distribution function defined as

�(z)= 1√2π

∫ z

−∞exp

[− s2

2

]ds (15.52)

Let N(t) be the total number of detected faultsup to testing time t . Since the condition M(t)+N(t) =m0 holds with probability one, we have thetransition probability distribution function of the

Page 310: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 277

Figure 15.5. Software debugging speed dN(t)/dt for DS-4.1

process N(t) as follows:

Pr[N(t) ≤ n |N(0)= 0, M(0)=m0]

=�

∫ n0 dn′g(m0−n′) −

∫ t0 b(t ′) dt ′

σ√t

(n < m0) (15.53)

15.4.2 Nonlinear Characteristics ofSoftware Debugging Speed

In order to apply the model constructed in theprevious section to the software fault-detectiondata, we should determine the function g(x)

previously. First, we investigate two data sets(denoted DS-4.1 and DS-4.2) that were collected inthe actual testing phase of software developmentto illustrate the behavior of their debugging speed.Each data set forms (tj , nj ) (j = 1, 2, . . . , K ; 0 <

t1 < t2 < · · ·< tK), where tj represents testingtime and nj is the total number of detectedfaults up to the testing time tj [30]. By evaluatingnumerically, we obtain the debugging speeddN(t)/dt from these data sets. Figures 15.5and 15.6 illustrate the behavior of debuggingspeed versus the cumulative number of detectedfaults N(t) for the two data sets. Figure 15.5shows an approximate linear relation between thedebugging speed and the degree of fault-detection.

Figure 15.6. Software debugging speed dN(t)/dt for DS-4.2

Therefore, we can give g(x) as:

g(x)= x (15.54)

for such a data set. On the other hand, Figure 15.6illustrates the existence of a nonlinear relationbetween them. Thus, we expand Equation 15.54and assume the function as follows:

g(x)=

xr (0≤ x ≤ xc, r > 1)

rxcr−1x + (1− r)xc

r

(x > xc, r > 1)

(15.55)

which is a power function that is partially revisedin x > xc so as to satisfy the growth condition.In Equation 15.55, r denotes a shape parameterand xc is a positive constant that is needed toensure the existence of the solution process M(t)

in Equation 15.45. The parameter xc can be givenarbitrarily. We assume xc =m0 in this model.

15.4.3 Estimation Method of UnknownParameters

We discuss the estimation method for unknownparameters m0, b, and σ by applying the maxi-mum likelihood method. The shape parameter r

is estimated by the numerical analysis betweendN(t)/dt and N(t), which was mentioned in theprevious section, because r is very sensitive. Weassume again that we have a data set of the form(tj , nj ) (j = 1, 2, . . . , K ; 0 < t1 < t2 < · · ·< tK).

Page 311: Handbook of Reliability Engineering - pdi2.org

278 Software Reliability

As a basic model, we suppose that the instanta-neous fault-detection rate b(t) takes on a constantvalue, i.e.

b(t)= b (const.) (15.56)

Let us denote the joint probability distributionfunction of the process N(t) as

P(t1, n1; t2, n2; . . . ; tK, nK)

≡ Pr[N(t1)≤ n1, . . . , N(tK) ≤ nK

|N(0)= 0, M(0)=m0] (15.57)

and denote its density as

p(t1, n1; t2, n2; . . . ; tK, nK)

≡ ∂KP(t1, n1; t2, n2; . . . ; tK, nK)

∂n1∂n2 · · · ∂nK (15.58)

Therefore, we obtain the likelihood function l forthe data set as follows:

l = p(t1, n1; t2, n2; . . . ; tK, nK) (15.59)

since N(t) takes on a continuous value. Forconvenience in mathematical manipulations, weuse the following logarithmic likelihood functionL:

L= log l (15.60)

The maximum-likelihood estimators m∗0, b∗,and σ ∗ are the values making L in Equation 15.60maximize, which can be obtained as the solutionsof the simultaneous equations as follows:

∂L

∂m0= 0

∂L

∂b= 0

∂L

∂σ= 0 (15.61)

By using Bayes’ formula, the likelihood func-tion l can be rewritten as

l = p(t1, n1; t2, n2; . . . ; tK, nK | t1, n1; t0, n0)

· p(t1, n1 | t0, n0) (15.62)

where p(·|t0, n0) is the conditional probabilitydensity under the condition of N(t0)= n0, and wesuppose that t0 = 0 and n0 = 0. By iterating thistransformation under the Markov property, wefinally obtain the likelihood function l as follows:

l =K∏j=1

p(tj , nj | tj−1, nj−1) (15.63)

The transition probability of N(tj ) under thecondition N(tj−1)= nj−1 can be obtained as

Pr[N(tj )≤ nj |N(tj−1)= nj−1]

=�

∫ njnj−1

dn′g(m0−n′) − b(tj − tj−1)

σ√tj − tj−1

(15.64)

Partially differentiating Equation 15.64 with re-spect to nj yields

p(tj , nj | tj−1, nj−1)

= 1

g(m0 − nj )σ√

2π(tj − tj−1)

× exp

−[∫ nj

nj−1

dn′g(m0−n′) − b(tj − tj−1)

]2

2σ 2(tj − tj−1)

(15.65)

By substituting Equation 15.65 into 15.63 andtaking the logarithm of the equation, we obtain thelogarithmic likelihood function as follows:

L= log l

=−K log σ − K

2log 2π −

K∑j=1

log g(m0 − nj )

− 1

2

K∑j=1

log(tj − tj−1)

− 1

2σ 2

K∑j=1

[ ∫ nj

nj−1

dn′

g(m0 − n′)

− b(tj − tj−1)

]2/(tj − tj−1) (15.66)

Substituting Equation 15.55 into Equation 15.66yields

L= log l

=−K log σ − K

2log 2π − r

K∑j=1

log(m0 − nj )

− 1

2

K∑j=1

log(tj − tj−1)

Page 312: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 279

− 1

2σ 2

K∑j=1

{1

r − 1[(m0 − nj )

1−r

− (m0 − nj−1)1−r ]

− b(tj − tj−1)

}2/(tj − tj−1) (15.67)

We obtain the simultaneous likelihood equationsby applying Equation 15.67 to Equation 15.61 asfollows:

∂L

∂m0=−r

K∑j=1

1

m0 − nj

− 1

σ 2

K∑j=1

SIj ∂SIj /∂m0

tj − tj−1= 0 (15.68)

∂L

∂b=− 1

σ 2

×K∑j=1

{ [(m0 − nj )1−r − (m0 − nj−1)

1−r ]r − 1

− b(tj − tj−1)

}= 0 (15.69)

∂L

∂σ=−K

σ+ 1

σ 3

K∑j=1

S2Ij

tj − tj−1= 0 (15.70)

where

SIj = 1

r − 1[(m0 − nj )

1−r − (m0 − nj−1)1−r ]

− b(tj − tj−1) (15.71)

∂SIj

∂m0=−(m0 − nj )

−r + (m0 − nj−1)−r

(15.72)

Since the variables b and σ can be eliminatedin Equations 15.68–15.70, we can estimate thesethree parameters by solving only one nonlinearequation numerically.

15.4.4 Software Reliability AssessmentMeasures

We derive several quantitative measures to assesssoftware reliability in the testing and operationalphase.

15.4.4.1 Expected Number of RemainingFaults and Its Variance

In order to assess software reliability quanti-tatively, information on the current number ofremaining faults in the software system is usefulto estimate the situation of the progress on thesoftware testing phase.

The expected number of remaining faults canbe evaluated by

E[M(t)] =∫ ∞

0m d Pr[M(t)≤m |M(0)=m0]

(15.73)The variance of M(t) is also obtained as

Var[M(t)] = E[M(t)2] − E[M(t)]2 (15.74)

If we assume g(x)= x as a special case, we have

E[M(t)] =m0 exp[−(b − 12σ

2)t] (15.75)

Var[M(t)] =m02 exp[−(2b− σ 2)t]{exp[σ 2t]−1}

(15.76)

15.4.4.2 Cumulative and InstantaneousMean Time Between Failures

In the fault-detection process {N(t), t ≥ 0},average fault-detection time-interval per fault upto time t is denoted by t/N(t). Hence, thecumulative mean time between failure (MTBF),MTBFC(t), is approximately given by

MTBFC(t)= E

[t

N(t)

]= E

[t

m0 −M(t)

]≈ t

m0 − E[M(t)]= t

m0{1− exp[−(b − 12σ

2)t]}(15.77)

when g(x)= x is assumed.The instantaneous MTBF, MTBFI(t), can be

evaluated by

MTBFI(t)= E

[1

dN(t)/dt

](15.78)

We calculate MTBFI(t) approximately by

MTBFI(t)≈ 1

E[dN(t)/dt] (15.79)

Page 313: Handbook of Reliability Engineering - pdi2.org

280 Software Reliability

Since the Wiener process has the independent in-crement property, W(t) and dW(t) are statisticallyindependent and E[dW(t)] = 0, we have

MTBFI(t)= 1

E[−dM(t)/dt]= 1

m0(b − 12σ

2) exp[−(b − 12σ

2)t](15.80)

if g(x)= x.

15.4.5 Concluding Remarks

In this section, a new model for software reliabilityassessment has been proposed by using nonlinearstochastic differential equations of the Itô type.By introducing the debugging speed function, wecan obtain a more precise estimation in terms ofsoftware reliability assessment. We have derivedseveral measures for software reliability assess-ment based on this nonlinear stochastic model.The method of maximum-likelihood estimationfor the model parameters has also been shown.

In any future study we must evaluate theperformance of our model in terms of softwarereliability assessment and prediction by using dataactually collected in the testing phase of softwaredevelopment.

15.5 Development of aSoftware Reliability ManagementToolRecently, numerous software tools that can assistthe software development process have beenconstructed (e.g. see [4, 5]). The tools areconstructed as software packages to analyzesoftware faults data that are observed in the testingphase, and to evaluate the quality/reliability ofthe software product. In the testing phase ofthe software development process, the testingmanager can perform the reliability evaluationeasily by using such tools without knowing thedetails of the process of the faults data analysis.Thus, this section aims at constructing a tool for

software management that provides quantitativeestimation/prediction of the software reliabilityand testing management.

We introduce the notion of object-orientedanalysis for implementation of the softwaremanagement tool. The methodology of object-oriented analysis has had great success in thefields of programming language, simulation, thegraphical user interface (GUI), and constructionof databases in software development. A generalidea of object-oriented methodology is developedas a technique that can easily construct andmaintain a complex software system. This toolhas been implemented using JAVA, which isone of a number of object-oriented languages.Hence we first show the definition of thespecification requirements of the tool. Then wediscuss the implementation methodology of thesoftware testing management tool using JAVA. Anexample of software quality/reliability evaluationis presented for a case where the tool is applied toactual observed data. Finally, we discuss the futureproblems of this tool.

15.5.1 Definition of the SpecificationRequirement

The definition of the specification requirementsfor this tool is as follows.

1. It should perform as a software reliabil-ity management tool, which can analyzethe fault data and evaluate the softwarequality/reliability and the degree of testingprogress quantitatively. The results should bedisplayed as a window system on a PC.

2. For the data analysis required for qual-ity/reliability and testing-progress control,seven stochastic models should be employedfor software reliability analysis and one fortesting progress evaluation. For software reli-ability assessment, the models are:

(i) exponential software reliability growth(SRGM) model;

(ii) delayed S-shaped SRGM;(iii) exponential-S-shaped SRGM [31];

Page 314: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 281

Figure 15.7. Structure of the software reliability management tool

(iv) logarithmic Poisson execution timemodel [32];

(v) Weibull process model [2, 33];(vi) hidden-Markov imperfect debugging

model;(vii) nonlinear stochastic differential equation

model.

In the above models, the first five modelsbelong to NHPP models. The model for testingprogress evaluation is that of the generalizeddeath process model.

The tool additionally examines thegoodness-of-fit test to the observed faultdata.

3. This tool should be operated by clicking ofmouse buttons and typing at the keyboard toinput the data through a GUI system.

4. JAVA, which is an object-oriented language,should be used to implement the program.This tool is developed as a stand-aloneapplication on the Windows95 operatingsystem.

15.5.2 Object-oriented Design

This tool should be composed of several pro-cesses, such as fault data analysis, estimation ofunknown parameters, goodness-of-fit test for es-timated models, graphical representation of faultdata, and their estimation results (see Figure 15.7).As to searching for the numerical solution of thesimultaneous likelihood equations that are ob-tained from the software reliability assessmentmodels based on NHPP and stochastic differential

Page 315: Handbook of Reliability Engineering - pdi2.org

282 Software Reliability

equation models, the bisection method is adopted.The Baum–Welch re-estimation method is im-plemented for the imperfect debugging modelexplained in Section 15.3.

The goodness-of-fit test of each modelestimated is examined by the K–S test and theAkaike information criterion (AIC) [34] forthe software reliability assessment modelsand the testing-progress evaluation model.To represent the results of quality/reliabilityevaluation, the estimated mean value function,the expected number of remaining faults, thesoftware reliability, and the MTBF are displayedgraphically for the software reliability assessmentmodels, except for hidden-Markov imperfectdebugging model. The imperfect debuggingmodel graphically gives the estimation result ofimperfect debugging probability and the expectednumber of actual detected faults.

To perform the testing-progress evaluation,the generalized death process model describedin Section 15.2 predicts the expected number ofconsumed test cases at testing time t , and themean completion time of the testing phase.

The object-oriented methodology for imple-mentation of a software program is defined asthe general idea that enables us to implementthe program code abstractly, and the programlogic is performed by passing messages amongthe objects. This is different from the traditionalmethodology of structured programming. Thus,it combines the advantages of information hidingand modularization. Moreover, it is comparativelyeasy for programmers to construct applicationsoftware program with a GUI by using an object-oriented language, e.g. see Arnold et al. [35].

This tool is implemented using the JAVA lan-guage and the source code of the programconforms to the object-oriented programmingmethodology by using message communicationamong the various classes. In the class construc-tion, we implement the source program as eachclass performs each function. For example, theMOML class is constructed to perform the func-tion of estimating the unknown parameters of themodels using the method of maximum likelihood.To protect from improper access among classes,

one should avoid using static variables as much aspossible.

The proposed tool has been implemented onan IBM-PC/AT compatible personal computerwith Windows95 operating system and JAVAinterpreter version 1.01 for MS-DOS. We usethe integrated JAVA development environmentpackage software “WinCafe”, which has a projectmanager, editor, class browser, GUI debugger, andseveral utilities.

15.5.3 Examples of ReliabilityEstimation and Discussion

In this section we discuss a software testing-management tool that can evaluate quantitativelythe quality/reliability and the degree of testingprogress. The tool analyzes the fault data observedin the testing phase of software development anddisplays the accompanying various informationin visual forms. Using this tool, the users canquantitatively and easily obtain the achievementlevel of the software reliability, the degree oftesting progress, and the testing stability.

By analyzing the actual fault-detection data set,denoted DS-4.1 in Section 15.4, we performeda reliability analysis using this tool and showexamples of the quality/reliability assessment inFigure 15.8. This figure shows the estimationresults for the expected number of detected faultsby using the nonlinear stochastic differentialequation model.

In the following we evaluate our tool and dis-cuss several problems. The first point concerns theexpandability of our tool. JAVA is simple and com-pactly developed and we can easily code and de-bug the source code. Moreover, since JAVA, like theC++ language, is an object-oriented language, wecan easily use object-oriented techniques and theirattendant advantages, such as modularization andreuse of existing code. We have implemented sev-eral classes to perform various functions, such asproducing a new data file, estimating the unknownparameters, displaying of results graphically, and

Page 316: Handbook of Reliability Engineering - pdi2.org

Software Reliability Management: Techniques and Applications 283

Figure 15.8. Expected cumulative number of detected faults (by SDE model with DS-4.1; m0 = 3569.8, b = 0.000 131, σ =0.008 00, r = 1.0)

several functions. If new software reliability as-sessment techniques are developed in the field ofsoftware reliability research, our tool can be easilyreconstructed with new software components toperform the new assessment techniques. So we canquickly and flexibly cope with extending the func-tion of our tool. The second point is the portabilityof the program. JAVA is a platform-free program-ming language, and so does not depend on thespecific computer architecture. This tool can beimplemented and perform on any kind of com-puter hardware. The final point is the processing

speed. The JAVA compiler makes use of bytecode,which cannot be directly executed by the CPU.This is the greatest difference from a traditionalprogramming language. But we do not need to payattention to the problem of the processing speed ofJAVA bytecode, since the processing speed of ourtool is sufficient.

In conclusion, this tool is new and has theadvantages that the results of data analysis arerepresented simply and visually by GUI. The toolis also easily expandable, it is machine portable,and is maintainable because of the use of JAVA.

Page 317: Handbook of Reliability Engineering - pdi2.org

284 Software Reliability

References[1] Pressman R. Software engineering: a practitioner’s

approach. Singapore: McGraw-Hill Higher Education;2001.

[2] Musa JD, Iannino A, Okumoto K. Software reliabil-ity: measurement, prediction, application. New York:McGraw-Hill; 1987.

[3] Bittanti S. Software reliability modeling and identifica-tion. Berlin: Springer-Verlag; 1988.

[4] Lyu M. Handbook of software reliability engineering. LosAlamitos: IEEE Computer Society Press; 1995.

[5] Pham H. Software reliability. Singapore: Springer-Verlag;2000.

[6] Okumoto K, Goel AL. Availability and other performancemeasurement for system under imperfect debugging. In:Proceedings of COMPSAC’78, 1978; p.66–71.

[7] Shanthikumar JG. A state- and time-dependent erroroccurrence-rate software reliability model with imperfectdebugging. In: Proceedings of the National ComputerConference, 1981; p.311–5.

[8] Kramer W. Birth–death and bug counting. IEEE TransReliab. 1983;R-32:37–47.

[9] Sumita U, Shanthikumar JG. A software reliability modelwith multiple-error introduction & removal. IEEE TransReliab. 1986;R-35:459–62.

[10] Ohba M, Chou X. Does imperfect debugging affectsoftware reliability growth? In: Proceedings of the IEEEInternational Conference on Software Engineering, 1989;p.237–44.

[11] Goel AL, Okumoto K. An imperfect debugging model forreliability and other quantitative measures of softwaresystems. Technical Report 78-1, Department of IE&OR,Syracuse University, 1978.

[12] Goel AL, Okumoto K. Classical and Bayesian inferencefor the software imperfect debugging model. TechnicalReport 78-2, Department of IE&OR, Syracuse University,1978.

[13] Rabinar LR, Juang BH. An introduction to hiddenMarkov models. IEEE ASSP Mag 1986;3:4–16.

[14] Kimura M, Yamada S. A software testing-progressevaluation model based on a digestion process of testcases. Int J Reliab, Qual Saf Eng 1997;4:229–39.

[15] Shanthikumar JG. A general software reliability modelfor performance prediction. Microelectron Reliab1981;21:671–82.

[16] Ross SM. Stochastic processes. New York: John Wiley &Sons; 1996.

[17] Karlin S, Taylor HM. A first course in stochasticprocesses. San Diego: Academic Press; 1975.

[18] Shibata K. Project planning and phase management ofsoftware product (in Japanese). Inf Process Soc Jpn1980;21:1035–42.

[19] Rabinar LR. A tutorial on hidden Markov models andselected applications in speech recognition. Proc IEEE1989;7:257–86.

[20] Jelinski Z, Moranda PB. Software reliability research.In: Freiberger W, editor. Statistical computer perfor-mance evaluation. New York: Academic Press; 1972.p.465–484.

[21] Goel AL, Okumoto K. Time-dependent and error-detection rate model for software reliability and otherperformance measures. IEEE Trans Reliab 1979;R-28:206–11.

[22] Arnold L. Stochastic differential equations: theory andapplications. New York: John-Wiley & Sons; 1974.

[23] Protter P. Stochastic integration and differential equa-tions: a new approach. Berlin Heidelberg New York:Springer-Verlag; 1995.

[24] Karlin S, Taylor HM. A second course in stochasticprocesses. San Diego: Academic Press; 1981.

[25] Goel AL, Okumoto K. A Markovian model for reliabilityand other performance measures of software system. In:Proceedings of the National Computer Conference, 1979;p.769–74.

[26] Yamada S, Kimura M, Tanaka H, Osaki S. Softwarereliability measurement and assessment with stochasticdifferential equations. IEICE Trans Fundam 1994;E77-A:109–16.

[27] Kimura M, Yamada S, Tanaka H. Software reliabilityassessment modeling based on nonlinear stochasticdifferential equations of Itô type. In: Proceedings ofthe ISSAT International Conference on Reliability andQuality in Design, 2000; p.160–3.

[28] Wong E, Zakai M. On the relation between ordinary andstochastic differential equations. Int J Eng Sci 1965;3:213–29.

[29] Wong E. Stochastic processes in information and systems.New York: McGraw-Hill; 1971.

[30] Yamada S, Ohtera H, Narihisa H. Software reliabilitygrowth models with testing-effort. IEEE Trans Reliab1986;R-35:19–23.

[31] Kimura M, Yamada S, Osaki S. Software reliability as-sessment for an exponential-S-shaped reliability growthphenomenon. J Comput Math Appl 1992;24:71–8.

[32] Musa JD, Okumoto K. A logarithmic Poisson executiontime model for software reliability measurement. In:Proceedings of the 7th International Conference onSoftware Engineering, 1984; p.230–8.

[33] Crow LH. On tracking reliability growth. In: Proceedingsof the 1975 Annual Reliability and MaintainabilitySymposium; p.438–43.

[34] Akaike H. A new look at the statistical model identifica-tion. IEEE Trans Autom Control 1974;AC-19:716–23.

[35] Arnold K, Gosling J, Holmes D. The Java programminglanguage. Boston Tokyo: Addison-Wesley; 2000.

Page 318: Handbook of Reliability Engineering - pdi2.org

Recent Studies in SoftwareReliability Engineering

Chap

ter1

6Hoang Pham

16.1 Introduction16.1.1 Software Reliability Concepts16.1.2 Software Life Cycle16.2 Software Reliability Modeling16.2.1 A Generalized Non-homogeneous Poisson Process Model16.2.2 Application 1: The Real-time Control System16.3 Generalized Models with Environmental Factors16.3.1 Parameters Estimation16.3.2 Application 2: The Real-time Monitor Systems16.4 Cost Modeling16.4.1 Generalized Risk–Cost Models16.5 Recent Studies with Considerations of Random Field Environments16.5.1 A Reliability Model16.5.2 A Cost Model16.6 Further Reading

16.1 Introduction

Computers are being used in diverse areasfor various applications, e.g. air traffic con-trol, nuclear reactors, aircraft, real-time militaryweapons, industrial process control, automotivemechanical and safety control, and hospital pa-tient monitoring systems. New computer andcommunication technologies are obviously trans-forming our daily life. They are the basis of manyof the changes in our telecommunications systemsand also a new wave of automation on the farm,in manufacturing, hospital, transportation, and inthe office. Computer information products andservices have become a major and still rapidlygrowing component of our global economy.

The organization of this chapter is divided intofive sections. Section 16.1 presents the basic con-cepts of software reliability. Section 16.2 presentssome existing non-homogeneous Poisson process(NHPP) software reliability models and their ap-plication to illustrate the results. Section 16.3

presents generalized software reliability modelswith environmental factors. Section 16.4 discussesa risk–cost model. Finally, Section 16.5 presentsrecent studies on software reliability and costmodels with considerations of random field envi-ronments. Future research directions in softwarereliability engineering and challenge issues arealso discussed.

16.1.1 Software Reliability Concepts

Research activities in software reliability engineer-ing have been conducted over the past 30 years,and many models have been developed for theestimation of software reliability [1, 2]. Soft-ware reliability is a measure of how closely userrequirements are met by a software system inactual operation. Most existing models [2–16] forquantifying software reliability are purely basedupon observation of failures during the systemtest of the software product. Some companiesare indeed putting these models to use in their

285

Page 319: Handbook of Reliability Engineering - pdi2.org

286 Software Reliability

Table 16.1. The real-time control system data for time domain approach

Fault TBF Cumulative Fault TBF Cumulative Fault TBF Cumulative Fault TBF CumulativeTBF TBF TBF TBF

1 3 3 35 227 5324 69 529 15806 103 108 422962 30 33 36 65 5389 70 379 16185 104 0 422963 113 146 37 176 5565 71 44 16229 105 3110 454064 81 227 38 58 5623 72 129 16358 106 1247 466535 115 342 39 457 6080 73 810 17168 107 943 475966 9 351 40 300 6380 74 290 17458 108 700 482967 2 353 41 97 6477 75 300 17758 109 875 491718 91 444 42 263 6740 76 529 18287 110 245 494169 112 556 43 452 7192 77 281 18568 111 729 50145

10 15 571 44 255 7447 78 160 18728 112 1897 5204211 138 709 45 197 7644 79 828 19556 113 447 5248912 50 759 46 193 7837 80 1011 20567 114 386 5287513 77 836 47 6 7843 81 445 21012 115 446 5332114 24 860 48 79 7922 82 296 21308 116 122 5344315 108 968 49 816 8738 83 1755 23063 117 990 5443316 88 1056 50 1351 10089 84 1064 24127 118 948 5538117 670 1726 51 148 10237 85 1783 25910 119 1082 5646318 120 1846 52 21 10258 86 860 26770 120 22 5648519 26 1872 53 233 10491 87 983 27753 121 75 5656020 114 1986 54 134 10625 88 707 28460 122 482 5704221 325 2311 55 357 10982 89 33 28493 123 5509 6255122 55 2366 56 193 11175 90 868 29361 124 100 6265123 242 2608 57 236 11411 91 724 30085 125 10 6266124 68 2676 58 31 11442 92 2323 32408 126 1071 6373225 422 3098 59 369 11811 93 2930 35338 127 371 6410326 180 3278 60 748 12559 94 1461 36799 128 790 6489327 10 3288 61 0 12559 95 843 37642 129 6150 7104328 1146 4434 62 232 12791 96 12 37654 130 3321 7436429 600 5034 63 330 13121 97 261 37915 131 1045 7540930 15 5049 64 365 13486 98 1800 39715 132 648 7605731 36 5085 65 1222 14708 99 865 40580 133 5485 8154232 4 5089 66 543 15251 100 1435 42015 134 1160 8270233 0 5089 67 10 15261 101 30 42045 135 1864 8456634 8 5097 68 16 15277 102 143 42188 136 4116 88682

software product development. For example, theHewlett-Packard company has been using an ex-isting reliability model to estimate the failure in-tensity expected for firmware, software embeddedin the hardware, in two terminals, known as theHP2393A and HP2394A, to determine when torelease the firmware. The results of the reliabilitymodeling enabled it to test the modules moreefficiently and so contributed to the terminals’success by reducing development cycle cost whilemaintaining high reliability. AT&T software devel-opers also used a software reliability model to pre-dict the quality of software system T. The model

predicted consistently and the results were within10% of predictions. AT&T Bell Laboratories alsopredicted requested rate for field maintenance toits 5ESS telephone switching system differed fromthe company’s actual experience by only 5 to 13%.Such accuracies could help to make warranties ofsoftware performance practical [17].

Most existing models, however, require con-siderable numbers of failure data in order to ob-tain an accurate reliability prediction. There aretwo common types of failure data: time-domaindata and interval-domain data. Some existing soft-ware reliability models can handle both types of

Page 320: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 287

data. The time-domain data are characterized byrecording the individual times at which the fail-ure occurred. For example, in the real-time con-trol system data of Lyu [18], there are a total of136 faults reported and the times between failures(TBF) in seconds are listed in Table 16.1. Theinterval-domain data are characterized by count-ing the number of failures occurring during a fixedperiod. The time-domain data always providesbetter accuracy in the parameter estimates withcurrent existing software reliability models, butthis involves more data collection effort than theinterval domain approach.

Information concerning the development of thesoftware product, the method of failure detection,environmental factors, field environments, etc.,however, are ignored in almost all the existingmodels. In order to develop a useful softwarereliability model and to make sound judgmentswhen using the model, one needs to have an in-depth understanding of how software is produced;how errors are introduced, how software is tested,how errors occur, the types of errors, and environ-mental factors can help us in justifying the reason-ableness of the assumptions, the usefulness of themodel, and the applicability of the model undergiven user environments [19]. These models, inother words, would be valuable to software devel-opers, users, and practitioners if they are capableof using information about the software devel-opment process, incorporating the environmentalfactors, and are able to give greater confidence inestimates based on small numbers of failure data.

Within the non-Bayes framework, the parame-ters of the assumed distribution are thought to beunknown but fixed and are estimated from pastinter-failure times utilizing the maximum likeli-hood estimation (MLE) technique. A time intervalfor the next failure is then obtained from the fittedmodel. However, the TBF models based on thistechnique tend to give results that are grossly opti-mistic [20] due to the use of MLE. Various Bayespredictive models [21, 22] have been proposed,hopefully to overcome such problems, since if theprior accurately reflects the actual failure process,then model prediction on software reliability canbe improved while there is a reduction in total

testing time or sample size requirements. Phamand Pham [14] developed Bayesian software re-liability models with a stochastically decreasinghazard rate. Within any given failure time interval,the hazard rate is a function of both total test-ing time and the number of failures encountered.To improve the predictive performance of theseparticular models, Pham and Pham [16] recentlypresented a modified Bayesian model introduc-ing pseudo-failures whenever there is a periodwhen the failure-free execution estimate equalsthe interval-percentile of a predictive distribution.

In addition to those already defined, thefollowing acronyms will be used throughout thechapter:

AIC the Akaike information criterion [23]EF environmental factorsLSE least squared estimateMVF mean value functionSRGM software reliability growth modelSSE sum of squared errorsSRE software reliability engineering.

The following notation is also used:

a(t) time-dependent fault content function:total number of faults in the software,including the initial and introducedfaults

b(t) time-dependent fault detection-ratefunction per fault per unit time

λ(t) failure intensity function: faults per unittime

m(t) expected number of error detected bytime t (MVF)

N(t) random variable representing thecumulative number of errors detectedby time t

Sj actual time at which the j th error isdetected

R(x/t) software reliability function, i.e. theconditional probability of no failureoccurring during (t, t + x) given thatthe last failure occurred at time t

ˆ estimates using MLE methodyk the number of actual failures observed

at time tk

Page 321: Handbook of Reliability Engineering - pdi2.org

288 Software Reliability

m(tk) estimated cumulative number offailures at time tk obtained from thefitted MVFs, k = 1, 2, . . . , n.

16.1.2 Software Life Cycle

As software becomes an increasingly importantpart of many different types of systems thatperform complex and critical functions in manyapplications, such as military defense, nuclearreactors, etc., the risk and impacts of software-caused failures have increased dramatically. Thereis now general agreement on the need to increasesoftware reliability by eliminating errors madeduring software development.

Software is a collection of instructions orstatements in a computer language. It is also calleda computer program, or simply a program. Asoftware program is designed to perform specifiedfunctions. Upon execution of a program, aninput state is translated into an output state.An input state can be defined as a combinationof input variables or a typical transaction tothe program. When the actual output deviatesfrom the expected output, a failure occurs.The definition of failure, however, differs fromapplication to application, and should be clearlydefined in specifications. For instance, a responsetime of 30 s could be a serious failure for an airtraffic control system, but acceptable for an airlinereservation system.

A software life cycle consists of five successivephases: analysis, design, coding, testing, andoperation [2]. The analysis phase is the mostimportant phase; it is the first step in the softwaredevelopment process and the foundation of build-ing a successful software product. The purpose ofthe analysis phase is to define the requirementsand provide specifications for the subsequencephases and activities. The design phase isconcerned with how to build the system to behaveas described. There are two parts to design: systemarchitecture design and detailed design. Thesystem architecture design includes, for example,system structure and the system architecture doc-ument. System structure design is the process ofpartitioning a software system into smaller parts.

Before decomposing the system, we need to dofurther specification analysis, which is to examinethe details of performance requirements, securityrequirements, assumptions and constraints, andthe needs for hardware and software.

The coding phase involves translating thedesign into code in a programming language.Coding can be decomposed into the followingactivities: identify reusable modules, code editing,code inspection, and final test plan. The finaltest plan should provide details on what needs tobe tested, testing strategies and methods, testingschedules and all necessary resources, and beready at the coding phase. The Testing phase isthe verification and validation activity for thesoftware product. Verification and validation arethe two ways to check if the design satisfies theuser requirements. In other words, verificationchecks if the product, which is under construction,meets the requirements definition, and validationchecks if the product’s functions are what thecustomer wants. The objectives of the testingphase are to: (1) affirm the quality of theproduct by finding and eliminating faults inthe program; (2) demonstrate the presence ofall specified functionality in the product; and(3) estimate the operational reliability of thesoftware. During this phase, system integration ofthe software components and system acceptancetests are performed against the requirements. Theoperation phase is the final phase in the softwarelife cycle. The operation phase usually containsactivities such as installation, training, support,and maintenance. It involves the transfer ofresponsibility for the maintenance of the softwarefrom the developer to the user by installingthe software product and it becomes the user’sresponsibility to establish a program to controland manage the software.

16.2 Software ReliabilityModeling

Many NHPP software reliability growth models[2, 4, 6–11, 14, 24] have been developed over the

Page 322: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 289

past 25 years to assess the reliability of software.Software reliability models based on the NHPPhave been quite successful tools in practicalsoftware reliability engineering [2]. In this section,we only discuss software reliability models basedon NHPP. These models consider the debuggingprocess as a counting process characterized by itsMVF. Software reliability can be estimated oncethe MVF is determined. Model parameters areusually estimated using either the MLE or LSE.

16.2.1 A GeneralizedNon-homogeneous Poisson ProcessModel

Many existing NHPP models assume that failureintensity is proportional to the residual faultcontent. A general class of NHPP SRGMs canbe obtained by solving the following differentialequation [7]:

dm(t)

dt= b(t)[(a(t)−m(t)] (16.1)

The general solution of the above differentialequation is given by

m(t)= e−B(t)[m0 +

∫ t

t0

a(τ)b(τ ) eB(τ) dτ

](16.2)

where B(t) = ∫ tt0 b(τ) dτ and m(t0)=m0 is themarginal condition of Equation 16.1 with t0representing the starting time of the debuggingprocess. The reliability function based on theNHPP is given by:

R(x/t)= e−[m(t+x)−m(t)] (16.3)

Many existing NHPP models can be consideredas a special case of Equation 16.2. An increasingfunction a(t) implies an increasing total numberof faults (note that this includes those already de-tected and removed and those inserted during thedebugging process) and reflects imperfect debug-ging. An increasing b(t) implies an increasing faultdetection rate, which could be attributed eitherto a learning curve phenomenon, or to softwareprocess fluctuations, or to a combination of both.

Different a(t) and b(t) functions also reflect differ-ent assumptions of the software testing processes.A summary of the most NHPP existing models ispresented in Table 16.2.

16.2.2 Application 1: The Real-timeControl System

Let us perform the reliability analysis using thereal-time control system data given in Table 16.1.The first 122 data points are used for the goodnessof fit test and the remaining data (14 points) areused for the predictive power test. The results forfit and prediction are listed in Table 16.3.

Although software reliability models based onthe NHPP have been quite successful tools inpractical software reliability engineering [2], thereis a need to fully validate their validity with respectto other applications, such as in communications,manufacturing, medical monitoring, and defensesystems.

16.3 Generalized Models withEnvironmental FactorsWe adopt the following notation:

z vector of environmental factorsβ coefficient vector of environmental

factors�(βz) function of environmental factorsλ0(t) failure intensity rate function

without environmental factorsλ(t, z) failure intensity rate function with

environmental factorsm0(t) MVF without environmental factorsm(t, z) MVF with environmental factorsR(x/t, z) reliability function with

environmental factors.

The proportional hazard model (PHM), which wasfirst proposed by Cox [25], has been successfullyutilized to incorporate environmental factors insurvival data analysis, in the medical field andin the hardware system reliability area. The basicassumption for the PHM is that the hazard ratesof any two items associated with the settings

Page 323: Handbook of Reliability Engineering - pdi2.org

290 Software Reliability

Table 16.2. Summary of the MVFs [8]

Model name Model type MVF (m(t)) Comments

Goel–Okumoto (G–O) Concave m(t)= a(1− e−bt ) Also called the exponentiala(t)= a modelb(t)= b

Delayed S-shaped S-shaped m(t)= a[1− (1+ bt) e−bt ] Modification of G–O modelto make it S-shaped

Inflection S-shaped S-shaped m(t)= a(1− e−bt )1+ β e−bt Solves a technical condition

SRGM a(t)= a with the G–O model. Becomes

b(t)= b

1+ β e−bt the same as G–O ifβ = 0

Yamada exponential S-shaped m(t)= a{1− e−rα[1−exp(−βt)]} Attempt to account fora(t)= a testing effortb(t)= rαβ e−βt

Yamada Rayleigh S-shaped m(t)= a{1− e−rα[1−exp(−βt2/2)]} Attempt to account fora(t)= a testing effort

b(t)= rαβt e−βt2/2

Yamada exponential Concave m(t)= ab

α + b(eαt − e−bt ) Assume exponential fault

imperfect debugging a(t)= a eαt content function andmodel (Y-ExpI) b(t)= b constant fault detection

rate

Yamada linear imperfect Concave m(t)= a(1− e−bt )(1− α

b

)+ αat Assume constant

debugging model (Y-LinI) a(t)= a(1 + αt) introduction rate α andb(t)= b the fault detection rate

Pham–Nordmann–Zhang S-shaped m(t)= a(1− e−bt )(1− α/b) + αat

1+ β e−bt Assume introduction rate

(P–N–Z) model and concave a(t)= a(1 + αt) is a linear function of

b(t)= b

1+ β e−bt testing time, and the fault

detection rate function isnon-decreasing with aninflexion S-shaped model

Pham–Zhang (P–Z) S-shaped m(t)= 1

1+ β e−bt[(c + a)(1− e−bt ) Assume introduction rate

model and concave − a

b − α(e−αt − e−bt )

]is exponential function of

a(t)= c + a(1 − e−αt ) the testing time, and the

b(t)= b

1+ β e−bt fault detection rate is

non-decreasing with aninflexion S-shaped model

Page 324: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 291

Table 16.3. Parameter estimation and model comparison

Model name SSE (fit) SSE (Predict) AIC MLEs

G–O model 7615.1 704.82 426.05 a = 125b = 0.00006

Delayed S-shaped 51729.23 257.67 546 a = 140b = 0.00007

Inflexion S-shaped 15878.6 203.23 436.8 a = 135.5b = 0.00007β = 1.2

Yamada exponential 6571.55 332.99 421.18 a = 130α = 10.5β = 5.4× 10−6

Yamada Rayleigh 51759.23 258.45 548 a = 130α = 5× 10−10

β = 6.035

Y-ExpI model 5719.2 327.99 450 a = 120b = 0.00006α = 1× 10−5

Y-LinI model 6819.83 482.7 416 a = 120.3b = 0.00005α = 3× 10−5

P–N–Z model 5755.93 106.81 415 a = 121b = 0.00005α = 2.5× 10−6

β = 0.002

P–Z model 14233.88 85.36 416 a = 20b = 0.00007α = 1.0× 10−5

β = 1.922c = 125

of environmental factors, say z1 and z2, will beproportional to each other. The environmentalfactors are also known as covariates in the PHM.When the PHM is applied to the NHPP it becomesthe proportional intensity model (PIM). A generalfault intensity rate function incorporating theenvironmental factors based on PIM can beconstructed using the following assumptions.

(a) The new fault intensity rate function con-sists of two components: the fault intensity

rate functions without environmental factors,λ0(t), and the environmental factor function�(βz).

(b) The fault intensity rate function λ0(t) andthe function of the environmental factors areindependent. The function λ0(t) is also calledthe baseline intensity function.

Assume that the fault intensity function λ(t, z)

is given in the following form:

λ(t, z)= λ0(t) ·�(βz)

Page 325: Handbook of Reliability Engineering - pdi2.org

292 Software Reliability

Typically, �(βz) takes an exponential form, suchas:

�(βz)= exp(β0 + β1z1 + β2z2 + . . . )

The MVF with environmental factors can then beeasily obtained:

m(t, z)=∫ t

0λ0(s)�(βz) ds

=�(βz)

∫ t

0λ0(s) ds

=�(βz)m0(t)

Therefore, the reliability function with environ-mental factors is given by [9]:

R(x/t, z)= e−[m(t+x,z]−m(t,z)]

= e−[�(βz)m0(t+x,z)−�(βz)m0(t,z)]

= (exp{−[m0(t + x)−m0(t)]})�(βz)

= [R0(x/t)]�(βz)

16.3.1 Parameters Estimation

The MLE is a widely used method to estimateunknown parameters in the models and willbe used to estimate the model parameters inSection 16.3. Since the environmental factorsare considered here, the parameters that needto be estimated include not only those inbaseline intensity rate function λ0(t), but alsothe coefficients βi in the link function of theenvironmental factors introduced. For example,we have m unknown parameters in functionλ0(t) and we introduced k environmental factorsinto the model, β1, β2, . . . , βk ; thus we have(m+ k) unknown parameters to estimate. Themaximum likelihood function for this model canbe expressed as follows:

L(θ, β, t, z)

= P

{ n∏j=1

[m(0, zj )= 0, m(t1,j , zj )= y1,j ,

m(t2,j , zj )= y2,j , . . . ,

m(tn,j , zj )= yn,j ]}

=n∏

j=1

kj∏i=1

[m(ti,j , zj )−m(ti−1,j , zj )](yi,j−yi−1,j )

(yi,j − yi−1,j )!× e−[m(ti,j ,zj )−m(ti−1,j ,zj )]

where n is the number of total failure datagroups, kj is the number of faults in groupj , (j = 1, 2, . . . , n), zj is the vector variableof the environmental factors in data groupj and m(ti,j , zj ) is the mean value functionincorporating the environmental factors.

The logarithm likelihood function is given by

ln[L(θ, β, t, z)]

=n∑

j=1

kj∑i=1

{(yi,j − yi−1,j )

× ln[m(ti,j , zj )−m(ti−1,j , zj )]− ln[yi,j − yi−1,j )!]− [m(ti,j , zj )−m(ti−1,j , zj )]}

A series of differential equations can be con-structed by taking the derivatives of the log likeli-hood function with respect to each parameter, andset them equal to zero. The estimates of unknownparameters can be obtained by solving these dif-ferential equations.

A widely used method, which is known asthe partial likelihood estimate method, can beused to facilitate the parameter estimate process.The partial likelihood method estimates thecoefficients of covariates, the β1 values, separatelyfrom the parameters in the baseline intensityfunction. The likelihood function of the partiallikelihood method is given by [25]:

L(β)=∏i

exp(βzi)[∑m∈R exp(βzm)

]diwhere di represents the tie failure times.

16.3.2 Application 2: The Real-timeMonitor Systems

In this section we illustrate the software reliabilitymodel with environmental factors based on thePIM method using the software failure data

Page 326: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 293

collected from real-time monitor systems [26].The software consists of about 200 modules andeach module has, on average, 1000 lines of ahigh-level language like FORTRAN. A total of481 software faults were detected during the111 days testing period. Both the information ofthe testing team size and the software failure dataare recorded.

The only environmental factor available inthis application is the testing team size. Teamsize is one of the most useful measures inthe software development process. It has aclose relationship with the testing effort, testingefficiency, and the development managementissues. From the correlation analysis of the 32environmental factors [27], team size is the onlyfactor correlated to the program complexity,which is the number one significant factoraccording to our environmental factor study.Intuitively, the more complex the software, thelarger the team that is required. Since the testingteam size ranges from one to eight, we firstcategorize the factor of team size into two levels.Let z1 denote the factor of team size as follows:

z1 ={

0 team size ranges from 1 to 4

1 team size ranges from 5 to 8

After carefully examining the failure data, we findthat, after day 61, the software becomes stable andthe failures occur with a much slower frequency.Therefore, we use the first 61 data points fortesting the goodness-of-fit and estimating theparameters, and use the remaining 50 datapoints (from day 62 to day 111) as real datafor examining the predictive power of softwarereliability models.

In this application, we use the P–Z model listedin Table 16.2 as the baseline mean value function,i.e.,

m0(t)= 1

1+ β e−bt

[(c + a)(1− e−bt )

− ab

b − α(e−αt − e−bt )

]

and the corresponding baseline intensity functionis:

λ0(t)= 1

1+ β e−bt

[(c + a)(1− e−bt )

− ab

b − α(e−αt − e−bt )

]+ −βb e−bt

(1+ β e−bt )2

[(c + a)(1− e−bt )

− ab(e−αt − e−bt )b − α

]

Therefore, the intensity function with environ-mental factor is given by [28]:

λ(t)= λ0(t) eβ1z1

={

1

1+ β e−bt

[(c + a)(1− e−bt )

− ab

b − α(e−αt − e−bt )

]+ −βb e−bt

(1+ β e−bt )2

[(c + a)(1− e−bt )

− ab(e−αt − e−bt )b − α

]}eβ1z1

The estimate of β1 using the partial likelihoodestimate method is β1 = 0.0246, which indicatesthat this factor is significant to consider. Theestimates of parameters in the baseline intensityfunction are given as follows: a = 40.0, b = 0.09,β = 8.0, α = 0.015, c = 450.

The results of several existing NHPP models arealso listed in Table 16.4.

The results show that incorporating the factorof team size into the P–Z model explains thefault detection better and thus enhances thepredictive power of this model. Further researchis needed to incorporate application complexity,test effectiveness, test suite diversity, test coverage,code reused, and real application operationalenvironments into the NHPP software reliabilitymodels and into the software reliability model ingeneral. Future software reliability models mustaccount for these important situations.

Page 327: Handbook of Reliability Engineering - pdi2.org

294 Software Reliability

Table 16.4. Model evaluation

Model name MVF (m(t)) SSE (Prediction) AIC

G–O Model m(t)= a(1− e−bt ) 1052528 978.14a(t)= ab(t)= b

Delayed S-shaped m(t)= a[1− (1+ bt) e−bt ] 83929.3 983.90a(t)= a

b(t)= b2t

1+ bt

Inflexion S-shaped m(t)= a(1− e−bt )1+ β e−bt 1051714.7 980.14

a(t)= a

b(t)= b

1+ β e−btYamada exponential m(t)= a{1− e−rα[1−exp(−βt)]} 1085650.8 979.88

a(t)= a

b(t)= rαβ e−βtYamada Rayleigh m(t)= a{1− e−rα[1−exp(−βt2/2)]} 86472.3 967.92

a(t)= a

b(t)= rαβt e−βt2/2

Y-ExpI model m(t)= ab

α + b(eαt − e−bt ) 791941 981.44

a(t)= a eαt

b(t)= b

Y-LinI model m(t)= a(1− e−bt )(1− α

b

)+ αat 238324 984.62

a(t)= a(1 + αt)b(t)= b

P–N–Z model m(t)= a

1+ β e−bt[(1− e−bt )

(1− α

b

)+ αt

]94112.2 965.37

a(t)= a(1 + αt)

b(t)= b

1+ β e−bt

P–Z model m(t)= 1

1+ β e−bt[(c + a)(1− e−bt ) 86180.8 960.68

− ab

b − α(e−αt − e−bt )

]a(t)= c + a(1 − e−αt )b(t)= b

1+ β e−bt

Environmental factor model m(t)= 1

1+ β e−bt[(c + a)(1− e−bt ) 560.82 890.68

− ab

b − α(e−αt − e−bt )

]eβ1z1

a(t)= c + a(1 − e−αt )c(t)= 1− 1+ β

ebt + β

Page 328: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 295

16.4 Cost ModelingThe quality of the software system usually dependson how much time testing takes and what testingmethodologies are used. On the one hand, thelonger time people spend in testing, the moreerrors can be removed, which leads to morereliable software; however, the testing cost of thesoftware will also increase. On the other hand,if the testing time is too short, the cost of thesoftware could be reduced, but the customers maytake a higher risk of buying unreliable software[29–31]. This will also increase the cost duringthe operational phase, since it is much moreexpensive to fix an error during the operationalphase than the testing phase. Therefore, it isimportant to determine when to stop testing andrelease the software. In this section, we present arecent generalized cost model and also some otherexisting cost models.

16.4.1 Generalized Risk–Cost Models

In order to improve the reliability of softwareproducts, testing serves as the main tool toremove faults in software products. However,efforts to increase reliability will require anexponential increase in cost, especially afterreaching a certain level of software refinement.Therefore, it is important to determine whento stop testing based on the reliability and costassessment. Several software cost models andoptimal release policies have been studied inthe past two decades. Okumoto and Goel [32]discussed a simple cost model addressing a lineardeveloping cost during the testing and operationalperiods. Ohtera and Yamada [33] also discussedthe optimum software-release time problem witha fault-detection phenomenon during operation.They introduced two evaluation criteria for theproblem: software reliability and mean timebetween failures. Leung [34] discussed the optimalsoftware release time with consideration of a givencost budget. Dalal and McIntosh [35] studied thestop-testing problem for large software systemswith changing code using graphical methods.They reported the details of a real-time trial of

a large software system that had a substantialamount of code added during testing. Yang andChao [36] proposed two criteria for makingdecisions on when to stop testing. Accordingto them, software products are released to themarket when (1) the reliability has reached a giventhreshold, and (2) the gain in reliability cannotjustify the testing cost.

Pham [6] developed a cost model with animperfect debugging and random life cycle aswell as a penalty cost to determine the optimalrelease policies for a software system. Hou et al.[37] discussed optimal release times for softwaresystems with scheduled delivery time based onthe hyper-geometric software reliability growthmodel. The cost model included the penaltycost incurred by the manufacturer for the delayin software release. Recently, Pham and Zhang[38] developed the expected total net gain inreliability of the software development process (asthe economical net gain in software reliability thatexceeds the expected total cost of the softwaredevelopment), which is used to determine theoptimal software release time that maximizes theexpected total gain of the software system.

Pham and Zhang [29] also recently developeda generalized cost model addressing the faultremoval cost, warranty cost, and software risk costdue to software failures for the first time. Thefollowing cost model calculates the expected totalcost:

E(T )= C0 + C1Tα + C2m(T )µy

+ C3µw[m(T + Tw)−m(T )]+ CR[1− R(x | T )]

where C0 is the set-up cost for software testing, C1is the software test cost per unit time, C2 is thecost of removing each fault per unit time duringtesting, C3 is the cost to remove an fault detectedduring the warranty period, CR is the loss due tosoftware failure, E(T ) is the expected total cost ofa software system at time T , µy is the expectedtime to remove a fault during the testing period,and µw is the expected time to remove a faultduring the warranty period.

Page 329: Handbook of Reliability Engineering - pdi2.org

296 Software Reliability

The details on how to obtain the optimal soft-ware release policies that minimize the expectedtotal cost can be obtained in Pham and Zhang [29].The benefits of using the above cost models arethat they provide:

1. an assurance that the software has achievedsafety goals;

2. a means of rationalizing when to stop testingthe software.

In addition, with this type of information, asoftware manager can determine whether moretesting is warranted or whether the softwareis sufficiently tested to allow its release orunrestricted use Pham and Zhang [29].

16.5 Recent Studies withConsiderations of Random FieldEnvironments

In this section, a generalized software reliabilitystudy under a random field environment will bediscussed with consideration of not only the timeto remove faults during in-house testing, the costof removing faults during beta testing, and therisk cost due to software failure, but also thebenefits from reliable executions of the softwareduring the beta testing and final operation. Oncea software product is released, it can be usedin many different locations, applications, tasksand industries, etc. The field environments forthe software product are quite different fromplace to place. Therefore, the random effects ofthe end-user environment will affect the softwarereliability in an unpredictable way.

We adopt the following notation:

R(x | T ) software reliability function; this isdefined as the probability that asoftware failure does not occur in timeinterval [t, t + x], where t ≥ 0, andx > 0

G(η) cumulative distribution function ofrandom environmental factor

γ shape parameter of field environmen-tal factor (gamma density variable)

θ scale parameter of field environmentalfactor (gamma density variable)

N(T ) counting process, which counts thenumber of software failures discoveredby time T

m(T ) expected number of software failuresby time T , m(T )= E[N(T )]

m1(T ) expected number of software failuresduring in-house testing by time T

m2(T ) expected number of software failuresin beta testing and final field operationby time T

mF(t | η) expected number of software failuresin field by time t

C0 set-up cost for software testingC1 software in-house testing per unit timeC2 cost of removing a fault per unit time

during in-house testingC3 cost of removing a fault per unit time

during beta testingC4 penalty cost due to software failureC5 benefits if software does not fail during

beta testingC6 benefits if software does not fail in

field operationµy expected time to remove a fault during

in-house testing phaseµw expected time to remove a fault during

beta testing phasea number of initial software faults at the

beginning of testingaF number of initial software faults at the

beginning of the field operationst0 time to stop testing and release the

software for field operationsb fault detection rate per faultTw time length of the beta testingx time length that the software is going

to be usedp probability that a fault is successfully

removed from the softwareβ probability that a fault is introduced

into the software during debugging,and β � p.

Page 330: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 297

16.5.1 A Reliability Model

Teng and Pham [39] recently propose an NHPPsoftware reliability model with consideration ofrandom field environments. This is a unifiedsoftware reliability model that covers both testingand operation phases in the software life cycle.The model also allows one to remove faults if asoftware failure occurs in the field and can beused to describe the common practice of so-called‘beta testing’ in the software industry. During betatesting, software faults will still be removed fromthe software after failures occur, but beta testingis conducted in an environment that is the sameas (or close to) the end-user environment, andit is commonly quite different from the in-housetesting environment.

In contrast to most existing NHPP models,the model assumes that the field environmentonly affects the unit failure detection rate b bymultiplying by a random factor η. The mean valuefunction mF(t | η) in the field can be obtained as[39]:

mF(t | η)= aF

p − β{1− exp[−ηb(p − β)t]}

(16.4)where η is the random field environmental factor.

The random factor η captures the random fieldenvironment effects on software reliability. If η

is modeled as a gamma distributed variable withprobability density function G(γ, θ), the MVF ofthis NHPP model becomes [39]:

m(T )

=

a

p − β[1− e−b(p−β)T ] T ≤ t0

a

p − β

{1− e−b(p−β)t0

×[

θ

θ + b(p − β)(T − t0)

]γ}T ≥ t0

(16.5)

Generally, the software reliability prediction isused after the software is released for fieldoperations, i.e. T ≥ t0. Therefore, the reliability of

the software in the field is

R(x | T )= exp

(a e−b(p−β)t0

×{[

θ

θ + b(p − β)(T + x − t0)

]γ−[

θ

θ + b(p − β)(T − t0)

]γ})We now need to determine the time when to stoptesting and release the software. In other words,we only need to know the software reliabilityimmediately after the software is released. In thiscase, T = t0; therefore

R(x | T )= exp

(−a e−b(p−β)T

×{

1−[

θ

θ + b(p − β)x

]γ})16.5.2 A Cost Model

The quality of the software depends on how muchtime testing takes and what testing methodologiesare used. The longer time people spend in testing,the more faults can be removed, which leads tomore reliable software; however, the testing costof the software will also increase. If the testingtime is too short, the cost of the developmentcost is reduced, but the risk cost in the operationphase will increase. Therefore, it is important todetermine when to stop testing and release thesoftware [29].

Figure 16.1 shows the entire software develop-ment life cycle considered in this software costmodel: in-house testing, beta testing, and opera-tion; beta testing and operation are conducted inthe field environment, which is commonly quitedifferent from the environment where the in-house testing is conducted.

Let us consider the following.

1. There is a set-up cost at the beginning ofthe in-house testing, and we assume it is aconstant.

2. The cost to do testing is a linear function ofin-house testing time.

Page 331: Handbook of Reliability Engineering - pdi2.org

298 Software Reliability

Figure 16.1. Software gain model

3. The cost to remove faults during the in-housetesting period is proportional to the total timeof removing all faults detected during thisperiod.

4. The cost to remove faults during the betatesting period is proportional to the total timeof removing all faults detected in the timeinterval [T , T + Tw].

5. It takes time to remove faults and it is assumedthat the time to remove each fault follows atruncated exponential distribution.

6. There is a penalty cost due to the softwarefailure after formally releasing the software,i.e. after the beta testing.

7. Software companies receive the economicprofits from reliable executions of theirsoftware during beta testing.

8. Software companies receive the economicprofits from reliable executions of theirsoftware during final operation.

From point 5, the expected time to remove eachfault during in-house testing is given by [29]:

µy = 1− (λyT0 + 1) e−λyT0

λy(1− e−λyT0)

where λy is a constant parameter associated witha truncated exponential density function for thetime to remove a fault during in-house testing,and T0 is the maximum time to remove any faultduring in-house testing.

Similarly, the expected time to remove eachfault during beta testing is given by:

µw = 1− (λwT1 + 1) e−λwT1

λw(1− e−λwT1)

where λw is a constant parameter associated witha truncated exponential density function for thetime to remove a fault during beta testing, and T1is the maximum time to remove any fault duringbeta testing.

The expected net gain of the software develop-ment process E(T ) is defined as the economicalnet gain in software reliability that exceeds theexpected total cost of the software development[38]. That is

E(T )= Expected Gain in Reliability

− (Total Development Cost+ Risk Cost)

Based on the above assumptions, the expectedsoftware system cost consists of the following.

1. Total development cost EC(T ), including:

(a) A constant set-up cost C0.(b) Cost to do in-house testing E1(T ). We

assume it is a linear function of time to doin-house testing T ; then

E1(T )= C1T

In some previous literature the testingcost has been assumed to be proportionalto the testing time. Here, we keep usingthis assumption.

(c) Fault removal cost during in-house testingperiod E2(T ). The expected total time toremove all faults during in-house testingperiod is

E

[ N(T )∑i=1

Yi

]= E[N(T )]E[Yi] =m(T )µy

Hence, the expected cost to remove allfaults detected by time T , E2(T ), can be

Page 332: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 299

expressed as

E2(T )= C2E

[ N(T )∑i=1

Yi

]= C2m(T )µy

(d) Fault removal cost during beta testingperiod E3(T ). The expected total time toremove all faults detected during the betatesting period [T , T + Tw] is given by

E

[ N(T+Tw)∑i=N(T )

Wi

]= E[N(T + Tw)−N(T )]E[Wi]= (m(T + Tw)−m(T ))µw

Hence, the expected total time to removeall faults detected by time T , E3(T ), is:

E3(T )= C3E

[ N(T+Tw)∑i=N(T )

Wi

]= C3µw(m(T + Tw)−m(T ))

Therefore, the total software developmentcost is

EC(T )= C0 + E1(T )+ E2(T )+ E3(T )

2. Penalty cost due to software failures afterreleasing the software E4(T ) is given by

E4(T )= C4(1− R(x | T + Tw))

3. The expected gain Ep(T ), including:

(a) Benefits gained during beta testing due toreliable execution of the software, E5(T )

is

E5(T )= C5R(Tw | T )(b) Benefits gained during field operation

due to reliable execution of the software,E6(T ), is

E6(T )= C6R(x | T + Tw)

Therefore, the expected net gain of thesoftware development process E(T ) can

be expressed as:

E(T )= Ep(T )− (EC(T )+ E4(T ))

= (E5(T )+ E6(T ))

− (C0 + E1(T )+ E2(T )

+ E3(T )+ E4(T ))

= C5R(Tw | T )+ C6R(x | T + Tw)

− C0 − C1T − C2µym(T )

− C3µw(m(T + Tw)−m(T ))

− C4(1− R(x | T + Tw))

or, equivalently, as

E(T )= C5R(Tw | T )+ (C4 + C6)R(x | T + Tw)

− (C0 + C4)− C1T − C2m1(T )µy

− C3µw(m2(T + Tw)−m2(T ))

where

m1(T )= a

p − β[1− e−b(p−β)T ]

m2(T )= a

p − β

{1− e−b(p−β)t0

×[

θ

θ + b(p − β)(T − t0)

]γ}The optimal software release time T ∗, that

maximizes the expected net gain of softwaredevelopment process can be obtained as follows.

Theorem 1. Given C0, C1, C2, C3, C4, C5, C6, x,µy , µw, Tw, the optimal value of T , say T ∗, thatmaximizes the expected net gain of the softwaredevelopment process E(T ), is as follows:

1. If u(0)≤ C and:

(a) if y(0)≤ 0, then T ∗ = 0;(b) if y(∞) > 0, then T ∗ =∞;(c) if y(0) > 0, y(T )≥ 0 for T ∈ (0, T ′] and

y(T ) < 0 for T ∈ (T ′,∞], then T ∗ = T ′where T ′ = y−1(0).

2. If u(∞) > C and:

(a) if y(0)≥ 0, then T ∗ =∞;(b) if y(∞) < 0, then T ∗ = 0;

Page 333: Handbook of Reliability Engineering - pdi2.org

300 Software Reliability

(c) if y(0) < 0, y(T ) ≤ 0 for T ∈ (0, T ′′] andy(T ) > 0 for T ∈ (T ′′,∞], then

T ∗ ={∞ if E(0) < E(∞)

0 if E(0)≥ E(∞)

where T ′′ = y−1(0).

3. If u(0) > C, u(T ) ≥ C for T ∈ (0, T 0]and u(T ) < C for T ∈ (T 0,∞], whereT 0 = u−1(C), then:

(a) if y(0) < 0, thenif y(T 0)≤ 0, then T ∗ = 0if y(T 0) > 0, then

T ∗ ={

0 if E(0)≥ E(Tb)

Tb if E(0) < E(Tb)

where Tb = y−1(0) and Tb ≥ T 0;(b) if y(0)≥ 0, then T ∗ = Tc, where Tc =

y−1(0) where

y(T )=−C1 − C2µyab e−b(p−β)T

+ µwC3ab e−b(p−β)T

×{

1−[

θ

θ + b(p − β)Tw

]γ}+ C5ab e−b(p−β)T R(Tw | T )×{

1−[

θ

θ + b(p − β)Tw

]γ}+ (C4 + C6)ab e−b(p−β)T

× R(x | T+Tw){[

θ

θ+b(p−β)Tw]γ

−[

θ

θ + b(p − β)(Tw + x)

]γ}

u(T )=−(C4+C6)(p−β)ab2R(x | T+Tw)×{[

θ

θ + b(p − β)Tw

]γ−[

θ

θ + b(p − β)(Tw + x)

]γ}

×(

1− a

p − βe−b(p−β)T

×{[

θ

θ + b(p − β)Tw

]γ−[

θ

θ + b(p − β)(Tw + x)

]γ})− C5(p − β)ab2R(Tw | T )×{

1−[

θ

θ + b(p − β)Tw

]γ}×(

1− a

p − βe−b(p−β)T

×{

1−[

θ

θ + b(p − β)Tw

]γ})C = µwC3(p − β)ab2

×{

1−[

θ

θ + b(p − β)Tw

]γ}− C2µy(p − β)ab2

This model can help developers to determinewhen to stop testing the software and release it tobeta-testing users and to end-users.

Further research interest is needed to deter-mine the following.

1. How should resources be allocated from thebeginning of the software lifecycle to ensurethe on-time and efficient delivery of a softwareproduct?

2. What information during the system testdoes a software engineer need to determinewhen to release the software for a givenreliability/risk level?

3. What is the risk cost due to the softwarefailures after release?

4. How should marketing efforts—includingadvertising, public relations, trade-showparticipation, direct mail, and relatedinitiatives—be allocated to support therelease of a new software product effectively?

16.6 Further ReadingThere are several survey papers and books onsoftware reliability research and software cost

Page 334: Handbook of Reliability Engineering - pdi2.org

Recent Studies in Software Reliability Engineering 301

published in the last five years, that can be readat an introductory/intermediate stage. Interestedreaders are referred to the following articles byPham [9] and Whittaker and Voas [40], Voas[41], and the Handbook of Software ReliabilityEngineering by Lyu [18], (McGraw-Hill and IEEECS Press, 1996); The books Software Reliabilityby Pham [2] (Springer-Verlag, 2000); Software-Reliability-Engineered Testing Practice by J. Musa,(John Wiley & Sons, 1998), are excellent textbooksand references for students, researchers, andpractitioners.

This list is by no means exhaustive, but I believeit will help readers get started learning about thesubjects.

Acknowledgment

This research was supported in part by the USNational Science Foundation under INT-0107755.

References[1] Wood A. Predicting software reliability. IEEE Comput

1996;11:69–77.

[2] Pham H. Software reliability. Springer-Verlag; 2000.

[3] Jelinski Z, Moranda PB. Software reliability research. In:Freiberger W, editor. Statistical computer performanceevaluation. New York: Academic Press; 1972.

[4] Goel L, Okumoto K. Time-dependent error-detection ratemodel for software and other performance measures.IEEE Trans Reliab 1979;28:206–11.

[5] Pham H. Software reliability assessment: imperfect de-bugging and multiple failure types in software develop-ment. EG&G-RAAM-10737, Idaho National EngineeringLaboratory, 1993.

[6] Pham H. A software cost model with imperfect debug-ging, random life cycle and penalty cost. Int J Syst Sci1996;27(5):455–63.

[7] Pham H, Zhang X. An NHPP software reliabilitymodels and its comparison. Int J Reliab Qual Saf Eng1997;4(3):269–82.

[8] Pham H, Nordmann L, Zhang X. A general imperfectsoftware debugging model with s-shaped fault detectionrate. IEEE Trans Reliab 1999;48(2):169–75.

[9] Pham H. Software reliability. In: Webster JG, editor. Wileyencyclopedia of electrical and electronics engineering,vol. 19. Wiley & Sons; 1999. p.565–78.

[10] Ohba M. Software reliability analysis models. IBM J ResDev 1984;28:428–43.

[11] Yamada S, Ohba M, Osaki S. S-shaped reliability growthmodeling for software error detection. IEEE Trans Reliab1983;12:475–84.

[12] Yamada S, Osaki S. Software reliability growth modeling:models and applications. IEEE Trans Software Eng1985;11:1431–7.

[13] Yamada S, Tokuno K, Osaki S. Imperfect debugging mod-els with fault introduction rate for software reliabilityassessment. Int J Syst Sci 1992;23(12).

[14] Pham L, Pham H. Software reliability models with time-dependent hazard function based on Bayesian approach.IEEE Trans Syst Man Cybernet Part A: Syst Hum2000;30(1).

[15] Pham H, Wang H. A quasi renewal process for softwarereliability and testing costs. IEEE Trans Syst ManCybernet 2001;31(6):623–31.

[16] Pham L, Pham H. A Bayesian predictive softwarereliability model with pseudo-failures. IEEE Trans SystMan Cybernet Part A: Syst Hum 2001;31(3):233–8.

[17] Ehrlich W, Prasanna B, Stampfel J, Wu J. Determining thecost of a stop-testing decision. IEEE Software 1993;10:33–42.

[18] Lyu M., editor. Software reliability engineering hand-book. McGraw-Hill; 1996.

[19] Pham H. Software reliability and testing. IEEE ComputerSociety Press; 1995.

[20] Littlewood B, Softer A. A Bayesian modification tothe Jelinski–Moranda software reliability growth model.Software Eng J 1989;30–41.

[21] Mazzuchi TA, Soyer R. A Bayes empirical-Bayes model forsoftware reliability. IEEE Trans Reliab 1988;37:248–54.

[22] Csenki A. Bayes predictive analysis of a fundamental soft-ware reliability model. IEEE Trans Reliab 1990;39:142–8.

[23] Akaike H. A new look at statistical model identification.IEEE Trans Autom Control 1974;19:716–23.

[24] Hossain SA, Ram CD. Estimating the parameters of anon-homogeneous Poisson process model for softwarereliability. IEEE Trans Reliab 1993;42(4):604–12.

[25] Cox DR. Regression models and life-tables (withdiscussion). J R Stat Soc B 1972;34:187–220.

[26] Tohma Y, Yamano H, Ohba M, Jacoby R. The estimationof parameters of the hypergeometric distribution and itsapplication to the software reliability growth model. IEEETrans Software Eng 1991;17(5): 483–9.

[27] Zhang X, Pham H. An analysis of factors affectingsoftware reliability. J Syst Software 2000;50:43–56.

[28] Zhang X. Software reliability and cost models with envi-ronmental factors. PhD Dissertation, Rutgers University,1999, unpublished.

[29] Pham H, Zhang X. A software cost model with warrantyand risk costs. IEEE Trans Comput 1999;48(1):71–5.

[30] Zhang X, Pham H. A software cost model witherror removal times and risk costs. Int J Syst Sci,1998;29(4):435–42.

[31] Zhang X, Pham H. A software cost model withwarranty and risk costs. IIE Trans Qual Reliab Eng1999;30(12):1135–42.

Page 335: Handbook of Reliability Engineering - pdi2.org

302 Software Reliability

[32] Okumoto K, Goel AL. Optimum release time for softwaresystems based on reliability and cost criteria. J SystSoftware 1980;1:315–8.

[33] Ohtera H, Yamada S. Optimal allocation and controlproblems for software-testing resources. IEEE TransReliab 1990;39:171–6.

[34] Leung YW. Optimal software release time with a givencost budget. J Syst Software 1992;17:233–42.

[35] Dalal SR, McIntosh AA. When to stop testing forlarge software systems with changing code. IEEE TransSoftware Eng 1994;20(4):318–23.

[36] Yang MC, Chao A. Reliability-estimation and stoppingrules for software testing based on repeated appearancesof bugs. IEEE Trans Reliab 1995;22(2):315–26.

[37] Hou RH, Kuo SY, Chang YP. Optimal release times forsoftware systems with scheduled delivery time based onthe HGDM. IEEE Trans Comput 1997;46(2):216–21.

[38] Pham H, Zhang X. Software release policies with gainin reliability justifying the costs. Ann Software Eng1999;8:147–66.

[39] Teng X, Pham H. A NHPP software reliability modelunder random environment. Quality and ReliabilityEngineering Center Report, Rutgers University, 2002.

[40] Whittaker JA, Voas J. Toward a more reliable theory ofsoftware reliability. IEEE Comput 2000;(December):36–42.

[41] Voas J. Fault tolerance. IEEE Software 2001;(July/August):54–7.

Page 336: Handbook of Reliability Engineering - pdi2.org

Maintenance Theory andTestingP

AR

TI V

17 Warranty and Maintenance17.1 Introduction17.2 Product Warranties: An Overview17.3 Maintenance: An Overview17.4 Warranty and Corrective Maintenance17.5 Warranty and Preventive Maintenance17.6 Extended Warranties and Service Contracts17.7 Conclusions and Topics for Future Research

18 Mechanical Reliability and Maintenance Models18.1 Introduction18.2 Stochastic Point Processes18.3 Perfect Maintenance18.4 Minimal Repair18.5 Imperfect or Worse Repair18.6 Complex Maintenance Policy18.7 Reliability Growth

19 Preventive Maintenance Models: Replacement, Repair, Orderingand Inspection

19.1 Introduction19.2 Block Replacement Models19.3 Age Replacement Models19.4 Ordering Models19.5 Inspection Models19.6 Concluding Remarks

20 Maintenance and Optimum Policy20.1 Introduction20.2 Replacement Policies20.3 Preventive Maintenance Policies20.4 Inspection Policies

Page 337: Handbook of Reliability Engineering - pdi2.org

21 Optimal Imperfect Maintenance Models21.1 Introduction21.2 Treatment Methods for Imperfect Maintenance21.3 Some Results on Imperfect Maintenance21.4 Future Research on Imperfect Maintenance

22 Accelerated Life Testing22.1 Introduction22.2 Design of Accelerated Life Testing Plans22.3 Accelerated Life Testing Models22.4 Extensions of the PH Model

23 Accelerated Test Models with the Birnbaum–SaundersDistribution

23.1 Introduction23.2 Accelerated Birnbaum–Saunders Models23.3 Inference Procedures with Accelerated Life Models23.4 Estimation from Experimental Data

24 Multiple-Steps Step-Stress Accelerated Life Test24.1 Introduction24.2 CE Models24.3 Planning SSALT24.4 Data Analysis in SSALT24.5 Implementation on Microsoft Excel?24.6 Conclusion

25 Step-Stress Accelerated Life Testing25.1 Introduction25.2 Step-Stress Life Testing with Constant Stress-Change Times25.3 Step-Stress Life Testing with Random Stress-Change Times25.4 Bibliographical Notes

Page 338: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance

Chap

ter1

7D. N. P. Murthy and N. Jack

17.1 Introduction17.2 Product Warranties: An Overview17.2.1 Role and Concept17.2.2 Product Categories17.2.3 Warranty Policies17.2.3.1 Warranties Policies for Standard Products Sold Individually17.2.3.2 Warranty Policies for Standard Products Sold in Lots17.2.3.3 Warranty Policies for Specialized Products17.2.3.4 Extended Warranties17.2.3.5 Warranties for Used Products17.2.4 Issues in Product Warranty17.2.4.1 Warranty Cost Analysis17.2.4.2 Warranty Servicing17.2.5 Review of Warranty Literature17.3 Maintenance: An Overview17.3.1 Corrective Maintenance17.3.2 Preventive Maintenance17.3.3 Review of Maintenance Literature17.4 Warranty and Corrective Maintenance17.5 Warranty and Preventive Maintenance17.6 Extended Warranties and Service Contracts17.7 Conclusions and Topics for Future Research

17.1 Introduction

All products are unreliable in the sense that theyfail. A failure can occur early in an item’s life dueto manufacturing defects or late in its life due todegradation that is dependent on age and usage.Most products are sold with a warranty that offersprotection to buyers against early failures over thewarranty period. The warranty period offered hasbeen getting progressively longer. For example,the warranty period for cars was 3 months inthe early 1930s; this changed to 1 year in the1960s, and currently it varies from 3 to 5 years.With extended warranties, items are covered fora significant part of their useful lives, and thisimplies that failures due to degradation can occurwithin the warranty period.

Offering a warranty implies additional costs(called “warranty servicing” costs) to the manu-facturer. This warranty servicing cost is the cost ofrepairing item failures (through corrective main-tenance (CM)) over the warranty period. For shortwarranty periods, the manufacturer can minimizethe expected warranty servicing cost through op-timal CM decision making. For long warrantyperiods, the degradation of an item can be con-trolled through preventive maintenance (PM), andthis reduces the likelihood of failures. OptimalPM actions need to be viewed from a life cycleperspective (for the buyer and manufacturer), andthis raises several new issues.

The literature that links warranty and mainte-nance is significant but small in comparison withthe literature on these two topics. In this paper

305

Page 339: Handbook of Reliability Engineering - pdi2.org

306 Maintenance Theory and Testing

we review the literature dealing with warranty andmaintenance and suggest areas for future research.

The outline of the paper is as follows. In Sec-tions 17.2 and 17.3 we give brief overviews of prod-uct warranty and maintenance respectively, in or-der to set the background for later discussions.Following this, we review warranty and CM inSection 17.4, and warranty and PM in Section 17.5.Section 17.6 covers extended warranties and ser-vice contracts, and in Section 17.7 we state ourconclusions and give a brief discussion of futureresearch topics.

17.2 Product Warranties:An Overview

17.2.1 Role and Concept

A warranty is a contract between buyer andmanufacturer associated with the sale of aproduct. Its purpose is basically to establishliability in the event of a premature failure of anitem sold, where “failure” is meant as the inabilityof the item to perform its intended function. Thecontract specifies the promised performance and,if this is not met, the means for the buyer tobe compensated. The contract also specifies thebuyer’s responsibilities with regards to due careand operation of the item purchased.

Warranties serve different purposes for buyersand manufacturers. For buyers, warranties servea dual role—protectional (assurance against un-satisfactory product performance) and informa-tional (better warranty terms indicating a morereliable product). For manufacturers, warrantiesalso serve a dual role—promotional (to commu-nicate information regarding product quality andto differentiate from their competitors’ products)and protectional (against excessive claims fromunreasonable buyers).

17.2.2 Product Categories

Products can be either new or used, and newproducts can be further divided into the followingthree categories.

1. Consumer durables, such as household appli-ances, computers, and automobiles bought byindividual households as a single item.

2. Industrial and commercial products boughtby businesses for the provision of services (e.g.equipment used in a restaurant, aircraft usedby airlines, or copy machines used in an of-fice). These are bought either individually (e.g.a single X-ray machine bought by a hospital)or as a batch, or lot, of K (K > 1) items (e.g.tires bought by a car manufacturer). Here wedifferentiate between “standard” off-the-shelfproducts and “specialized” products that arecustom built to buyer’s specifications.

3. Government acquisitions, such as a new fleetof defense equipment (e.g. missiles, ships,etc.). These are usually “custom built” andare products involving new and evolvingtechnologies. They are usually characterizedby a high degree of uncertainty in the productdevelopment process.

Used products are, in general, sold individuallyand can be consumer durables, industrial, orcommercial products.

17.2.3 Warranty Policies

Both new and used products are sold withmany different types of warranty. Blischke andMurthy [1] developed a taxonomy for new productwarranty and Murthy and Chatophadaya [2]developed a similar taxonomy for used products.In this section we briefly discuss the salientfeatures of the main categories of policies for newproducts and then briefly touch on policies forused items.

17.2.3.1 Warranties Policies for StandardProducts Sold Individually

We first consider the case of new standardproducts where items are sold individually. Thefirst important characteristic of a warranty isthe form of compensation to the customer onfailure of an item. The most common formsfor non-repairable products are (1) a lump-sumrebate (e.g. “money-back guarantee”), (2) a free

Page 340: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance 307

replacement of an item identical to the failed item,(3) a replacement provided at reduced cost to thebuyer, and (4) some combination of the precedingterms. Warranties of Type (2) are called FreeReplacement Warranties (FRWs). For warrantiesof Type (3), the amount of reduction is usually afunction of the amount of service received by thebuyer up to the time of failure, with decreasingdiscount as time of service increases. The discountis a percentage of the purchase price, which canchange one or more times during the warrantyperiod, or it may be a continuous functionof the time remaining in the warranty period.These are called pro rata warranties (PRWs). Themost common combination warranty is one thatprovides for free replacements up to a specifiedtime and a replacement at pro-rated cost duringthe remainder of the warranty period. This iscalled a combination FRW/PRW. For a repairableproduct under an FRW policy, the failed item isrepaired at no cost to the buyer.

Warranties can be further divided into dif-ferent sub-groups based on dimensionality (one-dimensional warranties involve only age or usage;two-dimensional warranties involve both age andusage) and whether the warranty is renewing ornot. In a renewing warranty, the repaired or re-placement item comes with a new warranty iden-tical to the initial warranty.

For used products, warranty coverage may alsobe limited in many ways. For example, certaintypes of failure or certain parts may be specificallyexcluded from coverage. Coverage may includeparts and labor or parts only, or parts and laborfor a portion of the warranty period and parts onlythereafter. The variations are almost endless.

17.2.3.2 Warranty Policies for StandardProducts Sold in Lots

Under this type of warranty, an entire batch ofitems is guaranteed to provide a specified totalamount of service, without specifying a guaranteeon any individual item. For example, rather thanguaranteeing that each item in a batch of K itemswill operate without failure for W hours, the batchas a whole is guaranteed to provide at least KW

hours of service. If, after the last item in thebatch has failed, the total service time is less thanKW hours, items are provided as specified inthe warranty (e.g. free of charge or at pro ratacost) until such time as the total of KW hours isachieved.

17.2.3.3 Warranty Policies for SpecializedProducts

In the procurement of complex military and in-dustrial equipment, warranties play a very differ-ent and important role of providing incentives tothe seller to increase the reliability of the itemsafter they are put into service. This is accom-plished by requiring that the contractor maintainthe items in the field and make design changes asfailures are observed and analyzed. The incentiveis an increased fee paid to the contractor if it canbe demonstrated that the reliability of the itemhas, in fact, been increased. Warranties of thistype are called reliability improvement warranties(RIWs).

17.2.3.4 Extended Warranties

The warranty that is an integral part of a productsale is called the base warranty. It is offered by themanufacturer at no additional cost and is factoredinto the sale price. An extended warranty providesadditional coverage over the base warranty and isobtained by the buyer through paying a premium.Extended warranties are optional warranties thatare not tied to the sale process and can either beoffered by the manufacturer or by a third party(e.g. several credit card companies offer extendedwarranties for products bought using their creditcards, and some large merchants offer extendedwarranties). The terms of the extended warrantycan be either identical to the base warranty or maydiffer from it in the sense that they might includecost limits, deductibles, exclusions, etc.

Extended warranties are similar to servicecontracts, where an external agent agrees tomaintain a product for a specified time periodunder a contract with the owner of the product.The terms of the contract can vary and can includeCM and/or PM actions.

Page 341: Handbook of Reliability Engineering - pdi2.org

308 Maintenance Theory and Testing

17.2.3.5 Warranties for Used Products

Some of the warranty policies for second-handproducts are similar to those for new products,whilst others are different. Warranties for second-hand products can involve additional features,such cost limits, exclusions, and so on. Theterms (e.g. duration and features) can vary fromitem to item and can depend on the conditionof the item involved. They are also influencedby the negotiating skills of the buyer. Murthyand Chattopadhyay [2] proposed a taxonomy forone-dimensional warranty policies for used itemssold individually. These include options such ascost limits, deductibles, cost sharing, money-backguarantees, etc.

17.2.4 Issues in Product Warranty

Warranties have been analyzed by researchersfrom many different disciplines. The variousissues dealt with are given in the following list.

1. Historical: origin and use of the notion.2. Legal: court action, dispute resolution, prod-

uct liability.3. Legislative: Magnusson–Moss Act in the USA,

warranty requirements in government ac-quisition (particularly military) in differentcountries, EU legislation.

4. Economic: market equilibrium, social welfare.5. Behavioral: buyer reaction, influence on pur-

chase decision, perceived role of warranty,claims behavior.

6. Consumerist: product information, consumerprotection, and consumer awareness.

7. Engineering: design, manufacturing, qualitycontrol, testing.

8. Statistics: data acquisition and analysis,stochastic modeling.

9. Operations research: cost modeling, optimiza-tion.

10. Accounting: tracking of costs, time of accrual,taxation implications.

11. Marketing: assessment of consumer attitudes,assessment of the marketplace, use of war-ranty as a marketing tool, warranty and sales.

12. Management: integration of many of theprevious items, determination of warrantypolicy, warranty servicing.

13. Society: public policy issues.

Blischke and Murthy [3] examined several ofthese issues in depth. Here, we briefly discuss twoissues of relevance to this chapter.

17.2.4.1 Warranty Cost Analysis

We first look at warranty cost from the manu-facturer’s perspective. There are a number of ap-proaches to the costing of warranties, and costs areclearly different for buyer and seller. The followingare some of the methods for calculating costs thatmight be considered.

1. Cost per item sold: this per unit costmay be calculated as the total cost ofwarranty, as determined by general principlesof accounting, divided by number of itemssold.

2. Cost per unit of time.

These costs are random variables, since claimsunder warranty and the cost to rectify each claimare uncertain. The selection of an appropriate costbasis depends on the product, the context andperspective. The type of customer—individual,corporation, or government—is also important, asare many other factors.

From the buyer’s perspective, the time intervalof interest is from the instant an item is purchasedto the instant when it is disposed or replaced.This includes the warranty period and the post-warranty period. The cost of rectification over thewarranty period depends on the type of warranty.It can vary from no cost (in the case of an FRW)to cost sharing (in the case of a PRW). The costof rectification during the post-warranty periodis borne completely by the buyer. As such, thevariable of interest to the buyer is the cost ofmaintaining an item over its useful life. Hence, thefollowing cost estimates are of interest.

1. Cost per item averaging over all itemspurchased plus those obtained free or atreduced price under warranty.

Page 342: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance 309

2. Life cycle cost of ownership of an item with orwithout warranty, including purchase price,operating and maintenance cost, etc., andfinally the cost of disposal.

3. Life cycle cost of an item and its replacements,whether purchased at full price or replacedunder warranty, over a fixed time horizon.

The warranty costs depend on the natureof the maintenance actions (corrective and/orpreventive) used and we discuss this further laterin the chapter.

17.2.4.2 Warranty Servicing

Warranty servicing costs can be minimized byusing optimal servicing strategies. In the casewhere only CM actions are used, two possiblestrategies are as follows.

1. Replace versus repair. The manufacturer hasthe option of either repairing or replacing afailed item by a new one.

2. Cost repair limit strategy. Here, an estimate ofthe cost to repair a failed item is made and,by comparing it with some specified limit, thefailed item is either repaired or replaced by anew one.

17.2.5 Review of Warranty Literature

Review papers on warranties include a three-part paper in the European Journal of OperationalResearch: Blischke and Murthy [1] deal withconcepts and taxonomy, Murthy and Blischke [4]deal with a framework for the study of warranties,and Murthy and Blischke [5] deal with warrantycost analysis. Papers by Blischke [6] and Chukovaet al. [7] deal mainly with warranty cost analysis.Two recent review papers are by Thomas and Rao[8] and Murthy and Djamaludin [9], with the latterdealing with warranty from a broader perspective.

Over the last 6 years, four books have appearedon the subject. Blischke and Murthy [10] dealwith cost analysis of over 40 different warrantypolicies for new products. Blischke and Murthy [3]provide a collection of review papers dealing withwarranty from many different perspectives. Sahin

and Polotoglu [11] deal with the cost analysisof some basic one-dimensional warranty policies.Brennan [12] deals with warranty administrationin the context of defense products.

Finally, Djamaludin et al. [13] list over 1500 pa-pers on warranties, dividing these into differentcategories. This list does not include papers thathave appeared in the law journals.

17.3 Maintenance:An OverviewAs indicated earlier, maintenance can be definedas actions (i) to control the deterioration processleading to failure of a system, and (ii) to restore thesystem to its operational state through correctiveactions after a failure. The former is called PM andthe latter CM.

CM actions are unscheduled actions intendedto restore a system from a failed state into aworking state. These involve either repair orreplacement of failed components. In contrast, PMactions are scheduled actions carried out either toreduce the likelihood of a failure or prolong the lifeof the system.

17.3.1 Corrective Maintenance

In the case of a repairable product, the behavior ofan item after a repair depends on the type of repaircarried out. Various types of repair action can bedefined.

• Good as new repair. Here, the failure timedistribution of repaired items is identical tothat of a new item, and we model successivefailures using an ordinary renewal process.In real life this type of repair would seldomoccur.• Minimal repair. A failed item is returned to

operation with the same effective age as it pos-sessed immediately prior to failure. Failuresthen occur according to a non-homogeneousPoisson process with an intensity functionhaving the same form as the hazard rate of thetime to first failure distribution. This type of

Page 343: Handbook of Reliability Engineering - pdi2.org

310 Maintenance Theory and Testing

rectification model is appropriate when itemfailure is caused by one of many componentsfailing and the failed component being re-placed by a new one (see Murthy [14] andNakagawa and Kowada [15]).• Different from new repair (I). Sometimes

when an item fails, not only the failedcomponents are replaced but also others thathave deteriorated sufficiently. These majoroverhauls result in F1(x), the failure timedistribution function of all repaired items,being different from F(x), the failure timedistribution function of a new item. The meantime to failure of a repaired item is assumed tobe smaller than that of a new item. In this case,successive failures are modeled by a modifiedrenewal process.• Different from new repair (II). In some

instances, the failure distribution of a repaireditem depends on the number of times theitem has been repaired. This situation can bemodeled by assuming that the distributionfunction after the j th repair (j ≥ 1) is Fj (x)

with the mean time to failure µj decreasing asj increases.

17.3.2 Preventive Maintenance

PM actions can be divided into the followingcategories.

• Clock-based maintenance. PM actions arecarried out at set times. An example of this isthe “block replacement” policy.• Age-based maintenance. PM actions are based

on the age of the component. An example ofthis is the “age replacement” policy.• Usage-based maintenance. PM actions are

based on usage of the product. This is appro-priate for items such as tires, components ofan aircraft, and so forth.• Condition-based maintenance. PM actions are

based on the condition of the componentbeing maintained. This involves monitoring ofone or more variables characterizing the wearprocess (e.g. crack growth in a mechanicalcomponent). It is often difficult to measure

the variable of interest directly, and in thiscase some other variable may be used toobtain estimates of the variable of interest.For example, the wear of bearings can bemeasured by dismantling the crankcase ofan engine. However, measuring the vibration,noise, or temperature of the bearing caseprovides information about wear, since thereis a strong correlation between these variablesand bearing wear.• Opportunity-based maintenance. This is ap-

plicable for multi-component systems, wheremaintenance actions (PM or CM) for a com-ponent provide an opportunity for carryingout PM actions on one or more of the remain-ing components of the system.• Design-out maintenance. This involves carry-

ing out modifications through redesigning thecomponent. As a result, the new componenthas better reliability characteristics.

In general, PM is carried out at discrete timeinstants. In cases where the PM actions arecarried out fairly frequently they can be treated asoccurring continuously over time. Many differenttypes of model formulations have been proposedto study the effect of PM on the degradation andfailure occurrence of items and to derive optimalPM strategies.

17.3.3 Review of MaintenanceLiterature

Several review papers on maintenance haveappeared over the last 30 years. These includeMcCall [16], Pierskalla and Voelker [17], Sherifand Smith [18], Monahan [19], Jardine andBuzacott [20], Thomas [21], Gits [22], Valdez-Flores and Feldman [23], Pintelon and Gelders[24], Dekker [25], and Scarf [26]. Cho andParlar [27] and Dekker et al. [28] deal withthe maintenance of multi-component systemsand Pham and Wang [29] review models withimperfect maintenance. These review paperscontain references to the large number of papersand books dealing with maintenance.

Page 344: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance 311

17.4 Warranty and CorrectiveMaintenance

The bulk of the literature on warranty and cor-rective maintenance deals with warranty servicingcosts under different CM actions. We first reviewthe literature linking warranties and CM for newproducts, then we consider the literature for useditems.

Although most warranted items are complexmulti-component systems, the ‘black-box’ ap-proach has often been used to model time tofirst failure. The items are viewed as single en-tities characterized by two states—working andfailed—and F(x), the distribution function fortime to first failure, is usually selected, either on anintuitive basis, or from the analysis of failure data.Subsequent failures are modeled by an appropriatestochastic point process formulation dependingon the type of rectification action used. If the timesto complete rectification actions are very small incomparison with the times between failures, theyare ignored in the modeling.

Most of the literature on warranty servic-ing (for one- and two-dimensional warranties) issummarized by Blischke and Murthy [3, 10]. Weconfine our discussion to one-dimensional war-ranties, for standard products, sold individually,and focus on issues relating to maintenance ac-tions and optimal decision making.

Models where items are subjected to differentfrom new repair (I) include those of Biedenweg[30], and Nguyen and Murthy [31, 32]. Biedenweg[30] shows that the optimal strategy is to replacewith a new item at any failure occurring up to acertain time measured from the initial purchaseand then repair all other failures that occurduring the remainder of the warranty period. Thistechnique of splitting the warranty period intodistinct intervals for replacement and repair isalso used by Nguyen and Murthy [31, 32], whereany item failures occurring during the secondpart of the warranty period are rectified using astock of used items [31]. Nguyen and Murthy [32]extend Biedenweg’s [30] model by adding a thirdinterval, where failed items are either replaced

or repaired and a new warranty is given at eachfailure.

The first warranty servicing model, involvingminimal repair and assuming constant repair andreplacement costs, is that of Nguyen [33]. As inBiedenweg [30], the warranty period is split into areplacement interval followed by a repair interval.Under this strategy a failed item is always replacedby a new one in the first interval, irrespective ofits age at failure. Thus, if the failure occurs close tothe beginning of the warranty then the item will bereplaced at a higher cost than that of a repair andyet there will be very little reduction in its effectiveage. This is the major limitation of this model, andmakes the strategy clearly sub-optimal.

Using the same assumptions as Nguyen [33],Jack and Van der Duyn Schouten [34] investigatedthe structure of the manufacturer’s optimalservicing strategy using a dynamic programmingmodel. They showed that the repair–replacementdecision on failure should be made by comparingthe item’s current age with a time-dependentcontrol limit function h(t). The item is replaced onfailure at time t if and only if its age is greater thanh(t). A new servicing strategy proposed by Jackand Murthy [35] involves splitting the warrantyperiod into distinct intervals for carrying outrepairs and replacements with no need to monitorthe item’s age. In intervals near the beginningand end of the warranty period the failed itemsare always repaired, whereas in the intermediateinterval at most one failure replacement is carriedout.

Murthy and Nguyen [36] examined the optimalcost limit repair strategy where, at each failureduring the warranty period, the item is inspectedand an estimate of the repair cost determined. Ifthis estimate is less than a specified limit thenthe failed item is minimally repaired, otherwise areplacement is provided at no cost to the buyer.

For used items, Murthy and Chattopadhyay [2]deal with both FRW and PRW policies with nocost limits. Chattopadhyay and Murthy [37] dealwith the cost analysis of limit on total cost (LTC)policies. Chattopadhyay and Murthy [38] deal withthe following three different policies—specificparts exclusion (SPE) policy; limit on individual

Page 345: Handbook of Reliability Engineering - pdi2.org

312 Maintenance Theory and Testing

cost (LIC) policy; and limit on individual andtotal cost (LITC) policy—and discuss their costanalysis.

17.5 Warranty and PreventiveMaintenance

PM actions are carried out either to reduce thelikelihood of a failure or to prolong the life of anitem. PM can be perfect (restoring the item to“good-as-new”) or imperfect (restoring the item toa condition that is between as “good-as new” andas “bad-as-old”).

PM over the warranty period has an impact onthe warranty servicing cost. It is worthwhile forthe manufacturer to carry out this maintenanceonly if the reduction in the warranty cost exceedsthe cost of PM. From a buyer’s perspective, amyopic buyer might decide not to invest in any PMover the warranty period, as item failures over thisperiod are rectified by the manufacturer at no costto the buyer. Investing in maintenance is viewedas an additional unnecessary cost. However, froma life cycle perspective the total life cycle cost tothe buyer is influenced by maintenance actionsduring the warranty period and the post-warrantyperiod. This implies that the buyer needs toevaluate the cost under different scenarios for PMactions.

This raises several interesting questions. Theseinclude the following:

1. Should PM be used during the warrantyperiod?

2. If so, what should be the optimal maintenanceeffort? Should the buyer or the manufacturerpay for this, or should it be shared?

3. What level of maintenance should the buyeruse during the post-warranty period?

PM actions are normally scheduled and carriedout at discrete time instants. When the PM iscarried out frequently and the time between thetwo successive maintenance actions is small, thenone can treat the maintenance effort as beingcontinuous over time. This leads to two different

ways (discrete and continuous) of modelingmaintenance effort.

Another complicating factor is the informationaspect. This relates to issues such as the state ofitem, the type of distribution function appropriatefor modeling failures, the parameters of thedistribution function, etc. The two extremesituations are complete information and noinformation, but often the information availableto the manufacturer and the buyer lies somewherebetween these two extremes and can vary. Thisraises several interesting issues, such as theadverse selection and moral hazard problems.Quality variations (with all items not beingstatistically similar) add yet another dimension tothe complexity.

As such, the effective study of PM for productssold under warranty requires a framework thatincorporates the factors discussed above. Thenumber of factors to be considered and the natureof their characterization result in many differentmodel formulations linking PM and warranty. Wenow present a chronological review of the modelsthat have been developed involving warranty andPM.

Ritchken and Fuh [39] discuss a preventivereplacement policy for a non-repairable itemafter the expiry of a PRW. Any item failureoccurring within the warranty period results in areplacement by a new item with the cost sharedbetween the producer and the buyer. After thewarranty period finishes, the item in use is eitherpreventively replaced by the buyer after a period T

(measured from the end of the warranty period) orreplaced on failure, whichever occurs first. A newwarranty is issued with the replacement item andthe optimal T ∗ is found by minimizing the buyer’sasymptotic expected cost per unit time.

Chun and Lee [40] consider a repairable itemwith an increasing failure rate that is subjectedto periodic imperfect PM actions both during thewarranty period and after the warranty expires.They assume that each PM action reduces theitem’s age by a fixed amount and all failuresbetween PM actions are minimally repaired.During the warranty period, the manufacturerpays all the repair costs and a proportion of

Page 346: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance 313

each PM cost, with the proportion dependingon when the action is carried out. After thewarranty expires, the buyer pays for the cost ofall repairs and PM. The optimal period betweenPM actions is obtained by minimizing the buyer’sasymptotic expected cost per unit time over aninfinite horizon. An example is given for an itemwith a Weibull failure distribution.

Chun [41] dealt with a similar problem toChun and Lee [40], but with the focus insteadon the manufacturer’s periodic PM strategy overthe warranty period. The optimal number ofPM actions N∗ is obtained by minimizing theexpected cost of repairs and PMs over this finitehorizon.

Jack and Dagpunar [42] considered the modelstudied by Chun [41] and showed that, when theproduct has an increasing failure rate, a strictperiodic policy for PM actions is not the optimalstrategy. They showed that, for a warranty oflength W and a fixed amount of age reductionat each PM, the optimal strategy is to performN PMs at intervals x apart, followed by a finalinterval at the end of the warranty of length W −Nx, where only minimal repairs are carried out.Performing PMs with this frequency means thatthe item is effectively being restored to as good-as-new condition.

Dagpunar and Jack [43] assumed that theamount of age reduction is under the control ofthe manufacturer and the cost of each PM actiondepends on the operating age of the item andon the effective age reduction resulting from theaction. In this model, the optimal strategy canresult in the product not being restored to asgood as new at each PM. The optimal numberof PM actions N∗, optimal operating age toperform a PM s∗, and optimal age reduction x∗are obtained by minimizing the manufacturer’sexpected warranty servicing cost.

Sahin and Polatoglu [44] discussed two types ofpreventive replacement policy for the buyer of arepairable item following the expiry of a warranty.Failures over the warranty period are minimallyrepaired at no cost to the buyer. In the first model,the item is replaced by a new item at a fixedtime T after the warranty ends. Failures before T

are minimally repaired, with the buyer paying allrepair costs. In the second model, the replacementis postponed until the first failure after T . Theyconsidered both stationary and non-stationarystrategies in order to minimize the long-runaverage cost to the buyer. The non-stationarystrategies depend on the information regardingitem age and number of previous failures thatmight be available to the buyer at the end of thewarranty period. Sahin and Polatoglu [45] dealtwith a model that examined PM policies withuncertainty in product quality.

Monga and Zuo [46] presented a model for thereliability-based design of a series–parallel systemconsidering burn-in, warranty, and maintenance.They minimized the expected system life cyclecost and used genetic algorithms to determinethe optimal values of system design, burn-inperiod, PM intervals, and replacement time. Themanufacturer pays the costs of rectifying failuresunder warranty and the buyer pays post-warrantycosts.

Finally, Jung et al. [47] determined the optimalnumber and period for periodic PMs followingthe expiry of a warranty by minimizing thebuyer’s asymptotic expected cost per unit time.Both renewing PRWs and renewing FRWs areconsidered. The item is assumed to have amonotonically increasing failure rate and the PMactions slow down the degradation.

17.6 Extended Warranties andService Contracts

The literature on extended warranties dealsmainly with the servicing cost to the provider ofthese extended warranties. This is calculated usingmodels similar to those for the cost analysis ofbase warranties with only CM actions.

Padmanabhan and Rao [48] and Padmanabhan[49] examined extended warranties with hetero-geneous customers with different attitudes to riskand captured through a utility function. Patankarand Mitra [50] considered the case where items aresold with PRW where the customer is given the

Page 347: Handbook of Reliability Engineering - pdi2.org

314 Maintenance Theory and Testing

option of renewing the initial warranty by payinga premium should the product not fail during theinitial warranty period.

Mitra and Patankar [51] dealt with the modelwhere the product is sold with a rebate policy andthe buyer has the option to extend the warrantyshould the product not fail during the initialwarranty period. Padmanabhan [52] discussedalternative theories and the design of extendedwarranty policies.

Service contracts also involve maintenanceactions. Murthy and Ashgarizadeh [53, 54] dealtwith service contracts involving only CM. Theauthors are unaware of any service contractmodels that deal with PM or optimal decisionmaking with regard to maintenance actions.

17.7 Conclusions and Topics forFuture Research

In this final section we again stress the importanceof maintenance modeling in a warranty context,we emphasize the need for model validation, andwe then outline some further research topics thatlink maintenance and warranty.

Post-sale service by a manufacturer is animportant element in the sale of a new product,but it can result in substantial additional costs.These warranty servicing costs, which can varybetween 0.5 and 7% of a product’s sale price, havea significant impact on the competitive behaviorof manufacturers. However, manufacturers canreduce the servicing costs by adopting proper CMand PM strategies, and these are found by usingappropriate maintenance models.

Models for determining optimal maintenancestrategies in a warranty context were reviewed inSections 17.4 and 17.5. The strategies discussed,whilst originating from more general maintenancemodels, have often had to be adapted to suitthe special finite time horizon nature of warrantyproblems. However, all warranty maintenancemodels will only provide useful information tomanufacturers provided they can be validated,

and this requires the collection of accurateproduct failure data.

We have seen that some maintenance andwarranty modeling has already been done, butthere is still scope for new research, and we nowsuggest some topics worthy of investigation.

1. For complex products (such as locomotives,aircraft, etc.) the (corrective and preventive)maintenance needs for different componentsvary. Any realistic modeling requires groupingthe components into different categoriesbased on the maintenance needs. This impliesmodeling and analysis at the component levelrather than the product level (see Chukovaand Dimitrov [55]).

2. The literature linking warranty and PMdeals primarily with age-based or clock-based maintenance. Opportunity-based main-tenance also offers potential for reducing theoverall warranty servicing costs. The study ofoptimal opportunistic maintenance policies inthe context of warranty servicing is an area fornew research.

3. Our discussion has been confined toone-dimensional warranties. Optimalmaintenance strategies for two-dimensionalwarranties have received very little attention.Iskandar and Murthy [56] deal with a simplemodel, and there is considerable scope formore research on this topic.

4. The optimal (corrective and preventive)maintenance strategies discussed in theliterature assume that the model structureand model parameters are known. In real life,this is often not true. In this case, the optimaldecisions must be based on the informationavailable, and this changes over time as morefailure data are obtained. This implies thatthe modeling must be done using a Bayesianframework. Mazzuchi and Soyer [57] andPercy and Kobbacy [58] dealt with this issuein the context of maintenance, and a topic forresearch is to apply these ideas in the contextof warranties and extended warranties.

5. The issue of risk becomes important in thecontext of service contracts. When the attitude

Page 348: Handbook of Reliability Engineering - pdi2.org

Warranty and Maintenance 315

to risk varies across the population and thereis asymmetry in the information available todifferent parties, several new issues (such asmoral hazard, adverse selection) need to beincorporated into the model (see Murthy andPadmanabhan [59]).

References[1] Blischke WR, Murthy DNP. Product warranty

management—I. A taxonomy for warranty policies.Eur J Oper Res 1991;62:127–48.

[2] Murthy DNP, Chattopadhyay G. Warranties for second-hand products. In: Proceedings of the FAIM Conference,Tilburg, Netherlands, June, 1999.

[3] Blischke WR, Murthy DNP. Product warranty handbook.New York: Marcel Dekker; 1996.

[4] Murthy DNP, Blischke WR. Product warrantymanagement—II: an integrated framework for study.Eur J Oper Res 1991;62:261–80.

[5] Murthy DNP, Blischke WR. Product warrantymanagement—III: a review of mathematical models.Eur J Oper Res 1991;62:1–34.

[6] Blischke WR. Mathematical models for warranty costanalysis. Math Comput Model 1990;13:1–16.

[7] Chukova SS, Dimitrov BN, Rykov VV. Warranty analysis.J Sov Math 1993;67:3486–508.

[8] Thomas M, Rao S. Warranty economic decision models: areview and suggested directions for future research. OperRes 2000;47:807–20.

[9] Murthy DNP, Djamaludin I. Product warranties—areview. Int J Prod Econ 2002;in press.

[10] Blischke WR, Murthy DNP. Warranty cost analysis. NewYork: Marcel Dekker; 1994.

[11] Sahin I, Polatoglu H. Quality, warranty and preventivemaintenance. Boston: Kluwer Academic Publishers; 1998.

[12] Brennan JR. Warranties: planning, analysis and imple-mentation. New York: McGraw-Hill; 1994.

[13] Djamaludin I, Murthy DNP, Blischke WR. Bibliographyon warranties. In: Blischke WR, Murthy DNP, editors.Product warranty handbook. New York: Marcel Dekker;1996.

[14] Murthy DNP. A note on minimal repair. IEEE Trans Reliab1991;R-40:245–6.

[15] Nakagawa T, Kowada M. Analysis of a system withminimal repair and its application to replacement policy.Eur J Oper Res 1983;12:176–82.

[16] McCall JJ. Maintenance policies for stochastically failingequipment: a survey. Manage Sci 1965;11:493–524.

[17] Pierskalla WP, Voelker JA. A survey of maintenancemodels: the control and surveillance of deterioratingsystems. Nav Res Logist Q 1976;23:353–88.

[18] Sherif YS, Smith ML. Optimal maintenance models forsystems subject to failure—a review. Nav Res Logist Q1976;23:47–74.

[19] Monahan GE. A survey of partially observable Markovdecision processes: theory, models and algorithms.Manage Sci 1982;28:1–16.

[20] Jardine AKS, Buzacott JA. Equipment reliability andmaintenance. Eur J Oper Res 1985;19:285–96.

[21] Thomas LC. A survey of maintenance and replacementmodels for maintainability and reliability of multi-itemsystems. Reliab Eng 1986;16:297–309.

[22] Gits CW. Design of maintenance concepts. Int J Prod Econ1992;24:217–26.

[23] Valdez-Flores C, Feldman RM. A survey of preventivemaintenance models for stochastically deterioratingsingle-unit systems. Nav Res Logist Q 1989;36:419–46.

[24] Pintelon LM, Gelders LF. Maintenance managementdecision making. Eur J Oper Res 1992;58:301–17.

[25] Dekker R. Applications of maintenance optimizationmodels: a review and analysis. Reliab Eng Syst Saf1996;51:229–40.

[26] Scarf PA. On the application of mathematical models inmaintenance. Eur J Oper Res 1997;99:493–506.

[27] Cho D, Parlar M. A survey of maintenance models formulti-unit systems. Eur J Oper Res 1991;51:1–23.

[28] Dekker R, Wildeman RE, Van der Duyn Schouten FA.A review of multi-component maintenance modelswith economic dependence. Math Methods Oper Res1997;45:411–35.

[29] Pham H, Wang H. Imperfect maintenance. Eur J Oper Res1996; 94:425–38.

[30] Biedenweg FM. Warranty analysis: consumer value vsmanufacturers cost. Unpublished PhD thesis, StanfordUniversity, USA, 1981.

[31] Nguyen DG, Murthy DNP. An optimal policy for servicingwarranty. J Oper Res Soc 1986;37:1081–8.

[32] Nguyen DG, Murthy DNP. Optimal replace–repairstrategy for servicing items sold with warranty. Eur JOper Res 1989;39:206–12.

[33] Nguyen DG. Studies in warranty policies and productreliability. Unpublished PhD thesis, The University ofQueensland, Australia, 1984.

[34] Jack N, Van der Duyn Schouten FA. Optimal repair–replace strategies for a warranted product. Int J ProdEcon 2000;67:95–100.

[35] Jack N, Murthy DNP. Servicing strategies for items soldwith warranty. J Oper Res 2001;52:1284–8.

[36] Murthy DNP, Nguyen DG. An optimal repair cost limitpolicy for servicing warranty. Math Comput Model1988;11:595–9.

[37] Chattopadhyay GN, Murthy DNP. Warranty cost analysisfor second hand products. J Math Comput Model2000;31:81–8.

[38] Chattopadhyay GN, Murthy DNP. Warranty cost analysisfor second-hand products sold with cost sharing policies.Int Trans Oper Res 2001;8:47–68.

[39] Ritchken PH, Fuh D. Optimal replacement policies forirreparable warrantied items. IEEE Trans Reliab 1986;R-35:621–3.

[40] Chun YH, Lee CS. Optimal replacement policy for a war-rantied system with imperfect preventive maintenanceoperations. Microelectron Reliab 1992;32:839–43.

Page 349: Handbook of Reliability Engineering - pdi2.org

316 Maintenance Theory and Testing

[41] Chun YH. Optimal number of periodic preventivemaintenance operations under warranty. Reliab Eng SystSaf 1992;37:223–5.

[42] Jack N, Dagpunar JS. An optimal imperfect maintenancepolicy over a warranty period. Microelectron Reliab1994;34:529–34.

[43] Dagpunar JS, Jack N. Preventive maintenance strategyfor equipment under warranty. Microelectron Reliab1994;34:1089–93.

[44] Sahin I, Polatoglu H. Maintenance strategies follow-ing the expiration of warranty. IEEE Trans Reliab1996;45:220–8.

[45] Sahin I, Polatoglu H. Manufacturing quality, reliabil-ity and preventive maintenance. Prod Oper Manage1996;5:132–47.

[46] Monga A, Zuo MJ. Optimal system design consid-ering maintenance and warranty. Comput Oper Res1998;9:691–705.

[47] Jung GM., Lee CH, Park DH. Periodic preventive mainte-nance policies following the expiration of warranty. Asia–Pac J Oper Res 2000;17:17–26.

[48] Padmanabhan V, Rao RC. Warranty policy and extendedservice contracts: theory and an application to automo-biles. Market Sci 1993;12:230–47.

[49] Padmanabhan V. Usage heterogeneity and extendedwarranty. J Econ Manage Strat 1995;4:33–53.

[50] Patankar JG, Mitra A. A multicriteria model for arenewable warranty program. J Eng Val Cost Anal 1999;2:171–85.

[51] Mitra A, Patankar JG. Market share and warranty costsfor renewable warranty programs. Int J Prod Econ1997;50:155–68.

[52] Padmanabhan V. Extended warranties. In: Blischke WR,Murthy DNP, editors. Product warranty handbook. NewYork: Marcel Dekker; 1996. p.439–51.

[53] Murthy DNP, Ashgarizadeh E. A stochastic model forservice contracts. Int J Reliab Qual Saf Eng 1998;5:29–45.

[54] Murthy DNP, Ashgarizadeh E. Optimal decision makingin a maintenance service operation. Eur J Oper Res1999;116:259–73.

[55] Chukova SS, Dimitrov BV. Warranty analysis for complexsystems. In: Blischke WR, Murthy DNP, editors. Productwarranty handbook. New York: Marcel Dekker; 1996.p.543–84.

[56] Iskandar BP, Murthy DNP. Repair–replace strategiesfor two-dimensional warranty policies. In: Proceedingsof the Third Australia–Japan Workshop on StochasticModels, Christchurch, September, 1999; p.206–13.

[57] Mazzuchi TA, Soyer R. A Bayesian perspective on somereplacement strategies. IEEE Trans Reliab 1996;R-37:248–54.

[58] Percy DF, Kobbacy KAH. Preventive maintenancemodelling—a Bayesian perspective. J Qual Maint Eng1996;2:44–50.

[59] Murthy DNP, Padmanabhan V. A dynamic model ofproduct warranty with consumer moral hazard. Researchpaper no. 1263. Graduate School of Business, StanfordUniversity, Stanford, CA, 1993.

Page 350: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability andMaintenance Models

Chap

ter1

8Gianpaolo Pulcini

18.1 Introduction18.2 Stochastic Point Processes18.3 Perfect Maintenance18.4 Minimal Repair18.4.1 No Trend with Operating Time18.4.2 Monotonic Trend with Operating Time18.4.2.1 The Power Law Process18.4.2.2 The Log–Linear Process18.4.2.3 Bounded Intensity Processes18.4.3 Bathtub-type Intensity18.4.3.1 Numerical Example18.4.4 Non-homogeneousPoisson Process Incorporating Covariate Information18.5 Imperfect or Worse Repair18.5.1 Proportional Age Reduction Models18.5.2 InhomogeneousGamma Processes18.5.3 Lawless–Thiagarajah Models18.5.4 Proportional Intensity Variation Model18.6 Complex Maintenance Policy18.6.1 Sequence of Perfect and Minimal Repairs Without Preventive Maintenance18.6.2 Minimal Repairs Interspersed with Perfect Preventive Maintenance18.6.3 Imperfect Repairs Interspersed with Perfect Preventive Maintenance18.6.4 Minimal Repairs Interspersed with Imperfect Preventive Maintenance18.6.4.1 Numerical Example18.6.5 Corrective Repairs Interspersed with Preventive Maintenance Without

Restrictive Assumptions18.7 Reliability Growth18.7.1 Continuous Models18.7.2 Discrete Models

18.1 Introduction

This chapter deals with the reliability modeling ofrepairable mechanical equipment and illustratessome useful models, and related statistical pro-cedures, for the statistical analysis of failure dataof repairable mechanical equipment. Repairable

equipment is an item that may fail to performat least one of its required functions many timesduring its lifetime. After each failure, it is not re-placed but is restored to satisfactory performanceby repairing or by substituting the failed part(a complete definition of repairable equipment canbe found in Ascher and Feingold [1] (p.8)).

317

Page 351: Handbook of Reliability Engineering - pdi2.org

318 Maintenance Theory and Testing

Many products are designed to be repaired andput back into service. However, the distinctionbetween repairable and non-repairable unitssometimes depends on economic considerationsrather than equipment design, since it may beuneconomical, unsafe, or unwanted to repair thefailed item. For example, a toaster is repairednumerous times in a poor society (and henceit is repairable equipment), whereas it is oftendiscarded when it fails and is replaced in a richsociety (therefore, it is a non-repairable item).

Apart from possible improvement phenomenain an early phase of the operating time, me-chanical repairable equipment is subjected, unlikeelectronic equipment, to degradation phenom-ena with operating time, so that the failures be-come increasingly frequent with time. This makesthe definition of reliability models for mechan-ical equipment more complicated, both due tothe presence of multiple failure mechanisms andto the complex maintenance policy that is oftencarried out. In fact, the failure pattern (i.e. thesequence of times between successive failures) ofmechanical repairable equipment strictly dependsnot only on the failure mechanisms and physicalstructure of the equipment, but also on the typeof repair that is performed at failure (correctivemaintenance). Moreover, the failure pattern alsodepends on the effect of preventive maintenanceactions that may be carried out in the attempt toprevent equipment degradation.

Hence, an appropriate modeling of the failurepattern of mechanical repairable equipment can-not but take into account the type of maintenancepolicy being employed. On the other hand, cor-rective and preventive maintenance actions aregenerally classified just in terms of their effect onthe operating conditions of the equipment (e.g. seePham and Wang [2]). In particular:

1. if the maintenance action restores the equip-ment to the as same as new condition, themaintenance is defined as a perfect mainte-nance;

2. if the maintenance action brings the equip-ment to the condition it was in just before the

maintenance (same as old), the maintenanceis a minimal maintenance;

3. if the maintenance action significantly im-proves the equipment condition, even withoutbringing the equipment to a seemingly newcondition (better than old), the maintenanceis an imperfect maintenance;

4. if the maintenance action brings the equip-ment to a worse condition than it was in justbefore the maintenance (worse than old), themaintenance is a worse maintenance.

Of course, the classification of a given mainte-nance action depends not only on the action thatis performed (repair or substitution of the failedpart, major overhaul of the equipment, and so on),but also on the physical structure and complexityof the equipment. For example, the substitutionof a failed part of complex equipment does notgenerally produce an improvement to the equip-ment conditions, and hence can be classified as aminimal repair. On the contrary, if the equipmentis not complex, then the same substitution canproduce a noticeable improvement and could beclassified as an imperfect repair.

In Section 18.2, the basic concepts of stochasticpoint processes are illustrated. Sections 18.3–18.6deal with stochastic models used to describe thefailure pattern of repairable equipment subjectto perfect, minimal, imperfect or worse mainte-nance, as well as to some complex maintenancepolicies. The final section illustrates models thatdescribe the failure pattern of equipment under-going development programs that are carried outto improve reliability through design modifica-tions (reliability growth). Of course, in this con-text, the observed failure pattern is also stronglyinfluenced by design changes.

Owing to obvious limitations of space, thisdiscussion cannot be exhaustive, and simplyintends to illustrate those models that have beenmainly referred to in reliability analysis.

18.2 Stochastic Point ProcessesThe more commonly used way to model thereliability of repairable equipment is through

Page 352: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 319

stochastic point processes. A stochastic pointprocess is a probability model to describe aphysical phenomenon that is characterized byhighly localized events that occur randomly inthe continuum [1] (p.17). In reliability modeling,the events are the equipment failures and thecontinuum is the operating time (or any othermeasure of the actual use of the equipment).

Suppose that the failures of repairable equip-ment, observed over time t ≥ 0, occur at timest1 < t2 < · · · . Time ti (i = 1, 2, . . . ), which ismeasured from zero, is called the arrival time tothe ith failure. If repair times are negligible, or aremeasured on a different time scale, then the inter-arrival times xi = ti − ti−1 (where t0 = 0 and i =1, 2, . . . ) represent the operating times betweenthe (i − 1)th and ith failures, and constitute thefailure pattern of the equipment.

Let N(s, t) denote the (integer valued) randomvariable counting the number of failures that occurin the time interval (s, t). It includes both thenumber of failures occurring in (s, t) and thetimes at which they occur. Such a point process canbe specified mathematically in several ways, e.g.via the joint distributions of the counts of failuresin arbitrary time intervals. One convenient way[3] is via its complete (or conditional) intensityfunction (CIF) defined as [4] (p.9):

λ(t; Ht)= lim�t→0

Pr{N(t, t +�t) ≥ 1 |Ht }�t

(18.1)where Ht = {N(0, s) : 0≤ s ≤ t} represents theentire history of the process through time t .For small �t values, the product λ(t; Ht)�t isapproximately equal to the conditional probabilitythat at least one failure (not necessarily the first)occurs in the time interval (t, t +�t), given thehistory up to t [3].

The expectation of the number N(t)= N(0, t)of failures up to t , say M(t)= E{N(t)}, is anon-decreasing right continuous function. Itsderivative µ(t)=M ′(t), if it exists, is calledthe instantaneous rate of occurrence of failures(ROCOF) and represents the time rate of changeof the expected number of failures [1] (p.19). If theprobability of a failure at any specified point is

zero, then the expected number of failures M(t)

is continuous everywhere and the process is saidto be regular [5] (p.15–16).

If simultaneous failures cannot occur, so thatthe ti are distinct, then the point process issaid to be orderly [1] (p.23). Under orderlinessconditions, the ROCOF µ(t) numerically equalsthe (unconditional) intensity function λ(t), if theyexist, which is defined as

λ(t)= lim�t→0

Pr{N(t, t +�t)≥ 1}�t

(18.2)

The intensity λ(t) can be viewed as the expectationof the CIF (Equation 18.1) over the entire spaceof possible histories. Of course, if the CIF isindependent ofHt then the two intensity functionscoincide. Also, for small �t , the product λ(t)�t

is approximately equal to the (unconditional)probability that at least one failure occurs in thetime interval (t, t +�t) [6]. Note that, if theprocess is not orderly, then the ROCOF is greaterthan the intensity function λ(t) [4] (p.26) and doesnot specify the process uniquely [3].

The assumption of no simultaneous failures isusually reasonable in most applications and allowsuseful distribution functions to be derived fromthe CIF.

Under orderliness conditions, the CIF can berelated to the conditional distribution fi(t;Ht) ofthe arrival time ti of the ith failure. Suppose thatthe (i − 1)th failure has occurred at ti−1; then, fort > ti−1:

λ(t; Ht)= fi(t;Ht)

1− Fi(t; Ht)t ≥ ti−1 (18.3)

where Ht is the observed history that consistsof (i − 1) failures. Hence, the conditional densityfunction fi(t;Ht) of ti is

fi(t; Ht)= λ(t; Ht) exp

[−∫ t

ti−1

λ(z;Hz) dz

]t ≥ ti−1 (18.4)

From Equation 18.4, the conditional densityfunction fi(x;Ht) of the ith interarrival time

Page 353: Handbook of Reliability Engineering - pdi2.org

320 Maintenance Theory and Testing

xi = ti − ti−1 is

fi(x;Ht)= λ(ti−1 + x;Hti−1+x)

× exp

[−∫ x

0λ(ti−1 + z;Hti−1+z) dz

]x ≥ 0 (18.5)

The equipment reliability R(t, t +w), definedas the probability that the equipment operateswithout failing over the time interval (t, t +w),generally depends on the entire history of theprocess, and hence can be expressed as [1] (p.24–25)

R(t, t + w)≡ Pr{N(t, t +w)= 0 |Ht } (18.6)

For any orderly point process, the equipmentreliability is

R(t, t +w)= exp

[−∫ t+w

t

λ(x;Hx) dx

](18.7)

The forward waiting time wt is the time to thenext failure measured from an arbitrary time t :wt = tN(t)+1 − t , where tN(t)+1 is the time of thenext failure. The cumulative distribution functionof wt is related to the equipment reliability(Equation 18.6) by [1] (p.23–25):

Kt(w)= Pr{wt ≤ w |Ht }= Pr{N(t, t + w)≥ 1 |Ht }= 1− R(t, t +w) (18.8)

which, under orderliness conditions, becomes:

Kt(w)= 1− exp

[−∫ w

0λ(t + x;Ht+x) dx

](18.9)

Of course, when the forward waiting time wt

is measured from a failure time, say tN(t), thenwt coincides with the interarrival time xN(t)+1 =tN(t)+1 − tN(t).

For any orderly process, the likelihood relativeto a time-truncated sample t1 < t2 < · · ·< tn ≤ T

is expressible in the form [7]

L(θ)=n∏

i=1

λ(ti;Hti ) exp

[−∫ T

0λ(t; Ht) dt

](18.10)

where θ is the vector of unknown parameters ofthe CIF. For failure-truncated sampling, T ≡ tn.

Finally, the integrated CIF over thetime between two successive failures, sayei =

∫ titi−1

λ(t; Ht) dt , under orderliness con-ditions, is distributed as a standard exponentialrandom variable. This property allows one to havea useful graphical tool in order to check if theassumed stochastic model describes adequatelyan observed data set [3]. In fact, if the assumedmodel is satisfactory, then the estimate of thegeneralized residuals ei should look roughly like astandard exponential sample.

Since the CIF characterizes the failure patternof the repairable equipment, its form dependsboth on the failure mechanisms acting on theequipment during its operating life and on themaintenance policy that is performed. We assumethat the CIF can be discontinuous with operatingtime and discontinuities can occur only at the(corrective or preventive) maintenance epochs, sothat the CIF is continuous in any time intervalbetween two successive maintenance epochs.

In the following, some point processes thatsatisfy the reasonable orderliness conditions andmeet practical applications in the reliabilityanalysis of mechanical equipment are showndepending on the type of maintenance policybeing employed.

18.3 Perfect Maintenance

Perfect maintenance is a maintenance action thatbrings the equipment to a like-new condition. Inother words, each time the equipment is perfectlymaintained, it is as though we are starting overwith a new piece of equipment. Then, if a perfectmaintenance is performed at time τ , the CIF atany time t ≥ τ depends only on the history of theprocess from τ up to t .

If the equipment is perfectly repaired at eachfailure, then its failure pattern can be describedby a renewal process (same as new process). TheCIF of a renewal process depends on the operatingtime t and on the history Ht only through thedifference t − tN(t), where tN(t) is the time of the

Page 354: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 321

most recent failure prior to t :

λ(t; Ht)= h(t − tN(t)) (18.11)

The CIF (Equation 18.11) implies that the timesxi (i = 1, 2, . . . ) between successive failures areindependent and identically distributed randomvariables with hazard rate r(x) numerically equalto h(x), then the failure pattern shows no trendwith time (times between failures do not tend tobecome smaller or larger with operating time).

As pointed out by many authors (e.g. seeThompson [5] (p.49) and Ascher [8]) a renewalprocess is generally inappropriate to describethe failure pattern of repairable equipment. Infact, it is implausible that each repair actuallybrings the equipment to a same as new condition,and thus the equipment experiences deterioration(times between failures tend to become smallerwith time) that cannot be modeled by a renewalprocess. Likewise, the renewal process cannot beused to model the failure pattern of equipmentthat experiences reliability improvement duringthe earlier period of its operating life.

A special form of renewal process ariseswhen the CIF (Equation 18.11) is constantwith the operating time t : λ(t; Ht)= ρ, sothat the interarrival times xi are independentand identically exponentially distributed randomvariables. Such a process is also a same as oldmodel, since each repair restores the equipmentto the condition it was in just before the repair(each repair is both perfect and minimal). Forconvenience of exposition, such a process will bediscussed later, in Section 18.4.1.

18.4 Minimal RepairThe minimal repair is a corrective maintenanceaction that brings the repaired equipment tothe conditions it was in just before the failureoccurrence. Thus, if a minimal repair is carriedout at time ti , then

λ(t+i ;Ht+i)= λ(t−i ;Ht−i

) (18.12)

and the equipment reliability is unchanged byfailure and repair (lack of memory property). If the

equipment is minimally repaired at each failure,and preventive maintenance is not carried out,then the CIF is continuous everywhere and doesnot depend on the history of the process but onlyon the operating time t :

λ(t; Ht)= ρ(t) (18.13)

In this case, ρ(t) is both the unconditionalintensity function and the ROCOF µ(t). Theassumption of minimal repair implies that thefailure process is a Poisson process [6,9], i.e. a pointprocess with independent Poisson distributedincrements. Thus:

Pr{N(t, t +�)= k}

= [M(t, t +�)]kk! exp[−M(t, t +�)]

k = 0, 1, 2, . . . (18.14)

where M(t, t +�) is the expected number offailures in the time interval (t, t +�):

M(t, t +�)= E{N(t, t +�)} =∫ t+�

t

µ(z) dz

(18.15)For a Poisson process, the interarrival times

xi = ti − ti−1 generally are neither identicallynor independently distributed random variables.In fact, the conditional density function of xidepends on the occurrence time ti−1 of the mostrecent failure:

fi(x; ti−1)= ρ(ti−1 + x)

× exp

[−∫ x

0ρ(ti−1 + z) dz

]x ≥ 0 (18.16)

Then, unlike the renewal process, only the firstinterarrival time x1 and, then, the arrival time t1of the first failure have hazard rate numericallyequal to ρ(t). Also, the generalized residual ei =∫ titi−1

λ(t; Ht) dt equals the expected number offailures in the time interval (ti−1, ti), i.e. for aPoisson process: ei =M(ti−1, ti ).

The form of Equation 18.13 enables one tomodel, in principle, the failure pattern of anyrepairable equipment subjected to minimal repair,

Page 355: Handbook of Reliability Engineering - pdi2.org

322 Maintenance Theory and Testing

both when a complex behavior is observed infailure data and when a simpler behavior arises.Several testing procedures have been developedfor testing the absence of trend with operatingtime against the alternative of a simple behaviorin failure data (such as monotonic or bathtub-typetrend).

One of the earliest tests is the Laplace test [10](p.47), which is based, under a failure-truncatedsampling, on the statistic [1] (p.79)

LA=∑n−1

i=1 ti − (n− 1)tn/2

tn√(n− 1)/12

(18.17)

where ti (i = 1, . . . , n) are the occurrence timesof the observed failures. Under the null hypothesisof constant intensity, LA is approximately dis-tributed as a standard normal random variable.Then, a large positive (large negative) value of LAprovides evidence of a monotonically increasing(decreasing) trend with time, i.e. of deteriorating(improving) equipment. The Laplace test is theuniformly most powerful unbiased (UMPU) testwhen the failure data come from a Poisson processwith log–linear intensity (see Section 18.4.2.2).

The test procedure, which generally performswell in testing the null hypothesis of constantintensity against a broad class of Poisson processeswith monotonically increasing intensity [11],is based, for failure-truncated samples, on thestatistic

Z = 2n−1∑i=1

ln(tn/ti) (18.18)

Under the null hypothesis of constant intensity, Zis exactly distributed as a χ2 random variable with2(n− 1) degrees of freedom [12]. Small (large)values of Z provide evidence of a monotonicallyincreasing (decreasing) trend. TheZ test is UMPUwhen the failure data actually come from aPoisson process with power-law intensity (seeSection 18.4.2.1).

Forms that little different for LA and Z arisewhen the failure data come from a time-truncatedsampling, i.e. when the process is observed until aprefixed time T [11]:

LA=∑n

i=1 ti − nT/2

T√n/12

and

Z = 2n∑

i=1

ln(T /ti ) (18.19)

Under a time-truncated sampling, LA is stillapproximately normally distributed and Z is χ2

distributed with 2n degrees of freedom.Both LA and Z can erroneously provide

evidence of trend when data arise from themore general renewal process, rather than froma Poisson process with constant intensity. Then,if a time trend is to be tested against the nullhypothesis of a general renewal process, oneshould use the Lewis and Robinson test, based onthe statistic [13]

LR= LA

[x2∑n

i=1(xi − x)2/(n− 1)

]1/2

(18.20)

where xi (i = 1, . . . , n) are the times betweensuccessive failures and x =∑n

i=1 xi/n= tn/n.Under the null hypothesis of a general renewalprocess, LR is approximately distributed as astandard normal random variable, and then largepositive (large negative) values of LR provideevidence of deteriorating (improving) equipment.

The above test procedures are often noteffective in detecting deviation from no trendwhen there is a non-monotonic trend, i.e. whenthe process has a bathtub or inverted bathtubintensity. For testing non-monotonic trends,Vaurio [14] proposed three different test statisticsunder a time-truncated sampling. In particular,the statistic

V1 =∑n

i=1 |ti − T/2| − nT/4

T√n/48

(18.21)

under the null hypothesis of constant intensity, isapproximately distributed as a standard normalrandom variable. Then, large positive (largenegative) values of V1 indicate a bathtub-type(inverted bathtub-type) behavior in failure data.

A very useful way to investigate the presenceand nature of trend in the failure pattern isby plotting the cumulative number i of failuresoccurred until the observed failure time ti (i =1, 2, . . . , n) against times ti . If the plot is

Page 356: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 323

approximately linear, then no trend is present inthe observed data. A concave (convex) plot isrepresentative of an increasing (decreasing) trend,whereas an inverted S-shaped plot suggests thepresence of a non-monotonic trend with bathtub-type intensity.

18.4.1 No Trend with Operating Time

If the failure pattern of minimally repairedequipment does not show any trend with time,i.e. times between failures do not tend to becomesmaller or larger with time, then it can bedescribed by a homogeneousPoisson process (HPP)[4]. The HPP is a Poisson process with constantintensity:

λ(t; Ht)= ρ (18.22)

so that the interarrival times xi are independentand identically exponentially distributed randomvariables with mean 1/ρ. Such a Poisson processis called homogeneous since it has stationaryincrements; the Pr{N(t, t +�)= k} does notdepend on time t , only on �:

Pr{N(t, t +�)= k}

= (ρ�)k

k! exp(−ρ�) k = 0, 1, 2, . . .

(18.23)

The forward waiting time wt is exponentiallydistributed with mean 1/ρ, too, and the failuretimes tk are gamma distributed with scaleparameter 1/ρ and shape parameter k. Theexpected number of failures increases linearlywith time: M(t)= ρt . Hence, if a failure data setfollows the HPP, then the plot of the cumulativenumber of failures i (i = 1, 2, . . . ) versus theobserved failure times ti should be reasonablyclose to a straight line with slope ρ.

Suppose that a process is observed from t = 0until a prefixed time T , and that n failures occurat times t1 < t2 < . . . < tn ≤ T . The likelihoodrelative to this data set is

L(ρ)= ρn exp(−ρT ) (18.24)

and the maximum likelihood (ML) estimate ofthe intensity ρ depends only on the number of

observed failures and not on the failure times: ρ =n/T .

When data consist of interarrival times, thenstatistical methods developed for the exponentialdistribution can be used [15]. Methods basedon counts of failures in fixed time intervals arediscussed by Cox and Lewis [10] (p.29–36) andBayesian procedures for predicting the number offailures in a future time interval are given in Beiserand Rigdon [16].

Despite the restrictive hypotheses for it, theHPP has been successfully used to describethe failure pattern of repairable mechanicalequipment over portions of operating life, suchas the so-called useful life [17], i.e. the phasethat follows the initial life period of early failuresand precedes the degradation period. Likewise,when defective parts and/or assembling defectsare screened out completely by the manufacturerso that early failures do not occur, the HPPcan model adequately the entire initial period ofthe operating life (which generally includes thewarranty period) until degradation phenomenaappear. Of course, the HPP cannot describereliability degradation or improvement, and thencaution has to be used in making predictions if theHPP is used to model the failure pattern both inthe observation period and in future intervals.

Application examples of the HPP are:

• air-conditioning equipment in Boeing 720aircraft, first given by Proschan [18] and lateranalyzed by Lawless and Thiagarajah [3] andCox and Lewis [10]• the hydraulic systems of some load–haul–

dump machines deployed at Kiruna mine [19]• modified hydraulic excavators supplied to

cement plants, coal mines, and iron ore mines,and observed during the warranty period[20].

18.4.2 Monotonic Trend withOperating Time

In many cases, mechanical equipment subject tominimal repairs experiences reliability improve-ment or deterioration with the operating time,

Page 357: Handbook of Reliability Engineering - pdi2.org

324 Maintenance Theory and Testing

i.e. there is a tendency toward less or more fre-quent failures [8]. In such situations, the failurepattern can be described by a Poisson processwhose intensity function (Equation 18.13) mono-tonically decreases or increases with t . A Poissonprocess with a non-constant intensity is callednon-homogeneous, since it does not have station-ary increments.

18.4.2.1 The Power Law Process

The best known and most frequently usednon-homogeneous Poisson process (NHPP) isthe power law process (PLP), whose intensity(Equation 18.13) is

ρ(t)= β

α

(t

α

)β−1

α, β > 0 (18.25)

The intensity function (Equation 18.25) increases(decreases) with t when the power-law parameterβ > 1 (β < 1), and thus the PLP can describeboth deterioration and reliability improvement. Ifβ = 1, the PLP reduces to the HPP. The expectednumber of failures M(t)= (t/α)β is linear witht in a log–log scale: ln[M(t)] = β ln(t)− β ln(α).Hence, if a failure data set follows the PLP, thenthe plot of the cumulative number of failures i

(i = 1, 2, . . . ) versus the observed failure timesti should be reasonably close to a straight linewith slope β in a coordinate system with bothlogarithmic scales.

Some objections are raised to the PLP whenit describes improving equipment. In fact, insuch a situation (β < 1), the PLP intensity(Equation 18.14) is unrealistically infinite at t =0 and tends to zero as t approaches infinity.Nevertheless, the PLP enjoys large popularityin the analysis of failure data of repairableequipment, partly due to the simplicity ofstatistical inference procedures. In fact, given thefirst n successive failures times t1 < t2 < · · ·< tnin a failure-truncated sampling, the likelihoodunder the hypothesis of a PLP is surprisinglysimple:

L(α, β)= (β/α)n[ n∏

i=1

(ti/α)β−1]

exp[−(tn/α)β ](18.26)

so that the closed form of the ML estimators ofparameters β and α can be obtained:

β = n∑ni=1 ln(tn/ti)

α = tn

n1/β(18.27)

These estimators are joint sufficient statistics, andsome useful pivotal quantities can be defined.Again, in most cases, confidence intervals can beobtained using existing tables.

Results on ML estimation of PLP parameters,both for time-truncated and failure-truncateddata, can be found in Crow [12], Finkelstein [21],Lee and Lee [22], Bain and Engelhardt [23],and Calabria et al. [24]. Testing procedures forcomparing the scale parameters α or the power-law parameters β of several independent PLPswere given by Crow [12], Lee [25], and Calabriaet al. [26]. Also, sequential testing procedures forthe power-law parameter of one or more PLPs canbe found in Bain and Engelhardt [27]. When onlycount data from several identical PLPs are avail-able, estimation procedures based on parametricbootstrap are given in Calabria et al. [28].

Goodness-of-fit tests for checking the hypoth-esis that the observed failure data came froma PLP are discussed in Crow [12], Rigdon andBasu [29] (who also give a complete review bothof PLP properties and of inferential and testingprocedures), Park and Kim [30], and Baker [31].Also, methods for obtaining ML prediction limitsfor future failure times of a PLP can be found inLee and Lee [22], Engelhardt and Bain [32], andCalabria et al. [33].

A complete discussion on ML inferential, pre-diction and testing procedures, both for time- andfailure-truncated sampling, can be found in Bainand Engelhardt [34] (p.415–449). In addition, auseful FORTRAN program is illustrated by Rigdonet al. [35]; this allows ML estimates of PLP param-eters and intensity function to be computed, aswell as goodness-of-fit tests to be performed, bothfor single and multiple equipment.

Inferential procedures on the PLP parametersand functions thereof, based on the Bayesian ap-proach, can be found in Guida et al. [36], Bar-Lev et al. [37], and Campodónico and Singpur-walla [38]. Bayesian prediction procedures for

Page 358: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 325

future failure times are given in Calabria et al.[39] and Bar-Lev et al. [37]. Point and intervalpredictions for the number of failures in somegiven future interval can be found in Bar-Lev etal. [37], Campodónico and Singpurwalla [38], andBeiser and Rigdon [40]. These Bayes proceduresallow the unification of the two sampling protocols(failure- and time-truncated) under a single anal-ysis and can provide more accurate estimate andprediction when prior information or technicalknowledge on failure mechanism is available.

Finally, graphical methods for analyzing failuredata of several independent PLPs can be found inLilius [41], Härtler [42], and Nelson [43] on thebasis of non-parametric estimates of the intensityfunction or of the mean cumulative number offailures.

A very large number of numerical applicationsshow the adequacy of the PLP in describing thefailure pattern of mechanical equipment expe-riencing reliability improvement or degradation.Some examples follow:

• a 115 kV transmission circuit in New Mexicogiven by Martz [44] and analyzed by Rigdonand Basu [29];• several machines (among which are diesel

engines and loading cranes) [45];• several automobiles (1973 AMC Ambassador)

[46];• the hydraulic systems of some load–haul–

dump machines deployed at Kiruna mine [19];• an airplane air-conditioning equipment [18],

denoted as Plane 6 by Cox and Lewis [10] andanalyzed by Park and Kim [30] and Ascherand Hansen [47];• American railroad tracks [38];• bus engines of the Kowloon Motor Bus

Company in Hong Kong [48].

18.4.2.2 The Log–Linear Process

A second well accepted form of the NHPP isthe log–linear process (LLP), whose intensity(Equation 18.13) has the form

ρ(t)= exp(α + βt) −∞< α, β <∞ (18.28)

This process was firstly proposed by Cox andLewis [10] (p.45) and can describe monotonictrends in the failure data: reliability improvementwhen β < 0, and deterioration when β > 0. Whenβ = 0, then the LLP reduces to the HPP. Unlikethe PLP, the LLP intensity (Equation 18.28) isfinite at t = 0 when a reliability improvementis described, and it is greater than zero whendeterioration is described. Since the increasingLLP intensity takes non-negligible values at thebeginning of the observed period, the LLP is ableto describe the occurrence of a small number offailures at the beginning of the operating time,even of equipment that on the whole deteriorates.In addition, the LLP with β > 0 can adequatelymodel repairable equipment with extremely rapiddeterioration, since the intensity (Equation 18.28)is increasing at an exponential rate with time.

The expected number of failures for an LLPis M(t)= exp(α)[exp(βt)− 1]/β and, under afailure-truncated sampling, the likelihood resultsin

L(α, β)= exp

{nα

+ β

n∑i=1

ti − exp(α)[exp(βtn)− 1]/β}

(18.29)

Since the observed data ti (i = 1, 2, . . . , n) enterthe likelihood only through (tn,

∑ni=1 ti), these

are sufficient statistics. The ML estimates of LLPparameters are not in closed form and can befound from

n∑i=1

ti + n

β− ntn

1− exp(−βtn)= 0

α = ln

[nβ

exp(βtn)− 1

](18.30)

Inferential and testing procedures can be foundin Ascher and Feingold [9], Cox and Lewis [10](p.45) and Lawless [15] (p.497–499). A goodness-of-fit test based on the Cramér–von Mises statisticwas described by Ascher and Feingold [9], whereasLawless [15] used the ML estimates of the

Page 359: Handbook of Reliability Engineering - pdi2.org

326 Maintenance Theory and Testing

generalized residuals

ei = M(ti−1, ti)

= exp(α)[exp(βti )− exp(βti−1)]/βi = 1, . . . , n (18.31)

in order to assess the adequacy of the LLP. Testingprocedures for comparing the improvement ordeterioration rate of two types of equipment op-erating in two different environmental conditionswere given by Aven [49]. Bayesian inferential andprediction procedures, based on prior informationon the initial intensity ρ0 = exp(α) and on the ag-ing rate β , have recently been proposed by Huangand Bier [50].

Since classical inferential procedures for theLLP are not as easy as for the PLP, the LLP hasnot received the large attention devoted to thePLP. Nevertheless, the LLP has often been used toanalyze the failure data of repairable mechanicalequipment:

• the air-conditioning equipment [18] in Boe-ing 720 airplanes, denoted as Planes 2 and 6by Cox and Lewis [10] and analyzed also byLawless and coworkers [3, 15] and Aven [49];• main propulsion diesel engines installed on

several submarines [9];• a number of hydraulic excavator systems [20];• a turbine-driven pump in a pressurized water

reactor auxiliary feedwater system [50].

18.4.2.3 Bounded Intensity Processes

Both the PLP and the LLP intensity functionsused to model deterioration tend to infinityas the operating time increases. However, therepeated application of repairs or substitutionsof failed parts in mechanical equipment subjectto reliability deterioration sometimes produces afinite bound for the increasing intensity function.In such a situation, the equipment, therefore,becomes composed of parts with a randomizedmix of ages, so that its intensity settles down toa constant value (bounded increasing intensity).This result is known as Drenick’s theorem [1] andcan be observed when the operating time of theequipment is very large.

A simple NHPP with bounded intensity func-tion was suggested by Engelhardt and Bain [51]:

ρ(t)= 1

θ[1− (1+ t/θ)−1] θ > 0 (18.32)

However, this one-parameter model does notseem adequate in describing actual failure databecause of its analytical simplicity.

A two-parameter NHPP (called the bounded in-tensity process (BIP)) has recently been proposedby Pulcini [52]. Its intensity has the form

ρ(t)= α[1 − exp(−t/β)] α, β > 0 (18.33)

This increasing intensity is equal to zero at t = 0and approaches the asymptote of α as t tends toinfinity. The parameter β is a measure of the initialincreasing rate of the intensity (Equation 18.33):the smaller β is, the faster the intensity increasesinitially until it approaches α. As β tends to zero,the intensity (Equation 18.33) approaches α, andthen the BIP includes as a limiting form the HPP.The expected number of failures up to a given timet is

M(t)= α{t − β[1− exp(−t/β)]} (18.34)

and the likelihood in a time-truncated sampling is

L(α, β)= αnn∏

i=1

[1− exp(−ti/β)]

× exp(−α{T − β[1− exp(−T/β)]})(18.35)

The ML estimates α and β of the BIP parametersare given by

α = n

T − β[1− exp(−T/β)]

n[1− exp(−T/β)− (T /β) exp(−T/β)](T /β)− [1− exp(−T/β)]

=n∑

i=1

(ti/β) exp(−ti/β)1− exp(−ti/β)

(18.36)

For the case of failure-truncated sampling, T ≡ tn.Approximate confidence intervals can be obtainedon the basis of the asymptotic distribution of α

Page 360: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 327

and β, and a testing procedure for time trendagainst the HPP can be performed on the basisof the likelihood ratio statistic. The BIP has beenused in analyzing the failure data (in days) ofautomobile #4 [46] and the failure data of thehydraulic system of the load–haul–dump LHD17machine [19].

18.4.3 Bathtub-type Intensity

In many cases, mechanical deteriorating equip-ment is also subject to early failures, because ofthe presence of defective parts or assembling de-fects that were not screened out through burn-intechniques by the manufacturer. Thus, the failurepattern has a bathtub-type behavior: at first theinterarrival times increase with operating time,and then they decrease. A bathtub-type intensityis more typical of large and complex equipmentwith many failure modes. As stated by Nelson [43],a bathtub intensity is typical for 10 to 20% ofproducts.

The presence of a bathtub intensity in anobserved data set can be detected both by agraphical technique (the plot of the cumulativenumber of failures i against the failure time tishould be inverted S-shaped) and by performingthe testing procedures of Vaurio [14].

An NHPP suitable to describe bathtub behavioris the three-parameter process with intensity

ρ(t)= λγ tγ−1 exp(βt)

λ, γ > 0; −∞< β <∞ (18.37)

This process, for which the PLP and the LLPare special cases, was proposed by Lee [53] inorder to assess the adequacy of the simplerprocesses to analyze a given data set. Ascherand Feingold [1] (p.85) and Lawless [15] (p.506)showed that, for λ > 0 and γ < 1, the intensitygiven by Equation 18.37 has a bathtub-typebehavior and so can model the failure patternof equipment subject both to early failuresand deterioration phenomena. However, to ourknowledge, statistical inference procedures for theLee model have not been developed yet.

Another NHPP with bathtub-type intensity,denoted the exponential power process (EPP)

ρ(t)= (β/η)(t/η)β−1 exp[(t/η)β ]η > 0; 0≤ β ≤ 1 (18.38)

can be found in [54], where the EPP has beenproposed in a wider context that includes minimal,perfect, and imperfect repairs. However, the EPPdoes not seem adequate to describe actual failuredata because of its analytical simplicity.

More recently, an NHPP (called S-PLP) thatarises from the superposition [5] (p.68–71) oftwo PLPs with different power-law parametershas been proposed by Pulcini [55]. The S-PLPintensity has the form

ρ(t)= β1

α1

(t

α1

)β1−1

+ β2

α2

(t

α2

)β2−1

α1, β1, α2, β2 > 0 (18.39)

If, and only if, either β1 or β2 < 1, the intensitygiven by Equation 18.39 is non-monotonic and hasa bathtub-type behavior. The expected number offailures up to a given time t is

M(t)=(

t

α1

)β1

+(

t

α2

)β2

(18.40)

and the likelihood relative to a time-truncatedsample t1 < · · ·< tn ≤ T is

L(α1, β1, α2, β2)

=n∏

i=1

[β1

α1

(ti

α1

)β1−1

+ β2

α2

(ti

α2

)β2−1]

× exp

[−(T

α1

)β1

−(T

α2

)β2]

(18.41)

A closed form solution for the ML estimatorsof the S-PLP parameters does not exist, so that theML estimates can be obtained by maximization ofthe modified three-parameter log-likelihood:

�′(α1, β1, β2)

=n∑

i=1

ln

{β1

α1

(ti

α1

)β1−1

+[n−

(T

α1

)β1]β2

T

(ti

T

)β2−1}− n

(18.42)

Page 361: Handbook of Reliability Engineering - pdi2.org

328 Maintenance Theory and Testing

subject to the nonlinear constraints α1 > T/n1/β1 ,β1 > 0, and β2 > β1, which assure α2 > 0 andallow a unique solution to be obtained. For failure-truncated samples, T ≡ tn.

A graphical approach, based on the transfor-mations x = ln(t) and y = ln[M(t)], allows oneto determine whether the S-PLP can adequatelydescribe a given data set, and to obtain, if thesample size is sufficiently large, crude but easyestimates of process parameters. In fact, if a dataset follows the S-PLP model, then the plot (on log–log paper) of the cumulative number of failuresversus the observed failure times ti should bereasonably close to a concave curve, with a slopethat increases monotonically from the smaller β

value to the larger one.The S-PLP has been used to analyze the failure

data of a 180 ton rear dump truck [56], andthe failure data of the diesel-operated load–haul–dump LHD-A machine operating in a Swedishmine [57]. The application of the Laplace, Lewisand Robinson, and Vaurio trend tests has providedevidence, in both the failure data, of a bathtub-type intensity in a global context of increasingintensity.

18.4.3.1 Numerical Example

Let us consider the supergrid transmission failuredata read from plots in Bendell and Walls [58].The data consist of 134 failures observed untilt134 = 433.7 (in thousands of unspecified timeunits) and are given in Table 18.1. Bendell andWalls [58] analyzed graphically the above dataand concluded that there was a decreasing trendof the failure intensity in the initial periodwith a global tendency for the time betweenfailures to become shorter, on average, over theentire observation period. To test the graphicalconclusions of Bendell and Walls [58], the LA, Z,LR, and V1 test statistics are evaluated

LA= 4.01 Z = 243.0 LR= 1.59 V1 = 1.62

and compared with the χ2 and standard normal0.10 and 0.90 quantiles. The Vaurio test providesevidence of a non-monotonic bathtub-type trendagainst the null hypothesis of no trend, and

Figure 18.1. ML estimate of M(t) and observed cumulativenumber of failures in a log–log plot for supergrid transmission datain numerical example

the LR test rejects the null hypothesis of arenewal process against the increasing trendalternative.

The LA and Z tests provide conflicting results.In fact, the Z test does not reject the nullhypothesis of an HPP against the PLP alternative,whereas the LA test strongly rejects the HPPhypothesis against the hypothesis of an increasinglog–linear intensity. The conflict between LA andZ results depends on the fact that, unlike the PLP,the LLP is able to describe the occurrence of asmall number of early failures in a deterioratingprocess. Furthermore, the LA test can be effectivein detecting deviation from the HPP hypothesiseven in the case of bathtub intensity, providedthat early failures constitute a small part of theobserved failures.

Thus, the supergrid transmission failure datashow statistically significant evidence of a bathtubintensity in a global context of increasing intensity.The ML estimates of S-PLP parameters are

β1 = 0.40 α1 = 0.074 β2 = 2.53 α2 = 70

and, since β1 < 1 and β2 > 1, the bathtub behavioris confirmed. In Figure 18.1 the ML estimateof M(t) is shown on a log–log plot and iscompared with the observed cumulative numberof failures.

Page 362: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 329

Table 18.1. Supergrid transmission failure data in numerical example (read from plots in Bendell and Walls [58])

0.1 0.2 0.3 0.4 0.6 18.6 24.8 24.9 25.0 25.1 27.1 36.136.2 45.7 45.8 49.0 62.1 62.2 69.7 69.8 82.0 82.1 82.2 82.3

159.5 159.6 166.8 166.9 167.0 181.2 181.3 181.4 181.5 181.6 187.6 187.7187.8 187.9 188.0 188.1 211.5 211.7 211.9 212.0 212.1 212.2 216.5 225.1236.6 237.9 252.1 254.6 254.8 254.9 261.1 261.3 261.4 261.8 262.8 262.9263.0 263.1 264.6 264.7 273.5 273.6 275.6 275.7 275.9 276.0 277.5 287.5287.7 287.9 288.0 288.1 288.2 294.2 301.4 301.5 305.0 312.2 312.7 312.8339.1 349.1 349.2 349.3 349.4 349.8 349.9 350.0 350.2 350.3 350.4 377.4377.5 381.5 382.5 382.6 382.7 382.8 383.5 387.5 387.6 390.9 399.4 399.6399.7 401.5 401.6 401.8 403.3 405.1 406.2 406.4 407.4 407.6 411.4 411.5412.1 412.7 415.9 416.0 420.3 425.1 425.2 425.3 425.6 425.7 426.1 426.5427.7 433.7

18.4.4 Non-homogeneous PoissonProcess Incorporating CovariateInformation

The influence of the operating and environmentalconditions on the intensity function is oftensignificant and cannot be disregarded. One wayis to relate the intensity function not only to theoperating time but also to explanatory variables,or covariates, that provide a measure of theoperating and environmental conditions. Severalmethods of regression analysis for repairableequipment have been discussed by Ascher andFeingold [1] (p.96–99).

More recently, Lawless [59] gave methodsof regression analysis for minimally repairedequipment based on the so-called proportionalintensity Poisson process, which is defined asfollows. An equipment with k × 1 covariate vectorx has repeated failures that occur according to anNHPP with intensity:

ρx(t)= ρ0(t) exp(x′a) (18.43)

where ρ0(t) is a baseline intensity with parametervector θ and a is the vector of regressioncoefficients (regressors). Suppose that m unitsare observed over the time interval (0, Ti) (i =1, . . . , m) and that the ith equipment experiencesni failures at times ti,1 < ti,2 < · · ·< ti,ni . If the ithequipment has covariate vector xi , the likelihood

is

L(θ , a)

=m∏i=1

[ ni∏j=1

ρxi (ti,j )

]

× exp

[−( ∫ Ti

0ρ0(t) dt

)exp(x′a)

](18.44)

A semi-parametric analysis with ρ0(t) unspecifiedhas been described, as well as a fully parametricanalysis based on the assumption of a power-lawform for the baseline intensity: ρ0(t)= νβtβ−1.ML inferential procedures, model checks, andtests have also been discussed. Under the power-law intensity assumption, the parameter ν isincluded in the regression function by definingexp(a0)= ν and taking the covariate x0 to beidentically equal to one. When all Ti values areequal (Ti = T ), the ML estimate of the power-lawparameter is:

β = n

/ m∑i=1

ni∑j=1

ln(T /ti,j ) (18.45)

and the ML estimate of regressors a can beobtained by solving iteratively:

m∑i=1

nixi,r −m∑i=1

T βxi,r exp(x′ia)= 0

r = 0, 1, . . . , k (18.46)

Page 363: Handbook of Reliability Engineering - pdi2.org

330 Maintenance Theory and Testing

When the Ti values are unequal, the ML estimatescan be obtained by maximization of the log-likelihood.

On the basis of the above model, Guida andGiorgio [60] analyzed failure data of repairableequipment in accelerating testing, where thecovariate is the stress level at which the equipmentis tested. They obtained an estimate of thebaseline intensity ρ0(t) under use condition fromfailure data obtained under higher stresses. Theyassumed a power-law form for the baselineintensity, ρ0(t)= (β/α0)(t/α0)

β−1, and a singleacceleration variable (one-dimensional covariatevector) with two stress levels x1 and x2 > x1,both greater than the use condition level. MLestimations of regression coefficient a and ofparameters β and α0 were also discussed, bothwhen one unit and several units are tested at eachstress level. In the case of one unit tested until T1under stress level x1 and another unit tested untilT2 under x2, the ML estimators are in closed-form:

β = n

/ 2∑i=1

ni∑j=1

ln(Ti/ti,j )

α0 = (T1/n1/β1 )ξ (T2/n

1/β2 )1−ξ

a = ln[(T1/T2)(n2/n1)1/β]/(x2 − x1) (18.47)

where n1 and n2 are the number of failuresobserved under x1 and x2 respectively, and ξ =x2/(x2 − x1) is the extrapolation factor.

Proportional intensity models have found wideapplication, both in the case of minimal repairand when complex maintenance is carried out.A complete discussion on this topic has beenpresented by Kumar [61], where procedures forthe estimation of regression coefficients and forthe selection of a regression model are illustrated.Some proportional intensity models, specificallydevoted to the analysis of complex maintenancepolicy, will be illustrated in Section 18.6.

18.5 Imperfect or Worse RepairIn many cases, the repair actions producea noticeable improvement in the equipment

condition, even without bringing the equipmentto a new condition (imperfect repair). Oneexplanation for this is that often a minimumtime repair policy is performed rather thana minimum repair policy, in order to shortenthe downtime of the equipment [62]. In othercases, repair actions can be incorrect or injectfresh faults into the equipment, so that theequipment after repair is in a worse conditionthan just before the failure occurrence (worserepair). This can happen when, for example, therepair actions are performed by inexperiencedor unauthorized maintenance staff and/or underdifficult conditions. Models that treat imperfector worse maintenance situations have been widelyproposed and studied. A complete discussion onthis topic is given by Pham and Wang [2].

In the following, a number of families of modelsable to describe the effect of imperfect or worserepairs on equipment conditions are illustrated,with particular emphasis on the reliability mod-eling. The case of complex maintenance policies,such as those of minimal repairs interspersed withperiodic preventive overhauls or perfect repairs,will be treated in Section 18.6.

18.5.1 Proportional Age ReductionModels

A general approach to model an imperfect main-tenance action was proposed by Malik [63], whoassumed that each maintenance reduces the ageof the equipment by a quantity proportional tothe operating time elapsed form the most recentmaintenance (proportional age reduction (PAR)model). Hence, each maintenance action is as-sumed to reduce only the damage incurred sincethe previous maintenance epoch. This approachhas since been largely used to describe the ef-fect of corrective or preventive maintenance onthe equipment condition, especially when the re-pairable equipment is subject to complex mainte-nance policy (see Sections 18.6.3 and 18.6.4).

In the case that, at each failure, the equipmentis imperfectly repaired, the CIF for the PAR model

Page 364: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 331

is

λ(t; Ht)= h(t − ρti−1)

0≤ ρ ≤ 1, ti−1 < t ≤ ti (18.48)

where ρ is the improvement parameter. Then, theCIF (Equation 18.48) shows jumps downward ateach failure time ti (i = 1, 2, . . . ), except for thespecial case of ρ = 0, when the PAR model reducesto an NHPP with intensity h(t) (minimal repair).The value of each jump depends on ρ and on theoperating time since the most recent failure. Whenρ = 1, the repair brings the equipment to a like-new condition (perfect repair) and the PAR modelreduces to a renewal process.

The PAR model has been adopted by Shin etal. [64] in their Model A, where the equipment isassumed to be imperfectly repaired at each failure.Two different forms for the intensity function upto the first failure time have been considered, thepower-law and the log–linear intensities, so thatthe CIF (Equation 18.48) results in

λ(t; Ht)= β

α

(t − ρti−1

α

)β−1

0≤ ρ ≤ 1, α, β > 0, ti−1 < t ≤ ti (18.49)

and

λ(t; Ht)= exp[α + β(t − ρti−1)]0≤ ρ ≤ 1, −∞< α, β <∞, ti−1 < t ≤ ti

(18.50)

respectively. Depending on the value of parame-ters, both the models in Equations 18.49 and 18.50are able to describe the failure pattern of equip-ment subject to imperfect repairs and experi-encing reliability improvement or deterioration.Figure 18.2 shows the plot of the PAR model withpower-law intensity (Equation 18.49) of deteri-orating equipment (β > 1) for arbitrary failuretimes t1, . . . , t4, and compares the PAR plot withthe minimal (ρ = 0) and perfect (ρ = 1) repairplots.

ML estimates of model parameters, whichrequire the maximization of the log-likelihood,can be found in [64] when data arise from severalidentical units. Since the effect of repair may be

Figure 18.2. Plot of the complete intensity function for thePAR model (with ρ = 0.3 and power-law form) of deterioratingequipment. Comparison with the minimal (ρ = 0) and perfect(ρ = 1) repair plots

not uniform, the improvement factor ρ shouldbe interpreted as the effect on average over theobservation period.

Although the PAR model has been proposedto describe imperfect repair, it can adequatelydescribe a worse maintenance by allowing theparameter ρ to be negative. In such a way,the effect of each maintenance action is that ofincreasing the age of the equipment by a quantityproportional to the operating time elapsed fromthe most recent maintenance.

18.5.2 Inhomogeneous GammaProcesses

The effect of imperfect or worse repairs on theequipment condition was modeled by Berman[65] by assuming that the equipment is subjectedto shocks that occur according to an NHPP withintensity ρ(t). The failure occurs not at everyshock but at every kth shock. Then, if k > 1, theequipment is in a better condition just after arepair (and then the process describe an imperfectrepair), whereas a value of k < 1 indicates that thesystem is in a worse condition just after a failure(worse repair).

Page 365: Handbook of Reliability Engineering - pdi2.org

332 Maintenance Theory and Testing

The random variables zi =∫ titi−1

ρ(t) dt (i =1, . . . , n) (where t1 < t2 < · · ·< tn are the first nfailure times) are independently and identicallydistributed according to the gamma distributionwith unit scale parameter and shape parameterk. The CIF at any time t in the interval (ti−1, ti)

depends on the history only through the time ti−1of the most recent failure:

λ(t; Ht)= {ρ(t)[U(ti−1, t)]k−1

× exp[−U(ti−1, t)]/�(k)}×{ ∫ ∞

t

ρ(z)[U(ti−1, z)]k−1

× exp[−U(ti−1, z)]/�(k) dz

}−1

t > ti−1 (18.51)

where U(ti−1, t)=∫ tti−1

ρ(z) dz is the expectednumber of shocks from the most recent failuretime up to t . A process with CIF given byEquation 18.51 is called an inhomogeneous gammaprocess and the likelihood relative to a failure-truncated sample t1 < t2 < · · ·< tn is

L(θ , k)= 1

[�(k)]n{ n∏

i=1

ρ(ti)[U(ti−1, ti )]k−1}

× exp[−U(0, tn)] (18.52)

where θ is the parameter’s vector of ρ(t).Although lacking a shock model interpretation,Equation 18.51 still defines the CIF of a pointprocess when k is positive but not an integer. If k =1, the inhomogeneous gamma process reducesto the NHPP (the repair is minimal), whereasif ρ(t)= c (constant), then the process reducesto a renewal process with times between failuresdistributed as a gamma random variable withscale parameter c and shape parameter k.

A special form of an inhomogeneous gammaprocess, called the modulated gamma process,can be found in [65]. In a modulated gammaprocess the shock process is modeled by anNHPP with intensity ρ(t)= µ exp{β ′z(t)}, whereβ ′ = (β1, . . . , βp) and z(t)′ = {z1(t), . . . , zp(t)}.Existence of time trend in the data can be modeledsimply by including a single term z1(t)= t or, lesssimply, by including a further term z2(t)= t2.

Another special form of inhomogeneousgamma process, which has been studied greatly,is the modulated PLP (MPLP). In the MPLPthe process of shocks is modeled by a PLPwith intensity ρ(t)= (β/α)(t/α)β−1 , so thatU(ti−1, t)= (t/α)β − (ti−1/α)

β . When k = 1,the MPLP reduces to a PLP, whereas when β = 1the process reduces to a gamma renewal process.Finally, when both k = 1 and β = 1, the MPLPbecomes an HPP. The likelihood relative to afailure-truncated sample t1 < t2 < · · ·< tn resultsin

L(α, β, k)= (β/α)n

[�(k)]n{ n∏

i=1

(ti

α

)β−1

×[(

ti

α

)β−(ti−1

α

)β]k−1}× exp

[−(tn

α

)β](18.53)

A complete discussion on the MPLP and MLprocedures is given by Rigdon and coworkers [66,67]. In particular, the ML estimates of parameterscan be obtained by solving

α = tn

(nk)1/β

β = n

/[(tn

α

)βln

(tn

α

)+ nk ln α −

n∑i=1

ln ti

− (k − 1)n∑

i=1

tβi ln ti − t

β

i−1 ln ti−1

tβi − t

β

i−1

]

ψ(k)=∑n

i=1 ln(t βi − tβi−1)

n− β ln α (18.54)

where ψ denotes the di-gamma function: ψ(x)=�′(x)/�(x). Approximate confidence intervalsand hypothesis tests for the MPLP parameterscan be obtained on the basis of the asymptoticdistribution of ML estimators.

Bayesian inference, based on non-informativeand vague prior densities, was proposed byCalabria and Pulcini [68] to obtain point estimatesof MPLP parameters and credibility intervalsthat do not rely on asymptotic results. Bayesian

Page 366: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 333

prediction procedures for future failure times,which perform well even in the case of smallsamples, can be found in [68].

The MPLP expected number of failures M(t)

can be well approximated by M(t)∼= (1/k)(t/α)β ,and hence the (unconditional) intensity functionλ(t) is approximately λ(t)∼= (1/k)(β/α)(t/α)β−1

[69]. On the basis of such approximations,Bayesian inference on the MPLP parameters, aswell as on M(t) and λ(t), can be found in [69]when prior information on β , k and/or M(t) isavailable. Bayesian prediction on the failure timesin a future sample can also be found. Finally,testing procedures for assessing the time trendand the effect of repair actions in one or two MPLPsamples can be found in [70].

The MPLP model has been applied foranalyzing the failure data of several aircraftair-conditioning equipment [18], an aircraftgenerator [71] and a photocopier [31].

18.5.3 Lawless–Thiagarajah Models

A very useful family of models that incorporatesboth time trend and the effect of past failures,such as renewal-type behavior, has been proposedby Lawless and Thiagarajah [3]. The CIF is of theform:

λ(t; Ht)= exp{θ ′z(t)} (18.55)

where z(t)= {z1(t), . . . , zn(t)}′ is a vector offunctions that may depend on both t and thehistory Ht , and θ = (θ1, . . . , θp)

′ is a vector ofunknown parameters. This model includes, asspecial cases, many commonly used processes,such as the PLP (when z(t)= {1, ln(t)}′), the LLP(when z(t)= {1, t}′), and the renewal processes(when z(t) is a function of the time u(t) sincethe most recent repair: u(t)= t − tN(t−)). Modelswith CIF given by Equation 18.55 are then suitablefor exploring failure data of repairable equipmentwhen a renewal-type behavior may be expected,perhaps with a time trend superimposed.

The ML estimation procedures for the generalform given by Equation 18.55, which in generalneed numerical integration, are given by Lawless

and Thiagarajah [3]. A particular form of Equa-tion 18.55 is discussed:

λ(t; Ht)= exp{α + βt + γ u(t)}−∞< α, β, γ <∞ (18.56)

This shows jumps at each failure time ti (i =1, 2, . . . ), except for the special case (γ =0), when it reduces to the NHPP with log–linear intensity (LLP). When β = 0, the processreduces to a renewal process with interarrivaltimes distributed as a Gumbel random variable.The complete intensity (Equation 18.56) is thenpiecewise continuous: when the repair is effectiveand produces an improvement in the conditionsof the equipment (imperfect repair), the jumpsare downward, whereas when the repair isincorrect (worse repair) these jumps are upward.Test procedures for time trend can be foundin [3], where adequacy of the large sampleapproximation to moderate samples has beenassessed.

Other suitable forms of Equation 18.55, in par-ticular the so-called power-law–Weibull renewal(PL–WR) and log–linear–Weibull renewal (LL–WR), can be found in [72]. The CIFs are

λ(t; Ht)= γ tβ−1[u(t)]δ−1

γ > 0, β + δ > 1 (18.57)

and

λ(t; Ht)= δ exp(θ + βt)[u(t)]δ−1

−∞< θ, β <∞, δ > 0 (18.58)

respectively. Both the CIFs show jumps at eachfailure time ti (i = 1, 2, . . . ), except for the specialcase (δ = 1), when they reduce to the PLP andthe LLP respectively. The PL–WR and LL–WRmodels are suitable to describe the failure pat-tern of a wide range of repairable mechani-cal units that experience reliability improvementor deterioration with operating time and aresubjected to imperfect or worse repairs. For ex-ample, when δ > 1 (δ < 1) the PL–WR modeldescribes a repeated beneficial (harmful) effectof repair actions with respect to the minimal re-pair condition. The more δ differs from unity,

Page 367: Handbook of Reliability Engineering - pdi2.org

334 Maintenance Theory and Testing

Figure 18.3. Plots of the complete intensity function of the PL–WR model for deteriorating equipment under imperfect (δ > 1)or worse (δ < 1) repairs. Comparison with the minimal repair case(δ = 1)

the more the failure process departs from aminimal repair condition. When β > 1 (1− δ <

β < 1) the PL–WR process describes the fail-ure pattern of equipment experiencing reliabil-ity deterioration (improvement) with operatingtime.

Figure 18.3 shows the plots of the CIF forarbitrary failure times t1, . . . , t4 of equipmentthat experiences reliability deterioration withoperating time (β > 1) and is subject to imperfect(δ > 1) or worse (δ < 1) repairs. The plots are alsocompared with the minimal repair case (δ = 1).

Point and interval ML estimates of parametersof PL–WR and LL–WR models can be found in[72], as well as test procedures, based on the like-lihood ratio statistic, for assessing the departurefrom minimal or perfect repair assumption. Inparticular, under a time-truncated sampling, theML estimators of the PL–WR parameters β andδ can be found by maximization of the modifiedtwo-parameter log-likelihood:

�′(β, δ)= n ln

[n∑n+1

i=1 Ii(β, δ)

]+ (β − 1) ln

( n∏i=1

ti

)

+ (δ − 1) ln

[ n∏i=1

(ti − ti−1)

]− n

(18.59)

where Ii(β, δ)=∫ titi−1

tβ−1(t − ti−1)δ−1 dt (with

t0 = 0 and tn+1 = T ), subject to the constraint β +δ > 1. The ML estimate of γ is in closed form:

γ = n∑n+1i=1 Ii(β, δ)

(18.60)

The Lawless–Thiagarajah family of models hasbeen applied to several failure data, such as theair-conditioning equipment data [18] denoted asPlanes 6 and 7 by Cox and Lewis [10], theaircraft generator data [71], and the failure data ofautomobile #3 [46].

18.5.4 Proportional Intensity VariationModel

An alternative way to describe a beneficial effectof preventive maintenance on the equipment re-liability has been given by Chan and Shaw [73]by assuming that each maintenance reduces the(non-decreasing) intensity of a quantity that canbe constant (fixed reduction) or proportional tothe current intensity value (proportional reduc-tion).

The latter assumption has been adopted byCalabria and Pulcini [74] to model the failurepattern of equipment subject to imperfect orworse repairs. Each repair alters the CIF of theequipment in such a way that the failure intensityimmediately after the repair is proportional to thevalue just before the failure occurrence:

λ(t+i ;Ht+i)= δλ(t−i ;Ht−i

) δ > 0 (18.61)

Hence, the CIF shows jumps at each failuretime ti , except in the case of δ = 1. Such jumpsare proportional to the value of the completeintensity and, unlike the PAR model, do notdepend explicitly on the operating time since themost recent failure. If ρ(t) denotes the intensityfunction up to the first failure time, the CIF is

λ(t; Ht)= δiρ(t) ti < t ≤ ti+1 (18.62)

Page 368: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 335

(with t0 = 0) and depends on the history of theprocess only through the number i of previousfailures. The parameter δ is a measure of theeffectiveness of repair actions. When δ = 1, themodel in Equation 18.62 reduces to an NHPPwith intensity ρ(t). When δ < 1 (δ > 1), the repairaction produces an improvement (worsening) inthe equipment condition, and then the model candescribe the effect of imperfect or worse repairfor equipment that, depending on the behaviorof ρ(t), experiences reliability improvement ordeterioration.

Two different forms for ρ(t) have beendiscussed, namely the power-law and the log–linear intensity, and the ML estimators of modelparameters can be found in [74]. Procedures fortesting the departure from the minimal repairassumption, based on the log-likelihood ratio andWald statistics, can be found both for large andmoderate samples. In particular, by assumingthe power-law intensity ρ(t)= (β/α)(t/α)β , thelikelihood relative to a failure-truncated sample ofsize n is

L(α, β, δ)

= δνβn

αnβ

n∏i=1

tβ−1i × exp

{− δi−1

[(ti

α

)β−(ti−1

α

)β]}(18.63)

where ν = (n2 − n)/2. The ML estimators of β andδ can be found by maximization of the modifiedtwo-parameter log-likelihood:

�′(β, δ)= ν ln δ − n ln

[∑ni=1 δ

i−1(tβi − t

βi−1)

]+ (β − 1) ln

( n∏i=1

ti

)− n (18.64)

subject to the constraints β > 0 and δ > 0, and theML estimate of α:

α =[∑n

i=1 δi−1(t

βi − t

β

i−1)

n

]1/β

(18.65)

Figure 18.4 shows the plots of the CIF (Equa-tion 18.62) for arbitrary failure times t1, . . . , t4

Figure 18.4. Plots of the complete intensity function for theproportional intensity variation model of deteriorating equipmentsubject to imperfect (δ < 1) or worse (δ > 1) repairs. Comparisonwith the minimal repair case (δ = 1)

of equipment that experiences reliability deteri-oration with operating time (β > 1) and is sub-ject to imperfect (δ < 1) or worse (δ > 1) repairs.The plots are also compared with the minimalrepair case (δ = 1). By comparing Figure 18.2 withFigure 18.4, one can see that the improvementcaused by the third repair is very small underthe PAR model, since the time interval (t3 − t2) isnarrow. On the contrary, the improvement of thesame third repair under the proportional intensityvariation model is more noticeable since the com-plete intensity just before the third failure is large.

18.6 Complex MaintenancePolicyThis section illustrates some reliability modelsthat are able to describe the failure pattern ofrepairable equipment subject to a maintenancepolicy consisting in general of a sequence ofcorrective repairs interspersed with preventivemaintenance.

Except for the model illustrated inSection 18.6.5, all models discussed inSections 18.6.1–18.6.4 are based on specificassumptions regarding the effect of corrective

Page 369: Handbook of Reliability Engineering - pdi2.org

336 Maintenance Theory and Testing

and/or preventive maintenance on the equipmentconditions. Such assumptions are summarizedin Table 18.2. In most models, the preventivemaintenance is assumed to improve theequipment reliability significantly, until the limitcase of a complete restoration of the equipment(complete overhaul). However, as emphasized inAscher and Feingold [1], not only do overhaulsnot necessarily restore the equipment to a same asnew condition, but in some cases they can reducereliability. This may happen when overhaul isperformed by inexperienced men or/and underdifficult working conditions: reassembling injectsfresh faults into the equipment and burn-intechniques cannot be applied before putting theequipment in service again. Thus, the assumptionthat preventive maintenance improves equipmentconditions should be assessed before usingstochastic models based on such an assumption.

18.6.1 Sequence of Perfect andMinimal Repairs Without PreventiveMaintenance

Let us consider repairable equipment that is re-paired at failure and is not preventively main-tained. By using the approach proposed by Nak-agawa [75] to model preventive maintenance,Brown and Proschan [76] assumed that the equip-ment is perfectly repaired with probability p andit is minimally repaired with probability 1− p.It is as though the equipment is subject to twotypes of failure: Type I failure (catastrophic fail-ure) occurs with probability p and requires thereplacement of the equipment, whereas Type IIfailure (minor failure) occurs with probability 1−p and is corrected with minimal repair. Note thatboth Nakagawa [75] and Brown and Proschan[76], as well as other authors, e.g. [77], use theterm “imperfect repair” when they refer to “sameas old repair”.

If p = 0, then the failure process reduces to aminimal repair process (NHPP), whereas if p = 1the process is a renewal one. When 0 < p < 1, theprocess is a sequence of NHPPs interspersed with

perfect repairs; each NHPP starts from the timeepoch of the most recent perfect repair.

If F1(t) and r1(t) are respectively the cumula-tive distribution and the hazard rate of the occur-rence time of the first failure, then the distributionof the time between successive perfect repairs isFp(t)= 1− [1− F1(t)]p and the correspondinghazard rate is rp(t)= pr1(t).

Let τj (with j = 1, . . . , m) denote the timeepochs of m perfect repairs. For a generic time t

in the interval (τi−1, τi) (where τ0 = 0), the CIFdepends on the operating time t and on the historyHt only through the difference t − τi−1:

λ(t; Ht)= h(t − τi−1) τi−1 < t ≤ τi (18.66)

It is numerically equal to the hazard rate r1(x) ofthe first failure time evaluated at x = t − τi−1.

Non-parametric estimates of the cumulativedistribution function F1(t) can be found inWhitaker and Samaniego [77] under the assump-tion that both failure times ti (i = 1, 2, . . . ) andthe repair mode (minimal or perfect) followingeach failure are fully known (complete data). Thelikelihood relative to a complete data set observeduntil the occurrence of the nth failure is

L(p, F1)= pm(1− p)n−m−1f1(x(1))

×n−1∏i=1

f1(x(i+1))

[1− F1(x(i))]1−z(i)(18.67)

where m is the number of perfect repairs, x(i) (i =1, . . . , n) are the ordered failure times measuredfrom the most recent perfect repair epoch (theequipment installation corresponds to a perfectrepair), and z(i) (i = 1, . . . , n− 1) is the (non-ordered) indicator variable for the mode of theith ordered repair: z(i) = 1 for perfect repair, andz(i) = 0 for minimal repair. Of course:

∑n−1i=1 z(i) =

m. If z(n−1) = 1, the non-parametric ML estimateof F1(t) is

F1(t)=

0 t < x(1)

1−∏ij=1 φj x(i) ≤ t < x(i+1)

1 t ≥ x(n)

(i = 1, . . . , n− 1) (18.68)

Page 370: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 337

Table 18.2. Assumptions regarding the effect of corrective and/or preventive maintenance formodels in Sections 18.6.1–18.6.4

Maintenance Section 18.6.1 Section 18.6.2 Section 18.6.3 Section 18.6.4

Corrective Perfect or minimal Minimal Imperfect MinimalPreventive No PM Perfect Perfect Imperfect

where φi = ki/(ki − 1) and ki is the numberof ordered failure times greater than x(i) forwhich the equipment was perfectly repaired.When z(n−1) = 0, the non-parametric estimatedoes not exist. The estimate of F1(t) allows theCIF of Equation 18.66 between successive perfectrepairs to be estimated. Whitaker and Samaniego[77] applied their estimation procedure to thefailure data of the air-conditioning equipmentof Boeing 720 airplane 7914 [18], by assigningarbitrary (but seemingly reasonable) values to theindicator variables z(i).

A Bayes procedure to estimate the unknownand random probability p of perfect repair andthe waiting time between two successive perfectrepairs when prior information on p is availablecan be found in [78].

A parametric form of the CIF (Equation 18.66)has been studied by Lim [79], who modeledthe failure process between successive perfectrepairs through a PLP. Then, the likelihood relativeto a fully known data set, observed until theoccurrence of the nth failure, is

L(p, α, β)= pm(1− p)n−m−1 βn

αnβ

n∏i=1

xβ−1i

× exp

[−

n∑i=1

zi

(xi

α

)β](18.69)

where m=∑n−1i=1 zi is the number of perfect

repairs and zn = 1. Note that, in the parametricformulation, the failure times xi (i = 1, . . . , n)from the most recent perfect repair need not beordered. The ML estimates of model parameters

for complete data are

p =m/(n− 1) α =(

1

n

n∑i=1

zixβi

)1/β

β =[∑n

i=1 zixβi ln xi∑n

i=1 zixβi

− 1

n

n∑i=1

ln xi

]−1

(18.70)

When the data set is incomplete, i.e. the repairmodes are not recorded, the inferential proceduresare much more complex. The ML estimate of theprobability p of perfect repair, as well as thatof the PLP parameters, can be obtained throughan expectation–maximization algorithm basedon an iterative performance of the expectationstep and the maximization step. The proposedinferential procedure can be applied even when adifferent NHPP from the PLP is assumed to modelthe failure process between successive perfectrepairs, by just modifying the maximizationstep.

When each repair mode can depend on themode of previous repair, so that there is afirst-order dependency between two consecutiverepair modes, the procedure recently proposedby Lim and Lie [80] can be used. The failureprocess between successive perfect repairs is stillmodeled by a PLP and data can be partiallymasked if all failure times are available butsome of the repair modes are not recorded.The ML estimates of model parameters andof transition probabilities pj,k = Pr{zi = k|zi−1 =j } (j, k = 0, 1; i = 1, 2, . . . ) of repair processcan be obtained by performing an expectation–maximization algorithm.

In some circumstances, the choice of repairmode does not depend only on external factors(such as the availability of a replacement)that are stable over time, but also on the

Page 371: Handbook of Reliability Engineering - pdi2.org

338 Maintenance Theory and Testing

equipment condition. For example, catastrophicfailures, which require the replacement of theequipment, can occur with higher probability asthe equipment age increases. This age-dependentrepair mode situation has been modeled by Blocket al. [81]. By extending the Brown–Proschanmodel, the repair is assumed to be perfect withprobability p(x) and minimal with probability1− p(x), where x is the age of the equipmentmeasured from the last perfect repair epoch. Thelikelihood relative to a complete data set observeduntil the occurrence of the nth failure is givenby

L(p(·), F1)=n−1∏i=1

p(xi)zi [1− p(xi)]1−zi f1(x1)

×n−1∏i=1

f1(xi+1)

[1− F1(xi)]1−zi−1(18.71)

Non-parametric estimates of the cumulative dis-tribution of the first failure time can be foundin [77].

18.6.2 Minimal Repairs Interspersedwith Perfect Preventive Maintenance

The failure pattern of repairable equipment sub-ject to preventive maintenance that completelyregenerates the intensity function (perfect main-tenance) has been modeled by Love and Guo [82,83]. Between successive preventive maintenanceactions, the equipment is minimally repaired atfailure, so that within each maintenance cyclethe failure process can be regarded as a piece-wise NHPP. In addition, the operating and envi-ronmental conditions of the equipment are notconstant during each maintenance cycle but canchange after each repair and then affect the nextfailure time. This operating and environmentalinfluence is then modeled through suitable covari-ates [59].

Let yi,j denote the j th failure time in the ithmaintenance cycle measured from the most recentmaintenance epoch. Then, the CIF at any time y in

the interval (yi,j−1, yi,j ) is then given by

ρi,j (y)= ρ0(y) exp(z′i,ja) yi,j−1 ≤ y < yi,j(18.72)

where yi,0 = 0, zi,j is the k × 1 vector of explana-tory variables (covariates) that characterize theoperating conditions of the equipment in the timeinterval (yi,j−1, yi,j ), and a is the vector of regres-sion coefficients.

A non-parametric form for the baseline in-tensity has been assumed in [82], whereas apower-law form for the baseline intensity ρ0(y)=(β/α)(y/α)β−1 has been chosen in [83]. In the lat-ter case, the likelihood relative to m maintenancecycles is

L(α, β, a)

=(β

α

)n[ m∏i=1

ni∏j=1

(yi,j

α

)β−1

exp(z′i,ja)]

× exp

{−

m∑i=1

ni+1∑j=1

[(yi,j

α

)β−(yi,j−1

α

)β]

× exp(z′i,ja)}

(18.73)

where ni is the number of failures occurred in theith maintenance cycle, yi,ni+1 is the ending timeof the ith maintenance cycle, and n=∑m

i=1 ni isthe total number of observed failures. The MLestimate of α, β , and a can be found by solv-ing the equations ∂ ln(L)/∂β = 0, ∂ ln(L)/∂α =0, and ∂ ln(L)/∂al = 0 (l = 1, . . . , k) through theNewton–Raphson algorithm. Tests of the signifi-cance of the regression model can be performedon the basis of the log-likelihood ratio statistic un-der the null hypothesis of constant environmentaland operating conditions (the regression coeffi-cients are assumed to be zero). Significance testsfor individual regression coefficients are basedon the asymptotic distribution of the ML estima-tors.

The model in Equation 18.72 has also beenapplied to a large cement roller mill, whosedata are the times between corrective or preven-tive maintenance actions. Three potential symp-tomatic covariates were considered: instantaneous

Page 372: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 339

change in pressure on the main bearing, recir-culating load on the bucket elevator, and thepower demanded by the main motor. The sec-ond and third covariates were found to be sig-nificant, and both the regression models thatinclude one of the significant covariates fit theobserved data rather better than the baselinemodel.

18.6.3 Imperfect Repairs Interspersedwith Perfect Preventive Maintenance

The failure pattern of an equipment imperfectlyrepaired at failure and subject to perfect preven-tive maintenance can be described by the modelgiven by Guo and Love [62]. The effect of im-perfect repair is modeled through the PAR modelof Section 18.5.1 and the covariate structure forthe complete intensity given by Love and Guo[82, 83] can be adopted when different environ-mental and operating conditions have to be con-sidered.

Let yi,j denote the j th failure time in the ithmaintenance cycle measured from the most recentmaintenance epoch. The CIF in any time interval(yi,j−1, yi,j ) (where yi,0 = 0) is then

λi,j (y;Hy)= h0[y − (1− δ)yi,j−1] exp(z′i,ja)0≤ δ ≤ 1, yi,j−1 < y ≤ yi,j

(18.74)

where 1− δ is equal to the improvement factorρ of Section 18.5.1. The effect of repair may beassumed to be constant (and unknown) over allfailures (constant repair impact), or the effectof repair may vary from time to time (repair-specific impact). In the latter case, the value ofthe repair factor δ after each failure has to beassigned.

The baseline intensity up to the first failure hasthe power-law form and, under the assumption ofconstant repair effect (δ is then an unknown pa-rameter), the likelihood relative to m maintenance

cycles is

L(α, β, δ, a)

=(β

α

)n{ m∏i=1

ni∏j=1

[yi,j − (1− δ)yi,j−1

α

]β−1}× exp(z′sumfa)

× exp

{−

m∑i=1

ni+1∑j=1

{[yi,j − (1− δ)yi,j−1

α

]β−(δyi,j−1

α

)β}exp(z′i,ja)

}(18.75)

where ni is the number of failures occurredduring the ith maintenance cycle, yi,ni+1 is theending time of the ith maintenance cycle, andzsumf =∑i

∑j zi,j is the summation vector of

the covariates over the failure set. The MLestimate of α, β , δ, and a can be found inthe traditional way by solving the equations∂ ln(L)/∂β = 0, ∂ ln(L)/∂α = 0, ∂ ln(L)/∂δ = 0,and ∂ ln(L)/∂al = 0 (l = 1, . . . , k).

When the effect of repair varies from time totime, the likelihood is conditioned on the vector�= {δ1,1, . . . , δ1,n1, . . . , δm,1, . . . , δm,nm}′ col-lecting the (assigned) value of the repair factorsδi,j . Then, the form of the likelihood L(α, β, a|�)

can be obtained from Equation 18.75 by substitut-ing δ with δi,j−1.

Both the proposed models (the constant repairimpact model and the repair-specific impact one)have been compared by Guo and Love [62] withthe minimal repair model [83] and the hybridmodel [77]. Data of the large cement roller mill[82] have been analyzed, and it is showed thatthe repair-specific impact model requires a goodmanagement judgement on the effect of eachrepair in order to be effective.

If the original intensity function (before ac-counting for repair discontinuities) shows abathtub-type behavior in each maintenance cycle,the failure pattern of imperfectly repaired equip-ment can be described by the model given by Loveand Guo [54]. The EPP intensity (Equation 18.38)describes the original intensity function, whereasthe imperfect repair is described through thePAR model [63]. The conditional ML estimate of

Page 373: Handbook of Reliability Engineering - pdi2.org

340 Maintenance Theory and Testing

model parameters β and η, given the value ofthe repair factors δi,j , can be obtained from thefirst- and second-order partial derivatives of thelog-likelihood, by applying the Newton–Raphsonalgorithm. The mean square error of the orderedestimates of the generalized residuals with respectto the expected standard exponential order statis-tics can be used to compare the fit of differentmodels to a given data set.

18.6.4 Minimal Repairs Interspersedwith Imperfect Preventive Maintenance

A model suitable to describe the failure patternof several units subject to pre-scheduled orrandom imperfect preventive maintenance hasbeen proposed by both Shin et al. [64] (and theredenoted as Model B) and Jack [84] (where it isdenoted as Model I). The units are minimallyrepaired at failure and the effect of imperfectpreventive maintenance is modeled according tothe PAR criterion [63].

The failure pattern in each maintenance cycleis described by an NHPP, but the actual age of theequipment in the kth cycle is reduced by a fractionof the most recent maintenance epoch τk−1. TheCIF at a time t in the kth maintenance cycle is

λ(t; Ht)= h(t − ρτk−1) τk−1 < t ≤ τk (18.76)

Suppose that l units are observed until Ti (i =1, . . . , l) and that each equipment is subject to mi

preventive maintenance actions at epochs τi,1 <

· · ·< τi,mi ≤ Ti . The ith equipment experiencesri,k failures during the kth maintenance cycle (k =1, . . . , mi + 1). Let ti,k,j (j = 1, . . . , ri,k) denotethe j th failure experienced in the kth maintenancecycle by the ith equipment. Then, under theassumption that the improvement factor ρ isconstant over all units and maintenance actions,the likelihood relative to the above data is

Li =l∏

i=1

{ mi+1∏k=1

[ ri,k∏j=1

h(ti,k,j − ρτi,k−1)

]

× exp

[−

mi+1∑k=1

∫ τi,k

τi,k−1

h(x − ρτi,k−1) dx

]}(18.77)

where τi,0 = 0 and τi,mi+1 ≡ Ti . Since the effectof preventive maintenance may not be uniformover units and time, the improvement factor ρ canbe considered as the averaged effect of preventivemaintenance over the observation period.

Two different forms for the intensity functionup to the first failure, namely the power-lawand log–linear intensities, have been studied andrelated ML estimates can be found in [64] and [84].In particular, under the power-law assumption,the likelihood results in

L(α, β, ρ)

=(β

α

)n[ l∏i=1

mi+1∏k=1

ri,k∏j=1

(ti,k,j − ρτi,k−1

α

)β−1]

× exp

{−

l∑i=1

mi+1∑k=1

[(τi,k − ρτi,k−1

α

)β−(τi,k−1 − ρτi,k−1

α

)β]}(18.78)

where n is the total number of failures for l unitsduring the whole observation periods. The MLestimate of the scale parameter α is in closed form,given the ML estimates of β and ρ:

α ={ l∑

i=1

mi+1∑k=1

[(τi,k − ρτi,k−1)β

− (τi,k−1 − ρτi,k−1)β ]/n

}1/β(18.79)

and the ML estimates β and ρ can be foundby maximizing the modified two-parameter log-likelihood:

�′(β, ρ)

= n ln β − n ln

{ l∑i=1

mi+1∑k=1

[(τi,k − ρτi,k−1)β

− (τi,k−1 − ρτi,k−1)β ]}+ n ln n+ (β − 1)

×[ l∑

i=1

mi+1∑k=1

ri,k∑j=1

ln(ti,k,j − ρτi,k−1)

]− n

(18.80)

Page 374: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 341

Approximate confidence on model parameterscan be obtained on the basis of the asymptoticdistribution of ML estimators [64] or by usingthe likelihood ratio method [84]. Shin et al.[64] have analyzed the failure data of a centralcooler system of a nuclear power plant subject tomajor overhauls, by assuming a power-law formfor the CIF (Equation 18.76). This applicationwill be discussed later on. Under the samepower-law assumption, Jack [84] analyzed thefailure data of medical equipment (syringe-driverinfusion pumps) used in a large teaching hospitaland estimated the model parameters and theexpected number of failures in given future timeintervals.

Bayes inferential procedures for the abovemodel under the power-law intensity (denotedas PAR–PLP model) are given by Pulcini [85,86] when data arise from a single piece ofequipment observed until T . Point and intervalestimates of model parameters and CIF can beobtained on the basis of a number of suitableprior informations that reflect different degreesof belief on the failure process and maintenanceeffect [85]. Unlike the ML procedure, the Bayesiancredibility intervals do not rely on asymptoticresults, and then, when they are based on non-informative or vague prior densities, they are avalid alternative to the confidence intervals forsmall or moderate samples. In addition, Bayesinference on the expected number of failures ina future time interval (T , T +�) can be madewhen preventive maintenance or no maintenanceis assumed to be performed at T .

Tests on the effectiveness of preventive main-tenance can be carried out on the basis of theBayes factor that measures the posterior evidenceprovided by the observed data in favor of thenull hypothesis of minimal or perfect preven-tive maintenance against the alternative of imper-fect maintenance. Finally, Bayes prediction proce-dures, both of the future failure times and of thenumber of failures in a future time interval, canbe found in [86] on the basis both of vague andinformative prior densities on β and ρ.

The PAR criterion proposed by Malik [63] as-sumes that maintenance reduces only the damage

incurred since the previous preventive mainte-nance, and thus prohibits a major improvementto the conditions of equipment that has a largeage but has recently undergone preventive main-tenance. A different age reduction criterion, wherepreventive maintenance reduces all previously in-curred damage, has been proposed by Nakagawa[87]. The equipment is minimally repaired at fail-ure and the CIF at any time t in the interval(τk−1, τk) between the (k − 1)th and the kth main-tenance epochs is given by

λ(t; Ht)

= h

{t − τk−1 +

k−1∑i=1

[(τi − τi−1)

k−1∏j=i

bj

]}τk−1 < t ≤ τk (18.81)

where τ0 = 0 and 0 < bj < 1(j = 1, 2, . . . ) is theimprovement factor in age of the equipmentreduced by the j th preventive maintenance. Thiscriterion has been adopted in Model II of Jack[84], by assuming that all improvement factorsare equal: bj = δ(j = 1, 2, . . . ). Thus, the CIFin Equation 18.81 at any time t in the interval(τk−1, τk) becomes

λ(t; Ht)= h

[t +

k−1∑i=1

δi(τk−i − τk−i−1)− τk−1

]0≤ δ ≤ 1, τk−1 < t ≤ τk (18.82)

ML estimation procedures for the model inEquation 18.82 are similar to those for the modelin Equation 18.76.

18.6.4.1 Numerical Example

Let us consider the failure data of a central coolersystem of a nuclear power plant analyzed by Shinet al. [64]. Data consist of n= 15 failure timesand m= 3 major overhaul epochs observed over612 days and are given in Table 18.3. The failurepattern between successive overhauls is modeledthrough a PLP, and each overhaul reduces the ageof the equipment according to the PAR model [63].

Figure 18.5 compares the ML estimate givenby Shin et al. [64] of the CIF under the imper-fect overhauls assumption with the alternative as-sumptions of (a) minimal overhauls (ρ = 0), and

Page 375: Handbook of Reliability Engineering - pdi2.org

342 Maintenance Theory and Testing

Table 18.3. Failure data of a central cooler system in numericalexample (given in Shin et al. [64])

116 151 154∗ 213 263∗ 386387 395 407 463 492 494501 512∗ 537 564 590 609

∗ denotes PM epochs.

Figure 18.5. ML estimate of the complete intensity functionunder imperfect, minimal, and perfect overhaul assumptions forcentral cooler system data in numerical example

(b) perfect overhauls (ρ = 1). The assumption ofminimal overhauls coincides with the assumptionthat the equipment is not preventively maintainedbut only minimally repaired at failure, and thewhole failure process is then described by thePLP of Section 18.4.2.1. Under the assumption ofperfect overhauls, the failure pattern is describedby the model (deprived of covariates) given byLove and Guo [83] and discussed in Section 18.6.2.

In Table 18.4 the ML estimate of parametersβ and ρ is compared with the Bayes estimateobtained by Pulcini [85] by assuming a uniformprior density for β over the interval (1, 5) and agamma prior density for ρ with mean µρ = 0.7and standard deviation σρ = 0.15. Such densitiesformalize a prior ignorance on the value of β

(except the prior conviction that the equipmentexperiences deterioration) and a vague belief thatoverhauls are quite effective. Table 18.5 gives theposterior mean and the equal-tails 0.80 credibility

Table 18.4. ML and Bayesian estimates of model parameters βand δ

ML estimates Bayes estimates

β ρ β ρ

Point 2.93 0.77 2.79 0.72estimate0.95lowerlimit

1.82 0.50 1.73 0.50

0.95upperlimit

4.00 1.00 3.98 0.89

interval of the expected number of failures in thefuture time interval (612, 612+�), for selected� values ranging from 30 to 100 days, under thedifferent hypotheses that a major overhaul is or isnot performed at T = 612 days.

18.6.5 Corrective Repairs Interspersedwith Preventive Maintenance WithoutRestrictive Assumptions

The previously illustrated models are basedon specific assumptions regarding the effect ofcorrective repairs and preventive maintenance onthe equipment conditions. A model that doesnot require any restrictive assumption on theeffect of maintenance actions on the equipmentconditions is given by Kobbacy et al. [88]. Themodel assumes two different forms for the CIFafter a corrective repair (CR) and a preventivemaintenance (PM). In both the intensities, theirdependence on the full history of the processis modeled through explanatory variables. Inparticular, the CIF following a CR or a PM is

λCR(t; z)= λCR,0(t) exp(z′a) (18.83)

or

λPM(t; y)= λPM,0(t) exp(y′b) (18.84)

respectively, where time t is measured from thetime of applying this CR or PM, z and y arethe k × 1 and l × 1 vectors of the explanatoryvariables, and a and b are the vectors of the

Page 376: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 343

Table 18.5. Bayesian estimate of the expected number of failures in the future time interval (T , T +�) when a majoroverhaul is or is not performed at T = 612 days

No overhaul at T Major overhaul at T

�= 30 �= 50 �= 80 �= 100 �= 30 �= 50 �= 80 �= 100

Point estimate 1.60 2.88 5.15 6.92 0.87 1.60 2.94 4.030.90 lower limit 0.97 1.72 2.98 3.91 0.40 0.79 1.56 2.210.90 upper limit 2.34 4.23 7.70 10.49 1.41 2.54 4.57 6.19

regression coefficients. Thus, the CIF followinga failure or a preventive maintenance is givenin terms of a baseline intensity (λCR,0(t) orλPM,0(t) respectively), which depends on the timet measured from this event, and of the value ofthe explanatory variables at a point in time justbefore the event occurrence. Potential explanatoryvariables are the age of the equipment (measuredfrom the time the equipment was put into service),the total number of failures that have occurred,the time since the last CR, the time since the lastPM, and so on. Hence, the explanatory variablesmeasure the full history of the process until eachCR and PM.

18.7 Reliability Growth

In many cases, the first prototype of mechanicalequipment contains design and engineering flaws,and then prototypes are subjected to developmenttesting programs before starting mass production.Reliability and performance problems are identi-fied and design modifications are introduced toeliminate the observed failure modes. As a result,if the design changes are effective, the failureintensity decreases at each design change and areliability growth is observed. Of course, in thiscontext, the failure pattern of the equipment isstrongly influenced by the effectiveness of designchanges, as well as by failure mechanisms andmaintenance actions. A complete discussion onreliability growth programs is given by Benton andCrow [89].

A large number of models that describe thereliability growth of repairable equipment have

been proposed. Such models can be roughlyclassified into continuous and discrete models.

Continuous models are mathematical tools ableto describe the design process, i.e. the decreasingtrend in the ROCOF resulting from the repeatedapplications of design changes, which must notbe confused with the behavior of the underlyingfailure process. Continuous models are commonlyapplied in test-analyze-and-fix (TAAF) programswhen the equipment is tested until it fails; adesign modification is then introduced in orderto remove the observed failure cause, and the testgoes on. This procedure is repeated until a desiredlevel of reliability is achieved. The observed failurepattern is then affected by the failure mechanismsand design changes.

Discrete models incorporate a change in thefailure intensity at each design change, which isnot necessarily introduced at failure occurrence.Thus, more than one failure can occur duringeach development stage and the failure patterndepends on failure mechanisms, design changesand repair policy. In many cases, discrete modelsassume explicitly that the failure intensity isconstant during each development stage, so thatthe resulting plot of the failure intensity as afunction of test time is a non-increasing seriesof steps (step-intensity models). Thus, the failureprocess in each stage is modeled by an HPP(e.g. see the non-parametric model of Robinsonand Dietrich [90] and the parametric model ofSen [91]). Under the step-intensity assumption,dealing with repairable units that, after repairand/or design modifications, continue to operateuntil the next failure is like dealing with non-repairable units (with constant hazard rate)

Page 377: Handbook of Reliability Engineering - pdi2.org

344 Maintenance Theory and Testing

which are replaced at each failure. Thus, modelsbased on the explicit assumption of constantintensity in each development phase could beviewed as models for non-repairable units withexponentially distributed lifetime and will not betreated in this chapter, which deals with modelsspecifically addressing repairable equipment.

18.7.1 Continuous Models

The most commonly used continuous model inthe analysis of repairable equipment undergo-ing a TAAF program is the PLP model of Sec-tion 18.4.2.1, whose intensity function is

ρ(t)= β

α

(t

α

)β−1

α > 0, 0 < β < 1 (18.85)

In the TAAF context, the power-law parameterβ is a measure of the management and engineer-ing efforts in eliminating the failure modes duringthe development program. Small β values arisewhen an aggressive program is conducted by aknowledgeable staff.

The PLP was proposed and found its firstapplication in a reliability growth context [12],specifically to explain the empirical results ofDuane [71]. Duane analyzed the failure data offive types of mechanical unit, such as complexhydromechanical devices, aircraft generators andjet engines, undergoing development programs,and observed that a plot of the cumulative numberof failures N(t) observed until time t against theratio t/N(t) was nearly linear when a log–log plotwas used.

In the application of the PLP in a reliabilitygrowth context, the inferential procedures aremainly addressed to estimate the reliability ofthe equipment at the end of the developmentprogram. In fact, if no further improvement isincorporated into the equipment design after thedevelopment program, the reliability level at theend of the program will characterize the reliabilityof the equipment as it goes into production. Inaddition, it is generally assumed that the currentintensity function of the equipment as it goesinto production remains constant during the so-called useful life period [92] and is equal to the

intensity ρ(T ) at the termination time T of thedevelopment program. Thus, the failure process ofthe equipment as it goes into production can bemodeled by an HPP with constant intensity ρ(T ),and the current reliability is given by

R0(t0)= exp(−ρ(T )t0) t0 > 0 (18.86)

ML inferential procedures for current intensityand current reliability can be found in Lee andLee [22], Bain and Engelhardt [23], Crow [93, 94],Calabria et al. [24], Tsokos and Rao [95], and Senand Khattree [96]. These procedures, which referboth to failure- and time-truncated sampling, arebased on pivotal properties of certain quantitiesand allow point estimates and exact confidenceintervals to be obtained. In particular, if thedevelopment program terminates at the time tn ofthe nth failure, confidence intervals on the currentintensity ρ(tn) and the current reliability R0(t0)

can be obtained by exploiting the result that [22]

ρ(tn)

ρ(tn)∼ 1

4n2ZS (18.87)

where Z and S are independent chi-squaredrandom variables with 2(n− 1) and 2n degreesof freedom respectively. For single or multiplecopies undergoing a time-truncated developmentprogram, confidence intervals on ρ(T ) and R0(t0)

can be obtained by using tables given by Crow[93, 94].

A quasi-Bayes method to estimate the currentintensity function has been proposed by Higginsand Tsokos [97], whereas a Bayes procedurefor estimating both the current intensity andthe current reliability on the basis of priorinformation on the parameter β can be found inCalabria et al. [98].

Some prediction problems have found a solu-tion in Calabria and Pulcini [99] and Beiser andRigdon [40]. In particular, Calabria and Dulcini[99] have found prediction procedures for thecurrent lifetime t0, both in the ML and Bayesiancontexts. Exact confidence limits on t0 can becomputed by using a specific table, whereas, ifprior information on the parameter β is available,prediction on t0 can be made through a Bayesianprocedure. Finally, prediction of the number of

Page 378: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 345

failures that new equipment will experience in afuture time interval can be obtained by using theBayesian procedure given by Beiser and Rigdon[40].

Another continuous model that has foundapplication in reliability growth analysis is theNHPP with log–linear intensity of Section 18.4.2.2:

ρ(t)= exp(α + βt) −∞< α <∞, β < 0(18.88)

Unlike the PLP, whose intensity function tendsto zero as t tends to infinity, the LLP intensityapproaches more realistically to the positive valueexp(α) as t increases. In the reliability growthcontext, the LLP yields a defective distributionfor the time to first failure [100] and then thereis a positive probability that no failure will beobserved during the development program.

When the TAAF program is carried out onseveral units, some of which are manufacturedwhile the development program goes on, theinitial reliability of all units put on test isnot yet the same. In fact, the manufacturingprocess improves as the TAAF program goes on,since process-related defects, discovered duringthe TAAF program, are corrected. Thus, a unitmanufactured a little while after the beginningof the production process will have fewer defectsthan one unit manufactured exactly at thebeginning of the production process. A modelthat takes into account the dependence of theinitial reliability of equipment on the age ofthe manufacturing process has been proposedby Heimann and Clark [101]. This process-related reliability-growth model describes thedesign process through a PLP (with β < 1)and assumes that the scale parameter α isnot constant but depends on the age X ofthe manufacturing process when the unit ismanufactured: α(X) = α[1 − exp(−bX)]. Thus,the intensity function

ρ(t|X)= βtβ−1

{α[1− exp(−bX)]}β (18.89)

depends through X on the maturity level ofthe manufacturing process. The ML estimates

of model parameters, and then of the process-age-dependent intensity ρ(t|X), can be ob-tained by maximization of the log-likelihood.Tests to assess the null hypothesis of constantinitial reliability over all units (i.e. constantscale parameter over X) against the process-related model in Equation 18.89 can be per-formed through the log-likelihood ratio statis-tic.

18.7.2 Discrete Models

Discrete models find applications when one ormore identical copies of the equipment are put ontest and design modifications are not introduced afew at each failure but are implemented in blockat the end of each development stage (delayedfixes) as a result of the experience gained duringthat stage. The next stage begins with new copiesthat incorporate the design changes and, if thosemodifications are effective, the reliability of thenew design is higher than the reliability of theprevious design.

One of the earliest models on this topic wasproposed by Robinson and Dietrich [102]. Itis based on the assumption that each time afailure occurs the equipment is perfectly repaired,so that the failure pattern in each developmentstage is described through a renewal process.The perfect repair assumption is not realisticwhen the development program involves complexequipment.

More recently, a discrete model based on theassumptions that the equipment is minimallyrepaired at failure and the failure process in eachstage is modeled by a PLP has been proposedby Ebrahimi [103] and Calabria et al. [104]. Theparameter β is constant over design modifications(changes do not alter the failure mechanism),whereas the scale parameter α increases from onestage to the next as a result of change effectiveness.Thus, the intensity function in the j th stageis

ρj (t)= β

αj

(t

αj

)β−1

j = 1, 2, . . . (18.90)

Page 379: Handbook of Reliability Engineering - pdi2.org

346 Maintenance Theory and Testing

where time t is measured from the beginningof the j th stage. ML estimates can be foundin [103] where the intensity (Equation 18.90) hasbeen reparameterized in terms of µj = α

−βj (j =

1, 2, . . . ). The parameters µj satisfy the condi-tions µ1 ≥ µ2 ≥ · · · , since a reliability growthbetween stages is modeled. If Tj (j = 1, 2, . . . , m)

denotes the test time of the j th stage and njfailures occur in the j th stage at times t1,j <

t2,j < · · ·< tnj ,j , then the likelihood results in

L(β, µ1, . . . , µm)

=[ m∏

j=1

µnjj

]βn

[ m∏j=1

nj∏i=1

tβ−1i,j

]

× exp

(−

m∑j=1

µjTβj

)(18.91)

where n is the total number of observed failures.The constrained ML estimate of model parameterscan be found through a two-stage procedure thatassures µ1 ≥ µ2 ≥ · · · ≥ µm, and approximateconfidence intervals can be obtained through theasymptotic distribution of estimators.

The Bayesian approach proposed by Calabriaet al. [104] is addressed to the solution ofa decision-making problem rather than to aninferential one. Several identical copies of theequipment are put on test at each developmentstage. At the end of each stage, depending onthe reliability level achieved, a decision betweentwo alternative actions is made: (a) to acceptthe current design of the equipment for massproduction, or (b) to continue the developmentprogram. The equipment reliability at stage j ismeasured by the expected number of failures in agiven time interval (0, τ ), say Nj (τ ), so that thedecision process is based on the posterior densityof Nj (τ ) and on specific loss functions thatmeasure the economical consequences associatedwith each alternative action.

Acknowledgment

The author is indebted to ProfessorMaurizio Guida for valuable discussions and sug-gestions which helped the exposition of the article.

References[1] Ascher H, Feingold H. Repairable systems reliability.

modeling, inference, misconceptions and their causes.New York, Basel: Marcel Dekker; 1984.

[2] Pham H, Wang H. Imperfect maintenance. Eur J OperRes 1996;94:425–38.

[3] Lawless JF, Thiagarajah K. A point-process model incor-porating renewals and time trends, with application torepairable systems. Technometrics 1996;38:131–8.

[4] Cox DR, Isham V. Point processes. London, New York:Chapman and Hall; 1980.

[5] Thompson WA. Point process models with applicationsto safety and reliability. New York, London: Chapmanand Hall; 1988.

[6] Thompson WA. On the foundations of reliability.Technometrics 1981;23:1–13.

[7] Berman M, Turner TR. Approximating point processlikelihoods with GLIM. Appl Stat 1992;41:31–8.

[8] Ascher HE. Reliability models for repairable systems.In: Møltoft J, Jensen F, editors. Reliability technology—theory and applications. Oxford: Elsevier; 1986. p.177–85.

[9] Ascher H, Feingold H. “Bad-as-old” analysis of systemfailure data. In: Proceedings of 8th Reliability andMaintenance Conference, Denver, 1969; p.49–62.

[10] Cox DR, Lewis PAW. The statistical analysis of series ofevents. London: Chapman and Hall; 1978.

[11] Bain LJ, Engelhardt M, Wright FT. Tests for an increasingtrend in the intensity of a Poisson process: a powerstudy. J Am Stat Assoc 1985;80:419–22.

[12] Crow LH. Reliability analysis for complex, repairablesystems. In: Proschan F, Serfling RJ, editors. Reliabilityand biometry. Philadelphia: SIAM; 1974. p.379–410.

[13] Lewis PAW, Robinson DW. Testing for a monotone trendin a modulated renewal process. In: Proschan F, Ser-fling RJ, editors. Reliability and biometry. Philadelphia:SIAM; 1974. p.163–82.

[14] Vaurio JK. Identification of process and distribu-tion characteristics by testing monotonic and non-monotonic trends in failure intensities and hazard rates.Reliab Eng Syst Saf 1999;64:345–57.

[15] Lawless JF. Statistical models and methods for lifetimedata. New York: John Wiley and Sons; 1982.

[16] Beiser JA, Rigdon SE. Bayesian prediction for thenumber of failures of a repairable system modeledby a homogeneous Poisson process. In: Vogt WG,Mickle MH, editors. Modeling and simulation, vol. 23,part 4. Control, signal processing, robotics, systems,power. Pittsburgh: Instrument Society of America; 1992.p.1783–9.

[17] Ascher H, Feingold H. Is there repair after failure?In: Proceedings of Annual Reliability and MaintenanceSymposium, Los Angeles, 1978; p.190–7.

[18] Proschan F. Theoretical explanation of observed de-creasing failure rate. Technometrics 1963;5:375–83.

[19] Kumar U, Klefsjø B. Reliability analysis of hydraulicsystems of LHD machines using the power law processmodel. Reliab Eng Syst Saf 1992;35:217–24.

Page 380: Handbook of Reliability Engineering - pdi2.org

Mechanical Reliability and Maintenance Models 347

[20] Majumdar SK. Study on reliability modelling of ahydraulic excavator system. Qual Reliab Eng Int1995;11:49–63.

[21] Finkelstein JM. Confidence bounds on the parameters ofthe Weibull process. Technometrics 1976;18:115–7.

[22] Lee L, Lee SK. Some results on inference for the Weibullprocess. Technometrics 1978;20:41–5.

[23] Bain LJ, Engelhardt M. Inferences on the parameters andcurrent system reliability for a time truncated Weibullprocess. Technometrics 1980;22:421–6.

[24] Calabria R, Guida M, Pulcini G. Some modifiedmaximum likelihood estimators for the Weibull process.Reliab Eng Syst Saf 1988;23:51–8.

[25] Lee L. Comparing rates of several independent Weibullprocesses. Technometrics 1980;22:427–30.

[26] Calabria R, Guida M, Pulcini G. Power bounds for atest of equality of trends in k independent power lawprocesses. Commun Stat Theor Methods 1992;21:3275–90.

[27] Bain LJ, Engelhardt M. Sequential probability ratio testsfor the shape parameter of a nonhomogeneous Poissonprocess. IEEE Trans Reliab 1982;R-31:79–83.

[28] Calabria R, Guida M, Pulcini G. Reliability analysis ofrepairable systems from in-service failure count data.Appl Stoch Models Data Anal 1994;10:141–51.

[29] Rigdon SE, Basu AP. The power law process: a modelfor the reliability of repairable systems. J Qual Technol1989;21:251–60.

[30] Park WJ, Kim YG. Goodness-of-fit tests for the power-law process. IEEE Trans Reliab 1992;41:107–11.

[31] Baker RD. Some new tests of the power law process.Technometrics 1996;38:256–65.

[32] Engelhardt M, Bain LJ. Prediction intervals for theWeibull process. Technometrics 1978;20:167–9.

[33] Calabria R, Guida M, Pulcini G. Point estimation offuture failure times of a repairable system. Reliab EngSyst Saf 1990;28:23–34.

[34] Bain LJ, Engelhardt M. Statistical analysis of reliabilityand life-testing models. Theory and methods. (Secondedition). New York, Basel, Hong Kong: Marcel Dekker;1991.

[35] Rigdon SE, Ma X, Bodden KM. Statistical inference forrepairable systems using the power law process. J QualTechnol 1998;30:395–400.

[36] Guida M, Calabria R, Pulcini G. Bayes inferencefor a non-homogeneous Poisson process with powerintensity law. IEEE Trans Reliab 1989;38:603–9.

[37] Bar-Lev SK, Lavi I, Reiser B. Bayesian inference for thepower law process. Ann Inst Stat Math 1992;44:623–39.

[38] Campodønico S, Singpurwalla ND. Inference andpredictions from Poisson point processes incorporatingexpert knowledge. J Am Stat Assoc 1995;90:220–6.

[39] Calabria R, Guida M, Pulcini G. Bayes estimation ofprediction intervals for a power law process. CommunStat Theor Methods 1990;19:3023–35.

[40] Beiser JA, Rigdon SE. Bayes prediction for the numberof failures of a repairable system. IEEE Trans Reliab1997;46:291–5.

[41] Lilius WA. Graphical analysis of repairable systems.In: Proceedings of Annual Reliability and MaintenanceSymposium, Los Angeles, USA, 1979; p.403–6.

[42] Härtler G. Graphical Weibull analysis of repairablesystems. Qual Reliab Eng Int 1985;1:23–6.

[43] Nelson W. Graphical analysis of system repair data.J Qual Technol 1988;20:24–35.

[44] Martz HF. Pooling life test data by means of theempirical Bayes method. IEEE Trans Reliab 1975;R-24:27–30.

[45] Bassin WM. Increasing hazard functions and overhaulpolicy. In: Proceedings of Annual Symposium onReliability, Chicago, USA, 1969; p.173–8.

[46] Ahn CW, Chae KC, Clark GM. Estimating parameters ofthe power law process with two measures of failure time.J Qual Technol 1998;30:127–32.

[47] Ascher HE, Hansen CK. Spurious exponentiality ob-served when incorrectly fitting a distribution to nonsta-tionary data. IEEE Trans Reliab 1998;47:451–9.

[48] Leung FKN, Cheng ALM. Determining replacementpolicies for bus engines. Int J Qual Reliab Manage2000;17:771–83.

[49] Aven T. Some tests for comparing reliabilitygrowth/deterioration rates of repairable systems.IEEE Trans Reliab 1989;38:440–3.

[50] Huang Y-S, Bier VM. A natural conjugate priorfor the nonhomogeneous Poisson process with anexponential intensity function. Commun Stat TheorMethods 1999;28:1479–509.

[51] Engelhardt M, Bain LJ. On the mean time betweenfailures for repairable systems. IEEE Trans Reliab1986;R-35:419–22.

[52] Pulcini G. A bounded intensity process for the reliabilityof repairable equipment. J Qual Technol 2001;33:480–92.

[53] Lee L. Testing adequacy of the Weibull and log linearrate models for a Poisson process. Technometrics1980;22:195–9.

[54] Love CE, Guo R. An application of a bathtub failuremodel to imperfectly repaired systems data. Qual ReliabEng Int 1993;9:127–35.

[55] Pulcini G. Modeling the failure data of a repairableequipment with bathtub type failure intensity. ReliabEng Syst Saf 2001;71:209–18.

[56] Coetzee JL. Reliability degradation and the equipmentreplacement problem. In: Proceedings of InternationalConference of the Maintenance Societies (ICOMS-96),Melbourne, 1996; paper 21.

[57] Kumar U, Klefsjö B, Granholm S. Reliability investi-gation for a fleet of load–haul–dump machines in aSwedish mine. Reliab Eng Syst Saf 1989;26:341–61.

[58] Bendell A, Walls LA. Exploring reliability data. QualReliab Eng Int 1985;1:37–51.

[59] Lawless JF. Regression methods for Poisson processdata. J Am Stat Assoc 1987;82:808–15.

[60] Guida M, Giorgio M. Reliability analysis of acceleratedlife-test data from a repairable system. IEEE Trans Reliab1995;44:337–46.

[61] Kumar D. Proportional hazards modelling of repairablesystems. Qual Reliab Eng Int 1995;11:361–9.

Page 381: Handbook of Reliability Engineering - pdi2.org

348 Maintenance Theory and Testing

[62] Guo R, Love CE. Statistical analysis of an age modelfor imperfectly repaired systems. Qual Reliab Eng Int1992;8:133–46.

[63] Malik MAK. Reliable preventive maintenance schedul-ing. AIIE Trans 1979;11:221–8.

[64] Shin I, Lim TJ, Lie CH. Estimating parameters ofintensity function and maintenance effect for repairableunit. Reliab Eng Syst Saf 1996;54:1–10.

[65] Berman M. Inhomogeneous and modulated gammaprocesses. Biometrika 1981;68:143–52.

[66] Lakey MJ, Rigdon SE. The modulated power law process.In: Proceedings of 45th Annual ASQC Quality Congress,Milwaukee, 1992; p.559–63.

[67] Black SE, Rigdon SE. Statistical inference for a modu-lated power law process. J Qual Technol 1996;28:81–90.

[68] Calabria R, Pulcini G. Bayes inference for the modulatedpower law process. Commun Stat Theor Methods1997;26:2421–38.

[69] Calabria R, Pulcini G. Bayes inference for repairablemechanical units with imperfect or hazardous mainte-nance. Int J Reliab Qual Saf Eng 1998;5:65–83.

[70] Calabria R, Pulcini G. On testing for repair effect andtime trend in repairable mechanical units. Commun StatTheor Methods 1999;28:367–87.

[71] Duane JT. Learning curve approach to reliability. IEEETrans Aerosp 1964;2:563–6.

[72] Calabria R, Pulcini G. Inference and test in modelingthe failure/repair process of repairable mechanicalequipments. Reliab Eng Syst Saf 2000;67:41–53.

[73] Chan J-K, Shaw L. Modeling repairable systems withfailure rates that depend on age and maintenance. IEEETrans Reliab 1993;42:566–71.

[74] Calabria R, Pulcini G. Discontinuous point processes forthe analysis of repairable units. Int J Reliab Qual Saf Eng1999;6:361–82.

[75] Nakagawa T. Optimum policies when preventive mainte-nance is imperfect. IEEE Trans Reliab 1979;R-28:331–2.

[76] Brown M, Proschan F. Imperfect repair. J Appl Prob1983;20:851–9.

[77] Whitaker LR, Samaniego FJ. Estimating the reliabilityof systems subject to imperfect repair. J Am Stat Assoc1989;84:301–9.

[78] Lim JH, Lu KL, Park DH. Bayesian imperfect repairmodel. Commun Statist Theor Meth 1998;27:965–84.

[79] Lim TJ. Estimating system reliability with fully maskeddata under Brown–Proschan imperfect repair model.Reliab Eng Syst Saf 1998;59:277–89.

[80] Lim TJ, Lie CH. Analysis of system reliability with de-pendent repair modes. IEEE Trans Reliab 2000;49:153–62.

[81] Block HW, Borges WS, Savits TH. Age-dependentminimal repair. J Appl Prob 1985;22:370–85.

[82] Love CE, Guo R. Using proportional hazard modellingin plant maintenance. Qual Reliab Eng Int 1991;7:7–17.

[83] Love CE, Guo R. Application of Weibull proportionalhazards modelling to bad-as-old failure data. QualReliab Eng Int 1991;7:149–57.

[84] Jack N. Analysing event data from a repairable machinesubject to imperfect preventive maintenance. QualReliab Eng Int 1997;13:183–6.

[85] Pulcini G. On the overhaul effect for repairablemechanical units: a Bayes approach. Reliab Eng Syst Saf2000;70:85–94.

[86] Pulcini G. On the prediction of future failures for arepairable equipment subject to overhauls. CommunStat Theor Methods 2001;30:691–706.

[87] Nakagawa T. Sequential imperfect preventive mainte-nance policies. IEEE Trans Reliab 1988;37:295–8.

[88] Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE. Afull history proportional hazards model for preven-tive maintenance scheduling. Qual Reliab Eng Int1997;13:187–98.

[89] Benton AW, Crow LH. Integrated reliability growthtesting. In: Proceedings of the Annual Reliability andMaintenance Symposium, Atlanta, USA, 1989; p.160–6.

[90] Robinson D, Dietrich D. A nonparametric-Bayesreliability-growth model. IEEE Trans Reliab1989;38:591–8.

[91] Sen A. Estimation of current reliability in a Duane-basedgrowth model. Technometrics 1996;40:334–44.

[92] Crow LH. Tracking reliability growth. In: Proceedings of20th Conference on Design of Experiments. ARO Report75-2, 1975; p.741–54.

[93] Crow LH. Confidence interval procedures for theWeibull process with application to reliability growth.Technometrics 1982;24:67–72.

[94] Crow LH. Confidence intervals on the reliability ofrepairable systems. In: Proceedings of the AnnualReliability and Maintenance Symposium, Atlanta, USA,1993; p.126–33.

[95] Tsokos CP, Rao ANV. Estimation of failure intensity forthe Weibull process. Reliab Eng Syst Saf 1994;45:271–5.

[96] Sen A, Khattree R. On estimating the current intensityof failure for the power-law process. J Stat Plan Infer1998;74:253–72.

[97] Higgins JJ, Tsokos CP. A quasi-Bayes estimate of thefailure intensity of a reliability-growth model. IEEETrans Reliab 1981;R-30:471–5.

[98] Calabria R, Guida M, Pulcini G. A Bayes procedurefor estimation of current system reliability. IEEE TransReliab 1992;41:616–20.

[99] Calabria R, Pulcini G. Maximum likelihood and Bayesprediction of current system lifetime. Commun StatTheor Methods 1996;25:2297–309.

[100] Cozzolino JM. Probabilistic models of decreasing failurerate processes. Nav Res Logist Q 1968;15:361–74.

[101] Heimann DI, Clark WD. Process-related reliability-growth modeling. How and why. In: Proceedings of theAnnual Reliability and Maintenance Symposium, LasVegas, USA, 1992; 316–321.

[102] Robinson DG, Dietrich D. A new nonparametric growthmodel. IEEE Trans Reliab 1987;R-36:411–8.

[103] Ebrahimi N. How to model reliability-growth whentimes of design modifications are known. IEEE TransReliab 1996;45:54–8.

[104] Calabria R, Guida M, Pulcini G. A reliability-growthmodel in a Bayes-decision framework. IEEE TransReliab 1996;45:505–10.

Page 382: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models:Replacement, Repair, Ordering, andInspection

Chap

ter1

9Tadashi Dohi, Naoto Kaio and Shunji Osaki

19.1 Introduction19.2 Block Replacement Models19.2.1 Model I19.2.2 Model II19.2.3 Model III19.3 Age Replacement Models19.3.1 Basic Age Replacement Model19.4 Ordering Models19.4.1 Continuous-timeModel19.4.2 Discrete-time Model19.4.3 Combined Model with Minimal Repairs19.5 Inspection Models19.5.1 Nearly Optimal Inspection Policy by Kaio and Osaki (K&O Policy)19.5.2 Nearly Optimal Inspection Policy by Munford and Shahani (M&S Policy)19.5.3 Nearly Optimal Inspection Policy by Nakagawa and Yasui (N&Y Policy)19.6 Concluding Remarks

19.1 Introduction

Many systems become more large-scale andmore complicated and influence our societygreatly, such as airplanes, computer networks,etc. Reliability/maintainability theory (engineer-ing) plays a very important role in maintainingsuch systems. Mathematical maintenance poli-cies, centering preventive maintenance, have beendeveloped mainly in the research area of oper-ations research/management science, to gener-ate an effective preventive maintenance schedule.The most important problem in mathematicalmaintenance policies is to design a maintenanceplan with two maintenance options: preventive re-placement and corrective replacement. In preven-tive replacement, the system or unit is replaced bya new one before it fails. On the other hand, with

corrective replacement it is the failed unit that isreplaced. Hitherto, a huge number of replacementmethods have been proposed in the literature.At the same time, some technical books on thisproblem have been published. For example, Ar-row et al. [1], Barlow and Proschan [2, 3], Jorgen-son et al. [4], Gnedenko et al. [5], Gertsbakh [6],Ascher and Feingold [7] and Osaki [8, 9] are theclassical but very important texts to study regard-ing mathematical maintenance theory. Many au-thors have also published several monographs onspecific topics. The reader is referred to Osaki andHatoyama [10], Osaki and Cao [11], Ozekici [12],and Christer et al. [13]. Recently, Barlow [14] andAven and Jensen [15] presented excellent text-books reviewing mathematical maintenance the-ory. There are also some useful survey papersreviewing the history of this research context, such

349

Page 383: Handbook of Reliability Engineering - pdi2.org

350 Maintenance Theory and Testing

as those by McCall [16], Osaki and Nakagawa [17],Pierskalla and Voelker [18], Sherif and Smith [19],and Valdez-Flores and Feldman [20].

Since the mechanism that causes failure in al-most all real complex systems may be consideredto be uncertain, the mathematical technique todeal with maintenance problems should be basedon the usual probability theory. If we are inter-ested in the dynamic behavior of system failuresdepending on time, the problems are essentiallyreduced to studying the stochastic processes pre-senting phenomena on both failure and replace-ment. In fact, since the theory of stochastic pro-cesses depends strongly on mathematical mainte-nance theory, many textbooks on stochastic pro-cesses have treated several maintenance problems.See, for example, Feller [21], Karlin and Taylor [22,23], Taylor and Karlin [24], and Ross [25]. In otherwords, in order to design a maintenance scheduleeffectively, both the underlying stochastic processgoverning the failure mechanism and the role ofmaintenance options carried out on the processhave to be analyzed carefully. In that sense, math-ematical maintenance theory is one of the mostimportant parts in applied probability modeling.

In this tutorial article, we present the pre-ventive maintenance policies arising in the con-text of maintenance theory. Simple but practicallyimportant preventive maintenance optimizationmodels, that involve age replacement and blockreplacement, are reviewed in the framework ofthe well-known renewal reward argument. Someextensions to these basic models, as well as thecorresponding discrete time models, are also in-troduced with the aim of the application of thetheory to the practice. In almost all textbooksand technical papers, the discrete-time preventivemaintenance models have been paid scant atten-tion. The main reason is that discrete-time mod-els are ordinarily considered as trivial analogiesof continuous-time models. However, we oftenface some maintenance problems modeled in thediscrete-time setting in practice. If one considersthe situation where the number of take-offs fromairports influences the damage to an airplane, theparts comprising the airplane should be replacedat a prespecified number of take-offs rather than

after an elapsed time. Also, in the Japanese elec-tric power company under our investigation, thefailure time data of electric switching devices arerecorded as group data (the number of failuresper year) and it is not easy to carry out a preven-tive replacement schedule on a weekly or monthlybasis, since the service team is engaged in otherworks, too. From our questionnaire, it is helpfulfor practitioners that a preventive replacementschedule should be carried out on a near-annualbasis. This will motivate attention toward discrete-time maintenance models.

19.2 Block Replacement ModelsFor block replacement models, the preventive re-placement is executed periodically at a prespec-ified time kt0 (t0 ≥ 0) or kN (N = 0, 1, 2, . . . ),(k = 1, 2, 3, . . . ). If the unit fails during the timeinterval ((k − 1)t0, kt0] or ((k − 1)N, kN], thenthe corrective maintenance is made at the failuretime. The main property for the block replacementis that it is easier to administer in general, sincethe preventive replacement time is scheduled inadvance and one need not observe the age of aunit. In this section, we develop the followingthree variations of block replacement model:

1. a failed unit is replaced instantaneously atfailure (Model I);

2. a failed unit remains inoperable until the nextscheduled replacement comes (Model II);

3. a failed unit undergoes minimal repair(Model III).

The cost components used in this section are asfollows:

cp (>0) unit preventive replacement cost attime kt0 or kN , (k = 1, 2, 3, . . . )

cc (>0) unit corrective replacement cost atfailure time

cm (>0) unit minimal repair cost at failuretime.

19.2.1 Model I

First, we consider the continuous-time blockreplacement model [2]. A failed unit is replaced

Page 384: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 351

by a new one during the replacement interval t0,and the scheduled replacement for the non-failedunit is performed at kt0 (k = 1, 2, 3, . . . ). LetF(t)

be the continuous lifetime distribution with finitemean 1/λ (>0). From the well-known renewalreward theorem, it is immediate to formulate theexpected cost per unit time in the steady state forthe block replacement model as follows:

Bc(t0)= ccM(t0)+ cp

t0t0 ≥ 0 (19.1)

where the function M(t)=∑∞k=1 F(k)(t) denotes

the mean number of failures during the timeperiod (0, t] (the renewal function) and F (k)(t)

the k-fold convolution of the lifetime distribution.The problem is, of course, to derive the optimalblock replacement time t∗0 that minimizes Bc(t0).

Define the numerator of the derivative of Bc(t0)

as

jc(t0)= cc[t0m(t0)−M(t0)] − cp (19.2)

where the function m(t)= dM(t)/dt is the re-newal density. Then, we have the optimal blockreplacement time t∗0 that minimizes the expectedcost per unit time in the steady state Bc(t0).

Theorem 1.

(1) Suppose that the function m(t) is strictlyincreasing with respect to t (>0).

(i) If jc(∞) > 0, then there exists one finiteoptimal block replacement time t∗0 (0 <

t∗0 <∞) that satisfies jc(t∗0 )= 0. Then the

corresponding minimum expected cost is

Bc(t∗0 )= ccm(t∗0 ) (19.3)

(ii) If jc(∞)≤ 0, then t∗0 →∞, i.e. it is op-timal to carry out only the corrective re-placement, and the corresponding mini-mum expected cost is

Bc(∞)= λcc (19.4)

(2) Suppose that the function m(t) is decreasingwith respect to t (> 0). Then, the optimal blockreplacement time is t∗0 →∞.

Next, we formulate the discrete-time blockreplacement model [29]. In the discrete-timesetting, the expected cost per unit time in thesteady state is

Bd(N)= ccM(N)+ cp

NN = 0, 1, 2, . . .

(19.5)where the function M(n)=∑∞k=1 F

(k)(n) is thediscrete renewal function for the discrete lifetimedistribution F(n) (n= 0, 1, 2, . . . ) and F (k)(n)

is the k-fold convolution; for more details, seeMunter [26] and Kaio and Osaki [27]. Define thenumerator of the difference of Bd(N) as

jd(N)= cc[Nm(N + 1)−M(N)] − cp (19.6)

where the function m(n)=M(n)−M(n− 1) isthe renewal probability mass function. Then, wehave the optimal block replacement time N∗ thatminimizes the expected cost per unit time in thesteady state Bd(N).

Theorem 2.

(1) Suppose that the m(n) is strictly increasingwith respect to n (>0).

(i) If jd(∞) > 0, then there exists one finiteoptimal block replacement time N∗ (0 <

N∗ <∞) that satisfies jd(N − 1) < 0 andjd(N) ≥ 0. Then the corresponding mini-mum expected cost satisfies the inequality

ccm(N∗) < Bd(N∗)≤ ccm(N∗ + 1)

(19.7)(ii) If jd(∞) ≤ 0, then N∗ →∞, i.e. it is op-

timal to carry out only the corrective re-placement, and the corresponding mini-mum expected cost is

Bd(∞)= λcc (19.8)

(2) Suppose that the function m(n) is decreasingwith respect to n (> 0). Then, the optimal blockreplacement time is N∗ →∞.

Remarks. A large number of variations on theblock replacement model have been studied inthe literature. Though we assume in the modelabove that the cost component is constant,

Page 385: Handbook of Reliability Engineering - pdi2.org

352 Maintenance Theory and Testing

some modifications are possible. Tilquin andCleroux [30] and Berg and Epstein [31] extendedthe original model in terms of cost structure.

19.2.2 Model II

For the first model, we have assumed that a failedunit is detected instantaneously just after failure.This implies that a sensing device monitors theoperating unit. Since such a case is not alwaysgeneral, however, we assume that the failureis detected only at kt0 (t0 ≥ 0) or kN (N =0, 1, 2, . . . ), (k = 1, 2, 3, . . . ) (see Osaki [9]).Consequently, in Model II, a unit is alwaysreplaced at kt0 or kN , but is not replaced at thetime of failure, and the unit remains inoperable forthe time duration from the occurrence of failureuntil its detection.

In the continuous-time model, since theexpected duration from the occurrence of failureuntil its detection per cycle is given by

∫ t00 (t0 − t)

dF(t)= ∫ t00 F(t) dt , we have the expected cost perunit time in the steady state:

Cc(t0)= cc∫ t0

0 F(t) dt + cp

t0(19.9)

where cc is changed to the cost of failure per unittime, i.e. the cost occurs per unit time for systemdown. Define the numerator of the derivative ofCc(t0) with respect to t0 as kc(t0), i.e.

kc(t0)= cc

[F(t0)t0 −

∫ t0

0F(t) dt

]− cp (19.10)

Theorem 3.

(i) If kc(∞) > 0, then there exists one uniqueoptimal block replacement time t∗0 (0 <

t∗0 <∞) that satisfies kc(t∗0 )= 0, and the

corresponding minimum expected cost is

Cc(t∗0 )= ccF(t∗0 ) (19.11)

(ii) If kc(∞)≤ 0, then t∗0 →∞ and Cc(∞)= cc.

On the other hand, in the discrete-time setting,the expected cost per unit time in the steady state

is

Cd(N)= cc∑N−1

k=1 F(k)+ cp

NN = 0, 1, 2, . . .

(19.12)where the function F(n) is the lifetime distribu-tion (n= 0, 1, 2, . . . ). Define the numerator ofthe difference of Cd(N) as

id(N) = cc

[NF(N)−

N−1∑k=1

F(k)

]− cp (19.13)

Then, we have the optimal block replacementtime N∗ that minimizes the expected cost per unittime in the steady state Cd(N).

Theorem 4.

(i) If jd(∞) > 0, then there exists one finiteoptimal block replacement time N∗ (0 <N∗ <∞) that satisfies id(N − 1) < 0 and id(N) ≥0. Then the corresponding minimum expectedcost satisfies the inequality

ccF(N∗ − 1) < Cd(N∗) ≤ ccF(N∗) (19.14)

(ii) If id(∞)≤ 0, then N∗ →∞.

Remarks. It is noted that Model II has not beenwidely studied in the literature, since this cannotdetect the failure instantly and is not certainlysuperior to Model I in terms of cost minimization.However, as described previously, one can see thatthe continuous monitoring of the operating unit isnot always possible for all practical applications.

19.2.3 Model III

In the final model, we assume that minimal repairis performed when a unit fails and the failure rateis not disturbed by each repair. If we considera stochastic process {N(t), t ≥ 0} in that N(t)

represents the number of minimal repairs up totime t , the process {N(t), t ≥ 0} is governed bya non-homogeneous Poisson process with meanvalue function

(t)=∫ t

0r(x) dx (19.15)

Page 386: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 353

that is also called the hazard function, where thefunction r(t)= f (t)/F (t) is called the failure rateor the hazard rate; in general ψ(·)= 1− ψ(·).Noting this fact, Barlow and Hunter [28] gave theexpected cost per unit time in the steady state forthe continuous-time model:

Vc(t0)= cm(t0)+ cp

t0(19.16)

Define the numerator of the derivative of Vc(t0)

as

lc(t0)= cm[t0r(t0)−(t0)] − cp (19.17)

Then, we have the optimal block replacementtime (with minimal repair) t∗0 that minimizes theexpected cost per unit time in the steady stateVc(t0).

Theorem 5.

(1) Suppose that F(t) is strictly increasing failurerate (IFR), i.e. the failure rate is strictlyincreasing.

(i) If lc(∞) > 0, then there exists one finiteoptimal block replacement time with min-imal repair t∗0 (0 < t∗0 <∞) that satisfieslc(t∗0 )= 0. Then the corresponding mini-

mum expected cost is

Vc(t∗0 )= cmr(t

∗0 ) (19.18)

(ii) If lc(∞) ≤ 0, then t∗0 →∞ and the corre-sponding minimum expected cost is

Vc(∞)= cmr(∞) (19.19)

(2) Suppose that F(t) is decreasing failure rate(DFR), i.e. the failure rate is decreasing.Then, the optimal block replacement time withminimal repair is t∗0 →∞.

Next, we formulate the discrete-time blockreplacement model with minimal repair [29]. Inthe discrete-time setting, the expected cost perunit time in the steady state is

Vd(N)= cm(N)+ cp

NN = 0, 1, 2, . . .

(19.20)

where the function (n) is the mean value func-tion of the discrete non-homogeneous Poissonprocess. Define the numerator of the difference ofVd(N) as

ld(N)= cm[Nr(N + 1)−(N)] − cp (19.21)

where the function r(n)=(n)−(n− 1)=f (n)/F (n− 1) is the failure rate function. Then,we have the optimal block replacement time withminimal repair N∗ that minimizes the expectedcost per unit time in the steady state Vd(N).

Theorem 6.

(1) Suppose that F(n) is strictly IFR.

(i) If ld(∞) > 0 , then there exists onefinite optimal block replacement timewith minimal repair N∗ (0 < N∗ <∞)

that satisfies ld(N − 1) < 0 and ld(N) ≥0. Then the corresponding minimumexpected cost satisfies the inequality

cmr(N∗) < Vd(N

∗)≤ cmr(N∗ + 1)

(19.22)(ii) If ld(∞)≤ 0, then N∗ →∞.

(2) Suppose that F(n) is DFR. Then, the optimalblock replacement time with minimal repair isN∗ →∞.

Remarks. It is apparent that a great number ofpapers on minimal repair models have been pub-lished. Morimura [32], Tilquin and Cleroux [33]and Cleroux et al. [34] involved several interestingmodifications. Later, Park [35], Nakagawa [36–40], Nakagawa and Kowada [41], Phelps [42],Berg and Cleroux [43], Boland [44], Boland andProschan [45], Block et al. [46], and Beichelt [47]proposed extended minimal repair models fromthe standpoint of generalization. The most in-teresting of the models with minimal repair isthe (t, T ) policy. The (t, T ) policy is a com-bined policy with three kinds of maintenanceoptions: minimal repair, failure replacement, andpreventive replacement. That is, the minimal re-pair is executed for failures during the first period[0, t), but the failure replacement is made for[t, T ], where T is the preventive replacement time.

Page 387: Handbook of Reliability Engineering - pdi2.org

354 Maintenance Theory and Testing

Tahara and Nishida [48] formulated this modelfirst. It was then investigated by Phelps [49] andSegawa et al. [50]; recently, Ohnishi [51] provedthe optimality of the (t, T ) policy under aver-age cost criterion, via a dynamic programmingapproach. This tells us that the (t, T ) policy isoptimal if we have only three kinds of maintenanceoption.

19.3 Age Replacement ModelsAs is well known, in the age replacement model,if the unit does not fail until a prespecifiedtime t0 (t0 ≥ 0) or N (N = 0, 1, 2, . . . ), then itis replaced by a new one preventively, otherwise,it is replaced at the failure time. Denote thecorrective and the preventive replacement costsby cc and cp respectively, where, without loss ofgenerality, cc > cp. This model plays a central rolein all replacement models, since the optimalityof the age replacement model has been provedby Bergman [52] if the replacement by a newunit is the only maintenance option (i.e. if norepair is considered as an alternative option). Inthe remainder of this section, we introduce threekinds of age replacement model.

19.3.1 Basic Age Replacement Model

See Barlow and Proschan [2], Barlow andHunter [53], and Osaki and Nakagawa [54]. Fromthe renewal reward theorem, it can be seen thatthe expected cost per unit time in the steady statefor the age replacement model is

Ac(t0)= ccF(t0)+ cpF (t0)∫ t00 F (t) dt

t0 ≥ 0 (19.23)

If one can assume that there exists the densityf (t) for the lifetime distribution F(t) (t ≥ 0),the failure rate r(t)= f (t)/F (t) necessarily exists.Define the numerator of the derivative of Ac(t0)

with respect to t0, divided by F (t0) as hc(t0), i.e.

hc(t0)= r(t0)

∫ t0

0F (t) dt − F(t0)− cp

cc − cp(19.24)

Then, we have the optimal age replacement time t∗0that minimizes the expected cost per unit time inthe steady state Ac(t0).

Theorem 7.

(1) Suppose that the lifetime distribution F(t) isstrictly IFR.

(i) If r(∞) > K = λcc/(cc − cp), then thereexists one finite optimal age replace-ment time t∗0 (0 < t∗0 <∞) that satisfieshc(t

∗0 )= 0. Then the corresponding mini-

mum expected cost is

Ac(t∗0 )= (cc − cp)r(t

∗0 ) (19.25)

(ii) If r(∞)≤K , then t∗0 →∞ and Ac(∞)=Bc(∞)= λcc.

(2) If F(t) is DFR, then t∗0 →∞.

Next, let us consider the case where the costis discounted by the discount factor α (α > 0)(see Fox [55]). The present value of a unit costafter t (t ≥ 0) time is exp(−αt). In the continuous-time age replacement problem, the expected totaldiscounted cost over an infinite time horizon is

Ac(t0; α)= cc∫ t0

0 e−αtf (t) dt + cp e−αt0F (t0)

α∫ t0

0 e−αt F (t) dt

t0 ≥ 0 (19.26)

Define the numerator of the derivative of Ac(t0; α)with respect to t0, divided by F (t0) exp(−αt0) ashc(t0; α):

hc(t0; α)= r(t0)

∫ t0

0e−αt F (t) dt

−∫ t0

0e−αtf (t) dt − cp

cc − cp

(19.27)

Then, we have the optimal age replacement time t∗0that minimizes the expected total discounted costover an infinite time horizon Ac(t0; α).Theorem 8.

(1) Suppose that the lifetime distribution F(t) isstrictly IFR.

Page 388: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 355

(i) If r(∞) > K(α), then there exists onefinite optimal age replacement time t∗0(0 < t∗0 <∞) that satisfies hc(t

∗0 ; α)= 0,

where

K(α)= ccF∗(α)+ cpF

∗(α)(cc − cp)F ∗(α)/α

(19.28)

F ∗(α)=∫ ∞

0e−αtf (t) dt (19.29)

Then the corresponding minimum ex-pected cost is

Ac(t∗0 ; α)=

(cc − cp)r(t∗0 )

α− cp (19.30)

(ii) If r(∞)≤K(α), then t∗0 →∞ and

Ac(∞; α)= ccF∗(α)/F ∗(α) (19.31)

(2) If F(t) is DFR, then t∗0 →∞.

Following Nakagawa and Osaki [56], let us con-sider the discrete-time age replacement model.Define the discrete-time lifetime distribution F(n)

(n= 0, 1, 2, . . . ), the probability mass func-tion f (n), and the failure rate r(n)= f (n)/F (n−1). From the renewal reward theorem, it can beseen that the expected cost per unit time in thesteady state for the age replacement model is

Ad(N)= ccF(N)+ cpF (N)∑Ni=1 F (i − 1)

N = 0, 1, 2, . . .

(19.32)Define the numerator of the difference ofAd(N) as

hd(N)= r(N + 1)N∑i=1

F (i − 1)− F(N)

− cp

cc − cp(19.33)

Then, we have the optimal age replacementtime N∗ that minimizes the expected cost per unittime in the steady state Ad(N).

Theorem 9.

(1) Suppose that the lifetime distribution F(N) isstrictly IFR.

(i) If r(∞) > K , then there exists one fi-nite optimal age replacement time N∗(0 <N∗ <∞) that satisfieshd(N

∗ − 1) <0 and hd(N

∗) ≥ 0. Then the correspondingminimum expected cost satisfies the fol-lowing inequality

(cc − cp)r(N∗) < Ad(N

∗)≤ (cc − cp)r(N

∗ + 1)(19.34)

(ii) If r(∞)≤K , then N∗ →∞ andAd(∞)= Bd(∞)= λcc.

(2) If F(N) is DFR, then N∗ →∞.

We introduce the discount factor β (0 < β < 1)in the discrete-time age replacement problem.The present value of a unit cost after n (n=0, 1, 2, . . . ) period is βn. In the discrete-timeage replacement problem, the expected totaldiscounted cost over an infinite time horizon is

Ad(N; β)=[cc

N∑j=0

βjf (j)+ cpβNF (N)

]

×[

1− β

β

N∑i=1

βiF (i − 1)

]−1

N = 0, 1, 2, . . . (19.35)

Define the numerator of the difference ofAd(N; β) as

hd(N; β)= r(N + 1)N∑i=1

βiF (i − 1)

−N∑j=0

βjf (j)− cp

cc − cp(19.36)

Then, we have the optimal age replacementtime N∗ that minimizes the expected totaldiscounted cost over an infinite time horizonAd(N; β).Theorem 10.

(1) Suppose that the lifetime distribution F(N) isstrictly IFR.

(i) If r(∞) > K(β), then there existsone finite optimal age replacement

Page 389: Handbook of Reliability Engineering - pdi2.org

356 Maintenance Theory and Testing

time N∗ (0 <N∗ <∞) that satisfieshd(N

∗ − 1; β) < 0 and hd(N∗; β)≥ 0.

Then the corresponding minimumexpected cost satisfies the followinginequalities:

(cc − cp)r(N∗)

(1− β)/β− cp < Ad(N

∗; β)(19.37)

and

Ad(N∗; β)≤ (cc − cp)r(N

∗ + 1)

(1− β)/β− cp

(19.38)where

K(β)

={cc

∞∑j=0

βjf (j)+ cp

∞∑j=0

(1− βj )f (j)

}

×{(cc − cp)

∞∑i=1

βiF (i − 1)

}−1

(19.39)

(ii) If r(∞)≤K(β), then N∗ →∞ and

Ad(∞; β)=cc∑∞

j=0 βjf (j)∑∞

j=0(1− βj )f (j)

(19.40)

(2) If F(N) is DFR, then N∗ →∞.

Theorem 11.

(1) For the continuous-time age replacementproblems, the following relationships hold:

Ac(t0)= limα→0

αAc(t0; α) (19.41)

hc(t0)= limα→0

hc(t0; α) (19.42)

K = limα→0

K(α) (19.43)

(2) For the discrete-time age replacement prob-lems, the following relationships hold.

Ad(N)= limβ→1

(1− β)Ad(N; β) (19.44)

hd(N)= limβ→1

hd(N; β) (19.45)

K = limβ→1

K(β) (19.46)

Remarks. Glasser [57], Scheaffer [58], Clerouxand Hanscom [59], Osaki and Yamada [60],and Nakagawa and Osaki [61] extended thebasic age replacement model mentioned above.Here, we introduce an interesting topic on theage replacement policy under the different costcriterion from the expected cost per unit timein the steady state. Based on the seminal ideaby Derman and Sacks [62], Ansell et al. [63]analyzed the age replacement model under analternative cost criterion. In the continuous-timemodel with no discount, let Yi and Si denote thetotal cost and the time length for ith cycle (i =1, 2, . . . ) respectively, where Yi = ccI{Xi<t0} +cpI{Xi≥t0}, Si =min{Xi, t0}, Xi is the lifetime forthe ith cycle, and I{·} is the indicator function.

In Equation 19.23, we find that

limt→∞

E[total cost on (0, t]]t

= E[Yi]/E[Si]= Ac(t0) (19.47)

On the other hand, let NU(t) denote the numberof cycles up to time t . Then, we define

η(t)≡ 1

NU(t)

NU(t)∑i=1

E[Yi/Si ] (19.48)

where η(t) is the mean of the ratioE[Yi/Si ] duringNU(t) cycles. From the independence of eachcycle, we have

A∗c(t0)= limt→∞ η(t)= E[Yi/Si ]

=∫ t0

0(cc/t) dF(t)+

∫ ∞t0

(cp/t0) dF(t)

(19.49)

This interesting cost criterion is called theexpected cost ratio and is of course differentfrom E[Yi]/E[Si]. Ansell et al. [63] comparedthis model with an approximated age replacementpolicy with finite time horizon by Christer [64,65]and Christer and Jack [66].

19.4 Ordering ModelsIn both block and age replacement problems,a spare unit is available whenever the original

Page 390: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 357

unit fails. However, it should be noted that thisassumption is questionable in most practicalcases. In fact, if a sufficiently large number ofspare units is always kept on hand, a largeinventory holding cost will be needed. Hence, ifthe system failure may be considered as a rareevent for the operating system, then the spareunit can be ordered when it is required. Therewere seminal contributions by Wiggins [67], Allenand D’Esopo [68, 69], Simon and D’Esopo [70],Nakagawa and Osaki [71], and Osaki [72]. A largenumber of ordering/order replacement modelshave been analyzed by many authors. For instance,the reader should refer to Thomas and Osaki [73,74], Kaio and Osaki [75–78] and Osaki et al. [79]. Acomprehensive bibliography in this research areais listed in Dohi et al. [80].

19.4.1 Continuous-time Model

Let us consider a replacement problem for a one-unit system where each failed unit is scrappedand each spare is provided, after a lead time,by an order. The original unit begins operatingat time t = 0, and the planning horizon isinfinite. If the original unit does not fail up toa prespecified time t0 ∈ [0,∞), the regular orderfor a spare is made at the time t0 and after alead time Lr (>0) the spare is delivered. Thenif the original unit has already failed by t = t0 +Lr, the delivered spare takes over its operationimmediately. But even if the original unit isstill operating, the unit is replaced by the sparepreventively. On the other hand, if the original unitfails before the time t0, the expedited order is madeimmediately at the failure time and the spare takesover its operation just after it is delivered after alead time Le (>0). In this situation, it should benoted that the regular order is not made. The samecycle repeats itself continually.

Under this model, define the interval from onereplacement to the following replacement as onecycle. Let ce (>0), cr (>0), kf (>0), w (>0), ands (<0) be the expedited ordering cost, the regularordering cost, the system down (shortage) cost perunit time, the operation cost per unit time, and thesalvage cost per unit time respectively. Then, the

expected cost per unit time in the steady state is

Oc(t0)= Vc(t0)/Tc(t0) (19.50)

where

Vc(t0)= ce

∫ t0

0dF(t)+ cr

∫ ∞t0

dF(t)

+ kf

[ ∫ t0

0Le dF(t)

+∫ t0+Lr

t0

(t0 + Lr − t) dF(t)

]+w

[ ∫ t0+Lr

0t dF(t)

+∫ ∞t0+Lr

(t0 + Lr) dF(t)

]+ s

∫ ∞t0+Lr

(t − t0 − Lr) dF(t)

= ceF(t0)+ crF (t0)

+ kf

[(Le − Lr)F (t0)+

∫ t0+Lr

t0

F(t) dt

]+w

∫ t0+Lr

0F (t) dt + s

∫ ∞t0+Lr

F (t) dt

t0 ≥ 0 (19.51)

and

Tc(t0)=∫ t0

0(t + Le) dF(t)+

∫ ∞t0

(t0 + Lr) dF(t)

= (Le − Lr)F (t0)+ Lr +∫ t0

0F (t) dt

(19.52)

Define the numerator of the derivative of Oc(t0)

with respect to t0, divided by F (t0) as qc(t0), i.e.

qc(t0)= {(kf −w + s)R(t0)+ (w − s)

+ [kf(Le − Lr)+ (ce − cr)]r(t0)}×[(Le − Lr)F (t0)+ Lr +

∫ t0

0F (t) dt

]−{w

∫ t0+Lr

0F (t) dt + ceF(t0)+ crF (t0)

Page 391: Handbook of Reliability Engineering - pdi2.org

358 Maintenance Theory and Testing

+ kf

[(Le − Lr)F (t0)+

∫ t0+Lr

t0

F(t) dt

]+ s

∫ ∞t0+Lr

F (t) dt

}[(Le − Lr)r(t0)+ 1]

(19.53)

where the function

R(t0)= F(t0 + Lr)− F(t0)

F (t0)(19.54)

has the same monotone properties as the failurerate r(t0), i.e. R(t) is increasing (decreasing) ifand only if r(t) is increasing (decreasing). Then,we have the optimal order replacement (ordering)time t∗0 that minimizes the expected cost per unittime in the steady state Oc(t0).

Theorem 12.

(1) Suppose that the lifetime distribution F(t) isstrictly IFR.

(i) If qc(0) < 0 and qc(∞) > 0, then thereexists one finite optimal order replace-ment time t∗0 (0 < t∗0 <∞) that satisfiesqc(t

∗0 )= 0. Then the corresponding mini-

mum expected cost is

Oc(t∗0 )

= (kf −w + s)R(t∗0 )+ (w − s)+ µ(t∗0 )(Le − Lr)r(t

∗0 )+ 1

(19.55)

where

µ(t0)= [kf(Le − Lr)+ (ce − cr)]r(t0)(19.56)

(ii) If qc(∞)≤ 0, then t∗0 →∞ and

Oc(∞)= w/λ+ ce + kfLe

Le + 1/λ(19.57)

(iii) If qc(0)≥ 0, then t∗0 = 0 and

Oc(0)= 1

Lr

[w

∫ Lr

0F (t) dt + cr

+ kf

∫ Lr

0F(t) dt + s

∫ ∞Lr

F (t) dt

](19.58)

(2) Suppose that F(t) is DFR. If the inequality[w

∫ Lr

0F (t) dt + cr + kf

∫ Lr

0F(t) dt

+ s

∫ ∞Lr

F (t) dt

](Le + 1/λ)

< (w/λ+ ce + kfLe)Lr (19.59)

holds, then t∗0 = 0, otherwise t∗0 →∞.

19.4.2 Discrete-time Model

In the discrete order-replacement model, thefunction in Equation 19.54 is given by

R(N)=∑N+Lr

n=1 f (n)−∑Nn=1 f (n)∑∞

n=N+1 f (n)(19.60)

Then, the expected cost per unit time in the steadystate is

Od(N)= Vd(N)/Td(N) (19.61)

where

Vd(N)=w

N+Lr∑i=1

∞∑n=i

f (n)+ ce

N−1∑n=1

f (n)

+ cr

∞∑n=N

f (n)+ kf

[(Le − Lr)

N−1∑n=1

f (n)

+N+Lr∑i=N+1

i−1∑n=1

f (n)+ s

∞∑i=N+Lr+1

∞∑n=i

f (n)

](19.62)

and

Td(N)= (Le−Lr)

N−1∑n=1

f (n)+ Lr +N∑i=1

∞∑n=i

f (n)

(19.63)Notice in the equations above that Le and Lr arepositive integers.

Similar to Equation 19.53, define the numeratorof the difference of Od(N) with respect to N ,

Page 392: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 359

divided by F (N) as qd(N), i.e.

qd(N)= {(kf −w + s)R(N) + (w − s)

+ [kf(Le − Lr)+ (ce − cr)]r(N)}

×[(Le − Lr)

N−1∑n=1

f (n)+ Lr

+N∑i=1

∞∑n=i

f (n)

]

−{w

N+Lr∑i=1

∞∑n=i

f (n)+ ce

N−1∑n=1

f (n)

+ cr

∞∑n=N

f (n)+ kf

×[(Le−Lr)

N−1∑n=1

f (n)+N+Lr∑i=N+1

i−1∑n=1

f (n)

]

+ s

∞∑i=N+Lr+1

∞∑n=i

f (n)

}× [(Le − Lr)r(N)+ 1] (19.64)

Then, we have the optimal order replacementtime N∗ that minimizes the expected cost per unittime in the steady state Od(N).

Theorem 13.

(1) Suppose that the lifetime distribution F(N) isstrictly IFR.

(i) If qd(0) < 0 and qd(∞) > 0, then thereexists one finite optimal order replacementtime N∗ (0 < N∗ <∞) that satisfiesqd(N

∗ − 1) < 0 and qd(N∗)≥ 0.

(ii) If qd(∞)≤ 0, then N∗ →∞.(iii) If qd(0)≥ 0, then N∗ = 0.

(2) Suppose that F(N) is DFR. Then N∗ = 0,otherwise N∗ →∞.

Remarks. This section has dealt with typicalorder-replacement models in both continuous-and discrete-time settings. These models can beextended from the various viewpoints. Thomasand Osaki [74] and Dohi et al. [80] presentedcontinuous models with stochastic lead times.Recently, Dohi and coworkers [81–83] applied

the order-replacement model to analyze thespecial continuous review cyclic inventory controlproblems.

19.4.3 Combined Model with MinimalRepairs

In the preceding subsections, if the original unitfails and there is no spare on hand, the systemremains down until the spare is delivered. Inthis subsection, we modify the above-mentionedmodel to accommodate systems where, wheneverthe original unit fails, minimal repairs are madeto the failed unit and it continues its operationuntil it can be replaced by the spare. We add thecost cm suffered for each minimal repair insteadof kf and use the uniform cannibalization cost csinstead of s. We discuss the combined model withminimal repairs for the continuous-time case.

Consider a one-unit system, where each spareis only provided after a lead time by an order.Each failure is detected instantaneously and foreach failed unit a minimal repair is executed in anegligible time. The original unit starts operatingat time zero. The planning horizon is infinite. If theoriginal unit does not fail before a prespecifiedtime t0 ∈ [0,∞), the regular order for a spare ismade at the time t0 and the spare is delivered aftera lead time L. Then, the original unit is exchangedby that spare immediately. If the original unit failsup to the delivery of the spare, then the minimalrepair is executed for each failure. On the otherhand, if the original unit fails before the time t0,then at the failure time the minimal repair andthe expedited order are executed simultaneously,and the original unit is exchanged by the spareas soon as it is delivered after a lead time L.If the original unit fails up to the delivery of thespare, then the minimal repair is executed for eachfailure. In this case, the regular order is not made.After each minimal repair, the unit continues itsoperation. Each exchange is made instantaneously,and after an exchange the system starts operatingimmediately. We define an interval from oneexchange to the following exchange as one cycle.The cycle then repeats itself continually.

Page 393: Handbook of Reliability Engineering - pdi2.org

360 Maintenance Theory and Testing

The lifetime for each unit obeys an identicaland arbitrary distribution F(t) with a finitemean 1/λ and a density f (t). Let us introduce thefollowing four costs: cost ce is suffered for eachexpedited order made before the time t0; cost cris suffered for each regular order made at thetime t0; the cost cm is suffered for each minimalrepair made at each failure time. Further, cost csis suffered as a salvage cost at each exchange time.We assume that ce > cr. Let us define the failurerate as follows:

r(t)= f (t)/F (t) (19.65)

where F (t)= 1− F(t). This failure rate r(t) isassumed to be differentiable. These assumptionsseem to be reasonable.

Under these assumptions, we analyze thismodel and discuss the optimal ordering policies,that minimize the expected cost per unit time inthe steady state.

The expected cost per one cycle A(t0) is givenby

A(t0)= ceF(t0)+ crF (t0)+ cm

[F(t0+L)

+∫ t0

0

∫ t+L

t

r(τ ) dτ dF(t)

+∫ t0+L

t0

∫ t0+L

t

r(τ ) dτ dF(t)

]+ cs

(19.66)

Also, the mean time of one cycle M(t0) is given by

M(t0)= L+∫ t0

0F (t) dt (19.67)

Thus, the expected cost per unit time in the steadystate C(t0) is given by

C(t0)= A(t0)/M(t0) (19.68)

and

C(0)= cr + cm∫ L

0 r(τ ) dτ + cs

L(19.69)

C(∞)

= ce + cm[1+∫∞

0

∫ t+Lt r(τ ) dτ dF(t)] + cs

L+ 1/λ(19.70)

Define the following function from the numer-ator of the derivative of C(t0):

a(t0)= [(ce − cr)r(t0)

+ cmr(t0 + L)]M(t0)− A(t0) (19.71)

The sufficient conditions for the existence ofoptimal ordering policies are presented as follows.

Theorem 14.

(i) If a(0) < 0, then there exists at least an optimalordering time t∗0 (0 < t∗0 ≤∞) minimizing theexpected cost per unit time in the steady stateC(t0).

(ii) If a(∞) > 0, then there exists at least an opti-mal ordering time t∗0 (0≤ t∗0 <∞)minimizingthe expected cost per unit time in the steadystate C(t0).

Next, we have the following interesting resultssupposing the monotonic properties of the failurerate.

Theorem 15.

(1) Suppose that the failure rate is strictly increas-ing.

(i) If a(0) < 0 and a(∞) > 0, then thereexists a finite and unique optimal orderingtime t∗0 (0 < t∗0 <∞), that minimizes theexpected cost C(t0), satisfying a(t∗0 )= 0and the corresponding expected cost isgiven by

C(t∗0 )= (ce − cr)r(t∗0 )+ cmr(t

∗0 + L)

(19.72)(ii) If a(0)≥ 0, then the optimal ordering time

is t∗0 = 0, i.e. the order for a spare ismade when the unit is put in service. Thecorresponding expected cost is given byEquation 19.69.

(iii) If a(∞)≤ 0, then the optimal orderingtime is t∗0 →∞, i.e. the order for a spare ismade at the same time as the first failureof the original unit. The correspondingexpected cost is given by Equation 19.70.

(2) Suppose that the failure rate is decreasing.

Page 394: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 361

(i) If[cr + cm

∫ L

0r(τ ) dτ + cs

](L+ 1/λ)

<

{ce + cm

[1+∫ ∞

0

∫ t+L

t

r(τ ) dτ dF(t)

]+ cs

}L (19.73)

then t∗0 = 0.(ii) If[

cr + cm

∫ L

0r(τ ) dτ + cs

](L+ 1/λ)

≥{ce + cm

[1+∫ ∞

0

∫ t+L

t

r(τ ) dτ dF(t)

]+ cs

}L (19.74)

then t∗0 →∞.

So far we have treated non-negative orderingtime t0. However, if we choose an initial time zerocarefully, it can be allowed that the orderingtime t0 is negative. That is, if the regular order forthe spare is made at time t0 ∈ (−L, 0), the originalunit begins operation after time interval−t0, i.e. attime zero, and the spare is delivered at time t0 + L

and the original unit is exchanged by that spareimmediately as the above rule. In this case theregular order is only made for the order, since theorder, is always made before the original unit fails.

Here, we treat the ordering policy not onlywith the non-negative ordering time t0 but alsowith the negative one as mentioned above, i.e. theregular ordering time t0 is over (−L,∞). Thenthe expected cost per unit time in the steadystate C(t0) in Equation 19.68 is still valid fort0 ∈ (−L,∞), since it is clear that F(t0)= 0 andF (t0)= 1 for negative t0. Furthermore, we assumethat cr + cs > 0.

Thus, we can obtain the theorems correspond-ing to those above-mentioned, respectively.

Theorem 16. If a(∞) > 0, then there exists at leasta finite optimal ordering time t∗0 (−L < t∗0 <∞)

minimizing the expected cost per unit time in thesteady state C(t0) in Equation 19.68.

Theorem 17.

(1) Suppose that the failure rate is strictly increas-ing.

(i) If a(∞) > 0, then there exists a finite andunique optimal ordering time t∗0 (−L<

t∗0 <∞), that minimizes the expected costC(t0) in Equation 19.68, satisfying a(t∗0 )=0, and the corresponding expected cost isgiven by Equation 19.72.

(ii) If a(∞)≤ 0, then the optimal orderingtime is t∗0 →∞.

(2) Suppose that the failure rate is decreasing.Then t∗0 →∞.

Remarks. When we consider a negative orderingtime, it is of no interest if we cannot choose theoperational starting point of the system, i.e. aninitial time zero definitely. For example, it is ofno interest to consider a negative ordering timein the ordering model that when the regularordered spare is delivered and the original unitdoes not fail once, that spare is put into inventoryuntil the original one fails first, since we cannotchoose the operational starting point of the systemdefinitely.

19.5 Inspection ModelsThere are several systems where failures are notdetected immediately they occur, usually those inthat failure is not catastrophic and an inspectionis needed to reveal the fault. If we execute toomany inspections, then system failure is detectedmore rapidly, but we incur a high inspectioncost. Conversely, if we execute few inspections,the interval between the failure and its detectionincreases and we incur a high cost of failure.The optimal inspection policy minimizes the totalexpected cost composed of costs for inspectionand system failure. From this viewpoint, manyauthors have discussed optimal and/or nearlyoptimal inspection policies [2, 84–93].

Among those policies, the inspection policydiscussed by Barlow and coworkers [2, 84] is the

Page 395: Handbook of Reliability Engineering - pdi2.org

362 Maintenance Theory and Testing

most famous. They have discussed the optimalinspection policy in the following model. A one-unit system is considered, that obeys an arbitrarylifetime distribution F(t) with a probability den-sity function (p.d.f.) f (t). The system is inspectedat prespecified times tk (k = 1, 2, 3, . . . ), whereeach inspection is executed perfectly and instanta-neously. The policy terminates with an inspectionthat can detect a system failure. The costs consid-ered are ci (>0), the cost of an inspection, andkf(> 0), the cost of failure per unit of time. Thenthe total expected cost is

CB =∞∑k=0

∫ tk+1

tk

[ci(k + 1)+ kf(tk+1 − t)] dF(t)

(19.75)They obtained Algorithm 1 to seek the optimalinspection-time sequence that minimizes the totalexpected cost in Equation 19.75 by using therecurrence formula

tk+1 − tk = F(tk)− F(tk−1)

f (tk)− ci

kf

k = 1, 2, 3, . . . (19.76)

where f (t) is a PF2 (Pólya frequency function oforder two) with f (t +�)/f (t) strictly decreasingfor t ≥ 0, �> 0, and with f (t) > 0 for t > 0, andt0 = 0.

Algorithm 1.

begin:

choose t1 to satisfy ci = kf

∫ t1

0F(t) dt ;

repeat

compute t2, t3, . . . , recursively usingEquation 19.76;if any tk+1 − tk > tk − tk−1,then reduce t1;if any tk+1 − tk < 0,then increase t1;

until t1 < t2 < · · · are determined to thedegree of accuracy required;end.

However, Algorithm 1 by Barlow and coworkersis complicated to execute, because one must apply

trial and error to decide the first inspectiontime t1, and the assumption on f (t) is restrictive.To overcome these difficulties, some improvedprocedures to obtain the nearly optimal inspectionpolicy have been proposed [85, 87–93].

We review the nearly optimal inspectionpolicies proposed by Kaio and Osaki [85, 92, 93],Munford and Shahani [87], and Nakagawa andYasui [90]. We follow the inspection model andnotation introduced by Barlow and coworkers.For details, see each of the contributed papers inthe references.

19.5.1 Nearly Optimal InspectionPolicy by Kaio and Osaki (K&O Policy)

Introduce the inspection density at time t , n(t),that is a smooth function and denotes theapproximate number of inspections per unit timeat time t . Then the total expected cost up to thedetection of the failure is approximately

Cn(n(t))= ci

∫ ∞0

n(t)F (t) dt

+ kf

∫ ∞0

1

2n(t)dF(t) (19.77)

where ψ = 1− ψ , in general. The density n(t)

that minimizes the functional Cn(n(t)) in Equa-tion 19.77 is

n(t)= [kcr(t)]1/2 (19.78)

where kc = kf/(2ci), and r(t)= f (t)/F (t), a fail-ure rate. The inspection times tk (k = 1, 2, 3, . . . )satisfy

k =∫ tk

0n(t) dt k = 1, 2, 3, . . . (19.79)

Substituting n(t) in Equation 19.78 into Equa-tion 19.79 yields the nearly optimal inspection-time sequence.

Kaio and Osaki obtained this procedure bydeveloping that of Keller [91]. For details, see Kaioand Osaki [85]. Note that the procedure does notdepend on assumptions about the p.d.f. f (t).

Page 396: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 363

19.5.2 Nearly Optimal InspectionPolicy by Munford and Shahani (M&SPolicy)

LetF(tk)− F(tk−1)

F (tk−1)= p

k = 1, 2, 3, . . . ; 0 < p < 1 (19.80)

Then the inspection times tk (k = 1, 2, 3, . . . ) are

tk = F−1(1− pk) k = 1, 2, 3, . . . (19.81)

where the probability p is chosen such that thenearly total expected cost up to the detection ofthe failure, Cp(p), is minimized:

Cp(p)= ci

p+ kf

[ ∞∑k=1

tkpk−1p −

∫ ∞0

tf (t) dt

](19.82)

This procedure does not depend on assumptionsabout the p.d.f. f (t). For details, see Munfordand Shahani [87]; additionally, see Munfordand Shahani [88] for the case of the Weibulldistribution and Tadikamalla [89] for the gammadistribution.

19.5.3 Nearly Optimal InspectionPolicy by Nakagawa and Yasui (N&YPolicy)

This procedure is based on one by Barlow andcoworkers [2, 84] (abbreviated to B policy below).If the p.d.f. f (t) is a PF2, then Algorithm 2 isobtained.

Algorithm 2.

begin:

choose d appropriately for 0 < d <ci

kf;

determine tn after sufficient time haselapsed to give the degree of accuracyrequired;compute tn−1 to satisfy

tn − tn−1 − d = F(tn)− F(tn−1)

f (tn)− ci

kf

repeatcompute tn−2 > tn−3 > · · ·recursively using Equation 19.76;

until ti < 0 or ti+1 − ti > ti ;

end.

For details, see Nakagawa and Yasui [90].

Remarks. From numerical comparisons withWeibull and gamma distributions, we concludethat there are no significant differences betweenthe optimal and nearly optimal inspection poli-cies [93], and consequently we should adopt thesimplest policy to compute, by Kaio and Os-aki [85]. There are the following advantages whenwe use the K&O policy:

(i) We can obtain the nearly optimal inspectionpolicy uniquely, immediately and easily fromEquations 19.78 and 19.79 for any distribu-tions, whereas the B and N&Y policies cannottreat non-PF2 distributions.

(ii) We can analyze more complicated models andeasily obtain their nearly optimal inspectionpolicies, e.g. see Kaio and Osaki [85].

19.6 Concluding RemarksThis chapter has discussed the basic preven-tive maintenance policies and their extensions,in terms of both continuous- and discrete-timemodeling. For further reading on the discretemodels, see Nakagawa [29, 94] and Nakagawa andOsaki [95]. Since the mathematical maintenancemodels are applicable to a variety of real prob-lems, such a modeling technique will be usefulfor practitioners and researchers. Though we havereviewed only the limited maintenance modelsin the limited space available here, a numberof earlier models should be reformulated in adiscrete-time setting, because, in most cases, thecontinuous-time models can be regarded as ap-proximated models for actual maintenance prob-lems and the maintenance schedule is often de-sired in discretized circumstances. These motiva-tions for discrete-time setting will be evident fromthe recent development of computer technologiesand their related computation abilities.

Page 397: Handbook of Reliability Engineering - pdi2.org

364 Maintenance Theory and Testing

Acknowledgments

This work was partially supported by a Grant-in-Aid for Scientific Research from the Ministry ofEducation, Sports, Science and Culture of Japanunder grant no. 09780411 and no. 09680426, andby the Research Program 1999 under the Institutefor Advanced Studies of the Hiroshima ShudoUniversity, Hiroshima, Japan.

References[1] Arrow KJ, Karlin S, Scarf H, editors. Studies in applied

probability and management science. Stanford: StanfordUniversity Press; 1962.

[2] Barlow RE, Proschan F. Mathematical theory of reliability.New York: John Wiley & Sons; 1965.

[3] Barlow RE, Proschan F. Statistical theory of reliability andlife testing: probability models. New York: Holt, Rinehartand Winston; 1975.

[4] Jorgenson DW, McCall JJ, Radner R. Optimal replacementpolicy. Amsterdam: North-Holland; 1967.

[5] Gnedenko BV, Belyayev YK, Solovyev AD. Mathematicalmethods of reliability theory. New York: Academic Press;1969.

[6] Gertsbakh IB. Models of preventive maintenance. Ams-terdam: North-Holland; 1977.

[7] Ascher H, Feingold H. Repairable systems reliability. NewYork: Marcel Dekker; 1984.

[8] Osaki S. Stochastic system reliability modeling. Singa-pore: World Scientific; 1985.

[9] Osaki S. Applied stochastic system modeling. Berlin:Springer-Verlag; 1992.

[10] Osaki S, Hatoyama Y, editors. Stochastic models inreliability theory. Lecture Notes in Economics andMathematical Systems, vol. 235. Berlin: Springer-Verlag;1984.

[11] Osaki S, Cao J, editors. Reliability theory and applica-tions. Singapore: World Scientific; 1987.

[12] Ozekici S, editor. Reliability and maintenance of complexsystems. NATO ASI Series. Berlin: Springer-Verlag; 1996.

[13] Christer AH, Osaki S, Thomas LC, editors. Stochasticmodelling in innovative manufacturing. Lecture Notes inEconomics and Mathematical Systems, vol. 445. Berlin:Springer-Verlag; 1997.

[14] Barlow RE. Engineering reliability. Philadelphia: SIAM;1998.

[15] Aven T, Jensen U. Stochastic models in reliability. NewYork: Springer-Verlag; 1999.

[16] McCall JJ. Maintenance policies for stochastically failingequipment: a survey. Manage Sci. 1965;11:493–521.

[17] Osaki S, Nakagawa T. Bibliography for reliability andavailability of stochastic systems. IEEE Trans Reliab1976;R-25:284–7.

[18] Pierskalla WP, Voelker JA. A survey of maintenancemodels: the control and surveillance of deterioratingsystems. Nav Res Logist Q 1976;23:353–88.

[19] Sherif YS, Smith ML. Optimal maintenance models forsystems subject to failure—a review. Nav Res Logist Q1981;28:47–74.

[20] Valdez-Flores C, Feldman RM. A survey of preventivemaintenance models for stochastically deterioratingsingle-unit systems. Nav Res Logist 1989;36:419–46.

[21] Feller W. An introduction to probability theory and itsapplications. New York: John Wiley & Sons; 1957.

[22] Karlin S, Taylor HM. A first course in stochasticprocesses. New York: Academic Press; 1975.

[23] Karlin S, Taylor HM. A second course in stochasticprocesses. New York: Academic Press; 1981.

[24] Taylor HM, Karlin S. An introduction to stochasticmodeling. New York: Academic Press; 1984.

[25] Ross SM. Applied probability models with optimizationapplications. San Francisco: Holden-Day; 1970.

[26] Munter M. Discrete renewal processes. IEEE Trans Reliab1971;R-20:46–51.

[27] Kaio N, Osaki S. Review of discrete and continuousdistributions in replacement models. Int J Sys Sci1988;19:171–7.

[28] Barlow RE, Hunter LC. Optimum preventive maintenancepolicies. Oper Res 1960;8:90–100.

[29] Nakagawa T. A summary of discrete replacement policies.Eur J Oper Res 1979;17:382–92.

[30] Tilquin C, Cleroux R. Block replacement with general coststructure. Technometrics 1975;17:291–8.

[31] Berg M, Epstein B. A modified block replacement policy.Nav Res Logist Q 1976;23:15–24.

[32] Morimura H. On some preventive maintenance policiesfor IFR. J Oper Res Soc Jpn 1970;12:94–124.

[33] Tilquin C, Cleroux R. Periodic replacement with minimalrepair at failure and adjustment costs. Nav Res Logist Q1975;22:243–54.

[34] Cleroux R, Dubuc S, Tilquin C. The age replacementproblem with minimal repair and random repair costs.Oper Res 1979;27:1158–67.

[35] Park KS. Optimal number of minimal repairs beforereplacement. IEEE Trans Reliab 1979;R-28:137–40.

[36] Nakagawa T. A summary of periodic replacement withminimal repair at failure. J Oper Res Soc Jpn 1979;24:213–28.

[37] Nakagawa T. Generalized models for determining op-timal number of minimal repairs before replacement.J Oper Res Soc Jpn 1981;24:325–57.

[38] Nakagawa T. Modified periodic replacement with mini-mal repair at failure. IEEE Trans Reliab 1981;R-30:165–8.

[39] Nakagawa T. Optimal policy of continuous and discretereplacement with minimal repair at failure. Nav ResLogist Q 1984;31:543–50.

[40] Nakagawa T. Periodic and sequential preventive mainte-nance policies. J Appl Prob 1986;23:536–542.

[41] Nakagawa T, Kowada M. Analysis of a system withminimal repair and its application to replacement policy.Eur J Oper Res 1983;12:176–82.

Page 398: Handbook of Reliability Engineering - pdi2.org

Preventive Maintenance Models: Replacement, Repair, Ordering, and Inspection 365

[42] Phelps RI. Replacement policies under minimal repair.J Oper Res Soc 1981;32:549–54.

[43] Berg M, Cleroux R. The block replacement problem withminimal repair and random repair costs. J Stat ComputSim 1982;15:1–7.

[44] Boland PJ. Periodic replacement when minimal repaircosts vary with time. Nav Res Logist Q 1982;29:541–6.

[45] Boland PJ, Proschan F. Periodic replacement withincreasing minimal repair costs at failure. Oper Res1982;30:1183–9.

[46] Block HW, Borges WS, Savits TH. Age-dependentminimal repair. J Appl Prob 1985;22:370–85.

[47] Beichelt F. A unifying treatment of replacement policieswith minimal repair. Nav Res Logist 1993;40:51–67.

[48] Tahara A, Nishida T. Optimal replacement policy forminimal repair model. J Oper Res Soc Jpn 1975;18:113–24.

[49] Phelps RI. Optimal policy for minimal repair. J Oper ResSoc 1983;34:425–7.

[50] Segawa Y, Ohnishi M, Ibaraki T. Optimal minimal-repairand replacement problem with average dependent coststructure. Comput Math Appl 1992;24:91–101.

[51] Ohnishi M. Optimal minimal-repair and replacementproblem under average cost criterion: optimality of(t, T )-policy. J Oper Res Soc Jpn 1997;40:373–89.

[52] Bergman B. On the optimality of stationary replacementstrategies. J Appl Prob 1980;17:178–86.

[53] Barlow RE, Hunter LC. Reliability analysis of a one-unitsystem. Oper Res 1961;9:200–8.

[54] Osaki S, Nakagawa T. A note on age replacement. IEEETrans Reliab 1975;R-24:92–4.

[55] Fox BL. Age replacement with discounting. Oper Res1966;14:533–7.

[56] Nakagawa T, Osaki S. Discrete time age replacementpolicies. Oper Res Q 1977;28:881–5.

[57] Glasser GJ. The age replacement problem. Technometrics1967;9:83–91.

[58] Scheaffer RL. Optimum age replacement policies with anincreasing cost factor. Technometrics 1971;13:139–44.

[59] Cleroux R, Hanscom M. Age replacement with adjust-ment and depreciation costs and interest charges. Tech-nometrics 1974;16:235–9.

[60] Osaki S, Yamada S. Age replacement with lead time. IEEETrans Reliab 1976;R-25:344–5.

[61] Nakagawa T, Osaki S. Reliability analysis of a one-unitsystem with unrepairable spare units and its optimizationapplications. Oper Res Q 1976;27:101–10.

[62] Derman C, Sacks J. Replacement of periodically inspectedequipment. Nav Res Logist Q 1960;7:597–607.

[63] Ansell J, Bendell A, Humble S. Age replacement underalternative cost criteria. Manage Sci 1984;30:358–67.

[64] Christer AH. Refined asymptotic costs for renewal rewardprocess. J Oper Res Soc 1978;29:577–83.

[65] Christer AH. Comments on finite-period applications ofage-based replacement models. IMA J Math Appl Bus Ind1987;1:111–24.

[66] Christer AH, Jack N. An integral-equation approach forreplacement modelling over finite time horizons. IMAJ Math Appl Bus Ind 1991;3:31–44.

[67] Wiggins AD. A minimum cost model of spare partsinventory control. Technometrics 1967;9:661–5.

[68] Allen SG, D’Esopo DA. An ordering policy for repairablestock items. Oper Res 1968;16:669–74.

[69] Allen SG, D’Esopo DA. An ordering policy for stock itemswhen delivery can expedited. Oper Res 1968;16:880–3.

[70] Simon RM, D’Esopo DA. Comments on a paper byS.G. Allen and D.A. D’Esopo: An ordering policy forrepairable stock items. Oper Res 1971;19:986–9.

[71] Nakagawa T, Osaki S. Optimum replacement policies withdelay. J Appl Prob 1974;11:102–10.

[72] Osaki S. An ordering policy with lead time. Int J Sys Sci1977;8:1091–5.

[73] Thomas LC, Osaki S. A note on ordering policy. IEEETrans Reliab 1978;R-27:380–1.

[74] Thomas LC, Osaki S. An optimal ordering policy for aspare unit with lead time. Eur J Oper Res 1978;2:409–19.

[75] Kaio N, Osaki S. Optimum ordering policies with leadtime for an operating unit in preventive maintenance.IEEE Trans Reliab 1978;R-27:270–1.

[76] Kaio N, Osaki S. Optimum planned maintenance withsalvage costs. Int J Prod Res 1978;16:249–57.

[77] Kaio N, Osaki S. Discrete-time ordering policies. IEEETrans Reliab 1979;R-28:405–6.

[78] Kaio N, Osaki S. Optimum planned maintenance withdiscounting. Int J Prod Res 1980;18:515–23.

[79] Osaki S, Kaio N, Yamada S. A summary of optimalordering policies. IEEE Trans Reliab 1981;R-30:272–7.

[80] Dohi T, Kaio N, Osaki S. On the optimal ordering policiesin maintenance theory—survey and applications. ApplStoch Models Data Anal 1998;14:309–21.

[81] Dohi T, Kaio N, Osaki S. Continuous review cyclicinventory models with emergency order. J Oper Res SocJpn 1995;38:212–29.

[82] Dohi T, Shibuya T, Osaki S. Models for 1-out-of-Q systems with stochastic lead times and expeditedordering options for spares inventory. Eur J Oper Res1997;103:255–72.

[83] Shibuya T, Dohi T, Osaki S. Spare part inventory modelswith stochastic lead times. Int J Prod Econ 1998;55:257–71.

[84] Barlow RE, Hunter LC, Proschan F. Optimum checkingprocedures. J Soc Indust Appl Math 1963;11:1078–95.

[85] Kaio N, Osaki S. Some remarks on optimum inspectionpolicies. IEEE Trans Reliab 1984;R-33:277–9.

[86] Kaio N, Osaki S. Analytical considerations on inspectionpolicies. In: Osaki S, Hatoyama Y, editors. Stochasticmodels in reliability theory. Heidelberg: Springer-Verlag;1984. p.53–71.

[87] Munford AG, Shahani AK. A nearly optimal inspectionpolicy. Oper Res Q 1972;23:373–9.

[88] Munford AG, Shahani AK. An inspection policy for theWeibull case. Oper Res Q 1973;24:453–8.

[89] Tadikamalla PR. An inspection policy for the gammafailure distributions. J Oper Res Soc 1979;30:77–80.

Page 399: Handbook of Reliability Engineering - pdi2.org

366 Maintenance Theory and Testing

[90] Nakagawa T, Yasui K. Approximate calculation of optimalinspection times. J Oper Res Soc 1980;31:851–3.

[91] Keller JB. Optimum checking schedules for systemssubject to random failure. Manage Sci 1974;21:256–60.

[92] Kaio N, Osaki S. Optimal inspection policies: A reviewand comparison. J Math Anal Appl 1986;119:3–20.

[93] Kaio N, Osaki S. Comparison of inspection policies.J Oper Res Soc 1989;40:499–503.

[94] Nakagawa T. Modified discrete preventive maintenancepolicies. Nav Res Logist Q 1986;33:703–15.

[95] Nakagawa T, Osaki S. Analysis of a repairable systemthat operates at discrete times. IEEE Trans Reliab 1976;R-25:110–2.

Page 400: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy

Chap

ter2

0Toshio Nakagawa

20.1 Introduction20.2 Replacement Policies20.2.1 Age Replacement20.2.2 Block Replacement20.2.2.1 No Replacement at Failure20.2.2.2 Replacementwith Two Variables20.2.3 Periodic Replacement20.2.3.1 Modified Models with Two Variables20.2.3.2 Replacement atN Variables20.2.4 Other Replacement Models20.2.4.1 Replacements with Discounting20.2.4.2 Discrete Replacement Models20.2.4.3 Replacements with Two Types of Unit20.2.4.4 Replacement of a Shock Model20.2.5 Remarks20.3 Preventive Maintenance Policies20.3.1 One-unit System20.3.1.1 Interval Reliability20.3.2 Two-unit System20.3.3 Imperfect Preventive Maintenance20.3.3.1 Imperfect with Probability20.3.3.2 Reduced Age20.3.4 Modified Preventive Maintenance20.4 Inspection Policies20.4.1 Standard Inspection20.4.2 Inspection with Preventive Maintenance20.4.3 Inspection of a Storage System

20.1 Introduction

It has been well known that high system reliabili-ties can be achieved by the use of redundancy ormaintenance. These techniques have actually beenused in various real systems, such as computers,generators, radars, and airplanes, where failuresduring actual operation are costly or dangerous.

A failed system is replaced or repaired imme-diately. But whereas the maintenance of a systemafter failure may be costly, it may sometimes re-quire a long time to undertake it. It is an importantproblem to determine when to maintain a system

before its failure. However, it is not wise to main-tain a system too often. From these viewpoints, thefollowing three policies are well known.

1. Replacement policy: a system with no repair isreplaced before failure with a new one.

2. Preventive maintenance (PM) policy: a systemwith repair is maintained preventively beforefailure.

3. Inspection policy: a system is checked todetect its failure.

Suppose that a system has to operate for aninfinite time span. Then, it is appropriate to adopt,

367

Page 401: Handbook of Reliability Engineering - pdi2.org

368 Maintenance Theory and Testing

as an objective function, the expected cost perunit of time from the viewpoint of economics,and the availability from reliability. We summarizethe theoretical results of optimum policies thatminimize or maximize the above quantities. InSection 20.2 we discuss (1) age replacement,(2) block replacement, (3) periodic replacement,and (4) modified replacements with discounting,in discrete time, with two types of unit and of ashock model. In Section 20.3, PMs of (1) a one-unitsystem, (2) a two-unit system, (3) an imperfect PMand (4) a modified PM are discussed. Section 20.4covers (1) standard inspection, (2) inspection withPM and (3) inspection of a storage system.

Barlow and Proschan [1] summarized theknown results of maintenance and their optimiza-tion. Since then, many papers have been publishedand the subject surveyed extensively by Osaki andNakagawa [2], Pierskalla and Voelker [3], Thomas[4], Valdez-Flores and Feldman [5], and Cho andParlar [6]. The recent published books edited byÖzekici [7], Christer et al. [8] and Ben-Daya etal. [9] and Osaki [10] collected many reliabilityand maintenance models, discussed their opti-mum policies, and applied them to actual systems.Most contents in this chapter are based on ouroriginal work [11–36].

The theories of renewal and Markov renewalprocesses are used in analyzing maintenancepolicies, and are discussed in the books by Cox[37], Çinlar [38], and Osaki [39].

20.2 Replacement Policies

In this section, we consider a one-unit systemwhere a unit is replaced upon failure. It would bewise to replace an operating unit before failureat a small cost. We introduce a cost incurred byfailure of an operating unit and a cost incurred byreplacement before failure. Most units are replacedat failures and their ages, or at scheduled timesaccording to their sizes, characters, circumstances,and so on.

A unit has to be operating for an infinite timespan in the following replacement costs: a cost c1is suffered for each failed unit that is replaced;

this includes all costs resulting from its failureand replacement. A cost c2(< c1) is suffered foreach non-failed unit that is exchanged. Let N1(t)

denote the number of failures during (0, t] andN2(t) denote the number of exchanges of non-failed units during (0, t]. Then, the expected costduring (0, t] is

C(t)≡ c1E{N1(t)} + c2E{N2(t)} (20.1)

When the planning horizon is infinite, it isappropriate to adopt the expected cost per unit oftime limt→∞ C(t)/t as an objective function.

We summarize three replacement policies of(1) age replacement, (2) block replacement, and(3) periodic replacement with minimal repairat failure, and moreover, consider (4) theirmodified models. We discuss optimum policiesthat minimize the expected costs per unit oftime of each replacement, and summarize theirderivation results.

20.2.1 Age Replacement

If the age of an operating unit is always knownand its failure rate increases with age, it maybe wise to replace it before failure on its age.A commonly considered age replacement policyfor such a unit is made if it is replaced at timeT (0 < T ≤∞) after its installation or at failure,whichever occurs first. We call the specified timeT a planned replacement time. Berg [40] andBergman [41] proved that an age replacementpolicy is optimal among all reasonable policies.It is assumed that failures are instantly detectedand a failed unit is replaced by a new unit. Anew installed unit also begins to operate instantly.Suppose that the failure time Xk (k = 1, 2, . . . )of each operating unit is independent and has anidentical distribution F(t) with finite mean 1/λ.

An age replacement policy has been treatedby many authors. Barlow and Proschan [1]studied the optimum policy that minimizesthe expected cost. Cox [37] gave a sufficientcondition for a finite optimum solution to exist.Glasser [42] gave the replacement time forthe cases of several failure time distributions.Scheaffer [43], Cléroux and coworkers [44,45], and

Page 402: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 369

Subramanian and Wolff [46] gave more generalcost structures of replacement. Fox [47] and Ranand Rosenland [48] proposed an age replacementwith continuous discounting. Further, Berg andEpstein [49] compared age replacement with blockreplacement. Ingram and Scheaffer [50], Freesand Ruppert [51], and Léger and Cléroux [52]showed the confidence intervals of the optimumage replacement policy when a failure timedistribution F is unknown. Christer [53] andAnsell et al. [54] gave the asymptotic costs of agereplacement for a finite time span. Popova and Wu[55] applied fuzzy set theory to age replacementpolicies. Opportunistic replacement policies areimportant for the PM of complex systems. Thisarea is omitted in this chapter; the reader isreferred to Zheng [56], for example.

A new unit begins to operate at t = 0. Then, anage replacement procedure generates an ordinaryrenewal process as follows. Let {Xk}∞k=1 be thefailure times of successive operating units, anddefine a new random variable Zk ≡min{Xk, T }.Then, {Zk}∞k=1 represent the intervals betweenreplacements caused by either failure or plannedreplacement, and are independently, identicallydistributed random variables. Thus, a sequence of{Zk}∞k=1 forms a renewal process with

Pr{Zk ≤ t} ={F(t) t < T

1 t ≥ T(20.2)

We term one cycle as being from the replace-ment to the next one. Then, the expected cost onone cycle is

E{c1I(Xk<T ) + c2I(Xk≥T )} = c1F(T )+ c2F (T )

(20.3)and the mean time of one cycle is

E{Zk} =∫ T

0t dF(t)+ T F (T )=

∫ T

0F (t) dt

(20.4)where F ≡ 1− F and IA is an indicator.

Therefore, from an elementary renewal theo-rem (e.g. theorem 2.6 of Barlow and Proschan [1,p.55]), the expected cost per unit of time for an

infinite time span is

C(T )≡ limt→∞

C(t)

t= Expected cost on one cycle

Mean time of one cycle

= c1F(T )+ c2F (T )∫ T0 F (t) dt

(20.5)

If T =∞ then the policy corresponds to thereplacement only at failure, and the expected costis C(∞)= λc1.

We derive an optimum replacement timeT ∗ that minimizes the expected cost C(T ) inEquation 20.5. It is assumed that there exists adensity f (t) of the failure time distribution F(t)

with mean 1/λ. Let r(t)≡ f (t)/F (t) be the failurerate (or the hazard rate) and its limit is r(∞)≡limt→∞ r(t), which may possibly be infinite.

Theorem 1. Suppose that r(t) is continuous andstrictly increasing, and K ≡ λc1/(c1 − c2).

(i) If r(∞) > K then there exists a finite andunique T ∗(0 < T ∗ <∞) that satisfies

r(T )

∫ T

0F (t) dt − F(T )= c2

c1 − c2(20.6)

and the resulting expected cost is

C(T ∗)= (c1 − c2)r(T∗) (20.7)

(ii) If r(∞)≤K then the optimum replacementtime is T ∗ =∞, i.e. a unit should be replacedonly at failure.

Proof. Differentiating C(T ) (or log C(T )) withrespect to T and setting it equal to zero impliesEquation 20.6. Denoting the left-hand side ofEquation 20.6 byQ(T ), we easily have that Q(T ) isstrictly increasing since r(T ) is strictly increasing,and Q(0)= 0:

Q(∞)≡ limT→∞Q(T )= 1

λr(∞)− 1

If r(∞) > K then Q(∞) > c2/(c1 − c2). Hence,from the monotonicity and continuity of Q(T ),there exists a finite and unique T ∗(0 < T ∗ <∞)

that satisfies Equation 20.6, and it minimizes

Page 403: Handbook of Reliability Engineering - pdi2.org

370 Maintenance Theory and Testing

C(T ). We easily have Equation 20.7 from Equa-tion 20.6.

Conversely, if r(∞)≤K then Q(∞) ≤c2/(c1 − c2) for any finite T , which impliesdC(T )/dT < 0 for any T . Hence, the optimumreplacement time is T ∗ =∞, i.e. a unit is replacedonly at failure. �

20.2.2 Block Replacement

If a system consists of a block or group ofcomponents and only their failures are known,all components may be replaced periodicallyindependent of their ages in use. The policy iscommonly used with complex electronic systemsand many electrical parts.

A new unit begins to operate at t = 0, and afailed unit is discovered instantly and replacedby a new one. Further, a unit is replaced atperiodic times kT (k = 1, 2, . . . ) independent ofits age. Suppose that each unit has an identicalfailure time distribution F(t) with finite mean1/λ, and F (n)(t) (n= 1, 2, . . . ) is the n-foldStieltjes convolution of F(t) with itself, i.e.F (n)(t)≡ ∫ t0 F (n−1)(t − u) dF(u) (n= 1, 2, . . . )and F (0)(t)≡ 1 for t ≥ 0.

Barlow and Proschan [1] studied block replace-ment and compared it with age replacement. Afterthat, Marathe and Nair [57] and Jain and Nair [58]defined the n-stage block replacement and com-pared it with other replacements. Schweitzer [59]compared block replacement with replacementonly of individual failures for hyper-exponentiallyand uniformly distributed failure times. Savits[60] derived the cost relationship between age andblock replacements. Tilquin and Cléroux [61] in-troduced the adjustment costs, which are increas-ing with the age of a unit. Archibald and Dekker[62] and Sheu [63–66] considered more general re-placement policies and summarized their results.

Consider one cycle with a constant time T fromthe planned replacement to the next one. Then,since the expected number of failed units duringone cycle is M(T )≡∑∞n=1 F

(n)(T ), the expectedcost on one cycle is

c1E{N1(T )} + c2E{N2(T )} = c1M(T )+ c2

Therefore, from Equation 20.5, the expectedcost per unit of time for an infinite time span(Equation 20.6 of Barlow and Proschan [1, p.95])is

C(T )≡ limt→∞

C(t)

t= c1M(T )+ c2

T(20.8)

If a unit is replaced only at failures, i.e. T =∞,then limT→∞ M(T )/T = λ and the expected costis C(∞)= λc1.

We seek an optimum planned replacementtime T ∗ that minimizes the expected cost C(T )

in Equation 20.8. It is assumed that M(t) isdifferentiable and define m(t)≡ dM(t)/dt , whereM(t) is called renewal function and m(t) is calledrenewal density in a stochastic process. Then,differentiating C(T ) with respect to T and settingit equal to zero, we have

Tm(T )−M(T )= c2

c1(20.9)

This equation is a necessary condition that thereexists a finite T ∗, and in this case the resultingexpected cost is

C(T ∗)= c1m(T ∗) (20.10)

Let σ 2 be the variance of F(T ). Then, fromCox [37, p.119], there exists a large T such thatC(T ) < C(∞)= λc1 if c2/c1 < (1− λ2σ 2)/2.

It may be wasteful to replace a failed unit bya new one just before a planned replacement.Three modified models from this viewpoint arewell known. When a failure occurs just before theplanned replacement, it remains failed until thereplacement time [37, 67, 68] or it is replaced bya used unit [69–72]. An operating unit of youngage is not replaced at planned replacement andremains in operation [73, 74].

We now discuss two typical modified models.

20.2.2.1 No Replacement at Failure

A unit is always replaced at times kT , but it is notreplaced at failure, and hence it remains a failurefor the time interval from the occurrence of failureto its detection. Let c1 be the cost of the timeelapsed between failure and its detection per unit

Page 404: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 371

of time, and c2 be the cost of planned replacement.Then, the expected cost per unit of time is

C(T )= 1

T

[c1

∫ T

0F(t) dt + c2

](20.11)

Differentiating C(T ) with respect to T and settingit equal to zero, we have

T F(T )−∫ T

0F(t) dt = c2

c1(20.12)

Thus, if 1/λ > c2/c1 then there exists an optimumreplacement time T ∗ uniquely that satisfiesEquation 20.12, and the resulting expected cost is

C(T ∗)= c1F(T ∗) (20.13)

20.2.2.2 Replacement with Two Variables

Cox [37] considered the modification of blockreplacement in which if a failure occurs just beforea planned replacement, it may postpone replacinga failed unit until the next replacement. That is,if a failure occurs in an interval (kT − Td, kT ),the replacement is not made until time kT , anda unit is down for the time interval. Suppose thatthe cost suffered for unit failure is proportionalto the downtime, i.e. let c3t be the cost of time t

elapsed between failure and its detection. Then,the expected cost is

C(T , Td)= 1

T

[c1M(T − Td)

+ c2 + c3

∫ T

T−Td

(T − t)F (T − t) dM(t)

](20.14)

When Td = 0, this corresponds to the blockreplacement, and when Td = T , this correspondsto the modified model in Section 20.2.2.1.Nakagawa [28] discussed the optimum T ∗ and T ∗dthat minimize C(T , Td) in Equation 20.14.

20.2.3 Periodic Replacement

When we consider large and complex systems thatconsist of many kinds of components, we shouldmake only minimal repair at each failure, and

make the planned replacement or PM at periodictimes. Barlow and Proschan [1] considered thefollowing replacement policy. A unit is replacedperiodically at periodic times kT (k = 1, 2, . . . ).After each failure, only minimal repair is made sothat the failure rate remains undisturbed by anyrepair of failures between successive replacements.This policy is commonly used with computers andairplanes. Holland and McLean [75] provided apractical procedure for applying the policy to largemotors and small electrical parts. Tilquin andCléroux [76], Boland and coworkers [77, 78] andChen and Feldman [79] gave more general coststructures, and Aven [80] and Bagai and Jain [81]considered the modification of a minimal repairmodel. Dekker and coworkers [83, 84] presented ageneral framework for three replacement policies.

A new unit begins to operate at t = 0 and,when it fails, only minimal repair is made. Thatis, the failure rate of a unit remains undisturbedby repair of failures. Further, a unit is replacedat periodic times kT (k = 1, 2, . . . ) independentof its age, and any units are as good as new afterreplacement. It is assumed that the repair andreplacement times are negligible. Suppose thatthe failure times of each unit are independent,and have a distribution F(t) and the failure rater(t)≡ f (t)/F (t), where f is a density of F . Then,failures of a unit occur in a non-homogeneousPoisson process with a mean-value function R(t),where R(t)≡ ∫ t0 r(u) du and F (t)= e−R(t) (e.g.see Nakagawa and Kowada [29] and Murthy [82]).

Consider one cycle with a constant time T fromthe planned replacement to the next one. Then,since the expected number of failures during onecycle is E{N1(T )} = R(T ), the expected cost onone cycle is

c1E{N1(T )} + c2E{N2(T )} = c1R(T )+ c2

where note that c1 is the cost of minimal repair.Therefore, from Equation 20.5, the expected

cost per unit of time for an infinite time span(equation (2.9) of Barlow and Proschan [1, p.99])is

C(T )≡ limt→∞

C(t)

t= 1

T[c1R(T )+ c2] (20.15)

Page 405: Handbook of Reliability Engineering - pdi2.org

372 Maintenance Theory and Testing

If a unit is never replaced, i.e. T =∞, thenlimT→∞ R(T )/T = r(∞), which may possibly beinfinite, and C(∞)= c1r(∞).

We seek an optimum planned replacement timeT ∗ that minimizes the expected cost C(T ) inEquation 20.15. Differentiating C(T ) with respectto T and setting it equal to zero, we have

T r(T )− R(T )= c2

c1(20.16)

Suppose that the failure rate r(t) is continuousand strictly increasing. Then, if a solution T ∗to Equation 20.16 exists, it is unique, and theresulting cost is

C(T ∗)= c1r(T∗) (20.17)

Further, Equation 20.16 can be rewritten as∫ T0 t dr(t)= c2/c1. Thus, if

∫∞0 t dr(t) > c2/c1

then there exists a solution to Equation 20.16.When r(t) is strictly increasing, we easily have

the inequality

T r(T )−∫ T

0r(t) dt ≥ r(T )

∫ T

0F (t) dt − F(T )

(20.18)Thus, an optimum T ∗ is not greater than thatof an age replacement in Section 20.2.1. Further,taking T →∞ in Equation 20.18, if r(∞) >

λ(c1 + c2)/c1 then a solution to Equation 20.16exists.

If the cost of minimal repair depends on the agex of a unit and is given by c1(x), the expected costis [77]

C(T )= 1

T

[ ∫ T

0c1(x)r(x) dx + c2

](20.19)

Several modifications of this replacement policyhave been proposed.

20.2.3.1 Modified Models with TwoVariables

See Nakagawa [28]. (a) A used unit begins tooperate at t = 0 and is replaced at times kT bythe same used unit of age x(0≤ x <∞). Then, theexpected cost is [21]

C(T , x)= 1

T

[c1

∫ T+x

x

r(t) dt + c2(x)

](20.20)

where c2(x) is an acquisition cost of a used unit ofage x.

In this case, we consider the optimum problemof determining economically the age of a usedunit. Suppose that x is a variable for a specifiedT and c2(x) is differentiable. Then, differentiatingC(T , x) with respect to x and setting it equal tozero implies

r(T + x)− r(x)=−c′2(x)c1

(20.21)

which is a necessary condition that a finite x

minimizes C(T , x) for a fixed T .(b) Mine and Kawai [85] considered the

replacement of a unit with random and wearoutfailures, where an operating unit enters a wearoutfailure period at a fixed time T0, after it has beenoperating in a random failure period. It is assumedthat a unit is replaced at planned time T + T0,where T0 is constant and previously given, and itundergoes only minimal repair at failures betweenreplacements.

Suppose that a unit has a constant failure rateλ in a random failure period and λ+ r(t) in awearout failure period. Then, the expected cost is[26]

C(T , T0)= λc1 + c1∫ T

0 r(t) dt + c2

T + T0(20.22)

(c) If a failure occurs in an interval (kT −Td, kT )(0≤ Td ≤ T ), a unit is not repaired in thisinterval and is down for the interval from itsfailure to the replacement (see Section 20.2.2.2).Then, the expected cost is

C(T , Td)= 1

T

{c1

∫ T−Td

0r(t) dt + c2

+ c31

F (T − Td)

∫ T

T−Td

[F(t)− F(T − Td)] dt}

(20.23)

In the above policy, we may not sometimesleave a failed unit as it is until the plannedreplacement time. To overcome this, we considerthe following model. If a unit fails in (T − Td, T )

then it is replaced by a new one [86]. Tahara and

Page 406: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 373

Nishida [87] termed this policy the (t, T ) policy.The expected cost is

C(T , Td)={c1

∫ T−Td

0r(t) dt + c2

+ c4[F(T )− F(T − Td)]/F (T − Td)

}×{T − Td +

∫ T

T−Td

F (t) dt/F (T − Td)

}−1

(20.24)

where c4 is an additional replacement costcaused by failure. Note that C(T , T ) agrees withEquation 20.5. Phelps [88, 89] and Park and Yoo[90] proposed a modification of this model.

20.2.3.2 Replacement atN Variables

See Nakagawa [27]. Morimura [91] proposed anew replacement where a unit is replaced at theNth failure. This is a modification of the periodicreplacement, and would be useful in the casewhere the total operating time of a unit is notrecorded or it is too time consuming and costlyto replace it in operation. Therefore, this could beused in maintaining complex systems with a lot ofequipment of the same type.

Recalling that failures occur by a non-homogeneous Poisson process with a mean-valuefunction R(t), we have

Hj(t)≡ Pr{N1(t)= j } = [R(t)]j

j ! e−R(t)

(j = 0, 1, 2, . . . )

and hence the mean time to the Nthfailure is, from Nakagawa and Kowada [29],∑N−1

j=0

∫∞0 Hj(t) dt . Thus, the expected cost is

C(N) = (N − 1)c1 + c2∑N−1j=0

∫∞0 Hj(t) dt

(N = 1, 2, . . . )

(20.25)Suppose that r(t) is continuous and strictly

increasing. Then, if there exists a finite solution

N∗ that satisfies

1∫∞0 HN(t) dt

N−1∑j=0

∫ ∞0

Hj(t) dt − (N − 1)≥ c2

c1

(N = 1, 2, . . . ) (20.26)

it is unique and minimizes C(N). Further,if r(∞) > λc2/c1, then a finite solution toEquation 20.26 exists.

Further, suppose that a unit is replaced at timeT or at the Nth failure, whichever occurs first.Then, the expected cost is

C(N, T )={c1

[N − 1−

N−1∑j=0

(N − 1− j)Hj(T )

]

+ c2

∞∑j=N

Hj(T )+ c3

N−1∑j=0

Hj(T )

}

×{ N−1∑

j=0

∫ T

0Hj(t) dt

}−1

where c2 is the cost of planned replacement atthe Nth failure and c3 is the cost of plannedreplacement at time T . Similar replacementpolicies were discussed by Park [92], Nakagawa[30], Tapiero and Ritchken [93], Lam [94–96],Ritchken and Wilson [97], and Sheu [98].

20.2.4 Other Replacement Models

20.2.4.1 Replacements with Discounting

When we adopt the total expected cost as anappropriate objective function for an infinite timespan, we have to evaluate the present values ofall replacement costs. A continuous discountingwith rate α(0 < α <∞) will be used for costs atthe replacement time. Then, the cost of one cyclestarting at time t is

c1 e−α(t+Xk)I(Xk<T ) + c2 e−α(t+T )I(Xk≥T )

The cost at each cycle is the same, except fora discount rate, and hence the total cost is equalto the sum of discounted costs incurred on theindividual cycles. Thus, we have [13] that the

Page 407: Handbook of Reliability Engineering - pdi2.org

374 Maintenance Theory and Testing

expected cost of an age replacement is

C(T , α)= c1∫ T

0 e−αt dF(t)+ c2 e−αT F (T )

α∫ T

0 e−αt F (t) dt(20.27)

and Equations 20.6 and 20.7 are

r(T )

∫ T

0e−αt F (t) dt −

∫ T

0e−αt dF(t)

= c2

c1 − c2(20.28)

and

C(T ∗; α)= 1

α(c1 − c2)r(T

∗)− c2 (20.29)

respectively.Similarly, for a block replacement:

C(T ; α) = c1∫ T

0 e−αt dM(t)+ c2 e−αT

1− e−αT(20.30)

1− e−αT

αm(T )−

∫ T

0e−αtm(t) dt = c2

c1(20.31)

C(T ∗; α)= c1

αm(T ∗)− c2 (20.32)

For a periodic replacement:

C(T ; α)= c1∫ T

0 e−αt r(t) dt + c2 e−αT

1− e−αT(20.33)

1− e−αT

αr(T )−

∫ T

0e−αt r(t) dt = c2

c1(20.34)

C(T ∗; α)= c1

αr(T ∗)− c2 (20.35)

Note that limα→0 αC(T ; α)= C(T ), which is theexpected cost of three replacements with nodiscounting.

20.2.4.2 Discrete Replacement Models

See Nakagawa and coworkers [18, 33]. In failurestudies, the time to unit failure is often measuredby the number of cycles to failure. In actualsituations, the tires on jet fighters are replacedpreventively at 4–14 flights, which may dependon the kinds of use. In other cases, the life timesare sometimes not recorded at the exact instant

of failure and are collected statistically per day,per month, or per year. In any case, it would beinteresting and possibly useful to consider discretetime processes [99].

Consider the time over an infinitely longcycle n (n= 1, 2, . . . ) that a unit has to beoperating. Let {pn}∞n=1 be the discrete time failuredistribution with finite mean 1/λ≡∑∞n=1 npn

that a unit fails at cycle n. Further, let{p(j)

n }∞n=1 be the j th convolution of pn, i.e.

p(1)n ≡ pn (n= 1, 2, . . . ), p

(j)n ≡∑n−1

i=1 pip(j−1)n−i

(j = 2, 3, . . . , n; n= 2, 3, . . . ), p(j)n ≡ 0

(j = n+ 1, n+ 2, . . . ), let r(n)≡ pn/∑∞

j=n pj

(n= 1, 2, . . . ) be the failure rate, and letm(n)≡∑n

j=1 p(j)n (n= 1, 2, . . . ) be the renewal

density of the discrete time distribution.In an age replacement, a unit is replaced at cycle

N(1 ≤N ≤∞) after its installation or at failure,whichever occurs first. Then, the expected cost foran infinite time span is

C(N)= c1∑N

j=1 pj + c2∑∞

j=N+1 pj∑Ni=1∑∞

j=i pj

(N = 1, 2, . . . ) (20.36)

Theorem 2 can be rewritten as follows.

Theorem 2. Suppose that r(n) is strictly increasingand r(∞)≡ limn→∞ r(n)≤ 1.

(i) If r(∞) > K then there exists a finite andunique minimum that satisfies

r(N + 1)N∑i=1

∞∑j=i

pj −N∑j=1

pj ≥ c2

c1 − c2

(N = 1, 2, . . . ) (20.37)

(ii) If r(∞)≤K then N∗ =∞, and C(∞)= λc1.

In a block replacement, a unit is replaced atcycle kN (k = 1, 2, . . . ) independent of its ageand a failed unit is replaced by a new one betweenplanned replacements. Then, the expected cost is

C(N) = 1

N

[c1

N∑j=1

m(j)+ c2

](N = 1, 2, . . . ) (20.38)

Page 408: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 375

and Equation 20.9 is

Nm(N + 1)−N∑j=1

m(j)≥ c2

c1

(N = 1, 2, . . . ) (20.39)

Example 1. Suppose that the failure time of aunit has a negative binomial distribution witha shape parameter 2, i.e. pn = np2qn−1 (n=1, 2, . . . ; 0 < p < 1) where q ≡ 1− p. Then, wehave 1/λ= (1+ q)/p, r(n)= np2/(np + q), andm(n)= p2 ∑n−1

j=0 q2j .

In an age replacement, if q/(1+ q)≤ c2/c1,then we should make no planned replacement;conversely, if q/(1+ q) > c2/c1, then we shouldadopt an optimum replacement cycle N∗ that is aunique minimum such that

2(N + 1)p + qN+2

Np + q≥ c1

c1 − c2(20.40)

Note that if c2 < c1 ≤ 2c2 then a unit should bereplaced only at failure, and C(∞)= pc1/(1+ q).

In a block replacement, from Equation 20.39

p2N∑j=1

jq2j ≥ c2

c1(20.41)

Thus, if q/(1+ q) >√c2/c1 then there exists

a finite and unique minimum N∗ that satisfiesEquation 20.41, since the left-hand side of Equa-tion 20.41 is strictly increasing from (p/q)2 to[q/(1+ q)]2. Conversely, if q/(1+ q)≤√c2/c1then a unit should be replaced only at failure. Notethat for c2/c1 ≥ 1/4, no finite value of N satisfiesEquation 20.41.

In a periodic replacement, a unit is replaced atcycle kN(k = 1, 2, . . . ) and a failed unit betweenplanned replacements undergoes only minimalrepair. Then, the expected cost is

C(N)= 1

N

[c1

N∑j=1

r(j)+ c2

](N = 1, 2, . . . ) (20.42)

and Equation 20.16 is

Nr(N + 1)−N∑j=1

r(j)≥ c2

c1(N = 1, 2, . . . )

(20.43)

Example 2. Suppose that the failure timehas a discrete Weibull distribution with ashape parameter 2, i.e. pn = q(n−1)2 − qn

2

(n= 1, 2, . . . ; 0 < q < 1) [100]. Then,r(n)= 1− q2n−1 (n= 1, 2, . . . ), which isstrictly increasing from 1− q to 1. Thus, fromEquation 20.43, if q/(1− q2) > c2/c1 then thereexists a finite and unique minimum N∗ thatsatisfies

q

1− q2{1− [1+N(1 − q2)]q2N} ≥ c2

c1(20.44)

and, otherwise, a unit undergoes only minimalrepair at failures.

20.2.4.3 Replacements with Two Types ofUnit

See Nakagawa [35]. Most systems consist of vitaland non-vital parts or essential and non-essentialcomponents. If vital parts fail then a systembecomes dangerous or suffers a high cost. Itwould be wise to make the planned replacementor overhaul at suitable times. We may classifyfailures into two types; partial and total failures,slight and serious failures, or simply faults andfailures. Several authors [101–107] have studiedthe replacement policies for systems with twotypes of unit. Beichelt and coworkers [108, 109],Murthy and Maxwell [110], Berg [111], Block etal. [112, 113], and Sheu [66] proposed generalizedreplacement models with two types of failure.

We consider a system with unit 1 and unit 2which operate independently, where unit 1 corre-sponds to non-vital parts and unit 2 to vital parts.It is assumed that unit 1 is always replaced to-gether with unit 2. Unit i has a failure time distri-bution Fi(t), failure rate ri (t), and cumulative haz-ard Ri(t)(i = 1, 2), i.e. Fi (t)= exp[−Ri(t)] andRi(t)=

∫ t0 ri(u) du. Then, we consider the follow-

ing four replacement policies, which combine age,block, and periodic replacements.

Page 409: Handbook of Reliability Engineering - pdi2.org

376 Maintenance Theory and Testing

Case (a). Unit 2 is replaced at failure or timeT , whichever occurs first, and when unit 1 failsbetween replacements, it is replaced by a new unit.Then, the expected cost is

C(T )= c1∫ T

0 m1(t)F2(t) dt + c2F2(T )+ c3∫ T0 F2(t) dt

(20.45)where c1 is the cost of replacement for a failedunit 1, c2 is the additional replacement for a failedunit 2, and c3 is the cost of replacement for units 1and 2.

Case (b). In case (a), when unit 1 fails betweenreplacements it undergoes only minimal repair.Then, the expected cost is

C(T )= c1∫ T

0 r1(t)F2(t) dt + c2F2(T )+ c3∫ T0 F2(t) dt

(20.46)where c1 is the cost of minimal repair for failedunit 1, and c2 and c3 are the same costs as case (a).

Case (c). Unit 2 is replaced at periodic timeskT (k = 1, 2, . . . ) and undergoes only minimalrepair at failures between planned replacements,and when unit 1 fails between replacements it isreplaced by a new unit. Then, the expected cost is

C(T )= 1

T[c1M1(T )+ c2R2(T )+ c3] (20.47)

where c2 is the cost of minimal repair for a failedunit 2, and c1 and c3 are the same costs as case (a).

Case (d). In case (c), when unit 1 fails betweenreplacements it also undergoes minimal repair.Then, the expected cost is

C(T )= 1

T[c1R1(T )+ c2R2(T )+ c3] (20.48)

where c1 is the cost of minimal repair for a failedunit 1, and c2 and c3 are the same costs as case (c).

We can similarly discuss optimum replacementpolicies that minimize the expected costs C(T )

in Equations 20.45–20.48. Further, we can easilyextend these policies to the replacements of asystem with one main unit and several sub-units.

20.2.4.4 Replacement of a Shock Model

We summarize briefly the replacement policies fora shock model. Shocks occur randomly in timeat a stochastic process and give the amount ofdamage to a system. This damage accumulatesand gradually weakens the system. A system failswhen the total damage has exceeded a failurelevel.

The general concept of such processes wasdue to Smith [114]. Mercer [115] consideredthe model where shocks occur in a Poissondistribution and the amount of damage due toeach shock has a gamma distribution. Cox [37],Esary et al. [116], and Nakagawa and Osaki [117]investigated the various properties of failure timedistributions.

Suppose that a system is replaced at failure bya new one. It may be wise to exchange a systembefore failure at a smaller cost. Taylor [118] andFeldman [119–121] derived the optimal control-limit policies where a system is replaced beforefailure when the total damage has exceeded athreshold level. On the other hand, Zuckerman[122–125] proposed the replacement model wherea system is replaced before failure at time T .Recently, Wortman et al. [126] and Sheu andcoworkers [127–129] studied some replacementmodels of a system subject to shocks. This sectionis based on Nakagawa [15, 33], Satow et al. [130]and Qian et al. [131].

Consider a unit that has to operate for aninfinite time span. Shocks occur by a non-homogeneous Poisson process with an inten-sity function r(t) and a mean-value functionR(t), i.e. R(t)≡ ∫ t0 r(u) du. Thus, the probabil-ity that j shocks occur during (0, t] is Hj(t)≡{[R(t)]j /j !} e−R(t) (j = 0, 1, 2, . . . ). Further, theamount Wj of damage to the j th shock hasan identical distribution G(x)≡ Pr{Wj ≤ x} (j =1, 2, . . . ). A unit fails when the total damage hasexceeded a failure level K .

A unit is replaced at time T (0 < T ≤∞) or atfailure, whichever occurs first. Let c1 and c2(< c1)

be the respective replacement costs at failure andat time T . Then, the expected cost per unit of time

Page 410: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 377

for an infinite span is

C(T )={c1

∞∑j=0

[G(j)(K)−G(j+1)(K)]

×∫ T

0Hj(t)r(t) dt

+ c2

∞∑j=0

G(j)(K)Hj (T )

}

×{ ∞∑

j=0

G(j)(K)

∫ T

0Hj(t) dt

}−1

(20.49)

where�(j)(x) (j = 1, 2, . . . ) is the j -fold Stieltjesconvolution of �(x) and �(0)(x)≡ 1 for x ≥ 0.

When successive times between shocks have anidentical distribution F(t) , the expected cost inEquation 20.49 is rewritten as

C(T )={c1

∞∑j=0

[G(j)(K)−G(j+1)(K)]F (j+1)(T )

+ c2

∞∑j=0

G(j)(K)[F (j)(T )− F (j+1)(T )]}

×{ ∞∑

j=0

G(j)(K)

×∫ T

0[F (j)(t)− F (j+1)(t)] dt

}−1

(20.50)

Next, a unit is replaced before failure at shockN (N = 1, 2, . . . ). Then, the expected cost is

C(N) = c1[1−G(N)(K)] + c2G(N)(K)∑N−1

j=0 G(j)(K)∫∞

0 Hj(t) dt

(N = 1, 2, . . . ) (20.51)

Further, a unit is replaced preventively whenthe total damage has exceeded a threshold level

Z(0≤ Z ≤K). Then, the expected cost is

C(Z)=[c1 − (c1 − c2)

{G(K)

−∫ Z

0[1−G(K − x)] dM(x)

}]×[ ∞∑

j=0

G(j)(Z)

∫ ∞0

Hj(t) dt

]−1

(20.52)

where M(x)≡∑∞j=1 G(j)(x).

By the similar methods of obtaining optimumreplacement policies, we can discuss the optimumT ∗, N∗, and Z∗ analytically that minimize C(T ),C(N), and C(Z).

20.2.5 Remarks

It has been assumed that the time requiredfor replacement is negligible, and at any timethere is an unlimited supply of units availablefor replacement. The results and methods inthis section could be theoretically extended andmodified in such cases, and be useful for actualreplacements in practical fields.

In general, the results of block and periodicreplacements are summarized as follows: Theexpected cost is

C(T )= 1

T

[c1

∫ T

0ϕ(t) dt + c2

](20.53)

where ϕ(t) is m(t), F(t), and r(t) respectively.Differentiating C(T ) with respect to T and settingit equal to zero:

T ϕ(T )−∫ T

0ϕ(t) dt = c2

c1(20.54)

and if a solution T ∗ to Equation 20.54 exists, thenthe expected cost is

C(T ∗)= c1ϕ(T∗) (20.55)

In a discount case:

C(T ; α)= c1∫ T

0 e−αtϕ(t) dt + c2 e−αT

1− e−αT(20.56)

Page 411: Handbook of Reliability Engineering - pdi2.org

378 Maintenance Theory and Testing

1− e−αT

αϕ(T )−

∫ T

0e−αtϕ(t) dt = c2

c1(20.57)

C(T ∗; α)= c1

αϕ(T ∗)− c2 (20.58)

20.3 Preventive MaintenancePoliciesA unit is repaired upon failure. If a failed unitundergoes repair, it needs a repair time that maynot be negligible. After the completion of repair,a unit begins to operate again. A unit representsthe most fundamental unit which repeats upand down alternately. Such a unit forms analternating renewal process [37] and a Markovrenewal process with two states [132, 133].

When a unit is repaired after failure, it mayrequire much time and high cost. In particular,the downtime of computers and radars shouldbe made as short as possible by decreasing thenumber of unit failures. In this case, we need tomaintain a unit to prevent failures, but not to do ittoo often from the viewpoint of reliability or cost.

A unit has to be operating for an infinitetime span. We define that Xk and Yk (k =1, 2, . . . ) denote the uptime and the downtimerespectively, and A(t) is the probability that aunit is operating at time t . Then, from Hosford[134] and Barlow and Proschan [1], A(t) is calledthe pointwise availability and

∫ t0 A(u) du/t is the

interval availability. Further, we have

A≡ limt→∞

1

t

∫ t

0A(u) du= lim

t→∞A(t)

= E{Xk}E{Xk} + E{Yk}

(20.59)

which is called the steady-state availability and iswell known as the usual and standard definitionof availability. For an infinite time span, anappropriate objective function is the availability A

in Equation 20.59.Morse [135] first derived the optimum PM

policy that maximizes the steady-state availability.Barlow and Proschan [1] pointed out that this

problem is reduced to an age replacement oneif the mean time to repair is replaced by thereplacement cost of a failed unit and the meantime to PM by the cost of exchanging a non-failed unit. Optimum PM policies for more generalsystems were discussed by Nakagawa [17], Sherifand Smith [136], Jardine and Buzacott [137], andReineke et al. [138]. Liang [139] considered thePM policies for series systems by modifying theopportunistic replacement policies. Aven [140]and Smith and Dekker [141] treated the PM ofa system with spare units. Silver and Fiechter[142] considered the PM model where the failuretime distribution is uncertain. Further, Van derDuyn Schouten and Scarf [143] presented severalmaintenance models in Europe and gave a goodsurvey of applied PM models. Chockie andBjorkelo [144], Smith [145], and Susova andPetrov [146] gave the PM programs of plant andaircraft.

We consider the following PM policies: (1) PMof a one-unit system; (2) PM of a two-unitsystem; (3) imperfect PM; and (4) modified PM.We discuss optimum PM policies that maximizethe availabilities or minimize the expected costsof each model, and summarize their derivationresults.

20.3.1 One-unit System

When a unit fails, it undergoes repair immediately,and once repaired it is returned to the operatingstate. It is assumed that the failure time X isindependent and has an identical distributionF(t) with finite mean 1/λ, and the repair timeY1 is also independent and has an identicaldistribution G1(t) with finite mean 1/µ1.

If the operating time of a unit is always knownand its failure rate increases with time, it may bewise to maintain preventively it at time T beforefailure on its operating time. We call time T theplanned PM time. The distribution of time Y2 toPM completion is G2(t) with finite mean 1/µ2,which may be smaller than the repair time Y1.

A new unit begins to operate at t = 0. We defineone cycle as being from the beginning of operationto the completion of PM or repair. Then, the mean

Page 412: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 379

time of downtime of one cycle is

E{Y1I(X<T ) + Y2I(X≥T )} = 1

µ1F(T )+ 1

µ2F (T )

(20.60)and from Equation 20.4, the mean time of uptimeof one cycle is

∫ T0 F (t) dt , where F ≡ 1− F .

Therefore, from Equation 20.59, the steady-state availability is

A(T )≡ Mean time of uptime

Mean time of one cycle

=∫ T

0 F (t) dt∫ T0 F (t) dt + 1

µ1F(T )+ 1

µ2F (T )

(20.61)

Thus, the policy maximizing the availability is thesame one as minimizing the expected cost C(T )

in Equation 20.5, as pointed out by Barlow andProschan [1].

20.3.1.1 Interval Reliability

See Mine and Nakagawa [16]. Barlow andProschan [1] defined that interval reliabilityR(x, T0) is the probability that, at a specified timeT0, a system is operating and will continue tooperate for an interval of time x. The intervalreliability is simply called reliability when T0 = 0;furthermore, it becomes pointwise availability attime T0 as x→ 0. Thus, the interval reliability isan important measure from the viewpoints of bothreliability and availability.

Consider the PM of the above one-unit system,where a unit is repaired at failure or is maintainedpreventively at time T , whichever occurs first.However, the PM of an operating unit is not madeduring the interval [T0, T0 + x] even if the timefor PM comes. It is assumed that the distributionof the time to pm is the same as the repair timedistribution G(t) with finite mean 1/µ, for thesimplicity of computations.

We set the PM time T to an operating unitand obtain the interval reliability R(T ; x, T0) bya method similar to Barlow and Proschan [1,p.82]. Let D(t) be the distribution of a degeneraterandom variable placing unit mass at T , i.e.D(t) ≡ 0 for t < T , and D(t)≡ 1 for t ≥ T . Then,

we have

R(T ; x, T0)= F (T0 + x)D(T0)

+∫ T0

0F (T0 + x − u)D(T0 − u) dM11(u)

(20.62)

where D ≡ 1−D, M11(t) represents the expectednumber of occurrences of a completed repairduring (0, t], and from an elementary renewaltheorem (e.g. theorem 2.9 of Barlow and Proshan[1, p.55]):

limt→∞

M11(t)

t= 1∫ T

0 F (t) dt + 1µ

in which the denominator is equal to the meantime of one cycle.

Thus, the limiting interval reliability is

R(T ; x)≡ limT0→∞

R(T ; x, T0)=∫ T+xx

F (t) dt∫ T0 F (t) dt + 1

µ

(20.63)We seek an optimum time T ∗ that maximizes

R(T ; x) in Equation 20.63 for a fixed x >

0. Let H(t)≡ [F(t + x)− F(t)]/F (t) for t, x ≥0 and F(t) < 1. Then, both H(t) and r(t)≡f (t)/F (t) are called the failure rates and have thesame properties (Barlow and Proschan [1, p.23]).Then, we have the following similar theorem toTheorem 1.

Theorem 3. Suppose that H(t) is continuous andstrictly increasing, where

K(x)≡[ ∫ x

0F (t) dt + 1/µ

]/(1/λ+ 1/µ).

(i) If H(∞) > K(x) then there exists a finite andunique T ∗(0 < T ∗ <∞) that satisfies

H(T )

[ ∫ T

0F (t) dt + 1

µ

]−∫ T

0[F (t)− F (t + x)] dt = 1

µ(20.64)

and the resulting interval reliability is

R(T ∗; x)= 1−H(T ∗) (20.65)

Page 413: Handbook of Reliability Engineering - pdi2.org

380 Maintenance Theory and Testing

(ii) If H(∞)≤K(x) then the optimum PM time isT ∗ =∞, and

R(∞; x)=∫∞x

F (t) dt

(1/λ+ 1/µ)

If the time for PM has G2(t) with mean 1/µ2and the time for repair has G1(t) with mean 1/µ1,then the limiting interval reliability is easily givenby

R(T ; x)=∫ T+xx F (t) dt∫ T

0 F (t) dt + 1µ1

F(T )+ 1µ2

F (T )

(20.66)which agrees with Equation 20.61 as x→ 0, andwith Equation 20.63 when 1/µ1 = 1/µ2 = 1/µ.

Example 3. Suppose that the failure time of a unithas a gamma distribution with order 2, i.e. f (t)=β2t e−β t . Then, we have

H(t)= 1− e−βx − βx

1+ βte−βx

K(x)= (2/β)(1− e−βx)− x e−βx + 1/µ

2/β + 1/µ

The failure rate H(t) is strictly increasing fromH(0)= F(x) to H(∞)= 1− e−βx . Thus, fromTheorem 3, if x > 1/µ then the optimum time T ∗is a unique solution of

βT

(x − 1

µ

)− x(1− e−βT )= 1+ βx

µ

and the interval reliability is

R(T ∗; x)= 1+ β(T ∗ + x)

1+ βT ∗e−βx

20.3.2 Two-unit System

A two-unit standby redundant system with asingle repairman is one of the most fundamentaland important redundant systems in reliabilitytheory. A system consists of two units where oneunit is operating and the other is in standby as aninitial condition. If an operating unit fails, then

it undergoes repair immediately and the otherstandby unit takes over its operation. Either of twounits is alternately operating. It can be said thata system failure occurs when two units are downsimultaneously.

Gaver [147] obtained the distribution of timeto system failure and its mean time for themodel with exponential failure and general repairtimes. Gnedenko [148, 149] and Srinivasan [150]extended the results for the model with bothgeneral failure and repair times. Further, Osaki[151] obtained the same results by using a signal-flow graph method.

PM policies for a two-unit standby system havebeen studied by many authors. Rozhdestvenskiyand Fanarzhi [152] obtained one method ofapproximating the optimum PM policy thatminimizes the mean time to system failure. Osakiand Asakura [153] showed that the mean timeto system failure of a system with PM is greaterthan that with only repair maintenance undersuitable conditions. Berg [154–156] consideredthe replacement policy for a two-unit systemwhere, at the failure points of one unit, theother unit is replaced if its age exceeds a controllimit.

Consider a two-unit standby system where twounits are statistically identical. An operating unithas a failure time distribution F(t) with finitemean 1/λ and a failed unit has a repair timedistribution G1(t) with finite mean 1/µ1. Whenan operating unit operates for a specified time T

without failure, we stop its operation. The timefor PM has a general distribution G2(t) with finitemean 1/µ2. Further, we make the following threeassumptions:

1. A unit is as good as new upon repair or PMcompletion.

2. PM of an operating unit is done only if theother unit is in standby.

3. An operating unit, which forfeited PM becauseof (1), undergoes PM just upon repair or PMcompletion of the other unit.

Page 414: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 381

Under the above assumptions, the steady-stateavailability is [11]

A(T )={[

1/γ1 +∫ T

0F (t)G1(t) dt

]×[1−

∫ ∞T

G2(t) dF(t)

]+[1/γ2 +

∫ T

0F (t)G2(t) dt

]×∫ ∞T

G1(t) dF(t)

}×{[

1/µ1 +∫ T

0F (t)G1(t) dt

]×[1−

∫ ∞T

G2(t) dF(t)

]+[1/µ2 +

∫ T

0F (t)G2(t) dt

]×∫ ∞T

G1(t) dF(t)

}−1

(20.67)

where F ≡ 1− F and 1/γi ≡∫∞

0 F (t)Gi (t) dt(i = 1, 2).

When an operating unit undergoes PM imme-diately upon repair or PM completion, i.e. T = 0,the availability is

A(0)= θ2/γ1 + (1− θ1)/γ2

θ2/µ1 + (1− θ1)/µ2(20.68)

where θi ≡∫∞

0 Gi(t) dF(t) (i = 1, 2). When noPM is done, i.e. T =∞, the availability is

A(∞)= 1/λ

1/λ+ 1/µ1 − 1/γ1(20.69)

We find an optimum planned PM time T ∗ thatmaximizes the availability A(T ) in Equation 20.67.It is assumed that G1(t) < G2(t) for 0 < t <∞.That is, the probability that the repair is completedup to time t is less than the probability that the PMis completed up to time t . Let r(t)≡ f (t)/F (t),where f (t) is a density of F .

Theorem 4. Suppose that G1(t) < G2(t) for 0 <

t <∞, and r(t) is continuous and strictly increas-ing.

(i) If r(∞) > K , β1µ1 > β2µ2 and r(0) < k, orr(∞) > K and β1µ1 ≤ β2µ2, then there existsa finite and unique T ∗(0 < T ∗ <∞) thatsatisfies

r(T )

[ ∫ T

0F (t)G(t) dt +

∫ ∞0

G(t) dt

]−∫ T

0G(t) dF(t)

= β1∫∞

0 G(t) dF(t)+ β2∫∞

0 G(t) dF(t)

β1 − β2(20.70)

and the resulting availability is

A(T ∗)= 1− (1/µ1 − 1/γ1)

×{

1/µ1 − 1/γ1 + 1/λ

+∫ ∞T ∗

G1(t) dF(t)/r(T ∗)

−∫ ∞T ∗

F (t)G1(t) dt

}−1

(20.71)

(ii) If r(∞) ≤K then the optimum PM timeis T ∗ =∞, i.e. no PM is done, and theavailability is given by Equation 20.69.

(iii) If β1µ1 > β2µ2 and r(0)≥ k then the opti-mum PM time is T ∗ = 0, and the availabilityis given by Equation 20.68 where

βi ≡∫ ∞

0F(t)Gi (t) dt (i = 1, 2)

G(t)≡ β1G2(t)− β2G1(t)

β1 − β2

k ≡ β1θ2 + β2(1− θ1)

β1/µ1 − β2/µ1

K ≡ λβ1

β1 − β2

20.3.3 Imperfect PreventiveMaintenance

Most PM models have assumed that a unit afterPM is as good as new. Actually, this assumption

Page 415: Handbook of Reliability Engineering - pdi2.org

382 Maintenance Theory and Testing

may not be true. A unit after PM usually may beyounger at PM, and occasionally it may be worsethan before PM because of faulty procedures.Generally, the improvement of a unit by PM woulddepend on the resources spent for PM.

Weiss [157] first assumed that the inspectionto detect failures may not be perfect. Chan andDowns [158], Nakagawa [20, 24], and Murthy andNguyen [159] considered the imperfect PM wherea unit after PM is not like new and discussed theoptimum policies that maximize the availabilityor minimize the expected cost. Further, Lie andChun [160] and Jayabalan and Chaudhuri [161]introduced an improvement factor in failure rateor age after maintenance, and Canfield [162]considered a system degradation with time wherethe PM restores the hazard function to the sameshape.

Brown and Proschan [163], Fontenot andProschan [164], and Bhattacharjee [165] assumedthat a failed unit is as good as new with a certainprobability and investigated some properties ofa failure time distribution. Similar imperfectrepair models were studied by Ebrahimi [166],Kijima et al. [167], Natvig [168], Stadje andZuckerman [169], and Makis and coworkers[170, 171]. Recently, Wang and Pham [172, 173]considered extended PM models with imperfectrepair and discussed the optimum policies thatminimize the expected cost and maximize theavailability.

A unit begins to operate at time zero and hasto be operating for an infinite time span [36].When a unit fails, it is repaired immediately andits mean repair time is 1/µ1. To prevent failures,a unit undergoes PM at periodic times kT (k =1, 2, . . . ), where the PM time is negligible. Then,one of the following three cases after PM results.

(a) A unit is not changed with probability p1, i.e.PM is imperfect.

(b) A unit is as good as new with probability p2,i.e. PM is perfect.

(c) A unit fails with probability p3, i.e. PMbecomes failure, where p1 + p2 + p3 ≡ 1 andp2 > 0. In this case, the mean repair time forPM failure is 1/µ2.

The probability that a unit is renewed by repairupon actual failure is

∞∑j=1

pj−11

∫ jT

(j−1)TdF(t)

= (1− p1)

∞∑j=1

pj−11 F(jT ) (20.72)

The probability that it is renewed by perfect PM is

p2

∞∑j=1

pj−11 F (jT ) (20.73)

The probability that it is renewed by repair uponPM failure is

p3

∞∑j=1

pj−11 F (jT ) (20.74)

where Equations 20.72+ 20.73+ 20.74= 1.The mean time until a unit is renewed by either

repair or perfect PM is

∞∑j=1

pj−11

∫ jT

(j−1)Tt dF(t)

+ (p2 + p3)

∞∑j=1

(jT )pj−11 F (jT )

= (1− p1)

∞∑j=1

pj−11

∫ jT

0F (t) dt (20.75)

Thus, from Equations 20.59 and 20.61, theavailability is

A(T )={(1− p1)

∞∑j=1

pj−11

∫ jT

0F (t) dt

}

×{(1− p1)

∞∑j=1

pj−11

∫ jT

0F (t) dt

+ 1− p1

µ1

∞∑j=1

pj−11 F(jT )

+ p3

µ2

∞∑j=1

pj−11 F (jT )

}−1

(20.76)

which agrees with Equation 20.61 when p1 = 0and p3 = 1.

Page 416: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 383

Let

H(T ; p1)≡∑∞

j=1 pj−11 jf (jT )∑∞

j=1 pj−11 j F (jT )

Then, by a similar method to Theorem 1, we havethe following theorem.

Theorem 5. Suppose that H(T ; p1) is continuousand strictly increasing, and K ≡ λ(1− p1)/p2.

(i) If H(∞; p1) > K and (1− p1)/µ1 > p3/µ2then there exists a finite and unique T ∗ thatsatisfies

H(T ; p1)

∞∑j=1

pj−11

∫ jT

0F (t) dt

−∞∑j=1

pj−11 F(jT )

= p3

1− p1

1µ2

1−p1µ1− p3

µ2

(20.77)

and the resulting availability is

A(T ∗)= 1

1+(

1µ1− p3

µ2(1−p1)

)H(T ∗)

(20.78)(ii) If H(∞; p1)≤K or (1− p1)/µ1 ≤ p3/µ2

then the optimum time is T ∗ =∞, andA(∞)= (1/λ)(1/λ+ 1/µ1).

We adopt the expected cost per unit of time asan objective function and give the following twoimperfect PM models.

20.3.3.1 Imperfect with Probability

An operating unit is repaired at failure or ismaintained preventively at time T , whicheveroccurs first. Then, a unit after PM has the sameage, i.e. the same failure rate, as it had beforePM with probability p (0≤ p < 1) and is as goodas new with probability q ≡ 1− p. Then, the

expected cost per unit of time is

C(T ; p)≡{c1q

∞∑j=1

pj−1F(jT )

+ c2

∞∑j=1

pj−1F (jT )

}

×{ ∞∑

j=1

pj−1∫ jT

(j−1)TF (t) dt

}−1

(20.79)

where c1 is the cost for repair and c2 is the cost forPM.

Next, suppose that an operating unit is main-tained preventively at times kT (k = 1, 2, . . . )and undergoes only minimal repair at failuresbetween PMs. Then the expected cost is

C(T ; p)= 1

T

[c1q

2∞∑j=1

pj−1∫ jT

0r(t) dt + c2

](20.80)

where c1 is the cost for minimal repair and c2 isthe cost for PM.

20.3.3.2 Reduced Age

An operating unit is maintained preventivelyat times kT (k = 1, 2, . . . ) and undergoes onlyminimal repair at failures between PMs. Further,the age of a unit becomes x (0≤ x ≤ T ) units ofyounger at each PM, and it is replaced if it operatesfor the time interval NT (N = 1, 2, . . . ). Then,the expected cost per unit of time is

C(N; T , x)= 1

NT

[c1

N−1∑j=0

∫ T+j (T−x)

j (T−x)r(t) dt

+ c2 + (N − 1)c3

](20.81)

where c1 is the cost for minimal repair, c2 is thecost for replacement at NT , and c3 is the cost forPM.

In this model, if the age after PM reduces to at

(0 < a ≤ 1) when it was t before PM, the expected

Page 417: Handbook of Reliability Engineering - pdi2.org

384 Maintenance Theory and Testing

cost is

C(N; T , a)= 1

NT

[c1

N−1∑j=0

∫ (Aj+1)T

Aj T

r(t) dt

+ c2 + (N − 1)c3

](20.82)

where Aj ≡ a + a2 + · · · + aj (j = 1, 2, . . . )and A0 ≡ 0.

The above model is extended to the followingsequential PM policy. The PM is done at fixedintervals xk (k = 1, 2, . . . , N − 1) and is replacedat the Nth PM. Further, the age after the kth PMreduces to akt when it was t before PM, where 0=a0 < a1 ≤ a2 ≤ · · · ≤ aN < 1. Then, the expectedcost per unit of time is

C(x1, x2, . . . , xN)

=c1∑N

k=1

∫ Ykak−1Yk−1

r(t) dt + c2 + (N − 1)c3∑N−1k=1 (1− ak)Yk + YN

(20.83)

where

Yk ≡ xk + ak−1xk−1 + · · · + ak−1ak−2 · · · a2a1x1

(k = 1, 2, . . . )

20.3.4 Modified PreventiveMaintenance

See Nakagawa [34]. We consider the followingmodified PM policy. Failures of a unit occur bya non-homogeneous Poisson process and the PMis made only at periodic times kT (k = 1, 2, . . . ).If the total number of failures has exceeded aspecified number N , then the PM should be madeat the next planned time, otherwise no PM shouldbe done. This policy was applied to the PM of harddisks by Sandoh et al. [174].

A unit has to operate for infinite time spanand assume that failures occur by a non-homogeneous Poisson process with an intensityfunction r(t) and a mean-value function R(t),i.e. R(t)≡ ∫ t0 r(u) du. Then, the probability thatj failures exactly occur during (0, t] is Hj(t)≡

{[R(t)]j /j !} e−R(t) (j = 0, 1, 2, . . . ). The PM isscheduled at times kT (k = 1, 2, . . . ), and ifthe total number of failures has exceeded aspecified number N , then the PM is made atthe next PM time. Otherwise, a unit is left asit is. A unit undergoes minimal repair at eachfailure.

The probability that the PM is done at time(k + 1)T (k = 0, 1, 2, . . . ), because more than N

failures have occurred during (0, (k + 1)T ] whenthe number of failures was less than N until kT ,is

N−1∑j=0

Hj [R(kT )]∞∑

i=N−jHi[R((k + 1)T )− R(kT )]

=N−1∑j=0

{Hj [R(kT )] −Hj [R((k + 1)T )]}

Thus, the mean time to PM is

∞∑k=0

[(k + 1)T ]

×N−1∑j=0

{Hj [R(kT )] −Hj [R((k + 1)T )]}

= T

∞∑k=0

N−1∑j=0

Hj [R(kT )] (20.84)

Further, the expected number of failures until PMis

∞∑k=0

N−1∑j=0

Hj [R(kT )]

×∞∑

i=N−j(i + j)Hi[R((k + 1)T )− R(kT )]

=∞∑k=0

[R((k + 1)T )− R(kT )]N−1∑j=0

Hj [R(kT )]

(20.85)

Page 418: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 385

Therefore, from Equations 20.84 and 20.85, theexpected cost per unit of time is

C(N; T )={c1

∞∑k=0

[R((k + 1)T )− R(kT )]

×N−1∑j=0

Hj [R(kT )] + c2

}

×{T

∞∑k=0

N−1∑j=0

Hj [R(kT )]}−1

(20.86)

where c1 is the cost for minimal repair and c2 isthe cost for planned PM.

We seek an optimum number N∗ that mini-mizesC(N; T ) in Equation 20.86 for a fixed T > 0.By a similar method to that of Section 20.2.3.2,we have the following results. From the inequalityC(N + 1; T ) > C(N; T ):

q(N)

∞∑k=0

N−1∑j=0

Hj [R(kT )]

−∞∑k=0

[R((k + 1)T )−R(kT )]N−1∑j=0

Hj [R(kT )]

≥ c2

c1(N = 1, 2, . . . ) (20.87)

where

q(N)≡{ ∞∑

k=0

[R((k + 1)T )−R(kT )]HN [R(kT )]}

×{ ∞∑

k=0

HN [R(kT )]}−1

When r(t) is strictly increasing, q(N) is strictlyincreasing, and hence the left-hand side ofEquation 20.87 is also strictly increasing. Thus,if there exists a finite and minimum solution N∗which satisfies Equation 20.87, it is unique andminimizes C(N; T ).

20.4 Inspection PoliciesSuppose that failures are not detected immediatelyand can be done only through inspections.

A typical example is standby electric generators inhospitals and other public facilities. It is extremelyserious if a standby generator fails at the verymoment of electric power supply stop. Similarexamples can be found in defense systems, inwhich all weapons are in standby, and hencemust be checked at periodic times. For example,missiles are in storage for a great part of theirlifetimes after delivery. It is important to testthe functions of missiles as to whether they canoperate normally or not. Therefore, we need tocheck such systems at suitable times, and, ifnecessary, replace or repair them.

Barlow and Proschan [1] summarized theschedules of inspections that minimize twoexpected costs until detection of failure and perunit of time. Luss and Kander [175], Luss [176],and Wattanapanom and Shaw [177] consideredthe modified models where checking times arenon-negligible and a system is inoperative duringchecking times. Platz [178] gave the availabilityof periodic checks. Zacks and Fenske [179], Lussand Kander [180], Anbar [181], Kander [182],Zuckerman [183, 184], and Qiu [185] treatedmuch more complicated systems. Further, Weiss[157], Coleman and Abrams [186], Morey [187],and Apostolakis and Bansal [188] consideredimperfect inspections where some failures mightnot be detected.

It would be especially important to check andmaintain standby and protective units. Nakagawa[22], Thomas et al. [189], Sim [190], andParmigiani [191] discussed the inspection policyfor standby units, and Chay and Mazumdar [192],Inagaki et al. [193], and Shima and Nakagawa[194] did so for protective devices.

A unit has to be operating for an infinite timespan. Let c1 be the cost for one check and c2 bethe loss cost for the time elapsed between failureand its detection at the next checking time per unitof time. Then, the expected cost until detection offailure is, from Barlow and Proschan [1]:

C ≡∫ ∞

0{c1[E{N(t)} + 1] + c2E{γt }} dF(t)

(20.88)

Page 419: Handbook of Reliability Engineering - pdi2.org

386 Maintenance Theory and Testing

where N(t) is the number of checks during (0, t],γt is the interval from failure to its detection whena failure occurs at time t , and F is a failure timedistribution of a unit.

We consider the following three inspectionpolicies: (1) standard inspection and asymptoticchecking time; (2) inspection with PM; and(3) inspection of a storage system. We summarizeoptimum policies that minimize the total expectedcost until detection of failure.

20.4.1 Standard Inspection

A unit is checked at successive times xk (k =1, 2, . . . ) where x0 ≡ 0. Any failures are detectedat the next checking time and are replacedimmediately. A unit has a failure time distributionF(t) with finite mean 1/λ and its failure rater(t)≡ f (t)/F (t) is not changed by any checks,where f is a density of F and F ≡ 1− F . It isassumed that all times needed for checks andreplacement are negligible.

Then, the total expected cost until detection offailure, from Equation 20.88, is

C(x1, x2, . . . )

=∞∑k=0

∫ xk+1

xk

[c1(k + 1)+ c2(xk+1 − t)] dF(t)

=∞∑k=0

[c1 + c2(xk+1 − xk)]F (xk)− c2

λ

(20.89)

Differentiating the total expected costC(x1, x2, . . . ) with xk and putting at zero:

xk+1 − xk = F(xk)− F(xk−1)

f (xk)− c1

c2(20.90)

Balrow and Proschan [1] proved that the optimumchecking intervals are decreasing when f isPF2 and gave the algorithm for computing theoptimum schedule.

It is difficult to compute this algorithm numer-ically, because the computations are repeated untilthe procedures are determined to the requireddegree by changing the first checking time. Wesummarize three asymptotic calculations of opti-mum checking procedures.

Munford and Shahani [195] suggested that,when a unit was operating at time xk−1, theprobability that it fails during (xk−1, xk] isconstant for all k, i.e.

F(xk)− F(xk−1)

F (xk)≡ p (20.91)

Then, the total expected cost in Equation 20.89 isrewritten as

C(p)= c1

p+ c2

∞∑k=1

xkqk−1p − c2

λ(20.92)

where q ≡ 1− p. We may seek p that minimizesC(p).

Keller [196] defined a smooth density n(t),which denotes the number of checks per unit oftime, expressed as∫ xk

0n(t) dt = k (20.93)

Then, the total expected cost is approximately

C(n(t))≈ c1

∫ ∞0

n(t)F (t) dt

+ c2

2

∫ ∞0

1

n(t)dF(t) (20.94)

and the approximate checking times are given bythe following equation:

k =√

c2

2c1

∫ xk

0

√r(t) dt (k = 1, 2, . . . )

(20.95)Finally, Nakagawa and Yasui [197] considered

that if xn is sufficiently large, we may assumeapproximately that

xn+1 − xn + ε = xn − xn−1 (20.96)

where 0 < ε < c1/c2. In this case, Equation 20.90becomes

c1

c2− ε =

∫ xnxn−1[f (x)− f (xn)] dx

f (xn)(20.97)

We can determine the asymptotic schedule bycomputing xn−1 > xn−2 > · · · recursively fromEquation 20.97, from starting at some time xn ofaccuracy required.

Page 420: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 387

Kaio and Osaki [198, 199] and Viscolani [200]discussed many modified inspection policies,using the above approximation methods, andcompared them.

Munford [201] and Luss [202] proposed thefollowing total expected cost for a continuousproduction system:

C(x1, x2, . . . )

=∞∑k=0

∫ xk+1

xk

[c1(k + 1)+ c2(xk+1 − xk)] dF(t)

= c1

∞∑k=0

F (xk)

+ c2

∞∑k=0

(xk+1 − xk)[F(xk+1)− F(xk)](20.98)

In this case, Equation 20.90 is rewritten as

xk+1 − 2xk + xk−1

= F(xk+1)− 2F(xk)+ F(xk−1)

f (xk)− c1

c2(20.99)

In particular, when a unit is checked at periodictimes and the failure time is exponential, i.e.xk = kT (k = 0, 1, 2, . . . ) and F(t)= 1− e−λt ,the total expected cost, from Equation 20.89, is

C(T )= c1 + c2

1− e−λT− c2

λ(20.100)

and an optimum checking time T ∗ is given by afinite and unique solution of

eλT − (1+ λT )= λc1

c2(20.101)

Similarly, from Equations 20.98 and 20.99, thetotal expected cost is

C(T )= c1

1− e−λT+ c2T (20.102)

and an optimum time T ∗ is a finite and uniquesolution of

eλT (1− e−λT )2 = λc1

c2(20.103)

Note that the solution to Equation 20.103 is lessthan that of Equation 20.101.

20.4.2 Inspection with PreventiveMaintenance

See Nakagawa [23, 32]. We consider a modifiedinspection policy in which an operating unit ischecked and maintained preventively at timeskT (k = 1, 2, . . . ) and the age of a unit afterPM reduces to at when it was t before PM (seeSection 20.3.3.2). Then, we obtain the mean timeto failure and the expected number of PMs beforefailure, and discuss an optimum inspection policythat minimizes the expected cost.

A unit begins to operate at time zero, and ischecked and undergoes PM at periodic times kT

(k = 1, 2, . . . ). It is assumed that the age of a unitafter PM reduces to at (0≤ a ≤ 1) when it was t

before PM. A unit has a failure time distributionF(t) with finite mean 1/λ and the process endswith its failure.

Let H(t, x) be the failure rate of a unit, i.e.H(t, x)≡ [F(t + x)− F(t)]/F (t) for x > 0, t ≥0 and F(t) < 1, where F ≡ 1− F . Then, thereliability function, which is the probability that aunit is operating at time t , is

S(t; T , a)= H (AkT , t − kT )

k−1∏j=0

H (AjT , T )

kT ≤ t < (k + 1)T (20.104)

where Ak ≡ a + a2 + · · · + ak (k = 1, 2, . . . ),A0 ≡ 0, H ≡ 1−H , and �−1

j=0 ≡ 1.Using Equation 20.104, the mean time to failure

is

γ (T ; a)=∞∑k=0

∫ (k+1)T

kT

S(t; T , a) dt

=∞∑k=0

[ k−1∏j=0

H (AjT , T )

]1

F (AkT )

×∫ (Ak+1)T

AkT

F (t) dt (20.105)

Page 421: Handbook of Reliability Engineering - pdi2.org

388 Maintenance Theory and Testing

and the expected number of PMs before failure is

M(T ; a)=∞∑k=0

kH(AkT , T )

k−1∏j=0

H (AjT , T )

=∞∑k=0

[ k∏j=0

H (AjT , T )

](20.106)

From Equations 20.105 and 20.106, we havethe following: if the failure rate H(T , x) isincreasing in T , then both γ (T ; a) and M(T ; a)are decreasing in a for any T > 0, and

M(T ; a)≤ γ (T ; a)T

≤ 1+M(T ; a) (20.107)

When a unit fails, its failure is detected at thenext checking time. Then, the total expected costuntil the detection of failure, from Equation 20.88,is

C(T ; a)=∞∑k=0

∫ (k+1)T

kT

{c1(k + 1)

+ c2[(k + 1)T − t]} d[−S(t; T , a)]= (c1 + c2T )[M(T ; a)+ 1] − c2γ (T ; a)

(20.108)

In particular, if a = 1, i.e. a unit undergoes onlychecks at periodic times, then

C(T ; 1)= (c1 + c2T )

∞∑k=0

F (kT )− c2

λ(20.109)

If a = 0, i.e. the PM is perfect, then

C(T ; 0)= 1

F(T )

[c1 + c2

∫ T

0F(t) dt

](20.110)

Since M(T ; a) is a decreasing function of a,and from Equation 20.107, we have the inequalities

C(T ; a)≥ c1[M(T ; a)+ 1] ≥ c1

∞∑k=0

F (kT )

Thus, limT→0 C(T ; a)= limT→∞ C(T ; a)=∞,and hence there exists a positive and finite T ∗that minimizes the expected cost C(T ; a) inEquation 20.108.

Table 20.1. Optimum times T ∗ for various c2 when F (t)=e−(t/100)2 and c1 = 1

a T ∗

0.1 0.5 1.0 2.0 5.0 10.0

0 76 47 38 31 23 180.2 67 40 32 26 19 150.4 60 34 27 22 16 130.6 53 29 22 18 13 100.7 47 23 18 14 10 81.0 42 19 13 9 6 4

Example 4. Suppose that the failure time of a unithas a Weibull distribution with shape parameter 2,i.e. F (T )= e−(t/100)2, and c1 = 1. Table 20.1 givesthe optimum checking times T ∗ for several a andc2. These times are small when a is large. The rea-son would be that the failure rate increases quicklywith age when a is large and a unit fails easily.

20.4.3 Inspection of a Storage System

Systems like missiles and spare parts for aircraftare stored for long periods until required.However, their reliability deteriorates with time[203], and we cannot clarify whether a systemcan operate normally or not. The periodictests and maintenances of a storage systemare indispensable to obtaining a highly reliablecondition. But, because frequent tests have a costand may degrade a system [204], we should nottest it very frequently.

Martinez [205] studied the periodic inspectionof stored electronic equipment and showed howto compute its reliability after 10 years of storage.Itô and Nakagawa [206, 207] considered a storagesystem that is overhauled if its reliability becomeslower than a prespecified value, and obtained thenumber of tests and the time to overhaul. Theyfurther discussed the optimum inspection policythat minimizes the expected cost. Wattanapanomand Shaw [177] proposed an inspection policy fora system where tests may hasten failures.

It is well known that large electric currentsoccur in an electronic circuit containing induction

Page 422: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 389

Table 20.2. Optimum timesT ∗ and expected costsC(T ∗)whenλ3 = 9.24 × 10−7 , c2 = 1, and a = 0.9

m λ c1 T ∗ C(T ∗)

1.0 29.24 × 10−6 10 510 60315 670 76420 800 90325 920 102730 1020 1140

1.0 58.48 × 10−6 10 460 49015 590 61320 680 71825 790 81130 840 897

1.2 29.24 × 10−6 10 430 37715 540 46120 630 53125 670 59330 760 648

1.2 58.48 × 10−6 10 350 28615 390 34720 470 39925 510 44530 550 487

parts, such as coils and motors, at power on andoff. Missiles are constituted of various kinds ofelectrical and electronic parts, and some of themare degraded by power on–off cycles during eachtest [204].

We consider the following inspection policy fora storage system that has to operate when it is usedat any time.

1. A system is new at time zero, and is checkedand maintained preventively if necessary atperiodic times kT (k = 1, 2, . . . ), where T (>0) is previously specified.

2. A system consists mainly of two independentunits, where unit 1 becomes like new afterevery check; however, unit 2 does not becomelike new and is degraded with time and ateach check.

3. Unit 1 has a failure rate r1, which is given byr1(t − kT ) for kT < t ≤ (k + 1)T , because itis like new at time kT .

4. Unit 2 has two failure rate functions r2 andr3, which describe the failure rates of systemdegradations with time and at each checkrespectively. The failure rate r2(t) remainsundisturbed by any checks. Further, sinceunit 2 is degraded by power on–off cyclesduring the checking time, r3 increases byconstant rate λ3 at each check, and is definedas r3(t)= kλ3 for kT < t ≤ (k + 1)T .

Thus, the failure rate function r(t) of a system,from (3) and (4), is

r(t)≡ r1(t − kT )+ r2(t)+ kλ3 (20.111)

and the cumulative hazard function is

R(t)≡∫ t

0r(u) du

= kR1(T )+ R1(t − kT )+ R2(t)

+k−1∑j=0

jλ3T + kλ3(t − kT ) (20.112)

for kT < t ≤ (k + 1)T (k = 0, 1, 2, . . . ), whereRi(t)≡

∫ t0 ri(u) du (i = 1, 2).

Using Equation 20.112, the reliability of asystem at time t is

F (t)≡ e−R(t)

= exp

{− kR1(T )− R1(t − kT )− R2(t)

− kλ3

[t − 1

2(k + 1)T

]}(20.113)

for kT < t ≤ (k + 1)T (k = 0, 1, 2, . . . ), themean time to system failure is

γ (T )≡∫ ∞

0F (t) dt

=∞∑k=0

exp

[− kR1(T )− k(k − 1)

2λ3T

]

×∫ T

0exp[−R1(t)− R2(t + kT )

− kλ3t] dt (20.114)

Page 423: Handbook of Reliability Engineering - pdi2.org

390 Maintenance Theory and Testing

and the expected number of checks before failureis

M(T )≡∞∑k=1

k[F (kT )− F ((k + 1)T )]

=∞∑k=1

exp

[− kR1(T )− R2(kT )

− k(k − 1)

2λ3T

](20.115)

Therefore, the total expected cost until detectionof failure, from Equation 20.108, is

C(T )= (c1 + c2T )[M(T )+ 1] − c2γ (T )

(20.116)Suppose that units have a Weibull distribu-

tion, i.e. Ri(t)= λitm (i = 1, 2). Then, the total

expected cost is

C(T )= (c1 + c2T )

×∞∑k=0

exp

[− kλ1T

m − λ2(kT )m

− N(N − 1)

2λ3T

]− c2

∞∑k=0

exp

[− kλ1T

m − k(k − 1)

2λ3T

]

×∫ T

0exp[−λ1t

m − λ2(t + kT )m

− kλ3t] dt (20.117)

Changing T , we can calculate numerically anoptimum T ∗ that minimizes C(T ).

Bauer et al. [204] showed that the degradationfailure rate λ3 at each check is λ3 =NcKλSE whereNc is the ratio of total cycles to checking time, Kis the ratio of cyclic failure rate to storage failurerate, and λSE is the storage failure rate of electronicparts; their values are Nc = 2.3× 10−14, K =270, and λSE = 14.88× 10−6 h−1. Hence, λSE =9.24× 10−7 h−1.

Table 20.2 gives the optimum times T ∗ and theresulting costs C(T ∗) for λ and c1 when c2 = 1,m= 1, 1.2 and a = 0.9, where λ1 = aλ and λ2 =(1− a)λ. This indicates that both T ∗ and C(T ∗)increase when c1 and 1/λ increase, and that a

system should be checked about once a month.It is of interest that T ∗ in the case of m= 1.2is much shorter than that for m= 1, because asystem deteriorates with time.

References[1] Barlow RE, Proschan F. Mathematical theory of reliabil-

ity. New York: John Wiley & Sons, 1965.[2] Osaki S, Nakagawa T. Bibliography for reliability and

availability. IEEE Trans Reliab 1976;R-25:284–7.[3] Pierskalla WP, Voelker JA. A survey of maintenance

models: the control and surveillance of deterioratingsystems. Nav Res Logist Q 1976;23:353–88.

[4] Thomas LC. A survey of maintenance and replacementmodels for maintainability and reliability of multi-itemsystems. Reliab Eng 1986;16:297–309.

[5] Valdez-Flores C, Feldman RM. A survey of preventivemaintenance models for stochastically deterioratingsingle-unit system. Nav Res Logist Q 1989;36:419–46.

[6] Cho DI, Parlar M. A survey of maintenance models formulti-unit systems. Eur J Oper Res 1991;51:1–23.

[7] Özekici S. (ed.) Reliability and maintenance of complexsystems. Berlin: Springer-Verlag; 1996.

[8] Christer AH, Osaki S, Thomas LC. (eds.) Stochasticmodelling in innovative manufacturing. Lecture Notesin Economics and Mathematical Systems, vol. 445.Berlin: Springer-Verlag; 1997.

[9] Ben-Daya M, Duffuaa SO, Raouf A. (eds.) Maintenance,modeling and optimization. Boston: Kluwer AcademicPublishers; 2000.

[10] Osaki S. (ed.) Stochastic models in reliability main-tainance. Berlin: Springer-Verlag; 2002.

[11] Nakagawa T, Osaki S. Optimum preventive maintenancepolicies for a 2-unit redundant system. IEEE TransReliab 1974;R-23:86–91.

[12] Nakagawa T, Osaki S. Optimum preventive maintenancepolicies maximizing the mean time to the first systemfailure for a two-unit standby redundant system. OptimTheor Appl 1974;14:115–29.

[13] Osaki S, Nakagawa T. A note on age replacement. IEEETrans Reliab 1975;R-24:92–4.

[14] Nakagawa T, Osaki S. A summary of optimumpreventive maintenance policies for a two-unit standbyredundant system. Z Oper Res 1976;20:171–87.

[15] Nakagawa T. On a replacement problem of a cumulativedamage model. Oper Res Q 1976;27:895–900.

[16] Mine H, Nakagawa T. Interval reliability and optimumpreventive maintenance policy. IEEE Trans Reliab1977;R-26:131–3.

[17] Nakagawa T. Optimum preventive maintenance policiesfor repairable systems. IEEE Trans Reliab 1977;R-26:168–73.

[18] Nakagawa T, Osaki S. Discrete time age replacementpolicies. Oper Res Q 1977;28:881–5.

Page 424: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 391

[19] Mine H, Nakagawa T. A summary of optimumpreventive maintenance policies maximizing intervalreliability. J Oper Res Soc Jpn 1978;21:205–16.

[20] Nakagawa T. Optimum policies when preventive mainte-nance is imperfect. IEEE Trans Reliab 1979;R-28:331–2.

[21] Nakagawa T. A summary of block replacement policies.RAIRO Oper Res 1979;13:351–61.

[22] Nakagawa T. Optimum inspection policies for a standbyunit. J Oper Res Soc Jpn 1980;23:13–26.

[23] Nakagawa T. Replacement models with inspectionand preventive maintenance. Microelectron Reliab1980;20:427–33.

[24] Nakagawa T. A summary of imperfect preventivemaintenance policies with minimal repair. RAIRO OperRes 1980;14:249–55.

[25] Nakagawa T. Modified periodic replacement withminimal repair at failure. IEEE Trans Reliab 1981;R-30:165–8.

[26] Nakagawa T. A summary of periodic replacementwith minimal repair at failure. J Oper Res Soc Jpn1981;24:213–27.

[27] Nakagawa T. Generalized models for determiningoptimal number of minimal repairs before replacement.J Oper Res Soc Jpn 1981;24:325–37.

[28] Nakagawa T. A modified block replacement with twovariables. IEEE Trans Reliab 1982;R-31:398–400.

[29] Nakagawa T, Kowada M. Analysis of a system withminimal repair and its application to replacementpolicy. Eur J Oper Res 1983;12:176–82.

[30] Nakagawa T. Optimal number of failures beforereplacement time. IEEE Trans Reliab 1983;R-32:115–6.

[31] Nakagawa T. Combined replacement models. RAIROOper Res 1983;17:193–203.

[32] Nakagawa T. Periodic inspection policy with preventivemaintenance. Nav Res Logist Q 1984;31:33–40.

[33] Nakagawa T. A summary of discrete replacementpolicies. Eur J Oper Res 1984;17:382–92.

[34] Nakagawa T. Modified discrete preventive maintenancepolicies. Nav Res Logist Q 1986;33:703–15.

[35] Nakagawa T. Optimum replacement policies for systemswith two types of units. In: Osaki S, Cao JH, editors.Reliability Theory and Applications Proceedings of theChina–Japan Reliability Symposium, Shanghai, China,1987.

[36] Nakagawa T, Yasui K. Optimum policies for a systemwith imperfect maintenance. IEEE Trans Reliab 1987;R-36:631–3.

[37] Cox DR. Renewal theory. London: Methuen; 1962.[38] Çinlar E. Introduction to stochastic processes. Engle-

wood Cliffs (NJ): Prentice-Hall; 1975.[39] Osaki S. Applied stochastic system modeling. Berlin:

Springer-Verlag; 1992.[40] Berg M. A proof of optimality for age replacement

policies. J Appl Prob 1976;13:751–9.[41] Bergman B. On the optimality of stationary replacement

strategies. J Appl Prob 1980;17:178–86.[42] Glasser GJ. The age replacement problem. Technomet-

rics 1967;9:83–91.

[43] Scheaffer RL. Optimum age replacement policies withan increasing cost factor. Technometrics 1971;13:139–44.

[44] Cléroux R, Hanscom M. Age replacement with adjust-ment and depreciation costs and interest charges. Tech-nometrics 1974;16:235–9.

[45] Cléroux R, Dubuc S, Tilquin C. The age replacementproblem with minimal repair and random repair costs.Oper Res 1979;27:1158–67.

[46] Subramanian R, Wolff MR. Age replacement in simplesystems with increasing loss functions. IEEE TransReliab 1976;R-25:32–4.

[47] Fox B. Age replacement with discounting. Oper Res1966;14:533–7.

[48] Ran A, Rosenland SI. Age replacement with discountingfor a continuous maintenance cost model. Technomet-rics 1976;18:459–65.

[49] Berg M, Epstein B. Comparison of age, block, and failurereplacement policies. IEEE Trans Reliab 1978;R-27:25–9.

[50] Ingram CR, Scheaffer RL. On consistent estimation ofage replacement intervals. Technometrics 1976;18:213–9.

[51] Frees EW, Ruppert D. Sequential non-parametric agereplacement policies. Ann Stat 1985;13:650–62.

[52] Léger C, Cléroux R. Nonparametric age replacement:bootstrap confidence intervals for the optimal cost.Oper Res 1992;40:1062–73.

[53] Christer AH. Refined asymptotic costs for renewalreward processes. J Oper Res Soc 1978;29:577–83.

[54] Ansell J, Bendell A, Humble S. Age replacement underalternative cost criteria. Manage Sci 1984;30:358–67.

[55] Popova E, Wu HC. Renewal reward processes with fuzzyrewards and their applications to T -age replacementpolicies. Eur J Oper Res 1999;117:606–17.

[56] Zheng X. All opportunity-triggered replacement pol-icy for multiple-unit systems. IEEE Trans Reliab1995;44:648–52.

[57] Marathe VP, Nair KPK. Multistage planned replacementstrategies. Oper Res 1966;14:874–87.

[58] Jain A, Nair KPK. Comparison of replacement strategiesfor items that fail. IEEE Trans Reliab 1974;R-23:247–51.

[59] Schweitzer PJ. Optimal replacement policies for hyper-exponentially and uniform distributed lifetimes. OperRes 1967;15:360–2.

[60] Savits TH. A cost relationship between age and blockreplacement policies. J Appl Prob 1988;25:789–96.

[61] Tilquin C, Cléroux R. Block replacement policies withgeneral cost structures. Technometrics 1975;17:291–8.

[62] Archibald TW, Dekker R. Modified block-replacementfor multi-component systems. IEEE Trans Reliab1996;R-45:75–83.

[63] Sheu SH. A generalized block replacement policy withminimal repair and general random repair costs for amulti-unit system. J Oper Res Soc 1991;42:331–41.

[64] Sheu SH. Extended block replacement policy with useditem and general random minimal repair cost. Eur JOper Res 1994;79:405–16.

Page 425: Handbook of Reliability Engineering - pdi2.org

392 Maintenance Theory and Testing

[65] Sheu SH. A modified block replacement policy withtwo variables and general random minimal repair cost.J Appl Prob 1996;33:557–72.

[66] Sheu SH. Extended optimal replacement model fordeteriorating systems. Eur J Oper Res 1999;112:503–16.

[67] Crookes PCI. Replacement strategies. Oper Res Q1963;14:167–84.

[68] Blanning RW. Replacement strategies. Oper Res Q1965;16:253–4.

[69] Bhat BR. Used item replacement policy. J Appl Prob1969;6:309–18.

[70] Tango T. A modified block replacement policy using lessreliable items. IEEE Trans Reliab 1979;R-28:400–1.

[71] Murthy DNP, Nguyen DG. A note on extended blockreplacement policy with used items. J Appl Prob1982;19:885–9.

[72] Ait Kadi D, Cléroux R. Optimal block replacementpolicies with multiple choice at failure. Nav Res Logist1988;35:99–110.

[73] Berg M, Epstein B. A modified block replacement policy.Nav Res Logist Q 1976;23:15–24.

[74] Berg M, Epstein B. A note on a modified blockreplacement policy for units with increasing marginalrunning costs. Nav Res Logist Q 1979;26:157–60.

[75] Holland CW, McLean RA. Applications of replacementtheory. AIIE Trans 1975;7:42–7.

[76] Tilquin C, Cléroux R. Periodic replacement withminimal repair at failure and adjustment costs. Nav ResLogist Q 1975;22:243–54.

[77] Boland PJ. Periodic replacement when minimal repaircosts vary with time. Nav Res Logist Q 1982;29:541–6.

[78] Boland PJ, Proschan F. Periodic replacement withincreasing minimal repair costs at failure. Oper Res1982;30:1183–9.

[79] Chen M, Feldman RM. Optimal replacement policieswith minimal repair and age-dependent costs. Eur JOper Res 1997;98: 75–84.

[80] Aven T. Optimal replacement under a minimal repairstrategy—a general failure model. Adv Appl Prob1983;15:198–211.

[81] Bagai I, Jain K. Improvement, deterioration, and optimalreplacement under age-replacement with minimalrepair. IEEE Trans Reliab 1994;43:156–62.

[82] Murthy DNP. A note on minimal repair. IEEE TransReliab 1991;40:245–6.

[83] Dekker R. A general framework for optimization pri-ority setting, planning and combining of maintenanceactivities. Eur J Oper Res 1995;82:225–40.

[84] Aven T, Dekker R. A useful framework for optimalreplacement models. Reliab Eng Syst Saf 1997;58:61–7.

[85] Mine H, Kawai H. Preventive replacement of a 1-unitsystem with a wearout state. IEEE Trans Reliab 1974;R-23:24–9.

[86] Muth E. An optimal decision rule for repair vsreplacement. IEEE Trans Reliab 1977;26:179–81.

[87] Tahara A, Nishida T. Optimal replacement policy forminimal repair model. J Oper Res Soc Jpn 1975;18:113–24.

[88] Phelps RI. Replacement policies under minimal repair.J Oper Res Soc 1981;32:549–54.

[89] Phelps RI. Optimal policy for minimal repair. J Oper ResSoc 1983;34:452–7.

[90] Park KS, Yoo YK. (τ, k) block replacement policy withidle count. IEEE Trans Reliab 1993;42:561–5.

[91] Morimura H. On some preventive maintenance policiesfor IFR. J Oper Res Soc Jpn 1970;12:94–124.

[92] Park KS. Optimal number of minimal repairs beforereplacement. IEEE Trans Reliab 1979;R-28:137–40.

[93] Tapiero CS, Ritchken PH. Note on the (N , T )replacement rule. IEEE Trans Reliab 1985;R-34:374–6.

[94] Lam Y. A note on the optimal replacement problem. AdvAppl Prob 1988;20,479–82.

[95] Lam Y. A repair replacement model. Adv Appl Prob1990;22:494–97.

[96] Lam Y. An optimal repairable replacement model fordeteriorating system. J Appl Prob 1991;28:843–51.

[97] Ritchken P, Wilson JG. (m, T ) group maintenancepolicies. Manage Sci 1990;36: 632–9.

[98] Sheu SH. A generalized model for determining optimalnumber of minimal repairs before replacement. Eur JOper Res 1993;69:38–49.

[99] Munter M. Discrete renewal processes. IEEE TransReliab 1971;R-20:46–51.

[100] Nakagawa T, Osaki S. The discrete Weibull distribution.IEEE Trans Reliab 1975;R-24:300–1.

[101] Scheaffer RL. Optimum age replacement in the bivariateexponential case. IEEE Trans Reliab 1975;R-24:214–5.

[102] Berg M. Optimal replacement policies for two-unitmachines with increasing running costs I. Stoch ProcessAppl 1976;4:89–106.

[103] Berg M. General trigger-off replacement procedures fortwo-unit systems. Nav Res Logist Q 1978;25:15–29.

[104] Yamada S, Osaki S. Optimum replacement policies fora system composed of components. IEEE Trans Reliab1981;R-30:278–83.

[105] Bai DS, Jang JS, Kwon YI. Generalized preventive main-tenance policies for a system subject to deterioration.IEEE Trans Reliab 1983;R-32.512–4.

[106] Murthy DNP, Nguyen DG. Study of two-componentsystem with failure interaction. Nav Res Logist Q1985;32:239–47.

[107] Pullen KW, Thomas MU. Evaluation of an opportunisticreplacement policy for a 2-unit system. IEEE TransReliab 1986;R-35:320–4.

[108] Beichelt F, Fisher K. General failure model appliedto preventive maintenance policies. IEEE Trans Reliab1980;R-29:39–41.

[109] Beichelt F. A generalized block-replacement policy. IEEETrans Reliab 1981;R-30:171–2.

[110] Murthy DNP, Maxwell MR. Optimal age replacementpolicies for items from a mixture. IEEE Trans Reliab1981;R-30:169–70.

[111] Berg MP. The marginal cost analysis and its applicationto repair and replacement policies. Eur J Oper Res1995;82:214–24.

Page 426: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 393

[112] Block HW, Borges WS, Savits TH. Age-dependentminimal repair. J Appl Prob 1985;22:370–85.

[113] Block HW, Borges WS, Savits TH. A general agereplacement model with minimal repair. Nav ResLogist Q 1988;35:365–72

[114] Smith WL. Renewal theory and its ramifications. J R StatSoc Ser B 1958;20:243–302.

[115] Mercer A. On wear-dependent renewal process. J R StatSoc Ser B 1961;23:368–76.

[116] Esary JD, Marshall AW, Proschan F. Shock models andwear processes. Ann Prob 1973;1:627–49.

[117] Nakagawa T, Osaki S. Some aspects of damage models.Microelectron Reliab 1974;13:253–7.

[118] Taylor HM. Optimal replacement under additive dam-age and other failure models. Nav Res Logist Q1975;22:1–18.

[119] Feldman RM. Optimal replacement with semi-Markovshock models. J Appl Prob 1976;13:108–17.

[120] Feldman RM. Optimal replacement with semi-Markovshock models using discounted costs. Math Oper Res1977;2:78–90.

[121] Feldman RM. Optimal replacement for systems gov-erned by Markov additive shock processes. Ann Prob1977;5:413–29.

[122] Zuckerman Z. Replacement models under additivedamage. Nav Res Logist Q 1977;24:549–58.

[123] Zuckerman Z. Optimal replacement policy for the casewhere the damage process is a one-sided Lévy process.Stoch Process Appl 1978;7:141–51.

[124] Zuckerman Z. Optimal stopping in a semi-Markovshock model. J Appl Prob 1978;15:629–34.

[125] Zuckerman D. A note on the optimal replacement timeof damaged devices. Nav Res Logist Q 1980;27:521–4.

[126] Wortman MA, Klutke GA, Ayhan H. A maintenancestrategy for systems subjected to deterioration governedby random shocks. IEEE Trans Reliab 1994;43:439–45.

[127] Sheu SH, Griffith WS. Optimal number of minimalrepairs before replacement of a system subject to shocks.Nav Res Logist 1996;43:319–33.

[128] Sheu SH. Extended block replacement policy of a systemsubject to shocks. IEEE Trans Reliab 1997;46:375–82.

[129] Sheu SH. A generalized age block replacement of asystem subject to shocks. Eur J Oper Res 1998;108:345–62.

[130] Satow T, Yasui K, Nakagwa T. Optimal garbage collectionpolicies for a database in a computer system. RAIROOper Res 1996;30:359–72.

[131] Qian CH, Nakamura S, Nakagawa T. Cumulative damagemodel with two kinds of shocks and its application tothe backup policy. J Oper Res Soc Jpn 1999;42:501–11.

[132] Pyke R. Markov renewal processes: definitions andpreliminary properties. Ann Math Stat 1961;32:1231–42.

[133] Pyke R. Markov renewal processes with finitely manystates. Ann Math Stat 1961;32:1243–59.

[134] Hosford JE. Measures of dependability. Oper Res1960;8:53–64.

[135] Morse PM. Queues, inventories and maintenance. NewYork: John Wiley & Sons; 1958. Chapter 11.

[136] Sherif YS, Smith ML. Optimal maintenance models forsystems subject to failure—a review. Nav Res Logist Q1981;28:47–74.

[137] Jardine AKS, Buzacott JA. Equipment reliability andmaintenance. Eur J Oper Res 1985;19:285–96.

[138] Reineke DM, Murdock Jr WP, Pohl EA, Rehmert I.Improving availability and cost performance for com-plex systems with preventive maintenance. In: Proceed-ings Annual Reliability and Maintainability Symposium,1999; p.383–8.

[139] Liang TY. Optimum piggyback preventive maintenancepolicies. IEEE Trans Reliab 1985;R-34:529–38.

[140] Aven T. Availability formulae for standby systems ofsimilar units that are preventively maintained. IEEETrans Reliab 1990;R-39:603–6.

[141] Smith MAJ, Dekker R. Preventive maintenance in a 1 outof n system: the uptime, downtime and costs. Eur J OperRes 1997;99:565–83.

[142] Silver EA, Fiechter CN. Preventive maintenance withlimited historical data. Eur J Oper Res 1995;82:125–44.

[143] Van der Duyn Schouten, Scarf PA. Eleventh EUROsummer institute: operational research models inmaintenance. Eur J Oper Res 1997;99:493–506.

[144] Chockie A, Bjorkelo K. Effective maintenance practicesto manage system aging. In: Proceedings Annual Relia-bility and Maintainability Symposium, 1992; p.166–70.

[145] Smith AM. Preventive-maintenance impact on plantavailability. In: Proceedings Annual Reliability andMaintainability Symposium, 1992; p.177–80.

[146] Susova GM, Petrov AN. Markov model-based reliabilityand safety evaluation for aircraft maintenance-systemoptimization. In: Proceedings Annual Reliability andMaintainability Symposium, 1992; p.29–36.

[147] Gaver Jr DP. Failure time for a redundant repairablesystem of two dissimilar elements. IEEE Trans Reliab1964;R-13:14–22.

[148] Gnedenko BV. Idle duplication. Eng Cybernet 1964;2:1–9.

[149] Gnedenko BV. Duplication with repair. Eng Cybernet1964;2:102–8.

[150] Srinivasan VS. The effect of standby redundancy insystem’s failure with repair maintenance. Oper Res1966;14:1024–36.

[151] Osaki S. System reliability analysis by Markov renewalprocesses. J Oper Res Soc Jpn 1970;12:127–88.

[152] Rozhdestvenskiy DV, Fanarzhi GN. Reliability of a dupli-cated system with renewal and preventive maintenance.Eng Cybernet 1970;8:475–9.

[153] Osaki S, Asakura T. A two-unit standby redundantsystem with preventive maintenance. J Appl Prob1970;7:641–8.

[154] Berg M. Optimal replacement policies for two-unitmachines with increasing running cost I. Stoch ProcessAppl 1976;4:89–106.

[155] Berg M. Optimal replacement policies for two-unitmachines with running cost II. Stoch Process Appl1977;5:315–22.

[156] Berg M. General trigger-off replacement procedures fortwo-unit system. Nav Res Logist Q 1978;25:15–29.

Page 427: Handbook of Reliability Engineering - pdi2.org

394 Maintenance Theory and Testing

[157] Weiss H. A problem in equipment maintenance. ManageSci 1962;8:266–77.

[158] Chan PKW, Downs T. Two criteria for preventivemaintenance. IEEE Trans Reliab 1978;R-27:272–3.

[159] Murthy DNP, Nguyen DG. Optimal age-policy withimperfect preventive maintenance. IEEE Trans Reliab1981;R-30:80–1.

[160] Lie CH, Chun YH. An algorithm for preventivemaintenance policy. IEEE Trans Reliab 1986;R-35:71–5.

[161] Jayabalan V, Chaudhuri D. Cost optimization ofmaintenance scheduling for a system with assuredreliability. IEEE Trans Reliab 1992;R-41:21–5.

[162] Canfield RV. Cost optimization of periodic preventivemaintenance. IEEE Trans Reliab 1986;R-35:78–81.

[163] Brown M, Proschan F. Imperfect repair. J Appl Prob1983;20:851–9.

[164] Fontnot RA, Proschan F. Some imperfect maintenancemodels. In: Abdel-Hameed MS, Çinlar E, Quinn J,editors. Reliability theory and models. Orlando (FL):Academic Press; 1984.

[165] Bhattacharjee MC. New results for the Brown–Proschanmodel of imperfect repair. J Stat Plan Infer 1987;16:305–16.

[166] Ebrahimi N. Mean time to achieve a failure-freerequirement with imperfect repair. IEEE Trans Reliab1985;R-34:34–7.

[167] Kijima M, Morimura H, Suzuki Y. Periodic replacementproblem without assuming minimal repair. Eur J OperRes 1988;37:194–203.

[168] Natvig B. On information based minimal repair andthe reduction in remaining system lifetime due to thefailure of a specific module. J Appl Prob 1990;27: 365–375.

[169] Stadje WG, Zuckerman D. Optimal maintenance strate-gies for repairable systems with general degree of repair.J Appl Prob 1991;28:384–96.

[170] Makis V, Jardine AKS. Optimal replacement policy fora general model with imperfect repair. J Oper Res Soc1992;43:111–20.

[171] Lie XG, Makis V, Jardine AKS. A replacement modelwith overhauls and repairs. Nav Res Logist 1995;42:1063–79.

[172] Wang H, Pham H. Optimal age-dependent preventivemaintenance policies with imperfect maintenance. Int JReliab Qual Saf Eng 1996;3:119–35.

[173] Pham H, Wang H. Imperfect maintenance. Eur J OperRes 1996;94:425–38.

[174] Sandoh H, Hirakoshi H, Nakagawa T. A new modifieddiscrete preventive maintenance policy and its appli-cation to hard disk management. J Qual Maint Eng1998;4:284–90.

[175] Luss H, Kander Z. Inspection policies when duration ofcheckings is non-negligible. Oper Res Q 1974;25:299–309.

[176] Luss H. Inspection policies for a system which isinoperative during inspection periods. AIIE Trans1976;9:189–94.

[177] Wattanapanom N, Shaw L. Optimal inspection sched-ules for failure detection in a model where tests hastenfailures. Oper Res 1979;27:303–17.

[178] Platz O. Availability of a renewable, checked system.IEEE Trans Reliab 1976;R-25:56–8.

[179] Zacks S, Fenske WJ. Sequential determinationof inspection epochs for reliability systems withgeneral lifetime distributions. Nav Res Logist Q 1973;20:377–86.

[180] Luss H, Kander Z. A preparedness model dealingwith N systems operating simultaneously. Oper Res1974;22:117–28.

[181] Anbar D. An aysmptotically optimal inspection policy.Nav Res Logist Q 1976;23:211–8.

[182] Kander Z. Inspection policies for deteriorating equip-ment characterized byN quality levels. Nav Res Logist Q1978;25:243–55.

[183] Zuckerman D. Inspection and replacement policies.J Appl Prob 1980;17:168–77.

[184] Zuckerman D. Optimal inspection policy for a multi-unit machine. J Appl Prob 1989;26:543–51.

[185] Qiu YP. A note on optimal inspection policy forstochastically deteriorating series systems. J Appl Prob1991;28:934–9.

[186] Coleman JJ, Abrams IJ. Mathematical model foroperational readiness. Oper Res 1962;10:126–38.

[187] Morey RC. A criterion for the economic application ofimperfect inspections. Oper Res 1967;15:695–8.

[188] Apostolakis GE, Bansal PP. Effect of human error on theavailability of periodically inspected redundant systems.IEEE Trans Reliab 1977;R-26:220–5.

[189] Thomas LC, Jacobs PA, Gaver DP. Optimal inspectionpolicies for standby systems. Commun Stat Stoch Model1987;3:259–73.

[190] Sim SH. Reliability of standby equipment with periodictesting. IEEE Trans Reliab 1987;R-36:117–23.

[191] Parmigiani G. Inspection times for stand-by units.J Appl Prob 1994;31:1015–25.

[192] Chay SC, Mazumdar M. Determination of test intervalsin certain repairable standby protective systems. IEEETrans Reliab 1975;R-24:201–5.

[193] Inagaki T, Inoue K, Akashi H. Optimization of staggeredinspection schedules for protective systems. IEEE TransReliab 1980;R-29:170–3.

[194] Shima E, Nakagawa T. Optimum inspection policy for aprotective device. Reliab Eng 1984;7:123–32.

[195] Munford AG, Shahani AK. A nearly optimal inspectionpolicy. Oper Res Q 1972;23:373–9.

[196] Keller JB. Optimum checking schedules for systemssubject to random failure. Manage Sci 1974;21:256–60.

[197] Nakagawa T, Yasui K. Approximate calculation of opt-imal inspection times. J Oper Res Soc 1980;31:851–3.

[198] Kaio N, Osaki S. Inspection policies: Comparisons andmodifications. RAIRO Oper Res 1988;22:387–400.

[199] Kaio N, Osaki S. Comparison of inspection policies.J Oper Res Soc 1989;40:499–503.

[200] Viscolani B. A note on checking schedules with finitehorizon. RAIRO Oper Res 1991;25:203–8.

[201] Munford AG. Comparison among certain inspectionpolicies. Manage Sci 1981;27:260–7.

Page 428: Handbook of Reliability Engineering - pdi2.org

Maintenance and Optimum Policy 395

[202] Luss H. An inspection policy model for productionfacilities. Manage Sci 1983;29:1102–9.

[203] Menke JT. Deterioration of electronics in storage. In:Proceedings of National SAMPE Symposium, 1983;p.966–72.

[204] Bauer J et al. Dormancy and power on–off cycling effectson electronic equipment and part reliability. RADAC-TR-73-248(AD-768619), 1973.

[205] Martinez EC. Storage reliability with periodic test. In:Proceedings of Annual Reliability and MaintainabilitySymposium, 1984; p.181–5.

[206] Itô K, Nakagawa T. Optimal inspection policies fora system in storage. Comput Math Appl 1992;24:87–90.

[207] Itô K, Nakagawa T. Extended optimal inspection policiesfor a system in storage. Math Comput Model 1995;22:83–7.

Page 429: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 430: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect MaintenanceModels

Chap

ter2

1Hongzhou Wang and Hoang Pham

21.1 Introduction21.2 Treatment Methods for Imperfect Maintenance21.2.1 Treatment Method 121.2.2 Treatment Method 221.2.3 Treatment Method 321.2.4 Treatment Method 421.2.5 Treatment Method 521.2.6 Treatment Method 621.2.7 Treatment Method 721.2.8 Other Methods21.3 Some Results on Imperfect Maintenance21.3.1 A Quasi-renewal Process and Imperfect Maintenance21.3.1.1 Imperfect Maintenance Model A21.3.1.2 Imperfect Maintenance Model B21.3.1.3 Imperfect Maintenance Model C21.3.1.4 Imperfect Maintenance Model D21.3.1.5 Imperfect Maintenance Model E21.3.2 Optimal Imperfect Maintenance of k-out-of-n Systems21.4 Future Research on Imperfect Maintenance21.A Appendix21.A.1 Acronyms and Definitions21.A.2 Exercises

21.1 Introduction

Maintenance involves preventive (planned) andcorrective (unplanned) actions carried out toretain a system in or restore it to an acceptableoperating condition. Optimal maintenance poli-cies aim to provide optimum system reliability andsafety performance at the lowest possible mainte-nance costs. Proper maintenance techniques havebeen emphasized in recent years due to increasedsafety and reliability requirements of systems, in-creased complexity, and rising costs of materialand labor [1]. For some systems, such as air-planes, submarines, and nuclear systems, it isextremely important to avoid failure during op-eration because it is dangerous and disastrous.

One important research area in reliability engi-neering is the study of various maintenance poli-cies in order to prevent the occurrence of systemfailure in the field and improve system availability.

In the past decades, maintenance, replacement,and inspection problems have been extensivelystudied in the literature, and hundreds of modelshave been developed. McCall [2], Barlow andProschan [3], Pieskalla and Voelker [4], Sherif andSmith [1], Jardine and Buzacott [5], Valdez-Floresand Feldman [6], Cho and Parlar [7], Jensen [8],Dekker [9], Pham and Wang [10], and Dekkeret al. [11] survey and summarize the researchdone in this field. However, problems in thisfield are still not solved satisfactorily and somemaintenance models are not realistic. Next, an

397

Page 431: Handbook of Reliability Engineering - pdi2.org

398 Maintenance Theory and Testing

important problem in maintenance and reliabilitytheory is discussed.

Maintenance can be classified into two majorcategories: corrective and preventive. Correctivemaintenance (CM) occurs when the system fails.Some researchers refer to CM as repair andwe will use them interchangeably throughoutthis chapter. According to MIL-STD-721B, CMmeans all actions performed as a result of failure,to restore an item to a specified condition.Preventive maintenance (PM) occurs when thesystem is operating. According to MIL-STD-721B,PM means all actions performed in an attempt toretain an item in specified condition by providingsystematic inspection, detection, and preventionof incipient failures. We believe that maintenancecan also be categorized according to the degreeto which the operating conditions of an item isrestored by maintenance in the following way.

(a) Perfect repair or perfect maintenance: a main-tenance action that restores the system oper-ating condition to “as good as new”, i.e. uponperfect maintenance, a system has the samelifetime distribution and failure rate functionas a brand new one. Complete overhaul of anengine with a broken connecting rod is anexample of perfect repair. Generally, replace-ment of a failed system by a new one is aperfect repair.

(b) Minimal repair or minimal maintenance: amaintenance action that restores the systemto the failure rate it had when it just failed.Minimal repair was first studied by Barlowand Proschan [3]. After this the systemoperating state is often called “as bad as old”in the literature. Changing a flat tire on acar is an example of minimal repair, becausethe overall failure rate of the car is essentiallyunchanged.

(c) Imperfect repair or imperfect maintenance: amaintenance action may not make a system“as good as new” but younger. Usually, it isassumed that imperfect maintenance restoresthe system operating state to somewherebetween “as good as new” and “as bad as old”.Clearly, imperfect maintenance is general,

which can include two extreme cases: minimaland perfect repair maintenance. Engine tune-up is an example of imperfect maintenance.

(d) Worse repair or maintenance: a maintenanceaction that undeliberately makes the systemfailure rate or actual age increase but thesystem does not break down. Thus, uponworse repair a system’s operating conditionbecomes worse than that just prior to itsfailure.

(e) Worst repair or maintenance: a maintenanceaction that undeliberately makes the systemfail or break down.

Because worse and worst maintenance is lessmet in practice than imperfect maintenance, thefocus of this chapter is on imperfect maintenance.

According to the above classification scheme,we can say that a PM is a minimal, perfect,imperfect, worst or worse PM. Similarly, a CM maybe a minimal, perfect, imperfect, worst or worseCM. We will refer to imperfect CM and PM asimperfect maintenance later.

The type and degree of maintenance that isused in practice depend on the types of system,their costs as well as system reliability andsafety requirements. In maintenance literature,most studies assume that the system after CMor PM is “as good as new” (perfect repair) or“as bad as old” (minimal repair). In practice,the perfect maintenance assumption may bereasonable for systems with one component thatis structurally simple. On the other hand, theminimal repair assumption seems plausible for thefailure behavior of systems when one of its many,non-dominating components is replaced by a newone [12]. However, many maintenance activitiesmay not result in such two extreme situations butin a complicated intermediate one. For example,an engine may not be “as good as new” or “asbad as old” after tune-up, a type of PM. It usuallybecomes “younger” than at the time just prior toPM. Therefore, perfect maintenance and minimalmaintenance are not true in many actual instancesand realistic imperfect maintenance should bemodeled. Recently, imperfect maintenance (bothcorrective and preventive) has received increasing

Page 432: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 399

attention in the reliability and maintenancecommunity. In fact, the advent of the conceptof imperfect maintenance has great meaning inreliability and maintenance engineering. We thinkthat imperfect maintenance study indicates asignificant breakthrough in maintenance theory.

Brown and Proschan [13] propose somepossible causes for imperfect, worse or worstmaintenance due to the maintenance performer:

• repair of the wrong part;• only partially repair the faulty part;• repair (partially or completely) the faulty part

but damage adjacent parts;• incorrectly assess the condition of the unit

inspected;• perform the maintenance action not when

called for but at his convenience (the timingfor maintenance is off the schedule).

Nakagawa and Yasui [14] mention three possi-ble reasons causing worse or worst maintenance:

• hidden faults and failures that are not detectedduring maintenance;• human errors, such as wrong adjustments and

further damage done during maintenance;• replacement with faulty parts.

Helvik [15] suggests that imperfectness ofmaintenance is related to the skill of the mainte-nance personnel, the quality of the maintenanceprocedure, and the maintainability of the system.

21.2 Treatment Methods forImperfect MaintenanceImpefect maintenance research started as earlyas in the late 1970s. Kay [16] and Chan andDowns [17] studied the worst PM. Ingle andSiewiorek [18] examined imperfect maintenance.Chaudhuri and Sahu [19] explored the concept ofimperfect PM. Most early research on imperfect,worse, and worst maintenance is for a single-unitsystem. Some researchers have proposed variousmethods for modeling imperfect, worse, and worstmaintenance. It is useful to summarize/classify

these methods. These modeling methods can beutilized in various maintenance and inspectionpolicies. Although these methods are summarizedmainly from the work of a single-unit system theywill also be useful for modeling multicomponentsystems, because the study methods of imperfect,worse, and worst maintenance for single-unitsystems will also be effective for modeling theimperfect maintenance of individual subsystemsthat are constituting parts of multicomponentsystems. Methods for treating imperfect, worse,and worst maintenance can be classified into eightcategories [10], which will be described next.

21.2.1 Treatment Method 1

Nakagawa [20] treats the imperfect PM in thisway: the component is returned to the “as good asnew” state (perfect PM) with probability p and itis returned to the “as bad as old” state (minimalPM) with probability q = 1− p upon imperfectPM. Obviously, if p = 1 the PM coincides withperfect PM and if p = 0 it corresponds tominimal PM. In this sense, minimal and perfectmaintenances are special cases of imperfectmaintenance, and imperfect maintenance is ageneral maintenance. Using such a methodfor modeling imperfect maintenance, Nakagawasucceeded in determining the optimum PMpolicies minimizing the s-expected maintenancecost rate for a one-unit system under both age-dependent [20] and periodic PM [21] policies,given that PM is imperfect.

Similar to Nakagawa’s work [20], Helvic [15]states that, after PM, though the fault-tolerantsystem is usually renewed after maintenance withprobability θ2, its operating condition sometimesremains unchanged (as bad as old) with probabil-ity θ1 where θ1 + θ2 = 1.

Brown and Proschan [13] contemplate thefollowing model of the repair process. A unit isrepaired each time it fails. The executed repairis either a perfect one with probability p or aminimal one with probability 1− p. Assumingthat all repair actions take negligible time, Brownand Proschan [13] established aging preservationproperties of this imperfect repair process and the

Page 433: Handbook of Reliability Engineering - pdi2.org

400 Maintenance Theory and Testing

monotonicity of various parameters and randomvariables associated with the failure process. Theyobtained an important, useful result: if the lifedistribution of a unit is F and its failure rateis r , then the distribution function of the timebetween successive perfect repairs Fp = 1− (1−F)p and the corresponding failure rate rp =pr . Using this result, Fontenot and Proschan[22] and Wang and Pham [23] both obtainedoptimal imperfect maintenance policies for a one-component system. Later on, we will refer to thismethod of modeling imperfect maintenance asthe (p, q) rule, i.e. after maintenance (correctiveor preventive) a system becomes “as good asnew” with probability p and “as bad as old” withprobability 1− p. In fact, quite a few imperfectmaintenance models have used this rule in recentyears.

Bhattacharjee [24] obtained the same resultsas Brown and Proschan [13] and also somenew results for the Brown–Proschan model ofimperfect repair via a shock model representationof the sojourn time.

21.2.2 Treatment Method 2

Block et al. [25] extend the above Brown–Proschanimperfect repair model with the (p, q) rule tothe age-dependent imperfect repair for a one-unitsystem: an item is repaired at failure (CM). Withprobability p(t), the repair is a perfect repair;with probability q(t)= 1− p(t), the repair is aminimal one, where t is the age of the item in useat the failure time (the time since the last perfectrepair). Block et al. [25] showed that if the item’slife distribution F is a continuous function and itsfailure rate is r , then the successive perfect repairtimes is a renewal process with interarrival timedistribution

Fp = 1− exp

{ ∫ t

0p(x)[1− F(x)]−1F(dx)

}and the corresponding failure rate rp(t)=p(t)r(t). In fact, similar results were found byBeichelt and Fischer [26]. Block et al. [25] provedthat the aging preservation results of Brown andProschan [13] hold under suitable hypotheses

on p(t). This imperfect maintenance modelingmethod is called the (p(t), q(t)) rule herein.

Using the (p(t), q(t)) rule, Block et al. [27]explored a general age-dependent PM policy,where an operating unit is replaced whenever itreaches age T ; if it fails at age t < T , it is eitherreplaced by a new unit, with probability p(t), or itundergoes minimal repair, with probability q(t)=1− p(t). The cost of the ith minimal repair isa function, ci(t), of age and number of repairs.After a perfect maintenance, planned (preventive)or unplanned (corrective), the above process isrepeated.

Both the Brown and Proschan models and theBlock et al. [25, 27] model assume that the repairtime is negligible. It is worthwhile mentioning thatIyer [28] obtained availability results for imperfectrepair using the (p(t), q(t)) rule given the repairtime is not negligible. Obviously, his treatmentmethod is more realistic.

In a related study, Sumita and Shanthikumar[29] (see Pham and Wang [10]) proposed andstudied an age-dependent counting process gen-erated from a renewal process, and applied thatcounting process to the age-dependent imperfectrepair for a one-unit system.

Whitaker and Samaniego [30] established anestimator for the life distribution when the abovemaintenance process proposed by Block et al. [25]is observed until the time of the mth perfectrepair. This estimator was motivated by a non-parametric maximum likelihood approach, andwas shown to be a “neighborhood maximumlikelihood estimator”. They derived large-sampleresults for this estimator. Hollander et al. [31]took the more modern approach of using acounting process and martingale theory to analyzethese models. Their methods yielded extensions ofWhitaker and Samaniego’s results to the whole lineand provide a useful framework for further workon the minimal repair model.

The (p, q) rule and (p(t), q(t)) rule formodeling imperfect maintenance seem practicaland realistic. They make imperfect maintenancebe somewhere between perfect and minimalmaintenance. The degree to which the operatingcondition of an item is restored by maintenance

Page 434: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 401

can be measured by probability p or p(t).Especially in the (p(t), q(t)) rule, the degreeto which the operating conditions of an itemis restored by maintenance is related to itsage t . Thus, the (p(t), q(t)) rule seems morerealistic, but mathematical modeling of imperfectmaintenance through using it may be morecomplicated. We think that the two rules canbe expected to be powerful in future imperfectmaintenance modeling. In fact, both rules havereceived much attention and have been used insome recent imperfect repair models.

Makis and Jardine [32] proposed a generaltreatment method for imperfect maintenance andmodeled imperfect repair at failure in a waythat repair returns a system to the “as goodas new” state with probability p(n, t) or to the“as bad as old” state with probability q(n, t),or with probability s(n, t)= 1− p(n, t)− q(n, t)

the repair is unsuccessful, the system is scrappedand replaced by a new one; here, t is the age ofthe system and n is the number of failures sincethe last replacement. This treatment method willbe referred to as the (p(n, t), q(n, t), s(n, t)) rulelater on.

21.2.3 Treatment Method 3

Malik [33] introduced the concept of improvementfactor in the maintenance scheduling problem. Hethinks that maintenance changes the system timeof the failure rate curve to some newer time butnot all the way to zero (not new), as shown inFigure 21.1. This treatment method for imperfectmaintenance also makes the failure rate after PMlie between “as good as new” and “as bad asold”. The degree of improvement in failure rateis called the improvement factor (note that PMis only justified by an increasing failure rate).Malik [33] contemplated that, since systems needmore frequent maintenance with increased age,the successive PM intervals are decreasing in orderto keep the system failure rate at or below astated level (sequential PM policy), and proposedan algorithm to determine these successive PMintervals. Lie and Chun [34] presented a general

Figure 21.1. Minimal, perfect, imperfect repair versus failure ratechange

expression to determine these PM intervals. Malik[33] relied on an expert judgment to estimate theimprovement factor, whereas Lie and Chun [34]gave a set of curves as a function of maintenancecost and the age of the system for the improvementfactor.

Using the improvement factor and assumingfinite planning horizon, Jayabalan and Chaudhuri[35] introduced a branching algorithm to min-imize the average total cost for a maintenancescheduling model with assured reliability, andthey [36] discuss the optimal maintenance pol-icy for a system with increased mean down timeand assured failure rate. It is worthwhile notingthat, using fuzzy set theory and an improvementfactor, Suresh and Chaudhuri [37] established aPM scheduling procedure to assure an acceptablereliability level or tolerable failure rate given afinite planning horizon. They regard the startingcondition, ending condition, operating condition,and type of maintenance of a system as fuzzysets. The improvement factor is used to find outthe starting condition of the system after mainte-nance.

Chan and Shaw [38] considered that the failurerate of an item is reduced after each PM andthis reduction of failure rate depends on theitem age and the number of PMs. They proposedtwo types of failure-rate reduction. (1) Failurerate with fixed reduction: after each PM, thefailure rate is reduced such that all jump-downsof the failure rate are the same. (2) Failure rate

Page 435: Handbook of Reliability Engineering - pdi2.org

402 Maintenance Theory and Testing

with proportional reduction: after PM, the failurerate is reduced such that each jump-down isproportional to the current failure rate. Chan andShaw [38] derived the cycle-availability for single-unit systems and discussed the design scheme tomaximize the probability of achieving a specifiedstochastic-cycle availability with respect to theduration of the operating interval between PMs.

This kind of study method for imperfectmaintenance is in terms of failure rate and seemsuseful and practical in engineering, which can beused as a general treatment method for imperfectmaintenance or even worse maintenance. Thistreatment method will be called the improvementfactor method herein.

In addition, Canfield [39] observed that PMat time t restores the failure rate function toits shape at t − τ , whereas the level remainsunchanged where τ is less than or equal to the PMintervention interval.

21.2.4 Treatment Method 4

Kijima et al. [40] proposed an imperfect repairmodel by using the idea of the virtual ageprocess of a repairable system. In this imperfectrepair model, if a system has the virtual ageVn−1 = y immediately after the (n− 1)th repair,the nth failure-time Xn is assumed to have thedistribution function

Pr{Xn ≤ x | Vn−1 = y} = F(x + y)− F(y)

1− F(y)

whereF(x) is the distribution function of the timeto failure of a new system. Let a be the degree ofthe nth repair, where 0≤ a ≤ 1. Kijima et al. [40]constructed such a repair model: the nth repaircannot remove the damage incurred before the(n− 1)th repair. It reduces the additional age Xn

to aXn. Accordingly, the virtual age after the nthrepair becomes

Vn = Vn−1 + aXn

Obviously, a = 0 corresponds to a perfect repair,whereas a = 1 is a minimal repair. In a later studyKijima [12] extended the above model to the case

that a is a random variable taking a value betweenzero and one and proposed another imperfectrepair model

Vn = An(Vn−1 + Xn)

where An is a random variable taking a valuebetween zero and one for n= 1, 2, 3, . . . . Forthe extreme values zero and one, An = 1 meansa minimal repair and An = 0 means a perfectrepair. Comparing this treatment method withBrown and Proschan’s [13], we can see that ifAn is independently and identically distributed(i.i.d.), then by taking the two extreme values zeroand one they are the same. Therefore, the secondtreatment method by Kijima [12] is general.He also derived various monotonicity propertiesassociated with the above two imperfect repairmodels.

This kind of treatment method is referred to asthe virtual age method herein.

It is worth mentioning that Uematsu andNishida [41] considered a more general modelthat includes the above two imperfect repair mod-els by Kijima and coworkers [12, 40] as spe-cial cases, and obtained some elementary prop-erties of the associated failure process. Let Tndenote the time interval between the (n− 1)thfailure and the nth failure, and Xn denote thedegree of repair. After performing the nth repair,the age of the system is q(t1, . . . , tn; x1, . . . , xn)

given that Ti = ti and Xi = xi (i = 1, 2, . . . , n)where Ti and Xi are random variables. On theother hand, q(t1, . . . , tn; x1, . . . , xn−1) repre-sents the age of the system as just beforethe nth failure. The starting epoch of an in-terval is subject to the influence of all previ-ous failure history, i.e. the nth interval is sta-tistically dependent on T1 = t1, . . . , Tn−1 = tn−1,X1 = x1, . . . , Tn−1 = tn−1. For example, if

q(t1, . . . , tn; x1, . . . , xn)=n∑

j=1

n∑i=j

xi tj ,

then Xi = 0 (Xi = 1) represents that a perfectrepair (minimal repair) is performed at the ithfailure.

Page 436: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 403

21.2.5 Treatment Method 5

It is well known that the time to failure of a unitcan be represented as a first passage time to athreshold for an appropriate stochastic processthat describes the levels of damage. Consider aunit that is subject to shocks occurring randomlyin time. At time t = 0, the damage level of theunit is assumed to be zero. Upon occurrence ofa shock, the unit suffers from a non-negativerandom damage. Each damage, at the time of itsoccurrence, adds to the current damage level ofthe unit, and between shocks the damage levelstays constant. The unit fails when its accumulateddamage first exceeds a prespecified level. To keepthe unit in an acceptable operating condition, PMis necessary [42].

Kijima and Nakagawa [42] proposed a cumula-tive damage shock model with imperfect periodicPM. The PM is imperfect in the sense that each PMreduces the damage level by 100(1− b)%, 0≤ b ≤1, of total damage. Note that if b=1 the PM is min-imal and if b=0 the PM coincides with a perfectPM. Obviously, this modeling approach is similarto Treatment Method 1, the (p, q) rule. Kijima andNakagawa [42] derived a sufficient condition forthe time to failure to have an increasing failurerate (IFR) distribution and discussed the problemof finding the number of PMs that minimizes theexpected maintenance cost rate.

Later, Kijima and Nakagawa [43] established acumulative damage shock model with a sequentialPM policy given that PM is imperfect. Theymodeled imperfect PM in a way that the amountof damage after the kth PM becomes bkYk whenit was Yk before PM, i.e. the kth PM reducesthe amount Yk of damage to bkYk , where bk iscalled the improvement factor. They assumed thata system is subject to shocks occurring accordingto a Poisson process and, upon occurrence ofshocks, it suffers from a non-negative randomdamage, which is additive. Each shock causes asystem failure with probability p(z) when the totaldamage is z at the shock. In this model, PM isdone at fixed intervals xk for k = 1, 2, 3, . . . , N ,because more frequent maintenance is neededwith age, and the Nth PM is perfect. If the system

fails between PMs then it undergoes only minimalrepair. They derive the expected maintenance costrate until replacement assuming that p(z) is anexponential function and damages are i.i.d., anddiscussed the optimal replacement policies.

This study approach for imperfect maintenancewill be called shock method later in this chapter.

21.2.6 Treatment Method 6

Wang and Pham [23, 44] treated imperfect repairin a way that after repair the lifetime of asystem will be reduced to a fraction α of itsimmediately previous one, where 0 < α < 1, i.e.the lifetime decreases with the number of repairs.The interarrival times between successive repairsconstitute a quasi-renewal process, defined byWang and Pham [23].

Wang and Pham [45] further considered thatrepair time is non-negligible, not as in mostimperfect maintenance models, upon repair thenext repair time becomes a multiple β of itscurrent one, where β > 1, and the time to repairincreases with the number of repairs. We refer tothis treatment method as the (α, β) rule.

21.2.7 Treatment Method 7

Shaked and Shanthikumar [46] introduced themultivariate imperfect repair concept. They con-sidered a system whose components have depen-dent lifetimes where each component is subject toan imperfect repair until it is replaced. For eachcomponent the repair is imperfect according tothe (p, q) rule, i.e. at failure the repair is perfectwith probability p and minimal with probabilityq . Assume that n components of the system startto function at the same time zero, and no morethan one component can fail at a time. They es-tablished the joint distribution of the times to nextfailure of the functioning devices after a minimalrepair or perfect repair, and derived the joint den-sity of the resulting lifetimes of the componentsand other probabilistic quantities of interest, fromwhich the distribution of the lifetime of the systemcan be obtained. Sheu and Griffith [47] extended

Page 437: Handbook of Reliability Engineering - pdi2.org

404 Maintenance Theory and Testing

this work further. This treatment method will betermed the multiple (p, q) rule herein.

21.2.8 Other Methods

Nakagawa [20] modeled imperfect PM in a waythat, in the steady-state, PM reduces the failurerate of a unit to a fraction of its value justbefore PM and during operation the failurerate climbs back up. He proposed that theportion by which the failure rate is reduced is afunction of some resource consumed in PM anda parameter. That is, after PM the failure rate ofthe unit becomes λ(t)= g(c1, θ)λ(x + T ), wherethe fraction reduction of failure rate g(c1, θ) liesbetween zero and one, T is the time intervallength between PM’s, c1 is the amount of resourceconsumed in PM, and θ is a parameter.

Nakagawa [48, 49] used two other methods todeal with imperfect PM for sequential PM policy.(a) The failure rate after PM k becomes akh(t)

given it was h(t) in previous period, where ak ≥1; i.e. the failure rate increases with the numberof PMs. (b) The age after PM k reduces to bkt

when it was t before PM, where 0≤ bk < 1. Thismodeling method is similar to the improvementfactor method and will be called the reductionmethod herein. That is, PM reduces the age. Inaddition, in investigating periodic PM models,Nakagawa [21] treated imperfect PM in a waythat the age of the unit becomes x units of timeyounger by each PM and further assumed that thex is in proportion to the PM cost, where x is lessthan or equal to the PM intervention interval.

Yak et al. [50] suggested that maintenancemay result in system failure (worst maintenance),in modeling the mean time to failure and theavailability of the system.

21.3 Some Results on ImperfectMaintenance

Next, we summarize some important results onimperfect maintenance from some recent work[10, 23, 51].

21.3.1 A Quasi-renewal Process andImperfect Maintenance

Renewal theory had its origin in the study ofstrategies for replacement of technical compo-nents. In a renewal process, the times betweensuccessive events are supposed to be i.i.d. Mostmaintenance models using renewal theory are ac-tually based on this assumption, i.e. “as good asnew”. However, in maintenance practice a systemmay not be “as good as new” after repair. Bar-low and Hunter [52] (see Barlow and Proschan[3], p.89) introduced the notation of “minimalrepair” and use non-homogeneous Poisson pro-cesses to model it. In fact, “as good as new” and“as bad as old” represent two extreme types ofrepair result. Most repair actions are somewherebetween these extremes, which are imperfect re-pairs. Wang and Pham [23] explored a quasi-renewal process and its applications in modelingimperfect maintenance, which are summarizedbelow.

Let {N(t), t > 0} be a counting process and Xn

denote the time between the (n− 1)th and the nthevent of this process, n≥1.

Definition 1. If the sequence of non-negative ran-dom variables {X1, X2, X3, . . . } is independentand Xi = αXi−1 for i ≥ 2, where α > 0 is a con-stant, then the counting process {N(t), t ≥ 0} issaid to be a quasi-renewal process with parameterα and the first interarrival time X1.

When α = 1, this process becomes the ordi-nary renewal process. Given that the probabilitydistribution function (p.d.f.), cumulative distribu-tion function (c.d.f.), survival function and failurerates of random variableX1 are f1(x),F1(x), s1(x)and r1(x), the p.d.f., c.d.f., survival function, andfailure rate of Xn for n= 2, 3, 4, . . . are

fn(x)= α1−nf1(α1−nx) Fn(x)= F1(α

1−nx)sn(x)= s1(α

1−nx) rn(x)= α1−nr1(α1−nx)

The following results are proved [23]:

Page 438: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 405

Proposition 1. If f1(x) belongs to IFR, DFR, IFRA,DFRA, NBU, NWU1, then fn(x) is in the samecategory for n= 2, 3, . . . .

Proposition 2. The shape parameter of Xn are thesame for n= 1, 2, 3, 4, . . . for a quasi-renewalprocess if X1 follows the Gamma, Weibull, orlognormal distribution.

Observe a quasi-renewal process with para-meter α and the first interarrival time X1.For the total number N(t) of “renewals” that haveoccurred up to time t

Pr{N(t)= n} =Gn(t)−Gn+1(t)

where Gn(t) is the convolution of the interarrivaltimes F1, F2, . . . , Fn .

If the mean value of N(t) is defined as therenewal function M(t) then

M(t)= E[N(t)]

=∞∑n=1

Gn(t)

The derivative of M(t) is known as renewaldensity:

m(t)=M ′(t)It is proved that the first interarrival distributionof a quasi-renewal process uniquely determines itsrenewal function.

Three applications of the quasi-renewal processto imperfect maintenance are described next.

21.3.1.1 Imperfect Maintenance Model A

Suppose that a unit is preventively maintainedat times kT , k = 1, 2, . . . , at a cost cp, inde-pendently of the unit’s failure history, where theconstant T > 0 and the PM is perfect. The unitundergoes imperfect repair at failures betweenPM’s at cost cf in the sense that after repair thelifetime will reduce to a fraction α of the immedi-ately previous one. Then, from the quasi-renewaltheory, the following propositions follow.

Proposition 3. The long-run expected mainte-nance cost per unit time, or maintenance cost rate,

1See the Appendix 21.A.1 for definitions.

is

L(T )= cp + cfM(T )

T

where M(T ) is the renewal function of a quasi-renewal process with parameter α.

Proposition 4. There exists an optimum T ∗ thatminimizes L(T ), where 0 < T ∗ ≤∞ and theresulting minimum value of L(T ) is cfm(T ∗).

21.3.1.2 Imperfect Maintenance Model B

This model is exactly like Model A, except that thePM at time kT is imperfect in the sense that, afterPM, the unit is “as good as new” with probabilityp and “as bad as old” with probability q = 1−p. From the quasi-renewal theory, the followingpropositions are proved.

Proposition 5. The long-run expected mainte-nance cost per unit time is:

L(T )= cp + cfp2[M(T )+∑∞i=2 q

i−1M(iT )]T

where M(iT ) is the renewal function of a quasi-renewal process with parameter α.

Proposition 6. There exists an optimum T ∗that minimizes L(T ), where 0 < T ∗ ≤∞,and the resulting minimum value of L(T ) iscfp

2 ∑ni=1 q

i−1[iT ∗m(iT ∗)].21.3.1.3 Imperfect Maintenance Model C

Assume that a unit is repaired at failure i at acost cf + (i − 1)cv if and only if i ≤ k − 1, wherei = 1, 2, . . . , and the repair is imperfect in thesense that, after repair, the lifetime will reduceto a fraction α of the immediately previous one[23]. Notice that the repair cost increases by cvfor each next imperfect repair. Suppose that thefirst imperfect repair time is random variableY1 with mean η1, the second imperfect repairtime is βY1 with mean βη1, and the (k−1)threpair time is βk−2Y1 with mean βk−2η1, wherethe constant β ≥ 1 means the repair times areincreasing as the number of imperfect repairsincreases. After the (k − 1)th imperfect repair atfailure the unit is preventively maintained at timesnT (n= 1, 2, . . . ) at a cost cp, where the constant

Page 439: Handbook of Reliability Engineering - pdi2.org

406 Maintenance Theory and Testing

T > 0 and the PM is imperfect in the sense that,after PM, the unit is “as good as new” withprobability p and restored to its condition justprior to failure (minimal repair) with probabilityq = 1− p. Further suppose that the PM time isa random variable W with mean w. If there isa failure between nT values, an imperfect repairis performed at a cost cfr with negligible repairtime in the sense that, after repair, the lifetimeof this unit will be reduced to a fraction λ of itsimmediately previous one where, 0 < λ < 1.

One possible interpretation of this model isas follows: when a new unit is put into use, thefirst k − 1 times of failure repairs, because theunit is young at that time, will be performed ata low cost cf + (i − 1)cv for i = 0, 1, 2, . . . , k −2, which result in imperfect repairs. After the(k−1)th imperfect repair at failure the unit willbe in bad condition and then a better or perfectmaintenance is necessary at a higher cost cp or cfr.

Proposition 7. The long-run expected mainte-nance cost rate is

L(T , k)={(k − 1)cf + (k − 1)(k − 2)

2cv

+ cpp−1 + pcfr

∞∑i=1

qi−1M(iT )

}

×{µ(1− αk−1)

1− α+ µ(1− βk−1)

1− β

+ T

p+ w

}−1

and the asymptotic average availability is

A(T , k)={µ(1− αk−1)

1− α+ T

p

}×{µ(1− αk−1)

1− α+ η(1− βk−1)

1− β

+ T

p+w

}−1

where M(t) is the renewal function of a quasi-renewal process with parameter λ and the firstinterarrival time distribution F1(α

1−kt).

Sometimes the optimum maintenance policyto minimize the cost rate is desired while some

availability requirements are satisfied. Thus, thefollowing optimization problem can be formu-lated in terms of decision variables T and k:

Minimize

L(T , k)

={(k−1)cf + (k − 1)(k − 2)

2cv+cpp

−1

+ pcfr

∞∑i=1

qi−1M(iT )

}

×{µ(1− αk−1)

1− α+ η(1− βk−1)

1− β

+ T

p+w

}−1

Subject to

A(T , k)

={µ(1− αk−1)

1− α+ T

p

}×{µ(1− αk−1)

1− α+ η(1− βk−1)

1− β+ T

p+w

}−1

≥ A0

k = 2, 3, . . .

T > 0

where constant A0 is the prespecified availabilityrequirement.

Assume that the lifetime X1 of a new systemfollows a normal distribution with mean µ andvariance σ 2, and

µ= 10 σ = 1 η = 0.9 cf = $1 cv = $0.06

cp = $3 cfr = $4 α = 0.95 β = 1.05

p = 0.95 w = 0.2 λ= 0.95 A0 = 0.94

From the above optimization model we canfind the globally optimal solution (T ∗, k∗) byusing commercial optimization software thatminimizes the maintenance cost rate given that theavailability is at least 0.94:

T ∗ = 7.6530 k∗ = 3

Page 440: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 407

and the corresponding minimum cost rate andavailability are respectively

L(T ∗, k∗)= $0.2332 A(T ∗, k∗)= 0.9426

The results indicate that the optimal maintenancepolicy is to perform repair at the first two failuresof the system at a cost of $1 and $1.06 respectively,and then do PM every 7.6530 time units at a costof $3 and repair the system upon failure betweenPMs at a cost $4.

21.3.1.4 Imperfect Maintenance Model D

Model D is identical to Model C except thatafter the (k − 1)th imperfect repair there are twotypes of failure [45], and that PM’s at timesT , 2T , 3T , . . . are perfect. Type 1 failure mightbe total breakdowns, whereas type 2 failure canbe interpreted as a slight and easily fixed problem.Type 1 failures are subject to perfect repairs andtype 2 failures are subject to minimal repairs.When a failure occurs it is a type 1 failurewith probability p(t) and a type 2 failure withprobability q(t)= 1− p(t), where t is the age ofthe unit. Thus, the repair at failure can be modeledby the (p(t), q(t)) rule. Assume that the failurerepair time is negligible, and PM time is a randomvariable V with mean v.

Consider T and k as decision variables.For this maintenance model, the times betweenconsecutive perfect PM’s constitute a renewalcycle. From the classical renewal reward theory,the long-run expected maintenance cost persystem time, or cost rate, is

L(T , k)= C(T , k)

D(T , k)

where C(T , k) is the expected maintenance costper renewal cycle and D(T, k) is the expectedduration of a renewal cycle.

After the (k − 1)th imperfect repair, let Ypdenote the time until the first perfect repairwithout PM since last perfect repair, i.e. the timebetween successive perfect repairs. As describedin Section 21.2.2, the survival distribution of Yp is

given by

S(t)= exp

{−∫ t

0p(x)rk(x) dx

}= exp

{−α1−k

∫ t

0p(x)r(α1−kx) dx

}Assume that Zt represents the number of minimalrepairs during the time interval (0, min{t, Yp})and S(t)= 1− S(t). Wang and Pham [45] show:

E{Zt | Yp < t} = 1

S(t)

∫ t

0

∫ y

0q(x)rk(x) dx dS(y)

= α1−k

S(t)

×∫ t

0

∫ y

0q(x)r1(α

1−kx) dx dS(y)

E{Zt | Yp ≥ t} =∫ t

0q(x)rk(x) dx

= α1−k∫ t

0q(x)r1(α

1−kx) dx

LetN1(t) andN2(t) denote the s-expected numberof perfect repairs and minimal repairs in (0, t)respectively; c1, c2, and cp denote costs of perfectrepair, minimal repair, and PM respectively. It iseasy to obtain

D(T, k)= µ(1− αk−1)

1− α+ η(1− βk−1)

1− β+ T + v

C(T , k)= (k − 1)cf + (k − 1)(k − 2)

2cv

+ c1N1(T )+ c2N2(T )+ cp

Obviously, N1(t) is the renewal function forthe renewal process with the interarrival timedistribution S(t) and can be determined bythe solution method to the renewal function inrenewal theory. Wang and Pham [45] proved fort ≤ T :

N2(t)= E{Zt | Yp ≥ t}S(t)+∫ t

0[E{Zx | Yp = x} +N2(t − x)] dS(x)

Note that

E{Zt | Yp < t}S(t)=∫ t

0E{Zx | Yp = x} dS(x)

Page 441: Handbook of Reliability Engineering - pdi2.org

408 Maintenance Theory and Testing

It follows that

N2(t)= E{Zt | Yp ≥ t}S(t)+ E{Zt | Yp < t}S(t)+∫ t

0N2(t − x) dS(t)

= E(Zt)+∫ t

0N2(t − x) dS(t)

= α1−k∫ t

0S(t)r1(α

1−kx) dx − S(t)

+∫ t

0N2(t − x) dS(t)

Therefore, N2(t) can be obtained by the Laplacetransform or solving the above integral equationusing numerical computation. The followingproposition follows from the above equations.

Proposition 8. The long-run expected mainte-nance cost rate is given by

L(T , k)={(k − 1)cf + (k − 1)(k − 2)

2cv

+ c1N1(T )+ c2N2(T )+ cp

}×{µ(1− αk−1)

1− α

+ η(1− βk−1)

1− β+ T + v

}−1

21.3.1.5 Imperfect Maintenance Model E

This model is the same as Model C except that atnext failures, after the (k − 1)th imperfect repairsince time zero, repair cost is estimated by perfectinspection to determine whether to replace orimperfectly repair it [53]. Assume that the repaircost has a cumulative distribution function C(x)

that is independent of the age of the unit. If theestimated cost does not exceed a constant costlimit Q, then the unit is imperfectly repaired at anexpected repair cost not exceeding Q. Otherwise,it is replaced by a new one at a higher fixed costand the replacement time is W with mean w.Imperfect repair is modeled by the (p, q) rule.Assume that the repair time is V with mean v,

and that W and V are independent of the previousfailure history of the unit. Upon a perfect repairor replacement the process repeats. We consider kand Q as decision variables, and α, β , and p asparameters.

Proposition 9. The long-run expected mainte-nance cost per unit time is

L(k, Q; α, β, p)={(k−1)cf + (k − 1)(k − 2)

2cv

+ c2[1− C(Q)] + c1C(Q)

1− pC(Q)

}×{µ(1− αk−1)

1− α+ η(1− βk−1)

1− β

+ [1− C(Q)]w + pC(Q)v

1− qC(Q)

+∫ ∞

0exp{−H(α1−kt)[1 − qC(Q)]} dt

}−1

and the asymptotic average availability is

A(k, Q; α, β, p)

={µ(1− αk−1)

1− α

+∫ ∞

0exp{−H(α1−kt)[1− qC(Q)]} dt

}×{µ(1− αk−1)

1− α+ η(1− βk−1)

1− β

+ [1− C(Q)]w + pC(Q)v

1− qC(Q)

+∫ ∞

0exp{−H(α1−kt)[1− qC(Q)]} dt

}−1

where H(α1−kt)= ∫ t0 α1−kr1(α1−kx) dx is the cu-mulative hazard of the unit right after the (k − 1)thimperfect repair and c1 = C−1(Q)

∫ L0 t dC(t) is

the mean of repair costs less than Q.

The optimum maintenance policy (k∗, Q∗)that minimizes L(k, Q; α, β, p) or maximizesA(k, Q; α, β, p) can be obtained using anynonlinear programming software.

Page 442: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 409

Figure 21.2. Imperfect maintenance policy for a k-out-of-nsystem

21.3.2 Optimal ImperfectMaintenance of k-out-of-n Systems

For a complex system, it may not be advisable toreplace the entire system just because of the failureof one component. In fact, the system comesback into operation on repair or replacementof the failed component by a new one or by aused but operative one. Such maintenance actionsdo not renew the system completely but enablethe system to continue to operate. However, thesystem is usually deteriorating with usage andtime. At some point of time or usage, it maybe in a bad operating condition and a perfectmaintenance is necessary. Based on this situation,Pham and Wang [51] proposed the followingimperfect maintenance policy for a k-out-of-nsystem.

A new system starts to operate at time zero.Each failure of a component of this system in thetime interval (0, τ ) is immediately removed by aminimal repair. Components that fail in the timeinterval (τ, T ) can be lying idle (partial failure isallowed). Perform CM on the failed componentstogether with PM on all unfailed but deterioratingones at a cost of cf once exactly m componentsare idle, or perform PM on the whole systemat a cost of cp once the total operating timereaches T , whichever occurs first. That is, if m

components fail in the time interval (τ, T ), thenCM combined with PM is performed; if less thanm components fail in the time interval (τ, T ),then PM is carried out at time T . After a perfectmaintenance, whether this is a CM combined withPM or a PM at T , the process repeats.

In the above maintenance policy, τ and T

are decision variables. A k-out-of-n system is

defined as a complex coherent system with n

failure-independent components such that thesystem operates if and only if at least k of thesecomponents function successfully. This chapterassumes thatm is a predetermined positive integerwhere 1≤m ≤ n− k + 1, and real numbers τ <

T . This policy is demonstrated in Figure 21.2. Inpractice, m may take different values accordingto different reliability and cost requirements.Obviously, m= 1 means that the system is subjectto maintenance whenever one component failsafter τ . For a series system (n-out-of-n system) ora system with key applications, m may basicallybe required to be unity. If m is chosen as n−k + 1, then a k-out-of-n system is maintainedonce the system fails. In most applications, thewhole system is subject to a perfect CM togetherwith a PM upon a system failure (m= n− k + 1)or partial failure (m< n− k + 1). Here, partialfailure means some components have failed butthe system still functions. However, if inspectionis not continuous and the system operatingcondition can be known only through inspection,m could be a number greater than (n− k + 1).For simplicity, in this paper we assume that if CMtogether with PM is carried out, then both areperfect, given that CM combined with PM takesw1time units and PM at time T takes w2 time units.Further suppose that every component has IFR,which is differentiable and remains undisturbedby minimal repair.

The justification of this policy is as follows:before τ , each component is “young” and no majorrepair is necessary. Therefore, only minimal re-pairs, which may not take much time and money,are performed. The component is deterioratingas the time passes. After τ , the component has alarger failure rate (due to IFR) and might be in abad operating condition. Thus, a major or perfectrepair may be needed. Because there exist eco-nomic dependence and availability requirements(less frequent shut-offs for maintenance), how-ever, we may not replace failed components im-mediately but start CM until the number of failedcomponents reaches some prespecified number m.In fact, when the number of failed components ac-cumulates to m, the remaining (n−m) operating

Page 443: Handbook of Reliability Engineering - pdi2.org

410 Maintenance Theory and Testing

components may degrade to a worse operatingcondition and need PM also. Note that, as long asm is less than (n− k + 1), the system will not failand continue to operate.

Assume that for each component the cost of theith minimal repair at age t consists of two parts:the deterministic part c1(t, i), which depends onthe age t of this component and the number i ofminimal repairs, and the age-dependent randompart c2(t). The assumption that the failure rateof each component has IFR is essential. This isbecause the system may be subject to a PM at timeT , which requires the system to be IFR after τ .The following proposition states the relationshipbetween component and system IFRs.

Proposition 10. If a k-out-of-n system is composedof independent, identical, IFR components, thesystem has an IFR also.

Given that PM at time T , 2T , 3T , . . . isimperfect, the failure rate of each component isq(t), the survival function of the time to failure ofthe k-out-of-n system is Fn−k+1(y), and the costexpectation at age t is µ(t), Pham and Wang [51]derived the following results using renewal theory:

Proposition 11. If the PM is perfect with probabil-ityp and minimal with probability q = 1− p, thenthe long-run expected system maintenance cost perunit time, or cost rate, for a k-out-of-n system:G isgiven by

L(τ, T | p)={n

∫ τ

0µ(y)q(y) dy

+ cp

∞∑j=1

qj−1Fm(jT − τ )

+ cf

[1− p

∞∑j=1

qj−1Fm(jT − τ )

]}

×{τ +

∫ T−τ

0Fm(t) dt

+∞∑j=2

qj−1∫ jT−τ

(j−1)T−τFm(t) dt

+w1

[1− p

∞∑j=1

qj−1Fm(jT − τ )

]

+w2

[ ∞∑j=1

qj−1Fm(jT − τ )

]}−1

and the limiting average system availability is

A(τ, T | p)={τ +

∫ T−τ

0Fm(t) dt

+∞∑j=2

qj−1∫ jT−τ

(j−1)T−τFm(t) dt

}

×{τ +

∫ T−τ

0Fm(t) dt

+∞∑j=2

qj−1∫ jT−τ

(j−1)T−τFm(t) dt

+ w1

[1− p

∞∑j=1

qj−1Fm(jT − τ )

]

+ w2

[ ∞∑j=1

qj−1Fm(jT − τ )

]}−1

The above model includes at least 12 existingmaintenance models as special cases. For example,when k = 1, and n > 1, then the k-out-of-n systemis reduced to a parallel system. If we further let k =1 and m= n, it follows that the long-run expectedsystem maintenance cost rate for a parallel systemwith n components is

L(τ, T )

={n

∫ τ

0µ(y)q(y) dy

+ (cf − cp)[F (τ )− F (T )]nF−n(τ )+ cp

}×{τ+∫ T−τ

0{1− [F (τ )−F (τ+t)]nF−n(τ )} dt

+ (w1 −w2)[F (τ )− F (T )]nF−n(τ )+w2

}−1

If we set τ = 0 and w1 = w2 = 0, then the aboveequation becomes

L(0, T )= (cf − cp)Fn(T )+ cp∫ T

0 [1− Fn(t)] dt

Page 444: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 411

which coincides with the result by Yasui et al.[54].

Sometimes, optimal maintenance policies maybe required so that the system maintenance costrate is minimized while some availability require-ments are satisfied, or the system availability ismaximized given that the maintenance cost rateis not larger than some predetermined value. Forexample, for the above maintenance model, thefollowing optimization problem can be formu-lated in terms of decision variables T and τ :

Minimize

L(τ, T | p)={n

∫ τ

0µ(y)q(y) dy

+ cp

∞∑j=1

qj−1Fm(jT − τ )

+ cf

[1− p

∞∑j=1

qj−1Fm(jT − τ )

]}

×{τ +

∫ T−τ

0Fm(t) dt

+∞∑j=2

qj−1∫ jT−τ

(j−1)T−τFm(t) dt

+ w1

[1− p

∞∑j=1

qj−1Fm(jT − τ )

]

+ w2

[ ∞∑j=1

qj−1Fm(jT − τ )

]}−1

Subject to

A(τ, T | p) ≥ A0

τ ≥ 0

T > 0

where constant A0 is the pre-decided minimumavailability requirement.

The optimal system maintenance policy(T ∗, τ ∗) can be determined from the aboveoptimization model by using nonlinearprogramming software. Similarly, otheroptimization models can be formulated based ondifferent requirements. Such optimization models

should be truly optimal, since both availabilityand maintenance costs are addressed.

An application of the above model to a 2-out-of-3 aircraft engine system is discussed by Phamand Wang [51], given the time to failure of eachengine follows a Weibull distribution with shapeparameter β = 2 and scale parameter θ = 500(days). The optimal maintenance policy for this2-out-of-3 aircraft engine system to minimize thesystem maintenance cost rate is determined fromthe above model: before τ ∗ = 335.32 (days), onlyminimal repairs are performed; after τ ∗ = 335.32,the failed engine will be subject to perfect repairtogether with PM on the remaining two onceany engine fails; if no engine fails until timeT ∗ = 383.99 (days), then PM is carried out atT ∗ = 383.99.

21.4 Future Research onImperfect MaintenanceThe following further work on imperfect mainte-nance is necessary:

• Investigate optimum PM policy and reliabil-ity measures for multicomponent systems be-cause most previous work on imperfect main-tenance has been focused on a one-unit sys-tem.• Develop more methods for treating imperfect

maintenance.• Establish statistical estimation of parameters

for imperfect maintenance models.• Explore more realistic imperfect maintenance

models, e.g. including non-negligible repairtime, finite horizon. In most literature onmaintenance theory, the maintenance time isassumed to be negligible. This assumptionmakes availability and mean-time-between-failures modeling impossible or unrealistic.Considering maintenance time will result inrealistic system reliability measures.• Use the reliability measures as the optimality

criteria instead of cost rates, or combine both,as shown in Section 21.3.1: optimal main-tenance policy to minimize cost rate undergiven reliability requirement. Most existing

Page 445: Handbook of Reliability Engineering - pdi2.org

412 Maintenance Theory and Testing

optimal maintenance models use the opti-mization criterion of minimizing the systemmaintenance cost rate but ignoring the re-liability performance. However, maintenanceaims to improve system reliability perfor-mance. Therefore, the optimal maintenancepolicy must be based on not only cost ratebut also on reliability measures. It is impor-tant to note that, for multicomponent systems,minimizing system maintenance cost rate maynot imply maximizing the system reliabilitymeasures. Sometimes, when the maintenancecost rate is minimized the system reliabil-ity measures are so low that they are notacceptable in practice [54]. This is becausevarious components in the system may havedifferent maintenance costs and a differentreliability importance in the system [53, 55,56]. Therefore, to achieve the best operatingperformance, an optimal maintenance policyneeds to consider both maintenance cost andreliability measures simultaneously.• Consider the structure of a system to obtain

optimal system reliability performance andoptimal maintenance policy. For example,once a subsystem of a series system fails it isnecessary to repair it immediately. Otherwise,the system will have longer downtime andworse reliability measures. However, whenone subsystem of a parallel system fails,the system will still function even if thissubsystem is not repaired right away. In fact,its repair can be delayed until it is time to dothe PM on the system considering economicdependence; or repair can begin at such atime that only one subsystem operates and theother subsystems have failed and are awaitingrepairs [51]; or at the time that all subsystemsfail, and thus the system fails, if the systemfailure during actual operation is not critical.

21.A Appendix21.A.1 Acronyms and DefinitionsNBU new better than usedNWU new worse than used

NBUE new better than used in expectationNWUE new worse than used in expectationIFRA increasing failure rate in averageDFRA decreasing failure rate in averageIFR increasing failure rateDFR decreasing failure rate.

Assume that lifetime has a distribution func-tion F(t) and survival function s(t)= 1− F(t)

with mean µ. The following definitions are given(for details see Barlow and Proschan [3]).

• s is NBU if s(x + y)≤ s(x)s(y) for all x, y ≥0.s is NWU if s(x + y)≥ s(x)s(y) for all x, y ≥0.• s is NBUE if

∫∞t s(x) dx ≤ µs(t) for all t ≥ 0.

s is NWUE if∫∞t

s(x) dx ≥ µs(t) for all t ≥ 0.• s is IFRA if [s(x)]1/x is decreasing in x for

x > 0.s is DFRA if [s(x)]1/x is increasing in x forx > 0.• s is IFR if and only if [F(t + x)− F(t)]/s(t)

is increasing in t for all x > 0.s is DFR if and only if [F(t + x)− F(t)]/s(t)is decreasing in t for all x > 0.

21.A.2 Exercises

1. What is minimal repair? What is imperfectrepair? Their relationships?

2. Prove Proposition 1.3. Assume the imperfect maintenance Model

C in Section 21.3.1.3 is modified, and thenew model is exactly like it except that theimperfect PM is treated by the x rule, i.e.the age of the unit becomes x units of timeyounger after PM; that the unit undergoesminimal repair at failures between PMs atcost cfm instead of imperfect repairs in termsof parameter λ in Model C. Assume furtherthat the Nth PM since the last perfect PM isperfect, where N is a positive integer. A costcNp and an independent replacement time V

with mean v is suffered for the perfect PMat time NT . Assume that imperfect PM atother times takes W time with mean w and

Page 446: Handbook of Reliability Engineering - pdi2.org

Optimal Imperfect Maintenance Models 413

imperfect PM cost is cp. Suppose that cNp >

cp, v ≥ w, and W and V are independentof the previous failure history of the unit.Derive the long-run expected maintenancecost per unit time and the asymptotic averageavailability.

4. A new model is identical to imperfectmaintenance Model C in Section 21.3.1.3except that after the (k − 1)th repair at failurethe unit will be either replaced at next failureat a cost of cfr, or preventively replaced atage T at a cost cp, whichever occurs first, i.e.after time zero a unit is imperfectly repairedat failure i at a cost cf + (i − 1)cv for i ≤ k −1, where cf and cv are constants. The repairis imperfect in the sense that, after repair,the lifetime will be reduced to a fraction α

of the immediately previous lifetime, that therepair times will be increased to a multipleβ of the immediately previous one, and thatthe successive lifetimes and repair times areindependent. Note that the lifetime of theunit after the (k − 2)th repair and the (k −1)th repair time are respectively αk−2Zk−1and βk−2ζk−1 with means αk−2µ and βk−2η,where Zi and ζi are respectively i.i.d. randomvariable sequences with Z1 =X1 and ζ1 =Y1. Derive the long-run expected maintenancecost per unit time, given T and k are decisionvariables, and α and β are parameters.

5. Show Proposition 6.

References[1] Sherif YS, Smith ML. Optimal maintenance models for

systems subject to failure—a review. Nav Res Logist Q1981;28(1):47–74.

[2] McCall JJ. Maintenance policies for stochastically failingequipment: a survey. Manage Sci 1965;11(5):493–524.

[3] Barlow RE, Proschan F. Mathematical theory of reliability.New York: John Wiley & Sons; 1965.

[4] Pierskalla WP, Voelker JA. A survey of maintenancemodels: the control and surveillance of deterioratingsystems. Nav Res Logist Q 1976;23:353–88.

[5] Jardine AKS, Buzacott JA. Equipment reliability andmaintenance. Eur J Oper Res 1985;19:285–96.

[6] Valdez-Flores C, Feldman RM. A survey of preventivemaintenance models for stochastically deterioratingsingle-unit systems. Nav Res Logist 1989;36:419–46.

[7] Cho ID, Parlar M. A survey of maintenance models formulti-unit systems. Eur J Oper Res 1991;51:1–23.

[8] Jensen U. Stochastic models of reliability and main-tenance: an overview. In: Ozekici S, editor. Reliabilityand maintenance of complex systems, NATO ASI Series,vol. 154 (Proceedings of the NATO Advanced Study Insti-tute on Current Issues and Challenges in the Reliabilityand Maintenance of Complex Systems, held in Kemer–Antalya, Turkey, June 12–22, 1995). Berlin: Springer-Verlag; 1995. p.3–36.

[9] Dekker R. Applications of maintenance optimizationmodels: a review and analysis. Reliab Eng Syst Saf1996;51(3):229–40.

[10] Pham H, Wang HZ. Imperfect maintenance. Eur J OperRes 1996;94:425–38.

[11] Dekker R, Wilderman RE, van der Duyn Schouten FA.A review of multi-component maintenance modelswith economic dependence. Math Methods Oper Res1997;45(3):411–35.

[12] Kijima M. Some results for repairable systems withgeneral repair. J Appl Prob 1989;26:89–102.

[13] Brown M, Proschan F. Imperfect repair. J Appl Prob1983;20:851–9.

[14] Nakagawa T, Yasui K. Optimum policies for a systemwith imperfect maintenance. IEEE Trans Reliab 1987;R-36:631–3.

[15] Helvic BE. Periodic maintenance, on the effect ofimperfectness. In: 10th International Symposium onFault-Tolerant Computing, 1980; p.204–6.

[16] Kay E. The effectiveness of preventive maintenance. Int JProd Res 1976;14:329–44.

[17] Chan PKW, Downs T. Two criteria for preventivemaintenance. IEEE Trans Reliab 1978;R-27:272–3.

[18] Ingle AD, Siewiorek DP. Reliability models for multipro-cessor systems with and without periodic maintenance.In: 7th International Symposium on Fault-Tolerant Com-puting, 1977; p.3–9.

[19] Chaudhuri D, Sahu KC. Preventive maintenance intervalsfor optimal reliability of deteriorating system. IEEE TransReliab 1977;R-26:371–2.

[20] Nakagawa T. Imperfect preventive maintenance. IEEETrans Reliab 1979;R-28(5):402.

[21] Nakagawa T. Summary of imperfect maintenance policieswith minimal repair. RAIRO Rech Oper 1980;14;249–55.

[22] Fontenot RA, Proschan F. Some imperfect maintenancemodels. In: Abdel-Hameed MS, Cinlar E, Quinn J, editors.Reliability theory and models. Orlando (FL): AcademicPress; 1984.

[23] Wang HZ, Pham H. A quasi renewal process and itsapplication in the imperfect maintenance. Int J Syst Sci1996;27(10):1055–62 and 1997;28(12):1329.

[24] Bhattacharjee MC. Results for the Brown–Proschanmodel of imperfect repair. J Stat Plan Infer 1987;16:305–16.

[25] Block HW, Borges WS, Savits TH. Age dependentminimal repair. J Appl Prob 1985;22:370–85.

[26] Beichelt F, Fischer K. General failure model appliedto preventive maintenance policies. IEEE Trans Reliab1980;R-29(1):39–41.

Page 447: Handbook of Reliability Engineering - pdi2.org

414 Maintenance Theory and Testing

[27] Block HW, Borges WS, Savits TH. A general agereplacement model with minimal repair. Nav Res Logist1988;35(5):365–72.

[28] Iyer S. Availability results for imperfect repair. Sankhya:Indian J Stat 1992;54(2):249–59.

[29] Sumita U, Shanthikumar JG. An age-dependent countingprocess generated from a renewal process. Adv ApplProbab 1988;20(4):739–755

[30] Whitaker LR, Samaniego FJ. Estimating the reliabilityof systems subject to imperfect repair. J Am Stat Assoc1989;84(405):301–9.

[31] Hollander M, Presnell B, Sethuraman J. Nonparamet-ric methods for imperfect repair models. Ann Stat1992;20(2):879–87.

[32] Makis V, Jardine AKS. Optimal replacement policy fora general model with imperfect repair. J Oper Res Soc1992;43(2):111–20.

[33] Malik MAK. Reliable preventive maintenance policy. AIIETrans 1979;11(3):221–8.

[34] Lie CH, Chun YH. An algorithm for preventive mainte-nance policy. IEEE Trans Reliab 1986;R-35(1):71–5.

[35] Jayabalan V, Chaudhuri D. Cost optimization of main-tenance scheduling for a system with assured reliability.IEEE Trans Reliab 1992;R-41(1):21–6.

[36] Jayabalan V, Chaudhuri D. Sequential imperfect preven-tive maintenance policies: a case study. MicroelectronReliab 1992;32(9):1223–9.

[37] Suresh PV, Chaudhuri D. Preventive maintenancescheduling for a system with assured reliability usingfuzzy set theory. Int J Reliab Qual Saf Eng 1994;1(4):497–513.

[38] Chan JK, Shaw L. Modeling repairable systems withfailure rates that depend on age & maintenance. IEEETrans Reliab 1993;42:566–70.

[39] Canfield RV. Cost optimization of periodic preventivemaintenance. IEEE Trans Reliab 1986;R-35(1):78–81.

[40] Kijima M, Morimura H, Suzuki Y. Periodical replacementproblem without assuming minimal repair. Eur J OperRes 1988;37(2):194–203.

[41] Uematsu K, Nishida T. One-unit system with a failurerate depending upon the degree of repair. Math Jpn1987;32(1):139–47.

[42] Kijima M, Nakagawa T. Accumulative damage shockmodel with imperfect preventive maintenance. Nav ResLogist 1991;38:145–56.

[43] Kijima M, Nakagawa T. Replacement policies of a shockmodel with imperfect preventive maintenance. Eur JOper Res 1992;57:100–10.

[44] Wang HZ, Pham H. Optimal age-dependent preventivemaintenance policies with imperfect maintenance. Int JReliab Qual Saf Eng 1996;3(2):119–35.

[45] Wang HZ, Pham, H. Some maintenance models andavailability with imperfect maintenance in productionsystems. Ann Oper Res 1999;91:305–18.

[46] Shaked M, Shanthikumar JG. Multivariate imperfectrepair. Oper Res 1986;34:437–48.

[47] Sheu S, Griffith WS. Multivariate imperfect repair. J ApplProb 1992;29(4):947–56.

[48] Nakagawa T. Periodic and sequential preventive mainte-nance policies. J Appl Prob 1986;23(2):536–42.

[49] Nakagawa T. Sequential imperfect preventive mainte-nance policies. IEEE Trans Reliab 1988;37(3):295–8.

[50] Yak YW, Dillon TS, Forward KE. The effect of imperfectperiodic maintenance on fault tolerant computer systems.In: 14th International Symposium on Fault-TolerantComputing, 1984; p.67–70.

[51] Pham H, Wang HZ. Optimal (τ, T ) opportunisticmaintenance of a k-out-of-n:G system with imperfect PMand partial failure. Nav Res Logist 2000;47:223–39.

[52] Barlow RE, Hunter LC. Optimum preventive maintenancepolicies. Oper Res 1960;8:90–100.

[53] Wang HZ, Pham H. Optimal maintenance policies forseveral imperfect maintenance models. Int J Syst Sci1996;27(6):543–9.

[54] Yasui K, Nakagawa T, Osaki S. A summary of optimalreplacement policies for a parallel redundant system.Microelectron Reliab 1988;28:635–41.

[55] Wang HZ, Pham H. Availability and optimal maintenanceof series system subject to imperfect repair. Int J PlantEng Manage 1997;2:75–92.

[56] Wang HZ, Pham H, Izundu AE. Optimal preparednessmaintenance of multi-unit systems with imperfect main-tenance and economic dependence. In: Pham H, editor.Recent advances in reliability and quality engineering.New Jersey: World Scientific; 2001.

Page 448: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing

Chap

ter2

2Elsayed A. Elsayed

22.1 Introduction22.2 Design of Accelerated Life Testing Plans22.2.1 Stress Loadings22.2.2 Types of Stress22.3 Accelerated Life Testing Models22.3.1 Parametric Statistics-based Models22.3.2 Acceleration Model for the Exponential Model22.3.3 Acceleration Model for the Weibull Model22.3.4 The Arrhenius Model22.3.5 Non-parametric Accelerated Life Testing Models: Cox’s Model22.4 Extensions of the Proportional Hazards Model

22.1 Introduction

The intensity of the global competition for thedevelopment of new products in a short timehas motivated the development of new methodssuch as robust design, just-in-time manufactur-ing, and design for manufacturing and assembly.More importantly, both producers and customersexpect that the product will perform the intendedfunctions for extended periods of time. Hence,extended warranties and similar assurances ofproduct reliability have become standard featuresof the product. These requirements have increasedthe need for providing more accurate estimates ofreliability by performing testing of materials, com-ponents, and systems at different stages of prod-uct development. Testing under normal operatingconditions requires a very long time (possiblyyears) and the use of an extensive number of unitsunder test, so it is usually costly and impracti-cal to perform reliability testing under normalconditions. This has led to the development ofaccelerated life testing (ALT), where units are sub-jected to a more severe environment (increased

or decreased stress levels) than the normal op-erating environment so that failures can be in-duced in a short period of test time. Informa-tion obtained under accelerated conditions is thenused in conjunction with a reliability prediction(inference) model to relate life to stress and toestimate the characteristics of life distributions atdesign conditions (normal operating conditions).Conducting an accelerated life test requires carefulallocation of test units to different stress levels sothat accurate estimation of reliability at normalconditions can be obtained using relatively smallunits and short test durations.

The accuracy of the inference procedure hasa profound effect on the reliability estimates atnormal operating conditions and the subsequentdecisions regarding system configuration, war-ranties, and preventive maintenance schedules.Therefore, it is important that the inference pro-cedure be robust and accurate. In this chapterwe present ALT models and demonstrate theiruse in relating failure data at stress conditionsto normal operating conditions. Before presentingsuch models, we briefly describe the design of ALTplans, including methods of applying stress and

415

Page 449: Handbook of Reliability Engineering - pdi2.org

416 Maintenance Theory and Testing

determination of the number of units for eachstress level.

In this chapter, we introduce the subject ofALT and its importance in assessing the reliabilityof components and products under normaloperating conditions. We describe briefly themethods of stress application and types of stress,and focus on the reliability prediction models thatutilize the failure data under stress conditionsto obtain reliability information under normalconditions. Several methods and examples thatdemonstrate its application are also presented.Before presenting such models, we briefly describethe design of ALT plans, including methods ofapplying stress and determination of the numberof units for each stress level.

22.2 Design of Accelerated LifeTesting PlansA detailed test plan is usually designed beforeconducting an accelerated life test. The planrequires determination of the type of stress,methods of applying stress, stress levels, thenumber of units to be tested at each stresslevel, and an applicable accelerated life testingmodel that relates the failure times at acceleratedconditions to those at normal conditions. Typicalstress loadings are described in the followingsection.

22.2.1 Stress Loadings

Stress in ALT can be applied in various ways. Typi-cal loadings include constant, cyclic, step, progres-sive, random stress loading, and combinations ofsuch loadings. Figure 22.1 depicts different formsof stress loadings. The choice of a stress loadingdepends on how the product or unit is used inservice and other practical and theoretical lim-itations [1]. For example, a typical automobileexperiences extreme environmental temperaturechanges ranging from below freezing to morethan 120 ◦F. Clearly, the corresponding acceleratedstress test should be cyclic with extreme tempera-ture changes.

The stress levels range from low to high.A low stress level should be higher than theoperating conditions to ensure failures duringthe test, whereas the high stress should bethe highest possible stress that can be appliedwithout inducing failure mechanisms differentfrom those that are likely to occur under normaloperating conditions. Meeker and Hahn [2]provide extensive tables and practical guidelinesfor planning an ALT. They present a statisticallyoptimum test plan and then suggest an alternativeplan that meets practical constraints, and hasdesirable statistical properties. Their tables allowassessment of the effect of reducing the testingstress levels (thereby reducing the degree ofextrapolation) on statistical precision.

Typical accelerated testing plans allocate equalunits to the test stresses. However, units testedat stress levels close to the design or operatingconditions may not experience enough failuresthat can be effectively used in the accelerationmodels. Therefore, it is preferred to allocate moretest units to the low stress conditions than to thehigh stress conditions so as to obtain an equalexpected number of failures at both conditions.

22.2.2 Types of Stress

The type of applied stress depends on the intendedoperating conditions of the product and thepotential cause of failure. We classify the types ofthe stresses as follows.

1. Mechanical stresses. Fatigue stress is the mostcommonly used accelerated test for mechan-ical components. When the components aresubject to elevated temperature, then creeptesting (which combines both temperatureand load) should be applied. Shock andvibration testing is suitable for components orproducts subject to such conditions as in thecase of bearings, shock absorbers, tires andcircuit boards in airplanes and automobiles.

2. Electrical stresses. These include powercycling, electric field, current density, andelectromigration. Electric field is one of themost common electrical stresses, as it induces

Page 450: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 417

Figure 22.1. Various stress loadings in ALT

failures in relatively short times; its effect isalso significantly higher than other types ofstress.

3. Environmental stresses. Temperature and ther-mal cycling are commonly used for mostproducts. Of course, it is important to useappropriate stress levels that do not inducedifferent failure mechanisms than those undernormal conditions. Humidity is as critical astemperature, but its application usually re-quires a very long time before its effect isnoticed. Other environmental stresses includeultraviolet light, sulfur dioxide, salt and fineparticles, and alpha and gamma rays.

22.3 Accelerated Life TestingModels

Elsayed [3] classified the inference procedures(or models) that relate life under stress conditionsto life under normal or operating conditionsinto three types: statistics-based models, physics–statistics-based models, and physics–experimental-based models as shown in Figure 22.2.

The statistics-based models are further clas-sified as parametric models and non-parametricmodels. The underlying assumption in relatingthe failure data, when using any of the models,

Page 451: Handbook of Reliability Engineering - pdi2.org

418 Maintenance Theory and Testing

Figure 22.2. Classification of accelerated failure data models [3]

is that the components/products operating undernormal conditions experience the same failuremechanism as those occurring at the acceleratedconditions.

Statistics-based models are generally usedwhen the exact relationship between the appliedstresses and the failure time of the componentor product is difficult to determine based onphysics or chemistry principles. In this case,components are tested at different stress levelsand the failure times are then used to determinethe most appropriate failure time distributionand its parameters. The most commonly usedfailure time distributions are the exponential,Weibull, normal, lognormal, gamma, and theextreme value distributions. The failure timesfollow the same general distributions for alldifferent stress levels, including the normaloperating conditions. When the failure time datainvolve a complex lifetime distribution or whenthe number of observations is small, makingit difficult to fit the failure time distributionaccurately, semiparametric or non-parametricmodels appear to be a very attractive approachto predict reliability at different stress levels.

The advantage of these models is that they areessentially “distribution free”.

In the following sections, we present the detailsof some of these models and provide numericalexamples that demonstrate the procedures forreliability prediction under normal operatingconditions.

22.3.1 Parametric Statistics-basedModels

As stated above, statistics-based models are gener-ally used when the exact relationship between theapplied stresses (temperature, humidity, voltage,etc.) and the failure time of the component (orproduct) is difficult to determine based on physicsor chemistry principles. In this case, componentsare tested at different accelerated stress levelss1, s2, . . . , sn. The failure times at each stress levelare then used to determine the most appropriatefailure time probability distribution, along with itsparameters. Under the parametric statistics-basedmodel assumptions, the failure times at differentstress levels are linearly related to each other.

Page 452: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 419

Moreover, the failure time distribution at stresslevel s1 is expected to be the same at differentstress levels s2, s3, . . . as well as under the normaloperating conditions. In other words, the shapeparameters of the distributions are the same forall stress levels (including normal conditions) butthe scale parameters may be different. Thus, thefundamental relationships between the operatingconditions and stress conditions are summarizedas follows [4]:

• failure timesto = AFts (22.1)

where to is the failure time under operatingconditions, ts is the failure time understress conditions, and AF is the accelerationfactor (the ratio between product life undernormal conditions and life under acceleratedconditions);• cumulative distribution functions (CDFs)

Fo(t)= Fs

(t

AF

)(22.2)

• probability density functions

fo(t)=(

1

AF

)fs

(t

AF

)(22.3)

• failure rates

ho(t)=(

1

AF

)hs

(t

AF

)(22.4)

The most widely used parametric models arethe exponential and Weibull models. Therefore,we derive the above equations for both models anddemonstrate their use.

22.3.2 Acceleration Model for theExponential Model

This is the case where the time to failure understress conditions is exponentially distributed witha constant failure rate λs . The CDF at stress s is

Fs(t)= 1− e−λst (22.5)

and the CDF under normal conditions is

Fo(t)= Fs

(t

AF

)= 1− e−λst/AF (22.6)

Table 22.1. Failure times of the capacitors in hours

Temperature Temperature Temperature145 ◦C 240 ◦C 305 ◦C

75 179 116359 407 189701 466 300722 571 305738 755 314

1015 768 4031388 1006 4332285 1094 4403157 1104 4683547 1493 6093986 1494 6344077 2877 6405447 3001 6445735 3160 6995869 3283 7816242 4654 8137804 5259 8608031 5925 10098292 6229 11768506 6462 11848584 6629 1245

11 512 6855 207112 370 6983 218916 062 7387 228817 790 7564 263719 767 7783 284120 145 10 067 291021 971 11 846 295430 438 13 285 311142 004 28 762 4617

mean= 9287 mean= 5244 mean= 1295

The failure rates are related as

λo = λs

AF(22.7)

Example 1. In recent years, silicon carbide (SiC)is used as an optional material for semiconductordevices, especially for those devices operating un-der high temperatures and high electric fields con-ditions. An extensive accelerated life experimentis conducted by subjecting 6H-SiC metal–oxide–silicon (MOS) capacitors to temperatures of 145,240, and 305 ◦C. The failure times are recorded inTable 22.1.

Page 453: Handbook of Reliability Engineering - pdi2.org

420 Maintenance Theory and Testing

Table 22.2. Temperatures and the 50th percentiles

Temperature (◦C) 145 240 30550th percentile 6437 3635 898

Determine the mean time to failure (MTTF)of the capacitors at 25 ◦C and plot the reliabilityfunction.

Solution. The data for every temperature are fittedusing an exponential distribution and the meansare shown in Table 22.1. In order to estimatethe acceleration factor we chose some percentileof the failed population, which can be donenon-parametrically using the rank distributionor a parametric model. In this example, theexponential distribution is used and the time atwhich 50% of the population fails is

t = λ(−ln 0.5) (22.8)

The 50th percentiles are given in Table 22.2.We use the Arrhenius model to estimate the

acceleration factor

t = k ec/T

where t is the time at which a specified portion ofthe population fails, k and c are constants and T isthe absolute temperature (measured in degreeskelvin). Therefore

ln t = ln k + c

T

Using the values in Table 22.2 and least-squaresregression we obtain c = 2730.858 and k =15.844 32. Therefore, the estimated 50th per-centile at 25 ◦C is

t25 ◦C = 15.844 32 e2730.858/(25+273)= 151 261 h

The acceleration factor at 25 ◦C is

AF = 151 261

6437= 23.49

and the failure rate under normal operatingconditions is 1/(1295× 23.49)= 3.287× 10−5

failures/h, the mean time to failure is 30 419 hand the plot of the reliability function is shown inFigure 22.3.

Figure 22.3. Reliability function for the capacitors

22.3.3 Acceleration Model for theWeibull Model

Again, we consider the true linear accelerationcase. Therefore, the relationships between thefailure time distributions at the acceleratedand normal conditions can be derived usingEquations 22.2–22.4. Thus

Fs(t)= 1− e−(t/θs)γs t ≥ 0, γs ≥ 1, θs > 0(22.9)

where γs is the shape parameter of the Weibulldistribution under stress conditions and θs is thescale parameter under stress conditions. The CDFunder normal operating conditions is

Fo(t)= Fs

(t

AF

)= 1− e−[t/AFθs ]γs

= 1− e−[t/θo]γo(22.10)

The underlying failure time distributions underboth the accelerated stress and operating condi-tions have the same shape parameters, i.e. γs = γo,and θo = AFθs . If the shape parameters at differentstress levels are significantly different, then eitherthe assumption of true linear acceleration is in-valid or the Weibull distribution is inappropriateto use for analysis of such data.

Let γs = γo = γ ≥ 1. Then the probability den-sity function under normal operating conditions

Page 454: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 421

is

fo(t)= γ

AFθs

(t

AFθs

)γ−1

e−[t/AFθs ]γ

t ≥ 0, θs ≥ 0 (22.11)

The MTTF under normal operating conditions is

MTTFo = θ1/γo �

(1+ 1

γ

)(22.12)

The failure rate under normal operating condi-tions is

ho(t)= γ

AFθs

(t

AFθs

)γ−1

= hs(t)

F

(22.13)

Example 2. A manufacturer of Bourdon tubes(used as a part of pressure sensors in avionics)wishes to determine its MTTF. The manufacturerdefines the failure as a leak in the tube. The tubesare manufactured from 18 Ni (250) maraging steeland operate with dry 99.9% nitrogen or hydraulicfluid as the internal working agent. Tubes fail as aresult of hydrogen embrittlement arising from thepitting corrosion attack. Because of the criticalityof these tubes, the manufacturer decides toconduct ALT by subjecting them to different levelsof pressures and determining the time for a leakto occur. The units are continuously examinedusing an ultrasound method for detecting leaks,indicating failure of the tube. Units are subjectedto three stress levels of gas pressures and the timesfor tubes to show leak are recorded in Table 22.3.

Determine the mean lives and plot the reliabil-ity functions for design pressures of 80 and 90 psi.

Solution. We fit the failure times to Weibulldistributions, which results in the followingparameters for pressure levels of 100, 120, and140 psi.

For 100 psi: γ1 = 2.87, θ1 = 10 392

For 120 psi: γ2 = 2.67, θ2 = 5375

For 140 psi: γ3 = 2.52, θ3 = 943

Since γ1 = γ2 = γ3 ∼= 2.65, then the Weibull modelis appropriate to describe the relationship between

Table 22.3. Time (hours) to detect leak

100 psi 120 psi 140 psi

1557 1378 2154331 2055 4265725 2092 4315759 2127 4356207 2656 4516529 2801 4516767 3362 4966930 3377 5287146 3393 5657277 3433 6137346 3477 6517668 3947 6707826 4101 7087885 4333 7108095 4545 7438468 4932 8368871 5030 8659652 5264 8949989 5355 927

10 471 5570 95911 458 5760 96611 728 5829 106712 102 5968 112412 256 6200 113912 512 6783 115813 429 6952 119813 536 7329 129314 160 7343 137614 997 8440 138517 606 9183 1780

Table 22.4. Percentiles at different pressures

Pressure (psi) 100 120 14050th percentile 9050 4681 821

failure times under accelerated conditions andnormal operating conditions. Moreover, we have atrue linear acceleration. Following Example 1, wedetermine the time at which 50% of the populationfails as

t = θ [−ln(0.5)]1/γ

The 50th percentiles are shown in Table 22.4.The relationship between the failure time t and

the applied pressure P can be assumed to be

Page 455: Handbook of Reliability Engineering - pdi2.org

422 Maintenance Theory and Testing

Figure 22.4. Reliability functions at 80 and 90 psi

similar to the Arrhenius model; thus

t = k ec/P

where k and c are constants. By making alogarithmic transformation, the above expressioncan be written as

ln t = ln k + c

P

Using a linear regression model, we obtaink = 3.319 and c = 811.456. The estimated 50thpercentiles at 80 psi and 90 psi are 84 361 hand 27 332 h respectively. The correspondingacceleration factors are 9.32 and 3.02. The failurerates under normal operating conditions are

ho(t)= γ

AFθs

(t

AFθs

)γ−1

= hs(t)

AγF

or

h80(t)= 2.65

1.633 82× 1013t1.65

and

h90(t)= 2.65

8.246 52× 1013t1.65

The reliability functions are shown in Figure 22.4.The MTTFs for 80 and 90 psi are calculated as

MTTF80 = θ1/γ �

(1

γ

)= (1.633 82× 1013)1/2.65�

(1+ 1

2.65

)= 96 853.38× 0.885= 85 715 h

and

MTTF90 = 31 383.829× 0.885= 27 775 h

22.3.4 The Arrhenius Model

Elevated temperature is the most commonly usedenvironmental stress for accelerated life testing ofmicroelectronic devices. The effect of temperatureon the device is generally modeled using theArrhenius reaction rate equation given by

r = A e−(Ea/kT ) (22.14)

where r is the speed of reaction, A is an unknownnon-thermal constant, Ea (eV) is the activationenergy (i.e. energy that a molecule must havebefore it can take part in the reaction), k is theBoltzmann constant (8.623× 10−5 eV K−1), andT (K) is the temperature.

Activation energyEa is a factor that determinesthe slope of the reaction rate curve with tem-perature, i.e. it describes the acceleration effectthat temperature has on the rate of a reactionand is expressed in electron volts (eV). For mostapplications, Ea is treated as a slope of a curverather than a specific energy level. A low value ofEa indicates a small slope or a reaction that hasa small dependence on temperature. On the otherhand, a large value of Ea indicates a high degree oftemperature dependence [3].

Assuming that device life is proportional tothe inverse reaction rate of the process, thenEquation 22.14 can be rewritten as

L30 = 719 exp0.42

4.2998× 10−5

×(

1

30+ 273− 1

180+ 273

)= 31.0918× 106

The median lives of the units at normal operatingtemperature Lo and accelerated temperature Ls

are related by

Lo

Ls

= A eEa/kTo

A eEa/kTs

or

Lo = Ls expEa

k

(1

To− 1

Ts

)(22.15)

The thermal acceleration factor is

AF = expEa

k

(1

To− 1

Ts

)

Page 456: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 423

Table 22.5. Failure time data (hours) for oxide breakdown

Temperature Temperature180 ◦C 150 ◦C

112 162260 188298 288327 350379 392487 681593 969658 1303701 1527720 2526734 3074736 3652775 3723915 3781974 4182

1123 44501157 48311227 49071293 63211335 63681472 74891529 83121545 13 7782029 14 0204568 18 640

The calculation of the median life (or percentileof failed units) is dependent on the failure timedistribution. When the sample size is small itbecomes difficult to obtain accurate results. In thiscase, it is advisable to use different percentilesof failures and obtain a weighted average of themedian lives. One of the drawbacks of this modelis the inability to obtain a reliability function thatrelates the failure times under stress conditions tofailure times under normal condition. We can onlyobtain a point estimate of life. We now illustratethe use of the Arrhenius model in predictingmedian life under normal operating conditions.

Example 3. The gate oxide in MOS devices isoften a source of device failure, especially forhigh-density device arrays that require thin gateoxides. The reliability of MOS devices on bulksilicon and the gate oxide integrity of these deviceshave been the subject of investigation over the

Figure 22.5. Weibull probability plot indicating appropriate fit ofdata

years. A producer of MOS devices conducts anaccelerated test to determine the expected lifeat 30 ◦C. Two samples of 25 devices each aresubjected to stress levels of 150 ◦C and 180 ◦C.The oxide breakdown is determined when thepotential across the oxide reaches a thresholdvalue. The times of breakdown are recorded inTable 22.5. The activation energy of the device is0.42 eV. Obtain the reliability function of thesedevices.

Solution. Figure 22.5 shows that it is appropriateto fit the failure data using Weibull distributionswith shape parameters approximately equal tounity. This means that they can be represented byexponential distributions with means of 1037 and4787 h for the respective temperatures of 180 ◦Cand 150 ◦C respectively. Therefore, we determinethe 50th percentiles for these temperatures usingEquation 22.8 as being 719 and 3318 respectively.

3318= 719 exp0.42

k

(1

150+ 273− 1

180+ 273

)

which results in k = 4.2998× 10−5.

Page 457: Handbook of Reliability Engineering - pdi2.org

424 Maintenance Theory and Testing

The median life under normal conditions of30 ◦C is

L30 = 719 exp0.42

4.2998× 10−5

×(

1

30+ 273− 1

180+ 273

)= 31.0918× 106 h

The mean life is 44.8561× 106 and the reliabilityfunction is

R(t) = e−44.8561×106t

22.3.5 Non-parametric AcceleratedLife Testing Models: Cox’s Model

Non-parametric models relax the assumption ofthe failure time distribution, i.e. no predeterminedfailure time distribution is required. Cox’s pro-portional hazards (PH) model [5, 6] is the mostwidely used among the non-parametric models.It has been successfully used in the medical areafor survival analysis. In the past few years Cox’smodel and its extensions have been applied inthe reliability analysis of ALT. Cox made two im-portant contributions: the first is a model thatis referred to as the PH model; the second isa new parameter estimation method, which islater referred to as the maximum partial likeli-hood.

The PH model is structured based on thetraditional hazard function, λ(t; z), the hazardfunction at time t for a vector of covariates(or stresses) z. It is the product of an unspecifiedbaseline hazard function λ0(t) and a relative riskratio. The baseline hazard function λ0(t) is afunction of time t and independent of z, whereasthe relative risk ratio is a function of z only.Thus, when a function exp(·) is considered as therelative risk ratio, its PH model has the followingform:

λ(t; z)= λ0(t) exp(βz)= λ0(t) exp

( p∑j=1

βjzj

)(22.16)

where

z= (z1, z2, . . . , zp)T

a vector of the covariates (or appliedstresses), where “T” indicates transpose. ForALT it is the column vector of stresses usedin the test and/or their interactions

β = (β1, β2, . . . , βp)T

a vector of the unknown regression coeffi-cients

p number of the covariates.

The PH model implies that the ratio of thehazard rate functions for any two units associatedwith different vectors of covariates, z1 and z2, isconstant with respect to time. In other words,λ(t; z1) is directly proportional to λ(t; z2).

If the baseline hazard function λ0(t) is speci-fied, then the usual maximum likelihood methodcan be used to estimate β. However, for the PHmodel, the regression coefficients can be esti-mated without the need for specifying the form ofλ0(t). Therefore, the partial likelihood method isused to estimate β.

Suppose that we have a random sample ofn units; let t1 < t2 · · ·< tk represent k (k ≤n) distinct failure times with correspondingcovariates z1, z2, . . . , zk . If k < n, then theremaining n− k units are assumed to be censored.The partial likelihood has the following form:

L(β)=k∏

i=1

exp(βzi )∑l∈R(ti)

exp(βzl )(22.17)

where R(ti), the risk set at ti , denotes the numberof units survived and uncensored just prior to ti .

To accommodate tied survival times, Breslow[7] modified the above form to

L(β)=k∏

i=1

{exp(βSi )

/[ ∑l∈R(ti)

exp(βzl )]di}(22.18)

where di is the number of failure times equal to ti ,and Si is the sum of the vectors z for these di units.

Though λ0(t) is not specified in the PHmodel, the only unknown parameter in the partiallikelihood function is β. Therefore, the vector β

can be estimated by maximizing the log of partial

Page 458: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 425

likelihood function using numerical methods,such as the Newton–Raphson method. Theestimate of β needs to be checked for significanceusing analytical or graphical methods. Thoseinsignificant covariates can be discarded from themodel, and a new estimate of β corresponding tothe remaining covariates needs to be recalculated.

After estimating the unknown parametersβ1, β2, . . . , βp using the maximum partial likeli-hood method, the remaining unknown in the PHmodel is λ0(t). Some authors [8–10] suggest theuse of a piecewise step function estimate wherethe function λ0(t) is assumed to be constant be-tween subdivisions of the time scale. The timescale can be subdivided at arbitrary convenientpoints or at those points where failures occur.This idea of estimating λ0(t) is a variation ofKaplan–Meier’s product limit estimate. The onlydifference is that, for the PH model, failure dataare collected at different stresses, and they needto be weighted and transformed by the relativerisk ratio to a comparable scale before using theKaplan–Meier method. Another method to esti-mate λ0(t), proposed by Anderson and Senthilsel-van [11], describes a piecewise smooth estimate.They assumed λ0(t) to be a quadratic spline in-stead of being constant between time intervals.They also suggest subtracting a roughness penaltyfunction, K0

∫ [λ′0(t)]2 dt , from the log-likelihoodand then maximizing the penalized log-likelihoodin order to obtain a smooth estimate for λ0(t). Todetermine K0, Anderson and Senthilselvan [11]plotted the estimates for several values of K0 andchose the value that seemed to “best represent thedata”. K0 may be adjusted if the hazard function isnegative for some t .

After the unknown parameters β1, β2, . . . , βpand the unknown baseline hazard rate functionλ0(t) in the PH model are estimated, i.e all of thePH model’s unknowns are estimated, we can usethese estimates in the PH model to estimate thereliability at any stress level and time.

Example 4. The need for long-term reliability inelectronic systems, which are subject to time-varying environmental conditions and experience

various operating modes or duty cycles, intro-duces a complexity in the development of thermalcontrol strategy. Coupling this with other appliedstresses, such as electric field, results in significantdeterioration of reliability. Therefore, the designerof a new electronic packaging system wishes todetermine the expected life of the design at designconditions of 30 ◦C and 26.7 V. The designer con-ducted an accelerated test at several temperaturesand voltages and recorded the failure times asshown in Table 22.6. Plot the reliability functionand determine the MTTF under design condi-tions.

Solution. This ALT has two types of stresses:temperature and voltage. The sample sizes arerelatively small and it is appropriate to use a non-parametric method to estimate the reliability andMTTF. We utilize the PH model and express itas

λ(t; T , V )= λ0(t) exp

(β1

T+ β2V

)where T (K) is temperature V is voltage. Theparameters β1 and β2 are determined using SAS�PROC PHREG [12] and their values are −24 538and 30.332 739 respectively. The unspecified base-line hazard function can be estimated using anyof the methods described by Kalbfleisch and Pren-tice [9] and Elsayed [3]. We can estimate the re-liability of function under design conditions byusing their values with PROC PHREG. The re-liability values are then plotted and a regres-sion model is fit accordingly, as shown in Fig-ure 22.6.

We assume a Weibull baseline hazard functionand fit the reliability values obtained under designconditions to the Weibull function, which resultsin

R(t; 30 ◦C, 26.7)= e−1.550 65×10−8t2

The MTTF is obtained using the mean time tofailure expression of the Weibull models as shownby Elsayed [3]:

MTTF= θ1/γ �

(1+ 1

γ

)= 7166 h

Page 459: Handbook of Reliability Engineering - pdi2.org

426 Maintenance Theory and Testing

Table 22.6. Failure times (hours) at different temperatures and voltages

Failure time Temperature (◦C) Z1 = [1/(Temperature + 273.16)] Z2 (V)

1 25 0.003 353 9 271 25 0.003 353 9 271 25 0.003 353 9 27

73 25 0.003 353 9 27101 25 0.003 353 9 27103 25 0.003 353 9 27148 25 0.003 353 9 27149 25 0.003 353 9 27153 25 0.003 353 9 27159 25 0.003 353 9 27167 25 0.003 353 9 27182 25 0.003 353 9 27185 25 0.003 353 9 27186 25 0.003 353 9 27214 25 0.003 353 9 27214 25 0.003 353 9 27233 25 0.003 353 9 27252 25 0.003 353 9 27279 25 0.003 353 9 27307 25 0.003 353 9 27

1 225 0.002 007 4 2614 225 0.002 007 4 2620 225 0.002 007 4 2626 225 0.002 007 4 2632 225 0.002 007 4 2642 225 0.002 007 4 2642 225 0.002 007 4 2643 225 0.002 007 4 2644 225 0.002 007 4 2645 225 0.002 007 4 2646 225 0.002 007 4 2647 225 0.002 007 4 2653 225 0.002 007 4 2653 225 0.002 007 4 2655 225 0.002 007 4 2656 225 0.002 007 4 2659 225 0.002 007 4 2660 225 0.002 007 4 2660 225 0.002 007 4 2661 225 0.002 007 4 26

22.4 Extensions of theProportional Hazards Model

The PH model implies that the ratio of thehazard functions for any two units associatedwith different vectors of covariates, z1 andz2, is constant with respect to time. In other

words, λ(t; z1) is directly proportional to λ(t; z2).This is the so-called PH model’s hazard rateproportionality assumption. The PH model isa valid model to analyze ALT data only whenthe data satisfy the PH model’s proportionalhazard rate assumption. So, checking the validityof the PH model and the assumption of thecovariates’ multiplicative effect is critical for

Page 460: Handbook of Reliability Engineering - pdi2.org

Accelerated Life Testing 427

Table 22.6. Continued.

Failure time Temperature (◦C) Z1 = [1/(Temperature+ 273.16)] Z2 (V)

1365 125 0.002 511 6 25.71401 125 0.002 511 6 25.71469 125 0.002 511 6 25.71776 125 0.002 511 6 25.71789 125 0.002 511 6 25.71886 125 0.002 511 6 25.71930 125 0.002 511 6 25.72035 125 0.002 511 6 25.72068 125 0.002 511 6 25.72190 125 0.002 511 6 25.72307 125 0.002 511 6 25.72309 125 0.002 511 6 25.72334 125 0.002 511 6 25.72556 125 0.002 511 6 25.72925 125 0.002 511 6 25.72997 125 0.002 511 6 25.73076 125 0.002 511 6 25.73140 125 0.002 511 6 25.73148 125 0.002 511 6 25.73736 125 0.002 511 6 25.7

Figure 22.6. Reliability function under design conditions

applying the model to the failure data. Whenthe proportionality assumption is violated, thereare several extensions of the PH model that areproposed to handle the situation.

If the proportionality assumption is violatedand there are one or more covariates that totallyoccur on q levels, a simple extension of the PHmodel is stratification [13], as given below:

λj (t; z)= λ0j (t) exp(βz) j = 1, 2, . . . , q

A partial likelihood function can be obtainedfor each of the q strata and β is estimatedby maximizing the multiplication of the partiallikelihood of the q strata. The baseline hazardrate λ0j (t), estimated as before, is completelyunrelated among the strata. This model is mostuseful when the covariate is categorical and of nodirect interest.

Another extension of the PH model includesthe use of time-dependent covariates. Kalbfleischand Prentice [13] classified the time-dependentcovariates as internal and external. An internalcovariate is the output of a stochastic processgenerated by the unit under study; it can beobserved as long as the unit survives and isuncensored. An external covariate has a fixedvalue or is defined in advance for each unit understudy.

Many other extensions exist in the literature[14]. However, one of the most generalized ex-tensions is the extended linear hazard regressionmodel proposed by Elsayed and Wang [15]. Bothaccelerated failure time models and PH modelsare indeed special cases of the generalized model,

Page 461: Handbook of Reliability Engineering - pdi2.org

428 Maintenance Theory and Testing

whose hazard function is

λ(t; z)= λ0(t e(β0+β1t )z) e(α0+α1t )z

where λ(t; z) is the hazard rate at time t and stressvector z, λ0(·) is the baseline hazard function, andβ0, β1, α0, and α1 are constants. This model hasbeen validated through extensive experimentationand simulation testing.

References[1] Nelson WB. Accelerated life testing step-stress models

and data analysis. IEEE Trans Reliab 1980;R-29:103–8.[2] Meeker WQ, Hahn GJ. How to plan an accelerated life

test-some practical guidelines. In: Statistical techniques,vol. 10, ASQC basic reference in QC. Milwaukee (WI):American Society for Quality Control; 1985.

[3] Elsayed EA. Reliability engineering. Massachusetts:Addison-Wesley; 1996.

[4] Tobias PA, Trindade D. Applied reliability. New York: VanNostrand Reinhold; 1986.

[5] Cox DR. Regression models and life tables (withdiscussion). J R Stat Soc Ser B 1972;34:187–220.

[6] Cox DR. Partial likelihood. Biometrika 1975;62:267–76.[7] Breslow NE. Covariance analysis of censored survival

data. Biometrika 1974;66:188–90.[8] Oakes D. Comment on “Regression models and life tables

(with discussion)” by D. R. Cox. J R Stat Soc Ser B1972;34:208.

[9] Kalbfleisch JD, Prentice RL. Marginal likelihood based onCox’s regression and life model. Biometrika 1973;60:267–78.

[10] Breslow NE. Comment on “Regression models and lifetables (with discussion)” by D. R. Cox. J R Stat Soc Ser B1972;34:187–202.

[11] Anderson JA, Senthilselvan A. Smooth estimates for thehazard function. J R Stat Soc Ser B 1980;42(3):322–7.

[12] Allison PD. Survival analysis using the SAS� system: apractical guide. Cary (NC): SAS Institute Inc.; 1995.

[13] Kalbfleisch JD, Prentice RL. Statistical analysis of failuretime data. New York: Wiley; 1980.

[14] Wang X. An extended hazard regression model foraccelerated life testing with time varying coefficients.PhD dissertation, Department of Industrial Engineering,Rutgers University, New Brunswick, NJ, 2001.

[15] Elsayed EA, Wang X. Confidence limits on reliabilityestimates using the extended linear hazards regressionmodel. In: Ninth Industrial Engineering Research Con-ference, Cleveland, Ohio, May 21–23, 2000.

Page 462: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with theBirnbaum–Saunders Distribution

Chap

ter2

3W. Jason Owen and William J. Padgett

23.1 Introduction23.1.1 Accelerated Testing23.1.2 The Birnbaum–Saunders Distribution23.2 Accelerated Birnbaum–Saunders Models23.2.1 The Power-law Accelerated Birnbaum–Saunders Model23.2.2 Cumulative Damage Models23.2.2.1 Additive Damage Models23.2.2.2 Multiplicative Damage Models23.3 Inference Procedures with Accelerated Life Models23.4 Estimation from Experimental Data23.4.1 Fatigue Failure Data23.4.2 Micro-composite Strength Data

23.1 IntroductionWhen modeling the time to failure T of a device orcomponent whose lifetime is considered a randomvariable, the reliability measure R(t), defined as

R(t)= Pr(T > t)

describes the probability of the component offailing or exceeding a time t . In basic statistical in-ference procedures, various statistical cumulativedistribution functions (CDFs), denoted by F(t),can be used to model a lifetime characteristicfor a device on test. In doing so, the relationshipbetween the reliability and CDF for T is straight-forward: R(t)= 1− F(t). Choosing a model forF is akin to assuming the behavior of observationsof T , and frequent choices for CDFs in reliabilitymodeling include the Weibull, lognormal, andgamma distributions. A description of thesemodels is given by Mann et al. [1] and Meeker andEscobar [2], for example. This chapter will focuson the Birnbaum–Saunders distribution (B–S)and its applications in reliability and life testing,

and this distribution will be described later inthis section. The B–S fatigue life distribution hasapplications in the general area of acceleratedlife testing, and two approaches given hereyield various three-parameter generalizations orextensions to the original form of their model.The first approach uses the (inverse) power lawmodel to develop an accelerated form of thedistribution using standard methods. The secondapproach uses different assumptions ofcumulative damage arguments (with applicationsto strengths of complex systems) to yield variousstrength distributions that are similar in form.Estimation and asymptotic theory for the modelsare discussed with some applications to variousdata sets from engineering laboratories.

The following is a list of notation usedthroughout this chapter:

s, sij strength variable, observed strengtht, tij time to failure variable, observed failure

timeF CDF

429

Page 463: Handbook of Reliability Engineering - pdi2.org

430 Maintenance Theory and Testing

f probability density function (PDF)E(T ) expected value of random variable T� standard normal (Gaussian) CDFα, β , δ, γ , η, θ , ρ, ψ

various CDF model parametersV level of accelerated stressL, Li system size, for the ith system, i = 1,

. . . , kL likelihood functionsp distribution p100th percentile.

23.1.1 Accelerated Testing

Many high-performance items made today canhave extremely large reliabilities (and thus tendnot to wear out as easily) when they are operatingas intended. For example, through advances inengineering science, a measurement such as timeto failure for an electronic device could bequite large. For these situations, calculating thereliability is still of interest to the manufacturerand to the consumer, but a long period of timemay be necessary to obtain sufficient data toestimate reliability. This is a major issue, since it isoften the case with electronic components that thelength of time required for the test may actuallyeclipse the “useful lifetime” of the component.In addition, performing a test over a long timeperiod (possibly many years) could be very costly.One solution to this issue of obtaining meaningfullife test data in a timely fashion is to perform anaccelerated life test (ALT). This procedure involvesobserving the performance of the component ata higher than usual level of operating stress toobtain failures more quickly. Types of stress oracceleration variable that have been suggestedin the literature include temperature, voltage,pressure, and vibration [1, 3].

Obviously, it can be quite a challenge to use thefailure data observed at the conditions of higherstress to predict the reliability (or even the averagelifetime) under the “normal use” conditions.To extrapolate from the observed acceleratedstresses to the normal-use stress, an accelerationmodel is used to describe the relationship betweenthe parameters of the reliability model and the

accelerated stresses. This functional relationshipusually involves some unknown parameters thatwill need to be estimated in order to drawinference at the normal-use stress. An expositorydiscussion of ALTs is given by Padgett [4].

Many different parametric acceleration modelshave been derived from the physical behavior ofthe material under the elevated stress level. Com-mon acceleration models include the (inverse)power law, Arrhenius, and Eyring models [4]. Thepower law model has applications in the fatiguefailure of metals, dielectric failure of capacitorsunder increased voltage, and the aging of multi-component systems. This model will be used inthis chapter, and it is given by

h(V )= γV −η (23.1)

where V represents the level of accelerated stress.Typically, h(V ) is substituted for a mean orscale parameter in a baseline lifetime distribution.Thus, by performing this substitution, it isassumed that the increased environmental stresshas the effect of changing the mean or scale ofthe lifetime model, but the distribution “family”remains intact. The parameters γ and η in h(V )

are (usually) unknown and are to be estimated(along with any other parameters in the baselinefailure model) using life-test data observed atelevated stress levels V1, V2, . . . , Vk . ALT dataare often represented in the following way:let tij characterize the observed failure (time)for an item under an accelerated stress levelVi , where j = 1, 2, . . . , ni and i = 1, 2, . . . , k.After parameter estimation using the ALT data,inferences can be drawn for other stress levelsnot observed, such as the “normal use” stress V0.Padgett et al. [5] provide an example using theWeibull distribution with the power-law model(Equation 23.1) substituted for the Weibull scaleparameter; this accelerated model was used forfitting tensile strength data from an experimenttesting various gage lengths of carbon fibersand tows (micro-composites). In this application,“length” is considered an acceleration variable,since longer specimens of these materials tendto exhibit lower strengths due to increased

Page 464: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with the Birnbaum–Saunders Distribution 431

occurrences of inherent flaws that ultimatelyweaken the material [5].

23.1.2 The Birnbaum–SaundersDistributionWhile considering the physical properties offatigue failure for materials subjected to cyclicstresses and strains, Birnbaum and Saunders [6]derived a new family of lifetime distributions.In their derivation, they assume (as is often thecase with metals and concrete structures) thatfailure occurs by the progression of a dominantcrack within the material to an eventual “criticallength”, which ultimately results in fatigue failure.Their derivation resulted in the CDF

FB–S(t; α, β)=�

[1

α

(√t

β−√β

t

)]t > 0; α, β > 0 (23.2)

where �[·] represents the standard normal CDF.A random variable T following the distribution inEquation 23.2 will be denoted as T ∼ B–S(α, β).The parameter α is a shape parameter and β isa scale parameter; also, β is the median forthe distribution. This CDF has some interestingqualities, and many are listed by Birnbaum andSaunders [6, 7]. Of interest for this discussion arethe first and first reciprocal moments

E(T )= β(1+ α2/2) E(T −1)= (1+ α2/2)/β(23.3)

and the PDF for T given by

F ′B–S(t; α, β)= f (t; α, β)= 1

2αβ

√t

β

[1+

(t

β

)−1]

1√2π

× exp

[− 1

2α2

(t

β− 2+ β

t

)]t > 0; α, β > 0 (23.4)

Desmond [8] showed that Equation 23.2 alsodescribes the failure time that is observedwhen any amount of accumulating damage(not unlike crack propagation) exceeds a criticalthreshold. This research has been the foundation

of the “cumulative damage” models that will beillustrated in Section 23.2.2.

Statistical inference for data following Equa-tion 23.2 has been developed for single samples.To estimate the parameters in Equation 23.2 fora data set, the maximum likelihood estimators(MLEs) of α and β must be found using aniterative, root-finding technique. The equationsfor the MLEs are given in Birnbaum and Saun-ders [7]. Achcar [9] derived an approximation tothe Fisher information matrix for the Birnbaum–Saunders distribution by using the Laplace ap-proximation [10] for complicated integrals, andused the information matrix to provide someBayesian approaches to estimation using Jeffery’s(non-informative) priors for α and β . The Fisherinformation matrix for Equation 23.2 is also use-ful for other inferences (e.g. confidence intervals)using the asymptotic normality theory for MLEs(see Lehman and Casella [11] for a comprehensivediscussion on this topic).

The remainder of this chapter is outlined asfollows. In Section 23.2, the power-law acceler-ated Birnbaum–Saunders distribution will be pre-sented, using the methods discussed in this sec-tion. In addition, a family of cumulative damagemodels developed for estimation of the strengthof complex systems will be presented. These mod-els assume various damage processes for materialfailure, and it will be shown that they possess ageneralized Birnbaum–Saunders form—not un-like the ALT Birnbaum–Saunders form. Estima-tion techniques and large-sample theory will bediscussed for all of these models in Section 23.3.Applications of the models will be discussed inSection 23.4, along with some conclusions.

23.2 AcceleratedBirnbaum–Saunders Models

In this section, various accelerated life models willbe presented that follow a generalized Birnbaum–Saunders form. The five models that will be

Page 465: Handbook of Reliability Engineering - pdi2.org

432 Maintenance Theory and Testing

presented are similar in that they represent three-parameter generalizations of the original two-parameter form given by Equation 23.2. A moredetailed derivation for each model, along withother theoretical aspects, can be found in theaccompanying references.

23.2.1 The Power-law AcceleratedBirnbaum–Saunders Model

Owen and Padgett [12] describe an acceleratedlife model using the models in Equations 23.1and 23.2. The power law model (Equation 23.1)was substituted for the scale parameter β inEquation 23.2 to yield

G(t; V )= FB–S[t; α, h(V )]

=�

[1

α

(√t

γ V −η−√γV−η

t

)]t > 0; α, γ, η > 0 (23.5)

Note that Equation 23.5 is a distribution withthree parameters; it would represent an overpa-rameterization of Equation 23.2 unless at least twodistinct levels of V are considered in an experi-ment (i.e. k ≥ 2). The mean for the distributionin Equation 23.5 can easily be calculated from theequation given in Equation 23.3 and by substi-tuting the power law model Equation 23.1 for β .Thus, it can be seen that as V (the level of accel-erated stress) increases, the expected value for T

decreases. Owen and Padgett [12] used the accel-erated model Equation 23.5 to fit observed cycles-to-failure data of metal “coupons” (rectangularstrips) that were stressed and strained at threedifferent levels (see Birnbaum and Saunders [7]).A discussion of those results will be given in Sec-tion 23.4.

The ALT model (Equation 23.5) is related to theinvestigation by Rieck and Nedelman [13], where alog–linear form of the Birnbaum–Saunders model(Equation 23.2) was considered. If T ∼ B–S(α, β),then the random variable can be expressed asT = βX, where X ∼ B–S(α, 1) since β is a scaleparameter. Therefore, if β is replaced by the ac-celeration model exp[a + bV ], the random vari-able has the log–linear form since ln(T )= a +

bV + ln(X), where the error term ln(X) followsthe sinh–normal distribution [13]. Thus, least-squares estimates (LSEs) of a and b can be foundquite simply using the tij data defined previously,but the parameter α (confounded in the errorterm) can only be estimated using alternativemethods. These are described by Rieck and Nedel-man [13], along with the properties of the LSEsof a and b. Their approach and the accelerationmodel selected, although functionally related toEquation 23.1, make the log–linear model attrac-tive in the general linear model (GLM) framework.Owen and Padgett [12] use the more commonparameterization (Equation 23.1) of the power-law model and likelihood-based inference. Thediscussion of inference for Equation 23.5 is givenin the next section.

23.2.2 Cumulative Damage Models

Many high-performance materials are actuallycomplex systems whose strength is a functionof many components. For example, the strengthof a fibrous carbon composite (carbon fibersembedded in a polymer material) is essentiallydetermined by the strength of the fibers. When acomposite specimen is placed under tensile stress,the fibers themselves may break within thematerial, which ultimately weakens the material.Since industrially made carbon fibers are knownto contain random flaws in the form of cracks,voids, etc., and these flaws tend to weaken thefibers [14] and ultimately the composite, statisticaldistributions have often been proposed to modelcomposite strength data. It is important to notethat since carbon fibers are often on the order of8 µm in diameter, it is nearly impossible to predictwhen a specimen is approaching failure based on“observable damage” (like cracks or abrasions, asis the case with metals or concrete structures).The models that have been used in the past arebased on studies of “weakest link” [5, 15] orcompeting risk models [16]. Recent investigationson the tensile strengths for fibrous carboncomposite materials [17, 18] have yielded newdistributions that are based on “cumulativedamage” arguments. This method models the

Page 466: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with the Birnbaum–Saunders Distribution 433

(random) damage a material aggregates as stressis steadily increased; the damage accumulatesto a point until it exceeds the strength of thematerial. The models also contain a length or“size” variable, given by L, which accounts for thescale of the system, and the derivation for thesemodels will now be summarized. Four modelsusing the cumulative damage approach will begiven here, and for simplicity, the CDFs will bedistinguished by Fl(s; L), where s represents thestrength of the specimen and the model type l =1, 2, 3, 4. The simplicity of this notation will beevident later.

Suppose that a specimen of length or “size”L is placed under a tensile load that is steadilyincreased until failure. Assume that:

1. conceptually, the load is incremented bysmall, discrete amounts until system failure;

2. each increment of stress causes a randomamount of damage D > 0, with D having CDFFD ;

3. the specimen has a fixed theoretical strength,denoted by ψ . The initial strength of thespecimen is a random quantity, however, thatrepresents the reduction of the specimen’stheoretical strength by an amount of “initialdamage” from the most severe flaw in thespecimen. The random variable W representsthe initial strength of the system.

As the tensile strength is incremented inthis fashion, the cumulative damage after n+ 1increments of stress is given by [8, 17]

Xn+1 =Xn +Dnh(Xn) (23.6)

where Dn > 0 is the damage incurred at loadincrement n+ 1, for n= 0, 1, 2, . . . , and the Dn

values are independent and distributed accordingto FD . The function h(u) is called the damagemodel function. If h(u)= 1 for all u, the incurreddamage is said to be additive; if h(u)= u for all u,the incurred damage to the specimen is said to bemultiplicative.

Let N be the number of increments of tensilestress applied to the system until failure. Then,N = supn{n :X1 <ψ, . . . , Xn−1 <ψ}, and N =1 if the set is empty. Let Fn(w)= P(N > n |W =

w). Now, since ψ is only the theoretical strength,we alternately define N according to W , the initialstrength.

Let X0 denote the “initial damage” due toflaws or other pre-existing weakening damage.With additive damage, W = ψ −X0 so that N =supn{n :X1 −X0 <ψ −X0, . . . , Xn−1 − X0 <

ψ − X0}. With multiplicative damage, W = ψ/X0with X0 > 1, so that N = supn{n :X1/X0 <

ψ/X0, . . . , Xn−1/X0 <ψ/X0}. Regardless ofhow W is defined, the survival probability ofthe specimen after n increments of stress issimply [17]

P(N > n)=∫ ∞

0Fn(w) dGW(w) (23.7)

with GW(w) representing the distribution forinitial strength of the specimen. The distributionfunction Fn(w) depends on the type of damagemodel h(u). In either case, Fn(w) can beapproximated for large n for both additive andmultiplicative damage.

23.2.2.1 Additive Damage Models

For an additive damage model, h(u)= 1 in Equa-tion 23.6. Since Dn =Xn+1 −Xn, the damageincurred to the specimen after n incrementsof stress is

∑n−1i=0 Di =∑n−1

i=0 (Xi+1 −Xi)=Xn −X0. Using the idea that the increments of stress aresmall (so that the number of increments will oftenbe very large), by the central limit theorem, Xn −X0 has an approximate normal (Gaussian) distri-bution with mean nµ and variance nσ 2, where µ

and σ 2 are the mean and variance of D. Therefore,for large n

Fn(w)= P(Xn −X0 ≤ w)

∼=�[(w − nµ)/(n1/2σ)] (23.8)

Substituting Equation 23.8 and supplying anappropriate distribution for the initial strength inEquation 23.7 gives an expression for the survivalprobability of the specimen after n increments ofstress.

Durham and Padgett [17] proposed two dis-tributions for GW(w), the initial strength of

Page 467: Handbook of Reliability Engineering - pdi2.org

434 Maintenance Theory and Testing

the specimen of length L. The first is a three-parameter Weibull distribution given by

GW(w)= 1− exp{−L[(w −w0)/δ]ρ} (23.9)

Using Equations 23.9 and 23.8 in Equation 23.7and letting S be the total tensile load after N incre-ments, the “additive Gauss–Weibull” strength dis-tribution for the specimen of length L is obtainedand is given by

F1(s; L)=�

[1

α

(√s

β− L−1/ρ

√s

)]s > 0

(23.10)where α and β represent two new parametersobtained in the derivation of Equation 23.10 andare actually functions of the parameters δ, ρ, µ,and σ . See Padgett and coworkers [17, 19] for thespecifics of the derivation.

The second initial strength model assumes a“flaw process” over the specimen of length L thataccounts for the reduction of theoretical strengthdue to the most severe flaw. This process isdescribed by a standard Brownian motion, whichyields an initial strength distribution found byintegrating the tail areas of a Gaussian PDF. ThePDF for W is given by [17]

G′W(w)= gW (w)

= 2√2πL

exp

[− (w−ψ)2

2L

]0 <w < ψ

(23.11)

Substituting Equations 23.11 and 23.8 in Equa-tion 23.7 and letting S be the total tensile load afterN increments (again, see Durham and Padgett[17] for the mathematical details), the “additiveGauss–Gauss” strength distribution for the spec-imen of length L is given by

F2(s; L)=�

[1

α

(√s

β− ψ−√2L/π√

s

)]s > 0

(23.12)

23.2.2.2 Multiplicative Damage Models

For a multiplicative damage model, W = ψ/X0and h(u)= u in Equation 23.6. Since Dn =

(Xn+1 −Xn)/Xn, the damage incurred to thespecimen after n increments of stress is

n−1∑i=0

Di =n−1∑i=0

(Xi+1 −Xi)/Xi ≈ ln(Xn)− ln(X0)

since n will be large. Thus, for large n, Xn/X0has an approximate log-normal distribution(by the central limit theorem) with parameters nµand nσ 2. Therefore, under multiplicative damage,we have

Fn(w)= P(Xn/X0 ≤ w)

∼=�{[ln(w)− nµ]/(n1/2σ)} (23.13)

As with the additive damage case, two modelshave been proposed for the distribution of initialstrength [18]. The first is the two-parameterWeibull distribution (this is Equation 23.9 withw0 = 0) and the second is the “flaw process”model given by the PDF (Equation 23.11). Usingeach of these with Equation 23.13 in Equation 23.7,the “multiplicative Gauss–Weibull” model and the“multiplicative Gauss–Gauss” model [18] are

F3(s; L)=�

[1

α

(√s

β− γ − ln(L)√

s

)]s > 0; α, β, γ > 0 (23.14)

and

F4(s; L)=�

[1

α

(√s

β

− ln(ψ)− (√

2L/π/ψ) − (L/2ψ2)√s

)]s > 0; α, β, ψ > 0 (23.15)

respectively.The cumulative damage models presented in

this section have been used by Padgett andcoworkers [17–19] to model strength data froman experiment with 1000-fiber tows (micro-composite specimens) tested at several gagelengths. A discussion of these will be given inSection 23.4. It is important to note that althoughthe cumulative damage models presented inthis section are not “true” ALT models, insofarthat they were not derived from a baseline

Page 468: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with the Birnbaum–Saunders Distribution 435

model, they do resemble them, since the lengthvariable L acts as a level of accelerated stress(equivalent to the V variable from Section 23.1.2).Obviously, the longer the specimen, the weakerit becomes due to an increased occurrence offlaws and defects. Estimation and other inferenceprocedures for these models are discussed byPadgett and coworkers [17–19]. These matters willbe summarized in the next section.

23.3 Inference Procedures withAccelerated Life ModelsParameter estimation based on likelihood meth-ods has been derived for the models given inSection 23.2. For the model in Equation 23.5 theMLEs for α, γ , and η are given by the values thatjointly maximize the likelihood function

L(α, γ, η)=k∏

i=1

ni∏j=1

g(tij ; Vi)

where g(t; V ) represents the PDF for the dis-tribution (Equation 23.5) and the significance ofthe values i, j , k, ni and tij are defined in Sec-tion 23.1.1. For simplicity, it is often the case thatthe natural logarithm of L is maximized, sincethis function is usually more tractable. Owen andPadgett [12] provide the likelihood equations formaximization of the likelihood function, alongwith other inferences (e.g. approximate variancesof the MLEs), but since the methods are similarfor the cumulative damage models, these will notbe presented here.

Durham and Padgett [17] provide estimationprocedures for the additive damage models (Equa-tions 23.10 and 23.12). The authors use the factthat the Birnbaum–Saunders models can be ap-proximated by the first term of an inverse Gaus-sian CDF [20, 21]. This type of approximation isvery useful, since the inverse Gaussian distribu-tion is an exponential form and parameter esti-mates are (generally) based on simple statistics ofthe data. Owen and Padgett [19] observed that themodels in Equations 23.10, 23.12, 23.14, and 23.15all have a similar structure, and they derived a

generalized three-parameter Birnbaum–Saundersmodel of the form

Fl(s; L)=�

[1

α

(√s

β− λl(θ; L)√

s

)]s > 0; α, β, θ > 0; l = 1, 2, 3, 4 (23.16)

where the “acceleration function” λl(θ; L) cor-responds to a known function of L, the lengthvariable, and θ , a model parameter. For example,for the “additive Gauss–Weibull” model (Equa-tion 23.10), the acceleration function λ1(θ; L)=L−1/θ . These functions are different for the fourcumulative damage models, but the overall func-tional form for F is similar. This fact allows for aunified theory of inference using Equation 23.16,which we will now describe [19].

The PDF for Equation 23.16 is

fl(s; L)= s + βλl(θ; L)2√

2παβs3/2

× exp

{−[s − βλl(θ; L)]2

2α2β2s

}s > 0

(23.17)

Using Equation 23.17 to form the likelihoodfunction L(α, β, θ) for strength data observedover the k gage lengths L1, L2, . . . , Lk , the log-likelihood can be expressed as

�= ln[L(α, β, θ)]= −m ln(2

√2π)−m ln(α)−m ln(β)

+k∑

i=1

ni∑j=1

ln

sij + βλl(θ; Li)

s3/2ij

− 1

2α2β2

k∑i=1

ni∑j=1

[sij − βλl(θ; Li)]2sij

where m=∑ki=1 ni . The three partial derivatives

of � with respect to each of α, β , and θ arenonlinear likelihood equations that must be solvednumerically. The equation for α can be solved interms of β and θ , but each equation for β and θ

must be solved numerically (i.e. the roots must befound, given values for the other two parameters),

Page 469: Handbook of Reliability Engineering - pdi2.org

436 Maintenance Theory and Testing

as follows:

α =√√√√ 1

mβ2

k∑i=1

ni∑j=1

[sij − βλl(θ; Li)]2sij

(23.18)

0=−m

β+

k∑i=1

λl(θ; Li)

ni∑j=1

[sij+βλl(θ; Li)]−1

+ 1

α2β3

k∑i=1

ni∑j=1

[sij − βλl(θ; Li)]2sij

+ 1

α2β2

k∑i=1

λl(θ; Li)

ni∑j=1

[sij−βλl(θ; Li)]sij

(23.19)

0= β

k∑i=1

[∂

∂θλl(θ; Li)

] ni∑j=1

[sij+βλl(θ; Li)]−1

+ 1

2α2β2

k∑i=1

ni

[∂

∂θλl(θ; Li)

]

− 1

α2

k∑i=1

λl(θ; Li)

[∂

∂θλl(θ; Li)

] ni∑j=1

1

sij

(23.20)

Iteration over these three equations with the spe-cific acceleration function λl(θ; L) can be per-formed using a nonlinear root-finding techniqueuntil convergence to an approximate solution forthe MLEs α, β , and θ . A procedure for obtainingthe starting values for the parameters that arenecessary for initiating the iterative root-findingprocedure is detailed by Owen and Padgett [19].From Equation 23.20, we see that an additionalprerequisite for the acceleration function λl(θ; L)in the generalized distribution (Equation 23.17)is that the λl(θ; L) must be differentiable withrespect to θ for MLEs to exist.

The 100pth percentile for thegeneralized Birnbaum–Saunders distribution(Equation 23.17) for a given value of L andspecified acceleration function λl(θ; L) can beexpressed as [19]

sp(L)= β2

4

[αzp +

√(αzp)2 + 4λl(θ; Li)/β

]2

(23.21)

where zp represents the 100pth percentile of thestandard normal distribution. The MLE for sp(L)can be obtained by substituting the MLEs α, β,and θ into Equation 23.21. Estimated values forthe lower percentiles of a strength distributionare often of interest to engineers for designissues with composite materials. In addition,approximate lower confidence bounds on sp(L)

can be calculated using its limiting distributionvia the Cramér delta method, and this method isoutlined by Owen and Padgett [19].

For a large value of m, approximate confi-dence intervals for the parameters α, β , andθ can be based on the asymptotic theory forMLEs (conditions for this are given by Lehmannand Casella [11]). As m increases without bound(and ni/m→ pi > 0 for all i = 1, . . . , k), thestandardized distribution of τ = (α, β, θ )′ ap-proaches that of the trivariate normal distribu-tion (TVN) with zero mean vector and identityvariance–covariance matrix. Thus, the samplingdistribution of τ can be approximately expressedas τ ∼ TVN(τ , I−1

m (τ )), which is the trivariatenormal distribution with mean vector τ andvariance–covariance matrix I−1

m (τ ), the mathe-matical inverse of the Fisher information ma-trix for Equation 23.16 based on m indepen-dent observations. The negative expected val-ues for the six second partial derivatives for thelog of the likelihood function �= ln L(α, β, θ)for Equation 23.16 are quite complicated butcan be obtained. The quantities E(−∂2�/∂α2),E(−∂2�/∂α∂β) and E(−∂2�/∂α∂θ) can be foundexplicitly, but the other three terms requirean approximation for the intractable integral.The method used is explained in detail by Owenand Padgett [19], but here we simply present thesix entries for Im(τ ):

E(−∂2�/∂α2)= 2m/α2 (23.22a)

E(−∂2�/∂α∂β)=m/(αβ) (23.22b)

E(−∂2�/∂α∂θ)=− 1

α

k∑i=1

ni(∂/∂θ)λl(θ; Li)

λl(θ; Li)

(23.22c)

Page 470: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with the Birnbaum–Saunders Distribution 437

E(−∂2�/∂β2)∼= 3m/(4β2)

+ 1

α2β3

k∑i=1

niλl(θ; Li)

(23.22d)

E(−∂2�/∂β∂θ)∼=− 1

k∑i=1

ni(∂/∂θ)λl(θ; Li)

λl(θ; Li)

+ 1

α2β2

k∑i=1

ni∂

∂θλl(θ; Li)

(23.22e)

E(−∂2�/∂θ2)∼= 3

4

k∑i=1

ni

[(∂/∂θ)λl(θ; Li)

λl(θLi)

]2

+ 1

α2β

k∑i=1

ni[(∂/∂θ)λl(θ; Li)]2

λl(θ; Li)

(23.22f)

Using these expressions, the asymptotic variancesof the three MLEs can be found from the diagonalelements of I−1

m (τ ), and these variances can beestimated by substitution of the MLEs for the threeunknown parameters involved. The variancescan be used to calculate approximate confidenceintervals for the parameters and lower confidencebounds (LCBs) for sp(L) from Equation 23.21.

In the next section, the power-law acceleratedBirnbaum–Saunders model (Equation 23.16) andthe cumulative damage models from Section 23.3will be used to fit accelerated life data taken fromtwo different industrial experiments for materialstesting. A discussion of the results will follow.

23.4 Estimation fromExperimental DataHere, we will look at two applications of thethree-parameter accelerated life models fromSection 23.2, using the methods of inference givenin Section 23.3.

23.4.1 Fatigue Failure Data

In the original collection of articles by Birnbaumand Saunders [6, 7], a data set was published

on the fatigue failure of aluminum coupons(rectangular strips cut from 6061-T6 aluminumsheeting). The coupons were subjected to repeatedcycles of alternating stresses and strains, and therecorded measurement for a specimen on testwas the number of cycles until failure. In theexperiment, three levels of maximum stress forthe periodic loading scheme were investigated(i.e. k = 3), and these stress levels were V1 =21 000, V2 = 26 000, and V3 = 31 000 psi withrespective sample sizes n1 = 101, n2 = 102, andn3 = 101. The data are given by Birnbaum andSaunders [7], and a scaled version of the data setis given by Owen and Padgett [12]. This data setwas the motivation for the derivation of Birnbaumand Saunders’ original model (Equation 23.2),and they used their distribution to fit each dataset corresponding to one of the three maximumstress levels to attain three estimated lifetimedistributions. Although the fitted distributionsmatched well with the empirical results, nodirect inference could be drawn on how thelevel of maximum stress functionally affectedthe properties of the distribution. Owen andPadgett [12] used the power-law acceleratedBirnbaum–Saunders model (Equation 23.16) tofit across all levels of maximum stress. For thespecifics of the analyses, which include parameterestimates and approximate confidence intervals,see Owen and Padgett [12].

23.4.2 Micro-Composite Strength Data

Bader and Priest [22] obtained strength data(measured in giga-pascals, GPa) for 1000-fiberimpregnated tows (micro-composite specimens)that were tested for tensile strength. In their ex-periment, several gage lengths were investigated:20 mm, 50 mm, 150 mm, and 300 mm with28, 30, 32, and 29 observed specimens tested atthe respective gage lengths. These data are givenexplicitly by Smith [23]. The strength measure-ments range between 1.889 and 3.080 GPa overthe entire data set, and, in general, it is seenthat the specimens with longer gage lengths tendto have smaller strength measurements. All of

Page 471: Handbook of Reliability Engineering - pdi2.org

438 Maintenance Theory and Testing

the cumulative damage models presented in Sec-tion 23.3 have been used to model these data,and the details of the procedures are given byPadgett and coworkers [17–19]. Owen and Pad-gett [19] compared the fits of the four models byobserving their respective overall mean squarederror (MSE); this is calculated by averaging thesquared distance of the estimated model (us-ing the MLEs) to the empirical CDF for each ofthe gage lengths over all of the m= 119 obser-vations. A comparison was also made with the“power-law Weibull model” that was used to fitthese strength data by Padgett et al. [5]. Overall,the “multiplicative Gauss–Gauss” strength model(Equation 23.15) had the smallest MSE of 0.004 56,compared with the other models, and this castssome degree of doubt on the standard “weakestlink” theory with the Weibull model for thesetypes of material. From [19], the MLEs of theparameters in Equation 23.15 were calculated to beα = 0.144 36, β = 0.832 51, and ψ = 34.254 (re-call that the parameterψ represented the “theoret-ical strength” of the system). Using these MLEs inEquation 23.22a–f, the (approximate) quantities ofthe Fisher information matrix for Equation 23.15can be estimated (for this case, the parameter θ isgiven byψ). These estimates of Equations 23.22a–fare, respectively, 11 421.1037, 990.1978, −10.1562,32 165.7520, 324.8901, and 3.4532. The diagonalelements of the inverse of the Fisher informationmatrix give the asymptotic variances of the MLEs,and the asymptotic standard deviations (ASDs) ofthe MLEs are estimated to be ASD(α)= 0.0105,ASD(β)= 0.0281 and ASD(ψ)= 2.7125. Usingthe asymptotic normality result stated in the lastSection 23.3, the individual 95% confidence inter-vals for the three parameters are: (0.1238, 0.1649)for α, (0.7774, 0.8876) for β , and (28.9377, 39.5703)for ψ . An estimation of s0.1(L), the 10th percentilefor the strength distribution Equation 23.15 for thegiven values of L, and the 90% LCBs on s0.1(L)

are presented by Owen and Padgett [19]. Thus, notonly does the model in Equation 23.15 provide anexcellent fit, it also gives insight into the physicalnature of the failure process due to the scientificnature of the cumulative damage process at themicroscopic level.

In summary, it is seen that the Birnbaum–Saunders distribution (Equation 23.2) has greatutility in the general area of accelerated life test-ing. The power-law accelerated model (Equa-tion 23.16) was derived using fairly standardtechniques from the ALT literature, and is athree-parameter generalization of Equation 23.2.The cumulative damage models from Section 23.2are also three-parameter generalizations of Equa-tion 23.2 and are accelerated life models in thegeneral sense. They are unique, however, sincenowhere in the individual development of themodels was a baseline Birnbaum–Saunders modelassumed. In addition, given that the models werederived from physical arguments, the parametersinvolved have specific meanings and interpreta-tions in the physical process of material failure.

Acknowledgments

The second author’s work was partially supportedby the National Science Foundation under grantnumber DMS-9877107.

References

[1] Mann NR, Schafer RE, Singpurwalla ND. Methods forstatistical analysis of reliability and life data. New York:John Wiley and Sons; 1974.

[2] Meeker WQ, Escobar LA. Statistical methods for reliabil-ity data. New York: John Wiley and Sons; 1998.

[3] Nelson W. Accelerated testing: statistical models, testplans, and data analysis. New York: John Wiley and Sons;1990.

[4] Padgett WJ. Inference from accelerated life tests. In:Abdel-Hameed MS, Cinlar E, Quinn J, editors. Reliabilitytheory and models. New York: Academic Press; 1984.p.177–98.

[5] Padgett WJ, Durham SD, Mason AM. Weibull analysisof the strength of carbon fibers using linear and powerlaw models for the length effect. J Compos Mater1995;29:1873–84.

[6] Birnbaum ZW, Saunders SC. A new family of lifedistributions. J Appl Prob 1969;6:319–27.

[7] Birnbaum ZW, Saunders SC. Estimation for a family oflife distributions with applications to fatigue. J Appl Prob1969;6:328–47

[8] Desmond AF. Stochastic models of failure in randomenvironments. Can J Stat 1985;13:171–83.

Page 472: Handbook of Reliability Engineering - pdi2.org

Accelerated Test Models with the Birnbaum–Saunders Distribution 439

[9] Achcar JA. Inferences for the Birnbaum–Saunders fatiguelife model using Bayesian methods. Comput Stat DataAnal 1993;15:367–80.

[10] Tierney L, Kass RE, Kadane JB. Fully exponentialLaplace approximations to expectations and variances ofnonpositive functions. J Am Stat Assoc 1989;84:710–6.

[11] Lehmann EL, Casella G. Theory of point estimation. NewYork: Springer; 1998.

[12] Owen WJ, Padgett WJ. A Birnbaum–Saunders acceleratedlife model. IEEE Trans Reliab 2000;49:224–9.

[13] Rieck JR, Nedelman JR. A log–linear model forthe Birnbaum–Saunders distribution. Technometrics1991;33:51–60.

[14] Stoner EG, Edie DD, Durham SD. An end-effect model forthe single-filament tensile test. J Mater Sci 1994;29:6561–74.

[15] Wolstenholme LC. A non-parametric test of the weakestlink property. Technometrics 1995;37:169–75.

[16] Goda K, Fukunaga H. The evaluation of strengthdistribution of silicon carbide and alumna fibres by amulti-modal Weibull distribution. J Mater Sci 1986;21:4475–80.

[17] Durham SD, Padgett WJ. A cumulative damage modelfor system failure with application to carbon fibers andcomposites. Technometrics 1997;39:34–44.

[18] Owen WJ, Padgett WJ. Birnbaum–Saunders-type modelsfor system strength assuming multiplicative damage. In:Basu AP, Basu SK, Mukhopadhyay S, editors. Frontiers inreliability. River Edge (NJ): World Scientific; 1998. p.283–94.

[19] Owen WJ, Padgett WJ. Acceleration models for systemstrength based on Birnbaum–Saunders distributions.Lifetime Data Anal 1999;5:133–47.

[20] Bhattacharyya GK, Fries A. Fatigue failure models—Birnbaum–Saunders vs. inverse Gaussian. IEEE TransReliab 1982;31:439–41.

[21] Chhikara RS, Folks JL. The inverse Gaussian distribution.New York: Marcel Dekker; 1989.

[22] Bader S, Priest A. Statistical aspects of fibre and bundlestrength in hybrid composites. In: Hayashi T, Katawa K,Umekawa S, editors. Progress in Science and Engineeringof Composites, ICCM-IV, Tokyo, 1982; p.1129–36.

[23] Smith RL. Weibull regression models for reliability data.Reliab Eng Syst Saf 1991;34:55–76.

Page 473: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 474: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stressAccelerated Life Test

Chap

ter2

4Loon-Ching Tang

24.1 Introduction24.2 Cumulative Exposure Models24.3 Planning a Step-stress Accelerated Life Test24.3.1 Planning a Simple Step-stress Accelerated Life Test24.3.1.1 The Likelihood Function24.3.1.2 Setting a Target Accelerating Factor24.3.1.3 Maximum Likelihood Estimator and Asymptotic Variance24.3.1.4 Nonlinear Programming for Joint Optimality in Hold Time and Low Stress24.3.2 Multiple-steps Step-stress Accelerated Life Test Plans24.4 Data Analysis in the Step-stress Accelerated Life Test24.4.1 Multiply Censored, Continuously Monitored Step-stress Accelerated Life Test24.4.1.1 Parameter Estimation for Weibull Distribution24.4.2 Read-out Data24.5 Implementation in Microsoft ExcelTM

24.6 Conclusion

24.1 Introduction

The step-stress accelerated life test (SSALT) is areliability test in which not only products aretested at higher than usual stress but the stressapplied also changes, usually increases, at somepredetermined time intervals during the courseof the test. The stress levels and the time epochsat which the stress changes constitute a step-stress pattern (SSP) that resembles a staircasewith uneven step and length. There are manyadvantages for this type of stress loading. Notonly can the test time be reduced by increasingthe stress during the course of a test, butexperimenters need not start with a high stressthat could be too harsh for the product, henceavoiding excessive extrapolation of test results.Moreover, only a single piece of test equipment,e.g. a temperature chamber, is required for eachSSP; and a single SSP is sufficient to validatethe assumed stress–life model (see Section 24.4).

Having said that, the obvious drawback isthat it requires stronger assumptions and morecomplex analysis compared with a constant-stress accelerated life test (ALT). In this chapter,we look at two related issues for an SSALT,i.e. how to plan/design a multiple-steps SSALTand how to analyze data obtained from anSSALT. To facilitate subsequent presentation anddiscussion, we shall adopt the following furtheracronyms:

AF acceleration factorAV asymptotic variancec.d.f. cumulative distribution functionFFL failure-free lifeLCEM linear cumulative exposure modelMLE maximum likelihood estimatorMTTF mean time to failureN-M Nelson cumulative exposure modelNLP nonlinear programmingp.d.f. probability density function.

441

Page 475: Handbook of Reliability Engineering - pdi2.org

442 Maintenance Theory and Testing

Figure 24.1. An example of an SSALT stress pattern and a possibletranslation of test time to a constant reference stress

The challenge presented in the SSALT can beillustrated as follows. Consider the situation wheresome specimens of a product are tested under anSSP, as depicted in Figure 24.1. Most specimensare expected to survive after the low stressregime, and a significant fraction may surviveafter the middle stress. When the test is finallyterminated at high stress, we have failure datathat cannot be described by a single distribution,as they come from three different stress levels.In particular, if there is indeed a stress–life modelthat adequately relates a percentile of the failuretime to the stress, we would have to translatethe failure times to equivalent failure times ata reference stress before the model could beapplied. The translation of SSALT failure timesmust be done piecewise, as the accelerationfactors are different at different stress levels.This means that, unlike a constant-stress ALT,failure data obtained under an SSALT may not beanalyzed independently of the stress–life modelwithout further assumptions. In particular, forspecimens that fail at high stress, assumptionson how the “damage” due to exposure to lowerstress levels is accumulated must be made.We shall defer the discussion on this to the nextsection.

The simplest from of SSALT is the partial ALTconsidered by Degroot and Goel [1], in which

products are first tested under use condition fora period of time before the stress is increasedand maintained at a level throughout the test.This is a special case of a simple SSALT,where only two stress levels are used. Muchwork has been done in this area since that byNelson [2]. Literature appearing before 1989 hasbeen covered by Nelson [3], in which a chapterhas been devoted to step-stress and progressive-stress ALT assuming exponential failure time.Here, we give a brief review of the subsequentwork.

We first begin with literature dealing with theplanning problem in an SSALT. The commonlyadopted optimality criterion is to minimize the AVof the log(MTTF) or some percentile of interest.Using the estimates for probabilities of failureat design and high stress levels by the endof the test as inputs, Bai and coworkers [4, 5]obtained the optimum plan for simple SSALTsunder type I censoring assuming exponential [4]and Weibull [5] failure times. Khamis andHiggins [6] considered a quadratic stress–liferelation and obtained the associated optimumplan for a three-step SSALT using parametersof the assumed quadratic function as inputs.They then proposed a compromised test planfor a three-step SSALT by combining the resultsof the optimum quadratic plan with that of theoptimum simple SSALT obtained by Bai et al. [4].Assuming complete knowledge of the stress–liferelation with multiple stress variables, Khamis [7]proposed an optimal plan for a multiple-stepSSALT. Park and Yum [8] developed statisticallyoptimal modified ALT plans for exponentiallifetimes, taking into consideration the rate ofchange in stress, under continuous inspection andtype I censoring, with similar assumptions as inBai et al. [4]. The above plans solve for optimalhold time and assumed that low stress levelsare given. Yeo and Tang [9] presented an NLPfor planning a two-step SSALT, with the targetacceleration factor and the expected proportionof failure at high stress by the end of the testas inputs, in which both the low stress levelsand the respective hold times are optimized.They proposed an algorithm for planning a

Page 476: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 443

Table 24.1. A summary of the characteristics in the literature on optimal design of an SSALT

Reference Problem addressed Input Output Lifetimedistribution

Bai et al. [4] Planning 2-step SSALT pd ,ph Optimal hold time ExponentialBai and Kim [5] Planning 2-step SSALT pd ,ph , shape parameter Optimal hold time WeibullKhamis and Higgins [6] Planning 3-step SSALT All parameters of stress–life relation Optimal hold time Exponential

with no censoringKhamis [7] PlanningM-step SSALT All parameters of stress–life relation Optimal hold time Exponential

with no censoringYeo and Tang [9] PlanningM-step SSALT ph and target acceleration factor Optimal hold time Exponential

and lower stressPark and Yum [8] Planning 2-step SSALT pd ,ph , ramp rate Optimal hold time Exponential

with ramp rate

multiple-step SSALT in which the number ofsteps was increased sequentially. Unfortunately,the algorithm inadvertently omitted a crucial stepthat would ensure optimality. A revised algorithmthat addresses this will be presented in a latersection. A summary of the above literature ispresented in Table 24.1.

In the literature on modeling and analysis of anSSALT beyond Nelson [2], van Dorp et al. [10] de-veloped a Bayes model and the related inferencesof data from SSALT assuming exponential failuretime. Tang et al. [11] generalized the Nelson cu-mulative exposure model (N-M), the basic modelfor data analysis of an SSALT. They presented anLCEM, which, unlike N-M, is capable of analyzingSSALT data with FFL. Their model also overcomesthe mathematical tractability problem for N-Munder a multiple-step SSALT with Weibull failuretime. An attempt to overcome such difficulty ispresented by Khamis and Higgins [12], who pro-posed a time transformation of the exponentialcumulative exposure (CE) model for analyzingWeibull data arising from an SSALT. Recently,Chang et al. [13] used neural networks for mod-eling and analysis of step-stress data.

Interest in SSALTs appears to be gathering mo-mentum, as evidenced by the recent literature,where a variety of other problems and case stud-ies associated with SSALTs have been reported.Xiong and Milliken [14] considered an SSALTwith random stress-change times for exponentialdata. Tang and Sun [15] addressed ranking andselection problems under an SSALT. Tseng and

Wen [16] presented degradation analysis understep stress. McLinn [17] summarized some practi-cal ground rules for conducting an SSALT. Sun andChang [18] presented a reliability evaluation on anapplication-specific integrated circuit (ASIC) flashRAM under an SSALT using a Peck temperature–RH model and Weibull analysis. Gouno [19] pre-sented an analysis of an SSALT using exponentialfailure time in conjunction with an Arrheniusmodel, where the activation energy and failurerate under operational conditions were estimatedboth graphically and using an MLE.

In the following, we first discuss the twobasic cumulative exposure models, i.e. N-Mand LCEM, that are pivotal in the analysisof SSALT data. This is followed by solvingthe optimal design problem of an SSALT withexponential lifetime. We then derive the likelihoodfunctions using an LCEM for multiply censoreddata under continuous monitoring and read-outdata with censoring. Finally, these methodologiesare implemented on Microsoft ExcelTM with anumerical example. The notation used in eachsection is defined at the beginning of each section.

24.2 Cumulative ExposureModelsThe notation used in this section is as follows:

h total number of stress levels in an SSPSi stress level i, i = 1, 2, . . . , h

Page 477: Handbook of Reliability Engineering - pdi2.org

444 Maintenance Theory and Testing

N index for stress level when failureoccurs

ti stress change times, i = 1, 2, . . . ,h− 1

�ti sojourn time at Si (= ti − ti−1, forsurvivors)

R reliabilityF, Fi(·) c.d.f., c.d.f. at Sit (i) lifetime of a sample tested at Siτe:i equivalent start time of any sample at

Si+1, τe:0 ≡ 0te(i) equivalent total operating time of a

sample at Siθi MTTF at stress Si .

As depicted in Figure 24.1, to facilitate theanalysis of an SSALT we need to translate the testtimes over different stress levels to a commonreference stress; say, the design stress. In otherwords, one needs to relate the life under an SSPto the life under constant stress. To put this intostatistical perspective, Nelson [2] proposed a CEmodel with the assumption that “the remaininglife of specimens depends only on the currentcumulative fraction failed and current stress—regardless of how the fraction accumulated.Moreover, if held at the current stress, survivorswill fail according to the c.d.f. for the stress butstarting at the previously accumulated fractionfailed”. By this definition, we have

F(y)=

F1(y) 0≤ y < t1

Fi(τe;i−1 + y − ti−1) ti−1 ≤ y < ti;1 < i < h

Fh(τe;h−1 + y − th−1) th−1 ≤ y <∞(24.1)

where the equivalent start time τe:i at Si is thesolution of the following:

Fi+1(τe:i )= Fi(ti − ti−1) for i = 1, . . . , h− 1

This N-M model has been widely accepted andcommonly adopted by many authors [4–9, 14].The problem, however, is that the model is notgeneral enough to cater for life distributionswith a location parameter representing FFL.This is because it is constructed by matching

the probability of failure across various stresses.For the N-M model, when ti < FFL, F(ti )= 0,implying that CE is zero. Consequently, no CEis registered. This does not seem reasonable, asthere should be CE even if the test time is belowFFL. Having an FFL is not uncommon amongengineering products, particularly for productsthat fail through degradation and/or wear-outprocesses. For example, some semiconductorsmay take a few years before failure emerges; otherexamples include the life of batteries and thestorage life of printer cartridges, photo-films, etc.

Another drawback of N-M is that it is definedpurely on statistical grounds, and thus it is hardto give a physical interpretation. Here, we presentan alternative CE model that is based on exposuretimes that a sample experienced under a test, callthe LCEM, which has been shown to include N-Mas its special case when failure time follows aWeibull distribution [11].

Suppose that the lifetime of a product can bedescribed by a distribution and a sample is ran-domly selected from its population. The lifetimeof the sample is an unknown constant at a fixedstress. Let the life of the sample operating at Si bet (i). The LCEM is defined as follows.

1. For a sample having tested for �ti at Si , itsfractional exposure accumulated at this Si is�ti/t (i).

2. For a sample test under an SSP, its frac-tional exposure is linearly accumulated. Thatis, if a sample has operated �tk, at Sk , k =1, 2, . . . , i, its accumulated fractional expo-sure is

i∑k=1

�tk/t (k).

3. For a survivor at any stress level, its equivalentstart time at that stress depends only on thepreviously accumulated fractional exposure,regardless of how that exposure is accumu-lated:

[�ti + τe:i−1]/t (i)= τe:i/t (i + 1)

τe:0 ≡ 0, i = 1, 2, . . . , h− 1

Page 478: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 445

From the definition, it follows that the equivalentstart time is given by:

τe:i = t (i + 1)i∑

k=1

�tk/t (k) i = 1, 2, . . . , h− 1

(24.2)Note that, compared with Equation 24.1, theequivalent time is independent of the distributionof the failure time. The LCEM allows the trans-lation of test time to any reference stress quiteeffortlessly. From Equation 24.2, the equivalenttest time of the sample at Si after going through�tk , k = 1, . . . , i, is:

te(i)= τe:i−1 +�ti

= t (i)

i−1∑k=1

�tk/t (k)+�ti

= t (i)

i∑k=1

�tk/t (k) (24.3)

From Equation 24.2, Equation 24.3 can be writtenin the equivalent test time at any stress, say Sj ,where �tj = 0:

te(j)= t (j)

i∑k=1

�tk/t (k)

⇒ te(j)

t (j)= te(i)

t (i)=

i∑k=1

�tk/t (k) (24.4)

Equation 24.4 is true as long as te(j)≤ t (j), be-cause failure occurs when te(j)= t (j). The inter-pretation of Equation 24.4 is interesting: the equiv-alent test time at a stress is directly proportional toits failure time at the stress and, more importantly,the total fractional CE is invariant across stresslevels. The failure condition, i.e.

N∑i=1

�ti/t (i)= 1 (24.5)

is identical to the well-known Miner’s rule. It canbe interpreted that, once we have picked asample, regardless of the step-stress spectrum itundergoes, the cumulative damage that triggers afailure is a deterministic unknown. Moreover, for

all samples, the exposure is linearly accumulateduntil failure. In fact, this seems to be the originalintent of Nelson [2], and it is only naturalthat N-M is a special case of LCEM (shown byTang et al. [11]).

From Equations 24.2 to 24.4, it is noted thatLCEMs need estimates of t (i), i = 1, 2, . . . , N .These must be estimated from the empiri-cal c.d.f./reliability and the assumed stress–lifemodel. For example, in the case of exponentialfailure time, suppose that the lifetime of a samplecorresponds to the (1− R)-quantile. Then

t (i, R)=−θi ln(R)⇒ t (i, R)=−θi ln(R)(24.6)

The MTTF at Si , θi will depend on the stress–lifemodel. We shall use the LCEM in the followingsections.

24.3 Planning a Step-stressAccelerated Life Test

The notation adopted in this section is as follows:

S0, Si , Sh stress (design, low, high) levels, i = 1,2, . . . , h− 1

xiSi − S0

Sh − S0: normalized low stress levels,

i = 1, 2, . . . , h− 1h total number of stress levelsn number of test unitsp expected failure proportion when

items are tested only at Shpi expected failure proportion at Sini number of failed units at Si , i = 1,

2, . . . , hnc number of censored units at Sh

(at end of test)yi,j failure time j of test units at Si, i = 1,

2, . . . , hte(j, i) equivalent failure time for the j th

failure at SiR(j, i) reliability of the j th failure at Si

Page 479: Handbook of Reliability Engineering - pdi2.org

446 Maintenance Theory and Testing

θi mean life at stress Si , i = 1, 2, . . . , hτi hold time at low Si , i = 1, 2, . . . ,

h− 1T censoring timeV (x, τ ) n · AV[log(mean-life estimate)].

Table 24.1 shows a list of the work related tooptimal design of an SSALT. The term “optimal”usually refers to minimizing the asymptoticvariance of the log(MTTF) or that of a percentileunder use condition.

As we can see from Table 24.1, with theexception of Bai and Kim [5], all work deals withexponential failure time. This is due to simplicity,as well as practicality, for it is hard to knowthe shape parameter of a Weibull distribution inadvance.

The typical design problem for a two-stepSSALT is to determine the optimal hold time,with a given low stress level. This solution is notsatisfactory, because, in practice, the selection ofthe low stress is not obvious. Choosing a lowstress close to the design stress might result in toofew failures at low stress for meaningful statisticalinference. Choosing a low stress close to a highstress could result in too much extrapolation ofthe stress–life model. In fact, it is well known thatin the optimal design of a two-stress constant-stress ALT, both the sample allocation and the lowstress are optimally determined. It is thus not anunreasonable demand for the same outputs in thedesign of an SSALT. This motivated the work ofYeo and Tang [9], who used p and an AF as inputsso that both the hold time and low stress can beoptimally determined.

The use of an AF as input is more practicalthan using the probability of failure at designstress, as the latter is typically hard to estimate.A common interpretation of the AF is that it is thetime compression factor. Given the time constraintthat determines the maximum test duration andsome guess of how much time the test would havetaken if tested under use conditions, the target AFis simply the ratio of the two. In fact, as we shallsee in Equation 24.9, the target AF can easily bederived using an LCEM once the test duration isgiven.

24.3.1 Planning a Simple Step-stressAccelerated Life Test

We consider a two-level SSALT in which n

test units are initially placed on S1. The stressis changed to S2 at τ1 = τ , after which thetest is continued until all units fail or until apredetermined censoring time T . For each stresslevel, the life distribution of the test units isexponential with mean θi .

24.3.1.1 The Likelihood Function

From Equations 24.3 and 24.6, under exponentialfailure time assumption, the equivalent failuretime of the j th failure at high stress is given by:

te(j, h)= θ2 ln[R(j, h)]×{

τ

θ1 ln[R(j, h)] +y2,j − τ

θ2 ln[R(j, h)]}

As the term ln[R(j, h)] will be canceled by itself,we shall omit it whenever this happens. Thecontribution to the likelihood function for the j thfailure at high stress is

1

θ2exp

[− te(j, h)

θ2

]= 1

θ2exp

(−y2,j − τ

θ2− τ

θ1

)For the nc survivors, the equivalent test time athigh stress is

te(h)= θ2

θ1+ T − τ

θ2

)so the contribution to the likelihood function is

exp

(− te(h)

θ2

)= exp

(−T − τ

θ2− τ

θ1

)Putting these together, the likelihood function is

L(θ1, θ2)=n1∏j=1

[1

θ1exp

(−y1,j

θ1

)]

×n2∏j=1

[1

θ2exp

(−y2,j − τ

θ2− τ

θ1

)]

×nc∏j=1

exp

(− τ

θ1− T − τ

θ2

)(24.7)

Page 480: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 447

24.3.1.2 Setting a Target AcceleratingFactor

The definition of an AF can be expressed as

φ = equivalent time to failure at design stress

time to failure under test plan(24.8)

From Equation 24.3, the equivalent operating timefor the test duration T at the design stress is

te(0)= θ0

θ1+ T − τ

θ2

)Substituting this into Equation 24.8 with the testtime T as the denominator, the AF is

φ = τ (θ0/θ1)+ (T − τ)(θ0/θ2)

T(24.9)

Suppose the mean life of a test unit is a log–linearfunction of stress:

log(θi)= α + βSi (24.10)

where α and β (β < 0) are unknown parameters.(This is a common choice for the life–stressrelationship because it includes both the powerlaw and the Arrhenius relation as special cases.)

Without loss of generality, let S0 = 0, S1 = x,S2 = 1, T = 1. Then, substituting Equation 24.10into Equation 24.9 yields

φ = (1− τ ) exp(−β)+ τ exp(−βx) (24.11)

24.3.1.3 Maximum Likelihood Estimatorand Asymptotic Variance

The MLE of log θ0 can be obtained by differentiat-ing the log-likelihood function in Equation 24.7:

log(θ0)= log(U1/n1)− x log(U2/n2)

(1− x)

where

U1 =n1∑j=1

y1,j + (n− n1)τ

U2 =n2∑j=1

(y2,j − τ )+ (n− nc)(T − τ )

From the Fisher information matrix, the AV ofthe MLE of the log(mean life) at the design stressAV[log(θ0)] is

V (x, τ)=

(1

1− x

)2

1− exp

(− τ

θ1

)

+

(x

1− x

)2

exp

(− τ

θ1

) [1− exp

(−1− τ

θ2

)](24.12)

We need to express Equation 24.12 in terms of x,τ , p, β . From the log–linear relation of the meanin Equation 24.10, we have

θ2

θ1=(θ2

θ0

)1−x= exp[β(1− x)]

And with

p = 1− exp

(− 1

θ2

)it follows that V (x, τ ) is

V (x, τ )=

(1

1− x

)2

1− (1− p)ω

+

(x

1− x

)2

(1− p)ω[1− (1− p)1−τ ](24.13)

where

ω ≡ τ

(θ2

θ0

)1−x= τ {exp[β(1− x)]}

24.3.1.4 Nonlinear Programming for JointOptimality in Hold Time and Low Stress

Given the numerical values of p and φ, the optimal(x, τ ) can be obtained by solving the NLP:

min V (x, τ )

subject to (1− τ ) exp(−β)+ τ exp(−βx)= φ

(24.14)

Page 481: Handbook of Reliability Engineering - pdi2.org

448 Maintenance Theory and Testing

102

Acc

eler

atio

nfa

cto

r

876

5

4

3

2

101

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Expected failure proportion

0.660 0.657 0.653 0.649 0.646

0.6640.667

0.6780.675

0.696

0.689

0.685

0.682

0.711

0.707

0.703

0.700

0.671

Figure24.2. Contours of optimal hold time at low stress for a two-step SSALT. For a given (p, φ), the optimal hold time can be readoff by interpolation between contours

The results are given graphically in Figures 24.2and 24.3, with (p, φ) on the x–y axis, for φ

ranging from 10 to 100 and p ranging from0.1 to 0.9. An upper limit for the range of φ

is selected for; in practice, an AF beyond 100may be too harsh. The range of p is chosenbased on the consideration that (1− p) gives theupper bound for the fraction of censored data.For a well-planned experiment, the degree ofcensoring is usually less than 90% (p = 0.1) andit is not common to have more than 90% failure.Figure 24.2 shows the contours of the optimalnormalized hold time τ and Figure 24.3 gives thecontours of the optimal normalized low stress x.Given a pair of (p, φ), the simultaneous optimallow stress and hold time can be read from thegraphs. An extensive grid search on the [0, 1] ×[0, 1] solution space of (x, τ ) has been conductedfor selected combinations of input parameters(p, φ), from which the global optimality of thegraphs has been verified.

Both sets of contours for the optimal holdtime and the optimal low stress show an upwardconvex trend. The results can be interpreted asfollows. For the same p, in situations where timeconstraints call for a higher AF, it is better toincrease the low stress and extend the hold time atlow stress. For a fixed AF, a longer test time (so that

102

Acc

eler

atio

nfa

cto

r

876

5

4

3

2

101

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Expected failure proportion

0.375 0.364 0.354

0.406

0.385

0.4260.4470.4680.489

0.510

0.530

0.5510.562

0.582

0.4160.4370.4580.478

0.499

0.520

0.395

0.5930.603

0.541

Figure 24.3. Contours of optimal low stress level for a two-stepSSALT. For a given (p, φ), the optimal low stress can be read off byinterpolation between contours

p increases) will result in a smaller optimal lowstress and a shorter optimal hold time.

24.3.2 Multiple-steps Step-stressAccelerated Life Test Plans

From Figure 24.2, a higher AF gives higheroptimal τ . This allows for further splitting ofhold time if an intermediate stress is desirable.The idea is to treat testing at low stress as anindependent test from that at high stress. In doingso, we need to ensure that, after splitting this lowstress into a two-step SSALT with the optimal holdtime as the censoring time, the AF achieved inthe two-step SSALT is identical to that contributedby the low stress alone. This is analogous todynamic programming, where the stage is thenumber of stress steps. At each stage, we solve theoptimal design problem for a two-step SSALT bymaintaining the AF contributed by the low stresstest phase. For a three-step SSALT, this will givea middle stress that is slightly higher than theoptimal low stress. Use this middle as the highstress and solve for the optimal hold time and lowstress using Equation 24.14. Next, we also need toensure that the solutions are consistent with thestress–life model in that the coefficients must bethe same as before. For an m-step SSALT, there

Page 482: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 449

are m− 1 cascading stages of SSALT. The resultingplan is “optimal”, for a given number of steps anda target AF, whenever a test needs to be terminatedprematurely at lower stress.

To make the idea illustrated above concrete,consider a three-step SSALT that has two cascad-ing stages of a simple SSALT. The results of thestage #1 simple SSALT, which uses the maximumpermissible stress as the high stress level, gives theoptimal low stress level and its hold time. Stage #2consists of splitting this low stress level into asimple SSALT that maintains the overall target AF.Since the AF is additive under an LCEM, the newtarget AF for stage #2 is the AF contributed by thestage #1 low stress.

From Equation 24.9 with T = 1, and sinceθ0/θ1 = exp(−βx) from Equation 24.10, it followsthat

AF contributed by low stress= New target AF

= τ exp(−βx)(24.15)

Note that to meet this target AF, xm should behigher than the optimal low stress in stage #1.To solve the stage 2 NLP, however, this target AFneeds to be normalized by the hold time τ , dueto a change to the time scale. The other inputneeded is p2, the expected proportion of failure atxm at τ . Moreover, the solutions of the NLP mustalso satisfy the condition that the stress–life modelat both stages #1 and #2 are consistent: having thesame β . The resulting algorithm that iterativelysolves for p2, xm, β , the new optimal low stress,and the hold time is summarized as follows.

Algorithm 1. This algorithm is for a multiple-stepSSALT. With the values of (τ, x, β) from stage #1,do the following:

1. Compute the normalized AF φ2 at stage #2using Equation 24.15:

φ2 = (New Target AF)/τ = exp(−βx)(24.16)

2. Compute p(0)2 the expected failure proportion

at the low stress of stage #1:

p(0)2 = 1− exp

(− τ

θx

)= 1− exp

{−τ log

(1

1− p

)× exp[β(1− x)]

}(24.17)

3. Solve the constrained NLP in Equation 24.14to obtain (τ ∗

(1), x∗(1), β

∗(1)).

4. Compute the new middle stress x(1)m

x(1)m =β∗(1){

log

[log(1/(1−p(0)

2 ))

τ log(1/(1−p))

]+β∗(1)

}−1

(24.18)which is the solution of

p(0)2 = 1− exp

{− τ log

(1

1− p

)× exp

[β∗(1)

x(1)m

(1− x(1)m )

]}5. Update p2 using x

(1)m :

p(1)2 = 1− exp

{− τ log

(1

1− p

)× exp[β(1− x(1)m )]

}(24.19)

6. Repeat steps 3 to 5, with (p(1)2 , φ2), (p

(2)2 , φ2),

(p(3)2 , φ2), . . . until |β∗(k) − x

(k)m β|< ε, for

some pre-specified ε > 0.

Let (τ ∗(k), x∗(k), β∗(k)) be the solutions from thescheme. By combining the stage #1 and #2 results,the plan for a three-step SSALT (h= 3) is

x1 = x(k)m x∗(k) x2 = x(k)m τ1 = ττ ∗(k) τ2 = τ

(24.20)Given the above parameters, the asymptoticvariance for log(θ0) is given by

nAV[log(θ0)]

=∑3

i=1 Aix2i∑

(i,j)={(1,2),(1,3),(2,3)}AiAj (xi − xj )2

Page 483: Handbook of Reliability Engineering - pdi2.org

450 Maintenance Theory and Testing

where

A1 = 1− exp(−τ1/θ1)

A2 = exp(−τ1/θ1){1− exp[−(τ2 − τ1)/θ2]}A3 = exp(−τ1/θ1) exp[−(τ2 − τ1)/θ2]

× {1− exp(−(1− τ2)/θ3)}θi = exp[−β(1− xi)]

− log(1− p)

From the above equations, the sample size can bedetermined given a desirable level of precision.

Algorithm 1 aims to adjust xm upwards so asto maintain the AF computed in Equation 24.6 ateach iteration. In doing so, p2 will also increase.This adjustment is repeated until β of the stress–life model is identical to that obtained in stage #1.Note that the algorithm is a revised version of thatof Yeo and Tang [9], which erroneously omittedstep 1. As a result, the AF attained is lower than thetarget AF. The updating of xm in Equation 24.18 isalso different from [9].

For a four-step SSALT, the results from stage #2are used to solve for stage #3 using the abovealgorithm. This updating scheme can be donerecursively, if necessary, to generate the test planfor a multiple-step SSALT.

24.4 Data Analysis in theStep-stress Accelerated Life Test

The following notation is used in this section:

ri number of failures at Sir number of failures at an SSP(=∑i ri

)Nj the stress level in which sample j

failed or is being censoredSi stress level i, i = 1, 2, . . . , hti stress change times, i = 1, 2, . . . ,

h− 1�ti,j sojourn time of sample j at SiRj reliability of sample jt (i), t (i, Rj ) life (of sample j ) at Site(i), te(i, Rj ) equivalent operating time

(of sample j ) at Si ; te(0)= 0.

24.4.1 Multiply Censored,Continuously Monitored Step-stressAccelerated Life Test

We consider a random sample of size n testedunder an SSALT with an SSP. The test durationis fixed (time censored) and multiple censoringis permitted at other stress levels. Suppose thatthere are r failures and rc survivors at the endof the test. All n samples have the same initialstress and stress increment for one SSP. Their holdtimes are the same at the same stress level, exceptat the stress level at which they fail or are beingcensored. The parameter estimation procedureusing an LCEM is described as follows.

From Equation 24.4, the total test time corre-sponding to failed sample j at stress level Nj canbe converted to the equivalent test time at SNj as:

te(Nj , Rj )= t (Nj , Rj )

[Nj−1∑i=1

�ti,j /t (i, Rj )

]+�tNj ,j j = 1, 2, . . . , r

(24.21)

Similarly, the total test time corresponding tosurvivor j at stress level Nj can be converted tothe equivalent test time at SNj as

te(Nj , Rj )= t (Nj , Rj )

Nj∑i=1

�ti,j /t (i, Rj )

j = r + 1, . . . , n (24.22)

From Equation 24.6, it can be seen that, givena stress-dependent c.d.f., i.e. F(S, t, θ), t (i, Rj )

in Equations 24.21 and 24.22, can be estimatedfrom

F(Si, t (i, Rj , θ))= 1− Rj j = 1, 2, . . . , n(24.23)

θ is the set of parameters to be estimated.The empirical c.d.f. can be estimated using

the well-known Kaplan–Meier estimate or thecumulative hazard estimate for multiply censoreddata. For example, by sorting the failure times andcensored times in ascending order, the cumulative

Page 484: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 451

hazard estimate for Rj is

Rj = exp

[−

j∑k=1

(1/mk)

](24.24)

where mk is the number of samples remaining ontest at the time before sample k fails.

The parameter set θ can be estimated bymaximizing the likelihood function:[ r∏j=1

f (SNj , te(Nj , Rj ), θ)

]

×[ n∏j=r+1

R(SNc , te(Nc, Rj ), θ)

](24.25)

For multiple SSPs, the parameters can be esti-mated from the joint likelihood function, whichis simply the product of the likelihood functionsof all patterns with the usual s-independence as-sumption.

24.4.1.1 Parameter Estimation for WeibullDistribution

The following additional notation is used:

ηi scale parameter of Weibull distribution at Siγ shape parameter of Weibull distribution.

For illustration purposes, we use a Weibulldistribution with constant shape parameterthroughout the SSP and stress-dependent ηi asan example. For an illustration of using three-parameter Weibull, refer to Tang et al. [11].This model is widely used for describing failuredata under mechanical stress, particularly forfatigue failure.

From Equation 24.23, the life corresponding tosample jand Si , t (i, Rj ) is given by:

t (i, Rj )= [−ln(Rj )]1/γ ηi j = 1, 2, . . . , n

Substituting into Equations 24.21 and 24.22 yields

te(Nj , Rj )= ηNj

(Nj−1∑i=1

�ti,j /ηi

)+�tNj ,j

= ηNj

( Nj∑i=1

�ti,j /ηi

)j = 1, 2, . . . , n

This is the same expression given by Nelson [3]using the N-M model. We can convert these testtimes into the equivalent test time at the higheststress so that we can expect the most number ofcancellations of ηi :

te(h, Rj )= ηh

( Nj∑i=1

�ti,j /ηi

)j = 1, . . . , n

Now we can analyze these failure/censored timesas if they come from the same Weibull distributionunder the design stress. The resulting likelihoodfunction is given by

{ r∏j=1

γ

ηh

(∑Nj

i=1 �ti,j /ηi

)( Nj∑i=1

�ti,j /ηi

× exp

[−( Nj∑

i=1

�ti,j /ηi

)γ]}

×{ n∏

j=r+1

exp

[−( Nj∑

i=1

�ti,j /ηi

)γ ]}

When the number of steps h > 2, but is smallcompared with the number of failures, ηi canbe estimated individually so that the stress–lifemodel can be identified. A set of constraints canbe included to ensure that ηi > ηj for Si < Sj .Alternatively, a stress–life model, say, log(ηi)=α + βSi , can be used to replace ηi in the aboveso that we have a likelihood function in terms ofthe unknown parameters α, β , and γ . The MLEof these parameters can be obtained via the usualmeans. This is left as an exercise.

24.4.2 Read-out Data

Suppose that the SSALT at read-out times coin-cides with the stress change time and there are rifailures at Si . The equivalent read-out time at thehighest stress, after stress level i, can be expressedas

te:i (h)= t (h)

[ i∑k=1

�tk/t (k)

]i = 1, 2, . . . , h

Page 485: Handbook of Reliability Engineering - pdi2.org

452 Maintenance Theory and Testing

Figure 24.4. A Microsoft ExcelTM template for designing an optimal two-step SSALT. The use of solver allows for optimization of a nonlinearfunction with nonlinear constraint

The likelihood function is given by

{ h∏i=1

[Fh(te:i (h))− Fh(te,i−1(h))]ri}

× [1− Fh(te:h(h))]n−r

where Fh is the c.d.f. of failure time at the higheststress.

In the case of Weibull failure time, we have

te:i (h)= ηh

( i∑k=1

�tk/ηk

)i = 1, 2, . . . , h

and the likelihood function is given by( h∏i=1

{exp

[−( i−1∑

k=1

�tk/ηk

)γ ]

− exp

[−( i∑

k=1

�tk/ηk

)γ ]}ri)

×{

exp

[−( h∑

k=1

�tk/ηk

)γ ]}n−r(24.26)

Similarly, one could proceed with the estimationin the usual way.

In this section, we have confined our analysisto an SSP. For multiple SSPs, we simply multiplythe marginal likelihood function by assuming

Page 486: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 453

Table 24.2. A numerical illustration of the algorithm for planning a multiple-step SSALT. The equation numbers from which the results arecomputed are indicated

Iteration k p2(k − 1) β(k) τ(k) x(k) xm(k) β(k)− βxm(k)Equations 24.17 Equation 24.14 Equation 24.14 Equation 24.14 Equation 24.18and 24.19

1 0.141 065 −3.395 32 0.667 67 0.419 06 0.591 777 −0.506 072 0.193 937 −3.395 20 0.666 83 0.416 88 0.630 109 −0.318 813 0.228 92 −3.393 81 0.665 67 0.415 32 0.652 689 −0.207 184 0.251 933 −3.393 95 0.665 38 0.414 35 0.666 837 −0.138 255 0.2673 −3.394 05 0.665 17 0.413 66 0.676 017 −0.093 52...

......

......

......

42 0.300 379 −3.394 19 0.664 67 0.412 08 0.695 201 −2.023 41× 10−8

43 0.300 379 −3.394 19 0.664 67 0.412 08 0.695 201 −1.406 68× 10−8

independence between data obtained under eachSSP.

24.5 Implementation inMicrosoft ExcelTM

The NLP in Equation 24.14 can easily beimplemented using the solver tool in MicrosoftExcelTM. This is illustrated in Figure 24.4.For illustration purposes, the inputs are in cells C3and C4, where p = 0.9 and φ = 50. Equation 24.13is stored in cell C9. Besides the two inputs, p andφ some initial guess values for β , x, and τ arerequired to compute the AV in Equation 24.13.The solver will minimize the cell containing theequation of the AV by changing cells C5–C7, whichcontain β , x, and τ . A constraint can be addedto the solver by equating cell C11, which containsEquation 24.11 to cell C4, which contains the targetAF φ. After running the solver, cells C5–C7 containthe optimal solutions for β , x, and τ .

Algorithm 1 for solving the design of amultiple-step SSALT can also be implemented inMicrosoft ExcelTM. With the optimal solutionsfrom above, Table 24.2 shows the equations used ingenerating results for planning a three-step SSALTand how the numerical results converge at eachiteration. The inputs and results are summarizedin Table 24.3.

Table 24.3. A summary of the inputs and results for the numericalexample

Input p 0.9φ 50

Final solution β −4.8823τ2 0.6871x (from stage #1) 0.5203τ1 0.4567x1 0.2865x2 0.6952

Suppose that the above test plan is forevaluating the breakdown time of insulating oil.The high and use stress levels are 50 kV and 20 kVrespectively and the stress is in log(kV). The totaltest time is 20 h and recall that p = 0.9 and φ = 50From Table 24.3, we have

S1 = exp[S0 + x1(Sh − S0)]= exp{log(20)+ 0.2865× [log(50)−log(20)]}= 26.0 kV

S2 = exp[S0 + x2(Sh − S0)]= exp{log(20)+ 0.6952× [log(50)−log(20)]}= 37.8 kV

t1 = τ1 × 20= 9.134 h= 548.0 min

t2 = τ2 × 20= 13.74 h= 824.5 min

Suppose that 50 samples were tested, resultingin 2, 10, and 21 failures after t1, t2, and T

respectively. The data are analyzed using the

Page 487: Handbook of Reliability Engineering - pdi2.org

454 Maintenance Theory and Testing

Figure 24.5. A Microsoft ExcelTM template for the analysis of read-out data from an SSALT. A set of constraints is included to ensure thatηi > ηj forSi < Sj and that all parameters are positive

solver in Microsoft ExcelTM and presented inFigure 24.5. The equation in cell G7 is the log-likelihood function derived from Equation 24.26.One could proceed to fit a stress–life model for ηi .We shall leave this as an exercise.

24.6 Conclusion

In this chapter, some issues related to modeling,planning, and analysis of a multiple-step SSALTare addressed. In particular, a model that isadditive in cumulative fractional exposure ispresented and shown to be useful for solvingoptimal design and for data analysis of an SSALT.

A new algorithm is provided for designing amultiple-step SSALT test plan, assuming that atarget acceleration factor and a predeterminednumber of steps are given. Likelihood functionsfor two common testing situations under anSSALT are derived. The methods presented hereare also implemented in Microsoft ExcelTM witha numerical example.

References

[1] Degroot MH, Goel PK. Bayesian estimation and optimaldesigns in partially accelerated life testing. Nav Res LogistQ 1979;26:223–35.

Page 488: Handbook of Reliability Engineering - pdi2.org

Multiple-steps Step-stress Accelerated Life Test 455

[2] Nelson W. Accelerated life testing: step-stress models anddata analysis. IEEE Trans Reliab 1980;29:103–8.

[3] Nelson W. Accelerated testing: statistical models, testplans and data analysis. New York: John Wiley & Sons;1990.

[4] Bai DS, Kim MS, Lee SH. Optimum simple step-stressaccelerated life tests with censoring. IEEE Trans Reliab1989;38:528–32.

[5] Bai DS, Kim MS. Optimum simple step-stress acceleratedlife test for Weibull distribution and type I censoring. NavRes Logist Q 1993;40:193–210.

[6] Khamis IH, Higgins JJ. Optimum 3-step step-stress tests.IEEE Trans Reliab 1996;45:341–5.

[7] Khamis IH. Optimum M-step, step-stress designwith k stress variables. Commun Stat Comput Sim1997;26:1301–13.

[8] Park SJ, Yum BJ. Optimal design of accelerated lifetests under modified stress loading methods. J Appl Stat1998;25:41–62.

[9] Yeo KP, Tang LC. Planning step-stress life-test with a tar-get acceleration-factor. IEEE Trans Reliab 1999;48:61–7.

[10] VanDorp JR, Mazzuchi TA, Fornell GE, Pollock LR.A Bayes approach to step-stress accelerated life testing.IEEE Trans Reliab 1996;45:491–8.

[11] Tang LC, Sun YS, Goh TN, Ong HL. Analysis of step-stressaccelerated-life-test data: a new approach. IEEE TransReliab 1996;45:69–74.

[12] Khamis IH, Higgins JJ. A new model for step-stresstesting. IEEE Trans Reliab 1998;47:131–4.

[13] Chang DS, Chiu CC, Jiang ST. Modeling and reliabilityprediction for the step-stress degradation measurementsusing neural networks methodology. Int J Reliab Qual SafEng 1999;6:277–88.

[14] Xiong CJ, Milliken GA. Step-stress life-testing withrandom stress-change times for exponential data. IEEETrans Reliab 1999;48:141–8.

[15] Tang LC, Sun Y. Selecting the most reliable populationunder step-stress accelerated life test. Int J Reliab Qual SafEng 1999;4:347–59.

[16] Tseng ST, Wen ZC. Step-stress accelerated degradationanalysis for highly reliable products. J Qual Technol2000;32:209–16.

[17] McLinn JA. Ways to improve the analysis of multi-levelaccelerated life testing. Qual Reliab Eng Int 1998;14:393–401.

[18] Sun FB, Chang WC. Reliability evaluation of a flashRAM using step-stress test results. In: Annual Reliabilityand Maintainability Symposium Proceedings, 2000;p.254–9.

[19] Gouno E. An inference method for temperature step-stress accelerated life testing. Qual Reliab Eng Int2001;17:11–8.

Page 489: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 490: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing

Chap

ter2

5Chengjie Xiong

25.1 Introduction25.2 Step-stress Life Testing with Constant Stress-change Times25.2.1 Cumulative Exposure Model25.2.2 Estimation with Exponential Data25.2.3 Estimation with Other Distributions25.2.4 Optimum Test Plan25.3 Step-stress Life Testing with Random Stress-change Times25.3.1 Marginal Distribution of Lifetime25.3.2 Estimation25.3.3 Optimum Test Plan25.4 Bibliographical Notes

25.1 IntroductionAccelerated life testing consists of a variety of testmethods for shortening the life of test items orhastening the degradation of their performance.The aim of such testing is to obtain data quickly,which, when properly modeled and analyzed,yields the desired information on product lifeor performance under normal use conditions.Accelerated life testing can be carried out usingconstant stress, step-stress, or linearly increasingstress. The step-stress test has been widely usedin electronics applications to reveal failure modes.The step-stress scheme applies stress to test unitsin such a way that the stress setting of test unitsis changed at certain specified times. Generally,a test unit starts at a low stress. If the unit doesnot fail at a specified time, stress on it is raised toa higher level and held a specified time. Stress isrepeatedly increased and held, until the test unitfails. A simple step-stress accelerated life testinguses only two stress levels.

In this chapter, we plan to give a summaryof a collection of results by the author, andsome others, dealing with the statistical modelsand estimations based on data from a step-stress

accelerated life testing. We consider two cases:when the stress-change times are either constantsor random variables. More specifically, we use datafrom step-stress life testing to estimate unknownparameters in the stress–response relationshipand the reliability function at the design stress.We also discuss the optimum test plan associatedwith the estimation of the reliability functionunder the assumption that the lifetime at aconstant stress is exponential. The step-stresslife testing with constant stress-change timesis considered in Section 25.2. In Section 25.3,results are given when the stress-change times arerandom variables.

25.2 Step-stress Life Testingwith Constant Stress-changeTimes

25.2.1 Cumulative Exposure Model

Since a test unit in a step-stress test is exposed toseveral different stress levels, its lifetime distribu-tion combines lifetime distributions from all stress

457

Page 491: Handbook of Reliability Engineering - pdi2.org

458 Maintenance Theory and Testing

levels used in the test. The cumulative exposuremodel of the lifetime in a step-stress life testingcontinuously pieces these lifetime distributions inthe order that the stress levels are applied. Morespecifically, the step-stress cumulative exposuremodel assumes that the remaining lifetime of a testunit depends on the current cumulative fractionfailed and current stress, regardless of how thefraction is accumulated. In addition, if held atthe current stress, survivors will fail accordingto the cumulative distribution for that stress butstarting at the previously accumulated fractionfailed. Moreover, the cumulative exposure modelassumes that the change in stress has no effecton life—only the level of stress does. The cumu-lative exposure model has been used by manyauthors to model data from step-stress acceler-ated life testing [1–5]. Nelson [3], chapter 10,extensively studied cumulative exposure modelswhen the stress is changed at prespecified constanttimes.

We denote the stress variable by x and assumethat x > 0. In general, variable x represents aparticular scale for the stress derived from themechanism of the physics of failure. For example,x can be the voltage in a study of the voltageendurance of some transformer oil or the tem-perature in a study of the life of certain semicon-ductors. We assume that F is a strictly increasingand continuously differentiable probability distri-bution function on (0,∞). We also assume thatthe scale family F(t/θ(x)) is the cumulative dis-tribution function of the lifetime under constantstress x. Let f (t)= dF(t)/dt for t ≥ 0. Note that,at a constant stress x, the lifetime depends on thestress only through parameter θ(x). Since∫ ∞

0t dF

(t

θ(x)

)= θ(x)

∫ ∞0

y dF(y) (25.1)

θ(x) is a multiple of the mean lifetime underconstant stress x. We call θ(x) the stress–responserelation. Further, we assume that ln θ(x) is linearlyand inversely related to stress x, i.e. ln θ(x)=α + βx with α > 0 and β < 0. Parameters α andβ are characteristics of the products and testmethods. This simple stress–response relationshipcovers some of the most important models in

engineering, such as the power law model, theEyring model, and the Arrhenius model (e.g. seeMann et al. [6]).

We use x0 to denote the design stress thatis the stress level under normal use conditions.We use x1 < x2 < · · ·< xk to denote the k

stress levels gradually applied in that order ina step-stress testing. Let ci be the time thatstress is changed from xi to xi+1, 1≤ i ≤ k −1; c1 < c2 < · · ·< ck−1. Let c0 = 0 and ck =∞.Let T be the lifetime under the step-stresstesting. Denote θi = θ(xi) and Fi(t)= F(t/θi),i = 0, 1, 2, . . . , k. According to the cumulativeexposure model, the cumulative distributionfunction of T is given by

G(t)={F1(t) for 0≤ t < c1

Fi(t − ci−1 + si−1) for ci−1 ≤ t < ci

i = 2, 3, . . . , k (25.2)

where si−1 is determined by Fi(si−1)= Fi−1(ci−1 − ci−2 + si−2), i = 2, 3, . . . , k; s0 = 0.Since Fi(t)= F(t/θi), it follows that si−1/

θi = (ci−1 − ci−2 + si−2)/θi−1, i = 2, 3, . . . , k.Solving these equations inductively for si gives

si−1 = θi(ci−1 − ci−2 + si−2)

θi−1

= θi

i−1∑j=1

cj − cj−1

θj

i = 2, 3, . . . , k. Therefore

G(t)=

F

(t

θ1

)for 0≤ t < c1

F

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

)for ci−1 ≤ t < ci

i = 2, 3, . . . , k (25.3)

Page 492: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 459

Thus, the probability density function of T is

g(t)=

1

θ1f

(t

θ1

)for 0≤ t < c1

1

θif

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

)for ci−1 ≤ t < ci

i = 2, 3, . . . , k (25.4)

Notice that, at design stress x0, the reliabilityfunction of the lifetime is

R0(t)= 1− F

(t

θ(x0)

)(25.5)

One of the most important goals for carrying outa step-stress life testing is to obtain a statisticalestimation on the reliability of lifetime underdesign stress x0 by analyzing data from the test.When the functional form of F is given, theestimation of the reliability function becomes thatof α and β .

25.2.2 Estimation with ExponentialData

When the lifetime under a constant stress is as-sumed exponential, i.e. F(t)= 1− exp(−t), someexact estimations are available. We demonstratethis by using the simple step-stress testing. Manyof these results can be generalized to general step-stress testing. Suppose that n test units are ini-tially placed on low stress level x1 and run untiltime c1 when stress is changed to x2 and thetest is continued until the first r(r > 1) lifetimesare observed. We assume that n1 test units failbefore time c1 and their failure times T1j , j =1, 2, . . . , n1, are observed under stress x1. n− n1test units survive time c1 and n2 failure timesT2j , j = 1, 2, . . . , n2, are observed under stressx2 after time c1 (n2 = r − n1). Let Ti· =∑ni

j=1 Tijbe the total lifetime under xi , i = 1, 2. Let Mt

and mt be the observed maximum and minimumlifetime respectively. Note that n1 may be zero or rwith positive probabilities.

The assumptions of the cumulative exposuremodel and exponentially distributed life at any

constant stress imply that the probability densityfunction of a test unit under the simple step-stresstest is

g(t)=

1

θ1exp

(− t

θ1

)if 0≤ t < c1

1

θ2exp

(− t − c1

θ2− c1

θ1

)if c1 ≤ t <∞

(25.6)The likelihood function from the first r lifetimeobservations Tij is then

L(α, β)= n!(n− r)!�ij g(Tij )[1−G(Mt)]n−r

Letting ∂ log L/∂α = 0 and ∂ log L/∂β = 0 yieldsthe maximum likelihood estimators for α and β

when r > n1 and n1U2 < n2U1:

α = 1

x2 − x1

(x1 ln

n2

U2− x2 ln

n1

U1

)(25.7)

β = 1

x2 − x1ln

(n1U2

n2U1

)(25.8)

whereU1 = T1· + (n− n1)c1 (25.9)

U2 = T2· − n2c1 + (n− r)(Mt − c1) (25.10)

Parameter β describes the relationship betweenthe mean of the lifetime of a test unit and the stresslevel. One may be interested in testing whether βequals zero, especially when the stress levels x1and x2 are close to each other. To test H0 : β =0 versus H1 : β < 0, we use the likelihood ratiotest. When H0 is true, the maximum likelihoodestimator for α is

α = ln

(U1 + U2

r

)The likelihood ratio is then

= L(α, 0)

L(α, β)=(

r

U1 + U2

)r(U1

n1

)n1(U2

n2

)n2

A size γ (0 < γ < 1) test of H0 rejects H0 if < c,where c is a constant such that

P( ≤ c |H0)= γ

Page 493: Handbook of Reliability Engineering - pdi2.org

460 Maintenance Theory and Testing

Since the distribution of is difficult to obtain,computer simulations can be used to approximatethe distribution of when H0 is true.

Next, we construct confidence interval esti-mates to α, β, θ0, and R0(t) at a given time t .We first observe that, if a random variable T hasprobability density function as in Equation 25.6,then it is easy to verify that the random variable

S =

T

θ1if 0≤ T < c1

T − c1

θ2+ c1

θ1if c1 ≤ T <∞

(25.11)

is exponentially distributed with mean of unity(see Xiong [7]).

The following lemma is used in the derivationof confidence intervals for α, β, θ0, and R0(t) ata given time t (see Lawless [8], theorem 3.5.1,p. 127).

Lemma 1. Suppose that Si , i = 1, 2, . . . , n are in-dependent and identically distributed exponentialrandom variables with mean of unity. Let S(1) ≤S(2) ≤ · · · ≤ S(r) be the r smallest ordered obser-vations and S· =∑r

i=1 S(i). Then 2nS(1) has aχ2 distribution with two degrees of freedom and2[S· + (n− r)S(r) − nS(1)] has a χ2 distributionwith 2r − 2 degrees of freedom. Further, 2nS(1) and2[S· + (n− r)S(r) − nS(1)] are independent.

Let 0 < γ < 1. Now we transfer all randomvariables Tij into Sij through Equation 25.11.Denote

ms =min{Sij , i = 1, 2, j = 1, 2, . . . , ni}

=

mt

θ1if n1 > 0

mt − c1

θ2+ c1

θ1if n1 = 0

and

Ms =max{Sij , i = 1, 2, j = 1, 2, . . . , ni}

=

Mt

θ1if r = n1

Mt − c1

θ2+ c1

θ1if r > n1

A direct application of Lemma 1 implies that

2nms ∼ χ2(2)

andD ∼ χ2(2r − 2)

where

D =

2V0

θ1if r = n1 > 0

2

(V1

θ1+ V2

θ2

)if r > n1 > 0

2[V2 − n(mt − c1)]θ2

if n1 = 0

and

V0 = T1· + (n− r)Mt − nmt

V1 = U1 − nmt

V2 = U2

U1 and U2 are given by Equations 25.9 and 25.10respectively.

For any 0 < γ < 1, let Fγ (2, 2r − 2)be the upper 100γ% point of the F

distribution with degrees of freedom 2 and2r − 2. Denote K1 = F1−γ /2(2, 2r − 2) andK2 = Fγ/2(2, 2r − 2). The independence between2nms and D implies that

P

[K1 ≤ 2n(r − 1)ms

D≤K2

]= 1− γ

Therefore

P

{V2 − n(mt − c1)

n(r − 1)c1K1 − mt − c1

c1

≤ exp[β(x2−x1)] ≤ V2−n(mt−c1)

n(r − 1)c1K2

− mt − c1

c1, n1 = 0

}+ P

{n(r − 1)mt − V1K2

V2K2

≤ exp[β(x1 − x2)]≤ n(r − 1)mt − V1K1

V2K1, r > n1 > 0

}+ P

{K1 ≤ n(r − 1)mt

V0≤K2, r = n1 > 0

}= 1− γ

Page 494: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 461

Thus, if r > n1 > 0, a 100(1− γ )% confidenceinterval for β is [l1, l2] ∩ (−∞, 0) when[n(r − 1)mt − V1K2]/(V2K2) > 0; and [l1,∞)∩(−∞, 0) otherwise, where

li = 1

x2 − x1ln

[V2Ki

n(r − 1)mt − V1Ki

]i = 1, 2

If n1 = 0, a 100(1− γ )% confidenceinterval for β is [l3, l4] ∩ (−∞, 0) when[V2 − n(mt − c1)] K1/[n(r − 1)c1] − (mt −c1)/c1 > 0; and (−∞, l4] ∩(−∞, 0) otherwise,where

li = 1

x2 − x1ln

[V2 − n(mt − c1)

n(r − 1)c1Ki−2

− mt − c1

c1

]i = 3, 4

If r = n1 > 0, a 100(1− γ )% confidence intervalfor β is (−∞, 0)whenK1 ≤ n(r − 1)mt/V0 ≤K2;and the empty set otherwise.

For 0 < γ < 1, let χ2γ (k) be the upper 100γ%

point of the χ2 distribution with degrees of free-dom k. Let 0 < γ ∗ < 1. Denote B1 = χ2

1−γ /2(2),

B2 = χ2γ /2(2), B3 = χ2

1−γ ∗/2(2r − 2) and B4 =χ2γ ∗/2(2r − 2). From the independence of random

variables 2nms and D, it follows that

(1− γ )(1− γ ∗)= P(B1 ≤ 2nms ≤ B2, B3 ≤D ≤ B4)

= P

(B1 ≤ 2nmt

θ1≤ B2, B3 ≤ 2V0

θ1≤ B4,

r = n1 > 0

)+ P

(2nmt

B2≤ θ1 ≤ 2nmt

B1,B3

2− V1

θ1

≤ V2

θ2≤ B4

2− V1

θ1, r > n1 > 0

)+ P

(B1

2n− mt − c1

θ2≤ c1

θ1≤ B2

2n

− mt − c1

θ2,V2 − n(mt − c1)

B4≤ θ2

2

≤ V2 − n(mt − c1)

B3, n1 = 0

)

≤ P

(exp(α) ≥max

(2nmt

B2,

2V0

B4

),

r = n1 > 0

)+ P

(2nmt

B2≤ θ1 ≤ 2nmt

B1,B3

2− V1B2

2nmt

≤ V2

θ2≤ B4

2− V1B1

2nmt

, r > n1 > 0

)+ P

(B1

n− (mt − c1)B4

V2 − n(mt − c1)≤ 2c1

θ1

≤ B2

n− (mt − c1)B3

V2 − n(mt − c1),

V2 − n(mt−c1)

B4≤ θ2

2≤ V2 − n(mt−c1)

B3,

n1 = 0

)Thus, if r > n1 > 0, a confidence interval for α ofconfidence level at least 100(1− γ )(1− γ ∗)% is[l5, l6] when B3 − V1B2/(nmt) > 0; and (−∞, l6]otherwise, where

li = 1

x2 − x1

[x2 ln

(2nmt

B7−i

)− x1 ln

(2nmtV2

nmtBi−2 − V1B7−i

)]i = 5, 6

If n1 = 0, a confidence interval for α of confidencelevel at least 100(1− γ )(1− γ ∗)% is [l7, l8] whenB1/n− (mt − c1)B4/[V2 − n(mt − c1)]> 0; and[l7,∞) otherwise, where

li = 1

x2 − x1

(x2 ln{2nc1[V2 − n(mt − c1)]

× [(V2 − n(mt − c1))B9−i− n(mt − c1)Bi−4]−1}− x1 ln

{2[V2 − n(mt − c1)]

Bi−4

})i = 7, 8

If r = n1 > 0, a confidence interval for α ofconfidence level at least 100(1− γ )(1− γ ∗)% is[max(ln(2nmt/B2), ln(2V0/B4)),∞).

Note that θi = exp(α + βxi)= exp[(α +βx0)+ β(xi − x0)], i = 1, 2. Using a verysimilar argument to the way that the confidenceinterval for α is derived, we can also set

Page 495: Handbook of Reliability Engineering - pdi2.org

462 Maintenance Theory and Testing

up confidence intervals for the mean oflifetime θ(x0)= exp(α + βx0) at design stress.If r > n1 > 0, a confidence interval for θ(x0) ofconfidence level at least 100(1− γ )(1− γ ∗)%is [l9, l10] when B3 − V1B2/(nmt) > 0; and(−∞, l10] otherwise, where

li = exp

[x2 − x0

x2 − x1ln

(2nmt

B11−i

)− x1 − x0

x2 − x1ln

(2nmtV2

nmtBi−6 − V1B11−i

)]i = 9, 10

If n1 = 0, a confidence interval for θ(x0) ofconfidence level at least 100(1− γ )(1− γ ∗)%is [l11, l12] when B1/n− (mt − c1)B4/[V2 −n(mt − c1)]> 0; and [l11,∞) otherwise, where

li = exp

(x2 − x0

x2 − x1

× ln

{2nc1[V2 − n(mt − c1)]

[V2−n(mt−c1)]B13−i−n(mt−c1)Bi−8

− x1 − x0

x2 − x1ln

{2[V2 − n(mt − c1)]

Bi−8

})i = 11, 12

If r = n1 > 0, a confidence interval for θ(x0) ofconfidence level at least 100(1− γ )(1− γ ∗)% is[exp(max(ln(2nmt/B2), ln(2V0/B4))),∞).

A confidence interval for the reliability functionR0(t)= exp[−t/θ(x0)] at any given survival timet can be obtained based on the confidenceinterval for θ(x0). More specifically, if [l, l]is a confidence interval for θ(x0) of confi-dence level at least 100(1− γ )(1− γ ∗)%, then[exp(−t/l), exp(−t/l)] is a confidence interval ofR0(t) of the same confidence level.

25.2.3 Estimation with OtherDistributions

For most other life distributions in step-stresslife testing, there are usually no closed formestimations. Standard asymptotic theory can beused to set up asymptotic confidence intervalsfor model parameters. More specifically, thelikelihood function from the first r observations

of a simple step-stress life testing is

L(α, β)= n!(n− r)!�ij g(Tij )[1−G(Mt)]n−r

A standard routine, such as the Newton–Raphsonmethod, can be applied to find the maximumlikelihood estimates for α and β by solvingthe estimation equations ∂ log L/∂α = 0 and∂ log L/∂β = 0. We assume that

limn→∞

r

n> F

(c1

θ1

)Under certain regularity conditions on f (t) (seeHalperin [9]), the maximum likelihood estimate(α, β)T is asymptotically normally distributedwith mean (α, β)T and covariance matrix � =(σrs), r, s = 1, 2, where � = I−1, I = (irs), and

i11 =−∂2ln L

∂2α

∣∣∣∣(α,β)=(α,β)

i22 =−∂2ln L

∂2β

∣∣∣∣(α,β)=(α,β)

i12 = i21 =−∂2ln L

∂α∂β

∣∣∣∣(α,β)=(α,β)

Therefore, a 100(1 − γ )% (0 < γ < 1) asymptoticconfidence interval for α is

α ± zγ/2σ11

where zγ/2 is the upper γ /2 point of the standardnormal distribution. A 100(1 − γ )% asymptoticconfidence interval for β is

β ± zγ/2σ22

A 100(1 − γ )% asymptotic confidence interval forθ0 = θ(x0)= exp(α + βx0) is

θ0 ± zγ/2θ20 (1, x0)�(1, x0)

T

where θ0 = exp(α + βx0). A 100(1 − γ )%asymptotic confidence interval for R0(t)= 1−F(t/θ(x0)) at time t can be obtained by applyingthe transformation R0(t)= 1− F(t/θ(x0)) to the100(1− γ )% asymptotic confidence interval ofθ(x0)= exp(α + βx0).

Page 496: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 463

25.2.4 Optimum Test Plan

At the design stage of a step-stress life testing,the choices of stress levels and stress-change timesare very important decisions. The optimum testplan chooses these parameters by minimizingthe asymptotic variance of the maximum likeli-hood estimate for ln θ0 = α + βx0. In a simplestep-stress life testing, assuming that the lifetimeunder constant stress is exponential, Miller andNelson [4] showed that the optimum test planchooses x1 as low as possible and x2 as high aspossible as long as these choices do not causefailure modes different from those at the designstress. When all units are run to failure, they alsoproved that the optimum choice for the stress-change time is

c1 = θ1 ln

(2ξ + 1

ξ

)where ξ = (x1 − x0)/(x2 − x1) is called theamount of stress extrapolation. When a constantcensoring time c (c > c1) is used in the simplestep-stress life testing, Bai et al. [10] showed thatthe optimum stress-change time c1 is the uniquesolution to the equation[

F1(c1)

F2(c − c1)(1− F1(c1))

]2

×{F2(c− c1)+ θ1

θ2[1− F2(c − c1)]

}

=(

1+ ξ

ξ

)2

where Fi(t)= 1− exp(−t/θi), i = 1, 2. Xiong[11] used a different optimization criterion andalso obtained results for the optimum choice ofstress-change time c1. All these optimum testplans are called locally optimum by Chernoff [12],since they depend on unknown parameters α andβ which can only be guessed or estimated froma pilot study. Bai et al. [10] and Xiong [11] alsodiscussed the variance increase in the maximumlikelihood estimation of ln θ0 when incorrect α

and β were used.

25.3 Step-stress Life Testingwith Random Stress-changeTimes

In this section we consider the situation that thestress-change times in a step-stress life testing arerandom. The random stress-change times happenfrequently in real-life applications. For example, ina simple step-stress testing, instead of increasingstress at a prespecified constant time, the stresscan be increased right after a certain numberof test units fail. The stress-change time in thiscase is then an order statistic from the lifetimedistribution under the first stress level. In general,when the stress-change times are random, thereare two sources of variations in the lifetime of astep-stress testing. One is the variation due to thedifferent stress levels used in the test; the otheris due to the randomness in the stress-changetimes. We study the marginal lifetime distributionunder a step-stress life testing with random stress-change times in Section 25.3.1. The estimation andoptimum test plan are presented in Sections 25.3.2and 25.3.3.

25.3.1 Marginal Distribution ofLifetime

We first consider the case of a simple step-stresslife testing for which the stress-change time isa random variable. Let us assume that C1 isthe random stress-change time with cumulativedistribution function hC1(c1) and that T is thelifetime under such a simple step-stress testing.We also assume that the conditional cumulativedistribution function GT |C1 of T , given the stress-change time C1 = c1, is given by the classiccumulative exposure model (Equation 25.2):

GT |C1(t)={F1(t) for 0≤ t < c1

F2(t − c1 + s1) for c1 ≤ t <∞

where s1 = c1θ2/θ1. The conditional probabilitydensity function g(t | c1) of T , given C1 = c1, is

Page 497: Handbook of Reliability Engineering - pdi2.org

464 Maintenance Theory and Testing

then

g(t | c1)=

1

θ1f

(t

θ1

)for 0≤ t < c1

1

θ2f

(t − c1

θ2+ t

θ1

)for c1 ≤ t <∞

The joint probability density function d(t, c1) ofT and C1 is now

d(t, c1)= g(t | c1)hC1(c1)

Thus, the marginal probability density functionq(t) of T is

q(t)=∫ ∞

0d(t, c1) dc1

= 1

θ2

∫ t

0f

(t − c1

θ2+ t

θ1

)hC1(c1) dc1

+ 1

θ1f

(t

θ1

)P(C1 > t) (25.12)

The integration is replaced by a sum if the stress-change time C1 is discrete.

We now assume that n test units are initiallyplaced on low stress level x1 and run untilr test units fail, 1≤ r < n. The stress is thenchanged to x2 and is continued until all unitsfail. The stress-change time C1 is the rth orderstatistic of a sample of size n from the lifetimedistribution under stress x1. Thus, the probabilitydensity function of C1 (see Lawless [8], p. 518) is

hC1(c1)=(n− 1r − 1

)n

θ1f

(c1

θ1

)Fr−1

(c1

θ1

)× Rn−r

(c1

θ1

)where R(t) = 1− F(t), t > 0. The marginal prob-ability density function of the lifetime under sucha test plan is

q(t)= n

(n− 1r − 1

) [1

θ1θ2

×∫ t

0f

(t − c1

θ2+ c1

θ1

)f

(c1

θ1

)Fr−1

(c1

θ1

)

× Rn−r(c1

θ1

)dc1

+ 1

θ21

f

(t

θ1

) ∫ ∞t

f

(c1

θ1

)Fr−1

(c1

θ1

)× Rn−r

(c1

θ1

)dc1

](25.13)

When the lifetime distribution at a constant stressis assumed exponential, i.e. f (t)= exp(−t) fort > 0, q(t) can be further simplified as

q(t)= n

(n− 1r − 1

) r−1∑i=0

(−1)i(r−1i )

(ξn,ir + 2)θ2 − θ1

×[

exp

(− t

θ2

)− exp

(− t (ξ

n,ir + 2)

θ1

)]+ n

(n− 1r − 1

) r−1∑i=0

(−1)i(r−1i )

θ1(ξn,ir + 1)

× exp

(− t (ξ

n,ir + 2)

θ1

)(25.14)

where ξn,ir = n+ i − r .The first two moments of T based on the

marginal distribution (Equation 25.14) are

ET= n

(n− 1r − 1

) r−1∑i=0

{(−1)i(r−1

i )θ22

(ξn,ir + 2)θ2 − θ1

− (−1)i(r−1i )θ2

1

(ξn,ir + 2)2[(ξn,ir + 2)θ2 − θ1]

+ (−1)i(r−1i )θ1

(ξn,ir + 2)2(ξn,ir + 1)

}(25.15)

ET2 = 2n

(n− 1r − 1

) r−1∑i=0

{(−1)i(r−1

i )θ32

(ξn,ir + 2)θ2 − θ1

− (−1)i(r−1i )θ3

1

(ξn,ir + 2)3[(ξn,ir + 2)θ2 − θ1]

+ (−1)i(r−1i )θ2

1

(ξn,ir + 2)3(ξn,ir + 1)

}(25.16)

Next we consider the general step-stresslife testing for which k stress levels,x1 < x2 < · · ·< xk, are used. Test units areinitially placed under stress x1. The stress is

Page 498: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 465

increased gradually from x1 to xk at random timesC1 <C2 < · · ·<Ck−1 (continuous or discrete).Let C0 = c0 = 0 and Ck = ck =∞. Let T be thelifetime of a test unit under such a step-stresslife testing plan. We assume that the conditionalcumulative distribution function GT |C1C2···Ck−1(t)

of T , given the stress-change times C1 = c1, C2 =c2, . . . , Ck−1 = ck−1, follows the cumulativeexposure model (Equation 25.3), i.e.

GT |C1C2···Ck−1(t)

=

F

(t

θ1

)for 0≤ t < c1

F

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

)for ci−1 ≤ t < ci

i = 2, 3, . . . , k

The conditional probability density function g(t |c1, c2, . . . , ck−1) of T , given the stress-changetimes C1 = c1, C2 = c2, . . . , Ck−1 = ck−1, is then

g(t | c1, c2, . . . , ck−1)

=

1

θ1f

(t

θ1

)for 0≤ t < c1

1

θif

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

)for ci−1 ≤ t < ci

i = 2, 3, . . . , k

Let hCi |C1C2···Ci−1(ci) be the conditionaldistribution of Ci, given C1, C2, . . . , Ci−1,for i = 2, 3, . . . , k − 1. Let hC1|C0(c1)= hC1(c1),the cumulative distribution of C1. The joint prob-ability density function d(t, c1, c2, . . . , ck−1) ofT and C1, C2, . . . , Ck−1 is now

d(t, c1, c2, . . . , ck−1)

= g(t | c1, c2, . . . , ck−1)

k−1∏i=1

hCi |C1C2···Ci−1(ci)

=

1

θ1f

(t

θ1

) k−1∏i=1

hCi |C1C2···Ci−1(ci)

for 0≤ t < c1

1

θif

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

k−1∏i=1

hCi |C1C2···Ci−1(ci)

for ci−1 ≤ t < ci, i = 2, 3, . . . , k

Thus, the marginal probability density functionq(t) of T is

q(t)=∫ ∞

0

∫ ∞0

. . .

. . .

∫ ∞0

d(t, c1, c2, . . . , ck−1) dc1 dc2 . . .

. . . dck−1

=k−1∑i=2

∫ t

0. . .

[ ∫ t

ci−3

( ∫ t

ci−2

{ ∫ ∞t

. . .

. . .

[ ∫ ∞ck−2

1

θif

(t−ci−1

θi+

i−1∑j=1

cj−cj−1

θj

)

×k−1∏i=1

hCi |C1C2···Ci−1(ci) dck−1

]. . .

. . . dci

}dci−1

)dci−2

]. . .

. . . dc1 +∫ ∞t

{∫ ∞c1

. . .

[∫ ∞ck−2

1

θ1f

(t

θ1

)

×k−1∏i=1

hCi |C1C2···Ci−1(ci) dck−1

]. . .

. . . dc2

}dc1 +

∫ t

0

{ ∫ t

c1

. . .

. . .

[ ∫ t

ck−2

1

θkf

(t−ck−1

θk+

k−1∑j=1

cj−cj−1

θj

)

×k−1∏i=1

hCi |C1C2···Ci−1(ci)dck−1

]. . . dc2

}dc1

(25.17)

The integrations need to be replaced by sums ifthe stress-change times C1, C2, . . . , Ck−1 are

Page 499: Handbook of Reliability Engineering - pdi2.org

466 Maintenance Theory and Testing

discrete. Even though the marginal probabilitydensity function q(t) is always theoreticallyavailable, it becomes very complex andmathematically intractable even with simplelifetime distributions when k is large. We considerseveral particular cases in the following.

We first note that when the stress-change timesare prespecified constants c1 < c2 < · · ·< ck−1,i.e. P(C1 = c1)= P(C2 = c2)= · · · = P(Ck−1 =ck−1)= 1, q(t) becomes

q(t)=

1

θ1f

(t

θ1

)for 0≤ t < c1

1

θif

(t − ci−1

θi+

i−1∑j=1

cj − cj−1

θj

)for ci−1 ≤ t < ci

i = 2, 3, . . . , k

which is the original cumulative exposure modelstudied by Nelson [1, 2].

Next we study the case k = 3. Two stress-changetimes C1 and C2 are used. Assume that n testunits are initially placed on low stress level x1and run until r1 test units fail. The stress is thenchanged to x2 and run until another r2 units fail.The stress is finally changed to x3 and continueduntil all units fail. The stress-change time C1is the r1th-order statistic of a sample of size n

from the lifetime distribution under stress x1.The probability density function of C1 is

hC1(c1)=(n− 1r1 − 1

)n

θ1f

(c1

θ1

)Fr1−1

(c1

θ1

)× Rn−r1

(c1

θ1

)The stress-change time C2, given C1 = c1, is ther2th-order statistic of a sample of size n− r1 fromthe conditional probability distribution

1

θ2f

(t − c1

θ2+ c1

θ1

)R−1

(c1

θ1

)for t > c1. Thus, the conditional probabilitydensity function of C2, given C1 = c1, is

hC2|C1(c2)=(n− r1)(

n−r1−1r2−1 )

θ2Rn−r1(c1θ1

) f

(c2 − c1

θ2+ c1

θ1

)

×[F

(c2−c1

θ2+ c1

θ1

)−F(c1

θ1

)]r2−1

× Rn−r1−r2(c2 − c1

θ2+ c1

θ1

)When the lifetime distribution at a constant stressis exponential, straightforward integrations yieldthe marginal probability density function q(t) oflifetime T under this step-stress life testing as

q(t)=∫ ∞

0

∫ ∞0

g(t | c1, c2)hC2|C1(c2)

× hC1(c1) dc2 dc1

= n(n− r1)

(n− 1r1 − 1

) (n− r1 − 1r2 − 1

)

×r1−1∑i=0

r2−1∑j=0

((−1)i+j (r1−1

i )(r2−1j )

(ηn,jr1,r2 + 1)(ξn,ir1 + 1)θ1

× exp

[− (ξ

n,ir1 + 2)t

θ1

]+ {(−1)i+j (r1−1

i )(r2−1j )θ2}{[(ηn,jr1,r2 + 2)θ1

− (ξn,ir1+ 2)θ2][θ2 − (η

n,jr1,r2 + 2)θ3]}−1

×{

exp

[− (ξ

n,ir1 + 2)t

θ1

]− exp

[− (η

n,jr1,r2 + 2)t

θ2

]}

− (−1)i+j (r1−1i )(

r2−1j )θ3

[θ2 − (ηn,jr1,r2 + 2)θ3][θ1 − (ξ

n,ir1 + 2)θ3]

×{

exp

[− (ξ

n,ir1 + 2)t

θ1

]− exp

[− t

θ3

]}

+ (−1)i+j (r1−1i )(

r2−1j )

[(ηn,jr1,r2+2)θ1− (ξn,ir1 +2)θ2](ηn,jr1,r2 + 1)

×{

exp

[− (ξ

n,ir1 + 2)t

θ1

]− exp

[− (η

n,jr1,r2 + 2)t

θ2

]})(25.18)

where ηn,jr1,r2 = n+ j − r1 − r2.

Page 500: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 467

25.3.2 Estimation

When the functional form of f (t) is givenand independent failure times, T1, T2, . . . , Tn,are observed, the maximum likelihood estimatesof α and β can be obtained by maximizing∏n

i=1 q(Ti) over α and β , where q(t) is givenin Equation 25.17. The maximization can becarried out by using numerical methods suchas the Newton–Raphson method and net searchmethods (Bazaraa and Shetty [13], chapter 8).

In the case of a simple step-stress acceleratedlife testing where the stress is changed when thefirst r lifetimes are observed. The only stress-change time C1 is the rth-order statistic from alifetime sample of size n under stress x1. When thelifetime at a constant stress is exponential, themethod of moment estimates for α and β can befound by numerically solving the following systemfor α and β :

ET=∑n

i=1 Ti

n

ET2 =∑n

i=1 T2i

n

where ET and ET2 are given in Equations 25.15and 25.16 respectively.

If the lifetime distribution at any constantstress is exponential, all the confidence intervalestimates for α, β , θ0, and R0(t) given inSection 25.2.2 from a simple step-stress lifetesting are now conditional confidence intervalsfor the corresponding parameters, given C1 = c1.However, these conditional confidence intervalsare also unconditional confidence intervals withthe same confidence levels. For example, if[a1, a2] is a conditional 100(1− γ )% (0 < γ < 1)confidence interval for β , i.e.

P(a1 ≤ β ≤ a2 | C1)= 1− γ

then

P(a1 ≤ β ≤ a2)= EC1[P(a1 ≤ β ≤ a2 | C1)]= 1− γ

where EC1 is the expectation taken with respectto the distribution of C1. Therefore, [a1, a2] is

a 100(1− γ )% unconditional confidence inter-val for β . A similar argument will prove that if[a3, a4] is an at least 100(1− γ )(1− γ ∗)% (0 <

γ < 1, 0 < γ ∗ < 1) conditional confidence inter-val for α, or θ0, or R0(t) at a given t, then[a3, a4] is also an at least 100(1− γ )(1− γ ∗)%unconditional confidence interval for the cor-responding parameters. In the general case ofk > 2, given C1, C2, . . . , Ck−1, conditional con-fidence intervals for α, β , θ0, and R0(t) canbe constructed by using a similar transforma-tion as in Equation 25.11 and Lemma 1. Again,these conditional confidence intervals are alsounconditional for the corresponding parame-ters.

For most other lifetime distributions in astep-stress life testing with random stress-change times, the conditional asymptoticconfidence intervals for α, β , θ0, and R0(t) canbe constructed using the results in Section 25.2.3.These conditional asymptotic confidenceintervals are also unconditional asymptoticconfidence intervals for the correspondingparameters.

25.3.3 Optimum Test Plan

We consider a simple step-stress life testing wherethe stress is changed when the first r lifetimes areobserved. We assume that the two stress levels arealready chosen and discuss the optimum test planfor choosing r when all the test units run to failure.Suppose that n test units are tested under a simplestep-stress life testing which uses only two stresslevels x1 < x2. The only stress-change time C1 isthe rth-order statistic from a lifetime sample ofsize n under stress x1. Suppose that the lifetimesat constant stress x1 and x2 are exponential withmeans θ1 and θ2 respectively, where θi = exp(α +βxi), i = 1, 2.

Conditional on C1 = c1, the maximum like-lihood estimators α and β are given by Equa-tions 25.7 and 25.8. The Fisher informationmatrix I, conditional on C1 = c1, is (see also

Page 501: Handbook of Reliability Engineering - pdi2.org

468 Maintenance Theory and Testing

Bai et al. [10])

I= n

(1

F1(c1)x1 + [1− F1(c1)]x2

F1(c1)x1 + [1− F1(c1)]x2

F1(c1)x21 + [1− F1(c1)]x2

2

)where

F1(c1)= 1− exp

(−c1

θ1

)Since the mean life at design stress x0 isθ0 = exp(α + βx0), the conditional maximumlikelihood estimator of θ0 is θ0 = exp(α + βx0).The conditional asymptotic variance Asvar (ln θ0 |C1) is then given by

n Asvar(ln θ0 | C1)= (1+ ξ)2

F1(c1)+ ξ2

1− F1(c1)

where ξ = (x1 − x0)/(x2 − x1). Thus, the uncon-ditional asymptotic variance Asvar(ln θ0) of ln θ0can be found as

Asvar(ln θ0)= EC1[Asvar(ln θ0 | C1)]When the distribution of C1 degenerates atc1, Asvar(ln θ0)=Asvar(ln θ0 | C1 = c1). Miller andNelson [4] obtained the asymptotic optimumproportion of test units failed under stress x1 as(1+ ξ)/(1+ 2ξ) by minimizing the asymptoticvariance Asvar(ln θ0 | C1 = c1). We now let C1 bethe rth smallest failure time at stress x1. A simpleintegration yields

n Asvar(ln θ0)= nEC1[Asvar(ln θ0 | C1)]

= n(1+ ξ)2

r − 1+ nξ2

n− r

Our optimum criterion is to find the optimumnumber of test units failed under stress x1 suchthat Asvar(ln θ0) is minimized. The minimizationof Asvar(ln θ0) over r yields

r∗ = n(1 + ξ)+ ξ

1+ 2ξ

Since r must be an integer, the optimum r

should take either �r∗ − 1, or �r∗ , or �r∗ +1, whichever gives the smallest Asvar(ln θ0).

(�r∗ is the greatest integer lower bound of r∗.)The asymptotic optimum proportion of test unitsfailed under stress x1 is then

limn→∞

r∗

n= 1+ ξ

1+ 2ξ

which is the same as given by Miller and Nelson[4] when the stress-change time is a prespecifiedconstant.

As an example, we use the data studied byNelson and coworkers [4, 14]. The acceleratedlife testing consists of 76 times (in minutes)to breakdown of an insulating fluid at con-stant voltage stresses (kilovolts). The two teststress levels are the log transformed voltages:x1 = ln(26.0)= 3.2581, x2 = ln(38.00)= 3.6376.The design stress is x0 = ln(20.00)= 2.9957.The amount of stress extrapolation is ξ = 0.6914.Miller and Nelson [4] gave the maximum likeli-hood estimates of model parameters from thosedata: α = 64.912, β =−17.704, θ0 ≈ 143 793 min.Now, a future simple step–step life testing is tobe designed to estimate θ0 using x1, x2, and thesame number of test units used in the past study;the question is to find the optimum test plan.Treating the stress-change time as a prespecifiedconstant, Miller and Nelson [4] found the opti-mum stress-change time as c1 = 1707 min. Usingthe rth smallest lifetime under stress x1 as thestress-change time, we found that the optimumnumber of test units failed under stress x1 is r∗ =54.

25.4 Bibliographical NotesThe problem of statistical data analysis andoptimum test plan in a step-stress life testinghas been studied by many authors. Yurkowskiet al. [15] gave a survey of the work onthe early statistical methods of step-stress lifetesting data. Meeker and Escobar [16] surveyedmore recent work on accelerated life testing.Nelson [1] first implemented the method ofmaximum likelihood to data from a step-stresstesting. Shaked and Singpurwalla [17] proposeda model based on shock and wear processes

Page 502: Handbook of Reliability Engineering - pdi2.org

Step-stress Accelerated Life Testing 469

and obtained non-parametric estimators for thelife distribution at the use condition. Barbosaet al. [18] analyzed exponential data from astep-stress life testing using the approach of ageneralized linear model. Tyoskin and Krivolapov[19] presented a non-parametric model for theanalysis of step-stress test data. Schäbe [20] usedan axiomatic approach to construct acceleratedlife testing models for non-homogeneous Poissonprocesses. DeGroot and Goel [21] and Dorp et al.[22] developed Bayes models for data from astep-stress testing and studied their inferences.Khamis and Higgins [23, 24] obtained results onoptimum test plan for a three-step test and studiedanother variation of the cumulative exposuremodel. Xiong [11] gave inference results on type-IIcensored exponential step-stress data. Xiong andMilliken [25] investigated the statistical inferenceand optimum test plan when the stress-changetimes are random in a step-stress testing. Xiong[26] introduced a threshold parameter into thecumulative exposure models and studied thecorresponding estimation problem.

References[1] Nelson WB. Accelerated life testing—step-stress models

and data analysis. IEEE Trans Reliab 1980;R-29:103–8.[2] Nelson WB. Applied life data analysis. New York: John

Wiley & Sons; 1982.[3] Nelson WB. Accelerated life testing, statistical models,

test plans, and data analysis. New York: John Wiley &Sons; 1990.

[4] Miller RW, Nelson WB. Optimum simple step-stressplans for accelerated life testing. IEEE Trans Reliab 1983;R-32:59–65.

[5] Yin XK, Sheng BZ. Some aspects of accelerated life testingby progressive stress. IEEE Trans Reliab 1987;R-36:150–5.

[6] Mann NR, Schafer RE, Singpurwalla ND. Methods forstatistical analysis of reliability and life data. New York:John Wiley & Sons; 1974.

[7] Xiong C. Optimum design on step-stress life testing. In:Proceedings of 1998 Kansas State University Conferenceon Applied Statistics in Agriculture, 1998; p.214–25.

[8] Lawless JF. Statistical models and methods for lifetimedata. New York: John Wiley & Sons; 1982.

[9] Halperin M. Maximum likelihood estimation in trun-cated samples. Ann Math Stat 1952;23:226–38.

[10] Bai DS, Kim MS, Lee SH. Optimum simple step-stressaccelerated life tests with censoring. IEEE Trans Reliab1989;38:528–32.

[11] Xiong C. Inferences on a simple step-stress model withtype II censored exponential data. IEEE Trans Reliab1998;47:142–6.

[12] Chernoff H. Optimal accelerated life designs for estima-tion. Technometrics 1962;4:381–408.

[13] Bazaraa MS, Shetty CM. Nonlinear programming: theoryand algorithms. New York: John Wiley & Sons; 1979.

[14] Meeker WQ, Nelson WB. Optimum accelerated life testsfor Weibull and extreme value distributions and censoreddata. IEEE Trans Reliab 1975;R-24:321–32.

[15] Yurkowski W, Schafer RE, Finkelstein JM. Acceleratedtesting technology. Rome Air Development CenterTechnical Report 67-420, Griffiss AFB, New York, 1967.

[16] Meeker WQ, Escobar LA. A review of recent researchand current issues in accelerated testing. Int Stat Rev1993;61:147–68.

[17] Shaked M, Singpurwalla ND. Inference for step-stressaccelerated life tests. J Stat Plan Infer 1983;7:694–99.

[18] Barbosa EP, Colosimo EA, Louzada-Neto F. Acceleratedlife tests analyzed by a piecewise exponential distributionvia generalized linear models. IEEE Trans Reliab1996;45:619–23.

[19] Tyoskin OI, Krivolapov SY. Nonparametric model forstep-stress accelerated life testing. IEEE Trans Reliab1996;45:346–350.

[20] Schäbe H. Accelerated life testing models fornonhomogeneous Poisson processes. Stat Pap 1998;39:291–312.

[21] DeGroot MH, Goel PK. Bayesian estimation and optimaldesigns in partially accelerated life testing. Nav ResLogist Q 1979;26:223–35.

[22] Dorp JR, Mazzuchi TA, Fornell GE, Pollock LR. A Bayesapproach to step-stress accelerated life testing. IEEETrans Reliab 1996;45:491–8.

[23] Khamis IH, Higgins JJ. Optimum 3-step step-stress tests.IEEE Trans Reliab 1996;45:341–5.

[24] Khamis IH, Higgins JJ. A new model for step-stresstesting. IEEE Trans Reliab 1998;47:131–4.

[25] Xiong C, Milliken GA. Step-stress life-testing withrandom stress-change times for exponential data. IEEETrans Reliab 1999;48:141–8.

[26] Xiong C. Step-stress model with threshold parameter.J Stat Comput Simul 1999;63:349–60.

Page 503: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 504: Handbook of Reliability Engineering - pdi2.org

Practices and EmergingApplicationsP

AR

TV

26 Statistical Methods for Reliability Data Analysis26.1 Introduction26.2 Nature of Reliability Data26.3 Probability and Random Variables26.4 Principles of Statistical Methods26.5 Censored Data26.6 Weibull Regression Model26.7 Accelerated Failure-time Model26.8 Proportional Hazards Model26.9 Residual Plots for the Proportional Hazards Model26.10 Non-proportional Hazards Models26.11 Selecting the Model and Variables26.12 Discussion

27 The Application of Capture-recapture Methods inReliability Studies

27.1 Introduction27.2 Formulation of the Problem27.3 A Sequential Procedure27.4 Real Examples27.5 Simulation Studies27.6 Discussions

28 Reliability of Electric Power Systems: An Overview28.1 Introduction28.2 System Reliability Pereformance28.3 System Reliability Prediction28.4 System Reliability Data

Page 505: Handbook of Reliability Engineering - pdi2.org

29 Human and Medical Device Reliability29.1 Introduction29.2 Human and Medical Device Reliability Terms and Definitions29.3 Human Stress-Performance Effectiveness, Human Error Types and

Causes of Human Error29.4 Human Reliability Analysis Methods29.5 Human Unreliability Data Sources29.6 Medical Device Reliability Related Facts and Figures29.7 Medical Device Recalls and Equipment Classificiation29.8 Human Error in Medical Devices29.9 Tools for Medical Device Reliability Assurance29.10 Data Sources for Performing Medical Device Reliability29.11 Guidelines for Reliability Engineers with Respect to Medical Devices

30 Probabilistic Risk Assessment30.1 Introduction30.2 Historical Comments30.3 PRA Methodology30.4 Engineering Risk vs Environmental Risk30.5 Risk Measures and Public Impact30.6 Transition to Risk-informed Regulation30.7 Some Successful PRA Applications30.8 Comments on Uncertainty30.9 Deterministic, Probabilistic, Prescriptive, Performance-based30.10 Outlook

31 Total Dependability Management31.1 Introduction31.2 Background31.3 Total Dependability Management31.4 Management System Components31.5 Conclusions

32 Total Quality for Software Engineering Management32.1 Introduction32.2 The Practice of Software Engineering32.3 Software Quality Models32.4 Total Quality Management for Software Engineering32.5 Conclusions

Page 506: Handbook of Reliability Engineering - pdi2.org

33 Software Fault Tolerance33.1 Introduction33.2 Software Fault-Tolerant Methodologies33.3 NVP Modeling33.4 Generalized NHPP Model Formulation33.5 NHPP Reliability Model for NVP Systems33.6 NVP-SRG33.7 Maximum Likelihood Estimation33.8 Conclusion

34 Dependability/Performability Modeling of Fault-Tolerant Systems34.1 Introduction34.2 Measures34.3 Model Specification34.4 Model Solution34.5 The Largeness Problem34.6 A Case Study34.7 Conclusions

35 Random-Request Availability35.1 Introduction35.2 System Description and Definition35.3 Mathematical Expression for the Random-Request Availability35.4 Numerical Examples35.5 Simulation Results35.6 Approximation35.7 Concluding Remarks

Page 507: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 508: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for ReliabilityData Analysis

Chap

ter2

6Michael J. Phillips

26.1 Introduction26.2 Nature of Reliability Data26.3 Probability and Random Variables26.4 Principles of Statistical Methods26.5 Censored Data26.6 Weibull Regression Model26.7 Accelerated Failure-time Model26.8 Proportional Hazards Model26.9 Residual Plots for the Proportional Hazards Model26.10 Non-proportional Hazards Models26.11 Selecting the Model and the Variables26.12 Discussion

26.1 Introduction

The objective of this chapter is to describestatistical methods for reliability data analysis,in a manner which gives the flavor of modernapproaches. The chapter commences with adescription of five examples of different formsof data encountered in reliability studies. Theseexamples are from recently published papers inreliability journals and include right censoreddata, accelerated failure data, and data fromrepairable systems. However, before commencingany discussion of statistical methods, a formaldefinition of reliability is required. The reliabilityof a system (or component) is defined as theprobability that the system operates (performsa function under stated conditions) for a statedperiod of time. Usually the period of time is theinitial interval of length t , which is denoted by[0, t). In this case the reliability is a function of t ,so that the reliability function R(t) can be defined

as:

R(t) = P(System operates during [0, t))where P(A) denotes the probability of an eventA, say. This enables the various features ofcontinuous distributions, which are used to modelfailure times, to be introduced. The relationshipsbetween the reliability function, the probabilitydensity function, and the hazard function aredetailed. Then there is an account of statisticalmethods and how they may be used in achievingthe objectives of a reliability study. This coversthe use of parametric, semi-parametric, and non-parametric models.

26.2 Nature of Reliability DataTo achieve the objectives of a reliability study, itis necessary to obtain facts from which the otherfacts required for the objectives may be inferred.These facts, or data, obtained for reliability studies

475

Page 509: Handbook of Reliability Engineering - pdi2.org

476 Practices and Emerging Applications

Table 26.1. Times to failure (in hours) for pressure vessels

274 28.5 1.7 20.8 871 363 1311 1661 236 828458 290 54.9 175 1787 970 0.75 1278 776 126

may come from testing in factory/laboratoryconditions or from field studies. A number ofexamples of different kinds of reliability dataobtained in reliability studies will be given, whichare taken from published papers in the followingwell-known reliability journals: IEEE Transactionsin Reliability, Journal of Quality Technology,Reliability Engineering and Safety Systems, andTechnometrics.

Example 1. (Single sample failure data) The sim-plest example of failure time data is a single sam-ple taken from a number of similar copies of asystem. An example of this was given by Keatinget al. [1] for 20 pressure vessels, constructed offiber/epoxy composite materials wrapped aroundmetal liners. The failure times (in hours) are givenin Table 26.1 for 20 similarly constructed vesselssubjected to a certain constant pressure.

This is an example of a complete sampleobtained from a factory/laboratory test which wascarried out on 20 pressure vessels and continueduntil failure was observed on all vessels. There isno special order for the failure times. These datawere used by Keating et al. [1] to show that thefailure distribution could have a decreasing hazardfunction and that the average residual life for avessel having survived a wear-in period exceedsthe expected life of a new vessel.

Example 2. (Right censored failure data) If one ofthe objectives of collecting data is to observe thetime of failure of a system, it may not always bepossible to achieve this due to a limit on the timeor resources spent on the data collection process.This will result in incomplete data as only partialinformation will be collected, which will occurif a system is observed for a period and thenobservation ceases at time C, say. Though the timeof failure T is not observed, it is known that T

must exceed C. This form of incomplete data isknown as right censored data.

An example of this kind of data was givenby Kim and Proschan [2] for telecommunicationsystems installed for 125 customers by a telephoneoperating company. The failure times (in days)are given in Table 26.2 for 16 telecommunicationsystems installed in 1985 and the censored times(when the system was withdrawn from servicewithout experiencing a failure or the closing dateof the observation period) for the remaining109 systems. The systems could be withdrawnfrom service at any time during the observationperiod on the customer’s request. As well asfailure to function, an unacceptable level of static,interference or noise in the transmission of thetelecommunication system was considered as asystem failure.

This situation, where a high proportion of theobservations are censored, is often met in practice.These data were used by Kim and Proschan [2] toobtain a smooth non-parametric estimate of thereliability (survivor) function for the systems.

Example 3. (Right censored failure data with acovariate) Ansell and Ansell [3] analyzed the datagiven in Table 26.3 in a study of the performanceof sodium sulfur batteries. The data consist oflifetimes (in cycles) of two batches of batteries.There were 15 batteries in the first batch and fourof the lifetimes are right censored observations.These are the four largest observations and allhave the same value. There were 20 batteries inthe second batch and only one of the lifetimes isright censored. It is not the largest observation.The principal interest in this study is to investigateany difference in performance between batteriesfrom the two different batches.

Example 4. (Accelerated failure data with covari-ates) Elsayed and Chan [4] presented failuretimes for time-dependent dielectric breakdownof metal–oxide–semiconductor integrated circuitsfor accelerated failure testing. The objective was

Page 510: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 477

Table 26.2. Times to failure and censored times (in days) for telecommunication systems during five months in 1985

164+ 2 45 147+ 139+ 135+ 3 155+ 150+ 101+139+ 135+ 164+ 155+ 150+ 146+ 139+ 135+ 164+ 155+150+ 1 139+ 7 163+ 139+ 149+ 143+ 138+ 134+163+ 152+ 149+ 143+ 40 13 163+ 152+ 149+ 143+138+ 134+ 163+ 152+ 149+ 142+ 138+ 134+ 163+ 152+149+ 10 138+ 134+ 163+ 94 149+ 141+ 138+ 133+77 151+ 149+ 141+ 138+ 133+ 162+ 151+ 149+ 141+

138+ 133+ 162+ 151+ 149+ 34 138+ 133+ 73+ 151+115 140+ 138+ 133+ 63 151+ 138+ 140+ 137+ 133+161+ 151+ 148+ 140+ 137+ 64+ 160+ 151+ 147+ 140+137+ 133+ 160+ 151+ 147+ 140+ 137+ 133+ 67 90+147+ 140+ 137+ 133+ 141+ 151+ 147+ 140+ 137+ 133+156+ 151+ 147+ 54 137+

Note: Figures with+ are right censored observations not failures.

Table 26.3. Lifetimes (in cycles) of sodium sulfur batteries

Batch 1 164 164 218 230 263 467 538 639 669917 1148 1678+ 1678+ 1678+ 1678+

Batch 2 76 82 210 315 385 412 491 504 522646+ 678 775 884 1131 1446 1824 1827 2248

2385 3077

Note: Lifetimes with+ are right censored observations not failures.

Table 26.4. Times to failure and censored times (in hours) for three different temperatures (170 ◦C, 200 ◦C, 250 ◦C)Temperature (170 ◦C)

0.2 5.0 27.0 51.0 103.5 192.0 192.0 429.0 429.0 954.02495.0 2495.0 2495.0 2495.0 2495.0 2495.0 2495.0 2495.0 2948.0 2948.0

Temperature (200 ◦C)0.1 0.2 0.2 2.0 5.0 5.0 25.0 52.5 52.5 52.5

52.5 123.5 123.5 123.5 219.0 219.0 524.0 524.0 1145.0 1145.0

Temperature (250 ◦C)0.1 0.2 0.2 0.2 0.5 2.0 5.0 5.0 5.0 10.0

10.0 23.0 23.0 23.0 23.0 23.0 50.0 50.0 120.0

Note: Figure with+ is right censored observation not failure.

to estimate the dielectric reliability and hazardfunctions at operating conditions from the dataobtained from the tests. Tests were performed toinduce failures at three different elevated temper-atures, where temperature is used as the stressvariable (or covariate). These data for silicon diox-ide breakdown can be used to investigate failure

time models which have parameters dependent onthe value of the covariate (temperature), and aregiven in Table 26.4. The nominal electric field wasthe same for all three temperatures. Elsayed andChan [4] used these data to estimate the param-eters in a semi-parametric proportional hazardsmodel.

Page 511: Handbook of Reliability Engineering - pdi2.org

478 Practices and Emerging Applications

Table 26.5. System failure times (in hours)

System A 452 752 967 1256 1432 1999 2383 (3000)System B 233 302 510 911 1717 2107 (3000)System C 783 1805 (2000)System D 782 (2000)System E (2000)

Note: Figures in parentheses are current cumulative times.

Example 5. (Repairable system failure data) Ex-amples 1 to 4 are of failure data that are records ofthe time to first failure of a system. Some of thesesystems, such as the pressure vessels described inExample 1 and the integrated circuits described inExample 4, may be irreparably damaged, thoughothers, such as the telecommunication systemsdescribed in Example 2, will be repaired afterfailure and returned to service. This repair is oftenachieved by replacing part of the system to pro-duce a repairable system.

Newton [5] presented failure times for arepairable system. Five copies of a system wereput into operation at 500-hour intervals. Whenfailures occurred the component which failed wasinstantaneously replaced by a new component.Failures were logged at cumulative system hoursand are given in Table 26.5. The times of interestare the times between the times of system failure.

The objective for which Newton [5] used thesedata was to consider the question of whetherthe failure rate for the systems was increasing ordecreasing with time as the systems were repaired.

26.3 Probability and RandomVariablesReliability has been defined as a probability, whichis a function of time, used for the analysis ofthe problems encountered when studying theoccurrence of events in time. Several aspects oftime may be important: (i) age, (ii) calendar time,(iii) time since repair. However, for simplicity,problems are often specified in terms of onetime scale, values of which will be denoted by t .In order to study the reliability of a component,

define a random variable T denoting the timeof occurrence of an event of interest (failure).Then the event “system operates during [0, t)”used in Section 26.1 to define the reliabilityfunction is equivalent to “T ≥ t”. Then thereliability function of the failure time distributioncan be defined by:

RT (t)= P(T ≥ t)

and is a non-increasing function of t . The relia-bility of a component cannot improve with time!The distribution of T can be defined using theprobability density function of the failure timedistribution, and this is defined by:

fT (t)= limh→0

P(t ≤ T < t + h)

h= −dRT (t)

dt

Another function which can be used to de-fine the distribution of T is the hazard func-tion (age-specific failure rate or force of mortality)of time to failure, which is defined by:

λT (t)= limh→0

P(t ≤ T < t + h | T ≥ t)

h= fT (t)

RT (t)

A related function is the cumulative (integrated)hazard function of the failure time distribution,which is closely connected with the reliabilityfunction and is defined by:

T (t)=∫ t

0λT (u) du=−log(RT (t))

So the reliability function can be obtained fromthe cumulative hazard function by:

RT (t)= exp(−T (t))

These results can be applied to various familiesof distributions, such as the exponential, Weibull,

Page 512: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 479

extreme value or lognormal. Details of thesefamilies of distributions can be found in such textsas Ansell and Phillips [6].

26.4 Principles of StatisticalMethods

Statistical methods are mainly based on usingparametric models, though non-parametric meth-ods and semi-parametric models are also used.

When considering possible failure-time distri-butions to model the behavior of observed failuretimes, it is not usual to choose a fully specifiedfunction to give the reliability or probability den-sity function of the distribution, but instead touse a class or “family” of functions. This familyof functions will be a function of some variablesknown as parameters, whose values have to bespecified to “index” the particular function of thefamily to be used to define the failure-time dis-tribution. More than one parameter may be usedand then the r parameters, say, will in general bedenoted by the vector β = (β1, β2, . . . , βr)

T andthe reliability or probability density function ofthe parametric distribution for a continuous fail-ure time will be denoted byRT (t; β) and fT (t; β),respectively.

One of the objectives of a reliability studymay be to study the behavior of the failuretime distribution of T through the variation ofk other variables represented by the vector z=(z1, z2, z3, . . . , zk)

T, say. These variables areusually referred to as covariates. In order toperform any parametric statistical analysis, itis necessary to know the joint distribution ofthe failure time T and the covariates z whoseprobability density function will be denoted byfT,Z(t, z; β), for parameters β.

Statistical methods based on using parametricmodels, with information from a random sample,are concerned with such questions as:

1. what is the best choice of the values of theparameters β, or some function of β, on thebasis of the data observed? (estimation);

2. are the data observed consistent with theparameters β having the value β0, say?(hypothesis testing).

There have been a variety of ways of answeringthe first question over the years, though it isgenerally accepted that there are essentially threemain approaches: Classical (due to Neyman andPearson), Likelihood (due to Fisher), and Bayesian(developed by the followers of Bayes).

To achieve the objectives of the reliability study,it may be decided to use a parametric modelwith parameters β, say. This may be to obtainthe reliability of the system or the expectedvalue of the failure time, and in order to dothis it will be necessary to specify the valuesof the parameters of the model. This will bedone by using the “best” value obtained fromthe reliability data, which are the observationscollected (in a possibly censored or truncatedform) from a random sample. Estimation isthe process of choosing a function of theseobservations (known as a statistic) which willgive a value (or values) close to the “true”(unknown) value of the parameters β. Such astatistic, known as an estimator, will be a randomvariable denoted by β. A point estimator givesone value for each of the parameters, whilstan interval estimator gives intervals for eachof the parameters. For interval estimators aprobability (or confidence level, usually expressedas a percentage by multiplying the probabilityby 100) is specified, which is the probability thatthe interval contains the “true” values of theparameter. One method for obtaining the intervalis to use the standard error of an estimator,which is the standard deviation of the estimatorwith any parameters replaced by their estimates.Often the estimator can be assumed to have anormal distribution (at least for a large samplesize, though this depends on the proportionof censored observations). Then an approximate95% interval estimator for β using the pointestimator β with a standard error se(β) wouldbe given by the interval (β − 1.96se(β), β +1.96se(β)).

Page 513: Handbook of Reliability Engineering - pdi2.org

480 Practices and Emerging Applications

As an alternative to choosing a parametricfailure-time distribution, it is possible to makeno specification of the reliability or probabilitydensity function of this distribution. This hasadvantages and disadvantages. While not beingrestricted to use specific statistical methods,which may not be valid for the model beingconsidered, the price to be paid for this isthat it may be difficult to make very powerfulinferences from the data. This is because anon-parametric model can be effectively thoughtof as a parametric model with an infinity ofparameters. Hence, with an infinity of parametersit is not surprising that statistical methodsmay be unable to produce powerful results.However, this approach to models and theresulting statistical methods which follow fromit are often successful and hence have been verypopular.

As a compromise to choosing a parametric ora non-parametric failure-time distribution it ispossible to make only a partial specification of thereliability, probability density, or hazard functionof this distribution. This has the advantageof introducing parameters which model thecovariates of interest while leaving the oftenuninteresting remainder of the model to have anon-parametric form. Such models are referredto as semi-parametric models. Probably the mostfamous example of a semi-parametric model is thehazard model introduced by Cox [7], known asthe proportional hazards model. The idea behindusing such a model is to be able to use the standardpowerful approach of parametric methods tomake inferences about the covariate parameterswhile taking advantage of the generality of non-parametric methods to cope with any problemsproduced by the remainder of the model.

26.5 Censored Data

It is not always possible to collect data forlifetime distributions in as complete a form asrequired. Example 1 is an example of a completesample where a failure time was observed forevery system. This situation is unusual and only

likely to occur in the case of laboratory testing.Many studies lead to the collection of incompletedata, which are either censored or truncated. Oneof the most frequent ways that this occurs isbecause of a limit of time (or resources) for thestudy, producing the case of right censored data.This case will be considered first.

Non-parametric estimation methods havemainly been developed to deal with the case ofright censored data for lifetime distributions,as this is the case most widely encountered inall branches of applied statistics. Consider rightcensored failure-time data for n components(or systems). Such data can be representedby the pairs, (ti, di) for the ith component,i = 1, 2, . . . , n. For the most part the case inwhich the Ti are independent and identicallydistributed is considered, and where it isnecessary to estimate some characterizationof their distribution. Sometimes it is necessaryto refer to the ith-ordered observed event time(the time of failure or censoring of the ith event)when the notation t(i) (rather than ti) will be used.On the other hand, it is sometimes necessaryto refer to the ith observed failure time (ratherthan the time of the ith event, which could bea failure or censoring). Then the notation t[i](rather than ti) will be used. Corresponding toeach observed failure time is a risk set: the setcomprising those components which were underobservation at that time, i.e. the set of componentswhich could have been the observed failure.The risk set corresponding to t[i] will be denotedby Ri . The number of components in the riskset Ri will be denoted by ri . These definitions areillustrated in Example 8 (Table 26.6).

Non-parametric estimators have been pro-posed for the reliability (survivor) function andthe cumulative hazard function of the failure-timedistribution. The Kaplan–Meier (KM) estimatorRT (t) of the reliability function, RT (t), whichwas proposed by Kaplan and Meier [8], is a stepfunction. In the case of ties of multiplicity mi

at the failure time t[i], the estimator is modified.If censoring times tie with failure times, censor-ing is assumed to occur just following the failure.The KM estimator gives a “point” estimator, or

Page 514: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 481

a single value for the reliability function at anytime t . If it is desired to obtain a measure of thevariation of this estimator over different samples,then an estimate of the variance of the KM esti-mator is needed. Greenwood’s formula providesan estimate of the variance of log(RT (t)), denotedby Var(log(RT (t))). Though this gives a measureof the variation of the log of the KM estimator, it ispossible to use it to estimate the variance of RT (t)

by (RT (t))2 Var(log(RT (t))).

The Nelson–Altschuler (NA) estimator T (t)

of the cumulative hazard function, T (t), whichwas proposed by Nelson [9], is also a step function.Using the basic relation between the reliabilityfunction and the cumulative hazard functiongiven in Section 26.3, it is possible to use the NAestimator to estimate the reliability function by:

RT (t)= exp(−T (t))

A number of examples are now presented toillustrate the features of these non-parametricestimators of the reliability function.

Example 6. An example of a single sample froma laboratory test of 20 failure times for pressurevessels was presented in Example 1. In thatexample all the observations were distinct failuretimes, so that the sample was complete and there

Kap

lan

-Mei

eres

tim

ate

of

relia

bili

ty

0.02000

0.2

0.4

0.6

0.8

1.0

150010005000

Time (hours)

Figure 26.1. Plot of the Kaplan–Meier estimator of the reliabilityfor the pressure vessels laboratory test data with 95% confidencelimits

Kap

lan

–M

eier

esti

mat

eo

fre

liab

ility

0.80200

0.85

0.90

0.95

1.00

150100500

Time (days)

Figure 26.2. Plot of the Kaplan–Meier estimator of the reliabilityfor the telecommunication systems data with 95% confidencelimits

was no truncation or censoring. In this case theKM estimator is a step function with equal stepsof 0.05 (= 1/20) at the 20 failure times. A plotof the KM estimator is given in Figure 26.1, with95 confidence limits obtained using Greenwood’sformula.

Example 7. An example of a sample with rightcensored failure times for 125 telecommunicationsystems was presented in Example 2. In thatexample the observations consisted of 16 failuretimes and 109 right censored observations. In thiscase the KM estimator is a step function with equalsteps of 0.008 (= 1/125) for the first 11 failuretimes. After that time (63 days) right censoredobservations occur and the step size changes untilthe largest failure time (139 days) is reached.After that time the KM estimator remains constantuntil the largest censored time (164 days), afterwhich the estimator is not defined. So in thisexample, unlike Example 6, the KM estimator isnever zero. A plot of the KM estimator is given inFigure 26.2 with 95% confidence limits obtainedusing Greenwood’s formula. (The vertical lines onthe plot of the reliability represent the censoredtimes.)

Page 515: Handbook of Reliability Engineering - pdi2.org

482 Practices and Emerging Applications

Table 26.6. Ordered event times (in hours)

Event Censor Event Fail Risk R(t[i])number indicator time number Multiplicity set(i) d(i) t(i) [i] m[i] r[i] (t[i]) NA KM

1 1 69 1 1 21 0.048 0.953 0.9522 1 176 2 1 20 0.098 0.907 0.9053 0 196+ – – – – – –4 1 208 3 1 18 0.153 0.858 0.8545 1 215 4 1 17 0.212 0.809 0.8046 1 233 5 1 16 0.274 0.760 0.7547 1 289 6 1 15 0.341 0.711 0.7048 1 300 7 1 14 0.413 0.662 0.6539 1 384 8 1 13 0.490 0.613 0.603

10 1 390 9 1 12 0.573 0.564 0.55311 0 393+ – – – – – –12 1 401 10 1 10 0.673 0.510 0.49813 1 452 11 1 9 0.784 0.457 0.44214 1 567 12 1 8 0.909 0.403 0.38715 0 617+ – – – – – –16 0 718+ – – – – – –17&18 1 783 13 1 5 0.309 0.270 0.23219 1 806 14 1 3 0.642 0.194 0.15520 0 1000+ – – – – – –21 1 1022 15 1 1 0.642 0.071 0.000

Note: Times with+ are right censored times.

Example 8. This example is adapted from New-ton [5], from the data which were described inExample 5. Five copies of a system were put intooperation at 500-hour intervals. (For the purposeof this example the system will be considered to beequivalent to a single component, and will be re-ferred to as a component.) When failures occurredthe component was instantaneously replaced bya new component. Failures were logged at cumu-lative system hours and are given in Table 26.5,except that the failure time of system D has beenchanged from 782 to 783, for illustrative purposes,as this will produce two tied failure times anda multiplicity of 2. The details of the calculationof both the KM and NA estimators will now beillustrated.

For analysis of the distribution of the compo-nent failure times, the observed lifetimes (timesbetween failures) are ranked as shown in Ta-ble 26.6. The events in parentheses in Table 26.6are times from the last failure of a component to

the current time. Such times are right censorings(times to non-failure) and are typical of failuredata from field service. As most reliability datacome from such sources, it is important not to justanalyze the failure times and ignore the censoringtimes by treating the data as if they came from acomplete sample as in a laboratory test. Ignoringcensorings will result in pessimistic estimates forreliability. Data are often encountered, as in Exam-ple 7, where there are as many as four or five timesmore censorings than failures. The consequencesof ignoring the censorings would result in wildlyinaccurate conclusions about the lifetime distribu-tion.

An estimate can be given for the reliabilityfunction at each observed component failure time.The estimate of the reliability function in thelast column of Table 26.6 is that obtained byusing the KM estimator. Alternatively, the NAestimator can be used. The estimate of the hazardfunction at each failure time is obtained as thenumber of failures (the multiplicity m[i]) at each

Page 516: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 483

event time divided by the number of survivorsimmediately prior to that time (the number inthe risk set, r[i]) The cumulative sum of thesevalues is the NA estimate, (t), of the cumulativehazard function (t). Then exp(−(t)) providesan estimate of R(t), the reliability function, andis given in the penultimate column of Table 26.6.It can be seen that the NA estimate of the reliabilityfunction is consistently slightly higher than theKM estimate.

The improbability of tied lifetimes shouldlead to consideration of the possibility of thetimes not being independent, as would be thecase in, for example, common cause failures.Both these estimates have been derived assumingthat the lifetimes are independent and identicallydistributed (i.i.d.).

26.6 Weibull Regression ModelOne of the most popular families of parametricmodels is that given by the Weibull distribution.It is usual for regression models to describe one ormore of the distribution’s parameters in terms ofthe covariates z. The relationship is usually linear,though this is not always the case. The Weibulldistribution has a reliability function which can begiven by:

RT (t)= exp(−λtκ ) for t > 0

where λ= 1/θκ and θ is the scale parameter, andκ is the shape parameter. Each of these parameterscould be described in terms of the covariates z,though it is more usual to define either thescale or the shape parameter in terms of z.For example, if the scale parameter was chosenthen a common model would be to have λ(z; β)=exp(βTz), where the number of covariates k = r ,the number of parameters. Then the reliabilityfunction would be:

RT (t | z; β)= exp(−exp(βTz)tκ ) for t > 0

The probability density function is given by:

fT (t | z; β)= κ(βTz)tκ−1

× exp(−exp(βTz)tκ ) for t > 0

This model is commonly referred to as the Weibullregression model, but there are alternatives whichhave been studied, see Smith [10], where the shapeparameter is dependent also on the covariate.

There are advantages to reparameterizing thismodel by taking logs, so that the model takesthe form of Gumbel’s extreme value distribution.A reason for this is to produce a model moreakin to the normal regression model, but also itallows a more natural extension of the model andhence greater flexibility. Define Y = log T so thatthe reliability function is given by:

RY (y | z; β)= exp(−exp(κy + βTz))

so that

E(log T )=−γ

κ− βTz

κ

where γ is Euler’s constant. It is usual to estimatethe parameters by using the maximum likelihoodapproach.

Standard errors may be obtained by the useof second derivatives to obtain the observedinformation matrix, Io, and this matrix is usuallycalculated in the standard statistical softwarepackages.

Example 9. Ansell and Ansell [3] analyzed thedata given in Table 26.3 in a study of theperformance of sodium sulfur batteries. The dataconsist of lifetimes (in cycles) of two batches ofbatteries. The covariate vector for the ith batteryis given by zi = (zi1, zi2)

T, where zi1 = 1 and zi2represents whether the battery comes from batch 1or batch 2, so that:

zi2 ={

0 if battery i is from batch 1

1 if battery i is from batch 2

Hence β2 represents the difference in performancebetween batteries from batch 2 and batch 1.

Fitting the Weibull regression model results inβ2 = 0.0156 and κ = 1.127. Using the observedinformation matrix the standard error of β2 =0.386. Hence a 95% confidence interval for β2 is(−0.740, 0.771).

An obvious test to perform is to see if β2 isnon-zero. If it is non-zero this would imply there is

Page 517: Handbook of Reliability Engineering - pdi2.org

484 Practices and Emerging Applications

a difference between the two batches of batteries.The hypotheses are:

H0 : β2 = 0

H1 : β2 �= 0

The log likelihood evaluated under H0 is−49.7347 and under H1 is −49.7339. Hence thelikelihood ratio test statistic has a value of 0.0016.Under the null hypothesis the test statistic hasa χ2 distribution with one degree of freedom.Therefore the hypothesis that β2 is zero, which isequivalent to no difference between the batches,is accepted. Using the estimate β2 of β2 and itsstandard error gives a Wald statistic of 0.0016,almost the same value as the likelihood ratio teststatistic, and hence leads to the same inference.Both these inferences are consistent with theconfidence interval, as it includes zero.

Examination of the goodness of fit andthe appropriateness of the assumptions madein fitting the regression model can be basedon graphical approaches using residuals, seeSmith [10]. The Cox and Snell generalizedresiduals, see Cox and Snell [11], are defined as:

ei =−log RT (ti | zi; β)where RT (ti | zi; β) is the reliability functionevaluated at ti and zi with estimates β.

Obviously one problem that arises is with theresiduals for censored observations and authorsgenerally, see Lawless [12], suggest using:

ei =−log RT (ti | zi; β)+ 1

Cox and Snell residuals should be i.i.d. randomvariables with a unit exponential distribution,i.e. with an expectation of one. Then thecumulative hazard function is a linear functionwith a slope of one. Also the cumulative hazardfunction is minus the log of the reliabilityfunction. Hence a plot of minus the log of theestimated reliability function for the residuals,−log R(ei), against ei should be roughly linearwith a slope of one when the model is adequate.

Example 9. (cont.) Using the data in Example 3and fitting the Weibull regression model, the

_lo

g(R

elia

bili

tyo

fre

sid

ual

)

4

4

3

2

1

00 1 2 3

Generalized residual

Figure 26.3. Plot of the generalized residuals of the Weibullregression model for the sodium sulfur battery data

generalized residuals have been calculated and arepresented in Figure 26.3. The plot is of minus thelog of the reliability function of the generalizedresiduals against the generalized residuals. Sincesome of the points are far from the line withslope one, it would seem that the current model isnot necessarily appropriate. Further investigationwould be required to clarify if this was due to thechoice of the distribution or the current model.

The approach described above is applicable tomany distributions. A special case of the Weibullregression model is the exponential regressionmodel, when the shape parameter is taken to be1 (κ = 1). This has been studied by Glasser [13],Cox and Snell [11] and Lawless [12]. Otherlifetime regression models which can be usedinclude: gamma, log-logistic, and lognormal, seeLawless [12]. Many of these models are coveredby the general term location-scale models, oraccelerated failure-time models.

Selection of an appropriate distribution modeldepends both on the physical context and onthe actual fit achieved. There ought to be goodphysical justification for fitting a distributionusing the context under study. Basing the decisionabout the distribution purely on fit can be verymisleading, especially if there are a number of

Page 518: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 485

possible covariates to be chosen. It is alwayspossible by fitting extra variables to achieve abetter fit to a set of data, though such a fit will havelittle predictive power.

26.7 Accelerated Failure-timeModelThe Weibull regression model discussed in thelast section can be regarded as an example ofan accelerated failure-time model. Acceleratedfailure-time models were originally devised torelate the performance of components put througha severe testing regime to a component’s moreusual lifetime. It was assumed that a variable orfactor, such as temperature or number of cycles,could be used to describe the severity of thetesting regime. This problem has been consideredby a number of authors, see Nelson [14] for acomprehensive account of this area.

Suppose in a study that the covariate is z, whichcan take the values 0 and 1, and that it is assumedthat the hazard functions are:

λ(t | z= 0)= λ0

andλ(t | z= 1)= φλ0

so that φ is the relative risk for z= 1 versus z = 0.Then

R(t | z= 1)= R(φt | z = 0)

and, in particular,

E(T | z= 1)= E(T | z= 0)

φ

So the time for components with z= 1 is passingat a rate φ faster than for the components withz = 0. Hence the name of the model.

The model can be extended as follows. Supposeφ is replaced by φ(z) with φ(0)= 1, then

R(t | z)= R(φ(z)t | z= 0)

and hence

λ(t | z)= φ(z)λ(φ(z)t | z= 0)

and

E(T | z)= E(T | z= 0)

φ(z)

In using the model for analysis a parametricmodel is specified for φ(z) with β as theparameter, which will be denoted by φ(z; β).A typical choice would be

φ(z; β)= exp(βz)

This choice leads to a linear regression model forlog T as exp(βz)T has a distribution which doesnot depend on z. Hence log T is given by:

log T = µ0 − βz+ ε

where µ0 is E(log T | z= 0) and ε is a randomvariable whose distribution does not depend onthe covariate z.

To estimate β there is the need to specifythe distribution. If the distribution of T islognormal then least squares estimation maybe used as µ0 + ε has a normal distribution.If the distribution of T is Weibull with a shapeparameter κ , then κ(µ0 + ε) has a standardextreme value distribution. Hence, as was statedat the beginning of this section, an example ofthe accelerated failure-time model is the Weibullregression model. Other such models have beenwidely applied in reliability, see Cox [15], Fiegl andZelen [16], Nelson and Hahn [17], Kalbfliesch [18],Farewell and Prentice [19], and Nelson [14].Plotting techniques, such as using Cox and Snellgeneralized residuals as defined for the Weibullregression model, may be used for assessing theappropriateness of the model.

Example 10. Elsayed and Chan [4] presented datacollected from tests for time-dependent dielectricbreakdown of metal–oxide–semiconductor inte-grated circuits, which was described in Example4 with the data given in Table 26.4. The dataconsist of times to failure (in hours) for three dif-ferent temperatures (170 ◦C, 200 ◦C and 250 ◦C).Elsayed and Chan [4] suggest a model where thecovariate of interest is the inverse of the absolutetemperature. So the covariate vector for the ith cir-cuit is given by zi = (zi1, zi2)

T, where zi1 = 1 andzi2 represents the inverse absolute temperature at

Page 519: Handbook of Reliability Engineering - pdi2.org

486 Practices and Emerging Applications_

log

(Rel

iab

ility

of

resi

du

al)

5

4

3

2

1

00 1 2 3

Generalized residual5

4

Figure 26.4. Plot of the generalized residuals of the Weibullregression model for the semiconductor integrated circuit data

which the test was performed, and takes the threevalues 0.001 911, 0.002 113, and 0.002 256. Henceβ2 represents the coefficient of the inverse absolutetemperature covariate.

Fitting the Weibull regression model results inβ2 =−7132.4 and κ = 0.551. Using the observedinformation matrix the standard error of β2 =1222.0. Hence a 95% confidence interval for β2 is(−9527.5,−4737.1). This interval indicates thatβ2 is non-zero, and this would imply there is adifference in the performance of the circuits at thedifferent temperatures. This can be confirmed byperforming a hypothesis test to see if β2 is non-zero. The hypotheses are:

H0 : β2 = 0

H1 : β2 �= 0

The log likelihood evaluated under H0 is−148.614 and under H1 is −130.112. Hencethe likelihood ratio test statistic has a valueof 37.00. Under the null hypothesis the teststatistic has a χ2 distribution with one degreeof freedom. Therefore the hypothesis that β2 iszero is not accepted and this implies there is adifference between the circuits depending on thetemperatures of the tests. Using the estimate β2of β2 and its standard error gives a Wald statisticof 34.06, almost the same value as the likelihood

ratio test statistic, and hence leads to the sameinference.

The generalized residuals are presented inFigure 26.4. The plot is of minus the log of thereliability function of the generalized residualsagainst the generalized residuals. The points liecloser to a line with slope one than was the casein Figure 26.3 for Example 9. So the current modelmay be appropriate.

26.8 Proportional HazardsModel

This model has been widely used in reliabilitystudies by a number of authors: Bendell andWightman [20], Ansell and Ansell [3] andJardine and Anderson [21]. The model Cox [7]proposed assumed that the hazard function for acomponent could be decomposed into a baselinehazard function and a function dependent on thecovariates. The hazard function at time t withcovariates z, λ(t | z), would be expressed as:

λ(t | z)= ψ[λ0(t), φ(z; β)]where ψ would be an arbitrary function,λ0(t) would be the baseline hazard function,φ would be another arbitrary function of thecovariates, z, and β the parameters of thefunction φ. It was suggested by Cox [7] that ψ

might be a multiplicative function and that φ

should be the exponential with a linear predictorfor the argument. This proportional hazardsmodel has the advantage of being well defined andis given by:

λ(t | z)= λ0(t) exp(βTz)

However, this is only one possible selection forψ and φ. Using the multiplicative formulation itis usual to define φ(z; β) so that φ(0, β)= 1, sothat φ(z; β) is the relative risk for a componentwith covariate z compared to a component withcovariate z= 0. Thus the reliability function isgiven by:

R(t | z)= R(t | z= 0)φ(z,β)

Page 520: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 487

where R(t | z= 0), often denoted simply as R0(t),is the baseline reliability function. Etezardi-Amoliand Ciampi [22], amongst others, have consideredan alternative additive model.

Cox [7] considered the case in which the hazardfunction is a semi-parametric model; λ0(t) ismodeled non-parametrically (or more accuratelyusing infinitely many parameters). It is possibleto select a specific parametric form for λ0(t),which could be a hazard function from one ofa number of families of distributions. In thecase of the semi-parametric model, Cox [23]introduced the concept of partial likelihoodto tackle the problems of statistical inference.This has been further supported by the workof Andersen and Gill [24]. In Cox’s approachthe partial likelihood function is formed byconsidering the components at risk at each of then0 failure times t[1], t[2], t[3], . . . , t[n0], as definedin Section 26.5. This produces a function whichdoes not depend on the underlying distributionand can therefore be used to obtain estimatesof β. Ties often occur in practice in data, andadjustment should be made to the estimationprocedure to account for ties, as suggested byBreslow [25] and Peto [26].

Example 11. Returning to the sodium sulfurbatteries data used in Example 9, put z= z2,which was defined in Example 9 to indicatewhether the battery comes from batch 1 orbatch 2. (Note: The variable z1 = 1 is not requiredas the arbitrary term λ0(t) will contain anyarbitrary constant which was provided in theWeibull regression model in Example 9 by theparameter β1.) Then β =−0.0888, dropping theredundant suffix from β . Using the informationmatrix, the standard error of β is 0.4034. Hencea 95% confidence interval for β is (−0.879,0.702).

A test can be performed to see if β is non-zero.If it is non-zero this would imply there is adifference between the two batches of batteries.The hypotheses are:

H0 : β = 0

H1 : β �= 0

It is possible to use the partial log likelihood,which when evaluated under H0 is −81.262 andunder H1 is−81.238. Hence the “likelihood” ratiotest statistic (twice the difference between thesepartial log likelihoods) has a value of 0.048. Underthe null hypothesis this test statistic can be shownto have a χ2 distribution with one degree of free-dom, in the same way as for likelihood. Thereforethe hypothesis that β is zero is accepted and thisimplies there is no difference between the batches.

An alternative non-parametric approach canbe taken in the case when comparing twodistributions to see if they are the same. In the caseof two groups a test of whether β = 0 is equivalentto testing whether the two reliability functions,R1(t) for group 1 and R2(t) for group 2, are thesame. The hypotheses are:

H0 : R2(t)= R1(t)

H1 : R2(t)= R1(t)exp(β) for β not equal to 0

A test statistic was proposed by Mantel [27], whichunder the null hypothesis can be shown to have aχ2 distribution with one degree of freedom.

Example 12. Returning to Example 11, a test ofwhether β = 0 is equivalent to testing whetherthe reliability functions are the same for bothbatches of batteries. The statistic proposed byMantel [27] is equal to 0.0485. This statistic,though not identical, is similar to that obtainedfrom the likelihood ratio in Example 11, and henceleads to the same conclusion. Therefore the nullhypothesis that β is zero is accepted.

These examples have used a categorical co-variate which indicates different groups of ob-servations. The next example uses a continuouscovariate.

Example 13. Returning to the semiconductor in-tegrated circuit data used in Example 10, put z=z2, which was defined in Example 10 as the inverseof the absolute temperature. Then β =−7315.0.Using the information matrix, the standard errorof β is 1345.0. Hence a 95% confidence interval forβ is (−9951.2,−4678.8).

Page 521: Handbook of Reliability Engineering - pdi2.org

488 Practices and Emerging Applications

This interval indicates that β is non-zero, andthis would imply there is a difference in the perfor-mance of the circuits at the different temperatures.This can be confirmed by performing a test to seeif β is non-zero. The hypotheses are:

H0 : β = 0

H1 : β �= 0

It is possible to use the partial log likelihood,which when evaluated under H0 is −190.15 andunder is−173.50. Hence the “likelihood” ratio teststatistic (twice the difference between these partiallog likelihoods) has a value of 33.31. Under thenull hypothesis this test statistic can be shown tohave a χ2 distribution with one degree of freedom,in the same way as for likelihood. Therefore thehypothesis that β is zero is not accepted, andthis implies there is a difference between thecircuits depending on the temperatures of thetests.

Using the estimate β of β and its standarderror gives a Wald statistic of 29.60, almost thesame value as the likelihood ratio test statistic,and hence leads to the same inference.

The reliability function has to be estimated, andthe usual approach is to first estimate the baselinereliability function, R0(t).

Example 14. The estimate of the baseline reliabil-ity function for the data on sodium sulfur batteriesintroduced in Example 9 with the estimate β of βas given in Example 11, is given in Figure 26.5 with95% confidence limits.

Baseline has been defined as the case withcovariate z= 0. However this is not always asensible choice of the covariate. This is illustratedin the next example.

Example 15. For the semiconductor integratedcircuit data used in Example 10, z= z wasdefined as the inverse of the absolute temperature.This will only be zero if the temperature isinfinitely large. Hence it makes more sense to take

Est

imat

eof

bas

elin

ere

liabili

ty

0.03500

0.2

0.4

0.6

0.8

1.0

1500 20005000

Lifetime (cycles)

1000 2500 3000

Figure 26.5. Plot of the baseline reliability function for theproportional hazards model for the sodium sulfur battery data with95% confidence limits

the baseline value to be one of the temperaturesused in the tests. The smallest temperature willbe chosen, which corresponds to 0.002 256, thelargest value of the inverse absolute temperature.Then the estimate of the baseline reliabilityfunction for the data, with the estimate of β asgiven in Example 13, is given in Figure 26.6 with95% confidence limits.

Est

imat

eof

bas

elin

ere

liabili

ty

0.0

0.2

0.4

0.6

0.8

1.0

1500 20005000

Time (hours)

1000 2500 3000

Figure 26.6. Plot of the baseline reliability function for theproportional hazards model for the semiconductor integratedcircuit data with 95% confidence limits

Page 522: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 489

Figure 26.7. Plot of the Schoenfeld residuals for a batch of theproportional hazards model for the sodium sulfur battery data

26.9 Residual Plots for theProportional Hazards Model

Given the estimates of the reliability functionit is then possible to obtain the Cox andSnell generalized residuals. However, it has beenfound that other residuals are more appropriate.Schoenfeld [28] suggested partial residuals toexamine the proportionality assumption made inthe Cox model. These residuals can then be plottedagainst time, and if the proportional hazardsassumption holds then the residuals should berandomly scattered about zero, with no timetrend.

Example 16. Using the sodium sulfur battery datafrom Example 3 and fitting the proportionalhazards model, the Schoenfeld residuals arecalculated and presented in Figure 26.7 as a plotof the residuals against time. A non-parametricestimate of the regression line, the “lowess” line,see Cleveland [29], is included on the plot as wellas the zero residual line. There seems to be nosignificant trend. A test for linear trend can beused. For this test the statistic is 0.0654, whichwould be from a χ2 distribution with one degreeof freedom if there was no linear trend. Hencethere is no evidence of a linear trend, and it is

probably safe to accept the proportional hazardsassumption.

Example 17. Using the semiconductor integratedcircuit data from Example 4 and fitting the propor-tional hazards model, the Schoenfeld residuals arecalculated and presented in Figure 26.8 as a plotof the residuals against time. A non-parametricestimate of the regression line, the “lowess” line,is included on the plot as well as the zero resid-ual line. For the test for linear trend the statisticis 1.60, which would be from a χ2 distributionwith one degree of freedom if there was no lineartrend. Hence there is no evidence of a linear trend.However, it is not clear whether to accept theproportional hazards assumption.

A number of statistical packages facilitate pro-portional hazards modeling, including SAS andS-PLUS. Therneau et al. [30] suggest two al-ternative residuals, a martingale residual and adeviance residual. Except in the case of discretecovariates, these residuals are far from simpleto calculate, however statistical software is avail-able for their estimation. In the case of discretecovariates the martingale residuals are a transfor-mation of the Cox and Snell generalized residuals.The deviance residuals are a transformation of

Figure 26.8. Plot of the Schoenfeld residuals for the inverse ofabsolute temperature of the proportional hazards model for thesemiconductor integrated circuit data

Page 523: Handbook of Reliability Engineering - pdi2.org

490 Practices and Emerging ApplicationsM

arti

ng

ale

resi

du

als

wit

h0

iter

atio

ns

_2

0.0023

Inverse absolute temperature

Lowess smooth line

_3

_1

0

1

0.0019 0.0020 0.0021 0.0022

Figure 26.9. Plot of the martingale residuals versus the inverse ofthe absolute temperature of the null proportional hazards modelfor the semiconductor integrated circuit data

the martingale residuals to correct for skewness.Therneau et al. [30] suggest that the martingaleresiduals are useful in deciding about (a) appro-priate functional relationships between the covari-ates and their survival, (b) proportionality of thehazard functions, and (c) the influence of obser-vations. They suggest the deviance residuals aremore useful in identifying the observations whichmay be outliers.

Example 18. Using the semiconductor integratedcircuit data as in Example 4 and fitting the propor-tional hazards model, the martingale residuals arecalculated and presented in Figure 26.9 as a plotof the residuals against the covariate, the inverseof the absolute temperature. A non-parametricestimate of the regression line, the “lowess” line, isincluded. There is some suggestion that a linear fitmight not be best and a quadratic function mightbe an improvement, but this was found not to bethe case.

The deviance residuals are calculated and canbe used with the standardized score residuals toidentify outliers. The plot of these residuals inFigure 26.10 suggests that there are some outlierswith standardized score residuals which are largerthan 0.3.

Figure 26.10. Plot of the deviance residuals versus the standard-ized score residuals for the inverse of the absolute temperature ofthe proportional hazards model for the semiconductor integratedcircuit data

26.10 Non-proportionalHazards ModelsAn assumption of the Cox regression model wasthat the hazards are proportional. This can beinterpreted as the distance between the log (−log)of the reliability functions not varying with time.Cox [7] suggested a test for this proportionality byadding an extra covariate of log time to the modelwith parameter β2. In the case of two groups withzi indicating membership of a group (zi = 0 ifthe ith individual belongs to group 1, zi = 1 if itbelongs to group 2) with coefficient β1, then thehazard functions become:

for group 1: λ0(t)

for group 2: λ0(t) exp(β1 + β2 log(t))

= λ0(t)tβ2 exp(β1)

If the coefficient of log time (β2) is differ-ent from zero, the hazard functions are non-proportional as their ratio tβ2 exp(β1) varieswith t , otherwise they are proportional as the haz-ard function for group 2 reduces to λ0(t) exp(β1).The use of such models with time-dependentcovariates was justified by Andersen and Gill [24].

Page 524: Handbook of Reliability Engineering - pdi2.org

Statistical Methods for Reliability Data Analysis 491

Example 19. Using the sodium sulfur batter-ies data in Example 3 and fitting Cox’s non-proportional hazards model, β2 = 0.0900 withstandard error 0.523. These give a Wald statisticof 0.0296, which is not significantly different fromzero, for a χ2 distribution with one degree of free-dom. Hence the null hypothesis of proportionalityof the hazards is accepted, which agrees with theconclusion made in Example 16, when using theSchoenfeld residuals.

Many authors have extended this use of time-dependent variables in studies of lifetimes, forexample Gore et al. [31]. The other obviousproblem is selection of an appropriate model.A plot of log (−log) of the reliability functionor log hazard plot may reveal some suitablefunction, or the nature of the problem maysuggest some particular structure. Alternativelyone might use the smoothed martingale residualssuggested by Therneau et al. [30], plotted againstthe covariate to seek functional form. The dangeris in producing too complex a model whichdoes not increase insight. When the variablesare quantitative rather than qualitative then theproblem is exacerbated, and greater caution isnecessary.

When predicting future performance there maybe difficulties with time-dependent covariates andthen it may be necessary to resort to simulation.

26.11 Selecting the Model andthe Variables

The number of variables included in an analysisshould be considerably less than the numberof data points. Overfitting the data can be amajor problem. Adding extra variables shouldimprove the fit of the model to the data. However,this will not necessarily improve the ability ofthe model to predict future observations. Someauthors have suggested that data in regressionanalyses should be split into two parts. With thefirst half one derives the model and with thesecond half judgment about the model should bemade. Often there is too little data to allow such an

approach. In the case of limited data one should bewary therefore of too good a fit to the data.

Most of the assessment of which variable to fitwill be based on the analysis performed. In normalregression modeling considerable attention hasbeen paid to variable selection, though nogenerally accepted methodology has been devisedto obtain the best set. Two approaches takenare forward selection, in which variables areadded to the model, and backward selection,in which variables are deleted until a “good”model is fitted. These approaches can be appliedtogether, allowing for the inclusion and deletionof variables.

For the models considered in this chapter anequivalent to the F -value, used for models withnormal errors, would be the change in deviance,which for the ith variable is

Devi =−2(l(βi = 0)− l(βi �= 0))

where l(βi = 0) is the log likelihood without theith variable fitted and l(βi �= 0) is with it fitted.This statistic is asymptotically distributed as χ2

with one degree of freedom. Hence the addition,or deletion, of a variable given the currentmodel can be decided on whether this value issignificantly large or not. Whether the approachresults in the “best” fit is always doubtful, but ifthere are a large number of variables it may bea sensible approach. Again if the above methodis used then the model produced should becapable of physical explanation within the context.The model ought to seem plausible to others. If themodel cannot be interpreted then there is a dangerof treating the technique as a black box.

26.12 DiscussionGiven the variety of models discussed it mayseem crucial that the appropriate model isselected. Whilst in some specific cases this maybe the case, generally the desire is to identifyfactors or variables which are having an effecton the component’s or system’s performance.In such cases most of the models will atleast be able to detect effects of factors which

Page 525: Handbook of Reliability Engineering - pdi2.org

492 Practices and Emerging Applications

are strongly associated with lifetimes, thoughthey may overlook those with weak associationprovided the proportion of censored observationsis small. Solomon [32] has also supported thesecomments when comparing accelerated failure-time and proportional hazards models. Furtherwork is necessary in the case of highly censoreddata. Hence model selection may not be as crucialas initially suggested. Obviously if the desire is tomodel lifetimes of the component in detail thenmore care in choice should be taken.

This chapter has covered the analysis oflifetimes of components and systems. However,the data in Example 5 was for a repairablesystem. So it would be appropriate to estimate therate of occurrence of failures (ROCOF). A non-parametric approach to this problem is given byPhillips [33]. This is beyond the scope of thischapter.

References[1] Keating JP, Glaser RE, Ketchum NS. Testing hypotheses

about the shape parameter of a gamma distribution.Technometrics 1990;32:67–82.

[2] Kim JS, Proschan F. Piecewise exponential estimator ofthe survivor function. IEEE Trans Reliab 1991;40:134–9.

[3] Ansell RO, Ansell JI. Modelling the reliability of sodiumsulphur cells. Reliab Eng 1987;17:127–37.

[4] Elsayed EA, Chan CK. Estimation of thin-oxide reliabilityusing proportional hazards models. IEEE Trans Reliab1990;39:329–35.

[5] Newton DW. Some pitfalls in reliability data analysis.Reliab Eng 1991;34:7–21.

[6] Ansell JI, Phillips MJ. Practical methods for reliabilitydata analysis. Oxford: Oxford University Press; 1994.

[7] Cox DR. Regression models and life tables. J R Stat Soc B1972;34:187–220.

[8] Kaplan EL, Meier P. Non-parametric estimation fromincomplete observations. J Am Stat Assoc 1958;53:457–81.

[9] Nelson W. Hazard plotting for incomplete failure data. JQual Tech 1969;1:27–52.

[10] Smith RL. Weibull regression models for reliabilityanalysis. Reliab Eng Safety Syst 1991;34:55–77.

[11] Cox DR, Snell EJ. A general definition of residuals (withdiscussion). J R Stat Soc B 1968;30:248–75.

[12] Lawless JF. Statistical models and methods for lifetimedata. New York: Wiley; 1982.

[13] Glasser M. Exponential survival with covariance. J AmStat Assoc 1967;62:561–8.

[14] Nelson W. Accelerated life testing. New York: Wiley; 1993.[15] Cox DR. Some applications of exponential ordered

scores. J R Stat Soc B 1964;26:103–10.[16] Fiegl P, Zelen M. Estimation of exponential survival

probabilities with concomitant information. Biometrics1965;21:826–38.

[17] Nelson WB, Hahn GJ. Linear estimation of a regressionrelationship from censored data. Part 1. Simple methodsand their applications. Technometrics 1972;14:247–76.

[18] Kalbfleisch JD. Some efficiency calculations for survivaldistributions. Biometrika 1974;61:31–8.

[19] Farewell VT, Prentice RL. A study of distributional shapein life testing. Technometrics 1977;19:69–76.

[20] Bendell A, Wightman DM. The practical application ofproportional hazards modelling. Proc 5th Nat ReliabConf, Birmingham, 1985.Culcheth: National Centre forSystems Reliability.

[21] Jardine AKS, Anderson M. Use of concomitant variablesfor reliability estimation and setting component replace-ment polices. Proc 8th Adv Reliab Tech Symp, Bradford,1984. Culcheth: UKAEA.

[22] Etezardi-Amoli J, Ciampi A. Extended hazard regressionfor censored survival data with covariates: a spline ap-proximation for the baseline hazard function. Biometrics1987;43:181–92.

[23] Cox DR. Partial likelihood. Biometrika 1975;62:269–76.[24] Andersen PK, Gill RD. Cox’s regression model for

counting processess: a large sample study. Ann Stat1982;10:1100–20.

[25] Breslow NE. Covariate analysis of censored survival data.Biometrics 1974;30:89–99.

[26] Peto R. Contribution to the discussion of regressionmodels and life tables. J R Stat Soc B 1972;34:205–7.

[27] Mantel N. Evaluation of survival data and two new rankorder statistics arising from its consideration. CancerChemother Rep 1966;50:163–70.

[28] Schoenfeld D. Partial residuals for the proportionalhazards regression model. Biometrika 1982;69:239–41.

[29] Cleveland WS. Robust locally weighted regression andsmoothing scatterplots. J Am Stat Assoc 1979;74:829–36.

[30] Therneau TM, Grambsch PM, Fleming TR. Martin-gale based residuals for survival models. Biometrika1990;77:147–60.

[31] Gore SM, Pocock SJ, Kerr GR. Regression models andnon-proportional hazards in the analysis of breast cancersurvival. Appl Stat 1984;33:176–95.

[32] Solomon PJ. Effect of misspecification of regressionmodels in the analysis of survival data. Biometrika1984;71:291–8.

[33] Phillips MJ. Bootstrap confidence regions for theexpected ROCOF of a repairable system. IEEE TransReliab 2000;49:204–8.

Page 526: Handbook of Reliability Engineering - pdi2.org

The Application ofCapture–Recapture Methods inReliability Studies

Chap

ter2

7Paul S. F. Yip, Yan Wang and Anne Chao

27.1 Introduction27.2 Formulation of the Problem27.2.1 Homogeneous Model with Recapture27.2.2 A Seeded Fault Approach Without Recapture27.2.3 Heterogeneous Model27.2.3.1 Non-parametric Case: λi(t)= γiαt27.2.3.2 Parametric Case: λi(t)= γi27.3 A Sequential Procedure27.4 Real Examples27.5 Simulation Studies27.6 Discussion

27.1 Introduction

Suppose there are an unknown number ν offaults in a computer program. A fault is somedefect in the code which (under at least one setof conditions) will produce an incorrect output,i.e. a failure. The program is executed for a totaltime τ under a range of conditions to simulatethe operating environment. Failures occur onNτ occasions at times 0 < t1 < t2 < · · ·< tNτ < τ .The time until the detection of fault i is a positiverandom variable with hazard function λi(t) (i =1, 2, . . . , ν). The first column of Table 27.1 givesa set of failure data for an information system(seconds CPU time). The data can be found inMoek [1]. The main purpose of this study is toestimate the total number of remaining faults inthe system.

Estimation methods are available based on theobserved number of faults. Information about ν

comes from detecting new faults and the rateat which faults are being detected. However,

it is shown that a large proportion of faultsmust be detected in order for the method toperform satisfactorily [2]. Alternatively, someother designs have been suggested to improve theestimation. For example:

• A recapture approach—the recapture ideais to assume a fault detected, corrected,and a counter inserted to record potentialrecurrences of the fault. The setting up ofa counter where the fault is found, and therecording of the number of times the faultis re-detected, is assumed without causingsystem failure [3, 4]. The recapture faultsassume that both the detection times andthe fault type are recorded. Informationabout ν comes from (a) detecting new faultsand (b) observing the proportion of revisitfaults from subsequent testing. An estimatorbased on the optimal estimating equationis recommended as suggested by Lloydet al. [4]. The estimator generated by the

493

Page 527: Handbook of Reliability Engineering - pdi2.org

494 Practices and Emerging Applications

Table 27.1. Real and simulated failure times of Moek’s data [1]

Fault # First Subsequent simulated failure times

1 880 64 653 329 685 520 0162 4310 305 937 428 364 432 134 576 2433 7170 563 9104 18 930 186 946 195 476 473 2065 23 680 259 496 469 1806 23 920 126 072 252 204 371 9397 26 220 251 3858 34 790 353 5769 39 410 53 878 147 409 515 884

10 40 470 371 824 466 71911 44 290 83 996 327 296 352 035 395 324 494 03712 59 090 61 435 222 288 546 57713 60 860 75 630 576 38614 85 13015 89 930 205 224 292 321 294 935 342 811 536 573 553 31216 90 400 228 283 334 152 360 218 368 811 377 529 547 04817 90 440 511 836 511 96718 100 610 367 520 429 21319 101 730 162 480 534 44420 102 710 194 399 294 708 295 030 360 344 511 02521 127 010 555 06522 128 76023 133 210 167 108 370 73924 138 070 307 101 451 66825 138 710 232 74326 142 700 215 58927 169 540 299 094 428 902 520 53328 171 810 404 88729 172 010 288 75030 211 19031 226 100 378 185 446 070 449 66532 240 770 266 322 459 44033 257 080 374 38434 295 490 364 95235 296 61036 327 170 374 032 430 07737 333 38038 333 500 480 02039 353 71040 380 110 433 07441 417 910 422 153 479 514 511 30842 492 13043 576 570

Page 528: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 495

optimal estimating equation can be shownto be equivalent to the maximum likelihoodestimator.• A seeded fault approach without recapture—

the seeded fault approach assumes a knownnumber of faults are deliberately seeded in thesystem. Faults are detected and the number ofreal and seeded faults recorded. Informationabout ν comes from the proportion of realand seeded faults observed in subsequenttesting. With sufficient seeded faults inthe system, recaptures do not improve theperformance significantly [4]. An estimatorbased on optimal estimating equation is alsorecommended in the present setting [5]. Theestimator generated by optimal estimatingequation can also be shown to be equivalentto the maximum likelihood estimator.

The two approaches suggested above make the ho-mogeneity assumption for the failure intensities,i.e. λi(t)= αt for all i. For heterogeneous modelswe consider the following three situations.

1. Non-parametric case (failure intensity λi(t)=γiαt ) with recapture. The γi values areindividual effects and can be unspecified.

2. Parametric case (failure intensity λi(t)= γi)without recapture, i.e. the Littlewood model[6] which also includes the Jelinski–Morandamodel as a special case.

3. Parametric case (failure intensity λi(t)= γi)with recapture. The γi values are assumedfrom a gamma density.

For the homogeneous model we use theoptimal estimating equation which is well estab-lished to be an alternative and efficient method toestimate population size [7, 8]. We shall introducea sample coverage estimator and a Horvitz–Thompson type estimator [9] to estimate ν forheterogeneous models.

In this chapter we also discuss a sequentialprocedure to estimate the number of faults ina system. A procedure of this type is useful inthe preliminary stages of testing. In practice itis expensive and generally infeasible to detect allthe faults in the system in the initial stages of

software development. The idea of the sequentialprocedure proposed here is not to find all thefaults in the system. The goal is to detect designsof questionable quality earlier in the process byestimating the number of faults that go undetectedafter a design review. This can be valuable fordeciding whether the design is of sufficient qualityfor the software to be developed further, assuggested by Van der Wiel and Votta [10]. In theearly stages, we only need a rough idea how manyfaults there are before proceeding further. Usually,an estimate with some specified minimum levelof accuracy would be sufficient [11]. Here weconsider as accuracy criterion the ratio of thestandard error of the estimate to the populationsize estimate, sometimes called the coefficient ofvariation.

The purpose of this chapter is to provideestimating methods to make inference on ν.This chapter is different from other reviewson the use of recapture in software reliabilityand epidemiology, which for example examinethe case of using a number of inspectors toestimate the number of faults in a systemor estimate the number of diseased personsbased on a number of lists, respectively [12, 13].Here, we concentrate on a continuous timeframework. Section 27.2 gives a formulationof the problem which includes homogeneousand heterogeneous cases, parametrically andnon-parametrically. A sequential procedure issuggested in Section 27.3 and a real example isgiven in Section 27.4 to illustrate the procedure.Section 27.5 gives a wide range of simulationresults of the proposed estimators. A discussioncan be found in Section 27.6.

27.2 Formulation of theProblem

Notation

ν number of faults in a systemNi(t) counting process of the number of

detections of fault i before time t

Page 529: Handbook of Reliability Engineering - pdi2.org

496 Practices and Emerging Applications

�dt small time interval increase aftertime t

Ft σ -algebra generated by Ni(s),s ∈ [0, t], i = 1, . . . , ν

λi(t) intensity process for Ni(t)

τ stopping time of the experimentγi individual effect of fault i in the

heterogeneous modelMt , M∗

t zero-mean martingaleMt number of distinct faults detected by

time tNt total number of detections by time tKt number of re-detections of the

already found faults by time tWt , k(t) weight functionW∗t optimal weight functionνW estimate of ν with weight function

Wse(ν) estimated standard error of the

estimated νD number of seeded faults inserted at

the beginning of the experimentUt number of real faults detected in the

system before time tαt failure intensity of the real faultsβt failure intensity of the seeded faultsθ proportionality between αt and βtRt , R∗t zero-mean martingaleCt proportion of total individual effects

of detected faultsI indicator functionAi(t) event that the fault i has not been

detected up to t but detected at timet +�dt

cumulative hazard functionfi(t) number of individuals detected

exactly i times by time tR remainder term in the extension of

the expectation of νCt

Yi(t) indicator function of the eventwhether fault i is at risk of beingdetected

G(φ, β) gamma distribution with shape,scale parameters φ, β

ε, ω parameters after reparameterizingφ, β

n number of distinct faults detectedduring [0, τ ]

δi indicator function denoting at leastone failure caused by fault i occursin [0, τ ]

pi probability of δi taking value 1 (θ) limiting covariance matrix of

parameter θ = (ε, ω)′d0 given accuracy levelave(ν) average estimated population size

based on simulationave(se(ν)) average estimated standard error

based on simulation.

In this chapter we consider various assump-tions of the intensity process, λi(t).

• Homogeneous model: assume all the faults arehomogeneous with respect to failure intensity,i.e. λi(t)= αt , t ≥ 0, for i = 1, 2, . . . , ν.• Heterogeneous model: consider λi(t)= γiαt

(where αt are specified).

The individual effects {γ1, γ2, . . . , γν} can bemodeled in the following two ways.

1. Fixed-effect model: {γ1, γ2, . . . , γν} are re-garded as fixed parameters.

2. Frailty model: {γ1, γ2, . . . , γν} are a randomsample assumed from a gamma distributionwhose parameters are unknown.

The optimal estimating function methods pro-posed here are based on results for continuousmartingales and follow from the work of Aalen[14, 15]. For our purpose a zero-mean martingale(ZMM) is a stochastic process {Mt ≥ 0} such thatE(M0)= 0 and for all t ≥ 0 we have E|Mt |<∞as well as E(Mt+h|Ft )=Mt for every h > 0.

27.2.1 Homogeneous Model withRecapture

For a homogeneous model with intensity λi(t)=αt , t ≥ 0, for i = 1, 2, . . . , ν, we make use of theestimating function to make inference about ν.The important feature is to identify an estimatingfunction which involves the parameter ν. Let Mt

denote the number of faults that are detected by

Page 530: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 497

time t and Nt =∑νi=1 Ni(t) the total number of

detections by time t . Then Kt =Nt −Mt denotesthe number of re-detections of the already foundfaults by time t . Define

dMu =Mu dMu − (ν −Mu) dKu

then dMu is a martingale difference with respectto Fu, where Fu is the history generated by{N1(s), N2(s), . . . , Nν(s); 0≤ s ≤ u}. Since wehave

dMu | Fu = Bin(ν −Mu, αu)

anddKu | Fu = Bin(Mu, αu)

where Bin denotes “Binomially Distributed”, and

E(dMu | Fu)= (ν −Mu)αu

andE(dKu | Fu)=Muαu

then E(dMu | Fu)= 0. Integrating dMu from 0to t , it follows that Mt is a ZMM [8]. For anypredictable process Wu, we obtain the followingZMM:

M∗t =

∫ t

0Wu dMu

=∫ t

0Wu{Mu dMu − (ν −Mu) dKu} (27.1)

A class of estimators of ν, evaluated at time τ , canbe obtained by solving M∗

τ = 0, i.e.

νW (τ )=∫ τ

0 WuMu dNu∫ τ0 Wu dKu

The variance of M∗τ is given by

Var(M∗τ )

= E

( ∫ τ

0W 2

u {M2u dMu + (ν −Mu)

2 dKu})

The term for variance follows from the standardresults given by, for example, Andersen and Gill[16] or Yip [5]. The covariance is zero by virtue ofthe orthogonality of the result of the martingale,since dMu and dKu cannot jump simultaneously.

Using the result

M∗τ /√

Var(M∗τ )−→N(0, 1)

so that M∗τ = (νW (τ )− ν)

∫ τ0 Wu dKu, a variance

for the standard error of νW (τ ) is given by

se(νW (τ ))

=[ ∫ τ

0 W 2u {M2

u dMu + (νW (τ )−Mu)2 dKu}

]1/2∫ τ0 Wu dKu

Here we consider two types of weight functions:Wu = 1 and the optimal weight suggested byGodambe [17], i.e.

W∗u =E[∂dMu

∂ν

∣∣∣ Fu

]E[dM2

u | Fu] (27.2)

The optimal property is in terms of giving thetightest asymptotic confidence intervals for theestimate. Note that ν is an integer value, so insteadof taking the derivative w.r.t. ν, we compute thefirst difference of dMt .

If Wu = 1 we have

ν1 =∫ τ

0 Mu dNu

which is the Schnabel estimator [8, 18]. In theevent thatKτ = 0, we define ν(τ )=∞, se(ν(τ ))=∞, and se(ν(τ ))/ν(τ )=∞. The optimal weightsuggested in Equation 27.2 can readily be shownto be

W∗u =1

(ν −Mu)

Hence, the optimal estimator ν∗ is the solution ofthe following equation:∫ τ

0

1

ν −Mu

[Mu dMu − (ν −Mu) dKu] = 0

(27.3)which is equivalent to the MLE [8,19]. An estimateof the standard error of the estimator ν∗ is givenby

se(ν∗)=(ν∗/∫ τ

0

Mu dMu

(ν∗ −Mu)2

)1/2

The assumption of homogeneous intensity is notused in deriving the estimator from the martingale

Page 531: Handbook of Reliability Engineering - pdi2.org

498 Practices and Emerging Applications

given in Equation 27.1. The intensity function maydepend on time, t . It follows that the estimator isrobust to time heterogeneity. For details, see Yipet al. [8].

27.2.2 A Seeded Fault ApproachWithout Recapture

A known number of faults D are seeded in thesystem at the beginning of the testing process[5]. The failure intensities of the seeded andreal faults are allowed to be time dependent.The failure intensity of the seeded faults isassumed to differ from that of the real faults bya constant proportion, θ , assumed known. Thesame failure intensity is assumed to be appliedto each type of fault in the system. Let Ut andMt denote the number of real and seeded faultsdetected in the system in [0, t], assuming oncea fault is detected it is immediately removedfrom the system. No recapture information isavailable from detected faults (seeded or real). Inthe presence of the unknown parameter αt , anidentifiability problem occurs when we want toestimate ν by observing the failure times only, theinformation of ν and αt is confounded. The extraeffort of inserting a known number D of faultsinto the system at the beginning of the testingprovides an extra equation to estimate ν. However,an identifiability problem can still occur with theextra unknown parameter βt , which is the failureintensity of the seeded faults at time t . Here weassume a constant proportionality between thetwo intensities αt and βt , i.e. αt = θβt for all t andθ assumed known. Yip et al. [20] have examinedthe sensitivity of a misspecification of θ and theestimation performance with an unknown θ . If θis unknown, similar estimating functions can beobtained for making inference on θ and ν.

Here, we define a “martingale difference” withthe intent to estimate ν:

dRu = (D −Mu) dUu − θ(ν − Uu) dMu.

Note that E[dRu | Fu] = 0. Let Wu be a measur-able function w.r.t. Fu. It follows that the process

R∗ = {R∗t ; t ≥ 0},

R∗t =∫ t

0Wu{(D −Mu) dUu − θ(ν − Uu) dMu}

(27.4)is a ZMM. By equating Equation 27.4 to zero andevaluating at time τ , a class of estimators for ν isobtained, i.e.

νw =[ ∫ τ

0Wu{(D −Mu) dUu + θUu dMu}

]×[θ

∫ τ

0Wu dMu

]−1

(27.5)

which depends on the choice of Wu. Theconditional variance of R∗τ in Equation 27.4 isgiven by

Var(R∗τ )= E

[ ∫ τ

0W 2

u {(D −Mu)2 dUu

+ θ2(ν − Uu)2 dMu}

]Hence we can deduce a standard error of νw forthe seeded faults approach, given by{[ ∫ τ

0W 2

u {(D −Mu)2 dUu

+ θ2(νw − Uu)2 dMu}

]1/2}×{θ

∫ τ

0Wu dMu

}−1

(27.6)

For the choice Wu = 1, we have explicit expres-sions for the estimate ν and its standard error, i.e.

ν =∫ τ

0 {(D −Mu) dUu + θUu dMu}θMτ

(27.7)

and

se(ν)={[ ∫ τ

0{(D −Mu)

2 dUu

+ θ2(ν1 − Uu)2 dMu}

]1/2}× {θMτ }−1 (27.8)

Also, the optimal weight corresponding to Equa-tion 27.4 is

W∗u =1

(ν − Uu)[(D −Mu)+ (ν − Uu)θ ]

Page 532: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 499

Accordingly an optimal estimating equation inthe sense of giving the tightest confidence for theestimator of ν is given by

R∗t =∫ t

0

1

(ν − Uu)[(D −Mu)+ (ν − Uu)θ ]× {(D −Mu) dUu − θ(ν − Uu) dMu}

(27.9)

The optimal estimate ν∗ is the solution ofEquation 27.9. An explicit expression is notavailable and an iterative procedure is required.From Equation 27.6 we have an estimate of thestandard error of ν∗:

se(ν∗)

={ ∫ τ

0

{(D−Mu)2 dUu+θ2(ν∗−Uu)

2 dMu}{(ν∗−Uu)[(D−Mu)+(ν∗−Uu)θ ]}2

}×{θ

∫ τ

0

dMu

(ν∗−Uu)[(D−Mu)+(ν∗−Uu)θ ]}−1

27.2.3 Heterogeneous Model

Here we introduce a sample coverage estimatorand a two-step procedure to estimate ν fora heterogeneous model. The sample coveragemethod was initiated by Chao and Lee [21] andthe two-step procedure was suggested by Huggins[22, 23] and Yip et al. [24]. The individual effectscan be assumed to be specified or unspecified(parametrically or non-parametrically).

27.2.3.1 Non-parametric Case:λi(t)= γiαt

The individual effects {γ1, γ2, . . . , γν} can bemodeled as either a fixed effect or a random effect.For a fixed-effect model, {γ1, γ2, . . . , γν} areregarded as fixed (nuisance) parameters. Insteadof trying to estimate each of the failure intensities,the important parameters are the mean, γ =∑ν

i=1 γi/ν and coefficient of variation (CV), γ =[ ∑(γi − γ )2/ν

]1/2/γ in the sample coverage

approach. The value of CV is a measure of theheterogeneity of the individual effects. CV= 0 isequivalent to all the γi values being equal. A larger

CV implies the population is more heterogeneousin individual effects. For a random-effects modelwe assume {γ1, γ2, . . . , γν} are a random samplefrom an unspecified and unknown distributionF with mean γ = ∫ u dF(u) and CV γ = ∫ (u−γ )2 dF(u)/γ 2. Both models lead to exactly thesame estimator of population size using thesample coverage idea [25, 26]. The derivationprocedure for the random-effects model is nearlyparallel to that for the fixed-effects model. Thusin the following we only illustrate the approach byusing the fixed-effects model.

Let us define the sample coverage, Ct , as theproportion of total individual fault effects ofdetected faults, given by

Ct =[ ν∑

i=1

γiI (the ith fault is detected at

least once up to t)

][ ν∑i=1

γi

]−1

where I is an indicator function. Let Mt and Kt

be the numbers of marked faults in the system andthe number of re-detections of those faults at timet , respectively, and let Ai(t) be the event that theith fault has not been detected up to t but detectedat time t +�dt . We have

E[dMt | Ft ] = E

[ ν∑i=1

I {Ai(t)} | Ft

]

=ν∑

i=1

γiαt dtI (ith fault is not

detected up to t)

It follows from the definition of Ct that

E[dMt | Ft ] = (1− Ct )

( ν∑i=1

γi

)αt dt

and

E[dKt | Ft ] = (Ct )

( ν∑i=1

γi

)αt dt

for t > 0.

Page 533: Handbook of Reliability Engineering - pdi2.org

500 Practices and Emerging Applications

Consider the following ZMMs:

Mt =∫ t

0

{(νCu)

[dMu − (1− Cu)

×( ν∑

i=1

γi

)αu du

]− ν(1− Cu)

[dKu − (Cu)

×( ν∑

i=1

γi

)αu du

]}=∫ t

0{(νCu) dMu − (ν − νCu) dKu}

whereE[Mt ] = 0 ∀t . By eliminating the unknownparameters

∑νi=1 γi and αu in Mt we can

make inference about ν in the presence of thenuisance parameter νCu which can be estimatedaccordingly. By integrating a predictable functionWu w.r.t. M= {Mt ; t ≥ 0} we have anothermartingale, i.e.

M∗t =

∫ t

0Wu{(νCu) dMu − (ν − νCu) dKu}

(27.10)where M∗

t involves the parameters ν and νCu.Similarly, we consider two types of weight func-tions: Wu = 1 and the optimal weight suggested inEquation 27.2.

Suppose all the γi values are the same sowe have νCu =Mu and (ν − νCu)= ν −Mu forthe homogeneous case as in Section 27.2.1; theestimating function Equation 27.10 reduces to

M∗t =

∫ t

0Wu[Mu dMu − (ν −Mu) dKu]

(27.11)which is equivalent to Equation 27.1. If we can finda “predictor” νCu for νCu, we have the followingestimator for ν:

νh =∫ t

0 Wu(νCu) dNu∫ t0 Wu dKu

(27.12)

The optimal weight can be shown to be

W∗u =1

1− Cu

Also, this weight reduces to the optimal weightused in the homogeneous case where all the γ

values are the same. Here we suggest a procedureto find an estimator for E[νCu] and subsequentlyuse it to replace νCu in the estimating function.Note that

E[νCu] = 1

γ

ν∑i=1

γiE[the ith fault is detected

at least once up to u]

= 1

γ

ν∑i=1

γi[1− e−γi]

where = ∫ u0 αs ds and we also have

E[Mu] =ν∑

i=1

[1− e−γi]

It then follows that

E[νCu] = E(Mu)+G(γ1, . . . , γν)

where

G(γ1, . . . , γν)=ν∑

i=1

(1− γi/γ ) exp(−γi).

Define γ ∗i = γ + θ(γi − γ ), for some 0 < θ < 1.Expanding G(γ1, . . . , γν) at (γ , . . . , γ ) we have

E[νCu] = E[Mu] + γ 2νγ e−γ

− 1

2

ν∑i=1

2 e−γ ∗i (γi − γ )3

γ

= E[Mu] + γ 2ν∑

i=1

(γi) e−γi

+ γ 2

ν∑i=1

(γ e−γ − γi e−γi )

− 1

2

ν∑i=1

2 e−γ ∗i (γi − γ )3

γ

= E[Mu] + γ 2E[f1(u)] + R

where fi(u) is the number of individuals thathave been caught exactly i times in the sample by

Page 534: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 501

time u, and

R = γ 2u

ν∑i=1

(γ e−γ − γi e−γi )

− 1

2

ν∑i=1

2 e−γ ∗i (γi − γ )3

γ

The remainder term, R, is usually negligible, asshown by Chao and Lee [25] when the variationof the γi values is not relatively large, say, γ < 1.One theoretical justification for the approximationis that when γ1, γ2, . . . , γν are a random samplefrom a gamma distribution, the remainder term R

satisfiesR/ν→ 0 as ν→∞. Further justificationscan be found in Lin et al. [27]. The estimationof νCu will then be based on the approximationignoring R in Chao and Lee [21]. Thus we have thefollowing estimator for νCu:

νCu =Mu + γ 2f1(u)

where γ 2 is an estimator of CV given by Chao andLee [25]:

γ 2 =max

{Mt

Ct

∑i(i − 1)fi(t)

[∑ ifi(t)]2 − 1, 0

}(27.13)

where

Ct = 1− f1(t)∑i ifi(t)

An estimator of ν based on the sample coverageestimate in Chao and Lee [25] is

ν2 = Mt

Ct

+ f1(t)

Ct

γ 2 (27.14)

Substituting νCu into Equation 27.12, we obtainthe following two new estimators:

ν3 =∫ t

0 {Mu + γ 2f1(u)} dNu

Kt

for Wu = 1

ν4 =[ ∫ t

0

1

(1− Cu){Mu + γ 2f1(u)} dNu

]

×[ ∫ t

0

1

(1− Cu)dKu

]−1

for Wu = 1

1− Cu

The estimation of γ 2 is based on all data at the endof the experiment. However, we need to estimateCu and νCu sequentially for 0 < u ≤ τ . It is clearthat the CV value plays an important role in theestimation procedure.

An estimate of the standard error for ν3 and ν4is given by:

se(νh)=√

Var(M∗t )∫ t

0 Wu dKu

(27.15)

where Var(M∗t )=

∫ t0 (Wu)

2{(νCu)2

dMu + (νh −νCu)

2 dKu}. Simulation results show that se(ν3)

and se(ν4) severely underestimate the variabilityof ν3 and ν4. The expression in Equation 27.15has not taken into account the additional variationintroduced by substituting the estimates for theparameters such as νCu and ν in the estimatingfunction. The resulting expression for the estimateof se(νh) would be complicated if the variabilityof these estimates were taken into account. Analternative procedure by bootstrap is used toassess the variability instead [26].

27.2.3.2 Parametric Case: λi(t)= γi

Let Ni(t) be the counting process for fault i andγi be its occurrence rate as a stochastic variable,i.e. the intensity process for the counting processdepends on an unobservable random variable. LetYi(t) indicate, by the value of 1 or 0, whether faulti has not been removed before t , thus Yi(t) is anobservable non-negative predictable process. Letτ be the duration of the study and Ft denote thehistory generated by {Ni(s), Yi(s), 0≤ s ≤ t}.

Suppose that γi has a gamma distribution withparameters φ and β , denoted by G(φ, β) withdensity function

f (γi | φ, β)= e−βγi γ φ−1i βφ

�(φ)(γi > 0)

Using the innovation theorem [15], the compen-sator of the counting process Ni(t) is given by

λi(t)= Yi(t)

{φ +Ni(t)

β + t

}(27.16)

Page 535: Handbook of Reliability Engineering - pdi2.org

502 Practices and Emerging Applications

The intensity function λi(t) depends on timeand the number of times the ith individualfault has been detected. The formulation inEquation 27.16 includes both the removal andrecapture sampling methods. For the case ofremoval, in which Ni(t) equals zero since theith fault is removed from the system after beingdetected, the model becomes the Littlewood model[2, 6]. In addition, Equation 27.16 also includesthe recapture experiment in which a counter isinserted at the location after the fault is detectedand the counter registers the number of re-detections of a particular fault without causing thesystem failure [3]. The re-detection informationhas been shown to be important in determiningthe performance when estimating the number offaults in a system [4]. If we reparameterize theintensity in Equation 27.16 by letting

ε = 1

φ, ω = β

φ

then we have

λi(t)= Yi(t)

{1+ εNi(t−)

ω + εt

}(27.17)

This extension allows the case of ε ≤ 0 to havea meaningful interpretation [28]. Note that whenε = 0 (i.e. φ =∞) for a removal experiment, themodel reduces to the Jelinski–Moranda model insoftware reliability studies [29].

The full likelihood function is given by

L(θ)= ν!(ν − n)!

∏0≤t≤τ

[ ν∏i=1

λi(t)dNi(t)

× {1− λi(t) dt}1−dNi(t)

](27.18)

see Andersen et al. [2], p.402, which reduces to

L(θ)= ν!(ν − n)!

ν∏i=1

[{ ∏0≤t≤τ

λi(t)dNi(t)

}× exp

{−∫ τ

0λi(t) dt

}](27.19)

where n denotes the number of distinct faultsdetected in the experiment.

For the removal experiment, Littlewood [6]suggested using the maximum likelihood (ML) toestimate the parameters ω, ε (or φ, β) and ν.However, due to the complexity of the three highlynonlinear likelihood equations, it is difficult if notimpossible to determine consistent estimates outof the possible multiple solutions of the likelihoodequations. A similar problem exists for alternativeestimation methods such as the M-estimator. Anestimating function is used to estimate φ, β and ν

which can be found by solving the equation

ν∑i=1

∫ τ

0kj (t)

[dNi(t)− Yi(t)

{φ +Ni(t)

β + t

}dt

](27.20)

for j = 1, 2, 3 with weight functions kj (t)=(β + t)tj−1 , see Andersen et al. [2]. Accordingly,we obtained the linear system of equations fordetermining φ, β and ν. For the special caseNi(t)= 0 for all t , Andersen et al. [2] providedan estimate for φ, β and ν for the data inTable 27.1. However, the ad hoc approach does notprovide stable estimates in the simulation. Also,for the recapture case, the parameter ν cannot beseparated from the estimating Equation 27.20, andso we are unable to solve for ν. Furthermore, theoptimal weights involve the unknown parametersin a nonlinear form, which makes Equation 27.20intractable.

Here an alternative two-step estimation proce-dure is suggested. Firstly, with the specified formof intensity in Equation 27.16 or 27.17, we areable to compute the conditional likelihood of ω

and ε (or φ and β) using the observed failureinformation. The likelihood equation in the firststage reduces to a lower dimension, so that it iseasier to obtain the unique consistent solutions.The second stage employs a Horvitz–Thompson[9] type estimator which is the minimum vari-ance unbiased estimator if the failure intensity isknown. For the removal experiment, the condi-tional MLE are asymptotically equivalent to theunconditional MLE suggested by Littlewood [6].

Let pi = Pr(δi = 1) denote the probability ofthe ith fault being detected during the course ofthe experiment. These probabilities are the samefor all faults under the model assumption, and

Page 536: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 503

with the application of the Laplace transform itcan be shown that

pi = p(ω, ε)= 1−(

ω

ω + ετ

)1/ε

The likelihood function for the given data can berewritten as:

L(θ)= L1 × L2

where

L1=ν∏

i=1

{∏0≤t≤τλi(t)dNi(t) exp(− ∫ τ0 λi(t) dt)

pi

}δi

(27.21)

L2= ν!(ν − n)!

ν∏i=1

{exp

[−∫ τ

0λi(t) dt

]1−δipδii

}(27.22)

where δi indicates, by the value of 1 versus 0,whether or not the ith fault has ever been detectedduring the experiment. Since the marginal likeli-hood L2 depends on the unknown parameter ν,we propose to make inference about θ = (ω, ε)′based on the conditional likelihood L1 which doesnot depend on ν. The corresponding score func-tion of the conditional likelihood L1 is given by

U(θ)= ∂ log(L1)/∂θ

=ν∑

i=1

δi

{ ∫ τ

0

∂ log λi(t)

∂θdNi(t)

−∫ τ

0Yi(t)

∂λi(t)

∂θdt − ∂ log(pi)

∂θ

}where

λi(t)= Yi(t)

{1+ εNi(t−)

ω + εt

}Let θ = (ω, ε)′ be the solution to U(θ)= 0.

The usual arguments ensure the consistency andν1/2(θ − θ) converges to N(0, −1(θ)), where (θ)= limν→∞ ν−1I (θ), I (θ)=−∂U(θ)/∂θ ′.The variance of the estimator θ can thus beestimated by the negative of the first derivative ofthe score function, I (θ ).

Since the probability of being detected is p(θ),it is natural to estimate the population size ν by the

Horvitz–Thompson estimator [9], i.e.

ν =ν∑

i=1

δi

p(θ)= n

p(θ)(27.23)

where n denotes the number of distinct faultsdetected. To compute the variance of ν, we have

ν(θ)− ν =ν∑

i=1

[δi

p− 1

]By a Taylor series expansion and some simpleprobabilistic arguments:

ν−1/2{ν(θ )− ν} = ν−1/2ν∑

i=1

[δi

p− 1

]+H ′(θ)ν1/2(θ − θ)+ op(1)

(27.24)

where

H(θ)= E{H (θ)}H (θ)= ν−1 ∂ν(θ)

∂θ= ν−1

{n∂p/∂θ

p2

}It follows from the multivariate central limit

theorem and the Cramér–Wold device thatν−1/2{ν(θ )− ν} converges in distribution to azero-mean normal random variable. The variancefor the first term on the right-hand side ofEquation 27.24 is:

ν−1ν∑

i=1

Var(δi)

p2 = ν−1ν∑

i=1

p(1− p)

p2

which can be consistently estimated by

ν−1ν∑

i=1

δi(1− p)

p2 = ν−1 n(1− p)

p2

The variance for the second term isH ′(θ) −1(θ)H(θ). The covariance betweenthe two terms is zero. A consistent varianceestimator for ν−1/2{ν(θ )− ν} is then given by:

s2 = ν−1 n(1− p)

p2 + H ′(θ)νI−1(θ )H (θ )

(27.25)where

H (θ)= ν−1 ∂ν

∂θand p =

(1− ω

ω + ετ

)1/ε

Page 537: Handbook of Reliability Engineering - pdi2.org

504 Practices and Emerging ApplicationsS

cale

dT

TT 0.8

1

1

0.6

0.4

0.2

00 0.2 0.4 0.6 0.8

i n/

Figure 27.1. Scaled TTT plot of the failure data set from [1]

27.3 A Sequential Procedure

Here a sequential procedure is suggested toestimate the number of faults in a system andto examine the efficiency of such a procedure.A procedure of this type is useful in thepreliminary stages of testing. The testing timeneeds to be long enough to allow sufficientinformation to be gathered to obtain an accurateestimate, but at the same time we do not wantto waste testing effort as testing can be quiteexpensive.

In practice it is expensive and generallyinfeasible to detect all the faults in the systemin the initial stages of software development.The idea of the sequential procedure proposedhere is not to find all the faults in the system.The goal is to detect designs of questionablequality earlier in the process by estimating thenumber of faults that go undetected after a designreview. Here we consider as accuracy criterionthe ratio of the standard error of the estimate tothe population size estimate, sometimes called thecoefficient of variation. For a given accuracy level,d0, the testing experiment stops at time τ0 whenthe accuracy condition is met. Thus

τ0 = inf

[t : se{ν(t)}

ν(t)< d0

](27.26)

The criterion is applicable to any other estimators,so long as we can compute the ratio se{ν(t)}/ν(t)at any time t ∈ (0, τ ).

27.4 Real ExamplesThe proposed methods are applied to the aircraftmovements data from Moek [1], as given in Ta-ble 27.1. The data comprises 43 occurrence times(in seconds CPU time) of first failures caused by43 distinct faults of an information system forregistering aircraft movements. Data are shown inthe second column of Table 27.1. Figure 27.1 givesthe total time on test (TTT) plot, which suggeststhat the failure times are identically exponentiallydistributed. The maximum likelihood estimatefor the failure intensity λ is estimated to be5.4× 10−6. Using the M-estimates suggested byAndersen et al. [2], we obtain ω = 0.196(0.007),ε = 0.089(0.102), and ν = 46.4(10.14) based onthe 43 failures reported. Here we illustrate how therecapture improves the estimation. For a recaptureexperiment, each fault is detected, corrected, anda counter inserted to record the number of re-detections of a particular fault. In order to use thedata set for a recapture experiment, we generatedinterfailure times for each of the detected faultsbased on the maximum likelihood estimate ofthe failure intensity γ = 5.4× 106 [2] from anexponential distribution. The third column ofTable 27.1 contains the simulated failure timesafter the first detection time. The time for theend of the experiment is 576 570 seconds in CPUtime. The estimated number of total faults inthe system using Equation 27.3 based on therecapture data is ν = 47.4(2.5). The two-stepprocedure in Section 27.2.3.2 gives the estimateν = 45.2(1.6). Chao’s coverage estimate inEquation 27.14 gives ν = 45.9(2.4). The numbersinside the brackets represent standard errors. Therecapture information improves the estimation.The two-step procedure and the sample coverageare almost the same.

Table 27.2 shows the results of applying theproposed sequential procedure in Section 27.3when estimating ν for the data in Table 27.1.As τ increases the standard error of theestimate decreases with some fluctuation.The improvement of d0 from 0.33 to 0.32 requiresdoubling the testing time. The distinct numberof faults detected increased from 13 to 26. Also,

Page 538: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 505

Table 27.2. Results of applying the sequential procedure for theMoek’s data in Table 27.1

d0 τ0 Nτ0 Mτ0 ν se(ν)

0.05 576 570 126 43 47.40 2.540.06 520 016 114 42 47.56 2.820.07 459 440 101 41 48.05 3.330.08 428 902 93 41 49.13 3.910.09 374 032 83 39 48.95 4.380.10 368 811 79 39 49.95 4.890.11 353 576 73 38 50.40 5.480.12 329 685 67 36 49.58 5.800.13 295 030 59 33 48.27 6.210.14 294 708 57 33 49.54 6.850.15 288 750 55 33 51.05 7.620.16 266 322 54 33 51.90 8.070.17 252 204 51 32 52.16 8.680.18 232 743 48 31 52.65 9.450.19 228 283 47 31 54.00 10.210.20 222 288 45 30 53.47 10.470.21 215 589 44 30 55.14 11.440.22 205 224 42 29 54.77 11.870.23 205 224 42 29 54.77 11.870.24 195 476 41 29 56.92 13.160.25 194 399 40 29 59.45 14.740.26 194 399 40 29 59.45 14.740.27 186 946 39 29 62.50 16.710.28 167 108 35 26 56.89 15.810.29 167 108 35 26 56.89 15.810.30 167 108 35 26 56.89 15.810.31 162 480 34 26 60.75 18.490.32 162 480 34 26 60.75 18.490.33 83 996 18 13 30.80 9.960.34 83 996 18 13 30.80 9.960.35 83 996 18 13 30.80 9.960.36 83 996 18 13 30.80 9.960.37 83 996 18 13 30.80 9.960.38 83 996 18 13 30.80 9.960.39 75 630 17 13 35.25 13.440.40 75 630 17 13 35.25 13.44

Notes:Nτ0 = total number of detections by time τ0.Mτ0 = distinct number of faults detected by time τ0.

the estimates during the period [83 996, 162 480]produce a larger estimate for ν with a largerstandard error as well, and the ratio d0 remainsrather stable for the period. The improvementin standard error of the estimate is from 4.89 to2.54 for the time 368 811 and 576 570, respectively.Also, the d0 achieved at the end of the experimentis 0.054. If the desired accuracy level is 0.1, the

testing experiment could stop at 368 811 with theestimate 49.9 and an estimated standard error 4.9.The information at that time is not much differentfrom the estimate obtained at 576 570. We couldsave nearly one third of the testing time.

27.5 Simulation StudiesA number of Monte Carlo simulations were per-formed to evaluate the performance of the pro-posed estimators. Table 27.3 gives the simulationresults for the seeded faults approach using theoptimal estimator ν∗ from Equation 27.9. Differ-ent values of the proportionality constant havebeen used. The capture proportions are assumedto be 0.5 and 0.9 respectively. Here we assumethat the stopping time is determined by the re-moved proportion of the seeded faults. It is ofinterest to investigate the effects of θ , the stop-ping time, and the proportion of seeded faultsplaced in the system on the performance of theestimators. Table 27.3 lists all the different setsof parameter values used in the simulation study.One thousand repetitions were completed for eachtrial. The statistics computed were: average pop-ulation size, ave(ν); standard deviation of the1000 observations, sd(ν); average standard error,ave(se(ν)); and the coverage (cov), the proportionof the estimate falling between the 95% confidenceintervals.

As the removed proportion of seeded faultsincreased, the standard error of the estimatesreduced. The coverage is satisfactory. Also, ahigher proportion of the seeded faults in thesystem also improves the estimation performance.As the value of θ increases from 0.5 to 1.5, theestimation performance also improves, since moredata on real faults would be available for θ greaterthan 1.

Table 27.4 gives simulation results for ν(t)

using different stopping times and other relatedquantities for the case αt = 1 for the sequentialprocedure suggested in Section 27.3. The numberof simulations used was 500 and the numberof faults in the system was 400. The stoppingtime τ0 increased from 0.64 for d0 = 0.1 to 4.49for d0 = 0.01. Meanwhile, the average number

Page 539: Handbook of Reliability Engineering - pdi2.org

506 Practices and Emerging Applications

Table 27.3. Simulation results of the seeded faults approach usingp = 0.5 and 0.9 and ν = 400

D θ ave(ν) sd(ν) ave{se(ν)} cov

p = 0.5100 0.5 402.2 56.9 58.0 0.95100 1.0 401.9 44.2 45.4 0.95100 1.5 401.4 35.7 36.4 0.94

400 0.5 398.4 39.0 39.2 0.95400 1.0 398.6 28.0 28.2 0.95400 1.5 398.9 22.4 22.0 0.94

800 0.5 400.1 34.8 35.5 0.95800 1.0 399.9 23.6 24.5 0.95800 1.5 399.5 17.9 18.7 0.96

p = 0.9100 0.5 399.7 29.6 30.0 0.94100 1.0 400.2 14.4 15.0 0.94100 1.5 400.0 6.9 6.8 0.90

200 0.5 398.9 19.0 19.1 0.95200 1.0 399.5 9.2 9.4 0.94200 1.5 399.4 4.5 4.5 0.93

400 0.5 399.5 16.0 16.6 0.95400 1.0 399.5 8.0 8.1 0.95400 1.5 399.4 4.0 4.0 0.93

Notes: θ = αu/βu where αu andβu are the failure intensities of thereal and seeded faults respectively.

p = capture proportion for seeded faults.

of fault detections increased from 1022 to 1795when d0 decreased from 0.02 to 0.01. Also, theproportion of faults detected was 47% for d0 =0.1, whereas 70% were detected when using d0 =0.05. Figure 27.2 gives a plot for the stopping timeversus the desired accuracy level d0. The mean,Eτ , increased rapidly when d0 decreased below0.1.

Table 27.5 gives the simulation results forChao’s coverage estimator in Equation 27.14and the two-step procedure proposed in Sec-tion 27.2.3.2. The failure intensities are generatedfrom gamma distributions with different coveragevalues. The population size is 500 and the over-all capture proportion is about 84%. It is shownfrom Table 27.5 that the sample coverage estimatorunderestimates the population size with a larger

relative mean square error, while the two-step pro-cedure performs satisfactorily with virtually nobias for ν and se(ν). It should be noted here that

Acc

ura

cyle

vel 0.4

0.5

0.3

0.2

0.1

00 1 2 3 4

Stopping time

Figure 27.2. Average stopping time with different accuracy leveld0

Page 540: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 507

Table 27.4. Simulation results for the sequential procedure withαt = 1 and ν = 400

d0 ave(τ0) Pτ0 ave(Nτ0) ave(ν) sd(ν) ave{se(ν)/ν} rel.MSE

0.01 4.49 0.99 1795.76 400.29 4.06 0.01 0.010.02 2.57 0.92 1022.99 400.88 8.23 0.02 0.020.03 1.84 0.84 733.15 401.08 11.90 0.03 0.030.04 1.44 0.76 575.99 401.39 15.77 0.04 0.040.05 1.19 0.70 475.26 401.16 20.01 0.05 0.050.06 1.01 0.64 404.58 400.61 24.33 0.06 0.060.07 0.89 0.59 352.91 400.64 27.87 0.07 0.070.08 0.79 0.54 313.21 400.71 32.15 0.08 0.080.09 0.71 0.51 281.32 400.38 35.62 0.09 0.090.10 0.64 0.47 255.72 400.75 40.14 0.10 0.100.15 0.44 0.36 176.26 401.98 61.35 0.15 0.150.20 0.33 0.28 133.93 399.97 81.03 0.20 0.200.25 0.27 0.24 108.36 399.65 99.52 0.24 0.250.30 0.23 0.20 90.34 394.00 116.27 0.29 0.290.35 0.19 0.18 77.67 390.47 132.18 0.34 0.330.40 0.17 0.16 68.47 393.15 152.93 0.38 0.380.45 0.15 0.14 61.84 393.91 174.55 0.42 0.440.50 0.14 0.13 54.98 394.65 192.51 0.47 0.480.60 0.12 0.11 46.82 395.62 223.85 0.54 0.560.70 0.09 0.09 37.53 400.17 278.94 0.67 0.700.80 0.09 0.09 36.96 395.08 283.20 0.67 0.710.90 0.09 0.08 34.94 376.65 293.51 0.69 0.741.00 0.06 0.06 25.52 422.32 411.85 0.94 1.03

Notes: ave(ν)=∑ νi /R; ave(τ )=∑ τi/R; ave(Nτ )=∑Nτi /R;

sd(ν)2 =∑{νi − ave(ν)}2/(R − 1); ave{se(ν)/ν} = {∑ se(νi )/νi}/R;

Pτ = 1− exp{ave(−τ0)}; rel.MSE2 =∑{(νi − ν)/ν}2/R.R = 500 is the number of simulations.

Table 27.5. Simulation results of the sample coverage estimator and thetwo-step procedure for heterogeneous models with gamma intensity forν = 500

CV Estimator ave(ν) sd(ν) ave(se(ν)) RMSE

0.58 Two-step 500.3 18.6 17.9 17.8Chao & Lee 486.6 13.9 13.8 19.2

0.71 Two-step 500.3 18.8 18.6 18.6Chao & Lee 483.8 13.5 13.7 21.2

1.00 Two-step 500.9 21.5 20.8 20.9Chao & Lee 477.7 14.0 14.5 26.6

1.12 Two-step 500.5 22.9 23.6 23.6Chao & Lee 474.7 14.3 15.0 29.4

Note: Two-step = conditional likelihood with Horvitz–Thompson estimator.Chao & Lee = sample coverage estimator.

Page 541: Handbook of Reliability Engineering - pdi2.org

508 Practices and Emerging Applications

the capture probability is fairly high in this simula-tion. Additional simulation results not shown hererevealed that larger population sizes are requiredfor the proposed procedure to perform well if thecapture probabilities are low.

27.6 Discussion

Estimators based on the optimal estimatingequation are recommended for the homogeneousmodels with or without recapture. The estimator isequivalent to the usual unconditional MLE derivedby Becker and Heyde [19]. For the seeded faultmodel without recapture it can also be shownthat the estimator using an optimal weight isidentical to the unconditional MLE [30]. For thenon-parametric case with γi values not specified,it seems that the coverage estimators of Chaoand Lee [25] and Yip and Chao [26] are thebest choices. However, for the parametric casethe γi values depend on specified distributionsand the two-step procedure should be highlyreliable provided that the model holds. Also, therecommended Horvitz–Thompson estimator isequivalent to the conditional MLE suggested bySanathanan [31, 32] in the case of a homogeneousfailure intensity.

Our approach adopted in this chapter isdifferent from other recent reviews by Chaoet al. [13] and Briand and Freimut [12], whoexamined the problem from a rather differentperspective. They examined the problem whenestimating the size of a target population basedon several incomplete lists of individuals. Here,it is assumed we only have one tester equivalent,the data are the failure times of the faults (eitherseeded or re-detections). It has been shown thatthe recapture and seeded approach can providesubstantial information for ν. Heterogeneity canbe a cause for concern. However, it is interestingto note that most of the failure data set canbe modeled by the Jelinski–Moranda modelrather than the Littlewood model. Nevertheless,ignoring heterogeneity has been shown to provideestimates which are biased downward [20]. Thepresent formulation allows removals during the

testing process, and the other existing methodscannot handle that.

Sequential procedures are obtained for estimat-ing the number of faults in a testing experimentusing a fixed accuracy criterion. With a greaterfailure intensity and a greater number of faultsin the system, a shorter testing time is neededto achieve the desired accuracy level for the es-timates. The accuracy is specified by d0, whichprescribes an upper limit on the coefficient of vari-ation of the estimator. Requiring greater accuracyresults in an experiment of longer duration. Thus,a small value of d0 results in a larger mean testingtime, τ . Some accuracy levels may be unattainable.The mean testing time τ is found to increase expo-nentially when d0 is smaller than 0.1. The value ofd0 = 0.1 (i.e. coefficient of variation 10%) seemsto be an optimal choice in terms of accuracy andthe cost of the experiment. The given results areconsistent with the findings of Lloyd et al. [4] thatwhen the proportion of faults found reaches a levelof 70% of the total number of faults in the system,further testing adds little value to the experimentin terms of reducing the standard error of theestimate. Sequential estimation is an importanttool in testing experiments. This chapter providesa practical way to use sequential estimation, whichcan be applied to other sampling designs andestimators.

Another potential application of the suggestedprocedure is in ecology [33]. In capture–recaptureexperiments in ecological studies, individualsare caught, marked, and released back intothe population and subject to recapture again.The capture–recapture process is conducted fora period of time τ . Capturing all animalsto determine the abundance is infeasible. Thedetection of total number of individuals is unwise,costly and expensive. The important issue hereis not the estimator, but the use of the accuracycriterion to determine the stopping time. It is notour intention in this chapter to compare differentestimators. Some comparison of estimators can befound in Yip et al. [34, 35].

The conditional likelihood with a Horvitz–Thompson estimator in Section 27.2.3.2 can pro-vide stable estimates. The proposed formulation

Page 542: Handbook of Reliability Engineering - pdi2.org

The Application of Capture–Recapture Methods in Reliability Studies 509

includes both removal and recapture models. Italso allows random removals during the recaptur-ing process. The resulting conditional likelihoodestimators have the consistency and asymptoticnormality properties, which have also been con-firmed by further simulation studies [36]. How-ever, the asymptotic normality for the proposedfrailty model requires a comparably large ν. Forsmall values of ν and low capture probability, thedistributions of ν are skew. A similar situationoccurred in removal studies with the use of aux-iliary variables [37, 38] or in coverage estimators[30]. Ignoring frailty or heterogeneity in the pop-ulation would seriously underestimate the popu-lation size, giving a misleading smaller standarderror [35]. The likelihood or estimating functionapproach to estimate the unknown population sizewith heterogeneity has been shown to be unsatis-factory [39,40]. The proposed two-step estimatingprocedure is a promising alternative.

Acknowledgment

This work is supported by the Research GrantCouncil of Hong Kong (HKU 7267/2000M).

References[1] Moek G. Comparison of software reliability models

for simulated and real failure data. Int J Model Simul1984;4:29–41.

[2] Andersen PK, Borgan Ø, Gill RD, Keiding N. Statisticalmodels based on counting processes. New York: Springer-Verlag; 1993.

[3] Nayak TK. Estimating population sizes by recapturesampling. Biometrika 1988;75:113–20.

[4] Lloyd C, Yip PSF, Chan KS. A comparison of recapturing,removal and resighting design for population sizeestimation. J Stat Inf Plan 1998;71:363–73.

[5] Yip PSF. Estimating the number of errors in a sys-tem using a martingale approach. IEEE Trans Reliab1995;44:322–6.

[6] Littlewood B. Theories of software reliability: how goodare they and how can they be improved? IEEE TransSoftware Eng 1980;6:489–500.

[7] Lloyd C, Yip PSF. A unification of inference from capture–recapture studies through martingale estimating func-tions. In: Godambe VP, editor. Estimating functions. Ox-ford: Oxford University Press; 1991. p.65–88.

[8] Yip PSF, Fong YT, Wilson K. Estimating population sizeby recapture via estimating function. Commun Stat–Stochast Models 1993;9:179–93.

[9] Horvitz DG, Thompson DJ. A generalization of samplingwithout replacement from a finite universe. J Am StatAssoc 1952;47:663–85.

[10] Van der Wiel SA, Votta LG. Assessing software designsusing capture–recapture methods. IEEE Trans SoftwareEng 1993;SE-19(11):1045–54.

[11] Huggins RM. Fixed accuracy estimation for chainbinomial models. Stochast Proc Appl 1992;41:273–80.

[12] Briand LC, Freimut BG. A comprehensive evaluationof capture–recapture models for estimating softwaredefect content. IEEE Trans Software Eng 2000;26:518–38.

[13] Chao A, Tsay PK, Lin SH, Shau WY, Chao DY. The ap-plication of capture–recapture models to epidemiologicaldata. Stat Med 2001;20:3123–57.

[14] Aalen OO. Statistical inference for a family of countingprocesses. Ph.D. thesis, University of California, Berkeley,1975.

[15] Aalen OO. Non-parametric inference for a family ofcounting processes. Ann Stat 1978;6:701–26.

[16] Andersen PK, Gill RD. Cox’s regression model forcounting processes: a large sample study. Ann Stat1982;10:1100–20.

[17] Godambe VP. The foundations of finite sampleestimation in stochastic processes. Biometrika 1985;72:419–28.

[18] Schnabel ZE. The estimation of the total fish populationof a lake. Am Math Monthly 1938;39:348–52.

[19] Becker NG, Heyde CC. Estimating population size frommultiple recapture experiments. Stochast Proc Appl1990;36:77–83.

[20] Yip PSF, Xi L, Fong DYT, Hayakawa Y. Sensitivity analysisand estimating number of faults in removal debugging.IEEE Trans Reliab 1999;48:300–5.

[21] Chao A, Lee SM. Estimating the number of classes viasample coverage. J Am Stat Assoc 1992;87:210–7.

[22] Huggins RM. On the statistical analysis of captureexperiments. Biometrika 1989;76:133–40.

[23] Huggins RM. Some practical aspects of a conditionallikelihood approach to capture–recapture models. Bio-metrics 1991;47:725–32.

[24] Yip PSF, Wan E, Chan KS. A unified of capture–recapturemethods in discrete time. J Agric, Biol Environ Stat2001;6:183–94.

[25] Chao A, Lee SM. Estimating population size forcontinuous time capture–recapture models via samplecoverage. Biometric J 1993;35:29–45.

[26] Yip PSF, Chao A. Estimating population size fromcapture–recapture studies via sample coverage andestimating functions. Commun Stat–Stochast Models1996;12(1):17–36.

[27] Lin HS, Chao A, Lee SM. Consistency of an estimator ofthe number of species. J Chin Stat Assoc 1993;31:253–70.

[28] Nielsen GG, Gill RD, Andersen PK, Sørensen TIA.A counting process approach to maximum likelihoodestimation in frailty models. Scand J Stat 1992;19:25–43.

Page 543: Handbook of Reliability Engineering - pdi2.org

510 Practices and Emerging Applications

[29] Jelinski Z, Moranda P. Software reliability research. StatComp Perform Eval 1972;465–84.

[30] Chao A, Yip PSF, Lee SM, Chu W. Population sizeestimation based on estimating functions for closedcapture–recapture models. J Stat Plan Inf 2001;92:213–32.

[31] Sanathanan L. Estimating the size of a multinomialpopulation. Ann Math Stat 1972;43:142–52.

[32] Sanathanan L. Estimating the size of a truncated sample.J Am Stat Assoc 1977;72:669–72.

[33] Pollock KH. Modeling capture, recapture and removalstatistics for estimation of demographic parameters forfish and wildlife populations: past, present and future.J Am Stat Assoc 1991;86:225–38.

[34] Yip PSF, Huggins RM, Lin DY. Inference for capture–recapture experiments in continuous time with variablecapture rates. Biometrika 1996;83:477–83.

[35] Yip PSF, Zhou Y, Lin DY, Fang XZ. Estimation ofpopulation size based on additive hazards models

for continuous time recapture experiment. Biometrics1999;55:904–8.

[36] Wang Y, Yip PSF, Hayakawa Y. A frailty model fordetecting number of faults in a system. Department ofStatistics and Actuarial Science Research Report 287, TheUniversity of Hong Kong,; 2001. p.1–15.

[37] Huggins R, Yip P. Statisitcal analysis of removalexperiments with the use of auxillary variables. Stat Sin1997;7:705–12.

[38] Lin DY, Yip PSF. Parametric regression models forcontinuous time recapture and removal studies. J R StatSoc Ser B 1999;61:401–13.

[39] Burnham KP, Overton WS. Estimation of the size of aclosed population when capture probabilities vary amonganimals. Biometrika 1978;65:625–33.

[40] Becker NG. Estimating population size from capture–recapture experiments in continuous time. Austr J Stat1984;26:1–7.

Page 544: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric PowerSystems: An Overview

Chap

ter2

8Roy Billinton and Ronald N. Allan

28.1 Introduction28.2 System Reliability Performance28.3 System Reliability Prediction28.3.1 System Analysis28.3.2 Predictive Assessment at HLI28.3.3 Predictive Assessment at HLII28.3.4 Distribution System Reliability Assessment28.3.5 Predictive Assessment at HLIII28.4 System Reliability Data28.4.1 Canadian Electricity Association Database28.4.2 Canadian Electricity Association Equipment Reliability Information

System Database for HLI Evaluation28.4.3 Canadian Electricity Association Equipment Reliability Information

System Database for HLII Evaluation28.4.4 Canadian Electricity Association Equipment Reliability Information

System Database for HLIII Evaluation28.5 System Reliability Worth28.6 Guide to Further Study

28.1 Introduction

The primary function of a modern electric powersystem is to supply its customers with electricalenergy as economically as possible and with anacceptable degree of reliability. Modern society,because of its pattern of social and workinghabits, has come to expect the supply to becontinuously available on demand. This degreeof expectation requires electric power utilitiesto provide an uninterrupted power supply totheir customers. It is not possible to design apower system with 100% reliability. Power systemmanagers and engineers therefore strive to obtainthe highest possible system reliability within theirsocio-economic constraints. Consideration of thetwo important aspects of continuity and qualityof supply, together with other important elementsin the planning, design, control, operation and

maintenance of an electric power system network,is usually designated as reliability assessment.In the power system context, reliability cantherefore be defined as concern regarding thesystem’s ability to provide an adequate supplyof electrical energy. Many utilities quantitativelyassess the past performance of their systems andutilize the resultant indices in a wide range ofmanagerial activities and decisions. All utilitiesattempt to recognize reliability implications insystem planning, design and operation through awide range of techniques [1–12].

Reliability evaluation methods can generally beclassified into two categories: deterministic andprobabilistic. Deterministic techniques are slowlybeing replaced by probabilistic methods [1–12].System reliability can be divided into the twodistinct categories of system security and systemadequacy. The concept of security is associated

511

Page 545: Handbook of Reliability Engineering - pdi2.org

512 Practices and Emerging Applications

Generationfacilities

Transmissionfacilities

Distributionfacilities

HL III

HL II

HL I

Figure 28.1. Hierarchical levels in an electric power system

with the dynamic response of the system to theperturbations to which it is subjected. Adequacy isrelated to static system conditions and theexistence of sufficient facilities within the systemto meet the system load demand. Overall adequacyevaluation of an electric power system involvesa comprehensive analysis of its three principalfunctional zones of generation, transmission anddistribution. The basic techniques for adequacyassessment are generally categorized in terms oftheir application to each of these zones.

The functional zones can be combined asshown in Figure 28.1 to obtain three distincthierarchical levels (HL) and adequacy assessmentcan be conducted at each of these hierarchicallevels [2]. At HLI, the adequacy of the generatingsystem to meet the load requirements is examined.Both generation and the associated transmissionfacilities are considered in HLII adequacyassessment. This activity is sometimes referred toas composite or bulk system adequacy evaluation.HLIII adequacy assessment involves theconsideration of all these functional zones in orderto evaluate customer load point indices. HLIIIstudies are not normally conducted in a practicalsystem because of the enormity of the problem.

Electric power utilities are experiencing con-siderable change with regard to their structure,operation and regulation. This is particularly truein those countries with highly developed systems.The traditional vertically integrated utility struc-ture consisting of generation, transmission anddistribution functional zones, as shown in Fig-ure 28.1, has in many cases been decomposedinto separate and distinct utilities in which eachperform a single function in the overall electricalenergy delivery system. Deregulation and compet-itive pricing will also make it possible for elec-tricity consumers to select their supplier based oncost effectiveness. Requests for use of the trans-mission network by third parties are extending thetraditional analysis of transmission capability farbeyond institutional boundaries. There is also agrowing utilization of local generation embeddedin the distribution functional zone. The individ-ual unit capacities may be quite small, but thecombined total could represent a significant com-ponent of the total required generation capacity.Electric power system structures have changedconsiderably over the last decade and are stillchanging. The basic principles and concepts of re-liability evaluation developed over many decadesremain equally valid in the new operating environ-ment. The objectives of reliability evaluation may,however, have to be reformulated and restructuredto meet the new paradigms.

28.2 System ReliabilityPerformance

Any attempt to perform quantitative reliabilityevaluation invariably leads to an examination ofdata availability and the data requirements tosupport such studies. Valid and useful data areexpensive to collect, but it should be recognizedthat in the long run it would be more expensive notto collect them. It is sometimes argued as to whichcomes first: reliability data or reliability method-ology. In reality, the data collection and reliabilityevaluation must evolve together and therefore the

Page 546: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 513

process is iterative. The principles of data collec-tion are described in [13, 14] and various schemeshave been implemented worldwide, for instance,by the Canadian Electricity Association (CEA) inCanada [15,16], North American Electric Reliabil-ity Council (NERC) in the USA, and others whichare well-established but less well-documented orless available publicly, e.g. the UK fault-reportingscheme (NAFIRS) [17] operated nationally andthose of various individual North American Relia-bility Councils.

In conceptual terms [13, 14], data can be col-lected for one or both of two reasons: assess-ment of past performance and/or prediction offuture system performance. The past performanceassessment looks back at the past behavior ofthe system, whereas predictive assessment fore-casts how the system is going to behave in thefuture. In order to perform predictive studies,however, it is essential to transform past experi-ence into required future prediction. Consistentcollection of data is therefore essential as it formsthe input to relevant reliability models, techniquesand equations. There are many reasons for col-lecting system and equipment performance data.Some of the more popular applications are asfollows.

1. To furnish management with performancedata regarding the quality of customer serviceon the electrical system as a whole and foreach voltage level and operating area.

2. To provide data for an engineering compari-son of electrical system performance amongconsenting companies.

3. To provide a basis for individual companiesto establish service continuity criteria. Suchcriteria can be used to monitor systemperformance and to evaluate general policies,practices, standards and design.

4. To provide data for analysis to determine reli-ability of service in a given area (geographical,political, operating, etc.) to determine howfactors such as design differences, environ-ment or maintenance methods, and operatingpractices affect performance.

5. To provide reliability history of individualcircuits for discussion with customers orprospective customers.

6. To identify substations and circuits withsubstandard performance and to ascertain thecauses.

7. To obtain the optimum improvement in re-liability per dollar expended for purchase,maintenance and operation of specific equip-ment/plant.

8. To provide equipment performance data nec-essary for a probabilistic approach to reliabil-ity studies. The purpose is to determine thedesign, operating and maintenance practicesthat provide optimum reliability per dollar ex-pended and, in addition, to use this informa-tion to predict the performance of future gen-eration, transmission and distribution systemarrangements.

9. To provide performance data to regulatorybodies so that comparative performancerecords can be established between regulatedmonopolies and between planned and actualachievements within a regulated monopoly.

There are a wide range of data gathering sys-tems in existence throughout the world. These sys-tems vary from being very simple to exceedinglycomplex. However, for the purposes of identifyingbasic concepts, the protocols established by theCEA are used in this chapter to illustrate a possibleframework and the data that can be determined. Inthe CEA system, component reliability data is col-lected under the Equipment Reliability Informa-tion System (ERIS). These protocols are structuredusing the functional zones of generation, trans-mission and distribution shown in Figure 28.1.The ERIS is described in more detail in a latersection.

In addition to ERIS, the CEA has also createdthe Electric Power System Reliability Assessment(EPSRA) procedure, which is designed to providedata on past performance of the system. This pro-cedure is in the process of evolution and at thepresent time contains systems for compiling infor-mation on bulk system disturbances, bulk systemdelivery point performance, and customer service

Page 547: Handbook of Reliability Engineering - pdi2.org

514 Practices and Emerging Applications

continuity statistics. Customer satisfaction is avery important consideration for Canadian elec-tric power utilities. The measured performanceat HLIII and the ability to predict future per-formance are important factors in providing ac-ceptable customer service and therefore customersatisfaction. The HLIII performance indices havebeen collected for many years by a number ofCanadian utilities. The following text defines thebasic customer performance indices and presentsfor illustration purposes some Canadian and UKindices for 1999.

Service performance indices can be calculatedfor the entire system, a specific region or voltagelevel, designated feeders or groups of customers.The most common indices [3] are as follows.

a. System Average Interruption Frequency Index(SAIFI)This index is the average number of interrup-tions per customer served per year. It is deter-mined by dividing the accumulated number ofcustomer interruptions in a year by the num-ber of customers served. A customer interrup-tion is considered to be one interruption toone customer:

SAIFI= total number of customer interruptions

total number of customers

In the UK, this index is defined as security-supply interruptions per 100 connected cus-tomers [19].

b. Customer Average Interruption FrequencyIndex (CAIFI)This index is the average number of inter-ruptions per customer interrupted per year.It is determined by dividing the number ofcustomer interruptions observed in a year bythe number of customers affected. The cus-tomers affected should be counted only onceregardless of the number of interruptions thatthey may have experienced during the year:

CAIFI= total number of customer interruptions

total number of customers affected

c. System Average Interruption Duration Index(SAIDI)

This index is the average interruption dura-tion for customers served during a year. It isdetermined by dividing the sum of all cus-tomer interruption durations during a year bythe number of customers served during theyear:

SAIDI= sum of customer interruption durationstotal number of customers

In the UK, this index is defined as availability-minutes lost per connected customer [19].

d. Customer Average Interruption Duration In-dex (CAIDI)This index is the average interruption dura-tion for customers interrupted during a year.It is determined by dividing the sum of allcustomer-sustained interruption durations bythe number of sustained customer interrup-tions over a one-year period:

CAIDI= sum of customer interruption durationstotal number of customer interruptions

= SAIDI

SAIFI

This index is not explicitly published inthe UK but can easily be calculated asavailability/security.

e. Average Service Availability Index (ASAI)This is the ratio of the total number ofcustomer hours that service was availableduring a year to the total customer hoursdemanded. Customer hours demanded aredetermined as the 12-month average numberof customers served times 8760 hours. This issometimes known as the “Index of Reliability”(IOR):

ASAI= customer hours of available servicecustomer hours demanded

The complementary value to this index,i.e. the Average Service Unavailability Index(ASUI), may also be used. This is the ratioof the total number of customer hours thatservice was unavailable during a year to thetotal customer hours demanded.

Tables 28.1a and 28.1b show the overallCanadian [18] and UK [19] system performance

Page 548: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 515

Table 28.1a. Overall Canadian service continuity statistics for1999

SAIFI 2.59 int/ySAIDI 4.31 hr/yCAIDI 1.67 hr/intIOR 0.999 508

Table 28.1b. Overall UK service continuity statistics for 1998/99

Security (SAIFI) 0.78 int/yrAvailability (SAIDI) 81 min/yrCalculated CAIDI 104 min/intCalculated ASAI/IOR 0.999 846

Table 28.2a. Differentiated Canadian service continuity statisticsfor 1999

Urbanutilities

Urban/ruralutilities

Overall

SAIFI 2.28 2.88 2.59SAIDI 2.68 5.06 4.31CAIDI 1.17 1.76 1.67IOR 0.999 694 0.999 422 0.999 508

indices for the 1999 period. The ASAI is knownas the IOR in the EPSRA protocol. Very fewutilities can provide the CAIFI parameter andtherefore this is not part of EPSRA. This parameteris also not published in the UK performancestatistics [19].

Tables 28.1a and 28.1b show the aggregateindices for the participating utilities. There areconsiderable differences between the systemperformance indices for urban utilities and thosecontaining a significant rural component; thisbeing true not only for Canada, but also the UKand elsewhere. This is shown in Table 28.2a forCanada [18] and Table 28.2b for the UK [19].

The indices shown in Tables 28.2a and 28.2bare overall aggregate values. Individual customervalues will obviously vary quite widely dependingon the individual utility, the system topology andthe company operating philosophy. The indicesshown in Tables 28.2a and 28.2b, however, dopresent a clear picture of the general level

Table 28.2b. Typical UK service continuity statistics for 1998/99

Typical Typicalurban urban/utility rural utility

Security (SAIFI) 0.4 1.2(int/yr)Availability(SAIDI)

50 100

(min/yr)Calculated CAIDI 125 83.3(min/int)CalculatedASAI/IOR

0.999 905 0.999 810

of customer reliability. Similar statistics canbe obtained at the bulk system delivery level,i.e. HLII. These indices [20, 21] are expressed asglobal values or in terms of delivery points ratherthan customers, as in the case of HLIII indices.

28.3 System ReliabilityPrediction

28.3.1 System Analysis

There are a wide range of techniques available andin use to assess the reliability of the functionalzones and hierarchical levels shown in Figure 28.1.References [1–12] clearly illustrate the literatureavailable on this subject. The following sectionsprovide further discussion on the available tech-niques. The basic concepts associated with electricpower system reliability evaluation are illustratedusing a small hypothetical test system known asthe RBTS [22–24]. This application illustrates theutilization of data such as that contained in theCEA-ERIS database to assess the adequacy of thesystem at all three hierarchical levels. The basicsystem data necessary for adequacy evaluation atHLI and HLII are provided in [22]. The capabil-ities of the RBTS have been extended to includefailure data pertaining to station configurationsin [25] and distribution networks that contain themain elements found in practical systems [24].The RBTS is sufficiently small to permit a largenumber of reliability studies to be conducted with

Page 549: Handbook of Reliability Engineering - pdi2.org

516 Practices and Emerging Applications

a reasonable solution time, yet sufficiently detailedto reflect the actual complexities involved in apractical reliability analysis. The RBTS describedin [22] has two generation buses, five load buses(one of which is also a generation bus), ninetransmission lines and 11 generating units. Anadditional line (Line 10) is connected betweenthe 138-kV substations of Buses 2 and 4. A sin-gle line diagram of the RBTS at HLII is shownin Figure 28.2. A more complete diagram whichshows the generation, transmission, station anddistribution facilities is given in Figure 28.3. Inorder to reduce the complexity of the overall sys-tem, distribution networks are shown at Buses 2,4 and 6. Bulk loads are used at Buses 3 and 5. Thetotal installed capacity of the system is 240 MWwith a system peak load of 185 MW. The trans-mission voltage level is 230 kV and the voltagelimits are 1.05 and 0.97 pu. The studies reportedin this chapter assume that there are no lines ona common right of way and/or a common tower.In order to obtain annual indices for the RBTS atHLII and HLIII, the load data described in [22]were arranged in descending order and dividedinto seven steps in increments of 10% of peak loadas given in Table 28.3.

28.3.2 Predictive Assessment at HLI

The basic reliability indices [3] used to assessthe adequacy of generating systems are the lossof load expectation (LOLE) and the loss ofenergy expectation (LOEE). The main concernin an HLI study is to estimate the generatingcapacity required to satisfy the perceived systemdemand and to have sufficient capacity to performcorrective and preventive maintenance on thegeneration facilities. The generation and loadmodels are convolved as shown in Figure 28.4 tocreate risk indices.

One possible casualty in the new deregulatedenvironment is that of overall generation plan-ning. There appears, however, to be a growingrealization, particularly in view of the difficultiesin California, that adequate generating capacityis an integral requirement in acceptable effectivepower supply.

2 40 MW1 20 MW1 10 MW

���

1 40 MW4 20 MW2 5 MW

���

BUS 2

BUS 1

G

BUS 3 BUS 4

L3 (20.0 MW)

(40.0 MW)(85.0 MW)

BUS 5

(20.0 MW)

BUS 6

(20.0 MW)

L2 L7

L1 L6 L4

L5 L8

L9

Figure 28.2. Single line diagram of the RBTS at HLII

The capacity model can take several forms andin analytical methods it is usually represented bya capacity outage probability table [3]. The LOLEis the most widely used probabilistic index at thepresent time. It gives the expected number of days(or hours) in a period during which the load willexceed the available generating capacity. It doesnot, however, indicate the severity of the shortage.The LOEE, also referred to as the expectedenergy not supplied (EENS), is the expectedunsupplied energy due to conditions resultingin the load demand exceeding the availablegenerating capacity. Normalized values of LOEEin the form of system-minutes (SM) or units permillion (UPM) are used by some power utilities.

The basic adequacy indices of the RBTSresulting from an HLI study are given below:

LOLE= 1.09161 hr/yr

LOEE= 9.83 MWh/yr

SM= 3.188

UPM= 9.9

Page 550: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 517

Figure 28.3. Single line diagram of the RBTS at HLIII

Page 551: Handbook of Reliability Engineering - pdi2.org

518 Practices and Emerging Applications

Table 28.3. Seven-step load data for the RBTS

Load (MW) Probability Duration (hr)

185.0 0.013 163 92 115.0166.5 0.111 034 78 970.0148.0 0.165 407 52 1445.0129.5 0.232 028 37 2027.0111.0 0.226 304 89 1883.092.5 0.226 304 89 1997.074.0 0.036 515 59 319.0

Generationmodel

Loadmodel

Riskmodel

Figure 28.4. Conceptual model for HLI evaluation

The SM and UPM indices are obtained by dividingthe LOEE by the system peak and the annualenergy requirement respectively. Both indices arenormalized LOEE indices and can be used tocompare systems of different sizes or to use in asingle system over time as the load changes. Bothindices are in actual utility use.

A per-unitized value of LOLE gives LOLP,i.e. the probability of loss of load. This is theoriginally defined term which was soon convertedinto LOLE because of the latter’s more physicalinterpretation. However, LOLP has been used inthe UK [26] since privatization to determine thetrading price of electricity, known as pool inputprice (PIP) and given by:

PIP = SMP + (VOLL − SMP)LOLP

where SMP = system marginal price and VOLL =value of lost load (set at £2.389/kWh for 1994/95).The pool has now been replaced by bilateraltrading and a balancing market known as NETA(the New Energy Trading Arrangement) in March

2001, when the use of LOLP in energy pricingceased.

28.3.3 Predictive Assessment at HLII

HLII adequacy evaluation techniques are con-cerned with the composite problem of assessingthe generation and transmission facilities in re-gard to their ability to supply adequate, depend-able and suitable electric energy at the bulk loadpoints [2–6]. A basic objective of HLII adequacyassessment is to provide quantitative input tothe economic development of the generation andtransmission facilities required to satisfy the cus-tomer load demands at acceptable levels of qualityand availability. A number of digital computerprograms have been developed to perform thistype of analysis [6–12]. Extensive developmentwork in the area of HLII adequacy assessment hasbeen performed at the University of Saskatchewanand at UMIST. This has resulted in the devel-opment of several computer programs, includingCOMREL (analytical) and MECORE (simulation)at the University of Saskatchewan [27, 28] andRELACS (analytical) and COMPASS (sequentialsimulation) at UMIST [28, 29]. The analytical pro-grams are based on contingency enumeration al-gorithms and the simulation programs on MonteCarlo approaches. All include selection of distur-bances, classification of these into specified failureevents, and calculation of load point and systemreliability indices. A range of other computer pro-grams have been published [28], some of whichare in-house assessment tools and others morewidely available. These include MEXICO (EdF),SICRIT and SICIDR (ENEL), CREAM and TRELLS(EPRI, USA), ESCORT (NGC, UK), ZANZIBAR(EdP), TPLan (PTI), CONFTRA (Brazil), amongstothers. These programs are not all equivalent;the modeling procedure, solution methods andcalculated results vary quite considerably. Reli-ability evaluation of a bulk power system nor-mally includes the independent outages of gener-ating units, transformers and transmission lines.A major simplification that is normally used inHLII studies is that terminal stations are modeledas single busbars. This representation does not

Page 552: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 519

Table 28.4. Annual HLII load point adequacy indices for the RBTSincluding station effects

Bus a b c d e

2 0.000 292 0.1457 0.314 4.528 2.553 0.001 737 0.4179 12.823 683.34 15.174 0.000 323 0.1738 0.746 10.611 2.825 0.001 387 0.2530 2.762 115.26 12.126 0.001 817 1.6518 22.017 327.23 24.613

consider the actual station configurations as anintegral part of the analysis. The HLII system usedin this analysis is shown in Figure 28.2. Station-originated outages can have a significant impacton HLII indices. This chapter considers, in addi-tion to independent overlapping outages of gener-ators and lines, the effects on the reliability indicesof station-originated outages. The HLIII systemused in this analysis is shown in Figure 28.3.

HLII adequacy indices are usually expressedand calculated on an annual basis, but they canbe determined for any period such as a season,a month, and/or a particular operating condition.The indices can also be calculated for a particularload level and expressed on an annual basis.Such indices are designated as annualized values.The calculated adequacy indices for HLII can beobtained for each load point or for the entiresystem. Both sets of indices are necessary to obtaina complete picture of HLII adequacy, i.e. theseindices complement rather than substitute foreach other. Individual load point indices can beused to identify weak points in the system, whileoverall indices provide an appreciation of globalHLII adequacy and can be used by plannersand managers to compare the relative adequaciesof different systems. The load point indicesassist in quantifying the benefits associated withindividual reinforcement investments. The annualload point and system indices for the RBTS werecalculated using COMREL and are presented inthis section. The load point indices designated a,b, c, d and e in Table 28.4 are defined below:

a = Probability of failure

b = Frequency of failure(f/yr)

c = Total load shed (MW)

d = Total energy curtailed (MWh)

e = Total duration (hr)

It is also possible to produce overall systemindices by aggregating the individual bus ade-quacy values. The system indices do not replacethe bus values but complement them to providea complete assessment at HLII and to provide aglobal picture of the overall system behavior. Thesystem indices of the RBTS are as follows:

Bulk Power Supply Disturbances

= 2.642(occ/yr)

Bulk Power Interruption Index

= 0.209(MW/MWyr)

Bulk Power Energy Curtailment Index

= 6.1674(MWh/MWyr)

Bulk Power Supply Average MW Curtailment

= 14.633(MW/disturbance)

Modified Bulk Power Energy Curtailment Index

= 0.000 71

SM = 370.04(system-minutes)

The Modified Bulk Power Energy CurtailmentIndex of 0.000 71 can be multiplied by 106 to give71 UPM. The SM and UPM indices at HLII areextensions of those obtained at HLI and illustratethe bulk system inadequacy.

28.3.4 Distribution System ReliabilityAssessment

The bulk of the outages seen by an individualcustomer occur in the distribution network andtherefore practical evaluation techniques [3] have

Page 553: Handbook of Reliability Engineering - pdi2.org

520 Practices and Emerging Applications

Table 28.5. System indices for the distribution systems in the RBTSBuses 2, 4 and 6

Bus SAIFI SAIDI CAIDI ASAI

2 0.304 21 3.670 65 12.066 07 0.999 5814 0.377 31 3.545 50 9.396 70 0.999 5956 1.061 90 3.724 34 3.507 25 0.999 575

been developed for this functional zone. The basicload point indices at a customer load pointare λ, r and U , which indicate the failure rate(frequency), average outage time and averageannual outage time respectively. The individualload point indices can be combined with thenumber of customers at each load point toproduce feeder, or area SAIFI, SAIDI, CAIDIand ASAI predictions. The distribution functionalzones shown in Figure 28.3 have been analyzedand the results are shown in Table 28.5. Additionalindices [3] can also be calculated.

28.3.5 Predictive Assessment at HLIII

HLIII adequacy evaluation includes all three seg-ments of an electric power system in an overallassessment of actual consumer load point ade-quacy. The primary adequacy indices at HLIII arethe expected failure rate, λ, the average durationof failure and the annual unavailability, U , of thecustomer load points [3]. The individual customerindices can then be aggregated with the averageconnected load of the customer and the numberof customers at each load point to obtain theHLIII system adequacy indices. These indices arethe SAIFI, the SAIDI, the CAIDI and the ASAI.Analysis of actual customer failure statistics indi-cates that the distribution functional zone makesthe greatest individual contribution to the overallcustomer supply unavailability. Statistics collectedby most, if not all, utilities indicate that, in gen-eral, the bulk power system contributes only arelatively small component to the overall HLIIIcustomer indices. The conventional customer loadpoint indices are performance parameters ob-tained from historical event reporting. The fol-lowing illustrates how similar indices have been

predicted. The HLIII adequacy assessment pre-sented includes the independent outages of gen-erating units, transmission lines, outages due tostation-originated failures, sub-transmission ele-ment failures and distribution element failures.The method used is summarized in the three stepsgiven below.

1. The COMREL computer program was usedto obtain the probability, expected frequencyand duration of each contingency at HLIIthat leads to load curtailment for eachsystem bus. The contingencies consideredinclude outages up to: four generation units,three transmission lines, two lines with onegenerator, and two generators with one line.All outages due to station-related failures andthe isolation of load buses due to stationfailures are also considered. If the contingencyresults in load curtailment, a basic question isthen how would the electric utility distributethis interrupted load among its customers?It is obvious that different power utilitieswill take different actions based on theirexperience, judgment and other criteria.The method used in this chapter assumes thatload is curtailed proportionately across all thecustomers. For each contingency j that leadsto load curtailment of Lkj at Bus k, the ratioof Lkj to bus peak load is determined. Thefailure probability and frequency due to anisolation case are not modified as the isolationaffects all the customers.

2. At the sub-transmission system level, theimpact of all outages is obtained in terms ofaverage failure rate and average annual outagetime at each distribution system supply point.

Page 554: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 521

Table 28.6. Modified annual bus adequacy indices for the RBTSusing Step 1

Bus Failure Failure Totalprobability frequency duration

(f/yr) (hr)

2 0.000 026 0.016 50 0.2273 0.001 254 0.200 26 10.9854 0.000 030 0.019 85 0.2695 0.001 330 0.204 20 11.6506 0.002 509 1.441 02 21.981

3. At the radial distribution level, the effects dueto outages of system components such as pri-mary main/laterals/low-voltage transformers,etc, are considered.

The modified annual indices of the RBTS wereobtained using the load model given in Table 28.3and the results are presented in Table 28.6.The indices in Table 28.6 are smaller than thoseshown in Table 28.4 as a significant number ofthe outage events included in Table 28.4 do notaffect all the customers at a load point. The indexmodification approach described in Step 1 abovewas used to allocate these event contributions.

Table 28.7 presents a sample of the HLIIIindices at Buses 2, 4 and 6. It can be seen fromthis table that the contribution of the distributionsystem to the HLIII indices is significant forBuses 2 and 4. Table 28.7 also shows that, for Bus 6,the contribution from HLII to the HLIII indicesis very significant. The obvious reason is thatcustomers at Bus 6 are supplied through a singlecircuit transmission line and through a single230/33-kV transformer. Reinforcing the supply toBus 6 would therefore obviously be an importantconsideration. Table 28.8 presents HLIII systemindices at Buses 2, 4 and 6. The percentage contri-butions from the distribution functional zones tothe total HLIII indices are given in Table 28.9.

The contribution of the distribution system tothe HLIII indices is significant for Buses 2 and 4,while for Bus 6 the contribution from HLII tothe HLIII indices is considerably larger. Table 28.8shows the annual customer indices at the buses.These values can be aggregated to provide overallsystem indices, assuming that the customers at

load points 2, 4 and 6 constitute all the systemcustomers:

SAIFI= 1.226(failures/system customer)

SAIDI= 12.588(hours/system customer)

CAIDI= 10.296(hours/customer interrupted)

ASAI= 0.998 563

It can be seen from the above that the overallsystem indices, while important, mask the actualindices at any particular load bus. Reliability eval-uation at each hierarchical level provides valuableinput to the decision-making process associatedwith system planning, design, operation and man-agement. Both past and predictive assessment arerequired to obtain a complete picture of systemreliability.

28.4 System Reliability DataIt should be recognized that the data requirementsof predictive methodologies should reflect theneeds of these methods. This means that the datamust be sufficiently comprehensive to ensure thatthe methods can be applied but restrictive enoughto ensure that unnecessary data is not collectedand irrelevant statistics are not evaluated. The datashould therefore reflect and respond to the factorsthat affect the system reliability and enable it to bemodeled and analyzed. This means that the datashould relate to the two main processes involvedin component behavior, namely the failure andrestoration processes. It cannot be stressed toostrongly that, in deciding which data is to becollected, a utility must make decisions on thebasis of factors that have impact on its ownplanning and design considerations.

The quality of the data and the resulting indicesdepend on two important factors: confidence andrelevance. The quality of the data and thus theconfidence that can be placed in it is clearlydependent upon the accuracy and completenessof the information compiled by operating andmaintenance personnel. It is therefore essentialthat they are made fully aware of the future useof the data and the importance it will play in later

Page 555: Handbook of Reliability Engineering - pdi2.org

522 Practices and Emerging Applications

Table 28.7. HLIII load point reliability indices for the RBTS Buses 2, 4 and 6

Load point Failure Outage Annual Contributionrate time unavailability of distribution(f/yr) (hr) (hr/yr) (%)

Bus 21 0.3118 12.3832 3.8605 94.119 0.2123 3.7173 0.7890 71.20

11 0.3248 12.0867 3.9255 94.2115 0.3150 12.2657 3.8638 94.1220 0.3280 11.9778 3.9288 94.22

Bus 41 0.3708 10.1410 3.7605 92.91

10 0.2713 2.6832 0.7280 63.3919 0.4157 9.1589 3.8076 93.0028 0.2935 2.3938 0.7025 61.75

Bus 61 1.8248 12.5237 22.8530 3.81

16 1.7350 13.2443 22.9792 4.3431 4.0580 7.8714 31.9422 31.1840 3.9930 8.6799 34.6592 36.58

Table 28.8. HLIII system indices at Buses 2, 4 and 6 of the RBTS

Bus SAIFI SAIDI CAIDI ASAI

2 0.320 72 3.897 86 12.153 60 0.999 5554 0.397 14 3.812 02 9.598 72 0.999 5656 2.502 92 25.705 58 10.270 24 0.997 066

Table 28.9. Percentage contribution to HLIII inadequacy of the RBTSfrom the distribution system

Bus SAIFI Distribution SAIDI Distribution(%) (%)

2 0.320 72 94.85 3.897 86 94.174 0.397 14 95.01 3.812 02 93.016 2.502 92 92.43 25.7055 14.49

developments of the system. The quality of thestatistical indices is also dependent upon how thedata is processed, how much pooling is done, andthe age of the data currently stored.

Many utilities throughout the world have es-tablished comprehensive procedures for assessingthe performance of their systems. In Canada theseprocedures have been formulated and data is col-lected through the CEA. As this system is very

much in the public domain, it is useful to use itas a means of illustrating general principles.

28.4.1 Canadian Electricity AssociationDatabase

All the major electric power utilities in Canadaparticipate in a single data collection and analysis

Page 556: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 523

system of the CEA called ERIS. The CEA startedcollecting data on generation outages in 1977and on transmission outages in 1978, and sincethen has published a number of reports in bothareas [14–16, 18, 20]. The third stage, dealing withdistribution equipment, is being implemented.This procedure is illustrated in Figure 28.5.

28.4.2 Canadian Electricity AssociationEquipment Reliability InformationSystem Database for HLI Evaluation

The simplest representation of a generating unitis the two-state model shown in Figure 28.6,where λ and µ are the unit failure and repairrates respectively. Additional derated states canbe added to the model to provide more accuraterepresentation. This is shown in Figure 28.7 wherea, b, c, d , e and f are the various state transitionrates. The transition rates for these models canbe calculated from the generation equipmentstatus data collected by utilities. These data areavailable in CEA-ERIS, which provides a detailedreporting procedure for collecting generating unitoperating and outage data. This database containsinformation regarding the various states in whicha generating unit can reside and outage codes toexplain why a unit is in a particular outage orderated state. The data provided by CEA-ERIS arequite adequate to perform reliability analysis atHLI. The basic parameters required in performingadequacy analysis are presented in Table 28.10for typical generating unit classes using data for1994/98 [15].

28.4.3 Canadian Electricity AssociationEquipment Reliability InformationSystem Database for HLII Evaluation

Reliability evaluation at HLII requires outagestatistics pertaining to both generation and trans-mission equipment. The generation outage statis-tics are available from [15] as discussed earlier.The basic equipment outage statistics required arefailure rates and repair times of transmission lines

Generation equipmentStatus reporting

system

Distribution equipmentOutage reporting

system

Transmission equipmentOutage reporting

system

ERIS

Figure 28.5. Basic components of ERIS

Up Down

Figure 28.6. Basic two-state unit model

Down Deratede f

Upb c

a d

Figure 28.7. Basic three-state unit model

and associated equipment such as circuit break-ers and transformers. The transmission stage ofCEA-ERIS was implemented in 1978 when Cana-dian utilities began supplying data on transmis-sion equipment. The thirteenth CEA report on theperformance of transmission equipment is basedon data for the period from January 1, 1994 toDecember 31, 1998 [16]. This report covers trans-mission equipment in Canada with an operating

Page 557: Handbook of Reliability Engineering - pdi2.org

524 Practices and Emerging Applications

Table 28.10. Outage data for different generating unit typesbased on 1994/98 data

Unit type FOR Failure rate(%) (f/yr)

CTU 9.02∗ 8.72Fossil 7.02 11.04Hydraulic 2.00 2.81Nuclear 11.46 3.33

CTU = combustion turbine unit.∗ Indicates utilization forced outage probability (UFOP) used ratherthan forced outage rate. The UFOP of a generating unit is defined as theprobability that the unit will not be available when required. The CEA-ERIS database contains unavailability data on units of different sizesand of different ages in addition to failure and repair rate informationfor frequency and duration reliability evaluation.

voltage of 110 kV and above. The transmissionline performance statistics are given on a per-100-km-yr basis for line-related outages, and a sum-mary of these statistics is provided in Table 28.11.The report also provides performance statistics oftransformers and circuit breakers.

Tables 28.12 and 28.13 provide summariesof transformer and circuit breaker statistics byvoltage classification for forced outages involvingintegral subcomponents and terminal equipment.The data presented in all three tables can be usedto calculate HLII adequacy indices. However, itshould be noted that, in order to include the effectof station-originated outages, additional data inthe form of active failure rates, maintenance ratesand switching times are required. Some of theseparameters can also be obtained from the CEAdatabase.

The reports produced by CEA on forcedoutage performance of transmission equipment[16] also provide summaries on a number of otherstatistics, including the following:

1. transmission line statistics for line-relatedtransient forced outages;

2. cable statistics for cable-related and terminal-related forced outages;

3. synchronous compensator statistics;4. static compensator statistics by voltage classi-

fication;

5. shunt reactor statistics by voltage classifica-tion;

6. shunt capacitator statistics by voltage classifi-cation;

7. series capacitor bank statistics by voltageclassification.

28.4.4 Canadian Electricity AssociationEquipment Reliability InformationSystem Database for HLIII Evaluation

HLIII adequacy assessment requires outage statis-tics of equipment in all parts of the system, i.e.generation, transmission and distribution. Theoutage statistics for the generation and trans-mission equipment are available in [15, 16] asdiscussed earlier. The distribution reliability datasystem recently commissioned by CEA completesthe basic equipment reliability information sys-tem, containing generic performance and relia-bility information on all three major constituentareas of generation, transmission and distribu-tion equipment. The required distribution compo-nent failure statistics are presently being collected.These data can be used for quantitative assessmentof power system reliability and to assist in theplanning, design, operation and maintenance ofdistribution systems. They will also support thequantitative appraisal of alternatives and the op-timization of cost/reliability worth.

In the reporting system adopted for CEA-ERIS,distribution equipment has been divided into anumber of major components and therefore thesystem can be referred to as component-oriented.Every event involves one major component (unlessit is a common-mode failure) and the outageof a major component due to that of anothercomponent is not recorded. In this way it ispossible to reconstruct the outage performanceof any given configuration from the record of thevarious components.

The following devices have been selected asmajor components:

1. distribution line;2. distribution cable;3. distribution transformer;

Page 558: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 525

Table 28.11. Summary of transmission line statistics

Voltage Line-related sustained outages Terminal-related sustained outages(kV) Frequency Mean duration Unavailability Frequency Mean duration Unavailability

(per yr) (hr) (%) (per yr) (hr) (%)

110–149 2.4373 13.4 0.373 0.1074 5.8 0.007150–199 0.8007 0.3 0.054 0.0263 21.8 0.007200–299 1.0510 8.9 0.107 0.1572 9.7 0.017300–399 0.2363 49.7 0.134 0.0593 15.8 0.011500–599 1.7224 3.1 0.061 0.2032 16.0 0.037600–799 0.2808 119.2 0.382 0.0143 13.9 0.023

Table 28.12. Summary of transformer bank statistics by voltage classification for forcedoutages

Voltage Involving integral subcomponents Involving terminal equipment(kV) Frequency Mean duration Frequency Mean duration

(per yr) (hr) (per yr) (hr)

110–149 0.0504 163.8 0.0914 63.7150–199 0.1673 113.9 0.0662 420.3200–299 0.0433 169.4 0.0706 52.2300–399 0.0505 119.4 0.0385 58.0500–599 0.0389 277.3 0.0429 34.6600–799 0.0320 176.9 0.0233 354.5

Table 28.13. Summary of circuit breaker statistics by voltage classification for forced outages

Voltage Involving integral subcomponents Involving terminal equipment(kV) Frequency Mean duration Frequency Mean duration

(per yr) (hr) (per yr) (hr)

110–149 0.0373 136.4 0.0468 55.4150–199 0.0313 74.9 0.0251 145.5200–299 0.0476 164.2 0.0766 37.7300–399 0.0651 134.6 0.0322 59.5500–599 0.0795 125.1 0.0948 37.7600–799 0.1036 198.8 0.0566 118.1

4. power transformer;5. regulator;6. capacitor.

There are other distribution equipment com-ponents which could also have been selected, butit was decided that they were either too diffi-cult to define or that the added complexity wasnot justified. It was therefore decided to use a

limited number of clearly defined major compo-nents.

28.5 System Reliability WorthAdequacy studies of a system are only part ofa required overall assessment. The economicsof alternative facilities play a major role in thedecision-making process. The simplest approach

Page 559: Handbook of Reliability Engineering - pdi2.org

526 Practices and Emerging Applications

which can be used to relate economics withreliability is to consider the investment costonly. In this approach, the increase in reliabilitydue to the various alternative reinforcement orexpansion schemes is evaluated, together withthe investment cost associated with each scheme.Dividing this cost by the increase in reliabilitygives the incremental cost of reliability, i.e. howmuch it will cost for a per unit increase inreliability. This approach is useful for comparingalternatives when it is known for certain that thereliability of a section of the power system must beincreased, the lowest incremental cost of reliabilitybeing the most cost effective. This is a significantstep forward compared with assessing alternativesand making major capital investment decisionsusing deterministic techniques.

The weakness of the approach is that it is notrelated to either the likely return on investmentor the real benefit accruing to the consumer,utility and society. In order to make a consistentappraisal of economics and reliability, albeit onlythe adequacy, it is necessary to compare theadequacy cost (the investment cost needed toachieve a certain level of adequacy) with theadequacy worth (the benefit derived by the utility,consumer and society). A step in this directionis achieved by setting a level of incremental costwhich is believed to be acceptable to consumers.Schemes costing less than this level would beconsidered viable, but schemes costing greaterthan this level would be rejected. A completesolution however requires a detailed knowledge ofadequacy worth.

This type of economic appraisal is a fundamen-tal and important area of engineering application,and it is possible to perform this kind of evaluationat the three hierarchical levels discussed. A goalfor the future should be to extend this adequacycomparison within the same hierarchical struc-ture to include security and therefore to arrive atreliability–cost and reliability–worth evaluation.The basic concept is relatively simple and canbe illustrated using the cost/reliability curves ofFigure 28.8.

The curves in Figure 28.8 show that theutility cost will generally increase as consumers

An

nu

alco

st

Utility

System reliability

Total

Customer

Figure 28.8. Consumer, utility and total cost as a function ofsystem reliability

are provided with higher reliability. On theother hand, the consumer costs associated withsupply interruptions will decrease as the reliabilityincreases. The total costs to society will thereforebe the sum of these two individual costs. This totalcost exhibits a minimum and so an “optimum”or target level of reliability is achieved. In thisapproach, the reliability level is not a fixedvalue and results from the cost minimizationprocess.

This concept is quite valid. Two difficultiesarise in its assessment. First the calculated indicesare assessments at the various hierarchical levels.Second, there are great problems in assessingconsumer perceptions of outage costs.

There have been many studies concerning in-terruption and outage costs [7–12]. These studiesshow that, although trends are similar in virtuallyall cases, the costs vary over a wide range anddepend on the country of origin and the type ofconsumer. It is apparent therefore that there is stillconsiderable research needed on the subject of thecost of an interruption. This research should con-sider the direct and indirect costs associated withthe loss of supply, both on a local and widespreadbasis. A recent CIGRE report [30] provides a widerange of international data on interruption costs

Page 560: Handbook of Reliability Engineering - pdi2.org

Reliability of Electric Power Systems: An Overview 527

and their utilization, together with a comprehen-sive bibliography of relevant material.

28.6 Guide to Further Study

As previously noted, there are a large numberof excellent publications [7–12] on the subject ofpower system reliability evaluation. Reference [5]is a compendium of important publicationsselected from [7–11] and includes some seminalreferences. The basic techniques required in HLI,HLII and distribution system adequacy evaluationare described in detail in [3]. This referencecontains many numerical examples and is usedas a basic textbook on power system reliabilityevaluation. Reference [3] illustrates HLI adequacyevaluation in terms of the basic probabilitymethods which provide LOLE and LOEE indicesand in terms of frequency (F ) duration (D)indices. The F&D approach provides additionalinformation which can prove useful in capacityplanning and evaluation. The recursive algorithmsrequired when developing these indices forpractical systems are illustrated by applicationto numerical examples. The text then extendsthe single system analysis by application tointerconnected power systems.

Reliability evaluation at HLI includes recog-nition of operating or spinning reserve require-ments. Reference [3] provides a chapter on thissubject which includes both operating and re-sponse risk evaluation. The subject of HLII evalu-ation using contingency enumeration is presentedby application to selected system configurations.The required calculations are shown in detail.Reference [3] also presents detailed examples ofdistribution system reliability evaluation in bothradial and meshed networks. These applicationsinclude many practical system effects such as lat-eral distributor protection, disconnects and loadtransfer capability in radial networks. The meshedanalysis includes the recognition of permanentand transient outages, maintenance considera-tions, adverse weather and common-mode out-ages. The meshed network analysis is also ex-tended to include partial in addition to total loss

of continuity. The concepts utilized in transmis-sion and distribution system evaluation are thenextended and applied to substation and switchingstations. Reference [3] concludes by illustratinghow basic reliability techniques can be used toassess general plants and equipment configura-tions. Reference [3] was extended in the secondedition to include chapters dealing with relia-bility evaluation of power systems using MonteCarlo simulation and incorporating outage costsand reliability worth assessment. Reference [6]is focused entirely on the utilization of MonteCarlo simulation in power system reliability as-sessment. These two texts jointly provide the basicunderstanding required to read and appreciate thewide range of techniques described in [7–12]. Theconcepts associated with the inclusion of embed-ded generation in distribution system reliabilityevaluation are described in [31] and an extendedreview of the application of reliability concepts inthe new competitive environment is detailed in[32]. A careful examination of [3] will provide thebasic understanding required to read many of thepapers detailed in References [7–12].

References[1] Billinton R, Allan RN. Reliability evaluation of engineer-

ing systems, 2nd ed. New York: Plenum Press; 1992.

[2] Billinton R, Allan RN. Power system reliability inperspective. IEE Electron Power 1984;30(3):231–6.

[3] Billinton R, Allan RN. Reliability evaluation of powersystems, 2nd ed. New York: Plenum Press; 1996.

[4] Billinton R, Allan RN. Reliability assessment of largeelectric power systems. Boston: Kluwer Academic; 1988.

[5] Billinton R, Allan RN, Salvaderi L. Applied reliabilityassessment in electric power systems. New York: IEEEPress; 1991.

[6] Billinton R, Li W. Reliability assessment of electric powersystems using Monte Carlo methods. New York: PlenumPress; 1994.

[7] Billinton R. Bibliography on the application of probabilitymethods in power system reliability evaluation. IEEETrans Power App Syst 1972;PAS-91(2):649–60.

[8] IEEE Sub-Committee on Application of ProbabilityMethods. Bibliography on the application of probabilitymethods in power system reliability evaluation 1971–1977. IEEE Trans Power App Syst 1978;PAS-97(6):2235–42.

Page 561: Handbook of Reliability Engineering - pdi2.org

528 Practices and Emerging Applications

[9] Allan RM, Billinton R, Lee SH. Bibliography on theapplication of probability methods in power systemreliability evaluation 1977–1982. IEEE Trans Power AppSyst 1984;PAS-103(2):275–82.

[10] Allan RN, Billinton R, Shahidehpour SM, Singh C.Bibliography on the application of probability methodsin power system reliability evaluation 1982–1987. IEEETrans Power Syst 1988;3(4):1555–64.

[11] Allan RN, Billinton R, Breipohl AM, Grigg C. Bib-liography on the application of probability methodsin power system reliability evaluation 1987–1991. IEEETrans Power Syst 1994;9(1):41–9.

[12] Allan RN, Billinton R, Breipohl AM, Grigg C. Bib-liography on the application of probability methodsin power system reliability evaluation 1992–1996. IEEETrans Power Syst 1999;PWRS-14:51–7.

[13] CIGRE WG 38.03. Power system reliability analysis,vol. 1. Application guide. Paris: CIGRE Publication; 1987.Ch. 4, 5.

[14] Allan RN, Billinton R. Concepts of data for assessing thereliability of transmission and distribution equipment.2nd IEE Conf on Reliability of Transmission andDistribution Equipment. Conference Publication 406;March 1995. p. 1–6.

[15] Canadian Electrical Association Equipment ReliabilityInformation System Report. 1998 Annual Report—Generation equipment status. Montreal: CEA; September1999.

[16] Canadian Electrical Association Equipment ReliabilityInformation System Report. Forced outage performanceof transmission equipment for the period January 1994 toDecember 1998. Montreal: CEA; February 2000.

[17] The Electricity Association: National Fault and Interrup-tion Reporting Scheme (NAFIRS). London: ElectricityAssociation.

[18] Canadian Electrical Association Equipment ReliabilityInformation System Report. 1999 Annual Service Con-tinuity Report on distribution system performance inCanadian Electrical Utilities. May 2000.

[19] Office of Gas and Electricity Markets (OFGEM). Reporton distribution and transmission system performance1998/99. London: OFGEM.

[20] Canadian Electrical Association Equipment ReliabilityInformation System Report. Bulk electricity systemdelivery point interruptions 1998–1990 Report. October1992.

[21] National Grid Company (UK). Performance of theTransmission System. Annual Reports to the DirectorGeneral of Electricity Supply, OFFER.

[22] Billinton R, Kumar S, Chowdhury N, Chu K, Debnath K,Goel L, et al. A reliability test system for educationalpurposes—basic data. IEEE Trans PWRS 1989;4(3):1238–44.

[23] Billinton R, Kumar S, Chowdhury N, Chu K, Goel L,Khan E, et al. A reliability test system for edu-cational purposes—basic results. IEEE Trans PWRS1990;5(1):319–25.

[24] Allan RN, Billinton R, Sjarief I, Goel L. A reliabilitytest system for educational purposes—basic distributionsystem data and results. IEEE Trans PWRS 1991;6(2):813–20.

[25] Billinton R, Vohra PK, Kumar S. Effect of station orig-inated outages in composite system adequacy evalua-tion of the IEEE reliability test system. IEEE Trans PAS1985;104:2649–56.

[26] NGC Settlements. An introduction to the initial poolrules. Nottingham; 1991.

[27] Medicherla TKP, Billinton R. Overall approach tothe reliability evaluation of composite generation andtransmission systems. IEE Proc C 1980;127(2):72–81.

[28] CIGRE WG 38.03. Power system reliability analysis, vol. 2.Composite power system reliability evaluation. Paris:CIGRE Publication; 1992.

[29] Roman Ubeda J, Allan RN. Sequential simulation appliedto composite system reliability evaluation. Proc IEE C1992;139:81–6.

[30] CIGRE TF 38.06.01. Methods to consider customerinterruption costs in power system analysis. CIGRE;2001.

[31] Jenkins N, Allan RN, Crossley P, Kirschen DS,Strbac G. Embedded generation. Stevenage: IEEPublishing; 2000.

[32] Allan RN, Billinton R. Probabilistic assessment of powersystems. Proc IEEE 2000;88(2):140–62.

Page 562: Handbook of Reliability Engineering - pdi2.org

Human and Medical DeviceReliability

Chap

ter2

9B. S. Dhillon

29.1 Introduction29.2 Human and Medical Device Reliability Terms and Definitions29.3 Human Stress—Performance Effectiveness, Human Error Types,

and Causes of Human Error29.4 Human Reliability Analysis Methods29.4.1 Probability Tree Method29.4.2 Fault Tree Method29.4.3 Markov Method29.5 Human Unreliability Data Sources29.6 Medical Device Reliability Related Facts and Figures29.7 Medical Device Recalls and Equipment Classification29.8 Human Error in Medical Devices29.9 Tools for Medical Device Reliability Assurance29.9.1 General Method29.9.2 Failure Modes and Effect Analysis29.9.3 Fault Tree Method29.9.4 Markov Method29.10 Data Sources for Performing Medical Device Reliability Studies29.11 Guidelines for Reliability Engineers with Respect to Medical Devices

29.1 Introduction

The beginning of the human reliability field maybe taken as the middle of the 1950s when the prob-lem of human factors was seriously consideredand the first human-engineering standards for theUS Air Force were developed [1, 2]. In 1958, theneed for human reliability was clearly identified[3] and some findings relating to human reliabil-ity were documented in two Convair-Astronauticsreports [4, 5].

In 1962, a database (i.e. Data Store) containingtime and human performance reliability estimatesfor human-engineering design features was es-tablished [6]. In 1986, the first book on humanreliability was published [7]. All in all, over theyears many other people have contributed to thefield of human reliability and a comprehensive listof publications on the subject is given in [8].

The latter part of the 1960s may be regardedas the beginning of the medical device reliabilityfield. During this period many publications onthe subject appeared [9–13]. In 1980, an articlelisted most of the publications on the subject andin 1983, a book on reliability devoted an entirechapter to medical device/equipment reliability[14, 15]. A comprehensive list of publications onmedical device reliability is given in [16].

This chapter presents various aspects of humanand medical device reliability.

29.2 Human and MedicalDevice Reliability Terms andDefinitionsSome of the common terms and definitionsconcerning human and medical device reliabilityare as follows [7, 15, 17–23].

529

Page 563: Handbook of Reliability Engineering - pdi2.org

530 Practices and Emerging Applications

Figure 29.1. A hypothetical curve showing the human perfor-mance effectiveness versus stress relationship

• Human reliability. This is the probabilityof performing a given task successfully bythe human at any required stage in systemoperations within a specified minimum time(if such time requirement is stated).• Human error. This is the failure to perform

a task (or the performance of a forbiddenaction) that could lead to the disruption ofscheduled operations or damage to propertyand equipment.• Human factors. This is the body of scien-

tific facts concerning the characteristics ofhumans.• Human performance. This is a measure of

human functions and actions subject tospecified conditions.• Medical device. This is any machine, appara-

tus, instrument, implant in vitro reagent, im-plement contrivance, or other related or simi-lar article, including any component, part, oraccessory, that is intended for application indiagnosing diseases or other conditions or inthe cure, treatment, mitigation, or preventionof disease or intended to affect the structure ofany function of the body.• Reliability. This is the probability that an

item will perform its specified functionsatisfactorily for the stated time period whenused according to the designed conditions.

• Failure. This is the inability of anitem/equipment to function within thespecified guidelines.• Safety. This is conservation of human life

and its effectiveness, and the preventionof damage to item/equipment according tooperational requirements.• Quality. There are many definitions used

in defining quality and one of these isconformance to requirements.• Mean time to failure. This is the sum of

operating time of given items over the totalnumber of failures (i.e. when times to failureare exponentially distributed).

29.3 HumanStress—PerformanceEffectiveness, Human Error Types,and Causes of Human Error

Human performance varies under different con-ditions and some of the factors that affect a per-son’s performance are time at work, reaction tostress, social interaction, fatigue, social pressure,morale, supervisor’s expectations, idle time, repet-itive work, and group interaction and identifica-tion [24].

Stress is probably the most important factorthat affects human performance, and in the pastmany researchers have studied the relationshipbetween human performance effectiveness andstress. Figure 29.1 shows the resulting conclusionof their efforts [18,25]. This figure basically showsthat the human performance for a specific taskfollows a curvilinear relation to the imposedstress. At a very low stress level, the task is dull andunchallenging, thus most humans’ performancewill not be at the optimal level. However, at amoderate stress level, the humans usually performtheir task at optimal level. As the stress bypassesits moderate level, the human performance beginsto decline and Figure 29.1 shows that in the higheststress region, human reliability is at its lowest.

Page 564: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 531

Figure 29.2. Types of human error

A human error may be categorized into manydifferent classifications, as shown in Figure 29.2.These are design error, operator error, mainte-nance error, fabrication error, inspection error,handling error, and miscellaneous error [7, 26].Although all these categories or types are self-explanatory, the miscellaneous error representsthe situation where it is difficult to differentiatebetween human failure and equipment failure.Sometimes, the miscellaneous error is also re-ferred to as the contributory error.

Human errors occur due to various reasons.Some of the important causes for the occurrenceof human error are task complexity, poor equip-ment design, improper work tools, poorly writ-ten maintenance and operating procedures, inad-equate training or skill, inadequate lighting in thework area, high noise level, crowded work space,high temperature in the work area, inadequatework layout, poor motivation, poor verbal com-munication, and inadequate handling of equip-ment [7, 26].

29.4 Human Reliability AnalysisMethods

There are many methods used in performinghuman reliability analysis [27]. These includethe probability tree method, fault tree method,Markov method, throughput ratio method, block

diagram method, personnel reliability index, tech-nique for human error rate prediction (THERP),and Pontecorvo’s method of predicting humanreliability [7, 27]. The first three methods are de-scribed below.

29.4.1 Probability Tree Method

This approach is concerned with performingtask analysis diagrammatically. More specifically,diagrammatic task analysis is represented bythe branches of the probability tree. The treebranching limbs denote outcomes (i.e. success orfailure) of each event, and in turn the occurrenceprobability is assigned to each branch of the tree.

There are many advantages of the probabil-ity tree method, including a visibility tool, lowerprobability of error due to computation becauseof computational simplification, the prediction ofquantitative effects of errors, and the incorpora-tion, with some modifications, of factors such asinteraction stress, emotional stress, and interac-tion effects. The following example demonstratesthis method.

Example 1. Assume that the task of an operator iscomposed of two subtasks, y and z. Each of thesesubtasks can either be performed successfullyor unsuccessfully, and subtask y is accomplishedbefore subtask z. The performance of subtasksunsuccessfully is the only error that can occur,

Page 565: Handbook of Reliability Engineering - pdi2.org

532 Practices and Emerging Applications

y z (Task failure)

y z (Task failure)

y z (Task failure)

y z (Task success)

y

z

z

z

z

y

Figure 29.3. Probability tree diagram for Example 1

and the performance of both the subtasks isindependent of each other.

Develop a probability tree and obtain anexpression for the probability of performing theoverall task incorrectly.

A probability tree for Example 1 is shown inFigure 29.3. The meanings of the symbols used inFigure 29.3 are defined below:

y= the subtask y is performed correctly

y= the subtask y is performed incorrectly

z= the subtask z is performed correctly

z= the subtask z is performed incorrectly

Using Figure 29.3, the probability of per-forming the overall task incorrectly is:

Ft in = PyPz + PyPz + PyPz (29.1)

where

Ft in is the probability of performing theoverall task incorrectly

Py is the probability of performing subtask ycorrectly

Pz is the probability of performing subtask zcorrectly

Py is the probability of performing subtask yincorrectly

Pz is the probability of performing subtask zincorrectly.

29.4.2 Fault Tree Method

This is a widely used method in reliability analysisof engineering systems and it can also be usedin performing human reliability analysis. Thismethod is described in detail in [24]. Nonetheless,this approach makes use of the following symbolsin developing a basic fault tree:

• Rectangle. This is used to denote a faultevent that results from the combination offailure events through the input of a logic gate,e.g. AND or OR gate.• AND gate. This gate denotes that an output

fault event occurs if all the input fault eventsoccur. Figure 29.4 shows a symbol for an ANDgate.• OR gate. This gate denotes that an output fault

event occurs if any one or more of the inputfault events occur. Figure 29.4 shows a symbolfor an OR gate.• Circle. This is used to denote basic fault

events; more specifically, those events thatneed not be developed any further.

The following example demonstrates thismethod:

Example 2. A person is required to perform ajob Z composed of four distinct and independenttasks A, B, C, and D. All of these tasks must becarried out correctly for the accomplishment ofthe job successfully. Tasks A and B are composedof two and three independent steps, respectively.More specifically, for the successful performance

Figure 29.4. Symbols for basic logic gates: (a) OR gate; (b) ANDgate

Page 566: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 533

The person will not accomplishjob Z successfully

Task B will not beperformed correctly

Task A will not beperformed correctly

Task Cwill not beperformedcorrectly

Task Dwill not beperformedcorrectly

Failure toperformstep

correctlyx

Failure toperformstep

correctlyw

Failure toperformstep

correctlyy

Failure toperformstep 1

correctly

Failure toperformstep 2

correctly

T

I1E6 I2 E7

E1 E2 E3

E4 E5

Figure 29.5. A fault tree for Example 2

of task A all the steps must be performed correctly,whereas for task B at least one of the steps must beaccomplished correctly.

Develop a fault tree for this example by usingthe defined symbols above. Assume that the topfault event is “The person will not accomplishjob Z successfully”. Calculate the probability ofoccurrence of the top event, if the probabilitiesof failure to perform tasks C, D, and the stepsinvolved in tasks A and B are 0.04, 0.05, and 0.1,respectively.

A fault tree for Example 2 is shown inFigure 29.5. The basic fault events are denotedby Ei , for i = 1, 2, 3, 4, 5, 6 and the intermediatefault events by Ii , for i = 1, 2. The letter T denotesthe top fault event.

The probability of occurrence of intermediateevent I1 is given by [24]:

P(I1)= P(E1)P (E2)P (E3)

= (0.1)(0.1)(0.1)

= 0.001

Page 567: Handbook of Reliability Engineering - pdi2.org

534 Practices and Emerging Applications

Similarly, the probability of occurrence of inter-mediate event I2 is given by [24]:

P(I2)= P(E4)+ P(E5)− P(E4)P (E5)

= 0.1+ 0.1− (0.1)(0.1)

= 0.19

The probability of occurrence of the top fault eventT is expressed by [24]:

P(T)= 1− [1− P(E6)][1− P(E7)]× [1− P(I1)][1− P(I2)]= 1− [1− 0.04][1− 0.05]× [1− 0.001][1− 0.19]= 0.2620

It means the probability of occurrence of the topevent T (i.e. the person will not accomplish job Zsuccessfully) is 0.2620.

29.4.3 Markov Method

This is a powerful tool that can also be usedin performing various types of human reliabilityanalysis [24]. The following example demonstratesthe application of this method in human reliabilityanalysis [28].

Example 3. Assume that an operator is perform-ing his/her tasks in fluctuating environments:normal, abnormal. Although human error canoccur under both environments, it is assumedthat the human error rate under the abnormalenvironment is greater than under the normalenvironment because of increased stress. The sys-tem state space diagram is shown in Figure 29.6.The numerals in the boxes denote system state.Obtain an expression for mean time to humanerror (MTTHE).

The following symbols were used to developequations for the diagram in Figure 29.6:Pi(t) is the probability of the human being in

state i at time t ; for i = 0 (meanshuman performing task correctly undernormal environment), i = 2 (meanshuman committed error in normalenvironment), i = 1 (means human

performing task correctly in abnormalenvironment), i = 3 (means humancommitted error in abnormalenvironment)

λh is the constant human error rate fromstate 0

λah is the constant human error rate fromstate 1

αn is the constant transition rate fromnormal to abnormal environment

αa is the constant transition rate fromabnormal to normal environment.

By applying the Markov method, we obtain thefollowing system of differential equations forFigure 29.6:

dP0(t)

dt+ (λh + αn)P0(t)= P1(t)αa (29.2)

dP1(t)

dt+ (λah + αa)P1(t)= P0(t)αn (29.3)

dP2(t)

dt− P0(t)λh = 0 (29.4)

dP3(t)

dt− P1(t)λah = 0 (29.5)

At time t = 0, P0(0)= 1 and P1(0)= P2(0)=P3(0)= 0. Solving Equations 29.2–29.5, we get:

P0(t)= (s2 − s1)−1[(s2 + λah + αn) es2t

− (s1 + λah + αn) es1t ] (29.6)

where

s1 = [−a1 + (a21 − 4a2)

1/2]/2s2 = [−a1 − (a2

1 − 4a2)1/2]/2

a1 = λh + λah + αn + αa

a2 = λh(λah + αa)+ λahαn

P2(t)= a4 + a5 es2t − a6 es1t (29.7)

where

a3 = 1/(s2 − s1)

a4 = λh(λah + αn)/s1s2

a5 = a3(λh + a4s1)

a6 = a3(λh + a4s2)

P1(t)= a3αn(es2t − es1t ) (29.8)

P3(t)= a7[(1+ a3)(s1 es2t − s2 es1t ) (29.9)

Page 568: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 535

Abnormal environment states

Normal environment states

Human performing taskcorrectly

Human committederror

Humanperforming

taskcorrectly

Humancommitted

error

�a �n

�h

�ah

0 2

1 3

Figure 29.6. System state space diagram

wherea7 = λahαn/s1s2

The reliability of the human is given by:

Rh(t)= P0(t)+ P1(t) (29.10)

Mean time to human error is given by:

MTTHE=∫ ∞

0Rh(t) dt

= (λah + αn + αa)/a2 (29.11)

29.5 Human Unreliability DataSourcesThe United States Department of Defense andAtomic Energy Commission (AEC) were probablythe first to provide an impetus to developformal data collection and analysis methods forquantifying human performance reliability [29].In 1962, the American Institute for Research (AIR)was probably the first organization to establisha human unreliability data bank called DataStore [6].

Over the years many other data banks havebeen established [7, 29]. Needless to say, today

there are many sources for obtaining human error-related data and Table 29.1 presents some of thesesources. References [7, 29] list many more suchsources.

29.6 Medical Device ReliabilityRelated Facts and Figures

Some of the facts and failures directly or indirectlyrelated to medical device reliability are as follows.

• In 1983, the United States medical deviceindustry employed around 200 000 personsand at least 3000 companies were involvedwith medical devices [38].• In 1997, a total of 10 420 registered manu-

facturers were involved in the manufactureof medical devices in the United States alone[39].• In 1969, 10 000 medical device-related injuries

resulting in 731 deaths over a 10-year pe-riod were reported by the US Department ofHealth, Education, and Welfare special com-mittee [40].

Page 569: Handbook of Reliability Engineering - pdi2.org

536 Practices and Emerging Applications

Table 29.1. Sources for obtaining human unreliability data

No. Source

1 Data Store [6]2 Book: Human reliability with human factors [7]3 Book: Human reliability and safety analysis data handbook [30]4 Operational Performance Recording and Evaluation Data System (OPREDS) [31]5 Bunker–Ramo tables [32]6 Aviation Safety Reporting System [33]7 Technique for Establishing Personnel Performance Standards (TEPPS) [34]8 Aerojet General Method [35]9 Book: Mechanical reliability: theory, models, and applications [36]

10 Air Force Inspection and Safety Center Life Sciences Accident and Incident Reporting System [37]

• In 1997, the world market for medical deviceswas estimated to be around $120 billion [41].• A study reported that more than 50% of all

technical medical equipment problems weredue to operator errors [42].• In 1990, a study conducted by the Food

and Drug Administration (FDA) over aperiod from October 1983 to September1989 reported that approximately 44% of thequality-related problems led to the voluntaryrecall of medical devices [43]. The study alsostated that effective design controls could haveprevented such problems.• In 1990, the Safe Medical Device Act (SMDA)

was passed by the US Congress and in turn,it strengthened the FDA to implement thePreparation Quality Assurance Program [16].• In 1969, a study reported that faulty instru-

mentation accounts for 1200 deaths per yearin the United States [44, 45].

29.7 Medical Device Recalls andEquipment Classification

Defective medical devices are subject to recallsand repairs in the United States. The passageof the Medical Device Amendments of 1976has helped to increase the public’s awarenessof these activities and their number. The FDAEnforcement Report publishes data on medicaldevice recalls [46]. For example, this report forthe period of 1980–1982 announced 230 medical

device-related recalls and the following categoriesof problem areas:

• faulty product design;• contamination;• mislabeling;• defects in material selection and manufactur-

ing;• defective components;• no premarket approval and failure to comply

with good manufacturing practices (GMPs);• radiation (X-ray) violations;• misassembly of parts;• electrical problems.

The number of recalls associated with each ofthe above nine categories of problem areas were40, 40, 27, 54, 13, 4, 25, 10, and 17, respectively.It is important to note that around 70% of themedical device recalls were due to faulty design,product contamination, mislabeling, and defectsin material selection and manufacturing.

Six components of the faulty design were pre-mature failure, potential for malfunction, electri-cal interference, alarm defects, potential for leak-age of fields into electrical components, and failureto perform as required. There were four subcate-gories of product contamination: defective pack-age seals, non-sterility, other package defects, andother faults. The mislabeling category includedcomponents such as incomplete labeling, mislead-ing or incorrect labeling, disparity between labeland product, and inadequate labeling. Five com-ponents of defects in material selection and man-ufacturing were manufacturing defects, material

Page 570: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 537

deterioration, inappropriate materials, separationof bounded components, and actual or potentialbreakage/cracking.

After the passage of the Medical DeviceAmendments of 1976, the FDA categorized med-ical devices prior to 1976 into three classifications[40].

• Category I. This included devices in whichgeneral controls such as GMPs were con-sidered adequate in relation to efficacy andsafety.• Category II. This included devices in which

general controls were deemed inadequate inrelation to efficacy and safety but in whichperformance standards could be established.• Category III. This included devices in which

the manufacturer must submit evidence ofefficacy and safety through well-designedstudies.

Nonetheless, the health care system uses a largevariety of electronic equipment and it can beclassified into three groups [47].

• Group X. This contains equipment/devicesthat are considered responsible for the pa-tient’s life or may become so during emer-gency. More specifically, when such items fail,there is seldom sufficient time for repair, thusthese items must have high reliability. Someexamples of the equipment/devices belongingto this group are respirators, cardiac defib-rillators, electro-cardiographic monitors, andcardiac pacemakers.• Group Y. This includes the vast majority of

equipment used for purposes such as rou-tine or semi-emergency diagnostic or thera-peutic. More specifically, the failure of equip-ment belonging to this group does not lead tothe same emergency as in the case of groupX equipment. Some examples of group Yequipment are gas analyzers, spectropho-tometers, ultrasound equipment, diathermyequipment, electro-cardiograph and electro-encephalograph recorders and monitors, andcolorimeters.• Group Z. This includes equipment not critical

to a patient’s life or welfare. Wheelchairs

and bedside television sets are two typicalexamples of group Z equipment.

29.8 Human Error in MedicalDevices

Past experience indicates that a large percentageof medical device-related problems are due tohuman error. For example, 60% of deaths orserious injuries associated with medical devicesreported through the FDA Center for Devices andRadiological Health (CDRH) were due to usererror [48]. Nonetheless, various studies conductedover the years have identified the most error-pronemedical devices [49]. Some of these devices (inorder of most error-prone to least error-prone) areas follows [49]:

• glucose meter;• balloon catheter;• orthodontic bracket aligner;• administration kit for peritoneal dialysis;• permanent pacemaker electrode;• implantable spinal cord simulator;• intra-vascular catheter;• infusion pump;• urological catheter;• electrosurgical cutting and coagulation de-

vice;• non-powered suction apparatus;• mechanical/hydraulic impotence device;• implantable pacemaker;• peritoneal dialysate delivery system;• catheter introducer;• catheter guidewire.

29.9 Tools for Medical DeviceReliability Assurance

There are many methods and techniques thatcan be used to improve the reliability of medicaldevices [16, 24]. Some of the commonly used onesare described below.

Page 571: Handbook of Reliability Engineering - pdi2.org

538 Practices and Emerging Applications

29.9.1 General Method

This method is composed of a number of stepsthat can be used to improve medical equipmentreliability. These steps are as follows [42]:

• specify required reliability parameters andtheir values in design specifications;• allocate specified reliability parameter values;• evaluate medical equipment reliability and

compare the specified and theoretical values;• modify design if necessary;• collect field failure data and perform neces-

sary analysis;• make recommendations if necessary to im-

prove weak areas of the medical equipment.

29.9.2 Failure Modes and EffectAnalysis

The history of failure modes and effect analysis(FMEA) goes back to the early 1950s whenthe technique was used in the design anddevelopment of flight control systems. Since thenthe method has received widespread acceptancein the industry. FMEA is a tool to evaluate designat the initial stage from the reliability aspects.This criterion helps to identify the need forand the effects of design change. Furthermore,the procedures demand listing the potentialfailure modes of each and every component onpaper and their effects on the listed subsystems.There are seven main steps involved in performingfailure modes and effect analysis: (i) establishingsystem definition; (ii) establishing ground rules;(iii) describing system hardware; (iv) describingfunctional blocks; (v) identifying failure modesand their effects; (vi) compiling critical items;(vii) documenting.

This method is described in detail in [24]and a comprehensive list of publications on thetechnique is given in [50].

29.9.3 Fault Tree Method

As in the case of human reliability analysis, thefault tree method can also be used in medical

X-raymachine

failed

1

X-ray machineworking normally

0

Figure 29.7. X-ray machine transition diagram

device reliability analysis. This method wasdeveloped in the early 1960s at Bell Laboratoriesto perform analysis of the Minuteman LaunchControl System. Fault tree analysis starts byidentifying an undesirable event, called the topevent, associated with a system. Events whichcould cause the top event are generated andconnected by logic operators such as AND or OR.The fault tree construction proceeds by generationof events in a successive manner until the events(basic fault events) need not be developed further.The fault tree itself is the logic structure relatingthe top event to the basic events.

This method is described in detail in [16, 24].

29.9.4 Markov Method

This is a very general approach and it can also beused in performing reliability analysis of medicaldevices. The method is quite appealing whenthe system failure and repair rates are constant.However, a problem may arise when solving a setof linear algebraic equations for systems with alarge number of states. This method is describedin detail in [24].

The application of the Markov method in relia-bility analysis of medical devices is demonstratedthrough the following example [16].

Example 4. An X-ray machine was observed for along period and it was concluded that its failureand repair rates are constant. The state spacediagram of the machine is given in Figure 29.7.The numerals in the boxes denote the X-raymachine states. Develop an expression for theX-ray machine unavailability by using the Markovmethod.

Page 572: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 539

Table 29.2. Sources for obtaining data to conduct medical device reliability studies

No. Source

1 MIL-HDBK-217: Reliability prediction of electronic equipment [51]2 Medical Device Reporting System (MDRS) [52]3 Universal Medical Device Registration and Regulatory Management System (UMDRMS) [53]4 Hospital Equipment Control System (HECSTM) [53]5 Health Devices Alerts (HDA) [53]6 Problem Reporting Program (PRP) [53]7 Medical Device Manufacturers Database (MDMD) [53]8 User Experience Network (UEN) [53]

The following symbols were used to developequations for the state space diagram in Fig-ure 29.7:Pi(t) is the probability that the X-ray machine is

in state i at time t ; for i = 0 (means theX-ray machine is operating normally),i = 1 (means the X-ray machine hasfailed)

λ is the X-ray machine constant failurerate

µ is the X-ray machine constant repairrate.

With the aid of the Markov method, we write thefollowing equations:

dP0(t)

dt+ λP0(t)= µP1(t) (29.12)

dP1(t)

dt+ λP1(t)= µP0(t) (29.13)

At time t = 0, P0(0)= 1 and P1(0)= 0.Solving Equations 29.12 and 29.13, we obtain:

P0(t)= µ

λ+ µ+ λ

λ+ µe−(λ+µ)t (29.14)

P1(t)= λ

λ+ µ− λ

λ+ µe−(λ+µ)t (29.15)

The X-ray machine unavailability is given by:

UAx(t)= P1(t)= λ

λ+ µ− λ

λ+ µe−(λ+µ)t

(29.16)whereUAx(t) is the X-ray machine unavailability at

time t .

29.10 Data Sources forPerforming Medical DeviceReliability Studies

As in the case of any other engineering product,the failure data are very important in medicaldevice reliability studies. There are many datasources that can be quite useful in performingvarious types of reliability studies. Some of thesesources are listed in Table 29.2.

29.11 Guidelines for ReliabilityEngineers with Respect toMedical Devices

Some of the guidelines for reliability engineersworking in the area of medical devices are asfollows [54].

• Remember that during the design, develop-ment, and manufacturing phases of a medicaldevice, the responsibility for reliability restswith the manufacturer, but in the field with theuser.• Remember that not all device failures are

equal in importance, thus direct your atten-tion to critical failures for maximum returns.• Avoid using sophisticated methods and tech-

niques developed to improve aerospace sys-tems reliability. For the time being, consider

Page 573: Handbook of Reliability Engineering - pdi2.org

540 Practices and Emerging Applications

yourself an expert on failure correction, sim-ple fault tree analysis, failure mode and effectsanalysis, and so forth.• Aim to become an effective cost-conscious

reliability engineer, thus perform cost versusreliability trade-off analyses. Past experienceindicates that some reliability improvementdecisions require very little or no additionalexpenditure.• Remember that even a simple failure report-

ing system in a health care organization willbe more beneficial in terms of improvingmedical device reliability than having a largeinventory of spares or standbys.• Use design review, failure mode and effect

analysis, qualitative fault tree analysis, andparts review for immediate results.

References[1] Meister D. Human reliability. In: Muckler FA, editor.

Human factors review. Santa Monica, CA: Human FactorsSociety; 1984. p.13–53.

[2] Swain AD. Overview and status of human factorsreliability analysis. Proc 8th Annual Reliability andMaintainability Conf; 1969. p.251–4.

[3] Williams HL. Reliability evaluation of the humancomponent in man–machine systems. Electr Manuf 1958;April: 78–82.

[4] Hettinger CW. Analysis of equipment failures causedby human errors. Report No. REL-R-005. Los Angles:Convair-Astronautics; April 15, 1958.

[5] Meister D. Human engineering survey of difficultiesin reporting failure and consumption data. Report No.ZX-7-O35. Los Angles: Convair-Astronautics; September1958.

[6] Munger SJ, Smith RW, Payne D. An index of electronicequipment operability: data store. Report No. AIR-C43-1/62 RP(29.1). Pittsburgh: American Institute forResearch; 1962.

[7] Dhillon BS. Human reliability with human factors. NewYork: Pergamon Press; 1986.

[8] Dhillon BS. Reliability and quality control: bibliographyon general and specialized areas. Gloucester, Canada:Beta Publishers; 1992.

[9] Johnson JP. Reliability of ECG instrumentation in ahospital. Proc Ann Symp Reliability; 1967. p.314–8.

[10] Crump JF. Safety and reliability in medical electronics.Proc Ann Symp Reliability; 1969. p.320–2.

[11] Taylor EF. The effect of medical test instrument reliabilityon patient risks. Proc Ann Symp Reliability; 1969. p.328–30.

[12] Meyer JL. Some instrument induced errors in theelectrocardiogram. J Am Med Assoc 1967;201:351–8.

[13] Grechman R. Tiny flaws in medical design can kill. HospTop 1968;46:23–4.

[14] Dhillon BS. Bibliography of literature on medical equip-ment reliability. Microelectron. Reliab. 1980;20;737–42.

[15] Dhillon BS. Reliability engineering in systems design andoperation. New York: Van Nostrand-Reinhold Company;1983.

[16] Dhillon BS. Medical device reliability and associatedareas. Boca Raton, FL: CRC Press; 2000.

[17] MIL-STD-721B. Definitions of effectiveness terms forreliability, maintainability, human factors and safety.Washington, DC: Department of Defense; 1966.

[18] Hagen EW. Human reliability analysis. Nucl Safety1976;17:315–26.

[19] Meister D. Human factors in reliability. In: Ireson WG,editor. Reliability handbook. New York: McGraw-Hill;1966. p.12.2–37.

[20] Federal Food, Drug, and Cosmetic Act (as Amended,Sec. 201 (h)). Washington, DC: US Government PrintingOffice; 1993.

[21] Naresky JJ. Reliability definitions. IEEE Trans Reliab1970;19:198–200.

[22] Omdahl TP. Reliability, availability and maintainability(RAM) dictionary. Milwaukee, WI: ASQC Quality Press;1988.

[23] McKenna T, Oliverson R. Glossary of reliability andmaintenance terms. Houston, TX: Gulf PublishingCompany; 1997.

[24] Dhillon BS. Design reliability: fundamentals and applica-tions. Boca Raton, FL: CRC Press; 1999.

[25] Beech HR, Burns LE, Sheffield BF. A behaviouralapproach to the management of stress. New York: JohnWiley and Sons; 1982.

[26] Meister D. The problem of human-initiated failures. Proc8th Nat Symp Reliability and Quality Control; 1962,p.234–9.

[27] Meister D. Comparative analysis of human reliabilitymodels. Report No. AD 734-432; 1971. Available from theNational Technical Information Service, Springfield, VA22151.

[28] Dhillon BS. Stochastic models for predicting humanreliability. Microelectron Reliab 1982;21:491–6.

[29] Dhillon BS. Human error data banks. MicroelectronReliab 1990;30:963–71.

[30] Gertman DI, Blackman HS, Human reliability and safetyanalysis data handbook. New York: John Wiley and Sons;1994.

[31] Urmston R. Operational performance recording andevaluation data system (OPREDS). Descriptive Brochure,Code 3400. San Diego, CA: Navy Electronics LaboratoryCenter; November 1971.

[32] Hornyak SJ. Effectiveness of display subsystems mea-surement prediction technique. Report No. TR-67-292;September 1967. Available from the Rome Air Develop-ment Center (RADC), Griffis Air Force Base, Rome, NY.

Page 574: Handbook of Reliability Engineering - pdi2.org

Human and Medical Device Reliability 541

[33] Aviation Safety Reporting Program. FAA AdvisoryCircular No. 00-4613; June 1, 1979. Washington, DC:Federal Aviation Administration (FAA).

[34] Topmiller DA, Eckel JS, Kozinsky EJ. Humanreliability data bank for nuclear power plant operations:a review of existing human reliability data banks.Report No. NUREG/CR 2744/1; 1982. Available fromthe US Nuclear Regulatory Commission, Washington,DC.

[35] Irwin IA, Levitz JJ, Freed AM. Human reliability inthe performance of maintenance. Report No. LRP317/TDR-63-218; 1964. Available from the Aerojet-General Corporation, Sacramento, CA.

[36] Dhillon BS. Mechanical reliability: theory, models, andapplications. Washington, DC: American Institute ofAeronatics and Astronautics; 1988.

[37] Life Sciences Accident and Incident Classification El-ements and Factors. AFISC Operating InstructionNo. AMISCM-127-6; December 1971. Available from theDepartment of Air Force, Washington, DC.

[38] Federal Policies and the Medical Devices Industry.Office of Technology Assessment. Washington, DC: USGovernment Printing Office; 1984.

[39] Allen D. California home to almost one-fifth of USmedical device industry. Med Dev Diag Ind Mag1997;19:64–7.

[40] Banta HD. The regulation of medical devices. PreventMed 1990;19:693–9.

[41] Murray K. Canada’s medical device industry faces costpressures, regulatory reform. Med Dev Diag Ind Mag1997;19:30–9.

[42] Dhillon BS. Reliability technology in health care systems.Proc IASTED Int Symp Computers and AdvancedTechnology in Medicine Health Care Bioengineering;1990. p.84–7.

[43] Schwartz AP. A call for real added value. Med Ind Exec1994;February/March:5–9.

[44] Walter CW. Instrumentation failure fatalities. ElectronNews, January 27, 1969.

[45] Micco LA. Motivation for the biomedical instrumentmanufacturer. Proc Ann Reliability and MaintainabilitySymp; 1972. p.242–4.

[46] Hyman WA. An evaluation of recent medical devicerecalls Med Dev Diag Ind Mag 1982;4:53–5.

[47] Crump JF. Safety and reliability in medical electronics.Proc Ann Symp Reliability; 1969. p.320–2.

[48] Bogner MS. Medical devices: a new frontier for humanfactors. CSERIAC Gateway 1993;IV:12–4.

[49] Wiklund ME. Medical device and equipment design.Buffalo Grove, IL: Interpharm Press Inc.; 1995.

[50] Dhillon BS. Failure mode and effects analysis: bibliogra-phy. Microelectron Reliab 1992;32:719–31.

[51] MIL-HDBK-217. Reliability prediction of electronicequipment. Washington, DC: Department of Defense.

[52] Medical Device Reporting System (MDRS). Centerfor Devices and Radiological Health, Food and DrugAdministration (FDA), Rockville, MD.

[53] Emergency Care Research Institute (ECRI). 5200 ButlerParkway, Plymouth Meeting, PA.

[54] Taylor EF. The reliability engineer in the health caresystem. Proc Ann Reliability and Maintainability Symp;1972. p.245–8.

Page 575: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 576: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment

Chap

ter3

0Robert A. Bari

30.1 Introduction30.2 Historical Comments30.3 Probabilistic Risk Assessment Methodology30.4 Engineering Risk Versus Environmental Risk30.5 Risk Measures and Public Impact30.6 Transition to Risk-informed Regulation30.7 Some Successful Probabilistic Risk Assessment Applications30.8 Comments on Uncertainty30.9 Deterministic, Probabilistic, Prescriptive, Performance-based30.10 Outlook

30.1 Introduction

Probabilistic risk assessment (PRA) is an analyt-ical methodology for computing the likelihoodof health, environmental, and economic conse-quences of complex technologies caused by equip-ment failure or operator error. It can also be usedto compute the risks resulting from normal, in-tended operation of these technologies. PRAs areperformed for various end uses. These include:understanding the safety characteristics of a par-ticular engineering design, developing emergencyplans for potential accidents, improving the reg-ulatory process, and communicating the risks tointerested parties. These are a few broadly statedexamples of applications of the methodology.

The requirements of the end uses should de-termine the scope and depth of the particular as-sessment. A PRA often requires the collaborationof experts from several disciplines. For example,it could require the designers and operators of afacility to provide a characterization of the normaland potential off-normal modes of behavior ofthe facility; it could require analysts who developand codify scenarios for the off-normal events;it could require statisticians to gather data anddevelop databases related to the likelihood of theoccurrence of events; it could require physical

scientists to compute the physical consequencesof off-normal events; it could require experts inthe transport of effluents from the facility tothe environment; it could require experts in theunderstanding of health, environmental, or eco-nomic impacts of effluents. Experts in all phasesof human behavior related to the normal and off-normal behavior of a facility and its environmentcould be required.

PRA is a methodology that developed and ma-tured within the commercial nuclear power reac-tor industry. Therefore the main emphasis and ex-amples in this chapter will be from that industry.Other industries also use disciplined methodolo-gies to understand and manage the safe opera-tion of their facilities and activities. The chemicalprocess, aviation, space, railway, health, environ-mental, and financial industries have developedmethodologies in parallel with the nuclear indus-try. Over the past decade there has been muchactivity aimed at understanding the commonali-ties, strengths, and weaknesses of methodologicalapproaches across industries. Examples of theseefforts are contained in the proceedings [1] of aseries of conferences sponsored by the Interna-tional Association for Probabilistic Safety Assess-ment and Management. In a series of six confer-ences from 1991 to 2002, risk assessment experts

543

Page 577: Handbook of Reliability Engineering - pdi2.org

544 Practices and Emerging Applications

from over 30 countries have gathered to exchangeinformation on accomplishments and challengesfrom within their respective fields and to forgenew collaborations with practitioners across dis-ciplines. Another useful reference in this regardis the proceedings [2] of a workshop by the Eu-ropean Commission on the identification of theneed for further standardization and developmentof a top-level risk assessment standard across dif-ferent technologies. This workshop included ex-perts from the chemical process, nuclear power,civil structures, transport, food, and health careindustries.

Even within the nuclear power industry therehave been various approaches and refinements tothe methodologies. One characteristic (but mostlysymbolic) sign of the disparity in this communityis the term PRA itself. Some practitioners preferand use the term “PSA”, which stands for proba-bilistic safety assessment. The methodologies areidentical. Perhaps the emphasis on safety ratherthan risk reflects an interest in end use and com-munication. A closely associated nomenclaturedistinction exists for the field of risk assessmentfor high-level radioactive waste repositories. HerePRA is called “PA”, which stands for performanceassessment. Reference [3] provides a good discus-sion of the evolution of this terminology distinc-tion and of how PRA is practiced in the areas ofhigh-level radioactive waste, low-level radioactivewaste, and decommissioning of nuclear facilities.

30.2 Historical Comments

The defining study for PRA is the Reactor SafetyStudy [4], which was performed by the AtomicEnergy Commission (under the direction ofNorman Rasmussen of the Massachusetts Instituteof Technology) to provide a risk perspective togauge insurance and liability costs for nuclearpower plants. The US Congress was developinglegislation, known as the Price–Anderson Act,which would specify the liabilities of ownersand operators of nuclear facilities. Prior to theReactor Safety Study, during the period 1957 to1972, reactor safety and licensing was developed

according to a deterministic approach. Thisapproach is contained in the current version of theFederal Code of Regulations, Title 10.

Reactors were designed according to conserva-tive engineering principles, with a strong contain-ment around the plants, and built-in engineeredsafety features. Deterministic analyses were per-formed by calculating the temperatures and pres-sures in a reactor or in a containment building, orthe forces on pipes that would result as a conse-quence of credible bounding accidents. This ap-proach became the framework for licensing in theUnited States, and it is basically built into the regu-lations for nuclear power plant operation. Duringthe same time period, there was informal usageof probabilistic ideas. In addressing events andhow to deal with their consequences, one lookedat events that were very likely to occur, events thatmight occur once a month, once a year, or onceduring the plant’s lifetime. Systems were put intoplace to accommodate events on such a qualitativescale. Also, probabilistic ideas were introduced inan informal way to look at such questions as theprobability of the reactor vessel rupturing. How-ever, the studies were not yet integrated analyses ofthese probabilistic notions. The essential idea wasto provide protection against the potential damage(or melting) of the reactor core and against therelease of fission products to the environment.

On the consequence side, a study was madecalled WASH 740 [5], which was an analysis ofa severe accident (which would melt the reactorcore and lead to the release of fission products tothe environment) that was postulated to occur ata nuclear power plant. It was performed to gaugeinsurance and liability costs for nuclear powerplants, and to evaluate how law protects the public.The postulate was that a very large amount ofradioactivity was released from the building, andthe consequences were calculated on that basis.The results showed very large numbers of deathsand health effects from latent cancers. That studylacked a probabilistic perspective. It gave onlythe consequence side, not the probabilistic side.But by 1972, and probably before that, peoplewere thinking about how to blend the probabilisticnotions into a more formal and systematic

Page 578: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 545

approach. This culminated in the Reactor SafetyStudy [4], mentioned above.

In this study the Atomic Energy Commissiontried to portray the risks from nuclear power plantoperation. It was a landmark study in probabilisticrisk assessment. To draw an analogy with the fieldof physics, this time period for probabilistic riskassessment is roughly analogous to the years 1925to 1928 for the atomic theory. In a short periodof time it laid the foundation for what we call thePRA technology, and the systematic integrationof probabilistic and consequential ideas. It alsoshed light on where the significant safety issueswere in nuclear power plants. For example, theconventional licensing approach advocated thedeterministic analysis of a guillotine pipe rupturein the plant, where one of the large pipes feedingwater to the vessel is severed, as if with a hatchet,and then the consequences are calculated in adeterministic way. Using probabilistic assessmentinstead, the Reactor Safety Study showed thatsmall pipe breaks are the dominant ones in the riskprofile.

Transient events, such as loss of power to partof the plant, were found to be important. TheReactor Safety Study challenged a notion that onealways looks only at single failures in the licensingapproach. There is a dictum that one looks atthe response of the plant to a transient event inthe presence of a single failure. Thus, the failureof an active component is postulated and it isdemonstrated that the plant can safely withstandsuch an event. The Reactor Safety Study showedthat is not the limiting (or bounding) case fromthe perspective of risk; multiple failures can occur,and, indeed, these have been studied since. Theanalysis also highlighted the role of the operatorof the plant. In looking at failure rates in the plant,it was not the hardware failures for many systemsthat gave the lead terms in the failure rates, itwas such terms as the incorrect performance ofa maintenance act, or a failure on the part of anoperator to turn a valve to the correct position.Another important piece of work in the ReactorSafety Study was the physical analysis of the core-melt sequence, coupling it with an evaluation ofthe response of the containment. Before that time,

analysts, designers, and regulators did not seemto recognize the close coupling between how thecore melts in the vessel and the responses ofthe containment. The Reactor Safety Study wasthe first publication to integrate these factors. Itsoverall message was that the risk was low fornuclear power plant operation.

The Reactor Safety Study was criticized, exten-sively analyzed, and reviewed. Congress commis-sioned a report now called the Lewis Report [6](after Harold Lewis, University of California, SantaBarbara, who chaired the committee). This reportmade certain conclusions about how the ExecutiveSummary of the Reactor Safety Study presentedthe data, how the uncertainties in the study werenot stated correctly or, as they put it, were “un-derstated”. They commented on the scrutability ofthe report but, overall, they endorsed the method-ology used in the study and advocated its furtherusage in the licensing, regulatory, and safety areas.Shortly after the Lewis Report, the accident atThree Mile Island occurred, that led some peopleto ask, Where have we gone wrong? An eventoccurred which was beyond what was expected tooccur in the commercial reactor industry. The reg-ulators reacted to Three Mile Island by imposingmany new requirements on nuclear power plants.

A curious thing happened. The people in theprobabilistic risk community went back, lookedat the Reactor Safety Study, and asked, Wheredid we go wrong in our analysis? A major studyof risk had been performed and it seemed tohave missed Three Mile Island. But, upon closerinspection, it was, in principle, in the study. TheReactor Safety Study was done for two specificplants. One was a Westinghouse plant, the SurryPlant, and the other was Peach Bottom, a GeneralElectric Plant. It was thought at the time thattwo power plants, one a boiling water reactorand one a pressurized water reactor, were fairlyrepresentative of the hundred or more nuclearplants that would be in place during the lastpart of the 20th century. However, each plantis unique. The risk must be assessed for eachindividual power plant, delineating the dominantcontributors to risk at each plant. The ThreeMile Island plant had certain design features and

Page 579: Handbook of Reliability Engineering - pdi2.org

546 Practices and Emerging Applications

operational procedures for which the particulardata and quantification in the Reactor Safety Studywas not representative. Qualitatively however, theevent, and many other sequences, were delineatedin the study and by using failure data and plantmodeling appropriate to TMI, the event couldbe explained within the framework of the study.This provided an impetus for further analysis inthe area of probabilistic risk assessment by theregulators and also by the nuclear industry.

Industry itself took the big initiative inprobabilistic risk assessment. They performed fullplant-specific, probabilistic risk assessments. Oneof the first studies was made for the Big RockPoint Plant, located out in the Midwest. This smallplant was at that time rather old, producing about67 MW of electricity. It is now permanently shutdown. Many requirements were put on it sincethe Three Mile Island accident, and, looked atfrom the risk perspective, it did not make senseto do the types of things that the regulators werepromoting on a deterministic basis. Following thatstudy, full-scale studies were done for three otherplants, which had the special consideration thatthey were in areas of high population density:Indian Point, for example, is about 36 miles fromNew York City, Zion is near Chicago, and Limerickis close to Philadelphia. The basic conclusions,based on plant-specific features, were that therisks were low. Some features of each plant wereidentified as the risk outliers, and they werecorrectable. For example, one plant was foundto be vulnerable to a potential seismic event, inwhich two buildings would move and damage eachother. This was very simple to fix: a bumper wasplaced between the two buildings. The controlroom ceiling of another plant was predicted tobe vulnerable to collapse, and it was reinforcedappropriately.

30.3 Probabilistic RiskAssessment Methodology

The PRA methodology has several aspects. One isthe logic model, which identifies events that lead

to failures of subsystems and systems. There aretwo types of logic trees that are in widespreaduse. One is the event tree approach, which isinductive. It moves forward in time to delineateevents through two-level logic-type trees; yes/no,fail/success. The other is a fault-tree approach,where one starts with a top event, which is theundesired event, and goes backward in time tofind out what has led to this event. Current PRAcombines both approaches, to go simultaneouslybackward and forward in time. The principaladvantage is that it reduces the combination ofevents that would be present if one or the othertype of approach were used alone.

Another very important feature of probabilisticrisk assessment concerns the source of the dataused to quantify risk assessments. The data isobtained in part from the nuclear power plantexperience itself, and partly from other industries.For example, one could look at valve failures infossil fuel plants and ask how the database fromanother part of the sample space applies to theevents that we want to quantify. Another part ofthe database is judgment. Sometimes events are ofvery low probability or of very high uncertainty,so that judgment is used to supplement data.There is nothing wrong with this; in fact,the Bayesian approach in probabilistic theoryeasily accommodates judgment. The physicalmodels describe how the core melts down,how the containment behaves, and how off-site consequences progress. Health effects andeconomic losses are then assessed on the basis ofthis information.

PRAs are sometimes categorized in terms of theend products or results that are reported in theevaluation. A Level 1 PRA determines the accidentinitiators and failure sequences that can lead tocore damage. The Level 1 PRA reports risk resultstypically in terms of the (annualized) probabilitiesof accident sequences and the overall probabilityof core damage. Usually the uncertainty in thecore damage probability is given as well as thesensitivity of this prediction to major assumptionsof the analysis.

When the PRA analysis also includes anassessment of the physical processes in the

Page 580: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 547

reactor vessel, the containment structure, and inassociated buildings at the plant, the study isreferred to as a Level 2 PRA. Typical products ofthis analysis are the probability of release (bothairborne and to liquid pathways) of radionuclidesfrom the plant to the environment. The analysisincludes the mode of failure of the reactor vessel,the resulting temperature and pressure loads tothe containment boundaries, the timing of theseloads, the quantity and character of radionuclidesreleased to the containment and subsequently tothe atmosphere, and the timing of the vesselfailure relative to the containment failure. Again,uncertainty analyses are typically performed aswell as sensitivity analyses related to majorassumptions.

A Level 3 PRA extends the results of a Level 2analysis to health effects and economic impactsoutside the facility. Effects of weather, populationconditions, evacuation options, terrain, and build-ings are also considered. The results are reportedin terms of probabilities of fatalities (acute andlatent cancers), property damage, and land con-tamination.

Overall descriptions of the methodologies andtools for performing PRAs are given in proceduresguides [7, 8] for this purpose. There are alsoseveral textbooks on this subject as well. A recentexample is [9], which contains ample referencesto the specialized subtopics of PRA and to muchof the recent literature. There are several technicalreports that the reader might want to consult onspecial topics. Some of these are noted in [7–9],but are given here for convenience.

For data analysis, in particular for failure ratesfor many components and systems, of nuclearpower plant risk studies that were performedduring the 1980s and 1990s, the reader shouldsee [10, 11]. These were a key data source for theNUREG-1150 risk study [12], which updated theearly WASH-1400 study [4]. This data was alsoused in the Individual Plant Examination studiesthat were performed by the US nuclear industryover the last decade.

The use of logic models to delineate anddescribe accident sequences is given in [7, 8] andgood examples of the implementation of these

tools are given in [4, 12]. The reader shouldconsult the Fault Tree Handbook [13] for anexcellent discussion of how this methodologyis used to decompose a system failure intoits contributory elements. The SETS computermodel [14] is an example of how large faulttrees can be handled and how some dependentfailures can be analyzed. Reference [9] provides auseful survey of the various tools that have beendeveloped in this area. Event trees are typicallyused to delineate an accident sequence by movingforward in time and indicating, through binarybranching, the success or failure of a specifiedsystem or action to mitigate the accident outcome.These are described and illustrated in [7, 8].Reference [9] provides summary information oncommercially available integrated computer codepackages for accident sequence analysis, alongwith the particular features of the various tools.

The understanding and modeling of humanbehavior related to accident evolution has receivedmuch attention in the PRA area. One of the keyinsights from [4], which has been substantiatedin many subsequent studies, is that the human isan important element of the overall risk profile ofa nuclear power plant. Actions (or inactions) bythe operations and maintenance crews, trainingprograms, attitudes and cultures of managementcan all play a role in the aggravation or mitigationof an accident. Research continues to be doneon various aspects of the role of the human insafe operation of power plants. An introductionto some of this work can be found in [15, 16].More recent state-of-the-art developments andapplications are contained in [1].

In order to obtain a complete as possibleportrayal of the risks of a complex technologicalsystem, both the failures that are “internal” tothe system and the failures that are “external”events to the system must be identified, andtheir likelihoods must be quantified. Examples ofinternal initiating events are stochastic failures ofmechanical or electrical components that assurethe safe operation of the facility. Examples ofexternal initiating events are earthquakes, fires,floods, and hurricanes. In order to model anexternal event in a PRA, its likelihood and severity

Page 581: Handbook of Reliability Engineering - pdi2.org

548 Practices and Emerging Applications

must be calculated, its impact on the plant systemsmust be described, and its consequences mustbe evaluated. This includes an assessment ofthe impact of the external event on emergencyplanning in the vicinity of the facility. There ismuch literature on the wide-ranging subject ofrisks due to external events. Some of the morerecent work can be found in [1, 17].

The procedures described above are the es-sential elements of a Level 1 PRA for nuclearpower reactors. They are sufficient to compute thefrequency of core damage as well as less severeevents, such as the unavailability of individual sys-tems within the facility. This information is usefulto a plant operator or owner because it provides ameasure of overall facility performance and an in-dication of the degree of safety that is attained. Theoperator can also use this prediction of the coredamage frequency to understand which systems,components, structures, and operational activitiesare important contributors to the core damagefrequency. This will aid the operator in priori-tizing resources for test, maintenance, upgrades,and other activities related to the safe operationof the facility. For the regulator, the computedcore damage frequency provides a metric for thefacility, which can be compared to a safety goal.This prediction of risk, especially if it is accom-panied by uncertainty ranges for the calculatedparameters, can be helpful ancillary informationto the regulations when new decisions need to bemade with regard to a safety issue.

There is additional information that can be de-rived from a PRA, which requires the calculationof the physical damage that would result fromthe predicted failures. This additional informa-tion would, in turn, be the input to the analysisof the health, environmental, and economic con-sequences of the predicted events. The physicaldamage assessment is usually referred to as theLevel 2 part of the PRA, and the consequenceassessment is termed the Level 3 part of the PRA.These elements of the PRA are discussed below.

The typical results of the Level 2 portion ofthe PRA may include: (1) the extent of damageto the reactor core, (2) the failure mode of thereactor vessel, (3) the disposition of fuel debris

in the containment building, (4) the temperatureand pressure histories of loads on the containmentdue to gas generation associated with the accident,(5) the containment failure mode and its timingrelative to previous events during the accident,(6) the magnitude and timing of the release offission products from the containment, (7) theradionuclide composition of the fission product“source term”. This information is very usefulto the operator and regulator in several ways.It provides the necessary characteristics of theaccident progression that would be needed toformulate an accident management strategy [18]for the facility. The information derived from theLevel 2 PRA could be useful to a regulator in theassessment of the containment performance, of aparticular plant, in response to a severe accident.

The melt progression of core damage acci-dents is typically analyzed with complex computercodes. These, in turn, have been developed withthe aid of experimental information from accidentsimulations with surrogate and/or prototypic ma-terials and from the (thankfully) limited actualdata in this area. There has been much interna-tional cooperation in this area, and the programshave been sponsored by both government andprivate funds. In the USA, the principal melt pro-gression computer codes are MELCOR [19] andMAAP [20]. MELCOR was developed under thesponsorship of the USNRC and the current versionof the code requires plant design and operationalinformation and an initiating accident event. Itthen computes the accident progression and yieldsthe time histories of key accident events, as indi-cated above. MAAP was developed under industrysponsorship and serves as an alternate choice ofmelt progression description. As with any com-puter codes, it is important for the user to under-stand the underlying assumptions of the code andto make sure that the application is commensuratewith the physical situation that is being modeled.Specific accident sequences and containment fail-ure modes have been given special attention inthis regard. Two difficult areas and how they wereanalyzed are discussed in detail in [21, 22].

The third major element of the integratedPRA is the consequence analysis or the Level 3

Page 582: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 549

portion of the assessment. The radiological sourceterm from the damaged plant is the essentialinput to this analysis. The key factors are: theradionuclide composition and chemical formsof each radionuclide, the energy with whichthey are expelled from the containment building,alternative pathways for source term release (e.g.downward penetration through the basemat orthrough various compartments or pipes), andtime of release (especially relative to initialawareness of the accident).

Weather conditions can play a crucial role indetermining the disposition of airborne releases.Offsite population distributions and the effective-ness of emergency response (e.g. evacuation, shel-tering) are factored into the consequence analy-sis. Large radiological dose exposures can lead tohealth effects in a very short term. Lower dosesmay lead to cancers after a period of many years.The accident consequence code that has been inwidespread international use is the MACCS code[23]. This code has been the subject of manypeer reviews, standard problem exercises, and cur-rently receives the attention of an internationalusers group.

Risk assessment tools can vary, dependingupon the end user and styles and interests withina particular group. Assumptions can easily bebuilt into models and codes and the user mustbe aware of this for a specific application. Manyengineered systems can be analyzed on the basisof the general construct outlined above. There is aclass of hazards that do not arise from a suddenor catastrophic failure of an engineered system.Rather, these hazards are continuously present inour environment and pose a risk in their ownright. These risks are discussed below.

30.4 Engineering Risk VersusEnvironmental Risk

In order to bring risk insights to a particularassessment or decision, a method is needed tocalculate the risk of the condition or activity inquestion. There are a wide range of activities

and conditions for which risk tools have beenused to calculate risk. The scope, depth ofanalysis, end products, and mode of inquiry areessential aspects of a risk assessment that definethe particular methodology. For example, in thearea of health science (or environmental) riskassessment, the top level approach is to address:

• hazard identification;• dose response assessment;• exposure assessment;• risk characterization.

These four activities are well suited to asituation in which a hazard is continuously present(this could be regarded as a “chronic” risk) andthe risk needs to be evaluated. For engineeredsystems, the favored top-level approach to riskassessment addresses the so-called risk triplet:

• What can go wrong?• How likely is it?• What are the consequences?

This approach is naturally suited to “episodic”risks, where something fails or breaks, a wrongaction is taken, or a random act of nature(such as a tornado or a seismic event) occurs.This is the approach that has been taken, withmuch success, in probabilistic risk assessments ofnuclear power reactors. A White Paper on Risk-Informed and Performance-Based Regulation [24]that was developed by the US Nuclear RegulatoryCommission (NRC) has an evident orientationtoward reactor risks that are episodic in origin.

The environmental risk assessment methodol-ogy and the engineered system methodology dohave elements in common, particularly in the doseand exposure areas. Further, in the developmentof physical and biomedical models in the environ-mental area, uncertainties in the parameters of themodels and the models themselves are expressedprobabilistically. This is also sometimes the case inthe engineered system area. Is it more appropriateto use an environmental risk approach or an en-gineered system risk approach for the risk assess-ments that are associated with waste managementconditions? Is the distinction worth making?

Page 583: Handbook of Reliability Engineering - pdi2.org

550 Practices and Emerging Applications

To understand these questions a bit more, it isinstructive to introduce some related concepts. Inthe waste and materials areas, the USNRC notes[25] that there are methodologies (not quite prob-abilistic risk assessments) that have been useful.Notably, PA is the method of choice for evaluationof the risk posed by a high-level waste repository.The USNRC paper defines PA to be a type of sys-tematic safety assessment that characterizes themagnitude and likelihood of health, safety, andenvironmental effects of creating and using a nu-clear waste facility. A review of some applicationsof PA indicates that PRA and PA appear to be verysimilar methodologies. In a recent paper, Eisen-berg et al. [3] state the connection very succinctlyand directly: “performance assessment is a prob-abilistic risk assessment method applied to wastemanagement”. Further, in a paper on this subject,Garrick in [2], notes that: “(PRA) is identifiedwith the risk assessment of nuclear power plants,probabilistic performance assessment (PPA), orjust PA, is its counterpart in the radioactive wastefield”. He adds that in the mid-1990s the USNRCbegan to equate PRA with PA.

30.5 Risk Measures and PublicImpact

One benefit of performing a PRA is to havemeasures to compare with a risk standard orgoal. This, of course, is not the only value ofPRA. It can be especially beneficial in measuringchanges in risk related to modifications in designor operational activities. Risk measures can beused as an aid to performing risk managementand for structuring risk management programsthat address unlikely, extreme consequence eventsand also less extreme, more likely events. Inorder to implement an effective risk managementprogram, an appropriate set of risk measures (orindices) must be defined.

One approach to establishing risk criteria isto define and characterize a set of extremeconsequence conditions (e.g. a large, uncontrolledrelease of a hazardous material). In this approach,

if the operation is managed to prevent, control, ormitigate the extreme conditions, then less extremeconditions and events ought to be managedthrough this process as well. There are at leasttwo potential flaws in this approach. One is thatless extreme conditions or events may not becaptured easily or at all by the approach. This issometimes referred to as the “completeness” issue.Another problem is that failures and consequencesthat are necessary but not sufficient conditionsfor the extreme event could occur individually orin combination with other less extreme events.In both situations the less extreme events thatmight occur could have significant public impactor reaction in their own right.

An example of the first problem is the tritiumleakage into the groundwater at Brookhaven Na-tional Laboratory from its High Flux Beam Re-actor that was reported in 1997. Risks from theoperation of this reactor were managed throughseveral mechanisms: procedures, guidelines, or-ders, and oversight by the US Department of En-ergy, the Laboratory’s own environment, safetyand health program, and requirements and agree-ments with other federal (US Environmental Pro-tection Agency), state (New York Department ofEnvironmental Conservation), and local (SuffolkCounty Department of Health) organizations.

As part of the Laboratory’s program, a PRAwas performed approximately 10 years ago. Thestudy focussed on events that could damagethe reactor core and potentially lead to onsiteand offsite radiological doses arising from therelease of fission products from the damaged fuel.The hypothetical airborne doses were extremelylow both onsite and offsite. Nevertheless, therisk assessment was used by the Laboratory toaid in decision-making regarding operation andpotential safety upgrades.

The PRA did not focus on potential ground-water contamination because the downward pen-etration of damaged fuel was assessed to be veryunlikely for the scenarios considered. The study,however, also did not address the routine contam-ination of the water in the spent fuel pool dueto tritium (a byproduct of normal reactor oper-ation). Nor did it address the potential leakage

Page 584: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 551

of water from the spent fuel pool. The leakageof tritiated water from the spent fuel pool re-sulted in the detection of concentrations of tri-tium in the groundwater, which exceeded the safedrinking water standard determined by the USEPA.

The contamination of the groundwater, in thiscase a sole source aquifer, led to a large publicimpact, which included the firing of the contractorfor the Laboratory and to sustained public concernand scrutiny of the Laboratory. In principle, theprobabilistic risk assessment performed 10 yearsago could have identified this failure scenario if it:(1) included the EPA safe drinking water standardfor tritium in its set of risk criteria, (2) identifiedthe spent fuel pool as a source of contamination,(3) recognized the potential for leakage, and(4) identified the tritium source term in the poolwater. However, the criteria and assessment toolswere not focussed in this area and thus were notused to manage this risk and prevent the ensuingscenario.

Events occur that are not catastrophic butnevertheless capture the public attention. Thiscan be due to the lack of public awareness ofthe significance (or insignificance) of the event.It could also be due to the perception that aless significant event undermines confidence in atechnology and thereby portends a more severe,dreaded event. In either case, the occurrence ofa less significant event undermines the case forthe manager of the risks of the given technology.Thus the risk manager is faced with the task ofmanaging both the risk of the less likely, higherconsequence accidents and the risk of the morelikely, lower consequence accidents.

One option would be to set stringent criteriafor failure of systems, structures, components, andoperations that define the performance and safetyenvelope for the technology in question. Thiswould minimize the likelihood of the occurrenceof small accidents (say below once in the lifetimeof the facilities operation, or more stringently,once in the lifetime of the technology’s operation).This option may be effective in eliminatingthe small accident, but it may be prohibitivelycostly.

Another option would be to develop a pub-lic education program that would sufficiently de-scribe the technology and its potential off-normalbehavior so that the small accidents would notresult in a large public impact. Instilling a senseof trust in the public for the management of acomplex technology would be key to assuringthat reactions to off-normal events would not bedisproportionate to their severity. The communi-cation of risks is a challenging process and theacceptance of a small or insignificant risk by thepublic is difficult, especially when the benefits of atechnology are not easily perceived.

These extreme options address the overall issuefrom either the cause (small accident) or theeffect (large public impact). Let us further explorethe issue from the perspective of the cause, i.e.consideration of risk measures that would aid inthe management of the risks.

As the example from Brookhaven NationalLaboratory illustrates, it is important, whenbuilding models of the risk of a facility, toassure that they are able to address the relevantcompliance criteria. This is an example in whichthe risk models must be as complete as practicalto capture the risk envelope.

The case in which a subsystem failure leadsto a small accident can perhaps be managedthrough an approach based on allocation ofreliabilities. Systems, components, structures, andoperational activities typically have a finite failureprobability (or conversely, a finite reliability). Adecision-theoretic approach to the allocation ofreliability and risk was developed by BrookhavenNational Laboratory [26, 27]. This was done inthe context of the formulation of safety goalsfor nuclear power plants. In this approach theauthors [26, 27] started with top-level safety goalsfor core damage and health risk and derived asubordinate set of goals for systems, components,structures, and operation. The methodology wasbased on optimization of costs (as a discriminatorof options) for the allocation of subordinategoals that are consistent with the top-level goals.The method required that a probabilistic riskassessment model be available for the plant andthat the costs of improving (or decreasing) the

Page 585: Handbook of Reliability Engineering - pdi2.org

552 Practices and Emerging Applications

reliability of systems, components, structures, andoperations can be specified. Because there areusually far more of these lower-tier elementsthan there are top-level goals, there will be amultiplicity of ways to be consistent with the top-level goals. The methodology searches for andfinds those solutions that are non-inferior in thetop-level goals. Thus, it will yield those sets ofsubordinate goals that are consistent with thetop-level goals that do not yield less desirableresults across the entire set of top-level goals.The decision-maker must perform a preferenceassessment among the non-inferior solutions.

Once these solutions are obtained, it is thenpossible to manage the lower-tier risks andhopefully minimize the likelihood of a “smaller”accident occurring. The methodology presents aninteresting possibility for events that would havea large public impact. In addition to recognizingthe need to minimize the economic penalties ofchanges in reliability of subtier elements, it maybe possible to account for the psychological valueof subtier element failures. In principle, one wouldgive weightings to those failures that result indifferent public impact. The resulting solutionswill still be consistent with the top-level goals ofprotecting public physical health.

An area in which risk-informed criteria settingis currently being considered [25] for less severeevents is the regulation of nuclear materials usesand disposal. In this area, the risks to workerscan be greater than the risk to the public,particularly for the use of nuclear byproductmaterials [28]. Nevertheless, events that haveoccurred while causing no health impact to thepublic have raised concern. For example, in [28],the authors’ objective was to consider both actualrisks and risks perceived by the general public.The byproduct material systems considered in[28] included medical therapy systems, welllogging, radiography, and waste disposal. Theresults of this study suggest that, because of thehigh variability between radiological doses for thehighest-dose and lowest-dose systems analyzed,the approaches to management and regulationof these systems should be qualitatively different.The authors observed that risk for these systems

was defined in terms of details of design, sourcestrength, type of source, radiotoxicity, chemicalform, and where and how the system is used. This,in turn, makes it difficult to characterize the riskof the system, to pose criteria based on this risk,and then to manage the risk.

Another area that has received attention re-cently is electrical energy system reliability. Thiswas underscored during the summer of 1999 whenpower outages disrupted the lives of millions ofpeople and thousands of businesses in regionsof the USA. In response to public concerns, theUS Department of Energy commissioned a studyof these events. This is detailed in a report [29]that was authored by various experts on energytransmission and delivery systems. This reportmakes recommendations that are very relevantto the subject of this chapter. In particular, theauthors recommend that there be mandatory stan-dards for electric power systems. They note thatour interconnected electric power system is beingtransformed from one that was designed to servecustomers of full-service utilities (each integratedover generation, transmission, and distributionfunctions) to one that will support a competitivemarket. Because of this change, they observe thatthe current voluntary compliance with reliabilitystandards is inadequate for ensuring reliability,and recommend mandatory standards for bulk-power systems to manage the emerging restruc-tured electric power industry. The study also rec-ommends that the US government consider thecreation of a self-regulated reliability organizationwith federal oversight to develop and enforce re-liability standards for bulk-power systems as partof a comprehensive plan for restructuring the elec-tric industry.

Adverse public impact is always an undesirableoutcome of a technological enterprise. This willbe the case if there is a direct physical effecton the public or if there is only the perceptionof risk. In order to use risk managementtools effectively to minimize adverse impact,it is important to understand the range ofundesirable outcomes and then set criteria (oradopt them) against which performance canbe measured. Clearly, resources would have to

Page 586: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 553

be allocated to a risk management programthat accounts for a broad range of risks, fromthe high-consequence/low-likelihood to the low-consequence/high-likelihood end of the spectrum.Both pose risks to the successful operation ofa facility or operation, and the prudent riskmanager will assure adequate protection acrossthe entire range.

30.6 Transition toRisk-informed Regulation

PRAs have been performed for many nuclearpower plants. Individual Plant Examinations [30]were performed by the plant owners beginningin the late 1980s and continued through much ofthe 1990s. These studies were specifically scopedPRAs [31] designed to provide an increasedunderstanding of the safety margins containedwithin the individual plants. These studies ledto the identification of specific vulnerabilitiesat some of the plants and to greater insight,by the owners, to improved operation of theirfacilities. Over the same time frame, studies wereundertaken of the risk posed by power reactorswhile in shutdown or low power conditions[32]. These studies resulted in and enhancedunderstanding of the overall risk envelope forreactor operation and led to improvements inactivities associated with these other modes ofoperation.

More recently, the NRC has been sponsoringresearch on the development of approaches torisk-informing 10 CFR 50. This includes a frame-work document [33] and specific applications ofan approach to risk informing 10 CFR 50.44 and10 CFR 50.46. The former refers to combustiblegas control in containment and the latter refers tothe requirements on the emergency core coolingfunction. Progress on these activities will be re-ported in a future publication by the NRC and itscontractors.

There has also been work performed recently[34] in the development of standards for theperformance of probabilistic risk assessments.

The need for standards has long been recognized.A fundamental dilemma has been how to developa standard in an area that is still emerging andfor which improved methods may yet be in theoffing. Early attempts at standardization can befound in NUREG-2300 and NUREG/CR-2815. Theadvantage of a standard to the industry is thatif they submit a request for a regulatory reviewto the NRC that is based on PRA methods, andif it is done within an accepted standard, thenthe review would not be encumbered by a case-specific review of the methodology itself.

30.7 Some SuccessfulProbabilistic Risk AssessmentApplications

There have been many accomplishments, whichhave in turn had lasting impact on improvedsafety. For example, the focus on the impact ofhuman performance on nuclear power plant riskhas been underscored in many PRAs. These haveled the way to the formulation and executionof research programs by NRC, the industry,and by organizations in many countries onhuman performance. This has led to valuableinsights that have affected emergency plans,accident management programs, and the day-to-day efficient operation of each unit.

The “Level 2” portion of the PRAs hasbeen invaluable in providing great insight intothe performance of containment under severeaccident loads. This includes the timing ofpressure and temperature loads as well as thebehavior of combustible gases and core debriswithin containment. The relative benefits of eachcontainment type are now far better understoodbecause the phenomenology was evaluated andprioritized within the context of a Level 2 PRA.

Risk perspectives from performing PRAs forvarious modes of power operation were also ofgreat benefit to the regulator and to the industry.These studies led to utilities having enhancedflexibility in managing outages and in extendingallowed outage times. Further, the studies led to

Page 587: Handbook of Reliability Engineering - pdi2.org

554 Practices and Emerging Applications

better decision-making by the NRC staff and tomore effective incorporation of risk informationin that process [35].

The Individual Plant Examination Program[30], which was essentially a PRA enterprise, ledto many improvements, directly by the ownersin the procedures and operations of their plants.The program also led to improvements in systems,structures, and components and to insights forimproved decision-making by the regulator andthe industry.

30.8 Comments on Uncertainty

Much has been said and written over the pasttwo decades on the topic of uncertainty. Hopefullywe are coming to closure on new insights onthis subject. It has been a strength of PRAthat it lends itself well to the expression ofuncertainties. After all, uncertainty is at the veryheart of risk. On the other hand, the elucidationof uncertainties by PRA has been taken by someto be a limitation of the PRA methodology (i.e.it deals in vagaries and cannot be trusted). Itis also unfortunate that PRA practitioners andadvocates have so often felt the need to dwellon the limitations of PRA. However, it should berecognized that deterministic methodologies arealso fraught with uncertainties, but they are notas explicitly expressed as they are by the PRAmethodologists. PRA allows decision-makers tobe informed about what they know about whatthey do not know. And this is valuable informationto have.

A good state-of-the-art presentation of paperson the subject of uncertainty in risk analysis canbe found in a special issue of the journal ReliabilityEngineering and Systems Safety [36]. Among the14 papers in this special volume are discussionsof aleatory and epistemic uncertainties andapplications to reactor technology and to wastemanagement. In a recent paper, Chun et al.[37] consider various measures of uncertaintyimportance and given examples of their use toparticular problems.

30.9 Deterministic,Probabilistic, Prescriptive,Performance-based

Four concepts that are central to the use of PRAin regulation are: deterministic, probabilistic, pre-scriptive, and performance-based. These conceptsare sometimes used conjunctively, and some-times interchanged. Figure 30.1 provides a two-dimensional view of how they should be correctlyrelated. Prescriptive and deterministic are some-times confused and used interchangeably. Theytend to be within the comfort zone of some par-ticipants in regulatory matters, and perhaps that isthe source of the interchangeability. Deterministicis really an approach to analysis and its oppositeis probabilistic. Prescriptive is an approach todecision-making and its opposite is performance-based. Most analyses are not purely probabilis-tic or deterministic but an admixture of the twoextremes. A classic “Chapter 15” safety analysis(required by the NRC for nuclear power plant ap-plications) really starts by determining, howeverinformally, what is likely and unlikely. Similarly, aprobabilistic analysis as embodied in a PRA usu-ally has some form of deterministic analysis, e.g.a heat transfer calculation. The important point isthat this is a two-dimensional space with variousregulatory activities falling in different parts ofthe plane. For example, if one were interestedin compliance with a numerical safety criterion

Figure 30.1. Two dimensional view of risk-related concepts

Page 588: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 555

related to the likelihood of an event, it would fallin the lower right quadrant.

30.10 Outlook

In the early days of PRA, both regulators andthe industry for the most part were not verycomfortable with PRA and the use of riskmanagement concepts to guide decisions withregard to the safe operation of plants. This haschanged, in an evolutionary way, over the pastquarter century. Now it is quite the custom to see,hear, and read of exchanges between the industryand the NRC (and among the organizationsthemselves) in which concepts like risk assessmentand risk management form the currency fortheir exchange of “safety thought”. There hasbeen an ever-increasing attempt by the NRC andthe industry to bring other stakeholders (states,public interest groups, and concerned citizens)into the communication loop on safety matters.However, the process, while moving forward,involves the exchange of different currency withregard to safety. This can be termed riskperception or more broadly, risk communication(see Figure 30.2). These are very important aspectsof the safety enterprise and require much attentionand development. It should be noted that thesimple figure does not capture the fact that theindicated modes of currency are not exclusiveto any of the interchanges. Each interchange isreally an admixture of the two indicated types. Thefigure just indicates the predominant exchangemode.

Safety research, being an inductive discipline,will always have its frontiers. The following arethree areas that might benefit from the now vastrepertoire of knowledge and methodology that iscontained in the PRA field.

Much of the work in PRA has focussed on thesevere accident. This type of accident (core melt,large radiological release to the environment)tends to be unacceptable to all parties concerned.Fortunately, they are also very unlikely events—not expected to occur in the lifetime of afacility. There are events, however, that are much

Figure 30.2. Interactions among participants in the risk-relatedenterprise

more likely, that do occur and, fortunately, havelittle or no health or environmental impact.Yet they do attract public attention and requiremuch attention by all parties involved. A recentexample of such an event is the steam generatortube rupture at a particular plant. The offsiteradiological release was insignificant but the eventdrew much attention from the press and thepolitical representatives (and therefore the NRCand the industry). From the PRA perspective,this was a very low risk event. While the PRAmethodology can easily express the risk of sucha small consequence event, can it be used to flagsuch events? Can these risks be managed? Howshould they be managed and by whom?

A related topic is risk perception. It is wellknown that an analytical quantification of healthand environmental risks, an objective expressionof reality, does not necessarily represent the riskthat an individual or a group perceives. Further,it is sometimes said that perception is reality. Thequestion is: can psychosocial measures of risk bedefined and calculated in a way that is comparableto physical risks? If so, can risk managementprograms be developed that recognize thesemeasures?

Finally, it should be recognized that the NRCand the civilian nuclear power industry have beenat the vanguard of PRA methods development andapplications. This has, no doubt, contributed tothe safety and reliability of this technology. Other

Page 589: Handbook of Reliability Engineering - pdi2.org

556 Practices and Emerging Applications

organizations and industries would benefit fromthe advances made in the uses of PRA and thereshould be a wider scale adoption (and adaptation)of these methods and applications.

Acknowledgment

I am pleased to thank J. Lehner for a review ofthis manuscript and for helpful comments andsuggestions.

References[1] See the website for this organization for information on

past conference proceedings: www.iapsam.org[2] Kirchsteiger C, Cojazzi G, editors. Proceedings of the

Workshop on the Promotion of Technical Harmonizationon Risk-Based Decision-Making. Organized by theEuropean Commission, Stresa, Italy; May 22–24, 2000.

[3] Eisenberg NA, Lee MP, McCartin J, McConnell KI, Thag-gard M, Campbell AC. Development of a performanceassessment capability in waste management programs ofthe USNRC. Risk Anal 1999;19:847–75.

[4] WASH-1400, Reactor Safety Study: An assessment ofaccident risks in commercial nuclear power plants.US Nuclear Regulatory Commission. NUREG-75/014;October 1975.

[5] Theoretical possibilities and consequences of majornuclear accidents at large nuclear plants. US AtomicEnergy Commission. WASH-740; March 1957.

[6] Lewis HW, Budnitz RJ, Kouts HJC, Loewenstein WB,Rowe WD, von Hippel F, et al. Risk assessment reviewgroup report to the US Nuclear Regulatory Commission.NUREG/CR-0400; September 1978.

[7] PRA Procedures Guide: A guide to the performance ofprobabilistic risk assessments for nuclear power plants.US Nuclear Regulatory Commission. NUREG/CR-2300;January 1983.

[8] Bari RA, Papazoglou IA, Buslik AJ, Hall RE, Ilberg D,Samanta PK. Probabilistic safety analysis proceduresguide. US Nuclear Regulatory Commission. NUREG/CR-2815; January 1984.

[9] Fullwood RR. Probabilistic safety assessment in chemicaland nuclear industries. Boston: Butterworth-Heinemann;2000.

[10] Drouin MT, Harper FT, Camp AL. Analysis of coredamage frequency from internal events: methodologicalguides. US Nuclear Regulatory Commission. NUREG/CR-4550, vol. 1; September 1987.

[11] Wheeler TA, et al. Analysis of core damage frequencyfrom internal events: expert judgement elicitation.US Nuclear Regulatory Commission. NUREG/CR-4550,vol. 2; 1989.

[12] Reactor risk reference document. US Nuclear RegulatoryCommission. NUREG-1150; February 1987.

[13] Vesely WE, Goldberg FF, Roberts NH, Haasl DF. Fault treehandbook.US Nuclear Regulatory Commission. NUREG-0492; January 1981.

[14] Worrell WB. SETS reference manual. Electric PowerResearch Institute. NP-4213; 1985.

[15] Swain AD, Gutmann HE. Handbook of human reliabilityanalysis with emphasis on nuclear power plant applica-tions. US Nuclear Regulatory Commission. NUREG/CR-1278; August 1983.

[16] Hannaman GW, Spurgin AJ. Systematic human actionreliability procedure (SHARP). Electric Power ResearchInstitute. NP-3583; June 1984.

[17] Bohn MP, Lambright JA. Recommended proceduresfor simplified external event risk analyses. US NuclearRegulatory Commission. NUREG/CR-4840; February1988.

[18] Bari RA, Pratt WT, Lehner J, Leonard M, DiSalvo R,Sheron B. Accident management for severe accidents.Proc Int ANS/ENS Conf on Thermal Reactor Safety.Avignon, France; October 1988. pp.269–76.

[19] MELCOR 1.8.5. Available by request from the US NuclearRegulatory Commission atwww.nrc.gov/RES/MELCOR/obtain.html

[20] Henry RE, et al. MAAP4—Modular accident analysisprogram for LWR power plant. Computer code manual 1-4. Electric Power Research Institute; May 1994.

[21] Theophanous TG, et al. The probability of Mark Icontainment failure by melt-attack of the liner. USNuclear Regulatory Commission. NUREG/CR-6025;1993.

[22] Pilch MM, et al. US Nuclear Regulatory Commission.NUREG/CR-6075; 1994.

[23] Chanin D, Young ML. Code manual for MACCS2. USNuclear Regulatory Commission. NUREG/CR-6613; May1998.

[24] Risk-informed and performance-based regulation. USNuclear Regulatory Commission.www.nrc.gov/NRC/COMMISSION/POLICY/whiteppr.html

[25] Framework for risk-informed regulation in the Officeof Nuclear Material Safety and Safeguards. US NuclearRegulatory Commission. SECY-99-100; March 31,1999.

[26] Papazoglou IA, Cho NZ, Bari RA. Reliability and riskallocation in nuclear power plants: a decision-theoreticapproach. Nucl Technol 1986;74:272–86.

[27] Cho NZ, Papazoglou IA, Bari RA. Multiobjectiveprogramming approach to reliability allocation innuclear power plants. Nucl Sci Eng 1986;95:165–87.

[28] Schmidt ER et al. Risk analysis and evaluation of regu-latory options for nuclear byproduct material systems.US Nuclear Regulatory Commission. NUREG/CR-6642;February 2000.

[29] Report of the US Department of Energy’s Power OutageStudy Team. Findings and recommendations to enhancereliability from the summer of 1999. US Department ofEnergy; March 2000.www.policy.energy.gov/electricity/postfinal.pdf

[30] Individual Plant Examinations Program. Perspectiveson reactor safety and plant performance. US NuclearRegulatory Commission. NUREG-1560; December 1997.

Page 590: Handbook of Reliability Engineering - pdi2.org

Probabilistic Risk Assessment 557

[31] Individual Plant Examination of severe accident vulnera-bilities. 10 CFR 50.54(f). US Nuclear Regulatory Commis-sion. Generic Letter GL-88-20; November 23, 1988.

[32] Perspectives report on low power and shutdown risk.PDF attachment to proposed staff plan for low power andshutdown risk analysis research to support risk-informedregulatory decision-making. US Nuclear RegulatoryCommission. SECY 00-0007; January 12, 2000.http://www.nrc.gov/NRC/COMMISSION/SECYS/2000-0007scy.html

[33] Framework for risk-informed regulations. US NuclearRegulatory Commission. Draft for Public Comment, Rev.1.0; February 10, 2000.http://nrc-part50.sandia.gov/Document/framework_(4_21_2000).pdf

[34] Standard for probabilistic risk assessment for nuclearpower plant applications. Rev. 12 of Proposed AmericanNational Standard. American Society of MechanicalEngineers; May 30, 2000.

[35] An approach for using probabilistic risk assessmentin risk-informed decisions on plant-specific changes tothe licensing basis. US Nuclear Regulatory Commission.Regulatory Guide 1.174; July 1998.

[36] Reliab Eng Syst Safety 1996;54:91–262.[37] Chun M-H, Han S-J, Tak N-I. An uncertainty importance

measure using a distance metric for the change in acumulative distribution function. Reliab Eng Syst Safety2000;70;313–21.

Page 591: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 592: Handbook of Reliability Engineering - pdi2.org

Total Dependability Management

Chap

ter3

1Per Anders Akersten and Bengt Klefsjö

31.1 Introduction31.2 Background31.3 Total Dependability Management31.4 Management System Components31.5 Conclusions

31.1 Introduction

Different methodologies and tools are availablefor the management and analysis of systemdependability and safety. In the different phases ofa system’s lifecycle, some of these methodologiesand tools are appropriate, and some are not.The choice is often made by the dependabilityor safety professional, according to personalexperience and knowledge. However, this choiceshould be influenced by some managementphilosophy, based on a number of core values.In the literature, core values are sometimesreferred to as principles, dimensions, elements, orcornerstones.

The core values constitute a basis for the cultureof the organization. They have a great influence onthe choice of strategies for accomplishing differentkinds of goals, e.g. regarding dependability,safety, and other aspects of customer satisfaction.The strategies will in turn include the use ofappropriate methodologies and tools.

In this chapter, a number of core values arediscussed and their influence on the choice ofmethodologies and tools briefly analyzed. The aimis to present one approach to the application of aholistic view to the choice of methodologies andtools in dependability and safety management.Moreover, the aim is to give a conceptual base

for the choice of appropriate methodologies andtools. The approach described is very general.The sets of core values, methodologies, and toolspresented are by no means complete. They shouldbe regarded as examples only.

The main objective is to make clear theusefulness of clearly expressed core values inthe choice and design of strategies for themanagement of dependability and safety in allphases of a system’s lifecycle. It is important thatan implementation of a dependability and safetymanagement system will focus on the totality. Justpicking up a few methodologies or tools will notbe sufficient.

A comparison is made to the structureof the IEC Dependability Standards, consistingof “Standards on Dependability Management”,“Application Guides”, and “Tools”. However, theconcept of “Tools”, used in the IEC DependabilityStandards structure, is wider than the concept of“Tool” used in this chapter.

31.2 Background

Dependability management has been an impor-tant issue for many years. Many articles andbooks, both in the academic and popular press,have been written on the subject of dependability;

559

Page 593: Handbook of Reliability Engineering - pdi2.org

560 Practices and Emerging Applications

Figure 31.1. The core values which are the basis for the Malcolm Baldrige National Quality Award, the European Quality Award, and theSwedish Quality Award (SQA, established in 1992 by the Swedish Institute for Quality and at that time strongly based on the MalcolmBaldrige Award). The values are taken from the versions of 2002.

not very many, however, on dependability man-agement. A number of authors have addressed thesubject of safety management. Still this numberis small, compared to the number of articles andbooks on total quality management (TQM).

Interesting parallels can be drawn. Both de-pendability management and TQM are often iden-tified as necessary to reach competitiveness, andthere are also examples of companies that havefailed in their implementation of dependabilitymanagement or TQM. However, the concept ofTQM is more generally known and it has been dis-cussed on very many levels, sometimes clarifyingthe concept, sometimes not.

Still there exist different opinions about TQM.There are many vague descriptions and few defi-nitions of what TQM really is. Regarding depend-ability management, it has not been described intoo many different ways. According to the au-thors’ view, the most comprehensive descriptionis a result of work performed by the InternationalElectrotechnical Commission (IEC) within theirTechnical Committee 56 “Dependability”. See e.g.the IEC Standards [1–3]. (Note: These standardsare currently under revision.)

Here we will discuss some of the problemswith dependability management and describeand discuss our own view of dependabilitymanagement as a management system consistingof values, methodologies, and tools.

31.3 Total DependabilityManagement

The concept of TQM is generally understood,and often also described, as some form of“management philosophy” based on a numberof core values, such as customer focus, fact-based decisions, process orientation, everybody’scommitment, fast response, result orientation,and learning from others; see Figure 31.1.What here are called core values are alsoin the literature named principles, dimensions,elements, or cornerstones. We prefer the term corevalue since it is a way to emphasize that thesestatements should work together to constitutethe culture of the organization, and that theyaccordingly are basic concepts.

Page 594: Handbook of Reliability Engineering - pdi2.org

Total Dependability Management 561

Figure 31.2. A dependability management system seen as a continuously evolving management system consisting of core values,methodologies, and tools. It is important to note that the methodologies and tools in the figure are just examples and not a completelist. In the same way the values may also vary a little between different organizations and over time.

A literature study [4] shows that a numberof core values seem to be common in most de-scriptions of TQM, namely: “focus on customers”,“management commitment”, “everybody’s com-mitment”, “focus on processes”, “continuous im-provements”, and “fact-based decisions” [5]. Fur-ther literature studies and informal contacts witha number of Swedish companies indicate that thesame core values are common also in the contextof dependability management.

31.4 Management SystemComponentsDependability management contains much morethan core values. Like TQM it is a managementsystem in the sense of Deming, i.e. as “a network ofinterdependent components that work together totry to accomplish the aim of the system” [6], p.50.

A very important component is constituted bythe core values. These core values are the basisfor the culture of the organization and also thebasis for the goals set by the organization. Anothercomponent is a set of methodologies, i.e. waysto work within the organization to reach thegoals. A methodology consists of a number ofactivities performed in a certain order. Sometimesit is appropriate to describe a methodology asa process. A third component is a set of tools,i.e. rather concrete and well-defined tools, whichsometimes have a statistical basis, to supportdecision-making or facilitate analysis of data.These three components are interdependent andsupport each other, see Figure 31.2.

We believe that it is important to classify dif-ferent terms related to dependability managementaccording to any of the three components. Forinstance, FMECA (Failure Mode, Effects, and Crit-icality Analysis [7]) has often been considered

Page 595: Handbook of Reliability Engineering - pdi2.org

562 Practices and Emerging Applications

Figure 31.3. The potential for mutual support between core values and methodologies. It must be emphasized that the lists of core valuesand methodologies are by no means complete, and the strengths of relations are based on the authors’ present judgments. The table hereis just an illustration of the idea.

as a tool. One reason for this is the very strongconnection to the tabular FMECA sheet. Anotherexample is QFD (Quality Function Deployment[8]), which often has been looked upon as a tool,mainly due to the confusion between QFD andthe Quality House. However, QFD is “a systemfor translating consumer requirements into appro-priate company requirements at each stage fromresearch and product development to engineeringand manufacturing to marketing/sales and dis-tribution” [9], and is therefore, in our opinion,a methodology. The Quality House, on the otherhand, is a tool to be used within that methodology.Another example is risk analysis [10], a method-ology consisting of a number of steps, involving

the use of different tools, e.g. hazard identification.By consistently using a terminology based on corevalues, methodologies, and tools, the “concepts”used within dependability will be clarified.

This management system is continuouslyevolving. Over time, some core values mightchange, and in particular the interpretationof some of them might be developed. Newmethodologies will also appear or be transferredfrom other management theories. New tools willbe developed or taken from other disciplines.

One of the things that is important to noticeis that the management system really should belooked upon as a system. It is often the case thata certain core value is necessary for the successful

Page 596: Handbook of Reliability Engineering - pdi2.org

Total Dependability Management 563

Figure 31.4. Scheme for the choice and implementation of various tools. It must be emphasized that the lists of methodologies and toolsare by no means complete, and the usefulness of tools is based on the authors’ present judgments. The table here is just an illustration ofthe idea.

Page 597: Handbook of Reliability Engineering - pdi2.org

564 Practices and Emerging Applications

implementation of a chosen methodology. At thesame time, the systematic use of a methodologygives more or less support for the acceptance andestablishing of different core values. For example,the core value of “everybody’s commitment”cannot be implemented without the appropriateuse of suitable methodologies. There is a potentialfor implementation of this core value, e.g.in the methodologies of FMECA, ReliabilityCentered Maintenance (RCM [11]) and LossControl Management (LCM [12]). However, themethodologies chosen will not work efficientlywithout the skillful use of specific tools. Asanother example we can mention the value of“focus on customers”. Here QFD is one usefulmethodology and the Quality House is then oneuseful tool for a systematic transformation. Thepotential for the support to core values of differentmethodologies is illustrated in Figure 31.3. In thisfigure we indicate a weak or strong relationship.Usually we can identify a mutual support inthe sense that a certain core value is necessaryfor the successful implementation of a chosentechnique. At the same time, the systematic useof the technique gives strong support for theacceptance and establishing of the core value.

There are several benefits with this systemview of dependability management. One is thatit emphasizes the role of top management.However, it is not obvious that the corevalue of top management commitment isautomatically supported by the formal use ofa single methodology. The way you perform workwithin this methodology is decisive. As a parallel,there is evidence that many of the organizationsthat have failed with TQM have not had sufficienttop management commitment [13–15].

An important consequence of the use of adependability management system, as describedabove, is that it focuses on the totality. Hopefullyit will reduce the risk that an organizationpicks up just a few parts of the system. We haveexperienced companies failing to implementdependability work due to the use of only smallparts of the system. They pick up one or a fewtools or methodologies and believe that these willsolve their problems. Illustrations of this are the

non-systematic use of FMECA sheets and reliabil-ity prediction handbooks some decades ago.

We have to start with the core values andask: Which core values should characterize ourorganization? When that is decided we have toidentify methodologies that are suitable for ourorganization to use and which support our values.Finally, from that decision the suitable tools haveto be identified and used in an efficient way tosupport the methodologies. Figure 31.4 gives anexample of a scheme, useful for the choice andsystematic implementation of appropriate tools. Itis, of course, important to note that a particularmethodology can support different core valuesand the same tool can be useful within manymethodologies. If we can use such methodologiesand tools we support several values, which ofcourse is of benefit to the culture.

The idea of using this type of system approachis not quite new. A related approach is describedin [16], giving a TQM definition based on amanagement system perspective.

31.5 ConclusionsThe authors strongly believe that the systemview of dependability management will facilitateorganizations to work with dependability matters,since things are “put together to create a whole”.Total dependability management builds upon theavailability and use of all three managementsystem components: a set of supporting corevalues, an appropriate methodology, and theskillful use of contextually adapted tools.

References[1] IEC Standard 60300-1. Dependability management.

Part 1. Dependability programme management. Geneva:International Electrotechnical Commission; 1993.

[2] IEC Standard 60300-2. Dependability management.Part 2. Dependability programme elements. Geneva:International Electrotechnical Commission; 1995.

[3] IEC Standard 60300-3-1. Dependability management.Part 3. Application guide. Section 1. Analysis techniquesfor dependability: guide on methodology. Geneva:International Electrotechnical Commission; 1991.

[4] Hellsten U. The Springboard—a TQM-based tool forself-assessment. Licentiate Thesis 1997: No. 42. Divisionof Quality Technology & Statistics, Luleå University ofTechnology; 1997.

Page 598: Handbook of Reliability Engineering - pdi2.org

Total Dependability Management 565

[5] Bergman B, Klefsjö B. Quality from customer needs tocustomer satisfaction. London: McGraw-Hill and Lund,Studentlitteratur; 1994.

[6] Deming WE. The new economics for industry, govern-ment, education; 2nd Edition. Massachusetts Institute ofTechnology, Center for Advanced Studies; 1994.

[7] Stamatis DH. Failure mode and effects analysis. FMEAfrom theory to execution. Milwaukee: ASQ Press; 1995.

[8] Akao Y. Quality Function Deployment. Integratingcustomer requirements into product design. Cambridge:Productivity Press; 1990.

[9] Slabey WR. In: The Second Symposium on Quality Func-tion Deployment/QPC, Automotive Division—ASQC andthe American Supplier Institute, Novi, MI; 1990: 21–86.

[10] IEC Standard 60300-3-9. Dependability management.Part 3. Application guide. Section 9. Risk analysis of tech-nological systems. Geneva: International ElectrotechnicalCommission; 1995.

[11] Anderson RT, Neri L. Reliability-centered maintenance.Management and engineering methods. London: ElsevierApplied Science; 1990.

[12] Bird FE, Germain GL. Practical loss control leadership;Revised Edition. Loganville: Det Norske Veritas (USA);1996.

[13] Dahlgaard JJ, Kristensen K, Kanji GK. The qualityjourney: a journey without an end. Abingdon: CarfaxPublishing Company; 1994.

[14] Oakland JS. Total quality management. The route to im-proving performance; 2nd Edition. Oxford: Butterworth-Heinemann; 1993.

[15] Tenner AR, DeToro IJ. Total Quality Management:three steps to continuous improvement. Reading, MA:Addison-Wesley; 1992.

[16] Hellsten U, Klefsjö B. TQM as a management system con-sisting of values, techniques and tools. TQM Magazine,2000:12, 238–44.

Page 599: Handbook of Reliability Engineering - pdi2.org

This page intentionally left blank

Page 600: Handbook of Reliability Engineering - pdi2.org

Total Quality for SoftwareEngineering Management

Chap

ter3

2G. Albeanu and Fl. Popentiu Vladicescu

32.1 Introduction32.1.1 The Meaning of Software Quality32.1.2 Approaches in Software Quality Assurance32.2 The Practice of Software Engineering32.2.1 Software Lifecycle32.2.2 Software Development Process32.2.3 Software Measurements32.3 Software Quality Models32.3.1 Measuring Aspects of Quality32.3.2 Software Reliability Engineering32.3.3 Effort and Cost Models32.4 Total Quality Management for Software Engineering32.4.1 Deming’s Theory32.4.2 Continuous Improvement32.5 Conclusions

32.1 Introduction

32.1.1 The Meaning of SoftwareQuality

Defining quality, in general, is a difficult task.Different people and organizations define qualityin a number of different ways. Most peopleassociate quality with a product or service inorder to satisfy the customer’s requirements.According to Uselac [1], “Quality is not onlyproducts and services but also includes processes,environment, and people”. Also, Deming [2], inhis famous work says “that quality has manydifferent criteria and that these criteria changecontinually”. This is the main reason “to measureconsumer preferences and to remeasure themfrequently”, as Goetsch and Davis [3] observe.Finally, they conclude: “Quality is a dynamicstate associated with products, services, people,

processes and environments that meets or exceedsexpectations”.

What about software? Software is the com-ponent that allows a computer or digital deviceto perform specific operations and process data.Mainly, software consists of computer programs.However, databases, files, and operating proce-dures are closely related to software. Any userdeals with two categories of software: the operat-ing system controlling the basic operations of thedigital system and the application software con-trolling the data processing for specific computerapplications like computer-aided drafting, design-ing and manufacturing, text processing, and so on.

Like any other product, quality software isthat which does what the customer expects itto do. The purpose for which a software systemis intended is described in a document usuallyknown as the user requirements specification.User requirements fall into two categories [4]:

567

Page 601: Handbook of Reliability Engineering - pdi2.org

568 Practices and Emerging Applications

1. capabilities needed by users to achieve anobjective or to solve a problem;

2. constraints on how the objective is to beachieved or the problem solved.

Capability requirements describe functionsand operations needed by customers. Theseinclude at least performance and accuracyattributes. Constraint requirements placerestrictions on how software can be built andoperated, and quality attributes of reliability,availability, portability, and security. All theseuser requirements are translated into softwarerequirements. In order to build a high-qualitysoftware product, these requirements should berigorously described. The software requirementscan be classified into the following categories:functional, performance, interface, operational,resource, portability, security, documentation,verification and acceptance testing, quality,reliability, safety, and maintainability.

Functional requirements specify the purposeof the software and may include performance at-tributes. Performance requirements specify nu-merical values for measurable variables (in gen-eral as a range of values).

A user may specify interface requirements forhardware, software, and communications. Soft-ware interfaces consist of operating systems andsoftware environments, database managementsystems and file formats. Hardware configurationis determined by the hardware interface require-ments. The usage of some special network proto-cols and other constraints of the communicationinterface is registered as communication interfacerequirements.

Some users specify how the system will run andhow it will communicate with human operators.Such a user interface, man–machine or human–computer interaction requirements (the helpsystem, screen layout, content of error messages,etc.) are operational requirements.

As an end user, the specification of the upperlimits on physical resources (processing power,internal memory, disk space, etc.) is necessary,especially when future extensions are very expen-sive. These constraints are classified as resource

requirements. The degree to which computingresources are used by the software explains animportant quality factor: the efficiency.

Following Ince [5], another quality factor isportability. “This is the effort required to transfera system from one hardware platform to another”.Possible computers and operating systems, otherthan those of the target platform, should be clearlyoutlined as portability requirements.

Confidentiality, integrity, and availability arefundamental security requirements. According toInce, “Integrity is used to describe the extent towhich the system and its data are immune toaccess by unauthorized users”. Mathematically,integrity is the probability that a system operateswithout security penetration for a specified timewhen a rate of arrival of threats and a threatprofile are specified. However, such a qualityfactor cannot be precisely measured. To increasethe security, interlocking operator commands,inhibiting of commands, read-only access, apassword system, and computer virus protectionare important techniques. As Musa stated [6],the software availability is “the expected fractionof operating time during which a softwarecomponent or system is functioning acceptably”.

In order to have reliable software, the con-straints on how the software is to be verified haveto be stated as verification requirements. Simula-tion, emulation, tests with simulated inputs, testswith real inputs, and interfacing with the real en-vironment are common verification requirements.It is also necessary to include the acceptance testsfor software validation. However, the testability isvery rarely specified by the customer.

Reliability requirements may have to be de-rived from the user’s availability requirements.It is important to mention that reliability isuser-oriented. For Ince, software reliability “de-scribes the ability of a software system to carryon executing with little interruption to its func-tioning”. Musa [6] says that “Software reliabil-ity is the most important aspect of quality tothe user because it quantifies how well the soft-ware product will function with respect to hisor her needs”. We mention here that althoughhardware and software reliability have the same

Page 602: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 569

definition: probability of failure-free operationfor a specified interval of time, there are dif-ferences between them (a failure being any be-havior in operation that will cause user dissat-isfaction). The main difference is that softwarereliability is changing with time, when faults areremoved during reliability growth test or whennew code (possibly containing additional faults) isadded.

Safety requirements may identify critical func-tions in order to reduce the possibility of damagethat can follow from software failure. As Musa[6] mentioned, software safety can be defined as“freedom from mishaps, where a mishap is anevent that causes loss of human life, injury, orproperty damage”.

The maintainability or modifiability “describesthe ease with which a software system can bechanged” according to Ince. He outlines threecategories of modifications: corrective changes(due to error fixing), adaptive changes (due tothe response to changes in requirements), andperfective changes (which improve a system). Theease of performing such tasks is quantified bythe mean time to repair a fault. Maintainabilityrequirements are derived from a user’s availabilityand adaptability requirements.

An important quality factor is correctness—thesoftware system has to conform to requirementspecifications. Usability, as another quality factor,is “the effort required to learn, operate andinterrupt a functioning system” as Ince says,because a system that satisfies all the functionsin its requirement specification but has a poorinterface is highly unusable.

For any software project a correct and cleardocumentation called the “software user manual”has to be provided. Two styles of user documen-tation are useful: the tutorial and the referencemanual. The tutorial style is adequate for newusers and the reference style is better for moreexperienced users searching for specific informa-tion.

There are some important reasons for estab-lishing a software quality program. According toBreisford [7], these include: reduction of the riskof producing a low quality product, reduction of

the cost of software development, increased cus-tomer satisfaction and limiting company respon-sibility. For some industrial software the emphasisis on reliability and lifecycle costs, while for othersthe safety is more important. An effective qualityprogram is necessary when the customer requiresit, like defense industry operations, space andnuclear industries. As mentioned in [8], “Safetymanagement forms an integral part of project riskmanagement and should be given equal consid-eration alongside Quality and Reliability manage-ment”. This remark is also valid for computersoftware. It is enough to mention the report [9]on the Ariane 5 rocket failure in 1996, where theproblems were caused by a few lines of Ada codecontaining three unprotected variables. Accordingto Ince [5], a quality system “contains procedures,standards and guidelines”. A procedure is “a setof instructions which describe how a particularsoftware task should be carried out”, and a stan-dard is a set of compulsory tasks or items to beconsidered when implementing a procedure ora package of procedures. Guidelines only advisetaking into consideration some procedures. Theliterature on quality uses the term standard witha modified meaning, but in the same manner.

32.1.2 Approaches in Software QualityAssurance

The IEEE defines software quality assurance asa “planned and systematic pattern of all actionsnecessary to provide adequate confidence thatthe item or product conforms to establishedtechnical requirements” [10]. A relevant standardfamily for the software industry is ISO 9000developed by the International Organization forStandardization. More precisely, ISO 9001 [11]“Quality Systems—Model for Quality Assurancein Design, Development, Production, Installationand Servicing” describes the quality system usedto support the development of a product whichinvolves design; ISO 9003 [12] “Guidelines forthe Application of ISO 9001 to the Development,Supply and Maintenance of Software” interpretsISO 9001 for the software developer; ISO 9004-2[13] “Quality Management and Quality System

Page 603: Handbook of Reliability Engineering - pdi2.org

570 Practices and Emerging Applications

Elements—Part 2” provides guidelines for theservicing of software.

There are 20 items to describe therequirements: management responsibility; qualitysystem; contract review; design control; documentand data control; purchasing; purchaser suppliedproduct; product identification and traceability;process control; inspection and testing; controlof inspection, measuring, and test equipment;inspection and test status; control of non-conforming product; corrective and preventiveaction; handling, storage, packaging, preserva-tion, and delivery; quality records; internal qualityaudits; training; servicing; statistical techniques.

Another approach is the “Capability MaturityModel for Software” (CMM) [14], developed bythe Software Engineering Institute (SEI). TheSEI has suggested that there are five levels ofprocess maturity, ranging from initial stage (theleast predictable and controllable) to repeatable,defined, managed, and optimizing (the mostpredictable and controllable).

The fundamental concepts underlying processmaturity applied for software organization are: thesoftware process; the capability; the performance;and the maturity. The software process is a “setof activities, methods, practices and transforma-tions” used by people to develop and maintainthe software and associated products (user man-uals, architectural document, detailed design doc-ument, code and test plans). The capability of thesoftware process run by an organization providesa basis to predict the expected results that can beachieved by following the software process. Whilethe software process capability focusses on the re-sults expected, the software process performancerepresents the actual results achieved. Maturity isa concept which implies the ability for growth incapability and gives the degree of maturation indefining, managing, measuring, controlling, andrunning. As a corollary, a software organizationgains in software process maturity by runningits software process via policies, standards, andorganizational structures. A brief description ofthe CMM follows.

Except for Level 1, each of the four capabilitylevels has a set of key process areas that an

organization should focus on to improve itssoftware process. The first level describes asoftware development process that is chaotic, anad hoc approach based on individual efforts (agenius who has a bright idea), not on teamaccomplishments. The capability maturity modelemphasizes quantitative control of the process,and the higher levels direct an organization to usemeasurement and quantitative analysis.

Each key process area comprises a set of keypractices that show the degree (effective, repeat-able, lasting) of the implementation and institu-tionalization of that area. The key process area ofthe second level contains the following key prac-tices: requirements management; software projectplanning, software project tracking and oversight;software subcontract management; software qual-ity assurance and software configuration man-agement. At this level “basic project-managementprocesses are established to track cost, schedule,and functionality”. There is a discipline amongteam members, so that the team can repeat earliersuccesses on projects with similar applications. Atthe defined level (Level 3), “the software processfor both management and engineering activitiesis documented, standardized, and integrated intoa standard software process for the organization”.Since some projects differ from others, the stan-dard process, under management approvement,is tailored to the special needs. The key process,at this maturity level, is oriented to organiza-tional aspects: organization process focus; orga-nization process definition; training program; in-tegrated software management; software productengineering; intergroup coordination and peer re-views. A managed process (Level 4) adds man-agement oversight to a defined process: “detailedmeasures of the software process and productquality are collected”. There is quantitative infor-mation about both the software process and prod-ucts to understand and control them. Key processareas focus on quantitative process managementand software quality management. An optimiz-ing process is the ultimate level of process matu-rity, where quantitative feedback is incorporatedinto the process to produce continuous processimprovements. Key process areas include: defect

Page 604: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 571

prevention; technology change management; andprocess change management.

Finally, the following items resume the CMMstructure:

1. the maturity levels indicate the processcapability and contain key process areas;

2. the key process areas achieve goals and areorganized by common features;

3. the common features address implementationand institutionalization and contain key prac-tices describing infrastructure or activities.

The CMM was the starting point in produc-ing new process assessment methods. We noteonly the bootstrap approach [15] (an extensionof the CMM developed by a European Commu-nity ESPRIT project) and the new standard calledSPICE [16] (for Software Process Improvementand Capability dEtermination). Similar to CMM,SPICE can be used both for process improvementand capability determination. The SPICE modelconsiders five activities: customer-supplied, engi-neering, project, support, and organization, whilethe generic practices, applicable to all processes,are arranged in six levels of capability: 0—“not performed”, 1—“performed informally”, 2—“planned and tracked”, 3—“well-defined”, 4—“quantitatively controlled”, 5—“continuously im-proving”. An assessment report, a profile, for eachprocess area describes the capability level.

Ending this section, the reader has to observethat the CMM tends to address the requirement ofcontinuous process improvement more explicitlythan ISO 9001, and that CMM addresses softwareorganizations while SPICE addresses processes.However, no matter which model is used, theassessment must be administered so as tominimize the subjectivity in the ratings.

32.2 The Practice of SoftwareEngineering

32.2.1 Software LifecycleThe increasing software complexity and the costand time constraints against software reliability

maximization for most software projects havemade mandatory the usage of a standardizedapproach in producing such items.

The collection of techniques that apply an en-gineering approach to the construction and sup-port of software is known as “software engineer-ing”. The theoretical foundations for designingand developing software are provided by com-puter science. However, software engineering pro-vides the mechanism to implement the softwarein a controlled and scientific way. According to[17], software engineering is “the application ofa systematic, disciplined, quantifiable approach tothe development, operations, and maintenance ofsoftware, that is, the application of engineering tosoftware”. It is possible to say, as in [18, 19] that“Software engineering is evolving from an art to apractical engineering discipline”.

This approach was also adopted by ESA [4],which establishes and maintains software engi-neering standards for the procurement, develop-ment, and maintenance of software. The ESA stan-dard PSS-05-0 describes the processes involved inthe complete lifecycle of the software. Organizedin two parts, the standard defines a set of practicesfor making software products. Such practices canalso be used to implement the requirements of ISO9001. The ESA’s Board for Software Standardiza-tion and Control says that [20]:

1. the ESA Software Engineering Standards arean excellent basis for a software qualitymanagement system;

2. the ESA standards cover virtually all therequirements for software developments—two thirds of the requirements in ISO 9001are covered by ESA Software EngineeringStandards, and the uncovered requirementsare not related to software development;

3. the ESA standards do not contradict those ofISO 9001.

According to [4], a software lifecycle consistsof the following six successive phases: (1) def-inition of the user requirements; (2) definitionof the software requirements; (3) definition ofthe architectural design; (4) detailed design andproduction of code; (5) transfer of the software

Page 605: Handbook of Reliability Engineering - pdi2.org

572 Practices and Emerging Applications

to operations; (6) operations and maintenance.These phases are mandatory whatever the size, ap-plication type, hardware configuration, operatingsystem, or programming language used for cod-ing. The software lifecycle begins with the deliveryof the “user requirements document” to the devel-oper for review. When this document is approved,three important phases have to be traversed beforethe software is transferred to users for operation.In order to start a new phase, the results of theprevious phase are reviewed and approved. Thesoftware lifecycle ends after a period of operation,when the software is retired.

Pham [19], considering the software reliabilityrealm, identifies five successive phases for the soft-ware lifecycle: analysis (requirements and func-tional specifications), design, coding, testing, andoperating. From the reliability point of view, “awell-developed specification can reduce the inci-dence of faults in the software and minimize re-work”. It is well known (see [19]) that a well-doneanalysis will “generate significant rewards in termsof dependability, maintainability, productivity,and general software quality”. The design phaseconsists of two stages of design: system architec-ture design and detailed design”. The principal ac-tivities in the system architecture design stage are:(1) construction of the physical model; (2) spec-ifying the architectural design; (3) selecting theprogramming language (or software developmentenvironments); and (4) reviewing the design.

Using implementation terminology, the devel-oper will derive a physical model starting fromthe logical model developed in the analysis phase(software requirements stage). To build a physi-cal model some activities are important: the de-composition of the software into components (atop-down approach, but care for the bottom-level components); the implementation of non-functional requirements; the design of qualitycriteria; and special attention paid to alternativedesigns. The last activity is needed because, ingeneral, there is no unique design for a softwaresystem. The design has to be easy to modify andmaintain, only make a minimal use of availableresources, and must be understandable in orderto be effectively built, operated and maintained.

Other quality criteria related to design are: sim-plicity in form and function, and modularity.

The architectural design document containsdiagrams showing the data flow and the controlflow between components. For each componentthere are specified: data input, functions to beperformed (derived from a software requirementsdocument), and data output. Data input, dataoutput, and temporary memory cells are clearlydefined as data structures (with name, type,dimension, relationships between the elements,range of possible values of each element, andinitial values of each element). The definitionof the control flow between componentsdescribes sequential and parallel operations,including also synchronous and asynchronousbehavior. About this phase, Pham says that “thesystem architecture document describes systemcomponents, subsystems and interfaces”.

The selected data structures and algorithms,in the detailed design phase, will be imple-mented in a particular programming language ona particular platform (hardware and software sys-tem). The selected programming languages mustsupport top-down decomposition, object-orientedprogramming, and concurrent production of thesoftware. Also, the choice of a programming lan-guage depends on the non-functional require-ments. Finally, the architectural design and theprogramming language should be compatible.

Detailed design is about designing the projectand the algorithmic details. Lower-level compo-nents of the architectural design are decomposeduntil they can be presented as modules or func-tions in the selected programming language. Formany recent projects, an object-oriented (OO)analysis and design approach has been consid-ered. The goals of the OO analysis are to iden-tify all major objects in the problem domain, in-cluding all data and major operations mandatoryfor establishing the software functions, and toproduce a class diagram containing all the soft-ware project semantics in a set of concise butdetailed definitions. A class specification and adata dictionary are also delivered. The OO designmaps the analysis product development during theanalysis phase to a structure that will allow the

Page 606: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 573

coding and execution. Various methodologies callthis approach OOA/OOD because in a practicalanalysis/design activity, it is almost impossible tofind the boundaries between OOA and OOD.

Producing software, after the detailed designphase, starts with coding. The code should beconsistent (to reduce complexity) and structured(to reduce errors and enhance maintainability).From the structured programming point of view,each module has a single entry and an exitpoint, and the control flow proceeds from thebeginning to the end. According to [4], “as thecoding of a module proceeds, documentationof the design assumptions, function, structure,interface, internal data and resource utilisationshould proceed concurrently”. As Pham says, thecoding process consists of the following activities:the identification of the “reusable modules”, thecode editing, the code “inspection”, and the“final test planning”. Existing modules of othersystems or projects which are similar to thecurrent system can be reused with modifications.This is an effective way to save time andeffort. “Code inspection includes code reviews,quality, and maintainability”. The aim of the codereview process is to check module (software)logic and readability. “Quality verification ensuresthat all the modules perform the functionalityas described in detailed design. Quality checkfocuses on reliability, performance, and security”.Maintainability is checked to ensure the softwareproject is easy to maintain. The final test planprovides the input to the testing phase: whatneeds to be tested, testing strategies and methods,testing schedules, etc.

Also, an important step in the productionphase, even before testing, is integration. As ESA[4] mentions, “integration is the process of build-ing a software system by combining componentsinto a working entity” and shall proceed in anorderly function-by-function sequence.

Testing consists of the verification and valida-tion of the software product. The goals for theseactivities are related to the software quality. Themost important goal is to affirm the quality bydetecting and removing faults in the software

project. Next, all the software specified function-alities shall be present in the product. Finally,but as a considerable goal, is the estimation ofthe operational reliability of the software prod-uct.

The process of testing an integrated softwaresystem is called “system testing”. Previous pro-cesses related to the testing phase are: unit testingand integration testing. Unit tests check the de-sign and implementation of all components, “fromthe lowest level defined in the detailed designphase up to the lowest defined in the architecturaldesign” according to ESA standards. Integrationtesting is directed at the interfacing part: to verifythat major components interface correctly (bothas data structures and control flow levels). Thetesting team shall include a large set of activities,like: end-to-end system tests (the input–execute–output pipe works properly); verification that userrequirements will be entirely covered (acceptancetests); measurements of performance limits (stresstests); preliminary estimation of reliability andmaintainability; and the verification of the “soft-ware user manual”. According to Pham [19], theacceptance test consists of an internal test anda field test. “The internal test includes capabilitytest and guest test, both performed in-house. Thecapability test tests the system in an environmentconfigured similar to the customer environment.The guest test is conducted by the users in theirsoftware organization sites”. The field test, alsocalled the “beta test”, allows the user to test theinstalled software, defining and developing the testcases. “Testing by an independent group providesassurance that the system satisfies the intent of theoriginal requirements”.

Considering the transfer phase, the developerwill “install the software in the operationalenvironment and demonstrate to the initiator andusers that the software has all the capabilities”formulated in the user requirements phase.Acceptance tests will demonstrate the capabilitiesof the software in its operational environment andthese are necessary for provisional acceptance.The final acceptance decision is made by thecustomer.

Page 607: Handbook of Reliability Engineering - pdi2.org

574 Practices and Emerging Applications

The final phase in the software lifecycle isoperation, consisting of the following activities:training, support, and maintenance. FollowingPham [19], maintenance is defined as “any changemade to the software, either to correct a deficiencyin performance, to compensate for environmentalchanges, or to enhance its operation”. After awarranty period, the maintenance of the softwareis transferred from the developer to a dedicatedmaintenance organization.

32.2.2 Software Development Process

A software process model is different froma software methodology. The process modeldetermines the tasks or phases involved in productdevelopment and all criteria for moving from onetask to the next one. The software methodologydefines the outputs of each task and describes howthey should be achieved. The ESA approach, Part I,describes a software process model and for eachphase describes the inputs, outputs, and activities.The second part of the ESA standard describesthe procedures used to manage a software project.These aspects will be covered in Section 32.4.

It is difficult to clearly define which methodsshould be used to complete a phase of the prod-uct development and production. As Cederlingremarks [21], “hardly any method is suitable touse in development of all kinds of applicationsand there are only a few methods that pretend tocover the whole life-cycle of the software”. Froma software quality point of view, the steps couldbe organized into three elements: software qualitymanagement, software quality engineering, andsoftware quality control. Such aspects will be dis-cussed in Section 32.3.

There are three favored software developmentprocesses: the waterfall model, the spiral model,and the generic integrated software developmentenvironment. ESA [20] also suggests three lifecyclemodels called: waterfall, incremental, and evolu-tionary. The waterfall model is the most used; itis a sequential approach and the implementationsteps are the activities in the sequence in whichthey appear. This is the main reason that devel-opment schedules can readily be based on it. In

the spiral model, each cycle involves a progressionthrough the same series of steps, which are appliedto everything from overall concept to detailedcoding of software. Such an approach applies todevelopment and maintenance. According to [20],the incremental model “splits the detailed designphase into manageable chunks”. The generic in-tegrated environment is used when an overall in-tegration of the project’s residual tools is neces-sary. The impact of object-oriented methods onsoftware quality is quite important. Capper et al.[22], show how and when to use object orientationto improve software quality by taking advantageof reuse and code modularity. ESA [20] explainsthe advantages of an object-oriented approach:“the analysis more closely corresponds to the ap-plication domain”, “the design (physical model)is an elaboration of the requirements analysismodel (the logical model)”, and “object-orientedmethods use the concept of inheritance which, ifproperly used, permits the building of reusablesoftware”. Such object-oriented methods whichrecommend feedback loops between analysis, de-sign and implementation phases when includedin a software engineering standard require specialtreatment in order to maintain the PSS-05-0 con-cept of well-defined development phases.

Software development with prototypes is a newapproach. As mentioned in [5], “prototyping is theprocess of developing a working model of a systemearly on in a project”. The working model is shownto the customer, who suggest improvements. Theimprovements incorporation gives rise to anupgraded prototype which is shown again tothe customer, and the process can continue.Producing a prototype is based on the followingtechniques: the relaxation of the quality assurancestandards of the project; partial implementation;the table-driven processor approach; the usage ofan application-oriented programming language,etc. Object-oriented programming languagesare also an excellent medium for prototyping(see [23]). There are three popular prototypingmodels: throw-away prototyping, evolutionaryprototyping, and incremental prototyping. Bythrow-away prototyping, effective for shortprojects, the developer builds a prototype and

Page 608: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 575

then starts the iterative process of showing–modifying already described. When the “game”is finished, the prototype is archived andconventional software development is started,based on a requirement specification writtenby examining the detailed functionality of theprototype. Evolutionary prototyping permits thedeveloper “to keep the prototype alive” as Incesays. After the customer has decided that allrequirements are included, the developer basesthe remainder of the work on the prototypealready generated. The incremental prototyping iseffective for projects where the requirements canbe partitioned into functionally separate areas, thesystem being developed as a series of small parts,each of them implementing a subset of functions.

A theory-based, team-oriented engineeringprocess for developing high-quality software isthe cleanroom approach. According to [24], thecleanroom process “combines formal methods ofobject-based box structure specification and de-sign, function-theoretical correctness verification,and statistical usage testing for reliability cer-tification to produce software approaching zerodefects”. The cleanroom software developmentmethod has three main attributes [25]: a set of at-titudes; a series of carefully prescribed processes;and a rigorous mathematical basis. “Cleanroomprocesses are quite specific and lend themselvesto process control and statistical measures”. AsInce mentioned, “a formal method of softwaredevelopments makes use of mathematics for spec-ification and design of a system, together withmathematical proof as a validation method”. For-mal methods require the same conventional de-velopment models, unlike prototyping and object-oriented technology. A new useful mathematicalframework important for software developers isthe generalized analytic hierarchy process [26],which provides qualitative and quantitative infor-mation for resource allocation.

The late 1980s brought new graphically basedsoftware tools for requirements specification anddesign, called computer-aided software engineer-ing (CASE) tools, with the following abilities:(1) creation of the diagrams describing the func-tions of the software system; (2) creation of the

diagrams describing the system design of both theprocesses and the data in the system; (3) process-ing the graphical requirements specification andsimulating its actions like a rudimentary proto-type; (4) the identification of errors in the flow ofdata; (5) the generation of code from the design;and (6) the storage and manipulation of infor-mation related to the project management. Thesetools improve the quality management systemof software development companies by enforcingtheir own requirements specification and systemdesign standards, and doing a lot of bureaucraticchecking activities.

32.2.3 Software Measurements

As Fenton and Pfleeger mention [27], we canneither predict nor control what we cannot mea-sure. Measurement “helps us to understand whatis happening during development and mainte-nance”, “allows us to control”, and “encouragesus to improve our processes and products”. Mea-surement is the first step towards software qualityimprovement. Due to the variety and complexityof software products, processes, and operatingconditions, no single metric system can be iden-tified as a measure of the software quality. Severalattributes have to be measured, related to:

• the product itself: size, complexity, modular-ity, reuse, control flow, data flow, etc.;• the development process: cost and effort

estimation, productivity, schedule, etc.;• quality and reliability: failures, corrections,

times to failures, fault density, etc.;• maintenance and upgrading: documentation,

etc.

The quality of any measurement program isdependent on careful data collection. For instance,in order to perform reliability analysis, the fol-lowing types of data should be collected: internalattributes (size, language, functions, verificationand validation methods and tools, etc.) and exter-nal attributes (time of failure occurrence, natureof the failures, consequence, the current version ofthe software environment in which the fault hasbeen activated, conditions under which the fault

Page 609: Handbook of Reliability Engineering - pdi2.org

576 Practices and Emerging Applications

has been activated, type of faults, fault location,etc.). According to [27], “an internal attribute canbe measured by examining the product, processor resource on its own, separate from its behav-ior”, while external attributes are related to thebehavior of the process, product, or resource and“can be measured only with respect to how theproduct, process or resource relates to its environ-ment”. Considering the process of constructingspecification, internal attributes are: time, effort,number of requirements changes, etc. and quality,cost, stability are some external attributes. Anyartefacts, deliverables, or documents that resultfrom a process activity are identified as prod-ucts: the specifications, designs, code, prototypes,documents, etc. For example, if we are interestedin measuring the code of one software product,the internal attributes cover: size, reuse, modu-larity, functionality, coupling, algorithmic com-plexity, control-flow structures, hierarchical classstructures (the number of classes, functions, andclass interactions), etc. Some popular software-process metrics are: number of “problems” foundby developer, number of “problems” found duringbeta testing, number of “problems” fixed that werefound by developer in prior release, number of“problems” fixed that were found by beta testingin the prior release, number of “problems” fixedthat were found by customer in the prior release,number of changes to the code due to new re-quirements, total number of changes to the codefor any reason, number of distinct requirementsthat caused changes to the module, net increase inlines of code, number of team members makingchanges, number of updates to a module, etc.

Measuring the size of the software productsis consistent with measurement theory principles[27]. The software size can be described withthree attributes: length (the physical size of theproduct), functionality (the functions suppliedby the product to the user), and complexity ofthe problem, algorithms, software structure, andcognitive effort.

The most used measure of source codeprogram length is the LOC (number of linesof code), sometimes interpreted as NCLOC(only non-commented lines) or ELOC (effective

lines of code). However, the dependence of theprogramming language and the impact of thevisual programming windowing environments areimportant factors which have a great influenceon the code. The length is also a measure forspecifications and design, which consists of bothtext (the text length) and diagrams (the diagramlength). Other statement metrics are: numberof distinct included files, number of controlstatements, number of declarative statements,number of executable statements, number ofglobal variables used, etc.

The functionality captures “the amount offunction contained in a delivered product or in adescription of how the product is supposed to be”[27]. There are three commonly used approaches:the Albrecht’s effort estimation method (based onfunction points), COCOMO 2.0 (based on objectpoints), and the DeMarco’s specification weight.

Complexity is difficult to measure. The com-plexity of an approach is given in terms of theresources needed to implement such an approach(the computer running time or the computermemory used for processing). As mentioned byPham [19], “Halstead’s theory of software metricis probably the best-known technique to measurethe complexity in a software program and amountof difficulty involved in testing and debuggingthe software”. However, the focus of Halstead’smeasurements is the source code for an imperativelanguage. Such an approach to software basedon declarative programming is not suitable. Thefollowing metrics, used by Halstead, give rise toempirical models: the number of distinct oper-ators and the number of distinct operands in aprogram to develop expressions for the overallprogram length, the volume and the number ofremaining defects in a program. Another com-plexity metric measure of software is the cy-clomatic number, proposed by McCabe (see [19,27] and references cited therein), that providesa quantitative measure of the logical complexityof a program by counting the decision points.The above remark also applies for visual softwareenvironments, especially for object-oriented pro-gramming, which needs special complexity mea-sures.

Page 610: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 577

The software structure has at least three parts:control-flow structure (addressing the sequence inwhich instructions are executed), data-flow struc-ture (showing the behavior of the data as it inter-acts with the software), and data structure (includ-ing data arrangements and algorithms for creat-ing, modifying, or deleting them). The complexityof the control-flow structures is obtained usinghierarchical measures, every computer programbeing built up in a unique way from the so-calledprime structures according to the structured pro-gramming principles. Some popular control-flowgraph metrics (for flowcharts) are: number of arcsthat are not conditional arcs, number of non-loopconditional arcs (if–then constructs), number ofloop constructs, number of internal nodes, num-ber of entry nodes, number of exit nodes, etc.

External software product attributes will beconsidered in the next section. Without presentingequations, the above descriptive approach showswhat the developers, the software managers haveto measure in order to understand, control, andevaluate. Special attention will be given to datacollection not only for different internal measure-ments, but for the investigation of relationshipsand future trends.

32.3 Software Quality Models

32.3.1 Measuring Aspects of Quality

In the first section we pointed out that qual-ity is a composite of many characteristics. Earlymodels of Boehm et al. [28] and McCall et al.[29] described quality by decomposition. Bothmodels identify key attributes of quality fromthe customer’s perspective, called quality factors,which are decomposed into lower-level attributes,called quality criteria and directly measurable.For example, McCall’s model considers the follow-ing quality factors: usability, integrity, efficiency,correctness, reliability, maintainability, testability,flexibility, reusability, portability, and interoper-ability. We observe that most of these factors alsodescribe the software requirements which wereoutlined in the first section.

A derivation of McCall’s model, called “Soft-ware Product Evaluation: Quality Characteristicsand Guidelines for their Use”, is known as theISO 9126 standard quality model, decomposingthe quality only into six quality factors: functional-ity, reliability, efficiency, usability, maintainability,and portability [30]. Unfortunately, the definitionsof these attributes differ from one standard toanother.

Another approach extends Gilb’s ideas [31]and provides an automated way to use them: theCOQUAMO model of Kitchenham and Walker[32]. Other approaches view software quality asan equation based on defect-based measures, us-ability measures, and maintainability measures.Some popular measures are: the defect density(given as the division of the number of knowndefects by the product size), the system spoilage(the division of the time to fix post-release de-fects by the total system development time), theuser performance measures defined by the MUSiCproject [33], the total effort expended on main-tenance [34], Gunning’s readability measure [35](for written documentation), etc. Other measureswill appear in the next sections. Recently, so-phisticated software tools to support software de-velopers, with software-quality models, appeared.For example, [36] describes the decision supporttool EMERALD (Enhanced Measurement for EarlyRisk Assessment of Latent Defects system). As[37] remarks, such tools “are the key to improvesoftware-quality” and can “predict which modulesare likely to be fault-prone” using the database ofavailable measurements.

32.3.2 Software ReliabilityEngineering

Improving the reliability of a software product canbe obtained by measuring different attributes dur-ing the software development phase. Measurementuses the outputs of a data collection process. Twotypes of data are related to software reliability:time-domain data and interval-domain data. Thetime-domain data approach requires that indi-vidual times at which failure occurred should be

Page 611: Handbook of Reliability Engineering - pdi2.org

578 Practices and Emerging Applications

recorded. For the interval-domain approach it isnecessary to count the number of failures occur-ring during a fixed period. From an accuracy pointof view in the reliability estimation process, thetime-domain approach is adequate but more datacollection efforts are necessary. In the following,the reliability will be presented as an engineeringconcept, improving the framework of the totalquality for software engineering management.

According to Musa [6], “engineering softwarereliability means developing a product in such away that the product reaches the market at theright time, at an acceptable cost, and with sat-isfactory reliability”. The reliability incorporatesall those properties of the software that can beassociated with the software running (correctness,safety, usability, and user friendliness). There aretwo approaches to measure the software reliabil-ity: a developer-oriented approach which countsthe faults or defects found in a software product,and a user-oriented approach taking into accountthe frequency with which problems occur.

A basic approach of the software reliabilityengineering (SRE) process, according to [6],requires the following activities.

1. Identify the modules that should have aseparate test.

2. Define the failure with severity classes (afailure severity class is a set of failures thathave the same per-failure impact on users,based on a quality attribute).

3. Choose the natural or time unit (a natural unitis the one related to the output of a software-based product).

4. Set the system failure intensity objective foreach module to be tested (failures per naturalunit).

5. Determine the developed software failureintensity objective.

6. Apply, in an engineering manner, the relia-bility strategies in order to meet the failureintensity objective.

7. Determine the operational modes of eachmodule to be tested.

8. For each module to be tested, developthe module and the operational profile (by

identifying the operation initiators, creatingthe operations list, determining occurrencerates and occurrence probabilities).

9. Prepare test cases (involving the followingsteps: estimation of the number of newtest cases needed for the current release,allocation of the test cases among the modulesto be tested, for each module allocate new testcases among its new operations, specify thenew test cases, and add these new test casesto those from previous releases).

10. Prepare test procedures (one test procedurefor each operational mode).

11. Execute the test by allocation of the test time,test the systems (acquired components, prod-uct and variations, and supersystems), andcarefully identify the failures (by analyzing thetest output for deviations, determining whichdeviations are failures, establishing when thefailures occurred and assigning failure sever-ity classes).

12. Apply failure data to guide decisions.

Introduction of SRE into an organization it is astrong function of the software process maturityof that organization. The necessary period forintroduction can range from six months to severalyears. An incremental approach is recommended.The process has to start with activities neededfor establishing a baseline and learning about theproduct and about customer expectations.

An important set of SRE activities are con-cerned with measurement and prediction of soft-ware reliability and availability. This includesmodeling of software failure behavior and mod-eling of the process that develops and removesfaults. According to [6, 21, 38–40] and referencescited therein, a large set of metrics and models areavailable for such a purpose.

There are many types of software reliabilitymodels. Some well-known models are: Halstead’ssoftware metric and McCabe’s cyclomatic com-plexity metric (both deterministic models). Hal-stead’s metric can be used to estimate the numberof errors in the program, while McCabe’s cyclo-matic complexity metric can be used to deter-mine an upper bound on the number of tests

Page 612: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 579

in a program. Models included in the secondgroup, called “error seeding models”, estimatethe number of errors in a program by using amultistage sampling technique, when the errorsare divided into indigenous errors and inducederrors (seeded errors). Examples of such modelsare [19]: Mills’ error seeding model, the hyper-geometric distribution model, and Cai’s model.Another group of models is used to study thesoftware failure rate per fault at the failure inter-vals. Models included in this group are: Jelinskiand Moranda (J-M), negative-binomial Poisson,Goel–Okumoto imperfect debugging, and a largenumber of variations of the J-M model. Othermodels, namely curve fitting models, use statis-tical regression analysis to study the relationshipbetween software complexity and the number offaults in a program, the number of changes, orfailure rate. When assuming that the future of theprocess is statistically independent of the past,NHPP (non-homogeneous Poisson process) mod-els are proposed (see [19, ch.4]): Duane, Musa ex-ponential, Goel–Okumoto, S-shaped growth mod-els, hyperexponential growth models, generalizedNHPP, etc. The last group, in Pham’s classification,includes Markov models [19]. Unfortunately, be-cause reliability is defined in terms of failures, it isimpossible to measure before development is com-plete. Even carefully collecting data on interfailuretimes and using software reliability growth mod-els, it is difficult to produce accurate predictionson all data sets in all environments; that means theabove-mentioned techniques work effectively onlyif the software’s future operational environmentis similar to that in which the failure data wascollected. The advice of Fenton and Pfleeger [27] isfull of importance: “If we must predict reliabilityfor a system that has not been released to a user,then we must simulate the target operational envi-ronment in our testing”.

From an SRE point of view, it is essentialthat a person selecting models and makingreliability predictions be appropriately trainedin both software (reliability) engineering andstatistics (see the cleanroom approach [25]).Data filtering and outliers identification arefundamental steps for data validation. Data

partitioning is also important. Applying reliabilitygrowth models to the most severe failures allows,for example, evaluation of the software failure ratecorresponding to the most critical behavior. Sucha rate is more significant than the overall softwarefailure rate which incorporates failures that do nothave a major impact on system behavior.

Reliability improvement programs will helpthe companies to ameliorate the maturity oftheir software production, adopting the CMMapproach, ESA standard, or other standardsspecially developed for producing safety-criticalapplications.

32.3.3 Effort and Cost Models

One of the major objections to applying programsfor developing under reliability requirements isthe total effort, which has been considered for along time as increasing with the level required.This is why the effort estimation is crucial forthe software development organization. Overesti-mated effort may convince general managementto disapprove proposed systems that might signifi-cantly contribute to the quality of the developmentprocess. Underestimated effort may convince gen-eral management to approve, but the exceeding oftheir budgets and the final failing is the commonway. A model which considers the testing cost,cost of removing errors detected during testingphase, cost of removing errors detected duringthe warranty period, and risk cost due to softwarefailure is proposed by Pham and Zhang [41]. Asthe authors mention: “this model can be usedto estimate realistic total software cost” for somesoftware products, and “to determine the optimaltesting release policies of the software system”.Other software cost models based on the NHPPsoftware reliability functions can be found in [19].

There are many proposed solutions, but, ingeneral, it is very restrictive to apply themacross a large set of software projects. Applyingdata analysis to empirical data indicates thateffort trends depend on certain measurableparameters. There are two classes of models forthe effort estimation: cost models and constraintmodels. COCOMO is an empirical cost model

Page 613: Handbook of Reliability Engineering - pdi2.org

580 Practices and Emerging Applications

(a composite one) which provides direct estimatesof effort or duration. Often, the cost modelshave one primary input parameter and a numberof cost drivers (characteristics of the project,process, product, or resources having a significantinfluence on effort or duration). Constraintmodels describe a relationship over time betweentwo or more parameters of effort, duration,or staffing level. Because these mathematicalmodels are defined in terms of an algorithm,they are termed algorithmic models. Boehm [42]classifies algorithmic models used for softwarecost estimation as follows:

1. linear models that try and fit a simple line tothe observed data;

2. multiplicative models that describe effort as aproduct of constants with various cost driversas their exponents;

3. analytic models that usually describe effortas a function that is neither linear normultiplicative;

4. tabular models that represent the relationshipbetween cost drivers and development effortin a matrix form;

5. composite models using a combination of allor some of the above-mentioned approaches,which are generic enough to represent a largeclass of situations.

Composite models are mostly used in practice.Another composite model, widely used in indus-try, is Putnam’s SLIM (software lifecycle man-agement) model. Both the COCOMO and Put-nam models use the Rayleigh distribution as anapproximation to the smoothed labor distribu-tion curve. SLIM uses separate Rayleigh curvesfor design and code, test and validation, main-tenance, management. Boehm and his colleagues[43] have defined an updated COCOMO, use-ful for a wider collection of software develop-ment techniques and technologies, including re-engineering, applications generators, and object-oriented approaches.

However, practitioners also use informal con-siderations: an expert’s subjective opinion, avail-ability of resources, analogy, etc. To succeed, theyuse more than one technique simultaneously. For

quality improving reasons, it is useful to assignthe responsibilities of the cost estimation to aspecific group of people [44]. According to [27],“the group members are familiar with the esti-mation and calibration techniques”, “their esti-mating experience gives them a better sense ofwhen a project is deviating from the average orstandard one”, “by monitoring the database, thegroup can identify trends and perform analysesthat are impossible for single projects”, and withthe advantage of being separated from the projectstaff they can re-estimate periodically differentexternal attributes.

32.4 Total Quality Managementfor Software Engineering

32.4.1 Deming’s Theory

The US Department of Defense defines the total-quality approach (also called total quality man-agement, or TQM) as follows (from ref. 11, ch.1,cited in [3]): “TQM consists of continuous im-provement activities involving everyone in theorganization—managers and workers—in a to-tally integrated effort toward improving perfor-mance at every level”. A definition of TQM hastwo components: “the what and the how of to-tal quality”. The how component has 10 criticalelements: customer focus, obsession with qual-ity, scientific approach, long-term commitment,teamwork, continual improvement of systems, ed-ucation and training, freedom throughout con-trol, unity of purpose, and employee involvementand empowerment. Many people and organiza-tions contribute to develop various concepts, col-lectively known as TQM. The best known qualitypioneer is Dr W. Edwards Deming [2]. Deming’s“fourteen points” philosophy is also applicableto software quality. According to Goetsch andDavis [3] and Zultner [45], following the plan–do–check–act–analyze Deming cycle, the 14 rulesare (the numbers represent neither an order ofprogression nor relative priorities).

Page 614: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 581

1. Create constancy of purpose for the improve-ment of development with the aim of be-coming excellent (optimizing level in SEI ap-proach).

2. Adopt a new philosophy: software projects failbecause of a lack of management and control.Software engineering management must learnout of experience that quality is vital forsurvival.

3. Achieving quality has to be independent ofinspection process. Develop for quality.

4. Stop awarding contracts based on the price tagalone.

5. Improve constantly and forever the systemdevelopment process to improve quality andproductivity, creating increasing value withlower cost.

6. Institute training on the job. Training is thebest way to improve people on a continualbasis.

7. Institute leadership (to help people andtechnology to work better).

8. Drive out fear so that everyone may workeffectively.

9. Break down barriers between departments.Teamwork, especially for the front end of theproject, improves the final product quality.

10. Eliminate slogans and targets that ask for newlevels of productivity or competitive degrees.They create adversarial relationships.

11. Eliminate quotas.12. Remove barriers that rob employees of their

pride of workmanship.13. Institute a vigorous program of

education/training and self-improvement.14. Put everyone to work to accomplish the

transformation.

The above rules summarize Deming’s views onwhat the software organization must do to effecta positive transition from “business-as-usual toworld-class quality”. To build quality software,the software house must take substantial changesin the way systems are developed and managed(see Zultner [45]). The factors that may inhibitsuch a transformation are termed “Deming’s sevendeadly diseases”.

By adapting them to software quality, thefollowing list is obtained.

1. Lack of constancy of purpose to plan softwareproducts that satisfy the user requirementsand keep the company in business. Excessivemaintenance costs.

2. Emphasis on short-term schedule.3. Reviewing systems and managing by objec-

tives without providing methods or resourcesto accomplish objectives.

4. Mobility of systems professionals and man-agement.

5. Managing by “visible figures” alone with littleconsideration, or no consideration given tothe figures that are unknown or cannot beknown.

6. Excessive personnel cost.7. Excessive costs of legal liabilities.

TQM can eliminate or reduce the impact ofa lack of constancy, the reviewing system, themobility and usage of visible figures. However,total quality will not free the organization fromexcessive personal costs or legal liabilities frompressure to produce short-term profits, thesebeing “the diseases of the nation’s financial, healthcare, and legal systems”, as mentioned by Goetschand Davis [3].

32.4.2 Continuous Improvement

Continuous improvement is the most fundamentalelement of TQM. It applies not only to productsbut also to processes and the people who operatethem. The process under improvement has to bedefined (all the activities to be performed areclearly described). Also, the process has to be used(applied) over many projects to create experience.To decide on process behavior, data and metricshave to be collected and analyzed. The abovesection on measurement and the cited literatureexplain the important role of measurement forTQM. Other models and references can be foundin Popentiu and Burschy [46].

Process improvement can be addressed atan organizational level, at a project level, andat an individual level. Both CMM and ISO 9001

Page 615: Handbook of Reliability Engineering - pdi2.org

582 Practices and Emerging Applications

identify processes at organizational level. TheCMM identifies key process areas mandatory forachieving a certain level of maturity. Accordingto Paulk [14], in the CMM, process improvementis implied in the concept of maturity itself, beingmore specifically addressed in the key processareas of Level 5, while ISO 9001 just states theminimum criteria to achieve the ISO certification.At a project level, the SPICE model deals withprocess improvement, while the personal softwareprocess model, developed by Humphrey [47], ad-dresses the process improvement at an individuallevel. In Humphrey’s model, each individual goesthrough a series of four levels, in which new stepsare added that make a more mature process. Letus observe that even though some quality modelsincorporate the concept of software improvement,they do not explain how it should be obtained.The ESA standard, and its updates to the newtechnologies, is more specific. It provides analgorithm both from a software engineering anda management point of view. A collaborativeESPRIT project involving nine European centersof excellence, called “ami” [48], is an adaptation ofthe goal–question–metric (GQM) method devisedby Basili [49]. The “ami” approach had the supportof the ESA. Another system engineering processwhich transforms the desires of the customer/userinto the language required, at all project levels, isthe quality function deployment (QFD). QFD, ac-cording to Goetsch and Davis [3], brings a numberof benefits to organizations trying to enhance theircompetitiveness by continually improving qualityand productivity. The process has the benefitsof being: customer-focused, time-efficient,teamwork-oriented, and documentation-oriented.

The goal of the management activities is tobuild the software product within the budget, ac-cording to schedule, and with the required qual-ity. To achieve this, according to ESA [4], anorganization must establish plans for: softwareproject management, software configuration man-agement, software verification and validation, andsoftware quality assurance.

The software project management plan (SPMP)contains “the technical and managerial projectfunctions, activities and tasks necessary to satisfy

the requirements of a software project”. Followingthe continuous improvement paradigm, the SPMPhas to be updated and refined, throughout thelifecycle, “as more accurate estimates of the effortinvolved become possible”, and whenever changesof some attributes or goals occur.

Software configuration management, essentialfor proper software quality control, is both amanagerial and a technical activity. Ince [5]identifies the following activities which makeup the process of configuration management:configuration items identification, configurationcontrol, status accounting, and configurationauditing.

All software verification and validation activ-ities shall be documented in the correspondingplan. According to ESA [4], the verification activi-ties include: reviewing, inspecting, testing, formalchecking, and auditing. By validation, the orga-nization will determine whether the final productmeets the user requirements.

The software quality assurance plan defineshow the standards adopted for project develop-ment will be monitored. Such a plan is a checklistfor activities that have to be carried out to ensurethe quality of the product. For each document, theESA approach provides a clear layout, with strictformat and content. Other software managementpractices exist, and Kenet and Baker [50] give apragmatic view of the subject.

32.5 Conclusions

Strategies to obtain software quality are exam-ined. The basic approaches in software qualityassurance are covered; ISO 9000 and CMM stan-dards being considered. The practice of softwareengineering is illustrated by ESA standards andmodern software development techniques. Othermethodologies are: software reliability engineer-ing, software engineering economics, and soft-ware management. Putting them together, andusing the adapted Deming’s philosophy, the soft-ware organization will implement the total qualitymanagement paradigm.

Page 616: Handbook of Reliability Engineering - pdi2.org

Total Quality for Software Engineering Management 583

References

[1] Uselac S. Zen leadership: the human side of totalquality team management. Londonville, OH: MohicanPublishing Company; 1993.

[2] Deming WE. Out of the crisis. Cambridge, MA: Mas-sachusetts Institute of Technology Center for AdvancedEngineering Study; 1986.

[3] Goetsch DL, Davis S. Introduction to total quality.quality, productivity, competitiveness: Englewood Cliffs,NJ: Prentice Hall; 1994.

[4] ESA PSS-05-0: ESA Software Engineering Standards,Issue 2; February 1991.

[5] Ince D. An introduction to quality assurance and itsimplementation. London: McGraw-Hill; 1994.

[6] Musa JD. Software reliability engineering. New York:McGraw-Hill; 1999.

[7] Breisford JJ. Establishing a software quality program.Quality Progress, November 1988.

[8] CODERM: “Is it Safe?” The Newsletter 1998/99;15,(2):6–7.

[9] Lions JL. Ariane 5 Flight 501 Failure: Report of the InquireBoard. Paris; July 19, 1996.

[10] IEEE Standard for Software Quality Assurance Plans:IEEE Std. 730.1-1989.

[11] International Standards Organization. ISO 9001: Qualitysystems—model for quality assurance in design, devel-opment, production, installation and servicing. Interna-tional Standards Organization; 1987.

[12] International Standards Organization. Quality manage-ment and quality assurance standards—Part 3. Guide-lines for the application of ISO 9001 to the develop-ment, supply and maintenance of software. ISO/IS 9000-3. Geneva; 1990.

[13] International Standards Organization. Quality manage-ment and quality system elements guidelines—Part 2.ISO 9004-2. Geneva; 1991.

[14] Paulk MC. How ISO 9001 compares with the CMM. IEEESoftware 1995;12(1):74–83.

[15] Kuvaja P, Bicego A. Bootstrap: a European assessmentmethodology. Software Qual J 1994;3(3):117–28.

[16] International Standards Organization. SPICE BaselinePractice Guide, Product Description, Issue 0.03 (Draft);1993.

[17] IEEE. Glossary of software engineering terminology.IEEE Std. 610.12-1990.

[18] Lyu MR. Handbook of software reliability engineering.New York: McGraw Hill; 1996.

[19] Pham H. Software reliability. Singapore: Springer; 2000.

[20] Jones M, Mazza C, Mortensen UK, Scheffer A. 1977–1997:Twenty years of software engineering standardisation inESA. ESA Bull 1997; 90.

[21] Cederling U. Industrial software development—a casestudy. In: Sommerville I, Paul M, editors. Softwareengineering—ESEC’93. Berlin: Springer-Verlag; 1993.p.226–37.

[22] Capper NP, Colgate RJ, Hunter JC, James. The impactof object-oriented technology on software quality: threecase histories. Software Qual 1994;33(1):131–58.

[23] Blaschek G. Object-oriented programming with proto-types. Berlin: Springer-Verlag; 1994.

[24] Hausler PA. Linger RC, Trammell CJ. Adopting cleanroomsoftware engineering with a phased approach. SoftwareQual 1994;33(1):89–110.

[25] Mills HD, Poore JH. Bringing software under statisticalquality control. Qual Prog 1988; Nov.

[26] Lee M, Pham H, Zhang X. A methodology for prioritysetting with application to software development process.Eur J Operat Res 1999;118(2):375–89.

[27] Fenton NE, Pfleeger SL. Software metrics: a rigorous& practical approach, 2nd ed. International ThomsonComputer Press; 1996.

[28] Boehm BW, Brown JR, Kaspar H, Lipow M, Macleod G,Merrit M. Characteristics of software quality. TRW seriesof software technology. Amsterdam: North Holland;1978.

[29] McCall JA, Richards PK, Walters GF. Factors in softwarequality. RADC TR-77-369; 1977.

[30] International Standards Organization. Informationtechnology—software product evaluation—qualitycharacteristics and guide lines for their use. ISO/IEC IS9126. Geneva; 1991.

[31] Gilb T. Software metrics. Cambridge, MA: Chartwell-Bratt; 1976.

[32] Kitchenham BA, Walker JG. A quantitative approachto monitoring software development. Software Eng J1989;4(1):2–13.

[33] Bevan N. Measuring usability as quality of use. SoftwareQual J 1995,4(2):115–30.

[34] Belady LA, Lehman MM. A model of large programdevelopment. IBM Syst J 1976;15(3):225–52.

[35] Gunning R. The technique of clear writing. New York:McGraw-Hill; 1968.

[36] Hudepohl JP, Aud SJ, Khoshoftaar TM, et al. EMERALD:software metrics and models on the desktop. IEEESoftware 1996;13(Sept):56–60.

[37] Khoshgoftaar TM, Allen EB, Jones WD, et al.Classification-tree models of software-quality overmultiple releases. IEEE Trans Reliab 2000;49(1):4–11.

[38] Musa JD, Iannino A, Okumoto K. Software reliability:measurement, prediction, application. New York: Mc-Graw Hill; 1987.

[39] Xie M. Software reliability modelling. Singapore: WorldScientific; 1991.

[40] Burtschy B, Albeanu G, Boros DN, Popentiu FL, Nicola V.Improving software reliability forecasting. MicroelectronReliab 1997;37(6):901–7.

[41] Pham H, Zhang X. A software cost model with warrantyand risk costs. IEEE Trans Comput 1999;48(1):71–5.

[42] Boehm BW. Software engineering economics. EnglewoodCliffs, NJ: Prentice Hall; 1981.

[43] Boehm BW, Clark B, Horowitz E, Westland JC, MadachyRJ, Selby RW. Cost models for future life cycle processes:COCOMO 2.0. Ann Software Eng 1995;1(1):1–24.

Page 617: Handbook of Reliability Engineering - pdi2.org

584 Practices and Emerging Applications

[44] DeMarco T. Controlling software projects. New York:Yourdon Press; 1982.

[45] Zultner R. The Deming approach to software qualityengineering. Qual Progr 1988; Nov.

[46] Popentiu Fl, Burschy B. Total quality for softwareengineering management. Final report for NATO—HTECH.LG 941434; September 1997.

[47] Humphrey WS. A discipline for software engineering.Reading, MA: Addison Wesley; 1995.

[48] Kuntzmann-Combelles A. Quantitative approach to soft-ware management: the “ami” method. In: Sommerville I,Paul M, editors. Software engineering—ESEC’93. Berlin:Springer-Verlag; 1993. p.238–50.

[49] Basili VR, Rombach D. The TAME project: towardsimprovement oriented software environments. IEEETrans Software Eng 1988;14(6):758–73.

[50] Kenet RS, Baker ER. Software process quality: manage-ment and control. New York: Marcel Dekker; 1999.

Page 618: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance

Chap

ter3

3Xiaolin Teng and Hoang Pham

33.1 Introduction33.2 Software Fault-tolerant Methodologies33.2.1 N -version Programming33.2.2 Recovery Block33.2.3 Other Fault-tolerance Techniques33.3 N -version Programming Modeling33.3.1 Basic Analysis33.3.1.1 Data-domain Modeling33.3.1.2 Time-domain Modeling33.3.2 Reliability in the Presence of Failure Correlation33.3.3 Reliability Analysis and Modeling33.4 Generalized Non-homogeneous Poisson Process Model Formulation33.5 Non-homogeneousPoisson Process Reliability Model forN -version

Programming Systems33.5.1 Model Assumptions33.5.2 Model Formulations33.5.2.1 Mean Value Functions33.5.2.2 Common Failures33.5.2.3 Concurrent Independent Failures33.5.3 N -version Programming System Reliability33.5.4 Parameter Estimation33.6 N -version Programming–Software Reliability Growth33.6.1 Applications ofN -version Programming–Software Reliability Growth Models33.6.1.1 Testing Data33.7 Conclusion

33.1 Introduction

Software fault tolerance is achieved throughspecial programming techniques that enable thesoftware to detect and recover from failures.This requires redundant software elements thatprovide alternative means of fulfilling the samespecifications. The different versions must bedeveloped independently such that they do nothave common faults, if any, to avoid simultaneousfailures in response to the same inputs.

The difference between software fault toleranceand hardware fault tolerance is that software faultsare usually design faults, and hardware faults are

usually physical faults. Hardware physical faultscan be tolerated in redundant (spare) copies ofa component that are identical to the original,since it is commonly assumed that hardwarecomponents fail independently. However, softwaredesign faults cannot generally be tolerated in thisway because the error is likely to recur on the sparecomponent if it is identical to the original [1].

Two of the best known fault-tolerant softwareschemes are N-version programming (NVP) andrecovery block (RB). Both schemes are basedon software component redundancy and theassumption that coincident failures of componentsare rare.

585

Page 619: Handbook of Reliability Engineering - pdi2.org

586 Practices and Emerging Applications

Input

Version N

VoterCorrectoutput

Version 2

Version 1

System failure

Figure 33.1. N -version programming scheme

33.2 Software Fault-tolerantMethodologies

33.2.1 N -version Programming

NVP was proposed by Chen and Avizienis.In this scheme, there are N ≥ 2 functionallyequivalent programs, called versions, which aredeveloped independently from the same initialspecification. An independent development meansthat each version will be developed by differentgroups of people. Those individuals and groupsdo not communicate with each other duringthe software development. Whenever possible,different algorithms, techniques, programminglanguages and tools are used in each version.NVP is actually a software generalization ofthe N-modular redundancy (NMR) approachused in hardware fault tolerance. All these N

versions are usually executed simultaneously(i.e. in parallel), and send their outputs to a voter,which determines the “correct” or best output ifone exists, by a voting mechanism, as shown inFigure 33.1.

Assume all N versions are statistically indepen-dent of each other and have the same reliability r ,and if majority voting is used, then the reliabilityof the NVP scheme (RNVP) can be expressed as:

RNVP =N∑

i=�N/2

(N

i

)ri(1− r)N−i (33.1)

Several different voting techniques have beenproposed. The simplest one is majority voting,

which has been investigated by a number ofresearchers [3–6]. In majority voting, usuallyN is odd, and the voter needs at least �N/2 software versions to produce the same output todetermine a “correct” result. Consensus votingis designed for multiversion software with smalloutput space, in which case software versionscan give identical but incorrect outputs [7].The voter will select the output that the mostsoftware versions give. For example, supposethere are six software versions and three possibleoutputs. If three versions give output A, twoversions give output B, and one version givesoutput C, then the voter will consider thecorrect result as output A. Leung [8] proposesa maximum likelihood voting method, whichapplies a maximum likelihood function to decidethe most likely correct result. Just like consensusvoting, this maximum likelihood voting methodalso needs small output space.

33.2.2 Recovery Block

The RB scheme was proposed by Randell [9].This scheme consists of three elements: primarymodule, acceptance tests (AT), and alternatemodules for a given task. The simplest scheme ofthe recovery block is as shown in Figure 33.2.

The process begins when the output ofthe primary module is tested for acceptability.If the acceptance test determines that the outputof the primary module is not acceptable, itrestores, recovers, or “rolls back” the state of the

Page 620: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 587

“Correct”outputATModule N

Y

NSystemfailure

N

“Correct”outputATModule 3

Y

“Correct”outputATModule 2

Y

N

“Correct”outputATModule 1

Y

N

Figure 33.2. Recovery blocks scheme

system before the primary module was executed.It then allows the second module to execute, andevaluates its output, and if the output is notacceptable, it then starts the third module, andevaluates its output, and so on. If all modulesexecute and none gives the acceptable results, thenthe system fails.

One problem with this scheme is how tofind a simple and highly reliable AT, becausethe AT is quite dependent on the application.There may also be a different acceptance test foreach module. In some cases, the AT will be verysimple. An interesting example is given in Fairley[10]. An implicit specification of the square rootfunction SQRT can serve as an AT:

(ABS((SQRT(x)∗SQRT(x))− x) < Error)

for 0≤ x ≤ y

where x and y are real numbers. Error is thepermissible error range.

In this case, it is much easier to do AT thanSQRT itself, then we can easily set up a reliableAT for SQRT. Unfortunately, sometimes it is noteasy to develop a reliable and simple AT, and an ATcan be complex and as costly as the full problemsolution. In some applications, it is enough forthe AT to check only the output of the software,

but in some other applications, this may not besufficient. The software internal states also needto be checked to detect a failure. Compared withthe voter in an NVP scheme, the AT is much moredifficult to build.

One of the differences between RB and NVPis that the modules are executed sequentially inRB, and simultaneously in NVP. Also the AT in RBcan determine whether a version gives the correctresult, but no such kind of mechanism is built inNVP. NVP can only rely on the voter to decide thebest output. Also, since the RB needs to switchto the backup modules when a primary softwaremodule fails, it takes much longer computationtimes to finish the computation in RB than inNVP if many modules fail in a task. Therefore,the recovery block generally is not applicable tocritical systems where real-time response is ofgreat concern.

33.2.3 Other Fault-toleranceTechniques

There are some studies that combine RB and NVPto create new hybrid techniques, such as consensusrecovery block [5], and acceptance voting [11].

Page 621: Handbook of Reliability Engineering - pdi2.org

588 Practices and Emerging Applications

“Correct” outputSuccess

NVP

Input

System failure

“Correct” outputRB

Failure

Figure 33.3. Consensus recovery block

Consensus recovery block is a hybrid systemthat combines NVP and RB in that order(Figure 33.3). If NVP fails, the system reverts to RBusing the same modules (the same module resultscan be used). Only when both NVP and RB faildoes the system fail.

Acceptance voting is also an NVP scheme,but it adds an AT to each version (Figure 33.4).The output of each version is first sent to the AT,if the AT accepts the output, then the output willbe passed to the voter. The voter sees only thoseoutputs that have been passed by the acceptance

System failureVoter

Success

“Correct” output

M1

Input

�������

�������

M2 M3 MN

A1 A2 A3 AN

Figure 33.4. Acceptance voting

test. This implies that the voter may not processthe same number of outputs at each invocation,and hence the voting algorithm must be dynamic.The system fails if no outputs are submitted to thevoter. If only one output is submitted, the votermust assume it to be correct, then pass it to thenext stage.

33.3 N -version ProgrammingModeling

A number of systematic experimental studies offault-tolerant software issues have been conductedover the last 20 years by both academia and indus-try [6, 12, 13]. Modeling of an NVP scheme givesinsight into its behavior and allows quantificationof its merit. Some research has been conductedin the reliability modeling of fault-tolerant soft-ware systems. The reliability modeling for fault-tolerant software systems falls into two categories:data-domain modeling and time-domain model-ing. Both analyses use the assumption that thefailure events are independent between or amongdifferent versions.

33.3.1 Basic Analysis

33.3.1.1 Data-domain Modeling

Data-domain modeling uses the assumption thatfailure events among different software versionsare independent of each other. This abstractionsimplifies the modeling and provides insight intothe behavior of NVP schemes.

If ri represents the reliability of the ith softwareversion, and RV represents the reliability ofthe voter, which is also assumed independentof software version failures, then the systemreliability of a three-version NVP fault-tolerantsystem is:

RNVP3(r1,r2,r3, RV)

= RV(r1r2 + r1r3 + r2r3 − 2r1r2r3) (33.2)

If all versions have the same reliability r , thenthe probability that an NVP system will operate

Page 622: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 589

Sys

tem

relia

bili

ty

0.8

0

1

0.6

0.4

0.2

00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Version reliability

2 out of 3 3 out of 5 4 out of 7

Figure 33.5. Comparison of system reliability and version reliability

successfully under majority voting strategy isgiven by the following expression:

RNVP(r, RV)= RV

N∑i=m

(N

i

)ri (1− r)N−i

(33.3)where m (m≤N) is the lower bound on therequired space number of versions that have thesame output. Figure 33.5 shows various systemreliability functions versus version reliability forRV = 1.

To achieve higher reliability, the reliability of asingle version must be greater than 0.5, then thesystem reliability will increase with the numberof versions involved. If the output space hascardinality ρ, then NVP will result in a system thatis more reliable than a single component only ifr > 1/ρ [7].

33.3.1.2 Time-domain Modeling

Time-domain modeling is concerned with the be-havior of system reliability over time. The sim-plest time-dependent failure model assumes thatfailures arrive randomly with interarrival timesexponentially distributed with constant rate λ. Thereliability of a single software version will be:

r(t)= e−λt (33.4)

If we assume that all versions have the samereliability level, then the system reliability of NVPis given by:

RNVP(r(t), RV)

= RV

N∑i=m

(N

i

)ri(t)(1 − r(t))N−i (33.5)

If N = 3, then

RNVP3(t)= 3 e−2λt − 2 e−3λt (33.6)

Given λ= 0.05 per hour, then we can plot r(t)

and RNVP3(t) function curves, and compare thereliability improvement (Figure 33.6). It is easyto see that when t ≤ t∗ = 14≈ 0.7/λ, the systemis more reliable than a single version, but whent > t∗, the system is less reliable than a singleversion.

The above analysis is very simple, and theactual problem is far more complex. It has beenshown that independently developed versionsmay not fail independently due to the failurecorrelations between software versions [12, 14].The next section discusses several existing modelsthat consider the failure correlation betweensoftware versions.

Page 623: Handbook of Reliability Engineering - pdi2.org

590 Practices and Emerging Applications

Rel

iab

ility

0.8

1

0.6

0.4

0.2

0

Time (hours)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

t*

r t( )

R tNVP3( )

Figure 33.6. Single version reliability versus three-version system reliability

33.3.2 Reliability in the Presence ofFailure CorrelationExperiments show that incidence of correlatedfailures of NVP system components may not benegligible in the context of current software de-velopment and testing techniques [5, 15, 16]. Al-though software versions are developed indepen-dently, many researchers have revealed that thoseindependently developed software versions do notnecessarily fail independently [12, 17]. Accordingto Knight and Leveson [14], experiments haveshown that the use of different languages and de-signs philosophy has little effect on the reliabilityin NVP because developers tend to make similarlogical mistakes in a difficult-to-program part ofthe software.

In NVP systems, a coincident failure occurswhen a related fault between versions is activatedby some input or two unrelated faults in differentversions are activated at the same input.

Laprie et al. [18] classify the software faultsaccording to their independence into eitherrelated or independent. Related faults manifestthemselves as similar errors and lead to common-mode failures, whereas independent faults usuallycause distinct errors and separate failures.

Due to the difficulty of distinguishing therelated faults from the independent faults, in thischapter we simplify this fault classification as

follows. If two or more versions give identical butall wrong results, then the failures are caused bythe related faults between versions; if two or moreversions give dissimilar but wrong results, thenthe faults are caused by unrelated or independentsoftware faults.

For the rest of the chapter, we refer to relatedfaults as common faults for simplicity. Figure 33.7illustrates the common faults and the independentfaults in NVP systems.

Common faults are those which are locatedin the functionally equivalent modules amongtwo or more software versions because theirprogrammers are prone to making the sameor similar mistakes, although they develop theversions independently. These faults will beactivated by the same inputs to cause thoseversions to fail simultaneously, and these failuresby common faults are called common failures.Independent faults are usually located in differentor functionally unequivalent modules between oramong different software versions. Since they areindependent of each other and their resultingfailures are typically distinguishable from thedecision mechanism, they are often consideredharmless to the fault-tolerant systems. However,there is still a probability, though very smallcompared with that of common failures (CF), thatan unforeseeable input activates two independentfaults in different software versions that will lead

Page 624: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 591

Common faults

Module 1

Module 2

Module k

Version 1

Module 1

Module 2

Module k

Version 2

Input

Functionally equivalent module

Functionally equivalent module

Functionally equivalent module

Independent

Figure 33.7. Common faults and independent faults

Table 33.1. Common failures and concurrent independent failures

Common failures Concurrent independent failures

Fault type Common faults Independent faultsOutput Usually the same Usually differentFault location (logically) Same DifferentVoting result Choose wrong solution Unable to choose correct solution(majority voting)

these two versions to fail at the same time.These failures by independent faults are calledconcurrent independent failures (CIF). Table 33.1shows the differences between the commonfailures and the concurrent independent failures.

33.3.3 Reliability Analysis andModeling

The reliability model of NVP systems shouldconsider not only the common failures, butalso the concurrent independent failures amongor between software versions. Some reliabilitymodels have been proposed to incorporate theinterversion failure dependency.

Eckhardt and Lee [4] developed a theoreticalmodel that provides a probabilistic frameworkfor empirically evaluating the effectiveness of ageneral multiversion strategy when componentversions are subject to coincident errors byassuming the statistical distributions of inputchoice and program choice. Their work is amongthe first that showed independently developedprogram versions to fail dependently.

Littlewood and Miller [6] further showed thatthere is a precise duality between input choice andprogram choice, and considered a generalizationin which different versions may be developedusing diverse methodologies. The use of diversemethodologies is shown to decrease the probabil-ity of simultaneous failure of several versions.

Page 625: Handbook of Reliability Engineering - pdi2.org

592 Practices and Emerging Applications

Nicola and Goyal [12] proposed to use a beta-binomial distribution to model correlated failuresin multi-version software systems, and presenteda combinatorial model to predict the reliability ofa multi-version software configuration.

The above models focus on software diversitymodeling, and are based on the detailed analysisof the dependencies in diversified software. Otherresearchers focus on modeling fault-tolerantsoftware system behavior.

Dugan and Lyu [19] presented a quantitativereliability analysis for an N-version programmingapplication. The software systems of this applica-tion were developed and programmed by 15 teamsat the University of Iowa and the Rockwell/CollinsAvionics Divisions. The overall model is a Markovreward model in which the states of the Markovchain represent the long-term evolution of thestructure of the system.

Figure 33.8 illustrates the Markov model inDugan and Lyu [19]. In the initial state, three in-dependently developed software versions are run-ning on three active processors. The processorshave the same failure rate λ. After the first hard-ware fails, the TMR-system is reconfigured to asimplex system successfully with probability c. Sothe transition rate to the reconfiguration state is3λc and the transition rate to the failure statecaused by an uncovered failure is 3λ(1− c). Thesystem fails when the single remaining processorfails, thus the transition rate from the reconfigura-tion state to the failure state is λ.

Their model considers independent softwarefaults, related software faults, transient hardwarefaults, permanent hardware faults, and imperfectcoverage. The experimental results from thisapplication were used to estimate the probabilitiesassociated with the activation of software faults.The overall NVP reliability is obtained bycombining the hardware reliability and 3VPsoftware reliability.

Tai et al. [20] proposed a performance anddependability model to assess and improve theeffectiveness of NVP systems. In their research, theperformance model of NVP is a renewal process,and the dependability is a fault-manifestationmodel. They propose a scheme to enhance the

Failure stateVi

H

V1

H

V2

H

V3

H

3 (1 _ )� c3�c

Figure 33.8. Markov model of system structure in Dugan andLyu [19]

performance and dependability of NVP systems.This scheme is a 3VP scheme, in which theslowest version is utilized as a tie-breaker (TB)when the first two versions “tie” by disagreeing.An acceptance test (AT) is added in the NVP-ATscheme, where AT will be used to decide whetherthe result is correct after the decision functionreaches a consensus decision. In this research, thefailures by related faults and unrelated faults werewell modeled, and their effects on the effectivenessof these NVP schemes are well evaluated.

While the modeling methods of Eckhardtand Lee [4], Littlewood and Miller [6], andNicola and Goyal [12] are quite different fromall other modeling methods [19, 20], Goseva-Popstojanova and Grnarov [21] incorporate theirmethodologies into a Markovian performancemodel and present a unified approach aimed atmodeling the joint behavior of the N-versionsystem and its operational environment. This newapproach provides some insight into how thereliability is affected by version characteristics andthe operational environment.

Lin and Chen [22] proposed two software-debugging models with linear dependence, which

Page 626: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 593

appears to be the first effort to apply a non-homogeneous Poisson process (NHPP) to an NVPsystem. These two models define a correlatingparameter αi for the failure intensity of softwareversion i, and the mean value function of versioni is a logarithmic Poisson execution-time:

mi(θi, λi, t)= (1/θi) log(λiθi t + 1) (33.7)

In model I, the effect of s-dependency amongsoftware versions is modeled by increasing theinitial failure intensity of each version:

λ′i = α1λ1 + · · · + αi−1λi−1 + λi + αi+1λi+1

+ · · · + αNλN (33.8)

In model II, the effect of s-dependency amongsoftware versions is modeled by increasing thenominal mean-value function of each version, to:

m′i (θ, λ, t)= α1m1(θ1, λ1, t)+ · · ·· · · + αi−1mi−1(θi−1, λi−1, t)

+mi(θi, λi , t)

+ αi+1mi+1(θi+1, λi+1, t)+ · · ·· · · + αNmN(θN, λN, t) (33.9)

Although these two models use the non-homogeneous Poisson process method, theyare not software reliability growth models(SRGMs). This chapter does not consider thesoftware reliability growth due to continuousremoval of faults from the components of the NVPsystems. The modifications due to the progressiveremoval of residual design faults from softwareversions cause their reliability to grow, which inturn causes the reliability of NVP software systemsto grow. Then the reliability of NVP systems cangrow as a result of continuous removal of faults inthe software versions. Kanoun et al. [23] proposeda reliability growth model for NVP systems byusing the hyperexponential model. The failureintensity function is given by:

h(t)= ωξsup exp(−ξsupt)+ ωξinf exp(−ξinft)

ω exp(−ξsupt)+ ω exp(−ξinft)(33.10)

where 0≤ ω ≤ 1, ω = 1− ω, and ξsup and ξinfare, respectively, the hyperexponential model

parameters characterizing reliability growth dueto the removal of the faults. It should be notedthat the function h(t) in Equation 33.10 is non-increasing with time t for 0≤ ω ≤ 1, from h(0)=ωξsup + ωξinf to h(∞)= ξinf.

This model is the first that considers the impactof reliability growth as a result of progressiveremoval of faults from each software version onsoftware reliability. By interpreting the hyper-exponential model as a Markov model that canhandle reliability growth, this model allows thereliability growth of an NVP system to be modeledfrom the reliability growth of its components.

Sha [24] investigated the relationship betweencomplexity, reliability, and development resourceswithin an N-version programming system, andpresented an approach to building a system thatcan manage upgrades and repair itself when com-plex software components fail. His result countersthe belief that diversity results in improved reli-ability under the limited development resources:the reliability of single-version programming withundivided effort is superior to three-version pro-gramming over a wide range of development ef-fort. He also pointed out that single-version pro-gramming might not always be superior to itsN-version counterpart, because some additionalversions can be obtained inexpensively.

Little research has been performed in reliabilitymodeling of NVP systems. Most of the NVP systemresearch either focuses on modeling softwarediversity [4, 6, 12], or aims primarily to evaluatesome dependability measures for specific types ofsoftware systems [13, 19]. Most of these proposedmodels assume stable reliability, i.e. they donot consider reliability growth due to correctivemaintenance, thus they may not be applicable tothe developing and testing phases in the softwarelifecycle, where the reliability of NVP systemsgrows as a result of progressive removal of residualfaults from the software components.

During the developing and testing phases,some important questions are:

• How reliable is the software?• How many remaining faults are in the

software?

Page 627: Handbook of Reliability Engineering - pdi2.org

594 Practices and Emerging Applications

• How much testing does it still need to attainthe required reliability level (i.e. when to stoptesting)?

For traditional single-version software, theSRGM can be used to provide answers to thesequestions. Kanoun et al. [23] developed thefirst SRGM for NVP systems, which does notconsider the impact of imperfect debugging onboth independent failures and common failures.The role of faults in NVP systems may changedue to the imperfect debugging, some (potential)common faults may reduce to low-level commonfaults or even to independent faults. BecauseKanoun’s model is not based on the characteristicsof the software and the testing/debugging process,users cannot obtain much information about theNVP system and its components, such as the initialnumber of independent and common faults. Fi-nally, this model is extremely complicated, whichprevents it from being successfully applied to largeN (N > 3) versions of programming applications.

Thus, there is a great need to develop a newreliability model for NVP systems that providesinsight into the development process of NVPsystems and is able to answer the above questions.In other words, the motive of this research is todevelop an SRGM for NVP systems. Different fromthe research of Kanoun et al. [23], this model willbe the first attempt to establish a software relia-bility model for NVP systems with considerationsof error removal efficiency and error introductionrate during testing and debugging. In this chapter,we present a generalized NHPP model for a singlesoftware program, then apply this generalizedNHPP model to NVP systems to develop an NVPsoftware reliability growth model.

33.4 GeneralizedNon-homogeneous PoissonProcess Model Formulation

As a general class of well-developed stochasticprocess models in reliability engineering, NHPPmodels have been successfully applied to software

reliability modeling. NHPP models are especiallyuseful to describe failure processes that possesscertain trends such as reliability growth or deteri-oration. Zhang et al. [25] proposed a generalizedNHPP model with the following assumptions.

1. A software program can fail during execution.2. The occurrence of software failures follows

NHPP with mean value function m(t).3. The software failure detection rate at any

time is proportional to the number of faultsremaining in the software at that time.

4. When a software failure occurs, a debuggingeffort occurs immediately. This effort removesthe faults immediately with probability p,where p� 1− p. This debugging is s-independent at each location of the softwarefailures.

5. For each debugging effort, whether the fault issuccessfully removed or not, some new faultsmay be introduced into the software systemwith probability β(t), β(t)� p.

From the above assumptions, we can formulatethe following equations

m′(t)= b(t)(a(t)− pm(t))

a′(t)= β(t)m′(t) (33.11)

where

m(t)= expected number of software failures bytime t , m(t)= E[N(t)]

a(t) = expected number of initial softwareerrors plus introduced errors by time t

b(t) = failure detection rate per fault at time tp = probability that a fault will be

successfully removed from the softwareβ(t) = fault introduction rate at time t .

If the marginal conditions are given as m(0)= 0,a(0)= a, the solutions to Equation 33.11 areshown as follows:

m(t)= a

∫ t

0b(u) e−

∫ u0 (p−β(τ))b(τ ) dτdu (33.12)

a(t)= a

(1+∫ t

0β(u)b(u) e−

∫ u0 (p−β(τ))b(τ ) dτ du

)(33.13)

Page 628: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 595

This model can be used to derive most ofthe known NHPP models. If we change theassumptions on b(t), β(t), and p, we can obtainall those known NHPP models. Table 33.2, forexample, shows some well-known NHPP softwarereliability growth models can be derived from thisgeneralized software reliability model [26].

33.5 Non-homogeneousPoisson Process Reliability Modelfor N -version ProgrammingSystemsNVP is designed to attain high system reliabilityby tolerating software faults. In this section weonly present the modeling of NVP where N = 3,based on the results of Teng and Pham [30], butthe methodology can be directly applied to themodeling of NVP where N > 3.

Let us assume that we have three independentlydeveloped software versions 1, 2, and 3, and usemajority voting, and the reliability of the voter is 1.The following notation is used:

CF common failureCIF concurrent independent

failureA independent faults in

version 1B independent faults in

version 2C independent faults in

version 3AB common faults between

version 1 and version 2AC common faults between

version 1 and version 3BC common faults between

version 2 and version 3ABC common faults among

version 1, version 2, andversion 3

Nx(t) counting process whichcounts the number of type x

faults discovered up totime t , x = A, B, C, AB,AC, BC, ABC

Nd(t) Nd(t)=NAB(t)+NAC(t)+NBC(t)+NABC(t); countingprocess which countscommon faults discovered inthe NVP system up to time t

mx(t) mean value function ofcounting process Nx(t),mx(t)= E[Nx(t)],x = A, B, C, AB, AC, BC,ABC, d

ax(t) total number of type x faultsin the system plus those typex faults already removedfrom the system at time t .ax(t) is a non-decreasingfunction, and ax(0) denotesthe initial number of type xfaults in the system,x = A, B, C, AB, AC, BC,ABC

b(t) failure detection rate perfault at time t

β1, β2,and β3

probability that a new fault isintroduced into version 1, 2,and 3 during the debugging,respectively

p1, p2,and p3

probability that a new fault issuccessfully removed fromversion 1, 2, and 3 during thedebugging, respectively

XA(t), XB(t),and XC(t)

number of type A, B, and C

faults at time t remaining inthe system respectively, i.e.XA(t)= aA(t)− p1mA(t)

XB(t)= aB(t)− p2mB(t)

XC(t)= aC(t)− p3mC(t)

R(x | t) software reliability functionfor given mission time x andtime to stop testing t ;R(x | t)= Pr{no failureduring mission x

| stop testing at t}KAB , KAC , andKBC

failure intensity per pair offaults for CIFs betweenversion 1 and 2, between 1and 3, and between 2 and 3respectively

Page 629: Handbook of Reliability Engineering - pdi2.org

596 Practices and Emerging Applications

Table 33.2. Some well-known NHPP software reliability models

Model name Model type Mean value functionm(t) Comments

Goel–Okumoto (G-O) [27] Concave m(t)= a(1− e−bt ) Also called exponential modela(t)= a

b(t)= b

Delayed S-shaped S-shaped m(t)= a(1− (1+ bt) e−bt ) Modification of G-O model to make itS-shaped

Inflection S-shaped SRGM S-shaped m(t)= a(1− e−bt )1+ β e−bt Solves a technical condition with the

a(t)= a G-O model. Becomes the same as G-O if

b(t)= b

1+ β e−bt β = 0

Y exponential S-shaped m(t)= a(1− e−rα(1−e(−βt))) Attempt to account for testing effort

a(t)= a

b(t)= rαβ e−βt

Y Rayleigh S-shaped m(t)= a(1− e−rα(1−e(−βt2/2))) Attempt to account for testing effort

a(t)= a

b(t)= rαβt e−βt2/2

Y imperfect debugging Concave m(t)= ab

α + b(eαt − e−bt ) Assume exponential fault content

model 1 a(t)= a eαt function and constant fault detectionb(t)= b rate

Y imperfect debugging Concave m(t)= a[1− e−bt ][1− α

b

]+ αat Assume constant introduction rate α

model 1 a(t)= a(1+ αt) and fault detection rateb(t)= b

P-N-Z model [28] S-shaped m(t)=a[1− e−bt ]

[1− α

b

]+ αat

1+ β e−bt Assume introduction rate is a linear

and concave a(t)= a(1+ αt) function of testing time, and the fault

b(t)= b

1+ β e−bt detection rate function is non-decreasing

with an inflexion S-shaped model

P-Z model [29] S-shaped m(t)= 1

(1+ β e−bt ) [(c+ a)(1− e−bt )] Assume introduction rate is exponential

and concave − a

b − α(e−αt − e−bt ) function of the testing time, and the fault

a(t)= c + a(1− e−αt ) detection rate is non-decreasing with an

b(t)= b

1+ β e−bt inflexion S-shaped model

Z-T-P model [25] S-shaped m(t)= a

p − β

[1−

((1+ α) e−bt1+ α e−bt

)c/b(p−β)]Assume constant fault introduction rate,

a′(t)= β(t)m′(t) and the fault detection rate function is

b(t)= c

1+ α e−bt non-decreasing with an inflexion

β(t)= β S-shaped model

Page 630: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 597

NAB(t), NAC(t),and NBC(t)

counting processes thatcount the number of CIFsinvolving version 1 and 2,version 1 and 3, andversion 2 and 3 up to time trespectively

NI (t) counting process that countsthe total number of CIFs upto time t ; NI (t)=NAB(t)+ NAC(t)+ NBC(t)

mAB(t), mAC(t),mBC(t), andmI (t)

mean value functions of thecorresponding countingprocesses m(t)= E[N(t)].E.g., mAB(t)= E[NAB(t)]

hAB(t), hAC(t),and hBC(t)

failure intensity functions ofCIFs involving version 1 and2, between 1 and 3, andbetween 2 and 3;hAB(t)= d

dt mAB(t)

MLE maximum likelihoodestimation

NVP-SRGM software reliability growthmodel for NVP systems

RNVP-SRGM(x | t) NVP system reliabilityfunction for given missiontime x and time to stoptesting t with considerationof common failures in theNVP system

RInd(x | t) NVP system reliabilityfunction for given missiontime x and time to stoptesting t , assuming no CF inthe NVP system, i.e. bothversions fail independently

Different types of faults and their relationsare shown in Figure 33.9. Generally, this notationscheme uses numbers to refer to software versions,and letters to refer to software fault types. Forexample, process N1(t) counts the number of fail-ures in software version 1 up to time t , therefore:

N1(t)= NA(t)+NAB(t)+NAC(t)+NABC(t)

Similarly:

N2(t)=NB(t)+NAB(t)+NBC(t)+ NABC(t)

N3(t)=NC(t)+NAC(t)+NBC(t)+ NABC(t)

Version 1 Version 2

Version 3

A AB B

ABC

AC BC

C

Figure 33.9. Different software faults in the three-versionsoftware system

There are two kinds of coincident failuresin NVP systems: common failures (CFs) andconcurrent independent failures (CIFs). CFs arecaused by the common faults between or amongversions, and CIFs are caused by the independent(unrelated) faults between versions. A CIF occurswhen two or more versions fail at the same inputon independent faults, i.e. on A, B, or C, not onAB, AC, etc.

NAB(t) and NAB(t) both count the numberof coincident failures between version 1 andversion 2. The difference between them is thatNAB(t) counts the number of CFs in version 1 andversion 2, and NAB(t) counts the number of CIFsin version 1 and version 2. Table 33.3 illustrates thedifferent failure types between or among differentsoftware versions.

To establish the reliability growth model for anNVP fault-tolerant software system, we need toanalyze the reliability of components as well as thecorrelation among software versions.

33.5.1 Model Assumptions

To develop an NHPP model for NVP systems, wemake the following assumptions.

1. N = 3.2. Voter is assumed perfect all the time, i.e.

Rvoter = 1.3. Faster versions will have to wait for the slowest

versions to finish (prior to voting).

Page 631: Handbook of Reliability Engineering - pdi2.org

598 Practices and Emerging Applications

Table 33.3. Failure types in the three-version programming softwaresystem

Failure time Version 1 Version 2 Version 3 Failure type

t1 � At2 � Ct3 � A

t4 � � BC orBCt5 � Bt6 � At7 � C

t8 � � AC orACt9 � B

t10 � � AB orAB. . . . . . . . . . . . . . .ti � � � ABC. . . . . . . . . . . . . . .

Note: 3VP system fails when more than two versions fail at the same time.

4. Each software version can fail during execu-tion, caused by the faults in the software.

5. Two or more software versions may fail onthe same input, which can be caused by eitherthe common faults, or the independent faultsbetween or among different versions.

6. The occurrence of software failures (by in-dependent faults, two-version common faultsor three-version common faults) follows anNHPP.

7. The software failure detection rate at any timeis proportional to the number of remainingfaults in the software at that time.

8. The unit error detection rates for all kindsof faults A, B, C, AB, AC, BC, and ABC arethe same and constant, i.e. b(t)= b.

9. When a software failure occurs in any of thethree versions, a debugging effort is executedimmediately. That effort removes the corre-sponding fault(s) immediately with probabil-ity pi , pi � 1− pi (i is the version number 1,2, or 3).

10. For each debugging effort, whether thefault(s) are successfully removed or not, somenew independent faults may be introducedinto that version with probability βi , βi � pi

(i is the version number 1, 2, or 3), but no newcommon fault will be introduced into the NVPsystem.

11. Some common faults may reduce to some low-level common faults or independent faults dueto unsuccessful removal efforts.

12. The CIFs are caused by the activation ofindependent faults between different versions,and the probability that a CIF involves threeversions is zero. Those failures only involvetwo versions.

13. Any pair of remaining independent faultsbetween versions has the same probability tobe activated.

14. The intensity for CIFs involving any twoversions is proportional to the remainingpairs of independent faults in those twoversions.

Following are some further explanations forassumptions 8–11,

All versions accept the same inputs and runsimultaneously, the fastest version has to waitfor the slowest version to finish. Therefore, theunit error detection rate, b, can be the samefor all kinds of faults (assumption 8). N groupsare assigned to perform testing and debuggingon N versions independently, then the errorremoval efficiency p and the error introductionrate β are different for different software versions(assumption 9). When two groups are tryingto remove an AB fault from version 1 and

Page 632: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 599

Figure 33.10. Independent fault pairs between version 1 and 2

version 2, each may introduce a new faultinto its own version. But the probability thatthese two groups make the same mistake tointroduce a new AB fault into both versionsis zero (assumption 10), this means that onlyindependent faults can be introduced into thesystem. Because an AB fault may be removedsuccessfully from version 1, but not removed fromversion 2, then the previously common fault (AB)is no longer common to version 1 and version 2, itreduces to a B fault, which only exists in version 2(assumption 11).

Figure 33.10 shows the pairs of independentfaults between software version 1 and version 2.There are three independent faults (type A) inversion 1, and there are two independent faults(type B) in version 2. Then there are 3× 2= 6pairs of independent faults between version 1 andversion 2. It is assumed that each of these six pairshas the same probability to be activated.

In this study, the voter is not expected to decidewhether or not a software version fails duringtesting, but we assume that people have advancedmethods or tools to determine exactly whetherand when a software version fails or succeeds.Those methods or tools to detect a softwarefailure are quite application-dependent, and theyare either too awkward or too impractical to bebuilt into an NVP system. Therefore, after theNVP software system is released, we do not havethose tools to decide whether a version of the NVPsystem fails, we can only rely on the voter to decidethe best or the correct output.

33.5.2 Model Formulations

Based on the assumptions given in the last section,we can establish the following NHPP equations fordifferent types of software faults and failures.

33.5.2.1 Mean Value Functions

1. Type ABC

m′ABC(t)= b(aABC − p1p2p3mABC(t))

(33.14)with marginal conditions mABC(0)= 0 andaABC(0)= aABC where aABC is the initialnumber of type ABC faults in the 3VPsoftware system. Equation 33.14 can be solveddirectly as:

mABC(t)= aABC

p1p2p3(1− e−bp1p2p3t ) (33.15)

2. Type AB

m′AB(t)= b(aAB − p1p2mAB(t))

a′AB(t)= (1− p1)(1− p2)p3m′AB(t)

(33.16)

Type AC

m′AC(t)= b(aAC(t)− p1p3mAC(t))

a′AC(t)= (1−p1)(1−p3)p2m′ABC(t)

(33.17)

Type BC

m′BC(t)= b(aBC(t)− p2p3mBC(t))

a′BC(t)= (1− p2)(1− p3)p1m′ABC(t)

(33.18)

with marginal conditions

mAB(0)= 0, aAB(0)= aAB

mAC(0)= 0, aAC(0)= aAC

mBC(0)= 0, aBC(0)= aBC

By applying Equation 33.15 toEquations 33.16–33.18, we can obtain themean value functions for failure type AB,

Page 633: Handbook of Reliability Engineering - pdi2.org

600 Practices and Emerging Applications

AC, and BC respectively:

mAB(t)= CAB1 − CAB2 e−bp1p2t

+ CAB3 e−bp1p2p3t (33.19)

mAC(t)= CAC1 − CAC2 e−bp1p3t

+ CAC3 e−bp1p2p3t (33.20)

mBC(t)= CBC1 − CBC2 e−bp2p3t

+ CBC3 e−bp1p2p3t (33.21)

where

CAB1 = aAB

p1p2+ aABC(1− p1)(1− p2)

p21p

22

CAB2 = aAB

p1p2+ aABC(1− p1)(1− p2)

p21p

22

− aABC(1− p1)(1− p2)

p21p

22(1− p3)

CAB3 =−aABC(1− p1)(1− p2)

p21p

22(1− p3)

(33.22)

CAC1 = aAC

p1p3+ aABC(1− p1)(1− p3)

p21p

23

CAC2 = aAC

p1p3+ aABC(1− p1)(1− p3)

p21p

23

− aABC(1− p1)(1− p3)

p21p

23(1− p2)

CAC3 =−aABC(1− p1)(1− p3)

p21p

23(1− p3)

(33.23)

CBC1 = aBC

p2p3+ aABC(1− p2)(1− p3)

p22p

23

CBC2 = aBC

p2p3+ aABC(1− p2)(1− p3)

p22p

23

− aABC(1− p2)(1− p3)

p22p

23(1− p1)

CBC3 =−aABC(1− p2)(1− p3)

p22p

23(1− p1)

(33.24)

3. Type A

m′A(t)= b(aA(t)− p1mA(t))

a′A(t)= β1(m′A(t)+m′AB(t)+m′AC(t)

+m′ABC(t))+ (1− p1)p2m′AB(t)

+ (1− p1)p3m′AC(t)

+ (1− p1)p2p3m′ABC(t) (33.25)

Type B

m′B(t)= b(aB(t)− p2mB(t))

a′B(t)= β2(m′B(t)+m′AB(t)+m′BC(t)

+m′ABC(t))+ (1− p2)p1m′AB(t)

+ (1− p2)p3m′BC(t)

+ (1− p2)p1p3m′ABC(t) (33.26)

Type C

m′C(t)= b(aC(t)− p3mC(t))

a′C(t)= β3(m′C(t)+m′AC(t)+m′BC(t)

+m′ABC(t))+ (1− p3)p1m′AC(t)

+ (1− p3)p2m′BC(t)

+ (1− p3)p1p2m′ABC(t) (33.27)

with marginal conditions

mA(0)=mB(0)=mC(0)= 0

aA(0)= aA, aB(0)= aB, aC(0)= aC

The solutions to these equations are verycomplicated and can be obtained from Teng andPham [30].

33.5.2.2 Common Failures

After we obtain the common failure countingprocesses NAB(t), NAC(t), NBC(t), and NABC(t),we can define a new counting process Nd(t):

Nd(t)=NAB(t)+NAC(t)+NBC(t)

+NABC(t)

In a 3VP system that uses a majority votingmechanism, common failures in two or threeversions lead to the system failures. Therefore,Nd(t) counts the number of NVP software systemfailures due to the CFs among software versions.

Similarly, we have the mean value function asfollows:

md(t)=mAB(t)+mAC(t)+mBC(t)

+mABC(t) (33.28)

Page 634: Handbook of Reliability Engineering - pdi2.org

Software Fault Tolerance 601

Given the release time t and mission time x,the probability that an NVP software system willsurvive the mission without a CF is:

Pr{no CF during x | T } = e−(md(t+x)−md(t))

(33.29)The above equation is not the final systemreliability function, since we need to consider theprobability that the system fails at independentfaults in the software system.

33.5.2.3 Concurrent Independent Failures

Usually the NVP software system fails by CFsinvolving multi-versions. However, there is still asmall probability that two software versions failon the same input because of independent faults.This kind of failure is concurrent independentfailure.

From assumptions 12, 13, and 14, the failureintensity hAB(t) for the CIFs between version 1and version 2 is given by:

hAB(t)=KABXA(t)XB(t) (33.30)

where XA(t) and XB(t) denote the number ofremaining independent faults in version 1 andversion 2 respectively, and KAB is the failureintensity per pair of faults for CIFs betweenversion 1 and version 2. Then we have anothernon-homogeneous Poisson process for CIFs,NAB(t), with mean value function:

mAB(t)=∫ t

0hAB(τ) dτ (33.31)

Given the release time t and mission time x, wecan obtain the probability that there is no type ABCIF during (t , t + x)

Pr{no type AB failure during x}= e−(mAB(t+x)−mAB(t)) (33.32)

Similarly, we have another two NHPPs for CIFsbetween version 1 and version 3 and betweenversion 2 and version 3, with mean valuefunctions:

mAC(t)=∫ t

0hAC(τ) dτ (33.33)

mBC(t)=∫ t

0hBC(τ) dτ (33.34)

where

hAC(t)=KACXA(t)XC(t) (33.35)

hBC(t)=KBCXB(t)XC(t) (33.36)

Also the conditional probabilities are

Pr{no type AC failure during x}= e−(mAC(t+x)−mAC(t))

Pr{no type BC failure during x}= e−(mBC(t+x)−mBC(t))

If we define a new counting process

Ni(t)= NAB(t)+ NAC(t)+ NBC(t) (33.37)

with mean value function

mi(t)=mAB(t)+mAC(t)+mBC(t) (33.38)

then the probability that there is no CIF for anNVP system during the interval (t , t + x) is:

Pr{no CIF during x} = e−(m1(t+x)−m1(t)) (33.39)

33.5.3 N -version ProgrammingSystem Reliability

After we obtain the probability of common failuresand concurrent independent failures, we candetermine the reliability of an NVP (N = 3) fault-tolerant software system:

RNVP-SRGM(x | t)= Pr{no CF & no CIF during x | t}

Because the CFs and CIFs are independent of eachother, then:

RNVP-SRGM(x | t)= Pr{no CF during x | t}× Pr{no CIF during x | t}

From Equations 33.29 and 33.39, the reliability ofan NVP system can be determined by:

RNVP-SRGM(x | t)= e−(md(t+x)+mi(t+x)−md(t)−mi(t)) (33.40)

Page 635: Handbook of Reliability Engineering - pdi2.org
Page 636: Handbook of Reliability Engineering - pdi2.org
Page 637: Handbook of Reliability Engineering - pdi2.org
Page 638: Handbook of Reliability Engineering - pdi2.org
Page 639: Handbook of Reliability Engineering - pdi2.org
Page 640: Handbook of Reliability Engineering - pdi2.org
Page 641: Handbook of Reliability Engineering - pdi2.org
Page 642: Handbook of Reliability Engineering - pdi2.org
Page 643: Handbook of Reliability Engineering - pdi2.org
Page 644: Handbook of Reliability Engineering - pdi2.org
Page 645: Handbook of Reliability Engineering - pdi2.org
Page 646: Handbook of Reliability Engineering - pdi2.org
Page 647: Handbook of Reliability Engineering - pdi2.org
Page 648: Handbook of Reliability Engineering - pdi2.org
Page 649: Handbook of Reliability Engineering - pdi2.org
Page 650: Handbook of Reliability Engineering - pdi2.org
Page 651: Handbook of Reliability Engineering - pdi2.org
Page 652: Handbook of Reliability Engineering - pdi2.org
Page 653: Handbook of Reliability Engineering - pdi2.org
Page 654: Handbook of Reliability Engineering - pdi2.org
Page 655: Handbook of Reliability Engineering - pdi2.org
Page 656: Handbook of Reliability Engineering - pdi2.org
Page 657: Handbook of Reliability Engineering - pdi2.org
Page 658: Handbook of Reliability Engineering - pdi2.org
Page 659: Handbook of Reliability Engineering - pdi2.org
Page 660: Handbook of Reliability Engineering - pdi2.org
Page 661: Handbook of Reliability Engineering - pdi2.org
Page 662: Handbook of Reliability Engineering - pdi2.org
Page 663: Handbook of Reliability Engineering - pdi2.org
Page 664: Handbook of Reliability Engineering - pdi2.org
Page 665: Handbook of Reliability Engineering - pdi2.org
Page 666: Handbook of Reliability Engineering - pdi2.org
Page 667: Handbook of Reliability Engineering - pdi2.org
Page 668: Handbook of Reliability Engineering - pdi2.org
Page 669: Handbook of Reliability Engineering - pdi2.org
Page 670: Handbook of Reliability Engineering - pdi2.org
Page 671: Handbook of Reliability Engineering - pdi2.org
Page 672: Handbook of Reliability Engineering - pdi2.org
Page 673: Handbook of Reliability Engineering - pdi2.org
Page 674: Handbook of Reliability Engineering - pdi2.org
Page 675: Handbook of Reliability Engineering - pdi2.org
Page 676: Handbook of Reliability Engineering - pdi2.org
Page 677: Handbook of Reliability Engineering - pdi2.org
Page 678: Handbook of Reliability Engineering - pdi2.org
Page 679: Handbook of Reliability Engineering - pdi2.org
Page 680: Handbook of Reliability Engineering - pdi2.org
Page 681: Handbook of Reliability Engineering - pdi2.org
Page 682: Handbook of Reliability Engineering - pdi2.org
Page 683: Handbook of Reliability Engineering - pdi2.org
Page 684: Handbook of Reliability Engineering - pdi2.org
Page 685: Handbook of Reliability Engineering - pdi2.org
Page 686: Handbook of Reliability Engineering - pdi2.org
Page 687: Handbook of Reliability Engineering - pdi2.org
Page 688: Handbook of Reliability Engineering - pdi2.org
Page 689: Handbook of Reliability Engineering - pdi2.org
Page 690: Handbook of Reliability Engineering - pdi2.org
Page 691: Handbook of Reliability Engineering - pdi2.org
Page 692: Handbook of Reliability Engineering - pdi2.org
Page 693: Handbook of Reliability Engineering - pdi2.org
Page 694: Handbook of Reliability Engineering - pdi2.org
Page 695: Handbook of Reliability Engineering - pdi2.org
Page 696: Handbook of Reliability Engineering - pdi2.org