33
1 Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling in Industry Daniel R. Jeske and Xuemei Zhang Bell Laboratories, Lucent Technologies Holmdel, NJ 1. Introduction and Context 2. Software Reliability Growth Models 3. Architecture-Based Software Reliability Models 4. Case Studies 5. Summary ASA Conference on Quality and Productivity May 21, 2003

Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

1

Lucent TechnologiesBell Labs Innovations

Some Successful Approaches to Software Reliability Modeling in Industry

Daniel R. Jeske and Xuemei Zhang Bell Laboratories, Lucent Technologies

Holmdel, NJ

1. Introduction and Context2. Software Reliability Growth Models3. Architecture-Based Software Reliability Models4. Case Studies5. Summary

ASA Conference on Quality and ProductivityMay 21, 2003

Page 2: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

2

Lucent TechnologiesBell Labs Innovations

When are Software Reliability Models Typically Applied?

Software Reliability Growth Models......reliability is quantified and influences the release decision

Unit Testing

SoftwareDevelopment

IntegrationTesting

SystemTesting

GeneralAvailability

RequirementsFormulation

ArchitectureDesign

Architecture-Based Reliability Models.....decisions are being made as to how to design reliability into the system

FirstOffice

Application

Software Reliability Growth Models......reliability predictions are verified......parameters useful for modeling reliability of next release are estimated

Page 3: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

3

Lucent TechnologiesBell Labs Innovations

Software Reliability Growth Models (SRGMs)

Page 4: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

4

Lucent TechnologiesBell Labs Innovations

SRGMs: Questions to Answer

•What would the user-perceived failure rate of the software be ifif the software was to be released now?

•How much more testing is needed to achieve the reliability targets?

•How many faults exist in the code at the end of system test?

Page 5: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

5

Lucent TechnologiesBell Labs Innovations

Goel-Okumoto Model

Cumulative Test Time (days)

CumulativeFailures

T

( ) (1 )d tM t c e−= −( ) ( )

dT

T M T

cde

λ−

′==

Failure Rate Estimate

x

x

xx

A Well-Known SRGM

= initial number of faults = average per-fault failure rate

= cumulative test time, aggregated across all installations

= current value of

cdt

T t

Page 6: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

6

Lucent TechnologiesBell Labs Innovations

The expected number of faults detected in a time interval is proportional to the number of faults remaining in the software at time t.

Thus,

or equivalently,

implying,

Goel and Okumoto further assume that the number of faults observed in (0 , t) is Poisson{ M(t) }.

),( ttt ∆+

( ) ( ) [ ( )]M t t M t d c M t t+ ∆ − = − ∆

( ) (1 )dtM t c e −= −

( ) ( )M t dM t cd′ + =

Motivation for Goel-Okumoto Model

Page 7: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

7

Lucent TechnologiesBell Labs Innovations

( )1 d t

dd teβ −=

+

Cumulative Test Time (days)

CumulativeFailures

T

x

x x

x

1( ) (1 )(1 )

dtdtM t c e

eβ−

−= × −+

S-shaped Model

Page 8: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

8

Lucent TechnologiesBell Labs Innovations

Challenges With Applications

Traditional SRGMs are applied to system test data with the hope of obtaining an estimate of software failure rate that will be observed in the field. The following issues have to be taken into consideration:

• The usage profile in the field is typically very different than the testing profile

• Fault removal in field environments is not instantaneous

• Quality of software reliability data

Page 9: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

9

Lucent TechnologiesBell Labs Innovations

G-O Model

( ) [1 ]

( )

dt

dt

M t c e

t cdeλ

= −=

where: c initial number of faults at time of field deploymentd average per fault failure rate in a field environmentp probability that a detected fault is successfully removed

( ) [1 ]

( )

dpt

dpt

cM t ep

t cdeλ

= −

=

G-O Model with Imperfect Debugging

Non-Instantaneous Fault Removal TimesIn Field Environments

Page 10: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

10

Lucent TechnologiesBell Labs Innovations

1 1 n dp

µ= +

•Perfect debugging assumption is an acceptable assumption (based on thorough regression tests)

•We relate non-instantaneous fault removal to imperfect debugging

Under the imperfect debugging model:

E(# of occurrences of each fault)= 1/p

Under the situation of non-instantaneous fault removals, if µ denotes the mean time to remove a fault and there are n systemsin the field

E(# of occurrences of each fault)= 1+ nµd

Non-Instantaneous Fault Removal TimesIn Field Environments

Page 11: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

11

Lucent TechnologiesBell Labs Innovations

Non-Instantaneous Fault Removal TimesIn Field Environments

1

1

( ) (1 )[1 ]

( )

d tn d

d tn d

M t c n d e

t cde

µ

µ

µ

λ

− ×+

− ×+

= + −

=

where: c initial number of faults at time of field deployment d average per fault failure rate in the field environmentµ average time to remove a fault (µ = 0 gives back the G-O model)n number of systems in the fieldt cumulative exposure time, aggregated across all installations

Page 12: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

12

Lucent TechnologiesBell Labs Innovations

Field Failure Rate Prediction if Testing Profile Matches the Field Usage Profile

The number of initial faults at time of field deployment time is the same as the number of residual faults after testing has completed

The average per fault failure rate in the field environment is identical to the average per fault failure rate in the test environment

For the field failure rate model, we can use the G-O model replacing with and adjusting for non-instantaneous removal times

1( ) ( )d t

dT n dt ce de µλ− ×

− +=

dTce−c

Page 13: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

13

Lucent TechnologiesBell Labs Innovations

Field Failure Rate Prediction When Testing Profile Does Not Match the Field Usage Profile

1) Testing and field usage profiles drive the average per fault failure rate. Usually thefailure rate of faults is smaller in field environments than in test environments

2) Define , where is the average per fault failure rate in the field environment

3) K is the “per fault failure rate” calibration factor

4) Estimate K from previous releases of the software, or from related projects

*K /d d=

/1 /( ) ( ) ( / )

d Kt

dT n d Kadj t ce d K e µλ

− ×− +=

*d

Page 14: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

14

Lucent TechnologiesBell Labs Innovations

BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network

BCF

MSC

PSTN

BTSs

BCF Software•setup calls•tear down calls•negotiate cell handoffs•dynamically control transmit

power levels•alarm handling•overload control•equipment provisioning

Case Study: GSM Wireless System

Page 15: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

15

Lucent TechnologiesBell Labs InnovationsObjective

Estimate the field failure rate of R3 BCF software which is currently

in system test

1) R3 Test Data in a non-operational profile environment

2) Field failure data for R1 and R2

Goals

Data Available

Page 16: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

16

Lucent TechnologiesBell Labs Innovations

Field Failure Data for R1

System Days FailuresMonth Days Cumulative Month Cumulative

1 1,249 1,249 4 42 3,472 4,721 6 103 4,065 8,786 4 144 4,883 13,669 3 175 5,425 19,094 6 236 5,656 24,750 1 247 7,549 32,299 2 268 8,295 40,594 4 309 8,882 49,476 1 3110 6,120 55,596 0 3111 2,465 58,061 1 3212 527 58,588 1 3313 45 58,633 0 33

Page 17: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

17

Lucent TechnologiesBell Labs Innovations

0

30

60

90

120

0 35,000 70,000 105,000 140,000 175,000

Cumulative System Days

Cum

ulat

ive

Failu

res

0

8

16

24

32

0 10,000 20,000 30,000 40,000 50,000 60,000

Cumulative System Days

Cum

ulat

ive

Failu

res 1

1

1

-5

58,633 system-days (13 months , 167 systems)45 days

ˆ 22.3 faultsˆ 0.0000746 failures/day/fault

(StdError: 3.07 10 )

T

c

d

µ====

×

R1 Field

R2 Field

Field Failure Data Analysis

2

2

2

-6

167,900 system-days (13 months , 370 systems)45 days

ˆ 114.01 faultsˆ 0.0000128 failures/day/fault

(StdError: 3.07 10 )

T

c

d

µ====

×

Page 18: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

18

Lucent TechnologiesBell Labs InnovationsR3 Test Data

Number of Days of Each SoftwareInstallationCalendar

Week1 2 3 4 5 6 7 8 9 10 11

CumulativeSystem-Days

of TestingCumulative

Failures

1 5 5 52 4 9 63 4 13 134 5 18 135 5 5 28 226 5 33 247 5 5 43 298 5 5 5 5 63 349 5 5 5 5 5 88 4010 5 5 5 5 5 5 5 123 46

33 5 5 5 6 955 17034 5 5 6 6 977 17635 5 5 6 6 999 18036 2 1001 181

As many as 11 frames were being used in parallel during system test.

‘Failure’ is defined as a severity 1, 2 or 3 MR.

Page 19: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

19

Lucent TechnologiesBell Labs Innovations

G-O Model Fit to the R3 Test Data

0

50

100

150

200

250

0 500 1000 1500 2000 2500

Cumulative System Days of Testing

Cum

ulat

ive

Failu

res

239-181=58

ˆˆ( ) (1 )dtcM t e −= −

1 2

2

2

ˆ 238.3 faults ˆ 0.00142 fails/day/fault

ˆ 31ˆ ˆ( ) / 2

ˆ 27.02 0.00552ˆˆ 0.00552 0.00027

c

d

dK

d d

cVar

d

=

= =+

−= −

Page 20: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

20

Lucent TechnologiesBell Labs Innovations

0

0.5

1

1.5

2

2.5

0 0.1 0.2 0.3 0.4 0.5

Calendar Time (Years)

R3

Failu

re R

ate

R3 Field Failure Rate Prediction

0.00142 / 31365 345

1 345 45 0.00142 / 31ˆ ( ) 58 (365 0.00142 / 31)

(fails/year, after calendar-years of field exposure)

t

adj t e

t

λ− × ×

+ × ×= × × ×

95% Prediction Limits

Smoothed NonparametricEstimate of R3 Failure Rate

Page 21: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

21

Lucent TechnologiesBell Labs Innovations

Architecture-Based Software Reliability Models

Page 22: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

22

Lucent TechnologiesBell Labs Innovations

Questions to Answer

•What type of software redundancy, if any, is needed in order to achieve the reliability target?

•How fast do fault recovery times need to be?

•Should software processes try to restart before a failover is attempted?

•How thorough do system fault diagnostics need to be?

•What is the required software failure rate?

Page 23: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

23

Lucent TechnologiesBell Labs Innovations

BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network

BCF

MSC

PSTN

BTSs

BTS Controller Software•allocate radio resources•negotiate hand-offs•manage connections between BTS and BCF

•provision and maintenance

Case Study - Continued

Alternative Architectures•Simplex, no process restart capability•Simplex, with restart capability•Active-Standby, with fail-over capability

Page 24: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

24

Lucent TechnologiesBell Labs Innovations

0

1

2

34

6

7

812

13

15

16

179

18

10

20

19

14

5

11

Detectedrestartable SW failure

Restart Success

Restart Fails System Detects

AutoFailoverFails, System Detects

Parallel Reboot

Auto Failover SucceedsAutoFailover

Fails, System Does not Detect

Detected non-restartable SW failure

UndetectedSW failure

ManualDetection

Manual Failover Fails

ParallelReboot

Manual FailoverSuccess

ManualDetection

Restartable orNon-restartable SW failure

UndetectedMonSW failure

DetectedMonSWfailure

Restart Success

Restart Fails No Escalation

Restart Fails, System Does not Detect

ManualDetection

Restart Fails and Escalationto AutoFailover

Undetected Stdby MonSWfailure

Restartable failure on Active side

Restartable orNon-restartable

Stdby SW failureManualDetection

Restartable Failure on Active side

Non-restartableSW failure on Active side

Restart Success

ManualDetection

Restart Fails

Restart Success

Restart Fails Undetected

Restartableor Non-restartableStdby SW failure

RebootManualDetection

Restartable failure on Active side

Non-restartableSW failure on Active side

Restart Success

Parallel Reboot

Non-restartable SW failure on Active side

Reboot

Restart Fails

AutoFailoverFails

1cλ

2cλ

3(1 )c λ−

1 2(1 )( )s sc λ λ− +

3(1 ) sc λ−

m mr µ

(1 ) r sr q µ−

fpµ

(1 ) f fp q µ−(1 )(1 )f fp q µ− −

m mfp µ(1 )m mfp µ−

3dµ λ+

3d sµ λ+Rµ

1 2s sλ λ+

3cλ

srµ

(1 )m mr wµ−

srµ

(1 ) sr µ−

srµ

(1 ) sr µ−

srµ

(1 ) sr µ−

(1 )(1 )r sr q µ− −

3dµ λ+

3dµ λ+

21 λλ +

))(1( 21 λλ +− c

(1 )(1 )m mr w µ− −

21

1λMonSW failure on Active side3λ

Restart Success

m mr µ

Restart Fails

22

MonSW failure on Active side

3λRestart Success

Restart Fails

m mr µ

23

m mr µ

MonSW failure on Active

side

mmr µ)1( −

mmr µ)1( −

mmr µ)1( −General Active-Standby

Software Availability Model

Page 25: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

25

Lucent TechnologiesBell Labs Innovations

Special Case: Cold Standby

0

2

34

6

7

5

Detectedrestartable SW failure

Restart Success

Restart Fails System Detects

AutoFailoverFails, System Detects

Parallel Reboot

Auto Failover SucceedsAutoFailover

Fails, System Does not Detect

Detected non-restartable SW failure

UndetectedSW failure

ManualDetection

Manual Failover Fails

ParallelReboot

Manual FailoverSuccess

ManualDetection

Restart Fails, System Does not Detect

ManualDetection

Reboot

11

8

1

1cλ

2cλ

(1 ) r sr q µ−

fpµ

(1 ) f fp q µ− (1 )(1 )f fp q µ− −

m mfp µ

(1 )m mfp µ−

3dµ λ+

srµ

(1 )(1 )r sr q µ− −

3dµ λ+

3dµ λ+

))(1( 21 λλ +− cRµ

Page 26: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

26

Lucent TechnologiesBell Labs Innovations

Simplex Systems as Special Cases

1

1 2 3

0 (no restartable software)0 (no standby side)

0 and (no restart attempts)0 and (no automatic failover attempts)

0 and (no manual failover attempts)

s s s

s

f

m mf

m f

rp

p

q q

λλ λ λ

µµµ

== = =

= = ∞= = ∞= = ∞= 1 (no restart or failover attempts)

1 and manual recovery duration ofm mr µ=

= =

1 2 3 0 (no standby)0 and (no automatic failover attempts)

0 and (no manual failover attempts)

1 (no automatic failover attempts)

s s s

f

m mf

f

p

p

q

λ λ λµµ

= = == = ∞= = ∞=

Simplex System With No Restart Capability

Simplex System With Restart Capability

Page 27: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

27

Lucent TechnologiesBell Labs Innovations

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 1718 19 2021 2223 24

Hour of Day

% o

f Bus

y H

our

Tra

ffic

Network Traffic Profile

Page 28: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

28

Lucent TechnologiesBell Labs Innovations

0

20

40

60

80

100

0 20 40 60 80 100

%Capacity Remaining

%T

raff

ic S

erve

d

% Capacity Remaining After the Failure Capacity Required for Traffic Volume Offered in Hour-i = % Offered Traffic Served if Failure Occurs in Hour-i

= Min ( , ) /

Expected %

i

i

i i

CCS

C C C

S

==

= Traffic Served, Given a Failure That Leaves Capacity at C

Reward Model

Page 29: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

29

Lucent TechnologiesBell Labs Innovations

0.1

1

10

0.5 0.6 0.7 0.8 0.9 1

Coverage Factor

Failu

res/

year

No Restart

Restart

Active-Standby

0.5

4

Feasibility Regions for Critical Parameters

Level Curves for 99.999% Availability(all other parameters are fixed)

1 2λ λ+

c

Page 30: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

30

Lucent TechnologiesBell Labs Innovations

Identifying Influential Parameters

•Availability formula is straight-forward to obtain, but is complicated in that it ishighly non-linear and it depends on numerous (23) parameters

•Analysis of derivatives, while also straight-forward is not particularly insightful, since they themselves depend on the same set of parameters.

•“Tornado Graphs” depends too much on the fixed values for the other parameters.

•Identify likely ranges for each parameter to define the domain of the availability function

•Uniformly sample N times from the domain to obtain A1, A2, . . . , AN

• Use an efficient second-order polynomial to approximate the availability formula

•Use the t-statistics as the associated measure of influence

Challenges

Approach

Page 31: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

31

Lucent TechnologiesBell Labs Innovations

Case Study – Continued(Likely Ranges for Parameter Values in Active-Standby Architecture)

Detections/yr

26,280Reboots/yr

3,153,600Failovers/yr

31,536,000Restarts/yr

Fails/yr

Fails/yr

0Fails/yr

Fails/yr

17,520Recoveries/yrFails/yr

26,280Failovers/yrFails/yr1λ

1sλ

2sλ

3sλ

rµdµ

mfµ

mµr

mr

rq

fq

p

mp

c

w

[ 0.1 , 1.0 ]

[ 0.1 , 1.0 ]

[ 0.1 , 0.5 ]

[ 0.1 , 0.5 ]

[ 0.1 , 0.5 ]

[ 1095 , 2190 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

Page 32: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

32

Lucent TechnologiesBell Labs Innovations

Case Study – Continued(Efficient Second-Order Regression Model)

1 2 1 2

1 3 2 3

3

(min/yr) 497 379 378 358 356 284( /1000) 268( /1000) 46112.2 ( /1000) 319 12 ( /1000) 29822 11.6 ( /1000) 11.7 9.6 9 5.8 5 2.1

d d

d d

d f r m m

D c c c cc

p r q q r w p

λ λ λ λ µ µλ µ λ λ µ λ

λ µ

= + + − − + − −− + − −− − − − − − − −

0

20

40

60

80

100

0 20 40 60 80 100

Predicted

Obs

erve

d

Influential Parameters•Software failure rates on active side•Silent failure detection time•Coverage factor•Cross-product of coverage factor with

software failure rates and silent failuredetection time

Weakly Influential Parameters•All of the success probabilities

InterpretationAdequacy of Fit

Page 33: Some Successful Approaches to Software Reliability Modeling in … · 2018-12-21 · Lucent Technologies Bell Labs Innovations Some Successful Approaches to Software Reliability Modeling

33

Lucent TechnologiesBell Labs Innovations

Needed Research

•SRGM fitting tools

•Small sample inference procedures

•Failure rate template

•Improved methods for linking SRGMsto architecture-based models