Are New Versions of PC Operating Systems More or Less Reliable than Older Versions?

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL

Qual. Reliab. Engng. Int. 2006; 22:177–189

Published online 27 June 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/qre.694

Research Are New Versions of PC OperatingSystems More or Less Reliablethan Older Versions?John Lawson1,∗,†, Jeremy Sudweeks2 and Del Scott11Department of Statistics, Brigham Young University, Provo, UT 84604, U.S.A.2Department of Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg,VA 24061-0118, U.S.A.

Off-the-shelf operating systems (OSs) are increasingly being used in criticalsystems. These OSs include members of the Microsoft Windows family and variousdistributions of the increasingly popular Linux OS. Reliability estimates of thesestandard off-the-shelf OSs are needed so, in order to evaluate the reliability of thesesystems, data were collected from product support Web sites. In contrast to previousstudies, the reliability of Windows NT Workstation 4.0 was found to exceed that ofWindows 2000 Professional. In addition it was found that the reliability of thesesystems degraded over the product lifecycle. The reliability of four versions of RedHat Linux was evaluated, and although the Red Hat OSs were more reliable than theWindows products, they exhibited the same pattern of reliability degradation acrossversions and, to a lesser extent, within the product lifecycle. The potential causes ofthese trends are examined as well as some of their implications. Copyright c© 2005John Wiley & Sons, Ltd.

KEY WORDS: software reliability; software patches; failure intensity; HPP; NHPP

1. INTRODUCTION

The quantification of software reliability has many applications. In addition to providing a levelof assurance that critical systems will operate correctly, it can also be used to improve softwaredevelopment efforts. Software reliability estimates provide developers with a tool to assist in the

planning and delegation of resources during the development and testing processes. The reliability measurecan be used to determine when testing can be stopped, and when software should be released to the user1.

If reliability estimates were available, comparisons of these estimates would also be useful for choosing asoftware vendor or a software package. Off-the-shelf operating systems (OSs) are now being used in criticalsystems. These OSs include members of the Microsoft Windows family and various distributions of theincreasingly popular Linux OS. Reliability estimates of standard off-the-shelf OSs are needed, yet independentestimates are rarely available. Software producers consider this to be proprietary information.

In this paper we attempt to make reliability estimates of PC OSs using data publicly available on productsupport pages. We collected data on the Windows family of OSs, particularly Windows NT Workstation 4.0

∗Correspondence to: John Lawson, Department of Statistics, Brigham Young University, Provo, UT 84604, U.S.A.†E-mail: [email protected]

Copyright c© 2005 John Wiley & Sons, Ltd.Received 7 May 2003

Revised 17 September 2003

178 J. LAWSON, J. SUDWEEKS AND D. SCOTT

(NT 4) and Windows 2000 Professional (Windows 2000). These data were used to make reliability estimates,and then we compared our reliability estimates of NT 4 and Windows 2000 with estimates from previous studiesthat were obtained in a completely different way. We also obtained data on Red Hat Linux versions 6.2, 7.0, 7.1,and 7.2 and used it to evaluate the reliability of these OSs. Comparison of estimated reliability between olderversions and later versions of the software raised interesting questions regarding the meaning of OS reliability.

2. EVALUATING WINDOWS RELIABILITY

The OSs under consideration were evaluated using the following methodology. Data, in the form of the releasedate of software patches, were collected from the product support pages for each OS. Care was taken to ensurethat only patches that applied directly to the OS under investigation were evaluated. A homogeneous Poissonprocess (HPP) model and several non-homogeneous Poisson process (NHPP) models were fitted to the datausing the method of maximum likelihood. The predictive abilities of these models were then used to select abest model for each OS, and the failure intensities were used to compare different OS.

From the software producer’s point of view, the date of a software patch release does not represent the actualtime of failure, but the time of failure plus the time taken to correct the failure in the software and then post thefix for all users to download. To use this information for modeling failure rates, we must assume the time takento fix and post fixes to software errors is relatively constant for all discovered problems. If the assumption isincorrect it may affect the local accuracy of the fitted model, but the overall trend should be accurate becausethe total number of failures found is equal to the number of patches posted.

From the user’s viewpoint, however, the software failure does not occur until they are made aware of itthrough the posted patch (unless they are one of the rare users who actually experience the failure). This isespecially relevant to security failures. Recent computer virus attacks occurred after security patches informedhackers of OS vulnerability. Therefore, from the user’s perspective, the posted patch dates may actually be themost relevant information for modeling failure rates.

2.1. Data collection

Most software reliability models have been developed from the perspective of the producer. That is, manymodels have been proposed with the assumption that the data to be analyzed was produced from in-housetesting or field data. However, many companies consider failure data proprietary information and are unwillingto make these data available.

Wesselman2 demonstrated that software patches posted on product support pages are a viable source of datafor modeling software reliability. From the users perspective this may be the most relevant information regardingreliability. We therefore used the product support pages for each of the OSs as sources of data to model reliabilityfrom the users perspective. It is our belief that if a software company such as Microsoft expends resources tocreate a software patch, then that patch is a surrogate measure for a failure in the system. If the same patch wasused to correct multiple errors, the date of each posting was recorded. Software patches for applications relatedto the OSs, such as Web appliances, were present on some of the product support pages. However, care wastaken to ensure that only patches that applied to the relevant OSs were considered.

A software failure is defined as an unacceptable departure of program operation caused by a software errorremaining in the system3. Underlying the failure is a fault in the software. Many faults may exist but if thatsection of code is not activated or users are not aware of the fault, then a failure is not realized. A failure mayoccur even when the defining user requirements are met. Thus, a definition of software reliability is not onlywhether requirements are fulfilled, but also whether user expectations are met. A perceived failure is a failure inthe mind of the user whether or not an actual failure exists4.

Healy et al.5 embrace this philosophy by advocating the idea that customer satisfaction should drivereliability assessment. Customer satisfaction encompasses three key concepts: failures are defined relative tothe customer’s perspective; it is expected that the system will satisfy requirements for functional performance;and the system documentation and operation should be consistent. Producer costs are usually proportional to the

Copyright c© 2005 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2006; 22:177–189

ARE NEW VERSIONS OF PC OPERATING SYSTEMS MORE OR LESS RELIABLE THAN OLDER VERSIONS? 179

Table I. Functional forms for λ(t)

Model Parameters Form of λ(t) M(t)

Musa Basic a, b a e−bt a

b(1 − e−bt )

Duane a, b abtb−1 atb

Littlewood a, b, ca

(bt + 1)c

a

b − bc[(bt + 1)1−c − 1]

Littlewood–Verrall a, ba

(bt + 1)1/2

2a

b[(bt + 1)1/2 − 1]

Logarithmic Poisson a, ba

(bt + 1)

a

blog(bt + 1)

number of faults in the software while customer costs are proportional to the number of failures. Several of theerror types observed for the various OSs were not errors that, if left unresolved, would cause serious stabilityissues in the OS. Rather, some of the errors were related to OS periphery such as multimedia devices. Many ofthe errors were related to security concerns. Although these errors probably would not force an unplannedrebooting of the system, the user may perceive them as faults that need to be corrected. It was thereforeappropriate to consider them when evaluating the reliability of the OS. Microsoft provided a great deal ofinformation about each of the errors fixed in the seven service packs that were released for NT 4. Patches thatwere intended solely for NT Server 4.0 were not included in the data set. There were 1091 observations collectedfor NT 4. Windows 2000 support pages clearly indicated what aspects of the OS were being altered with thepatches. Although we are not considering which component of the OS failed, e.g. security, networking, or I/Ooperations, this information could potentially be useful in analysis. A total of 708 observations were collectedfor this OS.

In order to make use of the data collected, the system age at each failure had to be determined. The systemage at the time of failure was determined by subtracting the official release date of the software from the dateof the failure. NT 4 and Windows 2000 had failures with dates prior to the official release date. It is possiblethat these patches represent errors that were observed and corrected early in the software development process.In each case these patches were deleted. In addition, patches whose release date was the same as the officialrelease date of the OS were deleted. The interarrival times, defined as the time between successive failures, werealso calculated.

2.2. Models

Due to the type of data collected, namely the number of failures and the time of failure, different types of failurecount models were fitted to the data. A HPP and several variations of the NHPP were used. For a HPP theprobability of n events, in this case software failures, in the interval from t to t + s is independent of t and isgiven by the Poisson distribution

P(n) = e−λs(λs)n

n! (1)

where λ is the failure intensity function or failure rate6. If the failure rate, λ, varies over time then a NHPP modelshould be used. The NHPP varies from the HPP by accounting for this time dependence by varying the formof λ(t). The probability of n events as given in (1), is modified with λs replaced by

∫ t+s

tλ(τ ) dτ . Healy et al.5

provided several functional forms of λ(t), which are outlined in Table I. These functional forms are the MusaBasic (MB), Duane, Littlewood, Littlewood–Verrall (LV), and Logarithmic Poisson (LP). These models aredecreasing functions that approach zero asymptotically. This property seemed appropriate for modeling softwarereliability because as bugs are detected and removed one would expect to see fewer failures. This seemedreasonable as it is assumed that the number of failures in any given piece of software is finite6.



2.3. Plots and trend tests

A very useful tool for initial data exploration, particularly with data from a repairable system, is the meancumulative function (MCF) plot. The MCF plot consists of the cumulative number of repairs (errors) plottedagainst the system age. This plot has discrete jumps on the vertical axis, which produce a staircase function.However, the population average, denoted by M(t), is a smooth function of t and represents the mean cumulativefunction6. A given cumulative function is therefore an estimate of the population cumulative function. M(t) isthe expected number of repairs by time t and the derivative of M(t), dM(t)/dt , is λ(t), the repair rate7. Table Iprovides the form of M(t) for each of the models fit.

The appearance of a linear trend in the MCF plot indicates that the repair rate, λ(t), for the software isbasically constant—the reliability of the software is not improving with the removal of faults. A convex shapeindicates improving reliability, while a concave shape demonstrates worsening performance. The cases in whichthe repair rate is constant can be modeled with HPP models while the nonlinear trends can be modeled withNHPP models.

In addition to using plots to check for trends in the data, formal tests are also available. We used the reversearrangement test (RAT) and the Laplace test7. The RAT test is a non-parametric test that is used to distinguishbetween independent and identically distributed (i.i.d.) interarrival times and a monotonic trend. This test simplyindicates if a trend is present in the data, not what type of model should be used. The Laplace test determineswhether observations are a Poisson-type sequence of i.i.d. exponential random variables. A rejection of this testindicates that the HPP is not an appropriate model and a trend is present.

2.4. Comparing models

As several models were fit to the data it was necessary to determine which model provided the best fit and wasthe most reliable in prediction. Likelihood ratio tests do not exist for comparing the models presented in Table I,because many of the models have the same number of parameters and they are not hierarchical6.

The prequential likelihood ratio (PLR)8 is used to make pairwise comparisons among competing modelsby predicting the next time to failure tj and comparing how close each model is to the actual time offailure. The PLR methodology6 requires that a reference model be arbitrarily selected and used in all pairwisecomparisons. Several observations are removed from the data set and each of the models is fit to the remainingdata. The estimated model parameters are then used to calculate the estimated pdfs for each of the observationtimes, tj , left out of the model fitting process. If the predicted values for the first model, A, are closer to theactual values than the predicted values for the reference model, B, A’s pdf fA(tj ) at tj will be larger than B’s pdffB(tj ), and the ratio of these quantities, fA(tj )/fB(tj ), will be larger than one. The PLR is a running productof such terms over many successive predictions:

PLRABi =

i∏

j=s

fA(tj )

fB(tj )(2)

The PLR should tend to increase with i if the A predictions are better than the B predictions. Conversely, if thepredictions of model B are better, then the PLR tends to decrease. The log of the PLR of Poisson data is

log(PLRABj ) =

j∑

i=1

[Mi log λA(ti) − Mi log λB(ti) + λB(ti) − λA(ti)] (3)

which reveals trends in PLRAB when plotted against time.

2.5. Results

Windows 2000 and NT 4 both exhibited a pattern in their MCF plots that required further inspection. In eachplot there was a period of time in which several patches were released in rapid succession. However, at the end of



this period the number of patches posted rapidly declined and the length of time between postings significantlyincreased. The sudden decline in failures corresponds closely with the release of subsequent OSs. The NT 4MCF plot began to tail off very near the release of Windows 2000. A similar trend is seen in the Windows 2000plot corresponding to the release of Windows XP. It is possible that the testing and correction efforts in thesupplanted OSs were curtailed and consequently the final observations did not give an accurate indication of thereliability of the system. These observations were removed from the data sets.

Each OS underwent the same analysis even though only intermediate results for NT 4 are presented. The MCFplot for NT 4 demonstrated a strong linear pattern. Although there are areas of local variation (trends are presentover brief periods of time), it seemed likely that a HPP model might adequately describe the data. Therefore, inaddition to the NHPP models described previously, HPP models were fit. All OSs demonstrated a strong linearpattern in the MCF plot.

As mentioned previously, the RAT test was used to test between i.i.d. interarrival times and a monotonictrend. The RAT test rejected the hypothesis of i.i.d. interarrival times for both NT 4 and Windows 2000, perhapsindicating that a trend was present. However, due to the obvious linear trend observed in the MCF plots, itwas possible that this trend test may have produced significant results due to local fluctuations, and not on theoverall pattern of the data. The assumption of i.i.d. interarrival times is required to use the Laplace test. Since thisassumption was rejected by the RAT test, the Laplace test was not used on the NT 4 or Windows 2000 data.

Initial estimates were found using SAS PROC NLIN and then used in PROC NLMIXED to arrive at finalsolutions. PROC NLMIXED performs nonlinear Poisson regression. The sample SAS commands for fitting theMB model to NT 4 data are:

*musa basic;proc nlin data = nt1;

parms a = 0 b = 0.1;model failures = (a/b)*(1 - exp(-b*days));

run;

proc nlmixed data = nt1;parms a = 0.7155 b = -0.00039;M = (a/b)*(1 - exp(-b*days));model failures ˜ poisson(M);predict M out = ntmb;

run;

The variable ‘failures’ represents the count of the cumulative failures, and the term ‘days’ represents the time ofoccurrence of the corresponding failure.

We were not able to fit the Littlewood model to either NT 4 or Windows 2000 data, because the convergencecriteria were not met. This did not come as a surprise, however. The Littlewood model is complex and theestimation of its three parameters required significantly more iterations than other models in a previous study2.All five remaining models were successfully fit to the NT 4 and Windows 2000 data. Table II lists the parameterestimates, standard errors, confidence limits, and the maximized log likelihood value for the five models fitto NT 4 data. There is little difference in the values of the log likelihood for the two parameter models, but noformal likelihood ratio tests could be done to compare models since they all have the same number of parametersand they are not hierarchal.

In order to assess the fit of each model to the data, the models were superimposed onto the MCF plot.The MCF plot with models for NT 4 data is shown in Figure 1. In this plot the Duane, LV, MB, and LP modelsare very similar and hard to distinguish, as would be guessed by the log likelihood values in Table II.

Although this plot was not useful in distinguishing between the two parameter models, it was useful inillustrating that the HPP model did not fit the data well. For example, it can be seen in Figure 1 that afterapproximately day 900 the HPP model underestimated the number of failures in the NT 4 OS.

The PLR plot was used to determine which of the models provided the best fit to the data. The MB functionalform was arbitrarily selected as the reference model for both NT 4 and Windows 2000. The last 20 points were



Tabl

eII

.Win

dow

sN

TW

orks

tati

on4.

0pa

ram

eter

esti

mat

es

Max

imiz

edlo

gM

odel

Para

met

ers

Stan

dard

erro

rs95

%co

nfide

nce

limits

likel

ihoo

dλ(t

)M

(t)

Mus

aB

asic

a=

0.70

24s a

=0.

000

2958

(0.6

966,

0.70

82)

−563

9.19

917

0.70

24e0.

000

43t

−163

3(1

−e0.

000

43t )

b=

−0.0

0043

s b=

9.13

×10

−6(−

0.00

045

,−0

.000

41)

Dua

nea

=0.

3267

s a=

0.00

648

7(0

.314

0,0.

3394

)−5

476.

483

330.

373t

1.14

310.

3267

e1.14

31t

(Pow

er)

b=

1.14

31s b

=0.

002

986

(1.1

372,

1.14

89)

Litt

lew

ood–

a=

0.72

46s a

=0.

002

283

(0.7

201,

0.72

91)

−567

3.24

061

0.72

46(−

0.00

057

t+

1)0.

5−2

542(

(−0.

000

57t+

1)0.

5−

1)V

erra

llb

=−0

.000

57s b

=7.

90×

10−6

(−0.

000

59,−0

.000

56)

Log

arith

mic

a=

1.64

40s a

=0.

005

981

(1.6

323,

1.65

58)

−565

5.27

844

1.64

4(−0

.000

35t+

1)−1

−469

7lo

g(−0

.000

35t+

1)Po

isso

nb

=−0

.000

35s b

=5.

971

×10

−6(−

0.00

036

,−0

.000

34)

Hom

ogen

eous

a=

0.84

24s a

=0.

001

121

(0.8

402,

0.84

46)

−667

5.75

161

0.84

240.

8424

t

Pois

son

Mod

el



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++

days

failu

res

0 400 600 800 1000 1200

02

00

40

06

00

80

01

00

0

DuaneHPPLV

200

MBLP

Figure 1. Windows NT Workstation 4.0 MCF plot with reliability models

-10000

-5000

0

5000

10000

15000

20000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Log(PLR)

Duane Littlewood-Verrall Logarithmic Poisson HPP

Figure 2. Windows NT Workstation 4.0 PLR plot

removed from each data set and used to evaluate the predictive ability of each model. The model parameterswere estimated using the remaining data. The predicted number of failures at the time of failure for the 20removed faults was then computed for each model. These predicted values served as the basis of comparisonbetween the models. Figure 2 shows the PLR plot for NT 4.

The straight line at zero in Figure 2 represented the reference model, in this case the MB model. Any modelwhose log(PLR) was positive gave a more accurate prediction than the reference model, and conversely anylog(PLR) that was negative indicated the reference model gave a more accurate prediction. From Figure 2 weobserve that the LV and LP models were making more accurate predictions than the MB. The LP model, withparameters a = 1.644 and b = −0.000 35, appeared to be the best fitting model for NT 4. In a similar plot forWindows 2000, the LP model, with parameters a = 2.9369 and b = −0.000 45, was found to be the best fittingmodel.



3. EVALUATING RED HAT RELIABILITY

Red Hat provided an errata sheet for each version of the OS that listed all of the software patches available andwe used these sheets to record the date of each posting. Version 6.2 had 69 patches, 7.0 had 83 patches, 7.1 had104 patches, and 7.2 had 91 patches. Red Hat version 7.3 was recently released, and consequently few softwarepatches were available. As with NT 4 and Windows 2000, Red Hat 7.1 and Red Hat 7.2 had failures with datesprior to the official release date. Once again these failures were deleted.

Fitting the selected NHPP and HPP models to some versions of the Red Hat OS proved to be more challengingthan it was for the Windows data. This was due to the relatively low failure count and periods of time in whichno failures were reported. At least two models were fit to each version of the OS.

The RAT test rejected the hypothesis of i.i.d. interarrival times of all versions of Red Hat except 6.2.The hypothesis of exponentially distributed interarrival times was rejected for version 6.2 using the Laplacetest. However, as with the Windows OS, local trends may have influenced the test results, and possibly an HPPmodel could adequately describe the results globally.

All five NHPP models were fit to Red Hat 7.1 and Red Hat 7.2. Only two models, the LP and a non-zerointercept HPP, were fit to Red Hat 6.2. Three models, LV, LP, and a non-zero intercept HPP, were fit to RedHat 7.0. As with the Littlewood model for the NT 4 and Windows 2000 data, convergence problems in thealgorithms prevented the other models from being fit to these two data sets. The non-zero intercept HPP was fitto Red Hat 6.2 and Red Hat 7.0 because of the long period of time between the release of the software and thefirst reported errors. In the case of Red Hat 6.2 the first error was over a year from the official release date andthe first error for Red Hat 7.0 was over six months from the official release date. The intercept term was addedas an additional parameter to the HPP model and both parameters were estimated using the SAS procedurescited above.

For Red Hat 6.2, the best fitting model was the modified HPP with a = 0.1563; for 7.0 the LP with a = 0.1294and b = −0.001 48 provided the best fit; for 7.1 the LP with a = 0.555 and b = 0.000 14 was the best fit; andfor 7.2 was the LP with a = 1.0894 and b = 0.003 723 was the best fit. However, inspection of the MCF plotwith models revealed that even the best fitting models did not model the data exceptionally well.

4. COMPARISON OF RED HAT AND WINDOWS RELIABILITY

4.1. Comparing Red Hat and Windows

Care must be taken when comparing the different families of OSs. Although the best fitting model for most ofthe systems was the same, the model may not have provided an excellent fit to each dataset, but was simply thebest approximation available. Drawing definitive conclusions was therefore tenuous at best.

In addition, the testing policies and procedures used by each company may differ significantly. The way inwhich a company responded to errors discovered in testing and field use may also differ significantly. Microsoftseems to have made extensive use of its product support page to release patches while Red Hat released newversions of its OS regularly. We were not attempting to make definitive statements about which OS is superiorin terms of reliability, but merely to present comparisons based on the research we conducted.

Direct comparisons between the two families of OSs were difficult to make due to the staggered releasedates. NT 4 was released almost four years before Red Hat 6.2. Following the release of Windows 2000 andprior to the release of Windows XP, Red Hat released versions 6.2, 7.0, 7.1 and 7.2. We therefore have to speakgenerally, without making specific comparisons. In order to make realistic comparisons it may be necessaryto treat each service pack released by Microsoft as a new version of the OS, thereby restarting the count ofcumulative failures. However, the service packs released by Microsoft simply fixed errors in the OS withoutmaking substantial upgrades. Each released version of Red Hat contained substantial upgrades over previousversions. Sending the cumulative count of failures to zero for each service pack of the Windows OSs maytherefore bias direct comparisons of Windows and Red Hat systems.



Based upon the repair rates reported earlier, it appears that each version of the Red Hat OS was more reliablethan both of the Windows OSs. However, the Red Hat OSs appeared to generally follow the same pattern as theWindows systems in that the reliability degraded over the product lifecycle, and newer versions of the OSs wereless reliable than the older ones. These two patterns lead to inconclusive results.

4.2. Windows baseline studies

In 1999 Microsoft commissioned two comparative reliability studies. Both studies compared the reliability ofMicrosoft Windows 98 Second Edition (SE), Windows NT 4, and Windows 2000. Although the studies differedin testing strategies the conclusions were similar—Windows 2000 is more reliable than either Windows 98 SEor NT 4.

The first study was conducted by ZDLABS9. The technicians configured six identical Dell OptiPlex GX110systems into two matching test beds of three systems each. Each OS was subject to testing via test scripts for30 consecutive days. The test scripts were meant to imitate common functions performed by a typical workerduring a typical workday. The functions included browsing the Internet, modifying a document, modifying aspreadsheet, and querying a database. Different I/O functions were required with each loop of the test script,including accessing a CD-ROM and a file server, and negotiating Web downloads. The total number of rebootsand the total amount of time the test stressed the system were recorded. The average uptime for each OS wascalculated. The unit of measure was the standard 8 h workday. In this particular study, Windows 2000 did notrequire a single reboot, leading ZDLABS to conclude that it is far more reliable than either of the other OSs.

The second study was conducted by NSTL10. This study attempted to evaluate the reliability of the same threeoperating systems in actual usage environments. In order to accomplish this, Microsoft developed a testingtool that resided on test machines and sent data about a given system’s performance to NSTL. The tool wasinstalled in academic and business settings, including the NSTL labs. The operating time for the system andany occurrence of an unexpected shutdown were recorded and reported to NSTL. An unexpected shutdown wasconsidered an error. The mean time to failure (MTTF) was then calculated for each OS. Similar to the ZDLABSstudy, Windows 2000 was found to be much more reliable than either NT 4 or Windows 98 SE. NSTL concludedthat in production environments the average system uptime between failures for Windows 2000 is 13 timesgreater than Windows 98 and three times greater than that of NT 4.

4.3. Comparison of results with Windows baseline studies

Our results seemed to contradict these studies. Due to a lack of data Windows 98 was not evaluated, but we foundthat NT 4 is more reliable than Windows 2000. Perhaps the most vivid depiction of this is the side-by-side MCFplot, found in Figure 3.

The steeper slope of the Windows 2000 line in Figure 3 indicated that it had a higher repair rate than NT 4.This difference can also be seen with numerical comparisons: at 180 days the repair rate (λ) for NT 4 was 1.75and for Windows 2000 it was 3.2; at 365 days λ for NT 4 was 1.88 while λ for Windows 2000 was 3.51; andat the last failure for each system, λ for NT 4 (at 1175 days) was 2.79 while λ for Windows 2000 (at 468 days)was 3.72. In addition each OS seemed to degrade over its product lifecycle. If one were to think of the estimatedrepair rate at the last observation as the final estimated repair rate, then both systems are less reliable after testingand correction efforts. That is, as more problems were identified and corrected, the reliability of the OS actuallydeclined. We discuss further the implications and possible reasons for the different results in Section 5.

4.4. Comparing versions of Red Hat

Although there is a great deal of discussion amongst computing professionals and enthusiasts about thereliability of different distributions of the Linux OS, we were unable to find formal studies. In particular, wewere unable to find formal studies quantifying the reliability of any version of Red Hat Linux. Therefore we donot have previous studies with which to compare our results. However, we can make comparisons among thefour versions of Red Hat for which we collected data.



Figure 3. Side-by-side MCF plot for Windows 2000 and NT 4

A comparative evaluation of the Red Hat versions revealed a surprising trend. Similar to NT 4 andWindows 2000, the reliability of the Red Hat OS seemed to degrade with each new release. Figure 4, a compositeMCF where all of the versions are plotted together, revealed this trend. In some of the systems, the reliabilityseemed to degrade over the lifecycle of the product, while in others it remained fairly constant or improvedslightly.

Certainly one reason for the contrast on this MCF was the large amount of time between the official releasedates of Red Hat 6.2 and 7.0 and the first patches posted. However, the slopes of the plots for the laterversions appeared to be somewhat steeper than the earlier versions indicating that they have a higher repairrate. Numerical comparisons bear this out: 6.2 had a constant repair rate of λ = 0.1563; at 180 days 7.0 had arepair rate of λ = 0.13, 7.1 had a repair rate of λ = 0.54, and 7.2 had a repair rate of λ = 1.02; and at the lastfailure for each system λ = 1.3 for 7.0 (610 days), λ = 0.53 for 7.1 (408 days), and λ = 1.01 for 7.2 (219 days).Possible reasons for the apparent degradation across the versions of Red Hat will be discussed in Section 5.

5. DISCUSSION

The discrepancy between our results and previous studies may be due to several factors. One of the most likelyfactors was how a failure was defined. Inherent in our data collection process was the idea that any patch postedrepresented a failure. The failure being fixed may not have caused the OS to become unstable to the point ofrequiring a reboot, but it is a failure nonetheless. The failures that we modeled include things such as securitypatches, plug and play components, and audio/visual features. The only failures that were considered in theprevious studies were those that required the computer to reboot to return to a stable state.

Perhaps the most important points of discussion center on the following questions: does the trend observedin these data sets really mean that the newer versions of the OS are less reliable than the older versions? Is theOS reliability actually degrading throughout the product lifecycle?



Figure 4. Side-by-side MCF plot for Red Hat 6.2, 7.0, 7.1, and 7.2

We are only able to speculate about what the observed trends mean. It seems likely that one part of thesoftware production process that directly affected our assessment is software testing. It is possible thatthe software testing methods used on later OSs are more extensive and better able to detect errors. It is alsopossible that software engineers are testing aspects of the software that previously were left untested. It maynow seem necessary to correct errors that were previously known but not corrected. Perhaps in trying to fix aknown error, programmers unwittingly introduce new errors or implement code that adversely affects anothercomponent of the OS. As new product release dates approach, perhaps testing efforts are greatly increased oncurrent OSs to fix as many bugs as possible before efforts are shifted to the new product development. The sizeof OSs have grown with each release and it certainly seems logical that more lines of code translate to moreerrors.

The increased size of the OSs points to another aspect of the current state of computing that should beconsidered when trying to determine if newer systems are less reliable than older systems. Desktop computershave become an integral part of many people’s lives. The convergence of many activities to desktop computersseems to be increasing. Early operating systems had conceptually simple tasks to perform—as opposed to themyriad of tasks that current OSs have to perform. These new tasks include managing a wide array of hardware,ensuring a secure computing environment (preventing unauthorized access to a user’s system), digital imagining,and providing a medium for entertainment. Perhaps the OSs are simply too complex to be exhaustively testedbefore release. Therefore field-testing and revision may be necessary.

The competitive business environment may also contribute to these trends. In order to get a new and complexproduct to market quickly, the OSs may have been prematurely released with known errors remaining inthe system. Releasing the software under these conditions may result more from a business decision ratherthan the inability of the software testing effort to discover and correct the majority of the errors in the system.

The implications of these trends seem to be extensive. Is it better to purchase a new OS later in its lifecycleso that many of the errors discovered in the field have been corrected? Is it possible that many of the patchesposted have little to do with a given user’s computing needs, thereby not affecting their system’s reliability?



Figure 5. Windows XP Professional MCT plot with an s-shaped model

Perhaps switching from one OS to another may not provide expected productivity gains. Which customer servicemodel is better: providing patches on a product support page or releasing new versions of the OS regularly?

It is clear that estimates of the reliability of off the shelf operating systems are needed. This approachattempted to use patches posted on product support pages to accomplish this. The results contradict previousstudies, but provide estimates for OSs for which no formal studies could be found. Further work in this fieldneeds to be done in order to allow for unbiased and realistic estimation of off the shelf OS reliability.

6. FUTURE RESEARCH

Several areas for future research have been identified, with perhaps the most compelling being the reliabilityevaluation of the most recently released Microsoft and Red Hat OSs.

There was not enough data available for Red Hat 7.3 to begin initial assessments. However, initial assessmentsfor Windows XP Professional (Windows XP) were possible. The plot of failures versus system age for WindowsXP revealed a trend common in software testing—the s-shaped curve. The MCF plot for Windows XP can beseen in Figure 5.

The s-shaped curve often reflects testing efforts. Initially testing appears to be a low priority for the softwarevendor as few errors are discovered and corrected. This is followed by a period of extensive testing and revision,demonstrated by the marked increase in errors detected and corrected. The conclusion of the testing cycleis marked by period of less rigorous testing where the number of errors discovered and corrected decreasesconsiderably. The s-shaped trend was also apparent in the MCF plots for Windows 2000 and NT 4. With NT 4,service packs were often released, and the data between each service pack appears to follow an s-shaped curve,resulting in a chain of s-shaped curves that were described earlier as local fluctuations in an otherwise linearMCF plot. It is reasonable to believe that this trend will continue with Windows XP. Models have been developedfor the s-shaped curve and we refer the reader to Lawson et al.6 for a brief introduction. The plot in Figure 5



includes an s-shaped model fit to the Windows XP data. Windows XP is still early in its product lifecycle soaccurate reliability estimates are not possible. Further comparisons with other Microsoft OSs are premature atthis time.

The s-shaped curves within NT 4 and Windows 2000 may present another area for future research. Rather thanfitting only the global or overall trend and ignoring some of the local fluctuations, models that chain togetherseveral s-shaped curves together could be developed.

Other areas for future research include considering the functional area in which patches occur and thepossibility of the cumulative failure count for a given OS returning to zero after the release of a service pack.The functional areas of the OS include security, networking, memory management, and I/O functions.Analysis that takes these factors into account may require several models for each OS, or perhaps models canbe developed that take service packs and functional areas of the OS repaired into consideration.

REFERENCES

1. Goel AL. Software reliability models: Assumptions, limitations, and applicability. IEEE Transactions on SoftwareEngineering 1985; 11:1409–1423.

2. Wesselman CW. Software reliability prediction. MS Thesis, Department of Statistics, Brigham Young University, 2000.3. Yamada S, Osaki S. Software reliability growth modeling: Models and applications. IEEE Transactions on Software

Engineering 1985; 11:1431–1437.4. Zeiter D. Realistic assumptions for software reliability models. Proceedings of the 1991 International Symposium on

Software Reliability Engineering. IEEE Computer Society Press: Los Alamitos, CA, 1991; 67–74.5. Healy JD, Jain AK, Bennett JM. Reliability prediction. Proceedings of the 1996 Annual Reliability and Maintainability

Symposium. IEEE Press: Piscataway, NJ, 1996.6. Lawson JS, Wesselman CW, Scott DT. Simple plots improve software reliability prediction models. Quality

Engineering 2003; 15:409–415.7. Tobias PA, Trindade DC. Applied Reliability. Van Nostrand Reinhold: New York, 1995.8. Brocklehurst S, Littlewood B. Techniques for prediction analysis and recalibration Handbook of Software Reliability

Engineering, Lyu MR (ed.). McGraw-Hill: New York, 1996; 119–164.9. ZDLABS. Microsoft Windows 2000 Professional reliability: A comparative study of the reliability of Windows 2000

Professional, Microsoft Windows 98 and Microsoft Windows NT Workstation 4.0, 2000.http://www.microsoft.com/windows2000/professional/evaluation/news/zdwindows.asp [February 2001].

10. NSTL. Microsoft Windows 2000 Professional—Comparison of the reliability of desktop operating systems, 2000.http://www.microsoft.com/windows2000/professional/evaluation/news/nstl.asp [February 2001].

Authors’ biographies

John Lawson is an associate professor in the Department of Statistics at Brigham Young University. He receivedhis PhD in Applied Statistics from the Polytechnic Institute of New York. He currently teaches courses inengineering statistics, reliability engineering, and design of experiments. He is a senior member of ASQ.

Jeremy Sudweeks obtained his MS degree in Statistics from Brigham Young University in 2002. He iscontinuing graduate studies at Virginia Polytechnic Institute and State University.

Del T. Scott is currently professor in the Department of Statistics at Brigham Young University. He received hisPhD in Statistics at Pennsylvania State University. He currently teaches courses in mathematical statistics andstatistical computing. His interests lie in the area of statistical computing.


Documents

Are New Versions of PC Operating Systems More or Less Reliable than Older Versions?