EXPLANING DISK FAILURE.pdf

Embed Size (px)

Citation preview

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    1/16

    Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you?

    Bianca Schroeder Garth A. GibsonComputer Science Department

    Carnegie Mellon University{bianca, garth }@cs.cmu.edu

    Abstract

    Component failure in large-scale IT installations is be-coming an ever larger problem as the number of compo-

    nents in a single cluster approaches a million.In this paper, we present and analyze eld-gathered

    disk replacement data froma number of large productionsystems, including high-performance computing sitesand internet services sites. About 100,000 disks are cov-ered by this data, some for an entire lifetime of ve years.The data include drives with SCSI and FC, as well asSATA interfaces. The mean time to failure (MTTF) of those drives, as specied in their datasheets, ranges from1,000,000 to 1,500,000 hours, suggesting a nominal an-nual failure rate of at most 0.88%.

    We nd that in the eld, annual disk replacement ratestypically exceed 1%, with 2-4% common and up to 13%observed on some systems. This suggests that eld re-placement is a fairly different process than one mightpredict based on datasheet MTTF.

    We also nd evidence, based on records of disk re-placements in the eld, that failure rate is not constantwith age, and that, rather than a signicant infant mor-tality effect, we see a signicant early onset of wear-outdegradation. That is, replacement rates in our data grewconstantly with age, an effect often assumed not to set inuntil after a nominal lifetime of 5 years.

    Interestingly, we observe little difference in replace-ment rates between SCSI, FC and SATA drives, poten-

    tially an indication that disk-independent factors, such asoperating conditions, affect replacement rates more thancomponent specic factors. On the other hand, we seeonly one instance of a customer rejecting an entire pop-ulation of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.

    Time between replacement, a proxy for time betweenfailure, is not well modeled by an exponential distribu-tion and exhibits signicant levels of correlation, includ-ing autocorrelation and long-range dependence.

    1 Motivation

    Despite major efforts, both in industry and in academia,high reliability remains a major challenge in runninglarge-scale IT systems, and disaster prevention and costof actual disasters make up a large fraction of the to-tal cost of ownership. With ever larger server clus-ters, maintaining high levels of reliability and avail-ability is a growing problem for many sites, includinghigh-performance computing systems and internet ser-vice providers. A particularly big concern is the reliabil-ity of storage systems, for several reasons. First, failureof storage can not only cause temporary data unavailabil-ity, but in the worst case it can lead to permanent dataloss. Second, technology trends and market forces maycombine to make storage system failures occur more fre-quently in the future [24]. Finally, the size of storagesystems in modern, large-scale IT installations has grownto an unprecedented scale with thousands of storage de-vices, making component failures the norm rather thanthe exception [7].

    Large-scale IT systems, therefore, need better systemdesign and management to cope with more frequent fail-ures. One might expect increasing levels of redundancydesigned for specic failure modes [3, 7], for exam-ple. Such designs and management systems are based onvery simple models of component failure and repair pro-cesses [22]. Better knowledge about the statistical prop-erties of storage failure processes, such as the distribu-

    tion of time between failures, may empower researchersand designers to develop new, more reliable and availablestorage systems.

    Unfortunately, many aspects of disk failures in realsystems are not well understood, probably because theowners of such systems are reluctant to release failuredata or do not gather such data. As a result, practi-tioners usually rely on vendor specied parameters, suchas mean-time-to-failure (MTTF), to model failure pro-cesses, although many are skeptical of the accuracy of

    In FAST'07: 5th USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 14-16, 2007.

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    2/16

    those models [4, 5, 33]. Too much academic and cor-porate research is based on anecdotes and back of theenvelope calculations, rather than empirical data [28].

    The work in this paper is part of a broader researchagenda with the long-term goal of providing a better un-derstanding of failures in IT systems by collecting, ana-

    lyzing and making publicly available a diverse set of realfailure histories from large-scale production systems. Inour pursuit, we have spoken to a number of large pro-duction sites and were able to convince several of themto provide failure data from some of their systems.

    In this paper, we provide an analysis of seven data setswe have collected, with a focus on storage-related fail-ures. The data sets come from a number of large-scaleproduction systems, including high-performance com-puting sites and large internet services sites, and consistprimarily of hardware replacement logs. The data setsvary in duration from one month to ve years and coverin total a population of more than 100,000 drives from at

    least four different vendors. Disks covered by this datainclude drives with SCSI and FC interfaces, commonlyrepresented as the most reliable types of disk drives, aswell as drives with SATA interfaces, common in desktopand nearline systems. Although 100,000 drives is a verylarge sample relative to previously published studies, itis small compared to the estimated 35 million enterprisedrives, and 300 million total drivesbuilt in 2006[1]. Phe-nomena such as bad batches caused by fabrication linechanges may require much larger data sets to fully char-acterize.

    We analyze three different aspects of the data. We be-gin in Section 3 by asking how disk replacement frequen-cies compare to replacement frequencies of other hard-ware components. In Section 4, we provide a quantitativeanalysis of disk replacement rates observed in the eldand compare our observations with common predictorsand models used by vendors. In Section 5, we analyzethe statistical properties of disk replacement rates. Westudy correlations between disk replacements and iden-tify the key properties of the empirical distribution of time between replacements, and compare our results tocommon models and assumptions. Section 6 provides anoverview of related work and Section 7 concludes.

    2 Methodology

    2.1 What is a disk failure?

    While it is often assumed that disk failures follow asimple fail-stop model (where disks either work per-fectly or fail absolutely and in an easily detectable man-ner [22, 24]), disk failures are much more complex inreality. For example, disk drives can experience latentsector faults or transient performance problems. Often it

    is hard to correctly attribute the root cause of a problemto a particular hardware component.

    Our work is based on hardware replacement recordsand logs, i.e. we focus on disk conditions that lead a drivecustomer to treat a disk as permanently failed and to re-place it. We analyze records from a number of large pro-

    duction systems, which contain a record for every disk that was replaced in the system during the time of thedata collection. To interpret the results of our work cor-rectly it is crucial to understand the process of how thisdata was created. After a disk drive is identied as thelikely culprit in a problem, the operations staff (or thecomputer system itself) perform a series of tests on thedrive to assess its behavior. If the behavior qualies asfaulty according to the customers denition, the disk isreplaced and a corresponding entry is made in the hard-ware replacement log.

    The important thing to note is that there is not oneunique denition for when a drive is faulty. In partic-ular, customers and vendors might use different deni-tions. For example, a common way for a customer to testa drive is to read all of its sectors to see if any reads ex-perience problems, and decide that it is faulty if any oneoperation takes longer than a certain threshold. The out-come of such a test will depend on how the thresholdsare chosen. Many sites follow a better safe than sorrymentality, and use even more rigorous testing. As a re-sult, it cannot be ruled out that a customer may declarea disk faulty, while its manufacturer sees it as healthy.This also means that the denition of faulty that a drivecustomer uses does not necessarily t the denition thata drive manufacturer uses to make drive reliability pro- jections. In fact, a disk vendor has reported that for 43%of all disks returned by customers they nd no problemwith the disk [1].

    It is also important to note that the failure behaviorof a drive depends on the operating conditions, and notonly on component level factors. For example, failurerates are affected by environmental factors, such as tem-perature and humidity, data center handling procedures,workloads and duty cycles or powered-on hours pat-terns.

    We would also like to point out that the failure behav-ior of disk drives, even if they are of the same model, can

    differ, since disks are manufactured using processes andparts that may change. These changes, such as a changein a drives rmware or a hardware component or eventhe assembly line on which a drive was manufactured,can change the failure behavior of a drive. This effectis often called the effect of batches or vintage. A badbatch can lead to unusually high drive failure rates or un-usually high rates of media errors. For example, in theHPC3 data set (Table 1) the customer had 11,000 SATAdrives replaced in Oct. 2006 after observing a high fre-

    2

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    3/16

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    4/16

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    5/16

    tion function (CDF) and how well it is t by four prob-ability distributions commonly used in reliability theory:the exponential distribution; the Weibull distribution; thegamma distribution; and the lognormal distribution. Weparameterize the distributions through maximum likeli-hood estimation and evaluate the goodness of t by vi-

    sual inspection, the negative log-likelihood and the chi-square tests.We will also discuss the hazard rate of the distribu-

    tion of time between replacements. In general, the hazardrate of a random variable t with probability distribution f (t ) and cumulative distribution function F (t ) is denedas [25]

    h(t ) = f (t )

    1 F (t )Intuitively, if the random variable t denotes the time be-tween failures, the hazard rate h(t ) describes the instanta-neous failure rate as a function of the time since the mostrecently observed failure. An important property of t sdistribution is whether its hazard rate is constant (whichis the case for an exponential distribution) or increasingor decreasing. A constant hazard rate implies that theprobability of failure at a given point in time does notdepend on how long it has been since the most recentfailure. An increasing hazard rate means that the proba-bility of a failure increases, if the time since the last fail-ure has been long. A decreasing hazard rate means thatthe probability of a failure decreases, if the time since thelast failure has been long.

    The hazard rate is often studied for the distribution of lifetimes. It is important to note that we will focus on the

    hazard rate of the time between disk replacements , andnot the hazard rate of disk lifetime distributions.Since we are interested in correlations between disk

    failures we need a measure for the degree of correlation.The autocorrelation function (ACF) measures the corre-lation of a random variable with itself at different timelags l . The ACF, for example, can be used to determinewhether the number of failures in one day is correlatedwith the number of failures observed l days later. The au-tocorrelation coefcient can range between 1 (high pos-itive correlation) and -1 (high negative correlation). Avalue of zero would indicate no correlation, supportingindependence of failures per day.

    Anotheraspect of the failure process that we will studyis long-range dependence. Long-range dependence mea-sures the memory of a process, in particular how quicklythe autocorrelation coefcient decays with growing lags.The strength of the long-range dependence is quanti-ed by the Hurst exponent. A series exhibits long-rangedependence if the Hurst exponent, H, is 0 . 5 < H < 1.We use the Sels tool [14] to obtain estimates of theHurst parameter using ve different methods: the abso-lute value method, the variance method, the R/S method,

    HPC1Component %CPU 44Memory 29Hard drive 16PCI motherboard 9

    Power supply 2

    Table 2: Node outages that were attributed to hardware problems broken down by the responsible hardware com- ponent. This includes all outages, not only those that re-quired replacement of a hardware component.

    the periodogram method, and the Whittle estimator. Abrief introduction to long-range dependence and a de-scription of the Hurst parameter estimators is providedin [15].

    3 Comparing disk replacement frequencywith that of other hardware components

    The reliability of a system depends on all its components,and not just the hard drive(s). A natural question is there-fore what the relative frequency of drive failures is, com-pared to that of other types of hardware failures. To an-swer this question we consult data sets HPC1, COM1,and COM2, since these data sets contain records for alltypes of hardware replacements, not only disk replace-ments. Table 3 shows, for each data set, a list of theten most frequently replaced hardware components andthe fraction of replacements made up by each compo-nent. We observe that while the actual fraction of disk replacements varies across the data sets (ranging from20% to 50%), it makes up a signicant fraction in allthree cases. In the HPC1 and COM2 data sets, disk drives are the most commonly replaced hardware com-ponent accounting for 30% and 50% of all hardware re-placements, respectively. In the COM1 data set, disksare a close runner-up accounting for nearly 20% of allhardware replacements.

    While Table 3 suggests that disks are among the mostcommonly replaced hardware components, it does notnecessarily imply that disks are less reliable or have a

    shorter lifespan than other hardware components. Thenumber of disks in the systems might simply be muchlarger than that of other hardware components. In orderto compare the reliability of different hardware compo-nents, we need to normalize the number of componentreplacements by the components population size.

    Unfortunately, we do not have, for any of the systems,exact population counts of all hardware components.However, we do have enough information in HPC1 to es-timate counts of the four most frequently replaced hard-

    5

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    6/16

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    7/16

    HPC1 HPC2 HPC3 HPC4 COM1 COM2 COM30

    1

    2

    3

    4

    5

    6

    A n n u a

    l r e p

    l a c e m e n t r a t e

    ( % )

    Avrg. ARRARR=0.88ARR=0.58

    Figure 1: Comparison of datasheet AFRs (solid and dashed line in the graph) and ARRs observed in the eld. Eachbar in the graph corresponds to one row in Table 1. The dotted line represents the weighted average over all data sets.Only disks within the nominal lifetime of ve years are included, i.e. there is no bar for the COM3 drives that weredeployed in 1998. The third bar for COM3 in the graph is cut off its ARR is 13.5%.

    point we are not interested in wearout effects after theend of a disks nominal lifetime, we have included in Fig-ure 1 only data for drives within their nominal lifetime of ve years. In particular, we do not include a bar for thefourth type of drives in COM3 (see Table 1), which weredeployed in 1998 and were more than seven years old atthe end of the data collection. These possibly obsoletedisks experienced an ARR, during the measurement pe-riod, of 24%. Since these drives are well outside the ven-dors nominal lifetime for disks, it is not surprising thatthe disks might be wearing out. All other drives werewithin their nominal lifetime and are included in the g-ure.

    Figure 1 shows a signicant discrepancy betweenthe observed ARR and the datasheet AFR for all datasets. While the datasheet AFRs are between 0.58% and0.88%, the observed ARRs range from 0.5% to as highas 13.5%. That is, the observed ARRs by data set andtype, are by up to a factor of 15 higher than datasheetAFRs.

    Most commonly, the observed ARR values are in the3% range. For example, the data for HPC1, which coversalmost exactly the entire nominal lifetime of ve yearsexhibits an ARR of 3.4% (signicantly higher than the

    datasheet AFR of 0.88%). The average ARR overall datasets (weighted by the number of drives in each data set)is 3.01%. Even after removing all COM3 data, whichexhibits the highest ARRs, the average ARR was still2.86%, 3.3 times higher than 0.88%.

    It is interesting to observe that for these data sets thereis no signicant discrepancy between replacement ratesfor SCSI and FC drives, commonly represented as themost reliable types of disk drives, and SATA drives, fre-quently described as lower quality. For example, the

    ARRs of drives in the HPC4 data set, which are exclu-sively SATA drives, are among the lowest of all datasets. Moreover, the HPC3 data set includes both SCSIand SATA drives (as part of the same system in the sameoperating environment) and they have nearly identical re-placement rates. Of course, these HPC3 SATA driveswere decommissioned because of media error rates at-tributed to lubricant breakdown (recall Section 2.1), ouronly evidence of a bad batch, so perhaps more data isneeded to better understand the impact of batches inoverall quality.

    It is also interesting to observe that the only drives thathave an observed ARR below the datasheet AFR are thesecond and third type of drives in data set HPC4. Onepossible reason might be that these are relatively newdrives, all less than one year old (recall Table 1). Also,these ARRs are based on only 16 replacements, perhapstoo little data to draw a denitive conclusion.

    A natural question arises: why are the observed disk replacement rates so much higher in the eld data thanthe datasheet MTTF would suggest, even for drives inthe rst years of operation. As discussed in Sections 2.1and 2.2, there are multiple possible reasons.

    First, customers and vendors might not always agree

    on the denition of when a drive is faulty. The factthat a disk was replaced implies that it failed some (pos-sibly customer specic) health test. When a health testis conservative, it might lead to replacing a drive that thevendor tests would nd to be healthy. Note, however,that even if we scale down the ARRs in Figure 1 to 57%of their actual values, to estimate the fraction of drivesreturned to the manufacturer that fail the latters healthtest [1], the resulting AFR estimates are still more than afactor of two higher than datasheet AFRs in most cases.

    7

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    8/16

    Second, datasheet MTTFs are typically determinedbased on accelerated (stress) tests, which make certainassumptions about the operating conditions under whichthe disks will be used (e.g. that the temperature willalways stay below some threshold), the workloads andduty cycles or powered-on hours patterns, and that cer-

    tain data center handling procedures are followed. Inpractice, operating conditions might not always be asideal as assumed in the tests used to determine datasheetMTTFs. A more detailed discussion of factors that cancontribute to a gap between expected and measured drivereliability is given by Elerath and Shah [6].

    Below we summarize the key observations of thissection.

    Observation 1: Variance between datasheet MTTF anddisk replacement rates in the eld was larger than weexpected. The weighted average ARR was 3.4 timeslarger than 0.88%, corresponding to a datasheet MTTFof 1,000,000 hours.

    Observation 2: For older systems (5-8 years of age),data sheet MTTFs underestimated replacement rates byas much as a factor of 30.

    Observation 3: Even during the rst few years of asystems lifetime ( < 3 years), when wear-out is not ex-pected to be a signicant factor, the difference betweendatasheet MTTF and observed time to disk replacementwas as large as a factor of 6.

    Observation 4: In our data sets, the replacement ratesof SATA disks are not worse than the replacement ratesof SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usageand environmental factors, affect replacement rates morethan component specic factors. However, the only ev-idence we have of a bad batch of disks was found in acollection of SATA disks experiencing high media errorrates. We have too little data on bad batches to estimatethe relative frequency of bad batches by type of disk,although there is plenty of anecdotal evidence that badbatches are not unique to SATA disks.

    4.2 Age-dependent replacement ratesOne aspect of disk failures that single-value metrics suchas MTTF and AFR cannot capture is that in real life fail-ure rates are not constant [5]. Failure rates of hardwareproducts typically follow a bathtub curve with highfailure rates at the beginning (infant mortality) and theend (wear-out) of the lifecycle. Figure 2 shows the fail-ure rate pattern that is expected for the life cycle of harddrives [4, 5, 33]. According to this model, the rst year

    Figure 2: Lifecycle failure pattern for hard drives [33].

    of operation is characterized by early failures (or infantmortality). In years 2-5, the failure rates are approxi-mately in steady state, and then, after years 5-7, wear-outstarts to kick in.

    The common concern, that MTTFs do not captureinfant mortality, has lead the International Disk driveEquipment and Materials Association (IDEMA) to pro-pose a new standard for specifying disk drive reliability,based on the failure model depicted in Figure 2 [5, 33].The new standard requests that vendors provide four dif-ferent MTTF estimates, one for the rst 1-3 months of operation, one for months 4-6, one for months 7-12, andone for months 13-60.

    The goal of this section is to study, based on our eldreplacement data, how disk replacement rates in large-scale installations vary over a systems life cycle. Notethat we only see customer visible replacement. Any in-fant mortality failure caught in the manufacturing, sys-tem integration or installation testing are probably notrecorded in production replacement logs.

    The best data sets to study replacement rates across thesystem life cycle are HPC1 and the rst type of drivesof HPC4. The reason is that these data sets span a longenough time period (5 and 3 years, respectively) and eachcover a reasonably homogeneous hard drive population,allowing us to focus on the effect of age.

    We study the change in replacement rates as a functionof age at two different time granularities, on a per-monthand a per-year basis, to make it easier to detect both short

    term and long term trends. Figure 3 shows the annual re-placement rates for the disks in the compute nodesof sys-tem HPC1 (left), the le system nodes of system HPC1(middle) and the rst type of HPC4 drives (right), at ayearly granularity.

    We make two interesting observations. First, replace-ment rates in all years, except for year 1, are larger thanthe datasheet MTTF would suggest. For example, inHPC1s second year, replacement rates are 20% largerthan expected for the le system nodes, and a factor of

    8

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    9/16

    1 2 3 4 50

    12

    3

    4

    5

    6

    7

    8

    Years of operation

    A R R ( % )

    1 2 3 4 50

    12

    3

    4

    5

    6

    7

    8

    Years of operation

    A R R ( % )

    1 2 30

    12

    3

    4

    5

    6

    7

    8

    Age (years)

    A R R ( % )

    HPC1 (compute nodes) HPC1 (lesystem nodes) HPC4

    Figure 3: ARR for the rst ve years of system HPC1s lifetime, for the compute nodes (left) and the le system nodes(middle). ARR for the rst type of drives in HPC4 as a function of drive age in years (right).

    0 10 20 30 40 50 600

    2

    4

    6

    8

    10

    12

    14

    16

    18

    Months of operation

    A R R

    ( % )

    0 10 20 30 40 50 600

    2

    4

    6

    8

    10

    12

    14

    16

    18

    Months of operation

    A R R

    ( % )

    0 5 10 15 20 25 30 350

    1

    2

    3

    4

    5

    6

    7

    8

    Age (months)

    A R R

    ( % )

    HPC1 (compute nodes) HPC1 (lesystem nodes) HPC4

    Figure 4: ARR per month over the rst ve years of system HPC1s lifetime, for the compute nodes (left) and the lesystem nodes (middle). ARR for the rst type of drives in HPC4 as a function of drive age in months (right).

    two larger than expected for the compute nodes. In year4 and year 5 (which are still within the nominal lifetimeof these disks), the actual replacement rates are 710times higher than the failure rates we expected based ondatasheet MTTF.

    The second observation is that replacement rates arerising signicantly over the years, even during earlyyears in the lifecycle. Replacement rates in HPC1 nearlydouble from year 1 to 2, or from year 2 to 3. This ob-servation suggests that wear-out may start much earlierthan expected, leading to steadily increasing replacementrates during most of a systems useful life. This is an in-teresting observation because it does not agree with thecommon assumption that after the rst year of operation,

    failure rates reach a steady state for a few years, formingthe bottom of the bathtub.Next, we move to the per-month view of replacement

    rates, shown in Figure 4. We observe that for the HPC1le system nodes there are no replacements during therst 12 months of operation, i.e. theres is no detectableinfant mortality. For HPC4, the ARR of drives is nothigher in the rst few months of the rst year than thelast few months of the rst year. In the case of theHPC1 compute nodes, infant mortality is limited to the

    rst month of operation and is not above the steady stateestimate of the datasheet MTTF. Looking at the lifecy-cle after month 12, we again see continuously rising re-placement rates, instead of the expected bottom of thebathtub.

    Below we summarize the key observations of thissection.

    Observation 5: Contrary to common and proposedmodels, hard drive replacement rates do not enter steadystate after the rst year of operation. Instead replacementrates seem to steadily increase over time.

    Observation 6: Early onset of wear-out seems to havea much stronger impact on lifecycle replacement ratesthan infant mortality, as experienced by end customers,even when considering only the rst three or ve yearsof a systems lifetime. We therefore recommend thatwear-out be incorporated into new standards for disk drive reliability. The new standard suggested by IDEMAdoes not take wear-out into account [5, 33].

    9

  • 8/12/2019 EXPLANING DISK FAILURE.pdf

    10/16

    0 10 20 30 400

    0.2

    0.4

    0.6

    0.8

    1

    Number of replacements per month

    P r

    ( X