Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 1
Hard Disk Drive - Reliability Overview
Dr. Amit Chattopadhyay
Sr. Engineering Manager, Recording Sub-Systems
Advanced Reliability Engineering
Western Digital, San Jose
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 2
Outline
State of affairs in HDD Field Reliability 1
a
Macroscopic objects at molecular dimensions: Head-disk Interface b
Critical Parameters: Clearance and Workload c
HDD reliability modeling 2
Drive parametric measurements: Virtual Failures
a
Importance of workload and temperature b
c
3 Data Center Fleet Management: In-field Health Monitor
HDD complexity leads to a myriad of possible failure modes
Historical model: derivation and limitations
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 3
Core problem in HDD reliability
1. Try to avoid something analogous to a ‘BP’ moment (Deepwater Horizon
oil spill of 2010) - Avoid an un-manageable catastrophe midway (or
beyond) through a product’s warranty life Referred to as ‘wear-out’ mode in reliability unbounded Failure Rate at large
times
2. For the drives out in the field, develop a predictive analysis for failures
QUESTIONS
How do we define a product’s “lifetime”?
How do we get visibility into what happens during this lifetime?
Can we help our customers perform effective Fleet Management in real
time?
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 4
100 150 200 250 300 350 400 450 5000.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
mobile
desktop
Cle
ara
nce (
nm
)
Areal Density (Gb-in-2)Page 4
Current HDD Reliability Landscape
Clearance trend in recent WD products
Increases in areal density have largely come
by decreasing the head-disk spacing
The Magnetics vs Tribology dilemma
Current head-media spacings are at
molecular dimensions: 1 – 2nm
Operating at these clearances can impact
the HDD failure rates, and the failure pareto
As much as 70% of the failure pareto is
attributable to low-clearance operation
Operation at ever-decreasing spacings
requires special attention be paid to:
Anamolous physical phenomenon in ultrathin
films
Non-linear response functions, e.g. Adhesive
and frictional forces
Mechanical tolerances
etc
Tribology problems explode
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 5
Head-Media Interface Failures: A Conceptual Model
The clearance between slider and disk is a dominant
factor dictating HDI reliability ( 70% of all failures).
Insufficient clearance results in an increased
probability of head-disk interaction and eventual
failure due to HFW, modulation, or head degradation
In order to ensure acceptable reliability in this spacing regime, one must
minimize the probability of contact……
devstd
clearancemeanZ
1 – 2 nm
Head-disk
Interaction 0.5nm
1.5nm
2.0nm
2.5nm
1.0nm
DISK
1.6nm
Increased failure
probability Reduce mean
clearance
= 2.1nm
0.5nm
1.5nm
2.0nm
2.5nm
1.0nm
Clearance Distribution
Head-disk
Interaction
DISK
Model
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 6
Influence of Clearance Mean / Sigma
2.4 2.0 1.6 1.2 0.8
0
1
2
3
Cle
ara
nce-r
ela
ted
AF
R (
%)
Clearance (nm)
= 0.5nm
= 0.35nm
= 0.7nm
HDDs require control of macro-scale mechanics to atomic dimensions!
Toy model: Failure rates as a function of both the mean clearance and the
required standard deviation in clearance are shown below
3. constZ
Design Criteria
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 7
= 0.35nm
Influence of Clearance Mean / Sigma
2.4 2.0 1.6 1.2 0.8
0
1
2
3
Cle
ara
nce-r
ela
ted
AF
R (
%)
Clearance (nm)
= 0.5nm
= 0.7nm
HDDs require control of macro-scale mechanics to atomic dimensions!
Toy model: Failure rates as a function of both the mean clearance and the
required standard deviation in clearance are shown below
WD produces 200 million drives per year, with each having multiple heads……..
…..something on the order of 1 billion head-disk interfaces / year
The Std dev in clearance for these interfaces must all be held to within the Van
der Waals diameter of a single carbon atom = 0.34nm
3. constZ
Design Criteria
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 8
Slider
HDI: Head-Disk Interface a major challenge area
HDD are complex devices that push the envelope
of our physics understanding in a number of areas
A monolayer of lubricant is used on disk surface to
reduce adhesion / friction between slider and disk
Lubricant pick-up by the slider is normal
Pole Tip
Disk
Lube chains
(0.8nm x 80nm) “end-groups”
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 9
Slider
Head-Disk Interface:
HDD are complex devices that push the envelope
of our physics understanding in a number of areas
A monolayer of lubricant is used on disk surface to
reduce adhesion / friction between slider and disk
Lubricant pick-up by the slider is normal, but
lubricant pick-up in excess of one layer can pose
a serious reliability risk
HFW, OTW, modulation, and head degradation
Optical image Chemical image
Red = excess lube
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 10
Page 10
Slider
Interface: at the edge of classical thermodynamics
HDD are complex devices that push the envelope
of our physics understanding in a number of areas
A monolayer of lubricant is used on disk surface to
reduce adhesion / friction between slider and disk
Lubricant pick-up by the slider is normal, but
lubricant pick-up in excess of one layer can pose a
serious reliability risk
HFW, OTW, modulation, and head degradation
Monomolecular films can show anamolous behavior
Fractal Formation
Terraced Spreading
4nm
height
Film
Thic
kn
ess (
A) 5th monolayer 3rd monolayer
Stable
bilayer
Stable
bilayer
1st Stable monolayer
Film
Th
ick
nes
s (
A)
1 monolayer
3 monolayers
Lube Fractals Terraced Spreading
Non-classical thermodynamics: physical properties are dependent on the amount of
material present
Optical image Chemical image
Red = excess lube
®
Protective diamond-like carbon (DLC) overcoats are used on both the head and disk surfaces Constant pressure to thin these overcoats for the purpose
of decreasing the magnetic separation distance …..and thereby improving magnetic signal / areal density
Current DLC thicknesses on the order of 2nm
As these thicknesses are reduced, various reliability risks can arise
Disk example: migration of magnetic material to head-disk interface Corrective action required modification to carbon
deposition energetics / process
Head example: contact stress can result in wear of the pole tip Corrective action required stricter process control
measures
Process optimization of sp3 / sp2
…..with some Chemistry and Materials Science
Magnetics
Carbon
Co+2
Co+2
Co+2
Co+2 Co+2
Co+2 Co+2 Co+2
Co+2
Co+2
Co+2
N+ N+ N+ N+ Co Co
Co+2
Co+2
Co smear
Carbon wear
2nm
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 12
Maintaining sufficient clearance, and hence ensuring interface reliability is a multi-dimensional problem.
HDD reliability is certainly not boring
New issues seem to accompany development of each new product generation
Since HDD complexity complicates reliability modeling, must focus on most likely failure modes
Any simple reliability model is bound to have some weaknesses given this complexity
HDI failure rates: critical component attributes
Contaminants
Pad Size DFH Bulge
Size / Roughness
HDI Failure Rate
Media
Overcoat
Material
composition
Deposition
technique
Active sites
Microwaviness
hardness
Polarity
Thickness
D:G Ratio
Density
Surface
Roughness
Lubricant
Molecular Weight
PFPE Backbone
C1/C2/C3
Surface mobility
Bonding Ratio
Bonding kinetics
End Group Structure
Humidity sensitivity
Polarity
Additives
Concentration type
Film Thickness
Surface Energy
Disjoining Pressure
Polydispersity
Operating
Environment
Drive Parameters
Filter type
/ location
Humidity Temperature
Rotational Velocity
Defect Mapping
Channel
Seek Profile
Head Design
Positive
Pressure
Fly Height
Profile
Min.FH Pole tip
protrusion
ABS
Overcoat
composition
Overcoat
Thickness
stiffness
Roll Pitch
Particulate loading Pressure/
altitude
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 13
Time to Failure: The “Bathtub” Reliability Model
Classical Reliability Model
“The Bathtub Curve"
Time
Fa
ilure
Pro
ba
bili
ty
Infant mortality region Failure rate decreases with increasing time
Result of defects etiher designed into, or inadvertently built into a product
Indicative of quality “escapes”
Marginal materials
Drives with the least margin for some critical design tolerance.
Manufacturing anomaly
Steady State region After the weak drives are removed from the
population, the failure rate reaches a fixed value
for the service life of the drive
Wear-out Region At long times, one enters the wear-out region
where normal wear and tear of the system
components results in an increasing failure rate
with time
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 14
Time to Failure: The “Bathtub” Reliability Model
Classical Reliability Model
“The Bathtub Curve"
Time
Fa
ilure
Pro
ba
bili
ty
Infant mortality region Failure rate decreases with increasing time
Result of defects etiher designed into, or inadvertently built into a product
Indicative of quality “escapes”
Marginal materials
Drives with the least margin for some critical design tolerance.
Manufacturing anomaly
Steady State region After the weak drives are removed from the
population, the failure rate reaches a fixed value
for the service life of the drive
Wear-out Region At long times, one enters the wear-out region
where normal wear and tear of the system
components results in an increasing failure rate
with time
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 15
Time to Failure: The “Bathtub” Reliability Model
Classical Reliability Model
“The Bathtub Curve"
Time
Fa
ilure
Pro
ba
bili
ty
Infant mortality region Failure rate decreases with increasing time
Result of defects etiher designed into, or inadvertently built into a product
Indicative of quality “escapes”
Marginal materials
Drives with the least margin for some critical design tolerance.
Manufacturing anomaly
Steady State region After the weak drives are removed from the
population, the failure rate reaches a fixed value
for the service life of the drive
Wear-out Region At long times, one enters the wear-out region
where normal wear and tear of the system
components results in an increasing failure rate
with time
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 16
Time to Failure: The “Bathtub” Reliability Model
Classical Reliability Model
“The Bathtub Curve"
Time
Fa
ilure
Pro
ba
bili
ty
Infant mortality region Failure rate decreases with increasing time
Result of defects either designed into, or inadvertently built into a product
Indicative of quality “escapes”
Marginal materials
Drives with the least margin for some critical design tolerance.
Steady State region After the weak drives are removed from the
population, the failure rate reaches a fixed value
for the service life of the drive
Wear-out Region At long times, normal wear and tear of the
system components results in an increasing
failure rate with time
This type of behavior results in costly
excursions to both WD and our customers
This regime must be avoided at all costs
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 17
Standard Weibull Treatment Gives time dependence of unreliability, F(t)
Where: t = time
= time constant
= shape parameter
In the limit of small failure rates (<10%) the failure rate vs time simplifies to a
power function of time:
Classical reliability theory – time dependence
POH
ttF
)(
ttF exp1)(
Field Reliability vs Time
Since failure rate assumed to depend on time
alone, MTTF has been used to quantify the
intrinsic reliability of HDD’s
MTTF defined by the time required for 63.2%
of all drives in a given population to fail
But…..use of MTTF alone is meaningless
Reason behind the workload specs
introduced by all HDD manufacturers
POH = power-on hrs
1st year
2nd year
3rd year
β < 1: Failure rate decreases with time
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 18
Numerous studies conducted on HDD failure rates at elevated temperature
Consistent reports of HDD failure rates increasing exponentially with increasing
temperature
Arrhenius equation:
Ea = activation energy
A, R = constants
T = Temperature (kelvin)
Rule of thumb:
Failure rate doubles for approximately 15 C increase in temp
Note: We make use of this acceleration by testing all products at the upper
temp spec limit
RT
Ea
AeTF )(
Historical reliability model – impact of temp
30 35 40 45 50 55 60 65 70
0
2
4
6
8
10
Accele
rati
on
Facto
r
Ambient Temperature
Maxtor:
Ea = 51.9 kJ mole-1
Seagate:
Ea = 36.4 kJ mole-1
WD: Ea = 39.3 kJ mole-1
Samsung:
Ea = 30.6 kJ mole-1
Microsoft:
Ea = 44.4 kJ mole-1
2X
15C
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 19
A duty cycle term is typically added to general reliability expression
Qualitatively accounts for the intuitive notion that HDD reliability should scale with
how much the drive does
Usually defined as a power law
Combination of all three effects, i.e. POH, DC and Temp, yields the classic
reliability model
This model has been used in the HDD industry for decades
How good is this at describing reality?
Historical reliability model – duty cycle
Usage Term Thermal Term
kT
ETtF Aexp x POH DC),( x
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 20 20
Failure Rate vs Expanded Temp Range
Failure rate dependence at cold shows inverse Arrhenius behavior
Magnetic challenges due to higher media coercivity
Stronger adhesive forces between heads and media/lube
Failure rates are relatively independent of temp between 15 – 40C
kT
ETF aexp)(
0.1
1.0
10.0
0.00290.0030.00310.00320.00330.00340.00350.00360.00370.0038
No
rmal
ize
d F
ailu
re R
ate
1/T[K]
-5C +5C +15C +30C +40C +50C +60C +70C
Drive Basecast Temperature [C]
+0.4 eV
-0.9 eV
Inverse Arrhenius
Conventional Behavior
Ea = activation energy
k = 8.62 x 10-5 eV/K
T = absolute temperature (Kelvin)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 21
ReliaSoft Weibull++ 7 - www.ReliaSoft.com
Impact of Workload
Folio2\Data 2:
Folio2\Data 1:
RDT Test Time (hrs)
Un
re
lia
bilit
y, F
(t
)
20 1000100
0.500
1.000
5.000
10.000
20.000
0.300
Univ. Ent.
Tribo
180 GB/hr
120 GB/hr
Ln
(cu
mu
lati
ve f
ailu
res)
Ln (test time)
β1 = 0.8
β2 = 0.8
Is the concept of “Duty Cycle” valid?
DOE with same drives built at the same time
Two tests with equivalent duty cycles (>95%)
……..but differing workloads (1.5:1)
Results clearly show that failure rates
scale with workload…..not duty cycle
Standard (time-based) Weibull Analysis
Duty Cycle
0 200 400 600 800 1000
Cu
mu
lati
ve
Fail
ure
s
RDT Test Time (hrs)
121 GB/hr
182 GB/hr
Time to Failure Plot
Convex curvature
(β < 1)
kT
ETtF Aexp x POH DC),( x
Usage Thermal Term
MTTF1
MTTF2
Same drive / same DC / two workloads = different MTTFs
Conclusions:
MTTF typically used to specify reliability of HDDs
Since MTTF is not uniquely defined……
MTTF alone is an insufficient measure of
drive reliability!
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 22
Functional duty cycle is a useless concept for predicting HDD failure rates Nebulous / ill-defined
Every data center claims they run at 100% duty cycle, however, actually “usage” varies dramatically between data centers and within a given data center
No quantitative data available
Workload is a much better measure for drive usage
Defined as the total data transfer (sectors written + read)
Tightly coupled to the dominate failure modes in test and field that are caused by
close proximity between heads and media.
Quantifiable from internal drive logs, thereby facilitating accurate field AFR predictions
All HDD manufacturers now explicitly specify workload ratings
Historical Reliability Model – Duty Cycle
Usage Term Thermal Term
kT
ETtF Aexp x POH DC),( x
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 23 23
Rough Drive Quality Scale
Introduced to allow
comparison of basic quality
requirements
Priority list for enablers
Reflective of
Intrinsic quality spec:
MTTF
Workload
Temperature
Normalized to Desktop
Reliability Landscape and Trends
8.6 RE (Nearline Enterprise)
2.7 SE (Scaled Enterprise)
15.3 XE
1 == Desktop; Cold Storage
0.6 Mobile 0
5
10
15
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 24 24
AFR = Annualized Failure Rate
Typical Workloads for different market segments –
XE (800 TB/yr)
RE: Nearline Enterprise (550 TB/yr)
Cloud (180 TB/yr)
Desktop usage (55 TB/yr)
10x increase in usage from typical DT drive to Enterprise
Reliability Landscape and Trends
kT
eV
TBtest
yrTBfieldratefailuretestAFR
4.0exp
/6.0
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 25
“Dynamic Fly Height” (DFH) concept
Head-disk spacing is controlled by resistive heating of the pole tip region of the head
Thermal actuation is only performed during read and write operations
Head-disk interface failures when flying at 10nm
Minimization of time spent in close proximity
Thermal
Expansion
Element spacing
Nominal FH
DFH Off DFH On
1.5nm
10nm
Disk
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 26
“Dynamic Fly Height” (DFH) concept
Head-disk spacing is controlled by resistive heating of the pole tip region of the head
Thermal actuation is only performed during read and write operations
Head-disk interface failures when flying at 10nm
Minimization of time spent in close proximity
Thermal
Expansion
Element spacing
Nominal FH
DFH Off DFH On
1.5nm
10nm
Disk
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 27
“Dynamic Fly Height” (DFH) concept
Head-disk spacing is controlled by resistive heating of the pole tip region of the head
Thermal actuation is only performed during read and write operations
Head-disk interface failures when flying at 10nm
Head – disk interactions, if they occur, will be limited to times of DFH actuation
Limits contact frequency
Contact area decreased
Much improved clearance sigma
Since contact only occurs during DFH actuation, and since DFH actuation only occurs during reads / writes, then workload will be critical to HDD reliability
Modification of our reliability prediction / testing methods needed
Minimization of time spent in close proximity
Thermal
Expansion
Element spacing
Nominal FH
DFH Off DFH On
1.5nm
10nm
Disk
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 28
Validation of Workload Impact on HDD reliability
Weibull Analysis
Standard (time-based) Treatment
Same drive + 100% DC / two workloads = different MTTFs
ReliaSoft Weibull++ 7 - www.ReliaSoft.com
Impact of Workload
Folio2\Data 2:
Folio2\Data 1:
RDT Test Time (hrs)
Un
re
lia
bilit
y, F
(t
)
20 1000100
0.500
1.000
5.000
10.000
20.000
0.300
Univ. Ent.
Tribo
180 GB/hr
120 GB/hr
Fa
ilu
re R
ate
(%
)
POH
MTTF1
MTTF2
120 GB/hr
ReliaSoft Weibull++ 7 - www.ReliaSoft.com
Sirius: Universal Server vs Tribo RDT
Folio2\Data 4:
Folio2\Data 3:
Terabytes Transferred
Un
re
lia
bilit
y, F
(t
)
1.0 1000.010.0 100.00.100
0.500
1.000
5.000
10.000
50.000
0.100
(WL x t)
180 GB/hr
120 GB/hr
MPbF1 = MPbF2
Fa
ilu
re R
ate
(%
)
Same drive +100% DC / two workloads = unique MPbF
Workload-based Treatment
TB Transferred (WL x POH)
)( TBAFR Failure rates scale with the total TB transferred
Results demonstrate that TB transferred is the critical reliability parameter….not time POH
Natural reliability metric: Mean Petabytes to Failure (MPbF)
This naturally leads to a DWM (Drive Workload Monitor) (like an odometer)
Minimum requirement: Simultaneously define max workload spec and MTTF
This is now done by all HDD manufacturers
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 29 29
A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs
/ high workload script)
Cumulative failure rate vs total TB transferred is plotted
Weibull Projections
Workload Specifications: An example
Verdi C-RDT3
Host TB Transferred
Cu
mu
lat
ive
Fa
ilu
re
s [
%]
10 10000100 10000.1
0.5
1.0
5.0
10.0
20.0
0.1
Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183
Data PointsProbability Line
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 30 30
Workload Specifications: An example Weibull Projections
A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs
/ high workload script)
Cumulative failure rate vs total TB transferred is plotted Verdi C-RDT3
Host TB Transferred
Cu
mu
lat
ive
Fa
ilu
re
s [
%]
10 10000100 10000.1
0.5
1.0
5.0
10.0
20.0
0.1
Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183
Data PointsProbability Line
Temperature derating from test
field is then applied
Resulting plot gives expected field
reliability as a function of usage
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 31 31
Workload Specifications: An example Weibull Projections
A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs
/ high workload script)
Cumulative failure rate vs total TB transferred is plotted
Temperature derating from test
field is then applied
Resulting plot gives expected field
reliability as a function of usage
Data validates that this drive can
meet a 1.2M hr MTTF at a
specified workload of 550TB/yr
Verdi C-RDT3
Host TB Transferred
Cu
mu
lat
ive
Fa
ilu
re
s [
%]
10 10000100 10000.1
0.5
1.0
5.0
10.0
20.0
0.1
Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183
Data PointsProbability Line
3.0 PB
560TB/yr x 5yr
MTTF spec = 1.2M hrs
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 32 32
Workload Specifications: An example Weibull Projections
A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs
/ high workload script)
Cumulative failure rate vs total TB transferred is plotted
Temperature derating from test
field is then applied
Resulting plot gives expected field
reliability as a function of usage
Data validates that this drive can
meet a 1.2M hr MTTF at a
specified workload of 550TB/yr
If HDD is used at lower workloads,
a commensurate increase in
MTTF (decrease in Failure Rate)
will be realized
Verdi C-RDT3
Host TB Transferred
Cu
mu
lat
ive
Fa
ilu
re
s [
%]
10 10000100 10000.1
0.5
1.0
5.0
10.0
20.0
0.1
Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183
Data PointsProbability Line
3.0 PB
560TB/yr x 5yr
MTTF spec = 1.2M hrs
120TB/yr x 5yr
MTTF = 3.0M hrs
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 33
Reliability Test Enhancements (last few years)
Field failure rates scale with usage
/ workload
RDT workloads can be insufficient
to “see” the entire drive warranty
period in many applications
Leaves WD and customers
vulnerable to late-in-life field
excursions
Need existed to enhance our
reliability testing to ensure future
products display the required field
quality?
Problem Statement
RDT testing must be performed for
longer times (brute force),
Extended Lifetime Test (19 week)
or …….
High workload test scripts
Tribo RDT (highest WL script)
Client (new version has 4X WL
increase)
Universal Enterprise (double
workload of predecessor)
Use of head parametric data to
identify “virtual” failures
Virtual failures = increased failure
probability
Testing Options
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 34
Are there precursors to HDD failure?
Controlling the clearance
distribution is critical for long-
term field reliability
Those heads in the low-
clearance wing are far more
prone to failure
Must maintain 3 process
Degradation gets severe
enough that the head can no
longer function.
Failure (typically ECC) then
results
Tear down confirms wear of
carbon on pole tips due to
occurrence of head-disk
contacts
Head-disk
Interaction 0.5nm
1.5nm
2.0nm
2.5nm
1.0nm
DISK
=1.6nm
Degrading population
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 35
“Virtual” Failures
Page 35
Controlling the clearance
distribution is critical for long-
term field reliability
Those heads in the low-
clearance wing are far more
prone to failure
Must maintain 3 process
Degradation gets severe
enough that the head can no
longer function.
Failure (typcially ECC) then
results
Tear down confirms wear of
carbon on pole tips due to
occurrence of head-disk
contacts
Head-disk
Interaction 0.5nm
1.5nm
2.0nm
2.5nm
1.0nm
DISK
=1.6nm
Degrading population
Contact between the head and
disk leads to degradation of
the magnetic / electrical
propeties of the head
Monitoring degradation in the
parametric data can be used
to identify “virtual” failures
Cri
tical P
ara
mete
r A
0
1
2
3
4
5
RDT Run Hours
Overall Strategy: The TMR reader is the best sensor we have in the drive
Use it to calibrate the head-media interface through magnetic ‘critical parameter’
degradation
We use about 7 critical parameters to perform in-situ characterization during test
Degradation vs time
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 36
Degradation precedes vast majority of real failures
Drive triggered all three degradation limits at 400-500 hrs in factory RDT
Drive then failed for head-degradation at 933 hours
2nd Level / Teardown FA
Scope signal on failed head showed base line
popping
SEM on failed head showed DLC wear (others
showed no wear).
Parameter C Parameter A Parameter B
Tribo-RDT trigger limit
4 order drop in BER 4.5 dB drop
510hrs
460hrs 400hrs
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 37
“Live” Parametric Monitoring During ORT/RDT
A web-based tool is available to monitor the progress of any drive
running RDT/ORT in the factory in real time
Gives temporal information on drive health that was previously unavailable
Provides much clearer insight into nature of root cause leading to failure
Allows tracking of: a) individual drives / parameters, b) entire drive population, or c) various
drive configurations.
Representative Results
Comparison of RDTs
Head A + Media A
Head A + Media B
Head B + Media B
RDT Test Time (days)
Individual Drive (any parameter)
Trigger Level
Entire RDT Population
Total
A
B
C
E
D De
gra
dati
on
Ra
te (
%)
RDT Test Time (days) RDT Test Time (hrs)
Cri
tic
al P
ara
me
ter
De
gra
dati
on
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 38
Improvement driven by
Parametric
measurement
•Head design
•Media design
•Clearance
• Head design
(new reader)
Product Improvements Driven by Parametric Measurements
Parametric data provides far more granularity on product quality compared
to “real” drive failures
Quality team continually monitors Virtual Failures and uses this data to
qualify any new component and / or design enhancement
Engineering uses data to drive improvements into each HDD design
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 39
Failure probability contour plots
Field failure probability can be plotted as a function of both the critical parameter and the
degradation in the critical parameter.
Both parameters are important
Failure Probability Contour Plot
1x
10x
5x
For the vast majority of the
population, degradation in critical
parameter A is more important
than the absolute value of A
This is not universally true for all
parameters
Degradation patterns are being used
to perform virtual FA
Some degradation scenarios are
more serious than others
Methodology is being extended for
WD’s “In-field Health Monitor”
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 40
Evolution of Parametric Monitoring
Decreasing head-media clearance
More Head-Disk contact events
Higher levels of magnetic critical parameter degradation
Precursor to real failure of the drive
Internal Benchmarking: Measurement of critical parametric data now
routinely performed in Reliability testing (2+ years experience)
Natural progression is to measure, and store critical parametrics at a regular interval during field operation
External dry-run: Partnership with Enterprise customers on small fleet sizes.
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 41
After nearly two decades, it is generally accepted that the SMART attributes
are not strongly correlated to HDD failures
“....models based on SMART parameters alone are unlikely to be useful for
predicting individual drive failures”.
(Google: Pinheiro et al., Proc. 5th USENIX Conf. on File and Storage Technologies, February 2007)
The obvious question arises…..“Can we use our parametric degradation
approach to improve predictability of drive failures in the field”?
In-field parametric monitoring (IFPM)
WD is pursuing parametric data collection for data center fleet management
Potential to identify poorly performing drives / heads before actual failure is the
“Holy Grail” of HDD reliability
Drives with the capability to perform critical parameter measurements (during
drive idle time) have been developed and are just now becoming available
Health monitor algorithms need to be optimized (6 – 9 months)
Why Extend parametric measurements to the field?
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 42 42
The In-Field Parametric Monitoring (IFPM) system will measure, and store, critical
parameters periodically throughout the HDD lifetime, and output a relative health
index.
IFPM Health Index (derived from critical parameters)
Dri
ve C
ou
nt
Real
Fail
ure
Rate
Low Score (Bad) High Score (good)
A ‘Gas Gauge’ - Failure Probability Increases as ‘Health Index’ gets worse
WD’s 1st fleet management system: IFPM
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 43
WD’s 1st fleet management system: IFPM
WD Confidential
25 30 35 40
1x
10x
6x
Failure Probability Contour Plot
Cri
tic
al P
ara
mete
r B
Cri
tic
al P
ara
mete
r B
Critical Parameter A
Phase 1 – Drives equipped with
initial parametric monitoring
Subset of parameters available now
Start field data collection with
partners
Phase 2 – “Complete” set of critical
parameters will become available
shortly
Phase 3 – Algorithm development
Data collected from select customers
Refinement of failure probability
algorithms
Define a “relative health index”
Phase 4 – Ability to check relative
health index of any drive on demand
Since the health index with change with usage, IFPM will be
coupled to the DWM (Drive Workload Monitor)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 44
Meaningful Predictive Modeling
Powerful models that predict failures of individual drives need to
make use of signals beyond SMART
One of the main motivators for SMART was to provide enough
insight into disk drive behavior to enable such models to be built
IFPM + workload monitor + SMART may prove effective at
overcoming our current inability to build predictive models.
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 45
Thank You
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 46
All time is not equivalent in driving HDD failures
The critical time for is the time spent in close proximity to the disk
Significant levels of head-disk interactions are limited to times of DFH actuation
This relatively subtle modification / realization has significantly changed the way in
which HDD reliability is viewed
Since actuation only occurs during reads and writes….actuation time is directly
proportion to the total data transferred
Failure rates will be dependent on the total data transferred
Overwhelming majority of failures occur when reading / writing / transferring data
Failure probability for HDI issues drops to near zero when pole tip is retracted
Failure probability for motor, PCBA, etc issues that depend on absolute POH time
constitute only small percentage of total failures
)()( TBtimeWLtF
How does workload enter into the classical analysis?
) POH ( TBWLtimeActuation
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 47 47
Virtual failure rates (parametric degradation) are monitored in all WD reliability testing
Very clearly defined pass / fail criteria
Virtual failures defined in terms a number of critical parameters
Critical parameters identified from analysis of field return data
Writer health
Reader health (magnetic and electrical measurements)
Magnetic element spacing change
Physical head-disk spacing change
Bit error rate
Virtual Failure Rates - Definition
Cumulative Distribution Functions Failure Probability vs Degradation
Failed Heads
Good Heads
OW (dB)
Pe
rce
nt
of
Vo
lum
e
Retu
rn V
olu
me
Rela
tive F
ailu
re P
rob
ab
ility
Overwrite (dB)
Increasing
Degradation
Failu
re P
rob
ab
ility2x
4x
6x
8x
10x
20x
18x
16x
14x
12x
Failu
re P
rob
ab
ility
OW Trigger LevelVF
Writer Degradation Writer Degradation
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 48 48
VF parameters = higher field failure probabilities
Cumulative Distribution Functions Failure Probability vs Degradation
Failed Heads
Good Heads
OW (dB)
Pe
rce
nt
of
Vo
lum
e
Re
turn
Vo
lum
e
Re
lativ
e F
ailu
re P
rob
ab
ility
Overwrite (dB)
Increasing
Degradation
Failu
re P
rob
ab
ility2x
4x
6x
8x
10x
20x
18x
16x
14x
12x
Failu
re P
rob
ab
ility
Writer Health
Writer Health
VFR Trigger
Field / test data used to identify parameters critical to HDD reliability
1x
2x
3x
4x
5x
6x
7x
8x
9x
10x
EM (dB)
1x
2x
3x
4x
5x
6x
7x
8x
9x
10x
EM (dB)
Retu
rn V
olu
me
Re
lativ
e F
ailu
re P
rob
ab
ility
VFR Trigger
VFR Trigger
Fa
ilure
Pro
ba
bility
Retu
rn V
olu
me
Error Rate
Error Rate
1x
2x
3x
4x
5x
TD Power (Field – Cal)
Re
lativ
e F
ailu
re P
rob
ab
ility
Re
turn
Vo
lum
e
5Q6
WearCo Smears
Physical Spacing
10x
4x
2x
Resolution (%)
Re
lativ
e F
ailu
re P
rob
ab
ility
Retu
rn V
olu
me
6x
8x
Re
lativ
e F
ailu
re P
rob
ab
ility
2x
4x
6x
8x
10x
Resolution (%)
Mag Spacing
VFR Trigger
Mag Spacing Physical Spacing
wear
Increase Decrease
contamination
Increase
Increasing
Degradation