48
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 1 Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering Western Digital, San Jose

Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 1

Hard Disk Drive - Reliability Overview

Dr. Amit Chattopadhyay

Sr. Engineering Manager, Recording Sub-Systems

Advanced Reliability Engineering

Western Digital, San Jose

Page 2: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 2

Outline

State of affairs in HDD Field Reliability 1

a

Macroscopic objects at molecular dimensions: Head-disk Interface b

Critical Parameters: Clearance and Workload c

HDD reliability modeling 2

Drive parametric measurements: Virtual Failures

a

Importance of workload and temperature b

c

3 Data Center Fleet Management: In-field Health Monitor

HDD complexity leads to a myriad of possible failure modes

Historical model: derivation and limitations

Page 3: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 3

Core problem in HDD reliability

1. Try to avoid something analogous to a ‘BP’ moment (Deepwater Horizon

oil spill of 2010) - Avoid an un-manageable catastrophe midway (or

beyond) through a product’s warranty life Referred to as ‘wear-out’ mode in reliability unbounded Failure Rate at large

times

2. For the drives out in the field, develop a predictive analysis for failures

QUESTIONS

How do we define a product’s “lifetime”?

How do we get visibility into what happens during this lifetime?

Can we help our customers perform effective Fleet Management in real

time?

Page 4: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 4

100 150 200 250 300 350 400 450 5000.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

mobile

desktop

Cle

ara

nce (

nm

)

Areal Density (Gb-in-2)Page 4

Current HDD Reliability Landscape

Clearance trend in recent WD products

Increases in areal density have largely come

by decreasing the head-disk spacing

The Magnetics vs Tribology dilemma

Current head-media spacings are at

molecular dimensions: 1 – 2nm

Operating at these clearances can impact

the HDD failure rates, and the failure pareto

As much as 70% of the failure pareto is

attributable to low-clearance operation

Operation at ever-decreasing spacings

requires special attention be paid to:

Anamolous physical phenomenon in ultrathin

films

Non-linear response functions, e.g. Adhesive

and frictional forces

Mechanical tolerances

etc

Tribology problems explode

Page 5: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 5

Head-Media Interface Failures: A Conceptual Model

The clearance between slider and disk is a dominant

factor dictating HDI reliability ( 70% of all failures).

Insufficient clearance results in an increased

probability of head-disk interaction and eventual

failure due to HFW, modulation, or head degradation

In order to ensure acceptable reliability in this spacing regime, one must

minimize the probability of contact……

devstd

clearancemeanZ

1 – 2 nm

Head-disk

Interaction 0.5nm

1.5nm

2.0nm

2.5nm

1.0nm

DISK

1.6nm

Increased failure

probability Reduce mean

clearance

= 2.1nm

0.5nm

1.5nm

2.0nm

2.5nm

1.0nm

Clearance Distribution

Head-disk

Interaction

DISK

Model

Page 6: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 6

Influence of Clearance Mean / Sigma

2.4 2.0 1.6 1.2 0.8

0

1

2

3

Cle

ara

nce-r

ela

ted

AF

R (

%)

Clearance (nm)

= 0.5nm

= 0.35nm

= 0.7nm

HDDs require control of macro-scale mechanics to atomic dimensions!

Toy model: Failure rates as a function of both the mean clearance and the

required standard deviation in clearance are shown below

3. constZ

Design Criteria

Page 7: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 7

= 0.35nm

Influence of Clearance Mean / Sigma

2.4 2.0 1.6 1.2 0.8

0

1

2

3

Cle

ara

nce-r

ela

ted

AF

R (

%)

Clearance (nm)

= 0.5nm

= 0.7nm

HDDs require control of macro-scale mechanics to atomic dimensions!

Toy model: Failure rates as a function of both the mean clearance and the

required standard deviation in clearance are shown below

WD produces 200 million drives per year, with each having multiple heads……..

…..something on the order of 1 billion head-disk interfaces / year

The Std dev in clearance for these interfaces must all be held to within the Van

der Waals diameter of a single carbon atom = 0.34nm

3. constZ

Design Criteria

Page 8: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 8

Slider

HDI: Head-Disk Interface a major challenge area

HDD are complex devices that push the envelope

of our physics understanding in a number of areas

A monolayer of lubricant is used on disk surface to

reduce adhesion / friction between slider and disk

Lubricant pick-up by the slider is normal

Pole Tip

Disk

Lube chains

(0.8nm x 80nm) “end-groups”

Page 9: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 9

Slider

Head-Disk Interface:

HDD are complex devices that push the envelope

of our physics understanding in a number of areas

A monolayer of lubricant is used on disk surface to

reduce adhesion / friction between slider and disk

Lubricant pick-up by the slider is normal, but

lubricant pick-up in excess of one layer can pose

a serious reliability risk

HFW, OTW, modulation, and head degradation

Optical image Chemical image

Red = excess lube

Page 10: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 10

Page 10

Slider

Interface: at the edge of classical thermodynamics

HDD are complex devices that push the envelope

of our physics understanding in a number of areas

A monolayer of lubricant is used on disk surface to

reduce adhesion / friction between slider and disk

Lubricant pick-up by the slider is normal, but

lubricant pick-up in excess of one layer can pose a

serious reliability risk

HFW, OTW, modulation, and head degradation

Monomolecular films can show anamolous behavior

Fractal Formation

Terraced Spreading

4nm

height

Film

Thic

kn

ess (

A) 5th monolayer 3rd monolayer

Stable

bilayer

Stable

bilayer

1st Stable monolayer

Film

Th

ick

nes

s (

A)

1 monolayer

3 monolayers

Lube Fractals Terraced Spreading

Non-classical thermodynamics: physical properties are dependent on the amount of

material present

Optical image Chemical image

Red = excess lube

Page 11: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

®

Protective diamond-like carbon (DLC) overcoats are used on both the head and disk surfaces Constant pressure to thin these overcoats for the purpose

of decreasing the magnetic separation distance …..and thereby improving magnetic signal / areal density

Current DLC thicknesses on the order of 2nm

As these thicknesses are reduced, various reliability risks can arise

Disk example: migration of magnetic material to head-disk interface Corrective action required modification to carbon

deposition energetics / process

Head example: contact stress can result in wear of the pole tip Corrective action required stricter process control

measures

Process optimization of sp3 / sp2

…..with some Chemistry and Materials Science

Magnetics

Carbon

Co+2

Co+2

Co+2

Co+2 Co+2

Co+2 Co+2 Co+2

Co+2

Co+2

Co+2

N+ N+ N+ N+ Co Co

Co+2

Co+2

Co smear

Carbon wear

2nm

Page 12: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 12

Maintaining sufficient clearance, and hence ensuring interface reliability is a multi-dimensional problem.

HDD reliability is certainly not boring

New issues seem to accompany development of each new product generation

Since HDD complexity complicates reliability modeling, must focus on most likely failure modes

Any simple reliability model is bound to have some weaknesses given this complexity

HDI failure rates: critical component attributes

Contaminants

Pad Size DFH Bulge

Size / Roughness

HDI Failure Rate

Media

Overcoat

Material

composition

Deposition

technique

Active sites

Microwaviness

hardness

Polarity

Thickness

D:G Ratio

Density

Surface

Roughness

Lubricant

Molecular Weight

PFPE Backbone

C1/C2/C3

Surface mobility

Bonding Ratio

Bonding kinetics

End Group Structure

Humidity sensitivity

Polarity

Additives

Concentration type

Film Thickness

Surface Energy

Disjoining Pressure

Polydispersity

Operating

Environment

Drive Parameters

Filter type

/ location

Humidity Temperature

Rotational Velocity

Defect Mapping

Channel

Seek Profile

Head Design

Positive

Pressure

Fly Height

Profile

Min.FH Pole tip

protrusion

ABS

Overcoat

composition

Overcoat

Thickness

stiffness

Roll Pitch

Particulate loading Pressure/

altitude

Page 13: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 13

Time to Failure: The “Bathtub” Reliability Model

Classical Reliability Model

“The Bathtub Curve"

Time

Fa

ilure

Pro

ba

bili

ty

Infant mortality region Failure rate decreases with increasing time

Result of defects etiher designed into, or inadvertently built into a product

Indicative of quality “escapes”

Marginal materials

Drives with the least margin for some critical design tolerance.

Manufacturing anomaly

Steady State region After the weak drives are removed from the

population, the failure rate reaches a fixed value

for the service life of the drive

Wear-out Region At long times, one enters the wear-out region

where normal wear and tear of the system

components results in an increasing failure rate

with time

Page 14: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 14

Time to Failure: The “Bathtub” Reliability Model

Classical Reliability Model

“The Bathtub Curve"

Time

Fa

ilure

Pro

ba

bili

ty

Infant mortality region Failure rate decreases with increasing time

Result of defects etiher designed into, or inadvertently built into a product

Indicative of quality “escapes”

Marginal materials

Drives with the least margin for some critical design tolerance.

Manufacturing anomaly

Steady State region After the weak drives are removed from the

population, the failure rate reaches a fixed value

for the service life of the drive

Wear-out Region At long times, one enters the wear-out region

where normal wear and tear of the system

components results in an increasing failure rate

with time

Page 15: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 15

Time to Failure: The “Bathtub” Reliability Model

Classical Reliability Model

“The Bathtub Curve"

Time

Fa

ilure

Pro

ba

bili

ty

Infant mortality region Failure rate decreases with increasing time

Result of defects etiher designed into, or inadvertently built into a product

Indicative of quality “escapes”

Marginal materials

Drives with the least margin for some critical design tolerance.

Manufacturing anomaly

Steady State region After the weak drives are removed from the

population, the failure rate reaches a fixed value

for the service life of the drive

Wear-out Region At long times, one enters the wear-out region

where normal wear and tear of the system

components results in an increasing failure rate

with time

Page 16: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 16

Time to Failure: The “Bathtub” Reliability Model

Classical Reliability Model

“The Bathtub Curve"

Time

Fa

ilure

Pro

ba

bili

ty

Infant mortality region Failure rate decreases with increasing time

Result of defects either designed into, or inadvertently built into a product

Indicative of quality “escapes”

Marginal materials

Drives with the least margin for some critical design tolerance.

Steady State region After the weak drives are removed from the

population, the failure rate reaches a fixed value

for the service life of the drive

Wear-out Region At long times, normal wear and tear of the

system components results in an increasing

failure rate with time

This type of behavior results in costly

excursions to both WD and our customers

This regime must be avoided at all costs

Page 17: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 17

Standard Weibull Treatment Gives time dependence of unreliability, F(t)

Where: t = time

= time constant

= shape parameter

In the limit of small failure rates (<10%) the failure rate vs time simplifies to a

power function of time:

Classical reliability theory – time dependence

POH

ttF

)(

ttF exp1)(

Field Reliability vs Time

Since failure rate assumed to depend on time

alone, MTTF has been used to quantify the

intrinsic reliability of HDD’s

MTTF defined by the time required for 63.2%

of all drives in a given population to fail

But…..use of MTTF alone is meaningless

Reason behind the workload specs

introduced by all HDD manufacturers

POH = power-on hrs

1st year

2nd year

3rd year

β < 1: Failure rate decreases with time

Page 18: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 18

Numerous studies conducted on HDD failure rates at elevated temperature

Consistent reports of HDD failure rates increasing exponentially with increasing

temperature

Arrhenius equation:

Ea = activation energy

A, R = constants

T = Temperature (kelvin)

Rule of thumb:

Failure rate doubles for approximately 15 C increase in temp

Note: We make use of this acceleration by testing all products at the upper

temp spec limit

RT

Ea

AeTF )(

Historical reliability model – impact of temp

30 35 40 45 50 55 60 65 70

0

2

4

6

8

10

Accele

rati

on

Facto

r

Ambient Temperature

Maxtor:

Ea = 51.9 kJ mole-1

Seagate:

Ea = 36.4 kJ mole-1

WD: Ea = 39.3 kJ mole-1

Samsung:

Ea = 30.6 kJ mole-1

Microsoft:

Ea = 44.4 kJ mole-1

2X

15C

Page 19: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 19

A duty cycle term is typically added to general reliability expression

Qualitatively accounts for the intuitive notion that HDD reliability should scale with

how much the drive does

Usually defined as a power law

Combination of all three effects, i.e. POH, DC and Temp, yields the classic

reliability model

This model has been used in the HDD industry for decades

How good is this at describing reality?

Historical reliability model – duty cycle

Usage Term Thermal Term

kT

ETtF Aexp x POH DC),( x

Page 20: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 20 20

Failure Rate vs Expanded Temp Range

Failure rate dependence at cold shows inverse Arrhenius behavior

Magnetic challenges due to higher media coercivity

Stronger adhesive forces between heads and media/lube

Failure rates are relatively independent of temp between 15 – 40C

kT

ETF aexp)(

0.1

1.0

10.0

0.00290.0030.00310.00320.00330.00340.00350.00360.00370.0038

No

rmal

ize

d F

ailu

re R

ate

1/T[K]

-5C +5C +15C +30C +40C +50C +60C +70C

Drive Basecast Temperature [C]

+0.4 eV

-0.9 eV

Inverse Arrhenius

Conventional Behavior

Ea = activation energy

k = 8.62 x 10-5 eV/K

T = absolute temperature (Kelvin)

Page 21: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 21

ReliaSoft Weibull++ 7 - www.ReliaSoft.com

Impact of Workload

Folio2\Data 2:

Folio2\Data 1:

RDT Test Time (hrs)

Un

re

lia

bilit

y, F

(t

)

20 1000100

0.500

1.000

5.000

10.000

20.000

0.300

Univ. Ent.

Tribo

180 GB/hr

120 GB/hr

Ln

(cu

mu

lati

ve f

ailu

res)

Ln (test time)

β1 = 0.8

β2 = 0.8

Is the concept of “Duty Cycle” valid?

DOE with same drives built at the same time

Two tests with equivalent duty cycles (>95%)

……..but differing workloads (1.5:1)

Results clearly show that failure rates

scale with workload…..not duty cycle

Standard (time-based) Weibull Analysis

Duty Cycle

0 200 400 600 800 1000

Cu

mu

lati

ve

Fail

ure

s

RDT Test Time (hrs)

121 GB/hr

182 GB/hr

Time to Failure Plot

Convex curvature

(β < 1)

kT

ETtF Aexp x POH DC),( x

Usage Thermal Term

MTTF1

MTTF2

Same drive / same DC / two workloads = different MTTFs

Conclusions:

MTTF typically used to specify reliability of HDDs

Since MTTF is not uniquely defined……

MTTF alone is an insufficient measure of

drive reliability!

Page 22: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 22

Functional duty cycle is a useless concept for predicting HDD failure rates Nebulous / ill-defined

Every data center claims they run at 100% duty cycle, however, actually “usage” varies dramatically between data centers and within a given data center

No quantitative data available

Workload is a much better measure for drive usage

Defined as the total data transfer (sectors written + read)

Tightly coupled to the dominate failure modes in test and field that are caused by

close proximity between heads and media.

Quantifiable from internal drive logs, thereby facilitating accurate field AFR predictions

All HDD manufacturers now explicitly specify workload ratings

Historical Reliability Model – Duty Cycle

Usage Term Thermal Term

kT

ETtF Aexp x POH DC),( x

Page 23: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 23 23

Rough Drive Quality Scale

Introduced to allow

comparison of basic quality

requirements

Priority list for enablers

Reflective of

Intrinsic quality spec:

MTTF

Workload

Temperature

Normalized to Desktop

Reliability Landscape and Trends

8.6 RE (Nearline Enterprise)

2.7 SE (Scaled Enterprise)

15.3 XE

1 == Desktop; Cold Storage

0.6 Mobile 0

5

10

15

Page 24: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 24 24

AFR = Annualized Failure Rate

Typical Workloads for different market segments –

XE (800 TB/yr)

RE: Nearline Enterprise (550 TB/yr)

Cloud (180 TB/yr)

Desktop usage (55 TB/yr)

10x increase in usage from typical DT drive to Enterprise

Reliability Landscape and Trends

kT

eV

TBtest

yrTBfieldratefailuretestAFR

4.0exp

/6.0

Page 25: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 25

“Dynamic Fly Height” (DFH) concept

Head-disk spacing is controlled by resistive heating of the pole tip region of the head

Thermal actuation is only performed during read and write operations

Head-disk interface failures when flying at 10nm

Minimization of time spent in close proximity

Thermal

Expansion

Element spacing

Nominal FH

DFH Off DFH On

1.5nm

10nm

Disk

Page 26: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 26

“Dynamic Fly Height” (DFH) concept

Head-disk spacing is controlled by resistive heating of the pole tip region of the head

Thermal actuation is only performed during read and write operations

Head-disk interface failures when flying at 10nm

Minimization of time spent in close proximity

Thermal

Expansion

Element spacing

Nominal FH

DFH Off DFH On

1.5nm

10nm

Disk

Page 27: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 27

“Dynamic Fly Height” (DFH) concept

Head-disk spacing is controlled by resistive heating of the pole tip region of the head

Thermal actuation is only performed during read and write operations

Head-disk interface failures when flying at 10nm

Head – disk interactions, if they occur, will be limited to times of DFH actuation

Limits contact frequency

Contact area decreased

Much improved clearance sigma

Since contact only occurs during DFH actuation, and since DFH actuation only occurs during reads / writes, then workload will be critical to HDD reliability

Modification of our reliability prediction / testing methods needed

Minimization of time spent in close proximity

Thermal

Expansion

Element spacing

Nominal FH

DFH Off DFH On

1.5nm

10nm

Disk

Page 28: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 28

Validation of Workload Impact on HDD reliability

Weibull Analysis

Standard (time-based) Treatment

Same drive + 100% DC / two workloads = different MTTFs

ReliaSoft Weibull++ 7 - www.ReliaSoft.com

Impact of Workload

Folio2\Data 2:

Folio2\Data 1:

RDT Test Time (hrs)

Un

re

lia

bilit

y, F

(t

)

20 1000100

0.500

1.000

5.000

10.000

20.000

0.300

Univ. Ent.

Tribo

180 GB/hr

120 GB/hr

Fa

ilu

re R

ate

(%

)

POH

MTTF1

MTTF2

120 GB/hr

ReliaSoft Weibull++ 7 - www.ReliaSoft.com

Sirius: Universal Server vs Tribo RDT

Folio2\Data 4:

Folio2\Data 3:

Terabytes Transferred

Un

re

lia

bilit

y, F

(t

)

1.0 1000.010.0 100.00.100

0.500

1.000

5.000

10.000

50.000

0.100

(WL x t)

180 GB/hr

120 GB/hr

MPbF1 = MPbF2

Fa

ilu

re R

ate

(%

)

Same drive +100% DC / two workloads = unique MPbF

Workload-based Treatment

TB Transferred (WL x POH)

)( TBAFR Failure rates scale with the total TB transferred

Results demonstrate that TB transferred is the critical reliability parameter….not time POH

Natural reliability metric: Mean Petabytes to Failure (MPbF)

This naturally leads to a DWM (Drive Workload Monitor) (like an odometer)

Minimum requirement: Simultaneously define max workload spec and MTTF

This is now done by all HDD manufacturers

Page 29: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 29 29

A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs

/ high workload script)

Cumulative failure rate vs total TB transferred is plotted

Weibull Projections

Workload Specifications: An example

Verdi C-RDT3

Host TB Transferred

Cu

mu

lat

ive

Fa

ilu

re

s [

%]

10 10000100 10000.1

0.5

1.0

5.0

10.0

20.0

0.1

Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183

Data PointsProbability Line

Page 30: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 30 30

Workload Specifications: An example Weibull Projections

A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs

/ high workload script)

Cumulative failure rate vs total TB transferred is plotted Verdi C-RDT3

Host TB Transferred

Cu

mu

lat

ive

Fa

ilu

re

s [

%]

10 10000100 10000.1

0.5

1.0

5.0

10.0

20.0

0.1

Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183

Data PointsProbability Line

Temperature derating from test

field is then applied

Resulting plot gives expected field

reliability as a function of usage

Page 31: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 31 31

Workload Specifications: An example Weibull Projections

A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs

/ high workload script)

Cumulative failure rate vs total TB transferred is plotted

Temperature derating from test

field is then applied

Resulting plot gives expected field

reliability as a function of usage

Data validates that this drive can

meet a 1.2M hr MTTF at a

specified workload of 550TB/yr

Verdi C-RDT3

Host TB Transferred

Cu

mu

lat

ive

Fa

ilu

re

s [

%]

10 10000100 10000.1

0.5

1.0

5.0

10.0

20.0

0.1

Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183

Data PointsProbability Line

3.0 PB

560TB/yr x 5yr

MTTF spec = 1.2M hrs

Page 32: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 32 32

Workload Specifications: An example Weibull Projections

A workload Weibull analysis is performed on RDT data (1200 drives / 1000hrs

/ high workload script)

Cumulative failure rate vs total TB transferred is plotted

Temperature derating from test

field is then applied

Resulting plot gives expected field

reliability as a function of usage

Data validates that this drive can

meet a 1.2M hr MTTF at a

specified workload of 550TB/yr

If HDD is used at lower workloads,

a commensurate increase in

MTTF (decrease in Failure Rate)

will be realized

Verdi C-RDT3

Host TB Transferred

Cu

mu

lat

ive

Fa

ilu

re

s [

%]

10 10000100 10000.1

0.5

1.0

5.0

10.0

20.0

0.1

Data 1Weibull-1PMLE SRM K-M FMF=15/S=1183

Data PointsProbability Line

3.0 PB

560TB/yr x 5yr

MTTF spec = 1.2M hrs

120TB/yr x 5yr

MTTF = 3.0M hrs

Page 33: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 33

Reliability Test Enhancements (last few years)

Field failure rates scale with usage

/ workload

RDT workloads can be insufficient

to “see” the entire drive warranty

period in many applications

Leaves WD and customers

vulnerable to late-in-life field

excursions

Need existed to enhance our

reliability testing to ensure future

products display the required field

quality?

Problem Statement

RDT testing must be performed for

longer times (brute force),

Extended Lifetime Test (19 week)

or …….

High workload test scripts

Tribo RDT (highest WL script)

Client (new version has 4X WL

increase)

Universal Enterprise (double

workload of predecessor)

Use of head parametric data to

identify “virtual” failures

Virtual failures = increased failure

probability

Testing Options

Page 34: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 34

Are there precursors to HDD failure?

Controlling the clearance

distribution is critical for long-

term field reliability

Those heads in the low-

clearance wing are far more

prone to failure

Must maintain 3 process

Degradation gets severe

enough that the head can no

longer function.

Failure (typically ECC) then

results

Tear down confirms wear of

carbon on pole tips due to

occurrence of head-disk

contacts

Head-disk

Interaction 0.5nm

1.5nm

2.0nm

2.5nm

1.0nm

DISK

=1.6nm

Degrading population

Page 35: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 35

“Virtual” Failures

Page 35

Controlling the clearance

distribution is critical for long-

term field reliability

Those heads in the low-

clearance wing are far more

prone to failure

Must maintain 3 process

Degradation gets severe

enough that the head can no

longer function.

Failure (typcially ECC) then

results

Tear down confirms wear of

carbon on pole tips due to

occurrence of head-disk

contacts

Head-disk

Interaction 0.5nm

1.5nm

2.0nm

2.5nm

1.0nm

DISK

=1.6nm

Degrading population

Contact between the head and

disk leads to degradation of

the magnetic / electrical

propeties of the head

Monitoring degradation in the

parametric data can be used

to identify “virtual” failures

Cri

tical P

ara

mete

r A

0

1

2

3

4

5

RDT Run Hours

Overall Strategy: The TMR reader is the best sensor we have in the drive

Use it to calibrate the head-media interface through magnetic ‘critical parameter’

degradation

We use about 7 critical parameters to perform in-situ characterization during test

Degradation vs time

Page 36: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 36

Degradation precedes vast majority of real failures

Drive triggered all three degradation limits at 400-500 hrs in factory RDT

Drive then failed for head-degradation at 933 hours

2nd Level / Teardown FA

Scope signal on failed head showed base line

popping

SEM on failed head showed DLC wear (others

showed no wear).

Parameter C Parameter A Parameter B

Tribo-RDT trigger limit

4 order drop in BER 4.5 dB drop

510hrs

460hrs 400hrs

Page 37: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 37

“Live” Parametric Monitoring During ORT/RDT

A web-based tool is available to monitor the progress of any drive

running RDT/ORT in the factory in real time

Gives temporal information on drive health that was previously unavailable

Provides much clearer insight into nature of root cause leading to failure

Allows tracking of: a) individual drives / parameters, b) entire drive population, or c) various

drive configurations.

Representative Results

Comparison of RDTs

Head A + Media A

Head A + Media B

Head B + Media B

RDT Test Time (days)

Individual Drive (any parameter)

Trigger Level

Entire RDT Population

Total

A

B

C

E

D De

gra

dati

on

Ra

te (

%)

RDT Test Time (days) RDT Test Time (hrs)

Cri

tic

al P

ara

me

ter

De

gra

dati

on

Page 38: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 38

Improvement driven by

Parametric

measurement

•Head design

•Media design

•Clearance

• Head design

(new reader)

Product Improvements Driven by Parametric Measurements

Parametric data provides far more granularity on product quality compared

to “real” drive failures

Quality team continually monitors Virtual Failures and uses this data to

qualify any new component and / or design enhancement

Engineering uses data to drive improvements into each HDD design

Page 39: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 39

Failure probability contour plots

Field failure probability can be plotted as a function of both the critical parameter and the

degradation in the critical parameter.

Both parameters are important

Failure Probability Contour Plot

1x

10x

5x

For the vast majority of the

population, degradation in critical

parameter A is more important

than the absolute value of A

This is not universally true for all

parameters

Degradation patterns are being used

to perform virtual FA

Some degradation scenarios are

more serious than others

Methodology is being extended for

WD’s “In-field Health Monitor”

Page 40: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 40

Evolution of Parametric Monitoring

Decreasing head-media clearance

More Head-Disk contact events

Higher levels of magnetic critical parameter degradation

Precursor to real failure of the drive

Internal Benchmarking: Measurement of critical parametric data now

routinely performed in Reliability testing (2+ years experience)

Natural progression is to measure, and store critical parametrics at a regular interval during field operation

External dry-run: Partnership with Enterprise customers on small fleet sizes.

Page 41: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 41

After nearly two decades, it is generally accepted that the SMART attributes

are not strongly correlated to HDD failures

“....models based on SMART parameters alone are unlikely to be useful for

predicting individual drive failures”.

(Google: Pinheiro et al., Proc. 5th USENIX Conf. on File and Storage Technologies, February 2007)

The obvious question arises…..“Can we use our parametric degradation

approach to improve predictability of drive failures in the field”?

In-field parametric monitoring (IFPM)

WD is pursuing parametric data collection for data center fleet management

Potential to identify poorly performing drives / heads before actual failure is the

“Holy Grail” of HDD reliability

Drives with the capability to perform critical parameter measurements (during

drive idle time) have been developed and are just now becoming available

Health monitor algorithms need to be optimized (6 – 9 months)

Why Extend parametric measurements to the field?

Page 42: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 42 42

The In-Field Parametric Monitoring (IFPM) system will measure, and store, critical

parameters periodically throughout the HDD lifetime, and output a relative health

index.

IFPM Health Index (derived from critical parameters)

Dri

ve C

ou

nt

Real

Fail

ure

Rate

Low Score (Bad) High Score (good)

A ‘Gas Gauge’ - Failure Probability Increases as ‘Health Index’ gets worse

WD’s 1st fleet management system: IFPM

Page 43: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 43

WD’s 1st fleet management system: IFPM

WD Confidential

25 30 35 40

1x

10x

6x

Failure Probability Contour Plot

Cri

tic

al P

ara

mete

r B

Cri

tic

al P

ara

mete

r B

Critical Parameter A

Phase 1 – Drives equipped with

initial parametric monitoring

Subset of parameters available now

Start field data collection with

partners

Phase 2 – “Complete” set of critical

parameters will become available

shortly

Phase 3 – Algorithm development

Data collected from select customers

Refinement of failure probability

algorithms

Define a “relative health index”

Phase 4 – Ability to check relative

health index of any drive on demand

Since the health index with change with usage, IFPM will be

coupled to the DWM (Drive Workload Monitor)

Page 44: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 44

Meaningful Predictive Modeling

Powerful models that predict failures of individual drives need to

make use of signals beyond SMART

One of the main motivators for SMART was to provide enough

insight into disk drive behavior to enable such models to be built

IFPM + workload monitor + SMART may prove effective at

overcoming our current inability to build predictive models.

Page 45: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 45

Thank You

Page 46: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 46

All time is not equivalent in driving HDD failures

The critical time for is the time spent in close proximity to the disk

Significant levels of head-disk interactions are limited to times of DFH actuation

This relatively subtle modification / realization has significantly changed the way in

which HDD reliability is viewed

Since actuation only occurs during reads and writes….actuation time is directly

proportion to the total data transferred

Failure rates will be dependent on the total data transferred

Overwhelming majority of failures occur when reading / writing / transferring data

Failure probability for HDI issues drops to near zero when pole tip is retracted

Failure probability for motor, PCBA, etc issues that depend on absolute POH time

constitute only small percentage of total failures

)()( TBtimeWLtF

How does workload enter into the classical analysis?

) POH ( TBWLtimeActuation

Page 47: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 47 47

Virtual failure rates (parametric degradation) are monitored in all WD reliability testing

Very clearly defined pass / fail criteria

Virtual failures defined in terms a number of critical parameters

Critical parameters identified from analysis of field return data

Writer health

Reader health (magnetic and electrical measurements)

Magnetic element spacing change

Physical head-disk spacing change

Bit error rate

Virtual Failure Rates - Definition

Cumulative Distribution Functions Failure Probability vs Degradation

Failed Heads

Good Heads

OW (dB)

Pe

rce

nt

of

Vo

lum

e

Retu

rn V

olu

me

Rela

tive F

ailu

re P

rob

ab

ility

Overwrite (dB)

Increasing

Degradation

Failu

re P

rob

ab

ility2x

4x

6x

8x

10x

20x

18x

16x

14x

12x

Failu

re P

rob

ab

ility

OW Trigger LevelVF

Writer Degradation Writer Degradation

Page 48: Hard Disk Drive - Reliability Overview...Hard Disk Drive - Reliability Overview Dr. Amit Chattopadhyay Sr. Engineering Manager, Recording Sub-Systems Advanced Reliability Engineering

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED 48 48

VF parameters = higher field failure probabilities

Cumulative Distribution Functions Failure Probability vs Degradation

Failed Heads

Good Heads

OW (dB)

Pe

rce

nt

of

Vo

lum

e

Re

turn

Vo

lum

e

Re

lativ

e F

ailu

re P

rob

ab

ility

Overwrite (dB)

Increasing

Degradation

Failu

re P

rob

ab

ility2x

4x

6x

8x

10x

20x

18x

16x

14x

12x

Failu

re P

rob

ab

ility

Writer Health

Writer Health

VFR Trigger

Field / test data used to identify parameters critical to HDD reliability

1x

2x

3x

4x

5x

6x

7x

8x

9x

10x

EM (dB)

1x

2x

3x

4x

5x

6x

7x

8x

9x

10x

EM (dB)

Retu

rn V

olu

me

Re

lativ

e F

ailu

re P

rob

ab

ility

VFR Trigger

VFR Trigger

Fa

ilure

Pro

ba

bility

Retu

rn V

olu

me

Error Rate

Error Rate

1x

2x

3x

4x

5x

TD Power (Field – Cal)

Re

lativ

e F

ailu

re P

rob

ab

ility

Re

turn

Vo

lum

e

5Q6

WearCo Smears

Physical Spacing

10x

4x

2x

Resolution (%)

Re

lativ

e F

ailu

re P

rob

ab

ility

Retu

rn V

olu

me

6x

8x

Re

lativ

e F

ailu

re P

rob

ab

ility

2x

4x

6x

8x

10x

Resolution (%)

Mag Spacing

VFR Trigger

Mag Spacing Physical Spacing

wear

Increase Decrease

contamination

Increase

Increasing

Degradation