17
NAND Flash and Solid NAND Flash and Solid - - State Drive Reliability State Drive Reliability Al Fazio Al Fazio Intel Fellow Intel Fellow Director, Memory Technology Development Director, Memory Technology Development Intel Corporation Intel Corporation September 17 September 17 th th , 2008 , 2008

NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

NAND Flash and SolidNAND Flash and Solid--State Drive ReliabilityState Drive Reliability

Al FazioAl FazioIntel Fellow Intel Fellow

Director, Memory Technology DevelopmentDirector, Memory Technology DevelopmentIntel Corporation Intel Corporation

September 17September 17thth, 2008, 2008

Page 2: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Flash: A License to Disrupt35mm film, Floppy drives, audio tape…– Flash use in consumer electronics characterized by:

– Large block files (.jpg, mp3…)– # Writes determined by human interaction (i.e. photos taken)

To disrupt HDD, flash must accommodate PC characteristics:– Small random writes, # writes determine by OS– Add to this:

A Be-

Control

Flash requires high fields to overcome Flash requires high fields to overcome energy barriers for nonenergy barriers for non--volatility volatility

Flash reliability dominated by oxideFlash reliability dominated by oxide--degradation; result of program/erase degradation; result of program/erase

Page 3: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Charge Storage: Program and EraseCharge Storage: Program and Erase

N+ N+

20V

0V0V

N+ N+

-20V

0V0V

EraseProgramming: NANDProgramming means injecting electrons to the FGProgramming means injecting electrons to the FG–– FowlerFowler--NordheimNordheim TunnelingTunneling

Erase: FowlerErase: Fowler--Nordheim Tunneling in reverse directionNordheim Tunneling in reverse direction

Dis

trib

utio

n

Vt

“1” “0”

2 Levels => 1 bit/cell

Dis

trib

utio

n

Vt

“11” “00”

4 Levels => 2 bit/cell

“10”“01”

Page 4: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Reliability and Oxide TrapsNormally, FNormally, F--N tunneling occur only during N tunneling occur only during accelerated stresses done by engineers trying to accelerated stresses done by engineers trying to study oxide degradationstudy oxide degradation……

–– Flash memories: basis for device operation itselfFlash memories: basis for device operation itselfThis fact has two fundamental implications:This fact has two fundamental implications:

–– Flash reliability is dominated by oxideFlash reliability is dominated by oxide--degradation degradation effects, notably trap buildup in the tunnel oxide, effects, notably trap buildup in the tunnel oxide, which occur as a result of program/erase cyclingwhich occur as a result of program/erase cycling

–– More than any other IC technology, developing a More than any other IC technology, developing a Flash technology centers around obtaining Flash technology centers around obtaining acceptable reliabilityacceptable reliability

Over time, charges can Over time, charges can detrapdetrap–– Effect will cause VEffect will cause VTT to shift and possible data lossto shift and possible data loss

-----

Channel

FG

Q’

Distance

Ener

gy

N+ N+

FG

Top Gate

Page 5: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Bit Errors: OverviewBit Errors: Overview

L0 L1 L2 L3After Write

VtVpass Read

L0 L1 L2 L3After Time

At any instant, some fraction of bits are in the wrong data statAt any instant, some fraction of bits are in the wrong data state, typically e, typically 1E1E--9 to 1E9 to 1E--6, called the 6, called the ““raw bit error rateraw bit error rate”” or RBERor RBERThese failing bits These failing bits develop with usedevelop with use–– During write, some bits program when they shouldnDuring write, some bits program when they shouldn’’t, or program t, or program

higher than they shouldhigher than they should

•• This complexity means that RBER is a number, but not like pi: This complexity means that RBER is a number, but not like pi: –– like temperature: a number only for specific set of conditions, like temperature: a number only for specific set of conditions, location, instantlocation, instant

–– Cells shift in VCells shift in VTT over time, because of simply time (over time, because of simply time (““data retentiondata retention””) or ) or of repetitive read operations (of repetitive read operations (““read disturbread disturb””))

–– Both kinds increase with more program/erase cyclesBoth kinds increase with more program/erase cycles–– Several mechanisms cause bit errors, each with its own dependencSeveral mechanisms cause bit errors, each with its own dependence e

on cycles, time, temperature, etc.on cycles, time, temperature, etc.

Page 6: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

4000 6000 8000 10000P/E Cycles

RB

ER

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1 10 100 1000 10000

P/E Cycles

RB

ER

Erratic Nature of Write Errors

Errors are erratic: Most bits failing at 5K didnErrors are erratic: Most bits failing at 5K didn’’t fail at 10Kt fail at 10KExplanation: oxide traps are transientExplanation: oxide traps are transientData verified only at symbols: did we miss errors in between?Data verified only at symbols: did we miss errors in between?Ran experiment to verify data after every cycleRan experiment to verify data after every cycle

–– Example bit failed 11 times, never at previous verify pointsExample bit failed 11 times, never at previous verify points–– Previous verifies detected only 0.6% of failing bitsPrevious verifies detected only 0.6% of failing bits

Standard Standard ““test after stresstest after stress”” qualifications miss most errors!qualifications miss most errors!

Bit failpoints

EarlierDataIMFT

Next Several Slides are based on 70nm results from: Mielke, N., et. al., “Bit error rate in NAND Flash memories”, IEEE International Reliability Physics Symposium, 2008

Page 7: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

10-4

10-5

10-6

10-7

10-8

10-9

0 5000 10000Retention Time (Hours)

Raw

Bit

Erro

r R

ate

Data-Retention Errors

Post 10K Cycles

L0 L1 L2 L3

R1 R2 R3

n+ n+

CG

FG

n+ n+

SILCDetrapping

After cycling, RBER increases over time without biasAfter cycling, RBER increases over time without biasError transitions show cells are losing VError transitions show cells are losing VTT ((““charge losscharge loss””))Two products dominated by upper state (L3), others by L1 & L2Two products dominated by upper state (L3), others by L1 & L2Characteristics:Characteristics:

–– L1 & L2: L1 & L2: DetrappingDetrapping from the tunnel oxidefrom the tunnel oxide–– L3: SILC (trapL3: SILC (trap--assisted tunneling) leakage off FGassisted tunneling) leakage off FG

Page 8: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

0

5x10-7

10-6

1.5 x10-6

0 5000 10000Number of Reads

Raw

Bit

Erro

r R

ate

Read Disturb ErrorsPost 10K Cycles

L0 L1 L2 L3

R1 R2 R3

n+ n+

~6V

SILC

After cycling, RBER increases with repetitive readingAfter cycling, RBER increases with repetitive readingError transitions show erased cells gaining VError transitions show erased cells gaining VTT

Mechanism is well known: SILC under read biasMechanism is well known: SILC under read bias

Page 9: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

10KCycles

+1 Yearor

10K reads

10KCycles

+ 0.5 Yearor

5K reads

10KCycles

5KCycles

00.0E+00

1.0E-04

2.0E-04C

um F

ract

ion

Sect

ors

Faili

ng (

1-bi

t EC

C)

0

2E-13

4E-13

Cum

Fra

ctio

n Se

ctor

sFa

iling

(4-b

it EC

C)

Effect of ECC4x10-13

0 0

2x10-4

1-bit

4-bit

Failures drop several orders of magnitude, ~10Failures drop several orders of magnitude, ~101212x over no ECCx over no ECCCurves get steeper (because of Curves get steeper (because of EccEcc power law)power law)Dominant mechanism switches to retention (because of underlying Dominant mechanism switches to retention (because of underlying error error distribution)distribution)

Page 10: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Workable UBER Definition for NANDUBER = Uncorrectable Bit Error Rate

)Cyc-PostN eReads/Cycl# (Nsector)per (bitsFailing Sectors Fraction Cum

sector)per reads(#sector)per (bitsFailing Sectors Fraction CumUBER

CYC +•⋅=

⋅=

Worst case:1Read Disturb: #reads in stressUnbiased: Impute same rate

as in cycling

Page 11: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

00

1085x107

2x10-13

Bits Read per Sector

Cum

Fra

ctio

n Se

ctor

s,4-

bit E

CC

3x10-21Retention

Read DisturbWrite

UBER Estimate

Data reData re--plotted vs. # bits readplotted vs. # bits readUBER at any point is the slope of line to the originUBER at any point is the slope of line to the originUBER is very low 3x10UBER is very low 3x10--2121 at worstat worst--case point (retention)case point (retention)UBER increases with greater use, so use range must be stated wheUBER increases with greater use, so use range must be stated when n UBER is specifiedUBER is specified

Page 12: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Write Amplification

Write Amplification is the amount of NAND write performed for a requested amount of write from host

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63Erase Block (EB)

Page 3

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63DRAM Copy

Page 3

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63

Page 3

Page 0

Page 1

Page 2

Page 61

Page 62

Page 63

Page 3

Example amplification is 32 Example amplification is 32 (128KB NAND write for 4KB host request)(128KB NAND write for 4KB host request). . Traditional schemes have amplification of approx 20Traditional schemes have amplification of approx 20--40X.40X.

*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.

Page 13: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Client Workload Write Amplification

0

10 20 30 40 50 60 70 80 90

100

110

120

130

140

150

160

170

180

190

0.1

1

10

100

1000

Dat

a W

ritte

n (M

B)

Workload Duration (Minutes)

Host Data Written NAND Data Written

XP Mobile Workload XP Mobile Workload WritesWrites

IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs typical write amplification typical write amplification <1.1 for client workloads (this example <1.05)<1.1 for client workloads (this example <1.05)

Measured writes from

host

Measured Measured writes from writes from

hosthost

Measured writes to NAND

Measured Measured writes to writes to NANDNAND

Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.

*Third party marks and brands are the property of their respective owners

Page 14: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Traditional Wear LevelingMany different algorithms & variations exist

• One class of algorithms is regioned wear leveling

Some traditional schemes have as much as ~3X factor Some traditional schemes have as much as ~3X factor between maximum wear and average wearbetween maximum wear and average wear

Block with max cycles much higher than

average block

Block with max Block with max cycles much cycles much higher than higher than

average blockaverage block

Block with max cycles much higher than

average block

Block with max Block with max cycles much cycles much higher than higher than

average blockaverage block

Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.

Page 15: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Intel Wear Level Efficiency

Wear Leveling

0

500

1000

1500

2000

2500

1 2049 4097 6145

Sorted Erase Block (min to max)

P/E

Cyc

les

<4% Delta

IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs has a typical factor of has a typical factor of <1.1 between maximum wear and average wear<1.1 between maximum wear and average wear

Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.

Page 16: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be

Putting it together: SSD Reliability MetricsSSD UBER values can be << 10SSD UBER values can be << 10--1515UBER UBER ∝∝ usage: program/erase/read & subsequent retentionusage: program/erase/read & subsequent retention

IntelIntel®® X18X18--M and X25M and X25--M Mainstream SATA SSD (80GB)M Mainstream SATA SSD (80GB)–– 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND –– 5 years usage, 1000G, 1.2million hrs MTBF 5 years usage, 1000G, 1.2million hrs MTBF –– GB/day client workload @ 1eGB/day client workload @ 1e--15 UBER 15 UBER >>100GB/day, 5 years>>100GB/day, 5 yearsIntel® X25-M and X18-M Mainstream SATA SSDs deliver

>5X accepted requirement for clients (20GB/day)

IntelIntel®® X25X25--E Extreme SATA SSD (32GB)E Extreme SATA SSD (32GB)–– 10 Channels Architecture with 50nm SLC ONFI 1.0 NAND10 Channels Architecture with 50nm SLC ONFI 1.0 NAND–– 1000G, 2Million hrs MTBF 1000G, 2Million hrs MTBF –– Intel SLC SSD support > 7000 8K 2:1 R/W Random Intel SLC SSD support > 7000 8K 2:1 R/W Random IOPsIOPs 24/7, 5 years24/7, 5 years

Intel X25-E SLC SSDs support the endurance required

to replace many 15K RPM HDDs for IOPS applications

Page 17: NAND Flash and Solid-State Drive ReliabilityPutting it together: SSD Reliability Metrics ySSD UBER values can be