Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
NAND Flash and SolidNAND Flash and Solid--State Drive ReliabilityState Drive Reliability
Al FazioAl FazioIntel Fellow Intel Fellow
Director, Memory Technology DevelopmentDirector, Memory Technology DevelopmentIntel Corporation Intel Corporation
September 17September 17thth, 2008, 2008
Flash: A License to Disrupt35mm film, Floppy drives, audio tape…– Flash use in consumer electronics characterized by:
– Large block files (.jpg, mp3…)– # Writes determined by human interaction (i.e. photos taken)
To disrupt HDD, flash must accommodate PC characteristics:– Small random writes, # writes determine by OS– Add to this:
A Be-
Control
Flash requires high fields to overcome Flash requires high fields to overcome energy barriers for nonenergy barriers for non--volatility volatility
Flash reliability dominated by oxideFlash reliability dominated by oxide--degradation; result of program/erase degradation; result of program/erase
Charge Storage: Program and EraseCharge Storage: Program and Erase
N+ N+
20V
0V0V
N+ N+
-20V
0V0V
EraseProgramming: NANDProgramming means injecting electrons to the FGProgramming means injecting electrons to the FG–– FowlerFowler--NordheimNordheim TunnelingTunneling
Erase: FowlerErase: Fowler--Nordheim Tunneling in reverse directionNordheim Tunneling in reverse direction
Dis
trib
utio
n
Vt
“1” “0”
2 Levels => 1 bit/cell
Dis
trib
utio
n
Vt
“11” “00”
4 Levels => 2 bit/cell
“10”“01”
Reliability and Oxide TrapsNormally, FNormally, F--N tunneling occur only during N tunneling occur only during accelerated stresses done by engineers trying to accelerated stresses done by engineers trying to study oxide degradationstudy oxide degradation……
–– Flash memories: basis for device operation itselfFlash memories: basis for device operation itselfThis fact has two fundamental implications:This fact has two fundamental implications:
–– Flash reliability is dominated by oxideFlash reliability is dominated by oxide--degradation degradation effects, notably trap buildup in the tunnel oxide, effects, notably trap buildup in the tunnel oxide, which occur as a result of program/erase cyclingwhich occur as a result of program/erase cycling
–– More than any other IC technology, developing a More than any other IC technology, developing a Flash technology centers around obtaining Flash technology centers around obtaining acceptable reliabilityacceptable reliability
Over time, charges can Over time, charges can detrapdetrap–– Effect will cause VEffect will cause VTT to shift and possible data lossto shift and possible data loss
-----
Channel
FG
Q’
Distance
Ener
gy
N+ N+
FG
Top Gate
Bit Errors: OverviewBit Errors: Overview
L0 L1 L2 L3After Write
VtVpass Read
L0 L1 L2 L3After Time
At any instant, some fraction of bits are in the wrong data statAt any instant, some fraction of bits are in the wrong data state, typically e, typically 1E1E--9 to 1E9 to 1E--6, called the 6, called the ““raw bit error rateraw bit error rate”” or RBERor RBERThese failing bits These failing bits develop with usedevelop with use–– During write, some bits program when they shouldnDuring write, some bits program when they shouldn’’t, or program t, or program
higher than they shouldhigher than they should
•• This complexity means that RBER is a number, but not like pi: This complexity means that RBER is a number, but not like pi: –– like temperature: a number only for specific set of conditions, like temperature: a number only for specific set of conditions, location, instantlocation, instant
–– Cells shift in VCells shift in VTT over time, because of simply time (over time, because of simply time (““data retentiondata retention””) or ) or of repetitive read operations (of repetitive read operations (““read disturbread disturb””))
–– Both kinds increase with more program/erase cyclesBoth kinds increase with more program/erase cycles–– Several mechanisms cause bit errors, each with its own dependencSeveral mechanisms cause bit errors, each with its own dependence e
on cycles, time, temperature, etc.on cycles, time, temperature, etc.
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
4000 6000 8000 10000P/E Cycles
RB
ER
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1 10 100 1000 10000
P/E Cycles
RB
ER
Erratic Nature of Write Errors
Errors are erratic: Most bits failing at 5K didnErrors are erratic: Most bits failing at 5K didn’’t fail at 10Kt fail at 10KExplanation: oxide traps are transientExplanation: oxide traps are transientData verified only at symbols: did we miss errors in between?Data verified only at symbols: did we miss errors in between?Ran experiment to verify data after every cycleRan experiment to verify data after every cycle
–– Example bit failed 11 times, never at previous verify pointsExample bit failed 11 times, never at previous verify points–– Previous verifies detected only 0.6% of failing bitsPrevious verifies detected only 0.6% of failing bits
Standard Standard ““test after stresstest after stress”” qualifications miss most errors!qualifications miss most errors!
Bit failpoints
EarlierDataIMFT
Next Several Slides are based on 70nm results from: Mielke, N., et. al., “Bit error rate in NAND Flash memories”, IEEE International Reliability Physics Symposium, 2008
10-4
10-5
10-6
10-7
10-8
10-9
0 5000 10000Retention Time (Hours)
Raw
Bit
Erro
r R
ate
Data-Retention Errors
Post 10K Cycles
L0 L1 L2 L3
R1 R2 R3
n+ n+
CG
FG
n+ n+
SILCDetrapping
After cycling, RBER increases over time without biasAfter cycling, RBER increases over time without biasError transitions show cells are losing VError transitions show cells are losing VTT ((““charge losscharge loss””))Two products dominated by upper state (L3), others by L1 & L2Two products dominated by upper state (L3), others by L1 & L2Characteristics:Characteristics:
–– L1 & L2: L1 & L2: DetrappingDetrapping from the tunnel oxidefrom the tunnel oxide–– L3: SILC (trapL3: SILC (trap--assisted tunneling) leakage off FGassisted tunneling) leakage off FG
0
5x10-7
10-6
1.5 x10-6
0 5000 10000Number of Reads
Raw
Bit
Erro
r R
ate
Read Disturb ErrorsPost 10K Cycles
L0 L1 L2 L3
R1 R2 R3
n+ n+
~6V
SILC
After cycling, RBER increases with repetitive readingAfter cycling, RBER increases with repetitive readingError transitions show erased cells gaining VError transitions show erased cells gaining VTT
Mechanism is well known: SILC under read biasMechanism is well known: SILC under read bias
10KCycles
+1 Yearor
10K reads
10KCycles
+ 0.5 Yearor
5K reads
10KCycles
5KCycles
00.0E+00
1.0E-04
2.0E-04C
um F
ract
ion
Sect
ors
Faili
ng (
1-bi
t EC
C)
0
2E-13
4E-13
Cum
Fra
ctio
n Se
ctor
sFa
iling
(4-b
it EC
C)
Effect of ECC4x10-13
0 0
2x10-4
1-bit
4-bit
Failures drop several orders of magnitude, ~10Failures drop several orders of magnitude, ~101212x over no ECCx over no ECCCurves get steeper (because of Curves get steeper (because of EccEcc power law)power law)Dominant mechanism switches to retention (because of underlying Dominant mechanism switches to retention (because of underlying error error distribution)distribution)
Workable UBER Definition for NANDUBER = Uncorrectable Bit Error Rate
)Cyc-PostN eReads/Cycl# (Nsector)per (bitsFailing Sectors Fraction Cum
sector)per reads(#sector)per (bitsFailing Sectors Fraction CumUBER
CYC +•⋅=
⋅=
Worst case:1Read Disturb: #reads in stressUnbiased: Impute same rate
as in cycling
00
1085x107
2x10-13
Bits Read per Sector
Cum
Fra
ctio
n Se
ctor
s,4-
bit E
CC
3x10-21Retention
Read DisturbWrite
UBER Estimate
Data reData re--plotted vs. # bits readplotted vs. # bits readUBER at any point is the slope of line to the originUBER at any point is the slope of line to the originUBER is very low 3x10UBER is very low 3x10--2121 at worstat worst--case point (retention)case point (retention)UBER increases with greater use, so use range must be stated wheUBER increases with greater use, so use range must be stated when n UBER is specifiedUBER is specified
Write Amplification
Write Amplification is the amount of NAND write performed for a requested amount of write from host
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63Erase Block (EB)
Page 3
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63DRAM Copy
Page 3
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63
Page 3
…
Page 0
Page 1
Page 2
Page 61
Page 62
Page 63
Page 3
Example amplification is 32 Example amplification is 32 (128KB NAND write for 4KB host request)(128KB NAND write for 4KB host request). . Traditional schemes have amplification of approx 20Traditional schemes have amplification of approx 20--40X.40X.
*Simplified example to illustrate the write amplification effect. Specific algorithms vary greatly.
Client Workload Write Amplification
0
10 20 30 40 50 60 70 80 90
100
110
120
130
140
150
160
170
180
190
0.1
1
10
100
1000
Dat
a W
ritte
n (M
B)
Workload Duration (Minutes)
Host Data Written NAND Data Written
XP Mobile Workload XP Mobile Workload WritesWrites
IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs typical write amplification typical write amplification <1.1 for client workloads (this example <1.05)<1.1 for client workloads (this example <1.05)
Measured writes from
host
Measured Measured writes from writes from
hosthost
Measured writes to NAND
Measured Measured writes to writes to NANDNAND
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
*Third party marks and brands are the property of their respective owners
Traditional Wear LevelingMany different algorithms & variations exist
• One class of algorithms is regioned wear leveling
Some traditional schemes have as much as ~3X factor Some traditional schemes have as much as ~3X factor between maximum wear and average wearbetween maximum wear and average wear
Block with max cycles much higher than
average block
Block with max Block with max cycles much cycles much higher than higher than
average blockaverage block
Block with max cycles much higher than
average block
Block with max Block with max cycles much cycles much higher than higher than
average blockaverage block
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
Intel Wear Level Efficiency
Wear Leveling
0
500
1000
1500
2000
2500
1 2049 4097 6145
Sorted Erase Block (min to max)
P/E
Cyc
les
<4% Delta
IntelIntel®® HighHigh--Performance SATA Performance SATA SSDsSSDs has a typical factor of has a typical factor of <1.1 between maximum wear and average wear<1.1 between maximum wear and average wear
Performance measurements are made using specific computer systems and/or components and reflect the approximate performance of the technology as measured by those tests. Any difference in system hardware or software design or configuration may affect actual results.
Putting it together: SSD Reliability MetricsSSD UBER values can be << 10SSD UBER values can be << 10--1515UBER UBER ∝∝ usage: program/erase/read & subsequent retentionusage: program/erase/read & subsequent retention
IntelIntel®® X18X18--M and X25M and X25--M Mainstream SATA SSD (80GB)M Mainstream SATA SSD (80GB)–– 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND 10 Channels Architecture with 50nm MLC ONFI 1.0 NAND –– 5 years usage, 1000G, 1.2million hrs MTBF 5 years usage, 1000G, 1.2million hrs MTBF –– GB/day client workload @ 1eGB/day client workload @ 1e--15 UBER 15 UBER >>100GB/day, 5 years>>100GB/day, 5 yearsIntel® X25-M and X18-M Mainstream SATA SSDs deliver
>5X accepted requirement for clients (20GB/day)
IntelIntel®® X25X25--E Extreme SATA SSD (32GB)E Extreme SATA SSD (32GB)–– 10 Channels Architecture with 50nm SLC ONFI 1.0 NAND10 Channels Architecture with 50nm SLC ONFI 1.0 NAND–– 1000G, 2Million hrs MTBF 1000G, 2Million hrs MTBF –– Intel SLC SSD support > 7000 8K 2:1 R/W Random Intel SLC SSD support > 7000 8K 2:1 R/W Random IOPsIOPs 24/7, 5 years24/7, 5 years
Intel X25-E SLC SSDs support the endurance required
to replace many 15K RPM HDDs for IOPS applications