Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Diﬀerenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

Flash Reliability in Produc4on:

The Importance of Measurement and Analysis in Improving System Reliability

Bianca Schroeder University of Toronto

(Currently on sabbatical at Microsoft Research Redmond)

•  System reliability •  Why and how do systems fail in the wild?

Data from a large number of large-‐scale produc4on systems at different organiza4ons:

▪ Different hardware failure events

▪ Hardware replacements ▪ Correctable and uncorrectable errors in DRAM ▪ Server outages ▪ Hard disk drive failures ▪ Sector errors in hard disk drives ▪ Data corrup4on in storage systems ▪ Failures/errors in solid state drives

▪ Job logs • Google, OpenCloud (Hadoop cluster at CMU), Yahoo! Hadoop trace

4

▪  Observa4ons oTen different from expecta4ons ▪  Surprising to operators as well as manufacturers

5

¡  Why flash reliability? §  More and more data is living on flash => data reliability depends on flash reliability §  Worry about flash wear-‐out

¡  For a long 4me only lab studies using accelerated tes4ng

¡  Recently, some field studies: ▪  Sigmetrics’15 (Facebook) Meza et al. ▪  FAST’16 (Google) Schroeder et al. ▪  Systor’17 (MicrosoT) Narayanan et al.

6

Google fleet

10 drive models

4 chip vendors MLC, SLC, eMLC

6 years of data

Data on workload and variety of error types

Data on repairs, replacements, bad blocks & bad chips

•  Custom drives based on commodity chips (but custom firmware and FTL)

•  Drives are reporting counters many times per day

7

¡  Percentage of drives replaced annually due to suspected hardware problems over the first 4 years in the field:

0 1 2 3 4 5 6

MLC-‐A MLC-‐B MLC-‐C MLC-‐D SLC-‐A SLC-‐B SLC-‐C SLC-‐D

Average annual replacement

rates for hard disks (2-‐20%)

Percen

tage(%

)

Consistent with [Narayanan’17]

§ Good news: §  Only 1-‐2% of drives replaced annually -‐-‐ much lower than hard disks! §  Drives benefiAed from ability to tolerate chip failure

§  0.5-‐1.5% of drives developed bad chips per year

8

§ Bad news: Uncorrectable errors common §  26-‐60% of drives see uncorrectable errors in their life (Google)

§  2-‐6 out of 1,000 drive days experience uncorrectable errors

§  0.2-‐75% of drives at Facebook [Meza et al. 2015] §  Rates at MicrosoT 10X higher than target rate [Narayanan et al. 2016]

§ Much worse than for hard disk drives (3.5% experiencing sector errors)!

§  These errors are insideous as they are latent.

9

¡  Wear-‐out (limited program erase cycles) ¡  Technology (MLC, SLC) ¡  Lithography ¡  Age ¡  Workload ¡  Temperature ¡  Other factors?

¡  What reliability metric to use? §  Raw bit error rate (RBER) ▪  Assump4on: as raw bit errors accumulate they turn uncorrectable

§  Probability of uncorrectable errors ▪  Why not UBER – we will see …

Common expecta4on: Exponen4al increase of RBER with PE cycles … or maybe polynomial … or other super-‐linear?

10

-‐-‐-‐ Exponential growth

PE cycles

RBER

11

§ Big differences across models (all drives use same ECC & FTL, so differences are not due to ECC)

§ Linear increase (for range of PE cycles in our data) § No sudden increase after PE cycle limit

12

Common expecta4on: Lower error rates under SLC ($$$) than MLC

13

§ RBER is lower for SLC drives than MLC § Uncorrectable errors are not lower for SLC drives (all drives use same ECC, FTL, etc. so differences are not due to ECC)

§ SLC drives don’t have lower rate of repairs or replacement

Red lines are SLC drives

14

Common expecta4on: Higher error rates for smaller feature size

15

§ Smaller lithography => higher RBER § Lithography has less impact on uncorrectable errors

43nm versus 50nm

34nm versus 50nm

Common expecta4on: Exponen4al increase in hardware failures with temperature (Arrhenius equa4on)

16

-‐-‐-‐ Exponential growth

Temperature

RBER

17

§ Uncorrectable errors might increase, decrease or not be affected by temperature.

§ Drive-‐internal mechanisms protect against temperature, e.g. through throttling.

§ Other effects might dominate

Increase Decrease!

Little effect

18

¡  Lab studies find workload induced error modes §  Read & program disturb errors, incomplete erases

¡  Field data: no correla4on between read / write / erase opera4ons versus errors (Google, Facebook) §  Possibly because data at per-‐drive level too coarse

¡  Consequence: For field studies UBER is not a meaningful metric.

UBER= #Uncorrectable bits

Total # bits read

Metrics from lab studies do not always make sense for field data.

19

§ Different RBER for same model in different clusters § Other factors at play …

20

¡  The main purpose of RBER is as a metric for overall drive reliability

¡  Allows for projec4ons on uncorrectable errors

[Mielke2008]

21

§ Drive models with higher RBER don’t have higher frequency of uncorrectable errors

22

§ Drives (or drive days) with higher RBER don’t have higher frequency of uncorrectable errors

§ RBER is not a good predictor of field reliability § Uncorrectable errors caused by other mechanisms than corr. errors?

23

§ Prior errors highly predictive of later uncorrectable errors § Can we predict uncorrectable errors?

24

¡  Drives report many opera4onal sta4s4cs e.g. through S.M.A.R.T: §  Workload, temperature, power-‐on-‐hours, prior errors, etc.

Time

?

now ¡  Based on data from interval n, will there be uncorrectable errors in interval n+1?

SMART1

SMART2

SMART254

…

25

¡  Common machine learning techniques for classifica4on problems: §  Classifica4on and regression trees §  Random forests §  Logis4c regression §  Support vector machines §  Neural networks

¡  How does predic4on accuracy compare? ¡  How can we use predic4ons in prac4ce?

Increasing complexity

Betterr

¡  Trained models for Google SSDs & Backblaze HDD data ¡  Metrics:

§  False posi4ve rate (rate of false alarms) §  False nega4ve rate (frac4on of errors not caught)

¡  A simple classifier performs best: random forests ¡  Accuracy is very good

§  Catch 90% of errors at false alarm rate of 2-‐8%. §  Catch 50% of errors at false alarm rate of 1%.

Catch 90% of errors at 2% false alarm rate.

Seagate HDS722020ALA330

¡  Accuracy lower than for HDDs. ¡  Could be due to differences in hardware or error repor4ng. ¡  Will see that it’s s4ll good enough for applica4ons we consider.

¡  What if only lisle data is available? §  Accuracy nearly iden4cal.

¡  What if only data for a different model available? §  Accuracy slightly lower, but s4ll very good.

Seagate error predictions using Hitachi-trained model

30

¡  Use case: Op4mize disk scrubbing ¡ Why scrubbing?

§  Periodic disk scan to detect and correct errors early. §  Standard: Scrub con4nuously at a fixed speed (e.g. once per week).

§  Expensive, in par4cular with increasing capaci4es and impact on foreground workload

¡  Idea: §  Increase scrub rate when error is predicted => reduce mean 4me to error detec4on (MTTD)

31

¡  Approach: § When error is predicted for next interval, increase scrub speed by factor X for this interval.

¡  Benefits: §  Errors that were predicted will be caught X-‐4mes faster, on average.

¡  Cost: §  Every 4me an error is predicted, X-‐4mes more scrub opera4ons will be issued.

¡  Cost & benefit trade-‐off can be controlled by choosing X and tuning FPR of predictor.

32

33

¡  Significant reduc4on in mean 4me to error detec4on at limited overheads.

¡  Effec4ve even for the SSD model with the weakest predictor.

Nearly 2X improvement for 2% time spent accelerated.

Nearly 1.5X improvement for 1% time spent accelerated.

34

¡  Results when using Hitachi predictor on Seagate drive:

¡  Even classifier trained on different model is useful.

35

¡  Dynamically adapt ECC strength of SSDs. ¡  Proac4vely re4re blocks or chips. ¡  Tuning the cache policy of SSD caches, switching from write-‐back to write-‐through when chance of errors is high.

¡  Tuning file system protec4on mechanisms.

36

¡  Predic4ng whole SSD failures [Narayanan 16] ¡  Problem: false posi4ves much more expensive (e.g. unnecessary drive replacements)

¡  False posi4ve rates of 13% too high. ¡ More work leT to be done …

37

¡  Many aspects different from expecta4ons §  Linear increase with PE cycles §  RBER not predic4ve of uncorrectable errors §  SLC not generally more reliable than MLC §  UBER not a great metric in the field ⇒ Importance of collec4ng field data

¡  Errors are oTen predictable using simple ML techniques §  Can be used to improve storage system reliability §  Need to think about what data to report/collect §  We have shown some use cases, probably many more…

¡  S4ll many open ques4ons …

38

¡  For more informa4on: ▪  [Sigmetrics’15] (Facebook) “A large-‐scale study of flash memory failures in the field”, J. Meza, Q. Wu, S. Kumar, O. Mutlu.

▪  [FAST’16] (Google) “Flash reliability in produc<on: The expected and the unexpected”, B. Schroeder, R. Lagisesy, A. Merchant.

▪  [Systor’17] (MicrosoT) “SSD failures in data centers: The what, when and why?”, I. Narayanan, D. Wang, M. Jeon, B. Sharma, L. Caulfield, A. Sivasubramaniam, B. Cutler, J. Liu, B. Khessib, K. Vaid.

▪  [IEEE Proceedings’17] (Survey) “Reliability of NAND-‐based SSD: What field studies tell us”, B. Schroeder, R. Lagisesy, A. Merchant.

Documents

Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Diﬀerenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM