Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Flash Reliability in Produc4on:
The Importance of Measurement and Analysis in Improving System Reliability
Bianca Schroeder University of Toronto
(Currently on sabbatical at Microsoft Research Redmond)
• System reliability • Why and how do systems fail in the wild?
Data from a large number of large-‐scale produc4on systems at different organiza4ons:
▪ Different hardware failure events
▪ Hardware replacements ▪ Correctable and uncorrectable errors in DRAM ▪ Server outages ▪ Hard disk drive failures ▪ Sector errors in hard disk drives ▪ Data corrup4on in storage systems ▪ Failures/errors in solid state drives
▪ Job logs • Google, OpenCloud (Hadoop cluster at CMU), Yahoo! Hadoop trace
4
▪ Observa4ons oTen different from expecta4ons ▪ Surprising to operators as well as manufacturers
5
¡ Why flash reliability? § More and more data is living on flash => data reliability depends on flash reliability § Worry about flash wear-‐out
¡ For a long 4me only lab studies using accelerated tes4ng
¡ Recently, some field studies: ▪ Sigmetrics’15 (Facebook) Meza et al. ▪ FAST’16 (Google) Schroeder et al. ▪ Systor’17 (MicrosoT) Narayanan et al.
6
Google fleet
10 drive models
4 chip vendors MLC, SLC, eMLC
6 years of data
Data on workload and variety of error types
Data on repairs, replacements, bad blocks & bad chips
• Custom drives based on commodity chips (but custom firmware and FTL)
• Drives are reporting counters many times per day
7
¡ Percentage of drives replaced annually due to suspected hardware problems over the first 4 years in the field:
0 1 2 3 4 5 6
MLC-‐A MLC-‐B MLC-‐C MLC-‐D SLC-‐A SLC-‐B SLC-‐C SLC-‐D
Average annual replacement
rates for hard disks (2-‐20%)
Percen
tage(%
)
Consistent with [Narayanan’17]
§ Good news: § Only 1-‐2% of drives replaced annually -‐-‐ much lower than hard disks! § Drives benefiAed from ability to tolerate chip failure
§ 0.5-‐1.5% of drives developed bad chips per year
8
§ Bad news: Uncorrectable errors common § 26-‐60% of drives see uncorrectable errors in their life (Google)
§ 2-‐6 out of 1,000 drive days experience uncorrectable errors
§ 0.2-‐75% of drives at Facebook [Meza et al. 2015] § Rates at MicrosoT 10X higher than target rate [Narayanan et al. 2016]
§ Much worse than for hard disk drives (3.5% experiencing sector errors)!
§ These errors are insideous as they are latent.
9
¡ Wear-‐out (limited program erase cycles) ¡ Technology (MLC, SLC) ¡ Lithography ¡ Age ¡ Workload ¡ Temperature ¡ Other factors?
¡ What reliability metric to use? § Raw bit error rate (RBER) ▪ Assump4on: as raw bit errors accumulate they turn uncorrectable
§ Probability of uncorrectable errors ▪ Why not UBER – we will see …
Common expecta4on: Exponen4al increase of RBER with PE cycles … or maybe polynomial … or other super-‐linear?
10
-‐-‐-‐ Exponential growth
PE cycles
RBER
11
§ Big differences across models (all drives use same ECC & FTL, so differences are not due to ECC)
§ Linear increase (for range of PE cycles in our data) § No sudden increase after PE cycle limit
12
Common expecta4on: Lower error rates under SLC ($$$) than MLC
13
§ RBER is lower for SLC drives than MLC § Uncorrectable errors are not lower for SLC drives (all drives use same ECC, FTL, etc. so differences are not due to ECC)
§ SLC drives don’t have lower rate of repairs or replacement
Red lines are SLC drives
14
Common expecta4on: Higher error rates for smaller feature size
15
§ Smaller lithography => higher RBER § Lithography has less impact on uncorrectable errors
43nm versus 50nm
34nm versus 50nm
Common expecta4on: Exponen4al increase in hardware failures with temperature (Arrhenius equa4on)
16
-‐-‐-‐ Exponential growth
Temperature
RBER
17
§ Uncorrectable errors might increase, decrease or not be affected by temperature.
§ Drive-‐internal mechanisms protect against temperature, e.g. through throttling.
§ Other effects might dominate
Increase Decrease!
Little effect
18
¡ Lab studies find workload induced error modes § Read & program disturb errors, incomplete erases
¡ Field data: no correla4on between read / write / erase opera4ons versus errors (Google, Facebook) § Possibly because data at per-‐drive level too coarse
¡ Consequence: For field studies UBER is not a meaningful metric.
UBER= #Uncorrectable bits
Total # bits read
Metrics from lab studies do not always make sense for field data.
19
§ Different RBER for same model in different clusters § Other factors at play …
20
¡ The main purpose of RBER is as a metric for overall drive reliability
¡ Allows for projec4ons on uncorrectable errors
[Mielke2008]
21
§ Drive models with higher RBER don’t have higher frequency of uncorrectable errors
22
§ Drives (or drive days) with higher RBER don’t have higher frequency of uncorrectable errors
§ RBER is not a good predictor of field reliability § Uncorrectable errors caused by other mechanisms than corr. errors?
23
§ Prior errors highly predictive of later uncorrectable errors § Can we predict uncorrectable errors?
24
¡ Drives report many opera4onal sta4s4cs e.g. through S.M.A.R.T: § Workload, temperature, power-‐on-‐hours, prior errors, etc.
Time
?
now ¡ Based on data from interval n, will there be uncorrectable errors in interval n+1?
SMART1
SMART2
SMART254
…
25
¡ Common machine learning techniques for classifica4on problems: § Classifica4on and regression trees § Random forests § Logis4c regression § Support vector machines § Neural networks
¡ How does predic4on accuracy compare? ¡ How can we use predic4ons in prac4ce?
Increasing complexity
Betterr
¡ Trained models for Google SSDs & Backblaze HDD data ¡ Metrics:
§ False posi4ve rate (rate of false alarms) § False nega4ve rate (frac4on of errors not caught)
¡ A simple classifier performs best: random forests ¡ Accuracy is very good
§ Catch 90% of errors at false alarm rate of 2-‐8%. § Catch 50% of errors at false alarm rate of 1%.
Catch 90% of errors at 2% false alarm rate.
Seagate HDS722020ALA330
¡ Accuracy lower than for HDDs. ¡ Could be due to differences in hardware or error repor4ng. ¡ Will see that it’s s4ll good enough for applica4ons we consider.
¡ What if only lisle data is available? § Accuracy nearly iden4cal.
¡ What if only data for a different model available? § Accuracy slightly lower, but s4ll very good.
Seagate error predictions using Hitachi-trained model
30
¡ Use case: Op4mize disk scrubbing ¡ Why scrubbing?
§ Periodic disk scan to detect and correct errors early. § Standard: Scrub con4nuously at a fixed speed (e.g. once per week).
§ Expensive, in par4cular with increasing capaci4es and impact on foreground workload
¡ Idea: § Increase scrub rate when error is predicted => reduce mean 4me to error detec4on (MTTD)
31
¡ Approach: § When error is predicted for next interval, increase scrub speed by factor X for this interval.
¡ Benefits: § Errors that were predicted will be caught X-‐4mes faster, on average.
¡ Cost: § Every 4me an error is predicted, X-‐4mes more scrub opera4ons will be issued.
¡ Cost & benefit trade-‐off can be controlled by choosing X and tuning FPR of predictor.
32
33
¡ Significant reduc4on in mean 4me to error detec4on at limited overheads.
¡ Effec4ve even for the SSD model with the weakest predictor.
Nearly 2X improvement for 2% time spent accelerated.
Nearly 1.5X improvement for 1% time spent accelerated.
34
¡ Results when using Hitachi predictor on Seagate drive:
¡ Even classifier trained on different model is useful.
35
¡ Dynamically adapt ECC strength of SSDs. ¡ Proac4vely re4re blocks or chips. ¡ Tuning the cache policy of SSD caches, switching from write-‐back to write-‐through when chance of errors is high.
¡ Tuning file system protec4on mechanisms.
36
¡ Predic4ng whole SSD failures [Narayanan 16] ¡ Problem: false posi4ves much more expensive (e.g. unnecessary drive replacements)
¡ False posi4ve rates of 13% too high. ¡ More work leT to be done …
37
¡ Many aspects different from expecta4ons § Linear increase with PE cycles § RBER not predic4ve of uncorrectable errors § SLC not generally more reliable than MLC § UBER not a great metric in the field ⇒ Importance of collec4ng field data
¡ Errors are oTen predictable using simple ML techniques § Can be used to improve storage system reliability § Need to think about what data to report/collect § We have shown some use cases, probably many more…
¡ S4ll many open ques4ons …
38
¡ For more informa4on: ▪ [Sigmetrics’15] (Facebook) “A large-‐scale study of flash memory failures in the field”, J. Meza, Q. Wu, S. Kumar, O. Mutlu.
▪ [FAST’16] (Google) “Flash reliability in produc<on: The expected and the unexpected”, B. Schroeder, R. Lagisesy, A. Merchant.
▪ [Systor’17] (MicrosoT) “SSD failures in data centers: The what, when and why?”, I. Narayanan, D. Wang, M. Jeon, B. Sharma, L. Caulfield, A. Sivasubramaniam, B. Cutler, J. Liu, B. Khessib, K. Vaid.
▪ [IEEE Proceedings’17] (Survey) “Reliability of NAND-‐based SSD: What field studies tell us”, B. Schroeder, R. Lagisesy, A. Merchant.