Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Storage Developer Conference 2008
An Analysis of Data Corruption in the Storage Stack
Garth GoodsonNetApp, Inc
Bianca SchroederUniversity of Toronto
Lakshmi Bairavasundaram Andrea C. Arpaci-DusseauRemzi H. Arpaci-Dusseau
University of Wisconsin-Madison
Storage Developer Conference 2008
Corruption Anecdote
There is much anecdotal evidence of data corruptionE.g., this is a photo stored on an author’s laptop
System designers know of similar occurrencesData protection often based on anecdotes
Anecdotes: interesting, but not enough for system designA more rigorous understanding is needed
Storage Developer Conference 2008
Our Analysis
First large scale study of data corruption1.53 million disks in 1000s of NetApp systems
Time period41 months (Jan 2004 – Jun 2007)
Corruption detectionUsing various data protection techniquesData from NetApp Autosupport Database
Also used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies
Storage Developer Conference 2008
Questions we had about corruption
What kinds of corruption occur and how often ?Does disk class matter ?
Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks
Does disk drive family/product matter ?Are corruption instances independent ?Do corruption instances have spatial locality?
Storage Developer Conference 2008
Talk Outline
IntroductionBackground
Data corruptionProtection techniques
ResultsLessonsConclusion
Storage Developer Conference 2008
Should we care about disk errors?
Nearline Systems
Disk FailurePhysical Interconnect
Failure
Protocol Failure
Performance Failure
Physical Interconnect
Failure
Disk Failure
Protocol Failure
Performance Failure
High-End Systems*W. Jiang, et. al, “Are disks the dominant contributor
of storage failures?”, USENIX FAST, 2008
Joint UIUC/NetApp system failure analysis44 months; 39,000 systems; 1.8 million disks
Storage Developer Conference 2008
Disk system failure rates
From failure rate pie charts:High-end: 29% of system errors are disk errorsNearline: 57% of system errors are disk errors
What’s going on?Software is generally the sameHardware platforms are somewhat differentBut real difference is in the type of disk in use
i.e., Fibre-channel vs SATA
Storage Developer Conference 2008
Types of disk errors
Operational/component failuresFundamental problem with the drive hardware
Bad servo, head, electronics, etc.Firmware bugs
Failure to flush cache on power-down, etc.Partial failures
Only affects small subset of disk sectorsErrors during writing
Bad media, high-fly write, vibration, etc.Errors during reading (write was successful)
Scratches, corrosion, thermal asperities, etc.
Storage Developer Conference 2008
Unreported Disk Errors
Operational failures are easy to detectUsually fail-stop; something stops working
Latent sector errors are reported via SCSI errorsOccurs when a disk sector is read
What about errors that go undetected?Observed errors not corrected by disk’s ECCCan not correct them unless detected firstResult is usually some form of corruption
Storage Developer Conference 2008
Data Corruption
Data stored on a disk block is incorrectMany sources
Software bugsFile system, software RAID, device drivers, etc.
Firmware bugsDisk drives, shelf controllers, adapters, etc.
Corruption is silentNot reported by the disk driveCould have greater impact than other errors
Storage Developer Conference 2008
Forms of Data Corruption
Bit corruptionContents of existing disk block are modifiedData being written to a disk block is corrupted
Lost writesData not written but completion is reported
Misdirected writesData is written to the wrong disk block
Torn writesData partially written but completion is reported
In all cases, data passes disk’s internal ECC
Storage Developer Conference 2008
Detecting data corruption
Basic idea:1. Generate checksum of data (64 Bytes/4KB)2. Store checksum along with data (4KB FS block)3. Verify checksum whenever reading data
Simple checksum has limited protectionDetects bit corruption and torn (partial) writesNo protection against lost or misdirected writes
Since data was not overwritten
Storage Developer Conference 2008
Data 1 Data 2 Data 3 Parity
B
cksum(B)
A
cksum(A)
C
cksum(C)
P(ABC)
cksum(P)
Checksum problems: lost writes
Block checksums
Overwrite C→C’
P(ABC’)
Lost Write
Read file ABC’
C
cksum(C)
Return data (ABC)
Return Corrupt Data (C instead of C’)
CKSUM
Storage Developer Conference 2008
Write verify: a partial solution
Attempt to solve lost write problemCostly solution, expect good protectionProcedure:1. Write data to disk2. Read back to verify3. If lost write detected, write again
or remap to new location
Overwrite C→C’Lost Write
Ccksum(C)
Read back (C)
Lost write detected, write C’ again
C’
Success
cksum(C’)
14
Storage Developer Conference 2008
Lost write protection: a better way
Need logical information pertaining to block identitySomething external to data being stored
Store inode, FS block number within checksumVerified by file system at read time
We also add a checksum of checksum structureBlock Checksum
Block Identity Data
Embedded Checksum
Protects 4KB FS block
Protects against lost writes
Protects checksum structure
4KB file system block
520 520 520 520 520 520 520 520
64B Checksum
Storage Developer Conference 2008
Summary: Data Corruption Classes
Checksum mismatchCauses: bit corruption, torn/misdirected writeDetection: block checksum mismatch
Identity mismatchCauses: lost or misdirected writeDetection: block identity mismatch
Parity mismatchCauses: lost write, bad parityDetection: RAID parity computation mismatch
Storage Developer Conference 2008
Talk Outline
IntroductionBackgroundResults
System architectureOverall resultsChecksum mismatch results
Lessons and Conclusion
Storage Developer Conference 2008
NetApp® System
• Parity generation• Reconstruction on failure• Data scrubbing
– read blocks, verify parity– Detect parity inconsistency– Lost or misdirected writes,
parity miscalculations
• Store, verify checksum• Detect checksum mismatch• Bit corruptions, torn writes
WAFL® file system
RAID layer
Storage layer
Disk drives
Aut
osup
port
Client interface (NFS)
• Store, verify block identity (Inode X, offset Y)
• Detect identity discrepancy• Lost or misdirected writes
1
2
3
Storage Developer Conference 2008
Overall Numbers
What percentage of disks are affected by the different kinds of corruption?
Storage Developer Conference 2008
Overall Numbers(% disks affected in 17 months of use)
~10 times fewer disks than latent sector errorsHigher % of Nearline disks affected
Order of magnitude more than enterprise disksBit corruptions or torn writes affect more disks than lost or misdirected writes
Corruption type Nearline(SATA)
Enterprise(FC)
Checksum mismatches 0.661% 0.059%
Parity inconsistencies 0.147% 0.017%
Identity discrepancies 0.042% 0.006%
123
Storage Developer Conference 2008
Checksum Mismatch (CM) Analysis
1. Factors2. Characteristics3. Correlations with
other errors4. Request type
• Disk class (Nearline / Enterprise)• Disk model• Disk age• Disk size (capacity)• Workload
• CMs per corrupt disk• Independence• Spatial locality• Temporal locality
• Not ready conditions• Latent sector errors• System reset
• Scrubs vs. FS reads etc.
Storage Developer Conference 2008
Checksum Mismatch (CM) Analysis
1. Factors2. Characteristics3. Correlations with
other errors4. Request type
• Disk class• Disk model• Disk age• Disk size• Workload
Storage Developer Conference 2008
Factors
Do disk class, model, or age affect development of checksum mismatches?
Disk class: Nearline (SATA) or Enterprise (FC)Disk model: Specific disk drive product(say Vendor V’s disk product P of capacity 80 GB)Disk age: Time in the field since ship date
Can we use these factors to determine corruption handling policies or mechanisms?
Ex: Aggressive scrubbing for some disks
Storage Developer Conference 2008
Class, Model, Age – Nearline
Fraction of disks affected varies across models
From 0.27% to 3.51%More than 3%
4 out of 6 modelsResponse to age also varies
4.0%
3.5%
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
% o
f dis
ks w
ith a
t lea
st 1
CM
Disk age (months)0 3 6 9 12 15 18
Storage Developer Conference 2008
Class, Model, Age – Enterprise
Fraction of disks affected varies across models
From 0% to 0.17%All less than lowest Nearline (0.27%)
Response to age also varies
0.18%
0.16%
0.14%
0.12%
0.10%
0.08%
0.06%
0.04%
0.02%
0.00%
% o
f dis
ks w
ith a
t lea
st 1
CM
Disk age (months)0 3 6 9 12 15 18
Storage Developer Conference 2008
Factors – Summary
Class, Model matterNearline disks require greater attention
Effect of age is unclearCannot use age-specific corruption handling
Storage Developer Conference 2008
Checksum Mismatch (CM) Analysis
1. Factors2. Characteristics3. Correlations with
other errors4. Request type
• CMs per corrupt disk• Independence• Spatial locality• Temporal locality
Storage Developer Conference 2008
Checksum Mismatches perCorrupt Disk
Corrupt disk: A disk with at least 1 checksum mismatch (CM)
How many CMs does a corrupt disk have?
Should we “fail-out” disks when one corruption is detected?
Storage Developer Conference 2008
CMs per Corrupt Disk – Nearline
CMs per corrupt disk is low
50% of corrupt disks have ≤ 2 CMs90% of corrupt disks have ≤ 100 CMs
Anomaly: E-1Develops many CMs
100%90%80%70%60%50%40%30%20%10%
0%% o
f co
rrup
t dis
ks w
ith ≤
XC
Ms
Number of Checksum Mismatches1 2 3 4 5 10 20 50 100 200 500 1K
Storage Developer Conference 2008
CMs per Corrupt Disk – Enterprise
CMs per corrupt disk higher
50% of corrupt disks have ≤ 10 CMs(2 for Nearline)90% of corrupt disks have ≤ 200 CMs(100 for Nearline)
100%90%80%70%60%50%40%30%20%10%
0%% o
f co
rrup
t dis
ks w
ith ≤
XC
Ms
Number of CMs1 2 3 4 5 10 20 50 100 200 500 1KNumber of Checksum Mismatches
Storage Developer Conference 2008
CMs per Corrupt Disk – Summary
Class and model matter
Fewer enterprise disks have CMs, but corrupt disks have more CMs
Fail-out enterprise disks on first CM
Corrupt nearline disks develop fewer CMsThere can be anomalies (Disk model E-1)
Storage Developer Conference 2008
Other Characteristics
Very high spatial localityWhen multiple checksum mismatches occur, they are often for consecutive disk blocks
High temporal locality
Not independentOver different disks in same systemDefect may be in common hardware components(Example: shelf controller)
Storage Developer Conference 2008
Checksum Mismatch (CM) Analysis
1. Factors2. Characteristics3. Correlations with
other errors4. Request type • Scrubs vs. FS reads etc.
Storage Developer Conference 2008
Request Type
What types of disk requests detect checksum mismatches?
Is data scrubbing useful?
Storage Developer Conference 2008
Request Type
Data scrubbing finds most CMs
Nearline: 49%Enterprise: 73%
Reconstruction finds CMs
Nearline: 9%Enterprise: 4%
Disk Model
100%90%80%70%60%50%40%30%20%10%0%%
of
CM
s di
scov
ered
Storage Developer Conference 2008
Request Type – Summary
Data scrubbing appears to be very usefulStudy of scrub rates, workload needed
Mismatches found during reconstructionData loss without double disk failure protection [Alvarez97, Blaum94, Corbett04, Park95, Hafner05]More aggressive scrubbing may be needed
Storage Developer Conference 2008
Interesting Behavior
Do system designers need to factor in any abnormal behavior?
Storage Developer Conference 2008
Block numbers are not created equal!
Typically, each block number has 1 disk where it is corruptA series of block numbers are corrupt in many disks
A block-number specific bug?
Block Number Space
Disk Model: E-1
120
100
80
60
40
20
0Num
ber o
f dis
ks w
ith C
M a
t blo
ck X
Storage Developer Conference 2008
Lessons
Data corruption does occurEven rare errors like lost writes do occurCorruption handling mechanisms are essential
Very few enterprise disks develop corruption“Fail-out” these disks on first corruption detection
High spatial localitySpread out redundant data within the same disk
Storage Developer Conference 2008
Lessons (contd.)
Temporal locality, consecutive blocks affectedMay be corruption occurs during the same write opWrite redundant data with separate disk requests, spaced out over time
Storage Developer Conference 2008
Conclusion
Our analysisFirst large scale study of data corruptionCorruptions detected by NetApp production systems
Data corruptions do occurAffect ~10 times fewer disks than latent sector errorsNearline (SATA) disks are most affectedCorruption handling mechanisms are essential
Data corruption characteristicsDepend on disk class and disk modelNot independent (both within disk and within system)High spatial and temporal localityMay occur at specific block numbers
Storage Developer Conference 2008
Thank You!
Advanced Systems Lab (ADSL)University of Wisconsin-Madison
http://www.cs.wisc.edu/adsl
Advanced Technology Group (ATG)NetApp, Inc
http://www.netapp.com/company/research/
Department of Computer ScienceUniversity of Toronto
http://www.cs.toronto.edu/~bianca