An Analysis of Data Corruption in the Storage Stack

Storage Developer Conference 2008

An Analysis of Data Corruption in the Storage Stack

Garth GoodsonNetApp, Inc

Bianca SchroederUniversity of Toronto

Lakshmi Bairavasundaram Andrea C. Arpaci-DusseauRemzi H. Arpaci-Dusseau

University of Wisconsin-Madison


Corruption Anecdote

There is much anecdotal evidence of data corruptionE.g., this is a photo stored on an author’s laptop

System designers know of similar occurrencesData protection often based on anecdotes

Anecdotes: interesting, but not enough for system designA more rigorous understanding is needed


Our Analysis

First large scale study of data corruption1.53 million disks in 1000s of NetApp systems

Time period41 months (Jan 2004 – Jun 2007)

Corruption detectionUsing various data protection techniquesData from NetApp Autosupport Database

Also used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies


Questions we had about corruption

What kinds of corruption occur and how often ?Does disk class matter ?

Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks

Does disk drive family/product matter ?Are corruption instances independent ?Do corruption instances have spatial locality?


Talk Outline

IntroductionBackground

Data corruptionProtection techniques

ResultsLessonsConclusion


Should we care about disk errors?

Nearline Systems

Disk FailurePhysical Interconnect

Failure

Protocol Failure

Performance Failure

Physical Interconnect

Failure

Disk Failure

Protocol Failure

Performance Failure

High-End Systems*W. Jiang, et. al, “Are disks the dominant contributor

of storage failures?”, USENIX FAST, 2008

Joint UIUC/NetApp system failure analysis44 months; 39,000 systems; 1.8 million disks


Disk system failure rates

From failure rate pie charts:High-end: 29% of system errors are disk errorsNearline: 57% of system errors are disk errors

What’s going on?Software is generally the sameHardware platforms are somewhat differentBut real difference is in the type of disk in use

i.e., Fibre-channel vs SATA


Types of disk errors

Operational/component failuresFundamental problem with the drive hardware

Bad servo, head, electronics, etc.Firmware bugs

Failure to flush cache on power-down, etc.Partial failures

Only affects small subset of disk sectorsErrors during writing

Bad media, high-fly write, vibration, etc.Errors during reading (write was successful)

Scratches, corrosion, thermal asperities, etc.


Unreported Disk Errors

Operational failures are easy to detectUsually fail-stop; something stops working

Latent sector errors are reported via SCSI errorsOccurs when a disk sector is read

What about errors that go undetected?Observed errors not corrected by disk’s ECCCan not correct them unless detected firstResult is usually some form of corruption


Data Corruption

Data stored on a disk block is incorrectMany sources

Software bugsFile system, software RAID, device drivers, etc.

Firmware bugsDisk drives, shelf controllers, adapters, etc.

Corruption is silentNot reported by the disk driveCould have greater impact than other errors


Forms of Data Corruption

Bit corruptionContents of existing disk block are modifiedData being written to a disk block is corrupted

Lost writesData not written but completion is reported

Misdirected writesData is written to the wrong disk block

Torn writesData partially written but completion is reported

In all cases, data passes disk’s internal ECC


Detecting data corruption

Basic idea:1. Generate checksum of data (64 Bytes/4KB)2. Store checksum along with data (4KB FS block)3. Verify checksum whenever reading data

Simple checksum has limited protectionDetects bit corruption and torn (partial) writesNo protection against lost or misdirected writes

Since data was not overwritten


Data 1 Data 2 Data 3 Parity

B

cksum(B)

A

cksum(A)

C

cksum(C)

P(ABC)

cksum(P)

Checksum problems: lost writes

Block checksums

Overwrite C→C’

P(ABC’)

Lost Write

Read file ABC’

C

cksum(C)

Return data (ABC)

Return Corrupt Data (C instead of C’)

CKSUM


Write verify: a partial solution

Attempt to solve lost write problemCostly solution, expect good protectionProcedure:1. Write data to disk2. Read back to verify3. If lost write detected, write again

or remap to new location

Overwrite C→C’Lost Write

Ccksum(C)

Read back (C)

Lost write detected, write C’ again

C’

Success

cksum(C’)

14


Lost write protection: a better way

Need logical information pertaining to block identitySomething external to data being stored

Store inode, FS block number within checksumVerified by file system at read time

We also add a checksum of checksum structureBlock Checksum

Block Identity Data

Embedded Checksum

Protects 4KB FS block

Protects against lost writes

Protects checksum structure

4KB file system block

520 520 520 520 520 520 520 520

64B Checksum


Summary: Data Corruption Classes

Checksum mismatchCauses: bit corruption, torn/misdirected writeDetection: block checksum mismatch

Identity mismatchCauses: lost or misdirected writeDetection: block identity mismatch

Parity mismatchCauses: lost write, bad parityDetection: RAID parity computation mismatch


Talk Outline

IntroductionBackgroundResults

System architectureOverall resultsChecksum mismatch results

Lessons and Conclusion


NetApp® System

• Parity generation• Reconstruction on failure• Data scrubbing

– read blocks, verify parity– Detect parity inconsistency– Lost or misdirected writes,

parity miscalculations

• Store, verify checksum• Detect checksum mismatch• Bit corruptions, torn writes

WAFL® file system

RAID layer

Storage layer

Disk drives

Aut

osup

port

Client interface (NFS)

• Store, verify block identity (Inode X, offset Y)

• Detect identity discrepancy• Lost or misdirected writes

1

2

3


Overall Numbers

What percentage of disks are affected by the different kinds of corruption?


Overall Numbers(% disks affected in 17 months of use)

~10 times fewer disks than latent sector errorsHigher % of Nearline disks affected

Order of magnitude more than enterprise disksBit corruptions or torn writes affect more disks than lost or misdirected writes

Corruption type Nearline(SATA)

Enterprise(FC)

Checksum mismatches 0.661% 0.059%

Parity inconsistencies 0.147% 0.017%

Identity discrepancies 0.042% 0.006%

123


Checksum Mismatch (CM) Analysis

1. Factors2. Characteristics3. Correlations with

other errors4. Request type

• Disk class (Nearline / Enterprise)• Disk model• Disk age• Disk size (capacity)• Workload

• CMs per corrupt disk• Independence• Spatial locality• Temporal locality

• Not ready conditions• Latent sector errors• System reset

• Scrubs vs. FS reads etc.





• Disk class• Disk model• Disk age• Disk size• Workload


Factors

Do disk class, model, or age affect development of checksum mismatches?

Disk class: Nearline (SATA) or Enterprise (FC)Disk model: Specific disk drive product(say Vendor V’s disk product P of capacity 80 GB)Disk age: Time in the field since ship date

Can we use these factors to determine corruption handling policies or mechanisms?

Ex: Aggressive scrubbing for some disks


Class, Model, Age – Nearline

Fraction of disks affected varies across models

From 0.27% to 3.51%More than 3%

4 out of 6 modelsResponse to age also varies

4.0%

3.5%

3.0%

2.5%

2.0%

1.5%

1.0%

0.5%

0.0%

% o

f dis

ks w

ith a

t lea

st 1

CM

Disk age (months)0 3 6 9 12 15 18


Class, Model, Age – Enterprise

Fraction of disks affected varies across models

From 0% to 0.17%All less than lowest Nearline (0.27%)

Response to age also varies

0.18%

0.16%

0.14%

0.12%

0.10%

0.08%

0.06%

0.04%

0.02%

0.00%

% o

f dis

ks w

ith a

t lea

st 1

CM

Disk age (months)0 3 6 9 12 15 18


Factors – Summary

Class, Model matterNearline disks require greater attention

Effect of age is unclearCannot use age-specific corruption handling





• CMs per corrupt disk• Independence• Spatial locality• Temporal locality


Checksum Mismatches perCorrupt Disk

Corrupt disk: A disk with at least 1 checksum mismatch (CM)

How many CMs does a corrupt disk have?

Should we “fail-out” disks when one corruption is detected?


CMs per Corrupt Disk – Nearline

CMs per corrupt disk is low

50% of corrupt disks have ≤ 2 CMs90% of corrupt disks have ≤ 100 CMs

Anomaly: E-1Develops many CMs

100%90%80%70%60%50%40%30%20%10%

0%% o

f co

rrup

t dis

ks w

ith ≤

XC

Ms

Number of Checksum Mismatches1 2 3 4 5 10 20 50 100 200 500 1K


CMs per Corrupt Disk – Enterprise

CMs per corrupt disk higher

50% of corrupt disks have ≤ 10 CMs(2 for Nearline)90% of corrupt disks have ≤ 200 CMs(100 for Nearline)

100%90%80%70%60%50%40%30%20%10%

0%% o

f co

rrup

t dis

ks w

ith ≤

XC

Ms

Number of CMs1 2 3 4 5 10 20 50 100 200 500 1KNumber of Checksum Mismatches


CMs per Corrupt Disk – Summary

Class and model matter

Fewer enterprise disks have CMs, but corrupt disks have more CMs

Fail-out enterprise disks on first CM

Corrupt nearline disks develop fewer CMsThere can be anomalies (Disk model E-1)


Other Characteristics

Very high spatial localityWhen multiple checksum mismatches occur, they are often for consecutive disk blocks

High temporal locality

Not independentOver different disks in same systemDefect may be in common hardware components(Example: shelf controller)




other errors4. Request type • Scrubs vs. FS reads etc.


Request Type

What types of disk requests detect checksum mismatches?

Is data scrubbing useful?


Request Type

Data scrubbing finds most CMs

Nearline: 49%Enterprise: 73%

Reconstruction finds CMs

Nearline: 9%Enterprise: 4%

Disk Model

100%90%80%70%60%50%40%30%20%10%0%%

of

CM

s di

scov

ered


Request Type – Summary

Data scrubbing appears to be very usefulStudy of scrub rates, workload needed

Mismatches found during reconstructionData loss without double disk failure protection [Alvarez97, Blaum94, Corbett04, Park95, Hafner05]More aggressive scrubbing may be needed


Interesting Behavior

Do system designers need to factor in any abnormal behavior?


Block numbers are not created equal!

Typically, each block number has 1 disk where it is corruptA series of block numbers are corrupt in many disks

A block-number specific bug?

Block Number Space

Disk Model: E-1

120

100

80

60

40

20

0Num

ber o

f dis

ks w

ith C

M a

t blo

ck X


Talk Outline

IntroductionBackgroundResultsLessonsConclusion


Lessons

Data corruption does occurEven rare errors like lost writes do occurCorruption handling mechanisms are essential

Very few enterprise disks develop corruption“Fail-out” these disks on first corruption detection

High spatial localitySpread out redundant data within the same disk


Lessons (contd.)

Temporal locality, consecutive blocks affectedMay be corruption occurs during the same write opWrite redundant data with separate disk requests, spaced out over time


Conclusion

Our analysisFirst large scale study of data corruptionCorruptions detected by NetApp production systems

Data corruptions do occurAffect ~10 times fewer disks than latent sector errorsNearline (SATA) disks are most affectedCorruption handling mechanisms are essential

Data corruption characteristicsDepend on disk class and disk modelNot independent (both within disk and within system)High spatial and temporal localityMay occur at specific block numbers


Thank You!

Advanced Systems Lab (ADSL)University of Wisconsin-Madison

http://www.cs.wisc.edu/adsl

Advanced Technology Group (ATG)NetApp, Inc

http://www.netapp.com/company/research/

Department of Computer ScienceUniversity of Toronto

http://www.cs.toronto.edu/~bianca

Documents

An Analysis of Data Corruption in the Storage Stack