Matthew Woitaszek NCAR Computational Science, 1850 Table

Tornado Codes for Archival StorageMatthew Woitaszek

NCAR Computational Science, 1850 Table Mesa Dr, Boulder CO 80305-5602

E-mail: [email protected]

Presented at the THIC Meeting at the National Center for Atmospheric Research, 1850 Table Mesa Drive, Boulder CO 80305-5602

July 19, 2006

Tornado Codes for Archival Storage

THIC 19 July 2006

Matthew WoitaszekNCAR Computational Science1850 Table Mesa DriveBoulder, CO 80305

Phone: 303/497-1279Fax: 303/497-1286E-mail: [email protected]

Presented at the THIC meeting at NCAR, Boulder CO, 19 July 2006

19 July 2006 2

Key Points

Tornado Codes can provide fault tolerant storageSingle site applications: better than RAIDDistributed applications: better than replicationVerify then trust

Tornado codes can work with existing Data GridsLow computational overhead Behind-the-scenes server-side fault toleranceNo reason to alter interfaces, use existing Data Grid tools

Tornado codes are an excellent match for MAIDLeverage redundant dataOptimize device mount count and limit power consumption

19 July 2006 3

Outline

Background and Motivation

Experimental Method

Fault Tolerance ResultsSingle Tornado Code Fault ToleranceDistributed and Federated System Fault Tolerance

Applications to MAID Systems

Conclusions and Future Work

19 July 2006 4

Storage for High Performance ComputingArchival StorageTape silo systems

Working StorageDisk systems

Grid Front EndGridFTP Servers

Archive Management

Collaborative StorageMassive disk or tape arraysShared using Grid technologyDistributed data stewarding

E N T E R P R I S E6 0 0 0

E N T E R P R I S E6 0 0 0

TM

T MOrigin 3200

19 July 2006 5

Motivation – Filesystem Features

Typical desired featuresPerformance (no waiting)Availability (no downtime)Reliability (no data loss)Inexpensive (no cost)Scalability (no limits)

Apply Tornado Codes to distributed archival storage

Optimize performance and resource utilizationLeverage existing legacy technologySupport emerging technologies and solutions (Grid, MAID)

Prime directive: Never lose data

19 July 2006 6

LDPC Codes and Tornado Codes

Low Density Parity Check codesGallager, 1963Luby, 1990s

Cascaded irregular LDPC graphs are Tornado Codes

Average degree of connectivityDistribution of degreeRandomly generated

Data Nodes Check Nodes

Simple XOR operation has low computational overhead

A

B

C

D

E

F

G

H

19 July 2006 7

Storage Applications – Safety and Speed

Tornado Code advantagesFault tolerancePerformance optimizations

Retrieve C or G?If one node is dead, the decision is easy! If both nodes are available, choose the node with less waiting

A

B

G

Data Nodes Check Nodes

C

19 July 2006 8

Related Work

Typhoon (Weatherspoon, 1999) – PerformanceHigher performance alternative to Reed-Solomon codingfor OceanStore’s archival storage classSimple algorithms for Tornado Code generation

RobuSTore (Xia, 2006) – Hide latencyLDPC codes and speculative accessHide latency from slow devices

This work examines the risk of data loss in distributed filesystems using Tornado Codes

19 July 2006 9

Experimental Method

Construct and evaluate a Tornado Coding scheme

potential graphsdevice failure cases

simulated failure

best graphs

Initial worst case failure identificationExtensive final graph profiling

Bound “probabilistically successful” data reconstructionCombine with empirical device failure rates for reliability

Final graphs used for filesystem implementation

19 July 2006 10

Evaluation Metrics

Reconstruction EfficiencyOverhead: average data retrieval required to reconstructExample: 48 data and 48 check nodes

Minimum: 48 data nodesRandom: average of 62 nodes (1.29)

Other authors: between 1.2 and 1.4

Fault Tolerance First failure: How many nodes/devices can be lost? Reliability: probability of failure over time

Missing blocks may be…available but not retrieved – reconstruction efficiencypermanently unavailable – fault tolerance

19 July 2006 11

Testing Tornado Codes

Fault tolerance analysis of 96-node graphsSmall enough to explicitly manage device stateLarge enough to take advantage of selective retrievals

Storing files using Tornado Code graphsPer-file encodingUsing a log-based filesystem

Files

Data

ParityStripe}

Object Storage

19 July 2006 12


Single site fault tolerance

Federated system fault tolerance

Optimization for MAID systems

19 July 2006 13

First Experiences with Tornado Codes

Tornado code graphs fail to reconstruct due to closed set of right nodes (subgraph identification)

48

51

57

66

6

28

42

68

48

57

17

22

17 [48,57]22 [48,57]

Left node [right nodes]

Failure with 2 lost nodes Failure with 3 lost nodes

6 [48,51,57]28 [57,66,68]42 [48,51,66,68]

Possible to correct by expanding sets Example 96-node graph survives all 4-node failures and only 6 / 61,124,064 5-node cases fail

19 July 2006 14

Tornado Code Defect Detection Results

Missing Nodes0 5 10 15 20 25 30 35 40 45

Frac

tion

Rec

onst

ruct

ion

Failu

re

0

0.2

0.4

0.6

0.8

1Prototype Graph (without defect correction)Tornado Graph 3(before adjustment)Tornado Graph 3(after adjustment)

5After

4Before

2Prototype

First Failure

Never trust a randomly generated Tornado Code graph without testing first!

19 July 2006 15

Comparison Single-Site Disk Configurations

Compared to familiar RAID systems (8 drawers)All configurations have 96 devices

RAID 5

RAID 6

Mirrored (RAID 10)

Tornado Code graphs

Results describe three specific Tornado Code graphs (“Tornado 1”, “Tornado 2”, “Tornado 3”)

data parity

data

data

data parity

19 July 2006 16

96-Device Tornado Code Fault Tolerance

2Mirrored

5Tornado 1

5Tornado 3

3RAID6

2RAID5

First Failure

Missing Nodes0 5 10 15 20 25 30 35 40 45

Frac

tion

Rec

onst

ruct

ion

Failu

re

0

0.2

0.4

0.6

0.8

1

RAID5 (*)RAID6 (*)MirroredTornado 1Tornado 3

Tornado codes demonstrate better fault tolerance than mirroring with the same capacity overhead

19 July 2006 17

Calculating Fault Tolerance

The probability of any independent drive being lost is given by the binomial distribution:

1.000E-19296

9.504E-18995

4.753E-0810

5.408E-079

5.476E-068

4.873E-057

0.000386

0.002455

0.013184

0.056113

0.177292

0.369501

0.381050

P( n offline )n offline

Device AFR: p = 0.01

P( n nodes offline )

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−= −

kn

ppkP knk 1)lost drives (

∑=

=drives

0)lost drives ()lost drives |fail()fail(

n

kkPkPP

Because the experimental results are independent conditional probabilities, they may be summed:

This is a worst case fault tolerance calculation: it ignores all repairs

19 July 2006 18

Single Site Stripe Fault Tolerance

5.857E-104848Tornado 3

5.947E-104848Tornado 2

1.345E-094848Tornado 1

0.004794848Mirrored

0.001641680RAID6

0.04834888RAID5

0.61895096Striping

0.01000Individual Disk

P( fail )ParityDataMethod

19 July 2006 19





19 July 2006 20

Distributed and Federated Storage

Federated StorageDistributed sites maintain subsets or complete copiesReplica selection and placement important

Data locality (close to researchers)Data permanence (stewarding of historical data)

Active Storage Site

Deep Storage Archival Site

Active Storage SiteHigh Throughput Wide-Area Network

19 July 2006 21

Multi-graph Federated Storage

Federated storage with Tornado CodesEach site uses a Tornado Code storage schemeDifferent Tornado Code graphs used on each siteSites may exchange data blocks to aid in reconstruction

Data Nodes

Site 2Site 1

Private Check Nodes

Private Check Nodes

19 July 2006 22

Multi-graph Federated Storage

19Tornado 2 + Tornado 317Tornado 1 + Tornado 317Tornado 1 + Tornado 210Tornado 1 + Tornado 14Mirrored (4 copies)

First Failure Detected (guided search)System

Cooperative graphs greatly enhance first failure Not all sites need to use Tornado Codes, just the ability to exchange missing data nodes

19 July 2006 23

Federated Storage Applications

Federated Tornado Code storage systems can reconstruct data even if all sites cannot do so independently

ApplicationsData stewardingCollaborative storage and data grids

Fits into existing technologiesGrid Bricks, MAID, tape systemsData Grid interfaces (SRB, SRM/DataMover, GridFTP)

19 July 2006 24





19 July 2006 25

MAID Storage – Between Disk and Tape

Massive Arrays of Idle Disks (MAID)Disks stored with electrical power offDevices activated by software to access dataUsually archival: write once, read occasionally

Research work at CU Boulder by Grunwald, et. al.Recent product release from Copan Systems

~1000 disks~10 second spin-up“Highly unique RAID” supporting power management

Tornado Codes provide high reliability and work well in archival storage applications

19 July 2006 26

Tornado Codes and MAID Storage

MAID objectivesData permanencePower conservation

Conserve power by limiting online devicesLeverage redundant data Selective device activation

Device selection policies and optimizations 1. Already online devices2. Random selection (collectively useful but prioritized)3. Graph retrieval planning

19 July 2006 27

Graph Retrieval Planning Example

Planning takes time (5s for 40 prior retrievals) but can be performed in parallel to the first stage retrievals

Random Nodes Retrieved Before Planning32 42 52 62 72 82 92

Ave

rage

Add

ition

al R

etrie

vals

0

5

10

15

20

Graph Retrieval Plan example40 initial random retrievals

+14 average additional retrievals54 total retrievals

Overhead: 6 extra nodes (1.125)

19 July 2006 28

Graph Retrieval Planning Example

Graph retrieval planning can reduce the number of device mounts required to retrieve data

Random Nodes Retrived Before Planning32 42 52 62 72 82 92

Pro

babi

lity

of R

econ

stru

ctio

n

0

0.2

0.4

0.6

0.8

1

Initial 40 node random retrievalReconstruction not possible

Initial retrieval + planning40 random + 14 planned = 54Guaranteed reconstruction

Initial 54 node random retrieval2.4% probability of reconstruction

19 July 2006 29

Current Work

Stripe retrieval simulationsFixed size 960-disk systemArbitrary retrieval requests

ParametersNumber of stripes used for planningNumber of stripes passively retrievingPercentage of disks online

MetricsStripe retrieval timeDevice mount count

19 July 2006 30

Conclusions and Future Work

Quantified “probabilistically successful” reconstruction for real Tornado Code graphs

First failure importantVerify then trustUseful for distributed and federated storage systems

Framework supports alternates to Tornado CodesMIT Lincoln Erasure Codes (Cooley, 2003)Plank’s small LDPC codes (2005, 2006)… and other related LDPC-based codes

19 July 2006 31

Future Work

Emulated FilesystemStores encoded filesSimulates MAID behavior for evaluationWorks on real disks for data storage

Goal: Reliable archival storage in a data grid1. Fault tolerance of real Tornado Code graphs2. Optimizations for efficient data reconstruction3. Operate within existing Data Grid standards

Construct a working federated archival filesystem using Tornado Codes

19 July 2006 32

Acknowledgements

Thanks To:Jason CopeDirk GrunwaldSean McCrearyMichael Oberg

Computer time was provided on equipment from:NSF ARI Grant #CDA-9601817NASA AIST grant #NAG2-1646DOE SciDAC grant #DE-FG02-04ER63870NSF sponsorship of the National Center for Atmospheric ResearchIBM Shared University Research (SUR) program

Questions?

Matthew [email protected]

THIC 19 July 2006

19 July 2006 34

96-Device Tornado vs. Alternate Fault Tolerance

Missing Nodes0 5 10 15 20 25 30 35 40 45

Frac

tion

Rec

onst

ruct

ion

Failu

re

0

0.2

0.4

0.6

0.8

1Regular − Degree=4Regular − Degree=11Altered Tornado(distribution doubled)Altered Tornado(distribution shifted +1)Tornado 4Regular 11

5Shifted

5Tornado

5Doubled

4Regular 4

First Failure

Tornado Code theory shows better results than token adjustments or simple regular bipartite graphs

19 July 2006 35

96-Device Tornado vs. Alternate Fault Tolerance

Missing Nodes0 5 10 15 20 25 30 35 40 45

Frac

tion

Rec

onst

ruct

ion

Failu

re

0

0.2

0.4

0.6

0.8

1Cascaded − Degree=6Cascaded − Degree=4Cascaded − Degree=3Tornado

4Degree 3

5Tornado

4Degree 4

5Degree 6

First Failure

Cascaded fixed degree graphs with random edge permutationsCurve shape matches but first failure is important characteristic

19 July 2006 36

Fault Tolerance I

1111111196

1111111195

0.0000370.0000360.0000670.0109380.40610210.711812110

0.0000190.0000180.0000330.0085840.33770610.58135219

0.0000060.0000070.0000120.0065770.27136010.4424860.9967828

0.0000020.0000030.0000050.0048840.20914410.3096970.9759447

0.0000010.0000000.0000010.0034400.15261310.1947070.9097806

0.0000000.0000000.0000000.0022630.10354010.1057120.7719655

0000.0013440.06284510.0457180.5629574

0000.0006650.03157910.0123180.3227323

0000.0002190.010526100.1157892

000001001

000000000

g096-tc05ag096-06-03g096-05-03g096-test02MirroringStripingRAID6RAID5n Offline

Tornado 3Tornado 2Tornado 1

Experimentally calculated results

P( fail | n nodes offline )

19 July 2006 37

Fault Tolerance II

The probability of any independent drive being lost is given by the binomial distribution:

1

1.000E-19296

9.504E-18995

4.753E-0810

5.408E-079

5.476E-068

4.873E-057

0.000386

0.002455

0.013184

0.056113

0.177292

0.369501

0.381050

P( n offline )n Offline

p = 0.01

P( n nodes offline )

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−= −

kn

ppkP knk 1)lost drives (

∑=

=drives

0)lost drives ()lost drives |fail()fail(

n

kkPkPP

Because the experimental results are independent conditional probabilities, they may be summed:

This is a worst case fault tolerance calculation: it ignores all repairs.

19 July 2006 38

Fault Tolerance III

5.86E-105.95E-101.34E-090.000100.004790.618950.001640.04834

1.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-19296

9.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-18995

1.757E-121.705E-123.168E-125.198E-101.930E-084.753E-083.383E-084.753E-0810

1.001E-119.843E-121.764E-114.642E-091.826E-075.408E-073.144E-075.408E-079

3.393E-113.952E-116.743E-113.602E-081.486E-065.476E-062.423E-065.458E-068

1.073E-101.430E-102.304E-102.380E-071.019E-054.873E-051.509E-054.756E-057

1.929E-101.608E-105.466E-101.291E-065.726E-053.752E-047.306E-053.414E-046

2.395E-102.395E-104.791E-105.543E-062.536E-042.449E-032.589E-041.891E-035

0001.771E-058.281E-041.318E-026.025E-047.418E-034

0003.731E-051.772E-035.611E-026.912E-041.811E-023

0003.888E-051.866E-031.773E-010.000E+002.053E-022

000000.3695002001

000000000

g096-tc05ag096-06-03g096-05-03g096-test02MirroringStripingRAID6RAID5n Offline

Tornado 3Tornado 2Tornado 1

Independent conditional probabilities

P(fail) = Sum[ P( n nodes offline )*P( fail | n nodes offline )]

19 July 2006 39

Graph Retrieval Plan Generation Time

Nodes Retrieved Before Planning32 42 52 62 72 82 92

Rec

onst

ruct

ion

Pla

n G

ener

atio

n Ti

me

12345

10

25

50

100

200300400 Best First Search

Limited Breadth First Search(Average for three final Tornado Code graphs)

19 July 2006 40

Additional Retrievals by Initial Retrieval

Nodes Retrieved Before Planning32 42 52 62 72 82 92

Nod

e R

etrie

vals

for R

econ

stru

ctio

n

0

5

10

15

20Average Node Retrievals (Best First Search)Data Node Retrievals (Best First Search)Check Node Retrievals (Best First Search)

(Average for three final Tornado Code graphs)

19 July 2006 41

Additional Retrievals by Data Nodes Missing

Data Nodes Missing0 5 10 15 20 25 30 35 40

Nod

e R

etrie

vals

for R

econ

stru

ctio

n

0

5

10

15

20Average Additional Retrievals (Best First Search)Data Node Retrievals (Best First Search)Check Node Retrievals (Best First Search)

(Average for three final Tornado Code graphs)

Documents

Matthew Woitaszek NCAR Computational Science, 1850 Table