Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Tornado Codes for Archival StorageMatthew Woitaszek
NCAR Computational Science, 1850 Table Mesa Dr, Boulder CO 80305-5602
E-mail: [email protected]
Presented at the THIC Meeting at the National Center for Atmospheric Research, 1850 Table Mesa Drive, Boulder CO 80305-5602
July 19, 2006
Tornado Codes for Archival Storage
THIC 19 July 2006
Matthew WoitaszekNCAR Computational Science1850 Table Mesa DriveBoulder, CO 80305
Phone: 303/497-1279Fax: 303/497-1286E-mail: [email protected]
Presented at the THIC meeting at NCAR, Boulder CO, 19 July 2006
19 July 2006 2
Key Points
Tornado Codes can provide fault tolerant storageSingle site applications: better than RAIDDistributed applications: better than replicationVerify then trust
Tornado codes can work with existing Data GridsLow computational overhead Behind-the-scenes server-side fault toleranceNo reason to alter interfaces, use existing Data Grid tools
Tornado codes are an excellent match for MAIDLeverage redundant dataOptimize device mount count and limit power consumption
19 July 2006 3
Outline
Background and Motivation
Experimental Method
Fault Tolerance ResultsSingle Tornado Code Fault ToleranceDistributed and Federated System Fault Tolerance
Applications to MAID Systems
Conclusions and Future Work
19 July 2006 4
Storage for High Performance ComputingArchival StorageTape silo systems
Working StorageDisk systems
Grid Front EndGridFTP Servers
Archive Management
Collaborative StorageMassive disk or tape arraysShared using Grid technologyDistributed data stewarding
E N T E R P R I S E6 0 0 0
E N T E R P R I S E6 0 0 0
TM
T MOrigin 3200
19 July 2006 5
Motivation – Filesystem Features
Typical desired featuresPerformance (no waiting)Availability (no downtime)Reliability (no data loss)Inexpensive (no cost)Scalability (no limits)
Apply Tornado Codes to distributed archival storage
Optimize performance and resource utilizationLeverage existing legacy technologySupport emerging technologies and solutions (Grid, MAID)
Prime directive: Never lose data
19 July 2006 6
LDPC Codes and Tornado Codes
Low Density Parity Check codesGallager, 1963Luby, 1990s
Cascaded irregular LDPC graphs are Tornado Codes
Average degree of connectivityDistribution of degreeRandomly generated
Data Nodes Check Nodes
Simple XOR operation has low computational overhead
A
B
C
D
E
F
G
H
19 July 2006 7
Storage Applications – Safety and Speed
Tornado Code advantagesFault tolerancePerformance optimizations
Retrieve C or G?If one node is dead, the decision is easy! If both nodes are available, choose the node with less waiting
A
B
G
Data Nodes Check Nodes
C
19 July 2006 8
Related Work
Typhoon (Weatherspoon, 1999) – PerformanceHigher performance alternative to Reed-Solomon codingfor OceanStore’s archival storage classSimple algorithms for Tornado Code generation
RobuSTore (Xia, 2006) – Hide latencyLDPC codes and speculative accessHide latency from slow devices
This work examines the risk of data loss in distributed filesystems using Tornado Codes
19 July 2006 9
Experimental Method
Construct and evaluate a Tornado Coding scheme
potential graphsdevice failure cases
simulated failure
best graphs
Initial worst case failure identificationExtensive final graph profiling
Bound “probabilistically successful” data reconstructionCombine with empirical device failure rates for reliability
Final graphs used for filesystem implementation
19 July 2006 10
Evaluation Metrics
Reconstruction EfficiencyOverhead: average data retrieval required to reconstructExample: 48 data and 48 check nodes
Minimum: 48 data nodesRandom: average of 62 nodes (1.29)
Other authors: between 1.2 and 1.4
Fault Tolerance First failure: How many nodes/devices can be lost? Reliability: probability of failure over time
Missing blocks may be…available but not retrieved – reconstruction efficiencypermanently unavailable – fault tolerance
19 July 2006 11
Testing Tornado Codes
Fault tolerance analysis of 96-node graphsSmall enough to explicitly manage device stateLarge enough to take advantage of selective retrievals
Storing files using Tornado Code graphsPer-file encodingUsing a log-based filesystem
Files
Data
ParityStripe}
Object Storage
19 July 2006 12
Tornado Codes for Archival Storage
Single site fault tolerance
Federated system fault tolerance
Optimization for MAID systems
19 July 2006 13
First Experiences with Tornado Codes
Tornado code graphs fail to reconstruct due to closed set of right nodes (subgraph identification)
48
51
57
66
6
28
42
68
48
57
17
22
17 [48,57]22 [48,57]
Left node [right nodes]
Failure with 2 lost nodes Failure with 3 lost nodes
6 [48,51,57]28 [57,66,68]42 [48,51,66,68]
Possible to correct by expanding sets Example 96-node graph survives all 4-node failures and only 6 / 61,124,064 5-node cases fail
19 July 2006 14
Tornado Code Defect Detection Results
Missing Nodes0 5 10 15 20 25 30 35 40 45
Frac
tion
Rec
onst
ruct
ion
Failu
re
0
0.2
0.4
0.6
0.8
1Prototype Graph (without defect correction)Tornado Graph 3(before adjustment)Tornado Graph 3(after adjustment)
5After
4Before
2Prototype
First Failure
Never trust a randomly generated Tornado Code graph without testing first!
19 July 2006 15
Comparison Single-Site Disk Configurations
Compared to familiar RAID systems (8 drawers)All configurations have 96 devices
RAID 5
RAID 6
Mirrored (RAID 10)
Tornado Code graphs
Results describe three specific Tornado Code graphs (“Tornado 1”, “Tornado 2”, “Tornado 3”)
data parity
data
data
data parity
19 July 2006 16
96-Device Tornado Code Fault Tolerance
2Mirrored
5Tornado 1
5Tornado 3
3RAID6
2RAID5
First Failure
Missing Nodes0 5 10 15 20 25 30 35 40 45
Frac
tion
Rec
onst
ruct
ion
Failu
re
0
0.2
0.4
0.6
0.8
1
RAID5 (*)RAID6 (*)MirroredTornado 1Tornado 3
Tornado codes demonstrate better fault tolerance than mirroring with the same capacity overhead
19 July 2006 17
Calculating Fault Tolerance
The probability of any independent drive being lost is given by the binomial distribution:
1.000E-19296
9.504E-18995
4.753E-0810
5.408E-079
5.476E-068
4.873E-057
0.000386
0.002455
0.013184
0.056113
0.177292
0.369501
0.381050
P( n offline )n offline
Device AFR: p = 0.01
P( n nodes offline )
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−= −
kn
ppkP knk 1)lost drives (
∑=
=drives
0)lost drives ()lost drives |fail()fail(
n
kkPkPP
Because the experimental results are independent conditional probabilities, they may be summed:
This is a worst case fault tolerance calculation: it ignores all repairs
19 July 2006 18
Single Site Stripe Fault Tolerance
5.857E-104848Tornado 3
5.947E-104848Tornado 2
1.345E-094848Tornado 1
0.004794848Mirrored
0.001641680RAID6
0.04834888RAID5
0.61895096Striping
0.01000Individual Disk
P( fail )ParityDataMethod
19 July 2006 19
Tornado Codes for Archival Storage
Single site fault tolerance
Federated system fault tolerance
Optimization for MAID systems
19 July 2006 20
Distributed and Federated Storage
Federated StorageDistributed sites maintain subsets or complete copiesReplica selection and placement important
Data locality (close to researchers)Data permanence (stewarding of historical data)
Active Storage Site
Deep Storage Archival Site
Active Storage SiteHigh Throughput Wide-Area Network
19 July 2006 21
Multi-graph Federated Storage
Federated storage with Tornado CodesEach site uses a Tornado Code storage schemeDifferent Tornado Code graphs used on each siteSites may exchange data blocks to aid in reconstruction
Data Nodes
Site 2Site 1
Private Check Nodes
Private Check Nodes
19 July 2006 22
Multi-graph Federated Storage
19Tornado 2 + Tornado 317Tornado 1 + Tornado 317Tornado 1 + Tornado 210Tornado 1 + Tornado 14Mirrored (4 copies)
First Failure Detected (guided search)System
Cooperative graphs greatly enhance first failure Not all sites need to use Tornado Codes, just the ability to exchange missing data nodes
19 July 2006 23
Federated Storage Applications
Federated Tornado Code storage systems can reconstruct data even if all sites cannot do so independently
ApplicationsData stewardingCollaborative storage and data grids
Fits into existing technologiesGrid Bricks, MAID, tape systemsData Grid interfaces (SRB, SRM/DataMover, GridFTP)
19 July 2006 24
Tornado Codes for Archival Storage
Single site fault tolerance
Federated system fault tolerance
Optimization for MAID systems
19 July 2006 25
MAID Storage – Between Disk and Tape
Massive Arrays of Idle Disks (MAID)Disks stored with electrical power offDevices activated by software to access dataUsually archival: write once, read occasionally
Research work at CU Boulder by Grunwald, et. al.Recent product release from Copan Systems
~1000 disks~10 second spin-up“Highly unique RAID” supporting power management
Tornado Codes provide high reliability and work well in archival storage applications
19 July 2006 26
Tornado Codes and MAID Storage
MAID objectivesData permanencePower conservation
Conserve power by limiting online devicesLeverage redundant data Selective device activation
Device selection policies and optimizations 1. Already online devices2. Random selection (collectively useful but prioritized)3. Graph retrieval planning
19 July 2006 27
Graph Retrieval Planning Example
Planning takes time (5s for 40 prior retrievals) but can be performed in parallel to the first stage retrievals
Random Nodes Retrieved Before Planning32 42 52 62 72 82 92
Ave
rage
Add
ition
al R
etrie
vals
0
5
10
15
20
Graph Retrieval Plan example40 initial random retrievals
+14 average additional retrievals54 total retrievals
Overhead: 6 extra nodes (1.125)
19 July 2006 28
Graph Retrieval Planning Example
Graph retrieval planning can reduce the number of device mounts required to retrieve data
Random Nodes Retrived Before Planning32 42 52 62 72 82 92
Pro
babi
lity
of R
econ
stru
ctio
n
0
0.2
0.4
0.6
0.8
1
Initial 40 node random retrievalReconstruction not possible
Initial retrieval + planning40 random + 14 planned = 54Guaranteed reconstruction
Initial 54 node random retrieval2.4% probability of reconstruction
19 July 2006 29
Current Work
Stripe retrieval simulationsFixed size 960-disk systemArbitrary retrieval requests
ParametersNumber of stripes used for planningNumber of stripes passively retrievingPercentage of disks online
MetricsStripe retrieval timeDevice mount count
19 July 2006 30
Conclusions and Future Work
Quantified “probabilistically successful” reconstruction for real Tornado Code graphs
First failure importantVerify then trustUseful for distributed and federated storage systems
Framework supports alternates to Tornado CodesMIT Lincoln Erasure Codes (Cooley, 2003)Plank’s small LDPC codes (2005, 2006)… and other related LDPC-based codes
19 July 2006 31
Future Work
Emulated FilesystemStores encoded filesSimulates MAID behavior for evaluationWorks on real disks for data storage
Goal: Reliable archival storage in a data grid1. Fault tolerance of real Tornado Code graphs2. Optimizations for efficient data reconstruction3. Operate within existing Data Grid standards
Construct a working federated archival filesystem using Tornado Codes
19 July 2006 32
Acknowledgements
Thanks To:Jason CopeDirk GrunwaldSean McCrearyMichael Oberg
Computer time was provided on equipment from:NSF ARI Grant #CDA-9601817NASA AIST grant #NAG2-1646DOE SciDAC grant #DE-FG02-04ER63870NSF sponsorship of the National Center for Atmospheric ResearchIBM Shared University Research (SUR) program
19 July 2006 34
96-Device Tornado vs. Alternate Fault Tolerance
Missing Nodes0 5 10 15 20 25 30 35 40 45
Frac
tion
Rec
onst
ruct
ion
Failu
re
0
0.2
0.4
0.6
0.8
1Regular − Degree=4Regular − Degree=11Altered Tornado(distribution doubled)Altered Tornado(distribution shifted +1)Tornado 4Regular 11
5Shifted
5Tornado
5Doubled
4Regular 4
First Failure
Tornado Code theory shows better results than token adjustments or simple regular bipartite graphs
19 July 2006 35
96-Device Tornado vs. Alternate Fault Tolerance
Missing Nodes0 5 10 15 20 25 30 35 40 45
Frac
tion
Rec
onst
ruct
ion
Failu
re
0
0.2
0.4
0.6
0.8
1Cascaded − Degree=6Cascaded − Degree=4Cascaded − Degree=3Tornado
4Degree 3
5Tornado
4Degree 4
5Degree 6
First Failure
Cascaded fixed degree graphs with random edge permutationsCurve shape matches but first failure is important characteristic
19 July 2006 36
Fault Tolerance I
1111111196
1111111195
0.0000370.0000360.0000670.0109380.40610210.711812110
0.0000190.0000180.0000330.0085840.33770610.58135219
0.0000060.0000070.0000120.0065770.27136010.4424860.9967828
0.0000020.0000030.0000050.0048840.20914410.3096970.9759447
0.0000010.0000000.0000010.0034400.15261310.1947070.9097806
0.0000000.0000000.0000000.0022630.10354010.1057120.7719655
0000.0013440.06284510.0457180.5629574
0000.0006650.03157910.0123180.3227323
0000.0002190.010526100.1157892
000001001
000000000
g096-tc05ag096-06-03g096-05-03g096-test02MirroringStripingRAID6RAID5n Offline
Tornado 3Tornado 2Tornado 1
Experimentally calculated results
P( fail | n nodes offline )
19 July 2006 37
Fault Tolerance II
The probability of any independent drive being lost is given by the binomial distribution:
1
1.000E-19296
9.504E-18995
4.753E-0810
5.408E-079
5.476E-068
4.873E-057
0.000386
0.002455
0.013184
0.056113
0.177292
0.369501
0.381050
P( n offline )n Offline
p = 0.01
P( n nodes offline )
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−= −
kn
ppkP knk 1)lost drives (
∑=
=drives
0)lost drives ()lost drives |fail()fail(
n
kkPkPP
Because the experimental results are independent conditional probabilities, they may be summed:
This is a worst case fault tolerance calculation: it ignores all repairs.
19 July 2006 38
Fault Tolerance III
5.86E-105.95E-101.34E-090.000100.004790.618950.001640.04834
1.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-1921.000E-19296
9.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-1899.504E-18995
1.757E-121.705E-123.168E-125.198E-101.930E-084.753E-083.383E-084.753E-0810
1.001E-119.843E-121.764E-114.642E-091.826E-075.408E-073.144E-075.408E-079
3.393E-113.952E-116.743E-113.602E-081.486E-065.476E-062.423E-065.458E-068
1.073E-101.430E-102.304E-102.380E-071.019E-054.873E-051.509E-054.756E-057
1.929E-101.608E-105.466E-101.291E-065.726E-053.752E-047.306E-053.414E-046
2.395E-102.395E-104.791E-105.543E-062.536E-042.449E-032.589E-041.891E-035
0001.771E-058.281E-041.318E-026.025E-047.418E-034
0003.731E-051.772E-035.611E-026.912E-041.811E-023
0003.888E-051.866E-031.773E-010.000E+002.053E-022
000000.3695002001
000000000
g096-tc05ag096-06-03g096-05-03g096-test02MirroringStripingRAID6RAID5n Offline
Tornado 3Tornado 2Tornado 1
Independent conditional probabilities
P(fail) = Sum[ P( n nodes offline )*P( fail | n nodes offline )]
19 July 2006 39
Graph Retrieval Plan Generation Time
Nodes Retrieved Before Planning32 42 52 62 72 82 92
Rec
onst
ruct
ion
Pla
n G
ener
atio
n Ti
me
12345
10
25
50
100
200300400 Best First Search
Limited Breadth First Search(Average for three final Tornado Code graphs)
19 July 2006 40
Additional Retrievals by Initial Retrieval
Nodes Retrieved Before Planning32 42 52 62 72 82 92
Nod
e R
etrie
vals
for R
econ
stru
ctio
n
0
5
10
15
20Average Node Retrievals (Best First Search)Data Node Retrievals (Best First Search)Check Node Retrievals (Best First Search)
(Average for three final Tornado Code graphs)
19 July 2006 41
Additional Retrievals by Data Nodes Missing
Data Nodes Missing0 5 10 15 20 25 30 35 40
Nod
e R
etrie
vals
for R
econ
stru
ctio
n
0
5
10
15
20Average Additional Retrievals (Best First Search)Data Node Retrievals (Best First Search)Check Node Retrievals (Best First Search)
(Average for three final Tornado Code graphs)