Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Coding Techniques for Efficient, Reliable g q ,Networked Distributed Storage in Data Centers
Anwitaman DattaJoint work with Frédérique Oggier (SPMS)
School of Computer EngineeringNanyang Technological University
Infocomm Professional Development Forum 7th July 2011, Singaporey , g p
Self-* Aspects of Networked Distributed SystemsSelf Aspects of Networked Distributed Systems
Me, myself & SANDS& SANDS
htt //sands sce ntu d sg/
© 2011 A. DattaNTU Singapore
http://sands.sce.ntu.edu.sg/
http://www.ntu.edu.sg/home/anwitaman/
OutlineOo Preliminaries Data Center is at the center of the Cloud Data Center is at the center of the Cloud Failure is not an option, but …
• Known unknowns and unknown unknowns• Existing contingencies• Existing contingencies
o Fault-tolerance using erasure codes (ECs) Background
• RAID vs. ECs used as “Blackbox” o Erasure codes tailor-made for distributed networked
storageg Pyramid & hierarchical codes Regenerating codes Self-repairing codes
© 2011 A. DattaNTU Singapore
Self-repairing codes o Wrap-up
Preliminaries
o CloudPreliminaries- Cloud o Cloud
o Data center/Storage systemsDi t ib t d fil t
Cloud- Storage systems
Fault-tolerance
Tailor-made d Distributed file system
• Distributed data management– Key-Value store NoSQL Hadoop etc
codes
Wrap up
– Key-Value store, NoSQL, Hadoop, etc.
Reliable storage
© 2011 A. DattaNTU Singapore
What is the Cloud?Wo At least we can all agreePreliminaries
- Cloud o At least, we can all agree Cloud is something “big” and happening! It’s all of these
Cloud- Storage systems
Fault-tolerance
Tailor-made d Old It s all of these …
• … and some more!codes
Wrap up
SaaS
Old wine
IaaSData
center
PaaS %*$aaS
© 2011 A. DattaNTU Singapore
NIST Definition for Cloud ComputingN p gPreliminaries- Cloud o Cloud computing is a model for enablingCloud- Storage systems
Fault-tolerance
Tailor-made d
o Cloud computing is a model for enablingubiquitous, convenient, on-demand networkaccess to a shared pool of configurablecodes
Wrap up
access to a shared pool of configurablecomputing resources (e.g., networks,servers storage applications andservers, storage, applications, andservices) that can be rapidly provisionedand released with minimal managementand released with minimal managementeffort or service provider interaction.
© 2011 A. DattaNTU Singapore
Two Sides of the Cloud CoinPreliminaries- Cloud o Outside viewCloud- Storage systems
Fault-tolerance
Tailor-made d
o Outside view A “single/exclusive” entity
• Access through a “demilitarized zone”codes
Wrap up
• Access through a demilitarized zone …• API based• Agnostic to multi-tenancyg y
Infinite/elastic resources• Pay per use, on-demand, … y p
Browser based access (often)• Anytime, anywhere, any device …
© 2011 A. DattaNTU Singapore
Two Sides of the Cloud CoinPreliminaries- Cloud o Inside viewCloud- Storage systems
Fault-tolerance
Tailor-made d
o Inside view Pool of resources
• In flux: New compute units joining, old ones codes
Wrap up
p j gretiring
• Self-*: Load-balancing, fault-tolerance, auto-configuration,configuration, …
Multi-tenancy• Virtualization, transparent migration, …
Distributed file system and data-management• Google’s GFS, Amazon’s Dynamo, Yahoo!’s Pnuts
© 2011 A. DattaNTU Singapore
• Map Reduce/Hadoop, Pig, Chubby, …
The New Stack Preliminaries- Cloud
ApplicationsCloud
- Storage systems
Fault-tolerance
Tailor-made d
NoSQLe g Map‐Reduce
SQL Implementationse g PIG (relationalcodes
Wrap up
e.g., Map Reduce, Hadoop, BigTable, Hbase, Cassandra …
e.g., PIG (relational algebra), HIVE, …
Distributed File System (e.g., Key‐Value Store)
Distributed Physical Infrastructure: Storage/Compute Nodes
Reliable storage service
© 2011 A. DattaNTU Singapore
Distributed Physical Infrastructure: Storage/Compute Nodes
Disclaimer: This is a personal “view”, and may not be standard/universal
Data CenterPreliminaries- Cloud o Essentially a networked distributed storageCloud- Storage systems
Fault-tolerance
Tailor-made d
o Essentially a networked distributed storagesystem
codes
Wrap up
© 2011 A. DattaNTU Singapore
Source of topology: http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/vmware/VMware.html
Data Center Design EvolutionGen 1 DC Collocation
Gen 2 Gen 3 Gen 4 (future)Modular Data Center
g
Deployment Scale UnitDeployment Scale Unit
ServerServer RackRack ContainersContainers PrePre‐‐Assembled Assembled
Density and
ScalabilityTh d f
Right Time to Market,
ComponentsComponents
© 2011 A. DattaNTU Singapore
Capacityand Sustainability
Thousands of Servers
Lower TCO (PUE)Scalable Data Centers
Slide courtesy Roger Barga (Microsoft) from his P2P 2009 Keynote talk
Failure (of individual nodes) is inevitableFailure (of individual nodes) is inevitable
o But failure of the system is not an option!Preliminaries- Cloud o But, failure of the system is not an option!
o Solution: RedundancyCloud
- Storage systems
Fault-tolerance
Tailor-made dcodes
Wrap up
© 2011 A. DattaNTU Singapore
Is the Danger Real? Yesg
Preliminaries- CloudCloud- Storage systems
Fault-tolerance
Tailor-made dcodes
Wrap up
Cloud is
© 2011 A. DattaNTU Singapore
NSFW
Is the Danger Real? Yeso There is also the danger of data falling in
“wrong hands”, e.g. due to security breach
g
Preliminaries- Cloud g , g yCloud- Storage systems
Fault-tolerance
Tailor-made dcodes
Wrap up
o Security/privacy issues are out of the scope ofo Security/privacy issues are out of the scope of this talk@ SANDS we work on those issues also
© 2011 A. DattaNTU Singapore
A*Star TSRP project pCloud• http://sands.sce.ntu.edu.sg/pCloud/
Data Center Fault-TolerancePreliminaries o Faults are omnipresentFault-tolerance- Existing approaches- Has EC a role?
o Faults are omnipresent Hardware, network,
software, human, Tailor-made codes
Wrap up
misconfiguration, …o Cascade of failures in
interdependent networks Power failure => Network
switches stop workingswitches stop working Network failure => Control
system for power system i ff ti
© 2011 A. DattaNTU Singapore
ineffective
Redundancy Based Fault ToleranceRedundancy Based Fault Tolerance
o Replicate dataPreliminarieso Replicate data e.g., 3 or more copies In nodes on different racks
Fault-tolerance- Existing approaches- Has EC a role?
In nodes on different racks• Can deal with switch failures
o Power back up using battery between racks
Tailor-made codes
Wrap up
o Power back-up using battery between racks (Google)
© 2011 A. DattaNTU Singapore
Redundancy Based Fault ToleranceRedundancy Based Fault Tolerance
o Using “independent” physical infrastructurePreliminarieso Using independent physical infrastructure Over different availability zones (Amazon AZ)
• How independent are components in a complex
Fault-tolerance- Existing approaches- Has EC a role?
• How independent are components in a complex network?
Over multiple geographical regions
Tailor-made codes
Wrap up
p g g p g
© 2011 A. DattaNTU Singapore
Amazon’s AWS: Availability Zones yPreliminaries
Fault-tolerance- Existing approaches- Has EC a role?
Tailor-made codes
Wrap up
© 2011 A. DattaNTU Singapore Note: The recent (April 2011) AWS outage was the first region‐wide failure
Five Levels of RedundancyFive Levels of Redundancy
o PhysicalPreliminarieso Physicalo Virtual resource
A il bili
Fault-tolerance- Existing approaches- Has EC a role?
o Availability zoneo Region
Tailor-made codes
Wrap up
o Cloud
© 2011 A. DattaNTU Singapore
From: http://broadcast.oreilly.com/2011/04/the‐aws‐outage‐the‐clouds‐shining‐moment.html
At What Cost?At What Cost?Preliminaries
o Failure is not an option butFault-tolerance- Existing approaches- Has EC a role?
o Failure is not an option, but … … are the overheads acceptable?
Tailor-made codes
Wrap up
© 2011 A. DattaNTU Singapore
Reducing the Overheads of Redundancy
o Erasure codes
Reducing the Overheads of RedundancyPreliminaries
o Erasure codes Much lower storage overhead High level of fault tolerance
Fault-tolerance- Existing approaches- Has EC a role?
High level of fault-tolerance • In contrast to replication or RAID based systems
o Has the potential to significantly improve
Tailor-made codes
Wrap up
o Has the potential to significantly improve the “bottomline”
C it h t h th f d ? Can it however match the performance needs?• An open question
© 2011 A. DattaNTU Singapore
Erasure Codes (ECs)Erasure Codes (ECs)
o Originally designed for communication
B1
B
o Originally designed for communication EC(n,k)
Receive any k’ (≥ k) blocks
ge ata
O1
O2
B2 O1
O2
Encoding Decoding…… …
a = messag
nstruct D
a
Data
Reco
Ok Ok
Lost blocks
© 2011 A. DattaNTU Singapore
n encoded blocksOriginalk blocks k blocks
k
Bn
k
Erasure Codes for Networked StorageErasure Codes for Networked Storage
OB1 Retrieve any O
ject
O1
O2B2
…
k’ (≥ k) blocks
Data
O1
O2
Data = Obj
Encoding… …
construct
DecodingBl
D
Ok
… Lost blocks O i i l
Re
Ok
k blocks Bn
n encoded blocks
… Lost blocks Originalk blocks
© 2011 A. DattaNTU Singapore
(stored in storage devices in a network)
Static ResilienceStatic Resilienceo Replicated r times
F lt th t b t l t d 1
objectFor f=0.1its 10‐3Preliminaries
Faults that can be tolerated: r-1 Probability of failure: f r
Storage efficiency: 1/r
replic
replic
replic
Fault-tolerance- Existing approaches- Has EC a role?
g y Access: Find any one good replica
o Erasure coded (k of n)
ca ca ca
object
For f=0.13 of 9 code
Tailor-made codes
Wrap up
Faults that can be tolerated: n-k Probability of failure:
k k
object
Blk Blk Blk
its ~3*10‐6
Storage efficiency: k/n
jkjkn
k
jff
jknn
)1(
1
jkjkn
k
jff
jknn
)1(
1
Blk
Blk
Blk
Blk
Blk
Blk
© 2011 A. DattaNTU Singapore
Storage efficiency: k/n Access: Find k good blocks
o Assumption: Peer failure is i.i.d. with failure probability f
Replenishing Lost Redundancy for ECs p g yo Repairs needed for long-term resilience
B2
B1
…
Retrieve any k’ (≥ k) blocks
O1
O2Recreate lost blocks…
…Decoding EncodingBl
lost blocks
Re‐insert
Reinsert in (new) d i
Bn
… Lost blocks Originalk blocks
Ok storage devices, so that there is (again) n encoded blocks
© 2011 A. DattaNTU Singapore
k blocksn encoded blocks
o Repairs are expensive!
Can We Do Better?
o What is the best one can do (w.r.to repairs)?
Can We Do Better?Preliminaries
Minimize bandwidth usage per repair Minimize number of live nodes used per repair
Fault-tolerance- Existing approaches- Has EC a role?
o Erasure codes have some other drawbacks Coding/Decoding is “Expensive”
Tailor-made codes
Wrap upg g p
• In contrast to replication or RAID/XOR based systems• Systematic codes can help (with decoding/access)!
N t d t h l d b l i i l i !!– Not adequate when load-balancing is also an issue!!
More complex system design We do not attempt to address these explicitly
© 2011 A. DattaNTU Singapore
We do not attempt to address these explicitly• But, some solution we will arrive at will be amenable!
Can We Do Better?Can We Do Better?Preliminaries o Erasure codes tailor-made for distributed Fault-tolerance
Tailor-made codes- Pyramid codes
networked storage
- Regenerating Codes - Self-repairing codes
Wrap upPyramid codes
Wrap up
Regenerating codes
Self‐repairing codes
© 2011 A. DattaNTU Singapore
Pyramid & Hierarchical CodesPyramid & Hierarchical CodesPreliminaries o Essentially, “nested” use of erasure codesFault-tolerance
Tailor-made codes- Pyramid codes
Oa1
…Oa2
Bla1 loca
redund
subgro
- Regenerating Codes - Self-repairing codes
Wrap up
…Bl
an’
a2
Bg1
aldancy
gred
oup
Rep
global rethe c
Wrap up
…
Bgr’
1 globaldundancy
…
…
plicate cod
…
edundancy code group
Os1
Os2
localredundan…
Bls1
subgrou
…de group
from
ps
© 2011 A. DattaNTU Singapore
ncyBlsn’
up
Code group Multi-hierarchical extension
Pyramid & Hierarchical CodesPyramid & Hierarchical CodesPreliminaries
o Pros If `small’ number of faults
Fault-tolerance
Tailor-made codes- Pyramid codes
• Communication restricted within the `hierarchy’ suffice• Progressively go higher-up for larger number of faults• Isolated faults can be repaired independently
- Regenerating Codes - Self-repairing codes
Wrap up
p p y Naturally maps to hierarchical data-center design?
o ConsWrap up Asymmetry
• Different encoded blocks have different importance• Difficult to analyze• Complex algorithm (for decoding/repair) and system design
© 2011 A. DattaNTU Singapore
Another (essentially identical, but independent) proposal: Hierarchical Codes: How to Make Erasure Codes Attractive for Peer‐to‐Peer Storage Systems from Eurecom
Regenerating CodesRegenerating CodesPreliminaries o Network information flow based argumentsFault-tolerance
Tailor-made codes- Pyramid codes
o Network information flow based arguments to determine “optimal” trade-off of storage/repair-bandwidth
- Regenerating Codes - Self-repairing codes
Wrap up
sto age/ epa ba dw dt
Wrap up
© 2011 A. DattaNTU Singapore
Regenerating CodesRegenerating CodesPreliminaries o Example code (w/ Functional Repair)Fault-tolerance
Tailor-made codes- Pyramid codes
p ( p ) Based on random linear network coding
• Exact repairing codes have also been proposed- Regenerating Codes - Self-repairing codes
Wrap upWrap up
© 2011 A. DattaNTU Singapore
Regenerating Codesg g
Preliminaries o ProsFault-tolerance
Tailor-made codes- Pyramid codes
Network information flow analysis determines optimal (w.r.to repair bandwidth)
• Catch: Subject to MDS property of code (more on this, later)- Regenerating Codes - Self-repairing codes
Wrap up
Some proposed codes apply network coding on top of ECs
• Inherit the properties for EC for de/codingWrap up Inherit the properties for EC for de/coding
o Cons Codes for only specific points on trade-off curve
• Information flow analysis itself does not suggest any code Restrictive
• Some proposed codes can carry out repair only for one fault
© 2011 A. DattaNTU Singapore
p p y p y• Needs to contact all live nodes for repair to be optimal
Not simple: Algorithmic as well as system design
Self-repairing Codes (SRC)p g ( )
Preliminaries o Self-repairing codes are (n; k) codes s.t.Fault-tolerance
Tailor-made codes- Pyramid codes
p g ( ) encoded fragments can be repaired directly
from other subsets of encoded fragments.- Regenerating Codes - Self-repairing codes
Wrap up
a fragment can be repaired from a fixed number of encoded fragments, independently of which
Wrap up specific blocks are missing• Analogous to erasure codes supporting
reconstruction using any n k losses independentlyreconstruction using any n - k losses, independently of which
© 2011 A. DattaNTU Singapore
Self-repairing Codes: Blackbox Viewp g
Preliminaries B1 Retrieve some Fault-tolerance
Tailor-made codes- Pyramid codes
B2
…
k” (< k) blocks (e.g. k”=2)to recreate a lost block
- Regenerating Codes - Self-repairing codes
Wrap up
Bl
Re‐insert
Wrap up
… Lost blocksReinsert in (new) storage devices, so h h ( )
Bn
n encoded blocks
… Lost blocks that there is (again) n encoded blocks
© 2011 A. DattaNTU Singapore
n encoded blocks(stored in storage
devices in a network)
Self-repairing Codesp g
Preliminaries o Two families of SRCs have been proposedFault-tolerance
Tailor-made codes- Pyramid codes
p p
- Regenerating Codes - Self-repairing codes
Wrap upWrap up
© 2011 A. DattaNTU Singapore
Self-repairing Codesp g
Preliminaries o There is at least one pair to repair a node, Fault-tolerance
Tailor-made codes- Pyramid codes
p pfor up to (n -1)/2 simultaneous failures Parallel & fast repair of multiple fairs
- Regenerating Codes - Self-repairing codes
Wrap up
p po PSRC (projective geometric SRC) example Data object split in four parts: PSRC(5 3)Wrap up Data object split in four parts: PSRC(5,3)
© 2011 A. DattaNTU Singapore
Toy Example: PSRC(5,3) repairy p ( , ) p
Preliminaries
Fault-tolerance
Tailor-made codes- Pyramid codes- Regenerating Codes - Self-repairing codes
Wrap up
(o1+o2+o4) + (o1) => o2+o4R i i t dWrap up (o3) + (o2+o3) => o2
(o1) + (o2) => o1+ o2
Repair using two nodes
Four pieces needed to regenerate two pieces
Say N1 and N3
(o +o +o ) + (o ) => o +o(o2) + (o4) => o2+ o4Repair using three nodes
© 2011 A. DattaNTU Singapore
(o1+o2+o4) + (o4) => o1+o2
Three pieces needed to regenerate two pieces
Say N2, N3 and N4
Toy Example: PSRC(5,3) reconstructiony p ( , )
Preliminaries
Fault-tolerance
Tailor-made codes- Pyramid codes- Regenerating Codes - Self-repairing codes
Wrap upWrap up o3o4(o3) + (o1+o3) => o1(o1) +(o4)+(o1+o2+o4) => o2(o1) +(o4)+(o1+o2+o4) > o2
Reconstruction, say using N3, N4 and N5
© 2011 A. DattaNTU Singapore
Symmetry in SRCsy y
o All encoded blocks have symmetric rolePreliminarieso All encoded blocks have symmetric role Equivalent importance of all blocks
• for both data reconstruction & repair
Fault-tolerance
Tailor-made codes- Pyramid codes • for both data reconstruction & repair
o Symmetry is goodE t l d t d d i l t
- Regenerating Codes - Self-repairing codes
Wrap up Easy to analyze, understand and implement Simpler algorithm and system design
Wrap up
© 2011 A. DattaNTU Singapore
Maximum Distance Separable (MDS)?Maximum Distance Separable (MDS)?o SRC is not MDS (and can not be!)
D i ?Preliminaries
Does it matter?• Not much• In practice access will be “planned”
Fault-tolerance
Tailor-made codes- Pyramid codes • In practice, access will be “planned” …
PSRC needs less bandwidth than `optimal’ RGC!- Regenerating Codes - Self-repairing codes
Wrap up
This is with random access
Wrap up
© 2011 A. DattaNTU Singapore
PSRC(21,3)
Practical propertiesp po (Current) SRCs are not systematic PSRC is “like systematic”
Preliminaries PSRC is like systematic Need to contact more nodes (than k)
• To obtain systematic `pieces’
Fault-tolerance
Tailor-made codes- Pyramid codes
• Same total bandwidth usage– Parallel download for access can even be an `advantage’
• `mixed’ strategies for access, i.e. get some systematic
- Regenerating Codes - Self-repairing codes
Wrap up g , g ypieces, and some others …
– Power saving (by switching off nodes) strategies possible
Wrap up
© 2011 A. DattaNTU Singapore
Practical propertiesp po Self-repair implies
h “l ll d d bl ”Preliminaries
somewhat “locally decodable”• If access to only part of the whole object is desired
C di /d di i PSRC b h i
Fault-tolerance
Tailor-made codes- Pyramid codes
o Coding/decoding in PSRC are both using XOR operations only
- Regenerating Codes - Self-repairing codes
Wrap upWrap up
© 2011 A. DattaNTU Singapore
Outlook
o 2020: Self-repairing codes in a data-center near Preliminaries
you?o Ongoing:
Fault-tolerance
Tailor-made codes g g
Concepts/Implementation Prototype miniature
data center
Wrap up
data-center Template for
preassembled component of a modular 4G+ data center
o Interested to Follow:
© 2011 A. DattaNTU Singapore
http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ Get involved: {anwitaman,frederique}@ntu.edu.sg