Upload
lula
View
42
Download
1
Embed Size (px)
DESCRIPTION
High-Availability LH* Schemes with Mirroring. W. Litwin, M.-A. Neimat U. Paris 9 & HPL Palo-Alto [email protected]. LH* with mirroring. A Scalable Dsitributed Data Structures Data are in Distributed RAM of server nodes of a multicomputer Uses the mirroring to survive : - PowerPoint PPT Presentation
Citation preview
1
High-Availability LH* Schemes High-Availability LH* Schemes with Mirroringwith Mirroring
High-Availability LH* Schemes High-Availability LH* Schemes with Mirroringwith Mirroring
W. Litwin, M.-A. NeimatW. Litwin, M.-A. NeimatU. Paris 9 & HPL Palo-AltoU. Paris 9 & HPL Palo-Alto
[email protected]@cid5.etud.dauphine.fr
2
LH* with mirroringLH* with mirroringLH* with mirroringLH* with mirroring A Scalable Dsitributed Data StructuresA Scalable Dsitributed Data Structures Data are in Distributed RAM of server nodes Data are in Distributed RAM of server nodes
of a multicomputerof a multicomputer Uses the mirroring to survive :Uses the mirroring to survive :
– every single node failureevery single node failure– most of multiple node failuresmost of multiple node failures
Moderate performance deterioration with Moderate performance deterioration with respect to basic LH*respect to basic LH*
3
PlanPlanPlanPlan
IntroductionIntroduction– multicomputers & SDDSsmulticomputers & SDDSs– need for high availabilityneed for high availability
Principles of LH* with mirroringPrinciples of LH* with mirroring Design issues Design issues PerformancePerformance ConclusionConclusion
4
MulticomputersMulticomputersMulticomputersMulticomputers A collection of loosely coupled computersA collection of loosely coupled computers
– common and/or preexisting hardwarecommon and/or preexisting hardware– share nothing architectureshare nothing architecture– message passing through message passing through high-speedhigh-speed net net
Network Network multicomputersmulticomputers– use general purpose netsuse general purpose nets
» LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...» WANs: ATM...WANs: ATM...
SwitchedSwitched multicomputers multicomputers– use a bus, use a bus,
» e.g., Transputer & Parsytece.g., Transputer & Parsytec
5
Client Server
Network multicomputer
6
Why multicomputers ?Why multicomputers ?Why multicomputers ?Why multicomputers ?
Potentially unbeatable price-performance ratioPotentially unbeatable price-performance ratio
– Much cheaper and more powerful than supercomputersMuch cheaper and more powerful than supercomputers» 1500 WSs at HPL with 500+ GB of RAM & TBs of disks1500 WSs at HPL with 500+ GB of RAM & TBs of disks
Potential computing powerPotential computing power
– file sizefile size
– access and processing timeaccess and processing time
– throughputthroughput For more pro & cons :For more pro & cons :
– NOW project (UC Berkeley)NOW project (UC Berkeley)
– Tanenbaum: "Distributed Operating Systems", Prentice Hall, Tanenbaum: "Distributed Operating Systems", Prentice Hall, 19951995
7
Why SDDSsWhy SDDSsWhy SDDSsWhy SDDSs
Multicomputers need data structures and Multicomputers need data structures and file systemsfile systems
Trivial extensions of traditional structures Trivial extensions of traditional structures are not bestare not best
hot-spotshot-spots scalabilityscalability parallel queriesparallel queries distributed and autonomous clientsdistributed and autonomous clients
8
What is an SDDSWhat is an SDDSWhat is an SDDSWhat is an SDDS A A scalablescalable data structure where: data structure where: Data are on Data are on serversservers
– always available for accessalways available for access
Queries come from autonomous Queries come from autonomous clientsclients– available for access only on its initiativeavailable for access only on its initiative
There is no centralized directoryThere is no centralized directory Clients sometime make Clients sometime make addressing errorsaddressing errors
» Clients have less or more adequate Clients have less or more adequate image image of the actual file structureof the actual file structure
Servers are able to Servers are able to forwardforward the queries to the correct address the queries to the correct address– perhaps in several messagesperhaps in several messages
Servers send Servers send Image Adjustment MessagesImage Adjustment Messages» Clients do not make same error twiceClients do not make same error twice
9
An SDDSAn SDDSAn SDDSAn SDDS
Servers
10
An SDDSAn SDDSAn SDDSAn SDDS
Servers
growth through splits under inserts
11
An SDDSAn SDDSAn SDDSAn SDDS
growth through splits under inserts
Servers
12
An SDDSAn SDDSAn SDDSAn SDDS
Clients
Servers
13
An SDDSAn SDDSAn SDDSAn SDDS
Clients
14
Clients
An SDDSAn SDDSAn SDDSAn SDDS
15
Clients
IAM
An SDDSAn SDDSAn SDDSAn SDDS
16
Clients
An SDDSAn SDDSAn SDDSAn SDDS
17
Clients
An SDDSAn SDDSAn SDDSAn SDDS
18
Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs
Hachage
ClassicsSDDS(1993)
Arbre 1-d
LH* schemesDDH
Breitbart & alRP* schemesKroll & Widmayer
Arbre k-d
k-RP* schemes
DS
19
Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs
Hachage
ClassicsSDDS(1993)
Arbre 1-d
LH* schemesDDH
Breitbart & alRP* schemesKroll & Widmayer
Arbre k-d
k-RP* schemes
DS
You are here
20
LH* LH* ((A classic)A classic)LH* LH* ((A classic)A classic)
Allows for key based hash filesAllows for key based hash files– generalizes the LH addressing schemageneralizes the LH addressing schema
Load factor 70 - 90 %Load factor 70 - 90 % At most 2 forwarding messagesAt most 2 forwarding messages
– regardless of the size of the fileregardless of the size of the file
In practice, 1 m/insert and 2 m/search on the In practice, 1 m/insert and 2 m/search on the averageaverage
4 messages in the worst case4 messages in the worst case Search time of a ms (10 Mb/s net) and of us (Gb/s Search time of a ms (10 Mb/s net) and of us (Gb/s
netnet
21
10,000 inserts
Global cost
Client's cost
22
High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes In a large multicomputer, it is unlikely that all In a large multicomputer, it is unlikely that all
servers are upservers are up Consider the probability that a bucket is up is 99 % Consider the probability that a bucket is up is 99 %
– bucket is unavailable 3 days per yearbucket is unavailable 3 days per year One stores every key in 1 bucket One stores every key in 1 bucket
– case of typical SDDSs, LH* includedcase of typical SDDSs, LH* included Probability that Probability that nn-bucket file is entirely up is-bucket file is entirely up is
» 37 % for 37 % for n = n = 100100
» 0 % for 0 % for n = n = 1000 1000
23
High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes
Using 2 buckets to store a key, one may Using 2 buckets to store a key, one may expect :expect :
– 99 % for 99 % for n = n = 100 100
– 91 % for 91 % for n n = 1000= 1000 High availability SDDS High availability SDDS
– make sensemake sense– are the only way to reliable large SDDS filesare the only way to reliable large SDDS files
24
High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes
High-availability LH* schemes keep data High-availability LH* schemes keep data available despite server failuresavailable despite server failures– any single server failureany single server failure
– most of two server failuresmost of two server failures
– some catastrophic failuressome catastrophic failures
Three types of schemes are currently knownThree types of schemes are currently known– with mirroringwith mirroring– with striping or groupingwith striping or grouping
25
LH* with MirroringLH* with MirroringLH* with MirroringLH* with Mirroring There are two files called There are two files called mirrorsmirrors Every insert propagates to bothEvery insert propagates to both
– the propagation is done by the serversthe propagation is done by the servers Splits are autonomous Every search is directed towards one of the mirrorsEvery search is directed towards one of the mirrors
– thethe primary primary mirror for the corresponding client mirror for the corresponding client If a bucket failure is detected, the If a bucket failure is detected, the sparespare is produced is produced
instantlyinstantly at some site at some site– the storage for failed bucket is reclaimedthe storage for failed bucket is reclaimed– it can be allocated to another bucketit can be allocated to another bucket
26
Basic configurationBasic configurationBasic configurationBasic configuration
Site 1with file F1
Site 2with file F2
Mirrors
Protection against a catastrophique failureProtection against a catastrophique failure
27
High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes
Two types of LH* schemes with mirroring appearTwo types of LH* schemes with mirroring appear Structurally-alike (SA) mirrorsStructurally-alike (SA) mirrors
– same file parameterssame file parameters» keys are presumably at the same bucketskeys are presumably at the same buckets
Structurally-dissimilar (SD) mirrorsStructurally-dissimilar (SD) mirrors» keys are presumably at different bucketskeys are presumably at different buckets
– loosely coupled = same LH-functions loosely coupled = same LH-functions hhii
– minimally coupled = different LH-functions minimally coupled = different LH-functions hhii
83216
30
363317
31
21
34264218
32
15
633514
23
43683620
34
23
1251656921
35
23
20
83216
30
22
3317
31
221
34264218
32
115
633514
23
47
683620
34
223
1251656921
35
44
10
i' = 0i' = 3
0, 125
3, 35
C1C2
SA-MirrorsSA-MirrorsSA-MirrorsSA-Mirrors
SA-MirrorsSA-Mirrorsnew forwarding pathsnew forwarding paths
SA-MirrorsSA-Mirrorsnew forwarding pathsnew forwarding paths
30
Failure managementFailure managementFailure managementFailure management
A bucket failure can be discoveredA bucket failure can be discovered– by the clientby the client– by the forwarding or mirroring serverby the forwarding or mirroring server– by the LH* split coordinatorby the LH* split coordinator
The failure discovery triggers the The failure discovery triggers the instantinstant creation of a creation of a spare spare bucketbucket– a copy of the failed bucket constructed from the a copy of the failed bucket constructed from the
mirror filemirror file» from one or more bucketsfrom one or more buckets
31
Spare creationSpare creationSpare creationSpare creation
The spare creation process is managed by The spare creation process is managed by the coordinatorthe coordinator– choice of the node for the sparechoice of the node for the spare– transfert of the records from the mirror filetransfert of the records from the mirror file
» the algo is in the paperthe algo is in the paper
– propagation of the spare node address to the propagation of the spare node address to the node of the failed bucketnode of the failed bucket
» when the node recovers, it contacts the coordinatorwhen the node recovers, it contacts the coordinator
32
And the client ?And the client ?And the client ?And the client ?
The client can be unaware of the failureThe client can be unaware of the failure– it then may send the message to the failed nodeit then may send the message to the failed node
» that perhaps recovered and has another bucket that perhaps recovered and has another bucket n'n'
ProblemProblem– bucket bucket n' n' should recognize an addressing errorshould recognize an addressing error– should forward the query to the spareshould forward the query to the spare
» a case that did not exist for the basic LH* a case that did not exist for the basic LH*
33
SolutionSolutionSolutionSolution
Every client sends with the query Every client sends with the query Q Q the the address address n n of the bucket of the bucket QQ should reach should reach
if if n <> n'n <> n', then bucket , then bucket n' n' resends the query resends the query to bucket to bucket n n – that must be the sparethat must be the spare
Bucket Bucket n n sendssends an IAM to the client to an IAM to the client to adjust its alloc. tableadjust its alloc. table– a new kind of IAMa new kind of IAM
34
SA / SD mirrorsSA / SD mirrorsSA / SD mirrorsSA / SD mirrors
2
10 6 2
10 6
2
10 6
210 6
2
10 6
987 12
2
1
0
6
98
7
(a)
(b)
(c)
F1F2b = 8 b = 8
b = 4
b = 6
SA-mirrors
SD-mirrors
35
SA-mirrorsSA-mirrors– most efficient for access and spare productionmost efficient for access and spare production– but max loss in the case of two-bucket failurebut max loss in the case of two-bucket failure
Loosely-coupled SD-mirrorsLoosely-coupled SD-mirrors– less efficient for access and spare productionless efficient for access and spare production– lesser loss of data for a two-bucket failurelesser loss of data for a two-bucket failure
Minimally-coupled SD-mirrorsMinimally-coupled SD-mirrors– least efficient for access and spare productionleast efficient for access and spare production– min. loss for a two-bucket failuremin. loss for a two-bucket failure
ComparisonComparisonComparisonComparison
36
ConclusionConclusionConclusionConclusion
LH* with mirroring is first SDDS for high-LH* with mirroring is first SDDS for high-availabilityavailability– for large multicomputer filesfor large multicomputer files– for high-availability DBsfor high-availability DBs
» avoids to create fragments replicasavoids to create fragments replicas
Variants adapted to importance of different Variants adapted to importance of different kinds of failureskinds of failures– How important is a multiple bucket failure ?How important is a multiple bucket failure ?
37
Price to payPrice to payPrice to payPrice to pay
Moderate access performance deterioration Moderate access performance deterioration as compared to basic LH*as compared to basic LH*– an additional message to the mirror per insertan additional message to the mirror per insert– a few messages when failures occura few messages when failures occur
Double storage for the fileDouble storage for the file– can be a drawbackcan be a drawback
38
Future directionsFuture directionsFuture directionsFuture directions
ImplementationImplementation Performance analysisPerformance analysis
– in presence of failuresin presence of failures Concurrency & transaction managementConcurrency & transaction management Other high-availability schemesOther high-availability schemes
– RAID-likeRAID-like
40