High-Availability LH* Schemes with Mirroring

1

High-Availability LH* Schemes High-Availability LH* Schemes with Mirroringwith Mirroring

High-Availability LH* Schemes High-Availability LH* Schemes with Mirroringwith Mirroring

W. Litwin, M.-A. NeimatW. Litwin, M.-A. NeimatU. Paris 9 & HPL Palo-AltoU. Paris 9 & HPL Palo-Alto

[email protected]@cid5.etud.dauphine.fr

2

LH* with mirroringLH* with mirroringLH* with mirroringLH* with mirroring A Scalable Dsitributed Data StructuresA Scalable Dsitributed Data Structures Data are in Distributed RAM of server nodes Data are in Distributed RAM of server nodes

of a multicomputerof a multicomputer Uses the mirroring to survive :Uses the mirroring to survive :

– every single node failureevery single node failure– most of multiple node failuresmost of multiple node failures

Moderate performance deterioration with Moderate performance deterioration with respect to basic LH*respect to basic LH*

3

PlanPlanPlanPlan

IntroductionIntroduction– multicomputers & SDDSsmulticomputers & SDDSs– need for high availabilityneed for high availability

Principles of LH* with mirroringPrinciples of LH* with mirroring Design issues Design issues PerformancePerformance ConclusionConclusion

4

MulticomputersMulticomputersMulticomputersMulticomputers A collection of loosely coupled computersA collection of loosely coupled computers

– common and/or preexisting hardwarecommon and/or preexisting hardware– share nothing architectureshare nothing architecture– message passing through message passing through high-speedhigh-speed net net

Network Network multicomputersmulticomputers– use general purpose netsuse general purpose nets

» LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...» WANs: ATM...WANs: ATM...

SwitchedSwitched multicomputers multicomputers– use a bus, use a bus,

» e.g., Transputer & Parsytece.g., Transputer & Parsytec

5

Client Server

Network multicomputer

6

Why multicomputers ?Why multicomputers ?Why multicomputers ?Why multicomputers ?

Potentially unbeatable price-performance ratioPotentially unbeatable price-performance ratio

– Much cheaper and more powerful than supercomputersMuch cheaper and more powerful than supercomputers» 1500 WSs at HPL with 500+ GB of RAM & TBs of disks1500 WSs at HPL with 500+ GB of RAM & TBs of disks

Potential computing powerPotential computing power

– file sizefile size

– access and processing timeaccess and processing time

– throughputthroughput For more pro & cons :For more pro & cons :

– NOW project (UC Berkeley)NOW project (UC Berkeley)

– Tanenbaum: "Distributed Operating Systems", Prentice Hall, Tanenbaum: "Distributed Operating Systems", Prentice Hall, 19951995

7

Why SDDSsWhy SDDSsWhy SDDSsWhy SDDSs

Multicomputers need data structures and Multicomputers need data structures and file systemsfile systems

Trivial extensions of traditional structures Trivial extensions of traditional structures are not bestare not best

hot-spotshot-spots scalabilityscalability parallel queriesparallel queries distributed and autonomous clientsdistributed and autonomous clients

8

What is an SDDSWhat is an SDDSWhat is an SDDSWhat is an SDDS A A scalablescalable data structure where: data structure where: Data are on Data are on serversservers

– always available for accessalways available for access

Queries come from autonomous Queries come from autonomous clientsclients– available for access only on its initiativeavailable for access only on its initiative

There is no centralized directoryThere is no centralized directory Clients sometime make Clients sometime make addressing errorsaddressing errors

» Clients have less or more adequate Clients have less or more adequate image image of the actual file structureof the actual file structure

Servers are able to Servers are able to forwardforward the queries to the correct address the queries to the correct address– perhaps in several messagesperhaps in several messages

Servers send Servers send Image Adjustment MessagesImage Adjustment Messages» Clients do not make same error twiceClients do not make same error twice

9

An SDDSAn SDDSAn SDDSAn SDDS

Servers

10


Servers

growth through splits under inserts

11


growth through splits under inserts

Servers

12


Clients

Servers

13


Clients

14

Clients


15

Clients

IAM


16

Clients


17

Clients


18

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hachage

ClassicsSDDS(1993)

Arbre 1-d

LH* schemesDDH

Breitbart & alRP* schemesKroll & Widmayer

Arbre k-d

k-RP* schemes

DS

19

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hachage

ClassicsSDDS(1993)

Arbre 1-d

LH* schemesDDH

Breitbart & alRP* schemesKroll & Widmayer

Arbre k-d

k-RP* schemes

DS

You are here

20

LH* LH* ((A classic)A classic)LH* LH* ((A classic)A classic)

Allows for key based hash filesAllows for key based hash files– generalizes the LH addressing schemageneralizes the LH addressing schema

Load factor 70 - 90 %Load factor 70 - 90 % At most 2 forwarding messagesAt most 2 forwarding messages

– regardless of the size of the fileregardless of the size of the file

In practice, 1 m/insert and 2 m/search on the In practice, 1 m/insert and 2 m/search on the averageaverage

4 messages in the worst case4 messages in the worst case Search time of a ms (10 Mb/s net) and of us (Gb/s Search time of a ms (10 Mb/s net) and of us (Gb/s

netnet

21

10,000 inserts

Global cost

Client's cost

22

High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes In a large multicomputer, it is unlikely that all In a large multicomputer, it is unlikely that all

servers are upservers are up Consider the probability that a bucket is up is 99 % Consider the probability that a bucket is up is 99 %

– bucket is unavailable 3 days per yearbucket is unavailable 3 days per year One stores every key in 1 bucket One stores every key in 1 bucket

– case of typical SDDSs, LH* includedcase of typical SDDSs, LH* included Probability that Probability that nn-bucket file is entirely up is-bucket file is entirely up is

» 37 % for 37 % for n = n = 100100

» 0 % for 0 % for n = n = 1000 1000

23

High-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemesHigh-availability LH* schemes

Using 2 buckets to store a key, one may Using 2 buckets to store a key, one may expect :expect :

– 99 % for 99 % for n = n = 100 100

– 91 % for 91 % for n n = 1000= 1000 High availability SDDS High availability SDDS

– make sensemake sense– are the only way to reliable large SDDS filesare the only way to reliable large SDDS files

24


High-availability LH* schemes keep data High-availability LH* schemes keep data available despite server failuresavailable despite server failures– any single server failureany single server failure

– most of two server failuresmost of two server failures

– some catastrophic failuressome catastrophic failures

Three types of schemes are currently knownThree types of schemes are currently known– with mirroringwith mirroring– with striping or groupingwith striping or grouping

25

LH* with MirroringLH* with MirroringLH* with MirroringLH* with Mirroring There are two files called There are two files called mirrorsmirrors Every insert propagates to bothEvery insert propagates to both

– the propagation is done by the serversthe propagation is done by the servers Splits are autonomous Every search is directed towards one of the mirrorsEvery search is directed towards one of the mirrors

– thethe primary primary mirror for the corresponding client mirror for the corresponding client If a bucket failure is detected, the If a bucket failure is detected, the sparespare is produced is produced

instantlyinstantly at some site at some site– the storage for failed bucket is reclaimedthe storage for failed bucket is reclaimed– it can be allocated to another bucketit can be allocated to another bucket

26

Basic configurationBasic configurationBasic configurationBasic configuration

Site 1with file F1

Site 2with file F2

Mirrors

Protection against a catastrophique failureProtection against a catastrophique failure

27


Two types of LH* schemes with mirroring appearTwo types of LH* schemes with mirroring appear Structurally-alike (SA) mirrorsStructurally-alike (SA) mirrors

– same file parameterssame file parameters» keys are presumably at the same bucketskeys are presumably at the same buckets

Structurally-dissimilar (SD) mirrorsStructurally-dissimilar (SD) mirrors» keys are presumably at different bucketskeys are presumably at different buckets

– loosely coupled = same LH-functions loosely coupled = same LH-functions hhii

– minimally coupled = different LH-functions minimally coupled = different LH-functions hhii

83216

30

363317

31

21

34264218

32

15

633514

23

43683620

34

23

1251656921

35

23

20

83216

30

22

3317

31

221

34264218

32

115

633514

23

47

683620

34

223

1251656921

35

44

10

i' = 0i' = 3

0, 125

3, 35

C1C2

SA-MirrorsSA-MirrorsSA-MirrorsSA-Mirrors

SA-MirrorsSA-Mirrorsnew forwarding pathsnew forwarding paths

SA-MirrorsSA-Mirrorsnew forwarding pathsnew forwarding paths

30

Failure managementFailure managementFailure managementFailure management

A bucket failure can be discoveredA bucket failure can be discovered– by the clientby the client– by the forwarding or mirroring serverby the forwarding or mirroring server– by the LH* split coordinatorby the LH* split coordinator

The failure discovery triggers the The failure discovery triggers the instantinstant creation of a creation of a spare spare bucketbucket– a copy of the failed bucket constructed from the a copy of the failed bucket constructed from the

mirror filemirror file» from one or more bucketsfrom one or more buckets

31

Spare creationSpare creationSpare creationSpare creation

The spare creation process is managed by The spare creation process is managed by the coordinatorthe coordinator– choice of the node for the sparechoice of the node for the spare– transfert of the records from the mirror filetransfert of the records from the mirror file

» the algo is in the paperthe algo is in the paper

– propagation of the spare node address to the propagation of the spare node address to the node of the failed bucketnode of the failed bucket

» when the node recovers, it contacts the coordinatorwhen the node recovers, it contacts the coordinator

32

And the client ?And the client ?And the client ?And the client ?

The client can be unaware of the failureThe client can be unaware of the failure– it then may send the message to the failed nodeit then may send the message to the failed node

» that perhaps recovered and has another bucket that perhaps recovered and has another bucket n'n'

ProblemProblem– bucket bucket n' n' should recognize an addressing errorshould recognize an addressing error– should forward the query to the spareshould forward the query to the spare

» a case that did not exist for the basic LH* a case that did not exist for the basic LH*

33

SolutionSolutionSolutionSolution

Every client sends with the query Every client sends with the query Q Q the the address address n n of the bucket of the bucket QQ should reach should reach

if if n <> n'n <> n', then bucket , then bucket n' n' resends the query resends the query to bucket to bucket n n – that must be the sparethat must be the spare

Bucket Bucket n n sendssends an IAM to the client to an IAM to the client to adjust its alloc. tableadjust its alloc. table– a new kind of IAMa new kind of IAM

34

SA / SD mirrorsSA / SD mirrorsSA / SD mirrorsSA / SD mirrors

2

10 6 2

10 6

2

10 6

210 6

2

10 6

987 12

2

1

0

6

98

7

(a)

(b)

(c)

F1F2b = 8 b = 8

b = 4

b = 6

SA-mirrors

SD-mirrors

35

SA-mirrorsSA-mirrors– most efficient for access and spare productionmost efficient for access and spare production– but max loss in the case of two-bucket failurebut max loss in the case of two-bucket failure

Loosely-coupled SD-mirrorsLoosely-coupled SD-mirrors– less efficient for access and spare productionless efficient for access and spare production– lesser loss of data for a two-bucket failurelesser loss of data for a two-bucket failure

Minimally-coupled SD-mirrorsMinimally-coupled SD-mirrors– least efficient for access and spare productionleast efficient for access and spare production– min. loss for a two-bucket failuremin. loss for a two-bucket failure

ComparisonComparisonComparisonComparison

36

ConclusionConclusionConclusionConclusion

LH* with mirroring is first SDDS for high-LH* with mirroring is first SDDS for high-availabilityavailability– for large multicomputer filesfor large multicomputer files– for high-availability DBsfor high-availability DBs

» avoids to create fragments replicasavoids to create fragments replicas

Variants adapted to importance of different Variants adapted to importance of different kinds of failureskinds of failures– How important is a multiple bucket failure ?How important is a multiple bucket failure ?

37

Price to payPrice to payPrice to payPrice to pay

Moderate access performance deterioration Moderate access performance deterioration as compared to basic LH*as compared to basic LH*– an additional message to the mirror per insertan additional message to the mirror per insert– a few messages when failures occura few messages when failures occur

Double storage for the fileDouble storage for the file– can be a drawbackcan be a drawback

38

Future directionsFuture directionsFuture directionsFuture directions

ImplementationImplementation Performance analysisPerformance analysis

– in presence of failuresin presence of failures Concurrency & transaction managementConcurrency & transaction management Other high-availability schemesOther high-availability schemes

– RAID-likeRAID-like

39

FIN Thank you for your attention

W. LitwinW. [email protected]@cid5.etud.dauphine.fr

40

Documents

High-Availability LH* Schemes with Mirroring