88
1 Scalable Distributed Scalable Distributed Data Structures Data Structures S S tate-of-the-art tate-of-the-art Part 1 Part 1 Witold Litwin Witold Litwin Paris 9 Paris 9 [email protected] [email protected]

1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 [email protected]

Embed Size (px)

Citation preview

Page 1: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

1

Scalable Distributed Scalable Distributed Data StructuresData Structures

SState-of-the-arttate-of-the-art Part 1Part 1

Scalable Distributed Scalable Distributed Data StructuresData Structures

SState-of-the-arttate-of-the-art Part 1Part 1

Witold LitwinWitold Litwin

Paris 9Paris 9

[email protected]@dauphine.fr

Page 2: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

2

PlanPlanPlanPlan

What are SDDSs ?What are SDDSs ? Why they are needed ?Why they are needed ? Where are we in 1996 ?Where are we in 1996 ?

– Existing SDDSsExisting SDDSs– Gaps & On-going workGaps & On-going work

ConclusionConclusion– Future workFuture work

Page 3: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

3

What is an SDDSWhat is an SDDSWhat is an SDDSWhat is an SDDS

A new type of data structureA new type of data structure– Specifically for Specifically for multicomputersmulticomputers

Designed for Designed for high-performancehigh-performance files : files :– horizontalhorizontal scalability to very large sizes scalability to very large sizes

» larger than any single-site filelarger than any single-site file

– parallel and distributed processing parallel and distributed processing » especially in (distributed) RAMespecially in (distributed) RAM

– access time better than for any disk fileaccess time better than for any disk file– 200 200 s under NT (100 Mb/s net, 1KB records)s under NT (100 Mb/s net, 1KB records)

– distributed autonomous clientsdistributed autonomous clients

Page 4: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

4

Killer appsKiller appsKiller appsKiller apps Storage serversStorage servers

– software & hardware scalable & HA serverssoftware & hardware scalable & HA servers– commodity component based commodity component based

» Do-It-Yourself-RAIDDo-It-Yourself-RAID

Object storage serversObject storage servers Object-relational databasesObject-relational databases WEB serversWEB servers

– like Inktomilike Inktomi

Video serversVideo servers Real-time systemsReal-time systems HP Scientific data processingHP Scientific data processing

Page 5: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

5

MulticomputersMulticomputersMulticomputersMulticomputers A collection of loosely coupled computersA collection of loosely coupled computers

– common and/or preexisting hardwarecommon and/or preexisting hardware– share nothing architectureshare nothing architecture– message passing through message passing through high-speedhigh-speed net ( net (Mb/s)Mb/s)

Network Network multicomputersmulticomputers– use general purpose netsuse general purpose nets

» LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...» WANs: ATM...WANs: ATM...

SwitchedSwitched multicomputers multicomputers– use a bus, or a switchuse a bus, or a switch

» e.g., IBM-SP2, Parsytece.g., IBM-SP2, Parsytec

Page 6: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

6

Client Server

Network multicomputer

Page 7: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

7

Why multicomputers ?Why multicomputers ?Why multicomputers ?Why multicomputers ? Potentially unbeatable price-performance ratioPotentially unbeatable price-performance ratio

– Much cheaper and more powerful than supercomputersMuch cheaper and more powerful than supercomputers» 1500 WSs at HPL with 500+ GB of RAM & TBs of disks1500 WSs at HPL with 500+ GB of RAM & TBs of disks

Potential computing powerPotential computing power– file sizefile size– access and processing timeaccess and processing time– throughputthroughput

For more pro & cons :For more pro & cons :– Bill Gates at Microsoft Scalability DayBill Gates at Microsoft Scalability Day– NOW project (UC Berkeley)NOW project (UC Berkeley)– Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995– www.microoft.com White Papers from Business Syst. Div. www.microoft.com White Papers from Business Syst. Div.

Page 8: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

8

Why SDDSsWhy SDDSsWhy SDDSsWhy SDDSs

Multicomputers need data structures and file Multicomputers need data structures and file systemssystems

Trivial extensions of traditional structures are Trivial extensions of traditional structures are not bestnot best

hot-spotshot-spots scalabilityscalability parallel queriesparallel queries distributed and autonomous clientsdistributed and autonomous clients distributed RAM & distance to datadistributed RAM & distance to data

Page 9: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

9

Distance to dataDistance to data(Jim Gray)(Jim Gray)

Distance to dataDistance to data(Jim Gray)(Jim Gray)

100 ns

1 sec

10 msec

RAM

distant RAM(gigabit net)

local disk

100 sec distant RAM(Ethernet)

Page 10: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

10

Distance to dataDistance to dataDistance to dataDistance to data

100 nsec

1 sec

10 msec

RAM

distant RAM(gigabit net)

local disk

100 sec distant RAM(Ethernet)

1 min

Page 11: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

11

Distance to dataDistance to dataDistance to dataDistance to data

100 ns

1 sec

10 msec

RAM

distant RAM(gigabit net)

local disk

100 sec distant RAM(Ethernet)

1 min

10 min

Page 12: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

12

Distance to dataDistance to dataDistance to dataDistance to data

100 ns

1 sec

10 msec

RAM

distant RAM(gigabit net)

local disk

100 sec distant RAM(Ethernet)

1 min

10 min

2 hours

Page 13: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

13

Distance to dataDistance to dataDistance to dataDistance to data

100 ns

1 sec

10 msec

RAM

distant RAM(gigabit net)

local disk

100 sec distant RAM(Ethernet)

1 min

10 min

8 days

lune

2 hours

Page 14: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

14

Economy etc.Economy etc.Economy etc.Economy etc. Price of RAM storage dropped in 1996 Price of RAM storage dropped in 1996

almost almost 1010 times times !!– $10 for 16 MB (production price)$10 for 16 MB (production price)– $30-40 for 16 MB RAM (end user price)$30-40 for 16 MB RAM (end user price)

» $47 for 32 MB (Fry’s price, Aug. 1997)$47 for 32 MB (Fry’s price, Aug. 1997)

– $1000 for 1GB$1000 for 1GB RAM storage is eternal (no mech. parts)RAM storage is eternal (no mech. parts) RAM storage can grow incrementallyRAM storage can grow incrementally NT plans for 64b addressing for VLMNT plans for 64b addressing for VLM MS plans for VLM-DBMSMS plans for VLM-DBMS

Page 15: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

15

What is an SDDSWhat is an SDDSWhat is an SDDSWhat is an SDDS A A scalablescalable data structure where: data structure where: Data are on Data are on serversservers

– always available for accessalways available for access

Queries come from autonomous Queries come from autonomous clientsclients– available for access only on their initiativeavailable for access only on their initiative

There is no centralized directoryThere is no centralized directory Clients may make Clients may make addressing errorsaddressing errors

» Clients have less or more adequate Clients have less or more adequate image image of the actual file structureof the actual file structure

Servers are able to Servers are able to forwardforward the queries to the correct address the queries to the correct address– perhaps in several messagesperhaps in several messages

Servers may send Servers may send Image Adjustment MessagesImage Adjustment Messages» Clients do not make same error twiceClients do not make same error twice

Page 16: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

16

An SDDSAn SDDSAn SDDSAn SDDS

Clients

growth through splits under inserts

Servers

Page 17: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

17

An SDDSAn SDDSAn SDDSAn SDDS

Clients

growth through splits under inserts

Servers

Page 18: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

18

An SDDSAn SDDSAn SDDSAn SDDS

Clients

growth through splits under inserts

Servers

Page 19: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

19

An SDDSAn SDDSAn SDDSAn SDDS

Clients

growth through splits under inserts

Servers

Page 20: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

20

An SDDSAn SDDSAn SDDSAn SDDS

Clients

growth through splits under inserts

Servers

Page 21: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

21

An SDDSAn SDDSAn SDDSAn SDDS

Clients

Page 22: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

22

Clients

An SDDSAn SDDSAn SDDSAn SDDS

Page 23: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

23

Clients

IAM

An SDDSAn SDDSAn SDDSAn SDDS

Page 24: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

24

Clients

An SDDSAn SDDSAn SDDSAn SDDS

Page 25: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

25

Clients

An SDDSAn SDDSAn SDDSAn SDDS

Page 26: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

26

Performance measuresPerformance measuresPerformance measuresPerformance measures

Storage costStorage cost– load factorload factor

» same definitions as for the traditional DSssame definitions as for the traditional DSs

Access costAccess cost messagingmessaging

– number of messages (rounds)number of messages (rounds)» network independentnetwork independent

– access timeaccess time

Page 27: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

27

Access performance measuresAccess performance measuresAccess performance measuresAccess performance measures

Query costQuery cost– key searchkey search

» forwarding costforwarding cost– insertinsert

» split costsplit cost– deletedelete

» merge costmerge cost– Parallel search, range search, partial match search, bulk Parallel search, range search, partial match search, bulk

insert... insert... Average & worst-case costsAverage & worst-case costs Client image convergence costClient image convergence cost New or less active client costsNew or less active client costs

Page 28: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

28

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

DS

Classics

Page 29: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

29

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hash

SDDS(1993)

LH* DDH

Breitbart & al

DS

Classics

Page 30: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

30

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hash

SDDS(1993)

1-d tree

LH* DDH

Breitbart & al RP* Kroll & Widmayer

DS

Classics

Page 31: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

31

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hash

SDDS(1993)

1-d tree

LH* DDH

Breitbart & al RP* Kroll & Widmayer

m-d trees

k-RP*dPi-tree

DS

Classics

Page 32: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

32

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hash

SDDS(1993)

1-d tree

LH* DDH

Breitbart & al RP* Kroll & Widmayer

m-d trees

DS

Classics

H-Avail.

LH*m, LH*gSecurity

LH*s

k-RP*dPi-tree

Page 33: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

33

Known SDDSsKnown SDDSsKnown SDDSsKnown SDDSs

Hash

SDDS(1993)

1-d tree

LH* DDH

Breitbart & al RP* Kroll & Widmayer

m-d trees

DS

Classics

H-Avail.

LH*m, LH*gSecurity

LH*s

k-RP*dPi-tree

s-availabilityLH*sa

Page 34: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

34

LH* LH* ((A classic)A classic)LH* LH* ((A classic)A classic)

Allows for the primary key (OID) based hash filesAllows for the primary key (OID) based hash files– generalizes the LH addressing schemageneralizes the LH addressing schema

Load factor 70 - 90 %Load factor 70 - 90 % At most 2 forwarding messagesAt most 2 forwarding messages

– regardless of the size of the fileregardless of the size of the file

In practice, 1 m/insert and 2 m/search on the In practice, 1 m/insert and 2 m/search on the averageaverage

4 messages in the worst case4 messages in the worst case Search time of 1 ms (10 Mb/s net), of 150 ms Search time of 1 ms (10 Mb/s net), of 150 ms

(100 Mb/s net) and of 30 us (Gb/s net)(100 Mb/s net) and of 30 us (Gb/s net)

Page 35: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

35

Overview of LHOverview of LHOverview of LHOverview of LH Extensible hash algorithmExtensible hash algorithm

– used, e.g., used, e.g., » Netscape browser (100M copies)Netscape browser (100M copies)

» LH-Server by AR (700K copies sold)LH-Server by AR (700K copies sold)– tought in most DB and DS classestought in most DB and DS classes– address space expandsaddress space expands

» to avoid overflows & access performance to avoid overflows & access performance deterioration deterioration

the file has buckets with capacity the file has buckets with capacity bb >> 1 >> 1 Hash by division Hash by division hhii : c -> c : c -> c mod 2 mod 2i i NN provides the address provides the address

h (c)h (c) of key of key cc.. Buckets split through the replacement of hBuckets split through the replacement of hii with with h h ii+1 +1 ; ; i = i =

0,1,..0,1,.. On the average, On the average, b/2 b/2 keys move towards new bucketkeys move towards new bucket

Page 36: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

36

Overview of LHOverview of LHOverview of LHOverview of LH

Basically, a split occurs when some bucket Basically, a split occurs when some bucket m m overflowsoverflows

One splits bucket One splits bucket n, n, pointed by pointer pointed by pointer nn..– usually usually m m nn

n n évolue : 0, 0,1, 0,1,..,2, 0,1..,3, 0,..,7, 0,..,2évolue : 0, 0,1, 0,1,..,2, 0,1..,3, 0,..,7, 0,..,2i i NN, 0.., 0.. One consequence => no index One consequence => no index

– characteristic of other EH schemescharacteristic of other EH schemes

Page 37: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

37

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

351271524

h0 ; n = 0

N = 1b = 4i = 0

hh00 : c -> : c -> 2200

0

Page 38: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

38

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

351271524

h1 ; n = 0

N = 1b = 4i = 0

hh11 : c -> : c -> 2211

0

Page 39: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

39

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

1224

h1 ; n = 0

N = 1b = 4i = 1

hh11 : c -> : c -> 2211

0

35715

1

Page 40: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

40

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

32581224

N = 1b = 4i = 1

hh11 : c -> : c -> 2211

0

211135715

1

hh11 hh11

Page 41: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

41

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

321224

N = 1b = 4i = 1

hh22 : c -> : c -> 2222

0

211135715

1

58

2

hh22 hh11 hh22

Page 42: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

42

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

321224

N = 1b = 4i = 1

hh22 : c -> : c -> 2222

0

33211135715

1

58

2

hh22 hh11 hh22

Page 43: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

43

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

321224

N = 1b = 4i = 1

hh22 : c -> : c -> 2222

0

3321

1

58

2

hh22 hh22 hh22

1135715

3

hh22

Page 44: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

44

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

321224

N = 1b = 4i = 2

hh22 : c -> : c -> 2222

0

3321

1

58

2

hh22 hh22 hh22

1135715

3

hh22

Page 45: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

45

LH File EvolutionLH File EvolutionLH File EvolutionLH File Evolution

EtcEtc– One starts One starts hh3 3 then then hh4 4 ......

The file can expand as much as needed The file can expand as much as needed

– without too many overflows everwithout too many overflows ever

Page 46: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

46

Addressing AlgorithmAddressing AlgorithmAddressing AlgorithmAddressing Algorithm

a <- h (i, c) a <- h (i, c)

if if n n = 0 alors exit = 0 alors exit

elseelse

if a < n then a <- h (i+1, c) ;if a < n then a <- h (i+1, c) ;

endend

Page 47: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

47

LH*LH*LH*LH*

Property of LH :Property of LH :– Given Given j = i j = i or or j = i j = i + 1, key + 1, key c c is in bucket is in bucket m m iffiff

hhj j ((cc) = ) = m ; j = i m ; j = i ou ou j = i j = i + 1+ 1» Verify yourselfVerify yourself

Ideas for LH* :Ideas for LH* :– LH addresing rule = global rule for LH* fileLH addresing rule = global rule for LH* file– every bucket at a serverevery bucket at a server– bucket level bucket level j j in the header in the header – Check the LH property when the key comes form a Check the LH property when the key comes form a

clientclient

Page 48: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

48

LH* : file structure LH* : file structure LH* : file structure LH* : file structure

j = 4

0

j = 4

1

j = 3

2

j = 3

7

j = 4

8

j = 4

9

n = 2 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinator

Client Client

servers

Page 49: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

49

LH* : file structureLH* : file structureLH* : file structureLH* : file structure

j = 4

0

j = 4

1

j = 3

2

j = 3

7

j = 4

8

j = 4

9

n = 2 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinator

Client Client

servers

Page 50: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

50

LH* : splitLH* : splitLH* : splitLH* : split

j = 4

0

j = 4

1

j = 3

2

j = 3

7

j = 4

8

j = 4

9

n = 2 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinator

Client Client

servers

Page 51: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

51

LH* : splitLH* : splitLH* : splitLH* : split

j = 4

0

j = 4

1

j = 3

2

j = 3

7

j = 4

8

j = 4

9

n = 2 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinator

Client Client

servers

Page 52: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

52

LH* : splitLH* : splitLH* : splitLH* : split

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinator

Client Client

servers

j = 4

10

Page 53: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

53

LH* Addressing SchemaLH* Addressing SchemaLH* Addressing SchemaLH* Addressing Schema

Client Client – computes the LH address computes the LH address m m of of c c using its image, using its image, – send send c c to bucket to bucket mm

ServerServer– Server Server a a getting key getting key cc, , a a = = m m in particular,in particular, computes :computes :

a' := ha' := hjj (c) ; (c) ;

if if a' = a a' = a thenthen accept accept c ; c ;

else else a'' := ha'' := hj - j - 11 (c) ;(c) ;

if if a'' > a a'' > a and and a'' a'' < < a' a' then then a' a' := := a'' ; a'' ;

send send c c to bucket to bucket a' ;a' ;

Page 54: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

54

LH* Addressing SchemaLH* Addressing SchemaLH* Addressing SchemaLH* Addressing Schema

Client Client – computes the LH address computes the LH address m m of of c c using its image, using its image,

– send send c c to bucket to bucket mm

ServerServer– Server Server a a getting key getting key cc, , a a = = m m in particular,in particular, computes :computes :

a' := ha' := hjj (c) ; (c) ;

if if a' = a a' = a thenthen accept accept c ; c ;

else else a'' := ha'' := hj - j - 11 (c) ;(c) ;

if if a'' > a a'' > a and and a'' a'' < < a' a' then then a' a' := := a'' ; a'' ;

send send c c to bucket to bucket a' ;a' ; See [LNS93] for the (long) proof

Simple ?

Page 55: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

55

Client Image AdjustementClient Image AdjustementClient Image AdjustementClient Image Adjustement The IAM consists of address The IAM consists of address a a where the client sent where the client sent c c and of and of

j j ((aa))– i' i' is presumed is presumed i i in client's in client's image.image.– n' n' is preumed value of pointer is preumed value of pointer n n in client's in client's image.image.

– initially, initially, i' i' = = n' n' = 0.= 0.

if if j j > > i' i' then then i' i' := := j - j - 1, 1, n' n' := := a +a +1 ; 1 ;

if if n' n' 2^2^i' i' then then n' n' = 0, = 0, i' i' := := i' i' +1 ;+1 ;

The algo. garantees that client image is within the file The algo. garantees that client image is within the file [LNS93][LNS93]– if there is no file contractions (merge)if there is no file contractions (merge)

Page 56: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

56

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

15

Page 57: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

57

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

15

Page 58: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

58

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 3 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

15

j = 3

Page 59: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

59

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

9

Page 60: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

60

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

9

Page 61: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

61

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

9

Page 62: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

62

LH* : addressingLH* : addressingLH* : addressingLH* : addressing

j = 4

0

j = 4

1

j = 4

2

j = 3

7

j = 4

8

j = 4

9

n = 3 ; i = 3

n' = 1, i' = 3 n' = 3, i' = 2 Coordinateur

Client Client

servers

j = 4

10

9

j = 4

Page 63: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

63

ResultResultResultResult

The distributed file can grow to even whole The distributed file can grow to even whole Internet so that :Internet so that :– every insert and search are done in four every insert and search are done in four

messages (IAM included)messages (IAM included)– in general an insert is done in one message in general an insert is done in one message

and search in two messagesand search in two messages– proof in [LNS 93]proof in [LNS 93]

Page 64: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

64

10,000 inserts

Global cost

Client's cost

Page 65: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

65

Page 66: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

66

Page 67: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

67

Inserts by two clients

Page 68: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

68

Parallel QueriesParallel QueriesParallel QueriesParallel Queries A query A query QQ for all buckets of file for all buckets of file FF with independent local with independent local

executionsexecutions– everyevery buckets should get buckets should get Q Q exactly once exactly once

The basis for function shippingThe basis for function shipping– fundamental for high-perf. DBMS appl.fundamental for high-perf. DBMS appl.

Send Mode :Send Mode :– multicastmulticast

» not always possible or convenientnot always possible or convenient

– unicastunicast» client may not know all the serversclient may not know all the servers

» severs have to forward the querysevers have to forward the query

– how ??how ??Image

File

Q

Page 69: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

69

LH* Algorithm for LH* Algorithm for Parallel QueriesParallel Queries(unicast)(unicast)

LH* Algorithm for LH* Algorithm for Parallel QueriesParallel Queries(unicast)(unicast)

Client sends Client sends Q Q to every bucket to every bucket a a in the imagein the image The message with The message with QQ has the has the message level message level jj' :' :

– initialy initialy j' =j' = i' i' if if n' n' i'i' elseelse j' = j' = i' + i' + 11

– bucket bucket a a (of level (of level j j ) copies ) copies Q Q to all its children using to all its children using the alg. :the alg. :

while while j' j' < < j j dodo

j' j' := := j' j' + 1+ 1

forward (forward (QQ, , j' j' ) à case ) à case a a + 2 + 2 j' - j' - 1 1 ; ;

endwhileendwhile Prove it !Prove it !

Page 70: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

70

Termination of Termination of Parallel QueryParallel Query (multicast or unicast)(multicast or unicast)

Termination of Termination of Parallel QueryParallel Query (multicast or unicast)(multicast or unicast)

How client How client C C knows that last reply came ?knows that last reply came ? Deterministic Solution (expensive)Deterministic Solution (expensive)

– Every bucket sends its Every bucket sends its j, m j, m and selected records if anyand selected records if any

» m m is its (logical) addressis its (logical) address

– The client terminates when it has every The client terminates when it has every m m fullfiling the fullfiling the condition ;condition ;

» m = m = 0,1..., 2 0,1..., 2 i i + n + n wherewhere– i i = min (= min (jj)) and and n n = min (= min (mm) where ) where j = i j = i

i+1 i i+1n

Page 71: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

71

Termination of Termination of Parallel QueryParallel Query (multicast or unicast)(multicast or unicast)

Termination of Termination of Parallel QueryParallel Query (multicast or unicast)(multicast or unicast)

Probabilistic Termination ( may need less messaging)Probabilistic Termination ( may need less messaging)

– all and only buckets with selected records replyall and only buckets with selected records reply

– after each reply after each reply C C reinitialises a time-out reinitialises a time-out TT

– C C terminates when terminates when T T expiresexpires Practical choice of Practical choice of T is T is network and query dependentnetwork and query dependent

– ex. 5 times Ethernet everage retry timeex. 5 times Ethernet everage retry time

» 1-2 msec ?1-2 msec ?– experiments neededexperiments needed

Which termination is finally more useful in practice ?Which termination is finally more useful in practice ?

– an open probleman open problem

Page 72: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

72

LH* variantsLH* variantsLH* variantsLH* variants

With/without load (factor) controlWith/without load (factor) control With/without the (split) coordinatorWith/without the (split) coordinator

– the former one was discussedthe former one was discussed

– the latter one is a token-passing schemathe latter one is a token-passing schema

» bucket with the token is next to splitbucket with the token is next to split– if an insert occurs, and file overload is guessedif an insert occurs, and file overload is guessed

– several algs. for the decisionseveral algs. for the decision

– use cascading splitsuse cascading splits

Page 73: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

73

Load factor for uncontrolled splitting

Page 74: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

74

Load factor for different load control strategies and threshold t = 0.8

Page 75: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

75

Page 76: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

76

LH* for LH* for switched multicomputersswitched multicomputersLH* for LH* for switched multicomputersswitched multicomputers

LH*LH*LHLH

– implemented on Parsytec machineimplemented on Parsytec machine» 32 Power PCs32 Power PCs» 2 GB of RAM (128 GB / CPU)2 GB of RAM (128 GB / CPU)

– uses uses » LH for the bucket managementLH for the bucket management» conurrent LH* splitting (described later on)conurrent LH* splitting (described later on)

– access times : < 1 msaccess times : < 1 ms Presented at EDBT-96Presented at EDBT-96

Page 77: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

77

LH* with presplittingLH* with presplittingLH* with presplittingLH* with presplitting

(Pre)splits are done "internally" immediately when an (Pre)splits are done "internally" immediately when an overflow occursoverflow occurs

Become visible to clients, only when LH* split should be Become visible to clients, only when LH* split should be normally performednormally performed

AdvantagesAdvantages– less overflows on sitesless overflows on sites

– parallel splitsparallel splits

DrawbacksDrawbacks– Load factorLoad factor

– Possibly longer forwardingsPossibly longer forwardings

Analysis remains to be doneAnalysis remains to be done

Page 78: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

78

LH* with concurrent LH* with concurrent splittingsplitting

LH* with concurrent LH* with concurrent splittingsplitting

Inserts and searches can be done Inserts and searches can be done concurrently with the splitting in progressconcurrently with the splitting in progress– used by LH*used by LH*LHLH

AdvantagesAdvantages– obviousobvious– and see EDBT-96and see EDBT-96

DrawbacksDrawbacks– + alg. complexity+ alg. complexity

Page 79: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

79

Research FrontierResearch FrontierResearch FrontierResearch Frontier Actual implementationActual implementation

– the SDDS protocolsthe SDDS protocols» Reuse the MS CFIS protocolReuse the MS CFIS protocol» + record types, forwarding, splitting, IAMs...+ record types, forwarding, splitting, IAMs...

– system architecturesystem architecture» client, server, sockets, UDP, TCP/IP, NT, Unix...client, server, sockets, UDP, TCP/IP, NT, Unix...» ThreadsThreads

Actual performanceActual performance» 250 us per search250 us per search

– 1 KB records, 100 mb AnyLan Ethernet1 KB records, 100 mb AnyLan Ethernet

– 40 times faster than a disk40 times faster than a disk

– e.g. response time of a join improves from 1m to 1.5 s.e.g. response time of a join improves from 1m to 1.5 s.

Page 80: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

80

Research FrontierResearch FrontierResearch FrontierResearch Frontier

Use within a DBMS Use within a DBMS » scalable AMOS, DB2 Parallel, Accessscalable AMOS, DB2 Parallel, Access

– replace the traditional disk access methodsreplace the traditional disk access methods» DBMS is the single SDDS clientDBMS is the single SDDS client

– LH* and perhaps other SDDSsLH* and perhaps other SDDSs

– use function shippinguse function shipping– use from multiple distributed SDDS clientsuse from multiple distributed SDDS clients

» concurrency, transactions, recovery...concurrency, transactions, recovery...

Other applicationsOther applications– A scalable WEB server (like INKTOMI)A scalable WEB server (like INKTOMI)

Page 81: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

81

TraditionalTraditionalTraditionalTraditional

DBMSDBMS

FMSFMS

Page 82: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

82

SDDS 1st stageSDDS 1st stageSDDS 1st stageSDDS 1st stage

DBMSDBMS

S S SS

Client

40 - 80 times faster recordaccess

Memorymappedfiles

Page 83: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

83

SDDS 2nd stageSDDS 2nd stageSDDS 2nd stageSDDS 2nd stage

DBMSDBMS

S S SS

Client

40 - 80 times faster recordaccess

n times faster non-keysearch

Page 84: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

84

SDDS 3rd stageSDDS 3rd stageSDDS 3rd stageSDDS 3rd stage

DBMSDBMS

S S SS

Client

DBMSDBMS

S

Client

40 - 80 times faster recordaccess

n times faster non-keysearch

larger fileshigher throughput

Page 85: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

85

ConclusionConclusionConclusionConclusion

Since their inception, in 1993, SDDS were Since their inception, in 1993, SDDS were subject to important research effortsubject to important research effort

In a few years, several schemes appearedIn a few years, several schemes appeared– with the basic functions of the traditional fileswith the basic functions of the traditional files

» hash, primary key ordered, multi-attribute k-d hash, primary key ordered, multi-attribute k-d accessaccess

– providing for much faster and larger filesproviding for much faster and larger files– confirming inital expectationsconfirming inital expectations

Page 86: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

86

Future workFuture workFuture workFuture work Deeper analysisDeeper analysis

– formal methods, simulations & experimentsformal methods, simulations & experiments Prototype implementationPrototype implementation

– SDDS protocol (on-going in Paris 9)SDDS protocol (on-going in Paris 9) New schemesNew schemes

– High-Availability & SecurityHigh-Availability & Security– R* - trees ?R* - trees ?

Killer appsKiller apps– large storage server & object serverslarge storage server & object servers– object-relational databasesobject-relational databases

» Schneider, D & al (COMAD-94)Schneider, D & al (COMAD-94)– video serversvideo servers– real-timereal-time– HP scientific data processingHP scientific data processing

Page 87: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

87

ENDEND(Part 1)(Part 1)

ENDEND(Part 1)(Part 1)

Thank you for your attentionThank you for your attention

Witold [email protected]@cs.berkeley.edu

Page 88: 1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9 litwin@dauphine.fr

88