1 WDAS 2003 13 – 14 June THESSALONIKI(Greece) Range Queries to Scalable Distributed Data Structure RP* WDAS 2003 13 – 14 June THESSALONIKI(Greece) Range

1

WDAS 2003 13 – 14 June WDAS 2003 13 – 14 June THESSALONIKI(Greece)THESSALONIKI(Greece)

Range Queries to Scalable Range Queries to Scalable Distributed Data Structure RP*Distributed Data Structure RP*

Samba NdiayeSamba Ndiaye11, Mianakindila Tsangou, Mianakindila Tsangou11,,

Mouhamed T. Seck, Witold LitwinMouhamed T. Seck, Witold Litwin

[email protected]@ucad.sn

2

What are SDDSs ?What are SDDSs ?– LH* : Scalable Distributed Hash Partitioning LH* : Scalable Distributed Hash Partitioning

– Scalable Distributed Range Partitioning RP*Scalable Distributed Range Partitioning RP*– High-Availability : LH*High-Availability : LH*RS RS & RP*& RP*RSRS

SDDS Range Query Data Reception SDDS Range Query Data Reception Experimental performance resultsExperimental performance results

– Environment and MeasurementsEnvironment and Measurements – ResultsResults

Conclusion Conclusion Current and Future workCurrent and Future work

PlanPlanPlanPlan

3

What is an SDDS ?What is an SDDS ?What is an SDDS ?What is an SDDS ? A new type of data structureA new type of data structure

– Specifically for Specifically for multicomputersmulticomputers Designed for Designed for data intensivedata intensive files : files :

– horizontalhorizontal scalability to very large sizes scalability to very large sizes» larger than any single-site filelarger than any single-site file

– parallel and distributed processing parallel and distributed processing » especially in (distributed) RAMespecially in (distributed) RAM

– Record access time better than for any disk fileRecord access time better than for any disk file– 100-300 100-300 s usually under Win 2000 (100 Mb/s net, 700 MHZ CPU, 100B – s usually under Win 2000 (100 Mb/s net, 700 MHZ CPU, 100B –

1 KB records)1 KB records)

– Queries come from multiple autonomousQueries come from multiple autonomous clientsclients

Data are stored onData are stored on serversserversData have a well defined Data have a well defined structurestructurerecords with keys records with keys objects with OIDs objects with OIDs

OverflowingOverflowing servers servers splitsplit into new servers into new servers

4

MulticomputersMulticomputersMulticomputersMulticomputers

A collection of loosely coupled computersA collection of loosely coupled computers– common and/or preexisting hardwarecommon and/or preexisting hardware– share nothing architectureshare nothing architecture– message passing through message passing through high-speedhigh-speed net ( net (Mb/s)Mb/s)

Network Network multicomputersmulticomputers– use general purpose netsuse general purpose nets

» LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI...» WANs: ATM...WANs: ATM...

SwitchedSwitched multicomputers multicomputers– use a bus, or a switchuse a bus, or a switch

» e.g., IBM-SP2, Parsytece.g., IBM-SP2, Parsytec

5

Why multicomputers ?Why multicomputers ?Why multicomputers ?Why multicomputers ? Potentially unbeatable price-performance ratioPotentially unbeatable price-performance ratio

– Much cheaper and more powerful than supercomputersMuch cheaper and more powerful than supercomputers» 1500 WSs at HPL with 500+ GB of RAM & TBs of disks1500 WSs at HPL with 500+ GB of RAM & TBs of disks

Potential computing powerPotential computing power

– file sizefile size

– access and processing timeaccess and processing time

– throughputthroughput For more pro & cons :For more pro & cons :

– Bill Gates at Microsoft Scalability DayBill Gates at Microsoft Scalability Day

– NOW project (UC Berkeley)NOW project (UC Berkeley)

– Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995

– www.microoft.com White Papers from Business Syst. Div. www.microoft.com White Papers from Business Syst. Div.

6

Why SDDSs ?Why SDDSs ?Why SDDSs ?Why SDDSs ?

Multicomputers need data structures and file Multicomputers need data structures and file systemssystems

Trivial extensions of traditional structures Trivial extensions of traditional structures are not bestare not best

hot-spotshot-spots scalabilityscalability parallel queriesparallel queries distributed and autonomous clientsdistributed and autonomous clients distributed RAM & distance to datadistributed RAM & distance to data

See the SDDS talk & papers for more See the SDDS talk & papers for more – ceria.dauphine.frceria.dauphine.fr

7

SDDS FAMILIESSDDS FAMILIESSDDS FAMILIESSDDS FAMILIES LH* :LH* : Scalable distributed hash partitioningScalable distributed hash partitioning

– Transparent for the applicationsTransparent for the applications» Unlike the current static schemes (i.e. DB2)Unlike the current static schemes (i.e. DB2)

– Generalizes the LH addressing schemaGeneralizes the LH addressing schema» vvariants used in Netscape products, LH-Server, Unify, Frontpage, ariants used in Netscape products, LH-Server, Unify, Frontpage,

IIS, MsExchange, Berkeley DB Library...IIS, MsExchange, Berkeley DB Library...

RP* schemes :RP* schemes : Produce scalable distributed 1-d ordered filesProduce scalable distributed 1-d ordered files

– for range searchfor range search Each bucket (server) has the unique range of keys it may containEach bucket (server) has the unique range of keys it may contain Ranges partition the key spaceRanges partition the key space Ranges evolve dynamically through splitsRanges evolve dynamically through splits

– Transparently for the application Transparently for the application Use RAM m-ary trees at each serverUse RAM m-ary trees at each server

– Like B-treesLike B-trees– Optimized for the RP* split efficiencyOptimized for the RP* split efficiency

8

SDDS FAMILIESSDDS FAMILIESSDDS FAMILIESSDDS FAMILIES

High-availability SDDS schemesHigh-availability SDDS schemes Data remain available despite :Data remain available despite :

–any single server failure & most of two server failuresany single server failure & most of two server failures–or any up to or any up to n-n-serverserver failurefailure–and some catastrophic failuresand some catastrophic failures

n n scales with the file sizescales with the file size–To offset the reliability decline which would otherwise occurTo offset the reliability decline which would otherwise occur

Three principles for high-availability SDDS schemes are currently knownThree principles for high-availability SDDS schemes are currently known–mirroring (LH*m)mirroring (LH*m)–striping (LH*s)striping (LH*s)–grouping (LH*g, LH*sa, LH*rs, RP* rs)grouping (LH*g, LH*sa, LH*rs, RP* rs)

Realize different performance trade-offsRealize different performance trade-offs

9

SDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data ReceptionA RP* file is distributed in buckets

Each bucket contains a maximum of b records with the keys within some interval ], ] called bucket range

SDDS RP* support range queries. A range query R requests all records whose keys are in the query interval [R, R,].

The application submits R, as any of its queries, to the SDDS client at its node.

buckets concerned by the query, i.e whose interval ]i, i] is

such that ]i, i] [R, R,], send relevant data to the Client.

The number N of those servers may be small or very high. N may potentially be over thousands, perhaps soon millions.

10

SDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data Reception

Server [1,1]

Server [2,2]

Server [n,n]

Range query [R, R] Client

Figure 1: Range query in an SDDSFigure 1: Range query in an SDDS RP* RP*

TheThe basic strategy for a range query is that every server concerned requests a basic strategy for a range query is that every server concerned requests a TCP connection with the client.TCP connection with the client.

If the number of connections to manage is too high, the client may get swampedIf the number of connections to manage is too high, the client may get swamped in various waysin various ways..

First, some demands may First, some demands may bebe refused, forcing the servers to repeat their requests, refused, forcing the servers to repeat their requests, perhaps many times. perhaps many times.

Next, managing many simultaneous connections may lead to a loss of some incoming Next, managing many simultaneous connections may lead to a loss of some incoming data, impossible to accommodate fast enough in the buffers available for the TCP/IP data, impossible to accommodate fast enough in the buffers available for the TCP/IP protocol.protocol.

11

SDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data ReceptionSDDS Range Query Data Reception

Three following strategies appear suitable for implementation :Three following strategies appear suitable for implementation : Strategy 1Strategy 1

The client accepts as many TCP/IP requests for connection as it can. In practice, it The client accepts as many TCP/IP requests for connection as it can. In practice, it may concern at most M requests, for some positive integer M. This one is the may concern at most M requests, for some positive integer M. This one is the system parameter usually called backlog. The refused server drops out, system parameter usually called backlog. The refused server drops out, abandoning the reply. The client may need to find which ones did it and request abandoning the reply. The client may need to find which ones did it and request the missing data specifically from each drop out. This recovery is not the part of the missing data specifically from each drop out. This recovery is not the part of Strategy 1 itself.Strategy 1 itself.

Strategy 2Strategy 2Every server starts as for Strategy 1. The refused servers repeat the connection Every server starts as for Strategy 1. The refused servers repeat the connection requests. The repetitions follow the CSMA/CD policy.requests. The repetitions follow the CSMA/CD policy.

Strategy 3 Strategy 3 Each server that should reply requests the connection demand from the client Each server that should reply requests the connection demand from the client through the UDP service message. Each message contains the interval and the IP through the UDP service message. Each message contains the interval and the IP address of the server. After receiving all the intervals and IP addresses, the client address of the server. After receiving all the intervals and IP addresses, the client establishes TCP connections with the servers to receive the data. It may initiate as establishes TCP connections with the servers to receive the data. It may initiate as many simultaneous connections as it can accommodate given its storage and many simultaneous connections as it can accommodate given its storage and processing resources. However, we study only the case of a single connection at processing resources. However, we study only the case of a single connection at the time. the time.

12

Experimental performance resultsExperimental performance resultsExperimental performance resultsExperimental performance results Environment and MeasurementsEnvironment and Measurements

We have developed a multithreaded client able to process in parallel multiple data streams from the servers. We have used Java to design the software able to run on different platforms.

Our hardware platform consisted of eight Pentium II 400 MHz PCs under Windows NT Server 4.0. These nodes were linked by 10 MB/s Ethernet network. Each PC had 64 MB of RAM and 450 MB of virtual memory.

Our main performance measure of a strategy was the response time of a range query. We measured the difference between the time of reception of the last data item and the time of query send out by multicast. We call this measure the total time. We also measured the corresponding per (received) record time.

For each strategy, our application built range queries whose intervals [R,

R,] scaled up progressively. At the end, the query addressed all the buckets in

the file. For each experiment, we measured the response time for given [R,

R,]. We repeated each experiment five times, to get the average value.

13

Experimental performance resultsExperimental performance resultsExperimental performance resultsExperimental performance results ResultsResults

Strategy 1

Figure 2 below shows the total response time, measured up to forty responding servers. The figure shows that the response time scales linearly with the number of servers.

Figure 3 shows the per record response of Strategy 1.This time increases with the number of servers, almost linearly between 10 and 30 servers. In fact, for 5 servers, this time is about 0,16 ms per record, and reaches about 0.8 ms for already 15 servers. Next it somehow oscillates around this value with the peak of 1 ms for 30 servers. This is a quite scalable behavior. However, as for the total response time, this behavior is limited to 40 servers only in our experimentation. Because of this limitation, one cannot qualify Strategy 1 as scalable for our purpose.

14

Experimental performance resultsExperimental performance resultsExperimental performance resultsExperimental performance results

y = 248,6x - 430,98

R2 = 0,9241

0

300

600

900

1200

1500

1800

1 5 10 15 20 25 30 35 40Number of servers

resp

onse

tim

e (m

s)

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1 5 10 15 20 25 30 35 40Number of servers

sear

ch ti

me

(ms)

Figure 2: Response time for Strategy 1 Figure 3: Per record response time of Strategy 1

Strategy 1

15

Experimental performance resultsExperimental performance resultsExperimental performance resultsExperimental performance results Strategy 2

Figure 4 shows the total response time for Strategy 2. As before, the points are our measurements and the line is the least square linear interpolation.

This strategy scales up linearly to 140 servers. Over 140 servers, we have, the same phenomenon like in Strategy 1, i.e., 80 % of measurements fail. Thus, 140 is a limit for the second strategy.

Figure 4: Total response time of Strategy 2

y = 183,21x + 22,317

R2 = 0,97

0

1000

2000

3000

4000

5000

6000

1 10 20 30 40 50 60 70 80 90 100 120 140

Number of servers

Re

sp

on

se

tim

e (

ms

)

Figure 5 shows that the per record response time of Strategy 2. It becomes rapidly almost constant around 0.8 ms. It even presents a slightly decreasing tendency, after reaching 1 ms for already 20 servers. This behavior also proves the scalability of Strategy 2.

Figure 5: Per record response time of Strategy 2

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1 10 20 30 40 50 60 70 80 90 100 120 140Number of servers

se

arc

h t

ime

(m

s)

16

Experimental performance resultsExperimental performance resultsExperimental performance resultsExperimental performance resultsStrategy 3

Figure 6 shows the measured and interpolated total response time of Strategy 3. This time also scales linearly up to our 140 server limit. One could expect such behaviour since the client keeps constant the number of simultaneous connections at each time.

Figure 6: Total response time of Strategy 3

y = 183,21x + 22,317

R2 = 0,97

0

1000

2000

3000

4000

5000

6000

1 10 20 30 40 50 60 70 80 90 100 120 140

Number of servers

Re

sp

on

se

tim

e (

ms

)

The per record time, Figure 7, starts by decreasing steeply. The initial 5 ms for one server, drops to 1 ms for 5 servers, reaches 0.49 for 20 servers. It then decreases only very slowly to the final 0.33 ms for 140 servers (figure 7). The curve appears in fact about flat between 20 and 140 servers, with small oscillations.

Figure 7: Per record response time of Strategy 3

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1 10 20 30 40 50 60 70 80 90 100 120 140Number of servers

sear

ch ti

me

(ms)

17


Comparative Analysis

Figure 8 shows the actual total response times of all the strategies. For a number of servers increasing from 1 to 35 servers, Strategies 1 and 2 appear practically equal, as they should. The differences result only from the experimental nature of the collected measures.

Figure 8:Comparison of total response times

Figure 9 shows the per record time of all the strategies. After 15 servers, Strategy 1 gives best results.

Strategy 3 is by far the worst performing for a small number of buckets. Actually, up to five buckets in our campaign. It is reasonably good between 5 and 15-bucket range. It turns to be the best performing beyond already 15 servers. For 140 servers, it is almost twice as fast as Strategy 2.

Figure 9: Comparison of per record response times

0,00

1,00

2,00

3,00

4,00

5,00

6,00

Number of servers

Strategy 1 Strategy 2 Strategy 3

0

1000

2000

3000

4000

5000

6000

1 10 20 30 40 50 60 70 80 90 100 120 140

Number of servers

Re

sp

on

se

tim

e (

ms

)

Strategy 1Strategy 2Strategy 3

18


For the strategy 3, the backlog is not For the strategy 3, the backlog is not reached when a few servers connect to the reached when a few servers connect to the client. Therefore the simultaneous treatment client. Therefore the simultaneous treatment of servers responses is not activated by of servers responses is not activated by TCP/IP layer. Servers responses are treated in TCP/IP layer. Servers responses are treated in a sequential way. So the a sequential way. So the per record response per record response timetimes are s are very high at the beginning of our very high at the beginning of our measurementsmeasurements..

Comparative Analysis

The treatment speed of the client is The treatment speed of the client is reached about 10 servers, so the per record reached about 10 servers, so the per record response time do not grow fast and stabilize response time do not grow fast and stabilize after 10 servers. after 10 servers.

The waiting time of the client, time to The waiting time of the client, time to receive messages from servers concerned by receive messages from servers concerned by the range query, has an effect on per record the range query, has an effect on per record response time.response time.

19

ConclusionConclusionConclusionConclusionStrategy 3 appears the best choice for range Strategy 3 appears the best choice for range

queries, and in fact the parallel queries in queries, and in fact the parallel queries in general, to an SDDS. It offers the linear scale up, general, to an SDDS. It offers the linear scale up, and the best response time. and the best response time.

For a faster CPU, performance of Strategy 3 For a faster CPU, performance of Strategy 3 could also improve by opening at the client a few could also improve by opening at the client a few simultaneous connections instead of only one in simultaneous connections instead of only one in our experiments. This generalization of Strategy our experiments. This generalization of Strategy 3 was integrated for the operational use into 3 was integrated for the operational use into SDDS-2000 [D01].SDDS-2000 [D01].

Using a faster PC and a popular 100 Mb/s Using a faster PC and a popular 100 Mb/s Ethernet or an increasingly popular Gigabyte Ethernet or an increasingly popular Gigabyte Ethernet should improve the response times of Ethernet should improve the response times of all the strategies. It should not change however all the strategies. It should not change however our basic conclusion about the superiority of our basic conclusion about the superiority of Strategy 3Strategy 3..

20

Current and Future workCurrent and Future workCurrent and Future workCurrent and Future work

An alternative approach could be that An alternative approach could be that partial results are first grouped at some servers. partial results are first grouped at some servers. This reduces the number of connections to This reduces the number of connections to manage at the client. Such strategies appear manage at the client. Such strategies appear especially interesting for a range query with an especially interesting for a range query with an aggregate function. aggregate function.

The process can be organized into a The process can be organized into a hierarchy where each level reduces the number hierarchy where each level reduces the number of servers aggregating the result. Ultimately one of servers aggregating the result. Ultimately one server could get then the final result, delivering server could get then the final result, delivering the reply to the client in a single fast message. the reply to the client in a single fast message. Such strategies constitute our current work. Such strategies constitute our current work.

21

ENDENDENDEND

Thank you for your attentionThank you for your attention

Mianakindila [email protected]

Documents

1 WDAS 2003 13 – 14 June THESSALONIKI(Greece) Range Queries to Scalable Distributed Data Structure RP* WDAS 2003 13 – 14 June THESSALONIKI(Greece) Range