Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa [email protected]

Early Hash Join: A Configurable Algorithm for the Efficient and Early

Production of Join Results



Ramon LawrenceUniversity of Iowa

[email protected]://www.cs.uiowa.edu/~rlawrenc/

Page 2The University of Iowa. Copyright© 2005

Ramon Lawrence - Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results

Introduction Interactive user querying requires the DBMS produce the first

few query answers quickly as well as minimize the total query execution time.

Queries that produce a lot of results with large hash joins have a slow response time as the smaller input must be completely partitioned before any output can be generated.

It is desirable to have a hash-based join algorithm for centralized databases that: Has rapid response time to produce the first few results Has overall execution time comparable to hybrid hash join Can be dynamically configured by the optimizer



Previous Work Hash joins:

hybrid hash join [DeWitt84] - standard join used in most DBMSs dynamic hash join [DeWitt95,Nakayama88] - dynamic partitioning symmetric hash join [Hong93,Wilschut91] - dual hash table ripple join [Haas99,Luo02] - online aggregation, reading policies MJoin [Ding03] - purges join state using stream punctuation

Mediator-based joins: Improve overall execution time by executing during delays

instead of plan re-ordering/query scrambling [Raman99, Urhan98].

double pipelined hash join [Ives99] - Tukwila system XJoin [Urhan00] - probe in-memory partitions when blocked hash-merge Join [Mokbel04] - sort-merge partitions when blocked progressive merge join [Dittrich02] - dual sort-based join



Motivation Interactive users of centralized DBMS can benefit from fast

response time inherent in dual-hash table joins.

Challenge is to ensure overall performance is not signficantly sacrificed for this fast response time.

Dual-hash table join has other benefits as the operator is more easily pipelined (since it is symmetric).

This is valuable for federated joins when one or more of the inputs may not be local to the database engine.



Reading Strategy A reading strategy is the rules an algorithm uses to decide how

to read from the two inputs when both inputs have tuples available. Reading strategies do NOT apply to streaming (push-based)

inputs. They are useful when the inputs are on a local hard drive or a

fast network source (pull-based).

Reading strategies have been used before for processing top-k queries and in ripple joins.

The reading strategy for hybrid hash join is to read the entire smaller input then the larger input. Another strategy is to read alternately from the inputs.



Flushing Policy The flushing policy determines which tuples in memory are

written to disk when memory must be released to accept new input.

Previous flushing policies: Flush the largest single partition (XJoin) Co-ordinated flushing of a partition pair (Hash-merge join)

Flushing policy affects the duplicate detection strategy of the join algorithm. Also affects its performance in two ways: 1) Join output rate - The number of results generated as input

is being received. This depends on the tuples in memory. 2) Overall execution time - The total time may change

depending on the cost of flushing and post-join cleanup.



Early Hash Join (EHJ) Algorithm The Early Hash Join (EHJ) algorithm uses a dual hash table

approach. It is specifically designed for a centralized DBMS where overall execution time is dictated by the flushing and partitioning speed and not by the input arrival rates.

EHJ uses: a variable reading strategy that changes when memory is full a biased flushing policy to favor the smaller input optimizations to flush join memory state for 1:* joins simplified duplicate detection that requires no timestamps for

1:* joins and only one timestamp for *:* joins a background process when used for mediator joins or with

slow network-based inputs



Early Hash Join (EHJ) AlgorithmStart Join

Read tuplefrom R or S

(policy)

Yes

Input left?

Tuple of R?

Insert in R tableProbe S tableOutput results

Insert in S tableProbe R tableOutput results

Yes No

Memory full?No

No Initialize 1stcleanup phase

Close S file.Delete on-disk

partitions.

No

On-disk S no R?

Yes

In phase 1?

No

Join complete

Yes

Load R to memory

On-disk R?

No

Read S tupleTSProbe R table

Output results

Input left in

S file?

Initialize probefile for S partition

Yes

Yes

Bias Flush

Initialize 2ndcleanup phase

No

R – smaller relationS – larger relation



Biased Flushing Policy The biased flushing policy is designed to keep as much of the

smaller input in memory as possible (similar to hybrid hash join).

Biased flushing policy: Flush largest non-frozen partition of S (larger input). If no such partition of S exists, flush smallest, non-frozen

partition of R (smaller input).

Idea of freezing a partition is from dynamic hash join [DeWitt95]. A frozen partition does not accept input once it has been flushed and is not probed. XJoin and HMJ do not freeze partitions. Freezing partitions and using biased flushing simplifies the

duplicate detection strategy.



Duplicate Detection Duplicate detection is required so that join results are not re-

generated during the cleanup pass. For common 1:* joins, no timestamps are needed:

With a *-side probe tuple, it is discarded if matched. With a 1-side probe tuple, delete from the hash table any

matching tuples on the *-side. For *:* joins a single timestamp representing the tuples arrival

order is kept. In cleanup pass, result tuple of (TR,TS) passes timestamp check (and is output) if one of these is true: 1) TS arrived before its partition of S was flushed and TR arrived

after its corresponding partition of S was flushed. 2) TS arrived after its partition of S was flushed but before the

matching partition of R was flushed and TR arrived after TS. 3) TS arrived after partition of R was flushed.



Performance Analysis Parameters:

Two input relations R and S with |R| |S|. Join memory M where M |R|. Let f = M / |R|. Reading policy before memory is full is A1:B1. Let q1=A1/(A1+B1). Reading policy after memory is full is A2:B2. Let q2=A2/(A2+B2).

Number of I/O operations: (not counting reading inputs)

where

Note for hybrid hash join, leftS = |S|.

)*||*|||(|*2 leftSfRfSR

2

121

*||*)1()1(*||

q

qMRqqMSleftS



Background Process A background process can be used when the inputs are from

sources other than the hard drive used for flushing. This includes mediator and federated joins. As shown in previous work, most valuable for slow or bursty

networks. Not as useful for high speed networks.

Similar to XJoin, use an on-disk partition of S to probe the matching partition of R currently in memory.

Designed as a background process that runs concurrently with main join process. This can boost join output rate, but still must be careful not to needlessly tie up CPU when background process may only generate a few results. Duplicate detection is slightly modified when using BG process.



Experimental Evaluation The performance of early hash join was compared with dynamic

hash join, XJoin, and hash-merge join. All algorithms were implemented in Java and tested on a TPC-H

1 GB size data set (raw text files). All dual hash table algorithms used the same table structure.

Summary of results: EHJ is 10-35% faster than HMJ/XJoin for many-to-many joins

and 25-75% faster for one-to-many joins. EHJ is faster over all memory sizes except for very small

memory (less than 10% of smaller relation size). EHJ performs better when the difference in the relative sizes of

the relations is large. EHJ is within 10% of overall time of DHJ, but with a response

time that is an order of magnitude faster. Intelligent buffering may be able to further reduce this difference.



Many-to-Many Join Experiment

Query: SELECT * FROM PartSupp P1, PartSupp P2 WHERE P1.p_partkey = P2.p_partkey

P1 and P2 were randomly permuted as sorted on p_partkey. Memory size = 300,000 tuples (37.5% of 800,000 tuples)

Join Output by Time

0

10

20

30

40

50

60

10 300 700 1100 1500 1900 2300 2700 3100Results *1000

Tim

e (

se

c)

DHJ EHJ1 EHJ2

HMJ XJoin

I/Os Performed

0

20

40

60

80

100

120

140

160

180

10 300 700 1100 1500 1900 2300 2700 3100Results *1000

I/O

s *

10

00

DHJ EHJ1 EHJ2

HMJ XJoin



One-to-Many Join Experiment

Query: SELECT * FROM Customer C, Orders O WHERE C.c_custkey = O.o_custkey

Memory size = 75,000 tuples (50% of 150,000 tuples)

I/Os Performed

0

20

40

60

80

100

120

140

160

180

10 300 700 1100 1500Results *1000

I/Os

* 1

00

0

DHJ EHJ1 EHJ2

HMJ XJoin

Join Output by Time

0

10

20

30

40

50

60

70

10 600 1500

Results *1000

Tim

e (

se

c)

DHJ EHJ1 EHJ2

HMJ XJoin



Multi-Join Experiment

SELECT c_custkey, c_name, c_address, o_orderkey, o_custkey, o_totalprice, o_orderdate, l_orderkey, l_partkey, l_suppkey, l_quantity, l_extendedprice FROM Customer C, Orders O, LineItem LI

WHERE C.c_custkey=O.o_custkey and O.o_orderkey = LI.l_orderkey

Memory size = 90,000 tuples (60% of 150,000 tuples) (C+O) Memory size = 450,000 tuples (30% of 1,500,000) (C/O + LI)

Join Output by Time

0

50

100

150

200

250

300

350

400

10 400 1200 2000 2800 3600 4400 5200 6000Results *1000

Tim

e (

se

c)

DHJ EHJ1 EHJ2

HMJ XJoin

I/Os Performed

0

100

200

300

400

500

600

700

10 400 1200 2000 2800 3600 4400 5200 6000

Results *1000

I/O

s *

1000

DHJ EHJ1 EHJ2

HMJ XJoin



Mediator Experimental Evaluation The performance of early hash join was compared with dynamic

hash join, XJoin, and hash-merge join for mediator joins. All algorithms were implemented in Java and tested on a TPC-H

100 MB size data set with queries processed by SQL Server. DHJ downloaded from both inputs in parallel.

Summary of results: Both overall execution time and join output rate is dictated by

speed of inputs. Little variation in execution time for algorithms. All early algorithms have response time an order of magnitude

faster than DHJ especially when left input is slow. For 1:* joins, EHJ is 5%-15% faster overall than HMJ/XJoin and

equivalent to DHJ. It also has a slightly faster join output rate. For *:* joins, EHJ was only marginally faster in overall time with

a very similar join output rate.



Applications The primary application is interactive querying on a centralized

database. EHJ has a response time an order of magnitude faster than hybrid hash join with little execution overhead.

EHJ is also more suitable for pipelining within and outside DBMS as it is a symmetric operator that tolerates source delays. This may be especially valuable for federate queries.

EHJ can be used with LIMIT queries to produce the first few results without the overhead of partitioning the smaller input. However, any query with blocking operators such as ordering and grouping cannot benefit from its fast response time. Further, it is not order preserving without additional modifications.



Future Work and Conclusions EHJ is a useful algorithm for interactive querying and a good

candidate for inclusion into the set of join algorithms for a centralized DBMS.

EHJ is dynamically configurable using a reading policy and can adapt to slow input arrival. In a centralized environment, it significantly outperforms previous early join algorithms.

Future work: Implement and test performance of EHJ in PostgreSQL. Expand algorithm for a N-way join. Investigate possibility of making order preserving and optimized

for distributed/mediator joins.





Ramon LawrenceUniversity of Iowa

[email protected]://www.cs.uiowa.edu/~rlawrenc/

Thank You!



Extra Slides...



Mediator Join Experiment: 1:* Join

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

1000 5000 9000 75000Results

Tim

e (

ms

)

DHJ EHJ2

HMJ XJoin

Slow Left, Fast Right Fast Left, Slow Right

0

20000

40000

60000

80000

100000

120000

140000

1000 5000 9000 75000Results

Tim

e (

ms

)

DHJ EHJ2

HMJ XJoin

Join of Customer and Orders tables Memory size = 7,500 tuples (50% of 15,000 tuples) Slow network = 250 KBps, Fast network = 1000 KBps



Mediator Join Experiment: *:* Join Slow Left, Fast Right Fast Left, Slow Right

0

10000

20000

30000

40000

50000

60000

70000

1000 5000 9000 75000 175000 275000

ResultsT

ime

(m

s) DHJ EHJ2

HMJ XJoin

Join of two randomized copies of PartSupp relation Memory size = 30,000 tuples (37.5% of 80,000 tuples) Slow network = 250 KBps, Fast network = 1000 KBps

0

10000

20000

30000

40000

50000

60000

70000

1000 5000 9000 75000 175000 275000Results

Tim

e (

ms

)

DHJ EHJ2

HMJ XJoin



Mediator Join Experiment: *:* Join Slow Left, Fast Right Fast Left, Slow Right

0

5000

10000

15000

20000

25000

1000 5000 9000 75000 175000 275000Results

Tim

e (

ms

)

DHJ EHJ2

HMJ XJoin

0

5000

10000

15000

20000

25000

1000 5000 9000 75000 175000 275000Results

Tim

e (

ms

)

DHJ EHJ2

HMJ XJoin

Join of two randomized copies of PartSupp relation Memory size = 16,000 tuples (20% of 80,000 tuples) Slow network = 1000 KBps, Fast network = 4000 KBps



Dual Hash Table Structure

R table

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

HashAlgorithm

S table

Tuple Tuple

Tuple

Tuple

Tuple Tuple Tuple

Tuple Tuple

Tuple

Documents

Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa [email protected]