Revisiting Pipelined Parallelism in Multi-Join Query Processing

VLDB 2005 1

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Bin Liu and Elke A. RundensteinerWorcester Polytechnic Institute(binliu|rundenst)@cs.wpi.edu

http://www.davis.wpi.edu/dsrg

VLDB 2005 2

Multi-Join Queries

Data Integration over Distributed Data Sources i.e., Extract Transform Load (ETL) Services

Data Source

Data Source

Data Source

…

Data Warehouse

Data Warehouse

…

Persistent Storage

(1) High IO costs given large intermediate results(2) Disk access undesirable since one time process

VLDB 2005 3

Applying Parallelism

Processed in Main Memory of a PC Cluster Make use of aggregated resources (main memory, CPU)

Network

Clusters of Machines

VLDB 2005 4

Three Types of Parallelism

Pipelined:Operators be composed into producer and consumer relationship

Independent:Independent operators run simultaneously on distinct machines

Partitioned:Single operator replicated and run on multiple machines

VLDB 2005 5

Basics of Hash Join Two-Phase Hash Join [SD89, LTS90]

Demonstrated High Performance Potential High Degree of Parallelism

………

…5/140012

…5/130011

…DateID

………

…HPS0012

…IPC0011

…ItemOID

Orders LineItems

valuekey

(1) Build hash tables of Orders based on ID

………

…5/140012

…5/130011

…DateID

Orders

(2) Probe hash tables and output results

………

…HPS0012

…IPC0011

…ItemOID

LineItems

VLDB 2005 6

Partitioned Hash Join

Orders

(1) Build hash tables of Orders based on ID

………

…5/140012

…5/130011

…DateID

Split

valuekey valuekey valuekey

(2) Probe hash tables and output results

………

…HPS0012

…IPC0011

…ItemOID

LineItems

Partition (Inputs) Hash Tables across Processors Have Each Processing Node Run in Parallel

VLDB 2005 7

Left-Deep Tree [SD90]

R6

R7

R1

R2

R5 R4

R3

R8

R9

Example Join Graph

R1 R2

R3

R8

R9

B1 P1

B2 P2

B7 P7

B8 P8

Left-Deep Query Tree Steps:

(1) Scan R1 – Build R1

(2) Scan R2 – Probe P1 – Build B2



(9) Scan R9 – Probe P8 – Output

…

VLDB 2005 8

Right-Deep Tree [SD90]

R6

R7

R1

R2

R5 R4

R3

R8

R9

Example Join Graph

R1R2

R3

R8

R9

B1 P1

B2 P2

B7 P7

B8 P8

Right-Deep Query Tree

(1) Scan R2 – Build R1, Scan R3 – Build R2, …, Scan R9 – Build R8

(2) Scan R1, Probe P1, Probe P2, … , Probe P8

VLDB 2005 9

Tradeoffs Between Left and Right Trees

Right-Deep Good potential for pipelined parallelism.

Intermediate results exist only as a stream.

Size of building relations can be predicted accurately.

Large memory consumption.

Left-Deep Less memory consumption

Less pipelined parallelism

VLDB 2005 10

State-of-the-Art Solutions

Implicit Assumption : Prefer Maximal Pipelined Parallelism !!!

R3

R2R1

R5

B1 P1

B2 P2

B4 P4

B3 P3

R4

B8 P8

R9 B7 P7

R8

VLDB 2005 11

State-of-the-Art Solutions

What if : Memory Constrained Environments ? Strategy :

R3

R2R1

R5

B1 P1

B2 P2

B4 P4

B3 P3

R4

B8 P8

R9 B7 P7

R8

R3

R2R1

R5

B1 P1

B2 P2

B4 P4

B3 P3

R4

B8 P8

R9 B7 P7

R8

Pipeline !

Break tree into several pieces, and Process one piece at a time (as pipeline)

I.e., Static Right-Deep[SD90], ZigZag [ZZBS93], Segmented Right-Deep [CLYY92].

VLDB 2005 12

Pipelined Execution

Optimal Degree of Parallelism? I.e., It may not be necessary to partition R2 over a large number of machines if it only has 1000 tuples?

Redirection Cost: The intermediate results generated may need to be partitioned to a different machine.

R1R2

R3

R4

R2 R3 R4R1

Computation Machines

Partition Partition Partition Partition

BuildingProbing

P32 P3

3 P34P2

2 P23 P2

4P12 P1

3 P14t

t P12

VLDB 2005 13

Pipelined Cost Model

Compute n-way join over k machines Probing relation R0, building relations, R1, R2, …, Rn

Ii represents the intermediate results after joining with Ri

Total Work (Wb+Wp) & Total Processing Time (Tb+Tp)

n

iibuildnetworkpartitionreadb RttttW

1

*)(

probe

n

i

n

iinetworki

probenetworkpartitionreadp

tItIk

k

RttttW

***1

*)(

1

1

1

1

0

ibuildnetworkpartitionreadni

b Rk

kfttttT *

)(*)(max

1

deletep

setupp Ik

WIT

VLDB 2005 14

Break Pipelined Parallelism

Large number of small pipelines High interdependence between pipelined segments

i.e., P1 > P2, P3 > P4, P2 > P4,

R9

R7

R1 R0

P1P2 P3

P4

R3 R2R1 R0 R4 R5R7 R6

To Break Long Pipeline and Introduce Independent Parallelism

VLDB 2005 15

Segmented Bushy Tree

Basic Idea Compose large pipelined segment Run pipelined segments independently Compose bushy tree with minimal interdependency

R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

R2R4 R3R8 R6 R9R7

R5

R1

R0I1 I2P1

P3

P2

To balance pipelined and independent parallelism

VLDB 2005 16

Cost-Based

Heuristics

Composing Segmented Tree

Input: A connected join graph G with n nodes. Number m specifies maximum number of nodes in each graph.

Output: Segmented bushy tree with at least n/m subtrees.

completed = false;WHILE (!completed) {

Choose node V with largest cardinality that has not yet been grouped as probing relation;Enumerate all subgraphs starting from V with at most m nodes;Choose best subgraph, mark nodes in this group as having been selected in original join graph;IF !(exist K, K is a connected subgraph of G with unselected nodes) && (K.size() >= 2) {

completed = true;}

}Compose segmented bushy tree from all groups;

VLDB 2005 17

Example

R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

G1

(1) R7, R8, R9, R6

(2) R7, R9, R6, R8

(3) R7, R4, R8, R5

...

(1) R1, R0, R2, R3

(2) R1, R2, R0, R3

(3) R1, R2, R3, R4

...

R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

G1

G2

VLDB 2005 18

Example : Segmented Bushy Tree

R2R4 R3R8 R6 R9R7

R5

R1

R0I1 I2R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

G1

G2

G3

VLDB 2005 19

Machine Allocation Based on building relation sizes of each segment

Nb: total amount of building work.

ki: number of machines allocated to pipeline i

R2R4 R3R8 R6 R9R7

R5

R1

R0I1 I2k1

k3

k2

7,1,90

1 ||||ii

i IRNb =

bN

RRRk

|)||||(| 9861

bN

RRRk

|)||||(| 4322

)( 213 kkkk

VLDB 2005 20

Insufficient Main Memory

Break query based on main memory availability Compose segmented bushy tree for each part

R7

R6

R4

R3

R5 R0

R1

R8

R9

R2

R15

R16

R18

R19

R17 R11

R10

R14

R13

R12

VLDB 2005 21

Experimental Setup

10 Machine Cluster Each machine has 2 2.4GHz Xeon CPUs, 2G Memory. Connect by gigabit ethernet switch

Oracle 8i

Controller

...

10 Machine Cluster

PIII 800M Hz PC, 256M Memory

2 PIII 1G CPUs, 1G Memory

Application PIII 800M Hz PC, 256M Memory

VLDB 2005 22

Experimental Setup (cont.)

Generated Data Set with Integer Join Values Around 40 bytes per tuple

Randomly Generated Join Queries Acyclic join graph with 8, 12, 16 nodes Each node represents one join relation Each edge represents one join condition Average join ratio is 1 Cardinality of each relation is from 1K ~ 100K Up to 600MB per query

VLDB 2005 23

Pipelined vs. Segmented (I)

0

100000

200000

300000

400000

500000

600000

700000

Sample Queries

Pro

ce

ss

ing

Tim

e (

ms

)

Right-Deep TreeSegmented Bushy Tree (3)

VLDB 2005 24

Pipelined vs. Segmented (II)

0

100000

200000

300000

400000

500000

600000

700000

800000

8 12 16

Number of relations in a query

Pro

ce

ss

ing

tim

e (

ms

)

Right-DeepSegmented Bushy

VLDB 2005 25

Insufficient Main Memory

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

1 2 3 4 5 6 7 8 9 10

Example Queries

Pro

cess

ing

Tim

e (m

s)

Segmented Right-DeepSegmented Bushy Tree

VLDB 2005 26

Related Work [SD90] Tradeoffs in processing complex join queries via hashing in

multiprocessor database machines. VLDB 1990. [CLYY92] Using segmented right deep trees for execution of pipelined

hash joins. VLDB 1992. [MLD94] Parallel hash based join algorithms for a shared everything

environment. TKDE 1994. [MD97] Data placement in shared nothing parallel database systems.

VLDB 1997. [WFA95] Parallel evaluation of multi-join queries. SIGMOD 1995. [HCY94] On parallel execution of multiple pipelined hash joins. SIGMOD

1994. [DNSS92] Practical skew handling in parallel joins. VLDB 1992. [SHCF03] Flux: an adaptive partitioning operator for continuous query

systems. ICDE, 2003.

VLDB 2005 27

Conclusions

Observation: Maximal pipelined hash join processing Redirection costs? optimal degree of parallelism?

Hypothesis: Worthwhile to incorporate independent parallelism into processing Both, so several shorter pipelines in parallel

Solution: Segmented bushy tree processing Heuristics and cost-driven algorithm developed

Validation : Extensive experimental studies Achieve around 50% improvement over pure pipelined processing

Documents

Revisiting Pipelined Parallelism in Multi-Join Query Processing