Parallel Querying with Non-Dedicated Nodes

IBM Research

© 2005 IBM Corporation

Parallel Querying with Non-Dedicated Nodes

Vijayshankar Raman, Wei Han, Inderpal Narang

IBM Almaden Research Center

IBM Research

© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005

Properties of a relational database

Ease of schema evolution

Declarative Querying

Transparent scalability does not quite work

IBM Research


Today: Partitioning is basis for parallelism

static partitioning (on the base tables)

Dynamic partitioning via exchange operators

Claim: partitioning does not handle non-dedicated nodes well

L1 O1 Sa L3 O3 ScL2 O2 Sb

IBM Research


Problems of partitioning

Hard to scale incrementally– Data must be re-partitioned

– Disk and CPU must be scaled together

• DBA must ensure partition-cpu affinity

Homogeneity Assumptions– Same plan runs on each node

– Identical software needed on all nodes

Susceptible to load variations, node failures / stalls, …– Response time is dictated by speed of slowest processor

– Bad for transient compute resources

• E.g. we want ability to interrupt query work by higher-priority local work

exchange

initial partitioning

IBM Research


GOAL: A more graceful scale-out solutionSacrifice partitioning for scalability

– Avoid initial partitioning– No exchange

New means for work allocation in absence of partitioning– Handles heterogeneity and load variations better

Two Design Features– Data In The Network (DITN)

• Shared files on high speed networks (e.g SAN)– Intra-Fragment Parallelism

• Send SQL fragments to heterogeneous join processors:each performs the same join, over a different subset of cross-product space

• Easy fault-tolerance• Can use heterogeneous nodes -- whatever is available at that time

IBM Research


Outline

Motivation

DITN design

Experimental Results

Summary

IBM Research


DITN Architecture1. Find idle coprocessors P1, P2, P3, P4, P5, P6

2. Prepare O, L, C

3. Logically divide OxLxC into workunits Wi

4. In Parallel, Run SQL queries for Wi at Pi

5. Property: SPJAG(OxLxC) = AG (i SPJAG(Wi))

CP

U-

WR

AP

PE

R

SHARED STORAGE

O1 L2CO1 L1C O2 L1C

INFO. INTEGRATOR

QueryPlan

Results UNION

Res

ults

C/tmp/customers

/tmp/orders

/tmp/lineitem

P1 P2 P3

O2 L2C O1 L3C O2 L3C

L1 L2 L3

O1 O2

P4 P5 P6

Restrictions (will return to this at the end) Pi cannot use indexes at info. Integrator

Isolation issues

IBM Research


Why Data in the Network

Observation: Network bandwidth >> Query Operator Bandwidth

– N/W bandwidth: in Gbps (SAN/LAN),Scan: 10-100 Mbps, Sort: about 10 Mbps

– Interconnect transfers data faster than query operators can process it

But, exploiting this fast interconnect via SQL is tricky

– E.g. ODBC Scan: 10x slower than local scan

Instead, keep temp files in a shared storage system (e.g. SAN-FS)

– Allows exploitation of full n/w bandwidth

immediate benefits

– Fast data transfer

– DBMS doesn’t have to worry about disks, i/o ||ism, || scans, etc.

– Independent scaling of CPU and I/O

IBM Research


Work Allocation without Partitioning

For each join: we now have to join the off-diagonal rectangles also

Minimize Response time = max(RT of each work-unit) = maxi,j JoinCost(|Li|, |Oj|)

How to optimize the Work allocation?– ~ cut join hyper-rectangle into n pieces to minimize max perimeter– Simplification: assume that the join is cut into a grid

• Choices: number of cuts on each table, size of each cut, allocation of work-units to processors

Ord

ers

(p

art

itio

ne

db

y o

_o

rde

rke

y)

Lineitem (partitioned by l_orderkey)

P1

P6

P2P3

P4P5

P7P8

P9P10

Ord

ers

(clu

ste

red

by R

ID)

Lineitem (clustered by RID)

P9 .

P10 .

P1

P1

P2

P3

P4

P5

P6

P7

P8

IBM Research


Allocation to homogenous processors

Theorem: For monotonic JoinCost, RT is minimized when each cut (on a table) is of same size

So allocation done into rectangles of size |T1|/p1, |T2|/p2, … |Tn|/pn

Theorem: For symmetric JoinCost, RT is minimized when |T1|/p1 = |T2|/p2 = … |Tn|/pn

E.g., with 10 processors, cut Lineitem into 5 parts and Orders into 2

Note: cutting each table into same number of partitions (as is done usually) is sub-optimal

IBM Research


Allocation to heterogeneous co-processors

Response time of queryRT = max(RT of each work-unit)Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT

Like a bin packing problem

– Solve for number of cuts on each table, assuming homogeneity

– Then solve a Linear Program to find the optimal size of each cut

– Have to make some approximations in order to avoid Integer Program (see paper)

Ord

ers

Lineitem

P9 .

P10 .

P1

P1

P2

P3

P4

P5

P6

P7

P8

IBM Research


Failure/Stall Resiliency by Work-Unit Reassignment

Without tuple shipping between plans, failure handling is easy

If co-processor’s A,B,C finished by time X,and co-processor D has not finished by time X(1+f)

– Take D’s work unit and assign to fastest among A,B,C – say A

– When either of D or A returns, close the cursor on the other

Can generalize to a work-stealing scheme

– E.g. with 10 coprocessors, assign each to 1/20th of the cross-product space

– When a coprocessor returns with a result, assign it more work

Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work

IBM Research


Analysis: What do we lose by not partitioning

Say join of L x O x C (TPC-H) with 12 processors: 12 = p1p2p3

RT without partitioning ~ JoinCost(|L|/p1, |O|/p2 , |C|/p3)

RT with partitioning ~ JoinCost(|L|/p1p2p3, |O|/p1p2p3, |C|/p1p2p3)

At p1=6, p2=2, p3=1, loss in CPU speedup is JoinCost(|L|/6, |O|/2, |C| ) ~ 2JoinCost(|L|/12, |O|/12, |C|/12)

Note: I/O speedup is unaffected

Can close the gap with partitioning further Sort the largest tables of the join: e.g. |L|, |O| on their join column

– Now, loss is: JoinCost(|L|/12,|O|/12,|C|) / JoinCost(|L|/12, |O|/12,|C|/12) Still avoids exchange => can use heterogeneous, non-dedicated nodes,

but causes problems with isolation

Optimization: selective clustering

IBM Research


Lightweight Join Processor

Work Allocation via Query Fragments => co-processors can be heterogeneous

Need not have a full DBMS; join processor is enough

E.g. screen saver for join processing

We use a trimmed down version of Apache Derby

– Parse CSV files

– Predicates, projections, sort-merge joins, aggregates, group by

IBM Research


Outline

Motivation

DITN design

Experimental Results

Summary

IBM Research


Performance degradation due to not partitioning

0

10000

20000

30000

40000

50000

2 3 4 5 6 7 8 9 10Number of nodes

Res

pons

e T

ime

PBP DITN DITN2PART

0

5000

10000

15000

20000

25000

30000

35000

40000

2 3 4 5 6 7 8 9 10Number of nodes

Res

pons

e T

ime

PBP DITNDITN2PART

0

5000

10000

15000

20000

25000

30000

35000

2 3 4 5 6 7 8 9 10Number of nodes

Res

pons

e T

ime

(ms)

DITN PBP

O L SOL

SOLCNR

At 10 nodes on SxOxLxCxNxR,DITN is about 2.1x slower than PBP(Work alloc: L/5, O/2, S, C, N, R)

DITN2PART has very little slowdown

– But needs total clustering

Slow-down oscillates due to discreteness of work-allocation

IBM Research


Failure/Stall Resiliency by Work-Unit Reassignment

Orders x Lineitemgroup by o_orderpriority5 co-processors

Impose high load on oneco-processor as soon as query begins

At 60% load (50% wait), DITN times out and switches to alternative 0

2040

60

80100

120

140160

180

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Extent of CPU Load Burst

Res

pons

e T

ime

(s) Shared Nothing

DITNDITN-OPTDITN2PART

PBP

IBM Research


Importance of Asymmetric Allocation

Initially 2 fast nodes: then add 4 slow nodes

With symmetric allocation: adding slow nodes can slow down system

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Num Of Co-processors

Res

po

nse

Tim

e (s

)

DITN - Symmetric Allocation

DITN - Asymmetric Allocation

05

1015202530354045

symm.4 nodes

asymm.4 nodes

symm.5 nodes

asymm.5 nodes

symm.6 nodes

asym. 6nodes

Experiment Group

Resp

on

se T

ime (

s)

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Contrast between DITN-symmetric and DITN-asymmetric

IBM Research


Danger of Tying partition to CPU

Repeated execution of O L

Impose 75% CPU load on one of the 5 co-processors during 3rd iteration

PBP continues to use this slow node throughout

DITN switches to another node after two iterations

0.00

20.00

40.00

60.00

80.00

100.00

120.00

1 2 3 4 5 6 7 8

Query Iteration

Res

pons

e T

ime

(sec

onds

)

DITN

PBP

h

IBM Research


Related Work

Parallel query processing – Gamma, XPRS, many commercial systems

– Mostly shared-nothing

– Shared-disk: IBM Sysplex

• Queries done via tuple shipping between co-processors

– Oracle

• Shared disk, but hash joins done via partitioning (static/dynamic) Mariposa – similar query fragment level work allocation Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins Fault-tolerant exchange (FLUX) Polar*, OGSA-DQP Distributed Eddies Query Execution on P2P systems

IBM Research


Summary and Future work

Partitioning-based parallelism does not handle non-dedicated nodes Proposal: Avoid partitioning

– Share data via storage system – Intra-fragment parallelism instead of exchange– Careful work-allocation to optimize response time

Promising initial results: only 2x slowdown with 10 nodes

Index scans: want shared reads without latching Isolation: DITN: uncommitted read; DITN2PART: read-only Scaling to large numbers of nodes Multi-query optimization to reuse shared temp tables

Open Questions

IBM Research


Backup Slides

Documents

Parallel Querying with Non-Dedicated Nodes