Upload
rangle
View
48
Download
0
Embed Size (px)
DESCRIPTION
Parallel Querying with Non-Dedicated Nodes. Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center. Properties of a relational database. Ease of schema evolution Declarative Querying Transparent scalability does not quite work. L 1. L 3. L 2. O 1. O 3. O 2. S a. - PowerPoint PPT Presentation
Citation preview
IBM Research
© 2005 IBM Corporation
Parallel Querying with Non-Dedicated Nodes
Vijayshankar Raman, Wei Han, Inderpal Narang
IBM Almaden Research Center
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Properties of a relational database
Ease of schema evolution
Declarative Querying
Transparent scalability does not quite work
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Today: Partitioning is basis for parallelism
static partitioning (on the base tables)
Dynamic partitioning via exchange operators
Claim: partitioning does not handle non-dedicated nodes well
L1 O1 Sa L3 O3 ScL2 O2 Sb
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Problems of partitioning
Hard to scale incrementally– Data must be re-partitioned
– Disk and CPU must be scaled together
• DBA must ensure partition-cpu affinity
Homogeneity Assumptions– Same plan runs on each node
– Identical software needed on all nodes
Susceptible to load variations, node failures / stalls, …– Response time is dictated by speed of slowest processor
– Bad for transient compute resources
• E.g. we want ability to interrupt query work by higher-priority local work
exchange
initial partitioning
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
GOAL: A more graceful scale-out solutionSacrifice partitioning for scalability
– Avoid initial partitioning– No exchange
New means for work allocation in absence of partitioning– Handles heterogeneity and load variations better
Two Design Features– Data In The Network (DITN)
• Shared files on high speed networks (e.g SAN)– Intra-Fragment Parallelism
• Send SQL fragments to heterogeneous join processors:each performs the same join, over a different subset of cross-product space
• Easy fault-tolerance• Can use heterogeneous nodes -- whatever is available at that time
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Outline
Motivation
DITN design
Experimental Results
Summary
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
DITN Architecture1. Find idle coprocessors P1, P2, P3, P4, P5, P6
2. Prepare O, L, C
3. Logically divide OxLxC into workunits Wi
4. In Parallel, Run SQL queries for Wi at Pi
5. Property: SPJAG(OxLxC) = AG (i SPJAG(Wi))
CP
U-
WR
AP
PE
R
SHARED STORAGE
O1 L2CO1 L1C O2 L1C
INFO. INTEGRATOR
QueryPlan
Results UNION
Res
ults
C/tmp/customers
/tmp/orders
/tmp/lineitem
P1 P2 P3
O2 L2C O1 L3C O2 L3C
L1 L2 L3
O1 O2
P4 P5 P6
Restrictions (will return to this at the end) Pi cannot use indexes at info. Integrator
Isolation issues
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Why Data in the Network
Observation: Network bandwidth >> Query Operator Bandwidth
– N/W bandwidth: in Gbps (SAN/LAN),Scan: 10-100 Mbps, Sort: about 10 Mbps
– Interconnect transfers data faster than query operators can process it
But, exploiting this fast interconnect via SQL is tricky
– E.g. ODBC Scan: 10x slower than local scan
Instead, keep temp files in a shared storage system (e.g. SAN-FS)
– Allows exploitation of full n/w bandwidth
immediate benefits
– Fast data transfer
– DBMS doesn’t have to worry about disks, i/o ||ism, || scans, etc.
– Independent scaling of CPU and I/O
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Work Allocation without Partitioning
For each join: we now have to join the off-diagonal rectangles also
Minimize Response time = max(RT of each work-unit) = maxi,j JoinCost(|Li|, |Oj|)
How to optimize the Work allocation?– ~ cut join hyper-rectangle into n pieces to minimize max perimeter– Simplification: assume that the join is cut into a grid
• Choices: number of cuts on each table, size of each cut, allocation of work-units to processors
Ord
ers
(p
art
itio
ne
db
y o
_o
rde
rke
y)
Lineitem (partitioned by l_orderkey)
P1
P6
P2P3
P4P5
P7P8
P9P10
Ord
ers
(clu
ste
red
by R
ID)
Lineitem (clustered by RID)
P9 .
P10 .
P1
P1
P2
P3
P4
P5
P6
P7
P8
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Allocation to homogenous processors
Theorem: For monotonic JoinCost, RT is minimized when each cut (on a table) is of same size
So allocation done into rectangles of size |T1|/p1, |T2|/p2, … |Tn|/pn
Theorem: For symmetric JoinCost, RT is minimized when |T1|/p1 = |T2|/p2 = … |Tn|/pn
E.g., with 10 processors, cut Lineitem into 5 parts and Orders into 2
Note: cutting each table into same number of partitions (as is done usually) is sub-optimal
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Allocation to heterogeneous co-processors
Response time of queryRT = max(RT of each work-unit)Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT
Like a bin packing problem
– Solve for number of cuts on each table, assuming homogeneity
– Then solve a Linear Program to find the optimal size of each cut
– Have to make some approximations in order to avoid Integer Program (see paper)
Ord
ers
Lineitem
P9 .
P10 .
P1
P1
P2
P3
P4
P5
P6
P7
P8
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Failure/Stall Resiliency by Work-Unit Reassignment
Without tuple shipping between plans, failure handling is easy
If co-processor’s A,B,C finished by time X,and co-processor D has not finished by time X(1+f)
– Take D’s work unit and assign to fastest among A,B,C – say A
– When either of D or A returns, close the cursor on the other
Can generalize to a work-stealing scheme
– E.g. with 10 coprocessors, assign each to 1/20th of the cross-product space
– When a coprocessor returns with a result, assign it more work
Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Analysis: What do we lose by not partitioning
Say join of L x O x C (TPC-H) with 12 processors: 12 = p1p2p3
RT without partitioning ~ JoinCost(|L|/p1, |O|/p2 , |C|/p3)
RT with partitioning ~ JoinCost(|L|/p1p2p3, |O|/p1p2p3, |C|/p1p2p3)
At p1=6, p2=2, p3=1, loss in CPU speedup is JoinCost(|L|/6, |O|/2, |C| ) ~ 2JoinCost(|L|/12, |O|/12, |C|/12)
Note: I/O speedup is unaffected
Can close the gap with partitioning further Sort the largest tables of the join: e.g. |L|, |O| on their join column
– Now, loss is: JoinCost(|L|/12,|O|/12,|C|) / JoinCost(|L|/12, |O|/12,|C|/12) Still avoids exchange => can use heterogeneous, non-dedicated nodes,
but causes problems with isolation
Optimization: selective clustering
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Lightweight Join Processor
Work Allocation via Query Fragments => co-processors can be heterogeneous
Need not have a full DBMS; join processor is enough
E.g. screen saver for join processing
We use a trimmed down version of Apache Derby
– Parse CSV files
– Predicates, projections, sort-merge joins, aggregates, group by
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Outline
Motivation
DITN design
Experimental Results
Summary
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Performance degradation due to not partitioning
0
10000
20000
30000
40000
50000
2 3 4 5 6 7 8 9 10Number of nodes
Res
pons
e T
ime
PBP DITN DITN2PART
0
5000
10000
15000
20000
25000
30000
35000
40000
2 3 4 5 6 7 8 9 10Number of nodes
Res
pons
e T
ime
PBP DITNDITN2PART
0
5000
10000
15000
20000
25000
30000
35000
2 3 4 5 6 7 8 9 10Number of nodes
Res
pons
e T
ime
(ms)
DITN PBP
O L SOL
SOLCNR
At 10 nodes on SxOxLxCxNxR,DITN is about 2.1x slower than PBP(Work alloc: L/5, O/2, S, C, N, R)
DITN2PART has very little slowdown
– But needs total clustering
Slow-down oscillates due to discreteness of work-allocation
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Failure/Stall Resiliency by Work-Unit Reassignment
Orders x Lineitemgroup by o_orderpriority5 co-processors
Impose high load on oneco-processor as soon as query begins
At 60% load (50% wait), DITN times out and switches to alternative 0
2040
60
80100
120
140160
180
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Extent of CPU Load Burst
Res
pons
e T
ime
(s) Shared Nothing
DITNDITN-OPTDITN2PART
PBP
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Importance of Asymmetric Allocation
Initially 2 fast nodes: then add 4 slow nodes
With symmetric allocation: adding slow nodes can slow down system
10
20
30
40
50
60
70
80
1 2 3 4 5 6
Num Of Co-processors
Res
po
nse
Tim
e (s
)
DITN - Symmetric Allocation
DITN - Asymmetric Allocation
05
1015202530354045
symm.4 nodes
asymm.4 nodes
symm.5 nodes
asymm.5 nodes
symm.6 nodes
asym. 6nodes
Experiment Group
Resp
on
se T
ime (
s)
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Contrast between DITN-symmetric and DITN-asymmetric
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Danger of Tying partition to CPU
Repeated execution of O L
Impose 75% CPU load on one of the 5 co-processors during 3rd iteration
PBP continues to use this slow node throughout
DITN switches to another node after two iterations
0.00
20.00
40.00
60.00
80.00
100.00
120.00
1 2 3 4 5 6 7 8
Query Iteration
Res
pons
e T
ime
(sec
onds
)
DITN
PBP
h
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Related Work
Parallel query processing – Gamma, XPRS, many commercial systems
– Mostly shared-nothing
– Shared-disk: IBM Sysplex
• Queries done via tuple shipping between co-processors
– Oracle
• Shared disk, but hash joins done via partitioning (static/dynamic) Mariposa – similar query fragment level work allocation Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins Fault-tolerant exchange (FLUX) Polar*, OGSA-DQP Distributed Eddies Query Execution on P2P systems
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Summary and Future work
Partitioning-based parallelism does not handle non-dedicated nodes Proposal: Avoid partitioning
– Share data via storage system – Intra-fragment parallelism instead of exchange– Careful work-allocation to optimize response time
Promising initial results: only 2x slowdown with 10 nodes
Index scans: want shared reads without latching Isolation: DITN: uncommitted read; DITN2PART: read-only Scaling to large numbers of nodes Multi-query optimization to reuse shared temp tables
Open Questions
IBM Research
© 2005 IBM CorporationParallel Querying with Non-Dedicated Computers Aug 30 2005
Backup Slides