25
SKEW IN PARALLEL QUERY PROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

Embed Size (px)

Citation preview

Page 1: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

SKEW IN PARALLEL QUERY PROCESSING

Paraschos KoutrisPaul BeameDan Suciu

University of WashingtonPODS 2014

Page 2: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

MOTIVATION

• Understand the complexity of parallel query processing on big data– on shared-nothing architectures (e.g. MapReduce)– even in the presence of data skew

• Dominating parameters of computation:– Communication cost– Number of communication rounds

2

Page 3: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

THE MPC MODEL• Computation proceeds in synchronous rounds

– Local Computation– Global Communication

3

INPUT(size = M)

.

.

.

1

p

round 1 round r

.

.

.

2

1

2

p

. . .

1

.

.

.

2

p

OUTPUT

#bits received at each rounds ≤ L

Page 4: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

THE MPC MODEL

What is the minimum load L of an MPC algorithm that computes a Conjunctive Query Q in one round?

[Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for relations of equal size (M bits) and no skew

4

The data is evenly distributedMaximizes parallelism

Equivalent to sequential computationNo parallelism

maximum load

Page 5: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

RESULTS

Computing a Conjunctive Query Q in the MPC model in one round for relations with different sizes and skew

• Matching upper and lower bounds for any skew-free input database and different relation sizes

• Almost matching upper and lower bounds in the presence of skew– Matching bounds in the case of simple joins

5

Page 6: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

CONJUNCTIVE QUERIES

• Full Conjuctive Queries w/o self-joins:– Q(x, y, z) = R(x, y), S(y, z), T(z, x) [the triangle query]

• The hypergraph of the query Q:– Variables as vertices– Atoms as hyperedges

6

x

y z

R

S

T

Page 7: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

EXAMPLE: CARTESIAN PRODUCT

• The cartesian product: Q(x,y) = S1(x), S2(y) with cardinalities m1, m2

• ALGORITHM– Organize the p servers in a rectangle– The load will be– To minimize L choose

• The algorithm is optimal

7

S1(x) (h1(x), *)

S2(y) (*, h2(y))

Page 8: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

LOWER BOUNDS (1)

• For a cartesian product Q = S1 × S2 × … × Su the lower bound for load is

• For a Conjunctive Query Q(x1,…, xk) = S1(…), …, Sl(…) any subset of relations Sj1, Sj2, …, Sju without shared variables (an edge packing for the hypergraph of Q) gives a lower bound for the load

• The lower bound also holds with any fractional edge packing

8

Page 9: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

LOWER BOUNDS (2)

9

Theorem For a Conjunctive Query Q, where relation Sj has size Mj (in bits), any MPC algorithm that computes Q in one round with maximum load L must satisfy for some constant c and for any fractional edge packing u:

Proof techniques:• Using entropy to bound knowledge• Friedgut’s inequality to bound the maximum size of a query

Page 10: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

HYPERCUBE ALGORITHM

• Q(x1,…, xk) = S1(…), …, Sl(…)

• For each variable xi define the share to be an integer pi

such that: p = p1 × .. × pk

• Assign each of the p servers to a point on the k-dimensional hypercube:

[p] = [p1] × … × [pk]

• Hash each tuple to the appropriate subcubee.g. S3 (x3, x4) (* , *, h3(x3), h4(x4), *, …)

10

Page 11: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

EXAMPLE: THE TRIANGLE QUERY

• Algorithm: [Ganguly ’92, Afrati ’10, Suri ’11]

– The p servers form a cube: [p1/3] × [p1/3] × [p1/3] – Send each tuple to servers:

• R(a, b) (hx(a), hy(b), - )

• S(b, c) (-, hy(b), hz(c) ) each tuple replicated p1/3 times

• T(c, a) (hx(a), -, hz(c) )

11

(hx(a), hy(b), hz(c))

Page 12: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

ANALYSIS OF HYPERCUBE (1)

• For a vector of shares p = (p1, …, pk), how is relation Sj distributed to the servers?

• Ideally, each server receives tuples

• Example: relation R(x, y) of the triangle query– Ideal load L = M / #cells = M/p2/3

– If R has a single value in the x-column, the load will instead be M/p1/3

– The load will be O(M/p2/3) if each value appears in the x and y columns at most M/p1/3 times

12

p1/3

p1/3

Page 13: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

ANALYSIS OF HYPERCUBE (2)

• In general, a relation Sj is skew-free w.r.t. to p if for any subset of variables x of vars(Sj), every value appears at most

• If every relation is skew-free w.r.t. p then the maximum load of the HYPERCUBE algorithm is:

13

Page 14: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

ANALYSIS OF HYPERCUBE (3)

• The maximum load of the HYPERCUBE algorithm is always bounded by

• Join with shares px = py = pz = p1/3

– For a skew-free database, the load is O(M/p2/3)– Otherwise, the load is always bounded by O(M/p1/3)

14

Page 15: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

COMPUTING THE SHARES

• The optimal shares are computed by solving a Linear Program (LP)

15

Page 16: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

ANALYSIS OF HYPERCUBE

16

Theorem For a conjunctive query Q, where relation Sj has size Mj and is skew-free, there exist shares such that the HYPERCUBE algorithm runs with maximum load

By using an LP duality argument, we can prove that the load matches the lower bound

pk(Q) = set of all fractional edge packings

Page 17: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

EDGE PACKINGS FOR THE TRIANGLE

17

Egde packing u Load (asymptotic)

(1/2, 1/2, 1/2) (MRMSMT)1/3/p2/3

(1,0,0) MR/p

(0,1,0) MS/p

(0,0,1) MT/p

x

y z

R

S

TQ(x, y, z) = R(x, y), S(y, z), T(z, x)

Page 18: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

THE PRESENCE OF SKEW

• A simple joinQ(x,y,z) = S1(x, z), S2(y, z)

• Optimal shares px = py = 1, pz = p

– Standard parallel hash-join– If the database has no skew, L = O(max{M1, M2} /p)

– If it is skewed, the load can be as bad as O(M) (all tuples are sent to the same server)

• For any value h of z, mj(h) = frequency of h in Sj

18

Page 19: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

SKEW-AWARE JOIN (1)

Q(x,y,z) = S1(x, z), S2(y, z)

• Idea: identify the heavy hitters and treat them differently• h is a heavy hitter in Sj if mj(h) > Mj/p

• h is light otherwise

CASE 1 (LIGHT)• For all light values h, run the HyperCube algorithm (hash-

join on z) on all p servers

19

Page 20: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

SKEW-AWARE JOIN (2)

CASE 2 (HEAVY)

For any heavy hitter h (either in S1 or S2)

• Compute the residual query (a cartesian product)Q[z\h] = S1(x, h), S2(y, h)

using ph exclusive servers.

• Choose ph such that– The sum of the ph is O(p)

– The load for every residual query Q[z\h] is the same

20

Page 21: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

SKEW: SIMPLE JOIN

21

Theorem Any MPC algorithm that computes the join query in one round must satisfy:

The skew-aware join achieves the above optimal load

Page 22: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

SKEW IN CONJUNCTIVE QUERIES

• For any conjunctive query Q, our algorithm computes the light values using HYPERCUBE– Since there is no skew, this part is optimal

• For the heavy hitters, it considers the residual queries and assigns appropriately an exclusive number of servers– The values of the heavy hitters and their frequency must be known

to the algorithm

22

Page 23: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

CONCLUSION

Summary• Upper and lower bounds for computing Conjunctive

Queries in the MPC model in the presence of skew

Open Problems• What is the load L when we consider more rounds?• How do other classes of queries behave?

23

Page 24: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

Thank you !

24

Page 25: S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014

DUALITY: EDGE PACKING

Fractional edge packing: assign uj to Sj such that for each variable xi, the sum of edges that contain it is at most 1

25

q(x, y, z) = R(x, y), S(y, z), T(z, x)

1/2

x

y z

R

S

T1/2

1/2

By duality, the minimum value of the LP is equal to the maximum value, over all edge packings pk(q), of