S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of...

Preview:

Citation preview

SKEW IN PARALLEL QUERY PROCESSING

Paraschos KoutrisPaul BeameDan Suciu

University of WashingtonPODS 2014

MOTIVATION

• Understand the complexity of parallel query processing on big data– on shared-nothing architectures (e.g. MapReduce)– even in the presence of data skew

• Dominating parameters of computation:– Communication cost– Number of communication rounds

2

THE MPC MODEL• Computation proceeds in synchronous rounds

– Local Computation– Global Communication

3

INPUT(size = M)

.

.

.

1

p

round 1 round r

.

.

.

2

1

2

p

. . .

1

.

.

.

2

p

OUTPUT

#bits received at each rounds ≤ L

THE MPC MODEL

What is the minimum load L of an MPC algorithm that computes a Conjunctive Query Q in one round?

[Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for relations of equal size (M bits) and no skew

4

The data is evenly distributedMaximizes parallelism

Equivalent to sequential computationNo parallelism

maximum load

RESULTS

Computing a Conjunctive Query Q in the MPC model in one round for relations with different sizes and skew

• Matching upper and lower bounds for any skew-free input database and different relation sizes

• Almost matching upper and lower bounds in the presence of skew– Matching bounds in the case of simple joins

5

CONJUNCTIVE QUERIES

• Full Conjuctive Queries w/o self-joins:– Q(x, y, z) = R(x, y), S(y, z), T(z, x) [the triangle query]

• The hypergraph of the query Q:– Variables as vertices– Atoms as hyperedges

6

x

y z

R

S

T

EXAMPLE: CARTESIAN PRODUCT

• The cartesian product: Q(x,y) = S1(x), S2(y) with cardinalities m1, m2

• ALGORITHM– Organize the p servers in a rectangle– The load will be– To minimize L choose

• The algorithm is optimal

7

S1(x) (h1(x), *)

S2(y) (*, h2(y))

LOWER BOUNDS (1)

• For a cartesian product Q = S1 × S2 × … × Su the lower bound for load is

• For a Conjunctive Query Q(x1,…, xk) = S1(…), …, Sl(…) any subset of relations Sj1, Sj2, …, Sju without shared variables (an edge packing for the hypergraph of Q) gives a lower bound for the load

• The lower bound also holds with any fractional edge packing

8

LOWER BOUNDS (2)

9

Theorem For a Conjunctive Query Q, where relation Sj has size Mj (in bits), any MPC algorithm that computes Q in one round with maximum load L must satisfy for some constant c and for any fractional edge packing u:

Proof techniques:• Using entropy to bound knowledge• Friedgut’s inequality to bound the maximum size of a query

HYPERCUBE ALGORITHM

• Q(x1,…, xk) = S1(…), …, Sl(…)

• For each variable xi define the share to be an integer pi

such that: p = p1 × .. × pk

• Assign each of the p servers to a point on the k-dimensional hypercube:

[p] = [p1] × … × [pk]

• Hash each tuple to the appropriate subcubee.g. S3 (x3, x4) (* , *, h3(x3), h4(x4), *, …)

10

EXAMPLE: THE TRIANGLE QUERY

• Algorithm: [Ganguly ’92, Afrati ’10, Suri ’11]

– The p servers form a cube: [p1/3] × [p1/3] × [p1/3] – Send each tuple to servers:

• R(a, b) (hx(a), hy(b), - )

• S(b, c) (-, hy(b), hz(c) ) each tuple replicated p1/3 times

• T(c, a) (hx(a), -, hz(c) )

11

(hx(a), hy(b), hz(c))

ANALYSIS OF HYPERCUBE (1)

• For a vector of shares p = (p1, …, pk), how is relation Sj distributed to the servers?

• Ideally, each server receives tuples

• Example: relation R(x, y) of the triangle query– Ideal load L = M / #cells = M/p2/3

– If R has a single value in the x-column, the load will instead be M/p1/3

– The load will be O(M/p2/3) if each value appears in the x and y columns at most M/p1/3 times

12

p1/3

p1/3

ANALYSIS OF HYPERCUBE (2)

• In general, a relation Sj is skew-free w.r.t. to p if for any subset of variables x of vars(Sj), every value appears at most

• If every relation is skew-free w.r.t. p then the maximum load of the HYPERCUBE algorithm is:

13

ANALYSIS OF HYPERCUBE (3)

• The maximum load of the HYPERCUBE algorithm is always bounded by

• Join with shares px = py = pz = p1/3

– For a skew-free database, the load is O(M/p2/3)– Otherwise, the load is always bounded by O(M/p1/3)

14

COMPUTING THE SHARES

• The optimal shares are computed by solving a Linear Program (LP)

15

ANALYSIS OF HYPERCUBE

16

Theorem For a conjunctive query Q, where relation Sj has size Mj and is skew-free, there exist shares such that the HYPERCUBE algorithm runs with maximum load

By using an LP duality argument, we can prove that the load matches the lower bound

pk(Q) = set of all fractional edge packings

EDGE PACKINGS FOR THE TRIANGLE

17

Egde packing u Load (asymptotic)

(1/2, 1/2, 1/2) (MRMSMT)1/3/p2/3

(1,0,0) MR/p

(0,1,0) MS/p

(0,0,1) MT/p

x

y z

R

S

TQ(x, y, z) = R(x, y), S(y, z), T(z, x)

THE PRESENCE OF SKEW

• A simple joinQ(x,y,z) = S1(x, z), S2(y, z)

• Optimal shares px = py = 1, pz = p

– Standard parallel hash-join– If the database has no skew, L = O(max{M1, M2} /p)

– If it is skewed, the load can be as bad as O(M) (all tuples are sent to the same server)

• For any value h of z, mj(h) = frequency of h in Sj

18

SKEW-AWARE JOIN (1)

Q(x,y,z) = S1(x, z), S2(y, z)

• Idea: identify the heavy hitters and treat them differently• h is a heavy hitter in Sj if mj(h) > Mj/p

• h is light otherwise

CASE 1 (LIGHT)• For all light values h, run the HyperCube algorithm (hash-

join on z) on all p servers

19

SKEW-AWARE JOIN (2)

CASE 2 (HEAVY)

For any heavy hitter h (either in S1 or S2)

• Compute the residual query (a cartesian product)Q[z\h] = S1(x, h), S2(y, h)

using ph exclusive servers.

• Choose ph such that– The sum of the ph is O(p)

– The load for every residual query Q[z\h] is the same

20

SKEW: SIMPLE JOIN

21

Theorem Any MPC algorithm that computes the join query in one round must satisfy:

The skew-aware join achieves the above optimal load

SKEW IN CONJUNCTIVE QUERIES

• For any conjunctive query Q, our algorithm computes the light values using HYPERCUBE– Since there is no skew, this part is optimal

• For the heavy hitters, it considers the residual queries and assigns appropriately an exclusive number of servers– The values of the heavy hitters and their frequency must be known

to the algorithm

22

CONCLUSION

Summary• Upper and lower bounds for computing Conjunctive

Queries in the MPC model in the presence of skew

Open Problems• What is the load L when we consider more rounds?• How do other classes of queries behave?

23

Thank you !

24

DUALITY: EDGE PACKING

Fractional edge packing: assign uj to Sj such that for each variable xi, the sum of edges that contain it is at most 1

25

q(x, y, z) = R(x, y), S(y, z), T(z, x)

1/2

x

y z

R

S

T1/2

1/2

By duality, the minimum value of the LP is equal to the maximum value, over all edge packings pk(q), of

Recommended