Efficient Skyline Computation in MapReduce

Efficient Skyline Computation in MapReduce

Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu

Aalborg University

Yongluan Zhou

University of Southern Denmark

Skyline Query

• Application: multi-criteria decision• Tuple dominance: t1 dominates t2 (t1 ⊰ t2)– Iff t1 is not worse than t2 in all dimensions, and– t1 is better than t2 in at least one dimension

• Skyline query:– Given a dataset, returns all tuples that are not

dominated by others

Scaling Skyline Computation

• Customized solutions:– Require arbitrary inter-node communication– Need software stacks to hardness a large cluster– Unproved scalability– Lack of fault tolerance

• General MapReduce platforms– Availability of scalable systems, such as Hadoop– A strict communication/synchronization model

MapReduce

Challenges of Skyline Computation using MapReduce

• To maximize parallelization• Push more work to mappers, i.e. let mappers filter out

more non-skyline points• Ability to utilize multiple reducers

• However, global skylines cannot be determined by local information• Without global information, Mappers have very limited

capabilities to filter out non-skyline points

Grid Partitioning and Bit String Representation

Partition Dominance: pi ⊰ pj iff pi.max ⊰ pj.min

2 5 8

1 4 7

0 3 6

BSR = 011110100

Bit String Generation

Determining Partitions Per Dimension (PPD)

• PPD is too high → very few tuples in each partition and too many partitions

• PPD is too low → too many tuples in each partition and less effective pruning

• Idea: generate bit strings for PPD from 2 to

– then choose the one with the most desirable number of tuples per partition

Single Reducer

Multi-Reducer

• The single reducer still performs significant work for detecting global skyline – limits the degree of parallelization

• Idea: independent partition group– Anti-Dominating Region (ADR):

– Independent Partition Group: A set of partitions Pi is an IPG iff holds

– One reducer is responsible for each IPG.

Multi-Reducer

Generation of I.P.G.

• Idea: a partition pm is a maximum partition iff ∀p, pm ∉ p.ADR

• Procedure:1. Find a maximum partition pm

2. Generate IPG = {pm} U pm.ADR

3. Remove pm and repeat 1

Implementation Issues

• More independent groups than #reducers– Need allocate them to the reducers, two options:1. Load balancing 2. Minimizing duplicate data transmission

• Elimination of duplicated skyline outputs– A grid partition appears in multiple IPGs– Designate one IPG as the responsible group• Load balancing

Experimental Setup

• 13 commodity machines• Datasets with independent and anti-

correlated distribution • Comparisons:– MR-BNL– MR-Angle

#Dimensions

independent data, cardinality: 1×105

#Dimensions

Anti-correlated data, cardinality: 1×105

Cardinality (independent data)

Dimensions: 3 Dimensions: 8

Cardinality (Anti-corr. data)

Dimensions: 3 Dimensions: 8

Number of Reducers

Summary

• Grid partitioning and bit strings– Choose an appropriate # partitioning

• Exploit independent groups to enable multiple reducers – Good for cases with large # skylines– Merging independent groups– Eliminate duplicate outputs

Documents

Efficient Skyline Computation in MapReduce