20
Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of Southern Denmark

Efficient Skyline Computation in MapReduce

Embed Size (px)

DESCRIPTION

Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2) - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient Skyline Computation in  MapReduce

Efficient Skyline Computation in MapReduce

Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu

Aalborg University

Yongluan Zhou

University of Southern Denmark

Page 2: Efficient Skyline Computation in  MapReduce

Skyline Query

• Application: multi-criteria decision• Tuple dominance: t1 dominates t2 (t1 ⊰ t2)– Iff t1 is not worse than t2 in all dimensions, and– t1 is better than t2 in at least one dimension

• Skyline query:– Given a dataset, returns all tuples that are not

dominated by others

Page 3: Efficient Skyline Computation in  MapReduce

Scaling Skyline Computation

• Customized solutions:– Require arbitrary inter-node communication– Need software stacks to hardness a large cluster– Unproved scalability– Lack of fault tolerance

• General MapReduce platforms– Availability of scalable systems, such as Hadoop– A strict communication/synchronization model

Page 4: Efficient Skyline Computation in  MapReduce

MapReduce

Page 5: Efficient Skyline Computation in  MapReduce

Challenges of Skyline Computation using MapReduce

• To maximize parallelization• Push more work to mappers, i.e. let mappers filter out

more non-skyline points• Ability to utilize multiple reducers

• However, global skylines cannot be determined by local information• Without global information, Mappers have very limited

capabilities to filter out non-skyline points

Page 6: Efficient Skyline Computation in  MapReduce

Grid Partitioning and Bit String Representation

Partition Dominance: pi ⊰ pj iff pi.max ⊰ pj.min

2 5 8

1 4 7

0 3 6

BSR = 011110100

Page 7: Efficient Skyline Computation in  MapReduce

Bit String Generation

Page 8: Efficient Skyline Computation in  MapReduce

Determining Partitions Per Dimension (PPD)

• PPD is too high → very few tuples in each partition and too many partitions

• PPD is too low → too many tuples in each partition and less effective pruning

• Idea: generate bit strings for PPD from 2 to

– then choose the one with the most desirable number of tuples per partition

Page 9: Efficient Skyline Computation in  MapReduce

Single Reducer

Page 10: Efficient Skyline Computation in  MapReduce

Multi-Reducer

• The single reducer still performs significant work for detecting global skyline – limits the degree of parallelization

• Idea: independent partition group– Anti-Dominating Region (ADR):

– Independent Partition Group: A set of partitions Pi is an IPG iff holds

– One reducer is responsible for each IPG.

Page 11: Efficient Skyline Computation in  MapReduce

Multi-Reducer

Page 12: Efficient Skyline Computation in  MapReduce

Generation of I.P.G.

• Idea: a partition pm is a maximum partition iff ∀p, pm ∉ p.ADR

• Procedure:1. Find a maximum partition pm

2. Generate IPG = {pm} U pm.ADR

3. Remove pm and repeat 1

Page 13: Efficient Skyline Computation in  MapReduce

Implementation Issues

• More independent groups than #reducers– Need allocate them to the reducers, two options:1. Load balancing 2. Minimizing duplicate data transmission

• Elimination of duplicated skyline outputs– A grid partition appears in multiple IPGs– Designate one IPG as the responsible group• Load balancing

Page 14: Efficient Skyline Computation in  MapReduce

Experimental Setup

• 13 commodity machines• Datasets with independent and anti-

correlated distribution • Comparisons:– MR-BNL– MR-Angle

Page 15: Efficient Skyline Computation in  MapReduce

#Dimensions

independent data, cardinality: 1×105

Page 16: Efficient Skyline Computation in  MapReduce

#Dimensions

Anti-correlated data, cardinality: 1×105

Page 17: Efficient Skyline Computation in  MapReduce

Cardinality (independent data)

Dimensions: 3 Dimensions: 8

Page 18: Efficient Skyline Computation in  MapReduce

Cardinality (Anti-corr. data)

Dimensions: 3 Dimensions: 8

Page 19: Efficient Skyline Computation in  MapReduce

Number of Reducers

Page 20: Efficient Skyline Computation in  MapReduce

Summary

• Grid partitioning and bit strings– Choose an appropriate # partitioning

• Exploit independent groups to enable multiple reducers – Good for cases with large # skylines– Merging independent groups– Eliminate duplicate outputs