17
EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOFTWARE / ENGINEER AT PEOPLEPERHOUR

Efficient processing of Rank-aware queries in Map/Reduce

Embed Size (px)

DESCRIPTION

Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.

Citation preview

Page 1: Efficient processing of Rank-aware queries in Map/Reduce

EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE

OIKONOMAKIS SPYRIDONSOFTWARE / ENGINEER AT PEOPLEPERHOUR

Page 2: Efficient processing of Rank-aware queries in Map/Reduce

Need for a new model

Exponential data growth

Need for analysis, utilization and scalability of more and more data

Need for parallel processing

Need to reduce reading time and data recovery

Need for convenience in terms of programmer

Cost

Page 3: Efficient processing of Rank-aware queries in Map/Reduce

What is the Map/Reduce?

Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing

Page 4: Efficient processing of Rank-aware queries in Map/Reduce

Is the Map/Reduce model reliable?

Page 5: Efficient processing of Rank-aware queries in Map/Reduce

Map/Reduce

Page 6: Efficient processing of Rank-aware queries in Map/Reduce

Weaknesses in Top-K Join Queries

What is the Top-K Join?

WeaknessesRead all the data for the recovery of K resultsNon-equitable distribution of workload per

Reducer

Page 7: Efficient processing of Rank-aware queries in Map/Reduce

Goals of the experiment

Implementation of Top-K Join queries in Map/Reduce model in an efficient manner

Troubleshooting shown in Map / Reduce with:

Early Termination

Load Balancing

Page 8: Efficient processing of Rank-aware queries in Map/Reduce

Design

Comparison of three algorithms (1 default and 2 new) Naive EarlyTermination (using bounds) EarlyTermination & LoadBalancing (using bounds and Longest

Processing Time)

Pre-Elaboration Production of two data tables with Join attributes Statistics for the data in the form of histograms

Elaboration Calculating bounds of histograms for each table Run Map/Reduce

Page 9: Efficient processing of Rank-aware queries in Map/Reduce

Design(2)

Page 10: Efficient processing of Rank-aware queries in Map/Reduce

Early Termination

EarlyTermRecordReaderCheck

Bounds

Send Data

Send Data

HDFS

Generated Sorted Data

Histograms

EarlyTermInputFormat

Mapper

ReducersProcess

Page 11: Efficient processing of Rank-aware queries in Map/Reduce

Early Termination & Load Balancing

EarlyTermRecordReaderCheck

BoundsSend Data

Send Data

HDFS

Generated Sorted DataHistograms

EarlyTermInputFormat

Mapper

Reducer

CustomPartitioner

Reducer Reducer

Page 12: Efficient processing of Rank-aware queries in Map/Reduce

Experiment (1)

Parameters Values

Data Distribution: Zipfian

Number of data: 1.000.000 / table

Number of reducers: 10, 6

Number of K results: 10

Data skew: 0, 0.5, 1

Number of Joining Attributes: 10

Max value for data: 10000

Sorting: By score

Histograms: 10 bins

Cluster: 8 machines

Page 13: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (2)

0 0.5 10:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:42

:54

0:41

:19

0:43

:50

0:07

:22

0:02

:00

0:02

:100:

07:0

6

0:02

:15

0:02

:13

Naive

Early Termination

Early Termination & Load Balancing

Skew

Ru

nn

ing

tim

e

REDUCERS = 10

Page 14: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (3)

0 0.5 10

500000

1000000

1500000

2000000

250000020

0000

0

2000

000

2000

000

5997

26

3300

18

3300

18

5997

26

3300

18

3300

18

NaiveEarly terminationEarly termination & Load Balanc-ing

Skew

Nu

mb

er

of

reco

rds

REDUCERS = 10

Page 15: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (4)

6 100:00:00

0:02:52

0:05:45

0:08:38

0:11:31

0:14:24

0:17:16 0:16

:10

0:07

:06

0:10

:10

0:07

:22

Early TerminationEarly Termination & Load Balanc-ing

Number of Reducers

Ru

nn

ing

tim

e

REDUCERS = 6

Page 16: Efficient processing of Rank-aware queries in Map/Reduce

Conclusion

By using the techniques proposed: :

Early TerminationLoad Balancing

is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce

Page 17: Efficient processing of Rank-aware queries in Map/Reduce

Questions

????

Thank you.