Efficient processing of Rank-aware queries in Map/Reduce

EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE

OIKONOMAKIS SPYRIDONSOFTWARE / ENGINEER AT PEOPLEPERHOUR

Need for a new model

Exponential data growth

Need for analysis, utilization and scalability of more and more data

Need for parallel processing

Need to reduce reading time and data recovery

Need for convenience in terms of programmer

Cost

What is the Map/Reduce?

Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing

Is the Map/Reduce model reliable?

Map/Reduce

Weaknesses in Top-K Join Queries

What is the Top-K Join?

WeaknessesRead all the data for the recovery of K resultsNon-equitable distribution of workload per

Reducer

Goals of the experiment

Implementation of Top-K Join queries in Map/Reduce model in an efficient manner

Troubleshooting shown in Map / Reduce with:

Early Termination

Load Balancing

Design

Comparison of three algorithms (1 default and 2 new) Naive EarlyTermination (using bounds) EarlyTermination & LoadBalancing (using bounds and Longest

Processing Time)

Pre-Elaboration Production of two data tables with Join attributes Statistics for the data in the form of histograms

Elaboration Calculating bounds of histograms for each table Run Map/Reduce

Design(2)

Early Termination

EarlyTermRecordReaderCheck

Bounds

Send Data

Send Data

HDFS

Generated Sorted Data

Histograms

EarlyTermInputFormat

Mapper

ReducersProcess

Early Termination & Load Balancing

EarlyTermRecordReaderCheck

BoundsSend Data

Send Data

HDFS

Generated Sorted DataHistograms

EarlyTermInputFormat

Mapper

Reducer

CustomPartitioner

Reducer Reducer

Experiment (1)

Parameters Values

Data Distribution: Zipfian

Number of data: 1.000.000 / table

Number of reducers: 10, 6

Number of K results: 10

Data skew: 0, 0.5, 1

Number of Joining Attributes: 10

Max value for data: 10000

Sorting: By score

Histograms: 10 bins

Cluster: 8 machines

Experiment Part – Comparison of algorithms (2)

0 0.5 10:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:42

:54

0:41

:19

0:43

:50

0:07

:22

0:02

:00

0:02

:100:

07:0

6

0:02

:15

0:02

:13

Naive

Early Termination

Early Termination & Load Balancing

Skew

Ru

nn

ing

tim

e

REDUCERS = 10


0 0.5 10

500000

1000000

1500000

2000000

250000020

0000

0

2000

000

2000

000

5997

26

3300

18

3300

18

5997

26

3300

18

3300

18

NaiveEarly terminationEarly termination & Load Balanc-ing

Skew

Nu

mb

er

of

reco

rds

REDUCERS = 10


6 100:00:00

0:02:52

0:05:45

0:08:38

0:11:31

0:14:24

0:17:16 0:16

:10

0:07

:06

0:10

:10

0:07

:22

Early TerminationEarly Termination & Load Balanc-ing

Number of Reducers

Ru

nn

ing

tim

e

REDUCERS = 6

Conclusion

By using the techniques proposed: :

Early TerminationLoad Balancing

is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce

Questions

????

Thank you.

Engineering

Efficient processing of Rank-aware queries in Map/Reduce