Upload
spiros-oikonomakis
View
290
Download
2
Embed Size (px)
DESCRIPTION
Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.
Citation preview
EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE
OIKONOMAKIS SPYRIDONSOFTWARE / ENGINEER AT PEOPLEPERHOUR
Need for a new model
Exponential data growth
Need for analysis, utilization and scalability of more and more data
Need for parallel processing
Need to reduce reading time and data recovery
Need for convenience in terms of programmer
Cost
What is the Map/Reduce?
Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing
Is the Map/Reduce model reliable?
Map/Reduce
Weaknesses in Top-K Join Queries
What is the Top-K Join?
WeaknessesRead all the data for the recovery of K resultsNon-equitable distribution of workload per
Reducer
Goals of the experiment
Implementation of Top-K Join queries in Map/Reduce model in an efficient manner
Troubleshooting shown in Map / Reduce with:
Early Termination
Load Balancing
Design
Comparison of three algorithms (1 default and 2 new) Naive EarlyTermination (using bounds) EarlyTermination & LoadBalancing (using bounds and Longest
Processing Time)
Pre-Elaboration Production of two data tables with Join attributes Statistics for the data in the form of histograms
Elaboration Calculating bounds of histograms for each table Run Map/Reduce
Design(2)
Early Termination
EarlyTermRecordReaderCheck
Bounds
Send Data
Send Data
HDFS
Generated Sorted Data
Histograms
EarlyTermInputFormat
Mapper
ReducersProcess
Early Termination & Load Balancing
EarlyTermRecordReaderCheck
BoundsSend Data
Send Data
HDFS
Generated Sorted DataHistograms
EarlyTermInputFormat
Mapper
Reducer
CustomPartitioner
Reducer Reducer
Experiment (1)
Parameters Values
Data Distribution: Zipfian
Number of data: 1.000.000 / table
Number of reducers: 10, 6
Number of K results: 10
Data skew: 0, 0.5, 1
Number of Joining Attributes: 10
Max value for data: 10000
Sorting: By score
Histograms: 10 bins
Cluster: 8 machines
Experiment Part – Comparison of algorithms (2)
0 0.5 10:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
0:43:12
0:50:24
0:42
:54
0:41
:19
0:43
:50
0:07
:22
0:02
:00
0:02
:100:
07:0
6
0:02
:15
0:02
:13
Naive
Early Termination
Early Termination & Load Balancing
Skew
Ru
nn
ing
tim
e
REDUCERS = 10
Experiment Part – Comparison of algorithms (3)
0 0.5 10
500000
1000000
1500000
2000000
250000020
0000
0
2000
000
2000
000
5997
26
3300
18
3300
18
5997
26
3300
18
3300
18
NaiveEarly terminationEarly termination & Load Balanc-ing
Skew
Nu
mb
er
of
reco
rds
REDUCERS = 10
Experiment Part – Comparison of algorithms (4)
6 100:00:00
0:02:52
0:05:45
0:08:38
0:11:31
0:14:24
0:17:16 0:16
:10
0:07
:06
0:10
:10
0:07
:22
Early TerminationEarly Termination & Load Balanc-ing
Number of Reducers
Ru
nn
ing
tim
e
REDUCERS = 6
Conclusion
By using the techniques proposed: :
Early TerminationLoad Balancing
is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce
Questions
????
Thank you.