Upload
vuongcong
View
247
Download
0
Embed Size (px)
Citation preview
Optimizing SQL Query Execution over Map-Reduce
Thesis submitted in partial fulfillment
of the requirements for the degree of
MS by Research
in
Computer Science
by
Bharath Vissapragada
200702012
bharat [email protected]
Center for Data Engineering
International Institute of Information Technology
Hyderabad - 500 032, INDIA
September 2014
Copyright c© Bharath Vissapragada, 2013
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Optimizing SQL Query Execution over Map-
Reduce” by Bharath Vissapragada, has been carried out under my supervision and is not submitted
elsewhere for a degree.
Date Adviser: Prof. Kamalakar Karlapalem
To my uncle Late. Ravi Sanker Ganti
Acknowledgments
Firstly I would like to thank my dad, mom, sister and grand parents for believing in me and letting me
pursue my interests. Also I would have never completed my thesis without the support of my advisers
Kamal sir and Satya sir. They were always open for discussions and I am really lucky to have their
support. I would like to thank my closest pals Abilash, Chaitanya, Phani, Ravali, Ronanki and Vignesh
for their constant support, especially when I was let down by something. I really miss my uncle Late.
Ravi Sanker Ganti, who was responsible for what a Iam today. Thanks to the Almighty for blessing me
with good luck and mental peace.
v
Abstract
Query optimization in relational database systems is a topic that has been studied at depth, in both
stand alone and distributed scenarios. Modern day optimizers became complex with focus on increasing
quality of optimization and reduction in query execution times. They employ a wide range of heuristics
and considers a set of possible plans called plan space to find the most optimal plan to be executed.
However, with the advent of big data, more and more organizations are moving towards map-reduce
based processing systems for managing their large databases, since it outperforms all the traditional
techniques for processing very huge amounts of data and still runs on commodity hardware thus reduc-
ing the maintenance cost.
In this thesis, we describe the design and implementation of a query optimizer tailor made for ef-
ficient execution of SQL queries over map-reduce framework. We rely on the traditional relational
database query optimization principles and extend them to address this problem. Our major contribu-
tions can be summarized as follows
1. We proposed a statistics based approach for optimizing SQL queries on top of map-reduce
2. We designed cost formulae to predict the run time of joins before executing the query
3. We extend the traditional plan space to consider the bushy plan space which leverage the mas-
sively parallel architecture of map-reduce systems. We designed three algorithms to explore this
plan space and use our cost formulae to select the plan with the least execution cost
4. We developed a task scheduler based on max flow min cut algorithm for map-reduce shuffle that
minimizes the overall network IO during joins. This algorithm uses the statistics collected from
data and formulates the task assignment problem as a max-flow min-cut problem in a flow graph
on nodes and solving it we get the overall minimal IO in shuffle phase
5. We show the performance enhancements from the above features using TPCH workload of scales
100 and 300 on both TPCH benchmark queries and synthetic queries
Our experiments show run time enhancements of up to 2x improvement in the query execution time
and up to 33% reduction in the shuffle network IO during map-reduce jobs.
vi
Contents
Chapter Page
1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Map-Reduce and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Replication factor and replica placement . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Data reads, writes and deletes . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Hive Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Joins in Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Query Optimization in Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Problem statement and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Query Plan Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Overview of Query Planspace - An Example . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Exploring Bushy Planspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Finding minimum cost n-fully balanced tree - FindMin . . . . . . . . . . . . . 20
3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin . . . . . . 21
3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin . . . 21
3.3 Choosing the value of n for an n-balanced tree . . . . . . . . . . . . . . . . . . . . . 23
4 Cost Based Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Distributed Statistics Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Cost Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Join map phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Join shuffle phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Join Reducer phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Scheduling - An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Scheduling strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Shuffle algorithm - Proof of minimality . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
viii CONTENTS
4.6 Shuffle algorithm - A working example . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Experimental Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Plan space evaluation and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Algorithm FindMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Algorithm FindRecMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Algorithm FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Algorithms performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Cost Formulae accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Efficiency of scheduling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 Conclusions and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Appendix A: Query execution plans for TPCH queries . . . . . . . . . . . . . . . . . . . . . 56
A.1 q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.1.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.1.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.1.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.2 q3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.2.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.2.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.2.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.3 q10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.3.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.3.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.3.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.4 q11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.4.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.4.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.4.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.5 q16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.5.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.5.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.5.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.6 q18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.6.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.6.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.6.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
List of Figures
Figure Page
1.1 Hive query - Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Pig script - Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 HDFS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Map Reduce Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Hive Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Block diagram of a query optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Possible join orderings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Time line for the execution of left deep QEP . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Time line for the execution of bushy Query plan . . . . . . . . . . . . . . . . . . . . . 18
3.4 An example of 4-fullybalanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 An example of 4-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 8-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 4-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Running Algorithm1 on 9 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.9 Running Algorithm2 on 9 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Statistics store architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Query scheduling example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Flow network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Query execution plan for q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Query execution plan for q3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Query execution plan for q10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Query execution plan for q11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Query execution plan for q16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Query execution plan for q18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.13 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance . . . . . . . . 47
5.14 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance . . . . . . . 48
5.7 Query execution plan for q20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a
TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced
trees since most of these queries have less number of join tables . . . . . . . . . . . . 50
5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries . . . . . . 50
5.10 Map phase cost formulae evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ix
x LIST OF FIGURES
5.11 Reduce and shuffle cost formulae evaluation . . . . . . . . . . . . . . . . . . . . . . . 51
5.12 Comparison of default vs optimal shuffle IO . . . . . . . . . . . . . . . . . . . . . . . 52
List of Tables
Table Page
4.1 Notations for cost formulae - Map phase . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Notations for cost formulae - Shuffle phase . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Key to Node assignment with optimized scheduler . . . . . . . . . . . . . . . . . . . . 35
4.4 Key to Node assignment with default scheduler . . . . . . . . . . . . . . . . . . . . . 35
5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14 38
5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a
10 node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees . . . . . . . 38
5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a
TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced
trees since most of these queries have less number of join tables . . . . . . . . . . . . 39
5.4 Summary of query plans for TPCH Dataset . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Shuffle data size comparison of default and optimized algorithm tested on a TPCH scale
100 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance . . . . . . . . 47
5.7 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance . . . . . . . 48
xi
Chapter 1
Introduction and Background
In the internet age, data is wealth. Most organizations rely on their data warehouses for analytics,
based on which the management takes strategic decisions that are important to organization’s growth.
After a decade of internet bubble, the amount of data each of these organizations possess scaled up
to tens of petabytes and this cannot be handled by centralized servers. Owing to these needs, the data
warehouse infrastructure has rapidly changed over the past few years from high-end servers holding vast
amounts of data to a set of commodity hardware machines holding data in a distributed fashion. The
map-reduce programming paradigm [15, 1] from Google has facilitated this transformation by providing
highly scalable and fault tolerant distributed systems for production level quality application software.
In the internet age, big-data has become a buzz word. Every company ranging from small size
startups to internet giants like Google and Facebook is managing data of unbelievable scale [5]. The
data sources are mainly web crawlers, user forms in websites, data uploaded in the social networking
sites, webserver logs etc. For example, Facebook, largest photosharing site today holds about 10 billion
photos whose size increases at the rate of 2-3TB per day [6]. This holds true even in the other areas
of science. Large Hedron Collider (LHC) produces data of enormous sizes from its nuclear reactors
and this data is stored in a grid consisting of 200,000 processing cores and 150PB of disk space in
a distributed network [4].Yahoo has built a hadoop cluster of 4000 nodes [11] with 16PB of storage
to perform their daily data crunching tasks and these numbers clearly show the power of map-reduce
programming. These huge data sizes and distributed systems create a whole new set of challenges for
managing it and performing huge computations like data mining, machine learning tasks etc. Most of
the firms spend a lot of money in managing and extracting important information present in this data.
So, performing these computations of large scale efficiently is very important and even slightest im-
provement in these processing techniques will save a lot of money and time.
Processing huge datasets is a complex task as it cannot be done on a single machine and the dis-
tributed systems implementation always pose a variety of problems in terms of synchronization, fault
tolerance and reliability. Fortunately, with the introduction of Google’s Map-Reduce programming
1
paradigm [16, 15], the above process has become fairly simple as the user just needs to think that he
is programming for a single machine and the framework takes care of distributing it across the cluster
and provides the additional features of fault tolerance and reliability. Hadoop [1] is an opensource im-
plementation of Google’s map-reduce programming that has been widely accepted in the academia and
the industry [3] over the past few years for its ability to process large amounts of data using commodity
hardware while hiding the inner details of parallelism from the end users. Hadoop is widely used in
production for building search indexes, crunching web server logs, recommender systems and a variety
of tasks that require huge processing capabilities on data of large scale. A number of packages have
been developed on top of Hadoop infrastructure to provide a SQL or a similar interface so that users can
perform analytics on the data using the traditional SQL queries and retrieve results. Such efforts include
Pig Latin, Hive [42, 33]. All the packages rely on basic principles for converting a SQL-like query to
map-reduce jobs but there are some minor differences in the way they work. For example, Hive takes
SQL like inputs from the user and converts the query into a directed acyclic graph (DAG) of map-reduce
jobs whereas pig is a scripting language and takes a script as input and user has to supply the entire ex-
ecution plan like a program but the goal is to convert the user’s tasks into a set of map-reduce jobs. An
example showing difference in joining 3 tables in Hive and Pig is shown in figures 1.1 and 1.2.
select * from A join B on A.a = B.b join C on B.b = C.c;
Figure 1.1 Hive query - Example
temp = join A by a, B by b;
result = join temp by B::b, C by c;
dump result
Figure 1.2 Pig script - Example
1.1 Map-Reduce and Hadoop
In this section we describe in detail about map-reduce in terms of its open source release Hadoop
and the file system it relies on, Hadoop Distributed File System(HDFS).
HDFS [38] is very much similar to Google File System, the base for the Map-Reduce paradigm
described in the Google paper.Map-Reduce is a parallel programming paradigm that relies on a basic
principle “Moving computation is cheaper than moving data”.
2
1.1.1 Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a highly fault tolerant distributed file system de-
signed to provide high throughput access for applications that have large data sets and still runs on
commodity and cheap hardware. It is designed to handle hardware failures well by replicating data on
various machines in a distributed fashion. HDFS does not follow the POSIX requirements and relaxes a
few of them to enable streaming access to file system data. HDFS has been built on the following goals
[7] 1.3.
• Hardware Failure Since HDFS runs on a large set of commodity machines, there is a great
probability that a subset of them fails at any moment. HDFS has been designed to overcome this
issue by intelligently placing replicas of data and maintains the replica count in case of data loss
by copying it to other machines in the cluster.
• Streaming Data Access Since HDFS has been designed for applications requiring high through-
put, a few of the unnecessary POSIX requirements have been relaxed to increase efficiency.
• Large amounts of Data HDFS has been tailor made for holding very huge amounts of data
ranging from tens of terabytes to a few petabytes. It also has the ability to scale linearly with the
amount of data just by adding new machines on the fly and still provide fault tolerance and fast
access. This makes it a de-facto choice for big data needs.
• Simple Coherency Model HDFS applications need a write-once-read-many access model for
files. A file once created, written, and closed need not be changed. This assumption simplifies
data coherency issues and enables high throughput data access. A map-reduce application or a
web crawler application fits perfectly with this model. There is a plan to support appending-writes
to files in the future.
• Moving Computation is Cheaper than Moving Data Since the data we are dealing with is
huge, it is clever to take the code to the location where data resides instead of moving data across
machines and HDFS provides methods to do this efficiently by letting the code know the locations
of data blocks and giving support for moving code executables.
• Portability Across Heterogeneous Hardware and Software Platforms Since HDFS supports
a very diverse set of platforms and has been written in java, users can include wide variety of
platforms in their cluster as long as they support basic generic requirements
1.1.2 HDFS Architecture
HDFS has a master/slave architecture. A HDFS cluster consists of a master node responsible for
maintaining filesystem namespace and a set of worker nodes called data nodes where the actual data is
stored. The namenode stores the filesystem namespace and exposes the data to the end user as a file and
3
directory hierarchy and provides him with a set of basic utilities to add, modify and delete data. It also
exposes a java api for all these features and bindings with other languages are also in widespread use.
All the instructions are passed via namenode to the datanodes which execute them in an orderly fashion.
Also all the datanodes report to the master about the block information and send a constant heartbeat to
the master to notify its health. In case the namenode does not get piggy backs from any of the datanodes
for sometime,they are declared dead and the blocks that miss the minimum replication factor are repli-
cated to other nodes. All this information is about block mapping is stored in the namenode local file
system as “fs-image”. Also all the changes to this are tracked by recording the changes to a file called
editLog.
Figure 1.3 HDFS architecture
1.1.3 Replication factor and replica placement
A replication factor is set by the user per file , which is the minimum number of replicas for each
block of data and it ensures fault tolerance. Higher the replication factor greater is the fault tolerance of
the cluster.The location of the replicas for each block is decided by the namenode and is placed intel-
ligently so as to maximize the fault tolerance. In case of replication factor three, first replica is written
4
to the local node ensuring maximum speed, second to a node in a different rack and third is written to
a node in the same rack. Second replica is useful when the whole rack goes down because of a switch
failure. Also greater replication can ensure faster and parallel reads, since namenode has the option of
choosing the replica closest to the client.
1.1.4 Data reads, writes and deletes
• Data is written to HDFS in a pipelined fashion. Suppose a client is writing data to a file, it is first
accumulated in a local file and when it becomes a full block of user data, it contacts the namenode
for a list of datanodes to hold the replicas for this block. This block of data is then flushed to the
first DataNode in streams of 4KB. The first dawned writes this data locally and forwards it to the
next datanode. This process is carried on till the last node in the list writes the whole block and it
notifies the namenode about this and returns the block map. Thus the whole process is pipelined
and parallel
• When namenode receives a read request, it tries to fulfill it by selecting a replica closest to the
client to save bandwidth and reduce the response time. Same rack replica is preferred most of the
times if such a replica exists. This whole information about network topology is fed to the system
via rack awareness scripts
• During file deletes, the data is just marked as deleted but is not deleted from datanodes immedi-
ately but is moved to the /trash folder. The data can be restored as long as it is in the /trash folder
(this time is configurable). Once it crosses the configured time limit, data is deleted and all the
blocks are freed
1.1.5 Map-Reduce
A map-reduce program consists of three main phases map, shuffle and reduce. User needs to specify
the map and reduce functionalities via an API and submit the executable to the processing framework.
Map-reduce job takes a file or a set of files stored in HDFS as input and the map function is applied
to each block of input (In reality, map function is applied to a FileSplit which can span across multiple
blocks, for simplicity we use FileSplit and block interchangeably). Each instance of map function takes
(key,value) pairs as input and emits a new set of (key,value) pairs as output and the framework shuffles all
the pairs with the same key to a single machine. This is called the shuffle phase. User can control what
all keys can go to the same machine via a partitioner function that can be plugged into the executable.
Now all the (key,value) pairs that are shuffled to the same machine are sorted and a reducer function is
applied to the whole group and the output is emitted by reducers as new set of (key,value) pairs which is
written to HDFS. The intermediate data of each map phase is sorted, merged in multiple rounds and is
5
written to the disk local to the map execution. Following equations outline the map and reduce phases
and the whole job flow is summarized in the figure 1.4 [8].
map(k1, v1)− > list(k2, v2) (1.1)
reduce(k2, list(v2))− > list(v2) (1.2)
Figure 1.4 Map Reduce Architecture
Following are the salient features of the map-reduce framework.
• Map-reduce programming framework makes a programmer think that he is writing the code for a
single machine and the framework takes care of distributing the logic and scaling it to thousands
of machines
• Users can write the logic for map and reduce function and control various components of the
framework like file splitting, secondary sorting, partitioner and combiner via pluggable compo-
nents
• Map-Reduce is highly fault tolerant in the sense that failed tasks (map or reduce or subset of them)
cannot stop the whole job. Only tasks that failed can be restarted and the job can be resumed. This
saves a lot of time with jobs on large amounts of data. This framework can be made to work with
both namenode and datanode failures
• One more notable feature of map-reduce framework is task localization. The task scheduler al-
ways tries tries to reduce the network IO by assigning map tasks as close to the input splits as
possible. Setting the split size and HDFS block size to the same value makes is still easier and
gives 100% task localization
6
• The scale at which the map-reduce programs work is enormous and has been shown to scale to
tens of thousands of nodes. This brings a very high degree of parallelism while data processing
resulting in faster throughputs
1.2 Hive
Hive [2] is a data warehouse infrastructure built on top of Hadoop. It provides the tools to perform
offline batch processing ETL tasks on structured data of peta-byte scale. It provides an SQL like query
interface to the user called Hive-QL through which user can query the data. Since Hive is built on top
of Hadoop, user can expect a latency of few minutes even on small amount data as Hadoop takes time
to submit the jobs, schedule them across the cluster and initialize their JVMs. This restricts Hive from
using it as an OLTP engine and also doesnt answer real-time queries and individual row level updates
as in an normal relational database. The functionality of Hive can be extended by using user defined
functions (UDFs).This notion of UDFs is not new to the relational databases and is in practice since
ages. In hive we can plug in our own custom mappers and reducer scripts to perform operations on
the results of the original query. These functionalities of Hive along with its ability to scale to tens of
thousands of nodes, makes it a very useful ETL engine.
1.2.1 Hive Anatomy
Hive stores the tables in the warehouse as flat files on HDFS and they can be partitioned based on the
value of a particular column. Each partition can be further bucketed based on other columns (other than
the one used for partitioning). A query executed by a user is parsed and is converted into an abstract
syntax tree (AST) where each node corresponds to a logical operator which is then mapped to a physical
operator. Figure 1.5 depicts hive architecture.
1.2.2 Joins in Hive
As with most of the relational database systems, executing a join in Hive is more costlier compared
to other operators in terms of query execution time and also resource consumption. This problem has
more significance in Hive owing to the fact that the data is sharded and distributed across a network and
performing a join requires matching tuples to be moved from one machine to another and this results in
a lot of network IO overhead. Hive implements joins over the map-reduce framework and the following
three types of joins are supported.
• Common Join : Common join is the default join mechanism in Hive and is implemented as a
single map-reduce job. It can be thought of as a distributed union of cartesian products. Suppose
we are joining a table A on column ’a’ with table B on column ’b’ and di is the distinct set of
join column values of both columns a and b. Then the commonjoin operator can be described
7
Figure 1.5 Hive Architecture
using the following mathematical equation where Aa=di is the set of all rows of A with column
a’s value as di
n⋃
i=0
(Aa=di ⊲⊳ Bb=di)
Data of both the tables is read in mappers and then rows are shuffled across the network in such a
way that all the rows with same join key reach the same system. To identify the table to which they
belong, they are tagged during the map phase and are differentiated in the reduce phase according
to these tags. A cartesian product is now applied on the rows of both tables with same join column
value and the output is written to the disk for further processing.
• Map Join : Map join is an optimization over common join and is used if one of the tables is very
small and can fit in the main memory of the slaves. In map join the smaller table is copied into
the distributed cache before the map-reduce job and the large table is fed into the mappers. The
large table is streamed row by row and the join is done with the rows of the smaller table and the
results are written to the disk. Map join eliminates the need for shuffle and reduce phases of the
map-reduce job and this makes it very fast compared to other join types
8
• Bucket Join : Bucket Map join is a special case of Map join in which both the tables to be joined
are bucketed(storing all the rows for a column value at a single place). Larger table is given as
input to the mappers for each value of join column, corresponding buckets for the smaller table
are fetched and the join is performed. This is an improvement over map join in the sense that,
instead of copying whole data of smaller table, we copy just the required buckets to the mappers
of larger table.
All the join conditions in the parse tree are converted into operators corresponding to one of Com-
mon, Map, Bucket joins. User can provide hints about the sizes of the tables as a part of the query and
this information is used during query processing time.
1.3 Query Optimization in Databases
In the recent times, SQL has become the de-facto standard for writing analytical queries over data.
The process of query optimization takes an SQL query, builds the basic query plan, applies some trans-
formations and gives a query execution plan (QEP) as output. The transformations applied on the query
plan are dependent on the logic of the query optimizer. In general, the query optimizer algorithm first
determines the logical query plans which are generic and then decides the physical operators to be ex-
ecuted. Overall the procedure of query optimization can be broken down into following steps and is
shown in figure 1.6 [10].
• Generating the search space for the query
• Building a cost model for query execution
• Divising a search strategy to enumerate the plans in the search space
Search Space : There exist many ways of executing the same query and query optimizers consider
only a subset of plans to find the best plan for execution by assigning some cost to each possible plan.
Since this problem has been proved to be NP hard [22], some heuristics are applied to reduce the search
space and prune out non-optimal plans. This whole set of plans that each query optimizer considers to
pick out the best plan is called the search space for that query optimizer. A lot of research exists on the
search space for query optimizers starting from left deep trees [14, 25, 18] to bushy trees [27]. Each
plan space has its own merits and drawbacks but finding the most optimal plan has been proved to be
np-hard and non-feasible for the optimizers.
Cost Model : Since we consider a search space and select the best plan, we need some function
to quantify the cost of executing a plan in terms of known parameters of the cluster. We minimize or
maximize some objective function based on the costs given out by our cost model. This cost model
relies on (i) the statistics on the relations and indexes, (ii) the formulae to estimate selectivity of various
9
Figure 1.6 Block diagram of a query optimizer
predicates and the sizes of the output of each operator in the query plan, and (iii) the formulae to estimate
the CPU and IO cost of every operator in the query plan.
The statistics on the tables includes various metadata about the actual data like the number of rows,
the number of disk pages of the relations, indexes and the number of distinct values of a column. Query
optimizers use these statistics to estimate the selectivity factor for a given query or predicate, which
means the number of rows that actually qualify this predicate. The most well known way of doing
this is by using histograms [35, 31, 23]. Using statistics query optimizers predict the overall cost of
executing a query and that includes mainly CPU, IO and network costs. Estimation of these costs of
a query operator is also non-trivial since it needs to take into consideration various properties about
system and lots of internal implementation at the system level like data flow from cache to buffer disk
etc. [14]. Other factors like the number of queries concurrently running and the available buffer space
also affect the cost values. Many detailed IO cost models have been developed to estimate various IO
operations like seek, latency and data transfer [21, 20].
Search Strategy : Various approaches have been studied in theory to search the given space of plans.
The dynamic programming algorithm proposed in [37] is an exhaustive search strategy that enumerates
the query plans in a bottom-up manner and prunes expensive plans based on the cost formulae. Work
has been done on heuristic and random optimizations to walk through the search space [40, 39] and
also much work was done on comparing various search strategies like left-deep, right deep and bushy
10
trees [27, 40, 41, 24] . Also some query optimizers employ a dynamic query optimization technique
where the decisions about the type of query operator to be used and their physical algorithms are taken
at run-time. Thus, the optimizer designs a decision tree which is evaluated at run time. This type of plan
enumeration is more suited for top-down architectures like Volcano [19, 17].
1.4 Problem statement and scope
In this thesis, we design and implement a cost based query-optimizer for join queries over map-
reduce, that modifies the naive query plans given by these translators based on statistics collected about
the data. Our query optimizer calculates the query execution cost for various possible plans of a given
query based on the statistics and our cost formulae and then decides on the plan with the least cost to
execute on a cluster of machines. It uses a linear combination of communication, IO and CPU costs
involved in the query execution to compute the total cost. Our query optimizer considers a whole new
plan space of plans that suit the highly parallelizable framework of map-reduce on huge datasets. It
then follows an optimized approach of assigning tasks to the slave machines which minimizes the total
network shuffle of data and also distributes data processing evenly among the slaves to increase the total
query throughput.
1.5 Contributions
Contributions of the thesis can be summed up as follows.
• Proposed a statistics based approach for optimizing SQL queries on top of map-reduce. We bor-
rowed this approach from existing query optimization techniques in relational database systems
and extended it to work for the current usecase.
• Extended the traditional plan space to include bushy trees which leverage the massively paral-
lel architecture of map-reduce systems. We explore a subset of bushy plan space that provides
parallelism and expolits the massively parallel mapreduce framework.
• Designed cost formulae to predict the runtime of joins before executing the query. We use these
cost formulae to find the optimial query plan from the above plan space.
• Designed and implemented a new task scheduler for map-reduce minimizes the overall network
IO during joins. This scheduler is based on our statistics store and converts the problem of as-
signing tasks to a maxflow-mincut formulation. Our experiments showed up to 33% reduction in
shuffle data size.
• Our expermients show that this query optimizer works upto 2 times faster on join queries on
TPCH dataset of scales 100 and 300 on both TPCH benchmark queries and synthetic queries. We
11
ran various SQL queries with select,project and join predicates and tested all our approaches by
building a distributed statistics store on the dataset.
1.6 Organization of thesis
Rest of the thesis is organized as follows. Chapter 2 discusses about related work and chapter 3 dis-
cusses in depth the query plan space we deal with and its advantages. Chapter 4 describes the statistics
engine and the cost formulae required to evaluate the joins in map-reduce scenario and also our schedul-
ing algorithm for map-reduce. Then we discuss the results of our research in chapter 5 and conclude the
thesis by giving our observations and possible directions of future work in chapter 6.
12
Chapter 2
Related Work
2.1 Related Work
A lot of research has gone into the field of the query optimization in databases, both stand alone and
distributed. Many techniques to estimate the cost of query plans using statistics have been proposed.
The most popular query optimizer System R from IBM has been extended to System R* [30] to work
for distributed databases.We extended the ideas from these systems to work for join optimization over
map-reduce based systems. Many heuristics have been developed to reduce the plan space for joins.
Not much work has gone into the field of optimizing SQL queries over map-reduce using traditional
database techniques. However work has been done in improving the runtime of map-reduce based SQL
systems by reducing the number of map-reduce jobs per query and removing the redundant IOs during
scans in [28] and [43].
In [28] Lee et. al developed a correlation aware translator from SQL to map-reduce which considers
various correlations among the query operators to build a graph of map-reduce operators instead of hav-
ing a one to one-operation-to-one job translation. This reduces the overall number of map-reduce jobs
that are required to run for a query and also minimize scans of the datasets by considering correlations.
In [43] Vernica et. al worked on improving set-similarity joins on the map-reduce framework. The main
idea is to balance the workload across the nodes in the cluster by efficiently partitioning the data.
Afrati et. al worked on optimizing chain and star joins on map-reduce environment and optimizing
shares between map and reduce tasks [13]. They also consider special cases of chain and star joins to
test their theory of optimal allocation of shares and they noticed their approach is useful in cases of star
schema join where a single large fact table is joined with a large number of small dimension tables.
Work was done on theta-joins over map-reduce [32] using randomized algorithms. [12] works on a
special case of fuzzy joins over map-reduce and quantifying its cost. Also Google has implemented
a framework called Tenzing [29] which has some inbuilt optimizations for improving the runtime of
queries over map-reduce environment by using techniques such as sort avoidance, block shuffle and
13
local execution and others. These techniques result in efficient execution of joins during runtime of the
query depending of various parameters like size of relations, volume of shuffle data and depending on
these the operators are scheduled to execute on the nodes.
14
Chapter 3
Query Plan Space
3.1 Overview of Query Planspace - An Example
Lets consider a simple query with two joins and three relations A,B,C,D as follows.
select * from A,B,C,D where A.a = B.b and B.b = C.c and C.c = D.d
Figure 3.1 lists two possible join orderings for the query.
Figure 3.1 Possible join orderings
The one on the left is (((A join B) join C)join D), a left deep plan considered by Hive
optimizer for execution whereas the plan on the right ((A join B) join (C join D))is bushy
tree. Lets assume for this analysis purposes that the size of the intermediate relations is larger and Map
15
join is not a possible operator for the plan execution. So the figure 3.1 is also the operator tree where CJ
represents a Common Join operator. Left deep plan is by default serial in nature and should be executed
level after level where as the bushy trees inherently parallel and the left and right children of the root
node can be executed in parallel.
Suppose the above query is executed using Hive on a hadoop cluster. Hive chooses the left deep plan
to be executed and it is broken down into 3 joins as follows
A common join B -> result:temp1
temp1 common join C -> result:temp2
temp2 common join D -> result:temp3 (final output)
The entire query execution is serial and there will be three map reduce jobs corresponding to three
common join operators. So the relations A and B join and the result is stored in a table temp1 in the first
map-reduce job and rest of the map-reduce jobs should wait till the entire result of temp1 is written to
HDFS. Given this scenario two possible cases can occur.
• Case 1 : The map-reduce job takes up all the slots of mappers and reducers in the cluster or
• Case 2 : The job takes up a subset of map and reduce slots in the cluster
In both of the cases there is an underutilization of cluster resources because of the fact that mapper
slots remain idle till the job completes shuffle and reduce phases and this is because of the restriction of
the left deep plan that the entire procedure is serial. However, in case 2 the underutilization is more pro-
nounced because of the fact that many slots remain idle right from the beginning of the job. The timeline
of this job looks as in the following figure . Considering the example query above, the map-reduce job
corresponding to the join (temp1 join B)cannot start even though there are free mapper slots in
the cluster. This definitely increases the query runtime and also most of the machines remain idle until
the whole job is completed. The query execution timeline is depicted in the figure 3.2. It is clear that
none of the phases overlap and everything is perfectly sequential. From t = a to t = c all the map slots
in the cluster and CPU remains idle.
Suppose the bushy tree execution plan is considered, the execution plan can be broken down into 3
joins as follows
A common join B -> result:temp1
C common join D -> result:temp2
temp1 common join temp2 -> result:temp3 (final output)
Even in this execution plan, the two cases discussed above are valid. Suppose that the two map-
reduce jobs corresponding to (A join B)and (A join B)are launched in parallel and the map-
16
Figure 3.2 Time line for the execution of left deep QEP
reduce job for (A join B)is executed first in the default FIFO scheduler for hadoop. In case 1, the
tasks corresponding to (A join B)will be waiting in the queue since there are no slots available for
them to get launched. As soon as the mappers corresponding to the job (A join B)complete new
tasks belonging to the job (A join B)get launched. This increases the cluster resouce utilization
and also increases intra query parallelization. Case 2 is even simpler and is more faster because both
the map-reduce jobs run in parallel due to abundance of resources and improves the performance. The
query execution timeline for this QEP is depicted in the figure 3.3. We can clearly see the overlap in
task assignment . The figure represents the worstcase task assignment where all the job needs to wait for
all the reducers, however in reality the situation is much better because some reducers complete quickly
and give way to others.
Given the benefits of bushy tree parallelization, we might be tempted to say that they are always
more efficient than left deep tree. Though this is partially true, the plan space of bushy trees is very
huge and exploring the whole of it takes a lot of time for the optimizer, sometimes much more than
query runtime. This is the reason most of the modern day optimizers in do not consider bushy trees in
17
Figure 3.3 Time line for the execution of bushy Query plan
their search space and still rely on left or right deep trees. In the rest of the chapter, we describe three
different algorithms to build bushy trees for joins in map-reduce and explore their planspace in detail.
3.2 Exploring Bushy Planspace
In this section we present 3 novel approaches for building bushy trees benefitting join queries over
mapreduce framework. We explore a subset of bushy planspace called n-balanced trees. We define an
n-balanced tree as follows.
n-fullybalanced tree 1. An n-fully balanced tree is a perfectly balanced binary tree with n leaf nodes.
This forces n to be a power of 2. An example of 4-fullybalanced tree is shown in figure 3.4
n-balanced tree 1. An n-balanced tree is obtained by replacing the left most leaf node of a n-fullybalanced
tree with another n-fullybalanced tree and repeating this procedure with the newly obtained tree. How-
18
Figure 3.4 An example of 4-fullybalanced tree
ever the condition that the first tree is n-fullybalanced is relaxed in case of query plans depending on the
number joins. In the resulting n-balanced tree all the internal nodes are join operators and all the leaf
nodes are relations.
An example of 4-balanced tree is shown in the figure 3.5. The number of levels in the tree is decided
by the number of joins in the query.
Figure 3.5 An example of 4-balanced tree
The rationale behind selecting n-balanced trees is that, at any level during execution there can be
n/2 parallel mapreduce jobs (unlike 1 for left deep trees) that can be executed in the cluster. Choosing
the value of n carefully will give a very good resource utilization in the cluster and also increases
the query performance. Also too much of parallelization will be an overkill for the query since the
waiting queue will be very long and for some non FIFO scheduler like fair scheduler cannot meet the
requirements and there will be too may task context switches. For example figures 3.6 and 3.7 shows
two possible execution plans for a query with 8 tables. Figure 3.6 is 8-balanced tree and figure 3.7
is 4-balanced tree and we ran both the query plans on the cluster with fair scheduler configured and
the runtimes for them are 923 seconds and 613 seconds respectively. Since we are running on a 10
node cluster with large input data, it cannot run 4 mapreduce jobs at a single time. So the jobs wait
in the queue until they have sufficient slots to complete the job. In the 4-balanced tree, since only 2
19
Figure 3.6 8-balanced tree
Figure 3.7 4-balanced tree
jobs run at a single time, the waiting time will be less and the job is completed quickly. Also carefully
choosing the value of n will reduce the plan space of bushy trees and this solution has the benefits
of both parallelization and higher query performance as we see in the later sections. First we start
off by explaining algorithm FindMin that builds a n-fullybalanced bushy tree to be executed and
FindRecMin & FindRecDeepMin explore the n-balanced bushy plan space.
3.2.1 Finding minimum cost n-fully balanced tree - FindMin
In this section, we describe an algorithm FindMin to build an n-fullybalanced bushy tree. We
build it level by level finding minimum at each level. This algorithm takes as input a query tree Q that
corresponds to the parsed SQL statement given by the user and the value n and gives an n-fully balanced
join operator tree as output. This algorithm is described below.
Algorithm 1: FindMin to build an n-fullybalanced bushy tree
Input : A query tree Q corresponding to the parsed SQL statement
Output: An operator tree to be executed
begin1
J ←− getJoinTables(Q)2
s←− sizeOf(J)3
while s 6= 1 do4
x←− getPowerOf2(n) //fetches the power of 2 less than n5
P ←− selectMinPairs(x) //selects x/2 pairs of joins with least cost6
for y ∈ P do7
Remove tables in y from J8
Add y to J9
end10
s←− sizeOf(J) //update the size of J11
end12
return makeOperatorTree(J)13
end14
20
The heart of the algorithm lies in 5 where we use the function selectMinPairs(x) to select the top
x/2 pairs from the table list based on the cost functions we define in the next chapter. We then remove
the individual tables in those minimum pairs and add the pair as a whole. This is the greediness in the
algorithm as this local minimum cost maynot give as a global minimum cost in the whole query tree,
even then we proceed assuming it gives the global minimum. An example run of this algorithm on a a
query with 8 joins(9 tables) to build 4-balanced trees is shown in the figure 3.8
Figure 3.8 Running Algorithm1 on 9 tables
At each step, the bracketed tables are the minimum cost x/2 join pairs. Once they are bracketed, the
whole bracket is considered as a single table for the subsequent step and each bracket converts to a join
in the query tree.
3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin
The algorithm we describe in this section generates n-balanced trees. The algorithm takes the query
plan Q that corresponds to the parsed SQL query and a value n as input and gives out the join operator
tree as output.
This algorithm considers all possible combinations of size n at each level, once it finds finds a
minimum combination it finalizes it and reflects it in the final result eventhough this local minimum
combination of size n may not produce a global minimum. An example run of this algorithm on 8
joins(9 times) is shown in the figure 3.9. At each stage of bracketing, all possible n-combinations are
considered and the the combination with least execution cost is shown in the figure. All the combinations
are not shown in the figure due to space constraints.
3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin
We now describe an approach to generate n-balanced trees that is similar to Algorithm FindRecMin,
however instead of finalizing on local minimum of size n, we recursively go up the order to see if this
21
Algorithm 2: FindRecMin algorithm for building n-balanced trees
Input : A query tree Q corresponding to the parsed SQL statement and n for generating
n-balanced trees
Output: An operator tree to be executed
begin1
J ←− getJoinTables(Q)2
breakF lag ←− True3
while breakF lag do4
s←− sizeOf(J)5
comb←− n6
if s < n then7
comb←− s8
breakF lag ←− False //Last iteration for loop since there are no tables left9
end10
C ←− generateCombinations(J, comb) //Generate combinations of size comb from11
set Jmin cost←−∞12
min comb←− null13
for x ∈ C do14
P ←− generatePermutations(x) //Checking all possible permutation for each15
combination, can be skipped for linear joins
for y ∈ P do16
if isV alidJoinOrder(y) ∧ executionCost(y) < min cost then17
min comb←− y //Update min comb since a plan with lesser cost is found18
min cost = executionCost(y)19
end20
end21
end22
Remove individual tables in mincomb from J23
Add total mincomb as a single table to J //Reflect the changes in J by updating24
min combend25
return J26
end27
22
Figure 3.9 Running Algorithm2 on 9 tables
n combination gives a global minimum. The algorithm can be described as follows 3. An example run
of this exhaustive algorithm is similar to the figure 2 except that for every bracketing another recursive
call is applied if it gives a global minimum. So we might get a set of bracketings different from those in
the figure but the whole shape of the tree remains the same.
We analyze the runtime, search space and performance of each of these algorithms in detail in the
results chapter .
3.3 Choosing the value of n for an n-balanced tree
Choosing the value of n is not as trivial as it seems. Setting the wrong value may not produce
desirable results as one of the following two cases may occur.
• A very small value of n might leave the cluster underutilized as there will be empty task slots in
the cluster but there are no more jobsthat can be run in parallel
• A very large value of n might keep many jobs in the cluster in the waiting queue and also bring
down the cluster efficieny by too much multitasking on nodes
These can be avoided by using few simple techniques. One way is to try a few values and fixing that
value which gives the most optimal results. This takes experimentation and may take some time before
23
Algorithm 3: FindRecDeepMin algorithm for building n-balanced trees
Input : A query tree Q corresponding to the parsed SQL statement and n for generating
n-balanced trees
Output: An operator tree to be executed
begin1
J ←− getJoinTables(Q)2
breakF lag ←− True3
while breakF lag do4
s←− sizeOf(J)5
comb←− n6
if s < n then7
comb←− s8
breakF lag ←− False //Last iteration for loop since there are no tables left9
end10
C ←− generateCombinations(J, comb) //Generate combinations of size comb from11
set Jmin cost←−∞12
min comb←− null13
for x ∈ C do14
P ←− generatePermutations(x) //Checking all possible permutation for each15
combination, can be skipped for linear joins
for y ∈ P do16
if isV alidJoinOrder(y) then17
J ′ ←− J18
Remove individual tables in y from J ′19
Add y combination as a single table to J ′20
Q′ ←− makeQueryTree(J ′)21
JTemp←− Algorithm3(Q′, n) //Recursively call the same function with22
new Join tree
if (executionCost(JTemp) < min cost then23
min comb←− JTemp //Update min comb since a plan with lesser cost24
is found
min cost = executionCost(JTemp)25
end26
end27
end28
end29
Remove individual tables in min comb from J30
Add total min comb as a single table to J //Reflect the changes in J31
end32
return J33
end34
24
we find a suitable value. Another way is to make an estimate is by calculating the maximum capacity of
the cluster as follows.
1. Get the total number of map slots in a cluster(M ). This can be done by summing up the map slots
on each machine of the cluster
2. Get the average input for a single map reduce job (I). This is twice the size of the average table
size (since two tables are involved per single join).
3. Get the average number of map tasks per mapreduce job(A) by divding the average input size(I)
by block size(B)
4. Now a good estimate of n is a power of 2 closest to the value of M/A
Another procedure is to improve this estimate of n based on past history of queries and come out
with an average number of jobs that can run at any time. Also the number of reduce tasks were not
involved in the computation since the map tasks are the ones that define the size of the input and the
number of reducers is generally constant per task.
25
Chapter 4
Cost Based Optimization Problem
All the algorithms described in the previous chapter assume that we have a cost estimator for the join
operators for joins over map-reduce framework. In this chapter we describe in detail
1. Estimating the cost of the operators in terms of disk IO operations, CPU usage and network
movement of data and
2. Optimizing the data movement across machines to reduce the network IO
To give accurate cost estimates majority of the traditional relational database systems rely on his-
tograms of data. Histograms give an overview of the data by partitioning the data range into a set of
bins. The number of bins and the partitioning method determines the accuracy of histograms and this
has been studied at depth. Traditionally two types of histograms have been in common use [36]
• Equi-width histograms : Equi-width histograms or frequency histograms are the most common
histograms used in databases where the range values are divided on n buckets equal size. Values
are bucketed based on the range they fall in and the number of rows per bucket depend on the data
distribution.
• Equi-depth histograms : In equi-depth histograms the number of rows per bucket is fixed and
the size of buckets is based on data distribution.
Each of the techniques has its own set of advantages and disadvantages. We borrow the idea of
using histograms for accurate cost estimation from relational databases and extend it to build our own
distributed statistics store and cost estimator functions based on that. So the rest of the chapter is
organized as follows. First we describe our distributed statistics store and the methods they expose to
the query optimizer and then describe the cost formulae for each of the join operators based on these
methods.
26
4.1 Distributed Statistics Store
We designed and implemented a statistics store distributed across the machines in the hadoop cluster.
Each node maintains a equi-depth histogram for data local to that machine. These histograms are used
for cardinality estimations local to that site. Methods have been written to compute these histograms
from data during map-reduce jobs. Since the data is likely to be updated, we update the histograms
too but not by reading all the data again. We managed to plugin our code in such way that when the
modified data is read as a part of other map-reduce jobs the histograms also get updated. Though this
might slightly increase the runtime for that query, we need not put the extra load on the cluster by
reading the whole data again and our experiments show that this slight increase in runtime is not that
significant. Also initial statistics can be computed while loading the data into the cluster or by separately
running a job that reads the whole data and updates all the local histograms. Further we consolidate the
distributed statistics to maintain the global per table statistics that we use in our cost model.
These histograms are serialized to disk as tables in mysql and APIs have been written on top it to
fetch the cardinality esitmates by the query optimizer.The query optimizer uses the APIs written to get
the cardinality estimates to calculate the cost of executing a query. Its architecture is summarized in
figure 4.1
Figure 4.1 Statistics store architecture
27
4.2 Cost Formulae
In this section we describe the cost formulae we use for estimating the cost of executing the join
operators on top of map-reduce framework. The actual procedure inside map-reduce framework is quite
complex in its implementation does multiple scans of disk while shuffling and sorting. For example
with hadoop framework, there are many settings to be done by the user and a small tweak would greatly
effect the performance of a mapreduce job. One such example is io.sort.mb. It is the amount of buffer
memory used by the map task jvm to sort the incoming streams and spill them to disk. Tuning this
parameter has been proven to greatly improve the job performance based on the size of map output data.
There are many such knobs that can be tuned on a per job basis that can greatly affect the job execution
time. These can set inside the job configuration classes. However for the problem of query optimization,
it is sufficient if our cost model predicts the runtime proportional the actual execution time. Our cost
model divides the whole join mapreduce job into three phases map, shuffle and reduce and evaluates
cost of each of the phases separately and add them up for the whole cost. We now describe each of these
phases in detail and cost formulae we use to predict its runtime.
4.2.1 Join map phase
In the map phase, each HDFS block is read, parsed with an appropriate RecordReader implementa-
tion and the select predicates are applied to prune the unnecessary rows. They are divided into partitions
based on the join column value and are spilled to disk whenever the memory gets filled. So, once all the
data is read, there might be multiple spills on disk and they are merged in multiple passes. This whole
process is complex and there are multiple round trips of data from disk to memory and memory to disk.
However we identified the parts of this phase that incur most cost (in terms of runtime) and included
them in our cost analysis. They are as follows.
1. Reading the whole data block from HDFS through network IO. Even though the data is local,
HDFS uses sockets to read the data. If the data is not local, it is read from a remote machine,
however hdfs tries to maintain data locality most of the time by reading local replicas. This
speeds up the whole process. So, cost for this step is BLOCK SIZE/Rhdfs
2. Once the data is read, selectivity filters are applied and the rest of the data is written back to
the disk so that it can be merged later. Cost of spilling this to disk is (BLOCK SIZE ∗
selectivity factor)/Wlocal write. We calculate the selectivity factor based on the selection
predicates in the input query
3. Now all the written data is read back to do an inmemory merge in multiple rounds. Cost of doing
this is (BLOCK SIZE ∗ selectivity factor)/Rlocal read
4. All the merged data is written back again to the local disk so that it can read in reduce part. We
include this cost as (BLOCK SIZE ∗ selectivity factor)/Wlocal write
28
BLOCK SIZE Configured block size in HDFS
Rhdfs Read throughput from HDFS
selectivity factor Selectivity factor the select predicates in the query for that block
Rlocal read Read throughput from local machine where the map task runs
Wlocal write Write throughput in local machine where the map task runs
Table 4.1 Notations for cost formulae - Map phase
Summing up all the costs, total cost of map phase is written as follows.
BLOCK SIZE/Rhdfs + (BLOCK SIZE ∗ selectivity factor)/Wlocal write+
(BLOCK SIZE ∗ selectivity factor)/Rlocal read+
(BLOCK SIZE ∗ selectivity factor)/Wlocal write
The values of Rhdfs, Rlocal read and Rwrite read are computed before hand by running a few exper-
iments on the cluster and BLOCK SIZE is the user mentioned block size. These are computed only
once provided the cluster is not changed by adding or removing nodes. The value of selectivity factor
is calculated from the distributed statistics using the selection predicates from the query. Above formula
assumes that all the spills are merged in a single go and this is valid for join map reduce jobs since not
much extra data (apart from block content along with selectivity factor) is written to the disk and current
day memory sizes are so big that the setting io.sort.mb (Hadoop setting responsible for heap available
for this merging process) is configured appropriately to speedup this process. Since this cost function
is tailored for join job map-phase, necessary modifications should be made before extending to other
map-reduce jobs. Now multiple such maps run on each node in parallel in batches. Since the total map
time of the job is limited by the node with the last and the slowest map task, we take the total map time
to the the time taken to complete the last maptask on the slowest node.
4.2.2 Join shuffle phase
Reducer starts copying data after a configured number of map tasks are complete. We quantify the
total time taken to complete the shuffle phase for a reducer running on machne k that has been assigned
a partition i as follows and the notations are in table 4.2. The value of Rjk is calculated before hand by
performing a series of experiments that transfer data from each node to every other node. SPij can be
calculated easily from histograms and a given partition assignment.
∀j∈nodes∑
j
SPij/NRjk
29
SPij Size of partition i on machine j
NRjk Network read throughput of data on machine j from machine k
Table 4.2 Notations for cost formulae - Shuffle phase
4.2.3 Join Reducer phase
In the reducer phase, all the shuffled data is read and the actual join of tables is performed. All this
data is now written to HDFS so that the subsequent join task reads it. Since the HDFS replicates each
block to multiple nodes entire process is limited by HDFS write throughput. So we estimate the total
cost of reduce phase to be the sum of time taken to read the whole partition data into reducer’s memory
(SPi) (after writing it from the shuffle) and the time taken to write the result (resultj) of join back to
HDFS from reducer j. So the equation can be written as follows
SPi/Wwrite local + SPi/Rread local + resultj/Wwrite hdfs
Wwrite local and Wwrite hdfs are write throughputs of local machine and hdfs respectively. Now we
estimate the value of resultj using the global histograms as follows.
∀keys k∈Partition i
∑
k
(sel factor(k jointable1) ∗ (size(join table1))∗
(sel factor(k jointable2) ∗ size(join table2))
For map joins, the cost of scheduling and shuffle is automatically set to 0 and mapper cost includes
writing the result and excludes the cost of writing intermediate data to local disks.
4.3 Scheduling - An example
Consider the execution plan (on a two machine cluster) of CommonJoin operator in MapReduce
using the tables A and B in Figure 3 joined according to the following query
SELECT * FROM A JOIN B ON (A.a = B.b)
Both tables A and B are read into mappers and triples in the format <Join column value, rowkey
and data, tag>are emitted out. So in the given example triples <x,1,0>, <x,2,0 >, <y,3,0>,<x,4,1
>,<y,5,1 >,<y,6,1 >are emitted out. The tag 0 or 1 is used to classify it to table A or B in the reducers
so that join can be done. Now all the triples with same join column value are moved to the same system
in the shuffle phase. These are now classified according to their tags and a cartesian product is done on
them and the result is written to the disk. Join operator is considered costly because it involves network
movement of data from one node to other and this incurs a huge latency. In the above example there are
two following ways of scheduling join keys in reduce phase.
30
Figure 4.2 Query scheduling example
1. Join value x on machine 1 and y on machine 2
2. Join value x on machine 2 and y on machine 1
In case 1 total network cost (in terms of rows) is 2 ( (3,y) from 1 to 2 and (4,x) from 2 to 1) where as
in case 2 total network cost in case 2 is 4. We can observe that the total network cost is double in case
2 is twice as that of case 1. Considering the size of data Hadoop ecosystem manages, these figures look
very huge for tables of terabytes scale and such communication costs might heavily impact the query
performance. So the problem here finally boils down to partitioning m reduce keys into n machines so
as to minimize the network movement of data.
4.4 Scheduling strategy
In this section we show a novel approach for scheduling a CommonJoin operator by modeling as a
max-flow min-cut problem. Consider a Hive query which joins two tables A and B on columns a and
b respectively. The rest of the description assumes that there are m distinct values of both A.a and B.b
combined and n machines in the cluster.
We define a variable Xij as follows.
Xij =
{
1 if reducer key i is assigned to machine j;
0 Otherwise.
Since a key can only be assigned to a single machine ,
∀i∑
j
Xij = 1 (4.1)
Also, we put a limit on the number of keys a reducer can process. This depends on the processing
capability of the machine and let it be lj for node j.
∀j∑
i
Xij ≤ lj (4.2)
31
The nxn matrix C is obtained by calcuting the average time taken to transfer a unit data from every
machine to every other machine by a simple experiment. Suppose the key i is assigned to the reducer
machine k and Pij is the size of key i on machine j, then the total cost of data transfer from all machines
to machine k because of key i along with the runtime estimate of the reducer, Wik can be written as
Wik = (∑
j
Pij ∗ Cjk)
The total shuffling cost can now be written as
Ctotal =∑
i
∑
k
Wik ∗Xik (4.3)
The above cost function can be generalized according the scheduling requirement. For example,
to schedule tasks in heterogenous environment, we can add additonal parameters that signify the cost
of running a query on machine k. Since we are optimizing communication costs in this case, only
network latencies are considered. So this model can be extended to any general scheduling problem on
mapreduce framework. We now model this problem as a flow network by following the steps below
[26].
1. Create two nodes source (S) and sink(T )
2. n nodes S1 to Sn are created one for each machine
3. m nodes K1 to Km are created one for each key
4. n edges each from source S to each of S1 to Sn are created with capacities l1 to ln and costs 0
5. Now every pair of Si (1 ≤ i ≤ n) and Kj (1 ≤ j ≤ m) is connected by an edge with cost Wji
and capacity 1. Xij is the flow of edge connecting Si and Kj
6. m edges from every node Kj (1 ≤ j ≤ m) to target T with capacity 1 and cost 0
The maximum flow in the above flow network is clearly m since all the capacities of the inbound
edges to target T add up to m. At maximum flow , since the capacities of outgoing edges from K1 to
Kn are 1, out of the multiple incoming edges, only one edge is selected and rest all flows become 0.
This process is similar to assigning the key Kj to the machine Si corresponding to the incoming edge
whose flow is made 1. The above procedure makes sure that one key is assigned to a single machine
and since we make the capacities of the incoming edges of each of machines S1 to Sm are l1 to lm,
we make sure that the number of keys that are assigned to that particular machine doesn’t exceed that
limit. The flow numbers of these edges determine the number of keys assigned to them once the above
flow network is solved for minimum cost and maximum flow. Once we have the values of Pij from
our statistics, we solve the above max-flow min-cut graph to obtain the optimal key allocation and feed
it to the mapreduce program. This flow network can be solved using an algorithm that has a strongly
32
Figure 4.3 Flow network
polynomial complexity[34]. In the next chapter we discuss the experimentation and results for the
approaches discussed so far.
4.5 Shuffle algorithm - Proof of minimality
We use proof by contradiction to prove that our algorithm actually gives the optimal shuffle. Let
us assume that the algorithm gives a non optimal partition assignment as the output. This means for
keys i and machines j there exists another possible allocation of Xij that has lesser shuffle cost than
the allocation chosen by the algorithm. Lets call this PlanX and the plan chosen by the algorithm
as Planopt. If we prove that PLANx has a lower flow cost in the network than PLANopt, it means
that the max-flow min-cost doesn’t choose a plan with minimum cost which is a contradiction. From
equation 4.3, the total shuffle cost for a given allocation Xij is
Ctotal =∑
i
∑
k
Wik ∗Xik (4.4)
From the flow network in figure 4.3, the total flow cost can be written as follows
Cflow =∑
i∈nodes(S)
∑
k∈nodes(K)
Wik ∗Xik (4.5)
which is the same as 4.4, since the nodes(S) represents machines and nodes(K) represent keys from
the way the flow network is built. This implies that the total cost of an allocation and the corresponding
cost in the flow network are the same and this means that flow cost of PLANx is less than PLANopt
which is contradicting. So shuffle algorithm always chooses the minimum shuffle cost allocation.
33
4.6 Shuffle algorithm - A working example
Lets consider a query joining two tables customer and supplier from TPCH dataset. The query is as
follows
s e l e c t ∗ from c u s t o m e r j o i n s u p p l i e r on c u s t o m e r . c n a t i o n k e y =
s u p p l i e r . s n a t i o n k e y where s n a t i o n k e y < 5 ;
We put a where clause to reduce the key set size so that the flow diagram is small and easy to
understand. We ran the query on a 7 node cluster and values in matrices Pij for the tables customer
and supplier are obtained from histograms.
Pij(customer) =
135464 163672 206312 225336 179908 3280 397208
242064 256660 15252 111028 586464 271748 133496
163672 184664 169904 19680 188928 22632 182696
94300 53300 234192 49364 73636 127264 131364
197948 581052 29520 69372 191880 123492 113816
Pij(supplier) =
256878 318364 0 0 559906 0 18318
66030 83354 0 0 29678 0 232312
27122 4118 0 0 40044 0 208030
312542 188008 0 0 503106 0 60918
2272 66314 0 0 123398 0 89176
So, the total Pij for the flow matrix is the sum of the above 2 matrices which is as follows,
Pij =
392342 482036 206312 225336 739814 3280 415526
308094 340014 15252 111028 616142 271748 365808
190794 188782 169904 19680 228972 22632 390726
406842 241308 234192 49364 576742 127264 192282
200220 647366 29520 69372 315278 123492 202992
In our testing, all the machines are under the same switch and because of this we have stable ping
time across all machines. So a normalized matrix Cij looks as follows,
Cij =
0111111
1011111
1101111
1110111
1111011
1111101
1111110
34
So, the matrix Wij evaluates to the following,
Wij =
2072304 1982610 2258334 2239310 1724832 2461366 2049120
1719992 1688072 2012834 1917058 1411944 1756338 1662278
1020696 1022708 1041586 1191810 982518 1188858 820764
1421152 1586686 1593802 1778630 1251252 1700730 1635712
1388020 940874 1558720 1518868 1272962 1464748 1385248
Solving the flow network with these Wij values, we get the optimal assignment of keys as in table
4.3 with a shuffle volume of 618 megabytes. As we can see, multiple keys are assigned to a single node
and this limit (node capacity) can be configured per node while solving the flow graph.
Key name Assigned node
0 node5
1 node5
2 node7
3 node5
4 node2
Table 4.3 Key to Node assignment with optimized scheduler
The same query when ran on a hadoop scheduler had the assignment in table 4.4 with a shuffle
volume of 726 megabytes.
Key name Assigned node
0 node3
1 node2
2 node4
3 node5
4 node7
Table 4.4 Key to Node assignment with default scheduler
For building the matrix Wij , we perform the standard matrix multiplication algorithm n times, where
n is the number of nodes in the cluster and is generally in the range of lower hundreds even for medium
to large clusters. So this whole process takes a fraction of a second to run on a normal dual core CPU
machine. We discuss the performance evaluation of our approach in detail in the next chapter where
best, average and worst case performances of our algorithm along with results.
35
Chapter 5
Experimental Evaluation and Results
In this chapter we discuss about the experimental evaluation we conducted on the theory discussed
so far and present you the results obtained.
5.1 Experimental setup
We conducted all the experiments on a 10 node cluster comprising of 1 master and 9 slaves. We used
TPCH datasets of scales 100 (100 gigabytes) and 300 (300 gigabytes) for testing the join queries. The
input queries we tested include both synthetic queries and tpch benchmark query set. Each machine is
equipped with 3.5 gigabyte RAM and 2.4 GHz dual core CPUs and are connected to a 10Gbps network.
This setup qualifies to be a network of commodity hardware machines loaded with linux and run a stable
version of hadoop. We have patched Hive with each of the above to test our algorithms.We used mysql
to serialize the histograms per node as discussed.
5.2 Plan space evaluation and Complexity
In this section we describe in detail the planspace that each of our algorithms explore and compare it
with the overall bushy planspace. For the purpose of this discussion, we assume queries to be multiway
join with m-tables and we are trying to build n-balanced trees.
5.2.1 Algorithm FindMin
In each iteration of the algorithm with i nodes remaining, we consider i2 pairs to find out 2⌊log2 i⌋
(value less than i which is a power of 2) tables whose joins have the least cost. This means after iteration
i, the number of tables remaining will be i− 2⌊log2 i⌋ + 1. So the total planspace of this algorithm is
{i>0,i=i−2⌊log2 i⌋+1}∑
i=m
(2⌊log2 i⌋)2
36
Coming to the complexity part, with n input tables, the algorithm calls selectMinPairs() for ⌊log2n⌋
times and each call of it takes O(n2). So the total complexity of the algorithm is O(n2log2n).
5.2.2 Algorithm FindRecMin
In this algorithm, we find the minimum n-fullybalanced tree and fix it to build the rest of the tree. In
each iteration the number of tables decreases by n− 1 starting from the initial value of m. The number
of ways of finding the n-fullybalanced sub tree is mCn ∗ n!. So, total planspace of the algorithm is as
follows,
{i|(m−i∗(n−1))>n}∑
i=0
(m−i∗(n−1))Cn ∗ n!
In the algorithm, each iteration of main loop selects an n-fully balanced tree. Given m tables, this
loop runs ⌈(m − 1/n − 1)⌉ times reducing the number of tables by n − 1 in each iteration. Each
iteration calls generateCombinations(x, n) which generates all n sized combinations of joins set
x. The complexity of this function call is O(x ∗ n) for linear chain joins since we need to consider
only n adjacent tables for each input table. However for cyclic or non linear joins, the complexity of
generateCombinations(x, n) is(
xn
)
and we iterate through all possible permutations (for non-linear
and cyclic joins) to find the best plan. So the total complexity of each iteration of loop is(
xn
)
∗ n!
which turns out to be O(mn) and for linear chain joins each iteration of loop takes O(mn). So the
total complexity of the algorithm is O(mn ∗ ⌈(m − 1/n − 1)⌉) for cyclic or non linear joins and
O(mn ∗ ⌈(m− 1/n− 1)⌉) for linear chain joins. n generally takes small values like 4 or 8 depending
on the size of the cluster.
5.2.3 Algorithm FindRecDeepMin
In this algorithm we try to build the tree bottom up and n-fully balanced in each iteration. In the
beginning, we have m tables and we build a n fully balanced tree (n < m) and the number of ways of
doing this is mCn ∗ n!. However the number of tables decreases over each iteration by a value of n− 1
and we recursively call the same procedure again. So the total planspace for this algorithm is
{i|(m−i∗(n−1))>n}∏
i=0
(m−i∗(n−1))Cn ∗ n!
Using the above formulae, following table lists the planspaces for various values of m and for n = 4
and n = 8.
37
m FindMinFindRecMin
n = 4FindRecMin
n = 8FindRecDeepMin
n = 4FindRecDeepMin
n = 8Total bushy trees
3 4 6 6 6 6 12
4 7 25 24 24 24 120
5 14 122 120 240 120 1680
6 22 366 720 2160 720 30240
7 35 865 5040 20160 5040 665280
8 35 1802 40321 403200 40320 17297280
9 50 3390 362882 6531840 725760 518918400
10 67 5905 1814406 101606400 10886400 17643225600
11 90 9722 6652824 3193344000 159667200 670442572800
12 101 15270 19958520 77598259200 2395008000 28158588057600
13 128 23065 51892560 1743565824000 37362124800 1.30E+15
14 158 33746 121086000 76716896256000 610248038400 6.48E+16
Table 5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14
Number of tables Hive FindMinFindRecMin
n = 4FindRecDeepMin
n = 4FindRecMin
n = 8FindRecDeepMin
n = 8
3 178 121.2 121.2 121.2 121.2 121.2
4 318 199 199 199 199 199
5 342.199 293 255 255 255 255
6 504.452 396 363 305 300 300
7 637.73 547 547 547 580 580
8 718.808 590 590 590 602 602
9 773.317 488 496 458 378 378
10 870.155 644 626 561 402 402
11 994.457 684 656 656 664 664
12 995.565 712 698 608 428 428
Table 5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a 10
node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees
5.3 Algorithms performance
In this section, we describe in detail about the performance of our algorithms on queries of increasing
number of joins on our experimental setup. For our evaluation, we ran SQL queries with number of join
tables from 3 to 12. For each such query, we ran it on Hive using the default left deep execution plans,
and also with the 3 algorithms mentioned above. Execution times are tabulated in table 5.2. Figure 5.9
has these results plotted.
From the graph we can see that the n-balanced trees perform well compared to default hive execution
especially as the number of joins increase. Since we are working on 4-balanced tees, the performance
of all the algorithms remains the same till the number of tables is 4. Once this number goes up we can
see that the algorithm 2 and 3 performing well as they considers a bigger plan space than algorithm1.
It is interesting to see the performance of 8-balanced trees. They graph for n = 8 is relatively ragged
compared to n = 4 owing to the fact that sometimes too much parallelization is an overkill in some
38
TPCH Query Hive FindMin FindRecMin FindRecDeepMin
q2 362.443 321.634 266.234 266.234
q3 1238.635 1069.725 1069.725 1069.725
q10 1247.157 995.087 995.087 995.087
q11 321.236 274 274 274
q16 325.129 325.129 305.966 305.966
q18 1004.821 1002.882 1002.882 1002.882
q20 994.485 832.663 622.234 622.234
Table 5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH
300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these
queries have less number of join tables
cases and suits the cluster in some cases as described in the previous chapters. This is reflected in the
performance figures too.
We ran the standard tpch benchmark queries on the same cluster of 10 nodes with scale 300 dataset
and the performance figures are in tabulated in 5.3 and plotted in 5.8. We included only the queries
with joins and selects since they are the only ones relavent to our current work. Also since all these
queries have 4 or less join tables, we needn’t test on 8-balanced trees as they essentially give the same
output as n = 4 and hence all the results in 5.8 correspond to n = 4. This the reason queries like q16
and q18 show similar performance between hive and our algorithms because there are only 3 tables (2
joins) and there is no search space that our algorithms can explore. We show the execution plans for
each of these queries in figures 5.1, 5.2, 5.3, 5.4, 5.5, 5.7 for Commercial DBMS [9], postgres, hive and
FindRecDeepMin query planners. and the ‘EXPLAIN’ command output in appendix A. Each figure
is followed by a description that explains the performance improvements of our approach compared to
non-parallelizable left deep plans chosen by other optimizers and the whole summary is tabulated in
table 5.4.
TPCH Query Commercial DBMS Postgres Hive FindRecDeepMin
q2left deep
((((ps 1 s) 1 n) 1 r) 1 p)
left deep
((((r 1 n) 1 s) 1 ps) 1 p)
left deep
((((r 1 n) 1 s) 1 ps) 1 p)
4-balanced
(((r 1 n) 1 (ps 1 p)) 1 s)
q3left deep
((c 1 o)1 l)left deep
((l 1 o)1 c)left deep
((c 1 o)1 l)left deep
((l 1 o)1 c)
q10left deep
((l 1 o)1 c) 1 n)
left deep
((c 1 n)1 o) 1 l)left deep
((c 1 o)1 n) 1 l)4-fully balanced
((c 1 n) 1(l 1 o))
q11left deep
((s 1 n)1 p)
left deep
((p 1 s)1 n)
left deep
((p 1 s)1 n)
left deep
((s 1 n)1 p)
q16left deep
((p 1 ps)1 s)
left deep
((p 1 ps)1 s)
left deep
((p 1 s)1 ps)
left deep
((p 1 ps)1 s)
q18left deep
((l 1 o)1 c)left deep
((c 1 o)1 l)left deep
((c 1 o)1 l)left deep
((c 1 o)1 l)
q20left deep
(((((ps 1 p)1 l)1 p)1 s)1 n)
left deep
(((((ps 1 p)1 l)1 p)1 s)1 n)
left deep
(((((ps 1 p)1 l)1 p)1 s)1 n)
4-balanced
(((ps 1 p)1(ps 1 l))1 (s 1 n))
Table 5.4 Summary of query plans for TPCH Dataset
c=customer, s=supplier, ps=partsupp, r=region, l=lineitem, o=orders, n=nation
39
(a) Commercial DBMS(b) Postgres
(c) Hive(d) FindRecDeepMin
Figure 5.1 Query execution plan for q2
We can notice that the execution plans chosen by Commercial DBMS, postgres and hive are all left deep
whereas the algorithm FindRecDeepMin has chosen a 4-balanced tree which has more parallelization which can
run two joins (and hence two map-reduce jobs, (region join nation) and (partsupp join part)) at the same time
thus resulting in effective utilization of the cluster. Also at each join node we optimize the reducer allocation
based on the statistics.
40
(a) Commercial DBMS (b) Postgres
(c) Hive (d) FindRecDeepMin
Figure 5.2 Query execution plan for q3
In query q3, we have only 2 joins and so we don’t have many possible join orders and no parallelization
is possible. Each of the optimizers choose plans based on their stats and for FindRecDeepMin algorithm we
optimally schedule the reduce tasks based on the statistics.
41
(a) Commercial DBMS
(b) Postgres
(c) Hive (d) FindRecDeepMin
Figure 5.3 Query execution plan for q10
This query shows the difference between the parallel and non-parallel plans by building a fully balanced tree
with four tables. As we can see from the figure, Commercial DBMS, postgres and Hive build the left deep plans
but FindRecDeepMin’s plan can be easily parallelized with two joins running at the same time ((customer join
nation) and (lineitem join orders)). This way we can optimally utilize the task slots and schedule the tasks as
and when a slot is available thus utilizing the cluster resources properly.
42
(a) Commercial DBMS (b) Postgres
(c) Hive (d) FindRecDeepMin
Figure 5.4 Query execution plan for q11
This query is similar to query q3 in figure 5.2. It has three tables and two joins. So each of the optimizers
chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based on statistics.
5.4 Cost Formulae accuracy
In this section we analyze in detail the accuracy of the cost formulae we designed for the purpose
of query optimizer. Our main aim of designing the cost formuale for our optimizer is to quantify the
cost of each phase of join mapreduce job. As explained already, a real mapreduce job is quite complex
under the hood and we picked the parts of each phase that incur significant costs and included them in
the cost formulae and we assumed the worst case scenarios to get an approximate upper bound for each
phase. We ran 5 different mapreduce join jobs with increasing input sizes and measured the runtime of
the phases and mapped them with the results from the cost formulae. Results are in the figures 5.10 and
5.11.
From the graph 5.10, it is clear that the estimated runtime using cost formulae is somewhat propor-
tional to the actual map phase execution time. But in all the inputs, the estimated cost is high. This
is because of the fact that the HDFS uses block locality principle in most of the cases to read the the
43
(a) Commercial DBMS (b) Postgres
(c) Hive(d) FindRecDeepMin
Figure 5.5 Query execution plan for q16
This query is similar to queries q3 and q11 in figures 5.2 and 5.4. It has three tables and two joins. So each of
the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based
on statistics.
data but in the formulae we consider a read throughput speed whose calculation includes reading remote
blocks too. Also we consider the time taken by the slowest machine to complete its map wave since
slowest machine limits the total speed of the cluster. However in reality tasks might get assigned to
faster machines thus resulting in faster execution times.
Unlike others, the estimates of shuffle and reduce phases are lesser compared to the actual runtime
since the actual runtime includes the jvm startup time which is not present in the cost formulae. It varies
from machine to machine and depends on various factors such as jvm reuse, caching etc.
5.5 Efficiency of scheduling algorithm
In this section we discuss about the efficiency of our scheduling algorithm discussed in chapter 4.
We ran various join queries on a TPCH 100 scale data set with input sizes up to 100 gigabytes. We
ran eighteen different queries to demonstrate the best case and the worst case running of the algorithm.
This best or worst case of the algorithm (as compared to default scheduler) depends on the distribution
44
(a) Commercial DBMS(b) Postgres
(c) Hive (d) FindRecDeepMin
Figure 5.6 Query execution plan for q18
This query is similar to queries q3, q11 and q16 in figures 5.2 , 5.4 and 5.5. It has three tables and two joins.
So each of the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce
tasks based on statistics.
of keys across the machines in the dataset since that is what defines the shuffling of keys across the
machines. To demonstrate the worst/best cases, we explain the data distribution that results in such
performance and show it experimentally on such a sample distribution. We ran these queries twice once
with default hadoop scheduling and once with the optimal scheduling allocation in place. We measured
the data shuffle inputs of each machine in the cluster from other machines in both of the above executions
and plotted the graph in figure 5.12. X-axis denotes the size of input tables and they reach up to 100
gigabytes whereas the Y-axis measures the execution time. Red plot corresponds to the default scheduler
in hadoop and the blue plot is the optimized shuffle. The actual shuffle values are tabulated in table 5.5.
From the graph and data, we can see that the optimized shuffle data size is less than the default
hadoop shuffle. This is because the optimized suffle algorithm minimizes the network IO by assigning
45
keys which result in least network movement of data. Also we can notice that the performance of the
algorithm increases with the increase in input data size. This is because the algorithm gets flexibility to
try out more possible allocations since the data is spread out.
Input size
Gigabytes
Optimized Shuffle Volume
Megabytes (A)
Default Shuffle Volume
Megabytes (H)
Num Shuffles
Optimized Shuffle
Num Shuffles
Default Shuffle(H/A)
2.42 24.07 37.33 177787 275688 1.55
12.1 122.39 181.28 903841 1338694 1.48
24.2 218.43 314.43 1612961 2321897 1.43
36.3 361.81 462.07 2671741 3412088 1.27
48.4 424.99 690.52 3138280 5099067 1.62
60.5 495.46 686.55 3658709 5069751 1.38
72.6 597.86 891.49 4414868 6583109 1.49
84.7 682.94 1373.80 5043081 10144671 2.01
96.8 958.08 1476.46 7074826 10902697 1.54
Table 5.5 Shuffle data size comparison of default and optimized algorithm tested on a TPCH scale 100
dataset
This algorithm works the best when all the values corresponding to the join key from both the join
tables lie on the same system so that the algorithm allocates the reduce task on the same machine and if
this holds for all the join keys, the total shuffle IO is zero since no join key should be moved to any other
system. We ran 9 different queries demonstrating this best case working and the results are in table 5.6
and plotted in figure 5.13 (Shuffle algorithm IO bytes are coincide with the X-axis as the shuffle volume
is zero due to perfect data locality). Since hadoop does not care about data locality while assigning
reduce tasks, we can clearly see that the data is being unnecessarily shipped to other machines resulting
in shuffle IO.
46
0 10 20 30 40 50 60 70 80 90 1000
1000
2000
3000
4000
5000
6000
Input sizes
Shu
ffle
size
in m
egab
ytes
Best Case
Hadoop
Shuffle Algo
Figure 5.13 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance
Input size
Gigabytes
Default Shuffle Volume
Megabytes
Num Shuffles
Default Shuffle
Optimized Shuffle
Megabytes
Num Shuffles
Optimized Shuffle
2.42 131917296 821508 0 0
12.1 659591856 4107576 0 0
24.2 1539049806 9584358 0 0
36.3 1978778388 12322746 0 0
48.4 3078100760 19168723 0 0
60.5 3847626734 23960909 0 0
72.6 3957557628 24645498 0 0
84.7 5386676540 33545267 0 0
96.8 4397286800 27383890 0 0
Table 5.6 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance
47
0 10 20 30 40 50 60 70 80 90 1000
1000
2000
3000
4000
5000
6000
Input table sizes
Shu
ffle
size
in m
egab
ytes
Worst case shuffle
Hadoop
Shuffle Algo
Figure 5.14 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance
Input size
Gigabytes
Default Shuffle Volume
Megabytes (H)
Num Shuffles
Default Shuffle
Optimized Shuffle Volume
Megabytes (A)
Num Shuffles
Optimized ShuffleH/A
2.42 131901300 821400 131901300 821400 1
12.1 659573700 4107450 659573700 4107450 1
24.2 1319168700 8215050 1319168700 8215050 1
36.3 1978763700 12322650 1978763700 12322650 1
48.4 2638358700 16430250 2638358700 16430250 1
60.5 3297932400 20537700 3297932400 20537700 1
72.6 3957527400 24645300 3957527400 24645300 1
84.7 4617122400 28752900 4617122400 28752900 1
96.8 5276717400 32860500 5276717400 32860500 1
Table 5.7 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance
Coming to the worst case performance of the algorithm, it happens when each key corresponding
to the join column is equally distributed across all the machines. In this case, it doesn’t matter which
machine we ship it to, we get overall same amount of shuffle IO. This is equivalent to a perfect uniform
distribution of rows per machine per join column value. We ran queries in the worst case and our
scheduler performs exactly same as hadoop’s default scheduler. The results are tabulated in table 5.7
and plotted in the figure 5.14.
48
(a) Commercial DBMS (b) Postgres
(c) Hive(d) FindRecDeepMin
Figure 5.7 Query execution plan for q20
This query highlights the parallelization achieved by the algorithm FindRecDeepMin compared to other opti-
mizers. As seen in the figure Commercial DBMS, Postgres and Hive stuck to the left deep tree approach whereas
the algorithm FindRecDeepMin provides multiple levels of parallelization. Instead of two parallel joins in the
first step, this query plan provides 3 join map-reduce jobs in parallel ((partsupp join part) and (partsupp join
lineitem) and (supplier join nation)). This utilizes the cluster to its maximum taking up the slots as soon as
they are left out by the other jobs. Also in the immediate level we have two join jobs in parallel (between the
intermediate output tables). Adding to that we schedule the tasks optimally based on the statistics and this gave
this query a good performance improvement over the other plans.
49
q2 q3 q10 q11 q16 q18 q200
200
400
600
800
1000
1200
1400
TPCH query numbers
Run
time
of th
e qu
erie
s in
sec
onds
Hive
FindMin
FindRecMin
FindRecDeepMin
Figure 5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH
300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these
queries have less number of join tables
3 4 5 6 7 8 9 10 11 12100
200
300
400
500
600
700
800
900
1000
Number of tables in the join query
Run
tim
e of
the
quer
y in
sec
onds
Hive
FindMin
FindRecMin(n=4)
FindRecDeepMin (n=4)
FindRecMin(n=8)
Find RecDeepMin(n=8)
Figure 5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries
50
0 10 20 30 40 50 60 70 80 900
100
200
300
400
500
600
700
800
Size of input data in gigabytes
Map
pha
se r
untim
e
Actual
Estimate
Figure 5.10 Map phase cost formulae evaluation
0 10 20 30 40 50 60 70 80 908
10
12
14
16
18
20
Size of input tables in gigabytes
Run
time
of s
huffl
e an
d re
duce
Actual
Estimate
Figure 5.11 Reduce and shuffle cost formulae evaluation
51
0 10 20 30 40 50 60 70 80 90 1000
500
1000
1500Shuffle algorithm performance
Input table sizes
Shu
ffle
size
in m
egab
ytes
Hadoop
Shuffle Algo
Figure 5.12 Comparison of default vs optimal shuffle IO
52
Chapter 6
Conclusions and Future Work
6.1 Conclusions and contributions
The database infrastructure has rapidly changed over the past few years from high-end servers hold-
ing vast amounts of data to a set of commodity hardware machines holding data in a distributed fashion.
The map-reduce programming paradigm from Google has facilitated this transformation by providing
highly scalable and fault tolerant distributed systems of production level quality. This increase in dataset
sizes has posed various problems to the system designers interms of complexity and scalability. Since
many companies still rely on SQL standards for analytics, it has been ported to mapreduce based sys-
tems too and is made a de-facto standard. This posed new problems to the query optimizers owing to its
complexity and the scale at which it works. As a part of this work, we have built a new optimizer from
scratch for joins over mapreduce framework based on traditional relational database style optimizations.
Our contributions can be summed up as follows.
• We built a query optimizer from scratch for mapreduce based SQL sytems. We followed the
traditional database style optimizations using statistics and costformulae.
• We designed a distributed statistics store for data in HDFS and built cost formulae for join opera-
tors in hive.
• We explored a new subset of bushy plan space called n-balanced trees based on the the fact that
their inherent parallelization suits mapreduce framework.
• We formulated the shuffle in mapreduce job as a maxflow mincut problem and found out the
optimal assignments to reduce netowrk IO.
Chapter 3 discusses about the plan space of n-balanced trees and their applicability to mapreduce
framework based on statistics. Results show that accurate statistics can be very useful for predicting
the output sizes of intermediate results and can be also help us in picking a best plan from the search
space of n-balanced trees. Also results show the benefits of using n-balanced trees for massively parllel
53
framework like mapreduce compared to traditional left deep join trees as in hive. But one should be
careful while choosing the value of n as too much of parallelization might be an overkill too since every
job keeps waiting for others to complete. Also using our optimial shuffle max-flow mincut formulation
will result in reducing the network IO from other nodes and helps increasing the query performance.
The ideas presented in this work can be used in any standard mapreduce based SQL systems with a few
changes. As a proof of concept modified Hive for our experimental analysis and a similar approach is
possible for other systems too.
6.2 Future work
In this section we discuss possible directions to extend this work.
• This work focuses on most basic problem of query optimization using joins with selection and pro-
jection operators since a join operator is considered relatively costly compared to other operators.
We can extend this work to include other SQL operators like aggregation, groupby, subqueries
etc. For that we need to design appropriate cost formuale for each of them and follow similar
techniques from relational world.
• Another important direction of work is to improve the cost formulae for join operators. Currently
we get an upper bound cost based on statistics. Since a mapreduce system is very complex interms
of design, we need to take lot of factors ranging from cache to buffer sizes to network speeds and
also need track all the roundtrips from disk to memory in each phase of the job. Taking each and
every factor into account and designing accurate cost formulae for each operator enhances the
query optimizer .
• A mapreduce job is designed to be fault tolerant and can sustain task failures. Current optimizer
doesn’t take failures into account and calculates costs of operators without considering such sce-
narios. However such failures can increase the cost of each SQL operator and including them in
the optimizer might give us more accurate costs and thus better plans. We need to consider factors
like load on each machine (number of mappers and reducers), scheduler properties like preemp-
tion, mean time to failure (MTTF) of nodes etc. to include failure scenarios in the optimizer
• The performance of a mapreduce cluster depends on how well the scheduler works. There are
a lot of schedulers available today that decide whether to launch a task or job or kill an existing
one based various factors like priority, load, fairness etc. We need to take into consideration
such properties specific to schedulers and build our optimizer accordingly since the cost of each
operator depends on how the scheduler works
• Block placement in a distributed systems play an important role in the cost of processing them.
Placing them closer to the code, reduces the network IO and increases the query performance.
54
The same principle applies to HDFS too. Since data is split into blocks, we can design an optimal
block allocation for a set of queries that reduces the overall cost of processing them. So the
problem translates to finding an optimal HDFS block allocation given a set of queries and the
costs to process them
55
Appendix A
Query execution plans for TPCH queries
A.1 q2
A.1.1 Postgres
Nested Loop (cost=36.91..61.34 rows=1 width=730)
Join Filter: (n.n_nationkey = s.s_nationkey)
-> Hash Join (cost=12.14..24.48 rows=1 width=108)
Hash Cond: (n.n_regionkey = r.r_regionkey)
-> Seq Scan on nation n (cost=0.00..11.70 rows=170 width=112)
-> Hash (cost=12.12..12.12 rows=1 width=4)
-> Seq Scan on region r (cost=0.00..12.12 rows=1 width=4)
Filter: (r_name = ’EUROPE’::bpchar)
-> Hash Join (cost=24.77..36.84 rows=1 width=630)
Hash Cond: (s.s_suppkey = ps.ps_suppkey)
-> Seq Scan on supplier s (cost=0.00..11.50 rows=150 width=510)
-> Hash (cost=24.76..24.76 rows=1 width=128)
-> Hash Join (cost=12.41..24.76 rows=1 width=128)
Hash Cond: (ps.ps_partkey = p.p_partkey)
-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width
=24)
-> Hash (cost=12.40..12.40 rows=1 width=108)
-> Seq Scan on part p (cost=0.00..12.40 rows=1 width
=108)
Filter: (((p_type)::text ˜˜ ’\%BRASS’::text) AND (
p_size = 15))
56
A.1.2 Hive
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-1 depends on stages: Stage-4
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-4
Map Reduce
Alias -> Map Operator Tree:
nation
TableScan alias: n
Reduce Output Operator
region
TableScan alias: r
Filter Operator (r_name = ’EUROPE’)
Reduce Operator Tree: Join Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
supplier
TableScan alias: s
Reduce Output Operator
Reduce Operator Tree: Join Operator
condition map:
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
57
Reduce Output Operator
partsupp
TableScan alias: ps
Reduce Output Operator
Stage: Stage-3
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
part
TableScan alias: p
Filter Operator
predicate: expr: ((p_size = 15) and (p_type like ’\%BRASS’))
Reduce Output Operator
Reduce Operator Tree:
Join Operator, File Output Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.1.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 is a root stage
Stage-3 depends on stages - Stage-1,Stage-2
Stage-4 depends on stages - Stage-3
Stage-0 is a root stage
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
nation
TableScan alias: n
Reduce Output Operator
region
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
partsupp
TableScan alias: ps
Reduce Output Operator
part
58
TableScan alias: r
Filter Operator (r_name = ’
EUROPE’)
Reduce Operator Tree: Join
Operator
TableScan alias: p
Filter Operator
predicate: expr: ((p_size
= 15) and (p_type like
’\%BRASS’))
Reduce Output Operator
Reduce Operator Tree: Join
Operator
condition map:
Stage: Stage-3
Map Reduce
Alias -> Map Operator Tree:
Stage-1-out
TableScan alias: s1
Stage-2-out
TableScan alias: s2
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-4
Map Reduce
Alias -> Map Operator Tree:
Stage-3-out
TableScan alias: s3
supplier
TableScan alias: s
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.2 q3
A.2.1 Postgres
Hash Join (cost=24.67..37.43 rows=1 width=12)
59
Hash Cond: (l.l\_orderkey = o.o\_orderkey)
-> Seq Scan on lineitem l (cost=0.00..12.50 rows=67 width=4)
Filter: (l_shipdate > ’1995-03-15’::date)
-> Hash (cost=24.66..24.66 rows=1 width=12)
-> Hash Join (cost=11.76..24.66 rows=1 width=12)
Hash Cond: (o.o\_custkey = c.c\_custkey)
-> Seq Scan on orders o (cost=0.00..12.62 rows=70 width=16)
Filter: (o\_orderdate < ’1995-03-15’::date)
-> Hash (cost=11.75..11.75 rows=1 width=4)
-> Seq Scan on customer c (cost=0.00..11.75 rows=1 width=4)
Filter: (c\_mktsegment = ’BUILDING’::bpchar)
A.2.2 Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
customer
TableScan alias: c
Filter Operator (c\_mktsegment = ’BUILDING’)
Reduce Output Operator
orders
TableScan alias: o
Filter Operator (o\_orderdate < ’1995-03-15’)
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
60
\$INTNAME
Reduce Output Operator
lineitem
TableScan alias: l
Filter Operator (l\_shipdate > ’1995-03-15’)
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.2.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
lineitem
TableScan alias: l
Filter Operator (l\_shipdate > ’1995-03-15’)
Reduce Output Operator
orders
TableScan alias: o
Filter Operator (o\_orderdate < ’1995-03-15’)
Reduce Output Operator
Reduce Operator Tree:
Join Operator
61
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
customer
TableScan alias: customer
Filter Operator (c\_mktsegment = ’BUILDING’)
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.3 q10
A.3.1 Postgres
Nested Loop (cost=25.11..49.97 rows=1 width=606)
Join Filter: (o.o\_orderkey = l.l\_orderkey)
-> Hash Join (cost=25.11..37.46 rows=1 width=610)
Hash Cond: (n.n\_nationkey = c.c\_nationkey)
-> Seq Scan on nation n (cost=0.00..11.70 rows=170 width=108)
-> Hash (cost=25.10..25.10 rows=1 width=510)
-> Hash Join (cost=13.16..25.10 rows=1 width=510)
Hash Cond: (c.c\_custkey = o.o\_custkey)
-> Seq Scan on customer c (cost=0.00..11.40 rows=140 width
=506)
-> Hash (cost=13.15..13.15 rows=1 width=8)
-> Seq Scan on orders o (cost=0.00..13.15 rows=1 width
=8)
Filter: ((o\_orderdate >= ’1993-10-01’::date) AND
(o\_orderdate < ’1994-01-01’::date))
62
-> Seq Scan on lineitem l (cost=0.00..12.50 rows=1 width=4)
Filter: (l.l\_returnflag = ’R’::bpchar)
A.3.2 Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
customer
TableScan alias: c
Reduce Output Operator
orders
TableScan alias: o
Filter Operator ((o\_orderdate >= ’1993-10-01’) and (o\_orderdate <
’1994-01-01’))
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
nation
TableScan alias: n
Reduce Output Operator
Reduce Operator Tree:Join Operator
Stage: Stage-3
Map Reduce
Alias -> Map Operator Tree:
63
\$INTNAME
Reduce Output Operator
lineitem
TableScan alias: l
Filter Operator (l\_returnflag = ’R’)
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.3.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
customer
TableScan alias: c
Reduce Output Operator
nation
TableScan alias: n
Reduce Output Operator
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
lineitem
TableScan alias: l
Filter Operator (l\
_returnflag = ’R’)
orders
TableScan alias: o
Filter Operator ((o\
_orderdate >=
’1993-10-01’) and (o\
_orderdate <
’1994-01-01’))
Reduce Output Operator
64
Reduce Operator Tree: Join
Operator
Stage: Stage-3
Map Reduce
Alias -> Map Operator Tree:
Stage-1-out
TableScan alias: s1
Stage-2-out
TableScan alias: s2
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.4 q11
A.4.1 Postgres
Hash Join (cost=24.22..36.57 rows=1 width=20)
Hash Cond: (ps.ps\_suppkey = s.s\_suppkey)
-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=24)
-> Hash (cost=24.21..24.21 rows=1 width=4)
-> Hash Join (cost=12.14..24.21 rows=1 width=4)
Hash Cond: (s.s\_nationkey = n.n\_nationkey)
-> Seq Scan on supplier s (cost=0.00..11.50 rows=150 width=8)
-> Hash (cost=12.12..12.12 rows=1 width=4)
-> Seq Scan on nation n (cost=0.00..12.12 rows=1 width=4)
Filter: (n\_name = ’GERMANY’::bpchar)
A.4.2 Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
65
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
partsupp
TableScan alias: ps
Reduce Output Operator
Reduce Output Operator
supplier
TableScan
alias: s
Reduce Output Operator
Reduce Operator Tree:Join Operator
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
nation
TableScan alias: n
Filter Operator (n\_name = ’GERMANY’)
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.4.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
66
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
supplier
TableScan alias: s
Reduce Output Operator
nation
TableScan alias:
Filter Operator (n\_name = ’GERMANY’)
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
partsupp
TableScan alias: ps
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.5 q16
A.5.1 Postgres
Hash Join (cost=27.57..44.13 rows=139 width=120)
67
Hash Cond: (ps.ps\_suppkey = s.s\_suppkey)
-> Hash Join (cost=13.82..28.40 rows=158 width=120)
Hash Cond: (p.p\_partkey = ps.ps\_partkey)
-> Seq Scan on part p (cost=0.00..12.40 rows=158 width=120)
Filter: ((p\_brand <> ’Brand45’::bpchar) AND ((p\_type)::text !˜˜ ’
MEDIUM POLISHED%’::text))
-> Hash (cost=11.70..11.70 rows=170 width=8)
-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=8)
-> Hash (cost=11.88..11.88 rows=150 width=4)
-> Seq Scan on supplier s (cost=0.00..11.88 rows=150 width=4)
Filter: ((s\_comment)::text !˜˜ ’\%Customer%Complaints\%’::text)
A.5.2 Hive
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
part
TableScan alias: p
Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’MEDIUM
POLISHED\%’)))
Reduce Output Operator
supplier
TableScan
alias: s
Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’))
Reduce Output Operator
Reduce Operator Tree: Join Operator
Stage: Stage-1
Map Reduce
68
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
partsupp
TableScan
alias: ps
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.5.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
supplier
TableScan alias: ps
Reduce Output Operator
part
TableScan alias: p
Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’
MEDIUM POLISHED\%’)))
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-2
Map Reduce
69
Alias -> Map Operator Tree:
\$INTNAME
Reduce Output Operator
supplier
TableScan alias: s
Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’))
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.6 q18
A.6.1 Postgres
Hash Join (cost=43.02..57.26 rows=49 width=96)
Hash Cond: (l.l\_orderkey = o.o\_orderkey)
-> Seq Scan on lineitem l (cost=0.00..12.00 rows=200 width=4)
-> Hash (cost=42.40..42.40 rows=49 width=100)
-> Hash Join (cost=26.49..42.40 rows=49 width=100)
Hash Cond: (o.o\_custkey = c.c\_custkey)
-> Hash Join (cost=13.34..28.50 rows=70 width=32)
Hash Cond: (o.o\_orderkey = t.l\_orderkey)
-> Seq Scan on orders o (cost=0.00..12.10 rows=210 width=28)
-> Hash (cost=12.50..12.50 rows=67 width=4)
-> Seq Scan on lineitem t (cost=0.00..12.50 rows=67
width=4)
Filter: (l\_quantity > 300::numeric)
-> Hash (cost=11.40..11.40 rows=140 width=72)
-> Seq Scan on customer c (cost=0.00..11.40 rows=140 width
=72)
A.6.2 Hive
70
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
customer
TableScan
alias: c
Reduce Output Operator
orders
TableScan
alias: o
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
l
TableScan
alias: l
Reduce Output Operator
key expressions:
expr: l\_orderkey
type: int
sort order: +
Map-reduce partition columns:
expr: l\_orderkey
type: int
tag: 2
t
71
TableScan
alias: t
Filter Operator (l\_quantity > 300.0)
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
A.6.3 FindRecDeepMin
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
customer
TableScan
alias: c
Reduce Output Operator
orders
TableScan
alias: o
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
\$INTNAME
l
TableScan
alias: l
72
Reduce Output Operator
key expressions:
expr: l\_orderkey
type: int
sort order: +
Map-reduce partition columns:
expr: l\_orderkey
type: int
tag: 2
t
TableScan
alias: t
Filter Operator (l\_quantity > 300.0)
Reduce Operator Tree: Join Operator
Stage: Stage-0
Fetch Operator
limit: -1
73
Bibliography
[1] Apache Hadoop. http://hadoop.apache.org/.
[2] Apache Hive. http://hive.apache.org/.
[3] Apache PoweredBy. http://wiki.apache.org/hadoop/PoweredBy.
[4] Cern data scale. http://home.web.cern.ch/about/updates/2013/04/
animation-shows-lhc-data-processing.
[5] Data Scale. http://www.comparebusinessproducts.com/fyi/
10-largest-databases-in-the-world.
[6] Facebook scale. https://www.facebook.com/notes/paul-yang/
moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/
10150246275318920.
[7] HDFS architecture. http://hadoop.apache.org/docs/stable1/hdfs_design.html.
[8] Map Reduce architecture diagram . http://biomedicaloptics.spiedigitallibrary.org/
article.aspx?articleid=1167145.
[9] SQL server tpch plans.
http://researchweb.iiit.ac.in/˜bharath.v/sql_server_plans.pdf.
[10] Towards Join Order Templates based Query Optimization: An Empirical Evaluation. http://web2py.
iiit.ac.in/research_centres/publications/view_publication/phdthesis/12.
[11] Yahoo scale. http://developer.yahoo.com/blogs/hadoop/
scaling-hadoop-4000-nodes-yahoo-410.html.
[12] F. Afrati, A. Sarma, D. Menestrina, A. Parameswaran, and J. Ullman. Fuzzy joins using mapreduce. In
Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 498–509, 2012.
[13] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th
International Conference on Extending Database Technology, EDBT ’10, pages 99–110, New York, NY,
USA, 2010. ACM.
[14] S. Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the Seventeenth
ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, pages 34–43,
New York, NY, USA, 1998. ACM.
74
[15] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM,
51(1):107–113, Jan. 2008.
[16] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev.,
37(5):29–43, Oct. 2003.
[17] G. Graefe. Volcano— an extensible and parallel query evaluation system. IEEE Trans. on Knowl. and
Data Eng., 6(1):120–135, Feb. 1994.
[18] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In
Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC,
USA, 1993. IEEE Computer Society.
[19] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In
Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC,
USA, 1993. IEEE Computer Society.
[20] L. M. Haas, M. J. Carey, M. Livny, and A. Shukla. Seeking the truth about ad hoc join costs. The VLDB
Journal, 6(3):241–256, 1997.
[21] E. P. Harris and K. Ramamohanarao. Join algorithm costs revisited. The VLDB Journal, 5(1):064–084,
Jan. 1996.
[22] T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Trans.
Database Syst., 9(3):482–502, Sept. 1984.
[23] Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th International Conference on
Very Large Data Bases - Volume 29, VLDB ’03, pages 19–30. VLDB Endowment, 2003.
[24] Y. E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, Mar. 1996.
[25] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its
implications for query optimization. SIGMOD Rec., 20(2):168–177, Apr. 1991.
[26] K. Karlaplem and N. M. Pun. Query driven data allocation algorithms for distributed database systems. In
in 8th International Conference on Database and Expert Systems Applications (DEXA’97), Toulouse,
Lecture Notes in Computer Science 1308, pages 347–356, 1997.
[27] R. S. G. Lanzelotte, P. Valduriez, and M. Zaı̈t. On the effectiveness of optimization search strategies for
parallel execution spaces. In Proceedings of the 19th International Conference on Very Large Data Bases,
VLDB ’93, pages 493–504, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
[28] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator.
In Distributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 25–36, 2011.
[29] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and M. Wong. Tenzing a sql implementation on the
mapreduce framework.
[30] L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed
queries. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB ’86, pages
149–159, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc.
75
[31] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec.,
27(2):448–459, June 1998.
[32] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of the 2011 ACM
SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 949–960, New York,
NY, USA, 2011. ACM.
[33] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’08, pages 1099–1110, New York, NY, USA, 2008. ACM.
[34] J. B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In OPERATIONS RESEARCH,
pages 377–387, 1988.
[35] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhD thesis, Madison, WI,
USA, 1997. UMI Order No. GAX97-16074.
[36] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY,
USA, 3 edition, 2003.
[37] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a
relational database management system. In Proceedings of the 1979 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’79, pages 23–34, New York, NY, USA, 1979. ACM.
[38] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage
Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, 2010.
[39] M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized optimization for the join ordering
problem. The VLDB Journal, 6(3):191–208, Aug. 1997.
[40] A. Swami. Optimization of large join queries: Combining heuristics and combinatorial techniques.
SIGMOD Rec., 18(2):367–376, June 1989.
[41] A. Swami and B. Iyer. A polynomial time algorithm for optimizing join queries. In Data Engineering,
1993. Proceedings. Ninth International Conference on, pages 345–354, 1993.
[42] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a
petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International
Conference on, pages 996–1005, 2010.
[43] R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings
of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages
495–506, New York, NY, USA, 2010. ACM.
76