Optimizing SQL Query Execution over Map-Reduceweb2py.iiit.ac.in/research_centres/publications/download/masters... · Optimizing SQL Query Execution over Map-Reduce Thesis submitted

Optimizing SQL Query Execution over Map-Reduce

Thesis submitted in partial fulfillment

of the requirements for the degree of

MS by Research

in

Computer Science

by

Bharath Vissapragada

200702012

bharat [email protected]

Center for Data Engineering

International Institute of Information Technology

Hyderabad - 500 032, INDIA

September 2014

Copyright c© Bharath Vissapragada, 2013

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Optimizing SQL Query Execution over Map-

Reduce” by Bharath Vissapragada, has been carried out under my supervision and is not submitted

elsewhere for a degree.

Date Adviser: Prof. Kamalakar Karlapalem

To my uncle Late. Ravi Sanker Ganti

Acknowledgments

Firstly I would like to thank my dad, mom, sister and grand parents for believing in me and letting me

pursue my interests. Also I would have never completed my thesis without the support of my advisers

Kamal sir and Satya sir. They were always open for discussions and I am really lucky to have their

support. I would like to thank my closest pals Abilash, Chaitanya, Phani, Ravali, Ronanki and Vignesh

for their constant support, especially when I was let down by something. I really miss my uncle Late.

Ravi Sanker Ganti, who was responsible for what a Iam today. Thanks to the Almighty for blessing me

with good luck and mental peace.

v

Abstract

Query optimization in relational database systems is a topic that has been studied at depth, in both

stand alone and distributed scenarios. Modern day optimizers became complex with focus on increasing

quality of optimization and reduction in query execution times. They employ a wide range of heuristics

and considers a set of possible plans called plan space to find the most optimal plan to be executed.

However, with the advent of big data, more and more organizations are moving towards map-reduce

based processing systems for managing their large databases, since it outperforms all the traditional

techniques for processing very huge amounts of data and still runs on commodity hardware thus reduc-

ing the maintenance cost.

In this thesis, we describe the design and implementation of a query optimizer tailor made for ef-

ficient execution of SQL queries over map-reduce framework. We rely on the traditional relational

database query optimization principles and extend them to address this problem. Our major contribu-

tions can be summarized as follows

1. We proposed a statistics based approach for optimizing SQL queries on top of map-reduce

2. We designed cost formulae to predict the run time of joins before executing the query

3. We extend the traditional plan space to consider the bushy plan space which leverage the mas-

sively parallel architecture of map-reduce systems. We designed three algorithms to explore this

plan space and use our cost formulae to select the plan with the least execution cost

4. We developed a task scheduler based on max flow min cut algorithm for map-reduce shuffle that

minimizes the overall network IO during joins. This algorithm uses the statistics collected from

data and formulates the task assignment problem as a max-flow min-cut problem in a flow graph

on nodes and solving it we get the overall minimal IO in shuffle phase

5. We show the performance enhancements from the above features using TPCH workload of scales

100 and 300 on both TPCH benchmark queries and synthetic queries

Our experiments show run time enhancements of up to 2x improvement in the query execution time

and up to 33% reduction in the shuffle network IO during map-reduce jobs.

vi

Contents

Chapter Page

1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Map-Reduce and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Replication factor and replica placement . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Data reads, writes and deletes . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.5 Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Hive Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Joins in Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Query Optimization in Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Problem statement and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Query Plan Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Overview of Query Planspace - An Example . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Exploring Bushy Planspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Finding minimum cost n-fully balanced tree - FindMin . . . . . . . . . . . . . 20

3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin . . . . . . 21

3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin . . . 21

3.3 Choosing the value of n for an n-balanced tree . . . . . . . . . . . . . . . . . . . . . 23

4 Cost Based Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Distributed Statistics Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Cost Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Join map phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Join shuffle phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Join Reducer phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Scheduling - An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Scheduling strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Shuffle algorithm - Proof of minimality . . . . . . . . . . . . . . . . . . . . . . . . . 33

vii

viii CONTENTS

4.6 Shuffle algorithm - A working example . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Experimental Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Plan space evaluation and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.1 Algorithm FindMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.2 Algorithm FindRecMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2.3 Algorithm FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Algorithms performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 Cost Formulae accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Efficiency of scheduling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1 Conclusions and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Appendix A: Query execution plans for TPCH queries . . . . . . . . . . . . . . . . . . . . . 56

A.1 q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A.1.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A.1.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.1.3 FindRecDeepMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.2 q3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


A.3 q10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.3.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.3.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


A.4 q11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.4.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.4.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65


A.5 q16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.5.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.5.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


A.6 q18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.6.1 Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.6.2 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

List of Figures

Figure Page

1.1 Hive query - Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Pig script - Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 HDFS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Map Reduce Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Hive Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Block diagram of a query optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Possible join orderings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Time line for the execution of left deep QEP . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Time line for the execution of bushy Query plan . . . . . . . . . . . . . . . . . . . . . 18

3.4 An example of 4-fullybalanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 An example of 4-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 8-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 4-balanced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.8 Running Algorithm1 on 9 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.9 Running Algorithm2 on 9 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Statistics store architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Query scheduling example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Flow network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Query execution plan for q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40






5.13 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance . . . . . . . . 47

5.14 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance . . . . . . . 48


5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a

TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced

trees since most of these queries have less number of join tables . . . . . . . . . . . . 50

5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries . . . . . . 50

5.10 Map phase cost formulae evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ix

x LIST OF FIGURES

5.11 Reduce and shuffle cost formulae evaluation . . . . . . . . . . . . . . . . . . . . . . . 51

5.12 Comparison of default vs optimal shuffle IO . . . . . . . . . . . . . . . . . . . . . . . 52

List of Tables

Table Page

4.1 Notations for cost formulae - Map phase . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Notations for cost formulae - Shuffle phase . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Key to Node assignment with optimized scheduler . . . . . . . . . . . . . . . . . . . . 35

4.4 Key to Node assignment with default scheduler . . . . . . . . . . . . . . . . . . . . . 35

5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14 38

5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a

10 node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees . . . . . . . 38

5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a

TPCH 300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced

trees since most of these queries have less number of join tables . . . . . . . . . . . . 39

5.4 Summary of query plans for TPCH Dataset . . . . . . . . . . . . . . . . . . . . . . . 39

5.5 Shuffle data size comparison of default and optimized algorithm tested on a TPCH scale

100 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.6 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance . . . . . . . . 47

5.7 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance . . . . . . . 48

xi

Chapter 1

Introduction and Background

In the internet age, data is wealth. Most organizations rely on their data warehouses for analytics,

based on which the management takes strategic decisions that are important to organization’s growth.

After a decade of internet bubble, the amount of data each of these organizations possess scaled up

to tens of petabytes and this cannot be handled by centralized servers. Owing to these needs, the data

warehouse infrastructure has rapidly changed over the past few years from high-end servers holding vast

amounts of data to a set of commodity hardware machines holding data in a distributed fashion. The

map-reduce programming paradigm [15, 1] from Google has facilitated this transformation by providing

highly scalable and fault tolerant distributed systems for production level quality application software.

In the internet age, big-data has become a buzz word. Every company ranging from small size

startups to internet giants like Google and Facebook is managing data of unbelievable scale [5]. The

data sources are mainly web crawlers, user forms in websites, data uploaded in the social networking

sites, webserver logs etc. For example, Facebook, largest photosharing site today holds about 10 billion

photos whose size increases at the rate of 2-3TB per day [6]. This holds true even in the other areas

of science. Large Hedron Collider (LHC) produces data of enormous sizes from its nuclear reactors

and this data is stored in a grid consisting of 200,000 processing cores and 150PB of disk space in

a distributed network [4].Yahoo has built a hadoop cluster of 4000 nodes [11] with 16PB of storage

to perform their daily data crunching tasks and these numbers clearly show the power of map-reduce

programming. These huge data sizes and distributed systems create a whole new set of challenges for

managing it and performing huge computations like data mining, machine learning tasks etc. Most of

the firms spend a lot of money in managing and extracting important information present in this data.

So, performing these computations of large scale efficiently is very important and even slightest im-

provement in these processing techniques will save a lot of money and time.

Processing huge datasets is a complex task as it cannot be done on a single machine and the dis-

tributed systems implementation always pose a variety of problems in terms of synchronization, fault

tolerance and reliability. Fortunately, with the introduction of Google’s Map-Reduce programming

1

paradigm [16, 15], the above process has become fairly simple as the user just needs to think that he

is programming for a single machine and the framework takes care of distributing it across the cluster

and provides the additional features of fault tolerance and reliability. Hadoop [1] is an opensource im-

plementation of Google’s map-reduce programming that has been widely accepted in the academia and

the industry [3] over the past few years for its ability to process large amounts of data using commodity

hardware while hiding the inner details of parallelism from the end users. Hadoop is widely used in

production for building search indexes, crunching web server logs, recommender systems and a variety

of tasks that require huge processing capabilities on data of large scale. A number of packages have

been developed on top of Hadoop infrastructure to provide a SQL or a similar interface so that users can

perform analytics on the data using the traditional SQL queries and retrieve results. Such efforts include

Pig Latin, Hive [42, 33]. All the packages rely on basic principles for converting a SQL-like query to

map-reduce jobs but there are some minor differences in the way they work. For example, Hive takes

SQL like inputs from the user and converts the query into a directed acyclic graph (DAG) of map-reduce

jobs whereas pig is a scripting language and takes a script as input and user has to supply the entire ex-

ecution plan like a program but the goal is to convert the user’s tasks into a set of map-reduce jobs. An

example showing difference in joining 3 tables in Hive and Pig is shown in figures 1.1 and 1.2.

select * from A join B on A.a = B.b join C on B.b = C.c;

Figure 1.1 Hive query - Example

temp = join A by a, B by b;

result = join temp by B::b, C by c;

dump result

Figure 1.2 Pig script - Example

1.1 Map-Reduce and Hadoop

In this section we describe in detail about map-reduce in terms of its open source release Hadoop

and the file system it relies on, Hadoop Distributed File System(HDFS).

HDFS [38] is very much similar to Google File System, the base for the Map-Reduce paradigm

described in the Google paper.Map-Reduce is a parallel programming paradigm that relies on a basic

principle “Moving computation is cheaper than moving data”.

2

1.1.1 Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a highly fault tolerant distributed file system de-

signed to provide high throughput access for applications that have large data sets and still runs on

commodity and cheap hardware. It is designed to handle hardware failures well by replicating data on

various machines in a distributed fashion. HDFS does not follow the POSIX requirements and relaxes a

few of them to enable streaming access to file system data. HDFS has been built on the following goals

[7] 1.3.

• Hardware Failure Since HDFS runs on a large set of commodity machines, there is a great

probability that a subset of them fails at any moment. HDFS has been designed to overcome this

issue by intelligently placing replicas of data and maintains the replica count in case of data loss

by copying it to other machines in the cluster.

• Streaming Data Access Since HDFS has been designed for applications requiring high through-

put, a few of the unnecessary POSIX requirements have been relaxed to increase efficiency.

• Large amounts of Data HDFS has been tailor made for holding very huge amounts of data

ranging from tens of terabytes to a few petabytes. It also has the ability to scale linearly with the

amount of data just by adding new machines on the fly and still provide fault tolerance and fast

access. This makes it a de-facto choice for big data needs.

• Simple Coherency Model HDFS applications need a write-once-read-many access model for

files. A file once created, written, and closed need not be changed. This assumption simplifies

data coherency issues and enables high throughput data access. A map-reduce application or a

web crawler application fits perfectly with this model. There is a plan to support appending-writes

to files in the future.

• Moving Computation is Cheaper than Moving Data Since the data we are dealing with is

huge, it is clever to take the code to the location where data resides instead of moving data across

machines and HDFS provides methods to do this efficiently by letting the code know the locations

of data blocks and giving support for moving code executables.

• Portability Across Heterogeneous Hardware and Software Platforms Since HDFS supports

a very diverse set of platforms and has been written in java, users can include wide variety of

platforms in their cluster as long as they support basic generic requirements

1.1.2 HDFS Architecture

HDFS has a master/slave architecture. A HDFS cluster consists of a master node responsible for

maintaining filesystem namespace and a set of worker nodes called data nodes where the actual data is

stored. The namenode stores the filesystem namespace and exposes the data to the end user as a file and

3

directory hierarchy and provides him with a set of basic utilities to add, modify and delete data. It also

exposes a java api for all these features and bindings with other languages are also in widespread use.

All the instructions are passed via namenode to the datanodes which execute them in an orderly fashion.

Also all the datanodes report to the master about the block information and send a constant heartbeat to

the master to notify its health. In case the namenode does not get piggy backs from any of the datanodes

for sometime,they are declared dead and the blocks that miss the minimum replication factor are repli-

cated to other nodes. All this information is about block mapping is stored in the namenode local file

system as “fs-image”. Also all the changes to this are tracked by recording the changes to a file called

editLog.

Figure 1.3 HDFS architecture

1.1.3 Replication factor and replica placement

A replication factor is set by the user per file , which is the minimum number of replicas for each

block of data and it ensures fault tolerance. Higher the replication factor greater is the fault tolerance of

the cluster.The location of the replicas for each block is decided by the namenode and is placed intel-

ligently so as to maximize the fault tolerance. In case of replication factor three, first replica is written

4

to the local node ensuring maximum speed, second to a node in a different rack and third is written to

a node in the same rack. Second replica is useful when the whole rack goes down because of a switch

failure. Also greater replication can ensure faster and parallel reads, since namenode has the option of

choosing the replica closest to the client.

1.1.4 Data reads, writes and deletes

• Data is written to HDFS in a pipelined fashion. Suppose a client is writing data to a file, it is first

accumulated in a local file and when it becomes a full block of user data, it contacts the namenode

for a list of datanodes to hold the replicas for this block. This block of data is then flushed to the

first DataNode in streams of 4KB. The first dawned writes this data locally and forwards it to the

next datanode. This process is carried on till the last node in the list writes the whole block and it

notifies the namenode about this and returns the block map. Thus the whole process is pipelined

and parallel

• When namenode receives a read request, it tries to fulfill it by selecting a replica closest to the

client to save bandwidth and reduce the response time. Same rack replica is preferred most of the

times if such a replica exists. This whole information about network topology is fed to the system

via rack awareness scripts

• During file deletes, the data is just marked as deleted but is not deleted from datanodes immedi-

ately but is moved to the /trash folder. The data can be restored as long as it is in the /trash folder

(this time is configurable). Once it crosses the configured time limit, data is deleted and all the

blocks are freed

1.1.5 Map-Reduce

A map-reduce program consists of three main phases map, shuffle and reduce. User needs to specify

the map and reduce functionalities via an API and submit the executable to the processing framework.

Map-reduce job takes a file or a set of files stored in HDFS as input and the map function is applied

to each block of input (In reality, map function is applied to a FileSplit which can span across multiple

blocks, for simplicity we use FileSplit and block interchangeably). Each instance of map function takes

(key,value) pairs as input and emits a new set of (key,value) pairs as output and the framework shuffles all

the pairs with the same key to a single machine. This is called the shuffle phase. User can control what

all keys can go to the same machine via a partitioner function that can be plugged into the executable.

Now all the (key,value) pairs that are shuffled to the same machine are sorted and a reducer function is

applied to the whole group and the output is emitted by reducers as new set of (key,value) pairs which is

written to HDFS. The intermediate data of each map phase is sorted, merged in multiple rounds and is

5

written to the disk local to the map execution. Following equations outline the map and reduce phases

and the whole job flow is summarized in the figure 1.4 [8].

map(k1, v1)− > list(k2, v2) (1.1)

reduce(k2, list(v2))− > list(v2) (1.2)

Figure 1.4 Map Reduce Architecture

Following are the salient features of the map-reduce framework.

• Map-reduce programming framework makes a programmer think that he is writing the code for a

single machine and the framework takes care of distributing the logic and scaling it to thousands

of machines

• Users can write the logic for map and reduce function and control various components of the

framework like file splitting, secondary sorting, partitioner and combiner via pluggable compo-

nents

• Map-Reduce is highly fault tolerant in the sense that failed tasks (map or reduce or subset of them)

cannot stop the whole job. Only tasks that failed can be restarted and the job can be resumed. This

saves a lot of time with jobs on large amounts of data. This framework can be made to work with

both namenode and datanode failures

• One more notable feature of map-reduce framework is task localization. The task scheduler al-

ways tries tries to reduce the network IO by assigning map tasks as close to the input splits as

possible. Setting the split size and HDFS block size to the same value makes is still easier and

gives 100% task localization

6

• The scale at which the map-reduce programs work is enormous and has been shown to scale to

tens of thousands of nodes. This brings a very high degree of parallelism while data processing

resulting in faster throughputs

1.2 Hive

Hive [2] is a data warehouse infrastructure built on top of Hadoop. It provides the tools to perform

offline batch processing ETL tasks on structured data of peta-byte scale. It provides an SQL like query

interface to the user called Hive-QL through which user can query the data. Since Hive is built on top

of Hadoop, user can expect a latency of few minutes even on small amount data as Hadoop takes time

to submit the jobs, schedule them across the cluster and initialize their JVMs. This restricts Hive from

using it as an OLTP engine and also doesnt answer real-time queries and individual row level updates

as in an normal relational database. The functionality of Hive can be extended by using user defined

functions (UDFs).This notion of UDFs is not new to the relational databases and is in practice since

ages. In hive we can plug in our own custom mappers and reducer scripts to perform operations on

the results of the original query. These functionalities of Hive along with its ability to scale to tens of

thousands of nodes, makes it a very useful ETL engine.

1.2.1 Hive Anatomy

Hive stores the tables in the warehouse as flat files on HDFS and they can be partitioned based on the

value of a particular column. Each partition can be further bucketed based on other columns (other than

the one used for partitioning). A query executed by a user is parsed and is converted into an abstract

syntax tree (AST) where each node corresponds to a logical operator which is then mapped to a physical

operator. Figure 1.5 depicts hive architecture.

1.2.2 Joins in Hive

As with most of the relational database systems, executing a join in Hive is more costlier compared

to other operators in terms of query execution time and also resource consumption. This problem has

more significance in Hive owing to the fact that the data is sharded and distributed across a network and

performing a join requires matching tuples to be moved from one machine to another and this results in

a lot of network IO overhead. Hive implements joins over the map-reduce framework and the following

three types of joins are supported.

• Common Join : Common join is the default join mechanism in Hive and is implemented as a

single map-reduce job. It can be thought of as a distributed union of cartesian products. Suppose

we are joining a table A on column ’a’ with table B on column ’b’ and di is the distinct set of

join column values of both columns a and b. Then the commonjoin operator can be described

7

Figure 1.5 Hive Architecture

using the following mathematical equation where Aa=di is the set of all rows of A with column

a’s value as di

n⋃

i=0

(Aa=di ⊲⊳ Bb=di)

Data of both the tables is read in mappers and then rows are shuffled across the network in such a

way that all the rows with same join key reach the same system. To identify the table to which they

belong, they are tagged during the map phase and are differentiated in the reduce phase according

to these tags. A cartesian product is now applied on the rows of both tables with same join column

value and the output is written to the disk for further processing.

• Map Join : Map join is an optimization over common join and is used if one of the tables is very

small and can fit in the main memory of the slaves. In map join the smaller table is copied into

the distributed cache before the map-reduce job and the large table is fed into the mappers. The

large table is streamed row by row and the join is done with the rows of the smaller table and the

results are written to the disk. Map join eliminates the need for shuffle and reduce phases of the

map-reduce job and this makes it very fast compared to other join types

8

• Bucket Join : Bucket Map join is a special case of Map join in which both the tables to be joined

are bucketed(storing all the rows for a column value at a single place). Larger table is given as

input to the mappers for each value of join column, corresponding buckets for the smaller table

are fetched and the join is performed. This is an improvement over map join in the sense that,

instead of copying whole data of smaller table, we copy just the required buckets to the mappers

of larger table.

All the join conditions in the parse tree are converted into operators corresponding to one of Com-

mon, Map, Bucket joins. User can provide hints about the sizes of the tables as a part of the query and

this information is used during query processing time.

1.3 Query Optimization in Databases

In the recent times, SQL has become the de-facto standard for writing analytical queries over data.

The process of query optimization takes an SQL query, builds the basic query plan, applies some trans-

formations and gives a query execution plan (QEP) as output. The transformations applied on the query

plan are dependent on the logic of the query optimizer. In general, the query optimizer algorithm first

determines the logical query plans which are generic and then decides the physical operators to be ex-

ecuted. Overall the procedure of query optimization can be broken down into following steps and is

shown in figure 1.6 [10].

• Generating the search space for the query

• Building a cost model for query execution

• Divising a search strategy to enumerate the plans in the search space

Search Space : There exist many ways of executing the same query and query optimizers consider

only a subset of plans to find the best plan for execution by assigning some cost to each possible plan.

Since this problem has been proved to be NP hard [22], some heuristics are applied to reduce the search

space and prune out non-optimal plans. This whole set of plans that each query optimizer considers to

pick out the best plan is called the search space for that query optimizer. A lot of research exists on the

search space for query optimizers starting from left deep trees [14, 25, 18] to bushy trees [27]. Each

plan space has its own merits and drawbacks but finding the most optimal plan has been proved to be

np-hard and non-feasible for the optimizers.

Cost Model : Since we consider a search space and select the best plan, we need some function

to quantify the cost of executing a plan in terms of known parameters of the cluster. We minimize or

maximize some objective function based on the costs given out by our cost model. This cost model

relies on (i) the statistics on the relations and indexes, (ii) the formulae to estimate selectivity of various

9

Figure 1.6 Block diagram of a query optimizer

predicates and the sizes of the output of each operator in the query plan, and (iii) the formulae to estimate

the CPU and IO cost of every operator in the query plan.

The statistics on the tables includes various metadata about the actual data like the number of rows,

the number of disk pages of the relations, indexes and the number of distinct values of a column. Query

optimizers use these statistics to estimate the selectivity factor for a given query or predicate, which

means the number of rows that actually qualify this predicate. The most well known way of doing

this is by using histograms [35, 31, 23]. Using statistics query optimizers predict the overall cost of

executing a query and that includes mainly CPU, IO and network costs. Estimation of these costs of

a query operator is also non-trivial since it needs to take into consideration various properties about

system and lots of internal implementation at the system level like data flow from cache to buffer disk

etc. [14]. Other factors like the number of queries concurrently running and the available buffer space

also affect the cost values. Many detailed IO cost models have been developed to estimate various IO

operations like seek, latency and data transfer [21, 20].

Search Strategy : Various approaches have been studied in theory to search the given space of plans.

The dynamic programming algorithm proposed in [37] is an exhaustive search strategy that enumerates

the query plans in a bottom-up manner and prunes expensive plans based on the cost formulae. Work

has been done on heuristic and random optimizations to walk through the search space [40, 39] and

also much work was done on comparing various search strategies like left-deep, right deep and bushy

10

trees [27, 40, 41, 24] . Also some query optimizers employ a dynamic query optimization technique

where the decisions about the type of query operator to be used and their physical algorithms are taken

at run-time. Thus, the optimizer designs a decision tree which is evaluated at run time. This type of plan

enumeration is more suited for top-down architectures like Volcano [19, 17].

1.4 Problem statement and scope

In this thesis, we design and implement a cost based query-optimizer for join queries over map-

reduce, that modifies the naive query plans given by these translators based on statistics collected about

the data. Our query optimizer calculates the query execution cost for various possible plans of a given

query based on the statistics and our cost formulae and then decides on the plan with the least cost to

execute on a cluster of machines. It uses a linear combination of communication, IO and CPU costs

involved in the query execution to compute the total cost. Our query optimizer considers a whole new

plan space of plans that suit the highly parallelizable framework of map-reduce on huge datasets. It

then follows an optimized approach of assigning tasks to the slave machines which minimizes the total

network shuffle of data and also distributes data processing evenly among the slaves to increase the total

query throughput.

1.5 Contributions

Contributions of the thesis can be summed up as follows.

• Proposed a statistics based approach for optimizing SQL queries on top of map-reduce. We bor-

rowed this approach from existing query optimization techniques in relational database systems

and extended it to work for the current usecase.

• Extended the traditional plan space to include bushy trees which leverage the massively paral-

lel architecture of map-reduce systems. We explore a subset of bushy plan space that provides

parallelism and expolits the massively parallel mapreduce framework.

• Designed cost formulae to predict the runtime of joins before executing the query. We use these

cost formulae to find the optimial query plan from the above plan space.

• Designed and implemented a new task scheduler for map-reduce minimizes the overall network

IO during joins. This scheduler is based on our statistics store and converts the problem of as-

signing tasks to a maxflow-mincut formulation. Our experiments showed up to 33% reduction in

shuffle data size.

• Our expermients show that this query optimizer works upto 2 times faster on join queries on

TPCH dataset of scales 100 and 300 on both TPCH benchmark queries and synthetic queries. We

11

ran various SQL queries with select,project and join predicates and tested all our approaches by

building a distributed statistics store on the dataset.

1.6 Organization of thesis

Rest of the thesis is organized as follows. Chapter 2 discusses about related work and chapter 3 dis-

cusses in depth the query plan space we deal with and its advantages. Chapter 4 describes the statistics

engine and the cost formulae required to evaluate the joins in map-reduce scenario and also our schedul-

ing algorithm for map-reduce. Then we discuss the results of our research in chapter 5 and conclude the

thesis by giving our observations and possible directions of future work in chapter 6.

12

Chapter 2

Related Work

2.1 Related Work

A lot of research has gone into the field of the query optimization in databases, both stand alone and

distributed. Many techniques to estimate the cost of query plans using statistics have been proposed.

The most popular query optimizer System R from IBM has been extended to System R* [30] to work

for distributed databases.We extended the ideas from these systems to work for join optimization over

map-reduce based systems. Many heuristics have been developed to reduce the plan space for joins.

Not much work has gone into the field of optimizing SQL queries over map-reduce using traditional

database techniques. However work has been done in improving the runtime of map-reduce based SQL

systems by reducing the number of map-reduce jobs per query and removing the redundant IOs during

scans in [28] and [43].

In [28] Lee et. al developed a correlation aware translator from SQL to map-reduce which considers

various correlations among the query operators to build a graph of map-reduce operators instead of hav-

ing a one to one-operation-to-one job translation. This reduces the overall number of map-reduce jobs

that are required to run for a query and also minimize scans of the datasets by considering correlations.

In [43] Vernica et. al worked on improving set-similarity joins on the map-reduce framework. The main

idea is to balance the workload across the nodes in the cluster by efficiently partitioning the data.

Afrati et. al worked on optimizing chain and star joins on map-reduce environment and optimizing

shares between map and reduce tasks [13]. They also consider special cases of chain and star joins to

test their theory of optimal allocation of shares and they noticed their approach is useful in cases of star

schema join where a single large fact table is joined with a large number of small dimension tables.

Work was done on theta-joins over map-reduce [32] using randomized algorithms. [12] works on a

special case of fuzzy joins over map-reduce and quantifying its cost. Also Google has implemented

a framework called Tenzing [29] which has some inbuilt optimizations for improving the runtime of

queries over map-reduce environment by using techniques such as sort avoidance, block shuffle and

13

local execution and others. These techniques result in efficient execution of joins during runtime of the

query depending of various parameters like size of relations, volume of shuffle data and depending on

these the operators are scheduled to execute on the nodes.

14

Chapter 3

Query Plan Space

3.1 Overview of Query Planspace - An Example

Lets consider a simple query with two joins and three relations A,B,C,D as follows.

select * from A,B,C,D where A.a = B.b and B.b = C.c and C.c = D.d

Figure 3.1 lists two possible join orderings for the query.

Figure 3.1 Possible join orderings

The one on the left is (((A join B) join C)join D), a left deep plan considered by Hive

optimizer for execution whereas the plan on the right ((A join B) join (C join D))is bushy

tree. Lets assume for this analysis purposes that the size of the intermediate relations is larger and Map

15

join is not a possible operator for the plan execution. So the figure 3.1 is also the operator tree where CJ

represents a Common Join operator. Left deep plan is by default serial in nature and should be executed

level after level where as the bushy trees inherently parallel and the left and right children of the root

node can be executed in parallel.

Suppose the above query is executed using Hive on a hadoop cluster. Hive chooses the left deep plan

to be executed and it is broken down into 3 joins as follows

A common join B -> result:temp1

temp1 common join C -> result:temp2

temp2 common join D -> result:temp3 (final output)

The entire query execution is serial and there will be three map reduce jobs corresponding to three

common join operators. So the relations A and B join and the result is stored in a table temp1 in the first

map-reduce job and rest of the map-reduce jobs should wait till the entire result of temp1 is written to

HDFS. Given this scenario two possible cases can occur.

• Case 1 : The map-reduce job takes up all the slots of mappers and reducers in the cluster or

• Case 2 : The job takes up a subset of map and reduce slots in the cluster

In both of the cases there is an underutilization of cluster resources because of the fact that mapper

slots remain idle till the job completes shuffle and reduce phases and this is because of the restriction of

the left deep plan that the entire procedure is serial. However, in case 2 the underutilization is more pro-

nounced because of the fact that many slots remain idle right from the beginning of the job. The timeline

of this job looks as in the following figure . Considering the example query above, the map-reduce job

corresponding to the join (temp1 join B)cannot start even though there are free mapper slots in

the cluster. This definitely increases the query runtime and also most of the machines remain idle until

the whole job is completed. The query execution timeline is depicted in the figure 3.2. It is clear that

none of the phases overlap and everything is perfectly sequential. From t = a to t = c all the map slots

in the cluster and CPU remains idle.

Suppose the bushy tree execution plan is considered, the execution plan can be broken down into 3

joins as follows

A common join B -> result:temp1

C common join D -> result:temp2

temp1 common join temp2 -> result:temp3 (final output)

Even in this execution plan, the two cases discussed above are valid. Suppose that the two map-

reduce jobs corresponding to (A join B)and (A join B)are launched in parallel and the map-

16

Figure 3.2 Time line for the execution of left deep QEP

reduce job for (A join B)is executed first in the default FIFO scheduler for hadoop. In case 1, the

tasks corresponding to (A join B)will be waiting in the queue since there are no slots available for

them to get launched. As soon as the mappers corresponding to the job (A join B)complete new

tasks belonging to the job (A join B)get launched. This increases the cluster resouce utilization

and also increases intra query parallelization. Case 2 is even simpler and is more faster because both

the map-reduce jobs run in parallel due to abundance of resources and improves the performance. The

query execution timeline for this QEP is depicted in the figure 3.3. We can clearly see the overlap in

task assignment . The figure represents the worstcase task assignment where all the job needs to wait for

all the reducers, however in reality the situation is much better because some reducers complete quickly

and give way to others.

Given the benefits of bushy tree parallelization, we might be tempted to say that they are always

more efficient than left deep tree. Though this is partially true, the plan space of bushy trees is very

huge and exploring the whole of it takes a lot of time for the optimizer, sometimes much more than

query runtime. This is the reason most of the modern day optimizers in do not consider bushy trees in

17

Figure 3.3 Time line for the execution of bushy Query plan

their search space and still rely on left or right deep trees. In the rest of the chapter, we describe three

different algorithms to build bushy trees for joins in map-reduce and explore their planspace in detail.

3.2 Exploring Bushy Planspace

In this section we present 3 novel approaches for building bushy trees benefitting join queries over

mapreduce framework. We explore a subset of bushy planspace called n-balanced trees. We define an

n-balanced tree as follows.

n-fullybalanced tree 1. An n-fully balanced tree is a perfectly balanced binary tree with n leaf nodes.

This forces n to be a power of 2. An example of 4-fullybalanced tree is shown in figure 3.4

n-balanced tree 1. An n-balanced tree is obtained by replacing the left most leaf node of a n-fullybalanced

tree with another n-fullybalanced tree and repeating this procedure with the newly obtained tree. How-

18

Figure 3.4 An example of 4-fullybalanced tree

ever the condition that the first tree is n-fullybalanced is relaxed in case of query plans depending on the

number joins. In the resulting n-balanced tree all the internal nodes are join operators and all the leaf

nodes are relations.

An example of 4-balanced tree is shown in the figure 3.5. The number of levels in the tree is decided

by the number of joins in the query.

Figure 3.5 An example of 4-balanced tree

The rationale behind selecting n-balanced trees is that, at any level during execution there can be

n/2 parallel mapreduce jobs (unlike 1 for left deep trees) that can be executed in the cluster. Choosing

the value of n carefully will give a very good resource utilization in the cluster and also increases

the query performance. Also too much of parallelization will be an overkill for the query since the

waiting queue will be very long and for some non FIFO scheduler like fair scheduler cannot meet the

requirements and there will be too may task context switches. For example figures 3.6 and 3.7 shows

two possible execution plans for a query with 8 tables. Figure 3.6 is 8-balanced tree and figure 3.7

is 4-balanced tree and we ran both the query plans on the cluster with fair scheduler configured and

the runtimes for them are 923 seconds and 613 seconds respectively. Since we are running on a 10

node cluster with large input data, it cannot run 4 mapreduce jobs at a single time. So the jobs wait

in the queue until they have sufficient slots to complete the job. In the 4-balanced tree, since only 2

19

Figure 3.6 8-balanced tree

Figure 3.7 4-balanced tree

jobs run at a single time, the waiting time will be less and the job is completed quickly. Also carefully

choosing the value of n will reduce the plan space of bushy trees and this solution has the benefits

of both parallelization and higher query performance as we see in the later sections. First we start

off by explaining algorithm FindMin that builds a n-fullybalanced bushy tree to be executed and

FindRecMin & FindRecDeepMin explore the n-balanced bushy plan space.

3.2.1 Finding minimum cost n-fully balanced tree - FindMin

In this section, we describe an algorithm FindMin to build an n-fullybalanced bushy tree. We

build it level by level finding minimum at each level. This algorithm takes as input a query tree Q that

corresponds to the parsed SQL statement given by the user and the value n and gives an n-fully balanced

join operator tree as output. This algorithm is described below.

Algorithm 1: FindMin to build an n-fullybalanced bushy tree

Input : A query tree Q corresponding to the parsed SQL statement

Output: An operator tree to be executed

begin1

J ←− getJoinTables(Q)2

s←− sizeOf(J)3

while s 6= 1 do4

x←− getPowerOf2(n) //fetches the power of 2 less than n5

P ←− selectMinPairs(x) //selects x/2 pairs of joins with least cost6

for y ∈ P do7

Remove tables in y from J8

Add y to J9

end10

s←− sizeOf(J) //update the size of J11

end12

return makeOperatorTree(J)13

end14

20

The heart of the algorithm lies in 5 where we use the function selectMinPairs(x) to select the top

x/2 pairs from the table list based on the cost functions we define in the next chapter. We then remove

the individual tables in those minimum pairs and add the pair as a whole. This is the greediness in the

algorithm as this local minimum cost maynot give as a global minimum cost in the whole query tree,

even then we proceed assuming it gives the global minimum. An example run of this algorithm on a a

query with 8 joins(9 tables) to build 4-balanced trees is shown in the figure 3.8

Figure 3.8 Running Algorithm1 on 9 tables

At each step, the bracketed tables are the minimum cost x/2 join pairs. Once they are bracketed, the

whole bracket is considered as a single table for the subsequent step and each bracket converts to a join

in the query tree.

3.2.2 Finding a minimum cost n-balanced tree recursively - FindRecMin

The algorithm we describe in this section generates n-balanced trees. The algorithm takes the query

plan Q that corresponds to the parsed SQL query and a value n as input and gives out the join operator

tree as output.

This algorithm considers all possible combinations of size n at each level, once it finds finds a

minimum combination it finalizes it and reflects it in the final result eventhough this local minimum

combination of size n may not produce a global minimum. An example run of this algorithm on 8

joins(9 times) is shown in the figure 3.9. At each stage of bracketing, all possible n-combinations are

considered and the the combination with least execution cost is shown in the figure. All the combinations

are not shown in the figure due to space constraints.

3.2.3 Finding a minimum cost n-balanced tree exhaustively - FindRecDeepMin

We now describe an approach to generate n-balanced trees that is similar to Algorithm FindRecMin,

however instead of finalizing on local minimum of size n, we recursively go up the order to see if this

21

Algorithm 2: FindRecMin algorithm for building n-balanced trees

Input : A query tree Q corresponding to the parsed SQL statement and n for generating

n-balanced trees


begin1


breakF lag ←− True3

while breakF lag do4

s←− sizeOf(J)5

comb←− n6

if s < n then7

comb←− s8

breakF lag ←− False //Last iteration for loop since there are no tables left9

end10

C ←− generateCombinations(J, comb) //Generate combinations of size comb from11

set Jmin cost←−∞12

min comb←− null13

for x ∈ C do14

P ←− generatePermutations(x) //Checking all possible permutation for each15

combination, can be skipped for linear joins

for y ∈ P do16

if isV alidJoinOrder(y) ∧ executionCost(y) < min cost then17

min comb←− y //Update min comb since a plan with lesser cost is found18

min cost = executionCost(y)19

end20

end21

end22

Remove individual tables in mincomb from J23

Add total mincomb as a single table to J //Reflect the changes in J by updating24

min combend25

return J26

end27

22

Figure 3.9 Running Algorithm2 on 9 tables

n combination gives a global minimum. The algorithm can be described as follows 3. An example run

of this exhaustive algorithm is similar to the figure 2 except that for every bracketing another recursive

call is applied if it gives a global minimum. So we might get a set of bracketings different from those in

the figure but the whole shape of the tree remains the same.

We analyze the runtime, search space and performance of each of these algorithms in detail in the

results chapter .

3.3 Choosing the value of n for an n-balanced tree

Choosing the value of n is not as trivial as it seems. Setting the wrong value may not produce

desirable results as one of the following two cases may occur.

• A very small value of n might leave the cluster underutilized as there will be empty task slots in

the cluster but there are no more jobsthat can be run in parallel

• A very large value of n might keep many jobs in the cluster in the waiting queue and also bring

down the cluster efficieny by too much multitasking on nodes

These can be avoided by using few simple techniques. One way is to try a few values and fixing that

value which gives the most optimal results. This takes experimentation and may take some time before

23

Algorithm 3: FindRecDeepMin algorithm for building n-balanced trees

Input : A query tree Q corresponding to the parsed SQL statement and n for generating

n-balanced trees


begin1


breakF lag ←− True3

while breakF lag do4

s←− sizeOf(J)5

comb←− n6

if s < n then7

comb←− s8

breakF lag ←− False //Last iteration for loop since there are no tables left9

end10

C ←− generateCombinations(J, comb) //Generate combinations of size comb from11

set Jmin cost←−∞12

min comb←− null13

for x ∈ C do14

P ←− generatePermutations(x) //Checking all possible permutation for each15

combination, can be skipped for linear joins

for y ∈ P do16

if isV alidJoinOrder(y) then17

J ′ ←− J18

Remove individual tables in y from J ′19

Add y combination as a single table to J ′20

Q′ ←− makeQueryTree(J ′)21

JTemp←− Algorithm3(Q′, n) //Recursively call the same function with22

new Join tree

if (executionCost(JTemp) < min cost then23

min comb←− JTemp //Update min comb since a plan with lesser cost24

is found

min cost = executionCost(JTemp)25

end26

end27

end28

end29

Remove individual tables in min comb from J30

Add total min comb as a single table to J //Reflect the changes in J31

end32

return J33

end34

24

we find a suitable value. Another way is to make an estimate is by calculating the maximum capacity of

the cluster as follows.

1. Get the total number of map slots in a cluster(M ). This can be done by summing up the map slots

on each machine of the cluster

2. Get the average input for a single map reduce job (I). This is twice the size of the average table

size (since two tables are involved per single join).

3. Get the average number of map tasks per mapreduce job(A) by divding the average input size(I)

by block size(B)

4. Now a good estimate of n is a power of 2 closest to the value of M/A

Another procedure is to improve this estimate of n based on past history of queries and come out

with an average number of jobs that can run at any time. Also the number of reduce tasks were not

involved in the computation since the map tasks are the ones that define the size of the input and the

number of reducers is generally constant per task.

25

Chapter 4

Cost Based Optimization Problem

All the algorithms described in the previous chapter assume that we have a cost estimator for the join

operators for joins over map-reduce framework. In this chapter we describe in detail

1. Estimating the cost of the operators in terms of disk IO operations, CPU usage and network

movement of data and

2. Optimizing the data movement across machines to reduce the network IO

To give accurate cost estimates majority of the traditional relational database systems rely on his-

tograms of data. Histograms give an overview of the data by partitioning the data range into a set of

bins. The number of bins and the partitioning method determines the accuracy of histograms and this

has been studied at depth. Traditionally two types of histograms have been in common use [36]

• Equi-width histograms : Equi-width histograms or frequency histograms are the most common

histograms used in databases where the range values are divided on n buckets equal size. Values

are bucketed based on the range they fall in and the number of rows per bucket depend on the data

distribution.

• Equi-depth histograms : In equi-depth histograms the number of rows per bucket is fixed and

the size of buckets is based on data distribution.

Each of the techniques has its own set of advantages and disadvantages. We borrow the idea of

using histograms for accurate cost estimation from relational databases and extend it to build our own

distributed statistics store and cost estimator functions based on that. So the rest of the chapter is

organized as follows. First we describe our distributed statistics store and the methods they expose to

the query optimizer and then describe the cost formulae for each of the join operators based on these

methods.

26

4.1 Distributed Statistics Store

We designed and implemented a statistics store distributed across the machines in the hadoop cluster.

Each node maintains a equi-depth histogram for data local to that machine. These histograms are used

for cardinality estimations local to that site. Methods have been written to compute these histograms

from data during map-reduce jobs. Since the data is likely to be updated, we update the histograms

too but not by reading all the data again. We managed to plugin our code in such way that when the

modified data is read as a part of other map-reduce jobs the histograms also get updated. Though this

might slightly increase the runtime for that query, we need not put the extra load on the cluster by

reading the whole data again and our experiments show that this slight increase in runtime is not that

significant. Also initial statistics can be computed while loading the data into the cluster or by separately

running a job that reads the whole data and updates all the local histograms. Further we consolidate the

distributed statistics to maintain the global per table statistics that we use in our cost model.

These histograms are serialized to disk as tables in mysql and APIs have been written on top it to

fetch the cardinality esitmates by the query optimizer.The query optimizer uses the APIs written to get

the cardinality estimates to calculate the cost of executing a query. Its architecture is summarized in

figure 4.1

Figure 4.1 Statistics store architecture

27

4.2 Cost Formulae

In this section we describe the cost formulae we use for estimating the cost of executing the join

operators on top of map-reduce framework. The actual procedure inside map-reduce framework is quite

complex in its implementation does multiple scans of disk while shuffling and sorting. For example

with hadoop framework, there are many settings to be done by the user and a small tweak would greatly

effect the performance of a mapreduce job. One such example is io.sort.mb. It is the amount of buffer

memory used by the map task jvm to sort the incoming streams and spill them to disk. Tuning this

parameter has been proven to greatly improve the job performance based on the size of map output data.

There are many such knobs that can be tuned on a per job basis that can greatly affect the job execution

time. These can set inside the job configuration classes. However for the problem of query optimization,

it is sufficient if our cost model predicts the runtime proportional the actual execution time. Our cost

model divides the whole join mapreduce job into three phases map, shuffle and reduce and evaluates

cost of each of the phases separately and add them up for the whole cost. We now describe each of these

phases in detail and cost formulae we use to predict its runtime.

4.2.1 Join map phase

In the map phase, each HDFS block is read, parsed with an appropriate RecordReader implementa-

tion and the select predicates are applied to prune the unnecessary rows. They are divided into partitions

based on the join column value and are spilled to disk whenever the memory gets filled. So, once all the

data is read, there might be multiple spills on disk and they are merged in multiple passes. This whole

process is complex and there are multiple round trips of data from disk to memory and memory to disk.

However we identified the parts of this phase that incur most cost (in terms of runtime) and included

them in our cost analysis. They are as follows.

1. Reading the whole data block from HDFS through network IO. Even though the data is local,

HDFS uses sockets to read the data. If the data is not local, it is read from a remote machine,

however hdfs tries to maintain data locality most of the time by reading local replicas. This

speeds up the whole process. So, cost for this step is BLOCK SIZE/Rhdfs

2. Once the data is read, selectivity filters are applied and the rest of the data is written back to

the disk so that it can be merged later. Cost of spilling this to disk is (BLOCK SIZE ∗

selectivity factor)/Wlocal write. We calculate the selectivity factor based on the selection

predicates in the input query

3. Now all the written data is read back to do an inmemory merge in multiple rounds. Cost of doing

this is (BLOCK SIZE ∗ selectivity factor)/Rlocal read

4. All the merged data is written back again to the local disk so that it can read in reduce part. We

include this cost as (BLOCK SIZE ∗ selectivity factor)/Wlocal write

28

BLOCK SIZE Configured block size in HDFS

Rhdfs Read throughput from HDFS

selectivity factor Selectivity factor the select predicates in the query for that block

Rlocal read Read throughput from local machine where the map task runs

Wlocal write Write throughput in local machine where the map task runs

Table 4.1 Notations for cost formulae - Map phase

Summing up all the costs, total cost of map phase is written as follows.

BLOCK SIZE/Rhdfs + (BLOCK SIZE ∗ selectivity factor)/Wlocal write+

(BLOCK SIZE ∗ selectivity factor)/Rlocal read+

(BLOCK SIZE ∗ selectivity factor)/Wlocal write

The values of Rhdfs, Rlocal read and Rwrite read are computed before hand by running a few exper-

iments on the cluster and BLOCK SIZE is the user mentioned block size. These are computed only

once provided the cluster is not changed by adding or removing nodes. The value of selectivity factor

is calculated from the distributed statistics using the selection predicates from the query. Above formula

assumes that all the spills are merged in a single go and this is valid for join map reduce jobs since not

much extra data (apart from block content along with selectivity factor) is written to the disk and current

day memory sizes are so big that the setting io.sort.mb (Hadoop setting responsible for heap available

for this merging process) is configured appropriately to speedup this process. Since this cost function

is tailored for join job map-phase, necessary modifications should be made before extending to other

map-reduce jobs. Now multiple such maps run on each node in parallel in batches. Since the total map

time of the job is limited by the node with the last and the slowest map task, we take the total map time

to the the time taken to complete the last maptask on the slowest node.

4.2.2 Join shuffle phase

Reducer starts copying data after a configured number of map tasks are complete. We quantify the

total time taken to complete the shuffle phase for a reducer running on machne k that has been assigned

a partition i as follows and the notations are in table 4.2. The value of Rjk is calculated before hand by

performing a series of experiments that transfer data from each node to every other node. SPij can be

calculated easily from histograms and a given partition assignment.

∀j∈nodes∑

j

SPij/NRjk

29

SPij Size of partition i on machine j

NRjk Network read throughput of data on machine j from machine k

Table 4.2 Notations for cost formulae - Shuffle phase

4.2.3 Join Reducer phase

In the reducer phase, all the shuffled data is read and the actual join of tables is performed. All this

data is now written to HDFS so that the subsequent join task reads it. Since the HDFS replicates each

block to multiple nodes entire process is limited by HDFS write throughput. So we estimate the total

cost of reduce phase to be the sum of time taken to read the whole partition data into reducer’s memory

(SPi) (after writing it from the shuffle) and the time taken to write the result (resultj) of join back to

HDFS from reducer j. So the equation can be written as follows

SPi/Wwrite local + SPi/Rread local + resultj/Wwrite hdfs

Wwrite local and Wwrite hdfs are write throughputs of local machine and hdfs respectively. Now we

estimate the value of resultj using the global histograms as follows.

∀keys k∈Partition i

∑

k

(sel factor(k jointable1) ∗ (size(join table1))∗

(sel factor(k jointable2) ∗ size(join table2))

For map joins, the cost of scheduling and shuffle is automatically set to 0 and mapper cost includes

writing the result and excludes the cost of writing intermediate data to local disks.

4.3 Scheduling - An example

Consider the execution plan (on a two machine cluster) of CommonJoin operator in MapReduce

using the tables A and B in Figure 3 joined according to the following query

SELECT * FROM A JOIN B ON (A.a = B.b)

Both tables A and B are read into mappers and triples in the format <Join column value, rowkey

and data, tag>are emitted out. So in the given example triples <x,1,0>, <x,2,0 >, <y,3,0>,<x,4,1

>,<y,5,1 >,<y,6,1 >are emitted out. The tag 0 or 1 is used to classify it to table A or B in the reducers

so that join can be done. Now all the triples with same join column value are moved to the same system

in the shuffle phase. These are now classified according to their tags and a cartesian product is done on

them and the result is written to the disk. Join operator is considered costly because it involves network

movement of data from one node to other and this incurs a huge latency. In the above example there are

two following ways of scheduling join keys in reduce phase.

30

Figure 4.2 Query scheduling example

1. Join value x on machine 1 and y on machine 2

2. Join value x on machine 2 and y on machine 1

In case 1 total network cost (in terms of rows) is 2 ( (3,y) from 1 to 2 and (4,x) from 2 to 1) where as

in case 2 total network cost in case 2 is 4. We can observe that the total network cost is double in case

2 is twice as that of case 1. Considering the size of data Hadoop ecosystem manages, these figures look

very huge for tables of terabytes scale and such communication costs might heavily impact the query

performance. So the problem here finally boils down to partitioning m reduce keys into n machines so

as to minimize the network movement of data.

4.4 Scheduling strategy

In this section we show a novel approach for scheduling a CommonJoin operator by modeling as a

max-flow min-cut problem. Consider a Hive query which joins two tables A and B on columns a and

b respectively. The rest of the description assumes that there are m distinct values of both A.a and B.b

combined and n machines in the cluster.

We define a variable Xij as follows.

Xij =

{

1 if reducer key i is assigned to machine j;

0 Otherwise.

Since a key can only be assigned to a single machine ,

∀i∑

j

Xij = 1 (4.1)

Also, we put a limit on the number of keys a reducer can process. This depends on the processing

capability of the machine and let it be lj for node j.

∀j∑

i

Xij ≤ lj (4.2)

31

The nxn matrix C is obtained by calcuting the average time taken to transfer a unit data from every

machine to every other machine by a simple experiment. Suppose the key i is assigned to the reducer

machine k and Pij is the size of key i on machine j, then the total cost of data transfer from all machines

to machine k because of key i along with the runtime estimate of the reducer, Wik can be written as

Wik = (∑

j

Pij ∗ Cjk)

The total shuffling cost can now be written as

Ctotal =∑

i

∑

k

Wik ∗Xik (4.3)

The above cost function can be generalized according the scheduling requirement. For example,

to schedule tasks in heterogenous environment, we can add additonal parameters that signify the cost

of running a query on machine k. Since we are optimizing communication costs in this case, only

network latencies are considered. So this model can be extended to any general scheduling problem on

mapreduce framework. We now model this problem as a flow network by following the steps below

[26].

1. Create two nodes source (S) and sink(T )

2. n nodes S1 to Sn are created one for each machine

3. m nodes K1 to Km are created one for each key

4. n edges each from source S to each of S1 to Sn are created with capacities l1 to ln and costs 0

5. Now every pair of Si (1 ≤ i ≤ n) and Kj (1 ≤ j ≤ m) is connected by an edge with cost Wji

and capacity 1. Xij is the flow of edge connecting Si and Kj

6. m edges from every node Kj (1 ≤ j ≤ m) to target T with capacity 1 and cost 0

The maximum flow in the above flow network is clearly m since all the capacities of the inbound

edges to target T add up to m. At maximum flow , since the capacities of outgoing edges from K1 to

Kn are 1, out of the multiple incoming edges, only one edge is selected and rest all flows become 0.

This process is similar to assigning the key Kj to the machine Si corresponding to the incoming edge

whose flow is made 1. The above procedure makes sure that one key is assigned to a single machine

and since we make the capacities of the incoming edges of each of machines S1 to Sm are l1 to lm,

we make sure that the number of keys that are assigned to that particular machine doesn’t exceed that

limit. The flow numbers of these edges determine the number of keys assigned to them once the above

flow network is solved for minimum cost and maximum flow. Once we have the values of Pij from

our statistics, we solve the above max-flow min-cut graph to obtain the optimal key allocation and feed

it to the mapreduce program. This flow network can be solved using an algorithm that has a strongly

32

Figure 4.3 Flow network

polynomial complexity[34]. In the next chapter we discuss the experimentation and results for the

approaches discussed so far.

4.5 Shuffle algorithm - Proof of minimality

We use proof by contradiction to prove that our algorithm actually gives the optimal shuffle. Let

us assume that the algorithm gives a non optimal partition assignment as the output. This means for

keys i and machines j there exists another possible allocation of Xij that has lesser shuffle cost than

the allocation chosen by the algorithm. Lets call this PlanX and the plan chosen by the algorithm

as Planopt. If we prove that PLANx has a lower flow cost in the network than PLANopt, it means

that the max-flow min-cost doesn’t choose a plan with minimum cost which is a contradiction. From

equation 4.3, the total shuffle cost for a given allocation Xij is

Ctotal =∑

i

∑

k

Wik ∗Xik (4.4)

From the flow network in figure 4.3, the total flow cost can be written as follows

Cflow =∑

i∈nodes(S)

∑

k∈nodes(K)

Wik ∗Xik (4.5)

which is the same as 4.4, since the nodes(S) represents machines and nodes(K) represent keys from

the way the flow network is built. This implies that the total cost of an allocation and the corresponding

cost in the flow network are the same and this means that flow cost of PLANx is less than PLANopt

which is contradicting. So shuffle algorithm always chooses the minimum shuffle cost allocation.

33

4.6 Shuffle algorithm - A working example

Lets consider a query joining two tables customer and supplier from TPCH dataset. The query is as

follows

s e l e c t ∗ from c u s t o m e r j o i n s u p p l i e r on c u s t o m e r . c n a t i o n k e y =

s u p p l i e r . s n a t i o n k e y where s n a t i o n k e y < 5 ;

We put a where clause to reduce the key set size so that the flow diagram is small and easy to

understand. We ran the query on a 7 node cluster and values in matrices Pij for the tables customer

and supplier are obtained from histograms.

Pij(customer) =

135464 163672 206312 225336 179908 3280 397208

242064 256660 15252 111028 586464 271748 133496

163672 184664 169904 19680 188928 22632 182696

94300 53300 234192 49364 73636 127264 131364

197948 581052 29520 69372 191880 123492 113816

Pij(supplier) =

256878 318364 0 0 559906 0 18318

66030 83354 0 0 29678 0 232312

27122 4118 0 0 40044 0 208030

312542 188008 0 0 503106 0 60918

2272 66314 0 0 123398 0 89176

So, the total Pij for the flow matrix is the sum of the above 2 matrices which is as follows,

Pij =

392342 482036 206312 225336 739814 3280 415526

308094 340014 15252 111028 616142 271748 365808

190794 188782 169904 19680 228972 22632 390726

406842 241308 234192 49364 576742 127264 192282

200220 647366 29520 69372 315278 123492 202992

In our testing, all the machines are under the same switch and because of this we have stable ping

time across all machines. So a normalized matrix Cij looks as follows,

Cij =

0111111

1011111

1101111

1110111

1111011

1111101

1111110

34

So, the matrix Wij evaluates to the following,

Wij =

2072304 1982610 2258334 2239310 1724832 2461366 2049120

1719992 1688072 2012834 1917058 1411944 1756338 1662278

1020696 1022708 1041586 1191810 982518 1188858 820764

1421152 1586686 1593802 1778630 1251252 1700730 1635712

1388020 940874 1558720 1518868 1272962 1464748 1385248

Solving the flow network with these Wij values, we get the optimal assignment of keys as in table

4.3 with a shuffle volume of 618 megabytes. As we can see, multiple keys are assigned to a single node

and this limit (node capacity) can be configured per node while solving the flow graph.

Key name Assigned node

0 node5

1 node5

2 node7

3 node5

4 node2

Table 4.3 Key to Node assignment with optimized scheduler

The same query when ran on a hadoop scheduler had the assignment in table 4.4 with a shuffle

volume of 726 megabytes.

Key name Assigned node

0 node3

1 node2

2 node4

3 node5

4 node7

Table 4.4 Key to Node assignment with default scheduler

For building the matrix Wij , we perform the standard matrix multiplication algorithm n times, where

n is the number of nodes in the cluster and is generally in the range of lower hundreds even for medium

to large clusters. So this whole process takes a fraction of a second to run on a normal dual core CPU

machine. We discuss the performance evaluation of our approach in detail in the next chapter where

best, average and worst case performances of our algorithm along with results.

35

Chapter 5

Experimental Evaluation and Results

In this chapter we discuss about the experimental evaluation we conducted on the theory discussed

so far and present you the results obtained.

5.1 Experimental setup

We conducted all the experiments on a 10 node cluster comprising of 1 master and 9 slaves. We used

TPCH datasets of scales 100 (100 gigabytes) and 300 (300 gigabytes) for testing the join queries. The

input queries we tested include both synthetic queries and tpch benchmark query set. Each machine is

equipped with 3.5 gigabyte RAM and 2.4 GHz dual core CPUs and are connected to a 10Gbps network.

This setup qualifies to be a network of commodity hardware machines loaded with linux and run a stable

version of hadoop. We have patched Hive with each of the above to test our algorithms.We used mysql

to serialize the histograms per node as discussed.

5.2 Plan space evaluation and Complexity

In this section we describe in detail the planspace that each of our algorithms explore and compare it

with the overall bushy planspace. For the purpose of this discussion, we assume queries to be multiway

join with m-tables and we are trying to build n-balanced trees.

5.2.1 Algorithm FindMin

In each iteration of the algorithm with i nodes remaining, we consider i2 pairs to find out 2⌊log2 i⌋

(value less than i which is a power of 2) tables whose joins have the least cost. This means after iteration

i, the number of tables remaining will be i− 2⌊log2 i⌋ + 1. So the total planspace of this algorithm is

{i>0,i=i−2⌊log2 i⌋+1}∑

i=m

(2⌊log2 i⌋)2

36

Coming to the complexity part, with n input tables, the algorithm calls selectMinPairs() for ⌊log2n⌋

times and each call of it takes O(n2). So the total complexity of the algorithm is O(n2log2n).

5.2.2 Algorithm FindRecMin

In this algorithm, we find the minimum n-fullybalanced tree and fix it to build the rest of the tree. In

each iteration the number of tables decreases by n− 1 starting from the initial value of m. The number

of ways of finding the n-fullybalanced sub tree is mCn ∗ n!. So, total planspace of the algorithm is as

follows,

{i|(m−i∗(n−1))>n}∑

i=0

(m−i∗(n−1))Cn ∗ n!

In the algorithm, each iteration of main loop selects an n-fully balanced tree. Given m tables, this

loop runs ⌈(m − 1/n − 1)⌉ times reducing the number of tables by n − 1 in each iteration. Each

iteration calls generateCombinations(x, n) which generates all n sized combinations of joins set

x. The complexity of this function call is O(x ∗ n) for linear chain joins since we need to consider

only n adjacent tables for each input table. However for cyclic or non linear joins, the complexity of

generateCombinations(x, n) is(

xn

)

and we iterate through all possible permutations (for non-linear

and cyclic joins) to find the best plan. So the total complexity of each iteration of loop is(

xn

)

∗ n!

which turns out to be O(mn) and for linear chain joins each iteration of loop takes O(mn). So the

total complexity of the algorithm is O(mn ∗ ⌈(m − 1/n − 1)⌉) for cyclic or non linear joins and

O(mn ∗ ⌈(m− 1/n− 1)⌉) for linear chain joins. n generally takes small values like 4 or 8 depending

on the size of the cluster.

5.2.3 Algorithm FindRecDeepMin

In this algorithm we try to build the tree bottom up and n-fully balanced in each iteration. In the

beginning, we have m tables and we build a n fully balanced tree (n < m) and the number of ways of

doing this is mCn ∗ n!. However the number of tables decreases over each iteration by a value of n− 1

and we recursively call the same procedure again. So the total planspace for this algorithm is

{i|(m−i∗(n−1))>n}∏

i=0

(m−i∗(n−1))Cn ∗ n!

Using the above formulae, following table lists the planspaces for various values of m and for n = 4

and n = 8.

37

m FindMinFindRecMin

n = 4FindRecMin

n = 8FindRecDeepMin

n = 4FindRecDeepMin

n = 8Total bushy trees

3 4 6 6 6 6 12

4 7 25 24 24 24 120

5 14 122 120 240 120 1680

6 22 366 720 2160 720 30240

7 35 865 5040 20160 5040 665280

8 35 1802 40321 403200 40320 17297280

9 50 3390 362882 6531840 725760 518918400

10 67 5905 1814406 101606400 10886400 17643225600

11 90 9722 6652824 3193344000 159667200 670442572800

12 101 15270 19958520 77598259200 2395008000 28158588057600

13 128 23065 51892560 1743565824000 37362124800 1.30E+15

14 158 33746 121086000 76716896256000 610248038400 6.48E+16

Table 5.1 Plan space evaluation for the algorithms on queries with number of tables from 3 to 14

Number of tables Hive FindMinFindRecMin

n = 4FindRecDeepMin

n = 4FindRecMin

n = 8FindRecDeepMin

n = 8

3 178 121.2 121.2 121.2 121.2 121.2

4 318 199 199 199 199 199

5 342.199 293 255 255 255 255

6 504.452 396 363 305 300 300

7 637.73 547 547 547 580 580

8 718.808 590 590 590 602 602

9 773.317 488 496 458 378 378

10 870.155 644 626 561 402 402

11 994.457 684 656 656 664 664

12 995.565 712 698 608 428 428

Table 5.2 Algorithms average runtime in seconds on queries with increasing number of joins on a 10

node cluster and a TPCH 100 GB dataset using 4 and 8-balanced trees

5.3 Algorithms performance

In this section, we describe in detail about the performance of our algorithms on queries of increasing

number of joins on our experimental setup. For our evaluation, we ran SQL queries with number of join

tables from 3 to 12. For each such query, we ran it on Hive using the default left deep execution plans,

and also with the 3 algorithms mentioned above. Execution times are tabulated in table 5.2. Figure 5.9

has these results plotted.

From the graph we can see that the n-balanced trees perform well compared to default hive execution

especially as the number of joins increase. Since we are working on 4-balanced tees, the performance

of all the algorithms remains the same till the number of tables is 4. Once this number goes up we can

see that the algorithm 2 and 3 performing well as they considers a bigger plan space than algorithm1.

It is interesting to see the performance of 8-balanced trees. They graph for n = 8 is relatively ragged

compared to n = 4 owing to the fact that sometimes too much parallelization is an overkill in some

38

TPCH Query Hive FindMin FindRecMin FindRecDeepMin

q2 362.443 321.634 266.234 266.234

q3 1238.635 1069.725 1069.725 1069.725

q10 1247.157 995.087 995.087 995.087

q11 321.236 274 274 274

q16 325.129 325.129 305.966 305.966

q18 1004.821 1002.882 1002.882 1002.882

q20 994.485 832.663 622.234 622.234

Table 5.3 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH

300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these

queries have less number of join tables

cases and suits the cluster in some cases as described in the previous chapters. This is reflected in the

performance figures too.

We ran the standard tpch benchmark queries on the same cluster of 10 nodes with scale 300 dataset

and the performance figures are in tabulated in 5.3 and plotted in 5.8. We included only the queries

with joins and selects since they are the only ones relavent to our current work. Also since all these

queries have 4 or less join tables, we needn’t test on 8-balanced trees as they essentially give the same

output as n = 4 and hence all the results in 5.8 correspond to n = 4. This the reason queries like q16

and q18 show similar performance between hive and our algorithms because there are only 3 tables (2

joins) and there is no search space that our algorithms can explore. We show the execution plans for

each of these queries in figures 5.1, 5.2, 5.3, 5.4, 5.5, 5.7 for Commercial DBMS [9], postgres, hive and

FindRecDeepMin query planners. and the ‘EXPLAIN’ command output in appendix A. Each figure

is followed by a description that explains the performance improvements of our approach compared to

non-parallelizable left deep plans chosen by other optimizers and the whole summary is tabulated in

table 5.4.

TPCH Query Commercial DBMS Postgres Hive FindRecDeepMin

q2left deep

((((ps 1 s) 1 n) 1 r) 1 p)

left deep

((((r 1 n) 1 s) 1 ps) 1 p)

left deep

((((r 1 n) 1 s) 1 ps) 1 p)

4-balanced

(((r 1 n) 1 (ps 1 p)) 1 s)

q3left deep

((c 1 o)1 l)left deep

((l 1 o)1 c)left deep


((l 1 o)1 c)

q10left deep

((l 1 o)1 c) 1 n)

left deep

((c 1 n)1 o) 1 l)left deep

((c 1 o)1 n) 1 l)4-fully balanced

((c 1 n) 1(l 1 o))

q11left deep

((s 1 n)1 p)

left deep

((p 1 s)1 n)

left deep

((p 1 s)1 n)

left deep

((s 1 n)1 p)

q16left deep

((p 1 ps)1 s)

left deep

((p 1 ps)1 s)

left deep

((p 1 s)1 ps)

left deep

((p 1 ps)1 s)

q18left deep

((l 1 o)1 c)left deep



((c 1 o)1 l)

q20left deep

(((((ps 1 p)1 l)1 p)1 s)1 n)

left deep

(((((ps 1 p)1 l)1 p)1 s)1 n)

left deep

(((((ps 1 p)1 l)1 p)1 s)1 n)

4-balanced

(((ps 1 p)1(ps 1 l))1 (s 1 n))

Table 5.4 Summary of query plans for TPCH Dataset

c=customer, s=supplier, ps=partsupp, r=region, l=lineitem, o=orders, n=nation

39

(a) Commercial DBMS(b) Postgres

(c) Hive(d) FindRecDeepMin

Figure 5.1 Query execution plan for q2

We can notice that the execution plans chosen by Commercial DBMS, postgres and hive are all left deep

whereas the algorithm FindRecDeepMin has chosen a 4-balanced tree which has more parallelization which can

run two joins (and hence two map-reduce jobs, (region join nation) and (partsupp join part)) at the same time

thus resulting in effective utilization of the cluster. Also at each join node we optimize the reducer allocation

based on the statistics.

40

(a) Commercial DBMS (b) Postgres

(c) Hive (d) FindRecDeepMin


In query q3, we have only 2 joins and so we don’t have many possible join orders and no parallelization

is possible. Each of the optimizers choose plans based on their stats and for FindRecDeepMin algorithm we

optimally schedule the reduce tasks based on the statistics.

41

(a) Commercial DBMS

(b) Postgres



This query shows the difference between the parallel and non-parallel plans by building a fully balanced tree

with four tables. As we can see from the figure, Commercial DBMS, postgres and Hive build the left deep plans

but FindRecDeepMin’s plan can be easily parallelized with two joins running at the same time ((customer join

nation) and (lineitem join orders)). This way we can optimally utilize the task slots and schedule the tasks as

and when a slot is available thus utilizing the cluster resources properly.

42




This query is similar to query q3 in figure 5.2. It has three tables and two joins. So each of the optimizers

chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based on statistics.

5.4 Cost Formulae accuracy

In this section we analyze in detail the accuracy of the cost formulae we designed for the purpose

of query optimizer. Our main aim of designing the cost formuale for our optimizer is to quantify the

cost of each phase of join mapreduce job. As explained already, a real mapreduce job is quite complex

under the hood and we picked the parts of each phase that incur significant costs and included them in

the cost formulae and we assumed the worst case scenarios to get an approximate upper bound for each

phase. We ran 5 different mapreduce join jobs with increasing input sizes and measured the runtime of

the phases and mapped them with the results from the cost formulae. Results are in the figures 5.10 and

5.11.

From the graph 5.10, it is clear that the estimated runtime using cost formulae is somewhat propor-

tional to the actual map phase execution time. But in all the inputs, the estimated cost is high. This

is because of the fact that the HDFS uses block locality principle in most of the cases to read the the

43




This query is similar to queries q3 and q11 in figures 5.2 and 5.4. It has three tables and two joins. So each of

the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce tasks based

on statistics.

data but in the formulae we consider a read throughput speed whose calculation includes reading remote

blocks too. Also we consider the time taken by the slowest machine to complete its map wave since

slowest machine limits the total speed of the cluster. However in reality tasks might get assigned to

faster machines thus resulting in faster execution times.

Unlike others, the estimates of shuffle and reduce phases are lesser compared to the actual runtime

since the actual runtime includes the jvm startup time which is not present in the cost formulae. It varies

from machine to machine and depends on various factors such as jvm reuse, caching etc.

5.5 Efficiency of scheduling algorithm

In this section we discuss about the efficiency of our scheduling algorithm discussed in chapter 4.

We ran various join queries on a TPCH 100 scale data set with input sizes up to 100 gigabytes. We

ran eighteen different queries to demonstrate the best case and the worst case running of the algorithm.

This best or worst case of the algorithm (as compared to default scheduler) depends on the distribution

44

(a) Commercial DBMS(b) Postgres



This query is similar to queries q3, q11 and q16 in figures 5.2 , 5.4 and 5.5. It has three tables and two joins.

So each of the optimizers chooses its best plan and FindRecDeepMin algorithm optimally schedules the reduce

tasks based on statistics.

of keys across the machines in the dataset since that is what defines the shuffling of keys across the

machines. To demonstrate the worst/best cases, we explain the data distribution that results in such

performance and show it experimentally on such a sample distribution. We ran these queries twice once

with default hadoop scheduling and once with the optimal scheduling allocation in place. We measured

the data shuffle inputs of each machine in the cluster from other machines in both of the above executions

and plotted the graph in figure 5.12. X-axis denotes the size of input tables and they reach up to 100

gigabytes whereas the Y-axis measures the execution time. Red plot corresponds to the default scheduler

in hadoop and the blue plot is the optimized shuffle. The actual shuffle values are tabulated in table 5.5.

From the graph and data, we can see that the optimized shuffle data size is less than the default

hadoop shuffle. This is because the optimized suffle algorithm minimizes the network IO by assigning

45

keys which result in least network movement of data. Also we can notice that the performance of the

algorithm increases with the increase in input data size. This is because the algorithm gets flexibility to

try out more possible allocations since the data is spread out.

Input size

Gigabytes

Optimized Shuffle Volume

Megabytes (A)

Default Shuffle Volume

Megabytes (H)

Num Shuffles

Optimized Shuffle

Num Shuffles

Default Shuffle(H/A)

2.42 24.07 37.33 177787 275688 1.55

12.1 122.39 181.28 903841 1338694 1.48

24.2 218.43 314.43 1612961 2321897 1.43

36.3 361.81 462.07 2671741 3412088 1.27

48.4 424.99 690.52 3138280 5099067 1.62

60.5 495.46 686.55 3658709 5069751 1.38

72.6 597.86 891.49 4414868 6583109 1.49

84.7 682.94 1373.80 5043081 10144671 2.01

96.8 958.08 1476.46 7074826 10902697 1.54

Table 5.5 Shuffle data size comparison of default and optimized algorithm tested on a TPCH scale 100

dataset

This algorithm works the best when all the values corresponding to the join key from both the join

tables lie on the same system so that the algorithm allocates the reduce task on the same machine and if

this holds for all the join keys, the total shuffle IO is zero since no join key should be moved to any other

system. We ran 9 different queries demonstrating this best case working and the results are in table 5.6

and plotted in figure 5.13 (Shuffle algorithm IO bytes are coincide with the X-axis as the shuffle volume

is zero due to perfect data locality). Since hadoop does not care about data locality while assigning

reduce tasks, we can clearly see that the data is being unnecessarily shipped to other machines resulting

in shuffle IO.

46

0 10 20 30 40 50 60 70 80 90 1000

1000

2000

3000

4000

5000

6000

Input sizes

Shu

ffle

size

in m

egab

ytes

Best Case

Hadoop

Shuffle Algo

Figure 5.13 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance

Input size

Gigabytes


Megabytes

Num Shuffles

Default Shuffle

Optimized Shuffle

Megabytes

Num Shuffles

Optimized Shuffle

2.42 131917296 821508 0 0

12.1 659591856 4107576 0 0

24.2 1539049806 9584358 0 0

36.3 1978778388 12322746 0 0

48.4 3078100760 19168723 0 0

60.5 3847626734 23960909 0 0

72.6 3957557628 24645498 0 0

84.7 5386676540 33545267 0 0

96.8 4397286800 27383890 0 0

Table 5.6 Shuffle algorithm on a TPCH scale 100 dataset - Best case performance

47

0 10 20 30 40 50 60 70 80 90 1000

1000

2000

3000

4000

5000

6000

Input table sizes

Shu

ffle

size

in m

egab

ytes

Worst case shuffle

Hadoop

Shuffle Algo

Figure 5.14 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance

Input size

Gigabytes


Megabytes (H)

Num Shuffles

Default Shuffle

Optimized Shuffle Volume

Megabytes (A)

Num Shuffles

Optimized ShuffleH/A

2.42 131901300 821400 131901300 821400 1

12.1 659573700 4107450 659573700 4107450 1

24.2 1319168700 8215050 1319168700 8215050 1

36.3 1978763700 12322650 1978763700 12322650 1

48.4 2638358700 16430250 2638358700 16430250 1

60.5 3297932400 20537700 3297932400 20537700 1

72.6 3957527400 24645300 3957527400 24645300 1

84.7 4617122400 28752900 4617122400 28752900 1

96.8 5276717400 32860500 5276717400 32860500 1

Table 5.7 Shuffle algorithm on a TPCH scale 100 dataset - Worst case performance

Coming to the worst case performance of the algorithm, it happens when each key corresponding

to the join column is equally distributed across all the machines. In this case, it doesn’t matter which

machine we ship it to, we get overall same amount of shuffle IO. This is equivalent to a perfect uniform

distribution of rows per machine per join column value. We ran queries in the worst case and our

scheduler performs exactly same as hadoop’s default scheduler. The results are tabulated in table 5.7

and plotted in the figure 5.14.

48




This query highlights the parallelization achieved by the algorithm FindRecDeepMin compared to other opti-

mizers. As seen in the figure Commercial DBMS, Postgres and Hive stuck to the left deep tree approach whereas

the algorithm FindRecDeepMin provides multiple levels of parallelization. Instead of two parallel joins in the

first step, this query plan provides 3 join map-reduce jobs in parallel ((partsupp join part) and (partsupp join

lineitem) and (supplier join nation)). This utilizes the cluster to its maximum taking up the slots as soon as

they are left out by the other jobs. Also in the immediate level we have two join jobs in parallel (between the

intermediate output tables). Adding to that we schedule the tasks optimally based on the statistics and this gave

this query a good performance improvement over the other plans.

49

q2 q3 q10 q11 q16 q18 q200

200

400

600

800

1000

1200

1400

TPCH query numbers

Run

time

of th

e qu

erie

s in

sec

onds

Hive

FindMin

FindRecMin

FindRecDeepMin

Figure 5.8 Algorithms runtime in seconds on tpch benchmark queries on a 10 node cluster on a TPCH

300 GB dataset. FindRecMin and FindRecDeepMin are tested on 4-balanced trees since most of these

queries have less number of join tables

3 4 5 6 7 8 9 10 11 12100

200

300

400

500

600

700

800

900

1000

Number of tables in the join query

Run

tim

e of

the

quer

y in

sec

onds

Hive

FindMin

FindRecMin(n=4)

FindRecDeepMin (n=4)

FindRecMin(n=8)

Find RecDeepMin(n=8)

Figure 5.9 Algorithms performance evaluation on 100GB dataset and synthetic queries

50

0 10 20 30 40 50 60 70 80 900

100

200

300

400

500

600

700

800

Size of input data in gigabytes

Map

pha

se r

untim

e

Actual

Estimate

Figure 5.10 Map phase cost formulae evaluation

0 10 20 30 40 50 60 70 80 908

10

12

14

16

18

20

Size of input tables in gigabytes

Run

time

of s

huffl

e an

d re

duce

Actual

Estimate

Figure 5.11 Reduce and shuffle cost formulae evaluation

51

0 10 20 30 40 50 60 70 80 90 1000

500

1000

1500Shuffle algorithm performance

Input table sizes

Shu

ffle

size

in m

egab

ytes

Hadoop

Shuffle Algo

Figure 5.12 Comparison of default vs optimal shuffle IO

52

Chapter 6

Conclusions and Future Work

6.1 Conclusions and contributions

The database infrastructure has rapidly changed over the past few years from high-end servers hold-

ing vast amounts of data to a set of commodity hardware machines holding data in a distributed fashion.

The map-reduce programming paradigm from Google has facilitated this transformation by providing

highly scalable and fault tolerant distributed systems of production level quality. This increase in dataset

sizes has posed various problems to the system designers interms of complexity and scalability. Since

many companies still rely on SQL standards for analytics, it has been ported to mapreduce based sys-

tems too and is made a de-facto standard. This posed new problems to the query optimizers owing to its

complexity and the scale at which it works. As a part of this work, we have built a new optimizer from

scratch for joins over mapreduce framework based on traditional relational database style optimizations.

Our contributions can be summed up as follows.

• We built a query optimizer from scratch for mapreduce based SQL sytems. We followed the

traditional database style optimizations using statistics and costformulae.

• We designed a distributed statistics store for data in HDFS and built cost formulae for join opera-

tors in hive.

• We explored a new subset of bushy plan space called n-balanced trees based on the the fact that

their inherent parallelization suits mapreduce framework.

• We formulated the shuffle in mapreduce job as a maxflow mincut problem and found out the

optimal assignments to reduce netowrk IO.

Chapter 3 discusses about the plan space of n-balanced trees and their applicability to mapreduce

framework based on statistics. Results show that accurate statistics can be very useful for predicting

the output sizes of intermediate results and can be also help us in picking a best plan from the search

space of n-balanced trees. Also results show the benefits of using n-balanced trees for massively parllel

53

framework like mapreduce compared to traditional left deep join trees as in hive. But one should be

careful while choosing the value of n as too much of parallelization might be an overkill too since every

job keeps waiting for others to complete. Also using our optimial shuffle max-flow mincut formulation

will result in reducing the network IO from other nodes and helps increasing the query performance.

The ideas presented in this work can be used in any standard mapreduce based SQL systems with a few

changes. As a proof of concept modified Hive for our experimental analysis and a similar approach is

possible for other systems too.

6.2 Future work

In this section we discuss possible directions to extend this work.

• This work focuses on most basic problem of query optimization using joins with selection and pro-

jection operators since a join operator is considered relatively costly compared to other operators.

We can extend this work to include other SQL operators like aggregation, groupby, subqueries

etc. For that we need to design appropriate cost formuale for each of them and follow similar

techniques from relational world.

• Another important direction of work is to improve the cost formulae for join operators. Currently

we get an upper bound cost based on statistics. Since a mapreduce system is very complex interms

of design, we need to take lot of factors ranging from cache to buffer sizes to network speeds and

also need track all the roundtrips from disk to memory in each phase of the job. Taking each and

every factor into account and designing accurate cost formulae for each operator enhances the

query optimizer .

• A mapreduce job is designed to be fault tolerant and can sustain task failures. Current optimizer

doesn’t take failures into account and calculates costs of operators without considering such sce-

narios. However such failures can increase the cost of each SQL operator and including them in

the optimizer might give us more accurate costs and thus better plans. We need to consider factors

like load on each machine (number of mappers and reducers), scheduler properties like preemp-

tion, mean time to failure (MTTF) of nodes etc. to include failure scenarios in the optimizer

• The performance of a mapreduce cluster depends on how well the scheduler works. There are

a lot of schedulers available today that decide whether to launch a task or job or kill an existing

one based various factors like priority, load, fairness etc. We need to take into consideration

such properties specific to schedulers and build our optimizer accordingly since the cost of each

operator depends on how the scheduler works

• Block placement in a distributed systems play an important role in the cost of processing them.

Placing them closer to the code, reduces the network IO and increases the query performance.

54

The same principle applies to HDFS too. Since data is split into blocks, we can design an optimal

block allocation for a set of queries that reduces the overall cost of processing them. So the

problem translates to finding an optimal HDFS block allocation given a set of queries and the

costs to process them

55

Appendix A

Query execution plans for TPCH queries

A.1 q2

A.1.1 Postgres

Nested Loop (cost=36.91..61.34 rows=1 width=730)

Join Filter: (n.n_nationkey = s.s_nationkey)

-> Hash Join (cost=12.14..24.48 rows=1 width=108)

Hash Cond: (n.n_regionkey = r.r_regionkey)

-> Seq Scan on nation n (cost=0.00..11.70 rows=170 width=112)

-> Hash (cost=12.12..12.12 rows=1 width=4)

-> Seq Scan on region r (cost=0.00..12.12 rows=1 width=4)

Filter: (r_name = ’EUROPE’::bpchar)


Hash Cond: (s.s_suppkey = ps.ps_suppkey)

-> Seq Scan on supplier s (cost=0.00..11.50 rows=150 width=510)



Hash Cond: (ps.ps_partkey = p.p_partkey)

-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width

=24)


-> Seq Scan on part p (cost=0.00..12.40 rows=1 width

=108)

Filter: (((p_type)::text ˜˜ ’\%BRASS’::text) AND (

p_size = 15))

56

A.1.2 Hive

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-1 depends on stages: Stage-4




STAGE PLANS:

Stage: Stage-4

Map Reduce

Alias -> Map Operator Tree:

nation

TableScan alias: n

Reduce Output Operator

region

TableScan alias: r

Filter Operator (r_name = ’EUROPE’)

Reduce Operator Tree: Join Operator

Stage: Stage-1

Map Reduce


\$INTNAME


supplier

TableScan alias: s



condition map:

Stage: Stage-2

Map Reduce


\$INTNAME

57


partsupp

TableScan alias: ps


Stage: Stage-3

Map Reduce


\$INTNAME


part

TableScan alias: p

Filter Operator

predicate: expr: ((p_size = 15) and (p_type like ’\%BRASS’))


Reduce Operator Tree:

Join Operator, File Output Operator

Stage: Stage-0

Fetch Operator

limit: -1

A.1.3 FindRecDeepMin

STAGE DEPENDENCIES:



Stage-3 depends on stages - Stage-1,Stage-2

Stage-4 depends on stages - Stage-3


Stage: Stage-1

Map Reduce


nation

TableScan alias: n


region

Stage: Stage-2

Map Reduce


partsupp

TableScan alias: ps


part

58

TableScan alias: r

Filter Operator (r_name = ’

EUROPE’)

Reduce Operator Tree: Join

Operator

TableScan alias: p

Filter Operator

predicate: expr: ((p_size

= 15) and (p_type like

’\%BRASS’))



Operator

condition map:

Stage: Stage-3

Map Reduce


Stage-1-out

TableScan alias: s1

Stage-2-out

TableScan alias: s2



Stage: Stage-4

Map Reduce


Stage-3-out

TableScan alias: s3

supplier

TableScan alias: s



Stage: Stage-0

Fetch Operator

limit: -1

A.2 q3

A.2.1 Postgres

Hash Join (cost=24.67..37.43 rows=1 width=12)

59

Hash Cond: (l.l\_orderkey = o.o\_orderkey)

-> Seq Scan on lineitem l (cost=0.00..12.50 rows=67 width=4)

Filter: (l_shipdate > ’1995-03-15’::date)



Hash Cond: (o.o\_custkey = c.c\_custkey)

-> Seq Scan on orders o (cost=0.00..12.62 rows=70 width=16)

Filter: (o\_orderdate < ’1995-03-15’::date)


-> Seq Scan on customer c (cost=0.00..11.75 rows=1 width=4)

Filter: (c\_mktsegment = ’BUILDING’::bpchar)

A.2.2 Hive

STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-1

Map Reduce


customer

TableScan alias: c

Filter Operator (c\_mktsegment = ’BUILDING’)


orders

TableScan alias: o

Filter Operator (o\_orderdate < ’1995-03-15’)



Join Operator

Stage: Stage-2

Map Reduce


60

\$INTNAME


lineitem

TableScan alias: l

Filter Operator (l\_shipdate > ’1995-03-15’)



Join Operator

Stage: Stage-0

Fetch Operator

limit: -1


STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-1

Map Reduce


lineitem

TableScan alias: l

Filter Operator (l\_shipdate > ’1995-03-15’)


orders

TableScan alias: o

Filter Operator (o\_orderdate < ’1995-03-15’)



Join Operator

61

Stage: Stage-2

Map Reduce


\$INTNAME


customer

TableScan alias: customer

Filter Operator (c\_mktsegment = ’BUILDING’)



Join Operator

Stage: Stage-0

Fetch Operator

limit: -1

A.3 q10

A.3.1 Postgres

Nested Loop (cost=25.11..49.97 rows=1 width=606)

Join Filter: (o.o\_orderkey = l.l\_orderkey)


Hash Cond: (n.n\_nationkey = c.c\_nationkey)




Hash Cond: (c.c\_custkey = o.o\_custkey)

-> Seq Scan on customer c (cost=0.00..11.40 rows=140 width

=506)


-> Seq Scan on orders o (cost=0.00..13.15 rows=1 width

=8)

Filter: ((o\_orderdate >= ’1993-10-01’::date) AND

(o\_orderdate < ’1994-01-01’::date))

62


Filter: (l.l\_returnflag = ’R’::bpchar)

A.3.2 Hive

STAGE DEPENDENCIES:





STAGE PLANS:

Stage: Stage-1

Map Reduce


customer

TableScan alias: c


orders

TableScan alias: o

Filter Operator ((o\_orderdate >= ’1993-10-01’) and (o\_orderdate <

’1994-01-01’))



Join Operator

Stage: Stage-2

Map Reduce


\$INTNAME


nation

TableScan alias: n


Reduce Operator Tree:Join Operator

Stage: Stage-3

Map Reduce


63

\$INTNAME


lineitem

TableScan alias: l

Filter Operator (l\_returnflag = ’R’)



Stage: Stage-0

Fetch Operator

limit: -1


STAGE DEPENDENCIES:





STAGE PLANS:

Stage: Stage-1

Map Reduce


customer

TableScan alias: c


nation

TableScan alias: n




Join Operator

Stage: Stage-2

Map Reduce


lineitem

TableScan alias: l

Filter Operator (l\

_returnflag = ’R’)

orders

TableScan alias: o

Filter Operator ((o\

_orderdate >=

’1993-10-01’) and (o\

_orderdate <

’1994-01-01’))


64


Operator

Stage: Stage-3

Map Reduce


Stage-1-out

TableScan alias: s1

Stage-2-out

TableScan alias: s2



Stage: Stage-0

Fetch Operator

limit: -1

A.4 q11

A.4.1 Postgres


Hash Cond: (ps.ps\_suppkey = s.s\_suppkey)

-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=24)



Hash Cond: (s.s\_nationkey = n.n\_nationkey)




Filter: (n\_name = ’GERMANY’::bpchar)

A.4.2 Hive

STAGE DEPENDENCIES:




65

STAGE PLANS:

Stage: Stage-1

Map Reduce


partsupp

TableScan alias: ps



supplier

TableScan

alias: s


Reduce Operator Tree:Join Operator

Stage: Stage-2

Map Reduce


\$INTNAME


nation

TableScan alias: n

Filter Operator (n\_name = ’GERMANY’)



Stage: Stage-0

Fetch Operator

limit: -1


STAGE DEPENDENCIES:




66

STAGE PLANS:

Stage: Stage-1

Map Reduce


supplier

TableScan alias: s


nation

TableScan alias:

Filter Operator (n\_name = ’GERMANY’)



Join Operator

Stage: Stage-2

Map Reduce


\$INTNAME


partsupp

TableScan alias: ps



Join Operator

Stage: Stage-0

Fetch Operator

limit: -1

A.5 q16

A.5.1 Postgres


67

Hash Cond: (ps.ps\_suppkey = s.s\_suppkey)


Hash Cond: (p.p\_partkey = ps.ps\_partkey)

-> Seq Scan on part p (cost=0.00..12.40 rows=158 width=120)

Filter: ((p\_brand <> ’Brand45’::bpchar) AND ((p\_type)::text !˜˜ ’

MEDIUM POLISHED%’::text))


-> Seq Scan on partsupp ps (cost=0.00..11.70 rows=170 width=8)



Filter: ((s\_comment)::text !˜˜ ’\%Customer%Complaints\%’::text)

A.5.2 Hive

STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-2

Map Reduce


part

TableScan alias: p

Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’MEDIUM

POLISHED\%’)))


supplier

TableScan

alias: s

Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’))



Stage: Stage-1

Map Reduce

68


\$INTNAME


partsupp

TableScan

alias: ps



Join Operator

Stage: Stage-0

Fetch Operator

limit: -1


STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-1

Map Reduce


supplier

TableScan alias: ps


part

TableScan alias: p

Filter Operator ((p\_brand <> ’Brand45’) and (not (p\_type like ’

MEDIUM POLISHED\%’)))



Join Operator

Stage: Stage-2

Map Reduce

69


\$INTNAME


supplier

TableScan alias: s

Filter Operator (not (s\_comment like ’\%Customer\%Complaints\%’))



Join Operator

Stage: Stage-0

Fetch Operator

limit: -1

A.6 q18

A.6.1 Postgres


Hash Cond: (l.l\_orderkey = o.o\_orderkey)




Hash Cond: (o.o\_custkey = c.c\_custkey)


Hash Cond: (o.o\_orderkey = t.l\_orderkey)

-> Seq Scan on orders o (cost=0.00..12.10 rows=210 width=28)


-> Seq Scan on lineitem t (cost=0.00..12.50 rows=67

width=4)

Filter: (l\_quantity > 300::numeric)


-> Seq Scan on customer c (cost=0.00..11.40 rows=140 width

=72)

A.6.2 Hive

70

STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-2

Map Reduce


customer

TableScan

alias: c


orders

TableScan

alias: o



Join Operator

Stage: Stage-1

Map Reduce


\$INTNAME

l

TableScan

alias: l


key expressions:

expr: l\_orderkey

type: int

sort order: +

Map-reduce partition columns:

expr: l\_orderkey

type: int

tag: 2

t

71

TableScan

alias: t

Filter Operator (l\_quantity > 300.0)


Stage: Stage-0

Fetch Operator

limit: -1


STAGE DEPENDENCIES:




STAGE PLANS:

Stage: Stage-2

Map Reduce


customer

TableScan

alias: c


orders

TableScan

alias: o



Join Operator

Stage: Stage-1

Map Reduce


\$INTNAME

l

TableScan

alias: l

72


key expressions:

expr: l\_orderkey

type: int

sort order: +

Map-reduce partition columns:

expr: l\_orderkey

type: int

tag: 2

t

TableScan

alias: t

Filter Operator (l\_quantity > 300.0)


Stage: Stage-0

Fetch Operator

limit: -1

73

Bibliography

[1] Apache Hadoop. http://hadoop.apache.org/.

[2] Apache Hive. http://hive.apache.org/.

[3] Apache PoweredBy. http://wiki.apache.org/hadoop/PoweredBy.

[4] Cern data scale. http://home.web.cern.ch/about/updates/2013/04/

animation-shows-lhc-data-processing.

[5] Data Scale. http://www.comparebusinessproducts.com/fyi/

10-largest-databases-in-the-world.

[6] Facebook scale. https://www.facebook.com/notes/paul-yang/

moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/

10150246275318920.

[7] HDFS architecture. http://hadoop.apache.org/docs/stable1/hdfs_design.html.

[8] Map Reduce architecture diagram . http://biomedicaloptics.spiedigitallibrary.org/

article.aspx?articleid=1167145.

[9] SQL server tpch plans.

http://researchweb.iiit.ac.in/˜bharath.v/sql_server_plans.pdf.

[10] Towards Join Order Templates based Query Optimization: An Empirical Evaluation. http://web2py.

iiit.ac.in/research_centres/publications/view_publication/phdthesis/12.

[11] Yahoo scale. http://developer.yahoo.com/blogs/hadoop/

scaling-hadoop-4000-nodes-yahoo-410.html.

[12] F. Afrati, A. Sarma, D. Menestrina, A. Parameswaran, and J. Ullman. Fuzzy joins using mapreduce. In

Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 498–509, 2012.

[13] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th

International Conference on Extending Database Technology, EDBT ’10, pages 99–110, New York, NY,

USA, 2010. ACM.

[14] S. Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the Seventeenth

ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, pages 34–43,

New York, NY, USA, 1998. ACM.

74

[15] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM,

51(1):107–113, Jan. 2008.

[16] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev.,

37(5):29–43, Oct. 2003.

[17] G. Graefe. Volcano&#151 an extensible and parallel query evaluation system. IEEE Trans. on Knowl. and

Data Eng., 6(1):120–135, Feb. 1994.

[18] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In

Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC,

USA, 1993. IEEE Computer Society.

[19] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In

Proceedings of the Ninth International Conference on Data Engineering, pages 209–218, Washington, DC,

USA, 1993. IEEE Computer Society.

[20] L. M. Haas, M. J. Carey, M. Livny, and A. Shukla. Seeking the truth about ad hoc join costs. The VLDB

Journal, 6(3):241–256, 1997.

[21] E. P. Harris and K. Ramamohanarao. Join algorithm costs revisited. The VLDB Journal, 5(1):064–084,

Jan. 1996.

[22] T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Trans.

Database Syst., 9(3):482–502, Sept. 1984.

[23] Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th International Conference on

Very Large Data Bases - Volume 29, VLDB ’03, pages 19–30. VLDB Endowment, 2003.

[24] Y. E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, Mar. 1996.

[25] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its

implications for query optimization. SIGMOD Rec., 20(2):168–177, Apr. 1991.

[26] K. Karlaplem and N. M. Pun. Query driven data allocation algorithms for distributed database systems. In

in 8th International Conference on Database and Expert Systems Applications (DEXA’97), Toulouse,

Lecture Notes in Computer Science 1308, pages 347–356, 1997.

[27] R. S. G. Lanzelotte, P. Valduriez, and M. Zaı̈t. On the effectiveness of optimization search strategies for

parallel execution spaces. In Proceedings of the 19th International Conference on Very Large Data Bases,

VLDB ’93, pages 493–504, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.

[28] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator.

In Distributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 25–36, 2011.

[29] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and M. Wong. Tenzing a sql implementation on the

mapreduce framework.

[30] L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed

queries. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB ’86, pages

149–159, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc.

75

[31] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec.,

27(2):448–459, June 1998.

[32] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of the 2011 ACM

SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 949–960, New York,

NY, USA, 2011. ACM.

[33] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for

data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of

Data, SIGMOD ’08, pages 1099–1110, New York, NY, USA, 2008. ACM.

[34] J. B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In OPERATIONS RESEARCH,

pages 377–387, 1988.

[35] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhD thesis, Madison, WI,

USA, 1997. UMI Order No. GAX97-16074.

[36] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY,

USA, 3 edition, 2003.

[37] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a

relational database management system. In Proceedings of the 1979 ACM SIGMOD International

Conference on Management of Data, SIGMOD ’79, pages 23–34, New York, NY, USA, 1979. ACM.

[38] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage

Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, 2010.

[39] M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized optimization for the join ordering

problem. The VLDB Journal, 6(3):191–208, Aug. 1997.

[40] A. Swami. Optimization of large join queries: Combining heuristics and combinatorial techniques.

SIGMOD Rec., 18(2):367–376, June 1989.

[41] A. Swami and B. Iyer. A polynomial time algorithm for optimizing join queries. In Data Engineering,

1993. Proceedings. Ninth International Conference on, pages 345–354, 1993.

[42] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a

petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International

Conference on, pages 996–1005, 2010.

[43] R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings

of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages

495–506, New York, NY, USA, 2010. ACM.

76

Documents

Optimizing SQL Query Execution over Map-Reduceweb2py.iiit.ac.in/research_centres/publications/download/masters... · Optimizing SQL Query Execution over Map-Reduce Thesis submitted