DISTRIBUTED QUERY EXECUTION FRAMEWORK FOR BIG SPATIAL DATA · data mainly rely over the functions of spatial databases and key-value stores. Spatial databases like Oracle Spatial

DISTRIBUTED QUERY EXECUTION FRAMEWORK

FOR BIG SPATIAL DATA

By

Bharat Singhvi

Roll Number: 10305912

Under Guidance of

Prof. N. L. Sarda

MTP Stage I Report

Submitted in Partial Fulfillment of the Requirementsfor the Degree of Master of Technologyin Computer Science and Engineering

in the Kanwal Rekhi School of Information Technology atIndian Institute of Technology, Bombay

Mumbai, Maharashtra

‘If there’s one thing I like, it’s a quiet life. I’m not one of those fellows who get all restlessand depressed if things aren’t happening to them all the time. You can’t make it too placidfor me. Give me regular meals, a good show with decent music every now and then, andone or two pals to totter round with, and I ask no more.’

– Bertie Wooster

ii

Acknowledgments

I would like to express my deepest gratitude towards Prof. N. L. Sarda for his constant guid-

ance and support in my pursuits. Without his insight and suggestions, it would not have been

possible to shape this project correctly. I would like to thank him for his valuable time and his

great involvement in the project.

I would also like to thank Dr. Smita Sengupta for hearing my excruciating thoughts countless

number of times and for making me believe that it is okay to try and fail.

I would never have been able to work without the support of my friends and colleagues from

GISE Lab. Thank you all for bearing my presence and constructing positive work environment.

Bharat Singhvi

10305912

iii

DISTRIBUTED QUERY EXECUTION FRAMEWORK FOR BIG

SPATIAL DATA

Bharat Singhvi

Department of Computer Science and EngineeringIndian Institute of Technology, Bombay

Mumbai, Maharashtra2012

ABSTRACT

If we consider the current scenario, an incredible amount of data is being generated these days

with data sources as diverse as satellites, scientific setups like LHC, social networking data and

various other sources varying from the internet to sensor networks - so much so that the term

”bigdata” has been specifically coined to define this data. Spatial extension of this data has

become very common with people sharing their location on social networks, spatio-temporal

data from coming in from sensors and several scientific setups spanning diverse domains like

oceanography, remote sensing, intelligent traffic management etc. Various technologies have

been designed specifically to mine this huge amount of data for finding meaningful information.

Several business intelligence tools use spatial data mining for solving complex problems like fa-

cility placement, entering new markets and diversifying businesses.

Literature survey has revealed that while advanced systems have been built to handle large

scale non spatial data, these techniques lack efficient mechanisms for analyzing spatial data.

Distributed database management systems have been a popular choice for managing big data.

While the face of computing has changed from distributed systems to cloud computing which

provide highly reliable, available and fault tolerant services, the underlying mechanisms have re-

mained the same. Data oriented cloud based services are defining this age of ”PetabyteData” by

providing facilities for data storage as well as computing by harnessing the power of distributed

systems and resource visualization.

The aim of this thesis is to design a seamless framework for distributed query processing over

spatial data which enables the community to handle big spatial data efficiently. We propose to

design the architecture for such a system and implement the system over existing technologies.

The central idea of this thesis is to enable distributed database management systems to perform

complex geometrical operations over spatial data efficiently. We intend to bring the positives

of distributed and parallel database computing paradigms together and implement the overall

solution.

iv

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Formuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Fragmentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Placement of Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Distributed Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Existing Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Technology Overview for Big Data Processing . . . . . . . . . . . . . . . 15

3.1.2 Parallel DBMS vs. Distributed DBMS . . . . . . . . . . . . . . . . . . . 19

3.1.3 Distributed Processing in Geospatial Data . . . . . . . . . . . . . . . . . 20

3.2 Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Thesis Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Distributed Query Translation Framework . . . . . . . . . . . . . . . . . . . . 25

5.1 Introduction to Translation Framework . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Query Translation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3.1 Classification of fragments . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3.2 Locating the data sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4 Genetic Algorithm for Query Translation . . . . . . . . . . . . . . . . . . . . . . 30

5.4.1 Identifying Fragment Class and Generating Fragment Trees . . . . . . . . 30

5.4.2 Mapping Fragment Query Trees to Data Sites . . . . . . . . . . . . . . . 32

6 Conclusion and Future Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.2 Work Plan for MTP Stage II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

v

Chapter 1

Introduction

In the past few years, applications like Location Aware Services and WebGIS which thrive on

spatial data have gained significance in academic research as well as industry. These applications

are heavily dependent on spatial processing capabilities of the system. However, with the huge

amount of data to be processed, the existing techniques in query processing of spatial data are no

longer sufficient to meet the processing requirements. Techniques of query processing over spatial

data mainly rely over the functions of spatial databases and key-value stores. Spatial databases

like Oracle Spatial and PostgreSQL with PostGIS as spatial extension provide query language

support for processing over spatial data and are typically used to deal with relatively small data

sets. Spatial queries are known to be both I/O intensive as well as compute intensive which

means that a single query can take several minutes or hours to produce results while executing

in spatial databases. As an experiment, the road network of Washington State available from

Open Street Maps which consists of half million nodes (point data) and over 1.2 billion edges

(polyline data) was stored in PostgreSQL installed on Windows 7 system with 4 GB memory and

2 processing cores. Spatial index over the geometry data was created. A simple buffer operation

and intersection computation to compute all edges which lie in 5 km radius of a particular

node took 13 minutes to produce the results. This is a simple illustration which shows that

spatial databases would not suffice in multi-user environment. Other technique employs key-

value stores like BigTable, HBase and Cassandra use distributed query processing techniques

to perform spatial computations. While key-value stores can execute the queries faster, their

indexing mechanisms do not use spatial indexes [39] which results in inefficient query execution.

This requires one to model a system which is capable of processing big spatial data efficiently by

harnessing the inherent features of spatial data. Since spatial queries require high computational

capabilities, parallel and distributed computing paradigms form an ideal platform for system

design.

1.1 Motivating Example

To formulate a motivating example for designing efficient spatial query processing system, let

us consider Location Aware Services. These services rely on GPS enabled mobile devices which

collects information about the user whereabouts and uses this information for plethora of ap-

plications like intelligent traffic management, contextual advertising, emergency services, locat-

ing nearby places like restaurants or hotels, locating friends in vicinity etc. According to a

1

study conducted by McKinsey Global Institute, it has been estimated that by using location

data - consumers can save over 600 billion dollars annually by year 2020 in terms of time and

fuel savings [23]. Intelligent transportation project by IBM uses location data to increase the

awareness of situation across the road network and use predictive paradigms to compute future

traffic volume. It centralizes traffic management and traffic operations by collecting informa-

tion across geographically diverse locations [22]. If we look at the spatial processing aspect in

these examples, we can quickly conclude that intelligent traffic management requires analysis

of spatio-temporal data and similarly contextual advertising and other services requires buffer

spatial analysis and hotspot spatial analysis. Another application of spatial data processing

is found in facility placement problem. Aggregation queries over spatial data and non-spatial

attributes like annual salary and spending habits of consumers are performed in order to find

the best location for opening a new facility.

Let us formulate a concrete example. Let us consider Mumbai city which has over 71 million

cellphone users. Without any loss of generalization, let us assume that 50 million of these users

have GPS enabled devices. The road network data of Mumbai roughly consists of 6000 nodes

and 14000 edges. If 50000 of these users simultaneously wish to locate places in their vicinity, it

would require spatial aggregation for distance based classification of the geographical region for

50000 users simultaneously. How much average time would it take for a single database system

to respond in this situation?

An experiment was conducted to answer this question. Road network of Mumbai was stored

in PostgreSQL database with PostGIS and PgRouting extensions. 10000 points spread across

Mumbai were generated randomly to serve as places of interest. Database consisted of three

tables - nodes, edges and places. Query was formulated as follows: ”Find restaurants in 1 km

radius of my current location such that actual distance between the places and current location

is less than 2 kms”. 500 users were simulated by running 500 threads for executing this query

over the database. Average response time for each thread turned out to be 7 seconds. As the

number of users increase, we can expect the average response time to go up. This illustrates the

scalability issue of single node spatial databases.

“Would it help to use distributed paradigm for big data analysis? How would the scenario change

if we used a distributed spatial database system as opposed to a centralized relational database

management system? Can the system be made independent of the number of users? How can

the queries be executed efficiently over such systems?” are some of the questions which have

been explored in this thesis.

2

1.2 Problem Formuation

The foremost question which has to be answered is - what should be the computing paradigm

that must be employed to handle large datasets which are significantly greater than the amount

of data that can be stored in a single hard disk? Secondly, how can this data be managed? Fi-

nally, since the overall requirement is to run computationally intensive operations over this data,

it becomes important to analyse the features which a system capable of handling big data must

have. System must be capable of performing data and computationally intensive operations over

big sized data and hence is required to be scalable, autonomic and fault-tolerant. By this, we

mean that the systems must be able to respond quickly to the change in input load, must be

capable of tuning themselves according to the requirements and they must have the ability to

continue operating even if some of the components of the system fail. Distributed systems have

been known to offer these features[31] and hence are the obvious choice for investigation.

If we dig a little deeper into use of distributed systems for data management and analysis, it

becomes important as to how the data is placed over disks. In the context of spatial data, which

resides in the core of this thesis, if we consider the nature of spatial data and its relation with

geographic location - features intrinsic to spatial data like bounding box and locality reference

can serve as important parameters for distribution of data. This requires the overall system to

be adaptive and change dynamically with respect to changes in number of data sites and the

data which they hold. Thus, one of the areas of focus is to devise an optimal data placement

strategy based on the characteristics of spatial data.

Once the data has been placed, there must be efficient way to access the data. The lookup

mechanism should be independent or lightly dependent on the changes in data being served

by nodes or addition/deletion of data nodes to/from the database federation. This means that

with large number of changes in data at various data nodes, there should be no need to update

the lookup mechanism every time such changes occur. This forms another area of research focus.

The main research focus is to develop a distributed data management framework for big spatial

data which is capable of handling the needs of scalability, high computing power and reliability.

The objective of this thesis is to provide a such a framework which is capable of:

1. Processing large scale spatial data sets efficiently.

2. Adopting dynamic data placement mechanisms for efficient data management.

3. Implement the proposed framework using existing technologies.

4. Benchmark the implemented system as opposed to centralized relational database man-

agement systems for computations over large spatial data sets.

3

We restrict our problem to vector datasets. Henceforth, whenever we talk about spatial data,

we implicitly assume that it is vector data and not raster data.

1.3 Thesis Outline

The report has been organized as follows: Chapter 2 presents the concepts of distributed

databases which form the fundamental part of this thesis. Chapter 3 presents the literature

survey of related work in the field of GIS processing using distributed and parallel computing

paradigms. In Chapter 4, we present a query translation framework for distributed execution

of spatial queries. Chapter 5 summarizes the conclusion and presents work plan for the second

stage of the thesis.

4

Chapter 2

Background Study

This chapter presents the background study introducing the concepts which are essential for

understanding the work planned in this thesis. Distributed query planning and execution are

fundamental to the thesis and the same have been discussed briefly in this chapter. Existing

distributed data storage system have been presented.

2.1 Distributed Databases

2.1.1 Introduction

Distributed Database Systems [DDBMS] bring the processing technologies in database systems

and computer networks together. A distributed database is composed by several interrelated

data files which are distributed over separate disks and are accessible over a computer network.

The key design features of distributed databases are:

� Data is distributed over several data nodes. Union of data over these nodes forms the

entire database.

� Data nodes are capable of interacting with each other over computer network. However,

each data node is independently capable of processing queries over the data it hosts.

� There is a logical relationship between the distributed data. Different data placement

mechanisms which will be discussed subsequently ensure this relationship.

DDBMS are more reliable and available as compared to centralized database management sys-

tems. This is because of the fact that if any component in centralized DBMS fails, then the

entire system is affected by this failure. However, in DDBMS, failures are local to sites and

they do not effect the functionality of overall system. Moreover, features like data replication

over different data nodes make DDBMS fault tolerant. In case of large data sets, DDMBS offers

improved performance since the data is divided into a number of smaller datasets which are

spread across data nodes capable of performing processing over this data parallely. Figure 2.1

illustrates the architecture of distributed database management systems.

5

Fig. 2.1: Architecture of Distributed DBMS

2.1.2 Fragmentation Techniques

In designing a DDBMS, the first question which needs to be answered is that how the data must

be distributed and placed over geographically distributed data nodes? The data may be stored

on a central data node, it may be split and stored at several data nodes or the same data may

be replicated at all data nodes. However, for improved performance, the data must be placed

in such a way that it can be accessed and processed efficiently. There are several data schemes

which are generally employed for data fragmentation and storage in DDBMS which have been

presented in this section.

Example

Let us formulate an example which will be referenced all along the thesis. Let us consider that

we have spatial data containing 4 features: Road Links, Road Nodes, Lakes and Buildings. Let

us assume that each of these features have following table definitions.

6

LinkID RoadName StartNodeID EndNodeID Width Geometry

0 A Road 1 2 10 Polyline((1,2),(3,4),(4,5))

1 B Road 0 4 8 Polyline((1,7),(8,4),(12,9))

2 C Road 3 7 4 Polyline((1,3),(3,8),(6,5))

3 D Road 2 5 15 Polyline((5,2),(8,5),(3,13))

4 E Road 4 7 9 Polyline((11,21),(13,4),(14,15))

5 F Road 5 6 6 Polyline((17,11),(6,17),(8,5))

Table 2.1: Road Link Feature Attributes

NodeID Geometry

0 POINT(1,2)

1 POINT(3,4)

2 POINT(5,6)

3 POINT(7,8)

4 POINT(9,10)

5 POINT(11,12)

6 POINT(13,14)

Table 2.2: Road Node Feature Attributes

LakeID WaterV olume Geometry

0 3400 POLYGON((1,2),(3,4),(5,6),(1,2))

1 6400 POLYGON((7,8),(9,10),(11,16),(7,8))

2 2300 POLYGON((2,7),(13,14),(25,66),(2,7))

3 1700 POLYGON((9,12),(24,43),(15,32),(9,12))

4 8600 POLYGON((21,32),(44,54),(53,61),(21,32))

5 9100 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))

Table 2.3: Lake Feature Attributes

BuildingID BuildingName Height Geometry

0 Building A 3400 POLYGON((1,2),(3,4),(5,6),(1,2))

1 Building B 6400 POLYGON((7,8),(9,10),(11,16),(7,8))

2 Building C 2300 POLYGON((2,7),(13,14),(25,66),(2,7))

3 Building D 1700 POLYGON((9,12),(24,43),(15,32),(9,12))

4 Building E 8600 POLYGON((21,32),(44,54),(53,61),(21,32))

5 Building F 9100 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))

Table 2.4: Building Feature Attributes

7

*Data is representative and is not accurate

We will assume that data in these tables are to be distributed across several data nodes. We

will refer to these relations whenever required.

Horizontal Data Fragmentation

In horizontal data fragmentation scheme, the rows of a global relation are divided and stored

over separate data nodes based on one or more attributes of the relation. The global relation

can be reconstructed by taking a UNION of all the rows from all data nodes. The Road Link

table after applying horizontal fragmentation for distribution over 3 sites looks as follows:


0 A Road 1 2 10 Polyline((1,2),(3,4),(4,5))

1 B Road 0 4 8 Polyline((1,7),(8,4),(12,9))

Table 2.5: Horizontal Fragmentation: Fragment 1


2 C Road 3 7 4 Polyline((1,3),(3,8),(6,5))

3 D Road 2 5 15 Polyline((5,2),(8,5),(3,13))



4 E Road 4 7 9 Polyline((11,21),(13,4),(14,15))

5 F Road 5 6 6 Polyline((17,11),(6,17),(8,5))


In terms of spatial data, MBR [Minimum Bounding Rectangles] based fragmentation technique

is essentailly a horizontal fragmentation technique. In MBR based fragmentation, the entire

geometrical region bounding the spatial data set is divided into several rectangles such that

all the rectangles taken together would reconstruct the entire area. Fragmentation of data can

be done by finding out the MBR(s) to which a particular feature belong to and accordingly

partitioning the data into multiple tables.

Vertical Fragmentation

In vertical fragmentation, the global relation is divided on the basis of its attributes, which is

achieved by grouping the columns. The global relation can be reconstruted by taking a JOIN of

all the fragments. Vertically fragmented Lake Feature table from example discussed before for

division over 2 data nodes will look as follows:

8

LakeID WaterV olume

0 3400

1 6400

2 2300

3 1700

4 8600

5 9100

Table 2.8: Vertical Fragmentation: Fragment 1

LakeID Geometry

0 POLYGON((1,2),(3,4),(5,6),(1,2))

1 POLYGON((7,8),(9,10),(11,16),(7,8))

2 POLYGON((2,7),(13,14),(25,66),(2,7))

3 POLYGON((9,12),(24,43),(15,32),(9,12))

4 POLYGON((21,32),(44,54),(53,61),(21,32))

5 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))

Table 2.9: Vertical Fragmentation: Fragment 2

Both the fragmentation schemes have been summarized in Figure 2.2.

Fig. 2.2: Horizontal and Vertical Fragmentation

9

Mixed Fragmentation

A combination of vertical and horizontal fragmentation techniques over a global relation may

also be employed. This means that vertical fragments of horizontally fragmented data [or vice

versa] may be generated. The original relation can be reconstructed by first applying a join over

all vertically divided fragments and then taking a union of the resultant relations [or vice versa].

2.1.3 Placement of Fragments

Fragment Allocation

Once the relations have been fragmented, the next step is to allocate these fragments to data

sites. The fundamental principle that governs fragment allocation is to distribute the fragments

over data sites in such a way that most of the data access remains local. This is a complex

problem and there are several algorithms which compute optimal fragmentation and allocation

strategies. Some of these algorithms have been presented in [4, 6, 17]. There are two strategies

for allocation of fragments - static and dynamic which have been discussed next.

Fragment Placement

One strategy to place fragments is to analyze the expected load on the database and produce

allocation strategy based on this analysis. Expected load is generally in terms of the set of

queries which are gathered from active systems. This method is termed as static fragment

placement method. The disadvantage of this scheme is that it is not adaptable to changes in the

load. Dynamic fragment placement methods are used to make distributed databases adaptable

to changing load. It involves continuous monitoring of the database in order to dynamically tune

the system for load balancing.

Decision Making

One more aspect associated with placement of fragments is the distribution of control. There

are two ways in which the decision making can be implemented in DDBMS. Either a dedicated

server can be used to gather information for decision making at all nodes or each data node can

be made autonomous and can be allowed to make its own decisions. A hybrid scheme may also

be employed which uses several central systems which can each serve a number of data nodes.

2.1.4 Distributed Query Processing

This section presents the general query processing techniques employed in distributed databases.

Details of discussed techniques can be found in [29]. There are four steps involved in distributed

query processing. First, the SQL query which is passed to the data site is translated into the

algebraic notation. This step is termed as decomposition of query. Next, the decomposed query

10

is analyzed to locate the data fragments which are required for processing the query. This steps

generates an outline of the query execution plan which is passed on to the query optimizer

for improvement. Finally, the query is executed at the data node. Figure 2.3 shows the steps

involved in query processing.

Fig. 2.3: Steps in Query Processing

Localization Step

This step is not present in centralized DBMS since a single site hosts all data. However, in

a distributed system, data required for query processing needs to be located since it may be

present across several data sites. Data required for processing the query is located by the system

by analyzing the tables which are needed to be accessed for answering the query. In case of

DDBMS for big data, an important function of data localization step is to eliminate the data

sites which are not relevant for answering the active query. This is important because otherwise

the query needs to be broadcasted to all the data sites which leads to inefficient query execution.

This steps generates a query plan which contains information about the data sites which hold

data needed for processing the active query.

Optimization Step

A detailed survey of query optimization techniques for distributed databases has been presented

in [24]. The main function of this step is to improve the query execution plan generated in

localization step. Some of the tasks of optimizer include finding optimal join order and using

appropriate access strategies while executing the query. Optimizer may choose to use the indices

or completely discard them and engage in a fragment scan. This steps also takes care of using

11

replicated data by either employing a parallel access mechanism or choosing a data site which

is known to have fastest access to data amongst the one where data can be located.

Query Execution

In the context of distributed systems, there are two basic methods to execute the query. One of

these methods uses data shipping which involves moving data from remote site to the site where

the query is being executed. Another strategy is to ship the query to the sites which contain

appropriate data. Both these techniques can be illustrated using the following example.

Let us consider that we need to compute join of three different table T, U and V. For simplicity,

let us assume that table T has been divided into two fragments while the other two tables are

stored with fragmenting the data. Also, let us assume that this data is hosted at four different

sites. Figure 2.4 gives an overview of these two strategies using this example.

Fig. 2.4: Query Execution Strategies

This concludes our discussion on distributed databases. Next, we look into existing systems

which provide the functionality of distributed storage of and distributed query processing over

the data.

12

2.2 Existing Distributed Systems

There are several systems which facilitate distributed data storage and allows one to run queries

over data distributed across several disks. While some systems only provide data storage facilities

and some provide query processing facilities, several others provide a combination of both. Some

of these systems have been presented in this section.

Google’s BigTable

Google’s BigTable [12] is a distributed storage system which is capable of storing large data

sets that may be of several petabytes in size and distributed among thousands of data nodes.

BigTable is based on semi-structured data storage technique. Data can be viewed as a large

map which is indexed using a row key, column key, and timestamp. Each value within the map

is an array of bytes that is interpreted by the application. Every read/write operation over a

row is atomic, and is independent of the number of columns are read/written within that row.

BigTable is a combination of (key, value) pairs over data sets where the key identifies a row

and the value is the set of columns. BigTable’s data is distributed among many independent

machines. It uses a distributed file system, which is the Google File System for managing data

access and updates. Components of BigTable include client library, a master server for coordi-

nation of activity, and several tablet servers. which may be added or removed dynamically. The

master server is responsible for assigning tablets to tablet servers and balancing load on a tablet

server. Master server is also responsible for garbage collection and managing changes in schema.

Apache HBase, developed over Hadoop core, is an opensource implementation of Google’s

BigTable.

Apache CouchDB

Apache CouchDB [5] is a schema-free, peer-based distributed storage system which is document

oriented. Primary data unit in CouchDB is a semistructured document which may consist

of numerous columns, associated metadata and attachments. Size and format of documents is

variable and they are uniquely identified using a document ID. It is a web based database system

primarily used in document aggregation and filtering. Optimization in CouchDB is heavily reliant

on its view management capabilities. Views are created using Javascript functions which produce

any number of rows for each document as the output view for document input. The database is

kept in a consistent state at all times for minimizing rollbacks which are necessary after a system

crash. It also allows multi-version concurrency control in order to facilitate different applications

to simultaneously read and write over same document data.

13

Piazza System

The Piazza system [20] provides a combination of data storage and query processing capabilities

for data management in peer systems hosting heterogenous data. Piazza allows complete au-

tonomy at the data sites in terms of schema implementation and kind of data which is hosted.

The system provides a strong schema mediation functionality and reformulating input query

in accordance with the schema mapping. Data is stored in XML format at the data sites and

XQuery based query language is used for processing the data. Queries which are sent over the

network are modified to reflect the schema mapping.

HadoopDB

HadoopDB [3] is a database middleware system which is typically designed for cloud. It has

been developed over Hadoop which is the open source implementation of MapReduce comput-

ing paradigm. HadoopDB provides data storage as well as query processing capabilities. Data

storage is enabled by using a local database management system at each site and communica-

tion between data sites is managed using the Hadoop framework. Local database management

system is integrated to work with the MapReduce framework and hence this DBMS serves as

the distributed file system. Information about the local sites, datasets present at each site and

details about partitioning and replication of data is maintained in HDFS and it serves as the

metadata catalog which is useful in query decomposition. For distributed query processing,

SQL query is converted into MapReduce jobs using Hive and then Hadoop framework is used to

co-ordinate amongst data sites and process the query on local DBMS.

MongoDB

MongoDB [1], like CouchDB, is a document-oriented database management system which pro-

vides advanced query processing services as compared to CouchDB. In MongoDB, the data unit

again is a semistructured document which are grouped in order to form collections. Generally,

there is a collection corresponding to each document type. MongoDb provides functionality to

fragment the data and store it at several data sites. It uses replication mechanism for increased

availability and also provides automatic load balancing feature for efficient query processing. Im-

perative language is used for querying the underlying data which initiates scan over collections

followed by filtering. B-tree based indexing mechanism is used for performance enhancement.

This section presented some of the existing systems which support distributed database man-

agement. However, it should be noted that none of these systems have functionality for efficient

handling and query processing over large spatial datasets. These system provide an insight into

architectures which are used for distributed query processing. We intend to utilize the features

from these systems in our quest for building efficient distributed query processing system for big

spatial data.

14

Chapter 3

Literature Survey

This chapter presents a concise summary of related work in distributed spatial databases. The

survey has been categorized into three classes: Current trends and technologies in large scale

data analysis, comparison of distributed and parallel computing paradigms and related work in

GIS processing in distributed systems. Gap analysis over the same has been presented and an

attempt has been made to formulate the problem in entirety.

3.1 Related Work

3.1.1 Technology Overview for Big Data Processing

In this section, we describe the current research work in the field of data oriented computing

by conducting a survey of technologies used in compute intensive problems in big data and the

research directions being sought to solve these problems.

First, we present the various components of computing architecture which is generally employed

in data oriented computing technologies. An important point to note here is that although the

technologies which have been described later are diverse, yet they share a number of common

features. The computing architecture presented in Figure 3.1 captures the common elements in

these technologies.

Fig. 3.1: General Architecture for Data Oriented Computing

15

Considering the architecture, co-ordination exists at the lowest level. This forms the basis

of distributed data access and querying. Distributed data layer provides data abstraction by

providing sophisticated techniques and interfaces for data access. The computing layer which

is the topmost layer is responsible for providing distributed processing capabilities. Computing

layer offers numerous high level languages which can be used to access this layer. Some of the

widely used technologies for big data management and analytics have been summarized in the

following table.

Technology Google Yahoo! Microsoft Miscellaneous

Languages Sazwall Pig DryadLINQ and SCOPE Hive

Computation MapReduce Hadoop Dryad

Database BigTable HBase and PNUTS Cassandra

File System GFS HDFS Cosmos Dynamo

Coordination Chubby Zookeeper

Table 3.1: Widely Used Technologies in Big Data Analysis

From the above table, we find that there are two technologies used in the coordination layer,

namely Chubby [9] and Zookeeper [21]. There technologies are responsible for maintaining the

information pertaining to configuration and data management. They are also responsible for

synchronizing distributed services provided by the data distribution layer. The primary char-

acteristics of the services provided by Chubby and Zookeeper is to ensure high reliability and

availability. Next, technologies used for storing distributed data are mentioned.

Google File System [18] employed by Google is based on the master-slave technique. In GFS,

the master node is responsible for metadata operations whereas the slave nodes are used for

actual computation over the data. GFS does not offer concrete consistency. For example, in

GFS - a record may be appended multiple times in data. GFS is a user space process which

lazily stores data as files over the local file system. Hadoop Distributed File System is based

on the same architecture as GFS. Further details on HDFS are present in [33]. It should be

noted that in both HDFS as well as GFS, the master node constitutes a single point of failure in

the system. In order to ensure high availability, replica of masters are present to take over the

process if case of failures. Microsoft uses Cosmos [11] as its file system for distributed computing.

While the details of this technology have not been shared in the paper, it has been mentioned

that Cosmos is an append-only file system which has been optimized for petabyte scale data.

Data is replicated and compressed for increased fault tolerance and higher efficiency. It has been

mentioned that Cosmos is an internal implementation of GFS by Microsoft. DynamoDB [16] is

a data storage technology developed by Amazon and is highly used in Amazon’s services [Shop-

ping cart on Amazon’s website, Amazon S3 service]. Dynamo is a highly available and reliable

key value store which ensures low latency access to data. It is based on P2P architecture which

16

uses consistent hashing for load balancing and implements specific protocol called the gossip

protocol to guarantee consistency. Further description of this technology can be found in [36].

An important point to note here is that except Dynamo, the file systems discussed above are

stream oriented append only systems. In order to make these systems robust and application

friendly, data abstraction mechanism are developed over these file systems.

Databases provide the functionality of data abstraction. An introduction to Google’s BigTable

and Apache HBase has been presented in Section 2.2. Both these databases are non-relational

in nature which are used to store semistructured or unstructured data sets. The data is stored

in column-wise fashion which facilitates better representation and compression. SSTable is the

fundamental structure employed in BigTable and HBase. SSTable is used to store data and

is designed in such a manner that data access through it requires single disk access. PNUTS

[14] is a storage technology developed by Yahoo which is similar to BigTable and HBase. It

allows data lookup based on key value pair. Distinguishing feature of PNUTS is that it provides

consistency by guaranteeing that operations to a single record will be applied in the same order

across all the replicas. Apache Cassandra [27] which is used to power Facebook is an innovative

data storage solution which brings the positives of BigTable and DynamoDB together and offers

a structured storage system over peer-to-peer network. Voldemort [34] is a non-relational open

source key-value based database system developed by LinkedIn which can be seen as a persistent

distributed hash table.

Next, we move to the computation layer as described in Figure 3.1. There are different computing

paradigms which are deployed for data intensive computing. The common characteristic of

computing paradigms is that they concentrate on dataflow while automating the parallelized

computing aspect.

Google uses the MapReduce [15] computing paradigm for distributed data processing which is

inspired from functional languages. MapReduce computing paradigm is based on two functions -

Map and Reduce. Map function is responsible for reading a list of key value pairs and generates

intermediate key-value pair lists which are sorted and grouped by key. The reduce function

then accepts the intermediate key along with the values associated with the key and merges

them together to produce the final result. Execution overview of MapReduce paradigm has been

illustrated in Figure 3.2. It should be noted that the map and reduce phases are independent

of each other and do not overlap at any point during execution. Map and Reduce functions are

completely functional and hence are quite easy to parallelize. Fault tolerance can be achieved

by re-executing the failed portion of execution process.

17

Fig. 3.2: Execution Overview in MapReduce [15]

Hadoop is the reimplementation of MapReduce paradigm in Java and has been reported to have

superior performance as compared to MapReduce. Dryad is Microsoft’s alternative to MapRe-

duce. Specific details about these technologies have been reported in [7] and [25] respectively.

Finally, we present the languages which are used for interfacing user requirements with the com-

puting systems discussed above. These languages facilitate writing programs for these systems.

Sazwall [32] is a parallel data processing language which has been built on top of MapReduce.

Aggregation operations like counting, representative sampling and histogram computation have

been predefined in this language for processing the underlying datasets. The language allows

the programmers to declare the desired operations in terms of aggregators and implement the

logic for filtering datasets as per requirement. It forces the programmers to consider processing

of single record at a time and at the same time allows the system to completely parallelize the

entire computation. Pig [28] is another sophisticated imperative language capable of performing

complex database operations like aggregation, filtering and joins. Statements written in Pig

are translated into corresponding Hadoop jobs which are executed in parallel fashion. It allows

18

the programmers to define operators which correspond to relational algebra primitives found in

database systems. DryadLINQ is a programming language which translates programs into Lan-

guage Integrated Query [LINQ] expressions which is a query language developed in Microsoft’s

.NET framework. Hive [35] is an attempt by Facebook which aims to build a data warehousing

solution based on Hadoop. It is still native and lacks efficient optimizer.

This completes the survey on existing technologies which employ distributed data management

and query processing for petabyte scale data. An important point to note here is that all these

systems treat spatial data same as the non-spatial data when it comes to query processing

and they do not harness the intrinsic features of spatial data for processing which makes them

inefficient for performing spatial computations.

3.1.2 Parallel DBMS vs. Distributed DBMS

Parallel Database Management Systems are the direct competitors with distributed systems. In

order to put this comparison in correct perspective, it is important to understand the CAP the-

oram [19] which states that every system can be made to excel in at most two out of these three

properties: Consitency, Availability and Parition Tolerance. For building distributed systems

which are meant to handle large datasets, partition tolerance is a highly desirable feature since it

is known that network failure cannot be prevented. On the other hand, parallel database man-

agement systems have evolved from traditional DBMS and hence prefer consistency as a property.

Thus, in comparing PDBMS with DDBMS we are essentially comparing ACID(Atomicity, Con-

sistency, Isolation and Durability) with BASE(Basically Available, Soft state and Eventually

consistent). With BASE, the systems become more scalable at the cost of consistency while

with ACID, systems are less scalable because of consistency constraints. Thus, both PDBMS

and DDBMS have their own merits. The following table presents the advantages of both these

technologies.

Fig. 3.3: Parallel DBMS vs. Distributed DBMS

Generally, a hybrid approach is recommended which selects the best of these two paradigms. This

19

calls for creating new DBMS which is tailored to match all requirements. Such considerations

are being taken seriously with the advent of cloud computing in the recent times [2].

3.1.3 Distributed Processing in Geospatial Data

[39] presents a parallel spatial query execution model for big spatial data. The paper presents

scheme to partition the data based on the geographical space and corresponding spatial object file

which is used to refer the data across various cluster nodes. It also presents a distributed indexing

mechanism based on the data partitioning hierarchy and uses this index in MapReduce paradigm

for spatial query computation. The experimental results show that the proposed scheme for data

management and processing is faster than PostegreSQL cluster, HDFS, Cassandra and HBase in

terms of read and bulk loading. The results also indicate that the scheme is faster than Oracle

Spatial and PostgreSQL with PostGIS in terms of processing spatial operations like spatial

selection and nearest neighbour computations.

[38] presents a method for distributed geospatial data processing by proposing a middleware

which uses spatial web services search engine and meta-data repository for achieving distributed

processing. The paper describes that spatial web services search engine is used to find existing

geospatial web services [WMS/WPS/WCS] over internet and collect available metadata which is

stored in the metadata repository of the middleware. Any query which comes to the middleware

is first mapped with the information available in metadata store and the sites which are suitable

to answer the query are selected. The query is then routed to all these sites and the resultant

WMS/WPS/WCS responses are merged to form a single result.

In [40], authors present a framework for performing parallelized spatial join operations. The

authors present the filter and refine strategy which is typically involved in all geometric compu-

tations and present a framework which decomposes the filtering task to perform parallel filtering

over data set and also decomposes refinement task by proposing object redistribution scheme.

Optimization strategies to improve the cost of processing and communication have been dis-

cussed which is achieved by manipulating the size of data structures used in parallelization and

appropriate data partitioning schemes. Experimental results validate the optimized cost model.

[8] presents collaborative mapping and feature extraction strategy which collect geospatial data

from distributed sources and integrate and analyse the same for meaningful visualization. The

paper primarily focuses on the image processing aspect and is relevant for raster data. However,

the architecture presented to integrate large image datasets can be extended for vector data

as well. The steps involved in the integration process include reprojection of data to a single

reference system and creating image tiles from reprojected data. It further presents the feature

capturing algorithm using image processing techniques and eventually exporting the analyzed

data as KML which can be visualized over Google Earth. This paper provides an insight about

the standards in geospatial computing.

[37] is a survey paper which tries to investigate the relevance of grid computing for supporting

20

geospatial applications which require real time response. The paper presents the results of

simulation for computation over large geospatial datasets. Simulation indicate that the response

time of the applications continously decreases as the number of cores in the cluster hosting the

application increase. However, the paper does not talk about the implementation challenges and

does not pose any guidelines which must be followed while designing spatial computation system

over grid computing technology.

[13] discusses distributed computing mechanism for geospatial data by bringing data from het-

erogenous sources together. The paper presents complete transformation process of a global

geospatial query into local site queries by converting global queries into algebraic form and

resolving them using lexical analyzer and parser to generate local queries which are eventually

executed to gather results. The paper also presents a distributed query execution manager which

takes the resolved global query and collaborates with the data nodes in geospatial repository to

generate results for local queries. These results are later merged by the query execution manager

and presented back as a single response. Paper presents an example transformation for a simple

global query into local queries. However, experimental results have not been included in the

paper which makes it difficult to validate the correctness of proposed schemes. Moreover, the

scheme is not complete and lacks features of query optimization, cost estimation and dynamic

data partitioning.

[26] discusses a new alternative to distributed DBMS systems by leveraging cloud comput-

ing paradigm. The paper explains that why cloud computing would be an ideal choice for

geostreaming applications like Location Based Services and Intelligent Transportation Systems

by analyzing the features which are required in these applications: Parallelization, dynamic load

and resource intensive nature. The paper presents ElaStream framework based on MapReduce

paradigm but fundamentally different from MapReduce by making the framework independent

of persistent data storage and designing it to be a push based model. Though the paper does not

validate correctness of the presented framework and there are no experimental results to deduce

whether or not the system is working, it identifies the challenges in scalable stream processing

and discuss that how these challenges have been avoided with the functionalities of proposed

framework.

3.2 Gap Analysis

From the literature survey of related work presented in Section 3.1, the following can be con-

cluded:

� Plenty of research work focus on exploiting MapReduce paradigm for spatial computations.

However, it is important to note that MapReduce architecture is suitable for one time data

processing and once the processing is complete, data is permanently offloaded from the

system. These systems do not care about the performance as much as they care about

availability and reliability. On the other hand, spatial data is not suitable for one time

21

use since user may constantly want to analyze the same data to produce more and more

meaningful results. Hence, there is a need to design a scalable system which is capable of

efficiently processing spatial data by using indexes and other techniques used in traditional

Spatial DBMS.

� While the strategies discussed in literature survey concentrate on providing parallel compu-

tation of spatial operations, they do not use the distributed data partitioning schemes and

do not consider network costs incurred in accessing remote data. While the central idea of

performing computations is present, the underlying structure for efficient data access and

processing is missing.

� These gaps require us to design a framework for distributed query processing which can

bring the merits of traditional spatial databases like efficient data access and query opti-

mization along with the merits of distributed database management systems like no single

point failure, scalability and autonomous data sites, together.

22

Chapter 4

Thesis Proposal

The ultimate objective of this thesis is to provide a robust framework capable of handling

big spatial data efficiently and effectively. We aim to provide solution in the area of spatial

data mining and large scale spatial data analysis by building over the existing knowledge in

distributed database management systems and empowering them to manage spatial data. As

discussed earlier in Section 1.2, the main research objective of this thesis is to:

1. Conduct comparative study of existing technologies in distributed database management.

2. Study various techniques of spatial data processing in parallel and distributed environment.

3. Provide a framework which is capable of processing large scale spatial data.

4. Implement the framework by using existing technologies and integrating studied techniques.

5. Benchmark the performance of developed system.

The methodology which has been employed till now and which will be used for completing the

thesis has been discussed briefly.

In order to understand the context of thesis, it was necessary to get a background on distributed

databases and the state of art in current distributed systems. The same study has been pre-

sented in Chapter 2.

Secondly, the existing technologies in big data management for non-spatial data has been stud-

ied. There are several technologies developed and used by some of the leading industries, and

few of them have been summarized in Chapter 3. Furthermore, the competing paradigms for

large scale data analysis have been studied. Parallel DBMS, which are directly extended from

the relational systems have been concluded to be more efficient than distributed DBMS which

are more reliable and fault tolerant. Cloud computing allows us to bring the positive aspects of

both parallel and distributed computing paradigms under one roof, which is one of the objective

of the thesis. Figure 3.1 shows the different layers of distributed computing architecture which is

very similar to cloud architecture and hence the discussion in literature survey can be extended

for cloud. The desired outcome of this study was to be acquainted with the technologies being

employed for big data analysis.

Next, the parallel and distributed techniques of query processing in geospatial information sys-

tems were surveyed. Several papers presented efficient techniques of data partitioning and stor-

23

age, data retrieval, and optimization in storage access using spatial indexing techniques as well

as using parallelized operations. Some interesting research papers have been summarized in

Chapter 3.

The background study and literature survey were used to put the problem in correct context

and start designing the distributed database management system for spatial data. Various com-

ponents which are required to be studied and implemented in order to achieve the objective

have been identified. The components which project the complete distributed framework for big

spatial data analysis include the following components: Data Partition Manager which dynami-

cally fragments the data based on its characteristics and allocates fragments to data sites, Query

Translation Framework which parses the global query tree and generates site specific query exe-

cution plans, Communication manager which maintains all communication requirement between

data sites [like data shipping, collecting results etc.], Query Optimizer which optimizes the site

queries for efficient processing and Replication and Duplication manager which enable paral-

lelization in system and make it fault-tolerant. All these components must work in complete

synchronization to achieve the overall quest. The core of the system must contain algorithms

capable of performing all spatial computations over geometry data. A point to be noted here

is that the functional and non-functional requirements of the proposed framework may change

over the course of thesis. The task at hand now is to design these components to work hand in

hand and move closer to the objective.

24

Chapter 5

Distributed Query TranslationFramework

In an attempt to build an efficient and scalable distributed spatial database system, we first

present the query translation framework which is responsible for taking the global query tree as

input and generating query execution plans for local sites hosting the actual data. The basic

structure of the query translation framework is based on the theory of query processing discussed

in [10, 30].

5.1 Introduction to Translation Framework

In a distributed database application, data is partitioned into fragments which are stored at

distributed data sites. The placement of data is a complex issue as discussed in section 2.1.3 and

there are several allocation algorithms which strategically place data in order to reduce the data

access time for improved query performance. In this chapter, we build over the query processing

in distributed systems to construct a framework which can be used for efficient translation

of global query into local site query execution plans. Figure 5.1 gives an overview of query

processing in distributed databases. We discuss our contributions in the general architecture

subsequently.

Fig. 5.1: Query Processing in Distributed Databases

25

5.2 Assumptions

The query translation framework assumes that:

1. Data placement has already been done.

2. Information about placement of each relation in the database is available. This information

includes: (a) Physical address of the local data nodes where data has been placed (b) Type of

fragmentation used to partition data (c) Size of relation on each data node (d) Attributes of the

relation contained in each fragment. (e) Data range present in each fragment.

3. Information about communication costs between any two data nodes is available.

4. The global query execution plan is available.

5.3 Query Translation Framework

5.3.1 Classification of fragments

Let us consider a simple query Q over a single global relation R which has been fragmented

into three relations R1, R2 and R3 which are stored at three different data nodes using some

fragmentation scheme. There are several possibilities which must be considered in order to

answer this query. One possibility is that the data required to answer the query is distributed

over all three data sites and the results from each site have to be merged to answer the query

completely. Another possibility is that the data required to answer the query is present on 2

data sites of the 3 nodes. In this case, the third node does not play any role in answering the

query and hence executing query over third data site would be inefficient. Another possibility

might be that the data required to answer Q exists on each of the data nodes completely. Thus,

it becomes important to classify the fragments which are important to answer a specific query

and then derive the relationship between physical location of the fragment and the parts of query

which needs to be executed at each data site physically. Based on the different fragmentation

schemes presented in Section 2.1.2, we can classify the fragments on the basis of query in the

following ways:

� Completely Capable Fragment: Completely Capable Fragment (CCF) with respect to

a query Q is the fragment which is capable of answering query Q completely. This means

the the fragment contains all the attributes of global relation R which have been projected

in the query Q and contains all the tuples of relation R which would satisfy the selection

condition in Q.

� Not Capable Fragment: Not Capable Fragment (NCF) with respect to a query Q is the

fragment is is incapable of answering any part of the query. Fragment can be considered

as a NCF if it returns zero results to the query.

� Partially Capable Fragment: Partially Capable Fragment (PCF) is the one which

26

does not contain complete information for answering a query Q, but contains some part

of the information which would be useful in answering query Q. Depending upon the

fragmentation technique employed over the global relation R, PCF can be further classified

into three different kinds:

1. PCF-Horizontal: PCF-H fragment with respect to query Q is the fragment which

contains ALL the attributes of relation R which have been mentioned in the query and

contains some of the tuples but not all the tuples which satisfy the selection condition in

query Q.

2. PCF-Vertical: PCF-V fragment with respect to query Q is the fragment which holds

a proper subset of attributes mentioned in the query and contains projection of ALL the

rows that would answer the query.

3. PCF-Mixed: PCF-M fragment with respect to query Q is a fragment which contains a

proper subset of attributes of global relation R which have been mentioned in the query

and contains projection of some of the tuples but not all the tuples which would satisfy

the query Q.

Example

Let us consider the same example as discussed in Section 2.1.2. We have 4 relations: Road Nodes,

Road Links, Lakes and Buildings. Furthermore, let us assume that there are 3 data nodes over

which these relations have been placed. Assume the fragmentation details as mentioned below:

Fig. 5.2: Horizontally Fragmented Link Table

27

Fig. 5.3: Vertically Fragmented Lake Table

Fig. 5.4: Mixed Fragmented Building Table

From figure 5.2, it is clear that the road link table has been horizontally fragmented into 3 parts

namely link-1-F, link-2-F, link-3-F which contains information of all attributes of node table for

ID 0 and 1, 2 and 3, 4 and 5 respectively. From figure 5.3 it is evident that the lake table has

been vertically fragmented into two components namely lake-1-F and lake-2-F which contain

information about attributed Water Volume and Geometry respectively with Lake-ID serving as

the common attribute. Finally, the building table is first horizontally fragmented into two com-

ponents each containing information about all attributes with tuples being divided on the basis

of Building-ID. The first component contains information about ID 0,1 and 2 while the other

28

contains information about ID 3,4 and 5. Next, these two fragments are vertically fragmented.

The first fragment contains information about building-name. The second fragment contains

information about building height and geometry. Building ID serves as the common attributed

between these components. As a result, we have a total of four components which have been

explained in Figure 5.4.

Now, consider query Q which selects Road Name of all the road links for which Road-ID <

4. With respect to this query, the fragments of links table can be classified as: Road-1-F and

Road-2-F: PCF-H, Road-3-F: NCF. If the value of parameter in Q was < 3, then Road-1-F

would be classified as CCF and Road-2-F and Road-3-F would be classified as NCF. Next, let us

consider a query Q on lake table which requires water volume and geometry of the lake whose id

= 3. In this case, both the fragments, Lake-1-F and Lake-2-F would fall under PCF-V category.

Finally, for a query on building table which requires name and height of building whose id < 5,

we can conclude that all the four fragments would fall under PCF-M class. However, if the query

required only the height parameter for such buildings whose id < 5, then fragments Building-1-F

and Building-4-F would fall under NCF category and fragments Building-2-F and Building-3-F

would fall under PCF-H category.

In practice, classification of fragments can be done based on the metadata information about

the fragmentation techniques used on relations. It is one of the assumptions of query translation

framework and it is also realistic in the sense that all this information should be readily available.

Two important points about fragment classification which must be mentioned are as follows:

1. We have discussed classification of fragments for queries which contain single variable.

However, with queries containing more than one variable in the selection condition, the

same explanation is valid with classification becoming specific to each variable in the query.

2. We have only discussed cases where each relation has been fragmented and stored once

without any replication. However, in reality, the same relation might be stored at number of

sites using different fragmentation schemes. In such cases, we might get several fragments

which would fall in the same class. Which fragment would be chosen for executing the

query would depend on the cost of executing the query plus the cost of communication.

Cost of executing query will generally be of the following order: PCF-M > PCF-V > PCF-

H > CCF > NCF because generally join operations are costlier as compared to selection

operations. However, cost of communication will also play an important role in deciding

that which fragment must be chosen for query execution.

29

5.3.2 Locating the data sites

Once the fragments have been classified, the information about fragment allocation to data sites

has to be used to locate the data sites which would eventually serve the queries. Locating

data sites is also important because we need to determine that what part of the query would

be answered from which data site. In general, one data node may contain zero/several/all the

fragments which are necessary for evaluating the query. In case the data node do not contain

any fragment which fall under the CCF or PCF category than no query must be passed to that

data site. If the query contains all the fragments which are necessary to evaluate the query, i.e.

all the PCF or CCF fragments identified for answering the query lies on one data site, than the

query needs to be executed only on that data site and all other sites can be dropped. In the

case where data site contains some of the PCF fragments required to answer the query, we need

to locate the other data sites which host the required data, consider the cost of communication

between these data sites as well as the cost of query evaluation and choose the best execution

plan from multiple plans (possible, like in case of replicated data).

5.4 Genetic Algorithm for Query Translation

In this section, we describe a step-by-step procedure to generate fragment queries from global

query tree and then we present the procedure to convert the fragment queries into queries for

each data site.

5.4.1 Identifying Fragment Class and Generating Fragment Trees

First, we consider the procedure to generate fragment queries from global query tree. The steps

have been outlined in Algorithm 1. The query processing steps described in the algorithm are

based on the fragment classification described in Section 5.3.1. As discussed breifly in Section

5.3.1, we can summarize that for any given query Q, it may be decomposed to contain the

combination of several fragment classes explained below:

1. Composed only of NCFs. In this case, no fragment tree should be returned since none

of the fragment is capable of answering the query.

2. Composed of CCF and several NCFs. In this case, the query must be run only

over the fragment which has been classified as CCF. In case of several CCFs due to data

replication, global tree must be individually replaced with all the CCFs to generate several

trees. This information may be used to parallelize the result set computation.

3. Composed of PCF-Hs. In case where fragment classes are composed of several PCF-Hs,

the global relation in the query must be replaced with Union of all the PCF-Hs. Replication

of data will result in duplicates which must be eliminated. Multiple trees can be formed, in

30

case enough detail to any pair of fragments is available, by finding all combinations which

would form CCF for query Q. Cases of multiple PCF-Vs and PCF-Ms can be handled in

a similar way.

Algorithm 1: Translation of Global Query Tree (Q) to Fragment Query Trees

01. TreeList = null;

02. FHorizontal = FVertical = FMixed = null

03. For each relation R in query Q

04. Classify all fragments F of R.

05. For each fragment f in F

06. If f is NCF, drop f

07. If f is CCF, replace R in Q with f. Add generated tree to TreeList

08. If f is PCF-H, add f to FHorizontal

10. If f is PCF-V, add f to FVertical

11. If f is PCF-M, add f to FMixed

12. UnitedH = null

12. For each h in FHorizontal

13. UnitedH = UnitedH (UNION) h

14. JoinedV = null

15. For each v in FVertical

16. JoinedV = JoinedV (JOIN) v

17. CombinedM = combine(FMixed)

18. TreeList.add(Replace R in Q with permute(UnitedH))

19. TreeList.add(Replace R in Q with permute(JoinedV))

20. TreeList.add(Replace R in Q with permute(CombinedM))

21. Return TreeList

The above algorithm generates several fragment query trees from the global query tree. The

main task of the above algorithm is to identify classes for all the fragments of each relation used

in the query. In order to classify each fragment, it has to be checked against the attributes and

its parameters being used in the query. In the assumptions, we state that we know the attributes

and ranges associated with a fragment. Thus, in order to check that if a fragment is NCF, we

need to check if it does not contain any of the attributes being requested by the query. If this is

true, than fragment is NCF. However, it might so happen that the fragment contains some or all

the attributes requested by the query. If it contains all the attributes, then fragment may be CCF

or PCF-H or NCF which can be verified from the data range information of the fragment. If the

fragment contains some of the attributes than it may be PCF-V or PCF-M or NCF which would

again be evident from the metadata information associated with fragment. The complexity of

31

fragment classification becomes of the order of O(n2) where n refers the number of attributes

present in the projection and selection portions of global query corresponding to relation R.

5.4.2 Mapping Fragment Query Trees to Data Sites

Once the fragments have been classified and all the fragment trees have been generated, the next

step is to generate local site plans which will be executed on data nodes to get query results.

This requires us to intelligently choose the best fragment tree and generate a corresponding

query execution plan. Depending upon the fragment classes, different fragment queries can be

generated. How do we choose the optimal fragment query for eventual execution? The criteria

which must be taken into account have been discussed subsequently.

Fragment Query Tree containing CCF

Suppose that the fragment query tree has been formed by a combination of CCF and several

NCFs. Although NCFs are not a part of fragment query tree, it is important to note that if the

CCF and NCFs are present on same data site, then processing time would increase as compared

to cases where CCF and NCFs are present separately. Thus, in order to optimally select the

data site for local query execution, the cost model should consider this aspect and choose the

best data site available. The complexity of choosing best available data site is of the order of

O(n) where n is the number of data sites which contain the CCF corresponding to fragment

tree.

Fragment Query Tree containing PCF-H

If the global query is converted into fragment query which contains several PCF-H, then we

need to compare all the query plans by comparing the cost of executing each query plan. In case

of PCF-Hs, the communication cost between data nodes which host the PCF-H class fragment

becomes important. If a fragment query is formed by combining a number of PCF-Hs which

are distributed among a large number of data sites, then the communication cost will be very

high as compared to the query plan where PCF-Hs are distributed among lesser number of data

sites. This implies that communication cost will go down if we find high number of PCF-Hs

present on the same data site. But in this case, the processing time will increase, particularly

if large number of NCFs are present along with the relevant PCF-Hs. Thus, in order to select

the optimal fragment query tree, an average execution cost must be computed which takes both

the communication cost between data sites and the processing and input/output cost at local

site into account and normalizes it. The same discussion can be extended for selecting optimal

fragment tree in cases of PCF-Vs and PCF-Ms or combination of any of these classes.

32

Chapter 6

Conclusion and Future Work Plan

ı»¿This chapter present the conclusion of MTP stage I and the work to be done in MTP stage

II.

6.1 Conclusion

In the first stage, the thesis proposal has been presented. Areas related to distributed database

management systems and parallel spatial data processing have been studied in order to under-

stand the existing approaches to deal with large scale data and parallel systems. The domain of

distributed databases and existing solutions for big data analysis have been extensively surveyed

and problem has been formulated by identifying various components of distributed database

systems which need to be extended for supporting spatial data analytics.

From the literature survey, it is evident that the current state of art in distributed database

systems aims to provide highly reliable and available services without paying due attention to

efficient query execution. On the other hand, parallel systems which have been extended from

the traditional database management systems parallelize computations to generate result sets

while keeping the query efficiency of traditional database systems intact. Our aim is to bring

the advantages of both these domains together and develop an efficient, reliable and available

distributed system capable of performing spatial analysis over petabyte scale data.

The first attempt for efficient data handling has been made by optimally translating the global

query execution plans into data site execution plans as presented in Chapter 5. Genetic al-

gorithm, independent of data placement strategies, to generate site specific queries has been

presented by assuming data partition strategies in place and using the metadata about fragmen-

tation techniques and attributes in fragments.

6.2 Work Plan for MTP Stage II

In stage II, we aim to complete the design of distributed query execution framework for spatial

data by considering design specifics in query translation, query optimization and integration

33

with existing parallel spatial computation techniques. Once the design is complete, we intend

to implement the proposed system design by developing a middleware over existing distributed

database management solution surveyed in Chapter 3.

We will start by designing query optimization framework by employing spatial indexing tech-

niques for data access and considering the aspect of parallelization. Once the design is complete,

we will implement the spatial query support operatives over Apache Hive and use Hadoop’s

MapReduce paradigm to achieve distributed parallelization.

The quest for stage II remains to design a spatially enabled efficient distributed database man-

agement system capable of adapting to dynamic data placement techniques, adhering to addi-

tion/deletion/manipulation of spatial data and able to perform complex spatial operations over

underlying data.

34

Bibliography

[1] Mongodb. http://www.mongodb.org/. Online. Last Accessed October 18, 2012. 14

[2] D. J. Abadi. Data management in the cloud: Limitations and opportunities. In IEEE Data

Engineering Bulletin, volume 32, pages 3–12, 2009. 20

[3] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexan-

der Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for

analytical workloads. Proc. VLDB Endow., 2(1):922–933, August 2009. 14

[4] Ishfaq Ahmad, Kamalakar Karlapalem, Yu-Kwong Kwok, and Siu-Kai So. Evolutionary

algorithms for allocating data in distributed database systems. Distrib. Parallel Databases,

11(1):5–32, January 2002. 10

[5] Apache. Couchdb. http://wiki.apache.org/couchdb/Technical%20Overview. Online.

Last Accessed October 18, 2012. 13

[6] Peter M. G. Apers. Data allocation in distributed database systems. ACM Trans. Database

Syst., 13(3):263–304, September 1988. 10

[7] Andrzej Bialecki, Christophe Taton, and Jim Kellerman. Apache Hadoop: a framework for

running applications on large clusters built of commodity hardware. 2010. 18

[8] D. Brunner, G. Lemoine, F.-X. Thoorens, and L. Bruzzone. Distributed geospatial data

processing functionality to support collaborative and rapid emergency response. Selected

Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, 2(1):33 –46,

march 2009. 20

[9] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Pro-

ceedings of the 7th symposium on Operating systems design and implementation, OSDI ’06,

pages 335–350, Berkeley, CA, USA, 2006. USENIX Association. 16

[10] Stefano Ceri and Giuseppe Pelagatti. Distributed Databases: Principles and Systems.

McGraw-Hill Book Company, 1984. 25

[11] Ronnie Chaiken, Bob Jenkins, Per-AAke Larson, Bill Ramsey, Darren Shakib, Simon

Weaver, and Jingren Zhou. Scope: easy and efficient parallel processing of massive data

sets. Proc. VLDB Endow., 1(2):1265–1276, August 2008. 16

35

http://www.mongodb.org/

http://wiki.apache.org/couchdb/Technical%20Overview

[12] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike

Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed

storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26, June 2008.

13

[13] Bin Chen, Fengru Huang, Yu Fang, Zhou Huang, and Hui Lin. An approach for het-

erogeneous and loosely coupled geospatial data distributed computing. Computers and

Geosciences, 36(7):839 – 847, 2010. 21

[14] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bo-

hannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. Pnuts:

Yahoo!’s hosted data serving platform. Proc. VLDB Endow., 1(2):1277–1288, August 2008.

17

[15] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clus-

ters. Commun. ACM, 51(1):107–113, January 2008. 17, 18, 39

[16] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash

Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vo-

gels. Dynamo: amazon’s highly available key-value store. In Proceedings of twenty-first

ACM SIGOPS symposium on Operating systems principles, SOSP ’07, pages 205–220, New

York, NY, USA, 2007. ACM. 16

[17] Tor Didriksen, Cesar A. Galindo-Legaria, and Eirik Dahle. Database de-centralization -

a practical approach. In Proceedings of the 21th International Conference on Very Large

Data Bases, VLDB ’95, pages 654–665, San Francisco, CA, USA, 1995. Morgan Kaufmann

Publishers Inc. 10

[18] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In

Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03,

pages 29–43, New York, NY, USA, 2003. ACM. 16

[19] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent available

partition-tolerant web services. In In ACM SIGACT News, page 2002, 2002. 19

[20] Alon Y. Halevy, Zachary G. Ives, Jayant Madhavan, Peter Mork, Dan Suciu, and Igor

Tatarinov. The piazza peer data management system, 2004. 14

[21] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: wait-

free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference

on USENIX annual technical conference, USENIXATC’10, pages 11–11, Berkeley, CA, USA,

2010. USENIX Association. 16

[22] IBM. Ibm intelligent transportation. 2012. 2

36

[23] McKinsey Global Institute. Big data: The next frontier for innovation, competition and

productivity. 2011. 2

[24] Yannis E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, March 1996.

11

[25] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: dis-

tributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev.,

41(3):59–72, March 2007. 18

[26] Seyed Jalal Kazemitabar, Farnoush Banaei-Kashani, and Dennis McLeod. Geostreaming in

cloud. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStream-

ing, IWGS ’11, pages 3–9, New York, NY, USA, 2011. ACM. 21

[27] Avinash Lakshman and Prashant Malik. Cassandra: structured storage system on a p2p

network. In Proceedings of the 28th ACM symposium on Principles of distributed computing,

PODC ’09, pages 5–5, New York, NY, USA, 2009. ACM. 17

[28] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew

Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the

2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages

1099–1110, New York, NY, USA, 2008. ACM. 18

[29] M. Tamer Ozsu. Principles of Distributed Database Systems. Prentice Hall Press, Upper

Saddle River, NJ, USA, 3rd edition, 2007. 10

[30] M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems, Third

Edition. Springer, 2011. 25

[31] Rasin Abadi DeWitt Madden Pavlo, Paulson and Stonebraker. Comparison of approaches to

large-scale data analysis. In In Proceedings of the 35th SIGMOD International Conference

on Management of Data, ACM Press, New York, pages 165–178, 2009. 3

[32] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data:

Parallel analysis with sawzall. Sci. Program., 13(4):277–298, October 2005. 18

[33] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop

distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage

Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010.

IEEE Computer Society. 16

[34] Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, and Sam Shah.

Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th

USENIX conference on File and Storage Technologies, FAST’12, pages 18–18, Berkeley, CA,

USA, 2012. USENIX Association. 17

37

[35] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh

Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution

over a map-reduce framework. Proc. VLDB Endow., 2(2):1626–1629, August 2009. 19

[36] Werner Vogels. Eventually consistent. Commun. ACM, 52(1):40–44, January 2009. 17

[37] Jibo Xie, Chaowei Yang, Qunying Huang, Ying Cao, and M. Kafatos. Utilizing grid com-

puting to support near real-time geospatial applications. In Geoscience and Remote Sensing

Symposium, 2008. IGARSS 2008. IEEE International, volume 2, pages II–1290 –II–1293,

july 2008. 20

[38] Chaowei Yang, Wenwen Li, Jibo Xie, and Bin Zhou. Distributed geospatial information

processing: sharing distributed geospatial resources to support digital earth. International

Journal of Digital Earth, 1(3):259–278, 2008. 20

[39] Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen.

Towards parallel spatial query processing for big spatial data. In Parallel and Distributed

Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International,

pages 2085 –2094, may 2012. 1, 20

[40] Xiaofang Zhou, David J. Abel, and David Truffet. Data partitioning for parallel spatial join

processing. Geoinformatica, 2(2):175–204, June 1998. 20

38

List of Figures

2.1 Architecture of Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Horizontal and Vertical Fragmentation . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Steps in Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Query Execution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 General Architecture for Data Oriented Computing . . . . . . . . . . . . . . . . 15

3.2 Execution Overview in MapReduce [15] . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Parallel DBMS vs. Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 Query Processing in Distributed Databases . . . . . . . . . . . . . . . . . . . . . 25

5.2 Horizontally Fragmented Link Table . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Vertically Fragmented Lake Table . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4 Mixed Fragmented Building Table . . . . . . . . . . . . . . . . . . . . . . . . . . 28

39

List of Tables

2.1 Road Link Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Road Node Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Lake Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Building Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Horizontal Fragmentation: Fragment 1 . . . . . . . . . . . . . . . . . . . . . . . 8



2.8 Vertical Fragmentation: Fragment 1 . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.9 Vertical Fragmentation: Fragment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Widely Used Technologies in Big Data Analysis . . . . . . . . . . . . . . . . . . 16

40

Documents

DISTRIBUTED QUERY EXECUTION FRAMEWORK FOR BIG SPATIAL DATA · data mainly rely over the functions of spatial databases and key-value stores. Spatial databases like Oracle Spatial