Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
DISTRIBUTED QUERY EXECUTION FRAMEWORK
FOR BIG SPATIAL DATA
By
Bharat Singhvi
Roll Number: 10305912
Under Guidance of
Prof. N. L. Sarda
MTP Stage I Report
Submitted in Partial Fulfillment of the Requirementsfor the Degree of Master of Technologyin Computer Science and Engineering
in the Kanwal Rekhi School of Information Technology atIndian Institute of Technology, Bombay
Mumbai, Maharashtra
‘If there’s one thing I like, it’s a quiet life. I’m not one of those fellows who get all restlessand depressed if things aren’t happening to them all the time. You can’t make it too placidfor me. Give me regular meals, a good show with decent music every now and then, andone or two pals to totter round with, and I ask no more.’
– Bertie Wooster
ii
Acknowledgments
I would like to express my deepest gratitude towards Prof. N. L. Sarda for his constant guid-
ance and support in my pursuits. Without his insight and suggestions, it would not have been
possible to shape this project correctly. I would like to thank him for his valuable time and his
great involvement in the project.
I would also like to thank Dr. Smita Sengupta for hearing my excruciating thoughts countless
number of times and for making me believe that it is okay to try and fail.
I would never have been able to work without the support of my friends and colleagues from
GISE Lab. Thank you all for bearing my presence and constructing positive work environment.
Bharat Singhvi
10305912
iii
DISTRIBUTED QUERY EXECUTION FRAMEWORK FOR BIG
SPATIAL DATA
Bharat Singhvi
Department of Computer Science and EngineeringIndian Institute of Technology, Bombay
Mumbai, Maharashtra2012
ABSTRACT
If we consider the current scenario, an incredible amount of data is being generated these days
with data sources as diverse as satellites, scientific setups like LHC, social networking data and
various other sources varying from the internet to sensor networks - so much so that the term
”bigdata” has been specifically coined to define this data. Spatial extension of this data has
become very common with people sharing their location on social networks, spatio-temporal
data from coming in from sensors and several scientific setups spanning diverse domains like
oceanography, remote sensing, intelligent traffic management etc. Various technologies have
been designed specifically to mine this huge amount of data for finding meaningful information.
Several business intelligence tools use spatial data mining for solving complex problems like fa-
cility placement, entering new markets and diversifying businesses.
Literature survey has revealed that while advanced systems have been built to handle large
scale non spatial data, these techniques lack efficient mechanisms for analyzing spatial data.
Distributed database management systems have been a popular choice for managing big data.
While the face of computing has changed from distributed systems to cloud computing which
provide highly reliable, available and fault tolerant services, the underlying mechanisms have re-
mained the same. Data oriented cloud based services are defining this age of ”PetabyteData” by
providing facilities for data storage as well as computing by harnessing the power of distributed
systems and resource visualization.
The aim of this thesis is to design a seamless framework for distributed query processing over
spatial data which enables the community to handle big spatial data efficiently. We propose to
design the architecture for such a system and implement the system over existing technologies.
The central idea of this thesis is to enable distributed database management systems to perform
complex geometrical operations over spatial data efficiently. We intend to bring the positives
of distributed and parallel database computing paradigms together and implement the overall
solution.
iv
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Formuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Fragmentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Placement of Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Distributed Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Existing Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Technology Overview for Big Data Processing . . . . . . . . . . . . . . . 15
3.1.2 Parallel DBMS vs. Distributed DBMS . . . . . . . . . . . . . . . . . . . 19
3.1.3 Distributed Processing in Geospatial Data . . . . . . . . . . . . . . . . . 20
3.2 Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Thesis Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Distributed Query Translation Framework . . . . . . . . . . . . . . . . . . . . 25
5.1 Introduction to Translation Framework . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Query Translation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Classification of fragments . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.2 Locating the data sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Genetic Algorithm for Query Translation . . . . . . . . . . . . . . . . . . . . . . 30
5.4.1 Identifying Fragment Class and Generating Fragment Trees . . . . . . . . 30
5.4.2 Mapping Fragment Query Trees to Data Sites . . . . . . . . . . . . . . . 32
6 Conclusion and Future Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Work Plan for MTP Stage II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
v
Chapter 1
Introduction
In the past few years, applications like Location Aware Services and WebGIS which thrive on
spatial data have gained significance in academic research as well as industry. These applications
are heavily dependent on spatial processing capabilities of the system. However, with the huge
amount of data to be processed, the existing techniques in query processing of spatial data are no
longer sufficient to meet the processing requirements. Techniques of query processing over spatial
data mainly rely over the functions of spatial databases and key-value stores. Spatial databases
like Oracle Spatial and PostgreSQL with PostGIS as spatial extension provide query language
support for processing over spatial data and are typically used to deal with relatively small data
sets. Spatial queries are known to be both I/O intensive as well as compute intensive which
means that a single query can take several minutes or hours to produce results while executing
in spatial databases. As an experiment, the road network of Washington State available from
Open Street Maps which consists of half million nodes (point data) and over 1.2 billion edges
(polyline data) was stored in PostgreSQL installed on Windows 7 system with 4 GB memory and
2 processing cores. Spatial index over the geometry data was created. A simple buffer operation
and intersection computation to compute all edges which lie in 5 km radius of a particular
node took 13 minutes to produce the results. This is a simple illustration which shows that
spatial databases would not suffice in multi-user environment. Other technique employs key-
value stores like BigTable, HBase and Cassandra use distributed query processing techniques
to perform spatial computations. While key-value stores can execute the queries faster, their
indexing mechanisms do not use spatial indexes [39] which results in inefficient query execution.
This requires one to model a system which is capable of processing big spatial data efficiently by
harnessing the inherent features of spatial data. Since spatial queries require high computational
capabilities, parallel and distributed computing paradigms form an ideal platform for system
design.
1.1 Motivating Example
To formulate a motivating example for designing efficient spatial query processing system, let
us consider Location Aware Services. These services rely on GPS enabled mobile devices which
collects information about the user whereabouts and uses this information for plethora of ap-
plications like intelligent traffic management, contextual advertising, emergency services, locat-
ing nearby places like restaurants or hotels, locating friends in vicinity etc. According to a
1
study conducted by McKinsey Global Institute, it has been estimated that by using location
data - consumers can save over 600 billion dollars annually by year 2020 in terms of time and
fuel savings [23]. Intelligent transportation project by IBM uses location data to increase the
awareness of situation across the road network and use predictive paradigms to compute future
traffic volume. It centralizes traffic management and traffic operations by collecting informa-
tion across geographically diverse locations [22]. If we look at the spatial processing aspect in
these examples, we can quickly conclude that intelligent traffic management requires analysis
of spatio-temporal data and similarly contextual advertising and other services requires buffer
spatial analysis and hotspot spatial analysis. Another application of spatial data processing
is found in facility placement problem. Aggregation queries over spatial data and non-spatial
attributes like annual salary and spending habits of consumers are performed in order to find
the best location for opening a new facility.
Let us formulate a concrete example. Let us consider Mumbai city which has over 71 million
cellphone users. Without any loss of generalization, let us assume that 50 million of these users
have GPS enabled devices. The road network data of Mumbai roughly consists of 6000 nodes
and 14000 edges. If 50000 of these users simultaneously wish to locate places in their vicinity, it
would require spatial aggregation for distance based classification of the geographical region for
50000 users simultaneously. How much average time would it take for a single database system
to respond in this situation?
An experiment was conducted to answer this question. Road network of Mumbai was stored
in PostgreSQL database with PostGIS and PgRouting extensions. 10000 points spread across
Mumbai were generated randomly to serve as places of interest. Database consisted of three
tables - nodes, edges and places. Query was formulated as follows: ”Find restaurants in 1 km
radius of my current location such that actual distance between the places and current location
is less than 2 kms”. 500 users were simulated by running 500 threads for executing this query
over the database. Average response time for each thread turned out to be 7 seconds. As the
number of users increase, we can expect the average response time to go up. This illustrates the
scalability issue of single node spatial databases.
“Would it help to use distributed paradigm for big data analysis? How would the scenario change
if we used a distributed spatial database system as opposed to a centralized relational database
management system? Can the system be made independent of the number of users? How can
the queries be executed efficiently over such systems?” are some of the questions which have
been explored in this thesis.
2
1.2 Problem Formuation
The foremost question which has to be answered is - what should be the computing paradigm
that must be employed to handle large datasets which are significantly greater than the amount
of data that can be stored in a single hard disk? Secondly, how can this data be managed? Fi-
nally, since the overall requirement is to run computationally intensive operations over this data,
it becomes important to analyse the features which a system capable of handling big data must
have. System must be capable of performing data and computationally intensive operations over
big sized data and hence is required to be scalable, autonomic and fault-tolerant. By this, we
mean that the systems must be able to respond quickly to the change in input load, must be
capable of tuning themselves according to the requirements and they must have the ability to
continue operating even if some of the components of the system fail. Distributed systems have
been known to offer these features[31] and hence are the obvious choice for investigation.
If we dig a little deeper into use of distributed systems for data management and analysis, it
becomes important as to how the data is placed over disks. In the context of spatial data, which
resides in the core of this thesis, if we consider the nature of spatial data and its relation with
geographic location - features intrinsic to spatial data like bounding box and locality reference
can serve as important parameters for distribution of data. This requires the overall system to
be adaptive and change dynamically with respect to changes in number of data sites and the
data which they hold. Thus, one of the areas of focus is to devise an optimal data placement
strategy based on the characteristics of spatial data.
Once the data has been placed, there must be efficient way to access the data. The lookup
mechanism should be independent or lightly dependent on the changes in data being served
by nodes or addition/deletion of data nodes to/from the database federation. This means that
with large number of changes in data at various data nodes, there should be no need to update
the lookup mechanism every time such changes occur. This forms another area of research focus.
The main research focus is to develop a distributed data management framework for big spatial
data which is capable of handling the needs of scalability, high computing power and reliability.
The objective of this thesis is to provide a such a framework which is capable of:
1. Processing large scale spatial data sets efficiently.
2. Adopting dynamic data placement mechanisms for efficient data management.
3. Implement the proposed framework using existing technologies.
4. Benchmark the implemented system as opposed to centralized relational database man-
agement systems for computations over large spatial data sets.
3
We restrict our problem to vector datasets. Henceforth, whenever we talk about spatial data,
we implicitly assume that it is vector data and not raster data.
1.3 Thesis Outline
The report has been organized as follows: Chapter 2 presents the concepts of distributed
databases which form the fundamental part of this thesis. Chapter 3 presents the literature
survey of related work in the field of GIS processing using distributed and parallel computing
paradigms. In Chapter 4, we present a query translation framework for distributed execution
of spatial queries. Chapter 5 summarizes the conclusion and presents work plan for the second
stage of the thesis.
4
Chapter 2
Background Study
This chapter presents the background study introducing the concepts which are essential for
understanding the work planned in this thesis. Distributed query planning and execution are
fundamental to the thesis and the same have been discussed briefly in this chapter. Existing
distributed data storage system have been presented.
2.1 Distributed Databases
2.1.1 Introduction
Distributed Database Systems [DDBMS] bring the processing technologies in database systems
and computer networks together. A distributed database is composed by several interrelated
data files which are distributed over separate disks and are accessible over a computer network.
The key design features of distributed databases are:
� Data is distributed over several data nodes. Union of data over these nodes forms the
entire database.
� Data nodes are capable of interacting with each other over computer network. However,
each data node is independently capable of processing queries over the data it hosts.
� There is a logical relationship between the distributed data. Different data placement
mechanisms which will be discussed subsequently ensure this relationship.
DDBMS are more reliable and available as compared to centralized database management sys-
tems. This is because of the fact that if any component in centralized DBMS fails, then the
entire system is affected by this failure. However, in DDBMS, failures are local to sites and
they do not effect the functionality of overall system. Moreover, features like data replication
over different data nodes make DDBMS fault tolerant. In case of large data sets, DDMBS offers
improved performance since the data is divided into a number of smaller datasets which are
spread across data nodes capable of performing processing over this data parallely. Figure 2.1
illustrates the architecture of distributed database management systems.
5
Fig. 2.1: Architecture of Distributed DBMS
2.1.2 Fragmentation Techniques
In designing a DDBMS, the first question which needs to be answered is that how the data must
be distributed and placed over geographically distributed data nodes? The data may be stored
on a central data node, it may be split and stored at several data nodes or the same data may
be replicated at all data nodes. However, for improved performance, the data must be placed
in such a way that it can be accessed and processed efficiently. There are several data schemes
which are generally employed for data fragmentation and storage in DDBMS which have been
presented in this section.
Example
Let us formulate an example which will be referenced all along the thesis. Let us consider that
we have spatial data containing 4 features: Road Links, Road Nodes, Lakes and Buildings. Let
us assume that each of these features have following table definitions.
6
LinkID RoadName StartNodeID EndNodeID Width Geometry
0 A Road 1 2 10 Polyline((1,2),(3,4),(4,5))
1 B Road 0 4 8 Polyline((1,7),(8,4),(12,9))
2 C Road 3 7 4 Polyline((1,3),(3,8),(6,5))
3 D Road 2 5 15 Polyline((5,2),(8,5),(3,13))
4 E Road 4 7 9 Polyline((11,21),(13,4),(14,15))
5 F Road 5 6 6 Polyline((17,11),(6,17),(8,5))
Table 2.1: Road Link Feature Attributes
NodeID Geometry
0 POINT(1,2)
1 POINT(3,4)
2 POINT(5,6)
3 POINT(7,8)
4 POINT(9,10)
5 POINT(11,12)
6 POINT(13,14)
Table 2.2: Road Node Feature Attributes
LakeID WaterV olume Geometry
0 3400 POLYGON((1,2),(3,4),(5,6),(1,2))
1 6400 POLYGON((7,8),(9,10),(11,16),(7,8))
2 2300 POLYGON((2,7),(13,14),(25,66),(2,7))
3 1700 POLYGON((9,12),(24,43),(15,32),(9,12))
4 8600 POLYGON((21,32),(44,54),(53,61),(21,32))
5 9100 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))
Table 2.3: Lake Feature Attributes
BuildingID BuildingName Height Geometry
0 Building A 3400 POLYGON((1,2),(3,4),(5,6),(1,2))
1 Building B 6400 POLYGON((7,8),(9,10),(11,16),(7,8))
2 Building C 2300 POLYGON((2,7),(13,14),(25,66),(2,7))
3 Building D 1700 POLYGON((9,12),(24,43),(15,32),(9,12))
4 Building E 8600 POLYGON((21,32),(44,54),(53,61),(21,32))
5 Building F 9100 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))
Table 2.4: Building Feature Attributes
7
*Data is representative and is not accurate
We will assume that data in these tables are to be distributed across several data nodes. We
will refer to these relations whenever required.
Horizontal Data Fragmentation
In horizontal data fragmentation scheme, the rows of a global relation are divided and stored
over separate data nodes based on one or more attributes of the relation. The global relation
can be reconstructed by taking a UNION of all the rows from all data nodes. The Road Link
table after applying horizontal fragmentation for distribution over 3 sites looks as follows:
LinkID RoadName StartNodeID EndNodeID Width Geometry
0 A Road 1 2 10 Polyline((1,2),(3,4),(4,5))
1 B Road 0 4 8 Polyline((1,7),(8,4),(12,9))
Table 2.5: Horizontal Fragmentation: Fragment 1
LinkID RoadName StartNodeID EndNodeID Width Geometry
2 C Road 3 7 4 Polyline((1,3),(3,8),(6,5))
3 D Road 2 5 15 Polyline((5,2),(8,5),(3,13))
Table 2.6: Horizontal Fragmentation: Fragment 2
LinkID RoadName StartNodeID EndNodeID Width Geometry
4 E Road 4 7 9 Polyline((11,21),(13,4),(14,15))
5 F Road 5 6 6 Polyline((17,11),(6,17),(8,5))
Table 2.7: Horizontal Fragmentation: Fragment 3
In terms of spatial data, MBR [Minimum Bounding Rectangles] based fragmentation technique
is essentailly a horizontal fragmentation technique. In MBR based fragmentation, the entire
geometrical region bounding the spatial data set is divided into several rectangles such that
all the rectangles taken together would reconstruct the entire area. Fragmentation of data can
be done by finding out the MBR(s) to which a particular feature belong to and accordingly
partitioning the data into multiple tables.
Vertical Fragmentation
In vertical fragmentation, the global relation is divided on the basis of its attributes, which is
achieved by grouping the columns. The global relation can be reconstruted by taking a JOIN of
all the fragments. Vertically fragmented Lake Feature table from example discussed before for
division over 2 data nodes will look as follows:
8
LakeID WaterV olume
0 3400
1 6400
2 2300
3 1700
4 8600
5 9100
Table 2.8: Vertical Fragmentation: Fragment 1
LakeID Geometry
0 POLYGON((1,2),(3,4),(5,6),(1,2))
1 POLYGON((7,8),(9,10),(11,16),(7,8))
2 POLYGON((2,7),(13,14),(25,66),(2,7))
3 POLYGON((9,12),(24,43),(15,32),(9,12))
4 POLYGON((21,32),(44,54),(53,61),(21,32))
5 POLYGON((12,12),(33,14),(65,36),(13,63),(12,12))
Table 2.9: Vertical Fragmentation: Fragment 2
Both the fragmentation schemes have been summarized in Figure 2.2.
Fig. 2.2: Horizontal and Vertical Fragmentation
9
Mixed Fragmentation
A combination of vertical and horizontal fragmentation techniques over a global relation may
also be employed. This means that vertical fragments of horizontally fragmented data [or vice
versa] may be generated. The original relation can be reconstructed by first applying a join over
all vertically divided fragments and then taking a union of the resultant relations [or vice versa].
2.1.3 Placement of Fragments
Fragment Allocation
Once the relations have been fragmented, the next step is to allocate these fragments to data
sites. The fundamental principle that governs fragment allocation is to distribute the fragments
over data sites in such a way that most of the data access remains local. This is a complex
problem and there are several algorithms which compute optimal fragmentation and allocation
strategies. Some of these algorithms have been presented in [4, 6, 17]. There are two strategies
for allocation of fragments - static and dynamic which have been discussed next.
Fragment Placement
One strategy to place fragments is to analyze the expected load on the database and produce
allocation strategy based on this analysis. Expected load is generally in terms of the set of
queries which are gathered from active systems. This method is termed as static fragment
placement method. The disadvantage of this scheme is that it is not adaptable to changes in the
load. Dynamic fragment placement methods are used to make distributed databases adaptable
to changing load. It involves continuous monitoring of the database in order to dynamically tune
the system for load balancing.
Decision Making
One more aspect associated with placement of fragments is the distribution of control. There
are two ways in which the decision making can be implemented in DDBMS. Either a dedicated
server can be used to gather information for decision making at all nodes or each data node can
be made autonomous and can be allowed to make its own decisions. A hybrid scheme may also
be employed which uses several central systems which can each serve a number of data nodes.
2.1.4 Distributed Query Processing
This section presents the general query processing techniques employed in distributed databases.
Details of discussed techniques can be found in [29]. There are four steps involved in distributed
query processing. First, the SQL query which is passed to the data site is translated into the
algebraic notation. This step is termed as decomposition of query. Next, the decomposed query
10
is analyzed to locate the data fragments which are required for processing the query. This steps
generates an outline of the query execution plan which is passed on to the query optimizer
for improvement. Finally, the query is executed at the data node. Figure 2.3 shows the steps
involved in query processing.
Fig. 2.3: Steps in Query Processing
Localization Step
This step is not present in centralized DBMS since a single site hosts all data. However, in
a distributed system, data required for query processing needs to be located since it may be
present across several data sites. Data required for processing the query is located by the system
by analyzing the tables which are needed to be accessed for answering the query. In case of
DDBMS for big data, an important function of data localization step is to eliminate the data
sites which are not relevant for answering the active query. This is important because otherwise
the query needs to be broadcasted to all the data sites which leads to inefficient query execution.
This steps generates a query plan which contains information about the data sites which hold
data needed for processing the active query.
Optimization Step
A detailed survey of query optimization techniques for distributed databases has been presented
in [24]. The main function of this step is to improve the query execution plan generated in
localization step. Some of the tasks of optimizer include finding optimal join order and using
appropriate access strategies while executing the query. Optimizer may choose to use the indices
or completely discard them and engage in a fragment scan. This steps also takes care of using
11
replicated data by either employing a parallel access mechanism or choosing a data site which
is known to have fastest access to data amongst the one where data can be located.
Query Execution
In the context of distributed systems, there are two basic methods to execute the query. One of
these methods uses data shipping which involves moving data from remote site to the site where
the query is being executed. Another strategy is to ship the query to the sites which contain
appropriate data. Both these techniques can be illustrated using the following example.
Let us consider that we need to compute join of three different table T, U and V. For simplicity,
let us assume that table T has been divided into two fragments while the other two tables are
stored with fragmenting the data. Also, let us assume that this data is hosted at four different
sites. Figure 2.4 gives an overview of these two strategies using this example.
Fig. 2.4: Query Execution Strategies
This concludes our discussion on distributed databases. Next, we look into existing systems
which provide the functionality of distributed storage of and distributed query processing over
the data.
12
2.2 Existing Distributed Systems
There are several systems which facilitate distributed data storage and allows one to run queries
over data distributed across several disks. While some systems only provide data storage facilities
and some provide query processing facilities, several others provide a combination of both. Some
of these systems have been presented in this section.
Google’s BigTable
Google’s BigTable [12] is a distributed storage system which is capable of storing large data
sets that may be of several petabytes in size and distributed among thousands of data nodes.
BigTable is based on semi-structured data storage technique. Data can be viewed as a large
map which is indexed using a row key, column key, and timestamp. Each value within the map
is an array of bytes that is interpreted by the application. Every read/write operation over a
row is atomic, and is independent of the number of columns are read/written within that row.
BigTable is a combination of (key, value) pairs over data sets where the key identifies a row
and the value is the set of columns. BigTable’s data is distributed among many independent
machines. It uses a distributed file system, which is the Google File System for managing data
access and updates. Components of BigTable include client library, a master server for coordi-
nation of activity, and several tablet servers. which may be added or removed dynamically. The
master server is responsible for assigning tablets to tablet servers and balancing load on a tablet
server. Master server is also responsible for garbage collection and managing changes in schema.
Apache HBase, developed over Hadoop core, is an opensource implementation of Google’s
BigTable.
Apache CouchDB
Apache CouchDB [5] is a schema-free, peer-based distributed storage system which is document
oriented. Primary data unit in CouchDB is a semistructured document which may consist
of numerous columns, associated metadata and attachments. Size and format of documents is
variable and they are uniquely identified using a document ID. It is a web based database system
primarily used in document aggregation and filtering. Optimization in CouchDB is heavily reliant
on its view management capabilities. Views are created using Javascript functions which produce
any number of rows for each document as the output view for document input. The database is
kept in a consistent state at all times for minimizing rollbacks which are necessary after a system
crash. It also allows multi-version concurrency control in order to facilitate different applications
to simultaneously read and write over same document data.
13
Piazza System
The Piazza system [20] provides a combination of data storage and query processing capabilities
for data management in peer systems hosting heterogenous data. Piazza allows complete au-
tonomy at the data sites in terms of schema implementation and kind of data which is hosted.
The system provides a strong schema mediation functionality and reformulating input query
in accordance with the schema mapping. Data is stored in XML format at the data sites and
XQuery based query language is used for processing the data. Queries which are sent over the
network are modified to reflect the schema mapping.
HadoopDB
HadoopDB [3] is a database middleware system which is typically designed for cloud. It has
been developed over Hadoop which is the open source implementation of MapReduce comput-
ing paradigm. HadoopDB provides data storage as well as query processing capabilities. Data
storage is enabled by using a local database management system at each site and communica-
tion between data sites is managed using the Hadoop framework. Local database management
system is integrated to work with the MapReduce framework and hence this DBMS serves as
the distributed file system. Information about the local sites, datasets present at each site and
details about partitioning and replication of data is maintained in HDFS and it serves as the
metadata catalog which is useful in query decomposition. For distributed query processing,
SQL query is converted into MapReduce jobs using Hive and then Hadoop framework is used to
co-ordinate amongst data sites and process the query on local DBMS.
MongoDB
MongoDB [1], like CouchDB, is a document-oriented database management system which pro-
vides advanced query processing services as compared to CouchDB. In MongoDB, the data unit
again is a semistructured document which are grouped in order to form collections. Generally,
there is a collection corresponding to each document type. MongoDb provides functionality to
fragment the data and store it at several data sites. It uses replication mechanism for increased
availability and also provides automatic load balancing feature for efficient query processing. Im-
perative language is used for querying the underlying data which initiates scan over collections
followed by filtering. B-tree based indexing mechanism is used for performance enhancement.
This section presented some of the existing systems which support distributed database man-
agement. However, it should be noted that none of these systems have functionality for efficient
handling and query processing over large spatial datasets. These system provide an insight into
architectures which are used for distributed query processing. We intend to utilize the features
from these systems in our quest for building efficient distributed query processing system for big
spatial data.
14
Chapter 3
Literature Survey
This chapter presents a concise summary of related work in distributed spatial databases. The
survey has been categorized into three classes: Current trends and technologies in large scale
data analysis, comparison of distributed and parallel computing paradigms and related work in
GIS processing in distributed systems. Gap analysis over the same has been presented and an
attempt has been made to formulate the problem in entirety.
3.1 Related Work
3.1.1 Technology Overview for Big Data Processing
In this section, we describe the current research work in the field of data oriented computing
by conducting a survey of technologies used in compute intensive problems in big data and the
research directions being sought to solve these problems.
First, we present the various components of computing architecture which is generally employed
in data oriented computing technologies. An important point to note here is that although the
technologies which have been described later are diverse, yet they share a number of common
features. The computing architecture presented in Figure 3.1 captures the common elements in
these technologies.
Fig. 3.1: General Architecture for Data Oriented Computing
15
Considering the architecture, co-ordination exists at the lowest level. This forms the basis
of distributed data access and querying. Distributed data layer provides data abstraction by
providing sophisticated techniques and interfaces for data access. The computing layer which
is the topmost layer is responsible for providing distributed processing capabilities. Computing
layer offers numerous high level languages which can be used to access this layer. Some of the
widely used technologies for big data management and analytics have been summarized in the
following table.
Technology Google Yahoo! Microsoft Miscellaneous
Languages Sazwall Pig DryadLINQ and SCOPE Hive
Computation MapReduce Hadoop Dryad
Database BigTable HBase and PNUTS Cassandra
File System GFS HDFS Cosmos Dynamo
Coordination Chubby Zookeeper
Table 3.1: Widely Used Technologies in Big Data Analysis
From the above table, we find that there are two technologies used in the coordination layer,
namely Chubby [9] and Zookeeper [21]. There technologies are responsible for maintaining the
information pertaining to configuration and data management. They are also responsible for
synchronizing distributed services provided by the data distribution layer. The primary char-
acteristics of the services provided by Chubby and Zookeeper is to ensure high reliability and
availability. Next, technologies used for storing distributed data are mentioned.
Google File System [18] employed by Google is based on the master-slave technique. In GFS,
the master node is responsible for metadata operations whereas the slave nodes are used for
actual computation over the data. GFS does not offer concrete consistency. For example, in
GFS - a record may be appended multiple times in data. GFS is a user space process which
lazily stores data as files over the local file system. Hadoop Distributed File System is based
on the same architecture as GFS. Further details on HDFS are present in [33]. It should be
noted that in both HDFS as well as GFS, the master node constitutes a single point of failure in
the system. In order to ensure high availability, replica of masters are present to take over the
process if case of failures. Microsoft uses Cosmos [11] as its file system for distributed computing.
While the details of this technology have not been shared in the paper, it has been mentioned
that Cosmos is an append-only file system which has been optimized for petabyte scale data.
Data is replicated and compressed for increased fault tolerance and higher efficiency. It has been
mentioned that Cosmos is an internal implementation of GFS by Microsoft. DynamoDB [16] is
a data storage technology developed by Amazon and is highly used in Amazon’s services [Shop-
ping cart on Amazon’s website, Amazon S3 service]. Dynamo is a highly available and reliable
key value store which ensures low latency access to data. It is based on P2P architecture which
16
uses consistent hashing for load balancing and implements specific protocol called the gossip
protocol to guarantee consistency. Further description of this technology can be found in [36].
An important point to note here is that except Dynamo, the file systems discussed above are
stream oriented append only systems. In order to make these systems robust and application
friendly, data abstraction mechanism are developed over these file systems.
Databases provide the functionality of data abstraction. An introduction to Google’s BigTable
and Apache HBase has been presented in Section 2.2. Both these databases are non-relational
in nature which are used to store semistructured or unstructured data sets. The data is stored
in column-wise fashion which facilitates better representation and compression. SSTable is the
fundamental structure employed in BigTable and HBase. SSTable is used to store data and
is designed in such a manner that data access through it requires single disk access. PNUTS
[14] is a storage technology developed by Yahoo which is similar to BigTable and HBase. It
allows data lookup based on key value pair. Distinguishing feature of PNUTS is that it provides
consistency by guaranteeing that operations to a single record will be applied in the same order
across all the replicas. Apache Cassandra [27] which is used to power Facebook is an innovative
data storage solution which brings the positives of BigTable and DynamoDB together and offers
a structured storage system over peer-to-peer network. Voldemort [34] is a non-relational open
source key-value based database system developed by LinkedIn which can be seen as a persistent
distributed hash table.
Next, we move to the computation layer as described in Figure 3.1. There are different computing
paradigms which are deployed for data intensive computing. The common characteristic of
computing paradigms is that they concentrate on dataflow while automating the parallelized
computing aspect.
Google uses the MapReduce [15] computing paradigm for distributed data processing which is
inspired from functional languages. MapReduce computing paradigm is based on two functions -
Map and Reduce. Map function is responsible for reading a list of key value pairs and generates
intermediate key-value pair lists which are sorted and grouped by key. The reduce function
then accepts the intermediate key along with the values associated with the key and merges
them together to produce the final result. Execution overview of MapReduce paradigm has been
illustrated in Figure 3.2. It should be noted that the map and reduce phases are independent
of each other and do not overlap at any point during execution. Map and Reduce functions are
completely functional and hence are quite easy to parallelize. Fault tolerance can be achieved
by re-executing the failed portion of execution process.
17
Fig. 3.2: Execution Overview in MapReduce [15]
Hadoop is the reimplementation of MapReduce paradigm in Java and has been reported to have
superior performance as compared to MapReduce. Dryad is Microsoft’s alternative to MapRe-
duce. Specific details about these technologies have been reported in [7] and [25] respectively.
Finally, we present the languages which are used for interfacing user requirements with the com-
puting systems discussed above. These languages facilitate writing programs for these systems.
Sazwall [32] is a parallel data processing language which has been built on top of MapReduce.
Aggregation operations like counting, representative sampling and histogram computation have
been predefined in this language for processing the underlying datasets. The language allows
the programmers to declare the desired operations in terms of aggregators and implement the
logic for filtering datasets as per requirement. It forces the programmers to consider processing
of single record at a time and at the same time allows the system to completely parallelize the
entire computation. Pig [28] is another sophisticated imperative language capable of performing
complex database operations like aggregation, filtering and joins. Statements written in Pig
are translated into corresponding Hadoop jobs which are executed in parallel fashion. It allows
18
the programmers to define operators which correspond to relational algebra primitives found in
database systems. DryadLINQ is a programming language which translates programs into Lan-
guage Integrated Query [LINQ] expressions which is a query language developed in Microsoft’s
.NET framework. Hive [35] is an attempt by Facebook which aims to build a data warehousing
solution based on Hadoop. It is still native and lacks efficient optimizer.
This completes the survey on existing technologies which employ distributed data management
and query processing for petabyte scale data. An important point to note here is that all these
systems treat spatial data same as the non-spatial data when it comes to query processing
and they do not harness the intrinsic features of spatial data for processing which makes them
inefficient for performing spatial computations.
3.1.2 Parallel DBMS vs. Distributed DBMS
Parallel Database Management Systems are the direct competitors with distributed systems. In
order to put this comparison in correct perspective, it is important to understand the CAP the-
oram [19] which states that every system can be made to excel in at most two out of these three
properties: Consitency, Availability and Parition Tolerance. For building distributed systems
which are meant to handle large datasets, partition tolerance is a highly desirable feature since it
is known that network failure cannot be prevented. On the other hand, parallel database man-
agement systems have evolved from traditional DBMS and hence prefer consistency as a property.
Thus, in comparing PDBMS with DDBMS we are essentially comparing ACID(Atomicity, Con-
sistency, Isolation and Durability) with BASE(Basically Available, Soft state and Eventually
consistent). With BASE, the systems become more scalable at the cost of consistency while
with ACID, systems are less scalable because of consistency constraints. Thus, both PDBMS
and DDBMS have their own merits. The following table presents the advantages of both these
technologies.
Fig. 3.3: Parallel DBMS vs. Distributed DBMS
Generally, a hybrid approach is recommended which selects the best of these two paradigms. This
19
calls for creating new DBMS which is tailored to match all requirements. Such considerations
are being taken seriously with the advent of cloud computing in the recent times [2].
3.1.3 Distributed Processing in Geospatial Data
[39] presents a parallel spatial query execution model for big spatial data. The paper presents
scheme to partition the data based on the geographical space and corresponding spatial object file
which is used to refer the data across various cluster nodes. It also presents a distributed indexing
mechanism based on the data partitioning hierarchy and uses this index in MapReduce paradigm
for spatial query computation. The experimental results show that the proposed scheme for data
management and processing is faster than PostegreSQL cluster, HDFS, Cassandra and HBase in
terms of read and bulk loading. The results also indicate that the scheme is faster than Oracle
Spatial and PostgreSQL with PostGIS in terms of processing spatial operations like spatial
selection and nearest neighbour computations.
[38] presents a method for distributed geospatial data processing by proposing a middleware
which uses spatial web services search engine and meta-data repository for achieving distributed
processing. The paper describes that spatial web services search engine is used to find existing
geospatial web services [WMS/WPS/WCS] over internet and collect available metadata which is
stored in the metadata repository of the middleware. Any query which comes to the middleware
is first mapped with the information available in metadata store and the sites which are suitable
to answer the query are selected. The query is then routed to all these sites and the resultant
WMS/WPS/WCS responses are merged to form a single result.
In [40], authors present a framework for performing parallelized spatial join operations. The
authors present the filter and refine strategy which is typically involved in all geometric compu-
tations and present a framework which decomposes the filtering task to perform parallel filtering
over data set and also decomposes refinement task by proposing object redistribution scheme.
Optimization strategies to improve the cost of processing and communication have been dis-
cussed which is achieved by manipulating the size of data structures used in parallelization and
appropriate data partitioning schemes. Experimental results validate the optimized cost model.
[8] presents collaborative mapping and feature extraction strategy which collect geospatial data
from distributed sources and integrate and analyse the same for meaningful visualization. The
paper primarily focuses on the image processing aspect and is relevant for raster data. However,
the architecture presented to integrate large image datasets can be extended for vector data
as well. The steps involved in the integration process include reprojection of data to a single
reference system and creating image tiles from reprojected data. It further presents the feature
capturing algorithm using image processing techniques and eventually exporting the analyzed
data as KML which can be visualized over Google Earth. This paper provides an insight about
the standards in geospatial computing.
[37] is a survey paper which tries to investigate the relevance of grid computing for supporting
20
geospatial applications which require real time response. The paper presents the results of
simulation for computation over large geospatial datasets. Simulation indicate that the response
time of the applications continously decreases as the number of cores in the cluster hosting the
application increase. However, the paper does not talk about the implementation challenges and
does not pose any guidelines which must be followed while designing spatial computation system
over grid computing technology.
[13] discusses distributed computing mechanism for geospatial data by bringing data from het-
erogenous sources together. The paper presents complete transformation process of a global
geospatial query into local site queries by converting global queries into algebraic form and
resolving them using lexical analyzer and parser to generate local queries which are eventually
executed to gather results. The paper also presents a distributed query execution manager which
takes the resolved global query and collaborates with the data nodes in geospatial repository to
generate results for local queries. These results are later merged by the query execution manager
and presented back as a single response. Paper presents an example transformation for a simple
global query into local queries. However, experimental results have not been included in the
paper which makes it difficult to validate the correctness of proposed schemes. Moreover, the
scheme is not complete and lacks features of query optimization, cost estimation and dynamic
data partitioning.
[26] discusses a new alternative to distributed DBMS systems by leveraging cloud comput-
ing paradigm. The paper explains that why cloud computing would be an ideal choice for
geostreaming applications like Location Based Services and Intelligent Transportation Systems
by analyzing the features which are required in these applications: Parallelization, dynamic load
and resource intensive nature. The paper presents ElaStream framework based on MapReduce
paradigm but fundamentally different from MapReduce by making the framework independent
of persistent data storage and designing it to be a push based model. Though the paper does not
validate correctness of the presented framework and there are no experimental results to deduce
whether or not the system is working, it identifies the challenges in scalable stream processing
and discuss that how these challenges have been avoided with the functionalities of proposed
framework.
3.2 Gap Analysis
From the literature survey of related work presented in Section 3.1, the following can be con-
cluded:
� Plenty of research work focus on exploiting MapReduce paradigm for spatial computations.
However, it is important to note that MapReduce architecture is suitable for one time data
processing and once the processing is complete, data is permanently offloaded from the
system. These systems do not care about the performance as much as they care about
availability and reliability. On the other hand, spatial data is not suitable for one time
21
use since user may constantly want to analyze the same data to produce more and more
meaningful results. Hence, there is a need to design a scalable system which is capable of
efficiently processing spatial data by using indexes and other techniques used in traditional
Spatial DBMS.
� While the strategies discussed in literature survey concentrate on providing parallel compu-
tation of spatial operations, they do not use the distributed data partitioning schemes and
do not consider network costs incurred in accessing remote data. While the central idea of
performing computations is present, the underlying structure for efficient data access and
processing is missing.
� These gaps require us to design a framework for distributed query processing which can
bring the merits of traditional spatial databases like efficient data access and query opti-
mization along with the merits of distributed database management systems like no single
point failure, scalability and autonomous data sites, together.
22
Chapter 4
Thesis Proposal
The ultimate objective of this thesis is to provide a robust framework capable of handling
big spatial data efficiently and effectively. We aim to provide solution in the area of spatial
data mining and large scale spatial data analysis by building over the existing knowledge in
distributed database management systems and empowering them to manage spatial data. As
discussed earlier in Section 1.2, the main research objective of this thesis is to:
1. Conduct comparative study of existing technologies in distributed database management.
2. Study various techniques of spatial data processing in parallel and distributed environment.
3. Provide a framework which is capable of processing large scale spatial data.
4. Implement the framework by using existing technologies and integrating studied techniques.
5. Benchmark the performance of developed system.
The methodology which has been employed till now and which will be used for completing the
thesis has been discussed briefly.
In order to understand the context of thesis, it was necessary to get a background on distributed
databases and the state of art in current distributed systems. The same study has been pre-
sented in Chapter 2.
Secondly, the existing technologies in big data management for non-spatial data has been stud-
ied. There are several technologies developed and used by some of the leading industries, and
few of them have been summarized in Chapter 3. Furthermore, the competing paradigms for
large scale data analysis have been studied. Parallel DBMS, which are directly extended from
the relational systems have been concluded to be more efficient than distributed DBMS which
are more reliable and fault tolerant. Cloud computing allows us to bring the positive aspects of
both parallel and distributed computing paradigms under one roof, which is one of the objective
of the thesis. Figure 3.1 shows the different layers of distributed computing architecture which is
very similar to cloud architecture and hence the discussion in literature survey can be extended
for cloud. The desired outcome of this study was to be acquainted with the technologies being
employed for big data analysis.
Next, the parallel and distributed techniques of query processing in geospatial information sys-
tems were surveyed. Several papers presented efficient techniques of data partitioning and stor-
23
age, data retrieval, and optimization in storage access using spatial indexing techniques as well
as using parallelized operations. Some interesting research papers have been summarized in
Chapter 3.
The background study and literature survey were used to put the problem in correct context
and start designing the distributed database management system for spatial data. Various com-
ponents which are required to be studied and implemented in order to achieve the objective
have been identified. The components which project the complete distributed framework for big
spatial data analysis include the following components: Data Partition Manager which dynami-
cally fragments the data based on its characteristics and allocates fragments to data sites, Query
Translation Framework which parses the global query tree and generates site specific query exe-
cution plans, Communication manager which maintains all communication requirement between
data sites [like data shipping, collecting results etc.], Query Optimizer which optimizes the site
queries for efficient processing and Replication and Duplication manager which enable paral-
lelization in system and make it fault-tolerant. All these components must work in complete
synchronization to achieve the overall quest. The core of the system must contain algorithms
capable of performing all spatial computations over geometry data. A point to be noted here
is that the functional and non-functional requirements of the proposed framework may change
over the course of thesis. The task at hand now is to design these components to work hand in
hand and move closer to the objective.
24
Chapter 5
Distributed Query TranslationFramework
In an attempt to build an efficient and scalable distributed spatial database system, we first
present the query translation framework which is responsible for taking the global query tree as
input and generating query execution plans for local sites hosting the actual data. The basic
structure of the query translation framework is based on the theory of query processing discussed
in [10, 30].
5.1 Introduction to Translation Framework
In a distributed database application, data is partitioned into fragments which are stored at
distributed data sites. The placement of data is a complex issue as discussed in section 2.1.3 and
there are several allocation algorithms which strategically place data in order to reduce the data
access time for improved query performance. In this chapter, we build over the query processing
in distributed systems to construct a framework which can be used for efficient translation
of global query into local site query execution plans. Figure 5.1 gives an overview of query
processing in distributed databases. We discuss our contributions in the general architecture
subsequently.
Fig. 5.1: Query Processing in Distributed Databases
25
5.2 Assumptions
The query translation framework assumes that:
1. Data placement has already been done.
2. Information about placement of each relation in the database is available. This information
includes: (a) Physical address of the local data nodes where data has been placed (b) Type of
fragmentation used to partition data (c) Size of relation on each data node (d) Attributes of the
relation contained in each fragment. (e) Data range present in each fragment.
3. Information about communication costs between any two data nodes is available.
4. The global query execution plan is available.
5.3 Query Translation Framework
5.3.1 Classification of fragments
Let us consider a simple query Q over a single global relation R which has been fragmented
into three relations R1, R2 and R3 which are stored at three different data nodes using some
fragmentation scheme. There are several possibilities which must be considered in order to
answer this query. One possibility is that the data required to answer the query is distributed
over all three data sites and the results from each site have to be merged to answer the query
completely. Another possibility is that the data required to answer the query is present on 2
data sites of the 3 nodes. In this case, the third node does not play any role in answering the
query and hence executing query over third data site would be inefficient. Another possibility
might be that the data required to answer Q exists on each of the data nodes completely. Thus,
it becomes important to classify the fragments which are important to answer a specific query
and then derive the relationship between physical location of the fragment and the parts of query
which needs to be executed at each data site physically. Based on the different fragmentation
schemes presented in Section 2.1.2, we can classify the fragments on the basis of query in the
following ways:
� Completely Capable Fragment: Completely Capable Fragment (CCF) with respect to
a query Q is the fragment which is capable of answering query Q completely. This means
the the fragment contains all the attributes of global relation R which have been projected
in the query Q and contains all the tuples of relation R which would satisfy the selection
condition in Q.
� Not Capable Fragment: Not Capable Fragment (NCF) with respect to a query Q is the
fragment is is incapable of answering any part of the query. Fragment can be considered
as a NCF if it returns zero results to the query.
� Partially Capable Fragment: Partially Capable Fragment (PCF) is the one which
26
does not contain complete information for answering a query Q, but contains some part
of the information which would be useful in answering query Q. Depending upon the
fragmentation technique employed over the global relation R, PCF can be further classified
into three different kinds:
1. PCF-Horizontal: PCF-H fragment with respect to query Q is the fragment which
contains ALL the attributes of relation R which have been mentioned in the query and
contains some of the tuples but not all the tuples which satisfy the selection condition in
query Q.
2. PCF-Vertical: PCF-V fragment with respect to query Q is the fragment which holds
a proper subset of attributes mentioned in the query and contains projection of ALL the
rows that would answer the query.
3. PCF-Mixed: PCF-M fragment with respect to query Q is a fragment which contains a
proper subset of attributes of global relation R which have been mentioned in the query
and contains projection of some of the tuples but not all the tuples which would satisfy
the query Q.
Example
Let us consider the same example as discussed in Section 2.1.2. We have 4 relations: Road Nodes,
Road Links, Lakes and Buildings. Furthermore, let us assume that there are 3 data nodes over
which these relations have been placed. Assume the fragmentation details as mentioned below:
Fig. 5.2: Horizontally Fragmented Link Table
27
Fig. 5.3: Vertically Fragmented Lake Table
Fig. 5.4: Mixed Fragmented Building Table
From figure 5.2, it is clear that the road link table has been horizontally fragmented into 3 parts
namely link-1-F, link-2-F, link-3-F which contains information of all attributes of node table for
ID 0 and 1, 2 and 3, 4 and 5 respectively. From figure 5.3 it is evident that the lake table has
been vertically fragmented into two components namely lake-1-F and lake-2-F which contain
information about attributed Water Volume and Geometry respectively with Lake-ID serving as
the common attribute. Finally, the building table is first horizontally fragmented into two com-
ponents each containing information about all attributes with tuples being divided on the basis
of Building-ID. The first component contains information about ID 0,1 and 2 while the other
28
contains information about ID 3,4 and 5. Next, these two fragments are vertically fragmented.
The first fragment contains information about building-name. The second fragment contains
information about building height and geometry. Building ID serves as the common attributed
between these components. As a result, we have a total of four components which have been
explained in Figure 5.4.
Now, consider query Q which selects Road Name of all the road links for which Road-ID <
4. With respect to this query, the fragments of links table can be classified as: Road-1-F and
Road-2-F: PCF-H, Road-3-F: NCF. If the value of parameter in Q was < 3, then Road-1-F
would be classified as CCF and Road-2-F and Road-3-F would be classified as NCF. Next, let us
consider a query Q on lake table which requires water volume and geometry of the lake whose id
= 3. In this case, both the fragments, Lake-1-F and Lake-2-F would fall under PCF-V category.
Finally, for a query on building table which requires name and height of building whose id < 5,
we can conclude that all the four fragments would fall under PCF-M class. However, if the query
required only the height parameter for such buildings whose id < 5, then fragments Building-1-F
and Building-4-F would fall under NCF category and fragments Building-2-F and Building-3-F
would fall under PCF-H category.
In practice, classification of fragments can be done based on the metadata information about
the fragmentation techniques used on relations. It is one of the assumptions of query translation
framework and it is also realistic in the sense that all this information should be readily available.
Two important points about fragment classification which must be mentioned are as follows:
1. We have discussed classification of fragments for queries which contain single variable.
However, with queries containing more than one variable in the selection condition, the
same explanation is valid with classification becoming specific to each variable in the query.
2. We have only discussed cases where each relation has been fragmented and stored once
without any replication. However, in reality, the same relation might be stored at number of
sites using different fragmentation schemes. In such cases, we might get several fragments
which would fall in the same class. Which fragment would be chosen for executing the
query would depend on the cost of executing the query plus the cost of communication.
Cost of executing query will generally be of the following order: PCF-M > PCF-V > PCF-
H > CCF > NCF because generally join operations are costlier as compared to selection
operations. However, cost of communication will also play an important role in deciding
that which fragment must be chosen for query execution.
29
5.3.2 Locating the data sites
Once the fragments have been classified, the information about fragment allocation to data sites
has to be used to locate the data sites which would eventually serve the queries. Locating
data sites is also important because we need to determine that what part of the query would
be answered from which data site. In general, one data node may contain zero/several/all the
fragments which are necessary for evaluating the query. In case the data node do not contain
any fragment which fall under the CCF or PCF category than no query must be passed to that
data site. If the query contains all the fragments which are necessary to evaluate the query, i.e.
all the PCF or CCF fragments identified for answering the query lies on one data site, than the
query needs to be executed only on that data site and all other sites can be dropped. In the
case where data site contains some of the PCF fragments required to answer the query, we need
to locate the other data sites which host the required data, consider the cost of communication
between these data sites as well as the cost of query evaluation and choose the best execution
plan from multiple plans (possible, like in case of replicated data).
5.4 Genetic Algorithm for Query Translation
In this section, we describe a step-by-step procedure to generate fragment queries from global
query tree and then we present the procedure to convert the fragment queries into queries for
each data site.
5.4.1 Identifying Fragment Class and Generating Fragment Trees
First, we consider the procedure to generate fragment queries from global query tree. The steps
have been outlined in Algorithm 1. The query processing steps described in the algorithm are
based on the fragment classification described in Section 5.3.1. As discussed breifly in Section
5.3.1, we can summarize that for any given query Q, it may be decomposed to contain the
combination of several fragment classes explained below:
1. Composed only of NCFs. In this case, no fragment tree should be returned since none
of the fragment is capable of answering the query.
2. Composed of CCF and several NCFs. In this case, the query must be run only
over the fragment which has been classified as CCF. In case of several CCFs due to data
replication, global tree must be individually replaced with all the CCFs to generate several
trees. This information may be used to parallelize the result set computation.
3. Composed of PCF-Hs. In case where fragment classes are composed of several PCF-Hs,
the global relation in the query must be replaced with Union of all the PCF-Hs. Replication
of data will result in duplicates which must be eliminated. Multiple trees can be formed, in
30
case enough detail to any pair of fragments is available, by finding all combinations which
would form CCF for query Q. Cases of multiple PCF-Vs and PCF-Ms can be handled in
a similar way.
Algorithm 1: Translation of Global Query Tree (Q) to Fragment Query Trees
01. TreeList = null;
02. FHorizontal = FVertical = FMixed = null
03. For each relation R in query Q
04. Classify all fragments F of R.
05. For each fragment f in F
06. If f is NCF, drop f
07. If f is CCF, replace R in Q with f. Add generated tree to TreeList
08. If f is PCF-H, add f to FHorizontal
10. If f is PCF-V, add f to FVertical
11. If f is PCF-M, add f to FMixed
12. UnitedH = null
12. For each h in FHorizontal
13. UnitedH = UnitedH (UNION) h
14. JoinedV = null
15. For each v in FVertical
16. JoinedV = JoinedV (JOIN) v
17. CombinedM = combine(FMixed)
18. TreeList.add(Replace R in Q with permute(UnitedH))
19. TreeList.add(Replace R in Q with permute(JoinedV))
20. TreeList.add(Replace R in Q with permute(CombinedM))
21. Return TreeList
The above algorithm generates several fragment query trees from the global query tree. The
main task of the above algorithm is to identify classes for all the fragments of each relation used
in the query. In order to classify each fragment, it has to be checked against the attributes and
its parameters being used in the query. In the assumptions, we state that we know the attributes
and ranges associated with a fragment. Thus, in order to check that if a fragment is NCF, we
need to check if it does not contain any of the attributes being requested by the query. If this is
true, than fragment is NCF. However, it might so happen that the fragment contains some or all
the attributes requested by the query. If it contains all the attributes, then fragment may be CCF
or PCF-H or NCF which can be verified from the data range information of the fragment. If the
fragment contains some of the attributes than it may be PCF-V or PCF-M or NCF which would
again be evident from the metadata information associated with fragment. The complexity of
31
fragment classification becomes of the order of O(n2) where n refers the number of attributes
present in the projection and selection portions of global query corresponding to relation R.
5.4.2 Mapping Fragment Query Trees to Data Sites
Once the fragments have been classified and all the fragment trees have been generated, the next
step is to generate local site plans which will be executed on data nodes to get query results.
This requires us to intelligently choose the best fragment tree and generate a corresponding
query execution plan. Depending upon the fragment classes, different fragment queries can be
generated. How do we choose the optimal fragment query for eventual execution? The criteria
which must be taken into account have been discussed subsequently.
Fragment Query Tree containing CCF
Suppose that the fragment query tree has been formed by a combination of CCF and several
NCFs. Although NCFs are not a part of fragment query tree, it is important to note that if the
CCF and NCFs are present on same data site, then processing time would increase as compared
to cases where CCF and NCFs are present separately. Thus, in order to optimally select the
data site for local query execution, the cost model should consider this aspect and choose the
best data site available. The complexity of choosing best available data site is of the order of
O(n) where n is the number of data sites which contain the CCF corresponding to fragment
tree.
Fragment Query Tree containing PCF-H
If the global query is converted into fragment query which contains several PCF-H, then we
need to compare all the query plans by comparing the cost of executing each query plan. In case
of PCF-Hs, the communication cost between data nodes which host the PCF-H class fragment
becomes important. If a fragment query is formed by combining a number of PCF-Hs which
are distributed among a large number of data sites, then the communication cost will be very
high as compared to the query plan where PCF-Hs are distributed among lesser number of data
sites. This implies that communication cost will go down if we find high number of PCF-Hs
present on the same data site. But in this case, the processing time will increase, particularly
if large number of NCFs are present along with the relevant PCF-Hs. Thus, in order to select
the optimal fragment query tree, an average execution cost must be computed which takes both
the communication cost between data sites and the processing and input/output cost at local
site into account and normalizes it. The same discussion can be extended for selecting optimal
fragment tree in cases of PCF-Vs and PCF-Ms or combination of any of these classes.
32
Chapter 6
Conclusion and Future Work Plan
ı»¿This chapter present the conclusion of MTP stage I and the work to be done in MTP stage
II.
6.1 Conclusion
In the first stage, the thesis proposal has been presented. Areas related to distributed database
management systems and parallel spatial data processing have been studied in order to under-
stand the existing approaches to deal with large scale data and parallel systems. The domain of
distributed databases and existing solutions for big data analysis have been extensively surveyed
and problem has been formulated by identifying various components of distributed database
systems which need to be extended for supporting spatial data analytics.
From the literature survey, it is evident that the current state of art in distributed database
systems aims to provide highly reliable and available services without paying due attention to
efficient query execution. On the other hand, parallel systems which have been extended from
the traditional database management systems parallelize computations to generate result sets
while keeping the query efficiency of traditional database systems intact. Our aim is to bring
the advantages of both these domains together and develop an efficient, reliable and available
distributed system capable of performing spatial analysis over petabyte scale data.
The first attempt for efficient data handling has been made by optimally translating the global
query execution plans into data site execution plans as presented in Chapter 5. Genetic al-
gorithm, independent of data placement strategies, to generate site specific queries has been
presented by assuming data partition strategies in place and using the metadata about fragmen-
tation techniques and attributes in fragments.
6.2 Work Plan for MTP Stage II
In stage II, we aim to complete the design of distributed query execution framework for spatial
data by considering design specifics in query translation, query optimization and integration
33
with existing parallel spatial computation techniques. Once the design is complete, we intend
to implement the proposed system design by developing a middleware over existing distributed
database management solution surveyed in Chapter 3.
We will start by designing query optimization framework by employing spatial indexing tech-
niques for data access and considering the aspect of parallelization. Once the design is complete,
we will implement the spatial query support operatives over Apache Hive and use Hadoop’s
MapReduce paradigm to achieve distributed parallelization.
The quest for stage II remains to design a spatially enabled efficient distributed database man-
agement system capable of adapting to dynamic data placement techniques, adhering to addi-
tion/deletion/manipulation of spatial data and able to perform complex spatial operations over
underlying data.
34
Bibliography
[1] Mongodb. http://www.mongodb.org/. Online. Last Accessed October 18, 2012. 14
[2] D. J. Abadi. Data management in the cloud: Limitations and opportunities. In IEEE Data
Engineering Bulletin, volume 32, pages 3–12, 2009. 20
[3] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexan-
der Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for
analytical workloads. Proc. VLDB Endow., 2(1):922–933, August 2009. 14
[4] Ishfaq Ahmad, Kamalakar Karlapalem, Yu-Kwong Kwok, and Siu-Kai So. Evolutionary
algorithms for allocating data in distributed database systems. Distrib. Parallel Databases,
11(1):5–32, January 2002. 10
[5] Apache. Couchdb. http://wiki.apache.org/couchdb/Technical%20Overview. Online.
Last Accessed October 18, 2012. 13
[6] Peter M. G. Apers. Data allocation in distributed database systems. ACM Trans. Database
Syst., 13(3):263–304, September 1988. 10
[7] Andrzej Bialecki, Christophe Taton, and Jim Kellerman. Apache Hadoop: a framework for
running applications on large clusters built of commodity hardware. 2010. 18
[8] D. Brunner, G. Lemoine, F.-X. Thoorens, and L. Bruzzone. Distributed geospatial data
processing functionality to support collaborative and rapid emergency response. Selected
Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, 2(1):33 –46,
march 2009. 20
[9] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Pro-
ceedings of the 7th symposium on Operating systems design and implementation, OSDI ’06,
pages 335–350, Berkeley, CA, USA, 2006. USENIX Association. 16
[10] Stefano Ceri and Giuseppe Pelagatti. Distributed Databases: Principles and Systems.
McGraw-Hill Book Company, 1984. 25
[11] Ronnie Chaiken, Bob Jenkins, Per-AAke Larson, Bill Ramsey, Darren Shakib, Simon
Weaver, and Jingren Zhou. Scope: easy and efficient parallel processing of massive data
sets. Proc. VLDB Endow., 1(2):1265–1276, August 2008. 16
35
[12] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed
storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26, June 2008.
13
[13] Bin Chen, Fengru Huang, Yu Fang, Zhou Huang, and Hui Lin. An approach for het-
erogeneous and loosely coupled geospatial data distributed computing. Computers and
Geosciences, 36(7):839 – 847, 2010. 21
[14] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bo-
hannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. Pnuts:
Yahoo!’s hosted data serving platform. Proc. VLDB Endow., 1(2):1277–1288, August 2008.
17
[15] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clus-
ters. Commun. ACM, 51(1):107–113, January 2008. 17, 18, 39
[16] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash
Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vo-
gels. Dynamo: amazon’s highly available key-value store. In Proceedings of twenty-first
ACM SIGOPS symposium on Operating systems principles, SOSP ’07, pages 205–220, New
York, NY, USA, 2007. ACM. 16
[17] Tor Didriksen, Cesar A. Galindo-Legaria, and Eirik Dahle. Database de-centralization -
a practical approach. In Proceedings of the 21th International Conference on Very Large
Data Bases, VLDB ’95, pages 654–665, San Francisco, CA, USA, 1995. Morgan Kaufmann
Publishers Inc. 10
[18] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In
Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03,
pages 29–43, New York, NY, USA, 2003. ACM. 16
[19] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent available
partition-tolerant web services. In In ACM SIGACT News, page 2002, 2002. 19
[20] Alon Y. Halevy, Zachary G. Ives, Jayant Madhavan, Peter Mork, Dan Suciu, and Igor
Tatarinov. The piazza peer data management system, 2004. 14
[21] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: wait-
free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference
on USENIX annual technical conference, USENIXATC’10, pages 11–11, Berkeley, CA, USA,
2010. USENIX Association. 16
[22] IBM. Ibm intelligent transportation. 2012. 2
36
[23] McKinsey Global Institute. Big data: The next frontier for innovation, competition and
productivity. 2011. 2
[24] Yannis E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, March 1996.
11
[25] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: dis-
tributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev.,
41(3):59–72, March 2007. 18
[26] Seyed Jalal Kazemitabar, Farnoush Banaei-Kashani, and Dennis McLeod. Geostreaming in
cloud. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStream-
ing, IWGS ’11, pages 3–9, New York, NY, USA, 2011. ACM. 21
[27] Avinash Lakshman and Prashant Malik. Cassandra: structured storage system on a p2p
network. In Proceedings of the 28th ACM symposium on Principles of distributed computing,
PODC ’09, pages 5–5, New York, NY, USA, 2009. ACM. 17
[28] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew
Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the
2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages
1099–1110, New York, NY, USA, 2008. ACM. 18
[29] M. Tamer Ozsu. Principles of Distributed Database Systems. Prentice Hall Press, Upper
Saddle River, NJ, USA, 3rd edition, 2007. 10
[30] M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems, Third
Edition. Springer, 2011. 25
[31] Rasin Abadi DeWitt Madden Pavlo, Paulson and Stonebraker. Comparison of approaches to
large-scale data analysis. In In Proceedings of the 35th SIGMOD International Conference
on Management of Data, ACM Press, New York, pages 165–178, 2009. 3
[32] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data:
Parallel analysis with sawzall. Sci. Program., 13(4):277–298, October 2005. 18
[33] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop
distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010.
IEEE Computer Society. 16
[34] Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, and Sam Shah.
Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th
USENIX conference on File and Storage Technologies, FAST’12, pages 18–18, Berkeley, CA,
USA, 2012. USENIX Association. 17
37
[35] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh
Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution
over a map-reduce framework. Proc. VLDB Endow., 2(2):1626–1629, August 2009. 19
[36] Werner Vogels. Eventually consistent. Commun. ACM, 52(1):40–44, January 2009. 17
[37] Jibo Xie, Chaowei Yang, Qunying Huang, Ying Cao, and M. Kafatos. Utilizing grid com-
puting to support near real-time geospatial applications. In Geoscience and Remote Sensing
Symposium, 2008. IGARSS 2008. IEEE International, volume 2, pages II–1290 –II–1293,
july 2008. 20
[38] Chaowei Yang, Wenwen Li, Jibo Xie, and Bin Zhou. Distributed geospatial information
processing: sharing distributed geospatial resources to support digital earth. International
Journal of Digital Earth, 1(3):259–278, 2008. 20
[39] Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen.
Towards parallel spatial query processing for big spatial data. In Parallel and Distributed
Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International,
pages 2085 –2094, may 2012. 1, 20
[40] Xiaofang Zhou, David J. Abel, and David Truffet. Data partitioning for parallel spatial join
processing. Geoinformatica, 2(2):175–204, June 1998. 20
38
List of Figures
2.1 Architecture of Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Horizontal and Vertical Fragmentation . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Steps in Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Query Execution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 General Architecture for Data Oriented Computing . . . . . . . . . . . . . . . . 15
3.2 Execution Overview in MapReduce [15] . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Parallel DBMS vs. Distributed DBMS . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 Query Processing in Distributed Databases . . . . . . . . . . . . . . . . . . . . . 25
5.2 Horizontally Fragmented Link Table . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Vertically Fragmented Lake Table . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Mixed Fragmented Building Table . . . . . . . . . . . . . . . . . . . . . . . . . . 28
39
List of Tables
2.1 Road Link Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Road Node Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Lake Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Building Feature Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Horizontal Fragmentation: Fragment 1 . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Horizontal Fragmentation: Fragment 2 . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Horizontal Fragmentation: Fragment 3 . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Vertical Fragmentation: Fragment 1 . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Vertical Fragmentation: Fragment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Widely Used Technologies in Big Data Analysis . . . . . . . . . . . . . . . . . . 16
40