GeoSharding: Optimization of data partitioning in … Optimization of data partitioning in sharded georeferenced databases ... Indexing DB Partition Blocks Partition Center Location

GeoSharding: Optimization of data partitioning in shardedgeoreferenced databases

João GracaUniversidade de Lisboa, Instituto Superior Tecnico, INESC-ID

Lisbon, [email protected]

ABSTRACTCurrently the volume of geo-referenced data in the web iscontinuously growing, not only in classical domains but alsoothers, such as Social networks.

NoSQL databases use sharding and partitioning algorithmsto load balance and optimize data accesses. These mecha-nisms rely on the database structure to define the boundariesof the sets stored by each node, and assume some sort of ac-cess locality to data. However, some of these partitioningand spatial indexing mechanisms are not very efficient andcould be optimized.

In this document we exploit a partitioning policy takingadvantage of the geographical positions of the data to definea new sharding mechanism, called GeoSharding, that usesVoronoi diagrams as basic structure of spatial indexing. Theresults obtained show that this method can optimize datapartitioning regarding spatial query performance and loadbalancing across the network shards.

KeywordsSharding; Georeferenced data; NoSQL; Big data;

1. INTRODUCTIONNowadays, great amounts of data is starting to be geo-

referenced, which means that they can be associated witha physical space or location. One example of georeferenceddata is a picture associated with a geographic location thatis posted in a social network. Actually, the ever increas-ing observations and measurements of geo-sensor networks,satellite images, point clouds from laser scanning, geospatialdata of Location Based Services (LBS) and location-basedsocial networks has become a serious challenge for data man-agement and systems’ analysis [21], as it results in large scalegeoreferenced data sets.

The huge development of computer networks promotesdata decentralization and distributed databases allow shar-ing this data, so making it accessible by all units and storingit where it is most frequently used. A distributed database isa logically interrelated collection of shared data, physicallydistributed over a network [12]. However, there are still sys-tems that process spatial data in a centralized manner andactual spatial partitioning mechanisms used during spatialdata processing are not efficient (for instance, in the caseof geohashing, or r-tree hashing, objects close to each othergeographically may end distributed in different shards, and,consequently, in different hosts).

In current distributed databases, complex spatial queries

cannot take advantage of the georeferenced characteristics ofdata and spatial range queries require the retrieval of datafrom a large number of different servers which is not efficient.

Sharding is used to partition data among different serversin order to obtain system scalability. Since the actual shard-ing mechanisms are inadequate when storing georeferenceddata, we propose GeoSharding, a new partitioning policy/al-gorithm based on the geographical positions assigned to theobjects to store. The mechanism uses Voronoi Diagrams[8]to perform spatial partitioning as it presents important ad-vantages with respect to already in use spatial division mech-anisms (geohashing, r-tree hashing). Our results prove thatthis Voronoi solution presents a better access performanceand a better load balancing than the other spatial divisionmechanisms.

In section 2 we present indexing mechanisms currentlyused in data storage systems along with the need to novelpartition and indexing mechanisms suitable to georeferenceddata. Section 3 presents GeoSharding, that is in section 4.Finally, section 5 presents the work conclusions.

2. RELATED WORK

2.1 Georeferenced Big DataThe amount of georeferenced data is increasing everyday.

There is a large variety of these data that goes from weatherdata, socioeconomic data, vegetation indexes, and more [23].To understand the relevance of the work presented, it isimportant to mention some cases where manipulating andmanaging georeferenced Big Data is important.

When discussing geological phenomena, every occurrenceof an event must be stored to further analysis or study. Forexample, the United States Geological Survey1 is a scientificgovernment agency responsible for the study of the land-scape and the natural resources of the United States. Thisagency stores spatial data sets of earthquakes, water pits,volcanoes and landslides, among others, and delivers thisinformation in real-time. Besides this agency, the BritishGeological Survey2 is a natural environment research coun-cil that also works with large georeferenced data sets relatedto geological research. This includes climate change, ground-water, minerals, energy, among others.

Georeferenced big data is also discussed when mentioningSocial Networks, as the huge volume of data these applica-tions have to manage and store is increasing everyday. The

1www.usgs.gov2www.bgs.ac.uk

fact that nowadays users can pair a location dimension tothe information they share in these type of networks addsgeoreferenced characteristics to this data volume and origi-nates different location-based social networking services [2].

In the scope of the Internet of Things, once more is impor-tant to refer that geography can be considered an importantbinding principle, since all the sensor data gathered by allthe physical objects have a position associated with, andthere are spatial relationships between these objects. Con-sequently, there is a direct link between georeferences andBig Data in IoT since to spatially process the “data prove-nient from internet connected devices for real-time spatialdecision making” [13], this data has to be georeferenced, cre-ating “Spatial Big Data”.

The mentioned entities that deal with Spatial Big Datahave to rely on storage processing. Regarding spatial dataprocessing, there are several extensions and optimizationapplications that are being developed to optimize alreadyexisting technologies in order for them to work efficientlywith georeferenced data. GeoSpark is one example[23], andits role is to extend Apache Spark Resilient DistributedDatasets to support geometrical and spatial objects. Thisway, it is possible to create spatial indexes (discussed fur-ther in this section) that boost spatial data processing per-formance when performing data partitioning.

Other important aspects when discussing spatial data ac-cess optimization are statistical methods that use geograph-ical data correspondent to positions near each other. Oneexample of such techniques is Kriging[6], an interpolationmethod used to predict a function value given a certainpoint, using the weighted average of the known values ofthe neighborhood of the selected point in the function. Thisis a very common method used in spatial analysis, and asmany others take advantage of data locality.

2.2 Database SystemsThe following subsection will review two main types of

data storage systems and GIS, both in the scope of Big Dataand georeferenced mechanisms.

2.2.1 Relational Database Management SystemsNowadays, the predominant systems used to store struc-

tured data in web and business applications are the Re-lational Database Management Systems (RDBMSs) whichrely on the relational calculus providing comprehensive adhoc querying facilities by SQL [3]. In the past, these sys-tems have been treated as the only reliable way to store databy many companies and their respective clients. Regardingsharding, one big disadvantage is that joins between datapartitions are not possible, which means that the client ap-plication or the proxy layer inside or outside the databasehas to issue several requests and post process results. Thisresults in the cancellation of many relational features of theRDBMSs, originated by the fact that sharding was not origi-nally designed within current RDBMSs but rather added ontop [9]. However, many NoSQL databases (presented next)embrace sharding as one key feature and some of them evenprovide automatic partitioning and balancing of data amongnodes.

2.2.2 NoSQL SystemsIn the last years, the “one size fits all” idea has been

questioned by science and web companies, which resulted

in the appearance of a great variety of alternatives to theRDBMSs. These new datastores technologies which rely onnon-relational databases may be called NoSQL (Not OnlySQL) databases[18]. Recently, NoSQL users came to sharehow they had surpassed the slow and expensive systemsbased on RDBMs. Some of the main motives and advan-tages of NoSQL databases are presented and discussed inthis subsection.

Relational systems present multiple features to ensure dataconsistency that sometimes are unnecessary. For example,for some clients or some specific application there are someconsistency checks that could be avoided and which are re-sponsible for a larger complexity of the system. Therefore,one advantage NoSQL systems may provide is the avoidanceof unneeded complexity.

Many times, in big data storage systems, it is impor-tant to increase the data processing rate in order to satisfyusers’ needs and there are NoSQL systems able to process 20petabyte per day, which is a higher processing throughputthan a traditional relational system. So, other motive of analternative system’s usage can be the high throughput.

As the volume of data increases with time, the focus onscaling and sharding increases too. In contrast to relationaldatabase management systems, most NoSQL databases aredesigned to scale well in the horizontal direction and notrely on highly available hardware. Running on commodityhardware and presenting horizontal scalability can be con-sidered as other of the main drivers to develop this type oftechnologies.

Some types of data structures that present low complex-ity, when stored by RDBMS, require an expensive entity-relational mapping, which can be unnecessary. The avoid-ance of expensive object-relational mapping is also impor-tant. Sometimes, NoSQL systems can also compromise re-liability for better performance. This is important because,once again, the need of a higher system’s speed may over-come the reliability’s need. As it was described, the contin-uous growth of data volumes to be stored and the growingneed to process larger amounts of data in shorter time mo-tivated the appearance and creation of NoSQL databases.

When discussing the data partitioning in a NoSQL databaseanother important feature is the replication mechanism. In adatabase of this type it must always exist replicas of the ob-jects stored, since these database systems need to be faulttolerant. By replicating the objects stored in the originalshard owner, the shard’s data availability is guaranteed evenif the shard is temporarily down. Furthermore, the replica-tion algorithms in NoSQL databases may allow data avail-ability optimization, which will be further discussed in thisdocument.

2.2.3 GISA geographic information system or geographical infor-

mation system (GIS) is a system designed to capture, store,manipulate, analyze, manage, and present all types of spa-tial or geographical data [24]. There is one internationalorganization committed to making quality open standardsfor the global geospatial community, which is called OGS(Open Geospatial Consortium). These standards are madethrough a consensus process and are freely available for any-one to use to improve sharing of the world’s geospatial data.

The actual GISs mainly work with relational databaseswhich make the spatial data processing centralized (relying

Table 1: Spatial Index StrategiesIndexing DB Partition Blocks Partition Center Location Partition Depth

GeoHashing Dynamo DB Regular Rectangle Regular RegularQuad-tree Hashing MongoDB Irregular Square “Ruled” Sub-partitions

R-tree Hashing (CAN) CouchDB Irregular Rectangle “Ruled” Sub-partitionsVAST - Irregular Polygon “Random” Irregular

on relational mechanisms) and not distributed. Examplesof database systems used by Geographical Information Sys-tems are MySQL or Oracle. One of the goals of this article isto show the advantages of working with spatial data with dis-tributed/partitioned systems, which does not happen withtraditional GIS systems.

2.3 Sharding and Spatial IndexingSharding is the partition of data in such a way that data

typically requested and updated together resides on the samenode and that load and storage volume is roughly evenlydistributed among the servers (in relation to their storagevolume and processing power) [18]. Data shards may alsobe replicated for reasons of reliability and load-balancingwriting to a dedicated replica or to all the existing repli-cas, depending on the system. A data sharding system re-quires a mapping between data partitions, the shards, andthe storage nodes that are responsible for these shards. Thismapping may be configured as static or dynamic, and thisconfiguration is usually executed at the client applicationlevel of the database.

In order to distribute data among the shards of the database,a shard key must be defined: a measurable value of an ob-ject’s field. There are two main strategies in Sharding, con-sidering the object’s shard key and the distribution of theobjects among the shards: the Range Based Partitioningand the Hash Based Partitioning. The first one consists individing “the data set into ranges determined by the shardkey value to provide range based partitioning” [11], support-ing more efficient range queries. Given a range query onthe shard key, the query router can easily determine whichshards overlap that range and route the query to only thoseshards. However, Range Based Partitioning may cause un-balanced load distributions, as there are shards that may endup with much more data stored than others. This negatessome benefits of sharding because in this situation, a smallset of shards may receive the majority of requests and the“system would not scale very well”.

Hash Based Partitioning consists on hashing the shard keyinto a value and uses this random value to locate the shardwhere to save the data. With this type of partitioning, twoobjects with close shard key’s values are unlikely to be savedin the same shard. This fixes the load balancing problembecause it introduces a randomness factor, but range queriesbecome inefficient as the system would more likely queryevery shard in order to return a result.

Independently from the sharding strategy, that uses itsown mapping system, there exist several possible indexingmechanisms in a storage system. P2P applications requirethe capability to respond to more complex queries (such asrange queries involving numerous data types, including thosethat have a spatial component) which results in defining dif-ferent indexing strategies. In this context, it is importantto have the notion of a spatial object and a spatial query.

By the definition in [19], spatial objects are objects withextents in a multidimensional setting, and spatial queriesare often executed by recursively subdividing the underlyingspace and then solving possibly simpler intersection prob-lems. In terms of dimensional settings, this work focuseson systems, mechanisms and algorithms for bi-dimensionalspaces.

Currently, [16] there are NoSQL systems which alreadyprocess spatial data. Table 1 presents the overview of thepossible spatial indexing strategies to be used with NoSQLsystems, its partition block shape, the partition center lo-cation of each block (the central node responsible for eachspatial block) and the partition depth (if sub partitions areallowed) of each mechanism. The following subsections de-scribe actual geographical data indexing mechanisms.

2.3.1 GeohashingGeohashing is a latitude/longitude geocode system. It is

a hierarchical spatial data structure which subdivides spaceinto buckets of grid shape, and a convenient dimensionalreduction mechanism for coordinates that uses a Z-Curve[10]to decide the visiting order of the partitioned quads. Asimple implementation is to simply interleave the bits of a(latitude, longitude) pair and base32 encode the result.

Geohashing features advantages such as facility on calcu-lation and reverting, its representation in bounding boxesand the ability to truncate bits from the end of a Geohashresulting in a larger Geohash bounding the original [10].Geohashes offer properties like arbitrary precision and thepossibility of gradually removing characters from the end ofthe code to reduce its size (and gradually lose precision). Asa consequence of the gradual precision degradation, nearbyplaces will often (but not always) present similar prefixes.The longer a shared prefix is, the closer the two places are.

However, there are significant disadvantages in GeoHash-ing, also referred in [10], criticizing the Z-Curve form. Z-Curves are not necessarily the most efficient space-fillingcurve for range queries, since Points on either end of the Z’sdiagonal appear as close together when they are actually not.The spatial partitioning and the Z curves are represented inFigure 1 .

Other inefficiency of the Z-curves is that points next toeach other on the spherical earth may end up on oppositesides of the plane. These drawbacks mean that sometimesit is necessary to run multiple queries, or expand boundingbox queries to cover very large areas.

2.3.2 Quad-tree HashingOther indexing solution, presented in [19], also focused on

the use of P2P networks for applications with spatial dataand queries, is the use of a distributed spatial index that isbased on the quad-tree data structure.

One case consists on recursively decomposing an underly-ing two-dimensional square-shaped space into four congru-

Figure 1: Geohashing Spatial Parti-tioning and Z-Curve Order

Figure 2: Quad-Tree hashing SpatialDivision

Figure 3: Voronoi Diagram

ent square blocks until each object is contained in any of theblocks. MongoDB is one of the database systems that canuse quad-tree as its indexing mechanism. One possible ad-vantage of this first variant is that it reduces the complexityof the intersection process by enabling the pruning of certainobjects or portions of objects from the query. Figure 2 rep-resents the Quad-tree Hashing standard spatial partitioning.

The MX-CIF quad-tree[19] is other alternative to imple-ment a quad-tree index, that consists on: for each objecto, the decomposition of the underlying space halts upon en-countering a block b such that o overlaps at least two ofthe child blocks of b or upon reaching a maximum level ofdecomposition of the underlying space. In both the variantspresented the object is associated with b upon halting.

When a peer is attached to a region of space it is responsi-ble for the resolution of all queries that intersect that region,and for storing the objects that are associated with thatsame region. In order to do this, the authors declare thateach quad-tree block can be uniquely identified by its cen-troid, named a control point. After, it is used the Chord[17]method to hash these control points so that the responsibil-ity for a quad-tree block is associated with a peer.

To determine the control points, the system uses the glob-ally known quad-tree subdivision method to recursively sub-divide the underlying space. It is stated that multiple con-trol points and hence quad-tree blocks can be hashed to thesame peer and multiple objects can be stored with each con-trol point.

The use of a control point in this team’s algorithm [19] isanalogous to that of a bucket for storing objects and also per-forming intersection calculations associated with that block.

2.3.3 CANCAN (Content Addressable Network) [14] is a P2P net-

work protocol that uses consistent hashing based on bi-dimensional indexes. The CAN algorithm partitions spaceamong all the nodes in the network. Every node “owns” azone in the overall space and the system stores data at thenetworks’ nodes. CAN allows routing from one “point” (anode that owns an enclosing zone) to another.

Space division in CAN is made in rectangular or quadran-gular forms. In order for a new node to join the network,the joining node has to perform the following steps [14]: 1)Find a node already in the overlay network; 2) Identify azone that can be split; and 3) Join the network, updatingthe neighbors’ information about the newly split zone.

In order to handle a departing node, CAN performs the

following steps: 1) Identify the departing node; 2) Have thedeparting node’s zone merged or taken over by a neighbor-ing node; and 3) Update the routing information across thenetwork.

Routing in CAN is made with the information of eachnode’s neighbors. Every node maintains information aboutits direct neighbors and its virtual coordinate zone. A noderoutes a message towards a destination point in the coordi-nation space. To do this, the node first determines whichneighboring zone is closest to the destination point, and thenlooks up that zone’s node’s IP address using the network’sinformation stored.

CAN is one of the spatial mechanisms evaluated in thispaper. Like the r-tree indexing mechanism, it performs spa-tial division using rectangular/quadrangular forms.

2.3.4 VASTVAST is a Voronoi Overlay Network[15] that also relies

on consistent hashing based in a bi-dimensional index, us-ing Voronoi polygons to perform spacial division. Its opensource implementation and description are available at [20].The work related to this implementation and description wasoriginally presented by [7].

Before presenting VAST, it is important to introduce themathematical notion of a Voronoi diagram. In [8], the au-thors describe a Voronoi diagram as a division of a space intodisjoint polygons, one polygon per node, where the nearestnode inside a polygon is the generator of the polygon. Con-sidering a limited set of points, whose name is generatorpoints, in the Euclidean plane, all the locations in the planeare associated to their closest generator. The set of locationsassigned to each generator forms a region called Voronoipolygon or Voronoi Cell. The set of Voronoi polygons asso-ciated with all the generators is called the Voronoi diagramwith respect to the generators set. It is also declared [8]that the boundaries of the polygons, called Voronoi edges,are the set of locations that can be assigned to more thanone generator (as these points are at the same distance frommore than one generator). The Voronoi polygons that sharethe same edges are called adjacent polygons and their gen-erators are called adjacent generators. Figure 3 describes aVoronoi diagram.

Each one of VAST’s nodes builds the Voronoi diagram ofthe overlay network knowing the coordinates of its neighborsand its neighbors’ Voronoi polygons, maintaining a connec-tion with them. In order to optimize system’s scalability,VAST includes the nodes’ Area Of Interest concept. By

using this, as stated by the authors [7], VON (Voronoi-based Overlay Network) meets the requirements of message-efficiency, as the neighbor discovery is embedded in regularposition updates, scalability, as each node only maintainsa limited number of connections and responsiveness, as la-tency is low due to the direct connections between neigh-bors.

3. GEOSHARDINGNowadays, georeferenced data volume on the web is con-

tinuously growing and the need of storing and accessing thisdata in efficient ways is a top priority regarding networktechnologies. The data locality redefinition (storing data inservers “near” it) requires a new solution that explores geo-referenced data to allow a better system performance andefficiency. As mentioned before, the index strategies area very important topic regarding spatial networks. Theseindexes can be used to efficiently access data on P2P net-works, and there are already NoSQL databases that use, asindexing strategy, mechanisms which use data georeferencedcharacteristics.

The solution proposed consists on using Voronoi SpatialPartitioning as spatial indexing/sharding mechanism. Toeach database system’s shard it would be attributed a co-ordinate in the space and using the object’s georeference assharding key, the data would be stored in the“closest”server.The coordinates attributed to each server can be random orchosen, in such a way it would optimize data load balanc-ing. This sharding mechanism, introduced by us, is calledGeoSharding.

The goals of such sharding strategy would be to optimizethe spatial partitioning (Voronoi Diagrams are not used inNoSQL databases), the range queries and the load balanc-ing. Since the idea of using the spatial coordinates as thesharding key could also be implemented using other spa-tial mechanisms (CAN, Geohashing, Quad-tree Hashing),we will show why Voronoi diagrams are the optimal solu-tion.

3.1 Spatial Partitioning OptimizationOne existing problem is that mechanisms such as geohash-

ing, quad-tree or CAN present some disadvantages that mayaffect the system’s efficiency. As mentioned before, the Z-Curves of geohashing are not necessarily the most efficientspace-filling curve for range queries, since Points on eitherend of the Z’s diagonal seem close together when they areactually not. Besides range queries, nearest neighbor discov-ery using Z-curves may also be an inefficient solution whichpresents another problem in using geohashing.

The rectangular and quadrangular forms of the quad-treehashing and CAN may also present inefficiencies. Thesesystems’ scalability obligates space subdivision which maynot be, once again, an optimal way to scale a system.

One solution to optimize these space-filling mechanisms’drawbacks may be the usage of Voronoi diagrams. Since theVoronoi diagrams consist in irregular polygon forms, theycan lead to a more efficient division of space, instead of therectangular/quadrangular shapes used in geohashing, quad-tree hashing and CAN.

Voronoi nodes have knowledge of their direct neighborsand of their neighbors’ boundaries (does not happen in CANand the others) which facilitates nearest neighbor discoveryand makes spatial queries easier to perform (for instance, in

[5], it is represented a solution that supports range queriesworking with Voronoi). In addition, this mechanism allowsspace’s spherical representation, which cannot happen usingthe CAN.

Four different algorithms (presented in a centralized ver-sion) that can be used in a sharded geo-spatial database sys-tem are presented below. The simplified data access pseudo-code of these algorithms is presented in Figure 4.

The Chord algorithm represents the hashed based shard-ing that some NoSQL database systems use, where the hashedkey value generates a random number that defines the server’sidentifier that is responsible to store this value. In this case,the hash is computed using all the fields of the object, whichaims to represent a completely random partitioning policy.

The Chord algorithm with Geohashing representsthe same hash based sharding, where the value resultantfrom the hashing of the shard key is a random value. How-ever, instead of using the hash of the whole object, the shardkey used are the coordinates of the same object. By perform-ing the hash of both coordinates, the result is a not com-pletely random distribution, but a distribution based on thegeographical coordinates of the object (like a minimal ver-sion of GeoHashing but without a spatial division assignedto the servers/shards).

In CAN algorithm, the spatial partitioning is based ina Binary Spatial Partitioning Tree, dividing the space in aquadratic form, like represented in the section 2.3.3. Defin-ing the objects’ coordinates as their sharding key, CANsearches for the node responsible for the rectangular zonethat contains the key, in which it stores the objects. Thissearch may require more than one iteration.

In theVAST algorithm, the spatial division is based onVoronoi Diagrams, like represented in the section 2.3.4.Also defining the objects’ coordinates as their sharding key,VAST identifies the Voronoi polygon which contains the key,by analyzing the overall Voronoi Diagram, and the objectsare stored in the shard that is the Voronoi generator of thecorrespondent polygon.

3.2 Range Query and Load Balancing Opti-mization

The GeoSharding solution proposed suits the NoSQL sys-tems environments, as it could replace the already exist-ing Sharding mechanisms. In order to proceed with thesechanges to apply the GeoSharding algorithm, it would benecessary to modify the object’s sharding keys in the sys-tem’s Application Program Interface. The new keys wouldbe based on the data georeferenced characteristics. For ex-ample, considering a bi-dimensional map, the sharding keywould be the pair of coordinates (x,y) of the georeferencedobjects. After, the system’s data partitioning and replica-tion mechanisms would also be modified based on the newkeys, resulting in the GeoSharding solution pretended.

The work developed intends to show that data access usingspatial range queries is optimized as the number of shardscontacted in order to retrieve the data requested is signif-icantly lower compared to the conventional Hashed BasedPartitioning used in sharding strategies. This way, shardingwould allow optimized spatial range queries since it wouldbe the first sharding mechanism using georeferenced coordi-nates as sharding key.

Furthermore, the general idea behind Hashed Based Par-titioning on Sharding is that the system’s load balancing is

key = hash ( obj . k e y f i e l d )do :

shard = shard [ i++]whi le ( shard . key < key )shard . a c c e s s ob j ( key )

Chord

key = obj . coo rd ina t e sshard = shards . c l o s e s t ( key )whi le ( ! shard . r e s p on s i b l e ( key ) ) :

shard =shard . ne ighbors . c l o s e s t ( key )

shard . a c c e s s ob j ( key )

CAN

key = obj . c oo rd ina t e sshard = shards . c l o s e s t ( key )shard . a c c e s s ob j ( key )

VAST

Figure 4: Key access pseudo-code (centralized infrastrcuture)

severally affected when Range Based partitioning is used.This work intends to prove that is possible to have rangepartitioning that also optimizes load balancing. By choosingthe spatial coordinates assigned to each shard, it is possibleto optimize both the range queries and the load bal-ancing system. This way, the GeoSharding solution wouldrespect the requirements already presented in existing sys-tems and would also optimize some of them.

3.3 Replication MechanismSince there is no existing sharding mechanism based on

the spatial coordinates of an object, it is also important topropose a suitable replication mechanism, that would alsowork with the spatial partitioning system used.

The Replication strategy proposed would consist in copy-ing each data object to the n nearest Voronoi generators(shards) besides the actual responsible Voronoi Generator(closest one to the object’s coordinates), where n is the num-ber of replicas desired in the system.

The goal of such replication system is also to optimizethe number of shards requested to solve a spatial query,since the objects are replicated taking into account its spatialposition.

3.3.1 Multi-Point Geometry ObjectsGeometry Objects composed of multiple points or regions

should also be taken into account. For instance, the routeinformation of a given vehicle, such as city bus’ routes ortaxi routes, is composed a set of thousands of points.

If a distributed system and its sharding mechanism doesnot take into account the fact that such objects are com-posed by multiple coordinates, each object will be split byvarious shards. The access and processing of a complete ob-ject will lead to overhead. This overhead may be reducedif each shard responsible for some points of those geometryobjects contains a complete copy of it.

In order to do so, we propose a second replication mecha-nism, specifically designed for multi-point geometry objects.In this replication policy, all the shards responsible for somepoint of a geometry object will contain a replica of the ob-ject. By doing so, the insertion of this class of objects will in-cur some costs, but it is expected that spatial range querieswill suffer no degradation and the queries that request acomplete object will be optimized.

4. RESULTS

4.1 SetupIn order to test the GeoSharding hypothesis, the tests

developed used real spatial datasets, the first one in a globaldimension, the second in a continental dimension and thelast one in a country dimension, all presented in Figure 5.The global perspective consists in an earthquake dataset [1],

Global Earthquakes Dataset

USA Underground Water Resources Dataset

Beijing Taxi Road-trips’ Routes Dataset

Dublin Buses Trips Dataset

Figure 5: Tested Datasets

from the Northern California Earthquake Data Center, with686.987 data points. The continental perspective test useda dataset of Water Underground Resources from USGS [21],with 4644 data points. Finally, the local perspective testsused a dataset from taxi routes in Beijing[25, 26] , with328653 data points. By testing this sharding mechanismin three different environments is possible to evaluate itsfeasibility in cases that work with spatial data in differentranges, and still conclude that is an efficient mechanism. Inorder to perform the tests, we used the PeerSim[12] Peer-To-Peer Networks Simulator, where 10 peers represented theservers/shards of the distributed database systems.

A fourth data set was used to test the geometry objects’replication mechanism proposed in section 3.3.1. This oneis from the Dublin City Council’s traffic control base[4] withthe routes of the city buses from a single day. The advantageof this data set is bus journeys are composed of multiple datapoints, allowing the evaluation of the proposed replication.The data set is composed by 796298 data points, belonging

to 5387 different journeys.

4.2 Tests DescriptionVAST and CAN implement the GeoSharding technique,

where the georeferenced objects are stored in the servers as-sociated with the nearest geographical coordinates to theobject. The tests consisted to perform 100 range quadraticqueries, where the center of the query square is generatedrandomly inside the space where there is data (for example,in the case of the Earthquakes dataset, the points are gen-erated randomly across the world). The area of the queryregion depends on its side measure, which was tested withdifferent values (from smaller range queries to queries thatrequest all the data across the space). This way the testsconsist in queries that are requesting different amounts ofdata, and the goal is to evaluate how many servers/shardsare necessary to reach in order to retrieve the data requested.

In the case of the GeoSharding technique, the shard re-sponsible for the query center is contacted first, and resultscan also be transferred by other shards. The positions acrossthe space assigned to each shard may assume very differentlocations, which have a great impact on the final results.This way, it was tested both CAN and VAST algorithmswith three different configurations regarding the shard’s as-signed position. Regarding CAN, there is a configurationwhere the positions assigned to the servers are randomly cho-sen across the bi-dimensional space where there is data, andanother configuration that positions the shards uniformlyalong the map.

The same thing happens with the tests applied to VAST,where there is also a third configuration that allows to posi-tion the shards where it is desired: Hand Picked. This way,it is possible to assign positions to the servers where it isassumed to be great amounts of georeferenced data.

Another important aspect to take into account when im-plementing a Sharding mechanism is the load balancing sys-tem. In order to guarantee that some system’s servers arenot overloaded and others without many data stored, thepartitioning policy must be careful with this particular as-pect. The Chord algorithm and the Chord algorithm withthe Geohashing used in the simulations represent the sys-tems that use a Hash Based Partitioning policy (discussedin section 2), where the load balancing is assumed to beguaranteed as there exists a random distribution of the dataamong the shards.

On the other hand, CAN and VAST, implementing theGeoSharding policy, represent a two dimensional Range BasedPartitioning (also reviewed in section 2). In the scope of loadbalancing, these two algorithms must not assign randomlythe geographical positions assigned to each server, as it maycause an uneven distribution of data among the shards. Thismay happen in the case where some servers have a positionassigned to a division of the map where there is no georef-erenced data, and other servers have a position assigned tothe divisions where all the data is located.

4.2.1 Replication TestsTo test the standard replication mechanism proposed, only

two algorithms were tested: Chord (representing the HashBased Replication) and VAST (representing the GeoShard-ing’s replication proposed). In the first case, each objectwas stored in the shard of the Chord ring responsible forthe hash of the object and in the same shard’s next neighbor

in the Chord ring, simulating a random replication systemwhere each object is stored twice in the database (the orig-inal and one copy). With VAST, each object was stored inthe two nearest Voronoi generators, being one of them theshard whose Voronoi polygon contains the coordinates of theobject stored (original owner of the object). Like in the pre-vious tests, 100 spatial queries were performed, and for thismatter was used the Earthquakes data set with VAST usinga random distribution of its shards.

To test the geometry object’s replication mechanism, withthe Dublin buses routes’ data set, we used VAST with ran-domly distributed shards and queried the database to knowwhat was the average number of shards storing each journey.

4.3 Spatial Queries PerformanceThe following graphics present the results from the range

query tests performed on the three different datasets. Theyinclude three different charts for each set, regarding themechanism assigning positions to each server/shard (ran-dom, uniform and hand picked). In the case of the Earth-quakes data set, the results of Chord and Chord+Geohashingwere the same, so only one of the lines (the second case one)is represented in the charts.

Regarding the Range Queries optimization, the outcomeof the test results were as predicted. By evaluating thecharts from Figure 6, it is possible to state that the solu-tions implementing GeoSharding (CAN and VAST) requiredsignificantly less communication from the remote servers toretrieve the data requested by the square range queries forall the dimensions of the queries.

For example, with the Earthquakes data set and the ran-domly distributed shards, the average number of requestedservers of a system which uses VAST to perform spatial oper-ations, by performing a bounding box query with about 3200km of square’s side measure, is 2, while in a system with anHashed Based Sharding mechanism the average number ofrequested servers is 9 (almost all the shards in the system).This proves that sharding based on georeference optimizesspatial range queries, which is very useful in the case of GISsystems that display huge datasets of georeferenced data.

By observing the test results, the CAN solution is slightlybetter in the range queries optimization than the VAST so-lution when implementing Geosharding. This is explainablebecause being constructed with irregular polygons, the reg-ular square queries intersect more blocks in the space (re-garding VAST) than the regular blocks of CAN (which arerectangles). However, this small difference is outmatched bythe performance of VAST regarding Load Balancing, provedin the next subsection.

When testing the distribution of global datasets (like inthe Earthquake dataset case), we modified VAST by chang-ing the bi-dimensional distance formula of the Voronoi Dia-grams. Instead, the distance was calculated using the Haver-sine Formula[22] applied to the decimal coordinates of thelatitude and longitude of the objects, which creates irregularVoronoi spherical blocks, treating the Earth like a sphere in-stead of a bi dimensional plane. This is useful since when re-questing points of opposite sides of the 2-dimensional Earthmap, which are actually near to each other in reality, theyare stored in nodes spatially far away from each other in thesystem. This does not happen while using this SphericalVoronoi Case, with VAST combined with the Haversineformula, since the Earth is treated in a 3-dimensional way.

GeoSharding with Randomly Distributed Servers

0

2

4

6

8

10

12

0 2000 4000 6000 8000 10000 12000 14000 16000

Nu

mb

er o

f R

equ

este

d S

erve

rs

Query Square's Side (km)

a) Earthquakes

0

2

4

6

8

10

12

0 1000 2000 3000 4000 5000 6000 7000 8000

Nu

mb

er o

f R

equ

este

d S

erve

rs


b) Underground Water Resources

-4

-2

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120Nu

mb

er o

f R

equ

este

d S

erve

rs


c) Taxi Routes

GeoSharding with Uniformly Distributed Servers

0

2

4

6

8

10

12

0 2000 4000 6000 8000 10000 12000 14000 16000

Nu

mb

er o

f R

equ

este

d S

erve

rs


a) Earthquakes

0

2

4

6

8

10

12

0 1000 2000 3000 4000 5000 6000 7000 8000

Nu

mb

er o

f R

equ

este

d S

erve

rs



-4

-2

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120

Nu

mb

er o

f R

equ

este

d S

erve

rs


c) Taxi Routes

GeoSharding with VAST Hand-Picked Distributed Servers

0

2

4

6

8

10

12

0 2000 4000 6000 8000 10000 12000 14000 16000

Nu

mb

er o

f R

equ

este

d S

erve

rs


a) Earthquakes

0

2

4

6

8

10

12

0 1000 2000 3000 4000 5000 6000 7000 8000

Nu

mb

er o

f R

equ

este

d S

erve

rs



-4

-2

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120

Nu

mb

er o

f R

equ

este

d S

erve

rs


c) Taxi Routes

Figure 6: Spatial Query Performance of Sharding Algorithms

We performed spatial square queries centered in the NorthPole region with the bounding box’s side measuring 4442km. The results showed that, in average, CAN requestedresults from 9 shards (almost all the network shards) andVAST requested results from 5 shards (half of the network’sshards). This proves the advantage of treating the Earthlike a sphere when processing global spatial data.

4.4 Load BalancingThe Load Balancing tests consisted in comparing the stan-

dard deviation of the number of stored objects in each shardof the system. As it is possible to observe, the charts fromthe Figures 7 show that the Hashed Based Partitioning solu-tions from systems sharded by using Chord or Chord+GeoHashingwere sometimes significantly better than the solutions im-plemented by CAN. For example, for the Taxi Routes dataset, the worst performance in terms of standard deviationof stored objects belongs to CAN (Uniformly distributedservers) and for the Underground Water Resources the worstperformance is once again from CAN (Randomly distributed

servers).In order to have a good load balancing for Geosharding,

the positions to assign to the geoshards would preferablybe spread across the regions where it is expected to havehuge amounts of data. For example, in the water resourcesdataset, spreading the center nodes of the spatial networkacross the region of the United States of America makessense as mostly all the data points in the set belong to thiscountry.

The spatial partitioning of CAN along with the centeredpositioning of each rectangle center node do not allow toposition each shard exactly where it is desired. In bestcase scenario, to position the centers exactly as preferableto have a better load balancing, the space would have tobe sub-partitioned several times (being necessary to havemore shards to assign to multiple new positions). And evenso, the regular form (rectangular) do not benefit the irregu-lar distribution of the data points. For these reasons, CANpresents an unbalanced system with overloaded shards andother shards semi-empty, as it is impossible to “pick” the

0

40000

80000

120000

160000

200000St

and

ard

Dev

iati

on

of

Ob

ject

s St

ore

d

a) Earthquakes

0

200

400

600

800

1000

1200

Stan

dar

d D

evia

tio

n o

f O

bje

cts

Sto

red

b) Water Resources

0

22000

44000

66000

88000

110000

Stan

dar

d D

evia

tio

n o

f O

bje

cts

Sto

red

c) Taxi Routes

Figure 7: Load Balance between Sharding Algorithms

0

2

4

6

8

10

12

0 5000 10000 15000

Nu

mb

er o

f R

equ

este

d S

erve

rs


a) Random Distribution

0

2

4

6

8

10

12

0 5000 10000 15000

Nu

mb

er o

f R

equ

este

d S

erve

rs


b) Uniform Distribution

0

2

4

6

8

10

12

0 5000 10000 15000

Nu

mb

er o

f R

equ

este

d S

erve

rs


c) Hand-Picked Distribution

Figure 8: Spatial Query Performance with Replication Mechanism (Earthquakes data set)

shard positions.On the other hand, by observing the results’ charts, the so-

lutions implemented by VAST are very positive compared tothe Hashed Based Partitioning methods. In the specific caseof the Hand picked positions assigned to the shards, thereis a significant optimization since the standard deviation ofthe objects stored is clearly lower in the three environments(local, continental and global data). Thanks to the irregularpartitioning among the shards using Voronoi Polygon blocksthe location attributed to the shards can be exactly what isdesired. The Load Balancing test results support the HandPicked VAST solution as the optimal solution to implementGeoSharding.

4.5 ReplicationFigure 8 presents the charts resultant from the tests per-

formed using the standard replication mechanism proposed.Regarding the results presented in Figure 6 (with the shardsusing the Earthquakes data set, also presented in Figure 8),the spatial queries performance was optimized (when usingVAST) by the new spatial replication strategy, being thisfact clearly evident in the larger queries (query square’s sidemeasuring the high values) where the number of shards re-quested to resolve the queries is lower in the Figure 8 than inFigure 6. The random replication implemented using Chorddid not improve the spatial query performance at all. Thisproves that replication based on geographical characteristicsof data can improve spatial data access performance.

Regarding the geometry object’s replication tests, eachbus journey was replicated in average by 2 shards , where theminimum number of shards storing a journey was 1 and themaximum number of shards storing the same journey was

5. This shows that is possible to replicate the data points ofa route database in a efficient way to allow requesting onlyone shard to retrieve a whole journey and at the same timeoptimize the spatial query performance tested before withdifferent data sets.

5. CONCLUSIONSIn this document we presented an approach to optimize

the database partitioning mechanisms of systems that workwith Georeferenced Big Data. GeoSharding was defined andtested a simulated environment, and its implementation ontop of Voronoi was the only solution to optimize not only thespatial range queries but also the load balancing in a shardedsystem, solving the paradox of Hashed Based Partitioning vsRange Based Partitioning when working with georeferenceddatasets.

This way, in GIS systems, it is possible to optimize dataaccess and data storage by exploring the georeferenced lo-cality characteristics of data. The sharding strategies usedby actual sharded systems that manage spatial data are notadequate since they do not take any advantage of the georef-erenced characteristics. GeoSharding is particularly usefulin these scenarios.

We tested different options available to implement GeoShard-ing, where the ones using common existing spatial mecha-nisms (CAN, Quad-tree) presented inefficiencies. By usinga Voronoi-based solution, we proved that spatial indexingmechanisms can be further optimized, specially when loadbalance and multiple sub partitioning avoidance are a pri-ority. Using Voronoi diagrams, it is possible to avoid thefixed sub-partitioning of space and adapt the spatial net-

work to the irregularities of the georeferenced datasets, bypositioning each shard exactly where it is desired (allowingload balancing optimization).

VAST also allows its spherical implementation using theHaversine formula that is very useful regarding global georef-erenced data sets, since it implements a spatial index thattreats Earth like a sphere and not like a a bi-dimensionalplane. This guarantees continuous data locality, even in theearth poles and around the 180th Meridian.

We also proved that it is possible to implement a replica-tion system based on Voronoi diagrams that is simple andefficient.

VAST allows the server locality optimization and, in thefuture, it is possible to adapt a Voronoi-based Shardingmechanism to work with“server relocation”, since VAST wasoriginally designed for Gaming Environments where eachnode moves along the space. By using this relocation of thepositions assigned to the shards, load balancing could bemaintained along the time even if the high volumes of geo-referenced points start to change place. This can be consid-ered other important advantage of using Voronoi Polygonsas the main structure to implement GeoSharding.

6. REFERENCES[1] N. (2014). Northern california earthquake data center.

uc berkeley seismological laboratory.http://www.quake.geo.berkeley.edu/anss/catalog-search.html. Accessed:2016-6-17.

[2] J. Bao, Y. Zheng, and M. F. Mokbel. Location-basedand preference-aware recommendation using sparsegeo-social networking data. In SIGSPATIAL ’12 -Proc. of the 20th Int. Conf. on Advances inGeographic Information Systems. ACM, 2012.

[3] D. D. Chamberlin and R. F. Boyce. Sequel: Astructured english query language. In Proc. of the1974 ACM SIGFIDET (Now SIGMOD) Workshop onData Description, Access and Control, SIGFIDET ’74.ACM, 1974.

[4] D. C. Council. Dublin bus gps sample data fromdublin city. www.data.dublinked.ie. Accessed:2016-6-17.

[5] L. Ferrucci, L. Ricci, M. Albano, R. Baraglia, andM. Mordacchini. Multidimensional range queries onhierarchical voronoi overlays. Journal of Computerand System Sciences, 2016.

[6] T. Hengl, G. B. M. Heuvelink, and D. G. Rossiter.About regression-kriging: From equations to casestudies. Comput. Geosci., 33(10):1301–1315, Oct.2007.

[7] S.-Y. Hu and G.-M. Liao. Scalable peer-to-peernetworked virtual environment. In Proc. of 3rd ACMSIGCOMM Workshop on Network and SystemSupport for Games, NetGames ’04. ACM, 2004.

[8] M. Kolahdouzan and C. Shahabi. Voronoi-based knearest neighbor search for spatial network databases.In Proc. of the Thirtieth int. conf. on Very large databases. VLDB Endowment, 2004.

[9] T. Lipcon. Design patterns for distributednon-relational databases. Presented in 2009-06-11,2009.

[10] M. Malone. Scaling gis data in nonrelational data

stores. San Jose, USA, 2010. Presented at Where 2.02010.

[11] MongoDB. Sharding introduction.www.docs.mongodb.com. Accessed: 2016-6-17.

[12] A. Montresor and M. Jelasity. PeerSim: A scalableP2P simulator. In P2P’09 - Proc. of the 9th Int. Conf.on Peer-to-Peer, Seattle, WA, Sept. 2009.

[13] C. D. Nik Bessis. Bid Data and Internet of Things: ARoadmap for Smart Environments. Springer, 2014.

[14] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A scalable content-addressable network. InProcs. of the 2001 Conf. on Applications,Technologies, Architectures, and Protocols forComputer Communications, SIGCOMM ’01, pages161–172. ACM, 2001.

[15] L. Ricci and A. Salvadori. Nomad: Virtualenvironments on p2p voronoi overlays. In OTMConfederated Int. Conf.” On the Move to MeaningfulInternet Systems”. Springer, 2007.

[16] R. Singh. The nosql geo landscape. In Presented byIBM Cloud Data Services. IBM Corporation 2014,December 2014.

[17] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, andH. Balakrishnan. Chord: A scalable peer-to-peerlookup service for internet applications. ACMSIGCOMM Computer Communication Review,31(4):149–160, 2001.

[18] C. Strauch, U.-L. S. Sites, and W. Kriha. Nosqldatabases. Lecture Notes, Stuttgart Media University,2011.

[19] E. Tanin, A. Harwood, and H. Samet. Using adistributed quadtree index in peer-to-peer networks.The VLDB Journal, 16(2):165–178, 2007.

[20] V. D. Team. Voronoi-based overlay network (von).www.vast.sourceforge.net. Accessed: 2016-6-17.

[21] U.S. Geological Survey. Groundwater levels for thenation. nwis.waterdata.usgs.gov/usa/nwis/gwlevels.Accessed: 2016-6-27.

[22] G. Van Brummelen. Heavenly mathematics: Theforgotten art of spherical trigonometry. PrincetonUniversity Press, 2013.

[23] J. Yu, J. Wu, and M. Sarwat. Geospark: A clustercomputing framework for processing large-scale spatialdata. In SIGSPATIAL ’15 - Proc. of the 23rd Int.Conf. on Advances in Geographic InformationSystems. ACM, 2015.

[24] X. Z. Yu Zheng. Computing with Spatial Trajectories.Springer, 2011.

[25] J. Yuan, Y. Zheng, X. Xie, and G. Sun. Driving withknowledge from the physical world. In Proc. of the17th ACM SIGKDD Int. conf. on Knowledgediscovery and data mining. ACM, 2011.

[26] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun,and Y. Huang. T-drive: driving directions based ontaxi trajectories. In SIGSPATIAL ’10 - Proc. of the18th Int. conf. on advances in geographic informationsystems. ACM, 2010.

Documents

GeoSharding: Optimization of data partitioning in … Optimization of data partitioning in sharded georeferenced databases ... Indexing DB Partition Blocks Partition Center Location