36

Click here to load reader

A dissimilarity function for geospatial polygons

Embed Size (px)

Citation preview

Page 1: A dissimilarity function for geospatial polygons

Knowl Inf SystDOI 10.1007/s10115-013-0666-2

REGULAR PAPER

A dissimilarity function for geospatial polygons

Deepti Joshi · Leen-Kiat Soh · Ashok Samal · Jing Zhang

Received: 12 December 2011 / Revised: 27 May 2013 / Accepted: 28 May 2013© Springer-Verlag London 2013

Abstract Similarity plays an important role in many data mining tasks and informationretrieval processes. Most of the supervised, semi-supervised, and unsupervised learning algo-rithms depend on using a dissimilarity function that measures the pair-wise similarity betweenthe objects within the dataset. However, traditionally most of the similarity functions fail toadequately treat all the spatial attributes of the geospatial polygons due to the incompletequantitative representation of structural and topological information contained within thepolygonal datasets. In this paper, we propose a new dissimilarity function known as thepolygonal dissimilarity function (PDF) that comprehensively integrates both the spatial andthe non-spatial attributes of a polygon to specifically consider the density, distribution, andtopological relationships that exist within the polygonal datasets. We represent a polygonas a set of intrinsic spatial attributes, extrinsic spatial attributes, and non-spatial attributes.Using this representation of the polygons, PDF is defined as a weighted function of thedistance between two polygons in the different attribute spaces. In order to evaluate our dis-similarity function, we compare and contrast it with other distance functions proposed in theliterature that work with both spatial and non-spatial attributes. In addition, we specificallyinvestigate the effectiveness of our dissimilarity function in a clustering application using apartitional clustering technique (e.g. k-medoids) using two characteristically different setsof data: (a) Irregular geometric shapes determined by natural processes, i.e., watersheds and(b) semi-regular geometric shapes determined by human experts, i.e., counties.

Keywords Dissimilarity function · Polygons · Spatial data mining · Regionalization ·Polygonal clustering

D. Joshi (B)Department of Mathematics and Computer Science, The Citadel, Charleston, SC 29409-0001, USAe-mail: [email protected]

L.-K. Soh · A. Samal · J. ZhangDepartment of Computer Science and Engineering, University of Nebraska-Lincoln,Lincoln, NE 68588-0115, USA

123

Page 2: A dissimilarity function for geospatial polygons

D. Joshi et al.

1 Introduction

Similarity plays an important role in most of the data mining tasks and information retrievalprocess. Most of the supervised, semi-supervised, and unsupervised learning algorithmsdepend on using a dissimilarity function that measures the pair-wise similarity/dissimilaritybetween the objects within the dataset [3,29,30]. Similarity also plays an important role inthe classification process. Similarity-based classifiers estimate the class label of a test sam-ple based on the similarities between the test sample and a set of labeled training samples,and the pair-wise similarities between the training samples [6,19]. Examples of some spe-cific application areas of dissimilarity functions include—clustering [2,33], training of aninstance-based learning system [28,39], and prediction and trend analysis [4,15,37]. Clus-tering, a common data mining task, is a prime application for a dissimilarity function sinceit is based on separation of dissimilar objects and grouping of similar objects (Joshi et al., toappear). Other applications such as region growing in which objects are ranked based on thedegree of similarity to add neighboring polygon require a function that orders polygons inincreasing similarity.

In this paper, our emphasis is on measuring the dissimilarity between complex geospatialpolygons. We define complex geospatial polygons as polygons that have varying structuresand may have other geospatial objects within their spatial extent. Please note that the polygonsreferred to here are the same as regions [8] or tessellations in space. Polygonal datasets aremuch more complicated in nature as compared to point datasets. For example, the length of aboundary shared between two polygons—which may be used to determine spatial proximityof the two polygons—is lost when polygons are represented as points. Moreover, for aconcave-shaped polygon, the centroid of the polygon may lie outside the boundary of thepolygon. Thus, if one tries to spatially analyze polygons simply by representing them aspoints (typically their centroids), the result may not be accurate, and the underlying spatialstructure is lost [24]. Furthermore, when considering geospatial polygons, there may be otherspatial objects that lie within the polygons or may be shared by two or more polygons. Forexample, lakes, rivers, and even manmade structures such as highways lie within geospatialpolygons such as counties and watersheds. There is no appropriate representation of thistype of information when using the current state-of-the-art in spatial analysis. For example,if one were to perform watershed analysis—where watersheds are naturally formed polygonswithin the river basins—based on their relationship with a set of rivers, say, cutting throughthe watersheds, there is no current spatial analysis technique that would allow us to do so.

Several similarity/dissimilarity functions have been proposed in the past that allow us toestimate the distance between two objects. However, the similarity functions that have beenproposed for polygons fail to comprehensively treat all the spatial attributes (see Sect. 2 for anoverview of the most commonly used distance functions for polygons) due to the inadequaterepresentation of structural and topological information contained in the polygons. This leadsto inaccuracy in the computed results. In order to comprehensively integrate both the spatialand the non-spatial attributes of a polygon, we propose a new dissimilarity function calledthe polygonal dissimilarity function (PDF). This is based on our earlier work presented inJoshi et al. [25].

We hypothesize that, in order to accurately represent polygons in the geospatial domain,the attributes of the polygons should accurately capture both its spatial structure (intrinsicto the polygon) and its spatial organization (extrinsic to the polygon) along with the non-spatial attributes of the polygons. The spatial attributes of the polygons are therefore dividedinto two categories—intrinsic spatial attributes and extrinsic spatial attributes. The intrinsicspatial attributes of a polygon are the features or properties of the polygon that cannot be

123

Page 3: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

separated from the polygon and therefore capture the structure of the polygons, for example,the location of the polygon, the area covered by polygon, and its shape, etc. By taking theintrinsic attributes of the polygon into account, we can find out, for example, the extent ofthe boundary shared by two polygons, the information as mentioned before that is lost byrepresenting the polygon as a point. The extrinsic spatial attributes, on the other hand, refer tothe properties of the polygons that exist independent of the polygon itself. Thus, they enableus to capture the spatial organization of a polygon by taking into account the topologicalrelationship between the polygon and its neighboring polygons within the dataset as well asother spatial objects present within a polygon itself. Examples of extrinsic spatial attributesof a polygon include the spatial distributedness of other spatial objects present within thepolygons, the density of the spatial objects present within the polygons, etc. Please note thatthe spatial objects present within the polygons may be point objects (for example, schools andhospitals), linear objects (for example, rivers and highways), and areal or polygonal objects(for example, lakes and parks).

Using the aforementioned representation of the polygons, we define PDF as a weightedfunction of the distance between two polygons in the different attribute spaces. In other words,PDF is a combination of a number of distance functions each pertaining to a different class ofattributes describing a polygon. Furthermore, the weights in the dissimilarity function allowthe users to customize their use of PDF based on the significance of the attributes in theirapplication domain. For example, in order to find the similar lakes based on their topologicalrelationship—such as “adjacent to”—with watersheds, one would assign a greater weight tothe spatial distance that measures topological relationships. On the other hand, in order todiscover the lakes high in nitrogen content, a higher weight must be assigned to the non-spatial attributes. In Sect. 3, we describe the distance functions for the underlying attributesof the polygons along with the guidelines for combining them effectively.

In order to evaluate our dissimilarity function, we first compare and contrast it withother distance functions proposed in the literature that also use both spatial and non-spatialattributes. In particular, we compare our algorithms to the distance functions proposed byWebster and Burrough [40], Cliff et al. [8], and Perruchet [31]. These distance functions havebeen described in Sect. 2, and the comparative analysis has been presented in Sect. 4. Next,we specifically investigate the effectiveness of our dissimilarity function in spatial clusteringsince distance-based functions play a central role in this application. We have applied ourdissimilarity function to the k-medoids clustering algorithm to cluster geospatial regions rep-resented as polygons in two different domains with diverse characteristics—namely, environ-mental analysis using watersheds and political applications using counties. Our results showthat PDF outperforms other distance functions in ranking the similarity between polygonsand results in the maximum range between the pair-wise distances computed. Furthermore,our results for the clustering application show that with the use of the intrinsic and extrinsicspatial attributes of the polygons along with the non-spatial attributes results in more cohesiveclusters.

Finally, we use the term “dissimilarity” instead of “distance” because our dissimilarityfunction does not satisfy the symmetry and triangular inequality properties of a distancemetric [1].

2 Related work

In this section, we present an overview of the various distance functions proposed in the lit-erature for measuring the distance between two polygons along with the problems associated

123

Page 4: A dissimilarity function for geospatial polygons

D. Joshi et al.

with their use. Polygons in general can be concave or convex, small or large, elongated orcompact. Furthermore, completely disjoint polygons can have overlapping bounding boxes;adjacent polygons can share a single point, a segment on the boundary or even multiplesegments. Based on these properties of the polygons, the following distance functions havebeen proposed.

Centroid distance One way to approximate polygon objects is to represent each object bya representative point, such as the centroid of each object, and then find the distance betweenthe centroids of the polygons. However, this approach is generally not effective since theobjects may have very different sizes and shapes. For instance, a rectangular building mayhave a size of 500 square meters, whereas a lake may have a size of 300,000 square meterswith irregular elongated shape. Simply representing each of these objects by its centroid, orany single point, does not take into the account the extents of the polygons. Another problemwith this approach is that the centroid may not be inside the polygon (e.g., for some concaveobjects) and may indeed be inside another object.

Minimum bounding rectangle distance There is a large body of work in shape analysis(e.g., [17,35]). For example, the minimum bounding rectangle (MBR) of a polygon can beused as a first-order approximation of the shape and orientation of the polygon: It is thesmallest rectangle that encloses an object. Distance of two polygons can be measured byfinding the distance between the centers of their respective MBRs. However, many of thesame problems described for centroid-based distances remain. For example, the center ofthe MBR of a polygon may not fall within the polygon, or the MBRs of two polygons mayoverlap.

Separation distance The distance between a point P and a line L is defined by the per-pendicular distance, between the point and the line, i.e., min{d(P, Q)|Q is a point on L}.Thus, given two polygons A and B, we can define the distance between these twopolygons to be the minimum distance between any pair of points in A and B, i.e.,min{d(P, Q)|P, Q are points in A, B, respectively}. This distance is called the separationdistance (e.g., distance between polygons P1 and P3 as shown in Fig. 2 and is exactly thesame as the minimum distance between any pair of points on the boundaries of A and B [9].

However, if two polygons intersect or share boundaries or even a point, their separationdistance is zero. This definition of distance is quite unsatisfactory for geospatial applications,e.g., the distances between P1 and P2 and between P2 and P3, as shown in Fig. 1. Theseparation distance between two adjacent polygons will always be zero and is an inappropriatemeasure since all polygons will have shared boundaries with their neighbors. The transitiverelationship in terms of separation distance does not hold: In Fig. 1, for example, the separationdistance between P1 and P3 is nonzero, even though each has a zero separation distancewith P2.

Fig. 1 Separation distancewhere the transitive relation doesnot hold

123

Page 5: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 2 Minimum and maximum distance between vertices

Min–max distance Another way to measure the distance between polygons is to find theminimum or maximum distance between each pair of vertices of the polygons. However, thismethod either overestimates or underestimates the true distance between two polygons asshown in Fig. 2a. It shows the separation distance (a), the minimum distance between vertices(b), the maximum distance between vertices (c), and the distance between the centroids (d).It is clear that both b and c do not match the intuitive notion of the distance between thetwo polygons. If we only consider the minimum or the maximum distance between vertices,we overlook the shape of the polygons as shown in Fig. 2b, where the shortest and longestdistances between any pair of vertices are shown in red and blue, respectively. Clearly, thesedistances are independent of the shape of the polygons, i.e., many polygons with differentshapes can have the same distance as long as we maintain the two extreme (minimum ormaximum) points in the two polygons. Hence, these are inappropriate as distance measures.

Hausdorff distance The Hausdorff distance between two sets of points [34] is defined asthe maximum distance of points in one set to the nearest point in the other set. Formally, theHausdorff distance from set A to set B is defined as

h(A, B) = maxa∈A

(minb∈B

d(a, b)

)

where a and b are points of sets A and B, respectively, and d(a, b) is any distance metricbetween the two points a and b; for simplicity, we can take d(a, b) as the Euclidian distancebetween a and b. If the boundaries of the polygons Pi and Pj are represented by two sets ofpoints A and B, respectively, we can use this as a distance measure between two polygons.

Dh(Pi , Pj

) = max (h(A, B), h(B, A))

Figure 3 presents a comparison between the Centroid distance and Hausdorff distance oftwo polygons. For convex polygons, the Hausdorff distance, defined on the set of vertices ofpolygons, usually gives as good an estimate of distance as the Centroid distance. However,using the centroids to measure the distance between two polygons may not give us the“true” distance for concave polygons. As shown in Fig. 3, the Centroid distance Dc mayunderestimate or overestimate the exact distance when the centroid of a concave polygonfalls outside the polygon. The Hausdorff distance, Dh , defined on the two sets of vertices ofpolygons, gives a more accurate measurement.

Fréchet distance In order to measure the distance between polygons based on their shape,Fréchet distance [5] is considered to be more appropriate than Hausdorff distance [34].An intuitive definition of the Fréchet distance is to imagine that a dog and its handler arewalking on their respective polygon boundaries. Both can control their speed but can onlygo forward. The Fréchet distance of these two polygon boundaries is the minimal length of

123

Page 6: A dissimilarity function for geospatial polygons

D. Joshi et al.

Fig. 3 Comparison of Hausdorff distance with centroid distance

any leash necessary for the dog and the handler to move from the starting points of the twocurves to their respective endpoints. It is formally defined below:

Let f, g be parameterizations of curves or polygons, i.e., continuous functions

f, g : [0, 1]k → Rd , k ∈ {1, 2}, d ≤ k

Then their Fréchet distance (DF ) is

DF ( f, g) = infσ :[0,1]k t∈[0,1]k

max ‖ f (t) − g(σ (t))‖

where the re-parameterization σ ranges over all orientation preserving homeomorphisms.It is important to note that Fréchet distance is used only for shape matching. It does not

measure the geographic distance between two polygons in the geospatial applications forinstance. For such purposes, Hausdorff distance is more appropriate as shown in Fig. 3.

In addition to the distance functions defined above, several ways to combine geographicaldistances and non-geographical dissimilarities into a single pair-wise similarity value havebeen proposed in the literature. Webster and Burrough [40], Cliff et al. [8], and Perruchet[31] proposed different multiplicative and additive forms to combine such elements. Theseare defined below:

WB distance Webster and Burrough [40] proposed to compute the dissimilarity betweenpairs of polygons using the Canberra metric [27]. The Canberra metric between the i th andthe j th sites is computed as follows:

Di j =p∑

k=1

[∣∣gik − g jk∣∣

gik + g jk

]/p

where gik and g jk are the values of the kth property for the i th and j th polygons, respectively,and p is the number of properties. They further proposed to add the geographic distancebetween the sites to the Canberra metric coefficient as follows:

DW B = Di j + di jdmax

× w

1 + w

where Di j is the Canberra metric between polygons i and j , di j is the geographic distancebetween the polygons i , and j , dmax is the distance between the most distant pair of polygons,and w is a weighting factor.

123

Page 7: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

CXY distance Cliff et al. [8] propose a combined distance metric (Dcliff ) to measure thedistance between two polygons i and j as:

Dcliff = λdi j + (1 − λ)ti j

where di j is some distance metric that measures the spatial separation between the i th and j thregions, ti j is the distance metric that measures the distance between the non-spatial attributesof the two regions, and λ represents a weighing constant (0 ≤ λ ≤ 1). λ = 0, representsa purely non-spatial strategy, and λ = 1 represents a purely spatial strategy. λ = 0.33 andλ = 0.66 signify a mixed strategy which has been shown by the authors to yield intermediateresults with an average efficiency about twenty percent greater than that of the extremes.

PXY distance Perruchet [31] defines the aggregation index of dissimilarity, DP , betweentwo polygons i and j as follows:

DP (i, j) = f (δ(i, j), d(i, j))

where f (x, y) = xy, d(i, j) is the geographic distance between the two polygons and iscomputed using the Euclidean distance function, and δ(i, j) is the aggregation index definedas the dissimilarity between the polygons based on their non-spatial attributes. An exampleof δ(i, j) is given as:

δ(i, j) = μiμ j

μi + μ j

∥∥vi − v j∥∥2

where μi is the mass of i , and vi is the representation of i in the descriptor space.In summary, all the distance functions defined above focus on one or two aspects (dis-

tance and/or shape) of polygons. Our representation of a polygon includes their structuraland organizational properties which are fundamentally different and thus need to be treateddifferently. These properties are not incorporated in any of the functions proposed in theliterature in a comprehensive manner. This serves as the motivation of our work to define acomprehensive dissimilarity function that effectively unifies the distance functions for eachtype of attribute of a polygon.

3 Dissimilarity function for polygons

Consider a set of polygons P = {P1, P2, . . . , Pn} where each polygon Pi is defined by a setof spatial and non-spatial attributes.

The non-spatial attributes of a polygon include all the attributes of the polygon that areindependent of the spatial location of the polygon, for example, population, average income,number of hospitals, and number of major cities.

The spatial attributes of a polygon can be further divided into two categories: (1) intrinsicand (2) extrinsic. The intrinsic spatial attributes describe the geometric properties of thepolygon without any contextual information in a domain independent way, for example—location, shape, area, and aspect ratio of the polygon. Location is represented as a set ofvertices, specified in some spatial coordinate frame.

The extrinsic spatial attributes encompass the various spatial objects that may exist withina polygon, or may be shared by two or more polygons, which may however be defined inde-pendent of the polygon. In other words, the elements or spatial objects that are either embed-ded into or intersect with the polygon are represented as extrinsic spatial attributes of thepolygon. These spatial objects are divided into three classes: point, linear, and areal. Exam-ples of point spatial objects include buildings, shopping complexes, etc. Examples of linear

123

Page 8: A dissimilarity function for geospatial polygons

D. Joshi et al.

spatial objects include rivers, roads, and mountain ranges. Examples of areal objects includereservoirs, crop areas, forests, and large lakes. Please note that areal objects themselves arepolygons as well.

Given two polygons, Pi and Pj , the PDF that measures the distance between two polygonsin all the attribute spaces described above is defined as follows:

DPDF(Pi , Pj ) = f (dns(Pi , Pj ), ds(Pi , Pj )) (1)

where dns is a function that computes the distance between two polygons based on the non-spatial attributes—see Eq. 3, and ds is a function that computes the distance based on thespatial attributes—see Eq. 4.

The function f in Eq. 1 can be any non-spatial function that combines the two distances.We use a weighted sum that can easily adjust the contribution (i.e., the weight) of both thedistances.

DPDF(Pi , Pj ) = wnsdns(Pi , Pj ) + wsds(Pi , Pj ) (2)

where wns + ws = 1 and wns > 0;ws > 0.The weights wns and ws are domain dependent, i.e., they should be tuned for the appli-

cations using experiential or expert knowledge. Therefore, we cannot explicitly assign themany fixed values. These weights play an important role in defining the contribution of thedifferent types of attributes. For guidelines on weight assignment, see Sect. 3.3.

3.1 Distance between non-spatial attributes

The distance between the polygons in the non-spatial attribute space (dns) can be definedusing any distance measure such as the Euclidean distance function or the Manhattan distancefunction. We use the standard Euclidean distance as our distance measure as shown in Eq. 3.

dns(Pi , Pj ) =√∑m

k=1(gik − g jk)2 (3)

where gik and g jk represent the kth non-spatial attribute of polygons Pi and Pj , respectively,and m is the total number of non-spatial attributes. Please note that all the non-spatial attributesmust be represented as ordered numerical attributes so that they can be integrated together.Furthermore, all the attributes must be normalized before the computation of the distance.The normalization can be performed by dividing all the values in the dataset by the largestvalue in the dataset [18]. We assign an equal weight to all the non-spatial attributes. However,if desired, different weights may be assigned to the various non-spatial attributes. In this case,the equation for the distance function for non-spatial attributes will be as follows:

dns(Pi , Pj ) =√∑m

k=1wk(gik − g jk)2 (3-1)

3.2 Distance between spatial attributes

The distance between the polygons based on their spatial attributes (ds) is defined as afunction of the distance between their intrinsic spatial attributes (dins) and their extrinsicspatial attributes (dexs) as reflected in Eq. 4. The function dins is defined in Eq. 6, and thefunction dexs is defined in Eq. 15.

ds(Pi , Pj ) = winsdins(Pi , Pj ) + wexsdexs(Pi , Pj ) (4)

123

Page 9: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

where wins + wexs = 1 and wins > 0;wexs > 0. Please note that once again the values forwins and wexs are domain dependent and need to be determined by the user. See Sect. 3.3 foradditional guidelines.

3.2.1 Distance between intrinsic attributes

Among the intrinsic attributes of polygons, location is the most important. The location ofa polygon is defined as a vector of its vertices. Intuitively, we expect the distance betweentwo polygons with shared boundaries to be shorter than the distance between two polygonsthat do not have a common border. This is based on the assumption that two regions thatshare a boundary are closer than two regions—with everything else being equal—that donot, an assumption that has been used in domains dealing with spatial data such as imageprocessing and structural organization [21]. The importance of geographic distance and theshared boundary length between two regions in various political applications have beendemonstrated in Furlong and Gleditsh [16].

The Hausdorff distance function as defined in Section 2.1 is a suitable distance functionto measure the distance between the vertices of two polygons as it neither under-estimatesnor over-estimates the distance between two polygons. However, the standard Hausdorffdistance is defined on the set of points and does not incorporate any shared boundary. In orderto incorporate this, we define a new distance measure, called boundary adjusted Hausdorffdistance that is inversely proportional to the length of the shared boundary between twopolygons Pi and Pj as follows:

dhs(Pi , Pj ) =(

1 − 2si j

si + s j

)× dh

(Pi , Pj

)(5)

where dh is the original standard Hausdorff distance, si and s j are the perimeter lengths ofpolygons Pi and Pj , respectively, and si j is the length of their shared boundary. This distance,dhs , is smaller than the standard Hausdorff distance when two polygons have shared boundaryand becomes the standard Hausdorff distance when two polygons have no shared boundary,i.e., when si j = 0. We use twice the shared distance in the definition to balance the effect ofthe denominator.

Other than location, for the other intrinsic attributes, we compute the Euclidean distancebetween the individual attributes of the polygons in order to measure the distance between thepolygons. Finally, the distance between polygons Pi and Pj based on their intrinsic attributes(dins) is defined as:

dins(Pi , Pj

) = whsdhs(Pi , Pj

) + wst

√∑r

k=1(tik − t jk)2 (6)

where tik and t jk represent the kth structural attribute of polygons Pi and Pj , respectively,and r is the total number of structural attributes, whs represents the weight assigned to themodified Hausdorff distance function, wst is the weight assigned to the remaining intrinsicspatial attributes, and whs + wst = 1 and whs > 0;wst > 0.

3.2.2 Distance between extrinsic attributes

Extrinsic attributes incorporate the spatial objects present within the polygons or shared bytwo or more polygons. Given below is a framework that is used for defining the distancebetween two polygons based on their extrinsic attributes. The distance is based on the fol-lowing properties of the various spatial objects with respect to the polygon—(1) density,

123

Page 10: A dissimilarity function for geospatial polygons

D. Joshi et al.

Table 1 Different characteristics of spatial object attributes

Density Extent Distribution Topology Direction

Linear object dn = n|A| N/A zi Defined below Defined below

Areal object dn = n|A| e = |ai ||A| zi Defined below Defined below

Point object dn = n|A| N/A zi N/A N/A

(2) extent (the area covered by the object within the polygon), (3) spatial distribution, (4)topology, and (5) direction.

The density, extent, and distribution of a spatial object within a polygon are indicativeof the underlying forces (e.g. climate or other biological or geophysical or chemical) whichinfluence the polygon. In the geospatial domain, for example, the presence of clusters ofoak trees in two polygons is indicative of similar soil and/or climate regime, and therefore,both the polygons are likely to be more similar to each other. Therefore, two polygons withsimilar object density and distribution are more likely to be similar. The topology of spatialobjects, on the other hand, especially of linear spatial objects, is important as it captures thebinary relationship between the polygons with respect to other spatial objects. For example,a physical barrier between the polygons (e.g., a mountain range) can potentially increase thephysical distance between the polygons and hence discourage the polygons to be clusteredtogether.

Due the wide differences in their construction, e.g., an areal object extends over a largearea, whereas a point object is simply a single point within the polygon, not all the differentaspects mentioned above are applicable to every type of spatial object. Table 1 lists thedifferent types of characteristics applicable to the different types of spatial objects.

In Table 1, n is the number of times the spatial object occurs within the polygon, |A| isthe total area of the polygon, |ai | is the total extent of the areal object i within the polygon,and zi is the test statistic obtained from the Mean Nearest Neighbor test for complete spatialrandomness (CSR) [10], and N/A stands for not applicable. Next, we define the functionsthat are used to find the distance between two polygons on the basis of the above-mentionedproperties of the spatial objects present within the polygons.

Density and extent Density is the number of times an object occurs within a polygondivided by the area of the polygon. Extent is the total area covered by the object within thepolygon. We measure the distance between two polygons on the basis of the density of theobjects using Eq. 7 and on the basis of extent using Eq. 8.

ddensity =∣∣dmi − dm j

∣∣max(dmi , dm j )

(7)

where dni is the density of point object m in polygon Pi , dn j is the density of point objectm in polygon Pj .

dextent =∣∣ei − e j

∣∣max(ei , e j )

(8)

where ei is the total extent of an areal object within polygon Pi , e j is the total extent of theareal object within polygon Pj .

Distribution The spatial distribution of an object refers to the pattern of their distributionin space. This pattern may be random, uniform, or aggregated. In spatial statistics, several

123

Page 11: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

methods have been proposed to measure the spatial distribution of a set of points in space. Inthis paper, we employ the Mean Nearest Neighbor test for CSR [10] to measure the spatialdistribution of the various spatial objects. A brief description of the test is as follows:

Given a set of points, P = {P1, P2, . . . , Pn}, this test is based on the mean distance (y)

from each point to its nearest neighbor. Small values of y indicate an aggregated distributionof points and large values indicate regularity. If there are n points in a region of area A, thenthe test statistic proposed by Clark and Evans [7] is

(y − E(y))/SE(y)

where the expected value and the standard error of y are given by

E(y) = 0.5√

A/n and SE(y) = √0.0683 A/n

respectively. This test statistic is compared with the standard normal distribution, with spatialrandomness being rejected in favor of aggregation or regularity. This test however does nottake into account edge effects and assumes that the observations are independent of eachother. A revised version was proposed by Donnelly [10] which accounted for both the edgeeffects and the correlations among nearest neighbor distances. The precise expected valueand the standard error value of y were given by as follows:

E(y) = 0.5√

A/n +(

0.051 + 0.041√n

)L/n

and

SE(y) =

√0.07A + 0.073L

√An

n

where L is the length of the intrinsic spatial attributes of the region.The statistic produced as the output of this test is a fair indicator of the presence of aggre-

gation, regularity or randomness of events located within a polygon. This information aboutthe polygons helps us in identifying the polygons that have a similar underlying structure.

Please note that the spatial distribution test is only applicable for point dataset. Therefore,in order to measure the distribution of areal objects, some methodology needs to be followedto represent areal objects as a set of points. While more complicated methods can be devisedfor this purpose, as the areal objects present within the polygons are an order smaller inmagnitude, for simplification purposes we represent each areal object by its centroid. Tomeasure the distribution of linear objects, we take a fixed number of points from each linearobject, and use these points for the spatial randomness test. We measure the distance betweentwo polygons on the basis of the distribution of the spatial objects using Eq. 9.

ddistribution =∣∣zi − z j

∣∣max(zi , z j )

(9)

where zi is the distribution of the point object m in polygon Pi computed using the CSR testdescribed above, and z j is the distribution of the point object m in polygon Pj .

Topology Relationships between a pair of spatial objects (points, lines, and regions) canbe characterized as topological relations that describe how two such objects interact in a2D space. The 4-intersection model and the 9-intersection model describe an object as itsinterior, boundary, and exterior [11]. The relationship between two objects is then based onthe intersection of their interior, exterior, or boundary. The topological relationship between

123

Page 12: A dissimilarity function for geospatial polygons

D. Joshi et al.

two objects helps us in computing the distance function in between the two objects—twoobjects with similar topology are more likely to be similar than two objects with differenttopology. Here, we provide an extension of the framework proposed by Egenhofer and hiscolleagues in [12–14], so that the topological relationship between two spatial objects canbe defined with reference to a third spatial object. The topology of a polygon with respect tothe linear objects is defined as follows.

A polygon (A) can be divided into three segments—boundary (∂ A), interior (A◦), andexterior (A−). A linear feature (l) may intersect the boundary, the interior, or the exterior ofthe polygon. Furthermore, the polygons may lie either on the same side of the linear feature, orthey may be on opposite sides of the linear feature. Table 2 illustrates the different scenariosthat may arise and define the topological relationships between the two polygons basedon a linear feature. These scenarios are also demonstrated in Fig. 4. Once the relationship

Table 2 Different possible scenarios based on topological relationship of a linear feature (l) with two polygons(A and B)

∂ A A◦ A− ∂ B B◦ B− Direction Figure

Scenario 1 l X X Yes Fig. 4a, b

Scenario 2 l X X No Fig. 4c

Scenario 3 l X X Yes Fig. 4d, e

Scenario 4 l X X No Fig. 4c

Scenario 5 l X X No Fig. 4f

Scenario 6 l X X No Fig. 4g

Scenario 7 l X X Yes Fig. 4d, e

Scenario 8 l X X No Fig. 4g

Scenario 9 l X X Yes Fig. 4h, i

Fig. 4 Topological relationship between two polygons based on a linear feature—linear feature may intersectthe interior, exterior or the boundary of a polygon

123

Page 13: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

between two polygons with respect to a linear feature is determined, the distance betweenthe two polygons is computed on the basis of the following two rules: (1) If the linear featureintersects the interior of both the polygons, then the distance between them is the smallest.(2) If the linear feature intersects only the exterior of both the polygons, then the distance isthe greatest.

We generalize the distance function between a polygon and the linear feature based onthe topological relationship as follows.

d (Pi , l) =⎧⎨⎩

β if P◦i ∩ l = φ (interior)

α if ∂ Pi ∩ l = φ (boundary)α + f (P−

i , l) if P−i ∩ l = φ (exterior)

(10)

where Pi is any polygon, and l is any linear feature, 0 ≤ β < α ≤ 1, and f (P−i , l) = Nearest

distance of Pi to l. Here, β is defined as the constant minimum distance that any polygonwill have to a linear feature that intersects its interior and α is defined as the constant distancethat any polygon will have to a linear feature that intersects its boundary.

The topology of a polygon with respect to an areal object will follow the same design aspresented for the linear object. The scenarios illustrated in Table 2 and Fig. 4 can be extendedfor areal objects by replacing the linear feature with an areal object. For example, if we takeinto consideration underground aquifers which are shared by two watersheds, then the arealobject (underground aquifers) intersects the interior of both the polygons Pi and Pj (the twowatersheds). The distance, based on the topology of the polygon with respect to the arealobject, between the areal object Pi will be equal to β and the distance of polygon Pj to theareal object will also be β. Similarly, the rest of the cases can be extended from the linearobjects to the areal spatial objects.

Direction The linear and areal spatial objects may also impose directional constraints onthe polygons, i.e., the polygons may be on the same side of the linear feature, or on oppositesides. Furthermore, a linear or an areal feature present between two polygons may increase ordecrease the distance between the polygons. Based on these two factors, the distance betweentwo polygons Pi and Pj based on the linear feature l is given by the following function:

d(Pi , Pj |l

) = [oppg

(Pi , Pj , l

) + γ] × opps

(Pi , Pj , l

) + β (11)

where opps(Pi , Pj , l

)is 1 if the linear feature l is defined to be opposing (i.e. the presence

of the linear feature makes the polygons dissimilar), and 0 otherwise; γ is defined as aconstant that ensures that the linear feature which is opposing in nature increases the distancebetween two polygons to an extent so that they are not categorized as similar polygons; andopps

(Pi , Pj , l

)is an indicator of the location of the polygons with respect to the linear

feature. It has a value of 1 if the polygons are on the opposite sides of the linear feature, or 0if the polygons are on the same side.

It should be noted that these values are also domain dependent. If the domain is suchwhere the polygons are considered to be closer to each other if they are on opposite sides ofthe linear feature rather than on the same side, then the values defined for oppg

(Pi , Pj , l

)can be reversed. This same argument also applies to oppg

(Pi , Pj , l

). Similarly, we can apply

the above equation for an areal object when it is shared by two polygons.Due to the difference in the characteristics of the three types of spatial objects, we treat

them separately and define three different functions, dτ , dϕ , and dω to compute the distancefor point, linear, and areal objects defined in Eqs. 12, 13, and 14, respectively.

123

Page 14: A dissimilarity function for geospatial polygons

D. Joshi et al.

The distance based on spatial point objects is as follows:

dτ (Pi , Pj ) = 1

k

k∑m=1

(ddensity + ddistribution) (12)

where k is the total number of different point objects present within both polygons Pi andPj .

The distance based on spatial linear objects is as follows:

dϕ(Pi , Pj ) = 1

l

l∑m=0

(ddensity + ddistribution + [d(Pi , m) + d(Pj , m)] × [1 + d(Pi , Pj |m)])

(13)

where l is the total number of different linear objects present within both polygons Pi and Pj ,Note: There may exist a scenario where the linear feature m is exterior to both the polygons,and the nearest distance between m and both the polygons is fairly large. In this case, wepropose to not include this linear feature while computing the distance between the twopolygons based on the set of the linear features.

The distance based on spatial areal objects is as follows:

dω(Pi , Pj ) = 1

a

(a∑

m=1

(ddensity + ddistribution + dextent + [d(Pi , m) + d(Pj , m)]

×[1 + d(Pi , Pj |m)]))

(14)

where a is the total number of different types of areal objects within polygons Pi and Pj , isthe extent of the areal.

Finally, we compute a weighted sum of the above three distances (dτ , dϕ and dω) to derivethe overall distance between two polygons Pi and Pj based on the extrinsic spatial attributesas follows:

dexs(Pi , Pj

) = wτ dτ (Pi , Pj ) + wϕdϕ(Pi , Pj ) + wωdω(Pi , Pj ) (15)

where wτ ,wϕ,wω are the weights associated with the three spatial object types and wτ +wϕ + wω = 1. wτ > 0;wϕ > 0;wω > 0. These weights provide flexibility for the domainexpert to emphasize any set of the spatial object features.

3.3 Guidelines for weight assignments

As stated previously, PDF is defined as a weighted function of distance across the differentattribute spaces of the polygons being considered. The role of an attribute in the similaritybetween two polygons is defined by its assigned weight in the distance function. Figure 5presents a summary of the different weight assignments for different parts of the PDF. Pleasenote that if an attribute is unavailable or undefined for any polygon in a given dataset, thedistance function permits the attribute and its associated weight to be completely removedfrom the equation.

Assignment of appropriate weights is an important exercise in effectively using the simi-larity function. However, since the weight assignment must be guided by domain knowledge,universal rules for this task difficult to specify. We do provide a set of general guidelines thatmay be followed in determining the weights. These guidelines are motivated assign greater

123

Page 15: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 5 Different weight assignments and their inter-relationships

weights to attributes with higher entropy so that their effect is more pronounced in the sim-ilarity/dissimilarity comparison, and vice versa. For example, suppose that the pair-wisesimilarity needs to be computed for a set of lake polygons where the nitrogen content inthe water varies greatly. In this case, a greater weight should be assigned to the non-spatialattributes of the lakes to enable the nitrogen content to play a significant role. Presence ofdomain knowledge can also provide commonsense-based guidelines for weight assignment.We have outlined six different scenarios in Table 3 to help illustrate the use of domainknowledge in determining the appropriate weights.

3.4 Mathematical properties of PDF

1. The modified Hausdorff distance function satisfies the three axioms for a distance functionas follows:

Minimality:db(Pi , Pj

)> db (Pi , Pi ) = 0

Symmetry:db(Pi , Pj

) = db(Pj , Pi

) = max (h (A, B) , h (B, A))

The triangular inequality:db (Pi , Pk) ≤ db(Pi , Pj

) + db(Pj , Pk

)2. The distance between two polygons based on density, distribution, and extent also satisfies

the three axioms stated above.3. The distance between two polygons based on their topological relationship with a linear

feature does not satisfy the three axioms for a distance function, as the relationship definedin Sect. 3.2 is between a polygon and a linear feature, and there is no direct relationshipbetween two polygons. Hence, the symmetry and triangular inequality properties of adistance metric are not satisfied.

As a result of the third mathematical property of PDF defined above, PDF has been definedas a “dissimilarity” function instead of “distance” function.

3.5 Computational complexity of PDF

As PDF is a weighted sum function combining the various distance functions that computethe dissimilarity between two polygons based on different attributes, the computational com-

123

Page 16: A dissimilarity function for geospatial polygons

D. Joshi et al.

Tabl

e3

Dif

fere

ntw

eigh

tass

ignm

ents

cena

rios

Scen

ario

Wei

ght

Ass

ignm

ent

Gui

delin

eSa

mpl

eW

eigh

tA

ssig

nmen

t

Non

-Spa

tialA

ttrib

ute

Wei

ght

Spat

ialA

ttrib

ute

Wei

ght

1A

llat

trib

utes

are

cons

ider

edto

beeq

ually

impo

rtan

t.A

llth

eat

trib

utes

shou

ldbe

assi

gned

equa

lwei

ght

wns

=0.

5w

s=

0.5

win

s=

0.5

wex

s=

0.5

wh

s=

0.5

wst

=0.

5w

τ=

0.33

=0.

33w

ω=

0.33

2N

on-s

patia

lat

trib

utes

play

the

mos

tim

port

antr

ole

with

very

little

emph

a-si

son

spat

iala

ttrib

utes

.

Agr

eate

rw

eigh

tmus

tbe

assi

gned

toth

eno

n-sp

atia

lat

trib

utes

wns

=0.

8w

s=

0.2

Exa

mpl

e:C

ities

need

tobe

com

pare

dba

sed

onth

eir

popu

latio

nin

orde

rto

stud

yor

mon

itor

the

tren

dof

popu

latio

nm

ovem

ents

from

rura

lto

urba

nar

eas.

win

s=

0.5

wex

s=

0.5

wh

s=

0.5

wst

=0.

5w

τ=

0.33

=0.

33w

ω=

0.33

3Sp

atia

lattr

ibut

espl

aya

very

impo

rtan

trol

ew

ithve

rylit

tleem

phas

ison

non-

spat

iala

ttrib

utes

.

Agr

eate

rw

eigh

tmus

tbe

assi

gned

toth

esp

atia

lat

trib

utes

wns

=0.

2w

s=

0.8

Exa

mpl

e:C

ount

ies

need

tobe

com

pare

dba

sed

onth

enu

mbe

rof

wat

erbo

dies

pres

entw

ithin

each

coun

ty,

alon

gw

ithth

ear

eaco

vere

dan

ddi

stri

butio

nof

the

wat

erbo

dies

inw

ater

man

agem

enta

ndG

ISap

plic

atio

ns.

win

s=

0.5

wex

s=

0.5

wh

s=

0.5

wst

=0.

5w

τ=

0.33

=0.

33w

ω=

0.33

123

Page 17: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Tabl

e3

cont

inue

d

Scen

ario

Wei

ghtA

ssig

nmen

tGui

delin

eSa

mpl

eW

eigh

tAss

ignm

ent

4G

eogr

aphi

ccl

osen

ess

isve

ryim

port

anti

nde

term

inin

gth

esi

mila

rity

crite

ria.

Agr

eate

rw

eigh

tmus

tbe

assi

gned

toth

elo

catio

nw

ns=

0.1

ws

=0.

9

Exa

mpl

e:Sp

atia

lcon

tigui

tyis

the

mos

tdes

ired

fact

or,

for

exam

ple

indi

stri

ctfo

rmat

ion

appl

icat

ions

tost

udy,

for

exam

ple,

gerr

yman

deri

ngis

sues

.

win

s=

0.9

wex

s=

0.1

wh

s=

0.9

wst

=0.

1w

τ=

0.33

=0.

33w

ω=

0.33

5St

ruct

ure/

shap

eof

the

poly

gons

are

impo

rtan

tcri

teri

aal

ong

with

thei

rno

n-sp

atia

lattr

ibut

es.

Agr

eate

rw

eigh

tmus

tbe

assi

gned

toot

her

intr

insi

csp

atia

lattr

ibut

es

wns

=0.

5w

s=

0.5

Exa

mpl

e:L

akes

need

tobe

com

pare

dba

sed

onth

eir

size

,sha

pe,a

ndth

esp

ecie

sof

fish

livin

gin

each

lake

for

vari

ous

biod

iver

sity

appl

icat

ions

.

win

s=

0.9

wex

s=

0.1

wh

s=

0.1

wst

=0.

9w

τ=

0.33

=0.

33w

ω=

0.33

6In

tern

alst

ruct

ure

ofpo

lygo

nsis

anim

port

antf

acto

r,al

ong

with

thei

rto

polo

gica

lrel

atio

nshi

pto

each

othe

r.

Agr

eate

rw

eigh

tmus

tbe

assi

gned

toth

eex

trin

sic

spat

iala

ttrib

utes

wns

=0.

2w

s=

0.8

Exa

mpl

e:W

ater

shed

sne

edto

beco

mpa

red

toea

chot

her

onth

eba

sis

ofth

eir

rela

tions

hip

toa

seto

fri

vers

,and

the

num

ber

ofla

kes

pres

entw

ithin

each

wat

ersh

edfo

rw

ater

allo

catio

nan

dm

anag

emen

tre

late

dap

plic

atio

ns.

win

s=

0.3

wex

s=

0.7

wh

s=

0.7

wst

=0.

3w

τ=

0.33

=0.

33w

ω=

0.33

123

Page 18: A dissimilarity function for geospatial polygons

D. Joshi et al.

plexity for computing the pair-wise dissimilarity between all the polygons will be k(O(n×n))

where n is the number of polygons within the dataset. In the worst case scenario, k = 7 asduring the pre-processing of the dataset and computing the dissimilarity of polygons based ontheir different attribute sets, the polygons will need to be compared to each other 7 differenttimes—once for the difference in non-spatial attributes, five times for intrinsic and extrinsicspatial attributes, and finally once for combining all the differences between non-spatial andspatial attributes to get the final result.

4 Experimental analyses

In order to evaluate our PDF, we first compare and contrast it with other distance functionsproposed in the literature that also use both spatial and non-spatial attributes. In particular, wecompare our algorithms to the distance functions proposed by Webster and Burrough [40],Cliff et al. [8], and Perruchet [31]. For this study, we have made use of a subset of the censusblock groups from the city of Lincoln, NE along with the number of liquor licenses assignedto each census block group. The goal is to study the differences in the pair-wise distancescomputed for the polygons (census block groups) using the various distance functions.

Next, we specifically investigate the effectiveness of our dissimilarity function in spatialclustering since distance-based functions play a central role in this application. We haveapplied our dissimilarity function to the k-medoids clustering algorithm to cluster geospatialregions represented as polygons in two different domains with diverse characteristics. We firstapply our dissimilarity function to the hydrology domain where we examine the formation ofclusters of watersheds. In hydrology, watersheds are polygons that serve as the basic unit foranalysis. For example, watersheds are often clustered together to perform frequency analysisof floods, or determine regional trends [32]. In a second experiment, we form clusters ofcounties that are often used for the organization of higher level political or managementdistricts. In contrast to watersheds which represent natural units of area (polygons) with nodefined geometric shape, counties are man-made polygons that have more regular geometricshape and spatial relationships.

4.1 Comparative analysis

In this section, we compare the performance of our algorithm with three distance functionsthat make use of both the spatial and non-spatial attributes, namely, the WB Distance, theCXY Distance, and PXY Distance. These are described in Sect. 2. We use a set of six polygonswhich are census blocks in the city of Lincoln, NE (USA). For non-spatial attribute, we usethe locations of liquor licenses within the census blocks. The census blocks and the sites forliquor licenses are shown in Fig. 6.

The WB distance function is computed using the following—(1) the Canberra Metricon the number of liquor licenses, (2) the Euclidean distance between the centroids of thepolygons, and (3) w = 0.66. The CXY distance is computed by as follows: (1) taking thenormalized Euclidean distance between the centroids of the polygons as the spatial distancefunction, (2) the normalized Euclidean distance between the number of liquor licenses, and(3) λ = 0.66. The PXY distance function is computed as the product of the spatial distancebetween the centroids of the polygons and the non-spatial distance using the metric definedin Sect. 2 [31]. We further compute a revised version of the PXY distance function where weforce the mass of any attribute being considered to be at least 0.01. The results of this versionof PXY distance are listed in Table under PXY’. Finally, we compute our proposed PDF

123

Page 19: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 6 A set of census blocks inLincoln, NE and the locations ofthe sites for liquor licenses

Table 4 Statistics and the distances between polygons using different distance functions (WB Distance, CXYDistance, PXY Distance, PXY’ Distance and PDF)

P1 P2 NoLL1 NoLL2 WB CXY PXY PXY’ PDF

0 1 8 0 2.252 0.932 0.000 0.572 1.202

0 2 8 0 2.224 0.904 0.000 0.544 1.207

0 3 8 5 1.551 0.738 27.692 27.692 0.807

0 4 8 4 1.320 0.440 21.134 21.134 0.446

0 5 8 13 1.302 0.485 75.839 75.839 0.646

1 2 0 0 0.831 0.171 0.000 0.000 0.061

1 3 0 5 1.993 0.673 0.000 0.125 1.365

1 4 0 4 1.928 0.608 0.000 0.065 0.896

1 5 0 13 2.063 0.743 0.000 1.031 1.217

2 3 0 5 1.832 0.512 0.000 0.065 1.286

2 4 0 4 1.949 0.629 0.000 0.070 0.937

2 5 0 13 1.927 0.607 0.000 0.683 1.133

3 4 5 4 1.209 0.476 1.476 1.476 0.582

3 5 5 13 1.384 0.431 97.957 97.957 0.864

4 5 4 13 1.497 0.488 115.444 115.444 0.844

using the—(1) non-spatial distance function computed as the normalized Euclidean distancebetween the number of liquor licenses, (2) intrinsic spatial distance function computed usingour modified Hausdorff distance function, and (3) extrinsic spatial distance function computedusing the density and distribution of the liquor license locations within each polygon. Further,the spatial attributes are assigned an overall weight of 0.66, and non-spatial attributes areassigned a weight of 0.34. Within the spatial attributes, intrinsic spatial attributes are assigneda weight of 0.5, and extrinsic spatial attributes are assigned a weight of 0.5. The number ofliquor licenses within each block (NoLL1 and NoLL2) and the distances between each pairof polygons (P1 and P2) using the aforementioned functions for the census blocks shown inFig. 6 are presented in Table 4.

Observing the results obtained, we find that PXY distance function fails to compute thedistance between two polygons when the mass of an attribute of either of the two polygons

123

Page 20: A dissimilarity function for geospatial polygons

D. Joshi et al.

Table 5 Ranking of pair-wise distances between polygons

P1 P2 WB Rank CXY Rank PDF Rank WB–CXY WB–PDF CXY–PDF

1 2 1 1 1 0 0 0

3 4 2 4 3 2 1 1

0 5 3 5 4 2 1 1

0 4 4 3 2 1 2 1

3 5 5 2 7 3 2 5

4 5 6 6 6 0 0 0

0 3 7 12 5 5 2 7

2 3 8 7 14 1 6 7

2 5 9 8 10 1 1 2

1 4 10 9 8 1 2 1

2 4 11 10 9 1 2 1

1 3 12 11 15 1 3 4

1 5 13 13 13 0 0 0

0 2 14 14 12 0 2 2

0 1 15 15 11 0 4 4

Sum: 18 28 36

Average: 1.2 1.87 2.4

is 0. Moreover, it favors small attribute values and heavily penalizes the attributes with largevalues. This can be seen in the low distance computed for polygons with fewer liquor licenses,and a very large distance computed for the polygons with more liquor licenses.

Comparison based on range Next, we compared the results obtained using the WB andCXY distance functions with PDF. The range of the pair-wise distance computed by theWB distance function is 0.63, that of the CXY distance function is 0.82, and that of PDFis 0.95. The low range of the WB distance function will make it difficult to implementthis distance function where polygons need to be grouped based on pair-wise similarity.The CXY distance function which has the same structure PDF improves upon the range ofdistance values; however, PDF offers the best range of pair-wise distances between polygonsamong these distance functions. This makes it more suitable for clustering type applications.

Comparison based on ordering If we look at the ordering of the pair-wise distancescomputed using the three distance functions we find that while they all agree on the pair ofpolygons that are most similar to each other (Polygons 1 and 2—Rank 1), they do not agreeon the pair that are most dissimilar. The complete ranking of pair-wise distances between thepolygons is presented in Table 5. In this case, the WB distance function is more bent towardthe difference in the non-spatial attributes and as a result Polygons 3 and 4 get assigned ahigher rank (Rank 2—meaning they are more similar to each other) as compared to the CXYdistance function which assigns this pair a lower rank (Rank 4) because the spatial distancebetween them is greater. On the other hand, as PDF is more rounded, using more spatial andnon-spatial attributes, it assigns this pair a more appropriate rank (Rank 3) as these polygonsare very similar in their non-spatial attributes, and their geographic distance is also not as bigas between some other polygons.

Further, while the WB and CXY distance functions agree on the pair of polygons that arefurthest apart (Polygons 0 and 1), PDF assigns this pair a slightly higher rank and categorizes

123

Page 21: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 7 Subset sample dataset 2, along with the pair-wise distances between the various polygons

the pair of polygons 1 and 3 as the furthest apart. This is because the attributes of spatialdistribution and spatial density also play an important role. Because of slight clustering withinthe distribution of liquor licenses in polygon 3, the extrinsic spatial distance between polygons1 and 3 and polygons 1 and 2 is greater than the extrinsic spatial distance between polygons 0and 1 and polygons 0 and 2. It is due to this factor that some large shifts in ranking occur withinPDF as compared to the WB and CXY distance functions. If we remove the pair of polygonswhich cause the biggest shifts, we can see that the average shift has dropped considerablyfor both the WB distance function versus PDF, and for the CXY distance function versusPDF. This suggests that we retain the properties of the WB distance function and the CXYdistance function for simple polygons, and we add to it, our additional considerations forcomplex polygons.

Comparison based on correlation The correlation between the WB distance function andthe CXY distance function is 0.88, between the WB distance function and PDF is 0.9, andbetween CXY distance function and PDF is 0.76. A low correlation between PDF and theCXY distance function despite the same structure of the distance functions is once againindicative of the importance of the other spatial attributes of the polygons. A high correlationbetween PDF and the WB distance function exists for this test dataset because both thedistance functions are basically the same in terms of how spatial and non-spatial attributesare combined. The main difference is that distribution & density are used in PDF, but notin the WB distance function. The distribution and density try to capture variations in spatialobjects within the polygons. Due to the normalization factor which is variable, and morelocalized, within the WB distance function, those variations were captured quite well forthe sample dataset considered. However, the WB distance function’s way of handling localvariations is limited. For example, consider the set of polygons (once again a subset of censusblock groups of city of Lincoln, NE, along with their liquor licenses) shown in Fig. 7. Whenthe pair-wise distance was computed for these polygons using the WB distance function andPDF, we found the correlation between these two distance functions dropped significantlyand was 0.68.

4.1.1 Comparative analysis between PDF, WB, and CXY using a watershed dataset

For this experiment, we have performed pair-wise comparison of watersheds within the stateof Nebraska to be used in water management applications where it is necessary to analyze thewatersheds based on the characteristics of the water bodies present within each watershed. Thedataset consists of 84 watersheds in all. These are shown in Fig. 8. In addition to the watershedboundaries, the dataset also contains the lakes present within each watershed. These are shownin Fig. 1 as well. Following the Cliff et al. [8] and Webster and Burrough [40] methodologies—

123

Page 22: A dissimilarity function for geospatial polygons

D. Joshi et al.

Fig. 8 Watersheds within the state of NE, USA along with the lakes present within each watershed

CXY and WB, respectively, they both guide us to consider the geographic distance betweenthe objects being compared and the non-spatial attributes of the objects. And the only non-spatial attribute present in this dataset is the number of lakes for each watershed. On theother hand, our proposed PDF methodology guides us to derive and incorporate additionalintrinsic and extrinsic spatial attributes for the watersheds, to complement their geographiclocation and number of lakes. For intrinsic spatial attributes, in addition to the location inthe form of a set of vertices for each watershed, we take into account the area covered byeach watershed. For the set of extrinsic attributes, the lakes are treated as areal objects withineach watershed, and we take into account the density, extent, and distribution of the lakeswithin each watershed. In order to make a fair comparison between the three functions—PDF, W&B, and CXY, we use the modified Hausdorff distance function to compute thegeographic distance between two watersheds. Furthermore, we assign an equal weight of 0.5to all attributes being used in the different distance functions.

In our experimental analysis, we computed pair-wise distance between all the 84 polygonsfor all three functions being compared, which resulted in 84 × 84 comparisons done threetimes. To summarize our results and give a sense of the performance of the three functions,we computed the average distance—based on each function—computed for each polygon toits k-nearest neighbors with k = 5. These results are shown in Fig. 2 in the form of a graphshowing the watershed IDs along the x-axis, and the normalized distance computed, rangingbetween 0 and 1 along the y-axis.

From Fig. 9, we can see that CXY reports a much smaller distance between all the water-sheds and their nearest neighbors respectively. Both PDF and WB result in greater distancesbetween the watersheds and their nearest neighbors, respectively. While CXY and PDF fol-low very similar patterns, W&B and PDF do not always follow a similar pattern. The reasonfor the similar patterns in CXY and PDF is because both the distance functions are a weightedsum of the attributes being considered, both the functions consider the geographic locationand the number of lakes within each watershed. The geographic distance combined withthe number of lakes dominate the attribute space for this dataset, as the other attributes aredependent on these two attributes and thus are the two most dominant attributes within thedataset. In addition, PDF also considers the extent, density, and distribution of the lakes withineach watershed. This consideration of additional attributes increases the distance betweenneighboring watersheds, thereby creating a greater disparity between various watersheds and

123

Page 23: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 9 Comparison between PDF, W&B, and CXY using the average nearest neighbor distance with k = 5

Fig. 10 Watershed with ID 45and its nearest neighbors

in turn helping in easily identifying dissimilar watersheds as compared to their neighbors.For example, if we look at watershed 45, as per CXY the watershed is very similar to itsneighboring watersheds. PDF, on the other hand, results in a much higher value. In order toverify this result, we visually inspect watershed 45 and its neighboring watersheds (shownin Fig. 10). From Fig. 10, we can see that watershed 45 is indeed very different from itsneighbors in—(1) it is the smallest watershed, covering the least amount of area, (2) it hasthe fewest number of lakes, and finally, (3) the lakes in watershed 45 do not follow the samedistribution pattern as its neighbors. Similar observations can also be made for watershed 77,and watershed 15.

Upon comparing the PDF and WB function results in Fig. 9, we can see that for somewatersheds, PDF results in a greater distance as compared to WB, and for a few otherwatersheds W&B functions results in a greater distance. For example, as per PDF, watershed19 is most dissimilar from its nearest neighbors, while W&B and CXY do not mark it asthe most dissimilar watershed. To verify this result, we visually inspect watershed 19 and itsnearest neighbors (shown in Fig. 11a). From the Fig. 11a, we can see that watershed 19 isindeed very different from its nearest neighbors as (1) it is the largest watershed, coveringthe maximum area, (2) it has a significantly greater number of lakes that any of its nearest

123

Page 24: A dissimilarity function for geospatial polygons

D. Joshi et al.

Fig. 11 Selected watersheds with their nearest neighbors

neighbors, (3) there is a huge cluster of lakes within this watershed, and finally (4) the area ofthe watershed covered by lakes is the maximum as compared to its neighboring watershed.

Another example that we closely inspect is watershed 39. For this watershed, W&B returnsa greater distance to its nearest neighbors than PDF and CXY. To verify this result, we onceagain visually inspect watershed 39 and its nearest neighbors (shown in Fig. 11b). FromFig. 11b, we can see that watershed 39 is in fact not all that different from its neighbors as(1) it has similar size as compared to its nearest neighbors, (2) it has a comparable numberof lakes to some of its nearest neighbors, and finally (3) the lakes have similar distributionpatterns in watershed 39 and its nearest neighbors.

The last example that we look at is watershed 84. For this watershed, W&B returns agreater distance to its nearest neighbors than PDF and CXY. This watershed is a very smallwatershed within the state of NE as can be seen in Fig. 11c. Upon a closer inspection, we findthat this watershed is indeed very different from its neighboring watersheds. Thus, in thiscase, W&B results in a more accurate result as compared to PDF and W&B. This is becausethere is not sufficient information available for this watershed. There is no lake informationavailable for this watershed; hence, PDF is not able to accurately compare this watershedwith its neighboring watersheds.

4.2 Spatial clustering application

Our novel dissimilarity function, PDF, can be seamlessly integrated within any algorithm thatuses a distance measure in order to analyze polygons. Here, we demonstrate the applicationof PDF to the k-medoids clustering algorithm in order to perform polygon-based spatial

123

Page 25: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 12 a Polygons (subset of watersheds in Nebraska) used to form a cluster, b polygons along with linearspatial objects

Fig. 13 a Polygons (subset of watersheds in Nebraska) used to form a cluster, b polygons along with arealspatial objects

clustering. In our clustering, no explicit cluster centre exists. The mean distance betweenpolygon Pi and all polygons within each cluster determines the membership of Pi . Thealgorithm terminates when it reaches the maximal number of iterations or the membership ofeach polygon no longer changes. In order to measure the validity of the clusters produced, weadapt the existing distance-based validity measure which is measured as a ratio of the inter-cluster distance versus the intra-cluster distance [38]. A good clustering result should have alow-intra-cluster distance and a high-inter-cluster distance. The intra-cluster distance is theaverage distance between each pair of polygons in a cluster, and the inter-cluster distanceis the average distance between each pair of polygons in two clusters. Thus, a high-qualitycluster is one whose validity index is large. The clustering result with the maximum validitymeasure gives us the optimal number of clusters.

We first give two examples that show the benefits of the addition of organizational (extrinsicspatial) attributes in computing the similarity between two polygons. The role of linear objectfeatures is illustrated by the following intuitive example where we analyze the clusteringresults. When the polygons shown in Fig. 12a are clustered into one cluster, the validityindex of the cluster is 76.92. When the linear feature that is shared by all the polygons isadded, as shown in Fig. 12b, the validity index increases to 111.11. This increase is directlyattributable to the addition of the linear feature which is shared by all the polygons presentin the cluster.

Another example shown in Fig. 13 illustrates the importance of using areal objects indetermining the dissimilarity of polygons. When polygons are grouped into one cluster(Fig. 13a), the validity index obtained is 83.33. After the addition of the lakes present withineach polygon, as shown in Fig. 13b, the validity index rises to 166.66. This increase in thevalidity index shows that the addition of areal objects makes the clusters more cohesive, i.e.,it increases the similarity within the polygons that belong to the same cluster.

123

Page 26: A dissimilarity function for geospatial polygons

D. Joshi et al.

Fig. 14 Dataset for the first clustering experiment—watersheds in the state of Nebraska along with selectedstreams and lakes used as spatial objects

4.2.1 Watershed analysis

The dataset comprises of 69 watersheds within the state of Nebraska (Fig. 14). A watershedis a geographic region draining into a river, river system, or other body of water. It is auseful areal unit of analysis for many applications including drought and water resourcemonitoring. The main goal for this set of experiment was to cluster together watershedsthat exhibit similar hydrological behavior and are spatially contiguous. Spatially contiguousclusters of watersheds are useful in many applications. For example, the improvement of theeconomic efficiency in the reduction of diffused water pollution rests on the identification andformation of homogenous groups of contiguous administrative units of a watershed. By jointlyimplementing pollution reduction measures, these homogenous groups are able to diminishnegative spillover effects and externalities. To identify homogenous groups of administrativeunits within this watershed cluster analysis methods are used. As the implementation of jointpollution mitigation measures is only sensible and manageable in contiguous areas, the spatialrelationship among administrative units is an essential variable for this cluster analysis [20].Data processing and feature selectionTable 6 lists the attributes that are used for clustering watersheds. There are over eight hundredhydrological observation stations including surface water stations, ground water stations, andweather stations. The measurements taken at the various stations—surface water, groundwater, and precipitation, cannot be— directly used in the clustering process. Therefore,taking the time series data collected from these stations, we found the correlation betweenwatersheds based on their respective surface water, ground water, and weather stations.

Figure 14 shows the watersheds in the state of Nebraska along with the various spatialobjects used in the process of clustering. The linear objects are the lines (rivers) goingacross several watersheds, and the areal objects are the polygons (lakes) present within thewatersheds.Clustering results for the watershed datasetIn order to observe the effects of the inclusion of different polygonal attributes in the clus-tering process, we conducted experiments by using different combinations of the attributesof the polygons/watersheds. When the watersheds are clustered using only their non-spatial

123

Page 27: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Table 6 Attributes forwatersheds Non-spatial attributes Correlation between surface water

stations

Correlation between ground waterstations

Correlation between precipitationstations

Intrinsic spatial attributes Set of vertices of the watershed, elon-gation of the watershed, orientation ofthe watershed

Extrinsic spatial attributes

Point object attributes None

Linear object attributes Major streams

Areal object attributes Lakes

Fig. 15 Result of clustering watersheds with k = 3 (a,c,e) and k = 4 (b,d,f) using different combinations ofnon-spatial, structural and organizational attributes

attributes, the watersheds with similar correlation indexes are clustered together. Their loca-tion in space has no relation with the clustering process. Therefore, we get disjoint clusters inspace. In Fig. 15, the left side (a, c, e) shows the clustering result when k = 3 and the right side(b, d, f) shows the clustering results when k = 4. Clusters formed using only the non-spatialattributes (Fig. 15a, b) are widely dispersed in space. With the addition of organizationalattributes, the spatial organization of the watersheds has its affect in the form of the density

123

Page 28: A dissimilarity function for geospatial polygons

D. Joshi et al.

of lakes within the watersheds, and the location of the streams—within the watersheds orexterior—changes the similarity of the watersheds, and that in turn results in better qualityof clustering (Fig. 15c, d). If we compare Fig. 15a with c, we can see that the distributionand density of lakes have a significant impact on the clustering process. The watershedswith a high density of lakes clustered together belong to the same cluster in Fig. 15c whilethey were clustered into different clusters in Fig. 15a when the organizational attributes werenot taken into account. Finally, when we add the structural attributes, watersheds locatedadjacent to each other in space and sharing a boundary are clustered together (Fig. 15e, f)because the spatial structure of the polygons within the geographic space plays an importantrole in clustering adjacent watersheds together. Therefore, we not only get the clusters withthe highest quality, but they are spatially contiguous as well with the addition of the spatialstructure and organization of the watersheds along with their correlation indices.

Table 7 lists validity indexes for different combinations of k (number of clusters) anddifferent combinations of non-spatial, structural attributes, and organizational attributes usingthe polygonal data for the watersheds. In Table 7, the highest validity index is obtained whenk = 3 and all the three types of attributes, non-spatial, structural attributes, and organizationalattributes, are used for clustering. This suggests that the best quality clusters are formed whenthe number of clusters is equal to 3, and all the three categories are attributes are taken intoaccount. As for the remaining validity indexes, there is no visible pattern, and therefore, noclear conclusions can be made. Note: If the number of polygons within a cluster is less thantwo, then the validity index will be undefined.

Next, we also test the validity of our clustering results using the gap statistic [36] thatis used to discover the number of clusters that exist in the dataset. The gap statistic wascomputed using the gap function defined in the statistical package SAGx written in R. Theresults for the watershed dataset are shown in Table 8. For the watershed dataset k = 3,which matches the result of our validity index.

Table 7 Clustering results for watershed dataset

k Validity index, ηGDF

Non-spatial attributes only Non-spatial andextrinsic spatialattributes

Non-spatial andintrinsic spatialattributes

Non-spatial, intrinsicspatial and extrinsicspatial attributes

3 76.92 83.33 111.11 125.00

4 0.55 3.80 0.47 8.85

5 35.71 3.34 66.66 1.56

6 0.76 0.43 2.39 1.24

Table 8 Gap statistic results forthe watershed dataset

k Gap statistic

2 −0.28

3 −0.23

4 −0.47

5 −0.38

6 −0.31

123

Page 29: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Fig. 16 Dataset for the second experiment—counties in the state of Nebraska along with the point and linearspatial objects

Table 9 Attributes for countiespolygons Non-spatial attributes Total population

Intrinsic spatial attributes Set of vertices of county polygons, area

Extrinsic spatial attributes

Point object attributes Towns

Linear object attributes Selected highways

Areal object attributes None

4.2.2 Grouping counties

For the second experiment, we have taken the 93 counties of the state of Nebraska (Fig. 16) asthe set of polygons. The motivation behind selecting this dataset is the fact that counties havebeen partitioned into clusters for a long time. There are several applications where countiesare divided into groups by the government for jurisdiction purposes, such as congressionaldistricts, natural resource districts that pertains to non-spatial and spatial attributes. The goalis once again to cluster counties that are similar to each other and are spatially contiguous.Spatially contiguous clusters of counties are important for applications involving resourcedistribution and allocation. For example, in resource allocation problems involving redistrict-ing, zones or districts need to be defined where each district is a spatially contiguous clusterof counties or census tracts or some other underlying spatial structure. If one tries to formthese districts using a traditional distance function such as Euclidean distance function orthe Manhattan distance function, without any additional constraints added to the clusteringprocess, the result would be districts that are spatially disjoint [26].Data processing and feature selectionEach county is represented by the set of attributes listed in Table 9. Figure 16 shows thecounties along with the organizational attributes used in clustering. For point objects, we usethe cities (towns). In order to show the effect of linear features, we selected three highways—State route 2, 6, and 20, running across the state.Clustering results for county datasetSeveral experiments were conducted using different combinations of the attributes of thepolygons. Our aim was to see the effect of the structural attributes and organizational attributes

123

Page 30: A dissimilarity function for geospatial polygons

D. Joshi et al.

Fig. 17 Result of clustering counties with k = 3 and k = 4 using different combinations of non-spatial,structural and organizational attributes

in clustering process. In Fig. 17, the left side (a, c, e) shows the clustering result when k = 3and the right side (b, d, f) shows the clustering results when k = 4. It is observed once againthat the clusters become more compact and spatially contiguous as we add more spatialfeatures (structural attributes and organizational attributes) to the clustering process. Theaddition of structural attributes plays a big role in producing contiguous clusters. A closerinspection shows that while clustering using non-spatial attributes only (Fig. 17a), whenk = 3, one of the clusters consists of only one county with the largest area. Therefore, areaturns out to be the dominating attribute in this clustering process. Using non-spatial attributesonly, when k = 4, we see that Douglas and Lancaster counties—the counties containing themost populated cities in Nebraska are clustered into one cluster, and no other county isclustered along with them. Cherry county—the county with the largest area also remains ina cluster of its own. This is clearly not the way we would want to cluster counties when wewant to partition them for any jurisdiction purpose where we would want a more uniformnumber of counties (or population) in each cluster.

With the addition of organizational attributes in the clustering process (Fig. 17c), we seethat the result of clustering improves—number of counties in the clusters is more even, andzero singleton clusters are produced. Even though, area still seems to be dominating theclustering of the largest counties together, the point object density and spatial distributionhas an even bigger impact as it breaks the previously large clusters into smaller compactclusters. The counties with a larger number of towns are now clustered together. As most ofthe towns are located along the highways, they have both an indirect and direct impact on

123

Page 31: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

Table 10 Clustering results for County Dataset

k Validity index, ηGDF

Non-spatial attributes only Non-spatial andextrinsic spatialattributes

Non-spatial andintrinsic spatialattributes

Non-spatial,intrinsic spatialand extrinsicspatial attributes

2 0.004 8.55 18.52 18.18

3 Undefined† 29.41 21.28 19.23

4 Undefined† 0.61 7.35 22.72

5 Undefined† 0.43 4.72 20.00

6 Undefined† Undefined† 16.39 16.39

† The validity index has a value of undefined when the number of polygons within a cluster is one

the process of clustering. They directly influence clustering since the highways are used aslinear spatial object attributes in clustering. They also indirectly influence the process sincethe locations of many of the towns are guided by highways. Finally, when we add structuralattributes to the clustering process, we get clusters that are a lot more uniform (Fig. 17e) withalmost the same number of counties in each cluster, and that are spatially contiguous. Thereason behind spatial contiguity is our boundary adjusted Hausdorff distance that also takesinto account the extent of boundary shared between two polygons. As a result, the distancebetween two polygons sharing a boundary reduces, and we get clusters of polygons locatedadjacent to each other in space. Table 10 lists validity indexes obtained for different valuesof k with different combinations of the attributes.

As can be seen from Table 10, using non-spatial attributes only produces clusters thathave a very low validity index, implying low quality clusters. As we add spatial attributes—structural attributes and organizational attributes, to the non-spatial attributes the validityindex increases significantly. When clustering is performed taking into account the non-spatialand organizational attributes, an increase in the validity index is observed. This suggests thatmore information about the polygons is needed to increase the quality of clustering. Whenclustering is performed using non-spatial and structural attributes, the validity index increasesfurther and we do not get any more bad clusters (with validity index of infinity). This impliesthat good quality clusters are formed using these two sets of attributes. However, whenclustering was performed using all three attribute sets—non-spatial, structural attributes,and organizational attributes, we found that the validity index increased even further, andthe largest validity index ever was produced for k = 4. This suggests that the best qualityclusters are formed using all the three types of attributes together.

To further evaluate our clustering results, we have applied the gap statistic to the countydataset to discover the number of clusters that exist in the dataset. The results of gap statisticare shown in Table 11. For the county dataset k = 4, which once again matches the result ofour validity index.

4.2.3 Summary

Our experiments in this section serve a dual purpose. First, we show the effect of varyingthe weights for different attributes of the polygons, and the resultant effect on the com-puted pair-wise dissimilarity. Secondly, based on our experimental analysis on the clus-

123

Page 32: A dissimilarity function for geospatial polygons

D. Joshi et al.

Table 11 Gap statistic results forthe county dataset

k Gap statistic

3 0.13

4 0.16

5 0.13

6 0.26

tering application, we have observed that with the addition of the spatial attributes—boththe structural and organizational attributes—within the dissimilarity function allows fora more accurate comparison of the polygons, and helps us achieve our goal of formingspatially contiguous and compact clusters using the k-medoids clustering algorithm. Forexample, when clustering is based upon non-spatial attributes alone, the result is spatiallydisjoint clusters (Figs. 15a, b, 17a, b). Therefore, the effect of spatial autocorrelation islost. The addition of spatial attributes produces clusters that are spatially contiguous andcompact (Figs. 15e, f, 17e, f). Thus, we have shown that even though k-medoids is nota favored algorithm in this domain, with the use of the appropriate distance/dissimilarityfunction, the desired results can be produced using any distance-based algorithm such ask-medoids.

Furthermore, we have demonstrated the robustness of our dissimilarity function by apply-ing it on two real and yet spatially different datasets—the watershed dataset, and the countydataset. The watershed dataset had spatial attributes resulting primarily from natural factors.The county dataset, on the other hand, is derived from human decisions. Due to the underly-ing similarity within the behavior of every watershed, there do not exist any well-separatedclusters. As a result, all the gap values for the watershed dataset in Table 7 are below zero.On the other hand, for the county dataset, since each county has distinct dissimilarities withrespect to number of towns and population, there exist well-separated clusters. As a result, allthe gap values in Table 10 are above zero. However, as can be observed from Tables 6 and 9,using our adapted validity index, we have clear notion on the improvement of the clusteringquality with the addition of the spatial attributes. For example, in the watershed dataset, whenk = 3, and clustering is performed using only the non-spatial attributes, the validity indexis 76.92. With the addition of the organizational attributes to the non-spatial attributes, thevalidity index now increases to 83.33. When, on the other hand, the structural attributes wereadded to the non-spatial attributes, the validity index increases by a bigger margin and isnow at 111.11. Further improvement is observed in the validity index of the clusters when allthree types of attributes were used within the dissimilarity function, and the index has nowincreased to 125.00. This result matches with the visual inspection of the clusters, and theclusters become more spatially contiguous and compact in Fig. 10e. A similar result is alsoobserved in the county dataset experiment.

5 Conclusions and future work

While dissimilarity functions play a central role in many applications/domains, such functionsin the context of polygons are not well studied. In such domains, ignoring the spatial aspectsin dissimilarity is neither correct nor does it lead to accurate results. In this research, wehave developed a new similarity/dissimilarity measure for polygons known as the PDF thatintegrates the non-spatial attributes of the polygon with the spatial attributes encoded in theform of the structural and organizational information. In order to incorporate these properties,

123

Page 33: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

the polygons are represented using three sets of attributes: non-spatial attributes, intrinsicspatial attributes, and extrinsic spatial attributes. In the process of defining our dissimilarityfunction, we have not only addressed the unary properties of the polygons, but also theirbinary properties, by taking into account the linear and areal features that may be shared bymultiple polygons.

We have shown that our proposed PDF performs better than the other distance functionsproposed in the literature as it allows the incorporation of additional information about thepolygon structure and other spatial attributes. It reduces the distance between the polygonsthat are most similar and increases the distance between the polygons that are most dissimilar.It takes a balanced yet flexible approach to incorporate the influence of spatial and non-spatial aspects. Our comparative analysis demonstrates that we retain the properties of thepreviously used well-established distance functions such as the WB distance function and theCXY distance function for simple polygons yet is able to include additional considerationsfor complex polygons.

In essence, we have proposed a framework to include all the attributes of the polygons inorder to measure the similarity/dissimilarity between polygons. This framework will allowus to administer a systematic approach to experimenting with different combinations ofattributes by giving the freedom to the user to change the weights of the different parts of thedissimilarity function. Further, by taking into account computational complexity, with sucha framework, whether and how to develop an automated system for carrying out the cluster-ing task or identifying the appropriate attributes for clustering can be informed. Moreover,domain knowledge and human expertise can be factored into the framework, for example,to determine the relative weights of the various attributes. Thus, with this framework, wealso aim to lend more structure to the search process for the correct combination, informalgorithm developers of computational complexity for automation, and better represent andengineer domain knowledge and human expertise. Our novel dissimilarity function can thusbe seamlessly integrated into any polygon analysis algorithm that uses a distance functionwithout added computational complexity.

While we have developed a dissimilarity function that takes the spatial analysis of poly-gons a step further, there are many open questions that have not been addressed by ourwork. Additional distance measures in the object space as well as more complex form ofthe combination function that integrates the distances in the three spaces can be explored.One could design a distance measure for linear objects using their density and a distancemeasure for areal objects that takes into account the amount of sharing, and topology. Ourapproach can be extended to fuzzy algorithms that will also have use in many geospatialapplications. Another direction is to include nominal, categorical, and ordinal attributes inour dissimilarity function to simplify our methodology. We plan to investigate how to flexiblymap the differences in values for these different attributes onto a comparable scale. In addi-tion, we will work to develop a semi-automatic approach to assign weights to the differentcomponents of our dissimilarity function more systematically. In particular, we will lookinto nonlinear strategies for weight assignment. Finally, we will also be working on morecomprehensive experiments in order to study the effect of different weight assignments tothe various attributes used within PDF.

Acknowledgments This material is based upon work supported by the National Science Foundation underGrants No. 0219970 and 0535255. We would like to thank David Marx for his valuable insight and feedback.We would also like to extend our thanks to Lei Fu, Bill Waltman, and Tao Hong for their assistance in dataprocessing and preparation for this research.

123

Page 34: A dissimilarity function for geospatial polygons

D. Joshi et al.

References

1. Arkhangel’skii AV, Pontryagin LS (1990) General topology I: basic concepts and constructions dimensiontheory. Encyclopaedia of Mathematical Sciences, Springer, Berlin

2. Balcan M-F, Blum A, Vempala S (2009) Clustering via similarity functions: theoretical foundations andalgorithms, Vol. 5. Citeseer, Available from http://citeseerx.ist.psu.edu

3. Bilenko M (2003) Learnable similarity functions and their applications to record linkage and clustering,appears as technical report, UT-AI-TR-03-305

4. Bonchi F, Gionis A, Ukkonen A (2011) Overlapping correlation clustering. In: 2011 IEEE 11th interna-tional conference on data mining (ICDM), Barcelona, Spain, pp 51–60.

5. Buchin K, Buchin M, Wenk C (2006) Computing the Fréchet distance between simple polygons inpolynomial time. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry(Sedona, Arizona, USA, June 05–07 2006) SCG ’06. ACM, New York, NY, pp 80–87

6. Chen Y, Garcia EK, Gupta MR, Rahimi A, Cazzanti L (2009) Similarity-based classification: conceptsand algorithms. J Mach Learn Res 10:747–776

7. Clark PJ, Evans FC (1954) Distance to nearest neighbor as a measure of spatial relationships in populations.Ecology 35(4):445–453

8. Cliff AD et al (1975) Elements of spatial structure. A quantitative approach. Cambridge University Press,Cambridge

9. Dobkin DP, Kirkpatrick DG (1985) A Linear algorithm for determining the separation of convex polyhe-dra. J Algorithm 6:381–392

10. Donnelly KP (1978) Simulations to determine the variance and edge effect of total nearest neighbourhooddistance. In: Hodder I (ed) Simulation methods in archeology. Cambridge University Press, Cambridge,pp 91–95

11. Egenhofer MJ, Sharma J, Mark DM (1993) A critical comparison of the 4-intersection and 9-intersectionmodels for spatial relations: formal analysis. Auto-Carto 11:1–11

12. Egenhofer MJ, Franzosa R (1994) On the equivalence of topological relations. Int J Geogr Inf Syst9(2):133–152

13. Egenhofer MJ, Mark DM (1995) Modeling conceptual neighborhoods of topological line-region relations.Int J Geogr Inf Syst 9(5):555–565

14. Egenhofer MJ, Clementini E, Felice PD (1994) Topological relations between regions with holes. Int JGeogr Inf Syst 129–144

15. Ester M, Frommelt A, Kriegel H-P, Sander J (1998) Algorithms for characterization and trend detectionin spatial databases. Published in proceedings of 4th international conference on knowledge discoveryand data mining (KDD-98)

16. Furlong K, Gleditsch NP (2003) Geographic opportunity and neomalthusian willingness: boundaries,shared rivers, and conflict. The Joint Sessions of Workshops European Consortium for Political Research,Edinburgh, UK

17. Gardoll SJ, Groves DI, Knox-Robinson CM, Yun GY, Elliott N (2000) Developing the tools for geologicalshape analysis, with regional- to local-scale examples from the Kalgoorlie Terrane of Western Australia.Aust J Earth Sci 47(5):943–953

18. Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann Publishers, SanFransisco

19. Huang H, He Q, Chiew K, Qian F, Ma L (2012) CLOVER: a faster prior-free approach to rare-categorydetection. Knowl Inf Syst doi:10.1007/s10115-012-0530-9

20. Huchtemann D, Frondel M (2010) Increasing the efficiency of transboundary water management: aregionalization approach. J Water Resource Prot 2(6):501–506

21. Jiao L, Liu Y (2008) Knowledge discovery by spatial clustering based on self-organizing feature mapand a composite distance measure. The International Archives of the Photogrammetry, Remote Sensingand Spatial Information Sciences. Vol. XXXVII. Part B2

22. Joshi D, Soh L, Samal AK (to appear). Spatio-Temporal Polygonal Clustering with Space and Time asFirst-Class Citizens, accepted to Geoinformatica

23. Joshi D, Soh L, Samal AK (to appear) Redistricting Using Constrained Polygonal Clustering, acceptedto IEEE Transactions on Knowledge and Data Engineering

24. Joshi D, Samal AK, Soh L (2009a) Density-based clustering of polygons. IEEE symposium series oncomputational intelligence and data mining, Nashville, TN, pp 171–178

25. Joshi D, Samal AK, Soh L (2009b) A dissimilarity function for clustering geospatial polygons. In: 17thinternational conference on advances in geographic information systems (ACM SIGSPATIAL GIS 2009).Seattle, WA, pp 384–387

123

Page 35: A dissimilarity function for geospatial polygons

A dissimilarity function for geospatial polygons

26. Joshi D, Soh L, Samal AK (2009c) Redistricting using heuristic-based polygonal clustering. In: IEEEInternational Conference on Data Mining. Miami, FL, pp 830–836

27. Lance GN, Williams WT (1967) Mixed-data classificatory programs I. Agglomerative systems. AustComput J 1:15–20

28. Luxberg UV (2004) Statistical Learning with similarity and dissimilarity functions. Ph. D. thesis, Tech-nische. Universität Berlin, Berlin

29. Mandl T (1998) Learning similarity functions in information retrieval. Zimmermann, Hans-Jürgen (ed.):EUFIT ‘98. 6th European congress on intelligent techniques and soft computing. Aachen, Germany,8.-10.9.1998. S. 771–775

30. Morvant E, Habrard A, Ayache S (2012) Parsimonious unsupervised and semi-supervised domain adapta-tion with good similarity functions, Knowledge and Information Systems (5 July 2012), pp 1–41. doi:10.1007/s10115-012-0516-7

31. Perruchet C (1983) Constrained agglomerative hierarchical classification. Patt Recognit 16(2):213–21732. Rao A, Srinivas V (2005) Regionalization of watersheds by hybrid-cluster analysis. J Hydrol 318(1):37–5633. Rakthanmanon T, Keogh EJ, Lonardi S, Evans S (2012) MDL-based time series clustering. Knowl Inf

Syst. doi:10.1007/s10115-012-0508-734. Rote G (1991) Computing the minimum Hausdorff distance between two point sets on a line under

translation. Inf Process Lett 38(3):123–12735. Shapiro L, Stockman G (2001) Computer Vision. Prentice Hall, Englewood Cliffs36. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the Gap

Statistic. J Royal Stat Soc (Series B), 63(2):411–42337. Tombari F, Mattoccia S, Stefano LD (2009) Full-search-equivalent pattern matching with incremental

dissimilarity approximations. IEEE Trans Patt Anal Mach Intell 31(1):129–14138. Turi R, Ray S (1998) K -means clustering for colour image segmentation with automatic detection of

k. In: Proceedings of international conference on signal and image processing. Las Vegas, Nevada,pp 345–349

39. Wang L, Yang C, Feng J (2007) On learning with dissimilarity functions. In: Proceedings of the 24thinternational conference on machine learning. Corvallis, OR

40. Webster R, Burrough PA (1972) Computer-based soil mapping of small areas from sample data. J SoilSci 23(2):210–234

Author Biographies

Deepti Joshi received the Bachelor of Art in English Literature fromthe Delhi University, Masters of Science in Applied Computer Sci-ence from the Northwest Missouri State University, and a Ph.D. degreein Computer Science from the University of Nebraska-Lincoln. Sheis currently an assistant professor at The Citadel. Her research inter-ests include geospatial computing, data mining, and Volunteered Geo-graphic Information systems.

123

Page 36: A dissimilarity function for geospatial polygons

D. Joshi et al.

Leen-Kiat Soh received the B.Sc. (with highest distinction), M.Sc.,and Ph.D. (with honors) degrees in electrical engineering from theUniversity of Kansas. He is currently an associate professor in theDepartment of Computer Science and Engineering at the Universityof Nebraska. His primary research interests are in multiagent systemsand intelligent agents, especially in coalition formation and multia-gent learning. He has applied his research to computer-aided education,intelligent decision support, and distributed GIS. He is a member of theIEEE, the ACM, and the AAAI

Ashok Samal received the Bachelor of Technology degree in com-puter science from the Indian Institute of Technology, Kanpur, India,in 1983, and the Ph.D. degree in computer science from the Univer-sity of Utah, Salt Lake City. He is a Professor with the Department ofComputer Science and Engineering at the University of Nebraska. Hiscurrent research interests include geospatial computing, data mining,image analysis, and computer science education.

123