Feature Layer Similarity Analysis

8/9/2019 Feature Layer Similarity Analysis

1/5

Feature Layer Similarity Analysis

Overview

In order to process feature layers for predictive analysis, it is advantageous to remove

layers that function as homologs (duplicate layers). This concept is much more complexthan it appears, in that the representation of spatial information can take many forms, at

many resolutions and precisions.

The Problem

Having multiple layers that fundamentally capture the same characteristics in space (or

spatially linked characteristics) will provide a harmonic reinforcement to the output of

any analytical processes that operate over the similar layers. This reinforcement of

output signals can contribute to a false elevation of output values in areas of redundant

signals while masking subtle features in discreet and distinct areas.

Similarity of layers is composed of multiple concepts:

1. Spatial overlap In order for 2 layers to be similar, they must cover acommon area of space

2. Distribution If the spatial distribution of the data is dissimilar betweenthe 2 layers, they are not truly similar

3. Generalization If the spatial data is of differing geometry types orresolutions, the layers may st ill be similar

Each of these concepts can be elaborated on the 2 primary facets of similarity to

predictive analysis. In general, if the analysis is performed using a grid for analytical

resolution of output, then the second facet of similarity is based on the resolution of the

grid and the similarity of the layers at the cellular level of the grid.

Spatial Overlap

If the layers do not overlap at all, they cannot be similar, however, they may map the

same category of data in disjoint areas. The mapping of similar categorical data in

disjoint areas cannot be resolved using geometrical means and will not be addressed any

further in this document.

If the layers share a common area, where the extents of each layer has some portion that

is distinct to itself and not shared with the other layer, then the similarity of the layers can

only be for the areas of commonality in spatial domain. The distinct areas for each layer

may however only contribute to a small portion of the data contained within that layer, so

the entirety of the layers must still be evaluated unless a threshold percentage

(representative sample) can be determined relative to the layer whole that would make

the layer similarity impossible at the finer level of per feature analysis.


2/5

If the layers have spatial domains such that the extent of one layer is entirely within the

extent of the other, then it is possible that one layer is a proper subset of the other in

terms of spatial similarity. This can provide for different opportunities for comparison.

If the spatially smaller layer is similar to the larger layer over all (i.e. the actual

data in the larger layer is predominantly contained within the extent of the smaller

layer), then the layers are wholly similarIf the spatially larger layer has a distribution of data that is proportionally outside

of the extent of the smaller layer in significant proportion to preclude similarity at

the overall level, then the layers cannot be wholly similar

If the spatially smaller layer is similar to the larger layer within the constraints of

the bounds of the spatially smaller layer, then the layers are locally similar at the

extent of the smaller layer, or, the smaller layer is a similar subset of the larger

layer. This may permit the exclusion of one of the layers for the constrained

region

For the grid-based analysis purpose, if the common extent of the layers is greater than or

equal to the extent of the grid, then the local similarity is all that is important. Similarly,if the extent of the overlap of the extent of the layers intersects the extent of the grid,

there may be a consideration for similarity constrained to the extent of the grid,

irrespective of the extents of the layers themselves.

Distribution

Given that the layers of interest have a common extent, it is possible that the layers are

not similar in the distribution of the features within them. This is the concept of the

spatial distribution of features, where the count of features is potentially different

between the layers, but the distribution of the features is similar in space.

If the layers features are all proximal to features in the other layer, the layers may be

similar; however, if a significant proportion of the features in a layer are disjoint from

any feature in the other layer, the layers cannot be wholly similar. The definition of

proximal is based upon the resolution of the data itself, where the significant proportion

is based on a more subjective target threshold.

In the grid-based analytical model, the distribution of features is the most significantly

different from the similarity of data layers at the global level. In the grid-based approach,

the distribution of geometries within the layers is based upon cellular membership, where

layers that are generally dissimilar may be identical on a cellular level depending on the

resolution and extent of the analysis grid.

Generalization

It is often possible that layers will represent spatially linked, or spatially identical features

in different layers to represent varying resolutions of data, or to serve different purposes.

As an example, buildings may be represented as points in one layer and as polygon

boundaries in another to provide different data resolutions. Additionally, at an even


3/5

coarser resolution, buildings may be represented as a single point to represent a cluster of

buildings. As a second example, rivers may be managed as polygons of high resolution,

following the waterline along the course in one layer, and managed as coarse polygons

representing the maximum flood line in another layer. These layers represent the same

basic data at differing resolutions or for different purposes.

Determining similarity for these types of relationships is the core challenge of this

process. This is accomplished at the feature level, and requires a runtime of at least n*m,

where n and m are the number of records in each of the 2 respective layers to be

compared.

The methodology for comparison is based upon the incident geometry types of the

corresponding layers, which may be any combination of:

Point - simple 2d locations, i.e. lat lon

Line - 2d linear features represented as a group of points connected by

line segments

Polygon - 2d areas defined by a line that defines the boundary of the areaand any holes within that area

Comparison methodologies

Point Point

Comparing 2 point layers requires a threshold distance (radius) for linkage and a

percentage linkage to define similarity. Each point in the first layer is compared

to each point in the second layer by distance. A count of feature pairs that are

closer than the threshold distance is built in the process. If this count is greater

than the percentage threshold of the features in each layer, then the layers are

similar. If the count is greater than the threshold for only a single layer, then thereis a directional similarity, i.e. A is similar to B but B is not similar to A.

0,1);),((|, tbaba DxxDifxx

Note that the count of linked pairs may be greater than the total number of points

in either layer, if multiple points link to multiple points. If this behavior is not

desired, a secondary methodology can be added to stop after the first link is

found. If completeness is desired, the secondary methodology would increase

runtime to a worst-case of 2(n*m).

Point Line

To compare lines to points, a threshold distance (radius) for linkage and apercentage to define similarity is required. Each point is compared to each line by

distance. A count of feature pairs that are closer than the threshold distance is

built in the process. If this count is greater than the percentage threshold of the

features in each layer, then the layers are similar.

0,1);),((|, tbaba DlpDiflp


4/5

A single point may link to multiple lines, and a line may link to multiple points.

If either behavior is not desired, secondary methodologies can be added, at

increased runtimes (2(n*m) worst case).

Point Polygon

Comparing points to polygons is a 2 fold-process in terms of the nature of thecompare. If a point is inside a polygon, it is linked by inclusion, if a point is not

inside, but instead within the threshold distance of the polygon, it is linked by

distance.

For this methodology a threshold distance and 2 percentage thresholds are

required. The first percentage threshold maps to the percentage of points that are

contained by polygons to define layers as similar. The second percentage

threshold maps to the percentage of points that must be contained or within the

threshold distance of polygons to define layers as similar.

Currently, there is no metric that compares the proportion of polygons that are

near to points, if that is desirable, it can be added as a secondary methodology at

no additional runtime cost.

Line Line

To compare lines to lines requires a distance threshold, a length proportion

similarity threshold and a proportion of features to classify as similar. The

methodology compares each line in one layer to every line in the other by

buffering the line by the threshold distance and comparing the length of the

second line to the length of the intersection of the second line to the buffer of the

first line. If this proportion is greater than the threshold proportion, the lines are

considered similar. If the length proportion is less than the threshold, then the

reverse is performed, buffering the first feature and comparing to the length of the

second feature.

This methodology is partially incomplete at present as it does not account for

fractioned lines (broken lines). Also, shortened lines may be given directional

preference depending on the comparison. A bidirectional compare could be

performed at each feature pair to ensure that both intersecting proportions meet

the threshold constraint.

Line Polygon

To compare lines to polygons, the polygons are taken as the perimeter line of the

polygon, and treated from there as lines. For details on this comparison, see

above LineLine section.

Polygon Polygon

Comparing polygon layers requires the definition of a proportion overlap

threshold and a percentage of features to classify as similar. The comparison is

performed by comparing each feature in one layer to every feature in the other.

The area of the intersection of the 2 polygons is compared to the area of each of

the polygons. If the ratio of the intersection to the original polygon is greater than


5/5

the threshold proportion for both polygons, the polygons are considered similar.

If the total count of similar polygon pairs is greater than the threshold proportion,

the layers are considered similar.

In this methodology, it is possible for the count of similar pairs to exceed the

count of features in either layer, as all possible pairs are enumerated. If this is not

acceptable, a secondary methodology can be employed to stop compares after thefirst match.

Determining Appropriate Threshold Values

This set of methodologies requires the definition of threshold values for all calculations.

There are no set values to use based on a cursory review of related literature, so an

experimental approach to determining these values should be taken. Additionally, the

intended use for or source of the data may alter the desired values for any threshold.

Uses For Similarity Measures

The determination of layer similarity comes into play across the GIS field. In general,

most agencies using GIS software acquire data from a myriad of sources which often

provide layers representing the same ground features at different resolutions or in

different projections. Often the tabular attributes are different between these sources with

little hope for textual correlations across the feature layers, this defines geometric

comparisons to be the only alternative approach aside from manual examination.

While the approach defined in this document cannot contend with gross spatial

dissimilarities caused by massive misalignments or projection distortions, it can detect

similarities between layers mapping similar features over a common area. Further

investigation into possible improvements in these capabilities should prove worth while

in other areas. For the purposes of grid-based analysis, cell-based similarity approaches

will most likely provide the best results, possibly hybridizing some of the methodologies

here into the cellular world.

Documents

Feature Layer Similarity Analysis