Feature Layer Similarity Analysis

Embed Size (px)

Citation preview

  • 8/9/2019 Feature Layer Similarity Analysis

    1/5

    Feature Layer Similarity Analysis

    Overview

    In order to process feature layers for predictive analysis, it is advantageous to remove

    layers that function as homologs (duplicate layers). This concept is much more complexthan it appears, in that the representation of spatial information can take many forms, at

    many resolutions and precisions.

    The Problem

    Having multiple layers that fundamentally capture the same characteristics in space (or

    spatially linked characteristics) will provide a harmonic reinforcement to the output of

    any analytical processes that operate over the similar layers. This reinforcement of

    output signals can contribute to a false elevation of output values in areas of redundant

    signals while masking subtle features in discreet and distinct areas.

    Similarity of layers is composed of multiple concepts:

    1. Spatial overlap In order for 2 layers to be similar, they must cover acommon area of space

    2. Distribution If the spatial distribution of the data is dissimilar betweenthe 2 layers, they are not truly similar

    3. Generalization If the spatial data is of differing geometry types orresolutions, the layers may st ill be similar

    Each of these concepts can be elaborated on the 2 primary facets of similarity to

    predictive analysis. In general, if the analysis is performed using a grid for analytical

    resolution of output, then the second facet of similarity is based on the resolution of the

    grid and the similarity of the layers at the cellular level of the grid.

    Spatial Overlap

    If the layers do not overlap at all, they cannot be similar, however, they may map the

    same category of data in disjoint areas. The mapping of similar categorical data in

    disjoint areas cannot be resolved using geometrical means and will not be addressed any

    further in this document.

    If the layers share a common area, where the extents of each layer has some portion that

    is distinct to itself and not shared with the other layer, then the similarity of the layers can

    only be for the areas of commonality in spatial domain. The distinct areas for each layer

    may however only contribute to a small portion of the data contained within that layer, so

    the entirety of the layers must still be evaluated unless a threshold percentage

    (representative sample) can be determined relative to the layer whole that would make

    the layer similarity impossible at the finer level of per feature analysis.

  • 8/9/2019 Feature Layer Similarity Analysis

    2/5

    If the layers have spatial domains such that the extent of one layer is entirely within the

    extent of the other, then it is possible that one layer is a proper subset of the other in

    terms of spatial similarity. This can provide for different opportunities for comparison.

    If the spatially smaller layer is similar to the larger layer over all (i.e. the actual

    data in the larger layer is predominantly contained within the extent of the smaller

    layer), then the layers are wholly similarIf the spatially larger layer has a distribution of data that is proportionally outside

    of the extent of the smaller layer in significant proportion to preclude similarity at

    the overall level, then the layers cannot be wholly similar

    If the spatially smaller layer is similar to the larger layer within the constraints of

    the bounds of the spatially smaller layer, then the layers are locally similar at the

    extent of the smaller layer, or, the smaller layer is a similar subset of the larger

    layer. This may permit the exclusion of one of the layers for the constrained

    region

    For the grid-based analysis purpose, if the common extent of the layers is greater than or

    equal to the extent of the grid, then the local similarity is all that is important. Similarly,if the extent of the overlap of the extent of the layers intersects the extent of the grid,

    there may be a consideration for similarity constrained to the extent of the grid,

    irrespective of the extents of the layers themselves.

    Distribution

    Given that the layers of interest have a common extent, it is possible that the layers are

    not similar in the distribution of the features within them. This is the concept of the

    spatial distribution of features, where the count of features is potentially different

    between the layers, but the distribution of the features is similar in space.

    If the layers features are all proximal to features in the other layer, the layers may be

    similar; however, if a significant proportion of the features in a layer are disjoint from

    any feature in the other layer, the layers cannot be wholly similar. The definition of

    proximal is based upon the resolution of the data itself, where the significant proportion

    is based on a more subjective target threshold.

    In the grid-based analytical model, the distribution of features is the most significantly

    different from the similarity of data layers at the global level. In the grid-based approach,

    the distribution of geometries within the layers is based upon cellular membership, where

    layers that are generally dissimilar may be identical on a cellular level depending on the

    resolution and extent of the analysis grid.

    Generalization

    It is often possible that layers will represent spatially linked, or spatially identical features

    in different layers to represent varying resolutions of data, or to serve different purposes.

    As an example, buildings may be represented as points in one layer and as polygon

    boundaries in another to provide different data resolutions. Additionally, at an even

  • 8/9/2019 Feature Layer Similarity Analysis

    3/5

    coarser resolution, buildings may be represented as a single point to represent a cluster of

    buildings. As a second example, rivers may be managed as polygons of high resolution,

    following the waterline along the course in one layer, and managed as coarse polygons

    representing the maximum flood line in another layer. These layers represent the same

    basic data at differing resolutions or for different purposes.

    Determining similarity for these types of relationships is the core challenge of this

    process. This is accomplished at the feature level, and requires a runtime of at least n*m,

    where n and m are the number of records in each of the 2 respective layers to be

    compared.

    The methodology for comparison is based upon the incident geometry types of the

    corresponding layers, which may be any combination of:

    Point - simple 2d locations, i.e. lat lon

    Line - 2d linear features represented as a group of points connected by

    line segments

    Polygon - 2d areas defined by a line that defines the boundary of the areaand any holes within that area

    Comparison methodologies

    Point Point

    Comparing 2 point layers requires a threshold distance (radius) for linkage and a

    percentage linkage to define similarity. Each point in the first layer is compared

    to each point in the second layer by distance. A count of feature pairs that are

    closer than the threshold distance is built in the process. If this count is greater

    than the percentage threshold of the features in each layer, then the layers are

    similar. If the count is greater than the threshold for only a single layer, then thereis a directional similarity, i.e. A is similar to B but B is not similar to A.

    0,1);),((|, tbaba DxxDifxx

    Note that the count of linked pairs may be greater than the total number of points

    in either layer, if multiple points link to multiple points. If this behavior is not

    desired, a secondary methodology can be added to stop after the first link is

    found. If completeness is desired, the secondary methodology would increase

    runtime to a worst-case of 2(n*m).

    Point Line

    To compare lines to points, a threshold distance (radius) for linkage and apercentage to define similarity is required. Each point is compared to each line by

    distance. A count of feature pairs that are closer than the threshold distance is

    built in the process. If this count is greater than the percentage threshold of the

    features in each layer, then the layers are similar.

    0,1);),((|, tbaba DlpDiflp

  • 8/9/2019 Feature Layer Similarity Analysis

    4/5

    A single point may link to multiple lines, and a line may link to multiple points.

    If either behavior is not desired, secondary methodologies can be added, at

    increased runtimes (2(n*m) worst case).

    Point Polygon

    Comparing points to polygons is a 2 fold-process in terms of the nature of thecompare. If a point is inside a polygon, it is linked by inclusion, if a point is not

    inside, but instead within the threshold distance of the polygon, it is linked by

    distance.

    For this methodology a threshold distance and 2 percentage thresholds are

    required. The first percentage threshold maps to the percentage of points that are

    contained by polygons to define layers as similar. The second percentage

    threshold maps to the percentage of points that must be contained or within the

    threshold distance of polygons to define layers as similar.

    Currently, there is no metric that compares the proportion of polygons that are

    near to points, if that is desirable, it can be added as a secondary methodology at

    no additional runtime cost.

    Line Line

    To compare lines to lines requires a distance threshold, a length proportion

    similarity threshold and a proportion of features to classify as similar. The

    methodology compares each line in one layer to every line in the other by

    buffering the line by the threshold distance and comparing the length of the

    second line to the length of the intersection of the second line to the buffer of the

    first line. If this proportion is greater than the threshold proportion, the lines are

    considered similar. If the length proportion is less than the threshold, then the

    reverse is performed, buffering the first feature and comparing to the length of the

    second feature.

    This methodology is partially incomplete at present as it does not account for

    fractioned lines (broken lines). Also, shortened lines may be given directional

    preference depending on the comparison. A bidirectional compare could be

    performed at each feature pair to ensure that both intersecting proportions meet

    the threshold constraint.

    Line Polygon

    To compare lines to polygons, the polygons are taken as the perimeter line of the

    polygon, and treated from there as lines. For details on this comparison, see

    above LineLine section.

    Polygon Polygon

    Comparing polygon layers requires the definition of a proportion overlap

    threshold and a percentage of features to classify as similar. The comparison is

    performed by comparing each feature in one layer to every feature in the other.

    The area of the intersection of the 2 polygons is compared to the area of each of

    the polygons. If the ratio of the intersection to the original polygon is greater than

  • 8/9/2019 Feature Layer Similarity Analysis

    5/5

    the threshold proportion for both polygons, the polygons are considered similar.

    If the total count of similar polygon pairs is greater than the threshold proportion,

    the layers are considered similar.

    In this methodology, it is possible for the count of similar pairs to exceed the

    count of features in either layer, as all possible pairs are enumerated. If this is not

    acceptable, a secondary methodology can be employed to stop compares after thefirst match.

    Determining Appropriate Threshold Values

    This set of methodologies requires the definition of threshold values for all calculations.

    There are no set values to use based on a cursory review of related literature, so an

    experimental approach to determining these values should be taken. Additionally, the

    intended use for or source of the data may alter the desired values for any threshold.

    Uses For Similarity Measures

    The determination of layer similarity comes into play across the GIS field. In general,

    most agencies using GIS software acquire data from a myriad of sources which often

    provide layers representing the same ground features at different resolutions or in

    different projections. Often the tabular attributes are different between these sources with

    little hope for textual correlations across the feature layers, this defines geometric

    comparisons to be the only alternative approach aside from manual examination.

    While the approach defined in this document cannot contend with gross spatial

    dissimilarities caused by massive misalignments or projection distortions, it can detect

    similarities between layers mapping similar features over a common area. Further

    investigation into possible improvements in these capabilities should prove worth while

    in other areas. For the purposes of grid-based analysis, cell-based similarity approaches

    will most likely provide the best results, possibly hybridizing some of the methodologies

    here into the cellular world.