Mining Spatial Data: Opportunities and Challenges …ceci/micFiles/Mining Spatial Data...Mining Spatial Data: Opportunities and Challenges of a Relational Approach Donato Malerba Department

Mining Spatial Data: Opportunities and Challengesof a Relational Approach

Donato MalerbaDepartment of Computer Science

University of Bari, Italy

August 30th - September 1st, 2007 - AVEIRO, PORTUGAL

Spatial Data Exploration: A Historical Example

1848: An epidemic of the ‘Asiatic cholera’ hit LondonJohn Snow observed the distribution of deaths throughout the city and hypothesized that river water contaminated by cholera evacuations explained spatial variations in mortality throughout London

John Snow


August 1854: the cholera epidemic hit an area of North LondonJ. Snow obtained the names and addresses listed on 83 death certificates from the Registry Office.He marked cholera cases on a map


He also inventoried potential sources of contamination (pumps)and combined this information on the map.He observed that nearly all the deaths had taken place within a short distance of the pump in Broad Street


Snow persuaded the parish council to remove the handleNot easy: the water provided by this pump was held in such high esteem that people came from neighboring streets for itResult: the epidemic subsided.

death


The council did not really believe Snow, so a curate repeated Snow’s work and considered other factors (cleanliness/filthiness of houses).The curate, who was initially biased against Snow’s theory, located 700 deaths within a 250-yard radius and showed that the use of water from the Broad Street pump was strongly correlated with death from Asiatic cholera.


Some curiosity: Snow’s theory was supported bytwo pieces of ‘negative data’

No infection in the workhouse (it had its own well)No cases in the Lion Brewery (workers drank the beer)

Lessons LearnedKey elements of this success story:

Identification of relevant spatial objectsReference spatial objects

(buildings where cholera cases occurred)Task-relevant spatial objects

(water pumps, wells, etc.)Identification of the properties of, and relationshipsbetween, relevant spatial objects(distance of buildings from water pumps, presence of wells)

Spatial Data MiningThe goal of spatial data mining is to automate the discovery of such correlations, which can then be examined by specialists for further validation and verification.

AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary

Modeling Spatial InformationTwo major approaches to conceptual modelingof space:

Field-based modelObject-based model

Field-based ModelThe world is seen as a continuous surface over whichfeatures vary.Spatial variation is defined by a number of Field Functions:

f: Rn Attribute DomainExamples: elevation, temperature, precipitation

Field OperationsExamples, addition(+) and composition(o).

))((:)()(:

xgfxgfxgxfxgf

→+→+

o

Object-based ModelThe world is seen as a surface littered with distinct, identifiable and relevant things or entities, called objects, which exist independent of their locations.Objects can be:

Zero-dimensional or punctualOne-dimensional or linearTwo-dimensional or surfacic

Operations on spatial objectsTopologicalDirectionalDistance-based

Field-based vs. Object-based

(b) (c)

(0,0) (2,0) (4,0)

(0,2)

(0,4)

Fir Oak

(a)

Area/Boundary

FS1

FS2

FS3

[(0,2),(4,2),(4,4),(0,4)]

[(0,0),(2,0),(2,2),(0,2)]

[(2,0),(4,0),(4,2),(2,2)]

y

x

Area-ID

f(x,y) �

"Pine," 2 � x � 4 ; 2 � y � 4

"Fir," 0 � x � 2; 0 � y � 2

"Oak," 2 � x � 4; 0 � y � 2

Pine

Object Viewpoint of Forest Stands

DominantTree Species

Fir

Oak

Pine

Field Viewpoint of Forest Stands


Spatial Pattern (Field-based view)A function obtained as a combination of field functions according to field operations

f (x,y) = precipitation in (x,y)

∫∫ >BA

dxdyyxfBsize

dxdyyxfAsize

),()(

1),()(

1

Spatial Pattern (Object-based view)It expresses a spatial relationship among spatial objects.

If the wetlands area is near open water then there is a nest of a red-winged blackbird

(classification rule, specifically location prediction rule)The price of a house near the river is 2000*Size + 5000

(regression rule)Trajectories of monitored cars group along the directiondowntown to residential suburbs

(cluster)A country that is adjacent to the Mediterranean sea is a wine exporter

(association rule)

Spatial Data MiningSpatial data mining: extraction of interesting and useful but implicit spatial patterns. (adapted from the definition of KDD)Fayyad, U., Piatesky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Fayyad, U., Piatesky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.): Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 1-35.


What’s Special about Spatial Data Mining? (1/4)

The formulation of a spatial data mining method cannot leave out of consideration the logicalrepresentation of spatial information.Two modes

TessellationVector

Tessellation ModeRegular or irregular

Field-based data in tessellation mode

Object-based data in tessellation mode

⟨2,3,6,7,8,12,13,14,18,19⟩2 3

6 7 8

12 13 14

18 19

Vector ModeOrdered sets of xy-coordinates defining points, lines, or polygons

Field-based data in vector mode

Object-based data in vector mode

Tessellation Vector☺Storage efficient

☺Easy to scale (though some operations are complex)Difficult to convert remote sensing images into this formatDifficult to check on a large number of constraints (is a polygon convex?)

☺Supported by spatial DBMS

Large memory space required Operation on objects are time consuming

☺Large volumes of spatialdata (e.g., remote sensing images) are available

Hybrid Modes also PossibleSpatial objects in a regular grid of cells

Tessellationsingle pixels are classifiedImage processing operators are needed Industrial area

VectorImages are transformedSpatial relationships are computed

Industrial area

Representation of Spatial Data in a Spatial DBMS

Spatial information is represented in different layers, one for each type of spatial object.Layer: a database relation Ri with a number of elementary attributes Ai

1, …, Aimi and possibly a

geometry attribute GiVector representation for Gi which can be a point, a line or a polygonA reference system defines the coordinates of single points and vertices of lines and polygons

LayerA spatial object or a field function is represented by one or more tuples of a …Layer: a database relation Ri with a number of elementary attributes Ai

1, …, Aimi and possibly a geometry attribute Gi

The geometry attribute is represented in vector mode.

wells ID Location Depth Type2357 (15,22) 20 drilled


Different types of spatial objects Several layers in a spatial DB

wells ID Location Depth Type2357 (15,22) 20 drilled

buildings ID Surface Size OwnerAD18 250 Smith


Spatial objects have a locational property which implicitly defines spatial relationships between objects.

TopologicalDistance-basedDirectional

Topological RelationsInvariant under homomorphisms (rotation, translation & scaling)Semantics defined by the 9-intersection model

disjoint

meet

contains

covers

overlaps

equal

inside

covered by

For regions

Distance RelationsMetric

Euclidean distance between two pointsFor polygons it’s an aggregate function (e.g., minimum)

Non-metricTypically defined on the basisof a cost function (e.g. drive time)

Directional RelationsBased on an angle

Based on the extension of Allen’s algebra

α

Spatial RelationsIn a spatial DB, different spatial relations ρ implicitly define

spatial joins between two layers Ri and RjRi ρ Rj

Too many spatial joins implicitly defined Efficient computation of spatial relations is a must when developing spatial data systems

Abstraction from PhysicalRepresentation

Interest towards properties not related to physical representation Well-defined semantics (e.g., 9-intersection model) is not enough.

Example: Two roads can cross each other, or run parallel, or can be confluent, independently of the fact that they are represented as “lines” or “regions”


Spatial (positive) autocorrelation: The values of a givenproperty are highly uniform among similar spatial objects in the neighborhood.

Tobler’s first law of geography“everything is related to everything else, but near things are

more related than distant things”(Tobler, 1970)


Spatial autocorrelation: The value of a property observed at a location depends on the values of properties observed at neighboring locations.positive autocorrelation: more similarnegative autocorrelation: less similar

Tobler’s First Law of Geography“everything is related to everything else, but near

things are more related than distant things”(Tobler, 1970)

Spatial Error and Spatial LagTwo primary types of autocorrelation:

Spatial error Spatial lagSpatially lagged explanatory variables

yi yj

εi εj

xi xj

yi yj

εi εj

xi xj

Spatially lagged response variables

Violated AssumptionsSpatial error: error terms are uncorrelatedSpatial lag: observations are independent (as well as error terms are uncorrelated)

“Anyone seriously interested in prediction when the sample data exhibit spatial dependence should consider a spatial model”

(LeSage & Page, 2001)

Limits of Traditional Data Mining Methods

Numeric and discrete type (no geometry)Observations cannot be of different types (e.g., wells, buildings, etc.)Spatial relationships between observations notrepresented / considered

Spatial StatisticsSpatial dependence typically modeled by the linear models

y = Xα + βDy + γ DX + εy: vector of observations of the dependent variableX: matrix of observations of the independent variableα: strength of local influence β: strength of spatial dependence on response var.sγ: strength of spatial dependence on explanatory var.sD: spatial weight matrix (or neighborhood matrix)

Spatial Weight MatrixContains a ‘d’ term for every combination of observations in the data set‘d’ may be the inverse distance betweenobservations or 0,1 if they share a borderand/or vertex.The choice of spatial weight matrix is often made ad hoc and a priori.Provides the ‘structure’ of assumed spatial relationships.

Problems with Spatial Models D has to be carefully definedHow can D express the contribution of different spatial relationships?Spatial dependencies are all handled in a pre-processing or feature extraction stepAll spatial objects involved in the spatial phenomena (rows of X) are uniformly represented by the same set of attributesNo clear difference between reference (target) and task-relevant objects.

Reference vs. Task-relevant ObjectsReference objects: the main subject of analysisTask-relevant objects: objects in the neighborhood

that can contribute to explain the spatial variation

Example: find associations involving large towns of Apulia (Italy) Reference objects: large towns Task-relevant objects:

water bodies roads province boundaries


Towards a Multi-Relational Representation

In spatial data mining the units of analysis are typically composed of several spatial objects with different properties.Their spatial structure cannot be accommodated into a classical double-entry table. A better representation: A set of relations, R1 … Rjsome of which are layers. Foreign key constraints and spatial relations define possible joins.

ExampleProblem: investigate social effects of public transportation in a British citySpatial data set: ED ID Area

03bsfc01BL Name Line Type

15a main

CE ED #householdsno car

#households1 car

#households≥2 cars

03bsfc01 80 67143

ExampleA unit of analysis corresponds to an ED(reference object) described in terms of #cars per household and crossing bus lines (task-relevant objects)Relational pattern:

“the enumeration districts with a high percentage of households which own less than two cars, are served by at least two bus lines, one of which is a main bus line”

Multi-relational Data MiningMRDM tools can be applied directly to data distributed on several relations to find relational patterns which involve multiple relations.Relational patterns can be expressed in SQL but also in first-order logic

Relational Patterns in SQL and LogicSELECT DISTINCT ED.IDFROM ED, CE, BL AS BL1, BL AS BL2WHERE CE.ED = ED.ID

AND HH2CARS / (HHNOCARS + HH1CAR + HH2CARS)*100 > 60 AND INTERSECTS(ED.AREA,BL1.LINE) AND INTERSECTS(ED.AREA,BL2.LINE)AND BL1.NAME ≠ BL2.NAME AND BL1.TYPE=“MAIN”

ed(X), ce(X, HHNOCARS, HH1, HH2), bl(BL1), bl(BL2), HH2CARS / (HHNOCARS + HH1CAR + HH2CARS)*100 > 60, intersects(X,BL1), intersects(X,BL2), BL1 ≠ BL2, main(BL1)

Two Settings for MRDMFinding relational patterns within units of analysis represented as sets of tuples

Each unit of analysis includes a single reference object and is represented by a sub-database of the original one

Finding patterns within the whole database

Individual-Centered RepresentationsSeveral advantages

Positive PAC-learnability resultsMethods working under single-table assumption are easier to upgrade (e.g., the notion of unit of analysis simplifies the sampling)More efficient (process one unit of analysis at a time)

Individual-Centered RepresentationsBut …

Units of analysis with a single reference object might not be easy to define In spatial data mining, the unit of analysis should be carefully selected so that

autocorrelation is consideredthe size of the neighborhood is limited

Example of AutocorrelationSpatially lagged explanatory variablesno communal establishment (schools, hospitals) in an ED, but many of

them are located in the nearby EDs

Xr j

Xr j

Xr j Xr j

Xr j

Xr j(x1,i ,… , xk,i , xr j)

Reference object: ED

Task-relevant objects: communal establishments in the nearby

Example of AutocorrelationSpatially lagged response variablesthe price level for a good at a retail outlet in a city depends on the price

for the same good in the nearby

Xr j

Xr j

Xr j Xr j

Xr j

Xr j

Yj

Yj

Yj Yj

Y j

Yj(x1,i ,… , xk,i , y j)

Reference object: EDTask-relevant objects: EDs in the neighborhood

A Recipe for MRDM SystemsStart from a well-known data mining system working on the classical double-entry table representationUpgrade

Generality order of patterns (e.g., θ-subsumption), Generalization/specialization operatorsSimilarity measure, …

to deal with several relations Build new system, retain as much as possible from the original one

Additional Ingredients for a Relational Approach to Spatial Data Mining

Define a representation of spatial objectsDefine operators for spatial joins Optimize the computation of spatial joins with spatial indexesDistinguish reference from task-relevant objectsVisualize spatial patterns (e.g., on a map)


Relational Systems for Spatial DMSpatial Association Rule Discovery

SPADA system (Malerba & Lisi, ILP 2001) Task-relevant data organized hierarchically

Spatial patterns are found at different granularity levels

road_net

MotorwayA_road B_road PrimaryRoad

1

2

3

Road net

1

2

3

Water net

water_net

canal river water

1

2

3

Rail net

rail_net

rail


Spatial pattern: conjunction of first-order logic atomsThe space of spatial patterns is ordered by θ-subsumptionmonotonicity of support w.r.t. θ-subsumption pruning of patterns at the same granularity level in the candidate generation phase monotonicity of pattern frequency w.r.t. granularity level

pruning of patterns at different granularity levels in the candidate generation phase


Efficiency improvement of pattern evaluation by caching support objects for each stored pattern Definition of a declarative bias to filter out rules on the basis of users’ preferences efficiency improvement is a byproductIntegration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)

Relational Systems for Spatial DMSpatial Classification

Based on associative classification (Ceci et al., ECML/PKDD 2004)SPADA is used to extract strong multi-level spatial association rules, with exactly one literal representing the class label in the consequentStrong rules are then used to build a relational Naive Bayesian classifier.

Relational Systems for Spatial DMSpatial Clustering

CORSO (presented @ this conference )Units of analysis are described by severalrelations A relational distance measure is used to clusterthemUnits of analysis are themselves spatially related

Relational Systems for Spatial DMSpatial Clustering

Discrete Spatial Structure: a directed graph wherenodes correspond to units of anaysislinks correspond to spatial relations between units of analysis

“Neighbouring regions” relation…

Apulia

Molise

Basilicata

Campania

Calabria

Abruzzo

Latium

Tuscany

Sicily

Relational Systems for Spatial DMSpatial ClusteringCORSO combines

graph based partitioningwith multi-relational clustering

clusters are described by means of a logical theory

…Apulia

Molise

Basilicata

Campania

Calabria

Abruzzo

Latium

Tuscany

Sicily

Relational Systems for Spatial DMSpatial ClusteringRelated work: GDBSCAN (Sander et al., 1998)

Pros• spatial relations between objects to be clustered is

consideredCons• Data stored in a single double-entry table• No description of clusters

Relational Systems for Spatial DMSpatial Regression

Mrs-SMOTI (Malerba et al., ECML/PKDD 2005)It generates relational model trees from a collection of tables (some are layers). It extends its predecessor SMOTI:

tight-integration with a spatial database to mine spatial relationships and properties implicit in datasearch strategy modified to capture the implicit relational structure of spatial data (intra-layer and inter-layer) intra-layer relationship make available spatially-lagged response in addition to spatially lagged explanatory attributes

Classical Regression ProblemGiven

m independent (or explanatory) attributes Xi (both continuous and discrete)a continuous dependent (or response) attribute Y to be predicteda set of n training cases (x1, x2, …, xm, y)

Builda function y=g(x) such that it correctly predicts the value of the response attribute for each m-tuple (x1, x2, …, xm)

Problem 1: Spatial ArrangementIf spatial heterogeneity of response isanticipated, than allow

the constantone or more of the other regression parameters

to vary spatiallyExample: residential areas have a higher number of migrants

Yi = β0 + β1x1,i + … + βkxk,i + γDi + ei

Di is a dummy variable: 1 site i is in residential area, 0 otherwise

Problem 1: Spatial ArrangementYi = β0 + (β1 + γDi)x1,i + … + βkxk,i + eiIn this case the slope parameter associated to variable xi

varies.

Issues: clumsy generalization to more than two influence areasdifficult to establish a priori which variables (if any) are actually affected by spatial arrangement

Model Tree• A tree-structure is generated according to a top-

down strategypartitioning of the training setlocal regression models

X1 ≤ 3

Y=3+2X1

a set of n training cases (x1, x2, …, xm, y)

Spatial Regression ProblemThe response attribute Y (e.g., number of migrants) is associated to a location (e.g., ED)Explanatory variables Xi are also associated to locations

In standard model tree learningmethods:arrangement properties of spatial objects isdisregardedobservations are assumedindependent

Model Trees: State of the ArtStatistics

Ciampi (1991): RECPAMSiciliano & Mola (1994)…

Data MiningKaralic, (1992): RETISQuinlan, (1992): M5Wang & Witten, (1997): M5’Lubinsky, (1994): TSIRTorgo, (1997): HTLMalerba et al., (2004): SMOTI…

No state of art method tries to mine model trees dealing with spatial structure!

Dealing with local & global effectsSome explanatory attributes can have spatially global effect on the response attribute, while others have only a spatially local effect.

Y = 0.9

Y = 3+1.1X1

Y = 3X1+1.1X2

• The model tree doesn’t show up the possibly globaleffect of X1

Dealing with local & global effectsA tree structure with splitting and regression nodes

• Splitting nodes perform a Boolean test.

tR

Xi ≤ α

Y=a+bXu Y=c+dXw

t

tL

continuousvariable

tXi∈{xi1,…,xih}

Y=a+bXu Y=c+dXw

tRtL

discrete variable

tL

• Regression nodes compute only a straight-line regression. They have only one child.

Y=a+bXi

X’j ≤ α

Y=c+dX’u Y=e+fX’w

nL nR

t

t’

t’Rt’L

X’j=Xj-(aj+bjXi)

Dealing with local & global effectsLeaves are associatedwith a straight-lineregression function

65

4

SMOTI: Stepwise Model Tree Induction(Malerba et al., IEEE Trans. Pattern Analysis & Mach. Intell. 2004)

3

2

Y’=c+dX’3

Y’=e+fX’2

X’4 ≤ γ

Y’=g+hX’3

0Y=a+bX1

1X’3 ≤ α

T

7

Y’=i+lX’4X’2 ≤ β The multiple regressionmodel associated to a leaf is the composition of straight-line regression functions found along the path from the root to a leaf

Dealing with Spatial Autocorrelation• Augment ED information by exploiting intra-layer

relationships (e.g., neighborhood)

ED #MigrInWards #Establishments #Employees on 10% sample population

#Migrants

Italy 3 1 9 4382…

03BSFA18 10 1 4503BSFN01 18 5 73

… … … …

Reference ED NeighbouringED

03BSFA0403BSFA0403BSFA04

…

03BSFA0503BSFB1803BSFQ01

…

Dealing with Spatial AutocorrelationOther spatial objects, which are different from areas where Y is measured, can be easily accommodated in this framework inter-layerrelationships.

Example:

Model Trees + Spatial Data Structure = Mrs-SMOTI

Mrs-SMOTI is the spatial extension of SMOTI (Stepwise Model Tree Induction)INPUT: spatial objects eventually belonging to separate layers stored in a spatial database S

reference objects (main subject of analysis)task-relevant objects

OUTPUT: a spatial model tree T by partitioning training spatial data according to intra-layer and inter-layer relationshipsassociating different regression models to disjoint spatial areas

Split NodeBinary split nodes involves:1. Boolean tests on spatial relationships (either intra-layer or

inter-layer)

Example: Partitioning EDs in presence/absence of roads.

EDs crossed by some road

EDs not crossed byany road

An extra layeris added to the spatial model

Split Node

2. Boolean tests onthematic attributes of a layerspatial properties implicitly defined for the geometry of a layer (e.g., area for polygons, extension for lines)

Boolean tests involve only some layer already included in the model

Spatial Regression NodeIt performs a straight-line regression on either a continuous thematic attribute or a continuous spatial property

response attribute and continuous explanatory attributes are replaced with residuals stepwise regressionregression attribute comes from a layer already included in the modelwhen a new layer is added to the model, continuous thematic and spatial attributes are replaced with corresponding residuals.

Spatial Database IntegrationHow?

Object relational data representation (Oracle Spatial)Spatial patterns associated to splitting and regression nodes are expressed by spatial queries.

ExampleSELECT * FROM EDs x, ROADS yWHERE SDO_GEOM.RELATE(x.geometry,’ANYINTERACT’,

y.geometry,0.001)=‘TRUE’

Mining Stockport Census DataGOAL: investigate social phenomena related to unemployment SPATIAL DATASET: Stockport (Greater Manchester, UK)

Reference object: 578 EDs in StockportTask-relevant objects:

shopping areas (53 objects)employment areas (30 object) housing areas (9 objects)

Mining Stockport Census DataTwo experimental settings:

B0 is obtained by considering only ED layerB1 is obtained by considering all layers (L1+L2)

Mining Stockport Census DataPortion of the spatial model tree built by Mrs-SMOTI on

the entire dataset (B1 setting)

-- split on EDs number of migrants [≤ 47] (578 EDs)---- regression on EDs’ area (458 EDs)------ split on EDs-Shopping areas spatial relationship (94 EDs)

…------ split on EDs’ number of migrants (364 EDs)

…---- split on EDs’ area (120 EDs)------ leaf on EDs’ area (22 EDs)------ regression on EDs’ area

…

Boolean test on a thematic attribute

Test on an inter-layer relationship

Boolean test on a spatial property

Mining Stockport Census Data10-fold cross validationAverage Mean Square Error (Avg.MSE)Systems: Mrs-SMOTI vs. SMOTI, M5’Two transformations of original multirelational data into a classical double-entry table by computing :

P1 – spatial joins according to all possible intra-layer and inter-layer relationships multiple tuples are generated for the same reference objectP2 – average values for continuous attributes one tuple for each reference object

Mining Stockport Census Data

Mrs-SMOTI vs SMOTI and M5’average (avg) and standard deviation (std) of the mean squareError (mse) and number of leaves (#L) of the learned models Mrs-SMOTI always has performance better (Avg.MSE) than

SMOTI and M5’


Spatial Relationships Not Explicitly Modeled

MRDM methods do take advantage of information on the data model reported in the DB schema (e.g., foreign keys) in order to guide the search process.But … the spatial relationships are not explicitly modeled in the schema of a spatial DB.

Spatial Relationships Not Explicitly Modeled

Pre-compute spatial relationshipsSpatial weight matrix D for spatial linear modelsSingle DB relation in GeoMiner (Han et al., 1997)Materialize distance, direction, topological relations (Ester et al., 1999)Extract spatial relations and represent them as first-order predicates (Appice et al., 2005)

Spatial Relationships Not Explicitly ModeledPros

Spatial DB are rather staticCons

Very large number of spatial relationships between two layersSome of them might be unnecessarily extracted

Dynamically compute spatial relationships, but which of them?

Feature Selection BiasConcentrated linkage (Jensen & Neville, 2002)

High concentration of objects linked to a common neighbor

0 1

ED

BL BLBL

ED

BL

ED

BL

ED

BL

Linkage

Feature Selection BiasRelational Autocorrelation

The values of a given attribute are highly uniformamong objects that share a common neighbor

0 1BL

ED EDED

AutocorrelationBL

ED EDEDED

+ - + - + + +

Feature Selection BiasHigh Linkage and Autocorrelation

Decreased Effective Sample Size

Increase the variance of scores estimated

Bias increases as variance increases

Frequent in truespatial phenomena

Feature selection algorithms are biased in favor of features with large variance (even when they are not related to the class attribute).

Feature Selection Biasχ2-test for independence fail to discard uninformative features (it’s based on i.i.d. assumption)Most MRDM algorithms do not account for this biasException: relational probability tree learning uses a randomization test to adjust for feature selection bias (Neville et al. 2003)

Use Unlabelled DataIn a spatial domain the (semi-supervised) smootheness assumption is implied by positive autocorrelation of high density regionsTransductive setting appropriate for spatial classification and regressionCurrently only one work on transductive relational setting (Ceci et al., 2007) Promising results for spatial domains (Appice et al., 2007)

Collective InferenceIn predictive data mining tasks, patterns maytake the form:

yi = f(xi, xN(i), yN(i))Dependentvariable in

space i

Dependentvariable in space N(i)

Both yi and yN(i) have to be inferred collectively.

Collective InferenceA possible approach

Locally-learned individual inference models+

Joint inference procedure (e.g. relaxation labelling)Example: iterative classification (Neville & Jensen, 2000).

Collective InferenceJoint Relational Model: estimates the joint probability distribution over the variables both in i and N(i) and then jointly infer the values of both yi and yN(i).

Probabilistic relational models (Getoor et al., 2001, Neville & Jenssen, 2003)Autocorrelation in exploited to improve predictions

This inference procedure should be investigated in the context of spatial data mining

Hierarchies of Spatial ObjectsSpatial objects are often organized in hierarchiesA hierarchy of areal objects may also be inducedby the spatial relationship of containment

County

District2District1 Districtn

Ward1… Ward1Ward1

WardnWard1Ward2

A spatial

hierarchy for

UK census data

Hierarchies of Spatial ObjectsSpatial patterns involving the most abstract spatial objects are

well supported, but less confident

Spatial data mining methods should be able to explore the search space at different granularity levels.

Hierarchies of Spatial ObjectsNaive approach:

Level-by-level anaysisInformation on patterns found at a level is not usedto make search more efficient at a higher/lower level

More sophisticated approach:GeoAssociator (Koperski & Han, 1995)SPADA (Malerba & Lisi, 2001)

Knowledge Rich Data MiningKnowledge available on spatial phenomenaIn geography, many natural geographic dependencies

A port is adjacent to a water bodyMany non-novel and uninteresting patterns with a very high support and confidence.Use known dependencies to prune uninteresting patterns

SPADA

Knowledge Rich Data MiningStockport (UK) Characterising the area served by the M63 motorway12,466 strong ass. rulesMany pure spatial patterns

ed_on_M63(X), can_reach(X,Y) is_a(Y,ward_on_m63_ED) (90.0 %, 100.0 %)

Embedding Spatial ReasoningProcess by which information about objects in space is used to arrive to valid conclusions regarding the objects relationships.

Recursive definition of site accessibility

Embedding Spatial ReasoningQuantitative approach:

based on coordinates and distancesMore akin to machine reasoning

Qualitative approach (Freksa, 1991)Abstract representations (‘northwest’, ‘far’, …)Closely related to human reasoning EfficientDeals with imprecision, uncertainty and incompleteness

Embedding Spatial ReasoningEmbedding spatial inference engines in the spatial data mining systems: promising, but still unexplored.

SPADA: a limited form of spatial inference if rules of spatial reasoning are reported in the background knowledge


SummarySpatial data mining presents several issues

Spatial objects have a geometryAre relatedAre of different typeAutocorrelation affects spatial phenomena

Solutions offered in spatial statistics are limitedDouble-entry table representationThe choice of neighborhood matrix is criticalSpatial dependencies handled in pre-processing

SummaryThe relational approach is the most appropriate

Several methods have already been proposed for different tasks

But, there are still many challenges Dynamic handling of spatial dependencies & scalabilityBias caused by autocorrelationTransductive inferenceCollective inferenceHierarchies of objectsUse of spatial knowledge Use of specific spatial reasoners…

Outlook To develop effective solutions to spatial data analysis it is necessary to develop synergies between researchers working on different research topics:

Spatial statisticsMulti-relational data miningSpatial Databases and GISVisualization

Will this happen?Motivation for optimism: real applications (e.g., sales

prediction of individual shops, urban data analysis, location based services) demand for this collaboration.

Spatial Data Mining @ CS Dept.-Univ. of BariAnnalisa AppiceMichelangelo CeciAntonietta LanzaAntonio TuriAntonio Varlaro

Thanks to them for their valuable contribution to this research topic.

Documents

Mining Spatial Data: Opportunities and Challenges …ceci/micFiles/Mining Spatial Data...Mining Spatial Data: Opportunities and Challenges of a Relational Approach Donato Malerba Department