Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Mining Spatial Data: Opportunities and Challengesof a Relational Approach
Donato MalerbaDepartment of Computer Science
University of Bari, Italy
August 30th - September 1st, 2007 - AVEIRO, PORTUGAL
Spatial Data Exploration: A Historical Example
1848: An epidemic of the ‘Asiatic cholera’ hit LondonJohn Snow observed the distribution of deaths throughout the city and hypothesized that river water contaminated by cholera evacuations explained spatial variations in mortality throughout London
John Snow
Spatial Data Exploration: A Historical Example
August 1854: the cholera epidemic hit an area of North LondonJ. Snow obtained the names and addresses listed on 83 death certificates from the Registry Office.He marked cholera cases on a map
Spatial Data Exploration: A Historical Example
He also inventoried potential sources of contamination (pumps)and combined this information on the map.He observed that nearly all the deaths had taken place within a short distance of the pump in Broad Street
Spatial Data Exploration: A Historical Example
Snow persuaded the parish council to remove the handleNot easy: the water provided by this pump was held in such high esteem that people came from neighboring streets for itResult: the epidemic subsided.
death
Spatial Data Exploration: A Historical Example
The council did not really believe Snow, so a curate repeated Snow’s work and considered other factors (cleanliness/filthiness of houses).The curate, who was initially biased against Snow’s theory, located 700 deaths within a 250-yard radius and showed that the use of water from the Broad Street pump was strongly correlated with death from Asiatic cholera.
Spatial Data Exploration: A Historical Example
Some curiosity: Snow’s theory was supported bytwo pieces of ‘negative data’
No infection in the workhouse (it had its own well)No cases in the Lion Brewery (workers drank the beer)
Lessons LearnedKey elements of this success story:
Identification of relevant spatial objectsReference spatial objects
(buildings where cholera cases occurred)Task-relevant spatial objects
(water pumps, wells, etc.)Identification of the properties of, and relationshipsbetween, relevant spatial objects(distance of buildings from water pumps, presence of wells)
Spatial Data MiningThe goal of spatial data mining is to automate the discovery of such correlations, which can then be examined by specialists for further validation and verification.
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
Modeling Spatial InformationTwo major approaches to conceptual modelingof space:
Field-based modelObject-based model
Field-based ModelThe world is seen as a continuous surface over whichfeatures vary.Spatial variation is defined by a number of Field Functions:
f: Rn Attribute DomainExamples: elevation, temperature, precipitation
Field OperationsExamples, addition(+) and composition(o).
))((:)()(:
xgfxgfxgxfxgf
→+→+
o
Object-based ModelThe world is seen as a surface littered with distinct, identifiable and relevant things or entities, called objects, which exist independent of their locations.Objects can be:
Zero-dimensional or punctualOne-dimensional or linearTwo-dimensional or surfacic
Operations on spatial objectsTopologicalDirectionalDistance-based
Field-based vs. Object-based
(b) (c)
(0,0) (2,0) (4,0)
(0,2)
(0,4)
Fir Oak
(a)
Area/Boundary
FS1
FS2
FS3
[(0,2),(4,2),(4,4),(0,4)]
[(0,0),(2,0),(2,2),(0,2)]
[(2,0),(4,0),(4,2),(2,2)]
y
x
Area-ID
f(x,y) �
"Pine," 2 � x � 4 ; 2 � y � 4
"Fir," 0 � x � 2; 0 � y � 2
"Oak," 2 � x � 4; 0 � y � 2
Pine
Object Viewpoint of Forest Stands
DominantTree Species
Fir
Oak
Pine
Field Viewpoint of Forest Stands
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
Spatial Pattern (Field-based view)A function obtained as a combination of field functions according to field operations
f (x,y) = precipitation in (x,y)
∫∫ >BA
dxdyyxfBsize
dxdyyxfAsize
),()(
1),()(
1
Spatial Pattern (Object-based view)It expresses a spatial relationship among spatial objects.
If the wetlands area is near open water then there is a nest of a red-winged blackbird
(classification rule, specifically location prediction rule)The price of a house near the river is 2000*Size + 5000
(regression rule)Trajectories of monitored cars group along the directiondowntown to residential suburbs
(cluster)A country that is adjacent to the Mediterranean sea is a wine exporter
(association rule)
Spatial Data MiningSpatial data mining: extraction of interesting and useful but implicit spatial patterns. (adapted from the definition of KDD)Fayyad, U., Piatesky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Fayyad, U., Piatesky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.): Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 1-35.
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
What’s Special about Spatial Data Mining? (1/4)
The formulation of a spatial data mining method cannot leave out of consideration the logicalrepresentation of spatial information.Two modes
TessellationVector
Tessellation ModeRegular or irregular
Field-based data in tessellation mode
Object-based data in tessellation mode
⟨2,3,6,7,8,12,13,14,18,19⟩2 3
6 7 8
12 13 14
18 19
Vector ModeOrdered sets of xy-coordinates defining points, lines, or polygons
Field-based data in vector mode
Object-based data in vector mode
Tessellation Vector☺Storage efficient
☺Easy to scale (though some operations are complex)Difficult to convert remote sensing images into this formatDifficult to check on a large number of constraints (is a polygon convex?)
☺Supported by spatial DBMS
Large memory space required Operation on objects are time consuming
☺Large volumes of spatialdata (e.g., remote sensing images) are available
Hybrid Modes also PossibleSpatial objects in a regular grid of cells
Tessellationsingle pixels are classifiedImage processing operators are needed Industrial area
VectorImages are transformedSpatial relationships are computed
Industrial area
Representation of Spatial Data in a Spatial DBMS
Spatial information is represented in different layers, one for each type of spatial object.Layer: a database relation Ri with a number of elementary attributes Ai
1, …, Aimi and possibly a
geometry attribute GiVector representation for Gi which can be a point, a line or a polygonA reference system defines the coordinates of single points and vertices of lines and polygons
LayerA spatial object or a field function is represented by one or more tuples of a …Layer: a database relation Ri with a number of elementary attributes Ai
1, …, Aimi and possibly a geometry attribute Gi
The geometry attribute is represented in vector mode.
wells ID Location Depth Type2357 (15,22) 20 drilled
What’s Special about Spatial Data Mining? (2/4)
Different types of spatial objects Several layers in a spatial DB
wells ID Location Depth Type2357 (15,22) 20 drilled
buildings ID Surface Size OwnerAD18 250 Smith
What’s Special about Spatial Data Mining? (3/4)
Spatial objects have a locational property which implicitly defines spatial relationships between objects.
TopologicalDistance-basedDirectional
Topological RelationsInvariant under homomorphisms (rotation, translation & scaling)Semantics defined by the 9-intersection model
disjoint
meet
contains
covers
overlaps
equal
inside
covered by
For regions
Distance RelationsMetric
Euclidean distance between two pointsFor polygons it’s an aggregate function (e.g., minimum)
Non-metricTypically defined on the basisof a cost function (e.g. drive time)
Directional RelationsBased on an angle
Based on the extension of Allen’s algebra
α
Spatial RelationsIn a spatial DB, different spatial relations ρ implicitly define
spatial joins between two layers Ri and RjRi ρ Rj
Too many spatial joins implicitly defined Efficient computation of spatial relations is a must when developing spatial data systems
Abstraction from PhysicalRepresentation
Interest towards properties not related to physical representation Well-defined semantics (e.g., 9-intersection model) is not enough.
Example: Two roads can cross each other, or run parallel, or can be confluent, independently of the fact that they are represented as “lines” or “regions”
What’s Special about Spatial Data Mining? (4/4)
Spatial (positive) autocorrelation: The values of a givenproperty are highly uniform among similar spatial objects in the neighborhood.
Tobler’s first law of geography“everything is related to everything else, but near things are
more related than distant things”(Tobler, 1970)
What’s Special about Spatial Data Mining? (4/4)
Spatial autocorrelation: The value of a property observed at a location depends on the values of properties observed at neighboring locations.positive autocorrelation: more similarnegative autocorrelation: less similar
Tobler’s First Law of Geography“everything is related to everything else, but near
things are more related than distant things”(Tobler, 1970)
Spatial Error and Spatial LagTwo primary types of autocorrelation:
Spatial error Spatial lagSpatially lagged explanatory variables
yi yj
εi εj
xi xj
yi yj
εi εj
xi xj
Spatially lagged response variables
Violated AssumptionsSpatial error: error terms are uncorrelatedSpatial lag: observations are independent (as well as error terms are uncorrelated)
“Anyone seriously interested in prediction when the sample data exhibit spatial dependence should consider a spatial model”
(LeSage & Page, 2001)
Limits of Traditional Data Mining Methods
Numeric and discrete type (no geometry)Observations cannot be of different types (e.g., wells, buildings, etc.)Spatial relationships between observations notrepresented / considered
Spatial StatisticsSpatial dependence typically modeled by the linear models
y = Xα + βDy + γ DX + εy: vector of observations of the dependent variableX: matrix of observations of the independent variableα: strength of local influence β: strength of spatial dependence on response var.sγ: strength of spatial dependence on explanatory var.sD: spatial weight matrix (or neighborhood matrix)
Spatial Weight MatrixContains a ‘d’ term for every combination of observations in the data set‘d’ may be the inverse distance betweenobservations or 0,1 if they share a borderand/or vertex.The choice of spatial weight matrix is often made ad hoc and a priori.Provides the ‘structure’ of assumed spatial relationships.
Problems with Spatial Models D has to be carefully definedHow can D express the contribution of different spatial relationships?Spatial dependencies are all handled in a pre-processing or feature extraction stepAll spatial objects involved in the spatial phenomena (rows of X) are uniformly represented by the same set of attributesNo clear difference between reference (target) and task-relevant objects.
Reference vs. Task-relevant ObjectsReference objects: the main subject of analysisTask-relevant objects: objects in the neighborhood
that can contribute to explain the spatial variation
Example: find associations involving large towns of Apulia (Italy) Reference objects: large towns Task-relevant objects:
water bodies roads province boundaries
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
Towards a Multi-Relational Representation
In spatial data mining the units of analysis are typically composed of several spatial objects with different properties.Their spatial structure cannot be accommodated into a classical double-entry table. A better representation: A set of relations, R1 … Rjsome of which are layers. Foreign key constraints and spatial relations define possible joins.
ExampleProblem: investigate social effects of public transportation in a British citySpatial data set: ED ID Area
03bsfc01BL Name Line Type
15a main
CE ED #householdsno car
#households1 car
#households≥2 cars
03bsfc01 80 67143
ExampleA unit of analysis corresponds to an ED(reference object) described in terms of #cars per household and crossing bus lines (task-relevant objects)Relational pattern:
“the enumeration districts with a high percentage of households which own less than two cars, are served by at least two bus lines, one of which is a main bus line”
Multi-relational Data MiningMRDM tools can be applied directly to data distributed on several relations to find relational patterns which involve multiple relations.Relational patterns can be expressed in SQL but also in first-order logic
Relational Patterns in SQL and LogicSELECT DISTINCT ED.IDFROM ED, CE, BL AS BL1, BL AS BL2WHERE CE.ED = ED.ID
AND HH2CARS / (HHNOCARS + HH1CAR + HH2CARS)*100 > 60 AND INTERSECTS(ED.AREA,BL1.LINE) AND INTERSECTS(ED.AREA,BL2.LINE)AND BL1.NAME ≠ BL2.NAME AND BL1.TYPE=“MAIN”
ed(X), ce(X, HHNOCARS, HH1, HH2), bl(BL1), bl(BL2), HH2CARS / (HHNOCARS + HH1CAR + HH2CARS)*100 > 60, intersects(X,BL1), intersects(X,BL2), BL1 ≠ BL2, main(BL1)
Two Settings for MRDMFinding relational patterns within units of analysis represented as sets of tuples
Each unit of analysis includes a single reference object and is represented by a sub-database of the original one
Finding patterns within the whole database
Individual-Centered RepresentationsSeveral advantages
Positive PAC-learnability resultsMethods working under single-table assumption are easier to upgrade (e.g., the notion of unit of analysis simplifies the sampling)More efficient (process one unit of analysis at a time)
Individual-Centered RepresentationsBut …
Units of analysis with a single reference object might not be easy to define In spatial data mining, the unit of analysis should be carefully selected so that
autocorrelation is consideredthe size of the neighborhood is limited
Example of AutocorrelationSpatially lagged explanatory variablesno communal establishment (schools, hospitals) in an ED, but many of
them are located in the nearby EDs
Xr j
Xr j
Xr j Xr j
Xr j
Xr j(x1,i ,… , xk,i , xr j)
Reference object: ED
Task-relevant objects: communal establishments in the nearby
Example of AutocorrelationSpatially lagged response variablesthe price level for a good at a retail outlet in a city depends on the price
for the same good in the nearby
Xr j
Xr j
Xr j Xr j
Xr j
Xr j
Yj
Yj
Yj Yj
Y j
Yj(x1,i ,… , xk,i , y j)
Reference object: EDTask-relevant objects: EDs in the neighborhood
A Recipe for MRDM SystemsStart from a well-known data mining system working on the classical double-entry table representationUpgrade
Generality order of patterns (e.g., θ-subsumption), Generalization/specialization operatorsSimilarity measure, …
to deal with several relations Build new system, retain as much as possible from the original one
Additional Ingredients for a Relational Approach to Spatial Data Mining
Define a representation of spatial objectsDefine operators for spatial joins Optimize the computation of spatial joins with spatial indexesDistinguish reference from task-relevant objectsVisualize spatial patterns (e.g., on a map)
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
Relational Systems for Spatial DMSpatial Association Rule Discovery
SPADA system (Malerba & Lisi, ILP 2001) Task-relevant data organized hierarchically
Spatial patterns are found at different granularity levels
road_net
MotorwayA_road B_road PrimaryRoad
1
2
3
Road net
1
2
3
Water net
water_net
canal river water
1
2
3
Rail net
rail_net
rail
Relational Systems for Spatial DMSpatial Association Rule Discovery
Spatial pattern: conjunction of first-order logic atomsThe space of spatial patterns is ordered by θ-subsumptionmonotonicity of support w.r.t. θ-subsumption pruning of patterns at the same granularity level in the candidate generation phase monotonicity of pattern frequency w.r.t. granularity level
pruning of patterns at different granularity levels in the candidate generation phase
Relational Systems for Spatial DMSpatial Association Rule Discovery
Efficiency improvement of pattern evaluation by caching support objects for each stored pattern Definition of a declarative bias to filter out rules on the basis of users’ preferences efficiency improvement is a byproductIntegration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)
Relational Systems for Spatial DMSpatial Classification
Based on associative classification (Ceci et al., ECML/PKDD 2004)SPADA is used to extract strong multi-level spatial association rules, with exactly one literal representing the class label in the consequentStrong rules are then used to build a relational Naive Bayesian classifier.
Relational Systems for Spatial DMSpatial Clustering
CORSO (presented @ this conference )Units of analysis are described by severalrelations A relational distance measure is used to clusterthemUnits of analysis are themselves spatially related
Relational Systems for Spatial DMSpatial Clustering
Discrete Spatial Structure: a directed graph wherenodes correspond to units of anaysislinks correspond to spatial relations between units of analysis
“Neighbouring regions” relation…
Apulia
Molise
Basilicata
Campania
Calabria
Abruzzo
Latium
Tuscany
Sicily
Relational Systems for Spatial DMSpatial ClusteringCORSO combines
graph based partitioningwith multi-relational clustering
clusters are described by means of a logical theory
…Apulia
Molise
Basilicata
Campania
Calabria
Abruzzo
Latium
Tuscany
Sicily
Relational Systems for Spatial DMSpatial ClusteringRelated work: GDBSCAN (Sander et al., 1998)
Pros• spatial relations between objects to be clustered is
consideredCons• Data stored in a single double-entry table• No description of clusters
Relational Systems for Spatial DMSpatial Regression
Mrs-SMOTI (Malerba et al., ECML/PKDD 2005)It generates relational model trees from a collection of tables (some are layers). It extends its predecessor SMOTI:
tight-integration with a spatial database to mine spatial relationships and properties implicit in datasearch strategy modified to capture the implicit relational structure of spatial data (intra-layer and inter-layer) intra-layer relationship make available spatially-lagged response in addition to spatially lagged explanatory attributes
Classical Regression ProblemGiven
m independent (or explanatory) attributes Xi (both continuous and discrete)a continuous dependent (or response) attribute Y to be predicteda set of n training cases (x1, x2, …, xm, y)
Builda function y=g(x) such that it correctly predicts the value of the response attribute for each m-tuple (x1, x2, …, xm)
Problem 1: Spatial ArrangementIf spatial heterogeneity of response isanticipated, than allow
the constantone or more of the other regression parameters
to vary spatiallyExample: residential areas have a higher number of migrants
Yi = β0 + β1x1,i + … + βkxk,i + γDi + ei
Di is a dummy variable: 1 site i is in residential area, 0 otherwise
Problem 1: Spatial ArrangementYi = β0 + (β1 + γDi)x1,i + … + βkxk,i + eiIn this case the slope parameter associated to variable xi
varies.
Issues: clumsy generalization to more than two influence areasdifficult to establish a priori which variables (if any) are actually affected by spatial arrangement
Model Tree• A tree-structure is generated according to a top-
down strategypartitioning of the training setlocal regression models
X1 ≤ 3
Y=3+2X1
a set of n training cases (x1, x2, …, xm, y)
Spatial Regression ProblemThe response attribute Y (e.g., number of migrants) is associated to a location (e.g., ED)Explanatory variables Xi are also associated to locations
In standard model tree learningmethods:arrangement properties of spatial objects isdisregardedobservations are assumedindependent
Model Trees: State of the ArtStatistics
Ciampi (1991): RECPAMSiciliano & Mola (1994)…
Data MiningKaralic, (1992): RETISQuinlan, (1992): M5Wang & Witten, (1997): M5’Lubinsky, (1994): TSIRTorgo, (1997): HTLMalerba et al., (2004): SMOTI…
No state of art method tries to mine model trees dealing with spatial structure!
Dealing with local & global effectsSome explanatory attributes can have spatially global effect on the response attribute, while others have only a spatially local effect.
Y = 0.9
Y = 3+1.1X1
Y = 3X1+1.1X2
• The model tree doesn’t show up the possibly globaleffect of X1
Dealing with local & global effectsA tree structure with splitting and regression nodes
• Splitting nodes perform a Boolean test.
tR
Xi ≤ α
Y=a+bXu Y=c+dXw
t
tL
continuousvariable
tXi∈{xi1,…,xih}
Y=a+bXu Y=c+dXw
tRtL
discrete variable
tL
• Regression nodes compute only a straight-line regression. They have only one child.
Y=a+bXi
X’j ≤ α
Y=c+dX’u Y=e+fX’w
nL nR
t
t’
t’Rt’L
X’j=Xj-(aj+bjXi)
Dealing with local & global effectsLeaves are associatedwith a straight-lineregression function
65
4
SMOTI: Stepwise Model Tree Induction(Malerba et al., IEEE Trans. Pattern Analysis & Mach. Intell. 2004)
3
2
Y’=c+dX’3
Y’=e+fX’2
X’4 ≤ γ
Y’=g+hX’3
0Y=a+bX1
1X’3 ≤ α
T
7
Y’=i+lX’4X’2 ≤ β The multiple regressionmodel associated to a leaf is the composition of straight-line regression functions found along the path from the root to a leaf
Dealing with Spatial Autocorrelation• Augment ED information by exploiting intra-layer
relationships (e.g., neighborhood)
ED #MigrInWards #Establishments #Employees on 10% sample population
#Migrants
Italy 3 1 9 4382…
03BSFA18 10 1 4503BSFN01 18 5 73
… … … …
Reference ED NeighbouringED
03BSFA0403BSFA0403BSFA04
…
03BSFA0503BSFB1803BSFQ01
…
Dealing with Spatial AutocorrelationOther spatial objects, which are different from areas where Y is measured, can be easily accommodated in this framework inter-layerrelationships.
Example:
Model Trees + Spatial Data Structure = Mrs-SMOTI
Mrs-SMOTI is the spatial extension of SMOTI (Stepwise Model Tree Induction)INPUT: spatial objects eventually belonging to separate layers stored in a spatial database S
reference objects (main subject of analysis)task-relevant objects
OUTPUT: a spatial model tree T by partitioning training spatial data according to intra-layer and inter-layer relationshipsassociating different regression models to disjoint spatial areas
Split NodeBinary split nodes involves:1. Boolean tests on spatial relationships (either intra-layer or
inter-layer)
Example: Partitioning EDs in presence/absence of roads.
EDs crossed by some road
EDs not crossed byany road
An extra layeris added to the spatial model
Split Node
2. Boolean tests onthematic attributes of a layerspatial properties implicitly defined for the geometry of a layer (e.g., area for polygons, extension for lines)
Boolean tests involve only some layer already included in the model
Spatial Regression NodeIt performs a straight-line regression on either a continuous thematic attribute or a continuous spatial property
response attribute and continuous explanatory attributes are replaced with residuals stepwise regressionregression attribute comes from a layer already included in the modelwhen a new layer is added to the model, continuous thematic and spatial attributes are replaced with corresponding residuals.
Spatial Database IntegrationHow?
Object relational data representation (Oracle Spatial)Spatial patterns associated to splitting and regression nodes are expressed by spatial queries.
ExampleSELECT * FROM EDs x, ROADS yWHERE SDO_GEOM.RELATE(x.geometry,’ANYINTERACT’,
y.geometry,0.001)=‘TRUE’
Mining Stockport Census DataGOAL: investigate social phenomena related to unemployment SPATIAL DATASET: Stockport (Greater Manchester, UK)
Reference object: 578 EDs in StockportTask-relevant objects:
shopping areas (53 objects)employment areas (30 object) housing areas (9 objects)
Mining Stockport Census DataTwo experimental settings:
B0 is obtained by considering only ED layerB1 is obtained by considering all layers (L1+L2)
Mining Stockport Census DataPortion of the spatial model tree built by Mrs-SMOTI on
the entire dataset (B1 setting)
-- split on EDs number of migrants [≤ 47] (578 EDs)---- regression on EDs’ area (458 EDs)------ split on EDs-Shopping areas spatial relationship (94 EDs)
…------ split on EDs’ number of migrants (364 EDs)
…---- split on EDs’ area (120 EDs)------ leaf on EDs’ area (22 EDs)------ regression on EDs’ area
…
Boolean test on a thematic attribute
Test on an inter-layer relationship
Boolean test on a spatial property
Mining Stockport Census Data10-fold cross validationAverage Mean Square Error (Avg.MSE)Systems: Mrs-SMOTI vs. SMOTI, M5’Two transformations of original multirelational data into a classical double-entry table by computing :
P1 – spatial joins according to all possible intra-layer and inter-layer relationships multiple tuples are generated for the same reference objectP2 – average values for continuous attributes one tuple for each reference object
Mining Stockport Census Data
Mrs-SMOTI vs SMOTI and M5’average (avg) and standard deviation (std) of the mean squareError (mse) and number of leaves (#L) of the learned models Mrs-SMOTI always has performance better (Avg.MSE) than
SMOTI and M5’
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
Spatial Relationships Not Explicitly Modeled
MRDM methods do take advantage of information on the data model reported in the DB schema (e.g., foreign keys) in order to guide the search process.But … the spatial relationships are not explicitly modeled in the schema of a spatial DB.
Spatial Relationships Not Explicitly Modeled
Pre-compute spatial relationshipsSpatial weight matrix D for spatial linear modelsSingle DB relation in GeoMiner (Han et al., 1997)Materialize distance, direction, topological relations (Ester et al., 1999)Extract spatial relations and represent them as first-order predicates (Appice et al., 2005)
Spatial Relationships Not Explicitly ModeledPros
Spatial DB are rather staticCons
Very large number of spatial relationships between two layersSome of them might be unnecessarily extracted
Dynamically compute spatial relationships, but which of them?
Feature Selection BiasConcentrated linkage (Jensen & Neville, 2002)
High concentration of objects linked to a common neighbor
0 1
ED
BL BLBL
ED
BL
ED
BL
ED
BL
Linkage
Feature Selection BiasRelational Autocorrelation
The values of a given attribute are highly uniformamong objects that share a common neighbor
0 1BL
ED EDED
AutocorrelationBL
ED EDEDED
+ - + - + + +
Feature Selection BiasHigh Linkage and Autocorrelation
Decreased Effective Sample Size
Increase the variance of scores estimated
Bias increases as variance increases
Frequent in truespatial phenomena
Feature selection algorithms are biased in favor of features with large variance (even when they are not related to the class attribute).
Feature Selection Biasχ2-test for independence fail to discard uninformative features (it’s based on i.i.d. assumption)Most MRDM algorithms do not account for this biasException: relational probability tree learning uses a randomization test to adjust for feature selection bias (Neville et al. 2003)
Use Unlabelled DataIn a spatial domain the (semi-supervised) smootheness assumption is implied by positive autocorrelation of high density regionsTransductive setting appropriate for spatial classification and regressionCurrently only one work on transductive relational setting (Ceci et al., 2007) Promising results for spatial domains (Appice et al., 2007)
Collective InferenceIn predictive data mining tasks, patterns maytake the form:
yi = f(xi, xN(i), yN(i))Dependentvariable in
space i
Dependentvariable in space N(i)
Both yi and yN(i) have to be inferred collectively.
Collective InferenceA possible approach
Locally-learned individual inference models+
Joint inference procedure (e.g. relaxation labelling)Example: iterative classification (Neville & Jensen, 2000).
Collective InferenceJoint Relational Model: estimates the joint probability distribution over the variables both in i and N(i) and then jointly infer the values of both yi and yN(i).
Probabilistic relational models (Getoor et al., 2001, Neville & Jenssen, 2003)Autocorrelation in exploited to improve predictions
This inference procedure should be investigated in the context of spatial data mining
Hierarchies of Spatial ObjectsSpatial objects are often organized in hierarchiesA hierarchy of areal objects may also be inducedby the spatial relationship of containment
County
District2District1 Districtn
Ward1… Ward1Ward1
WardnWard1Ward2
A spatial
hierarchy for
UK census data
Hierarchies of Spatial ObjectsSpatial patterns involving the most abstract spatial objects are
well supported, but less confident
Spatial data mining methods should be able to explore the search space at different granularity levels.
Hierarchies of Spatial ObjectsNaive approach:
Level-by-level anaysisInformation on patterns found at a level is not usedto make search more efficient at a higher/lower level
More sophisticated approach:GeoAssociator (Koperski & Han, 1995)SPADA (Malerba & Lisi, 2001)
Knowledge Rich Data MiningKnowledge available on spatial phenomenaIn geography, many natural geographic dependencies
A port is adjacent to a water bodyMany non-novel and uninteresting patterns with a very high support and confidence.Use known dependencies to prune uninteresting patterns
SPADA
Knowledge Rich Data MiningStockport (UK) Characterising the area served by the M63 motorway12,466 strong ass. rulesMany pure spatial patterns
ed_on_M63(X), can_reach(X,Y) is_a(Y,ward_on_m63_ED) (90.0 %, 100.0 %)
Embedding Spatial ReasoningProcess by which information about objects in space is used to arrive to valid conclusions regarding the objects relationships.
Recursive definition of site accessibility
Embedding Spatial ReasoningQuantitative approach:
based on coordinates and distancesMore akin to machine reasoning
Qualitative approach (Freksa, 1991)Abstract representations (‘northwest’, ‘far’, …)Closely related to human reasoning EfficientDeals with imprecision, uncertainty and incompleteness
Embedding Spatial ReasoningEmbedding spatial inference engines in the spatial data mining systems: promising, but still unexplored.
SPADA: a limited form of spatial inference if rules of spatial reasoning are reported in the background knowledge
AgendaModeling spatial informationSpatial patternSpatial data mining: main issuesOpportunities for a relational approachA case study: spatial model treesChallenges for a relational approachSummary
SummarySpatial data mining presents several issues
Spatial objects have a geometryAre relatedAre of different typeAutocorrelation affects spatial phenomena
Solutions offered in spatial statistics are limitedDouble-entry table representationThe choice of neighborhood matrix is criticalSpatial dependencies handled in pre-processing
SummaryThe relational approach is the most appropriate
Several methods have already been proposed for different tasks
But, there are still many challenges Dynamic handling of spatial dependencies & scalabilityBias caused by autocorrelationTransductive inferenceCollective inferenceHierarchies of objectsUse of spatial knowledge Use of specific spatial reasoners…
Outlook To develop effective solutions to spatial data analysis it is necessary to develop synergies between researchers working on different research topics:
Spatial statisticsMulti-relational data miningSpatial Databases and GISVisualization
Will this happen?Motivation for optimism: real applications (e.g., sales
prediction of individual shops, urban data analysis, location based services) demand for this collaboration.
Spatial Data Mining @ CS Dept.-Univ. of BariAnnalisa AppiceMichelangelo CeciAntonietta LanzaAntonio TuriAntonio Varlaro
Thanks to them for their valuable contribution to this research topic.