Upload
bryan-peters
View
216
Download
0
Embed Size (px)
Citation preview
Discovering Interesting Regions inDiscovering Interesting Regions inSpatial Data SetsSpatial Data Sets
Christoph F. Eick for the Data Mining Class
1. Motivation: Examples of Region Discovery
2. Region Discovery Framework
3. A Fitness For Hotspot Discovery
4. Other Fitness Functions
5. A Family of Clustering Algorithms for Region Discovery
6. Summary
Discovering Interesting Regions inDiscovering Interesting Regions inSpatial Data SetsSpatial Data Sets
Christoph F. Eick for Data Mining Class
1. Motivation: Examples of Region Discovery
2. Region Discovery Framework
3. A Fitness For Hotspot Discovery
4. Other Fitness Functions
5. A Family of Clustering Algorithms for Region Discovery
6. Summary
Ch. Eick: Introduction Region Discovery
1. Motivation: Examples of Region Discovery1. Motivation: Examples of Region Discovery
RD-Algorithm
Application 1: Supervised Clustering [EVJW07]Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07]Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08]Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08]Application 5: Find “representative” regions (Sampling)Application 6: Regional Regression [CE09]Application 7: Multi-Objective Clustering [JEV09]Application 8: Change Analysis in Spatial Datasets [RE09]
Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well
=1.01
=1.04
References: http://www2.cs.uh.edu/~ceick/pub.html
Ch. Eick: Introduction Region Discovery
2. Region Discovery Framework2. Region Discovery Framework
• We assume we have spatial or spatio-temporal datasets that have the following structure:
(x,y,[z],[t];<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude,
lattitude, continous_variable)• Clustering occurs in the (x,y,[z],[t])-space; regions are
found in this space.• The non-spatial attributes are used by the fitness
function but neither in distance computations nor by the clustering algorithm itself.
• For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same
Ch. Eick: Introduction Region Discovery
Region Discovery Framework ContinuedRegion Discovery Framework Continued
The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX interestingness(c)size(c) with >1
Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous (each pair of objects belonging to ci has to
be delaunay-connected with respect to ci and to d)4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and
low reward clusters are frequently not reported
Ch. Eick: Introduction Region Discovery
Challenges for Region DiscoveryChallenges for Region Discovery
1. Recall and precision with respect to the discovered regions should be high
2. Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets”
3. Detection of regions at different levels of granularities (from very local to almost global patterns)
4. Detection of regions of arbitrary shapes5. Necessity to cope with very large datasets6. Regions should be properly ranked by relevance
(reward); in many application only the top-k regions are of interest
7. Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.
Ch. Eick: Introduction Region Discovery
3. Fitness Function for Supervised Clustering3. Fitness Function for Supervised Clustering
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
|c| 50 200 200 350 200
P(c, Unsafe) 20/50 = 40% 40/200 = 20% 10/200 = 5% 30/350 = 8.6% 100/200=50%
Reward
Class of Interest: Unsafe_Well
Prior Probability: 20%γ1 = 0.5, γ2 = 1.5;R+ = 1, R-= 1;β = 1.1, =1.
10% 30%
1.1507
1 1.1200*
2
1 1.1350*143.0 1.1200*7
20
Ch. Eick: Introduction Region Discovery
4. Fitness Functions for Other Region 4. Fitness Functions for Other Region Discovery TasksDiscovery Tasks
4.1 Creating Contour Maps for Water Temperature (Temp)
1. Examples in the data set WT have the form: (x,y,temp); var(c,temp) denotes the variance of variable temp in region c
2. interestingness(c)=
IF var(c,temp)>
THEN 0
ELSE (-var(c,temp))
with +{0} being a form parameter (with default 1) and being a theshold parameter (0).
3. Many other possible fitness functions could be used.
Fig. 1: Sea Surface Temperature on July 7 2002
Var=2.2Reward: 48.5
Rank: 3
A single region and its summary
Mean=11.2
Ch. Eick: Introduction Region Discovery
4.2 Finding Regions with High Water Temperature Differences4.2 Finding Regions with High Water Temperature Differences
1. Examples in the data set WT have the form: (x,y,Temp); Var(WT, temp) denotes the variance of the dataset for attribute temp.
2. Fitness function: Let c be a cluster to be evaluated
interestingness(c)=
IF var(c,temp)<(var(WT,temp)+)
THEN 0
ELSE ((var(c,temp)/(var(WT,temp)+) -1)
with being a form parameter (with default 1) and 0 threshold parameter (with default 0)
Ch. Eick: Introduction Region Discovery
4.3 Programming Project Fitness Functions Purity4.3 Programming Project Fitness Functions Purity
r1
r2(6, 2, 2)
(0, 0, 5)
We assume th=0.5 and =2
i(r1)= (0.6-0.5)**2=0.01i(r2)=(1-0.5)**2=0.25i(r3)=0
q(X)=q({r1,r2,r3})= 0.01*10+ 0.25*5
(2,2,1)
r3
We assume we have 3 classes; in r1 we have 6 objects of class1, 3 objects of class 2, and 2 objects of class1
pc(r)= (number of instance of class c in region r)/(number of instances in r)
Ch. Eick: Introduction Region Discovery
Programming Project 2008 Fitness Function VarianceProgramming Project 2008 Fitness Function Variance
We assume =1 and th=1.5
i(r1)= 0i(r2)=(2-1.5)=0.5i(r3)=(11-1.5)=9.5i(r4)=0
OVar(O)=100
r1var(r1)=80
r2Var(r2)=200
r3Var(r3)=1100
r4Var(r4)=20
Ch. Eick: Introduction Region DiscoveryCo-location Interesting Measure for Co-location Interesting Measure for 2-Continuous Variables2-Continuous Variables
• The values of attributes A1 and A2 are converted into z-scores
• Interestingness of an object: Remark: i(A,o) can be negative• Interestingness of a region:
• Remark: Patterns {A1, A2} and {A1, A2} are treated as same. Same is true for {A1, A2} and {A1, A2 }
Remark: will be called Binary Co-location Interestingness Function in the following.
Ch. Eick: Introduction Region Discovery
Example: Using the Binary Co-location Fitness FunctionExample: Using the Binary Co-location Fitness Function
We assume =1, zth=0.1 and A={B1,B2}
i(r1)= |1-1-0.6|/3 -0.1=0.1i(r2)=|4+0.5+0|/3-0.1=1.4i(r3)=…i(r4)=0 because |-1+1-0.03|/3=0.01<0.1
r1(1,1)
(-1, 1)(1, 0.6) r2
(-1, -4)(-.0.5, -1)(-0.5,0)
r3
R4(1,-1)(1, 1)
(0.3, -0.1)
Meaning: z-value of B1 is -1, andz-value of B2 is -4
Binary Co-location: i(o,{B1,B2})=zB1(o)*zB2(o)
Remark:Let A be an attribute and a value of that attributez-score(a)= (a-mean(a))/standard-deviation(a))
Ch. Eick: Introduction Region Discovery
Finding Regional Co-location Patterns in Spatial DatasetsFinding Regional Co-location Patterns in Spatial Datasets
Objective: Find co-location regions using various clustering algorithms and novel fitness functions.
Applications:1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster
datasets. In figure 1, regions in red have very high co-location and regions in blue have anti co-location.
2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply. Figure 2 indicates discovered regions and their
associated chemical patterns.
Figure 1: Co-location regions involving deep andshallow ice on Mars
Figure 2: Chemical co-location patterns in Texas Water Supply
Ch. Eick: Introduction Region Discovery
Programming Project Function MSE Programming Project Function MSE
r1
r2(2,2) (4,4)
(-1,-1) (-7,-7) (-4,-4)
MSE(r1)=(2**2+2**2)/2=4
MSE(r2)=(6**2+6**2+0**0)/3=24
X={r1,r2}MSE(X)= (8+72)/5=16
Assume Manhattan is used:
(12,12)
outlier
Ch. Eick: Introduction Region Discovery
Global Co-location: and are co-located in the whole dataset
Task: Find Co-location patterns for the following data-set.
4.4 4.4 Regional Co-location MiningRegional Co-location Mining
RegionalCo-location
R1
R2
R3
R4
Ch. Eick: Introduction Region Discovery
Categorical Binary Co-locationCategorical Binary Co-location
Task: Find regions in which the density of 2 or more classes is elevated. In general, multipliers C are computed for every region r, indicating how much the density of instances of class C is elevated in region r compared to C’s density in the whole space, and the interestness of a region with respect to two classes C1 and C2 is assessed proportional to the product C1C2
Example: Binary Co-Location Reward Framework;
C(r)=p(C,r)/prior(C)
C1,C2 = 1/((prior(C1)+prior(C2)) “maximum multiplier”
C1,C2(r) = IF C1(r)<1 or C2(r )<1 THEN 0
ELSE sqrt((C1(r)–1)*(C2(r)–1))/(C1,C2 –1)
interestingness(r)= maxC1,C2;C1C2 (C1,C2(c))
Ch. Eick: Introduction Region Discovery
2006: The Ultimate Vision 2006: The Ultimate Vision of the Presented Researchof the Presented Research
Spatial Databases
Data Set
DomainExpert
Measure ofInterestingnessAcquisition Tool
Fitness Function
Family ofClustering Algorithms
VisualizationTools
Ranked Set of Interesting Regions and their Properties
Region Discovery
Display
DatabaseIntegration
Tool
Architecture Region Discovery Engine
Ch. Eick: Introduction Region Discovery
How to Apply the Suggested MethodologyHow to Apply the Suggested Methodology
1. With the assistance of domain experts determine structure of dataset to be used.
2. Acquire measure of interestingness for the problem of hand (this was purity, variance, MSE, probability elevation of two or more classes in the examples discussed before)
3. Convert measure of interestingness into a reward-based fitness function. The designed fitness function should assign a reward of 0 to “boring” regions. It is also a good idea to normalize rewards by limiting the maximum reward to 1.
4. After the region discovery algorithm has been run, rank and visualize the top k regions with respect to rewards obtained (interestingness(c)size(c)), and their properties which are usually task specific.
Ch. Eick: Introduction Region Discovery
5. A Family of Clustering Algorithms for Region Discovery5. A Family of Clustering Algorithms for Region Discovery
1. Supervised Partitioning Around Medoids (SPAM). 2. Representative-based Clustering Using Randomized Hill
Climbing (CLEVER) 3. Supervised Clustering using Evolutionary Computing
(SCEC)4. Single Representative Insertion/Deletion Hill Climbing with
Restart (SRIDHCR)5. Supervised Clustering using Multi-Resolution Grids
(SCMRG)6. Agglomerative Clustering (MOSAIC)7. Supervised Clustering using Density Estimation
Techniques (SCDE)8. Clustering using Density Contouring (DCONTOUR)
Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCMRG
Ch. Eick: Introduction Region Discovery
CLEVERCLEVER
Separate Slideshow
Ch. Eick: Introduction Region Discovery
22
Steps of Grid-based Clustering AlgorithmsSteps of Grid-based Clustering Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density of each cell.
3. Eliminate cells, whose density is below a certain threshold .
4. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function)
Simple version of a grid-based algorithm: Merge cells greedily as long as merging improves q(X).
Ch. Eick: Introduction Region Discovery
23
Advantages of Grid-based Clustering Advantages of Grid-based Clustering AlgorithmsAlgorithms
• fast:– No distance computations
– Clustering is performed on summaries and not individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are neighboring
• Shapes are limited to union of grid-cells
Ch. Eick: Introduction Region Discovery
Ideas SCMRG (Divisive, Multi-Resolution Grids)Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy
1. If a cell receives a reward that is larger than the sum of its rewards
its ancestors: return that cell.
2. If a cell and its ancestor do not receive any reward: prune
3. Otherwise, process the children of the cell (drill down)
Ch. Eick: Introduction Region Discovery
Code SCMRGCode SCMRG
Ch. Eick: Introduction Region Discovery
Parameters SCMRGParameters SCMRG
Separate Transparency!
Ch. Eick: Introduction Region Discovery
6. Summary6. Summary
1. A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced.
2. The framework find interesting places and their associated patterns.
3. The framework extracts regional knowledge from spatial datasets
4. The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.
Ch. Eick: Introduction Region Discovery
Why should people use Why should people use Region Discovery EnginesRegion Discovery Engines (RDE)(RDE)??
RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for:
• Focused summarization• Find interesting subsets in spatial datasets for further studies• Identify regions with unexpected patterns; because they are unexpected they deviate
from global patterns; therefore, their regional characteristics are frequently important for domain experts
• Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary
• Exploratory data analysis for a mostly unknown dataset• Co-location statistics frequently blurred when arbitrary region definitions are used,
hiding the true relationship of two co-occurring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime-rates along the major rivers in Texas)
• Data set reduction; focused sampling