Upload
lindsay-gallagher
View
219
Download
0
Embed Size (px)
Citation preview
Discovering Interesting Regions inDiscovering Interesting Regions inSpatial Data Sets using Supervised ClusteringSpatial Data Sets using Supervised Clustering
Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang
PKDD Conference, Berlin, Sept. 21, 2006
Department of Computer Science
University of Houston, Texas, USA
Organization
1. Motivation: Examples of Region Discovery
2. Region Discovery Framework
3. A Family of Clustering Algorithms for Region Discovery
4. Experimental Evaluation
5. Related Work
6. Generalizability of the Region Discovery Framework
7. Conclusion
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
1. Motivation: Examples of Region Discovery1. Motivation: Examples of Region Discovery
RD-Algorithm
Application 1: Hot-spot Discovery [this paper]Application 2: Regional Association Rule Mining [DEWY06]
1. Find Regions2. Mine Regional association rules
Application 3: Find Interesting Regions with respect to a Continuous VariableApplication 4: Regional Co-location MiningApplication 5: Find “representative” regions (Sampling)
Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well
=1.01
=1.04
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
2. Region Discovery Framework2. Region Discovery Framework
• We assume we have spatial or spatio-temporal datasets that have the following structure:
(x,y,[z],[t];<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude,
lattitude, continous_variable)• Clustering occurs in the (x,y,[z],[t])-space; regions are
found in this space.• The non-spatial attributes are used by the fitness
function but neither in distance computations nor by the clustering algorithm itself.
• For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Region Discovery Framework ContinuedRegion Discovery Framework Continued
The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of R
A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)size(c) with >1
Objective:
Find c1,…,ck O such that:
1. cicj= if ij
2. X={c1,…,ck} maximizes q(X)
3. All cluster ciX are contiguous (each pair of objects belonging to ci has to be delaunay-connected with respect to ci and to d)
4. c1,…,ck O
5. c1,…,ck are frequently ranked based on the reward each cluster receives, and low reward clusters are not reported
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Example of a Fitness Function for Hot Spot DiscoveryExample of a Fitness Function for Hot Spot Discovery
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
|c| 50 200 200 350 200
P(c, Unsafe) 20/50 = 40% 40/200 = 20% 10/200 = 5% 30/350 = 8.6% 100/200=50%
Reward
Class of Interest: Unsafe_Well
Prior Probability: 20%γ1 = 0.5, γ2 = 1.5;R+ = 1, R-= 1;β = 1.1, =1.
10% 30%
1.1507
1 1.1200*
2
1 1.1350*143.0 1.1200*7
20
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Challenges for Region DiscoveryChallenges for Region Discovery
1. Recall and precision with respect to the discovered regions should be high
2. Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets”
3. Detection of regions at different levels of granularities (from very local to almost global patterns)
4. Detection of regions of arbitrary shapes
5. Necessity to cope with very large datasets
6. Regions should be properly ranked by relevance (reward)
7. Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
3. A Family of Clustering Algorithms for Region Discovery3. A Family of Clustering Algorithms for Region Discovery
1. Supervised Partitioning Around Medoids (SPAM). 2. Single Representative Insertion/Deletion Steepest Decent
Hill Climbing with Randomized Restart (SRIDHCR). 3. Supervised Clustering using Evolutionary Computing
(SCEC)4. Agglomerative Hierarchical Supervised Clustering (SCAH)5. Hierarchical Grid-based Supervised Clustering (SCHG)6. Supervised Clustering using Multi-Resolution Grids
(SCMRG)7. Representative-based Clustering with Gabriel Graph Based
Post-processing (SCEC+GGP / SRIDHCR+GGP)8. Supervised Clustering using Density Estimation
Techniques (SCDE)
Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCAH, SCHG, SCMRG
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
SCAH (Agglomerative Hierarchical) SCAH (Agglomerative Hierarchical)
Inputs:A dataset O={o1,...,on}A distance Matrix D = {d(oi,oj) | oi,oj O },Output:Clustering X={c1,…,ck}
Algorithm:1) Initialize: Create single object clusters: ci = {oi}, 1≤ i ≤ n; Compute merge candidates based on “nearest clusters”
2) DO FOREVER a) Find the pair (ci, cj) of merge candidates that improves q(X) the most
b) If no such pair exist terminate, returning X={c1,…,ck} c) Delete the two clusters ci and cj from X and add the cluster ci cj to X d) Update inter-cluster distances incrementally e) Update merge candidates based on inter-cluster distances
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
SCHG (Hierarchical Grid-based)SCHG (Hierarchical Grid-based)
Remark: Same as SCAH, but uses grid cells as intial clusters
Inputs:A dataset O={o1,...,on}A grid structure GOutput:Clustering X={c1,…,ck}
Algorithm:1) Initialize: Create clusters making each single non-empty grid cell a cluster Compute merge candidates (all pairs of neighboring grid cells)
2) DO FOREVER a) Find the pair (ci, cj) of merge candidates that improves q(X) the most
b) If no such pair exist terminate, returning X={c1,…,ck} c) Delete the two clusters ci and cj from X and add the cluster c’=ci cj to X d) Update merge candidates: cX (MC(c’,c) MC(c, ci) MC(c, cj ))
1 2 3
4 5
6 7
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Ideas SCMRG (Divisive, Multi-Resolution Grids)Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy
1. If a cell receives a reward that is larger than the sum of its rewards
its ancestors: return that cell.
2. If a cell and its ancestor do not receive any reward: prune
3. Otherwise, process the children of the cell (drill down)
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
4. Experimental Evaluation4. Experimental Evaluation
Volcano Earthquake
Dataset Name # of objects # of classes
1 B-Complex9 3,031 2
2 Volcano 1,533 2
3 Earthquake-1 3,161 3
4 Earthquake-10 31,614 3
5 Earthquake-100 316,148 3
6 Wyoming-Poverty 493,781 2
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Experimental Results Experimental Results
DatasetAlgorithms SCAH SCHG SCMRG SCAH SCHG SCMRG
Parameters β = 1.01, η = 6 β = 3, η = 1
B-Complex9
Purity 1 0.998 1 1 0.997 0.863
Quality 0.974 0.974 0.957 0.008 0.044 0.002
Clusters 17 15 132 17 9 22
Volcano
Purity 1 0.692 0.979 1 0.692 0.885
Quality 0.940 0.091 0.822 1E-5 7E-4 1E-4
Clusters 639 56 311 639 31 221
Earthquake-1
Purity 1 0.844 0.938 0.853 0.840 0.814
Quality 0.952 0.399 0.795 0.004 0.086 0.006
Clusters 479 33 380 161 10 93
Earthquake-10
Purity DNF 0.840 0.912 DNF 0.834 0.807
Quality DNF 0.398 0.658 DNF 0.077 0.006
Clusters DNF 37 506 DNF 12 153
Earthquake-100
Purity DNF 0.842 0.909 DNF 0.837 0.808
Quality DNF 0.389 0.560 DNF 0.083 0.006
Clusters DNF 38 780 DNF 9 191
Wyoming
Purity DNF 0.772 0.721 DNF 0.769 0.661
Quality DNF 0.027 0.227 DNF 0 0.001
Clusters DNF 489 89 DNF 391 78
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Experimental EvaluationExperimental Evaluation
• SCAH outperforms SCHG and SCMRG when the penalty for the number of clusters is very low (=1.01, =6). However, when SCAH runs out of pure clusters to merge, it has the tendency to terminate prematurely; therefore, it does quite poorly when the objective is obtain large clusters (=3, =1).
• SCHG outperforms SCMRG and SCAH for =3, =1.
• SCMRG obtains better clusters than SCAH for the Volcano dataset for =1.01, =6, which can be attributed to the fact that SCMRG uses grid cells with different sizes.
• Avg. wall clocktime for smaller datasets SCAH:SCMRG/SCHG: 13:1/52:1
• SCAH is not suitable to cope with dataset sizes of 10000 and more, mainly because of the large number of distance computations, large numbers of clusters, and merge steps needed.
• The quality of clustering of SCMRG is strongly dependent on initial cluster sizes and on the look ahead depth.
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Problems with SCAHProblems with SCAH
No look ahead:
Non-contiguousclusters:
XXX OOO OOO XXXToo restrictive definition of merge candidates:
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
5. Related Work5. Related Work
• In contrast to most work in spatial data mining, our work centers on creating regional knowledge and not global knowledge.
• A lot of work in spatial data mining centers on partioning a spatial dataset into “transactions” so that apriori-style algorithms can be used. We claim that our work can contribute to “finding such transactions” [DEWY06].
• Our work has similarity to work in supervised clustering/semi-supervised clustering in that it uses class labels in evaluating clusters.
• Moreover, the goals of the algorithms presented in this paper are similar to hotspot discovery algorithms, a task that does not receive a lot of attention in spatial data mining, but more attention by scientists in earth sciences and related disciplines.
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
6. Generalizibility6. Generalizibility
1. Find regions whose density/entropy/purity with respect to a class of interest is low/high this talk
2. Find regions whose variance with respect to a continuous variable is low contour maps
3. Find regions whose variance with respect to a contious variable is high …
4. Find regions whose distribution is similar to the distribution of the whole dataset spatial sampling
5. Find regions in which the density of 2 or more classes is elevated regional co-location mining
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
7. Summary7. Summary
1. A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced.
2. Evidence concerning the usefulness of the framework for hot spot discovery problems has been presented.
3. As a by-product some known and not so well known flaws of hierarchical clustering algorithms have been identified.
4. The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
The Vision of the Presented ResearchThe Vision of the Presented Research
Spatial Databases
Data Set
DomainExpert
Measure ofInterestingnessAcquisition Tool
Fitness Function
Family ofClustering Algorithms
VisualizationTools
Ranked Set of Interesting Regions and their Properties
Region Discovery
Display
DatabaseIntegration
Tool
Architecture Region Discovery Engine
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Additional TransparenciesAdditional Transparencies
Not used for PKDD 2006 Talk
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Code SCMRGCode SCMRG
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Why should people use Why should people use Region Discovery EnginesRegion Discovery Engines (RDE)(RDE)??
RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for:
• Focused summarization• Find interesting subsets in spatial datasets for further studies• Identify regions with unexpected patterns; because they are unexpected they deviate
from global patterns; therefore, their regional characteristics are frequently important for domain experts
• Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary
• Exploratory data analysis for a mostly unknown dataset• Co-location statistics frequently blurred when arbitrary region definitions are used,
hiding the true relationship of two co-occuring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime-rates along the major rivers in Texas)
• Data set reduction; focused sampling
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Experimental Results Volcano for Experimental Results Volcano for =1.01, =1.01, =6=6
SCAH
SCHG
SCMRG
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Example Result SCMRGExample Result SCMRG
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Datasets UsedDatasets Used
• Obtained from Geosciences Department in University of Houston.
• The Earthquake dataset contains all earthquake data worldwide done by the United States Geological Survey (USGS) National Earthquake Information Center (NEIC).
• The modified Earthquake dataset contains the longitude, latitude and a class variable that indicates the depth of the earthquake, 0(shallow), 1(medium) and 2(deep).
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Datasets UsedDatasets Used
• Wyoming datasets were created from U.S. Census 2000 data.
• The Wyoming Modified Poverty Status in 1999 is a modified version of the original dataset, Wyoming Poverty Status.
• The Wyoming Poverty Datasets were created using county statistics. For each county, random population coordinates were generated using the complete spatial randomness (CSR) functions in S-PLUS.
• Then, the background information was attached to each individual county based on the county’s distribution for the class of interest. Finally, all counties were merged into a single dataset that describes the whole state.
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Datasets UsedDatasets Used
• Obtained from Geosciences Department in University of Houston.
• The Volcano dataset contains basic
geographic and geologic information for volcanoes thought to be active in the last 10,000 years
• The original data include a unique volcano number, volcano name, location, latitude and longitude, summit elevation, volcano type, status and the time range of the last recorded eruption.
• The Subset of the volcano dataset used in this thesis contains longitude, latitude and a class variable that indicates if a volcano is non –violent (blue) or violent (red).
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
Global Co-location: and
Task: Find Co-location patterns for the following data-set.
Another Example: Another Example: Regional Co-location Regional Co-location MiningMining
RegionalCo-location
Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006
A Co-Location Reward FrameworkA Co-Location Reward Framework
• Task: Find regions in which the density of 2 or more classes is elevated.
• One approach to measure class density elevation: In general, multipliers C can be computed for every class in a dataset, indicating how much the density of instances of class C is elevated in region r compared to their density in the whole space.
• Example: Binary Co-Location Reward Framework;
1. increaseC(r)= if C(r)1 then 0
else ((C(r)– 1)/(1/(prior(C)-1)))
2. C1,C2(r) = increaseC1(r)* increaseC2(r)
3. reward(r)= maxC1,C2; C1C2 (C1,C2(r))