Upload
junior-gregory
View
214
Download
0
Embed Size (px)
Citation preview
Stratified K-means Clustering Over A Deep Web Data Source
Tantan Liu, Gagan Agrawal
Dept. of Computer Science & Engineering
Ohio State University
Aug. 14, 2012
Outline
• Introduction
– Deep Web
– Clustering on the deep web
• Stratified K-means Clustering– Stratification
– Sample Allocation
• Conclusion
Deep Web
• Data sources hidden from the Internet– Online query interface vs. Database– Database accessible through online Interface– Input attribute vs. Output attribute
• An example of Deep Web
Data Mining over the Deep Web
• High level summary of data– Scenario 1: a user wants to relocate to the county.
• Summary of the residences of the county? – Age, Price, Square Footage
– County property assessor’s web-site only allows simple queries
Challenges
• Databases cannot be accessed directly– Sampling method for Deep web mining
• Obtaining data is time consuming– Efficient sampling method
– High accuracy with low sampling cost
An Example of Deep Web for Real-Estate
k-means clustering over a deep web data source
• Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.
Overview of Method
Sub-population 1
Sub-population 2
Sub-population n
Sample 1 Sample 2 Sample n
Sample
Clusters
Stratified Based K-meansClustering
Stratification...
Sample Allocation
Stratification on the deep web
• Partitioning the entire population in to strata
– Stratifies on the query space of input attributes
– Goal: Homogenous Query subspaces– Radius of query subspace:
– Rule: Choosing the input attribute that mostly decreases the radius of a node
– For an input attribute , decrease of radius:
Y=1980 Y=1990 Y=2008
B=3 B=4
NULLYear ofconstruction
Y=2000
Bedroom
. . .
Partition on Space of Output Attributes
Price
SquareFeet
2008200019901980
Sampling Allocation Methods
• We have created c*k partitions and c*k subspaces– A pilot sample– C*k-mean clustering generate c*k partitions
• Representative sampling– Good Estimation on statistics of c*k subspaces
• Centers• Proportions
Representative Sampling-Centers
• Center of a subspace– Mean vector of all data points belonging to the subspace
• Let sample S={DR1, DR2, …, DRn}– For i-th subspace, center :
i
mjimi m
ODRsc
)(,,
Distance Function
• For c*k estimated centers with true centers
• Using Euclidean Distance
– Integrated variance • Computed based on pilot sample
– : # of sample drawn from j-th stratum
Optimized Sample Allocation
• Goal:
• Using Lagrange multipliers:
• We are going to sample stratum with large variance
• Data is spread in a wide area, and more data are need to represent the population
Active Learning based sampling Method
• In machine learning– Passive learning: data are randomly chosen – Active Learning
• Certain data are selected, to help build a better model• Obtaining data is costly and/or time-consuming
• Choosing stratum i, the estimated decrease of distance function is
• Iterative Sampling Process– At each iteration, stratum with largest decrease of distance function
is selected for sampling– Integrated variance is updated
Representative Sampling-Proportion
• Proportion of a sub-space:– Fraction of data records belonging to the sub-space – Depends on proportion of the sub-space in each stratum
• In j-th stratum,
• Risk function– Distance between estimated factions and their true values
• Iterative Sampling Process– At each iteration, stratum with largest decrease of risk function is
chosen for sampling– Parameters are updated
Stratified K-means Clustering
• Weight for data records in i-th stratum – , : size of population, : size of sample
• Similar to k-means clustering– Center for i-th cluster
Experiment Result
• Data Set:– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Average Distance
Representative Sampling-Yahoo! Data set
• Benefit of Stratification– Compared with rand,
decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8%
• Benefit of Representative Sampling
– Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5%
• Center based sampling methods have better performance
• Optimized sampling method has better performance in the long run
Conclusion
• Clustering over a deep web data source is challenging• A Stratified k-means clustering method over the deep
web• Representative Sampling
– Centers
– Proportions
• The experiment results show the efficiency of our work