Upload
mallory-knight
View
74
Download
0
Tags:
Embed Size (px)
DESCRIPTION
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation. Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel 101tec GmbH, Halle, Germany. Overview. Density-based clustering and DENCLUE 1.0 Hill climbing as EM-algorithm - PowerPoint PPT Presentation
Citation preview
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation
Alexander HinneburgMartin-Luther-University Halle-Wittenberg, Germany
Hans-Henning Gabriel101tec GmbH, Halle, Germany
Overview
• Density-based clustering and DENCLUE 1.0• Hill climbing as EM-algorithm• Identification of local maxima• Applications of general EM-acceleration • Experiments
Density-Based Clustering
• Assumption– clusters are regions of high density in the data
space ,
• How to estimate density?– parametric models
• mixture models
– non-parametric models• histogram• kernel density estimation
Kernel Density Estimation• Idea
– influence of a data point is modeled by a kernel– density is the normalized sum of all kernels– smoothing parameter h
Gaussian Kernel
Density Estimate
DENCLUE 1.0 Framework• Clusters are defined by local maxima of
the density estimate– find all maxima by hill climbing
• Problem– const. step size
Gradient
Hill Climbing
const. step size
Problem of const. Step Size
• Not efficient– many unnecessary small steps
• Not effective– does not converge to a local maximum
just comes close
• Example
New Hill Climbing Approach
• General approach– differentiate density estimate and set to zero
– no solution, but can be used for iteration
New DENCLUE 2.0 Hill Climbing
• Efficient– automatically adjusted step size at no extra costs
• Effective– converges to local maximum (proof follows)
• Example
Proof of Convergence• Cast the problem of maximizing kernel denstiy
as maximizing the likelihood of a mixture model
• Introduce hidden variable
Proof of Convergence
• Complete likelihood is maximized by EM-Algorithm
• this also maximizes the original likelihood, which is the kernel density estimate
• When starting the EM with we do the hill climbing for
E-Step
M-Step
Identification of local Maxima• EM-Algorithm iterates until
– reached end point– sum of k last step sizes
• Assumption– true local maximum is in a ball of around
• Points with end points closerbelong to the same maximum M
• In case of non-unique assignmentdo a few extra EM iterations
Acceleration
• Sparse EM– update only the p% points with largest posterior– saves 1-p% of kernel computations after first iteration
• Data Reduction– use only %p of the data as representative points– random sampling– kMeans
Experiments
• Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA)
• 16-dim. artificial data• both methods are tuned to find the correct clustering
Experiments
• Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and acceleration methods and k-Means on real data
sample sizes 0.8, 0.4, 0.2
Conclusion
• New hill climbing for DENCLUE
• Automatic step size adjustment
• Convergence proof by reduction to EM
• Allows the application of general EM accelerations
• Future work– automatic setting of smoothing parameter h
(so far tuned manually)