Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
k-means
• Gaussian mixture model
• Maximize the likelihood
)2
1exp(
2
1),|(
:Centers
}{
2
2
21
21
jiji
k
n
cxcxP
,...c, cc
,...,x,xx
k-means
Minimize
Sum of squared errors (SSE) criterion (k clusters and n samples)
)2
1exp(
2
1),|(
2
2 jiji cxcxP
2
ji cx
k
j Cx ji
ji cx1
2
min
k-means
k-means works perfectly when clusters are “linearly separable” and spherical
k-means
k-means works perfectly when clusters are “linearly separable” and spherical
k-means
SSE criterion doesn’t always work
k-means
What about data which contain arbitrarily shaped clusters of different densities?
The Kernel Trick Revisited
The Kernel Trick Revisited
Map points to “feature space” using basis function
Replace dot product (for similarity computation between points x and y) with kernel entry
)(x
)().( yx
),( yxK
Mercer’s condition: To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)= (x) (y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds ( ) ( , ) ( ) 0dxdyf x K x y f y
Kernel k-means
Minimize sum of squared error:
n
i
k
j
ij jiu cx
1 1
2
mink-means:
}1,0{iju 11
k
j
iju
Kernel k-means
Minimize sum of squared error:
n
i
k
j
ij jiu cx
1 1
2
min
)(xReplace with
n
i
k
j
ij jiu cx
1 1
2~)(min
k-means:
}1,0{iju 11
k
j
iju
Kernel k-means
Cluster centers:
Substitute for centers:
n
i
iij
j
j xun
c1
)(1~
n
i
k
j
ij
n
i
k
j
ij
n
lllj
j
iu
jiu
xun
x
cx
1 1
2
1 1
2
1
)(1
)(
~)(
Kernel k-means
• Use kernel trick:
• Optimization problem:
• K is the n x n kernel matrix, U is the optimal normalized cluster membership matrix
)'()(1 1
2~)( UKUtraceKtraceji
un
i
k
j
ij cx
)'(max)'()(min UKUtraceUKUtraceKtrace
Example
2k
1x
2x
Example
2k
1x
2x
k-means clusters
Example
1x
2x
Example
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
1x
2x
1z2z
3z
Example
1x
2x
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
1z2z
3z
Example
1x
2x
1z2z
3z
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
k-means Vs. Kernel k-means
k-means Kernel k-means 2k
Performance of Kernel k-means
Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern Recognition, 2005
Limitations of Kernel k-means
• More complex than k-means
• Need to compute and store n x n kernel matrix
• Appropriate kernel function has to be determined
• Largest n that can be handled?
Limitations of Kernel k-means
• More complex than k-means
• Need to compute and store n x n kernel matrix
• Appropriate kernel function has to be determined
• Largest n that can be handled?
• Intel Xeon E7-8837 Processor (Q2’11), Oct-core, 2.8GHz, 4TB max memory
• < 1 million points with “single” precision numbers
• May take several days to only compute the kernel matrix
“Big data” Volume* – Big data comes in one size: large
*Defn. due to IBM
Data Volume
Application Clustering Task Size of data Number of
features
Document retrieval Group documents of
similar topics
109 104
Gene analysis Group genes with
similar expression
levels
106 102
Image retrieval Quantize low-level
features
109 102
Earth science data
analysis
Derive climate
indices
105 102
“Big data” Velocity – Often time-sensitive, big data must be
used as it is streaming
“Big data” Variety – Big data extends beyond structured data,
including unstructured data of all varieties: text, audio, video, click streams, log files and more
Large Scale Clustering
Deals with the first issue related to big data – the volume of data
Issues:
Computational Complexity
Hardware Limitations
Application Requirements
MapReduce Framework
How to distribute k-means?
How to distribute k-means?
Two methods
• Distribute distance computation
k-means Clustering with MapReduce - I
Distribute the cost of distance computation
Cluster centers maintained in global memory
Divide points among map tasks
Parallel k-means clustering based on MapReduce, Cloud computing, 2009
k-means Clustering with MapReduce - I
Map function
Find the closest center for data point
Intermediate output: Closest cluster index
Combine function
Partially sum the values of the points assigned to the same cluster, keep track of number of points in the cluster
Reduce function
Compute new centers from the output of combine function
Parallel k-means clustering based on MapReduce, Cloud computing, 2009
How to distribute k-means?
Two methods
• Distribute distance computation
How to distribute k-means?
Two methods
• Distribute distance computation
• Distribute clustering task
k-means Clustering with MapReduce - II
Distribute the cost of clustering
Map function
Cluster the partition into k clusters
Intermediate output: Clusters of the partition
Reduce function
Cluster the cluster centers from the map output to obtain the new centers
Fast clustering using MapReduce, KDD, 2011
k-means Clustering with MapReduce - II
No global storage required
Approximate solution
Clustering error (SSE) < 2 * optimal clustering error
Fast clustering using MapReduce, KDD, 2011
Machine Learning on Mapreduce
Mahout – scalable implementation of major clustering and classification algorithms on Hadoop
Open source
Java and Maven based
Large Scale Kernel Clustering
Data set with 'n' points.
When n ~ 106 more than 1 TB of memory required, highly expensive computationally
Kn× n
Approximate Kernel k-means
Low rank approximation
Use a small portion of the kernel matrix for clustering.
(n-m) x (n-m) chunk of the kernel matrix need not be computed
= n x n n x m m x m m x n
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011
K BK '
BK1K̂
Approximate Kernel k-means
Cluster centers – linear combination of sampled points
Approximation error
m
i
ijij xc1
)ˆ(
error Clustering Optimal1
1error Clusteringm
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011
Approximate Kernel k-means
Performance of Approximate Kernel k-means
Performance of Approximate Kernel k-means
MNIST data set (70,000 data points)
Kernel calculation Clustering
Kernel k-means 514 seconds 3953 seconds
Approximate kernel k-
means (m=1000)
8 seconds 75 seconds
About 98% reduction in time
Almost the same clustering error as kernel k-means
Performance of Approximate Kernel k-means
Network Intrusion data set ( > 4 million data points)
• Kernel k-means not possible on a “normal” system
• Requires 64 TB of memory
• Approximate kernel k-means with just 40 GB memory
Kernel calculation Clustering
Approximate kernel k-
means (m=1000)
52 seconds 433 seconds
Summary
• Kernel k-means
• Performs better than k-means
• Kernel clustering algorithms, in general are more complex than linear clustering algorithms
• Large scale clustering
• Distributed and approximate variants of existing algorithms required for clustering large data