k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

• Gaussian mixture model

• Maximize the likelihood

)2

1exp(

2

1),|(

:Centers

}{

2

2

21

21

jiji

k

n

cxcxP

,...c, cc

,...,x,xx

k-means

Minimize

Sum of squared errors (SSE) criterion (k clusters and n samples)

)2

1exp(

2

1),|(

2

2 jiji cxcxP

2

ji cx

k

j Cx ji

ji cx1

2

min

k-means

k-means works perfectly when clusters are “linearly separable” and spherical

k-means

k-means works perfectly when clusters are “linearly separable” and spherical

k-means

SSE criterion doesn’t always work

k-means

What about data which contain arbitrarily shaped clusters of different densities?

The Kernel Trick Revisited

The Kernel Trick Revisited

Map points to “feature space” using basis function

Replace dot product (for similarity computation between points x and y) with kernel entry

)(x

)().( yx

),( yxK

Mercer’s condition: To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)= (x) (y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds ( ) ( , ) ( ) 0dxdyf x K x y f y

Kernel k-means

Minimize sum of squared error:

n

i

k

j

ij jiu cx

1 1

2

mink-means:

}1,0{iju 11

k

j

iju

Kernel k-means

Minimize sum of squared error:

n

i

k

j

ij jiu cx

1 1

2

min

)(xReplace with

n

i

k

j

ij jiu cx

1 1

2~)(min

k-means:

}1,0{iju 11

k

j

iju

Kernel k-means

Cluster centers:

Substitute for centers:

n

i

iij

j

j xun

c1

)(1~

n

i

k

j

ij

n

i

k

j

ij

n

lllj

j

iu

jiu

xun

x

cx

1 1

2

1 1

2

1

)(1

)(

~)(

Kernel k-means

• Use kernel trick:

• Optimization problem:

• K is the n x n kernel matrix, U is the optimal normalized cluster membership matrix

)'()(1 1

2~)( UKUtraceKtraceji

un

i

k

j

ij cx

)'(max)'()(min UKUtraceUKUtraceKtrace

Example

2k

1x

2x

Example

2k

1x

2x

k-means clusters

Example

1x

2x

Example

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(

)'(),( kernel Polynomial

xzxxzxz

xxxxxx

yxyxK

1x

2x

1z2z

3z

Example

1x

2x

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(


xzxxzxz

xxxxxx

yxyxK

1z2z

3z

Example

1x

2x

1z2z

3z

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(


xzxxzxz

xxxxxx

yxyxK

k-means Vs. Kernel k-means

k-means Kernel k-means 2k

Performance of Kernel k-means

Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern Recognition, 2005

Limitations of Kernel k-means

• More complex than k-means

• Need to compute and store n x n kernel matrix

• Appropriate kernel function has to be determined

• Largest n that can be handled?

Limitations of Kernel k-means

• More complex than k-means

• Need to compute and store n x n kernel matrix

• Appropriate kernel function has to be determined

• Largest n that can be handled?

• Intel Xeon E7-8837 Processor (Q2’11), Oct-core, 2.8GHz, 4TB max memory

• < 1 million points with “single” precision numbers

• May take several days to only compute the kernel matrix

“Big data” Volume* – Big data comes in one size: large

*Defn. due to IBM

Data Volume

Application Clustering Task Size of data Number of

features

Document retrieval Group documents of

similar topics

109 104

Gene analysis Group genes with

similar expression

levels

106 102

Image retrieval Quantize low-level

features

109 102

Earth science data

analysis

Derive climate

indices

105 102

“Big data” Velocity – Often time-sensitive, big data must be

used as it is streaming

“Big data” Variety – Big data extends beyond structured data,

including unstructured data of all varieties: text, audio, video, click streams, log files and more

Large Scale Clustering

Deals with the first issue related to big data – the volume of data

Issues:

Computational Complexity

Hardware Limitations

Application Requirements

MapReduce Framework

How to distribute k-means?


Two methods

• Distribute distance computation

k-means Clustering with MapReduce - I

Distribute the cost of distance computation

Cluster centers maintained in global memory

Divide points among map tasks

Parallel k-means clustering based on MapReduce, Cloud computing, 2009

k-means Clustering with MapReduce - I

Map function

Find the closest center for data point

Intermediate output: Closest cluster index

Combine function

Partially sum the values of the points assigned to the same cluster, keep track of number of points in the cluster

Reduce function

Compute new centers from the output of combine function

Parallel k-means clustering based on MapReduce, Cloud computing, 2009


Two methods



Two methods


• Distribute clustering task

k-means Clustering with MapReduce - II

Distribute the cost of clustering

Map function

Cluster the partition into k clusters

Intermediate output: Clusters of the partition

Reduce function

Cluster the cluster centers from the map output to obtain the new centers

Fast clustering using MapReduce, KDD, 2011

k-means Clustering with MapReduce - II

No global storage required

Approximate solution

Clustering error (SSE) < 2 * optimal clustering error

Fast clustering using MapReduce, KDD, 2011

Machine Learning on Mapreduce

Mahout – scalable implementation of major clustering and classification algorithms on Hadoop

Open source

Java and Maven based

Large Scale Kernel Clustering

Data set with 'n' points.

When n ~ 106 more than 1 TB of memory required, highly expensive computationally

Kn× n

Approximate Kernel k-means

Low rank approximation

Use a small portion of the kernel matrix for clustering.

(n-m) x (n-m) chunk of the kernel matrix need not be computed

= n x n n x m m x m m x n

Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011

K BK '

BK1K̂


Cluster centers – linear combination of sampled points

Approximation error

m

i

ijij xc1

)ˆ(

error Clustering Optimal1

1error Clusteringm

Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011


Performance of Approximate Kernel k-means


MNIST data set (70,000 data points)

Kernel calculation Clustering

Kernel k-means 514 seconds 3953 seconds

Approximate kernel k-

means (m=1000)

8 seconds 75 seconds

About 98% reduction in time

Almost the same clustering error as kernel k-means


Network Intrusion data set ( > 4 million data points)

• Kernel k-means not possible on a “normal” system

• Requires 64 TB of memory

• Approximate kernel k-means with just 40 GB memory

Kernel calculation Clustering

Approximate kernel k-

means (m=1000)

52 seconds 433 seconds

Summary

• Kernel k-means

• Performs better than k-means

• Kernel clustering algorithms, in general are more complex than linear clustering algorithms

• Large scale clustering

• Distributed and approximate variants of existing algorithms required for clustering large data

Documents

k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory