45
k-means Gaussian mixture model Maximize the likelihood ) 2 1 exp( 2 1 ) , | ( : Centers } { 2 2 2 1 2 1 j i j i k n c x c x P ,...c , c c ,...,x ,x x

k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

• Gaussian mixture model

• Maximize the likelihood

)2

1exp(

2

1),|(

:Centers

}{

2

2

21

21

jiji

k

n

cxcxP

,...c, cc

,...,x,xx

Page 2: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

Minimize

Sum of squared errors (SSE) criterion (k clusters and n samples)

)2

1exp(

2

1),|(

2

2 jiji cxcxP

2

ji cx

k

j Cx ji

ji cx1

2

min

Page 3: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

k-means works perfectly when clusters are “linearly separable” and spherical

Page 4: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

k-means works perfectly when clusters are “linearly separable” and spherical

Page 5: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

SSE criterion doesn’t always work

Page 6: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means

What about data which contain arbitrarily shaped clusters of different densities?

Page 7: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

The Kernel Trick Revisited

Page 8: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

The Kernel Trick Revisited

Map points to “feature space” using basis function

Replace dot product (for similarity computation between points x and y) with kernel entry

)(x

)().( yx

),( yxK

Mercer’s condition: To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)= (x) (y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds ( ) ( , ) ( ) 0dxdyf x K x y f y

Page 9: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Kernel k-means

Minimize sum of squared error:

n

i

k

j

ij jiu cx

1 1

2

mink-means:

}1,0{iju 11

k

j

iju

Page 10: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Kernel k-means

Minimize sum of squared error:

n

i

k

j

ij jiu cx

1 1

2

min

)(xReplace with

n

i

k

j

ij jiu cx

1 1

2~)(min

k-means:

}1,0{iju 11

k

j

iju

Page 11: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Kernel k-means

Cluster centers:

Substitute for centers:

n

i

iij

j

j xun

c1

)(1~

n

i

k

j

ij

n

i

k

j

ij

n

lllj

j

iu

jiu

xun

x

cx

1 1

2

1 1

2

1

)(1

)(

~)(

Page 12: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Kernel k-means

• Use kernel trick:

• Optimization problem:

• K is the n x n kernel matrix, U is the optimal normalized cluster membership matrix

)'()(1 1

2~)( UKUtraceKtraceji

un

i

k

j

ij cx

)'(max)'()(min UKUtraceUKUtraceKtrace

Page 13: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

2k

1x

2x

Page 14: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

2k

1x

2x

k-means clusters

Page 15: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

1x

2x

Page 16: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(

)'(),( kernel Polynomial

xzxxzxz

xxxxxx

yxyxK

1x

2x

1z2z

3z

Page 17: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

1x

2x

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(

)'(),( kernel Polynomial

xzxxzxz

xxxxxx

yxyxK

1z2z

3z

Page 18: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Example

1x

2x

1z2z

3z

2

23212

2

11

2

221

2

121

2

,2,

),2,(),(

)'(),( kernel Polynomial

xzxxzxz

xxxxxx

yxyxK

Page 19: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means Vs. Kernel k-means

k-means Kernel k-means 2k

Page 20: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Performance of Kernel k-means

Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern Recognition, 2005

Page 21: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Limitations of Kernel k-means

• More complex than k-means

• Need to compute and store n x n kernel matrix

• Appropriate kernel function has to be determined

• Largest n that can be handled?

Page 22: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Limitations of Kernel k-means

• More complex than k-means

• Need to compute and store n x n kernel matrix

• Appropriate kernel function has to be determined

• Largest n that can be handled?

• Intel Xeon E7-8837 Processor (Q2’11), Oct-core, 2.8GHz, 4TB max memory

• < 1 million points with “single” precision numbers

• May take several days to only compute the kernel matrix

Page 23: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

“Big data” Volume* – Big data comes in one size: large

*Defn. due to IBM

Page 24: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Data Volume

Application Clustering Task Size of data Number of

features

Document retrieval Group documents of

similar topics

109 104

Gene analysis Group genes with

similar expression

levels

106 102

Image retrieval Quantize low-level

features

109 102

Earth science data

analysis

Derive climate

indices

105 102

Page 25: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

“Big data” Velocity – Often time-sensitive, big data must be

used as it is streaming

Page 26: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

“Big data” Variety – Big data extends beyond structured data,

including unstructured data of all varieties: text, audio, video, click streams, log files and more

Page 27: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Large Scale Clustering

Deals with the first issue related to big data – the volume of data

Issues:

Computational Complexity

Hardware Limitations

Application Requirements

Page 28: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

MapReduce Framework

Page 29: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

How to distribute k-means?

Page 30: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

How to distribute k-means?

Two methods

• Distribute distance computation

Page 31: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means Clustering with MapReduce - I

Distribute the cost of distance computation

Cluster centers maintained in global memory

Divide points among map tasks

Parallel k-means clustering based on MapReduce, Cloud computing, 2009

Page 32: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means Clustering with MapReduce - I

Map function

Find the closest center for data point

Intermediate output: Closest cluster index

Combine function

Partially sum the values of the points assigned to the same cluster, keep track of number of points in the cluster

Reduce function

Compute new centers from the output of combine function

Parallel k-means clustering based on MapReduce, Cloud computing, 2009

Page 33: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

How to distribute k-means?

Two methods

• Distribute distance computation

Page 34: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

How to distribute k-means?

Two methods

• Distribute distance computation

• Distribute clustering task

Page 35: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means Clustering with MapReduce - II

Distribute the cost of clustering

Map function

Cluster the partition into k clusters

Intermediate output: Clusters of the partition

Reduce function

Cluster the cluster centers from the map output to obtain the new centers

Fast clustering using MapReduce, KDD, 2011

Page 36: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

k-means Clustering with MapReduce - II

No global storage required

Approximate solution

Clustering error (SSE) < 2 * optimal clustering error

Fast clustering using MapReduce, KDD, 2011

Page 37: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Machine Learning on Mapreduce

Mahout – scalable implementation of major clustering and classification algorithms on Hadoop

Open source

Java and Maven based

Page 38: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Large Scale Kernel Clustering

Data set with 'n' points.

When n ~ 106 more than 1 TB of memory required, highly expensive computationally

Kn× n

Page 39: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Approximate Kernel k-means

Low rank approximation

Use a small portion of the kernel matrix for clustering.

(n-m) x (n-m) chunk of the kernel matrix need not be computed

= n x n n x m m x m m x n

Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011

K BK '

BK1K̂

Page 40: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Approximate Kernel k-means

Cluster centers – linear combination of sampled points

Approximation error

m

i

ijij xc1

)ˆ(

error Clustering Optimal1

1error Clusteringm

Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011

Page 41: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Approximate Kernel k-means

Page 42: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Performance of Approximate Kernel k-means

Page 43: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Performance of Approximate Kernel k-means

MNIST data set (70,000 data points)

Kernel calculation Clustering

Kernel k-means 514 seconds 3953 seconds

Approximate kernel k-

means (m=1000)

8 seconds 75 seconds

About 98% reduction in time

Almost the same clustering error as kernel k-means

Page 44: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Performance of Approximate Kernel k-means

Network Intrusion data set ( > 4 million data points)

• Kernel k-means not possible on a “normal” system

• Requires 64 TB of memory

• Approximate kernel k-means with just 40 GB memory

Kernel calculation Clustering

Approximate kernel k-

means (m=1000)

52 seconds 433 seconds

Page 45: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory

Summary

• Kernel k-means

• Performs better than k-means

• Kernel clustering algorithms, in general are more complex than linear clustering algorithms

• Large scale clustering

• Distributed and approximate variants of existing algorithms required for clustering large data