Lec 6 Data Analysis

Analysis ofHigh-Dimensional Data

Leif Kobbelt

2Visual Computing Institute | Prof. Dr. Leif Kobbelt

Computer Graphics and Multimedia

Data Analysis and Visualization

Motivation

• Given: n samples in d-dimensional space

nd

n R xxX ,,1

2




Motivation


• Decrease d dimensionality reduction:PCA

MDS

nd

n R xxX ,,1

3




Principal Component Analysis

• Idea: Compute orthorgonal linear transformation

that transforms the data into a new coordinate

system s.t.greatest variance on first coordinate axis

second greatest variance on second axis

etc.

• Optimal transform for a given data set in the least

squares sense

• Dimensionality reduction: project data into lower

dimensional space spanned by first principal

components





Given: n samples scattered in d-dimensional space,

written as a matrix

nd

n R xxxX ,,, 21

compute the centered covariance matrix:

(interpretation as map from Rd

to Rd)

ddT RXXXXC ))((

5





computation of C with the “centering matrix”:

TTTXJJXXJXJC

T

nn

IJ 111

principal component(s):

eigenvector(s) vi to largest eigenvalue(s) λi of C

(low rank approximation)

6





Tqqq

T

ddd

TVDVC

vvvv

vvvv

111

111

diag

diag

nqT

q RJXX vv 1

* :

7




Relation to SVD

• singular value decomposition

TUVXJ

T

TTTT

VV

VUUVXJXJC

2

)(

8




… for very large dimension d

ddT RXJXJC )(

nnT RXJXJC )(~

vvC vXJwT

wvXJvXJXJXJwCTTT

~

9




Example

10 points in 2R

10




10 points in

Example

717.0615.0

615.0617.0C

68.0

74.01e

74.0

68.02e

11

2R




Multi-Dimensional Scaling

Given: For n unknown samples in high-

dimensional space

d

in R xxxX ,,,1

we are given a matrix of pairwise

(squared) distances:2

, jiji xxD

ndR X

nnRD

12





samples in some abstract space:

matrix of pairwise abstract distances:

Ain xxxX ,,,1

jiD ,

X

nnRD

13





Goal:find an embedding of in a low-dimensional

space such that the pairwise (variations of)

distances are preserved.

2

)ˆ()ˆ,(F

T JDDJDD

D̂

other measures are possible

but they cannot be solved easily.

)ˆ,( DD

X

14





closed form solution:

first q eigenvectors of the matrix

define the coordinates of a q-dimensional

embedding

nnT RJDJ 21

nq

T

q

q

q R

v

v

v

vX ,,'

1

11

15

qvv ,,1








Motivation


• Decrease n clustering: k-means

EM

Mean shift

Spectral clustering

Hierarchical clustering

nd

n R xxX ,,1

17




Cluster Analysis

• Task: Given a set of observations / data samples,

assign them into clusters so that observations in

the same cluster are similar.

18




Cluster Analysis

• Task: Given a set of observations / data samples,

assign them into clusters so that observations in

the same cluster are similar.

19




k-means Clustering

• Idea: partition n observations into k clusters in

which each observation belongs to the cluster with

the nearest mean.

• Given: data samples

• Goal: partition the n samples into k sets (k ≤ n)

S1, S2, …, Sk such that

is minimized, where μi is the mean of points in Si.

d

in Rxxx ,,1

k

i S

ijS

ij1

2

minargx

μx

20




k-means Clustering

• Two step algorithm:Assignment step: Assign each sample to the cluster with

the closest mean (Voronoi Diagram)

Update step: Calculate the new means to be the centroid

of the observations in the cluster.

Iterate until convergence (assignments change no

longer)

kiS t

ij

t

ijj

t

i ,,1,: ** mxmxx

tij S

jt

i

t

iS x

xm11

21




k-means Clustering

22




k-means Clustering

23




k-means Clustering

24




k-means Clustering

25




k-means Clustering

26




k-means Clustering

27




k-means Clustering

28




k-means Clustering

29




k-means Clustering

30




k-means Clustering

31




k-means Clustering

32




k-means Clustering

33




k-means Clustering

34




k-means Clustering

35




k-means Clustering - Comments

• Advantages:Efficient

Always converges to a solution

• Drawbacks:Not necessarily globally optimal solution

#clusters k is an input parameter

Sensitive to initial clusters

Cluster model: data is split halfway between cluster

means

36




Clustering Results

37




EM Algorithm

• Expectation Maximization (EM)

• Probabilistic assignments to clusters instead of

deterministic assignments

• Multivariate Gaussian distributions instead of

means

38




EM Algorithm

• Given: data samples

• Assumption: data was generated by k Gaussians

• Goal: Fit Gaussian mixture model (GMM) to data

Find means

covariances of the Gaussians

probabilities (weights) that the samples come from

the Gaussian j

kj ,,1

jμ

jjΣ

d

in R xxxX ,,,1

X

39




EM Algorithm – Example (1D)

• Three samples drawn from each mixture component• means: 2,2 21 μμ

40





41





42





43




EM Algorithm

1. Initialization: Choose initial estimates

and compute the initial

log-likelihood

2. E-step: Compute

and

n

i

k

j

jjijn

L1 1

0000 ,log1

Σμx

kjjjj ,,1,,, 000 Σμ

kjni

k

l

m

l

m

li

m

l

m

j

m

ji

m

jm

ij ,,1,,,1,

,

,

1

Σμx

Σμx

kjnn

i

m

ij

m

j ,,1,1

44




EM Algorithm

3. M-step: Compute new estimates (j=1,...,k)

4. Convergence check: Compute new log-

likelihood

n

i

Tm

ji

m

ji

m

ijm

j

m

j

n

i

i

m

ijm

j

m

j

m

jm

j

n

n

n

n

1

111

1

1

1

1

1

μxμxΣ

xμ

n

i

k

j

m

j

m

ji

m

j

m

nL

1 1

1111 ,log1

μx

45




Example (2D)

Ground truth: Means:

Covariance matrices:

Weights:

• Input to EM-algorithm:

1000 samples

46




Initial Estimate

Initial density estimation:

(centroids of k-means result)

47




1st Iteration

48




2nd Iteration

49




3rd Iteration

Estimates after three iterations:

50




Mean Shift Clustering

• Non-parametric clustering technique

• No prior knowledge of #clusters

• No constraints on shape of clusters

51




Mean Shift Clustering - Idea

• Interprete points in feature space as empirical probability

density function

• Dense regions in feature space correspond to local

maxima of the underlying distribution

• For each sample: run gradient ascent procedure on local

estimated density until convergence

• Stationary points = maxima of distribution

• Samples associted with the same stationary point are

considered to be in the same cluster

52





• Given: data samples• Multi-variate kernel density estimate with radially

symmetric kernel K(x) and window radius h

• The radially symmetric kernel is defined as

where is a normalization constant• Modes of density function are located at zeros of

gradient function

n

i

i

d hK

nhf

1

1 xxx

2

, xx kcK dk

dkc ,

0 xf

d

in Rxxx ,,1

53





Gradient of density estimator

where denotes the derivative of the

kernel profile

xxx

xxx

xxx

n

i

i

n

i

ii

n

i

i

d

dk

hg

hg

hg

nh

cf

1

2

1

2

1

2

2

,2

xx'kg

xk

54





Gradient of density estimator

xxx

xxx

xxx

n

i

i

n

i

ii

n

i

i

d

dk

hg

hg

hg

nh

cf

1

2

1

2

1

2

2

,2

proportional to density

estimate at x xhm

mean shift vector points toward direction of

maximum increase in the density.

xhm

55





Mean shift procedure for sample :

1. Compute mean shift vector

2. Translate density estimation window

Iterate 1. and 2. until convergence, i.e.,

t

ixm

t

i

t

i

t

i m xxx 1

0 if x

ix

56





57





0

ix

58





1

ix

59





2

ix

60





3

ix

61





n

ix

62




Mean Shift - Comments

• Advantages: No prior knowledge of #clusters

No constraints on shape of clusters

• Drawbacks:Computationally expensive:

Run algorithm for every sample

Identification of sample neighborhood requires multi-dimensional

range search

How to choose the bandwidth parameter h ?




Summary


• Decrease d dimensionality reduction:PCAMDS

• Decrease n clustering: k-meansEM Mean shiftSpectral clusteringHierarchical clustering

nd

n R xxX ,,1

64




Spectral Clustering

• Model similarity between data points as graph

• Clustering: Find connected components in graph




Spectral Clustering


• (weighted) Adjacency Matrix W:

• Degree Matrix D:




Spectral Clustering

• Graphs: Similarity graph: fully connected, model local neighborhood relations

Gaussian kernel similarity function:

K-nearest neighbour graph

𝜀-neighbourhood graph




Spectral Clustering


• (weighted) Adjacency Matrix W:

• Degree Matrix D:

• Graph Laplacian L = D – W:




Spectral Clustering

• Properties of the Graph Laplacian L:

For every vector

L is symmetric and positive semi-definite

The smallest eigenvalue of L is 0

The corresonding eigenvector is the constant one vector

L has n non-negative, real-valued eigenvalues




Spectral Clustering

• The multiplicity k of the eigenvalue 0 of L equals the number of connected

components in the graph Consider k = 1. Assume f is eigenvector with eigenvalue 0:

The sum only vanishes if all terms vanish

If two vertices are connected (their edge weight > 0)

f needs to be constant for all vertices which can be connected by a path

All vertices of a connected component in an undirected graph can be connected by a

path:

f needs to be constant on the whole connected component




Spectral Clustering

• Laplacian of graph with 1 connected component has one constant vector

with eigenvalue 0

• For k > 1: Wlog. assume that vertices are ordered according to connected

components

• Each is a graph Laplacian of a fully connected graph: Each has one eigenvalue 0 with constant one vector on the i-th connected comp.

• Spectrum of L is given by union of the spectra of




Spectral Clustering

• Graph:

• Graph Laplacian

• Eigenvectors for eigenvalues




Spectral Clustering

• Graph:

• Project vertices into subspace spanned by k eigenvectors

• Projected vertices:

• K-means clustering recovers the connected components Embedding is the same regardless of data ordering




Spectral Clustering

• Similarity Graph:

• W =




Spectral Clustering


• L =

• Eigenvalues : 0, 0.4, 2, 2

• Eigenvectors :




Spectral Clustering


• For fully connected graph we want to find the Min-Cut: Partition graph into 2 sets of vertices such that the weight of edges connecting them

is minimal:

Vertices in each set should be similar to vertices in the same set, but dissimilar to

vertices from the other set

Partitions often not balanced: isolated vertices




Spectral Clustering


• For fully connected graph we want to find the Normalized Cut: Partition graph into 2 sets of vertices such that the weight of edges connecting them

is minimal

Partitions should have similar size




Spectral Clustering

• Min-Cut: minimize

• Normalized Cut: minimize

minimal if




Spectral Clustering

• Reformulate with Graph Laplacian

• Construct f:




Spectral Clustering

• Reformulate Ncut:

• Minimize subject to

Partition (cluster) assignment by thresholding f at 0

NP hard to compute since f is discrete

Relax problem by allowing f to take arbitrary real values

Solution: second eigenvector of (normalized Graph Laplacian)

• For k > 2 we can similarily construct indicator vectors like f and relax the

problem for minimization: Project the vertices into the subspace spanned by the first k eigenvectors of L‘

Clustering the embedded vertices yields the solution

• Spectral clustering (with normalized Graph Laplacian) approximates Ncut




Spectral Clustering

Mean Shift Spectral Clustering K-Means




Spectral Clustering

• Summary: Useful for non-convex clustering problems

Computation intensive because of eigenvalue computation (for large matrices)

Choice of k necessary:

A heuristic can be used that tries to find jumps in the eigenvalues (eigengap)

Similarity has to be defined for graph construction:

Size of Gaussian kernel?

Size of neighbourhood?




Hierarchical Clustering

• Bottom up: Each data point is it‘s own cluster

Greedily merge clusters according to some criteria





• Requirements: Metric: distance between data points

Linkage: distance between data point sets:

Maximum linkage:

Average linkage:

Ward linkage:





• Algorithm: Start out with a cluster for each data point

Merge two clusters that result in the least increase in linkage criteria

Repeat until k clusters remain

• Maximum linkage: Minimizes maximimal distance of data points in each cluster

• Average linkage: Minimizes average distance of data points in each cluster

• Ward linkage: Minimizes inter-cluster variance





• We can add connectivity constraints that enforce which clusters can be

merged





• Summary: Flexibel: any pairwise distance can be used

Choice of k, distance and linkage necessary

Instead of specifying k we can use a heuristic which stops cluster merging if the

linkage increases too much

Given connectivity constraints hierarchical clustering scales well for large number of

data points

How do we choose connectivity constraints?

K-nearest neighbour graph

𝜀-neighbourhood graph

Documents

Lec 6 Data Analysis