31
Machine Learning Problems Unsupervised Learning Clustering Density estimation Dimensionality Reduction Supervised Learning – Classification – Regression

Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Embed Size (px)

Citation preview

Page 1: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Machine Learning Problems• Unsupervised Learning

– Clustering– Density estimation– Dimensionality Reduction

• Supervised Learning– Classification– Regression

Page 2: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Clustering• Clustering definition: Partition a given set of objects into M

groups (clusters) such that the objects of each group are ‘similar’ and ‘different’ from the objects of the other groups.

• A distance (or similarity) measure is required.• Unsupervised learning: no class labels• Clustering is NP-complete• Clustering Examples: documents, images, time series, image

segmentation, video analysis, gene clustering, motif discovery, web applications

• Big Issue: Number of Clusters estimation

• Difficult to evaluate solutions

Page 3: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Clustering• Cluster Assignments: hard vs (fuzzy/probabilistic)

• Clustering Methods− Hierarchical (agglomerative, divisive)− Density-based (non-parametric)− Parametric (k-means, mixture models etc)

• Clustering Problems– Data Vectors– Similarity Matrix

Page 4: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Agglomerative Clustering• The simplest approach and a good starting point• Starting from singleton clusters at each step we merge the

two most similar clusters• A similarity (or distance) measure between clusters is

needed• Output: dendrogram

• Drawback: merging decisions are permanent (cannot be corrected at a later stage)

Page 5: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Density-Based Clustering (eg DBSCAN)

• Identify ‘dense regions’ in the data space

• Merge neighboring dense regions

• Require a lot of points

• Complexity: O(n2)

Core

Border

Outlier

Eps = 1cm

MinPts = 5

(set empirically, how?)

Page 6: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Parametric methods• k-means (data vectors): O(n) (n: the number of objects to

be clustered)• k-medoids (similarity matrix): O(n2)

• Mixture models (data vectors): O(n)

• Spectral clustering (similarity matrix): O(n3)

• Kernel k-means (similarity matrix): O(n2)

• Affinity Propagation (similarity matrix): O(n2)

Page 7: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

k-means• Partition a dataset X of N vectors xi into M subsets (clusters)

Ck such that intra-cluster variance is minimized.

• Intra-cluster variance: distance from the cluster prototype mk

• k-means: Prototype = cluster center• Finds local minima w.r.t. clustering error

– sum of intra-cluster variances

• Highly dependent on the initial positions (examples) of the centers mk

km (km.wmv)

Page 8: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

k-medoids

• Similar to k-means • The represenative is the cluster medoid: the cluster object with

smallest average distance to the other cluster objects

• At each iteration the medoid is computed instead of centroid

• Increased complexity: O(n2)

• Medoid: more robust to outliers

• k-medoids can be used with similarity matrix

Page 9: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

k-means (1) vs k-medoids (2)

Page 10: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

10

Spectral Clustering (Ng & Jordan, NIPS2001)

Input: Similarity matrix between pairs of objects, number of clusters M

Example: a(x,y)=exp(-||x-y||2/σ2) (RBF kernel)

Spectral analysis of the similarity matrix: compute top M eigenvectorsand form matrix U

The i-th object corresponds to a vector in Rk : i-th row of U.

Rows are clustered in M clusters using k-means

1 2 3

1 0 01 0 00 1 00 1 00 0 10 0 1

U UU

Page 11: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

11

Spectral Clustering

k-means spectral (RBF kernel,σ=1)

2 rings dataset

Page 12: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Spectral Clustering ↔ Graph cut

iji B,j C

(V,A) links(B,C)= a degree(B)=links(B,V)G

1 k 1 k

k kc c c c

V ,...,V V ,...,Vc=1 c=1c c

links(V ,V ) links(V ,V\V )RAss=max RCut=min

| V | | V |

1 k

kc c

V ,...,Vc=1 c

links(V ,V\V )NCut=min

degree(V )

Data graph

Vertices: objects

Edge weight: pairwise similarity

Clustering = Graph Partitioning

Page 13: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

13

• Cluster Indicator vector zi=(0,0,…,0,1,0,…0)T for object i

• Indicator matrix Z=[z1,…,zn] (nxk, for k clusters), ZTZ=I

• Graph partitioning = trace maximization wrt Z:

• The relaxed problem: is solved optimally using the spectral algorithm to obtain Y

• k-means is applied on yij to obtain zij

TZmax trace(Z ), Z indicator matrix, function of Α

T Tijmax trace(Y Y), Y Y=I, yY R

Spectral Clustering ↔ Graph cut

Page 14: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Kernel-Based Clustering(non-linear cluster separation)

– Given a set of objects and the kernel matrix K=[Kij] containing the similarities between each pair of objects

– Goal: Partition the dataset into subsets (clusters) Ck such that intra-cluster similarity is maximized.

– Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x).

2( ) ( ), || ( ) ( ) || 2Tij i j i j ii jj ijK K K K x x x x

RBF kernel:

K(x,y)=exp(-||x-y||2 /σ2)

Page 15: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Kernel k-Means• Kernel k-means = k-means in feature space

– Minimizes the clustering error in feature space

• Differences from k-means– Cluster centers mk in feature space cannot be computed– Each cluster Ck is explicitly described by its data objects

– Computation of distances from centers in feature space:

• Finds local minima - Strong dependence on the initial partition

Page 16: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Spectral Relaxation of Kernel k-means

1 Dhillon, I.S., Guan, Y., Kulis, B., Weighted graph cuts without eigenvectors: A multilevel approach, IEEE TPAMI, 2007

Spectral methods can substitute kernel k-means and vice versa

Constant

Page 17: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Exemplar-Based Methods• Cluster data by identifying representative exemplars

– An exemplar is an actual dataset point, similar to a medoid– All data points are considered as possible exemplars– The number of clusters

is decided during learning (but a depends on a user-defined parameter)

• Methods– Convex Mixture Models– Affinity Propagation

Page 18: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Affinity Propagation (AP)(Frey et al., Science 2007)

• Clusters data by identifying representative exemplars– Exemplars are identified by transmitting messages between

data points

• Input to the algorithm– A similarity matrix where s(i,k) indicates how well data point xk

is suited to be an exemplar for data point xi.

– Self-similarities s(k,k) that control the number of identified clusters and a higher value means that xk is more likely to become an exemplar

• Self-similarities are independent of the other similarities• Higher values result in more clusters

Page 19: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Affinity Propagation• Clustering criterion:

– s(i,ci) is the similarity between the data point xi and its exemplar

– Minimized by passing messages between data points, called responsibilities and availabilities

• Responsibility r(i,k):– Sent from xi to candidate exemplar xk reflects the accumulated

evidence for how well suited xk is to serve as the exemplar of xi taking into account other potential exemplars for xi

Page 20: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Affinity Propagation• Availability a(i,k)

– Sent from candidate exemplar xk to xi reflects the accumulated evidence for how appropriate it would be for xi to choose xk as its exemplar, taking into account the support from other points that xk should be an exemplar

• The algorithm alternates between responsibility and availability calculation and

• The exemplars are the points with r(k,k)+a(k,k)>0– http://www.psi.toronto.edu/index.php?q=affinity%20propagation

Page 21: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Affinity Propagation

Page 22: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Incremental ClusteringBisecting k-means(Steinbach,Karypis & Kumar, SIGKDD 2000)

• Start with k=1 (m1= data average)• Assume a solution with k clusters

– Find the ‘best’ cluster split in two subclusters– Replace the cluster center with the two subcluster

centers– Run k-means with k+1 centers (optional)– k:=k+1

• Until M clusters have been added

• Split a cluster using several random trials• Each trial:

– Randomly initialize two centers from the cluster points– Run 2-means using the cluster points only

• Keep the split of the trial with the lowest clustering error

Page 23: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Global k-means(Likas, Vlassis & Verbeek, PR 2003)

• Incremental, deterministic clustering algorithm that runs k-Means several times

• Finds near-optimal solutions wrt clustering error

• Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state

– the k-1 centers are initialized from a near-optimal solution of the (k-1)-clustering problem

– the k-th center is initialized at some data point xn (which?)

• Consider all possible initializations (one for each xn)

1 2 1( , ,..., , )k nm m m x

1 2 1( , ,..., )km m m

Page 24: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Global k-means

• In order to solve the M-clustering problem:– Solve the 1-clustering problem (trivial)– Solve the k-clustering problem using the solution of

the (k-1)-clustering problem • Execute k-Means N times, initialized as

at the n-th run (n=1,…,N).• Keep the solution corresponding to the run with the lowest

clustering error as the solution with k clusters

– k:=k+1, Repeat step 2 until k=M.

1 2 1( , ,..., , )k nm m m x

Page 25: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Best Initial m2

Best Initial m3

Best Initial m4

Best Initial m5

glkm (glkm.wmv)

Page 26: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

100_clusters (100_clusters.wmv)

Page 27: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Fast Global k-Means

• How is the complexity reduced?– We select the initial state with the greatest reduction in clustering error in the first

iteration of k-means (reduction can be computed analytically)– k-means is executed only once from this state

• Restrict the set of candidate initial points (kd-tree, summarization)

*1 2 1( , ,..., , )k nm m m x

Page 28: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Global Kernel k-Means(Tzortzis & Likas, IEEE TNN 2009)

• In order to solve the M-clustering problem:1. Solve the 1-clustering problem with Kernel k-Means (trivial solution)2. Solve the k-clustering problem using the solution of the (k-1)-clustering

problema) Let denote the solution to the (k-1)-clustering problemb) Execute Kernel k-Means N times, initialized during the n-th run as

c) Keep the run with the lowest clustering error as the solution with k clusters

d) k := k+1

3. Repeat step 2 until k=M.

• The fast Global kernel k-means can be applied• Select representative data points using convex mixture models

1 2 1( , ,..., )kC C C

1 1( ,..., : { },...., , { }) l l n k k n n lC C C C C C x x x

Page 29: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Best Initial C3

Best Initial C4

Blue circles: optimal initialization of the cluster to be added

RBF kernel: K(x,y)=exp(-||x-y||2 /σ2)

Best Initial C2

Page 30: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression
Page 31: Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Clustering Methods: SummaryUsually we assume that the number of clusters is given• k-means is still the most widely used method• Mixture models could be used when lots of data are available• Spectral clustering (or kernel k-means) the most popular when

similarity matrix is given

• Beware of the parameter initialization problem!• Ground truth absence makes evaluation difficult• How could we estimate the number of clusters?