Machine Learning Hamid Beigyce.sharif.edu/courses/94-95/1/ce717-1/resources/... · Table of contents 1 Supservised vs. Unsupervised Learning 2 Hierarchical Clustering 3 Non-Hierarchical

Clustering algorithmsMachine Learning

Hamid Beigy

Sharif University of Technology

Fall 1394

Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 1 / 32

Table of contents

1 Supservised vs. Unsupervised Learning

2 Hierarchical Clustering

3 Non-Hierarchical Clustering


Supervised vs. Unsupervised Learning

Supervised Learning

Classification: partition examples into groups according to pre-defined categoriesRegression: assign value of feature vectorsrequired labeled data for training

Unsupervised Learning

Clustering: partition examples into groups when no pre-defined categories/classes areavailableNovelty detection: find new lables of dataConcept drift detection: find changes in distribution of dataOutlier detection: find unusual events (e.g. hackers)Only instances required, but no labels


Why do Unsupervised Learning?

Cost

Raw data is cheapLabeled data is expensive

Save memory/computation

Reduce noise in high-dimensional data

Useful in exploratory data analysis

Often pre-processing step for supervised learning


Clustering

Partition unlabeled examples into disjoint subsets of clusters, such that:

Examples within a cluster are similarExamples in different clusters are different

Discover new categories in an unsupervised manner

(no sample category labels provided)


Similarity Measures

Eucliudian distance (L2 norm)

L2(−→x ,−→x ′) =√

ΣNi=1(xi − x ′i )

2

L1 norm:

L1(−→x ,−→x ′) =√

ΣNi=1|xi − x ′i |

Cosine similarity:

cos(−→x ,−→x ′) =−→x ∗−→x ′||−→x ||||−→x ′||

Kernels


Application of Clustering

Cluster retrieved documents

to present more organized and understandable results to user → ”diversified retrieval”

Detecting near duplicatesEntity resolution

E.g. ”thorsten Joachims” == ”thorsten B Joachims”

Exploratory data analysis

Automated (or semi-automated) creation of taxonomies

E.g. Yahoo, DMOZ

Comparison


Applications of Clustering

Text Mining


Applications of Clustering

Image SegmentationImage Segmentation

http://people.cs.uchicago.edu/ pff/segment

Sriram Sankararaman Clustering


Clustering Example


Clustering Example

\


Clustering Example


Clustering Example


Clustering Example


Clustering Algorithms Categories

Hierarchical clustering

E.g. Linkage clustering

Centroid-based clustering

E.g. K-means clustering

Distribution-based clustering

E.g. EM algorithm

Density-based clustering

E.g. DBSCAN

Other methods


Hierarchical Clustering

Organize the clusters in a hierarchical way

Produces a rooted tree (Dendrogram)

Animal

Vertebrate Invertebrate

Fish Reptile Amphibian Mammal Worm Insect Crustacean

Recursive application of a standard clustering algorithm can produce a hierarchicalclustering


Agglomerative vs. Divisive Clustering

Agglomerative (bottom-up) Methods start with each example in its own cluster anditeratively combine them to form larger and larger clusters.

Divisive(top-down) methods separate all examples recursively into smaller clusters.

Hierarchical Clustering

• Organize the clusters in a hierarchical way.• Produces a rooted binary tree (dendrogram).

Sriram Sankararaman Clustering


Agglomerative (bottom up)

Assumes a similarity function for determining the similarity of two clusters.

Starts with all instances in a separate cluster and then repeatedly joins the two clustersthat are most similar until there is only one cluster.

The history of merging forms a binary tree or hierarchy

Basic algorithms:

Start with all instances in their own clusterUntil there is only one cluster:

Among the current clusters, determine the two clusters, ci and cj that are most similarReplace ci and cj with a single cluster ci ∪ cj


Cluster Similarity

How to compute similarity of two clusters each possibly containing multiple instances?

Single Linkage: Similarity of two most similar members.Complete Linkage: Similarity of two least similar members.Group Average: Average similarity between members.


Single-Link (bottom-up)

sim(ci , cj) = minx∈ci ,y∈cj sim(x , y)


Compelete-Link (bottom-up)

sim(ci , cj) = maxx∈ci ,y∈cj sim(x , y)


Computational Complexity of HAC

In the first iteration, all HAC methods need to compute similarity of all pairs n individualinstances which is O(n2).

In each of the subsequent O(n) merging iterations, must find smallest distance pair ofclusters → Maintain heap O(n2log(n))

in each of the subsequent O(n) merging iterations, it must compute the distance betweenthe most recently created cluster and all other existing cluster. Can this be done inconstant time such that O(n2log(n)) overall?


Computing Cluster Similarity

After merging ci and cj , the similarity of the resulting cluster to any cluster, Ck , can becomputed by:

Single Linkage:

sim((ci ∪ cj), ck) = min(sim(ci , ck), sim(cj , ck))

Single Linkage:

sim((ci ∪ cj), ck) = max(sim(ci , ck), sim(cj , ck))


Group Average Bottom-up Clustering

Use average similarity across all pairs within the merged cluster to measure the similarityof two clusters.

sim(ci , cj) =1

|ci ∪ cj − 1|(|ci ∪ cj − 1|)− 1

∑−→x ∈(ci ,cj )

∑−→y ∈(ci ,cj ):−→y 6=−→x

sim(−→x ,−→y )

Compromise between single and complete link.


Non-Hierarchical Clustering

K-means clustering (”hard”)

Mixtures of Gaussian and training via Expectation maximization Algorithm (”soft”)


Clustering Criterion

Evaluation function that assigns a (usually real-valued) value to a clusteringClustering criterion typically function of

withing-cluster similarity andbetween-cluster dissimilarity

OptimizationFind clustering that maximize the criterion

Global optimization (often intractable)Greedy searchApproximation algorithms


Centroid-Based Clustering

Assumes instances are real-valued vectors.

Clusters represented via centroids (i.e. average of points in a cluster) c:

~µ(c) = 1/|c |∑~x∈c

~x

Reassignment of instances to clusters is based on distance to the current cluster


K-Means Algorithm

Input: k = number of clusters, distance measure d,

Select k random instances s1, s2, ..., sk as seeds.

Until clustering converges or other stopping criterion:For each instance xi :

Assign xi to the cluster cj such that d(xi , sj) is min.

For each cluster cj ,item itupdate the centroid of each cluster

sj = µ(cj)


K-Means Algorithm

Input: k = number of clusters, distance measure d,

Select k random instances s1, s2, ..., sk as seeds.

Until clustering converges or other stopping criterion:For each instance xi :

Assign xi to the cluster cj such that d(xi , sj) is min.

For each cluster cj ,item itupdate the centroid of each cluster

sj = µ(cj)


Time Complexity

Assume computing distance between two instances is O(D) where D is the dimensionalityof the vectors.

Reassigning clusters for N points: O(kN) distance computations, or O(kDN).

Computing centroids: Each instance gets added once to some centroid: O(DN).

Assume these two steps are each done once for i iterations: O(ikDN).

Linear in all relevant factors, assuming a fixed number of iterations, more efficient thanHAC.


Time Complexity

Problem

Results can vary based on random seed selection, especially for high-dimensional data.

Some seeds can result in poor convergence rate, or convergence to sub-optimalclusterings.

Idea: Combine HAC and K-means clustering.

First randomly take a sample of instances of size

Run group-average HAC on this sample

Use the results of HAC as initial seeds for K-means.

Overall algorithm is efficient and avoids problems of bad seed selection.


Gaussian Mixtures and EM

Gaussian Mixture Models

Assume

EM Algorithm

Assume P(Y ) and k know and∑

i = 1.REPEAT

~µj =

∑ni=1 P(Y = j |X = ~xi , ~µ1, ..., ~µk)~xi∑ni=1 P(Y = j |X = ~xi , ~µ1, ..., ~µk)

P(Y = j |X = ~xi , ~µ1, ..., ~µk) =P(X = ~xi |Y = j , ~µj)P(Y = j)∑kl=1 P(X = ~xi |Y = l , ~µl)P(Y = l)

=

e−0.5(~xj−~µj )2

P(Y = j)∑kl=1 e

−0.5(~xj−~µj )2P(Y = l)


Documents

Machine Learning Hamid Beigyce.sharif.edu/courses/94-95/1/ce717-1/resources/... · Table of contents 1 Supservised vs. Unsupervised Learning 2 Hierarchical Clustering 3 Non-Hierarchical