50
Clustering Knowledge Discovery and Data Mining 1 Roman Kern ISDS, TU Graz 2019-01-17 Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 1 / 50

Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

ClusteringKnowledge Discovery and Data Mining 1

Roman Kern

ISDS, TU Graz

2019-01-17

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 1 / 50

Page 2: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Big picture: KDDM

Knowledge Discovery Process

Preprocessing

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Transformation

Data Mining

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 2 / 50

Page 3: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Outline

1 Clustering

2 Iterative Clustering

3 Hierarchical Clustering

4 Density-Based Clustering

5 Other Clustering Approaches

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 3 / 50

Page 4: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

ClusteringAn unsupervised machine learning technique

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 4 / 50

Page 5: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction

Introduction

Identify group of instances with the following objectives1 Maximize the similarity with a group (intra group, cohesion)2 Minimize the similarity between the groups (inter group, separation)

Clustering is an unsupervised method

... in contrast to supervised methods no training data is needed (i.e.no external teacher)

... the optimisation criterion is often task and domain-independent

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 5 / 50

Page 6: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction

Example

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 6 / 50

Page 7: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Use Cases

Use Cases & Applications

Group similar items

e.g. similar buyers

Text categorization

e.g. for search engine results

Pattern recognition

Supporting task, e.g. outlier detection

Vector quantisation

e.g. for lossy compression

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 7 / 50

Page 8: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Basics

Basics

Similarity is often expressed as distance function, d(x , y)

... the higher the distance, the lower the similarity

Definition of the distance function depends on the types of features(categorial, numeric, ...)

All inter-instance distances build a n× n distance matrix (takes O(n2)time) for n instances

Instead of a distance function, often proximity or (dis-)similarity functions are used (often theEuclidean distance or the Cosine similarity are used).

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 8 / 50

Page 9: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Types of Clusterings

Exclusive clustering

Each instances are only assigned to a single cluster

... there is no overlap between the clusters

Let X be a set of instances. An exclusive clustering C of X ,C = {C1,C2, ...,Cn}, Ci ⊆ X , is a partitioning of X into non-empty,mutually exclusive subsets Ci with

⋃Ci∈C Ci = X

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 9 / 50

Page 10: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Types of Clusterings

Fuzzy clustering

Instances are assigned to more than one cluster

... includes a weight, how strong the relationship is

Thus clusters do overlap

Also known as soft clustering

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 10 / 50

Page 11: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Types of Clusterings

Hierarchical clustering

Clusters are further refined with sub-clusters

... thus an instance is assigned to a cluster and all its parent clusters

Thus clusters overlap with their parent clusters

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 11 / 50

Page 12: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Types of Clustering Algorithms

Clustering algorithmsIterative

Exemplar based (e.g. k-means)Exchange based (e.g. Kerninghan-Lin)

HierarchicalAgglomerative (e.g. HAC)Devisive (e.g. min-cut)

Density basedPoint density based (e.g. DBSCAN)Attraction based (e.g. MajorClust)

Meta-search controlledGradient based (e.g. simulated annealing)Competitive (e.g. evolutionary strategies)

StochasticGaussian mixtures (e.g. EM)

Information theory based

Subspace clusteringRoman Kern (ISDS, TU Graz) Clustering 2019-01-17 12 / 50

Page 13: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Clustering

Introduction - Types of Clustering Algorithms

Clustering algorithms - on graphs

Clique based

Spectral clustering approaches

Markov clustering

Clustering approaches - on matrices

Co-clustering

Note: this is just a selection and one possible way to categorise clustering algorithms

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 13 / 50

Page 14: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clusteringe.g. k-Means

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 14 / 50

Page 15: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

Model-Based Clustering

Assume a model and then try to fit the parameters→ optimize the fit between the model and the data

Clustering as a problem to infer these parameters

For example, each cluster (group) will be presented by a parametricdistribution

→ for more than one cluster it will be a mixture of distributions

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 15 / 50

Page 16: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

Clustering a Mixture of Gaussians

The clustering assumes, that the data is generated by a mixture ofGaussians

Each cluster is then a Gaussian distribution

The task is to find the parameters, µi and σi for each cluster Ci ∈ C→ the mean and the standard deviation for each cluster

Plus as an optional task, find out the true number of clusters

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 16 / 50

Page 17: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

Connection to noisy channel

Transmit two states over a noisy channel

Two states as input, a real value as output

The output will look like this:

P(x)

Knowing the parameters of the distributions, one could compute theprobabilities

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 17 / 50

Page 18: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means

Example for an iterative clustering algorithm

Input:

Set of instances - VDistance function - d(vi , vj)Minimization criteria, based on dNumber of desired clusters - k

Output:

Cluster representatives - r1, ..., rk , the so called centroids

Each centroid is the mean of the instances within the cluster

k-Means is a partitioning clustering algorithmIn contrast to centroids, medoids are the best representative instance within a cluster

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 18 / 50

Page 19: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Algorithm

1 Initialise the centroids (r1, ..., rk - seed selection)2 Iterate over all instances (v ∈ V )

Assign each instance to the closest centroid (using d)

3 Update the centroids (using minimization criteria -ri = minimize(e(Ci )))

4 GOTO 2 (until convergence)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 19 / 50

Page 20: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Open questions

Which is the correct cluster number?

Which initialisation to choose?

Which distance measure?

Which minimization criteria?

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 20 / 50

Page 21: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Basic algorithm

Which is the correct cluster number?

e.g. predefined

Which initialisation to choose?

e.g. random initialisation

Which distance measure?

e.g. Euclidean distance

Which minimization criteria?

e.g. component-wise arithmetic mean of the cluster members

If the data is from a metric space, then as minimization criterion e the sum of the squareddistances to the cluster representatives (= variance criterion) is usually chosen → vector ofminimum variance

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 21 / 50

Page 22: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 22 / 50

Page 23: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Advantages

Conceptually simple

Efficient, low computational complexity - O(n), for n instances

The basic algorithm produces hard clustering, but can be extended tofuzzy clustering

Can be extended to work iteratively (online k-Means)

Partitions the instances → Voronoi diagram

k-Means - Disadvantages

Only finds local minimalStart multiple times with random seedsDevise “clever” seeding methods

Needs fixed k (cluster number)

Does not work well for elongated clusters or density based clusters

Does not deal well with noisy data, outliers

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 23 / 50

Page 24: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Advanced algorithms

Which is the correct cluster number?

e.g. Bisecting k-Means (start with 2 clusters, split the largest one)

Which initialisation to choose?

e.g. k-Means++ (try to cover the feature space)

Which distance measure?

e.g. Spherical k-Means (cosine similarity)

Which minimization criteria?

e.g. k-center (maxv∈Ci |v − ri |)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 24 / 50

Page 25: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

k-Means - Limitations

Nested clusters

Clusters with large differences in size

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 25 / 50

Page 26: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Iterative Clustering

Iterative Clustering

Expectation–Maximization (EM) algorithm

Model based clustering algorithm

... iteratively compute the Maximum Likelihood estimate for themodel parameters, in the presence of hidden or missing data (latentvariables)

Algorithm1 Initialise parameters2 E-step (estimate missing data by observations)3 M-step (maximize likelihood function)4 GOTO 2 (until convergence)

Used for mixture of Gaussians

Soft clustering

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 26 / 50

Page 27: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical Clusteringe.g. HAC

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 27 / 50

Page 28: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

Two basic approaches

1 Agglomerative: start with each instance as single cluster and thenmerge (bottom-up)

2 Divisive: start with a single cluster of all instances and then split(top-down)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 28 / 50

Page 29: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

Hierarchical Agglomerative Clustering (HAC)

Input:

Distance measure for two clusters - dC

Output:

Cluster hierarchy (dendrogram)

Clusters are not represented by centroids, there is no fixed number of clusters

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 29 / 50

Page 30: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

HAC - Algorithm

1 Initialise clustering (one cluster per instance)

2 Update the distance matrix

3 Select the pair of clusters with the lowest distance(argmin{Ci ,Cj}⊂C:Ci 6=Cj

dC(Ci ,Cj))

4 Merge the selected pair

5 GOTO 2 (until stop criteria has been reached)

Many implementation use an explicit distance (similarity, proximity) matrix (see step 2) - but itis not strictly required

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 30 / 50

Page 31: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

HAC - Open questions

Which is the distance measure for instances?

Which is the distance measure for clusters?

Which stopping criteria?

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 31 / 50

Page 32: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

HAC - Stop criterion

Alternatives:

# ClusterThreshold on distance (e.g. infinity)Size of clustersMeasure of cluster quality (e.g. silhouette coefficient)...

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 32 / 50

Page 33: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

HAC - Distance Measure

The distance measure represents the inter-cluster relationship

... also the term linkage is used for this distance measure

The most common used linkage types are

Single link (nearest neighbour, MIN)dC(Ci ,Cj) = minu∈Ci ,v∈Cjd(u, v)Complete link (furthest neighbour, MAX)dC(Ci ,Cj) = maxu∈Ci ,v∈Cjd(u, v)Group average linkdC(Ci ,Cj) = 1

|Ci ||Cj | sumu∈Ci ,v∈Cjd(u, v)

Ward criterion (variance)

dC(Ci ,Cj) =√

2|Ci ||Cj ||Ci |+|Cj | ||u − v ||

CentroiddC(Ci ,Cj) = d(ri , rj)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 33 / 50

Page 34: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering - Single Linkage Example

Start state: single cluster for each instance

Iteratively merge clusters to build a dendrogram

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 34 / 50

Page 35: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering - Single Linkage Example

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 35 / 50

Page 36: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

Comparison of linkage types

single link complete link average link Ward criterioncharacteristic contractive dilating conservative conservativecluster number low high medium mediumcluster form extended small compact sphericalchaining tendency strong low low lowoutlier-detecting very good poor medium medium

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 36 / 50

Page 37: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

Chaining issues with single link

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 37 / 50

Page 38: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

Nesting issues with complete link

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 38 / 50

Page 39: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Hierarchical Clustering

Hierarchical clustering

HAC - Advantages

Arbitrary measures for distance and and similarities

No need to specify the true number of clusters

HAC - Disadvantages

Relatively high computational complexity O(n2) (to O(n3)),depending on the linkage

In many implementations HAC uses the distance matrix as inputComputing this matrix is already O(n2)

High memory requirements, if distance matrix is stored in memory(O(n2))

Greedy nature and no backtracking

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 39 / 50

Page 40: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Density-Based Clustering

Density-Based Clusteringe.g. DBSCAN

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 40 / 50

Page 41: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Density-Based Clustering

Density-Based Clustering

Basic approaches

Density based clustering tries to partition the instances into groups ofequal density

Two types:1 Parameter-based: if the distribution is known2 Parameterless: need to estimate the distribution and parameters

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 41 / 50

Page 42: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Density-Based Clustering

Density-Based Clustering

Example: DBSCAN

Tries to identify three types of instances, based on theirneighbourhood Nε(v)

Core points: if |Nε(v)| ≥ tmin

Noise point: if not density reachable from any core point

Border point: otherwise

Density-reachable

If at least a single point in the neighbourhood is a core point,or there is a chain of density-reachable points

In German these types are named: Kern-, Grenz-, and Gerauschpunkte

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 42 / 50

Page 43: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Density-Based Clustering

Density-Based Clustering - DBSCAN

Example of a DBSCAN clustering

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 43 / 50

Page 44: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Density-Based Clustering

Density-Based Clustering - DBSCAN

Advantages

Low computational complexity (linear)

No predefined number of clusters needed

Stable results

Deterministic for the most part (except border points)Independent from the sequence of instances

Disadvantages

Does not work well with high dimensional data

Assumption of equal density does not always hold

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 44 / 50

Page 45: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Other Clustering ApproachesAlternatives & Additional Steps

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 45 / 50

Page 46: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Other Approaches

Spectral clustering

Starting with the similarity matrix

Compute the spectrum (eigenvalues)

Reduce the dimensionality

As a preprocessing step for further clustering

Graph clustering

Make use of the graph structure

Many proposed algorithms

e.g. Affinity propagation, Markov clustering

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 46 / 50

Page 47: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Introduction

Co-clustering

Cluster more than one dimension at the same time (mutualreinforcement)

Consider a matrix, e.g. document × terms

Clustering now tries to find topics

Co-clustering groups the documents to topics and the terms to topics

High-dimensionality

Subspace clustering targets high dimensional data

... as many approaches fail (curse of dimensionality)

Typically need techniques to keep the computational complexity low

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 47 / 50

Page 48: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Preprocessing for clustering

Prepare the data for clustering

Remove noise and outliers

Some algorithms do not cope well (e.g. k-Means)

Normalise the data

Compute an approximation of clustering to speed up the computation

e.g. Canopy clustering to initialise the seeds for k-meansAlso helps to partition the data in a map/reduce environment

Estimate the number of clusters

Some algorithms need the number of clustersTechniques from model selection

Reduce the dimensionality & transformation of features

e.g. Singular Value Decomposition (SVD)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 48 / 50

Page 49: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Postprocessing for clustering

Postprocessing the output of clustering

Remove spurious clusters (empty, small) - could be outliers, noise

Validation of results

e.g. reject solutions

Merge clusters (if close to each other)

Split clusters (if inhomogeneous)

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 49 / 50

Page 50: Clustering - KTI – Knowledge Technologies Institutekti.tugraz.at/staff/denis/courses/kddm1/clustering.pdf · Clustering An unsupervised machine learning technique Roman Kern (ISDS,

Other Clustering Approaches

Thank You!Next up: Evaluation

Further informationSpecial thanks to Benno Stein for his slides:http://www.uni-weimar.de/en/media/chairs/webis/teaching/lecturenotes/

#machine-learning

http://www.slideshare.net/pierluca.lanzi/

machine-learning-and-data-mining-06-clustering-introduction

http://educlust.dbvis.de/

Roman Kern (ISDS, TU Graz) Clustering 2019-01-17 50 / 50