Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering...

Data Mining Course 2007

Eric Postma

Clustering

Overview

Three approaches to clustering

1. Minimization of reconstruction error• PCA, nlPCA, k-means clustering

2. Distance preservation• Sammon mapping, Isomap, SPE

3. Maximum likelihood density estimation• Gaussian Mixtures

• These datasets have identical statistics up to 2nd order

1. Minimization of reconstruction error

Illustration of PCA (1)

• Face dataset (Rice database)

• Average face

• Top 10 Eigenfaces

Each 39-dimensional data item describes different aspects of the welfare and poverty of one country.

2D PCA projection

Non-linear PCA

• Using neural networks (to be discussed tomorrow)

2. Distance preservation

Sammon mapping

• Given a data set X. The distance between any two samples is defined as Dij

• We consider the projection on a two dimensional plane where the projected points are separated by dij

• Define an Error function

i jiij

i ji ij DdD

Sammon mapping

Main limitations of Sammon

• The Sammon mapping procedure is a gradient descent method

• Main limitation: local minima

• MDS may be preferred because it finds global minima (being based on PCA)

• Both methods have difficulty with “curved or curly subspaces”

Isomap

• Tenenbaum• Build a graph in which each node

represents a data point• Compute shortest distances along the

graph (e.g., Dijkstra’s algorithm)• Store all distances in a matrix D• Perform MDS on the matrix D

Illustration of Isomap (1)

• For two arbitrary points on the manifold Euclidean distance does not always reflect similarity (cf. dashed blue line)

• Isomap finds the appropriate shortest path along the graph (red curve, for K=7, N=1000)

• Two-dimensional embedding (red line is the shortest path along the graph, blue line is the true distance in the embedding.

• Isomaps (●) ability to find the intrinsic dimensionality as compared to PCA and MDS (∆ and o).

• Interpolation along a straight line

Stochastic Proximity Embedding

• SPE algorithm

• Agrafiotis, D.K. and Xu, H. (2002). A self-organizing principle for learning nonlinear manifolds. Proceedings of the National Academy of Sciences U.S.A.

Stress function

Output proximity between points i and j Input proximity between points i and j

2)(),( ijijijij rdrdf

Swiss roll data set

Original 3D set 2D embedding obtained by SPE

Stress as a function of embedding dimension(averaged over 30 runs)

Scalability (# steps for four set sizes)Linear scaling

Conformations of methylpropyletherC1C2C3O4C5

Diamine combinatorial library

Clustering

• Minimize the total within-cluster variance (reconstruction error)

ic wxkE2

• kic = 1 if a data point belongs to cluster cK-means clustering1. Random selection of C cluster centres2. Partition the data by assigning them to the clusters3. The mean of each partitioning is the new cluster centreA distance threshold may be used…

• Effect of distance threshold on the number of clusters

Main limitation of k-means clustering

• Final partitioning and cluster centres depend on initial configuration

• Discrete partitioning may introduce errors

• Instead of minimizing the reconstruction error, we may maximize the likelihood of the data (given some probabilistic model)

Neural algorithms related to k-means

• Kohonen self-organizing feature maps

• Competitive learning networks

3. Maximum likelihood

Gaussian Mixtures

• Model the pdf of the data using a mixture of distributions

• K is the number of kernels (<< # data points)• Common choice for the component densities p(x|i):

iiPixpxp )()|()(

2/2 2exp

)2(1)|(

Illustration of EM applied to GM model

The solid line gives the initialization of the EM algorithm: two kernels,P(1) = P(2) = 0:5, μ1 = 0.0752; μ2 = 1.0176, σ1 = σ2 = 0:2356

Convergence after 10 EM steps..

Relevant literature

• L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik (submitted). Dimensionality Reduction: A Comparative Review.

• http://www.cs.unimaas.nl/l.vandermaaten

Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering...

Documents

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples

Waste minimization

Scalable co-Clustering using a Crossing Minimization ...acta.uni-obuda.hu/Pigler_Fogarassy-Vathy_Abonyi_66.pdf · Cs. Pigler et al. Scalable co-Clustering using a Crossing Minimization

Postma (1985) the Concept of Time Among the Mangyans

Welcome to Second Grade! Mr Orcutt and Mrs. Postma 2015 -2016

Sailing CV Kees Postma - GitHub Pages · Dutch kp.yachtracing@gmail.com 1.93m kpyachtracing 100kg CV accurate as of July18th 2017 Kees Postma SUMMARY The most recent achievements

Welcome to Second Grade! Mr Orcutt and Mrs. Postma 2014 -2015

Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working

CHECKLIST FOR MUSIC SELECTION Tori Postma Rate the

Chemical Minimization

Waste Minimization Program - ehs.unc.edu › files › 2018 › 02 › waste-minimization-report.pdfWaste Minimization Program Department ofEnvironment, Health & Safety The University

Layout Minimization

Neural networks for data mining Eric Postma MICC-IKAT Universiteit Maastricht

Energy Efficient Clustering Protocol for Wireless Sensor ... · loss exponent and 2 λ 4 (Habib & Sajal, 2008) . In this way, minimization of transmission separation can decrease

Ferdinand Postma Diary Quarter 2

DDMA iLounge "De kracht van Psychologie & Marketing: Presentatie Paul Postma

Neural networks Eric Postma IKAT Universiteit Maastricht

Clustering via Concave Minimization

Ferdinand Postma Diary Quarter 4

Annemarie Postma - Het Lichaam is Perfect