128
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ Plato, 427 Plato, 427 - - 347 BC 347 BC

Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Plato, 427Plato, 427--347 BC347 BC

Page 2: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

N. N. LaskarisLaskaris

Algorithms for Algorithms for Geometrical Data AnalysisGeometrical Data Analysis

Page 3: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Philosophical Inquiries Philosophical Inquiries - Where does this course belong to ?

(e.g. machine learning/vision, pattern recognition)

- What is it about ?( multivariate data, multi-dimensional signals )

- Why is this course necessary ?( generic-character, simplicity, efficiency, user’s-idiosyncrasy )

- Scope of this short course & Goals( How ? vs. Why ? )

Page 4: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

What isnWhat isn’’t t Geometrical Data Analysis ?Geometrical Data Analysis ?

Statistical Data Analysis

Hypothesis Driven methodologies

A-priori (Top-Down) Data Modeling

Parametric (model fitting) approaches

Page 5: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A little Motivation A little Motivation

Page 6: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

απ’ τη σκοπιά μου

Page 7: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Information-Geometry

vs. Informative - Geometry

Page 8: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Roger Shepard (1929 - ) Prof. Emeritus of Social Science,

Stanford University

A cognitive scientist (Ph.D. in psychology 1955)and author of ‘‘Toward a Universal Law of Generalization forPsychological Science ’’

He is considered the father of spatial relations

Page 9: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Science, vol. 237, Sept.1987 Science, vol. 237, Sept.1987

Does psychological science have any hope of achieving a law

that is comparable in generality (if not in predictive accuracy) to Neuton’s universal law of gravitation ?

Page 10: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Science, vol. 237, Sept.1987 Science, vol. 237, Sept.1987

Page 11: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Michael KirbyMichael KirbyProfessor of Mathematics and Computer Science

Graduate Program Director, Colorado State University

An Empirical Approach to Dimensionality Reduction

and the Study of Patterns

Page 12: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Thus, researchers today are confronted with a modern dilemma.

Presumably the more information available concerning a phenomenon the better.

Yet, a massive data set storing the information, in and of itself, a potentially significant barrier to the investigation.

A time-honored approach for the investigation of unexplained phenomena is to attempt to infer laws, or explain processes, from the patterns present in collected data.

‘‘Our phenomenal ability to acquire data has outstripped our ability to analyze it’’

Page 13: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The book describes several mathematical tools for overcoming problems associated with analyzing high-dimensional and massive data sets.

Kirby’s approach is geometric in nature and the main tool is the dimensionality reducing mappingsdimensionality reducing mappings.

These mappings are required for the analysis and representation of information (patterns) in large data sets generated by physical or numerical experiments.

Page 14: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

(1890-1962)

Sir Ronald Aylmer Fisher

‘‘Let the Data Speak for itself ’’

Page 15: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Basics of Basics of Geometrical Data AnalysisGeometrical Data Analysis

IntroductionIntroduction

Page 16: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Feature Extraction

Distance measure

Structure description

Embedding in Feature-Space

Page 17: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

PartitionalPartitional ClusteringClustering

Outlier

Page 18: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Hierarchical ClusteringHierarchical Clustering

Page 19: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

GraphGraph--theoretic Clusteringtheoretic Clustering

Page 20: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

FeatureFeature--selectionselection

Page 21: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

FeatureFeature--normalizationnormalization

Page 22: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Elementary Elementary Geometrical Data AnalysisGeometrical Data Analysis

Page 23: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

VectorVector--QuantizationQuantization && PrototypingPrototyping

Distances Distances && VisualizationVisualization

OrderingOrdering && NoveltyNovelty//Outlier DetectionOutlier Detection

ClusteringClusteringDimensionalityDimensionality--reductionreductionManifoldManifold--LearningLearning

Elementary Elementary Geometrical Data AnalysisGeometrical Data Analysis

Page 24: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

- Given an ensemble of N patterns,

a p-dimensional vector xi , i=1,2,…N

is extracted from each one.

xi = [ xi(1) xi(2) ….. xi(p) ]

From Patterns to Distances From Patterns to Distances

-With the feature-extraction step, the set of patterns is represented

by a set of row-vectors.

Page 25: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

-The N vectors are gathered in the so-called Data-Matrix Xdata

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

NpNN

p

p

NpNN

p

p

NNN

datapxN

xxx

xxxxxx

xxx

xxxxxx

pxxx

pxxxpxxx

X

......,,.........,,......,,

......,,.........,,......,,

)]()......,(),([...

)]()......,(),([)]()......,(),([

.

.

.][

21

22221

11211

21

22221

11211

222

111

N

2

1

21

2121

x

xx

Page 26: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

NpNN

p

p

datapxN

xxx

xxxxxx

X

......,,.........,,......,,

.

.

.][

21

22221

12111

N

2

1

x

xx

standardization of each one of the p variates(after subtraction of its mean)

is performed via a normalization with the stdstdor Whitening based on PCA

normalization of each one of the N vectors,by dividing with its norm,

i.e. replacement of xi with Xi= xi / ⎥⎥xi ⎢⎢

where ⎥⎥xi ⎢⎢= [ xi(1)2 + xi(2)2+….. xi(p)2 ]1/2

Two simple transformations of the Data-matrix are :

Page 27: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The role of feature-extraction & transformations in the subsequent mining of information

from the input patterns.

For instance, the normalization trick is employed when dealing with time-series patternsand we want to highlight shape(phase) similarities

during the subsequent computation of Euclidean distances

NoteNote

Page 28: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The geometrical consideration( patterns points in feature space )

is very useful in order :

to conceptualize morphological relationships between patterns

to search for natural groupings inside the sample of patterns

similar patterns are mapped onto nearby points

measuring the geometrical distance between vectors as a means of quantifying (inversely)

common signal/information content.

the similarity between the corresponding patterns. A small distance means great similarity between two patternsand this can be interpreted as

Page 29: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

d(x1,x2) =⎥⎥x1 - x2 ⎢⎢= [ ( x1(1)- x2(1) ) 2 + ( x1(2) – x2(2) )2+…..+ ( x1(p)-x2(p) )2 ]1/2

For computational considerations, usually its squared form is utilized, i.e.

d(x1,x2) =⎥⎥x1 - x2 ⎢⎢2

For an ensemble of N patterns {xi}i=1:N

all the pairwise distances are gathered

in the so-called (NxN) distance matrix D[NxN]

The Euclidean-distance in p-D space

Page 30: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A fast computation of this (symmetric) matrix, is given via :

D = diag(A) E + E diag(A) – 2A(1)

] .... [= , 1...1 1

1...1 1= ,= N21

datap) x (NN) x (N

T xxxXXEXXA MM=⎟⎟⎠

⎞⎜⎜⎝

⎥⎥⎥

⎢⎢⎢

⎡=

N

1

][

x

xdata

pxNX

Page 31: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

From Patterns to DistancesFrom Patterns to Distances

SummarySummary

Page 32: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

NoteNote

If the normalized versions {Xi} of the vectors {xi} replace them in the Data matrix,

then the corresponding pairwise Euclidean distances becomes

d(Xi,Xj) = 2 ( 1- ρ(xi , xj) )

where ρ(xi , xj) is the correlation coefficientbetween two vectors:

ρ(xi , xj) = xi • xj / (⎥⎥x1 ⎢⎢2 ⎥⎥x2 ⎢⎢2) = Xi • Xj

Page 33: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

An insight to the structural information contained in the Distance-matrix

can be obtained via a simple visualization-scheme

An efficient procedure for unmasking possible outliers - the corresponding rows/columns

are white stripes in the produced layout.

Page 34: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The description of a set of patterns, through the topology of their representing points

can lead to simple descriptors with ready geometric interpretationand without loosing the connection

with conventional approach for studying the data (statistics).

Relating topological descriptors of point sets with the data.

Geometrical concepts like the ‘local point-density’or the outline/skeleton of a point-swarm

can be utilized in building toolsfor understanding and handling the multi-D data.

Page 35: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The interpoint distances &

the gravitational centre of a point set.

The dispersion J, expresses the compactness of a point set.

It is the average distance from the geometrical mean.

{ } ∑∑==

= ⋅=−⋅−=NN

iNi NNJ

11iave

1

2avei1i x1xxx11 x ,)/()( :

note: it is the p-dimensional analogous of (squared) standard deviation for a set of scalars

Page 36: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

It can be expressed as a summation of pairwise distances :

{ } ∑∑= =

= −⋅−=N

i

N

jNi NNJ

1

2ji

11i xx121 x )(/)( :

and estimated via simple matrix operation:

{ } ],.....,[,)(

)( ][: 111uDu12

1 x 1T

1i =⋅⋅⋅−

== xNNi uNN

J

Page 37: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

{ } ∑=

= −⋅−=N

iNi NJ

1

2avei1i xx11 x )/()( :

1. The dispersion is a measure of ‘noise’ in the data.

{ } ∑∑= =

= −⋅−=N

i

N

jNi NNJ

1

2ji

11i xx121 x )(/)( :

2. The contribution of the i-th vector to the total dispersion is the sum of its distance to the rest of the points,

i.e. the row-sum of the Dmatrix:

d(i,N).)d(i,)d(i,dist +…++= 21xi )(

It is a simple gauge for unmasking outlying points,and therefore spotting unusual patterns.

3. Conversely, the notion of Vector Median can be introduced.

Page 38: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Unmasking OutliersUsing simple functionals with arguments the pairwise distances,

we built 1D-mappings that are informative about the ‘‘distinctiveness’’ of the corresponding patterns.

1. map each vector to a scalar, 2. locate the vectors with images lying at the extremes

of the obtained scalar distributions,3. identify the corresponding vectors

and make a final judgment about the corresponding patters.

Vector-Ordering schemes

Page 39: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

In the case of Reduced-Orderingmapping is based on the aggregate distance,

[ ] [ ][N][3][2][1]ordering

N321 dist ....dist dist distdist....dist dist dist ⎯⎯⎯ →⎯

the estimated scalars are ordered

this ordering defines the ordering of the corresponding vectors

[ ] [ ][N][3][2][1]ordering Reduced

N321 x....xxxx....xxx ⎯⎯⎯⎯⎯ →⎯

a ranked list of patterns has been formed in which the elements that deserve further consideration (due to their non-typicality) lie at one end (e.g. novelty detection )

Page 40: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Note: How many patterns to disregard/underline ?

FromFrom Patterns Patterns toto OrderedOrdered--listslistsSummarySummary

Page 41: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Alternative Vector-Ordering schemes

2. Radial-ordering

3. Graph-theoretic (MST)

4. Manifold Ranking

5. Diffusion-network

1.

Ranking in Rp

Page 42: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 43: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Cluster AnalysisCluster Analysis

Page 44: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Clusters everywhereClusters everywhere

Page 45: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Gestalt psychology (Berlin School) is a theory of mind and brain that proposes thatthe operational principle of the brain is holistic, parallel, and analog, with self-organizing tendencies.

Clusters within our mindClusters within our mind

The Gestalt effect refers to the form-forming capabilityof our senses, particularly with respect to the visualrecognition of figures and whole forms instead of just a collection of simple lines and curves.

Gestalt is a German word meaning shape or form.

Page 46: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Emergence is explained in this way

Page 47: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The most basic rule of Gestaltis the law of prpräägnanzgnanz :

‘‘we try to experience things in as good a gestaltway as possible’’

In this sense, "good" can mean several things, suchas regular, orderly, simplistic, symmetrical, etc.

So, there is inherent tendency in humans to perform clustering

Page 48: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

What is clustering ?What is clustering ?

The The ArtArt of identifying of identifying homogeneoushomogeneous groups in the datagroups in the data

Page 49: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Are there many algorithms ?Are there many algorithms ?

‘‘There are as many clustering algorithms as there are (potential) users’’

Page 50: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Is there a Is there a ––singlesingle-- best one ? best one ?

Can I design the Can I design the ‘’‘’perfectperfect’’’’ algorithmalgorithm ??

Page 51: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The clustering The clustering of clustering algorithmsof clustering algorithms

(Metaclustering)Hierarchical, Partitional & Graph-Theoretic

Probabilistic, Possibilistic, Deterministic

Static, Adaptive, Dynamic

Statistical , Neuronal, Heuristical

{ Stochastic vs Batch-mode } {Parallel vs Serial }

Page 52: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Hierarchical Clustering Algorithms

they work with a dissimilarity matrix(i.e. without using the patterns themselves )

and have a deterministic character(e.g. the Single-linkage algorithm )

The end output is a Dendrogram

Sampling Sampling clustering algorithmsclustering algorithms

Page 53: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

1. Pair the two points k and l with the smallest distance.

Given the distance matrix D[N x N]

2. Delete the rows (& columns) in D corresponding to k & l

3. Insert a new row (and the corresponding column) containing the distances of the first cluster (k,l)

to the remaining N-2 points. D(kl) i = min ( Dki , Dli ), i≠k,l

4. repeat the procedure from ( 1. ) for the new [N-1 x N-1] distance matrix.

Page 54: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

How do we define the number of clusters ?

dendrogram

Page 55: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Partitional Clustering Algorithms

Sampling Sampling clustering algorithmsclustering algorithms

they work with a Data matrix(i.e. using the patterns themselves )

and have a stochastic character(e.g. the C-means algorithm )

Page 56: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Minimization (maximization) of an objective (cost) function

that expresses the separability (compactness) of the produced groups.

Prototypes are emerging naturally.

Fast execution.

Large data-sets can be handled.

Page 57: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The partition matrix U is used to tabulate the resultsIt’s a [CxN] matrix, with each row devoted to one of the C produced clusters

The indicator function ujihas the value 1 if xi belongs to the j-th cluster; otherwise is set to 0

crisp clustering vs fuzzy clustering .

Page 58: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

the objective function is the total intra-cluster dispersion:

x1oox i1

1

j2

1ji

1∑∑∑∑=

===

=−=Ni

ji

NijiNi

jiCj

uu

,uE:

:::

In matrix operation, the above cost function reads:

uDu 2

1 1

jj1

∑∑==

=⋅⋅⋅

=Ni

jijT

jCjupop,

popE

::

or E= trace( UDUT )

In the case of C-means algorithm

D is distance matrix and popj the population of j-th cluster

Page 59: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 60: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

C-means (or K-means) algorithm startsby partitioning the input points into c initial sets,either at random or using some heuristic data.

It then calculates the mean point, or centroid, of each set.

It constructs a new partitionby associating each point with the closest centroid.

Then the centroids are recalculated for the new clusters,

and the algorithm is repeated by alternate application ofthese two steps until convergence….

Page 61: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Some remarks on Partitional Clustering

1. Since these algorithms always result to grouped data, a critical issue is does their use really contribute to the understanding of the true point distribution.

A way to justify this is the comparison of measure E with the corresponding dispersion

for the overall point set dispersion.

Page 62: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Some remarks on Partitional Clustering

2. To alleviate the problemof initialization & insufficient convergence, usually the iterative algorithms are applied a few times and the best partition matrix is the final outcome.

3. Outlying points tend to obscure the convergence and the accuracy of the resulting partition.It is suggested to be isolated from the beginning.

Page 63: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Some remarks on Partitional Clustering

4. “How many clusters are there” in the point set ?A simple strategy for estimating the number of clusters C,

is to apply the algorithm for increasing value of C, and by plotting the corresponding values of E as function of C to decide the critical number C0.

Notice that E is by default a monotonically decreasing function of C, with absolute minimum C=N, i.e each point to its one cluster.

Page 64: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Some remarks on Partitional Clustering

5. The objective function has been modified many times in the Pattern Recognition literature,

e.g. so as to bias the creation of highly populated clusters,or to favor specific cluster-shapes

Page 65: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Sampling Sampling clustering algorithmsclustering algorithms

Subtractive Clustering

Mountain-clustering for delineating cores in a multimodal point distribution

A simple loop :1. Detection of the most significant mode &2. Subtraction of the subset of points

that are coming from the certain mode.

Page 66: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

the technique of Potential Functionsis used so as to construct a mountain,

with height proportional to the local point density.

2

xx

21 x

12

2ji

2i ∑=

−−=

N

j oP

oP rNr

]exp[)(

)PD( /π

1. A mapping xi PD(xi)2. ro : radius of influence

3. PD(xi) can be estimated using D-matrix elements

Remarks :

Page 67: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Mode-detection: The point of the set that lies closer to the dominant mode is identified as the point xmax

of maximum local point density PD(xmax).

2

xx

21 x

12

2ji

2i ∑=

−−=

N

j oP

oP rNr

]exp[)(

)PD( /π

Page 68: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A portion of the lower ranked points will be averaged

x j

1 x ]j[

[1]ii

0sel

0

∑=

=

This subset is removed and the procedure is repeated from the detection step.

each point xi in the point-set is orderedaccording to its distance d(xi, xmax )i.e.

the closer to the xmax the point is, the lower its rank [i] will be.

Mode-delineation : points in the vicinity of xmax are collected.

Page 69: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The role of ro

PD(xmax)

Page 70: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A.K. Jain, R. C. Dubes‘‘Algorithms for Clustering Data’’

Prentice-Hall , 1988

L. Kaufman, P. J. Rousseeuw‘’Finding Groups in Data :

An Introduction to Cluster Analysis’’,Wiley Series in Probability and Statistics, 1990

Classical References Classical References

Page 71: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Related issuesRelated issues

external vs. internal validation

Comparing two different clustering outputs (MI, Hubert-test, etc )

Automating selection of cluster numbers(Gap-statistic, BIC, MDL, MDE, etc.)

Model-validation

Comparing a clustering output with a given classification

Page 72: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

ItIt’’s an Ever Expanding field s an Ever Expanding field

# 33 issue

Page 73: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 74: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Clustering Ensembles Clustering Ensembles

Page 75: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Clustering Dynamics Clustering Dynamics

1. Raw-data. 2. Feature-space. 3. Models

Page 76: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Kernel-based ClusteringRandomization

Class-projects

Page 77: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Marc Yuko NaruAraikArmanVaheArmenAskaYuka MihaiHorhe….& Groupies

‘’The Group that groups’’

‘’Okinawa-blues’’

Page 78: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 79: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 80: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Nonlinear Nonlinear Dimensionality ReductionDimensionality Reduction

&&DataData--summarizationsummarization

Page 81: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The curse of dimensionality

Human-machine interactions

‘‘Less is More’’

Why?Why?

Page 82: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Multidimensional Scaling (MDS)Multidimensional Scaling (MDS)motivations

- sometimes only proximity data are available (e.g. data from psychophysics / behavioral experiments)

- to take advantage of the ‘‘human gift for pattern recognition-tasks’’

like determining modes in a point distribution and recognizing trends in the data

when these are presented in the form of point-diagrams

Page 83: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

MDS MDS –– definition definition

Any procedure that, - given a dissimilarity matrix corresponding to a set of patterns -configures points in a low dimensional space (usually 2-D) as images of the patterns in a way that the interpoint distances approximate as much as possible the original pairwise dissimilarities.

Page 84: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

MDS results in a 2-D “projection” of the objects, where neighboring relationships /clustering trends

are prominent.

Page 85: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

MDS MDS –– categories categories

metric vs. nonmetric MDS

metric MDS is applied via eigenvectors analysisand has analytical expression.

nonmetric MDS algorithms are iterative in nature and computational demanding,but usually (slightly to moderately) superior

to the metric ones.

Page 86: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Metric-MDS [ Torgerson; 1952 ]

Given a Distance-Matrix D[NxN] for a set of N objects (patterns?)

Negation: A[NxN] = - D[NxN]

Centering: Bij = Aij – Ai. - A.j + A.

EigenAnalysis of B[NxN]:The first r characteristic roots l1, l2, …., lr

& the associated vectors v1 [Nx1] , v2 ,…., vr are computed

Page 87: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

V[Nxr]=[ v1 v2 ….. vr ]

Normalization of vi : so that viT vi = li

and gathering in a [N x r] matrix

Output: the i-th row of this matrix contains the coordinates of the i-th point in the new r-dimensional space (r = 1, 2 or 3) :

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

==

NrNN

r

r

NrNN

r

r

χχχ

χχχχχχ

χχχ

χχχχχχ

......,,.........,,......,,

......,,.........,,......,,

Χ

21

22221

11211

21

22221

11211

N

2

1

datar]x[Nr]x[N

χ...χχ

V

Page 88: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

the normalized total discrepancy as a measure of mapping creditability

<

<

Δ−=

jiij

jiijij

D

DStreess

where Δ is the matrix of interpoint distancesΔij=║χi - χj ║2 in the new space.

MDS-quality

Page 89: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Note

- Possible outliers in the set tend to “dominate” the projection.

- A refined image can be obtained after their isolation and removal.

Page 90: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

MDS-classical exampleWith standard psychophysical experimental procedures, the perceptual similarity (PS) between 14 selected colorswas estimated and tabulated in a [14 x 14] matrix .

The 14 entries correspond to 14 different ‘hues’ with wavelengths :

Wavelength = [434, 445, 465, 472 ,490, 504, 537, 555, 584, 600, 610, 628, 651,674]

bluish hue = = 472, reddish hue = = 674

A point diagram was produced by applying the MDS algorithm to the distance matrix with entries

d(i,j) = 1-PS(i,j), i,j=1:14.

Page 91: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The ‘homeomorphism’ of this plot with the well-known color-disk shown on the right is remarkable

Page 92: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Related issuesRelated issues

Shammon mapping

Procrustes Analysis

Correspondence-problem

Treating Graphs

MST-planingProjection Pursuit

Page 93: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

DataData--ManifoldManifold LearningLearning

Page 94: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

ManifoldsManifolds• What is a

‘’Manifold’’ ?

OXFORD Dictionary : n (techn) a pipe or an enclosed space with several openings that connects with other parts,

eg for taking gases into or out of cylinders in a car engine: The exhaust/inlet manifold

Page 95: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

(1826-1866)The name manifold comes from Riemann's originalgerman term, Mannigfaltigkeit,

which W. Clifford translated as "manifoldness“ .

Bernhard Riemann

In his Göttingen inaugural lecture,Riemann described the set of all possible valuesof a variable with certain constraintsas a Mannigfaltigkeit

Page 96: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

(light) Mathematical Definition

A manifold is a space which, in a close-up view,resembles spaces described by Euclidean geometry,

But which may have a more complicated structurewhen viewed as a whole.

Page 97: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Manifold ExamplesManifold Examples

‘‘‘‘Swiss RollSwiss Roll’’’’

Page 98: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Our own Manifold Our own Manifold ……. .

Page 99: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The Manifolds way The Manifolds way of perception !of perception !

Human Cognition:‘’The Manifold Ways of Perception’’

H.Seung and D. Lee• Science 22 Dec 2000: Vol. 290(5500), pp. 2268 - 2269

Page 100: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

the interest in manifolds has been renewed and extended well beyond the mathematicians’community:

(1) Tenenbaum et al. ‘‘A global geometric framework for nonlinear dimensionality reduction’’

(2) Roweis & Saul. ‘’Nonlinear Dimensionality Reduction by Locally Linear Embedding’’.

Recently: Science, vol. 290,Dec,2000

Page 101: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Nowadays, Manifold-Learninghas become an individual scientific branch.

A well-informed Web-site is : http://www.cse.msu.edu/~lawhiu/manifold/

‘‘‘‘Manifold learningManifold learning’’’’

Page 102: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

In a nutshellIn a nutshellManifold is ‘a constrained (multidimensional) surface’

This implies the existence of an ambient (vector) spacein which the available data lie in a restricted way.

The famous Swiss-Roll

Page 103: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

when the available data are multivariate observationsfrom a high dimensional space,

the high-dimensionalityusually obscures the useful information,

and constitutes one of the major component of the ‘curse of dimensionality’.

What is a DATA-Manifold& what is to learn about ?

Page 104: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

for Data-Analysts : ‘‘Less is Better ’’

methods for data-abstraction and summarization.

Visualization-schemes are highly popular, since some insight into the data can be gained, immediately, by the user through low-dimensional plots and graphs.

efficient techniques for handling high-dimensional data

Page 105: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Do we really need to learn the DATA-Manifold ?

YES !!! ……. for Moonwalking !

Page 106: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Radial-Ordering

Page 107: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Radial-Ordering

Ranking on Manifold

Results on a subset of the USPS data set [Zhou et all., 2004].The top left-hand image is the query,

The 99 top ranked images are shown

Page 108: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Manifolds are everywhere Manifolds are everywhere ……..

Page 109: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Minimal Spanning Tree (MST)A Graph-Theoretic tool

to parameterize Data-Manifolds

Page 110: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A graph is a set of nodes and a set of node pairs called edges.

An edge-weighted graph is a graph with a real number, called weight, assigned to each edge.

A connected graph has a path between any two distinct nodes.

A Spanning Tree is a connected graph that includes all the nodes without loops.

Graph-Theoretic terminology

the MST is the spanning tree of minimum total weight

Page 111: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

How Graph-Theory is applied in feature space ?

a node is dedicated to each data-pointand the corresponding pairwise distances (generalised dissimilarities)

are assigned as weights to the formed edges.

The MST is the connected graph, emerged from the collection of exactly (N-1) edges,

having minimum total length.

Page 112: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A realistic example

Page 113: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

What is magic about MST ?

It contains the NN-graph.

It can be used for ranking in RP

(i.e. MST-ordering),

It can be used for visualizing the skeleton of pattern variation

Page 114: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

ISOMAPA hybrid tool for

visualizing Data-Manifolds

ISOMAP = Graph theory in feature space+ Multidimensional Scaling

Page 115: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Isomap, comprises simple algorithmic steps,that transform the original distance matrix D to GD

which contains the geodesic interpoint distances.

ISOMAP algorithm

1. The nearest-neighborS graphover the given point sample is constructed.

2. The geodesic interpoint distances are computed as theshortest paths (on this graph) between each pair of points.

3. The MDS is then applied,Y = MDS( GDε )

Page 116: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

It is an efficient graph-flattening technique and can learn a broad class of nonlinear manifolds.

MDS

ISOMAP

Page 117: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A very interesting example with many potential applications

in computer vision (e.g. morphing )

Page 118: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

ISOMAP restrictions

While Isomap is a very competent procedure for learning nonlinear manifolds,

it is restricted by the computational demands of the geodesic-distance estimations.

The handling of more than a few thousandsmultidimensional points (i.e. patterns)

is becoming problematic.

Page 119: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A possible solution the marriage of Isomapwith unsupervised learning techniques (e.g. Kohonen Maps).

As a preprocessing-step, efficient techniques can be, first, appliedto reform the ensemble of patterns as data-chunks, that will be then summarized via prototypeswhich will then be fed to the ISOMAP-routine.

Page 120: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Vector Quantization (VQ) based on Neural-Gas Network

VQ encodes the data manifold in the ambient (high-d) spaceby utilizing only a finite set of reference vectors,

the code vectors.

It actually performs a parcellation of the ambient spaceknown as Voronoi Tessellation.

A Voronoi-region is defined around each code vector:This is a section in the original space comprised of all the points closer to a specific code vector than to any other.

Page 121: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

A realistic example

Page 122: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The codebook design is the most critical part in VQ. For this step the “neural-gas” algorithm is employed.

Neural Gas is an artificial neural network model, which converges efficiently to a small,

user-defined number C<N of codebook vectors.

The ‘Neural Gas’ algorithm

It is an extension of the Kohonen’s self-organizing maps that shares some characteristics with the Fuzzy C-means .Its name stems from the physics

of the underlying optimization scheme.

Page 123: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Learning Dynamic-Manifolds from continuous stream data

Page 124: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

‘Neural-Gas’ based dynamic prediction

Page 125: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Related issuesRelated issues

Laplacian Eigenmaps, LLE, etc.

Ranking on ManifoldsSemisupervised Learning

Class-projects

Page 126: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Page 127: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

Thank UThank U

Page 128: Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis ΝικοςΛασκαρηςγιατοΠΜΣ-ΗΕΠ The geometrical consideration ( patterns Æpoints

Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ

The EndThe End