Linear and Non-Linear Dimensionality Reductiondidawiki.cli.di.unipi.it/lib/exe/fetch.php/magistrale... · 2015-05-11 · Linear and Non-Linear Dimensionality Reduction Alexander Schulz

Linear and Non-Linear Dimensionality Reduction

Alexander Schulzaschulz(at)techfak.uni-bielefeld.de

University of Pisa, Pisa

04.05.2015 and 07.05.2015

Overview

Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning

Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction

Curse of Dimensionality1

I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface

1[Lee and Verleysen, 2007]


I high-dimensional spaces are almost empty

I Hypervolume concentrates in a thin shell close to the surface



I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface


Why use Dimensionality Reduction?

0 0 0 0 0 0 0 0 0 0 0 0 0 0 81 34 0 0 0 0 0 0 0 0 0 0 0 0 3 104 247 1530 0 0 0 0 0 0 0 0 0 0 0 70 255 255 175 0 0 0 0 0 0 0 0 0 0 14 54 227255 255 98 0 0 0 0 0 0 0 0 0 0 168 255 255 255 179 7 0 0 0 0 0 0 0 033 157 254 255 247 162 13 0 0 0 0 0 0 0 0 72 207 255 255 255 124 00 0 0 0 0 0 0 29 149 250 255 255 228 160 2 0 0 0 0 0 0 0 6 148 255255 255 245 112 0 0 0 0 0 0 0 0 56 195 255 255 255 169 15 0 0 0 0 00 0 0 5 205 255 254 208 61 10 0 0 0 0 0 0 0 0 39 84 255 219 127 0 00 0 0 0 0 0 0 0 3 186 234 211 41 0 0 0 0 0 0 0 0 0 0 0 105 255 230 150 0 0 0 0 0 0 0 0 0 0 0 175 224 48 0 0 0 0 0 0 0 0 0 0 0 0 0 63 52 0 0 00 0 0 0 0 0 0 0 0 0 0



Principal Component Analysis (PCA)

variance maximization

I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

2

I = 1N∑

i w>xixi>w

I = w>CwI →Eigenvectors of the covari-

ance matrix are optimal



I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

2

I = 1N∑

i w>xixi>w





I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

2

I = 1N∑

i w>xixi>w

I = w>Cw

I →Eigenvectors of the covari-ance matrix are optimal



I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

2

I = 1N∑

i w>xixi>w



PCA in Action

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1.5 −1 −0.5 0 0.5 1 1.5−1

01

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

2[Gisbrecht and Hammer, 2015]

Distance Preservation (CMDS)

I distances δij = ‖xi − xj‖2, dij = ‖yi − yj‖2

I objective δij ≈ dij

I distances dij and similarities sij can be transformed into eachother

I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>

with the largest eigenvalue





I S = UΛU> matrix of pairwise similarities

I best low rank approximation of S in Frobenius norm is S = UΛU>






I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>


Manifold Learning

learn linear manifold

I represent data as projectionson unknown w:C = 1

2N∑

i(xi − yiw)2

I What are the best parametersyi and w?

Manifold Learning

learn linear manifold

I represent data as projectionson unknown w:C = 1

2N∑

i(xi − yiw)2

I What are the best parametersyi and w?

PCA

three ways to obtain PCA

I maximize variance of a linear projectionI preserve distancesI find a linear manifold such that errors are minimal in an L2 sense

Kernel PCA3

Idea

I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature space

I How to apply the kernel trick here?

3[Schölkopf et al., 1998]

Kernel PCA3

Idea

I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature spaceI How to apply the kernel trick here?

3[Schölkopf et al., 1998]

Kernel PCA in Action

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1.5 −1 −0.5 0 0.5 1 1.5−1

01

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

4


Stochastic Neighbor Embedding (SNE)5

I introduce a probabilistic neighborhood in the input space

pj|i =exp(−0.5‖xi − xj‖2/σ2

i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

i )

I and in the output space

qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)

I optimize the sum of Kullback-Leibler divergences

C =∑

i

KL(Pi ,Qi) =∑

i

∑j 6=i

pj|i log(

pj|i

qj|i

)

5[Hinton and Roweis, 2002]




i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

i )


qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)


C =∑

i

KL(Pi ,Qi) =∑

i

∑j 6=i

pj|i log(

pj|i

qj|i

)





i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

i )


qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)


C =∑

i

KL(Pi ,Qi) =∑

i

∑j 6=i

pj|i log(

pj|i

qj|i

)


Neighbor Retrieval Visualizer (NeRV)6

I precision(i) = NTP,iki

= 1− NFP,iki, recall(i) = NTP,i

ri= 1− NMISS,i

ri

6[Venna et al., 2010]

Neighbor Retrieval Visualizer (NeRV)6

I precision(i) = NTP,iki

= 1− NFP,iki, recall(i) = NTP,i

ri= 1− NMISS,i

ri

6[Venna et al., 2010]

Neighbor Retrieval Visualizer (NeRV)

I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precision

I NeRV optimizes

C = λ∑

i

KL(Pi ,Qi) + (1− λ)∑

i

KL(Qi ,Pi)

Neighbor Retrieval Visualizer (NeRV)

I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precisionI NeRV optimizes

C = λ∑

i

KL(Pi ,Qi) + (1− λ)∑

i

KL(Qi ,Pi)

t-Distributed SNE (t-SNE)7

I symmetrized probabilities p and qI uses a Student-t distribution in the output space

7[van der Maaten and Hinton, 2008]

t-Distributed SNE (t-SNE)7

I symmetrized probabilities p and qI uses a Student-t distribution in the output space

7[van der Maaten and Hinton, 2008]

Neighbor Embedding in Action

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1.5 −1 −0.5 0 0.5 1 1.5−1

01

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

8


Maximum Variance Unfolding (MVU)9

I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed

I maximize∑

ij ‖yi − yj‖2 s.t.∑i yi = 0

‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors

9[Weinberger and Saul, 2006]

Maximum Variance Unfolding (MVU)9

I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed

I maximize∑

ij ‖yi − yj‖2 s.t.∑i yi = 0

‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors

9[Weinberger and Saul, 2006]

Manifold Learner in Action

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1.5 −1 −0.5 0 0.5 1 1.5−1

01

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

10


Overview

Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning

Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction

Barnes Hut Approach11

Complexity of NE

I Neighbor Embeddings have the complexity O(N2)

I matrices P and Q are squaredI squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)

11[Yang et al., 2013, van der Maaten, 2013]


Complexity of NE


I matrices P and Q are squared

I squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)



Complexity of NE


I matrices P and Q are squaredI squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)


12

12[van der Maaten, 2013]

Barnes Hut Approach

Barnes Hut

I approximate the gradient ∂C∂yi

=∑

j 6=i gij(yi − yj) as∑j 6=i

gij(yi − yj) ≈∑

t

|Git | · gij(yi − yi

t)

I approximate P as sparse matrixI results in a O(N log N) algorithm

Barnes Hut Approach

Barnes Hut

I approximate the gradient ∂C∂yi

=∑

j 6=i gij(yi − yj) as∑j 6=i

gij(yi − yj) ≈∑

t

|Git | · gij(yi − yi

t)

I approximate P as sparse matrixI results in a O(N log N) algorithm

Barnes Hut SNE

13

13[van der Maaten, 2013]

Kernel t-SNE14

I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection

I how to obtain an out of sample extension?I use kernel mapping

x 7→ y(x) =∑

j

αj ·k(x,xj)∑l k(x,xl)

= Ak

I minimization of∑i

‖yi − y(xi)‖2 yields A = Y · K−1

14[Gisbrecht et al., 2015]

Kernel t-SNE14


I how to obtain an out of sample extension?

I use kernel mapping

x 7→ y(x) =∑

j


= Ak




Kernel t-SNE14



x 7→ y(x) =∑

j


= Ak




Kernel t-SNE14



x 7→ y(x) =∑

j


= Ak




Kernel t-SNE

15


Quality Assessment of DR16

I most popular measure

Qk (X ,Y ) =∑

i

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recentlyI recently used to compare many DR techniques

16[Lee et al., 2013]



Qk (X ,Y ) =∑

i

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recently

I recently used to compare many DR techniques

16[Lee et al., 2013]



Qk (X ,Y ) =∑

i

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recentlyI recently used to compare many DR techniques

16[Lee et al., 2013]

Quality Assessment of DR

17

17[Peluffo-Ordóñez et al., 2014]

Feature Relevance for DR

I Which features are important for a given projection?

data set: ESANN participants

Partic. University ESANN paper #publications . . . likes beer . . .1 A 1 15 1 . . .2 A 0 8 -1 . . .3 B 1 22 . . . -1 . . .4 C 1 9 . . . 0 . . .5 C 0 15 -1 . . .6 D . . ....

......

Visualization of ESANN participants

0 2 4 6 8 10

−4

−3

−2

−1

0

1

2

3

4

5

Dim 1

Dim

2

Relevance of features for the projection

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Relevance of features for the projection

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

Visualization of ESANN participants

0 2 4 6 8 10

−4

−3

−2

−1

0

1

2

3

4

5

Dim 1

Dim

2

Beer 1Beer 0Beer -1

Aim

I estimate the relevance of single features for non linear dimen-sionality reductions

Idea

I change the influence of a single feature and observe the changein the quality

Aim

I estimate the relevance of single features for non linear dimen-sionality reductions

Idea

I change the influence of a single feature and observe the changein the quality

Evaluation of Feature Relevance for DR19

NeRV cost function18 QNeRVk

I interpretation from an information retrieval perspective

I d(~x i , ~x j)2 =∑

l(xil − x j

l )2 becomes

∑l λ

2l (x

il − x j

l )2

I λkNeRV(l) := λ2

l where λ optimizes QNeRVk (Xλ,Y ) + δ

∑l λ

2l

18[Venna et al., 2010]19[Schulz et al., 2014a]



I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =

∑l(x

il − x j

l )2 becomes

∑l λ

2l (x

il − x j

l )2

I λkNeRV(l) := λ2


∑l λ

2l




I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =

∑l(x

il − x j

l )2 becomes

∑l λ

2l (x

il − x j

l )2

I λkNeRV(l) := λ2


∑l λ

2l


Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

0

0.2

0.4

0.6

−0.1

0

0.1

0.2

Dim 1Dim 2

Dim

3

C1

C2

C3

Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

0

0.2

0.4

0.6

−0.1

0

0.1

0.2

Dim 1Dim 2

Dim

3

C1

C2

C3

−50 −40 −30 −20 −10 0 10 20 30 40 50−60

−50

−40

−30

−20

−10

0

10

20

30

40

DimI

Dim

II

C 1

C 2

C 3

Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

0

0.2

0.4

0.6

−0.1

0

0.1

0.2

Dim 1Dim 2

Dim

3

C1

C2

C3

−50 −40 −30 −20 −10 0 10 20 30 40 50−60

−50

−40

−30

−20

−10

0

10

20

30

40

DimI

Dim

II

C 1

C 2

C 3

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Neighborhood size

λN

eR

V

Dim 1

Dim 2

Dim 3

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

0

0.1

Dim

3

Dim 1Dim 2

C 1

C 2

C 3

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

0

0.1

Dim

3

Dim 1Dim 2

C 1

C 2

C 3

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

DimI

Dim

II

C 1

C 2

C 3

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

0

0.1

Dim

3

Dim 1Dim 2

C 1

C 2

C 3

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

DimI

Dim

II

C 1

C 2

C 3

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

−30

−20

−10

0

10

20

30

40

DimI

Dim

II

C 1

C 2

C 3

Relevances for different projections

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

DimI

Dim

II

C 1

C 2

C 3

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

−30

−20

−10

0

10

20

30

40

DimI

Dim

II

C 1

C 2

C 3

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Neighborhood size

λN

eR

V

Dim 1

Dim 2

Dim 3

Relevances for different projections

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

DimI

Dim

II

C 1

C 2

C 3

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

−30

−20

−10

0

10

20

30

40

DimI

Dim

II

C 1

C 2

C 3

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Neighborhood size

λN

eR

V

Dim 1

Dim 2

Dim 3

0 100 200 300 400 500 6000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Neighborhood size

λN

eR

V

Dim 1

Dim 2

Dim 3

98.7%

Why visualize Classifiers?



Where is the Difficulty?

Class borders are

I often non linearI often not given in an explicit functional form (e.g. SVM)I high dimensional which makes it non feasible to sample them for

a projection

An illustration the approach

00.2

0.40.6

0.81

0

0.5

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dim1

dim2

dim

3

class 1

class 2

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.5

1

dim1

dim2

dim

3

class 1

class 2

I project data to 2D

An illustration the approach

00.2

0.40.6

0.81

0

0.5

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dim1

dim2

dim

3

class 1

class 2

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.5

1

dim1

dim2

dim

3

class 1

class 2

I project data to 2D

An illustration: dimensionality reduction

dimI

dim

II

class 1

class 2

SVs

I sample the 2D data spaceI project the samples up

I classify them


dimI

dim

II

class 1


I classify them


dimI

dim

II

class 1


0

0.5

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

1.2

dim2dim1

dim

3

class 1class 2

SVs2D samples

I classify them


dimI

dim

II

class 1


0

0.5

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

1.2

dim2dim1

dim

3

class 1class 2

SVs2D samples

I classify them

An illustration: border visualization

I color intensity codes the cer-tainty of the classifier

Classifier visualization20

I project data to 2DI sample the 2D data spaceI project the samples upI classify them

dimI

dim

II

class 1

class 2

SVs

0

0.5

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

1.2

dim2dim1

dim

3

class 1class 2

SVs2D samples

dimI

dim

II

class 1

20[Schulz et al., 2014b]

Toy Data Set 1

dim1

dim2

dim

3

C1

C2

Toy Data Set 1

dim1

dim2

dim

3C1

C2

Toy Data Set 2

dim1dim2

dim

3

C1

C2

Toy Data Set 2

dim1dim2

dim

3

C1

C2

Toy Data Set 2 with NE

dim1dim2

dim

3

C1

C2

Toy Data Set 2 with NE

dim1dim2

dim

3

C1

C2

DR: intrinsically 3D data

C1

C2

C1

C2

DR: NE projection

Supervised dimensionality reduction

I Use the Fisher metric21 d(x,x + dx) = x>J(x)x

I J(x) = Ep(c|x)

{(∂∂x log p(c|x)

)(∂∂x log p(c|x)

)>}

21[Peltonen et al., 2004, Gisbrecht et al., 2015]

Supervised dimensionality reduction

I Use the Fisher metric22 d(x,x + dx) = x>J(x)x

I J(x) = Ep(c|x)

{(∂∂x log p(c|x)

)(∂∂x log p(c|x)

)>}

22[Peltonen et al., 2004, Gisbrecht et al., 2015]

DR: supervised NE projection

DR: supervised NE projection

prototypes in original space

C1

C2

Summary

I different objectives of dimensionality reduction

I new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements

Summary

I different objectives of dimensionality reductionI new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements

Thank You For Your Attention!

Alexander Schulzaschulz (at) techfak.uni-bielefeld.de

Literature I

Gisbrecht, A. and Hammer, B. (2015).Data visualization by nonlinear dimensionality reduction.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(2):51–73.

Gisbrecht, A., Schulz, A., and Hammer, B. (2015).Parametric nonlinear dimensionality reduction using kernel t-sne.Neurocomputing, 147:71–82.

Hinton, G. and Roweis, S. (2002).Stochastic neighbor embedding.In Advances in Neural Information Processing Systems 15, pages 833–840. MIT Press.

Lee, J. A., Renard, E., Bernard, G., Dupont, P., and Verleysen, M. (2013).Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similaritypreservation.Neurocomput., 112:92–108.

Lee, J. A. and Verleysen, M. (2007).Nonlinear dimensionality reduction.Springer.

Peltonen, J., Klami, A., and Kaski, S. (2004).Improved learning of riemannian metrics for exploratory analysis.Neural Networks, 17:1087–1100.

Literature II

Peluffo-Ordóñez, D. H., Lee, J. A., and Verleysen, M. (2014).Recent methods for dimensionality reduction: A brief comparative analysis.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.

Schölkopf, B., Smola, A., and Müller, K.-R. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Comput., 10(5):1299–1319.

Schulz, A., Gisbrecht, A., and Hammer, B. (2014a).Relevance learning for dimensionality reduction.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.

Schulz, A., Gisbrecht, A., and Hammer, B. (2014b).Using discriminative dimensionality reduction to visualize classifiers.Neural Processing Letters, pages 1–28.

van der Maaten, L. (2013).Barnes-hut-sne.CoRR, abs/1301.3342.

van der Maaten, L. and Hinton, G. (2008).Visualizing high-dimensional data using t-sne.Journal of Machine Learning Research, 9:2579–2605.

Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. (2010).Information retrieval perspective to nonlinear dimensionality reduction for data visualization.Journal of Machine Learning Research, 11:451–490.

Literature III

Weinberger, K. Q. and Saul, L. K. (2006).An introduction to nonlinear dimensionality reduction by maximum variance unfolding.In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI.

Yang, Z., Peltonen, J., and Kaski, S. (2013).Scalable optimization of neighbor embedding for visualization.In Dasgupta, S. and Mcallester, D., editors, Proceedings of the 30th International Conference on Machine Learning(ICML-13), volume 28, pages 127–135. JMLR Workshop and Conference Proceedings.

Documents

Linear and Non-Linear Dimensionality Reductiondidawiki.cli.di.unipi.it/lib/exe/fetch.php/magistrale... · 2015-05-11 · Linear and Non-Linear Dimensionality Reduction Alexander Schulz