Linear and Non-Linear Dimensionality...

Linear and Non-Linear Dimensionality Reduction

Alexander Schulzaschulz(at)techfak.uni-bielefeld.de

University of Pisa, Pisa

04.05.2015 and 07.05.2015

Overview

Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning

Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction

Curse of Dimensionality1

I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface

1[Lee and Verleysen, 2007]

I high-dimensional spaces are almost empty

I Hypervolume concentrates in a thin shell close to the surface

I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface

Why use Dimensionality Reduction?

0 0 0 0 0 0 0 0 0 0 0 0 0 0 81 34 0 0 0 0 0 0 0 0 0 0 0 0 3 104 247 1530 0 0 0 0 0 0 0 0 0 0 0 70 255 255 175 0 0 0 0 0 0 0 0 0 0 14 54 227255 255 98 0 0 0 0 0 0 0 0 0 0 168 255 255 255 179 7 0 0 0 0 0 0 0 033 157 254 255 247 162 13 0 0 0 0 0 0 0 0 72 207 255 255 255 124 00 0 0 0 0 0 0 29 149 250 255 255 228 160 2 0 0 0 0 0 0 0 6 148 255255 255 245 112 0 0 0 0 0 0 0 0 56 195 255 255 255 169 15 0 0 0 0 00 0 0 5 205 255 254 208 61 10 0 0 0 0 0 0 0 0 39 84 255 219 127 0 00 0 0 0 0 0 0 0 3 186 234 211 41 0 0 0 0 0 0 0 0 0 0 0 105 255 230 150 0 0 0 0 0 0 0 0 0 0 0 175 224 48 0 0 0 0 0 0 0 0 0 0 0 0 0 63 52 0 0 00 0 0 0 0 0 0 0 0 0 0

Why use Dimensionality Reduction?

Principal Component Analysis (PCA)

variance maximization

I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

I = 1N∑

i w>xixi>w

I = w>CwI →Eigenvectors of the covari-

ance matrix are optimal

I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

I = 1N∑

i w>xixi>w

I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

I = 1N∑

i w>xixi>w

I = w>Cw

I →Eigenvectors of the covari-ance matrix are optimal

I var(w>xi

)with ‖w‖ = 1

I = 1N∑

i(w>xi)

I = 1N∑

i w>xixi>w

PCA in Action

−0.5

1 −1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5−1

−1.5

−0.5

2[Gisbrecht and Hammer, 2015]

Distance Preservation (CMDS)

I distances δij = ‖xi − xj‖2, dij = ‖yi − yj‖2

I objective δij ≈ dij

I distances dij and similarities sij can be transformed into eachother

I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>

with the largest eigenvalue

I S = UΛU> matrix of pairwise similarities

I best low rank approximation of S in Frobenius norm is S = UΛU>

I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>

Manifold Learning

learn linear manifold

I represent data as projectionson unknown w:C = 1

i(xi − yiw)2

I What are the best parametersyi and w?

Manifold Learning

learn linear manifold

I represent data as projectionson unknown w:C = 1

i(xi − yiw)2

I What are the best parametersyi and w?

three ways to obtain PCA

I maximize variance of a linear projectionI preserve distancesI find a linear manifold such that errors are minimal in an L2 sense

Kernel PCA3

I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature space

I How to apply the kernel trick here?

3[Schölkopf et al., 1998]

Kernel PCA3

I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature spaceI How to apply the kernel trick here?

3[Schölkopf et al., 1998]

Kernel PCA in Action

−0.5

1 −1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5−1

−1.5

−0.5

Stochastic Neighbor Embedding (SNE)5

I introduce a probabilistic neighborhood in the input space

pj|i =exp(−0.5‖xi − xj‖2/σ2

i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

I and in the output space

qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)

I optimize the sum of Kullback-Leibler divergences

C =∑

KL(Pi ,Qi) =∑

∑j 6=i

pj|i log(

5[Hinton and Roweis, 2002]

i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)

C =∑

KL(Pi ,Qi) =∑

∑j 6=i

pj|i log(

i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2

qj|i =exp(−0.5‖yi − yj‖2)∑

k ,k 6=i exp(−0.5‖yi − yk‖2)

C =∑

KL(Pi ,Qi) =∑

∑j 6=i

pj|i log(

Neighbor Retrieval Visualizer (NeRV)6

I precision(i) = NTP,iki

= 1− NFP,iki, recall(i) = NTP,i

ri= 1− NMISS,i

6[Venna et al., 2010]

Neighbor Retrieval Visualizer (NeRV)6

I precision(i) = NTP,iki

= 1− NFP,iki, recall(i) = NTP,i

ri= 1− NMISS,i

6[Venna et al., 2010]

Neighbor Retrieval Visualizer (NeRV)

I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precision

I NeRV optimizes

C = λ∑

KL(Pi ,Qi) + (1− λ)∑

KL(Qi ,Pi)

Neighbor Retrieval Visualizer (NeRV)

I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precisionI NeRV optimizes

C = λ∑

KL(Pi ,Qi) + (1− λ)∑

KL(Qi ,Pi)

t-Distributed SNE (t-SNE)7

I symmetrized probabilities p and qI uses a Student-t distribution in the output space

7[van der Maaten and Hinton, 2008]

t-Distributed SNE (t-SNE)7

I symmetrized probabilities p and qI uses a Student-t distribution in the output space

7[van der Maaten and Hinton, 2008]

Neighbor Embedding in Action

−0.5

1 −1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5−1

−1.5

−0.5

Maximum Variance Unfolding (MVU)9

I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed

I maximize∑

ij ‖yi − yj‖2 s.t.∑i yi = 0

‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors

9[Weinberger and Saul, 2006]

Maximum Variance Unfolding (MVU)9

I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed

I maximize∑

ij ‖yi − yj‖2 s.t.∑i yi = 0

‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors

9[Weinberger and Saul, 2006]

Manifold Learner in Action

−0.5

1 −1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5−1

−1.5

−0.5

Overview

Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning

Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction

Barnes Hut Approach11

Complexity of NE

I Neighbor Embeddings have the complexity O(N2)

I matrices P and Q are squaredI squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)

11[Yang et al., 2013, van der Maaten, 2013]

Complexity of NE

I matrices P and Q are squared

I squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)

Complexity of NE

I matrices P and Q are squaredI squared summation for the gradient

∂C∂yi

=∑j 6=i

gij(yi − yj)

12[van der Maaten, 2013]

Barnes Hut Approach

Barnes Hut

I approximate the gradient ∂C∂yi

j 6=i gij(yi − yj) as∑j 6=i

gij(yi − yj) ≈∑

|Git | · gij(yi − yi

I approximate P as sparse matrixI results in a O(N log N) algorithm

Barnes Hut Approach

Barnes Hut

I approximate the gradient ∂C∂yi

j 6=i gij(yi − yj) as∑j 6=i

gij(yi − yj) ≈∑

|Git | · gij(yi − yi

I approximate P as sparse matrixI results in a O(N log N) algorithm

Barnes Hut SNE

13[van der Maaten, 2013]

Kernel t-SNE14

I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection

I how to obtain an out of sample extension?I use kernel mapping

x 7→ y(x) =∑

αj ·k(x,xj)∑l k(x,xl)

I minimization of∑i

‖yi − y(xi)‖2 yields A = Y · K−1

14[Gisbrecht et al., 2015]

Kernel t-SNE14

I how to obtain an out of sample extension?

I use kernel mapping

x 7→ y(x) =∑

Kernel t-SNE14

x 7→ y(x) =∑

Kernel t-SNE14

x 7→ y(x) =∑

Kernel t-SNE

Quality Assessment of DR16

I most popular measure

Qk (X ,Y ) =∑

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recentlyI recently used to compare many DR techniques

16[Lee et al., 2013]

Qk (X ,Y ) =∑

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recently

I recently used to compare many DR techniques

16[Lee et al., 2013]

Qk (X ,Y ) =∑

(Nk (~x i) ∩ Nk (~y i)

)/(Nk)

I rescaling added recentlyI recently used to compare many DR techniques

16[Lee et al., 2013]

Quality Assessment of DR

17[Peluffo-Ordóñez et al., 2014]

Feature Relevance for DR

I Which features are important for a given projection?

data set: ESANN participants

Partic. University ESANN paper #publications . . . likes beer . . .1 A 1 15 1 . . .2 A 0 8 -1 . . .3 B 1 22 . . . -1 . . .4 C 1 9 . . . 0 . . .5 C 0 15 -1 . . .6 D . . ....

......

Visualization of ESANN participants

0 2 4 6 8 10

Relevance of features for the projection

1 2 3 4 5 6 7 8 9 100

Relevance of features for the projection

1 2 3 4 5 6 7 8 9 100

Visualization of ESANN participants

0 2 4 6 8 10

Beer 1Beer 0Beer -1

I estimate the relevance of single features for non linear dimen-sionality reductions

I change the influence of a single feature and observe the changein the quality

I estimate the relevance of single features for non linear dimen-sionality reductions

I change the influence of a single feature and observe the changein the quality

Evaluation of Feature Relevance for DR19

NeRV cost function18 QNeRVk

I interpretation from an information retrieval perspective

I d(~x i , ~x j)2 =∑

l(xil − x j

l )2 becomes

∑l λ

il − x j

I λkNeRV(l) := λ2

l where λ optimizes QNeRVk (Xλ,Y ) + δ

∑l λ

18[Venna et al., 2010]19[Schulz et al., 2014a]

I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =

∑l(x

il − x j

l )2 becomes

∑l λ

il − x j

I λkNeRV(l) := λ2

∑l λ

I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =

∑l(x

il − x j

l )2 becomes

∑l λ

il − x j

I λkNeRV(l) := λ2

∑l λ

Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

−0.1

Dim 1Dim 2

Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

−0.1

Dim 1Dim 2

−50 −40 −30 −20 −10 0 10 20 30 40 50−60

Toy data set 1

−0.6−0.4

−0.20

0.20.4

−0.2

−0.1

Dim 1Dim 2

−50 −40 −30 −20 −10 0 10 20 30 40 50−60

0 100 200 300 400 500 6000

Neighborhood size

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

Dim 1Dim 2

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

Dim 1Dim 2

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

Toy data set 2

−0.4 −0.2 0 0.2 0.4−0.4

−0.20

0.20.4

−0.1

Dim 1Dim 2

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

Relevances for different projections

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

0 100 200 300 400 500 6000

Neighborhood size

Relevances for different projections

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

−50 −40 −30 −20 −10 0 10 20 30 40 50−40

0 100 200 300 400 500 6000

Neighborhood size

0 100 200 300 400 500 6000

Neighborhood size

Why visualize Classifiers?

Where is the Difficulty?

Class borders are

I often non linearI often not given in an explicit functional form (e.g. SVM)I high dimensional which makes it non feasible to sample them for

a projection

An illustration the approach

0.40.6

class 1

class 2

class 1

class 2

I project data to 2D

An illustration the approach

0.40.6

class 1

class 2

class 1

class 2

I project data to 2D

An illustration: dimensionality reduction

class 1

class 2

I sample the 2D data spaceI project the samples up

I classify them

class 1

I classify them

class 1

0 0.2 0.4 0.6 0.8 1

dim2dim1

class 1class 2

SVs2D samples

I classify them

class 1

0 0.2 0.4 0.6 0.8 1

dim2dim1

class 1class 2

SVs2D samples

I classify them

An illustration: border visualization

I color intensity codes the cer-tainty of the classifier

Classifier visualization20

I project data to 2DI sample the 2D data spaceI project the samples upI classify them

class 1

class 2

0 0.2 0.4 0.6 0.8 1

dim2dim1

class 1class 2

SVs2D samples

class 1

20[Schulz et al., 2014b]

Toy Data Set 1

Toy Data Set 2

dim1dim2

Toy Data Set 2

dim1dim2

Toy Data Set 2 with NE

dim1dim2

Toy Data Set 2 with NE

dim1dim2

DR: intrinsically 3D data

DR: NE projection

Supervised dimensionality reduction

I Use the Fisher metric21 d(x,x + dx) = x>J(x)x

I J(x) = Ep(c|x)

{(∂∂x log p(c|x)

)(∂∂x log p(c|x)

21[Peltonen et al., 2004, Gisbrecht et al., 2015]

Supervised dimensionality reduction

I Use the Fisher metric22 d(x,x + dx) = x>J(x)x

I J(x) = Ep(c|x)

{(∂∂x log p(c|x)

)(∂∂x log p(c|x)

22[Peltonen et al., 2004, Gisbrecht et al., 2015]

DR: supervised NE projection

prototypes in original space

Summary

I different objectives of dimensionality reduction

I new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements

Summary

I different objectives of dimensionality reductionI new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements

Thank You For Your Attention!

Alexander Schulzaschulz (at) techfak.uni-bielefeld.de

Literature I

Gisbrecht, A. and Hammer, B. (2015).Data visualization by nonlinear dimensionality reduction.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(2):51–73.

Gisbrecht, A., Schulz, A., and Hammer, B. (2015).Parametric nonlinear dimensionality reduction using kernel t-sne.Neurocomputing, 147:71–82.

Hinton, G. and Roweis, S. (2002).Stochastic neighbor embedding.In Advances in Neural Information Processing Systems 15, pages 833–840. MIT Press.

Lee, J. A., Renard, E., Bernard, G., Dupont, P., and Verleysen, M. (2013).Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similaritypreservation.Neurocomput., 112:92–108.

Lee, J. A. and Verleysen, M. (2007).Nonlinear dimensionality reduction.Springer.

Peltonen, J., Klami, A., and Kaski, S. (2004).Improved learning of riemannian metrics for exploratory analysis.Neural Networks, 17:1087–1100.

Literature II

Peluffo-Ordóñez, D. H., Lee, J. A., and Verleysen, M. (2014).Recent methods for dimensionality reduction: A brief comparative analysis.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.

Schölkopf, B., Smola, A., and Müller, K.-R. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Comput., 10(5):1299–1319.

Schulz, A., Gisbrecht, A., and Hammer, B. (2014a).Relevance learning for dimensionality reduction.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.

Schulz, A., Gisbrecht, A., and Hammer, B. (2014b).Using discriminative dimensionality reduction to visualize classifiers.Neural Processing Letters, pages 1–28.

van der Maaten, L. (2013).Barnes-hut-sne.CoRR, abs/1301.3342.

van der Maaten, L. and Hinton, G. (2008).Visualizing high-dimensional data using t-sne.Journal of Machine Learning Research, 9:2579–2605.

Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. (2010).Information retrieval perspective to nonlinear dimensionality reduction for data visualization.Journal of Machine Learning Research, 11:451–490.

Literature III

Weinberger, K. Q. and Saul, L. K. (2006).An introduction to nonlinear dimensionality reduction by maximum variance unfolding.In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI.

Yang, Z., Peltonen, J., and Kaski, S. (2013).Scalable optimization of neighbor embedding for visualization.In Dasgupta, S. and Mcallester, D., editors, Proceedings of the 30th International Conference on Machine Learning(ICML-13), volume 28, pages 127–135. JMLR Workshop and Conference Proceedings.

Linear and Non-Linear Dimensionality...

Documents

Magistrale kod PC mašina

Dimensionality reduction with t-SNE - Seoul AI · Dimensionality Reduction aims to map the data from ... t-SNE is non-linear dimensionality reduction technique that has better performance

Dimensionality Reduction for Fast Similarity Search in ...eamonn/kais_2000.pdfMay 16, 2000 · linear time. KEYW ORDS: Time series, Indexing and Retrieval, Dimensionality Reduction,

Non-Linear Dimensionality Reduction

Dimensionality Reduction with Linear Transformations project update by Mingyue Tan March 17, 2004

Conducte magistrale

Dimensionality Reduction of Face Order Problem Using Non ...€¦ · Dimensionality Reduction of Face Order Problem Using Non-linear Embedding Methods SUN, Jing ; jsunav@connect.ust.hk,

Non Linear Dimensionality reduction Abedpeople.cs.pitt.edu/~milos/courses/cs3750-Fall2011/lectures/class16.pdf · Non‐Linear Dimensionality Reduction Md. Abedul Haque ... our m‐dimensions

Iterative Non-linear Dimensionality Reduction by … · Iterative Non-linear Dimensionality Reduction by Manifold Sculpting ... formation step of Principle Component Analysis

Linear Dimensionality Reduction: Principal Component Analysis

Non-Linear Dimensionality Reduction Presented by: Johnathan Franck Mentor: Alex Cloninger

Tesi Magistrale 2014

Non-linear Dimensionality Reduction by Locally Linear Inlaying Presented by Peng Zhang Tianjin University, China Yuexian Hou, Peng Zhang, Xiaowei Zhang,

Linear discriminant analysis, two classes Linear ...gn.dronacharya.info/CSEDept/Downloads/Question... · LDA vs. PCA • Limitations of LDA • Variants of LDA • Other dimensionality

Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno Vasconcelos ECE Depp,artment, UCSD. Note ... The curse of dimensionality

Retete Tipizate + Magistrale Examen

Lecture 5. Dimensionality Reduction. Linear Algebragavalda/DataStreamSeminar/files/Lecture5.… · Lecture 5. Dimensionality Reduction. Linear Algebra Ricard Gavalda` MIRI Seminar

8892 Marconi Magistrale

Non-Linear Dimensionality Reduction - papers.nips.cc · Non-Linear Dimensionality Reduction 583 Figure 2: The original 3-D helix data plus reconstructionfrom a single parameter encoding

Dimensionality reduction with t-SNE - Seoul AIseoulai.com › presentations › t-SNE.pdf · Dimensionality Reduction aims to map the data from ... t-SNE is non-linear dimensionality