Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Linear and Non-Linear Dimensionality Reduction
Alexander Schulzaschulz(at)techfak.uni-bielefeld.de
University of Pisa, Pisa
04.05.2015 and 07.05.2015
Overview
Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning
Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction
Curse of Dimensionality1
I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface
1[Lee and Verleysen, 2007]
Curse of Dimensionality1
I high-dimensional spaces are almost empty
I Hypervolume concentrates in a thin shell close to the surface
1[Lee and Verleysen, 2007]
Curse of Dimensionality1
I high-dimensional spaces are almost emptyI Hypervolume concentrates in a thin shell close to the surface
1[Lee and Verleysen, 2007]
Why use Dimensionality Reduction?
0 0 0 0 0 0 0 0 0 0 0 0 0 0 81 34 0 0 0 0 0 0 0 0 0 0 0 0 3 104 247 1530 0 0 0 0 0 0 0 0 0 0 0 70 255 255 175 0 0 0 0 0 0 0 0 0 0 14 54 227255 255 98 0 0 0 0 0 0 0 0 0 0 168 255 255 255 179 7 0 0 0 0 0 0 0 033 157 254 255 247 162 13 0 0 0 0 0 0 0 0 72 207 255 255 255 124 00 0 0 0 0 0 0 29 149 250 255 255 228 160 2 0 0 0 0 0 0 0 6 148 255255 255 245 112 0 0 0 0 0 0 0 0 56 195 255 255 255 169 15 0 0 0 0 00 0 0 5 205 255 254 208 61 10 0 0 0 0 0 0 0 0 39 84 255 219 127 0 00 0 0 0 0 0 0 0 3 186 234 211 41 0 0 0 0 0 0 0 0 0 0 0 105 255 230 150 0 0 0 0 0 0 0 0 0 0 0 175 224 48 0 0 0 0 0 0 0 0 0 0 0 0 0 63 52 0 0 00 0 0 0 0 0 0 0 0 0 0
Why use Dimensionality Reduction?
Why use Dimensionality Reduction?
Principal Component Analysis (PCA)
variance maximization
I var(w>xi
)with ‖w‖ = 1
I = 1N∑
i(w>xi)
2
I = 1N∑
i w>xixi>w
I = w>CwI →Eigenvectors of the covari-
ance matrix are optimal
Principal Component Analysis (PCA)
variance maximization
I var(w>xi
)with ‖w‖ = 1
I = 1N∑
i(w>xi)
2
I = 1N∑
i w>xixi>w
I = w>CwI →Eigenvectors of the covari-
ance matrix are optimal
Principal Component Analysis (PCA)
variance maximization
I var(w>xi
)with ‖w‖ = 1
I = 1N∑
i(w>xi)
2
I = 1N∑
i w>xixi>w
I = w>Cw
I →Eigenvectors of the covari-ance matrix are optimal
Principal Component Analysis (PCA)
variance maximization
I var(w>xi
)with ‖w‖ = 1
I = 1N∑
i(w>xi)
2
I = 1N∑
i w>xixi>w
I = w>CwI →Eigenvectors of the covari-
ance matrix are optimal
PCA in Action
−1
−0.5
0
0.5
1 −1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1.5 −1 −0.5 0 0.5 1 1.5−1
01
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
2[Gisbrecht and Hammer, 2015]
Distance Preservation (CMDS)
I distances δij = ‖xi − xj‖2, dij = ‖yi − yj‖2
I objective δij ≈ dij
I distances dij and similarities sij can be transformed into eachother
I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>
with the largest eigenvalue
Distance Preservation (CMDS)
I distances δij = ‖xi − xj‖2, dij = ‖yi − yj‖2
I objective δij ≈ dij
I distances dij and similarities sij can be transformed into eachother
I S = UΛU> matrix of pairwise similarities
I best low rank approximation of S in Frobenius norm is S = UΛU>
with the largest eigenvalue
Distance Preservation (CMDS)
I distances δij = ‖xi − xj‖2, dij = ‖yi − yj‖2
I objective δij ≈ dij
I distances dij and similarities sij can be transformed into eachother
I S = UΛU> matrix of pairwise similaritiesI best low rank approximation of S in Frobenius norm is S = UΛU>
with the largest eigenvalue
Manifold Learning
learn linear manifold
I represent data as projectionson unknown w:C = 1
2N∑
i(xi − yiw)2
I What are the best parametersyi and w?
Manifold Learning
learn linear manifold
I represent data as projectionson unknown w:C = 1
2N∑
i(xi − yiw)2
I What are the best parametersyi and w?
PCA
three ways to obtain PCA
I maximize variance of a linear projectionI preserve distancesI find a linear manifold such that errors are minimal in an L2 sense
Kernel PCA3
Idea
I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature space
I How to apply the kernel trick here?
3[Schölkopf et al., 1998]
Kernel PCA3
Idea
I Apply a fixed nonlinear preprocessing φ(x)I Perform standard PCA in feature spaceI How to apply the kernel trick here?
3[Schölkopf et al., 1998]
Kernel PCA in Action
−1
−0.5
0
0.5
1 −1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1.5 −1 −0.5 0 0.5 1 1.5−1
01
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
4
4[Gisbrecht and Hammer, 2015]
Stochastic Neighbor Embedding (SNE)5
I introduce a probabilistic neighborhood in the input space
pj|i =exp(−0.5‖xi − xj‖2/σ2
i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2
i )
I and in the output space
qj|i =exp(−0.5‖yi − yj‖2)∑
k ,k 6=i exp(−0.5‖yi − yk‖2)
I optimize the sum of Kullback-Leibler divergences
C =∑
i
KL(Pi ,Qi) =∑
i
∑j 6=i
pj|i log(
pj|i
qj|i
)
5[Hinton and Roweis, 2002]
Stochastic Neighbor Embedding (SNE)5
I introduce a probabilistic neighborhood in the input space
pj|i =exp(−0.5‖xi − xj‖2/σ2
i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2
i )
I and in the output space
qj|i =exp(−0.5‖yi − yj‖2)∑
k ,k 6=i exp(−0.5‖yi − yk‖2)
I optimize the sum of Kullback-Leibler divergences
C =∑
i
KL(Pi ,Qi) =∑
i
∑j 6=i
pj|i log(
pj|i
qj|i
)
5[Hinton and Roweis, 2002]
Stochastic Neighbor Embedding (SNE)5
I introduce a probabilistic neighborhood in the input space
pj|i =exp(−0.5‖xi − xj‖2/σ2
i )∑k ,k 6=i exp(−0.5‖xi − xk‖2/σ2
i )
I and in the output space
qj|i =exp(−0.5‖yi − yj‖2)∑
k ,k 6=i exp(−0.5‖yi − yk‖2)
I optimize the sum of Kullback-Leibler divergences
C =∑
i
KL(Pi ,Qi) =∑
i
∑j 6=i
pj|i log(
pj|i
qj|i
)
5[Hinton and Roweis, 2002]
Neighbor Retrieval Visualizer (NeRV)6
I precision(i) = NTP,iki
= 1− NFP,iki, recall(i) = NTP,i
ri= 1− NMISS,i
ri
6[Venna et al., 2010]
Neighbor Retrieval Visualizer (NeRV)6
I precision(i) = NTP,iki
= 1− NFP,iki, recall(i) = NTP,i
ri= 1− NMISS,i
ri
6[Venna et al., 2010]
Neighbor Retrieval Visualizer (NeRV)
I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precision
I NeRV optimizes
C = λ∑
i
KL(Pi ,Qi) + (1− λ)∑
i
KL(Qi ,Pi)
Neighbor Retrieval Visualizer (NeRV)
I KL(Pi ,Qi) generalizes recallI KL(Qi ,Pi) generalizes precisionI NeRV optimizes
C = λ∑
i
KL(Pi ,Qi) + (1− λ)∑
i
KL(Qi ,Pi)
t-Distributed SNE (t-SNE)7
I symmetrized probabilities p and qI uses a Student-t distribution in the output space
7[van der Maaten and Hinton, 2008]
t-Distributed SNE (t-SNE)7
I symmetrized probabilities p and qI uses a Student-t distribution in the output space
7[van der Maaten and Hinton, 2008]
Neighbor Embedding in Action
−1
−0.5
0
0.5
1 −1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1.5 −1 −0.5 0 0.5 1 1.5−1
01
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
8
8[Gisbrecht and Hammer, 2015]
Maximum Variance Unfolding (MVU)9
I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed
I maximize∑
ij ‖yi − yj‖2 s.t.∑i yi = 0
‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors
9[Weinberger and Saul, 2006]
Maximum Variance Unfolding (MVU)9
I goal: ’unfold’ a given manifoldwhile keeping all the localdistances and angles fixed
I maximize∑
ij ‖yi − yj‖2 s.t.∑i yi = 0
‖yi − yj‖2 = ‖xi − xj‖2, for allneighbors
9[Weinberger and Saul, 2006]
Manifold Learner in Action
−1
−0.5
0
0.5
1 −1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1.5 −1 −0.5 0 0.5 1 1.5−1
01
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
10
10[Gisbrecht and Hammer, 2015]
Overview
Dimensionality ReductionMotivationLinear ProjectionsLinear Mappings in Feature SpaceNeighbor EmbeddingManifold Learning
Advances in Dimensionality ReductionSpeedup for Neighbor EmbeddingsQuality Assessment of DRFeature Relevance for DRVisualization of ClassifiersSupervised Dimensionality Reduction
Barnes Hut Approach11
Complexity of NE
I Neighbor Embeddings have the complexity O(N2)
I matrices P and Q are squaredI squared summation for the gradient
∂C∂yi
=∑j 6=i
gij(yi − yj)
11[Yang et al., 2013, van der Maaten, 2013]
Barnes Hut Approach11
Complexity of NE
I Neighbor Embeddings have the complexity O(N2)
I matrices P and Q are squared
I squared summation for the gradient
∂C∂yi
=∑j 6=i
gij(yi − yj)
11[Yang et al., 2013, van der Maaten, 2013]
Barnes Hut Approach11
Complexity of NE
I Neighbor Embeddings have the complexity O(N2)
I matrices P and Q are squaredI squared summation for the gradient
∂C∂yi
=∑j 6=i
gij(yi − yj)
11[Yang et al., 2013, van der Maaten, 2013]
12
12[van der Maaten, 2013]
Barnes Hut Approach
Barnes Hut
I approximate the gradient ∂C∂yi
=∑
j 6=i gij(yi − yj) as∑j 6=i
gij(yi − yj) ≈∑
t
|Git | · gij(yi − yi
t)
I approximate P as sparse matrixI results in a O(N log N) algorithm
Barnes Hut Approach
Barnes Hut
I approximate the gradient ∂C∂yi
=∑
j 6=i gij(yi − yj) as∑j 6=i
gij(yi − yj) ≈∑
t
|Git | · gij(yi − yi
t)
I approximate P as sparse matrixI results in a O(N log N) algorithm
Barnes Hut SNE
13
13[van der Maaten, 2013]
Kernel t-SNE14
I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection
I how to obtain an out of sample extension?I use kernel mapping
x 7→ y(x) =∑
j
αj ·k(x,xj)∑l k(x,xl)
= Ak
I minimization of∑i
‖yi − y(xi)‖2 yields A = Y · K−1
14[Gisbrecht et al., 2015]
Kernel t-SNE14
I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection
I how to obtain an out of sample extension?
I use kernel mapping
x 7→ y(x) =∑
j
αj ·k(x,xj)∑l k(x,xl)
= Ak
I minimization of∑i
‖yi − y(xi)‖2 yields A = Y · K−1
14[Gisbrecht et al., 2015]
Kernel t-SNE14
I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection
I how to obtain an out of sample extension?I use kernel mapping
x 7→ y(x) =∑
j
αj ·k(x,xj)∑l k(x,xl)
= Ak
I minimization of∑i
‖yi − y(xi)‖2 yields A = Y · K−1
14[Gisbrecht et al., 2015]
Kernel t-SNE14
I O(N) algorithm: apply NE to a fixed subset, map remainder without of sample projection
I how to obtain an out of sample extension?I use kernel mapping
x 7→ y(x) =∑
j
αj ·k(x,xj)∑l k(x,xl)
= Ak
I minimization of∑i
‖yi − y(xi)‖2 yields A = Y · K−1
14[Gisbrecht et al., 2015]
Kernel t-SNE
15
15[Gisbrecht et al., 2015]
Quality Assessment of DR16
I most popular measure
Qk (X ,Y ) =∑
i
(Nk (~x i) ∩ Nk (~y i)
)/(Nk)
I rescaling added recentlyI recently used to compare many DR techniques
16[Lee et al., 2013]
Quality Assessment of DR16
I most popular measure
Qk (X ,Y ) =∑
i
(Nk (~x i) ∩ Nk (~y i)
)/(Nk)
I rescaling added recently
I recently used to compare many DR techniques
16[Lee et al., 2013]
Quality Assessment of DR16
I most popular measure
Qk (X ,Y ) =∑
i
(Nk (~x i) ∩ Nk (~y i)
)/(Nk)
I rescaling added recentlyI recently used to compare many DR techniques
16[Lee et al., 2013]
Quality Assessment of DR
17
17[Peluffo-Ordóñez et al., 2014]
Feature Relevance for DR
I Which features are important for a given projection?
data set: ESANN participants
Partic. University ESANN paper #publications . . . likes beer . . .1 A 1 15 1 . . .2 A 0 8 -1 . . .3 B 1 22 . . . -1 . . .4 C 1 9 . . . 0 . . .5 C 0 15 -1 . . .6 D . . ....
......
Visualization of ESANN participants
0 2 4 6 8 10
−4
−3
−2
−1
0
1
2
3
4
5
Dim 1
Dim
2
Relevance of features for the projection
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
Relevance of features for the projection
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
Visualization of ESANN participants
0 2 4 6 8 10
−4
−3
−2
−1
0
1
2
3
4
5
Dim 1
Dim
2
Beer 1Beer 0Beer -1
Aim
I estimate the relevance of single features for non linear dimen-sionality reductions
Idea
I change the influence of a single feature and observe the changein the quality
Aim
I estimate the relevance of single features for non linear dimen-sionality reductions
Idea
I change the influence of a single feature and observe the changein the quality
Evaluation of Feature Relevance for DR19
NeRV cost function18 QNeRVk
I interpretation from an information retrieval perspective
I d(~x i , ~x j)2 =∑
l(xil − x j
l )2 becomes
∑l λ
2l (x
il − x j
l )2
I λkNeRV(l) := λ2
l where λ optimizes QNeRVk (Xλ,Y ) + δ
∑l λ
2l
18[Venna et al., 2010]19[Schulz et al., 2014a]
Evaluation of Feature Relevance for DR19
NeRV cost function18 QNeRVk
I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =
∑l(x
il − x j
l )2 becomes
∑l λ
2l (x
il − x j
l )2
I λkNeRV(l) := λ2
l where λ optimizes QNeRVk (Xλ,Y ) + δ
∑l λ
2l
18[Venna et al., 2010]19[Schulz et al., 2014a]
Evaluation of Feature Relevance for DR19
NeRV cost function18 QNeRVk
I interpretation from an information retrieval perspectiveI d(~x i , ~x j)2 =
∑l(x
il − x j
l )2 becomes
∑l λ
2l (x
il − x j
l )2
I λkNeRV(l) := λ2
l where λ optimizes QNeRVk (Xλ,Y ) + δ
∑l λ
2l
18[Venna et al., 2010]19[Schulz et al., 2014a]
Toy data set 1
−0.6−0.4
−0.20
0.20.4
−0.2
0
0.2
0.4
0.6
−0.1
0
0.1
0.2
Dim 1Dim 2
Dim
3
C1
C2
C3
Toy data set 1
−0.6−0.4
−0.20
0.20.4
−0.2
0
0.2
0.4
0.6
−0.1
0
0.1
0.2
Dim 1Dim 2
Dim
3
C1
C2
C3
−50 −40 −30 −20 −10 0 10 20 30 40 50−60
−50
−40
−30
−20
−10
0
10
20
30
40
DimI
Dim
II
C 1
C 2
C 3
Toy data set 1
−0.6−0.4
−0.20
0.20.4
−0.2
0
0.2
0.4
0.6
−0.1
0
0.1
0.2
Dim 1Dim 2
Dim
3
C1
C2
C3
−50 −40 −30 −20 −10 0 10 20 30 40 50−60
−50
−40
−30
−20
−10
0
10
20
30
40
DimI
Dim
II
C 1
C 2
C 3
0 100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Neighborhood size
λN
eR
V
Dim 1
Dim 2
Dim 3
Toy data set 2
−0.4 −0.2 0 0.2 0.4−0.4
−0.20
0.20.4
−0.1
0
0.1
Dim
3
Dim 1Dim 2
C 1
C 2
C 3
Toy data set 2
−0.4 −0.2 0 0.2 0.4−0.4
−0.20
0.20.4
−0.1
0
0.1
Dim
3
Dim 1Dim 2
C 1
C 2
C 3
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
DimI
Dim
II
C 1
C 2
C 3
Toy data set 2
−0.4 −0.2 0 0.2 0.4−0.4
−0.20
0.20.4
−0.1
0
0.1
Dim
3
Dim 1Dim 2
C 1
C 2
C 3
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
DimI
Dim
II
C 1
C 2
C 3
−50 −40 −30 −20 −10 0 10 20 30 40 50−40
−30
−20
−10
0
10
20
30
40
DimI
Dim
II
C 1
C 2
C 3
Relevances for different projections
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
DimI
Dim
II
C 1
C 2
C 3
−50 −40 −30 −20 −10 0 10 20 30 40 50−40
−30
−20
−10
0
10
20
30
40
DimI
Dim
II
C 1
C 2
C 3
0 100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Neighborhood size
λN
eR
V
Dim 1
Dim 2
Dim 3
Relevances for different projections
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
DimI
Dim
II
C 1
C 2
C 3
−50 −40 −30 −20 −10 0 10 20 30 40 50−40
−30
−20
−10
0
10
20
30
40
DimI
Dim
II
C 1
C 2
C 3
0 100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Neighborhood size
λN
eR
V
Dim 1
Dim 2
Dim 3
0 100 200 300 400 500 6000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Neighborhood size
λN
eR
V
Dim 1
Dim 2
Dim 3
98.7%
Why visualize Classifiers?
Why visualize Classifiers?
Why visualize Classifiers?
Where is the Difficulty?
Class borders are
I often non linearI often not given in an explicit functional form (e.g. SVM)I high dimensional which makes it non feasible to sample them for
a projection
An illustration the approach
00.2
0.40.6
0.81
0
0.5
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
dim1
dim2
dim
3
class 1
class 2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.5
1
dim1
dim2
dim
3
class 1
class 2
I project data to 2D
An illustration the approach
00.2
0.40.6
0.81
0
0.5
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
dim1
dim2
dim
3
class 1
class 2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.5
1
dim1
dim2
dim
3
class 1
class 2
I project data to 2D
An illustration: dimensionality reduction
dimI
dim
II
class 1
class 2
SVs
I sample the 2D data spaceI project the samples up
I classify them
An illustration: dimensionality reduction
dimI
dim
II
class 1
I sample the 2D data spaceI project the samples up
I classify them
An illustration: dimensionality reduction
dimI
dim
II
class 1
I sample the 2D data spaceI project the samples up
0
0.5
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
dim2dim1
dim
3
class 1class 2
SVs2D samples
I classify them
An illustration: dimensionality reduction
dimI
dim
II
class 1
I sample the 2D data spaceI project the samples up
0
0.5
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
dim2dim1
dim
3
class 1class 2
SVs2D samples
I classify them
An illustration: border visualization
I color intensity codes the cer-tainty of the classifier
Classifier visualization20
I project data to 2DI sample the 2D data spaceI project the samples upI classify them
dimI
dim
II
class 1
class 2
SVs
0
0.5
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
dim2dim1
dim
3
class 1class 2
SVs2D samples
dimI
dim
II
class 1
20[Schulz et al., 2014b]
Toy Data Set 1
dim1
dim2
dim
3
C1
C2
Toy Data Set 1
dim1
dim2
dim
3C1
C2
Toy Data Set 2
dim1dim2
dim
3
C1
C2
Toy Data Set 2
dim1dim2
dim
3
C1
C2
Toy Data Set 2 with NE
dim1dim2
dim
3
C1
C2
Toy Data Set 2 with NE
dim1dim2
dim
3
C1
C2
DR: intrinsically 3D data
C1
C2
C1
C2
DR: NE projection
Supervised dimensionality reduction
I Use the Fisher metric21 d(x,x + dx) = x>J(x)x
I J(x) = Ep(c|x)
{(∂∂x log p(c|x)
)(∂∂x log p(c|x)
)>}
21[Peltonen et al., 2004, Gisbrecht et al., 2015]
Supervised dimensionality reduction
I Use the Fisher metric22 d(x,x + dx) = x>J(x)x
I J(x) = Ep(c|x)
{(∂∂x log p(c|x)
)(∂∂x log p(c|x)
)>}
22[Peltonen et al., 2004, Gisbrecht et al., 2015]
DR: supervised NE projection
DR: supervised NE projection
prototypes in original space
C1
C2
Summary
I different objectives of dimensionality reduction
I new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements
Summary
I different objectives of dimensionality reductionI new approach to get insight into trained classification modelsI discriminative information can yield major imrpovements
Thank You For Your Attention!
Alexander Schulzaschulz (at) techfak.uni-bielefeld.de
Literature I
Gisbrecht, A. and Hammer, B. (2015).Data visualization by nonlinear dimensionality reduction.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(2):51–73.
Gisbrecht, A., Schulz, A., and Hammer, B. (2015).Parametric nonlinear dimensionality reduction using kernel t-sne.Neurocomputing, 147:71–82.
Hinton, G. and Roweis, S. (2002).Stochastic neighbor embedding.In Advances in Neural Information Processing Systems 15, pages 833–840. MIT Press.
Lee, J. A., Renard, E., Bernard, G., Dupont, P., and Verleysen, M. (2013).Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similaritypreservation.Neurocomput., 112:92–108.
Lee, J. A. and Verleysen, M. (2007).Nonlinear dimensionality reduction.Springer.
Peltonen, J., Klami, A., and Kaski, S. (2004).Improved learning of riemannian metrics for exploratory analysis.Neural Networks, 17:1087–1100.
Literature II
Peluffo-Ordóñez, D. H., Lee, J. A., and Verleysen, M. (2014).Recent methods for dimensionality reduction: A brief comparative analysis.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.
Schölkopf, B., Smola, A., and Müller, K.-R. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Comput., 10(5):1299–1319.
Schulz, A., Gisbrecht, A., and Hammer, B. (2014a).Relevance learning for dimensionality reduction.In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014.
Schulz, A., Gisbrecht, A., and Hammer, B. (2014b).Using discriminative dimensionality reduction to visualize classifiers.Neural Processing Letters, pages 1–28.
van der Maaten, L. (2013).Barnes-hut-sne.CoRR, abs/1301.3342.
van der Maaten, L. and Hinton, G. (2008).Visualizing high-dimensional data using t-sne.Journal of Machine Learning Research, 9:2579–2605.
Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. (2010).Information retrieval perspective to nonlinear dimensionality reduction for data visualization.Journal of Machine Learning Research, 11:451–490.
Literature III
Weinberger, K. Q. and Saul, L. K. (2006).An introduction to nonlinear dimensionality reduction by maximum variance unfolding.In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI.
Yang, Z., Peltonen, J., and Kaski, S. (2013).Scalable optimization of neighbor embedding for visualization.In Dasgupta, S. and Mcallester, D., editors, Proceedings of the 30th International Conference on Machine Learning(ICML-13), volume 28, pages 127–135. JMLR Workshop and Conference Proceedings.