Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Dimension reduction methods: Algorithmsand Applications
Yousef SaadDepartment of Computer Science
and Engineering
University of Minnesota
Université du Littoral- CalaisJuly 11, 2016
First..
ä ... to the memory of Mohammed Bellalij
Calais 05/18/2016 p. 2
Introduction, background, and motivation
Common goal of data mining methods: to extract meaningfulinformation or patterns from data. Very broad area – in-cludes: data analysis, machine learning, pattern recognition,information retrieval, ...
ä Main tools used: linear algebra; graph theory; approximationtheory; optimization; ...
ä In this talk: emphasis on dimension reduction techniquesand the interrelations between techniques
Calais 05/18/2016 p. 3
Introduction: a few factoids
ä Data is growing exponentially at an “alarming” rate:
• 90% of data in world today was created in last two years
• Every day, 2.3 Million terabytes (2.3×1018 bytes) created
ä Mixed blessing: Opportunities & big challenges.
ä Trend is re-shaping & energizing many research areas ...
ä ... including my own: numerical linear algebra
Calais 05/18/2016 p. 4
Topics
ä Focus on two main problems
– Information retrieval
– Face recognition
ä and 2 types of dimension reduction methods
– Standard subspace methods [SVD, Lanczos]
– Graph-based methods
Calais 05/18/2016 p. 5
Major tool of Data Mining: Dimension reduction
ä Goal is not as much to reduce size (& cost) but to:
• Reduce noise and redundancy in data before performing atask [e.g., classification as in digit/face recognition]
• Discover important ‘features’ or ‘paramaters’
The problem: Given: X = [x1, · · · , xn] ∈ Rm×n, find a
low-dimens. representation Y = [y1, · · · , yn] ∈ Rd×n of X
ä Achieved by a mapping Φ : x ∈ Rm −→ y ∈ Rd so:
φ(xi) = yi, i = 1, · · · , n
Calais 05/18/2016 p. 6
m
n
X
Y
x
y
i
id
n
ä Φ may be linear : yi = W>xi , i.e., Y = W>X , ..
ä ... or nonlinear (implicit).
ä Mapping Φ required to: Preserve proximity? Maximizevariance? Preserve a certain graph?
Calais 05/18/2016 p. 7
Example: Principal Component Analysis (PCA)
In Principal Component Analysis W is computed to maxi-mize variance of projected data:
maxW∈Rm×d;W>W=I
n∑i=1
∥∥∥∥∥∥yi − 1
n
n∑j=1
yj
∥∥∥∥∥∥2
2
, yi = W>xi.
ä Leads to maximizing
Tr[W>(X − µe>)(X − µe>)>W
], µ = 1
nΣni=1xi
ä SolutionW = { dominant eigenvectors } of the covariancematrix≡ Set of left singular vectors of X̄ = X − µe>
Calais 05/18/2016 p. 8
SVD:
X̄ = UΣV >, U>U = I, V >V = I, Σ = Diag
ä Optimal W = Ud ≡ matrix of first d columns of U
ä Solution W also minimizes ‘reconstruction error’ ..
∑i
‖xi −WW Txi‖2 =∑i
‖xi −Wyi‖2
ä In some methods recentering to zero is not done, i.e., X̄replaced by X.
Calais 05/18/2016 p. 9
Unsupervised learning
“Unsupervised learning” : meth-ods that do not exploit known labelsä Example of digits: perform a 2-Dprojectionä Images of same digit tend tocluster (more or less)ä Such 2-D representations arepopular for visualizationä Can also try to find natural clus-ters in data, e.g., in materialsä Basic clusterning technique: K-means
−6 −4 −2 0 2 4 6 8−5
−4
−3
−2
−1
0
1
2
3
4
5PCA − digits : 5 −− 7
567
SuperhardPhotovoltaic
Superconductors
Catalytic
Ferromagnetic
Thermo−electricMulti−ferroics
Calais 05/18/2016 p. 10
Example: The ‘Swill-Roll’ (2000 points in 3-D)
−15 −10 −5 0 5 10 15−20
−10
0
10
20
−15
−10
−5
0
5
10
15
Original Data in 3−D
Calais 05/18/2016 p. 11
2-D ‘reductions’:
−20 −10 0 10 20−15
−10
−5
0
5
10
15PCA
−15 −10 −5 0 5 10−15
−10
−5
0
5
10
15LPP
Eigenmaps ONPP
Calais 05/18/2016 p. 12
Example: Digit images (a random sample of 30)
5 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
20
5 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
20
5 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
20
5 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
20
5 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
205 10 15
10
20
Calais 05/18/2016 p. 13
2-D ’reductions’:
−10 −5 0 5 10−6
−4
−2
0
2
4
6PCA − digits : 0 −− 4
01234
−0.2 −0.1 0 0.1 0.2−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15LLE − digits : 0 −− 4
0.07 0.071 0.072 0.073 0.074−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2K−PCA − digits : 0 −− 4
01234
−5.4341 −5.4341 −5.4341 −5.4341 −5.4341
x 10−3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1ONPP − digits : 0 −− 4
Calais 05/18/2016 p. 14
APPLICATION: INFORMATION RETRIEVAL
Application: Information Retrieval
ä Given: collection of doc-uments (columns of a matrixA) and a query vector q.ä Representation: m × nterm by document matrix
ä A query q is a (sparse) vector in Rm (‘pseudo-document’)
Problem: find a column of A that best matches q
ä Vector space model: use cos〈(A(:, j), q), j = 1 : n
ä Requires the computation of ATq
ä Literal Matching→ ineffective
Calais 05/18/2016 p. 16
Common approach: Dimension reduction (SVD)
ä LSI: replace A by a low rank approximation [from SVD]
A = UΣV T → Ak = UkΣkVTk
ä Replace similarity vector: s = ATq by sk = ATkq
ä Main issues: 1) computational cost 2) Updates
Idea: ReplaceAk byAφ(ATA), where φ == a filter function
Consider the step-function (Heaviside):
φ(x) =
{0, 0 ≤ x ≤ σ2
k
1, σ2k ≤ x ≤ σ2
1
ä Would yield the same result as TSVD but not practical
Calais 05/18/2016 p. 17
Use of polynomial filters
ä Solution : use a polynomial approximation to φ
ä Note: sT = qTAφ(ATA) , requires only Mat-Vec’s
ä Ideal for situations where data must be explored once or asmall number of times only –
ä Details skipped – see:
E. Kokiopoulou and YS, Polynomial Filtering in Latent SemanticIndexing for Information Retrieval, ACM-SIGIR, 2004.
Calais 05/18/2016 p. 18
IR: Use of the Lanczos algorithm (J. Chen, YS ’09)
ä Lanczos algorithm = Projection method on Krylov subspaceSpan{v,Av, · · · , Am−1v}
ä Can get singular vectors with Lanczos, & use them in LSI
ä Better: Use the Lanczos vectors directly for the projection
ä K. Blom and A. Ruhe [SIMAX, vol. 26, 2005] perform aLanczos run for each query [expensive].
ä Proposed: One Lanczos run- random initial vector. Thenuse Lanczos vectors in place of singular vectors.
ä In short: Results comparable to those of SVD at a muchlower cost.
Calais 05/18/2016 p. 19
Tests: IR
Informationretrievaldatasets
# Terms # Docs # queries sparsityMED 7,014 1,033 30 0.735CRAN 3,763 1,398 225 1.412
Pre
proc
essi
ngtim
es
Med dataset.
0 50 100 150 200 250 3000
10
20
30
40
50
60
70
80
90Med
iterations
prep
roce
ssin
g tim
e
lanczostsvd
Cran dataset.
0 50 100 150 200 250 3000
10
20
30
40
50
60Cran
iterations
prep
roce
ssin
g tim
e
lanczostsvd
Calais 05/18/2016 p. 20
Average retrieval precision
Med dataset
0 50 100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Med
iterations
aver
age
prec
isio
n
lanczostsvdlanczos−rftsvd−rf
Cran dataset
0 100 200 300 400 500 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Cran
iterations
aver
age
prec
isio
n
lanczostsvdlanczos−rftsvd−rf
Retrieval precision comparisons
Calais 05/18/2016 p. 21
Supervised learning: classification
Problem: Given labels(say “A” and “B”) for eachitem of a given set, find amechanism to classify anunlabelled item into eitherthe “A” or the “B" class.
?
?ä Many applications.
ä Example: distinguish SPAM and non-SPAM messages
ä Can be extended to more than 2 classes.
Calais 05/18/2016 p. 22
Supervised learning: classification
ä Best illustration: written digits recognition example
Given: a set oflabeled samples(training set), andan (unlabeled) testimage.Problem: find
label of test image
��������
������������
������ ��
Training data Test data
Dim
ensio
n red
uctio
n
Dig
it 0
Dig
fit
1
Dig
it 2
Dig
it 9
Dig
it ?
?
Dig
it 0
Dig
fit
1
Dig
it 2
Dig
it 9
Dig
it ?
?
ä Roughly speaking: we seek dimension reduction so thatrecognition is ‘more effective’ in low-dim. space
Calais 05/18/2016 p. 23
Supervised learning: Linear classification
Linear classifiers: Finda hyperplane which bestseparates the data inclasses A and B.ä Example of applica-tion: Distinguish betweenSPAM and non-SPAM e-mails Linear
classifier
ä Note: The world in non-linear. Often this is combined withKernels – amounts to changing the inner product
Calais 05/18/2016 p. 24
A harder case:
−2 −1 0 1 2 3 4 5−4
−3
−2
−1
0
1
2
3
4Spectral Bisection (PDDP)
ä Use kernels to transform
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
Projection with Kernels −− σ2 = 2.7463
Transformed data with a Gaussian Kernel
Calais 05/18/2016 p. 26
GRAPH-BASED TECHNIQUES
Graph-based methods
ä Start with a graph of data. e.g.: graphof k nearest neighbors (k-NN graph)Want: Perform a projection which pre-
serves the graph in some sense
ä Define a graph Laplacean:
L = D −W
x
xj
i
e.g.,: wij =
{1 if j ∈ Adj(i)0 else D = diag
dii =∑j 6=i
wij
with Adj(i) = neighborhood of i (excluding i)
Calais 05/18/2016 p. 28
A side note: Graph partitioning
If x is a vector of signs (±1) then
x>Lx = 4× (’number of edge cuts’)
edge-cut = pair (i, j) with xi 6= xj
ä Consequence: Can be used for partitioning graphs, or ‘clus-tering’ [take p = sign(u2), where u2 = 2nd smallest eigenvec-tor..]
Calais 05/18/2016 p. 29
Example: The Laplacean eigenmaps approach
Laplacean Eigenmaps [Belkin-Niyogi ’01] *minimizes*
F(Y ) =n∑
i,j=1
wij‖yi − yj‖2 subject to Y DY > = I
Motivation: if ‖xi − xj‖ is small(orig. data), we want ‖yi − yj‖ to bealso small (low-Dim. data)ä Original data used indirectlythrough its graphä Leads to n× n sparse eigenvalueproblem [In ‘sample’ space]
x
xj
i
yi
yj
Calais 05/18/2016 p. 30
ä Problem translates to:
min Y ∈ Rd×nY D Y > = I
Tr[Y (D −W )Y >
].
ä Solution (sort eigenvalues increasingly):
(D −W )ui = λiDui ; yi = u>i ; i = 1, · · · , d
ä Note: can assume D = I. Amounts to rescaling data.Problem becomes
(I −W )ui = λiui ; yi = u>i ; i = 1, · · · , d
Calais 05/18/2016 p. 31
Locally Linear Embedding (Roweis-Saul-00)
ä LLE is very similar to Eigenmaps. Main differences:
1) Graph Laplacean matrix is replaced by an ‘affinity’ graph
2) Objective function is changed.
1. Graph: Each xi is written as aconvex combination of its k nearestneighbors:xi ≈ Σwijxj,
∑j∈Ni
wij = 1ä Optimal weights computed (’localcalculation’) by minimizing
‖xi − Σwijxj‖ for i = 1, · · · , n
x
xj
i
Calais 05/18/2016 p. 32
2. Mapping:
The yi’s should obey the same ’affinity’ as xi’s
Minimize:∑i
∥∥∥∥∥∥yi −∑j
wijyj
∥∥∥∥∥∥2
subject to: Y 1 = 0, Y Y > = I
Solution:
(I −W>)(I −W )ui = λiui; yi = u>i .
ä (I−W>)(I−W ) replaces the graph Laplacean of eigen-maps
Calais 05/18/2016 p. 33
ONPP (Kokiopoulou and YS ’05)
ä Orthogonal Neighborhood Preserving Projections
ä A linear (orthogonoal) version of LLE obtained by writing Yin the form Y = V >X
ä Same graph as LLE. Objective: preserve the affinity graph(as in LEE) *but* with the constraint Y = V >X
ä Problem solved to obtain mapping:
minV
Tr[V >X(I −W>)(I −W )X>V
]s.t. V TV = I
ä In LLE replace V >X by Y
Calais 05/18/2016 p. 34
Implicit vs explicit mappings
ä In PCA the mapping Φ from high-dimensional space (Rm)to low-dimensional space (Rd) is explicitly known:
y = Φ(x) ≡ V Tx
ä In Eigenmaps and LLE we only know
yi = φ(xi), i = 1, · · · , n
ä Mapping φ is complex, i.e.,
ä Difficult to get φ(x) for an arbitrary x not in the sample.
ä Inconvenient for classification
ä “The out-of-sample extension” problem
Calais 05/18/2016 p. 35
Face Recognition – background
Problem: We are given a database of images: [arrays of pixelvalues]. And a test (new) image.
Calais 05/18/2016 p. 36
Face Recognition – background
Problem: We are given a database of images: [arrays of pixelvalues]. And a test (new) image.
↖ ↑ ↗
Question: Does this new image correspond to one of thosein the database?
Difficulty Positions, Expressions, Lighting, ...,
Calais 05/18/2016 p. 37
Example: Eigenfaces [Turk-Pentland, ’91]
ä Idea identical with the one we saw for digits:
– Consider each picture as a (1-D) column of all pixels– Put together into an arrayA of size #_pixels×#_images.
. . . =⇒ . . .
︸ ︷︷ ︸A
– Do an SVD ofA and perform comparison with any test imagein low-dim. space
Calais 05/18/2016 p. 38
Graph-based methods in a supervised setting
Graph-based methods can be adapted to supervised mode.Idea: Build G so that nodes in the same class are neighbors.If c = # classes, G consists of c cliques.
ä Weight matrixW =block-diagonalä Note: rank(W ) = n− c.ä As before, graph Laplacean:
Lc = D −WW =
W1
W2. . .
Wc
ä Can be used for ONPP and other graph based methods
ä Improvement: add repulsion Laplacean [Kokiopoulou, YS09]
Calais 05/18/2016 p. 39
Class 1 Class 2
Class 3
Leads to eigenvalue problemwith matrix:
Lc − ρLR
• Lc = class-Laplacean,• LR = repulsion Laplacean,• ρ = parameter
Test: ORL 40 subjects, 10 sample images each – example:
# of pixels : 112× 92; TOT. # images : 400
Calais 05/18/2016 p. 40
10 20 30 40 50 60 70 80 90 1000.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98ORL −− TrainPer−Class=5
onpppcaolpp−Rlaplacefisheronpp−Rolpp
ä Observation: some values of ρ yield better results thanusing the optimum ρ obtained from maximizing trace ratio
Calais 05/18/2016 p. 41
Conclusion
ä Interesting new matrix problems in areas that involve theeffective mining of data
ä Among the most pressing issues is that of reducing compu-tational cost - [SVD, SDP, ..., too costly]
ä Many online resources available
ä Huge potential in areas like materials science though inertiahas to be overcome
ä To a researcher in computational linear algebra : big tide ofchange on types or problems, algorithms, frameworks, culture,..
ä But change should be welcome
When one door closes, another opens; but we often look solong and so regretfully upon the closed door that we do notsee the one which has opened for us.
Alexander Graham Bell (1847-1922)
ä In the words of “Who Moved My Cheese?” [ Spencer John-son, 2002]:
“If you do not change, you can become extinct !”
Calais 05/18/2016 p. 43