Dimension reduction methods: Algorithms and Applicationssaad/PDF/calais16.pdf · Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science

Dimension reduction methods: Algorithmsand Applications

Yousef SaadDepartment of Computer Science

and Engineering

University of Minnesota

Université du Littoral- CalaisJuly 11, 2016

First..

ä ... to the memory of Mohammed Bellalij

Calais 05/18/2016 p. 2

Introduction, background, and motivation

Common goal of data mining methods: to extract meaningfulinformation or patterns from data. Very broad area – in-cludes: data analysis, machine learning, pattern recognition,information retrieval, ...

ä Main tools used: linear algebra; graph theory; approximationtheory; optimization; ...

ä In this talk: emphasis on dimension reduction techniquesand the interrelations between techniques

Calais 05/18/2016 p. 3

Introduction: a few factoids

ä Data is growing exponentially at an “alarming” rate:

• 90% of data in world today was created in last two years

• Every day, 2.3 Million terabytes (2.3×1018 bytes) created

ä Mixed blessing: Opportunities & big challenges.

ä Trend is re-shaping & energizing many research areas ...

ä ... including my own: numerical linear algebra

Calais 05/18/2016 p. 4

Topics

ä Focus on two main problems

– Information retrieval

– Face recognition

ä and 2 types of dimension reduction methods

– Standard subspace methods [SVD, Lanczos]

– Graph-based methods

Calais 05/18/2016 p. 5

Major tool of Data Mining: Dimension reduction

ä Goal is not as much to reduce size (& cost) but to:

• Reduce noise and redundancy in data before performing atask [e.g., classification as in digit/face recognition]

• Discover important ‘features’ or ‘paramaters’

The problem: Given: X = [x1, · · · , xn] ∈ Rm×n, find a

low-dimens. representation Y = [y1, · · · , yn] ∈ Rd×n of X

ä Achieved by a mapping Φ : x ∈ Rm −→ y ∈ Rd so:

φ(xi) = yi, i = 1, · · · , n

Calais 05/18/2016 p. 6

m

n

X

Y

x

y

i

id

n

ä Φ may be linear : yi = W>xi , i.e., Y = W>X , ..

ä ... or nonlinear (implicit).

ä Mapping Φ required to: Preserve proximity? Maximizevariance? Preserve a certain graph?

Calais 05/18/2016 p. 7

Example: Principal Component Analysis (PCA)

In Principal Component Analysis W is computed to maxi-mize variance of projected data:

maxW∈Rm×d;W>W=I

n∑i=1

∥∥∥∥∥∥yi − 1

n

n∑j=1

yj

∥∥∥∥∥∥2

2

, yi = W>xi.

ä Leads to maximizing

Tr[W>(X − µe>)(X − µe>)>W

], µ = 1

nΣni=1xi

ä SolutionW = { dominant eigenvectors } of the covariancematrix≡ Set of left singular vectors of X̄ = X − µe>

Calais 05/18/2016 p. 8

SVD:

X̄ = UΣV >, U>U = I, V >V = I, Σ = Diag

ä Optimal W = Ud ≡ matrix of first d columns of U

ä Solution W also minimizes ‘reconstruction error’ ..

∑i

‖xi −WW Txi‖2 =∑i

‖xi −Wyi‖2

ä In some methods recentering to zero is not done, i.e., X̄replaced by X.

Calais 05/18/2016 p. 9

Unsupervised learning

“Unsupervised learning” : meth-ods that do not exploit known labelsä Example of digits: perform a 2-Dprojectionä Images of same digit tend tocluster (more or less)ä Such 2-D representations arepopular for visualizationä Can also try to find natural clus-ters in data, e.g., in materialsä Basic clusterning technique: K-means

−6 −4 −2 0 2 4 6 8−5

−4

−3

−2

−1

0

1

2

3

4

5PCA − digits : 5 −− 7

567

SuperhardPhotovoltaic

Superconductors

Catalytic

Ferromagnetic

Thermo−electricMulti−ferroics

Calais 05/18/2016 p. 10

Example: The ‘Swill-Roll’ (2000 points in 3-D)

−15 −10 −5 0 5 10 15−20

−10

0

10

20

−15

−10

−5

0

5

10

15

Original Data in 3−D

Calais 05/18/2016 p. 11

2-D ‘reductions’:

−20 −10 0 10 20−15

−10

−5

0

5

10

15PCA

−15 −10 −5 0 5 10−15

−10

−5

0

5

10

15LPP

Eigenmaps ONPP

Calais 05/18/2016 p. 12

Example: Digit images (a random sample of 30)

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

Calais 05/18/2016 p. 13

2-D ’reductions’:

−10 −5 0 5 10−6

−4

−2

0

2

4

6PCA − digits : 0 −− 4

01234

−0.2 −0.1 0 0.1 0.2−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15LLE − digits : 0 −− 4

0.07 0.071 0.072 0.073 0.074−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2K−PCA − digits : 0 −− 4

01234

−5.4341 −5.4341 −5.4341 −5.4341 −5.4341

x 10−3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1ONPP − digits : 0 −− 4

Calais 05/18/2016 p. 14

APPLICATION: INFORMATION RETRIEVAL

Application: Information Retrieval

ä Given: collection of doc-uments (columns of a matrixA) and a query vector q.ä Representation: m × nterm by document matrix

ä A query q is a (sparse) vector in Rm (‘pseudo-document’)

Problem: find a column of A that best matches q

ä Vector space model: use cos〈(A(:, j), q), j = 1 : n

ä Requires the computation of ATq

ä Literal Matching→ ineffective

Calais 05/18/2016 p. 16

Common approach: Dimension reduction (SVD)

ä LSI: replace A by a low rank approximation [from SVD]

A = UΣV T → Ak = UkΣkVTk

ä Replace similarity vector: s = ATq by sk = ATkq

ä Main issues: 1) computational cost 2) Updates

Idea: ReplaceAk byAφ(ATA), where φ == a filter function

Consider the step-function (Heaviside):

φ(x) =

{0, 0 ≤ x ≤ σ2

k

1, σ2k ≤ x ≤ σ2

1

ä Would yield the same result as TSVD but not practical

Calais 05/18/2016 p. 17

Use of polynomial filters

ä Solution : use a polynomial approximation to φ

ä Note: sT = qTAφ(ATA) , requires only Mat-Vec’s

ä Ideal for situations where data must be explored once or asmall number of times only –

ä Details skipped – see:

E. Kokiopoulou and YS, Polynomial Filtering in Latent SemanticIndexing for Information Retrieval, ACM-SIGIR, 2004.

Calais 05/18/2016 p. 18

IR: Use of the Lanczos algorithm (J. Chen, YS ’09)

ä Lanczos algorithm = Projection method on Krylov subspaceSpan{v,Av, · · · , Am−1v}

ä Can get singular vectors with Lanczos, & use them in LSI

ä Better: Use the Lanczos vectors directly for the projection

ä K. Blom and A. Ruhe [SIMAX, vol. 26, 2005] perform aLanczos run for each query [expensive].

ä Proposed: One Lanczos run- random initial vector. Thenuse Lanczos vectors in place of singular vectors.

ä In short: Results comparable to those of SVD at a muchlower cost.

Calais 05/18/2016 p. 19

Tests: IR

Informationretrievaldatasets

# Terms # Docs # queries sparsityMED 7,014 1,033 30 0.735CRAN 3,763 1,398 225 1.412

Pre

proc

essi

ngtim

es

Med dataset.

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90Med

iterations

prep

roce

ssin

g tim

e

lanczostsvd

Cran dataset.

0 50 100 150 200 250 3000

10

20

30

40

50

60Cran

iterations

prep

roce

ssin

g tim

e

lanczostsvd

Calais 05/18/2016 p. 20

Average retrieval precision

Med dataset

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Med

iterations

aver

age

prec

isio

n

lanczostsvdlanczos−rftsvd−rf

Cran dataset

0 100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Cran

iterations

aver

age

prec

isio

n

lanczostsvdlanczos−rftsvd−rf

Retrieval precision comparisons

Calais 05/18/2016 p. 21

Supervised learning: classification

Problem: Given labels(say “A” and “B”) for eachitem of a given set, find amechanism to classify anunlabelled item into eitherthe “A” or the “B" class.

?

?ä Many applications.

ä Example: distinguish SPAM and non-SPAM messages

ä Can be extended to more than 2 classes.

Calais 05/18/2016 p. 22

Supervised learning: classification

ä Best illustration: written digits recognition example

Given: a set oflabeled samples(training set), andan (unlabeled) testimage.Problem: find

label of test image

��

��

��

Training data Test data

Dim

ensio

n red

uctio

n

Dig

it 0

Dig

fit

1

Dig

it 2

Dig

it 9

Dig

it ?

?

Dig

it 0

Dig

fit

1

Dig

it 2

Dig

it 9

Dig

it ?

?

ä Roughly speaking: we seek dimension reduction so thatrecognition is ‘more effective’ in low-dim. space

Calais 05/18/2016 p. 23

Supervised learning: Linear classification

Linear classifiers: Finda hyperplane which bestseparates the data inclasses A and B.ä Example of applica-tion: Distinguish betweenSPAM and non-SPAM e-mails Linear

classifier

ä Note: The world in non-linear. Often this is combined withKernels – amounts to changing the inner product

Calais 05/18/2016 p. 24

A harder case:

−2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

4Spectral Bisection (PDDP)

ä Use kernels to transform

−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

Projection with Kernels −− σ2 = 2.7463

Transformed data with a Gaussian Kernel

Calais 05/18/2016 p. 26

GRAPH-BASED TECHNIQUES

Graph-based methods

ä Start with a graph of data. e.g.: graphof k nearest neighbors (k-NN graph)Want: Perform a projection which pre-

serves the graph in some sense

ä Define a graph Laplacean:

L = D −W

x

xj

i

e.g.,: wij =

{1 if j ∈ Adj(i)0 else D = diag

dii =∑j 6=i

wij

with Adj(i) = neighborhood of i (excluding i)

Calais 05/18/2016 p. 28

A side note: Graph partitioning

If x is a vector of signs (±1) then

x>Lx = 4× (’number of edge cuts’)

edge-cut = pair (i, j) with xi 6= xj

ä Consequence: Can be used for partitioning graphs, or ‘clus-tering’ [take p = sign(u2), where u2 = 2nd smallest eigenvec-tor..]

Calais 05/18/2016 p. 29

Example: The Laplacean eigenmaps approach

Laplacean Eigenmaps [Belkin-Niyogi ’01] *minimizes*

F(Y ) =n∑

i,j=1

wij‖yi − yj‖2 subject to Y DY > = I

Motivation: if ‖xi − xj‖ is small(orig. data), we want ‖yi − yj‖ to bealso small (low-Dim. data)ä Original data used indirectlythrough its graphä Leads to n× n sparse eigenvalueproblem [In ‘sample’ space]

x

xj

i

yi

yj

Calais 05/18/2016 p. 30

ä Problem translates to:

min Y ∈ Rd×nY D Y > = I

Tr[Y (D −W )Y >

].

ä Solution (sort eigenvalues increasingly):

(D −W )ui = λiDui ; yi = u>i ; i = 1, · · · , d

ä Note: can assume D = I. Amounts to rescaling data.Problem becomes

(I −W )ui = λiui ; yi = u>i ; i = 1, · · · , d

Calais 05/18/2016 p. 31

Locally Linear Embedding (Roweis-Saul-00)

ä LLE is very similar to Eigenmaps. Main differences:

1) Graph Laplacean matrix is replaced by an ‘affinity’ graph

2) Objective function is changed.

1. Graph: Each xi is written as aconvex combination of its k nearestneighbors:xi ≈ Σwijxj,

∑j∈Ni

wij = 1ä Optimal weights computed (’localcalculation’) by minimizing

‖xi − Σwijxj‖ for i = 1, · · · , n

x

xj

i

Calais 05/18/2016 p. 32

2. Mapping:

The yi’s should obey the same ’affinity’ as xi’s

Minimize:∑i

∥∥∥∥∥∥yi −∑j

wijyj

∥∥∥∥∥∥2

subject to: Y 1 = 0, Y Y > = I

Solution:

(I −W>)(I −W )ui = λiui; yi = u>i .

ä (I−W>)(I−W ) replaces the graph Laplacean of eigen-maps

Calais 05/18/2016 p. 33

ONPP (Kokiopoulou and YS ’05)

ä Orthogonal Neighborhood Preserving Projections

ä A linear (orthogonoal) version of LLE obtained by writing Yin the form Y = V >X

ä Same graph as LLE. Objective: preserve the affinity graph(as in LEE) *but* with the constraint Y = V >X

ä Problem solved to obtain mapping:

minV

Tr[V >X(I −W>)(I −W )X>V

]s.t. V TV = I

ä In LLE replace V >X by Y

Calais 05/18/2016 p. 34

Implicit vs explicit mappings

ä In PCA the mapping Φ from high-dimensional space (Rm)to low-dimensional space (Rd) is explicitly known:

y = Φ(x) ≡ V Tx

ä In Eigenmaps and LLE we only know

yi = φ(xi), i = 1, · · · , n

ä Mapping φ is complex, i.e.,

ä Difficult to get φ(x) for an arbitrary x not in the sample.

ä Inconvenient for classification

ä “The out-of-sample extension” problem

Calais 05/18/2016 p. 35

Face Recognition – background

Problem: We are given a database of images: [arrays of pixelvalues]. And a test (new) image.

Calais 05/18/2016 p. 36

Face Recognition – background

Problem: We are given a database of images: [arrays of pixelvalues]. And a test (new) image.

↖ ↑ ↗

Question: Does this new image correspond to one of thosein the database?

Difficulty Positions, Expressions, Lighting, ...,

Calais 05/18/2016 p. 37

Example: Eigenfaces [Turk-Pentland, ’91]

ä Idea identical with the one we saw for digits:

– Consider each picture as a (1-D) column of all pixels– Put together into an arrayA of size #_pixels×#_images.

. . . =⇒ . . .

︸︷︷︸A

– Do an SVD ofA and perform comparison with any test imagein low-dim. space

Calais 05/18/2016 p. 38

Graph-based methods in a supervised setting

Graph-based methods can be adapted to supervised mode.Idea: Build G so that nodes in the same class are neighbors.If c = # classes, G consists of c cliques.

ä Weight matrixW =block-diagonalä Note: rank(W ) = n− c.ä As before, graph Laplacean:

Lc = D −WW =

W1

W2. . .

Wc

ä Can be used for ONPP and other graph based methods

ä Improvement: add repulsion Laplacean [Kokiopoulou, YS09]

Calais 05/18/2016 p. 39

Class 1 Class 2

Class 3

Leads to eigenvalue problemwith matrix:

Lc − ρLR

• Lc = class-Laplacean,• LR = repulsion Laplacean,• ρ = parameter

Test: ORL 40 subjects, 10 sample images each – example:

# of pixels : 112× 92; TOT. # images : 400

Calais 05/18/2016 p. 40

10 20 30 40 50 60 70 80 90 1000.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98ORL −− TrainPer−Class=5

onpppcaolpp−Rlaplacefisheronpp−Rolpp

ä Observation: some values of ρ yield better results thanusing the optimum ρ obtained from maximizing trace ratio

Calais 05/18/2016 p. 41

Conclusion

ä Interesting new matrix problems in areas that involve theeffective mining of data

ä Among the most pressing issues is that of reducing compu-tational cost - [SVD, SDP, ..., too costly]

ä Many online resources available

ä Huge potential in areas like materials science though inertiahas to be overcome

ä To a researcher in computational linear algebra : big tide ofchange on types or problems, algorithms, frameworks, culture,..

ä But change should be welcome

When one door closes, another opens; but we often look solong and so regretfully upon the closed door that we do notsee the one which has opened for us.

Alexander Graham Bell (1847-1922)

ä In the words of “Who Moved My Cheese?” [ Spencer John-son, 2002]:

“If you do not change, you can become extinct !”

Calais 05/18/2016 p. 43

Documents

Dimension reduction methods: Algorithms and Applicationssaad/PDF/calais16.pdf · Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science