dimred lecture 2015Fallpoloclub.gatech.edu/cse6242/2016spring/slides/dimred... · 2016-04-19 · 22...

Preview:

Citation preview

Dimension ReductionCSE 6242 / CX 4242

Thanks : Prof. Jaegul Choo , Dr. RamakrishnanKannan, Prof. Le Song

What is Dimension Reduction?

Dataitemindex(n)

Dimensionindex(d)

Columnsasdataitems

low-dimdata

DimensionReduction

Howbigisthis?

Why?

Attribute=Feature= Variable=Dimension

3

Serialized/rasterized pixel values

Image Data

3 80

2458

63

45

3

80

24

58

78

45

5 34 78

Rawimages Pixelvalues

5

34

63Serializedpixels

In a 4K (4096x2160) image there are totally 8.8 million pixels

3 80

2458

63

45

5 34 78

49 54

7814

67

36

22 86 15

4

Serialized/rasterized pixel values

� Huge dimensions� 4096x2160 image size →8847360dimensions� 30 fps.� Means for 2 mins video, you generate a matrix of size

8847360x3600

Video Data

3

80

24

58

63

45

Rawimages

Pixelvalues

5

34

63

Serializedpixels

49

54

78

14

15

67

22

86

36

� Bag-of-words vector� Document 1 = “Life of Pi won Oscar”� Document 2 = “Life of Pi is also a book.”

Text Documents

Life

Pi

movies

also

oscar

book

won

Vocabulary Doc1 Doc2

1

1

0

1

0

1

0

1

1

0

0

1

0

1

� Data items� How many data items?

� Dimensions� How many dimensions representing each item?

Two Axes of Data Set

Dataitemindex(n)

Dimensionindex(d)

Columnsasdataitems vs.Rowsasdataitems

Wewillusethisduringlecture

Dimension Reduction

7

Dimension Reduction

High-dim data (n) low-dim

data (n)

No. of dimensions

(k)

Additional info about data

Other parameters

Dim-reducing transformation for new data

: user-specified

Reduced dimension

(k)

Dimensionindex (d)

Benefits of Dimension ReductionObviously,

CompressionVisualizationFaster computation

Computing distances: 100,000-dim vs. 10-dim vectorsMore importantly,

Noise removal (improving data quality)Separates the data into General Pattern + Sparse + NoiseIs Noise the important signal? Works as pre-processing for better performancee.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition

Two Main Techniques1. Feature selection

Selects a subset of the original variables as reduced dimensions relevant for a particular taske.g., the number of genes responsible for a particular disease may be small

2. Feature extractionEach reduced dimension combines multiple original dimensionsThe original dataset will be transformed to some other numbers

9

Feature = Variable = Dimension

Feature Selection

What are the optimal subset of m features to maximize a given criterion?

Widely-used criteriaInformation gain, correlation, …

Typically combinatorial optimization problems Therefore, greedy methods are popular

Forward selection: Empty set → Add one variable at a timeBackward elimination: Entire set → Remove one variable at a time

10

Feature Extraction

Aspects of Dimension Reduction

Linear vs. NonlinearUnsupervised vs. SupervisedGlobal vs. LocalFeature vectors vs. Similarity (as an input)

12

Linear vs. NonlinearLinear

Represents each reduced dimension as a linear combination of original dimensions

Of the form aX+b where a, x and b are vectors/matricese.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4

Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4Naturally capable of mapping new data to the same space

13

Dimension Reduction

D1 D2

X1 1 1

X2 1 0

X3 0 2

X4 1 1

D1 D2

Y1 1.75 -0.27

Y2 -0.21 0.58

Linear vs. Nonlinear

LinearRepresents each reduced dimension as a linear combination of original dimensions

e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space

NonlinearMore complicated, but generally more powerfulRecently popular topics

14

Unsupervised vs. SupervisedUnsupervised

Uses only the input data

15

Dimension Reduction

High-dim data

low-dim data

No. of dimensions

Other parameters

Dim-reducing Transformer for

a new data

Additional info about data

Unsupervised vs. SupervisedSupervised

Uses the input data + additional info

16

Dimension Reduction

High-dim data

low-dim data

No. of dimensions

Other parameters

Dim-reducing Transformer for

a new data

Additional info about data

Unsupervised vs. SupervisedSupervised

Uses the input data + additional infoe.g., grouping label

17

Dimension Reduction

High-dim data

low-dim data

No. of dimensions

Additional info about data

Other parameters

Dim-reducing Transformer for

a new data

Global vs. Local

Dimension reduction typically tries to preserve all the relationships/distances in data

Information loss is unavoidable!Then, what should we emphasize more?

GlobalTreats all pairwise distances equally important

Focuses on preserving large distancesLocal

Focuses on small distances, neighborhood relationshipsActive research area, e.g., manifold learning

18

Feature vectors vs. Similarity (as an input)

Dimension Reduction

High-dim data (n)

low-dim data

No. of dimensions

(k)

Other parameters

Dim-reducing Transformer for

a new data

Additional info about data

Typical setup (feature vectors as an input)

Reduced dimension

(k)Dimensionindex (d)

Feature vectors vs. Similarity (as an input)

Dimension Reduction

Similarity matrix

low-dim data

No. of dimensions

Other parameters

Dim-reducing Transformer for

a new data

Additional info about data

Typical setup (feature vectors as an input)Alternatively, takes similarity matrix instead

(i,j)-th component indicates similarity between i-th and j-th dataAssuming distance is a metric, similarity matrix is symmetric

Feature vectors vs. Similarity (as an input)

Dimension Reduction

low-dim data(kxn)

No. of dimensions

Other parameters

Dim-reducing Transformer for

a new data

Additional info about data

Typical setup (feature vectors as an input)Alternatively, takes similarity matrix insteadInternally, converts feature vectors to similarity matrix before performing dimension reduction

Similarity matrix(nxn)

High-dim data (dxn)

Dimension Reduction

low-dim data(kxn)

Graph Embedding

Feature vectors vs. Similarity (as an input)Why called graph embedding?

Similarity matrix can be viewed as a graph where similarity represents edge weight

Similarity matrix

High-dim data(dxn)

Dimension Reduction

low-dim data

Graph Embedding

MethodsTraditional

Principal component analysis (PCA)Multidimensional scaling (MDS)Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning) Isometric feature mapping (Isomap)

23

* Matlab codes are available athttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

Principal Component AnalysisFinds the axis showing the largest variation, and project all points into this axisReduced dimensions are orthogonalAlgorithm: Eigen-decompositionPros: FastCons: Limited performances

24

Image source: http://en.wikipedia.org/wiki/Principal_component_analysis

PC1PC2LinearUnsupervisedGlobalFeature vectors

PCA – Some Questions

AlgorithmSubtract mean from the dataset (X-μ)Find the covariance matrix (X-μ)’ (X-μ)Perform Eigen decomposition on this covariance matrix

Key QuestionsWhy covariance matrix?SVD on the original matrix vs Eigen decomposition on covariance matrix

25

Multidimensional Scaling (MDS)Main idea

Tries to preserve given pairwise distances in low-dimensional space

Metric MDSPreserves given distance values

Nonmetric MDSWhen you only know/care about ordering of distances Preserves only the orderings of distance values

Algorithm: gradient-decent typec.f. classical MDS is the same as PCA

26

NonlinearUnsupervisedGlobalSimilarity input

ideal distanceLow-dim distance

Multidimensional Scaling

Pros: widely-used (works well in general)Cons: slow (n-body problem)

Nonmetric MDS is even much slower than metric MDSFast algorithm are available.

Barnes-Hut algorithmGPU-based implementations

27

Linear Discriminant AnalysisWhat if clustering information is available?LDA tries to separate clusters by

Putting different cluster as far as possiblePutting each cluster as compact as possible

(a) (b)

Aspects of Dimension ReductionUnsupervised vs. Supervised

SupervisedUses the input data + additional info

e.g., grouping label

Dimension Reduction

High-dim data

low-dim data

No. of dimensions

Additional info about data

Other parameters

Dim-reducing Transformer for

a new data

Linear Discriminant Analysis (LDA)vs. Principal Component Analysis

30

2D visualization of 7 Gaussian mixture of 1000 dimensions

Linear discriminant analysis(Supervised)

Principal component analysis(Unsupervised)30

LDACompute mean of the two classes, global mean (μ1, μ2,μ) Compute class specific covariance matrix Sw

Compute between class covariance matrix using means. Call it Sb

Find the eigenvectors corresponding to the eigenvalues of inv(Sw)*Sb

31

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

LDA, C classes • Fisher’s LDA generalizes gracefully for C-class problems

– Instead of one projection 𝑦, we will now seek (𝐶 − 1) projections [𝑦1, 𝑦2, …𝑦𝐶−1] by means of (𝐶 − 1) projection vectors 𝑤𝑖arranged by columns into a projection matrix 𝑊 = [𝑤1|𝑤2|… |𝑤𝐶−1]:

𝑦𝑖 = 𝑤𝑖𝑇𝑥 ⇒ 𝑦 = 𝑊𝑇𝑥

• Derivation – The within-class scatter generalizes as

𝑆𝑊 = 𝑆𝑖𝐶𝑖=1

• where 𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇𝑥∈𝜔𝑖

and 𝜇𝑖 =1𝑁𝑖 𝑥𝑥∈𝜔𝑖

– And the between-class scatter becomes

𝑆𝐵 = 𝑁𝑖 𝜇𝑖 − 𝜇 𝜇𝑖 − 𝜇 𝑇𝐶𝑖=1

• where 𝜇 = 1𝑁 𝑥∀𝑥 = 1

𝑁 𝑁𝑖𝜇𝑖𝐶𝑖=1

– Matrix 𝑆𝑇 = 𝑆𝐵 + 𝑆𝑊 is called the total scatter

P 1

P 2

P 3

P

S B 1

S B 3

S B 2S W 3

S W 1

S W 2

x 1

x 2

P 1

P 2

P 3

P

S B 1

S B 3

S B 2S W 3

S W 1

S W 2

x 1

x 2

*http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

Linear Discriminant AnalysisMaximally separates clusters by

Putting different cluster far apartShrinking each cluster compactly

Algorithm: generalized eigendecompositionPros: better at showing cluster structureCons: may distort original relationships of data

32

LinearSupervisedGlobalFeature vectors

Methods

TraditionalPrincipal component analysis (PCA)Multidimensional scaling (MDS)Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning) Isometric feature mapping (Isomap)

33

* Matlab codes are available athttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

Manifold LearningSwiss Roll Data

Swiss roll dataOriginally in 3D

What is the intrinsicdimensionality? (allowing flattening)

34

Manifold LearningSwiss Roll Data

Swiss roll dataOriginally in 3D

What is the intrinsicdimensionality? (allowing flattening)→ 2D

35

What if your data has low intrinsic dimensionality but resides in high-dimensional space?

Isomap(Isometric Feature Mapping)

Let’s preserve pairwise geodesic distance (along manifold)Compute geodesic distance as the shortest path length from k-nearest neighbor (k-NN) graph*Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances

36

* Eigen-decomposition is the main algorithm of PCA

Isomap(Isometric Feature Mapping)

Algorithm: all-pair shortest path computation + eigen-decompositionPros: performs well in generalCons: slow (shortest path), sensitive to parameters

37

NonlinearUnsupervisedGlobal: all pairwise distances are consideredFeature vectors

Practitioner’s GuideCaveats

38

Trustworthiness of dimension reduction resultsInevitable distortion/information loss in 2D/3DThe best result of a method may not align with what we want, e.g., PCA visualization of facial image data

(1, 2)-dimension (3, 4)-dimension

Practitioner’s GuideGeneral Recommendation

Want something simple and fast to visualize data?PCA, force-directed layout

Want to first try some manifold learning methods?Isomap

It is the method that will give empirically the best result. Have cluster label to use? (pre-given or computed)

LDA (supervised)Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters

39

Practitioner’s GuideResults Still Not Good?

Try various pre-processingData centering

Subtract the global mean from each vectorNormalization

Make each vector have unit Euclidean normOtherwise, a few outlier can affect dimension reduction significantly

Application-specific pre-processingDocument: TF-IDF weighting, remove too rare and/or short termsImage: histogram normalization

40

Practitioner’s GuideToo Slow?

Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step

The results may be even better due to noise removed by PCASee if there is any approximated but faster version

Landmarked versions (only using a subset of data items)e.g., landmarked Isomap

Linearized versions (the same criterion, but only allow linear mapping)

e.g., Laplacian Eigenmaps → Locality preserving projection

41

Practitioner’s GuideStill need more?

Tweak dimension reduction for your own purposePlay with its algorithm, convergence criteria, etc.

See if you can impose label informationRestrict the number of iterations to save computational time.

The main purpose of DR is to serve us in exploring data and solving complicated real-world problems

42

Take Away

43

PCA MDS LDA IsomapSupervised ✖ ✖ ✔ ✖

Linear ✔ ✖ ✔ ✖

Global ✔ ✔ ✔ ✔

Feature ✔ ✖ ✔ ✔

Useful ResourceTutorial on PCA

http://arxiv.org/pdf/1404.1100.pdfTutorial on LDA

http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdfReview article

http://www.iai.uni-bonn.de/~jz/dimensionality_reduction_a_comparative_review.pdf

Matlab toolbox for dimension reductionhttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

Matlab manifold learning demohttp://www.math.ucla.edu/~wittman/mani/

44

Recommended