�
Dimension ReductionCSE 6242 / CX 4242
Thanks : Prof. Jaegul Choo , Dr. RamakrishnanKannan, Prof. Le Song
What is Dimension Reduction?
Dataitemindex(n)
Dimensionindex(d)
Columnsasdataitems
low-dimdata
DimensionReduction
Howbigisthis?
Why?
Attribute=Feature= Variable=Dimension
3
Serialized/rasterized pixel values
Image Data
3 80
2458
63
45
3
80
24
58
78
45
5 34 78
Rawimages Pixelvalues
5
34
63Serializedpixels
In a 4K (4096x2160) image there are totally 8.8 million pixels
3 80
2458
63
45
5 34 78
49 54
7814
67
36
22 86 15
4
Serialized/rasterized pixel values
� Huge dimensions� 4096x2160 image size →8847360dimensions� 30 fps.� Means for 2 mins video, you generate a matrix of size
8847360x3600
Video Data
3
80
24
58
63
45
Rawimages
Pixelvalues
5
34
63
Serializedpixels
49
54
78
14
15
67
22
86
36
…
� Bag-of-words vector� Document 1 = “Life of Pi won Oscar”� Document 2 = “Life of Pi is also a book.”
Text Documents
Life
Pi
movies
also
oscar
book
won
Vocabulary Doc1 Doc2
1
1
0
1
0
1
0
…
1
1
0
0
1
0
1
� Data items� How many data items?
� Dimensions� How many dimensions representing each item?
Two Axes of Data Set
Dataitemindex(n)
Dimensionindex(d)
Columnsasdataitems vs.Rowsasdataitems
Wewillusethisduringlecture
Dimension Reduction
7
Dimension Reduction
High-dim data (n) low-dim
data (n)
No. of dimensions
(k)
Additional info about data
Other parameters
Dim-reducing transformation for new data
: user-specified
Reduced dimension
(k)
Dimensionindex (d)
Benefits of Dimension ReductionObviously,
CompressionVisualizationFaster computation
Computing distances: 100,000-dim vs. 10-dim vectorsMore importantly,
Noise removal (improving data quality)Separates the data into General Pattern + Sparse + NoiseIs Noise the important signal? Works as pre-processing for better performancee.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition
Two Main Techniques1. Feature selection
Selects a subset of the original variables as reduced dimensions relevant for a particular taske.g., the number of genes responsible for a particular disease may be small
2. Feature extractionEach reduced dimension combines multiple original dimensionsThe original dataset will be transformed to some other numbers
9
Feature = Variable = Dimension
Feature Selection
What are the optimal subset of m features to maximize a given criterion?
Widely-used criteriaInformation gain, correlation, …
Typically combinatorial optimization problems Therefore, greedy methods are popular
Forward selection: Empty set → Add one variable at a timeBackward elimination: Entire set → Remove one variable at a time
10
Feature Extraction
Aspects of Dimension Reduction
Linear vs. NonlinearUnsupervised vs. SupervisedGlobal vs. LocalFeature vectors vs. Similarity (as an input)
12
Linear vs. NonlinearLinear
Represents each reduced dimension as a linear combination of original dimensions
Of the form aX+b where a, x and b are vectors/matricese.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4
Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4Naturally capable of mapping new data to the same space
13
Dimension Reduction
D1 D2
X1 1 1
X2 1 0
X3 0 2
X4 1 1
D1 D2
Y1 1.75 -0.27
Y2 -0.21 0.58
Linear vs. Nonlinear
LinearRepresents each reduced dimension as a linear combination of original dimensions
e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4
Naturally capable of mapping new data to the same space
NonlinearMore complicated, but generally more powerfulRecently popular topics
14
Unsupervised vs. SupervisedUnsupervised
Uses only the input data
15
Dimension Reduction
High-dim data
low-dim data
No. of dimensions
Other parameters
Dim-reducing Transformer for
a new data
Additional info about data
Unsupervised vs. SupervisedSupervised
Uses the input data + additional info
16
Dimension Reduction
High-dim data
low-dim data
No. of dimensions
Other parameters
Dim-reducing Transformer for
a new data
Additional info about data
Unsupervised vs. SupervisedSupervised
Uses the input data + additional infoe.g., grouping label
17
Dimension Reduction
High-dim data
low-dim data
No. of dimensions
Additional info about data
Other parameters
Dim-reducing Transformer for
a new data
Global vs. Local
Dimension reduction typically tries to preserve all the relationships/distances in data
Information loss is unavoidable!Then, what should we emphasize more?
GlobalTreats all pairwise distances equally important
Focuses on preserving large distancesLocal
Focuses on small distances, neighborhood relationshipsActive research area, e.g., manifold learning
18
Feature vectors vs. Similarity (as an input)
Dimension Reduction
High-dim data (n)
low-dim data
No. of dimensions
(k)
Other parameters
Dim-reducing Transformer for
a new data
Additional info about data
Typical setup (feature vectors as an input)
Reduced dimension
(k)Dimensionindex (d)
Feature vectors vs. Similarity (as an input)
Dimension Reduction
Similarity matrix
low-dim data
No. of dimensions
Other parameters
Dim-reducing Transformer for
a new data
Additional info about data
Typical setup (feature vectors as an input)Alternatively, takes similarity matrix instead
(i,j)-th component indicates similarity between i-th and j-th dataAssuming distance is a metric, similarity matrix is symmetric
Feature vectors vs. Similarity (as an input)
Dimension Reduction
low-dim data(kxn)
No. of dimensions
Other parameters
Dim-reducing Transformer for
a new data
Additional info about data
Typical setup (feature vectors as an input)Alternatively, takes similarity matrix insteadInternally, converts feature vectors to similarity matrix before performing dimension reduction
Similarity matrix(nxn)
High-dim data (dxn)
Dimension Reduction
low-dim data(kxn)
Graph Embedding
Feature vectors vs. Similarity (as an input)Why called graph embedding?
Similarity matrix can be viewed as a graph where similarity represents edge weight
Similarity matrix
High-dim data(dxn)
Dimension Reduction
low-dim data
Graph Embedding
MethodsTraditional
Principal component analysis (PCA)Multidimensional scaling (MDS)Linear discriminant analysis (LDA)
Advanced (nonlinear, kernelized, manifold learning) Isometric feature mapping (Isomap)
23
* Matlab codes are available athttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Principal Component AnalysisFinds the axis showing the largest variation, and project all points into this axisReduced dimensions are orthogonalAlgorithm: Eigen-decompositionPros: FastCons: Limited performances
24
Image source: http://en.wikipedia.org/wiki/Principal_component_analysis
PC1PC2LinearUnsupervisedGlobalFeature vectors
PCA – Some Questions
AlgorithmSubtract mean from the dataset (X-μ)Find the covariance matrix (X-μ)’ (X-μ)Perform Eigen decomposition on this covariance matrix
Key QuestionsWhy covariance matrix?SVD on the original matrix vs Eigen decomposition on covariance matrix
25
Multidimensional Scaling (MDS)Main idea
Tries to preserve given pairwise distances in low-dimensional space
Metric MDSPreserves given distance values
Nonmetric MDSWhen you only know/care about ordering of distances Preserves only the orderings of distance values
Algorithm: gradient-decent typec.f. classical MDS is the same as PCA
26
NonlinearUnsupervisedGlobalSimilarity input
ideal distanceLow-dim distance
Multidimensional Scaling
Pros: widely-used (works well in general)Cons: slow (n-body problem)
Nonmetric MDS is even much slower than metric MDSFast algorithm are available.
Barnes-Hut algorithmGPU-based implementations
27
Linear Discriminant AnalysisWhat if clustering information is available?LDA tries to separate clusters by
Putting different cluster as far as possiblePutting each cluster as compact as possible
(a) (b)
Aspects of Dimension ReductionUnsupervised vs. Supervised
SupervisedUses the input data + additional info
e.g., grouping label
Dimension Reduction
High-dim data
low-dim data
No. of dimensions
Additional info about data
Other parameters
Dim-reducing Transformer for
a new data
Linear Discriminant Analysis (LDA)vs. Principal Component Analysis
30
2D visualization of 7 Gaussian mixture of 1000 dimensions
Linear discriminant analysis(Supervised)
Principal component analysis(Unsupervised)30
LDACompute mean of the two classes, global mean (μ1, μ2,μ) Compute class specific covariance matrix Sw
Compute between class covariance matrix using means. Call it Sb
Find the eigenvectors corresponding to the eigenvalues of inv(Sw)*Sb
31
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
LDA, C classes • Fisher’s LDA generalizes gracefully for C-class problems
– Instead of one projection 𝑦, we will now seek (𝐶 − 1) projections [𝑦1, 𝑦2, …𝑦𝐶−1] by means of (𝐶 − 1) projection vectors 𝑤𝑖arranged by columns into a projection matrix 𝑊 = [𝑤1|𝑤2|… |𝑤𝐶−1]:
𝑦𝑖 = 𝑤𝑖𝑇𝑥 ⇒ 𝑦 = 𝑊𝑇𝑥
• Derivation – The within-class scatter generalizes as
𝑆𝑊 = 𝑆𝑖𝐶𝑖=1
• where 𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇𝑥∈𝜔𝑖
and 𝜇𝑖 =1𝑁𝑖 𝑥𝑥∈𝜔𝑖
– And the between-class scatter becomes
𝑆𝐵 = 𝑁𝑖 𝜇𝑖 − 𝜇 𝜇𝑖 − 𝜇 𝑇𝐶𝑖=1
• where 𝜇 = 1𝑁 𝑥∀𝑥 = 1
𝑁 𝑁𝑖𝜇𝑖𝐶𝑖=1
– Matrix 𝑆𝑇 = 𝑆𝐵 + 𝑆𝑊 is called the total scatter
P 1
P 2
P 3
P
S B 1
S B 3
S B 2S W 3
S W 1
S W 2
x 1
x 2
P 1
P 2
P 3
P
S B 1
S B 3
S B 2S W 3
S W 1
S W 2
x 1
x 2
*http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf
Linear Discriminant AnalysisMaximally separates clusters by
Putting different cluster far apartShrinking each cluster compactly
Algorithm: generalized eigendecompositionPros: better at showing cluster structureCons: may distort original relationships of data
32
LinearSupervisedGlobalFeature vectors
Methods
TraditionalPrincipal component analysis (PCA)Multidimensional scaling (MDS)Linear discriminant analysis (LDA)
Advanced (nonlinear, kernelized, manifold learning) Isometric feature mapping (Isomap)
33
* Matlab codes are available athttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Manifold LearningSwiss Roll Data
Swiss roll dataOriginally in 3D
What is the intrinsicdimensionality? (allowing flattening)
34
Manifold LearningSwiss Roll Data
Swiss roll dataOriginally in 3D
What is the intrinsicdimensionality? (allowing flattening)→ 2D
35
What if your data has low intrinsic dimensionality but resides in high-dimensional space?
Isomap(Isometric Feature Mapping)
Let’s preserve pairwise geodesic distance (along manifold)Compute geodesic distance as the shortest path length from k-nearest neighbor (k-NN) graph*Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances
36
* Eigen-decomposition is the main algorithm of PCA
Isomap(Isometric Feature Mapping)
Algorithm: all-pair shortest path computation + eigen-decompositionPros: performs well in generalCons: slow (shortest path), sensitive to parameters
37
NonlinearUnsupervisedGlobal: all pairwise distances are consideredFeature vectors
Practitioner’s GuideCaveats
38
Trustworthiness of dimension reduction resultsInevitable distortion/information loss in 2D/3DThe best result of a method may not align with what we want, e.g., PCA visualization of facial image data
(1, 2)-dimension (3, 4)-dimension
Practitioner’s GuideGeneral Recommendation
Want something simple and fast to visualize data?PCA, force-directed layout
Want to first try some manifold learning methods?Isomap
It is the method that will give empirically the best result. Have cluster label to use? (pre-given or computed)
LDA (supervised)Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters
39
Practitioner’s GuideResults Still Not Good?
Try various pre-processingData centering
Subtract the global mean from each vectorNormalization
Make each vector have unit Euclidean normOtherwise, a few outlier can affect dimension reduction significantly
Application-specific pre-processingDocument: TF-IDF weighting, remove too rare and/or short termsImage: histogram normalization
40
Practitioner’s GuideToo Slow?
Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step
The results may be even better due to noise removed by PCASee if there is any approximated but faster version
Landmarked versions (only using a subset of data items)e.g., landmarked Isomap
Linearized versions (the same criterion, but only allow linear mapping)
e.g., Laplacian Eigenmaps → Locality preserving projection
41
Practitioner’s GuideStill need more?
Tweak dimension reduction for your own purposePlay with its algorithm, convergence criteria, etc.
See if you can impose label informationRestrict the number of iterations to save computational time.
The main purpose of DR is to serve us in exploring data and solving complicated real-world problems
42
Take Away
43
PCA MDS LDA IsomapSupervised ✖ ✖ ✔ ✖
Linear ✔ ✖ ✔ ✖
Global ✔ ✔ ✔ ✔
Feature ✔ ✖ ✔ ✔
Useful ResourceTutorial on PCA
http://arxiv.org/pdf/1404.1100.pdfTutorial on LDA
http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdfReview article
http://www.iai.uni-bonn.de/~jz/dimensionality_reduction_a_comparative_review.pdf
Matlab toolbox for dimension reductionhttp://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Matlab manifold learning demohttp://www.math.ucla.edu/~wittman/mani/
44