Upload
pearl-briggs
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
1
Distance Metric Learning in Data Mining
Fei Wang
2
What is a Distance Metric?
From wikipedia
3
Distance Metric Taxonomy
Distance Metric
Learning Metric
Distance Characteristics
Linear vs. Nonlinear
Global vs. Local
Algorithms
Unsupervised
Supervised
Semi-supervised
Fixed Metric
Numeric
Discrete
Continuous Euclidean
Categorical
4
Categorization of Distance Metrics: Linear vs. Non-linear
Linear Distance
– First perform a linear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space
– Generalized Mahalanobis distance
•
• If M is symmetric, and positive definite, D is a distance metric;
• If M is symmetric, and positive semi-definite, D is a pseudo-metric.
•
•
Nonlinear Distance
– First perform a nonlinear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space
5
Categorization of Distance Metrics: Global vs. Local
Global Methods
– The distance metric satisfies some global properties of the data set
• PCA: find a direction on which the variation of the whole data set is the largest
• LDA: find a direction where the data from the same classes are clustered while from different classes are separated
• …
Local Methods
– The distance metric satisfies some local properties of the data set
• LLE: find the low dimensional embeddings such that the local neighborhood structure is preserved
• LSML: find the projection such that the local neighbors of different classes are separated
• ….
6
Related Areas
Metric learning
Dimensionalityreduction
Kernel learning
7
Related area 1: Dimensionality reduction
Dimensionality Reduction is the process of reducing the number of features through Feature Selection and Feature Transformation.
Feature Selection Find a subset of the original features Correlation, mRMR, information gain…
Feature Transformation Transforms the original features in a lower dimension space PCA, LDA, LLE, Laplacian Embedding…
Each dimensionality reduction technique can map a distance metric, where we first perform dimensionality reduction, then evaluate the Euclidean distance in the embedded space
8
Related area 2: Kernel learning
What is Kernel?
Suppose we have a data set X with n data points, then the kernel matrix K defined on X is an nxn symmetric positive semi-definite matrix, such that its (i,j)-th element represents the similarity between the i-th and j-th data points
Kernel Leaning vs. Distance Metric LearningKernel learning constructs a new kernel from the data, i.e., an inner-product function in some feature space, while distance metric learning constructs a distance function from the data
Kernel Leaning vs. Kernel Trick
• Kernel learning infers the n by n kernel matrix from the data, • Kernel trick is to apply the predetermined kernel to map the data from the
original space to a feature space, such that the nonlinear problem in the original space can become a linear problem in the feature space
B. Scholkopf, A. Smola. Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. 2001.
9
Different Types of Metric Learning
Unsupervised Supervised Semi-Supervised
Nonlinear
Linear PCA UMMP LPP
KernelizationLE LLE
ISOMAP SNE
LDA MMDAXing RCA ITML
NCA ANMM LMNN
Kernelization
CMMLRML
KernelizationSSDR
•Red means local methods•Blue means global methods
10
Metric Learning Meta-algorithm
Embed the data in some space
– Usually formulate as an optimization problem
• Define objective function– Separation based– Geometry based– Information theoretic based
• Solve optimization problem– Eigenvalue decomposition– Semi-definite programming– Gradient descent– Bregman projection
Euclidean distance on the embedded space
11
Unsupervised metric learning
Learning pairwise distance metric purely based on the data, i.e., without any supervision
linear
nonlinear
Global Local
Principal Component Analysis
Locality Preserving Projections
Laplacian EigenmapLocally Linear EmbeddingIsometric Feature Mapping
Stochastic Neighborhood EmbeddingKernelization
Type
Objective
Kernelization
12
Outline
Part I - Applications
Motivation and Introduction
Patient similarity application
Part II - Methods
Unsupervised Metric Learning
Supervised Metric Learning
Semi-supervised Metric Learning
Challenges and Opportunities for Metric Learning
13
Principal Component Analysis (PCA)
Find successive projection directions which maximize the data variance
The pairwise Euclidean distances will not change after PCA
2nd PC
1st PC
Geometry based objectiveEigenvalue decompositionLinear mappingGlobal method
14
Kernel Mapping
Map the data into some high-dimensional feature space, such that the nonlinear problem in the original space becomes the linear problem in the high-dimensional feature space. We do not need to know the explicit form of the mapping, but we need to define a kernel function.
Figure from http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf
15
Kernel Principal Component Analysis (KPCA)
Perform PCA in the feature space
With the representer theorem
Figure from http://en.wikipedia.org/wiki/Kernel_principal_component_analysis
Original data
Embedded data after KPCA
Geometry based objectiveEigenvalue decompositionNonlinear mappingGlobal method
16
Locality Preserving Projections (LPP)
Find linear projections that can preserve the localities of the data setPCA
LPP
X. He and P. Niyogi. Locality Preserving Projections. NIPS 2003.
Degree Matrix Laplacian
Matrix
Data
Data
X-axis is the major direction
Y-axis is the major projection direction
Geometry based objectiveGeneralized eigenvalue decompositionLinear mappingLocal method
17
Laplacian Embedding (LE)
LE: Find embeddings that can preserve the localities of the data set
M. Belkin, P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, June 2003; 15 (6):1373-1396
Geometry based objectiveGeneralized eigenvalue decompositionNonlinear mappingLocal method
The embedded XiThe relationship between Xi and Xj
18
Locally Linear Embedding (LLE)
Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, v.290 no.5500 , Dec.22, 2000. pp.2323—2326.
Obtain data relationships
Obtain data embeddings
Geometry based objectiveGeneralized eigenvalue decompositionNonlinear mappingLocal method
19
Isometric Feature Mapping (ISOMAP)
1. Construct the neighborhood graph2. Compute the shortest path length
(geodesic distance) between pairwise data points
3. Recover the low-dimensional embeddings of the data by Multi-Dimensional Scaling (MDS) with preserving those geodesic distances
J. B. Tenenbaum, V. de Silva and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000.
Geometry based objectiveEigenvalue decompositionNonlinear mappingLocal method
20
Stochastic Neighbor Embedding (SNE)
Hinton, G. E. and Roweis, S. Stochastic Neighbor Embedding. NIPS 2003.
The probability that i picks j as its neighbor
Neighborhood distribution in the embedded space
Neighborhood distribution preservation
Information theoretic objectiveGradient descentNonlinear mappingLocal method
21
Summary: Unsupervised Distance Metric Learning
PCA LPP LLE Isomap SNE
Local √ √ √ √
Global √
Linear √ √
Nonlinear √ √ √
Separation
Geometry √ √ √ √
Information theoretic √
Extensibility √ √
22
Outline
Unsupervised Metric Learning
Supervised Metric Learning
Semi-supervised Metric Learning
Challenges and Opportunities for Metric Learning
23
Supervised Metric Learning
Learning pairwise distance metric with data and their supervision, i.e., data labels and pairwise constraints (must-links and cannot-links)
linear
nonlinear
Global Local
Linear Discriminant Analysis
Eric Xing’s method Information Theoretic
Metric Learning
Kernelization
Type
Objective
Kernelization
Locally Supervised Metric Learning
24
Linear Discriminant Analysis (LDA)
Figure from http://cmp.felk.cvut.cz/cmp/software/stprtool/examples/ldapca_example1.gif
Suppose the data are from C different classes
Keinosuke Fukunaga. Introduction to statistical pattern recognition. Academic Press Professional, Inc. 1990 .Y. Jia, F. Nie, C. Zhang. Trace Ratio Problem Revisited. IEEE Trans. Neural Networks, 20:4, 2009.
Separation based objectiveEigenvalue decompositionLinear mappingGlobal method
25
Distance Metric Learning with Side Information (Eric Xing’s method)
Must-link constraint set: each element is a pair of data that come from the same class or cluster
Cannot-link constraint set: each element is a pair of data that come from different classes or clusters
E.P. Xing, A.Y. Ng, M.I. Jordan and S. Russell, Distance Metric Learning, with application to Clustering with side-information. NIPS 16. 2002.
Separation based objectiveSemi-definite programmingNo direct mappingGlobal method
26
Information Theoretic Metric Learning (ITML)
Suppose we have an initial Mahalanobis distance parameterized by , a set of must-link constraints and a set of cannot-link constraints. ITML solves the following optimization problem
Jason Davis, Brian Kulis, Prateek Jain, Suvrit Sra, & Inderjit Dhillon. Information-Theoretic Metric Learning. ICML 2007
Best paper award
Scalable and efficient
Information theoretic objectiveBregman projectionNo direct mappingGlobal method
27
Summary: Supervised Metric Learning
LDA Xing ITML LSML
Label √ √
Pairwise constraints √ √ √
Local √
Global √ √ √
Linear √ √
Nonlinear
Separation √ √ √ √
Geometry
Information theoretic √
Extensibility √ √
28
Outline
Unsupervised Metric Learning
Supervised Metric Learning
Semi-supervised Metric Learning
Challenges and Opportunities for Metric Learning
29
Semi-Supervised Metric Learning
Learning pairwise distance metric using data with and without supervision
linear
nonlinear
Global Local
Constraint Margin Maximization
Kernelization
Type
Objective
Semi-Supervised Dimensionality
Reduction
Laplacian Regularized
Metric Learning
30
Constraint Margin Maximization (CMM)
Projected Constraint Margin
F. Wang, S. Chen, T. Li, C. Zhang. Semi-Supervised Metric Learning by Maximizing Constraint Margin. CIKM08
Maximum Variance No label information
Scatterness Compactness
Geometry & Separation based objectiveEigenvalue decompositionLinear mappingGlobal method
31
Laplacian Regularized Metric Learning (LRML)
Smoothness term
Compactness term
Scatterness term
S. Hoi, W. Liu, S. Chang. Semi-Supervised Distance Metric Learning for Collaborative Image Retrieval. CVPR08
Geometry & Separation based objectiveSemi-definite programmingNo mappingLocal & Global method
32
Semi-Supervised Dimensionality Reduction (SSDR)
Low-dimensional embeddings
Algorithm specific symmetric matrix
Most of the nonlinear dimensionality reduction algorithms can finally be formalized as solving the following optimization problem
X. Yang , H. Fu , H. Zha , J. Barlow. Semi-supervised nonlinear dimensionality reduction. ICML 2006.
Separation based objectiveEigenvalue decompositionNo mappingLocal method
33
Summary: Semi-Supervised Distance Metric Learning
CMM LRML SSDM
Label √ √
Pairwise constraints √ √
Embedded coordinates √
Local √ √
Global √
Linear √ √
Nonlinear √
Separation √ √
Locality-Preservation √ √
Information theoretic
Extensibility √
34
Outline
Unsupervised Metric Learning
Supervised Metric Learning
Semi-supervised Metric Learning
Challenges and Opportunities for Metric Learning
35
Scalability
– Online (Sequential) Learning (Shalev-Shwartz,Singer and Ng, ICML2004) (Davis et al. ICML2007)
– Distributed Learning
Efficiency
– Labeling efficiency (Yang, Jin and Sukthankar, UAI2007)
– Updating efficiency (Wang, Sun, Hu and Ebadollahi, SDM2011)
Heterogeneity
– Data heterogeneity (Wang, Sun and Ebadollahi, SDM2011)
– Task heterogeneity (Zhang and Yeung, KDD2011)
Evaluation
36
Shai Shalev-Shwartz, Yoram Singer and Andrew Y. Ng. Online and Batch Learning of Pseudo-Metrics. ICML 2004
J. Davis, B. Kulis, P. Jain, S. Sra, & I. Dhillon. Information-Theoretic Metric Learning. ICML 2007
Liu Yang, Rong Jin and Rahul Sukthankar. Bayesian Active Distance Metric Learning. UAI 2007.
Fei Wang, Jimeng Sun, Shahram Ebadollahi. Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM2011.
F. Wang, J. Sun, J. Hu, S. Ebadollahi. iMet: Interactive Metric Learning in Healthcare Applications. SDM 2011.
Yu Zhang and Dit-Yan Yeung. Transfer Metric Learning by Learning Task Relationships. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 1199-1208. Washington, DC, USA, 2010.
37