1 Distance Metric Learning in Data Mining Fei Wang

1

Distance Metric Learning in Data Mining

Fei Wang

2

What is a Distance Metric?

From wikipedia

3

Distance Metric Taxonomy

Distance Metric

Learning Metric

Distance Characteristics

Linear vs. Nonlinear

Global vs. Local

Algorithms

Unsupervised

Supervised

Semi-supervised

Fixed Metric

Numeric

Discrete

Continuous Euclidean

Categorical

4

Categorization of Distance Metrics: Linear vs. Non-linear

Linear Distance

– First perform a linear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space

– Generalized Mahalanobis distance

•

• If M is symmetric, and positive definite, D is a distance metric;

• If M is symmetric, and positive semi-definite, D is a pseudo-metric.

•

•

Nonlinear Distance

– First perform a nonlinear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space

5

Categorization of Distance Metrics: Global vs. Local

Global Methods

– The distance metric satisfies some global properties of the data set

• PCA: find a direction on which the variation of the whole data set is the largest

• LDA: find a direction where the data from the same classes are clustered while from different classes are separated

• …

Local Methods

– The distance metric satisfies some local properties of the data set

• LLE: find the low dimensional embeddings such that the local neighborhood structure is preserved

• LSML: find the projection such that the local neighbors of different classes are separated

• ….

6

Related Areas

Metric learning

Dimensionalityreduction

Kernel learning

7

Related area 1: Dimensionality reduction

Dimensionality Reduction is the process of reducing the number of features through Feature Selection and Feature Transformation.

Feature Selection Find a subset of the original features Correlation, mRMR, information gain…

Feature Transformation Transforms the original features in a lower dimension space PCA, LDA, LLE, Laplacian Embedding…

Each dimensionality reduction technique can map a distance metric, where we first perform dimensionality reduction, then evaluate the Euclidean distance in the embedded space

8

Related area 2: Kernel learning

What is Kernel?

Suppose we have a data set X with n data points, then the kernel matrix K defined on X is an nxn symmetric positive semi-definite matrix, such that its (i,j)-th element represents the similarity between the i-th and j-th data points

Kernel Leaning vs. Distance Metric LearningKernel learning constructs a new kernel from the data, i.e., an inner-product function in some feature space, while distance metric learning constructs a distance function from the data

Kernel Leaning vs. Kernel Trick

• Kernel learning infers the n by n kernel matrix from the data, • Kernel trick is to apply the predetermined kernel to map the data from the

original space to a feature space, such that the nonlinear problem in the original space can become a linear problem in the feature space

B. Scholkopf, A. Smola. Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. 2001.

9

Different Types of Metric Learning

Unsupervised Supervised Semi-Supervised

Nonlinear

Linear PCA UMMP LPP

KernelizationLE LLE

ISOMAP SNE

LDA MMDAXing RCA ITML

NCA ANMM LMNN

Kernelization

CMMLRML

KernelizationSSDR

•Red means local methods•Blue means global methods

10

Metric Learning Meta-algorithm

Embed the data in some space

– Usually formulate as an optimization problem

• Define objective function– Separation based– Geometry based– Information theoretic based

• Solve optimization problem– Eigenvalue decomposition– Semi-definite programming– Gradient descent– Bregman projection

Euclidean distance on the embedded space

11

Unsupervised metric learning

Learning pairwise distance metric purely based on the data, i.e., without any supervision

linear

nonlinear

Global Local

Principal Component Analysis

Locality Preserving Projections

Laplacian EigenmapLocally Linear EmbeddingIsometric Feature Mapping

Stochastic Neighborhood EmbeddingKernelization

Type

Objective

Kernelization

12

Outline

Part I - Applications

Motivation and Introduction

Patient similarity application

Part II - Methods

Unsupervised Metric Learning

Supervised Metric Learning

Semi-supervised Metric Learning

Challenges and Opportunities for Metric Learning

13

Principal Component Analysis (PCA)

Find successive projection directions which maximize the data variance

The pairwise Euclidean distances will not change after PCA

2nd PC

1st PC

Geometry based objectiveEigenvalue decompositionLinear mappingGlobal method

14

Kernel Mapping

Map the data into some high-dimensional feature space, such that the nonlinear problem in the original space becomes the linear problem in the high-dimensional feature space. We do not need to know the explicit form of the mapping, but we need to define a kernel function.

Figure from http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf

15

Kernel Principal Component Analysis (KPCA)

Perform PCA in the feature space

With the representer theorem

Figure from http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

Original data

Embedded data after KPCA

Geometry based objectiveEigenvalue decompositionNonlinear mappingGlobal method

16

Locality Preserving Projections (LPP)

Find linear projections that can preserve the localities of the data setPCA

LPP

X. He and P. Niyogi. Locality Preserving Projections. NIPS 2003.

Degree Matrix Laplacian

Matrix

Data

Data

X-axis is the major direction

Y-axis is the major projection direction

Geometry based objectiveGeneralized eigenvalue decompositionLinear mappingLocal method

17

Laplacian Embedding (LE)

LE: Find embeddings that can preserve the localities of the data set

M. Belkin, P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, June 2003; 15 (6):1373-1396

Geometry based objectiveGeneralized eigenvalue decompositionNonlinear mappingLocal method

The embedded XiThe relationship between Xi and Xj

18

Locally Linear Embedding (LLE)

Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, v.290 no.5500 , Dec.22, 2000. pp.2323—2326.

Obtain data relationships

Obtain data embeddings

Geometry based objectiveGeneralized eigenvalue decompositionNonlinear mappingLocal method

http://www.sciencemag.org/content/vol290/issue5500/

19

Isometric Feature Mapping (ISOMAP)

1. Construct the neighborhood graph2. Compute the shortest path length

(geodesic distance) between pairwise data points

3. Recover the low-dimensional embeddings of the data by Multi-Dimensional Scaling (MDS) with preserving those geodesic distances

J. B. Tenenbaum, V. de Silva and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000.

Geometry based objectiveEigenvalue decompositionNonlinear mappingLocal method

http://www-psych.stanford.edu/~jbt

http://www-psych.stanford.edu/~jbt

http://www.cs.cmu.edu/~jcl

20

Stochastic Neighbor Embedding (SNE)

Hinton, G. E. and Roweis, S. Stochastic Neighbor Embedding. NIPS 2003.

The probability that i picks j as its neighbor

Neighborhood distribution in the embedded space

Neighborhood distribution preservation

Information theoretic objectiveGradient descentNonlinear mappingLocal method

21

Summary: Unsupervised Distance Metric Learning

PCA LPP LLE Isomap SNE

Local √ √ √ √

Global √

Linear √ √

Nonlinear √ √ √

Separation

Geometry √ √ √ √

Information theoretic √

Extensibility √ √

22

Outline





23


Learning pairwise distance metric with data and their supervision, i.e., data labels and pairwise constraints (must-links and cannot-links)

linear

nonlinear

Global Local

Linear Discriminant Analysis

Eric Xing’s method Information Theoretic

Metric Learning

Kernelization

Type

Objective

Kernelization

Locally Supervised Metric Learning

24

Linear Discriminant Analysis (LDA)

Figure from http://cmp.felk.cvut.cz/cmp/software/stprtool/examples/ldapca_example1.gif

Suppose the data are from C different classes

Keinosuke Fukunaga. Introduction to statistical pattern recognition. Academic Press Professional, Inc. 1990 .Y. Jia, F. Nie, C. Zhang. Trace Ratio Problem Revisited. IEEE Trans. Neural Networks, 20:4, 2009.

Separation based objectiveEigenvalue decompositionLinear mappingGlobal method

25

Distance Metric Learning with Side Information (Eric Xing’s method)

Must-link constraint set: each element is a pair of data that come from the same class or cluster

Cannot-link constraint set: each element is a pair of data that come from different classes or clusters

E.P. Xing, A.Y. Ng, M.I. Jordan and S. Russell, Distance Metric Learning, with application to Clustering with side-information. NIPS 16. 2002.

Separation based objectiveSemi-definite programmingNo direct mappingGlobal method

http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf

26

Information Theoretic Metric Learning (ITML)

Suppose we have an initial Mahalanobis distance parameterized by , a set of must-link constraints and a set of cannot-link constraints. ITML solves the following optimization problem

Jason Davis, Brian Kulis, Prateek Jain, Suvrit Sra, & Inderjit Dhillon. Information-Theoretic Metric Learning. ICML 2007

Best paper award

Scalable and efficient

Information theoretic objectiveBregman projectionNo direct mappingGlobal method

27

Summary: Supervised Metric Learning

LDA Xing ITML LSML

Label √ √

Pairwise constraints √ √ √

Local √

Global √ √ √

Linear √ √

Nonlinear

Separation √ √ √ √

Geometry

Information theoretic √

Extensibility √ √

28

Outline





29

Semi-Supervised Metric Learning

Learning pairwise distance metric using data with and without supervision

linear

nonlinear

Global Local

Constraint Margin Maximization

Kernelization

Type

Objective

Semi-Supervised Dimensionality

Reduction

Laplacian Regularized

Metric Learning

30

Constraint Margin Maximization (CMM)

Projected Constraint Margin

F. Wang, S. Chen, T. Li, C. Zhang. Semi-Supervised Metric Learning by Maximizing Constraint Margin. CIKM08

Maximum Variance No label information

Scatterness Compactness

Geometry & Separation based objectiveEigenvalue decompositionLinear mappingGlobal method

31

Laplacian Regularized Metric Learning (LRML)

Smoothness term

Compactness term

Scatterness term

S. Hoi, W. Liu, S. Chang. Semi-Supervised Distance Metric Learning for Collaborative Image Retrieval. CVPR08

Geometry & Separation based objectiveSemi-definite programmingNo mappingLocal & Global method

32

Semi-Supervised Dimensionality Reduction (SSDR)

Low-dimensional embeddings

Algorithm specific symmetric matrix

Most of the nonlinear dimensionality reduction algorithms can finally be formalized as solving the following optimization problem

X. Yang , H. Fu , H. Zha , J. Barlow. Semi-supervised nonlinear dimensionality reduction. ICML 2006.

Separation based objectiveEigenvalue decompositionNo mappingLocal method

33

Summary: Semi-Supervised Distance Metric Learning

CMM LRML SSDM

Label √ √

Pairwise constraints √ √

Embedded coordinates √

Local √ √

Global √

Linear √ √

Nonlinear √

Separation √ √

Locality-Preservation √ √

Information theoretic

Extensibility √

34

Outline





35

Scalability

– Online (Sequential) Learning (Shalev-Shwartz,Singer and Ng, ICML2004) (Davis et al. ICML2007)

– Distributed Learning

Efficiency

– Labeling efficiency (Yang, Jin and Sukthankar, UAI2007)

– Updating efficiency (Wang, Sun, Hu and Ebadollahi, SDM2011)

Heterogeneity

– Data heterogeneity (Wang, Sun and Ebadollahi, SDM2011)

– Task heterogeneity (Zhang and Yeung, KDD2011)

Evaluation

36

Shai Shalev-Shwartz, Yoram Singer and Andrew Y. Ng. Online and Batch Learning of Pseudo-Metrics. ICML 2004

J. Davis, B. Kulis, P. Jain, S. Sra, & I. Dhillon. Information-Theoretic Metric Learning. ICML 2007

Liu Yang, Rong Jin and Rahul Sukthankar. Bayesian Active Distance Metric Learning. UAI 2007.

Fei Wang, Jimeng Sun, Shahram Ebadollahi. Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM2011.

F. Wang, J. Sun, J. Hu, S. Ebadollahi. iMet: Interactive Metric Learning in Healthcare Applications. SDM 2011.

Yu Zhang and Dit-Yan Yeung. Transfer Metric Learning by Learning Task Relationships. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 1199-1208. Washington, DC, USA, 2010.

37

Documents

1 Distance Metric Learning in Data Mining Fei Wang