Metric Learning Survey Slides

Distance Metric Learning:

A Comprehensive Survey

Liu Yang

Advisor: Rong Jin

May 8th, 2006

Outline

Introduction

Supervised Global Distance Metric Learning

Supervised Local Distance Metric Learning

Unsupervised Distance Metric Learning

Distance Metric Learning based on SVM

Kernel Methods for Distance Metrics Learning

Conclusions

Introduction

Definition

Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data pairs.

Importance

Many machine learning algorithms, heavily rely on the distance metric for the input data patterns. e.g. kNN

A learned metric can significantly improve the performance in classification, clustering and retrieval tasks:

e.g. KNN classifier, spectral clustering, content-based image retrieval (CBIR).

Contributions of this Survey

Review distance metric learning under different learning conditions

supervised learning vs. unsupervised learning

learning in a global sense vs. in a local sense

distance matrix based on linear kernel vs. nonlinear kernel

Discuss central techniques of distance metric learning

K nearest neighbor

dimension reduction

semidefinite programming

kernel learning

large margin classification

Supervised Distance Metric Learning

Local

Local Adaptive DistanceMetric Learning

Neighborhood Components Analysis

Relevant Component Analysis

Unsupervised Distance Metric Learning Nonlinear embedding

LLE, ISOMAP, Laplacian Eigenmaps


Large Margin Nearest Neighbor Based Distance Metric Learning

Cast Kernel Margin Maximization into a SDP problem

Kernel Methods forDistance Metrics Learning

Kernel Alignment with SDP

Learning with Idealized Kernel

Linear embeddingPCA, MDS

Global Distance Metric Learning by Convex Programming

Outline

Introduction






Supervised Global Distance Metric

Learning (Xing et al. 2003)

Goal : keep all the data points within the same classes close,

while separating all the data points from different classes.

Formulate as a constrained convex programming problem

minimize the distance between the data pairs in S

Subject to data pairs in D are well separated

22

A

Equivalence constraints: {( , ) | and belong to the same class}

Inequivalence constraints: {( , ) | and belong to different classes},

d ( , ) ( ) ( ), is the distanc

i j i j

i j i j

T m m

A

S x x x x

D x x x x

x y x y x y A x y A S

e metric

Global Distance Metric Learning (Cont’d)

A is positive semi-definite

Ensure the negativity and the triangle inequality of the metric

The number of parameters is quadratic in the number of features

Difficult to scale to a large number of features

Simplify the computation

2 2

( , ) ( , )

min . . 0, 1m m

i j i j

i j i jA R

x x S x x DA A

x x s t A x x

(a) Data Dist. of the original dataset (b) Data scaled by the global metric

Global Distance Metric Learning:

Example I

Keep all the data points within the same classes close

Separate all the data points from different classes

Global Distance Metric Learning:

Example II

Diagonalize distance metric A can simplify computation, but could lead to disastrous results

(a) Original data (c) Rescaling by learned

diagonal A

(b) rescaling by learned

full A

(a) Data Dist. of the original dataset

Multimodal data distributions prevent global distance metrics

from simultaneously satisfying constraints on within-class

compactness and between-class separability.

(b) Data scaled by the global metric

Problems with Global Distance

Metric Learning

Outline

Introduction






Conclusions

Supervised Local Distance Metric

Learning

Local Adaptive Distance Metric Learning

Local Feature Relevance

Locally Adaptive Feature Relevance Analysis

Local Linear Discriminative Analysis


Relevant Component Analysis

Local Adaptive Distance Metric

Learning

K Nearest Neighbor Classifier

0

0 0

1 1

0

( )0

( ) : nearest neighbors of

, , , , : training examples

1( )

0 . .

1Pr( | ) ( )

( )i

n n

i

i

i

x N x

N x x

x y x y

y jy j

o w

j x y jN x

Modified local neighborhood by a distance metric

Elongate the distance along the dimensions where

the class labels change rapidly

Squeeze the distance along the dimensions that are

almost independent from the class labels

Assumption of KNN

Pr(y|x) in the local NN is constant or smooth

However, this is not necessarily true!

Near class boundaries

Irrelevant dimensions

Local Adaptive Distance Metric

Learning

Local Feature Relevance

[J. Friedman,1994]

(x)p(x)dxEf f

i[ | ] (x)p(x|x =z)dx,iE f x z f i

p(x) ( )p(x|x =z) =

p(x') ( ' )

i

i

x z

x z

2 2 2 2( ) [( (x)-E ) | ] [( (x)-E( (x) | ) | ] ( [ | ])i i i i iI z E f f x z E f f x z x z Ef E f x z

2 2 2

1

( ) ( ) / ( )p

i i i k k

k

r z I z I z

1( , , )mz z z

ix z

ix z

x = z

Assume least-squared estimate for predicting f(x) is

Conditioned at , then the least-squared estimate of f(x)

The improvement in prediction error with knowing

Consider , a measure of relative influence of the

ith input variable to the variation of f(x) at is given by

Locally Adaptive Feature Relevance

Analysis [C. Domeniconi, 2002]

2

00

1 0

[ ( | X) ( | x )](X, x )

( | x )

J

j

p j p jr

p j

0x

Use a Chi-squared distance analysis to compute metric for

producing a neighborhood, in which

The posterior probabilities are approximately constant

Highly adaptive to query locations

Chi-squared distance between the true and estimated posterior

at the test point

Use the Chi-squared distance for feature relevance:

---- to tell to which extent the ith dimension can be relied on for

predicting p(j| )0x

Local Relevance Measure

in ith Dimension

2

i

1

[Pr( | ) Pr( | )]r (z) =

Pr( | )

Ji i

ji i

j z j x z

j x z

ir (z)

Pr( | )i ij x z

is a conditional expectation of p(j|x)Pr( | ) (Pr(j|x) | )i i i ij x z E x z

ir (z) 0x

The closer is to p(j|z), the more information the ith

dimension provides for predicting p(j|z)

measures the distance between Pr(j|z) and the conditional

expectation of Pr(j|x) at location z

Calculate for each point z in the neighborhood of

0i 0 0 1 0 0

0

1

( )w ( ) , where ( ) (max ( )) ( )

( )

t= 1 or 2, corresponds to linear and quadratic weighting.

tqi

i j j iqt

l

l

R xx R x r x r x

R x

q2

i=1

(x,y) = ( )i i iD w x y

Locally Adaptive Feature

Relevance Analysis

0

0

(x )

1(x ) ( )i i

z N

r r zK

0(x )N is the neighborhood of 0x

A local relevance measure in dimension i

Relative relevance

Weighted distance


[T. Hastie et al. 1996]

Sb : the between-class covariance matrix

Sw : the within-class covariance matrix

-1T = Sw Sb. LDA finds principle eigenvectors of matrix

to keep patterns from the same class close

separate patterns from different classes apart

LDA metric : stacking principle eigenvectors of T together

Local Linear

Discriminative Analysis

1 1 1 1

2 2 2 2[ I]w w b w wS S S S S

0x

0x

Need local adaptation of the nearest neighbor metric

Initialize as identical matrix

Given a testing point , iterate below two steps:

Estimate Sb and Sw based on the local neighbor

of measured by

Form a local metric behaving like LDA metric

is a small tuning parameter to prevent neighborhoods

extending to infinity

Local Sb shows the inconsistency of the class centriods

The estimated metric

shrinks the neighborhood in directions in which the local class

centroids differ to produce a neighborhood in which the class

centriod coincide

shrinks neighborhoods in directions orthogonal to these local

decision boundaries, and elongates them parallel to the boundaries.


Overfitting, Scalability problem, # parameters is quadratic in #features.


[J. Goldberger et al. 2005]

ix

2

i j

i 2

i k

exp( Ax Ax )Here C { | },

exp( Ax Ax )i j ij

k i

j c c p

n

i

i=1

f(A) = p ,

ipi

ij

j C

p

NCA learns a Mahalanobis distance metric for the KNN

classifier by maximizing the leave-one-out cross validation.

The probability of classifying correctly,

weighted counting involving pairwise distance

The expected number of correctly classification points:

RCA [N. Shen et al. 2002]

unlabeled data labeled datachuklet data

^ ^ ^T

j jji ji

1 1

1C (x m )(x m ) ,

jnk

j ip

1

^ 2y C x

j

^n

jji i=1chunklet j : {x } , with mean m

Constructs a Mahalanobis distance metric based on a sum of

in-chunklet covariance matrices

Chunklet : data have same but unknown class labels

Sum of in-chunklet covariance matrices for p points in k chunklets:

Apply linear transformation

Information maximization

under chunklet constraints

[A. Bar-Hillel etal, 2003]

Maximizes the mutual information I(X,Y)

Constraints: within-chunklet compactness

T

2

jB

1 1 B

Let B =A A, (*) can be further written into

1max | B | s.t. m , B 0

p

jnk

ji

j i

x K

2

y

j

1 1

y

j

1max I(X,Y) s.t. m . (*)

p

m is the transformed mean in the jth chunklet.

K is threshold constant.

jnk

jif F

j i

y K

RCA algorithm applied to

synthetic Gaussian data

(a) The fully labeled data set with 3 classes.

(b) Same data unlabeled; classes' structure is less evident.

(c) The set of chunklets

(d) The centered chunklets, and their empirical covariance.

(e) The RCA transformation applied to the chunklets. (centered)

(f) The original data after applying the RCA transformation.

Outline

Introduction






Conclusions


A Unified Framework for Dimension Reduction

Solution 1

Solution 2

linear nonlinear

Global PCA, MDS ISOMAP

Local LLE, Laplacian Eigenmap

Most dimension reduction approaches are to learn a distance

metric without label information. e.g. PCA

I will present five methods for dimensionality reduction.

Dimensionality Reduction Algorithms

PCA finds the subspace that best preserves the variance of the data.

MDS finds the subspace that best preserves the interpoint distances.

Isomap finds the subspace that best preserves the geodesic interpoint distances. [Tenenbaum et al, 2000].

LLE finds the subspace that best preserves the local linear structure of the data [Roweis and Saul, 2000].

Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in the adjacency graph [M. Belkin and P. Niyogi,2003].

Multidimensional Scaling (MDS)

MDS finds the rank m projection that best preserves the inter-point distance given by matrix D

Converts distances to inner products

Calculate X

Rank m projections Y closet to X

Given the distance matrix among cities, MDS produces this map:

1

MDS MDS 2m mY= V ( )

TB= (D)= X X

1

MDS MDS 2

MDS MDS[V , ] =eig(B)

X = V ( )

PCA (Principal Component Analysis)

1

PCA MDS PCA MDS PCA PCA MDS2V XV , , Y ( ) Y

=Var(X)

PCA PCA[V , ]=eig( )

PCAY = V Xm

PCA finds the subspace that best preserves the data variance.

PCA projection of X with rank m

PCA vs. MDS

In the Euclidean case, MDS only differs from PCA by

starting with D and calculating X.

A B

Isometric Feature Mapping (ISOMAP)

[Tenenbaum et al, 2000]

Geodesic :the shortest curve on a manifold

that connects two points on the manifold

e.g. on a sphere, geodesics are great circles

Geodesic distance: length of the geodesic

Points far apart measured by geodesic dist.

appear close measured by Euclidean dist.

ISOMAP

Take a distance matrix as input

Construct a weighted graph G based on neighborhood relations

Estimate pairwise geodesic distance by

“a sequence of short hops” on G

Apply MDS to the geodesic distance matrix

Locally Linear Embedding (LLE)

[Roweis and Saul, 2000]

LLE finds the subspace that best preserves the local

linear structure of the data

Assumption: manifold is locally “linear”

Each sample in the input space is a linearly weighted

average of its neighbors.

A good projection should best preserve this geometric

locality property

LLE

W: a linear representation of every data point by its neighbors

Choose W by minimized the reconstruction error

Calculate a neighborhood preserving mapping Y, by minimizing

the reconstruction error

Y is given by the eigenvectors of the m lowest nonzero

eigenvalues of matrix

2n

i ij

i=1 1

ij i ij j i

1

minimizing x W

s.t. W 1, x ; W 0 if x is not a neighbor of x

K

ij

j

n

j

x

* *

iW1

(Y)= y , where W arg min (W)K

ij ij

i

W y

T(I-W) (I-W)

Laplacian Eigenmap finds the subspace that best preserves local

neighborhood information in adjacency graph

Graph Laplacian: Given a graph G with weight matrix W

D is a diagonal matrix with

L =D –W is the graph Laplacian

Detailed steps:

Construct adjacency graph G.

Weight the edges:

Generalized eigen-decomposition of

Embedding : eigenvectors with top m nonzero eigenvalues

Laplacian Eigenmap

[M. Belkin and P. Niyogi,2003]

ii ji

j

D W

Lf= DfijW 1, if nodes i and j are connected, and 0 otw.

A Unified Framework for

Dimension Reduction Algorithms

All use an eigendecomposition to obtain a lower-dimensional embedding of data lying on a non-linear manifold.

Normalize affinity matrix

The embedding of has two alternative solutions

Solution 1 : (MDS & Isomap)

is the best approximation of in the squared error sense.

Solution 2 : (LLE & Laplacian Eigenmap)

i it tiy with y = v

i it t ite with e = v

^

t t(H) the m largest positive eigenvalues and eigenvectors veig

ix

i je ,e^

Hij

ij

j

H ^

H H

Outline

Introduction






Conclusions



Objective Function

Reformulation as SDP

Cast Kernel Margin Maximization into a SDP Problem

Maximum Margin

Cast into SDP problem

Apply to Hard Margin and Soft Margin

After training

k=3 target neighbors lie within a smaller radius

differently labeled inputs lie outside this smaller radius with a

margin of at least one unit distance.

Large Margin Nearest Neighbor

Based Distance Metric Learning

[K. Weinberger et al., 2006] Learns a Mahanalobis distance metric in the kNN classification

setting by SDP, that

Enforces the k-nearest neighbors belong to the same class

examples from different classes are separated by a large margin


Cost function:

Penalize large distances between each input and its target neighbors

The hinge loss is incurred by differently labeled inputs whose

distances do not exceed the distance from input to any of its target

neighbors by one absolute unit of distance

-> do not threaten to invade each other’s neighborhoods

2 2 2

ij i j ij i j i l 22 2ij ijl

(L) = L(x -x ) C (1 )[1 L(x -x ) L(x -x ) ]

[ ] max(z,0) denotes the standard hinge loss and the constant C > 0.

ily

z

ij i j

ij j i

y {0,1} indicate whether or not the label y and y match

{0,1} indicate whether x is a target neighbor of x

ix

Reformulation as SDP

T

i j i jM

T T

i j i j i l i l

The resulting SDP is :

min (x x ) M(x x ) C (1 )

. . (x x ) M(x x ) (x x ) M(x x ) 1

0,M =0

ij ij il ijl

ij ijl

ijl

ijl

y

s t

2

i j i j i j2Let L(x -x ) (x -x ) M(x -x ), and introducing slack variable T

ijl

Cast Kernel Margin Maximization

into a SDP Problem

[G. R. G. Lanckriet et al, 2004]

Maximum margin : the decision boundary has the maximum

minimum distance from the closest training point.

Hard Margin: linearly separable

Soft Margin: nonlinearly separable

The performance measure, generalized from dual solution of

different maximizing margin problem

T T T

, (K) max 2 ( ( ) ) : 0, y 0

with 0 on the training data w.r.t K. G is Gram matrix.

Cw e G K I C

Cast into SDP Problem

2 tr 2 tr , trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )S Sw w

Hard Margin

1-norm soft margin

2-norm soft margin

tr tr ,0 trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )w w

1 tr 1 tr C,0 trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )S Sw w

K,t, , ,

tr

T T

min

. . trace(K)=c, K =0, 0, 0,

G(K ) y 0

( y) t-2C

trn

t

s t

I e

e e

,K =0min (K) . . trace(K) = cCw s t

Outline

Introduction






Conclusions

Kernel Methods for

Distance Metrics Learning

Learning a good kernel is equivalent to distance metric learning

Kernel Alignment



Ideal Kernel

The Idealized Kernel

Kernel Alignment

[N. Cristianini,2001]

T^ 1 F

1 2

1 1 F

K , yyA(S, k ,k ) , y { 1}

K ,K

m

m

A measure of similarity between two kernel functions or between

a kernel and a target function

The inner product between two kernel matrices based on kernel k1

and k2.

The alignment of K1 and K2 w.r.t S:

Measure the degree of agreement between a kernel and a given

learning task.

1 2 1 i j 2 i jF, 1

K ,K K (x , x )K (x , x )n

i j

^1 2 F

1 2

1 1 2 2F F

K ,KA(S, k ,k )

K ,K K ,K


[G. R. G. Lanckriet et al, 2004]

Optimizing the alignment between a set of labels and a kernel matrix using SDP in a transductive setting.

Optimizing an objective function over the training data block -> automatic tuning of testing data block

Introduce A with , this reduces to

T

trFA,K

T

n

max K , yy

A K. . trace(A) 1, K =0, =0.

K Is t

^T

1K

max A( ,K , yy ) s.t. K =0, trace(K) =1S

tr tr,t

ij i j tr tT

tr,t

K KK= , where K (x ), (x ) ,i, j =1, ,n n .

K K

T K K =A and trace(A) 1


[J. T. Kwok and I.W. Tsang,2003]

Idealize a given kernel by making it more similar to the ideal kernel matrix.

Ideal kernel:

Idealized kernel:

The alignment of will be greater than k, if

are the number of positive and negative samples.

Under the original distance metric M:

i j*

i j

i j

1, y(x ) y(x )k (x , x )

0, y(x ) y(x )

~

*k = k + k2

*

2 2

K,K

n n

~

k

2~ ~ ~

ij i j

2

ij i j

d y =yK K 2K

d y yii jj ij

T 2 T

i j i j ij i j i j k(x , x ) = x Mx , M =0; d (x - x ) M(x - x )

,n n

iji j i j

2 TS

2B, , (x ,x ) (x ,x )

~ ~2 2

ij ij i j

~2 2

ij ij i j

1 1min B , where B= AA

2

, (x , x ). . , 0, 0,

, (x , x )

ij D ij

S DS D

ij

ij

ij

CC

N N

d d Ds t

d d S

Idealized kernel

We modify

Search for a matrix A under which

different classes : pulled apart by an amount of at least

same class :getting close together.

Introduce slack variables for error tolerance

2~

ij i j2

ij 2

ij i j

d y = yd

d y y

2~T T

ij i j i j(x - x ) A A(x - x )d

Conclusions

A comprehensive review, covers:

Supervised distance metric learning

Unsupervised distance metric learning

Maximum margin based distance metric learning approaches

Kernel methods towards distance metrics

Challenge:

Unsupervised distance metric learning.

Going local in a principle manner.

Learn an explicit nonlinear distance metric in the local sense.

Efficiency issue.

Education

Metric Learning Survey Slides