Metric Cornell

8/4/2019 Metric Cornell

1/48

Metric and Kernel Learning

Inderjit S. DhillonUniversity of Texas at Austin

Cornell UniversityOct 30, 2007

Joint work with Jason Davis, Prateek Jain, Brian Kulis and Suvrit Sra

Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/


2/48

Metric Learning

Goal: Learn Distance Metric between Data

Important problem in Data Mining & Machine Learning

Can govern success or failure of data mining algorithm



3/48

Metric Learning: Example I

Similarity by Person(identity) or by Pose



4/48

Metric Learning: Example II

Consider a set of text documents

Each document is a review of a classical music piece

Might be clusterable in one of two ways

By Composer (Beethoven, Mozart, Mendelssohn)By Form (Symphony, Sonata, Concerto)

Similarity by Composer or by Form



5/48

Mahalanobis Distances

We restrict ourselves to learning Mahalanobis distances:

Distance parameterized by positive definite matrix :

d(x1, x2) = (x1 x2)T(x1 x2)

Often is the inverse of the covariance matrix

Generalizes squared Euclidean distance ( = I)

Rotates and scales input data



6/48

Example: Four Blobs

0.5 0 0.5 1 1.50.5

0

0.5

1

1.5



7/48

Example: Four Blobs

0.5 0 0.5 1 1.50.5

0

0.5

1

1.5



8/48

Example: Four Blobs

0.5 0 0.5 1 1.5

0.5

0

0.5

1

1.5

Want to learn:

=

1 00



9/48

Example: Four Blobs

0.5 0 0.5 1 1.50.5

0

0.5

1

1.5



10/48

Example: Four Blobs

0.5 0 0.5 1 1.5

0.5

0

0.5

1

1.5

Want to learn:

=

00 1



11/48

Problem Formulation

Metric Learning Goal:

min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]

Learn spd matrix that is close to the baseline spd matrix 0

Other linear constraints on are possible

Constraints can arise from various scenarios

Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.



12/48

Problem Formulation

Metric Learning Goal:

min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]

Learn spd matrix that is close to the baseline spd matrix 0

Other linear constraints on are possible

Constraints can arise from various scenarios

Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.

QUESTION: What should loss be?



13/48

LogDet Divergence

We use loss(,0) to be the Log-Determinant Divergence:

Dd(,0) = trace(10 ) log det(10 ) d

Our Goal:

minDd(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]



14/48

Preview

Salient points of our Approach:

Metric Learning is equivalent to Kernel Learning

Generalizes to Unseen Data Points

Can improve upon an input metric or kernel

No expensive eigenvector computation or semi-definite programming

Most existing methods fail to satisfy one or more of the above



15/48

Brief Digression



16/48

Bregman Divergences

Let : S R be a differentiable, strictly convex function of Legendretype (S

R

d)

The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)


B Di


17/48

Bregman Divergences


R

d)The Bregman Divergence D : S relint(S) R is defined as

D(x, y) = (x) (y) (x y)T(y)

y x

(z)= 12

zTz

h(z)

D(x,y)=12

x

y

2

Squared Euclidean distance is a Bregman divergence


B Di


18/48

Bregman Divergences


R

d)The Bregman Divergence D : S relint(S) R is defined as

D(x, y) = (x) (y) (x y)T(y)

yx

D(x,y)=x logxyx+y

h(z)

(z)=z log z

Relative Entropy (or KL-divergence) is another Bregman divergence


B Di


19/48

Bregman Divergences


R

d)

The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)

y x

D(x,y)=xylog x

y1h(z)

(z)= log z

Itakura-Saito Dist.(used in signal processing) is also a Bregman divergence


E l f B Di


20/48

Examples of Bregman Divergences

Function Name (x) D(x,y)

Squared norm12

x2 12

(xy)2

Shannon entropy x log xx x log xyx+y

Bit entropy x log x+(1x) log(1x) x log xy

+(1x) log 1x1y

Burg entropy log x xylog x

y1

Hellinger 1x2 (1xy)(1y2)1/2(1x2)1/2

p quasi-norm xp (0


21/48

Bregman Matrix Divergences

Define

D(X, Y) = (X)

(Y)

trace((

(Y))T(X

Y))




22/48


Define

D(X, Y) = (X)

(Y)

trace((

(Y))T(X

Y))

Squared Frobenius norm: (X) = X2F. Then

D(X, Y) =1

2X

Y

2F




23/48


Define

D(X, Y) = (X)

(Y)

trace((

(Y))T(X

Y))


D(X, Y) =1

2X

Y

2F

von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)

also called quantum relative entropy




24/48


Define

D(X, Y) = (X)

(Y)

trace((

(Y))T(X

Y))


D(X, Y) =1

2X

Y

2F

von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)

also called quantum relative entropy

LogDet divergence: For X 0, (X) = log det X. ThenD(X, Y) = trace(XY

1) log det(XY1) dInderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

LogDet Divergence: Properties I


25/48

LogDet Divergence: Properties I

Dd(X, Y) = trace(XY1)

log det(XY1)

d

Properties:

Not symmetricTriangle inequality does not holdCan be unboundedConvex in first argument (not in second)Pythagorean Property holds:

Dd(X, Y)

Dd(X, P(Y)) + Dd(P(Y), Y)

Divergence between inverses:

Dd(X, Y) = Dd(Y1, X1)


LogDet Divergence: Properties II


26/48



log det(XY1)

d,

=d

i=1

dj=1

(vTi uj)2

i

j log i

j 1

Properties:Scale-invariance

Dd(X, Y) = Dd(X, Y), 0

In fact, for any invertible M

Dd(X, Y) = Dd(MTXM, MTYM)




27/48



log det(XY1)

d,

=r

i=1

rj=1

(vTi uj)2

i

j log i

j 1

Properties:

Scale-invariance

Dd(X, Y) = Dd(X, Y), 0In fact, for any invertible M

Dd(X, Y) = Dd(MT

XM, MT

YM)

Definition can be extended to rank-deficient matricesFiniteness:

Dd(X, Y) is finite iff X and Y have the same range space


Information-Theoretic Interpretation


28/48

Information Theoretic Interpretation

Differential Relative Entropy between two Multivariate Gaussians:

N(x|,0)log

N(x|,0)N(x|,)

dx = 1

2Dd(,0)

Thus, the following two problems are equivalent

Relative Entropy Formulation LogDet Formulation

minN(x|,0)log N(x|,0)N(x|,) dx minDd(,0)

tr((xi

xj)(xi

xTj )

u

tr((xi

xj)(xi

xTj )

u

tr((xi xj)(xi xj)T) tr((xi xj)(xi xj)T) 0 0


Steins Loss


29/48

Stein s Loss

LogDet divergence is known as Steins loss in the statistics community

Steins loss is the unique scale invariant loss-function for which theuniform minimum variance unbiased estimator is also a minimum riskequivariant estimator


Quasi-Newton Optimization


30/48

Quasi Newton Optimization

LogDet Divergence arises in the BFGS and DFP updates

Quasi-Newton methods

Approximate Hessian of the function to be minimized

[Fletcher, 1991] BFGS update can be written as:

minB

Dd(B, Bt)

subject to B st = yt (Secant Equation)

st = xt+1 xt, yt = ft+1 ftClosed-form solution:

Bt+1 = Bt BtsssTt Bt

sTt Btst+ yty

Tt

sTt yt

Similar form for DFP update


Algorithm: Bregman Projections for LogDet


31/48

Algorithm: Bregman Projections for LogDet

Algorithm: Cyclic Bregman Projections (successively onto each linearconstraint) converges to globally optimal solution

Use Bregman projections to update the Mahalanobis matrix:

min

Dd(,t)

s.t. (xi xj)T

(xi xj) uCan be solved by rank-one update:

t+1 = t + tt(xi xj)(xi xj)TtAdvantages:

Automatic enforcement of positive semidefinitenessSimple, closed-form projectionsNo eigenvector calculationEasy to incorporate slack for each constraint


Example: Noisy XOR


32/48

p y

0.5 0 0.5 1 1.50.5

0

0.5

1

1.5

No linear transformation for XOR grouping


Kernel Methods


33/48

Map input data to higher-dimensional feature space:

x (x)Idea: Run machine learning algorithm in feature space

Noisy XOR Example:

x x212x1x2

x22


Kernel Methods


34/48

Map input data to higher-dimensional feature space:

x (x)Idea: Run machine learning algorithm in feature space

Noisy XOR Example:

x x212x1x2

x22

Kernel function: K(x,

y) =

(x),

(y)Kernel trick no need to explicitly form high-dimensional features

Noisy XOR Example: (x), (y) = (xTy)2


Connection to Kernel Learning


35/48

g

LogDet Formulation (1) Kernel Formulation (2)

min Dd(,0) minK Dd(K, K0)s.t. tr((xi xj)(xi xj)T) u s.t. tr(K(ei ej)(ei ej)T) u

tr((xi xj)(xi xj)T) tr(K(ei ej)(ei ej)T) 0 K 0

(1) optimizes w.r.t. the d d Mahalanobis matrix (2) optimizes w.r.t. the NN kernel matrix KLet K0 = X

T0X, where X is the input data

Let be optimal solution to (1) and K be optimal solution to (2)Theorem: K = XTX


Kernelization


36/48

Metric learning in kernel space

Assume input kernel function (x, y) = (x)T(y)

Want to learn

d((x), (y)) = ((x) (y))T((x) (y))

Equivalently: learn a new kernel function of the form

(x, y) = (x)T(y)

How to learn this only using (x, y)?

Learned kernel can be shown to be of the form

(x, y) = (x, y) +

i

j

ij(x, xi)(y, xj)

Can update ij parameters while optimizing the kernel formulation


Related Work


37/48

Distance Metric Learning [Xing, Ng, Jordan & Russell, 2002]

Large margin nearest neighbor(LMNN) [Weinberger, Blitzer & Saul,2005]

Collapsing Classes (MCML) [Globerson & Roweis, 2005]

Online Metric Learning (POLA) [Shalev-Shwartz, Singer & Ng, 2004]

Many others!


Experimental Results


38/48

Framework

k-nearest neighbor (k = 4)

and u determined by 5th and 95th percentile of distribution

20c2 constraints, chosen randomly

2-fold cross validation

Algorithms

Information-theoretic Metric Learning (offline and online)

Large-Margin Nearest Neighbors (LMNN) [Weinberger et al.]

Metric Learning by Collapsing Classes (MCML) [Globerson and Roweis]

Baseline Metrics: Euclidean and Inverse Covariance


Results: UCI Data Sets


39/48

Ran ITML with 0 = I(ITML-MaxEnt) and theinverse covariance(InverseCovariance)

Ran online algorithm for 105

iterations

Wine Ionosphere Scale Iris Soybean

ITMLMaxEntITMLOnlineEuclideanInverse CovarianceMCMLLMNN

Error

0.0

0.1

0.2

0.3

0.4


Application 1: Clarify


40/48

Clarify improves error reporting for software that uses black boxcomponents

Motivation: Black box components complicate error messaging

Solution: Error diagnosis via machine learning

Representation: System collects program features during run-time

Function counts

Call-site countsCounts of program pathsProgram execution represented as a vector of counts

Class labels: Program execution errors

Nearest neighbor software support

Match program executions with othersUnderlying distance measure should reflect this similarity


Results: Clarify


41/48

Very high dimensionality

Feature selection reduces thenumber of features to 20

Latex Mpg321 Foxpro Iptables

ITMLMaxEntITMLInverse CovarianceEuclideanInverse CovarianceMCMLLMNN

Error

0.0

0.1

0.2

0.3


Application 2: Learning Image Similarity


42/48

Goal: Learn a metric to compare images

Start with a baseline measure

Use the pyramid match kernel [Grauman and Darrell]Compares sets of image featuresEfficient and effective measure of similarity between images

Application of metric learning in kernel space

Other metric learning methods (LMNN, etc) cannot be appliedDoes metric learning work in kernel space?


Caltech 101 Results


43/48

Data Set: Caltech 101

Standard benchmark for multi-category image recognition

101 classes of images

Wide variance in pose etc.

Challenging data set

Experimental Setup15 images per class in training set; rest in test set (2454 images)

Performed 1-NN using original PMK and learned PMK


Caltech 101 Results


44/48

700 800 900 1000 1100 1200 1300 1400 1500

0.35

0.4

0.45

0.5

0.55

Number of Constraint Points

Accuracy

Learned PMK

Original PMK

When constraints are drawn from all training data, kNN accuracy is52%, versus 32% for original PMK ([Jain, Kulis & Grauman, 2007])

Data set is well-studiedbest performance with 15 training images perclass is 60%

Uses different features (geometric-blur)


Metric Learning in High Dimensions


45/48

Text analysis & Software analysis: Feature sets larger than 1,000

Learning full distance matrix requires over 1 million parameters!

Overfitting problems, intractable

Solution: Learning low-rank Mahalanobis matrices

LogDet divergence can be generalized to low-rank matrices

Dd(X, Y) is finite X and Y have the same range spaceExtending ITML to the low-rank caseIf0 is low-rank is low-rankClustered Mahalanobis matricesIf0 is a block matrix, then will also be a block matrix


Low-rank Metric Learning: Preliminary Results


46/48

Classification accuracy for Low-Rank ITML

Classic3: Text dataset, PCA basis

Mpg321, Gcc: Software analysis, Clustered basis

2 4 6 8 100.5

0.6

0.7

0.8

0.9

Rank

Accuracy

LowRank

LMNN

MCML

ITML

4 6 8 10 120.75

0.8

0.85

Rank

Accuracy

10 20 30 40 500.7

0.75

0.8

0.85

Rank

Accuracy

Classic3 Mpg321 Gcc


Conclusions


47/48

Metric Learning Formulation

Uses LogDet divergence

Information-theoretic interpretationEquivalent to kernel learning problem

Can improve upon input metric or kernelGeneralizes to unseen data points

AlgorithmBregman projections result in simple rank-one updates

Can be kernelized

Online variant has provable regret bounds

Empirical EvaluationMethod is competitive with existing techniques

Scalable to large data sets

Applied to nearest-neighbor software support & image recognition


References


48/48

J. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,Information-Theoretic Metric Learning, International Conference on

Machine Learning(ICML), pages 209216, June 2007.

B. Kulis, M. Sustik, and I. S. Dhillon, Learning Low-Rank KernelMatrices, International Conference on Machine Learning(ICML), pages505512, July 2006 (longer version submitted to JMLR).


Documents

Metric Cornell