Metric Cornell

Embed Size (px)

Citation preview

  • 8/4/2019 Metric Cornell

    1/48

    Metric and Kernel Learning

    Inderjit S. DhillonUniversity of Texas at Austin

    Cornell UniversityOct 30, 2007

    Joint work with Jason Davis, Prateek Jain, Brian Kulis and Suvrit Sra

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    2/48

    Metric Learning

    Goal: Learn Distance Metric between Data

    Important problem in Data Mining & Machine Learning

    Can govern success or failure of data mining algorithm

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    3/48

    Metric Learning: Example I

    Similarity by Person(identity) or by Pose

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    4/48

    Metric Learning: Example II

    Consider a set of text documents

    Each document is a review of a classical music piece

    Might be clusterable in one of two ways

    By Composer (Beethoven, Mozart, Mendelssohn)By Form (Symphony, Sonata, Concerto)

    Similarity by Composer or by Form

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    5/48

    Mahalanobis Distances

    We restrict ourselves to learning Mahalanobis distances:

    Distance parameterized by positive definite matrix :

    d(x1, x2) = (x1 x2)T(x1 x2)

    Often is the inverse of the covariance matrix

    Generalizes squared Euclidean distance ( = I)

    Rotates and scales input data

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    6/48

    Example: Four Blobs

    0.5 0 0.5 1 1.50.5

    0

    0.5

    1

    1.5

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    7/48

    Example: Four Blobs

    0.5 0 0.5 1 1.50.5

    0

    0.5

    1

    1.5

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    8/48

    Example: Four Blobs

    0.5 0 0.5 1 1.5

    0.5

    0

    0.5

    1

    1.5

    Want to learn:

    =

    1 00

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    9/48

    Example: Four Blobs

    0.5 0 0.5 1 1.50.5

    0

    0.5

    1

    1.5

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    10/48

    Example: Four Blobs

    0.5 0 0.5 1 1.5

    0.5

    0

    0.5

    1

    1.5

    Want to learn:

    =

    00 1

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    11/48

    Problem Formulation

    Metric Learning Goal:

    min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]

    Learn spd matrix that is close to the baseline spd matrix 0

    Other linear constraints on are possible

    Constraints can arise from various scenarios

    Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    12/48

    Problem Formulation

    Metric Learning Goal:

    min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]

    Learn spd matrix that is close to the baseline spd matrix 0

    Other linear constraints on are possible

    Constraints can arise from various scenarios

    Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.

    QUESTION: What should loss be?

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    13/48

    LogDet Divergence

    We use loss(,0) to be the Log-Determinant Divergence:

    Dd(,0) = trace(10 ) log det(10 ) d

    Our Goal:

    minDd(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    14/48

    Preview

    Salient points of our Approach:

    Metric Learning is equivalent to Kernel Learning

    Generalizes to Unseen Data Points

    Can improve upon an input metric or kernel

    No expensive eigenvector computation or semi-definite programming

    Most existing methods fail to satisfy one or more of the above

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    15/48

    Brief Digression

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    16/48

    Bregman Divergences

    Let : S R be a differentiable, strictly convex function of Legendretype (S

    R

    d)

    The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    B Di

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    17/48

    Bregman Divergences

    Let : S R be a differentiable, strictly convex function of Legendretype (S

    R

    d)The Bregman Divergence D : S relint(S) R is defined as

    D(x, y) = (x) (y) (x y)T(y)

    y x

    (z)= 12

    zTz

    h(z)

    D(x,y)=12

    x

    y

    2

    Squared Euclidean distance is a Bregman divergence

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    B Di

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    18/48

    Bregman Divergences

    Let : S R be a differentiable, strictly convex function of Legendretype (S

    R

    d)The Bregman Divergence D : S relint(S) R is defined as

    D(x, y) = (x) (y) (x y)T(y)

    yx

    D(x,y)=x logxyx+y

    h(z)

    (z)=z log z

    Relative Entropy (or KL-divergence) is another Bregman divergence

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    B Di

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    19/48

    Bregman Divergences

    Let : S R be a differentiable, strictly convex function of Legendretype (S

    R

    d)

    The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)

    y x

    D(x,y)=xylog x

    y1h(z)

    (z)= log z

    Itakura-Saito Dist.(used in signal processing) is also a Bregman divergence

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    E l f B Di

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    20/48

    Examples of Bregman Divergences

    Function Name (x) D(x,y)

    Squared norm12

    x2 12

    (xy)2

    Shannon entropy x log xx x log xyx+y

    Bit entropy x log x+(1x) log(1x) x log xy

    +(1x) log 1x1y

    Burg entropy log x xylog x

    y1

    Hellinger 1x2 (1xy)(1y2)1/2(1x2)1/2

    p quasi-norm xp (0

  • 8/4/2019 Metric Cornell

    21/48

    Bregman Matrix Divergences

    Define

    D(X, Y) = (X)

    (Y)

    trace((

    (Y))T(X

    Y))

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Bregman Matrix Divergences

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    22/48

    Bregman Matrix Divergences

    Define

    D(X, Y) = (X)

    (Y)

    trace((

    (Y))T(X

    Y))

    Squared Frobenius norm: (X) = X2F. Then

    D(X, Y) =1

    2X

    Y

    2F

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Bregman Matrix Divergences

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    23/48

    Bregman Matrix Divergences

    Define

    D(X, Y) = (X)

    (Y)

    trace((

    (Y))T(X

    Y))

    Squared Frobenius norm: (X) = X2F. Then

    D(X, Y) =1

    2X

    Y

    2F

    von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)

    also called quantum relative entropy

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Bregman Matrix Divergences

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    24/48

    Bregman Matrix Divergences

    Define

    D(X, Y) = (X)

    (Y)

    trace((

    (Y))T(X

    Y))

    Squared Frobenius norm: (X) = X2F. Then

    D(X, Y) =1

    2X

    Y

    2F

    von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)

    also called quantum relative entropy

    LogDet divergence: For X 0, (X) = log det X. ThenD(X, Y) = trace(XY

    1) log det(XY1) dInderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    LogDet Divergence: Properties I

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    25/48

    LogDet Divergence: Properties I

    Dd(X, Y) = trace(XY1)

    log det(XY1)

    d

    Properties:

    Not symmetricTriangle inequality does not holdCan be unboundedConvex in first argument (not in second)Pythagorean Property holds:

    Dd(X, Y)

    Dd(X, P(Y)) + Dd(P(Y), Y)

    Divergence between inverses:

    Dd(X, Y) = Dd(Y1, X1)

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    LogDet Divergence: Properties II

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    26/48

    LogDet Divergence: Properties II

    Dd(X, Y) = trace(XY1)

    log det(XY1)

    d,

    =d

    i=1

    dj=1

    (vTi uj)2

    i

    j log i

    j 1

    Properties:Scale-invariance

    Dd(X, Y) = Dd(X, Y), 0

    In fact, for any invertible M

    Dd(X, Y) = Dd(MTXM, MTYM)

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    LogDet Divergence: Properties II

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    27/48

    LogDet Divergence: Properties II

    Dd(X, Y) = trace(XY1)

    log det(XY1)

    d,

    =r

    i=1

    rj=1

    (vTi uj)2

    i

    j log i

    j 1

    Properties:

    Scale-invariance

    Dd(X, Y) = Dd(X, Y), 0In fact, for any invertible M

    Dd(X, Y) = Dd(MT

    XM, MT

    YM)

    Definition can be extended to rank-deficient matricesFiniteness:

    Dd(X, Y) is finite iff X and Y have the same range space

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Information-Theoretic Interpretation

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    28/48

    Information Theoretic Interpretation

    Differential Relative Entropy between two Multivariate Gaussians:

    N(x|,0)log

    N(x|,0)N(x|,)

    dx = 1

    2Dd(,0)

    Thus, the following two problems are equivalent

    Relative Entropy Formulation LogDet Formulation

    minN(x|,0)log N(x|,0)N(x|,) dx minDd(,0)

    tr((xi

    xj)(xi

    xTj )

    u

    tr((xi

    xj)(xi

    xTj )

    u

    tr((xi xj)(xi xj)T) tr((xi xj)(xi xj)T) 0 0

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Steins Loss

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    29/48

    Stein s Loss

    LogDet divergence is known as Steins loss in the statistics community

    Steins loss is the unique scale invariant loss-function for which theuniform minimum variance unbiased estimator is also a minimum riskequivariant estimator

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Quasi-Newton Optimization

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    30/48

    Quasi Newton Optimization

    LogDet Divergence arises in the BFGS and DFP updates

    Quasi-Newton methods

    Approximate Hessian of the function to be minimized

    [Fletcher, 1991] BFGS update can be written as:

    minB

    Dd(B, Bt)

    subject to B st = yt (Secant Equation)

    st = xt+1 xt, yt = ft+1 ftClosed-form solution:

    Bt+1 = Bt BtsssTt Bt

    sTt Btst+ yty

    Tt

    sTt yt

    Similar form for DFP update

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Algorithm: Bregman Projections for LogDet

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    31/48

    Algorithm: Bregman Projections for LogDet

    Algorithm: Cyclic Bregman Projections (successively onto each linearconstraint) converges to globally optimal solution

    Use Bregman projections to update the Mahalanobis matrix:

    min

    Dd(,t)

    s.t. (xi xj)T

    (xi xj) uCan be solved by rank-one update:

    t+1 = t + tt(xi xj)(xi xj)TtAdvantages:

    Automatic enforcement of positive semidefinitenessSimple, closed-form projectionsNo eigenvector calculationEasy to incorporate slack for each constraint

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Example: Noisy XOR

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    32/48

    p y

    0.5 0 0.5 1 1.50.5

    0

    0.5

    1

    1.5

    No linear transformation for XOR grouping

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Kernel Methods

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    33/48

    Map input data to higher-dimensional feature space:

    x (x)Idea: Run machine learning algorithm in feature space

    Noisy XOR Example:

    x x212x1x2

    x22

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Kernel Methods

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    34/48

    Map input data to higher-dimensional feature space:

    x (x)Idea: Run machine learning algorithm in feature space

    Noisy XOR Example:

    x x212x1x2

    x22

    Kernel function: K(x,

    y) =

    (x),

    (y)Kernel trick no need to explicitly form high-dimensional features

    Noisy XOR Example: (x), (y) = (xTy)2

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Connection to Kernel Learning

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    35/48

    g

    LogDet Formulation (1) Kernel Formulation (2)

    min Dd(,0) minK Dd(K, K0)s.t. tr((xi xj)(xi xj)T) u s.t. tr(K(ei ej)(ei ej)T) u

    tr((xi xj)(xi xj)T) tr(K(ei ej)(ei ej)T) 0 K 0

    (1) optimizes w.r.t. the d d Mahalanobis matrix (2) optimizes w.r.t. the NN kernel matrix KLet K0 = X

    T0X, where X is the input data

    Let be optimal solution to (1) and K be optimal solution to (2)Theorem: K = XTX

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Kernelization

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    36/48

    Metric learning in kernel space

    Assume input kernel function (x, y) = (x)T(y)

    Want to learn

    d((x), (y)) = ((x) (y))T((x) (y))

    Equivalently: learn a new kernel function of the form

    (x, y) = (x)T(y)

    How to learn this only using (x, y)?

    Learned kernel can be shown to be of the form

    (x, y) = (x, y) +

    i

    j

    ij(x, xi)(y, xj)

    Can update ij parameters while optimizing the kernel formulation

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Related Work

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    37/48

    Distance Metric Learning [Xing, Ng, Jordan & Russell, 2002]

    Large margin nearest neighbor(LMNN) [Weinberger, Blitzer & Saul,2005]

    Collapsing Classes (MCML) [Globerson & Roweis, 2005]

    Online Metric Learning (POLA) [Shalev-Shwartz, Singer & Ng, 2004]

    Many others!

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Experimental Results

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    38/48

    Framework

    k-nearest neighbor (k = 4)

    and u determined by 5th and 95th percentile of distribution

    20c2 constraints, chosen randomly

    2-fold cross validation

    Algorithms

    Information-theoretic Metric Learning (offline and online)

    Large-Margin Nearest Neighbors (LMNN) [Weinberger et al.]

    Metric Learning by Collapsing Classes (MCML) [Globerson and Roweis]

    Baseline Metrics: Euclidean and Inverse Covariance

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Results: UCI Data Sets

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    39/48

    Ran ITML with 0 = I(ITML-MaxEnt) and theinverse covariance(InverseCovariance)

    Ran online algorithm for 105

    iterations

    Wine Ionosphere Scale Iris Soybean

    ITMLMaxEntITMLOnlineEuclideanInverse CovarianceMCMLLMNN

    Error

    0.0

    0.1

    0.2

    0.3

    0.4

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Application 1: Clarify

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    40/48

    Clarify improves error reporting for software that uses black boxcomponents

    Motivation: Black box components complicate error messaging

    Solution: Error diagnosis via machine learning

    Representation: System collects program features during run-time

    Function counts

    Call-site countsCounts of program pathsProgram execution represented as a vector of counts

    Class labels: Program execution errors

    Nearest neighbor software support

    Match program executions with othersUnderlying distance measure should reflect this similarity

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Results: Clarify

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    41/48

    Very high dimensionality

    Feature selection reduces thenumber of features to 20

    Latex Mpg321 Foxpro Iptables

    ITMLMaxEntITMLInverse CovarianceEuclideanInverse CovarianceMCMLLMNN

    Error

    0.0

    0.1

    0.2

    0.3

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Application 2: Learning Image Similarity

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    42/48

    Goal: Learn a metric to compare images

    Start with a baseline measure

    Use the pyramid match kernel [Grauman and Darrell]Compares sets of image featuresEfficient and effective measure of similarity between images

    Application of metric learning in kernel space

    Other metric learning methods (LMNN, etc) cannot be appliedDoes metric learning work in kernel space?

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Caltech 101 Results

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    43/48

    Data Set: Caltech 101

    Standard benchmark for multi-category image recognition

    101 classes of images

    Wide variance in pose etc.

    Challenging data set

    Experimental Setup15 images per class in training set; rest in test set (2454 images)

    Performed 1-NN using original PMK and learned PMK

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Caltech 101 Results

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    44/48

    700 800 900 1000 1100 1200 1300 1400 1500

    0.35

    0.4

    0.45

    0.5

    0.55

    Number of Constraint Points

    Accuracy

    Learned PMK

    Original PMK

    When constraints are drawn from all training data, kNN accuracy is52%, versus 32% for original PMK ([Jain, Kulis & Grauman, 2007])

    Data set is well-studiedbest performance with 15 training images perclass is 60%

    Uses different features (geometric-blur)

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Metric Learning in High Dimensions

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    45/48

    Text analysis & Software analysis: Feature sets larger than 1,000

    Learning full distance matrix requires over 1 million parameters!

    Overfitting problems, intractable

    Solution: Learning low-rank Mahalanobis matrices

    LogDet divergence can be generalized to low-rank matrices

    Dd(X, Y) is finite X and Y have the same range spaceExtending ITML to the low-rank caseIf0 is low-rank is low-rankClustered Mahalanobis matricesIf0 is a block matrix, then will also be a block matrix

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Low-rank Metric Learning: Preliminary Results

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    46/48

    Classification accuracy for Low-Rank ITML

    Classic3: Text dataset, PCA basis

    Mpg321, Gcc: Software analysis, Clustered basis

    2 4 6 8 100.5

    0.6

    0.7

    0.8

    0.9

    Rank

    Accuracy

    LowRank

    LMNN

    MCML

    ITML

    4 6 8 10 120.75

    0.8

    0.85

    Rank

    Accuracy

    10 20 30 40 500.7

    0.75

    0.8

    0.85

    Rank

    Accuracy

    Classic3 Mpg321 Gcc

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    Conclusions

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    47/48

    Metric Learning Formulation

    Uses LogDet divergence

    Information-theoretic interpretationEquivalent to kernel learning problem

    Can improve upon input metric or kernelGeneralizes to unseen data points

    AlgorithmBregman projections result in simple rank-one updates

    Can be kernelized

    Online variant has provable regret bounds

    Empirical EvaluationMethod is competitive with existing techniques

    Scalable to large data sets

    Applied to nearest-neighbor software support & image recognition

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    References

    http://find/http://goback/
  • 8/4/2019 Metric Cornell

    48/48

    J. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,Information-Theoretic Metric Learning, International Conference on

    Machine Learning(ICML), pages 209216, June 2007.

    B. Kulis, M. Sustik, and I. S. Dhillon, Learning Low-Rank KernelMatrices, International Conference on Machine Learning(ICML), pages505512, July 2006 (longer version submitted to JMLR).

    Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning

    http://find/http://goback/