1
Similarity Learning for High Dimensional Sparse Data Existing methods Motivation Measuring distance or similarity is a key component of AI, machine learning, pattern recognition, data mining, etc Ex: nearest neighbor classification, clustering, information retrieval... How to define good distances between objects? Metric learning [1]: learn distance or similarity function automatically from data (must-link/cannot-link relations) metric learning Main contributions A similarity model that can efficiently learn high-dimensional similarity in the original space by parameterizing the similarity measure as a convex combination of rank-one matrices with specific sparsity structures . Derivation of scalable algorithms for the proposed formulations, with time/memory cost independent from data dimensionality Appealing optimization and generalization guarantees . Experiments on high dimensional real data showing its potential in classification, dimensionality reduction and data exploration. Classification Feature Selection and Sparsity More experiments in the paper + MATLAB code available! Dimension Reduction Experiments Approach Formulation Optimization Theoretical analysis Limited features selected: As iteration goes, the number of features tends to converge; Extreme sparse similarity matrix:0.0006% entries are non-zero. One run of HDSL as means of dimension reduction, outperforming PCA and random projection in terms of k-NN classification test error. References [1] Bellet, A.; Habrard, A.; and Sebban, M. 2013. A Survey on Metric Learning for Feature Vectors and Structured Data. Technical report, arXiv:1306.6709. [2] Jaggi, M. "Revisiting Frank-Wolfe: Projection-free sparse convex optimization." ICML, 2013. [3] Gao, X, et al. "SOML: Sparse online metric learning with application to image retrieval." AAAI. 2014. [4] Schultz, M, and Joachims. T. "Learning a distance metric from relative comparisons." NIPS,2004. [5] Goldberger, J., Roweis, S., Hinton, G. and Salakhutdinov, R.. Neighbourhood Components Analysis. In NIPS, 2004. [6] K. Q. Weinberger and L. K. Saul. DistanceMetric Learning for Large Margin Nearest Neighbor Classification. JMLR, 2009. [7] Kedem, D., Tyree, S., Weinberger, K., Sha, F. and Lanckriet, G., Non-linear Metric Learning. In NIPS, 2012. [8] Shen. S. et al. Positive Semidefinite Metric Learning Using Boosting-like Algorithms. JMLR, 2012 [9] Shi, Y., Bellet, A. and Sha, F., Sparse Compositional Metric Learning. In AAAI, 2014 Kuan Liu Aurélien Bellet Fei Sha Time/Memory cost independent from data dimensionality (d). Learning sparse similarity efficiently based on FW. i.e., only one similarity basis (two features) added/removed at each iteration. gives compact storage of M and efficient computation of objective function, active constraints, etc. Frank Wolfe Algorithm[2]: Settting: assume data high dimensional, sparse. , , Goal: learn , , . Similarity basis: Given Learning Objective: Smoothed hinge loss more similar to than to (1) Sparsity: At any iteration k, has at most rank k+1 with 4(k+1) non-zero entries, using at most 2(k+1) distinct features. Convergence: Let be an optimal solution to (1), and let , for each iteration k, Generalization: the number of iterations gives a tradeoff between optimization error and model complexity. For each iteration k, expected risk empirical risk This is expensive in high dimensions Must learn D 2 parameters Must ensure that is PSD: O(D 3 ) time complexity Most algorithms learn a Mahalanobis distance where is a D x D positive semi-definite (PSD) matrix Explicit low rank decomposition[5][6][7]. Rank one matrices decompostion[8][9]. Low-dimensional projection based methods. Current approaches Learn a diagonal matrix [3][4]. = 0.125 + 0.25 + ... + 0.125 In k-NN classification on data sets with d up to , our methods achieve lowest test errors than other state-of-the-art similarity learning approaches.

Similarity Learning for High Dimensional Sparse Datakuanl/papers/aistats15_hdsl_poster.pdf · Derivation of scalable algorithms for the proposed formulations, with time/memory cost

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Similarity Learning for High Dimensional Sparse Datakuanl/papers/aistats15_hdsl_poster.pdf · Derivation of scalable algorithms for the proposed formulations, with time/memory cost

Similarity Learning for High Dimensional Sparse Data

Existing methodsMotivationMeasuring distance or similarity is a key component of AI, machine learning, pattern recognition, data mining, etcEx: nearest neighbor classification, clustering, information retrieval...

How to define good distances between objects?

Metric learning [1]: learn distance or similarity function automatically from data (must-link/cannot-link relations)

metriclearning

Main contributions A similarity model that can efficiently learn high-dimensional similarity in the original space by parameterizing the similarity measure as a convex combination of rank-one matrices with specific sparsity structures.

Derivation of scalable algorithms for the proposed formulations, with time/memory cost independent from data dimensionality

Appealing optimization and generalization guarantees.

Experiments on high dimensional real data showing its potential in classification, dimensionality reduction and data exploration.

Classification Feature Selection and Sparsity

More experiments in the paper + MATLAB code available!

Dimension Reduction

Experiments

ApproachFormulation Optimization Theoretical analysis

Limited features selected: As iteration goes, the number of features tends to converge; Extreme sparse similarity matrix:0.0006% entries are non-zero.

One run of HDSL as means of dimension reduction, outperforming PCA and random projection in terms of k-NN classification test error.

References[1] Bellet, A.; Habrard, A.; and Sebban, M. 2013. A Survey on Metric Learning for Feature Vectors and Structured Data. Technical report, arXiv:1306.6709.[2] Jaggi, M. "Revisiting Frank-Wolfe: Projection-free sparse convex optimization." ICML, 2013.[3] Gao, X, et al. "SOML: Sparse online metric learning with application to image retrieval." AAAI. 2014.[4] Schultz, M, and Joachims. T. "Learning a distance metric from relative comparisons." NIPS,2004.

[5] Goldberger, J., Roweis, S., Hinton, G. and Salakhutdinov, R.. Neighbourhood Components Analysis. In NIPS, 2004.[6] K. Q. Weinberger and L. K. Saul. DistanceMetric Learning for Large Margin Nearest Neighbor Classification. JMLR, 2009.[7] Kedem, D., Tyree, S., Weinberger, K., Sha, F. and Lanckriet, G., Non-linear Metric Learning. In NIPS, 2012.[8] Shen. S. et al. Positive Semidefinite Metric Learning Using Boosting-like Algorithms. JMLR, 2012[9] Shi, Y., Bellet, A. and Sha, F., Sparse Compositional Metric Learning. In AAAI, 2014

Kuan Liu Aurélien Bellet Fei Sha

Time/Memory cost independent from datadimensionality (d).

Learning sparse similarity efficiently based on FW.

i.e., only one similarity basis (two features) added/removed at each iteration. gives compact storage of Mand efficient computation of objective function, activeconstraints, etc.

Frank Wolfe Algorithm[2]:Settting: assume data high dimensional, sparse. , ,

Goal: learn , , .

Similarity basis: Given

Learning Objective:

Smoothed hinge loss more similar to than to

(1)

Sparsity: At any iteration k, has at most rank k+1 with 4(k+1) non-zero entries, using at most 2(k+1) distinct features.

Convergence: Let be an optimal solution to(1), and let , for each iteration k,

Generalization: the number of iterations givesa tradeoff between optimization error and model complexity. For each iteration k,

expected risk empirical risk

This is expensive in high dimensionsMust learn D2 parametersMust ensure that is PSD: O(D3) time complexity

Most algorithms learn a Mahalanobis distance

where is a D x D positive semi-definite (PSD) matrix

Explicit low rank decomposition[5][6][7].Rank one matrices decompostion[8][9].

Low-dimensional projection based methods.Current approaches

Learn a diagonal matrix [3][4].

= 0.125 + 0.25 + ... + 0.125

In k-NN classification on data sets with d up to , our methods achieve lowest test errors than other state-of-the-art similarity learning approaches.