43
Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21, 2013 Kevin Swersky Ilya Sutskever Laurent Charlin Richard Zemel Danny Tarlow

Stochastic k- Neighborhood Selection for Supervised and Unsupervised Learning

  • Upload
    dixie

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Stochastic k- Neighborhood Selection for Supervised and Unsupervised Learning. Danny Tarlow. Kevin Swersky. Ilya Sutskever. Laurent Charlin. Richard Zemel. University of Toronto Machine Learning Seminar Feb 21, 2013. Distance Metric Learning. Distance metrics are everywhere. - PowerPoint PPT Presentation

Citation preview

Page 1: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning

University of Toronto Machine Learning SeminarFeb 21, 2013

Kevin Swersky Ilya Sutskever Laurent Charlin Richard ZemelDanny Tarlow

Page 2: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Distance Metric Learning

Distance metrics are everywhere.

But they're arbitrary! Dimensions are scaled weirdly, and even if they're normalized, it's not clear that Euclidean distance means much.

So learning sounds nice, but what you learn should depend on the task.

A really common task is kNN. Let's look at how to learn distance metrics for that.

Page 3: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Popular Approaches for Distance Metric Learning

Large margin nearest neighbors (LMNN)

“Target neighbors”must be chosenahead of time

[Weinberger et al., NIPS 2006]

Some satisfying properties• Based on local structure (doesn’t have to pull all points into one region)

Some unsatisfying properties• Initial choice of target neighbors is difficult• Choice of objective function has reasonable forces (pushes and pulls), but beyond that, it is pretty heuristic.• No probabilistic interpretation.

Page 4: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Our goal: give a probabilistic interpretation of kNN and properly learn a model based upon this interpretation.

Related work that kind of does this: Neighborhood Components Analysis (NCA). Our approach is a direct generalization.

Probabilistic Formulations for Distance Metric Learning

Page 5: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Generative Model

Page 6: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Generative Model

Page 7: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Given a query point i.We select neighborsrandomly according to d.

Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?

Page 8: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?

Page 9: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?

Page 10: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?

Page 11: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Another way to write this:

(y are the class labels)

Page 12: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Objective: maximize thelog-likelihood of stochasticallyselecting neighbors of thesame class.

Page 13: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

After Learning

We might hope to learn a projection that looks like this.

Page 14: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

NCA is happy if points pair up and ignore global structure. This is not ideal if we want k > 1.

Problem with 1-NCA

Page 15: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

NCA:

k-NCA:

(S is all sets of k neighbors of point i)

Setting k=1 recovers NCA.

[] is the indicator function

Page 16: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution

Page 17: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution

Page 18: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution

Page 19: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose subsets of k neighbors.

Computing the denominator of the distribution

Page 20: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose subsets of k neighbors.

Computing the denominator of the distribution

Page 21: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-NCA puts more pressure on points to formbigger clusters.

k-NCA Intuition

Page 22: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

k-NCA Objective

Technical challenge: efficiently compute and

Learning: find A that (locally) maximizes this.

Given: inputs X, labels y, neighborhood size k.

Page 23: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Factor Graph Formulation

Focus on a single i

Page 24: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Factor Graph Formulation

Step 2: Constrain total # neighbors chosen to be k.

Step 1: Split Majority function into cases (i.e., use gates)

Switch(k'=|ys = yi|) // # neighbors w/ label yi

Maj(ys) = 1 iff forall c != yi, |ys=c| < k'

Page 25: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Assume yi = 'blue"

Z(Z(

))

Binary variable: is j chosen as neighbor?Total number of "blue" neighbors chosenExactly k' "blue" neighbors are chosen

Less than k' "pink" neighbors are chosenExactly k total neighbors must be chosenCount total number of neighbors chosen

At this point, everything is just a matter ofinference in these factor graphs

• Partition functions: give objective• Marginals: give gradients

Page 26: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Sum-Product Inference

Number of neighbors chosen from first two "blue" points

Number of neighbors chosen from first three"blue" points

Number of neighbors chosen from "blue" or "pink" classesTotal number of neighbors chosen

Lower level messages: O(k) time eachUpper level messages: O(k2) time each

Total runtime: O(Nk + C k2)*

* Although slightly better is possible asymptotically. See Tarlow et al., UAI 2012.

Page 27: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

• Instead of Majority(ys)=yi function, use All(ys)=yi.– Computation gets a little easier (just one k’ needed)– Loses the kNN interpretation.– Exerts more pressure for homogeneity; tries to create

a larger margin between classes.– Usually works a little better.

Alternative Version

Page 28: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Unsupervised Learning with t-SNE[van der Maaten and Hinton, 2008]

Visualize the structure of data in a 2D embedding.

• Each input point x maps to an embedding point e.• SNE tries to preserve relative pairwise distances

as faithfully as possible.

[Turian, http://metaoptimize.com/projects/wordreprs/][van der Maaten & Hinton, JMLR 2008]

Page 29: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Problem with t-SNE (also based on k=1)

[van der Maaten & Hinton, JMLR 2008]

Page 30: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Data distribution:

Embedding distribution:

Objective (minimize wrt e):

Unsupervised Learning with t-SNE[van der Maaten and Hinton, 2008]

Distances:

Page 31: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

kt-SNE

Data distribution:

Embedding distribution:

Objective:

Minimize objective wrt e

Page 32: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

kt-SNE

• kt-SNE can potentially lead to better higher order structure preservation (exponentially many more distance constraints).

• Gives another “dial to turn” in order to obtainbetter visualizations.

Page 33: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Experiments

Page 34: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

WINE embeddings“All” Method“Majority” Method

Page 35: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

IRIS—worst kNCA relative performance (full D)Testing accuracyTraining accuracy

kNN

acc

urac

y

kNN

acc

urac

y

Page 36: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

ION—best kNCA relative performance (full D)Training accuracy Testing accuracy

Page 37: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

USPS kNN Classification (0% noise, 2D)Training accuracy Testing accuracy

Page 38: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

USPS kNN Classification (25% noise, 2D)Training accuracy Testing accuracy

Page 39: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

USPS kNN Classification (50% noise, 2D)Training accuracy Testing accuracy

Page 40: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

NCA Objective Analysis on Noisy USPS

0% Noise 25% Noise 50% Noise

Y-axis: objective of 1-NCA, evaluated atthe parameters learned from k-NCA with varying k and neighbor method

Page 41: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

t-SNE vs kt-SNE

t-SNE 5t-SNE

kNN Accuracy

Page 42: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Discussion

• Local is good, but 1-NCA is too local.

• Not quite expected kNN accuracy, but doesn’t seem to change results.

• Expected Majority computation may be useful elsewhere?

Page 43: Stochastic k- Neighborhood Selection  for Supervised and Unsupervised Learning

Thank You!