34
A Theory of Learning and A Theory of Learning and Clustering via Similarity Clustering via Similarity Functions Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon University

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A Theory of Learning and A Theory of Learning and Clustering via Similarity FunctionsClustering via Similarity Functions

Maria-Florina Balcan

09/17/2007

Joint work with Avrim Blum and Santosh Vempala

Carnegie Mellon University

Page 3: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

2-Minute VersionGeneric classification problem:

learn to distinguish men from women. Problem: pixel representation not so good.

Powerful technique: use a kernel, a special kind of similarity function

K( , ).

What if don’t have any labeled data? (i.e., clustering)

Can we develop a theory of conditions sufficient for K to be useful now?

Page 4: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Part I: On Similarity Part I: On Similarity Functions for ClassificationFunctions for Classification

Page 6: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Kernels, Kernalizable Algorithms

• K kernel if 9 implicit mapping s.t. K(x,y)=(x) ¢ (y).

Point: many algorithms interact with data only via dot-products.

• If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space.• If data is linearly separable by large margin in -space, don’t have to pay in terms of sample complexity or comp time.

If margin in -space, only need 1/2 examples to learn well.

w

(x)

1

Page 7: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Kernels and Similarity Functions

Our Work: analyze more general similarity functions.

Kernels: useful for many kinds of data, elegant SLT.

Characterization of good similarity functions:

1) In terms of natural direct properties.

• no implicit high-dimensional spaces• no requirement of positive-semidefiniteness

2) If K satisfies these, can be used for learning.

3) Is broad: includes usual notion of “good kernel”.

has a large margin sep. in -space

Page 8: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A First Attempt: Definition Satisfying (1) and (2)

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Note: might not be a legal kernel.

• E.g., K(x,y) ¸ 0.2, l(x) = l(y)

P distribution over labeled examples (x, l(x))

K(x,y) random in [-1,1], l(x) l(y)

Page 9: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A First Attempt: Definition Satisfying (1) and (2). How to use it?

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob. mass of x satisfy:

Algorithm

• Draw S+ of O((1/2) ln(1/2)) positive examples.• Draw S- of O((1/2) ln(1/2)) negative examples.• Classify x based on which gives better score.

Guarantee: with probability ¸ 1-, error · +

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Page 10: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A First Attempt: Definition Satisfying (1) and (2). How to use it?

• Hoeffding: for any given “good x”, prob. of error w.r.t. x (over draw of S+, S-) is · 2.

• At most chance that the error rate over GOOD is ¸ .

Guarantee: with probability ¸ 1-, error · +

• Overall error rate · + .

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Page 11: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

• K(x,y)=x ¢ y has large margin separator but doesn’t satisfy our definition.

+ +++++

-- -- --

more similar to + than to typical -

Page 12: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if a 1- prob. mass of x satisfy:

Broaden: OK if 9 non-negligible R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

R+ ++++

+

-- -- --

Page 13: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Broader/Main Definition• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] a 1- prob. mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Algorithm

• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).

• “Triangulate” data:

F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

• Take a new set of labeled examples, project to this space, and run any alg for learning lin. separators.

Theorem: with probability ¸ 1-, exists linear separator of error · + at margin /4.

Page 14: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Main Definition & Algorithm, Implications

• S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).• “Triangulate” data:F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

Theorem: with prob. ¸ 1-, exists linear separator of error · + at margin /4.

legal kernelK arbitrary sim.

function

(,)-good sim. function

(+,/4)-good kernel function

Any (,)-good kernel is an (’,’)-good similarity function.

Theorem

(some penalty: ’ = + extra, ’ = 2extra )

Page 15: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Similarity Functions for Classification, Summary

• Formal way of understanding kernels as similarity functions.

• Algorithms and guarantees for general similarity functions that aren’t necessarily PSD.

Page 16: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Part II: Can we use this angle to help think about Clustering?

Page 17: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

What if only unlabeled examples available?

[documents,images]

[topic]

Problem: only have unlabeled data!

S set of n objects.

There is some (unknown) “ground truth” clustering.

Goal: h of low error up to isomorphism of label names.

But we have a Similarity function!

Each object has true label l(x) in {1,…,t}.

[sports][fashion]

Err(h) = minPrx~S[(h(x)) l(x)]

Page 18: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

[documents,images]

[topic]

Problem: only have unlabeled data!

S set of n objects.

There is some (unknown) “ground truth” clustering.

Goal: h of low error up to isomorphism of label names.

But we have a Similarity function!

Each object has true label l(x) in {1,…,t}.

[sports][fashion]

Err(h) = minPrx~S[(h(x)) l(x)]

What conditions on a similarity function would be enough to allow one to cluster well?

Page 19: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

- closer to learning mixtures of Gaussians

- analyze algos to optimize various criteria- which criterion produces “better-looking” results

We flip this perspective around.

- discriminative, not generative

More natural, since the input graph/similarity is merely based on some heuristic.

Contrast with “Standard” Approach

Traditional approach: the input is a graph or embedding of points into Rd.

Page 20: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

[sports][fashion]

What conditions on a similarity function would be enough to allow one to cluster well?

Condition that trivially works.

K(x,y) > 0 for all x,y, l(x) = l(y).K(x,y) > 0 for all x,y, l(x) = l(y).

K(x,y) < 0 for all x,y, l(x) K(x,y) < 0 for all x,y, l(x) l(y). l(y).

Page 21: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

What conditions on a similarity function would be enough to allow one to cluster well?

Problem: same K can satisfy it for two very different clusterings of the same data!

K is s.t. all x are more similar to points y in their own cluster than to any y’ in other clusters.

Still Strong

Unlike learning, you can’t even test your hypotheses!sports fashion

soccer

tennis

Lacoste

Coco Chanel

sports fashion

soccer

tennis

Lacoste

Coco Chanel

Strict Ordering Property

Page 22: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals

soccer

tennis

Lacoste

Coco Chanel

1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

Page 23: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

soccer

sportsfashion

Coco Chanel

tennis Lacoste

All topics

Page 24: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

soccer

sportsfashion

Coco Chanel

tennis Lacoste

All topics

Page 25: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

sportsfashion

Coco Chanel Lacoste

All topics

Page 26: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

soccer

sportsfashion

Coco Chanel

tennis Lacoste

All topics

Page 27: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

soccer

sportsfashion

Coco Chanel

tennis Lacoste

All topics

Page 28: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Relax Our Goals1. Produce a hierarchical clustering s.t. correct answer is

approximately some pruning of it.

2. List of clusterings s.t. at least one has low error.

Tradeoff strength of assumption with size of list.

soccer

sportsfashion

tennis

All topics

Page 29: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Start Getting Nice Algorithms/Properties

For all clusters C, C’, for all A in C, A’ in C’:

at least one of A, A’ is more attracted to its own cluster than to the other.

A A’

K is s.t. all x are more similar to points y in their own cluster than to any y’ in other clusters.

Sufficient for hierarchical clustering

Strict Ordering Property

Weak Stability PropertySufficient for

hierarchical clustering

Page 30: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Example Analysis for Strong Stability Property

K is s.t. for all C, C’, all A in C, A’ in C’

K(A,C-A) > K(A,A’),

• Failure iff merge P1, P2 s.t. P1 ½ C, P2 Å C =.

• But must exist P3 ½ C s.t. K(P1,P3) ¸ K(P1,C-P1) and

K(P1,C-P1) > K(P1,P2).

Average Single-Linkage.

• merge “parts” whose average similarity is highest.

All “parts” made are laminar wrt target clustering.

Contradiction.

Algorithm

Analysis:

(K(A,A’) - average attraction between A and A’)

Page 31: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Strong Stability Property, Inductive Setting

Assume for all C, C’, all A ½ C, A’µ C’: K(A,C-A) > K(A,A’)+

– Need to argue that sampling preserves stability.

Insert new points as they arrive.

Draw sample S, hierarchically partition S.

– A sample cplx type argument using Regularity type results of [AFKK].

Inductive Setting

Page 32: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Weaker Conditions

EEx’ x’ 22 C(x) C(x)[K(x,x’)] > E[K(x,x’)] > Ex’ x’ 22 C’ C’ [K(x,x’)]+[K(x,x’)]+ (8 C’C(x))

Can produce a small list of clusterings.

Upper bound tO(t/2). [doesn’t depend on n]

Lower bound ~ t(1/).

Might cause bottom-up algorithms to fail.

Find hierarchy using learning-based algorithm.

Average Attraction Property

Stability of Large Subsets Property

Not Sufficient for hierarchy

Sufficient for hierarchy

(running time tO(t/2))

A

A’

Page 33: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Similarity Functions for Clustering, Summary

• Minimal conditions on K to be useful for clustering.

– List Clustering

– Hierarchical clustering

Discriminative/SLT-style model for Clustering with non-interactive feedback.

• Our notion of property: analogue of a data dependent concept class in classification.

Page 34: A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon