of 16 /16
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD

# Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD

• View
219

1

Embed Size (px)

### Text of Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU,...

Maria-Florina Balcan

Learning with Similarity Functions

Maria-Florina Balcan & Avrim BlumCMU, CSD

Maria-Florina Balcan

Kernels and Similarity Functions

• Useful in practice for dealing with many different kinds of data.

• Elegant theory about what makes a given kernel good for a given learning problem.

Our Goal: analyze more general similarity functions.• In the process we describe ways of constructing good data dependent kernels.

Kernels have become a powerful tool in ML.

Maria-Florina Balcan

Kernels• A kernel K is a pairwise similarity function s.t. 9 an implicit mapping s.t. K(x,y)=(x) ¢ (y).

• Point is: many learning algorithms can be written so only interact with data via dot-products.

• If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space.

• If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time.

If margin in -space, only need 1/2 examples to learn well.

w

(x)

1

Maria-Florina Balcan

General Similarity Functions

Goal: definition of good similarity function for a learning problem that:

1) Talks in terms of natural direct properties:

• no implicit high-dimensional spaces• no requirement of positive-

semidefiniteness2) If K satisfies these properties for our given

problem, then has implications to learning.

3) Is broad: includes usual notion of “good kernel”.(induces a large margin separator in -space)

Maria-Florina Balcan

A First Attempt: Definition satisfying properties (1) and (2)

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Note: this might not be a legal kernel.

• Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1].

Let P be a distribution over labeled examples (x, l(x))

A

BC

+

--

Maria-Florina Balcan

A First Attempt: Definition satisfying properties (1) and (2). How to use it?

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Algorithm

• Draw S+ of O((1/2) ln(1/2)) positive examples.• Draw S- of O((1/2) ln(1/2)) negative examples.• Classify x based on which gives better score.

Maria-Florina Balcan

A First Attempt: How to use it?• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:

Algorithm

• Draw S+ of O((1/2) ln(1/2)) positive examples.• Draw S- of O((1/2) ln(1/2)) negative examples.• Classify x based on which gives better score.

• Hoeffding: for any given “good x”, probability of error w.r.t. x (over draw of S+, S-) at most 2.

• By Markov, at most chance that the error rate over GOOD is more than . So overall error rate · + .

Guarantee: with probability ¸ 1-, error · + Proof

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Maria-Florina Balcan

A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

• K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition.

+ +++++

-- -- --

more similar to negs than to typical pos

Maria-Florina Balcan

A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:

Idea: would work if we didn’t pick y’s rom top-left.

Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

R+ ++++

+

-- -- --

Maria-Florina Balcan

• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Maria-Florina Balcan

Main Definition, How to Use It• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Algorithm

• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).

• Use to “triangulate” data:

F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

• Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators.

Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4.

(w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])

Maria-Florina Balcan

Main Definition, Implications

Algorithm

• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).• Use to “triangulate” data:F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4.

Implications legal kernel

K arbitrary sim. function

(,)-good sim. function

(+,/4)-good kernel function

Maria-Florina Balcan

Good Kernels are Good Similarity Functions

Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:

Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

• An (,)-good kernel is an (’,’)-good similarity function under main definition.

Theorem

Our current proofs incur some penalty: ’ = + extra, ’ = 3extra.

Maria-Florina Balcan

Good Kernels are Good Similarity Functions

• An (,)-good kernel is an (’,’)-good similarity function under main definition, where

Theorem

’ = + extra, ’ = 3extra.

Proof Sketch

• Suppose K is a good kernel in usual sense.• Then, standard margin bounds imply:

– if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF.

• But, we want sample-independent weights [and bounded].

– Boundedness not too hard (imagine a margin-perceptron run over just the good y).

– Get sample-independence using an averaging argument.

Maria-Florina Balcan

Learning with Multiple Similarity Functions

• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).

• Use to “triangulate” data:

F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].

Guarantee: The induced distribution F(P) in R2dr has a separator of error · + at margin at least

Algorithm

Sample complexity is roughly

Maria-Florina Balcan

Implications & Conclusions

• Develop theory that provides a formal way of understanding kernels as similarity function.

• Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric).

Open ProblemsOpen Problems

• Better results for learning with multiple similarity functions. Extending [SB’06].

• Improve existing bounds.

Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents