View
219
Download
1
Tags:
Embed Size (px)
Maria-Florina Balcan
Learning with Similarity Functions
Maria-Florina Balcan & Avrim BlumCMU, CSD
Maria-Florina Balcan
Kernels and Similarity Functions
• Useful in practice for dealing with many different kinds of data.
• Elegant theory about what makes a given kernel good for a given learning problem.
Our Goal: analyze more general similarity functions.• In the process we describe ways of constructing good data dependent kernels.
Kernels have become a powerful tool in ML.
Maria-Florina Balcan
Kernels• A kernel K is a pairwise similarity function s.t. 9 an implicit mapping s.t. K(x,y)=(x) ¢ (y).
• Point is: many learning algorithms can be written so only interact with data via dot-products.
• If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space.
• If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time.
If margin in -space, only need 1/2 examples to learn well.
w
(x)
1
Maria-Florina Balcan
General Similarity Functions
Goal: definition of good similarity function for a learning problem that:
1) Talks in terms of natural direct properties:
• no implicit high-dimensional spaces• no requirement of positive-
semidefiniteness2) If K satisfies these properties for our given
problem, then has implications to learning.
3) Is broad: includes usual notion of “good kernel”.(induces a large margin separator in -space)
Maria-Florina Balcan
A First Attempt: Definition satisfying properties (1) and (2)
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Note: this might not be a legal kernel.
• Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1].
Let P be a distribution over labeled examples (x, l(x))
A
BC
+
--
Maria-Florina Balcan
A First Attempt: Definition satisfying properties (1) and (2). How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.• Draw S- of O((1/2) ln(1/2)) negative examples.• Classify x based on which gives better score.
Maria-Florina Balcan
A First Attempt: How to use it?• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.• Draw S- of O((1/2) ln(1/2)) negative examples.• Classify x based on which gives better score.
• Hoeffding: for any given “good x”, probability of error w.r.t. x (over draw of S+, S-) at most 2.
• By Markov, at most chance that the error rate over GOOD is more than . So overall error rate · + .
Guarantee: with probability ¸ 1-, error · + Proof
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Maria-Florina Balcan
A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
• K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition.
+ +++++
-- -- --
more similar to negs than to typical pos
Maria-Florina Balcan
A First Attempt: Not Broad Enough• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy:
Idea: would work if we didn’t pick y’s rom top-left.
Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
R+ ++++
+
-- -- --
Maria-Florina Balcan
Broader/Main Definition
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Maria-Florina Balcan
Main Definition, How to Use It• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
• Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators.
Point is: with probability ¸ 1-, exists linear separator of error · + at margin /4.
(w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])
Maria-Florina Balcan
Main Definition, Implications
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).• Use to “triangulate” data:F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4.
Implications legal kernel
K arbitrary sim. function
(,)-good sim. function
(+,/4)-good kernel function
Maria-Florina Balcan
Good Kernels are Good Similarity Functions
Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
• An (,)-good kernel is an (’,’)-good similarity function under main definition.
Theorem
Our current proofs incur some penalty: ’ = + extra, ’ = 3extra.
Maria-Florina Balcan
Good Kernels are Good Similarity Functions
• An (,)-good kernel is an (’,’)-good similarity function under main definition, where
Theorem
’ = + extra, ’ = 3extra.
Proof Sketch
• Suppose K is a good kernel in usual sense.• Then, standard margin bounds imply:
– if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF.
• But, we want sample-independent weights [and bounded].
– Boundedness not too hard (imagine a margin-perceptron run over just the good y).
– Get sample-independence using an averaging argument.
Maria-Florina Balcan
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].
Guarantee: The induced distribution F(P) in R2dr has a separator of error · + at margin at least
Algorithm
Sample complexity is roughly
Maria-Florina Balcan
Implications & Conclusions
• Develop theory that provides a formal way of understanding kernels as similarity function.
• Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric).
Open ProblemsOpen Problems
• Better results for learning with multiple similarity functions. Extending [SB’06].
• Improve existing bounds.