25
Learning with General Similarity Functions Maria-Florina Balcan

Learning with General Similarity Functions Maria-Florina Balcan

Embed Size (px)

DESCRIPTION

3 Kernel Methods A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K(, )=  ( ) ¢  ( ). E.g., K(x,y) = (x ¢ y + 1) d  (n-dimensional space) ! n d -dimensional space Why Kernels matter? Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional  -space. What is a Kernel? Prominent method for supervised classification today. The learning alg. interacts with the data via a similarity fns

Citation preview

Page 1: Learning with General Similarity Functions Maria-Florina Balcan

Learning with General Similarity Functions

Maria-Florina Balcan

Page 3: Learning with General Similarity Functions Maria-Florina Balcan

3

Kernel Methods

A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping such that K( , )= ( )¢ ( ).

E.g., K(x,y) = (x ¢ y + 1)d

(n-dimensional space) ! nd-dimensional space

Why Kernels matter? Many algorithms interact with data only via dot-products.

So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional -space.

What is a Kernel?

Prominent method for supervised classification today.The learning alg. interacts with the data via a similarity fns

Page 4: Learning with General Similarity Functions Maria-Florina Balcan

4

x2

x1

OO O

OO

OO O

XX

X

X

XX

X

X X

X

X

X

X

XX

X

XX

z1

z3

OO

O O

O

O

O

OO

X XX X

X

X

X

X X

X

X

X

X

X

X

X X

X

ExampleE.g., for n=2, d=2, the kernel

z2

K(x,y) = (x¢y)d corresponds to

original space space

Page 5: Learning with General Similarity Functions Maria-Florina Balcan

5

Generalize Well if Good Margin

• If data is linearly separable by margin in -space, then good sample complexity.

|(x)| · 1

+++

++

+

---

--

If margin in -space, then

need sample size of only Õ(1/2) to get confidence in generalization.

Page 6: Learning with General Similarity Functions Maria-Florina Balcan

6

Kernel Methods

Significant percentage of ICML, NIPS, COLT.

Very useful in practice for dealing with many different types of data.

Prominent method for supervised classification today

Page 7: Learning with General Similarity Functions Maria-Florina Balcan

7

Limitations of the Current Theory

Existing Theory: in terms of margins in implicit spaces.

In practice: kernels are constructed by viewing them as measures of similarity.

Kernel requirement rules out many natural similarity functions.

Difficult to think about, not great for intuition.

Better theoretical explanation?Better theoretical explanation?

Page 8: Learning with General Similarity Functions Maria-Florina Balcan

8

Better Theoretical Framework

Existing Theory: in terms of margins in implicit spaces.

In practice: kernels are constructed by viewing them as measures of similarity.

Kernel requirement rules out natural similarity functions.

Difficult to think about, not great for intuition.

Yes! We provide a more general and intuitive theory that formalizes the intuition that a good kernel is a good measure of similarity.

Better theoretical explanation?Better theoretical explanation?

[Balcan-Blum, ICML 2006][Balcan-Blum-Srebro, MLJ 2008]

[Balcan-Blum-Srebro, COLT 2008]

Page 9: Learning with General Similarity Functions Maria-Florina Balcan

9

More General Similarity Functions

2) Is broad: includes usual notion of good kernel.

We provide a notion of a good similarity function:

1) Simpler, in terms of natural direct quantities.• no implicit high-dimensional spaces• no requirement that K(x,y)=(x) ¢ (y)

K can be used to learn well.

has a large margin sep. in -space

Good kernels

First attempt

Main notion

3) Allows one to learn classes that have no good kernels.

Page 10: Learning with General Similarity Functions Maria-Florina Balcan

10

A First Attempt

K is (,)-good for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

P distribution over labeled examples (x, l(x))

K is good if most x are on average more similar to points y of their own type than to points y of the

other type.

Goal: output classification rule good for P

Average similarity to points of opposite label

gap Average similarity to

points of the same label

Page 11: Learning with General Similarity Functions Maria-Florina Balcan

11

A First Attempt

E.g., K(x,y) ¸ 0.2, l(x) = l(y) K(x,y) random in {-1,1}, l(x)

l(y)

-1

11

0.50.4Example:

K is (,)-good for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

0.3

Page 12: Learning with General Similarity Functions Maria-Florina Balcan

12

A First Attempt

Algorithm• Draw sets S+, S- of positive and negative examples.

• Classify x based on average similarity to S+ versus to S-.

K is (,)-good for P if a 1- prob. mass of x satisfy:

S+-1

11

0.50.4

S-

xx

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Page 13: Learning with General Similarity Functions Maria-Florina Balcan

13

A First Attempt K is (,)-good for P if a 1- prob. mass of x satisfy:

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+

Theorem

Algorithm• Draw sets S+, S- of positive and negative examples.

• Classify x based on average similarity to S+ versus to S-.

• For a fixed good x prob. of error w.r.t. x (over draw of S+, S-) is ± ²’. [Hoeffding]

• Overall error rate · +’.

If |S+| and |S-| are ((1/2) ln(1/’)), then with probability ¸ 1-, error · +’.

• At most chance that the error rate over GOOD is ¸ ’.

Page 14: Learning with General Similarity Functions Maria-Florina Balcan

14

A First Attempt: Not Broad EnoughEy~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]

+

• has a large margin separator;

+ +++++

-- -- --

more similar to - than to typical +

½ versus ¼ 30o

30o

Similarity function K(x,y)=x ¢ ydoes not satisfy our definition.

½ versus ½ ¢ 1 + ½ ¢ (- ½)

Page 15: Learning with General Similarity Functions Maria-Florina Balcan

15

A First Attempt: Not Broad EnoughEy~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]

+

+ +++++

-- -- --

R

Broaden: 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label.[even if do not know R in advance]

30o

30o

Page 16: Learning with General Similarity Functions Maria-Florina Balcan

16

Broader Definition K is (, , ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-

fraction of x satisfy:

• Draw S={y1, , yd} set of landmarks. F(x) = [K(x,y1), …,K(x,yd)].

RdFF(P)

Property

x !

• If enough landmarks (d=(1/2 )), then with high prob. there exists a good L1 large margin linear separator.

Re-represent data.

w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]

P

At least prob. mass of reasonable positives & negatives.Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]

+

Page 17: Learning with General Similarity Functions Maria-Florina Balcan

17

Broader Definition

• Draw S={y1, , yd} set of landmarks. F(x) = [K(x,y1), …,K(x,yd)]

RdFF(P)

Algorithm

x !Re-represent data.

OO

OOOX

XXX

X

X X X XX

O O OO O

• Take a new set of labeled examples, project to this space, and and run a good Lrun a good L11 linear separator alg. linear separator alg.

P

K is (, , ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1- fraction of x satisfy:

du=Õ(1/(2 ))dl=O(1/(2²acc ln (du) ))

At least prob. mass of reasonable positives & negatives.Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]

+

Page 18: Learning with General Similarity Functions Maria-Florina Balcan

18

Kernels versus Similarity Functions

TheoremK is also a good similarity function.

Main Technical ContributionsOur Work

Good Kernels

Good Similarities

(but gets squared).

K is a good kernel

If K has margin in implicit space, then for any ,K is (,2,)-good in our sense.

Page 19: Learning with General Similarity Functions Maria-Florina Balcan

19

Kernels versus Similarity Functions

Can also show a Strict Separation.

Main Technical ContributionsOur Work

Good Kernels

Good Similarities

Strictly more general

For any class C of n pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists.

Theorem

TheoremK is also a good similarity function.

(but gets squared).

K is a good kernel

Page 20: Learning with General Similarity Functions Maria-Florina Balcan

20

Kernels versus Similarity FunctionsCan also show a Strict

Separation.

For any class C of n pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists.

Theorem

• In principle, should be able to learn from O(-1log(|C|/)) labeled examples.

• Claim 1: can define generic (0,1,1/|C|)-good similarity function achieving this bound. (Assume D not too concentrated)

• Claim 2: There is no (,) good kernel in hinge loss, even if =1/2 and =1/ |C|-1/2. So, margin based SC is d=(1/|C|).

Page 21: Learning with General Similarity Functions Maria-Florina Balcan

21

Learning with Multiple Similarity Functions

• Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good.

Guarantee: Whp the induced distribution F(P) in R2dr has a separator of error · + at L1 margin at least

Algorithm• Draw S={y1, , yd} set of landmarks. Concatenate features.

Sample complexity only increases by log(r) factor!

F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].

Page 22: Learning with General Similarity Functions Maria-Florina Balcan

Conclusions• Theory of learning with similarity fns that provides a

formal way of understanding good kernels as good similarity fns.

• Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric).

Algorithmic Implications

• Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM.

E.g., Liao and Noble, Journal of Computational Biology

Page 23: Learning with General Similarity Functions Maria-Florina Balcan

Conclusions• Theory of learning with similarity fns that provides a

formal way of understanding good kernels as good similarity fns.

• Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric).

Open Questions• Analyze other notions of good similarity fns.

Our Work

Good Kernels

Page 24: Learning with General Similarity Functions Maria-Florina Balcan

24

Page 25: Learning with General Similarity Functions Maria-Florina Balcan

25

Similarity Functions for Classification Algorithmic Implications• Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM.

E.g., Liao and Noble, Journal of Computational Biology

• Give justification to the following rule:

• Also show that anything learnable with SVM is learnable this way!