23
Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels and Margins Maria Florina Balcan 10/13/2011

Embed Size (px)

DESCRIPTION

Linear Separators Hypothesis class of linear decision surfaces in R n X X X X X X X X X X O O O O O O O O Instance space: X=R n w  h(x)=w ¢ x, if h(x)> 0, then label x as +, otherwise label it as -

Citation preview

Page 1: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels and Margins

Maria Florina Balcan10/13/2011

Page 2: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernel Methods

Significant percentage of ICML, NIPS, COLT.

Lots of Books, Workshops.

ICML 2007,Business meeting

Amazingly popular method in ML in recent years.

Page 3: Kernels and Margins Maria Florina Balcan 10/13/2011

Linear Separators

• Hypothesis class of linear decision surfaces in Rn

X

XX

X

X

X

X X

XX

O

O

O

O

O O

OO

• Instance space: X=Rn

w h(x)=w ¢ x, if h(x)> 0, then label x

as +, otherwise label it as -

Page 4: Kernels and Margins Maria Florina Balcan 10/13/2011

Lin. Separators: Perceptron algorithm

Start with all-zeroes weight vector w. Given example x, predict positive , w ¢ x ¸ 0. On a mistake, update as follows:

• Mistake on positive, then update w à w + x• Mistake on negative, then update w à w - x

w is a weighted sum of the incorrectly classified examplesNote:

mistake bound is 1/2Guarantee:

Page 5: Kernels and Margins Maria Florina Balcan 10/13/2011

Geometric Margin• If S is a set of labeled examples, then a vector

w has margin w.r.t. S if

w

x

h: w¢x = 0

Page 7: Kernels and Margins Maria Florina Balcan 10/13/2011

Overview of Kernel Methods

• A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping such that K( , )= ( )¢ ( ).

Why Kernels matter?

• Many algorithms interact with data only via dot-products.

So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional -space.

What is a Kernel?

• If data is linearly separable by margin in -space, then good sample complexity.

Page 8: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels• K(¢,¢) - kernel if it can be viewed as a legal

definition of inner product: 9 : X ! RN such that K(x,y) =(x) ¢ (y)

• range of - “-space”• N can be very large

But think of as implicit, not explicit!

Page 9: Kernels and Margins Maria Florina Balcan 10/13/2011

x2

x1

OO O

OO

OO O

XX

X

X

XX

X

X X

X

X

X

X

XX

X

XX

z1

z3

OO

O O

O

O

O

OO

X XX X

X

X

X

X X

X

X

X

X

X

X

X X

X

ExampleE.g., for n=2, d=2, the kernel

z2

K(x,y) = (x¢y)d corresponds to

original space space

Page 10: Kernels and Margins Maria Florina Balcan 10/13/2011

Example

x2

x1

OO O

OO

OO O

XX

X

X

XX

X

X X

X

X

X

X

XX

X

XX

z1

z3

OO

O O

O

O

O

OO

X XX X

X

X

X

X X

X

X

X

X

X

X

X X

X

z2

original space space

Page 11: Kernels and Margins Maria Florina Balcan 10/13/2011

Note: feature space need not be unique

Example

Page 12: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels

K is a kernel iff • K is symmetric• for any set of training points x1, x2, …,xm and for

any a1, a2, …, am 2 R we have:

Linear: K(x,y)=x ¢ y Polynomial: K(x,y) =(x ¢ y)d or K(x,y) =(1+x ¢ y)d

More examples:

Gaussian:

Theorem

Page 13: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernelizing a learning algorithm• If all computations involving instances are in terms of

inner products then:

• Examples of kernalizable algos: Perceptron, SVM.

Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space.

Computationally, only need to modify the alg. by replacing each x ¢ y with a K(x,y).

Page 14: Kernels and Margins Maria Florina Balcan 10/13/2011

Lin. Separators: Perceptron algorithm Start with all-zeroes weight vector w. Given example x, predict positive , w ¢ x ¸ 0. On a mistake, update as follows:

• Mistake on positive, then update w à w + x• Mistake on negative, then update w à w - x

Note: need to store all the mistakes so far.

Easy to kernelize since w is a weighted sum of examples:

Replace with

Page 15: Kernels and Margins Maria Florina Balcan 10/13/2011

Generalize Well if Good Margin

• If data is linearly separable by margin in -space, then good sample complexity.

|(x)| · 1

+++

++

+

---

--

If margin in -space, then

need sample size of only Õ(1/2) to get confidence in generalization.

• Cannot rely on standard VC-bounds since the dimension of the phi-space might be very large. VC-dim for the class of linear sep. in Rm is m+1.

Page 16: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels & Large Margins• If S is a set of labeled examples, then a vector w in the -

space has margin if:

• A vector w in the -space has margin with respect to P if:

• A vector w in the -space has error at margin if:

(,)-good kernel

Page 17: Kernels and Margins Maria Florina Balcan 10/13/2011

Large Margin Classifiers• If large margin, then the amount of data we need depends

only on 1/ and is independent on the dim of the space!

If large margin and if our alg. produces a large margin classifier, then the amount of data we need depends only on 1/ [Bartlett & Shawe-Taylor ’99]

If large margin, then Perceptron also behaves well.

Another nice justification based on Random Projection [Arriaga & Vempala ’99].

Page 18: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels & Large Margins

• Powerful combination in ML in recent years!

A kernel implicitly allows mapping data into a high dimensional space and performing certain operations there without paying a high price computationally.

If data indeed has a large margin linear separator in that space, then one can avoid paying a high price in terms of sample size as well.

Page 19: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels Methods

Offer great modularity.

• No need to change the underlying learning algorithm to accommodate a particular choice of kernel function.

• Also, we can substitute a different algorithm while maintaining the same kernel.

Page 20: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels, Closure Properties

• Easily create new kernels using basic ones!

Page 21: Kernels and Margins Maria Florina Balcan 10/13/2011

Kernels, Closure Properties• Easily create new kernels using basic ones!

Page 22: Kernels and Margins Maria Florina Balcan 10/13/2011

What we really care about are good kernels not only legal kernels!

Page 23: Kernels and Margins Maria Florina Balcan 10/13/2011

Good Kernels, Margins, and Low Dimensional Mappings

Designing a kernel function is much like designing a feature space.

Given a good kernel K, we can reinterpret K as defining a new set of features.

[Balcan-Blum -Vempala, MLJ’06]