Some Mathematical Tools for Machine Learning (part 1)€¦ · Some Mathematical Tools for Machine Learning Chris Burges Microsoft Research February 13th, 2009. Contents –Part 1

Some Mathematical Tools

for Machine Learning

Chris Burges

Microsoft Research

February 13th, 2009

Contents – Part 1

1.Lagrange Multipliers: An Overview

2.Basic Concepts in Functional Analysis

3.Some Notes on Matrix Analysis

4.Convex Optimization: A Brief Tour

A Brief Diversion

Leslie Lamport: “How to Write a Proof”: DEC tech report, Feb 14, 1993

Abstract: A method of writing proofs is proposed that makes it much harder

to prove things that are not true. The method, based on hierarchical

structuring, is simple and practical.

“Anecdotal evidence suggests that as many as a third of all papers

published in mathematical journals contain mistakes – not just minor errors,

but incorrect theorems and proofs.”

Lagrange Multipliers:

An Overview, and Some Examples

Lagrange the Mathematician

• Born 1736 in Turin, one of two of 11 to survive infancy

• “Responsible for much fine mathematics published under the names

of other mathematicians”

• Believed that a mathematician has not thoroughly understood his

own work till he has made it so clear that he can go out and explain

it to the first person he meets on the street

• Worked on mechanics, calculus, the calculus of variations

astronomy, probability, group theory, and number theory

• At least partly responsible for the choice of base 10 for the metric

system, rather than 12

• Supported by Euler and d’Alembert, financed by Frederick and Louis

XIV, close to Lavoisier, Marie Antoinette

An indirect approach can be easier( ) ( ) 1,

0,

1,

Example: Minimize subject to

If could rotate to coordinate system and rescale so that

constraints take the form substitute with a parameterization

that encodes the c

nf x c x x Ax x

A

y y

1

1 1 2 1

2 1 2 1

:

sin sin sin

sin sin cos

onstraints that lives on

solve, and then apply the inverse mapping. But this can get

very complicated!

It is often not even possible to parameterize co

n

n

n

x

y

y

S

nstraints (for

example, polynomial constraints in several variables)

One equality constraint

2( ) ( ,) 0 .Minimize subject to f x xc x

0f 1f

2f

3f

( ) 0c x

f

c

One equality constraint, cont.

,

( ) 0

f c

f c

L f c

Hence at the optimum, we must have or:

0f 1f

2f

3f

( ) 0c x

f

c

Multiple equality constraints

( ) 0, 1, , .

( ) ( ).

,

constraints:

Define gradients:

Let be the subspace spanned by the and let be its

orthogonal complement.

Suppose that at some point, all constraints hold, and

i

i i

i

n c x i n

g x c x

S g S

0

0, ( ) : ( ) 0

Then can increase (or decrease) by moving along

Hence or: i i i i

i i

f

f f

f f c x L f c

Puzzle: why not multiple Lagrangians?

One inequality constraint

*

*

( ) ( ) 0.

( ) 0.

( ) ', ( ) 0.)

Find that minimizes subject to

What's new? At the solution, it's possible that

(Simple but not guaranteed: solve 'minimize check that

x f x c x

c x

f x c x

0f

1f

2f

3f

( ) 0c x

f

c

( ) 0c x

( ) 0c x

: ( ) 0, 0f c f c

Multiple inequality constraints

( ) 0, 0i i ii

f c

Suppose that at the solution

Then removing makes no difference, and we must drop

from the sum in

Equivalently we can set

Hence, always impose

A simple example

21 11 2 1 2

2 2

1 1 2 1 2 1 2 2

2 2 2

1 2 1 2 1 1 2 2

1 1 2 1 1

:

: , ,

( , ) 1 0, ( , ) 1

( , ) (1 ) (1 )

0 ( )

Extremize the distance between two points on

Embed in extremize

subject to

n

n n

i i

i

f x x x x

c x x x c x x x

L x x f c x x x x

L x x x

n

S

R

2 2 1 2 2

2 1 1 1 2 2

0

0 ( ) 0

(1 ) (1 )

2 0.

,

antipodal or equal: or i

L x x x

x x x x

Another simple example

c

Given a parallelogram whose sidelengths you can choose but

whose perimeter is fixed - what shape has the largest area?

h

b

2( )

( , ) (2( ) )

0

bh b h c

L b h bh b h c

L b h

Maximize subject to

Again, not explicitly needed: hence “method of

undetermined multipliers”

Simple exercises

2

2

1.

1 0 0

Puzzle: what coefficients maximize a convex sum of fixed numbers?

Puzzle: minimize subject to

Puzzle: maximize subject to and (hint: use )

i i

i i

i i i i i

i i

x x

x x x x

Resource Allocation

Distortion

Rate

inNumber of widgets

( )

Cost

i ic n

Fiber has bit errors per second, and sends bits total, per second. We wish to maximize the bit rate at a fixed distortion rate:

Puzzle: Does this work for economics, too?

A variational problem

An isoperimetric problem: find the curve of fixed length and

fixed endpoints {a,b} that encloses maximum area above [a,b].

x=1x=0

y

1 12 1/ 2

0 0

1 12 1/ 2

0 0

1 12 1/ 2

0 0

12 1/ 2

0

12 3/ 2

0

2 3/ 2

(1 ' )

(1 ' )

(1 ' ) ' '

1 '(1 ' )

1 ''(1 ' )

1 ''(1 ' ) 1 0

Area , length y dx y dx

L y dx y dx

L y dx y y y dx

dy y y dx

dx

y y y dx

y y

…straight line, or arc of circle.

Which univariate distribution has max entropy?

( ) log ( )f x f x dx

Minimize

1

2

2

( ) 0 ,

( ) 1,

( ) ,

( )

f x x

f x dx

x f x dx c

x f x dx c

subject to:

( )( )

( )

functional derivative

g xx y

g y

Need :

2

1 2( ) log ( ) (1 ( ) ) ( ) ( )L f x f x dx f x dx x f x dx x f x dx

2

1 20, log ( ) log( ) 0( )

Lx f y e y y

f y

Impose integrate w.r.t

→ f must be Gaussian!

Which univariate distribution has max

entropy?

Puzzle: I thought the uniform distribution has max entropy.

What’s going on?

Puzzle: What distribution do you get if you fix the mean,

but not the variance?

Puzzle: What distribution do you get if you fix only the

function’s support?

Max Entropy for Discrete Distribn + Linear

Constraints

: 1, 0

Have discrete distribution

Suppose you also have known linear constraints:

but you are maximally uncertain about everything else. So want max

entropy distribution subject to

i i ii

ij j ji

P P P

a P C

log 1

0 exp( 1 ) (1/ )exp( )

these constraints.

i i j ij j j i i ii j i i i

k k j kj j kjj j

L P P a P C P P

L P a Z a

→ logistic regression!

Are Lagrange Multipliers Really That Common?

Yes.

• Most flavors of Support Vector Machines

• Principal Component Analysis

• Canonical Correlation Analysis

• Locally Linear Embedding

• Laplacian Eigenmaps

• …

For example:

Basic Concepts in

Functional Analysis:

A Brief Tour

of Hilbert spaces, Norms,

and All That

What is a Field?

{ , , }: , { , } F F a set operations

{ , } is an Abelian group with identity denoted by 0F

{ 0, }F is an Abelian group with identity denoted by 1

( ) { , , }x y z x y x z x y z F

A field generalizes the notion of arithmetic on reals.

Field : Examples

F

F

F

R

Q

C

With + meaning addition and meaning multiplication,

the reals:

the rationals:

the complex numbers:

2

{0

{

,1

: 0

}

} { , }?

2

+

Puzzle 1: For , why doesn't using "

with + meaning "X

OR" for + work?

Pu

What's the s

zzle 2: Is

OR" a

mallest fiel

nd meaning "AND"

a field und

?

er

d

x x

Z

Z

How Many Fields Are There?

p

2

Actually infinitely many - e.g. , prime.

However, can define an 'ordering;' for fields (like <, =, > for ).

, can be ordered; and cannot; in fact all finite fields

cannot be ordered.

For orde

pZ

R

R Q C Z

.

red fields, 'supremum' and 'infimum' can be defined.

An ordered field is 'complete' iff every nonempty subset of

that has an upper bound in also has a supremum in

is not complete, but is. In

F

F F

Q R fact: every complete, ordered

field is isomorphic to !R

However, can define an ordering for some fields.

What is a Vector Space?

,

( , )

( , )

;

( ) (

A vector space is a nonempty set a field and operations 'addition'

( from ) and 'multiplication by a scalar'

( from into ) such that:

(a)

(b)

E F

x y x y E E E

x x F E E

x y y x

x y z z y

);

, , ;

) ( ) ;

) ;

( ) ;

1 .

(c) For every there exists such that

(d) (

(e) (

(f)

(g)

z

x y E z E x y z

x x

x x x

x y x y

x x

nA vector space generalizes the notion of vectors in nR

Vector Spaces: Field Matters!

{ , } { , } );

{ , } { , } );

{ , } { , } 2 );

(dimension

(dimension

(dimension

N

N

N

E F N

E F N

E F N

{ , }, 1

{ , }

:

For the vector space vectors and are linearly

independent.

For

'Linear

the vec

dependence' depends on field

tor space they are not.

i

F

Vector Spaces: More Examples

( )( ) ( ) ( );

( )( ) ( ).

f g x f x g x

f x f x

(2) (complex by matrices), over the field ;mnM m n

(1) Functions, whose range is a vector space, also

themselves form a vector space:

1

1 1

1,

( ),

(3) : for the space of all infinite sequences of

complex numbers such that

p

p

n

n

n n n

n n

l p

z

x y x y x x

What is an Inner Product?

, :

, , ,

, 0

, 0 0

, , ,

, ,

, ,

Let be a vector space over or . A function

is an if for all

(a)

(b)

(c)

(d) for all scalars

inner produ

)

ct

(e

V V V F

x y z V

x x

x x x

x y z x z y z

cx y c x y c F

x y y x

The inner product generalizes the notion of dot product

Inner Product: Examples

2 1

{ , } ,

,

, ( )

,

(1) Vector space with , defined by

(2) Vector space over with , defined by

(3) Vector space of matrices over with

( )

Ni ii

i ii

T

pm

x y x y

l x y x y

X Y Trace X Y

X Y M

Inner Product: Trace

2

2

( ) 0

( ) 0

(( ) , ) ( )

, 0

, 0 0

, , , :

,

(

,

)

(a)

(b)

(c)

(d) f

Ti

i

Ti

i

T T T

X X

X X x

x y

Tr X X X

Tr X X X

Tr Xz x z Yy z

cx y c

Z Tr X Z Tr Y Z

x y

( ) ( )

( ) ( ), ,

or all scalars

(e)

T T

T T

Tr X Y Tr X Y

Tr X Y Tr Yx X

c F

y y x

2

1

1/ 2

| , | , , ,

| , |cos : 0

( , , )

Cauchy Schartz inequality holds for any inner product , :

for all

"Angle" between two vectors can be defined for any inner product:

x y x x y y x y V

x y

x x y y

2

1 2 1 0

3 4 1 2E.g. the angle between and is 78.4 degrees !

Inner Product is General

What is a Norm?

:

, ,

0

0 0

| |

Let be a vector space over or . A function is

a norm if for all

(a)

(b)

(c) for all scalars

(d)

If condition (b) is droppe

Wh

d, it's a 'seminorm'.

at is the

V V

x y V

x

x x

cx c x c F

x y x y

simplest seminorm? : 0Ans x

Seminorm splits the space

0

1 0 1

1.

{ : 0}

,

0.

If is a seminorm on , then is a subspace

(called the null space).

If is a subspace of such that then is a norm

on

Define The corresponding cosets form a vec

V V v V v

V V V V

V

x y x y

5 , ' , (3,2,1,0,0),

(0,0,0,1,0) (0,0,0,0,1).

tor

space, and on that sp

Example seminorm on : null space

spanned by

ace, is a no

and

rm.

x Dx D diag

2 2 21 2 1 2

1/

( , , , )

{ } . .n=1

for is the Euclidean

norm.

Let The function defined by is a norm in

This works for finite sums too: define the norm on as:

nn n

pp

n p n p

np

x x x x x x x x

z z l z z l

l

x

1/

1 1 2| | | | | |

max({ })

{ }.

i=1

The norm is the 'Manhattan' norm

The norm is the 'max' norm

A normed vector space is a pair vector space, norm

np

p

i

n

i

x

l x x x x

l x x

Norm Generalizes Length

What is convexity?

1 2 1

1 2( ) ( ).

A function is if, between and , it lies below the

chord joi

convex

strictly convexning and It is if it lies strictly below.

x x x

f x f x

1 2 1 2((1 ) ) (1 ) ( ) ( )

0 1

f x x f x f x

.

A set is if, for all and , all points on the line

joining and

convex

lie in

S a S b S

a b S

1x

2x

When is a sphere a square?

2

1

: the 1-sphere

: the

Green

1-sphere

: the

Blu

Red

1-spher

e

e

l

l

l

Interesting... each ball looks convex

When is a sphere a cross?

0

0

0

0 0,

0

| | 1.0

In this context, never. The " norm" (count the number of

nonzero elements) is not a norm (for over or ).

if

if

unless

n

l

v

v

v

00? lim ?

p

p iix x

All norms are convex functions

1 2 1 2

1 2

1 2

(1 ) 1

| 1 | | |

1

x x x x

x x

x x

1 2

1 2 1 2

2

( ) ( ) 0

: ( ) 0 . ( 0 ( 0,

((1 ) ) (1 ) ( ) ( ) 0

1,

If is convex, then defines a convex set:

let Then if ) and )

the unit ball, for any norm is a convex region.

f x f x

R x f x f x f x

f x x f x f x

x

Open, Closed, Compact

, :

: , 0 . . ( , )

:

: 0 ( ,0)

Given a norm and a set in a vector space

n

n

i

n S V

Open x S s t B x S

Closed complement of open

Bounded r suchthat S B r

Compact : Every sequence x in S contains convergent subsequence

with limit in S

O

1 2, , , , , ,

Compact sets are closed and bounded.

For finite dimensional normed vector spaces, every closed bounded

set is compa

NN i i

R : Every cover has a finite sub - cover :

S S S S S N finite suchthat S S

,

ct.

If for a normed vector space the unit ball is compact, then has

finite dimension.

V V

Making your own norm

In fact, in real finite dimensional spaces, any symmetric,

compact, convex region centered on the origin defines

a norm (as the unit ball for that norm):

0Rescuing : the Hamming norml

2

2

0

.

:

( )

( ) ( )

1

.

0

Consider n-tuples taking values in These form a vector

space over the field for example,

Now define the norm to be the number of nonzero elements

of Is

x y x y

x x

x x

l x N

x

Z

Z

0

0

0 0

0 0 0

0

0 0

| |

this a norm?

(a)

(b)

(c)

(d)

x

x x

cx c x

x y x y

Hamming norm, cont.

2

2

1)

0

Puzzle: ( - how can this be correct?

Puzzle: What is the subtraction operation, in ?

Puzzle: What is the Hamming distance ?

N

x y

Z

Z

When does a norm come from an inner product?

2 2

,

1,

4

Every inner product defines a norm:

Does every norm define an inner product? If so, for real vector spaces,

No! A necessary and sufficient condition for a norm to correspond t

x x x

x y x y x y

2 2 2 2

o an

inner product is the paralellogram identity:

1

2

(Jordan and von Neumann, 1935)

x y x y x y

Lp norms, inner products

1/

2 /2

2 // 2

2 // 2

1 2 2 1 1 2 2 1 1 2 2

| | , | |

| | ? ,

, | | : , ,

, | ( ) | , ,

2.

1

where is integrable

E.g. try then but

unless

p

pp p

L

pp

pp

pp

f f f

f f f f

f g fg f g f g

f f g f f g f g f g

p

Lp norms, inner products cont.

2 2

1

)

1,

4

Maybe we could find some other inner product that works for

all ?

No: if a norm is derivable from an inner product (over , then

. Choose two functions:

p

x y x y x y

1

1

1

1

( ) 0 : 0

( ) 1 : 0 1

( ) 2 : 1 2

( ) 0 : 2

f x x

f x x

f x x

f x x

2

2

2

0 : 1

1 : 1 2

0 : 2

f x

f x

f x

1 2

1

Lp norms, inner products cont.

We’d like:

1 2 3 4 5 6 7 8 9 101.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

p

norm on Rn has no inner product

2

2 2 2 2

2 2 2 22 2

: [1,0], [0,2]

1?

2

1 1(2 2 ) 4 1 4 5

2 2

[1,0,0, ,0], [0,2,0, ,0]

Example in

Use parallelogram test:

Extend to : n

x y

x y x y x y

x y x y x y

x y

R

R

What is a Cauchy Sequence?

{ }

, .

Cauchy sequenc

A sequence of vectors in a normed vector space is called

a if for every >0 there exists a number

such that for all

e

n

m n

x

M

x x m n M

1x2x 3x

4x 5x 6x

Key idea: the Cauchy sequence

allows us to define notions of

convergence without ever

leaving the space

Cauchy sequences, cont.

Every convergent sequence is a Cauchy sequence.

Not every Cauchy sequence is a convergent sequence.

Note Cauchy sequence requires choice of norm!

[0,1]

2 3

(0,1) [0,1].

max | ( ) | .

( ) 12! 3! !

1,2, { }

(0,1)

: the space of polynomials on Choose the

norm Define

for . Then is a Cauchy sequence but it does

not converge in becaus

n

n

n

l

P P x

x x xP x x

n

n P

P

P e its limit is not a polynomial.

The notion of completeness

1 .

A normed vector space is called if every Cauchy

sequence in converges to an element of .

complete

Banach spaA complete normed space is called a .

with norm is complete, for all

The se

ce

q

np

E

E E

l p

2 1

2

1

[ , ]

[ , ]

uence space is complete, for all .

with norm is complete.

with norm or norm is not complete.

"All square integrable functions with norm" is complete.

A complete space "con

pl p

C a b L

C a b L L

L

tains all its limit points."

It is always possible to 'complete' a non-complete space.

Hilbert and Banach spaces

A Hilbert space is a complete inner product space.

Normed Inner Product

Banach Hilbert

[ , ] : ,C a b x y xy

[0,1]Polynomials on with max norm

Hilbert and Banach spaces, cont.

Normed

Banach

Hilbert

2[ , ] with

norm

C a b L

Polynomials on [0,1], max norm

[ , ] with normC a b L2pL

2pl

Finite dim. inner product spaces

2, all square

integrable fns

L

2l

How many kinds of Hilbert spaces are there?

1 2 1,2

1

: ,

) ( ) ( ) ,linear mappin

A mapping where are vector spaces, is called

a if ( and

all scalar .

g

s ,

T E E E

T x y T x T y x y E

1 2

1 2

1

( ), ( ) ,

, .

A Hilbert space is said to be to a Hilbert space

if there exists a 1-1 linear mapping from to such that

for every Such a

isomorphic

Hilbert spac is called a e i

H H

T H H

T x T y x y

x y H T

1 2.of

somo

onto

rp ism

h

H H

How many kinds of Hilbert spaces are there?

2 2[ , ]E.g. , are separable Hilbert spaces.l L a b

2.

, .

Let be separable:

If is infinite dimensional, then it is isomorphic to

If has dimension then it is is

Answe

omorp

r: 2..

hic to

.

N

H

H l

H N

A Hilbert space H is called separable if and only if it admits a

countable orthonormal basis. (So, all finite dimensional

Hilbert spaces are separable).

The Riesz Representation Theorem

1sup | ( ) |

. . ( ) )

The of a linear functional is defined:

A linear functional on a normed vector space is bounded

( if and only if its operator

norm

operato

is finite

r

.

norm

x

f

f f x

K s t f x K x x V

{ , }

: .

A linear on a normed vector space is

a linear mappi

functi

ng

onal V F

V F

The Riesz Representation Theorem

0 0

0

.

( ) ,

, .

Let be a bounded linear functional on a Hilbert space

Then there exists exactly one such that

for all and in fact

f H

x H f x x x

x H f x

2

2

[ , ], .

( ) ( )

? ( ) ( ) ( ) ( ) ( ) ( ) ( )

? ( ) ( ) ( ) ( )1

( )

Example: Define a linear functional by

b

a

b b b

a a a

b b b

a a a

b

a

H L a b a b

f x x t dt

Linear f x y x t y t dt x t dt y t dt f x f y

Bounded f x x t dt x t dt x t dt

x t dt

1 12 2

1

b

a

dt b a x

Riesz Representation Theorem, cont.

0 0

0

1/ 22

0

2

1 1

1: ,1 ( ).1 ( ) ( )

?

1

sup | ( ) | sup | ( ) | sup | ( ) | : ( ) 1

1/ ,

Can we find ? Try

Check:

The sup is found by choosing

b b

a a

b

a

b b b

x x

a a a

x x x x t dt x t dt f x

f x

x b a

f f x x t dt

x b a

x t dt x t dt

a

, 0 otherwise.

f

x

b a

t b

A Brief Look Ahead

The Riesz representation theorem can be used to show that any Hilbert

space for which the evaluation functional is continuous, is a Reproducing

Kernel Hilbert Space: there exists K such that

In particular:

Representer Theorem (Kimeldorf and Wahba, 1971; Schölkopf and

Smola, 2002): Let be a strictly monotonic increasing function

and let be an arbitrary loss function. Then each minimizer of

the “regularized risk”

admits a representation of the form

What is a metric space?

( , )

( , ) ,

( , ) ( , ) ;

For any set , let be a function (with range in ) defined on

the set of all ordered pairs of members of satisfying:

(i) is a finite real number for every pair of

(i

E x y

E E x y E

x y x y E E

( , ) 0 ;

( , ) ( , ) ( , ), { , , } .

:

i)

(iii)

Such a function is called a on ; a set with

metric is called a . Different choices of m

metric

metric sp etric on

give different me

ace

t

x y x y

y z x y x z x y z E

E E E E

E

ric spaces.

( , ) 0?Puzzle: What about x y

( , ) ( , )?Puzzle: How about x y y x

( , ) ( , ) ( , )?Puzzle: Where's the triangle inequality: x y x z z y

Metric versus norm

( , )Every normed vector space is a metric space: define

But metric spaces are much more general:

x y x y

Normed spaces

Metric spaces

2.2 1.0

3.4

4.5 3.1

6.0

1.1 0.5

5.9

1.1 6.4

2.3

9.6

19.5d

( ( )) ( )Metrics extend "continuity": f B x B f x

Some metrics on function spaces

12

[ , ]

1

2

2

:[ , ]

, , ( , ) sup ( ) ( )

[ , ] : ( , ) ( ) ( ) ,

( , ) ( ) ( ( , ) ?) ,

Let be the set of all bounded functions . For two

points is a metric.

Let be

Sup

x a b

b

ap

a

b

A f a b

f g A f g f x g x

A C a b f g f x g x dx

f g f x g x dx f g

R.

( ) ( ), [ , ]

[ , ] :

( , ) sup ( ) ( ) , '( ) '( ) , , ( ) ( )

pose instead thenr

r rr x a b

A C a b

f g f x g x f x g x f x g x

Topological Spaces

, :

,

,

A is a non-empty set, a fixed collection

of subsets of satisfying

(1) ,

(2) Intersection of any two sets in is in ,

(3) Unio

topological space

n of any collection of sets in is

T A A

A

A

S S a

S

S i S

S

1 1

,

.

,

in .

is called a topology for and the members of are called the

of

Topological spaces are more general than metric spaces!

Topological spaces extend "continuity": Given an

open

sets

A

T

T A 1

S.

S S

S 2 2

11 2 2 1

,

: , ( )

d and a

map is "continuous" if

T A

A A U A U A

2S

..

Continuity, convergence, connectedness… without distance!

Putting Spaces in their Places…

Hilbert spaces

Banach spaces

Normed vector spaces

Metric spaces

Topological spaces

, : the "indiscrete" topology on A A S

~ The Middle ~

Thanks!

Documents

Some Mathematical Tools for Machine Learning (part 1)€¦ · Some Mathematical Tools for Machine Learning Chris Burges Microsoft Research February 13th, 2009. Contents –Part 1