Download pdf - Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Linear Models for ClassificationMachine Learning

Torsten Möller and Thomas Torsney-Weir

©Möller/Mori 1


Reading

• Chapter 4 of “Pattern Recognition and Machine Learning”by Bishop

©Möller/Mori 2


Classification

• goal in classification is to take input x and assign it to oneof K discrete classes Ck

• Ck typically disjoint (unique class membership)• input space divided into decision regions• boundaries: decision boundaries or decision surfaces• linear models: decision boundaries are hyperplanes:

linearly separable classes

©Möller/Mori 3


Classification: Hand-written Digit Recognition

xi =

!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334

7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(

8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6

C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"

/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:

8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/

(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6

!"# $%&'() *+,-. */0+12.33. 4,3,5,6.

N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/

+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI

*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:

=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/

1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!

-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'

!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997

:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ

?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6

:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6

:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6

ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes• Denote classes by Ck

• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10• Given a training set {(x1, t1), . . . , (xN , tN )}, learning

problem is to construct a “good” function y(x) from these.• y : R784 → R10

©Möller/Mori 4


Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in ϕi(x) for regression

©Möller/Mori 5




y(x) = f(wTx+ w0)


• e.g.f(u) =




©Möller/Mori 6




y(x) = f(wTx+ w0)


• e.g.f(u) =




©Möller/Mori 7


Outline

Discriminant Functions

Generative Models

Discriminative Models

©Möller/Mori 8


Outline


Generative Models


©Möller/Mori 9


Discriminant Functions with Two Classes

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

• Start with 2 class problem,ti ∈ {0, 1}

• Simple linear discriminant

y(x) = wTx+ w0

apply threshold function to getclassification

• Projection of x in w dir. is wTx||w||

©Möller/Mori 10


Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 11


Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3






Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3






Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3






Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3






Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3






Multiple Classes

Ri

Rj

Rk

xA

xB

x̂

• A solution is to build K linear functions:yk(x) = wT

k x+ wk0

assign x to class arg maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 17


Multiple Classes

Ri

Rj

Rk

xA

xB

x̂

• A solution is to build K linear functions:yk(x) = wT

k x+ wk0

assign x to class arg maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 18


Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and

all components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

©Möller/Mori 19





E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2


©Möller/Mori 20





E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2


©Möller/Mori 21


Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary

• Similar to logistic regressiondecision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 22



−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4





from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 23



−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4




• Why?

• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 24



−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4





from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved©Möller/Mori 25


More Least Squares Problems

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

• Easily separated by hyperplanes, but not found using leastsquares!

• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant

©Möller/Mori 26


Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx

≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”

©Möller/Mori 27




y = wTx ≥ −w0

followed by a threshold

• In which direction w should we project?• One which separates classes “well”

©Möller/Mori 28




y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?

• One which separates classes “well”

©Möller/Mori 29


Fisher’s Linear Discriminant• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

©Möller/Mori 30



−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in thisdirection

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

©Möller/Mori 31



−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4




©Möller/Mori 32



−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4




©Möller/Mori 33



−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4




©Möller/Mori 34


Math time - FLD• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s22

maximize wrt w©Möller/Mori 35




mk =1

Nk

∑n∈Ck

wTxn


s2k =∑n∈Ck

(yn −mk)2


J(w) =(m2 −m1)

2

s21 + s22





mk =1

Nk

∑n∈Ck

wTxn


s2k =∑n∈Ck

(yn −mk)2


J(w) =(m2 −m1)

2

s21 + s22





mk =1

Nk

∑n∈Ck

wTxn


s2k =∑n∈Ck

(yn −mk)2


J(w) =(m2 −m1)

2

s21 + s22



Math time - FLD

J(w) =(m2 −m1)

2

s21 + s22=

wTSBw

wTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean differencevector

©Möller/Mori 39


Math time - FLD

J(w) =(m2 −m1)

2

s21 + s22=

wTSBw

wTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean differencevector

©Möller/Mori 40


FLD Summary

• FLD is a dimensionality reduction technique (more later inthe course)

• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares

example)

©Möller/Mori 41


Perceptrons

• Perceptrons is used to refer to many neural networkstructures

• The classic type is a fixed non-linear transformation ofinput, one layer of adaptive weights, and a threshold:

y(x) = f(wTϕ(x))

• Developed by Rosenblatt in the 50s• The main difference compared to the methods we’ve seen

so far is the learning algorithm

©Möller/Mori 42


Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTϕ(xn)tn

sum over mis-classified examples only

©Möller/Mori 43


Perceptron Learning




EP (w) = −∑n∈M

wTϕ(xn)tn


©Möller/Mori 44


Perceptron Learning




EP (w) = −∑n∈M

wTϕ(xn)tn


©Möller/Mori 45


Perceptron Learning




EP (w) = −∑n∈M

wTϕ(xn)tn


©Möller/Mori 46


Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w)

= w(τ) + ηϕ(xn)tn︸︷︷︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

©Möller/Mori 47




w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸︷︷︸if incorrect



©Möller/Mori 48




w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸︷︷︸if incorrect



©Möller/Mori 49


Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

©Möller/Mori 50



−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1


of choosing one

©Möller/Mori 51



−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1


of choosing one

©Möller/Mori 52



−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1


of choosing one

©Möller/Mori 53



−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1


of choosing one ©Möller/Mori 54


Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems infeature space

• Same as the other models in this chapter• Canonical example of non-separable problem is X-OR

• Real datasets can look like this too

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

Can represent AND, OR, NOT, majority, etc.

Represents a linear separator in input space:

!jWjxj > 0 or W · x > 0

I 1

I 2

I 1

I 2

I 1

I 2

?

(a) (b) (c)

0 1

0

1

0

1 1

0

0 1 0 1

xor I 2I 1orI 1 I 2and I 1 I 2

Chapter 20 11

©Möller/Mori 55


Outline


Generative Models


©Möller/Mori 56


Generative vs. Discriminative

• Generative models• Can generate synthetic

example data• Perhaps accurate

classification is equivalent toaccurate synthesis

• e.g. vision and graphics• Tend to have more parameters• Require good model of class

distributions

• Discriminative models• Only usable for classification• Don’t solve a harder problem

than you need to• Tend to have fewer parameters• Require good model of

decision boundary

©Möller/Mori 57


Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 58




p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)


p(C1|x) =p(x|C1)p(C1)



©Möller/Mori 59




p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)


p(C1|x) =p(x|C1)p(C1)



©Möller/Mori 60




p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)


p(C1|x) =p(x|C1)p(C1)



©Möller/Mori 61




p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)


p(C1|x) =p(x|C1)p(C1)



©Möller/Mori 62




p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)


p(C1|x) =p(x|C1)p(C1)



©Möller/Mori 63


Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

©Möller/Mori 64


Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

©Möller/Mori 65



• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = ln p(x|C1)p(C1)p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying

• Of course, we will see how such a simple forma = wTx+ w0 arises naturally

©Möller/Mori 66




p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)




©Möller/Mori 67




p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)




©Möller/Mori 68


Logistic Sigmoid

−5 0 50

0.5

1

• The function σ(a) = 11+exp(−a) is known as the logistic

sigmoid• It squashes the real axis down to [0, 1]

• It is continuous and differentiable• It avoids the problems encountered with the too correct

least-squares error fitting (later)

©Möller/Mori 69


Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1

©Möller/Mori 70


Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1

©Möller/Mori 71


Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = ln p(x|C1)p(C1)p(x|C2)p(C2)

= wTx+ w0

• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 72





p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)



= wTx+ w0






p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)



= wTx+ w0






p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)



= wTx+ w0



Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood

• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)


p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

©Möller/Mori 76









©Möller/Mori 77









©Möller/Mori 78



• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =N∏

n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

ℓ(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸︷︷︸π

+ tn lnN1 + (1− tn) lnN2︸︷︷︸µ1,µ2,Σ

• Maximize for each separately

©Möller/Mori 79



• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =N∏

n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

ℓ(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸︷︷︸π

+ tn lnN1 + (1− tn) lnN2︸︷︷︸µ1,µ2,Σ

• Maximize for each separately

©Möller/Mori 80


Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

©Möller/Mori 81




∂

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2


©Möller/Mori 82




∂

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2


©Möller/Mori 83


Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances©Möller/Mori 84


Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances©Möller/Mori 85


Probabilistic Generative Models Summary

• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more

than just Gaussian distributions• Arises for any distribution in the exponential family, a large

class of distributions

©Möller/Mori 86


Outline


Generative Models


©Möller/Mori 87


Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)

• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

©Möller/Mori 88





p(C1|x) =1

1 + exp(−wTx+ w0)




©Möller/Mori 89





p(C1|x) =1

1 + exp(−wTx+ w0)




©Möller/Mori 90


Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇ℓ(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

©Möller/Mori 91



learning

p(t|w) =

N∏n=1



∇ℓ(w) =

N∑n=1

(tn − yn)xn



©Möller/Mori 92



learning

p(t|w) =

N∏n=1



∇ℓ(w) =

N∑n=1

(tn − yn)xn



©Möller/Mori 93



learning

p(t|w) =

N∏n=1



∇ℓ(w) =

N∑n=1

(tn − yn)xn



©Möller/Mori 94


Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 95






f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)









f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)









f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)









f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)





Newton-Raphson484 9 Unconstrained minimization

PSfrag replacements

f

!f

(x, f(x))

(x + !xnt, f(x + !xnt))

Figure 9.16 The function f (shown solid) and its second-order approximation!f at x (dashed). The Newton step !xnt is what must be added to x to give

the minimizer of !f .

9.5 Newton’s method

9.5.1 The Newton step

For x ! dom f , the vector

!xnt = "#2f(x)!1#f(x)

is called the Newton step (for f , at x). Positive definiteness of #2f(x) implies that

#f(x)T !xnt = "#f(x)T#2f(x)!1#f(x) < 0

unless #f(x) = 0, so the Newton step is a descent direction (unless x is optimal).The Newton step can be interpreted and motivated in several ways.

Minimizer of second-order approximation

The second-order Taylor approximation (or model) !f of f at x is

!f(x + v) = f(x) + #f(x)T v +1

2vT#2f(x)v, (9.28)

which is a convex quadratic function of v, and is minimized when v = !xnt. Thus,the Newton step !xnt is what should be added to the point x to minimize thesecond-order approximation of f at x. This is illustrated in figure 9.16.

This interpretation gives us some insight into the Newton step. If the functionf is quadratic, then x + !xnt is the exact minimizer of f . If the function f isnearly quadratic, intuition suggests that x + !xnt should be a very good estimateof the minimizer of f , i.e., x!. Since f is twice di"erentiable, the quadratic modelof f will be very accurate when x is near x!. It follows that when x is near x!,the point x + !xnt should be a very good estimate of x!. We will see that thisintuition is correct.

• Figure from Boyd and Vandenberghe, Convex Optimization

• Excellent reference, free for download onlinehttp://www.stanford.edu/~boyd/cvxbook/

©Möller/Mori 100

http://www.stanford.edu/~boyd/cvxbook/


Conclusion

• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f(wTx+ w0)

• Threshold/max function for f(·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples

• Probabilistic models: logistic sigmoid / softmax for f(·)• Generative model - assume class conditional densities in

exponential family; obtain sigmoid• Discriminative model - directly model posterior using

sigmoid (a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood

• All of these models are limited to linear decisionboundaries in feature space

©Möller/Mori 101