Discriminant Functions Generative Models Discriminative Models
Linear Models for ClassificationMachine Learning
Torsten Möller and Thomas Torsney-Weir
©Möller/Mori 1
Discriminant Functions Generative Models Discriminative Models
Reading
• Chapter 4 of “Pattern Recognition and Machine Learning”by Bishop
©Möller/Mori 2
Discriminant Functions Generative Models Discriminative Models
Classification
• goal in classification is to take input x and assign it to oneof K discrete classes Ck
• Ck typically disjoint (unique class membership)• input space divided into decision regions• boundaries: decision boundaries or decision surfaces• linear models: decision boundaries are hyperplanes:
linearly separable classes
©Möller/Mori 3
Discriminant Functions Generative Models Discriminative Models
Classification: Hand-written Digit Recognition
xi =
!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334
7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(
8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6
C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"
/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:
8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/
(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6
!"# $%&'() *+,-. */0+12.33. 4,3,5,6.
N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/
+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI
*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:
=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/
1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!
-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'
!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997
:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ
?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6
:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6
:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6
ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
• Each input vector classified into one of K discrete classes• Denote classes by Ck
• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10• Given a training set {(x1, t1), . . . , (xN , tN )}, learning
problem is to construct a “good” function y(x) from these.• y : R784 → R10
©Möller/Mori 4
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f(wTx+ w0)
• This is called a generalized linear model• f(·) is a fixed non-linear function
• e.g.f(u) =
{1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in ϕi(x) for regression
©Möller/Mori 5
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f(wTx+ w0)
• This is called a generalized linear model• f(·) is a fixed non-linear function
• e.g.f(u) =
{1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in ϕi(x) for regression
©Möller/Mori 6
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f(wTx+ w0)
• This is called a generalized linear model• f(·) is a fixed non-linear function
• e.g.f(u) =
{1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in ϕi(x) for regression
©Möller/Mori 7
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
©Möller/Mori 8
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
©Möller/Mori 9
Discriminant Functions Generative Models Discriminative Models
Discriminant Functions with Two Classes
x2
x1
wx
y(x)‖w‖
x⊥
−w0‖w‖
y = 0y < 0
y > 0
R2
R1
• Start with 2 class problem,ti ∈ {0, 1}
• Simple linear discriminant
y(x) = wTx+ w0
apply threshold function to getclassification
• Projection of x in w dir. is wTx||w||
©Möller/Mori 10
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 11
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 12
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 13
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 14
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 15
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs©Möller/Mori 16
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Ri
Rj
Rk
xA
xB
x̂
• A solution is to build K linear functions:yk(x) = wT
k x+ wk0
assign x to class arg maxk yk(x)• Gives connected, convex decision regions
x̂ = λxA + (1− λ)xB
yk(x̂) = λyk(xA) + (1− λ)yk(xB)
⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 17
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Ri
Rj
Rk
xA
xB
x̂
• A solution is to build K linear functions:yk(x) = wT
k x+ wk0
assign x to class arg maxk yk(x)• Gives connected, convex decision regions
x̂ = λxA + (1− λ)xB
yk(x̂) = λyk(xA) + (1− λ)yk(xB)
⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 18
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and
all components of the label vector:
E(W ) =1
2
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
©Möller/Mori 19
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and
all components of the label vector:
E(W ) =1
2
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
©Möller/Mori 20
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and
all components of the label vector:
E(W ) =1
2
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
©Möller/Mori 21
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary
• Similar to logistic regressiondecision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved
©Möller/Mori 22
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary
• Similar to logistic regressiondecision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved
©Möller/Mori 23
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary
• Similar to logistic regressiondecision boundary (more later)
• Gets worse by adding easypoints?!
• Why?
• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary is moved
©Möller/Mori 24
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary
• Similar to logistic regressiondecision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved©Möller/Mori 25
Discriminant Functions Generative Models Discriminative Models
More Least Squares Problems
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
• Easily separated by hyperplanes, but not found using leastsquares!
• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant
©Möller/Mori 26
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx
≥ −w0
followed by a threshold• In which direction w should we project?• One which separates classes “well”
©Möller/Mori 27
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold
• In which direction w should we project?• One which separates classes “well”
©Möller/Mori 28
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold• In which direction w should we project?
• One which separates classes “well”
©Möller/Mori 29
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold• In which direction w should we project?• One which separates classes “well”
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
©Möller/Mori 30
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of the lineconnecting class means
• However, problematic if classes have variance in thisdirection
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
©Möller/Mori 31
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of the lineconnecting class means
• However, problematic if classes have variance in thisdirection
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
©Möller/Mori 32
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of the lineconnecting class means
• However, problematic if classes have variance in thisdirection
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
©Möller/Mori 33
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of the lineconnecting class means
• However, problematic if classes have variance in thisdirection
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
©Möller/Mori 34
Discriminant Functions Generative Models Discriminative Models
Math time - FLD• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =∑n∈Ck
(yn −mk)2
• Fisher criterion:
J(w) =(m2 −m1)
2
s21 + s22
maximize wrt w©Möller/Mori 35
Discriminant Functions Generative Models Discriminative Models
Math time - FLD• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =∑n∈Ck
(yn −mk)2
• Fisher criterion:
J(w) =(m2 −m1)
2
s21 + s22
maximize wrt w©Möller/Mori 36
Discriminant Functions Generative Models Discriminative Models
Math time - FLD• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =∑n∈Ck
(yn −mk)2
• Fisher criterion:
J(w) =(m2 −m1)
2
s21 + s22
maximize wrt w©Möller/Mori 37
Discriminant Functions Generative Models Discriminative Models
Math time - FLD• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =∑n∈Ck
(yn −mk)2
• Fisher criterion:
J(w) =(m2 −m1)
2
s21 + s22
maximize wrt w©Möller/Mori 38
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
J(w) =(m2 −m1)
2
s21 + s22=
wTSBw
wTSWw
Between-class covariance:SB = (m2 −m1)(m2 −m1)
T
Within-class covariance:SW =
∑n∈C1
(xn −m1)(xn −m1)T +
∑n∈C2
(xn −m2)(xn −m2)T
Lots of math:
w ∝ S−1W (m2 −m1)
If covariance SW is isotropic, reduces to class mean differencevector
©Möller/Mori 39
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
J(w) =(m2 −m1)
2
s21 + s22=
wTSBw
wTSWw
Between-class covariance:SB = (m2 −m1)(m2 −m1)
T
Within-class covariance:SW =
∑n∈C1
(xn −m1)(xn −m1)T +
∑n∈C2
(xn −m2)(xn −m2)T
Lots of math:
w ∝ S−1W (m2 −m1)
If covariance SW is isotropic, reduces to class mean differencevector
©Möller/Mori 40
Discriminant Functions Generative Models Discriminative Models
FLD Summary
• FLD is a dimensionality reduction technique (more later inthe course)
• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares
example)
©Möller/Mori 41
Discriminant Functions Generative Models Discriminative Models
Perceptrons
• Perceptrons is used to refer to many neural networkstructures
• The classic type is a fixed non-linear transformation ofinput, one layer of adaptive weights, and a threshold:
y(x) = f(wTϕ(x))
• Developed by Rosenblatt in the 50s• The main difference compared to the methods we’ve seen
so far is the learning algorithm
©Möller/Mori 42
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:
EP (w) = −∑n∈M
wTϕ(xn)tn
sum over mis-classified examples only
©Möller/Mori 43
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:
EP (w) = −∑n∈M
wTϕ(xn)tn
sum over mis-classified examples only
©Möller/Mori 44
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:
EP (w) = −∑n∈M
wTϕ(xn)tn
sum over mis-classified examples only
©Möller/Mori 45
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:
EP (w) = −∑n∈M
wTϕ(xn)tn
sum over mis-classified examples only
©Möller/Mori 46
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP (w)
= w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
©Möller/Mori 47
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
©Möller/Mori 48
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
©Möller/Mori 49
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
©Möller/Mori 50
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
©Möller/Mori 51
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
©Möller/Mori 52
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
©Möller/Mori 53
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one ©Möller/Mori 54
Discriminant Functions Generative Models Discriminative Models
Limitations of Perceptrons
• Perceptrons can only solve linearly separable problems infeature space
• Same as the other models in this chapter• Canonical example of non-separable problem is X-OR
• Real datasets can look like this too
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc.
Represents a linear separator in input space:
!jWjxj > 0 or W · x > 0
I 1
I 2
I 1
I 2
I 1
I 2
?
(a) (b) (c)
0 1
0
1
0
1 1
0
0 1 0 1
xor I 2I 1orI 1 I 2and I 1 I 2
Chapter 20 11
©Möller/Mori 55
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
©Möller/Mori 56
Discriminant Functions Generative Models Discriminative Models
Generative vs. Discriminative
• Generative models• Can generate synthetic
example data• Perhaps accurate
classification is equivalent toaccurate synthesis
• e.g. vision and graphics• Tend to have more parameters• Require good model of class
distributions
• Discriminative models• Only usable for classification• Don’t solve a harder problem
than you need to• Tend to have fewer parameters• Require good model of
decision boundary
©Möller/Mori 57
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 58
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 59
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 60
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 61
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 62
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the distribution p(x|Ck)which generates the data for each class
©Möller/Mori 63
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models - Example
• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)
• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)
• Class priors p(C1) = 0.1, p(C2) = 0.9
• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33
©Möller/Mori 64
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models - Example
• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)
• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)
• Class priors p(C1) = 0.1, p(C2) = 0.9
• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33
©Möller/Mori 65
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• We can write the classifier in another form
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=1
1 + exp(−a)≡ σ(a)
where a = ln p(x|C1)p(C1)p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying
• Of course, we will see how such a simple forma = wTx+ w0 arises naturally
©Möller/Mori 66
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• We can write the classifier in another form
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=1
1 + exp(−a)≡ σ(a)
where a = ln p(x|C1)p(C1)p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying
• Of course, we will see how such a simple forma = wTx+ w0 arises naturally
©Möller/Mori 67
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• We can write the classifier in another form
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=1
1 + exp(−a)≡ σ(a)
where a = ln p(x|C1)p(C1)p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying
• Of course, we will see how such a simple forma = wTx+ w0 arises naturally
©Möller/Mori 68
Discriminant Functions Generative Models Discriminative Models
Logistic Sigmoid
−5 0 50
0.5
1
• The function σ(a) = 11+exp(−a) is known as the logistic
sigmoid• It squashes the real axis down to [0, 1]
• It is continuous and differentiable• It avoids the problems encountered with the too correct
least-squares error fitting (later)
©Möller/Mori 69
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• There is a generalization of the logistic sigmoid to K > 2classes:
p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)
=exp(ak)∑j exp(aj)
where ak = ln p(x|Ck)p(Ck)
• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1
©Möller/Mori 70
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• There is a generalization of the logistic sigmoid to K > 2classes:
p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)
=exp(ak)∑j exp(aj)
where ak = ln p(x|Ck)p(Ck)
• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1
©Möller/Mori 71
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2exp
{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = ln p(x|C1)p(C1)p(x|C2)p(C2)
= wTx+ w0
• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 72
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2exp
{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = ln p(x|C1)p(C1)p(x|C2)p(C2)
= wTx+ w0
• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 73
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2exp
{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = ln p(x|C1)p(C1)p(x|C2)p(C2)
= wTx+ w0
• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 74
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2exp
{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = ln p(x|C1)p(C1)p(x|C2)p(C2)
= wTx+ w0
• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 75
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood
• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
©Möller/Mori 76
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood
• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
©Möller/Mori 77
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood
• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
©Möller/Mori 78
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• The likelihood of the training data is:
p(t|π,µ1,µ2,Σ) =N∏
n=1
[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn
• As usual, ln is our friend:
ℓ(t; θ) =
N∑n=1
tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π
+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ
• Maximize for each separately
©Möller/Mori 79
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• The likelihood of the training data is:
p(t|π,µ1,µ2,Σ) =N∏
n=1
[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn
• As usual, ln is our friend:
ℓ(t; θ) =
N∑n=1
tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π
+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ
• Maximize for each separately
©Möller/Mori 80
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂πℓ(t; θ) =
N∑n=1
tnπ
− 1− tn1− π
⇒ π =N1
N1 +N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
©Möller/Mori 81
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂πℓ(t; θ) =
N∑n=1
tnπ
− 1− tn1− π
⇒ π =N1
N1 +N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
©Möller/Mori 82
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂πℓ(t; θ) =
N∑n=1
tnπ
− 1− tn1− π
⇒ π =N1
N1 +N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
©Möller/Mori 83
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same
fashion• Class means:
µ1 =1
N1
N∑n=1
tnxn
µ2 =1
N2
N∑n=1
(1− tn)xn
• Means of training examples from each class• Shared covariance matrix:
Σ =N1
N
1
N1
∑n∈C1
(xn−µ1)(xn−µ1)T+
N2
N
1
N2
∑n∈C2
(xn−µ2)(xn−µ2)T
• Weighted average of class covariances©Möller/Mori 84
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same
fashion• Class means:
µ1 =1
N1
N∑n=1
tnxn
µ2 =1
N2
N∑n=1
(1− tn)xn
• Means of training examples from each class• Shared covariance matrix:
Σ =N1
N
1
N1
∑n∈C1
(xn−µ1)(xn−µ1)T+
N2
N
1
N2
∑n∈C2
(xn−µ2)(xn−µ2)T
• Weighted average of class covariances©Möller/Mori 85
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models Summary
• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more
than just Gaussian distributions• Arises for any distribution in the exponential family, a large
class of distributions
©Möller/Mori 86
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
©Möller/Mori 87
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx+ w0)
and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
©Möller/Mori 88
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx+ w0)
and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
©Möller/Mori 89
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx+ w0)
and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
©Möller/Mori 90
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇ℓ(w) =
N∑n=1
(tn − yn)xn
• This time no closed-form solution since yn = σ(wTx)
• Could use (stochastic) gradient descent• But there’s a better iterative technique
©Möller/Mori 91
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇ℓ(w) =
N∑n=1
(tn − yn)xn
• This time no closed-form solution since yn = σ(wTx)
• Could use (stochastic) gradient descent• But there’s a better iterative technique
©Möller/Mori 92
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇ℓ(w) =
N∑n=1
(tn − yn)xn
• This time no closed-form solution since yn = σ(wTx)
• Could use (stochastic) gradient descent• But there’s a better iterative technique
©Möller/Mori 93
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇ℓ(w) =
N∑n=1
(tn − yn)xn
• This time no closed-form solution since yn = σ(wTx)
• Could use (stochastic) gradient descent• But there’s a better iterative technique
©Möller/Mori 94
Discriminant Functions Generative Models Discriminative Models
Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent
method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient
direction• IRLS is a special case of a Newton-Raphson method
• Approximate function using second-order Taylor expansion:
f̂(w+v) = f(w)+∇f(w)T (v−w)+1
2(v−w)T∇2f(w)(v−w)
• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear
• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week
• Hence the name IRLS©Möller/Mori 95
Discriminant Functions Generative Models Discriminative Models
Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent
method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient
direction• IRLS is a special case of a Newton-Raphson method
• Approximate function using second-order Taylor expansion:
f̂(w+v) = f(w)+∇f(w)T (v−w)+1
2(v−w)T∇2f(w)(v−w)
• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear
• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week
• Hence the name IRLS©Möller/Mori 96
Discriminant Functions Generative Models Discriminative Models
Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent
method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient
direction• IRLS is a special case of a Newton-Raphson method
• Approximate function using second-order Taylor expansion:
f̂(w+v) = f(w)+∇f(w)T (v−w)+1
2(v−w)T∇2f(w)(v−w)
• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear
• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week
• Hence the name IRLS©Möller/Mori 97
Discriminant Functions Generative Models Discriminative Models
Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent
method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient
direction• IRLS is a special case of a Newton-Raphson method
• Approximate function using second-order Taylor expansion:
f̂(w+v) = f(w)+∇f(w)T (v−w)+1
2(v−w)T∇2f(w)(v−w)
• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear
• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week
• Hence the name IRLS©Möller/Mori 98
Discriminant Functions Generative Models Discriminative Models
Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent
method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient
direction• IRLS is a special case of a Newton-Raphson method
• Approximate function using second-order Taylor expansion:
f̂(w+v) = f(w)+∇f(w)T (v−w)+1
2(v−w)T∇2f(w)(v−w)
• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear
• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week
• Hence the name IRLS©Möller/Mori 99
Discriminant Functions Generative Models Discriminative Models
Newton-Raphson484 9 Unconstrained minimization
PSfrag replacements
f
!f
(x, f(x))
(x + !xnt, f(x + !xnt))
Figure 9.16 The function f (shown solid) and its second-order approximation!f at x (dashed). The Newton step !xnt is what must be added to x to give
the minimizer of !f .
9.5 Newton’s method
9.5.1 The Newton step
For x ! dom f , the vector
!xnt = "#2f(x)!1#f(x)
is called the Newton step (for f , at x). Positive definiteness of #2f(x) implies that
#f(x)T !xnt = "#f(x)T#2f(x)!1#f(x) < 0
unless #f(x) = 0, so the Newton step is a descent direction (unless x is optimal).The Newton step can be interpreted and motivated in several ways.
Minimizer of second-order approximation
The second-order Taylor approximation (or model) !f of f at x is
!f(x + v) = f(x) + #f(x)T v +1
2vT#2f(x)v, (9.28)
which is a convex quadratic function of v, and is minimized when v = !xnt. Thus,the Newton step !xnt is what should be added to the point x to minimize thesecond-order approximation of f at x. This is illustrated in figure 9.16.
This interpretation gives us some insight into the Newton step. If the functionf is quadratic, then x + !xnt is the exact minimizer of f . If the function f isnearly quadratic, intuition suggests that x + !xnt should be a very good estimateof the minimizer of f , i.e., x!. Since f is twice di"erentiable, the quadratic modelof f will be very accurate when x is near x!. It follows that when x is near x!,the point x + !xnt should be a very good estimate of x!. We will see that thisintuition is correct.
• Figure from Boyd and Vandenberghe, Convex Optimization
• Excellent reference, free for download onlinehttp://www.stanford.edu/~boyd/cvxbook/
©Möller/Mori 100
Discriminant Functions Generative Models Discriminative Models
Conclusion
• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f(wTx+ w0)
• Threshold/max function for f(·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples
• Probabilistic models: logistic sigmoid / softmax for f(·)• Generative model - assume class conditional densities in
exponential family; obtain sigmoid• Discriminative model - directly model posterior using
sigmoid (a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood
• All of these models are limited to linear decisionboundaries in feature space
©Möller/Mori 101