Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
INTRODUCTION
TO
MACHINE
LEARNING 3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
CHAPTER 10:
LINEAR DISCRIMINATION
Likelihood- vs. Discriminant-based
Classification 3
Likelihood-based: Assume a model for p(x|Ci), use
Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
Discriminant-based: Assume a model for gi(x|Φi);
no density estimation
Estimating the boundaries is enough; no need to
accurately estimate the densities inside the
boundaries
Linear discriminant:
Advantages:
Simple: O(d) space/computation
Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
Optimal when p(x|Ci) are Gaussian with shared cov matrix;
useful when classes are (almost) linearly separable
Linear Discriminant 4
0
1
00 ij
d
jiji
Tiiii wxwwwg
xwwx ,|
Quadratic discriminant:
Higher-order (product) terms:
Map from x to z using nonlinear basis functions and use a linear
discriminant in z-space
Generalized Linear Model 5
215
2
24
2
132211 xxzxzxzxzxz , , , ,
00 iTii
Tiiii wwg xwxxwx WW ,,|
k
jjiji wg
1
xx
Two Classes
0
201021
202101
21
w
ww
ww
ggg
T
T
TT
xw
xww
xwxw
xxx
otherwise
if choose
2
1 0
C
gC x
6
Geometry 7
Multiple Classes
00 iTiiii wwg xwwx ,|
xx j
K
ji
i
gg
C
1max
if Choose
8
Classes are
linearly separable
Pairwise Separation
00 ijTijijijij wwg xwwx ,|
otherwisecare tdon'
if
if
j
i
ij C
C
g x
x
x 0
0
0 xij
i
gij
C
,
i f choose
9
From Discriminants to Posteriors 10
iiTiiii
iTiiii
CPw
wwg
log
,|
μμ2
1μ 1
0
1
00
w
xwwx
When p (x | Ci ) ~ N ( μi , ∑)
otherwise and
log
if choose
| and |
21
21
01
11
50
1
C
yy
yy
y
C
yCPCPy
/
/
.
xx
11
0
01
0
1
1
21
1
21021
1
0
2
1
2
1
2
212
1
1
1
212
2
1
2
1
2
1
1
11
1
1
1
μμμμ2
1μμ
μμ212
μμ212
1
wwCP
wCP
CP
w
w
CP
CP
CP
CP
Cp
Cp
CP
CP
CP
CPCP
T
T
T
T
T
Td
Td
xwxwx
xwx
x
w
xw
xx
xx
x
x
x
x
x
xx
expsigmoid|
|
|log
logit of inverse The
where
logexp
explog
log|
|log
|
|log
|
|log|logit
/
///
//
Sigmoid (Logistic) Function 12
50
0
10
10
.
yCwy
gCwgT
T
if choose and sigmoid Calculate
or , if choose and Calculate
xw
xxwx
Gradient-Descent 13
T
d
ww
E
w
E
w
EE
,...,,
21
E(w|X) is error with parameters w on sample X
w*=arg minw E(w | X)
Gradient
Gradient-descent: Starts from random w and updates w iteratively in the
negative direction of gradient
Gradient-Descent 14
iii
i
i
www
iw
Ew
,
wt wt+1
η
E (wt)
E (wt+1)
Logistic Discrimination 15
0
1
2
100
0
2
1
2
1
1
11
0
2
1
1
1
1
wCPy
CP
CPww
w
CP
CP
Cp
Cp
CP
CPCP
wCp
Cp
T
o
T
oT
xwx
xw
x
x
x
xx
xwx
x
exp|
log where
log|
|log
|
|log|logit
|
|log
ˆ
Two classes: Assume log likelihood ratio is linear
Training: Two Classes 16
ttt
t
t
rt
t
rt
T
tttt
tt
yryrwE
lE
yywl
wCPy
yrr
tt
11
1
1
1
0
1
0
0
1
log log|
log
|
exp|
Bernoulli|
X
X
X
,
,
~,
w
w
xwx
xx
Training: Gradient-Descent 17
t
tt
t
tj
tt
tj
tt
tt
t
t
t
j
j
ttt
t
t
yrw
Ew
djxyr
xyyy
r
y
r
w
Ew
yyda
dyy
yryrwE
0
0
0
1
11
1
1
11
,...,,
,
asigmoid If
log log|Xw
18
19
10
100 1000
K>2 Classes 20
t
tj
tjj
t
t
tj
tjj
ti
t
tiiii
t i
rtiiii
K
j jTj
iTi
i
oi
Ti
K
i
tK
ttt
tt
yrwyr
yrwE
ywl
Kiw
wCPy
wCp
Cp
r
ti
0
0
0
1 0
0
0
1
1
log|,
|,
,...,,exp
exp|ˆ
|
|log
,Mult~| ,
xw
w
w
xw
xwx
xwx
x
yxrx
X
X
X
softmax
21
Example 22
Quadratic:
Sum of basis functions:
where φ(x) are basis functions. Examples:
Hidden units in neural networks (Chapters 11 and 12)
Kernels in SVM (Chapter 13)
Generalizing the Linear Model 23
0i
Tii
T
K
i wCp
Cp xwxx
x
xW
|
|log
0iTi
K
i wCp
Cp xw
x
x
|
|log
Discrimination by Regression 24
ttt
t
tt
t
tt
tt
t
tT
tTt
tt
yyyr
yrwE
yrwl
wwy
yr
xw
w
w
xwxw
1
2
1
22
1
1
1
0
2
0
2
2
0
0
0
2
X
X
N
|
exp|
expsigmoid
where
,
,
,~
Classes are NOT mutually exclusive and exhaustive
Learning to Rank 25
Ranking: A different problem than classification or
regression
Let us say xu and xv are two instances, e.g., two
movies
We prefer u to v implies that g(xu)>g(xv)
where g(x) is a score function, here linear:
g(x)=wTx
Find a direction w such that we get the desired
ranks when instances are projected along w
Ranking Error 26
We prefer u to v implies that g(xu)>g(xv), so
error is g(xv)-g(xu), if g(xu)<g(xv)
27