Introduction to Machine Learningethem/i2ml3e/3e_v1-0/i2ml3e-chap10.pdfLearning to Rank 25 Ranking: A different problem than classification or regression Let us say xu vand x are two

INTRODUCTION

TO

MACHINE

LEARNING 3RD EDITION

ETHEM ALPAYDIN

© The MIT Press, 2014

[email protected]

http://www.cmpe.boun.edu.tr/~ethem/i2ml3e

Lecture Slides for

CHAPTER 10:

LINEAR DISCRIMINATION

Likelihood- vs. Discriminant-based

Classification 3

Likelihood-based: Assume a model for p(x|Ci), use

Bayes’ rule to calculate P(Ci|x)

gi(x) = log P(Ci|x)

Discriminant-based: Assume a model for gi(x|Φi);

no density estimation

Estimating the boundaries is enough; no need to

accurately estimate the densities inside the

boundaries

Linear discriminant:

Advantages:

Simple: O(d) space/computation

Knowledge extraction: Weighted sum of attributes;

positive/negative weights, magnitudes (credit scoring)

Optimal when p(x|Ci) are Gaussian with shared cov matrix;

useful when classes are (almost) linearly separable

Linear Discriminant 4

0

1

00 ij

d

jiji

Tiiii wxwwwg

xwwx ,|

Quadratic discriminant:

Higher-order (product) terms:

Map from x to z using nonlinear basis functions and use a linear

discriminant in z-space

Generalized Linear Model 5

215

2

24

2

132211 xxzxzxzxzxz , , , ,

00 iTii

Tiiii wwg xwxxwx WW ,,|

k

jjiji wg

1

xx

Two Classes

0

201021

202101

21

w

ww

ww

ggg

T

T

TT

xw

xww

xwxw

xxx

otherwise

if choose

2

1 0

C

gC x

6

Geometry 7

Multiple Classes

00 iTiiii wwg xwwx ,|

xx j

K

ji

i

gg

C

1max

if Choose

8

Classes are

linearly separable

Pairwise Separation

00 ijTijijijij wwg xwwx ,|

otherwisecare tdon'

if

if

j

i

ij C

C

g x

x

x 0

0

0 xij

i

gij

C

,

i f choose

9

From Discriminants to Posteriors 10

iiTiiii

iTiiii

CPw

wwg

log

,|

μμ2

1μ 1

0

1

00

w

xwwx

When p (x | Ci ) ~ N ( μi , ∑)

otherwise and

log

if choose

| and |

21

21

01

11

50

1

C

yy

yy

y

C

yCPCPy

/

/

.

xx

11

0

01

0

1

1

21

1

21021

1

0

2

1

2

1

2

212

1

1

1

212

2

1

2

1

2

1

1

11

1

1

1

μμμμ2

1μμ

μμ212

μμ212

1

wwCP

wCP

CP

w

w

CP

CP

CP

CP

Cp

Cp

CP

CP

CP

CPCP

T

T

T

T

T

Td

Td

xwxwx

xwx

x

w

xw

xx

xx

x

x

x

x

x

xx

expsigmoid|

|

|log

logit of inverse The

where

logexp

explog

log|

|log

|

|log

|

|log|logit

/

///

//

Sigmoid (Logistic) Function 12

50

0

10

10

.

yCwy

gCwgT

T

if choose and sigmoid Calculate

or , if choose and Calculate

xw

xxwx

Gradient-Descent 13

T

d

ww

E

w

E

w

EE

,...,,

21

E(w|X) is error with parameters w on sample X

w*=arg minw E(w | X)

Gradient

Gradient-descent: Starts from random w and updates w iteratively in the

negative direction of gradient

Gradient-Descent 14

iii

i

i

www

iw

Ew

,

wt wt+1

η

E (wt)

E (wt+1)

Logistic Discrimination 15

0

1

2

100

0

2

1

2

1

1

11

0

2

1

1

1

1

wCPy

CP

CPww

w

CP

CP

Cp

Cp

CP

CPCP

wCp

Cp

T

o

T

oT

xwx

xw

x

x

x

xx

xwx

x

exp|

log where

log|

|log

|

|log|logit

|

|log

ˆ

Two classes: Assume log likelihood ratio is linear

Training: Two Classes 16

ttt

t

t

rt

t

rt

T

tttt

tt

yryrwE

lE

yywl

wCPy

yrr

tt

11

1

1

1

0

1

0

0

1

log log|

log

|

exp|

Bernoulli|

X

X

X

,

,

~,

w

w

xwx

xx

Training: Gradient-Descent 17

t

tt

t

tj

tt

tj

tt

tt

t

t

t

j

j

ttt

t

t

yrw

Ew

djxyr

xyyy

r

y

r

w

Ew

yyda

dyy

yryrwE

0

0

0

1

11

1

1

11

,...,,

,

asigmoid If

log log|Xw

18

19

10

100 1000

K>2 Classes 20

t

tj

tjj

t

t

tj

tjj

ti

t

tiiii

t i

rtiiii

K

j jTj

iTi

i

oi

Ti

K

i

tK

ttt

tt

yrwyr

yrwE

ywl

Kiw

wCPy

wCp

Cp

r

ti

0

0

0

1 0

0

0

1

1

log|,

|,

,...,,exp

exp|ˆ

|

|log

,Mult~| ,

xw

w

w

xw

xwx

xwx

x

yxrx

X

X

X

softmax

21

Example 22

Quadratic:

Sum of basis functions:

where φ(x) are basis functions. Examples:

Hidden units in neural networks (Chapters 11 and 12)

Kernels in SVM (Chapter 13)

Generalizing the Linear Model 23

0i

Tii

T

K

i wCp

Cp xwxx

x

xW

|

|log

0iTi

K

i wCp

Cp xw

x

x

|

|log

Discrimination by Regression 24

ttt

t

tt

t

tt

tt

t

tT

tTt

tt

yyyr

yrwE

yrwl

wwy

yr

xw

w

w

xwxw

1

2

1

22

1

1

1

0

2

0

2

2

0

0

0

2

X

X

N

|

exp|

expsigmoid

where

,

,

,~

Classes are NOT mutually exclusive and exhaustive

Learning to Rank 25

Ranking: A different problem than classification or

regression

Let us say xu and xv are two instances, e.g., two

movies

We prefer u to v implies that g(xu)>g(xv)

where g(x) is a score function, here linear:

g(x)=wTx

Find a direction w such that we get the desired

ranks when instances are projected along w

Ranking Error 26

We prefer u to v implies that g(xu)>g(xv), so

error is g(xv)-g(xu), if g(xu)<g(xv)

27

Documents

Introduction to Machine Learningethem/i2ml3e/3e_v1-0/i2ml3e-chap10.pdfLearning to Rank 25 Ranking: A different problem than classification or regression Let us say xu vand x are two