Elements of Pattern Recognition CNS/EE-148 -- Lecture 5 M. Weber P. Perona

Elements ofPattern Recognition

CNS/EE-148 -- Lecture 5

M. WeberP. Perona

What is Classification?

• We want to assign objects to classes based on a selection of attributes (features).

• Examples:– (age, income) {credit worthy, not credit worthy}

– (blood cell count, body temp) {flue, hepatitis B, hepatitis C}

– (pixel vector) {Bill Clinton, coffee cup}

• Feature vector can be continuous, discrete or mixed.

What is Classification?

• Want to find a function from measurements to class labels decision boundary.

{ }?2102 ,,,: CCCCRc →

x1

x2

Signal 1

Noise

Signal 2

• Statistical methods use pdf: p(C,x)

• Assume p(C,x) known for now

Space of Feature Vectors

Some Terminology

• p(C) is called a prior or a priori probability

• p(x|C) is called a class-conditional density

or likelihood of C with respect to x

• p(C|x) is called a posterior or

a posteriori probability

Examples• One measurement, symmetric cost, equal priors

bad

x

p(x|C1) p(x|C2)

⎭⎬⎫

⎩⎨⎧

==

=12

21

)(),|(

)(),|()|(

CxcifxCP

CxcifxCPxerrorP∫= dxxpxerrorPerrorP )()|()(

Examples

• One measurement, symmetric cost, equal priors

good

x

p(x|C1) p(x|C2)

How to Make the Best Decision? (Bayes Decision Theory)

• Define a cost function for mistakes, e.g.

• Minimize expected loss (risk) over entire p(C,x).

• Sufficient to assure optimal decision for each individual x.

• Result: decide according to maximum posterior probability:

ijjiL δ−=1),(

dxxpxxcCLE

xxcCLEExcCLER

)(]|))(,([

]]|))(,([[))](,([

∫===

∑=

=N

iii xCpxcCLxxcCLE

1

)|())(,(]|))(,([

)|(max)( xCpxc ii

=

Two Classes, C1, C2

• It is helpful to consider the likelihood ratio:

• Use known priors p(Ci) or ignore them.

• For more elaborate loss function (proof is easy):

• g(x) is called a discriminant function

)()|(

)()|(

)|(

)|(

22

11

2

1

CpCxp

CpCxp

xCp

xCp=

€

g(x) ≡p(x | C1)

p(x | C2)≥

l12 − l22

l21 − l11

p(C2)

p(C1)

?

Discriminant Functions for Multivariate Gaussian Class Conditional Densities

• Two multivariate Gaussians in d dimensions• Since log is monotonic, we can look at log g(x).

)()()(

)(

)|(

)|(log)(log 21

2

1

2

1 xgxgCp

Cp

Cxp

Cxpxg −==

Mahalanobis Distance2 superfluous

)(loglog2

12log

2)()(

2

1)( 1

iiiiT

ii Cpd

xxxg +Σ−−−Σ−−= − πμμ

Mahalanobis Distance

• iso-distance lines = iso-probability lines

• Decision surface:

x1

x2

μ1

)()()( 12ii

Ti xxxd μμ −Σ−= −

μ2

decisionsurface

.)()( 22

21 constxdxd =−

Case 1: Σi = 2I

• Discriminant functions…

• …simplify to:

[ ]

)(

)(log

2

)()()()(

)(

)(log22

2

1

)(

)(log

22)()(

2

12

21212

221

2

12221112

2

12

2

2

2

2

121

Cp

Cpx

Cp

Cpxxxxxx

Cp

Cpxxxgxg

TT

TTTTTT

+−−

−−−

=

++−+−+−=

+−

+−

−=−

σ

μμμμ

σ

μμμ

μμμμμμσ

σ

μ

σ

μ

)(loglog2

1)()(

2

1)( 1

iiiiT

ii Cpxxxg +Σ−−Σ−−= − μμ

Decision Boundary

)(

)(log

2

1)(

)(

)(

)(log

2

1)()(

0)()(

2

1

21

2

21221

21

2

122

21221

21

Cp

Cpx

Cp

Cpx

xgxg

T

T

μμμμμ

μμμμ

μμμμμ

−−−=−

−−

⇔

−−=−−⇒

=−

•If μ2=0, we obtain...

The matched filter! With an expression for the threshold.

)(

)(log

2

1

2

1

1

2

11

1

Cp

Cpx

T

μμ

μμ

−=

Two Signals and Additive White Gaussian Noise

Signal 1

Signal 2

x

μ1

μ2

μ1-μ2

x-μ2

x1

x2

)(

)(log

2

1)(

)(

2

1

21

2

21221

21

Cp

Cpx

T

μμμμμ

μμμμ

−−−=−

−−

)(

)(log

2

1

2

1

21

2

21 Cp

Cp

μμμμ−

−−

Case 2: Σi = Σ

• Two classes, 2D measurements, p(x|C) are multivariate Gaussians with equal covariance matrices.

• Derivation is similar– Quadratic term vanishes since it is independent of class

– We obtain a linear decision surface

• Matlab demo

c:\MATLAB\bin\matlab c:\home\rmw\Talks\demo2.m

Case 3: General Covariance Matrix

• See transparency

Isn’t this to simple?

• Not at all…

• It is true that images form complicated manifolds (from a pixel point of view, translation, rotation and scaling are all highly non-linear operations)

• The high dimensionality helps

Assume Unknown Class Densitites

• In real life, we do not know the class conditional densities.

• But we do have example data.

• This puts us in the typical machine learning scenario:We want to learn a function, c(x), from examples.

• Why not just estimate class densities from examples and apply the previous ideas?– Learn Gaussian (simple density): in N dimensions need N2 samples

at least!• 10x10 pixels 10,000 examples!

– Avoid estimating densities whenever you can! (too general)– posterior is generally simpler than class conditional (see transparency)

Remember PCA?• Principal components are

eigenvectors of covariance matrix

• Use reconstruction error for recognition (e.g. Eigenfaces)– good

• reduces dimensionality

– bad• no model within subspace

• linearity may be inappropriate

• covariance not appropriate to optimize discrimination

x1

x2

zUx

USUCxxN

T

i

Tii

ˆ

))((1

+≈

==−−∑μ

μμ u1

μ

x

Fisher’s Linear Discriminant

• Goal: Reduce dimensionality before training classifiers etc. (Feature Selection)

• Similar goal as PCA!

• Fisher has classification in mind…

• Find projection directions such that separation is easiest

• Eigenfaces vs. Fisherfaces

x1

x2

Fisher’s Linear Discriminant

• Assume we have n d-dimensional samples x1,…,xn

• n1 from set (class) X1 and n2 from set X2

• we form linear combinations:

• and obtain y1…,yn

• only direction of w is important

xwy T=

Objective for Fisher• Measure the separation as the distance between the means

after projecting (k = 1,2):

• Measure the scatter after projecting:

• Objective becomes to maximize

kT

Xx

T

kYykk mwxw

ny

nm

kk

∑∑∈∈

===11~

∑∈

−=kYy

kk mys 22 )~(~

22

21

2

21

~~

~~)(

ss

mmwJ

+−

=

• We need to make the dependence on w explicit:

• Defining the within-class scatter matrix, SW=S1+S2, we obtain

• Similarly for the separation (between-class scatter matrix)

• Finally we can write

wSwwmxmxw

mwxws

kT

Xx

Tkk

T

Xxk

TTk

k

k

≡−−=

−=

∑

∑

∈

∈

))((

)(~ 22

wSwss WT=+ 2

221

~~

wSwwmmmmw

mwmwmm

BTTT

TT

=−−

=−=−

))((

)()~~(

2121

221

221

wSw

wSwwJ

WT

BT

=)(

Fisher’s Solution

• Is called a generalized Rayleigh quotient. Any w that maximizes J must satisfy the generalized eigenvalue problem

• Since SB is very singular (rank 1), and SBw is in the direction of (m1-m2), we are done:

wSw

wSwwJ

WT

BT

=)(

wSwS WB λ=

)( 211 mmSw W −= −

Comments on FLD

• We did not follow Bayes Decision Theory

• FLD is useful for many types of densities

• Fisher can be extended (see demo):– more than one projection direction– more than two clusters

• Let’s try it out: Matlab Demo

C:\MATLAB\bin\matlab.exe

Fisher vs. Bayes

• Assume we do have identical Gaussian class densities, then Bayes says:

• while Fisher says:

• Since SW is proportional to the covariance matrix, w is in the same direction in both cases.

• Comforting...

)(

0

211

0

μμ −Σ=

=+−w

wxwT

)( 211 mmSw W −= −

What have we achieved?

• Found out that maximum posterior strategy is optimal. Always.

• Looked at different cases of Gaussian class densities, where we could derive simple decision rules.

• Gaussian classifiers do reasonable jobs!• Learned about FLD which is useful and often

preferable to PCA.

Just for Fun: Support Vector Machine

• Very fashionable…s.o.t.a?

• Does not model densities

• Fits decision surface directly

• Maximizes margin reduces “complexity”

• Decision surface only depends on nearby samples

• Matlab Demo

x1

x2

Learning Algorithms

Set of functions LearningAlgorithm

Examples:(xi,yi)

p(x,y)

Learnedfunction

y = f(x)f = ?

Assume Unknown Class Densitites

• SVM Examples

• Densitites are hard to estimate -> avoid it– example from Ripley

• Give intuitions on overfitting

• Need to learn– Standard machine learning problem– Training/Test sets

Documents

Elements of Pattern Recognition CNS/EE-148 -- Lecture 5 M. Weber P. Perona