View
212
Download
0
Embed Size (px)
Lectures 5 & 6: ClassifiersHilary Term 2007 A. Zisserman
Bayesian Decision Theory Bayes decision rule Loss functions Likelihood ratio test
Classifiers and Decision Surfaces Discriminant function Normal distributions
Linear Classifiers The Perceptron Logistic Regression
Decision Theory
Suppose we wish to make measurements on a medical image and classify it as showing evidence of cancer or not
image x
C1 cancer
C2 no cancerimage processingdecision
rule
measurement
and we want to base this decision on the learnt joint distribution
How do we make the best decision?
p(x,Ci) = p(x|Ci)p(Ci)
Classification
Assign input vector to one of two or more classes
Any decision rule divides input space into decision regions separated by decision boundaries
x Ck
Example: two class decision depending on a 2D vector measurement
Also, would like a confidence measure (how sure are we that the input belongs to the chosen category?)
Decision Boundary for average error
Consider a two class decision depending on a scalar variable x
x
R1R1 R2R2
p x( , )C2p x( , )C2
p x( , )C1p x( , )C1x^
x^
x0x0
minimize number of misclassifications if the decision boundary is at x0
Bayes Decision ruleAssign x to the class Ci for which p(x, Ci) is largest
Assign x to the class Ci for which p( Ci | x ) is largest
since p(x, Ci) = p(Ci|x) p(x) this is equivalent to
p(error) =Z +
p(error, x) dx
=ZR1p(x,C2) dx+
ZR2p(x,C1) dx
Bayes error
A classifier is a mapping from a vector x to class labels {C1, C2}
The Bayes error is the probability of misclassification
p(error) =Z +
p(error, x) dx
=ZR1p(x,C2) dx+
ZR2p(x,C1) dx
=ZR1p(C2|x)p(x) dx+
ZR2p(C1|x)p(x) dx
x
R1R1 R2R2
p x( , )C2p x( , )C2
p x( , )C1p x( , )C1x^
x^
x0x0
Example: Iris recognition
How Iris Recognition Works, John DaugmanIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 1, JANUARY 2004
Posteriors
0 0.2 0.4 0.6 0.8 10
1
2
3
4
5
clas
s de
nsiti
es
p(x|C1)
p(x|C2)
x0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
post
erio
r pr
obab
ilitie
s
x
p(C1|x) p(C
2|x)
Assign x to the class Ci for which p( Ci | x ) is largest
i.e. class i if p(Ci|x) > 0.5
p(C1|x)+ p(C2|x) = 1,so p(C2|x) = 1 p(C1|x)
sum to 1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
post
erio
r pr
obab
ilitie
s
x
p(C1|x) p(C
2|x)
Reject option
avoid making decisions if unsure
reject if posterior probability p(Ci|x) <
reject region
Example skin detection in video
Objective: label skin pixels (as a means to detect humans)
Two stages:
1. Training: learn likelihood for pixel colour, given skin and non-skin pixels
2. Testing: classify a new image into skin regions
training image training skin pixel mask masked pixels
r=R/(R+G+B)
g=G
/(R+G
+B)
chromaticity color space
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
- chromaticity color space: r=R/(R+G+B), g=G/(R+G+B)- invariant to scaling of R,G,B, plus 2D for visualisation
Choice of colour space
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r=R/(R+G+B)
g=G
/(R+G
+B)
skin pixels in chromaticity space
Represent likelihood as Normal Distribution
N (x|,) = 1(2)n/2 ||1/2
exp
12(x )>1(x )
r=R/(R+G+B)
g=G
/(R+G
+B)
p(x|background)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
20
40
60
80
100
120
Gaussian fitted to background pixels
r=R/(R+G+B)
g=G
/(R+G
+B)
p(x|skin)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
200
400
600
800
1000
1200
Gaussian fitted to skin pixels
r=R/(R+G+B)
g=G
/(R+G
+B)
contours of p(x|skin/background)
0.3 0.35 0.4 0.45 0.5 0.55
0.2
0.25
0.3
0.35
0.4
0.45
contours of two Gaussians 3D view of two Gaussiansvertical axis is likelihood
Posterior probability of skin given pixel colour
Assume equal prior probabilities, i.e. probability
of a pixel being skin is 0.5.
Posterior probability of skin is defined by Bayes rule:
P(skin|x) = p(x|skin)P(skin)p(x)
where
p(x) = p(x|skin)P(skin)+ p(x|background)P(background)i.e. the marginal pdf of x
NB: the posterior depends on both foreground and background likelihoods i.e. it involves both distributions
P(x|background)
0
20
40
60
80
100
120
Assess performance on training image
input
P(x|skin)
0
200
400
600
800
1000
1200
P(skin|x)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1P(x|skin)
0
200
400
600
800
1000
1200
likelihood posterior
posterior depends on likelihoods (Gaussians) of both classes
P(skin|x)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1P(skin|x)>0.5
Test data
p(x|background)
p(x|skin)
p(skin|x)
p(skin|x)>0.5
Test performance on other frames
Receiver Operator Characteristic (ROC) Curve
In many algorithms there is a threshold that affects performance
true positives
false positives
1
10e.g. true positive: skin pixel classified as skin
false positive: background pixel classified as skin
threshold decreasing
worse performance
Loss function revisited
Consider again the cancer diagnosis example. The consequences for an incorrect classification vary for the following cases:
False positive: does not have cancer, but is classified as having it> distress, plus unnecessary further investigation
False negative: does have cancer, but is classified as not having it> no treatment, premature death
The two other cases are true positive and true negative.
Because the consequences of a false negative far outweigh the others, rather than simply minimize the number of mistakes, a loss function can be minimized.
Loss matrix
R(Ci|x) =Xj
Lijp(Cj|x)
Lij =0 1
1000 0
cancer normal
cancer
normalclassification
truth
true +ve false +ve
false -ve true -ve
Risk
Bayes Risk
The class conditional risk of an action is
R(ai|x) =Xj
L(ai|Cj)p(Cj|x)
action
measurement
loss incurred if action i taken and true state is j
Bayes decision rule: select the action for which R(ai | x) is minimum
Mininimize Bayes risk
This decision minimizes the expected loss
ai = argminaiR(ai|x)
Likelihood ratio
Two category classification with loss function
Conditional risk
R(a1|x) = L11p(C1|x)+ L12p(C2|x)R(a2|x) = L21p(C1|x)+ L22p(C2|x)
Thus for minimum risk, decide C1 if
L11p(C1|x)+ L12p(C2|x) < L21p(C1|x) + L22p(C2|x)p(C2|x)(L12 L22) < p(C1|x)(L21 L11)
p(x|C2)p(C2)(L12 L22) < p(x|C1)p(C1)(L21 L11)Assuming L21 L11 > 0, then decide C1 if
p(x|C1)p(x|C2)
>p(C2)(L22 L12)p(C1)(L11 L21)
i.e. likelihood ratio exceeds a threshold that is independent of x
Bayes
A two category classifier can often be written in the form
where is a discriminant function, and
is a discriminant surface.
In 2D is a set of curves.
Discriminant functions
g(x)
C1C2
g(x)
(> 0 assign x to C1< 0 assign x to C2
g(x) = 0
g(x) = 0
g(x) = 0
Posterior probability of skin given pixel colour
Assume equal prior probabilities, i.e. probability
of a pixel being skin is 0.5.
Posterior probability of skin is defined by Bayes rule:
P(skin|x) = p(x|skin)P(skin)p(x)
where
p(x) = p(x|skin)P(skin)+ p(x|background)P(background)i.e. the marginal pdf of x
ExampleIn the minimum average error classifier, the assignment rule is: decide C1if the posterior p(C1|x) > p(C2|x).
The equivalent discriminant function is
g(x) = p(C1|x) p(C2|x)or
g(x) = lnp(C1|x)p(C2|x)
Note, these two functions are not equal, but the decision boundaries are
the same.
Developing this further
g(x) = lnp(C1|x)p(C2|x)
= lnp(x|C1)p(x|C2)
+ lnp(C1)
p(C2)
Decision surfaces for Normal distributions
Suppose that the likelihoods are Normal:
p(x|C1) N(1,1) p(x|C2) N(2,2)
Then
g(x) = lnp(x|C1)p(x|C2)
+ lnp(C1)
p(C2)
= lnp(x|C1) lnp(x|C2)+ lnp(C1)
p(C2)
(x 1)>11 (x 1)+ (x 2)>12 (x 2) + c0
where c0 = ln p(C1)p(C2)
12 ln |1|+12 ln |2|.
Case 1: i = 2I
g(