Lectures 5 & 6: Classifiers - University of az/lectures/est/  · Lectures 5 & 6: Classifiers

  • View
    212

  • Download
    0

Embed Size (px)

Text of Lectures 5 & 6: Classifiers - University of az/lectures/est/  · Lectures 5 & 6: Classifiers

  • Lectures 5 & 6: ClassifiersHilary Term 2007 A. Zisserman

    Bayesian Decision Theory Bayes decision rule Loss functions Likelihood ratio test

    Classifiers and Decision Surfaces Discriminant function Normal distributions

    Linear Classifiers The Perceptron Logistic Regression

    Decision Theory

    Suppose we wish to make measurements on a medical image and classify it as showing evidence of cancer or not

    image x

    C1 cancer

    C2 no cancerimage processingdecision

    rule

    measurement

    and we want to base this decision on the learnt joint distribution

    How do we make the best decision?

    p(x,Ci) = p(x|Ci)p(Ci)

  • Classification

    Assign input vector to one of two or more classes

    Any decision rule divides input space into decision regions separated by decision boundaries

    x Ck

    Example: two class decision depending on a 2D vector measurement

    Also, would like a confidence measure (how sure are we that the input belongs to the chosen category?)

  • Decision Boundary for average error

    Consider a two class decision depending on a scalar variable x

    x

    R1R1 R2R2

    p x( , )C2p x( , )C2

    p x( , )C1p x( , )C1x^

    x^

    x0x0

    minimize number of misclassifications if the decision boundary is at x0

    Bayes Decision ruleAssign x to the class Ci for which p(x, Ci) is largest

    Assign x to the class Ci for which p( Ci | x ) is largest

    since p(x, Ci) = p(Ci|x) p(x) this is equivalent to

    p(error) =Z +

    p(error, x) dx

    =ZR1p(x,C2) dx+

    ZR2p(x,C1) dx

    Bayes error

    A classifier is a mapping from a vector x to class labels {C1, C2}

    The Bayes error is the probability of misclassification

    p(error) =Z +

    p(error, x) dx

    =ZR1p(x,C2) dx+

    ZR2p(x,C1) dx

    =ZR1p(C2|x)p(x) dx+

    ZR2p(C1|x)p(x) dx

    x

    R1R1 R2R2

    p x( , )C2p x( , )C2

    p x( , )C1p x( , )C1x^

    x^

    x0x0

  • Example: Iris recognition

    How Iris Recognition Works, John DaugmanIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 1, JANUARY 2004

  • Posteriors

    0 0.2 0.4 0.6 0.8 10

    1

    2

    3

    4

    5

    clas

    s de

    nsiti

    es

    p(x|C1)

    p(x|C2)

    x0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    post

    erio

    r pr

    obab

    ilitie

    s

    x

    p(C1|x) p(C

    2|x)

    Assign x to the class Ci for which p( Ci | x ) is largest

    i.e. class i if p(Ci|x) > 0.5

    p(C1|x)+ p(C2|x) = 1,so p(C2|x) = 1 p(C1|x)

    sum to 1

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    post

    erio

    r pr

    obab

    ilitie

    s

    x

    p(C1|x) p(C

    2|x)

    Reject option

    avoid making decisions if unsure

    reject if posterior probability p(Ci|x) <

    reject region

  • Example skin detection in video

    Objective: label skin pixels (as a means to detect humans)

    Two stages:

    1. Training: learn likelihood for pixel colour, given skin and non-skin pixels

    2. Testing: classify a new image into skin regions

    training image training skin pixel mask masked pixels

    r=R/(R+G+B)

    g=G

    /(R+G

    +B)

    chromaticity color space

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    - chromaticity color space: r=R/(R+G+B), g=G/(R+G+B)- invariant to scaling of R,G,B, plus 2D for visualisation

    Choice of colour space

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    r=R/(R+G+B)

    g=G

    /(R+G

    +B)

    skin pixels in chromaticity space

    Represent likelihood as Normal Distribution

    N (x|,) = 1(2)n/2 ||1/2

    exp

    12(x )>1(x )

    r=R/(R+G+B)

    g=G

    /(R+G

    +B)

    p(x|background)

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0

    20

    40

    60

    80

    100

    120

    Gaussian fitted to background pixels

    r=R/(R+G+B)

    g=G

    /(R+G

    +B)

    p(x|skin)

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0

    200

    400

    600

    800

    1000

    1200

    Gaussian fitted to skin pixels

  • r=R/(R+G+B)

    g=G

    /(R+G

    +B)

    contours of p(x|skin/background)

    0.3 0.35 0.4 0.45 0.5 0.55

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    contours of two Gaussians 3D view of two Gaussiansvertical axis is likelihood

    Posterior probability of skin given pixel colour

    Assume equal prior probabilities, i.e. probability

    of a pixel being skin is 0.5.

    Posterior probability of skin is defined by Bayes rule:

    P(skin|x) = p(x|skin)P(skin)p(x)

    where

    p(x) = p(x|skin)P(skin)+ p(x|background)P(background)i.e. the marginal pdf of x

    NB: the posterior depends on both foreground and background likelihoods i.e. it involves both distributions

  • P(x|background)

    0

    20

    40

    60

    80

    100

    120

    Assess performance on training image

    input

    P(x|skin)

    0

    200

    400

    600

    800

    1000

    1200

    P(skin|x)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1P(x|skin)

    0

    200

    400

    600

    800

    1000

    1200

    likelihood posterior

    posterior depends on likelihoods (Gaussians) of both classes

  • P(skin|x)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1P(skin|x)>0.5

    Test data

  • p(x|background)

    p(x|skin)

  • p(skin|x)

    p(skin|x)>0.5

  • Test performance on other frames

    Receiver Operator Characteristic (ROC) Curve

    In many algorithms there is a threshold that affects performance

    true positives

    false positives

    1

    10e.g. true positive: skin pixel classified as skin

    false positive: background pixel classified as skin

    threshold decreasing

    worse performance

  • Loss function revisited

    Consider again the cancer diagnosis example. The consequences for an incorrect classification vary for the following cases:

    False positive: does not have cancer, but is classified as having it> distress, plus unnecessary further investigation

    False negative: does have cancer, but is classified as not having it> no treatment, premature death

    The two other cases are true positive and true negative.

    Because the consequences of a false negative far outweigh the others, rather than simply minimize the number of mistakes, a loss function can be minimized.

    Loss matrix

    R(Ci|x) =Xj

    Lijp(Cj|x)

    Lij =0 1

    1000 0

    cancer normal

    cancer

    normalclassification

    truth

    true +ve false +ve

    false -ve true -ve

    Risk

  • Bayes Risk

    The class conditional risk of an action is

    R(ai|x) =Xj

    L(ai|Cj)p(Cj|x)

    action

    measurement

    loss incurred if action i taken and true state is j

    Bayes decision rule: select the action for which R(ai | x) is minimum

    Mininimize Bayes risk

    This decision minimizes the expected loss

    ai = argminaiR(ai|x)

    Likelihood ratio

    Two category classification with loss function

    Conditional risk

    R(a1|x) = L11p(C1|x)+ L12p(C2|x)R(a2|x) = L21p(C1|x)+ L22p(C2|x)

    Thus for minimum risk, decide C1 if

    L11p(C1|x)+ L12p(C2|x) < L21p(C1|x) + L22p(C2|x)p(C2|x)(L12 L22) < p(C1|x)(L21 L11)

    p(x|C2)p(C2)(L12 L22) < p(x|C1)p(C1)(L21 L11)Assuming L21 L11 > 0, then decide C1 if

    p(x|C1)p(x|C2)

    >p(C2)(L22 L12)p(C1)(L11 L21)

    i.e. likelihood ratio exceeds a threshold that is independent of x

    Bayes

  • A two category classifier can often be written in the form

    where is a discriminant function, and

    is a discriminant surface.

    In 2D is a set of curves.

    Discriminant functions

    g(x)

    C1C2

    g(x)

    (> 0 assign x to C1< 0 assign x to C2

    g(x) = 0

    g(x) = 0

    g(x) = 0

    Posterior probability of skin given pixel colour

    Assume equal prior probabilities, i.e. probability

    of a pixel being skin is 0.5.

    Posterior probability of skin is defined by Bayes rule:

    P(skin|x) = p(x|skin)P(skin)p(x)

    where

    p(x) = p(x|skin)P(skin)+ p(x|background)P(background)i.e. the marginal pdf of x

  • ExampleIn the minimum average error classifier, the assignment rule is: decide C1if the posterior p(C1|x) > p(C2|x).

    The equivalent discriminant function is

    g(x) = p(C1|x) p(C2|x)or

    g(x) = lnp(C1|x)p(C2|x)

    Note, these two functions are not equal, but the decision boundaries are

    the same.

    Developing this further

    g(x) = lnp(C1|x)p(C2|x)

    = lnp(x|C1)p(x|C2)

    + lnp(C1)

    p(C2)

    Decision surfaces for Normal distributions

    Suppose that the likelihoods are Normal:

    p(x|C1) N(1,1) p(x|C2) N(2,2)

    Then

    g(x) = lnp(x|C1)p(x|C2)

    + lnp(C1)

    p(C2)

    = lnp(x|C1) lnp(x|C2)+ lnp(C1)

    p(C2)

    (x 1)>11 (x 1)+ (x 2)>12 (x 2) + c0

    where c0 = ln p(C1)p(C2)

    12 ln |1|+12 ln |2|.

  • Case 1: i = 2I

    g(