CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech

CSE489-02 & CSE589-02 Multimedia Processing

Lecture 09 Pattern Classifier and Evaluation for Multimedia

Applications

Spring 2009

New Mexico Tech

Basic Concepts and Definitions

Disease present

(D+) Disease absent

(D-)

Test positive

True positives

(TP)

False positives

(FP)

Total amount of test positive (TP+FP)

Test negative

False negative

(FN)

True negatives

(TN)

Total amount of test negative (FN+TN)

Total amount of disease present

(TP+FN)

Total amount of disease absent

(FP+TN)

Sensitivity —proportion of patients with disease who test positive

P(T+|D+) = TP / (TP+FN)

Specificity —proportion of patients without disease who test negative

P(T-|D-) = TN / (TN + FP)

Sensitivity: the ability to detect "true positives"

TPTP + FN

Accuracy (specificity): the ability to avoid "false positives"

TNTN + FP

Positive Predictive Value (PPV)

TPTP + FP

Sensitivity and Specificity

Accuracy Definition: TP + TN

TP + TN + FP + FN

Range: [0 … 1]

Definition: TP·TN - FN·FP√(TN+FN)(TN+FP)(TP+FN)

(TP+FP)

Range: [-1 … 1]

Matthews correlation coefficient

An example:

On a test set with 20 motif-containing (m+) and 47 motif-lacking (m-) proteins, the following results were obtained:predicted:

m+ m-true : m+ 17 3 tp fn m- 8 39 fp tn

sensitivity = tp/(tp+fn) = 17/(17+3)= 0.85

specificity = tp/(tp+fp) = 17/(17+8)= 0.68

MCC= ... = 0.64

An Example: Hypothyroidism

Hypothyroidism is the disease state in humans and other animals caused by insufficient production of thyroid hormone by the thyroid gland.

Clinical Response to Thyroxine Sodium in Clinically Hypothyroid but Biochemically Euthyroid Patients

Skinner GRB, Holmes D, Ahmad A, Davies JA, Benitez J. J Nutr Environ Med 2000;10:115-124.



Sensitivity is 18/32 = 0.56 Specificity is 92/93 = 0.99


Sensitivity is 25/32 = 0.78 Specificity is 75/93 = 0.81


Sensitivity is 29/32 = 0.91Specificity is 39/93 = 0.42



A Universal Illustration

Comparison of ROC Curve

The idea behind artificial neural networks

The brain of a vertebrate is (in general) capable of learning things

Example: having seen a number of trees, a normally giftedperson will be able to recognise almost all types of trees

The idea: to construct networks of artificial neurons and makethem learn and generalize in a way similar to how the physiological neural networks do that

The feed-forward neural network- the training principle

... ...

input hidden outputlayer layer layer

weights

-Data presented at input

Correct answer fixed at output

Difference between correct and actual output used for weight adjusting (training) se

quen

ce in

put d

ata

When to stop training er

ror

epoch

We want to get a good generalization performance and to avoidover-fitting of the parameters to the training set (over-training)

test set

training set

The training set

- large enough

- contain all possible classes in approximately equal amounts

- unbiased, i.e. no particular type within a class should be overrepresented --- this is important for two reasons:

- if training set is biased towards a particular type, so will the ANN be- if training and test set contain too similar examples, the performance will be over-estimated

in short: the training set should be representative

Cross-validation

N-fold cross-validation:divide the data set into n partsuse n-1 parts for traininguse 1 part for testing

e.g.., split the total data set into 5 parts:

- 4 parts for training- 1 part for testing

Support Vector Machines (SVM)

Linear Classifiers

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

w x + b=

0w x + b<0

w x + b>0

Linear Classifiers

denotes +1

denotes -1



Linear Classifiers

denotes +1

denotes -1



Linear Classifiers

denotes +1

denotes -1


Any of these would be fine..

..but which is best?

Classifier Margin

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Classifier Margin

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

What we know: w . x+ + b = +1 w . x- + b = -1 w . (x+-x-) = 2

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=

0

wx+b=-1

X-

x+

ww

wxxM

2)(

M=Margin Width

Maximum Margin

Linear SVM Mathematically

Goal: 1) Correctly classify all training data

if yi = +1

if yi = -1

for all i

2) Maximize the Margin

same as minimize

We can formulate a Quadratic Optimization Problem and solve for w and b

Minimize

subject to

wM

2

www t

2

1)(

1bwxi

1bwxi

1)( bwxy ii

1)( bwxy ii

i

wwt

2

1

Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of

mathematical programming problems, and many (rather intricate)

algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange

multiplier αi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) αi ≥ 0 for all αi

The Optimization Problem Solution The solution has the form:

Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.

Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

f(x) = ΣαiyixiTx + b

Dataset with noise

Hard Margin: So far we require all data points be classified correctly

- No training error

What if the training set is noisy?

- Solution 1: use very powerful kernels

denotes +1

denotes -1

OVERFITTING!

Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

wx+b=1

wx+b=0

wx+b=-

1

7

11 2

Soft Margin Classification

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

Hard Margin v.s. Soft Margin

The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find w and b such that

Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Linear SVMs: Overview

The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the

hyperplane. Quadratic optimization algorithms can identify which training points

xi are support vectors with non-zero Lagrangian multipliers αi.

Both in the dual formulation of the problem and in the solution

training points appear only inside dot products: Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

Non-linear SVMs Datasets that are linearly separable with some noise work out

great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0 x

0 x

0 x

x2

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped

to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

The “Kernel Trick”

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Examples of Kernel Functions Linear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),(2

2

ji

ji

xxxx

K

Non-linear SVMs Mathematically Dual problem formulation:

The solution is:

Optimization techniques for finding αi’s remain the same!

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

f(x) = ΣαiyiK(xi, xj)+ b

SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, simply by defining a kernel function

The kernel function plays the role of the dot product in the feature space.

Nonlinear SVM - Overview

SVM Applications SVM has been used successfully in many

real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition

Useful Toolboxes PRtools STPRtool OSU-SVM SVMlight MATLABArsenal DENFIS ….

Documents

CSE489-02 & CSE589-02 Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech