Upload
jaylen-wiggins
View
216
Download
2
Embed Size (px)
Citation preview
CSE489-02 & CSE589-02 Multimedia Processing
Lecture 09 Pattern Classifier and Evaluation for Multimedia
Applications
Spring 2009
New Mexico Tech
Basic Concepts and Definitions
Disease present
(D+) Disease absent
(D-)
Test positive
True positives
(TP)
False positives
(FP)
Total amount of test positive (TP+FP)
Test negative
False negative
(FN)
True negatives
(TN)
Total amount of test negative (FN+TN)
Total amount of disease present
(TP+FN)
Total amount of disease absent
(FP+TN)
Sensitivity —proportion of patients with disease who test positive
P(T+|D+) = TP / (TP+FN)
Specificity —proportion of patients without disease who test negative
P(T-|D-) = TN / (TN + FP)
Sensitivity: the ability to detect "true positives"
TPTP + FN
Accuracy (specificity): the ability to avoid "false positives"
TNTN + FP
Positive Predictive Value (PPV)
TPTP + FP
Sensitivity and Specificity
Accuracy Definition: TP + TN
TP + TN + FP + FN
Range: [0 … 1]
Definition: TP·TN - FN·FP√(TN+FN)(TN+FP)(TP+FN)
(TP+FP)
Range: [-1 … 1]
Matthews correlation coefficient
An example:
On a test set with 20 motif-containing (m+) and 47 motif-lacking (m-) proteins, the following results were obtained:predicted:
m+ m-true : m+ 17 3 tp fn m- 8 39 fp tn
sensitivity = tp/(tp+fn) = 17/(17+3)= 0.85
specificity = tp/(tp+fp) = 17/(17+8)= 0.68
MCC= ... = 0.64
An Example: Hypothyroidism
Hypothyroidism is the disease state in humans and other animals caused by insufficient production of thyroid hormone by the thyroid gland.
Clinical Response to Thyroxine Sodium in Clinically Hypothyroid but Biochemically Euthyroid Patients
Skinner GRB, Holmes D, Ahmad A, Davies JA, Benitez J. J Nutr Environ Med 2000;10:115-124.
An Example: Hypothyroidism
An Example: Hypothyroidism
Sensitivity is 18/32 = 0.56 Specificity is 92/93 = 0.99
An Example: Hypothyroidism
Sensitivity is 25/32 = 0.78 Specificity is 75/93 = 0.81
An Example: Hypothyroidism
Sensitivity is 29/32 = 0.91Specificity is 39/93 = 0.42
An Example: Hypothyroidism
An Example: Hypothyroidism
A Universal Illustration
Comparison of ROC Curve
The idea behind artificial neural networks
The brain of a vertebrate is (in general) capable of learning things
Example: having seen a number of trees, a normally giftedperson will be able to recognise almost all types of trees
The idea: to construct networks of artificial neurons and makethem learn and generalize in a way similar to how the physiological neural networks do that
The feed-forward neural network- the training principle
... ...
input hidden outputlayer layer layer
weights
-Data presented at input
Correct answer fixed at output
Difference between correct and actual output used for weight adjusting (training) se
quen
ce in
put d
ata
When to stop training er
ror
epoch
We want to get a good generalization performance and to avoidover-fitting of the parameters to the training set (over-training)
test set
training set
The training set
- large enough
- contain all possible classes in approximately equal amounts
- unbiased, i.e. no particular type within a class should be overrepresented --- this is important for two reasons:
- if training set is biased towards a particular type, so will the ANN be- if training and test set contain too similar examples, the performance will be over-estimated
in short: the training set should be representative
Cross-validation
N-fold cross-validation:divide the data set into n partsuse n-1 parts for traininguse 1 part for testing
e.g.., split the total data set into 5 parts:
- 4 parts for training- 1 part for testing
Support Vector Machines (SVM)
Linear Classifiers
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
w x + b=
0w x + b<0
w x + b>0
Linear Classifiers
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Linear Classifiers
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Linear Classifiers
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Any of these would be fine..
..but which is best?
Classifier Margin
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Classifier Margin
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Maximum Margin
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
Support Vectors are those datapoints that the margin pushes up against
1. Maximizing the margin is good according to intuition and PAC theory
2. Implies that only support vectors are important; other training examples are ignorable.
3. Empirically it works very very well.
What we know: w . x+ + b = +1 w . x- + b = -1 w . (x+-x-) = 2
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=
0
wx+b=-1
X-
x+
ww
wxxM
2)(
M=Margin Width
Maximum Margin
Linear SVM Mathematically
Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin
same as minimize
We can formulate a Quadratic Optimization Problem and solve for w and b
Minimize
subject to
wM
2
www t
2
1)(
1bwxi
1bwxi
1)( bwxy ii
1)( bwxy ii
i
wwt
2
1
Solving the Optimization Problem
Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather intricate)
algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every constraint in the primary problem:
Find w and b such thatΦ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0(2) αi ≥ 0 for all αi
The Optimization Problem Solution The solution has the form:
Each non-zero αi indicates that corresponding xi is a support vector.
Then the classifying function will have the form:
Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.
Also keep in mind that solving the optimization problem involved computing the inner products xi
Txj between all pairs of training points.
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
f(x) = ΣαiyixiTx + b
Dataset with noise
Hard Margin: So far we require all data points be classified correctly
- No training error
What if the training set is noisy?
- Solution 1: use very powerful kernels
denotes +1
denotes -1
OVERFITTING!
Slack variables ξi can be added to allow misclassification of difficult or noisy examples.
wx+b=1
wx+b=0
wx+b=-
1
7
11 2
Soft Margin Classification
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
Hard Margin v.s. Soft Margin
The old formulation:
The new formulation incorporating slack variables:
Parameter C can be viewed as a way to control overfitting.
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Linear SVMs: Overview
The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the
hyperplane. Quadratic optimization algorithms can identify which training points
xi are support vectors with non-zero Lagrangian multipliers αi.
Both in the dual formulation of the problem and in the solution
training points appear only inside dot products: Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
Non-linear SVMs Datasets that are linearly separable with some noise work out
great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0 x
0 x
0 x
x2
Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,
= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
Examples of Kernel Functions Linear: K(xi,xj)= xi
Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):
Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)
)2
exp(),(2
2
ji
ji
xxxx
K
Non-linear SVMs Mathematically Dual problem formulation:
The solution is:
Optimization techniques for finding αi’s remain the same!
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi, xj)+ b
SVM locates a separating hyperplane in the feature space and classify points in that space
It does not need to represent the space explicitly, simply by defining a kernel function
The kernel function plays the role of the dot product in the feature space.
Nonlinear SVM - Overview
SVM Applications SVM has been used successfully in many
real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition
Useful Toolboxes PRtools STPRtool OSU-SVM SVMlight MATLABArsenal DENFIS ….