21
Machine Learning Module Week 5 Lecture Notes 9 & 10 Non-Probabilistic Classification Methods Mark Girolami [email protected] Department of Computing Science University of Glasgow May 14, 2006 1

Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

Machine Learning Module

Week 5

Lecture Notes 9 & 10

Non-Probabilistic Classification Methods

Mark [email protected]

Department of Computing ScienceUniversity of Glasgow

May 14, 2006

1

Page 2: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

1 Non-Probabilistic Classification

Last week we studied two approaches to probabilistic classification the firstexplicitly defined a discriminant function using a linear model and approx-imate Bayesian inference was employed in obtaining the model parameters.The second approach, generative, makes estimates of the class-conditionalprobability distributions or densities of the feature vectors. This week wewill look at two very popular methods of classification which are not mo-tivated from a probabilistic model of the data. The first is the classicalK-Nearest neighbour method and the second is the rather exotic soundingSupport Vector Machine.

2 K-Nearest Neighbours

Consider again the example of classifying Male & Female based on measuredheight. Given the height h of N individuals as our training data set i.e.D = {(h1, t1), (h2, t2), · · · , (hN , tN)} where each tn takes on the values of Male

or Female. Now we get a new height measurement, hnew, for an individualand we have to now assign a label (Male or Female) to that individual. Todo this we could simply find the closest height value in the training set andadopt the target value associated with this closest value. So in other wordswe find the Nearest Neighbour (hNN , tNN) of our test case hnew and make theassignment tnew = tNN . This seems an incredibly simple decision but it isone which in fact will perform well in many applications.

Now then lets say that we generalise our Nearest Neighbour rule and takethe K −NearestNeighbours to make our decision. In this case we will haveone vote from each of the K nearest-neighbours so the sensible thing to dowould be to take the Majority Vote as the imputed target value for our newclassification. If there is no clear majority then a simple random decisioncould be taken in this case.

This is a very simple classification rule and yet it is very effective inmany cases. So now consider the more general case where our objects arerepresented by a D-dimensional feature vector x ∈ R

D and that we cancompute some distance between any two vectors in this space, we will denotethe distance between example m & n by δ(xm,xn). So for a test point xnew

the decision on the majority vote obtained from the K smallest values of{δ(xnew,xn)}n=1···N will give us our predicted value tnew.

2

Page 3: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

2.1 Implementation

There is no model as such in KNN and so no training is required. All thecomputational effort occurs when making classifications. Here is a naiveimplementation of the KNN classification method in Matlab.

function [error,tpred] = knn_multi_class(X,t,Xtest,ttest,k)

Ntest = size(Xtest,1);

N = size(X,1);

tpred = zeros(Ntest,1);

C = max(t);

for n=1:Ntest

Dist = sum(( repmat(Xtest(n,:),N,1) - X ).^2,2 );

[sorted_list,sorted_index] = sort(Dist);

[max_k, index_max_k] = max(histc(t(sorted_index(1:k)),1:C));

tpred(n) = index_max_k;

end

error = 100*sum(tpred ~= ttest)/Ntest;

2.1.1 Description of Code

A training set of features, in an N×D dimensional matrix X, and N×1 targetvectors t are passed to the KNN script along with an Ntest×D dimensionalmatrix Xtest of test data. The corresponding true target values of the testdata are in the Ntest × 1 vector ttest. As this naturally accommodatesmultiple classes then the target values will take on values in 1 · · ·C where C

is the number of distinct classes.The value of k, the number of neighbours to be considered is also passed

to the function.We loop over every test point to be considered, and for each one we

compute the distance between the test point under consideration and everyone of the training points. In this case the distance is simply the squareddistance between each point. In the code above an N×1 vector which has onlythe values Xtest(n,:) is created using the repmat command, which simplyrepeats a matrix or vector by stacking in row or column format depending

3

Page 4: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

on the argument passed to it (refer to the online help for details on repmat).We can then compute the squared Euclidean distance

δ(xtest,xn) =

√∑D

d=1

(xtest,d − xnd)2

for all n in one simple matrix operation which Matlab is particularly efficientin doing.

Now we have to find the K-nearest neighbours which we do by sortingthe vector of distances from xnew using, somewhat unsurprisingly, the Matlabsort command. The next step is to take the K target values of the K-nearestneighbours and select the dominant class amongst them. We achieve this inone line of Matlab using the original indices of the sorted distances returnedby the sort command to select the target values of the K-nearest neighboursusing t(sorted_index(1:k)). We can make a count of the number of oc-curences of each of the C classes using the histc command and then simplyfind the maximum.

3 Computational Complexity

The computational complexity for a single prediction will be dominated bythe function to compute the distances from each training point. For thesimple squared distance then the scaling is linear in the dimensionality ofthe feature vector and the number of training points, O(DN). The sortingrequired will scale as around O(N log N) which will tend to dominate theoverall cost. So clearly as your training set gets larger the testing timefor KNN will be adversely affected, although we would of course anticipateimproved predictions.

There are a number of ways in which this cost can be reduced, for ex-ample using efficient search and data condensation methods. We will notconsider these any further in this course however you should be aware thatKNN computational scaling can be significantly improved over the naive im-plementation given here.

4 Distances & Metrics

Whilst the KNN classifier is model free there are in fact 2 ’hyperparame-ters’ associated with the method. The most obvious one is the number of

4

Page 5: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

neighbours K to employ when making a classification. Selection of this canbe made using Leave-One-Out Cross Validation however the other tunableterm is the distance function employed. The KNN classifier relies on thedefinition of an appropriate metric in the defined feature space. This metricgives meaning to the notion of distances within the feature space and hencehow similar (or close) one feature vector is to another.

There are four properties that a metric must have for all vectors x, y,z ∈ R

D

1. Non-Negativity: δ(x,y) ≥ 0

2. Reflexivity: δ(x,y) = 0 iff x = y

3. Symmetry: δ(x,y) = δ(y,x)

4. Triangle Inequality: δ(x,y) + δ(y, z) ≥ δ(x, z)

So the Euclidean distance δ(x,y) =√∑D

d=1(xd − yd)2 satisfies all the

above properties and is hence a metric defining a distance in RD. However

if the space is transformed such that each axis is scaled by some arbitraryconstant then distance relations between point can be quite different. Figure(1) shows a simple example of axis being scaled and the impact this has onthe distance relationships between the three points.

As each dimension can have different scales then it is common practiceto normalise, or standardise the vectors. So if there are N data points eachpoint can be set to have a zero-mean value by simply subtracting the samplemean, i.e.

x← x− 1

N

N∑

n=1

xn

Likewise the variance of the data can be set to one so that each axis sharesthe same mean, zero, and a common variance, one.

˜x← N x∑N

n=1x2

n

Instead of rescaling the data explicitly and then employing the squareddistance metric

5

Page 6: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

Y

X

Y

0.3X

Figure 1: The left hand plot shows three points in a 2-d space defined byaxes x and y. The two points with the shortest distance between them isdenoted by the dashed line. If we rescale the x-axis by a factor of 0.3, thenwe see that the closest pair of points changes.

δ(x,y)2 =

D∑

d=1

(xd − yd)2 = (x− y)T(x− y)

we can actually change the metric directly such that

δ(x,y)2 = (x− y)TΣ(x− y)

where Σ is a local transformation based on a number of nearest neighboursaround the point of interest, say x, which provides a local distortion. Thedefinition of Σ is outwith the course content but it does provide a nice oflocally adapting the metric used in KNN classification for each test point.

Other metrics often used in KNN is the Minkowski family of metricsdefined by

δ(x,y) =

(D∑

d=1

(xd − yd)p

) 1

p

when p = 1 this metric is often referred to as the L1 norm or the Manhat-

tan or city block distance as it measures the shortest path based on segmentswhich run parallel to the axes. When p = ∞ then the L∞ norm defines thedistance between x and y as the maximum distance along each of the D axes.

6

Page 7: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

A set theoretic metric which could be used when our objects are definedas sets, for example a bag-of-words representation of a document, is theTanimoto metric defined below

δ(x,y) =nx + ny − 2nxy

nx + ny − nxy

where nx and ny are the number of elements in each set x and y. So inlast weeks example of text classification where we used a binary model nx

would be the number of distinct words occurring in document x and nxy isthe number of words that both documents share in common.

The choice of metric in KNN is very important and this is an active areaof research in Pattern Recognition.

5 KNN Performance

So how well can we expect the KNN classifier to perform? There is someanalysis of the KNN rule which indicates that the error rate of a KNN classi-fier where K=1 is never more than twice the Bayes error rate (this is of courseand asymptotic result where N → ∞) but nevertheless it is indicative thatgood performance with the KNN can be expected. Intuitively we can thinkof KNN as making a local approximation of the posterior class probabilityin the region of the test point.

6 Experiments

Taking the two-dimensional data we used last week we standardise the databy making it have zero-mean and unit variance in both dimensions. This canbe achieved simply using the Matlab commands

X = X - repmat(mean(X),size(X,1),1); andX =X./repmat(std(X),size(X,1),1);

and then simply apply the KNN classifier to the test data and log thetest error for a range of values of K from 1 to 100. The minimum test errorrate achieved is 8.8% which is rather impressive given that the theoreticaloptimal is about 8.0%. The Logistic Regression using a Polynomial order of3 and a prior variance of α = 10 achieved an error rate of 9.2% on this data.

7

Page 8: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

0 20 40 60 80 100

9

10

11

12

13

14

K − Number of Neighbours

Per

cent

age

Test

Err

or

Figure 2: The plot shows the actual percentage of classification errors madeon 1000 test points for values of K ranging from 1 to 100. A minimum testerror of 8.8% is achieved using K = 33. This compares very favourably tothe theoretical optimal Bayes Error rate achievable of around 8.0%.

8

Page 9: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

−0.5 0 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0 0.05 0.1 0.15 0.2 0.250

0.05

0.1

0.15

0.2

0.25

Figure 3: The left hand plot shows the original data comprising of two classesone class is distributed as an annular ring whilst the second class is a simpleGaussian cloud. If we perform a simple transformation of the feature spaceinto a polynomial of order 2, then we see that the transformed points of eachof the two classes are such that they can be separated in a linear manner.This is in contrast to the highly nonlinear nature of the separating boundaryin the original space. So by this simple transformation we have convertedwhat was a nonlinear classification problem into a linear one.

7 Distance, Metrics & the Kernel Trick

Let us revisit the previous discussion about distances and metrics. We havealready seen that by using polynomial or radial basis expansions of our fea-ture vectors such that x → φ(x) then we can model functions of arbitrarycomplexity or define arbitrarily shaped decision surfaces for classificationwith a linear model. So what we are doing is replacing what is a nonlinearproblem in the original feature space to a linear problem in our transformedfeature space.

Figure (3) shows this effect rather nicely where a simple polynomial oforder two is used to transform the data.

Now then when we compute a Euclidean distance between two objects intheir original feature space the distance between is defined simply as

δ(x,y)2 = (x− y)T(x− y)

9

Page 10: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

If we perform a transformation x → φ(x) then the distance between theobjects in the transformed feature space will be defined as

δ(φ(x), φ(y))2 = (φ(x)− φ(y))T(φ(x)− φ(y))

= φ(x)Tφ(x) + φ(y)Tφ(y)− 2φ(x)Tφ(y)

We can see that the distance is simply defined by the inner-productscomputed in the transformed feature space. Now here is where the worldfamous kernel trick comes in.

A kernel function is defined for all x and y from a feature space X suchthat

K(x,y) = φ(x)Tφ(y)

where φ is a mapping from the original feature space X to an inner-productfeature space F i.e. φ : x→ φ(x) ∈ F .

Lets think of a simple example. Let x and y ∈ R2 and lets define the

mapping

φ : x = (x1x2)T → φ(x) = (x2

1, x2

2,√

2x1x2)T ∈ F = R

3

So we can evaluate the inner-product in F as

φ(x)Tφ(y) = (x2

1, x2

2,√

2x1x2)(y2

1, y2

2,√

2y1y2)T

= x2

1y2

1+ x2

2y2

2+ 2x1x2y1y2

= (x1y1 + x2y2)2

=(xTy

)2

So this is a pretty cool result as it shows that we can compute theinner-product in feature space F by simply computing the kernel function

K(x,y) =(xTy

)2

in the original feature space. So this means that we donot need to explicitly define and compute the mapping φ to compute inner-products in F . So we compute distances between points in F simply bycomputing the kernel function on the points in X .

δ(φ(x), φ(y))2 = (φ(x)− φ(y))T(φ(x)− φ(y))

= K(x,x) + K(y,y)− 2K(x,y)

There are a number of properties which have to be satisfied for a functionK to be a valid kernel function. The main one being that it is symmetric

10

Page 11: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

K(xi,xj) = K(xj,xi) and an N ×N dimensional matrix whose elements areK(xi,xj) ∀ i, j = 1 · · ·N must be positive semi-definite. There is a wholeclass of kernel functions which can be employed and we have already metK(xi,xj) = exp(β|xi − xj|2) in previous laboratory work.

8 Support Vector Machines

We have already met discriminative classifiers which directly provide a dis-criminant function of the form

f(x;w) = wTφ(x)

Consider a simple binary linear discriminant function (separating twoclasses) operating on a two-dimensional feature vector

g(x; w2, w1, w0) = w2x2 + w1x1 + w0 = wTx + w0

where we now denote w as the column vector comprising of w2 and w1. Ifour class labels (target values) take on the values of ±1 for each of the twoclasses then our decision function would simply test whether g(x; w2, w1, w0)was positive or negative, in which case decisions are made based on the signof the linear expansion of the feature vectors i.e.

f(x; w2, w1, w0) = sign(w2x2 + w1x1 + w0) = sign(wTx + w0)

Now let us say that we have N data points in our available sample for training(x1, t1) · · · (xN , tN ) and we assume that the two classes are completely linearlyseparable then the training data will be correctly classified if

tn(wTx + w0) > 0 ∀ n = 1 · · ·N

Now then if we cast our minds far back to last weeks lecture on proba-bilistic classification we saw that there was a posterior distribution over theparameters, w, of our classifier, which of course means that there will be anumber of w which could separate the training data perfectly see Figure (4).

Whilst in the Bayesian methodology you would take the posterior averageof the w when making class predictions the Support Vector Machine (SVM)exploits some results which come from Statistical Learning Theory to choosewhich of the possible decision functions should be used.

11

Page 12: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

����������������

������������ ������������

����

��

��������

��������������������

��������������������

������������

������������

������������������

������������

����������������

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!�!�!�!

"�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"�"�"�"

#�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�##�#�#�#�#�#�#�#�#�#�#�#�#�#�#�#

$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$

Figure 4: The samples of two classes denoted by sold circles and squares canbe separated perfectly with no miss-classifications by a number of possiblew some examples of which are drawn on this cartoon.

There are theoretical results from analysis of the generalisation error of abinary classifier which show that an upper-bound on the generalisation erroris inversely proportional to the perpendicular distance from the separatinghyperplane, w, and a hyperplane through the closest points from both classessee Figure(5). This distance is called the margin in the SVM literature andso to minimise the bound on the generalisation error we would then seek tomaximise the margin of our classifier.

So we will seek the hyperplane w which provides the maximum margin ofseparation between the two classes. Vector geometry shows that the distanceof an arbitrary point x in some D-dimensional space to a hyper-plane H

within this space which is defined by all points that satisfy wTx + w0 = 0 isgiven by

wTx + w0

||w||So if x∗

1and x∗

2are the closest points from each class to the separating hyper-

12

Page 13: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

%%&&'�''�'(�((�(

)�))�)*�**�* +�++�+,�,,�,

--..

//00

1122

3344

5�5�55�5�55�5�56�6�66�6�66�6�6

7�7�77�7�77�7�78�8�88�8�88�8�8

9�99�9:�::�:

;�;;�;<�<<�<

=�==�=>�>>�>

?�??�?@�@@�@

A�A�AA�A�AB�BB�B

C�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�CC�C�C�C�C�C�C

D�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�DD�D�D�D�D�D

E�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�EE�E�E�E�E�E�E

F�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�FF�F�F�F�F�F�F

G�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�GG�G�G�G�G�G

H�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�HH�H�H�H�H�H

I�I�II�I�II�I�IJ�JJ�JJ�J

Figure 5: The hyper-plane which maximises the margin.

plane w then the margin of separation is

wTx∗1+ w0

||w|| − wTx∗2+ w0

||w|| =wT

||w||(x∗1− x∗

2)

As the SVM discriminant function is sign(wTx+w0) then clearly the decisionmade will be invariant to an arbitrary rescaling of the argument wTx+w0 inwhich case we can define a canonical hyper-plane w such that wTx∗

1+w0 = 1

and wTx∗2

+ w0 = −1 in which case the margin is now simply 2

||w||. So to

maximise this margin we need to minimise ||w|| subject to all the pointsbeing correctly classified. The SVM optimisation can be written as

min1

2||w||2

subject to

tn(wTx + w0) ≥ 1 ∀ n = 1 · · ·Nand by finding the solution to the above we will be using the w in our classifierwhich will minimise the bound on the achievable generalisation error.

What we now have to do is find a way to identify the solution of theconstrained optimsation above.

13

Page 14: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

8.1 SVM Optimisation

Optimisation theory is a huge mathematical subject which has applicationsin about every area of science, technology, economics,... everywhere. We willhave to use one result from this theory to make progress in obtaining ourSVM classifier and so the optimisation methods we now require are simplygoing to be stated and used in devising our SVM1.

Given a constrained optimisation problem of the form

min f(w)

subject to

gi(w) ≤ 0 i = 1 · · ·Khi(w) = 0 i = 1 · · ·M

we form the Lagrangian function as

L(w, α, β) = f(w) +K∑

i=1

αigi(w) +M∑

i=1

βihi(w)

We now find the maximum of L(w, α, β) with respect to w which isdenoted as θ(α, β) and then we now have to solve the following optimisationproblem

max θ(α, β)

subject to

αi ≥ 0 ∀ i = 1 · · ·KFollowing the above let us now define the Lagrangian function we need

for our SVM. Noting that there is only one set of inequality constraints andno equality constraints then

L(w, w0, α) =1

2||w||2 +

N∑

n=1

αn

(1− tn(wTxn + w0)

)

1If you are interested in studying and understanding a bit more about

the result we use here the book Convex Optimisation is available online at

http://www.stanford.edu/∼boyd/cvxbook/. Be careful and don’t print it out in your

first flush of enthusiasm it is a rather large book of 730 pages.

14

Page 15: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

where we have defined each gi(w) = 1− tn(wTxn + w0) ≤= 0 which comesfrom our original constraint.

We should know the drill by now, to find the stationary point of L(w, w0, α)we take derivatives and so

∂wL(w, w0, α) = w −

N∑

n=1

αntnxn = 0⇒ w =

N∑

n=1

αntnxn

and∂

∂w0

L(w, w0, α) = −N∑

n=1

αntn = 0⇒N∑

n=1

αntn = 0

Now using the above we need to define our θ(α) so let us plug-in theresults to L(w, w0, α).

Using the result w =∑N

n=1αntnxn we should see that

1

2||w||2 =

1

2wTw =

1

2

(N∑

n=1

αntnxT

n

)(N∑

m=1

αmtmxm

)

=1

2

N∑

n=1

N∑

m=1

αnαmtntmxT

nxm

Now the second component of our Lagrangian needs to be considered

N∑

n=1

αn

(1− tn(wTxn + w0)

)

=

N∑

n=1

αn −N∑

n=1

αntnwTxn − w0

N∑

n=1

αntn

=N∑

n=1

αn −N∑

n=1

αntnwTxn

=N∑

n=1

αn −N∑

n=1

N∑

m=1

αnαmtntmxT

nxm

So combining the two parts we obtain

θ(α) =N∑

n=1

αn −1

2

N∑

n=1

N∑

m=1

αnαmtntmxT

nxm

15

Page 16: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

Now this has to be maximised with respect to all αn, the constraints thatαn ≥ 0 ∀ n = 1 · · ·N and the additional constraint which emerges fromour stationary conditions that is

∑N

n=1αntn = 0

So at long last we arrive at the SVM optimisation problem So combiningthe two parts we obtain

max

N∑

n=1

αn −1

2

N∑

n=1

N∑

m=1

αnαmtntmxT

nxm

subject to

αn ≥ 0 ∀ n = 1 · · ·NN∑

n=1

αntn = 0

There are a number of ways to solve this problem and we will employa simple quadratic optimisation solver which is written in Matlab. We willnot consider the details of such solvers but suffice to say this has been animportant area of development for SVM’s so that the required optimisationcan be solved efficiently for large data sets.

9 Support Vectors

What we find is that a number of the αn parameters are returned as havingzero value from the optimisation. The αn which have non-zero values areimportant and as they are associated with each vector in the training samplexn these are referred to as the Support Vectors as the support the decisionboundary between the two classes.

Now as the discriminant function for a new point xnew is of the formwTxnew+w0 then we can write the following by noting that, w =

∑N

n=1αntnxn,

and we will only need to take a summation over the non-zero α values, inother words sum over the Support Vectors. Also noting that we are rely-ing on an inner-product between the xn and xnew we can employ the kerneltrick and write the inner-product in the feature space as the kernel functionK(xn,xnew). Then our SVM discriminant function becomes

16

Page 17: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

f(xnew;w, w0) = sign(wTxnew + w0)

= sign

(N∑

n=1

tnαnxT

nxnew + w0

)

= sign

(∑

n∈SV

tnαnxT

nxnew + w0

)

= sign

(∑

n∈SV

tnαnK(xn,xnew) + w0

)

Now w0 can be obtained by noting that as we defined wTx∗1

+ w0 = 1and wTx∗

2+w0 = −1 then adding together we obtain w0 = −0.5wT(x∗

1+x∗

2)

where each x are support vectors from each class.The optimsation problem which we require to solve can, of course, be

written in matrix format as below where we denote the diagonal matrixdiag(t) by Λ and the N × 1 dimensional vector of ones as 1. The N × N

dimensional kernel matrix K is the matrix whose elements are K(xi,xj)which in the linear case is simply XXT. This of course opens up the doorfor more ’exotic’ kernel functions to be used in our SVM classifier.

max αT1− 1

2αTΛKΛα

subject to

αn ≥ 0 ∀ n = 1 · · ·NαTt = 0

A little code example is over the page. Fifty example of two well separatedclasses are drawn from two Gaussians and the constrained quadratic solvermonqp0 is called2 we pass the matrix ΛKΛ,the vectors 1 & t, a value C, whichwe shall discuss shortly and a convergence threshold value as parameters.The function then passes back the non-zero α values (support vectors), thevalue of w0 and the indices of the support vectors. A contour grid of pointsx is created and the value of the SVM discriminant function is computed at

2This Matlab implementation was obtained from Stepahne Canu

http://asi.insa-rouen.fr/ scanu/

17

Page 18: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

each one of these points and the disciminant hyperplane is then plotted out.You should be able to see clearly that the solution is very sparse and a verysmall portion of the original data points end up being used in the SVM. Asusual the code is available at the class website.

9.1 Toy Implementation

clear

Step=0.5;

N = 50;

C = 1000;

X = [randn(N/2,2);randn(N/2,2)+[ones(N/2,1).*6 zeros(N/2,1)]];

t=[ones(N/2,1);-ones(N/2,1)];

[alpha,w_0,alpha_index]=monqp0(diag(t)*X*X’*diag(t),ones(N,1),t,C,1e-6);

%Define contour grid

mn = min(X);

mx = max(X);

[x1,x2]=meshgrid(floor(mn(1)):Step:ceil(mx(1)),floor(mn(2)):Step:ceil(mx(2)));

[n11,n12]=size(x1);

[n21,n22]=size(x2);

XG=[reshape(x1,n11*n12,1)

reshape(x2,n21*n22,1)];

f = (t(alpha_index).*alpha)’*X(alpha_index,:)*XG’ + w_0;

plot(X(alpha_index,1),X(alpha_index,2),’go’); hold

plot(X(1:N/2,1),X(1:N/2,2),’.’) plot(X(N/2+1:N,1),X(N/2+1:N,2),’r.’)

contour(x1,x2,reshape(f,[n11,n12]),[0 0]); hold off

Figure (6) shows the SVM decision plane and the support vectors for thislittle toy data set.

For the case where the samples from the two classes may not be completelylinearly separable then the SVM optimisation problem can be posed in sucha way as to take these possible errors into account. It turns out that avery simple change to the SVM optimisation is required and it changes the

18

Page 19: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

−2 0 2 4 6 8−3

−2

−1

0

1

2

3

Figure 6: The SVM decision plane separating examples from two classesalong with the support vectors which are highlighted. Note that there areonly three non-zero α components and so only three points in the data setwhich are supporting the decision surface.

positivity constraint from αn ≥ 0 to 0 ≤ αn ≤ C, for all n, where C is a boxconstraint parameter.

9.2 SVM Hyper-Parameters

It is clear then that the SVM will have a number of hyper-parameters whichwill have to be tuned using for example LOOCV. The box constraint para-meter C is one and any parameters associated with the kernel function willbe the other(s).

If we take the two-dimensional binary class data set which was used lastweek and employ an SVM in classifying the test examples we can study howwell the SVM will do on this example for varying levels of value C. Figure(7) shows the value of the classification error as C is varied from C = 1 toC = 20 in unit steps when an SVM is trained using a third-order Polynomialkernel function K(xi,xj) = (1 + xT

i xj)3. A minimum value of test error,

9.4%, is achieved at C = 2. This is comparable with the best performanceachieved with the Bayesian classifier using a cubic polynomial expansion.

19

Page 20: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

0 5 10 15 209.4

9.5

9.6

9.7

9.8

9.9

10

10.1

Box Constraint Value

Per

cent

age

Test

Err

or

−1 −0.5 0 0.5 1

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 7: The left hand plot shows the test error achieved for varying valuesof C when using a polynomial order kernel function. The right hand plotshows the training data and the decision surface. The support vectors arehighlighted and they can all be seen to be clumped around the decisionsurface.

Clearly there are many types of kernel function which could be used andif we were to consider the radial basis style function, which we met in theWeek 3 Laboratory then we will need to search over all combinations of widthparameter β and box constraint parameter C. Figure (8) shows results ofsuch a search.

10 Conclusions

We have introduced the KNN and SVM classifiers along with the kernel trickin this weeks classes. There has been an explosion of research in the last tenyears focused on SVM and kernel methods and this class barely scratches thesurface of the literature on these methods. There have been some impressiveapplications of SVM’s in computational biology and Information Retrievaldue to their excellent classification performance in general. However, the in-credibly simple KNN classifier outperforms the SVM on the two class problemwe have been looking at. This is not always the case but it should be saidthat for many applications the KNN classifier performs rather impressively.

20

Page 21: Machine Learning Module Week 5 Lecture Notes 9 & 10 Non

0 1 2 3 49

10

11

12

13

14

15

Kernel Width

Test

Err

or

Figure 8: The percentage error achieved by an SVM using a Radial BasisKernel function with a width parameter ranging from 0.01 to 4.0 in stepsizes of 0.05. For each of these ranges a value of C was selected from 1 to4 and we can see that the minimum test error of 9.0% was achieved withhyper-parameter values of C = 1 and β = 1.4.

21