Using SVMs for Scientists and Engineers - PRT Blog

Embed Size (px)

Citation preview

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    1/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/

    Using SVMs for Scientists and EngineersJul 24th, 2013

    In the mid-90s, support-vector machines became extremely popular machine learning algorithms due to

    a number of very nice properties, and because they can also acheive state-of-the-art performance on a

    number of data sets. Although the statistical underpinnings of why SVMs work rely on somewhat

    abstract statistical theory, modern statistical packages (like libSVM, and the PRT) make training andusing SVMs almost trivial for the average engineer That said, getting good performance out of an SVM

    is often not as easy as simply running pre-existing code on your data, and for some data-sets, SVM

    classification may not be appropriate.

    This blog entry will serve two purposes - 1) to provide an introduction to practical issues you (as an

    engineer or scientist) may encounter when using an SVM on your data, and 2) to be the first in a series

    of similar for Engineers& Scientists posts dedicated to helping engineers understand the tradeoffs

    and assumptions, and practical details of using various machine learning approaches on their data.

    ContentsQuick Notes

    SVM Formulation

    Appropriate Data Sets

    SVM Parameters & Notes

    Parameter: Cost (Scalar)

    Parameter: Relative Class Error Weights

    Parameter: Kernel Choice & Associated Parameters

    SVM Pre-Prccessing

    Optimizing Parameters

    Some Rules-Of-ThumbConcluding

    Quick NotesThoughtout this post, well be using prtClassLibSvm, which is built directly on top of the fantastic LibSVM

    library, available here:

    http://www.csie.ntu.edu.tw/~cjlin/libsvm/

    The parameter nomenclature were using matches theirs pretty closely, so feel free to leverage their

    Get Updates: By RSSPRT BlogMATLAB Pattern Recognition Open Free and Easy

    http://-/?-http://-/?-http://-/?-https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theoryhttp://newfolder.github.io/atom.xmlhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    2/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 2

    documentation as well.

    SVM FormulationTypical SVM formulations assume that you have a set of n-dimensional real training vectors, {x_i} for i =

    1N, and corresponding labels {y_i}, y_i \in {-1,1}. Let x_ik represent the kth element of the vector x_i.

    Also assume that you have a relevant kernel function (https://en.wikipedia.org/wiki/Kernel_methods), P,

    which takes two input arguments, both n-dimensional real vectors, and outputs a scalar metric -

    P(x_i,x_j) = z_ij. The most common choice of P is a radial basis function

    (http://en.wikipedia.org/wiki/Radial_basis_function): P(x_i,x_j) = exp(- (\sum_{k} (x_ik-x_jk)^2 )/s^2 )

    SVMs perform prediction of new labels by calculating:

    f(x) = \hat{y} = ( \sum_{i} (w_i*P(x_i,x) - b) ) > 0

    e.g., the SVM learns a representation for the labels (y) based on the data (x) with a linear combination

    (w) of a set of functions of the training data (x_i) and the test data (x).

    Appropriate Data SetsBinary/M-Ary: Typically, SVMs are appropriate for binary classification problems - multi-class problems

    require some extensions of SVMs, although in the PRT, SVMs can be used in

    prtClassBinaryToMaryOneVsAll to emulate multi-class classification.

    Data: SVM formulations often assume vector-valued training data, however as long as a suitable kernel-

    function can be constructed, SVMs can be used on arbitrary data (e.g., string-match distances can be

    usned as a kernel for calculating the distances between character strings). Note, however, that SVMs do

    assume that the kernel used is a Mercer kernel, so some functions are not appropriate as SVM kernels -

    http://en.wikipedia.org/wiki/Mercers_theorem.

    Computational Considerations: Depending on the kernel, and particular algorithm under consideration,

    training an SVM can be very time-consuming for very large data sets. Proper selection of SVM

    parameters can significantly improve training time. At run-time, SVMs are typically very fast, with

    computational complexity that grows approximately linearly with the size of the training data set.

    SVM Parameters & NotesAs you might imagine, several SVM parameters will have significant effect on overall classification

    performance. Good performance requires careful selection of each of these; though some generalrules-of-thumb can help provide reasonable performance with a minimum of headaches.

    Parameter: Cost (Scalar)Internally, the SVM is going to try and ignore a whole bunch of your training data, by setting their

    corresponding w_i to zero. This might sound counter-intuitive, but its very important, because it makes

    for fast run-time, and also (it turns out) that setting a bunch of ws to zero is fundamental to why the SVM

    performs so well in general (see any number of articles on V-C Theory for more information).

    http://en.wikipedia.org/wiki/Mercer's_theoremhttp://en.wikipedia.org/wiki/Radial_basis_function
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    3/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 3

    n or una e y, s presen s a ema - ow muc s ou e ry an ma e ws zero vs. ow m uc

    should it try and classify your data absolutely perfectly? More zero-ws might improve performance on

    the training set, but reduce the performance of the SVM on an unseen testing set!

    The Cost parameter in the SVM enables you to control this trade off. Higher cost leads to more non-

    zero w vectors, and more correctly classified training points, while lower costs tend to generate w

    vectors with lots of zeros, and slightly worse performance on training data (though performance on

    testing data may be better).

    We usually run a number of experiments for different cost values across a range of, say 0.01 to 100,

    though if performance is plateauing it might make sense to extend this range. The following figures show

    how the SVM decision boundaries change with varying costs in the PRT.

    close all;

    ds = prtDataGenUnimodal;

    c = prtClassLibSvm;

    count = 1;

    forw = logspace(-2,2,4);

    c.cost = w; c = c.train(ds);

    subplot(2,2,count);

    plot(c);

    legend off;

    title(sprintf('Cost: %.2f',c.cost));

    count = count + 1;

    end

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    4/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 4

    Parameter: Relative Class Error WeightsIn typical discussions of cost, errors in both classes are treated equally e.g., its equally bad to call a

    -1 a 1 and vice-versa. In realistic operations, that may not be the case for example, failing to detect

    a landmine, is significantly worse than calling a coke-can a landmine.

    Luckily, SVMs enable us to specify class-specific error costs, so if class 1 has error cost of 1, and class

    -1 has an error cost of 100, its 100x as bad to mistake a -1 for a 1 as the opposite.

    LibSVM implements these class-specific weights using parameters called w-1, w1, etc. In the PRT,

    these are implemented as a vector, weights. The following example shows how the effects of changing

    the error weight on class 1 affects the overall SVM contours. Clearly, as the cost on class 1 increases,

    the SVM spends more effort to correctly classify red elements.

    close all;

    c = prtClassLibSvm;count = 1;

    forw = logspace(-1,1,4);

    c.weight = [1 w]; %Class0: 1, Class1: w

    c = c.train(ds);

    subplot(2,2,count);

    c.plot();

    legend off;

    title(sprintf(Weight: [%.2f,%.2f],c.weight(1),c.weight(2)));

    count = count + 1;

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    5/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 5

    Parameter: Kernel Choice & Associated ParametersThe proper choice of kernel makes a huge difference in the resulting performance of your classifier. We

    tend to stick with RBF and linear kernels (kernelType = 0 or 2 in prtClassLibSvm), but several otheroptions (including hand-made kernels) are also possible. The linear kernel doesnt have any

    parameters to set, but the RBF has a parameter that can significantly impact performance. In most

    formulations, the parameter is referred to as sigma, but in LibSVM, the parameter is gamma, and its

    equivalent to 1/sigma. For the RBF, you can set it to any positive value. You can also use the special

    character k, and specify a coefficient as a string. k will evaluate to the number of features in the data

    set e.g., 5k evaluates to 10 for a 2-dimensional data set.

    In general, we find that for normalized data (see below), the default gamma value of k (the number of

    dimensions) works well.

    The following example code generates 4 example images for SVM decision boundaries for varying

    gamma parameters.

    close all;

    c = prtClassLibSvm;

    count = 1;

    d = prtDataGenUnimodal;

    forkk = logspace(-1,.5,4);

    c.gamma = sprintf(%.2fk,kk);

    c = c.train(d);

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    6/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 6

    subplot(2,2,count);

    c.plot();

    title(sprintf(\gamma = %s,c.gamma));

    legend off;

    count = count + 1;

    end

    SVM Pre-PrccessingNote that for many kernel choices (e.g., RBF, and many others, see

    http://en.wikipedia.org/wiki/Kernel_methods#Popular_kernels), the kernel output (P(x_i,x_j) depends

    strongly and non-linearly on the magnitudes of the data vectors. E.g., exp(-1000) is not equal to

    1000*exp(-1). In fact, if you refer to the RBF equation above, youll notice that if two elements of your

    vector have a difference approaching 1000, P(x1,x2) will be dominated by a term like exp(-1000), which

    by any reasonable metric (and certainly in floating point precision) is exactly 0. This is a bad thing .

    In general, non-linear kernel functions should only be applied to data that is guaranteed to be in a

    reasonable range (e.g., -10 to 10), or data that has been pre-processed to remove outliers or control

    for data magnitude. The PRT pamkes several such techniques available compare and contrast the

    performance in the following example:

    close all;

    ds = prtDataGenBimodal;

    ds.X = 100*ds.X; %scale the data

    http://en.wikipedia.org/wiki/Kernel_methods#Popular_kernels
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    7/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 7

    yOutNaive = kfolds(prtClassLibSvm,ds,3);

    yOutNorm = kfolds(prtPreProcZmuv + prtClassLibSvm,ds,3);

    [pfNaive,pdNaive] = prtScoreRoc(yOutNaive);

    [pfNorm,pdNorm] = prtScoreRoc(yOutNorm);

    h = plot(pfNaive,pdNaive,pfNorm,pdNorm);

    set(h,'linewidth',3);

    legend(h,{'Naive','Pre-Proc'});

    title('ROC Curves for Naive and Pre-Processed Application of SVM to Bimodal Data');

    Clearly, performance on un-normalized data is attrocious, but simple re-scaling acheives good results.

    Optimizing ParametersThe general procedure in developing an SVM is to optimize both the C and gamma parameters for your

    particular data set. You can do this using two for-loops and the PRT:

    close all;

    gammaVec = logspace(-2,1,10);

    costVec = logspace(-2,1,10);

    ds = prtDataGenUnimodal;

    auc = nan(length(gammaVec),length(costVec));

    kfoldsInds = ds.getKFoldKeys(3);

    forgammaInd = 1:length(gammaVec);

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    8/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 8

    forcostInd = 1:length(costVec);

    c = prtClassLibSvm;

    c.cost = costVec(costInd);

    c.gamma = gammaVec(gammaInd);

    yOut = crossValidate(c,ds,kfoldsInds);

    auc(gammaInd,costInd) = prtScoreAuc(yOut);

    imagesc(auc,[.95 1]);

    colorbar drawnow;

    end

    end

    title('AUC vs. Gamma Index (Vertical) and Cost Index (Horizontal)');

    Some Rules-Of-ThumbIn general, you may not have time or simply want to optimize over your SVM parameters. In this case,

    you can usually get by using ZMUV pre-processing, and the default SVM parameters (RBF kernel, Cost

    = 1, gamma = k)

    algo = prtPreProcZmuv + prtClassLibSvm;

  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    9/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 9

    Observation Info Supervised Learning: An Introduction for Scientists and Engineers

    We hope this entry helps you make sense of how to use an SVM in real-world scenarios, and how to

    optimize the SVM parameters for your particular data set. As always, proper cross-validation is

    fundamental to good generalizability.

    Happy coding.

    Posted by Pete Jul 24th, 2013

    Comments9 Comments

    Sunil Dadhich

    Why do we need to optimize the C and g values in SVM?

    Peter Torrione Hi Sunil, The parameters in the SVM control the relative tradeoffs between sparsity and

    accuracy on the training data set - even though the default parameters may work well, they

    are not guaranteed to work ideally on all data sets. As a result, optimizing the parameters is

    recommended. Not sure if that answers your question...

    Mauro Baldi

    Hello and many thanks again for the previous help!

    This time I tried to build and compare different classifiers with the fantastic PRT youdeveloped, and the last classifier is a SVM.

    I thoroughly read this guide and I tried, at first, to skip the "manual" pre-processing phase.

    Instead, I used the ZMUV pre-processing which, as stated in the guide, avoid to optimize the

    SVM parameters manually.

    Nevertheless, the resulting ROC curves are not as satisfactory as those coming from the

    other classifiers.

    What I am wondering is wether this is normal (as I skipped a more detailed preprocessing)

    or wether there might be something wrong with my code.

    My code is:

    http://disqus.com/mauro_baldi/http://disqus.com/petertorrione/http://disqus.com/disqus_OejKaeqy7Q/http://disqus.com/mauro_baldi/http://disqus.com/petertorrione/http://disqus.com/disqus_OejKaeqy7Q/http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1515212062http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1371942340http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1367765696http://newfolder.github.io/blog/2013/07/29/supervised-learning/http://newfolder.github.io/blog/2013/07/15/observation-info/
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    10/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 10

    %% CLASSIFICATORE (PREPROCZMUV + SVM)%%

    algoSVM = prtPreProcZmuv + prtClassLibSvm;

    algoSVM = algoSVM.train(TrainingSet);

    %% TEST %%

    yOutTest = algoSVM.run(TestSet);

    kennethmorton Mod

    I don't see anything immediately wrong with your code. The default options for

    LibSVM uses an RBF kernel. If your data is high dimensional you may need to use

    something to reduce the dimensionality first. Have any other kernel classifiers

    worked?

    Mauro Baldi

    Hello Kenny and thank you for your reply.

    My data set is not very big. It consists of 1393 rows, 3 columns (the features)

    and the corresponding target values (either 0 or 1).

    So far I used the RBF kernel as default. I am trying to change the kernel type.

    In particular, I read in the help that the kernel attribute is kernelType.

    But if I type

    algoSVM.kernelType = 0;

    to set a linear kernel the following error code appers:

    No public field kernelType exists for class prtAlgorithm.

    So this means that the kernelType attribute is a private one and might be

    changed through a set method.

    How can I do that?

    I also several questions about this procedure and I apologize in advance if themessage is too long.

    kennethmorton Mod

    Mauro,

    When ou use the followin line:

    http://disqus.com/kennethmorton/http://disqus.com/mauro_baldi/http://disqus.com/kennethmorton/http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1518727445http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1526514726http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1516248600http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1518727445http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1515212062http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1516248600
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    11/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 1

    >> algoSVM = prtPreProcZmuv + prtClassLibSvm;

    you are constructing a prtAlgorithm. This is why the properties of the

    SVM canot be set directly using algoSVM. Referencing the individual

    components of the algorithm can be done by accessing the actionCell

    property of prtAlgorithm

    >> algoSVM.actionCell{2}.kernelType = 0;

    In general I don't like to do things this way. I find it is cleaner to

    construct the algorithm with the properties you want using string value

    pairs. For example

    >> algoSVM = prtPreProcZmuv + prtClassLibSvm('kernelType',0);

    1. I am confused b our code. Should there be two SVM al orithms.

    Mauro Baldi

    Hello Kenny and thank you very much for your, as always, fast and

    very detailed replies.

    My goal is this: I have a data set made up of a training set and a test

    set.

    What I would like to do is to build many classifiers (including SVMs)

    and, at the end, pick up the best promising one.So far I have built RVM, KNN and SVM classifiers, all thanks to your

    PRT toolbox and help.

    So, I am really very grateful to you and Peter.

    Although this post is just devoted to SVM, I have questions both on

    SVMs but also on other issues I have encountered while trying to

    implement your suggestions.

    Therefore, I'd like to ask you wether I can contact you or Peter

    privately.

    Anyway, here are my questions:

    1) You asked me " I am confused by your code. Should there be two

    SVM algorithms. One with a linear kernel and one with an RBF

    kennethmorton Mod

    Mauro,

    http://disqus.com/kennethmorton/http://disqus.com/mauro_baldi/http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1529793643http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1531792665http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1526514726http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1529793643
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    12/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    http://newfolder.github.io/blog/2013/07/24/using-svms/ 12

    This is getting a bit detailed for the comments section. Let's talk this

    offline. Please feel free to email me at [email protected]

    Kenny

    Mauro Baldi

    Hello Kenny,

    this time I am writing here because the questions I am gonna ask

    might interest other people.

    In a previous post you said that it is not a problem if you calibrate a

    SVM with RBF kernel with or without any preprocessing.

    Just to check, I tried the following calibrations:

    1) Preprocessing with prtPreProcZmuv and automatic training (i.e.,

    without the double loop on parameters Cost and gamma)

    2) Manual calibration with prtPreProcPca preprocessing:

    algoSVManual = prtPreProcPca + prtClassLibSvm;

    3) Manual calibration without any preprocessing

    algoSVManual = prtClassLibSvm;

    4) Manual calibration with prtPreProcZmuv preprocessing:

    Recent PostsDude Where's My Help?

    verboseStorage and a little prtAlgorithm plotting

    Introducing prtClassNNET

    Supervised Learning: An Introduction for Scientists and Engineers

    Using SVMs for Scientists and Engineers

    http://newfolder.github.io/blog/2013/07/24/using-svms/http://newfolder.github.io/blog/2013/07/29/supervised-learning/http://newfolder.github.io/blog/2013/08/20/introducing-prtclassnnet/http://newfolder.github.io/blog/2013/09/04/verbosestorage-and-a-little-prtalgorithm-plotting/http://newfolder.github.io/blog/2013/10/09/dude-wheres-my-help/http://disqus.com/mauro_baldi/https://disqus.com/websites/?utm_source=newfolderconsulting&utm_medium=Disqus-Footerhttp://disqus.com/http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1531792665http://newfolder.github.io/blog/2013/07/24/using-svms/#comment-1580672771
  • 7/21/2019 Using SVMs for Scientists and Engineers - PRT Blog

    13/13

    9/23/2014 Using SVMs for Scientists and Engineers - PRT Blog

    Copyright 2013 - Kenneth Morton and Peter Torrione - Powered byOctopress - Theme by Brian Armstrong

    http://brianarmstrong.org/http://octopress.org/