Online learning of multi-class Support Vector Machines

IT 12 061

Examensarbete 30 hpNovember 2012

Online learning of multi-class Support Vector Machines

Xuan Tuan Trinh

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Online learning of multi-class Support VectorMachines

Xuan Tuan Trinh

Support Vector Machines (SVMs) are state-of-the-art learning algorithms forclassification problems due to their strong theoretical foundation and their goodperformance in practice. However, their extension from two-class to multi-classclassification problems is not straightforward. While some approaches solve a seriesof binary problems, other, theoretically more appealing methods, solve one singleoptimization problem. Training SVMs amounts to solving a convex quadraticoptimization problem. But even with a carefully tailored quadratic program solver,training all-in-one multi-class SVMs takes a long time for large scale datasets. We firstconsider the problem of training the multi-class SVM proposed by Lee, Lin and Wahba(LLW), which is the first Fisher consistent multi-class SVM that has been proposed inthe literature, and has recently been shown to exhibit good generalizationperformance on benchmark problems. Inspired by previous work on onlineoptimization of binary and multi-class SVMs, a fast approximative online solver for theLLW SVM is derived. It makes use of recent developments for efficiently solvingall-in-one multi-class SVMs without bias. In particular, it uses working sets of size twoinstead of following the paradigm of sequential minimal optimization. After successfulimplementation of the online LLW SVM solver, it is extended to also support thepopular multi-class SVM formulation by Crammer and Singer. This is done using arecently established unified framework for a larger class of all-in-one multi-class SVMs.This makes it very easy in the future to adapt the novel online solver to even moreformulations of multi-class SVMs. The results suggest that online solvers may providefor a robust and well-performing way of obtaining an approximative solution to thefull problem, such that it constitutes a good trade-off between optimization time andreaching a sufficiently high dual value.

Tryckt av: Reprocentralen ITCIT 12 061Examinator: Lisa KaatiÄmnesgranskare: Robin StrandHandledare: Christian Igel

Acknowledgments

First of all, I would like to say many thanks to my supervisor Prof. Christian Igel from theImage Group, Department of Computer Science, University of Copenhagen. Thank you so muchfor giving me the opportunity to do my thesis with you, for your inspiring topic and for yourguidance.

I would also like to thank very much to my assistant supervisor Matthias Tuma for hisinvaluable suggestions, for his patience in teaching me how to write an academic report. I wouldalso like to thank Oswin Krause for taking the time to do proof-reading for my thesis. Thanks toall my friends and my colleagues in Shark framework group for useful comments and interestingdiscussions during my thesis work.

I am very grateful to my reviewer Prof. Robin Strand for his encouragement and for hiskindness to provide me an account to do experiments in Uppmax system.

I would like to express my appreciations to my family and my girlfriend for their love, theirconstant encouragement and supports over years.

In the end, many thanks to my friends who helped me directly or indirectly to complete thethesis.

Contents

1 Introduction 61.1 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 The Learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Generalization and regularization . . . . . . . . . . . . . . . . . . . . . 102.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Fisher consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.5 Online learning and batch learning . . . . . . . . . . . . . . . . . . . . . 12

2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Linear classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.3 Regularized risk minimization in RKHS . . . . . . . . . . . . . . . . . . 152.2.4 Binary Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 162.2.5 Support Vector Machine training . . . . . . . . . . . . . . . . . . . . . . 182.2.6 Multi-class SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.7 The unified framework for multi-class SVMs . . . . . . . . . . . . . . . 232.2.8 Online SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Training Multi-class Support Vector Machines 293.1 Sequential two-dimensional optimization . . . . . . . . . . . . . . . . . . . . . 293.2 Gain computation for the unconstrained sub-problem . . . . . . . . . . . . . . . 303.3 Parameter update for the constrained sub-problem . . . . . . . . . . . . . . . . . 31

4 Online training of the LLW multi-class SVM 354.1 The LLW multi-class SVMs formulation . . . . . . . . . . . . . . . . . . . . . . 354.2 Online learning for LLW SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.1 Online step types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Working set strategies for online LLW SVM . . . . . . . . . . . . . . . 394.2.3 Gradient information storage . . . . . . . . . . . . . . . . . . . . . . . . 404.2.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1

4.2.5 Stopping conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.6 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.3 Comparing online LLW with batch LLW . . . . . . . . . . . . . . . . . 454.3.4 The impact of the cache size . . . . . . . . . . . . . . . . . . . . . . . . 464.3.5 Comparison of working set selection strategies for online LLW . . . . . . 49

5 Unified framework for online multi-class SVMs 525.1 The dual problem for the unified framework . . . . . . . . . . . . . . . . . . . . 525.2 Online learning for the unified framework . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Online step types and working set selection . . . . . . . . . . . . . . . . 535.2.2 Parameter update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3.1 Comparing online CS with batch CS solvers . . . . . . . . . . . . . . . . 605.3.2 Comparison of working selection strategies . . . . . . . . . . . . . . . . 62

6 Conclusion and Future Work 64

2

List of Figures

2.1 The geometry interpretation of Binary Support Vector Machines. . . . . . . . . . 16

4.1 Test error as a function of time for LLW SVM . . . . . . . . . . . . . . . . . . . 474.2 Test error as a function of computed kernel for LLW SVM . . . . . . . . . . . . 474.3 Dual as a function of time for LLW SVM . . . . . . . . . . . . . . . . . . . . . 484.4 The increasing dual curves with respect to time . . . . . . . . . . . . . . . . . . 504.5 The increasing dual curves with respect to kernel evaluation . . . . . . . . . . . 51

5.1 Test error as a function of time for CS SVM . . . . . . . . . . . . . . . . . . . . 615.2 Test error as a function of computed kernel for CS SVM . . . . . . . . . . . . . 615.3 Dual as a function of time for CS SVM . . . . . . . . . . . . . . . . . . . . . . 62

3

List of Tables

2.1 Coefficients νy,c,m for the different margins in the unified form . . . . . . . . . . . 252.2 Margin-based loss functions in the unified form . . . . . . . . . . . . . . . . . . 25

4.1 Datasets used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Best parameters for LLW SVM found in grid search . . . . . . . . . . . . . . . . 454.3 Comparing online LLW with batch LLW solvers. . . . . . . . . . . . . . . . . . 464.4 Speed up factor of onlineLLW × 1 over batch LLW solvers at the same dual value. 484.5 Comparing working set strategies for the online LLW SVM. . . . . . . . . . . . 49

5.1 Best parameters for CS found in grid search. . . . . . . . . . . . . . . . . . . . . 605.2 Comparing online CS with batch CS solvers. . . . . . . . . . . . . . . . . . . . 605.3 Comparing working set strategies for the online CS SVM. . . . . . . . . . . . . 63

4

List of Algorithms

1 Sequential minimal optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . 202 LaSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 PROCESS_NEW online step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 PROCESS_OLD online step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 OPTIMIZE online step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Sequential two-dimensional optimization. . . . . . . . . . . . . . . . . . . . . . 307 Gain computation for S2DO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Solving the two-dimensional sub-problem. . . . . . . . . . . . . . . . . . . . . . 349 Working set selection for online LLW SVM steps. . . . . . . . . . . . . . . . . . 4110 Online LLW SVM solver with adaptive scheduling. . . . . . . . . . . . . . . . . 4311 Working set selection for TWO_NEWS for the unified framework. . . . . . . . . 5412 Working set selection for TWO_OLDS for the unified framework. . . . . . . . . 5513 Working set selection for TWO_OLD_SUPPORTS for the unified framework. . . 5614 Solving the two-dimensional sub-problem when i = j and syi(p) = sy j(q). . . . . . 5815 Solving the single variable optimization problem. . . . . . . . . . . . . . . . . . 59

5

Chapter 1

Introduction

Machine learning is an active discipline in Computer Science which aims to enable computersto learn from data and make predictions based on this. A large number of accurate and efficientlearning algorithms exist, which have been applied in various areas from search engines andnatural language processing to bioinformatics. However, with recent advances in technology, anever increasing amount of data is collected and stored. As a consequence, there is a high demandfor learning algorithms that are fast and have low memory requirements when trained on largescale dataset.

Among learning algorithms, Support Vector Machines (SVMs) are state-of-the-art for clas-sification problems due to their theoretical basis and their high accuracy in practice. Originallyfor binary classification, SVMs have been extended to handle multi-category classification bysolving a series of binary problems such as the one-versus-all (OVA) method or by solving asingle problem in so-called all-in-one methods. The OVA method combines separately trainedbinary SVM classifiers into a classifier for multiple classes. This method has the advantage ofbeing simple to implement. However, it has a disadvantage that even if all binary classifiersare consistent, the resulting OVA classifier is inconsistent [13]. In all-in-one approaches, themulti-class problem is formulated and solved in one single optimization problem. There arethree all-in-one approaches that will be presented in this thesis, namely Weston and Watkins(WW), Crammer and Singer (CS) and Lee, Lin and Wahba (LLW). These approaches are moretheoretical involved and have been found to be able to lead to better generalizing classifierscompared to OVA in a comparison study [13]. However, in comparison, training multi-classSVMs in all-in-one approach normally tend to need longer time. Unfortunately, this is especiallytrue for the LLW SVM, which is the first Fisher consistent multi-class SVM that has beenproposed in the literature [24].

We will address the training time problem by employing online learning methods. Onlinelearning methods use a stream of examples to adapt the solution iteratively. Over the pastyears, online variants of binary and multi-class SVM solvers have been proposed and refined[3, 4, 5, 6, 18, 32]. Most online SVMs provide an approximative solver for the full batch problem.Online SVMs can be preferable to standard SVMs even for batch learning. In particular, for

6

large data sets we cannot afford to compute a close to optimal solution to the SVM optimizationproblem, but want to find a good approximation quickly. In the past, online SVMs were shownto be able to be a good answer to these requirements.

The prominent online multi-class SVM LaRank [4] relies on the multi-class SVM formula-tion proposed by Crammer and Singer [10]. The CS SVM was found to perform worse than thetheoretically more sound LLW SVM in one study [13]. We will derive a new online algorithm forSVMs, in particular we will adapt LaRank paradigm for the LLW machine. This will lead to anonline multi-class SVM that gets a fast approximation to a machine which is Fisher consistent.Recently an unified framework has been established, which unifies three all-in-one approaches[14]. Based on this result, we will further develop an online learning solver that can be appliedto the unification framework.

1.1 Structure of this thesisThis thesis aims at developing online learning algorithms for multi-class SVMs in all-in-oneapproaches by transferring LaRank’s concepts. Therefore, the thesis is organized as follows.

• Chapter 2 will provide the general background that is used throughout the thesis. Itdiscusses some important concepts in statistical learning theory. It also introduces theformulation of binary SVM, all-in-one approaches and the unified framework for multi-class SVMs as well as online algorithms for SVMs.

• Chapter 3 will introduce the generalized quadratic optimization problem without equalityconstraints to which the formulation of WW and LLW SVMs can easily be converted to.In addition, this chapter introduces the sequential two-dimensional optimization (S2DO)method that can solve the generalized optimization problem effectively.

• Chapter 4 will derive new online learning algorithm for the LLW SVM. It first introducesthe dual form of the LLW SVM. After that, a new learning algorithm for the LLW SVMis established by combining online learning paradigm with S2DO method. We will assessthe performance of the new online solver in comparison with the LLW batch solvers in thischapter.

• Chapter 5 will propose an online solver for the unified framework for all-in-oneapproaches. The unification is established through the unification of the margin conceptsand the loss functions. In this chapter, a new online solver for the unified framework thenis derived. The performance of the unified online solver is evaluated through a comparisonbetween the online CS solver and the batch CS solvers.

• Chapter 6 concludes and summarizes this thesis and gives some suggestions for futurework.

7

1.2 ContributionsThe main contributions of this thesis are the derivations of online solvers for multi-class SVMs inall-in-one approaches. In detail, an online multi-class SVMs solver for LLW SVM is derived andevaluated. This is the first online solver that approximates a machine which is Fisher consistent.Additionally, it is the first time we have an online solver for the unified framework that unifiedthe three all-in-one approaches. These online solvers are implemented in Shark and severalminor improvements of Shark are performed. The thesis also studies and compares the build-upbehaviour of the online versus batch solvers for LLW SVM, which gives insight into the openquestion if early-stopping of a batch solver would not be an equivalent alternative method.

8

Chapter 2

Background

In this chapter, we will review some important concepts that are used throughout in this thesis.We first discuss the learning problem, in particular supervised learning and its properties. Afterthat, we will review Support Vector Machines in which the formulations of SVMs for binaryand multi-class classification problems are presented. Finally, we discuss some existing onlinelearning algorithms for SVMs.

2.1 The Learning problem

2.1.1 Supervised learningSupervised learning algorithms infer a relation between an input space X ⊆ Rn and an outputspace Y based on sample data. In detail, let such sample data S = {(x1, y1), (x2, y2), .., (x`, y`)} ⊂X × Y be drawn according to an unknown but fixed probability distribution p over X × Y. Thissample dataset is normally referred to as training set. Two important examples of supervisedlearning are the classification problem, where Y is a finite set of classes {C1,C2, ..,Cd}, 2 ≤ d <∞, or a regression problem, where Y ⊆ Rd, 1 ≤ d < ∞. However, this thesis will only focus onclassification problems, thus Y is restricted to {C1,C2, ..,Cd} .The goal of supervised learning is to determine a hypothesis h : X → Y that takes an inputx ∈ X and returns the corresponding output y ∈ Y. This hypothesis is selected from a hypothesisclass H , for example a hypothesis in the set of all linear classifiers H = {sign(wT x + b),w ∈X, b ∈ R}. A hypothesis h ∈ H is sometimes represented as h(x) = sign( f (x)), f : X → R inbinary classification or h(x) = arg maxc∈Y( fc(x)), fc : X → R, f = [ fC1 , .., fCd ] in multi-classclassification. The function f is called scoring function, and let F denote the set of scoringfunctions. It can be seen that h can play the role of f . As a result, in the following, h can be usedas scoring function except for explicitly distinguishing cases.The quality of a prediction of h is measured by a task-dependent loss function L, which fulfillsL : Y ×Y → [0,∞], L(y, y) = 0. Therefore, L(y, y) represents the cost of predicting y instead ofy. Some common loss functions on a hypothesis are:

• 0-1 loss: L(y, h(x)) = 0 if y = h(x), 1 otherwise. This is the ideal loss function in theory.

9

• Square loss: L(y, h(x)) = (y − h(x))2. This loss function is normally used in regression

• Hinge loss: L(y, h(x)) = max(1 − yh(x), 0). This is commonly used in SVMs.

• Exponential loss: L(y, h(x)) = exp(−yh(x)). This loss function has a connection toAdaBoost.

• Margin based loss: L(y, h(x)) = L(yh(x)). This is a generalized version of the hinge lossand exponential loss.

Given such a loss function, the expected risk of a hypothesis h over p can be defined as:

Rp(h) =∫

L(y, h(x))dp(x, y)

The goal of learning is to find a hypothesis that minimizes the expected risk Rp(h).In general, the underlying probability distribution p is unknown, thus the empirical risk RS (h) ofh over sample data S is used as a substitute:

RS (h) =1`

∑̀i=1

L(yi, h(xi))

As a result, the empirical risk minimization is normally used.Let us define some important concepts that we deal with in the thesis. The best hypothesis overall possible hypotheses is called Bayes classifier and its associated risk is called Bayes risk:

RBayesp = inf Rp(h)

Rp(hBayes) = RBayesp

For the best hypothesis within a hypothesis setH , we write:

hH = arg minh∈HRp(h)

2.1.2 Generalization and regularizationOne of major problems of empirical risk minimization in supervised learning is overfitting.Overfitting typically arises when learning algorithms choose non smooth or too complex functionwhile minimizing the empirical risk RS (h). The function selected fits the training dataset verywell, but it does not generalize well to unseen test data from the same generating distribution.Generalization can be considered as the ability of a hypothesis to have only a few mistakeswith future test data. There are several ways to handle the overfitting problem. This thesismainly focuses on the regularization framework. This is the basis of several important learningalgorithms, for instance Support Vector Machines, which will be at the center of this thesis. Theregularization framework proposes to select a hypothesis from a normed function spaceH givena training set S as :

h∗ = minh∈H

(RS (h) + λg(‖h‖H ))

(2.1)

10

Here, g : R → R is a monotonically increasing regularization function which penalizes thecomplexity of h as measured by ‖h‖H . The parameter λ > 0 is fixed and controls a trade-offbetween the empirical risk and the smoothness of the hypothesis. In practice, g(‖h‖H ) = ‖h‖2H isnormally employed .

2.1.3 ConsistencyConsistency concerns the behaviour of classifiers created by a learning algorithm as the numberof training examples goes to infinity. While generalization is a property of one classifier,consistency is a property of family of classifiers.

Definition 1 (Definition 1 in [33]). Consider a sample dataset S = {(x1, y1), (x2, y2), .., (x`, y`)}of size ` drawn according to an underlying probability distribution p over X × Y. Let H be thehypothesis set from which the learning algorithm chooses its hypothesis. Then:

• a learning algorithm is called consistent with respect toH and p if the riskRp(h) convergesin probability to the risk Rp(hH ), where h is generated by this learning algorithm. This is,for all ε > 0:

Pr(Rp(h) − Rp(hH ) > ε)→ 0, as ` → ∞

• a learning algorithm is called Bayes-consistent with respect to p if the riskRp(h) convergesin probability to the Bayes risk RBayes

p . That is , for all ε > 0:

Pr(Rp(h) − Rp(hBayes) > ε)→ 0, as ` → ∞

• a learning algorithm is called universally consistent with respect to H if it is consistentwith respect toH for all underlying probability distributions p.

2.1.4 Fisher consistencyAnother important concept in machine learning is Fisher consistency, in which the behaviour ofloss functions in the limit of infinite data is analyzed. Fisher consistency of a loss function is anecessary condition for a classification algorithm based on that loss to be Bayes consistent.Given a particular loss function L(y, h(x)), let h∗L : X → Y be the minimizer of the expected risk:

h∗L = arg minhRp(h) = arg min

h

∫L(y, h(x))dp(x, y)

In the case of binary classification, the Bayes optimal rule is sign(2p(y = 1|x)−1). A loss functionL(y, h(x)) is called Fisher consistent if h∗L(x) has the same sign with sign(2p(y = 1|x) − 1).This definition for binary classification can be extended to the multi-category case, where h∗L(x)is represented as h∗L(x) = arg maxc∈Y( f ∗c (x)). A loss function L(y, h(x)) is called Fisher consistentif

arg maxc∈Y{ f ∗c } ⊂ arg max

y∈Y{p(y|x)}

11

In [23], Lin proved that margin-based loss functions listed above with some additional constraintsare Fisher consistent for binary classification problems. Independently, Zhang [36, 35] provedthat a larger class of convex loss functions are Fisher consistent in binary classification tasks.Bartlett et al. [2] and Tewari et al. [31] extended Zhang’s results to establish conditions for a lossfunction to satisfy Fisher consistency for binary and multi-category classification, respectively.

2.1.5 Online learning and batch learningThere exists a set of two common paradigms of learning algorithms for classification problems,namely online learning and batch learning. Batch learning is a paradigm underlying manyexisting supervised learning algorithms, which assumes that all training examples are available atonce. Normally, a cost function or objective function is defined to estimate how well a hypothesisbehaves on these examples. The learning algorithm will then perform optimization steps towardsreducing the cost function until some stopping conditions are reached. Common batch learningalgorithms scale between quadratically and cubically in the number of examples, which can resultin quite long training times on large datasets. In contrast, online learning methods use a streamof examples to adapt the solution iteratively [7]. Inspired by this ’real’ online learning paradigm,there are optimization methods that replicate the strategy from online learning to approximatethe solution of a fixed batch problem. When using the online learning paradigm as a heuristicfor optimization, there is some freedom in how many computation steps on ’old’ examples aredone after the introduction of each ’new’ example. If these steps are designed effectively, asingle pass over the dataset of online learning will take much less time and computational powerthan a full batch optimization run [9, 7, 5], but still provides a well-performing approximationto the solution that would be obtained by a full batch solver. During this thesis, such an onlineoptimization algorithm using online learning is derived to achieve comparable accuracy withbatch learning algorithm, but it consumes significant less time and computation.

2.2 Support Vector MachinesIn this section, we will review some important concepts in Support Vector Machines. We willdiscuss some concepts in linear classification as well as kernel concepts before going to describethe formulation and the training of binary Support Vector Machines. We also review the threeall-in-one approaches and the unified framework for multi-class SVMs. The final part of thissection will present some existing online SVMs algorithms.

2.2.1 Linear classificationThis part will review some basic concepts in linear classification. These concepts are importantsince they will appear later when deriving binary Support Vector Machines algorithm as well asin other sections. In linear classification, the hypothesis h for classification is defined on a affine

12

linear function f on the input space as:

h(x) = sign( f (x))

f (x) = wT x + b

Here, w ∈ Rn is the weight vector and b is the bias. These parameters define a separatinghyperplane that divides the input space into two half-spaces corresponding to the two classes.Let us next introduce definitions around the concept of a margin between a separating hyperplaneand one or more data instances. The functional margin of a data instance (xi, yi) with respect tothe hyperplane (w, b) is defined as:

γi = yi(wT xi + b)

The geometric margin of an example (xi, yi) with respect to the hyperplane (w, b) is:

ρi =(yi(wT xi + b))

||w|| =γi

‖w‖

b xi

wT xi+b||w||

b||w||

f (x) = wT x+b = 0

w

x⊥

The absolute value of the geometric margin of an input pattern equals its distance from theseparating hyperplane defined by (w, b) . Indeed, let d be the distance from the hyperplane andx⊥ is the projection onto the hyperplane of an input pattern xi. Therefore:

xi = x⊥ +w‖w‖d

⇒ f (xi) = wT (x⊥ +w‖w‖d) + b

= wT x⊥ + b + ‖w‖d= f (x⊥) + ‖w‖d

13

However, x⊥ lies on the hyperplane, thus f (x⊥) = 0. As a result:

d =f (xi)‖w‖ =

wT xi + b‖w‖

The information about the distance of an input pattern xi from a hyperplane is thus encoded in thegeometric margin. Furthermore, it can be seen that a hypothesis h classifies correctly an exampleif the functional and geometric margins corresponding to this hypothesis are positive. A datasetS is called a linearly separable dataset if there exist w, b that satisfy the following constrains:

wT xi + b > 0 i f yi = +1

wT xi + b < 0 i f yi = −1

Or yi(wT xi + b) ≥ γ, γ > 0 and 1 ≤ i ≤ `

Here, γ is called a target functional margin.The margin concepts of a hyperplane for a single example (xi, yi) can be extended to the trainingdataset S . The functional margin and geometric margin of a hyperplane with regards to a trainingset S can be defined as γ = min1≤i≤` γi and ρ = min1≤i≤` ρi, respectively.

2.2.2 KernelsKernels are a flexible and powerful tool which allows to easily modify linear methods suchthat they yield non-linear decision functions in the input space. The underlying purpose isto transform a non-linearly separable dataset into a Hilbert space with higher or even infinitedimension, where the dataset can be separated by a hyperplane. The interesting point of viewis that this transformation is performed directly through kernel evaluations, without explicitcomputation of any feature map. This strategy is called kernel trick.

Definition 2. Let X be a non-empty set. A function k : X × X → R is a kernel if there exists aHilbert spaceH and a map φ : X → H satisfying:

k(x, y) = 〈φ(x), φ(y)〉H ∀x, y ∈ X

Here, φ : X → H is called feature map andH is called feature space, φ andH might be notunique. As a trivial example consider k(x, y) = x ∗ y, φ(x) = x withH = R or φ(x) = [

x√

2,

x√

2]T

withH = R2.

A symmetric function k : X × X → R is called positive definite if for all m ∈ N andx1, x2, .., xm ∈ X, the Gram matrix K with entries ki j = k(xi, x j) is positive definite. That is,for all a ∈ Rm, aT Ka ≥ 0. It can be seen that a kernel is also a positive definite function because:

m∑i, j=1

aia jk(xi, x j) =⟨ m∑

i=1

aiφ(xi),m∑

j=1

a jφ(x j)⟩H=

∥∥∥∥ m∑i=1

aiφ(xi)∥∥∥∥2

H

14

Vice versa, Mercer’s theorem [25] guarantees that for every positive definite symmetric functionk : X×X → R, there exists a Hilbert spaceH and a feature map φ : X → H such that the kernelis the inner product of features in the feature space: k(xi, x j) = 〈φ(xi), φ(x j)〉H .

Other important concepts are that of reproducing kernels and reproducing Hilbert space thatare at the theoretical core of the kernelization technique. They can be defined as follows:

Definition 3. Let X be a non-empty set and H be a Hilbert space of functions on X. A functionk : X × X → R is a reproducing kernel and H is a reproducing kernel Hilbert space (RKHS) ifthey satisfy:

• ∀x ∈ X : k(·,x) ∈ H

• ∀x ∈ X,∀ f ∈ H : 〈 f , k(·, x)〉H = f (x)

In particular ∀x, y ∈ X : k(x, y) = 〈k(·, x), k(·, y)〉H = 〈k(·, y), k(·, x)〉H . It is clear that areproducing kernel is a kernel since a feature map can be defined: φ(x) = k(·, x). The Moore-Aronszajn theorem [1] states that for every positive definite function k : X × X → R over anon-empty set X, there is a unique RKHS H with reproducing kernel k. As as result, there arestrong connections between the concepts of positive definite function, kernel and reproducingkernel.

The natural question that has been raised is whether kernel-based learning algorithm can beconsistent given a certain kernel. In other words, given a kernel, can the classifier resultingfrom a kernel-based learning algorithm lead to Bayes optimal classifier as the sampling size goesto infinity. The answer is that this depends on a property of the kernel function, the so-calleduniversal property . The RKHS H induced by a universal kernel is rich enough to approximateany optimal classifier or function arbitrarily well [30]. For example, the Gaussian kernel isuniversal kernel whereas the linear kernel is not.

2.2.3 Regularized risk minimization in RKHSWe now combine the framework of kernels and RKHS function spaces of the last sectionwith the regularization framework presented in Section 2.1.2. An interesting result within thiscombination is established by Kimeldorf and Wahba in [21] and by Schölkopf, Herbrich andSmola in [28], which is the representer theorem. The representer theorem says that if H is aRKHS induced by a reproducing kernel k and g is a monotonically increasing regularizationfunction, then the optimal solution f of regularization framework (2.1) admits a representationform:

f (·) =∑̀i=1

aik(·, xi), xi ∈ S

As a result, the optimal solution is expressed as a linear combination of a finite number ofkernel functions on the dataset, regardless of the dimensionality ofH . Therefore, the representer

15

theorem enables us to perform empirical risk minimization in infinite dimensional feature spaces.Since, by adding the regularization term, the problem reduces to finite number of variables. Anexample of algorithms performing regularized risk minimization in RKHS are Support VectorMachines to which the representer theorem thus naturally applies as well [15], we will introduceSVMs in the next section. However, for ease of understanding, we derive the non-kernelizedSVM first.

2.2.4 Binary Support Vector MachinesSupport Vector Machines (SVM) are state-of-the-art algorithms in machine learning for patternrecognition due to their theoretical properties and high accuracies in practice. In this part,Support Vector Machines are first formulated in linear classification through the large marginconcept, then the formulation will be extended to non-linear classification through the kerneltrick.With a linearly separable dataset S , there might be numerous hyperplanes that can separate thedata set into two classes. The fundamental idea of large-margin classification, which is alsoinherently related to SVM classification, is to choose that optimal hyperplane which maximizesthe geometric margin with respect to this dataset. In detail, a hyperplane is selected such thatthe geometric margin of the closest data points of both classes is maximized. The data pointswhich lie on the solid line in Figure 2.1, are called support vectors. The problem of large-margin

separating hyperplane

margin

b||w||

Figure 2.1: The geometry interpretation of Binary Support Vector Machines.

classification can be formulated as the following optimization problem over w and b :

maximizew,b ρ =γ

‖w‖Subject to yi(wT xi + b) ≥ γ , i = 1, .., `

16

Because (w, b) can be rescaled to (cw, cb), which will result in the same hyperplane andoptimization problem, γ can be fixed to 1 (otherwise, we have to fix ||w|| = 1). Therefore,‖w‖ should be minimized, which is equivalent to the convex optimization problem:

minimizew,b12||w||2

Subject to yi(wT xi + b) ≥ 1 ∀(xi, yi) ∈ S

When the Bayes risk of this dataset itself is not zero which means that the dataset is not linearlyseparable, it becomes necessary allow some examples to violate the margin condition. This ideacan be integrated as yi(wT xi + b)) ≥ 1 − ξi, with slack variables ξi ≥ 0. As a result, the convexoptimization problem becomes:

minimizew,b12||w||2 +C

∑̀i=1

ξi (2.2)

Subject to yi(wT xi + b) ≥ 1 − ξi ∀(xi, yi) ∈ Sξi ≥ 0, 1 ≤ i ≤ `

Here, C > 0 is the regularization parameter. It is obvious that the formulation ( 2.2 ) fits well intothe regularization framework (2.1) where :

g(‖ f ‖) = ‖w‖2

λ =1

2C`

RS ( f ) =1`

∑̀i=1

max(1 − yi(wT xi + b), 0)

That is the reason why C is called regularization parameter.In practice, the convex optimization problem (2.2) can be converted into its so-called dual formby the technique of Lagrange multipliers for constrained convex optimization problems [11], thisresults in the equation:

maximizeα D(α) =l∑

i=1

αi −12

l∑i=1

l∑j=1

αiα jyiy j〈xi, x j〉 (2.3)

Subject tol∑

i=1

αiyi = 0

0 ≤ αi ≤ C,∀i ∈ {1, .., `}

Here, αi is a Lagrange multiplier.Another extension of SVM for handling datasets which are not linearly separable is to applykernel trick which is introduced in section 2.2.2. This will map the original input space intoa higher dimensional feature space that can be linear separable. Kernelization is performed by

17

replacing every inner product 〈xi, x j〉 in (2.3) by a kernel evaluation k(xi, x j). As a result, the finaldual formulation can be represented as follows:

maximizeα D(α) =l∑

i=1

αi −12

l∑i=1

l∑j=1

αiα jyiy jk(xi, x j) (2.4)

Subject tol∑

i=1

αiyi = 0 (2.5)

0 ≤ αi ≤ C,∀i ∈ {1, .., `} (2.6)

2.2.5 Support Vector Machine trainingSequential minimal optimization

Here, we will consider decomposition algorithms tailored to the SVM training problem in thedual form above. At every step, these algorithms choose an update direction u and perform a linesearch along the line α+ λu such that maxλ,u D(α+ λu). One of the most effective algorithms forthe above problem is Sequential Minimal Optimization (SMO) [26] in which the search directionvector has as few non-zero entries as possible and the set of the corresponding indices is calledworking set. Because of the equality constraint (2.5), the smallest number of variables that canbe changed at a time is 2. As a result, the search direction can be represented as:

u = ±(0, .., yi, 0, ..,−y j, 0.., 0)

The sign ± is chosen to satisfy the equality constraint (2.5) based on the values of labels yi, y j.Without loss of generalization, assume that:

ui j = (0, .., yi, 0, ..,−y j, 0.., 0)

To make the following mathematical equations easier to read, we define the following notation::

ki j = k(xi, x j)

gi =∂D∂αi= 1 − yi

l∑j=1

y jα j

[ai, bi] =

[0,C] i f yi = +1[−C, 0] i f yi = −1

Iup = {i | yiαi < bi}Idown = { j | y jα j > a j}

Taking Taylor expansion for D around α in the search direction ui j, we have:

D(α + λui j) = D(α) + λ∂D(α)∂α

uTi j +

12λ2uT

i j∂2D∂2α

ui j

18

In addition, the second order derivatives of D with respect to αi and α j can be derived as:

∂D∂αi∂α j

= −yiy jki j

And, the first order derivative of D with respect to α in the search direction ui j is:

∂D(α)∂α

uTi j = yigi − y jg j

As a result, the function D can be written:

D(α + λui j) = D(α) + λ(yigi − y jg j) −12

(kii + k j j − 2ki j)λ2 (2.7)

Solving the one dimensional optimization problem (2.7) without the box constraint (2.6), theoptimal value is:

λ̂ =yigi − y jg j

kii + k j j − 2ki j(2.8)

With the box constraint, the optimal value is determined as:

λ∗ = min{bi − yiαi, y jα j − ai, λ̂}

In decomposition algorithms, we have to specify the conditions when these algorithms will stop.In the following part, we will derive the stopping conditions of SMO. If only λ ≥ 0 is considered,from first order Taylor expansion:

D(α + λui j) = D(α) + λ(yigi − y jg j) + O(λ2)

Therefore, D(α) reaches the optimality if there does not exist (i, j) such that yigi − y jg j ≥ 0. Asa result, the optimality condition is:

maxi∈Iup

yigi − minj∈Idown

y jg j ≤ 0

In reality, the condition is replaced by:

maxi∈Iup

yigi − minj∈Idown

y jg j ≤ ε , for some ε > 0 (2.9)

From (2.8) and (2.9), the sequential minimal optimization is derived as in Algorithm 1The operations from step (4) to step (7) in Algorithm 1 form a SMO optimization step which isfrequently used later.

19

Algorithm 1 Sequential minimal optimization.1: α← 0, g← 12: repeat3: select indices i ∈ Iup, j ∈ Idown

4: λ = min{bi − yiαi, y jα j − ai,yigi − y jg j

kii + k j j − 2ki j}

5: ∀p ∈ {1, .., `} : gp = gp − λypkip + λypk jp

6: αi = αi + λyi

7: α j = α j − λy j

8: until maxi∈Iup yigi −min j∈Idown y jg j ≤ ε

Working set selection

The performance of SMO heavily depends on how the working set is selected. There are twocommon strategies for working set selection.

• Most violating pair strategy which is based on the gradient information.

i = arg maxp∈Iup

ypgp

j = arg minq∈Idown

yqgq

• Another strategy [17, 16] employs second order information to select a working set thatmaximizes the gain. The gain is determined as follows. Rewriting equation (2.7), we have:

D(α + λui j) − D(α) = λ(yigi − y jg j) −12

(kii + k j j − 2ki j)λ2

Substituting the optimal value λ̂ into the equation yields:

D(α + λui j) − D(α) =(yigi − y jg j)2

2(kii + k j j − 2ki j)

The idea is selecting i and j such that the gain(yigi − y jg j)2

2(kii + k j j − 2ki j)is maximized. However,

`(` − 1)2

pairs need to be checked, which results in long computation times. Therefore, anidea which only needs O(`) computation steps:

– The index i is picked up, which is based on most violating pair strategy i =arg maxp∈Iup ypgp.

– the index j is selected to maximize the gain , j = arg maxq∈Idown

(yigi − yqgq)2

2(kii + kqq − 2kiq)

20

2.2.6 Multi-class SVMsIn practice, pattern recognition problems often involve multiple classes. Support VectorMachines can be extended to handle such cases by training a series of binary SVMs or by solvinga single optimization problem. There are two common ways in training a series of binary SVMs,namely the one-versus-one (OVO) method and the one-versus-all (OVA) method. However, theyare quite similar, thus we restrict ourself to only discuss the OVA method. The OVA method isto combine separately trained binary SVM classifiers into a classifier for multiple classes. Thismethod has the advantage of being simple to implement on top of an existing binary SVM solver.However, it has a disadvantage that even if all binary classifiers are consistent, the resulting OVAclassifier is inconsistent [13]. Formulating a single optimization problem to train multi-classSVMs is called all-in-one approaches. These approaches are more theoretical involved and andhave been shown to be able to give better accuracies than OVA on a benchmark set of datasets[13]. However, their time complexities which depend on the dataset size, can be significanthigher than those of OVA.The remainder of this section will present some all-in-one SVMs in detail. They all share thecommon property (together with the OVA machine) that the final classification decision is madeaccording to a rule of the form:

x −→ arg maxc∈{1,..,d}

(wTc φ(x) + bc)

Just like in section 2.2.2, here φ(x) = k(x, ·) is a feature map into a reproducing kernel Hilbertspace H with kernel k, d is the number of classes, wc, bc are parameters that define the decisionfunction of class c. There are several ways to approach multi-class SVM classification viaone single optimization problem (all-in-one approach), namely Weston and Watkins (WW),Crammer and Singer (CS) and Lee, Lin and Wahba (LLW). The basic idea of the three all-in-oneapproaches is to extend the standard binary SVM to the multi-class case, in particular the marginconcept and the loss function. In binary classification, the hinge loss is applied to the marginviolation, which itself is obtained from the functional margin. Similarly, the loss function inthe three all-in-one approaches is created through the margins. However, they employ differentinterpretations of the margin concept and different ways of formulating a loss function on thesedifferent margins. As a result, their final formulations are different, which will be discussed moredetailed in the next section. Despite these differences, they all can be reduced to standard binarySVM in binary classification. The formulations of the three all-in-one approaches are as follows:

• Weston and Watkins (WW) SVM [34]:

minimizewc

12

d∑c=1

‖wc‖2H +C∑̀i=1

d∑c=1

ξi,c

∀n ∈ {1, .., `},∀c ∈ {1, .., d}\{yn} : 〈wyn − wc, φ(xn)〉 + byn − bc ≥ 2 − ξn,c∀n ∈ {1, .., `},∀c ∈ {1, .., d} : ξn,c ≥ 0

The objective function contains two terms. The first one is the sum of ‖wc‖2 which controlsthe smoothness of the scoring function wT

c φ(x) + bc. The second term is the sum of slack

21

variable ξn,c that encodes the violation of the margin between the true class yn and class c.This differs from the OVA schema where the violation of the margin between the true classand the all other classes is encoded and determined at once by a single variable.

• Crammer and Singer (CS) SVM[10]:

minimizewc

12

d∑c=1

‖wc‖2H +C∑̀i=1

ξi

∀n ∈ {1, .., `},∀c ∈ {1, .., d}\{yn} : 〈wyn − wc, φ(xn)〉 ≥ 1 − ξn∀n ∈ {1, .., `} : ξn ≥ 0

There are two key differences between this approach from WW SVM. This approachemploys a different loss function that results in less slack variables. This difference willbe discussed more clearly in the next section. The other difference is that there is no biasterms in the decision functions with this approach.

• Lee, Lin and Wahba (LLW) SVM [22]:

minimizewc

12

d∑c=1

‖wc‖2H + C∑̀n=1

d∑c=1

ξn,c

Subject to ∀n ∈ {1, .., `}, c ∈ {1, .., d} \{yn} : 〈wc, φ(xn)〉 + bc ≤ −1

d − 1+ ξn,c

∀n ∈ {1, .., `}, c ∈ {1, .., d} : ξn,c ≥ 0

∀h ∈ H :d∑

c=1

(〈wc, h〉 + bc) = 0 (2.10)

Instead of computing 〈wyn − wc, φ(xn)〉 + byn − bc like in WW SVM, the slack variable ξn,cin this approach controls the deviation between the target margin of one and 〈wc, φ(xn)〉 +bc. This approach also introduces the sum-to-zero constraint (2.10), which is argued as anecessary condition that LLW SVM can reduce to the standard binary SVM [13].

All these extensions fit into a general regularization framework:

12

d∑c=1

‖wc‖2H +C∑̀i=1

L(yi, f (xi))

Where:

• LLW SVM loss function: L(y, f (x)) =∑

j,y max(0, 1 + f j(x))

• WW SVM loss function: L(y, f (x)) =∑

j,y max(0, 1 − ( fy(x) − f j(x))

• CS SVM loss function: L(y, f (x)) =∑

j,y max(0, 1 −min j( fy(x) − f j(x)))

Here, f = [ f1, f2, .., fd] is a vector of hypotheses fc = wTc φ(x) + bc.

Among these loss functions, Liu in [24] proved that only the LLW SVM loss function is Fisherconsistent. Therefore, it is highly desirable to derive online learning algorithms for LLW SVM.

22

2.2.7 The unified framework for multi-class SVMsThe margin concept is an important concept in binary classification, which is inherited by binarySVM in order to select the maximum margin classifier. However, there are different ways tointerpret this concept for multi-class classification. In addition, there are several strategies toconstruct a loss function based on the margin concept. We will review these different conceptsbefore re-stating the unified formulation which is recently established in [14].

Margin in multi-class SVM classification

As discussed in the Section 2.2.1, the (functional or geometric) margins encodes the informationof whether a classification is correct or not by a linear classifier f bin(x) = wT x + b. In binaryclassification,Y = {+1,−1}, the functional margin y f bin(x) = y(wT x+ b) ≥ γ for a specific targetfunctional margin γ > 0, which corresponds to the correct classification of f bin for an example(x, y). When the Bayes risk is not zero, it is necessary to allow margin violations and we measurethis margin violation by max(0, γ − y(wT x + b)) like in section 2.2.4.In multi-class case, the hypothesis or the decision function is normally obtained as arg maxc∈Y( fc(x)),thus, there are several ways to interpret the margin concepts. With two scoring functions fc andfe corresponding to two classes c, e, the difference fc(x) − fe(x) encodes the information that theclass c is preferred over class e in the decision function. This observation is developed in CS andWW formulations. Therefore, a margin can be measured as follows:

Definition 4 (relative margin). For a labeled point (x, y) ∈ X × Y the values of the marginfunction

µrelc ( f (x), y) =

12

( fy(x) − fc(x))

for all c ∈ Y are called relative margins.

The following, different margin definition is the basis of the LLW SVM:

Definition 5 (absolute margin). For a labeled point (x, y) ∈ X × Y the values of the marginfunction

µabsc ( f (x), y) =

{+ fc(x) if c = y− fc(x) if c ∈ Y\{y}

for all c ∈ Y are called absolute margins.

With this definition, the decision function just selects the largest score value fc(x) and it doesnot directly take in account the differences fy(x) − fc(x).From multi-class perspective, the decision function in binary classification can be expressed asarg maxc∈Y{ fc(x)| Y = {±1}} where f+1(x) = f bin(x) and f−1(x) = − f bin(x). It is obvious thatarg maxc∈Y{ fc(x)} = sign( f bin(x)) if f bin(x) , 0. In addition, with the relative margin, it can bee

seen that12

( f+1(x)− f−1(x)) = (+1) f bin(x) and12

( f−1(x)− f+1(x)) = (−1) f bin(x), which shows that

23

the relative margin can reduce to the functional margin in binary classification. Similarly, we canshow the relationship between the absolute margin concept and the functional margin concept inbinary classification. Furthermore, it can be seen that f+1(x)+ f−1(x) = 0, which is generalized tothe sum-to-zero constraint in LLW SVM. [13] argued that the sum-to-zero constraint is of vitalimportance for machines relying on absolute margins and this constraint is a minimal conditionfor an all-in-one method compatible with the binary SVM.

Margin-based surrogate loss functions

In binary classification, the hinge loss is applied to the margin violation, which itself is obtainedfrom the functional margin. Similarly, margin-based loss functions in multi-class SVMs are alsocreated through the margin violation concept. Given a margin function µc( f (x), y) and a targetmargin γy,c that possibly depends on both class y and c, the target margin violation is defined as:

vc( f (x), y) = max(0, γy,c − µc( f (x), y))

However, in multi-category classification with d classes, there are (d − 1) possibilities to makea mistake when classifying an input pattern. Therefore, there are several ways to combine the ddifferent function scores into one single loss value.

Definition 6 (sum-loss). For a labeled point (x, y) ∈ X × Y the discriminative sum-loss is givenby

Lsum ( f (x), y) =∑c∈Y′

vc ( f (x), y)

for Y′ = Y\{y}. Setting Y′ = Y results in the total sum-loss.

The LLW and WW machines use the discriminative sum-loss in their formulations.

Definition 7 (max-loss). For a labeled point (x, y) ∈ X × Y the discriminative max-loss is givenby

Lsum ( f (x), y) = maxc∈Y′

vc ( f (x), y)

for Y′ = Y\{y}. Setting Y′ = Y results in the total max-loss.

The CS machine is based on the discriminative max-loss.

Unifying the margins and loss functions

Firstly, it is straightforward that the different margin concepts in the earlier section can be unifiedas:

µc ( f (x), y) =d∑

m=1

νy,c,m fm(x)

24

margin type νy,c,m

relative δy,m − δc,m

absolute −(−1)δc,yδm,c

Table 2.1: Coefficients νy,c,m for the different margins in the unified form. Here, δy,m denotes theKronecker symbol.

Here, νy,c,m is a coefficient to the score function fm, which also depends on the combination y, c.As can be seen, the relative margin and absolute margin can be brought into the unified form bydefining the coefficients νy,c,m as in Table 2.1.Secondly, the different margin-based loss functions also need to be unified. It can be seen that allthe loss functions in section 2.2.7 are linear combinations of target margin violations. Each targetmargin violation can be represented through a slack variable like in section 2.2.4. In particular,the violation of the margin between the true class y and class c is measured as vc( f (x), y) =max(0, γy,c − µc( f (x), y)), which can be represented as min ξc such that ξc ≥ γy,c − µc( f (x), y) andξc ≥ 0, or ξc ≥ vc( f (x), y). Therefore, for an example (x, y) ∈ X × Y, the unified loss functioncan be defined as follows.

L ( f (x), y) =minξ

∑r∈Ry

ξr (2.11)

Subject to ξsy(p) ≥ vp ( f (x), y) ∀p ∈ Py

ξr ≥ 0 ∀r ∈ Ry

Here, Py ⊂ Y lists all the target margin violations that enter the loss, Ry is the index set thatlists all slack variables that associated with class the true class y, sy : Py → Ry is a surjectivefunction that assigns slack variables to margin components. Table 2.2 shows some examples ofloss functions and their corresponding representation in the unified notation of equation 2.11.

loss type L ( f (x), y) Py Ry sy

discriminative max-loss maxc∈Y\{y} vc ( f (x), y) Y\y * p 7→ ∗total max-loss maxc∈Y vc ( f (x), y) Y * p 7→ ∗

discriminative sum-loss∑

c∈Y\{y} vc ( f (x), y) Y\{y} Y\{y} p 7→ ptotal sum-loss

∑c∈Y vc ( f (x), y) Y Y p 7→ p

Table 2.2: Margin-based loss functions and their corresponding parameters in the unified form.Here, * is a constant index associated with the true class y.

Unifying the primal problems

All the multi-class SVM machines in the section 2.2.6 fit into a regularization framework.Therefore, integrating the unified loss function above into the regularization framework, we have

25

a unified primal problem:

minw,b,ξ

12

d∑c=1

‖wc‖ +C∑̀i=1

∑r∈Ryi

ξi,r (2.12)

Subject tod∑

c=1

νyi,p,c (〈wc, φ(xi)〉 + bc) ≥ γyi,p − ξi,syi (p) 0 ≤ i ≤ `, p ∈ Pyi

ξi,r ≥ 0 0 ≤ i ≤ `, r ∈ Ryi

d∑c=1

(〈wc, φ(x)〉 + bc) = 0 ⇔d∑

c=1

wc = 0 andd∑

c=1

bc = 0 (2.13)

The equality constraint (2.13) is the sum-to-zero constraint that does not appear in all machines.Therefore, this is optional in the unified primal problem.

2.2.8 Online SVMsInspired by the potential advantage of online learning and the success of Support VectorMachines, there have been many scientific studies to design effective online learning algorithmsfor Support Vector Machines. In linear SVMs, significant achievements have been seen over lastdecade. For example, Pegasos [29] is known as effective and scalable online algorithm workingon the primal representation. The reason for its efficiency is that for a linear kernel, the weightvectors w can be computed explicitly. Therefore, the kernel matrix is no longer needed andgradients are cheap to compute in the primal formulation. This situation will change in the non-linear case where the weight vector cannot be expressed explicitly. In that case, optimizing thedual is the most frequently followed proceeding.LaSVM [5] is an algorithm in the dual representation that tries to employ an optimization schemeinspired by online algorithms in order to more quickly get accuracies close to those of batchsolvers. The main idea of LaSVM is to operate in an online learning scheme, each time whennew example is presented, the algorithm will perform a so-called PROCESS operation, followedby a REPROCESS operation. In order to understand these two operations, it is important toknow that LaSVM maintains an index set of support vectors I which correspond to non-zeroalpha coefficients. The PROCESS operation will do a SMO optimization step in which onevariable corresponds to a new example and another variable is selected from the set of currentsupport vectors I to form a violating pair. The REPROCESS step performs a SMO optimizationstep on two old support vectors that form a most violating pair. The purpose of PROCESS is toadd a new support vector to the set of current support vector I while REPROCESS tries to keep Ias small as possible by removing examples which no longer are required as support vectors. Thepseudo-code of LaSVM [5] is presented in Algorithm 2.For multi-class SVMs, LaRank [4] is an extension of LaSVM to perform online learning with

26

Algorithm 2 LaSVM.1: Initialization

Seed S with some examples from each classSet α← 0 and initialize gradient g

2: Online IterationRepeat a predefined number of iterations:- An example kt is selected.- Run PROCESS(kt)- Run REPROCESS once

3: FinishingRepeat REPROCESS until convergence criterion reached.

the CS SVM formulation. The CS dual objective optimized by LaRank takes the form:

maximizeβ∑̀i=1

βyii −

12

∑̀i, j

d∑y=1

βyi β

yjk(xi, x j)

Subject to βyii ≤ Cδy,yi , 1 ≤ i ≤ `, 1 ≤ y ≤ dd∑

y=1

βyi = 0, 1 ≤ i ≤ `

Here, βyi is the coefficient that is associated with the pair (xi, y). It can be seen that this CS

SVM formulation introduces equality constraints corresponding to examples. This means that aSMO update can only be done on variables of the same example. LaRank defines new concepts:support vectors and support patterns. Support vectors are all pairs (xi, y) whose coefficients βy

i ,1 ≤ i ≤ ` are non-zero. Similarly, support patterns are all patterns xi for which there exist some y,1 ≤ y ≤ d, such that (xi, y) are support vector. As a result, the PROCESS and REPROCESS stepsfrom LaSVM are extended to PROCESS_NEW, PROCESS_OLD and OPTIMIZE operations inwhich SMO steps are implemented with different strategies.

• PROCESS_NEW will perform a SMO step on two variables from the new example, whichis not yet support pattern at this point. The two variables must define a feasible direction.

• PROCESS_OLD will randomly select a support pattern and then choose two variables thatis the most violating pair associated with this support pattern.

• OPTIMIZE: This function will randomly select a support pattern which is similar toPROCESS_OLD. However, it will select most violating pair which are support vectorsassociated with this support pattern.

Furthermore, the derivative of the dual objective function with respect to variable βyi is:

gi(y) = δyi,y −∑

j

βyjk(xi, x j).

27

This equation shows that each gradient gi(y) only relies on coefficients of its own class y.Furthermore, because of the added example-wise equality constraints, PROCESS_OLD onlyrequires d gradient computations. These computations can also be speeded up by exploiting thesparseness structure of support vectors. This leads to the implementation in which gradientsneeded in PROCESS_NEW and PROCESS_OLD steps are effectively computed. Motivated bythis, LaRank chooses to only store and update the gradients of current support vectors. Thislowers the memory requirements and makes the gradient updates much faster. As a downside,some gradients need to be computed from scratch in the PROCESS_OLD steps, namely thosewhich do not belong to current support vectors.The PROCESS_NEW, PROCESS_OLD and OPTIMIZE procedures are presented in Algorithms3, 4 and 5.

Algorithm 3 PROCESS_NEW online step1: Select a new, fresh pattern xi

2: y+ ← yi

3: y− ← arg miny∈Y gi(y)4: Perform a SMO step: SmoStep(βy+

i , βy−i )

Algorithm 4 PROCESS_OLD online step1: Select randomly a support pattern xi

2: y+ ← arg maxy∈Y gi(y) subjec to βyi < Cδy,yi

3: y− ← arg miny∈Y gi(y)4: Perform a SMO step: SmoStep(βy+

i , βy−i )

Algorithm 5 OPTIMIZE online step1: Select randomly a support pattern xi

2: Let Yi = {y | (xi, y) is a support vector}3: y+ ← arg maxy∈Yi gi(y) subject to βy

i < Cδy,yi

4: y− ← arg miny∈Yi gi(y)5: Perform a SMO step: SmoStep(βy+

i , βy−i )

28

Chapter 3

Training Multi-class Support VectorMachines

Training multi-class Support Vector Machines often takes long time, since its complexity scalesbetween quadratically and cubically in the number of examples. Due to this time demand,previous work [13] has focused much on making a solver for these problems as fast as possible,which results in a novel decomposition method. The key elements of this solver strategy areconsidering hypothesis without bias, exploiting second order information and using workingsets of size two instead of SMO principle. We will examine a general quadratic problemwithout equality constraints that a larger number of multi-class SVMs formulations can beeasily converted to. After that, the parts of the efficient decomposition method for solving thisgeneralized quadratic problem will also be presented.

3.1 Sequential two-dimensional optimizationThe optimization problems in section 2.2.6 are normally solved in dual form. Dogan et al. [13]suggested that the bias terms bc should be removed. The reason is that bias terms make theseproblems difficult to solve, because bias terms will introduce equality constraints. These equalityconstraints require that at each optimization step, at least d variables are simultaneously updatedin the LLW and WW SVM formulation. In addition, the bias terms are of minor importancewhen working with universal kernels.The dual problems of all primal multi-class SVMs above without bias terms can be generalizedto:

maxα

D(α) = vTα − 12αT Qα (3.1)

Subject to ∀n ∈ {1, ..,m} : Ln ≤ αn ≤ Un

Here, α ∈ Rm, v ∈ Rm are some vectors, and Q ∈ Rmxm is a symmetric positive definite matrix.The gradient g = ∇D(α) has components gn = vn −

∑mi=1 αiQin. As discussed in section 2.2.5, the

state of the art algorithms for solving quadratic problem (3.1) are decomposition methods. The

29

decomposition methods try to iteratively solve sub-quadratic problems in which free variablesare restricted to a working set B. However, instead of using the smallest working set size as inSMO, [13] suggests using working sets of size two whenever possible, to balance the complexityof solving the subproblems, the availability of well-motivated heuristics for working set selectionand the computational cost per decomposition iteration. This method is called sequential two-dimensional optimization (S2DO). The strategy of S2DO is to select the first index according tothe gradient value, i = arg maxn |gn| and second component by maximization of the gain. Thesequential two-dimensional optimization method is presented in Algorithm 6. In the next parts

Algorithm 6 Sequential two-dimensional optimization.1: Initialize a feasible point α, g← v − Qα2: repeat3: i← arg maxp |gp|4: j← arg maxq computeGain(i, q)5: Solve sub-problem with respect to B = {i, j}: µ← solve2D(i, j)6: Update alpha: α← α + µ7: Update gradient: g← g − Qµ8: until Stop conditions are reached

of this chapter, we will derive how the gain is computed, and how the quadratic problem (3.1) issolved when using working sets of size 2.

3.2 Gain computation for the unconstrained sub-problemLet the optimization problem (3.1) be restricted to αB = [αi, α j]T with B = {i, j}. We want to findthe optimal α̂B = [α̂i, α̂i]T of D without constraints with respect to variables αi, α j, and the gainD(α̂B) − D(αB).Let µB = [µi, µ j]T denote the update step size, µB = α̂B−αB and gB = [gi, g j]T denote the gradient

where gi =∂D∂αi

(αB), g j =∂D∂α j

(αB). Taking the Taylor expansion around α̂B, we have:

D(αB) = D(α̂B) − µTB∇D(α̂B) − 1

2µT

BQBµB

Here, QB is the restriction of Q to those variables in B. It can be seen that ∇D(α̂B) = 0 since α̂B isthe optimum of the unconstrained optimization problem. Therefore, the gain is D(α̂B)−D(αB) =12µT

BQBµB. We have three cases to be considered:

• if QB = 0 then we have 2 cases. If gB = 0 then the gain is 0. If gB , 0 then D(αB) is alinear function with respect to αB, thus the gain is infinite.

• if det QB = 0 and QB , 0, then at least one of elements Qii,Q j j , 0. Without loss ofgenerality, we can assume that Qii , 0. In addition, taking the Taylor expansion around

30

αB, we have the objective function D(αB+µB) = D(αB)+gTBµB−

12µT

BQBµB. This objective

function is maximized when∂D∂µB

(αB + µB) = 0, which results in gB = QBµB. Therefore,

the gain is12

gTBµB. Furthermore, Qiiµi +Qi jµ j = gi ⇒ µi =

gi − Qi jµi

Qii. As a result, the gain

is:

12

(µigi + µ jg j) =gi(gi − Qi jµ j)

Qii+ µ jg j =

g2i

Qii+µ j(giQi j − g jQii)

Qii

If (giQi j − g jQii) , 0, the gain is infinite. If (giQi j − g jQii) = 0, there are numerous waysto assign values of µB. We can compute the gain by assuming µB = λgB, λ ∈ R, λ , 0, then1λ

gB = QBgB. From Rayleigh quotients equation, we have λ =gT

BgB

gTBQBgB

, then the gain is:

(gTBgB)2

2(gTBQBgB)

=(g2

i + g2i )2

2(g2i Qii + 2gig jQi j + g2

j Q j j)

• If det(QB) , 0 then µB = Q−1B gB, thus the gain is:

g2i Q j j − 2gig jQi j + g2

j Qii

2(QiiQ j j − Q2i j)

In summary, the gain computation can be represented as in Algorithm 7

3.3 Parameter update for the constrained sub-problemAfter the working pair B = {i, j} has been selected, the following sub-problem needs to be solved:

maxµB

D(αB + µB) = D(αB) + gTBµB −

12µT

BQBµB

Subject to Li ≤ αi + µi ≤ Ui

L j ≤ α j + µ j ≤ U j

Taking derivative of D(αB + µB) with respect to µB , we obtain:

∂D∂µB

(αB + µB) = gB − QBµB

Without inequality constraints,∂D∂µB

(αB+µB) vanishes at the optimum, in other words, gB = QBµB.

However, with constraints, there are three cases that need to be considered:

31

Algorithm 7 Gain computation for S2DO.Input: B = {i, j}, gi, g j,Qii,Qi j,Q j j

Output: the gain with respect to variables αi, α j

1: if QB = 0 then2: if gB = 0 then3: return 04: else return∞5: end if6: else if QB , 0 and det(QB) = 0 then7: if (Qii , 0 and giQi j − g jQii , 0)8: or (Q j j , 0 and g jQi j − giQ j j , 0) then9: return∞

10: else

11: return(g2

i + g2j)

2

2(g2i Qii + 2gig jQi j + g2

j Q j j)12: end if13: else

14: returng2

i Q j j − 2gig jQi j + g2j Qii

2(QiiQ j j − Q2i j)

15: end if

1. If QB = 0 then the objective function becomes maxµB D(αB + µB) = D(αB) + giµi + g jµ j

such that µi + αi ∈ [Li,Ui], µ j + α j ∈ [Li,Ui]. Therefore, to find µi, we have several casesthat are considered:

• If gi > 0⇒ µi + αi = Ui ⇒ µi = Ui − αi.

• If gi < 0⇒ µi + αi = Li ⇒ µi = Li − αi

• If gi = 0⇒ µi = 0

We can solve µ j with the procedure that is the same with finding µi.

2. If det(QB) = 0 and Q , 0, then D(αB + µB) is a convex function. The equality gB = QBµB

has no solution or infinite solutions in line gi = Qiiµi + Qi jµ j. However, in both cases,the optimum must be on one of the four edges of the constraints Li ≤ αi + µi ≤ Ui, L j ≤α j + µ j ≤ U j due to the convexity. The optimum is obtained by solving four one-variablesub-problems and selecting the one that gives the best value of D(αB + µB):

• If αi + µi = Ui, then µ j = max(L j − α j,min(g j − Qi j(Ui − αi)

Q j j,U j − α j))

• If αi + µi = Li, then µ j = max(L j − α j,min(g j − Qi j(Li − αi)

Q j j,U j − α j))

• If α j + µ j = U j, then µi = max(Li − αi,min(gi − Qi j(U j − α j)

Qii,Ui − αi))

32

• If α j + µ j = L j, then µi = max(Li − αi,min(gi − Qi j(L j − α j)

Qii,Ui − αi))

3. If Q has full rank, then the unconstrained optimum is:

µ̂B = Q−1B gB =

1det(QB)

(Q j jgi − Qi jg j

Qiig j − Qi jgi

)If µ̂B satisfies the inequality constraints, it is the optimal value. If not, by convexity,the problem can be transformed to solving one-dimensional problems with one of theinequality constraints active. Therefore, the proceeding is the same as in the case before.

As a result, the procedure that solves the two-dimensional sub-problem is shown in Algorithm 8

33

Algorithm 8 Solving the two-dimensional sub-problem.Input: B = {i, j}, gi, g j,Qii,Qi j,Q j j, αi, α j,Ui, Li,U j, L j

Output: µi, µ j

1: Let QB be matrix [Qii Qi j; Qi j Q j j]2: if Qii = Qi j = Q j j = 0 then3: if gi > 0 then µi ← Ui − αi

4: else if gi < 0 then µi ← Li − αi

5: else µi ← 06: end if7: if g j > 0 then µ j ← U j − α j

8: else if g j < 0 then µ j ← L j − α j

9: else µ j ← 010: end if11: else if det(QB) = 0 and QB , 0 then12: Let:13: (µi1, µ j1)←

(Ui − αi,max(L j − α j,min(

g j − Qi j(Ui − αi)Q j j

,U j − α j)))

14: (µi2, µ j2)←(Li − αi,max(L j − α j,min(

g j − Qi j(Li − αi)Q j j

,U j − α j)))

15: (µi3, µ j3)←(U j − α j,max(Li − αi,min(

gi − Qi j(U j − α j)Qii

,Ui − αi)))

16: (µi3, µ j3)←(L j − α j,max(Li − αi,min(

gi − Qi j(L j − α j)Qii

,Ui − αi)))

17: (µi, µ j)← arg maxk=1,4

(µikgi + µ jkg j −

12(Qiiµ

2ik + 2Qi jµikµ jk + Q j jµ

2jk

))18: else19: µ̂B ←

1det(QB)

(Q j jgi − Qi jg j

Qiig j − Qi jgi

)20:21: if µ̂B ∈ [Li,Ui] × [L j,U j] then22: µi ← µ̂i

23: µ j ← µ̂ j

24: else25: Repeat steps 13-1726: end if27: end if

34

Chapter 4

Online training of the LLW multi-classSVM

Although the LLW SVM has the nice theoretical property of Fisher consistency, training LLWSVM normally takes long time and high computation efforts for large scale datasets. The reasonis that training LLW SVM requires solving quadratic optimization problem of the size that isproportional to the number of training examples and the number of classes. As a consequence,this limits the applicability of the LLW SVM. The natural idea is reducing the training time andthe computation by approximating the solution of LLW SVMs. The online learning paradigmcan be used as an optimization heuristic to do that, which is introduced in sections 2.1.5, 2.2.8. Inthis chapter, the LLW SVM will be formulated in the dual form, then online learning paradigmand S2DO method which have already been established both in earlier chapters, are employed toderive an new effective learning algorithm.

4.1 The LLW multi-class SVMs formulationWe introduced the LLW SVM [22] in its primal form in section 2.2.6. Here, we will forconvenience give the primal again, then first derive the dual form, then link it to regular SVMsolver strategies, and finally, as one contribution of this thesis, derive all necessary elements foran online solver for the LLW SVM.The LLW SVM is modeled as follow:

minwc

12

d∑c=1

〈wc,wc〉 + C∑̀n=1

d∑c=1

ξn,c (4.1)

Subject to ∀n ∈ {1, .., `}, c ∈ {1, .., d} \{yn} : 〈wc, φ(xn)〉 + bc ≤ −1

d − 1+ ξn,c

∀n ∈ {1, .., l}, c ∈ {1, .., d} : ξn,c ≥ 0

∀h ∈ H :d∑

c=1

(〈wc, h〉 + bc) = 0⇔d∑

c=1

wc = 0 andd∑

c=1

bc = 0 (4.2)

35

Equation (4.2) is the sum-to-zero constraint that is equivalent to∑d

c=1 wc = 0 and∑d

c=1 bc =

0. Next, we transform the primal optimization problem to the dual representation using thetechnique of Lagrangian multiplier [8]:

L = 12

d∑c=1

〈wc,wc〉 + C∑̀n=1

d∑c=1

ξn,c

+∑̀n=1

d∑c=1

αn,c(〈wc, φ(xn)〉 + bc +1

d − 1− ξn,c)

−∑̀n=1

d∑c=1

βn,cξn,c − 〈η,d∑

c=1

wc〉 − τ(d∑

c=1

bc)

Here, αn,c ≥ 0, αn,yn = 0, βn,c ≥ 0, and η ∈ H , τ ∈ R are Lagrange multipliers.Taking derivatives with respect to variables wc, bc and ξn,c, respectively, we obtain Karush-Kuhn-Tucker (KKT) conditions [8]:

∂L∂wc= wc +

∑̀n=1

αn,cφ(xn) − η

⇒ wc = η −∑̀n=1

αn,cφ(xn) (4.3)

∂L∂bc=

∑̀n=1

αn,c − τ

⇒ τ =∑̀n=1

αn,c

∂L∂ξn,c

= C − αn,c − βn,c = 0

⇒ 0 ≤ αn,c ≤ C, ∀n ∈ {1, .., `},∀c ∈ {1, .., d}αn,yn = 0

Substituting (4.3) into the constraint∑d

c=1 wc = 0, we obtain:

dη =d∑

c=1

∑̀n=1

αn,cφ(xn)

⇒ η = 1d

d∑c=1

∑̀n=1

αn,cφ(xn)

36

Therefore, the weight wc can be written as:

wc =1d( d∑

p=1

l∑n=1

αn,pφ(xn)) − ∑̀

n=1

αn,cφ(xn)

= −d∑

p=1

l∑n=1

(δp,c −1d

)αn,pφ(xn) (4.4)

Here, δp,c denotes the Kronecker symbol.Applying the constraint

∑dc=1 wc = 0 into the Lagrangian program, we have:

L =12

d∑c=1

〈wc,wc〉 +∑̀n=1

d∑c=1

ξn,c(C − αn,c − βn,c)

+∑̀n=1

d∑c=1

αn,c(〈wc, φ(xn)〉) +d∑

c=1

bc(∑̀n=1

αn,c − τ) +1

d − 1(∑̀n=1

d∑c=1

αn,c)

=1

d − 1(∑̀n=1

d∑c=1

αn,c) +12

d∑c=1

〈wc,wc〉 +∑̀n=1

d∑c=1

αn,c(〈wc, φ(xn)〉)

We have this transformation because of the constraints C − αn,c − βn,c = 0 and τ =∑`

n=1 αn,c.Incorporating the result from (4.4) yields:

L = 1d − 1

(∑̀n=1

d∑c=1

αn,c)

+12

d∑c=1

( ∑̀m=1,n=1

d∑p=1,q=1

(δp,c −1d

)(δq,c −1d

)αm,pαn,qφ(xm)φ(xn))

−∑̀

m=1,n=1

d∑p=1,q=1

(δp,q −1d

)αm,pαn,q〈φ(xm), φ(xn)〉

37

In addition, we can re-arrange:

d∑c=1

d∑p=1,q=1

(δp,c −1d

)(δq,c −1d

)

=

d∑c=1

d∑p=1,q=1

(δp,cδq,c −

1d

(δp,c + δq,c) +1d2

)=

d∑p=1,q=1

( d∑c=1

δp,cδq,c −1d

d∑c=1

(δp,c + δq,c) +1d)

=

d∑p=1,q=1

(δp,q −

2d+

1d)

=

d∑p=1,q=1

(δp,q −1d

)

In summary, we have the dual problem:

maxα

1d − 1

∑̀n=1

d∑c=1

αn,c −12

( ∑̀m=1,n=1

d∑p=1,q=1

(δp,q −1d

)αm,pαn,qk(xm, xn))

(4.5)

Subject to 0 ≤ αn,c ≤ C ∀n ∈ {1, .., `},∀c ∈ {1, .., d}∑̀n=1

αn,c = τ,∀c ∈ {1, .., d} (4.6)

αn,yn = 0 ∀n ∈ {1, .., `}

Here, the value of the Lagrange multiplier τ ∈ R can be chosen to some constant value.It can be seen that equality constraints (4.6) are introduced by bias terms. The previous discussionin section (3.1) is about why bias terms can be removed. This will obviate the equality constraints(4.6). As a result, problem (4.5) will be cast into the generalized problem which can be solvedas in chapter 3.

4.2 Online learning for LLW SVMInspired by LaSVM and LaRank, in this section, an online algorithm for LLW machine isproposed, which replicates LaRank’s concepts. Because S2DO has proven superior for batchlearning [13], we design our online solver from the beginning to support the S2DO optimizationscheme in order to preserve the advantages of this method. Similarly to LaRank, we will definesome concepts that are used in online LLW SVM. To clarify, we remind of some definitions inLaRank: a support vector is a pair (xi, y) whose alpha coefficient αi,y is non-zero, and supportpattern be pattern xi such that (xi, y) is support vector for some y, 1 ≤ y ≤ d. In addition, we alsodefine I and W , which are the set of current support vectors and support patterns, respectively.

38

4.2.1 Online step typesIn ’classical’ online learning [27], online steps only update variables that correspond to new,fresh patterns. However, this will lead to a slow optimization if online learning is employed asoptimization heuristic in the dual representation. The reason is that the variables correspondingto the support vectors are not updated often enough [4]. Therefore, in LaSVM and LaRank,various online steps are defined and carried out that update not only variables corresponding tonew patterns but also variables corresponding to old support patterns.Furthermore, in LaRank, the CS SVM [10] is employed to solve the multi-class classificationproblem. As a result, there is one equality constraint for each pattern. This restricts the twoworking variables of SMO in LaRank to operate on the same pattern. However, in the LLWSVM without bias, there is no equality constraint. As a result, two working variables in S2DOmight come from different patterns. These patterns might be support patterns or new patterns,since old non-support patterns are not stored and will become new patterns in the next epoch.Therefore, two working variables might be from the same new, fresh pattern; a new supportpattern and old support pattern or from two old support patterns. As a result, equivalent onlineoperations for the LLW SVM can be defined as follows:

• TWO_NEWS: This online step will select two variables from a fresh pattern which isnot a support pattern. This online operation is roughly equivalent to PROCESS_NEW inLaRank.

• ONE_OLD_ONE_NEW: This online step will select one variable from the pattern that hasjust been inserted into the support pattern set. Another one will be chosen from old supportpatterns.

• TWO_OLD: Two variables will be selected from old support patterns. This operation isequivalent to PROCESS_OLD in LaRank.

• TWO_OLD_SUPPORTS: Two variables are support vectors that are selected from oldsupport patterns. This operation is similar to OPTIMIZE in LaRank

It can be seen that ONE_OLD_ONE_NEW is a special case of TWO_OLD and with ourexperience, ONE_OLD_ONE_NEW contributes very small to the training process. Therefore,we exclude ONE_OLD_ONE_NEW from online step list.

4.2.2 Working set strategies for online LLW SVMOnline steps in some sense are part of the working set selection, especially from the originalperspective of solving batch optimization problem, because they restrict the variables from whichthe working set can be selected. However, in the online learning perspective, online steps areoperations that online solvers will perform to adapt with the changes when a new, fresh examplecomes. Within a specific online step, the purpose of working set selection is to choose thevariables that optimize this adaptation. It is also clear that the strategy of working set selectionfor online steps might affect the performance of a online optimization method. In case of LLW

39

SVM, we will examine several working set selection strategies. Let V is set of possible variableswhich are determined by a specific online step. Working set selection strategies are defined asbelow.

• S2DO_ALL: Two variables are selected by S2DO over the set of all possible variables V ,in particular one variable with highest gradient will be chosen, and the other variable willbe selected by highest gain among V .

• RANDOM_PATTERN_S2DO_ALL: Firstly, a pattern p is randomly selected from W, thenone variable is chosen according to this pattern with highest gradient. After that, theremaining variable is selected through V with highest gain.

• RANDOM_PATTERN_S2DO_PATTERN: A pattern p is randomly selected from W, thentwo variables are selected from this pattern by S2DO.

Because all working set strategies are the same in TWO_NEWS, these different strategies areonly applied in TWO_OLD and TWO_OLD_SUPPORTS online steps. As a result, the workingset selection algorithm for online steps is presented in Algorithm 9.

4.2.3 Gradient information storageAs discussed earlier, two working variables in the LLW SVM can come from different patterns.Therefore, there might be more gradient calculations needed to find two working variables inonline operations. For example, the combination of TWO_OLDS online step with S2DO_ALLstrategy needs |W | · d gradients in comparison with only d gradients in PROCESS_OLD inLaRank. Another point is that from (4.5), we have :

gm,c =1

d − 1−|W |∑n=1

d∑e=1

(δc,e −1d

)αn,ek(xm, xn)

Therefore, if gradients are computed on demand as in LaRank, they will take more time, sinceeach gradient computation relates to |W | · d variables. As a result, gradients should be updatedfor all variables, both support vector and support vector, which correspond to the set of currentsupport patterns. The gradient update rule is:

gi,c = gi,c − (δc,p −1d

)µm,pk(xm, xi) − (δc,q −1d

)µn,qk(xn, xi)

As a result, we only compute the gradients that belong to the new pattern in the TWO_NEWSstep, and also we only keep and update the gradients if the new pattern becomes a supportpattern. This strategy has an advantage that less time is needed to compute gradients. However,as a downside, this strategy needs |W | · d variables in each update step, which means thatmore variables need to be treated and more memory is needed to store gradient information incomparison with the strategy in LaRank. However, as usually |W | >> d, the number of variablesneeded grows linearly with the number of support patterns.

40

Algorithm 9 Working set selection for online LLW SVM steps.Input: step: online step, selectionMethod: working set strategyOutput: (m,p),(n,q)

1: if step=TWO_NEWS then2: Retrieve new pattern xi

3: m← i, p← arg maxc∈Y gm,c

4: n← i, q← arg maxc∈Y computeGain((m, p), (n, c)

)5: else if step=TWO_OLD then6: Retrieve possible variables: V = {(m, p)|m ∈ W, p ∈ Y}7: if selectionMethod=S2DO_ALL then8: (m, p)← arg max(i,c)∈V gm,p

9: (n, q)← arg max( j,e)∈V computeGain((m, p), ( j, e)

)10: else if selectionMethod=RANDOM_PATTERN_S2DO_ALL then11: Select randomly a support pattern i from W12: m← i, p← arg maxc∈Y gm,c

13: (n, q)← arg max(n,q)∈V computeGain((m, p), (n, q)

)14: else if selectionMethod=RANDOM_PATTERN_S2DO_PATTERN then15: Select randomly a pattern i from W16: m← i, p← arg maxc∈Y gm,c


)18: end if19: else if step=TWO_OLD_SUPPORTS then20: Retrieve possible variables: V = {(m, p)|m ∈ W, p ∈ Y, αm,p , 0}21: if selectionMethod=S2DO_ALL then22: (m, p)← arg max(i,c)∈V gm,p


)24: else if selectionMethod=RANDOM_PATTERN_S2DO_ALL then25: Select randomly a support pattern i from W26: m← i, p← arg maxc∈Y gm,c, subject to αm,p , 027: (n, q)← arg max(n,q)∈V computeGain

((m, p), (n, q)

)28: else if selectionMethod=RANDOM_PATTERN_S2DO_PATTERN then29: Select randomly a pattern i from W30: m← i, p← arg maxc∈Y gm,c, subject to αm,p , 031: n← i, q← arg maxc∈Y computeGain

((m, p), (n, c)

), subject to αn,p , 0

32: end if33: end if

41

4.2.4 SchedulingA natural question is how online operations are scheduled to approximate efficiently the batchLLW SVM solver. There are two common strategies to mix different online steps, namely fixedscheduling and adaptive scheduling. Fixed scheduling [6] will successively run different atomicsteps which consist of a number nO of each online steps. nO can be called ’multiplier’ anddifferent atomic steps can have different multipliers. These multipliers depend on a specificproblem, which would be additional hyperparameters for the solver. In contrast, adaptivescheduling [4] will take into account both the computing time and the progress of the dual valueto determine which atomic step should be run next and the multipliers of online steps are fixedfor all datasets. There is still an open problem which kind of strategy is better, because thereare no solid comparison studies investigating or comparing the effect of different step selectionstrategies. Due to the effectiveness in Larank, we adopt adaptive scheduling in our online solver,but differently, the multipliers of atomic steps are set to the number of classes of training setexcept for the multiplier of TWO_NEWS is set to 1. The main idea behind this strategy isto select an atomic step which gets higher gain in shorter time for next step. In particular,this strategy selects the next atomic online step randomly with the probability based on thevalue of the dual increase / the time period. Our solver also replicates the strategy in LaRankthat guarantees the selection of online steps with at least η in probability. Therefore, adaptivescheduling strategy for solving online LLW SVM is presented in Algorithm 10. Here, β is adecay parameter of adaptive scheduling algorithm. The value of β and η are both set to 0.05 tohave good performance in LaRank which is also inherited in our online solver.

4.2.5 Stopping conditionsWe did not specify any stopping conditions in Algorithm 10. Normally, for reasonable choicesof multipliers for the adaptive scheme, the algorithm gives good result after a single pass overthe training dataset. We name it OnlineLLW×1 to indicate that the algorithm will stop after oneepoch.In other case, we use the same stopping criteria with Shark framework [19] for multi-class SVMswhere all gradient values of the variables that are inside the feasible region, are less than ε = 10−3.We use the name OnlineLLW ×∞ to indicate that the algorithm might take several epochs untilthe stopping conditions are reached. We also name OnlineLLW × 2 to indicate that the onlinelearning will stop after two epochs.

4.2.6 Implementation detailsIn implementation, we will integrate our online LLW SVM algorithm into Shark [19] framework.Shark is a modular C++ library for the design and optimization of machine learning algorithmssuch as SVMs and neural network. Furthermore, our online solver still employs the Shark SVMcache mechanism to store kernel values on demand like in the batch SVM solver. The cachemechanism helps the online solver to handle training datasets of which the kernel matrix cannotfit into memory.

42

Algorithm 10 Online LLW SVM solver with adaptive scheduling.

Input: {(xi, yi)}`i=1: dataset, O: array of online steps, multipliers: array of multiplier values1: P← ∅2: β← 0.053: ∀step ∈ O : stepGains[step]← 14: repeat5: for k = 1, .., ` do6: step← TWO_NEWS7: repeat8: for j = 1, ..,multipliers[step] do9: (m, p)← (0, 0), (n, q)← (0, 0)

10: Select two working variables (m, p), (n, q)← selectWorkingS et(step)11: Perform two variable optimization: S olve2D ((m, p), (n, q))12: Update current support pattern set W according to the value of αm,p, αn,q

13: Update gradient ∀i ∈ W,∀c ∈ Y :

14: gi,c ← gi,c − (δc,p −1d

)µm,pk(xi, xm) − (δc,q −1d

)µn,qk(xi, xn)

15: Update gain: stepGains[step]← β · dual_increaseduration

+(1−β) · stepGains[step]16: end for17: Select a step with probability according to stepGains[step]18: until step = TWO_NEWS19: end for20: until stopping condition is reached

43

4.3 ExperimentsOur goal is to derive an online LLW SVM solver that gets a fast approximation to the solutionof the batch LLW SVM solvers while achieving comparable accuracy with the batch LLW SVMsolvers. Therefore, we performed experiments on a benchmark set of datasets that are establishedin [4]. The reason is that they were used to establish the good performance of LaRank, thus,we also use them in our own studies. These benchmark datasets span an interesting rangeand cover several important settings: from mid-sized to larger-scale, and from few-features tomany-features. After reporting the optimal hyper-parameters in model selection, we comparethe performances of the online LLW SVM solver with its counterpart batch solvers. We alsoinvestigate the impact of kernel cache size on the performance of both the online LLW and batchLLW solvers. Finally, we will compare the efficiencies of the three working set strategies.

4.3.1 Experiment setupExperiments are carried out on four datasets, namely MNIST, INEX, LETTER and USPS. Allfour datasets are collected from [4]. Table 4.1 shows the properties of these datasets. We will

Training size Test size Features ClassesUSPS 7291 2007 256 10INEX 6053 6054 167295 18

LETTER 16000 4000 16 26MNIST 60000 10000 780 10

Table 4.1: Datasets used in the experiments.

do experiments with Gaussian kernel for USPS, LETTER and MNIST and with linear kernel forINEX. Our choice of kernels is also like in the original LaRank paper.During experiments, we will compare the online LLW SVM solver with the batch LLW SVMsolvers, thus C and γ are kept the same for both solvers. We also name batchLLW_shrinkand batchLLW_unshrink to indicate the batch LLW SVM solvers of Shark with two differentsettings: using or not using shrinking techniques. Furthermore, we found out a non-optimalimplementation of the cache mechanism and shrinking technique in Shark if the kernel matrixcan be fully stored in the kernel cache. Therefore, we name batchLLW_modified_shrink toindicate the version of the batch LLW SVM solver of Shark with the modification that improvesthis non-optimal implementation. In addition, the kernel cache size is set to 3GB for overallexperiments and the stopping condition ε is 10−3 .

4.3.2 Model selectionIn order to find good parameters for online LLW SVM learning, grid search and Jaakkola’sheuristic [20] were used. We did grid search with onlineLLW × ∞ for all datasets except forMNIST. MNIST was searched with onlineLLW × 2 due to its intensive computation time with

44

onlineLLW × ∞ and onlineLLW × 2 was normally enough to achieve stable accuracy. We didfive-fold cross-validation on each point of the grid that we varied log10(C) ∈ {−1, .., 3}, log2(γ) ∈{log2(γJaakkola) − 2, .., log2(γJaakkola) + 2} for the machines with Gaussian kernel where γJaakkola

is estimated by Jaakkola’s heuristic. With the linear kernel, the search grid was firstly createdby C ∈ {10−1, .., 103} to find an initial value C0, the search then was performed in a refine gridC0 · 2ϕ, ϕ ∈ {−2,−1, 0, 1, 2} . In the case of equal cross validation errors, we preferred lower Cand γ. If any chosen parameter hits the grid boundary, we shifted the grid such that the formerboundary value would be inside the new grid and the new row of the grid was computed andcompared. We repeated the shifting until the values of the best parameters did not improvefurther. The best parameters found in grid search are reported in table 4.2. The procedure of

Kernel k(x, x) C γ

USPS e−γ‖x−x‖2 100 0.0331574INEX xT · x 5000 -

LETTER e−γ‖x−x‖2 1000 0.0512821MNIST e−γ‖x−x‖2 1000 0.0189274

Table 4.2: Best parameters for LLW SVM found in grid search .

the model selection here differs from in [4] where the optimal hyper-parameters are chosen bymanually tuning.

4.3.3 Comparing online LLW with batch LLWIn this part, we compare onlineLLW × 1 with batchLLW_shrink, batchLLW_unshrink andbatchLLW_modified_shrink. This comparison is carried out on the test sets after training bothonline and batch LLW solvers on the training sets. We first evaluate the final performance ofboth solvers at their endpoints to compare the methods as whole, then we will look at the build-up curves and discuss the optimization behaviour in more detail.As can be seen from Table 4.3, the test errors of onlineLLW × 1 are comparable with those ofbatchLLW_shrink, batchLLW_unshrink and batchLLW_modified_shrink, but it consumesfar less computation and training time for all datasets. In detail, onlineLLW×1 is equal or betterthan batchLLW_shrink, batchLLW_unshrink and batchLLW_modified_shrink in three overfour datasets in terms of test error, although it employs dramatically less training time. Kernelevaluation in onlineLLW × 1 is also significant lower than that in its counterpart batch solvers.These encouraging results are clearly reflected in Figures 4.1 and 4.2 where the evolution ofthe test errors as functions of time and computed kernels are drawn over the training sets. It isclear that one single pass over training sets is sufficient for online learning to nearly reach theperformance of its counterpart batch solvers with regards to test error.There is a drawback of onlineLLW × 1 that it does not always approximate well the dual valuesof the batch solvers. This drawback can be seen from Figure 4.3 where the differences of dualvalues of onlineLLW× 1 and its counterpart batch solvers are significant in INEX and LETTERdatasets. It is noticed that these two datasets are the two which have a significantly higher number

45

of classes than the other two. Their dual values are also remarkably high in comparison withthose in USPS and MNIST. The reason is the poor performance of adaptive scheduler in thesecases.

USPS INEX LETTER MNIST

onlineLLW × 1

Test error (%) 4.584 28.6257 2.25 1.49Dual 58136.7 74671100 2262460 701103Training time (sec) 218.611 923.528 5349.79 26631.9Kernel evaluation (×106) 39.84 34.74 227.38 11416.63

batchLLW_shrink

Test error (%) 4.584 27.552 2.35 1.49Dual 58854.7 113078068 3527538 724837Training time (sec) 946.16 18960.5 357575 322660Kernel evaluation (×106) 279.38 40.47 2658.46 167505.7

batchLLW_unshrink

Test error (%) 4.584 27.552 2.35 1.49Dual 58854.7 113078000 3527540 724837Training time (sec) 608.896 7473.71 200402 359508Kernel evaluation (×106) 42.12 36.64 240.25 195585.42

batchLLW_modified_ Test error (%) 4.584 27.552 2.35 1.49shrink Dual 58854.7 113078068 3527538 724837

Training time (sec) 625.337 22724.5 494183 377070Kernel evaluation (×106) 38.70 36.64 247.14 154590.10

Table 4.3: Comparing online LLW with batch LLW solvers. Here, the S2DO_ALL working setselection is employed for online learning.

4.3.4 The impact of the cache sizeUndoubtedly, the kernel evaluation account for a large part of the computation time in SVMs.Therefore, the kernel cache affects heavily the training time and the number of kernel evaluationson learning algorithms for SVMs. Here, we want to investigate how the kernel cache size affectsthe performance of online learning and batch learning. We compare the increasing curves of dualvalues of both online learning and batch learning for the LLW SVM when the cache size variesat 20 and 120 percent of kernel matrix size. The reason for selecting 120 percent of kernel matrixsize is that we want to guarantee that all kernel values can be fully stored in the cache. Figures 4.4and 4.5 have shown that if kernel cache size is limited, in particular 20% of kernel matrix size,the dual value of onlineLLW ×∞ increases significantly faster than that of its counterpart batchsolvers for all experimented datasets. At the same time or the same number of computed kernels,the marked points where the online solver stop after single pass over training sets, are higher thanthose of batchLLW_shrink, batchLLW_unshrink and batchLLW_modified_shrink. Thisresult can be more clearly seen in Table 4.4 where the speed up factor of onlineLLW × 1over batchLLW_shrink, batchLLW_unshrink and batchLLW_modified_shrink at the pointswhere the same dual value is interpolated. The reason is that online optimization heuristic

46

0 100 200 300 400 500 600 700 800 900 10004

4.5

5

5.5

6

6.5

7

7.5

8

8.5USPS

Time(s)

Test

err

or(%

)

onlineLLW×1

onlineLLW×∞

batchLLW_shrink

batchLLW_unshrink

batchLLW_modified_shrink

0 0.5 1 1.5 2 2.5 3

x 104

26

28

30

32

34

36

38

40

42

44

46INEX

Time(s)

Test

err

or(%

)

onlineLLW×1onlineLLW×∞

batchLLW_shrinkbatchLLW_unshrink


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

5

10

15LETTER

Time(s)

Test

err

or(%

)

onlineLLW×1

onlineLLW×∞ batchLLW_shrink

batchLLW_unshrink batchLLW_modified_shrink

0 1 2 3 4 5 6

x 105

1

1.5

2

2.5

3

3.5

4MNIST

Time(s)

Test

err

or(%

)onlineLLW×1

onlineLLW×∞

batchLLW_shrink

batchLLW_unshrink


Figure 4.1: Test error as a function of time for LLW SVM. Here, the line denotes the evolutionfor the online solver.

0 0.5 1 1.5 2 2.5 3

x 108

4

4.5

5

5.5

6

6.5

7

7.5

8

8.5USPS

Computed kernels

Test

err

or (%

)

onlineLLW×1 onlineLLW×∞

batchLLW_shrinkbatchLLW_unshrink


0 2 4 6 8 10 12 14 16

x 107

26

28

30

32

34

36

38

40

42

44

46INEX

Computed kernels

Test

err

or (%

)

onlineLLW×1

onlineLLW×∞batchLLW_shrink

batchLLW_unshrink


0 0.5 1 1.5 2 2.5 3

x 109

0

5

10

15LETTER

Computed kernels

Test

err

or (%

)

onlineLLW×1 onlineLLW×∞ batchLLW_shrink

batchLLW_unshrinkbatchLLW_modified_shrink

0 0.5 1 1.5 2 2.5 3

x 1011

1

1.5

2

2.5

3

3.5

4MNIST

Computed kernels

Test

err

or (%

)

onlineLLW×1


batchLLW_unshrink


Figure 4.2: Test error as a function of computed kernel for LLW SVM. Here, the line denotesthe evolution for the online solver.

47

0 100 200 300 400 500 600 700 800 900 10001

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5x 10

4 USPS

Time(s)

Dua

l


batchLLW_shrink

batchLLW_unshrink


0 0.5 1 1.5 2 2.5 3

x 104

0

2

4

6

8

10

12x 10

7 INEX

Time(s)

Dua

l

onlineLLW×1

onlineLLW×∞

batchLLW_shrink


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

Time(s)

Dua

l onlineLLW×1



0 1 2 3 4 5 6

x 105

0

1

2

3

4

5

6

7

8x 10

5 MNIST

Time(s)

Dua

l


batchLLW_shrink

batchLLW_unshrink


Figure 4.3: Dual as a function of time for LLW SVM. Here, the line denotes the evolution forthe online solver.

USPS INEX LETTER MNISTbatchLLW_shrink 1.36 5.57 1.90 1.61

batchLLW_unshrink 1.69 5.44 1.88 2.31batchLLW_modified_shrink 1.22 5.44 1.82 2.19

Table 4.4: Speed up factor of onlineLLW × 1 over batch LLW solvers at the same dual value,when the kernel cache size is set to 20% kernel matrix size.

is a kind of bottom up strategy that naturally fits the memory hierarchy design in computerarchitecture. Therefore, online optimization heuristic exploits the cache mechanism better thanbatch learning strategy when the memory is limited. This proves the applicability of onlineheuristic for approximating the batch optimization problem in large scale problems.Vice versa, the batch learning heuristic can be considered as a top-down strategy. As a result,batch learning has a more ’global’ view which can be advantageous if the kernel cache sizeis larger. This explains why the dual values of batchLLW_shrink, batchLLW_unshrink andbatchLLW_modified_shrink increase better and sometimes they are even higher than the dualvalues of onlineLLW× 1 at the same time or the same computed kernels if the kernel cache sizeis set to 120% of the kernel matrix size.

48

4.3.5 Comparison of working set selection strategies for online LLWAn interesting question is how working set selection strategies perform in the online LLWSVM. It is clear from Table 4.5 that there are not much differences in the performanceof the three different strategies. However, RANDOM_PATTERN_S2DO_PATTERN andRANDOM_PATTERN_S2DO_ALL often give slightly better test errors due to randomizationstrategies, but they seem to consume more times than S2D0_ALL strategy.


S2DO_ALL

Test error (%) 4.584 28.6257 2.25 1.49Dual 58136.7 74671100 2262460 701103Training time (sec) 218.611 923.528 5349.79 26631.9Kernel evaluation (×106) 39.84 34.74 227.38 11416.63

RANDOM_PATTERN_Test error (%) 4.5341 28.64 2.275 1.44

S2DO_ALLDual 58122.6 74946300 2265390 701059Training time (sec) 220.266 925.101 5413.05 27589.7Kernel evaluation (×106) 39.88 34.86 227.54 11376.33


S2DO_PATTERNDual 58130.6 75572700 2263770 701042Training time (sec) 220.317 914.479 5454.12 28262.6Kernel evaluation (×106) 39.69 34.91 227.36 11391.79

Table 4.5: Comparing working set strategies for the online LLW SVM.

49

0 500 1000 1500 2000 2500 3000 35000

1

2

3

4

5

6x 10

4 USPS

time(s)

Dua

l

onlineLLW× 1

Online learningBatch learning with shrinkingBatch learning without shrinkingBatch learning with modified shrinking

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6x 10

4 USPS

time(s)

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

2

4

6

8

10

12x 10

7 INEX

time(s)

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5

x 104

0

2

4

6

8

10

12x 10

7 INEX

time(s)

Dua

l

onlineLLW× 1


0 1 2 3 4 5 6

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

time(s)

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

time(s)

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

1

2

3

4

5

6

7

8x 10

5 MNIST

time(s)

Dua

l

onlineLLW× 1


0 5 10 15

x 104

0

1

2

3

4

5

6

7

8x 10

5 MNIST

time(s)

Dua

l

onlineLLW× 1


Figure 4.4: The increasing dual curves with respect to time when the cache sizes are 20% and120% of the kernel matrix sizes. Here, the left is of 20% the kernel matrix size and the right isof 120% the kernel matrix size. 50

0 2 4 6 8 10 12 14 16 18

x 108

0

1

2

3

4

5

6x 10

4 USPS

computed kernel

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3

x 109

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

computed kernel

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 1010

0

2

4

6

8

10

12x 10

7 INEX

computed kernel

Dua

l

onlineLLW× 1


0 2 4 6 8 10 12 14 16

x 107

0

2

4

6

8

10

12x 10

7 INEX

computed kernel

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3 3.5 4

x 1011

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

computed kernel

Dua

l

onlineLLW× 1


0 0.5 1 1.5 2 2.5 3

x 109

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

6 LETTER

computed kernel

Dua

l

onlineLLW× 1


0 2 4 6 8 10 12 14

x 1010

0

1

2

3

4

5

6

7

8x 10

5 MNIST

computed kernel

Dua

l

onlineLLW× 1


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 1010

0

1

2

3

4

5

6

7

8x 10

5 MNIST

computed kernel

Dua

l

onlineLLW× 1


Figure 4.5: The dual curves with respect to kernel evaluation when the kernel cache size is 20%and 120% of kernel matrix sizes. Here, the left is of 20% the kernel matrix size and the right isof 120% the kernel matrix size. 51

Chapter 5

Unified framework for online multi-classSVMs

As discussed in chapter 2, there are various conceptually slightly different proposals for adaptingSVMs to the multi-class case. There are the three all-in-one approaches presented in thisthesis, namely CS, WW and LLW. They both fit the regularized risk minimization, and theirdifferences come from the different interpretations of the margin concept and the different waysof formulating a loss function on them. The unified framework of these different concepts isrecently established, which is discussed in Section 2.2.7. In this chapter, we will first reviewthe dual problem of the unified framework and then apply the online optimization heuristic inChapter 4 to the unified multi-class SVMs. After that, we perform experiments with the CSSVM as one example of the versatility of the online solver for the unified framework for all-in-one multi-class SVMs.

5.1 The dual problem for the unified frameworkThe primal problem for the unified framework can be re-stated as:

minw,b,ξ

12

d∑c=1

‖wc‖ +C∑̀i=1

∑r∈Ryi

ξi,r (5.1)

Subject tod∑

c=1

νyi,p,c (〈wc, φ(xi)〉 + bc) ≥ γyi,p − ξi,syi (p) 0 ≤ i ≤ `, p ∈ Pyi

ξi,r ≥ 0 0 ≤ i ≤ `, r ∈ Ryi

d∑c=1

(〈wc, φ(x)〉 + bc) = 0 ⇔d∑

c=1

wc = 0 andd∑

c=1

bc = 0 (5.2)

52

Normally, the problem (5.1) is solved in its dual form. Transforming equation (5.1) usingLagrangian multipliers like in chapter 4, we have the dual representation:

maxα

∑̀i=1

∑p∈Pyi

γyi,pαi,p −12

∑̀i=1

∑p∈Pyi

∑̀j=1

∑q∈Py j

Myi,p,y j,q · k(xi, x j) · αi,p · α j,q

Subject to ξi,p ≥ 0 ∀i ∈ {0, .., `}, ∀p ∈ Ryi∑p∈Pr

yi

αi,r ≤ C ∀i ∈ {0, .., `}, ∀r ∈ Ryi

∑̀i=1

∑p∈Pyi

νyi,p,cαi,p = 0 ∀c ∈ {1, .., d} (5.3)

With Myi,p,y j,q is defined as:

Myi,p,y j,q =

d∑m=1

d∑n=1

(δm,n −1d

)νyi,p,mνy j,q,n

And, the set Pry, r ∈ Ry, are defined as Pr

y = s−1y ({r}) = {p ∈ Py |sy(p) = r}.

As discussed earlier, because of the minor importance with universal kernel and the difficultiesin solving, bias terms should be removed. Using machines without bias term results in deletingthe equality constraints (5.3). As a result, the unified problem can be written as:

maxα

D(α) =∑̀i=1

∑p∈Pyi

γyi,pαi,p −12

∑̀i=1

∑p∈Pyi

∑̀j=1

∑q∈Py j

Myi,p,y j,q · k(xi, x j) · αi,p · α j,q (5.4)

Subject to ξi,p ≥ 0 ∀i ∈ {0, .., `}, ∀p ∈ Ryi∑p∈Pr

yi

αi,r ≤ C ∀i ∈ {0, .., `}, ∀r ∈ Ryi (5.5)

5.2 Online learning for the unified frameworkIn this section, we will derive a learning algorithm that uses an online optimization heuristicto approximate the solution of the unified problem (5.4). In detail, we will adapt the onlineoperations and the dynamic scheduling presented in the chapter 4 to solve the unified problem.However, because of the more general formulation, the working set selection for onlineoperations and the procedure for solving the sub-problems will be changed and presented inthe next parts of this section.

5.2.1 Online step types and working set selectionIt can be seen that two working variables might come across patterns due to the absence ofequality constraints. Therefore, online steps which were introduced in the section 4.2.1, are still

53

applied.With working set selection strategies for online operations, we still employ different strategiesthat are defined in the section 4.2.2. However, there is a slight change in the way that two workingvariables are selected due to the inequality constraints (5.5). This change is inherited from themodified S2DO in the batch learning for the unified problem which is presented in [14]. In thebatch learning, the two working variables are selected as follows.

• If max{|gm,c|} ≥ max{gm,c − gn,e}, then selecting the first working variable as (i, p) =arg maxm,c{|gm,c|}, and the second working variable is chosen as arg maxn,e computeGain ((i, p), (n, e)).Here, the procedure computeGain is the same as in section 3.2.

• If max{|gm,c|} ≤ max{(gm,c − gn,e)}, then the two working variables are selected as((i, p), ( j, q)) = arg max(m,c),(n,e){(gm,c − gn,e)}.

As a result, let W be the set of current support patterns and V be the set of possible variables.The working set selections for online operations are presented in Algorithms 11, 12, 13.

Algorithm 11 Working set selection for TWO_NEWS for the unified framework.Input: selectionMethod: working set strategyOutput: (m,p),(n,q)

1: Retrieve new pattern xi

2: if |Pry| = 1 then

3: m← i, p← arg maxc∈Y gm,c


)5: else6: if maxc∈Y{gi,c} ≥ maxc,e∈Y{(gi,c − gi,e)} then7: (m, p)← arg max(i,c) gi,c

8: (n, q)← arg max(i,e) computeGain((m, p), (i, e)

)9: else

10: ((m, p), (n, q))← arg max(i,c),(i,e)∈V{(gi,c − gi,e)}11: end if12: end if

5.2.2 Parameter updateWhen the working set has been selected, we want to solve the problem 5.4 with respect to thevariables in the working set. Let denote the working set of size two as B = {(i, p), ( j, q)} and thequadratic matrix restricted to the working set as

QB =

(Q(i,p),(i,p) Q(i,p),( j,q)

Q( j,q),(i,p) Q( j,q),( j,q)

)Here, Q(i,p),( j,q) = Myi,p,y j,q · k(xi, x j) . Also, let the gradient be gB = (g(i,p), g( j,q)) whereg(i,p) = ∂D(α)/∂α(i,p). The sub-problem is to find the update µB = (µi,p, µ j,q) that maximizes

54

Algorithm 12 Working set selection for TWO_OLDS for the unified framework.Input: step: online step, selectionMethod: working set strategyOutput: (m,p),(n,q)

1: Retrieve possible variables: V = {(m, p) | m ∈ W, p ∈ Y}2: if selectionMethod=S2DO_ALL then3: if |Pr

y| = 1 then4: (m, p)← arg max(i,c)∈V gm,p


)6: else7: if max(i,c)∈V{gi,c} ≥ max(i,c),( j,e)∈V{(gi,c − g j,e)} then8: (m, p)← arg max(i,c)∈V gi,c


)10: else11: ((m, p), (n, q))← arg max(i,c),( j,e)∈V{(gi,c − g j,e)}12: end if13: end if14: else if selectionMethod=RANDOM_PATTERN_S2DO_ALL then15: Select randomly a support pattern i from W16: if |Pr

y| = 1 then17: m← i, p← arg maxc∈Y gm,c

18: (n, q)← arg max(n,e)∈V computeGain((m, p), (n, e)

)19: else20: if max(i,c)∈V{gi,c} ≥ max(i,c),(i,e)∈V{(gi,c − gi,e)} then21: m← i, p← arg maxc∈Y gi,p

22: (n, q)← arg max( j,e)∈V computeGain((m, p), (i, e)

)23: else24: ((m, p), (n, q))← arg max(i,c),(i,e)∈V{(gi,c − gi,e)}25: end if26: end if27: else if selectionMethod=RANDOM_PATTERN_S2DO_PATTERN then28: Select randomly a pattern i from W29: if |Pr

y| = 1 then30: m← i, p← arg maxc∈Y gm,c


)32: else33: if maxc∈Y{gi,c} ≥ maxc,e∈Y{(gi,c − gi,e)} then34: (m, p)← arg max(i,c) gi,c

35: (n, q)← arg max(i,e) computeGain((m, p), (i, e)

)36: else37: ((m, p), (n, q))← arg max(i,c),(i,e)∈V{(gi,c − gi,e)}38: end if39: end if40: end if

55

Algorithm 13 Working set selection for TWO_OLD_SUPPORTS for the unified framework.Input: step: online step, selectionMethod: working set strategyOutput: (m,p),(n,q)

1: Retrieve possible variables: V = {(m, p) | m ∈ W, p ∈ Y, αm,p , 0}2: if selectionMethod=S2DO_ALL then3: if |Pr

y| = 1 then4: (m, p)← arg max(i,c)∈V gm,p


)6: else7: if max(i,c)∈V{gi,c} ≥ max(i,c),( j,e)∈V{(gi,c − g j,e)} then8: (m, p)← arg max(i,c)∈V gi,c


)10: else11: ((m, p), (n, q))← arg max(i,c),( j,e)∈V{(gi,c − g j,e)}12: end if13: end if14: else if selectionMethod=RANDOM_PATTERN_S2DO_ALL then15: Select randomly a support pattern i from W16: if |Pr

y| = 1 then17: m← i, p← arg maxc∈Y{gm,c | (m, c) ∈ V}18: (n, q)← arg max(n,e)∈V computeGain

((m, p), (n, e)

)19: else20: if maxc∈Y{gi,c | (i, c) ∈ V} ≤ maxc,e∈Y{(gi,c − gi,e) | (i, c), (i, e) ∈ V} then21: ((m, p), (n, q))← arg max(i,c),(i,e)∈V{(gi,c − gi,e)}22: else23: m← i, p← arg maxc∈Y{gm,c | (m, c) ∈ V}24: (n, q)← arg max(n,e)∈V computeGain

((m, p), (n, e)

)25: end if26: end if27: else if selectionMethod=RANDOM_PATTERN_S2DO_PATTERN then28: Select randomly a pattern i from W29: if |Pr

y| = 1 then30: m← i, p← arg maxc∈Y{gm,c | (m, c) ∈ V}31: n← i, q← arg maxc∈Y

{computeGain

((m, p), (n, c)

) | (n, c) ∈ V}

32: else33: if maxc∈Y{gi,c | (i, c) ∈ V} ≥ maxc,e∈Y{(gi,c − gi,e) | (i, c), (i, e) ∈ V} then34: (m, p)← arg max(i,c)∈V gi,c

35: (n, q)← arg max(i,e)∈V computeGain((m, p), (i, e)

)36: else37: ((m, p), (n, q))← arg max(i,c),(i,e)∈V{(gi,c − gi,e)}38: end if39: end if40: end if

56

the gain: maxµ D(αB + µB) − D(αB) = gTBµB −

12µT

BQBµB, where αB = (αi,p, α j,q). There are twocases to be considered as follows.

1. If i , j or syi(p) , sy j(q), then the sub-problem becomes:

maxµB

D(αB + µB) − D(α) =gTBµB −

12µT

BQBµB

Subject to 0 ≤ αi,p + µi,p ≤ U1

0 ≤ α j,q + µ j,q ≤ U2

Here, U1 = C − ∑e∈Pr1

yi{αi,e | e , p}, r1 = syi(p), and U2 = C − ∑

e∈Pr2y j{α j,e | e , q},

r2 = sy j(q). As can be seen, this case is solved by the technique in 3.3.

2. If i = j and syi(p) = sy j(q), the sub-problem becomes:

maxµB

D(αB + µB) − D(α) = gTBµB −

12µT

BQBµB (5.6)

Subject to αi,p + µi,p ≥ 0 (5.7)α j,q + µ j,q ≥ 0 (5.8)αi,p + µi,p + α j,q + µ j,q ≤ U (5.9)

Here, U = C −∑e∈Pr

yi{αi,e | e , p, e , q}, r = syi(p) = sy j(q). In this case, we have three

smaller cases to be considered:

• if QB = 0, the sub-problem becomes a linear optimization problem in which theoptimal solution must be at the point where two of the three inequality constraints areactivated to be equality constraints. Therefore, by examining the feasible solutionsat the points where two of the three inequality constraints are active, we select theoptimal one that maximizes the gain.

• if det QB = 0 and QB , 0, the objective function is convex with respect to µB.In addition, the unconstrained optimum must be in the geometric line gB = QBµ

TB.

Therefore, by convexity, the constrained optimum must be in the line when one ofthe three inequality constraints becomes an equality constraint. As a result, we canreduce the two dimensional sub-problem to several single variable problems. Solvingthe single variable problem gives us one feasible solution, and then we select theoptimal one that maximizes the gain.

• if det QB > 0, the unconstrained optimum is µ̂B = Q−1B gB. If this unconstrained

optimum satisfies all three constraints, it is the optimal solution. Otherwise, theoptimal solution will be in the geometric line when one of the three inequalityconstraints becomes equality constraint as was the case above.

57

It can be seen that in all cases, we can reduce the problem (5.6) to single variableoptimization problems. The single variable optimization can be stated in a general formas:

maximize D(µ) = aµ − b2µ2

Subject to d ≤ µ ≤ c

There are several cases to be considered:

• If b = 0, we have two small cases. If a > 0 then µ = c, else µ = d

• If b , 0, the unconstrained optimum is µ̂ =ab

. If d ≤ µ̂ ≤ c, µ = µ̂. Other case, µ = d

if d ≥ µ̂ or µ = c if c ≤ µ̂.

As a result, the procedure for solving the problem (5.4) when i = j and syi(p) = sy j(q),is presented in Algorithm 14. In addition, the procedure for solving the single variable

Algorithm 14 Solving the two-dimensional sub-problem when i = j and syi(p) = sy j(q).

Input: B = {(i, p), ( j, q)}, gi,p, g j,q, Q(i,p),(i,p), Q(i,p),( j,q), Q( j,q),( j,q), αi,p, α j,q, UOutput: µi,p, µ j,q

1: Let QB be matrix [Q(i,p),(i,p) Q(i,p),( j,q) ; Q(i,p),( j,q) Q( j,q),( j,q)]

2: µi,p ←1

det QB

(Q( j,q),( j,q)gi,p − Q(i,p),( j,q)g j,q

)3: µ j,q ←

1det QB

(Q(i,p),( j,q)gi,p − Q(i,p),(i,p)g j,q

)4: if (µi,p + αi,p ≥ 0) & (µ j,q + α j,q ≥ 0) & (µi,p + µ j,q ≤ U − αi,p − α j,q) then5: µi,p ← µi,p6: µ j,q ← µ j,q7: else8: µ1

i,p ← −αi,p

9: µ1j,q ← solveSingleVariable(µ j,q) subject to 0 ≤ µ j,q + α j,q ≤ U

10: µ2j,q ← −α j,q

11: µ2i,p ← solveSingleVariable(µi,p) subject to 0 ≤ µi,p + αi,p ≤ U

12: µ3i,p ← solveSingleVariable(µi,p) subject to 0 ≤ µi,p + αi,p ≤ U − α j,q

13: µ3j,q ← U − µ j,q − α j,q − α j,q

14:(µi,p, µ j,q

)← arg maxv=1,3 computeGain(µv

i,p, µvj,q)

15: end if

optimization problem is presented in Algorithm 15.

58

Algorithm 15 Solving the single variable optimization problem.Input: a, b, c, dOutput: µ

1: if b = 0 then2: if a > 0 then µ = c3: else µ = d4: end if5: else6: Let µ̂ =

ab

7: if d ≤ µ̂ ≤ c then µ = µ̂8: else9: if d ≥ µ̂ then µ = d

10: else µ = c11: end if12: end if13: end if

5.3 ExperimentsIn this section, we will do experiments about the CS SVM as an example of the unifiedframework. The purpose of these experiments is to investigate the performance of online leaningfor CS SVM as an approximative solver for the batch CS SVM.As in chapter 4, the experiments are carried out on four datasets, namely MNIST, INEX,LETTER and USPS since they were established as a benchmark suite in the LaRank paper [4].After showing the optimal hyper-parameters selected for each dataset, we report the performanceof online learning through the comparison with that of batch learning for the CS SVM. We alsoinvestigate the effect of working set selection strategies for online steps on the performance ofonline learning for the CS SVM.In order to easy distinguish, we name onlineCS × 1 to indicate that the online learning for theCS SVM will stop after single pass over dataset, onlineCS × ∞ indicate that the online learningwill take more epochs until stopping conditions are reached. We also name batchCS_shrink andbatchCS_unshrink to indicate the batch learning with and without shrinking techniques for theCS SVM that are developed in Shark. In addition, we also name batchCS_modified_shrink toindicate the version of the CS SVM solver in Shark in which the cache mechanism and shrinkingtechnique are modified as discussed in section 4.3.1.It is clear that C and γ are crucial hyper-parameter that affects the performance of SVMs.Therefore, we employ the same strategy and heuristic as in section 4.3.2 to find the optimalhyper-parameters for each dataset. We also exploit the grid search with shifting heuristic whereas in section 4.3.2, we shift the boundary of the grid as necessary to make sure the optimal hyper-parameters are always inside the grid. The optimal hyper-parameters are reported in table 5.1.

59

Kernel k(x, x) C γ

USPS e−γ‖x−x‖2 10 0.0331574INEX xT · x 500 -

LETTER e−γ‖x−x‖2 10 0.0512821MNIST e−γ‖x−x‖2 100 0.0378548

Table 5.1: Best parameters for CS found in grid search.

5.3.1 Comparing online CS with batch CS solversIn this section, we will compare the performance of the onlineCS × 1 with the batch CS solversbatchCS_shrink, onlineCS_unshrink and batchCS_modified_shrink. For fairly comparison,we set the kernel cache to 3GB. In addition, RANDOM_PATTERN_S2DO_ALL working setselection strategy is used for online learning for CS SVM.As can be seen from table 5.2, the onlineCS × 1 performs better than both the batch CS solversin terms of training time in three over four benchmark datasets. While the onlineCS × 1 solverachieves nearly the same dual values as the batch CS solvers, its test errors are comparable withthe batch CS solvers. This results are also reflected in Figures 5.1 and 5.3 where the evolution oftest error as a function of time and a function of computed kernel are plotted.


onlineCS × 1

Test error (%) 4.6836 26.3297 1.95 1.46Dual 1734.71 998143 9235.1 9972.64Training time (sec) 56.5084 348.729 209.722 3301.65Kernel evaluation (×106) 27.95 33.09 146.87 1217.92

batchCS_shrink

Test error (%) 4.6338 26.3297 2.0 1.45Dual 1738.95 987710 9326.2 10001.3Training time (sec) 161.013 894.974 818.044 11249.6Kernel evaluation (×106) 90.75 85.51 1049.15 5512.04

batchCS_unshrink

Test error (%) 4.6338 26.115 2.0 1.45Dual 1738.96 1010980 9326.37 10001.3Training time (sec) 38.9461 378.035 429.231 7689Kernel evaluation (×106) 15.61 33.40 112.43 4288.74

batchCS_modified_ Test error (%) 4.6338 26.3297 2.0 1.45

shrinkDual 1738.95 987710 9326.2 10001.3Training time (sec) 35.1277 415.184 319.4 8633.52Kernel evaluation (×106) 15.47 33.43 109.24 4676.34

Table 5.2: Comparing online CS with batch CS solvers.

60

0 50 100 150 200 250 300 350 4003

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8USPS

Time(s)

Test

err

or(%

)

onlineCS×1 onlineCS×∞batchCS_shrink

batchCS_unshrinkbatchCS_modified_shrink

0 1000 2000 3000 4000 5000 6000 700015

20

25

30

35

40

45

50INEX

Time(s)

Test

err

or(%

)

onlineCS×1

onlineCS×∞batchCS_shrink


0 500 1000 15000

5

10

15LETTER

Time(s)

Test

err

or(%

)

onlineCS×1 onlineCS×∞batchCS_shrinkbatchCS_unshrink

batchCS_modified_shrink

0 2000 4000 6000 8000 10000 12000 14000 16000 180001

1.5

2

2.5

3

3.5

4MNIST

Time(s)

Test

err

or(%

)onlineCS×1

onlineCS×∞batchCS_shrinkbatchCS_unshrink


Figure 5.1: Test error as a function of time for CS SVM. Here, the line denotes the evolution forthe online solver.

0 2 4 6 8 10 12 14 16 18

x 107

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8USPS

Computed kernels

Tes

t err

or (

%)



0 1 2 3 4 5 6 7

x 108

15

20

25

30

35

40

45

50INEX

Computed kernels

Test

err

or (%

)

onlineCS×1

onlineCS×∞batchCS_shrink


0 2 4 6 8 10 12 14 16

x 108

0

5

10

15LETTER

Computed kernels

Test

err

or (%

)

onlineCS×1

onlineCS×∞batchCS_shrinkbatchCS_unshrinkbatchCS_modified_shrink

0 1 2 3 4 5 6 7 8 9

x 109

1

1.5

2

2.5

3

3.5

4MNIST

Computed kernels

Test

err

or (%

)

onlineCS×1

onlineCS×∞batchCS_shrinkbatchCS_unshrink


Figure 5.2: Test error as a function of computed kernel for CS SVM. Here, the line denotes theevolution for the online solver.

61

0 50 100 150 200 250 300 350 4000

200

400

600

800

1000

1200

1400

1600

1800

2000

2200USPS

Time(s)

Dua

l

onlineCS×1

onlineCS×∞

batchCS_shrink


0 1000 2000 3000 4000 5000 6000 70001

2

3

4

5

6

7

8

9

10

11x 10

5 INEX

Time(s)

Dua

l



0 500 1000 15001000

2000

3000

4000

5000

6000

7000

8000

9000

10000LETTER

Time(s)

Dua

l

onlineCS×1 onlineCS×∞batchCS_shrinkbatchCS_unshrink


0 2000 4000 6000 8000 10000 12000 14000 16000 180002000

3000

4000

5000

6000

7000

8000

9000

10000

11000MNIST

Time(s)

Dua

l

onlineCS×1 onlineCS×∞

batchCS_unshrinkbatchCS_shrink


Figure 5.3: Dual as a function of time for CS SVM. Here, the line denotes the evolution for theonline solver.

5.3.2 Comparison of working selection strategiesAs can be seen from Table 5.3, although S2DO_ALL achieves highest dual values and lowesttest error among working set strategies, RANDOM_PATTERN_S2DO_ALL to be the best togain a trade-off between optimization time and reaching a sufficiently high dual value. Thiscan be explained by the optimization structure of CS SVM in dual form. Some of the inequalityconstraints 5.5 will be active to become equality constraints, which might restrict the two workingvariables corresponding to the same support pattern. In addition, some of them are still inactive,which leads to the fact that the two working variables might come from different support patterns.As a result, in one side, RANDOM_PATTERN_S2DO_ALL saves time when it restricts one ofthe two working variables to the set of variables that correspond to a randomly support pattern.In other side, RANDOM_PATTERN_S2DO_ALL still has high possibility of finding the twoworking variables that have fairly good gain by searching the gain maximization over the setof current support patterns. RANDOM_PATTERN_S2DO_PATTERN seems to be the fastestamong the three working set selection strategies.

62


S2DO_ALL

Test error (%) 4.6836 26.1315 1.925 1.43Dual 1736.58 1007070 9264.13 9974.94Training time (sec) 78.917 402.396 425.409 3462.45Kernel evaluation(×106) 26.62 32.49 140.26 1205.31


S2DO_ALLDual 1734.71 998143 9235.1 9972.64Training time (sec) 56.5084 348.729 209.722 3301.65Kernel evaluation (×106) 27.95 33.09 146.87 1217.92


S2DO_PATTERNDual 1712.44 976498 9040.81 9951.65Training time (sec) 56.7444 340.407 144.106 3087.46Kernel evaluation (×106) 28.71 32.72 151.62 1225.84

Table 5.3: Comparing working set strategies for the online CS SVM.

63

Chapter 6

Conclusion and Future Work

In this thesis, online learning algorithms were derived for approximative training of all-in-onemulti-class Support Vector Machines (SVMs). In particular, an online solver has been developedfor the machine proposed by Lee, Lin and Wahba (LLW), which possesses the preferabletheoretical property of Fisher consistency. Furthermore, we have then extended this online SVMalgorithm to become applicable to a general, unified formulation of all-in-one multi-class SVMs,which also includes the popular variants by Crammer & Singer as well as Weston & Watkins. Forexperimental evaluation, our online framework was tested on an established benchmark suite. Indetail, we compared our online and the standard batch training algorithms with respect to theirrun-time, dual value, test error, and number of kernel evaluations after training. It could be shownthat our online solvers efficiently approximate the solution reached by the batch solvers for multi-class SVMs. In particular, our online solvers achieve test errors that are competitive with theircounterpart batch solvers, but require less time, especially in the case of the LLW machine.

For the LLW SVM, we also compared the dual build-up curves of online and batch learningfor different kernel cache sizes. In general, the curves both follow a similar course. However,in the case of a limited cache size, the online LLW solver maintains higher dual values than thebatch LLW solvers throughout its entire learning process. This implies that using a less strictstopping criterion for the batch LLW solvers may have a similar effect as using an online LLWsolver, but there are two drawbacks still: first, choosing the right value for the stopping threshold,and second, the online algorithm still maintains a time advantage in the case of a limited cachesize. For example, for a cache size of 20% of the overall kernel Gram matrix and after one epochof online learning, the comparative time advantage of the online LLW solver over its counterpartbatch solvers reaching the same dual value was between factors of 1.2 and 5.6 on the four datasetsexamined.

One direction for future work is to limit the number of support vectors during training, asproposed in [12]. It is clear that the time complexity of several operations in online SVM trainingis linear in the number of support patterns. In addition, memory requirements also scale with thenumber of support vectors. Therefore, such a limit on the support vectors can further speed upthe training process - however at the cost of a second approximation in the training algorithm.

64

Bibliography

[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American MathematicalSociety, 68, 1950.

[2] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006.

[3] A. Bordes. New Algorithms for Large-Scale Support Vector Machines. PhD thesis, Pierreand Marie Curie University, 2010.

[4] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass Support VectorMachines with LaRank. In Proceedings of the 24th international conference on Machinelearning, pages 89–96. ACM, 2007.

[5] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online andactive learning. The Journal of Machine Learning Research, 6:1579–1619, 2005.

[6] A. Bordes, N. Usunier, and L. Bottou. Sequence labelling SVMs trained in one pass. InW. Daelemans, B. Goethals, and K. Morik, editors, Proceedings of the 2008 EuropeanConference on Machine Learning and Knowledge Discovery in Databases, volume 5211of Lecture Notes in Computer Science, pages 146–161. Springer, 2008.

[7] L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. K. Saul, andB. Schölkopf, editors, Advances in Neural Information Processing Systems. MIT Press,2003.

[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, NewYork, NY, USA, 2004.

[9] V. R. Carvalho and W. W. Cohen. Single-pass online learning: performance, voting schemesand online feature selection. In T. Eliassi-Rad, L. H. Ungar, M. Craven, and D. Gunopulos,editors, Proceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery, pages 548–553. ACM, 2006.

[10] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-basedvector machines. Journal of Machine Learning Research, 2:265–292, 2001.

65

[11] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and OtherKernel-Based Learning Methods. Cambridge University Press, Cambridge, U.K., 2000.

[12] O. Dekel and Y. Singer. Support Vector Machines on a Budget. In B. Schölkopf, J. C.Platt, and T. Hoffman, editors, Proceedings of the Twentieth Annual Conference on NeuralInformation Processing Systems, pages 345–352. MIT Press, 2006.

[13] Ü. Dogan, T. Glasmachers, and C. Igel. Fast training of multi-class Support VectorMachines. Technical report, Department of Computer Science, University of Copenhagen,2011.

[14] Ü. Dogan, T. Glasmachers, and C. Igel. A unified view on multi-class support vectorclassification. In preparation, 2012.

[15] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and Support VectorMachines. Advances in Computational Mathematics, 13(1):1–50, 2000.

[16] R. E. Fan, P. Chen, and C. J. Lin. Working set selection using second order information fortraining Support Vector Machines. Journal of Machine Learning Research, 6:1889–1918,2005.

[17] T. Glasmachers and C. Igel. Maximum-gain working set selection for SVMs. Journal ofMachine Learning Research, 7:1437–1466, 2006.

[18] T. Glasmachers and C. Igel. Second-order SMO improves SVM online and active learning.Neural Computation, 20(2):374–382, 2008.

[19] C. Igel, V. Heidrich-Meisner, and T. Glasmachers. Shark. Journal of Machine LearningResearch, 9:993–996, 2008.

[20] T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher kernel method to detect remoteprotein homologies. In T. Lengauer, R. Schneider, P. Bork, D. Brutlag, J. Glasgow, H.-W.Mewes, and R. Zimmer, editors, Proceedings of the Seventh International Conference onIntelligent Systems for Molecular Biology, pages 149–158. AAAI, 1999.

[21] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. Journal ofMathematical Analysis and Applications, 33(1):82–95, 1971.

[22] Y. Lee, Y. Lin, and G. Wahba. Multicategory Support Vector Machines, theory, andapplication to the classification of microarray data and satellite radiance data. Journalof the American Statistical Association, pages 67–81, 2004.

[23] Y. Lin. A note on margin-based loss functions in classification. Statistics & probabilityletters, 68(1):73–82, 2004.

[24] Y. Liu. Fisher consistency of multicategory Support Vector Machines. In Proceedings of theEleventh International Workshop on Artificial Intelligence and Statistics, pages 289–296,2007.

66

[25] J. Mercer. Functions of positive and negative type and their connection with the theory ofintegral equations. Philosophical Transactions of the Royal Society, London, 1909.

[26] J. Platt. Fast training of Support Vector Machines using sequential minimal optimization.In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods —Support Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press.

[27] F. Rosenblatt. Principles of Neurodynamics: Perceptron and Theory of Brain Mechanisms.Spartan Books, 1962.

[28] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. InProceedings of the 14th Annual Conference on Computational Learning Theory, COLT’01, pages 416–426, London, UK, 2001. Springer-Verlag.

[29] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdientSOlver for SVM. In Z. Ghahramani, editor, Proceedings of the 24th internationalconference on Machine learning, volume 227 of ACM International Conference ProceedingSeries, pages 807–814. ACM, 2007.

[30] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels andRKHS embedding of measures. Journal of Machine Learning Research, 12:2389–2410,2011.

[31] A. Tewari and P.L. Bartlett. On the consistency of multiclass classification methods. TheJournal of Machine Learning Research, 8:1007–1025, 2007.

[32] M. Tuma and C. Igel. Improved working set selection for LaRank. In Computer Analysis ofImages and Patterns, volume 6854 of Lecture Notes in Computer Science, pages 327–334.Springer, 2011.

[33] U. von Luxburg and B. Schölkopf. Statistical learning theory: Models, concepts, andresults. In D. M. Gabbay and J. Woods, editors, Inductive Logic, volume 10 of Handbookof the History of Logic, pages 651–706. North-Holland, 2011.

[34] J. Weston and C. Watkins. Support Vector Machines for multi-class pattern recognition.In Proceedings of the Seventh European Symposium On Artificial Neural Networks, pages219–224, 1999.

[35] T. Zhang. Statistical analysis of some multi-category large margin classification methods.The Journal of Machine Learning Research, 5:1225–1251, 2004.

[36] T. Zhang. Statistical behavior and consistency of classification methods based on convexrisk minimization. Annals of Statistics, pages 56–85, 2004.

67

Documents

Online learning of multi-class Support Vector Machines