92
Lecture Notes Statistical and Machine Learning Classical Methods Kernelizing Bayesian & . Statistical Learning Theory % - Information Theory SVM Neural Networks Su-Yun Huang *1 , Kuang-Yao Lee 1 and Horng-Shing Lu 2 1 Institute of Statistical Science, Academia Sinica 2 Institute of Statistics, National Chiao-Tung University contact: * [email protected] * http://www.stat.sinica.edu.tw/syhuang course web pages: * http://140.109.74.188 TAs: Pei-Chun Chen and Kuang-Yao Lee

Lecture Notes Statistical and Machine Learning - Academia Sinica

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Lecture Notes

Statistical and Machine Learning

Classical Methods ⇒ Kernelizing ⇐ Bayesian

↘ ⇓ ↙

Statistical Learning Theory

↗ ⇑ ↖Information Theory SVM Neural Networks

Su-Yun Huang∗1, Kuang-Yao Lee1 and Horng-Shing Lu2

1Institute of Statistical Science, Academia Sinica2Institute of Statistics, National Chiao-Tung University

contact: ∗[email protected]∗http://www.stat.sinica.edu.tw/syhuangcourse web pages: ∗http://140.109.74.188

TAs: Pei-Chun Chen and Kuang-Yao Lee

COURSE INFORMATION & PREREQUISITES

• Prerequisites: calculus, linear algebra, elementary probability and statistics.

• Course plan: See Table of Contents (tentative). We will emphasize on know-ing “why” and on statistical aspects instead of algorithms and programming.But still you have to know “how” by either writing your own implementationor modifying from others’ code.

• Grading policy:

homework 30%, score for late homework= (full points)× 0.8d, d : delay days.

midterm 30% (April 27, 2007).

oral presentation 20% on assigned tasks (and your final project as well).

final project 20%: choose your own data analysis problem and write up a shortreport and prepare for an oral presentation. The report should include

– problem and data description,

– methodology (related to this course),

– data analysis, and

– conclusion.

The report can be in Chinese or in English.

• Web resources:

– UCI Repository for machine learning is probably the most popular repos-itory. http://www.ics.uci.edu/∼mlearn/MLRepository.html

– UCI KDD Archive. It is an online repository of large data sets whichencompasses a wide variety of data types, analysis tasks, and applicationareas. http://kdd.ics.uci.edu

– CiteSeer is a public search engine and digital library for scientific andacademic papers in the fields of computer and information sciences.http://citeseer.ist.psu.edu

– Statlib for statistical software and data sets. http://lib.stat.cmu.edu

– SSVM toolbox. http://dmlab1.csie.ntust.edu.tw/downloads

– Kernel Statistics toolbox. http://www.stat.sinica.edu.tw/syhuangor http://140.109.74.188/kern stat toolbox/kernstat.html

ii Lecture Notes

– LIBSVM. http://www.csie.ntu.edu.tw/∼cjlin/libsvm/– Wikipedia, a free online encyclopedia.

http://en.wikipedia.org/wiki/Main Page

Contents

1 Introduction 11.1 Aim and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Notation and abbreviation . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Support Vector Machines 52.1 Linear support vector machines . . . . . . . . . . . . . . . . . . . . . . 52.2 Kernel trick for nonlinear SVM extension . . . . . . . . . . . . . . . . 92.3 Variants of SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 One-norm soft margin SVM . . . . . . . . . . . . . . . . . . . . 122.3.2 Two-norm soft margin SVM . . . . . . . . . . . . . . . . . . . . 122.3.3 Smooth SVM (SSVM) . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Multiclass classification . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Two Auxiliary Techniques: Reduced Kernel and Model Selection forSVMs 173.1 Reduced kernel: a low-rank approximation . . . . . . . . . . . . . . . . 17

3.1.1 RSVM with random subset . . . . . . . . . . . . . . . . . . . . 183.1.2 RSVM with optimal basis . . . . . . . . . . . . . . . . . . . . . 203.1.3 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Uniform design for SVM model selection . . . . . . . . . . . . . . . . . 243.2.1 Methodology: 2 staged UD for model selection . . . . . . . . . 24

4 More on Kernel-Based Learning Algorithms 294.1 Kernel principal components analysis . . . . . . . . . . . . . . . . . . . 29

4.1.1 Classical PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Kernel sliced inverse regression . . . . . . . . . . . . . . . . . . . . . . 324.2.1 Classical SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

iii

iv CONTENTS

4.2.2 Kernel SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Kernel canonical correlation analysis . . . . . . . . . . . . . . . . . . . 36

4.3.1 Classical CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.2 Kernel CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Kernel Fisher discriminant analysis . . . . . . . . . . . . . . . . . . . . 404.4.1 Classical FLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4.2 Kernel FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . 424.5.1 RLS-SVR: a ridge approach . . . . . . . . . . . . . . . . . . . . 434.5.2 Smooth ε-support vector regression (ε-SSVR) . . . . . . . . . . 43

4.6 SIR and KSIR revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Learning Theory with RKHS (optional) 495.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 SVMs with RKHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3 Multivariate statistical analysis with RKHS . . . . . . . . . . . . . . . 525.4 More to come . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Neural Networks 55

A Some Matrix Algebra 69A.1 Matrix diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.1.1 Eigenvalue decomposition and generalized eigenvalue decompo-sition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.1.2 Singular value decomposition (SVD) and generalized SVD . . . 70

B Kernel Statistics Toolbox 71B.1 KGaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.2 Smooth SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.3 Lagrangian SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.4 Uniform designs for model selection . . . . . . . . . . . . . . . . . . . . 75B.5 Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . 75B.6 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75B.7 Kernel SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.8 Kernel Fisher linear discriminant analysis . . . . . . . . . . . . . . . . 79B.9 Kernel CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

List of Figures

3.1 Spectrum comparison on full kernel vs. reduced kernel. . . . . . . . . . 233.2 Three examples of search space of model selection. . . . . . . . . . . . 253.3 The nested UD model selection with a 13-points UD at the first stage

and a 9-points UD at the second stage. . . . . . . . . . . . . . . . . . . 263.4 The nested UD model selection with a 9-points UD at the first stage

and a 5-points UD at the second stage. . . . . . . . . . . . . . . . . . . 27

4.1 SIR view over the 1st-and-2nd variates. . . . . . . . . . . . . . . . . . 344.2 KSIR view over the 1st-and-2nd variates. . . . . . . . . . . . . . . . . 354.3 Scatter plot of pen-digits over CCA-found variates. . . . . . . . . . . . 394.4 Scatter plots of pen-digits over KCCA-found variates. . . . . . . . . . 404.5 Data scatter along some coordinates. . . . . . . . . . . . . . . . . . . . 474.6 Data scatter, fitted values and residuals using 1st SIR variate. . . . . . 474.7 Data scatter, fitted values and residuals using 1st KSIR variate. . . . . 48

v

vi LIST OF FIGURES

Chapter 1

Introduction

1.1 Aim and scope

The development of statistical methods is often highly influenced and challenged byproblems from other fields. Due to the rapid growth of computer technology, we easilyencounter with enormous amount of data collected from diverse sources of scientificfields, which has lead to a great demand for innovative analytic tools for complex data.The process of extracting information from data and making statistical inferences forfuture observations is called learning from data. Statistical and machine learning isan interdisciplinary field consisting of theory from statistics, probability, mathematicsand computer science, with plenty of applications for engineering science, biology,bioinformatics, medical study, etc.

The aim of this monograph is to provide an overview of the development of machinelearning with emphasis on its statistical aspects. It aims for an introductory graduatelevel course. The choice of material is rather subjective. A major part of it is based onsome recent works by colleagues in the IE&SLR group.1 This monograph starts withan introduction to support vector machines (SVMs), followed by other kernel-basedlearning algorithms, learning theory with reproducing kernel Hilbert space (RKHS)and ended by a chapter on neural networks. Bayesian theory will be blended inwhen it is necessary. Two auxiliary techniques for computation reduction and modelselection (see reduced kernel and model selection for SVMs in Chapter 3) are alsoprovided in this monograph. Program codes can be found in Appendix B. They are

1IE&SLR (information extraction & statistical learning research) group consists of faculty and

students from various institutes and departments of Academia Sinica and other universities.

1

2 CHAPTER 1. INTRODUCTION

part of the Kernel Statistics toolbox2 still in development.

1.2 Notation and abbreviation

All vectors will be column vectors unless otherwise specified or transposed to a rowvector by a prime superscript ′. For a matrix A ∈ <n×p, Ai is the ith row of A. Weput the data {xi ∈ <p}n

i=1 by row in a matrix denoted by A, and we call it the datadesign matrix. We will use the row vector Ai and the column vector xi for the sameith observation depending on the convenience. A column vector of ones of arbitrarydimension will be denoted by 1. For A ∈ <n×p and B ∈ <n×p, the kernel K(A,B)maps Rn×p ×Rn×p into Rn×n. We adopt a Matlab vector and matrix convention

[M1; M2] :=

[M1

M2

],

where Mi can be a vector or a matrix and the semicolon on the left hand side of theequality means to stack M1 on top of M2 as shown on the right hand side. Othernotation convention and terminology abbreviation used in this monograph are listedbelow.

• A: a matrix of training inputs, where input data are arranged by rows, i.e.,A = [x′1; x

′2; . . . ; x

′n] ∈ <n×p.

• A+: a matrix of training inputs whose corresponding class label is 1.

• A−: a matrix of training inputs whose corresponding class label is −1.

• Ai: the ith row of a matrix A.

• C(M): the column space of a matrix M , i.e., the linear space spanned by columnsof M .

• CCA: canonical correlation analysis.

• CV: cross-validation.

• D: a diagonal matrix with class labels in the diagonals, i.e. D = diag(y1, . . . , yn).

• FLDA: Fisher linear discriminant analysis.

2Kernel Statistics toolbox is maintained on http://140.109.74.188/kern stat toolbox/kernstat.html

by the first author. It is also available at http://www.stat.sinica.edu.tw/syhuang.

1.2. NOTATION AND ABBREVIATION 3

• I: an index set.

• |I|: the size of an index set I.

• K, K(A,B): kernel function or kernel matrix.

• K = K(A, A): a reduced kernel matrix using some subset A.

• KKT: Karush-Kuhn-Tucker.

• n: full data size; n: reduced set size.

• PCA: principal components analysis.

• plus function t+: For t = (t1, . . . , tp) ∈ <p, the plus function t+ is definedcomponentwise as (t+)i = max {0, ti} for i = 1, . . . , p.

• 1: a column vector of ones of arbitrary dimension; 1p: a column vector of onesof dimensionality p;

• Φ : X 7→ Z: feature map.

• <: real number.

• RK: reproducing kernel; RKHS: reproducing kernel Hilbert space.

• RSVM: reduced (kernel) support vector machine.

• SIR: sliced inverse regression.

• SVM: support vector machine; LSVM: Lagrangian SVM; SSVM: smooth SVM.

• SVR: support vector regression; SSVR: smooth support vector regression.

• UD: uniform design.

• X : input data space.

• Z: feature data design matrix, i.e. Z = [z′1; . . . ; z′n].

• Z: feature space.

• ∇g(x): gradient consisting of first derivatives.

• ∇2g(x): Hessian matrix consisting of second derivatives.

• 〈·, ·〉H: inner product in a Hilbert space H.

• ‖ · ‖q: the q-norm.

4 CHAPTER 1. INTRODUCTION

Chapter 2

Support Vector Machines

Classification, which is one kind of supervised learning, is a commonly encounteredtask in statistics. For a binary classification problem, we are given a training dataset {(x1, y1), . . . , (xn, yn)}, where xi ∈ <p is the input data and yi ∈ Y = {−1, 1}is the corresponding class label. The main goal of classification is to find a decisionfunction f(x) based on the training data that can predict future class labels for newinput data points.

Support vector machines in a wide sense refer to a body of statistical and com-putational techniques that build on “kernel trick”. Observations are actually takenin the pattern space, while the classification (or any other learning task) is carriedout in the so-called feature space, which is a very high dimensional Hilbert space andwhere populations can be easily separated and/or predicted by linear methods. Inrecent years SVMs [5, 15, 53] have become one of the most promising learning algo-rithms for classification as well as for regression [16, 38, 39, 47, 31, 43], which are twofundamental tasks in data mining [55].

2.1 Linear support vector machines

The linear SVM for binary classification attempts to find a hyperplane with maximalmargin to separate the training data into two classes, A− and A+. We consider thefollowing linear decision function

f(x) = w′x + b. (2.1.1)

The decision boundary, given by {x : f(x) = 0}, for separating the positive andnegative input instances is a hyperplane in the input space <p. The distance from

5

6 CHAPTER 2. SUPPORT VECTOR MACHINES

the origin to the hyperplane, w′x + b = 0, in <p is given by |b|/‖w‖2. For the linearseparable case, we can arrange the scale of the normal vector w and the offset b sothat the following constraints are satisfied:

w′xi + b ≥ +1 for yi = +1, (2.1.2)

w′xi + b ≤ −1 for yi = −1. (2.1.3)

These constraints can be reformulated into one set of inequalities:

1− yi(w′xi + b) ≤ 0 ∀i. (2.1.4)

The margin between the two bounding planes (2.1.2) and (2.1.3) is

|b + 1− (b− 1)|/‖w‖2 = 2/‖w‖2.

Maximizing the margin is equivalent to minimizing ‖w‖22. Thus, the SVM for linearseparable case can be formulated as a constrained optimization problem:

minw,b

‖w‖22/2 subject to 1− yi(w′xi + b) ≤ 0 ∀i. (2.1.5)

It is almost always the case that data are not completely linear separable and thereis no feasible solution to (2.1.5). Hence, it is necessary to consider the non-separablecase. For the non-separable case, we further introduce some positive slack variablesξi into the constraints:

w′xi + b ≥ +1− ξi for yi = +1, (2.1.6)

w′xi + b ≤ −1 + ξi for yi = −1, (2.1.7)

ξi ≥ 0 ∀i. (2.1.8)

Inequalities (2.1.6)-(2.1.8) can be combined into

1− yi(w′xi + b)− ξi ≤ 0 and − ξi ≤ 0, ∀i. (2.1.9)

The nonnegative constraints on slack variables ξi ≥ 0 are expressed as −ξi ≤ 0 in(2.1.9). It is merely out of convenience for making the associated Lagrange multi-pliers nonnegative. With these slack variables, a natural way to assign error cost isto add a penalty on the size of ξi to the objective function to be minimized, e.g.,‖w‖22/2 + C

∑ni=1 P (ξi), where P (ξ) is a nonnegative increasing function of ξ, and C

is a tuning parameter controlling the trade-off between data goodness of fit (or calleddata fidelity) and the regularity (or flatness, smoothness) of the decision function.

2.1. LINEAR SUPPORT VECTOR MACHINES 7

Here we consider a penalty on the one-norm of ξ leading to the following constrainedoptimization problem:

minw,b,ξ

‖w‖22/2 + C

n∑

i=1

ξi subject to (2.1.9). (2.1.10)

To solve the problem, Lagrange multipliers αi ≥ 0 and βi ≥ 0 are introduced, one foreach inequalities in (2.1.9). Then we have the following Lagrangian:

L(w, b, ξ, α, β) =12‖w‖22 + C1′ξ + α′ [1−D(Aw + b1)− ξ]− β′ξ, (2.1.11)

where D is a diagonal matrix with yi in the diagonals. Variables w, b, ξ are calledprimal variables, and α, β are called dual variables. The task is to minimize (2.1.11)with respect to w, b and ξ, and to maximize it with respect to α and β. At theoptimal point, we have the following saddle point equations:

∂L

∂w= 0,

∂L

∂b= 0 and

∂L

∂ξ= 0,

which translate to

w′ − α′DA = 0, α′D1 = 0 and C1− α− β = 0,

or equivalently to

w =n∑

i=1

αiyixi,

n∑

i=1

αiyi = 0 and αi + βi = C ∀i. (2.1.12)

By substituting (2.1.12) into the Lagrangian, we get the dual problem:

maxα

n∑

i=1

αi − 12

n∑

i,j=1

αi αj yi yj x′i xj subject ton∑

i=1

αiyi = 0 and 0 ≤ αi ≤ C.

(2.1.13)By introducing the slack variables ξi, we get the box constraints that limit the sizeof the Lagrange multipliers: αi ≤ C. The KKT necessary and sufficient optimality

8 CHAPTER 2. SUPPORT VECTOR MACHINES

conditions1 can be summarized by

0 ≤ α ⊥ 1−D(Aw + b1)− ξ ≤ 0, and

0 ≤ β ⊥ ξ ≥ 0.

These KKT conditions imply that

αi = 0 ⇒ βi = C ⇒ yif(xi) ≥ 1 and ξi = 0,

0 < αi < C ⇒ 0 < βi < C ⇒ yif(xi) = 1 and ξi = 0,

αi = C ⇒ βi = 0 ⇒ yif(xi) ≤ 1 and ξi ≥ 0.

Points xi associated with nonzero αi are called support vectors (SVs). The offsetterm b is computed by

b =1|I|

i∈I

yi −

n∑

j=1

αjyjx′jxi

, where I = {i : 0 < αi < C}.

With w =∑n

i=1 αiyixi the underlying decision function is then given by

f(x) =n∑

i=1

αi yi x′i x + b =∑

xi:SV

αi yi x′i x + b.

Only the SVs contribute to the decision function for classification. In addition, thecontribution of a data point as a SV is at most at a scale C in the final decisionfunction. The one-norm soft margin SVM is probably the most common formulationfor SMVs.

Remark 1 The SVM introduced above provides a way of doing linear discriminantbased on the one-norm soft margin criterion. There are many other alternatives instatistical literature, e.g., Fisher linear discriminant analysis, logistic discriminantanalysis, classification by least squares regression, Gaussian mixture approach, etc.Think about why SVM and why the kernel extension introduced in later chapters,while we are proceeding forward in this course.

1Theorem. Given a convex optimization problem: minv `(v) subject to inequality constraints

gi(v) ≤ 0, i = 1, . . . , k, with ` ∈ C1 and gi being affine, necessary and sufficient conditions for a

point v∗ to be an optimal solution are the existence of Lagrange multipliers η∗ such that

∂L(v∗, η∗)∂v

= 0,

η∗i gi(v∗) = 0, i = 1, . . . , k,

gi(v∗) ≤ 0, i = 1, . . . , k,

η∗i ≥ 0, i = 1, . . . , k.

The second conditions are known as the KKT conditions.

2.2. KERNEL TRICK FOR NONLINEAR SVM EXTENSION 9

2.2 Kernel trick for nonlinear SVM extension

How can we generalize the SVM to the nonlinear case, but still keep the modelsimplicity of linearity? That is, we want a flexible decision boundary for separatingthe two classes, while still manage to maintain the simplicity of estimation and/orprediction task.

The main idea is first to map the input data to a high dimensional space by afeature map Φ : X 7→ Z

x 7→ Φ(x) = (φ1(x), φ2(x), φ3(x), . . .) ∈ Z.2 (2.2.14)

Next, the linear SVM problem (2.1.10) is considered and nominally carried out in thefeature space Z. Here are the nominal input data in the feature space {zi = Φ(xi)}n

i=1.Similar to the constraints (2.1.9), we can make constraints in the feature space asfollows:

1− yi(w′zi + b)− ξi ≤ 0 ∀i, (2.2.15)

−ξi ≤ 0 ∀i, (2.2.16)

where w is the normal vector to the separating hyperplane in Z. The correspondingoptimization problem becomes

minw,b,ξ

‖w‖22/2 + C

n∑

i=1

ξi subject to (2.2.15)− (2.2.16). (2.2.17)

Similar to the derivation for (2.1.13), we start with partial derivatives of the La-grangian with respect to the primal variables:

∂L

∂w= 0,

∂L

∂b= 0 and

∂L

∂ξ= 0,

which translate to

w′ − α′DZ = 0, α′D1 = 0 and C1− α− β = 0, where Z = [z′1; z′2; · · · ; z′n],

or equivalently to

w =n∑

i=1

αiyizi,

n∑

i=1

αiyi = 0 and αi + βi = C ∀i. (2.2.18)

By substituting (2.2.18) into the Lagrangian, we get the dual problem:2Regard the collection of functions {φ1(x), φ2(x), φ3(x), . . .} as a basis set in a certain functional

space H of X . The nonlinear SVM in general is a nonparametric method.

10 CHAPTER 2. SUPPORT VECTOR MACHINES

maxα

n∑

i=1

αi− 12

n∑

i,j=1

αi αj yi yj Φ(xi)′Φ(xj) subject ton∑

i=1

αiyi = 0 and 0 ≤ αi ≤ C.

(2.2.19)Similarly, b is computed as

b =1|I|

i∈I

yi −

n∑

j=1

αjyjΦ(xj)′Φ(xi)

, where I = {i : 0 < αi < C}.

Note that the feature data zi are not observed. However, the problem (2.2.17) andthe underlying decision function

f(x) = w′z + b, where z = Φ(x),

depend on w and z only through their values of inner products in Z. Define the innerproduct as a bivariate function on X × X :

K(x, u) = Φ(x)′Φ(u).

There is no need to know explicitly the functional form of Φ, nor the individual featureobservations zi. It is sufficient to know the function K(x, u) representing the innerproduct values in Z. The function K(x, u) is called a kernel function. Or one cansimply start with a given positive definite kernel K(x, u). (We will give a formaldefinition of positive definite kernel later.) Then, solve the maximal dual problem

maxα

n∑

i=1

αi − 12

n∑

i,j=1

αi αj yi yj K(xi, xj) subject ton∑

i=1

αiyi = 0 and 0 ≤ αi ≤ C,

(2.2.20)and b is computed from

b =1|I|

i∈I

yi −

n∑

j=1

αjyjK(xj , xi)

, where I = {i : 0 < αi < C}.

Finally, the underlying decision function is given by

f(x) =n∑

i=1

αiyiK(xi, x) + b.

The kernel trick introduced above provides a powerful extension for nonlinearSVMs. In fact, the SVM does not solve the nonlinear problem directly. It maps the

2.3. VARIANTS OF SVM ALGORITHMS 11

input data into a high dimensional space via an implicit transformation so that thedata in this high dimensional space can be better separated by a hyperplane therein.One of the most popular kernel functions is the Gaussian kernel (also known as theradial basis function):

K(x, u) = e−γ‖x−u‖22 , (2.2.21)

where x and u are two vectors in <p and γ is the width parameter of the Gaussiankernel. (We do not care to normalize it to make the kernel integrate to one, as thenormalization has no effect on SVM problems.) The value of K(x, u) represents theinner product of Φ(x) and Φ(u) in some high dimensional feature space Z and reflectsthe similarity between x and u. Values of K(x, u) drop off at a rate determined by‖x− u‖22 and γ.

Before we end this section, we give a formal definition of a positive definite kerneland its spectrum.

Definition 1 (Positive definite kernel and its spectrum) A real-valued symmet-ric kernel function K defined on X × X is said to be positive definite, if and only ifthat, for any positive integer m, any sequence of real numbers {a1, . . . , am} and anysequence of points {u1, . . . , um} in X , we have

∑mi,j=1 aiajK(ui, uj) ≥ 0, and strictly

positive definite when the equality holds if and only if a1 = · · · = am = 0. For a givenpositive definite kernel K, if it defines a continuous compact operator on L2(X ) thenK can be expanded in terms of its eigenvalues-eigenfunctions:

K(x, u) =∑

q

λqψq(x)ψq(u). (2.2.22)

With the above expansion, the associated feature map is given by

Φ(x) = (√

λ1 ψ1(x),√

λ2 ψ2(x),√

λ3 ψ3(x), . . .).

2.3 Variants of SVM algorithms

There are two key ingredients in an SVM problem. One is the kernel trick for makingextension to nonlinear decision function, which is a kernel mixture (i.e., a linearcombination of kernels). The decision function is nonlinear in the pattern space butlinear in an implicit feature space. The nonlinearity via using kernel mixture allowsexcellent flexibility and versatility for the decision function in the pattern space, whilethe simple linear structure, as well as the sparsity of SVs for making up the decisionfunction in the feature space, makes the implementation of nonlinear model possible.The other key ingredient is the error criterion (or risk, cost criterion) for seeking (orestimating) the decision function.

12 CHAPTER 2. SUPPORT VECTOR MACHINES

2.3.1 One-norm soft margin SVM

The formulations in (2.1.10) and (2.1.13) for linear case as well as (2.2.19) and (2.2.17)for nonlinear case fall into the category of one-norm soft margin SVM. The term one-norm refers to the one-norm penalty on slack variables: ‖ξ‖1 =

∑ni=1 ξi. 3 4

2.3.2 Two-norm soft margin SVM

Besides the one-norm ‖ξ‖1, it is natural to consider the two-norm penalty ‖ξ‖22. Thecorresponding formulation becomes

minw,b,ξ

12‖w‖22 +

C

2

n∑

i=1

ξ2i subject to (2.2.15)− (2.2.16). (2.3.23)

Homework 1 Formulate the 2-norm soft margin SVM in dual form.

Solution: The nonnegative constraint (2.2.16) can be removed since the term ‖ξ‖22 isincluded in the objective function. Therefore, we have the following Lagrangian:

L(w, b, ξ, α) =12‖w‖22 +

C

2

n∑

i=1

ξ2i + α′ [1−D(Zw + b1)− ξ] ,

and its partial derivatives with respect to the primal variables:

∂L

∂w= 0,

∂L

∂b= 0 and

∂L

∂ξ= 0,

which translate to

w′ − α′DZ = 0, α′D1 = 0 and Cξ − α = 0,

or equivalently to

w =n∑

i=1

αiyizi,

n∑

i=1

αiyi = 0 and ξi =αi

C∀i.

3There is another SVM variant, also called one-norm SVM [57]. It penalizes on the one-norm of

the normal vector, i.e., ‖w‖1.4The usage of penalty is to regularize (to make “smooth”, often by shrinking the parameters values

toward zero) the underlying model. The effect of a one-norm penalty will lead to a parsimonious

model structure and it enforces the action of keep-or-kill. The effect of a two-norm penalty will

shrink the parameters values, but will never kill them. In Bayesian formulation, the one-norm

penalty corresponds to a Laplacian (double exponential) prior and the two-norm corresponds to a

Gaussian prior on parameters.

2.3. VARIANTS OF SVM ALGORITHMS 13

Then by substituting the above equations into the Lagrangian, we can obtain

L(w, b, ξ, α)

=12

n∑

i,j=1

αiαjyiyjz′izj +

12C

n∑

i=1

α2i −

n∑

i,j=1

αiαjyiyjz′izj +

n∑

i=1

αi − 1C

n∑

i=1

α2i

=n∑

i=1

αi − 12

n∑

i,j=1

αiαjyiyjz′izj − 1

2C

n∑

i=1

α2i .

The 2-norm soft margin SVM problem can be formulated in the following dual form:

maxα

n∑

i=1

αi − 12

n∑

i,j=1

αiαjyiyjK(xi, xj)− 12C

n∑

i=1

α2i

subject ton∑

i=1

αiyi = 0, αi ≥ 0 ∀i.

2

There is another variant of 2-norm soft margin SVM algorithm, called the La-grangian SVM [40] (LSVM for short). It has also appended the term b2/2 to themargin criterion and solves the following reformulated problem:

minw,b,ξ

12(‖w‖22 + b2) +

C

2

n∑

i=1

ξ2i subject to (2.2.15)− (2.2.16). (2.3.24)

This reformulation leads to simple constraints αi ≥ 0 in the dual problem (see Ap-pendix B) without sacrificing the testing accuracy. The LSVM implementation isbased directly on the KKT necessary and sufficient optimality conditions for the dualproblem. For the linear case, the implementation of LSVM has used the Sherman-Morrison-Woodbury (SMW) identity to transform an n×n matrix inversion to a p×p

matrix inversion. However, for the nonlinear case, the n×n kernel matrix [K(xi, xj)]is dense and of full rank (or close to full rank), and there is no advantage of usingSMW identity. In this case, a reduced kernel technique (introduced later in Chap-ter 3) should be used to cut down the column rank of the kernel matrix. Refer toAppendix B for implementation details and Matlab code.

2.3.3 Smooth SVM (SSVM)

Traditionally, SVMs are solved via their dual problems. Here we introduce an alter-native approach, which focuses on the primal problems and solves them directly bysmoothing techniques [33]. Smoothing methods can be found and have been exten-sively used in optimization literature [8, 9, 7, 11, 12, 18, 51]. The basic idea is to

14 CHAPTER 2. SUPPORT VECTOR MACHINES

find a smooth (differentiable) approximation to the non-smooth objective function,here specifically the plus function. Then, use a Newton5 method to solve the smoothoptimization problem.

The decision function of an SVM has the following form:

f(x) =n∑

i=1

αiyiK(xi, x) + b

(:=

n∑

i=1

viK(xi, x) + b with vi = αiyi

).

The SSVM works directly on the primal problem. It starts from a type of 2-normsoft margin SVM, and next it appends the term b2/2 to the objective function to beminimized and leads to the following minimization problem:

minv,b,ξ

C

2‖ξ‖22 +

12

(‖v‖22 + b2),

subject to 1− yif(xi)− ξi ≤ 0 ∀i,ξi ≥ 0 ∀i.

Note that the nonnegative constraint ξ ≥ 0 can be removed, as the term ‖ξ‖22 isincluded in the objective function of the minimization problem. The problem abovebecomes

minv,b,ξ

C

2‖ξ‖22 +

12

(‖v‖22 + b2), (2.3.25)

subject to 1−D(Kv + b1)− ξ ≤ 0, (2.3.26)

where K = K(A,A) = [K(xi, xj)]ni,j=1. At a solution, ξ takes the form ξ = (1 −D(Kv + b1))+. The problem given by (2.3.25) and (2.3.26) converts to an equivalentunconstrained minimization problem:

minv,b

C

2

∥∥(1−D(Kv + b1))+∥∥2

2+

12

(‖v‖22 + b2). (2.3.27)

It is a strongly convex minimization problem without any constraint and has a uniquesolution. However, the objective function is not differentiable, which hinders the useof a Newton method. Below a smooth approximation [33] is introduced.

5It is also known as Newton-Raphson method and is an efficient algorithm for finding approxi-

mations to the zeros (or roots) of a smooth function. It can also be used to find local maxima and

local minima of functions by noticing that if a point is a local optimal point, then it is a root of the

derivative of the underlying function. For a twice-differentiable function f(x), an iterative scheme

for a local optimal point can be prescribed as follows:

xn+1 = xn −�∇2f(xn)

�−1 ∇f(xn), n ≥ 0,

where x0 is an initial point. Usually the Newton method is modified to include a small step size

0 < γ < 1 instead of γ = 1. This is known as the Newton-Armijo algorithm to ensure the convergence

of the iterative scheme.

2.4. MULTICLASS CLASSIFICATION 15

For t = (t1, . . . , tp) ∈ <p, the plus function t+ is defined componentwise as (t+)i =max {0, ti} for i = 1, . . . , p. An accurate smooth approximation to the plus functionis given by p(t; ρ), which is defined componentwise by:

p(ti; ρ) = ti +1ρ

log(1 + e−ρti

), ti ∈ <, ρ > 0. (2.3.28)

The smooth p function is the integral of the Sigmoid (also known as logistic) functiongiven by: g(t) = 1/(1+e−ρt) for t ∈ <, which is used to approximate the step function:h(t) = 1 for t ≥ 0 and h(t) = 0 for t < 0. As ρ →∞, p(t; ρ) → t+ for all t ∈ <p. Wecan replace (1−D(Kv + b1))+ by the smooth approximation (2.3.28) and the SVMproblem converts to

minv,b

C

2‖p (1−D(Kv + b1); ρ)‖22 +

12

(‖v‖22 + b2). (2.3.29)

The problem (2.3.29) can be solved by the fast Newton-Armijo algorithm [33]. AMatlab implementation code is given in Appendix B. This code takes the advantageof sparsity of the Hessian matrix and has used the limit values of the sigmoid functionand p-function in computing the gradient vector and the Hessian matrix. That is,ρ = ∞ in the implementation, which simplifies the computation for the gradientvector and the Hessian matrix.

2.4 Multiclass classification

So far we have only talked about the binary classification, as SVMs are originallydesigned for binary classification. There are two commonly seen multiclass exten-sions for SVMs. One is the composition type methods built upon a series of binaryclassifiers, e.g., the one-against-one and one-against-rest, and the other is the sin-gle machine type methods, often huge and solved in one optimization formulation.See Rifkin and Klautau [45] for literature review, comparison study and references.There is no universally dominant classification rule for multiclass problems. Differentmethods have their own merits and advantages, allowing the room for exploration ofalternative approaches.6 In this section we will only introduce the one-against-one(also known as pairwise) classification, which is probably the most commonly usedclassification scheme in multiclass SVMs. In one-against-one classification scheme, wetrain a series of binary classifiers for each pair of classes. For k classes, this resultsin k(k − 1)/2 binary classifiers. Though the number of classifiers to be trained is

6For instance, kernel Fisher discriminant analysis [41, 26] and classification via support vector

regression [10].

16 CHAPTER 2. SUPPORT VECTOR MACHINES

large, the individual problems that we need to train on are significantly smaller. It isoften time saving for training a series of small classifiers than training a huge singlemachine.

When it comes for classifying a test pattern, we evaluate all k(k − 1)/2 binaryclassifiers, and assign the test point to the class with majority votes. A vote for agiven class is defined as a classifier assigning the test pattern to that class. For tiedvotes, we use a random assignment among classes of ties.

Chapter 3

Two Auxiliary Techniques:

Reduced Kernel and Model

Selection for SVMs

In this chapter we introduce two auxiliary techniques for dealing with large datacomputation and model selection associated with SVMs.

3.1 Reduced kernel: a low-rank approximation

In nonlinear SVM as well as other kernel-based learning algorithms, to generate anonlinear separating surface has to deal with a fully dense kernel matrix of size n×n.When n is median to large, there are some computational difficulties encountered:

(P1) the size of the mathematical programming problem,

(P2) the size of memory usage,

(P3) the complexity of the resulting separating surface.

To cut down the model complexity, a reduced kernel technique was introduced [32, 34].The original reduced kernel support vector machine (RSVM) uses a very small randomsubset of size n (¿ n) to build up the separating surface for SSVM algorithm. Thereduced set plays a similar role as support vectors. We denote this random subset byA ⊂ A, which is used to generate a much smaller rectangular matrix of size n× n:

K(A, A) = [K(xi, xj)], xi ∈ A, xj ∈ A.

17

18 Two Auxiliary Techniques

The reduced kernel matrix is used to replace the full kernel matrix K(A,A). Theusage of a reduced kernel can cut the problem size, associated computing time andthe memory space, and it can simplify the characterization of the nonlinear separatingsurface as well. The basic idea of RSVM is to solve an approximate SVM probleminstead of the full problem. It works by replacing the full kernel matrix with a low-rank approximation. The use of reduced kernel is not limited to random subset nor theSSVM algorithm. There are alternative choices. We divide the discussion of reducedkernel selection into 2 categories: the most computationally economic random subsetand the most expensive optimal basis selection. There are also other alternativeselections in between by searching for optimal basis within a certain random subset.As for the SVM formulation, there are 2 basic types of RSVM formulations introducedhere: RSVM in the primal space and RSVM in the dual space, respectively.

3.1.1 RSVM with random subset

In this monograph, the term “RSVM” is referring to reduced kernel SVM of any form,which need not be limited to random subset nor to any specific SVM algorithm.

Primal RSVM using random subset is straightforwardly easy to formulate.It works simply by replacing the full kernel matrix K with a reduced kernel, denotedby K. Using SSVM, which is solved in the primal space, as an illustrative example,its decision function is of the form f(x) = K(x, A)v + b, where v is solved directlyfrom the following optimization problem:

minv,b

C

2

∥∥∥∥(1−D{Kv + b1}

)+

∥∥∥∥2

2

+12

(‖v‖22 + b2), where K = K(A, A). (3.1.1)

In solving the RSVM (3.1.1), a smooth approximation p(t; ρ) to the plus function isused. The solution to the minimization problem (3.1.1) leads to a nonlinear separatingsurface of the form

K(x, A)v + b = 0, i.e.,n∑

i=1

K(x, Ai)vi + b = 0. (3.1.2)

In fact, the reduced set A is not necessarily to be a subset of training set. TheRSVM minimization problem (3.1.1) retains the strong convexity and differentiabilityproperties in the space, Rn+1, of (v, b) for any arbitrary rectangular kernel. Again,the fast Newton-Armijo algorithm [33] can be applied directly to solve (3.1.1). Theexistence and uniqueness of the optimal solution are also guaranteed. Moreover, thecomputational complexity of solving problem (3.1.1) by the Newton-Armijo methodis O(n3) while solving the full kernel SSVM is O(n3) [33].

3.1. REDUCED KERNEL: A LOW-RANK APPROXIMATION 19

The reduced kernel serves as a low-rank approximation to the full kernel matrix.It is also known as the Nystrom approximation, which has been proposed in manysophisticated ways [46, 54]. The Nystrom approximation, denoted by R, has thefollowing form:

K(A, A) ≈ K(A, A)K(A, A)−1K(A, A) := R.1 (3.1.3)

Note that, for a given v, the following least squares problem

minv∈<n

∥∥∥Rv −K(A, A)v∥∥∥

2

2(3.1.4)

has the solution given byv = K(A, A)−1K(A, A)v. (3.1.5)

That is, the Nystrom approximation approximates Kv by Kv, as given below:

K(A,A)v ≈ K(A, A)K(A, A)−1K(A, A)v = K(A, A)v. (3.1.6)

In the primal RSVM, v is directly determined by fitting the entire data set to a reducedproblem (3.1.1), which can be regarded as an approximation to the full problem(2.3.27). For more advanced theoretical study, consult Lee and Huang [34]. For ageneral primal SVM algorithm, its corresponding RSVM approximation works byreplacing K with K. That is to say, the RSVM in the primal space is easy to workwith and basically share the same type of formulation as its full problem.

Dual RSVM using random subset works by replacing the full kernel matrixK with the reduced kernel matrix R given by (3.1.3) and solving the dual problem(2.2.20). Note that the reduced rank kernel matrix R can be expressed as R = BB′,where B = K(A, A)K(A, A)−1/2 (or B = K(A, A), if the scaling matrix discussed inthe footnot is taken to be the identity for simplicity). Thus, the dual problem (2.2.20)with kernel R has the same solution as the dual linear SVM problem (2.1.13) withinput data matrix B. That is, the dual RSVM can be treated as a linear SVM problem.We recommend to use the Lagrangian SVM, an efficient linear SVM algorithm, forthe RSVM solved in the dual space. Lagrangian SVM code consisting of only a couplelines of commands is given in Appendix B. The dual solution has to be transformed

1Note that K(A, A) serves as a scaling matrix. For some authors, they have used the approxima-

tion

R := K(A, A)K(A, A)

in the dual problem. That is they take the identity matrix as the scaling matrix for simplicity. To

get the corresponding primal solution, the least squares problem (3.1.4) is applied and the primal

solution is given by: v = K(A, A)v. The difference between the two approximations (3.1.3) and the

one in this footnote is on the scaling matrix being K(A, A) or the identity.

20 Two Auxiliary Techniques

back to the primal one, as the decision function is given in the primal space. Byletting vi = yiαi, or equivalently v = Dα, the corresponding primal solution can byobtained via (3.1.4) and (3.1.5) and the decision function is then given by

f(x) = K(x, A)v + b, where v = (scaling matrix))−1K(A, A)v,

where the scaling matrix is either K(A, A) or the identity.

3.1.2 RSVM with optimal basis

RSVM using optimal basis can be done through the singular value decomposition(SVD). One may start with the full kernel matrix K, or a certain intermediate reducedkernel matrix K, and looks for optimal basis in the column space of K for furthercolumn-rank reduction. Assume that we start with K = K(A, A), where A ⊆ A. Letthe SVD of K be given by

K = PΛG′, 2

where P and G are orthogonal matrices of sizes n and n respectively, and Λ is adiagonal matrix of size n× n.

Primal RSVM using optimal basis works by further reducing the column rankby extracting the leading right singular vectors of K. Let G = [g1, . . . , gm] consistingof m-many leading right singular vectors. Projecting rows of K along these m-manyright singular vectors leads to KG, which is of size n × m. We use KG as our newreduced kernel. The decision function is then given by

f(x) = K(x, A)Gv + b, (3.1.7)

where v and b are solved from (3.1.1) with K replaced by the new reduced kernel KG.Dual RSVM using optimal basis works by replacing the full kernel matrix K

by the reduced kernel matrix R given by

R := KG(G′(scaling matrix)G

)−1

G′K ′, (3.1.8)

and then solves for the dual problem (2.2.20),3 where K can be the full kernel or anintermediate reduced kernel K(A, A), and the scaling matrix is K(A, A) (or can be

2As only the leading right singular vectors are needed, we can extract G from the leading eigenvec-

tors of a smaller square matrix K′K to save computing time. See the SVD Section in Appendix A.3Again, R can be expressed as R = BB′, where B = KG{G′K(A, A)G}−1/2, and the dual RSVM

can be treated as a linear SVM problem with input data matrix B.

3.1. REDUCED KERNEL: A LOW-RANK APPROXIMATION 21

replaced by a simple identity matrix). By letting v = Dα, the corresponding primalsolution can by obtained by solving

minv‖Rv − KGv‖22, (3.1.9)

which leads to v =(G′(scaling matrix)G

)−1

G′K ′v. Then, the decision function is

then given by f(x) = K(x, A)Gv + b.

3.1.3 Some remarks

For many other kernel-based algorithms, one has to face the same problems (P1)-(P3)stated above. Several authors have suggested the use of low-rank approximations tothe full kernel matrix for very large problems [32, 46, 54]. These low-rank approxima-tions all have used a thin rectangular matrix K(A, A) consisting of a subset of n(¿ n)columns drawn from the full kernel matrix K. Lee and Mangasarian [32], Williamsand Seeger [54] suggest to pick the subset randomly and uniformly over all columns.Lee and Huang [34] give a further theoretical study on random subset approach interms of some statistical optimalities. Smola and Scholkopf [46] consider finding anoptimal subset. Since finding an optimal subset is a combinatorial problem involvingCn

n possibilities and an exhaustive search is too expensive to carry out, they use aprobabilistic speedup algorithm. The trick they use is: First draw a random subsetof a fixed size and then pick the best basis (column) from this set. The search goeson till the stopping criterion is reached. The optimal basis that we introduced aboveconsists of leading right singular vectors of K instead of mere columns of K. Forsmall sized problem we may start with K the full kernel, while for median to largesized problem we may start with an intermediate reduced kernel using random subset,and then extract the leading right singular vectors therein for further column-rankreduction.

For problems working in the primal space, e.g., the proximal SVM [19], theleast-squares SVM [48, 49], the kernel Fisher discriminant [42, 41], kernel princi-pal component analysis [27], kernel sliced inverse regression [56], the random sub-set method works straightforwardly easy by replacing the full kernel matrix withthe reduced kernel and also by cutting down the corresponding number of para-meters in the decision function. For problems solved in the dual space, we formthe approximation kernel matrix R = K(A, A)(scaling matrix)−1K(A, A), or R =

KG(G′(scaling matrix)G

)−1

G′K ′, to replace the full matrix K, where the scaling

matrix is given by K(A, A) or the identity. Next the dual solution has to be trans-formed to the primal form for the expression of decision function. To obtain the

22 Two Auxiliary Techniques

primal solution v, we should solve the least squares problem (3.1.4) or (3.1.9), wherev is the dual solution.

Homework 2 (1) Do a numerical study of the quality of reduced kernel approxima-tion based on its spectral behavior. Use the Wine Recognition data set to build thefull kernel matrix and the reduced kernel matrices by random subset and by optimalbasis respectively. (2) Analyze the Wine Recognition data set using both the fulland reduced kernel SVM approaches with Gaussian kernel. (Choose your own SVMalgorithm, LIBSVM (or BSVM, a companion to LIBSVM for large data problems),SSVM or else.)

Solution: For your reference, here is the solution using Iris and Tree data sets. Thespectral behavior of the reduced kernel approximation

R = K(A, A)K(A, A)−1K(A, A)

can be measured via a few quantities, such as the maximum difference of the eigen-values and the relative difference of the traces of the full kernel matrix K(A,A) andR , which are definded as Maxi-diff and rel-diff=Trace(K−R)/Trace(K) respectively.The result is summarized in Table 3.1. Moreover, let λk and λk be the kth eigenvaluesof K and R. The term (λk−λk) against k is plotted in Figure 3.1.3(a). Figure 3.1.3(b)and Figure 3.1.3(c) together can present the eigenvalues of K and R correspondingly.From these figures, it seems that a larger decay rate of eigenvalues will obtain a betterquality of the approximation. Finally, instead of the total 150 eigenvalues, only thefirst 50 leading ones are considered here as their sum is large enough (> 99% of thetotal).

In Table 3.2, numerical comparisons between full kernel and reduced kernel fromrandom subset are conducted, separately through SSVM and LIBSVM. The trainingerror and validation error (with a 5-fold cross-validation) are reported in SSVM. TheLIBSVM provides the validation accuracy and the number of the support vectors dueto its dual formulation. It can be found that, though using a simpler model, RSVMhas only slightly bigger errors than the full kernel in training and in validation.

a. use ‘hibiscus’ to tune for the best parameters separately for SSVMand LIBSVM.

b. include LSVM for comparison?2

3.1. REDUCED KERNEL: A LOW-RANK APPROXIMATION 23

Table 3.1: Spectrum comparison on full kernel K(A,A) versus its Nystrom approxi-mation R.

Data set & Size Largest Eigen-v Largest Eigen-v Maxi-diff Rel-diff(Full, Ruduced) of K(A,A) of R of Eigenvalues of Traces

Iris 40.8754 40.5268 1.3223 0.1083(150,30)

0 10 20 30 40 500

1

2The Eigen−structure: Full kernel vs. Approx. kernel of Iris dataset

(a)

1 2 3 4 5 6 7 8 9 100

50

(b)

Full kernelApprox. kernel

10 15 20 25 30 35 40 45 500

2

4

(c)

Full kernelApprox. kernel

Figure 3.1: Spectrum comparison on full kernel vs. reduced kernel.

Table 3.2: The Gaussian kernel is used on all tests. The width parameter is γ = 0.1and the tuning parameter is C = 10.

Iris DatasetSSVM LIBSVM

Model(size) Training error Validation error Validation error # of SVsFull(150) 0.033 0.046 0.03333 32

Reduced(30) 0.058 0.073 0.03333 32

Tree DatasetSSVM LIBSVM

Model(size) Training error Validation error Validation error # of SVsFull(700) 0.13405 0.14676 0.13143 236

Reduced(69) 0.14724 0.14676 0.14000 222

24 Two Auxiliary Techniques

3.2 Uniform design for SVM model selection

The problem of choosing a good parameter setting for a better generalization abilityis the so called model selection here. It is desirable to have an effective and automaticscheme for it. In particular, for people who are not familiar with parameters tuning inSVMs, an automatic and data driven scheme is helpful. To develop a model selectionscheme, we need to set up

• a performance measure, and

• a search scheme to look for optimal performance point.

One of the most commonly used performance measure is the cross-validation (CV)(Stone, 1974), k-fold or leave-one-out. Both require that the learning engine be trainedmultiple times in order to obtain a performance measure for each parameter combi-nation. As for search scheme, one standard method is a simple grid search on theparameter domain of interest. It is obvious that the exhaustive grid search can noteffectively perform the task of automatic model selection due to its high computa-tional cost. There are also gradient-based approaches in the literature (Chapelle etal. [6]; Larsen et al. [30]; Bengio [3]). Although the gradient-based methods presentimpressive gain in time complexity, they have a great chance of falling into bad localminima, especially for SVM model selection, since the objective function, which isthe CV-based performance measure, has low degree of regularity. Figure 3.2 plots the5-fold average test set accuracy for three public available data sets, Banana, Wave-

form and Splice in a three dimensional surface, where the x-axis and the y-axis arelog2 C and log2 γ, respectively. The z-axis is the 5-fold average test accuracy. Eachmesh point in the (x, y)-plane stands for a parameter combination and the z-axisindicates the model performance measure.

As an alternative, we will introduce a nested uniform design (UD) methodologyfor model selection, which can be regarded as an economic modification to replace gridsearch. The lattice grid points are intended to represent uniform points in a parameterdomain. However, they are not uniform enough. A much better uniformity scheme,called UD, is introduced below.

3.2.1 Methodology: 2 staged UD for model selection

The uniform experimental design is one kind of space filling designs that can beused for computer and industrial experiments. The UD seeks its design points tobe uniformly scattered on the experimental domain. Suppose there are s parametersof interest over a domain Cs. The goal here is to choose a set of m points Pm =

3.2. UNIFORM DESIGN FOR SVM MODEL SELECTION 25

-5

0

5

10

15 -15

-10-5

0

0.6

0.7

0.8

0.9

banana

-5

0

5

10

15 -15-10

-5

0

0.6

0.7

0.8

splice

γlog2

Clog2

γlog2

Clog2

-5

0

5

10

15 -15

-10

-5

0

0.7

0.75

0.8

0.85

0.9

γlog2

Clog2

waveform

Figure 3.2: Three examples of search space of model selection.

{θ1, . . . , θm} ⊂ Cs such that these points are uniformly scattered on Cs. The searchfor UD points is a hard problem involves the number theory and quasi-Monte Carlomethods (Niederreiter [44]; Fang and Wang [17]). We will simply refer to UD tablesin the UD-web given below. To implement the UD-based model selection, follow thesteps:

1. Choose a parameter search domain, determine a suitable number of levels foreach parameter (or factor in design terminology).

2. Choose a suitable UD table to accommodate the number of parameters andlevels. This can be easily done by visiting the UD-web.http://www.math.hkbu.edu.hk/UniformDesign

3. From the UD table, randomly determine the run order of experiments andconduct the performance evaluation of each parameter combination in the UD.

4. Fit the SVM model. This step is a knowledge discovery step from the builtmodel. That is, we find the best combination of the parameter values thatmaximizes the performance measure.

5. Repeat the whole procedure one more time in a smaller search domain centeredabout the optimal point found at the last stage.

26 Two Auxiliary Techniques

The automatic UD-based model selection for SVMs [28] is implemented in Matlab,named “hibiscus” in the SSVM toolbox available at: http://dmlab1.csie.ntust.edu.tw/downloads/. The hibiscus focuses on selecting the regularization parameterC and the Gaussian kernel width parameter γ. It first determines a two-dimensionalsearch box in the parameter space, which is able to automatically scale the distancefactor in the Gaussian kernel. Next, a 2-staged UD procedure is designed for parame-ter selection within the search box. At the first stage, it sets out a crude search for ahighly likely candidate region of global optimum. At the second stage, it confines toa finer search within a smaller region centered around the first-stage-found optimalpoint. At each stage, a UD is used to replace the lattice grid as trial points. Themethod of nested UDs is not limited to 2 stages and can be applied in a sequentialmanner. Two examples of UD scheme are presented below.

1st stage

- the best pointClog

2

γlog2

2nd stage

Clog2

γlog2

- the new UD point

- the duplicate point

Figure 3.3: The nested UD model selection with a 13-points UD at the first stage and

a 9-points UD at the second stage.

3.2. UNIFORM DESIGN FOR SVM MODEL SELECTION 27

1st stage

- the best point

2nd stage

- the new UD point

- the duplicate point

Clog2

γlog2

Clog2

γlog2

Figure 3.4: The nested UD model selection with a 9-points UD at the first stage and

a 5-points UD at the second stage.

28 Two Auxiliary Techniques

Chapter 4

More on Kernel-Based

Learning Algorithms

Besides the various variants of SVMs, there are other kernel-based learning algorithms.Many popular classical statistical data analysis tools are linear methods. For instance,the principal components analysis (PCA) looks for an orthogonal decomposition of asample space, say <p, into several one-dimensional linear subspaces ranked by theirimportance of contribution to the data variation. The search for influential linearsubspaces, or factors, is an important and frequently encountered task. PCA haslong been a successful tool for finding such linear subspaces or factors. However,it is not able to extract important nonlinear features. This chapter will focus onextending some classical statistical linear methods to nonlinear setting via the kerneltrick introduced earlier. We will include the following procedures: PCA, sliced inverseregression (SIR), canonical correlation analysis (CCA), Fisher linear discriminantanalysis (FLDA), least-squares regression and ε-insensitive regression.

4.1 Kernel principal components analysis

As we have briefly mentioned that the classical PCA is not capable of finding nonlinearfeatures and there is a need for nonlinear extension. Nonlinear PCA, e.g., functionalPCA, curve PCA, has been known in statistical literature. However, they are not ouraim here. We will restrict ourselves to the nonlinear extension via kernel trick only.

29

30 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

4.1.1 Classical PCA

In the original data space, X ⊂ <p, we are interested in finding successive one-dimensional directions which contain “the maximum information” about the data.Let u be the first of such directions scaled to unit length. Data projections alongthe direction u are given by Au. Here we adopt the variance as the criterion forinformation content. Then, we shall be looking for u to maximize the followingobjective function:

maxu∈<p

var(Au) subject to u′u = 1. (4.1.1)

This problem is equivalent to

maxu∈<p

u′Σu subject to u′u = 1, where Σ is the covariance of data A.

Applying the Lagrange theory, the Lagrangian function is given by

L(u, α) = u′Σu− α(u′u− 1), u ∈ <p, α ∈ <.

Taking the derivative with respect to u and setting it to zero, we have

2Σu− 2αu = 0 =⇒ Σu = αu. (4.1.2)

That is, u is an eigenvector (with unit length) and α is the associated eigenvalue.Plugging (4.1.2) back to the Lagrangian function, we have u′Σu = α. Therefore, u isobtained by finding the eigenvector associated with the leading eigenvalue α. Denotethem by u1 and α1, respectively. For the second principal component, we look for u

again for the same problem (4.1.1) but with an extra constraint that u is orthogonalto u1. Then, the Lagrangian becomes

L(u, α, β) = u′Σu− α(u′u− 1)− βu′1u. (4.1.3)

Using similar procedure, we are able to find the second principal component andsuccessive components sequentially.

Note that PCA first centers the data and then rotates the axes to line up with thedirections of highest data variation. The action of rotation and lining up with direc-tions of highest variance is like finding a new coordinate system to record the data.Data recorded by the new system have a simple diagonal structure in their covariancematrix with descending diagonal elements ranked by the importance (contribution todata variance) of the new coordinate axes.

Homework 3 You have seen the kernel trick in SVM. Apply it to PCA and thinkabout how the KPCA should be formulated.

4.1. KERNEL PRINCIPAL COMPONENTS ANALYSIS 31

Solution: Similar to the classical PCA, first we have to center the input data Z inthe feature space; that is,

(In − 1n1′n

n

)Z. Then the covariance matrix is of the form

1nZ ′

(In − 1n1′n

n

)Z and the optimization problem could be formulated as

maxu∈Z

u′Z ′(

In − 1n1′nn

)Zu subject to u′u = 1. (4.1.4)

The Lagrangian function is given by

L(u, α) = u′Z ′(

In − 1n1′nn

)Zu− α(u′u− 1).

∂L/∂u = 0 =⇒ Z ′(In − 1n1′n

n

)Zu = αu =⇒ u is in the column space of Z ′, i.e.

u = Z ′v for some v ∈ <n. Plugging it to the above funtion, the optimization problembecomes

maxv∈Z

v′ZZ ′(

In − 1n1′nn

)ZZ ′v subject to v′ZZ ′v = 1. (4.1.5)

By the kernel trick ZZ ′ = K and also assuming K nonsingular, we can solve thisproblem by finding the leading eigenvectors v from the following eigenvalue problem:

(In − 1n1′n

n

)Kv = αv (4.1.6)

with normalization v′Kv = 1, or equivalently αv′v = 1. Normalization changes thescales of principal components to have unit length in the sense of v′Kv = 1, but itwill not change the orientation of the principal components. 2

4.1.2 Kernel PCA

Full kernel PCA. From (4.1.6) in Homework 3 we see that the KPCA works byextracting the leading eigen-components of the centered kernel matrix. Note that theeigenvector, according to (4.1.5), should be normalized by

v′Kv = 1, or equivalently v′(

In − 1n1′nn

)Kv = 1. (4.1.7)

As (In− 1n1′nn )Kv = αv, the normalization can be expressed as αv′v = 1. In summary,

the KPCA solves the eigenvalue-decomposition problem for (In − 1n1′nn )K. Denote

its largest eigenvalue by α1 and its associated eigenvector by v1. We can sequentiallyfind the second and the third principal components, etc. From (4.1.7) we have thatvj , j = 1, 2, . . ., should be normalized by v′jKvj = 1, or equivalently by αjv

′jvj = 1.

32 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

From Homework 3 we see that Z ′vj is the jth principal component in the featurespace Z. For an x ∈ <p and its feature image z = Φ(x) ∈ Z, the projection of Φ(x)along the jth component Z ′vj is given by

Φ(x)′Z ′vj = z′[z1, z2, . . . , zn]vj = K(x,A)vj . (4.1.8)

Therefore, the projection of feature image z = Φ(x) onto the leading r eigen-componentsin Z is given by

[K(x,A)v1; . . . ; K(x,A)vr] ∈ <r.

Reduced kernel PCA starts with a random sampling of a portion of columnsfrom the full kernel matrix. Next, by a SVD it extracts the leading right singularvectors v of the reduced kernel matrix:

(In − 1n1′n

n

)K(A, A)v = αv, normalized to αv′v = 1. (4.1.9)

The feature image Φ(x) projected along the leading r components is then given by[K(x, A)v1; . . . ; K(x, A)vr

]∈ <r.

As we are interested in extracting only right singular vectors, it works by solvingthe eigenvalue problem of K ′(In − 1n1′n

n )K, whose corresponding eigenvalues andeigenvectors will be α2

j and vj . That is, if we work directly on the eigenvalue problem

of K ′(In − 1n1′nn )K, denoting its eigenvalues by βj , then the normalization should be√

βj v′j vj = 1.

4.2 Kernel sliced inverse regression

The PCA approach can be used as an unsupervised dimension reduction tool. Of-ten the input data come with their class labels and supervised dimension reductionwith category information becomes possible. The sliced inverse regression (SIR) [35]was originally proposed for dimension reduction through inverse regression. We willdepart from the original inverse regression spirit for the time being and focus on aclassification framework, which is easier for the first-time reader who is non-statisticsmajor to get a grasp of the method.

4.2.1 Classical SIR

Apart from its statistical meaning, SIR, technically speaking, solves a generalizedeigenvalue-eigenvector decomposition problem. We have seen the decomposition prob-lem of a covariance matrix in the PCA section. In a classification problem with given

4.2. KERNEL SLICED INVERSE REGRESSION 33

data {(xi, yi)}ni=1, xi ∈ <p and yi ∈ {1, . . . , k}, here are the steps of SIR for finding

leading directions that contain “maximal information” of class membership:

1. Slice (classify) the input data A = [x′1; . . . ;x′n] into k many slices by their

corresponding labels y.

2. For each slice, s = 1, . . . , k, compute the mean, denoted by xs. Also computethe grand mean, denoted by x.

3. Use the slice mean xs as the representative for each slice to form a new inputdata matrix. That is, replace each individual input point xi by its slice meanfor a new input data matrix, denoted by AS (still of size n× p).

4. Compute Σb = Cov(AS), named the between-slice (or between-class) covariancematrix, as follows:

Σb =1n

k∑s=1

ns(xs − x)(xs − x)′,

where ns is the number of data points in the sth slice. Also compute the samplecovariance matrix of x: Σ =

∑ni=1(xi − x)(xi − x)′/n.

5. Solve the generalized eigenvalue problem of Σb with respect to Σ (Assume Σ isrank p). That is, to find the leading, say r many (r ≤ k− 1), eigenvalues α andeigenvectors u for the problem:

Σbui = αiΣui subject to u′iΣui = 1 and u′jΣui = 0 for 1 ≤ j < i ≤ r.

6. Collect the leading r SIR directions into a matrix U = [u1, . . . , ur]. U is or-thogonal with respect Σ, i.e. U ′ΣU = Ir. Project the data x along the SIRdirections, i.e. U ′x = [u′1x; . . . , u′rx]. We should call u′1x the first SIR variate,u′2x the second SIR variate, and so on.

Remark 2 For an ordinary eigenvalue problem, the decomposition is done with re-spect to an identity matrix. For a generalized eigenvalue problem, the decompositionis done with respect to a strictly positive definite matrix. PCA solves an ordinaryeigenvalue problem, while SIR solves a generalized one.

From the steps above, we see that SIR is designed to find the leading directionsfor maximal discriminant power for class information. Here the discriminant ability isjudged by the between class variance scaled by the total variance. Before we formally

34 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

introduce the kernel generalization of SIR, we present you a quick pictorial view ofthe effect of kernel SIR (KSIR) compared with the classical SIR. Figure 4.1 is thescatter plot of first 2 SIR variates using Pendigits data and Figure 4.2 is the scatterplot of first 2 KSIR variates.

−200 −150 −100 −50 0 50 100 150−40

−30

−20

−10

0

10

20

30

40

50

60

0123456789

Figure 4.1: SIR view over the 1st-and-2nd variates.

Homework 4 Prescribe a kernelized SIR algorithm and try it out on the Wine

Recognition data set available at UCI Repository. (Hint: Replace A by K andproceed the same as in the classical SIR. However, since the covariance matrix of K

is singular, add a small diagonal matrix δI to Cov(K) for numerical stability.)

4.2.2 Kernel SIR

SIR is designed to find linear dimension reduction subspace by extracting the leadingvariance components of sliced means scaled by the overall variance across all classes.Define N (Σb) = {u : Σbu = 0}, which is the null column space of Σb. SIR is unableto find directions which reside in this null space N nor directions which have smallangle with the null space. SIR is also limited to the linear directions and is unable toextract nonlinear features, as you have seen in the Pendigits data demonstration.Kernelized SIR [56] can provide an easy remedy for the difficulties encountered by theclassical SIR. A nonlinear parsimonious characterization of the data is the key goalof the kernel extension.

4.2. KERNEL SLICED INVERSE REGRESSION 35

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.020123456789

Figure 4.2: KSIR view over the 1st-and-2nd variates.

Full kernel SIR. As before, in looking for a nonlinear extension we first mapthe original input data to the feature space Z, and then nominally carry out the SIRprocedure on the feature data in Z. That is, we look for the leading directions in thefeature space that have the highest between-slice variance with respect to the varianceof the whole data set. In other words, we intend to solve the following optimizationproblem:

maxu∈Z

u′Z ′H ′(

I − 1n1′nn

)HZu subject to u′Z ′

(I − 1n1′n

n

)Zu = n, (4.2.10)

where H is a diagonal block matrix with the sth block given by Bs = 1ns

1ns1′ns

withns the number of observations in the sth slice. (Here we have implicitly assumedthat the data are already sorted by slice.) Note that H is symmetric and idempotent,i.e. H2 = H. It is a projection matrix, which maps each feature data input to theslice mean of its corresponding class. As in the KPCA case, we confine our search ofeigenvectors to the column space of Z ′, i.e. u = Z ′v for some v ∈ <n. Then, by thekernel trick K = ZZ ′, the optimization problem (4.2.10) becomes

maxv

v′ZZ ′H(

In − 1n1′nn

)HZZ ′v subject to v′ZZ ′

(In − 1n1′n

n

)ZZ ′v = n,

(4.2.11)which is equivalent to

maxv

v′Cov(HK)v subject to v′Cov(K)v = 1. (4.2.12)

36 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

For successive KSIR directions, some extra orthogonality constraints are applied. TheKSIR problem (4.2.12) is exactly the same as the classical SIR stated in Steps 1-6above, except for replacing the original input data matrix A by the kernel data matrixK. Note that Cov(K) is singular and the singularity will cause numerical instabilityto the problem. An easy remedy is to add a small diagonal matrix δI to the covariancematrix. That is, we solve the following problem:

Cov(HK)v = α {Cov(K) + δI} v. (4.2.13)

Let V = [v1; v2; . . . ; vr] ∈ <n×r consisting of r-many leading KSIR directions and letC(V ) be its column space spanned by V . Again, we say {v1, . . . , vr} are KSIR direc-tions, {Kv1, . . . ,Kvr} are KSIR variates, and C(V ) is the KSIR dimension reductionsubspace.

Reduced kernel SIR works directly on K = K(A, A) or K = K(A, A)G asdefined in (??). The RKSIR solves the following generalized eigenvalue problem:

Cov(HK)v = αv′{

Cov(K) + δI}

v. (4.2.14)

We then say {v1, . . . , vr} are KSIR directions, {Kv1, . . . , Kvr} are KSIR variates, andC(V ) is the KSIR dimension reduction subspace, where V = [v1; v2; . . . ; vr]. Somecommon suggestions for implementing the RKSIR are: For small sized problem, usethe leading optimal bases based on full kernel to form the reduced kernel; for medianto large sized problem, use either a random subset to form the reduced kernel or tostart with an intermediate random subset and then search for leading optimal basestherein.

Combined with linear methods. As the input space has been mapped to Z,which has very high dimensionality (often infinite) to provide enough flexibility andversatility for linear models therein. It is, thus, sufficient to consider only simplelinear models therein for statistical inferences. For instance, KSIR can be combinedwith various linear SVM algorithms or the classical linear discriminant analysis forclassification problems, and with various linear SVR algorithms or least-squares linearregression method for regression problems. In other words, various linear algorithmscan act on top of the KSIR variates for dimension reduction and data analysis.

4.3 Kernel canonical correlation analysis

Canonical correlation analysis (CCA) studies the relation between two sets of vari-ables. The classical CCA is used to describe only the linear relation between thetwo sets, while its kernel extension, KCCA, can be used to describe general nonlinearrelation.

4.3. KERNEL CANONICAL CORRELATION ANALYSIS 37

4.3.1 Classical CCA

Suppose the random variable x of p components can be partitioned into x = [x(1);x(2)]with p1 and p2 components, respectively. The CCA aims to study the relation betweenx(1) and x(2) and eventually defines a pair of new coordinate systems in <p1 and<p2 , which make the description of “relation” between x(1) and x(2) as “simple” aspossible. One way to study and to simplify the relationship structure is through thefactorization of the covariance matrix. Let

X(1) =

x(1)′

1...

x(1)′n

,

which is the design matrix of the first set of variables; and let X(2) = [x(2)′

1 ; . . . ;x(2)′n ],

which is the design matrix of the second set of variables. The sample covariancematrix of X(i) and X(j), i, j = 1, 2, is given by

Σij =1n

X(i)′(

In − 1n1′nn

)X(j).

The CCA finds the pair of vectors (u1, v1) ∈ <p1+p2 , called the first pair of canonicalvariates, such that

maxu,v

u′Σ12v subject to u′Σ11u = 1 and v′Σ22v = 1. (4.3.15)

Successive pairs can be found by solving the same problem with extra orthogonal-ity (uncorrelatedness) constraints. The classical CCA describes linear relation byreducing the correlation structure between these two sets of variables to the sim-plest possible form by means of linear transformations on X(1) and X(2). It findspairs (ui, vi) ∈ <p1+p2 in the following way. The first pair maximizes the correlationbetween u′1X

(1) and v′1X(2) subject to the unit variance constraints Var(u′1X

(1)) =Var(v′1X

(2)) = 1, and the kth pair (uk, vk), which are uncorrelated with the first k−1pairs,1 maximizes the correlation between u′kX(1) and v′kX(2), and again subject tothe unit variance constraints. The sequence of correlations between u′iX

(1) and v′iX(2)

are called canonical correlation coefficients of X(1) and X(2), (ui, vi) are called the ithpair of canonical variates, and u′iX

(1) and v′iX(2) are called the ith pair of canonical

variables. The canonical variates2 {ui}p1i=1 and {vi}p2

i=1 can serve as new systems of

1Being uncorrelated is in terms of u′iΣ11uk = 0 and v′iΣ22vk = 0 for all i = 1, . . . , k − 1.2Assume that p1 ≤ p2. After p1−1 pairs of canonical variates, one cas always fill up the remaining

orthogonal variates for both sets of variables.

38 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

coordinate axes for the two components x(1) and x(2). These new systems are simplylinear systems of the original ones. Thus, the classical CCA can only be used to de-scribe linear relation. Via such linear relation it finds only linear dimension reductionsubspace and linear discriminant subspace, too.

4.3.2 Kernel CCA

Kernel methods can provide a convenient way for nonlinear generalization of CCA.By forming the kernel data:

K(1) = K1(A(1), A(1)) and K(2) = K2(A(2), A(2)), (4.3.16)

where A(i) = [x(i)′

1 ; . . . ;x(i)′n ] for i = 1, 2, and K1 and K2 are two positive definite

kernel functions. The KCCA works by carrying out the classical CCA procedure onthe kernel data [K(1) K(2)]. In summary, the KCCA procedure consists of two majorsteps:

(a) Transform the data points to a kernel representation as in (4.3.16).

(b) The classical CCA procedure is acting on the kernel data matrix [K(1) K(2)].Note that some sort of regularization is necessary here to solve the associatedspectrum problem (4.3.15) of extracting leading canonical variates and correla-tion coefficients. Here we use the reduced kernel technique again to cut down theproblem size and to stabilize the numerical computation. Only partial columnsare computed to form reduced kernel matrices, denoted by K(1) and K(2). Theclassical CCA procedure is then acting on [K(1) K(2)].3

As the KCCA is simply the classical CCA acting on kernel data, existing code fromstandard statistical packages are ready for use. In the example below we use Matlabm-file “canoncorr”, which implements the classical CCA, on kernel data.

Example 1 We use the Pendigits data set from UCI Repository for visual demon-stration of nonlinear discriminant using KCCA. We use the 7494 training instancesfor explanatory purpose. For each instance there are 16 input measurements (i.e., xj

is 16-dimensional) and a corresponding class label yj from {0, 1, 2, . . . , 9}. A Gaussiankernel with the window width (

√10S1, . . . ,

√10S16) is used to prepare the kernel data,

where Si’s are the coordinate-wise sample variances. A reduced kernel of size 300equally stratified over 10 digit groups is used and serves as the K(1) in step (b) of the

3The reduced kernel K(i) can be either formed by random subset K(i) = Ki(A(i), A(i)) or by

further optimal basis therein K(i) = Ki(A(i), A(i))G(i).

4.3. KERNEL CANONICAL CORRELATION ANALYSIS 39

KCCA procedure. We use yj , the class labels, as our K(2) (no kernel transformationinvolved). Precisely,

K(2) =

Y ′1...

Y ′n

n×10

, Y ′j = (0, . . . , 1, 0, . . .) ,

where Yj is a categorical variable for class membership. If yj = i, i = 0, 1, . . . , 9, thenYj has the entry 1 in the (i+1)th place and 0 elsewhere. Now we want to explore forthe relation between the input measurements and their associated class labels usingCCA and KCCA. The training data are used to find the leading CCA- and KCCA-found variates. Next 20 test samples from each digit-group are drawn randomlyfrom the test set.4 Scatter plots of test data projected along the leading CCA-foundvariates (Figure 4.3) and the leading KCCA-found variates (Figure 4.4) are givenbelow. Different classes are labeled with distinct digits. The superior discriminantpower for KCCA over CCA is clear.

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

1 111

11

1

1111 1 1

1

1

1 111

12 222

2222

2

222

22

2

22

2

22

3 33

33

3

3 33 33

33

3 333 3

3

3

4

444

4

4

44

4 4444

44

44

4

44

5

555

5

5 55

5

5

55

5

5

55

5

5

5

566

6666

6

6

66

6

6

6

6

66 6

6

6

6 77

7

7

7

7

7 777

77

77

777

7

7

7

8

8

888

88

88

88

888

8 888 8

8

9

99

99

9999 9

999 99

99999

0

0

0 0

0

00

0

0

0

00

00

0

0

00

00

1st canonical variates−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

111

1

1

1

1

11

1

1

1

1

111

1

11

12

2

2 222

222

22

2 222

2

2

222

333

33333

33

33

333

3

33

3

3

4

4

444

444

44

4

4

44

44 4 4

4

4 55

5

5 5

55

5

55

5

5

55

5

5

5

5

5

5

66 66

6

666

66

6

666 6

6

66

66

77

7 77

7

77 777

77

777

7

7

77

8

88

8 888

88

8 8

88

888

88

8 8

9

99

9

9999

9999

99

9

9999

9

000

0

000

0 00

00

00 00 0

0 00

2nd canonical variates

3rd

cano

nica

l var

iate

s

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

1

11

1

1

1

1

11

1

1

11

11

1

1

1 1

12

222

2

2

2

2

22 2

22

22

22 222

3

3

3 33

3

3

3

3

33

3

3

3

33

33

3

3

4

4

444

4

4

4

44

4

4

4

44

4

4

4

4 4

555

55

5

55

5

5

5

5

55

55

5

5

5

56

6

66 6

6

6

6

6

6

6

66

6

66

6

6

66

77

77 7

7

7777 7

7

77

77

77

77

8

8

8

8

88

8

8

88

8

88

8

8

8

8

8

88

9

999 9

99

99

9

99

99

9 999 9

9

0

0

0

0

0

0

0

0 00 00

0

000

0

0

0

0

3rd canonical variates−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

111

1

111

11

1

111

1

1

1

1

11

1 2222

2 22

22

22

2

2222

22

22

33 3

33

33

3

33

33

3

33

3 3

3

3 3

4

4

4 44

4

444

44

44

44 44 44

4555

55

5

55

5

5

55 5

5

5

5

5

5

5

56 6

6

66

666

6

66

66

6

66

6

66

6

77

77

77

77

77

77

7

7

7

7

7

777 8

8

8

8

8 8

8

8

8

8

8

8

8

888

88

8

8 9

9 9

99

99

9

9 9

9 99

9

9

99

99

9

0 00

0 00 000

0

00 0

00

000 0

0

4th canonical variates

5th

cano

nica

l var

iate

s

Figure 4.3: Scatter plot of pen-digits over CCA-found variates.

4The test set has 3498 instances in total with average around 350 instances for each digit. For

clarity of plots and to avoid excess ink, we use only 20 test points per digit.

40 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

−2 −1 0 1 2

−2

−1

0

1

2

11111111 1

11 111111111 2 2222222222222 222222333333

33 3333

333 333 33

4 44444444444 444 44

4

44

55555555 5

5 5555555 55 5

66

6

666

666

666

666 6

666

6

7777777 7777 77 77 7

7777

8 88888888888888

88888

99999 9999 999999 99999

0000 00

0

00000

00000

00

0

1st kernel canonical variates−2 −1 0 1 2

−2

−1

0

1

2

1111

1 1

111 11

111111111

22222222222222 222222

3333333333

33 333333334

4444444 4444 44444

44 4

5555 55555 55555

555555

66 66 66

66

6

66

66

666

66

66

777 777777777

7

777

7777

88

888

888

8888 8 8888888

99

99

9999

9999

99

99

999

9

000000

0

0 0 000 00000

0

0

0

2nd kernel canonical variates

3rd

kern

el c

anon

ical

var

iate

s

−2 −1 0 1 2

−2

−1

0

1

2

1

111

111

111

1

111 11 1111

222

22222222222222

222

33333 333333 333 3

333

33 444444444444444

44

444

555

55555

5

5

555 5

5

55 5

5

5 66

6

6

6

666

6

6 66

6 6666

666

77777777777 7

7 77

777 77

8

88

888

888 8

88

88

888

8

88

999 99

99

99999

9 9 99999

9

0000

0 0

0

000

000000

00

0

0

3rd kernel canonical variates−2 −1 0 1 2

−2

−1

0

1

2

1111

1 11 11

1111

11 1111122 2 2222222222222 2222

333333

3

3

333

33333 3

333

44 444

444

44

44444

44

4

44

5

555555 5

5

5

5

5

5

5 55 5

55

56 66

66 6

666 6

6

66666

666

6

777 77 7777

7777

7777

777

888

888

888

88 8 88888

88

8

99999

99

9

9

99999999

9 9 90

00 0

00 0

0 000000 000

00

0

4th kernel canonical variates

5th

kern

el c

anon

ical

var

iate

s

Figure 4.4: Scatter plots of pen-digits over KCCA-found variates.

4.4 Kernel Fisher discriminant analysis

Fisher linear discriminant analysis (FLDA, or simply LDA) is a commonly used andtime-honored tool for multiclass classification because of its simplicity and probabilis-tic outputs. Motivated from the active development of statistical learning theory [53]and the popular and successful usage of various kernel machines [14, 15], there hasemerged a hybrid approach which combines the idea of feature mapping with theclassical FLDA to prescribe a kernelized LDA algorithm.

4.4.1 Classical FLDA

For a binary classification, FLDA finds an optimal normal direction of the separatinghyperplane, w′x + b = 0, by maximizing the so-called Rayleigh coefficient:

maxw

J(w) ≡ w′Σbw

w′Σww, (4.4.17)

where Σb and Σw are respectively the between- and within-class covariance matrices.The offset is determined by letting the hyperplane pass through the mid point of thetwo class centroids, i.e., b = −w′(x1 + x2)/2, where xj =

∑i∈Ij

xi/nj with Ij indexset of jth class.

4.4. KERNEL FISHER DISCRIMINANT ANALYSIS 41

Consider a k-class problem based on training data {(xi, yi)}ni=1, where xi ∈ X ⊂

<p is a p-variate input measurement and yi ∈ {1, . . . , k} indicates the correspondingclass membership. With k − 1 ≤ p, the FLDA finds k − 1 canonical variates thatare optimal in maximizing the between-class variation (for separating the classes)with respect to the within-class variation, and its decision boundaries are linear inthese canonical variates. The FLDA canonical variates wj , j = 1, . . . , k − 1, can bedetermined again by maximizing the Rayleigh coefficient given below:

maxw

J(w) ≡ w′Σbw

w′Σww. (4.4.18)

The optimality problem (4.4.18) of finding FLDA canonical variates is a generalizedeigenvalue decomposition problem. Data are then projected to these (k−1) canonicalvariates and class centroids are calculated in this (k−1)-dimensional subspace. A testpoint is assigned to the class with the closest class centroid based on Mahalanobisdistance. The Mahalanobis distance of two points x and u projected along the FLDAcanonical variates is given by

dM (x, u) = (x− u)′W ′Σ−1w∗W (x− u),

where W = [w1, . . . , wk−1] is a p × (k − 1) matrix consisting of FLDA canoni-cal variates, and Σw∗ is the the within-class covariance matrix of data projections:{W ′xi}n

i=1. FLDA can also be justified under homoscedastic normal model assump-tion for maximum-likelihood based discriminant analysis.

4.4.2 Kernel FDA

With the kernel trick, it is obviously natural for one to think of prescribing a hybridapproach combining FLDA with a kernel machine. Such a hybrid usage can be tracedback to Mika, Ratsch, Weston, Scholkopf and Muller [42], Baudat and Anouar [2] andMika [41]. The main idea is to first map the data in the input space X ⊂ <p into thespectrum-based feature space Z via the feature map Φ. Next, as before, the classicalFLDA procedure is operated on the feature data. The KFDA finds (k − 1)-many (orless, if further dimension reduction is preferred) canonical variates that maximize theRayleigh coefficient in feature space Z:

maxu∈Z

J(u) ≡u′Z ′H ′

(I − 1n1′n

n

)HZu/n

u′Z ′ (I −H)Zu/n + rP (u),

where H is defined same as in the KSIR section. The numerator and the first termin the denominator are respectively the between- and within-class sample covariances

42 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

based on feature data {z1, . . . , zn}. The term P (u) is a penalty functional on u withassociated regularization parameter r. It is added to the denominator to cope withthe singularity problem. It is shown (Mika et al., 1999) that the solution u can beexpanded as

u =n∑

i=1

viΦ(xi) = Z ′v, (4.4.19)

where Z, as before, is the design matrix consisting of feature data. The coefficientvector v can then be obtained as the solution to the following optimization problem:

maxv∈Rn

J(v) ≡ v′Cov(HK)vv′Cov(K −HK)v + rP (v)

, (4.4.20)

where a penalty functional P (v) is added to overcome the numerical problems. Fora new test point x, it is assigned to the class with the closed class centroid in thesubspace spanned by KFDA-found canonical variates. The distance measure here isthe Mahalanobis distance using the within-class covariance as the scale matrix. Forfurther theoretic study on KFDA, see Huang and Hwang [26].

4.5 Support vector regression

Linear regression models have long been an important and widely used modellingscheme for data analysis in many scientific fields. It gains the popularity by itssimplicity in model structure, which possesses the rich theory of least squares.

In this section we introduce two support vector regression (SVR) approaches. Oneis the regularized least square (RLS) SVR [50, 20] and the other is the ε-SSVR [31]. Inthe classical regression approach, we often estimate the regression coefficients by theleast squares for simple linear systems or by the regularized least squares for complexlinear systems. The ε-SSVR is a smoothing and adaptive strategy for implementingSVR with ε-insensitive loss. It regresses the responses on support vectors and ignoresthe effects of small noises at a level less than ε and is more robust than the RLS-SVR.Also as it provides further fine adaptation to data, the ε-insensitive SVR is a bit moreaccurate than the RLS-SVR, but in the expense of computing time. For a generalpurpose tutorial on SVR, we refer the reader to Smola and Scholkopf [47].

4.5. SUPPORT VECTOR REGRESSION 43

4.5.1 RLS-SVR: a ridge approach

The main idea of least squares approach is to minimize the squared errors of regression.The sum-of-squared residuals for multiresponse regression is given by

SSR(V, b) =k∑

j=1

‖Y·j − Kvj − bj1n‖22,

where Y is a multivariate response variable and ‖Y·j − Kvj − bj1n‖22 =∑n

i=1(yij −K(xi, A)vj − bj)2 with A ⊆ A. The unique solution to minV ∈<n×k,b∈<1×k SSR(V, b)is usually good enough and the corresponding estimators are the best linear unbiasedestimators for V and b under the assumption of normal error distribution. However,if columns of the matrix K are highly correlated, then V will be poorly determinedand exhibit high variation. It is due to that the information matrix [K 1n]′[K 1n] isill-conditioned. This situation often happens in kernel data. The ridge regression isa method to lessen the problem. It shrinks the regression coefficients by imposing apenalty on their norm. The regression coefficients are derived from the following RLSproblem:

minV ∈<n×k,b∈<1×k

k∑

j=1

{C

2‖Y·j − Kvj − bj1n‖22 +

12‖vj‖22

}. (4.5.21)

Notice that each pair of vj and bj can be determined independently of the others.The problem (4.5.21) can be decomposed into k-many RLS subproblems, and each onesolves for one individual vj and bj . These k subproblems share a common informationmatrix on the left hand side of the normal equations and only differ on the right handside of the equations. Thus, we only need to do one time of any direct method forsolving the linear systems of equations. In contrast to the multicategory proximalSVM [20], which requires to run k times of the direct method for solving the linearsystems, our version of RLS regression in equation (4.5.21) needs only one run timeof a same-sized linear system.

4.5.2 Smooth ε-support vector regression (ε-SSVR)

The ε-insensitive support vector machine for regression has been a popular tool infunction estimation in recent years [47]. Applying the idea of SVMs, the SVR adoptsa regression model of the form: η(x) = K(x, A)v + b. The regression function η(x)is made as smooth as possible in fitting the training data by placing a penalty on‖v‖22/2. The error for data infidelity is made to be ε-insensitive to toerate errors

44 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

within a certain tolerance level, say, ε. For a univariate response regression, ε-SVRcan be formulated as an unconstrained minimization problem given as follows:

min[v;b]∈<n+1

C

n∑

i=1

|yi − η(xi)|ε +12‖v‖22, (4.5.22)

where η(x) = K(x, A)v + b, and |t|ε = (|t| − ε)+ = max(0, |t| − ε). Conventionally,ε-SVR is reformulated as a constrained minimization problem [53, 15] by introduc-ing more variables and constraints that enlarge the problem size and computationalcomplexity. The smooth ε-insensitive support vector regression [31] is a smoothingstrategy for solving the ε-SVR. It reformulates the ε-SVR as an unconstrained min-imization problem directly and it solves the problem in the primal space. This un-constrained minimization primal problem takes the form of a regularized ε-insensitiveleast squares:

min[v;b]∈<n+1C

2

n∑

i=1

|yi −K(xi, A)v − b|2ε +12

(‖v‖22 + b2)

(4.5.23)

= min[v;b]∈<n+1C

2|y − Kv − b1n|2ε +

12

(‖v‖22 + b2)

in matrix form.

The plus function x+ = max(0, x) can be approximated by a smooth p-functionintroduced earlier. The optimization problem (4.5.23) can then be approximated bysolving the following one:

min[v;b]∈<n+1

C

2

n∑

i=1

p2ε(yi −K(xi, A)v − b, ρ) +

12

(‖v‖22 + b2). (4.5.24)

By taking advantage of the strong convexity and twice differentiability of this newmodel, a globally and quadratically convergent Newton-Armijo algorithm is pre-scribed for ε-SSVR [31]. By applying the ε-SSVR to our multiresponse regressionsetting, we solve k many ε-SSVR problems defined as follows:

min[vj ;bj ]∈<n+1

C

21′np2

ε((Y·j − Kvj − bj1n), ρ) +12(‖vj‖22 + b2

j ), j = 1 . . . k, (4.5.25)

where 1′np2ε((Y·j − Kvj − bj1n), ρ) =

∑ni=1 p2

ε((yij −K(xi, A)vj − bj), ρ). By bringingin the idea of ε-insensitivity, one can improve the numerical stability as well as theprediction accuracy. The solution of ε-SSVR is close to that of RLS-SVR. The RLSregression has zero insensitivity, while the ε-SSVR provides a further fine adaptationto training data. As the ε-SSVR is solved by the Newton-Armijo algorithm, which

4.6. SIR AND KSIR REVISITED 45

is an iterative method, we have to repeat this procedure k times. The running timeof the Newton-Armijo algorithm heavily depends on the quality of initial point. Ifthe point is close enough to the optimal solution, the Newton-Armijo algorithm willbecome the Newton method without turning on the Armijo step in the iterations.In practice, we use the solution generated by RLS-SVR as the initial point, which isoften close enough to the optimal one for ε-SSVR.

4.6 SIR and KSIR revisited

In an early section we have formulated SIR and KSIR as a generalized eigenvalueproblem for classification. Here we will give an account for regression setup. SIRsummarizes a regression model as follows [35]:

y = f(β′1x, . . . , β′rx; ε), βj , x ∈ <p, (4.6.26)

where r (often ¿ p) is the effective dimensionality and {β1, . . . , βr} forms a basis setof the dimension reduction subspace. This model implies that most of the relevantinformation in x about y is contained in {β′1x, . . . , β′rx}. The model (4.6.26) doesnot impose any structure on f , which can be free form. It takes a very weak formfor dimension reduction, as it only assumes the existence of some linear dimensionreduction subspace. With this model assumption (4.6.26), SIR then estimates thedimension-reduction subspace by using the notion of inverse regression [35, 13]. TheSIR procedure goes as follows. First the regression responses yi are sorted in as-cending (or descending) order and the corresponding explanatory input variables xi

are re-arranged accordingly. With a given number of slices, say k, the responses andexplanatory input variables are sliced into k slices. Next, similar steps of formingbetween- and within-slice covariances and extracting leading eigenvectors from thebetween-slice covariance are carried out.

In the classical SIR, the model assumption (4.6.26) says that there exists a lineardimension reduction subspace and that the underlying objective function f can befree form. In looking for a nonlinear extension, we work on the feature space Z byassuming the following dimension reduction model:

y = f(u′1z, . . . , u′rz; ε), z, u1, . . . , ur ∈ Z. (4.6.27)

As the input space has been mapped to a high dimensional feature space Z, whichcontains rich enough local bases. Linear models in Z are versatile enough for mod-elling nonlinear patterns. It is, thus, not necessary to consider a free form f , and a

46 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

simple linear model for f should be sufficient, i.e.,

y =r∑

k=1

cku′kz + ε, z, u1, . . . , ur ∈ Z. (4.6.28)

Equivalently, the underlying model (4.6.28) can be expressed as

y = f(x) + ε =r∑

k=1

cku′kΦ(x) + ε, u1, . . . , ur ∈ Z, x ∈ X . (4.6.29)

Similar to the classical approach, we look for the leading directions in the featurespace that have highest between-slice variance with respect to the variance of thewhole data set. That is, we intend to solve the same optimization problem of (4.2.10).By confining the search of u ∈ Z within the subspace spanned by {z1, . . . , zn}, it leadsto solving the same generalized eigenvalue problem as (4.2.12):

maxα

α′Cov(HK)α subject to α′(Cov(K) + δI)α = 1,

where δI is added to stabilize the numerical problem for a singular Cov(K).

Example 2 For regression data visualization, we create a ten dimensional syntheticregression data set with 200 instances. We take the first 100 instances as training setand the remaining 100 instances as test set. Each attribute is generated from a normaldistribution with µ (mean) = 0 and σ (standard deviation) = 20, i.e., x ∼ N(0, 202).The responses are generated from y = ‖x‖2/10 + ε, where ε ∼ N(0, 0.12). Some dataanalysis results are depicted as follows. Figure 4.5 gives the data scatter of (x, y). Asx is 10-dimensional, we randomly pick 4 coordinates and plotted (x, y) along thesecoordinates, one coordinate axis of x versus y at a time. It shows no obvious room forlinear dimension reduction subspace. We next further look into the comparison of SIRversus KSIR. The first subplot in Figure 4.6 plots the responses versus the traininginputs projected along the first SIR direction. The second subplot is the test inputsand fitted test inputs by the first SIR variate. The third subplot plots the residualsfrom the second subplot. Figure 4.7 gives the KSIR counterpart of Figure 4.6. Wecan see from these pictures that with one leading KSIR component, the regressionfunction can be much better explained than using the leading SIR variate.

4.6. SIR AND KSIR REVISITED 47

−80 −60 −40 −20 0 20 40 601

2

3

4

5

6

7

8

9

10

xi

y:re

spon

se

(x2,y) scatter

(x3,y) scatter

(x7,y) scatter

(x8,y) scatter

Figure 4.5: Data scatter along some coordinates.

−1 −0.5 0 0.5 13

4

5

6

7

8

9

10

1st SIR variate

y:res

pons

e

training data along 1st SIR variate

−1.5 −1 −0.5 0 0.5 1 1.52

4

6

8

10

12

1st SIR variate

y:res

pons

e

test data along 1st SIR variatefitted test data by 1st SIR variate

0 20 40 60 80 100−5

0

5

numbering of test instances

residu

al

Figure 4.6: Data scatter, fitted values and residuals using 1st SIR variate.

48 CHAPTER 4. MORE ON KERNEL-BASED LEARNING ALGORITHMS

−1 −0.5 0 0.5 13

4

5

6

7

8

9

10

1st KSIR variate

y:res

pons

e

training data along 1st KSIR variate

−1.5 −1 −0.5 0 0.5 1 1.52

4

6

8

10

12

1st KSIR variate

y:res

pons

e

test data along 1st KSIR variatefitted test data by 1st KSIR variate

0 20 40 60 80 100−5

0

5

numbering of test instances

residu

al

Figure 4.7: Data scatter, fitted values and residuals using 1st KSIR variate.

Chapter 5

Learning Theory with RKHS

(optional)

In this chapter we give a survey of the theory of positive definite kernels and asso-ciated reproducing kernel Hilbert spaces (RKHSs) with focus on material relevantto statistical and machine learning. We refer the reader to [1, 4, 25] and referencestherein.

5.1 Introduction

We have given a formal definition of a positive definite kernel in Definition 1, Sec-tion 2.2. Positive definite kernels provide a powerful tool for function approximation.Building upon positive definite kernels, there are Hilbert spaces of functions on X ,named RKHSs, for convenient, flexible and versatile statistical models.

Definition 2 (Reproducing kernel Hilbert space) An RKHS is a Hilbert spaceof real-valued functions on X satisfying the property that, all its evaluation func-tionals are bounded linear functionals. 1 Denote this space by H. The bounded-ness is in the sense that, for an evaluation functional `x, supf∈H |`x(f)|/‖f‖H =supf∈H |f(x)|/‖f‖H < ∞.

The boundedness of an evaluation functional also implies its continuity. This defi-nition emphasizes that an RKHS is a Hilbert space of pointwise defined functions,

1An evaluation functional on a Hilbert space H at a point x, denoted by `x, is a mapping from

H to < defined by `x(h) = h(x) for h ∈ H. In other words, the evaluation functional `x associates

with any h ∈ H its value at the point x ∈ X . Evaluation functionals are linear.

49

50 CHAPTER 5. LEARNING THEORY WITH RKHS (OPTIONAL)

where the H-norm convergence implies the pointwise convergence. Unlike L2 space,where the L2-norm convergence does not imply the pointwise convergence.

To every RKHS H of real-valued functions on X , there corresponds a uniquesymmetric positive-definite kernel K : X ×X 7→ < such that K(x, ·) ∈ H and that K

has the reproducing property:

〈f,K(x, ·)〉H = f(x) ∀f ∈ H, ∀x ∈ X . (5.1.1)

We say that this RKHS admits the kernel K. Conversely, given a positive-definitekernel K on X × X there exists a unique Hilbert space admitting this kernel. Wedenote this Hilbert space by HK . As to every positive definite kernel there associateswith it a RKHS so that the kernel has the property of reproducing the function value,a positive-definite kernel is also called reproducing kernel (RK).

Often we are dealing with a separable RKHS, where a countable orthonormalsystem of bases {φi, i = 1, 2, . . .} exists, and the unique kernel associated with thisRKHS can be expanded as

K(x, u) =∑

i

φi(x)φi(u).

Berlinet and Thomas-Agnan’s book (pages 31-37) [4] gives a clear acount of the sep-arability and continuity of RKHS

In statistical inference, we assume in general a probability measure µ on the samplespace X . (The measure µ need not be the underlying probability distribution ofthe data.) Throughout this monograph we assume that all the reproducing kernelsemployed are (1) µ-measurable and (2) of trace type, i.e.,

∫X K(x, x)dµ < ∞. In

previous sections that we have establish the kernel trick through a feature map Φ :X 7→ Z, where the space and the map Φ are both implicit. Below we will introduce anisometric isomorphism. This isomorphism gives a better concrete representation of thefeature map and has direct link to the kernel function. Consider a map Γ : X 7→ HK

given byx 7→ Γ(x) := K(x, ·). (5.1.2)

The original input space X is then embedded into a new input space HK via thetransformation Γ. Each input point x ∈ X is mapped to an element K(x, ·) ∈ HK .Let J : Z 7→ HK be a map from the spectrum-based feature space Z to the kernelassociated Hilbert space Γ(X ) ⊂ HK defined by J (Φ(x)) = K(x, ·). Notice that J isa one-to-one linear transformation satisfying

‖Φ(x)‖2Z = K(x, x) = ‖K(x, ·)‖2HK.

5.2. SVMS WITH RKHS 51

Thus, Φ(X ) and Γ(X ) are isometrically isomorphic, and the two feature representa-tions (2.2.14) and (5.1.2) are equivalent in this sense. The latter representation isexplicit and concrete.

5.2 SVMs with RKHS

SVMs are powerful computational methods for classification, regression, and functionestimation in general. A major part of their computational power comes from thefact that it adopts a simple linear model in a high dimensional feature RKHS, wherethe inner product can be conveniently obtained by the kernel reproducing property.Though the feature map (5.1.2) represents each data point x by a function K(x, ·) ∈HK , we do not have to go through the entire function domain for inner products andthe similarity between two points, x and u, can be obtained by directly evaluatingK(x, u).

In the Euclidean space <p, a linear model f(x) is determined, up to an offsetterm, by an element w ∈ <p so that f(x) := w′x + b, which falls into the categoryof parametric modelling. For parametric modelling, a single global model, e.g. alinear one, is assumed. To model f(x) as a linear model in an RKHS HK , similarlyit is determined by an element h(x) ∈ HK , serving as a functional normal direction,so that f(x) = 〈h,K(x, ·)〉HK + b. The SVMs in general can be formulated as aregularization problem in HK . Specifically, if we take a 2-norm soft margin SVM(with b2/2 appended to the regularization as well) for study, it is formulated as:

minh∈HK

C

2‖ξ‖22 +

12

(‖h‖2HK+ b2

)

subject to 1− yi(h(xi) + b)− ξi ≤ 0, ∀i.At a solution, we have ξi = (1 − yi(h(xi) + b))+ and h(x) =

∑ni=1 K(x, xi)vi =

K(x,A)v. Such a kernel-mixture model consists of many local base and has thenature of nonparametric modelling. However, it is certainly a linear model from theHK-viewpoint. Thus, being linear or not depends on where you stand, the featurespace or the pattern space. Linear models in the Euclidean pattern space are veryrigid, while linear models in the RKHS feature space are versatile and flexible. Backto the optimization problem, it has the following equivalent form:

minv∈<n

C

2‖(1−D(K(A, A)v + b1))+‖22 +

12

(‖K(·, A)v‖2HK+ b2

), (5.2.3)

where ‖K(·, A)v‖2HK= v′Kv. Examples are Lagrangian SVM [40], active set SVM

[37] and SSVM [33] among others. (In SSVM, v′v is used to replace v′Kv for prob-lems solved in the primal space.) SVM algorithms can be regarded as regularization

52 CHAPTER 5. LEARNING THEORY WITH RKHS (OPTIONAL)

problems in HK . The RKHS feature space provides the SVM algorithms not onlywith computational power but also with a unified framework for theoretical study.

5.3 Multivariate statistical analysis with RKHS

In kernel-based SVMs, a linear model in the RKHS feature space is adopted and thenit is fitted to the training data via the same notion as its linear companion in theEuclidean space. The same philosophy of adopting a linear model in the feature spaceand then fitting the data via a certain parametric notion can be extended to manyclassical multivariate statistical analysis tools such as KPCA, KSIR, KCCA, KFDAand beyond. For a theoretical study of them, we need to introduce the definition ofa Gaussian measure on an arbitrary Hilbert space H.

Definition 3 (Gaussian measure on a Hilbert space) Let H be an arbitrary realseparable Hilbert space. A probability measure µH defined on H is said to be Gaussian,if 〈f, h〉H has a one-dimensional normal distribution for any f ∈ H, where h denotesthe random element having the probability measure µH.

It can be shown that for any k and any {f1, . . . , fk ∈ H}, the joint distributionof 〈f1, h〉H, . . . , 〈fk, h〉H is normal. For references of Gaussian measures on Hilbertspaces, see, e.g., Grenander [22], Vakhania, Tarieladze and Chobanyan [52], and Jan-son [29].

For simplicity we assume that E〈h, h〉H < ∞. For a probability measure µH on H,there exists m ∈ H, the mean, and an operator Σ, known as the covariance operator,such that

〈m, f〉H = E〈h, f〉H, ∀f ∈ H; and 〈Σf, g〉H = E{〈h−m, f〉H〈h−m, g〉H}, ∀f, g ∈ H.

The covariance operator Σ is of trace type with trace(Σ) = E〈h−m,h−m〉H.

Theorem 1 (Grenander [21]) Assume that µ1,H and µ2,H are two equivalent Gaussianmeasures on H with means m1 and m2 and a common nonsingular covariance oper-ator Σ of trace type. Let L2,1 = log(dµ2,H/dµ1,H) and h be an element in H. Letma = (m1 + m2)/2 and md = m2 −m1. A necessary and sufficient condition for thelog-likelihood ratio L2,1 being linear is that md ∈ R(Σ1/2), where R(Σ1/2) is the rangeof Σ1/2. The log-likelihood ratio is then given by 2

L2,1(h) = 〈h−ma, Σ−1md〉H. (5.3.4)2Notice that 〈h − ma, Σ−1md〉H exists a.s. for h from the Gaussian measure with mean

ma and covariance operator Σ. Let λi and ei be the eigenvalues and eigenvectors of Σ, re-

spectively. Then 〈h − ma, Σ−1md〉H =P

i〈h − ma, ei〉H〈md, ei〉H/λi. As {〈h − ma, ei〉H}

5.3. MULTIVARIATE STATISTICAL ANALYSIS WITH RKHS 53

To separate two Gaussian populations inH, the log-likelihood ratio in Theorem 1 leadsto an ideal optimal linear decision boundary. Exactly, a binary KFDA is to look for afunctional normal direction g, which is optimal in separating the two classes of kernelinputs {Γ(xj)}j∈I1 and {Γ(xj)}j∈I2 . Heuristically, when data patterns (conveyedin realizations Γ(xj), j = 1 . . . , n) are projected along g, the class centers are farapart, while the spread within each class is small causing the overlap of these twoclasses to be as small as possible when projected along this functional direction. Theoptimality can be formalized in the sense of maximum likelihood ratio of two Gaussianmeasures. There are parameters including the population means and the covarianceoperator involved in the log-likelihood ratio (5.3.4), which have to be estimated fromthe data. Below we derive their maximum likelihood estimates.

Theorem 2 (Maximum likelihood estimates) Let H be a Hilbert space of real-valued functions on X . Assume that {hj}n

j=1 are iid random elements from a Gaussianmeasure on H with mean m and nonsingular covariance operator Σ of trace-type.Then, for any g, f ∈ H, the maximum likelihood estimate for 〈g, m〉H is given by〈g, m〉HK

with

m =1n

n∑

j=1

hj , (5.3.5)

and the maximum likelihood estimate for 〈g, Σf〉H is given by 〈g, Σf〉H with

Σ =1n

n∑

j=1

(hj − m)⊗ (hj − m), (5.3.6)

where ⊗ denotes the tensor product. In particular, for any given x, u ∈ X , by takingg and f the evaluation functionals at x and u respectively, the MLEs for m(x) andΣ(x, u) are given, respectively, by m(x) = n−1

∑nj=1 hj(x) and

Σ(x, u) =1n

n∑

j=1

(hj(x)− m(x))(hj(u)− m(u)). (5.3.7)

For multiple populations sharing a common covariance operator, we pool togethersample covariance estimates from all populations according to their sizes to get apooled single estimate.

are independent normal random variables with mean zero and variance E〈h − ma, ei〉2H = λi,

thenP

i E(〈h − ma, ei〉H〈md, ei〉H/λi)2 =

Pi〈md, ei〉2H/λi < ∞, as md ∈ R(Σ1/2). Thus,

〈h − ma, Σ−1md〉H exists a.s. (For independent random variables Xi having zero mean,P

i Xi

converges a.s. if and only ifP

i var(Xi) < ∞.)

54 CHAPTER 5. LEARNING THEORY WITH RKHS (OPTIONAL)

5.4 More to come . . .

Chapter 6

Neural Networks

We will not write our own texts of neural networks, instead we use “Part V: NeuralNetworks” from the book “Information Theory, Inference, and Learning

Algorithms” by David J.C. MacKay. Consult the book website:http://www.inference.phy.cam.ac.uk/mackay/itila/

Here we only give a summary for material to be covered in class. Numbering forChapters, Sections, Figures and equations are all kept the same as in MacKay’s book.

38. Introduction to Neural Networks

Neural networks are a class of learning methods that have been developed in manyfields–statistics, mathematics, and artificial intelligence–based on essentially identicalmodels. The central idea of neural networks relies on 2 things: network modelling(architecture, activity) and learning rule. Neural network provides a powerful tool forlearning methods with a widespread applications in many fields.

♠ Modelling: is a kind of approximation, the underlying class of functions shouldbe large enough for versatile and flexible approximation.

♠ Learning rule: should be computationally feasible.Compare the NN modelling and learning with SVMs, while you are continuing on.Focus on (1) the model simplicity, flexibility and versatility, and (2) computationalease.

38.2 Terminology

Here are the 3 major components of a neural network algorithm:

55

56 CHAPTER 6. NEURAL NETWORKS

• Architecture. The architecture specifies what variables are involved in the net-work and their topological relationships.

• Activity rule. Most neural network models have short time-scale dynamics:local rules define how the activities of the neurons change in response to eachother. Typically the activity rule depends on the weights (the parameters) inthe network.

• Learning rule. The learning rule specifies the way in which the neural network’sweights (parameters) change with time. Usually the learning rule will depend onthe activities of the neurons. It may also depend on the target values suppliedby a teacher and on the current value of the weights.

Neural network algorithms can be roughly divided into two classes: supervisedneural networks (with a teacher’s supervision of target values) and unsupervisedneural networks (without a teacher’s supervision of target values).

39. The Single Neuron as a Classifier

You have to understand the single neuron in good detail, as many neural networks arebuilt out of single neurons. It is your first example of a supervised neural network.

• Architecture. It is a feedforward device.

�����

�����

�����

�����

� � � �

������ ��

�� ���� �����

Figure 39.1. Architecture of a single neuron.

• Activity rule. It has two steps.

1. Activation of the neuron: a =∑I

i=1 wixi+b = w′x+b. Or to save notationusage, add a constant node x0 = 1 to the inputs and the neuron activationis given by a =

∑Ii=0 wixi = w′x. The parameter w0 will carry the value

of the bias b.

57

2. Output (also called the activity) of the neuron: y = f(a), where f is a func-tion of the activation and is called activation function (or transfer functionin some other books). There are several possible activation functions, andhere are the most popular ones. (Also introduce the graphical notationused for each of the following activation function.)

i. Linear. y = f(a) = a.

ii. Sigmoid (logistic). y = f(a) = 1/(1 + e−a).

iii. Sigmoid (tanh): y = f(a) = tanh(a) = (ea − e−a)/(ea + e−a).

iv. Threshold function: y = f(a) =

{1, a > 0,

−1, a ≤ 0.

• Learning rule. Specify a learning rule in order to train the single neuron andget the weight parameters w.

39.3. Training the single neuron as a binary classifier

Assume we have training data set of inputs {x(n)}Nn=1 with binary labels {t(n)}N

n=1,where x(n) ∈ <I+1 (including the bias term by setting x

(n)0 = 1) and t(n) ∈ {0, 1}.

Below we will introduce how to design and train a single neuron as a binary classifier.

• Architecture. Consider a single neuron with I (or I +1 if counting the constantnode for bias) inputs and one output y ∈ (0, 1).

• Activity rule. Activation of the neuron: a =∑I

i=0 wixi. Transfer function ofthe neuron: y(a) = 1/(1 + e−a), or y(x;w) = 1/(1 + e−x′w).

• Learning rule, i.e., training the network as a binary classifier. Theerror function considered here is the cross-entropy1, also known as negativelog-likelihood2, or Kullback-Leibler distance3:

G(w) (39.11)

= −N∑

n=1

{t(n) ln y(x(n);w) + (1− t(n)) ln(1− y(x(n);w))

}

1Let P = {pi} be a discrete probability distribution, i.e., 0 ≤ pi ≤ 1 andP

i pi = 1. Entropy

of P , defined by H(p) = −Pi pi log2 pi, can be viewed as a measure of impurity, or a measure of

uncertainty. The probability distribution P is pure, or is certain, if −Pi pi log2 pi = 0, leading to

one pi being 1 and the rest being 0. Entropy is maximized if {pi} is uniform. Entropy is the minimum

number of bits needed to encode the classification accuracy of an instance. Let Q = {qi} be also a

discrete distribution. The cross-entropy of between P and Q is defined by H(P, Q) = −Pi pi log qi2 y(n) as the likelihood of being 1, and 1− y(n) as the likelihood of being 0.3The KL distance of P to Q is defined by DKL(P‖Q) =

Pi pi log(pi/qi) = H(P, Q)−H(P ).

58 CHAPTER 6. NEURAL NETWORKS

= DKL

({t(n)}

∥∥{y(x(n);w)})−

N∑n=1

{t(n) ln t(n) + (1− t(n)) ln(1− t(n))

}.

The resulting classifier assigns a new test input x to the class with label

sign(y(x; w)− 1/2) =

{1, y − 1/2 > 0,

0, y − 1/2 ≤ 0.

39.4. Beyond descent on the error function: regularization

The objective function to be minimized is changed to

M(w) = G(w) + αEW (w), (39.22)

where the simplest choice of regularizer is the weight decay regularizer EW = 12

∑Ii=1 w2

i

or EW = 12

∑Ii=0 w2

i (with the bias term also being regularized as in the SSVM case).A backpropagation gradient-descent learning algorithm is prescribed below:

∇G =∂G

∂y· ∂y

∂w= −

N∑n=1

(t(n) − y(n))x(n) := −N∑

n=1

e(n)x(n).

new wupdate←− −− w − η (∇G(w) + αw) ,

where η is a learning rate parameter and together with α can be determined bycross-validation. Explain the following concepts.

• Backpropagation algorithm. The error in the output units is back-propagatedto the input units by being weighted by x(n).4

• Gradient-descent learning algorithm. First order gradient descent.

• Batch learning versus on-line learning. Batch learning: parameter updates aredone with all the training cases. On-line learning: processing each observa-tion one at a time, updating the gradient after each training case, and cyclingthrough the training cases many times. On-line training allows the network tohandle very large data sets and also to update the weight parameters as newobservations come in.

4Backpropagation is a chain rule for differentiation computed by a backward sweep over the

network. It keeps track only for quantities local to each unit. The backward error propagation can

be better seen in a multi-layer network.

59

Algorithm 39.5 (Batch version)

global x;

global t;

for l = 1 : L

# x is an N ∗ (I + 1) matrix containing all the input vectors# and an extra column of all ones.# t is a vector of length N containing all the targets

# loop L times

a = x ∗ w;y = sigmoid(a);e = t− y;g = −x′ ∗ e;w = w − η ∗ (g + α ∗ w);

# compute all activations# compute all outputs# compute all errors# compute the gradient vectors# make step, using learning rate η

# and weight decay α

endfor

function f = sigmoid(v)f = 1.0 ./ (1.0 . + exp(−v));

endfunction

42. Hopfield Networks

Here we will introduce a fully connected feedback network called the Hopfield net-work. A Hopfield network is a form of recurrent artificial neural network, which canserve as content-addressable memory systems with binary threshold units. They areguaranteed to converge to a stable state.

The weights (representing strength of connectivity) in the Hopfield network areconstrained to be symmetric. The weight from neuron i to neuron j is equal to theweight from neuron j to neuron i, i.e., wij = wji. Also there are no self connections,i.e., wii = 0.

��� ���

Figure 42.1. (a) A feedforward network. (b) A feedback network.

60 CHAPTER 6. NEURAL NETWORKS

Hopfield networks have two applications. (1) They can act as associative memories(also known as content-addressable memories). (2) They can be used to solve opti-mization problems.

• Architecture. A Hopfield network is a fully connected feedback network withweights wij = wji and wii = 0, i = 1, . . . , I (or i = 0, . . . , I, where wi0 = w0i

are biases, also called intercepts or offsets). Denote the weight matrix by W .

• Activity rule. Each neuron updates its state as if it were a single neuron with the

threshold activation function: x(a) =

{1, a > 0,

−1, a ≤ 0.As there is a feedback

in the network, every neuron’s output x(a) will serve as an input to all theother neurons in next update. The order for the updates can be synchronousor asynchronous (one at a time).

xt → at = Wtxt → xt+1(at) =

{1, at > 0,

−1, at ≤ 0,→ at+1 = Wt+1xt+1 → · · · .

• Learning rule. Given a set of desired memories {x(n)}, the learning rule isintended to make a set of desired memories {x(n)} stable states of the network.Each memory is a binary pattern made of k-bits of {0, 1}. Hopfield networkshave a scalar value associated with each state of the network referred to as theenergy, E, of the network:

E = −12x′Wx

(+

α

2‖W‖2F for regularization

),

where ‖W‖2F =∑

ij w2ij is the Frobenius norm of a matrix (with or without

including bias term into the regularization). The network will converge to stateswhich have minimum energy. To get the associated W for stationary states, seeAlgorithm 42.9 for iterative steps.

61

Algorithm 42.9

w = x′ ∗ x;

for l = 1 : L

# initialize the weights using Hebb rule# x is the data matrix of size N × (I + 1) including constant

# loop L times

for i = 1 : I + 1w(i, i) = 0;

end

a = x ∗ w;y = sigmoid(a);e = t− y;gw = x′ ∗ e

gw = gw′ + gw

w = w + η ∗ (gw − α ∗ w);

## ensure the self-weights are zero.#

# compute all activations# compute all outputs# compute all errors# compute the gradients# make symmetric

# make step

endfor

42.3-42.4 Definition and convergence of the continuous Hopfield

network

Continuous Hopfield networks have the same architecture and learning rule as thebinary ones. The only difference is, continuous Hopfield networks use a continuoustransfer function. The choice of transfer functions can be tanh-sigmoid (in MacKay’sbook), saturating linear function (in Matlab Hopfield networks, see Example 3 below),or else.

We hope that the activity rule of a Hopfield network will take a partial memoryor a corrupted memory, and perform pattern completion or error correction to restorethe original memory. But why should we expect so?

The energy function (regularized) of a Hopfield network decreases under the dy-namical evolution of a system and the energy funciton is bounded below. There is ageneral name for such a function: Lyapunov function for the system. If a system hasa Lyapunov function, then its dynamics are bound to settle down to a fixed point,which is a local minimum of the Lyapunov function, or a limit cycle, along which thefunction is a constant. Chaotic behavior is not possible for a system of a Lyapunovfunction. If a system has a Lyapunov function, then its state space can be divided intobasins of attraction, one basin for each attractor. This convergence proof dependscrucially on the fact that the Hopfield network’s connections are symmetric. It alsodepends on its updates being make asynchronously.

62 CHAPTER 6. NEURAL NETWORKS

Example 3 (not from MacKay’s book) Here is one simple example from Matlab Hop-field Network documentation. It adopts the so called saturating linear transfer functionas the activation function:

x(a) =

1, a ≥ 1,

a, −1 ≤ a ≤ 1,

−1, a ≤ −1.

Consider a Hopfield network with a design of two stable points: T =

−1 1−1 −1

1 1

.

We can create the design with Matlab command

net=newhop(T);

The associated weight matrix W and bias vector b for stationary states can be retrievedby

W=net.LW{1};b=net.b{1};It leads to

W =

1.1618 0 00 0.2231 00 0 0.2231

and b =

0−0.8546

0.8546

.

For a test input x = {[−0.9;−0.8; 0.7]}, we simulate the newly constructed Hopfieldnetwork using this input.

y=sim(net,{1 5},{},x);y{1}

We get y{1}=[-1;-1;1]. While with x = {[−0.6;−0.7; 0.6]}, it takes 4 steps to restorethe pattern, i.e., y{4}=[-1;-1;1].

Homework 5 Build a Hopfield network using the 4 given memories in Figure 42.3.(State the architecture, activation and training rule of the network that you are tobuild, and then give the resulting weight matrix and bias vector after training.) Teston patterns (b)-(d) given in the same figure. (Provide the states for all steps.) Youmay use your owe implementation of Algorithm 42.9 (binary Hopfield network), oruse existing Matlab code (continuous Hopfield network) or R coed for this homework.

63

���

� ������������������������������� � ������������������������������������� � ������������������������������������� � ������������������������������������� � �������������������������������������� � ����������������������������������� � ����������������������������������� � ������������������������������� � ���������������������������������� � ���������������������������������� ������������������������������������ � ������������������������������� � �������������������������������� ����������������������������������� ����������������������������������� ����������������������������������� � ��������������������������������� � �������������������������������� ������������������������������������ � �������������������������������� ���������������������������������������� � ������������������������������������� � ���������������������������������� � ����������������������������������� �

���

��

���

���

� �

���

���

��� ��� ���

��� ���

Figure 42.3. Binary Hopfield network storing 4 memories.

44. Supervised Learning in Multilayer Networks

44.1 Multilayer perceptrons

• Multilayer perceptron. It is a feedforward network and has input neurons,hidden neurons and output neurons. Such a feedforward network defines anonlinear parametrized mapping from an input vector x to an output vectory = y(x;w,A), or simply y(x;w). The output of the net is a continuousfunction of the input x and the parameters w = (w(1),w(2)) for a 2-layernet; and the architecture of the net is denoted by A = (a(1),a(2)). Betweenthe input nodes and the output nodes, there are hidden nodes in-betweeny(H) = y(H)(x;w(1),a(1)), or simply y(H)(x;w(1)). Multi-layer networks arequite powerful for function approximation in pattern analysis and regression

64 CHAPTER 6. NEURAL NETWORKS

problems. For instance, a 2-layer network with biases, a sigmoid hidden layerand a linear output layer, can approximate any “ regular” function (with a finitenumber of discontinuities) arbitrarily well.

Hiddens

Inputs

Outputs

Figure 44.1. A typical 2-layer (not counting the input nodes) network.

Feedforward networks can be trained (often by backpropagation algorithms) toperform regression and classification tasks.

• Backpropagation algorithm. It is a supervised learning technique used fortraining feedforward multilayer networks based on gradient descent. The termis for “backwards propagation of errors”. Assume we have training samples{(x(n), t(n))}N

n=1; and let E(w) denote the energy function (or the error func-tion with or without a regularizer) to be minimized. A first-order gradient basedmethod is used to find the energy minimum. Here is how the parameters makeupdate:

new wupdate←− −− w − η∇E(w),

where

∇E(w) =∂E

∂y· ∂y∂y(H)

· ∂y(H)

∂w.

As the algorithm’s name implies, the errors in the output layer propagate back-wards from the output nodes to the inner hidden nodes. The term ∂E

∂y · ∂y∂y(H)

acts like the error term for the hidden layer, which is the error term in theoutput layer weighted by the “responsibility” of the hidden units in order toadjust the weights of each neurons to lower the local error, and hence also thetotal error. The network weights are moved along the negative of the gradientof the objective function to be minimized.

Backpropagation refers to the manner in which the gradient is computed fornonlinear feedforward multi-layer networks. There are other variations on thebasic algorithm that are based on other standard optimization techniques.

65

• Regression networks. For a univariate response 2-layer regression network witha nonlinear hidden layer and a linear output layer, the network architecture hasthe form: ♠ Hidden layer: y(H) = f(a(1)) with a(1) = x′w(1) (including biasesby enclosing the node x0 = 1 to the input nodes to save notation usage). Notethat w(1) is a matrix.

♠ Output layer: y = a(2) = y(H)′w(2) (including biases by enclosing the nodey(H)0 = 1 to the hidden layer).

44.2 Regression networks

Given a training data set D = {x(n), t(n)} (boldface t is used for vector response, i.e.,this can be a multivariate-response regression problem), the network is trained byadjusting the weights w so as to minimize an error function, for instance, the meansquared error

minw

{ED(w) =

12

∑n

‖t(n) − y(n)(x;w)‖22}

. (44.3)

Often, a regularization term is added to error function, modifying the objective func-tion to

M(w) = ED + αEW , (44.4)

where, for instance, EW = 12 (‖w(1)‖2F + ‖w(2)‖2F ). The regularization term favors

small values of w and guard against overfitting the training data. The regressionfunction is estimated by y(n)(x; w), where w is the solution for the minimizationproblem minw M(w).

44.3 Classification networks

Binary classification networks. The targets t(n) are binary labels {0, 1}. Afeedforward network, consisting of a linear hidden layer and a nonlinear log-sigmoidoutput layer, can be used for binary classification and interpreted as a probabilityP (t = 1|x,w,A). The network is trained to minimize the regularized negative likeli-hood: G(w) + αEW , where

G(w) = −∑

n

[t(n) ln y(x(n);w) + (1− t(n))(1− ln y(x(n);w))

]. (44.9)

Multi-class classification networks. For a multi-class problem, we can rep-resent the targets by a vector, t, in which a single element is set to 1, indicating thecorrect class, and all other elements are set to 0. The output y = (y1, . . . , yk) are

66 CHAPTER 6. NEURAL NETWORKS

interpreted as class probabilities, i.e., yi = P (ti = 1|x,w,A), where k is the numberof classes. The class membership assignment is by the so called ‘softmax’, that assignsa test point x to the class with the maximal class probability. That is, the classifieris given by

arg maxk

yk(x, w).

The negative likelihood in this case is

G(w) = −N∑

n=1

k∑

i=1

t(n)i ln yi(x(n);w). (44.10)

Homework 6 Design a 2-layer feedforward network to analyze the Pendigits dataset. (Clearly specify the architecture, activation, transfer functions, learning rule,etc.) Use the training data to train the network and then test on the test data.

♣ Some Issues in Training Neural Networks

Material listed below is taken from Chapter 11 Neural Networks of Hastie, Tib-shirani and Friedman [24].

• Initial values for weight parameters. If the weights are near zero, then the opera-tive part of the sigmoid is roughly linear, and hence the neural network collapsesinto an approximately linear model.5 Usually starting values for weights are cho-sen to be random values near zero. Hence the model starts out nearly linear,and becomes nonlinear as the weights increase. Individual units localize to di-rections and introduce nonlinearities where needed. Use of exact zero weightsleads to zero derivatives and perfect symmetry, and the algorithm never moves.Starting with large weights often leads to poor solutions.

• Overfitting. Add a regularization (weight decay) term, αEW (w(1),w(2)), to thecost function. This has the effect of shrinking w = (w(1),w(2)) toward zero,that is to say shrinking the final model toward a linear model. A validationdata set is useful for determine the tuning parameter α (as well as the learningrate η).

• Scaling of the inputs. Since the scaling of the inputs determines the effectivescaling of the weights in the bottom layer, it can have a large effect on thequality of the final solution. At the outset it is best to standardize all inputs to

5Assume a 2-layer network with linear output layer and nonlinear sigmoid hidden layer. If the

first layer weights w(1) are nearly zero, the resulting network model is nearly linear in the inputs x.

67

have mean zero and standard deviation one. This ensures all inputs are treatedequally in the regularization process, and allows one to choose a meaningfulrange for the random starting weights. With standardized inputs, it is typicalto take random uniform weights over the range [-0.7, +0.7].

• Number of hidden units and layers. (For simplicity, use a 2-layer network with‘enough’ hidden units.)

♣ Some Remarks

• Neural network takes nonlinear functions of linear combinations of the inputs.This is a powerful and general approach for classification (pattern analysis) andregression.

• There have been so many packages available for fitting neural networks. Theyvary widely in quality, and the learning problem for neural networks is sensitiveto issues such as input scaling, initial values, weight decay parameter. Therefore,the software should be carefully chosen and tested.

68 CHAPTER 6. NEURAL NETWORKS

Appendix A

Some Matrix Algebra

A.1 Matrix diagonalization

The matrix diagonalization (factorization) has several important applications in sta-tistics including PCA, factor analysis, independent component analysis, dimensionreduction, etc. Diagonalization is a process of finding new bases (i.e. new coordinateaxes) so that a linear map can be represented in a simple form. Matlab programs“eigshow” and “svdshow” provide an excellent visual demonstration of the eigen-value decomposition and the singular value decomposition for 2-dimensional squarematrices.

A.1.1 Eigenvalue decomposition and generalized eigenvalue de-

composition

Given any symmetric p × p square matrix M , we are asked when Mu = λu. It canbe shown that there exists an orthogonal matrix U = [u1, u2, . . . , up] such that

U ′MU = Λ is a diagonal matrix (i.e. Λ = diag(λ1, . . . , λp)). (A.1)

If M is positive semi-definite, λi ≥ 0 for all i; if M is strictly positive definite, λi > 0for all i. Note that

MU = UΛ, Mui = λiui.

Columns of U are called eigenvectors and λi, i = 1, . . . , p are called eigenvalues ofM . For nonsingular M , the eigenvectors lie on the principal axes of the ellipsoid:x′Mx = 1 and the eigenvalues are equal to the squares of the reciprocals of thelengths of the principal axes.

69

70 APPENDIX A. SOME MATRIX ALGEBRA

For a pair of symmetric matrices Σ (strictly p.d.) and M . We shall consider thesimultaneous diagonalization of M and Σ so that

U ′MU = Λ and U ′ΣU = I. (A.2)

(A.2) is equivalent to solving the generalized eigenvalue problem: Mu = λΣu. Thediagonalization of M is with respect to the inner product space <p with inner productgiven by 〈u, v〉Σ = u′Σv. In other words, we are looking for a new system U , whichis orthogonal in the sense of U ′ΣU = I, so that M possesses a simple representation.

A.1.2 Singular value decomposition (SVD) and generalized

SVD

Given an n×p matrix, A, there exist an n×n orthogonal matrix U , a p×p orthogonalmatrix V and an n× p diagonal matrix Λ such that

A = UΛV ′. (A.3)

The expression (A.3) is called the singular value decomposition of A. Columns of U

and V are called, respectively, left singular vectors and right singular vectors, and thediagonal elements are called singular values. Note that

A′A = V Λ′ΛV ′ and AA′ = UΛΛ′U ′.

A generalized SVD is to find U and V so that

A = UΛV ′ and U ′Σ1U = In, V ′Σ2V = Ip, (A.4)

where Σ1 and Σ2 are some given strictly positive definite matrix, and Λ is an n × p

diagonal matrix.It is often encountered in data preprocessing that we need to extract leading

column principal components of A. For the case p ¿ n, we form the smaller squarematrix A′A and find its eigenvalue decomposition A′A = V SV ′, where S is a diagonalmatrix consisting of singular values. Without loss of generality, assume that S isnonsingular. If not, delete the row(s) and column(s) with zero diagonal from S anddrop the corresponding column(s) from V . Then, the leading right singular vectors ofA associated with non-zero singular values can be recovered by U = V S−1/2. For thecase n ¿ p, we form the matrix AA′ and extract its right singular vectors directly.

Appendix B

Kernel Statistics Toolbox

This toolbox consists of several core programs and some supplements. It is availableat http://www.stat.sinica.edu.tw/syhuang, andhttp://140.109.74.188/kern stat toolbox/kernstat.html.

B.1 KGaussian

This program is for making kernel data, full or reduced, with Gaussian kernel.

KGaussian Matlab m-file for building Gaussian kernel data

function K = KGaussian (gamma, A, tilde A)

B.2 Smooth SVM

SSVM [33] is a type of 2-norm soft margin SVM (with b2/2 also appended to themargin criterion). It uses a smooth approximation for the problem and solves thesmoothed problem with the Newton-Armijo method. In the linear case, SSVM solvesthe following optimization problem:

minw,b

C

2‖p (1−D(Aw + b1), ρ)‖22 +

12

(‖w‖22 + b2). (B.1)

It is shown [33] that, as ρ → ∞, the solution for (B.1) converges to the solution ofthe original unsmoothed problem. This code takes the advantage of sparsity of theHessian matrix and has used the limit values of the sigmoid function and p-function in

71

72 APPENDIX B. KERNEL STATISTICS TOOLBOX

computing the gradient vector and the Hessian matrix. In the footnote of Section 2.3.3we have the Newton update direction at the (k + 1)th iteration:

wk+1 = wk − stepsize× (∇2f(wk))−1 ∇f(wk),

where ∇f is the gradient and ∇2f is the Hessian matrix of the objective function.For the objective function in (B.1), the gradient (at its limit value, i.e. ρ → ∞) isgiven by

[∂f∂w∂f∂b

]= C

[−A′D(1−D(Aw + b1)+ + w/C

−1′D(1−D(Aw + b1)+ + b/C

]

= C

[−A′DHs(1−D(Aw + b1) + w/C

−1′DHs(1−D(Aw + b1) + b/C

];

and the Hessian matrix (at its limit value, i.e. ρ →∞) is given by

[∂2f∂w2

∂2f∂w∂b

∂2f∂b∂w

∂2f∂b2

]= C

[A′HsA + Ip/C A′Hs1

1′HsA 1′Hs1 + 1/C

],

where Hs is a diagonal matrix with diagonal values given by

Hs(i, i) =

{1, if 1− yi(x′iw

k + b) > 0,

0, if 1− yi(x′iwk + b) ≤ 0.

That is, only those instances with positive current slack will have major contributionto the next update, especially when C is large.

As SSVM is solved in the primal space, the linear SSVM and its kernel extensionshare exactly the same formulation and the same implementation. The kernel exten-sion works directly by feeding the kernel data, full or reduced, into the code, as if theyare the input data in the pattern space. This is one of the advantages of working inthe primal space.

SSVM2 is a Matlab program implementing the SSVM. It is modified from theoriginal author’s code SSVM M. The original code and a companion C-code versionSSVM C along with other supplementary programs are available at http://dmlab1.csie.ntust.edu.tw/downloads.

SSVM2 Matlab m-file for implementing SSVM solved in the primal space

function [w, b] = SSVM2(A, B, C, w0, b0)

B.3. LAGRANGIAN SVM 73

B.3 Lagrangian SVM

The Lagrangian SVM (LSVM) has the same formulation as the SSVM. That is,the one-norm ‖ξ‖1 is changed to 2-norm ‖ξ‖22/2, which makes the constraint ξ ≥ 0redundant; and the term b2/2 is appended to the 2-norm of the normal vector ‖w‖22/2,which simplifies the constraints of the dual problem:

maxα

n∑

i=1

αi − 12

n∑

i,j=1

αiαjyiyjx′ixj − 1

2C

n∑

i=1

α2i

subject to αi ≥ 0 ∀i.

Or equivalently in matrix notation,

minα≥0

12α′

(DAA′D +

In

C

)α− 1′nα.

The primal variables can be recovered from derivatives of Lagriangian:

w = A′Dα, b = 1′nDα, ξ = α/C.

These simple changes have lead to a simple primal formulation of the SVM problemas seen in SSVM, and to a simple dual formulation as well as seen in the LSVMwithout sacrificing the prediction accuracy. For a nonlinear RSVM in the dual form,the LSVM code given above can still be used by simply replacing the original inputdata A with the reduced kernel matrix B = K(A, A)K(A, A)−

12 , which is derived

from the Nystrom approximation. The LSVM algorithm is based directly on theKKT necessary and sufficient optimality conditions (see Theorem in the footnote ofSection 2.1) for the dual problem:

0 ≤ α ⊥ Qα− 1n ≥ 0 (i.e. 1n −DAw − ξ ≤ 0), where Q = DAA′D +In

C.

The LSVM algorithm has used the iterative updates:

Qαi+1 − 1n = (Qαi − 1n − γαi)+, or αi+1 = Q−1(1n + (Qαi − 1n − γαi)+

).1

In implementing the LSVM, it has involved the following matrix inversion Q−1 =(DAA′D + In/C)−1, where DA is an n × p matrix. When p ¿ n, Q−1 can becalculated by the Sherman-Morrison-Woodbury (SMW) identity. It goes as follows,

1A global linear convergence can be achieve from any starting point under the condition 0 < γ <

2/C. In the LSVM code, the authors impose this condition as 0 < γ < 1.9/C

74 APPENDIX B. KERNEL STATISTICS TOOLBOX

for an n× p matrix H,

(HH ′ +

In

C

)−1

= C

(In −H

(H ′H +

Ip

C

)−1

H ′)

.2

LSVM Matlab m-file for implementing a type of 2-norm SVM solved in the dual space

function [it, opt, w, b] = svml(A,D,C,itmax,tol)

% Inputs: %% A: a matrix of training inputs. %% D: a diagonal matrix with class labels yi in the diagonals. %% C: the tuning parameter. %% itmax: the maximum number of iterations. %% tol: the tolerated error at termination. %% %% Outputs: %% it: the total number of iterations. %% opt: error at termination. %% w: the normal vector of the decision function. %% b: the offset %% separating hyperplane: w′x− b = 0; the offset term is formulated as −b. %% %% Authors: O. L. Mangasarian and D. R. Musicant [40] %

[m,n]=size(A); gamma=1.9/C; e=ones(m,1); H=D*[A -e]; it=0;

S=H*inv((speye(n+1)/C+H′*H)); % the Sherman-Morrison-Woodbury identity

alpha=C*(1-S*(H′*e)); oldalpha=alpha+1;% initial

while (it<itmax & norm(oldalpha-alpha)>tol)

z=(1+pl(((alpha/C + H*(H′*alpha))-gamma*alpha)-1));

oldalpha=alpha;

alpha=C*(z-S*(H′*z));

it=it+1;

end;

opt=norm(alpha-oldalpha); w=A′*D*alpha; b=-e′*D*alpha;

function pl = pl(x); pl = (abs(x)+x)/2;

2It can be easily checked that (CHH′ + In)

�In −H

�H′H +

Ip

C

�−1H′�

= In.

B.4. UNIFORM DESIGNS FOR MODEL SELECTION 75

B.4 Uniform designs for model selection

We have several UD-based model selection programs colllected in the hibiscus-pluscollection.

hibiscus4SVM Matlab m-file for SVM model selection. It supports SSVM, LIBSVM and

Lagrangian SVM.

hibiscus4RLSSVR Matlab m-file for model selection for multi-response regularized least-

squares SVR.

hibiscus4mSSVR Matlab m-file for model selection for multi-response ε-SSVR.

hibiscus4KSIR Matlab m-file for model selection for KSIR combined with other linear

algorithms for classification and regression. It now supports KSIR combined with FLDA,

SSVM, LIBSVM, LIBSVM for regression, SSVR, RLS-SVR.

B.5 Support vector regression

RLS-SVR Matlab m-file for regularized least-squares SVR. The response can be multivari-

ate.

SSVR2 Matlab m-file for smooth ε-insensitive SVR. It is modified from the original authors’

SSVR M code.

B.6 Kernel PCA

KPCA Matlab m-file

function [EigenVectors, EigenValues, ratio] = KPCA(K, NumOfPC)

% KPCA: kernel principal component analysis for dimension reduction. %% %% Inputs %% K: kernel matrix (reduced or full) %% NumOfPC: If NumOfPC= r ≥ 1, it extracts leading r-many eigenvectors. %% If NumOfPC= r < 1, it extracts leading eigenvectors whose corresponding %% eigenvalues account for 100r% of the total sum of eigenvalues. %% [EigenVectors, EigenValues, ratio] = KPCA(K, NumOfPC) also keep tracks %% of extracted eigenvalues and their ratio to the total sum. %% %

76 APPENDIX B. KERNEL STATISTICS TOOLBOX

% Outputs %% EigenVectors: leading eigenvectors. %% EigenValues: leading eigenvalues. %% ratio: sum of leading eigenvalues over the total sum of all eigenvalues. %% %% References %% Programmer: Yeh, Yi-Ren; [email protected] %% http://www.stat.sinica.edu.tw/syhuang %% or http://140.109.74.188/kern stat toolbox/kernstat.html %% Send your comment and inquiry to [email protected] %

[n p] = size(K);

if (NumOfPC > p )

error([‘the number of leading eigenvalues must be less than’, num2str(p),]);

end

if (p < n)% for reduced kernel, only right singular vectors are needed.

K = (K-ones(n,1)*mean(K));

K = K’*K;

[EigenVectors EigenValues] = svd(K);

EigenValues = sqrt(EigenValues);

else

[EigenVectors EigenValues] = svd((K-ones(n,1)*mean(K))′,0);

end

clear K

EigenValues = diag(EigenValues);

ZeroEigenValue = length(find(EigenValues==0));

if (NumOfPC > p-ZeroEigenValue )

error([‘the number of leading eigenvalues must be less than’, num2str(p-ZeroEigenValue),]);

end

Total = sum(EigenValues);

if (NumOfPC≥ 1)

% extract NumOfPC-many leading eigenvectors

EigenVectors = EigenVectors(:,1:NumOfPC);

ratio = sum(EigenValues(1:NumOfPC))/Total;

EigenValues = EigenValues(1:NumOfPC);

else

% extract leading eigenvectors that account for at least 100NumOfPC% of data variation

count = 1;

Temp = EigenValues(count);

B.7. KERNEL SIR 77

ratio = Temp/Total;

while (ratio < NumOfPC)

count = count + 1;

Temp = Temp + EigenValues(count);

ratio = Temp/Total;

end

EigenVectors = EigenVectors(:,1:count);

EigenValues = EigenValues(1:count);

end

y% normalization

EigenVectors = EigenVectors./(ones(p,1)*sqrt(EigenValues)′);

B.7 Kernel SIR

KSIR Matlab m-file

function [EigenVectors, EigenValues, ratio] = KSIR(K, y, NumOfSlice, NumOfPC)

% KSIR: kernel sliced inverse regression for dimension reduction. %% %% Inputs %% K: kernel matrix (reduced or full) %% y: class label for classification; response for regression. %% NumOfSlice: no. of slices. %% For classification problem, NumOfSlice is a string variable ‘class’. %% For regression problem, NumOfSlice is an integer. Responses y are sorted %% and sliced into NumOfSlice slices and so are rows of K accordingly. %% NumOfPC: If NumOfPC= r ≥ 1, it extracts leading r-many eigenvectors. %% If NumOfPC= r < 1, it extracts leading eigenvectors whose corresponding %% eigenvalues account for 100r% of the total sum of eigenvalues. %% [EigenVectors, EigenValues, ratio] = KSIR(K, y, NumOfSlice, NumOfPC) %% also keep tracks of extracted eigenvalues and their ratio to the total sum. %% %% Outputs %% EigenVectors: leading eigenvectors of between-slice covariance. %% EigenValues: corresponding leading eigenvalues. %% ratio: sum of leading eigenvalues over the total sum of all eigenvalues. %% %% References %

78 APPENDIX B. KERNEL STATISTICS TOOLBOX

% Author: Yeh, Yi-Ren; [email protected] %% http://www.stat.sinica.edu.tw/syhuang %% or http://140.109.74.188/kern stat toolbox/kernstat.html %% Send your comment and inquiry to [email protected] %

[n p] = size(K);

if (nargin < 4)

NumOfPC = p;

end

if (NumOfPC > p )

error([’the number of leading eigenvalues must be less than’,num2str(p),]);

end

[Sorty Index] = sort(y);

K = K(Index,:);

HK = [ ];

base = zeros(2,1);

if (ischar(NumOfSlice))

Label = unique(y);

for i = 1: length(Label)

count = length(find(y==Label(i)));

base(2) = base(2) + count;

HK = [HK;ones(base(2)-base(1),1)*mean(K(base(1)+1:base(2),:))];

base(1) = base(2);

end

else

SizeOfSlice = fix(n/NumOfSlice);

m = mod(n,SizeOfSlice);

for i = 1 : NumOfSlice

count = SizeOfSlice+(i<m+1);

base(2) = base(2) + count;

HK = [HK;ones(base(2)-base(1),1)*mean(K(base(1)+1:base(2),:))];

base(1) = base(2);

end

end

% solve the following generalized eigenvalue problem:

% (HK)′(In − (1n1′n)/n)HKβ = λK(In − (1n1

′n)/n)K′β

Cov b=cov(HK);% between-slice covariance matrix

clear HK

Cov w=cov(K);% within-slice covariance matrix

B.8. KERNEL FISHER LINEAR DISCRIMINANT ANALYSIS 79

clear K

% [EigenVectors EigenValues]=eigs(Cov b, Cov w+eps*eye(p), NumOfPC);

[EigenVectors EigenValues]=eig(Cov b, Cov w+eps*eye(p));

clear Cov b Cov w

EigenValues = diag(EigenValues);

[EigenValues Index] = sort(EigenValues,’descend’);

EigenVectors = EigenVectors(:,Index);

% ZeroEigenValue = length(find(EigenValues≤0));

% if (NumOfPC > p-ZeroEigenValue )

% error([’the number of leading eigenvalues must be less than’, num2str(p-ZeroEigenValue),]);

% end

Total = sum(EigenValues);

if (NumOfPC ≥ 1)% choose the leading NumOfPc eigenvectors.

EigenVectors = EigenVectors(:,1:NumOfPC);

ratio = sum(EigenValues(1:NumOfPC))/Total;

EigenValues = EigenValues(1:NumOfPC);

else

% extract leading eigenvectors that explain 100*NumOfPC% of data variation

count = 1;

Temp = EigenValues(count);

ratio = Temp/Total;

while (ratio < NumOfPC)

count = count + 1;

Temp = Temp + EigenValues(count);

ratio = Temp/Total;

end% while loop ends

EigenVectors = EigenVectors(:,1:count);

EigenValues = EigenValues(1:count);

end

% normalization

EigenVectors = EigenVectors./(ones(p,1)*sqrt(EigenValues)′);

B.8 Kernel Fisher linear discriminant analysis

B.9 Kernel CCA

80 APPENDIX B. KERNEL STATISTICS TOOLBOX

Bibliography

[1] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337–404,

1950.

[2] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach.

Neural Computation, 12, 2385–2404, 2000.

[3] Y. Bengio. Gradient-based optimization of hyper-parameters. Neural Computation,

12(8):1889–1900, 2000.

[4] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability

and Statistics. Kluwer Academic Publisher, Boston, 2004.

[5] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data

Mining and Knowledge Discovery, 2(2):121–167, 1998.

[6] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters

for support vector machines. Machine Learning, 46(1-3):131–159, 2002.

[7] B. Chen and P. T. Harker. Smooth approximations to nonlinear complementarity prob-

lems. SIAM Journal of Optimization, 7:403–420, 1997.

[8] C. Chen and O. L. Mangasarian. Smoothing methods for convex inequalities and linear

complementarity problems. Mathematical Programming, 71(1):51–69, 1995.

[9] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed

complementarity problems. Computational Optimization and Applications, 5(2):97–138,

1996.

[10] P.-C. Chen, T.-J. Lee, Y.-J. Lee and S.-Y. Huang. Multiclass support vector classifica-

tion via regression. submitted.

[11] X. Chen, L. Qi, and D. Sun. Global and superlinear convergence of the smoothing

Newton method and its application to general box constrained variational inequalities.

Mathematics of Computation, 67:519–540, 1998.

[12] X. Chen and Y. Ye. On homotopy-smoothing methods for variational inequalities. SIAM

Journal on Control and Optimization, 37:589–616, 1999.

81

82 BIBLIOGRAPHY

[13] R. D. Cook. Regression Graphics: Ideas for Studying Regressions through Graphics.

John Wiley and Sons, 1998.

[14] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–279,

1995.

[15] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cam-

bridge University Press, Cambridge, 2000.

[16] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector

regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances

in Neural Information Processing Systems -9-, pages 155–161, Cambridge, MA, 1997.

MIT Press.

[17] K.-T. Fang and Y. Wang. Number-theoretic Methods in Statistics. Chapmman & Hall,

London, 1994.

[18] M. Fukushima and L. Qi. Reformulation: Nonsmooth, Piecewise Smooth, Semismooth

and Smoothing Methods. Kluwer Academic Publishers, Dordrecht, The Netherlands,

1999.

[19] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In

F. Provost and R. Srikant, editors, Proceedings KDD-2001: Knowledge Discovery

and Data Mining, August 26-29, 2001, San Francisco, CA, pages 77–86, New York,

2001. Asscociation for Computing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech-

reports/01-02.ps.

[20] G. M. Fung and O. L. Mangasarian. Multicategory proximal support vector machine

classifiers. Machine Learning, 59:77–97, 2005.

[21] U. Grenander. Stochastic processes and statistical inference. Arkiv for Matematik, 1,

195–277, 1950.

[22] U. Grenander. Probabilities on Algebraic Structures. Almqvist & Wiksells, Stockholm,

and John Wiley & Sons, New York, 1963.

[23] U. Grenander. Abstract Inference. Wiley Series in Probability and Mathematical Sta-

tistics. John Wiley & Sons, New York, 1981.

[24] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning: Data

Mining, Inference, and Prediction. Springer, New York, 2001.

[25] M. Hein and O. Bousquet. Kernels, associated structures and generalizations.

Technical report, Max Planck Institute for Biological Cybernetics, Germany, 2004.

http://www.kyb.tuebingen.mpg.de/techreports.html.

[26] S.-Y. Huang and C. R. Hwang. Kernel Fisher Discriminant Analysis in Gaussian Repro-

ducing Kernel Hilbert Spaces. manuscript. http://www.stat.sinica.edu.tw/syhuang

[27] S.-Y. Huang, K.-Y. Lee and H. H.-S. Lu. Lecture Notes on Statistical and Machine

Learning. Course material maintained on http://140.109.74.188/.

BIBLIOGRAPHY 83

[28] C.-M. Huang, Y.-J. Lee, D. K. J. Lin and S.-Y. Huang. Model selection for support

vector machines via uniform design. A special issue on Machine Learning and Robust

Data Mining, Journal of Computational Statistics and Data Analysis, to appear, 2007.

[29] S. Janson. Gaussian Hilbert Spaces. Cambridge University Press, Cambridge, 1997.

[30] J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen. Adaptive regularization in

neural network modeling. In G. B. Orr and K. R. Muller, eds. Neural Networks: Trick

of the Trade, 1998.

[31] Y.-J. Lee, W.-F. Hsieh, and C.-M. Huang. ε-SSVR: A smooth support vector machine

for ε-insensitive regression. IEEE Transactions on Knowledge and Data Engineering,

17:678–685, 2005.

[32] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Techni-

cal Report 00-07, Data Mining Institute, Computer Sciences Department, University

of Wisconsin, Madison, Wisconsin, July 2000. Proceedings of the First SIAM Inter-

national Conference on Data Mining, Chicago, April 5-7, 2001, CD-ROM Proceedings.

ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps.

[33] Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Compu-

tational Optimization and Applications, 20:5–22, 2001.

[34] Y.-J. Lee and S.-Y. Huang. Reduced support vector machines: A statistical theory.

IEEE Transaction on Neural Networks, 18:1–13, 2007.

[35] K. C. Li. Sliced inverse regression for dimension reduction with discussion. Journal of

the American Statistical Association, 86:316–342, 1991.

[36] K.-M. Lin and C.-J. Lin. A study on reduced support vector machines. IEEE Transac-

tions on Neural Networks, 14:1449–1459, 2003.

[37] O. L. Mangasarian and D. R. Musicant. Active set support vector machine classification.

Adv. Neural Inform. Processing Syst., 13:577–583, 2001.

[38] O. L. Mangasarian and D. R. Musicant. Large scale kernel regression via lin-

ear programming. Technical Report 99-02, Data Mining Institute, Computer Sci-

ences Department, University of Wisconsin, Madison, Wisconsin, August 1999.

ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-02.ps.

[39] O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):950–955, 2000.

[40] O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal

of Machine Learning Research, 1:161–177, 2001.

[41] S. Mika. Kernel Fisher Discriminants. PhD thesis, Electrical Engineering and Computer

Science, Technische Universitat Berlin, 2002.

[42] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller. Fisher discriminant

analysis with kernels. Neural Networks for Signal Processing, IX:41–48, 1999.

84 BIBLIOGRAPHY

[43] D. R. Musicant and A. Feinberg. Active set support vector regression. IEEE Transac-

tions on Neural Networks, 15:268–275, 2004.

[44] H. Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. Society

for Industrial and Applied Mathematics (SIAM), Philadelphia, 1992.

[45] R. Rifkin and A. Klautau. In defense of one-vs-all classification. J. Machine Learning

Research, 5:101–141, 2004.

[46] A. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learn-

ing. In Proc. 17th International Conf. on Machine Learning, pages 911–918. Morgan

Kaufmann, San Francisco, CA, 2000.

[47] A. Smola and B. Scholkopf. A tutorial on support vector regression. Statistics and

Computing, 14:199–222, 2004.

[48] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers.

Neural Processing Letters, 9(3):293–300, 1999.

[49] J. A. K. Suykens and J. Vandewalle. Multiclass least squares support vector machines.

In Proceedings of IJCNN’99, pages CD–ROM, Washington, DC, 1999.

[50] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle.

Least Squares Support Vector Machines. World Scientific, New Jersey, 2002.

[51] P. Tseng. Analysis of a non-interior continuation method based on Chen-Mangasarian

smoothing functions for complementarity problems. In Reformulation: Nonsmooth,

Piecewise Smooth, Semismooth and Smoothing Methods, M. Fukushima and L. Qi, (ed-

itors), pages 381–404, Dordrecht, Netherlands, 1999. Kluwer Academic Publishers.

[52] N. N. Vakhania, V. I. Tarieladze and S. A. Chobanyan. Probability Distributions on

Banach Spaces. Translated from the Russian by W.A. Woyczynski. Mathematics and

Its Applications (Soviet Series), 14, D. Reidel Publishing Co., Dordrecht, Holland, 1987.

[53] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

[54] C. K. I. Williams and M. Seeger. Using the Nystrom method to speed up kernel ma-

chines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in

Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001. MIT

Press.

[55] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-

niques with Java Implementations. Morgan Kaufmann, San Francisco, CA, 1999.

[56] H. M. Wu. Kernel sliced inverse regression with applications on classification. Submit-

ted.

[57] J. Zhu, S. Rosset, T. Hastie and R. Tibshirani. 1-norm support vector machines. In:

Advances in Neural Information Processing Systems 16, pages 49–56, 2003. MIT Press.