5
978-1-4244-9953-3/11/$26.00 ©2011 IEEE 288 2011 Seventh International Conference on Natural Computation The Generalization Performance of Learning Algorithms Derived simultaneously through Algorithmic Stability and Space Complexity Jie Xu Faculty of Mathematics and Computer Science Hubei University, Wuhan, 430062, China Bin Zou Faculty of Mathematics and Computer Science Hubei University, Wuhan, 430062, China  Abstract —A main issue in mach ine learnin g theor etic al re- search is to analyze the generalization performance of learning algor ithms. The pre vious res ults desc ribi ng the gener aliz ation perf orma nce of lear ning algorithms are based on either com- ple xit y of hypothesi s spa ce or stabil ity property of lea rning al gori thms. In this pape r we go far be yo nd thes e cl assi cal frame works by establish ing the rst generali zati on bound s of learning algorithms in terms of uniform stability and the covering number of function space for regularized least squares regression and SVM regression. To have a better understanding the results obtained in this paper, we compare the obtained generalization bounds with previously known results. I. I NTRODUCTION In recent years, Support Vector Machines (SVMs)[1] have become one of the most widely utilized learning algorithms. Besid es thei r good gene rali zati on perf ormance in prac tica l appli cati ons, they also enjo y a good theor etical just ica tion in terms of both consistency (see e.g.[2]-[4]) and generaliza- tio n whe n the tra ini ng sample s come fro m an ind epe nde nt and identically distributed (i.i.d.) process. This motivated the int erest for the ore ti cal res ear ch on the gen era liz ati on and consisten ce of lear ning algor ithms (e.g.[5]-[10]). Unti l re- cen tly , two mai n app roa che s ha ve bee n pro pos ed to stu dy the generalization and consistency of learning algorithms. The rst approach is based on the theory of uniform convergence of emp iri cal ris ks to the ir expec ted ris ks (se e e.g.[1],[8]). Such convergence theory provides some ways to estimate the generalization bounds of learning algorithms in terms of an empirical measurement of its accuracy and a measure of its complexity, such as, VC-dimension [1], covering number [11], Rademarcher average [12],   -dimension and   -dimension [13]. In this framework, the main aim is to character the condi- tions on hypothesis space that ensure the generalization ability for empi rica l risk mini miza tion (ERM) lear ning algo rith ms, tha t is, all of these wor ks bas ed on the theory of uni for m convergence are in ter ms of the comple xit y of hyp oth esi s spa ce. In oth er wor ds, the se works ans wer wha t pro per ty must the hypot hesi s spac e hav e for the good general izat ion of lea rni ng alg ori thms. Ho weve r, how cha nge of the inp ut samples inuences the output of learning algorithms did not been taken into account in those researches. The second approach is based on “sensitivity analysis” or perturbation analysis [14]. The aim of sensitivity analysis is to determine how much the variation of the input inuences the outpu t of a lea rni ng alg ori thm. Sin ce a goo d learni ng algorithm should be stable with respect to its training samples, that is, any small change of a single element in the training samples should yield only a small change in the output of the learning algorithm (see e.g. [15]-[24]). For example, Bousquet and Elisseeff [15] introduced the denition of uniform stability and proved that uniform stability implies good generalization of learning algorithms and Tikhonov regularization algorithms are uni for m sta ble . Kutin and Niy ogi [18] int rod uce d the den itio n of CV -sta bili ty and showed that it is adeq uate in the classical probability approximately correct (PAC) setting. Pog gio et al. [16 ] int rod uce d the denit ion of CVEEE loo - stability and proved that CVEEE loo -stability is sufcient for generalization of any algorithms and necessary and sufcient for generalization and consistency of ERM algorithms. All of these works based on sensitivity analysis are usually indepen- dent of the complexity of hypothesis space. Howe ver , in rea l applic ati ons of mac hin e lea rni ng, the performance of a learning algorithm is not affected only by the complexity of hypothesis space and stab ilit y of lear ning algorithms, but also by some other factors like sampling mech- anism and sample quality. More importantly, we believe that how those factors determine the performance of learning algo- rithms are by no means independent and individual. Therefore, machine learning from samples should be a consequence of synthesized and simultaneous action of all the involved factors. From this point of view, some more reasonable generalization bounds of learning algorithms should reect such synthesized inuence of all the factors. Obviously, the existing approaches for generalization performance evaluation have not done in this way. As the rst step toward the goal, we derive in the present paper the rst generali zati on bound s of lear ning algo rith ms through combinatively use of the covering number of function space and algorithmic stability. The pa per is or ga ni ze d as foll ows: In Sect ion 2, we int roduce the nec ess ary notion and not ations used in this pap er. In Sec tio n 3 we est abl ish the gen era liz ati on bou nd of re gularized lea st squ are re gressi on alg ori thm and SVM reg ress ion algor ithm base d simu ltan eous ly on the cov erin g number of fun cti on spa ce, uni form stabilit y and uni for m hypothesis stability, respectively. In Section 4 we prove these

06022044

Embed Size (px)

Citation preview

8/3/2019 06022044

http://slidepdf.com/reader/full/06022044 1/5978-1-4244-9953-3/11/$26.00 ©2011 IEEE 288

2011 Seventh International Conference on Natural Computation

The Generalization Performance of Learning

Algorithms Derived simultaneously through

Algorithmic Stability and Space Complexity

Jie XuFaculty of Mathematics and Computer Science

Hubei University, Wuhan, 430062, China

Bin ZouFaculty of Mathematics and Computer Science

Hubei University, Wuhan, 430062, China

 Abstract—A main issue in machine learning theoretical re-search is to analyze the generalization performance of learningalgorithms. The previous results describing the generalizationperformance of learning algorithms are based on either com-plexity of hypothesis space or stability property of learningalgorithms. In this paper we go far beyond these classicalframeworks by establishing the first generalization bounds of learning algorithms in terms of uniform stability and the coveringnumber of function space for regularized least squares regressionand SVM regression. To have a better understanding the resultsobtained in this paper, we compare the obtained generalizationbounds with previously known results.

I. INTRODUCTION

In recent years, Support Vector Machines (SVMs)[1] have

become one of the most widely utilized learning algorithms.

Besides their good generalization performance in practical

applications, they also enjoy a good theoretical justification

in terms of both consistency (see e.g.[2]-[4]) and generaliza-

tion when the training samples come from an independent

and identically distributed (i.i.d.) process. This motivated the

interest for theoretical research on the generalization andconsistence of learning algorithms (e.g.[5]-[10]). Until re-

cently, two main approaches have been proposed to study

the generalization and consistency of learning algorithms. The

first approach is based on the theory of uniform convergence

of empirical risks to their expected risks (see e.g.[1],[8]).

Such convergence theory provides some ways to estimate the

generalization bounds of learning algorithms in terms of an

empirical measurement of its accuracy and a measure of its

complexity, such as, VC-dimension [1], covering number [11],

Rademarcher average [12],  -dimension and  -dimension

[13]. In this framework, the main aim is to character the condi-

tions on hypothesis space that ensure the generalization ability

for empirical risk minimization (ERM) learning algorithms,that is, all of these works based on the theory of uniform

convergence are in terms of the complexity of hypothesis

space. In other words, these works answer what property

must the hypothesis space have for the good generalization

of learning algorithms. However, how change of the input

samples influences the output of learning algorithms did not

been taken into account in those researches.

The second approach is based on “sensitivity analysis” or

perturbation analysis [14]. The aim of sensitivity analysis is

to determine how much the variation of the input influences

the output of a learning algorithm. Since a good learning

algorithm should be stable with respect to its training samples,

that is, any small change of a single element in the training

samples should yield only a small change in the output of the

learning algorithm (see e.g. [15]-[24]). For example, Bousquet

and Elisseeff [15] introduced the definition of uniform stabilityand proved that uniform stability implies good generalization

of learning algorithms and Tikhonov regularization algorithms

are uniform stable. Kutin and Niyogi [18] introduced the

definition of CV-stability and showed that it is adequate in

the classical probability approximately correct (PAC) setting.

Poggio et al. [16] introduced the definition of  CVEEEloo-

stability and proved that CVEEEloo-stability is sufficient for

generalization of any algorithms and necessary and sufficient

for generalization and consistency of ERM algorithms. All of 

these works based on sensitivity analysis are usually indepen-

dent of the complexity of hypothesis space.

However, in real applications of machine learning, the

performance of a learning algorithm is not affected only bythe complexity of hypothesis space and stability of learning

algorithms, but also by some other factors like sampling mech-

anism and sample quality. More importantly, we believe that

how those factors determine the performance of learning algo-

rithms are by no means independent and individual. Therefore,

machine learning from samples should be a consequence of 

synthesized and simultaneous action of all the involved factors.

From this point of view, some more reasonable generalization

bounds of learning algorithms should reflect such synthesized

influence of all the factors. Obviously, the existing approaches

for generalization performance evaluation have not done in this

way. As the first step toward the goal, we derive in the present

paper the first generalization bounds of learning algorithmsthrough combinatively use of the covering number of function

space and algorithmic stability.

The paper is organized as follows: In Section 2, we

introduce the necessary notion and notations used in this

paper. In Section 3 we establish the generalization bound

of regularized least square regression algorithm and SVM

regression algorithm based simultaneously on the covering

number of function space, uniform stability and uniform

hypothesis stability, respectively. In Section 4 we prove these

8/3/2019 06022044

http://slidepdf.com/reader/full/06022044 2/5

289

main results obtained in Section 3. Finally, we present the

conclusions in Section 5.

I I . PRELIMINARIES

In this section we introduce the definitions and notations

used throughout the paper. Let ( , ) be a compact metric

space and let  = R. We consider a training set

  = {1 = (1, 1), 2 = (2, 2), ⋅ ⋅ ⋅ , = (, )}of size in  = ×  drawn independent and identically

distributed (i.i.d.) from an unknown distribution .

A learning algorithm is a process which takes as input a

finite training set  ∈   and outputs a function   : →  such that   is a good approximation to the output of the

target operator on input  . To avoid complex notation, in this

paper we consider only deterministic learning algorithms and

we assume that the learning algorithm is symmetric with

respect to  , that is, the learning algorithm does not depend

on the order of the elements in the training set  .For a given training set  , we build, for all = 1, 2, ⋅ ⋅ ⋅ , ,

modified training sets as follows (see [15]):∙ By replacing the -th element of   

 , = {1, ⋅ ⋅ ⋅ , −1, , +1, ⋅ ⋅ ⋅ , }.

∙ By removing the -th element of   

  = {1, ⋅ ⋅ ⋅ , −1, +1, ⋅ ⋅ ⋅ , },

where the sample is assumed to be drawn from  according

to the distribution and independent from  .To measure the accuracy of the predictions of the learning

algorithm , we will use a cost function : × → R+.

The loss of the hypothesis   with respect to an example

= (, ) is defined as ℓ(  , ) := ( (), ).The (generalization) error of the hypothesis   is defined

as

ℰ ( ) = E[ℓ(  , )] =

∫  

ℓ( , ).

Since the distribution is unknown and we only know the

sample set  , we have to estimate it from the available sample

set  . The simplest estimator of the error ℰ ( ) is the so-called

empirical error ℰ ( ) = 1

∑=1 ℓ(  , ), which can be

computed directly for a given function   .

Different from what is usually studied in learning theory

(see e.g. [15],[25]), the study in this paper intends to bound

the error of learning algorithms based simultaneously on the

covering number of function space and algorithmic stability.Thus we present some basic assumptions on the hypothesis

space, the covering number and the definition of algorithmic

stability as follows:

First, we assume that the hypothesis space considered is

a reproducing kernel Hilbert space (RKHS) ℋ (see [26]).

Namely, let   : × → R be continuous, symmetric and

positive semidefinite, i.e., for any finite set of distinct points

{1, 2, ⋅ ⋅ ⋅ , } ⊂  , the matrix ( (, )),=1 is positive

semidefinite. Such a function is called a Mercer kernel. The

RKHS ℋ associated with the kernel   is defined to be the

closure of the linear span of the set of functions {  := (, ⋅) : ∈ } with the inner product ⟨⋅, ⋅⟩ℋ

= ⟨⋅, ⋅⟩satisfying ⟨ ,  ⟩ =  (, ), that is,

  ,

  ⟩ =,

 (, ).

The reproducing property takes the form

⟨ ,  ⟩ =  (), ∀ ∈  , ∀  ∈ ℋ .

Denote ( ) as the space of continuous functions on

  with the norm ∣∣ ∣∣∞ := sup∈  ∣ ()∣. Let =sup∈ 

√  (, ), then the above reproducing property tells

us that ∣∣ ∣∣∞ ≤ ∣∣ ∣∣ , ∀  ∈ ℋ.Let ℬ := {  ∈ ℋ : ∣∣ ∣∣ ≤ }. It is a subset of  ( )

and the covering number is well defined (see e.g. [11]).

Definition 1: For a subset ℱ of a metric space and > 0,

the covering number  (ℱ , ) of the function set ℱ  is the

minimal ∈N such that there exist disks in ℱ with radius

covering ℱ .We denote the covering number of the unit ball ℬ1 as

  () :=  (ℬ1, ), > 0.

Definition 2: The RKHS is said to have polynimial com-

plexity exponent > 0 if there is some   > 0 such that

log  () ≤  (1/), ∀ > 0. (1)

Remark 1: The covering number has been extensively

studied in [19]-[22]. In particular, we know that for the

Gaussian kernel  (, ) = exp{−∣−∣2/2} with > 0 on

a bounded subset   of R, if    is   with > 0 (Sobolev

smoothness), then inequality (1) is valid with = 2/(see

[23]).

In addition, we assume that there exists a constant suchthat for any 1, 2 ∈ ℋ, and any ∈  ,

∣ℓ(1, ) − ℓ(2, )∣ ≤ ⋅ ∣∣1 − 2∣∣∞, (2)

and we assume that there is a constant such that ∣∣ ≤ for any ∈  .

Remark 2: The assumption (2) is a general assumption

used in learning theory (see e.g.[15],[27]). For example, if 

  is a bounded kernel, that is,  (, ) ≤ 2, if loss function

is defined by

ℓ(, ) = ∣ ()−∣ =

{0, ∣( () − )∣ ≤ ∣( () − )∣ − , otherwise

we have = 1. If   = {−1, 1}, and if the loss function is

defined as

ℓ(, ) = (1 −  ())+ =

{1 −  (), 1 −  () ≥ 00, otherwise

we have = 1. Since for any  1,  2 ∈ ℬ, we have

∣( 1() − )2 − ( 1() − )2∣ ≤ 2( + ) ⋅ ∣∣ 1 −  2∣∣∞.

This implies that for the loss function ℓ(, ) = ( () − )2,

we have = 2( + ).

8/3/2019 06022044

http://slidepdf.com/reader/full/06022044 3/5

290

In this paper, we consider two definitions of algorithmic

stability, uniform stability and uniform hypothesis stability.

Now we close this section by giving these definitions of 

algorithmic stability.

Definition 3: ([18]) A learning algorithm is said to be

uniformly -hypothesis stable or uniform hypothesis stability

, if there is a nonnegative constant such that for any

∀  ∈  

, ∀ ∈  , ∀1 ≤ ≤ , ∣ℓ( , ) − ℓ( ,, )∣ ≤

with going to zero for → ∞.Definition 4: ([15]) A learning algorithm is said to be

-uniformly stable or uniform stability , if there is a

nonnegative constant such that

∀  ∈  , ∀ ∈  , ∀1 ≤ ≤ , ∣ℓ(  , ) − ℓ(  , )∣ ≤

with going to zero for → ∞.Remark 3: Comparing Definitions 3 and 4 with Definition

3.1 in [18] and Definition 6 in [15] respectively, we can find

that   ,  , and   in this paper are corresponding to ,

and ∖ in [15] and [18] respectively. The interested

reader can consult [15] and [18] for the details. For the

time being, is denoted respectively by so as to reduce

notational clutter. The dependence of  on is restored near

the end of the paper.

III. MAI N RESULTS

In this section, we establish the bounds on the error of two

learning algorithms (regularized least squares regression and

SVM regression) simultaneously based on the covering num-

ber of function space and algorithmic stability, respectively.

Our main results are stated as follows:

Theorem 1: If the learning algorithm is defined as

  = arg min ∈ℋ

1

=1

( ()

−)

2 +

∣∣ 

∣∣2 , (3)

and assume that the learning algorithm is uniform stability

. Then for any ∈ (0, 1], with probability at least 1 − , the

inequality

ℰ ( ) ≤ ℰ ( ) + (, ) + 2 (4)

is valid, where

(, ) ≤ 82( +√

)

max

[ ln(1/)

] 12 ,[ 2

] 1+2

}.

As an application of Theorem 1, we can easily establish the

following bound on the convergence rate of regularized least

squares regression algorithm (3).

Corollary 1: If the learning algorithm (3) is uniform sta-bility . Then for any ∈ (0, 1], with probability at least 1−,

the inequality

ℰ ( ) ≤ ℰ ( ) + 2 +82( +

√)

[ 2

] 1+2 (5)

is valid provided that the size of samples satisfies >

2 ln(1/)[ ln(1/)

 

] 2 . The inequality

ℰ ( ) ≤ ℰ ( ) + 2 +82( +

√)

[ ln(1/)

] 12 (6)

is valid provided that the size of samples satisfies ≤2 ln(1/)

[ ln(1/) 

] 2 .

Remark 4: (i) Bound (4) in Theorem 1 evaluates the risk 

for the chosen function simultaneously based on the covering

number of subset of  ℋ and the parameter of uniform

stability. Different from the previously known bounds in [1],

[15] and [8], the bound in Theorem 1 is explicitly dependent

on the parameter of uniform stability and the complexparameters  , of function space ℋ . As far as we know,

this is the first result in this direction.

(ii) In order to have a better understanding the significance

and value of the obtained results in Theorem 1, now we com-

pare the obtained results with the previously known bounds

obtained by Bousquet and Elisseeff ([15]). First, in [15],

Bousquet and Elisseeff established the sharper generalization

bound (see Example 3 in [15]) of learning algorithm (3)

based on uniform stability. While in Theorem 1, we establish

the bound on the generalization error of learning algorithm

(3) simultaneously based on the covering number of function

space and algorithmic stability. Second, compared the bound

(4) obtained in Theorem 1 with that bound in Example 3 of [15], we can find that as the size of the training samples

satisfies ≤ 2 ln(1/)[ ln(1/)

 

] 2 , the generalization bound

in Theorem 1 has the same rate with that obtained by Bousquet

and Elisseeff in Example 3 of [15].

In addition, if the learning algorithm is SVM regression

algorithm (see [1], [15]), that is, is defined as

  = arg min ∈ℋ

1

=1

∣ () − ∣ + ∣∣ ∣∣2, (7)

we also establish the following bound on the generalization

error of learning algorithm (7) simultaneously based on thecovering number of subset of  ℋ and the parameter of 

uniform stability.

Theorem 2: If the learning algorithm is defined as (7),

and assume that the learning algorithm (7) is uniform stability

. Then for any ∈ (0, 1], with probability at least 1 − , the

inequality

ℰ ( ) ≤ ℰ ( ) + ′(, ) + 2 (8)

is valid, where

(, ) ≤8( +

√)

⋅ max

[ ln(1/)

] 12

,[ 2

] 1+2}

.

Remark 5: By Definitions 3 and 4, we have that an algo-

rithm with uniform stability also has the following property:

∣ℓ(  , )−ℓ( , , )∣ ≤ 2. This shows that uniform stability

implies uniform hypothesis stability 2. Thus by the same

argument as that in Theorems 1 and 2, we can obtain the

same bounds on the generalization error of learning algorithms

(3) and (7) simultaneously based on the covering number of 

function space and uniform hypothesis stability, respectively.

8/3/2019 06022044

http://slidepdf.com/reader/full/06022044 4/5

291

IV. PROOF OF MAI N RESULTS

In order to establish these bounds on the generalization error

of the above two learning algorithms simultaneously based

on the covering number of function space and algorithmic

stability. Our main tools are the following two useful lemmas.

Lemma 1: (Hoeffding’s inequality) Let be a random

variable on a space   with expectation = E(). If 

∣() − ∣ ≤  1 for almost all ∈  , then for all > 0,

P 1

=1

() − ≥

}≤ 2exp

−2

2 21

}.

Lemma 2: ([25]) Let 1, 2 > 0, and > > 0. Then

the equation

  − 1 − 2 = 0

has a unique positive zero ∗. In addition

∗ ≤ max

(21)1/( −), (22)(1/)

.

Proof of Theorem 1: We decompose the proof into three

steps.Step 1: By the definition of    , we have that for any   ∈

 , and   = 0

∣∣  ∣∣2 ≤ 1

=1

( () − )2 + ∣∣  ∣∣2

≤ 1

=1

(0 − )2 + 0 ≤ 2.

The second inequality above follows from the assumption

∣∣ ≤ for any ∈  . We then have that ∣∣  ∣∣ ≤ /√

for

almost all   ∈  . Similarly, we have that for any   ∈  and any 1

≤,

∣∣ 

∣∣

≤/

√. These imply that

  ∈ ℬ1 and   ∈ ℬ1 (1 ≤ ≤ ) with 1 = /√ foralmost all   ∈  .

Step 2: Let ℒ( ) = ℰ ( ) − ℰ ( ). For any ∈{1, 2, ⋅ ⋅ ⋅ , }, we have

∣ℒ( ) − ℒ( )∣ ≤ ∣ ℰ  ( ) − ℰ ( )∣ + ∣ℰ ( ) − ℰ ( )∣.By Definition 4, we have ∣ℒ( ) − ℒ( )∣ ≤ 2. It follows

that for any 1 ≤ ≤ ,

∣ℒ(  )∣ ≤ ∣ ℒ( )∣ + 2.

Then we have that for any > 0

P∣ℒ( )∣ ≥ 2 + 2

≤ P∣ℒ( )∣ ≥ 2

≤ P

sup

1≤≤∣ℒ( )∣ ≥ 2

≤ P

sup ∈ℬ1

∣ℒ( )∣ ≥ 2

. (9)

The final inequality follows from the fact that for any  ∈  and any 1 ≤ ≤   ∈ ℬ1 .

Now we begin to bound the term on the right-hand side

of inequality (9). By the similar argument conducted as that

in [8], we denote the balls , ∈ {1, 2, ⋅ ⋅ ⋅ , 1} to be a

cover of  ℬ1 with center at   and radius /2, where =2( +

√)/

√. Then for any > 0, we have

P{ sup ∈ℬ1

∣ℒ( )∣ ≥ 2} ≤1=1

P{ sup ∈

∣ℒ( )∣ ≥ 2}. (10)

In addition, for all   ∈ , we have

∣ℒ( ) − ℒ( )∣ ≤ ∣ ℰ  ( ) − ℰ ( )∣ + ∣ℰ ( ) − ℰ ( )∣≤ 2 ⋅ ∣∣ −  ∣∣∞ ≤ 2 ⋅

2= .

It follows that for any   ∈ ,

sup ∈

∣ℒ( )∣ ≥ 2 =⇒ ∣ℒ( )∣ ≥ .

We then conclude that for any ∈ {1, 2, ⋅ ⋅ ⋅ , 1}, and any

> 0,

P

sup ∈

∣ℒ( )∣ ≥ 2 ≤ P

∣ℒ( )∣ ≥

.

By Lemma 1 and the fact that ℓ(, )

∣∣ 

∣∣∞

≤∣∣ ∣∣ ≤ 1 for any ∈   and any   ∈ ℬ1 , weget that for any > 0,

P

sup ∈

∣ℰ ( ) − ℰ ( )∣ ≥ 2 ≤ 2exp

−2

2(1)2

}.

By inequality (10) and the above inequality, we have

P{ sup ∈ℬ1

∣ℒ( )∣ ≥ 2} ≤ 2  (ℬ1 ,

2)exp{ −2

2(1)2}.

Combining inequality (9) and the above inequality, and replac-

ing by /2, we have that for any > 0(  = 1)

P{∣ℒ( )∣ ≥ + 2} ≤ 2  (ℬ1 ,

4 )exp{−2

8 2 }. (11)

Step 3: By the fact that an -covering of ℬ1 yields an 1 ⋅-

covering of ℬ1 and vice versa (see e.g. [6], [7]), we have that

for any > 0,  (ℬ1 , 4) ≤  (

41). By Definition 2 and

inequality (11), we then get that for any > 0,

P{∣ℒ( )∣ ≥ + 2} ≤ 2exp{ (41

) − 2

8 2}. (12)

We rewrite the above inequality in the equivalent form. We

equate the right-hand side of inequality (12) to be a positive

value (0 < ≤ 1)

exp

 (41

) −

2

8(1)2}

= .

It follows that

+2 − ⋅ 8ln(1/)(1)2

− 8 (1)2(41)

= 0.

By Lemma 2, we can obtain the solution ∗ of the above

equation with respect to

∗.

= (, ) ≤ 41 ⋅ max

ln(1/)

12

, 2

1+2

}.

8/3/2019 06022044

http://slidepdf.com/reader/full/06022044 5/5

292

Then by inequality (12), we conclude that for any ∈ (0, 1],with probability at least 1 − the inequality

ℰ ( ) ≤ ℰ ( ) + (, ) + 2

holds true. Replacing 1 and by /√

and 2( +√)/

√ respectively, we finish the proof of Theorem 1.

Proof of Theorem 2: By the definition of    , we have thatfor any   ∈  , and   = 0

∣∣ ∣∣2 ≤ 1

=1

∣ () − ∣ + ∣∣ ∣∣2

≤ 1

=1

∣0 − ∣ + 0 ≤ .

The second inequality follows from the assumption ∣∣ ≤ for any ∈  . It follows that ∣∣  ∣∣ ≤ √ 

/ for almost

all  ∈  . Similarly, we have that for any  ∈   and any

1 ≤ ≤ , ∣∣  ∣∣ ≤ √ /. This implies that   ∈ ℬ2

and   ∈ ℬ2 (1 ≤ ≤ ) with 2 =√ 

/. Thus by thesimilar argument that conducted as that in Theorem 1, we can

finish the proof of Theorem 2.

V. CONCLUSION

In this paper, we explored how stability property of learning

algorithms and space complexity of function space influence

simultaneously the generalization performance of learning

algorithms. we first applied uniform stability and the covering

number of hypothesis space to establish the bounds on the

generalization error of regularized least square regression and

SVM regression algorithms. The established results not only

depend explicitly on the parameter of uniform stability,

but also depend on the complex parameters  0, of function

space. To our knowledge, these results here are the first gener-

alization bounds in this topic. In order to better understand the

significance and value of the established results in this paper,

we also compared our main result with previously known

works of algorithmic stability approach ([15]).

Further directions of research include the question of es-

tablishing better bounds via weaker algorithmic stability (e.g.

CVEEEloo-stability, see [16]; error stability, see [28]) and the

other complex measure of space (e.g. Rademarcher average,

see [12]), and establishing the generalization bounds of learn-

ing algorithms based on more information (e.g. complexity of hypothesis space, algorithmic stability, sampling mechanism

and sample quality) and so on. All these problems are under

our current investigation.

ACKNOWLEDGEMENTS

This work was supported in part by NSFC project

(61070225) and Foundation of Hubei Educational Committee

(Q20091003).

REFERENCES

[1] V. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.[2] I. Steinwart. Support vector machines are universally consistent. J.

Complexity, 18:768-791, 2002.[3] T. Zhang. Statistical behaviour and consistency of classification methods

based on convex risk minimization. Ann. Statist., 32:56-134, 2004.[4] I. Steinwart. Consistency of support vector machines and other regular-

ized kernel machines. IEEE Trans. Inform. Theory, 51:128-142, 2005.[5] I. Steinwart and A. Christmann. Support Vector Machines. Springer,

New York, 2008.[6] D. R. Chen, Q. Wu, Y. M. Ying and D. X. Zhou. Support vector machinesoft margin classifiers: error analysis. Journal of Machine LearningResearch, 5: 1143-1175, 2004.

[7] Q. Wu, Y. Ying, D. X. Zhou. Learning rates of least-squares regularizedregression. Found. Comput. Math., 6: 171-192, 2006.

[8] F. Cucker and S. Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society, 39: 1-49, 2001.

[9] F. Cucker and D. X. Zhou. Learning Theory: An Approximation TheoryViewpoint. Cambridge University Press, Cambridge, 2007.

[10] I. Steinwart and C. Scovel. Fast rates for support vector machines, 18thAnn. Conf. Learning Theory (COLT 2005), Bertinoro, Italy, Jun., 279-294, 2005.

[11] A. W. van der Vaart and J. A. wellner. Weak Convergence and empiricalProcesses. New york: Springer-Verleg, 1996.

[12] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complex-ities: risk bounds and structural results. Journal of Machine Learning

Research, 3: 463-482, 2002.[13] T. Evgeniou and M. Pontil. On the V-gamma dimension for regressionin repreduving kernel Hilbert spaces. In Proc. of Algorithmic LearningTheory. Lecture Notes in Comput. Sci., Springer, Berlin, 1720: 106-117,1999.

[14] J. F. Bonnans and A. Shapiro. Optimization problems with perturbation:A guided tour, SIAM Rev., 40: 228-264, 1998.

[15] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2: 499-526, 2002.

[16] T. Poggio, R. Rifkin, S. Mukherjee and P. Niyogi. General conditionsfor predictivity in learning theory. Nature, 428: 419-422, 2004.

[17] L. Devroye and T. Wagner. Distribution-free performance bounds forponential function rules. IEEE Transactions on Information Theory, 25:601-604, 1979.

[18] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability andgeneralization error. In proceedings of Uncertainty in AI, MorganKaufmann, Univ. Alberta, Edmonton, 2002.

[19] P. L. Bartlett. The sample complexity of pattern classification with neuralnetworks: the size of the weights is more important than the size of thenetwork. IEEE Trans. Inform. Theory, 44: 525-536, 1998.

[20] R. C. Williamson, A. J. Smola and B. Scholkopf. Generalizationperformance of regularization networks and support vector machines viaentropy numbers of compact operators. IEEE Trans. Inform. Theory, 47:2516-2532, 2001.

[21] D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18: 739-767, 2002.

[22] M. Pontil. A note on different covering numbers in learning theory.Journal of Complexity, 19: 665-671, 2003.

[23] D. X. Zhou. Capacity of reproducing kernel spaces in learning theory.IEEE Trans. Inform. Theory, 49: 1743-1752, 2003.

[24] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithmsvia algorithmic stability. Journal of Machine Learning Research, 10:441-474, 2009.

[25] F. Cucker and S. Smale. Best choices for regularization parameters in

learning theory: on the bias-variance problem. Found. Comput. Math.,2: 413-428, 2002.[26] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc.,

68: 337-404, 1950.[27] P. L. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher

complexity. The Annals of Statistics, 33: 1497-1537, 2005.[28] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds

foe leave-one-out cross-validation. Neural Comput., 11: 1427-1453,1999.