Upload
casio2008
View
217
Download
0
Embed Size (px)
Citation preview
8/3/2019 06022044
http://slidepdf.com/reader/full/06022044 1/5978-1-4244-9953-3/11/$26.00 ©2011 IEEE 288
2011 Seventh International Conference on Natural Computation
The Generalization Performance of Learning
Algorithms Derived simultaneously through
Algorithmic Stability and Space Complexity
Jie XuFaculty of Mathematics and Computer Science
Hubei University, Wuhan, 430062, China
Bin ZouFaculty of Mathematics and Computer Science
Hubei University, Wuhan, 430062, China
Abstract—A main issue in machine learning theoretical re-search is to analyze the generalization performance of learningalgorithms. The previous results describing the generalizationperformance of learning algorithms are based on either com-plexity of hypothesis space or stability property of learningalgorithms. In this paper we go far beyond these classicalframeworks by establishing the first generalization bounds of learning algorithms in terms of uniform stability and the coveringnumber of function space for regularized least squares regressionand SVM regression. To have a better understanding the resultsobtained in this paper, we compare the obtained generalizationbounds with previously known results.
I. INTRODUCTION
In recent years, Support Vector Machines (SVMs)[1] have
become one of the most widely utilized learning algorithms.
Besides their good generalization performance in practical
applications, they also enjoy a good theoretical justification
in terms of both consistency (see e.g.[2]-[4]) and generaliza-
tion when the training samples come from an independent
and identically distributed (i.i.d.) process. This motivated the
interest for theoretical research on the generalization andconsistence of learning algorithms (e.g.[5]-[10]). Until re-
cently, two main approaches have been proposed to study
the generalization and consistency of learning algorithms. The
first approach is based on the theory of uniform convergence
of empirical risks to their expected risks (see e.g.[1],[8]).
Such convergence theory provides some ways to estimate the
generalization bounds of learning algorithms in terms of an
empirical measurement of its accuracy and a measure of its
complexity, such as, VC-dimension [1], covering number [11],
Rademarcher average [12], -dimension and -dimension
[13]. In this framework, the main aim is to character the condi-
tions on hypothesis space that ensure the generalization ability
for empirical risk minimization (ERM) learning algorithms,that is, all of these works based on the theory of uniform
convergence are in terms of the complexity of hypothesis
space. In other words, these works answer what property
must the hypothesis space have for the good generalization
of learning algorithms. However, how change of the input
samples influences the output of learning algorithms did not
been taken into account in those researches.
The second approach is based on “sensitivity analysis” or
perturbation analysis [14]. The aim of sensitivity analysis is
to determine how much the variation of the input influences
the output of a learning algorithm. Since a good learning
algorithm should be stable with respect to its training samples,
that is, any small change of a single element in the training
samples should yield only a small change in the output of the
learning algorithm (see e.g. [15]-[24]). For example, Bousquet
and Elisseeff [15] introduced the definition of uniform stabilityand proved that uniform stability implies good generalization
of learning algorithms and Tikhonov regularization algorithms
are uniform stable. Kutin and Niyogi [18] introduced the
definition of CV-stability and showed that it is adequate in
the classical probability approximately correct (PAC) setting.
Poggio et al. [16] introduced the definition of CVEEEloo-
stability and proved that CVEEEloo-stability is sufficient for
generalization of any algorithms and necessary and sufficient
for generalization and consistency of ERM algorithms. All of
these works based on sensitivity analysis are usually indepen-
dent of the complexity of hypothesis space.
However, in real applications of machine learning, the
performance of a learning algorithm is not affected only bythe complexity of hypothesis space and stability of learning
algorithms, but also by some other factors like sampling mech-
anism and sample quality. More importantly, we believe that
how those factors determine the performance of learning algo-
rithms are by no means independent and individual. Therefore,
machine learning from samples should be a consequence of
synthesized and simultaneous action of all the involved factors.
From this point of view, some more reasonable generalization
bounds of learning algorithms should reflect such synthesized
influence of all the factors. Obviously, the existing approaches
for generalization performance evaluation have not done in this
way. As the first step toward the goal, we derive in the present
paper the first generalization bounds of learning algorithmsthrough combinatively use of the covering number of function
space and algorithmic stability.
The paper is organized as follows: In Section 2, we
introduce the necessary notion and notations used in this
paper. In Section 3 we establish the generalization bound
of regularized least square regression algorithm and SVM
regression algorithm based simultaneously on the covering
number of function space, uniform stability and uniform
hypothesis stability, respectively. In Section 4 we prove these
8/3/2019 06022044
http://slidepdf.com/reader/full/06022044 2/5
289
main results obtained in Section 3. Finally, we present the
conclusions in Section 5.
I I . PRELIMINARIES
In this section we introduce the definitions and notations
used throughout the paper. Let ( , ) be a compact metric
space and let = R. We consider a training set
= {1 = (1, 1), 2 = (2, 2), ⋅ ⋅ ⋅ , = (, )}of size in = × drawn independent and identically
distributed (i.i.d.) from an unknown distribution .
A learning algorithm is a process which takes as input a
finite training set ∈ and outputs a function : → such that is a good approximation to the output of the
target operator on input . To avoid complex notation, in this
paper we consider only deterministic learning algorithms and
we assume that the learning algorithm is symmetric with
respect to , that is, the learning algorithm does not depend
on the order of the elements in the training set .For a given training set , we build, for all = 1, 2, ⋅ ⋅ ⋅ , ,
modified training sets as follows (see [15]):∙ By replacing the -th element of
, = {1, ⋅ ⋅ ⋅ , −1, , +1, ⋅ ⋅ ⋅ , }.
∙ By removing the -th element of
= {1, ⋅ ⋅ ⋅ , −1, +1, ⋅ ⋅ ⋅ , },
where the sample is assumed to be drawn from according
to the distribution and independent from .To measure the accuracy of the predictions of the learning
algorithm , we will use a cost function : × → R+.
The loss of the hypothesis with respect to an example
= (, ) is defined as ℓ( , ) := ( (), ).The (generalization) error of the hypothesis is defined
as
ℰ ( ) = E[ℓ( , )] =
∫
ℓ( , ).
Since the distribution is unknown and we only know the
sample set , we have to estimate it from the available sample
set . The simplest estimator of the error ℰ ( ) is the so-called
empirical error ℰ ( ) = 1
∑=1 ℓ( , ), which can be
computed directly for a given function .
Different from what is usually studied in learning theory
(see e.g. [15],[25]), the study in this paper intends to bound
the error of learning algorithms based simultaneously on the
covering number of function space and algorithmic stability.Thus we present some basic assumptions on the hypothesis
space, the covering number and the definition of algorithmic
stability as follows:
First, we assume that the hypothesis space considered is
a reproducing kernel Hilbert space (RKHS) ℋ (see [26]).
Namely, let : × → R be continuous, symmetric and
positive semidefinite, i.e., for any finite set of distinct points
{1, 2, ⋅ ⋅ ⋅ , } ⊂ , the matrix ( (, )),=1 is positive
semidefinite. Such a function is called a Mercer kernel. The
RKHS ℋ associated with the kernel is defined to be the
closure of the linear span of the set of functions { := (, ⋅) : ∈ } with the inner product ⟨⋅, ⋅⟩ℋ
= ⟨⋅, ⋅⟩satisfying ⟨ , ⟩ = (, ), that is,
⟨
,
⟩ =,
(, ).
The reproducing property takes the form
⟨ , ⟩ = (), ∀ ∈ , ∀ ∈ ℋ .
Denote ( ) as the space of continuous functions on
with the norm ∣∣ ∣∣∞ := sup∈ ∣ ()∣. Let =sup∈
√ (, ), then the above reproducing property tells
us that ∣∣ ∣∣∞ ≤ ∣∣ ∣∣ , ∀ ∈ ℋ.Let ℬ := { ∈ ℋ : ∣∣ ∣∣ ≤ }. It is a subset of ( )
and the covering number is well defined (see e.g. [11]).
Definition 1: For a subset ℱ of a metric space and > 0,
the covering number (ℱ , ) of the function set ℱ is the
minimal ∈N such that there exist disks in ℱ with radius
covering ℱ .We denote the covering number of the unit ball ℬ1 as
() := (ℬ1, ), > 0.
Definition 2: The RKHS is said to have polynimial com-
plexity exponent > 0 if there is some > 0 such that
log () ≤ (1/), ∀ > 0. (1)
Remark 1: The covering number has been extensively
studied in [19]-[22]. In particular, we know that for the
Gaussian kernel (, ) = exp{−∣−∣2/2} with > 0 on
a bounded subset of R, if is with > 0 (Sobolev
smoothness), then inequality (1) is valid with = 2/(see
[23]).
In addition, we assume that there exists a constant suchthat for any 1, 2 ∈ ℋ, and any ∈ ,
∣ℓ(1, ) − ℓ(2, )∣ ≤ ⋅ ∣∣1 − 2∣∣∞, (2)
and we assume that there is a constant such that ∣∣ ≤ for any ∈ .
Remark 2: The assumption (2) is a general assumption
used in learning theory (see e.g.[15],[27]). For example, if
is a bounded kernel, that is, (, ) ≤ 2, if loss function
is defined by
ℓ(, ) = ∣ ()−∣ =
{0, ∣( () − )∣ ≤ ∣( () − )∣ − , otherwise
we have = 1. If = {−1, 1}, and if the loss function is
defined as
ℓ(, ) = (1 − ())+ =
{1 − (), 1 − () ≥ 00, otherwise
we have = 1. Since for any 1, 2 ∈ ℬ, we have
∣( 1() − )2 − ( 1() − )2∣ ≤ 2( + ) ⋅ ∣∣ 1 − 2∣∣∞.
This implies that for the loss function ℓ(, ) = ( () − )2,
we have = 2( + ).
8/3/2019 06022044
http://slidepdf.com/reader/full/06022044 3/5
290
In this paper, we consider two definitions of algorithmic
stability, uniform stability and uniform hypothesis stability.
Now we close this section by giving these definitions of
algorithmic stability.
Definition 3: ([18]) A learning algorithm is said to be
uniformly -hypothesis stable or uniform hypothesis stability
, if there is a nonnegative constant such that for any
∀ ∈
, ∀ ∈ , ∀1 ≤ ≤ , ∣ℓ( , ) − ℓ( ,, )∣ ≤
with going to zero for → ∞.Definition 4: ([15]) A learning algorithm is said to be
-uniformly stable or uniform stability , if there is a
nonnegative constant such that
∀ ∈ , ∀ ∈ , ∀1 ≤ ≤ , ∣ℓ( , ) − ℓ( , )∣ ≤
with going to zero for → ∞.Remark 3: Comparing Definitions 3 and 4 with Definition
3.1 in [18] and Definition 6 in [15] respectively, we can find
that , , and in this paper are corresponding to ,
and ∖ in [15] and [18] respectively. The interested
reader can consult [15] and [18] for the details. For the
time being, is denoted respectively by so as to reduce
notational clutter. The dependence of on is restored near
the end of the paper.
III. MAI N RESULTS
In this section, we establish the bounds on the error of two
learning algorithms (regularized least squares regression and
SVM regression) simultaneously based on the covering num-
ber of function space and algorithmic stability, respectively.
Our main results are stated as follows:
Theorem 1: If the learning algorithm is defined as
= arg min ∈ℋ
1
=1
( ()
−)
2 +
∣∣
∣∣2 , (3)
and assume that the learning algorithm is uniform stability
. Then for any ∈ (0, 1], with probability at least 1 − , the
inequality
ℰ ( ) ≤ ℰ ( ) + (, ) + 2 (4)
is valid, where
(, ) ≤ 82( +√
)
max
[ ln(1/)
] 12 ,[ 2
] 1+2
}.
As an application of Theorem 1, we can easily establish the
following bound on the convergence rate of regularized least
squares regression algorithm (3).
Corollary 1: If the learning algorithm (3) is uniform sta-bility . Then for any ∈ (0, 1], with probability at least 1−,
the inequality
ℰ ( ) ≤ ℰ ( ) + 2 +82( +
√)
[ 2
] 1+2 (5)
is valid provided that the size of samples satisfies >
2 ln(1/)[ ln(1/)
] 2 . The inequality
ℰ ( ) ≤ ℰ ( ) + 2 +82( +
√)
[ ln(1/)
] 12 (6)
is valid provided that the size of samples satisfies ≤2 ln(1/)
[ ln(1/)
] 2 .
Remark 4: (i) Bound (4) in Theorem 1 evaluates the risk
for the chosen function simultaneously based on the covering
number of subset of ℋ and the parameter of uniform
stability. Different from the previously known bounds in [1],
[15] and [8], the bound in Theorem 1 is explicitly dependent
on the parameter of uniform stability and the complexparameters , of function space ℋ . As far as we know,
this is the first result in this direction.
(ii) In order to have a better understanding the significance
and value of the obtained results in Theorem 1, now we com-
pare the obtained results with the previously known bounds
obtained by Bousquet and Elisseeff ([15]). First, in [15],
Bousquet and Elisseeff established the sharper generalization
bound (see Example 3 in [15]) of learning algorithm (3)
based on uniform stability. While in Theorem 1, we establish
the bound on the generalization error of learning algorithm
(3) simultaneously based on the covering number of function
space and algorithmic stability. Second, compared the bound
(4) obtained in Theorem 1 with that bound in Example 3 of [15], we can find that as the size of the training samples
satisfies ≤ 2 ln(1/)[ ln(1/)
] 2 , the generalization bound
in Theorem 1 has the same rate with that obtained by Bousquet
and Elisseeff in Example 3 of [15].
In addition, if the learning algorithm is SVM regression
algorithm (see [1], [15]), that is, is defined as
= arg min ∈ℋ
1
=1
∣ () − ∣ + ∣∣ ∣∣2, (7)
we also establish the following bound on the generalization
error of learning algorithm (7) simultaneously based on thecovering number of subset of ℋ and the parameter of
uniform stability.
Theorem 2: If the learning algorithm is defined as (7),
and assume that the learning algorithm (7) is uniform stability
. Then for any ∈ (0, 1], with probability at least 1 − , the
inequality
ℰ ( ) ≤ ℰ ( ) + ′(, ) + 2 (8)
is valid, where
′
(, ) ≤8( +
√)
⋅ max
[ ln(1/)
] 12
,[ 2
] 1+2}
.
Remark 5: By Definitions 3 and 4, we have that an algo-
rithm with uniform stability also has the following property:
∣ℓ( , )−ℓ( , , )∣ ≤ 2. This shows that uniform stability
implies uniform hypothesis stability 2. Thus by the same
argument as that in Theorems 1 and 2, we can obtain the
same bounds on the generalization error of learning algorithms
(3) and (7) simultaneously based on the covering number of
function space and uniform hypothesis stability, respectively.
8/3/2019 06022044
http://slidepdf.com/reader/full/06022044 4/5
291
IV. PROOF OF MAI N RESULTS
In order to establish these bounds on the generalization error
of the above two learning algorithms simultaneously based
on the covering number of function space and algorithmic
stability. Our main tools are the following two useful lemmas.
Lemma 1: (Hoeffding’s inequality) Let be a random
variable on a space with expectation = E(). If
∣() − ∣ ≤ 1 for almost all ∈ , then for all > 0,
P 1
=1
() − ≥
}≤ 2exp
−2
2 21
}.
Lemma 2: ([25]) Let 1, 2 > 0, and > > 0. Then
the equation
− 1 − 2 = 0
has a unique positive zero ∗. In addition
∗ ≤ max
(21)1/( −), (22)(1/)
.
Proof of Theorem 1: We decompose the proof into three
steps.Step 1: By the definition of , we have that for any ∈
, and = 0
∣∣ ∣∣2 ≤ 1
=1
( () − )2 + ∣∣ ∣∣2
≤ 1
=1
(0 − )2 + 0 ≤ 2.
The second inequality above follows from the assumption
∣∣ ≤ for any ∈ . We then have that ∣∣ ∣∣ ≤ /√
for
almost all ∈ . Similarly, we have that for any ∈ and any 1
≤
≤,
∣∣
∣∣
≤/
√. These imply that
∈ ℬ1 and ∈ ℬ1 (1 ≤ ≤ ) with 1 = /√ foralmost all ∈ .
Step 2: Let ℒ( ) = ℰ ( ) − ℰ ( ). For any ∈{1, 2, ⋅ ⋅ ⋅ , }, we have
∣ℒ( ) − ℒ( )∣ ≤ ∣ ℰ ( ) − ℰ ( )∣ + ∣ℰ ( ) − ℰ ( )∣.By Definition 4, we have ∣ℒ( ) − ℒ( )∣ ≤ 2. It follows
that for any 1 ≤ ≤ ,
∣ℒ( )∣ ≤ ∣ ℒ( )∣ + 2.
Then we have that for any > 0
P∣ℒ( )∣ ≥ 2 + 2
≤ P∣ℒ( )∣ ≥ 2
≤ P
sup
1≤≤∣ℒ( )∣ ≥ 2
≤ P
sup ∈ℬ1
∣ℒ( )∣ ≥ 2
. (9)
The final inequality follows from the fact that for any ∈ and any 1 ≤ ≤ ∈ ℬ1 .
Now we begin to bound the term on the right-hand side
of inequality (9). By the similar argument conducted as that
in [8], we denote the balls , ∈ {1, 2, ⋅ ⋅ ⋅ , 1} to be a
cover of ℬ1 with center at and radius /2, where =2( +
√)/
√. Then for any > 0, we have
P{ sup ∈ℬ1
∣ℒ( )∣ ≥ 2} ≤1=1
P{ sup ∈
∣ℒ( )∣ ≥ 2}. (10)
In addition, for all ∈ , we have
∣ℒ( ) − ℒ( )∣ ≤ ∣ ℰ ( ) − ℰ ( )∣ + ∣ℰ ( ) − ℰ ( )∣≤ 2 ⋅ ∣∣ − ∣∣∞ ≤ 2 ⋅
2= .
It follows that for any ∈ ,
sup ∈
∣ℒ( )∣ ≥ 2 =⇒ ∣ℒ( )∣ ≥ .
We then conclude that for any ∈ {1, 2, ⋅ ⋅ ⋅ , 1}, and any
> 0,
P
sup ∈
∣ℒ( )∣ ≥ 2 ≤ P
∣ℒ( )∣ ≥
.
By Lemma 1 and the fact that ℓ(, )
≤
∣∣
∣∣∞
≤∣∣ ∣∣ ≤ 1 for any ∈ and any ∈ ℬ1 , weget that for any > 0,
P
sup ∈
∣ℰ ( ) − ℰ ( )∣ ≥ 2 ≤ 2exp
−2
2(1)2
}.
By inequality (10) and the above inequality, we have
P{ sup ∈ℬ1
∣ℒ( )∣ ≥ 2} ≤ 2 (ℬ1 ,
2)exp{ −2
2(1)2}.
Combining inequality (9) and the above inequality, and replac-
ing by /2, we have that for any > 0( = 1)
P{∣ℒ( )∣ ≥ + 2} ≤ 2 (ℬ1 ,
4 )exp{−2
8 2 }. (11)
Step 3: By the fact that an -covering of ℬ1 yields an 1 ⋅-
covering of ℬ1 and vice versa (see e.g. [6], [7]), we have that
for any > 0, (ℬ1 , 4) ≤ (
41). By Definition 2 and
inequality (11), we then get that for any > 0,
P{∣ℒ( )∣ ≥ + 2} ≤ 2exp{ (41
) − 2
8 2}. (12)
We rewrite the above inequality in the equivalent form. We
equate the right-hand side of inequality (12) to be a positive
value (0 < ≤ 1)
exp
(41
) −
2
8(1)2}
= .
It follows that
+2 − ⋅ 8ln(1/)(1)2
− 8 (1)2(41)
= 0.
By Lemma 2, we can obtain the solution ∗ of the above
equation with respect to
∗.
= (, ) ≤ 41 ⋅ max
ln(1/)
12
, 2
1+2
}.
8/3/2019 06022044
http://slidepdf.com/reader/full/06022044 5/5
292
Then by inequality (12), we conclude that for any ∈ (0, 1],with probability at least 1 − the inequality
ℰ ( ) ≤ ℰ ( ) + (, ) + 2
holds true. Replacing 1 and by /√
and 2( +√)/
√ respectively, we finish the proof of Theorem 1.
Proof of Theorem 2: By the definition of , we have thatfor any ∈ , and = 0
∣∣ ∣∣2 ≤ 1
=1
∣ () − ∣ + ∣∣ ∣∣2
≤ 1
=1
∣0 − ∣ + 0 ≤ .
The second inequality follows from the assumption ∣∣ ≤ for any ∈ . It follows that ∣∣ ∣∣ ≤ √
/ for almost
all ∈ . Similarly, we have that for any ∈ and any
1 ≤ ≤ , ∣∣ ∣∣ ≤ √ /. This implies that ∈ ℬ2
and ∈ ℬ2 (1 ≤ ≤ ) with 2 =√
/. Thus by thesimilar argument that conducted as that in Theorem 1, we can
finish the proof of Theorem 2.
V. CONCLUSION
In this paper, we explored how stability property of learning
algorithms and space complexity of function space influence
simultaneously the generalization performance of learning
algorithms. we first applied uniform stability and the covering
number of hypothesis space to establish the bounds on the
generalization error of regularized least square regression and
SVM regression algorithms. The established results not only
depend explicitly on the parameter of uniform stability,
but also depend on the complex parameters 0, of function
space. To our knowledge, these results here are the first gener-
alization bounds in this topic. In order to better understand the
significance and value of the established results in this paper,
we also compared our main result with previously known
works of algorithmic stability approach ([15]).
Further directions of research include the question of es-
tablishing better bounds via weaker algorithmic stability (e.g.
CVEEEloo-stability, see [16]; error stability, see [28]) and the
other complex measure of space (e.g. Rademarcher average,
see [12]), and establishing the generalization bounds of learn-
ing algorithms based on more information (e.g. complexity of hypothesis space, algorithmic stability, sampling mechanism
and sample quality) and so on. All these problems are under
our current investigation.
ACKNOWLEDGEMENTS
This work was supported in part by NSFC project
(61070225) and Foundation of Hubei Educational Committee
(Q20091003).
REFERENCES
[1] V. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.[2] I. Steinwart. Support vector machines are universally consistent. J.
Complexity, 18:768-791, 2002.[3] T. Zhang. Statistical behaviour and consistency of classification methods
based on convex risk minimization. Ann. Statist., 32:56-134, 2004.[4] I. Steinwart. Consistency of support vector machines and other regular-
ized kernel machines. IEEE Trans. Inform. Theory, 51:128-142, 2005.[5] I. Steinwart and A. Christmann. Support Vector Machines. Springer,
New York, 2008.[6] D. R. Chen, Q. Wu, Y. M. Ying and D. X. Zhou. Support vector machinesoft margin classifiers: error analysis. Journal of Machine LearningResearch, 5: 1143-1175, 2004.
[7] Q. Wu, Y. Ying, D. X. Zhou. Learning rates of least-squares regularizedregression. Found. Comput. Math., 6: 171-192, 2006.
[8] F. Cucker and S. Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society, 39: 1-49, 2001.
[9] F. Cucker and D. X. Zhou. Learning Theory: An Approximation TheoryViewpoint. Cambridge University Press, Cambridge, 2007.
[10] I. Steinwart and C. Scovel. Fast rates for support vector machines, 18thAnn. Conf. Learning Theory (COLT 2005), Bertinoro, Italy, Jun., 279-294, 2005.
[11] A. W. van der Vaart and J. A. wellner. Weak Convergence and empiricalProcesses. New york: Springer-Verleg, 1996.
[12] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complex-ities: risk bounds and structural results. Journal of Machine Learning
Research, 3: 463-482, 2002.[13] T. Evgeniou and M. Pontil. On the V-gamma dimension for regressionin repreduving kernel Hilbert spaces. In Proc. of Algorithmic LearningTheory. Lecture Notes in Comput. Sci., Springer, Berlin, 1720: 106-117,1999.
[14] J. F. Bonnans and A. Shapiro. Optimization problems with perturbation:A guided tour, SIAM Rev., 40: 228-264, 1998.
[15] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2: 499-526, 2002.
[16] T. Poggio, R. Rifkin, S. Mukherjee and P. Niyogi. General conditionsfor predictivity in learning theory. Nature, 428: 419-422, 2004.
[17] L. Devroye and T. Wagner. Distribution-free performance bounds forponential function rules. IEEE Transactions on Information Theory, 25:601-604, 1979.
[18] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability andgeneralization error. In proceedings of Uncertainty in AI, MorganKaufmann, Univ. Alberta, Edmonton, 2002.
[19] P. L. Bartlett. The sample complexity of pattern classification with neuralnetworks: the size of the weights is more important than the size of thenetwork. IEEE Trans. Inform. Theory, 44: 525-536, 1998.
[20] R. C. Williamson, A. J. Smola and B. Scholkopf. Generalizationperformance of regularization networks and support vector machines viaentropy numbers of compact operators. IEEE Trans. Inform. Theory, 47:2516-2532, 2001.
[21] D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18: 739-767, 2002.
[22] M. Pontil. A note on different covering numbers in learning theory.Journal of Complexity, 19: 665-671, 2003.
[23] D. X. Zhou. Capacity of reproducing kernel spaces in learning theory.IEEE Trans. Inform. Theory, 49: 1743-1752, 2003.
[24] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithmsvia algorithmic stability. Journal of Machine Learning Research, 10:441-474, 2009.
[25] F. Cucker and S. Smale. Best choices for regularization parameters in
learning theory: on the bias-variance problem. Found. Comput. Math.,2: 413-428, 2002.[26] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc.,
68: 337-404, 1950.[27] P. L. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher
complexity. The Annals of Statistics, 33: 1497-1537, 2005.[28] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds
foe leave-one-out cross-validation. Neural Comput., 11: 1427-1453,1999.