Download pdf - [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Learning Curves of

Learning Curves of Support Vector Machines

Kazushi IkedaGraduate School of Informatics, Kyoto University

Kyoto 606-8501 JapanEmail: kazushi @i.kyoto-u.ac.jp

Abstract- A support vector machines (SVM) is known as apattern classifier with a high generalization ability and one ofits advantages is that the generalization ability is theoreticallyguaranteed. However, many of the analyses are given in theframework of the PAC learning and the error-bounds are ratherloose than the practical generalization error. In this paper, wepresent some studies on the average generalization error ofSVMs, which is a more practical criterion for generalizationability.

I. INTRODUCTIONIn a decade, support vector machines (SVMs) have attracted

much attention as a new classification technique with goodgeneralization ability [1j-[6]. One of the reasons is that SVMshave no local minima, from which multi-layer perceptrons(MLPs) suffer, and another is that they have a theoreticalbackground on their generalization ability. In fact, it is knownthat they outperform any other learning machines such asmulti-layer perceptrons or RBF networks in generalizationability. However, many theoretical results were given in theprobability approximately correct (PAC) learning [7], whichevaluates in a kind of worst cases since its error-bounds aredistribution-free.Compared to the PAC learning, the average generalization

error is said to be a more practical criterion since it isdefined as the average of the generalization error over givenexamples. The average generalization error is a function ofthe number of given examples and is called the learningcurve. Recently, studies on the learning curve of SVMs haveintensively been developed [8]-[14]. In this paper, we showhow several properties of SVMs affect their learning curves.

II. SUPPORT VECTOR MACHINESSVMs are a kind of kernel methods [15], which map input

vectors x E X to feature vectors f(x) in a high-dimensionalfeature space F and separate them linearly1. That is, the outputy for an input x is expressed as

y = sgn [WTf(X)] (1)

where w E F is a weight vector2, sgn is a sign function andT denotes the transposition.

This formulation means that SVMs are a kind of di-chotomies called perceptrons and have the same structure as

'We do not treat regression but only classification.2The original SVM employs an inhomogeneous hyperplanes wTf(X) +

b = 0 but this paper treats only a homogeneous one. An inhomogeneoushyperplane can be transformed to a homogeneous one v-OT= 0 by lifting-up i = (w; b) and f = (f(x); 1).

x

° (ff_\ X X

oig x0 1I X

Fig. I. Separating, hyperplanes and margins.

MLPs. The difference between SVMs and MLPs is that theweight vector w of an SVM is limited to a weighted sum ofgiven examples, that is,

N

w - Y z,,f(x(hl)).a1_1

(2)

This form has an advantage that explicit expression in thefeature space is needless, since the output for x is calculatedas

sgn [WTf(X)]N

sgn E criif((X ))Tf(X)

n

sgn E °g7tK(x("'), xc)n=

I (3)

(4)

where K(x.x')X fT()f(x) is called the kernel function.Moreover, Mercer's theorem shows that there exits a featurespace iff a kernel function is positive semidefinite. Hence weneed not to take care of the feature space explicitly.

Perceptrons have a set of weight vectors that are consistentwith given N examples (x(n-),y(1)), n - 1,... ,N [16].In the set called the admissible region, an SVM choosesthe hyperplane that maximizes the margin, defined as theminimum distance of examples from the hyperplane (Fig. 1),that is,

m y(n) wTf (x(oi))n ||IwHt (5)

Considering the ambiguity due to the linearity.of the aboveequation, the problem is formulated as the optimization prob-lem,

min1w2 i'w 2

s t. y(n)wTf(x(n)) > 1. (6)

0-7803-9422-4/05/$20.00 (C2005 IEEE1708

Fig. 2. The input surface and a separating hyperplane. Their intersectiondoes not necessarily change even when the hyperplane moves.

in changing the separating curve, which we call the class3.-Note again that the learning curve is proportional to the class,not the dimension of the feature space.The class can be calculated by considering the dimenision of

the space spanned by the tangent spaces of the input surfaceat x where x is located on the separating curve, becausethe component orthogonal to this space cannot change theintersection of the input surface and the separating hyperplane.This is mathematically expressed as the space spanned by -Aws.t.

_ Aw + . Ax = 0,Ow Ox (8)

which is equivalent to the Wolfe dual problem,

N N

min l|w|| -E?a, s.t. W = EYanry (n)f(X(n)). (7)

This means that the solution always has the form of (2).

III. ASYMPTOTIC THEORY OF POLYNOMIAL SVMS

Due to the indifferentiability of the output by the parame-ters, the exact learning curve of even simple perceptrons stillremains open. However, statistical mechanical methods [17],[18] and statistical asymptotic theory [19]-[21] have revealedthat the learning curve is proportional to the number of pa-rameters, which corresponds to the dimension of input space,and is reciprocal to the number of examples. Since SVMscan be regarded as perceptrons with nonlinear pre-processors,the learning curve of SVMs is expected to have the sameproperty. In fact, a statistical mechanical study has shown thatSVMs with binary inputs and the polynomial kernel functionhave such a learning curve [8]. However, this contradicts thefacts that SVMs have a high generalization ability in practiceand that the high dimensionality of a feature space does notdegrade the performance in the analysis of the PAC learning,which gives an upper-bound of the average generalizationerror [3]. In this section, we analyze the polynomial SVMsto elucidate this apparent contradiction.The difference between the kernel methods including SVMs

and simple perceptrons is that the former have localized inputvectors of dichotomies since f(x) is a vector mapped fromx. In other words, the set of f (x) forms a lower-dimensionalsubmanifold in the feature space F, which we call the inputsurface in this paper. This implies that the dimension of thefeature space does not necessarily coincide with the numberof the effective parameters of dichotomies (Fig. 2). Since thelearning curve is proportional to not the dimension of the spacebut the number of the effective parameters, we consider therelationship between the latter and the input surface in thefollowing.We call the intersection of the input surface and the true

separating hyperplane the separating curve. Assuming that Nis sufficiently large, we concentrate the neighborhood of theseparating curve. In such a case, the number of effective pa-rameters corresponds to the degrees of freedom of hyperplanes

where tb(x; w) = 0 denotes the separating hyperplane, xsatisfies &'(x; w.) 0O, and Ax and Aw are infinitesimalvectors.

In the following. we derive upper-bounds and lower-boundsof the class for the polynomial kernel K(x, x') - (xTx' +1)P. The components of the feature vector f(x) E Rt' arerepresented as monomials of order less than or equal to p,

Inf(X) - {Cd1.....d,,,, . .c2 x di <p1,

?,=I(9)

where Al = (n±p)Cm and Cd1,. .d,r, is a non-zero constant.Note that wTf(x) can express any polynomial of orderless than or equal to p. We assume that the true separatinghyperplane is a polynomial of orderpo(< p), woTf(x), whichwe call the true polynomial.

Since the first term of (8) represents a polynomial of orderp due to t10(x:wO) - f(x) and the second term of (8) is alinear combination of Yx(x; wi), the class is strongly relatedto the ideal (Ox (x: w0,)) under 1)'(x: wo) - 0. Actually, wecan prove the below using the properties of the ideal [10].

* If po0 p and the true polynomial is irreducible, the classis Al -1, which is almost the same as the dimension ofthe feature space.

. If pu, = p and the true polynomial is reducible to poly-nomials of order Pi 7P2, .p,, (AEi= /1(.1 ) < A-1is a lower-bound of the class where Mli =(-n+pj)Cin

* If p1. < p and the true polynomial is expressed asV(x;w0) - y'(x:w,)6(x) where 16(x) > 0 for anyx, Il' (rrI+I) )C1,, <A - 1 is a lower-bound of theclass.

When rn - 1, the lower-bounds given above coincide to theclass and the average generalization error depends on only PO,notp [11].

These bounds have been derived by considering non-orthogonal vectors to Aw. Another approach to the class isto calculate explicit components orthogonal to Aw [12], thatleads to the following bounds.

. If pO < p, the class is less than

rn,(p+m-11 1Cm PP-po+7n.1Cmt )- (10)

3This is named from the similarity to the class of Plucker's dual cu-ve inalgebraic geometry [22].

1709

Degree of Polynomial Kemel

Fig. 3. Order of polynomial kernel and learning curves with m = 3, wherethe dashed and solid lines are given by (10) and (11), respectively. x, a, *show the case of N = 1000, 3000, 10000, respectively.

° gX X

Fig. 4. Soft margins and slack variables.

. If Po < p/2, the class is less than

p+mCm -p-2po+mCm. (11)

When m = 2, both bounds coincide and the latter is smallerwhen m > 3. A result by computer simulations in Fig. 3 showsthat the latter agrees well with the experimental result whenthe order is small. Otherwise, the dimension of the featurespace is rather large compared to the number of examples andhence the asymptotic theory cannot be applied. The reasonwhy the experimental result is so small is to be clarified.

IV. EFFECT OF SOFT MARGINS

The discussion above is based on the linear separability ofexamples. In fact, the given data set is often inseparable dueto noise and no solution exists that satisfies (6) in such a case.Even when the set is linearly separable, SVMs are too sensitiveto outliers since they pick up the data near the separatinghyperplane. To cope with these problems, the soft-margintechnique has been proposed, that employ slack variables(n allowing constraint violation [1]-[3]. The violation 4n ispenalized in the cost function as

N

-min llIwll2 + C>Ejjsn=1

sF ty(n)tg Tfr((n)) >1ir n> 0 W (12)

See Fig. 4 for the geometrical interpretation. The Wolfe dual

of the above problem is expressed as

. 1 ~ Nmin -IIwII2-Y":a,,O<.a,<C 2

N

s.t. w =E anY(n)f(X(n) )n=1

(13)

Introducing soft margins guarantees the existence of a uniquesolution for any example set. However, it has been analyzedonly in the PAC learning how soft margins affect the general-ization ability [23]. Hence, we derive the asymptotic learningcurve as a function of the number of examples N and thesoft-margin parameter C, using the order statistics. The resultsshow that the average generalization error increases as Cdecreases. Note that we employ the v-SVM, a variant ofSVMs, and restrict the input space one-dimensional in theanalysis.The v-SVM has a variable margin 3, instead of fixing it

to unity, which is maximized in the cost function. The primaland dual problems formulated as below4 [24]:

N1 2w,,n/3 2 E

s.t. y(n)wTf(X(n)) > 3, gn > 0. (14)

min 2 llwll'o<a,n<c 2N

s.t. w = E anY(n)f (X(n) )n=l

N

E an = 1.n=l

(15)

If we fix 3 to unity in (14), it becomes (12), while (13)equals (15) if we fix Ea,, unity. On one hand, the originalSVM has a clear geometrical meaning of margin maximizationin the primal problem, but the meaning of C is not clearin soft margins. On the other hand, the v-SVM has a cleargeometrical meaning of finding the closest point to the originin the reduced convex hull, where C expresses the degree ofreduction (Fig. 5) [13], [25].

In other words, w is a non-negative weighted sum of f(n)from (15) where f(n) - y(n)f(X(n)) and the sum of theweights is unity, which means that wo lies in the convex hullof f(n). The condition 0 < an < C means that an f(n)cannot contribute to w more than C and that the convex hullis reduced to some extent.

Using the fact above, we can derive the average general-ization error of the v-SVM in the asymptotic of N -- oo,where the soft-margin parameter C is a reciprocal of an integerM and the feature space is a one-dimensional circle. Thisis because the positive examples f(n) are distributed on asemicircle, the centroid of M closest examples from eachendpoint is a support vector, and the midpoint of the twosupport vectors is the closest point to the origin in the reducedconvex hull (Fig. 6). Note that the centroid of M closest

4The original v-SVMs employ inhomogeneous hyperplanes.

1710

0

C= 1 C= 1/2 C= 1/3

Fig. 5. Reduced convex hull and the closest point to the origin.

to changing the distribution of an example to that of thecentroid of M = 1/C examples, which may lead to noise-suppression for an appropriate C as well as degradation ofgeneralization ability for too small a C.

V. EFFECT OF KERNEL FUNCTIONSDue to Mercer's theorem, determining a kernel function is

completely equivalent to selecting features, which is the mostimportant for pattern classification. Keerthi et al. [26] con-tributed to this problem by elucidating how the generalizationability depends on the parameters a2 and C of SVMs withthe Gaussian kernel

K(x,x') = exp -(II5 )

Fig. 6. The v-SVM in the one-dimensional case.

examples from an endpoint has properties of the so-calledorder statistics.The prediction error, a kind of generalization errors, defined

as the probability that a learning machine mispredicts theoutput of a random test input, is expressed as I01/7r where0 is the angle between the closest point i'v and the true weightvector w*. Hence, the average prediction error is written as

1 M (_1)i+jiM+2jMMCiMCjNM(M!)2 Si, (16)

as shown in [13].In the hard-margin case of M = 1, (16) takes the same

value 1/(2N) as the existing result and (16) decreases as Mincreases or C decreases, e.g. 7/(12N) and 239/(360N) forM = 2 and M = 3, respectively. For a large M, we can showthat (16) is approximated to v'7/(x3, N) by applying thecentral limit theorem to the distribution of the centroid. Fig. 7shows experimental results.

In total, introducing soft margins to the v-SVM is equivalent

102

104)

10) 10' 102 103 10o'M (=1/C,\

Fig. 7. Computer simulations for the average generalization error of thev-SVM with soft margins.

(17)

when the parameters take extreme values, such as 0 or oo. Theresults are listed as below:

1) For a fixed a2 and C -* 0, all examples are classifiedto majority class and the SVM underfits the examples.

2) For a fixed a2 and C -* oo, the SVM solution ap-proaches that with hard margins.

3) For a fixed C and a2 -O 0, the SVM overfits or underfitsthe examples depending on C.

4) For a fixed C and a2 -÷ oo, the SVM solution ap-proaches that with the linear kernel K(x, x') = XTXI.At the same time, the effect of C decreases and the SVMunderfits the examples.

5) For a fixed C/c52 and C, a2 -+ 0o, the SVM solutionapproaches that with the linear kernel.

Some results are interesting but others are unnatural, suchas 1). This results from the imbalance of positive and neg-ative examples. On the other hand, the v-SVM with homo-geneous hyperplanes treats positive and negative examples(f(X(n)), y(n)) and (-f(X(n)), _y(n)) equivalently, since theexamples only appear in the form of f(n) = y(n)f(X(n)).Hence, we consider the asymptotic properties of the v-SVMin the two extreme cases: One is that all feature vectors arealmost orthogonal to each other, K(x(i), x(i)) 6ij, andthe other is that all feature vectors are almost the same,K(x(i), x(3)) 1, assuming normalized feature vectors,K(x, x) = 1. The results are in the following [27].

* If IK(x('), x(i))I < 1/(4N) holds for any pair of inputsx(Z) and x(j) where N is the number of examples, thenany example is a support vector. The v-SVM solution aapproaches (1/N,.. ., 1/N)T that is, the weight vectorw converges to the centroid of f(n) = y(n)f(X(n)), asN-oo.

* Assume that K(x(z), x(i)) is written as K(x('), ()) =(Ilx(') -x(j) II /sigma2 where 'I is a decreasing func-

tion. The v-SVM solution approaches the solution ofthe v-SVM with the linear kernel and inhomogeneoushyperplanes as a2 -* 00. This means that the property islost that the v-SVM with homogeneous hyperplanes doesnot distinguish positive and negative examples.

Due to Theorem 5.2 in [2], the v-SVM has a low generaliza-tion ability in both cases.

1711

-________ Ce1j 2 a3 Ce4 &51 .2000 .2000 .2000 .2000 .2000.3 .2072 .1998 .1998 .2003 .19291 .2814 .1779 .1728 .2375 .13032 .3505 .1479 .1005 .3306 .07043 .3854 .1454 .0438 .4019 .02365 .4023 .1487 0 .4489 010 .3925 .1383 0 .4692 0100 .3771 .1265 0 .4963 01000 .3752 .1252 0 .4996 0

Centroid .2 .2 .2 .2 .2Linear .375 .125 0 .5 0

TABLE ICT OF THE V-SVM SOLUTION WITH THE GAUSSIAN KERNEL.

Table I shows how a depends on u2 of the Gaussian kernelin computer simulations. The experimental results agree wellwith the theoretical results.

Fig. 8. Relationship among vp, wp and W2 for a hyperplane.

Fig. 9. Relationship among vp, wp and w2 for the reduced convex hull.

VI. EFFECT OF NORMS

In the margin maximization of SVMs, the distance ofexamples from the separating hyperplane is the Euclideandistance, that is, the 2-norm. Due to the 2-norm, the marginmaximization results in a quadratic programming problem(QP) and it is known that the 1-norm or the oc-norm makesthe margin maximization a linear programming problem (LP),which requires a less computational complexity [25], [281,[29]. However, it is still unknown how such a change ofthe norm affects the generalization ability of SVMs, exceptan experimental study [29]. In this section, we discuss thegeometrical properties of the v-SVM with the p-norm toclarify how the p-norm affects the v-SVM solution.

If we employ the p-norm

mM \ l/p

1wilpwmax wi1<i<M

1 <p<oc

p= oo

instead, the primal and dual problems of the v-SVM areformulated as

I ~N

minIjWIjq+CZ(~n /w,Gn., q q

s.t. y(n) Tf((n)) . n, > 0, (19)

o<a,<C p1PN

S.t. v = any(n) f (X(n))n=l

N

I: a,, = 1,n=l

respectively, where the q-norm satisfying I/p + 1/q = 1 iscalled the dual norm of the p-norm. The norms hold the Holderinequality, w'v < IIWIlpllVllq. Note that (20) is a problem offinding the closest point to the origin in the reduced convex

hull, as well as (15), but the closest point is v, not w itself.The v-SVM solution w is expressed as

wi = sgn(vb) j,i lP-l (21)using the closest point v6. We discuss the properties of w inthe following.We first consider an inhomogeneous hyperplane instead of

the reduced convex hull. Let the closest point in the p-normbe denoted by vp, the v-SVM solution derived from vp by wvand the normalized wv by wp, that is,

i)^WP =

iVl (22)

Then, it is easily shown that w, coincides with w2 irrespectiveof p. In other words, the v-SVM solution does not depend onp in this case (Fig. 8). The result implies that wp and w2 arethe same when vp and v2 lie on the same hyperplane of thereduced convex hull and they are different otherwise (Fig. 9).The reduced convex hull approaches the origin as the numberof examples N increases while the angle ( between v2 andvP has an upper-bound

cp- 21Cos(>M 2p (23)

irrespective ofp or N, where M is the dimension of the featurespace [14]. Hence, the effect of p is rather small.

For a more quantitative analysis for effects of p on thegeneralization ability, we should consider the properties ofthe admissible region. Ikeda & Murata [14] has shown thatwp for any p is included in the admissible region. Therefore,the average generalization error is bounded by that of the worstalgorithm in [16] from the above and the error converges tonull in the order of 1/N.

VII. CONCLUSIONSIn this paper, we discussed the properties of the average

generalization error, the so-called learning curves, of SVMs.

1712

First, the generalization error of kernel methods includingSVMs is determined by how the input space is embeddedinto the feature space, not the dimension of the feature spaceitself. More specifically, the error strongly depends on the classdefined as the dimension of the space spanned by the tangentspaces on the separating curve in the feature space.

Second, the soft-margin technique that is introduced to makeSVMs available for noisy examples increases the generaliza-tion error in a simple case. Combining this result with anappropriate noise model may leads to a reasonable methodfor selecting the parameter C.

Third, it is shown how the properties of the kernel functionaffect the SVM solution in extreme cases. The SVM solutionapproaches the centroid of given examples when Ki 46and it is approximated to the SVM solution with the linearkernel and inhomogeneous hyperplanes when Ki 1.

Lastly, the SVM solution slightly depends on the norm whenthe p-norm is employed instead of the Euclidean norm inmargin maximization.Many of the results in this paper have been derived under

the assumption of asymptotics,. e.g. N -* ox. However, it issaid that the superiority of SVMs to other learning classifiersappears in the case of few examples, as seen in Fig. 3.Analyses in such a case should be done in the future.

ACKNOWLEDGMENT

This study is supported in part by a Grant-in-Aid forScientific Research (14084210, 15700130) from the Ministryof Education, Culture, Sports, Science and Technology ofJapan.

REFERENCES

[13] K. Ikeda and T. Aoishi, "An asymptotic statistical analysis of supportvector machines with soft margins," Neural Networks, vol. 18, no. 3,pp. 251-259, 2005.

[14] K. Ikeda and N. Murata, "Geometrical properties of nu support vectormachines with different norms," Neural Computation, vol. 17, no. 11,pp. 2508-2529, 2005.

[15] M. A. Aizernan, E. M. Braverman, and L. I. Rozonoer, "Theoreticalfoundations of the potential function method in pattern recognitionlearning," Automation and Remote Control, vol. 25, pp. 821-837, 1964.

[16] K. Ikeda and S.-I. Amari, "Geometry of admissible parameter region inneural learning," IEICE Trans. Fundamentals, vol. E79-A, pp. 938-943,1996.

[17] E. B. Baum and D. Haussler, "What size net gives valid generalization?"Neural Computation, vol. 1, pp. 151-160, 1989.

[18] M. Opper and D. Haussler, "Calculation of the learning curve of bayesoptimal classification on algorithm for learning a perceptron with noise,"Proc. COLT, vol. 4, pp. 75-87, 1991.

[19] S.-I. Amari, N. Fujita, and S. Shinomoto, "Four types of learningcurves," Neural Computation, vol. 4, pp. 605-618, 1992.

[20] S.-I. Amari, "A universal theorem on learning curves," Neural Networks,vol. 6, pp. 161-166, 1993.

[21] S.-I. Amari and N. Murata, "Statistical theory of learning curves underentropic loss criterion," Neural Computation, vol. 5, pp. 140-153, 1993.

[22] K. Ueno, Introduction to Algebraic Geometry. Tokyo: Iwanami-Shoten,1995. In Japanese.

[23] J. Shawe-Taylor and N. Cristianini, "On the generalisation of soft marginalgorithms," IEEE Trans. on Information Theory, vol. 48, no. 10, pp.2721-2735, 2002.

[24] B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, "Newsupport vector algorithms," Neural Computation, vol. 12, no. 5, pp.1207-1245, 2000.

[25] K. P. Bennett and E. J. Bredensteiner, "Duality and geometry in SVMclassifiers," Proc. ICML, pp. 57-64, 2000.

[26] S. S. Keerthi and C.-J. Lin, "Asymptotic behaviors of support vectormachines with Gaussian kernel," Neural Computation, vol. 15, no. 7,pp. 1667-1689, 2003.

[27] K. Ikeda, "Effects of kernel function on nu support vector machines inasymptotics," IEEE Trans. on Neural Networks, vol. 17, in press, 2006.

[28] 0. L. Mangasarian, "Arbitrary-norm separating plane," Operations Re-search Letters, vol. 24, pp. 15-23, 1999.

[29] J. P. Pedroso and N. Murata, "Support vector machines with differentnorms: Motivation, formulations, and results," Pattern Recognition Let-ters, vol. 22, no. 12, pp. 1263-1272, 2001.

[1] C. Cortes and V. Vapnik, "Support vector networks," Machine Learningvol. 20, pp. 273-297, 1995.

[2] V. N. Vapnik, The Nature of Statistical Learning Theory. New York,NY: Springer-Verlag, 1995.

[3] , Statistical Learning Theory. New York, NY: John Wiley andSons, 1998.

[4] B. Sch6lkopf, C. Burges, and A. J. Smola, Advances in Kernel Methods:Support Vector Learning. Cambridge, UK: Cambridge Univ. Press,1998.

[5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support VectorMachines. Cambridge, UK: Cambridge Univ. Press, 2000.

[6] A. J. Smola, P. L. Bartlett, B. Sch6lkopf, and -D. Schuurmans, Eds.,Advances in Large Margin Classifiers. Cambridge, MA: MIT Press,2000.

[7] L. G. Valiant, "A theory of the learnable," Communications of ACM,vol. 27, pp. 1134-1142, 1984.

[8] R. Dietrich, M. Opper, and H. Sompolinsky, "Statistical mechanics ofsupport vector networks," Physical Review Letters, vol. 82, no. 14, pp.2975-2978, 1999.

[9] M. Opper and R. Urbanczik, "Universal learning curves of support vectormachines," Physical Review Letters, vol. 86, no. 19, pp. 4410-4413,2001.

[10] K. Ikeda, "Generalization error analysis for polynomial kernel methodsalgebraic geometrical approach," Proc. ICANN1ICONIP, pp. 201-

208, 2003.[111 K. Ikeda, "Geometry and learning curves of kernel methods with

polynomial kernels," Systems and Computers in Japan, vol. 35, no. 7,pp. 41-48, 2004.

[12] , "An asymptotic statistical theory of polynomial kernel methods,"Neural Computation, vol. 16, no. 8, pp. 1705-1719, 2004.

1713