Data - cims.nyu.edukellen/research/sparsegrids/Data mining with...medicine (e.g. CT data), ev aluation astroph ysics (e.g. telescop e and observ atory ... grids eac h one with its

Data mining with sparse gridsJ. Gar ke+, M. Griebel+ and M. Thess�+Institut f�ur Angewandte MathematikAbteilung f�ur wissens haftli hes Re hnen und numeris he SimulationUniversit�at Bonn, D-53115 Bonnfgar kej, griebelg�iam.uni-bonn.de�Prudential Systems, Chemnitzthess�prudsys. omAbstra tWe present a new approa h to the lassi� ation problem arising in data mining.It is based on the regularization network approa h but, in ontrast to other meth-ods whi h employ ansatz fun tions asso iated to data points, we use basis fun tions oming from a grid in the usually high-dimensional feature spa e for the minimiza-tion pro ess. To ope with the urse of dimensionality, we employ sparse grids.Thus, only O(h�1n nd�1) instead of O(h�dn ) grid points and unknowns are involved.Here d denotes the dimension of the feature spa e and hn = 2�n gives the mesh size.To be pre ise, we suggest to use the sparse grid ombination te hnique where the lassi� ation problem is dis retized and solved on a ertain sequen e of onventionalgrids with uniform mesh sizes in ea h oordinate dire tion. The sparse grid solutionis then obtained from the solutions on these di�erent grids by linear ombination.In ontrast to other sparse grid te hniques, the ombination method is simpler touse and an be parallelized in a natural and straightforward way.We des ribe the sparse grid ombination te hnique for the lassi� ation problemin terms of the regularization network approa h. We then give implementationaldetails and dis uss the omplexity of the algorithm. It turns out that the methods ales linearly with the number of instan es, i.e. the amount of data to be lassi�ed.Finally we report on the quality of the lassi�er build by our new method. Herewe onsider standard test problems from the UCI repository and problems withhuge syntheti al data sets in up to 9 dimensions. It turns out that our new methoda hieves orre tness rates whi h are ompetitive to that of the best existing methods.AMS subje t lassi� ation. 62H30, 65D10, 68T10Key words. data mining, lassi� ation, approximation, sparse grids, ombination te hnique1 Introdu tionData mining is the pro ess of �nding hidden patterns, relations and trends in large datasets. It plays an in reasing role in ommer e and s ien e. Typi al s ienti� appli ations1

are the post-pro essing of data in medi ine (e.g. CT data), the evaluation of data inastrophysi s (e.g. teles ope and observatory data) the grouping of seismi data or theevaluation of satellite pi tures (e.g. NASA earth observing system). Maybe even moreimportant are �nan ial and ommer ial appli ations. With the development of the inter-net and e- ommer e there are huge data sets olle ted more or less automati ally whi h an be used for business de isions and further strategi planning. Here, appli ations rangefrom the termination and an ellation of ontra ts, the assessment of redit risks, the seg-mentation of ustomers for rate/tari�-planing and advertising ampaign letters to frauddete tion, sto k analysis and turnover predi tion.Usually, the pro ess of data mining (or knowledge dis overy) an be separated in theplanning step, the preparation phase, the mining phase (i.e. the ma hine learning) andthe evaluation phase. To this end, asso iation-analysis lassi� ation, lustering and prog-nosti s are to be performed. For a thorough overview on the various tasks arising in thedata mining pro ess see [7, 18℄.In this paper we onsider the lassi� ation problem in detail. Here, a set of data pointsin d-dimensional feature spa e is given together with a lass label in f�1; 1g for example.From this data, a lassi�er must be onstru ted whi h allows to predi t the lass of anynewly given data point for future de ision making. Widely used approa hes are nearestneighbor methods, de ision tree indu tion, rule learning and memory-based reasoning.There are also lassi� ation algorithms based on adaptive multivariate regression splines,neural networks, support ve tor ma hines and regularization networks. Interestingly,these latter te hniques an be interpreted in the framework of regularization networks[29℄. This approa h allows a dire t des ription of the most important neural networksand it also allows for an equivalent des ription of support ve tor ma hines and n-term ap-proximation s hemes [27℄. Here, the lassi� ation of data is interpreted as a s attered dataapproximation problem with ertain additional regularization terms in high-dimensionalspa es. With these te hniques, it is possible to treat quite high-dimensional problems butthe amount of data is limited due to omplexity reasons. This situation is reversed inmany pra ti al appli ations. Here, after some prepro essing steps, the dimension of theresulting problem is moderate but the amount of data is usually huge. Thus, there is astrong need for methods whi h an be applied in this situation as well.In this paper, we present a new approa h to the lassi� ation problem arising in datamining. It is also based on the regularization network approa h but, in ontrast to theother methods whi h employ mostly global ansatz fun tions asso iated to data points,we use an independent grid with asso iated lo al ansatz fun tions in the minimizationpro ess. This is similar to the numeri al treatment of partial di�erential equations. Here,a uniform grid would result in O(h�dn ) grid points, where d denotes the dimension of thefeature spa e, n is the re�nement level and hn = 2�n gives the mesh size. Therefore the omplexity of the problem would grow exponentially with d and we en ounter the urseof dimensionality. This is probably the reason why onventional grid-based te hniquesare not used in data mining up to now.However, there is a spe ial dis retization te hnique using so- alled sparse grids whi h al-lows to ope with the omplexity of the problem, at least to some extent. This method hasbeen originally developed for the solution of partial di�erential equations [4, 10, 40, 75℄and is now used su essfully also for integral equations [22, 39℄, interpolation and approx-2

imation [5, 37, 47, 60, 66℄, eigenvalue problems [25℄ and integration problems [26℄. In theinformation based omplexity ommunity it is also known as 'hyperboli ross points' andthe idea an even be tra ed ba k to [63℄. For a d-dimensional problem, the sparse grid ap-proa h employs only O(h�1n (log(hn�1))d�1) grid points in the dis retization. The a ura yhowever is nearly as good as for the onventional full grid methods, provided that ertainadditional smoothness requirements are ful�lled. Thus a sparse grid dis retization method an be employed also for higher-dimensional problems. The urse of the dimensionalityof full grid methods a�e ts sparse grids mu h less. Note that there exist di�erent variantsof solvers working on sparse grids ea h one with its spe i� advantages and drawba ks.One variant is based on �nite di�eren e dis retization [33, 59℄, another approa h uses aGalerkin �nite element dis retization [4, 10, 13℄ and the so- alled ombination te hnique[40℄ makes use of multivariate extrapolation [16℄.In this paper, we apply the sparse grid ombination method to the lassi� ation problem.The regularization network problem is dis retized and solved on a ertain sequen e of onventional grids with uniform mesh sizes in ea h oordinate dire tion. The sparse gridsolution is then obtained from the solutions on these di�erent grids by linear ombination.Thus the lassi�er is build on sparse grid points and not on data points. A dis ussion of the omplexity of the method shows that the new method s ales linearly with the number ofinstan es, i.e. the amount of data to be lassi�ed. Therefore, our method is well suited forrealisti data mining appli ations where the dimension of the feature spa e is moderatelyhigh after some prepro essing steps but the amount of data is very large. Furthermore thequality of the lassi�er build by our new method seems to be very good. Here we onsiderstandard test problems from the UCI repository and problems with huge syntheti al datasets in up to 9 dimensions. It turns out that our new method a hieves orre tness rateswhi h are ompetitive to that of the best existing methods. The ombination method issimple to use and an be parallelized in a natural and straightforward way.The remainder of this paper is organized as follows: In Se tion 2 we des ribe the lassi� a-tion problem in the framework of regularization networks as minimization of a (quadrati )fun tional. We then dis retize the feature spa e and derive the asso iated linear problem.Here we fo us on grid-based dis retization te hniques. Then, we introdu e the on eptof a sparse grid and its asso iated sparse grid spa e and dis uss its properties. Further-more, we introdu e the sparse grid ombination te hnique for the lassi� ation problemand present the overall algorithm. We then give implementational details and dis ussthe omplexity of the algorithm. Se tion 3 presents the results of numeri al experiments ondu ted with the new sparse grid ombination method and demonstrates the quality ofthe lassi�er build by our new method. Some �nal remarks on lude the paper.2 The problemClassi� ation of data an be interpreted as traditional s attered data approximation prob-lem with ertain additional regularization terms. In ontrast to onventional s attereddata approximation appli ations, we now en ounter quite high-dimensional spa es. Tothis end, the approa h of regularization networks [29℄ gives a good framework. Thisapproa h allows a dire t des ription of the most important neural networks and it alsoallows for an equivalent des ription of support ve tor ma hines and n-term approximation3

s hemes [27℄.Consider the given set of already lassi�ed data (the training set)S = f(xi; yi) 2 Rd � RgMi=1 ;with xi representing the data points in the feature spa e and yi their lass labels. Assumenow that these data have been obtained by sampling of an unknown fun tion f whi hbelongs to some fun tion spa e V de�ned over Rd . The sampling pro ess was disturbed bynoise. The aim is now to re over the fun tion f from the given data as good as possible.This is learly an ill-posed problem sin e there are in�nitely many solutions possible. Toget a well-posed, uniquely solvable problem we have to assume further knowledge of f .To this end, regularization theory [67, 72℄ imposes an additional smoothness onstrainton the solution of the approximation problem and the regularization network approa h onsiders the variational problem minf2V R(f)with R(f) = 1M MXi=1 C(f(xi); yi) + ��(f): (1)Here, C(:; :) denotes an error ost fun tion whi h measures the interpolation error and�(f) is a smoothness fun tional whi h must be well de�ned for f 2 V . The �rst termenfor es loseness of f to the data, the se ond term enfor es smoothness of f and theregularization parameter � balan es these two terms. Typi al examples areC(x; y) = jx� yj or C(x; y) = (x� y)2;and �(f) = jjPf jj22 with Pf = rf or Pf = �f:The value of � an be hosen a ording to ross-validation te hniques [2, 30, 68, 71℄or a ording to some other prin iple, su h as stru tural risk minimization [69℄. We �ndexa tly this type of formulation in the ase d = 2; 3 in many s attered data approximationmethods, see [1, 44℄, where the regularization term is usually physi ally motivated.Now assume that we have a basis of V given by f'j(x)g1j=1. Let also the onstant fun tionbe in the span of the fun tions 'j. We then an express a fun tion f 2 V asf(x) = 1Xj=1 �j'j(x)with asso iated degrees of freedom �j. In the ase of a regularization term of the type�(f) = 1Xj=1 �2j�j4

where f�jg1j=1 is a de reasing positive sequen e, it is easy to show [27℄, Appendix B, thatindependent of the fun tion C the solution of the variational problem (1) has always theform f(x) = MXj=1 �jK(x;xj):Here K is the symmetri kernel fun tionK(x;y) = 1Xj=1 �j'j(x)'j(y)whi h an be interpreted as the kernel of a Reprodu ing Kernel Hilbert Spa e (RKHS), see[3, 27℄. In other words, if ertain fun tions K(x;xj) are used in an approximation s hemewhi h are entered in the lo ation of the data points xj, then the approximation solutionis a �nite series and involves onlyM terms. For radially symmetri kernels we end up withradial basis fun tion approximation s hemes. But also many other approximation s hemeslike additive models, hyperbasis fun tions, ridge approximation models and several typesof neural networks an be derived by a spe i� hoi e of the regularization operator [29℄.For an overview of these di�erent data- entered approximation s hemes, see [19, 28, 29℄and the referen es ited therein.It is noteworthy that the support ve tor ma hine approa h an be expressed equivalentlyin form of the minimization problem (1), see [19, 27, 62℄ for further details. The spe ial hoi e of the ost fun tionC(x; y) = jy � xj" = � 0 if jy � xj < "jy � xj � " otherwiseresults in a regularization formulation whi h has been shown [27, 62℄ to be equivalent tothe support ve tor ma hine approa h. It has the formf(x; �; ��) = MXi=1 (��i � �i)K(x;xi) + bwith onstant b where ��i and �i are positive oeÆ ients whi h solve the following quadrati programming problem:min�;�� R(��; �) = " MXi=1 (��i + �i)� MXi=1 yi(��i � �i) + 12 MXi;k=1(��i � �i)(��k � �k)K(xi;xk)subje t to the onstraints0 � ��; � � 12�M and MXi=1 (��i � �i) = 0;see also [19℄, Problem 5.2. Due to the nature of this quadrati programming problem, onlya number of oeÆ ients ��i ��i will be di�erent from zero. The input data points xi asso- iated to them are alled support ve tors. This formulation is the starting point for some5

improved s hemes like the support ve tor ma hine based on the jj:jj1-norm, the smoothedsupport ve tor ma hine [48℄, the feature sele tion on ave minimization algorithm [9℄ orthe minimal support ve tor ma hine [23℄. Also ertain n-term approximation s hemes,de-noising (basis pursuit) have been shown [27℄ to be equivalent to the regularizationformulation.2.1 Dis retizationIn the following we follow a slightly di�erent approa h: We expli itly restri t the problemto a �nite dimensional subspa e VN 2 V . The fun tion f is then repla ed byfN = NXj=1 �j'j(x): (2)Here the ansatz fun tions f'jgNj=1 should span VN and preferably should form a basis forVN . The oeÆ ients f�jgNj=1 denote the degrees of freedom. Note that the restri tionto a suitably hosen �nite-dimensional subspa e involves some additional regularization(regularization by dis retization) whi h depends on the hoi e of VN .In the remainder of this paper, we restri t ourselves to the hoi eC(fN(xi); yi) = (fN(xi)� yi)2and �(fN ) = jjPfN jj2L2 (3)for some given linear operator P . This way we obtain from the minimization problem afeasible linear system. We thus have to minimizeR(fN) = 1M MXi=1 (fN(xi)� yi)2 + �kPfNk2L2; fN 2 VN (4)in the �nite dimensional spa e VN . We plug (2) into (4) and obtainR(fN) = 1M MXi=1 NXj=1 �j'j(xi)�yi!2+ �kP NXj=1 �j'jk2L2 (5)= 1M MXi=1 NXj=1 �j'j(xi)�yi!2+ � NXi=1 NXj=1 �i�j(P'i; P'j)L2 : (6)Di�erentiation with respe t to �k, k = 1; : : : ; N , gives0 = �R(fN )��k = 2M MXi=1 NXj=1 �j'j(xi)�yi!�'k(xi) + 2� NXj=1 �j(P'j; P'k)L2: (7)This is equivalent to (k = 1; : : : ; N)� NXj=1 �j(P'j; P'k)L2 + 1M NXj=1 �j MXi=1 'j(xi) � 'k(xi) = 1M MXi=1 yi'k(xi) (8)6

and we obtain (k = 1; : : : ; N)NXj=1 �j "M�(P'j; P'k)L2 + MXi=1 'j(xi) � 'k(xi)# = MXi=1 yi'k(xi): (9)In matrix notation we end up with the linear system(�C +B �BT )� = By: (10)Here C is a square N � N matrix with entries Cj;k = M � (P'j; P'k)L2 , j; k = 1; : : :N ,and B is a re tangular N �M matrix with entries Bj;i = 'j(xi); i = 1; : : :M; j = 1; : : :NThe ve tor y ontains the data yi and has length M . The unknown ve tor � ontains thedegrees of freedom �j and has length N .Depending on the regularization operator we obtain di�erent minimization problems ind-dimensional spa e. For example if we use the gradient P = r in the regularizationexpression in (1) we obtain a Poisson problem with an additional term whi h resembles theinterpolation problem. The natural boundary onditions for su h a di�erential equationin e.g. = [0; 1℄d are Neumann onditions. The dis retization (2) gives us then the linearsystem (10) where C orresponds to a dis rete Lapla ian. To obtain the lassi�er fN wenow have to solve this system.2.2 Grid based dis rete approximationUp to now we have not yet been spe i� what �nite-dimensional subspa e VN and whattype of basis fun tions f'jgNj=1 we want to use. In ontrast to onventional data miningapproa hes whi h work with ansatz fun tions asso iated to data points we now use a ertain grid in the attribute spa e to determine the lassi�er with the help of basis fun -tions asso iated to these grid points. This is similar to the numeri al treatment of partialdi�erential equations.For reasons of simpli ity, here and in the remainder of this paper, we restri t ourself to the ase xi 2 = [0; 1℄d. This situation an always be rea hed by a proper res aling of thedata spa e. A onventional �nite element dis retization would now employ an equidistantgrid n with mesh size hn = 2�n for ea h oordinate dire tion. In the following we alwaysuse the gradient P = r in the regularization expression (3). Let j denote the multi-index(j1; :::; jd) 2 INd. A �nite element method with pie ewise d-linear test- and trial-fun tions�n;j(x) on grid n now would give the lassi�er fN as(fN (x) =)fn(x) = 2nXj1=0 ::: 2nXjd=0�n;j�n;j(x)and the variational pro edure (4) - (9) would result in the dis rete system(�Cn +Bn �BTn )�n = Bny (11)with the dis rete (2n + 1)d � (2n + 1)d Lapla ian(Cn)j;k =M � (r�n;j;r�n;k);7

jt; kt = 0; :::; 2n; t = 1; :::; d, the (2n + 1)d �M -matrix(Bn)j;i = �n;j(xi);jt = 0; :::; 2n; t = 1; :::; d, i = 0; :::;M , and the unknown ve tor (�n)j, jt = 0; :::; 2n; t =1; :::; d. Note that fn lives in the spa eVn := spanf�n;j; jt = 0; ::; 2n; t = 1; :::; dg:The dis rete problem (11) might in prin iple be treated by an appropriate solver like the onjugate gradient method, a multigrid method or some other suitable eÆ ient iterativemethod. However, this dire t appli ation of a �nite element dis retization and an ap-propriate linear solver for the arising system is learly not possible for a d-dimensionalproblem if d is larger than four. The number of grid points would be of the orderO(h�dn ) = O(2nd) and, in the best ase, if the most e�e tive te hnique like a multi-grid method is used, the number of operations is of the same order. Here we en ounterthe so- alled urse of dimensionality: The omplexity of the problem grows exponentiallywith d. At least for d > 4 and a reasonable value of n, the arising system an not bestored and solved on even the largest parallel omputers today.2.3 Sparse grid spa eHowever, there is a spe ial dis retization te hnique using so- alled sparse grids whi hallows to ope with the omplexity of the problem, at least to some extent. This methodhas been originally developed for the solution of partial di�erential equations [4, 10, 40,75℄ and is now used su essfully also for integral equations [22, 39℄, interpolation andapproximation [5, 37, 47, 60, 66℄, eigenvalue problems [25℄ and integration problems [26℄.In the information based omplexity ommunity it is also known as 'hyperboli rosspoints' and the idea an even be tra ed ba k to [63℄. For a d-dimensional problem, thesparse grid approa h employs only O(h�1n (log(hn�1))d�1) grid points in the dis retizationpro ess. It an be shown that an a ura y of O(h2n log(h�1n )d�1) an be a hieved pointwiseor with respe t to the L2- or L1-norm provided that the solution is suÆ iently smooth.Thus, in omparison to onventional full grid methods, whi h need O(h�dn ) points for ana ura y of O(h2n), the sparse grid method an be employed also for higher-dimensionalproblems. The urse of the dimensionality of full grid methods a�e ts sparse grids mu hless.Now, with the multi-index l = (l1; :::; ld) 2 INd, we onsider the family of standard regulargrids fl; l 2 INdg (12)on � with mesh size hl := (hl1 ; :::; hld) := (2�l1; :::; 2�ld). That is, l is equidistantwith respe t to ea h oordinate dire tion, but, in general, has di�erent mesh sizes in thedi�erent oordinate dire tions. The grid points ontained in a grid l are the pointsxl;j := (xl1;j1; :::; xld;jd) (13)with xlt;jt := jt � hlt = jt � 2�lt; jt = 0; :::; 2lt. On ea h grid l we de�ne the spa e Vl ofpie ewise d-linear fun tions,Vl := spanf�l;j; jt = 0; :::; 2lt; t = 1; :::; dg; (14)8

whi h is spanned by the usual d-dimensional pie ewise d-linear hat fun tions�l;j(x) := dYt=1 �lt;jt(xt): (15)Here, the 1D fun tions �lt;jt(xj) with support [xlt;jt�hlt ; xlt;jt+hlt℄ = [(jt�1)hlt; (jt+1)hlt℄(e.g. restri ted to [0; 1℄) an be reated from a unique 1D mother fun tion �(x),�(x) := � 1� jxj if x 2 ℄� 1; 1[ ;0 otherwise, (16)by dilatation and translation, i. e.�lt;jt(xt) := ��xt � jt � hlthlt � : (17)In the previous de�nitions and in the following, the multi-index l 2 INd indi ates the levelof a grid or a spa e or a fun tion, respe tively, whereas the multi-index j 2 INd denotes thelo ation of a given grid point xl;j or of the respe tive basis fun tion �l;j(x), respe tively.Now, we an de�ne the di�eren e spa esWl := Vl � dXt=1 Vl�et; (18)where et denotes the t-th unit ve tor. To omplete this de�nition, we formally set Vl := ;if lt = �1 for at least one t 2 f1; :::; dg. These hierar hi al di�eren e spa es allow us thede�nition of a multilevel subspa e splitting, i. e. the de�nition of the spa e Vn as a dire tsum of subspa es, Vn := nXl1=0 ::: nXld=0W(l1;::ld) = Mjlj1�nWl; (19)Here and in the following, let � denote the orresponding element-wise relation, and letjlj1 := max1�lt�d lt and jlj1 :=Pdt=1 lt denote the dis rete L1- and the dis rete L1-normof l, respe tively. As it an be easily seen from (14) and (18), the introdu tion of indexsets Il,Il := �(j1; :::; jd) 2 INd;� jt = 1; :::; 2lt � 1; jt odd; t = 1; :::; d; if lt > 0;jt = 0; 1; t = 1; :::; d; if lt = 0; �� ; (20)leads to Wl = spanf�l;j; j 2 Ilg: (21)Therefore, the family of fun tions f�l;j; j 2 Ilgn0 (22)is just a hierar hi al basis [20, 73, 74℄ of Vn that generalizes the one-dimensional hierar- hi al basis of [20℄ to the d-dimensional ase by means of a tensor produ t approa h. Notethat the supports of all basis fun tions �l;j(x) in (21) spanning Wl are mutually disjoint.9

Figure 1: Two-dimensional sparse grid (left) and three-dimensional sparse grid (right),n = 5Now, any fun tion f 2 Vn an be splitted a ordingly byf(x) = Xl�n fl(x); fl 2 Wl; and fl(x) = Xj2Il �l;j � �l;j(x); (23)where �l;j 2 IR are the oeÆ ient values of the hierar hi al produ t basis representation.It is the hierar hi al representation whi h now allows to onsider the following subspa eV (s)n of Vn whi h is obtained by repla ing jlj1 by jlj1 in (19):V (s)n := Mjlj1�n+d�1Wl: (24)Again, any fun tion f 2 V (s)n an be splitted a ordingly byf (s)n (x) = Xjlj1�n+d�1Xj2Il �l;j�l;j(x): (25)The grids orresponding to the approximation spa es V (s)n are alled sparse grids and havebeen studied in detail in [10, 11, 12, 13, 31, 34, 40, 42, 75℄. Examples of sparse grids forthe two- and three-dimensional ase are given in Figure 1.Now, a straightforward al ulation [14℄ shows that the dimension of the sparse grid spa eV (s)n is of the order O(nd�12n). For the interpolation problem, as well as for the approxi-mation problem stemming from se ond order ellipti PDEs, it was proven that the sparsegrid solution f (s)n is almost as a urate as the full grid fun tion fn, i.e. the dis retizationerror satis�es jjf � f (s)n jjLp = O(h2n log(h�1n )d�1)provided that a slightly stronger smoothness requirement on f holds than for the full gridapproa h. Here, we need the seminormjf j1 := �� 2dfQdt=1 �x2t ��1 (26)10

to be bounded.The idea is now to arry this dis retization method and its advantages with respe t to thedegrees of freedom over to the minimization problem (1). The minimization pro edure(4) - (9) with the dis rete fun tion f (s)n in V (s)nf (s)n = Xjlj1�n+d�1XIl �(s)l;j �l;j(x)would result in the dis rete system(�C(s)n +B(s)n � (B(s)n )T )�(s)n = B(s)n y (27)with (C(s)n )(l;j);(r;k) =M � (r�n;(l;j);r�n;(r;k)); (B(s)n )(l;j);i = �n;(l;j)(xi);jlj1 � n + d � 1; jrj1 � n + d � 1; j 2 Il; r 2 Ik; i = 1; :::;M; and the unknown ve tor(�(s)n )(r;k), jrj1 � n + d � 1; r 2 Ik: The dis rete problem (27) might in prin iple betreated by an appropriate iterative solver. Note that now the size of the problem is justof the order O(2nnd�1). Here, the expli it assembly of the matrix C(s)n and B(s)n shouldbe avoided. These matri es are more densely populated than the orresponding full gridmatri es and this would add further terms to the omplexity. Instead only the a tionof these matri es onto ve tors, i.e. a matrix-ve tor multipli ation should be performedin an iterative method like the onjugate gradient method or a multigrid method. Forexample, for C(s)n this is possible in a number of operations whi h is proportional to theunknowns only. For details, see [4, 10℄. However the implementation of su h a program isquite umbersome and diÆ ult. It also should be possible to avoid the assembly of B(s)nand (B(s)n )T and to program the respe tive matrix-ve tor multipli ations in O(2nnd�1;M)operations. But this is ompli ated as well.There also exists another variant of a solver working on the sparse grid, the so alled ombination te hnique [40℄, whi h makes use of multivariate extrapolation [16℄. In thefollowing, we apply this method to the minimization problem (1). It is mu h simpler touse than the Galerkin-approa h (27), it avoids the matrix assembly problem mentionedabove and it an be parallelized in a natural and straightforward way, see [32, 36℄.Note that an approa h similar to ours was introdu ed re ently in [43℄.2.4 The sparse grid ombination te hniqueFor the sparse grid ombination te hnique we pro eed as follows: We dis retize and solvethe problem on a ertain sequen e of grids l with uniform mesh sizes ht = 2�lt in the t-th oordinate dire tion. These grids may possess di�erent mesh sizes for di�erent oordinatedire tions. To this end, we onsider all grids l withl1 + ::: + ld = n+ (d� 1)� q; q = 0; ::; d� 1; lt > 0: (28)In ontrast to the de�nition (24), for reasons of eÆ ien y, we now restri t the level indi esto lt > 0 . The �nite element approa h with pie ewise d-linear test- and trial-fun tions11

�l;j(x) on grid l now would givefl(x) = 2l1Xj1=0 ::: 2ldXjd=0�l;j�l;j(x)and the variational pro edure (4) - (9) would result in the dis rete system(�Cl +Bl �BTl )�l = Bly (29)with the matri es (Cl)j;k =M � (r�l;j;r�l;k) and (Bl)j;i = �l;j(xi);jt; kt = 0; :::; 2lt; t = 1; :::; d; i = 1; :::;M; and the unknown ve tor (�l)j, jt = 0; :::; 2lt; t =1; :::; d. We then solve these problems by a feasible method. To this end we use here adiagonally pre onditioned onjugate gradient algorithm. But also an appropriate multi-grid method with partial semi- oarsening an be applied. The dis rete solutions fl are ontained in the spa es Vl, see (14), of pie ewise d-linear fun tions on grid l.Note that all these problems are substantially redu ed in size in omparison to (11).Instead of one problem with size dim(Vn) = O(h�dn ) = O(2nd), we now have to deal withO(dnd�1) problems of size dim(Vl) = O(h�1n ) = O(2n). Moreover, all these problems anbe solved independently whi h allows for a straightforward parallelization on a oarsegrain level, see [32℄. Also there is a simple but e�e tive stati load balan ing strategyavailable [36℄.Finally we linearly ombine the results fl(x) =Pj �l;j�l;j(x) 2 Vl from the di�erent gridsl as follows: f ( )n (x) := d�1Xq=0(�1)q�d� 1q � Xl1+:::+ld=n+(d�1)�qfl(x): (30)The resulting fun tion f ( )n lives in the above-de�ned sparse grid spa e V (s)n ( but now withlt > 0 in (24)). For the two-dimensional ase, the grids needed in the ombination formulaof level 4 are shown in Figure 2.The ombination te hnique an be interpreted as a ertain multivariate extrapolationmethod whi h works on a sparse grid, for details see [16, 40, 56℄. The ombination solutionf ( )n is in general not equal to the Galerkin solution f (s)n but its a ura y is usually of thesame order, see [40℄. To this end, a series expansion of the error is ne essary. Its existen ewas shown for PDE-model problems in [15℄.Note that the summation of the dis rete fun tions from di�erent spa es Vl in (30) involvesd-linear interpolation whi h resembles just the transformation to a representation in thehierar hi al basis (22). For details see [33, 41℄. However we never expli itly assemblethe fun tion f ( )n but keep instead the solutions fl on the di�erent grids l whi h arise inthe ombination formula. Now, any linear operation F on f ( )n an easily be expressed bymeans of the ombination formula (30) a ting dire tly on the fun tions fl, i.e.F(f ( )n ) = d�1Xq=0(�1)q�d� 1q � Xl1+:::+ld=n+(d�1)�qF(fl): (31)12

q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q4;1 � q q q q q q q q qq q q q q q q q qq q q q q q q q qq q q q q q q q qq q q q q q q q q

3;2 � qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

2;3 � qqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqq

1;4 q q q q q q q q qq q q q q q q q qq q q q q q q q q3;1 qqqq

qqqqqq

qqqqq

qqqqq

qqqqq

2;2 qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

1;3= q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qqqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqq

q q q q q q q q qq q q q q q q q qq q q q q q q q qq q q q q q q q qq q q q q q q q qqqqqqqqqq

qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

qqqqqqqqq

4;4Figure 2: Combination te hnique for level 4 in two dimensionsTherefore, if we now want to evaluate a newly given set of data points f ~xig ~Mi=1 (the testset) by ~yi := f ( )n (~xi); i = 1; :::; ~Mwe just form the ombination of the asso iated values for fl a ording to (30). Altogetherwe obtain the algorithm of Figure 3.The ombination te hnique is only one of the various methods to solve problems on sparsegrids. Note that there exist also �nite di�eren e [33, 59℄ and Galerkin �nite element ap-proa hes [4, 10, 13℄ whi h work dire tly in the hierar hi al produ t basis on the sparsegrid. But the ombination te hnique is on eptually mu h simpler and easier to imple-ment. Moreover it allows to reuse standard solvers for its di�erent subproblems and isstraightforwardly parallelizable.2.5 Implementational details and omplexity dis ussionUp to now we did not adress the question how to assemble and treat the linear systems(29) (�Cl +Bl �BTl )�l = Bly (=: rl); (32)jlj1 = n + (d � 1) � q; q = 0; ::; d � 1; whi h arise in the ombination method. Here,di�erent approa hes are possible: First we an expli itly assemble and store the matri esCl and Gl := Bl �BTl :Se ond, we an assemble and store Cl and Bl. Of ourse, all these matri es are sparselypopulated. We only assemble the non-zero entries. The resulting omplexities are given13

Given are data points f(xi; yi)gMi=1 (the training set)Computation of the sparse grid lasssi� ator:q = 0 to d� 1l1 = 1 to n� ql2 = 1 to n� q � (l1 � 1)l3 = 1 to n� q � (l1 � 1)� (l2 � 1).....ld�1 = 1 to n� q � (l1 � 1)� :::� (ld�2 � 1)ld = n� q � (l1 � 1)� (l2 � 1)� :::� (ld�2 � 1)� (ld�1 � 1)solve the linear equation system (�Cl +Bl �BTl )�l = BlyEvaluation in newly given data points f~xig ~Mi=1 (the test set):~yi = 0; i = 1; :::; ~Mq = 0 to d� 1l1 = 1 to n� ql2 = 1 to n� q � (l1 � 1)l3 = 1 to n� q � (l1 � 1)� (l2 � 1).....ld�1 = 1 to n� q � (l1 � 1)� :::� (ld�2 � 1)ld = n� q � (l1 � 1)� (l2 � 1)� :::� (ld�2 � 1)� (ld�1 � 1)evaluate the solution fl in ~xi by d-linear interpolation on grid lset ~yi := ~yi + (�1)q�d�1q �fl(~xi); i = 1; :::; ~MFigure 3: The overall algorithmin Table 1. Here, N = Qdt=1(2lt + 1) denotes the number of unknowns on grid l. NoteCl Gl Bl rl �lstorage O(3d �N) O(3d �N) O(2d �M) O(N) O(N)assembly O(3d �N) O(d � 22d �M) O(d � 2d �M) O(d � 2d �M) -mv-multipli ation O(3d �N) O(3d �N) O(2d �M) - -Table 1: Complexities for the storage, the assembly and the matrix-ve tor multipli ationfor di�erent matri es and ve tors arising in the ombination method on one grid lthat Gl is in prin iple a N �N matrix. But of ourse, we take into a ount the non-zeroentries only.Interestingly there is a di�eren e in the omplexities of the two approa hes: The storage ost and the ost of the matrix-ve tor multipli ation for the �rst approa h s ale with N ,whereas they s ale withM for the se ond approa h. Espe ially for the ase M >> N , i.e.for a moderate number of levels n in the sparse grid but a very large amount of data tobe handled for the training phase in the lassi� ation problem, there is a strong di�eren ein the resulting storage ost and run time. 14

Sin e we have O(dnd�1) di�erent problems to store we obtain an overall storage omplexityof O(dnd�1 � 3dN) for the �rst approa h and O(dnd�1 � 2d �M) for the se ond approa h.Clearly, for n and thus N �xed, and large amount of data, i.e. M >> N the storage ofthe matri es Gl is more advantageous than the storage of the matri es Bl. With respe tto the assembly ost (O(dnd�1 � (3dN + d22dM)) versus O(dnd�1 � (3dN + d2dM)) ) thereis not su h a di�eren e. However when it �nally omes to the overall ost of the matrix-ve tor multipli ations for all jlj1 = n + d � q; q = 0; :::; d � 1, then we obtain only a omplexity of the order O((3d + 3d) � N � dnd�1) for the �rst approa h, but a omplexityof the order O((3dN + 2dM)dnd�1) for the se ond method. Thus, analogously to thestorage omplexities, also the matrix-ve tor multipli ation for the �rst approa h behavesadvantageous in the pra ti ally relevant ase M >> N of large data sets.Note however that both the storage and the run time omplexities depend exponentially onthe dimension d. Presently, due to the limitations of the memory of modern workstations(512 MByte - 2 GByte), we therefore an only deal with the ase d � 8 if we assemblethe matri es and keep them in the omputer memory. Alternatively we an avoid tobuild these matri es expli itly but ompute the ne essary entries on the y when neededin the CG-iteration. This way we trade main memory storage versus omputing time.Then, depending on the hosen value of n, problems up to d = 15 �t into the memory ofmodern workstations. Now, sin e the ost of the assembly of the matri es arises in everymatrix-ve tor multipli ation arising in the CG-iteration, there is no more an advantagefor the �rst approa h. On the ontrary, due to the di�eren e in the d-dependen y (22dversus 2d), the se ond approa h performs better.Note that the solution of the various problems arising in the ombination te hnique byan iterative method like onjugate gradients is surely not optimal. Here, the number ofiterations ne essary to obtain a pres ribed a ura y depends on the ondition numbersof the respe tive system matri es. Here it is possible to use an appropriate multigridmethod to iteratively treat the di�erent linear systems with a onvergen e rate whi h isindependent of n. Also robustness, i.e. a onvergen e rate independent of the dis reteanisotropy aused by the possible di�eren e in the mesh sizes in di�erent oordinatedire tions an be a hieved by applying semi- oarsening te hniques where ne essary [38℄.Furthermore independen e of the onvergen e rate from the regularization parameter isanother issue. Here, sin e BlBTl resembles a somewhat smeared out lo al mixture betweenthe identity and the zero matrix, there is hope to obtain an overall robust method alongthe lines of [53, 54, 65℄. All these te hniques are well understood in the two- and three-dimensional ase. They should work analogously in the higher-dimensional ase as well.However, there is no implementation of a high-dimensional robust multigrid method yet.This involves quite some work and future e�ort. Therefore we sti k to the diagonallypre onditioned onjugate gradient method for the time being.Finally let us onsider the evaluation of the omputed sparse grid ombination lassi�erfor a newly given set of points f~xig ~Mi=1. From (28) we dire tly see that this involvesO(dnd�1 ~M) operations. Presently it is not lear, if this omplexity an be redu ed some-what by swit hing to the expli it assembly of the ombination solution f ( )n in terms ofthe hierar hi al basis ( ost O(2nnd�1)) and evaluating the ~M points ~xi in the hierar hi albasis representation. There is a slight han e that ertain ommon intermediate resultsmight be exploited in a way analogously to other multis ale al ulations on sparse grids,15

ompare [4, 10℄, but this is un lear up to now. In any ase, the evaluation of the di�erentfl in the test points an be done ompletely in parallel, their summation a ording to(30) needs basi ally an all-redu e/gather operation.Note �nally that eÆ ient data stru tures for sparse matri es are ru ial to allow foran eÆ ient matrix-ve tor-multipli ation. Sin e we are dealing with higher dimensionalproblems we have to use data stru tures whi h are still fast for rows with many, i.e. 3dentries. In the ase of the on-the- y matrix-ve tor-multipli ation it is also very importantto optimize this part of the algorithm in a similar way sin e it onsumes the main partof the omputing time.3 Numeri al resultsWe now apply our approa h to di�erent test data sets. Here we use both syntheti aldata generated by DatGen [51℄ and real data from pra ti al data mining appli ations.All the data sets are res aled to [0; 1℄d. To evaluate our method we give the orre tnessrates on testing data sets if available and the ten-fold ross-validation results on trainingand testing data sets. For the ten-fold ross validation we pro eed as follows: We dividethe training data into 10 equally sized disjoint subsets. For i = 1 to 10, we pi k thei-th of these subsets as further testing set and build with the data from the remaining 9subsets the sparse grid ombination lassi�er. We then evaluate the orre tness rates ofthe urrent training and testing set. This way we obtain ten di�erent training and testing orre tness rates. The ten-fold ross validation result is then just the average of these ten orre tness rates. For further details see [64℄. For a riti al dis ussion on the evaluationof the quality of lassi�er algorithms see [58℄.3.1 Two-dimensional problemsWe �rst onsider two-dimensional problems with small sets of data whi h orrespond to ertain stru tures. Then we treat problems with huge sets of syntheti data with up to5 million points. In the following Figures we show the border between the two lasses,i.e. a lass label of zero, with the bla k line, values between 0 and 1 are presented brightand values between 0 and -1 are shown from bright to dark to show the hange in the lassi�er.3.1.1 Che kerboardThe �rst example is taken from [45, 46℄. Here, 1000 training data points were given whi hare more or less uniformly distributed in = [0; 1℄2. The asso iated lass labels areplus one or minus one depending on their lo ation in su h that a 4 � 4 he kerboardstru ture appears, see Figure 4 (left). We omputed the 10-fold ross-validated trainingand testing orre tness with the sparse grid ombination method for di�erent values ofthe regularization parameter � and di�erent levels n. The results are shown in Figure 4(right).We see that the 10-fold testing orre tness is well around 95 % for values of � between3 � 10�5 and 5 � 10�3. Our best 10-fold-testing orre tness was 96:20% on level 10 with16

88

90

92

94

96

98

100

1e-05 0.0001 0.001 0.01

10-F

old

Cor

rect

ness

%

Lambda

Level 7 TrainLevel 7 TestLevel 8 TrainLevel 8 TestLevel 9 TrainLevel 9 Test

Level 10 TrainLevel 10 Test

Figure 4: Left: Che kerboard data set [45℄, ombination te hnique with level 10, � =4:53999 � 10�5, Right: plot of the dependen e on � (in logs ale) and level� = 4:54 � 10�5. The he kerboard stru ture is thus re onstru ted with less than 4 %error.3.1.2 SpiralAnother two-dimensional example with stru ture is the spiral data set, �rst proposedby Alexis Wieland of MITRE Corp, see also [21℄. Here, 194 data points des ribe twointertwined spirals, see Figure 5. This is surely an arti� ial problem whi h does notappear in many pra ti al appli ations. However it serves as a hard test ase for datamining algorithms. It is known that neural networks an have severe problems with thisdata set and some neural networks an not separate the two spirals at all [61℄. In Table2 we give the orre tness rates a hieved with the leave-one-out ross-validation method,i.e. a 194-fold ross-validation. The best testing orre tness was a hieved on level 6 with89.69% in omparison to 77.20% in [61℄.level � training orre tness testing orre tness4 0.00001 95.31 % 87.63 %5 0.001 94.36 % 87.11 %6 0.00075 100.00 % 89.69 %7 0.00075 100.00 % 88.14 %8 0.0005 100.00 % 87.63 %Table 2: Leave-one-out ross-validation results for the spiral data setIn Figure 5 we show the orresponding results obtained with our sparse grid ombinationmethod for the levels 4 to 7. Already for level 6 the two spirals are learly dete ted andresolved. Note that here only 447 grid points are ontained in the sparse grid. For level17

Figure 5: Spiral data set, sparse grid with level 4 (top left) to 7 (bottom right)7 (1023 sparse grid points) the shape of the two re onstru ted spirals gets smoother andthe re onstru tion gets more pre ise.3.1.3 RipleyThis data set, taken from [57℄, onsists of 250 training data and 1000 test points. Thedata set was generated syntheti ally and is known to exhibit 8 % error. Thus no bettertesting orre tness than 92 % an be expe ted.Sin e we now have training and testing data, we pro eed as follows: First we use ten-fold ross-validation on the training set to determine the best regularization parameter�. This value is given for di�erent levels n in the �rst olumn in Table 3. With this � wethen ompute the sparse grid lassi�er from the 250 training data. The olumns two andthree of Table 3 give the performan e of this lassi�er on the test data set. Here, we showthe 10-fold ross-validation testing orre tness using the training data set and the testing orre tness using the (previously unknown) test data. We see that our method workswell. Already level 5 is suÆ ient to obtain results of 90.9 %. We also see that there isnot mu h need to use any higher levels. The reason is surely the relative simpli ity of thedata, see Figure 6. Just a few hyperplanes should be enough to separate the lasses quite18

Figure 6: Ripley data set, ombination te hnique with level 5, � = 0.01005 (left) level 8,� = 0.00166 (right)properly. This is a hieved with the sparse grid already for a small number n. Additionallyten-fold bestlevel ten-fold testing � on test data � testing1 84.8 % 0.01005 89.8 % 0.00370 90.3 %2 85.2 % 9:166 � 10�6 90.4 % 0.00041 90.9 %3 88.4 % 0.00166 90.6 % 0.00370 91.2 %4 87.6 % 0.00248 90.6 % 0.01500 91.2 %5 87.6 % 0.01005 90.9 % 0.00673 91.1 %6 86.4 % 0.00673 90.8 % 0.00673 90.8 %7 86.4 % 0.00075 88.5 % 0.00673 91.0 %8 88.0 % 0.00166 89.7 % 0.00673 91.0 %9 88.4 % 0.00203 90.9 % 0.00823 91.0 %10 88.4 % 0.00166 90.6 % 0.00452 91.1 %[57℄ 90.6[55℄ 91.1Table 3: Results for the Ripley data set [57℄we give in Table 3 the testing orre tness whi h is a hieved for the best possible �. Tothis end we ompute for all (dis rete) values of � the sparse grid lassi�ers from the 250data points and evaluate them on the test set. We then pi k the best result. We learlysee that there is not mu h of a di�eren e. This indi ates that our approa h to determinethe value of � from the training set by ross-validation works well.19

3.1.4 Syntheti huge data set in 2DTo test our method with very large data sets we produ ed with DatGen [51℄ (datgen -r2-X0/100,R,O:0/100,R,O -R2 -C2/4 -D0/1 -T10/70 -O2000000 -e0.15 -p) a data set with 5million training points and 20 000 points for testing. The results obtained with our sparsegrid ombination te hnique are given in Table 4. Here we onsider di�erent levels n anddi�erent amounts of data points and use the value 0.01 for the regularization parameter�. We also show the a tual total run time of the ombination te hnique on one pro essor(MIPS R10000 (195 MHz)) of an SGI O200. Then, the sum over the overall numberof iterations is given, whi h was needed for the CG-treatment of the di�erent problemsarising in the ombination method. Here the a ura y in the termination riterion ofthe CG-solver was set to 10�10. We see that our method gives testing orre tness resultsup to 95 %. Furthermore we see that really large data sets an be handled without anyproblems. Here, the exe ution time s ales linearly with the number of data as predi tedby the omplexity analysis of the previous se tion. Finally we see that the run times arereasonable also for huge data sets.training testing # of# of points orre tness orre tness time (se ) iterations100 000 94.7 % 94.7 % 2.9 9level 1 1 million 94.7 % 94.7 % 26 92 million 94.7 % 94.7 % 54 95 million 94.7 % 94.7 % 132 9100 000 94.9 % 94.9 % 5.3 40level 2 1 million 94.9 % 94.9 % 50 412 million 94.9 % 94.9 % 103 415 million 94.9 % 94.9 % 253 41100 000 94.9 % 94.9 % 7.8 113level 3 1 million 94.9 % 94.9 % 75 1152 million 94.9 % 94.8 % 150 1175 million 94.9 % 94.9 % 373 117100 000 95.0 % 94.9 % 10 277level 4 1 million 95.0 % 94.9 % 99 2832 million 95.0 % 94.9 % 199 2855 million 95.0 % 94.9 % 496 287100 000 95.0 % 95.0 % 12 650level 5 1 million 95.0 % 94.9 % 124 6662 million 95.0 % 94.9 % 247 6695 million 95.0 % 95.0 % 620 676Table 4: Results for a 2D huge syntheti data set, � = 0:0120

3.2 6-dimensional problems3.2.1 BUPA LiverThe BUPA Liver Disorders data set from Irvine Ma hine Learning Database Repository[8℄ onsists of 345 data points with 6 features plus a sele tor �eld used to split the datainto 2 sets with 145 instan es and 200 instan es respe tively. Here we only have trainingdata and therefore an only report our ten-fold ross-validation results. No omparisonwith unused test data is possible.We ompare with the two best results from [48℄, the therein introdu ed smoothed supportve tor ma hine (SSVM) and the feature sele tion on ave minimization (FSV) algorithm[9℄. The results are given in Table 5. sparse grid ombination methodSSVM FSV Level 1 Level 2 Level 3 Level 4[48℄ [48℄ � = 0:02 � = 0:1 � = 0:007 � = 0:000410-fold train. orr., % 70.37 68.18 76.00 77.49 84.28 90.2710-fold test. orr., % 70.33 65.20 67.87 67.84 70.34 70.92Table 5: Results for the BUPA liver disorders data setOur sparse grid ombination approa h performs on level 4 with � = 0:0004 at 70.92 %10-fold testing orre tness. But also our other results were in this range. Our methodperforms here slightly better than the SSVM but learly better than FSV. Note that theresults for the robust linear program (RLP) algorithm [6℄, the support ve tor ma hineusing the 1-norm approa h (SVMjj:jj1) and the lassi al support ve tor ma hine (SVMjj:jj22)[9, 17, 70℄ were reported to be somewhat worse in [48℄.3.2.2 Syntheti massive data set in 6DNow we produ ed with DatGen [51℄ a 6-dimensional data set with 5 million training pointsand 20 000 points for testing. We used the all datgen -r1 -X0/100,R,O:0/100,R,O:0/100,R,O:0/100,R,O:0/200,R,O:0/200,R,O -R2 -C2/4 -D2/5 -T10/60 -O5020000 -p -e0.15.The results are given in Table 6. Already on level 1 a testing orre tness of 88 % wasa hieved whi h is quite satisfying for this data. We see that really huge data sets of 5million points ould be handled. We also give the CPU time whi h is needed for the omputation of the matri es Gl = Bl � BTl . Here, more than 96 % of the omputationtime is spent for the matrix assembly. Again, the exe ution times s ale linearly with thenumber of data points.3.3 8-dimensional problemThe Pima Indians Diabetes data set from Irvine Ma hine Learning Database Repository onsists of 768 instan es with 8 features plus a lass label whi h splits the data into 2 setswith 500 instan es and 268 instan es respe tively, see [8℄.21

training testing total data matrix # of# of points orre tness orre tness time (se ) time (se ) iterations50 000 90.8 % 90.8 % 158 152 41level 1 500 000 90.7 % 90.8 % 1570 1528 445 million 90.7 % 90.7 % 15933 15514 4650 000 91.9 % 91.5 % 1155 1126 438level 2 500 000 91.5 % 91.6 % 11219 11022 4665 million 91.4 % 91.5 % 112656 110772 490Table 6: Results for a 6D syntheti massive data set, � = 0:01Again we ompare with the two best results from [48℄, the smoothed support ve torma hine (SSVM) and the robust linear program (RLP) algorithm [6℄. The results aregiven in Table 7. sparse grid ombination methodSSVM RLP Level 1 Level 2 Level 3[48℄ [48℄ � = 0:075 � = 0:25 � = 1:010-fold train. orr., % 78.11 76.48 83.94 88.51 93.2910-fold test. orr., % 78.12 76.16 77.47 75.01 72.93Table 7: Results for the Pima indian diabetes data setNow our sparse grid ombination approa h performs already on level 1 with � = 0:075at 77.47 % 10-fold testing orre tness. Here the data set seems to be simple enough thatone level gives enough resolution for building the ne essary lass separators. The resultson higher levels do not improve but even deteriorate. Our method performs here onlyslightly worse than the SSVM but learly better than RLP. Note that the results for thefeature sele tion on ave minimization (FSV) algorithm [9℄, the support ve tor ma hineusing the 1-norm approa h (SVMjj:jj1) and the lassi al support ve tor ma hine (SVMjj:jj22)[9, 17, 70℄ were reported to be slightly worse in [48℄.3.4 9-dimensional problemsWe �nally onsider two problems in 9-dimensional spa e. Here, due to the limitation ofthe main memory of our omputer we are not able to fully store the system matri esany longer. Instead we only assemble and store the matri es Bl whi h is possible for the onsidered problems sin e M and thus the storage needed for Bl is suÆ iently limited.The operations of the matri es Cl and Gl on the ve tors are then omputed on the ywhen needed in the onjugate gradient iteration.3.4.1 Shuttle DataThe shuttle data set omes from the StatLog Proje t [52℄. It onsists of 43 500 observationsin the training set and 14 500 data in the testing set and has 9 attributes and 7 lasses in22

the original version. To simplify the problem we use the original lass 1, approximately 80% of the data, as one lass and the other 6 lasses as lass -1. The aim is to get orre tnessrates on the testing data of more than 99 %. The resulting orre tness rates are given inTable 8. We a hieve extremely good orre tness rates. But note that the full 58 000 dataset an be lassi�ed with only 19 errors by a rather simple linear de ision tree using nineterminal nodes [52℄. This shows that the problem is very simple. Finally, we see that thenumber of iterations needed in the CG-solvers grows with �! 0. This indi ates that the ondition number of the respe tive linear systems gets worse with de lining values of �.training testing total# of points � orre tness orre tness time (se ) # of iterations0.1 93.8 % 92.8 % 1573 2690.01 96.1 % 96.2 % 3719 645435 0.001 99.5 % 99.1 % 11221 19660.0001 99.8 % 99.6 % 34010 58910.00001 100 % 99.6 % 83400 147190.1 94.0 % 93.8 % 1849 2980.01 97.6 % 97.4 % 4008 6534350 0.001 99.6 % 99.4 % 12574 20590.0001 99.7 % 99.8 % 38897 63820.00001 99.7 % 99.7 % 107698 175720.1 94.4 % 94.8 % 4019 3240.01 98.0 % 98.0 % 7491 69243500 0.001 99.5 % 99.5 % 32587 21370.0001 99.7 % 99.6 % 74642 67760.00001 99.8 % 99.8 % 203227 19930Table 8: Results for the Shuttle data set, only Level 13.4.2 Ti -Ta -ToeThis data set omes from the UCI Ma hine Learning Repository [8℄. The 958 instan esen ode the omplete set of possible board on�gurations at the end of ti -ta -toe games.We did 10-fold ross-validation and got a training orre tness of 100 % and a testing orre tness of 98.33 % already on level 1 for � = 0:1; 0; 01; 0; 001 and 0.0001. In [49℄,testing orre tness rates around 70 % are reported with ASVM. With LSVM [50℄ and aquadrati kernel, 100% training orre tness and 93.33 % testing orre tness were a hieved.4 Con lusionsWe presented the sparse grid ombination te hnique for the lassi� ation of data inmoderate-dimensional spa es. Our new method gave good results for a wide range ofproblems. It works ni ely for problems with a small amount of data like that of the UCIrepository, but it is also apable to handle huge data sets with 5 million points and more.23

The run time s ales only linear with the number of data. This is an important propertyfor many pra ti al appli ations where often the dimension of the problem an substan-tially be redu ed by ertain prepro essing steps but the number of data an be extremelyhuge. We believe that our sparse grid ombination method possesses a great potential insu h pra ti al appli ation problems. In prin iple, by omputing the matrix entries on the y in the CG-iteration, lassi� ation problems up to 15 dimensions an be dealt with.We demonstrated for the Ripley data set how the best value of the regularization param-eter � an be determined. This is also of pra ti al relevan e.It is surely ne essary to improve on the implementation of the method. Here, insteadof grids of d-dimensional bri k-type elements with d-linear �nite element basis fun tions,grids of d-simpli es might be used with linear �nite elements. Thus the storage and runtime omplexities would only depend linearly on d (2d+ 1 versus 3d). This would speedup things substantially. A parallel version of the sparse grid ombination te hnique wouldalso redu e the run time of the method signi� antly. Note that our new method is easilyparallelizable on a oarse grain level.Note furthermore that our approa h delivers a ontinuous lassi�er fun tion whi h approx-imates the data. It therefore an be used without modi� ation for regression problems aswell. This is in ontrast to many other methods like e.g. de ision trees. Also more thantwo lasses an be handled by using isolines with just di�erent values.Finally, for reasons of simpli ity, we used the operator P = r. But other di�erential(e.g. P = �) or pseudo-di�erential operators an here be employed with their asso iatedregular �nite element ansatz fun tions.Referen es[1℄ E. Arge, M. D� hlen, and A. Tveito. Approximation of s attered data using smooth gridfun tions. J. Comput. Appl. Math, 59:191{205, 1995.[2℄ A. Allen. Regression and the Moore-Penrose Pseudoinverse. A ademi Press, New York,1972.[3℄ N. Aronszajn. Theory of reprodu ing kernels. Trans. Amer. Math. So . 68 1950:337{404,1950.[4℄ R. Balder. Adaptive Verfahren f�ur elliptis he und parabolis he Di�erentialglei hungen aufd�unnen Gittern. PhD thesis, Te hnis he Universit�at M�un hen, 1994.[5℄ G. Baszenski. N{th order polynomial spline blending. In W. S hempp and K. Zeller, editors,Multivariate Approximation Theory III, ISNM 75, pages 35{46. Birkh�auser, Basel, 1985.[6℄ K. P. Bennett and O. L. Mangasarian. Robust linear programming dis rimination of twolinearly inseparable sets. Optimization Methods and Software, 1:23{34, 1992.[7℄ M. Berry and G. Lino�, Mastering Data Mining, Wiley, 2000.[8℄ C.L. Blake and C.J. Merz, UCI Repository of ma hine learning databases, 1998.http://www.i s.u i.edu/�mlearn/MLRepository.html[9℄ P. S. Bradley and O. L. Mangasarian. Feature sele tion via on ave minimization and sup-port ve tor ma hines. In J. Shavlik, editor, Ma hine Learning Pro eedings of the FifteenthInternational Conferen e(ICML '98), pages 82{90. Morgan Kaufmann, 1998.[10℄ H.-J. Bungartz, D�unne Gitter und deren Anwendung bei der adaptiven L�osung der dreidi-mensionalen Poisson-Glei hung, Dissertation, Inst. f�ur Informatik, TU M�un hen, 1992.[11℄ H.-J. Bungartz, An adaptive Poisson solver using hierar hi al bases and sparse grids, inIterative Methods in Linear Algebra, P. de Groen and R. Beauwens, eds., Elsevier, Ams-terdam, 1992, pp. 293{310. 24

[12℄ H.-J. Bungartz, T. Dornseifer, Sparse grids: Re ent developments for ellipti partial di�er-ential equations. In W. Ha kbus h et al, editor, Multigrid Methods V, volume 3 of Le tureNotes in Computational S ien e and Engineering, pages 45{70. Springer, 1998.[13℄ H.-J. Bungartz, T. Dornseifer, C. Zenger, Tensor produ t approximation spa es for the eÆ- ient numeri al solution of partial di�erential equations, to appear in Pro . Int. Workshopon S ienti� Computations, Konya, 1996, Nova S ien e Publishers, 1997.[14℄ H.-J. Bungartz, M. Griebel, A note on the omplexity of solving Poisson's equation forspa es of bounded mixed derivatives, J. Complexity 15 (1999), 167-199.[15℄ H.-J. Bungartz, M. Griebel, D. R�os hke, C. Zenger, Pointwise onvergen e of the ombina-tion te hnique for Lapla e's equation, East-West J. of Numeri al Mathemati s 1(2) (1994),21-45.[16℄ H.-J. Bungartz, M. Griebel, and U. R�ude. Extrapolation, ombination, and sparse grid te h-niques for ellipti boundary value problems. Comput. Methods Appl. Me h. Eng., 116:243{252, 1994.[17℄ V. Cherkassky and F. Mulier. Learning from Data - Con epts, Theory and Methods. JohnWiley & Sons, 1998.[18℄ K. Cios, W. Pedry z, and R. Swiniarski, Data Mining Methods for Knowledge Dis overy,Kluwer, 1998.[19℄ T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support ve tor ma- hines, Advan es in Computational Mathemati s, 13 (2000), pp. 1{50.[20℄ G. Faber, �Uber stetige Funktionen, Mathematis he Annalen, 66 (1909), pp. 81{94.[21℄ S.E. Fahlmann and C. Lebiere. The as ade- orrelation learning ar hite ture. In Touretzky,editor, Advan es in Neural Information Pro essing Systems 2. Morgan-Kaufmann, 1990.[22℄ K. Frank, S. Heinri h, and S. Pereverzev. Information Complexity of Multivariate FredholmIntegral Equations in Sobolev Classes. J. of Complexity, 12:17{34, 1996.[23℄ G. Fung and O.L. Mangasarian. Data sele tion for support ve tor ma hine lassi�ers.Te hni al report 00-02, Data Mining Institute, Computer S ien es Department, Universityof Wis onsin, 2000. KDD-2000, Boston august 20-23, 2000.[24℄ J. Gar ke. Classi� ation with sparse grids, 2000. http://wissre h.iam.uni-bonn.de/people/gar ke/SparseMining.html.[25℄ J. Gar ke, M. Griebel On the omputation of the eigenproblems of hydrogen and helium instrong magneti and ele tri �elds with the sparse grid ombination te hnique, Journal ofComputational Physi s, submitted, 2000, also as SFB 256 preprint 670, Universit�at Bonn.[26℄ T. Gerstner and M. Griebel. Numeri al integration using sparse grids. Numer. Algorithms,18:209{232, 1998.[27℄ F. Girosi, An Equivalen e Between Sparse Approximation and Support Ve tor Ma hines,Neural Computation 10(6) (1998), 1455-1480.[28℄ F. Girosi, M. Jones, T. Poggio, Priors, Stabilizers and Basis Fun tions: from regular-ization to radial, tensor and additive splines, A.I. Memo No. 1430, Arti� ial Intelligen eLaboratory, MIT, 1993.[29℄ F. Girosi, M. Jones, T. Poggio Regularization Theory and Neural Networks Ar hite tures,Neural Computation, 7:219-265, 1995.[30℄ G. Golub, M. Heath, G. Wahba. Generalized ross validation as a method for hoosing agood ridge parameter, Te hnometri s, 21:215-224, 1979.[31℄ M. Griebel, A parallelizable and ve torizable multi-level algorithm on sparse grids, in Paral-lel Algorithms for Partial Di�erential Equations, Notes on Numeri al Fluid Me hani s 31,W. Ha kbus h, ed., Vieweg, Brauns hweig/Wiesbaden, 1991, pp. 94{100.[32℄ M. Griebel, The ombination te hnique for the sparse grid solution of PDEs on multipro- essor ma hines, Parallel Pro essing Letters 2(1) (1992), 61-70.[33℄ M. Griebel, Adaptive sparse grid multilevel methods for ellipti PDEs based on �nite dif-feren es. Computing, 61:151-179, 1998.25

[34℄ M. Griebel,Multilevelmethoden als Iterationsverfahren �uber Erzeugendensystemen, TeubnerSkripten zur Numerik, Teubner, Stuttgart, 1994.[35℄ M. Griebel, W. Huber, U. R�ude, T. St�ortkuhl, The ombination te hnique for parallelsparse-grid-pre onditioning and -solution of PDEs on multipro essor ma hines and work-station networks, in L. Bouge, M. Cosnard, Y. Robert and D. Trystram (eds.), Le tureNotes in Computer S ien e 634, Parallel Pro essing: CONPAR92-VAPP V, pp. 217-228,Springer Verlag, 1992.[36℄ M. Griebel, W. Huber, T. St�ortkuhl, and C. Zenger, On the parallel solution of 3D PDEson a network of workstations and on ve tor omputers, in Parallel Computer Ar hite tures:Theory, Hardware, Software, Appli ations, A. Bode and M. D. Cin, eds., vol. 732 of Le tureNotes in Computer S ien e, Springer Verlag, 1993, pp. 276{291.[37℄ M. Griebel, S. Knapek, Optimized tensor-produ t approximation spa es, Constru tive Ap-proximation, to appear.[38℄ M. Griebel, P. Oswald, Tensor produ t type subspa e splittings and multilevel iterativemethods for anisotropi problems, Advan es in Computational Mathemati s 4 (1995), 171-206.[39℄ M. Griebel, P. Oswald, and T. S hiekofer. Sparse grids for boundary integral equations.Numer. Mathematik, 83(2):279{312, 1999.[40℄ M. Griebel, M. S hneider, C. Zenger, A ombination te hnique for the solution of sparsegrid problems, in Iterative Methods in Linear Algebra, P. de Groen and R. Beauwens (eds.),Elsevier, Amsterdam, 1992, 263{281.[41℄ M. Griebel and V. Thurner, The eÆ ient solution of uid dynami s problems by the om-bination te hnique, Int. J. Num. Meth. for Heat and Fluid Flow 5(3) (1995), 251-269.[42℄ M. Griebel, C. Zenger, S. Zimmer, Multilevel Gau� -Seidel-algorithms for full and sparsegrid problems, Computing, 50 (1995), pp. 127{148.[43℄ M. Hegland, O. M. Nielsen, and Z. Shen, High dimensional smoothing based on multilevelanalysis, te h. rep., Data Mining Group, The Australian National University, Canberra,November 2000.[44℄ J. Hos hek and D. Lasser, Grundlagen der goemetris hen Datenverarbeitung. Kapitel 9,Teubner, 1992[45℄ Tin Kam Ho and Eugene M. Kleinberg. Che kerboard dataset, 1996.http://www. s.wis .edu/math-prog/mpml.html.[46℄ L. Kaufman, Solving the quadrati programming problem arising in support ve tor lassi�- ation, in Advan es in Kernel Methods - Support Ve tor Learning, B. S h�olkopf, C. J. C.Burges, and A. J. Smola, eds., MIT Press, 1999, pp. 146{167.[47℄ S. Knapek. Approximation und Kompression mit Tensorprodukt{Multiskalen{ Approxima-tionsr�aumen. PhD thesis, Universit�at Bonn, Institut f�ur Angewandte Mathematik, 2000.[48℄ Y.J. Lee, O. L. Mangasarian, SSVM: A Smooth Support Ve tor Ma hine for Classi� a-tion, Te hni al Report 99-03, Sep 1999, Data Mining Institute, University of Wis onsin,Computational Optimization and Appli ations, to appear.[49℄ O. L. Mangasarian and David R. Musi ant. A tive set support ve tor ma hine lassi� ation.Te hni al Report 00-04, Data Mining Institute, University of Wis onsin, April 2000.[50℄ O. L. Mangasarian and David R. Musi ant. Lagrangian support ve tor ma hines. Te hni alReport 00-06, Data Mining Institute, University of Wis onsin, June 2000.[51℄ G. Melli. Datgen: A program that reates stru tured data. www.datasetgenerator. om.[52℄ D. Mi hie, D.J. Spiegelhalter, and C.C. Taylor, editors. Ma hine Learning, Neural andStatisti al Classi� ation. Ellis Horwood, 1994.[53℄ M. A. Olshanskii and A. Reusken. On the onvergen e of a multigrid method for linearrea tion-di�usion problems. Beri ht Nr. 192, Institut f�ur Geometrie und Praktis he Math-ematik, RWTH Aa hen, 2000.[54℄ P. Oswald. Two remarks on multilevel pre onditioners. Fors hungsergebnisse Math-1-91,FSU Jena, 1991. 26

[55℄ W.D. Penny and S.J. Roberts. Bayesian neural networks for lassi� ation: how useful isthe eviden e framework ? Neural Networks, 12:877{892, 1999.[56℄ C. P aum and A. Zhou. Error analysis of the ombination te hnique. Numer. Math., 1999.to appear.[57℄ B. Ripley, Neural networks and related methods for lassi� ation, Journal of the RoyalStatisti al So iety B, 56 (1994), pp. 409{456.[58℄ S. L. Salzberg, On omparing lassi�ers: Pitfalls to avoid and a re ommended approa h,Data Mining and Knowledge Dis overy, 1 (1997), pp. 317{327.[59℄ T. S hiekofer, Die Methode der Finite Di�erenzen auf d�unnen Gittern zur L�osung elliptis- her und parabolis her partieller Di�erentialglei hungen, Dissertation, Inst. f�ur angewandteMathematik, Universit�at Bonn, 1998.[60℄ W. Si kel and F. Sprengel. Interpolation on sparse grids and Nikol'skij{Besov spa es ofdominating mixed smoothness. J. Comput. Anal. Appl., 1:263{288, 1999.[61℄ S. Singh. 2D Spiral Pattern Re ognition with Possibilisti Measures Pattern Re ognitionLetters, 19(2):141{147, 1998.[62℄ A. Smola, B. S h�olkopf, K. M�uller, The onne tion between regularization operators andsupport ve tor kernels, Neural Networks, 11:637-649, 1998.[63℄ S. A. Smolyak, Quadrature and interpolation formulas for tensor produ ts of ertain lassesof fun tions, Dokl. Akad. Nauk SSSR 4 (1963), 240{243.[64℄ M. Stone. Cross-validatory hoi e and assessment of statisti al predi tions. Journal of theRoyal Statisti al So iety, 36:111{147, 1974.[65℄ K. St�uben. Algebrai Multigrid (AMG): An introdu tion with appli ations. GMD Report53, GMD, 1999.[66℄ V. N. Temlyakov, Approximation of fun tions with bounded mixed derivative, Pro eedingsof the Steklov Institute of Mathemati s 1 (1989).[67℄ A.N. Tikhonov, V.A. Arsenin, Solutions of ill-posed problems. W.H. Winston, WashingtonD.C., 1977.[68℄ F. Utreras, Cross-validation te hniques for smoothing spline fun tions in one or two di-mensions. In T. Gasser, M. Rosenblatt, editors, smoothing te hniques for urve estimation,pp. 196-231.Springer-Verlag, Heidelberg, 1979.[69℄ V.N. Vapnik, Estimation of dependen es based on empiri al data, Springer-Verlag, Berlin,1982.[70℄ V. N. Vapnik. The Nature of Statisti al Learning Theory. Springer, 1995.[71℄ G. Wahba, A omparison of GCV and GML for hoosing the smoothing parameter in thegeneralized splines smoothing problem. The annals of statisti s, 13:1378-1402, 1985.[72℄ G. Wahba, Spline models for observational data, Series in Applied Mathemati s, Vol. 59,SIAM, Philadelphia, 1990.[73℄ H. Yserentant, On the multi-level splitting of �nite element spa es, Numer. Math., 49(1986), pp. 379{412.[74℄ H. Yserentant, Hierar hi al bases, in Pro . ICIAM'91, Washington, 1991, R. E. O'Malley,Jr. et al., eds., SIAM, Philadelphia, 1992.[75℄ C. Zenger, Sparse grids, Parallel Algorithms for Partial Di�erential Equations, Notes onNum. Fluid Me h. 31, W. Ha kbus h (ed.), Vieweg, Brauns hweig, 1991.27

Documents

Data - cims.nyu.edukellen/research/sparsegrids/Data mining with...medicine (e.g. CT data), ev aluation astroph ysics (e.g. telescop e and observ atory ... grids eac h one with its