18
This article was downloaded by: [University Of Pittsburgh] On: 27 April 2013, At: 10:11 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Optimization Methods and Software Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/goms20 A finite newton method for classification O.L. Mangasarian a a Computer Sciences Department, University of Wisconsin, Madison, WI, 53706, USA Version of record first published: 27 Oct 2010. To cite this article: O.L. Mangasarian (2002): A finite newton method for classification, Optimization Methods and Software, 17:5, 913-929 To link to this article: http://dx.doi.org/10.1080/1055678021000028375 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms- and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

A finite newton method for classification

  • Upload
    ol

  • View
    219

  • Download
    4

Embed Size (px)

Citation preview

Page 1: A finite newton method for classification

This article was downloaded by: [University Of Pittsburgh]On: 27 April 2013, At: 10:11Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK

Optimization Methods andSoftwarePublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/goms20

A finite newton method forclassificationO.L. Mangasarian aa Computer Sciences Department, University ofWisconsin, Madison, WI, 53706, USAVersion of record first published: 27 Oct 2010.

To cite this article: O.L. Mangasarian (2002): A finite newton method forclassification, Optimization Methods and Software, 17:5, 913-929

To link to this article: http://dx.doi.org/10.1080/1055678021000028375

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden.

The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up todate. The accuracy of any instructions, formulae, and drug doses should beindependently verified with primary sources. The publisher shall not be liablefor any loss, actions, claims, proceedings, demand, or costs or damageswhatsoever or howsoever caused arising directly or indirectly in connectionwith or arising out of the use of this material.

Page 2: A finite newton method for classification

Optimization Methods and Software, Vol. 17, pp. 913–929

A FINITE NEWTON METHOD FOR

CLASSIFICATION

O.L. MANGASARIAN*

Computer Sciences Department, University of Wisconsin, Madison, WI 53706,USA

(Received 10 April 2002; In final form 6 August 2002)

A fundamental classification problem of data mining and machine learning is that ofminimizing a strongly convex, piecewise quadratic function on the n-dimensionalreal space Rn. We show finite termination of a Newton method to the unique globalsolution starting from any point in Rn. If the function is well conditioned, then nostepsize is required from the start and if not, an Armijo stepsize is used. In either case,the algorithm finds the unique global minimum solution in a finite number of iterations.

Keywords: Newton method; Classification; Data mining; Support vector machines

1 INTRODUCTION

This article establishes finite termination of a Newton method for

minimizing a strongly convex, piecewise quadratic function on the

n-dimensional real space Rn. Such a problem is a fundamental one

in generating a linear or nonlinear kernel classifier for data mining

and machine learning [2,5,10,11,17–20,25]. This work is motivated

by [11], where a smoothed version of the present algorithm was

shown to converge globally and quadratically but was observed to ter-

minate in a few steps even when the smoothing and stepsize (line

search) were removed. Another motivating work is [8], where finite ter-

mination of a Newton method was established for the least 2-norm

solution of a linear program. In fact our algorithm is similar to that

*E-mail: [email protected]

ISSN 1055-6788 print: ISSN 1029-4937 online � 2002 Taylor & Francis Ltd

DOI: 10.1080/1055678021000028375

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 3: A finite newton method for classification

of [8], but has a global finite termination property without a stepsize

under a well conditioning property. This well conditioning property,

(26) below, which is sufficient for finite termination without a stepsize,

does not appear to be necessary for a wide range of classification prob-

lems tested in [11].

Other related work using a finite Newton method to solve quadratic

programs with bound constraints appears in [13–16]. In [12,24], finite

Newton methods with exact line search are given for minimizing

strongly convex differentiable piecewise quadratic functions. Our

approach differs from this previous work in its finite termination

property with or without an Armijo stepsize, in the analysis we present

and in the application to support vector machine classification.

We outline now the contents of the paper. In Section 2, we describe

the linear and nonlinear classification problems leading to a piecewise

quadratic strongly convex minimization problem. In Section 3, we

establish finite global termination of a Newton algorithm without a

stepsize but under a well conditioning property. In Section 4, we

remove the well conditioning assumption but add the Armijo stepsize

and obtain again finite global termination at the unique solution. In

Section 5, we briefly discuss some previous numerical results with the

proposed method. Section 6, concludes the article.

A word about our notation. All vectors will be column vectors unless

transposed to a row vector by a prime superscript 0. For a vector x in

the n-dimensional real space Rn, the plus function xþ is defined as

ðxþÞi ¼ maxf0, xig, i ¼ 1, . . . , n, while x� denotes the subgradient of

xþ which is the step function defined as ðx�Þi ¼ 1 if xi > 0, ðx�Þi ¼ 0

if xi < 0, and ðx�Þi 2 ½0, 1� if xi ¼ 0, i ¼ 1, . . . , n. The scalar (inner)

product of two vectors x and y in the n-dimensional real space Rn

will be denoted by x0y and the 2-norm of x will be denoted by kxk.

For a matrix A 2 Rm�n,Ai is the ith row of A which is a row vector

in Rn and kAk is 2-norm of A: maxkxk¼1kAxk. A column vector of

ones of arbitrary dimension will be denoted by e. For A 2 Rm�n and

B 2 Rn�l , the kernel KðA,BÞ [2,17,25] is an arbitrary function which

maps Rm�n � Rn�l into Rm�l. In particular, if x and y are column vec-

tors in Rn, then Kðx0, yÞ is a real number, Kðx0,A0Þ is a row vector in Rm

and KðA,A0Þ is an m�m matrix. If f is a real valued function defined

on the n-dimensional real space Rn, the gradient of f at x is denoted

by rf ðxÞ which is a column vector in Rn and the n� n Hessian

914 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 4: A finite newton method for classification

matrix of second partial derivatives of f at x is denoted by r2f ðxÞ.

The convex hull of a set S is denoted by cofSg. The identity matrix

of arbitrary order will be denoted by I.

2 LINEAR AND NONLINEAR KERNEL CLASSIFICATION

We describe in this section the fundamental classification problems

that lead to minimizing a piecewise quadratic strongly convex

function. We consider the problem of classifying m points in the

n-dimensional real space Rn, represented by the m� n matrix A,

according to membership of each point Ai in the classes þ 1 or � 1

as specified by a given m�m diagonal matrix D with ones or minus

ones along its diagonal. For this problem, the standard support

vector machine with a linear kernel AA0 [2,25] is given by the following

quadratic program for some � > 0:

minðw, �, yÞ2Rnþ1þm

�e0yþ 12w

0w

s.t. DðAw� e�Þ þ y � e

y � 0:

ð1Þ

As depicted in Fig. 1, w is the normal to the bounding planes:

x0w� � ¼ þ1x0w� � ¼ �1,

ð2Þ

and � determines their location relative to the origin. The first plane

above bounds the class þ1 points and the second plane bounds the

class �1 points when the two classes are strictly linearly separable,

that is when the slack variable y ¼ 0. The linear separating surface

is the plane

x0w ¼ �, ð3Þ

midway between the bounding planes (2). If the classes are linearly

inseparable then the two planes bound the two classes with a ‘‘soft

margin’’ determined by a nonnegative slack variable y, that is:

x0w� � þ yi � þ1, for x0 ¼ Ai and Dii ¼ þ1,x0w� � � yi � �1, for x0 ¼ Ai and Dii ¼ �1:

ð4Þ

A FINITE NEWTON METHOD FOR CLASSIFICATION 915

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 5: A finite newton method for classification

The 1-norm of the slack variable y is minimized with weight � in (1).

The quadratic term in (1), which is twice the reciprocal of the square

of the 2-norm distance 2=kwk between the two bounding planes of

(2) in the n-dimensional space of w 2 Rn for a fixed �, maximizes that

distance, often called the ‘‘margin’’. Figure 1 depicts the points

represented by A, the bounding planes (2) with margin 2=kwk, and

the separating plane (3) which separates Aþ, the points represented

by rows of A with Dii ¼ þ1, from A�, the points represented by

rows of A with Dii ¼ �1.

In many essentially equivalent formulations of the classification

problem [4,5,10,11], the square of 2-norm of the slack variable y is mini-

mized with weight �=2 instead of the 1-norm of y as in (1). In addition,

the distance between the planes (2) is measured in the ðnþ 1Þ-dimen-

sional space of ðw, �Þ 2 Rnþ1, that is 2=kðw, �Þk. Measuring the

margin in this ðnþ 1Þ-dimensional space instead of Rn induces strong

convexity and has little or no effect on the problem as was shown in

[18]. Thus using twice the reciprocal squared of the margin instead,

yields our modified SVM problem as follows:

minðw, �, yÞ2Rnþ1þm

2y0yþ

1

2ðw0wþ �2Þ

s.t. DðAw� e�Þ þ y � e

y � 0:

ð5Þ

FIGURE 1 The bounding planes (2) with margin 2=kwk, and the plane (3) separatingAþ, the points represented by rows of A with Dii ¼ þ1, from A�, the points representedby rows of A with Dii ¼ �1.

916 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 6: A finite newton method for classification

It has been shown computationally [20] that this reformulation (5)

of the conventional support vector machine formulation (1) yields

similar results to (1). At a solution of problem (5), y is given by

y ¼ ðe�DðAw� e�ÞÞþ, ð6Þ

where, as defined in the Introduction, ð�Þþ replaces negative com-

ponents of a vector by zeros. Thus, we can replace y in (5) by

ðe�DðAw� e�ÞÞþ and convert the SVM problem (5) into an equiva-

lent SVM which is an unconstrained optimization problem as follows:

minðw, �Þ2Rnþ1

2kðe�DðAw� e�ÞÞþk

2 þ1

2ðw0wþ �2Þ: ð7Þ

This problem is the strongly convex piecewise quadratic minimization

problem. Note, however, that its objective function is not twice

differentiable which precludes the use of a regular Newton method.

In [11], we smoothed this problem and applied a fast Newton

method to solve it. Other methods, such as Newton methods for non-

smooth functions with exact line searches [13,24], can also solve this

problem. Problem (7) is one of the nonsmooth problems that we

shall provide a direct finite Newton method for. The other nonsmooth

problem that we will treat is the nonlinear kernel problem, (12) below,

which generates a nonlinear classifier as described below.

We now describe how the generalized support vector machine

(GSVM) [17] generates a nonlinear separating surface by using a com-

pletely arbitrary kernel. The GSVM solves the following mathematical

program for a general kernel KðA,A0Þ, defined in the Introduction:

minðu, �, yÞ2R2mþ1

�e0yþ f ðuÞ

s.t. DðKðA,A0ÞDu� e�Þ þ y � e

y � 0:

ð8Þ

Here f ðuÞ is some convex function on Rm which suppresses the

parameter u and � is some positive number that weights the classifica-

tion error e0y versus the suppression of u. A solution of this mathemat-

ical program for u and � leads to the nonlinear separating surface

Kðx0,A0ÞDu ¼ �: ð9Þ

A FINITE NEWTON METHOD FOR CLASSIFICATION 917

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 7: A finite newton method for classification

The linear formulation (1) of Section 2 is obtained if we let

KðA,A0Þ ¼ AA0, w ¼ A0Du and f ðuÞ ¼ ð1=2Þu0DAA0Du: We now use

a different classification objective which not only suppresses the

parameter u but also suppresses � in our nonlinear formulation:

minðu, �, yÞ2R2mþ1

2y0yþ

1

2ðu0uþ �2Þ

s.t. DðKðA,A0ÞDu� e�Þ þ y � e

y � 0:

ð10Þ

At a solution of (10), y is given by

y ¼ ðe�DðKðA,A0ÞDu� e�ÞÞþ, ð11Þ

where, as defined earlier, ð�Þþ replaces negative components of a vector

by zeros. Thus, we can replace y in (10) by ðe�DðKðA,A0ÞDu� e�ÞÞþand convert the SVM problem (10) into an equivalent SVM which is

an unconstrained optimization problem as follows:

minðu, �Þ2Rmþ1

2kðe�DðKðA,A0ÞDu� e�ÞÞþk

2 þ1

2ðu0uþ �2Þ: ð12Þ

Again, as in (7), this problem is a strongly convex unconstrained mini-

mization problem which has a unique solution but its objective function

is not twice differentiable. It is for this problem, that generates the non-

linear separating surface (9), and for problem (7) that generates the

linear separating surface (3), that we shall develop the finite Newton

methods that we describe in the next two sections of the article.

3 FINITE STEPLESS NEWTON METHOD

We shall consider in the rest of the article the following piecewise

quadratic strongly convex problem which subsumes both problems (7)

and (12):

minz2Rp

f ðzÞ :¼�

2kðCz� hÞþk

2 þ1

2ðz0zÞ, ð13Þ

918 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 8: A finite newton method for classification

where C 2 Rm�p, h 2 Rm, and � is a fixed positive parameter. We

note immediately that, since kðr� tÞþk � kðr� tÞk for r, t 2 Rm, the

gradient of f :

rf ðzÞ ¼ �C0ðCz� hÞþ þ z, ð14Þ

is globally Lipschitz continuous with constant K as follows:

krf ðsÞ � rf ðzÞk � Kks� zk, 8s, z 2 Rp, whereK ¼ �kC0kkCk þ 1:

ð15Þ

The ordinary Hessian of f does not exist everywhere. However, since

rf ðzÞ is Lipschitzian, f belongs to the class LC1 of functions with

locally Lipschitzian gradients, and a generalized Hessian exists every-

where [7]. LC1 functions were introduced in [7] and used in subsequent

papers such as [3,9,22,23]. The generalized Hessian is defined as

follows [3,7]:

Definition 1 Generalized Hessian Let f : Rp�!R have a Lipschitz

continuous gradient on Rp. The generalized Hessian of f at z is the

set of @ 2f ðzÞ of p� p matrices defined as [3,7]:

@2f ðzÞ :¼ cofH 2 Rp�pj 9zk�! z such that :

rf is differentiable at zk and r2f ðzkÞ�!Hg:ð16Þ

Furthermore, a generalization of the mean value theorem for the

Lipschitzian gradient rf ðzÞ holds [3,6]:

PROPOSITION 2 Generalized Mean Value Theorem Let f : Rp�!R

have a Lipschitz continuous gradient on Rp. Then for z, s 2 Rp:

rf ðzÞ ¼ rf ðsÞ þXpj¼1

tjHjðz� sÞ ð17Þ

where H j 2 @2f ðy jÞ, for some y j 2 ðs, zÞ, tj � 0, j ¼ 1, . . . , p and

Xpj¼1

tj ¼ 1:

A FINITE NEWTON METHOD FOR CLASSIFICATION 919

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 9: A finite newton method for classification

We shall also need the following lemma [21, 8.1.5] that gives a

quadratic bound on a linear Taylor expansion:

LEMMA 3 Quadratic Bound Lemma Let f : Rn�!R have a Lipschitz

continuous gradient on Rp with constant K. Then for z, s 2 Rp:

j f ðzÞ � f ðsÞ � rf ðsÞ0ðz� sÞj �K

2kz� sk2: ð18Þ

We summarize now properties of the function f ðzÞ in the following

lemma and omit the straightforward proof of these properties.

LEMMA 4 Properties of f ðzÞ The function f ðzÞ defined by (13) has the

following properties for all s, z 2 Rp:

(i) f ðzÞ is strongly convex with constant k ¼ 1:

ðrf ðsÞ � rf ðzÞÞ0ðs� zÞ � k � ks� zk2 ¼ ks� zk2: ð19Þ

(ii) rf ðzÞ is Lipschitz continuous with constant K:

krf ðsÞ � rf ðzÞk � Kks� zk, 8s, z 2 Rp, whereK ¼ �kC0kkCk þ 1:

ð20Þ

(iii) The generalized Hessian of f ðzÞ [7] is:

@2f ðzÞ ¼ �C0diagðCz� hÞ�C þ I , ð21Þ

where diagðCz� hÞ� denotes the p� p diagonal matrix whose jth

diagonal entry is the subgradient of the step function ð�Þþ as follows:

ðdiagðCz� hÞ�Þjj

¼ 1, if Cjz� hj > 0,

2 ½0, 1�, if Cjz� hj ¼ 0,

¼ 0, if Cjz� hj < 0,

j ¼ 1, . . . , p:

8>><>>: ð22Þ

920 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 10: A finite newton method for classification

(iv)

Kksk2 � s0ð�C0C þ IÞs � s0@2f ðzÞs � ksk2, where K ¼ �kC0kkCk þ 1:

ð23Þ

(v)

1

Kksk2 � s0@2f ðzÞ�1s � ksk2: ð24Þ

(vi)

k@2f ðsÞ � @2f ðzÞk � �kC0k � kCk: ð25Þ

We note that the inverse @2f ðzÞ�1 appearing in (24) and elsewhere

refers to the inverse of @2f ðzÞ defined by (21)–(22) for an arbitrary

but specific value of ðdiagðCz� hÞ�Þjj , j ¼ 1, . . . , p in the interval

½0, 1� when Cjz� hj ¼ 0.

We are ready now to state and establish global finite termination of a

Newton algorithm without a stepsize starting from any point.

Algorithm 5 Stepless Newton Algorithm for (13) Start with any

z0 2 Rp. For i ¼ 0, 1, . . .:

(i) ziþ1 ¼ zi � @2f ðziÞ�1rf ðziÞ.

(ii) Stop if rf ðziþ1Þ ¼ 0.

(iii) i ¼ i þ 1. Go to ðiÞ.

THEOREM 6 Finite Termination of Stepless Newton Let f ðzÞ be well

conditioned, that is:

K

�kC0kkCk þ 1

1< 2, i:e:, �kC0kkCk < 1: ð26Þ

(i) The sequence fzig of Algorithm 5 terminates at the global minimum

solution �zz of (13).

(ii) The error decreases linearly at each step as follows:

kziþ1 � �zzk � �kC0kkCkkzi � �zzk: ð27Þ

A FINITE NEWTON METHOD FOR CLASSIFICATION 921

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 11: A finite newton method for classification

Proof

(i) We first establish convergence and then finite termination of the

sequence fzig. By the Quadratic Bound Lemma (18):

f ðziÞ � f ðziþ1Þ � rf ðziÞ0@2f ðziÞ�1rf ðziÞ �

K

2rf ðziÞ0@2f ðziÞ�2

rf ðziÞ

¼ rf ðziÞ0 @2f ðziÞ�1�K

2@2f ðziÞ�2

� �rf ðziÞ

¼K

2rf ðziÞ0@2f ðziÞ�1=2 2I

K� @2f ðziÞ�1

� �@2f ðziÞ�1=2

rf ðziÞ

�K

2

1� �kC0kkCk

1þ �kC0kkCkrf ðziÞ0@2f ðziÞ�1

rf ðziÞ

�1

2

1� �kC0kkCk

1þ �kC0kkCkkrf ðziÞk2 � 0,

ð28Þ

where the first inequality above follows from Quadratic Bound

Lemma (18), the next to the last inequality follows from the well con-

ditioned assumption (26) and (24) and the last inequality from (24).

By (28) and the strong convexity of f ðzÞ we have that:

0 � f ðziÞ � f ðz0Þ

� rf ðz0Þ0ðzi � z0Þ þ 12 kz

i � z0k2

� �krf ðz0Þkkzi � z0k þ 12 kz

i � z0k2:

ð29Þ

Hence,

kzi � z0k �2

krf ðz0Þk, i ¼ 1, 2, . . . , ð30Þ

and consequently the bounded sequence fzig has an accumulation

point �zz such that limj�!1zij ¼ �zz. Since f f ðziÞg is nonincreasing

by (28) and bounded below by minz2Rp f ðzÞ, it converges and

limi�!1 f ðziÞ ¼ f ð �zzÞ. Thus:

0 ¼ limj�!1

ð f ðzij Þ � f ðzijþ1ÞÞ �1

2

1� �kC0kkCk

1þ �kC0kkCklim

j�!1krf ðzij Þk2 � 0: ð31Þ

922 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 12: A finite newton method for classification

Hence limj�!1krf ðzij Þk ¼ krf ð �zzÞk ¼ 0. Since all accumulation points

are stationary, they must all equal the unique minimizer of f ðzÞ on Rp.

It follows that the whole sequence fzig must converge to the unique �zz

such that rf ð �zzÞ ¼ 0. We now show finite termination by using an argu-

ment similar to that of [8].

Our Newton iteration is:

�C0ðCzi � hÞþ þ zi þ ð�C0diagðCzi � hÞ�C þ I Þðziþ1 � ziÞ ¼ 0, ð32Þ

which we rewrite by subtracting from it the equality:

�C0ðC �zz� hÞþ þ �zz ¼ rf ð �zzÞ ¼ 0: ð33Þ

This results in the equivalent iteration:

�C0½ðCzi � hÞþ � ðC �zz� hÞþ� þ ½zi � �zz �þ

ð�C0diagðCzi � hÞ�C þ IÞðziþ1 � ziÞ ¼ 0: ð34Þ

We establish now that this Newton iteration is satisfied uniquely (since

@2f ðziÞ is nonsingular) by ziþ1 ¼ �zz when zi is sufficiently close to �zz and

hence the Newton iteration terminates at �zz at Step (ii) of Algorithm 5.

Setting ziþ1 ¼ �zz in (34) and canceling terms gives:

�C0½ðCzi � hÞþ � ðC �zz� hÞþ þ diagðCzi � hÞ�ððC �zz� hÞ � ðCzi � hÞÞ� ¼ 0:

ð35Þ

We verify now that the term in the square bracket in (35) is zero when

zi is sufficiently close to �zz by looking at each component j, j ¼ 1, . . . , p,

of the vector enclosed in the square brackets. We consider the nine

possible combinations:

(i) Cj �zz� hj > 0, Cjzi � hj > 0:

Cjzi � hj � Cj �zzþ hj þ 1 � ðCj �zz� hjÞ � Cjz

i þ hj ¼ 0:

(ii) Cj �zz� hj > 0, Cjzi � hj ¼ 0:

Cannot occur when zi is sufficiently close to �zz.

(iii) Cj �zz� hj > 0, Cjzi � hj < 0:

A FINITE NEWTON METHOD FOR CLASSIFICATION 923

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 13: A finite newton method for classification

Cannot occur when zi is sufficiently close to �zz.

(iv) Cj �zz� hj ¼ 0, Cjzi � hj > 0:

Cjzi � hj � 0þ 1 � ð0� Cjz

i þ hjÞ ¼ 0:

(v) Cj �zz� hj ¼ 0, Cjzi � hj ¼ 0:

0þ 0þ ½0, 1�ð0þ 0Þ ¼ 0:

(vi) Cj �zz� hj ¼ 0, Cjzi � hj < 0:

0þ 0þ 0 � ð0� Cjzi þ hjÞ ¼ 0:

(vii) Cj �zz� hj < 0, Cjzi � hj > 0:

Cannot occur when zi is sufficiently close to �zz.

(viii) Cj �zz� hj < 0, Cjzi � hj ¼ 0:

Cannot occur when zi is sufficiently close to �zz.

(ix) Cj �zz� hj < 0, Cjzi � hj < 0:

0� 0þ 0 � ðCj �zz� hj � Cjzi þ hjÞ ¼ 0:

Hence for zi is sufficiently close to �zz, the Newton iteration is

uniquely satisfied by �zz and terminates.

(ii) We establish now the linear termination rate by using the generali-

zation of the mean value theorem Proposition 2 as follows:

ðziþ1 � �zzÞ ¼ zi � �zz� @2f ðziÞ�1ðrf ðziÞ � rf ð �zzÞÞ

¼ zi � �zz� @2f ðziÞ�1Xpj¼1

t jH jðzi � �zzÞ

¼ ðzi � �zzÞ � @2f ðziÞ�1Xpj¼1

t jH j þ @2f ðziÞ � @2f ðziÞ

" #ðzi � �zzÞ

¼ �@2f ðziÞ�1Xpj¼1

t jðH j � @2f ðziÞÞ

" #ðzi � �zzÞ,

ð36Þ

924 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 14: A finite newton method for classification

where the second equality above follows from the generalization of

the mean value theorem Proposition 2. Taking norms of the first

and last terms above gives:

kziþ1 � �zzk � k@2f ðziÞ�1k

Xpj¼1

t jðH j � @2f ðziÞÞ

����������kzi � �zzk

� k@2f ðziÞ�1k�kC0kkCkkzi � �zzk

� �kC0kkCkkzi � �zzk,

ð37Þ

where use has been made of (24) and (25) of Lemma 4 in the last two

inequalities above. g

We turn now to a Newton algorithm with an Armijo stepsize.

4 FINITE ARMIJO NEWTON METHOD

We now drop the well conditioning assumption (26) but add an

Armijo stepsize [1] in order guarantee global finite termination of

the following algorithm.

Algorithm 1 Armijo Newton Algorithm for (13) Start with any

z0 2 Rp. For i ¼ 0, 1, . . .:

(i) Stop if rf ðzi � @2f ðziÞ�1rf ðziÞÞ ¼ 0.

(ii) ziþ1 ¼ zi � �i@2f ðziÞ�1

rf ðziÞ ¼ zi þ �idi,

where �i ¼ maxf1, 12 ,

14 , . . .g such that:

f ðziÞ � f ðzi þ �idiÞ � ���irf ðz

iÞ0d i, ð38Þ

for some � 2 ð0, 1=2Þ and d i is the Newton direction:

d i ¼ �@2f ðziÞ�1rf ðziÞ: ð39Þ

(iii) i ¼ i þ 1. Go to (i).

THEOREM 2 Finite Termination of Armijo Newton The sequence fzig

of Algorithm 1 terminates at the global minimum solution �zz of (13).

A FINITE NEWTON METHOD FOR CLASSIFICATION 925

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 15: A finite newton method for classification

Proof As in the proof of the Stepless Newton Algorithm 5, we first

establish global convergence of fzig to the global solution �zz, but with-

out using the well conditioned assumption (26). By the Quadratic

Bound Lemma 3 we have the first inequality below:

f ðziÞ � f ðzi þ �idiÞ þ ��irf ðz

iÞ0d i � �

K

2�2i kd

ik2 � ð1� �Þ�irf ðziÞ0d i

¼ �K

2�2i rf ðz

iÞ0@2f ðziÞ�2

rf ðziÞ

þ ð1� �Þ�irf ðziÞ0@2f ðziÞ�1

rf ðziÞ

� �i �K

2�i þ

1� �

K

� �krf ðziÞk2,

ð40Þ

where the equality above makes use of the definition of d i and the last

inequality makes use of (24) of Lemma 4. Hence if �i is small enough,

that is �i � ð2ð1� �ÞÞ=K2 then the Armijo inequality (38) is satisfied.

However by the definition of the Armijo stepsize, 2�i violates the

Armijo inequality (38) and hence by (40):

�K

22�i þ

1� �

K

� �< 0, or equivalently �i >

1� �

K2: ð41Þ

It follows that:

�i � min 1,1� �

K2

� �¼: � > 0, ð42Þ

and by the Armijo inequality (38) and (24) of Lemma 4 that:

f ðziÞ � f ðziþ1Þ � ���irf ðziÞ0d i � ��rf ðziÞ0@2f ðziÞ�1

rf ðziÞ ���

Kkrf ðziÞk2:

ð43Þ

By exactly the same arguments following inequalities (28), we have

that the sequence fzig converges to unique minimum solution �zz of (13).

Having established convergence of fzig to �zz, finite termination of the

Armijo Newton to �zz again follows exactly the finiteness proof of

926 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 16: A finite newton method for classification

Theorem 6 because of Step (i) of the current algorithm which

checks whether the next iterate obtained by a stepless Newton is

stationary. g

5 NUMERICAL EXPERIENCE

Our numerical experience is based on 16 test problems of [11] for

which a smoothed version of (7) and (12) was solved using a regular

Newton method with an Armijo stepsize. However, all of these test

problems were also solved in [11] using the proposed Stepless

Newton Algorithm 5 here and gave the same results as the smoothed

method and with the number of Newton steps varying between 5 and 8

for all problems. The test problems were all of type of (7) or (12) and

varied in size with m between 110 and 32 562 and n between 2 and 123.

Running times varied between 1 and 85 s on a 200 MHz PentiumPro

with 64 megabytes of RAM. Linear classifiers were used for the

larger problems, such as with m ¼ 32 562, where the Newton method

easily solved the minimization problem (7) in Rnþ1 with n ¼ 123 for

these large problems.

6 CONCLUSION

We have presented fast, finitely terminating Newton methods for solv-

ing a fundamental classification problem of data mining and machine

learning. The methods are simple and fast and can be applied to other

problems such as linear programming [8]. An interesting open ques-

tion is whether these finite methods can be shown to have polynomial

run time in general, such as that for one dimensional piecewise

quadratic functions [13], and whether they can be applied to an even

broader class of problems.

Acknowledgments

The research described in this Data Mining Institute Report 01-11,

December 2001, was supported by National Science Foundation

A FINITE NEWTON METHOD FOR CLASSIFICATION 927

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 17: A finite newton method for classification

Grants CCR-9729842, CCR-0138308 and CDA-9623632, by Air Force

Office of Scientific Research Grant F49620-00-1-0085 and by the

Microsoft Corporation. The author is grateful to the referees for

encouraging constructive suggestions and for important references.

References

[1] L. Armijo (1966). Minimization of functions having Lipschitz-continuous firstpartial derivatives. Pacific Journal of Mathematics, 16, 1–3.

[2] V. Cherkassky and F. Mulier (1998). Learning from Data – Concepts, Theory andMethods. John Wiley & Sons, New York.

[3] F. Facchinei (1995). Minimization of SC1 functions and the Maratos effect.Operations Research Letters, 17, 131–137.

[4] G. Fung and O.L. Mangasarian (September 2001). Incremental support vectormachine classification. Proceedings of the Second SIAM International Conferenceon Data Mining, Arlington, Virginia, April 11–13, 2002, R. Grossman, H. Mannilaand R. Motwani (editors), SIAM, Philadelphia 2002, 247–260. Technical Report01–08, Data Mining Institute, Computer Sciences Department, University ofWisconsin, Madison, Wisconsin.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-08.ps.

[5] G. Fung and O.L. Mangasarian (2001). Proximal support vector machine classifiers.In: F. Provost and R. Srikant (Eds.), Proceedings KDD-2001: Knowledge Discoveryand Data Mining, August 26–29, pp. 77–86. Asscociation for Computing Machinery,San Francisco, CA, New York.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps.

[6] J.-B. Hiriart-Urruty (1982). Refinements of necessary optimality conditions innondifferentiable programming II. Mathematical Programming Study, 9, 120–139.

[7] J.-B. Hiriart-Urruty, J.J. Strodiot and V.H. Nguyen (1984). Generalized hessianmatrix and second-order optimality conditions for problems with CL1 data.Applied Mathematics and Optimization, 11, 43–56.

[8] C. Kanzow, H. Qi and L. Qi (2001). On the minimum norm solution of linearprograms. Journal of Optimization Theory and Applications, to appear. Preprint,University of Hamburg, Hamburg.http://www.math.uni-hamburg.de/home/kanzow/paper.html.

[9] D. Klatte and K. Tammer (1988). On second-order sufficient optimality conditionsfor C1, 1-optimization problems. Optimization, 19, 169–179.

[10] Y.-J. Lee and O.L. Mangasarian (2001). RSVM: reduced support vector machines.Technical Report 00–07, Data Mining Institute, Computer Sciences Department,University of Wisconsin, Madison, Wisconsin, July 2000. Proceedings of the FirstSIAM International Conference on Data Mining, April 5–7. CD-ROMProceedings, Chicago. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps.

[11] Yuh-Jye Lee and O.L. Mangasarian (2001). SSVM: a smooth support vectormachine. Computational Optimization and Applications, 20, 5–22. TechnicalReport 99–03, Data Mining Institute, University of Wisconsin.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps

[12] W. Li and J. Swetits (1999). Regularized newton methods for minimization ofconvex quadratic splines with singular hessians. In: M. Fukushima and L. Qi(Eds.), Reformulation: Nonsmooth, Piecewise Smooth, Semismooth and SmoothingMethods, pp. 235–257. Kluwer Academic Publishers, Boston.

928 O.L. MANGASARIAN

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13

Page 18: A finite newton method for classification

[13] W. Li and J.J. Swetits (1993). A Newton method for convex regression, datasmoothing and quadratic programming with bounded constraints. SIAM Journalon Optimization, 3, 466–488.

[14] W. Li and J.J. Swetits (1997). A new algorithm for solving strictly convex quadraticprograms. SIAM Journal on Optimization, 7, 595–619.

[15] K. Madsen and H.B. Nielsen (1998). A finite continuation algortihm for boundconstrained quadratic programming. SIAM Journal on Optimization, 9, 62–83.

[16] K. Madsen, H.B. Nielsen and M.C. Pinar (1999). Bound constrained quadraticprogramming via piecewise quadratic functions. Mathematical Programming,85, 135–156.

[17] O.L. Mangasarian (2000). Generalized support vector machines. In: A. Smola,P. Bartlett, B. Scholkopf and D. Schuurmans (Eds.), Advances in Large MarginClassifiers, pp. 135–146. MIT Press, Cambridge, MA.ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.

[18] O.L. Mangasarian and D.R. Musicant (1999). Successive over relaxation forsupport vector machines. IEEE Transactions on Neural Networks, 10, 1032–1037.ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-18.ps.

[19] O.L. Mangasarian and D.R. Musicant (2001). Active support vector machineclassification. In: Leen K. Todd, Dietterich G. Thomas and Volker Tresp (Eds.),Advances in Neural Information Processing Systems 13, pp. 577–583. MIT Press,Cambridge, MA. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-04.ps

[20] O.L. Mangasarian and D.R. Musicant (2001). Lagrangian support vector machines.Journal of Machine Learning Research, 1, 161–177.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-06.ps.

[21] J.M. Ortega (1972). Numerical Analysis, A Second Course. Academic Press, NewYork.

[22] L. Qi (1994). Superlinearly convergent approximate Newton methods for LC1

optimization problems. Mathematical Programming, 64, 277–294.[23] L. Qi and J. Sun (1993). A nonsmooth version of Newton’s method. Mathematical

Programming, 58, 353–368.[24] J. Sun (1997). On piecewise quadratic newton and trust region problems.

Mathematical Programming, 76, 451–468.[25] V.N. Vapnik (2000). The Nature of Statistical Learning Theory, 2nd Edn. Springer,

New York.

A FINITE NEWTON METHOD FOR CLASSIFICATION 929

Dow

nloa

ded

by [

Uni

vers

ity O

f Pi

ttsbu

rgh]

at 1

0:11

27

Apr

il 20

13