Upload
ol
View
219
Download
4
Embed Size (px)
Citation preview
This article was downloaded by: [University Of Pittsburgh]On: 27 April 2013, At: 10:11Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK
Optimization Methods andSoftwarePublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/goms20
A finite newton method forclassificationO.L. Mangasarian aa Computer Sciences Department, University ofWisconsin, Madison, WI, 53706, USAVersion of record first published: 27 Oct 2010.
To cite this article: O.L. Mangasarian (2002): A finite newton method forclassification, Optimization Methods and Software, 17:5, 913-929
To link to this article: http://dx.doi.org/10.1080/1055678021000028375
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions
This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden.
The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up todate. The accuracy of any instructions, formulae, and drug doses should beindependently verified with primary sources. The publisher shall not be liablefor any loss, actions, claims, proceedings, demand, or costs or damageswhatsoever or howsoever caused arising directly or indirectly in connectionwith or arising out of the use of this material.
Optimization Methods and Software, Vol. 17, pp. 913–929
A FINITE NEWTON METHOD FOR
CLASSIFICATION
O.L. MANGASARIAN*
Computer Sciences Department, University of Wisconsin, Madison, WI 53706,USA
(Received 10 April 2002; In final form 6 August 2002)
A fundamental classification problem of data mining and machine learning is that ofminimizing a strongly convex, piecewise quadratic function on the n-dimensionalreal space Rn. We show finite termination of a Newton method to the unique globalsolution starting from any point in Rn. If the function is well conditioned, then nostepsize is required from the start and if not, an Armijo stepsize is used. In either case,the algorithm finds the unique global minimum solution in a finite number of iterations.
Keywords: Newton method; Classification; Data mining; Support vector machines
1 INTRODUCTION
This article establishes finite termination of a Newton method for
minimizing a strongly convex, piecewise quadratic function on the
n-dimensional real space Rn. Such a problem is a fundamental one
in generating a linear or nonlinear kernel classifier for data mining
and machine learning [2,5,10,11,17–20,25]. This work is motivated
by [11], where a smoothed version of the present algorithm was
shown to converge globally and quadratically but was observed to ter-
minate in a few steps even when the smoothing and stepsize (line
search) were removed. Another motivating work is [8], where finite ter-
mination of a Newton method was established for the least 2-norm
solution of a linear program. In fact our algorithm is similar to that
*E-mail: [email protected]
ISSN 1055-6788 print: ISSN 1029-4937 online � 2002 Taylor & Francis Ltd
DOI: 10.1080/1055678021000028375
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
of [8], but has a global finite termination property without a stepsize
under a well conditioning property. This well conditioning property,
(26) below, which is sufficient for finite termination without a stepsize,
does not appear to be necessary for a wide range of classification prob-
lems tested in [11].
Other related work using a finite Newton method to solve quadratic
programs with bound constraints appears in [13–16]. In [12,24], finite
Newton methods with exact line search are given for minimizing
strongly convex differentiable piecewise quadratic functions. Our
approach differs from this previous work in its finite termination
property with or without an Armijo stepsize, in the analysis we present
and in the application to support vector machine classification.
We outline now the contents of the paper. In Section 2, we describe
the linear and nonlinear classification problems leading to a piecewise
quadratic strongly convex minimization problem. In Section 3, we
establish finite global termination of a Newton algorithm without a
stepsize but under a well conditioning property. In Section 4, we
remove the well conditioning assumption but add the Armijo stepsize
and obtain again finite global termination at the unique solution. In
Section 5, we briefly discuss some previous numerical results with the
proposed method. Section 6, concludes the article.
A word about our notation. All vectors will be column vectors unless
transposed to a row vector by a prime superscript 0. For a vector x in
the n-dimensional real space Rn, the plus function xþ is defined as
ðxþÞi ¼ maxf0, xig, i ¼ 1, . . . , n, while x� denotes the subgradient of
xþ which is the step function defined as ðx�Þi ¼ 1 if xi > 0, ðx�Þi ¼ 0
if xi < 0, and ðx�Þi 2 ½0, 1� if xi ¼ 0, i ¼ 1, . . . , n. The scalar (inner)
product of two vectors x and y in the n-dimensional real space Rn
will be denoted by x0y and the 2-norm of x will be denoted by kxk.
For a matrix A 2 Rm�n,Ai is the ith row of A which is a row vector
in Rn and kAk is 2-norm of A: maxkxk¼1kAxk. A column vector of
ones of arbitrary dimension will be denoted by e. For A 2 Rm�n and
B 2 Rn�l , the kernel KðA,BÞ [2,17,25] is an arbitrary function which
maps Rm�n � Rn�l into Rm�l. In particular, if x and y are column vec-
tors in Rn, then Kðx0, yÞ is a real number, Kðx0,A0Þ is a row vector in Rm
and KðA,A0Þ is an m�m matrix. If f is a real valued function defined
on the n-dimensional real space Rn, the gradient of f at x is denoted
by rf ðxÞ which is a column vector in Rn and the n� n Hessian
914 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
matrix of second partial derivatives of f at x is denoted by r2f ðxÞ.
The convex hull of a set S is denoted by cofSg. The identity matrix
of arbitrary order will be denoted by I.
2 LINEAR AND NONLINEAR KERNEL CLASSIFICATION
We describe in this section the fundamental classification problems
that lead to minimizing a piecewise quadratic strongly convex
function. We consider the problem of classifying m points in the
n-dimensional real space Rn, represented by the m� n matrix A,
according to membership of each point Ai in the classes þ 1 or � 1
as specified by a given m�m diagonal matrix D with ones or minus
ones along its diagonal. For this problem, the standard support
vector machine with a linear kernel AA0 [2,25] is given by the following
quadratic program for some � > 0:
minðw, �, yÞ2Rnþ1þm
�e0yþ 12w
0w
s.t. DðAw� e�Þ þ y � e
y � 0:
ð1Þ
As depicted in Fig. 1, w is the normal to the bounding planes:
x0w� � ¼ þ1x0w� � ¼ �1,
ð2Þ
and � determines their location relative to the origin. The first plane
above bounds the class þ1 points and the second plane bounds the
class �1 points when the two classes are strictly linearly separable,
that is when the slack variable y ¼ 0. The linear separating surface
is the plane
x0w ¼ �, ð3Þ
midway between the bounding planes (2). If the classes are linearly
inseparable then the two planes bound the two classes with a ‘‘soft
margin’’ determined by a nonnegative slack variable y, that is:
x0w� � þ yi � þ1, for x0 ¼ Ai and Dii ¼ þ1,x0w� � � yi � �1, for x0 ¼ Ai and Dii ¼ �1:
ð4Þ
A FINITE NEWTON METHOD FOR CLASSIFICATION 915
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
The 1-norm of the slack variable y is minimized with weight � in (1).
The quadratic term in (1), which is twice the reciprocal of the square
of the 2-norm distance 2=kwk between the two bounding planes of
(2) in the n-dimensional space of w 2 Rn for a fixed �, maximizes that
distance, often called the ‘‘margin’’. Figure 1 depicts the points
represented by A, the bounding planes (2) with margin 2=kwk, and
the separating plane (3) which separates Aþ, the points represented
by rows of A with Dii ¼ þ1, from A�, the points represented by
rows of A with Dii ¼ �1.
In many essentially equivalent formulations of the classification
problem [4,5,10,11], the square of 2-norm of the slack variable y is mini-
mized with weight �=2 instead of the 1-norm of y as in (1). In addition,
the distance between the planes (2) is measured in the ðnþ 1Þ-dimen-
sional space of ðw, �Þ 2 Rnþ1, that is 2=kðw, �Þk. Measuring the
margin in this ðnþ 1Þ-dimensional space instead of Rn induces strong
convexity and has little or no effect on the problem as was shown in
[18]. Thus using twice the reciprocal squared of the margin instead,
yields our modified SVM problem as follows:
minðw, �, yÞ2Rnþ1þm
�
2y0yþ
1
2ðw0wþ �2Þ
s.t. DðAw� e�Þ þ y � e
y � 0:
ð5Þ
FIGURE 1 The bounding planes (2) with margin 2=kwk, and the plane (3) separatingAþ, the points represented by rows of A with Dii ¼ þ1, from A�, the points representedby rows of A with Dii ¼ �1.
916 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
It has been shown computationally [20] that this reformulation (5)
of the conventional support vector machine formulation (1) yields
similar results to (1). At a solution of problem (5), y is given by
y ¼ ðe�DðAw� e�ÞÞþ, ð6Þ
where, as defined in the Introduction, ð�Þþ replaces negative com-
ponents of a vector by zeros. Thus, we can replace y in (5) by
ðe�DðAw� e�ÞÞþ and convert the SVM problem (5) into an equiva-
lent SVM which is an unconstrained optimization problem as follows:
minðw, �Þ2Rnþ1
�
2kðe�DðAw� e�ÞÞþk
2 þ1
2ðw0wþ �2Þ: ð7Þ
This problem is the strongly convex piecewise quadratic minimization
problem. Note, however, that its objective function is not twice
differentiable which precludes the use of a regular Newton method.
In [11], we smoothed this problem and applied a fast Newton
method to solve it. Other methods, such as Newton methods for non-
smooth functions with exact line searches [13,24], can also solve this
problem. Problem (7) is one of the nonsmooth problems that we
shall provide a direct finite Newton method for. The other nonsmooth
problem that we will treat is the nonlinear kernel problem, (12) below,
which generates a nonlinear classifier as described below.
We now describe how the generalized support vector machine
(GSVM) [17] generates a nonlinear separating surface by using a com-
pletely arbitrary kernel. The GSVM solves the following mathematical
program for a general kernel KðA,A0Þ, defined in the Introduction:
minðu, �, yÞ2R2mþ1
�e0yþ f ðuÞ
s.t. DðKðA,A0ÞDu� e�Þ þ y � e
y � 0:
ð8Þ
Here f ðuÞ is some convex function on Rm which suppresses the
parameter u and � is some positive number that weights the classifica-
tion error e0y versus the suppression of u. A solution of this mathemat-
ical program for u and � leads to the nonlinear separating surface
Kðx0,A0ÞDu ¼ �: ð9Þ
A FINITE NEWTON METHOD FOR CLASSIFICATION 917
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
The linear formulation (1) of Section 2 is obtained if we let
KðA,A0Þ ¼ AA0, w ¼ A0Du and f ðuÞ ¼ ð1=2Þu0DAA0Du: We now use
a different classification objective which not only suppresses the
parameter u but also suppresses � in our nonlinear formulation:
minðu, �, yÞ2R2mþ1
�
2y0yþ
1
2ðu0uþ �2Þ
s.t. DðKðA,A0ÞDu� e�Þ þ y � e
y � 0:
ð10Þ
At a solution of (10), y is given by
y ¼ ðe�DðKðA,A0ÞDu� e�ÞÞþ, ð11Þ
where, as defined earlier, ð�Þþ replaces negative components of a vector
by zeros. Thus, we can replace y in (10) by ðe�DðKðA,A0ÞDu� e�ÞÞþand convert the SVM problem (10) into an equivalent SVM which is
an unconstrained optimization problem as follows:
minðu, �Þ2Rmþ1
�
2kðe�DðKðA,A0ÞDu� e�ÞÞþk
2 þ1
2ðu0uþ �2Þ: ð12Þ
Again, as in (7), this problem is a strongly convex unconstrained mini-
mization problem which has a unique solution but its objective function
is not twice differentiable. It is for this problem, that generates the non-
linear separating surface (9), and for problem (7) that generates the
linear separating surface (3), that we shall develop the finite Newton
methods that we describe in the next two sections of the article.
3 FINITE STEPLESS NEWTON METHOD
We shall consider in the rest of the article the following piecewise
quadratic strongly convex problem which subsumes both problems (7)
and (12):
minz2Rp
f ðzÞ :¼�
2kðCz� hÞþk
2 þ1
2ðz0zÞ, ð13Þ
918 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
where C 2 Rm�p, h 2 Rm, and � is a fixed positive parameter. We
note immediately that, since kðr� tÞþk � kðr� tÞk for r, t 2 Rm, the
gradient of f :
rf ðzÞ ¼ �C0ðCz� hÞþ þ z, ð14Þ
is globally Lipschitz continuous with constant K as follows:
krf ðsÞ � rf ðzÞk � Kks� zk, 8s, z 2 Rp, whereK ¼ �kC0kkCk þ 1:
ð15Þ
The ordinary Hessian of f does not exist everywhere. However, since
rf ðzÞ is Lipschitzian, f belongs to the class LC1 of functions with
locally Lipschitzian gradients, and a generalized Hessian exists every-
where [7]. LC1 functions were introduced in [7] and used in subsequent
papers such as [3,9,22,23]. The generalized Hessian is defined as
follows [3,7]:
Definition 1 Generalized Hessian Let f : Rp�!R have a Lipschitz
continuous gradient on Rp. The generalized Hessian of f at z is the
set of @ 2f ðzÞ of p� p matrices defined as [3,7]:
@2f ðzÞ :¼ cofH 2 Rp�pj 9zk�! z such that :
rf is differentiable at zk and r2f ðzkÞ�!Hg:ð16Þ
Furthermore, a generalization of the mean value theorem for the
Lipschitzian gradient rf ðzÞ holds [3,6]:
PROPOSITION 2 Generalized Mean Value Theorem Let f : Rp�!R
have a Lipschitz continuous gradient on Rp. Then for z, s 2 Rp:
rf ðzÞ ¼ rf ðsÞ þXpj¼1
tjHjðz� sÞ ð17Þ
where H j 2 @2f ðy jÞ, for some y j 2 ðs, zÞ, tj � 0, j ¼ 1, . . . , p and
Xpj¼1
tj ¼ 1:
A FINITE NEWTON METHOD FOR CLASSIFICATION 919
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
We shall also need the following lemma [21, 8.1.5] that gives a
quadratic bound on a linear Taylor expansion:
LEMMA 3 Quadratic Bound Lemma Let f : Rn�!R have a Lipschitz
continuous gradient on Rp with constant K. Then for z, s 2 Rp:
j f ðzÞ � f ðsÞ � rf ðsÞ0ðz� sÞj �K
2kz� sk2: ð18Þ
We summarize now properties of the function f ðzÞ in the following
lemma and omit the straightforward proof of these properties.
LEMMA 4 Properties of f ðzÞ The function f ðzÞ defined by (13) has the
following properties for all s, z 2 Rp:
(i) f ðzÞ is strongly convex with constant k ¼ 1:
ðrf ðsÞ � rf ðzÞÞ0ðs� zÞ � k � ks� zk2 ¼ ks� zk2: ð19Þ
(ii) rf ðzÞ is Lipschitz continuous with constant K:
krf ðsÞ � rf ðzÞk � Kks� zk, 8s, z 2 Rp, whereK ¼ �kC0kkCk þ 1:
ð20Þ
(iii) The generalized Hessian of f ðzÞ [7] is:
@2f ðzÞ ¼ �C0diagðCz� hÞ�C þ I , ð21Þ
where diagðCz� hÞ� denotes the p� p diagonal matrix whose jth
diagonal entry is the subgradient of the step function ð�Þþ as follows:
ðdiagðCz� hÞ�Þjj
¼ 1, if Cjz� hj > 0,
2 ½0, 1�, if Cjz� hj ¼ 0,
¼ 0, if Cjz� hj < 0,
j ¼ 1, . . . , p:
8>><>>: ð22Þ
920 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
(iv)
Kksk2 � s0ð�C0C þ IÞs � s0@2f ðzÞs � ksk2, where K ¼ �kC0kkCk þ 1:
ð23Þ
(v)
1
Kksk2 � s0@2f ðzÞ�1s � ksk2: ð24Þ
(vi)
k@2f ðsÞ � @2f ðzÞk � �kC0k � kCk: ð25Þ
We note that the inverse @2f ðzÞ�1 appearing in (24) and elsewhere
refers to the inverse of @2f ðzÞ defined by (21)–(22) for an arbitrary
but specific value of ðdiagðCz� hÞ�Þjj , j ¼ 1, . . . , p in the interval
½0, 1� when Cjz� hj ¼ 0.
We are ready now to state and establish global finite termination of a
Newton algorithm without a stepsize starting from any point.
Algorithm 5 Stepless Newton Algorithm for (13) Start with any
z0 2 Rp. For i ¼ 0, 1, . . .:
(i) ziþ1 ¼ zi � @2f ðziÞ�1rf ðziÞ.
(ii) Stop if rf ðziþ1Þ ¼ 0.
(iii) i ¼ i þ 1. Go to ðiÞ.
THEOREM 6 Finite Termination of Stepless Newton Let f ðzÞ be well
conditioned, that is:
K
k¼
�kC0kkCk þ 1
1< 2, i:e:, �kC0kkCk < 1: ð26Þ
(i) The sequence fzig of Algorithm 5 terminates at the global minimum
solution �zz of (13).
(ii) The error decreases linearly at each step as follows:
kziþ1 � �zzk � �kC0kkCkkzi � �zzk: ð27Þ
A FINITE NEWTON METHOD FOR CLASSIFICATION 921
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Proof
(i) We first establish convergence and then finite termination of the
sequence fzig. By the Quadratic Bound Lemma (18):
f ðziÞ � f ðziþ1Þ � rf ðziÞ0@2f ðziÞ�1rf ðziÞ �
K
2rf ðziÞ0@2f ðziÞ�2
rf ðziÞ
¼ rf ðziÞ0 @2f ðziÞ�1�K
2@2f ðziÞ�2
� �rf ðziÞ
¼K
2rf ðziÞ0@2f ðziÞ�1=2 2I
K� @2f ðziÞ�1
� �@2f ðziÞ�1=2
rf ðziÞ
�K
2
1� �kC0kkCk
1þ �kC0kkCkrf ðziÞ0@2f ðziÞ�1
rf ðziÞ
�1
2
1� �kC0kkCk
1þ �kC0kkCkkrf ðziÞk2 � 0,
ð28Þ
where the first inequality above follows from Quadratic Bound
Lemma (18), the next to the last inequality follows from the well con-
ditioned assumption (26) and (24) and the last inequality from (24).
By (28) and the strong convexity of f ðzÞ we have that:
0 � f ðziÞ � f ðz0Þ
� rf ðz0Þ0ðzi � z0Þ þ 12 kz
i � z0k2
� �krf ðz0Þkkzi � z0k þ 12 kz
i � z0k2:
ð29Þ
Hence,
kzi � z0k �2
krf ðz0Þk, i ¼ 1, 2, . . . , ð30Þ
and consequently the bounded sequence fzig has an accumulation
point �zz such that limj�!1zij ¼ �zz. Since f f ðziÞg is nonincreasing
by (28) and bounded below by minz2Rp f ðzÞ, it converges and
limi�!1 f ðziÞ ¼ f ð �zzÞ. Thus:
0 ¼ limj�!1
ð f ðzij Þ � f ðzijþ1ÞÞ �1
2
1� �kC0kkCk
1þ �kC0kkCklim
j�!1krf ðzij Þk2 � 0: ð31Þ
922 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Hence limj�!1krf ðzij Þk ¼ krf ð �zzÞk ¼ 0. Since all accumulation points
are stationary, they must all equal the unique minimizer of f ðzÞ on Rp.
It follows that the whole sequence fzig must converge to the unique �zz
such that rf ð �zzÞ ¼ 0. We now show finite termination by using an argu-
ment similar to that of [8].
Our Newton iteration is:
�C0ðCzi � hÞþ þ zi þ ð�C0diagðCzi � hÞ�C þ I Þðziþ1 � ziÞ ¼ 0, ð32Þ
which we rewrite by subtracting from it the equality:
�C0ðC �zz� hÞþ þ �zz ¼ rf ð �zzÞ ¼ 0: ð33Þ
This results in the equivalent iteration:
�C0½ðCzi � hÞþ � ðC �zz� hÞþ� þ ½zi � �zz �þ
ð�C0diagðCzi � hÞ�C þ IÞðziþ1 � ziÞ ¼ 0: ð34Þ
We establish now that this Newton iteration is satisfied uniquely (since
@2f ðziÞ is nonsingular) by ziþ1 ¼ �zz when zi is sufficiently close to �zz and
hence the Newton iteration terminates at �zz at Step (ii) of Algorithm 5.
Setting ziþ1 ¼ �zz in (34) and canceling terms gives:
�C0½ðCzi � hÞþ � ðC �zz� hÞþ þ diagðCzi � hÞ�ððC �zz� hÞ � ðCzi � hÞÞ� ¼ 0:
ð35Þ
We verify now that the term in the square bracket in (35) is zero when
zi is sufficiently close to �zz by looking at each component j, j ¼ 1, . . . , p,
of the vector enclosed in the square brackets. We consider the nine
possible combinations:
(i) Cj �zz� hj > 0, Cjzi � hj > 0:
Cjzi � hj � Cj �zzþ hj þ 1 � ðCj �zz� hjÞ � Cjz
i þ hj ¼ 0:
(ii) Cj �zz� hj > 0, Cjzi � hj ¼ 0:
Cannot occur when zi is sufficiently close to �zz.
(iii) Cj �zz� hj > 0, Cjzi � hj < 0:
A FINITE NEWTON METHOD FOR CLASSIFICATION 923
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Cannot occur when zi is sufficiently close to �zz.
(iv) Cj �zz� hj ¼ 0, Cjzi � hj > 0:
Cjzi � hj � 0þ 1 � ð0� Cjz
i þ hjÞ ¼ 0:
(v) Cj �zz� hj ¼ 0, Cjzi � hj ¼ 0:
0þ 0þ ½0, 1�ð0þ 0Þ ¼ 0:
(vi) Cj �zz� hj ¼ 0, Cjzi � hj < 0:
0þ 0þ 0 � ð0� Cjzi þ hjÞ ¼ 0:
(vii) Cj �zz� hj < 0, Cjzi � hj > 0:
Cannot occur when zi is sufficiently close to �zz.
(viii) Cj �zz� hj < 0, Cjzi � hj ¼ 0:
Cannot occur when zi is sufficiently close to �zz.
(ix) Cj �zz� hj < 0, Cjzi � hj < 0:
0� 0þ 0 � ðCj �zz� hj � Cjzi þ hjÞ ¼ 0:
Hence for zi is sufficiently close to �zz, the Newton iteration is
uniquely satisfied by �zz and terminates.
(ii) We establish now the linear termination rate by using the generali-
zation of the mean value theorem Proposition 2 as follows:
ðziþ1 � �zzÞ ¼ zi � �zz� @2f ðziÞ�1ðrf ðziÞ � rf ð �zzÞÞ
¼ zi � �zz� @2f ðziÞ�1Xpj¼1
t jH jðzi � �zzÞ
¼ ðzi � �zzÞ � @2f ðziÞ�1Xpj¼1
t jH j þ @2f ðziÞ � @2f ðziÞ
" #ðzi � �zzÞ
¼ �@2f ðziÞ�1Xpj¼1
t jðH j � @2f ðziÞÞ
" #ðzi � �zzÞ,
ð36Þ
924 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
where the second equality above follows from the generalization of
the mean value theorem Proposition 2. Taking norms of the first
and last terms above gives:
kziþ1 � �zzk � k@2f ðziÞ�1k
Xpj¼1
t jðH j � @2f ðziÞÞ
����������kzi � �zzk
� k@2f ðziÞ�1k�kC0kkCkkzi � �zzk
� �kC0kkCkkzi � �zzk,
ð37Þ
where use has been made of (24) and (25) of Lemma 4 in the last two
inequalities above. g
We turn now to a Newton algorithm with an Armijo stepsize.
4 FINITE ARMIJO NEWTON METHOD
We now drop the well conditioning assumption (26) but add an
Armijo stepsize [1] in order guarantee global finite termination of
the following algorithm.
Algorithm 1 Armijo Newton Algorithm for (13) Start with any
z0 2 Rp. For i ¼ 0, 1, . . .:
(i) Stop if rf ðzi � @2f ðziÞ�1rf ðziÞÞ ¼ 0.
(ii) ziþ1 ¼ zi � �i@2f ðziÞ�1
rf ðziÞ ¼ zi þ �idi,
where �i ¼ maxf1, 12 ,
14 , . . .g such that:
f ðziÞ � f ðzi þ �idiÞ � ���irf ðz
iÞ0d i, ð38Þ
for some � 2 ð0, 1=2Þ and d i is the Newton direction:
d i ¼ �@2f ðziÞ�1rf ðziÞ: ð39Þ
(iii) i ¼ i þ 1. Go to (i).
THEOREM 2 Finite Termination of Armijo Newton The sequence fzig
of Algorithm 1 terminates at the global minimum solution �zz of (13).
A FINITE NEWTON METHOD FOR CLASSIFICATION 925
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Proof As in the proof of the Stepless Newton Algorithm 5, we first
establish global convergence of fzig to the global solution �zz, but with-
out using the well conditioned assumption (26). By the Quadratic
Bound Lemma 3 we have the first inequality below:
f ðziÞ � f ðzi þ �idiÞ þ ��irf ðz
iÞ0d i � �
K
2�2i kd
ik2 � ð1� �Þ�irf ðziÞ0d i
¼ �K
2�2i rf ðz
iÞ0@2f ðziÞ�2
rf ðziÞ
þ ð1� �Þ�irf ðziÞ0@2f ðziÞ�1
rf ðziÞ
� �i �K
2�i þ
1� �
K
� �krf ðziÞk2,
ð40Þ
where the equality above makes use of the definition of d i and the last
inequality makes use of (24) of Lemma 4. Hence if �i is small enough,
that is �i � ð2ð1� �ÞÞ=K2 then the Armijo inequality (38) is satisfied.
However by the definition of the Armijo stepsize, 2�i violates the
Armijo inequality (38) and hence by (40):
�K
22�i þ
1� �
K
� �< 0, or equivalently �i >
1� �
K2: ð41Þ
It follows that:
�i � min 1,1� �
K2
� �¼: � > 0, ð42Þ
and by the Armijo inequality (38) and (24) of Lemma 4 that:
f ðziÞ � f ðziþ1Þ � ���irf ðziÞ0d i � ��rf ðziÞ0@2f ðziÞ�1
rf ðziÞ ���
Kkrf ðziÞk2:
ð43Þ
By exactly the same arguments following inequalities (28), we have
that the sequence fzig converges to unique minimum solution �zz of (13).
Having established convergence of fzig to �zz, finite termination of the
Armijo Newton to �zz again follows exactly the finiteness proof of
926 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Theorem 6 because of Step (i) of the current algorithm which
checks whether the next iterate obtained by a stepless Newton is
stationary. g
5 NUMERICAL EXPERIENCE
Our numerical experience is based on 16 test problems of [11] for
which a smoothed version of (7) and (12) was solved using a regular
Newton method with an Armijo stepsize. However, all of these test
problems were also solved in [11] using the proposed Stepless
Newton Algorithm 5 here and gave the same results as the smoothed
method and with the number of Newton steps varying between 5 and 8
for all problems. The test problems were all of type of (7) or (12) and
varied in size with m between 110 and 32 562 and n between 2 and 123.
Running times varied between 1 and 85 s on a 200 MHz PentiumPro
with 64 megabytes of RAM. Linear classifiers were used for the
larger problems, such as with m ¼ 32 562, where the Newton method
easily solved the minimization problem (7) in Rnþ1 with n ¼ 123 for
these large problems.
6 CONCLUSION
We have presented fast, finitely terminating Newton methods for solv-
ing a fundamental classification problem of data mining and machine
learning. The methods are simple and fast and can be applied to other
problems such as linear programming [8]. An interesting open ques-
tion is whether these finite methods can be shown to have polynomial
run time in general, such as that for one dimensional piecewise
quadratic functions [13], and whether they can be applied to an even
broader class of problems.
Acknowledgments
The research described in this Data Mining Institute Report 01-11,
December 2001, was supported by National Science Foundation
A FINITE NEWTON METHOD FOR CLASSIFICATION 927
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
Grants CCR-9729842, CCR-0138308 and CDA-9623632, by Air Force
Office of Scientific Research Grant F49620-00-1-0085 and by the
Microsoft Corporation. The author is grateful to the referees for
encouraging constructive suggestions and for important references.
References
[1] L. Armijo (1966). Minimization of functions having Lipschitz-continuous firstpartial derivatives. Pacific Journal of Mathematics, 16, 1–3.
[2] V. Cherkassky and F. Mulier (1998). Learning from Data – Concepts, Theory andMethods. John Wiley & Sons, New York.
[3] F. Facchinei (1995). Minimization of SC1 functions and the Maratos effect.Operations Research Letters, 17, 131–137.
[4] G. Fung and O.L. Mangasarian (September 2001). Incremental support vectormachine classification. Proceedings of the Second SIAM International Conferenceon Data Mining, Arlington, Virginia, April 11–13, 2002, R. Grossman, H. Mannilaand R. Motwani (editors), SIAM, Philadelphia 2002, 247–260. Technical Report01–08, Data Mining Institute, Computer Sciences Department, University ofWisconsin, Madison, Wisconsin.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-08.ps.
[5] G. Fung and O.L. Mangasarian (2001). Proximal support vector machine classifiers.In: F. Provost and R. Srikant (Eds.), Proceedings KDD-2001: Knowledge Discoveryand Data Mining, August 26–29, pp. 77–86. Asscociation for Computing Machinery,San Francisco, CA, New York.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps.
[6] J.-B. Hiriart-Urruty (1982). Refinements of necessary optimality conditions innondifferentiable programming II. Mathematical Programming Study, 9, 120–139.
[7] J.-B. Hiriart-Urruty, J.J. Strodiot and V.H. Nguyen (1984). Generalized hessianmatrix and second-order optimality conditions for problems with CL1 data.Applied Mathematics and Optimization, 11, 43–56.
[8] C. Kanzow, H. Qi and L. Qi (2001). On the minimum norm solution of linearprograms. Journal of Optimization Theory and Applications, to appear. Preprint,University of Hamburg, Hamburg.http://www.math.uni-hamburg.de/home/kanzow/paper.html.
[9] D. Klatte and K. Tammer (1988). On second-order sufficient optimality conditionsfor C1, 1-optimization problems. Optimization, 19, 169–179.
[10] Y.-J. Lee and O.L. Mangasarian (2001). RSVM: reduced support vector machines.Technical Report 00–07, Data Mining Institute, Computer Sciences Department,University of Wisconsin, Madison, Wisconsin, July 2000. Proceedings of the FirstSIAM International Conference on Data Mining, April 5–7. CD-ROMProceedings, Chicago. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps.
[11] Yuh-Jye Lee and O.L. Mangasarian (2001). SSVM: a smooth support vectormachine. Computational Optimization and Applications, 20, 5–22. TechnicalReport 99–03, Data Mining Institute, University of Wisconsin.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps
[12] W. Li and J. Swetits (1999). Regularized newton methods for minimization ofconvex quadratic splines with singular hessians. In: M. Fukushima and L. Qi(Eds.), Reformulation: Nonsmooth, Piecewise Smooth, Semismooth and SmoothingMethods, pp. 235–257. Kluwer Academic Publishers, Boston.
928 O.L. MANGASARIAN
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13
[13] W. Li and J.J. Swetits (1993). A Newton method for convex regression, datasmoothing and quadratic programming with bounded constraints. SIAM Journalon Optimization, 3, 466–488.
[14] W. Li and J.J. Swetits (1997). A new algorithm for solving strictly convex quadraticprograms. SIAM Journal on Optimization, 7, 595–619.
[15] K. Madsen and H.B. Nielsen (1998). A finite continuation algortihm for boundconstrained quadratic programming. SIAM Journal on Optimization, 9, 62–83.
[16] K. Madsen, H.B. Nielsen and M.C. Pinar (1999). Bound constrained quadraticprogramming via piecewise quadratic functions. Mathematical Programming,85, 135–156.
[17] O.L. Mangasarian (2000). Generalized support vector machines. In: A. Smola,P. Bartlett, B. Scholkopf and D. Schuurmans (Eds.), Advances in Large MarginClassifiers, pp. 135–146. MIT Press, Cambridge, MA.ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.
[18] O.L. Mangasarian and D.R. Musicant (1999). Successive over relaxation forsupport vector machines. IEEE Transactions on Neural Networks, 10, 1032–1037.ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-18.ps.
[19] O.L. Mangasarian and D.R. Musicant (2001). Active support vector machineclassification. In: Leen K. Todd, Dietterich G. Thomas and Volker Tresp (Eds.),Advances in Neural Information Processing Systems 13, pp. 577–583. MIT Press,Cambridge, MA. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-04.ps
[20] O.L. Mangasarian and D.R. Musicant (2001). Lagrangian support vector machines.Journal of Machine Learning Research, 1, 161–177.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-06.ps.
[21] J.M. Ortega (1972). Numerical Analysis, A Second Course. Academic Press, NewYork.
[22] L. Qi (1994). Superlinearly convergent approximate Newton methods for LC1
optimization problems. Mathematical Programming, 64, 277–294.[23] L. Qi and J. Sun (1993). A nonsmooth version of Newton’s method. Mathematical
Programming, 58, 353–368.[24] J. Sun (1997). On piecewise quadratic newton and trust region problems.
Mathematical Programming, 76, 451–468.[25] V.N. Vapnik (2000). The Nature of Statistical Learning Theory, 2nd Edn. Springer,
New York.
A FINITE NEWTON METHOD FOR CLASSIFICATION 929
Dow
nloa
ded
by [
Uni
vers
ity O
f Pi
ttsbu
rgh]
at 1
0:11
27
Apr
il 20
13