6 Constrained optimization — optimality conditionsmepelman/teaching/NLP/... · 6 Constrained optimization — optimality conditions ... (Note that the ﬁrst equation can be rewritten

IOE 519: NLP, Winter 2012 c�Marina A. Epelman 39

6 Constrained optimization — optimality conditions

Recall that a constrained optimization problem is a problem of the form

(P) min f(x)s.t. g(x) 0

h(x) = 0x 2 X,

where X is an open set and g(x)) = (g1(x), . . . , gm

(x))T : Rn ! Rm, h(x) = (h1(x), . . . , hl

(x))T :Rn ! Rl. Let S denote the feasible set of (P), i.e.,

S4= {x 2 X : g(x) 0, h(x) = 0}.

Then the problem (P) can be written as minx2S f(x). Recall that x is a local minimizer of (P) if

9✏ > 0 such that f(x) f(y) for all y 2 S \B✏

(x).

Local, global minimizers and maximizers, strict and non-strict, are defined analogously.

We will often use the following “shorthand” notation:

rg(x)T =

2

6

4

rg1(x)T...

rgm

(x)T

3

7

5

and rh(x)T =

2

6

4

rh1(x)T...

rhl

(x)T

3

7

5

,

i.e., rg(x)T 2 Rm⇥n and rh(x)T 2 Rl⇥n are Jacobian matrices, whose ith row is the transpose ofthe gradient of the corresponding function.

6.1 Necessary Optimality Conditions

We define a set C ✓ Rn to be a cone if for every x 2 C, ↵x 2 C for any ↵ > 0. A set C is a convexcone if C is a cone and C is a convex set.

Suppose x 2 S. We make the following definitions:

F0 = {d : rf(x)Td < 0} — “the cone of descent directions of f at x.”

I = {i : gi

(x) = 0} — the indices of binding/active inequality constraints at x.

G0 = {d : rgi

(x)Td < 0 8i 2 I} — “the cone of ‘inward’ directions of active inequality con-straints.”

H0 = {d : rhi

(x)Td = 0 8i = 1, . . . , l} — the set of tangent directions of equality constraints.

Important note: The descriptions of F0 and G0 are presented in quotes because they, in fact,do not provide the precise descriptions of these sets; they are only intended to give some intuitionabout their content. For instance, recall that any direction d such that rf(x)Td < 0 is, in fact,a descent direction of f at x, but some descent directions may make a 0 inner product with thegradient. Thus, in some cases the set F0 is a subset of the cone of all descent directions. (Same istrue of the set G0.)

Theorem 6.1 Assume that h(x) is an a�ne function, i.e., h(x) = Ax� b for A 2 Rl⇥n. If x is alocal minimizer of (P), then F0 \G0 \H0 = ;.


Proof: Note that rhi

(x) = Ai·, i.e., H0 = {d : Ad = 0}.

Suppose d 2 F0 \G0 \H0. Then for all � > 0 su�ciently small gi

(x+ �d) gi

(x) = 0 8i 2 I (fori 62 I, since � is small, g

i

(x+�d) < 0), and h(x+�d) = (Ax� b)+�Ad = 0. Therefore, x+�d 2 Sfor all � > 0 su�ciently small. On the other hand, for all su�ciently small � > 0, f(x+�d) < f(x).This contradicts the assumption that x is a local minimizer of (P).

The following is the extension of Theorem 6.1 to handle general nonlinear functions h(x).

Theorem 6.2 If x is a local minimizer of (P) and the gradient vectors rhi

(x), i = 1, . . . , l arelinearly independent, then F0 \G0 \H0 = ;.

The proof of Theorem 6.2 is somewhat involved, and is postponed until subsection 6.8 the end ofthis section.

Note that Theorem 6.2 is essentially saying that if a point x is (locally) optimal, there is no directiond which is an improving direction (i.e., such that f(x+�d) < f(x) for small � > 0), and at the sametime a feasible direction (i.e., such that g

i

(x+ �d) gi

(x) = 0 for i 2 I and h(x+ �d) = 0), whichmakes sense intuitively. Observe, however, that the condition in Theorem 6.2 is somewhat weakerthan the above intuitive explanation: indeed, we can have a direction d which is an improvingdirection, but rf(x)Td = 0 (same is true for g(x)).

6.2 Farkas’ Lemma and Extensions

We will shortly attempt to reduce the geometric necessary local optimality conditions (F0 \ G0 \H0 = ;) to an algebraic statement in terms of the gradients of the objective and constraintsfunctions. For this we need the mathematical tools developed here.

Theorem 6.3 (Farkas’ Lemma) Exactly one of the following two systems has a solution:

(i) Ax 0, cTx > 0

(ii) AT y = c, y � 0.

Proof: First note that both systems cannot have a solution, since then we would have 0 < cTx =yTAx 0.

Suppose that system (ii) has no solution. Then the following linear programming problem:

min 0T ys.t. AT y = c

y � 0

is infeasible. Using linear programming duality theory, we conclude that the dual LP is eitherinfeasible or unbounded. The dual LP is

max cTxs.t. Ax 0,

and it is obviously feasible. Therefore, it is unbounded, i.e., there exists a vector x such that Ax 0and cTx > 0. Hence, system (i) has a solution.

For those of you not familiar with linear programming duality theory, an alternative proof ofTheorem 6.3 is provided in subsection 6.9 (in fact, the results of subsection 6.9 can be used toderive linear programming duality).


Lemma 6.4 (Key Lemma) Exactly one of the two following systems has a solution:

(i) Ax < 0, Bx 0, Hx = 0

(ii) ATu+BT v +HTw = 0, u � 0, v � 0, eTu = 1.

Proof: It is easy to show that both (i) and (ii) cannot have a solution. Suppose (i) does not havea solution. Then the system

Ax+ e✓ 0, ✓ > 0Bx 0Hx 0�Hx 0

has no solution. This system can be re-written in the form2

6

6

4

A eB 0H 0�H 0

3

7

7

5

·✓

x✓

◆

0, (0, . . . , 0, 1) ·✓

x✓

◆

> 0.

From Farkas’ Lemma, there exists a vector (u; v;w1;w2) � 0 such that

2

6

6

4

A eB 0H 0�H 0

3

7

7

5

T

·

0

B

B

@

uvw1

w2

1

C

C

A

=

0

B

B

B

@

0...01

1

C

C

C

A

,

or ATu + BT v + HT (w1 � w2) = 0, eTu = 1. Letting w = w1 � w2 completes the proof of thelemma.

6.3 First Order Optimality Conditions

Theorem 6.5 (Fritz John Necessary Conditions) Let x be a feasible solution of (P). If x isa local minimizer, then there exists (u0, u, v) such that

u0rf(x) +m

X

i=1

ui

rgi

(x) +l

X

i=1

vi

rhi

(x) = 0,

u0, u � 0, (u0, u, v) 6= 0,

ui

gi

(x) = 0, i = 1, . . . ,m.

(Note that the first equation can be rewritten as u0rf(x) +rg(x)u+rh(x)v = 0.)

Proof: If the vectors rhi

(x) are linearly dependent, then there exists v 6= 0 such that rh(x)v = 0.Setting (u0, u) = 0 establishes the result.

Suppose now that the vectors rhi

(x) are linearly independent. Then Theorem 6.2 applies, i.e.,F0 \G0 \H0 = ;. Assume for simplicity that I = {1, . . . , p}. Let

A =

2

6

6

6

4

rf(x)Trg1(x)T

...rg

p

(x)T

3

7

7

7

5

, H =

2

6

4

rh1(x)T...

rhl

(x)T

3

7

5

.


Then there is no d that satisfies Ad < 0, Hd = 0. From the Key Lemma there exists (u0, u1, . . . , up)and (v1, . . . , v

l

) such that

u0rf(x) +p

X

i=1

ui

rgi

(x) +l

X

i=1

vi

rhi

(x) = 0,

with u0 + u1 + · · · + up

= 1 and (u0, u1, . . . , up) � 0. Define up+1, . . . , um = 0. Then (u0, u) � 0,

(u0, u) 6= 0, and for any i, either gi

(x) = 0, or ui

= 0. Finally,

u0rf(x) +rg(x)u+rh(x)v = 0.

Theorem 6.6 (KKT Necessary Conditions) Let x be a feasible solution of (P) and let I ={i : g

i

(x) = 0}. Further, suppose that the gradients of all constraints active at x, i.e., rgi

(x) fori 2 I and rh

i

(x) for i = 1, . . . , l, are linearly independent. If x is a local minimizer, there exists(u, v) such that

rf(x) +rg(x)u+rh(x)v = 0,

u � 0, ui

gi

(x) = 0 i = 1, . . . ,m.

Proof: x must satisfy the Fritz John conditions. If u0 > 0, we can redefine u u/u0 andv v/u0. If u0 = 0, it would imply that

P

i2I uirgi(x) +P

l

i=1 virhi(x) = 0, i.e., the abovegradients are linearly dependent. This contradicts the assumptions of the theorem.

6.4 Examples

Example 1 Consider the problem:

min 6(x1 � 10)2 +4(x2 � 12.5)2

s.t. x21 +(x2 � 5)2 50x21 +3x22 200

(x1 � 6)2 +x22 37

In this problem, we have:f(x) = 6(x1 � 10)2 + 4(x2 � 12.5)2

g1(x) = x21 + (x2 � 5)2 � 50

g2(x) = x21 + 3x22 � 200

g3(x) = (x1 � 6)2 + x22 � 37

We also have:

rf(x) =✓

12(x1 � 10)8(x2 � 12.5)

◆

rg1(x) =✓

2x12(x2 � 5)

◆

rg2(x) =✓

2x16x2

◆

rg3(x) =✓

2(x1 � 6)2x2

◆


Let us determine whether or not the point x = (x1, x2) = (7, 6) is a candidate to be an optimalsolution to this problem.

We first check for feasibility:g1(x) = 0 0

g2(x) = �43 < 0

g3(x) = 0 0

To check for optimality, we compute all gradients at x:

rf(x) =✓

�36�52

◆

rg1(x) =✓

142

◆

rg2(x) =✓

1436

◆

rg3(x) =✓

212

◆

Note that first and third constraints are active at x, and since their gradients evaluated at xare linearly independent, the assumptions of Theorem 6.6 are satisfied. So, we next check tosee if the gradients “line up” (i.e., satisfy the KKT necessary conditions), by trying to solve foru1 � 0, u2 = 0, u3 � 0 in the following system:

✓

�36�52

◆

+

✓

142

◆

u1 +

✓

1436

◆

u2 +

✓

212

◆

u3 =

✓

00

◆

Notice that u = (u1, u2, u3) = (2, 0, 4) solves this system, and that u � 0 and u2 = 0. Therefore xis a candidate to be an optimal solution of this problem.

Example 2 Consider the problem (P):

(P) : maxx

xTQxs.t. kxk 1

where Q is symmetric. This is equivalent to:

(P) : minx

�xTQxs.t. xTx 1 .

The Fritz John conditions are:�2u0Qx+ 2ux = 0

xTx 1(u0, u) � 0(u0, u) 6= 0

u(1� xTx) = 0 .


Suppose that u0 = 0. Then u 6= 0. However, the gradient condition implies that x = 0, and thenfrom complementarity we have u = 0, which is a contradiction. Therefore we can assume thatu0 > 0 and so instead work directly with the KKT conditions:

�2Qx+ 2ux = 0xTx 1

u � 0u(1� xTx) = 0 .

One solution to the KKT system is x = 0, u = 0, with objective function value xTQx = 0. Arethere any better solutions to the KKT system?

If x 6= 0 is a solution of the KKT system together with some value u, then x is an eigenvector of Qwith nonnegative eigenvalue u. Also, xTQx = uxTx = u, and so the objective value of this solutionis u. Therefore the solution of (P) with the largest objective function value is x = 0 if the largesteigenvalue of Q is nonpositive. If the largest eigenvalue of Q is positive, then the optimal objectivevalue of (P) is the largest eigenvalue, and the optimal solution is any eigenvector x correspondingto this eigenvalue, normalized so that kxk = 1.

Example 3 Consider the problem:

min (x1 � 12)2 +(x2 + 6)2

s.t. x21 + 3x1 +x22 � 4.5x2 6.5(x1 � 9)2 +x22 64

8x1 +4x2 = 20

In this problem, we have:f(x) = (x1 � 12)2 + (x2 + 6)2

g1(x) = x21 + 3x1 + x22 � 4.5x2 � 6.5

g2(x) = (x1 � 9)2 + x22 � 64

h1(x) = 8x1 + 4x2 � 20

Let us determine whether or not the point x = (x1, x2) = (2, 1) is a candidate to be an optimalsolution to this problem.

We first check for feasibility:g1(x) = 0 0

g2(x) = �14 < 0

h1(x) = 0

To check for optimality, we compute all gradients at x:

rf(x) =✓

�2014

◆

rg1(x) =✓

7�2.5

◆


rg2(x) =✓

�142

◆

rh1(x) =✓

84

◆

Again, the KKT conditions are necessary here, so we next check to see if the gradients “line up”by trying to solve for u1 � 0, u2 = 0, v1 in the following system:

✓

�2014

◆

+

✓

7�2.5

◆

u1 +

✓

�142

◆

u2 +

✓

84

◆

v1 =

✓

00

◆

Notice that (u, v) = (u1, u2, v1) = (4, 0,�1) solves this system and that u � 0 and u2 = 0. Thereforex is a candidate to be an optimal solution of this problem.

6.5 Convexity and Su�cient Conditions

The problem(P) min f(x)s.t. g(x) 0

h(x) = 0x 2 X

is a convex program if f , gi

, i = 1, . . . ,m are convex functions, hi

, i = 1 . . . , l are a�ne functions,and X is an open convex set.

Theorem 6.7 (KKT First-order Su�cient Conditions) Let x be a feasible point of (P), andsuppose it satisfies the KKT conditions, i.e., for some u 2 Rm and v 2 Rl,

rf(x) +rg(x)u+rh(x)v = 0,

u � 0, ui

gi

(x) = 0, i = 1, . . . ,m.

If (P) is a convex program, then x is a global minimizer of (P).

Proof: Because each gi

is convex, the level sets

Ci

= {x 2 X : gi

(x) 0}, i = 1, . . . ,m

are convex sets. Also, because each hi

is a�ne, the sets

Di

= {x 2 X : hi

(x) = 0}, i = 1, . . . ,m

are convex. Thus, since the intersection of convex sets is convex, the feasible region

S = {x 2 X : g(x) 0, h(x) = 0}

is a convex set.

Let x 2 S be any point di↵erent from x. Then �x+(1��)x is feasible for all � 2 (0, 1). Thus

8i 2 I, gi

(�x+ (1� �)x) = gi

(x+ �(x� x)) 0 = gi

(x)


for any � 2 (0, 1), and since the value of gi

does not increase by moving in the direction x� x, wemust have rg

i

(x)T (x� x) 0 for all i 2 I.

Similarly, hi

(x+ �(x� x)) = 0, and so rhi

(x)T (x� x) = 0 for all i = 1, . . . , l.

Thus, from the KKT conditions,

rf(x)T (x� x) = � (rg(x)u+rh(x)v)T (x� x) � 0,

and by the gradient inequality, f(x) � f(x) for any feasible x.

Returning to our examples, in Example 1, note that f(x), g1(x), g2(x), and g3(x) are all convexfunctions. Therefore the problem is a convex optimization problem, and the KKT conditions aresu�cient. Therefore x = (7, 6) is the global minimizer.

In Example 3, note that f(x), g1(x), g2(x) are all convex functions and that h1(x) is an a�nefunction. Therefore the problem is a convex optimization problem, and the KKT conditions aresu�cient. Therefore x = (2, 1) is the global minimizer.

6.6 Constraint qualifications, or when are necessary conditions really neces-sary?

Recall that the statement of the KKT necessary conditions established above has the form “if x isa local minimizer of (P) and some requirement for the constraints then KKT conditions must holdat x.” This additional requirement for the constraints that enables us to proceed with the proof ofthe KKT conditions is called a constraint qualification.

We have already established (Theorem 6.6) that the following is a constraint qualification:Linear Independence Constraint Qualification The gradients rg

i

(x), i 2 I, rhi

(x), i =1, . . . , l are linearly independent.

We will now establish two other useful constraint qualifications.

Theorem 6.8 (Slater’s condition) Suppose the problem (P) satisfies Slater condition, i.e., gi

, i =1, . . . ,m are convex, h

i

, i = 1, . . . , l are a�ne, and rhi

(x), i = 1, . . . , l are linearly independent,and there exists a point x0 2 X which satisfies h(x0) = 0 and g(x0) < 0. Then the KKT conditionsare necessary to characterize an optimal solution.

Proof: Let x be a local minimizer. Fritz-John conditions are necessary for this problem, i.e., theremust exist (u0, u, v) 6= 0 such that (u0, u) � 0 and

u0rf(x) +rg(x)u+rh(x)v = 0, ui

gi

(x) = 0.

If u0 > 0, dividing through by u0 demonstrates KKT conditions. Now suppose u0 = 0. Letd = x0 � x. Then for each i 2 I, 0 = g

i

(x) > gi

(x0), and by gradient inequality applied to gi

,rg

i

(x)Td < 0. Also, since h(x) are a�ne, rh(x)Td = 0. Thus,

0 = 0Td = (rg(x)u+rh(x)v)Td < 0,

unless ui

= 0 for all i 2 I. Therefore, v 6= 0 and rh(x)v = 0, violating the linear independenceassumption. This is a contradiction, and so u0 > 0.

It is useful to contrast the Slater condition as stated here to the linear independence constraintqualification. The linear independence constraint qualification (as most others) has a “local” flavor,


i.e., all conditions depend on the particular point x we are considering to be the candidate foroptimality. Therefore, we can’t really verify that the conditions hold until we have a particularpoint in mind (and thus to construct a complete list of candidates for local optimality, we need toconsider all points that satisfy KKT conditions, as well as all feasible points that do not satisfy theconstraint qualification!). It simplifies our task tremendously if our problem satisfies some “global”constraint qualification, such as the Slater condition as stated in Theorem 6.8. Then we knowthat every candidate must satisfy KKT conditions! The next constraint qualification is also of a“global” nature:

Theorem 6.9 (Linear constraints) If all constraints are linear, i.e., if all functions gi

and hi

are a�ne, the KKT conditions are necessary to characterize an optimal solution.

Proof: Our problem is min{f(x) : Ax b, Mx = g}. Suppose x is a local optimum. Without lossof generality, we can partition the constraints Ax b into groups A

I

x bI

and AI

x bI

such thatA

I

x = bI

and AI

x < bI

. Then at x, the set {d : Md = 0, AI

d 0} is precisely the set of feasibledirections. Thus, in particular, for every d as above, rf(x)Td � 0, for otherwise d would be afeasible descent direction at x, violating its local optimality. Therefore, the linear system

2

4

AI

M�M

3

5 d 0, �rf(x)Td > 0

has no solution.

From Farkas’ lemma, there exists (u, v1, v2) � 0 such that AT

I

u+MT v1�MT v2 = �rf(x). Takingv = v1 � v2, we obtain the KKT conditions.

6.7 Second order conditions

To describe the second order conditions for optimality, we will define the following function, knownas the Lagrangian function, or simply the Lagrangian:

L(x, u, v) = f(x) +m

X

i=1

ui

gi

(x) +l

X

i=1

vi

hi

(x).

Using the Lagrangian, we can, for example, re-write the gradient conditions of the KKT necessaryconditions as follows:

rx

L(x, u, v) = 0, (3)

since rx

L(x, u, v) = rf(x) +rg(x)u+rh(x)v.

Also, note that r2xx

L(x, u, v) = r2f(x) +P

m

i=1 uir2gi

(x) +P

l

i=1 vir2hi

(x). Here we use thestandard notation: r2q(x) denotes the Hessian of the function q(x), and r2

xx

L(x, u, v) denotes thesubmatrix of the Hessian of L(x, u, v) corresponding to the partial derivatives with respect to thex variables only.

Theorem 6.10 (KKT second order necessary conditions) Suppose x is a local minimizer of(P), and rg

i

(x), i 2 I and rhi

(x), i = 1, . . . , l are linearly independent. Then x is a KKT point,and, in addition,

dTr2xx

L(x, u, v)d � 0

for all d 6= 0 such that rgi

(x)Td 0, i 2 I, rhi

(x)Td = 0, i = 1 . . . , l.


Theorem 6.11 (KKT second order su�cient conditions) Suppose the point x 2 S togetherwith multipliers (u, v) satisfies the first order KKT conditions. Let I+ = {i 2 I : u

i

> 0 andI0 = {i 2 I : u

i

= 0}. If in addition,

dTr2xx

L(x, u, v)d > 0

for all d 6= 0 such that rgi

(x)Td 0, i 2 I0,rgi

(x)Td = 0, i 2 I+, rhi

(x)Td = 0, i = 1 . . . , l,then x is a (strict) local minimizer.

6.8 Proof of Theorem 6.2

To prove Theorem 6.2 to include general equality constraints, we need the following tool, known asthe Implicit Function Theorem.

Example: Let h(x) = Ax � b, where A 2 Rl⇥n has full row rank (i.e., its rows are linearlyindependent). Then we can partition columns of A and elements of x as follows: A = [B,N ],x = (y; z), so that B 2 Rl⇥l is non-singular, and h(x) = By +Nz � b.

Let s(z) = B�1b � B�1Nz. Then for any z, h(s(z), z) = Bs(z) + Nz � b = 0, i.e., x = (s(z), z)solves h(x) = 0. This idea of “invertability” of a system of equations is generalized (although onlylocally) by the following theorem (we will preserve the notation used above):

Theorem 6.12 (IFT) Let h(x) : Rn ! Rl and x = (y1, . . . , yl

, z1, . . . , zn�l

) = (y, z) satisfy:

1. h(x) = 0

2. h(x) is continuously di↵erentiable in a neighborhood of x

3. The l ⇥ l Jacobian matrix2

6

6

4

@h1(x)@y1

· · · @h1(x)@yl

.... . .

...@hl(x)@y1

· · · @hl(x)@yl

3

7

7

5

is non-singular.

Then there exists ✏ > 0 along with functions s(z) = (s1(z), . . . , sl

(z)) such that 8z 2 N✏

(z),h(s(z), z) = 0 and s

k

(z) are continuously di↵erentiable. Moreover,

l

X

k=1

@hi

(s(z), z)

@yk

· @sk(z)@z

j

+@h

i

(s(z), z)

@zj

= 0.

Proof of Theorem 6.2 Let A = rh(x)T 2 Rl⇥n. Then A has full row rank, and its columns(along with corresponding elements of x) can be re-arranged such that A = [B;N ] and x = (y; z),where B is non-singular. Let z lie in a small neighborhood of z. Then, from the IFT, there existss(z) such that h(s(z), z) = 0.

Suppose d = F0 \ G0 \ H0. Let d = (q; p); then 0 = Ad = Bq + Np, or q = �B�1Np. Letz(✓) = z + ✓p, y(✓) = s(z(✓)) = s(z + ✓p), and x(✓) = (y(✓), z(✓)). We will derive a contradictionby showing that d is an improving feasible direction, i.e., for small ✓ > 0, x(✓) is feasible andf(x(✓)) < f(x).


To show feasibility of x(✓), note that for ✓ > 0 su�ciently small, by IFT,

h(x(✓)) = h(s(z(✓)), z(✓)) = 0.

Furthermore,

0 =@h

i

(x(✓))

@✓=

l

X

k=1

@hi

(s(z(✓)), z(✓))

@yk

· @sk(z(✓))@✓

+n�l

X

k=1

@hi

(s(z(✓)), z(✓))

@zk

· @zk(✓)@✓

.

Let rk

= @sk(z(✓))@✓

, and recall that @zk(✓)@✓

= pk

. The above equation can then be re-written as

0 = Br +Np, or r = �B�1Np = q. Therefore, @xk(✓)@✓

= dk

for k = 1, . . . , n.

For i 2 I,

gi

(x(✓)) = gi

(x) + ✓@g

i

(x(✓))

@✓

�

�

�

�

✓=0

+ |✓|↵i

(✓)

= ✓n

X

k=1

@gi

(x(✓))

xk

· @xk(✓)@✓

�

�

�

�

✓=0

+ |✓|↵i

(✓) = ✓rgi

(x)Td+ |✓|↵i

(✓),

where ↵i

(✓)! 0 as ✓ ! 0. Hence gi

(x(✓)) < 0 for all i = 1, . . . ,m for ✓ > 0 su�ciently small, andtherefore, x(✓) is feasible for any ✓ > 0 su�ciently small.

On the other hand,f(x(✓)) = f(x) + ✓rf(x)Td+ |✓|↵(✓) < f(x)

for ✓ > 0 su�ciently small, which contradicts local optimality of x.

6.9 Proving Farkas’ Lemma without appealing to LP duality theory

First, some definitions:

• If p 6= 0 is a vector in Rn and ↵ 6= 0 is a scalar, H = {x 2 Rn : pTx = ↵} is a hyperplane,and H+ = {x 2 Rn : pTx � ↵}, H� = {x 2 Rn : pTx ↵} are half-spaces.

• Let S and T be two non-empty sets in Rn. A hyperplane H = {x : pTx = ↵} is said toseparate S and T if pTx � ↵ for all x 2 S and pTx ↵ for all x 2 T , i.e., if S ✓ H+ andT ✓ H�. If, in addition, S [ T 6⇢ H, then H is said to properly separate S and T .

• H is said to strictly separate S and T if pTx > ↵ for all x 2 S and pTx < ↵ for all x 2 T .

• H is said to strongly separate S and T if for some ✏ > 0, pTx > ↵ + ✏ for all x 2 S andpTx < ↵� ✏ for all x 2 T .

Theorem 6.13 (Separating hyperplane theorem) Let S be a nonempty closed convex set inRn, and y 62 S. Then 9p 6= 0 and ↵ such that H = {x : pTx = ↵} strongly separates S and {y}.

To prove the theorem, we need the following result:

Theorem 6.14 Let S be a nonempty closed convex set in Rn, and y 62 S. Then there exists aunique point x 2 S with minimum distance from y. Furthermore, x is the minimizing point if andonly if (y � x)T (x� x) 0 for all x 2 S.


Proof: Let x be an arbitrary point in S, and let S = S \ {x : kx � yk kx � yk}. Then S is acompact set. Let f(x) = kx � yk. Then f(x) attains its minimum over the set S at some pointx 2 S (note: x 6= y).

To show uniqueness, suppose that x0 2 S such that ky � xk = ky � x0k. By convexity of S,12(x+ x0) 2 S. But by the triangle inequality, we get

�

�

�

�

y � 1

2(x+ x0)

�

�

�

�

1

2ky � xk+ 1

2ky � x0k.

If strict inequality holds, we have a contradiction. Therefore, equality holds, and we must havey� x = �(y� x0) for some �. Since ky� xk = ky� x0k, |�| = 1. If � = �1, then y = 1

2(x+ x0) 2 S,contradicting the assumption. Hence, � = 1, i.e., x0 = x.

Finally we need to establish that x is the minimizing point if and only if (y � x)T (x � x) 0 forall x 2 S. To establish su�ciency, note that for any x 2 S,

kx� yk2 = k(x� x)� (y � x)k2 = kx� xk2 + ky � xk2 � 2(x� x)T (y � x) � kx� yk2.

Conversely, assume that x is the minimizing point. For any x 2 S, �x + (1 � �)x 2 S for any� 2 [0, 1]. Also, k�x+ (1� �)x� yk � kx� yk. Thus,

kx� yk2 k�x+ (1� �)x� yk2 = �2kx� xk2 + 2�(x� x)T (x� y) + kx� yk2,

or �2kx � xk2 � 2�(y � x)T (x � x). This implies that (y � x)T (x � x) 0 for any x 2 S, sinceotherwise the above expression can be invalidated by choosing � > 0 small.

Proof of Theorem 6.13: Let x 2 S be the point minimizing the distance from the point y to theset S. Note that x 6= y. Let p = y � x, ↵ = 1

2(y � x)T (y + x), and ✏ = 12ky � xk2. Then for any

x 2 S, (x� x)T (y � x) 0, and so

pTx = (y � x)Tx xT (y � x) = xT (y � x) +1

2ky � xk2 � ✏ = ↵� ✏.

That is pTx ↵ � ✏ for all x 2 S. On the other hand, pT y = (y � x)T y = ↵ + ✏, establishing theresult.

Proof of Theorem 6.3: Suppose the system (ii) has no solution. Let S = {x : x = AT y for some y �0}. Then c 62 S. S is easily seen to be a convex set. Also, S is a closed set (see, for example,Appendix B.3 of “Nonlinear Programming” by Dimitri Bertsekas). Therefore, by the separatinghyperplane theorem, there exist (P) and ↵ such that cT p > ↵ and (Ap)T y ↵ for all y � 0.

If (Ap)i

> 0, one could set yi

su�ciently large so that (Ap)T y > ↵, a contradiction. Thus Ap 0.Taking y = 0, we also have that ↵ � 0, and so cT p > 0, and (P) is a solution of (i).

Documents

6 Constrained optimization — optimality conditionsmepelman/teaching/NLP/... · 6 Constrained optimization — optimality conditions ... (Note that the ﬁrst equation can be rewritten