Upload
cmpt-cmpt
View
216
Download
0
Embed Size (px)
Citation preview
7/24/2019 An Optimization Primer
1/149
AN OPTIMIZATION PRIMER
An Introduction to Linear, Nonlinear,
Large Scale, Stochastic Programming
and Variational Analysis
Roger J-B Wets
University of California, Davis
Graphics by Maria E. Wets
AMS Classification: 90C15, 90xxx, 49J99
Date: August 26, 2005
7/24/2019 An Optimization Primer
2/149
2
7/24/2019 An Optimization Primer
3/149
i
PROLOGUE
The primordial objective of these lectures is to prepare the reader to dealwith a wide variety of applications that include optimal allocation of (lim-ited) resources, finding best estimates or best-fits, etc. Optimization prob-lems of this type arise in almost all areas of human industry: engineering,economics, agriculture, logistics, ecology, finance, information and commu-nication technology, and so on. Because the solution may have to respondto an evolutionary system, in time and/or space, or must take into accountuncertainty about some of the problems data, we usually end up with hav-
ing to solve a large scale optimization problems and this, to a large extent,conditions our overall approach. Consequently, the layout of the materialdoesnt follow the pattern of more traditional introduction to optimizationtextbooks.
The main thrust wont be on the detailed analysis of specific algorithms,but on setting up the tools to deal with these large scale applications. Thisdoesnt mean, that eventually, we wont describe, justify and even establishconvergence of some basic algorithmic procedures. To achieve our goals, weproceed, more or less, on three parallel tracks:
(i) modeling,
(ii) theoretical foundations that will allow us to analyze the properties ofsolutions as well as hone our modeling skills to help us build stable,easier to solve optimization problems, and
(iii) some numerical experimentation that will highlight some of the difficul-ties inherent in numerical implementation, but mostly to illustrate theuse of elementary algorithmic procedures as building blocks of moresophisticated solution schemes.
The lectures are designed to serve as an introduction to the field of opti-mization for students who have a background roughly equivalent to a bach-elors degree in science, engineering or mathematics. More specifically, its
7/24/2019 An Optimization Primer
4/149
ii
expected that the reader has a good foundation in Differential Calculus,Linear Algebra, and is familiar with the abstract notion of function1. Thepresentation also includes an introduction to a plain version of Probabilitythat will enable the non-initiated reader to follow the sections dealing withstochastic programming.
A novel feature of this book is that decision making under uncertainty-models are an integral part of the exposition. There are two basic reasonsfor this. Stochastic optimization motivates the study of linear, non-linearoptimization, large scale optimization, non-differentiable functions and vari-ational analysis. But, more significantly, given our concern with intelligentmodeling, one is bound to realize that very few important decision problemsdo not involve some level of uncertainty about some of their parameters. Itsthus imperative, that from the outset, the reader be aware of the potentialpitfalls of simplifying a model so as to skirt uncertainty. Nonetheless, itspossible to skip the chapters or sections dealing with stochastic programming
without compromising the continuity of the presentation, but not without be-ing shortchanged on the insight that comes from including uncertainty in themodeling of optimization models.
There are 16 chapters, each one corresponds to what could be coveredin about a weeks lectures (three to four hours). Proofs, constructive in na-ture whenever possible, have been provided so that (i) an instructor doesnthave to go through them in meticulous detail but can limit the discussionto the main idea(s) accompanied with some relevant examples and counter-examples, and (ii) so that the argumentation can serve as guide to solvingthe exercises. The theoretical side comes with almost no compromises, but
there are a few rare exceptions that would have required lengthy mathe-matical detours that are not germane to the subject at hand, and are moreappropriately dealt with in other texts or lectures.
Numerical software
As already mentioned earlier, although we end up describing a significantnumber of algorithmic procedures, we dont concern ourselves directly with
1An appendix provides a review of some standard notation and terminology as well assome basic results in analysis that might not be familiar to an heterogeneous student-body,a typical situation for such a course.
7/24/2019 An Optimization Primer
5/149
iii
implementation issues. This is best dealt with in specialized courses and text-books such as [16, 10, 20]. For example, in the case of linear programming,there is a description of both the simplex and the interior point methodsin Chapters 7 and 6, but from the outset on, its assumed that packages tosolve mathematical programs, of various types, including linear programs,
are available (C-Plex, IBM Solutions, LOQO, . . . ). To allow for some ex-perimentation with these solution procedures, its assumed that the readerhas access to Matlab2, in particular to the functions found in the MatlabOptimization Toolbox and will be able to use them to solve the numericalexercises. These Matlab functionalities were used to solve the examples, andin a number of instances, the corresponding m-file has been supplied.
2Matlab is distributed by The MathWorks, Inc.
7/24/2019 An Optimization Primer
6/149
iv
7/24/2019 An Optimization Primer
7/149
Contents
1 PRELUDE 1
1.1 Mathematical curtain rise . . . . . . . . . . . . . . . . . . . . 1
1.2 Curve fitting I . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Steepest Descent and Newton methods . . . . . . . . . . . . . 6
1.4 The Quasi-Newton methods . . . . . . . . . . . . . . . . . . . 14
1.5 Integral functionals . . . . . . . . . . . . . . . . . . . . . . . . 171.6 In conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 FORMULATION 23
2.1 A product mix problem . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Curve fitting II . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 A network capacity expansion problem . . . . . . . . . . . . . 35
2.4 Discrete decision variables . . . . . . . . . . . . . . . . . . . . 40
2.5 The Broadway producer problem . . . . . . . . . . . . . . . . 42
3 PRELIMINARIES 47
3.1 Variational analysis I . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Variational analysis II . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Plain probability distributions . . . . . . . . . . . . . . . . . . 68
3.4 Expectation functionals I . . . . . . . . . . . . . . . . . . . . . 75
3.5 Analysis of the producers problem . . . . . . . . . . . . . . . 79
4 LINEAR CONSTRAINTS 83
4.1 Linearly constrained programs . . . . . . . . . . . . . . . . . . 85
4.2 Variational analysis III . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Variational analysis IV . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . 98
v
7/24/2019 An Optimization Primer
8/149
vi CONTENTS
4.5 Karush-Kuhn-Tucker conditions I . . . . . . . . . . . . . . . . 101
5 SIMPLE RECOURSE: RHS 107
5.1 Random right hand sides . . . . . . . . . . . . . . . . . . . . . 108
5.2 Aircraft allocation to routes I . . . . . . . . . . . . . . . . . . 111
5.3 Separable simple recourse I . . . . . . . . . . . . . . . . . . . . 114
5.4 Aircraft allocation to routes II . . . . . . . . . . . . . . . . . . 119
5.5 Approximations I . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 LAGRANGIANS 127
6.1 Saddle functions . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Primal and Dual problems . . . . . . . . . . . . . . . . . . . . 131
6.3 A primal-dual interior-point method . . . . . . . . . . . . . . 134
6.4 Monitoring functions . . . . . . . . . . . . . . . . . . . . . . . 138
6.5 Lake Stoopt I . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.6 Separable simple recourse II . . . . . . . . . . . . . . . . . . . 145
6.7 The Lagrangian finite generation method . . . . . . . . . . . . 147
6.8 Lake Stoopt II . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7 POLYHEDRAL CONVEXITY 157
7.1 Polyhedral sets . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Full duality: linear programs . . . . . . . . . . . . . . . . . . . 165
7.3 Variational analysis V . . . . . . . . . . . . . . . . . . . . . . 167
7.4 The simplex method . . . . . . . . . . . . . . . . . . . . . . . 172
8 OPTIMALITY & DUALITY 183
8.1 Variational analysis VI . . . . . . . . . . . . . . . . . . . . . . 183
8.2 Separation Theorems . . . . . . . . . . . . . . . . . . . . . . . 188
8.3 Variational analyis VIII . . . . . . . . . . . . . . . . . . . . . 193
8.4 Karush-Kuhn-Tucker conditions II . . . . . . . . . . . . . . . . 197
8.5 Variational analysis VII: Conjugacy . . . . . . . . . . . . . . . 200
8.6 General duality theory . . . . . . . . . . . . . . . . . . . . . . 205
8.7 Geometric programming . . . . . . . . . . . . . . . . . . . . . 210
8.8 Semi-definite programming . . . . . . . . . . . . . . . . . . . . 210
7/24/2019 An Optimization Primer
9/149
CONTENTS vii
9 LINEAR RECOURSE 211
9.1 Fixed recourse and fixed costs . . . . . . . . . . . . . . . . . . 2129.2 A Manufacturing model . . . . . . . . . . . . . . . . . . . . . 2139.3 Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2159.4 Stochastic linear programs with recourse . . . . . . . . . . . . 221
9.5 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . 2229.6 Network capacity expansion II . . . . . . . . . . . . . . . . . . 2239.7 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.8 A summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.9 Practical probability II . . . . . . . . . . . . . . . . . . . . . . 2259.10 Expectation functionals II . . . . . . . . . . . . . . . . . . . . 2279.11 Disintegration Principle . . . . . . . . . . . . . . . . . . . . . 2299.12 Stochastic programming: duality . . . . . . . . . . . . . . . . 229
10 DECOMPOSITION 231
10.1 Lagrangian relaxation . . . . . . . . . . . . . . . . . . . . . . . 23110.2 Sequential linear programming . . . . . . . . . . . . . . . . . . 23110.3 The L-shaped method . . . . . . . . . . . . . . . . . . . . . . 23510.4 Dantzig-Wolfe decomposition . . . . . . . . . . . . . . . . . . 242
10.5 An optimal control problem . . . . . . . . . . . . . . . . . . . 24710.6 A targetting problem . . . . . . . . . . . . . . . . . . . . . . . 25410.7 Linear-quadratic control models . . . . . . . . . . . . . . . . . 25410.8 A hydro-power generation problem . . . . . . . . . . . . . . . 254
11 APPROXIMAION THEORY 255
11.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25511.2 Epi-convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 25511.3 Barrier & Penalty methods, ecaxt? . . . . . . . . . . . . . . . 25811.4 Infinite dimensional theory . . . . . . . . . . . . . . . . . . . . 25811.5 Approximation of control problems . . . . . . . . . . . . . . . 25811.6 Approximation of stochastic programs . . . . . . . . . . . . . 25811.7 Approximation of statistical estimation problems . . . . . . . 25811.8 Augmented Lagrangians . . . . . . . . . . . . . . . . . . . . . 25811.9 Variational Analysis ?? . . . . . . . . . . . . . . . . . . . . . . 25811.10Proximal point algorithm . . . . . . . . . . . . . . . . . . . . . 25811.11Method of multipliers: equalities . . . . . . . . . . . . . . . . . 25811.12Method of multipliers: inequalities . . . . . . . . . . . . . . . 258
7/24/2019 An Optimization Primer
10/149
viii CONTENTS
11.13Application to engineering design . . . . . . . . . . . . . . . . 258
12 NONLINEAR OPTIMIZATION 259
12.1 Statistical estimation: An introduction . . . . . . . . . . . . . 259
12.1.1 The discrete case . . . . . . . . . . . . . . . . . . . . . 262
12.2 Statistical estimation: parametric . . . . . . . . . . . . . . . . 27012.3 Satistical estimation: non-parametric . . . . . . . . . . . . . . 274
12.4 Non-convex optimization . . . . . . . . . . . . . . . . . . . . . 281
12.5 KKT-optimality conditions . . . . . . . . . . . . . . . . . . . . 282
12.6 Sequential quadratic programming . . . . . . . . . . . . . . . 282
12.7 Trust regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
13 EQUILIBIRUM PROBLEMS 289
13.1 Convex-type equilibrium problems . . . . . . . . . . . . . . . 289
13.2 Variational inequalities . . . . . . . . . . . . . . . . . . . . . . 289
13.3 Monotone Operators . . . . . . . . . . . . . . . . . . . . . . . 28913.4 Complementarity problem . . . . . . . . . . . . . . . . . . . . 289
13.5 Application in Mechanics . . . . . . . . . . . . . . . . . . . . . 28913.6 Pricing an American option . . . . . . . . . . . . . . . . . . . 289
13.7 Market Equilibrium: Walras . . . . . . . . . . . . . . . . . . . 289
13.8 Application to traffic, transportation . . . . . . . . . . . . . . 289
13.9 Non-cooperative games: Nash . . . . . . . . . . . . . . . . . . 289
13.10Energy? Communications pricing . . . . . . . . . . . . . . . . 289
14 NON-DIFFERENTIAL OPTIMIZATION 291
14.1 Bundle method . . . . . . . . . . . . . . . . . . . . . . . . . . 29114.2 Example of the bundle method . . . . . . . . . . . . . . . . . 291
14.3 Stochastic quasi-gradient method . . . . . . . . . . . . . . . . 291
14.4 Application: urn problem . . . . . . . . . . . . . . . . . . . . 291
14.5 Sampled gradient (Burke, Lewis & Overton) . . . . . . . . . . 29114.6 Eigenvalues calculations . . . . . . . . . . . . . . . . . . . . . 291
15 DYNAMIC PROBLEMS 293
15.1 Optimal control problems . . . . . . . . . . . . . . . . . . . . 293
15.2 Hamiltonian, Pontryagins . . . . . . . . . . . . . . . . . . . . 293
15.3 Polaks minmax approach . . . . . . . . . . . . . . . . . . . . 293
15.4 Multistage stochastic programs . . . . . . . . . . . . . . . . . 293
7/24/2019 An Optimization Primer
11/149
CONTENTS ix
15.5 Progressive hedging algorithm . . . . . . . . . . . . . . . . . . 29315.6 Water reservoirs management . . . . . . . . . . . . . . . . . . 29315.7 Linear-quadratic stochastic control . . . . . . . . . . . . . . . 29315.8 Pricing a contingent claim . . . . . . . . . . . . . . . . . . . . 293
16 TOPICS IN STOCHASTIC PROGRAMMING 29516.1 The distribution problem . . . . . . . . . . . . . . . . . . . . . 29516.2 Application to robotics . . . . . . . . . . . . . . . . . . . . . . 29516.3 Chance constraints . . . . . . . . . . . . . . . . . . . . . . . . 29516.4 Reliability of networks . . . . . . . . . . . . . . . . . . . . . . 29516.5 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 29516.6 Modeling decision making under uncertainty . . . . . . . . . . 295
A Notation and terminology 297
A.1 Existence: limits & minimizers . . . . . . . . . . . . . . . . . . 298
A.2 Function expansions . . . . . . . . . . . . . . . . . . . . . . . 299
7/24/2019 An Optimization Primer
12/149
x CONTENTS
7/24/2019 An Optimization Primer
13/149
Chapter 1
PRELUDE
How simple and clear this is thought Pierre. How could I not have knownthis before. War and Peace, Leonid Tolstoy.
Lets begin our journey in the classical landscape: All functions to beminimized are smooth and there are no side constraints! The mathematicalfoundations were laid down in the middle of the last millennium by two genialmathematical dabblers. As we shall see, the rules they formulated providethe guidelines for building an optimization theory, in the classical frameworkas well as in the non-classical framework thats going to be our main concernin this book. This chapter also covers the basic algorithmic procedures tofind, at least numerically, the minimizer(s) of smooth multivariate functions.Although our interest will be primarily with the minimization of functions,subject or not to constraints, defined on IRn, the last section shows thatthe rules to identify minimizers are also applicable to functions defined oninfinite-dimensional spaces, like a space of arcs connecting two points, etc.
1.1 Mathematical curtain rise
The modern theory of optimization starts in the middle of the 14th Centurywith Nicholas Oresme, (1323-1382), part-time mathematician and full-timeBishop of Lisieux (France). In his treatise [15] on , he remarks that neara minimum, the increment of a variable quantity becomes 0. A present day
1
7/24/2019 An Optimization Primer
14/149
2 CHAPTER 1. PRELUDE
version would read
Oresme Rule: x argmin f = df(x; w) = 0, wIR.
where argmin fis the set of minimizers off, i.e., the arguments that minimize
f, and
df(x; w) := lim0
f(x+ w) f(x)
,
is the derivative1 of the function f: IR IR at the point x in direction w,i.e., the limit of the incremental value off at x, in direction w .
x x
w
f
df(x; ).
Figure 1.1: Derivative function identifying incremental changes at x.
About three centuries later, Pierre de Fermat (1601-1665), another part-time mathematician and full time ??-lawyer at the Royal Court in Toulouse(France) while working on the long-standing tangent problem, observed thatfor x to be a minimizer of a function f, the tangent to the graph of the
functionfat the point x, f(x) must be parallel to the x-axis. In thenotation of Differential Calculus2, one would express this as
Fermat Rule: x argmin f = f(x) = 0,
where
f(x) := lim0
1
f(x+) f(x)1called the Gateaux derivative when its necessary to distinguish it from some alterna-
tive definitions of derivative.2whose development can be viewed as a continuation and a formalization of Fermats
work on the tangent problem
7/24/2019 An Optimization Primer
15/149
1.1. MATHEMATICAL CURTAIN RISE 3
f
x*
(x ,f(x ))* *
Figure 1.2: Horizontal tangent to the graph of a function at a minimum.
is the slope3 of the tangent at x. Implicit in the formulation of these optimal-ity criteria is the assumption: f is smooth, i.e., continuously differentiable;in those days, only smooth functions were considered to be of any interest.And for smooth functions, one has
w
IR: df(x; w) = f(x)w,
as is immediate from the definitions. Consequently, the Fermat rule can bederived from Oresmes rule and vice verse.
To extend Oresmes rule to functions defined on IRn, again assumingsmoothness, one has to consider possible moves, or variations, not just tothe right or the left, but in every possible direction. And the rule becomes:
x argmin f = df(x; w) = 0,wIRn,whereas Fermats rule now takes the form
x
argmin f =
f(x) = 0.
Indeed, the slope of the tangent to the graph of a smooth function f at apoint
x, f(x)
is given by the gradientoff at x:
f(x) =
x1f(x),
x2f(x), . . . ,
xnf(x)
with the partial derivativesdefined by
fori = 1, . . . , n ,
xif(x) := lim
0
f(x+ ei) f(x)
=df(x; ei),
3In Differential Calculus, one usually refers to f(x) as the derivative off at x but wewant to reserve this term for the more malleable function df(x, ).
7/24/2019 An Optimization Primer
16/149
4 CHAPTER 1. PRELUDE
where ei = (0, . . . , 1, . . . , 0) is the unitn-vector with a 1 in the ith position.In the 1-dimensional case, one hasf(x) = f(x). Because f is smooth, itfollows from the preceding definitions that
w
IRn : df(x; w) =
f(x), w
=
n
j=1f(x)
xjwj,
and so, also for smooth functions defined on IRn, one can derive Fermatsrule from Oresmes rule and vice verse. However, the fact that one can relyon either one of these rules to check for optimality turns out to be quiteefficient.
1.2 Curve fitting I
Given the values of an unknown function h : [0, 1] IR at a finite numberof distinct points, z1, z2, . . . , z L, one is interested in finding a polynomial
p : [0, 1]IR of degree n, i.e.,p(x) =anx
n + +a1x+a0,whose values at z1, . . . , z L are as close as possible of those ofh. There are anumber of ways to interpret as close as, but for now, let it have the meaningof least squares, i.e., the sum of the squares of the distances between h(zl)and p(zl) is to be minimized. With a = (an, . . . , a1, a0), the minimizationproblem is:
minaIRn+1L
l=1
n
j=0
aj zj
lh(zl)2.
With
Z=
zn1 . . . z 1 1zn2 . . . z 2 1. . . . . . . . . .znL . . . z L 1
andy=
h(z1), h(z2), . . . , h(zL)
,
theleast squares problemcan be written as:
minaIRn+1
Za y,Za y,
7/24/2019 An Optimization Primer
17/149
1.2. CURVE FITTING I 5
or still mina f(a) =|Za y|2, i.e., the least squares solution will minimizethe square of the norm of the error. Applying Fermats rule, we see that theminimizer(s)a must satisfy
f(a) = 2Z(Za
y) = 0,
or equivalently, the so-called normal equation,
ZZa =Zy.
If we assume, as might be expected, that n+ 1 L, and recalling thatthe points z1, . . . , z L are distinct, the columns of the matrix Zare linearlyindependent. Hence, ZZ is invertible and
a =
ZZ)1Zy.
This is the solution calculated by the Matlab-function polyfit. In Figure1.3, a 5th degree polynomial has been fitted to the given data points; theplot has been obtained by polyval, another Matlab-function.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.5
1
0.5
0
0.5
1
Figure 1.3: Fitting data with 5th degree polynomial
7/24/2019 An Optimization Primer
18/149
6 CHAPTER 1. PRELUDE
1.1 Exercise (polyfit and polyval functions). Let x = (0, 0.05, . . . , 0.95, 1)andy = (0.95, 0.23, 0.61, 0.49, 0.89, 0.76, 0.46, 0.02, 0.82, 0.44, 0.62, 0.79, 0.92,0.74, 0.18, 0.41, 0.94, 0.92, 0.41, 0.89, 0.06) use the functionspolyfitto obtaina polynomial fit of degreen= 1, 2, . . . , 11 andn= 21. For each n, calculatethe mean square error and plot you results so that you can visuallly see the
fit and the graph of the polynomial.
Guide. For given n, let p= polyfit(x,y,n) the coefficients of the polyno-mial of degreen. To graph the resulting polynomial and check the fit, use thecommand: plot(z, polyval(p, z),x,y, xm) withz= (0.2 : 0.001 : 1.2)4.
1.3 Steepest Descent and Newton methods
When we applied Fermats rule to the polynomial fit problem, the minimizer
could be found by solving a system of linear equations, and there are veryefficient procedures available to solve (nn)-linear systems. But, in gen-eral, the function to be minimized wont be quadratic, and consequently, theFermat rule may result in a system consisting ofnnonlinearequations. Forexample, when
f(x1, x2) = (x2 x21)2 + (1 x1)2,Fermats rule yields,
2x31 2x2x1+x1 = 1, x21+x2 = 0.
And that system is not any easier to solve than minimizing f. In fact, proce-dures to solve nonlinear equations and minimizing functions on IRn go handin hand5. In this section, and the next, we outline algorithmic proceduresto find a point x that satisfy Fermats rule. Such points will minimize thefunctionf, at least locally, when the function fis locally convex; convexityis dealt with in Chapter 3 and locally convexmeans that f is convex in theneighborhood ofx, say on a ball IB(x, ) with >0.
4In Matlab-Figure, use the Export functionality to obtain printable files, for example,EPS-files.
5Roughly speaking, a system ofn equations, linear or nonlinear, in n variables can bethought of as the gradient of some nonlinear function defined on IRn.
7/24/2019 An Optimization Primer
19/149
1.3. STEEPEST DESCENT AND NEWTON METHODS 7
1.2 Definition (local minimizers). Forf: IRn IR, x is a local minimizeriff(x)f(x)for allxIB(x, )for some > 0, i.e.,
xargminxIB(x,) f(x).
Aglobal minimizer is simply a minimizer off onIRn
.
f
argmin f
local minimizer
Figure 1.4: Local and global minimizers
The first step in the design of algorithmic procedures to minimize a func-tionf is to identify directions of descent.
1.3 Lemma (direction of descent). Let f:IRn IRbe smooth. Wheneverdf(x; d) 0 such that
(0, ) : f(x+ d)< f(x).
Becausef is smooth at x,
every dIRn such thatf(x), d 0, f(x+ d) f(x)< 0 for all (0, ). The assertioninvolving the gradient simply follows from df(x; d) =f(x), d, when f issmooth.
Steepest Descent Method.
Step 0. x0 IRn, := 0.
7/24/2019 An Optimization Primer
20/149
8 CHAPTER 1. PRELUDE
Step 1. Stop iff(x) = 0, otherwise,d :=f(x).Step 2. := argmin0
f(x +d) f(x) .
Step 3. x+1 :=x +d, + 1, go to Step 1.
1.4 Convergence (steepest descent). Suppose f : IRn IR is smooth.Then, the Steepest Descent algorithm stops after a finite number of steps at a
pointx wheref(x) = 0, or it generates a sequence of points{x, IN}that either diverges, i.e.,|x|, orf(x) = 0 for every cluster point x ofthis sequence.
Proof. The algorithm stops only, in Step 1, whenf(x) = 0, necessarilyafter a finite number of steps. Otherwise, excluding the case when the iteratesdiverge, the sequence{x, IN}will have at least one cluster point, say x.By restricting our attention to the convergent subsequence, we may as wellproceed as ifx
x, i.e., x is a limit point.
Iff(x)= 0, then d=f(x) is a direction of descent and, by Lemma1.3,f(x + d)< f(x) for all(0, ) for some> 0. Since fis smooth6,
f(x+1) f(x) =|d|2 +o(),where d =f(x), and x x implies
f(x+1) f(x)0, |d|2 |d|2 = 0.Hence, from the preceding identity, by letting , it follows that0.This means that eventually (0, ), and then, by definition of the stepsize in Step2, one must have
f(x + d) f(x)f(x + d) f(x) =|d|2 + o(), (0, ).
Letting and choosingclose enough to 0, one obtains 0 |d|2
7/24/2019 An Optimization Primer
21/149
1.3. STEEPEST DESCENT AND NEWTON METHODS 9
Thats an unrealistic premise. The best one can hope for is an approximatingminimizer. When implementing this algorithm, the step sizein Step 2 iscommonly calculated as follows: for parameters , (0, 1) selected at theoutset in Step 0,
A-Step 2. := maxk=0,1,...{k f(x +kd) f(x) |d|2k}.
One then refers to as the Armijo step size7; k is to the power k, so,
in particular, 0 = 1. The convergence proof can easily be adjusted to
23 01
0
2
|d |
f(x + d ) - f(x )
Figure 1.5: Calculating the Armijo step size
accommodate the Armijo step size selection.One may interpret the iterations of the Steepest Descent method as fol-
lows: At x, the function is approximated by
f(x) =f(x) +f(x), x x + 12 |x x|2,
where >0 is a parameter to be selected at some point. The next iteratex+1 is obtained by minimizing f, i.e.,
x+1 =x f(x).Iff turns out to be a good approximation off, at least locally, one shouldend up with a point x+1 near a local minimizer off. However, its obviousthat this can only be the case for a very limited class of nonlinear functionsf. A more trust-worthy local approximation off atx is provided by
f(x) =f(x) +f(x), x x + 12x x, 2f(x)(x x),7more appropriately the Armijo-Goldstein step-size
7/24/2019 An Optimization Primer
22/149
10 CHAPTER 1. PRELUDE
assuming thatf is twice differentiable with Hessianmatrix at x,
2f(x) =
2f
xixj(x)
n,ni,j=1
,
again with > 0, a parameter to be selected appropriately. Assumingfurthermore, that2f(x) is positive definite8 and thus also invertible, theminimum off is attained at
x+1 =x 2f(x)1f(x).
The suggested descent direction,
d =2f(x)1f(x),is known as the Newton direction. Its certainly a direction of descent since
f(x
), 2f(x)1f(x) > 0; the inverse of a positive definite matrixis also positive definite. This lead us to the following variant of the SteepestDescent method:
Newton Method (with line minimization).
Step 0. x0 IRn,:= 0.Step 1. Stop iff(x) = 0, otherwise,d :=2f(x)1f(x).Step 2. := argmin0
f(x +d) f(x) .
Step 3. x+1 :=x +d, + 1, go to Step 1.
The convergence proof for the Newton method is essentially the same asthat for the Steepest Descent method, the only difference is that a differentdescent direction is calculated in Step 1. And when implementing the Newtonmethod, one would again replace Step 2 by A-Step 2, i.e., determine bycalculating the Armijo step size.
What really makes the Newton method very attractive is that it comeswith particularly desirable localconvergence properties! The proof that fol-lows is for the classical version of Newtons method that doesnt include aline minimization step, i.e., with = 1.
8A matrixC is positive definiteifx,Cx> 0 for all x = 0. When the Hessian at x, ofa twice continuously differentiable function f: IRn IR is positive definite, f is strictlyconvex on a neighborhood of x, cf. 3.15. We kick off our study of convexity in Chapter 3.
7/24/2019 An Optimization Primer
23/149
1.3. STEEPEST DESCENT AND NEWTON METHODS 11
f(x) =
Steepest descentNewton direction
Figure 1.6: Comparison of Newton and steepest descent directions
1.5 Convergence (Newton Method: local convergence). Let f : IRn IR be twice continuously differentiable and x 2f(x) locally Lipschitzcontinuous9, i.e., given anyx,
>0, 0, such that2f(x) 2f(x) |x x|, x, x IB(x, ).
Letxbe a local minimizer that satisfies the following second order sufficiencycondition: there are constants0 < lu 0 such that if the Newtons method is started at a pointx0 IB(x,), the iterates{x}INwill converge quadratically tox, i.e.,
0lim |x+1
x
||x x|2 0 and
7/24/2019 An Optimization Primer
24/149
12 CHAPTER 1. PRELUDE
IB(x, ) and with z =x x,
2f(x)(x+1 x) =f(x) + f(x) =z, 10
2f(x tz) dt,
as follows from the Mean Value Theorem; recall f(x) = 0. Because x+tz
IB(x, ) for all t [0, 1], adding2f(x)z to both sides of the precedingidentity, yields
(x+1 x) =2f(x)1 z, 10
2f(x tz) dt, .Since|x+1 x| |x x| + |x+1 x| and| g(t) dt| |g(t)| dt,
|x+1 x| |z|
2f(x)
1
10
2f(x) 2f(x+tz) dt
l
|z|2 = |z
|l
|z|
Thus, if (/l)|z|
7/24/2019 An Optimization Primer
25/149
1.3. STEEPEST DESCENT AND NEWTON METHODS 13
Guide. The Rosenbrock function has boomerang shaped level sets. Theminimum occurs at (1, 1). Starting at x0 = (1.2, 1), the Steepest Descentmethod may require as many as a 1000 steps to reach the neighborhood ofthe solution. As, can be observed, the method of Steepest Descent can notonly be quite inefficient but, due to numerical round-off, it might even get
stuck at a non-optimal point.
The Newton method, like the steepest descent method, can be viewed asa procedure to find a solution to the system ofnequations:
f
x1(x) = 0,
f
x2(x) = 0, . . . ,
f
xn(x) = 0.
Because, these functions x fxj (x) have, in principle, no preassigned prop-erties, one could have described Newtons method as one of solving a systemofn non-linear equations in nunknowns, say
G(x) =
G1(x)...Gn(x)
=0...
0
with
G(x) =Gi
xj(x)n
i,j=1,
theJacobianofG at x. Generally, the algorithmic procedure, then known astheNewton-Raphson method, doesnt involve a line minimization step but aline-search could be included to avoid being lead in unfavorable directions.
Assuming that the Jacobian is invertible, a generic version of the methodinvolves the following steps:
Newton-Raphson Method.
Step 0. x0 IRn, := 0.Step 1. Stop ifG(x) = 0, otherwise, d :=G(x)1G(x).Step 2. x+1 :=x +d, + 1, go to Step 1.After making the appropriate change of notation, one can follow step by
step the proof of 1.5 to obtain a (local) quadratic convergence rate. Figure1.7 illustrates the steps of the Newton-Raphson procedure when G : IR
IR.
7/24/2019 An Optimization Primer
26/149
14 CHAPTER 1. PRELUDE
1xx32x 4x 5x
Figure 1.7: Root finding via Newton-Raphson in dimension 1
Newton-Raphson for systems of linear/non-linear equations and Newtonsmethod for unconstrained optimization problems illustrate vividly the impor-tant connections between these two classes of problems, both in the designof solutions procedures as well as in the analysis of their intrinsic properties.
1.4 The Quasi-Newton methods
The numerical implication of quadratic convergence is that once a smallenough neighborhood of a (local) minimizer is reached, each iteration willdouble the number of correct digits in the approximating solutions x. Ofcourse, this doesnt take into consideration round-off errors! But, nonethe-
less, any method with theses characteristics has to be treasured and deservesto be plagiarized. Unfortunately, the conditions under which one can applyNewtons method are quite restrictive:
-2f(x) might not be invertible. This can be repaired to some extentby choosing appropriately a direction dthat satisfies2f(x)d=f(x).
- The approximationf forfatx might be poor, or more generally, onlyvalid in a very limited region. Again, this can be repaired to some extent byrestricting the step size to a trust region.
- The function f is not twice differentiable, or its Hessian is difficult tocompute. This cant be repaired in the framework of the Newton method.It requires a different approach.
7/24/2019 An Optimization Primer
27/149
1.4. THE QUASI-NEWTON METHODS 15
Quasi-Newton Method10.
Step 0. x0 IRn, := 0, pick B0 (=I, for example).Step 1. Stop iff(x) = 0,
otherwise, choose d such thatB d =f(x).Step 2. := argmin0 f(x +d) f(x) .Step 3. x+1 :=x + d
, calculateB +1, + 1, go to Step 1.
Clearly B plays here the role of the Hessian2f(x) in the Newtonmethod, i.e., it tries to capture the adjustment that needs to be made in thesteepest descent on the basis of the local curvature of the function f. Theactual behavior of the method is determined by the choice one makes of theupdating mechanism for B. To guarantee, at least, local convergence, oneimposes the following condition:
The Quasi-Newton Condition. The curvature along the descent direc-
tiond (fromx tox+1) should be approximated by
B+1(x+1 x) := (B +U)(x+1 x) =f(x+1) f(x),
or equivalently, with
s =x+1 x, c =f(x+1) f(x) : U s =c Bs.
The updating matrix U = B+1 B must be chosen so that it satisfiesthe preceding identity. This can always be achieved by means of a matrix ofrank 1 of the type U =u
v, where
u v=
u1v1 u1v2 . . . u1vnu2v1 u2v2 . . . u2vn. . . . . . . . . . . . . . . . . .unv1 unv2 . . . unvn
is the outer productof the vectors uandv. We must have
[ u v ]s =v, s u= c Bs.10Methods of this type are also known are variable metric methods. One can think of
the descent direction generated in Step 1 as the steepest descent but with respect to adifferent metric on IRn than the usual Euclidean metric.
7/24/2019 An Optimization Primer
28/149
16 CHAPTER 1. PRELUDE
This means thatuneeds to be a multiple of (cBs). Assumingc =Bs,otherwiseB itself satisfies the Quasi-Newton condition,
v such thatv, s = 0,one would set
u= 1v, s(c Bs) and U = 1v, s [ (c
Bs) v ].
In the preceding expression all quantities are fixed except forv and the onlyrestriction is that v shouldnt be orthogonal to s. One can choose v soas to restrict B to a class of matrices that have some desirable properties.For example, one may wish to have the matrices B symmetric; the Hessian2f(x) is symmetric when it is defined. Choosing
v =c Bs yields B+1 =B + 1v, s[ v v ].
which is symmetric ifB
is symmetric.One particularly useful property of these updates, is that their inverses
can be computed recursively. Indeed, ifB is invertible andv, B1u =1,then
B+ [ u v ]1 =B1 11 + v, B1uB
1 [u v ] B1.To see this, simply observe that
B+ [ u v ]B1 11 + v, B1uB[u v ] B
= I ,
follows from
v, B1u [ u v ] B1 = [ u v ] B1 [ u v ] B1 = [ u v ]B1u, vB1.Updating schemes based directly on (B + [ uv ]) are not numerically
stable, by which one means that small errors in carrying our the numericaloperations might result in significant errors in the descent method. Twopopular, numerically reliable, updating schemes are
BFGS update 11:
B+1 =B + 1
c, s
c c 1Bs, s Bs Bs.11BFGS = Broyden-Fletcher-Goldfarb-Shano who, independently, proposed this formula
for the update.
7/24/2019 An Optimization Primer
29/149
1.5. INTEGRAL FUNCTIONALS 17
DFP update 12: with D paying the role of (B)1:
D+1 =D + 1
c, s
s s 1c, Dc Dc Dc.The Matlab-function fminunc (with option.LargeScale setoff) im-plements the BFGS Quasi-Newton method. Again, the Rosenbrock function
could be used as test function and the results compared to those obtained inExercise 1.6.
1.5 Integral functionals
Lets go one step further in our analysis of classical optimality conditionsand consider integral functionals, as they arise in theCalculus of Variations13,
f(x) = 10
L(t, x(t), x(t)) dt.
With, IR, the simplest problem of the Calculus of Variations is:
min
f(x) xX, x(0) =, x(1) =,
where Xfcns([0, 1], IR) consists of all functions with some specified prop-erties, for example X= ac-fcns([0.1], IR), the space of absolutely continuousreal-valued functions defined on [0, 1]. For simplicity sake, let assume thatX=C1([0, 1]; IR) is the space of real-valued, continuously differentiable func-
tions defined on [0, 1]. One has,
Oresme rule: x argmin f = df(x; w) = 0,wW X,
where the setW ofadmissible variationsis such that
xX, wW = IR : (x+ w)(0) =, (x+ w)(1) =,
that is,
W =
wX
w(0) =w(1) = 0
.
12DFP = Davidon-Fletcher-Powell who proposed this updating scheme.13Its customary to denote by x, the derivative ofx, with respect to time parameter t
7/24/2019 An Optimization Primer
30/149
18 CHAPTER 1. PRELUDE
It isnt straightforward to write down Fermats rule. There isnt a ready madecalculus to find the gradient of functions defined on an infinite dimensionallinear space. In the late 17th Century (?), the path going from Oresmesto Fermats rule for this problem was pioneered by a trio of mathematicalsuperstars: the Bernoulli brothers, Johan and Jacob, and Isaac Newton.
Lets sketch out their approach when
L : [0, 1] IR IR IR
is a really nice function. Here, this means that the partial derivatives
Lx(t,x,v) =L
x(t,x,v), Lx(t,x,v) =
L
v(t,x,v)
have the continuity and differentiability properties required to validate theoperations carried out below. Then,
df(x; w) = lim0
1 10
L(t, (x+ w)(t), (x+w)(t)) L(t, x(t), x(t)) dt
=
10
lim0
1
L(t, (x+ w)(t), (x+w)(t)) L(t, x(t), x(t)) dt=
10
Lx
t, x(t), x(t)
w(t) +Lx
t, x(t), x(t)
w(t)
dt
For a given functionxXand forwW, integration by parts yields
10
w(t)Lxt, x(t), x(t) dt= 0 10w(t) t
0Lx, x(), x() d dt,
and thus,
df(x; w) =
10
w(t)
Lx
t, x(t), x(t) t
0
Lx
, x(), x()
d
dt.
Becausew W implies 10
w(t) dt = 0 and df(x; w) = 0 must hold for allsuch functions w : on [0, 1],
t Lxt, x(t), x(t) t0
Lx, x(), x(t) d must be constant.
7/24/2019 An Optimization Primer
31/149
1.5. INTEGRAL FUNCTIONALS 19
In other words, x must satisfy the ordinary differential equation
Lx
t, x(t), x(t)
= d
dtLx
t, x(t),x(t)
fort[0, 1],
known as the Euler equation. In addition, x must satisfy the boundary
conditions att = 0, 1. Fermats rule then reads,
x argmin f =
x(0) =, x(0) =,
x satisfies the Euler equation.
1.7 Example (the brachistochrone problem). The problem14 is to find thepath along which a particle will fall in the shortest time from A to B.
Detail. Lets pass a vertical plane through the points A and B with they-axis drawn vertically downward, A located at the origin (0, 0) and say, B= (1, 2). So, we are to find a path y: [0, 1] [0, ) with y(0) = 0 andy(1) = 2 that will minimize the elapsed time; the force acting on the particleis gravity. From Newtons Law, one derives the following expression for the
2
1
B
Ax
g
Figure 1.8: Shortest time path for falling particle
the function to be minimized: 10
1 +y(x)2
y(x) dx.
14originally formulated by Galileo, the name is derived from the Greek, brachistos forshortest and chronos for time
7/24/2019 An Optimization Primer
32/149
20 CHAPTER 1. PRELUDE
Rather than the Euler equation itself, lets rely on a variant, namely,
d
dx
L yLy
= Lx;
carrying out the differentiation with respect to x, the preceding identity
yields Lx+yLy+yLy yLy y ddx Ly =Lx, or still y (Ly ddx Ly) = 0.Since in our problem, Lx = 0, this variant of the Euler equation implies thatx(L yLy)(x) should be constant. Hence the optimal path must satisfythe following boundary value problem:
2= y(x)(1 +y(x)2), y(0) = 0, y(1) = 2.
where is a constant to be chosen so as to satisfy the boundary conditions.The path described by the solution in the (x, y)-plane can be parametrizedwith respect to time for t[0, 2], and one can verify that the cycloids,
x(t) =(t sin t), y(t) = (1 cos t),
satisfy the preceding differential equation; somewhat informally
y =dy
dx=
dy
dt
dt
dx=
y
x= (1 cos t)1 sin t.
At t = 0, x(0) = y(0) = 0, so there remains only to choose so that thepath passes through B at time t = t, our shortest time. For B at (1, 2),that turns out to be = 2.4056 and the corresponding value for t is 1.4014;these values were obtained with he help of the Matlab root-finder fzero.
1.8 Exercise Show that the straight line yields the shortest distance be-tween two points, saya = (0.0) andb = (1, 2).
Guide. The length of a differentiable arc y: [0, 1]IR is 10
1 +y(x)2 dx,
as follows from the theorem of Pythagoras and the definition ofy . Set upand solve the Euler equation with the boundary conditions y(0) = 0 and
y(1) = 2.
7/24/2019 An Optimization Primer
33/149
1.6. IN CONCLUSION. . . 21
1.6 In conclusion . . .
We have seen that the rules of Oresme and Fermat, with the help of Dif-ferential Calculus, can be used effectively to identify potential minimizersof a smooth function in a variety of situations. But, many interesting opti-
mization problems involve non-differentiable functions and minimizers havea predilection for being located at the cusps and kinks of such functions!Moreover, the presence of constraints in a minimization problem come withan intrinsic lack of smoothness. There is a sharp discontinuity between pointsthat are admissible and those that are not.
To deal with this more inclusive class of functions, we need to enrich ourcalculus. Our task, on the mathematical side, will thus be to set up a Sub-differential15 Calculus with rules that mirror those of Differential Calculus,and that culminates in versions of Oresmes and Fermats rules to ferret outthe minimizers of non-smooth, and even discontinuous, functions16.
15the prefix sub has the meaning: requiring less than differentiability.16For more comprehensive expositions of Subdifferential Calculus, one should consult
[4, 14, 1, 3, 18], our notation and terminology will be consistent with that of [18].
7/24/2019 An Optimization Primer
34/149
22 CHAPTER 1. PRELUDE
7/24/2019 An Optimization Primer
35/149
Chapter 2
FORMULATION
Lets begin with a few typical (constrained) optimization problems that fitunder the mathematical programming umbrella. In almost all of these ex-amples, we start with a deterministic version and then switch to a morerealistic model that makes a place for the uncertainty about some of theparameters.
When we allow for data uncertainty, not only do we gain credibility forthe modeling process but we are also lead to consider a number of issuesthat are at the core of optimization theory and practice, namely, how todeal with non-linearities, with lack of smoothness, and how to design solu-tion procedures for large scale problems. In addition, due to the additionof randomness (uncertainty) its also necessary to clarify a number of thebasic modeling issues, in particular, how stochastic programs differ from thesimpler, but less realistic, deterministic formulations.
For all these reasons, we are going to rely rather extensively, but by nomeans exclusively, on stochastic programming examples to motivate both thetheoretical development and the design of algorithmic procedures.
2.1 A product mix problem
A furniture maker can manufacture and sell four different dressers. Eachdresser requires a certain numbertcj of man-hours for carpentry, and a certainnumbertf j of man-hours for finishing, j = 1, . . . , 4. In each period, there aredc man-hours available for carpentry, and dfavailable for finishing. There isa (unit) profit cj per dresser of typej thats manufactured. The owners goal
23
7/24/2019 An Optimization Primer
36/149
24 CHAPTER 2. FORMULATION
is to maximize total profit, or equivalently, to minimize cost1 (= negativeprofit). Let these cost coefficients be
c=(c1, c2, c3, c4) = (12, 25, 21, 40)
andT =
tc1 tc2 tc3 tc4tf1 tf2 tf3 tf4
=
4 9 7 101 1 3 40
d= (dc, df) = (6000, 4000).
The furniture maker must choose (xj0, j = 1, . . . , 4) to minimize4
j=1
cj xj =12x1 25x2 21x3 40x4,
subject to the constraints
4x1+ 9x2+ 7x3+ 10x46000,x1+ x2+ 3x3+ 40x44000.
This is alinear program, i.e., an optimization problem in finitely many (real-valued) variables in which a linear function is to be minimized (or maxi-mized) subject to a system of finitely many linear constraints: equations andinequalities. A general formulation of a linear program could be
minn
j=1cj xj over all xXIRn
so thatn
j=1
ai,j xj bi for i= 1, . . . , m ,
where stands for either, = or and the (internal) constraints x Xconsist of some simple linear inequalities on the the variables xj such as:
- Xis a box, i.e., X=
x ljxjuj, j = 1, . . . , n,
- X=IRn+
=
x xj0, j= 1, . . . , n, the non-negative orthant, etc.
1This conversion to minimization is made in order to have a canonical formulationof optimization problems. Generally, engineers and mathematicians prefer the minimiza-tion framework, whereas social scientists and business majors have a preference for themaximization framework.
7/24/2019 An Optimization Primer
37/149
2.1. A PRODUCT MIX PROBLEM 25
The objective and the constraints of our product mix problem are linearand it may be written compactly as:
minc, x so that T xd, x0.As part of the ensuing development, many of the properties of linear pro-
grams will be brought to the fore including optimality conditions, solutionprocedures and the associated geometry. For now, lets simply posit thatsuch problems can be solved efficiently when they are not too large. The(optimal) solution of our product mix problem is:
xd = (4000/3, 0, 0, 200/3) with optimal value: $ -18,667.
Here is the Matlab-file used to calculate the solution; linprog is a functionin the Matlab Optimization Toolbox.
function [xopt,ObjVal] = prodmix%% data and solution of the product mix example%c = -[12 25 21 40]; d = [6000 4000];T = [ 4 9 7 1 0 ; 1 1 3 4 0 ] ;xlb = zeros(4,1); xub = ones(4,1)*10^9;[xopt,ObjVal] = linprog(c,T,d,[],[],xlb,xub);
Now, lets get a bit more realistic and account for the fact that the numberof hours needed to produce each dresser type cant be known with certainty.Then, each entry in T becomes a random variable2. For simplicitys sake,assume that each entry ofT takes on four possible man-hour values withequal probability (1/4) and that these entries areindependentof one another.
entry possible valuestc1: 3.60 3.90 4.10 4.40tc2: 8.25 8.75 9.25 9.75tc3: 6.85 6.95 7.05 7.15tc4: 9.25 9.75 10.25 10.75tf1: 0.85 0.95 1.05 1.15tf2: 0.85 0.95 1.05 1.15tf3: 2.60 2.90 3.10 3.40tf4: 37.0 39.0 41.0 43.0
2Bold face will be used for random variables with normal print for their possible values.
7/24/2019 An Optimization Primer
38/149
26 CHAPTER 2. FORMULATION
We have 8 random variables, each taking four possible values, that yields atotal of 48 = 65,536 possible T matrices (outcomes) and each one of thesehas equal probability of occurring! In practice, this could be adiscretizationthat approximates a continuous distribution (e.g. a uniform distribution).Lets denote the probability of a particular outcome T l bypl = (0.25)
8, for
l= 1, . . . , 65, 536.Because the manufacturer must decide on the production plan before the
number of hours required for carpentry or finishing are known with certainty,there is the possibility that they actually exceed the number of hours avail-able. Therefore, the possibility of having to pay for overtime must be factoredin. The recoursecosts are determined by: qc per extra carpentry hour andqfper extra finishing hour, say q= (qc, qf) = (5, 10).
This recourse decision will only enter into play after the production plan xhas been selected and the time required, T l, for each task, has been observed.Our manufacturer will, at least potentially, make a different decision about
overtime when confronted with each one of these 65,536 possible differentoutcomes forT. Lety lc andy
lfdenote the number of hours of overtime hours
hired for carpentry and finishing when the matrixTturns out to beT l. Theproblem is then to choose (xj0, j= 1, . . . , 4) that minimizes
4j=1
cj xj+
65,536l=1
pl
qcyl
c + qfyl
f
so that
4j=1
t lcjxj y lcdc, l= 1, . . . , 65, 536,4
j=1
t lf j xj y lfdf, l= 1, . . . , 65, 536,
y lc0, y lf0, l= 1, . . . , 65, 536.Notice that the objective now being minimized is the sum of the immedi-ate costs (actually, the negative profits) and the expected futures costssinceone must consider 65,536 possible outcomes; the constraints involving ran-dom quantities are written out explicitly for all 65,536 possible outcomes.In addition to non-negativity for the decision variables xj and the recourse
7/24/2019 An Optimization Primer
39/149
2.1. A PRODUCT MIX PROBLEM 27
variables y lc , yl
f, the constraints say that the number of man-hours it takes
for the carpentry of all dressers (4
j=1 tlcj xj) must not exceed the total num-
ber of hours made available for carpentry (dc+yl
c ), i.e., regular hours plusovertime, and the same must hold for finishing.
Because there is the possibility of making a recoursedecisiony l = (y lc , yl
f)that will depend on the outcomes of the random elements, this type of prob-lem is called a stochastic program with recourse. This class of problems willbe studied in more depth later. For now, it suffices to understand how thedecision/information process is evolving:
decision: x observation: T l recourse: y l
In summary, the manufacturer makes today, a decision x of how muchof each dresser type to produce based on the knowledge that he will beable tomorrow, to observe how many man-hours T l it actually took to
manufacture the dressers, as well as to decide how much overtime labor y lto hire based on this observation.
The problem is still a linear program, but of much larger size! Notice theblock-angular structure of the problem when written in the following way:
min c, x +p1q, y1 +p2q, y2 + + p65536q, y65536so that T1x y1 d
T2x y2 d...
. . . ...
T65536x y65536 dx0, y
1
0, y2
0, y65536
0.Later, we shall see how to solve these large scale linear programs by exploitingtheir structure. For now, it is enough to observe that these are indeed, largescale problems.
Oftentimes, there is more than one source of uncertainty in a problem. Forexample, due to employee absences, the available man-hours for carpentryand finishing may also have to be modeled as random variables, say
entry possible values dc: 5,873 5,967 6,033 6,127 each with probability 1/4df: 3,936 3,984 4,016 4,064 each with probability 1/4
7/24/2019 An Optimization Primer
40/149
28 CHAPTER 2. FORMULATION
We now need to replace d by d l = (d lc, dlf) and we must take into account
the 42 = 16 possible d l vectors, that gives a total ofL = 410 = 1, 048, 576possible (T, d) realizations. With pl = 1/L, the problem reads:
min
c, x
+p1
q, y1
+p2
q, y2
+
+ pL
q, yL
so that T1x y1 d1T2x y2 d2
... . . .
...TLx yL dLx0, y1 0, y2 0, yL 0.
The relatively small linear program we started out with, in the deterministicsetting, has now become almost enormous! Lets refer to this problem as the(equivalent)extensive versionof the (given) stochastic program.
The optimal solution is
x = (257, 0, 665.2, 33.8) with total expected cost: $ -18,051.
Because of its large size, this problem is more difficult to solve than itsdeterministic counterpart, and any efficient solution procedure must exploitthe problems special structure. But the solution x is robust, meaning thatit has examined all one million plus possibilities, and has taken into accountthe resulting recourse costs for overtime and the associated probabilities ofhaving to pay these costs.
With xd = (4000/3, 0, 0, 200/3), the solution of the deterministic version,
the expected cost would have been $ -16,942; the expected overtime costs are$ 1,725. Of course, xd is not an optimal solution of the stochastic program,but more significantly,xd isnt getting us on the right track! The solutionx
suggests that a large number of dressers of type 3 should be manufactured,while the production plan suggested byxd doesnt even include any dresser oftype 3. This is exactly the information a decision maker would want to have,viz, what are the activities that should be included in a (robust) optimalsolution.
2.1 Exercise (stochastic resources). Consider the product mix problemwhen the only uncertainty is about the number of hours that will be availablefor carpentry and finishing. Overtime will still be paid at the rates of $ 5 an
7/24/2019 An Optimization Primer
41/149
2.2. CURVE FITTING II 29
hour for carpentry and $ 10 and hour for finishing. Let
c= (12, 25, 21, 20), T =
4 9 7 101 1 3 40
.
and the random variablesdc, df(independently) take on the values
entry possible values dc: 4,800 5,500 6,050 6,150 each with probability 1/4df: 3,936 3,984 4,016 4,064 each with probability 1/4
Solve also the deterministic problem: minc, x so that T x d, x 0withdc= 5, 625anddf= 4, 000, the expected values ofdcanddf. Comparethe solution with that of the stochastic program.
Guide. Here, L=16, is the number of possible outcomes, so in addition to thenon-negativity restrictions, the stochastic program will have 32 constraints.
One solution is x = (1, 072.6, 0, 251.4, 0) with optimal value $ -15,900. Thesolution of the deterministic problem suggests manufacturing the same typeof dressers but in significantly different quantities. To compare the solutions,for the deterministic solution one needs to evaluate not just its cost (= -profit) but one must also calculate the recourse costs that might result whenone actually would implement the deterministic solution.
2.2 Curve fitting II
As in1.2, we know the values of an unknown function h : [0, 1]IR at afinite number of (distinct) points. Its also known that h is quite smoothwhich in this context, we are going to interpret as meaning: h is twice dif-ferentiable with h bounded, i.e., for all t [0, 1],|h(t)| for a given > 0. Unless is unusually small, our estimate for h is allowed to comefrom a much richer class of functions than just polynomials of degree n, asin1.2. Its thus reasonable to expect that one should be able to come upwith a better fit.
Every twice differentiable z: [0, 1]IRcan be written as
z(t) =z0+v0t+ t0
d 0
ds a(s),
7/24/2019 An Optimization Primer
42/149
30 CHAPTER 2. FORMULATION
where z0 and v0 play the role of integration constants and a: [0, 1] IR,the second derivative, is some function, not necessarily continuous. In fact,to render the problem computationally manageable, lets restrict a to bepiecewise constant. Thats not as serious a limitation as might appear at firstglance since any piecewise continuous function on [0, 1] can be approximated
arbitrarily closely by such a piecewise constant function.Getting down to specifics: Partition (0, 1] inNsub-intervals (tk1, tk], of
length= 1/Nso that the points at which the function h is known are someof the end points of these intervals, say
tl, lL
. Fork = 1, . . . , N , set
a(t) =xk, (a constant) , t(tk1, tk];fixingx1, . . . , xN,v0 andz0 completely determines the function z. Moreover,by introducing bounds on the choice of the xk, one can control the rate ofchange in bothz and, in the function zitself.
Fort
(tk1, tk], one has
z(t) =v0+
t0
a(s) ds= v0+
k1j=1
xj+ (t tk1)xk,
and
z(t) =z0+
t0
z(s) ds= z0+k1j=1
tjtj1
z(s) ds +
ttk1
z(s) ds
=z0+v0t+
k1
j=1(t tj+ /2)xj+ 12(t tk1)2xk.
In particular, when t = tk,
z(tk) =z0+kv0+2
kj=1
(k j+ 12)xj .
The curve fitting problem comes down to finding v0,z0and fork = 1, . . . , N ,xk[, ] so thatzis as close as possible to h in terms of a given criterion.For example, one may be interested in minimizing the (square of the)2-normof the error,
lL
|z(tl) h(tl)|2
,
7/24/2019 An Optimization Primer
43/149
2.2. CURVE FITTING II 31
i.e., least squares minimization. With z=
zl =z(tl), lL
and h=
hl =h(tl), lL
, one ends up with the following formulation,
min |z h|2 =
lL|zl hl|2
so thatzl = z0+lv0+2
lk=1
(l k+ 12)xk, lL,
xk, k= 1, . . . , N .
Thats aquadratic program: the constraints are linear and the function to beminimized is quadratic. One can write the equality constraints as
z= Ax where x= (z0, v0, x1, . . . , xN);
the entries ofA are the detached coefficients of these equations. Since,
|z h|2 =|Ax h|2 =Ax h,Ax h,
the quadratic program can also be expressed as
minx, AAx 2Ah, x so that lbxub,
wherelb and ubare, respectively, lower and upper bounds on the x-variables;for z0 and v0 these bounds could be. Because the matrixAA is posi-tive semi-definite3, it turns out that our quadratic program is a convexopti-mization problem, a property thats difficult to overvalue in an optimization
context, cf. 3.16 & Chapter 4.To illustrate this approach, let consider again the same data as that used
in1.2. We rely on the Matlab-function quadprog to solve the quadraticprogram. Figure 2.1 displays the resulting curve when N = 400 (and isrelatively large). The resulting curve is traced in Figure 2.1, and as expected,the fitting is significantly better than what resulted from a best polynomialfit; compare Figures 1.3 and 2.1.
function z = lsCurveFit(xr,N,x,h,kappa)% xr, N: range [0, xr]; partition in N subintervals
3A matrix C is positive semi-definite if x,Cx 0 for all x. When C is positivesemi-definite, the quadratic form x x,Cx is convex, see Example 3.16.
7/24/2019 An Optimization Primer
44/149
32 CHAPTER 2. FORMULATION
% (x,h): data points% kappa: -lower and upper bound on 2nd derivativesmsh = xr/N; [m, m0] = size(x(:)); xidx = round(N*x);N 1 = N + 1 ; N 2 = N + 2 ;mx = 2+max(abs(h)); ub = [kappa*ones(1,N);100;mx]; lb = -ub;% generating the coefficients of matrix Afor i = 1:m
for j = 1:xidx(i)A(i,j) = (xidx(i)-j+0.5)*msh^2;
end %forA(i,N1) = xidx(i)*msh; A(i,N2) = 1;
end %forxx = quadprog(A*A,-A*h(:),[],[],[],[],-ub,ub);% z-curve calculationfor l = 1:N
zd = 0;for k = 1:l
zd = zd + (l-k+0.5)*xx(k);
end %forz(l) = xx(N2) + xx(N1)*l*msh + zd*msh^2;
end %for
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.5
1
0.5
0
0.5
1
1.5
Figure 2.1: Least squares fit of a smooth curve
Instead of minimizing the 2-norm of the error (= least squares), onecould, for instance, choose to minimize the 1-norm (= sum of the errors).The function to be minimized is then
lL |zl hl|. Since
|zl hl|= max {zl hl, hl zl},
7/24/2019 An Optimization Primer
45/149
2.2. CURVE FITTING II 33
one can find the minimum of
lL |zl hl| by minimizinglL
l with lzl hl, lhl zl for lL.
With the 1-norm criterion, the curve fitting problem takes the form,
min
lLl
so that l zl hl, lL,l zl+hl, lL,zl = z0+lv0+
2l
k=1(l k+ 12)xk, lL,
xk, k= 1, . . . , N .This is a linear program, the constraints are linear and the function to beminimized is also linear.
2.2 Example (yield curve tracing). The spot rate of a Treasury Note thatmatures intmonths always includes a risk premium as well as a forecast com-
ponent that represent the markets perception of future interest rates. Suchspot rates are quoted for Treasury Notes with specific maturities,t = 3, 6, . . . .To evaluate financial instruments that generate the cash flow (coupons, final
payments) at intermediate dates, one needs to have access to to ayield curvethat supplies the spot rate for every possible date.
0 50 100 150 200 250 3009
10
11
12
13
14
15
Figure 2.2: Yield curve for Treasury Notes July 1982
Detail. Lets work with the (historical) rates quoted in July 1982:
7/24/2019 An Optimization Primer
46/149
34 CHAPTER 2. FORMULATION
term 3 6 12 24 36 60 84 120 240 360rt 11.9 12.8 13.2 13.8 14.0 14.1 14.1 13.9 13.8 13.6
To trace the yield curve, the simplistic approach is to rely on linear inter-polation. However, thats not really satisfactory. Financial markets make
continuous adjustments to the changing environment and this suggests thatthe yield curve has to be quite smooth. Certainly, there shouldnt be anabrupt change in the slope of the yield curve, and a fortiori, this shouldntoccur at t = 3, 6, . . . . So, lets fit a smooth curve to the data. Because thespot rates are nonnegative4, one can express the yield curve as s(t) =ez(t)
in which case we need to search for a smooth z-curve that will fit the pairs{(3, ln r3), (6, ln r6), . . . }. The following Matlab-file generates the coef-ficients of the linear program and then relies on linprog to calculate thesolution. Figure 2.2 graphs the (historical) yield curve calculated by our pro-gram.
function spots = YieldCurve(N,x,r,kappa)% N: # of months, range [0, N]; % (x,r): data points% kappa: -lower and upper bound on 2nd derivative[m, m0] = size(x(:)); N1 = N + 1; N2 = N + 2;ub = [kappa*ones(1,N);0;0;10*ones(1,m)];lb = [-ub(1:N);-1;-3.25;zeros(1,m)];% generating the coefficients of linear programfor i = 1:m
i2 = 2*i; i1= i2-1;b(i2) = log(r(i)); b(i1) = -b(i2);A(i1,:) = zeros(1,N2+m);
for j = 1:x(i)A(i1,j) = (x(i)-j+0.5);
end %forA(i1,N1) = x(i); A(i1,N2) = 1; A(i1,N2+i) = -1;A(i2,:) = -A(i1,:); A(i2,N2+i) = -1;
end %forc = [ zeros(1,N2) ones(1,m)];xx = linprog(c,A,b,[],[],lb,ub);% yield curve calculationfor l = 1:N
zd = 0;
4and its expedient to have an expression for the spot rates that makes calculatingforward rates and discount factors particularly easy
7/24/2019 An Optimization Primer
47/149
2.3. A NETWORK CAPACITY EXPANSION PROBLEM 35
for k = 1:lzd = zd + (l-k+0.5)*xx(k);
end %forz(l) = xx(N2) + xx(N1)*l + zd;
end %forspots = exp(-z);
2.3 A network capacity expansion problem
Lets consider a power transmission network, Figure 2.3, with eibe the exter-nal flow at node i, i.e., the difference between demand and supply at node i.The internal flowyj on arc j is limited by its capacityj of the transmissionline. Total supply exceeds total demand but the capacity of the transmissionlines needs to be expanded fromj toj+ xj , withj an upper bound on xj,
in order to render the problem feasible5. The total cost of such an expansionisn
j=1 j (xj ).
e
e 2
1 < jj|y | _
Figure 2.3: Power transmission network
5In the 2001 California energy crisis, some of the blackouts were blamed on the lack ofcapacity of the transmission lines between South and North California.
7/24/2019 An Optimization Primer
48/149
36 CHAPTER 2. FORMULATION
The deterministic version of this capacity expansion problem would be:
minn
j=1
j(xj)
so that 0
xj
j , j= 1, . . . , n ,
|yj| j+xj, j= 1, . . . , n ,i
yji
yjei, i= 1, . . . , m;
i yj stands for the (internal) flow into iwhereas
i yj is the flow fromi to the other nodes. Since the constraint|yj| j +xj can be split inthe two linear constraints yj j +xj and yj j xj , this is again alinear programming problem if the cost functionsj are linear. Usually, thefunctionsj are nonlinear, and the problem then belongs to a more generalclass of optimization problems.
A nonlinear program is an optimization problem in finitely many (real-valued) variables in which a function is to be minimized (or maximized)subject to a system of finitely many constraints: equations and inequalities.A general formulation of a nonlinear program could be
minf0(x) over all xXIRnso thatfi(x) 0 for i= 1, . . . , m ,
where stands for either, = or and the set X is usually a simpleset such as a box or an orthant but, in principle, could be any subset ofIRn. Depending on the properties of the functions f0,{fi, i = 1, . . . , m}and the setX, various labels are attached to nonlinear programs: quadratic,geometric, convex, positive definite, etc. As we proceed, we shall developoptimality conditions and study stability criteria for nonlinear programs aswell as describe a number of algorithmic procedures for solving certain classesof nonlinear programs.
2.3 Example (capacity expansion example). Consider the (simple) capacityexpansion problem as defined by Figure 2.4 with no upper bounds on theexpansionsxj and let 1(x1) =x
21, 2(x2) = 8x
22, 3(x3) = 3x
23.
Detail. This is a quadratic program with solution
x = (0, 0.55, 1.45), y = (2.45, 2.55, 6.45)
7/24/2019 An Optimization Primer
49/149
2.3. A NETWORK CAPACITY EXPANSION PROBLEM 37
x.
6Of course, deterministic programs can be viewed as stochastic programs whose randomelements take on a single value with probability one.
7/24/2019 An Optimization Primer
55/149
2.5. THE BROADWAY PRODUCER PROBLEM 43
WhenPis a continuous distribution function, i.e., there is a density functionp :IR+IR+ such that
P() =
0
p() d().
The expected costs are
x +
x
( x)p()d.
In this case, a simple calculation shows that the optimal solution x mustsatisfy
0 = [1 P(x)].More generally, if we define P() := lim P(), one must have
P(x)
P(x),
that allows for the possibility of a (discontinuous) jump in P at x, as couldhappen when the random demand is discretely distributed. Figure 2.6illustrates these possibilities.
^x ^x
1
() 1
()
0 0
1 1
Figure 2.6: Solution: discrete and continuous distributions.
2.6 Exercise (numerical solutions of the producer problem). With = 3, = 7 and uniformly distributed on [10, 20], the producer problem has a
7/24/2019 An Optimization Primer
56/149
7/24/2019 An Optimization Primer
57/149
2.5. THE BROADWAY PRODUCER PROBLEM 45
of the contract, and the second term evaluating the decision in terms ofexpected costs to come after the random event is observed. The second termis called the expected recourse cost:
E{q( x)} where q(y) = 0 when y0,y when y0.The recourse cost function is q( x).
0
q
Figure 2.7: The cost function q.
2.8 Exercise (alternative expression for recourse cost). Show that the costfunction q(with >0), a function commonly used to define recourse costs,admits the alternative representations:
q(y) = max [ 0, y ]
= min {y+
|y+
y
=y, y+
0, y
0}.WithE{} denoting expectation with respect to the distribution function
P, the stochastic programming formulation of the producers problem is:
minx
x +E
q( x).After integration, one obtains the deterministic equivalentof the stochasticprogram, stated here in terms of a continuous distribution Pwith densityp,but valid in the more general case:
minx x +
x( x)p() d.
7/24/2019 An Optimization Primer
58/149
46 CHAPTER 2. FORMULATION
This is precisely the problem that was solved earlier.This formulation of the producers problem illustrates the important fea-
tures of a stochastic program: the decision stages of the problem in relationto the arrival of information, and the recourse costs being obtained as theexpected value of an optimization problem that will be solved after full in-
formation is available. Many issues of stochastic programming may be illus-trated by this simple producers problem. There are many instances whenthe developments to come can first be explored for this simple example, inwhich everything is well understood, then the intuition gained can guide theapplication to more complex decision problems.
7/24/2019 An Optimization Primer
59/149
Chapter 3
PRELIMINARIES
At the congenital level, one makes a distinction between two classes of op-timization problems, namely those that are convex and those that are non-convex1. Fortunately, a major portion of the optimization problems thathave to be dealt with practice are convex; all examples in Chapter 2 fall inthis class. In the two first sections of this chapter, we build the basic tools toanalyze convex optimization problems, and in particular, whats needed togeneralize Oresme and Fermat rules. The three last sections, set up a minimalprobabilistic framework that will allow us to deal with (convex) stochasticoptimization problems and commence the study of expectation functionals.
3.1 Variational analysis I
The analysis of deterministic and stochastic programming models relies onthe tools and framework of Variational Analysis. Our concern at this pointis mainly, but not exclusively, with finite-valued convex functions, but theexposition will already touch on the interplay between a function and itsepigraph that occupies such a pivotal role in Variational Analysis and, inparticular, in the theoretical foundations of optimization. The definitions andaccompanying notation are consistent with the extensions and generalizationsrequired in the sequel. General facts about convexity will be covered in this
1Of course, non-convex problems have large subfamilies that possess particular prop-erties that can be exploited effectively in the design of solution procedures, e.g., com-binatorial optimization, optimization problems with integer variables, complementarityproblems, equilibrium problems, and so on.
47
7/24/2019 An Optimization Primer
60/149
48 CHAPTER 3. PRELIMINARIES
section. The next one will be devoted to a more detailed analysis of convexfunctions and their (sub)differentiability properties.
A subsetCofIRn isconvexif for allx0, x1 C, the line segment [x0, x1]C, i.e.,
x
= (1 )x0
+x1
C for all [0, 1].Note that x0, x1 dont have to be distinct, and thus ifCconsists of a singlepoint its convex; the condition is also vacuously satisfied when C =, theempty set. Balls, lines, line segments, cubes, planes are all examples ofconvex sets. Sets with dents or holes are typical examples of sets that fail tobe convex, cf. Figure 3.1.
Figure 3.1: Convex and non-convex sets.
Given [0, 1], one refers to x as a convex combinationofx0 and x1.More generally, given any collection x1, . . . , xL
IRn, then any point
x =
Ll=1
lxl for some l0, l= 1, . . . , L , such that
Ll=1
l = 1
is a convex combinationofx1, . . . , xL. The set
C= con(x1, . . . , xL) =
x=
Ll=1
lxl L
l=1
l = 1, l0, l= 1, . . . , L
is the convex hullofx1, . . . , xL; cf. Figure 3.2.
Convexity is preserved under the following operations:
7/24/2019 An Optimization Primer
61/149
3.1. VARIATIONAL ANALYSIS I 49
2x
Lx
1
x
Figure 3.2: Convex hull of a finite collection of points
3.1 Exercise (intersections, products and linear combinations). Given acollection of convex sets
Ci, i= 1, . . . , r
one has:
(a) C = ri=1 Ci is convex, actually, the intersection of an arbitrary col-lection of convex sets is still convex;(b) C=C1 C2 Cr is convex. ForC1 IRn1, C2 IRn2,
C1 C2 := (x1, x2)IRn1+n2 x1 C1, x2 C2.(c) iIR, C=
ri=1 iC
i :=r
i=1 ixi xi Ci is convex.
3
C + C1 1 2 2
C
C
C
1
2C
C
x
0x
1
2
1
Figure 3.3: Operations resulting in convex sets.
3.2 Exercise (affine transformation and projection). Given L :IRn
IRm,
an affine mapping, i.e.,L(x) =Ax + bwhereA is amn-marix andbIRm,
7/24/2019 An Optimization Primer
62/149
50 CHAPTER 3. PRELIMINARIES
the setL(C) =
z= Ax + b xC is convex wheneverCIRn is convex.
In particular,L(C)is convex when its the projection of the convex setCona subspace ofIRn.
2
C
IS = R
S
C
proj Cproj C
Figure 3.4: Projection of a convex set.
Guide. The first statement only requires a simple application of the defi-nition of convexity. For the second, simply write the projection as an affinemapping.
Warning: projections preserve convexity but the projection of a closed convexset is not necessarily closed. A simple example: let C =
(x1, x2)
x21/x1, x1 > 0, then the projection of C on the x-axis is the open interval(0, 1), This cant occur ifCis also bounded, cf. Proposition 8.9.
A particularly important subclass of convex sets are those that are alsocones. Arayis a closed half-line emanating from the origin, i.e., a set of thetype
x0 for some 0= xIRn. A set K IRn is a cone if 0K
and x K for all x K and > 0. Aside from the zero cone{0}, thecones K inIRn are characterized as the sets expressible as nonempty unionsof rays. The following sets are all convex cones:{0},IR+sn,IRn and(closed)half-spaces, i.e., sets of the type
xIRn
a, x 0 with a= 0.3.3 Exercise (convex cones). A nonempty set C is a convex cone if andonly ifx1, x2 C impliesx1 +x2 C.
7/24/2019 An Optimization Primer
63/149
3.1. VARIATIONAL ANALYSIS I 51
0
0 00
Figure 3.5: Cones: convex and non-convex.
A function f is convexrelative to a convex set C if for allx0, x1 C:f
(1 )x0 +x1 (1 )f(x0) +f(x1) for all [0, 1].It isstrictly convexrelative toCif for all distinct x0, x1 C:
f g
Figure 3.6: Convex and non-convex functions.
f
(1 )x0 +x1
< (1 )f(x0) +f(x1) for all (0, 1).
In particular, ifC = IRn
the preceding inequalities must be satisfied for allx0, x1 in IRn. A function f is concaverelative to a convex set C iff isconvex relative to C.
One can bypass the need for constant reference to the set on which thefunctionfis defined if we adopt, as we will, the following framework: Ratherthan real-valued functions, one considers extended real-valuedfunctions withthe valueassigned to those points that are outside their effective domain.More specifically, with a function f0: C IR with C IRn, one associatesa function f:IRn IRdefined by
f(x) = f0(x) ifxC, otherwise.
7/24/2019 An Optimization Primer
64/149
52 CHAPTER 3. PRELIMINARIES
f
x
f
x
Figure 3.7: Strictly convex and not strictly convex functions.
Theeffective domainof a function f:IRn IR is denoted
dom f=
xIRn
f(x)
7/24/2019 An Optimization Primer
65/149
3.1. VARIATIONAL ANALYSIS I 53
reals follows the usual rules, including 0 = 0, except for =. This is going to be ourextended arithmetic convention, but again thisconvention is consistent with the view that points at which a function takesthe value lie outside its effective domain.
The convexity of a function isnt really affected by such an extension.
3.4 Exercise (convexity for extended real-valued functions). Show thatf0 :C IR is a convex function relative to the convex set C IRn if and onlyiff:IRn IRis convex, wheref=f0 onCandf onIRn \ C.
For any convex function f:IRn IR, dom f is convex.We only need to adjust the definition of strict convexity, namely, a functionf:IRn IRisstrictly convexif the restriction offto dom fis strictly convexrelative to dom f.
Linear functions f(x) =a, x and affine functions f(x) =a, x + areconvex and concave. The exponential function ex and the absolute value
function|x| are examples of real-valued convex functions. The sine functionsin xcan serve as a typical example of non-convexity.
Theepigraphof a function f:IRn IRis the setepi f :=
(x, )IRn+1 f(x),
i.e., epi f consists of all points in IRn+1 that lie on or above the graph off. Linking the geometrical properties of the epigraph with the analyticalproperties of functions is one of the most useful tools of Variational Analysis.
gf
epi gepi f
Figure 3.9: Epigraphs.
Thehypographof a function f:IRn IRis the sethypo f := (x, )IRn+1 f(x),
7/24/2019 An Optimization Primer
66/149
54 CHAPTER 3. PRELIMINARIES
i.e., hypo fconsists of all points in IRn+1 that lie on or below the graph off.
3.5 Proposition (convexity of epigraphs). A function f: IRn IR is con-vex if and only if epi fIRn+1 is convex.
Its concave if and only if hypo f
IRn+1 is convex.
Proof. The convexity of epi fmeans that whenever (x0, 0), (x1, 1)epi f
and (0, 1), the point (x, ) := (1 )(x0, 0) + (x1, 1) belongs toepi f. This is the same as saying that whenever f(x0)0 andf(x1)1,one hasf(x).
The assertion about concavity follows by symmetry, passing from f tof.
The epigraph is not the only convex set associated with a convex function.
3.6 Exercise (convexity of level sets and argmin). Let f : IRn IR beconvex. Then, for all
IR, thelevel sets
lev f :=
xIRn f(x).
and the set of minimizersargmin f :=
xIRn f(x)inff are convex.Moreover, if f is strictly convex then argmin f is a single point wheneverits nonempty; for example, iff=ex, the functionf is strictly convex butargmin f=.
xlev f
f
Figure 3.10: Level set of a convex function.
Guide. Use the convexity of epi f and IRn
{}, and apply 3.1(a).
7/24/2019 An Optimization Primer
67/149
3.1. VARIATIONAL ANALYSIS I 55
3.7 Exercise (convexity of max-functions). Let
fi, iI be a collectionof convex functions. Then the function f(x) = supiIf
i(x) is convex.
f1
f3
f4
f2x
f
Figure 3.11: Max-function.
Guide. Observe that epi f=iIepi fi and appeal to 3.1 and 3.5.3.8 Example (convex indicator functions). The indicator function C :
epi C
C
Figure 3.12: An indicator function and its epigraph.
IRn [0, )of a set CIRn is convex if and only ifCis convex, where
C(x) = 0 ifxC, otherwise.
7/24/2019 An Optimization Primer
68/149
56 CHAPTER 3. PRELIMINARIES
Follows from 3.1(b), 3.5, and 3.6 sinceepi C=C IR+ andC= lev0C.
3.9 Proposition (inf-projection of convex functions). Let f: IRn IR bethe inf-projection of the convex function g : IRm IRn IR, i.e., for allx
IRn,
f(x) = infuIRm g(u, x).
Then f is a convex function.
epi g
x
u
epi f
Figure 3.13: Inf-projection of a convex function
Proof. Follows from 3.2 and 3.5 since epi f is the vertical closure of theprojection,
(u,x,)(x, ),of epi g IRm+n+1 on the subspace IRn+1. By vertical closure one meansthat (, x) is included in epi f whenever (, x) epi f for all > ; itsimmediate that taking vertical closure preserves convexity.
3.10 Exercise (convexity under linear transformations). Let f: IRm IRbe a convex function. For anym n-matrixA anda IRm, the functiong :IRn IRdefined byg(x) =f(Ax+a)is a convex function.
Sublinear functions, a subclass of convex functions, play a central role inthe subdifferentiation of convex functions. functions. A functionf: IRn
7/24/2019 An Optimization Primer
69/149
3.1. VARIATIONAL ANALYSIS I 57
8
x x
Figure 3.14: Two sublinear functions.
IR is sublinear if f is convex and positively homogeneous, i.e., 0 dom f,f(x) = f(x) for all x IRn, > 0. Typical examples are f(x) =a, x,f(x) =|x| and f(x) = supiIai, x, the supremum of a collection of linearfunctions.
3.11 Exercise (sublinearity criteria). For f : IRn IR with f(0) = 0,sublinearity is equivalent to either one of the following conditions:
(a) epi f is a convex cone,
(b) for allx1, x2 IRn, f(x1 +x2)f(x1) +f(x2).
Our primordial interest in convexity, at least in Variational Analysis,comes from the theorem below that relates local and global minimizers inthe presence of convexity; refer to 1.2 for the definition of local minimizers.
3.12 Theorem (local and global minimizers). Every local minimizer of aconvex function f: IRn IR is a global minimizer. Moreover, there is onlya single minimizer offwhen its strictly convex.
Proof. Ifx0 andx1 are two points ofCwithf(x0)> f(x1), thenx0 cannotfurnish a local minimum off, cf. Definition 1.2: every ball centered at x0
contains points x = (1 )x0 +x1 with (0, 1) that satisfy f(x)(1)f(x0) +f(x1) < f(x0). Thus, there cant be any locally optimalsolutions outside of argmin fwhere global optimality is achieved.
Iff is strictly convex, then x0, x1
dom fcant be distinct points that
minimize f. In view of the preceding, one necessarily would have f(x0) =
7/24/2019 An Optimization Primer
70/149
58 CHAPTER 3. PRELIMINARIES
f(x1). But then, strict convexity would imply thatf(x) < f(x0) for everypointx(x0, x1).
3.2 Variational analysis II
This section continues the study of the properties of convex functions but weare now mostly concerned with their (sub)differentiability properties. Theclass of functions to which we can apply the classical optimality conditionsof Chapter 1, doesnt include many that come up in the mathematical pro-gramming context. Restricting the development to models involving onlydifferentiable (convex) functions would leave by the wayside all constrainedoptimization models, and they include the large majority of the applications.One needs a calculus that applies to functions that are not necessarily differ-entiable. Eventually, this will enable us to formulate Oresmes and Fermats
rules for convex functions that arent necessarily differentiable, or even con-tinuous. This Subdifferential Calculus is introduced in the this section andwill be expanded throughout the ensuing development.
For the sake of exposition, and so that the readers can drill their intuition,1-dimensional functions are featured prominently in this section. In someinstances, for the sake of simplicity, the proof of a statement is only providedfor 1-dimensional convex functions3.
3.13 Proposition (continuity of convex functions). A real-valued convexfunction f defined onIRn is continuous.
Proof. The proof is forn= 1. It will be sufficient to show thatf:IRIRiscontinuous at 0; continuity at any other point x follows from the continuityat 0 of the function g(z) = f(z+x). By symmetry, it suffices to showthat f(0) = limf(x
) for any sequence x0 with x (0, 1]. From theconvexity off, one has for all :
f(x)(1 x)f(0) +xf(1),
f(0) 1x + 1
f(x) + x
x + 1f(1),
3A complete proof would have required additional background material that would letus stray too far from the objectives of these lectures; for proofs in full generality, one canconsult [3], [18, Chapter 2], for example.
7/24/2019 An Optimization Primer
71/149
3.2. VARIATIONAL ANALYSIS II 59
that can also be writte