An Optimization Primer

Embed Size (px)

Citation preview

  • 7/24/2019 An Optimization Primer

    1/149

    AN OPTIMIZATION PRIMER

    An Introduction to Linear, Nonlinear,

    Large Scale, Stochastic Programming

    and Variational Analysis

    Roger J-B Wets

    University of California, Davis

    Graphics by Maria E. Wets

    AMS Classification: 90C15, 90xxx, 49J99

    Date: August 26, 2005

  • 7/24/2019 An Optimization Primer

    2/149

    2

  • 7/24/2019 An Optimization Primer

    3/149

    i

    PROLOGUE

    The primordial objective of these lectures is to prepare the reader to dealwith a wide variety of applications that include optimal allocation of (lim-ited) resources, finding best estimates or best-fits, etc. Optimization prob-lems of this type arise in almost all areas of human industry: engineering,economics, agriculture, logistics, ecology, finance, information and commu-nication technology, and so on. Because the solution may have to respondto an evolutionary system, in time and/or space, or must take into accountuncertainty about some of the problems data, we usually end up with hav-

    ing to solve a large scale optimization problems and this, to a large extent,conditions our overall approach. Consequently, the layout of the materialdoesnt follow the pattern of more traditional introduction to optimizationtextbooks.

    The main thrust wont be on the detailed analysis of specific algorithms,but on setting up the tools to deal with these large scale applications. Thisdoesnt mean, that eventually, we wont describe, justify and even establishconvergence of some basic algorithmic procedures. To achieve our goals, weproceed, more or less, on three parallel tracks:

    (i) modeling,

    (ii) theoretical foundations that will allow us to analyze the properties ofsolutions as well as hone our modeling skills to help us build stable,easier to solve optimization problems, and

    (iii) some numerical experimentation that will highlight some of the difficul-ties inherent in numerical implementation, but mostly to illustrate theuse of elementary algorithmic procedures as building blocks of moresophisticated solution schemes.

    The lectures are designed to serve as an introduction to the field of opti-mization for students who have a background roughly equivalent to a bach-elors degree in science, engineering or mathematics. More specifically, its

  • 7/24/2019 An Optimization Primer

    4/149

    ii

    expected that the reader has a good foundation in Differential Calculus,Linear Algebra, and is familiar with the abstract notion of function1. Thepresentation also includes an introduction to a plain version of Probabilitythat will enable the non-initiated reader to follow the sections dealing withstochastic programming.

    A novel feature of this book is that decision making under uncertainty-models are an integral part of the exposition. There are two basic reasonsfor this. Stochastic optimization motivates the study of linear, non-linearoptimization, large scale optimization, non-differentiable functions and vari-ational analysis. But, more significantly, given our concern with intelligentmodeling, one is bound to realize that very few important decision problemsdo not involve some level of uncertainty about some of their parameters. Itsthus imperative, that from the outset, the reader be aware of the potentialpitfalls of simplifying a model so as to skirt uncertainty. Nonetheless, itspossible to skip the chapters or sections dealing with stochastic programming

    without compromising the continuity of the presentation, but not without be-ing shortchanged on the insight that comes from including uncertainty in themodeling of optimization models.

    There are 16 chapters, each one corresponds to what could be coveredin about a weeks lectures (three to four hours). Proofs, constructive in na-ture whenever possible, have been provided so that (i) an instructor doesnthave to go through them in meticulous detail but can limit the discussionto the main idea(s) accompanied with some relevant examples and counter-examples, and (ii) so that the argumentation can serve as guide to solvingthe exercises. The theoretical side comes with almost no compromises, but

    there are a few rare exceptions that would have required lengthy mathe-matical detours that are not germane to the subject at hand, and are moreappropriately dealt with in other texts or lectures.

    Numerical software

    As already mentioned earlier, although we end up describing a significantnumber of algorithmic procedures, we dont concern ourselves directly with

    1An appendix provides a review of some standard notation and terminology as well assome basic results in analysis that might not be familiar to an heterogeneous student-body,a typical situation for such a course.

  • 7/24/2019 An Optimization Primer

    5/149

    iii

    implementation issues. This is best dealt with in specialized courses and text-books such as [16, 10, 20]. For example, in the case of linear programming,there is a description of both the simplex and the interior point methodsin Chapters 7 and 6, but from the outset on, its assumed that packages tosolve mathematical programs, of various types, including linear programs,

    are available (C-Plex, IBM Solutions, LOQO, . . . ). To allow for some ex-perimentation with these solution procedures, its assumed that the readerhas access to Matlab2, in particular to the functions found in the MatlabOptimization Toolbox and will be able to use them to solve the numericalexercises. These Matlab functionalities were used to solve the examples, andin a number of instances, the corresponding m-file has been supplied.

    2Matlab is distributed by The MathWorks, Inc.

  • 7/24/2019 An Optimization Primer

    6/149

    iv

  • 7/24/2019 An Optimization Primer

    7/149

    Contents

    1 PRELUDE 1

    1.1 Mathematical curtain rise . . . . . . . . . . . . . . . . . . . . 1

    1.2 Curve fitting I . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Steepest Descent and Newton methods . . . . . . . . . . . . . 6

    1.4 The Quasi-Newton methods . . . . . . . . . . . . . . . . . . . 14

    1.5 Integral functionals . . . . . . . . . . . . . . . . . . . . . . . . 171.6 In conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2 FORMULATION 23

    2.1 A product mix problem . . . . . . . . . . . . . . . . . . . . . . 23

    2.2 Curve fitting II . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3 A network capacity expansion problem . . . . . . . . . . . . . 35

    2.4 Discrete decision variables . . . . . . . . . . . . . . . . . . . . 40

    2.5 The Broadway producer problem . . . . . . . . . . . . . . . . 42

    3 PRELIMINARIES 47

    3.1 Variational analysis I . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2 Variational analysis II . . . . . . . . . . . . . . . . . . . . . . 58

    3.3 Plain probability distributions . . . . . . . . . . . . . . . . . . 68

    3.4 Expectation functionals I . . . . . . . . . . . . . . . . . . . . . 75

    3.5 Analysis of the producers problem . . . . . . . . . . . . . . . 79

    4 LINEAR CONSTRAINTS 83

    4.1 Linearly constrained programs . . . . . . . . . . . . . . . . . . 85

    4.2 Variational analysis III . . . . . . . . . . . . . . . . . . . . . . 87

    4.3 Variational analysis IV . . . . . . . . . . . . . . . . . . . . . . 94

    4.4 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . 98

    v

  • 7/24/2019 An Optimization Primer

    8/149

    vi CONTENTS

    4.5 Karush-Kuhn-Tucker conditions I . . . . . . . . . . . . . . . . 101

    5 SIMPLE RECOURSE: RHS 107

    5.1 Random right hand sides . . . . . . . . . . . . . . . . . . . . . 108

    5.2 Aircraft allocation to routes I . . . . . . . . . . . . . . . . . . 111

    5.3 Separable simple recourse I . . . . . . . . . . . . . . . . . . . . 114

    5.4 Aircraft allocation to routes II . . . . . . . . . . . . . . . . . . 119

    5.5 Approximations I . . . . . . . . . . . . . . . . . . . . . . . . . 122

    6 LAGRANGIANS 127

    6.1 Saddle functions . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    6.2 Primal and Dual problems . . . . . . . . . . . . . . . . . . . . 131

    6.3 A primal-dual interior-point method . . . . . . . . . . . . . . 134

    6.4 Monitoring functions . . . . . . . . . . . . . . . . . . . . . . . 138

    6.5 Lake Stoopt I . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.6 Separable simple recourse II . . . . . . . . . . . . . . . . . . . 145

    6.7 The Lagrangian finite generation method . . . . . . . . . . . . 147

    6.8 Lake Stoopt II . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    7 POLYHEDRAL CONVEXITY 157

    7.1 Polyhedral sets . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    7.2 Full duality: linear programs . . . . . . . . . . . . . . . . . . . 165

    7.3 Variational analysis V . . . . . . . . . . . . . . . . . . . . . . 167

    7.4 The simplex method . . . . . . . . . . . . . . . . . . . . . . . 172

    8 OPTIMALITY & DUALITY 183

    8.1 Variational analysis VI . . . . . . . . . . . . . . . . . . . . . . 183

    8.2 Separation Theorems . . . . . . . . . . . . . . . . . . . . . . . 188

    8.3 Variational analyis VIII . . . . . . . . . . . . . . . . . . . . . 193

    8.4 Karush-Kuhn-Tucker conditions II . . . . . . . . . . . . . . . . 197

    8.5 Variational analysis VII: Conjugacy . . . . . . . . . . . . . . . 200

    8.6 General duality theory . . . . . . . . . . . . . . . . . . . . . . 205

    8.7 Geometric programming . . . . . . . . . . . . . . . . . . . . . 210

    8.8 Semi-definite programming . . . . . . . . . . . . . . . . . . . . 210

  • 7/24/2019 An Optimization Primer

    9/149

    CONTENTS vii

    9 LINEAR RECOURSE 211

    9.1 Fixed recourse and fixed costs . . . . . . . . . . . . . . . . . . 2129.2 A Manufacturing model . . . . . . . . . . . . . . . . . . . . . 2139.3 Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2159.4 Stochastic linear programs with recourse . . . . . . . . . . . . 221

    9.5 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . 2229.6 Network capacity expansion II . . . . . . . . . . . . . . . . . . 2239.7 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.8 A summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.9 Practical probability II . . . . . . . . . . . . . . . . . . . . . . 2259.10 Expectation functionals II . . . . . . . . . . . . . . . . . . . . 2279.11 Disintegration Principle . . . . . . . . . . . . . . . . . . . . . 2299.12 Stochastic programming: duality . . . . . . . . . . . . . . . . 229

    10 DECOMPOSITION 231

    10.1 Lagrangian relaxation . . . . . . . . . . . . . . . . . . . . . . . 23110.2 Sequential linear programming . . . . . . . . . . . . . . . . . . 23110.3 The L-shaped method . . . . . . . . . . . . . . . . . . . . . . 23510.4 Dantzig-Wolfe decomposition . . . . . . . . . . . . . . . . . . 242

    10.5 An optimal control problem . . . . . . . . . . . . . . . . . . . 24710.6 A targetting problem . . . . . . . . . . . . . . . . . . . . . . . 25410.7 Linear-quadratic control models . . . . . . . . . . . . . . . . . 25410.8 A hydro-power generation problem . . . . . . . . . . . . . . . 254

    11 APPROXIMAION THEORY 255

    11.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25511.2 Epi-convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 25511.3 Barrier & Penalty methods, ecaxt? . . . . . . . . . . . . . . . 25811.4 Infinite dimensional theory . . . . . . . . . . . . . . . . . . . . 25811.5 Approximation of control problems . . . . . . . . . . . . . . . 25811.6 Approximation of stochastic programs . . . . . . . . . . . . . 25811.7 Approximation of statistical estimation problems . . . . . . . 25811.8 Augmented Lagrangians . . . . . . . . . . . . . . . . . . . . . 25811.9 Variational Analysis ?? . . . . . . . . . . . . . . . . . . . . . . 25811.10Proximal point algorithm . . . . . . . . . . . . . . . . . . . . . 25811.11Method of multipliers: equalities . . . . . . . . . . . . . . . . . 25811.12Method of multipliers: inequalities . . . . . . . . . . . . . . . 258

  • 7/24/2019 An Optimization Primer

    10/149

    viii CONTENTS

    11.13Application to engineering design . . . . . . . . . . . . . . . . 258

    12 NONLINEAR OPTIMIZATION 259

    12.1 Statistical estimation: An introduction . . . . . . . . . . . . . 259

    12.1.1 The discrete case . . . . . . . . . . . . . . . . . . . . . 262

    12.2 Statistical estimation: parametric . . . . . . . . . . . . . . . . 27012.3 Satistical estimation: non-parametric . . . . . . . . . . . . . . 274

    12.4 Non-convex optimization . . . . . . . . . . . . . . . . . . . . . 281

    12.5 KKT-optimality conditions . . . . . . . . . . . . . . . . . . . . 282

    12.6 Sequential quadratic programming . . . . . . . . . . . . . . . 282

    12.7 Trust regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

    13 EQUILIBIRUM PROBLEMS 289

    13.1 Convex-type equilibrium problems . . . . . . . . . . . . . . . 289

    13.2 Variational inequalities . . . . . . . . . . . . . . . . . . . . . . 289

    13.3 Monotone Operators . . . . . . . . . . . . . . . . . . . . . . . 28913.4 Complementarity problem . . . . . . . . . . . . . . . . . . . . 289

    13.5 Application in Mechanics . . . . . . . . . . . . . . . . . . . . . 28913.6 Pricing an American option . . . . . . . . . . . . . . . . . . . 289

    13.7 Market Equilibrium: Walras . . . . . . . . . . . . . . . . . . . 289

    13.8 Application to traffic, transportation . . . . . . . . . . . . . . 289

    13.9 Non-cooperative games: Nash . . . . . . . . . . . . . . . . . . 289

    13.10Energy? Communications pricing . . . . . . . . . . . . . . . . 289

    14 NON-DIFFERENTIAL OPTIMIZATION 291

    14.1 Bundle method . . . . . . . . . . . . . . . . . . . . . . . . . . 29114.2 Example of the bundle method . . . . . . . . . . . . . . . . . 291

    14.3 Stochastic quasi-gradient method . . . . . . . . . . . . . . . . 291

    14.4 Application: urn problem . . . . . . . . . . . . . . . . . . . . 291

    14.5 Sampled gradient (Burke, Lewis & Overton) . . . . . . . . . . 29114.6 Eigenvalues calculations . . . . . . . . . . . . . . . . . . . . . 291

    15 DYNAMIC PROBLEMS 293

    15.1 Optimal control problems . . . . . . . . . . . . . . . . . . . . 293

    15.2 Hamiltonian, Pontryagins . . . . . . . . . . . . . . . . . . . . 293

    15.3 Polaks minmax approach . . . . . . . . . . . . . . . . . . . . 293

    15.4 Multistage stochastic programs . . . . . . . . . . . . . . . . . 293

  • 7/24/2019 An Optimization Primer

    11/149

    CONTENTS ix

    15.5 Progressive hedging algorithm . . . . . . . . . . . . . . . . . . 29315.6 Water reservoirs management . . . . . . . . . . . . . . . . . . 29315.7 Linear-quadratic stochastic control . . . . . . . . . . . . . . . 29315.8 Pricing a contingent claim . . . . . . . . . . . . . . . . . . . . 293

    16 TOPICS IN STOCHASTIC PROGRAMMING 29516.1 The distribution problem . . . . . . . . . . . . . . . . . . . . . 29516.2 Application to robotics . . . . . . . . . . . . . . . . . . . . . . 29516.3 Chance constraints . . . . . . . . . . . . . . . . . . . . . . . . 29516.4 Reliability of networks . . . . . . . . . . . . . . . . . . . . . . 29516.5 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 29516.6 Modeling decision making under uncertainty . . . . . . . . . . 295

    A Notation and terminology 297

    A.1 Existence: limits & minimizers . . . . . . . . . . . . . . . . . . 298

    A.2 Function expansions . . . . . . . . . . . . . . . . . . . . . . . 299

  • 7/24/2019 An Optimization Primer

    12/149

    x CONTENTS

  • 7/24/2019 An Optimization Primer

    13/149

    Chapter 1

    PRELUDE

    How simple and clear this is thought Pierre. How could I not have knownthis before. War and Peace, Leonid Tolstoy.

    Lets begin our journey in the classical landscape: All functions to beminimized are smooth and there are no side constraints! The mathematicalfoundations were laid down in the middle of the last millennium by two genialmathematical dabblers. As we shall see, the rules they formulated providethe guidelines for building an optimization theory, in the classical frameworkas well as in the non-classical framework thats going to be our main concernin this book. This chapter also covers the basic algorithmic procedures tofind, at least numerically, the minimizer(s) of smooth multivariate functions.Although our interest will be primarily with the minimization of functions,subject or not to constraints, defined on IRn, the last section shows thatthe rules to identify minimizers are also applicable to functions defined oninfinite-dimensional spaces, like a space of arcs connecting two points, etc.

    1.1 Mathematical curtain rise

    The modern theory of optimization starts in the middle of the 14th Centurywith Nicholas Oresme, (1323-1382), part-time mathematician and full-timeBishop of Lisieux (France). In his treatise [15] on , he remarks that neara minimum, the increment of a variable quantity becomes 0. A present day

    1

  • 7/24/2019 An Optimization Primer

    14/149

    2 CHAPTER 1. PRELUDE

    version would read

    Oresme Rule: x argmin f = df(x; w) = 0, wIR.

    where argmin fis the set of minimizers off, i.e., the arguments that minimize

    f, and

    df(x; w) := lim0

    f(x+ w) f(x)

    ,

    is the derivative1 of the function f: IR IR at the point x in direction w,i.e., the limit of the incremental value off at x, in direction w .

    x x

    w

    f

    df(x; ).

    Figure 1.1: Derivative function identifying incremental changes at x.

    About three centuries later, Pierre de Fermat (1601-1665), another part-time mathematician and full time ??-lawyer at the Royal Court in Toulouse(France) while working on the long-standing tangent problem, observed thatfor x to be a minimizer of a function f, the tangent to the graph of the

    functionfat the point x, f(x) must be parallel to the x-axis. In thenotation of Differential Calculus2, one would express this as

    Fermat Rule: x argmin f = f(x) = 0,

    where

    f(x) := lim0

    1

    f(x+) f(x)1called the Gateaux derivative when its necessary to distinguish it from some alterna-

    tive definitions of derivative.2whose development can be viewed as a continuation and a formalization of Fermats

    work on the tangent problem

  • 7/24/2019 An Optimization Primer

    15/149

    1.1. MATHEMATICAL CURTAIN RISE 3

    f

    x*

    (x ,f(x ))* *

    Figure 1.2: Horizontal tangent to the graph of a function at a minimum.

    is the slope3 of the tangent at x. Implicit in the formulation of these optimal-ity criteria is the assumption: f is smooth, i.e., continuously differentiable;in those days, only smooth functions were considered to be of any interest.And for smooth functions, one has

    w

    IR: df(x; w) = f(x)w,

    as is immediate from the definitions. Consequently, the Fermat rule can bederived from Oresmes rule and vice verse.

    To extend Oresmes rule to functions defined on IRn, again assumingsmoothness, one has to consider possible moves, or variations, not just tothe right or the left, but in every possible direction. And the rule becomes:

    x argmin f = df(x; w) = 0,wIRn,whereas Fermats rule now takes the form

    x

    argmin f =

    f(x) = 0.

    Indeed, the slope of the tangent to the graph of a smooth function f at apoint

    x, f(x)

    is given by the gradientoff at x:

    f(x) =

    x1f(x),

    x2f(x), . . . ,

    xnf(x)

    with the partial derivativesdefined by

    fori = 1, . . . , n ,

    xif(x) := lim

    0

    f(x+ ei) f(x)

    =df(x; ei),

    3In Differential Calculus, one usually refers to f(x) as the derivative off at x but wewant to reserve this term for the more malleable function df(x, ).

  • 7/24/2019 An Optimization Primer

    16/149

    4 CHAPTER 1. PRELUDE

    where ei = (0, . . . , 1, . . . , 0) is the unitn-vector with a 1 in the ith position.In the 1-dimensional case, one hasf(x) = f(x). Because f is smooth, itfollows from the preceding definitions that

    w

    IRn : df(x; w) =

    f(x), w

    =

    n

    j=1f(x)

    xjwj,

    and so, also for smooth functions defined on IRn, one can derive Fermatsrule from Oresmes rule and vice verse. However, the fact that one can relyon either one of these rules to check for optimality turns out to be quiteefficient.

    1.2 Curve fitting I

    Given the values of an unknown function h : [0, 1] IR at a finite numberof distinct points, z1, z2, . . . , z L, one is interested in finding a polynomial

    p : [0, 1]IR of degree n, i.e.,p(x) =anx

    n + +a1x+a0,whose values at z1, . . . , z L are as close as possible of those ofh. There are anumber of ways to interpret as close as, but for now, let it have the meaningof least squares, i.e., the sum of the squares of the distances between h(zl)and p(zl) is to be minimized. With a = (an, . . . , a1, a0), the minimizationproblem is:

    minaIRn+1L

    l=1

    n

    j=0

    aj zj

    lh(zl)2.

    With

    Z=

    zn1 . . . z 1 1zn2 . . . z 2 1. . . . . . . . . .znL . . . z L 1

    andy=

    h(z1), h(z2), . . . , h(zL)

    ,

    theleast squares problemcan be written as:

    minaIRn+1

    Za y,Za y,

  • 7/24/2019 An Optimization Primer

    17/149

    1.2. CURVE FITTING I 5

    or still mina f(a) =|Za y|2, i.e., the least squares solution will minimizethe square of the norm of the error. Applying Fermats rule, we see that theminimizer(s)a must satisfy

    f(a) = 2Z(Za

    y) = 0,

    or equivalently, the so-called normal equation,

    ZZa =Zy.

    If we assume, as might be expected, that n+ 1 L, and recalling thatthe points z1, . . . , z L are distinct, the columns of the matrix Zare linearlyindependent. Hence, ZZ is invertible and

    a =

    ZZ)1Zy.

    This is the solution calculated by the Matlab-function polyfit. In Figure1.3, a 5th degree polynomial has been fitted to the given data points; theplot has been obtained by polyval, another Matlab-function.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.5

    1

    0.5

    0

    0.5

    1

    Figure 1.3: Fitting data with 5th degree polynomial

  • 7/24/2019 An Optimization Primer

    18/149

    6 CHAPTER 1. PRELUDE

    1.1 Exercise (polyfit and polyval functions). Let x = (0, 0.05, . . . , 0.95, 1)andy = (0.95, 0.23, 0.61, 0.49, 0.89, 0.76, 0.46, 0.02, 0.82, 0.44, 0.62, 0.79, 0.92,0.74, 0.18, 0.41, 0.94, 0.92, 0.41, 0.89, 0.06) use the functionspolyfitto obtaina polynomial fit of degreen= 1, 2, . . . , 11 andn= 21. For each n, calculatethe mean square error and plot you results so that you can visuallly see the

    fit and the graph of the polynomial.

    Guide. For given n, let p= polyfit(x,y,n) the coefficients of the polyno-mial of degreen. To graph the resulting polynomial and check the fit, use thecommand: plot(z, polyval(p, z),x,y, xm) withz= (0.2 : 0.001 : 1.2)4.

    1.3 Steepest Descent and Newton methods

    When we applied Fermats rule to the polynomial fit problem, the minimizer

    could be found by solving a system of linear equations, and there are veryefficient procedures available to solve (nn)-linear systems. But, in gen-eral, the function to be minimized wont be quadratic, and consequently, theFermat rule may result in a system consisting ofnnonlinearequations. Forexample, when

    f(x1, x2) = (x2 x21)2 + (1 x1)2,Fermats rule yields,

    2x31 2x2x1+x1 = 1, x21+x2 = 0.

    And that system is not any easier to solve than minimizing f. In fact, proce-dures to solve nonlinear equations and minimizing functions on IRn go handin hand5. In this section, and the next, we outline algorithmic proceduresto find a point x that satisfy Fermats rule. Such points will minimize thefunctionf, at least locally, when the function fis locally convex; convexityis dealt with in Chapter 3 and locally convexmeans that f is convex in theneighborhood ofx, say on a ball IB(x, ) with >0.

    4In Matlab-Figure, use the Export functionality to obtain printable files, for example,EPS-files.

    5Roughly speaking, a system ofn equations, linear or nonlinear, in n variables can bethought of as the gradient of some nonlinear function defined on IRn.

  • 7/24/2019 An Optimization Primer

    19/149

    1.3. STEEPEST DESCENT AND NEWTON METHODS 7

    1.2 Definition (local minimizers). Forf: IRn IR, x is a local minimizeriff(x)f(x)for allxIB(x, )for some > 0, i.e.,

    xargminxIB(x,) f(x).

    Aglobal minimizer is simply a minimizer off onIRn

    .

    f

    argmin f

    local minimizer

    Figure 1.4: Local and global minimizers

    The first step in the design of algorithmic procedures to minimize a func-tionf is to identify directions of descent.

    1.3 Lemma (direction of descent). Let f:IRn IRbe smooth. Wheneverdf(x; d) 0 such that

    (0, ) : f(x+ d)< f(x).

    Becausef is smooth at x,

    every dIRn such thatf(x), d 0, f(x+ d) f(x)< 0 for all (0, ). The assertioninvolving the gradient simply follows from df(x; d) =f(x), d, when f issmooth.

    Steepest Descent Method.

    Step 0. x0 IRn, := 0.

  • 7/24/2019 An Optimization Primer

    20/149

    8 CHAPTER 1. PRELUDE

    Step 1. Stop iff(x) = 0, otherwise,d :=f(x).Step 2. := argmin0

    f(x +d) f(x) .

    Step 3. x+1 :=x +d, + 1, go to Step 1.

    1.4 Convergence (steepest descent). Suppose f : IRn IR is smooth.Then, the Steepest Descent algorithm stops after a finite number of steps at a

    pointx wheref(x) = 0, or it generates a sequence of points{x, IN}that either diverges, i.e.,|x|, orf(x) = 0 for every cluster point x ofthis sequence.

    Proof. The algorithm stops only, in Step 1, whenf(x) = 0, necessarilyafter a finite number of steps. Otherwise, excluding the case when the iteratesdiverge, the sequence{x, IN}will have at least one cluster point, say x.By restricting our attention to the convergent subsequence, we may as wellproceed as ifx

    x, i.e., x is a limit point.

    Iff(x)= 0, then d=f(x) is a direction of descent and, by Lemma1.3,f(x + d)< f(x) for all(0, ) for some> 0. Since fis smooth6,

    f(x+1) f(x) =|d|2 +o(),where d =f(x), and x x implies

    f(x+1) f(x)0, |d|2 |d|2 = 0.Hence, from the preceding identity, by letting , it follows that0.This means that eventually (0, ), and then, by definition of the stepsize in Step2, one must have

    f(x + d) f(x)f(x + d) f(x) =|d|2 + o(), (0, ).

    Letting and choosingclose enough to 0, one obtains 0 |d|2

  • 7/24/2019 An Optimization Primer

    21/149

    1.3. STEEPEST DESCENT AND NEWTON METHODS 9

    Thats an unrealistic premise. The best one can hope for is an approximatingminimizer. When implementing this algorithm, the step sizein Step 2 iscommonly calculated as follows: for parameters , (0, 1) selected at theoutset in Step 0,

    A-Step 2. := maxk=0,1,...{k f(x +kd) f(x) |d|2k}.

    One then refers to as the Armijo step size7; k is to the power k, so,

    in particular, 0 = 1. The convergence proof can easily be adjusted to

    23 01

    0

    2

    |d |

    f(x + d ) - f(x )

    Figure 1.5: Calculating the Armijo step size

    accommodate the Armijo step size selection.One may interpret the iterations of the Steepest Descent method as fol-

    lows: At x, the function is approximated by

    f(x) =f(x) +f(x), x x + 12 |x x|2,

    where >0 is a parameter to be selected at some point. The next iteratex+1 is obtained by minimizing f, i.e.,

    x+1 =x f(x).Iff turns out to be a good approximation off, at least locally, one shouldend up with a point x+1 near a local minimizer off. However, its obviousthat this can only be the case for a very limited class of nonlinear functionsf. A more trust-worthy local approximation off atx is provided by

    f(x) =f(x) +f(x), x x + 12x x, 2f(x)(x x),7more appropriately the Armijo-Goldstein step-size

  • 7/24/2019 An Optimization Primer

    22/149

    10 CHAPTER 1. PRELUDE

    assuming thatf is twice differentiable with Hessianmatrix at x,

    2f(x) =

    2f

    xixj(x)

    n,ni,j=1

    ,

    again with > 0, a parameter to be selected appropriately. Assumingfurthermore, that2f(x) is positive definite8 and thus also invertible, theminimum off is attained at

    x+1 =x 2f(x)1f(x).

    The suggested descent direction,

    d =2f(x)1f(x),is known as the Newton direction. Its certainly a direction of descent since

    f(x

    ), 2f(x)1f(x) > 0; the inverse of a positive definite matrixis also positive definite. This lead us to the following variant of the SteepestDescent method:

    Newton Method (with line minimization).

    Step 0. x0 IRn,:= 0.Step 1. Stop iff(x) = 0, otherwise,d :=2f(x)1f(x).Step 2. := argmin0

    f(x +d) f(x) .

    Step 3. x+1 :=x +d, + 1, go to Step 1.

    The convergence proof for the Newton method is essentially the same asthat for the Steepest Descent method, the only difference is that a differentdescent direction is calculated in Step 1. And when implementing the Newtonmethod, one would again replace Step 2 by A-Step 2, i.e., determine bycalculating the Armijo step size.

    What really makes the Newton method very attractive is that it comeswith particularly desirable localconvergence properties! The proof that fol-lows is for the classical version of Newtons method that doesnt include aline minimization step, i.e., with = 1.

    8A matrixC is positive definiteifx,Cx> 0 for all x = 0. When the Hessian at x, ofa twice continuously differentiable function f: IRn IR is positive definite, f is strictlyconvex on a neighborhood of x, cf. 3.15. We kick off our study of convexity in Chapter 3.

  • 7/24/2019 An Optimization Primer

    23/149

    1.3. STEEPEST DESCENT AND NEWTON METHODS 11

    f(x) =

    Steepest descentNewton direction

    Figure 1.6: Comparison of Newton and steepest descent directions

    1.5 Convergence (Newton Method: local convergence). Let f : IRn IR be twice continuously differentiable and x 2f(x) locally Lipschitzcontinuous9, i.e., given anyx,

    >0, 0, such that2f(x) 2f(x) |x x|, x, x IB(x, ).

    Letxbe a local minimizer that satisfies the following second order sufficiencycondition: there are constants0 < lu 0 such that if the Newtons method is started at a pointx0 IB(x,), the iterates{x}INwill converge quadratically tox, i.e.,

    0lim |x+1

    x

    ||x x|2 0 and

  • 7/24/2019 An Optimization Primer

    24/149

    12 CHAPTER 1. PRELUDE

    IB(x, ) and with z =x x,

    2f(x)(x+1 x) =f(x) + f(x) =z, 10

    2f(x tz) dt,

    as follows from the Mean Value Theorem; recall f(x) = 0. Because x+tz

    IB(x, ) for all t [0, 1], adding2f(x)z to both sides of the precedingidentity, yields

    (x+1 x) =2f(x)1 z, 10

    2f(x tz) dt, .Since|x+1 x| |x x| + |x+1 x| and| g(t) dt| |g(t)| dt,

    |x+1 x| |z|

    2f(x)

    1

    10

    2f(x) 2f(x+tz) dt

    l

    |z|2 = |z

    |l

    |z|

    Thus, if (/l)|z|

  • 7/24/2019 An Optimization Primer

    25/149

    1.3. STEEPEST DESCENT AND NEWTON METHODS 13

    Guide. The Rosenbrock function has boomerang shaped level sets. Theminimum occurs at (1, 1). Starting at x0 = (1.2, 1), the Steepest Descentmethod may require as many as a 1000 steps to reach the neighborhood ofthe solution. As, can be observed, the method of Steepest Descent can notonly be quite inefficient but, due to numerical round-off, it might even get

    stuck at a non-optimal point.

    The Newton method, like the steepest descent method, can be viewed asa procedure to find a solution to the system ofnequations:

    f

    x1(x) = 0,

    f

    x2(x) = 0, . . . ,

    f

    xn(x) = 0.

    Because, these functions x fxj (x) have, in principle, no preassigned prop-erties, one could have described Newtons method as one of solving a systemofn non-linear equations in nunknowns, say

    G(x) =

    G1(x)...Gn(x)

    =0...

    0

    with

    G(x) =Gi

    xj(x)n

    i,j=1,

    theJacobianofG at x. Generally, the algorithmic procedure, then known astheNewton-Raphson method, doesnt involve a line minimization step but aline-search could be included to avoid being lead in unfavorable directions.

    Assuming that the Jacobian is invertible, a generic version of the methodinvolves the following steps:

    Newton-Raphson Method.

    Step 0. x0 IRn, := 0.Step 1. Stop ifG(x) = 0, otherwise, d :=G(x)1G(x).Step 2. x+1 :=x +d, + 1, go to Step 1.After making the appropriate change of notation, one can follow step by

    step the proof of 1.5 to obtain a (local) quadratic convergence rate. Figure1.7 illustrates the steps of the Newton-Raphson procedure when G : IR

    IR.

  • 7/24/2019 An Optimization Primer

    26/149

    14 CHAPTER 1. PRELUDE

    1xx32x 4x 5x

    Figure 1.7: Root finding via Newton-Raphson in dimension 1

    Newton-Raphson for systems of linear/non-linear equations and Newtonsmethod for unconstrained optimization problems illustrate vividly the impor-tant connections between these two classes of problems, both in the designof solutions procedures as well as in the analysis of their intrinsic properties.

    1.4 The Quasi-Newton methods

    The numerical implication of quadratic convergence is that once a smallenough neighborhood of a (local) minimizer is reached, each iteration willdouble the number of correct digits in the approximating solutions x. Ofcourse, this doesnt take into consideration round-off errors! But, nonethe-

    less, any method with theses characteristics has to be treasured and deservesto be plagiarized. Unfortunately, the conditions under which one can applyNewtons method are quite restrictive:

    -2f(x) might not be invertible. This can be repaired to some extentby choosing appropriately a direction dthat satisfies2f(x)d=f(x).

    - The approximationf forfatx might be poor, or more generally, onlyvalid in a very limited region. Again, this can be repaired to some extent byrestricting the step size to a trust region.

    - The function f is not twice differentiable, or its Hessian is difficult tocompute. This cant be repaired in the framework of the Newton method.It requires a different approach.

  • 7/24/2019 An Optimization Primer

    27/149

    1.4. THE QUASI-NEWTON METHODS 15

    Quasi-Newton Method10.

    Step 0. x0 IRn, := 0, pick B0 (=I, for example).Step 1. Stop iff(x) = 0,

    otherwise, choose d such thatB d =f(x).Step 2. := argmin0 f(x +d) f(x) .Step 3. x+1 :=x + d

    , calculateB +1, + 1, go to Step 1.

    Clearly B plays here the role of the Hessian2f(x) in the Newtonmethod, i.e., it tries to capture the adjustment that needs to be made in thesteepest descent on the basis of the local curvature of the function f. Theactual behavior of the method is determined by the choice one makes of theupdating mechanism for B. To guarantee, at least, local convergence, oneimposes the following condition:

    The Quasi-Newton Condition. The curvature along the descent direc-

    tiond (fromx tox+1) should be approximated by

    B+1(x+1 x) := (B +U)(x+1 x) =f(x+1) f(x),

    or equivalently, with

    s =x+1 x, c =f(x+1) f(x) : U s =c Bs.

    The updating matrix U = B+1 B must be chosen so that it satisfiesthe preceding identity. This can always be achieved by means of a matrix ofrank 1 of the type U =u

    v, where

    u v=

    u1v1 u1v2 . . . u1vnu2v1 u2v2 . . . u2vn. . . . . . . . . . . . . . . . . .unv1 unv2 . . . unvn

    is the outer productof the vectors uandv. We must have

    [ u v ]s =v, s u= c Bs.10Methods of this type are also known are variable metric methods. One can think of

    the descent direction generated in Step 1 as the steepest descent but with respect to adifferent metric on IRn than the usual Euclidean metric.

  • 7/24/2019 An Optimization Primer

    28/149

    16 CHAPTER 1. PRELUDE

    This means thatuneeds to be a multiple of (cBs). Assumingc =Bs,otherwiseB itself satisfies the Quasi-Newton condition,

    v such thatv, s = 0,one would set

    u= 1v, s(c Bs) and U = 1v, s [ (c

    Bs) v ].

    In the preceding expression all quantities are fixed except forv and the onlyrestriction is that v shouldnt be orthogonal to s. One can choose v soas to restrict B to a class of matrices that have some desirable properties.For example, one may wish to have the matrices B symmetric; the Hessian2f(x) is symmetric when it is defined. Choosing

    v =c Bs yields B+1 =B + 1v, s[ v v ].

    which is symmetric ifB

    is symmetric.One particularly useful property of these updates, is that their inverses

    can be computed recursively. Indeed, ifB is invertible andv, B1u =1,then

    B+ [ u v ]1 =B1 11 + v, B1uB

    1 [u v ] B1.To see this, simply observe that

    B+ [ u v ]B1 11 + v, B1uB[u v ] B

    = I ,

    follows from

    v, B1u [ u v ] B1 = [ u v ] B1 [ u v ] B1 = [ u v ]B1u, vB1.Updating schemes based directly on (B + [ uv ]) are not numerically

    stable, by which one means that small errors in carrying our the numericaloperations might result in significant errors in the descent method. Twopopular, numerically reliable, updating schemes are

    BFGS update 11:

    B+1 =B + 1

    c, s

    c c 1Bs, s Bs Bs.11BFGS = Broyden-Fletcher-Goldfarb-Shano who, independently, proposed this formula

    for the update.

  • 7/24/2019 An Optimization Primer

    29/149

    1.5. INTEGRAL FUNCTIONALS 17

    DFP update 12: with D paying the role of (B)1:

    D+1 =D + 1

    c, s

    s s 1c, Dc Dc Dc.The Matlab-function fminunc (with option.LargeScale setoff) im-plements the BFGS Quasi-Newton method. Again, the Rosenbrock function

    could be used as test function and the results compared to those obtained inExercise 1.6.

    1.5 Integral functionals

    Lets go one step further in our analysis of classical optimality conditionsand consider integral functionals, as they arise in theCalculus of Variations13,

    f(x) = 10

    L(t, x(t), x(t)) dt.

    With, IR, the simplest problem of the Calculus of Variations is:

    min

    f(x) xX, x(0) =, x(1) =,

    where Xfcns([0, 1], IR) consists of all functions with some specified prop-erties, for example X= ac-fcns([0.1], IR), the space of absolutely continuousreal-valued functions defined on [0, 1]. For simplicity sake, let assume thatX=C1([0, 1]; IR) is the space of real-valued, continuously differentiable func-

    tions defined on [0, 1]. One has,

    Oresme rule: x argmin f = df(x; w) = 0,wW X,

    where the setW ofadmissible variationsis such that

    xX, wW = IR : (x+ w)(0) =, (x+ w)(1) =,

    that is,

    W =

    wX

    w(0) =w(1) = 0

    .

    12DFP = Davidon-Fletcher-Powell who proposed this updating scheme.13Its customary to denote by x, the derivative ofx, with respect to time parameter t

  • 7/24/2019 An Optimization Primer

    30/149

    18 CHAPTER 1. PRELUDE

    It isnt straightforward to write down Fermats rule. There isnt a ready madecalculus to find the gradient of functions defined on an infinite dimensionallinear space. In the late 17th Century (?), the path going from Oresmesto Fermats rule for this problem was pioneered by a trio of mathematicalsuperstars: the Bernoulli brothers, Johan and Jacob, and Isaac Newton.

    Lets sketch out their approach when

    L : [0, 1] IR IR IR

    is a really nice function. Here, this means that the partial derivatives

    Lx(t,x,v) =L

    x(t,x,v), Lx(t,x,v) =

    L

    v(t,x,v)

    have the continuity and differentiability properties required to validate theoperations carried out below. Then,

    df(x; w) = lim0

    1 10

    L(t, (x+ w)(t), (x+w)(t)) L(t, x(t), x(t)) dt

    =

    10

    lim0

    1

    L(t, (x+ w)(t), (x+w)(t)) L(t, x(t), x(t)) dt=

    10

    Lx

    t, x(t), x(t)

    w(t) +Lx

    t, x(t), x(t)

    w(t)

    dt

    For a given functionxXand forwW, integration by parts yields

    10

    w(t)Lxt, x(t), x(t) dt= 0 10w(t) t

    0Lx, x(), x() d dt,

    and thus,

    df(x; w) =

    10

    w(t)

    Lx

    t, x(t), x(t) t

    0

    Lx

    , x(), x()

    d

    dt.

    Becausew W implies 10

    w(t) dt = 0 and df(x; w) = 0 must hold for allsuch functions w : on [0, 1],

    t Lxt, x(t), x(t) t0

    Lx, x(), x(t) d must be constant.

  • 7/24/2019 An Optimization Primer

    31/149

    1.5. INTEGRAL FUNCTIONALS 19

    In other words, x must satisfy the ordinary differential equation

    Lx

    t, x(t), x(t)

    = d

    dtLx

    t, x(t),x(t)

    fort[0, 1],

    known as the Euler equation. In addition, x must satisfy the boundary

    conditions att = 0, 1. Fermats rule then reads,

    x argmin f =

    x(0) =, x(0) =,

    x satisfies the Euler equation.

    1.7 Example (the brachistochrone problem). The problem14 is to find thepath along which a particle will fall in the shortest time from A to B.

    Detail. Lets pass a vertical plane through the points A and B with they-axis drawn vertically downward, A located at the origin (0, 0) and say, B= (1, 2). So, we are to find a path y: [0, 1] [0, ) with y(0) = 0 andy(1) = 2 that will minimize the elapsed time; the force acting on the particleis gravity. From Newtons Law, one derives the following expression for the

    2

    1

    B

    Ax

    g

    Figure 1.8: Shortest time path for falling particle

    the function to be minimized: 10

    1 +y(x)2

    y(x) dx.

    14originally formulated by Galileo, the name is derived from the Greek, brachistos forshortest and chronos for time

  • 7/24/2019 An Optimization Primer

    32/149

    20 CHAPTER 1. PRELUDE

    Rather than the Euler equation itself, lets rely on a variant, namely,

    d

    dx

    L yLy

    = Lx;

    carrying out the differentiation with respect to x, the preceding identity

    yields Lx+yLy+yLy yLy y ddx Ly =Lx, or still y (Ly ddx Ly) = 0.Since in our problem, Lx = 0, this variant of the Euler equation implies thatx(L yLy)(x) should be constant. Hence the optimal path must satisfythe following boundary value problem:

    2= y(x)(1 +y(x)2), y(0) = 0, y(1) = 2.

    where is a constant to be chosen so as to satisfy the boundary conditions.The path described by the solution in the (x, y)-plane can be parametrizedwith respect to time for t[0, 2], and one can verify that the cycloids,

    x(t) =(t sin t), y(t) = (1 cos t),

    satisfy the preceding differential equation; somewhat informally

    y =dy

    dx=

    dy

    dt

    dt

    dx=

    y

    x= (1 cos t)1 sin t.

    At t = 0, x(0) = y(0) = 0, so there remains only to choose so that thepath passes through B at time t = t, our shortest time. For B at (1, 2),that turns out to be = 2.4056 and the corresponding value for t is 1.4014;these values were obtained with he help of the Matlab root-finder fzero.

    1.8 Exercise Show that the straight line yields the shortest distance be-tween two points, saya = (0.0) andb = (1, 2).

    Guide. The length of a differentiable arc y: [0, 1]IR is 10

    1 +y(x)2 dx,

    as follows from the theorem of Pythagoras and the definition ofy . Set upand solve the Euler equation with the boundary conditions y(0) = 0 and

    y(1) = 2.

  • 7/24/2019 An Optimization Primer

    33/149

    1.6. IN CONCLUSION. . . 21

    1.6 In conclusion . . .

    We have seen that the rules of Oresme and Fermat, with the help of Dif-ferential Calculus, can be used effectively to identify potential minimizersof a smooth function in a variety of situations. But, many interesting opti-

    mization problems involve non-differentiable functions and minimizers havea predilection for being located at the cusps and kinks of such functions!Moreover, the presence of constraints in a minimization problem come withan intrinsic lack of smoothness. There is a sharp discontinuity between pointsthat are admissible and those that are not.

    To deal with this more inclusive class of functions, we need to enrich ourcalculus. Our task, on the mathematical side, will thus be to set up a Sub-differential15 Calculus with rules that mirror those of Differential Calculus,and that culminates in versions of Oresmes and Fermats rules to ferret outthe minimizers of non-smooth, and even discontinuous, functions16.

    15the prefix sub has the meaning: requiring less than differentiability.16For more comprehensive expositions of Subdifferential Calculus, one should consult

    [4, 14, 1, 3, 18], our notation and terminology will be consistent with that of [18].

  • 7/24/2019 An Optimization Primer

    34/149

    22 CHAPTER 1. PRELUDE

  • 7/24/2019 An Optimization Primer

    35/149

    Chapter 2

    FORMULATION

    Lets begin with a few typical (constrained) optimization problems that fitunder the mathematical programming umbrella. In almost all of these ex-amples, we start with a deterministic version and then switch to a morerealistic model that makes a place for the uncertainty about some of theparameters.

    When we allow for data uncertainty, not only do we gain credibility forthe modeling process but we are also lead to consider a number of issuesthat are at the core of optimization theory and practice, namely, how todeal with non-linearities, with lack of smoothness, and how to design solu-tion procedures for large scale problems. In addition, due to the additionof randomness (uncertainty) its also necessary to clarify a number of thebasic modeling issues, in particular, how stochastic programs differ from thesimpler, but less realistic, deterministic formulations.

    For all these reasons, we are going to rely rather extensively, but by nomeans exclusively, on stochastic programming examples to motivate both thetheoretical development and the design of algorithmic procedures.

    2.1 A product mix problem

    A furniture maker can manufacture and sell four different dressers. Eachdresser requires a certain numbertcj of man-hours for carpentry, and a certainnumbertf j of man-hours for finishing, j = 1, . . . , 4. In each period, there aredc man-hours available for carpentry, and dfavailable for finishing. There isa (unit) profit cj per dresser of typej thats manufactured. The owners goal

    23

  • 7/24/2019 An Optimization Primer

    36/149

    24 CHAPTER 2. FORMULATION

    is to maximize total profit, or equivalently, to minimize cost1 (= negativeprofit). Let these cost coefficients be

    c=(c1, c2, c3, c4) = (12, 25, 21, 40)

    andT =

    tc1 tc2 tc3 tc4tf1 tf2 tf3 tf4

    =

    4 9 7 101 1 3 40

    d= (dc, df) = (6000, 4000).

    The furniture maker must choose (xj0, j = 1, . . . , 4) to minimize4

    j=1

    cj xj =12x1 25x2 21x3 40x4,

    subject to the constraints

    4x1+ 9x2+ 7x3+ 10x46000,x1+ x2+ 3x3+ 40x44000.

    This is alinear program, i.e., an optimization problem in finitely many (real-valued) variables in which a linear function is to be minimized (or maxi-mized) subject to a system of finitely many linear constraints: equations andinequalities. A general formulation of a linear program could be

    minn

    j=1cj xj over all xXIRn

    so thatn

    j=1

    ai,j xj bi for i= 1, . . . , m ,

    where stands for either, = or and the (internal) constraints x Xconsist of some simple linear inequalities on the the variables xj such as:

    - Xis a box, i.e., X=

    x ljxjuj, j = 1, . . . , n,

    - X=IRn+

    =

    x xj0, j= 1, . . . , n, the non-negative orthant, etc.

    1This conversion to minimization is made in order to have a canonical formulationof optimization problems. Generally, engineers and mathematicians prefer the minimiza-tion framework, whereas social scientists and business majors have a preference for themaximization framework.

  • 7/24/2019 An Optimization Primer

    37/149

    2.1. A PRODUCT MIX PROBLEM 25

    The objective and the constraints of our product mix problem are linearand it may be written compactly as:

    minc, x so that T xd, x0.As part of the ensuing development, many of the properties of linear pro-

    grams will be brought to the fore including optimality conditions, solutionprocedures and the associated geometry. For now, lets simply posit thatsuch problems can be solved efficiently when they are not too large. The(optimal) solution of our product mix problem is:

    xd = (4000/3, 0, 0, 200/3) with optimal value: $ -18,667.

    Here is the Matlab-file used to calculate the solution; linprog is a functionin the Matlab Optimization Toolbox.

    function [xopt,ObjVal] = prodmix%% data and solution of the product mix example%c = -[12 25 21 40]; d = [6000 4000];T = [ 4 9 7 1 0 ; 1 1 3 4 0 ] ;xlb = zeros(4,1); xub = ones(4,1)*10^9;[xopt,ObjVal] = linprog(c,T,d,[],[],xlb,xub);

    Now, lets get a bit more realistic and account for the fact that the numberof hours needed to produce each dresser type cant be known with certainty.Then, each entry in T becomes a random variable2. For simplicitys sake,assume that each entry ofT takes on four possible man-hour values withequal probability (1/4) and that these entries areindependentof one another.

    entry possible valuestc1: 3.60 3.90 4.10 4.40tc2: 8.25 8.75 9.25 9.75tc3: 6.85 6.95 7.05 7.15tc4: 9.25 9.75 10.25 10.75tf1: 0.85 0.95 1.05 1.15tf2: 0.85 0.95 1.05 1.15tf3: 2.60 2.90 3.10 3.40tf4: 37.0 39.0 41.0 43.0

    2Bold face will be used for random variables with normal print for their possible values.

  • 7/24/2019 An Optimization Primer

    38/149

    26 CHAPTER 2. FORMULATION

    We have 8 random variables, each taking four possible values, that yields atotal of 48 = 65,536 possible T matrices (outcomes) and each one of thesehas equal probability of occurring! In practice, this could be adiscretizationthat approximates a continuous distribution (e.g. a uniform distribution).Lets denote the probability of a particular outcome T l bypl = (0.25)

    8, for

    l= 1, . . . , 65, 536.Because the manufacturer must decide on the production plan before the

    number of hours required for carpentry or finishing are known with certainty,there is the possibility that they actually exceed the number of hours avail-able. Therefore, the possibility of having to pay for overtime must be factoredin. The recoursecosts are determined by: qc per extra carpentry hour andqfper extra finishing hour, say q= (qc, qf) = (5, 10).

    This recourse decision will only enter into play after the production plan xhas been selected and the time required, T l, for each task, has been observed.Our manufacturer will, at least potentially, make a different decision about

    overtime when confronted with each one of these 65,536 possible differentoutcomes forT. Lety lc andy

    lfdenote the number of hours of overtime hours

    hired for carpentry and finishing when the matrixTturns out to beT l. Theproblem is then to choose (xj0, j= 1, . . . , 4) that minimizes

    4j=1

    cj xj+

    65,536l=1

    pl

    qcyl

    c + qfyl

    f

    so that

    4j=1

    t lcjxj y lcdc, l= 1, . . . , 65, 536,4

    j=1

    t lf j xj y lfdf, l= 1, . . . , 65, 536,

    y lc0, y lf0, l= 1, . . . , 65, 536.Notice that the objective now being minimized is the sum of the immedi-ate costs (actually, the negative profits) and the expected futures costssinceone must consider 65,536 possible outcomes; the constraints involving ran-dom quantities are written out explicitly for all 65,536 possible outcomes.In addition to non-negativity for the decision variables xj and the recourse

  • 7/24/2019 An Optimization Primer

    39/149

    2.1. A PRODUCT MIX PROBLEM 27

    variables y lc , yl

    f, the constraints say that the number of man-hours it takes

    for the carpentry of all dressers (4

    j=1 tlcj xj) must not exceed the total num-

    ber of hours made available for carpentry (dc+yl

    c ), i.e., regular hours plusovertime, and the same must hold for finishing.

    Because there is the possibility of making a recoursedecisiony l = (y lc , yl

    f)that will depend on the outcomes of the random elements, this type of prob-lem is called a stochastic program with recourse. This class of problems willbe studied in more depth later. For now, it suffices to understand how thedecision/information process is evolving:

    decision: x observation: T l recourse: y l

    In summary, the manufacturer makes today, a decision x of how muchof each dresser type to produce based on the knowledge that he will beable tomorrow, to observe how many man-hours T l it actually took to

    manufacture the dressers, as well as to decide how much overtime labor y lto hire based on this observation.

    The problem is still a linear program, but of much larger size! Notice theblock-angular structure of the problem when written in the following way:

    min c, x +p1q, y1 +p2q, y2 + + p65536q, y65536so that T1x y1 d

    T2x y2 d...

    . . . ...

    T65536x y65536 dx0, y

    1

    0, y2

    0, y65536

    0.Later, we shall see how to solve these large scale linear programs by exploitingtheir structure. For now, it is enough to observe that these are indeed, largescale problems.

    Oftentimes, there is more than one source of uncertainty in a problem. Forexample, due to employee absences, the available man-hours for carpentryand finishing may also have to be modeled as random variables, say

    entry possible values dc: 5,873 5,967 6,033 6,127 each with probability 1/4df: 3,936 3,984 4,016 4,064 each with probability 1/4

  • 7/24/2019 An Optimization Primer

    40/149

    28 CHAPTER 2. FORMULATION

    We now need to replace d by d l = (d lc, dlf) and we must take into account

    the 42 = 16 possible d l vectors, that gives a total ofL = 410 = 1, 048, 576possible (T, d) realizations. With pl = 1/L, the problem reads:

    min

    c, x

    +p1

    q, y1

    +p2

    q, y2

    +

    + pL

    q, yL

    so that T1x y1 d1T2x y2 d2

    ... . . .

    ...TLx yL dLx0, y1 0, y2 0, yL 0.

    The relatively small linear program we started out with, in the deterministicsetting, has now become almost enormous! Lets refer to this problem as the(equivalent)extensive versionof the (given) stochastic program.

    The optimal solution is

    x = (257, 0, 665.2, 33.8) with total expected cost: $ -18,051.

    Because of its large size, this problem is more difficult to solve than itsdeterministic counterpart, and any efficient solution procedure must exploitthe problems special structure. But the solution x is robust, meaning thatit has examined all one million plus possibilities, and has taken into accountthe resulting recourse costs for overtime and the associated probabilities ofhaving to pay these costs.

    With xd = (4000/3, 0, 0, 200/3), the solution of the deterministic version,

    the expected cost would have been $ -16,942; the expected overtime costs are$ 1,725. Of course, xd is not an optimal solution of the stochastic program,but more significantly,xd isnt getting us on the right track! The solutionx

    suggests that a large number of dressers of type 3 should be manufactured,while the production plan suggested byxd doesnt even include any dresser oftype 3. This is exactly the information a decision maker would want to have,viz, what are the activities that should be included in a (robust) optimalsolution.

    2.1 Exercise (stochastic resources). Consider the product mix problemwhen the only uncertainty is about the number of hours that will be availablefor carpentry and finishing. Overtime will still be paid at the rates of $ 5 an

  • 7/24/2019 An Optimization Primer

    41/149

    2.2. CURVE FITTING II 29

    hour for carpentry and $ 10 and hour for finishing. Let

    c= (12, 25, 21, 20), T =

    4 9 7 101 1 3 40

    .

    and the random variablesdc, df(independently) take on the values

    entry possible values dc: 4,800 5,500 6,050 6,150 each with probability 1/4df: 3,936 3,984 4,016 4,064 each with probability 1/4

    Solve also the deterministic problem: minc, x so that T x d, x 0withdc= 5, 625anddf= 4, 000, the expected values ofdcanddf. Comparethe solution with that of the stochastic program.

    Guide. Here, L=16, is the number of possible outcomes, so in addition to thenon-negativity restrictions, the stochastic program will have 32 constraints.

    One solution is x = (1, 072.6, 0, 251.4, 0) with optimal value $ -15,900. Thesolution of the deterministic problem suggests manufacturing the same typeof dressers but in significantly different quantities. To compare the solutions,for the deterministic solution one needs to evaluate not just its cost (= -profit) but one must also calculate the recourse costs that might result whenone actually would implement the deterministic solution.

    2.2 Curve fitting II

    As in1.2, we know the values of an unknown function h : [0, 1]IR at afinite number of (distinct) points. Its also known that h is quite smoothwhich in this context, we are going to interpret as meaning: h is twice dif-ferentiable with h bounded, i.e., for all t [0, 1],|h(t)| for a given > 0. Unless is unusually small, our estimate for h is allowed to comefrom a much richer class of functions than just polynomials of degree n, asin1.2. Its thus reasonable to expect that one should be able to come upwith a better fit.

    Every twice differentiable z: [0, 1]IRcan be written as

    z(t) =z0+v0t+ t0

    d 0

    ds a(s),

  • 7/24/2019 An Optimization Primer

    42/149

    30 CHAPTER 2. FORMULATION

    where z0 and v0 play the role of integration constants and a: [0, 1] IR,the second derivative, is some function, not necessarily continuous. In fact,to render the problem computationally manageable, lets restrict a to bepiecewise constant. Thats not as serious a limitation as might appear at firstglance since any piecewise continuous function on [0, 1] can be approximated

    arbitrarily closely by such a piecewise constant function.Getting down to specifics: Partition (0, 1] inNsub-intervals (tk1, tk], of

    length= 1/Nso that the points at which the function h is known are someof the end points of these intervals, say

    tl, lL

    . Fork = 1, . . . , N , set

    a(t) =xk, (a constant) , t(tk1, tk];fixingx1, . . . , xN,v0 andz0 completely determines the function z. Moreover,by introducing bounds on the choice of the xk, one can control the rate ofchange in bothz and, in the function zitself.

    Fort

    (tk1, tk], one has

    z(t) =v0+

    t0

    a(s) ds= v0+

    k1j=1

    xj+ (t tk1)xk,

    and

    z(t) =z0+

    t0

    z(s) ds= z0+k1j=1

    tjtj1

    z(s) ds +

    ttk1

    z(s) ds

    =z0+v0t+

    k1

    j=1(t tj+ /2)xj+ 12(t tk1)2xk.

    In particular, when t = tk,

    z(tk) =z0+kv0+2

    kj=1

    (k j+ 12)xj .

    The curve fitting problem comes down to finding v0,z0and fork = 1, . . . , N ,xk[, ] so thatzis as close as possible to h in terms of a given criterion.For example, one may be interested in minimizing the (square of the)2-normof the error,

    lL

    |z(tl) h(tl)|2

    ,

  • 7/24/2019 An Optimization Primer

    43/149

    2.2. CURVE FITTING II 31

    i.e., least squares minimization. With z=

    zl =z(tl), lL

    and h=

    hl =h(tl), lL

    , one ends up with the following formulation,

    min |z h|2 =

    lL|zl hl|2

    so thatzl = z0+lv0+2

    lk=1

    (l k+ 12)xk, lL,

    xk, k= 1, . . . , N .

    Thats aquadratic program: the constraints are linear and the function to beminimized is quadratic. One can write the equality constraints as

    z= Ax where x= (z0, v0, x1, . . . , xN);

    the entries ofA are the detached coefficients of these equations. Since,

    |z h|2 =|Ax h|2 =Ax h,Ax h,

    the quadratic program can also be expressed as

    minx, AAx 2Ah, x so that lbxub,

    wherelb and ubare, respectively, lower and upper bounds on the x-variables;for z0 and v0 these bounds could be. Because the matrixAA is posi-tive semi-definite3, it turns out that our quadratic program is a convexopti-mization problem, a property thats difficult to overvalue in an optimization

    context, cf. 3.16 & Chapter 4.To illustrate this approach, let consider again the same data as that used

    in1.2. We rely on the Matlab-function quadprog to solve the quadraticprogram. Figure 2.1 displays the resulting curve when N = 400 (and isrelatively large). The resulting curve is traced in Figure 2.1, and as expected,the fitting is significantly better than what resulted from a best polynomialfit; compare Figures 1.3 and 2.1.

    function z = lsCurveFit(xr,N,x,h,kappa)% xr, N: range [0, xr]; partition in N subintervals

    3A matrix C is positive semi-definite if x,Cx 0 for all x. When C is positivesemi-definite, the quadratic form x x,Cx is convex, see Example 3.16.

  • 7/24/2019 An Optimization Primer

    44/149

    32 CHAPTER 2. FORMULATION

    % (x,h): data points% kappa: -lower and upper bound on 2nd derivativesmsh = xr/N; [m, m0] = size(x(:)); xidx = round(N*x);N 1 = N + 1 ; N 2 = N + 2 ;mx = 2+max(abs(h)); ub = [kappa*ones(1,N);100;mx]; lb = -ub;% generating the coefficients of matrix Afor i = 1:m

    for j = 1:xidx(i)A(i,j) = (xidx(i)-j+0.5)*msh^2;

    end %forA(i,N1) = xidx(i)*msh; A(i,N2) = 1;

    end %forxx = quadprog(A*A,-A*h(:),[],[],[],[],-ub,ub);% z-curve calculationfor l = 1:N

    zd = 0;for k = 1:l

    zd = zd + (l-k+0.5)*xx(k);

    end %forz(l) = xx(N2) + xx(N1)*l*msh + zd*msh^2;

    end %for

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    Figure 2.1: Least squares fit of a smooth curve

    Instead of minimizing the 2-norm of the error (= least squares), onecould, for instance, choose to minimize the 1-norm (= sum of the errors).The function to be minimized is then

    lL |zl hl|. Since

    |zl hl|= max {zl hl, hl zl},

  • 7/24/2019 An Optimization Primer

    45/149

    2.2. CURVE FITTING II 33

    one can find the minimum of

    lL |zl hl| by minimizinglL

    l with lzl hl, lhl zl for lL.

    With the 1-norm criterion, the curve fitting problem takes the form,

    min

    lLl

    so that l zl hl, lL,l zl+hl, lL,zl = z0+lv0+

    2l

    k=1(l k+ 12)xk, lL,

    xk, k= 1, . . . , N .This is a linear program, the constraints are linear and the function to beminimized is also linear.

    2.2 Example (yield curve tracing). The spot rate of a Treasury Note thatmatures intmonths always includes a risk premium as well as a forecast com-

    ponent that represent the markets perception of future interest rates. Suchspot rates are quoted for Treasury Notes with specific maturities,t = 3, 6, . . . .To evaluate financial instruments that generate the cash flow (coupons, final

    payments) at intermediate dates, one needs to have access to to ayield curvethat supplies the spot rate for every possible date.

    0 50 100 150 200 250 3009

    10

    11

    12

    13

    14

    15

    Figure 2.2: Yield curve for Treasury Notes July 1982

    Detail. Lets work with the (historical) rates quoted in July 1982:

  • 7/24/2019 An Optimization Primer

    46/149

    34 CHAPTER 2. FORMULATION

    term 3 6 12 24 36 60 84 120 240 360rt 11.9 12.8 13.2 13.8 14.0 14.1 14.1 13.9 13.8 13.6

    To trace the yield curve, the simplistic approach is to rely on linear inter-polation. However, thats not really satisfactory. Financial markets make

    continuous adjustments to the changing environment and this suggests thatthe yield curve has to be quite smooth. Certainly, there shouldnt be anabrupt change in the slope of the yield curve, and a fortiori, this shouldntoccur at t = 3, 6, . . . . So, lets fit a smooth curve to the data. Because thespot rates are nonnegative4, one can express the yield curve as s(t) =ez(t)

    in which case we need to search for a smooth z-curve that will fit the pairs{(3, ln r3), (6, ln r6), . . . }. The following Matlab-file generates the coef-ficients of the linear program and then relies on linprog to calculate thesolution. Figure 2.2 graphs the (historical) yield curve calculated by our pro-gram.

    function spots = YieldCurve(N,x,r,kappa)% N: # of months, range [0, N]; % (x,r): data points% kappa: -lower and upper bound on 2nd derivative[m, m0] = size(x(:)); N1 = N + 1; N2 = N + 2;ub = [kappa*ones(1,N);0;0;10*ones(1,m)];lb = [-ub(1:N);-1;-3.25;zeros(1,m)];% generating the coefficients of linear programfor i = 1:m

    i2 = 2*i; i1= i2-1;b(i2) = log(r(i)); b(i1) = -b(i2);A(i1,:) = zeros(1,N2+m);

    for j = 1:x(i)A(i1,j) = (x(i)-j+0.5);

    end %forA(i1,N1) = x(i); A(i1,N2) = 1; A(i1,N2+i) = -1;A(i2,:) = -A(i1,:); A(i2,N2+i) = -1;

    end %forc = [ zeros(1,N2) ones(1,m)];xx = linprog(c,A,b,[],[],lb,ub);% yield curve calculationfor l = 1:N

    zd = 0;

    4and its expedient to have an expression for the spot rates that makes calculatingforward rates and discount factors particularly easy

  • 7/24/2019 An Optimization Primer

    47/149

    2.3. A NETWORK CAPACITY EXPANSION PROBLEM 35

    for k = 1:lzd = zd + (l-k+0.5)*xx(k);

    end %forz(l) = xx(N2) + xx(N1)*l + zd;

    end %forspots = exp(-z);

    2.3 A network capacity expansion problem

    Lets consider a power transmission network, Figure 2.3, with eibe the exter-nal flow at node i, i.e., the difference between demand and supply at node i.The internal flowyj on arc j is limited by its capacityj of the transmissionline. Total supply exceeds total demand but the capacity of the transmissionlines needs to be expanded fromj toj+ xj , withj an upper bound on xj,

    in order to render the problem feasible5. The total cost of such an expansionisn

    j=1 j (xj ).

    e

    e 2

    1 < jj|y | _

    Figure 2.3: Power transmission network

    5In the 2001 California energy crisis, some of the blackouts were blamed on the lack ofcapacity of the transmission lines between South and North California.

  • 7/24/2019 An Optimization Primer

    48/149

    36 CHAPTER 2. FORMULATION

    The deterministic version of this capacity expansion problem would be:

    minn

    j=1

    j(xj)

    so that 0

    xj

    j , j= 1, . . . , n ,

    |yj| j+xj, j= 1, . . . , n ,i

    yji

    yjei, i= 1, . . . , m;

    i yj stands for the (internal) flow into iwhereas

    i yj is the flow fromi to the other nodes. Since the constraint|yj| j +xj can be split inthe two linear constraints yj j +xj and yj j xj , this is again alinear programming problem if the cost functionsj are linear. Usually, thefunctionsj are nonlinear, and the problem then belongs to a more generalclass of optimization problems.

    A nonlinear program is an optimization problem in finitely many (real-valued) variables in which a function is to be minimized (or maximized)subject to a system of finitely many constraints: equations and inequalities.A general formulation of a nonlinear program could be

    minf0(x) over all xXIRnso thatfi(x) 0 for i= 1, . . . , m ,

    where stands for either, = or and the set X is usually a simpleset such as a box or an orthant but, in principle, could be any subset ofIRn. Depending on the properties of the functions f0,{fi, i = 1, . . . , m}and the setX, various labels are attached to nonlinear programs: quadratic,geometric, convex, positive definite, etc. As we proceed, we shall developoptimality conditions and study stability criteria for nonlinear programs aswell as describe a number of algorithmic procedures for solving certain classesof nonlinear programs.

    2.3 Example (capacity expansion example). Consider the (simple) capacityexpansion problem as defined by Figure 2.4 with no upper bounds on theexpansionsxj and let 1(x1) =x

    21, 2(x2) = 8x

    22, 3(x3) = 3x

    23.

    Detail. This is a quadratic program with solution

    x = (0, 0.55, 1.45), y = (2.45, 2.55, 6.45)

  • 7/24/2019 An Optimization Primer

    49/149

    2.3. A NETWORK CAPACITY EXPANSION PROBLEM 37

    x.

    6Of course, deterministic programs can be viewed as stochastic programs whose randomelements take on a single value with probability one.

  • 7/24/2019 An Optimization Primer

    55/149

    2.5. THE BROADWAY PRODUCER PROBLEM 43

    WhenPis a continuous distribution function, i.e., there is a density functionp :IR+IR+ such that

    P() =

    0

    p() d().

    The expected costs are

    x +

    x

    ( x)p()d.

    In this case, a simple calculation shows that the optimal solution x mustsatisfy

    0 = [1 P(x)].More generally, if we define P() := lim P(), one must have

    P(x)

    P(x),

    that allows for the possibility of a (discontinuous) jump in P at x, as couldhappen when the random demand is discretely distributed. Figure 2.6illustrates these possibilities.

    ^x ^x

    1

    () 1

    ()

    0 0

    1 1

    Figure 2.6: Solution: discrete and continuous distributions.

    2.6 Exercise (numerical solutions of the producer problem). With = 3, = 7 and uniformly distributed on [10, 20], the producer problem has a

  • 7/24/2019 An Optimization Primer

    56/149

  • 7/24/2019 An Optimization Primer

    57/149

    2.5. THE BROADWAY PRODUCER PROBLEM 45

    of the contract, and the second term evaluating the decision in terms ofexpected costs to come after the random event is observed. The second termis called the expected recourse cost:

    E{q( x)} where q(y) = 0 when y0,y when y0.The recourse cost function is q( x).

    0

    q

    Figure 2.7: The cost function q.

    2.8 Exercise (alternative expression for recourse cost). Show that the costfunction q(with >0), a function commonly used to define recourse costs,admits the alternative representations:

    q(y) = max [ 0, y ]

    = min {y+

    |y+

    y

    =y, y+

    0, y

    0}.WithE{} denoting expectation with respect to the distribution function

    P, the stochastic programming formulation of the producers problem is:

    minx

    x +E

    q( x).After integration, one obtains the deterministic equivalentof the stochasticprogram, stated here in terms of a continuous distribution Pwith densityp,but valid in the more general case:

    minx x +

    x( x)p() d.

  • 7/24/2019 An Optimization Primer

    58/149

    46 CHAPTER 2. FORMULATION

    This is precisely the problem that was solved earlier.This formulation of the producers problem illustrates the important fea-

    tures of a stochastic program: the decision stages of the problem in relationto the arrival of information, and the recourse costs being obtained as theexpected value of an optimization problem that will be solved after full in-

    formation is available. Many issues of stochastic programming may be illus-trated by this simple producers problem. There are many instances whenthe developments to come can first be explored for this simple example, inwhich everything is well understood, then the intuition gained can guide theapplication to more complex decision problems.

  • 7/24/2019 An Optimization Primer

    59/149

    Chapter 3

    PRELIMINARIES

    At the congenital level, one makes a distinction between two classes of op-timization problems, namely those that are convex and those that are non-convex1. Fortunately, a major portion of the optimization problems thathave to be dealt with practice are convex; all examples in Chapter 2 fall inthis class. In the two first sections of this chapter, we build the basic tools toanalyze convex optimization problems, and in particular, whats needed togeneralize Oresme and Fermat rules. The three last sections, set up a minimalprobabilistic framework that will allow us to deal with (convex) stochasticoptimization problems and commence the study of expectation functionals.

    3.1 Variational analysis I

    The analysis of deterministic and stochastic programming models relies onthe tools and framework of Variational Analysis. Our concern at this pointis mainly, but not exclusively, with finite-valued convex functions, but theexposition will already touch on the interplay between a function and itsepigraph that occupies such a pivotal role in Variational Analysis and, inparticular, in the theoretical foundations of optimization. The definitions andaccompanying notation are consistent with the extensions and generalizationsrequired in the sequel. General facts about convexity will be covered in this

    1Of course, non-convex problems have large subfamilies that possess particular prop-erties that can be exploited effectively in the design of solution procedures, e.g., com-binatorial optimization, optimization problems with integer variables, complementarityproblems, equilibrium problems, and so on.

    47

  • 7/24/2019 An Optimization Primer

    60/149

    48 CHAPTER 3. PRELIMINARIES

    section. The next one will be devoted to a more detailed analysis of convexfunctions and their (sub)differentiability properties.

    A subsetCofIRn isconvexif for allx0, x1 C, the line segment [x0, x1]C, i.e.,

    x

    = (1 )x0

    +x1

    C for all [0, 1].Note that x0, x1 dont have to be distinct, and thus ifCconsists of a singlepoint its convex; the condition is also vacuously satisfied when C =, theempty set. Balls, lines, line segments, cubes, planes are all examples ofconvex sets. Sets with dents or holes are typical examples of sets that fail tobe convex, cf. Figure 3.1.

    Figure 3.1: Convex and non-convex sets.

    Given [0, 1], one refers to x as a convex combinationofx0 and x1.More generally, given any collection x1, . . . , xL

    IRn, then any point

    x =

    Ll=1

    lxl for some l0, l= 1, . . . , L , such that

    Ll=1

    l = 1

    is a convex combinationofx1, . . . , xL. The set

    C= con(x1, . . . , xL) =

    x=

    Ll=1

    lxl L

    l=1

    l = 1, l0, l= 1, . . . , L

    is the convex hullofx1, . . . , xL; cf. Figure 3.2.

    Convexity is preserved under the following operations:

  • 7/24/2019 An Optimization Primer

    61/149

    3.1. VARIATIONAL ANALYSIS I 49

    2x

    Lx

    1

    x

    Figure 3.2: Convex hull of a finite collection of points

    3.1 Exercise (intersections, products and linear combinations). Given acollection of convex sets

    Ci, i= 1, . . . , r

    one has:

    (a) C = ri=1 Ci is convex, actually, the intersection of an arbitrary col-lection of convex sets is still convex;(b) C=C1 C2 Cr is convex. ForC1 IRn1, C2 IRn2,

    C1 C2 := (x1, x2)IRn1+n2 x1 C1, x2 C2.(c) iIR, C=

    ri=1 iC

    i :=r

    i=1 ixi xi Ci is convex.

    3

    C + C1 1 2 2

    C

    C

    C

    1

    2C

    C

    x

    0x

    1

    2

    1

    Figure 3.3: Operations resulting in convex sets.

    3.2 Exercise (affine transformation and projection). Given L :IRn

    IRm,

    an affine mapping, i.e.,L(x) =Ax + bwhereA is amn-marix andbIRm,

  • 7/24/2019 An Optimization Primer

    62/149

    50 CHAPTER 3. PRELIMINARIES

    the setL(C) =

    z= Ax + b xC is convex wheneverCIRn is convex.

    In particular,L(C)is convex when its the projection of the convex setCona subspace ofIRn.

    2

    C

    IS = R

    S

    C

    proj Cproj C

    Figure 3.4: Projection of a convex set.

    Guide. The first statement only requires a simple application of the defi-nition of convexity. For the second, simply write the projection as an affinemapping.

    Warning: projections preserve convexity but the projection of a closed convexset is not necessarily closed. A simple example: let C =

    (x1, x2)

    x21/x1, x1 > 0, then the projection of C on the x-axis is the open interval(0, 1), This cant occur ifCis also bounded, cf. Proposition 8.9.

    A particularly important subclass of convex sets are those that are alsocones. Arayis a closed half-line emanating from the origin, i.e., a set of thetype

    x0 for some 0= xIRn. A set K IRn is a cone if 0K

    and x K for all x K and > 0. Aside from the zero cone{0}, thecones K inIRn are characterized as the sets expressible as nonempty unionsof rays. The following sets are all convex cones:{0},IR+sn,IRn and(closed)half-spaces, i.e., sets of the type

    xIRn

    a, x 0 with a= 0.3.3 Exercise (convex cones). A nonempty set C is a convex cone if andonly ifx1, x2 C impliesx1 +x2 C.

  • 7/24/2019 An Optimization Primer

    63/149

    3.1. VARIATIONAL ANALYSIS I 51

    0

    0 00

    Figure 3.5: Cones: convex and non-convex.

    A function f is convexrelative to a convex set C if for allx0, x1 C:f

    (1 )x0 +x1 (1 )f(x0) +f(x1) for all [0, 1].It isstrictly convexrelative toCif for all distinct x0, x1 C:

    f g

    Figure 3.6: Convex and non-convex functions.

    f

    (1 )x0 +x1

    < (1 )f(x0) +f(x1) for all (0, 1).

    In particular, ifC = IRn

    the preceding inequalities must be satisfied for allx0, x1 in IRn. A function f is concaverelative to a convex set C iff isconvex relative to C.

    One can bypass the need for constant reference to the set on which thefunctionfis defined if we adopt, as we will, the following framework: Ratherthan real-valued functions, one considers extended real-valuedfunctions withthe valueassigned to those points that are outside their effective domain.More specifically, with a function f0: C IR with C IRn, one associatesa function f:IRn IRdefined by

    f(x) = f0(x) ifxC, otherwise.

  • 7/24/2019 An Optimization Primer

    64/149

    52 CHAPTER 3. PRELIMINARIES

    f

    x

    f

    x

    Figure 3.7: Strictly convex and not strictly convex functions.

    Theeffective domainof a function f:IRn IR is denoted

    dom f=

    xIRn

    f(x)

  • 7/24/2019 An Optimization Primer

    65/149

    3.1. VARIATIONAL ANALYSIS I 53

    reals follows the usual rules, including 0 = 0, except for =. This is going to be ourextended arithmetic convention, but again thisconvention is consistent with the view that points at which a function takesthe value lie outside its effective domain.

    The convexity of a function isnt really affected by such an extension.

    3.4 Exercise (convexity for extended real-valued functions). Show thatf0 :C IR is a convex function relative to the convex set C IRn if and onlyiff:IRn IRis convex, wheref=f0 onCandf onIRn \ C.

    For any convex function f:IRn IR, dom f is convex.We only need to adjust the definition of strict convexity, namely, a functionf:IRn IRisstrictly convexif the restriction offto dom fis strictly convexrelative to dom f.

    Linear functions f(x) =a, x and affine functions f(x) =a, x + areconvex and concave. The exponential function ex and the absolute value

    function|x| are examples of real-valued convex functions. The sine functionsin xcan serve as a typical example of non-convexity.

    Theepigraphof a function f:IRn IRis the setepi f :=

    (x, )IRn+1 f(x),

    i.e., epi f consists of all points in IRn+1 that lie on or above the graph off. Linking the geometrical properties of the epigraph with the analyticalproperties of functions is one of the most useful tools of Variational Analysis.

    gf

    epi gepi f

    Figure 3.9: Epigraphs.

    Thehypographof a function f:IRn IRis the sethypo f := (x, )IRn+1 f(x),

  • 7/24/2019 An Optimization Primer

    66/149

    54 CHAPTER 3. PRELIMINARIES

    i.e., hypo fconsists of all points in IRn+1 that lie on or below the graph off.

    3.5 Proposition (convexity of epigraphs). A function f: IRn IR is con-vex if and only if epi fIRn+1 is convex.

    Its concave if and only if hypo f

    IRn+1 is convex.

    Proof. The convexity of epi fmeans that whenever (x0, 0), (x1, 1)epi f

    and (0, 1), the point (x, ) := (1 )(x0, 0) + (x1, 1) belongs toepi f. This is the same as saying that whenever f(x0)0 andf(x1)1,one hasf(x).

    The assertion about concavity follows by symmetry, passing from f tof.

    The epigraph is not the only convex set associated with a convex function.

    3.6 Exercise (convexity of level sets and argmin). Let f : IRn IR beconvex. Then, for all

    IR, thelevel sets

    lev f :=

    xIRn f(x).

    and the set of minimizersargmin f :=

    xIRn f(x)inff are convex.Moreover, if f is strictly convex then argmin f is a single point wheneverits nonempty; for example, iff=ex, the functionf is strictly convex butargmin f=.

    xlev f

    f

    Figure 3.10: Level set of a convex function.

    Guide. Use the convexity of epi f and IRn

    {}, and apply 3.1(a).

  • 7/24/2019 An Optimization Primer

    67/149

    3.1. VARIATIONAL ANALYSIS I 55

    3.7 Exercise (convexity of max-functions). Let

    fi, iI be a collectionof convex functions. Then the function f(x) = supiIf

    i(x) is convex.

    f1

    f3

    f4

    f2x

    f

    Figure 3.11: Max-function.

    Guide. Observe that epi f=iIepi fi and appeal to 3.1 and 3.5.3.8 Example (convex indicator functions). The indicator function C :

    epi C

    C

    Figure 3.12: An indicator function and its epigraph.

    IRn [0, )of a set CIRn is convex if and only ifCis convex, where

    C(x) = 0 ifxC, otherwise.

  • 7/24/2019 An Optimization Primer

    68/149

    56 CHAPTER 3. PRELIMINARIES

    Follows from 3.1(b), 3.5, and 3.6 sinceepi C=C IR+ andC= lev0C.

    3.9 Proposition (inf-projection of convex functions). Let f: IRn IR bethe inf-projection of the convex function g : IRm IRn IR, i.e., for allx

    IRn,

    f(x) = infuIRm g(u, x).

    Then f is a convex function.

    epi g

    x

    u

    epi f

    Figure 3.13: Inf-projection of a convex function

    Proof. Follows from 3.2 and 3.5 since epi f is the vertical closure of theprojection,

    (u,x,)(x, ),of epi g IRm+n+1 on the subspace IRn+1. By vertical closure one meansthat (, x) is included in epi f whenever (, x) epi f for all > ; itsimmediate that taking vertical closure preserves convexity.

    3.10 Exercise (convexity under linear transformations). Let f: IRm IRbe a convex function. For anym n-matrixA anda IRm, the functiong :IRn IRdefined byg(x) =f(Ax+a)is a convex function.

    Sublinear functions, a subclass of convex functions, play a central role inthe subdifferentiation of convex functions. functions. A functionf: IRn

  • 7/24/2019 An Optimization Primer

    69/149

    3.1. VARIATIONAL ANALYSIS I 57

    8

    x x

    Figure 3.14: Two sublinear functions.

    IR is sublinear if f is convex and positively homogeneous, i.e., 0 dom f,f(x) = f(x) for all x IRn, > 0. Typical examples are f(x) =a, x,f(x) =|x| and f(x) = supiIai, x, the supremum of a collection of linearfunctions.

    3.11 Exercise (sublinearity criteria). For f : IRn IR with f(0) = 0,sublinearity is equivalent to either one of the following conditions:

    (a) epi f is a convex cone,

    (b) for allx1, x2 IRn, f(x1 +x2)f(x1) +f(x2).

    Our primordial interest in convexity, at least in Variational Analysis,comes from the theorem below that relates local and global minimizers inthe presence of convexity; refer to 1.2 for the definition of local minimizers.

    3.12 Theorem (local and global minimizers). Every local minimizer of aconvex function f: IRn IR is a global minimizer. Moreover, there is onlya single minimizer offwhen its strictly convex.

    Proof. Ifx0 andx1 are two points ofCwithf(x0)> f(x1), thenx0 cannotfurnish a local minimum off, cf. Definition 1.2: every ball centered at x0

    contains points x = (1 )x0 +x1 with (0, 1) that satisfy f(x)(1)f(x0) +f(x1) < f(x0). Thus, there cant be any locally optimalsolutions outside of argmin fwhere global optimality is achieved.

    Iff is strictly convex, then x0, x1

    dom fcant be distinct points that

    minimize f. In view of the preceding, one necessarily would have f(x0) =

  • 7/24/2019 An Optimization Primer

    70/149

    58 CHAPTER 3. PRELIMINARIES

    f(x1). But then, strict convexity would imply thatf(x) < f(x0) for everypointx(x0, x1).

    3.2 Variational analysis II

    This section continues the study of the properties of convex functions but weare now mostly concerned with their (sub)differentiability properties. Theclass of functions to which we can apply the classical optimality conditionsof Chapter 1, doesnt include many that come up in the mathematical pro-gramming context. Restricting the development to models involving onlydifferentiable (convex) functions would leave by the wayside all constrainedoptimization models, and they include the large majority of the applications.One needs a calculus that applies to functions that are not necessarily differ-entiable. Eventually, this will enable us to formulate Oresmes and Fermats

    rules for convex functions that arent necessarily differentiable, or even con-tinuous. This Subdifferential Calculus is introduced in the this section andwill be expanded throughout the ensuing development.

    For the sake of exposition, and so that the readers can drill their intuition,1-dimensional functions are featured prominently in this section. In someinstances, for the sake of simplicity, the proof of a statement is only providedfor 1-dimensional convex functions3.

    3.13 Proposition (continuity of convex functions). A real-valued convexfunction f defined onIRn is continuous.

    Proof. The proof is forn= 1. It will be sufficient to show thatf:IRIRiscontinuous at 0; continuity at any other point x follows from the continuityat 0 of the function g(z) = f(z+x). By symmetry, it suffices to showthat f(0) = limf(x

    ) for any sequence x0 with x (0, 1]. From theconvexity off, one has for all :

    f(x)(1 x)f(0) +xf(1),

    f(0) 1x + 1

    f(x) + x

    x + 1f(1),

    3A complete proof would have required additional background material that would letus stray too far from the objectives of these lectures; for proofs in full generality, one canconsult [3], [18, Chapter 2], for example.

  • 7/24/2019 An Optimization Primer

    71/149

    3.2. VARIATIONAL ANALYSIS II 59

    that can also be writte