Upload
katherine-mason
View
217
Download
3
Embed Size (px)
Citation preview
HANSO and HIFOOTwo New MATLAB Codes
Michael L. OvertonCourant Institute of Mathematical
SciencesNew York University
Singapore, Jan 2006
Two New Codes• HANSO:
Hybrid Algorithm for Nonsmooth Optimization– Aimed at finding local minimizers of general
nonsmooth, nonconvex optimization problems– Joint with Jim Burke and Adrian Lewis
• HIFOO: H-Infinity Fixed-Order Optimization– Aimed at solving specific nonsmooth, nonconvex
optimization problems arising in control– Built on HANSO– Joint with Didier Henrion
Many optimization objectives are
• Nonconvex• Nonsmooth • Continuous• Differentiable almost everywhere• With gradients often available at little
additional cost beyond that of computing function
• Subdifferentially regular• Sometimes, non-Lipschitz
Numerical Optimization of Nonsmooth, Nonconvex Functions
• Steepest descent: jams• Bundle methods: better, but these are mainly
intended for nonsmooth convex functions• We developed a simple method for nonsmooth,
nonconvex minimization based on Gradient Sampling
• Intended for continuous functions that are differentiable almost everywhere, and for which the gradient can be easily computed when it is defined
• User need only write routine to return function value and gradient – and need not worry about nondifferentiable cases, e.g., ties for a max when coding the gradient
• Very recently, found BFGS is often very effective when implemented correctly - and much less expensive
BFGS• Standard quasi-Newton method for optimizing
smooth functions, using gradient differences to update an inverse Hessian matrix H, defining a local quadratic model of the objective function
• Conventional wisdom: jams on nonsmooth functions
• Amazingly, works very well as long as implemented right (weak Wolfe line search is essential)
• It builds an excellent, extremely ill-conditioned quadratic model
• Often, runs until cond(H) is 1016 before breaking down, taking steplengths of around 1
• Often converges linearly (not superlinearly)• By contrast, steepest descent and Newton’s
method usually jam• Not very reliable, and no convergence theory
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Steepest Descent:f(x)=10*|x2 - x
12| + (1-x
1)2 iter 25: f = 3.6e+000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
BFGS:f(x)=10*|x2 - x
12| + (1-x
1)2 iter 250: f = 4.5e-009
0 50 100 150 200 25010
-12
10-10
10-8
10-6
10-4
10-2
100
102
BFGS:f(x)=10*|x2 - x
12| + (1-x
1)2 Summary
f
||g||cos(theta)
inverse(cond(H))
Minimizing a Product of Eigenvalues
• An application due to Anstreicher and Lee• Min log of product of K largest eigenvalues of A o X
subject to X positive semidefinite and diag(X)=1(A, X symmetric) (we usually set K=N/2)
• Would be equivalent to SDP if replace product by sum
• Number of variables is n = N(N-1)/2 where N is dimension of X
• Following Burer, substitute Y Y’ for X where Y is N by r so n is reduced to rN and normalize Y so its rows have norm 1: thus both constraints eliminated
• Typically objective is not smooth at minimizer, because the K-th largest eigenvalue is multiple
• Results for N=10 and rank r = 2 through 10…
2 3 4 5 6 7 8 9 10-1.6
-1.5
-1.4
-1.3
-1.2
-1.1
-1
-0.9
rank r
optim
al v
alue
of
log
eige
nval
ue p
rodu
ct f
optimal multiplicities are 2 3 3 5 6 3 3 3 3
N=10, r=6: eigenvalues of optimal A o X
3.163533558166873e-001 5.406646569314501e-001 5.406646569314647e-001 5.406646569314678e-001 5.406646569314684e-001 5.406646569314689e-001 5.406646569314706e-001 7.149382062194134e-001 7.746621702661097e-001 1.350303124150612e+000
The U and V Spaces• The condition number of H, the BFGS approximation
to inverse Hessian, typically reaches 1016 !! • Let’s look at eigenvalues of H:
H = QDQ* (* denotes transpose)Search direction is d = Hg = QDQ*g (where g = gradient)
• Next plot shows diag(D), sorted into ascending orderComponents of |Q*g|, using same orderingComponents of |DQ*g|, using same ordering
(search direction expanded in eigenvector basis)
• Tiny eigenvalues of H correspond to “nonsmooth” V-space
• Other eigenvalues of H correspond to “smooth” U-space
• From matrix theory, dim(V-space) = m(m+1)/2 – 1, where m is multiplicity of Kth eigenvalue of A o X
0 10 20 30 40 50 6010
-20
10-15
10-10
10-5
100
105Data=63, N=10, r=6, K=5, mult=6, V space dim=20, U space dim=40, iter=415
Diag(D) where H=QDQ*
|Q*g||DQ*g|
0 100 200 300 400 500 600 70010
-10
10-5
100
105
Data=63, N=63, r=10, K=31, mult=10, V space dim=54, U space dim=576, iter=10000
Diag(D) where H=QDQ*
|Q*g||DQ*g|
More on the U and V Spaces• H seems to be an excellent approximation
to the “limiting inverse Hessian” and evidently gives us bases for the optimal U and V spaces
• Let’s look at plots of f along random directions in the U and V spaces, as defined by eigenvectors of H, passing through minimizer (for the N=10, r=6 example)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.6
-1.5
-1.4
-1.3
-1.2
-1.1
-1
-0.9
-0.8
-0.7
-0.6plots in random directions in V (red) and U (green) spaces, scale=1.0e+001
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.6
-1.5
-1.4
-1.3
-1.2
-1.1
-1
-0.9plots in random directions in V (red) and U (green) spaces, scale=1.0e+000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.53
-1.52
-1.51
-1.5
-1.49
-1.48
-1.47
-1.46
-1.45
-1.44plots in random directions in V (red) and U (green) spaces, scale=1.0e-001
Line Search for BFGS• Sufficient decrease:
for 0 < c1 < 1, f (x + td) ≤ f (x) + c1 t(grad f)(x)*d
• Weak Wolfe condition on derivative for c1 < c2 < 1, (grad f)(x + td)*d ≥ c2 (grad f)(x)*d
• Not strong Wolfe condition on derivative | (grad f)(x + t d)*d | ≤ c2 (grad f)(x)*d
• Essential when f is nonsmooth• No reason to use strong Wolfe even if f is smooth• Much simpler to implement• Why ever use strong Wolfe?
But how to check if final x is locally optimal?
• Extreme ill-conditioning of H is a strong clue, but is expensive to compute and proves nothing
• If x is locally optimal, running a local bundle method from x establishes nonsmooth stationarity: this involves repeated null-step line searches and building a bundle of gradients obtained via the line searches
• When line search returns null step, it must also return gradient at a point lying across the discontinuity
• Then new d = – arg min { ||d||: d conv G }, where G is the set of gradients in the bundle
• Terminate if ||d|| is small• If ||d|| is 0 and line search steps were 0, x is Clarke
stationary; more realistically it is “close” to Clarke stationary
• I’ll call this Clarke Quay Stationary since it is a Quay Idea
First-order local optimality• Is this implied by Clarke stationarity of f at x ?• No, for example f(x) = x is Clarke stationary
at 0• Yes, in sense that f’(x; d) ≥ 0 for all directions
d, when f is regular at x (subdifferentially regular, Clarke regular)
• Most of the functions that we have studied are regular everywhere, although this is sometimes hard to prove
• Regularity generalizes smoothness and convexity
Harder Problems
• Eigenvalue product problems are interesting, but the Lipschitz constant L for f is not large
• Harder problems:– Chebyshev Exponential Approximation– Optimization of Distance to Instability– Pseudospectral Abscissa Optimization
(arbitarily large L)– Spectral Abscissa Optimization (not even
Lipschitz)• In all cases, BFGS turns out to be
substantially less reliable than gradient sampling
Gradient Sampling Algorithm Initialize and x. Repeat • Get G, a set of gradients of function
f evaluated at x and at points near x (sampling controlled by )
• Let d = – arg min { ||d||: d conv G }• Line search: replace x by x + t d,
with f(x + t d) < f(x) (if d is not 0)
until d = 0 (or ||d|| tol)
Then reduce and repeat.
Convergence Theory for Gradient Sampling Method (SIOPT, 2005)• Suppose
– f is locally Lipschitz and coercive– f is continuously differentiable on an open dense subset of
its domain– number of gradients sampled near each iterate is greater
than problem dimension– line search uses an appropriate sufficient decrease
condition• Then, with probability one and for fixed sampling
parameter , algorithm generates a sequence of points with a cluster point x that is -Clarke stationary
• If f has a unique Clarke stationary point x, then the set of all cluster points generated by the algorithm converges to x as is reduced to zero
• Kiwiel has already improved on this! (For a slightly different, as yet untested, version of the algorithm.)
1 2 3 4 5 6 7 8 9 100.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
different starting points
optim
al v
alue
fou
ndpseudospectral abscissa, epsln = .001
BFGS
Grad SamplingGrad Sampling started at BFGS solution
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
different starting points
optim
al v
alue
fou
ndpseudospectral abscissa, epsln = .000001
BFGS
Grad SamplingGrad Sampling started at BFGS solution
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
different starting points
optim
al v
alue
fou
nd
spectral abscissa
BFGS
Grad SamplingGrad Sampling started at BFGS solution
HANSO• User provides routine to compute f and its
gradient at any given x (do not worry about nonsmooth cases)
• BFGS is then run from many randomly generated or user-provided starting points. Quit if gradient at best point found is small.
• Otherwise, a local bundle method is run from best point found, attempting to verify local stationarity.
• If this is not successful, gradient sampling is initiated.
• Regardless, return a final x along with a set of nearby points, the corresponding gradients, and the d which is the smallest vector in the convex hull of these gradients
• If the nearby points are near enough and d is small enough, f is Clarke Quay Stationary at x
• This is an approximate first-order local optimality condition if f is regular at x
Optimization in Control• Often, not many variables, as in low-order controller
design• Reasons:
– engineers like simple controllers– nonsmooth, nonconvex optimization problems in even a
few variables can be very difficult
• Sometimes, more variables are added to the problem in an attempt to make it more tractable
• Example: adding a Lyapunov matrix variable P and imposing stability on A by AP + PA*
• When A depends linearly on the original parameters, this results in a bilinear matrix inequality (BMI), which is typically difficult to solve
• Our approach: tackle the original problem• Looking for locally optimal solutions
H Norm of a Transfer Function• Transfer function G(s)=C (sI – A)-1 B + D• SS representation
• H norm is if A is not stable (stable means all eigenvalues have negative real part) and otherwise, sup ||G(s)||2 over all imaginary s
• Standard way to compute it is norm(SS,inf) in Matlab control toolbox
• LINORM from SLICOT is much faster
A B
C D
H Fixed-Order Controller Design
• Choose matrices a, b, c, d to minimize the H norm of the transfer function
• The dimension of a is k by k (order, between 0 and n=dim(A))
• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)
A + B2 d C2 B2 c
b C2 a
B1 + B2 d D21
b D21
C1 + D12 d C2 D12 c
D11 + D12 d D21
H Fixed-Order Controller Design
• Choose matrices a, b, c, d to minimize the H norm of the transfer function
• The dimension of a is k by k (order, between 0 and n=dim(A))
• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)• The case k=0 is static output feedback
A + B2 d C2 B2 c
b C2 a
B1 + B2 d D21
b D21
C1 + D12 d C2 D12 c
D11 + D12 d D21
H Fixed-Order Controller Design
• Choose matrices a, b, c, d to minimize the H norm of the transfer function
• The dimension of a is k by k (order, between 0 and n=dim(A))• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)• The case k=0 is static output feedback• When B1,C1 are empty, all I/O channels are in performance
measure
A + B2 d C2 B2 c
b C2 a
B1 + B2 d D21
b D21
C1 + D12 d C2 D12 c
D11 + D12 d D21
HIFOO: H Fixed-Order Optimization • Aims to find a, b, c, d for which H norm is locally
optimal• Begins by minimizing the spectral abscissa
max(real(eig(A-block))) until finds a, b, c, d for which A-block is stable (and therefore H norm is finite)
• Then locally minimizes H norm• Calls HANSO to carry out both optimizations• Alternative objectives:
– Optimize the spectral abscissa instead of quitting when stable
– Optimize the pseudospectral abscissa– Optimize the distance to instability (complex stability radius)
• The output a,b,c,d can then be input to optimize for larger order k, which cannot give worse result
• Accepts various input formats
HIFOO Provides the Gradients• For H norm, it combines
– left and right singular vector info at point on imaginary axis where sup achieved
– chain rule• For spectral abscissa max(real(eig)), it combines
– left and right eigenvector info for eigenvalue achieving max real part
– chain rule• Do not have to worry about ties• Function is continuous, but gradient is not• However, function is differentiable almost
everyhere (and virtually everywhere that it is evaluated)
• Typically, not differentiable at an exact minimizer
Benchmark Examples: AC Suite• Made extensive runs for the AC suite of 18
aircraft control problems in Leibfritz’ COMPLeIB• In many cases we found low-order controllers
that had smaller H norm than was supposedly possible according to the standard full-order controller MATLAB routine HINFSYN!
• Sometimes this was even the case when the order was set to 0 (static output feedback)
• After I announced this at the SIAM Control meeting in July, HINFSYN was extensively debugged and a new version has been released
Benchmark Examples: Mass-Spring
• A well known simple example with n=4
• Optimizing spectral abscissa:– Order 1, we obtain 0, known to be
optimal value– Order 2, we obtain about 0.73,
previously conjectured to be about 0.5
Benchmarks: Belgian Chocolate Challenge
• Simply stated stabilization problem due to Blondel
• From 1994 to 2000, not known if solvable by any order controller
• In 2000, an order 11 controller was discovered
• We found an order 3 controller, and using our analytical results, proved that it is locally optimal
Availability of HANSO and HIFOO
• Version 0.9, documented in a paper submitted to ROCOND 2006, freely available athttp://www.cs.nyu.edu/overton/software/hifoo/
• Version 0.91, available online but not fully tested so no public link yet
• Version 0.92 will have further enhancements and we hope to announce it widely in a month or two
Some Relevant Publications• A Robust Gradient Sampling Algorithm for
Nonsmooth, Nonconvex Optimization– SIOPT, 2005 (with Burke, Lewis)
• Approximating Subdifferentials by Random Sampling of Gradients– MOR, 2002 (with Burke, Lewis)
• Stabilization via Nonsmooth, Nonconvex Optimization– submitted to IEEE Trans-AC (with Burke, Henrion, Lewis)
• HANSO: A Hybrid Algorithm for Non-Smooth Optimization Based on BFGS, Bundle and Gradient Sampling– in planning stage
http://www.cs.nyu.edu/faculty/overton/
恭喜发财 Gong xi fa cai!
Gung hei fat choy!
恭喜發財