A kriging method for the solution of nonlinear programs with black-box functions

A Kriging Method for the Solution of NonlinearPrograms with Black-Box Functions

Eddie Davis and Marianthi IerapetritouDept. of Chemical and Biochemical Engineering, Rutgers—The State University of New Jersey, Piscataway, NJ 08854

DOI 10.1002/aic.11228Published online June 25, 2007 in Wiley InterScience (www.interscience.wiley.com).

In this article, a new methodology is developed for the optimization of black-boxsystems lacking a closed-form mathematical description. To properly balance the com-putational cost of building the model against the probability of convergence to globaloptimum, a hybrid methodology is proposed. A kriging approach is first applied to pro-vide information about the global behavior of the system considered, whereas aresponse surface method is considered close to the optimum to refine the set of candi-date local optima and find the global optimum. The kriging predictor is a global modelemploying normally distributed basis functions, so both an expected sampling valueand its variance are obtained for each test point. The presented work extends the capa-bilities of existing response surface techniques to address the refinement of optimalocated in regions described by convex asymmetrical feasible regions containing arbi-trary linear and nonlinear constraints. The performance of the proposed algorithm iscompared to previously developed stand-alone response surface techniques and itseffectiveness is evaluated in terms of the number of function calls required, number oftimes the global optimum is found, and computational time. � 2007 American Institute of

Chemical Engineers AIChE J, 53: 2001–2012, 2007

Keywords: optimization, mathematical modeling, black-box, kriging, response surface

Introduction

Process design problems lacking closed-form model equa-tions are very difficult to solve when noisy input-output dataare the only information available. Sampling information canbe noisy due to unknown/uncontrollable environmental con-ditions affecting process operation or as a result of the levelof detail employed in a computational model. This problemcontains three main challenges. First, what kind of modelshould be used to approximate the black-box function? Localmodels provide only a limited understanding of systembehavior, but global models are more expensive to build.Second, how can the model capture the true behavior, insteadof the noise? If derivative-based techniques are applied tomodels that fit the noise, this can lead to the discovery of ar-

tificial optima due to iterates becoming trapped by the noise.Third, given that more accurate models are obtained usingsymmetrically arranged sampling points, how can reliablemodels be constructed for problems described by convexasymmetrical feasible regions?

A new algorithm is presented in this paper for the solutionof nonlinear programs (NLP) containing black-box functionsand noisy variables that extends the capabilities of currentmethods to handle convex feasible regions defined by arbi-trary linear and nonlinear constraints. In the proposedmethod, kriging is first used to construct a global model inorder to find the set of regions containing potential optima.Response surface techniques are then applied to local modelsin order to refine the set of local solutions. The global modelbuilding expense is offset by an increased probability of find-ing the global optimum in contrast to the application ofpurely local techniques.

A variety of techniques have been applied towards the so-lution of NLP containing black-box functions and noisy vari-

ID: 3b2server Date: 10/7/07 Time: 11:10 Path: J:/Wiley_Sgml/JOHNWILEY/3B2-Final-Pdfs/Electronic/WEB_PDF/Process/C2AIC#070122

Correspondence concerning this article should be addressed to M. Ierapetritou [email protected].

� 2007 American Institute of Chemical Engineers

AIChE Journal August 2007 Vol. 53, No. 8 2001

ables. The first set of algorithms discussed use sample pointsand do not build surrogate models, so model building costsare avoided. Sampling-based global zero-order methods suchas Nelder-Mead (Barton and Ivey, 19961), Divided Rectan-gles (Jones et al., 19932), multilevel coordinate search(Huyer and Neumaier, 19993), and differential evolution(Storn and Price, 19974) are algorithms that belong to thisclass and can be used for the optimization of black box mod-els. However they exhibit the following limitations. First,convergence may be asymptotic, and second, premature ter-mination can occur since no mechanism is in place to over-come the discovery of artificial optima. In order to overcomethese limitations, gradient-based methods can be used whichapply finite-differencing expressions that depend on the sys-tem noise (Brekelmans et al., 20055; Choi and Kelley, 20006;Gilmore and Kelley, 19957). Gradient inaccuracy is mini-mized by using finite-difference intervals large enough to‘‘step over’’ the noise. The interval length adaptively changesover the sequence of iterations since the noise may decay asthe optimum is approached. The main drawback of using thisclass of methods is that even though the optimum can befound, an understanding of system behavior is limited,thereby motivating the application of a different set of tech-niques which perform optimization after a surrogate modelhas first been constructed.

Response surface techniques find an optimum by applyingsteepest descent towards inexpensive local models (Myersand Montgomery, 20028). Box and Wilson (19519) and Hoerl(195910) authored seminal papers in this field proposing theuse of statistical design of experiments, another developingfield at the time, to generate quadratic functionals in order tomore rigorously quantify response behavior. In this approach,a local input-output mapping is represented using a low-orderpolynomial fitted to output data obtained from symmetricallyarranged sampling points. After minimization of the localmodel, a new model is built which is centered on the currentiterate, and is then again minimized. The procedure continuesuntil a prespecified stopping criterion such as minimaldecrease in the objective function has been obtained. Becauseresponse surfaces are usually inexpensive in terms of thecomputational cost required to build the models, samplingexpense is the primary source of the optimization cost. Toreduce the computational complexity due to sampling inresponse surface methods, two algorithms have previouslybeen developed by our group (Davis and Ierapetritou,200611). In the first method, direct search is employed in theearly stages followed by minimization of local models oncethe vicinity of the optimum has been identified. Samplingpoints are generated according to a stencil that is a coarserepresentation of the finer template used to generate the setof points needed to build the response surface at the laterstages. This method is based on the idea that the value of thesearch direction information obtained by response surfaceminimization is not justified by the additional sampling dur-ing the early stages of optimization. The second techniqueborrows ideas from sequential quadratic programming (SQP)by minimizing the local response surface across the entirefeasible region in order to obtain possible global movementtowards the optimum. Global movement towards the opti-mum occurs if sampling at the predicted minimum yields asuperior value to the current best candidate solution of the

set of local collocation points. This method performs wellwhen optima reside near feasible region boundaries and thelocal model generated from a nominal collocation set far-away provides a suitable approximation of the global geome-try. The cost of an additional function call per iteration isjustified when compared against the equivalent set of sequen-tial response surface minimizations that result in the samesearch direction being identified over local subregions. Opti-mization using response surfaces has been applied to manyfields including automotive impact engineering and design(Stander and Craig, 200212), structural optimization (Rouxet al., 199813), and food engineering (Chen et al., 200514).Tang et al. (200515) employs response surface ideas within aGauss-Newton algorithmic framework to solve the kineticinversion problem. Zhang et al. (200616) extended this meth-odology for parameter estimation of coupled reaction-trans-port systems under uncertainty. Although local optimizationtechniques are applied more often to response surfaces,global methods have also been studied (Jones, 200117; Joneset al., 199818).

In contrast to response surfaces which are local models, akriging predictor is a global model employing normally dis-tributed basis functions, so both an expected sampling valueand its variance are obtained for each test point (Cressie,199323; Goovaaerts, 199719; Isaaks and Srivistava, 198920).Kriging was first developed as an inverse distance weightingmethod to describe the spatial distribution of mineral depos-its. The global model obtained served as a tool to determineadditional locations possessing the same grade characteristics,thereby enabling the generation of improved monthly fore-cast estimates for different mine sections. The seminal articleby Krige (195121) was reviewed in context of other classicalstatistical methods used for geostatistics (Matheron, 196322).Kriging has not only been used to generate models based onfield data, but also when input-output information is obtainedvia computer experiments due to the increasing use of simu-lation for the study of complex processes (Sacks et al.,198924; Santner et al., 200325). It is to be noted that in theproposed algorithm, the function of kriging is to determineimproved locations for local search thereby enabling theresponse surface techniques to be applied to ‘‘warm-start’’iterates. Because of the black-box functions present in theproblem formulation, it is impossible to guarantee globaloptimality. The interpolatory and confidence limits of krigingmay be similar to those attained using aBB (Maranas andFloudas, 199526) but aBB assumes that the functionality isknown, which is not the case in the general class of problemsaddressed in this article.

Kriging is an interpolation technique whereby a prediction~z2 at test point xk is made according to a weighted sum ofthe observed function values at nearby sampling points. Thekriging predictor is modeled after a normally distributed ran-dom function, so a prediction variance is also obtained at thetest point. As a result, the sampling value is expected to fallwithin the interval specified by the prediction and corre-sponding variance. By discretizing the feasible region, both aprediction and variance mapping can be obtained. The var-iance mapping describes prediction uncertainty, which willbe high in regions with a low number of sampling points.Additional sampling can then be done within these regions toreduce the uncertainty. The main drawback for kriging is

2002 DOI 10.1002/aic Published on behalf of the AIChE August 2007 Vol. 53, No. 8 AIChE Journal

that model building costs are high compared to the construc-tion of response surfaces. The existing literature usually com-pares kriging and RSM methods but has not combined themtogether for the purpose of global optimization as proposedin this work. In the next section, the details and algorithmicfeatures of the proposed method are presented. The proposedmethod is applied to several numerical examples and con-cluding remarks are then given.

Solution Approach

In the next section, the mathematical formulation of theproblem is given followed by the basic idea of the proposedkriging-RSM algorithm. The details of the kriging methodare presented followed by ideas for refinement of the opti-mum using response surfaces when the problem is describedby arbitrary linear and nonlinear constraints.

Problem Formulation

The problem addressed in this work has the followingform:

minFðx; z1; z2Þs:t: gðx; z2Þ � 0

hðx; z1Þ ¼ 0

z2ðxÞ ¼ CðxÞ þ eðxÞeðxÞ 2 N xjl; r 2

� �Nðxjl; r 2Þ ¼ 1ffiffiffiffiffiffi

2pp

rexp

�ðx� lÞ22r 2

!(1)

In this formulation, x represents continuous variables thatare process inputs. This set of variables is distinct from inputinformation z2, which is obtained from sampling at upstreamunits. Deterministic output variables z1 describe outputswhose modeling equations h(x, z1) are known. Stochastic out-put variables z2 represent the black-box part of the modeland are simulated by deterministic functions G(x) perturbedwith noise e(x). The noise is described by a normally distrib-uted error whose properties can change depending upon thespatial location of x. The noise function is stochastic in thatreplicate experiments performed under the same conditionslead to different values for e(x) and in turn z2(x). Design con-straints such as operating specifications are given by g(x, z2).In the proposed approach, a sequential kriging-RSM algo-rithm is employed for the optimization of systems containingblack-box functions and noisy variables. The basic idea is tofirst use kriging to obtain a global picture of system behaviorby generating predictions at test points, and then to applyRSM to a set of regions containing potential local optima inorder to find the global optimum. Although the kriging pre-dictor is iteratively improved as additional sampling is con-ducted in regions where prediction ability is poor, the CPUcost associated with generating many predictions over thecourse of the modeling can become high, especially if a setof discretized test points obtained using high-resolution gridsare employed to obtain more accurate optima. Since responsesurfaces can be inexpensively generated, a better strategy for

obtaining potential optima is to generate a kriging surfaceover a lower-resolution grid and then use sequential responsesurfaces to refine the best set of candidate vectors. Since it ispossible that local optima lie along the boundaries of the fea-sible region, application of current response surface techni-ques, which are suited for the solution of box-constrainedproblems, would fail to improve upon the current values ofthe possible optima. The new method addresses these issues byapplying (1) adaptive experimental designs to retain feasibility,and (2) projection of the n-dimensional response surface ontothe feasibility space limited by the problem constraints.Although global optimality is not guaranteed, application ofthe proposed kriging-RSM algorithm can result in the discov-ery of both interior and boundary solutions for constrainedNLP described by (1) black box functions, (2) noisy variables,and (3) arbitrary convex feasible regions. If any z2 variablesare inputs to downstream process units, the feasible region isexplicitly defined after sampling at the respective upstreamunits. In this case, the feasible region will be affected by noise.This complication is not present for the examples presented inthe Examples section but will be addressed in a future work.At the present time, the kriging-RSM algorithm has only beenapplied towards systems having deterministic feasible regionsand only a single black-box unit. The next section providesdetails for obtaining the kriging predictor.

Kriging Methodology

The kriging predictor of the stochastic output ~z2ðxkÞ is alinear combination of N weighted function values at nearbysampling points as given below:

~z2ðxkÞ ¼XNCluster

i¼1

f ðxiÞkðxiÞ (2)

where ~z2ðxkÞ represents the predictor at test point xk. Themain problem in building the kriging predictor lies in deter-mining the appropriate set of weights assigned to each set ofNCluster points near xk subject to the constraint that theexpected error between the prediction and true value is zero,and also that the variance of the error distribution for all testpoints is minimized. The mechanism of obtaining the weightsusing kriging follows the same principles as the inverse dis-tance weighting methods in that the weights are generally adecreasing function of distance from the test point xk. Usingkriging, predictions are indirectly obtained by estimating thefunction difference between xi and xk as being similar to cor-responding differences for known sampling data. Functionvalues for sampling points xi, which are close to xk, will bemore influential in determining the prediction z2(xk). Theconverse is true as the distance h between xi and xk increases.Consider a sampling point xi and test point xk separated by adistance h1. The corresponding kriging weight will be higherif there are many sampling-point xi–xj pairs, also separatedby this same distance h1, whose corresponding f(xi)–f(xj) dif-ferences are approximately the same. The weight will behigher because there is a stronger correlation between whatthe expected function value difference should be for this dis-tance h. In other words, given n sampling-point pairs, eachseparated by a distance h1, if the standard deviation of thefunction value differences between these n sampling pairs is

AIChE Journal August 2007 Vol. 53, No. 8 Published on behalf of the AIChE DOI 10.1002/aic 2003

low, the difference between the function values for an nth 11 sampling-point pair, or a sampling-point/test-point pair,also separated by h1, is expected to be close to the averagedifference obtained for the first n pairs. If the number ofsampling pairs n is high, there will be an even higher confi-dence that the estimated function value difference f(xi) – f(xk)should be close to the average difference for the n samplingpairs. The number of n sampling pairs separated by a dis-tance h is usually higher as the value of h decreases, andconsequently there will be more information available forestimating the corresponding function value difference if asampling point is close to the test point. Now consider thecase where the sampling point xi and test point xk are sepa-rated by a distance h2. After obtaining the set of S samplingpairs also separated by h2, if there is a high spread observedin the set of corresponding f(xi) – f(xj) values, there will beless confidence in estimating the expected difference betweenthe function value at sampling point xi and test point xk. Thetrue difference might be closer to the lowest or highest dif-ference instead of the average difference in this case. As aresult, the value of the corresponding kriging weight will belower in this case to emphasize the fact that there is less con-fidence that the estimate of the expected function value dif-ference between xi and xk will be close to the true difference.This idea is quantitatively represented in the form of a fittedcovariance function expressing correlation strength betweenfunction differences for sampling-point pairs a given distanceapart. The first step in obtaining the covariance function is toobtain (N)(N 2 1)/2 squared function differences Fij for Nsampling points xi, as shown in Eq. 3:

Fij ¼ f ðxiÞ � f ðxjÞ� �2

i; j ¼ 1;:::;N; i 6¼ j (3)

The covariance function is fitted from a plot of Fij as afunction of xi – xj distance, also known as lag distance h.Because of the plot complexity as shown in Figure 1a, aresulting fit to one of the established covariance models inthe literature is not usually immediately apparent.

To improve the fit, data averaging is performed by cluster-ing Fij within P subintervals each having length hm, whereNm xi – xj pairs fall within the interval hm – hTol � hm � hm1 hTol. The new scatterpoints known as semivariances arethen fit to an established semivariance model from the litera-ture as shown in Figure 1b, and may be a linear combinationof five major types: spherical, Gaussian, exponential, power,or linear. The corresponding covariance function is thenobtained by reflecting the semivariance function between thesill and the x-axis as shown in Figure 1c. The expression for

obtaining the semivariances is given below:

cðhmÞ ¼ 1

2NðhmÞXNðhmÞi¼1

f ðxiÞ � f ðxjÞ� �

m ¼ 1;:::;P (4)

After obtaining the covariance function, the krigingweights are obtained by substituting in the respective distan-ces. Unlike some inverse distance techniques, the weightsalso will account for clustering effects by reducing theweight values for sampling points that are the same distancefrom xk but close to one another. The weights at test point xkare then obtained according to Eq. 5, where k(xk) is an NCluster

3 1 vector of weights summing to unity. The problem of find-ing the weights subject to the constraint that they sum to unityis recast in its Lagrangian equivalent, and solution of theresulting set of Lagrangian equations then simultaneouslyyields both a set of weights and the Lagrange parameter l(xk).

kðxkÞlðxkÞ� �

¼ Covðdðxi; xjÞÞ 1

1 0

� ��1Covðdðxi; xkÞÞ

1

� �i; j ¼ 1;:::;NCluster; i 6¼ j (5)

Once the weights are known for the set of NCluster nearbysampling points xi, the kriging prediction ~z2ðxkÞ and itsexpected variance ~r2ðxkÞ are then obtained as given by Eqs.5 and 6, respectively:

~r2ðxkÞ ¼ r2max �XNCluster

i¼1

kðxiÞCov dðxi; xkÞ½ � (6)

After obtaining kriging predictions for all test points, botha prediction and variance mapping can be built. Using thevariance mapping, confidence regions around each predictionare then obtained. If there is a tight confidence intervalaround the kriging prediction at xk, it means that there is agreater belief that subsequent sampling values will fall withinthe corresponding interval. The converse is true for the oppo-site case, since there is less confidence that the kriging pre-diction accurately represents the mean sampling value fortest points xk located far away from previously sampledpoints. After obtaining additional sampling data based onpredictions satisfying criteria such as maximum variance,minimum value, or high difference in the estimate for con-secutive iterations, the kriging model is updated. Over thecourse of the modeling, the average value of the kriging pre-dictions made over the set of test points can be obtained aslm,Pred for iteration m, which converges to a final value. The

Figure 1. Data smoothing applied to squared function differences (a) to obtain a semivariogram fit (b) and covari-ance function fit (c).


average of the prediction variance also drops to a fraction ofits corresponding initial value. Once convergence isachieved, a set of regions containing potential local optimacan be identified. After identifying all promising regions, itis more efficient to refine the set of best predicted valuesusing local response surface models instead of kriging pre-dictors over finer local grids. The source of the CPU expenserequired in building and minimizing response surfaces lies inthe solution of a single linear system to obtain the fittedresponse surface coefficients, whereas the construction ofrefined predictors over an NFine 3NFine local grid requires so-lution of N2

Fine linear equations as given in Eq. 5, since aunique set of weights is associated with each new local testpoint.

The kriging algorithm proceeds as follows for obtaining aprediction at xk. First, the feasible region is described basedon g(x, z2) and h(x, z1), where g(x, z2) is assumed to be ex-plicitly defined after having obtained sampling data z2 fromupstream units. A sampling set X is generated whose initialsize ranges between ten and twenty dispersed data locationsfor problem dimensions of up to eleven variables. As long asthe starting size of X is not too small, the number of itera-tions required to achieve convergence in lm,Pred will be rela-tively insensitive to this number. The initial number of sam-pling points in X is small, placing emphasis on further sam-pling as needed during the iterative stages of predictorrefinement. The location xk is specified and NCluster nearest-neighbor sampling points are chosen from X based on theL2-norm relative to xk. NCluster usually varies between fiveand ten regardless of problem dimension although the esti-mate is potentially skewed if sampling information is sparse.Semivariances are then generated using all sampling datawithin X. The best semivariogram model is obtained usingleast squares, and the complementary covariance function isgenerated. The symmetrical matrix hCluster is built describingsampling-pair distances among the set of NCluster points. Sim-ilarly, the vector h0 is constructed from sampling-point dis-tances relative to xk. The covariance matrices C and D arebuilt using information from hCluster and h0, respectively. Thekriging weights k are then obtained from solving the linearsystem of equations as presented in (5) and the prediction

~z2ðxkÞ and its variance ~r2ðxkÞ are determined. A flowchart ofthe kriging algorithm is presented in Figure 2.

Refinement of Local Optima Using RSM

The main idea of this step is to utilize RSM to approxi-mate local input–output relationships using polynomial func-tions fitted to a set of stochastic data, which can then be opti-mised via conventional techniques. Noninterpolating polyno-mials are used to avoid fitting the noise, which also causethe search direction to become less sensitive to the processuncertainty at each sample point. A standard RSM algorithmcalls for the construction of local models built from a set ofcollocation points centered on an iterate using an experimen-tal design. Figure 3 shows examples of two-dimensional fac-torial and central composite designs (CCD). In a factorialdesign, a set or subset of the input variables is selected to bethe factors, and particular levels are chosen for each factor.Experiments are then conducted at all possible factor-levelcombinations. The total number of sampling points is ab,where a represents the number of factor levels (usually twoor three) and b denotes the number of factors. For a systeminvolving three variables, each with two levels, a total of 23

sampling points are needed. Although the same number ofsampling points are used for the factorial and central com-posite design in the 2D case, the number of sampling pointsrequired is lower for the CCD in contrast to the factorialdesign when the system dimensionality increases. The totalnumber of points required for an n-dimensional system usinga CCD is 1 1 2n 1 2n. A factorial design utilizing three var-iables, each with three levels, results in 27 design points,whereas only 15 are required for the central compositedesign (8 corner points, 6 axial points, 1 center point). If thenumber of variables increases to six, the factorial designrequires 729 sampling points, whereas the CCD requires only77. Even though replication at the center point is usuallybuilt into the design, the total number of sample points isstill lower than that for the factorial design. For a more com-plete discussion, the reader is referred to Box et al. (200527),and Myers and Montgomery (20028).

At the start of the algorithm, a nominal point serves as thestarting iterate; after a response surface has been minimized,the optimum of the local model serves as the next iterate.The procedure of building and minimizing local surfacescontinues until a stopping criterion is reached, such as suffi-cient decrease in the objective or when the Euclidean normbetween current and previous iterates falls below a thresholdTolRSM. The parameter TolRSM is set at 3rx, since the error ismodeled as a normally distributed function. If TolRSM

Figure 2. Flowchart of kriging algorithm for obtaining aprediction at test point xk.

Figure 3. Sample experimental designs in 2-D: (a) Fac-torial design, (b) Central composite design.


exceeds this value, premature termination may occur. Con-versely, a value lower than 3r(x) may cause the algorithm tofail as artificial importance will be placed on the noise. Aflowchart of the algorithm is presented in Figure 4.

The construction of a response surface at each iterationcan lead to high sampling expense; furthermore, since localmodels are used, the discovered optimum may be suboptimal,since convexity of the black-box function cannot be known apriori. If the convex feasible region is of arbitrary shape anditerates fall near nonlinear or multiple boundaries, the gener-ation of collocation points using symmetrical design stencilscan lead to infeasible points being found. Since more reliablesearch directions are usually obtained from response surfacesconstructed over symmetrical design stencils, to retain feasi-bility, the experimental design for collocation point genera-tion is adaptively changed depending upon iterate proximityto the boundary. If iterates are near nonlinear boundaries orthe intersection of linear boundaries, the central compositedesign is used; conversely, if the iterate is near a single lin-ear boundary, a factorial design is applied, as shown in Fig-ure 5a. The particular designs used provide a more advanta-geous search of the respective subregions defined by the sur-rounding constraints over their counterparts. In order toensure feasibility, the maximum possible design radius is setas the minimum of the distance between the iterate and allconstraints, shown in Figure 5b. However, as iteratesapproach a boundary, this radius can become small, leadingto the construction of response surfaces over a small regionand small steps taken towards the optimum. When the radiusis reduced below a prespecified criterion, such as 1% of thesmallest box-constrained operating range for N variables, alower-dimensional response surface in terms of (N 2 1) vari-ables is projected onto the respective constraint to ensurethat large steps are still taken toward the optimum, as dis-played in Figure 5c.

To further reduce the sampling expense for higher-dimen-sional problems, line search techniques can also be incorpo-rated into the algorithm whereby sampling is conducted at asequence of points obtained along the path defined by thecurrent and previous iterates. Although application of re-sponse surface techniques can reduce the sampling expense,the optimum found may be sensitive to the nominal pointand the solution found may only be a local minimum. Whenkriging is first used to identify the best regions for localsearch, this problem no longer exists, since a global model isbuilt.

The kriging-RSM algorithm is now described. First, a setof nominal sampling information is obtained and a set of fea-

sible test points is generated, over which to build the krigingglobal model. As mentioned before, it is important that thenumber of test points not be too high as model building costswill increase. Discretization can be used to generate the setof test points for 2- or 3-dimensional problems, but the num-ber may become too high for problems of higher dimension.For the presented examples in which the problem dimension-ality was greater than three, it was found that 1000 randomlygenerated points was sufficient for building the kriging pre-dictor without substantial increase in the model buildingcosts. Next, a test point xk is selected and both its krigingprediction and variance are obtained according to the krigingalgorithm described in Figure 2. After the kriging predictionshave been obtained at all test points, the average value of thekriging predictor is then obtained and compared to the corre-sponding value obtained in the previous iteration. If the dif-ference between these values exceeds a stopping toleranceTolKrig, additional sampling is performed. The parameterTolKrig is set at 0.01 and is a generally noise-independent pa-rameter as the number of test points used to build the krigingmapping exceeds the number of points at which samplingdata are obtained. It should be emphasized that the globaloptimum can be missed when using kriging or any otherglobal modeling technique due to the fact that black-boxfunctions prevent knowledge of convexity. To minimize thepossibility of missing the neighborhood of the global opti-mum, the sampling set at each kriging iteration is comprisedof three subsets each having an equal number of data points.Each subset has a family of points conforming to one of thefollowing characteristics: (1) high variance, (2) high differ-ence between prediction estimates for consecutive iterations,and (3) minimum prediction values. In addition, the samplinglocations within each family subset are separated by a mini-mum L2-normed distance to maximize the amount of globalinformation obtained. After performing additional sampling,the kriging predictor is then refined by generating new pre-dictions again at all test points. Once the difference in theaverage kriging predictor falls below TolKrig for consecutiveiterations, refinement of the current best candidate vectorsyielding the lowest kriging predictions in N local regions isperformed according to the RSM algorithm described in Fig-ure 4. For the presented examples, whenever an iterateapproaches the feasible region boundary, the computationalburden associated with identifying the corresponding con-

Figure 4. Flowchart of RSM algorithm.

Figure 5. Using adaptive experimental designs to retainfeasibility and still obtain good search direc-tions toward an optimum.

(a) Application of CCD or factorial design depending iterateproximity to a linear or nonlinear boundary; (b) Ensuring fea-sibility by setting the maximum radius as the minimum Eu-clidean distance between the iterate and boundaries; (c) Pro-jection of the n-dimensional response surface onto constraintsto avoid taking small steps toward the optimum.


straints is low, since the problem sizes are small. However,modifications to the proposed methodology may be necessaryfor high-dimensional problems containing many constraintsand will be addressed in a future work. The kriging-RSMalgorithm terminates after selecting the global optimum asthe best candidate value from the set of local optimaobtained using RSM. A flowchart of the proposed methodol-ogy is presented in Figure 6.

Examples

In this section, the proposed kriging-RSM algorithm isapplied to five numerical examples. The first four examplesare presented in order of increasing complexity in terms ofproblem dimensionality and presence of linear/nonlinear con-straints. The last example is a kinetics case study intended toprovide an application of the new method to a more realisticexample. For each example, a table of computational resultsis provided that illustrates the performance of four optimiza-tion algorithms. The kriging-RSM algorithm refers to thenew methodology of applying steepest descent to responsesurfaces after building a global model. The remaining algo-rithms are stand-alone response surface methods. The secondalgorithm, S-SD RSM, applies direct search in the earlystages followed by application of steepest descent to responsesurfaces once the neighborhood of a local optimum has beenfound. The third algorithm applies steepest descent to

response surfaces at every iteration. The fourth algorithmrefers to the technique of building a response surface andminimizing it in the context of a QP to obtain global steps tothe optimum. For each algorithm, 100 trials are performedfrom a set of randomly selected feasible starting iterates. Thepercentage of iterates converging to the global optimum ispresented in the second column of the respective table. Usinginformation taken from the subset of starting iterates success-fully finding the global optimum, the average number of iter-ations required, function calls needed, and CPU time requiredare also reported. All computational results are obtainedusing an HP dv8000 CTO Notebook PC containing a 1.8GHz AMD Turion 64 processor.

Example 1: Six-Hump Camel Back Function. The six-hump camel back function is a well-known global optimiza-tion test function that is box-constrained. Introducing bothnoise and black-box complications into this example, the out-put z2 is simulated according to a normally distributed pertur-bation of the deterministic function. The problem is formu-lated as shown below and the deterministic problem is pre-sented in Figure 7:

min z2

s:t: z2 ¼ 4 � 2:1x21 þ x413

� x21 þ x1x2

� � 4 þ 4x22� �

x22 þ Nð0; 0:05Þ� 2 � x1 � 2

� 1 � x2 � 1 (7)

This problem contains multiple local optima in addition totwo global optima and is solved by applying four optimiza-tion algorithms. Table 1 presents the results obtained for thisproblem.

The global minimum is obtained in the fewest number offunction evaluations when using the S-SD RSM program, butthis is to be expected because the set of all optima are fairly

Figure 6. Flowchart of kriging-RSM algorithm.

Figure 7. Six-hump camel function.

[Color figure can be viewed in the online issue, which isavailable at www.interscience.wiley.com.]


evenly spread and have wide basins, meaning that the quad-ratic curvature is not apparent until the iterates are very closeto the optimum. The Standard RSM algorithm performsslightly better than the RSM-SQP algorithm because the twosymmetrical global optima (20.0898, 0.7126) and (0.0898,20.7126) are located near the center of the feasible region(0,0). When applying the RSM-SQP solver, the optimizationof many nominal quadratic approximations over the globalregion leads to identification of local solutions residing nearthe feasible region boundaries as can be seen in Figure 7.The kriging-RSM program requires ;50% additional func-tion evaluations compared to the Standard RSM and RSM-SQP programs, and almost twice as many as the S-SD RSMsolver. However, the additional cost is balanced by the factthat this algorithm leads to global convergence in 81% of thecases. The remaining 19% of the starting iterates successfullyfound the basin containing either global optimum but termi-nated at values outside a 2% radius of the optimal solution.The average CPU time required by the Kriging-RSM algo-rithm is an order of magnitude higher than that of the stand-alone RSM solvers, but is still relatively low.

Example 2: Schwefel Function. In this example, the fouroptimization algorithms are applied to the 2D Schwefel testfunction, a problem also taken from the global optimizationliterature. This box-constrained problem is modified toinclude a linear and a nonlinear constraint. The black-boxfunction z2 depends on both x1 and x2 and is noisy, modeledby perturbing its deterministic value by a normally distrib-uted error whose standard deviation is 1% of the determinis-tic function. The problem is formulated as shown below:

min z2

s:t: z2 ¼X2i¼1

�xisinffiffiffiffiffiffixij j

pþ Nð0; 7Þ

� 2x1 � x2 � 800

0:004x21 � x1 þ x2 � 500

� 500 � x1;x2 � 500 (8)

The deterministic equivalent of this problem is shown inFigure 8. This function contains a number of local optimaand one global optimum at (420.97, 2302.525) with anobjective value of 2719.53. This problem was selected inorder to examine the performance of the stand-alone RSMalgorithms when it is the underlying geometry instead of thenoise that poses the main complication affecting successful

convergence to the global optimum. In Table 2, results arepresented for this example.

For the stand-alone response surface methods, it isobserved that the starting iterate easily finds the global opti-mum if it is within its neighborhood. However, the globaloptimum is located in a corner of the feasible region and issurrounded by a set of local optima. When applying theStandard RSM and RSM-SQP algorithms, the optimizationof the initial response surface can lead to premature termina-tion at the local minima. In contrast, when applying the S-SD RSM algorithm, since response surfaces are not builtuntil the end stages of the algorithm, the global optimum canbe attained using a lower number of function calls while alsoavoiding the local optima. The average number of functioncalls observed for the Standard RSM and RSM-SQP solversis approximately the same because the interior candidate val-ues are generally superior to the objective values obtained bysampling at the extremes of the feasible region. Due to thenumber of local solutions found within the interior of thefeasible region, there is a low probability of convergence tothe global optimum using the local algorithms. In contrast,application of the Kriging-RSM solver requires almost twiceas many function calls as the local methods to find a solu-tion, but global convergence is observed in all cases. Thestrategy of building a global model before conducting localsearch is particularly successful for this problem since con-vergence to a suboptimal solution using the local methods isavoided. The low CPU time required for solution of this

Table 1. Comparison of the Performance of the Kriging-RSM Algorithm Against Stand-Alone RSM Methods in Findingthe Global Optimum for the Six-Hump Camel Back Function

Opt. Algorithm

% Starting IteratesFinding Global

Optimum

No. of Iterations No. of Function Calls CPU Time (s)

Kriging RSM Kriging RSM Total Kriging RSM Total

Kriging-RSM 81 9 3 47 17 64 5.98 0.09 6.07S-SD RSM 35 N/A 10 N/A 34 34 N/A 0.23 0.23Standard RSM 40 N/A 9 N/A 47 47 N/A 0.28 0.28RSM-SQP 30 N/A 7 N/A 41 41 N/A 0.18 0.18

Figure 8. Schwefel function corresponding to the opti-mization problem presented in Eq. 8.



problem using the Kriging-RSM algorithm increases theattractiveness of using this method as an alternative to thestand-alone response surface methods.

Example 3. This example involves five variables withfour linear constraints and is an example modified from Flou-das (1995). The problem is originally presented as anMINLP, and for this example a relaxed NLP is solved byrelaxing the integrality constraints. The black-box variable z2is a function of two continuous variables and is noisy accord-ing to a normally distributed error with standard deviation0.01. The problem is formulated as shown in (9) and a plotof z2 as a deterministic function of the two continuous varia-bles is presented in Figure 9:

min F ¼ y1 þ y2 þ y3 þ 5z2

s:t: z2 ¼ ð8x41 � 8x21 þ 1Þð2x22 � 1Þ þ Nð0; 0:01Þ3x1 � y1 � y2 � 0

5x2 � 2y1 � y2 � 0

� x1 þ 0:1y2 þ 0:25y3 � 0

x1 � y1 � y2 � y3 � 0

0:2 � x1; x2 � 1

0 � yi � 1 i ¼ 1; :::; 3 (9)

A family of global optimal solutions exists for this prob-lem in terms of the binary variables; however, the global op-timum of 20.98688 is not achieved unless the optimal vectorfor the continuous variables is achieved at (0.2, 0.2). Theresults for this example are presented in Table 3.

The increased CPU time reported for this example usingthe Kriging-RSM algorithm is higher than that observed forthe earlier presented 2D examples because the kriging pre-dictor is generated over a 5D grid. Even though it appearsfrom the plot that the nominal iterates in the continuousspace should converge to (0.2,0.2) easily, movement is re-stricted by the feasible region defined by the relaxed binaryvariables, thereby causing many of the nominal iterates toconverge to a suboptimal solution when applying the stand-alone response surface methods. Even though the number offunction evaluations required is higher when using the Krig-ing-RSM algorithm, global convergence is observed in nearlyall cases with only a modest increase in the overall modelbuilding costs.

Example 4. This example is also taken from Floudas(1995) and involves 11 variables with 11 linear constraintsand three nonlinear constraints. Although the problem is alsoformulated as an MINLP, for this example it is solved as arelaxed NLP. The black-box variable z2 is a function of sixcontinuous variables and is noisy according to a normallydistributed error with standard deviation 0.01. The problem

is formulated as shown in (10):

min z2 þ 5y1þ8y2þ6y3þ10y4þ6y5þ140

s:t: z2¼ �10x3� 15x5�15x9þ15x11þ5x13�20x16þ expðx3Þþexpðx5=1:2Þ�60 lnðx11þx13þ1ÞþNð0;0:01Þ� lnðx11þx13þ1Þ� 0

expðx3Þ�10y1� 0

expðx5=1:2Þ�10y2� 0

�x3�x5�2x9þx11þ2x16� 0

�x3�x5�0:75x9þx11þ2x16� 0

x9�x16� 0

2x9�x11�2x16� 0

�0:5x11þx13� 0

0:2x11�x13� 0

1:25x9�10y3� 0

x11þx13�10y4� 0

�2x9þ2x16�10y5� 0

y1þy2¼ 1

y4þy5� 1

a� xi� b i¼ 3;5;9;11;13;16

0� yi� 1 i¼ 1;:::;5

aT ¼ 0;0;0;0;0;0ð Þ bT ¼ 2;2;2;2;2;3ð Þ (10)

Table 2. Comparison of the Performance of the Kriging-RSM Algorithm Against Stand-Alone RSM Methods in Findingthe Global Optimum for the Schwefel Problem

Opt. Algorithm


Optimum




Figure 9. Plot of objective as a function of the continu-ous variables for the optimization problemdescribed by Eq. 9.



The solution of the deterministic problem is(1.903,2,2,1.403,0.701,2,0.571,0.429,0.25,0.21,0) with anobjective value of 20.554. Because of the higher problemdimensionality, the kriging predictor is generated from a setof 1000 feasible points unevenly dispersed throughout thefeasible region rather than from an 11D grid. The results forthis example are presented in Table 4.

For this problem, the number of function calls required toobtain the global optimum is higher by two orders of magni-tude compared to the corresponding values obtained from theearlier examples. The significantly higher number of functioncalls needed when using the response surface methods is dueto the increased number of collocation points required tobuild response surfaces in higher dimensions even though theCCD is employed. The average number of function callsrequired for convergence in the kriging predictor is ;14% ofthe number needed during the refinement stage of the optimi-zation. This result suggests that the kriging predictor suffi-ciently captures only the coarse geometry. In contrast to theprevious examples, when using the local solvers, a highernumber of response surfaces must be obtained before the op-timum is found because the problem dimensionality ishigher. This leads to increased local model building costswhich are on the same order of magnitude as the modelbuilding costs for the kriging predictor. However, therequired number of function evaluations required using thekriging-RSM algorithm is approximately half of the numberrequired using the stand-alone response surface methods,again emphasizing the success observed when using informa-tion obtained from building the kriging global model to guidethe local optimization using RSM.

Example 5: Kinetics Example. In this section, a realisticexample is considered that involves a kinetics case studytaken from Bindal et al. (200628). There are five reactionspecies, two of which are system inputs. The reactions areassumed to occur in an ideal CSTR and the reaction networkconsidered is a modification of the Fuguitt and Hawkinsmechanism (Floudas and Pardalos, 199529) given by: A ?E, A ? D, B ? D, C ? 2D, and 2A ? C, where the rate

constants are kf1 5 3.33384 s21, kf2 5 0.26687 s21, kf3 50.29425 s21, kr3 5 0.14940 s21, kf4 5 0.011932 s21, kr4 51.8957 3 1023 m3/(mol s), kf5 5 9.598 3 1026 m3/(mol s),V 5 0.1 m3, F 5 0.008 m3/s. Only A and C enter the reactorwith concentrations CA

0, CC0 in the ranges of 3 3 103 � C0

A

[mol/m3] � 3 3 104 and 0 � C0C [mol/m3] � 1 3 104. The

dynamic behavior of the system is described by problem (11).

min G ¼ 4 x� 0:6ð Þ2þ4 y� 0:4ð Þ2þsin3 pxð Þ þ 0:4

x ¼ 0:1428CSSC � 0:357CSS

D

y ¼ �0:1428CSSC þ 2:857CSS

D þ 1:0

dCA

dt¼ F

VC0A � CA

� �� kf1CA � kf2CA � kf5C2A

dCB

dt¼ F

V�CBð Þ � kf3CB þ kr3CD

dCC

dt¼ F

VC0C � CC

� �� kf4CC þ 0:5kr4C2D þ 0:5kf5C

2A

dCD

dt¼ F

V�CDð Þ þ kf2CA þ kf3CB � kr3CD þ 2kr4CC � kr4C

2D

dCE

dt¼ F

V�CEð Þ þ kf1CA (11)

where CSSC and CSS

D represent the macroscale steady-state val-ues of CC and CD, respectively. A contour plot of the objec-tive as a function of the input variables C0

A and C0C is shown

in Figure 10.The deterministic solution yields a global optimum of G

5 0.7422 at [C0A, C

0C] 5 [10.117, 8.378] and a local optimum

of G 5 1.2364 at [13.202, 3.163]. To introduce black-boxcomplications, the rate equations are assumed to beunknown, so a microscopic model is used instead, repre-sented using a lattice containing N molecules. The micro-scopic model is generated by first translating concentrationsto molecular variables, evolving the microscopic systemusing the Gillespie algorithm, and then mapping back thefinal measured variables to concentrations. The Gillespiealgorithm evolves reaction networks by assigning an eventprobability to each reaction and choosing one to occur by

Table 3. Comparison of the Performance of the Kriging-RSM Algorithm Against Stand-Alone RSM Methods in Findingthe Global Optimum for the Problem Presented in Eq. 9

Opt. Algorithm% Starting Iterates

Finding Global Optimum




Table 4. Comparison of the Performance of the Kriging-RSM Algorithm Against Stand-Alone RSM Methods in Findingthe Global Optimum for the Problem Presented in Eq. 10

Opt. Algorithm% Starting Iterates

Finding Global Optimum





generating a random number. After updating the molecularvariables, another random number is obtained and the corre-sponding reaction is selected. This procedure continuesoccurring until little change is observed in the molecular var-iables. The noise in the output concentrations thereby arisesas a function of how coarse the microscopic model is.Steady-state solution vectors are obtained from an initialpoint by running the microscale simulations for a long timehorizon, after which the objective function can be evaluated.The variance of the objective is then evaluated over k repli-cate simulations at a nominal point x as:

r2 ¼ VarðeÞ ¼ Var Gðx;NÞji ¼ 1; :::; kf g (12)

For this example, the noise applied to the objective is inthe form of a normally distributed error with a standard devi-ation of r 5 0.011, corresponding to a microscale systemsize of N 5 100,000. Table 5 provides results for the fouroptimization algorithms applied to this example. The CPUtime excludes the time required for each function call in theform of a microscopic model simulation.

It is observed that even though approximately double thenumber of function calls are required for convergence to an op-timum using the kriging-RSM algorithm, the global minimumis achieved in all cases. Since both the local and global opti-mum are located in wide shallow basins near the center of thefeasible region, starting iterates are found to converge to either

optimum quickly when using the stand-alone response surfacemethods as seen by the low model building costs.

For all the presented examples, it is seen that the modelingand optimization costs are relatively low when applying theKriging-RSM algorithm due to the solution of ItKrigNTest 1ItRSM systems of linear equations, where ItKrig and ItRSM rep-resent the number of iterations required for the respectivestages of the methodology. In contrast, only ItRSM systems oflinear equations must be solved when applying any of the S-SD, Standard RSM, or RSM-SQP optimization methods. Thevalue of ItKrigNTest is higher than ItRSM, so the major sourceof the required CPU time is attributed to kriging modelingcosts. For problems of significantly higher dimensions, suchas atomistic modeling applications, the overall problem costswill be dominated by the time required for either real-timelab experiments or molecular simulation.

Conclusions and Future Work

In this work, a new kriging-RSM algorithm is presentedfor the solution of constrained NLPs containing black-boxfunctions and noisy variables. Using the kriging methodol-ogy, a global model is built whereby predictions and varian-ces at discretized test points are obtained using a fitted covar-iance function built from scattered sampling data. After addi-tional sampling is performed, the covariance function isupdated, leading to better predictions and an improved globalmodel. Once convergence in the average value is observed,regions containing possible local optima are identified andRSM is applied to refine the current set of candidate vectors.The global optimum is selected as the best point of the set oflocal optima. The likelihood of finding the global optimumincreases when applying kriging-RSM because the globalmodel obtained using kriging allows the set of regions con-taining potential local optima to be identified. The proposedmethodology extends the capabilities of current kriging andRSM methods to problems containing ten variables whosefeasible region is described by arbitrary linear and nonlinearconstraints instead of simply being box-constrained. It isfound that the proposed approach leads to a substantialincrease in the probability of finding the global optimumcompared to stand-alone RSM at minimal increased samplingcost due to model construction of the kriging predictor, eventhough no theoretical guarantees of global optimality aremade. The cost of building kriging models can become ex-pensive for problems containing many variables or when twoor more black-box functions are present and multiple globalmodels must be built at each iteration. Current work focuseson comparing algorithmic performance when using adaptiveversus fixed grid resolutions and also on building kriging

Figure 10. Kinetics Problem corresponding to the opti-mization problem presented in Eq. 11.


Table 5. Comparison of the Performance of the Kriging-RSM Algorithm Against Stand-Alone RSM Methods in Findingthe Global Optimum for the Kinetics Example Presented in Eq. 11

Opt. Algorithm


Optimum





predictors on a need-to-generate basis for the solution of pro-cess synthesis problems.

Acknowledgments

The authors gratefully acknowledge financial support from theNational Science Foundation under the NSF CTS-0224745 and CTS0625515 grants and also the USEPA-funded Environmental Bioinfor-matics and Computational Toxicology Center under the GAD R 832721-010 grant.

Literature Cited

1. Barton RR, Ivey JS. Nelder-Mead Simplex modifications for simula-tion optimization. Manage Sci. 1996;42:954–973.

2. Jones D, Pertunnen CD, Stuckman BE. Lipschitzian optimizationwithout the Lipschitz constant. J Optim Theory Appl. 1993;79:151–181.

3. Huyer W, Neumaier A. Global optimization by multilevel coordinatesearch. J Global Optim. 1999;14:331–355.

4. Storn R, Price K. Differential Evolution: A simple and efficientadaptive scheme for global optimization over continuous spaces. JGlobal Optim. 1997;11:341–359.

5. Brekelmans R, Driessen L, Hamers H, den Hertog D. Gradient esti-mation schemes for noisy functions. J Optim Theory Appl. 2005;126:529–551.

6. Choi TD, Kelley CT. Superlinear Convergence and implicit filtering.SIAM J Optim. 2000;10:1149–1162.

7. Gilmore P, Kelley CT. An implicit filtering algorithm for optimi-zation of functions with many local minima. SIAM J Optim.1995;5:269–285.

8. Myers R, Montgomery D. Response Surface Methodology. NewYork, NY: Wiley, 2002.

9. Box G, Wilson K. On the experimental attainment of optimum con-ditions. J R Stat Soc B. 1951;13:1–45.

10. Hoerl A. Optimum solution of many variables equations. Chem EngProg. 1959;11:69–78.

11. Davis E, Ierapetritou M. Adaptive optimisation of noisy black-boxfunctions inherent in microscopic models. Comput Chem Eng.2007;31:466–476.

12. Stander N, Craig KJ. On the robustness of a simple domain reduc-tion scheme for simulation-based optimization. Eng. Comp.: Int. J.Comp. Aid. Eng. 2002;19:431–450.

13. Roux W, Stander N, Haftka R. Response surface approximations forstructural optimization. Int J Numer Methods Eng. 1998;42:517–534.

14. Chen M, Chen K, Lin C. Optimization on response surface modelsfor the optimal manufacturing conditions of dairy tofu. J Food Eng.2005;68:471–480.

15. Tang W, Zhang L, Linninger A, Tranter R, Brezinsky K. Solving ki-netic inversion problems via a physically bounded Gauss-Newton(PGN)method. Ind Eng Chem Res. 2005;44:3626–3637.

16. Zhang L, Xue C, Malcolm A, Kulkarni K, Linninger A. Distributedsystem design under uncertainty. Ind Eng Chem Res. 2006;45:8352–8360.

17. Jones D. A taxonomy of global optimization methods based onresponse surfaces. J Global Optim 2001;21:345–383.

18. Jones D, Schonlau M, Welch W. Efficient global optimization of ex-pensive black-box functions. J Global Optim 1998;13:455–492.

19. Goovaaerts P. Geostatistics for Natural Resources Evolution. NewYork, NY: Oxford University Press, 1997.

20. Isaaks E, Srivistava R. Applied Geostatistics. New York, NY:Oxford University Press, 1989.

21. Krige D. A Statistical Approach to Some Mine Valuations and AlliedProblems at the Witwatersrand, Master’s Thesis. Johannesburg: Uni-versity of Witwatersrand, 1951.

22. Matheron G. Principles of geostatistics. Econ Geol. 1963;58:1246–1266.

23. Cressie N. Statistics for Spatial Data. New York, NY: Wiley,1993.

24. Sacks J, Schiller SB, Welch WJ. Designs for computer experiments.Technometrics. 1989;31:41–47.

25. Santner T, Williams B, Notz W. The Design and Analysis of Com-puter Experiments. New York, NY: Springer, 2003.

26. Maranas C, Floudas C. Finding all solutions of nonlinearly con-strained systems of equations. J Global Optim. 1995;7:143–182.

27. Box G, Hunter S, Hunter WG. Statistics for Experimenters: Design,Innovation, and Discovery, 2nd ed. New York, NY: Wiley-Inter-science, 2005.

28. Bindal A, Ierapetritou MG, Balakrishnan S, Makeev A, KevrekidisI, Armaou A. Equation-free, coarse-grained computational optimiza-tion using timesteppers. Chem Eng Sci. 2006;61:779–793.

29. Floudas CA, Pardalos PM. A Collection of Test Problems for Con-strained Global Optimization Algorithms. New York, NY: Springer,1991.

Manuscript received Jan. 23, 2007, and revision received May 7, 2007.


Documents

A kriging method for the solution of nonlinear programs with black-box functions