Stochastic Approximation and Its Applicationsinis.jinr.ru/sl/M_Mathematics/MOc_Optimal control/Han-Fu...Stochastic Approximation and Its Applications by Han-Fu Chen Institute of Systems

Stochastic Approximation and Its Applications

Nonconvex Optimization and Its Applications

Volume 64

Managing Editor:

Panos Pardalos

Advisory Board:

J.R. BirgeNorthwestern University, U.S.A.

Ding-Zhu DuUniversity of Minnesota, U.S.A.

C. A. FloudasPrinceton University, U.S.A.

J. MockusLithuanian Academy of Sciences, Lithuania

H. D. SheraliVirginia Polytechnic Institute and State University, U.S.A.

G. StavroulakisTechnical University Braunschweig, Germany

The titles published in this series are listed at the end of this volume.

Stochastic Approximationand Its Applications

by

Han-Fu ChenInstitute of Systems Science,Academy of Mathematics and System Science,Chinese Academy of Sciences,Beijing, P.R. China

KLUWER ACADEMIC PUBLISHERSNEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW

eBook ISBN: 0-306-48166-9Print ISBN: 1-4020-0806-6

©2003 Kluwer Academic PublishersNew York, Boston, Dordrecht, London, Moscow

Print ©2002 Kluwer Academic Publishers

All rights reserved

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Kluwer Online at: http://kluweronline.comand Kluwer's eBookstore at: http://ebooks.kluweronline.com

Dordrecht

Contents

PrefaceAcknowledgments

1. ROBBINS-MONRO ALGORITHM1.11.2

1.31.41.51.6

Finding Zeros of a Function.Probabilistic MethodODE MethodTruncated RM Algorithm and TS MethodWeak Convergence MethodNotes and References

2. STOCHASTIC APPROXIMATION ALGORITHMS WITH

2.12.22.32.42.52.6

2.72.82.9

MotivationGeneral Convergence Theorems by TS MethodConvergence Under State-Independent ConditionsNecessity of Noise ConditionNon-Additive NoiseConnection Between Trajectory Convergence and Propertyof Limit PointsRobustness of Stochastic Approximation AlgorithmsDynamic Stochastic ApproximationNotes and References

3. ASYMPTOTIC PROPERTIES OF STOCHASTIC

EXPANDING TRUNCATIONS

APPROXIMATION ALGORITHMS

3.13.23.3

Convergence Rate: Nondegenerate CaseConvergence Rate: Degenerate CaseAsymptotic Normality

v

ixxv

124

10162123

252628414549

57678293

9596

103113

vi STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

3.43.5

Asymptotic EfficiencyNotes and References

4. OPTIMIZATION BY STOCHASTIC APPROXIMATION4.14.24.34.44.54.6

Kiefer-Wolfowitz Algorithm with Randomized DifferencesAsymptotic Properties of KW AlgorithmGlobal OptimizationAsymptotic Behavior of Global Optimization AlgorithmApplication to Model ReductionNotes and References

5. APPLICATION TO SIGNAL PROCESSING5.15.25.35.45.55.65.7

Recursive Blind IdentificationPrincipal Component AnalysisRecursive Blind Identification by PCAConstrained Adaptive FilteringAdaptive Filtering by Sign AlgorithmsAsynchronous Stochastic ApproximationNotes and References

6. APPLICATION TO SYSTEMS AND CONTROL6.16.26.3

6.46.5

Application to Identification and Adaptive ControlApplication to Adaptive StabilizationApplication to Pole Assignment for Systems with UnknownCoefficientsApplication to Adaptive RegulationNotes and References

Appendices

A.1A.2A.3A.4A.5A.6A.7

Probability SpaceRandom Variable and Distribution FunctionExpectationConvergence Theorems and InequalitiesConditional ExpectationIndependenceErgodicity

B.1B.2B.3

Convergence Theorems for MartingaleConvergence Theorems for MDS IBorel-Cantelli-Lévy Lemma

130149

151153166172194210218

219220238246265273278288

289290305

316321327

329329329330330331332333333335335339340

Contents vii

B.4B.5B.6

Convergence Criteria for Adapted SequencesConvergence Theorems for MDS IIWeighted Sum of MDS

References

Index

341343344

347

355

Preface

Estimating unknown parameters based on observation data contain-ing information about the parameters is ubiquitous in diverse areas ofboth theory and application. For example, in system identification theunknown system coefficients are estimated on the basis of input-outputdata of the control system; in adaptive control systems the adaptivecontrol gain should be defined based on observation data in such a waythat the gain asymptotically tends to the optimal one; in blind chan-nel identification the channel coefficients are estimated using the outputdata obtained at the receiver; in signal processing the optimal weightingmatrix is estimated on the basis of observations; in pattern classifica-tion the parameters specifying the partition hyperplane are searched bylearning, and more examples may be added to this list.

All these parameter estimation problems can be transformed to aroot-seeking problem for an unknown function. To see this, let de-note the observation at time i.e., the information available about theunknown parameters at time It can be assumed that the parameterunder estimation denoted by is a root of some unknown function

This is not a restriction, because, for example, mayserve as such a function. Let be the estimate for at time Thenthe available information at time can formally be written as

where

Therefore, by considering as an observation on at withobservation error the problem has been reduced to seeking theroot of based on

It is clear that for each problem to specify is of crucial importance.The parameter estimation problem is possible to be solved only if

ix

x STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

is appropriately selected so that the observation error meets therequirements figured in convergence theorems.

If and its gradient can be observed without error at any desiredvalues, then numerical methods such as Newton-Raphson method amongothers can be applied to solving the problem. However, this kind ofmethods cannot be used here, because in addition to the obvious problemconcerning the existence and availability of the gradient, the observationsare corrupted by errors which may contain not only the purely randomcomponent but also the structural error caused by inadequacy of theselected

Aiming at solving the stated problem, Robbins and Monro proposedthe following recursive algorithm

to approximate the sought-for root where is the step size. Thisalgorithm is now called the Robbins-Monro (RM) algorithm. Follow-ing this pioneer work of stochastic approximation, there have been alarge amount of applications to practical problems and research workson theoretical issues.

At beginning, the probabilistic method was the main tool in con-vergence analysis for stochastic approximation algorithms, and ratherrestrictive conditions were imposed on both and For example,it is required that the growth rate of is not faster than linear as

tends to infinity and is a martingale difference sequence [78].Though the linear growth rate condition is restrictive, as shown by sim-ulation it can hardly be simply removed without violating convergencefor RM algorithms.

To weaken the noise conditions guaranteeing convergence of the algo-rithm, the ODE (ordinary differential equation) method was introducedin [72, 73] and further developed in [65]. Since the conditions on noiserequired by the ODE method may be satisfied by a large class ofincluding both random and structural errors, the ODE method has beenwidely applied for convergence analysis in different areas. However, inthis approach one has to a priori assume that the sequence of estimates

is bounded. It is hard to say that the boundedness assumption ismore desirable than a growth rate restriction on

The stochastic approximation algorithm with expanding truncationswas introduced in [27], and the analysis method has then been improvedin [14]. In fact, this is an RM algorithm truncated at expanding bounds,and for its convergence the growth rate restriction on is not re-quired. The convergence analysis method for the proposed algorithmis called the trajectory-subsequence (TS) method, because the analysis

PREFACE xi

is carried out at trajectories where the noise condition is satisfied andin contrast to the ODE method the noise condition need not be veri-fied on the whole sequence but is verified only along convergentsubsequences This makes a great difference when dealing withthe state-dependent noise because a convergent subsequence

is always bounded while the boundedness of the whole sequenceis not guaranteed before establishing its convergence. As shown in

Chapters 4, 5, and 6 for most of parameter estimation problems aftertransforming them to a root-seeking problem, the structural errors areunavoidable, and they are state-dependent.

The expanding truncation technique equipped with TS method ap-pears a powerful tool in dealing with various parameter estimation prob-lems: it not only has succeeded in essentially weakening conditions forconvergence of the general stochastic approximation algorithm but alsohas made stochastic approximation possible to be successfully applied indiverse areas. However, there is a lack of a reference that systematicallydescribes the theoretical part of the method and concretely shows theway how to apply the method to problems coming from different areas.To fill in the gap is the purpose of the book.

The book summarizes results on the topic mostly distributed overjournal papers and partly contained in unpublished material. The bookis written in a systematical way: it starts with a general introductionto stochastic approximation and then describes the basic method usedin the book, proves the general convergence theorems and demonstratesvarious applications of the general theory.

In Chapter 1 the problem of stochastic approximation is stated andthe basic methods for convergence analysis such as probabilistic method,ODE method, TS method, and the weak convergence method are intro-duced.

Chapter 2 presents the theoretical foundation of the algorithm withexpanding truncations: the basic convergence theorems are proved byTS method; various types of noises are discussed; the necessity of theimposed noise condition is shown; the connection between stability ofthe equilibrium and convergence of the algorithm is discussed; the ro-bustness of stochastic approximation algorithms is considered when thecommonly used conditions deviate from the exact satisfaction, and themoving root tracking is also investigated. The basic convergence the-orems are presented in Section 2.2, and their proof is elementary andpurely deterministic.

Chapter 3 describes asymptotic properties of the algorithms: conver-gence rates for both cases whether or not the gradient of is degener-

xii STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

ate; asymptotic normality of and asymptotic efficiency by averagingmethod.

Starting from Chapter 4 the general theory developed so far is ap-plied to different fields. Chapter 4 deals with optimization by usingstochastic approximation methods. Convergence and convergence ratesof the Kiefer-Wolfowitz (KW) algorithm with expanding truncations andrandomized differences are established. A global optimization methodconsisting in combination of the KW algorithms with search methods isdefined, and its a.s. convergence as well as asymptotic behaviors are es-tablished. Finally, the global optimization method is applied to solvingthe model reduction problem.

In Chapter 5 the general theory is applied to the problems arisingfrom signal processing. Applying the stochastic approximation methodto blind channel identification leads to a recursive algorithm estimatingthe channel coefficients and continuously improving the estimates whilereceiving new signal in contrast to the existing “block” algorithms. Ap-plying TS method to principal component analysis results in improvingconditions for convergence. Stochastic approximation algorithms withexpanding truncations with TS method are also applied to adaptive fil-ters with and without constraints. As a result, conditions required forconvergence have been considerably improved in comparison with theexisting results. Finally, the expanding truncation technique and TSmethod are applied to the asynchronous stochastic approximation.

In the last chapter, the general theory is applied to problems arisingfrom systems and control. The ideal parameter for operation is identifiedfor stochastic systems by using the methods developed in this book.Then the obtained results are applied to the adaptive quadratic controlproblem. Adaptive regulation for a nonlinear nonparametric system andlearning pole assignment are also solved by the stochastic approximationmethod.

The book is self-contained in the sense that there are only a few pointsusing knowledge for which we refer to other sources, and these points canbe ignored when reading the main body of the book. The basic mathe-matical tools used in the book are calculus and linear algebra based onwhich one will have no difficulty to read the fundamental convergenceTheorems 2.2.1 and 2.2.2 and their applications described in the sub-sequent chapters. To understand other material, probability concept,especially the convergence theorems for martingale difference sequencesare needed. Necessary concept of probability theory is given in AppendixA. Some facts from probability that are used at a few specific points arelisted in Appendix A but without proof, because omitting the corre-sponding parts still makes the rest of the book readable. However, the

PREFACE xiii

proof of convergence theorems for martingales and martingale differencesequences is provided in detail in Appendix B.

The book is written for students, engineers and researchers working inthe areas of systems and control, communication and signal processing,optimization and operation research, and mathematical statistics.

HAN-FU CHEN

Acknowledgments

The support of the National Key Project of China and the NationalNatural Science Foundation of China is gratefully acknowledged. Theauthor would like to express his gratitude to Dr. Haitao Fang for hishelpful suggestions and useful discussions. The author would also liketo thank Ms. Jinling Chang for her skilled typing and to thank my wifeShujun Wang for her constant support.

xv

ROBBINS-MONRO ALGORITHM

Chapter 1

Optimization is ubiquitous in various research and application fields.It is quite often that an optimization problem can be reduced to findingzeros (roots) of an unknown function which can be observed butthe observation may be corrupted by errors. This is the topic of stochas-tic approximation (SA). The error source may be observation noise, butmay also come from structural inaccuracy of the observed function. Forexample, one wants to find zeros of but he actually observes func-tions which are different from Let us denote by theobservation at time the observation noise:

Here, is the additional error caused by the structural in-accuracy. It is worth noting that the structural error normally dependson and it is hard to require it to have a certain probabilistic propertysuch as independence, stationarity or martingale property. We call thiskind of noises as state-dependent noise.

The basic recursive algorithm for finding roots of an unknown functionon the basis of noisy observations is the Robbins-Monro (RM) algorithm,which is characterized by its simplicity in computation. This chapterserves as an introduction to SA, describing various methods for analyzingconvergence of the RM algorithm.

In Section 1.1 the motivation of RM algorithm is explained, and itslimitation is pointed out by an example. In Section 1.2 the classicalapproach to analyzing convergence of RM algorithm is presented, whichis based on probabilistic assumptions on the observation noise. To relaxrestrictions made on the noise, a convergence analysis method connectingconvergence of the RM algorithm with stability of an ordinary differential

1

2 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

equation (ODE) was introduced in nineteen seventies. The ODE methodis demonstrated in Section 1.3. In Section 1.4 the convergence analysisis carried out at a sample path by considering convergent subsequences.So, we call this method as Trajectory-Subsequence (TS) method, whichis the basic tool used in the subsequent chapters.

In this book our main concern is the path-wise convergence of thealgorithm. However, there is another approach to convergence analy-sis called the weak convergence method, which is briefly introduced inSection 1.5. Notes and references are given in the last section.

This chapter introduces main methods used in literature for conver-gence analysis, but restricted to the single root case. Extension to moregeneral cases in various aspects is given in later chapters.

1.1. Finding Zeros of a Function.Many theoretical and practical problems in diverse areas can be re-

duced to finding zeros of a function. To see this it suffices to notice thatsolving many problems finally consists in optimizing some functioni.e., finding its minimum (or maximum). If is differentiable, thenthe optimization problem reduces to finding the roots of where

the derivative ofIn the case where the function or its derivatives can be observed

without errors, there are many numerical methods for solving the prob-lem. For example, the gradient method, by which the estimate forthe root of is recursively generated by the following algorithm

where denotes the derivative of This kind of problems belongsto the topics of optimization theory, which considers general cases where

may be nonconvex, nonsmooth, and with constraints.In contrast to the optimization theory, SA is devoted to finding zeros

of an unknown function which can be observed, but the observationsare corrupted by errors.

Since is not exactly known and even may not exist, (1.1.1)-like algorithms are no longer applicable. Consider the following simpleexample. Let be a linear function

If the derivative of is available, i.e., if we know and ifcan precisely be observed, then according to (1.1.1)

ROBBINS-MONRO ALGORITHM 3

This means that the gradient algorithm leads to the zero ofby one step.

Assume the derivative of is unavailable but can exactly beobserved.

Let us replace by in (1.1.1). Then we derive

or

This is a linear difference equation, which can inductively be solved,and the solution of (1.1.3) can be expressed as follows

Clearly, tends to the root of as for any initialvalue This is an attractive property: although the gradient of isunavailable, we can still approach the sought-for root if the inverse of thegradient is replaced by a sequence of positive real numbers decreasinglytending to zero.

Let us consider the case where is observed with errors:

where denotes the observation at time the correspondingobservation error and the estimate for the root of at time

It is natural to ask, how will behave if the exact value ofin (1.1.2) is replaced by its error-corrupted observation i.e., ifis recursively derived according to the following algorithm:

In our example, and (1.1.5) turns to be

STOCHASTIC APPROXIMATION AND ITS APPLICATIONS4

Similar to (1.1.3), the solution of this difference equation is

Therefore, converges to the root of if tends

to zero as This means that replacement of gradient by asequence of numbers still works even in the case oferror-corrupted observations, if the observation errors can be averagedout. It is worth noting that in lieu of (1.1.5) we have to take the positivesign before i.e., to consider

if rather than or more general, if is decreasingas increases.

This simple example demonstrates the basic features of the algorithm(1.1.5) or (1.1.7): 1) The algorithm may converge to a root of 2) Thelimit of the algorithm, if exists, should not depend on the initial value; 3)The convergence rate is defined by that how fast the observation errorsare averaged out.

From (1.1.6) it is seen that the convergence rate is defined by

for linear functions. In the case where is a sequence of indepen-dent and identically distributed random variables with zero mean andbounded variance, then

by the iterated logarithm law.This means that convergence rate for algorithms (1.1.5) or (1.1.7) with

error-corrupted observations should not be faster than

1.2. Probabilistic MethodWe have just shown how to find the root of an unknown linear function

based on noisy observations. We now formulate the general problem.


Let be an unknown function with unknown rootAssume can be observed at each point with noise

and is the estimate for at timeStochastic approximation algorithms recursively generate to ap-

proximate based on the past observations. In the pioneer work of thisarea Robbins and Monro proposed the following algorithm

to estimate where step size is decreasing and satisfies the fol-

lowing conditions and They proved

We explain the meaning of conditions required for step sizeCondition aims at reducing the effect of observation noises.To see this, consider the case where is close to and is closeto zero, say, with small.

Throughout the book, always means the Euclidean norm of avector and denotes the square root of the maximum eigenvalueof the matrix where means the transpose of the matrix A.

By (1.2.2) andEven in the Gaussian noise case, may be large if

has a positive lower bound. Therefore, in order to have the desiredconsistency, i.e., it is necessary to use decreasing gains

such that On the other hand, consistency can neither be

achieved, if decreases too fast as To see this, let

Then even for the noise-free case, i.e., from (1.2.2) we have

if is a bounded function.

Therefore, in this case

if the initial value is far from the true root and hence will neverconverge to

The algorithm (1.2.2) is now called Robbins-Monro (RM) algorithm.

where isthe observation at time is the observation noise,


The classical approach to convergence analysis of SA algorithms isbased on the probabilistic analysis for trajectories. We now present atypical convergence theorem by this approach. Related concept andresults from probability theory are given in Appendices A and B.

In fact, we will use the martingale convergence theorem to prove thepath-wise convergence of i.e., to show For this, the

following set of conditions will be used.

A 1.2.1 The step size is such that

A1.2.2 There exists a continuously twice differentiable Lyapunov func-tion satisfying the following conditions.

i) Its second derivative is bounded;ii) and asiii) For any there is a such that

where denotes the gradient of

A1.2.3 The observation noise is a martingale difference se-quence with

where is a family of nondecreasing

A1.2.4 The function and the conditional second moment of theobservation noise have the following upper bound

where is a positive constant.

Prior to formulating the theorem we need some auxiliary results.Let be an adapted sequence, i.e., isDefine the first exist time of from a Borel set

It is clear that i.e., is a Markov time.

Lemma 1.2.1 Assume and is a nonnegative supermartin-gale, i.e.,


Then is also a nonnegative supermartingale, where

The proof is given in Appendix B, Lemma B-2-1.The following lemma concerning convergence of an adapted sequence

will be used in the proof for convergence of the RM algorithm, but thelemma is of interest by itself.

Lemma 1.2.2 Let be two nonnegative adapted se-quences.

i) If and then converges a.s.

to a finite limit.

ii) If then

Proof. For proving i) set

Then we have

By the convergence theorem for nonnegative supermartingales, con-verges a.s. as

Since by the convergence theorem for martingales it

follows that converges a.s. as Since is

Noticing that both and converge a.s.

as we conclude that is also convergent a.s. as

Consequently, from (1.2.5) it follows that converges a.s. as

For proving ii) set

measurable and is nondecreasing, we have


Taking conditional expectation leads to

Again, by the convergence theorem for nonnegative supermartingales,converges a.s. as Since by the same theorem also

converges a.s. as it directly follows that a.s.

Theorem 1.2.1 Assume Conditions A1.2.1–A1.2.4 hold. Then for anyinitial value, given by the RM algorithm (1.2.2) converges to the root

of a.s. as

Proof. Let be the Lyapunov function given in A1.2.2. Expandingto the Taylor series, we obtain

where and denote the gradient and Hessian of respec-tively, is a vector with components located in-between the corre-sponding components of and and denotes the constant suchthat (by A1.2.2).

Noticing that is and taking con-ditional expectation for (1.2.6), by (1.2.4) we derive

Since by (A1.2.1), we have

Denoting


and noticing by A1.2.2, iii) from (1.2.7) and (1.2.8) itfollows that

Therefore, and converges a.s. by the convergencetheorem for nonnegative supermartingales.

Since also converges a.s.

For any denote

Let be the first exit time of from and let

where denotes the complement to This means that is the firstexit time from after

Since is nonpositive, from (1.2.9) it follows that

for anyThen by (1.2.2), this implies that

By Lemma 1.2.2, ii), the above inequality implies

which means that must be finite a.s. Otherwise, we would have

a contradiction to A1.2.1. Therefore, after with


possible exception of a set with probability zero the trajectory ofmust enter

Consequently, there is a subsequence such thatwhere as

By the arbitrariness of we then conclude that there is a subsequence,denoted still by such that Hence

However, we have shown that converges a.s. Therefore,a.s. By A1.2.2, ii) we then conclude that a.s.

Remark 1.2.1 If Condition A1.2.2 iii) changes to

then the algorithm (1.2.2) should accordingly change to

We now explain conditions required in Theorem 1.2.1. As noted in

Section 1.1, the step size should satisfy but the condition

may be weakened to

Condition A1.2.2 requires existence of a Lyapunov function Thiskind of conditions is normally necessary to be imposed for convergenceof the algorithms, but the analytic properties of may be weakened.The noise condition A1.2.3 is rather restrictive. As to be shown in thesubsequent chapters, may be composed of not only the random noisebut also structural errors which hardly have nice probabilistic propertiessuch as martingale difference, stationarity or with bounded variances etc.

As in many cases, one can take to serve as Then from(1.2.4) it follows that the growth rate of as should not befaster than linear. This is a major restriction to apply Theorem 1.2.1.However, if we a priori assume that generated by the algorithm(1.2.2) is bounded, then is bounded provided is locallybounded, and then the linear growth is not a restriction for1,2,...}.

1.3. ODE MethodAs mentioned in Section 1.2, the classical probabilistic approach to

analyzing SA algorithms requires rather restrictive conditions on theobservation noise. In nineteen seventies a so-called ordinary differentialequation (ODE) method was proposed for analyzing convergence of SA


algorithms. We explain the idea of the method. The estimategenerated by the RM algorithm is interpolated to a continuous functionwith interpolating length equal to the step size used in the algo-rithm. The tail part of the interpolating function is shown to satisfyan ordinary differential equation The sought-for root is theequilibrium of the ODE. By stability of this equation, or by assumingexistence of a Lyapunov function, it is proved that From

this, it can be deduced thatFor demonstrating the ODE method we need two facts from analysis,

which are formulated below as propositions.

Proposition 1.3.1 (Arzelà-Ascoli) Let be a set ofequi-continuous and uniformly bounded functions, where by equi-continuity we mean that for any and any there exists

such that

Then there are a continuous function and asubsequence offunctions which converge to uniformly in any finite interval ofi.e.,

uniformly with respect to belonging to any finite interval.

Proposition 1.3.2 For the following ODE

with

if there exists a continuously differentiable function such thatas and

then the solution to (1.3.1), starting from any initial value, tends toas i.e., is the global asymptotically stable solution to

(1.3.1).

Let us introduce the following conditions.

A1.3.2 There exists a twice continuously differentiable Lyapunov func-tion such that as

and

A1.3.1

whenever


In order to describe conditions on noise, we introduce an integer-valued function for any and any integer

For define

Noticing that tends to zero, for any fixed diverges toinfinity as In fact, counts the number of iterationsstarting from time as long as the sum of step sizes does not exceedThe integer-valued function will be used throughout the book.

The following conditions will be used:

A1.3.3 satisfies the following conditions

A1.3.4 is continuous.

Theorem 1.3.1 Assume that A1.3.1, A1.3.2, and A1.3.4 hold. If for afixed sample A1.3.3 holds and generated by the RM algorithm(1.2.2) is bounded, then for this tends to as

Proof. Set

Define the linear interpolating function

It is clear that is continuous and

Further, define and the corresponding linear interpo-

lating function which is defined by (1.3.4) with replaced bySince we will deal with the tail part of we define by shifting

time in

Thus, we derive a family of continuous functions


Let us define the constant interpolating function

Then summing up both sides of (1.2.2) yields

and hence

By the boundedness assumption on the family is uni-formly bounded. We now prove it is equi-continuous.

By definition,

Hence, we have

where since

From this it follows that

which tends to zero as and then by A1.3.3.For any we have

By boundedness of and (1.3.11) we see that is equi-continuous.


By Proposition 1.3.1, we can select from a convergent subse-quence which tends to a continuous function

Consider the following difference with

which is derived by using (1.3.11).By (1.3.9) it is clear that for

Then from (1.3.12) we obtain

Tending to zero in (1.3.13), by continuity of and uniform con-vergence of to we conclude that the last term in (1.3.13)converges to zero, and


By A1.3.2 and Proposition 1.3.2 we see asWe now prove that Assume the converse: there is a

subsequence

Then for There is a such that

By (1.3.4) we have

where and denotesthe integer part of so

It is clear that the family of functions indexed byis uniformly bounded and equi-continuous. Hence, we can select a

convergent subsequence, denoted still by The limit satisfiesthe ODE (1.3.14) and coincides with being the limit of bythe uniqueness of the solution to (1.3.14).

By the uniform convergence we have

which implies thatFrom here by (1.3.15) it follows that

Then we obtain a contradictory inequality:

for large enough such that and This completesthe proof of

We now compare conditions used in Theorem 1.3.1 with those in The-orem 1.2.1.

Conditions A1.3.1 and A1.3.2 are slightly weaker than A1.2.1 andA1.2.2, but they are almost the same. The noise condition A1.3.3 issignificantly weaker than those used in Theorem 1.2.1, because underthe conditions of Theorem 1.2.1 we have

which certainly implies A1.3.3.


As a matter of fact, Condition A1.3.3 may be satisfied by sequencesmuch more general than martingale difference sequences.

Example 1.3.1 Assume but may be any random or deter-ministic sequence. Then satisfies A1.3.3.This is because

Example 1.3.2 Let be an MA process, i.e.,

where is a martingale difference sequence with

Then under conditionA1.2.1, a.s., and hence

a.s. Consequently, A1.3.3 is satisfied for almost all sample paths

Condition A1.3.4 requires continuity of which is not required inA1.2.4. At first glance, unlike A1.2.4, Condition A1.3.4 does not imposeany growth rate condition on but Theorem 1.3.1 a priori requiresthe boundedness of which is an implicit requirement for the growthrate of

The ODE method is widely used in convergence analysis for algo-rithms arising from various application areas, because from the noiseit requires no probabilistic property which would be difficult to verify.Concerning the weakness of the ODE method, we have mentioned thatit a priori assumes that is bounded. This condition is difficult tobe verified in general case. The other point should be mentioned thatCondition A1.3.3 is also difficult to be verified in the case wheredepends on the past which often occurs when containsstructural errors of This is because A1.3.3 may be verifiable if isconvergent, but may badly behave depending upon the behavior of

So we are somehow in a cyclic situation: with A1.3.3 we canprove convergence of on the other hand, with convergent wecan verify A1.3.3. This difficulty will be overcome by using Trajectory-Subsequence (TS) method to be introduced in the next section and usedin subsequent chapters.

1.4. Truncated RM Algorithm and TS MethodIn Section 1.2 we considered the root-seeking problem where the

sought-for root may be any point in If the region belongs

as


to is known, then we may use the truncated algorithm and the growthrate restriction on can be removed.

Let us assume that and is known. In lieu of (1.2.2) wenow consider the following truncated RM algorithm:

where the observation is given by (1.2.1), is a given point,

and

The constant used in (1.4.1) will be specified later on.The algorithm (1.4.1) means that it coincides with the RM algorithm

when it evolves in the sphere but if exits thesphere then the algorithm is pulled back to the fixed point

We will use the following set of conditions:

A1.4.1 The step size satisfies the following conditions

A1.4.2 There exists a continuously differentiable Lyapunov function(not necessarily being nonnegative) such that

and for (which is used in

(1.4.1)) there is such that

A1.4.3 For any convergent subsequence of

where is given by (1.3.2);

A1.4.4 is measurable and locally bounded.

We first compare these conditions with A1.3.1–A1.3.4. We note thatA1.4.1 is the same as A1.3.1, while A1.4.2 is weaker than A1.2.2.

The difference between A1.3.3 and A1.4.3 consists in that Condition(1.4.2) is required to be verified only along convergent subsequences,while (1.3.3) in A1.3.3 has to be verified along the whole sequence


It will be seen that A1.4.3 in many problems can be verified while A1.3.3is difficult to verify.

Comparing A1.4.4 with A1.3.4 we find that the conditions on havenow been weakened. The growth rate restriction used in Theorem 1.2.1and the boundedness assumption on imposed in Theorem 1.3.1have been removed in the following theorem.

Theorem 1.4.1 Assume Conditions A1.4.1, A1.4.2, and A1.4.4 holdand the constant in A1.4.2 is available. Set for (1.4.1). Iffor some sample path A1.4.3 holds, then given by (1.4.1) convergesto for this

Proof. We say that crosses an intervalif and

We first prove that the number of truncations in (1.4.1) may happenat most for a finite number of steps. Assume the converse: there areinfinitely many truncations occurring in (1.4.1). Since

by A1.4.2, there is an interval such that

and there are infinitely many thatcross

Since is bounded, we may extract a convergent subsequencefrom Let us denote the extracted convergent subsequence stillby It is clear that

Since the limit of is located in the open spherethere is an such that

for all sufficiently largeSince is bounded by Al.4.4 and the boundedness of

using (1.4.2) we have


if is small enough and is large enough.This incorporating with (1.4.5) implies that

Therefore, the norm of

cannot reach the truncation bound In other words, the algorithm(1.4.1) turns to be an untruncated RM algorithm (1.4.7) for

for small and largeBy the mean theorem there exists a vector with components located

in-between the corresponding components of and suchthat

Notice that by (1.4.2) the left-hand side of (1.4.6) is of for allsufficiently large since is bounded. From this it follows that i)for small enough and large enough

and hence and ii) the last term in (1.4.8) is ofsince as From (1.4.7) and (1.4.8) it thenfollows that

Since the interval does not contain the origin. Noticingthat we findand that there is such that


for sufficientlysmall and all large enough Then by A1.4.2 thereis such that

for all large and small enough As mentioned abovefrom (1.4.9) we have

for sufficiently large and small enough where denotes a mag-nitude tending to zero as

Taking (1.4.4) into account, from (1.4.10) we find that

for large However, we have shown that

The obtained contradiction shows that the number of truncations in(1.4.1) can only be finite.

We have proved that starting from some large the algorithm (1.4.1)develops as an RM algorithm

and is bounded.We are now in a position to show that converges.Assume it were not true. Then we would have

Then there would exist an interval not containing the originand would cross for infinitely many

Again, without loss of generality, assuming by the same

argument as that used above, we will arrive at (1.4.9) and (1.4.10) forlarge and obtain a contradiction. Thus, tends to a finite limitas

It remains to show that

Assume the converse that there is a subsequence

Then there is a such that for all sufficiently largeWe still have (1.4.8), (1.4.9), and (1.4.10) for some


Tending in (1.4.10), by convergence of we arrive at acontradictory inequality:

This means

In this section we have demonstrated an analysis method which isdifferent from those used in Sections 1.2 and 1.3. This method is basedon analyzing the sample-path behavior, and conclusions on the wholesequence are deducted from the local behaviors of estimatesthat are obtained immediately after which denotes a convergentsubsequence of We call this method as Trajectory-Subsequence(TS) Method. The TS method is the main tool to be used in subsequentchapters for analyzing more general cases. It will be seen that the TSmethod is powerful in dealing with complicated errors including bothrandom noise and structural inaccuracy of the function.

The obvious weakness of Theorem 1.4.1 is the assumption on the avail-ability of the upper bound for This limitation will be removedlater on.

1.5. Weak Convergence MethodUp-to now we have worked with decreasing gains which are necessary

for path-wise convergence when observations are corrupted by noise.However, in some applications people prefer to using constant gain:

where in contrast to (1.2.2) a constant stands for which tendsto zero as

Define the piece-wise constant interpolating function as

Then which is the space of real functions on thatare right continuous and have left-hand limits, endowed with the Skoro-hod topology. Convergence of to a continuous function

in the Skorohod topology is equivalent to the uniform convergenceon any bounded interval.

Let and be probability measures determined by stochastic pro-cesses and respectively on with inducedby the Skorohod topology.


If for any bounded continuous function defined on

then we say that weakly converges toIf for any there is a compact measurable set in

such that

then is called tight.Further, is called relatively compact if each subsequence of

contains a weakly convergent subsequence.In the weak convergence analysis an important role is played by the

Prohorov’s Theorem, which says that on a complete and separable met-ric space, tightness is equivalent to relative compactness. The weakconvergence method establishes the weak limit of as andconvergence of to in probability as whereas

Theorem 1.5.1 Assume the following conditions:

A1.5.1 is a.s. bounded;

A1.5.2 is continuous;

A1.5.3 is adapted, is uniformly integrable in the sense that

and

Then is tight in and weakly converges tothat is a solution to

Further, if is asymptotically stable for (1.5.3), then for anyas the distance between and

converges to zero in probability as

In stead of proof, we only outline its basic idea. First, it is shownthat we can extract a subsequence of weakly converging to


For notational simplicity, denote the subsequence still by Bythe Skorohod representation, we may assume For

this we need only, if necessary, to change the probabilistic space and takeand on this new space such that and

have the same distributions as those of and respectively.Then, it is proved that

is a martingale. Since and as can be shown, is Lipschitzcontinuous, it follows that

Since is relatively compact and the limit does not depend onthe extracted subsequence, the whole family weakly convergesto as and satisfies (1.5.3). By asymptotic stability of

Remark 1.5.1 The boundedness assumption on may be removed.For this a smooth function is introduced such that

and the following truncated algorithm

is considered in lieu of (1.5.1). Then is interpolated to a piece-wiseconstant function for the It is shownthat is tight, and weakly convergent as The limit

satisfies

Finally, by showing lim sup lim sup for some

for each it is proved that itself is tight and weaklyconverges to satisfying (1.5.3).

1.6. Notes and ReferencesThe stochastic approximation algorithm was first proposed by Rob-

bins and Monro in [82], where the mean square convergence of the algo-rithm was established under the independence assumption on the obser-vation noise. Later, the noise was extended from independent sequenceto martingale difference sequences (e.g. [7, 40, 53]).


The probabilistic approach to convergence analysis is well summarizedin [78].

The ODE approach was proposed in [65, 72], and then it was widelyused [4, 85]. For detailed presentation of the ODE method we refer to[65, 68].

The proof of Arzelá-Ascoli Theorem can be found in ([37], p.266).Section 1.4 is an introduction to the method described in detail in

coming chapters. For stability and Lyapunov functions we refer to [69].The weak convergence method was developed by Kushner [64, 68].

The Skorohod topology and Prohorov’s theorem can be found in [6, 41].For probability concepts briefly presented in Appendix A, we refer

to [30, 32, 70, 76, 84]. But the proof of the convergence theorem formartingale difference sequences, which are frequently used throughoutthe book, is given in Appendix B.

STOCHASTIC APPROXIMATION ALGORI-THMS WITH EXPANDING TRUNCATIONS

In Chapter 1 the RM algorithm, the basic algorithm used in stochas-tic approximation(SA), was introduced, and four different methods foranalyzing its convergence were presented. However, conditions imposedfor convergence are rather strong.

Comparing theorems derived by various methods in Chapter 1, wefind that the TS method introduced in Section 1.4 requires the weakestcondition on noise. The trouble is that the sought-for root has to be in-side the truncation region. This motivates us to consider SA algorithmswith expanding truncations with the purpose that the truncation regionwill finally cover the sought-for root whose location is unknown. This isdescribed in Section 2.1.

General convergence theorems of the SA algorithm with expandingtruncations are given in Section 2.2. The key point of the proof is toshow that the number of truncations is finite. If this is done, then theestimate sequence is bounded and the algorithm turns to be the conven-tional RM algorithm in a finite number of steps. This is realized by usingthe TS method. It is worth noting that the fundamental convergencetheorems given in this section are analyzed by a completely elementarymethod, which is deterministic and is limited to the knowledge of calcu-lus. In Section 2.3 the state-independent conditions on noise are givento guarantee convergence of the algorithm when the noise itself is state-dependent. In Section 2.4 conditions on noise are discussed. It appearsthat the noise condition in the general convergence theorems in a certainsense is necessary. In Section 2.5 the convergence theorem is given forthe case where the observation noise is non-additive.

In the multi-root (of case, up-to Section 2.6 we have only estab-lished that the distance between the estimate and the root set tends to

25

Chapter 2

In Chapter 1 we have presented four types of convergence theoremsusing different analysis methods for SA algorithms. However, none ofthese theorems is completely satisfactory in applications. Theorem 1.2.1is proved by using the classical probabilistic method, which requiresrestrictive conditions on the noise and As mentioned before, thenoise may contain component caused by the structural inaccuracy ofthe function, and it is hard to assume this kind of noise to be mutuallyindependent or to be a martingale difference sequence etc. The growthrate restriction imposed on the function not only is sever, but also isunavoidable in a certain sense. To see this, let us consider the followingexample:

It is clear that conditions A1.2.1, A1.2.2, and A1.2.3 are satisfied,where for A1.2.2 one may take The only conditionthat is not satisfied is (1.2.4), since while the right-hand side of (1.2.4) is a second order polynomial. Simple calculationshows that given by RM algorithm rapidly diverges:


From this one might conclude that the growth rate restriction wouldbe necessary.

However, if we take the initial value with thengiven by the RM algorithm converges to To reduce initial valuein a certain sense, it is equivalent to use step size not from but from

for some The difficulty consists in that from which we should

zero. But, by no means this implies convergence of the estimate itself.This is briefly discussed in Section 2.4, and is considered in Section 2.6in connection with properties of the equilibrium of Conditionsare given to guarantee the trajectory convergence. It is also consideredwhether the limit of the estimate is a stable or unstable equilibrium of

In Section 2.7 it is shown that a small distortion of conditionsmay cause only a small estimation error in limit, while Section 2.8 ofthis chapter considers the case where the sought-for root is moving dur-ing the estimation process. Convergence theorems are derived with thehelp of the general convergence theorem given in Section 2.2. Notes andreferences are given in the last section.

2.1. Motivation

Stochastic Approximation Algorithms withExpanding Truncations 27

start the algorithm. This is one of the motivations to use expandingtruncations to be introduced later.

Theorem 1.3.1 proved in Section 1.3 demonstrates the ODE method.By this approach, the condition imposed on the noise has significantlybeen weakened and it covers a class of noises much larger than thattreated by the probabilistic method. However, it a priori requiresbe bounded. This is the case if converges, but before establishing itsconvergence, this is an artificial condition, which is not satisfied even forthe simple example given above. Further, although the noise condition(1.3.3) is much more general than that used in Theorem 1.2.1, it isstill difficult to be verified for the state-dependent noise. For example,

where is a martingale difference sequence withIf is bounded and

then a.s. and (1.3.3) holds. However, in general,

it is difficult to directly verify (1.3.3) because the behavior of isunknown. This is why we use Condition (1.4.2) which should be verifiedonly along convergent subsequences. With convergent the noise

is easier to be dealt with.Considering convergent subsequences, the path-wise convergence is

proved for a truncated RM algorithm by using the TS method in Theo-rem 1.4.1. The weakness of algorithms with fixed truncation bounds isthat the sought-for root of has to be located in the truncation region.But, in general, this cannot be ensured. This is another motivation toconsider algorithms with expanding truncations.

The weak convergence method explained in Section 1.5 can avoidboundedness assumption on but it can ensure convergence in dis-tribution only, while in practical computation one always deals with asample path. Hence, people in applications are mainly interested inpath-wise convergence.

The SA algorithm with expanding truncations was introduced in or-der to remove the growth rate restriction on It has been developedin two directions: weakening conditions imposed on noise and improv-ing the analysis method. By the TS method we can show that theSA algorithm with expanding truncations converges under a truly weakcondition on noise, which, in fact, is also necessary for a wide class of

In Chapter 1, the root of is a singleton. Fromnow on we will consider the general case. Let J be the root set of

We now define the algorithm. Let be a sequence of positivenumbers increasingly diverging to infinity, and let be a fixed point in


Fix an arbitrary initial value and denote by the estimate attime serving as the approximation to J. Define by the followingrecursion:

where is an indicator function meaning that it equals 1 ifthe inequality indicated in the bracket is fulfilled, and 0 if the inequalitydoes not hold.

We explain the algorithm. is the number of truncations up-to timeserves as the truncation bound when the estimate is

generated. From (2.1.1) it is seen that if the estimate at timecalculated by the RM algorithm remains in the truncation region, i.e., if

then the algorithm evolves as the RM algorithm.If exits from the sphere with radius i.e., if

then the estimate at time is pulled back to thepre-specified point and the truncation bound is enlarged fromto

Consequently, if it can be shown that the number of truncations isfinite, or equivalently, generated by (2.1.1) and (2.1.2) is bounded,then the algorithm (2.1.1) and (2.1.2) turns to be the one without trun-cations, i.e., to be the RM algorithm after a finite number of steps. Thisactually is the key step when we prove convergence of (2.1.1) and (2.1.2).

The convergence analysis of (2.1.1) and (2.1.2) will be given in thenext section, and the analysis is carried out in a deterministic way at afixed sample without involving any interpolating function.

In This section by TS method we establish convergence of the RMalgorithm with expanding truncations defined by (2.1.1)–(2.1.3) undergeneral conditions. Let us first list conditions to be used.

2.2. General Convergence Theorems by TSMethod


A2.2.2 There is a continuously differentiable function (not necessarilybeing nonnegative) such that

for any and is nowhere dense, whereJ is the zero set of i.e.,

and denotes the gradient of Further, used in (2.1.1)is such that for some and

For introducing condition on noise let us denote by the prob-ability space. Let be a mea-surable function defined on the product space. Fixing an meansthat a sample path is under consideration. Let the noise be givenby

Thus, the state-dependent noise is considered, and for fixedmay be random.

A2.2.3 For the sample path under consideration for any sufficientlylarge integer

for any such that converges, where is given by(1.3.2) and denotes given by (2.1.1)–(2.1.3) and valued at thesample path

In the sequel, the algorithm (2.1.1)–(2.1.3) is considered for the fixedfor which A2.2.3 holds, and in will often be suppressed if no

confusion is caused.

A2.2.4 is measurable and locally bounded.

Remark 2.2.1 Comparing A2.2.1–A2.2.4 with A1.4.1–A1.4.4, we findthat if the root set J degenerates to a singleton then the only essentialdifference is that an indicator function is included in (2.2.2)while (1.4.2) stands without it. It is clear that if is bounded, thenthis makes no difference. However, before establishing the boundednessof condition (2.2.2) is easier to be verified. The key point here


is that in contrast to Section 1.4 we do not assume availability of theupper bound for the roots of

Remark 2.2.2 It is worth noting that con-

verges. To see this it suffices to take in (2.2.2).

Theorem 2.2.1 Let be given by (2.1.1)–(2.1.3) for a given initialvalue Assume A2.2.1–A2.2.4 hold. Then, for the

sample path for which A2.2.3 holds.

Proof. The proof is completed by six steps by considering conver-gent subsequences at the sample path. This is why we call the analysismethod used here as TS method.

Step 1. We show that there are constants such thatfor any there exists such that for any

if is a convergent subsequence of where M is

independent of andSince we need only to prove

(2.2.3) forIf the number of truncations in (2.1.1)–(2.1.3) is finite, then there is

an N such that i.e., there is no more truncation forHence, wheneverIn this case, we may take in (2.2.3).

We now prove (2.2.3) for the case where asAssume the converse that (2.2.3) is not true. Take There issuch that

Take a sequence of positive real numbers and asSince (2.2.3) is not true, for there are and

such that


and for any there are andsuch that

Without loss of generality we may assume

Then for any from (2.2.4) and (2.2.6) it followsthat

Since there is such that Then from(2.2.7) it follows that

For any fixed if is large enough, then andand by (2.2.10)

Since from (2.2.11) it followsthat

and by (2.2.4), (2.2.7), and (2.2.8)

and hence

by A2.2.4, where is a constant.Let where is specified in A2.2.3. Thenfrom A2.2.3

for any


Taking and respectively in (2.2.10)

and noticing from(2.2.9) we thenhave

and hence

From (2.2.8), it follows that

where the second term on the right-hand of the inequality tends to zeroby (2.2.12) and (2.2.13), while the first term tends to zero because

Noticing that by

(2.2.9) and (2.2.13), we then by (2.2.14) have

On the other hand, by (2.2.6) we have

The obtained contradiction proves (2.2.3).Step 2. We now show that for all large enough

if T is small enough, where is a constant.If the number of truncations in (2.1.1)–(2.1.3) is finite, then is

bounded and hence is also bounded.


Then for large enough there is no truncation, and by (2.2.2) for

if T is small enough. In (2.2.16), for the last inequality the boundednessof is invoked, and is a constant.

Thus, it suffices to prove (2.2.15) for the case where

From (2.2.3) it follows that for any

if is large enough.This implies that for

where is a constant. The last inequality of (2.2.18) yields

With in A2.2.3, from (2.2.2) we have

for large enough and small enough T.Combining (2.2.18), (2.2.19), and (2.2.20) leads to

for all large enough This together with (2.2.16) verifies (2.2.15).Step 3. We now show the following assertion:For any interval with and the

sequence cannot cross infinitely many times with


andAssume the converse: there are infinitely many crossings

and is bounded.By boundedness of without loss of generality, we may assume

By setting in (2.2.15), we have

But by definition so we have

From (2.2.15) we see that if take sufficiently small, then

for sufficiently largeBy (2.2.18) and (2.2.15), for large we then have

where denotes the gradient of and asFor condition (2.2.2) implies that

By (2.2.15) and (2.2.18) it follows that

bounded, where by “crossing by we mean that


Then, by (2.2.23) and (2.2.1) from (2.2.24)–(2.2.26) it follows that thereare and such that

for all sufficiently largeNoticing (2.2.22), from (2.2.27) we derive

However, by (2.2.15) we have

which implies that for small enoughThis means that which contradicts (2.2.28).Step 4. We now show that the number of truncations is bounded.By A2.2.2, is nowhere dense, and hence a nonempty interval

exists such that and

If then starting from will cross the sphere

infinitely many times. Consequently, will crossinfinitely often with bounded. In Step 3, we have shown this

process is impossible. Therefore, starting from some the algorithm(2.1.1)–(2.1.3) will have no truncations and is bounded.

This means that the algorithm defined by (2.1.1)–(2.2.3) turns to bethe conventional RM algorithm for and a stronger than (2.2.2)condition is satisfied:

for any such that converges.Step 5. We now show that converges. Let

We have to showIf and one of and does not belong to then

exists such that and By Step 3 thisis impossible. So, both and belong to and


If we can show that is dense in then from (2.2.30)it will follow that is dense in which contradicts to theassumption that is nowhere dense. This will prove i.e., theconvergence of

To show that is dense in it suffices to show thatAssume the converse: there is a subsequence

Without loss of generality, we may assume converges. Otherwise,a convergent subsequence can be extracted, which is possible because

is bounded. However, if we take in (2.2.15), we have

which contradicts (2.2.31). Thus and converges.Step 6. For proving it suffices to show that all

limit points of belong to J.

Assume the converse: By (2.2.15) wehave

for all large if is small enough. By (2.2.1) it follows that

and from (2.2.24)

for small enough This leads to a contradiction because convergesand the left-hand side of (2.2.32) tends to zero as Thus, weconclude

Remark 2.2.3 In (2.1.1)–(2.1.3) the spheres with expanding radiusesare used for truncations. Obviously, the spheres can be replaced

by other expanding sets. At first glance the point in (2.1.1) may bearbitrarily chosen, but actually the restriction is imposed on the exis-tence of such that The condition is obviouslysatisfied if as because the availability of is notrequired.


Remark 2.2.4 In the proof of Theorem 2.2.1 it can be seen that theconclusion remains valid if in A2.2.2 “J is the zero

set of is removed. As a matter of fact, J may be bigger than thezero set of Of course, it should at least contain the zero set of

in order (2.2.1) to be satisfied. It should also be noted that forwe need not require to be nowhere dense.

Let us modify A2.2.2 as follows.

A2.2.2’ There is a continuously differentiable functionsuch that

for any and is nowhere dense. Further, used in(2.1.1) is such that for some and

A2.2.2” There is a continuously differentiable functionsuch that

for any and J is closed. Further, used in (2.1.1) is suchthat for some and

Notice that, in A2.2.2’ and A2.2.2” the set J is not specified, but itcertainly contains the root sets of both and We may modifyTheorem 2.2.1 as follows.

Theorem 2.2.1’ Let be given by (2.1.1)–(2.1.3) for a given ini-tial value Assume A2.2.1, A2.2.2’,A2.2.3, and A2.2.4 hold. Then

for the sample path for which A2.2.3 holds.

Proof. The Proof of Theorem 2.2.1 applies without any change.

Theorem 2.2.1” Let be given by (2.1.1)–(2.1.3) for a given initialvalue. If A2.2.1, A2.2.2”,A2.2.3, and A2.2.4 hold, then

for the sample path for which A2.2.3 holds.

Proof. We still have Step 1– Step 3 in the proof of Theorem 2.2.1. Let


If or or both do not belong to J, then exists suchthat since J is closed. Then would crossinfinitely many times. But, by Step 3 of the Proof for Theorem 2.2.1,this is impossible. Therefore both and belong to

Theorems 2.2.1 and 2.2.1’ only guarantee that the distance betweenand the set J tends to zero. As a matter of fact, we have more

precise result.

Theorem 2.2.2 Assume conditions of Theorem 2.2.1 or Theorem 2.2.1’hold. Then for fixed and for which A2.2.3 holds, a connected subset

exists such that

where denotes the closure of and is generated by (2.1.1)–(2.1.3).

Proof. Denote by the set of limit points of Assume theconverse: i.e., is disconnected. In other words, closed sets andexist such that and

Define

Since a exists such that

where denotes the of set A.Define

It is clear that and

Since by we have

By boundedness of we may assume that converges.Then, by taking in (2.2.15), we derive


which contradicts (2.2.33) and proves the theorem.

Corollary 2.2.1 If J is not dense in any connected set, then underconditions of Theorem 2.2.1, given by (2.1.1)–(2.1.3) converges toa point in This is because in the present case any connected set inconsists of a single point.

Example 2.2.1 Reconsider the example given in Section 2.1:

It was shown that the RM algorithm rapidly diverges to even in thenoise-free case.

We now assume the observations are noise-corrupted:

where is an ARMA process driven by the independent identicallydistributed normal random variables

whereWe use the algorithm (2.1.1)–(2.1.3) with The

computation shows

which tend to the sought-for root 10.

Example 2.2.2 Let Then

Clearly, A2.2.1 and A2.2.4 hold. Concerning A2.2.2, we may taketo serve as Since

(2.2.1) is satisfied. The existence of required in A2.2.2 is obvious, forexample,


Finally, is nowhere dense. So A2.2.2 also holds.Now assume the noise is such that

Then A2.2.3 is satisfied too.By Corollary 2.2.1, given by (2.1.1)–(2.1.3) converges to a point

If for the conventional (untruncated) RM algorithm

it is a priori known that is bounded, then we have the followingtheorem.

Theorem 2.2.3 Assume A2.2.1–A2.2.4 hold but in A2.2.2 the require-ment: “Further, used in (2.1.1) is such that forsome and is removed. If produced by (2.2.34) isbounded, then for the sample path for which A2.2.3

holds, where is a connected subset of

Proof. As a matter of fact, by boundedness of (2.2.3) and (2.2.15)become obvious. Steps 3, 5, and 6 in the proof of Theorem 2.2.1 remainunchanged, while Step 4 is no longer needed. Then the conclusion followsfrom Theorems 2.2.1 and 2.2.2.

Remark 2.2.5 All theorems concerning SA algorithms with expandingtruncations remain valid for produced by (2.2.34), if given by(2.2.34) is known to be bounded.

Theorems 2.2.1 and 2.2.2 concern with time-invariant functionbut the results can easily be extended to time-varying functions, i.e., tothe case where the measurements are carried out for

where depends on timeConditions A2.2.2 and A2.2.4 are respectively replaced by the follow-

ing conditions:A2.2.2o There is a continuously differentiable functionsuch that


for any and is nowhere dense,

where and and denotes

the gradient of Further, used in (2.1.1) is such thatfor some and

A2.2.4 are measurable and uniformly locally bounded, i.e., forany constant

Theorem 2.2.4 Let be given by (2.1.1)–(2.1.3) for a given initialvalue Assume A2.2.1, A2.2.2°, and A2.2-4’ hold. Then

for the sample path for which A2.2.3 holds, where is a

connected subset of

Proof. It suffices to replace by everywhere in the prooffor Theorems 2.2.1 and 2.2.2.

Remark 2.2.6 If it is known that given by an SA algorithmevolves in a subspace S of then it suffices to verify A2.2.2, A2.2.2’,A2.2.2”, and A2.2.2° in the subspace S in order the corresponding con-clusions about convergence of to hold. For example, in this caseA2.2.2 changes toA2.2.2 (S): There is a continuously differentiable functionR such that for any and

is nowhere dense.Further, used in (2.1.1) is such that for some

and According to Remark 2.2.4, here J is not spec-ified. Then, with A2.2.2 and replaced by A2.2.2(S) andrespectively, Theorem 2.2.1 incorporating with Theorem 2.2.2 assertsthat

2.3. Convergence Under State-IndependentConditions

In the last section we have established convergence theorems undergeneral conditions. These theorems take a sample-path-based form: un-der A2.2.1, A2.2.2, andA2.2.4 converges at those sample paths forwhich A2.2.3 holds. Condition A2.2.3 looks rather complicated, but it


is so weak that it is necessary as to be shown later. However, conditionA2.2.3 is state-dependent in the sense that the condition itself dependson the behavior of This makes it not always possible to verifythe condition beforehand. We are planning to give convergence theo-rems under conditions with no state involved. For this we have toreformulate Theorems 2.2.1 and 2.2.2.

As defined in Section 2.2 where is a mea-surable function. In lieu of A2.2.3 we introduce the following condition.

A2.3.1 For any sufficiently large integer there is anwith such that for any

for any such that converges.

Theorem 2.3.1 Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.1 hold. Thena.s. for generated by (2.1.1)–(2.1.3) with a

given initial value where is a connected subset contained in theclosure of J.

Proof. Let It is clear that

i.e., Then for any

A2.2.3 is fulfilled with possibly depending on and theconclusion of the theorem follows from Theorems 2.2.1 and 2.2.2.

We now introduce a state-independent condition on noise.

A2.3.2 For any is a martingale difference se-quence and for some

where is a family of nondecreasing independent of

We first give an example satisfying A2.3.2. Let be andimensional martingale difference sequence with


for some and letbe a measurable and locally bounded function. Then

satisfies A2.3.2, because

and

by assumption.

Theorem 2.3.2 Let be given by (2.1.1)–(2.1.3) for a given initialvalue. Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and

for given in A2.3.2. Then a.s., where is a

connected subset contained in

Proof. Since is measurable and is it fol-lows that is adapted. Approximatingby simple functions, it is seen that

Hence, is a martingale difference sequence, and

a.s.

By the convergence theorem for martingale difference sequences, theseries

converges a.s., which implies that with exists such thatfor each

converges to zero as uniformly inThis means that A2.3.1 holds, and the conclusion of the theorem

follows from Theorem 2.3.1.In applications it may happen that is not directly observed. In-

stead, the time-varying functions are observed, and the observa-tions may be done not at but at i.e., at with bias


Theorem 2.3.3 Let be given by (2.1.1)–(2.1.3) for a given ini-tial value. Assume that A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and

for p given in A2.3.2. Further, assume is anadapted sequence, is bounded by a constant, and for any sufficientlylarge integer there exists with such that for any

for any such that converges. Then, a.s.,

where is a connected subset contained in

Proof. By assumption where is a constant. Then

and again by the convergence theorem for martingale difference sequences,the series

convergence a.s. Consequently, there exists with such thatfor any the convergence indicated in (2.3.5) holds and for anyinteger

tends to zero as uniformly inTherefore, A2.3.1 is fulfilled and the conclusion of the theorem follows

from Theorem 2.3.1.

Remark 2.3.1 The obvious sufficient condition for (2.3.5) is

which in turn is satisfied, if is continuous and

Remark 2.3.2 Theorems 2.3.2 and 2.3.3 with A2.2.2 and A2.2.4 re-placed by A2.2.2° and A2.2.4’, respectively, remain valid, if isreplaced by time-varying


2.4. Necessity of Noise ConditionUnder Conditions A2.2.1–A2.2.4 we have established convergence The-

orems for recursively given by (2.1.1)–(2.1.3). Condition A2.2.1 is acommonly accepted requirement for decreasing step size, while A2.2.2 isa stability condition. This kind of conditions are unavoidable for conver-gence of SA type algorithms, although it may appear in different forms.Concerning A2.2.4 on it is the weakest possible: neither continuitynor growth rate of is required. So, it is natural to ask is it possi-ble to further weaken Condition A2.2.3 on noise? We now answer thisquestion.

Theorem 2.4.1 Assume only has one root , i.e., andis continuous at Further, assume A2.2.1 and A2.2.2 hold. Thengiven by (2.1.1)–(2.1.3) converges to at those sample paths for

which one of the following conditions holds:i)

ii) can be decomposed into two parts such that

and

Conversely, if then both i) and ii) are satisfied.

Proof. Sufficiency. It is clear that ii) implies i), which in turn impliesA2.2.3. Consequently, sufficiency follows from Theorem 2.2.1.

Necessity. Assume Then is bounded and (2.1.1)–(2.1.3) turns to be the RM algorithm after a finite number of steps (for

. Therefore,

whereSince and is continuous, Condition ii) is satisfied. And,

Condition i) being a consequence of ii) also holds.

Remark 2.4.1 In the case where and is continuous at, under conditions A2.2.1, A2.2.2, and A2.2.3 by Theorem 2.2.1 we

arrive at Then by Theorem 2.4.1 we derive (2.4.1) which isstronger than A2.2.3. One may ask why a weaker condition A2.2.3 canimply a stronger condition (2.4.1)? Are they equivalent ? The answer


is “yes” or “no”: Yes, these conditions are equivalent but only underadditional conditions A2.2.1, A2.2.2, and continuity of at beingthe unique root of However, these conditions by themselves are notequivalent because condition A2.2.3 is weaker than (2.4.1) indeed.

We now consider the multi-root case. Instead of the singleton wenow have a root set J. Accordingly, continuity of at is replacedby the following condition

In order to derive the necessary condition on noise, we consider thelinear interpolating function

where From form a family of func-

tions, where

where is a constant.For any subsequence define

where appearing on the right-hand side of (2.4.3) denotes the de-pendence of the limit function on the subsequence, and the limsup of avector sequence is taken component-wise. In general, may bediscontinuous.

However, if then

which is not only continuous but also differentiable.Thus, (2.4.2) for the multi-root case corresponds to the continuity of

at for the single root case, while and a certain

analytic property of correspond to

Theorem 2.4.2 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold. Thengiven by (2.1.1)–(2.1.3) is bounded, and the

right derivative for any convergent subsequence


if and only if condition A2.2.3 is satisfied, where is a connected subsetof

Proof. Sufficiency. By Theorem 2.2.1 it follows that is boundedand We only need to show

Let be a convergent subsequence. Since is bounded,

the algorithm (2.1.1)–(2.1.3) becomes the one without truncations forlarge enough Therefore,

Notice that and hence

where asThen from (2.4.5) we have

In (2.4.5) the last term tends to zero by A2.2.3 because isbounded and hence the indicator in (2.2.2) can be removed for suf-ficiently large By (2.4.2) the first term on the right-hand side of(2.4.7) also tends to zero as The left-hand side of (2.4.7) is

Consequently,


Necessity. We now assume is bounded, and

for any convergent subsequence and want toshow A2.2.3. Let For any from (2.4.5) we have

From (2.4.6) it is seen that

where asThe assumption means that

where and

Noticing the continuity of from (2.4.10) and (2.4.11) it followsthat

which incorporating with yields (2.4.9). Thus, we have

for any such that converges.By the boundedness of (2.4.12) is equivalent to (2.2.2), and the

proof is completed.

Corollary 2.4.1 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold, andassume J is not dense in any connected set. Then given by (2.1.1)–(2.1.3) converges to some point in J if and only if A2.2.3 holds.

This corollary is a direct generalization of Theorem 2.4.1. The suffi-ciency part follows from Corollary 2.2.1, while the necessity part fol-lows from Theorem 2.4.2 if notice that convergence of implies

for sufficiently largeThe first term on the right-hand side of (2.4.8) tends to zero asby (2.4.2) and So, to verify A2.2.3 it suffices to

show that


2.5. Non-Additive NoiseIn the algorithm (2.1.1)–(2.1.3) the noise in observation is

additive. In this section we continue considering (2.1.1)–(2.1.2) but inlieu of (2.1.3) we now have the non-additive noise

where is the observation noise at timeThe problem is that under which conditions does the algorithm defined

by (2.1.1), (2.1.2), and (2.5.1) converge to J, the root set of whichisthe average of with respect to its second argument? To be precise,let be an measurable function and let be adistribution function in The function is defined by

It is clear that the observation given by (2.5.1) can formally be ex-pressed by the one with additive noise:

and Theorems 2.2.1 and 2.2.2 can still be applied. The basic problem ishow to verify A2.2.3. In other words, under which conditions onand does given by (2.5.3) satisfy A2.2.3?

Before describing conditions to be used we first introduce some no-tations. We always take the regular version of conditional probability.This makes conditional distributions introduced later are well-defined.

Let be the distribution function of and be theconditional distribution of given where

Further, let us introduce the following coefficients,

where denotes the Borel in and for a random variablewhere runs over all sets

with probability zero.is known as the mixing coefficient of and it measures the

dependence between and It is clear thatmeasures the closeness of the distribution of to


The following conditions will be needed.

A2.5.2 (=A2.2.2);

A2.5.3 is a measurable function and is locally Lipschitz-continuousin the first argument, i.e., for any fixed

where is a constant depending on

A2.5.4 (Noise Condition)

i) is a process with mixing coefficient asuniformly in

ii)

where is defined in (2.5.6);iii) as

Theorem 2.5.1 Assume A2.5.1–A2.5.4. Then for generated by(2.1.1), (2.1.2), and (2.5.1)

where is a connected subset of

The proof consists in verifying Condition A2.2.3 satisfied a.s. bygiven in (2.5.3). Then the theorem follows from Theorems 2.2.1 and2.2.2.

We first prove lemmas.

Lemma 2.5.1 Assume A2.5.1, A2.5.3, and A2.5.4 hold. Then thereis an with such that for any and any boundedsubsequence of say,

A2.5.1

as


(without loss of generality assume there exists an integersuch that for all

if T is small enough, where is given by (2.1.1), (2.1.2), and (2.5.1),and is given by (1.3.2).

Proof. For any set

By setting in (2.5.6), it is clear that

From (2.5.7), it follows that

and

where (and hereafter) L is taken large enough so thatSince is a convergent martingale, there is a a.s.

such that

From (2.5.13) and it is clear that for any integer L the

series of martingale differences

converges a.s.Denote by the where the above series converges, and set


It is clear thatLet be fixed and with and

Then for any integer by (2.5.13) we have

where the first term on the right-hand side tends to zero as by(2.5.15).

Assume is sufficiently large such thati) for if as orii) if

We note that in case ii) there will be no truncation in (2.1.1) for

Assume and fix a small enough T such that Letbe arbitrarily fixed.

We prove (2.5.9) by induction. It is clear (2.5.9) is true forAssume (2.5.9) is true for and there

is no truncation for if Noticingwe have, by (2.5.16)

if is large enough.

This means that at time there is no truncation in (2.1.1), and

Lemma 2.5.2 Assume A2.5.1, A2.5.3, and A2.5.4 hold. There is anwith such that if and if as


is a bounded subsequence of produced by (2.1.1), (2.1.2),and (2.5.1), then

Proof. Write

where

By (2.5.13), for we have

which converges to a finite limit as by the martingale conver-gence theorem.

Therefore, for any integers L and

converges a.s.


Therefore, there is with such that (2.5.23) holds forany integers L and

Let be fixed, By Lemma 2.5.1,for small

Then

for any by (2.5.23).We now estimate (II). By Lemma 2.5.1 we have the following,

Noticing (2.5.7) and (2.5.14), we then have

Similarly, by Lemma 2.5.1 and (2.5.7)

Combining (2.5.18), (2.5.24), and (2.5.26) leads to

Therefore, to prove the lemma it suffices to show that the right-handside of (2.5.27) is zero.


Applying the Jordan-Hahn decomposition to the signed measure,

and noticing that is a process with mixing coefficientwe know that there is a Borel set D in such that for any

Borel set A in

and

Then, we have the following,

where


For any given there is a j such that

For any fixed by (2.5.13), (2.5.14), and it follows that

Therefore,

Since may be arbitrarily small, this combining with (2.5.27)proves the lemma.

Proof of Theorem 2.5.1.For proving the theorem it suffices to show that A2.2.3 is satisfied by

a.s. By Lemma 2.5.2, we need only to provethat

for is a bounded subsequence, and asAssumeApplying the Jordan-Hahn decomposition to the signed measure,


we conclude that

where for the last inequality (2.5.8) and (2.5.12) are invoked. Sinceas the right-hand side of (2.5.32) tends to zero asfor any This proves (2.5.31) and completes the proof

of Theorem 2.5.1.

Remark 2.5.1 From the expression (2.5.3) for observation it is seenthat the observation with non-additive noise can be reduced to the ad-ditive but state-dependent noise which was considered in Section 2.3.However, Theorem 2.5.1 is not covered by Theorems in Section 2.3 andvice versa.

2.6. Connection Between Trajectory Convergenceand Property of Limit Points

In the multi-root case, what we have established so far is that the dis-tance between given by (2.1.1)–(2.1.3) and a connected subsetof converges to zero under various sets of conditions.

As pointed out in Corollary 2.2.1, if J is not dense in any connectedset, then converges to a point belonging to However, it is stillnot clear how does behave when J is dense in some connected set?The following example shows that still may not converge, although

Example 2.6.1 Let


and let

Take step sizes as follows

We apply the RM algorithm (2.2.34) withAs we may take

Then, all conditions A2.2.1–A2.2.4 are satisfied.Notice that

and

where k is such thatBy (2.6.1), it is clear that in (2.6.2)

and

Therefore, is bounded and by Theorem 2.2.4.

As a matter of fact, changes from one to zero and then from zeroto one, and this process repeats forever with decreasing step sizes.


Thus, is dense in [0,1]. This phenomenon hints that for tra-jectory convergence of the stability-like condition A2.2.2 is notenough; a stronger stability is needed.

Definition 2.6.1 A point i.e., a root of is called dominantlystable for if there exist a and a positive measurable function

which is bounded in the interval andsatisfies the following condition

for all the ball centered at with radius

Remark 2.6.1 The dominant stability implies stability. To see this, itsuffices to take as the Lyapunov function. Then

The dominant stability of however, is not necessary for asymptoticstability.

Remark 2.6.2 Equality (2.6.3) holds for any whatever is.Therefore, all interior points of J are dominantly stable for Further,for a boundary point of J to be dominantly stable for it sufficesto verify (2.6.3) for with small i.e., all that areclose to and outside J.

Example 2.6.2 Let

In fact, is the gradient of

In this example We now show that all points of Jare dominantly stable for For this, by Remark 2.6.2, it suffices toshow that all with are dominantly stable for and for this,it in turn suffices to show (2.6.3) for any with andfor small enough Denoting by the angle between vectors

and we have for


It is clear that

for all small enough Therefore, all points in J are dominantlystable for

Theorem 2.6.1 Assume A2.2.1, A2.2.2, and A2.2.4 hold. If for a

given is convergent and a limit point of generated

by (2.1.1)–(2.1.3) is deminantly stable for then for this trajectory

Proof. For any define

where is the one indicated in Definition 2.6.1.It is clear that is well-defined, because there is a convergent subse-

quence: and for any greater than some Iffor any for some then by arbitrariness of

Therefore, for proving the theorem, it suffices to show that, for anysmall an exists such that implies if

Since implies A2.2.3, all conditions of Theorem 2.2.1

are satisfied. By the boundedness of we may assume that islarge enough so that the truncations no longer exist in (2.1.1)–(2.1.3)for It then follows that

Notice that for any andis bounded, and hence by (2.6.3)


for some because is convergent and

Further,

An argument similar to that used for (2.6.5) leads to

if is large enough.Then from (2.6.6) we have

From (2.6.4) and (2.6.7) we see that we can inductively obtain

Then, noticing by definitions of we have

where the elementary inequality

Administrator

ferret


is used with for the first inequality in (2.6.8), and withfor the third inequality in (2.6.8). Because is bounded,

and an exists such

that

This means that and completes the proof.For convergence of SA algorithms we have imposed the stability-like

condition A2.2.2 for and the dominant stability con-

dition (2.6.3) for trajectory convergence. It is natural to ask does a limitpoint of trajectory possess a certain stability property? The followingexample gives the negative answer.

Example 2.6.3 Let

It is straightforward to check that

satisfies A2.2.2. Takewhere is a sequence of mutually independent random

variables such that a.s. Then with 1 being

a stable attractor for and all A2.2.1–A2.2.4 are satisfied. TakeThen by Theorem 2.2.1 it follows that

a.s. Since must converge to 0 a.s. Zero, however, isunstable for

In this example converges to a limit, which is independent of ini-tial values and unstable, although conditions A2.2.1–A2.2.4 hold. Thisstrange phenomenon happens because

as a function of is singular for some in the sense that itrestricts the algorithm to evolve only in a certain set of Therefore,


in order the limit of to be stable, imposing a certain regularitycondition on and some restrictions on noises is unavoidable.

As in Section 2.3, assume that observation noise iswith being a measurable function defined on Set

Let us introduce the following conditions:

A2.6.1 For a given is a surjection for any

A2.6.2 For any and is continuous in and for anyand

where denotes the ball centered at with radius

It is clear, that A2.6.2 is equivalent to A2.6.2’:

A2.6.2’ For any and any compact set

Before formulating Theorem 2.6.2 we first give some remarks on Con-ditions A2.6.1 and A2.6.2.

Remark 2.6.3 If does not depend on then in (2.6.9)can be removed when taking supremum. In Condition A2.2.3

is a convergent subsequence, and hence is automaticallylocated in a compact set. In Theorems in Sections 2.2, 2.3, 2.4, and2.5, the initial value is fixed, and hence for fixed is a fixedsequence. In contrast to this, in Theorem 2.6.2 we will consider the casewhere the initial value arbitrarily varies, and hence for any fixedmay be any point in If in (2.6.9) were not restricted to a compactset (i.e., with removed in (2.6.9)), then the resultingcondition would be too strong. Therefore, to put in(2.6.9) is to make the condition reasonable.

Remark 2.6.4 If is continuous and ifthen is a surjection.


By this property, is a surjection for a large class of Forexample, let be free of and let the growth rate of be notfaster than linear as Then with satisfying A2.2.1 wehave as for all Hence, A2.6.1holds. In the case where the growth rate of is faster than linearas and for some we alsohave that as for all and A2.6.1holds.

In what follows by stability of a set for we mean it in theLyapunov sense, i.e., a nonnegative continuously differentiable function

exists such that andfor some where

Theorem 2.6.2 Assume A2.2.1, A2.2.2, and A2.6.2 hold, and thatis continuous and for a given A2.6.1 holds. If defined by (2.1.1)–(2.1.3) with any initial value converges to a limit independent of

then belongs to the unique stable set of

Proof. Since by A2.2.2 and by conti-

nuity of exists with such thatHence, By continuity of J is closed, and hence by A2.2.2,

Since we must have Denote by the connectedsubset of containing The minimizer set of that contains isclosed and is contained in Since is a connected setand byA2.2.2 is nowhere dense, is a constant.

By continuity of all connected root-sets are closed and they areseparated. Thus, there exists a such thati.e., contains no root of other than those located in

Set

Then andTherefore, by definition, is stable forWe have to show that and is the unique stable root-set.Let be the connected set of such

that contains By continuity of for an arbitrary smallexist such that and the distance

between the interval and the set is positive;i.e.,


We first show that, for any and there existand such that, for any if then

By Theorem 2.2.1, for with sufficiently large there will beno truncation for (2.1.1)–(2.1.3), and

For any let By A2.6.2, sufficiently small

and large enough exist such that for any

If for then (2.6.10) immediatelyfollows by setting Assume for some

Let be the first such one. Then

By (2.6.11), however,

which contradicts (2.6.12). Thus and (2.6.10)is verified.

For a given we now prove the existence of such thatfor any if where the dependence of

on and on the initial value is emphasized. For simplicity of writing,is written as in the sequel.

Assume the assertion is not true; i.e., for any exists such thatand for some

Suppose and


If there exists an with then withexists because is connected and with

This yields a contradictory inequality:

where the first inequality follows from A2.2.2 while the second inequalityis because is the minimizer of

Consequently, for any and

and a subsequence of exists, also denoted by fornotational simplicity, such that By the continuity

of

Hence, by the factBy (2.6.10) and the fact we can choose sufficiently

small T and large enough N such that

and i.e.,

for any By (2.6.10), exists with theproperty such that

Because as for sufficiently large N,by (2.6.10) the last term of (2.6.15) is Then


By (2.6.10) and the continuity of the third term on the right handside of (2.6.16) is and by A2.6.2 (Since

with for all sufficiently large N.), the norm of the secondterm on the right-hand side of (2.6.16) is also as Henceby A2.2.2 and (2.6.13), some exists such that the right-handside of (2.6.16) is less than for all sufficiently large N if T is smallenough. By noticing and mentioned

above, from (2.6.14) it follows that the left-hand side of (2.6.16) tendsto a nonnegative limit as The obtained contradiction showsthat exists such that for any ifWith fixed for any byA2.6.1 exists such thatBy and the arbitrary smallness of from this itfollows that Since by assumption, we have

which means that is stable. If another stable set existedsuch that then by the same argument would belong toThe contradiction shows that the uniqueness of the stable set.

2.7. Robustness of Stochastic ApproximationAlgorithms

In this section for the single root case, i.e, the case weconsider the behavior of SA algorithms when conditions for convergenceof algorithms to are not exactly satisfied. It will be shown that a“small” violation of conditions will cause no big effect on the behaviorof the algorithm.

The following result known as Kronecker lemma will be used severaltimes in the sequel. We state it separately for convenience of reference.

Kronecker Lemma. If where is a sequence

of positive numbers nondecreasingly diverging to infinity and is a

sequence of matrices, then

Proof. Set Since

for any there is such that if Then it


follows that

as and thenWe still consider the algorithm given by (2.1.1)–(2.1.3), where de-

notes the estimate for at time but may not be the exact root ofAs a matter of fact, the following set of conditions will be used to

replace A2.2.1–A2.2.4:

A2.7.1 nonincreasingly tends to zero, and

exists such that

A2.7.2 There exists a nonnegative twice continuously differentiable func-tion such that and

A2.7.3 For sample path the observation noise satisfies the fol-lowing condition

A2.7.4 is continuous, but is not necessary tobe the root of

Comparing A2.7.1–A2.7.4 with A2.2.1–A2.2.4, we see the followingconditions required here are not assumed in Section 2.2: nonincreasing


property of condition (2.7.1), nonnegativity of divergence ofto infinity and continuity of but in (2.7.2), in (2.7.3), and

are allowed to be greater than zero.

Concerning we note that from the convergenceof

it follows that i) A2.2.3 holds and ii) by the Kronecker lemma

because is nonincreasing. We will demonstrate

how does the deviation from of the estimate given by (2.1.1)–(2.1.3)depend on and

For used in (2.1.1) define Since ascan be taken sufficiently large such that

Let the initial truncation bound used in (2.1.1) and (2.1.2) belarge enough such that

Take real numbers such that

Since is continuous, an exists such that

Denote

and

where denotes the matrix consisting of the second partial deriva-tives of

Since we have for any and hence

and


Set

We will only consider those in (2.7.2) for which where is givenin (2.7.7). From (2.7.7) and (2.7.8) it is seen that

Consequently, by (2.7.2), a given by (2.7.12) is positive.By continuity of and and exist

such that the following inequalities hold:

By A2.7.3 for can be taken sufficiently large suchthat

Lemma 2.7.1 Assume A2.7.1, A2.7.2, A2.7.4 hold with given in (2.7.3)being less than or equal to If for given by (2.1.1)–(2.1.3) with (2.7.5) fulfilled, for some where K isgiven in (2.7.18), then for any

Proof. Because is nondecreasing as T increases, it suffices toprove the lemma for

Assume the converse: there exists an such that


Then for any we have

and hence

which incorporating with the definition of leads to

On the other hand, from (2.7.20) and (2.7.21) it follows that

From (2.7.9) we have

By a partial summation we have

Applying (2.7.3) to the first two terms on the right-hand side of (2.7.25),and (2.7.1) and (2.7.3) to the last term we find

From (2.7.24) and (2.7.26) it then follows that


which contradicts (2.7.22). This proves the lemma.

Lemma 2.7.2 Under the conditions of Lemma 2.7.1, for anythe following estimate holds:

Proof. Since by Lemma 2.7.1 we have

and hence

Consequently, we have

Lemma 2.7.3 Assume A2.7.1–A2.7.4 hold and satisfies (2.7.7). Thenfor the sample path for which A 2.7.3 holds, a that is independent of

and exists such that

in other words, given by (2.1.1)–(2.1.3) is bounded.

Proof. Let be a sufficiently large integer such that

where K is given by (2.7.18).


Assume the lemma is not true. Then there exist and such

that Let be the maximal integer satisfying thefollowing equality:

Then by definition we have

and by (2.7.28) and (2.7.29),

We first show that under the converse assumption there must be ansuch that

Otherwise, for any and from (2.7.24) it followsthat

This together with (2.7.30) implies

which contradicts with the converse assumption.Hence (2.7.31) must be held.By the definition of (2.7.6), and (2.7.30) we have

Since by (2.7.31), from (2.7.4) and (2.7.6) it follows that


We now show For this it suffices to proveby noticing (2.7.34).

Since similar to (2.7.32) we have

and hence

From (2.7.32) and (2.7.36) it is seen that

where for the second inequality, (2.7.9) and are used, whilefor the last inequality (2.7.18) is invoked.

Paying attention to (2.7.10), we have and andby (2.7.16)

Then by (2.7.32) we see and (2.7.34) becomes

Thus, we can define

and have

Taking in Lemmas 2.7.1 and 2.7.2, and paying attentionto (2.7.4) and we know By Lemmas 2.7.1and 2.7.2, from (2.7.28) we see From (2.7.28)–(2.7.30) wehave obtained which together with the definition ofimplies and hence Therefore, iswell defined, and by the Taylor’s expansion we have


where with components located in-between andWe now show that which, as to be shown, implies

a contradiction.By Lemma 2.7.2 we have

and hence

By (2.7.10) it follows that and by (2.7.11).Using Lemma 2.7.1, we continue (2.7.41) as follows:

Noticing we seeIt is clear that (2.7.35) and (2.7.37) remain valid with replaced

by Hence, similar to (2.7.37) we have

By (2.7.11) and the Taylor’s expansion we have

and consequently,

and


By (2.7.40), Substituting (2.7.44) into (2.7.43) and using(2.7.12) lead to

Estimating by the treatment similar to that used for

(2.7.26) yields

Noticing by Lemma 2.7.2 we find that

and

Hence, and by (2.7.15) from (2.7.45) it follows that

Using (2.7.14), from the above estimate we have


From (2.7.18) it follows that Taking notice of (2.7.13) by(2.7.17) we derive

On the other hand, by Lemma 2.7.2 and (2.7.11), (2.7.17), and (2.7.44)it follows that

whereFrom (2.7.39), (2.7.40), and (2.7.48) we see that

and hence which contradicts with (2.7.47). Thismeans that the converse assumption of the lemma cannot be held.

Corollary 2.7.1 From Lemma 2.7.3 it follows that there existand which is independent of and arbitrarily varying in

intervals and such that

and for with sufficiently large the algorithm (2.1.1)–(2.1.3)turns to an ordinary RM algorithm:

Set

Take and denote

By A2.7.2, Set


If in (2.7.2), then In the general case maybe positive.

Theorem 2.7.1 Assume A2.7.1–A2.7.4 hold and is given by(2.1.1)–(2.1.3) with (2.7.5) held. Then there existand a nondecreasing, left-continuous function defined on suchthat for the sample path for which A2.7.3 holds,

whenever and where and are the ones appearing in(2.7.2) and (2.7.3), respectively. As a matter of fact, can be taken as

the inverse function of

Proof. Given recursively define

We now show that exists such that

Set and assume

From the recursion of we have

Assume is large enough such that by A2.7.3


By a partial summation, from (2.7.57) we find that

where (2.7.58) is invoked.By (2.7.1) we see

Without loss of generality, we may assume Then by(2.7.1) we have

Applying (2.7.60) and (2.7.61) to (2.7.59) leads to


and hence

which implies (2.7.56).For and by (2.7.53)

Taking this into account for by (2.7.51)–(2.7.54) and theTaylor’s expansion we have

Therefore, in the following Taylor’s expansion

we have and henceand

Denote

For we have



Similar to (2.7.62), we see that

Consequently, we arrive at

Define

It is clear that is nondecreasing as increases and

Take such that Then we have


Define function

It is clear that is left-continuous, nondecreasing and

From (2.7.66) and (2.7.67) it follows that

which implies, by (2.7.57) and the definition of

Corollary 2.7.2 If in (2.7.2) may not be zero), thenand the right-hand side of (2.7.55) will be

Since may be arbitrarily small and hence the estimation errormay be arbitrarily small. If, in addition, in A2.7.3, then

tending and then in both sides of (2.7.55) we derive

In the case where by tending the right-hand side of(2.7.55) converges to

Consequently, as the estimation error depends on how big isIf in (2.7.2), then can also be taken

arbitrarily small and the estimation error depends on the magnitude of

2.8. Dynamic Stochastic ApproximationSo far we have discussed the root-searching problem for an unknown

function, which is unchanged during the process of estimation. We nowconsider the case where the unknown functions together with their rootschange with time. To be precise, Let be a sequence of unknown


functions with roots i.e.,Let be the estimatefor at time based on the observations

Assume the evolution of the roots satisfies the following equation

where areknown functions, while is a sequenceof dynamic noises.

The observations are given by

where is the observation noise and is allowed to depend on

In what follows the discussion is for a fixed sample, and the analysisis purely deterministic. Let us arbitrarily take as the estimate forand define

From equation (2.8.1), we see that may serve as a rough esti-mate for In the sequel, we will impose some conditions onand sothat

where is an unknown constant. Therefore, should notdiverge to infinity. But is unknown, so we will use the expandingtruncation technique.

Take a sequence of increasing numbers satisfying

Let be recursively defined by the following algorithm:

where denotes the number of truncations in (2.8.5) occurred untiltime


We list conditions to be used.

A2.8.1 and

A2.8.2 is measurable and for anyconstant possibly depending on exists so that

for with

A2.8.3 is known such that

for where

and

A2.8.4 and

A2.8.5 There is a continuously differentiable functionsuch that for and for any

where is a positive constant possibly depending on and A con-stant exists such that

where is an unknown constant that is an upper bound for

A2.8.6 For any convergent subsequence the observation noisesatisfies

where

Remark 2.8.1 Condition A2.8.2 implies the local boundedness, but theupper bound should be uniform with respect to In A2.8.3,measures the difference between the estimation error and the


prediction error In general, is greaterthan For example, then A2.8.3 holdswith A2.8.4 means that in the root dynamics, thenoise should be vanishing.

As A2.2.3, Condition A2.8.6 is about existence of a Lyapunov func-tion. To impose such kind a condition is unavoidable in convergenceanalysis of SA algorithms. Inequality (2.8.7) is an easy condition. Forexample, if as then this condition is automati-cally satisfied. The noise condition A2.8.6 is similar to A2.2.3.

Before analyzing convergence property of the algorithm (2.8.5), (2.8.6),and (2.8.2) we give an example of application of dynamic stochastic ap-proximation.

Example 2.8.1 Assume that the chemical product is produced in abatch mode, and the product quality or quantity of the batch de-pends on the temperature in batch. When the temperature equals theideal one, then the product is optimized. Let denote the deviationof the temperature from its optimal value for the batch, wheredenotes the control parameter, which may be, for example, the pressurein batch, the quantity of catalytic promoter, the raw material propor-tion and others. The deviation reduces to zero if the control equals itsoptimal value i.e., Because of the environment change,the optimal parameter may change from batch to batch. Assume

where is known and is the noise.

Let be the estimate for Then may serve as a predictionfor Apply as the control parameter for the batch.Assume that the temperature deviation of for the thbatch can be observed, but the observation may be corrupted bynoise, i.e., where is the observationnoise.

Then we can apply algorithm (2.8.5), (2.8.6), and (2.8.2) to estimateUnder conditions A2.8.1–A2.8.6, by Theorem 2.8.1 to be proved in

this section, the estimate is consistent, i.e.,

Theorem 2.8.1 Under Conditions A2.8.1–A2.8.6 the estimation errortends to zero as where is given by (2.8.5),

(2.8.6), and (2.8.2).

To prove the theorem we start with lemmas.


Lemma 2.8.1 Under A2.8.3 and 2.8.4, the sequenceis bounded for any

Proof. By A2.8.3 and A2.8.4 from (2.8.1) it follows that

Lemma 2.8.2 Assume A2.8.1–A2.8.4 and A2.8.6 hold. Let beaconvergent subsequence such that as Then, thereare a sufficiently small and a sufficiently large integer such thatfor

where is implied by

for where is a constant independentof

Proof. In the case asis bounded, and hence is bounded. By Lemma

2.8.1, is bounded. Therefore, is bounded. Forlarge and

The following expression (2.8.11) and estimate (2.8.12) will frequentlybe used. By (2.8.1) and A2.8.3 we have


and

Substitution of (2.8.12) into (2.8.10) leads to

By boundedness of and A2.8.3,

for some ByA2.8.4, while the last term is also

less than by A2.8.6.Without loss of generality, we may assume

Therefore, and the lemma is true for the case

We now consider the case as Let be so largethat for

with being a constant, and

where is given by (2.8.8).

as


Without loss of generality we may assume

Define and take T so small that Weprove the lemma by induction.

By (2.8.8) and (2.8.12), we have

Therefore, at time there is no truncation. Then by (2.8.11) and(2.8.12) we have

where (2.8.14) and (2.8.15) have been used.Let the conclusions of the lemma hold for

We prove that it also holds for Again by (2.8.12), we have

Hence there is no truncation at time By the inductive assump-tion, (2.8.11) and (2.8.12), it follows that

where (2.8.13) and (2.8.14) are invoked.Therefore, the conclusions of the lemma are also true for This

completes the proof.

Lemma 2.8.3 Assume A2.8.1–A2.8.6 hold. Then the number of trun-cations in (2.8.5) is finite and is bounded.


Proof. Using the argument in the proof of Lemma 2.8.2, the bounded-ness of follows from the boundedness of the number of truncations.Hence, it suffices to show that as

Assume the converse: as This means that thesequence is unbounded. Let be thesequence of truncation times. We prove that is also unbounded if

Assume is bounded. Then is also bounded. Fromwe select a convergent subsequence, denoted by the same for no-tational simplicity, such that By assumption, truncationhappens at the next time The obtained contradiction shows theunboundedness of in the case

Since algorithm (2.8.5) returns back to for infinitelymany times. Let Then

By Lemma 2.8.1, is bounded and by (2.8.8),Because is unbounded, starting from will exit the ball

with radius where is given by (2.8.7). Therefore, there is an intervaland for any there is a sequence,

such that forand In other words, the values of at

the sequence cross the interval from theleft. It is clear that Select from aconvergent subsequence denoted still by such that as

It is clear thatFrom now on, assume is large enough and T is small enough so that

Lemma 2.8.2 is applicable and it is valid with replaced bySince converges, by A2.8.5 and (2.8.12) it follows that

as Hence we have

By Lemma 2.8.2, forNoticing for small T we then have

In the following Taylor’s expansion is located in-betweenand and by Lemma 2.8.2, By (2.8.9)


and (2.8.11) we have

Notice that by Lemma 2.8.2 and (2.8.13)

for sufficiently large From (2.8.21) and (2.8.23), it follows that

On the other hand, by Lemma 2.8.2

Identifying and in A2.8.5 to and respectively, we canfind such that

by A2.8.5.


Let us consider the right-hand side of (2.8.22). Noticingby A2.8.3 and A2.8.4 we have

By A2.8.6,

Noticing that

as and by continuity of we find thattends to zero as and

Since the sum of the first and secondterms on the right-hand side of (2.8.22) is as andThis combining with (2.8.26) yields the following conclusion that for

with sufficiently large and for small enough T from (2.8.22) itfollows that

By (2.8.20), tending to infinity, from (2.8.30) we derive

By Lemma 2.8.2 we have

However, by definition,and Hence from (2.8.32), we must have

if T is small enough. Therefore, This contradicts(2.8.31). The obtained contradiction shows that


Theorem 2.8.2 Assume A2.8.1–A2.8.6 hold. Then the estimation er-ror tends to zero as

Proof. We first show that converges. Assume the converse:

where because is bounded by Lemma 2.8.3.It is clear that there exists an interval that does not containzero such that Without loss of generality, assume

From A2.8.6, it follows that there are infinitely manysequences such that and thatfor

Without loss of generality we may assume converges:Since exists such that and by

Lemma 2.8.2, Completely thesame argument as that used for (2.8.22)–(2.8.32) leads to a contradiction.Hence is convergent.

We now show that as Assume the converse: thereis a subsequence By the same argument we again arriveat (2.8.30). Tending by convergence of we obtain acontradictory inequality This implies that as

The following theorem is similar to Theorem 2.4.1.

Theorem 2.8.3 Assume A2.8.1–A2.8.5 hold and is continuous atuniformly in Then as if and only if A2.8.6

holds. Furthermore, under conditions A2.8.1–A2.8.5, the following threeconditions are equivalent.

1) Condition A2.8.6;

2)

3) can be decomposed into two parts: so that

Proof. Assume as Then is bounded. Wehave shown in the proof of Lemma 2.8.3 that the number of truncationsmust be finite if is bounded. Therefore, starting from some thealgorithm (2.8.5) becomes

The following theorem is similar to Theorem 2.4.1.

and as



Set

By A2.8.3 and A2.8.4 and aswhile tends to zero because

is uniformly continuous at and Consequently,3) holds.

On the other hand, it is clear that 3) implies 2), which in turn im-plies A2.8.6. By Theorem 2.8.1, under A2.8.1–A2.8.5, Condition A2.8.6implies as

Thus, the equivalence of l)–3) has been justified under A2.8.1–A2.8.5.

2.9. Notes and ReferencesThe initial version of SA algorithms with expanding truncations and

its associated analysis method were introduced in [27], where the algo-rithm was called SA with randomly varying truncations. Convergenceresults of this kind of algorithms can also be found in [14, 28]. The-orems given in Section 2.2 are the improved versions of those given in[14, 27, 28]. Theorems in Section 2.3 can be found in [18]. Necessity ofthe noise condition is proved in [24, 94] for the single-root case, and in[17] for the multi-root case.

Convergence results of SA algorithms with additive noise can be foundin [16]. Concerning the measure theory, we refer to [31, 76, 84]. Resultsgiven in Section 2.6 can be found in [48], and some related problems arediscussed in [3]. For the proof of Remark 2.6.4 we refer to Theorem 3.3in [34]. Example 2.6.1 can be found in [93]. Robustness of SA algorithmsis presented in [24]. The dynamic SA was considered in [38, 39, 91], butthe results presented in Section 2.8 are given in [25].

Chapter 3

ASYMPTOTIC PROPERTIES OF STOCHA-STIC APPROXIMATION ALGORITHMS

In Chapter 2 we were mainly concerned with the path-wise conver-gence analysis for SA algorithms with expanding truncations. Condi-tions were given to guarantee where J denotes the

root set of the unknown function, and the estimate for unknown rootgiven by the algorithm.

In this chapter, for the case where J consists of a singleton weconsider the convergence rate of asymptotic normality ofand asymptotic efficiency of the estimate.

Assume is differentiable at Then as

whereIt turns out that the convergence rate heavily depends on whether

or not F is degenerate. Roughly speaking, in the case where the stepsize in (2.1.1) the convergence rate of forsome positive when F is nondegenerate, and for some

when F vanishes.It will be shown that is asymptotically normal and the covari-

ance matrix of the limit distribution depends on the matrix D if in(2.1.1) the step size is replaced by If F in (3.0.1) is available,then D can be defined to make the limiting covariance matrix minimal,i.e., to make the estimate efficient. However, this is not the case in SA.To overcome the difficulty one way is to derive the approximate valueof F by estimating it, but for this one has to impose rather heavy con-ditions on Efficiency here is derived by using a sequence of slowly

95

is

and

to


decreasing step sizes, and the averaged estimate appears asymptoticallyefficient.

3.1. Convergence Rate: Nondegenerate CaseIn this section, we give the rate of convergence of to zero

in the case F in (3.0.1) is nondegenerate, where is given by (2.1.1)–(2.1.3). It is worth noting that F is the coefficient for the first order inthe Taylor’s expansion for

The following conditions are to be used.

A3.1.2 A continuously differentiable function existssuch that

for any and for some with

where is used in (2.1.1).

A3.1.3 For the sample path under consideration the observation noisein (2.1.3) can be decomposed into two parts such that

for some

A3.1.4 is measurable and locally bounded, and is differentiable atsuch that as

The matrix F is stable (This implies nondegeneracy of F.), in ad-dition, is also stable, where and are given by (3.1.1) and(3.1.3), respectively.

By stability of a matrix we mean that all its eigenvalues are withnegative real parts.

and

Asymptotic Properties of Stochastic Approximation Algorithms 97

Remark 3.1.1 We now compare A3.1.1–A3.1.4 with A2.2.1–A2.2.4. Be-cause of additional requirement (3.1.1), A3.1.1 is stronger than A2.2.1,but it is automatically satisfied if with In this case a in(3.1.1) equals Also, (3.1.1) is satisfied if with

In this case Take sufficiently small such that

Then and Assume

is a martingale difference sequence with

Then by the convergence theorem for martingale difference se-

quences, Therefore (3.1.3) is satisfied a.s. with

Condition A3.1.4 assumes differentiability of whichis not required in A2.2.4.

Lemma 3.1.1 Let and H be -matrices. Assume H is stableand If satisfies A3.1.1 and l-dimensional vectors

satisfy the following conditions

then defined by the following recursion with arbitrary initial valuetends to zero:

Proof. Set

We now show that there exist constants and such that

Let S be any negative definite matrix. Consider

at


Since H is stable, the positive definite matrix P is well-defined. Inte-grating by parts, we have

which implies

This means that if H is stable, then for any negative definite matrix Swe can find a positive definite matrix P to satisfy equation (3.1.9). Thisfact is called the Lyapunov theorem and (3.1.9) called the Lyapunovequation. Consequently, we can find P > 0 such that

where I denotes the identity matrix of compatible dimension.Since there exists such that for

Consequently,

Without loss of generality we may assume that is sufficiently largesuch that for

for some constant where the first inequality is becauseas and while the second inequality is elementary.Combining (3.1.11) and (3.1.12) leads to

as

Asymptotic Properties of Stochastic Approximation Algorithms 99

and hence

where denotes the minimum eigenvalue of P.Paying attention to that

from (3.1.13) we derive

which verifies (3.1.8).From (3.1.6) it follows that

We have to show that the right-hand side of (3.1.14) tends to zero as

For any fixed because of (3.1.1) and(3.1.8). This implies that as for any initial value

Since as for any exists such thatThen by (3.1.8) we have

The first term at the right-hand side of (3.1.15) tends to zero by A3.1.1,while the second term can be estimated as follows:

as


where the first inequality is valid for sufficiently large sinceas and the second inequality is valid when

Therefore, the right-hand side of (3.1.15) tends to zero asand then

Set

By assumption of the lemma Hence, for anythere exists such that By a partialsummation, we have

where except the last term, the sum of remaining terms tends to zero asby (3.1.8) and

Let us now estimate

Asymptotic Properties of StochasticApproximation Algorithms 101

Since for and as by (3.1.8)we have

which tends to zero as and by (3.1.16) and the factthat Thus, the right-hand side of (3.1.17) tends to

zero as and the proof of the lemma is completed.

Theorem 3.1.1 Assume A3.1.1-A3.1.4 hold. Then given by (2.1.1)–(2.1.3) for those sample paths for which (3.1.3) holds converges towith the following convergence rate:

where is the one given in (3.1.3).

Proof. We first note that by Theorem 2.4.1 and there is notruncation after a finite number of steps. Without loss of generality, wemay assume

By (3.1.1), Hence, by the Taylor’s expansion we

have

Write given by (3.1.4) as follows

where

By (3.1.4) and (3.1.19), for sufficiently large k we have


where

By (3.1.1), (3.1.3) we have

Denote

Then (3.1.22) can be rewritten as

Noticing that which is stable by A3.1.4, we see

that all conditions of Lemma 3.1.1 are satisfied. Hence, by the lemmawhich proves the theorem.

Remark 3.1.2 Consider the dependence of convergence rate on the stepsize Take and let in (3.1.3). In order to

have it suffices to requirea.s.,


if is a martingale difference sequence with

So, for (3.1.25) it is sufficient to require

Since the best convergence rate is achievedat the convergence rate is Since

the convergence rate is slowing down as approachesto When (3.1.25) cannot be guaranteed. From this it is seenthat the convergence rate depends on how big is.

3.2. Convergence Rate: Degenerate CaseIn the previous section, for obtaining the convergence rate of

stability and hence nondegeneracy of F is an essential requirement. Wenow consider what will happen if the linear term vanishes in the Taylor’sexpansion of For this we introduce the following set of conditions:

A3.2.2 A continuously differentiable function existssuch that

for any and for some withwhere is used in (2.1.1);

A3.2.3 For the observation noise on the sample path under con-sideration the following series converges:

where

A3.2.4 is measurable and locally bounded, and is differentiable atsuch that as

where F is a stable matrix, and is the one used in A3.2.3.

We first note that in comparison with A3.1.1–A3.1.4, here we do notrequire (3.1.1), but A3.2.2 is the same as A3.1.2. From (3.2.3) we see that

A3.2.1 and

or

For


the Taylor’s expansion for does not contain the linear term. HereF is the coefficient for a term higher than second order in the Taylor’sexpansion of The noise condition A3.2.3 is different from A3.1.3,but, as to be shown by the following lemma, it also implies A2.2.3.

Lemma 3.2.1 If (3.2.2) holds, then and hence A2.2.3

is satisfied.

Proof. We need only to show

Setting

by a partial summation we have

Since as and converges as the first twoterms on the right-hand side of (3.2.4) tend to zero as and

The last term in (3.2.4) is dominated by

whereBy the following elementary calculation we conclude that the right-

hand side of (3.2.5) tends to zero as and

as


which tends to zero as and because as

This combining with (3.2.4) and (3.2.5) shows that

By the Lyapunov equation (3.1.9), there is a positive definite matrixP > 0 such that

Assuming is large enough so that there is no truncation, by (3.2.3) wehave

where is the maximum eigenvalue of P given by (3.2.6).

We start with lemmas. Note that by Theorems 2.2.1 or 2.4.1Therefore, starting from some the algorithm has no truncation.

Define

Denote by and the maximum and minimum eigenvalues ofP, respectively, and by K the condition number

Theorem 3.2.1 Assume A3.2.1–A3.2.4 hold and is given by (2.1.1)–(2.1.3). Then for the sample paths where A3.2.3 holds the followingconvergence rate takes place:


where

Define

Lemma 3.2.2 Assume A3.2.1– A3.2.4 hold. Then is bounded.

Proof. Since exists such that and

where and hereafter, means the smaller one between a and b.

By the definition of we havethere exists such that

Assuming is large enough such that for we also have

Define

where and hereafter

and

Since

If then is bounded.Otherwise, let We need only to

Let

Byfor

and

we have

i.e.,


consider the case since if it is not true then is clearlybounded.

Let P be given by (3.2.6). We have

where

In what follows we will prove that

By (3.2.10) and (3.2.6) it is clear that

where the last inequality follows by the following consideration:

By (3.2.11) so for (3.2.16) it suffices to show that

By definition of we have and hence

or


Consequently,

and by the agreement

which verifies the last inequality in (3.2.16).We now estimate By (3.2.10) (3.2.11) and the agreement

we have

Noticing that, as agreed,from (3.2.17) we have

and by (3.2.13),

Again, from (3.2.10) and noticing we have

Consequently, by (3.2.12)

Combining (3.2.14), (3.2.16), (3.2.18), and (3.2.20) yields

for

and


Similar to (3.2.14) we treat the right-hand side of the above inequalityas follows.

By the same argument as that used above, we can show that

and inductively we derive

Thus, by (3.2.12) and the definition of

or

This contradicts with the definition of and hence must be infinite.Consequently, is bounded.


Proof of Theorem 3.2.1.By Lemma 3.2.2 and the fact

we have

where

By setting

from (3.2.9) it follows that

This is nothing else but an RM algorithm. Since by Lemma 3.2.2is bounded, no truncation is needed and one may apply Theorem 2.2.1”.

First note that

Hence, A2.2.1 is satisfied.

as So A2.2.3 holds with replaced byA2.2.4 is clearly satisfied, since is continuous. The key issue is to

find a satisfyingA2.2.2”.Take

and define which is closed.Notice

Notice and

as

by


ForThen we have

This means that

and the condition A2.2.2” holds.By Theorem 2.2.1”, This implies

which in turn implies (3.2.7) by (3.2.8).Imposing some additional conditions on F, we may have more precise

than (3.2.7) results by using different Lyapunov functions.

Theorem 3.2.2 Assume A3.2.1–A3.2.4 hold, in addition, assume F isnormal, i.e., Let be given by (2.1.1)–(2.1.3). Then

for those sample paths for which A3.2.3 holds, converges

to either zero or one of where denotes an eigenvalue of

More precisely,

where is an unit eigenvector of H corresponding to

Proof. Since F is stable, the integral

is well defined. Noticing that we have

and

and

for


This means that H is also stable. Therefore, all eigenvalues arenegative. Further, by we find

and hence

We consider (3.2.23) and take

By (3.2.26) we have

Define

Obviously,

for any

Clearly,

where is the dimension ofThus, J is a discrete set, and is nowhere dense because is

continuous. This together with (3.2.28) shows that A2.2.2’ is satisfied.

and


By Theorem 2.2.1’, and (3.2.25) is verified.

Corollary 3.2.1 Let Then

In this case,

and hence (3.2.7) and (3.2.25) are respectively equivalent to

and

Remark 3.2.1 For the convergence rate given by (3.1.18)for the nondegenerate case is while for the degenerate case is

by (3.2.29), which is much slower than

3.3. Asymptotic NormalityIn Theorem 3.1.1 we have shown that given

by (2.1.1)–(2.1.3). As shown in Remark 3.1.2,This is a path-wise result. Assuming the observation noise is

a random sequence, we show that is asymptotically normal,

i.e., the distribution of converges to a normal distributionas This convergence implies that in the convergence rate

cannot be improved toWe first consider the linear regression case, i.e., is a linear func-

tion, but may be time-varying.Let us introduce a central limit theorem on double-indexed random

variables. We formulate it as a lemma.

Lemma 3.3.1 Let be an array of l-dimensional randomvectors. Denote

as

forif


and

Assume

and

Then

where and hereafter denotes the normal distribution with meanand covariance S.

Let us first consider the linear recursion (3.1.6) and derive its asymp-totic normality. We keep the notation introduced by (3.1.7).

We have obtained estimate (3.1.8) for and now derive moreproperties for it.

Lemma 3.3.2 Assume and

H where H is stable. Then for any

Proof. By (3.1.8) it follows that


We will use the following elementary inequality

which follows from the fact that the function equalszero at x = 0 and its derivative By (3.3.8), we derive

which implies

Assume is sufficiently large such that Then

where for the last inequality (3.3.9) is invoked.Combining (3.3.7) and (3.3.10) gives (3.3.6).

Lemma 3.3.3 Set


Under conditions of Lemma 3.3.2,

uniformly with respect to anduniformly with respect to

Proof. Expanding to the series

with we have

where by definition

By stability of H, there exist constants and p > 0 such that

Putting (3.3.13) into (3.3.12) yields that for any

where for the last inequality is assumed to be sufficiently large suchthat and (3.1.8) is used too.

as

as


Since and may be arbitrarily small the conclusions

of the lemma follow from (3.3.14) by Lemma 3.3.2.

Lemma 3.3.4 Assume as and

Let A, B, and Q be matrices and let A and B be stable. Then

Proof. For any T > 0 define

Since for fixed T. Denoting

by we then have Consequently,

serves as an integral sum for or equivalently, for

and hence

Therefore, for (3.3.15) it suffices to show that

Similar to (3.3.10), by stability of A we can show that there is a constantsuch that


By stability of A and B, constants and existsuch that


which verifies (3.3.18) and completes the proof of the lemma.

Theorem 3.3.1 Let be given by (3.1.6) with an arbitrarily giveninitial value. Assume the following conditions holds:

where are constant matrices with is

a martingale difference sequence of dimension satisfying the followingconditions:

and

and is stable;lI

and

as


and

Then is asymptotically normal:

where

Proof. Define by the following recursion

By (3.1.6) it follows that

Using (3.3.19) we have

Consequently,

where


and

by (3.3.20).Define

By (3.3.30) and stability of A, from (3.1.8) it follows that constantsand exist such that

Consequently, from (3.3.29) we have

The first term on the right-hand side of (3.3.34) tends to zero asby (3.3.33), while the second term is estimated as follows. By (3.3.31)

where for the last equality, Lemma 3.3.2 and (3.3.33) are used. Thismeans that r and have the same limit distribution if exists.

Consequently, for the theorem it suffices to show

Similar to (3.3.29) and (3.3.31), by (3.3.28) we have


Noticing

by Lemma 3.3.2 and (3.1.8), we find that the last term of (3.3.36) tendsto zero in probability. Therefore, for (3.3.24) it suffices to show

We now show that for (3.3.37) it is sufficient to prove

For any fixed we have

By (3.3.21) we have

where convergence to zero follows from and Lemma 3.3.2.

It is worth noting that the convergence is uniform with respect to This

By (3.3.21) and we see that


implies that the second term on the right-hand side of (3.3.39) tends tozero in probability. The first term on the right-hand side of (3.3.39) canbe rewritten as

By (3.3.33) for any fixed we estimate the first term of (3.3.40) as follows

while for the second term we have

since and

We now show that the last term of (3.3.40) also converges to zero inprobability as

Notice that by (3.3.28), for any fixed and

Therefore, for a fixed there exist constantsand such that

as


Then the last term of (3.3.40) is estimated as follows:

For the first term on the right-hand side of (3.3.44) we have

where the last inequality is obtained because is bounded

by some constant by (3.3.30). Since is fixed, in order toprove that the right-hand side of (3.3.45) tends to zero as itsuffices to show


By (3.3.33), for any fixed

while for any given we may take sufficiently large such thatTherefore,

by Lemma 3.3.2.Incorporating (3.3.47) with (3.3.48) proves (3.3.46). Therefore, the

right-hand side of (3.3.45) tends to zero as This impliesthat the first term on the right-hand side of (3.3.44) tends to zero inprobability.

By (3.3.43), for the last term of (3.3.44) we have

which tends to zero as as can be shown by an argument similarto that used for (3.3.45).

In summary we conclude that the right-hand side of (3.3.44) tendsto zero in probability, and hence all terms in (3.3.40) tend to zero inprobability. This implies that the right-hand side of (3.3.39) tends tozero in probability as and then Thus, we have shownthat for (3.3.37) it suffices to show (3.3.38).

We now intend to apply Lemma 3.3.1, identifying

to in that lemma. We have to check conditions of the lemma.Since is a martingale difference sequence, (3.3.1) is obviously

satisfied.


By (3.3.22) and Lemma 3.3.2,

This verifies (3.3.3). We now verify (3.3.2). We have

where the last term tends to zero by (3.3.22) and Lemma 3.3.2.We show that the first term on the right-hand side of (3.3.49) tends

to (3.3.25).With A and respectively identified to H and in Lemma 3.3.3,

by Lemmas 3.3.2 and 3.3.3 we have

This incorporating with (3.3.49) leads to

By Lemma 3.3.4 we conclude

Finally, we have to verify (3.3.4).


By (3.3.33) we have

Noticing that uniformly with respect to

since or equivalently,

uniformly with respect to by (3.3.23) we have

Consequently, for any by Lemma 3.3.2

Thus, all conditions of Lemma 3.3.1 hold, and by this lemma we conclude(3.3.38). The proof is completed.

Remark 3.3.1 Under the conditions of Theorem 3.3.1, if integersare such that then it can be

shown that converges in distribution towhere is a stationary Gaussian Markov process satisfying

the following stochastic differential equation

where is the standard Wiener process.


Corollary 3.3.1 From (3.1.7) and (3.3.28), similar to (3.3.29)–(3.3.31)we have

and

By (3.3.33), the first term on the right-hand side of (3.3.50) tendsto zero as Note that the last term in (3.3.34) has beenproved to vanish as and it is just a different writing of

Therefore, from (3.3.50) by Theorem 3.3.1, it fol-

lows that for any fixed

We have discussed the asymptotic normality of for the case

where is linear. We now consider the general Let us firstintroduce conditions to be used.

and

A3.3.2 A continuously differentiable function exists such that

for any and for some withwhere is used in (2.1.1).

for some


where is a martingale difference sequence satisfying (3.3.21)–(3.3.23).

A3.3.3

A3.3.4 is measurable and locally bounded. As

where with a specified in (3.3.52) is stable and

satisfying which is specified in (3.3.53).

Theorem 3.3.2 Let be given by (2.1.1)–(2.1.3) and let A3.3.1–A3.3.4 be held. Then

where

Proof. Since there exists such that

which implies From (3.3.53) it follows that

This together with the convergence theorem for martingale differencesequences yields


which implies

Since from it follows thatStability of is implied by stability of which is a part ofA3.3.4. Then by Theorem 3.1.1

By (3.3.55) and (3.3.58) we have

From Theorem 3.1.1 we also know that there is an integer-valued(possibly depending on sample paths) such that

and there is no truncation in (2.1.1) for Consequently,for we have

Denoting

by (3.3.59) and (3.3.54) we see a.s.Then (3.3.60) is written as


where


Using introduced by (3.3.32), we find

By the argument similar to that used in Corollary 3.3.1, we have

a n d a s

Then by (3.3.51) from (3.3.63) we conclude (3.3.56).

Corollary 3.3.2 Let D be an matrix and let in (2.1.1)–(2.1.2)be replaced by In other words, in stead of (2.1.1) and (2.1.2) ifwe consider

then this is equivalent to replacing and by andrespectively.

In this case the only modification should be made in conditions ofTheorem 3.3.2 consists in that stability of in A3.3.4 should bereplaced by stability of The conclusion of Theorem 3.3.2 re-

mains valid with only modification that and F in (3.3.57)

should be replaced by and DF, respectively.

3.4. Asymptotic EfficiencyIn Corollary 3.3.2 we have mentioned that the limiting covariance

matrix S(D) for depends on D, if in (2.1.l)–(2.1.3) is replacedby By efficiency we mean that S(D) reaches its minimum withrespect to D.

Denote


By Corollary 3.3.2, the limiting covariance matrix for withgiven by (3.3.64)–(3.3.66) is expressed by

Theorem 3.4.1 Assume is stable. i) If then S(D)reaches its minimum at andwhere ii) If thenas

Proof. i) Integrating by parts, we have

This means that S(D) satisfies the following algebraic Riccati equation

By stability of and DF is nondegenerate. Thus,(3.4.3) is equivalent to

or

or

From (3.4.4) it follows that

and the equality is achieved atii) If then


When then

For the commonly used step size i.e., spec-ified in (3.3.52) equals By Theorem 3.4.1 the optimal

and the optimal step size is For

the limiting covariance matrix is Therefore, the optimallimiting covariance matrix for is no matter what

is taken inLet us take Then and the optimal In this

case and is the mini-

mum of the limiting covariance matrix. However, is unknown andis unknown too. Hence, cannot be directly used in the algo-

rithm. To achieve asymptotic efficiency, one way is to estimate F, and

replace the optimal step size by its estimate This is theso-called adaptive SA. But, to guarantee its convergence and optimality,rather restrictive conditions are needed.

Let be estimates for being the root of satisfying

where F is stable and The estimates are obtained on the basis ofobservations

with

If then we call

asymptotically efficient forTo achieve asymptotic efficiency we apply the averaging technique

that is different from adaptive SA.For satisfying A3.3.1, if in (3.3.52) equals zero, then

is called slowly decreasing step size. As a typical example of slowlydecreasing step sizes, one may take

Let be generated by (2.1.1)–(2.1.3) with slowly decreasingDefine


In what follows we will show that is asymptotically normaland is asymptotically efficient.

We list the conditions to be used.

A3.4.1 nonincreasingly converges to zero,

and for some

A3.4.2 A continuously differentiable function exists such that

for any and for some withwhere is used in (2.1.1).

A3.4.3 The observation noise is such that

with being a constant independent of and

where is specified in (3.4.7).

A3.4.4 is measurable and locally bounded. There exist a stable ma-trix F, and such that


where is a constant.

Remark 3.4.1 It is clear that satisfies A3.4.1.From (3.4.7) it follows that

where denotes the integer part of

Since is nonincreasing, from (3.4.12) we have

which implies

or

Remark 3.4.2 If with being a martingale

difference sequence satisfying (3.3.21)–(3.3.23), then identifying toin Lemma 3.3.1, by this lemma we have

where is given by (3.4.1). Thus, in this case the second condition in(3.4.8) holds.

We now show that the first condition in (3.4.8) holds too.By the estimate for the weighted sum of martingale difference se-

quences (See Appendix B) we have

which incorporating with (3.4.13) yields


It is clear that (3.4.9) is implied by (3.3.21). Therefore, in the presentcase all requirements in A3.4.3 are satisfied.

Theorem 3.4.2 Assume A3.4.1–A3.4.4 hold. Let be given by(2.1.1)–(2.1.3) and be given by (3.4.5). Then is asymptoticallyefficient:

Prior to proving the theorem we establish some properties of slowlydecreasing step size.

Set

By (3.1.8) we have

where and are constants.Set

Lemma 3.4.1 i) The following estimate takes place

where o(1) denotes a magnitude that tends to zero asii) is uniformly bounded with respect to both and

and

Proof. i) By (3.4.6) we know that


and

which implies (3.4.17) since asii) By (3.4.6) as and hence for any we have

where denotes the integer part ofUsing (3.4.15) we have

for any where the first term at the right-hand side tends to zeroas by (3.4.20), and the last term tends to zero asTherefore, for (3.4.18) it suffices to show

Noticing that (3.4.13) implies for any we have


and hence

By (3.4.6) where asTaking this into account, by (3.4.15) and (3.4.17) we have

where asThus, by (3.4.23) we have

This implies (3.4.21), and together with (3.4.15) shows that is uni-formly bounded with respect to both and

We now express given by (2.1.1)–(2.1.3) in a different form byintroducing a sequence of stopping times and a sequence of processes

To be precise, define


where by definition

Remind that is the sequence used in (2.1.1)–(2.1.3).It is clear thatSimilarly, define

where

Recursively define

where

As a matter of fact, is the first exit time of from the sphere withradius after time and during the time period

evolves as same as and is recursively defined as an RM process.Therefore, given by (2.1.1)–(2.1.3) can be expressed as


Lemma 3.4.2 Under Conditions A3.4.1–A3.4.4, there exists an integer-valued such that a.s., a.s., and given by(2.1.1)–(2.1.3) has no truncation for i.e.,

and a.s.

Proof. If we can show that A2.2.3 is implied by A3.4.3, then all condi-tions of Theorem 2.2.1 are fulfilled a.s., and the conclusions of the lemmafollow from Theorem 2.2.1.

Since we have

which means that (2.2.2) is satisfied forWe now check (2.2.2) for By a partial summation we have

where (3.4.6) is used and asBy (3.4.8) the first two terms on the right-hand side of (3.4.34) tend

to zero as by the same reason and by the fact

the last term of (3.4.34) also tends to zero as This means thatsatisfies (2.2.2), and the lemma follows.

By Lemma 3.4.2 we have


and by (3.4.14)

For specified in (3.4.11) and a deterministic integer define thestopping time as follows


and

Lemma 3.4.3 If A 3.4.1-A3.4.4 hold, then

is uniformly bounded with respect to


Proof. By (3.4.11) and (3.4.15) from (3.4.39) we have

where respectively denote the terms on the right-handside of the inequality in (3.4.40).

By (3.4.19) we see

where as From this we find that is bounded inif is large enough so thatBy (3.4.19) we estimate as follows:

where is assumed to be large enough such that

Thus, by (3.4.9)


We now pay attention to (3.3.10) in the proof of Lemma 3.3.2 and findthat the right-hand side of (3.4.42) is bounded with respect to

For by (3.4.19) and (3.4.10) we have

where is a constant. Again, by (3.3.10), is bounded inIt remains to estimate By Schwarz inequality we have

By (3.4.19), for large enough

which, as shown by (3.3.11), is bounded in we then by (3.4.37) have

where is a constant.Combing (3.4.40)-(3.4.44) we find that there exists a constant

such that


Setting

and

from (3.4.45) we have

where is a constant.Denoting

from (3.4.48) we find

where is set to equal to 1.


which combining (3.4.46) leads to


where for the last equality we have used (3.4.47).Choosing sufficiently small so that

from (3.4.51) we then have

which is bounded with respect to as shown by (3.3.10).

Lemma 3.4.4 If A3.4.1-A3.4.4 hold, then

Proof. It suffices to prove

Then the lemma follows from (3.4.53) by using the Kronecker lemma.By (3.4.11) and (3.4.37) we have

where the last inequality follows by using the Lyapunov inequality.


Applying Lemma 3.4.3, from the above estimate we derive

where is a constant and the convergence of the series follows from(3.4.13).


which means that

By Lemma 3.4.2, for any given

if is sufficiently large. This together with (3.4.55) shows that

or equivalently,

This verifies (3.4.53) because can be arbitrarily small. The proofof the lemma is completed.

Proof of Theorem 3.4.2.


By Lemma 3.4.2, a.s. and

Consequently,

where asNoticing we have

and hence

By (3.4.16) and (3.4.57), from here we derive

By Lemma 3.4.1, is bounded. Then with the help of (3.4.58) wehave


From (3.4.58) and the boundedness of there exists a constantsuch that

Then, we have

where the convergence to zero a.s. follows from Lemma 3.4.4.Putting (3.4.59), (3.4.61) into (3.4.56) leads to

By (3.4.58) we then have

Notice that

Let us denote by the upper bound for where the existence ofis guaranteed by Lemma 3.4.1. Then using (3.4.9) and (3.4.18) we

have


which implies that

and hence

because is bounded.By (3.4.10) we see that

where the convergence follows from (3.4.13).From this by the Kronecker lemma it follows that

Therefore, we have

and hence

Combining (3.4.62)–(3.4.64) we arrive at

or


This incorporating with (3.4.8) implies the conclusion of thetheorem.

This theorem tells us that if in (2.1.1)-(2.1.3) we apply the slowlydecreasing step size, then the averaged estimate leads to the minimalcovariance matrix of the limit distribution.

3.5. Notes and ReferencesConvergence rates and asymptotic normality can be found in [28, 68,

78] for the nondegenerate case. The rate of convergence for the degen-erate case was first considered by Pflug in [74]. The results presented inSection 3.2 are given in [15, 47].

For the proof of central limit theorem (Lemma 3.3.1) we refer to [6,56, 78], while for Remark 3.3.1 refer to [78]. The proof of Theorem 3.3.1and 3.3.2 can be found in [28].

Asymptotic normality of stochastic approximation algorithm was firstconsidered in [44].

For asymptotic efficiency the averaging technique was introduced in[80, 83], and further considered in [35, 59, 66, 67, 74, 98]. Theoremsgiven in Section 3.4 can be found in [13]. For adaptive stochastic ap-proximation refer to [92, 95].

Chapter 4

OPTIMIZATION BY STOCHASTICAPPROXIMATION

Up-to now we have been concerned with finding roots of an unknownfunction observed with noise. In applications, however, one oftenfaces to the optimization problem, i.e., to finding the minimizer or max-inizer of an unknown function It is well know that achievesits maximum or minimum values at the root set of its gradient, i.e., at

although it may be only in the local sense.The gradient is also written as

If the gradient can be observed with or without noise, then theoptimization problem is reduced to the SA problem we have discussed inprevious chapters. Here, we are considering the optimization problem forthe case where the function itself rather than its gradient is observedand the observations are corrupted by noise. This problem was solvedby the classical Kiefer-Wolfowitz (KW) algorithm which took the finitedifferences to serve as estimates for the partial derivatives. To be precise,let be the estimate at time for the minimizer (maximizer) ofand let

be two observations on at time with noises andrespectively, where

are two vectors perturbed from the estimate by and respec-tively, on the component of The KW algorithm suggests taking

151


the finite difference

as the observation of the component of the gradientIt is clear that

where the component of equals

The RM algorithm

with defined above is called the KW algorithm.It is understandable that in the classical theory for convergence of

the KW algorithm rather restrictive conditions are imposed not only onbut also on and Besides, at each iteration to form finite

differences, observations are needed, where is the dimension ofIn some problems may be very large, for example, in the problem ofoptimizing weights in a neuro-network corresponds to the number ofnodes, which may be large. Therefore, it is of interest not only to weakenconditions required for convergence of the optimizing algorithm but alsoto reduce the number of observations per iteration.

In Section 4.1 the KW algorithm with expanding truncations usingrandomized differences is considered. As to be shown, because of replac-ing finite differences by randomized differences, the number of observa-tions is reduced from to 2 for each iteration, and because of involvingexpanding truncations in the algorithm and applying TS method forconvergence analysis, the conditions needed for have been weak-ened significantly and the conditions imposed on the noise have beenimproved to the weakest possible. The convergence rate and asymp-totic normality for the KW algorithm with randomized differences andexpanding truncations are given in Section 4.2.

The KW algorithm as other gradient-based optimization algorithmsmay be stuck at a local minimizer (or maximizer). How to approachto the global optimizer is one of the important issues in optimizationtheory. Especially, how to pathwisely reach the global optimizer is adifficult and challenging problem. In Section 4.3 the KW algorithm iscombined with searching initial values, and it is shown that the resultingalgorithm a.s. converges to the global optimizer of the unknown function

Optimization by Stochastic Approximation 153

The obtained results are then applied to some practical problemsin Section 4.4.

4.1. Kiefer-Wolfowitz Algorithm withRandomized Differences

There is a fairly long history of random search or approximation ideasin SA. Different random versions of KW algorithm were introduced: forexample, in one version a sequence of random unit vectors that are inde-pendent and uniformly distributed on the unit sphere or unit cube wasused; and in another version the KW algorithm with random directionswas introduced and was called a simultaneous perturbation stochasticapproximation algorithm.

Here, we consider the expandingly truncated KW algorithm with ran-domized differences. Conditions needed for convergence of the proposedalgorithm are considerably weaker than existing ones.

Conditions onLet be a sequence of independent and

identically distributed (iid) random variables such that

Furthermore, let be independent ofthe algebra generated by

is the observation noise to be explained later.For convenience of writing let us denote

It should be emphasized that is a vector and is irrelevant to inverse.At each time two observations are taken: either

or


where is the estimate for the sought-for minimizer (maximizer) ofdenote the observation noises, and is a real

number.The randomized differences are defined as

and

may serve as observations of randomized differences.To be fixed, let us consider observations defined by (4.1.3) and (4.1.4).

The convergence analysis, however, can analogously be done for obser-vations (4.1.5) and (4.1.6).

Thus, the observations considered in the sequel are

where

We now define the KW algorithm with expanding truncations andrandomized differences. Let be a sequence of positive numbersincreasingly diverging to infinity, and let be a fixed point inGiven any initial value the algorithm is defined by:

where is given by (4.1.9) and (4.1.10).It is worth noting that the algorithm (4.1.9)-(4.1.12) differs from

(2.1.1)- (2.1.3) only by observations As a matter of fact, (4.1.11)and (4.1.12) are exactly the same as (2.1.1) and (2.1.2), but (4.1.9) and


(4.1.10) are different from (2.1.3). As before, is the number of trun-cations that have occurred before time Clearly the random vectoris measurable with respect to the minimalcontaining both and where Thusthe random vector is independent of

LetThe observation (4.1.9) can be written in the standard form of RM

algorithm. In fact, we can rewrite as follows:

where

Thus, the KW algorithm (4.1.9)-(4.1.12) turns to be a standard RM algo-rithm with expanding truncations (4.1.11)-(4.1.14) considered in Chap-ter 2. Of course, the observation noise expressed by (4.1.14) is quitecomplicated: it is composed of the structural error

and the random noise caused by inaccuracy of observa-tions.

We now list conditions to be used.

A4.1.2 is locally Lipschitz continuous. There is an uniquemaximum of at that is the only root for and

for Further, used in (4.1.11) is such thatsup L(x) for some c and

Remark 4.1.1 If is twice continuously differentiable, then islocally Lipschitz continuous.

A4.1.1 and

exists such that

and as


Remark 4.1.2 If is the unique minimizer of then in (4.1.11)and (4.1.12) should be replaced by

Theorem 4.1.1 Assume A4.1.1, A4.1.2, and Conditions on hold.Let be given by (4.1.9)-(4.1.12) (or (4.1.11)-(4.1.14)) with anyinitial value. Then

if and only if for each the random noise given by (4.1.10) can bedecomposed into the sum of two terms in ways such that

with

and

where is given in Conditions on

Proof. We will apply Theorem 2.2.1 for sufficiency and Theorem 2.4.1for necessity.

Let us first check Conditions A2.2.1–A2.2.4. Condition A2.2.1 is apart of A4.1.1. Condition A2.2.2 is automatically satisfied if we take

noticing that in the presented case. ConditionA2.2.4 is contained in A4.1.2. So, the key issue is to verify thatgiven by (4.1.14) satisfies the requirements.

Let and be vector functions obtained fromwith some of its components replaced by zero:

It is clear that

and


For notational convenience, let denote a genericrandom vector such that

where is specified in (4.1.1), and may vary for differentapplications.

We express given by (4.1.14) in an appropriate form to be dealtwith. We mainly use the local Lipschitz-continuity to treat the structuralerror (4.1.15) in

Rewrite the component of the structural error as follows

and for any express

where on the right-hand side of the equality all terms are cancelled exceptthe first and the last terms, and in each difference of L, the argumentsof L differ from each other only by one

We write (4.1.25) in the compact from:


Applying the Taylor’s expansion to (4.1.26) we derive

where

Similarly, we have

and

where


Define the following vectors:

Finally, putting (4.1.27)-(4.1.35) into (4.1.14) we obtain the followingexpression for

It is worth noting that each component of and is a martingale

difference sequence, because both and are independent of

For the sufficiency part we have to show that (2.2.2) is satisfied a.s.Let us show that (2.2.2) is satisfied by all components of and

For components of we have for any

since by (4.1.1), and asTherefore, for any integer N

for any such that converges.Thus, all sample paths of components of satisfy (2.2.2). Com-

pletely the same situation takes place for the components of

and


By the convergence theorem for martingale difference sequences, wefind that for any integer N

This is because is inde-

pendent of and is bounded by a constant uniformly with respect

to by Lipschitz-continuity of Then the martingale convergence

theorem applies since for some by A4.1.1.

Similar argument can be applied to components of Since for anyinteger N (4.1.38) holds outside an exceptional set with probability zero,there is an with such that for any

and

for all and N = 1,2, ….Therefore, for all and any integer N



From (4.1.17) and (4.1.18) it follows that there exists suchthat and for each

and hence

Combining (4.1.41) and (4.1.42), we find for each

This means that for the algorithm (4.1.11)-(4.1.14), Condition A2.2.3 issatisfied on Thus by Theorem 2.2.1, on This provesthe sufficiency part of the theorem.

Under the assumption a.s. it is clear that both andconverge to zero a.s. and (4.1.39) and (4.1.40) turn to be

and

Then the necessity part of the theorem follows from Theorem 2.4.1. Weshow this. By Theorem 2.4.1, can be decomposed into two parts

and such that and Let us

denote by the component of a vector Define

Then for


and


This together with (4.1.44) and (4.1.45) proves the necessity of the the-orem.

Theorem 2.4.1 gives necessary and sufficient condition on the obser-vation noise in order the KW algorithm with expanding truncations andrandomized differences converges to the unique maximizer of a functionL. We now give some simple sufficient conditions on

Theorem 4.1.2 Assume A4.1.1 and A4.1.2 hold. Further, assume that

is independent of

and satisfies one of the following two conditions:i) where is a random variable;

ii) Then

whre is given by (4.1.9)-(4.1.12).

Proof. It suffices to prove (4.1.16)-(4.1.18). Assume i) holds. Letbe given by

By definition, is independent of and so

and


where is an upper bound for

By the convergence theorem for martingale difference sequences, itfollows that

Thus in (4.1.16) it can be assumed thatand and the conclusion of the

theorem follows from Theorem 4.1.1.Assume now ii) holds.By the independence assumption it follows that for is inde-

pendent of so that

Then, we have

It directly follows that

Again, it suffices to takesWe now extend the results to the case of multi-extremes. For this A

4.1.2 is replaced by A4.1.2’.

A4.1.2’ is locally Lipschitz continuous, L(J) is nowhere

dense, where the set where L takes extremes, andused in (4.1.11) is such that for some and

Theorem 4.1.3 Let be given by (4.1.9)-(4.1.12) with a given ini-tial value Assume A 4.1.1 and A 4.1.2’ hold. Then

on an with if satisfies (4.1.16)- (4.1.18), orsatisfies conditions given in Theorem 4.1.2, where is a connected

set contained in the closure of .

Proof. Condition A2.2.2 is implied by A4.1.2’ with andA2.2.1 and A2.2.4 are implied by A4.1.1 and A4.1.2, respectively, while


A2.2.3 is satisfied as shown in Theorems 4.1.1 and 4.1.2. Then theconclusion of the theorem follows from Theorem 2.2.2.

Remark 4.1.3 In the multi-extreme case, the necessary conditions onfor convergence can also be obtained on the

analogy of Theorem 2.4.2.

Remark 4.1.4 Conditions i) or ii) used in Theorem 4.1.2 are simpleindeed. However, in Theorem 4.1.2 is required to be independentof This may not be satisfied if the observation noise

is state-dependent. Taking into account that is theobservation noise when observing at and wesee that depends on and if the observationnoise is state-dependent. In this case, does depend on Thisviolates the assumption about independence made in Theorem 4.1.2.

Consider the case where the observation noise may depend on loca-tions of measurement, i.e., in lieu of (4.1.3) and (4.1.4) consider

Introduce the following condition.

A4.1.3 Both and are measurable functionsand are martingale dif-

ference sequences for any and

for p specified in A4.1.1 with

where is a family of nondecreasing independent of bothand


Theorem 4.1.4 Let be given by (4.1.9)–(4.1.12) with a given ini-tial value Assume A4.1.1, A4.1.2’, and A4.1.3 hold. Then


Proof. Introduce the generated by andi.e.,

It is clear that is measurable with respect toand hence are Both

and are Ap-proximating and by simple functions, it is seenthat

Therefore, and aremartingale difference sequences, and

where

Hence, is a martingale difference sequence with

Noticing is bounded and as by (4.1.50) and(4.1.51) and the convergence theorem for martingale difference sequenceswe have, for any integer N > 0

This together with (4.1.37) with replaced by (4.1.39), and (4.1.40)verifies that expressed by (4.1.36) satisfies A2.2.3. Then the con-clusion of the theorem follows from Theorem 2.2.2.

Remark 4.1.5 If J consists of a singleton then Theorems 4.1.3 and4.1.4 ensure a.s. If J is composed of isolated points, then


theorems ensure that converges to some point in J. However, thelimit is not guaranteed to be a global minimizer of Depending oninitial value, may converge to a local minimizer. We will return backto this issue in Section 4.3.

4.2. Asymptotic Properties of KW AlgorithmWe now present results on convergence rate and asymptotic normality

of the KW algorithm with randomized differences.

Theorem 4.2.1 Assume hypotheses of Theorem 4.1.2 or Theorem 4.1.4with and that

for some and as

where is stable and and are specified in (4.2.1) and (4.2.2),respectively.

Then given by (4.1.9)–(4.1.12) satisfies

Proof. First of all, under conditions of Theorems 4.1.2 or 4.1.4,By Theorem 3.1.1 it suffices to show that given by

(4.1.36) can be represented as

where

From (4.1.28) and (4.1.31) by the local Lipschitz continuity of itfollows that


by (4.2.2). Since it follows that

Since and given by (4.1.27) and (4.1.32)are uniformly bounded for for each

where converges. By the convergence theorem for martingaledifference sequences it follows that

where and are are given by (4.1.35).

In the proof of Theorem 4.1.2, replacing by and using (4.2.2),the same argument leads to

Then by defining

we have shown (4.2.4) under the hypotheses of Theorem 4.1.2.Under the hypotheses of Theorem 4.1.4 we have the same conclusions

about and as before. We need only to show (4.2.5). Butthis follows from (4.1.52) with replaced by and the convergence

Remark 4.2.1 Let be given by (4.1.9)–(4.1.12). If andwith then conditions (4.2.1) and (4.2.2) are satisfied.

Theorem 4.2.2 Assume A4.1.1 and A4.1.2 hold and thati) and for some

ii) for some c > 0 and

iii) is stable and for some

iv) given by (4.1.10) is an MA process:

for

and


where are real numbers and is a martingaledifference sequence which is independent of and satisfies

Then

where and

Proof. Since it follows that and

By assumption is independent of and hence is inde-pendent of Then by (4.2.11) and the convergence theorem formartingale difference sequences we obtain (4.2.5). By Theorem 4.2.1 wehave as

and after a finite number of iterations of (4.1.11), say, for thereare no more truncations.

Since and is stable,it follows that

Let be given by


By (4.1.11), (4.1.13), (4.1.36), and condition ii) it follows that for

Let be given by

whereSince is stable, by (3.1.8) it follows that there are constants

and such that

Noticing where becauseby condition iii), we have


where respectively denote the five terms on the right-hand side of the first equality of (4.2.19).

By (4.2.18),

By Lemma 3.3.2, because and

By (4.1.28) and (4.1.3) it follows that and henceby i) and (4.2.18)

where is a constant.By Lemma 3.3.2 and the right-hand side of (4.2.20) tends to

zero a.s. asTo estimate let us consider the following linear recursion


By (4.2.11), Since and

Then by the convergence theorem for martingale differ-

ence sequences it follows that

i.e.,


Similarly,

Applying Lemma 3.1.1, we find that From (4.2.22),

it follows that

Since is an MA process driven by a martingale difference sequence

satisfying (4.2.6),

By the argument similar to that used for (4.2.21) and (4.2.22), fromLemma 3.1.1 it follows that

Therefore, putting all these convergence results into (4.2.19) yields

By (3.3.37),

where is given by (4.2.10). By (4.2.18), from (4.2.23) and (4.2.24)

it follows that which together with the definition

(4.2.14) for proves the theorem.

Example 4.2.1 The following example of and satisfies Con-ditions i) and iii) of Theorem 4.2.2:

In this example, and


Remark 4.2.2 Results in Sections 4.1 and 4.2 are proved for the case,

where the two-sided randomized differences areused where and are given by (4.1.3) and (4.1.4), respectively.But, all results presented in Sections 4.1 and 4.2 are also valid for thecase where the one-sided randomized differences

are used, where and are given by (4.1.3) and (4.1.6), respec-tively.

In this case, in (4.1.27), (4.1.28) and in the expression of shouldbe replaced by 1, and (4.1.29)–(4.1.32) disappear. Accordingly, (4.1.36)changes to

Theorems 4.1.1-4.1.4 and 4.2.1 remain unchanged. The conclusion of

Theorem 4.2.2 remains valid too, if in Condition iv)

changes to

4.3. Global OptimizationAs pointed out at the beginning of the chapter, the KW algorithm may

lead to a local minimizer of Before the 1980s, the random searchor its combination with a local search method was the main stochasticapproach to achieve the global minimum when the values of L can exactlybe observed without noise. When the structural property of L is usedfor local search, a rather rapid convergence rate can be derived, but itis hard to escape a local attraction domain. The random search hasa chance to fall into any attraction domain, but its convergence ratedecreases exponentially as the dimension of the problem increases.

Simulating annealing is an attractive method for global optimization,but it provides only convergence in probability rather than path-wiseconvergence. Moreover, simulation shows that for functions with a fewlocal minima, simulated annealing is not efficient. This motivates oneto combine KW-type method with random search. However, a simplecombination of SA and random search does not work: in order to reachthe global minimum one has to reduce the noise effect as time goes on.

A hybrid algorithm composed of a search method and the KW algo-rithm is presented in the sequel with main effort devoted to design eas-


ily realizable switching rules and to provide an effective noise-reducingmethod.

We define a global optimization algorithm, which consists of threeparts: search, selection, and optimization. To be fixed, let us discussthe global minimization problem. In the search part, we choose an ini-tial value and make the local search by use of the KW algorithm withrandomized differences and expanding truncations described in Section4.1 to approach the bottom of the local attraction domain. At the sametime, the average of the observations for L is used to serve as an estimateof the local minimum of L in this attraction domain. In the selectionpart, the estimates obtained for the local minima of L are compared witheach other, and the smallest one among them together with the corre-sponding minimizer given by the KW algorithm are selected. Then, theoptimization part takes place, where again the local search is carried out,i.e., the KW algorithm without any truncations is applied to improvethe estimate for the minimizer. At the same time, the correspondingminimum of L is reestimated by averaging the noisy observations. Afterthis, the algorithm goes back to the search part again.

For the local search, we use observations (4.1.3) and (4.1.4), or (4.1.5)and (4.1.6). To be fixed, let us use (4.1.5) and (4.1.6).

In the sequel, by KW algorithm with expanding truncations we meanthe algorithm defined by (4.1.11) and (4.1.12) with

where and are given by (4.1.5) and (4.1.6), respectively. Sim-ilar to (4.1.9) and (4.1.10) we have

where

By KW algorithm we mean

with defined by (4.3.2).It is worth noting that unlike (4.1.8), is used in (4.3.1).

Roughly speaking, this is because in the neighborhood of a miminizer

of is increasing, and in (4.1.11) should be anobservation on


In order to define switching rules, we have to introduce integer-valuedand increasing functions and such thatand

Define

In the sequel, by the search period we mean the part of algorithm

starting from the test of selecting the initial value up to the nextselection of initial value. At the end of the search period, weare given and being the estimates for the global minimizerand the minimum of L, respectively. Variables such as

and etc. in the search period are equipped by superscript

etc.The global optimization algorithm is defined by the following five

steps.

(GO1) Starting from at the search period, the initial value

is chosen according to a given rule (deterministic or random),

and then is calculated by the KW algorithm with expandingtruncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich , step sizes and and used for truncation aredefined as follows:

where c > 0 and are fixed constants, andare two sequences of positive real numbers increasingly diverging toinfinity.

(GO2) Set the initial estimate for and update the

estimate for by

where is the noise when observing

After steps, is obtained.

(GO3) Let be a given sequence of real numbers such that

and as Set For if

as

e.g.,


then set Otherwise, keepunchanged.

(GO4) Improve to by the KW algorithm with expandingtruncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich

where in (4.1.11) and (4.1.12) may be an arbitrary sequence ofnumbers increasingly diverging to infinity, and

At the same time, update the estimate for by

where is the noise when observing At the end of thisstep, and are derived.

(GO5) Go back to (GO1) for the search period.

We note that for the search period is added to and (see(4.3.7) and (4.3.8)). The purpose of this is to diminish the effect ofthe observation noise as increases. Therefore, and both tendto zero, not only as but also as The followingexample shows that adding an increasing to the denominators of

and is necessary.

Example 4.3.1 Let

It is clear that the global minimizer is and are twolocal minima. Furthermore, and are attractiondomains for –1 and +1, respectively.


Since is linear, for local search we apply the ordinary KW al-gorithm without truncation

Here, no randomized differences are introduced, because this is a one-dimentional problem.

Assume

where

and and are mutually independent and both are sequences ofiid random variables with

Let us start from (GO1) and take

(not tending to infinity),

If then, by noticing one of andmust belong to Elementary calculation shows that

Paying attention to (4.3.13), we see

and

i.e.,


This means that is located in one of the attraction domainsand Furthermore, by (4.3.12) and (4.3.13), the ob-

servations carried out at these domains are free of noise. Let us considerthe further development of the algorithm, once has fallen into the in-terval or To be fixed, let us assume

For we have

or which impliesIf say, then since

It suffices to consider the case where i.e.,because for the case we again have

(4.3.14) andSimple computation shows that starting from the

observations are free of noise, and the algorithm becomes

As a result of computation, we have

Then, starting from the algorithm will be iterated according to(4.3.14), and hence

For the case it can similarly be shown thatTherefore, whatever the initial value is chosen, will never converge

to the global minimizer if in (GO1) does not diverge to infinity.

Let us introduce conditions to be used.Since we are seeking for global minima of Condition A4.1.2’

should be modified.

A4.3.1 is locally Lipschitz continuous,

and L(J) is nowhere dense, where the set ofextremes of L.

Note that for seeking minima of the corresponding part in A4.1.2’,should be modified as follows: used in (4.1.11) is such that

for some and But this is implied by assuming


A4.3.2

A4.3.3 For any convergent subsequence of

where denotes given by (4.3.3) with replaced bydenotes used for the ¢ search period, and

A4.3.4 For any convergent subsequence


It is worth emphasizing that each in the sequenceis used only once when we form and

We now give sufficient conditions for A4.1.2, A4.3.3, and A4.4.4. Forthis, we first need to define generated by estimates and

derived up-to current time. Precisely, for running in the searchperiod of Step (GO1) define

and for running in Step (GO4) define


Remark 4.3.1 If both sequences

and are martingale difference sequences with

and if

for some then A4.3.2 holds.

This is because

is a maringale difference sequence with bounded second conditional mo-ment, and hence

which implies (4.3.15).By using the second parts of conditions (4.3.22) and (4.3.23), (4.3.16)

can be verified in a similar way.

Remark 4.3.2 If and is independent of

and if exists

such that then by the uncorrelatedness of

with for or


where M is a constant. From this, it follows that

and hence A4.3.3 holds.

Remark 4.3.3 If and is independent of

then by the martingale convergence theorem, A4.3.4 holds.

We now formulate the convergence theorem for the global optimiza-tion algorithm (GO1)–(GO5).

Theorem 4.3.1 Assume A4.1.1, A4.3.2, A4.3.3, and A4.3.4 hold. Fur-ther, assume that selected in (GO1) is dense in anopen set U, for some and

If then

where is derived at Step (GO4) and is the set of global mini-mizers of

The proof of theorem is separated into lemmas.We recall that the essence of the proof for the basic convergence The-

orem 2.2.1 consists in showing the following property that cannotcross a nonempty interval infinitely often ifWe need to extend this property to a family of algorithms.

Assume for each fixed the observation is

and the algorithm develops as follows


where

Assume, further, for fixed

Lemma 4.3.1 Assume L(J) is nowhere dense, whereLet be a nonempty interval such If there are

two sequences and such that

and is bounded, then it is impossible to have

where

Proof. Without loss of generality we may assume converges asotherwise, it suffices to select a subsequence.

Assume the converse: i.e., (4.3.28) holds. Along the lines of the prooffor Theorem 2.2.1 we can show that

for some constant M if is sufficiently large. As a matter of fact, this isan analogue of (2.2.3). From (4.3.29) the following analogue of (2.2.15)takes place:

and the algorithm for has no truncation forif is large enough, where is a constant. Similar to

(2.2.27), we then have

and

and


for some small T > 0 and all sufficiently largeFrom this, by (4.3.27) and convergence of it follows that

By continuity of and (4.3.30) we have

which implies that for small enough T.Then by definition,

which contradicts (4.3.32). The obtained contradiction shows the im-possibility of (4.3.28).

Introduce

such that

and

Lemma 4.3.2 Let be given by (GO1). AssumeA4.3.1 and A4.3.3 hold and for some Then

for any may occur infinitely many often withprobability 0, i.e.,

Proof. Since L(J) is nowhere dense, for any belonging to infinitelymany of there are subsequences such that

and

whereand


By assumption as must be bounded.Hence, is bounded. Without loss of generality we may assume

that is convergent.Notice that at Step (GO1), is calculated according to (4.1.11)

and (4.1.12) with given by (4.3.2) and (4.3.3), i.e.,

which differ from (4.1.11) (4.1.12), (4.3.2), and (4.3.3) by superscript (i),which means the calculation is carried out in the search period.

By (4.1.27) with notations (4.1.33) and (4.1.34), equipped by super-script we have

where

If we can show that and


where

then by Lemma 4.3.1, (4.3.42) contradicts with that all sequencescross the interval which is disjoint with

L (J).This then proves (4.3.36).We now show for all sufficiently large if T is small

enough.Since and are finite, where

We now show that on theif is sufficiently large and T is small

enough.Suppose the converse: for any fixed T > 0, there always exists

whatever large is taken such thatSince by continuity of there is a constant

q > 0 such that

For any let us estimate By

and the local Lipschitz continuity of it is seen that

is uniformly bounded with respect to and allThen by A4.3.3, it follows that there is a constant such that

From this it follows that there is no truncation forand

Let T be so small that


On the other hand, however, we have andThe obtained contradiction shows for all sufficiently

large if T is small enough.We now prove (4.3.42). Let us order in the following way

From (4.1.34) and by the fact that is an iid sequence and isindependent of sums appearing in (4.1.34), it is easy to be convincedthat is a martingale difference sequence.

By the condition for some it is clearthat for with being a constant. Then we have

By (4.1.28) and (4.3.8), we have

where is a constant. Noticing that for large andsmall T, by (4.3.44),(4.3.45), and A4.3.3 we may assume sufficientlylarge and T small enough such that

This will imply (4.3.42) if we can show

We prove (4.3.47) by induction.We have by definition of Assume that

and by the convergence theorem for martingale difference sequences


and Then there is no truncation at timesince by (4.3.46) (with chosen such that

if T in (4.3.46) is sufficiently small.Then by (4.3.40), we have

and by (4.3.43) and (4.3.46)

for small T. This completes induction, and (4.3.42) is proved, which, inturn, concludes the lemma.

Lemma 4.3.3 Assume A4.3.1–A4.3.3 hold. Further, assume thatfor some and If there

exists a subsequence such that then

Proof. For any by Lemma 4.3.2 there exists such that forany if By (GO2),we have

Then by A4.3.2, there exists such that, for any

This implies the conclusion of the lemma by the arbitrariness of


Lemma 4.3.4 Assume A4.3.1–A4.3.3 hold, for

some and If subsequence is such that

then

where denotes the closure of L(J), and and aregiven by (GO1) and (GO2) for the search period.

Proof. Since by A4.3.1, for (4.3.50) it isseen that contains a bounded infinite subsequence, and hence, a

convergent subsequence (for simplicity of notation, assume

such that

Since there exists a such thatand hence

Define

It is worth noting that for any T > 0, is well defined for all

sufficiently large because and hence

We now show that

By the same argument as that just used before, without loss of gen-erality, we may assume is convergent (otherwise, a convergentsubsequence should be extracted) and thus

We have to show

as


By the same argument as that used for deriving (2.2.27), it followsthat there is such that

which implies the correctness of (4.3.53).From (4.3.53) it follows that

because, otherwise, we would have a subsequence with

such that and by (4.3.54)

for large However, by (2.2.15), so for smallenough T > 0, (4.3.56) is impossible. This verifies (4.3.55).

We now show

Assume the converse, i.e.,

From (4.3.54) and (4.3.58) it is seen that for all sufficiently large thesequence

contains at least a crossing the interval withIn other words, we are dealing with a sample path on which both(4.3.54) and (4.3.58) are satisfied. Thus, belongs to ByLemma 4.3.2, the set composed of such is with zero probability. Thisverifies (4.3.57).


for all sufficiently large


Notice that from the following elementary inequalities

by (4.3.5) it follows that

By definition of we write

By (4.3.59) and (4.3.61), noticing we have

because

By (4.3.55) and (4.3.61) we have

Since by (4.3.15), combining (4.3.62)–(4.3.64)

leads to


which completes the proof of the lemma.

Lemma 4.3.5 Let be given by (GO1)–(GO5). Assume that A4.3.1–

A4.3.4 hold, initial values selected in (GO1) are dense in an openset U containing the set of global minima of

for some and Then for any

Proof. Among the first search periods denote by the number ofthose search periods for which are reset to be i.e.,

Since L(J) is not dense in any interval, there exists an intervalsuch that So, for lemma it suffices to prove

that cannot cross infinitely many times a.s.If then after a finite number of steps, is generated

by (GO4). By Lemma 4.3.1 the assertion of the lemma follows immedi-ately. Therefore, we need only to consider the case where

Denote by the search period for which a resetting happens, i.e.,It is clear that by

In the case by (GO4) the algorithm generates a family

of consecutive sequences:

Let us denote the sequence by

and the corresponding sequence of the values of by

Let be sufficiently small such that


and which is possible because L(J) isnowhere dense.

Since is dense in U, visits infinitely often. Assume

By Lemma 4.3.2

if is large enough.Define

This means that the first resetting in or after the search periodoccurs in the search period.

We now show that there is a large enough such that thefollowing requirements are simultaneously satisfied:

where is fixed;

We first show ii)-v).Since all three intervals indicated in ii) have an empty intersection

with L(J), by Lemma 4.3.1 ii) is true if S is large enough. It is clear

i) implies

ii) does not cross intervals

and

iii)

vi)

v)


that iii) and vi) are correct for fixed and if is large enough,while v) is true because

For i) we first show that there are infinitely many for which

By (4.3.68) and (4.3.71) we have

Consider two cases.1) There is no resetting in the search period. Then

and by (4.3.72) and (4.3.74) it follows that

By (4.3.70) and the definition of there exists at least an integeramong such that

because, otherwise, we would have which contradicts(4.3.74).

By ii) we conclude that

and by (4.3.68) we also have (4.3.76).From (4.3.76), by ii) does not cross for

Consequently,

This together with (4.3.70) implies that

and, in particular,2) If there is a resetting in the search period, then


By (GO3) we then have

Noticing as we conclude that there are infinitelymany for which (4.3.73) holds.

We now show that there is a such that

where lim sup is taken along those for which (4.3.73) holds.Assume the converse: there is a subsequence of such that

Then by Lemma 4.3.4,

which contradicts (4.3.73). This proves (4.3.78), and also i). As a matterof fact, we have proved more than i): Precisely, we have shown that thereare infinitely many for which (4.3.73) holds, and for (4.3.73)implies the following inequality:

Let us denote by the totality of those for which (4.3.73)holds and What we have just proved is that contains infinitelymany if

Consider a sequence By ii) it cannot cross the intervalThis means that

Then by (4.3.70)

and by (GO3)


since is a search period with resetting.Thus, we have shown that if then also be-

longs to Therefore, and

From here and (4.3.67) it follows that

Since may cross the interval only forfinite number of times by Lemma 4.3.1. This completes the proof of thelemma.

Proof of Theorem 4.3.1.By Lemma 4.3.5 the limit exists. By arbitrariness offrom (4.3.69) it follows that

By continuity of we conclude that

4.4. Asymptotic Behavior of Global OptimizationAlgorithm

In last section a global optimization algorithm combining the KW al-gorithm with search method was proposed, and it was proved that thealgorithm converges to the set of global minimizers, i.e.,

However, in the algorithm defined by (GO1)–(GO5), reset-

tings are involved. The convergence by no means

excludes the algorithm from resettings asymptotically. In other words,although it may still happen that

where is defined in Lemma 4.3.5, i.e., it may still be possible to haveinfinitely many resettings.

In what follows we will give conditions under which

In this case, the global optimization algorithm (GO1)–(GO5) asymp-totically behaves like a KW algorithm with expanding truncations andrandomized differences, because for large is purely generated by(GO4) without resetting.

a.s.

a.s.,

a.s.

a.s.


A4.4.1 is a singleton, is twice continuously differentiablein the ball centered at with radius for some and

of is positive definite.

A4.4.2 and ordered as in (4.3.20) (4.3.21) andRemark 4.3.1 are martingale difference sequences with

A4.4.3 is independent of

for and

and

for

We recall that is the observation noise in thesearch period.

A4.4.4 is independent of and where

denotes the observation noise when is calculatedin (Go4).

Lemma 4.4.1 Assume A4.4.2 holds and, in addition,

Then, there exists an (maybe depending on such that for any

and

and


Proof. Notice that by A4.4.2 is a martingale

difference sequence with bounded conditional variance. By the conver-gence theorem for martingale difference sequences

which implies (4.4.2).Estimate (4.4.3) can be proved by a similar way.

Lemma 4.4.2 Assume A4.4.3 and A4.4.4 hold. Iffor some then

and

for where and are given in (4.1.34), where super-script denotes the corresponding values in the ith search period.

Proof. Let us prove

Note that

is a martingale difference sequence with bounded conditional secondmoment. So, by the convergence theorem for martingale difference se-quences for (4.4.6) it suffices to show


By assumption of the lemma or and

for large The last inequality yields

and hence

Therefore,

Thus, (4.4.6) is correct. As noted in the proof of Lemma 4.3.2, isa martingale difference sequence. So, (4.4.4) is true.

Similarly, (4.4.5) is also verified by using the convergence theorem formartingale difference sequences.

Lemma 4.4.3 In addition to the conditions of Theorem 4.3.1, supposethat A4.4.1 and A4.4.3 hold, is positive definite, and

for some Then there exists a sufficiently large suchthat, for if the inequality

holds for some with then the following inequality holds

Proof. . By A4.4.1 and the Taylor’s expansion, we have

i.e.,


where

Therefore, for any there is a such that for any

and

where and denote the minimum and maximum eigenvalue ofH, respectively, and o(1) is the one given in (4.4.10).

Since is the unique minimizer of and is continuous, thereis such that if We always assumethat is large enough such that

and

where is used in (GO1). From (4.4.8) it then follows thatand there is no truncation at time

Denote

For satisfying (4.4.8) and we have



By (4.4.11) it then follows that

where is given by (4.1.33) with superscript denoting thesearch period and

By (4.4.14) it is clear that

Let

For (4.4.9) it suffices to show thatAssume the converse:Let

By (4.4.20), for all

and hence,

Thus, (4.4.12)-(4.4.14) are applicable.


By (4.4.17) and the second inequality of (4.4.13), we have for


Applying the first inequality of (4.4.13) and then (4.4.20) leads to

Since for there is no truncation forUsing (4.4.18) we have

where

We now show that is negative for all sufficiently largeLet us consider terms in By assumption,

from (4.4.19) and (4.4.22) it follows that


We now estimate the second term on the right-hand side of (4.4.25)after multiplying it by


uniformly with respect to and with

Noticing that with being a constant,

and that which implies we find

Then, noticing that is bounded by some constantwe have

For the third term on the right-hand side of (4.4.25), multiplying it bywe have

where is a constant.Finally, for the last term of (4.4.25) we have the following estimate

Combining (4.4.26)–(4.4.30) we find that

where

and for large


Consequently, from (4.4.25) it follows that

We now show that

by induction.Assume it holds for i.e.,

which has been verified for We have to show it is true for

By (4.4.18) we have

and


where

Comparing (4.4.35) with (4.4.25), we find that in lieu of and

we now have and respec-

tively. But, for both cases we use the same estimate (4.4.27). Therefore,completely by the same argument as (4.4.26)–(4.4.30), we can prove that

and for large

Thus, we have proved (4.4.32).By the elementary inequality

for which is derived from

for any matrices A and B of compatible dimensions, we derive


from (4.4.32)

As mentioned before, for and there is notruncation. Then by (4.4.18)

where

Then from (4.4.36) and (4.4.27) it follows that

where

which tends to zero as by (4.4.27) and (4.4.38).Then

where for the last equality (4.4.10) is used.Finally, by (4.4.21), for large from (4.4.39) it follows that



This contradicts (4.4.20), the definition of The contradictionshows

Theorem 4.4.1 Assume that A4.3.1, A4.4.1–A4.4.4 hold, and

is positive definite for someFurther, assume that

and for some constants

Then the number of resettings is finite, i.e.,

where is the number of resettings among the first search periods(GO1), and is given in (GO3).

Proof. If (4.4.44) were not true, then there would be an S with positiveprobability such that, for any there exists a subsequencesuch that at the search period a resetting occurs, i.e.,

Notice that

by(4.4.41) and and


by (4.4.41) and (4.4.42). Hence, conditions of Lemma 4.4.1 are satisfied.Without loss of generality, we may assume that (4.4.2)–(4.4.5) and theconclusion of Theorem 4.3.1 hold From now on assume that

is fixed.It is clear that, for any constant

if is large enough, since forLet

Rewrite (4.4.46) as

Define

and

Noticing that there is no resetting between and and (4.4.47)corresponds to (4.4.8), by the same argument as that used in the proofof Lemma 4.4.3, we find that, for any

Since we have


By (4.4.3) (4.4.42) and (4.4.43) it follows that

where for the last inequality (4.4.41) is used.Thus, by (4.4.40)


provided is large enough, where for the last inequality, (4.4.2) isused.

Since by (4.4.43)

and since

and


we find

where the last inequality follows from (4.4.40).Using (4.4.51) and (4.4.53), from (4.4.52) for sufficiently large we

have

Using the second inequality of (4.4.43) and then observing that

and

by (4.4.40) and (4.4.41) and we find

We now show that there is such that

Assume the converse:

with


Then, we have

for large enough because

Inequality (4.4.57) contradicts (4.4.55). Consequently, (4.4.56) is true.In particular, for we have

Completely by the same argument as that used for (4.4.47)–(4.4.50), by

noticing that there is no resetting from to we conclude

that

By the same treatment as that used for deriving (4.4.54) from (4.4.50),we obtain

Comparing (4.4.58) with (4.4.54), we find that has been changed toand this procedure can be continued if the number of resettings


is infinite. Therefore, for any we have

From (4.4.40) we see

Since we have and hence by

Consequently, by (4.4.41) the right-hand side of (4.4.59) can be esti-mated as follows:

by (4.4.61) if is large enough.However, the left-hand side of (4.4.59) is nonnegative. The obtained

contradiction shows that must be finite, and (4.4.44) is correct.By Theorem 4.4.1, our global optimization algorithm coincides with

KW algorithm with randomized differences and expanding truncationsfor sufficiently large Therefore, theorems proved in Section 4.2 areapplicable to the global optimization algorithm. By Theorems 4.2.1 and4.2.2 we can derive convergence rate and asymptotic normality of thealgorithm described by (GO1)–(GO5).

4.5. Application to Model ReductionIn this section we apply the global optimization algorithm to system

modeling. A real system may be modeled by a high order system which,however, may be too complicated for control design. In control engineer-ing the order reduction for a model is of great importance. In the linearsystem case, this means that a high order transfer function is to beapproximated by a lower order transfer function. For this one may usemethods like the balanced truncation and the Hankel norm approxima-tion. These methods are based on concept of the balanced realization.We are interested in recursively estimating the optimal coefficients of the


reduced model by using the stochastic optimization algorithm presentedin Section 4.3.

Let the high order transfer function be

and let it be approximated by a lower order transfer functionIf is of order then is taken to be of order

To be fixed, let us take to be a polynomial of orderand of order

where coefficients should not be confused with stepsizes used in Steps (GO1)-(GO5). Write as whereand stand for coefficients of and

It is natural to take

as the performance index of approximation. The parameters and areto be selected to minimize under the constraint that

is stable. For simplicity of notations we denote and write

asLet us describe the where has the required property.Stability requires that

This implies that

because is the sum of two complex-conjugate roots of

If then which yields Ifthen and hence

(or ).


Set

Identify and appeared in Section 4.3 to

and respectively for the present case.We now apply the optimization algorithm (GO1)–(GO5) to minimiz-

ing under constraint that the parameter in belongs toD. For this we first concretize Steps (GO1)–(GO5) described in Section4.3.

Since is convex in for fixed we take the fixed initial value

for any search period and randomly select initial valuesonly for according to a distribution density which is defined asfollows:

where with and being the uniform dis-tributions over [–2,2] and – 1,1], respectively.

After having been selected in the search period, the algorithm

(4.1.11) and (4.1.12) is calculated with and

As to observations, in stead of (4.3.1) we will use information

about gradient because in the present case the gradientof can explicitly be expressed:

In the search period the observation is denoted by and is givenby

where is independently selected from according to the uniform

distribution, and stands for the estimate for at time in the

search period. It is clear that is an approximation to the integral


(4.5.8) with Therefore, we have observations in the form

The expanding truncation method used in (4.1.11) and (4.1.12) re-quires projecting the estimated value to a fixed point, if the estimatedvalue appears outside an expanding region. Let us denote it by In(4.1.11) and (4.1.12) the spheres with expanding radiuses serve asthe expanding regions which are now modified as follows.

Let us write where Define

where

The expanding truncations in (4.1.11) and (4.5.11) are also modified:

where means the projection ofTake Then after

steps, will be obtained.Concerning (GO2)–(GO4), the only change consists in observations.

We replace which is defined byin (GO2)–(GO4) by


where are independently selected from according to the uni-form distribution for each Clearly, is an approximation to

Finally, take equal toIn control theory there are several well-known model reduction meth-

ods such as model reduction by balanced truncation, Hankel norm ap-proximation among others. These methods depend on the balanced re-alization which is a state space realization method for a transfer matrix

keeping the Gramians for controllability and observability of therealized system balanced. In order to compare the proposed global op-timization (GO) method, we take the commonly used model reductionmethods by balanced truncation (BT) and Hankel norm approximation(HNA), which, are realized by using Matlab. For this, the discrete-timetransfer functions are transformed to the continuous time ones byusing d2c provided in Matlab. Then the reduced systems are discretizedto compute for comparison.

As we take a 10th order transfer function respec-tively for the following examples:

Example 4.5.1

Example 4.5.2

Example 4.5.3

Using the algorithm described in Section 4.3, for Examples 4.5.1-4.5.3we obtain the approximate transfer functions of order 4, respectively,


denoted by and with

Using Matlab we also derive the 4th order approximations for Exam-ples 4.5.1–4.5.3 by balanced truncation and Hankel norm approximation,which are as follows:

where the subscripts and H denote the results obtained by balancedtruncation and Hankel norm approximation, respectively.

The approximation errors are given in the following table:

From this table we see that the algorithm presented in Section 4.3gives less approximation errors in in comparison with othermethods.

We now compare approximation errors in norm and compare stepresponses between the approximate models and the true one by figures.

In the figures of step response

the solid lines denote the true high order systems;

the dashed lines (- - -) denote the system reduced by Hankel normapproximation;


the dotted lines denote the system reduced by balanced trun-cation;

The dotted-dashed lines denote the systems reduced by thestochastic optimization method given in Section 3.

In the figures of the approximation error

the solid lines denote the systems reduced by the stochasticoptimization method;

the dashed lines (- - -) denote the system reduced by Hankel normapproximation;

the dotted lines denote the system reduced by balanced trun-cation.


Example 4.5.1

Example 4.5.2

Example 4.5.3

These figures show that the algorithm given in Section 4.3 gives lessapproximation error in in comparison with other methodsfor Example 4.5.1 and the intermediate error in for Exam-ples 4.5.2 and 4.5.3. Concerning step responses, the algorithm givenin Section 4.3 provides better approximation in comparison with othermethods for all three examples.


4.6. Notes and ReferencesThe well-known paper [61] by Kiefer and Wolfowitz is the pioneer

work using stochastic approximation method for optimization. The ran-dom version of KW algorithm was introduced in [63], and the randomdirection version of KW algorithm was dealt with in [85] by the ODEmethod. Theorems 2.4.1, 2.4.2 given in Section 4.1 are presented in [21],while Theorem 2.4.4 in [18]. The results on convergence rate and symp-totic normality of KW algorithm presented in Section 4.2 can be foundin [21].

Global optimization based on noisy observations by discrete-time sim-ulated annealing is considered in [45, 52, 100]. Combination of the KWalgorithm with a search method for global optimization is dealt with in[97]. A better combination given in [49] is presented in Section 4.3 and4.4.

For model reduction we refer to [51, 102]. The global optimizationmethod presented in Section 4.3 is applied to model reduction in Sec-tion 4.5, which is written based on [22].

Chapter 5

APPLICATION TO SIGNAL PROCESSING

The general convergence theorems developed in Chapter 2 can dealwith noises containing not only random components but also structuralerrors. This property allows us to apply SA algorithms to parameterestimation problems arising from various fields. The general approach,roughly speaking, is as follows. First, the parameter estimation problemcoming from practice is transformed to a root-seeking problem for a rea-sonable but unknown function which may not be directly observed.Then, the real observation is artificially written in the standardform

with Normally, it is quite straightforward to arriveat this point. The main difficulty is to verify that the complicated noise

satisfies one of the noise conditions required in theconvergence theorems. It is common that there is no standard method tocomplete the verification procedure, because for different problemsare completely different from each other.

In Section 5.1, SA algorithms are applied to solve the blind channelidentification problem, an active topic in communication. In Section 5.2,the principle component analysis used in pattern classification is dealtwith by SA methods. Section 5.3 continues the problem discussed inSection 5.1, but in more general setting. Namely, unlike Section 5.1,the covariance matrix of the observation noise is no longer assumed tobe known. In Section 5.4, adaptive filtering is considered: Very simpleconditions for convergence of sign-algorithms are given. Section 5.5 dis-cusses the asymptotic behavior of asynchronous SA algorithms, whichtake the possible communication delays between parallel processors intoconsideration.

219


5.1. Recursive Blind IdentificationIn system and control area, the unknown parameters are estimated on

the basis of observed input and output data of the system. This is thesubject of system identification. In contrast to this, for communicationchannels only the channel output is observed and the channel input is un-available. The topic of blind channel identification is to estimate channelparameters by using the output data only. Blind channel identificationhas drawn much attention from researchers because of its potential ap-plications in wireless communication. However, most existing estimationmethods are “block” algorithms in nature, i.e., parameters are estimatedafter the entire block of data have been received.

By using the SA method, here a recursive approach is presented: Es-timates are continuously improved while receiving new signals.

Consider a system consisting of channels with L being the maximumorder of the channels. Let be the one-dimensionalinput signal, and be the channel out-put at time where N is the number of samplesand may not be fixed:

where

are the unknown channel coefficients.Let us denote by

the coefficients of the channel, and by

the coefficients of the whole system which compose avector.

The observations may be corrupted by noise

where is a vector. The problem is to estimate onthe basis of observations.

Application to Signal Processing 221

Let us introduce polynomials in backward-shift operator

whereWrite and in the component forms

respectively, and express the component via

From this it is clear that

Define

where is a

It is clear that is a xSimilar to and let us define and and and which

have the same structure as and but with replaced by and

respectively.


By (5.1.5) we have

From (5.1.8), (5.1.4), and (5.1.10) it is seen that

This means that the channel coefficient satisfies the set of linearequations (5.1.12) with coefficients being the system outputs.

From the input sequence we form the(N – 2L + 1) × (2L + 1)-Hankel matrix

It is clear that the maximal rank of is 2L + 1 as

If is of full rank for some then willalso be of full rank for any

Lemma 5.1.1 Assume the following conditions hold:

A5.1.1 have no common root.

A5.1.2 The Hankel matrix composed of input signal is offull rank (rank=2L + 1).

Then is the unique up to a scalar multiple nonzero vector simulta-neously satisfying

Proof. Assume there is another solution to(5.1.14), which is different from

where isDenote



By (5.1.7), we then have

which implies

where by we denote the (2L + 1)-dimensional vector composed

of coefficients of the polynomial written inthe form of increasing orders of

Since is of full rank, In other words,

For a fixed (5.1.17) is valid for all Therefore, all

roots of should be roots of for all By A5.1.1,

all roots of must be roots of Consequently, there is a

constant such that Substitutingthis into (5.1.17) leads to

and hence Thus, we conclude that

We first establish a convergence theorem for blind channel identifica-tion based on stochastic approximation methods for the case where anoise-free data sequence is observed.

Then, we extend the results to the case where N is not fixed andobservation is noise-corrupted.

Assume is observed. In this caseare available, and we have We will repeatedlyuse the data by setting


Define estimate for recursively by

with an initial valueWe need the following condition.

Theorem 5.1.1 Assume A5.1.1–A5.1.3 hold. Let be given by(5.1.19) with any initial value with Then

where is a constant.

Proof. Decompose and respectively into orthogonalvectors:

whereIf serves as the initial value for (5.1.19), then by (5.1.14),

Again, by (5.1.14) we have

and we conclude that

and

Therefore, for proving the theorem it suffices to show thatas

Denote


andThen by (5.1.21) we have

Noticing that and is uniformly bounded with respect tofor large we have

andBy (5.1.18)

and by Lemma 5.1.1, is its unique up to a constant multiple eigenvec-

tor corresponding to the zero eigenvalue, and the rank of

is

Denote by the minimal nonzero eigenvalue of

Let be an arbitrary vector orthogonal toThen can be expressed by

where – 1, are the unit eigenvectors of

corresponding to its nonzero eigenvalues.It is clear that

By this, from (5.1.23) and (5.1.24), it follows that for


and

Noticing that

we conclude

and hence

From (5.1.21) it is seen that is nonincreasing forHence, the convergence implies that

The proof is completed.

Remark 5.1.1 If the initial value is orthogonal to thenand (5.1.20) is also true. But this is a non-interesting case giving

no information about

Remark 5.1.2 Algorithm (5.1.19) is an SA algorithm with linear time-varying regression function The root set J for istime-invariant: As mentioned above, evolves in

one of the subspaces depending on the initialvalue: In the proof of Theorem 5.5.1we have actually verified that may serve as the Lyapunov function

satisfying A2.2.20 for Then applying Remark 2.2.6 also leads tothe desired conclusion.

We now assume the input signal is a sequence of infinitely manymutually independent random variables and that the observations donot contain noise, i.e., in (5.1.5).

Lemma 5.1.2 Assume A5.1.1 holds and is a sequence of mutually

independent random variables with Then isthe unique unit eigenvector corresponding to the zero eigenvalue for thematrices


and the rank of is

Proof. Since is a sequence of mutually independent random vari-ables and it follows that

where

Proceeding along the lines of the proof of Lemma 5.1.1., we arrive at theanalogue of (5.1.16):

which implies

From (5.1.28) and (5.1.29) it follows that Then followingthe proof of Lemma 5.1.1, we conclude that is the unique unit vectorsatisfying

This shows that is of rank and isits unique unit eigenvector corresponding to the zero eigenvalue.

Let denote the minimal nonzero eigenvalue of Onwe need the following condition.

A5.1.4 is a sequence of mutually independent random variableswith for some and such that

Condition A5.1.3 is strengthened to the following A5.1.5.

A5.1.5 A5.1.3 holds and where is given in A5.1.4.


It is obvious that if is an iid sequence, then is a positiveconstant, and (5.1.30) is automatically satisfied.

Theorem 5.1.2 Assume A5.1.1, A5.1.4, and A5.1.5 hold, and isgiven by (5.1.19) with initial value Then

where

Proof. In the present situation we still have (5.1.21) and (5.1.22). So,it suffices to show

With N replaced by 4L in the definitions of and we again arriveat (5.1.23).

Since

converges a.s. by A5.1.4 and A5.1.5, there is a large suchthat

Let be an arbitrary vector such thatThen by Lemma 5.1.2,

and hence

Therefore, which

tends to zero since This implies

is bounded, and

a.s.,


We now consider the noisy observation (5.1.5). By the definition(5.1.11), similar to (5.1.9) we have

where and have the same structure as given by (5.1.10) withreplaced by and , respectively.

The following truncated algorithm is used to estimate

with initial value andIntroduce the following conditions.

A5.1.6 and are mutually independent and each of them is asequence of mutually independent random variables (vectors) such that

and

for some

and where is given in A5.1.4.

Set

Then

Denote by the resetting times, i.e.,Then, we have

A5.1.6


and

Let be an orthogonal matrix, where

Denote

Then

Noticing we find that

Lemma 5.1.3 Assume A5.1.6 and A5.1.7 hold. Then for given by(5.1.32),

Proof. Setting

we have

and


By A5.1.6, is a martingale difference sequence with

Noticing and we find

that

by the convergence theorem for martingale difference sequences.Since is independent of

and we also have

which together with (5.1.42) implies (5.1.41).

Lemma 5.1.4 Under the condition A5.1.6, ifthen there is a constant possibly depending on sample

path, such that

where

Proof. By A5.1.6 there is a constant possibly depending onsample path, such that

Then the lemma follows from (5.1.36) by noticing

Lemma 5.1.5 Assume A5.1.1 and A5.1.6 hold. Then for any andany the matrix

has rank and serves as its unique unit eigenvectorcorresponding to the zero eigenvalue.


Proof. Since is a sequence of mutually independent nondegeneraterandom variables, where

Notice that coincides with given by (5.1.13) ifsetting N = 4L and in (5.1.13).

Proceeding as the proof of Lemma 5.1.1, we again arrive at (5.1.16).Then, we have Since

we find that Then bythe same argument as that used in the proof of Lemma 5.1.1, we con-clude that for any is the unique unit nonzero vector simultaneouslysatisfying

Since is a matrix, the above assertion

proves that the rank of is and also

proves that is its unique unit eigenvector corresponding to the zeroeigenvalue.

Denote by the minimal nonzero eigenvalue of

We need the following condition.

A5.1.8 There is a such that

It is clear that if is an iid sequence, then is independentof and and A5.1.8 is automatically satisfied.

Lemma 5.1.6 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then for any


if N is large enough, where with c and given inA 5.1.7 and A 5.1.8, respectively.

Proof. Let be the orthogonal matrix com-

posed of eigenvectors of By Lemma 5.1.4,

is the only eigenvector corresponding to the zero eigenvalue.Since can be expressed as

Then

By A5.1.4 is bounded with respect to and hence by(5.1.48) and the nonincreasing property of we have

where denotes the integer part ofSince we have


which incorporating with (5.1.44) leads to

for large enough

where and

Theorem 5.1.3 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then forgiven by (5.1.32) with initial value

and

where is a random variable expressed by (5.1.60).

Proof. We first prove that the number of truncations is finite, i.e.,a.s.


By Lemma 5.1.3, for any given

and

as


if is large enough, say,By the definition of we have

which incorporating with (5.1.52) implies

and

Define

Since is well-defined

by (5.1.54). Notice that from to there is no truncation. Con-sequently,

and


To be fixed, let us takeFrom (5.1.52) and (5.1.54) it follows that sequences

starting from cross the intervalfor each This means that

crosses interval for eachHere, we call that the sequence

crosses an interval with ifand

there is no truncation in the algorithm (5.1.32) forWithout loss of generality, we may assume converges:

It is clear that andBy Lemma 5.1.4, there is no truncation for

if T is small enough.Then, similar to (2.2.24), for large by Lemmas 5.1.3 and 5.1.4 we

have

where and

By Lemma 5.1.6, for large and small T we have

By Lemma 5.1.4 Noticing that

and by definition of crossing wesee that for small enough T,


This implies that

Letting in (5.1.57), we find that

which contradicts (5.1.58). The contradiction shows that

Thus, starting from the algorithm (5.1.32) suffers from no truncation.If did not converge as then

and would cross a nonempty interval

infinitely often. But this leads to a contradiction as shown above. There-fore, converges as

If were not zero, then there would exist a convergent

subsequence Replacing in (5.1.56) by from(5.1.57) it follows that

Since converges, the left-hand side of (5.1.59) tends to zero,which makes (5.1.59) a contradictory inequality. Thus, we have proved

a.s.

Since from (5.1.40) it follows that

By (5.1.38) and the fact that we finally conclude that

The difficulty of applying the algorithm (5.1.32) consists in that thesecond moment of the noise may not be available. Identificationof channel coefficients without using will be discussed in Sec-tion 5.3, by using the principal component analysis to be described inthe next section.

a.s.


5.2. Principal Component AnalysisThe principal component analysis (PCA) is one of the basic methods

used in feature extraction, signal processing and other areas. Roughlyspeaking, PCA gives recursive algorithms for finding eigenvectors of asymmetric matrix A based on the noisy observations on A.

Let be a sequence of observed symmetric matrices, andThe problem is to find eigenvectors of A, in particular,

the one corresponding to the maximal eigenvalue.Define

with initial value being a nonzero unit vector. serves as anestimate for unit eigenvector of A.

If then is reset to a different vector with norm equalto 1.

Assume have been defined as estimates for unit

eigenvectors of A. Denote which isan where

where denotes the pseudo-inverse of Since for largeis a full-rank matrix,

Define

if with

If we redefine an with such that

Define the estimate for the eigenvalue corresponding to the

eigenvector whose estimate at time is by the following recursion.


Take an increasingly diverging to infinity sequenceand define by the SA algorithm with expanding truncations:

where

We will use the following conditions:

A5.2.1 and

A5.2.2 are symmetric, and

A5.2.3 and


Examples for which (5.2.8) is satisfied are given in Chapters 1 and 2.We now give one more example.

Example 5.2.1 Assume is stationary and ergodic,

If then satisfies (5.2.8). Set By

ergodicity, we have a.s. By a partial summation it follows

that

which implies (5.2.8).

Let be the unit eigenvector of A corresponding to eigenvaluewhere may not be different.


Theorem 5.2.1 Assume A5.2.1 and A5.2.2 hold. Then given by(5.2.1)–(5.2.6) converges at those samples for which A5.2.3 holds,

and the limits of coincide with

Let denote the limit of as Then

Proof. Consider those for which A5.2.3 holds. We first prove con-vergence of Note that may happen only for a finite

number of steps because as and By

boundedness of we expand into the power series of

where

Further, we rewrite (5.2.9) as

where


Denote

From (5.2.10) and the boundedness of and it is seen that

as Therefore, in order to show that satisfiesA2.2.3 it suffices to show

for any convergent subsequence

By boundedness of and it is clear that

where c is a constant for a fixed sample. For any there is asuch that


Expressing the first part of as

we find that

This is because (5.2.8) is applied for the first term on the right-handside of (5.2.16), while for the other two terms we have used (5.2.15),

and the boundedness of and

Similar treatment can also be applied to the second part of Thus,we have verified (5.2.13), and A2.2.3 too.


Denote by S the unit sphere in Then defined by (5.2.2)evolves on S.

Define

The root set of on S is

Defining we find for

Thus, Condition A2.2.2(S) introduced in Remark 2.2.6 is satisfied.Since is bounded, no truncation is needed. Then, by Remark

2.2.6 we conclude that converges to one of sayDenote

Inductively, we now assume

We then have

Since and from (5.2.21) and (5.2.5) it

follows that and by (5.2.6)

We now proceed to show that converges to one of unit eigenvectorscontained in

From (5.2.5) we see that the last term in the recursion


tends to zero as So, by (5.2.22) we need to reset with

and at most for a finite number of times.

Replacing by in (5.2.9)–(5.2.11), we again arrive at(5.2.11) for Precisely,

where

and

By noticing

and using (5.2.22), (5.2.23) can be rewritten as

where asSince tends to an eigenvector of A, from (5.2.11) it follows that

where

Since converges, from (5.2.13) and it follows that


Inductively, assume that

with satisfying (5.2.27), i.e.,

Noticing that for any matrix V, we have

by (5.2.28).Since by (5.2.24), denoting by

the term we have

for any convergent subsequenceDenoting

from (5.2.26) we see

By (5.2.8) and (5.2.30), similar to (5.2.18)–(5.2.20), by Remark 2.2.6

converges to an unit eigenvector of From (5.2.5) it

is seen that converges since and Then from

(5.2.6) it follows that itself converges as

Thus, we have



which implies that and consequently,

Since the limit of is an unit eigenvector ofwe have

By (5.2.33) it is clear that can be expressed as a linear combi-nation of eigenvectors Consequently,which incorporating with (5.2.34) implies that

This means that is an eigenvector of A, and is different fromby (5.2.33).

Thus, we have shown (5.2.21) for To complete the induction itremains to show (5.2.28) for

As have just shown,tends to zero as from (5.2.31) we have

where satisfies (5.2.29) with replaced by by taking

notice of that (5.2.30) is fulfilled for whole sequence because whichhas been shown to be convergent.

Elementary manipulation leads to


This expression incorporating with (5.2.35) proves (5.2.28) forThus, we have proved that given by (5.2.1)–(5.2.6)

converge to different unit eigenvectors of A, respectively.To complete the proof of the theorem it remains to showRewrite the untruncated version of (5.2.7) as follows

We have just proved that Then by (5.2.8) and

noticing the fact that converges and we see that

satisfies A2.2.3.The regression function in (5.2.36) is linear:

Applying Theorem 2.2.1 leads to

Remark 5.2.1 If in (5.2.1) and (5.2.3) is replaced by Theo-rem 5.2.1 remains valid. In this case given by (5.2.18) should changeto and correspondingly changes to As a

result, the limit of changes to the opposite sign, fromto

5.3. Recursive Blind Identification by PCAAs mentioned in Section 5.1, the algorithm (5.1.32) for identifying

channel coefficients uses the second moment of the obser-vation noise. This causes difficulty in possible applications, because

may not be available.We continue to consider the problem stated in Section 5.1 with nota-

tions introduced there. In particular, (5.1.1)–(5.1.12), and (5.1.31) willbe used without explanation.

In stead of (5.1.32) we now consider the following normalized SAalgorithm:


Comparing (5.3.1) and (5.3.2) with (5.2.1) and (5.2.2), we find thatthe channel parameter identification algorithm coincides with the PCAalgorithm with By Remark 5.2.1, Theorem 5.2.1 canbe applied to (5.3.1) and (5.3.2) if conditions A5.2.1, A5.2.2, and A5.2.3hold.

The following conditions will be used.

A5.3.1 The input is a sequence, i.e., there exist a con-stant and a function such that for any

where

A5.3.2 There exists a distribution function over such that

where denotes the Borel in and

A5.3.3 The (2L + 1) × (2L + 1)-matrix is nondegenerate,where

A5.3.4 The signal is independent of anda.s., where is a random variable with

A5.3.5 All components of of are

mutually independent with and

and is bounded whereis a constant.

A5.3.6 have no common root.

For Theorem 5.1.1, is assumed to be a sequence of mutuallyindependent random variables (Condition A5.1.6), while in A5.3.1 theindependence is weakened to a property, but the distribution of

A5.3.7 and


is additionally required to be convergent. Although thereis no requirement on distribution of in Theorem 5.1.1, we noticethat (5.1.30) is satisfied if are identically distributed.

In the sequel, denotes the identity matrix.Define with

and

In what follows denotes the Kronecker product.

Theorem 5.3.1 Assume A5.3.1–A5.3.7 hold. Then

where C is a -matrix and Q is given in A5.3.3, andfor given by (5.3.1) and (5.3.2),

where J denotes the set of unit eigenvectors of C.

Proof. By the definition of we have

Since


and by A5.3.2, (5.3.3) im-

mediately follows.From the definition (5.1.31) for by A5.3.5 it is clear that

is a -identity matrix multiplied by withThen by A5.3.4 and A5.3.5

Identifying inTheorem 5.2.1 to we find that Theorem5.2.1 can be applied to the present algorithm, if we can show (5.2.8),which, in the present case, is expressed as

where is given by (1.3.2), and B is given by (5.3.6).Notice, by the notation introduced by (5.1.33),

Since

and

by the convergence theorem for martingale difference

sequences, for (5.3.7) it suffices to show

Identifying and in Lemma 2.5.2 toand respectively, we find that conditions required there aresatisfied. Then (5.3.8) follows from Lemma 2.5.2, and hence (5.3.7) isfulfilled.


By Theorem 5.2.1 given by (5.3.1) and (5.3.2) converges to anunit eigenvector of B, which clearly is an eigenvector of C.

Lemma 5.3.1 is the unique up to a scalar multiple nonzero vectorsimultaneously satisfying

Proof. Since it is known that satisfies (5.3.9), it suffices to prove theuniqueness.

As in the proof of Lemma 5.1.1, assume isalso a solution to (5.3.9). Then, along the lines of the proof of Lemma5.1.1, we obtain the analogue of (5.1.16), which implies (5.1.29):

where is given by (5.1.28) while by (5.1.16).By A5.3.3 which is nondegener-

ate. Then we have The rest of proof for uniqueness coincideswith that given in Lemma 5.1.1.

By Lemma 5.3.1 zero is an eigenvalue of C with multiplicity one andthe corresponding eigenvector is Theorem 5.3.1 guar-antees that the estimate approaches to J, but it is not clear iftends to the direction of

Let be all different eigenvaluesof C. J is composed of disconnected sets and

where Note thatthe limit points of are in a connected set, so converges to a

for some Let We want to prove that

a.s. or This is the conclusion of Theorem

5.3.2, which is essentially based on the following lemma, proved in [9].

Lemma 5.3.2 Let be a family of nondecreasing andbe a martingale difference sequence with

Let be an adapted random sequence and be a real sequencesuch that and Suppose that onthe following conditions 1, 2 and 3 hold.


2) can be decomposed into two adapted sequences andsuch that

3) coincides with an random variablefor some

Then

Theorem 5.3.2 Assume A5.3.1–A5.3.7 hold. Then defined by(5.3.1) and (5.3.2) converges to up-to a constant multiple:

where equals either

Proof. Assume the contrary: for someSince C is a symmetric matrix, for where andhereafter a possible set with zero probability in is ignored. The proofis completed by four steps.

Step 1. We first explicitly expressExpanding defined by (5.3.2) to the power series of we derive

where

Noting and we derive

and


where is defined by (5.1.4), is given by (5.1.10)with replaced by the observation noise, and denotes theestimate for at time

By (5.3.4) and (5.3.5), there exists a.s. such thata.s.

For any integers and define and

Note that for

and by the convergence of from (5.3.12) it follows thatwhere is a constant for all in By

(5.3.7) we then have

as where and hereafter T should not be confused with thesuperscript T for transpose.

Choose large enough and sufficiently small T such thatLet

and It then followsthat for

In


for sufficiently large.Consequently, for with fixed

and hence

Define


Tending in (5.3.21) and replacing by in the resulting equal-ity, by (5.3.19) we have

Thus, we have expressed in two ways: (5.3.21) shows that ismeasurable, while (5.3.22) is in the form required in 5.3.2, where


Step 2. In order to show that the summand in (5.3.22) can beexpressed as that required in Lemma 5.3.2 we first show that the series

is convergent on By (5.3.14) and (5.3.7) it suffices to showis convergent on

Define

and

Clearly, is measurable with respect to and Thenby the convergence theorem for martingale difference sequences,



The first term on the right-hand side of the last equality of (5.3.29) canbe expressed in the following form:

where the last term equals

Combining (5.3.30) and (5.3.31) we derive that the first term on theright-hand side of the last equality of (5.3.29) is


By A5.3.4, A5.3.5, and A5.3.7 it is clear thatHence replacing by in (5.3.29) results in

producing an additional term of magnitude Thus, by (5.3.24)–(5.3.26) we can rewrite (5.3.29) as

where and is By (5.3.28) and A5.3.7the series (5.3.33) is convergent, and hence given by (5.3.23) is aconvergent series.

Step 3. We now define sequences corresponding to and inLemma 5.3.2.

Let We have

where

Denote


Then and are adapted sequences, is a mar-tingale difference sequence, and is written in the form of Lemma 5.3.2:

It remains to verify (5.3.10) and (5.3.11).From (5.3.23) and (5.3.33) it follows that there is a constant

such that Then for noticing

and

we have

By A5.3.4 and A5.3.5 it follows that

As in Step 4 it will be shown that



Then from the following inequality

by (5.3.34) and (5.3.36) it follows that

Therefore all conditions required in Lemma 5.3.2 are met, and we con-clude Since it follows that

and must converge to a.s.Step 4. To complete the proof we have to show (5.3.35).Proof. If (5.3.35) were not true, then there would exist a subsequence

such that

For notational simplicity, let us denote the subsequence still bySince by A5.3.5 for if and for any

but if we then have

which incorporating with (5.3.37) implies that

and

Noticing that and from (5.3.38)

and (5.3.24) it follows that


On the other hand, we have

and hence,

where denotes the estimate provided by for at timeSince for any

we have

Hence (5.3.40) implies that


and

By A5.3.4 the left-hand side of (5.3.41) equals

Since it follows that for any

The left side of (5.3.42) equals

Thus (5.3.42) implies that for any


Noticing from (5.3.25) we have

Then by A5.3.5, (5.3.39) implies that for any

Notice that

and


Then by A5.3.5, from (5.3.45)–(5.3.47) it follows that

and hence for any

and

Notice that (5.3.49) means that

However, the above expression equals

Therefore,


In the sequel, it will be shown that (5.3.43), (5.3.44), (5.3.48), and(5.3.50)) imply that which contradicts with

This means that the converse assumption (5.3.37) is not true.For any since are coprime, where is

given in (5.1.6), there exist polynomials such that

Let and be the degrees of and respectively. SetIntroduce the q-dimensional vector and q × q

square matrices W and A as follows:

Note that where and Then (5.3.43), (5.3.44), (5.3.48),and (5.3.50) can be written in the following compact form:

To see this, note that for any fixed and on the left hand sides of(5.3.48) and (5.3.50) there are 2L different sums when varies from 0 toL– 1 and replace roles each other. These together with (5.3.43) and(5.3.44) give us 2L + 1 sums, and each of them tends to zero. Explicitlyexpressing (5.3.52), we find that there are 2L +1 nonzero rows and eachrow corresponds to one of the relationships in (5.3.43), (5.3.44), (5.3.48),and (5.3.50).

Since we have put enough zeros in the definition of after multiply-ing the left hand side of (5.3.52) by

has only shifted nonzero elements in

From (5.3.52) it follows that for any and in(5.3.51)



Note that for any polynomial of degree ifthe last elements of are zeros. From (5.3.54) it follows that

Denoting

from (5.3.55) we find that

By the definition of the first elements of are zeros, i.e.,This means that the last

elements of are zeros, i.e.,

On the other hand,

By (5.3.56), from (5.3.57) and (5.3.58) it is seen that i.e.,


From (5.3.53) it then follows that

i.e., But this is impossible, because

are unit vectors. Consequently, (5.3.37) is impossible and this completesthe proof of Theorem 5.3.2.

which, however, is unknown.It is required to design the optimal weighting X, which

minimizes

under constraint

where C and are matrices, respectively. In thecase where C = 0, the problem is reduced to the unconstrained one.

It is clear that (5.4.3) is solvable with respect to X if and only ifand in this case the solution to (5.4.3) is

where Z is anyFor notational simplicity, denote

Let L(C) denote the vector space spanned by the columns of matrix C,and let the columns of matrix be an orthogonally normalized basis

5.4. Constrained Adaptive FilteringWe now apply SA methods to adaptive filtering, which is an important

topic in signal processing. We consider the constrained problem, whilethe unconstrained problem is only a special case of the constrained oneas to be explained.

Let and be two observed sequences, where and are

respectively. Assume is stationary and

ergodic with


of L(C). Then there is a full-rank decompositionNoticing we have Let bean orthogonal matrix. Then

and hence


and hence a.s. This implies that

Let us express the optimal X minimizing (5.4.2) via By(5.4.8) substituting (5.4.4) into (5.4.2) leads to

On the right-hand side of (5.4.9) only the first term, which is quadratic,depends on Z. Therefore, the optimal should be the solution of

i.e.,

where is any satisfying


Combining (5.4.4) with (5.4.11), we find that

Using the ergodic property of we may replace and bytheir sample averages to obtain the estimate for And, the esti-mate can be updated by using new observations. However, to updatethe estimate, it involves taking the pseudo-inverse of the updated esti-mate for which may be of high dimension. This will slow down thecomputation speed. Instead, we now use an SA algorithm to approach

By (5.4.8), we can rewrite (5.4.10) as

or

We now face to the standard root-seeking problem for a linear function

As before, let and The

following algorithm is used to estimate given by (5.4.12), which inthe notations used in previous chapters is the root set J for the linearfunction given by (5.4.14):

with initial value such that and

Theorem 5.4.1 Assume that is stationary and ergodic with sec-

ond moment given by (5.4.1) and thatThen, after a finite number of steps, say (5.4.16) has no moretruncations, i.e.,


and

i.e.,

where given by (5.4.12) solves the stated constrained optimizationproblem.

Proof. We first note that (5.4.16) is a matrix recursion. However,if in lieu of we consider with being an arbitrary constantvector, then we have a conventional vector recursion, and by (5.4.9)

may serve as a Lyapunov function for thecorresponding regression function obtained from (5.4.14):

Therefore, in order to apply Theorem 2.2.1, we need only to verifythe noise condition.

Denoting

then from (5.4.16) we have

We now show that for a fixed sample if then

and there is a constant such that

if is sufficiently large and T is small enough, where

We need the following fact, which is an extension of Example 5.2.1.Assume the process is stationary and ergodic with


and is a convergent sequence of random matrices

a.s. Then

Let Then by ergodicity of both and

we have

because the second term on the right-hand side of the equality can beestimated as follows

which tends to zero as and then By a partialsummation and by using (5.4.25) we have

which implies (5.4.24) by (5.4.25).Let us consider the following algorithm starting from without trun-

cation:

Set

and


Then from (5.4.26) it follows that

Denote

and

Since is stationary and ergodic, a.s., and

Then by a partial summation, we have


Notice that a.s. by ergodicity. Then for large

and from (5.4.29) it follows that

where (5.4.24) is used incorporating with the fact that

and is stationary with EFrom (5.4.27)–(5.4.30) by convergence of it follows that

for large and small T, where and are constants independent ofand

Consequently, in the case i.e., in

(5.4.16), will never reach the truncation bound forif is large enough and T is small enough.

Then coincides with This verifies(5.4.22), while (5.4.23) follows from (5.4.16) because for a fixedand are bounded, andare also bounded by (5.4.31) and the convergence In

the case i.e., ‚ for some is

bounded, and hence (5.4.22) and (5.4.23) are also satisfied.We are now in a position to verify the noise condition required in

Theorem 2.2.1 for given by (5.4.20), i.e., we want to show thatfor any convergent subsequence


By (5.4.24)

so for (5.4.32) it suffices to show

Again, by (5.4.24) and also by (5.4.23)

which implies (5.4.33). By Theorem 2.2.1, there is such that for is defined by(5.4.17) and converges to the root set J for given by (5.4.14).This completes the proof for the theorem.

Remark 5.4.1 For the unconstrained problem and C = 0, thealgorithm (5.4.16) becomes


Further, if then and Theorem 5.4.1 asserts

a.s.,

provided is stationary, ergodic, and bounded.

5.5. Adaptive Filtering by Sign AlgorithmsWe now consider the unconstrained problem mentioned in Section 5.4,

but we restrict ourselves to discuss the vector case, i.e., instead of matrix

signal we now consider where is and

is one-dimensional. However, instead of quadratic criterion (5.4.2) wenow minimize the cost

where is an vector.Note that the gradient of is given by

where

The problem is to find which minimizes or to approach theroot set J of

As before, let be an increasing sequence of positive real numberssuch that as Define the algorithm as follows


Theorem 5.5.1 Assume is stationary and ergodic with

Then

where is defined by (5.5.4) and (5.5.5) with an arbitrary initialvalue. In addition, in a finite number of steps truncations cease to existin (5.5.4).

Proof. Define

and

Let be a countable set that is dense in let and betwo sequences of positive real numbers such that andas and denote

and

where and is an integer.The summands of (5.5.9)–(5.5.11) are stationary with finite expecta-

tions for any any integer any and any and then theergodic theorem yields that

a.s.,


and

Therefore, there is an such that and for eachthe convergence for (5.5.12)–(5.5.14) takes place for any anyinteger any and any

Let us fix anWe first show that for any fixed

if is large enough (say, for ), and in addition,

where c is a constant which may depend on but is independent ofIn what follows always denote constants that may

depend on but are independent of By (5.4.24) we have for any

There are two cases to be considered. If then for largeenough, and (5.5.15) holds. If is bounded, then thetruncations cease to exist after a finite number of steps. So, (5.5.15) alsoholds if is sufficiently large. Then (5.5.16) follows immediately from(5.5.15) and (5.5.17).

Let us define

where is given by (5.5.2). Then (5.5.15) can be represented as

Let be a convergent subsequence of and let

be such that We now show that


Let By (5.5.16) or forsome integer

We examine that the terms on the right-hand side of (5.5.20) satisfy(5.5.19) .

For the first term on the right-hand side of (5.5.20) we have

where and are deterministic for a fixed and the expecta-tion is taken with respect to and

Since (5.5.6), a.s., applying the dominated convergencetheorem yields

Then from (5.5.21) it follows that


Similarly, for the second term on the right-hand side of (5.5.20) wehave

since a.s.For the third term on the right-hand side of (5.5.20) by (5.4.24),

(5.5.10), and (5.5.13) we have

sinceFinally, for the last term in (5.5.20), by (5.5.14) and (5.4.24) we have

where the last convergence follows from the fact thata.s. as since and

a.s.Combining (5.5.23)–(5.5.26) yields that


Since the left-hand side of (5.5.27) is free of tending to infinity in(5.5.27) leads to (5.5.19). Then the conclusion of the theorem followsfrom Theorem 2.2.1 by noticing that as in A2.2.2 one may take

5.6. Asynchronous Stochastic ApproximationWhen dealing with large interconnected systems, it is natural to con-

sider the distributed, asynchronous SA algorithms. For example, in acommunication network with servers, each server has to allocate audioand video bandwidths in an appropriate portion in order to minimizethe average time of queueing delay. Denote by the bandwidthratio for the server, and Assume the average delay

time depends on only and is differentiable,Then, to minimize is equivalent to find the root of Assumethe time, denoted by spent on transmitting data from the serverto the server is not negligible. Then at the server for theiteration we can observe or only at where

denotes the total time spent until completion of iterations for theserver. This is a typical problem solved by asynchronous SA. Simi-

lar problem arises also from job-scheduling for computers in a computernetwork.

We now precisely define the problem and the algorithm.

At time denote by the estimate for the unknown

root of Components of are observedby different processors, and the communication delays from theprocessor to the processor at time are taken into account. Theobservation of the processor is carried only at

i.e.,

where is the observation noise.In contrast to the synchronous case, the update steps now are different

for different processors, so it is unreasonable to use the same step sizefor all processors in an asynchronous environment. At time the stepsize used in the processor is known and is denoted by

We will still use the expanding truncation technique, but we are un-able to simultaneously change estimates in different processors when theestimate exceeds the truncation bound because of the communicationdelay.

Assume all processors start at the same given initial valueand for all The observation at


the processor is and is updated to by the rule givenbelow. Because of the communication delay the estimate produced bythe processor cannot reach the processor for the initial steps:

By agreement we will take to serve as whenever

At the processor, there are two sequences andare recursively generated, where is the estimate for the

component of at time and is connected with the number oftruncations up-to and including time at the processor. For the

processor at time the newest information about other processorsis In all algorithms discussed until

now all components of are observed at the same point at timeand this makes updating to meaningful. In the present case,

although we are unable to make all processors to observe at thesame points at each time, it is still desirable to require all processorsobserve at points located as close as possible. Presumably, thiswould make estimate updating reasonable. For this, by noticing thatthe estimate gradually changes after a truncation, the ideal is to keepall are equal, but for this the best we can do is toequalize with other

Keeping this idea in mind, we now define the algorithm and the ob-servations for the processor,

Let be a fixed point from where the algorithmrestarts after a truncation.

i) If there exists with then reset to equal the biggest

one among and pull back to the fixed point although

may not exceed the truncation bound. Precisely, in this case define

and observe

for any then observe atii) If


i.e.,

For both cases i) and ii), and are updated as follows:

where is the step size at time and may be random, andis a sequence of positive numbers increasingly diverging to in-

finity.Let us list conditions to be used.

A5.6.1 is locally Lipschitz continuous.

A5.6.2 and

there exist two positive constants such that

A5.6.3 There is a twice continuously differentiable function (not neces-sarily being nonnegative) such that

and is nowhere dense, where

and denotes the gradient of

A5.6.4 For any convergent subsequence any and any


where

and

A5.6.5

Note that (5.6.10) holds if is bounded, since Note

also that A5.6.3 holds if and

Theorem 5.6.1 Let be given by (5.6.1)–(5.6.6) withinitial value Assume A5.6.1–A5.6.5 hold, and thereis a constant such that and

where is given in A5.6.3. Then

where

The proof of the theorem is separated into lemmas. From now on wealways assume that A5.6.1–A5.6.5 hold.

We first introduce an auxiliary sequence and its associated ob-servation noise It will be shown that differs from onlyby a finite number of steps. Therefore, for convergence of it sufficesto prove convergence of

Let be a samplepath generated by the algorithm (5.6.1)–(5.6.6), where is the one afterresetting according to (5.6.2). Let where isdefined in A5.6.4. Assume By the resetting rule givenin i), for any after resetting we have For we

have and by the definition of

In the processor we take and to replace andrespectively, and define for thoseFurther, define and for


Then we obtain new sequences associated withBy (5.6.1)–(5.6.6), if then there exists a with

and

since and forBecause during the period there is no truncationfor the sequences are recursivelyupdated as follows:

whereDefine delays for as follows

is available to the processor at time

Lemma 5.6.1 For any any convergent subsequenceand any satisfies the following condition

where

Proof. Since equals either or which is available at time itis seen that

For by definition of we havewhich is certainly available to the processor. Therefore,

We rewrite By the definition of andpaying attention to (5.6.17) we see

so

as


We now show that (5.6.18) is true for all Forthere is no truncation for the processor,

and hence by the resetting rule i). If

for some then by (5.6.16) and the definition of it follows that

which implies (5.6.18).If for some then as explained above for the processor

at time the latest information about the estimate produced by theprocessor is In other words,

However, by definition of which yields

This again implies (5.6.18).In summary, we have

This means that for there is no truncation at any time equal toand the observation is carried out at

i.e.,

For any any convergent subsequence and any wehave

By (5.6.11), Then from A5.6.2 and

A5.6.5 it follows that and hence the second term


on the right-hand side of (5.6.21) tends to zero as Further,from the definition of there is such that Hence the firstterm on the right-hand side of (5.6.21) is of order o(T) by A5.6.4. Con-sequently, from A5.6.2, A5.6.4 and A5.6.5 it follows that satisfies(5.6.15).

Lemma 5.6.2 Let be generated by (5.6.12)–(5.6.14). For any con-vergent subsequence of if is bounded,then there are and such that

where is given in (5.6.14).

where is given in A5.6.2.By (5.6.15) for convergent subsequence there exists such

that for any and

Choose such that For any let

Then for any

If then if is sufficiently large,i.e., no truncation occurs after and hence for

If then there exists such that forany From (5.6.24) it follows that

Therefore, in both cases

Proof. Let whereand


If then for sufficiently large

i.e.,

This contradicts the definition of Therefore,

Lemma 5.6.3 Let be given by (5.6.12)–(5.6.14). For anywith the following assertions take place:

i) In the case, cannot cross infinitely manytimes keeping bounded, where are the startingpoints of crossing;

ii) In the case cannot converge to keepingbounded.

Proof. i) Since is bounded, there exists a convergent subse-quence, which is still denoted by for notational simplicity,

By the boundedness of and (5.6.22)for sufficiently large there is no truncation between andand hence

where By (5.6.20), (5.6.22) andit follows that


By A5.6.2 and A5.6.3 we have

Then by A5.6.1

where is the Lipschitz coefficient of in andBy the boundedness of

and the fact that there is no truncation between and it followsthat

Without loss of generality, we may assume is a convergent se-quence. Then by A5.6.3 and A5.6.5

Therefore,

where

Since is continuous for fixed by A5.6.4 there exists afor such that


Thus, for sufficiently small T and sufficiently large we have

On the other hand, by Lemma 5.6.2

Thus, for sufficiently small T, and

This contradicts (5.6.31), and i) is proved.ii) If is bounded, then there is a convergent subsequenceThen the assertion can be deduced by a similar way as that for i).

Lemma 5.6.4 Under the conditions of Theorem 5.6.1


Proof. If then there exists a sequence such that

From (5.6.12)–(5.6.14) we haveChoose a small positive constant such that

Let be a connectedset containing and included in the setand let be a connected set containing and included in the set

Clearly, and and arebounded.

Since diverges to infinity, there exists such thatfor Noting that there exists i such that

and we can define andfor

Since there is a convergent subsequence in also de-noted by Let be a limit point of

By the definition of is bounded. Butcrosses infinitely many times, and it

is impossible by Lemma 5.6.3. Thus,

Proof of Theorem 5.6.1

and


By Lemma 5.6.4 is bounded. Let

If then by Lemma 5.6.3, we have

If then there are and such thatand since is nowhere dense. But by

Lemma 5.6.3 this is impossible. Therefore,We now show If there is a convergent subsequence

and then (5.6.26)–(5.6.30) still hold. Hence,

This is a contradiction to

Consequently, i.e.,

Since and the truncations occur onlyfor finitely many times. Therefore, and differ from each other onlyfor a finite number of So,

5.7. Notes and ReferencesFor blind identification with “block” algorithms we refer to [71, 96].

Recursive blind channel identification algorithms appear to be new. Sec-tion 5.1 is written on the basis of the joint work “H. F. Chen, X. R. Cao,and J. Zhu, Convergence of stochastic approximation based algorithmsfor blind channel identification”. Principal component analysis is ap-plied in different areas (see, e.g., [36, 79]). The results presented inSection 5.2 are the improved version of those given in [101]. The princi-pal component analysis is applied to solve the blind identification prob-lem in Section 5.3, which is based on the recent work “H. T. Fang andH. F. Chen, Blind channel identification based on noisy observation bystochastic approximation method”. The proof of Lemma 5.3.2 is givenin [9].

For adaptive filter we refer to [57]. The results presented in Sec-tion 5.4 are stronger than those given in [11, 28]. The sign algorithmsare dealt with in [42], but conditions used in Section 5.5 are consider-ably weaker than those in [42]. Section 5.5 is based on the recent work“H. F. Chen and G. Yin, Asymptotic properties of sign algorithms foradaptive filtering”.

Asynchronous stochastic approximation was considered in [9, 88, 89,99]. Section 5.6 is written on the basis of [50].

i.e.,

Chapter 6

APPLICATION TO SYSTEMSAND CONTROL

Assume a control system depends on a parameter and the systemoperation reaches its ideal status when the parameter equals someSince is unknown, we have to estimate it during the operation of thesystem, which, therefore, can work only on the estimate of Inother words, the real system is not under the ideal parameter andthe problem is to on-line estimate and to make the system asymptot-ically operating in the ideal status. It is clear that this kind of systemparameter identification can be dealt with by SA methods.

Adaptive control for linear stochastic systems is a typical examplefor the situation described above. If the system coefficients are known,then the optimal stochastic control may be a feedback control of thesystem state. The corresponding feedback gain can be viewed as theideal parameter which depends on the system coefficients. In the setupof adaptive control, system coefficients are unknown, and hence isunknown. The problem is to estimate and to prove that the resultingadaptive control system by using the estimate as the feedback gain isasymptotically optimal as tends to infinity.

In Section 6.1 the ideal parameter is identified by SA methods forsystems in a general setting, and the results are applied to solving theadaptive quadratic control problem. The adaptive stabilization problemis solved for stochastic systems in Section 6.2, while the adaptive exactpole assignment is discussed in Section 6.3. An adaptive regulationproblem for nonlinear and nonparametric systems is considered is Section6.4.

289


6.1. Application to Identification and AdaptiveControl

Consider the following linear stochastic system depending on param-eter

where andare unknown.

The ideal parameter for System (6.1.1) is a root of an unknownfunction

The system actually operates with equal to some estimate for ,i.e., the real system is as follows:

For the notational simplicity, we suppress the dependence on thestate and rewrite (6.1.3) as

The observation at time is

where is a noise process.From (6.1.5) it is seen that the function is not directly observed,

but it is connected with as follows:

We list conditions that will be used.

where is generated by (6.1.1).Let be a sequence of positive numbers increasingly diverging to

infinity and let be a fixed point. Fixing an initial value werecursively estimate by the SA algorithm with expanding truncations:

291Application to Systems and Control

A6.1.2 There is a continuously differentiable functionsuch that

for any and is nowhere dense,where J is given by (6.1.2). Further, used in (6.1.8) is such that

inf for some and

A6.1.3 The random sequence in (6.1.1) satisfies a mixing conditioncharacterized by

uniformly in where Further, is such thatsup where

A6.1.4 For sufficiently large integer

for any such that converges, where is given by (1.3.2).

Let is stable}, and let be an open, connected subsetof

A6.1.5 and f are connected by (6.1.6) and (6.1.1) for each satisfies a local Lipschitz condition on

with for any constants and where isgiven in A6.1.3.

with

A6.1.1 and


A6.1.6 and in (6.1.1) are globally Lipschitz continuous:

where L is a constant.

A6.1.7 given by (6.1.7) is If converges for somethen where may depend on

Theorem 6.1.1 Assume A6.1.1–A6.1.7 hold. Then


Proof. By (6.1.5) we rewrite the observation in the standard from

where

By Theorem 2.2.2 and Condition A6.1.4, the assertion of the theoremwill immediately follow if we can show that for almost all condition(2.2.2) is satisfied with replaced by

Let be expressed as a sum of seven terms:

where


where

and and denote the distribution andconditional distribution of given respectively.

To prove the theorem it suffices to show that there exists withsuch that for each all satisfy

(2.2.2) with respectively identified toBy definition, for any there is such that

where is independent ofLet us first show that satisfy (2.2.2).Solving (6.1.1) yields

By A6.1.3 is bounded. Hence, by (6.1.18) is bounded andby A6.1.5 is also bounded:

where

where is given in A6.1.5.Since we haveWe now show that and are continuous in uni-

formly with respect toBy (6.1.18) and (6.1.20), from (6.1.19) it follows that

By (6.1.18) (6.1.20) and the Lipschitz condition A6.1.5 for it followsthat

and


which implies the uniform continuity of This together with (6.1.13)yield that is also uniformly continuous.

Let be a countable dense subset ofNoticing that is and expressing

as a sum of martingale difference sequences

by (6.1.20) and we find that there is withsuch that for each

for any integer and any From here by uniform continuity ofit follows that for and for any integer

Note that

This is because by (6.1.18) and (6.1.20) we have the following estimate:

We now estimate by the treatment used in Lemma 2.5.2. By ap-plying the Jordan-Hahn decomposition to the signed measure

Similarly, we can find with such that forand

since is bounded by the martingale convergencetheorem. It is worth noting that (6.1.23) holds a.s. for anybut without loss of generality (6.1.23) may be assumed to hold for all

with To see this, we first selectthat (6.1.23) holds for any This is possible

because is a countable set. Then, we notice that iscontinuous in uniformly with respect to Thus, we have


withon

such

where is the mixing coefficient given in A6.1.3. Thus, by(6.1.27)–(6.1.29) we have

and


it is seen that there is a Borel set D in the sampling spacesuch that for any A in the sampling space

Application to Systems and Control 297

By A6.1.5, (6.1.18), (6.1.20), and noticing we find

whose expectation is finite as explained for (6.1.20). Therefore, on theright-hand side of (6.1.30) the conditional expectation is bounded withrespect to by the martingale convergence theorem, and the last term isalso bounded with respect to Thus, by (6.1.10) from (6.1.30) it followsthat there is with such that

Assume is a convergent subsequence

Define

Write (6.1.4) as

Let be fixed.

Now choose sufficiently small so that

and hence

Applying the Gronwall inequality to (6.1.33) we obtain the inequality

where and hereafter always denotes a constant for fixed and,without loss of generality, we assume

Define so that


298

Since by A6.1.7, by A6.1.5 and A6.1.6 it follows that for

STOCHASTIC APPROXIMATION AND ITS APPLICATIONS


whereBy induction we now show that

for all suitable large .

For any fixed if is large enough, since

Therefore (6.1.36) holds for sinceAssume (6.1.36) holds for some By notic-

ing from (6.1.34) and (6.1.35) it follows that

By using (6.1.20) (6.1.37) and the inductive assumption and applying(6.1.19) to it follows that

for where and satisfies thefollowing equation

By A6.1.7 and (6.1.20) we have

and using (6.1.18), (6.1.37), and the inductive assumption we derive


This combining with (6.1.38) leads to that there are real numbersand such that

for From here it follows that

From the inductive assumption it follows that for

for some large enough integer N. Then by (6.1.12)

Setting

we derive

where (6.1.22), (6.1.24), (6.1.25), (6.1.31), (6.1.39), and (6.1.40) areused.

Choose sufficiently small so that (6.1.35) holds, and


Since by A6.1.5 there is such that

for all From (6.1.41) it then follows that

It can be assumed that is sufficiently large so that

Since by (6.1.42) it follows that

and hence there is no truncation at

Thus, we have

or equivalently,

which proves (6.1.36).Consequently, (6.1.39) is valid for and

hence

times and

and

where is the estimate ofLet be given by (6.1.7) and (6.1.8) with given by (6.1.5).

where and are related by (6.1.44).However, since the ideal is unknown, the real system satisfies the

equation

where and are symmetric such that andLet given by A6.1.3). The control

where is the feedback control which is required to minimize

Finally, noticing that A6.1.5 assumes (6.1.6), we conclude that for each

all satisfy (2.2.2) with

respectively replaced by The proof of the theorem iscompleted.

We now apply the obtained result to an adaptive control problem.Assume that is the ideal parameter for the system, being

the unique zero of an unknown function The system in the idealcondition is described by the equation


From (6.1.21) and (6.1.13) it is seen that is contin-uous in uniformly with respect to Therefore, its limit is acontinuous function. Then by (6.1.36) it follows that

should be selected in the family U of admissible controls:


In order to give adaptive control we need the expression of the optimalcontrol when is known.

Lemma 6.1.1 Suppose thatis a martingale difference sequence with

ii) where is controllable and observ-able, i.e., · · · , and· · · , are of full rank.Then in the class of nonnegativedefinite matrices there is an unique satisfying

and

where

and

Proof. The existence of an unique solution to (6.1.50) and stability ofF given by (6.1.51) are well-known facts in control theory. We show theoptimality of control given by (6.1.52).

For notational simplicity , we temporarily suppress the dependence ofand on and write them as A, B, and

D, respectively.Noticing

is stable. The optimal control minimizing (6.1.45) is


we then have

Since by the estimate for the weighted sum of martingaledifference sequence from (6.1.55) it follows that

where is the state in (6.1.47).Thus the closed system becomes

Notice that the last term of (6.1.56) is nonnegative. The conclusions ofthe lemma follow from (6.1.56).

According to (6.1.52), by the certainty-equivalence-principle, we formthe adaptive control


which has the same structure as (6.1.4). Therefore, under the assump-tions A6.1.1–A6.1.7 with replaced by and with J being asingleton by Theorem 6.1.1 it is concluded that

By continuity and stability of it is seen that there are andpossibly depending on such that

This yields the boundedness of and

because


Therefore, the closed system (6.1.58) asymptotically operates under theideal parameter and makes the performance index (6.1.45) minimized.

6.2. Application to Adaptive StabilizationConsider the single-input single-output system

where and are the system input, output, and noise, respec-tively, and

where is the backward shift operator,The system coefficient

is unknown. The purpose of adaptive stabilization is to design controlso that

a.s.

The fact that and a can be solved from (6.2.5) for anymeans that

is nonzero. In other words, the coprimeness of and is equiva-lent to

In the case is unknown the certainly-equivalency-principle suggestsreplacing by its estimate to derive the adaptive control law. How-ever, for may be zero and (6.2.5) may not be solvable withand replaced by their estimates.

Let us estimate by the following algorithm called the weighted leastsquares (WLS) estimate, which is convergent for any feedback control


If is known and if and are coprime, then for an arbitrarystable polynomial of degree there are unique polynomials

and both of order with such that

Then the feedback control generated by

leads the system (6.2.1) to

Then, by stability of (6.2.4) holds if assume

Considering coefficients of and as unknowns, and identifyingcoefficients of for both sides of (6.2.5), we derivea system of linear algebraic equations with matrix for unknowns:


where

Though converges a.s., its limit may not be the true If a boundedsequence can be found such that the modified estimate

and for some

is convergent and

then the control obtained from (6.2.6) with replaced by solves theadaptive stabilization problem, i.e., makes (6.2.4) to hold.

Therefore, the central issue in adaptive stabilization is to find a bound-ed sequence such that given by (6.2.12) is convergent and(6.2.13) is fulfilled. This gives rise to the following definition.

Definition. System (6.2.1) is called adaptively stabilizable by the useof parameter estimate if there is a bounded sequence such that(6.2.13) holds and given by (6.2.12) is convergent.

It can be shown that if system (6.2.1) is controllable, i.e., andare coprime, then it is adaptively stabilizable by the use of the WLS

estimate. It can also be shown that the system is adaptively stabilizableby use of if and only if where and F denote

the limits of and respectively, which are generated by (6.2.9)–(6.2.11).

We now use an SA algorithm to recursively produce such thatis convergent and the resulting estimate by (6.2.12) satisfies

(6.2.13).

is generated by (6.2.9)–(6.2.11), is defined by (6.2.11), and isrecursively defined by an SA given below.

Let us take a few real sequences defined as follows:

where

which can be written as

From algebraic geometry it is known that is afinite set.

However, is not directly observed; the real observation is

The root set of is denoted by where

where

As a matter of fact,

Let and be –dimensional, and let



Let be l-dimensional with only one nonzero element equal to

either +1 or –1, Similarly, let be -dimensionalwith only nonzero elements, each of which equals either 1 or – 1,

The total number of such vectors is

Normalize these vectors and denote the resulting vectors byin the nondecreasing order of the number of nonzero elements in

Define and for Introduce

Define the recursive algorithm for as follows:

and is a fixed vector.The algorithm (6.2.23)–(6.2.27) is the RM algorithm with expanding

truncations, but it differs from the algorithm given by (2.1.1)–(2.1.3)as follows. The algorithm (2.1.1)–(2.1.3) is truncated at the upper sideonly, but the present algorithm is truncated not only at the upper sidebut also at the lower side: is allowed neither to diverge to infinity norto tend to zero; whenever it reaches the truncation bounds the estimate

is pulled back to and is enlarged to at the upper side,while at the lower side is pulled back to which will change to the


next whenever is satisfied. If for successiveresettings of we have to change to the next one, then we reduceto

Lemma 6.2.1 Assume the following conditions hold:

A6.2.2 System (6.2.1) is adaptively stabilizable by use of generatedby (6.2.9)–(6.2.11), i.e.,

If then after a finite number of steps the algorithm (6.2.23)–(6.2.27) becomes the RM algorithm

converges and

Proof. The basic steps of the proof are essentially the same as those forproving Theorem 2.2.1, but some modifications should be made becauseof truncations at the lower side.

Step 1. Let be a convergent subsequence of

For any define the RM algorithm

with or for some for some

We show that there are M > 0, T > 0 such thatwhen and

when if is large enough, where is givenby (1.3.2).

Let > 1 be a constant such that

It is clear that

A6.2.1 and


Since and are convergent, there is such that

Let By (6.2.29) and (6.2.30), we have

for if and for if whereLet (6.2.31) hold for or

It then follows that

where orThus, (6.2.31) has been inductively proved for

orStep 2. Let be a convergent subsequence. We show that there

are M > 0 and T > 0 such that

if is large enough.If defined by (6.2.25) is bounded, then (6.2.32) directly follows.Again take such that and setAssume Then there is a such that

By the result proved in Step 1, starting from the algo-rithm for cannot directly hit the sphere with radius without atruncation for So it may first hit somelower bound at time and switch to some

from which again by Step 1 cannot directly reach without atruncation. The only possibility is to be truncated again at a lowerbound. Therefore, (6.2.32) takes place.


Step 3. Since and are convergent, by (6.2.32) it follows thatfrom any convergent subsequence there are constants and

such that

if is large enough.Consequently, there is such that

By (6.2.32) and the convergence of and it also follows that

Therefore,

Using (6.2.33) and (6.2.34) by the same argument as that given in Step 3of the proof for Theorem 2.2.1, we arrive at the following conclusion.If starting from the algorithm (6.2.24) is calculatedas an RM algorithm and is bounded, then forany with and cannot cross

infinitely often.Step 4. We now show that is bounded.If is unbounded, then as Therefore,

is unbounded and comes back to the fixed point infinitely manytimes.

Notice that is a finite set and

We see that there is an interval with and0 such that crosses infinitely often, and during each cross-ing the algorithm (6.2.24) behaves like an RM algorithm with staringpoint It is clear that is bounded because as

But by Step 3, this is impossible. Thus, we conclude thatis bounded, and after a finite number of steps (6.2.24) becomes

as


Step 5. We now show (6.2.28), i.e., after a finite number of steps thealgorithm (6.2.35) ceases to truncate at the lower side.

Since and by A6.2.2, it follows that thereis at least one nonzero coefficient in the polynomial for some

with Therefore, for some and a small

From (6.2.16) it is seen that for sufficiently small we have

This combining with convergence of and leads to

for sufficiently largeFrom (6.2.26) and (6.2.36) it follows that must be bounded, and

hence is bounded. This means that there is a such that

We now show that is bounded.Since for all sufficiently large it follows that

were unbounded, then by (6.2.37) the algorithm, starting fromwould infinitely many times enter the sphere with radius

where is small enough such that

Then would cross infinitely often an intervalSince is a finite set, we may assume It is clearthat during the crossing the algorithm behaves like an RM algorithm.By Step 4, this is impossible.

Therefore, there is a such that

Noticing (6.2.20), (6.2.34), and that serves as the Lyapunov func-tion for from Theorem 2.2.1 we conclude the remaining assertionsof the lemma.

where and are defined by 1)-3) described above.

Proof. The key step is to show that


Case i) The assumption implies that

and occurs infinitely many times. However,

this is impossible, since and The contradictionshows

Theorem 6.2.1 Assume conditions A6.2.1 and A6.2.2 hold. Then thereis such that and converges and

and use to produce the adaptive control as in 1), and go back to1) for

3) If and none of a)-c) of 2) is the case, then setand go back to 1) for and at the same time change to

i.e.,

Define


Using we now define in (6.2.12) satisfying (6.2.13) and thussolving the adaptive stabilization problem.

Let1) If then set Using we produce

the adaptive control from (6.2.6) with and defined from(6.2.5) with replaced by and go back to 1) for

2) If then definea) for the case whereb) defined by (6.2.24) for the case wherebut

c) for the case, wherebut

and the algorithm defining will run over the following cases: 1) and2a)-2c). Since and are convergent, the inequality

for all sufficiently large Again, this means that (6.2.41) may take placeat most a finite number of times, and we conclude that

Thus, there is such that

If then from (6.2.43) it follows that

Since and for sufficiently largefrom (6.2.42) it follows that

for all sufficiently large Thus, (6.2.41) may take place at most a finitenumber of times. The contradiction shows that

we havethen as

Take a convergent subsequence of For notational simplicity de-note by itself its convergent subsequence. Thus

By Lemma 6.2.1,1) If then


Case ii) The assumption implies that there

is a sequence of integers such that andi.e., for all the following indicator equals one

2) If

implies

6.3. Application to Pole Assignment for Systemswith Unknown Coefficients

Consider the linear stochastic system

where is the -dimensional state, is the one-dimensional control,and is the -dimensional system noise.

The task of pole assignment is to define the feedback control

in order that the characteristic polynomial

of the closed-loop system coincides with a given polynomial

The pair is called similar to if there exists a nonsin-

gular matrix such that

where denotes the column of T.

Consequently, the truncation at the lower bound in (6.2.24) should bevery rare. The computation will be simplified if there is no lower boundtruncation.


for sufficiently large This means that the algorithm can be at 2b)only for finitely many times. By the same reason it cannot be at 2c)for infinitely many times. Therefore, the algorithm will stick on 1) if

and on 2a) if and in both cases there is asuch that and

The convergence of follows from the convergence of and

Remark 6.2.1 For the case the origin is not a stableequilibrium for the equation

So, is nonsingular if and only if is nonsingular.Assume that is controllable and is already in its con-

troller form (6.3.5). For notational simplicity, we will write ratherthan

where

which imply


Define

where are coefficients of

The pair is called the controller form associated to the pair

If is controllable, i.e., is of full rank,then is similar to its controller form. To see this, we note that(6.3.4) implies and from it followsthat

where is the system noise at time “1” for the system with feedbackgain applied.

Having observed we compute its characteristic polynomial detwhich is a noise-corrupted characteristic polynomial of

Let be the estimate for By observing det weactually learn the difference det which in a certain sensereflects how far det differs from the ideal polynomial

For any let


With feedback control the closed-loop system takes theform

Since is in controller form,

where are elements of the row vector F:

Therefore, if is known, then comparing (6.3.10) with (6.3.3) givesthe solution to the pole assignment problem, where

We now solve the pole assignment problem by learning for the casewhere is unknown.

Let us combine the vector equation (6.3.9) for initial values to formthe matrix equation

Let In learning control, can be observed at any fixed

For any the observation of is denoted by


be the row vector composed of coefficients of

By (6.3.10)

composed of coefficients of

and respectively.Take a sequence of positive real numbers

and

Calculate the estimate for by the following RM algorithm withexpanding truncations:

with fixed

Theorem 6.3.1 Assume that is controllable and is in the con-troller form. Further, assume the following conditions A6.3.1 and A6.3.2hold:

A6.3.1 The components ofof in (6.3.13) are mutually independent with

A6.3.2

where is the same as that in A6.3.1.Then there is with such that for each as

Similarly, define row vectors

for some

From here it is seen that is a sum of products ofelements from with +1 and –1 as

multiple for each product, where and denote elements of A andrespectively. It is important to note that each product in

includes at least one of as its factor. Thus, the product is ofthe form

From (6.3.21) by (6.3.18), (6.3.15), and (6.3.13) it follows that

Therefore, the conclusion of the theorem will follow from Theorem 2.2.1,if we can show that for any integer N


where is the desired feedback gain realizing the exact poleassignment.

Proof. Define

where and are given by (6.3.14) and (6.3.17), respectively.By (6.3.11) and (6.3.16) it follows that

Thus, (6.3.19) and (6.3.20) become

It is clear that the recursive algorithm for has the same structureas (2.1.1)–(2.1.3). For the present case, as function required inA2.2.2 we may take


where

By A6.3.1 we have

whereBy A6.3.2 and the convergence theorem for martingale difference se-

quences it follows that

for any integer which implies (6.3.24).

6.4. Application to Adaptive RegulationWe now apply the SA method to solve the adaptive regulation problem

for a nonlinear nonparametric system.Consider the following system

where is the system state, is the control, andis an unknown nonlinear function with being

the unknown equilibrium for the system (6.4.1).Assume the state is observed, but the observations are corrupted

by noise:

where is the observation noise, which may depend onThe purpose of adaptive regulation is to define adaptive control based

on measurements in order the system state to reach the desired value,which, without loss of generality, may be assumed to be equal to zero.

We need the following conditions.

A6.4.1 and


A6.4.2 The upper bound for is known, i.e., and isrobust stabilizing control in the sense that for any the state

tends to zero for the following system

A6.4.3 The system (6.4.1) is BIBS stable, i.e., for any bounded input,the system state is also bounded;

A6.4.4 is continuous for bounded i.e., for any

A6.4.5 The system (6.4.1) is strictly input passive, i.e., there are andsuch that for any input

A6.4.6 For any convergent subsequence

where is defined by (1.3.2).

It is worth noting that A6.4.6 becomes

if is independent ofThe adaptive control is given according to the following recursive

algorithm:

where b is specified in A6.4.2.

Theorem 6.4.1 Assume A6.4.1–A6.4.6. Then the system (6.4.1), (6.4.2),and (6.4.4) has the desired properties:


at sample paths where A6.4.6 holds.

Proof. Let be a convergent subsequence of such that

andWe have

for sufficiently large and small enough T, where is a constant to bespecified later on. The relationships (6.4.5) and (6.4.6) can be provedalong the lines of the proof for Theorem 2.2.1, but here is known to bebounded, and (6.4.5) and (6.4.6) can be proved more straightforwardly.We show this.

Since the system (6.4.1) is BIBS, from it follows that thereis such that

By A6.4.6 for large and small T > 0,

This implies that

Let be large enough such that

and let T be small enough such that

Then we have

and hence there is no truncation in (6.4.4) for i.e., (6.4.5) holdsfor Therefore,

indeed.By induction, the assertions (6.4.5) and (6.4.6) have been proved.We now show that for any convergent subsequence

there is a such that

from (6.4.4) it follows that (6.4.5) holds for Hence,

Thus, (6.4.5) and (6.4.6) hold for Assume they are true for allWe now show that they are true for

too.Since

for small enough T > 0.By A6.4.5, we have

Let us restrict in (6.4.8) to Then forsmall T and large from (6.4.6) and (6.4.8) it follows that


and (6.4.6) is true for

Since and it is seen that

Using a partial summation, by (6.4.9) we have

for all sufficiently large and small enough T > 0.Set

forThis implies that there exist a and a sufficiently large which

may depend on but is independent of such that



Then (6.4.10) implies that

This proves (6.4.7).Define


for convergent subsequenceUsing A6.4.6 and (6.4.11), by completely the same argument as that

used in the proof (Steps 3– 6) of Theorem 2.2.1, we conclude that

Finally, write (6.4.1) as

By A6.4.4 and the boundedness of we haveand by A6.4.2 we conclude

Remark 6.4.1 It is easy to see that A6.4.6 is also necessary if A6.4.1–A6.4.5 hold and and This is because for largethe observation noise can be expressed as

and hence

6.5. Notes and ReferencesFor system identification and adaptive control we refer to [10, 23, 54,

62, 75, 90]. The identification problem stated in Section 6.1 was solved in[72] by ODE method. In comparison with [72], conditions used here haveconsiderably been weakened, and the convergence is proved by the TSmethod rather than the ODE method. Section 6.1 is based on the jointwork by H. F. Chen, T. Duncan and B. Pasik-Duncan. The existenceand uniqueness of the solution to (6.1.50) can be found, e.g., in [23]. Forstochastic quadratic control refer to [2, 10, 12, 33].

Adaptive stabilization for stochastic systems is dealt with in [5, 55, 77].The convergence of WLS and adaptive stabilization using WLS are givenin [55]. The problem is solved by the SA method in [19]. This approachis presented in Section 6.2.

The pole assignment problem for stochastic system with unknowncoefficients is solved by SA with the help of learning in Section 6.3,which is based on [20]. For concept of linear control systems we refer to


which tends to zero as since and

Remark 6.4.2 In the formulation of Theorem 6.4.1 the condition A6.4.5can be replaced either by (6.4.7) or by (6.4.11), which are the conse-quences of A6.4.5. Further, the quadratic can be replaced by acontinuously differentiable function such thatand In this case, in (6.4.7) should becorrespondingly replaced by

Example 6.4.1 Let the nonlinear system be affine:

where the scalar nonlinear function is bounded from above and frombelow by positive constants:

Note that and hence(6.4.7) holds, if Assume is known:Then A6.4.2, A6.4.3, and A6.4.4 are satisfied. Therefore, if satisfiesA6.4.6, then given by (6.4.4) leads to and

In the area of system and control, the SA methods also are successfullyapplied in discrete event dynamic systems, especially, to the perturbationanalysis based parameter optimization.


[1, 46, 60]. The connection between the feedback gain and coefficients ofthe desired characteristic polynomial is called the Ackermann’s formula,which can be found in [46].

Application of SA to adaptive regulation is based on [26].For perturbation analysis of discrete event dynamic systems we refer

to [58]. The perturbation analysis based parameter optimization is dealtwith in [29, 86, 87].

Appendix A

In Appendix A we introduce the basic concept of probability theory. Results arepresented without proof. For details we refer to [31, 32, 70, 76, 84].

A.1. Probability SpaceThe basic space is denoted by The point is called elementary event or

sample. The point set in is denoted by A,Let be a family of sets in satisfying the following conditions:1.2.3.Then, is called the or The element A of is called the

measurable set, or random event, or event.As a consequence of Properties 2 and 3,

then the complement of A, also belongs toIfIf

if

A set function defined on is called -additive if for any

sequence of disjoint events By definition, one of the values or isnot allowed to be taken by

A nonnegative set function is called a measure.Define

The set functions and are called the upper, lower, and totalvariation of on respectively.

Jordan-Hahn Decomposition Theorem If is on then thereexists a set D such that, for any

and are measures andLet P be a set function defined on with the following properties.1.2.

329

then


3. if are disjoint. Then, P is called a

probability measure on The triple is called a probability space.PA is called the probability of random event A.It is assumed that any subset of a measurable set of probability zero is measurable

and its probability is zero. After such a completion of measurable sets the resultingprobability space is called completed.

If a relationship between random variables holds for any with possible exceptionof a set with probability zero, then we say this relationship holds a.s. (almost surely)or with probability one.

A.2. Random Variable and Distribution FunctionIn R, the real line, the smallest containing all intervals is called the Borel

and is denoted by The “smallest” means that if there is acontaining all intervals, then there must be in the sense that for any

The Borel can also be defined in Any set in or iscalled the Borel set.

Any interval can be endowed with a measure equal to its length. This measurecan be extended to each i.e., to each Borel set. Any subset of a set withmeasure zero is also assumed to be a measurable set with measure zero. After such acompletion, the measurable set is called Lebesgue measurable, and the measure theLebesgue measure. In what follows always means the completed Borel

A real function defined on is called measurable, if

If is a real measurable function defined on andthen is called a random variable. Therefore, if is a measurable function, then

is also a random variable ifLet be a random variable. The distribution function of is defined as

By a random vector we mean that each componentof is a random variable.

The distribution function of a random vector is defined as

If is differentiable, then its derivative is called the densityof The density of a random vector is defined by a similar way. The density ofl-dimensional normal distribution is defined by

A.3. ExpectationLet be a random variable and letDefine

APPENDIX A 331

whereis called the expectation of

For an arbitrary random variable define

The expectation of is defined as

if at least one of and is finite .If then is called integrable.The expectation of can be expressed by a Lebesgue-Stieltjes integral with respect

to its distribution function

In the density of l-dimensional random vector with normal distribution,

A.4. Convergence Theorems and InequalitiesLet be a sequence of random variables and be a random variable.If then we say that converges to and write

If for any then we say that converges to in

probability and writeIf the distribution functions of converge to at any where

is continuous, then we say weakly (or in distribution) converges to and write

If then we say converges to in the mean square sense and

write l.i.m.implies which in turn implies

Monotone Convergence Theorem If random variables nondecreasingly(nonincreasingly) converge to andthen

Dominated Convergence Theorem If and there exists an inte-

grable random variable such that then and

Fatou Lemma If for some random variable withthen

If is a measurable function, then


Chebyshev Inequality

Lyapunov Inequality

Hölder Inequality

In the special case where the Hölder inequality is called the Schwarzinequality.

A.5. Conditional ExpectationLet be a probability space. is called a of if is a

and by which it is meant that any impliesRadon-Nikodym Theorem Let be a of For any random

variable with at least one of and being finite, there is an uniquemeasurable random variable denoted by such that for any

The random variable satisfying the above equality is called condi-tional expectation of given

Let be the smallest (see A.2) containing all setsis called the generated by

The conditional expectation of given is defined as

Let A be an event. Conditional probability of A given is defined by

Properties of the conditional expectation are listed below.1) for constants and2)3) if is and4) if5) ifConvergence theorems and inequalities stated in A.4 remain true with expectation

replaced by the conditional expectation For example, the conditionalHölder inequality

forFor a sequence of random variables and a the consistent

conditional distribution functions of given

Let and Then

APPENDIX A 333

can be defined such that i) they are for any and anyfixed ii) they are distribution functions for any fixed and iii) for anymeasurable function

A.6. IndependenceLet be a sequence of events.If for any set of indices

then is called mutually independent.Let be a sequence of If events are mutually

independent whenever then the family ofis called mutually independent.

Let be a sequence of random variables and let be the generatedby If is mutually independent, then the sequence of random variablesis called mutually independent.

Law of iterated logarithm Let be a sequence of independent and identicallydistributed (iid) random variables, Then

Proposition A.6.1 Let be a measurable function defined onIf the l-dimensional random vector is independent of the m-dimensional ran-

dom vector then

where

From this proposition it follows that

if is independent of

A. 7. ErgodicityLet be a sequence of random variables and let be the

distribution function ofIf for any integer then

is called stationary, or is a stationary process.

Proposition A.7.1 Let be stationary.

provided exists for all in the range of


If exists, then

where is a of and is called invariant

If then the stationary process is called ergodic. Thus, forstationary and ergodic process we have

If is a sequence of mutually independent and identically distributed (and hencestationary) random variables, then and the sequence is ergodic.

Appendix B

In Appendix B we present the detailed proof of convergence theorems for martin-gales and martingale difference sequences.

Let be a sequence of random variables, and let be a family of nonde-creasing i.e.,

If is for any then we write and call it as an adaptedprocess.

An adapted process with is called a martingale ifa supermartingale if and a submartingale if

An adapted process is called a martingale difference sequence (MDS) if

A sequence of mutually independent random vectors with is anobvious example of MDS.

An integer-valued measurable function is called a Markov time with respect toif

If, in addition, then is called a stopping time.

B.1. Convergence Theorems for MartingaleLemma B.1.1 Let be adapted, a Markov time, and B a Borel set. Letbe the first time at which the process hits the set B after time i.e.,

Then is a Markov time.Proof. The conclusion follows from the following expression:

335


For defining the number of up-crossing of an interval by a submartingalewe first define

The largest for which is called the number of up-crossing of the intervalby the process and is denoted by

By Lemma B.1.1So, is a Markov time.

Assume is a Markov time. Again, by Lemma B.1.1,and

Therefore, all are Markov times.Theorem B.1.1 (Doob) For submartingales the following inequalities

hold

where

Proof. Note that equals the number of up-crossing of the intervalby the submartingale or by

Since for

is a submartingale.Thus, without loss of generality, it suffices to prove that for a nonnegative sub-

martingale

Define

APPENDIX B 337

Define also Then for even crosses (0, b) from time toTherefore,

and

Further, the set is since is a Markov time,and

Taking expectation of both sides of (B-l-2) yields

where the last inequality holds because is a submartingale and hence theintegrand is nonnegative.

Thus (B.1.1) and hence the theorem is proved.Theorem B.1.2 (Doob) Let be a submartingale with

a.s.Then there is a random variable with such that

Proof. Set

Assume the converse:Then

where and run over all rational numbers.


By the converse assumption there exist rational numbers such that

Let be the number of up-crossing of the interval byBy Theorem B.1.1

By the monotone convergence theorem from (B-1-4) it follows that

However, (B.1.3) implies which contradicts (B.1.5). Hence,

and

where is invoked. Hence,Corollary B.1.1 If is a nonnegative supermartingale or nonpositive sub-

martingale, then

Because for nonpositive submartingales the corollary follows from the the-orem; while for a nonnegative supermartingale is a nonpositivesubmartingale.

Corollary B.1.2 If is a martingale with thenand

This is because for a martingale andand hence

or converges to a limit which is finite a.s.By Fatou lemma it follows that

APPENDIX B 339

B.2. Convergence Theorems for MDS ILet be an adapted process, and let G be a Borel set in

Then the first exit time from G defined by

is a Markov time. This is because

Lemma B.2.1. Let be a martingale (supermartingale, submartingale)and a Markov time. Then the process stopped at is again a martingale(supermartingale, submartingale), where

Proof. Note that

isIf is a martingale, then

This shows that is a martingale. For supermartingales and submartin-gales the proof is similar.

Theorem B.2.1. Let be a one-dimensional MDS. Then as

converges on

Proof. Since is the first exit time

is a Markov time and by Lemma B.2.1 is a martingale, where M is apositive constant.

Noticing that and that

is we find


By Corollary B.1.2 converges as It is clear that onTherefore, as pathwisely converges on Since M is

arbitrary, converges on which equals A.

Theorem B.2.2. Let be an MDS and If

then converges on If then

converges onProof. It suffices to prove the first assertion, because the second one is reduced to

the first one if is replaced byDefine

By Lemma B.2.1 is a martingale. It is clear that

Consequently,

By Theorem B.1.2 converges asSince on as converges on and

consequently on which equals

B.3. Borel-Cantelli-Lévy LemmaTheorem B.3.1. (Borel-Cantelli-Lévy Lemma) Let be a sequence of

events, Then if and only if or equivalently,

Proof. Define

Clearly, is a martingale and is an MDS.Since by Theorem B.2.2, converges on

APPENDIX B 341

If then from (B.3.2) it follows that which implies that

converges. Then, this combining with by (B.3.2) yields

Conversely, if then from (B.3.2) it follows that

Noticing that is contained in the set where converges by

Theorem B.2.2, from the convergence of by (B.3.2) it follows that

If are mutually independent and then

Proof. Denote by the generated by

If then

and hence which, by (B.3.1), implies (B.3.3).

When are mutually independent, then

B.4. Convergence Criteria for AdaptedSequences

Let be an adapted process.Theorem B.4.1 Let be a sequence of positive numbers. Then

where

Consequently, implies and

follows from (B.3.1).

Theorem B.3.2 (Borel-Cantelli Lemma) Let be a sequence of events. If

then the probability that occur infinitely often is zero, i.e.,


Proof. Set

By Theorem B.3.1

or

This means that A is the set where events may occur only finitely many times.

Therefore, on A the series converges if and only if converges.

Theorem B.4.2 (Three Series Criterion) Denote by S the where thefollowing three series converge:

and

where c is a positive constant.

Then converges on S as

Proof. Taking in (B.4.1), we have and

by Theorem B.4.1.Define

Since converges on S, from (B.4.2) it follows that

Noticing that is an MDS and

we see

By Theorem B.2.1 converges on S, or

APPENDIX B 343

Then from (B.4.3) it follows that

or converges ).

B.5. Convergence Theorems for MDS IILet be an MDS.

Theorem B.5.1 (Y. S. Chow) converges on

Proof. By Theorem B.4.2 it suffices to prove where S is defined in Theo-rem B.4.2 with replaced by considered in the present theorem.

We now verify that three series defined in Theorem B.4.2 are convergent on A ifis replaced byFor convergence of the first series it suffices to note

For convergence of the second series, taking into account we find

Finally, for convergence of the last series it suffices to note

and

by the conditional Schwarz inequality.Theorem B.5.2. The conclusion of Theorem B.5.1 is valid also forProof. Define

Then we have


on A where A is still defined by (B-5-1) but with

Applying Theorem B.5.1 with to the MDS leads to that

converges on A, i.e.,

This is equivalent to

B.6. Weighted Sum of MDSTheorem B.6.1 Let be an l-dimensional MDS and let be a

matrix adapted process. If

for some then as

where

Proof. Without loss of generality, assume

Notice that convergence of implies convergence of since for

sufficiently largeConsequently, from (B.5.2) it follows that

APPENDIX B 345

We have the following estimate:

By Theorems B.5.1 and B.5.2 it follows that

where

Notice that is nondecreasing as If is bounded, then the conclusionof the theorem follows from (B.6.1). If then by the Kronecker lemma

(see Section 3.4) the conclusion of the theorem also follows from (B.6.1).

References

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

B. D. O. Anderson and T. B. Moore, Optimal Control: Linear Quadratic Methods,Prentice-Hall, N. J., 1990.

K. J. Åström, Introduction to Stochastic Control, Academic Press, New York,1970.

M. Benaim, A dynamical systems approach to stochastic approximation, SIAMJ. Control & Optimization, 34:437–472, 1996.

A. Benveniste, M. Metivier and P. Priouret, Adaptive Algorithms and StochasticApproximation, Springer-Verlag, New York, 1990.

B. Bercu, Weighted estimation and tracking for ARMAX models, SIAM J. Con-trol & Optimization, 33:89–106, 1995.

P. Billingsley, Convergence of Probability Measures, Wiley, New York, 1968.

J. R. Blum, Multidimensional stochastic approximation, Ann. Math. Statist.,9:737–744, 1954.

V. S. Borkar, Asynchronous stochastic approximations, SIAM J. Control andOptimization, 36:840–851, 1998.

O. Brandière and M. Duflo, Les algorithmes stochastiques contournents-ils lespièges? Ann. Inst. Henri Poincaré, 32:395–427, 1996.

P. E. Caines, Linear Stochastic Systems, Wiley, New York, 1988.

H. F. Chen, Recursive algorithms for adaptive beam-formers, Kexue Tongbao(Science Bulletin), 26:490–493, 1981.

H. F. Chen, Recursive Estimation and Control for Stochastic SyNew York, 1985.

stems, Wiley,

H. F. Chen, Asymptotic efficient stochastic approximation, Stochastics andStochastics Reports, 45:1–16, 1993.

347


[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

H. F. Chen, Stochastic approximation and its new applications, Proceedingsof 1994 Hong Kong International Workshop on New Directions of Control andManufacturing, 1994, 2–12.

H. F. Chen, Convergence rate of stochastic approximation algorithms in thedegenerate case, SIAM J. Control & Optimization, 36:100–114, 1998.

H. F. Chen, Stochastic approximation with non-additive measurement noise, J.of Applied Probability, 35:407–417, 1998.

H. F. Chen, Convergence of SA algorithms in multi-root or multi-extreme cases,Stochastics and Stochastics Reports, 64: 255–266, 1998.

H. F. Chen, Stochastic approximation with state-dependent noise, Science inChina (Series E), 43:531–541, 2000.

H. F. Chen and X. R. Cao, Controllability is not necassry for adaptive poleplacement control, IEEE Trans. Autom. Control, AC-42:1222–1229, 1997.

H. F. Chen and X. R. Cao, Pole assignment for stochastic systems with unknowncoefficients, Science in China (Series E), 43:313–323, 2000.

H. F. Chen, T. Duncan, and B. Pasik-Duncan, A Kiefer-Wolfowitz algorithmwith randomized differences, IEEE Trans. Autom. Control, AC-44:442–453, 1999.

H. F. Chen and H. T. Fang, Nonconvex stochastic optimization for model reduc-tion, Global Optimization, 2002.

H. F. Chen and L. Guo, Identification and Stochastic Adaptive Control,Birkhäuser, Boston, 1991.

H. F. Chen, L. Guo, and A. J. Gao, Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds, Stochastic Processesand Their Applications, 27:217–231, 1988.

H. F. Chen and R. Uosaki, Convergence analysis of dynamic stochastic approx-imation, Systems and Control Letters, 35:309–315, 1998.

H. F. Chen and Q. Wang, Adaptive regulator for discrete-time nonlinear non-parametric systems, IEEE Trans. Autom. Control, AC-46: , 2001.

H. F. Chen and Y. M. Zhu, Stochastic approximation procedures with randomlyvarying truncations, Scientia Sinica (Series A), 29:914–926, 1986.

H. F. Chen and Y. M. Zhu, Stochastic Approximation (in Chinese), ShanghaiScientific and Technological Publishers, Shanghai, 1996.

E. K. P. Chong and P. J. Ramadge, Optimization of queues using an infinitesi-mal perturbation analysis-based stochastic algorithm with general update times,SIAM J. Control & Optimization, 31:698–732, 1993.

Y. S. Chow, Local convergence of martingales and the law of large numbers,Ann. Math. Statst. 36:552–558, 1965.

REFERENCES 349

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

Y. S. Chow and H. Teicher, Probablility Theory: Independence, Interchangeabil�

ity, Martingales, Springer Verlag, New York, 1978.

K. L. Chung, A Course in Probability Theory, (second edition), Academic Press,

New York, 1974.

M. H. A. Davis, Linear Estimation and Stochastic Control, Chapman and Hall,

New York, 1977.

K. Deimling, Nonlinear Functional Analysis, Springer, Berlin, 1985.

B. Delyon and A. Juditsky, Stochastic optimization with averaging of trajecto�

ries, Stochastics and Stochastics Reports, 39:107–118, 1992.

E. F. Deprettere (eds.), SVD and Signal Processing, Elsevier, Horth�Holland,

1988.

N. Dunford and J. T. Schwartz, Linear Operators, Part 1: General Theory, Wiley�

Interscience, New York, 1966.

V. Dupač, A dynamic stochastic methods, Ann. Math. Statist. 36:1695–1702.

V. Dupač, Stochastic approximation in the presense of trend, Czeshoslovak Math.

J., 16:454–461, 1966.

A. Dvoretzky, On stochastic approximation, Proceedings of the Third Berkeley

Symposium on Mathematical Statistics and Probability, pp. 39–55, 1956.

S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Conver�

gence, Wiley, New York, 1986.

E. Eweda, Convergence of the sign algorithm for adaptive filtering with corre�

lated data, IEEE Trans. Information Theory, IT�37:1450�1457, 1991.

V. Fabian, On asymptotic normality in stochastic approximation, Ann. of Math.

Statis., 39: 1327–1332, 1968.

V. Fabian, On asymptotically efficient recursive estimation, Ann. Statist., 6:

854–856, 1978.

V. Fabian, Simulated annealing simulated, Computers Math. Applic., 33:81–94,

1997.

F. W. Fairman, Linear Control Theory, The State Space Approach, Wiley, Chich�

ester, 1998.

H. T. Fang and H. F. Chen, Sharp convergence rates of stochastic approximation

for degenerate roots, Science in China (Series E), 41:383–392, 1998.

H. T. Fang and H. F. Chen, Stability and instability of limit points of stochastic

approximation algorithms, IEEE Trans. Autom. Control, AC�45:413–420, 2000.

H. T. Fang and H. F. Chen, An a.s. convergent algorithm for global optimization

with noise corrupted observations, J. Optimization and Its Applications, 104:343–

376, 2000.

350

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

H. T. Fang and H. F. Chen, Asymptotic behavior of asynchronous stochasticapproximation, Science in China (Series F), 44:249–258, 2001.

B. A. Francis, A Course in Control Theory, Lecture Notes in Control andInformation Sciences, Vol. 18, 1987.

S. B. Gelfand and S. K. Mitter, Recursive stochastic algorithms for global opti-mization in SIAM J. Control & Optimization, 29:999–1018, 1991.

E. G. Gladyshev, On stochastic approximation (in Russian), Theory Probab.Appl, 10:275–278, 1965.

G. C. Goodwin and K. S. Sin, Adaptive Filtering, Prediction and Control,Prentice-Hall, N.J., 1984.

L. Guo, Self-convergence of weighted least squares with applications to stochasticadaptive control, IEEE Trans. Autom. Control, AC-41:79–89, 1996.

P. Hall and C. C. Heyde, Martingale Limit Theory and Its Applications, Aca-demic Press, New York, 1980.

S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, 1990.

Y. C. Ho and X. R. Cao, Perturbation Analysis of Discrete Event DynamicalSystems, Kluwer, Boston, 1991.

A. Juditsky, A Stochastic estimation algorithm with observation averaging, IEEETrans. Autom. Control, 38:794–798, 1993.

T. Kailath, Linear Systems, Prentice-Hall, N. J., 1980.

J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regressionfunction, Ann. Math. Statist., 23:462–466, 1952.

P. V. Kokotovic (Ed.), Foundations of Adaptive Control, Springer, Berlin, 1991.

J. Koronaski, Random-seeking methods for the stochastic unconstrained opti-mization, Int. J. Control, 21:517–527, 1975.

H. J. Kushner, Approximation and Weak Convergence Methods for Random Pro-cesses with Applications to Stochastic Systems Theory, MIT Press, Cambridge,MA, 1984.

H. J. Kushner and D. S. Clark, Stochastic Approximation for Constrained andUnconstained Systems, Springer-Verlag, New York, 1978.

H. J. Kushner and J. Yang, Stochastic approximation with averaging of theiterates: Optimal asymptotic rates of convergence for general processes, SIAM J.Control & Optimization, 31:1045–1062, 1993.

H. J. Kushner and J. Yang, Stochastic approximation with averaging and feed-back: Rapidly convergent “on line” algorithms, IEEE Trans. Autom. Control,AC-40:24–34, 1995.

STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

REFERENCES 351

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

H. J. Kushner and G. Yin, Stochastic Approximation Algorithms and Applica-tions, Springer-Verlag, New York, 1997.

J. P. LaSaller and Lefchetz, Stability by Lyapunov’s Direct Methods with Ap-plications, Academic Press, New York, 1961.

R. Liptser and A. N. Shiryaev, Statistics of Random Processes, Springer-Verlag,New York, 1977.

R. Liu, Blind signal processing: An introduction, Proceedings 1996 Intl. Symp.Circuits and Systems, Vol. 2, 81–83, 1996.

L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Autom. Con-trol, AC-22:551-575, 1977.

L. Ljung, On positive real transfer functions and the convergence of some recur-sive schemes, IEEE Trans. Autom. Control, AC-22:539–551, 1977.

L. Ljung, G. Pflug, and H. Walk, Stochastic Approximation and Optimizationof Random Systems, Birkhäuser, Basel, 1992.

L. Ljung and T. Söderström, Theory and Practice of Recursive Identification,MIT Press, Cambridge, MA, 1983.

M. Loéve, Probability Theory, Springer, New York, 1977–1978.

R. Lozano and X. H. Zhao, Adaptive pole placement without excitation probingsignals, IEEE Trans. Autom. Control, AC-39:47–58, 1994.

M. B. Nevelson and R. Z. Khasminskii, Stochastic Approximation and Recur-sive Estimation, Amer. Math. Soc., Providence, RI, 1976, Translation of Math.Monographs, Vol. 47.

E. Oja, Subspace Methods of Pattern Recognition, 1st ed., Letchworth, ResearchStudies Press Ltd., Hertfordshire, 1983.

B. T. Polyak, New stochastic approximation type procedures, (in Russian) Au-tom. i Telemekh., 7:98–107, 1990.

B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation byaveraging, SIAM J. Control & Optimization, 30:838–855, 1992.

H. Robbins and S. Monro, A stochastic approximation method, Ann. Math.Statist., 22:400–407, 1951.

D. Ruppert, Stochastic approximation, In B. K. Ghosh and P. K. Sen, Editors,Handbook in Sequential Analysis, 503–529, Marcel Dekker, New York, 1991.

A. N. Shiryaev, Probability, Springer, New York, 1984.

J. C. Spall, Multivariate stochastic approximation using a simultaneous pertur-bation gradient approximation, IEEE Trans. Autom. Control, AC-37:331–341,1992.


[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

Q. Y. Tang and H. F. Chen, Convergence of perterbation analysis based optimiza-tion algorithm with fixed-number of customers period, Discrete Event DynamicSystems, 4:359–373, 1994.

Q. Y. Tang, H. F. Chen, and Z. J. Han, Convergence rates of perturbation-analysis-Robbins-Monro-Single-run algorithms, IEEE Trans. Autom. Control,AC-42:1442–1447, 1997.

J. N. Tsitsiklis, Asynchronous stochastic approximation and Q-learning MachineLearning, 16:185–202, 1994.

N. J. Tsitsiklis, D. P. Bertsekas, and M. Athans, Distributed asynchronous de-terministic and stochastic gradient optimization algorithms, IEEE Trans. Autom.Control, 31:803–812, 1986.

Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems, AcademicPress, New York, 1971.

K. Uosaki, Some generalizations of dynamic stochastic approximation processes,Ann. Statist., 2:1042–1048, 1974.

J. Venter, An extension of the Robbins-Monro procedure, Ann. Math. Stat.,38:181–190, 1967.

G. J. Wang and H. F. Chen, Behavior of stochastic approximation algorithmin root set of regression function, Systems Science and Mathematical Sciences,12:92–96, 1999.

I. J. Wang, E. K. P. Chong and S. R. Kulkarni, Equivalent necessary and suffi-cient conditions on noise sequences for stochastic approximation algorithms, Adv.Appl. Probab., 28:784–801, 1996.

C. Z. Wei, Multivariate adaptive stochastic approximation, Ann. Stat., 15:1115–1130, 1987.

G. Xu, L. Tong, and T. Kailath, A least squares approach to blind identification,IEEE Trans. Signal Processing, SP-43:2982–2993, 1995.

S. Yakowitz, A globally convergent stochastic approximation, SIAM J. Control& Optimization, 31:30–40, 1993.

G. Yin, On extensions of Polyak’s averaging approach to stochastic approxima-tion, Stochastics and Stochastics Reports, 36:245–264, 1991.

G. Yin and Y. M. Zhu, On w.p.l. convergence of a parallel stochastic approxi-mation algorithm, Probability in the Eng. and Infor. Sciences, 3:55–75, 1989.

[100] R. Zeilinski, Global stochastic approximation: A review of results and someopen problems. In F. Archetti and M. Cugiani (eds.), Numerical Techniques forStochastic Systems, 379–386, Northholland Publ. Co., 1980.

[101] J. H. Zhang and H. F. Chen, Convergence of algorithms used for principalcomponent analysis, Science in China (Series E), 40:597–604, 1997.

REFERENCES 353

[102] K. Zhou, J. C. Doyle, and K. Glover, Robust Optimal Control, Prentice-Hall,New Jersey, 1996.

Index

50, 55, 247329

329329

Ackermann’s formula, 328adapted process, 335adapted sequence, 341adaptive control, 290, 303, 327adaptive filter, 288adaptive filtering, 265, 273adaptive regulation, 321adaptive stabilization, 305, 307, 314, 327adaptive stochastic approximation, 132,

149adaptively stabilizable, 310admissible controls, 302algebraic Riccati equation, 131ARMA process, 39Arzelá-Ascoli theorem, 11, 24asymptotic behavior, 194asymptotic efficiency, 95, 130, 132, 149asymptotic normality, 95, 113, 119, 127,

149, 210asymptotic properties, 95, 166asymptotically efficient, 135asynchronous stochastic approximation,

219, 278, 288averaging technique, 132, 149

balanced realization, 210, 214balanced truncation, 214, 215blind channel identification, 219, 220, 223blind identification, 220Borel 330Borel set, 330Borel-Cantelli Lemma, 341Borel-Cantelli-Lévy Lemma, 340

certainly-equivalency-principle, 304, 306

Chebyshev inequality, 332closure, 38conditional distribution function, 332conditional expectation, 332conditional probability, 332conditional Schwarz inequality, 343constant interpolating function, 13constrained optimization problem, 268controllable, 307, 317, 319controller form, 317–319convergence, 28, 36, 41, 153, 223, 331, 341convergence analysis, 6, 28, 95, 154convergence rate, 95, 96, 101–103, 105,

149convergence theorem for martingale dif-

ference sequences, 97, 128, 160,170, 185, 196, 231, 249, 321,339, 343

convergence theorem for nonnegative su-permartingales, 7–9

convergence theorems for martingale, 335convergent subsequence, 17, 18, 30, 36,

84, 86, 89, 178, 187, 237, 241,244, 271, 275, 280, 282, 283,285, 287, 288, 297, 312, 315,322, 323

coprimeness, 306covariance matrix, 130, 132crossing, 18, 34, 188, 236, 312

degenerate case, 103, 149density, 330distribution function, 330dominant stability, 59, 62dominated convergence theorem, 331dynamic stochastic approximation, 82, 93

equi-continuous, 15ergodic, 265, 268, 270, 273, 274, 334ergodicity, 333

355


event, 329expectation, 330

Fatou lemma, 331first exit time, 9, 339

general convergence theorems, 28global minimum, 177global minimizer, 174, 177, 180global optimization, 172–174, 218global optimization algorithm, 180, 194global optimizer, 152globally Lipschitz continuous, 292Gronwall inequality, 298

Hölder Inequality, 332Hankel matrix, 222Hankel norm approximation, 210, 214,

215Hessian, 8, 195

identification, 290integrable, 331interpolating function, 11invariant 334

Jordan-Hahn decomposition, 55, 56, 295,329

Kiefer-Wolfowitz (KW) algorithm, 151–153, 166, 173, 218

Kronecker lemma, 67, 144, 148, 345Kronecker product, 248KW algorithm with expanding trunca-

tions, 152, 154, 173–175

Law of iterated logarithm, 333Lebesgue measurable, 330Lebesgue measure, 330Lebesgue-Stieltjes integral, 331linear interpolating function, 12Lipschitz continuous, 23Lipschitz-continuity, 160local search, 172, 173locally bounded, 17, 29, 96, 103, 133locally Lipschitz continuous, 50, 155, 163,

177, 280Lyapunov equation, 105Lyapunov function, 6, 8, 10, 11, 17, 111,

226, 268, 313Lyapunov inequality, 144, 332Lyapunov theorem, 98

MA process, 171Markov time, 6, 335, 336, 339martingale, 335, 339, 340martingale convergence theorem, 6, 180,

297

martingale difference sequence, 6, 16, 42,97, 128, 134, 159, 164, 168,179, 185, 195–197, 231, 250,257, 294, 335

maxinizer, 151measurable, 17, 29, 96, 103, 133measurable function, 330measurable set, 329measure, 329minimizer, 151mixing condition, 291model reduction, 210monotone convergence theorem, 331multi-extreme, 163, 164multi-root, 46, 57mutually independent, 333, 341

necessity of noise condition, 45non-additive noise, 49nondegenerate case, 96, 149nonnegative adapted sequence, 7nonnegative supermartingale, 6, 7, 338nonpositive submartingale, 338normal distribution, 113, 114, 330nowhere dense, 29, 35, 37, 41, 177, 181,

182, 280, 291

observation, 5, 17, 132, 321observation noise, 5, 103, 133, 175, 195,

321ODE method, 2, 10, 24, 327one-sided randomized difference, 172optimal control, 303optimization, 151optimization algorithm, 212ordinary differential equation (ODE), 10

pattern classification, 219perturbation analysis, 328pole assignment, 316, 318, 327principal component analysis, 238, 288probabilistic method, 4probability measure, 330probability of random event, 330probability space, 329, 330Prohorov’s theorem, 22, 24

Radon-Nikodym Theorem, 332random noise, 10, 21random search, 172random variable, 330randomized difference, 152–154recursive blind identification, 246relatively compact, 22RM algorithm with expanding trunca-

tions, 28, 155, 309, 319

INDEX 357

Robbins-Monro (RM) algorithm, 1, 5, 8,11, 12, 17, 20, 45, 110, 310, 313

robustness, 67, 93

SA algorithm, 67SA algorithm with expanding trunca-

tions, 25, 40, 95, 290SA with randomly varying truncations, 93Schwarz inequality, 142, 332sign algorithms, 273, 288signal processing, 219, 265signed measure, 56, 295Skorohod representation, 23Skorohod topology, 21, 24slowly decreasing step sizes, 132spheres with expanding radiuses, 36stability, 131stable, 96, 97, 102, 131, 133state-dependent, 42, 164state-dependent noise, 29, 57state-independent condition, 41, 42stationary, 265, 268, 270, 273, 274, 333step size, 5, 6, 17, 102, 132, 174stochastic approximation (SA), 1, 223,

226, 246stochastic approximation algorithm, 5,

307, 308stochastic approximation method, 321

stochastic differential equation, 126stochastic optimization, 211stopping time, 335strictly input passive, 322structural error, 10, 157structural inaccuracy, 21submartingale, 335–337, 339subspace, 41, 226supermartingale, 335, 339surjection, 63system identification, 327

three series criterion, 342time-varying, 44trajectory-subsequence (TS) method, 2,

16, 21truncated RM algorithm, 16, 17TS method, 28, 327

uniformly bounded, 15uniformly locally bounded, 41up-crossing, 336, 338

weak convergence method, 21, 24weighted least squares, 306weighted sum of MDS, 344Wiener process, 126


22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

H. Tuy: Convex Analysis and Global Optimization. 1998 ISBN 0�7923�4818�4

D. Cieslik: Steiner Minimal Trees. 1998 ISBN 0�7923�4983�0

N.Z. Shor: Nondifferentiable Optimization and Polynomial Problems. 1998

ISBN 0�7923�4997�0

R. Reemtsen and J.�J. Rückmann (eds.): Semi�Infinite Programming. 1998

ISBN 0�7923�5054�5

B. Ricceri and S. Simons (eds.): Minimax Theory and Applications. 1998

ISBN 0�7923�5064�2

J.�P. Crouzeix, J.�E. Martinez�Legaz and M. Volle (eds.): Generalized Convexitiy,

Generalized Monotonicity: Recent Results. 1998 ISBN 0�7923�5088�X

J. Outrata, M. Kočvara and J. Zowe: Nonsmooth Approach to Optimization Problems

with Equilibrium Constraints. 1998 ISBN 0�7923�5170�3

D. Motreanu and P.D. Panagiotopoulos: Minimax Theorems and Qualitative Proper�

ties of the Solutions of Hemivariational Inequalities. 1999 ISBN 0�7923�5456�7

J.F. Bard: Practical Bilevel Optimization. Algorithms and Applications. 1999

ISBN 0�7923�5458�3

H.D. Sherali and W.P. Adams: A Reformulation�Linearization Technique for Solving

Discrete and Continuous Nonconvex Problems. 1999 ISBN 0�7923�5487�7

F. Forgó, J. Szép and F. Szidarovszky: Introduction to the Theory of Games. Concepts,

Methods, Applications. 1999 ISBN 0�7923�5775�2

C.A. Floudas and P.M. Pardalos (eds.): Handbook of Test Problems in Local and

Global Optimization. 1999 ISBN 0�7923�5801�5

T. Stoilov and K. Stoilova: Noniterative Coordination in Multilevel Systems. 1999

ISBN 0�7923�5879�1

J. Haslinger, M. Miettinen and P.D. Panagiotopoulos: Finite Element Method for

Hemivariational Inequalities. Theory, Methods and Applications. 1999

ISBN 0�7923�5951�8

V. Korotkich: A Mathematical Structure of Emergent Computation. 1999

ISBN 0�7923�6010�9

C.A. Floudas: Deterministic Global Optimization: Theory, Methods and Applications.

2000 ISBN 0�7923�6014�1

F. Giannessi (ed.): Vector Variational Inequalities and Vector Equilibria. Mathemat�

ical Theories. 1999 ISBN 0�7923�6026�5

D. Y. Gao: Duality Principles in Nonconvex Systems. Theory, Methods and Applica�

tions. 2000 ISBN 0�7923�6145�3

C.A. Floudas and P.M. Pardalos (eds.): Optimization in Computational Chemistry

and Molecular Biology. Local and Global Approaches. 2000 ISBN 0�7923�6155�5

G. Isac: Topological Methods in Complementarity Theory. 2000 ISBN 0�7923�6274�8

P.M. Pardalos (ed.): Approximation and Complexity in Numerical Optimization: Con�

crete and Discrete Problems. 2000 ISBN 0�7923�6275�6

V. Demyanov and A. Rubinov (eds.): Quasidifferentiability and Related Topics. 2000

ISBN 0�7923�6284�5


44.

45.

46.47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

61.62.

63.

A. Rubinov: Abstract Convexity and Global Optimization. 2000ISBN 0-7923-6323-X

R.G. Strongin and Y.D. Sergeyev: Global Optimization with Non-Convex Constraints.2000 ISBN 0-7923-6490-2X.-S. Zhang: Neural Networks in Optimization. 2000 ISBN 0-7923-6515-1H. Jongen, P. Jonker and F. Twilt: Nonlinear Optimization in Finite Dimen-sions. Morse Theory, Chebyshev Approximation, Transversability, Flows, ParametricAspects. 2000 ISBN 0-7923-6561-5R. Horst, P.M. Pardalos and N.V. Thoai: Introduction to Global Optimization. 2ndEdition. 2000 ISBN 0-7923-6574-7S.P. Uryasev (ed.): Probabilistic Constrained Optimization. Methodology andApplications. 2000 ISBN 0-7923-6644-1D.Y. Gao, R.W. Ogden and G.E. Stavroulakis (eds.): Nonsmooth/Nonconvex Mech-anics. Modeling, Analysis and Numerical Methods. 2001 ISBN 0-7923-6786-3A. Atkinson, B. Bogacka and A. Zhigljavsky (eds.): Optimum Design 2000. 2001

ISBN 0-7923-6798-7M. do Rosário Grossinho and S.A. Tersian: An Introduction to Minimax Theoremsand Their Applications to Differential Equations. 2001 ISBN 0-7923-6832-0A. Migdalas, P.M. Pardalos and P. Värbrand (eds.): From Local to Global Optimiza-tion. 2001 ISBN 0-7923-6883-5N. Hadjisavvas and P.M. Pardalos (eds.): Advances in Convex Analysis and GlobalOptimization. Honoring the Memory of C. Caratheodory (1873-1950). 2001

ISBN 0-7923-6942-4R.P. Gilbert, P.D. Panagiotopoulos† and P.M. Pardalos (eds.): From Convexity toNonconvexity. 2001 ISBN 0-7923-7144-5D.-Z. Du, P.M. Pardalos and W. Wu: Mathematical Theory of Optimization. 2001

ISBN 1-4020-0015-4M.A. Goberna and M.A. López (eds.): Semi-Infinite Programming. Recent Advances.2001 ISBN 1-4020-0032-4F. Giannessi, A. Maugeri and P.M. Pardalos (eds.): Equilibrium Problems: NonsmoothOptimization and Variational Inequality Models. 2001 ISBN 1-4020-0161-4G. Dzemyda, V. Šaltenis and A. Žilinskas (eds.): Stochastic and Global Optimization.2002 ISBN 1-4020-0484-2D. Klatte and B. Kummer: Nonsmooth Equations in Optimization. Regularity, Cal-culus, Methods and Applications. 2002 ISBN 1-4020-0550-4S. Dempe: Foundations of Bilevel Programming. 2002 ISBN 1-4020-0631-4P.M. Pardalos and H.E. Romeijn (eds.): Handbook of Global Optimization, Volume2. 2002 ISBN 1-4020-0632-2G. Isac, V.A. Bulavsky and V.V. Kalashnikov: Complementarity, Equilibrium, Effi-ciency and Economics. 2002 ISBN 1-4020-0688-8

KLUWER ACADEMIC PUBLISHERS – DORDRECHT / BOSTON / LONDON

Documents

Stochastic Approximation and Its Applicationsinis.jinr.ru/sl/M_Mathematics/MOc_Optimal control/Han-Fu...Stochastic Approximation and Its Applications by Han-Fu Chen Institute of Systems