Upload
others
View
12
Download
1
Embed Size (px)
Citation preview
Stochastic Approximation and Its Applications
Nonconvex Optimization and Its Applications
Volume 64
Managing Editor:
Panos Pardalos
Advisory Board:
J.R. BirgeNorthwestern University, U.S.A.
Ding-Zhu DuUniversity of Minnesota, U.S.A.
C. A. FloudasPrinceton University, U.S.A.
J. MockusLithuanian Academy of Sciences, Lithuania
H. D. SheraliVirginia Polytechnic Institute and State University, U.S.A.
G. StavroulakisTechnical University Braunschweig, Germany
The titles published in this series are listed at the end of this volume.
Stochastic Approximationand Its Applications
by
Han-Fu ChenInstitute of Systems Science,Academy of Mathematics and System Science,Chinese Academy of Sciences,Beijing, P.R. China
KLUWER ACADEMIC PUBLISHERSNEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW
eBook ISBN: 0-306-48166-9Print ISBN: 1-4020-0806-6
©2003 Kluwer Academic PublishersNew York, Boston, Dordrecht, London, Moscow
Print ©2002 Kluwer Academic Publishers
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Kluwer Online at: http://kluweronline.comand Kluwer's eBookstore at: http://ebooks.kluweronline.com
Dordrecht
Contents
PrefaceAcknowledgments
1. ROBBINS-MONRO ALGORITHM1.11.2
1.31.41.51.6
Finding Zeros of a Function.Probabilistic MethodODE MethodTruncated RM Algorithm and TS MethodWeak Convergence MethodNotes and References
2. STOCHASTIC APPROXIMATION ALGORITHMS WITH
2.12.22.32.42.52.6
2.72.82.9
MotivationGeneral Convergence Theorems by TS MethodConvergence Under State-Independent ConditionsNecessity of Noise ConditionNon-Additive NoiseConnection Between Trajectory Convergence and Propertyof Limit PointsRobustness of Stochastic Approximation AlgorithmsDynamic Stochastic ApproximationNotes and References
3. ASYMPTOTIC PROPERTIES OF STOCHASTIC
EXPANDING TRUNCATIONS
APPROXIMATION ALGORITHMS
3.13.23.3
Convergence Rate: Nondegenerate CaseConvergence Rate: Degenerate CaseAsymptotic Normality
v
ixxv
124
10162123
252628414549
57678293
9596
103113
vi STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
3.43.5
Asymptotic EfficiencyNotes and References
4. OPTIMIZATION BY STOCHASTIC APPROXIMATION4.14.24.34.44.54.6
Kiefer-Wolfowitz Algorithm with Randomized DifferencesAsymptotic Properties of KW AlgorithmGlobal OptimizationAsymptotic Behavior of Global Optimization AlgorithmApplication to Model ReductionNotes and References
5. APPLICATION TO SIGNAL PROCESSING5.15.25.35.45.55.65.7
Recursive Blind IdentificationPrincipal Component AnalysisRecursive Blind Identification by PCAConstrained Adaptive FilteringAdaptive Filtering by Sign AlgorithmsAsynchronous Stochastic ApproximationNotes and References
6. APPLICATION TO SYSTEMS AND CONTROL6.16.26.3
6.46.5
Application to Identification and Adaptive ControlApplication to Adaptive StabilizationApplication to Pole Assignment for Systems with UnknownCoefficientsApplication to Adaptive RegulationNotes and References
Appendices
A.1A.2A.3A.4A.5A.6A.7
Probability SpaceRandom Variable and Distribution FunctionExpectationConvergence Theorems and InequalitiesConditional ExpectationIndependenceErgodicity
B.1B.2B.3
Convergence Theorems for MartingaleConvergence Theorems for MDS IBorel-Cantelli-Lévy Lemma
130149
151153166172194210218
219220238246265273278288
289290305
316321327
329329329330330331332333333335335339340
Contents vii
B.4B.5B.6
Convergence Criteria for Adapted SequencesConvergence Theorems for MDS IIWeighted Sum of MDS
References
Index
341343344
347
355
Preface
Estimating unknown parameters based on observation data contain-ing information about the parameters is ubiquitous in diverse areas ofboth theory and application. For example, in system identification theunknown system coefficients are estimated on the basis of input-outputdata of the control system; in adaptive control systems the adaptivecontrol gain should be defined based on observation data in such a waythat the gain asymptotically tends to the optimal one; in blind chan-nel identification the channel coefficients are estimated using the outputdata obtained at the receiver; in signal processing the optimal weightingmatrix is estimated on the basis of observations; in pattern classifica-tion the parameters specifying the partition hyperplane are searched bylearning, and more examples may be added to this list.
All these parameter estimation problems can be transformed to aroot-seeking problem for an unknown function. To see this, let de-note the observation at time i.e., the information available about theunknown parameters at time It can be assumed that the parameterunder estimation denoted by is a root of some unknown function
This is not a restriction, because, for example, mayserve as such a function. Let be the estimate for at time Thenthe available information at time can formally be written as
where
Therefore, by considering as an observation on at withobservation error the problem has been reduced to seeking theroot of based on
It is clear that for each problem to specify is of crucial importance.The parameter estimation problem is possible to be solved only if
ix
x STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is appropriately selected so that the observation error meets therequirements figured in convergence theorems.
If and its gradient can be observed without error at any desiredvalues, then numerical methods such as Newton-Raphson method amongothers can be applied to solving the problem. However, this kind ofmethods cannot be used here, because in addition to the obvious problemconcerning the existence and availability of the gradient, the observationsare corrupted by errors which may contain not only the purely randomcomponent but also the structural error caused by inadequacy of theselected
Aiming at solving the stated problem, Robbins and Monro proposedthe following recursive algorithm
to approximate the sought-for root where is the step size. Thisalgorithm is now called the Robbins-Monro (RM) algorithm. Follow-ing this pioneer work of stochastic approximation, there have been alarge amount of applications to practical problems and research workson theoretical issues.
At beginning, the probabilistic method was the main tool in con-vergence analysis for stochastic approximation algorithms, and ratherrestrictive conditions were imposed on both and For example,it is required that the growth rate of is not faster than linear as
tends to infinity and is a martingale difference sequence [78].Though the linear growth rate condition is restrictive, as shown by sim-ulation it can hardly be simply removed without violating convergencefor RM algorithms.
To weaken the noise conditions guaranteeing convergence of the algo-rithm, the ODE (ordinary differential equation) method was introducedin [72, 73] and further developed in [65]. Since the conditions on noiserequired by the ODE method may be satisfied by a large class ofincluding both random and structural errors, the ODE method has beenwidely applied for convergence analysis in different areas. However, inthis approach one has to a priori assume that the sequence of estimates
is bounded. It is hard to say that the boundedness assumption ismore desirable than a growth rate restriction on
The stochastic approximation algorithm with expanding truncationswas introduced in [27], and the analysis method has then been improvedin [14]. In fact, this is an RM algorithm truncated at expanding bounds,and for its convergence the growth rate restriction on is not re-quired. The convergence analysis method for the proposed algorithmis called the trajectory-subsequence (TS) method, because the analysis
PREFACE xi
is carried out at trajectories where the noise condition is satisfied andin contrast to the ODE method the noise condition need not be veri-fied on the whole sequence but is verified only along convergentsubsequences This makes a great difference when dealing withthe state-dependent noise because a convergent subsequence
is always bounded while the boundedness of the whole sequenceis not guaranteed before establishing its convergence. As shown in
Chapters 4, 5, and 6 for most of parameter estimation problems aftertransforming them to a root-seeking problem, the structural errors areunavoidable, and they are state-dependent.
The expanding truncation technique equipped with TS method ap-pears a powerful tool in dealing with various parameter estimation prob-lems: it not only has succeeded in essentially weakening conditions forconvergence of the general stochastic approximation algorithm but alsohas made stochastic approximation possible to be successfully applied indiverse areas. However, there is a lack of a reference that systematicallydescribes the theoretical part of the method and concretely shows theway how to apply the method to problems coming from different areas.To fill in the gap is the purpose of the book.
The book summarizes results on the topic mostly distributed overjournal papers and partly contained in unpublished material. The bookis written in a systematical way: it starts with a general introductionto stochastic approximation and then describes the basic method usedin the book, proves the general convergence theorems and demonstratesvarious applications of the general theory.
In Chapter 1 the problem of stochastic approximation is stated andthe basic methods for convergence analysis such as probabilistic method,ODE method, TS method, and the weak convergence method are intro-duced.
Chapter 2 presents the theoretical foundation of the algorithm withexpanding truncations: the basic convergence theorems are proved byTS method; various types of noises are discussed; the necessity of theimposed noise condition is shown; the connection between stability ofthe equilibrium and convergence of the algorithm is discussed; the ro-bustness of stochastic approximation algorithms is considered when thecommonly used conditions deviate from the exact satisfaction, and themoving root tracking is also investigated. The basic convergence the-orems are presented in Section 2.2, and their proof is elementary andpurely deterministic.
Chapter 3 describes asymptotic properties of the algorithms: conver-gence rates for both cases whether or not the gradient of is degener-
xii STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
ate; asymptotic normality of and asymptotic efficiency by averagingmethod.
Starting from Chapter 4 the general theory developed so far is ap-plied to different fields. Chapter 4 deals with optimization by usingstochastic approximation methods. Convergence and convergence ratesof the Kiefer-Wolfowitz (KW) algorithm with expanding truncations andrandomized differences are established. A global optimization methodconsisting in combination of the KW algorithms with search methods isdefined, and its a.s. convergence as well as asymptotic behaviors are es-tablished. Finally, the global optimization method is applied to solvingthe model reduction problem.
In Chapter 5 the general theory is applied to the problems arisingfrom signal processing. Applying the stochastic approximation methodto blind channel identification leads to a recursive algorithm estimatingthe channel coefficients and continuously improving the estimates whilereceiving new signal in contrast to the existing “block” algorithms. Ap-plying TS method to principal component analysis results in improvingconditions for convergence. Stochastic approximation algorithms withexpanding truncations with TS method are also applied to adaptive fil-ters with and without constraints. As a result, conditions required forconvergence have been considerably improved in comparison with theexisting results. Finally, the expanding truncation technique and TSmethod are applied to the asynchronous stochastic approximation.
In the last chapter, the general theory is applied to problems arisingfrom systems and control. The ideal parameter for operation is identifiedfor stochastic systems by using the methods developed in this book.Then the obtained results are applied to the adaptive quadratic controlproblem. Adaptive regulation for a nonlinear nonparametric system andlearning pole assignment are also solved by the stochastic approximationmethod.
The book is self-contained in the sense that there are only a few pointsusing knowledge for which we refer to other sources, and these points canbe ignored when reading the main body of the book. The basic mathe-matical tools used in the book are calculus and linear algebra based onwhich one will have no difficulty to read the fundamental convergenceTheorems 2.2.1 and 2.2.2 and their applications described in the sub-sequent chapters. To understand other material, probability concept,especially the convergence theorems for martingale difference sequencesare needed. Necessary concept of probability theory is given in AppendixA. Some facts from probability that are used at a few specific points arelisted in Appendix A but without proof, because omitting the corre-sponding parts still makes the rest of the book readable. However, the
PREFACE xiii
proof of convergence theorems for martingales and martingale differencesequences is provided in detail in Appendix B.
The book is written for students, engineers and researchers working inthe areas of systems and control, communication and signal processing,optimization and operation research, and mathematical statistics.
HAN-FU CHEN
Acknowledgments
The support of the National Key Project of China and the NationalNatural Science Foundation of China is gratefully acknowledged. Theauthor would like to express his gratitude to Dr. Haitao Fang for hishelpful suggestions and useful discussions. The author would also liketo thank Ms. Jinling Chang for her skilled typing and to thank my wifeShujun Wang for her constant support.
xv
ROBBINS-MONRO ALGORITHM
Chapter 1
Optimization is ubiquitous in various research and application fields.It is quite often that an optimization problem can be reduced to findingzeros (roots) of an unknown function which can be observed butthe observation may be corrupted by errors. This is the topic of stochas-tic approximation (SA). The error source may be observation noise, butmay also come from structural inaccuracy of the observed function. Forexample, one wants to find zeros of but he actually observes func-tions which are different from Let us denote by theobservation at time the observation noise:
Here, is the additional error caused by the structural in-accuracy. It is worth noting that the structural error normally dependson and it is hard to require it to have a certain probabilistic propertysuch as independence, stationarity or martingale property. We call thiskind of noises as state-dependent noise.
The basic recursive algorithm for finding roots of an unknown functionon the basis of noisy observations is the Robbins-Monro (RM) algorithm,which is characterized by its simplicity in computation. This chapterserves as an introduction to SA, describing various methods for analyzingconvergence of the RM algorithm.
In Section 1.1 the motivation of RM algorithm is explained, and itslimitation is pointed out by an example. In Section 1.2 the classicalapproach to analyzing convergence of RM algorithm is presented, whichis based on probabilistic assumptions on the observation noise. To relaxrestrictions made on the noise, a convergence analysis method connectingconvergence of the RM algorithm with stability of an ordinary differential
1
2 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
equation (ODE) was introduced in nineteen seventies. The ODE methodis demonstrated in Section 1.3. In Section 1.4 the convergence analysisis carried out at a sample path by considering convergent subsequences.So, we call this method as Trajectory-Subsequence (TS) method, whichis the basic tool used in the subsequent chapters.
In this book our main concern is the path-wise convergence of thealgorithm. However, there is another approach to convergence analy-sis called the weak convergence method, which is briefly introduced inSection 1.5. Notes and references are given in the last section.
This chapter introduces main methods used in literature for conver-gence analysis, but restricted to the single root case. Extension to moregeneral cases in various aspects is given in later chapters.
1.1. Finding Zeros of a Function.Many theoretical and practical problems in diverse areas can be re-
duced to finding zeros of a function. To see this it suffices to notice thatsolving many problems finally consists in optimizing some functioni.e., finding its minimum (or maximum). If is differentiable, thenthe optimization problem reduces to finding the roots of where
the derivative ofIn the case where the function or its derivatives can be observed
without errors, there are many numerical methods for solving the prob-lem. For example, the gradient method, by which the estimate forthe root of is recursively generated by the following algorithm
where denotes the derivative of This kind of problems belongsto the topics of optimization theory, which considers general cases where
may be nonconvex, nonsmooth, and with constraints.In contrast to the optimization theory, SA is devoted to finding zeros
of an unknown function which can be observed, but the observationsare corrupted by errors.
Since is not exactly known and even may not exist, (1.1.1)-like algorithms are no longer applicable. Consider the following simpleexample. Let be a linear function
If the derivative of is available, i.e., if we know and ifcan precisely be observed, then according to (1.1.1)
ROBBINS-MONRO ALGORITHM 3
This means that the gradient algorithm leads to the zero ofby one step.
Assume the derivative of is unavailable but can exactly beobserved.
Let us replace by in (1.1.1). Then we derive
or
This is a linear difference equation, which can inductively be solved,and the solution of (1.1.3) can be expressed as follows
Clearly, tends to the root of as for any initialvalue This is an attractive property: although the gradient of isunavailable, we can still approach the sought-for root if the inverse of thegradient is replaced by a sequence of positive real numbers decreasinglytending to zero.
Let us consider the case where is observed with errors:
where denotes the observation at time the correspondingobservation error and the estimate for the root of at time
It is natural to ask, how will behave if the exact value ofin (1.1.2) is replaced by its error-corrupted observation i.e., ifis recursively derived according to the following algorithm:
In our example, and (1.1.5) turns to be
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS4
Similar to (1.1.3), the solution of this difference equation is
Therefore, converges to the root of if tends
to zero as This means that replacement of gradient by asequence of numbers still works even in the case oferror-corrupted observations, if the observation errors can be averagedout. It is worth noting that in lieu of (1.1.5) we have to take the positivesign before i.e., to consider
if rather than or more general, if is decreasingas increases.
This simple example demonstrates the basic features of the algorithm(1.1.5) or (1.1.7): 1) The algorithm may converge to a root of 2) Thelimit of the algorithm, if exists, should not depend on the initial value; 3)The convergence rate is defined by that how fast the observation errorsare averaged out.
From (1.1.6) it is seen that the convergence rate is defined by
for linear functions. In the case where is a sequence of indepen-dent and identically distributed random variables with zero mean andbounded variance, then
by the iterated logarithm law.This means that convergence rate for algorithms (1.1.5) or (1.1.7) with
error-corrupted observations should not be faster than
1.2. Probabilistic MethodWe have just shown how to find the root of an unknown linear function
based on noisy observations. We now formulate the general problem.
ROBBINS-MONRO ALGORITHM 5
Let be an unknown function with unknown rootAssume can be observed at each point with noise
and is the estimate for at timeStochastic approximation algorithms recursively generate to ap-
proximate based on the past observations. In the pioneer work of thisarea Robbins and Monro proposed the following algorithm
to estimate where step size is decreasing and satisfies the fol-
lowing conditions and They proved
We explain the meaning of conditions required for step sizeCondition aims at reducing the effect of observation noises.To see this, consider the case where is close to and is closeto zero, say, with small.
Throughout the book, always means the Euclidean norm of avector and denotes the square root of the maximum eigenvalueof the matrix where means the transpose of the matrix A.
By (1.2.2) andEven in the Gaussian noise case, may be large if
has a positive lower bound. Therefore, in order to have the desiredconsistency, i.e., it is necessary to use decreasing gains
such that On the other hand, consistency can neither be
achieved, if decreases too fast as To see this, let
Then even for the noise-free case, i.e., from (1.2.2) we have
if is a bounded function.
Therefore, in this case
if the initial value is far from the true root and hence will neverconverge to
The algorithm (1.2.2) is now called Robbins-Monro (RM) algorithm.
where isthe observation at time is the observation noise,
6 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
The classical approach to convergence analysis of SA algorithms isbased on the probabilistic analysis for trajectories. We now present atypical convergence theorem by this approach. Related concept andresults from probability theory are given in Appendices A and B.
In fact, we will use the martingale convergence theorem to prove thepath-wise convergence of i.e., to show For this, the
following set of conditions will be used.
A 1.2.1 The step size is such that
A1.2.2 There exists a continuously twice differentiable Lyapunov func-tion satisfying the following conditions.
i) Its second derivative is bounded;ii) and asiii) For any there is a such that
where denotes the gradient of
A1.2.3 The observation noise is a martingale difference se-quence with
where is a family of nondecreasing
A1.2.4 The function and the conditional second moment of theobservation noise have the following upper bound
where is a positive constant.
Prior to formulating the theorem we need some auxiliary results.Let be an adapted sequence, i.e., isDefine the first exist time of from a Borel set
It is clear that i.e., is a Markov time.
Lemma 1.2.1 Assume and is a nonnegative supermartin-gale, i.e.,
ROBBINS-MONRO ALGORITHM 7
Then is also a nonnegative supermartingale, where
The proof is given in Appendix B, Lemma B-2-1.The following lemma concerning convergence of an adapted sequence
will be used in the proof for convergence of the RM algorithm, but thelemma is of interest by itself.
Lemma 1.2.2 Let be two nonnegative adapted se-quences.
i) If and then converges a.s.
to a finite limit.
ii) If then
Proof. For proving i) set
Then we have
By the convergence theorem for nonnegative supermartingales, con-verges a.s. as
Since by the convergence theorem for martingales it
follows that converges a.s. as Since is
Noticing that both and converge a.s.
as we conclude that is also convergent a.s. as
Consequently, from (1.2.5) it follows that converges a.s. as
For proving ii) set
measurable and is nondecreasing, we have
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS8
Taking conditional expectation leads to
Again, by the convergence theorem for nonnegative supermartingales,converges a.s. as Since by the same theorem also
converges a.s. as it directly follows that a.s.
Theorem 1.2.1 Assume Conditions A1.2.1–A1.2.4 hold. Then for anyinitial value, given by the RM algorithm (1.2.2) converges to the root
of a.s. as
Proof. Let be the Lyapunov function given in A1.2.2. Expandingto the Taylor series, we obtain
where and denote the gradient and Hessian of respec-tively, is a vector with components located in-between the corre-sponding components of and and denotes the constant suchthat (by A1.2.2).
Noticing that is and taking con-ditional expectation for (1.2.6), by (1.2.4) we derive
Since by (A1.2.1), we have
Denoting
ROBBINS-MONRO ALGORITHM 9
and noticing by A1.2.2, iii) from (1.2.7) and (1.2.8) itfollows that
Therefore, and converges a.s. by the convergencetheorem for nonnegative supermartingales.
Since also converges a.s.
For any denote
Let be the first exit time of from and let
where denotes the complement to This means that is the firstexit time from after
Since is nonpositive, from (1.2.9) it follows that
for anyThen by (1.2.2), this implies that
By Lemma 1.2.2, ii), the above inequality implies
which means that must be finite a.s. Otherwise, we would have
a contradiction to A1.2.1. Therefore, after with
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS10
possible exception of a set with probability zero the trajectory ofmust enter
Consequently, there is a subsequence such thatwhere as
By the arbitrariness of we then conclude that there is a subsequence,denoted still by such that Hence
However, we have shown that converges a.s. Therefore,a.s. By A1.2.2, ii) we then conclude that a.s.
Remark 1.2.1 If Condition A1.2.2 iii) changes to
then the algorithm (1.2.2) should accordingly change to
We now explain conditions required in Theorem 1.2.1. As noted in
Section 1.1, the step size should satisfy but the condition
may be weakened to
Condition A1.2.2 requires existence of a Lyapunov function Thiskind of conditions is normally necessary to be imposed for convergenceof the algorithms, but the analytic properties of may be weakened.The noise condition A1.2.3 is rather restrictive. As to be shown in thesubsequent chapters, may be composed of not only the random noisebut also structural errors which hardly have nice probabilistic propertiessuch as martingale difference, stationarity or with bounded variances etc.
As in many cases, one can take to serve as Then from(1.2.4) it follows that the growth rate of as should not befaster than linear. This is a major restriction to apply Theorem 1.2.1.However, if we a priori assume that generated by the algorithm(1.2.2) is bounded, then is bounded provided is locallybounded, and then the linear growth is not a restriction for1,2,...}.
1.3. ODE MethodAs mentioned in Section 1.2, the classical probabilistic approach to
analyzing SA algorithms requires rather restrictive conditions on theobservation noise. In nineteen seventies a so-called ordinary differentialequation (ODE) method was proposed for analyzing convergence of SA
ROBBINS-MONRO ALGORITHM 11
algorithms. We explain the idea of the method. The estimategenerated by the RM algorithm is interpolated to a continuous functionwith interpolating length equal to the step size used in the algo-rithm. The tail part of the interpolating function is shown to satisfyan ordinary differential equation The sought-for root is theequilibrium of the ODE. By stability of this equation, or by assumingexistence of a Lyapunov function, it is proved that From
this, it can be deduced thatFor demonstrating the ODE method we need two facts from analysis,
which are formulated below as propositions.
Proposition 1.3.1 (Arzelà-Ascoli) Let be a set ofequi-continuous and uniformly bounded functions, where by equi-continuity we mean that for any and any there exists
such that
Then there are a continuous function and asubsequence offunctions which converge to uniformly in any finite interval ofi.e.,
uniformly with respect to belonging to any finite interval.
Proposition 1.3.2 For the following ODE
with
if there exists a continuously differentiable function such thatas and
then the solution to (1.3.1), starting from any initial value, tends toas i.e., is the global asymptotically stable solution to
(1.3.1).
Let us introduce the following conditions.
A1.3.2 There exists a twice continuously differentiable Lyapunov func-tion such that as
and
A1.3.1
whenever
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS12
In order to describe conditions on noise, we introduce an integer-valued function for any and any integer
For define
Noticing that tends to zero, for any fixed diverges toinfinity as In fact, counts the number of iterationsstarting from time as long as the sum of step sizes does not exceedThe integer-valued function will be used throughout the book.
The following conditions will be used:
A1.3.3 satisfies the following conditions
A1.3.4 is continuous.
Theorem 1.3.1 Assume that A1.3.1, A1.3.2, and A1.3.4 hold. If for afixed sample A1.3.3 holds and generated by the RM algorithm(1.2.2) is bounded, then for this tends to as
Proof. Set
Define the linear interpolating function
It is clear that is continuous and
Further, define and the corresponding linear interpo-
lating function which is defined by (1.3.4) with replaced bySince we will deal with the tail part of we define by shifting
time in
Thus, we derive a family of continuous functions
ROBBINS-MONRO ALGORITHM 13
Let us define the constant interpolating function
Then summing up both sides of (1.2.2) yields
and hence
By the boundedness assumption on the family is uni-formly bounded. We now prove it is equi-continuous.
By definition,
Hence, we have
where since
From this it follows that
which tends to zero as and then by A1.3.3.For any we have
By boundedness of and (1.3.11) we see that is equi-continuous.
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS14
By Proposition 1.3.1, we can select from a convergent subse-quence which tends to a continuous function
Consider the following difference with
which is derived by using (1.3.11).By (1.3.9) it is clear that for
Then from (1.3.12) we obtain
Tending to zero in (1.3.13), by continuity of and uniform con-vergence of to we conclude that the last term in (1.3.13)converges to zero, and
ROBBINS-MONRO ALGORITHM 15
By A1.3.2 and Proposition 1.3.2 we see asWe now prove that Assume the converse: there is a
subsequence
Then for There is a such that
By (1.3.4) we have
where and denotesthe integer part of so
It is clear that the family of functions indexed byis uniformly bounded and equi-continuous. Hence, we can select a
convergent subsequence, denoted still by The limit satisfiesthe ODE (1.3.14) and coincides with being the limit of bythe uniqueness of the solution to (1.3.14).
By the uniform convergence we have
which implies thatFrom here by (1.3.15) it follows that
Then we obtain a contradictory inequality:
for large enough such that and This completesthe proof of
We now compare conditions used in Theorem 1.3.1 with those in The-orem 1.2.1.
Conditions A1.3.1 and A1.3.2 are slightly weaker than A1.2.1 andA1.2.2, but they are almost the same. The noise condition A1.3.3 issignificantly weaker than those used in Theorem 1.2.1, because underthe conditions of Theorem 1.2.1 we have
which certainly implies A1.3.3.
16 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
As a matter of fact, Condition A1.3.3 may be satisfied by sequencesmuch more general than martingale difference sequences.
Example 1.3.1 Assume but may be any random or deter-ministic sequence. Then satisfies A1.3.3.This is because
Example 1.3.2 Let be an MA process, i.e.,
where is a martingale difference sequence with
Then under conditionA1.2.1, a.s., and hence
a.s. Consequently, A1.3.3 is satisfied for almost all sample paths
Condition A1.3.4 requires continuity of which is not required inA1.2.4. At first glance, unlike A1.2.4, Condition A1.3.4 does not imposeany growth rate condition on but Theorem 1.3.1 a priori requiresthe boundedness of which is an implicit requirement for the growthrate of
The ODE method is widely used in convergence analysis for algo-rithms arising from various application areas, because from the noiseit requires no probabilistic property which would be difficult to verify.Concerning the weakness of the ODE method, we have mentioned thatit a priori assumes that is bounded. This condition is difficult tobe verified in general case. The other point should be mentioned thatCondition A1.3.3 is also difficult to be verified in the case wheredepends on the past which often occurs when containsstructural errors of This is because A1.3.3 may be verifiable if isconvergent, but may badly behave depending upon the behavior of
So we are somehow in a cyclic situation: with A1.3.3 we canprove convergence of on the other hand, with convergent wecan verify A1.3.3. This difficulty will be overcome by using Trajectory-Subsequence (TS) method to be introduced in the next section and usedin subsequent chapters.
1.4. Truncated RM Algorithm and TS MethodIn Section 1.2 we considered the root-seeking problem where the
sought-for root may be any point in If the region belongs
as
ROBBINS-MONRO ALGORITHM 17
to is known, then we may use the truncated algorithm and the growthrate restriction on can be removed.
Let us assume that and is known. In lieu of (1.2.2) wenow consider the following truncated RM algorithm:
where the observation is given by (1.2.1), is a given point,
and
The constant used in (1.4.1) will be specified later on.The algorithm (1.4.1) means that it coincides with the RM algorithm
when it evolves in the sphere but if exits thesphere then the algorithm is pulled back to the fixed point
We will use the following set of conditions:
A1.4.1 The step size satisfies the following conditions
A1.4.2 There exists a continuously differentiable Lyapunov function(not necessarily being nonnegative) such that
and for (which is used in
(1.4.1)) there is such that
A1.4.3 For any convergent subsequence of
where is given by (1.3.2);
A1.4.4 is measurable and locally bounded.
We first compare these conditions with A1.3.1–A1.3.4. We note thatA1.4.1 is the same as A1.3.1, while A1.4.2 is weaker than A1.2.2.
The difference between A1.3.3 and A1.4.3 consists in that Condition(1.4.2) is required to be verified only along convergent subsequences,while (1.3.3) in A1.3.3 has to be verified along the whole sequence
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS18
It will be seen that A1.4.3 in many problems can be verified while A1.3.3is difficult to verify.
Comparing A1.4.4 with A1.3.4 we find that the conditions on havenow been weakened. The growth rate restriction used in Theorem 1.2.1and the boundedness assumption on imposed in Theorem 1.3.1have been removed in the following theorem.
Theorem 1.4.1 Assume Conditions A1.4.1, A1.4.2, and A1.4.4 holdand the constant in A1.4.2 is available. Set for (1.4.1). Iffor some sample path A1.4.3 holds, then given by (1.4.1) convergesto for this
Proof. We say that crosses an intervalif and
We first prove that the number of truncations in (1.4.1) may happenat most for a finite number of steps. Assume the converse: there areinfinitely many truncations occurring in (1.4.1). Since
by A1.4.2, there is an interval such that
and there are infinitely many thatcross
Since is bounded, we may extract a convergent subsequencefrom Let us denote the extracted convergent subsequence stillby It is clear that
Since the limit of is located in the open spherethere is an such that
for all sufficiently largeSince is bounded by Al.4.4 and the boundedness of
using (1.4.2) we have
ROBBINS-MONRO ALGORITHM 19
if is small enough and is large enough.This incorporating with (1.4.5) implies that
Therefore, the norm of
cannot reach the truncation bound In other words, the algorithm(1.4.1) turns to be an untruncated RM algorithm (1.4.7) for
for small and largeBy the mean theorem there exists a vector with components located
in-between the corresponding components of and suchthat
Notice that by (1.4.2) the left-hand side of (1.4.6) is of for allsufficiently large since is bounded. From this it follows that i)for small enough and large enough
and hence and ii) the last term in (1.4.8) is ofsince as From (1.4.7) and (1.4.8) it thenfollows that
Since the interval does not contain the origin. Noticingthat we findand that there is such that
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS20
for sufficientlysmall and all large enough Then by A1.4.2 thereis such that
for all large and small enough As mentioned abovefrom (1.4.9) we have
for sufficiently large and small enough where denotes a mag-nitude tending to zero as
Taking (1.4.4) into account, from (1.4.10) we find that
for large However, we have shown that
The obtained contradiction shows that the number of truncations in(1.4.1) can only be finite.
We have proved that starting from some large the algorithm (1.4.1)develops as an RM algorithm
and is bounded.We are now in a position to show that converges.Assume it were not true. Then we would have
Then there would exist an interval not containing the originand would cross for infinitely many
Again, without loss of generality, assuming by the same
argument as that used above, we will arrive at (1.4.9) and (1.4.10) forlarge and obtain a contradiction. Thus, tends to a finite limitas
It remains to show that
Assume the converse that there is a subsequence
Then there is a such that for all sufficiently largeWe still have (1.4.8), (1.4.9), and (1.4.10) for some
ROBBINS-MONRO ALGORITHM 21
Tending in (1.4.10), by convergence of we arrive at acontradictory inequality:
This means
In this section we have demonstrated an analysis method which isdifferent from those used in Sections 1.2 and 1.3. This method is basedon analyzing the sample-path behavior, and conclusions on the wholesequence are deducted from the local behaviors of estimatesthat are obtained immediately after which denotes a convergentsubsequence of We call this method as Trajectory-Subsequence(TS) Method. The TS method is the main tool to be used in subsequentchapters for analyzing more general cases. It will be seen that the TSmethod is powerful in dealing with complicated errors including bothrandom noise and structural inaccuracy of the function.
The obvious weakness of Theorem 1.4.1 is the assumption on the avail-ability of the upper bound for This limitation will be removedlater on.
1.5. Weak Convergence MethodUp-to now we have worked with decreasing gains which are necessary
for path-wise convergence when observations are corrupted by noise.However, in some applications people prefer to using constant gain:
where in contrast to (1.2.2) a constant stands for which tendsto zero as
Define the piece-wise constant interpolating function as
Then which is the space of real functions on thatare right continuous and have left-hand limits, endowed with the Skoro-hod topology. Convergence of to a continuous function
in the Skorohod topology is equivalent to the uniform convergenceon any bounded interval.
Let and be probability measures determined by stochastic pro-cesses and respectively on with inducedby the Skorohod topology.
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS22
If for any bounded continuous function defined on
then we say that weakly converges toIf for any there is a compact measurable set in
such that
then is called tight.Further, is called relatively compact if each subsequence of
contains a weakly convergent subsequence.In the weak convergence analysis an important role is played by the
Prohorov’s Theorem, which says that on a complete and separable met-ric space, tightness is equivalent to relative compactness. The weakconvergence method establishes the weak limit of as andconvergence of to in probability as whereas
Theorem 1.5.1 Assume the following conditions:
A1.5.1 is a.s. bounded;
A1.5.2 is continuous;
A1.5.3 is adapted, is uniformly integrable in the sense that
and
Then is tight in and weakly converges tothat is a solution to
Further, if is asymptotically stable for (1.5.3), then for anyas the distance between and
converges to zero in probability as
In stead of proof, we only outline its basic idea. First, it is shownthat we can extract a subsequence of weakly converging to
ROBBINS-MONRO ALGORITHM 23
For notational simplicity, denote the subsequence still by Bythe Skorohod representation, we may assume For
this we need only, if necessary, to change the probabilistic space and takeand on this new space such that and
have the same distributions as those of and respectively.Then, it is proved that
is a martingale. Since and as can be shown, is Lipschitzcontinuous, it follows that
Since is relatively compact and the limit does not depend onthe extracted subsequence, the whole family weakly convergesto as and satisfies (1.5.3). By asymptotic stability of
Remark 1.5.1 The boundedness assumption on may be removed.For this a smooth function is introduced such that
and the following truncated algorithm
is considered in lieu of (1.5.1). Then is interpolated to a piece-wiseconstant function for the It is shownthat is tight, and weakly convergent as The limit
satisfies
Finally, by showing lim sup lim sup for some
for each it is proved that itself is tight and weaklyconverges to satisfying (1.5.3).
1.6. Notes and ReferencesThe stochastic approximation algorithm was first proposed by Rob-
bins and Monro in [82], where the mean square convergence of the algo-rithm was established under the independence assumption on the obser-vation noise. Later, the noise was extended from independent sequenceto martingale difference sequences (e.g. [7, 40, 53]).
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS24
The probabilistic approach to convergence analysis is well summarizedin [78].
The ODE approach was proposed in [65, 72], and then it was widelyused [4, 85]. For detailed presentation of the ODE method we refer to[65, 68].
The proof of Arzelá-Ascoli Theorem can be found in ([37], p.266).Section 1.4 is an introduction to the method described in detail in
coming chapters. For stability and Lyapunov functions we refer to [69].The weak convergence method was developed by Kushner [64, 68].
The Skorohod topology and Prohorov’s theorem can be found in [6, 41].For probability concepts briefly presented in Appendix A, we refer
to [30, 32, 70, 76, 84]. But the proof of the convergence theorem formartingale difference sequences, which are frequently used throughoutthe book, is given in Appendix B.
STOCHASTIC APPROXIMATION ALGORI-THMS WITH EXPANDING TRUNCATIONS
In Chapter 1 the RM algorithm, the basic algorithm used in stochas-tic approximation(SA), was introduced, and four different methods foranalyzing its convergence were presented. However, conditions imposedfor convergence are rather strong.
Comparing theorems derived by various methods in Chapter 1, wefind that the TS method introduced in Section 1.4 requires the weakestcondition on noise. The trouble is that the sought-for root has to be in-side the truncation region. This motivates us to consider SA algorithmswith expanding truncations with the purpose that the truncation regionwill finally cover the sought-for root whose location is unknown. This isdescribed in Section 2.1.
General convergence theorems of the SA algorithm with expandingtruncations are given in Section 2.2. The key point of the proof is toshow that the number of truncations is finite. If this is done, then theestimate sequence is bounded and the algorithm turns to be the conven-tional RM algorithm in a finite number of steps. This is realized by usingthe TS method. It is worth noting that the fundamental convergencetheorems given in this section are analyzed by a completely elementarymethod, which is deterministic and is limited to the knowledge of calcu-lus. In Section 2.3 the state-independent conditions on noise are givento guarantee convergence of the algorithm when the noise itself is state-dependent. In Section 2.4 conditions on noise are discussed. It appearsthat the noise condition in the general convergence theorems in a certainsense is necessary. In Section 2.5 the convergence theorem is given forthe case where the observation noise is non-additive.
In the multi-root (of case, up-to Section 2.6 we have only estab-lished that the distance between the estimate and the root set tends to
25
Chapter 2
In Chapter 1 we have presented four types of convergence theoremsusing different analysis methods for SA algorithms. However, none ofthese theorems is completely satisfactory in applications. Theorem 1.2.1is proved by using the classical probabilistic method, which requiresrestrictive conditions on the noise and As mentioned before, thenoise may contain component caused by the structural inaccuracy ofthe function, and it is hard to assume this kind of noise to be mutuallyindependent or to be a martingale difference sequence etc. The growthrate restriction imposed on the function not only is sever, but also isunavoidable in a certain sense. To see this, let us consider the followingexample:
It is clear that conditions A1.2.1, A1.2.2, and A1.2.3 are satisfied,where for A1.2.2 one may take The only conditionthat is not satisfied is (1.2.4), since while the right-hand side of (1.2.4) is a second order polynomial. Simple calculationshows that given by RM algorithm rapidly diverges:
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS26
From this one might conclude that the growth rate restriction wouldbe necessary.
However, if we take the initial value with thengiven by the RM algorithm converges to To reduce initial valuein a certain sense, it is equivalent to use step size not from but from
for some The difficulty consists in that from which we should
zero. But, by no means this implies convergence of the estimate itself.This is briefly discussed in Section 2.4, and is considered in Section 2.6in connection with properties of the equilibrium of Conditionsare given to guarantee the trajectory convergence. It is also consideredwhether the limit of the estimate is a stable or unstable equilibrium of
In Section 2.7 it is shown that a small distortion of conditionsmay cause only a small estimation error in limit, while Section 2.8 ofthis chapter considers the case where the sought-for root is moving dur-ing the estimation process. Convergence theorems are derived with thehelp of the general convergence theorem given in Section 2.2. Notes andreferences are given in the last section.
2.1. Motivation
Stochastic Approximation Algorithms withExpanding Truncations 27
start the algorithm. This is one of the motivations to use expandingtruncations to be introduced later.
Theorem 1.3.1 proved in Section 1.3 demonstrates the ODE method.By this approach, the condition imposed on the noise has significantlybeen weakened and it covers a class of noises much larger than thattreated by the probabilistic method. However, it a priori requiresbe bounded. This is the case if converges, but before establishing itsconvergence, this is an artificial condition, which is not satisfied even forthe simple example given above. Further, although the noise condition(1.3.3) is much more general than that used in Theorem 1.2.1, it isstill difficult to be verified for the state-dependent noise. For example,
where is a martingale difference sequence withIf is bounded and
then a.s. and (1.3.3) holds. However, in general,
it is difficult to directly verify (1.3.3) because the behavior of isunknown. This is why we use Condition (1.4.2) which should be verifiedonly along convergent subsequences. With convergent the noise
is easier to be dealt with.Considering convergent subsequences, the path-wise convergence is
proved for a truncated RM algorithm by using the TS method in Theo-rem 1.4.1. The weakness of algorithms with fixed truncation bounds isthat the sought-for root of has to be located in the truncation region.But, in general, this cannot be ensured. This is another motivation toconsider algorithms with expanding truncations.
The weak convergence method explained in Section 1.5 can avoidboundedness assumption on but it can ensure convergence in dis-tribution only, while in practical computation one always deals with asample path. Hence, people in applications are mainly interested inpath-wise convergence.
The SA algorithm with expanding truncations was introduced in or-der to remove the growth rate restriction on It has been developedin two directions: weakening conditions imposed on noise and improv-ing the analysis method. By the TS method we can show that theSA algorithm with expanding truncations converges under a truly weakcondition on noise, which, in fact, is also necessary for a wide class of
In Chapter 1, the root of is a singleton. Fromnow on we will consider the general case. Let J be the root set of
We now define the algorithm. Let be a sequence of positivenumbers increasingly diverging to infinity, and let be a fixed point in
28 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Fix an arbitrary initial value and denote by the estimate attime serving as the approximation to J. Define by the followingrecursion:
where is an indicator function meaning that it equals 1 ifthe inequality indicated in the bracket is fulfilled, and 0 if the inequalitydoes not hold.
We explain the algorithm. is the number of truncations up-to timeserves as the truncation bound when the estimate is
generated. From (2.1.1) it is seen that if the estimate at timecalculated by the RM algorithm remains in the truncation region, i.e., if
then the algorithm evolves as the RM algorithm.If exits from the sphere with radius i.e., if
then the estimate at time is pulled back to thepre-specified point and the truncation bound is enlarged fromto
Consequently, if it can be shown that the number of truncations isfinite, or equivalently, generated by (2.1.1) and (2.1.2) is bounded,then the algorithm (2.1.1) and (2.1.2) turns to be the one without trun-cations, i.e., to be the RM algorithm after a finite number of steps. Thisactually is the key step when we prove convergence of (2.1.1) and (2.1.2).
The convergence analysis of (2.1.1) and (2.1.2) will be given in thenext section, and the analysis is carried out in a deterministic way at afixed sample without involving any interpolating function.
In This section by TS method we establish convergence of the RMalgorithm with expanding truncations defined by (2.1.1)–(2.1.3) undergeneral conditions. Let us first list conditions to be used.
2.2. General Convergence Theorems by TSMethod
Stochastic Approximation Algorithms withExpanding Truncations 29
A2.2.2 There is a continuously differentiable function (not necessarilybeing nonnegative) such that
for any and is nowhere dense, whereJ is the zero set of i.e.,
and denotes the gradient of Further, used in (2.1.1)is such that for some and
For introducing condition on noise let us denote by the prob-ability space. Let be a mea-surable function defined on the product space. Fixing an meansthat a sample path is under consideration. Let the noise be givenby
Thus, the state-dependent noise is considered, and for fixedmay be random.
A2.2.3 For the sample path under consideration for any sufficientlylarge integer
for any such that converges, where is given by(1.3.2) and denotes given by (2.1.1)–(2.1.3) and valued at thesample path
In the sequel, the algorithm (2.1.1)–(2.1.3) is considered for the fixedfor which A2.2.3 holds, and in will often be suppressed if no
confusion is caused.
A2.2.4 is measurable and locally bounded.
Remark 2.2.1 Comparing A2.2.1–A2.2.4 with A1.4.1–A1.4.4, we findthat if the root set J degenerates to a singleton then the only essentialdifference is that an indicator function is included in (2.2.2)while (1.4.2) stands without it. It is clear that if is bounded, thenthis makes no difference. However, before establishing the boundednessof condition (2.2.2) is easier to be verified. The key point here
30 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is that in contrast to Section 1.4 we do not assume availability of theupper bound for the roots of
Remark 2.2.2 It is worth noting that con-
verges. To see this it suffices to take in (2.2.2).
Theorem 2.2.1 Let be given by (2.1.1)–(2.1.3) for a given initialvalue Assume A2.2.1–A2.2.4 hold. Then, for the
sample path for which A2.2.3 holds.
Proof. The proof is completed by six steps by considering conver-gent subsequences at the sample path. This is why we call the analysismethod used here as TS method.
Step 1. We show that there are constants such thatfor any there exists such that for any
if is a convergent subsequence of where M is
independent of andSince we need only to prove
(2.2.3) forIf the number of truncations in (2.1.1)–(2.1.3) is finite, then there is
an N such that i.e., there is no more truncation forHence, wheneverIn this case, we may take in (2.2.3).
We now prove (2.2.3) for the case where asAssume the converse that (2.2.3) is not true. Take There issuch that
Take a sequence of positive real numbers and asSince (2.2.3) is not true, for there are and
such that
Stochastic Approximation Algorithms withExpanding Truncations 31
and for any there are andsuch that
Without loss of generality we may assume
Then for any from (2.2.4) and (2.2.6) it followsthat
Since there is such that Then from(2.2.7) it follows that
For any fixed if is large enough, then andand by (2.2.10)
Since from (2.2.11) it followsthat
and by (2.2.4), (2.2.7), and (2.2.8)
and hence
by A2.2.4, where is a constant.Let where is specified in A2.2.3. Thenfrom A2.2.3
for any
32 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Taking and respectively in (2.2.10)
and noticing from(2.2.9) we thenhave
and hence
From (2.2.8), it follows that
where the second term on the right-hand of the inequality tends to zeroby (2.2.12) and (2.2.13), while the first term tends to zero because
Noticing that by
(2.2.9) and (2.2.13), we then by (2.2.14) have
On the other hand, by (2.2.6) we have
The obtained contradiction proves (2.2.3).Step 2. We now show that for all large enough
if T is small enough, where is a constant.If the number of truncations in (2.1.1)–(2.1.3) is finite, then is
bounded and hence is also bounded.
Stochastic Approximation Algorithms withExpanding Truncations 33
Then for large enough there is no truncation, and by (2.2.2) for
if T is small enough. In (2.2.16), for the last inequality the boundednessof is invoked, and is a constant.
Thus, it suffices to prove (2.2.15) for the case where
From (2.2.3) it follows that for any
if is large enough.This implies that for
where is a constant. The last inequality of (2.2.18) yields
With in A2.2.3, from (2.2.2) we have
for large enough and small enough T.Combining (2.2.18), (2.2.19), and (2.2.20) leads to
for all large enough This together with (2.2.16) verifies (2.2.15).Step 3. We now show the following assertion:For any interval with and the
sequence cannot cross infinitely many times with
34 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
andAssume the converse: there are infinitely many crossings
and is bounded.By boundedness of without loss of generality, we may assume
By setting in (2.2.15), we have
But by definition so we have
From (2.2.15) we see that if take sufficiently small, then
for sufficiently largeBy (2.2.18) and (2.2.15), for large we then have
where denotes the gradient of and asFor condition (2.2.2) implies that
By (2.2.15) and (2.2.18) it follows that
bounded, where by “crossing by we mean that
Stochastic Approximation Algorithms withExpanding Truncations 35
Then, by (2.2.23) and (2.2.1) from (2.2.24)–(2.2.26) it follows that thereare and such that
for all sufficiently largeNoticing (2.2.22), from (2.2.27) we derive
However, by (2.2.15) we have
which implies that for small enoughThis means that which contradicts (2.2.28).Step 4. We now show that the number of truncations is bounded.By A2.2.2, is nowhere dense, and hence a nonempty interval
exists such that and
If then starting from will cross the sphere
infinitely many times. Consequently, will crossinfinitely often with bounded. In Step 3, we have shown this
process is impossible. Therefore, starting from some the algorithm(2.1.1)–(2.1.3) will have no truncations and is bounded.
This means that the algorithm defined by (2.1.1)–(2.2.3) turns to bethe conventional RM algorithm for and a stronger than (2.2.2)condition is satisfied:
for any such that converges.Step 5. We now show that converges. Let
We have to showIf and one of and does not belong to then
exists such that and By Step 3 thisis impossible. So, both and belong to and
36 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If we can show that is dense in then from (2.2.30)it will follow that is dense in which contradicts to theassumption that is nowhere dense. This will prove i.e., theconvergence of
To show that is dense in it suffices to show thatAssume the converse: there is a subsequence
Without loss of generality, we may assume converges. Otherwise,a convergent subsequence can be extracted, which is possible because
is bounded. However, if we take in (2.2.15), we have
which contradicts (2.2.31). Thus and converges.Step 6. For proving it suffices to show that all
limit points of belong to J.
Assume the converse: By (2.2.15) wehave
for all large if is small enough. By (2.2.1) it follows that
and from (2.2.24)
for small enough This leads to a contradiction because convergesand the left-hand side of (2.2.32) tends to zero as Thus, weconclude
Remark 2.2.3 In (2.1.1)–(2.1.3) the spheres with expanding radiusesare used for truncations. Obviously, the spheres can be replaced
by other expanding sets. At first glance the point in (2.1.1) may bearbitrarily chosen, but actually the restriction is imposed on the exis-tence of such that The condition is obviouslysatisfied if as because the availability of is notrequired.
Stochastic Approximation Algorithms withExpanding Truncations 37
Remark 2.2.4 In the proof of Theorem 2.2.1 it can be seen that theconclusion remains valid if in A2.2.2 “J is the zero
set of is removed. As a matter of fact, J may be bigger than thezero set of Of course, it should at least contain the zero set of
in order (2.2.1) to be satisfied. It should also be noted that forwe need not require to be nowhere dense.
Let us modify A2.2.2 as follows.
A2.2.2’ There is a continuously differentiable functionsuch that
for any and is nowhere dense. Further, used in(2.1.1) is such that for some and
A2.2.2” There is a continuously differentiable functionsuch that
for any and J is closed. Further, used in (2.1.1) is suchthat for some and
Notice that, in A2.2.2’ and A2.2.2” the set J is not specified, but itcertainly contains the root sets of both and We may modifyTheorem 2.2.1 as follows.
Theorem 2.2.1’ Let be given by (2.1.1)–(2.1.3) for a given ini-tial value Assume A2.2.1, A2.2.2’,A2.2.3, and A2.2.4 hold. Then
for the sample path for which A2.2.3 holds.
Proof. The Proof of Theorem 2.2.1 applies without any change.
Theorem 2.2.1” Let be given by (2.1.1)–(2.1.3) for a given initialvalue. If A2.2.1, A2.2.2”,A2.2.3, and A2.2.4 hold, then
for the sample path for which A2.2.3 holds.
Proof. We still have Step 1– Step 3 in the proof of Theorem 2.2.1. Let
38 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If or or both do not belong to J, then exists suchthat since J is closed. Then would crossinfinitely many times. But, by Step 3 of the Proof for Theorem 2.2.1,this is impossible. Therefore both and belong to
Theorems 2.2.1 and 2.2.1’ only guarantee that the distance betweenand the set J tends to zero. As a matter of fact, we have more
precise result.
Theorem 2.2.2 Assume conditions of Theorem 2.2.1 or Theorem 2.2.1’hold. Then for fixed and for which A2.2.3 holds, a connected subset
exists such that
where denotes the closure of and is generated by (2.1.1)–(2.1.3).
Proof. Denote by the set of limit points of Assume theconverse: i.e., is disconnected. In other words, closed sets andexist such that and
Define
Since a exists such that
where denotes the of set A.Define
It is clear that and
Since by we have
By boundedness of we may assume that converges.Then, by taking in (2.2.15), we derive
Stochastic Approximation Algorithms withExpanding Truncations 39
which contradicts (2.2.33) and proves the theorem.
Corollary 2.2.1 If J is not dense in any connected set, then underconditions of Theorem 2.2.1, given by (2.1.1)–(2.1.3) converges toa point in This is because in the present case any connected set inconsists of a single point.
Example 2.2.1 Reconsider the example given in Section 2.1:
It was shown that the RM algorithm rapidly diverges to even in thenoise-free case.
We now assume the observations are noise-corrupted:
where is an ARMA process driven by the independent identicallydistributed normal random variables
whereWe use the algorithm (2.1.1)–(2.1.3) with The
computation shows
which tend to the sought-for root 10.
Example 2.2.2 Let Then
Clearly, A2.2.1 and A2.2.4 hold. Concerning A2.2.2, we may taketo serve as Since
(2.2.1) is satisfied. The existence of required in A2.2.2 is obvious, forexample,
40 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Finally, is nowhere dense. So A2.2.2 also holds.Now assume the noise is such that
Then A2.2.3 is satisfied too.By Corollary 2.2.1, given by (2.1.1)–(2.1.3) converges to a point
If for the conventional (untruncated) RM algorithm
it is a priori known that is bounded, then we have the followingtheorem.
Theorem 2.2.3 Assume A2.2.1–A2.2.4 hold but in A2.2.2 the require-ment: “Further, used in (2.1.1) is such that forsome and is removed. If produced by (2.2.34) isbounded, then for the sample path for which A2.2.3
holds, where is a connected subset of
Proof. As a matter of fact, by boundedness of (2.2.3) and (2.2.15)become obvious. Steps 3, 5, and 6 in the proof of Theorem 2.2.1 remainunchanged, while Step 4 is no longer needed. Then the conclusion followsfrom Theorems 2.2.1 and 2.2.2.
Remark 2.2.5 All theorems concerning SA algorithms with expandingtruncations remain valid for produced by (2.2.34), if given by(2.2.34) is known to be bounded.
Theorems 2.2.1 and 2.2.2 concern with time-invariant functionbut the results can easily be extended to time-varying functions, i.e., tothe case where the measurements are carried out for
where depends on timeConditions A2.2.2 and A2.2.4 are respectively replaced by the follow-
ing conditions:A2.2.2o There is a continuously differentiable functionsuch that
Stochastic Approximation Algorithms withExpanding Truncations 41
for any and is nowhere dense,
where and and denotes
the gradient of Further, used in (2.1.1) is such thatfor some and
A2.2.4 are measurable and uniformly locally bounded, i.e., forany constant
Theorem 2.2.4 Let be given by (2.1.1)–(2.1.3) for a given initialvalue Assume A2.2.1, A2.2.2°, and A2.2-4’ hold. Then
for the sample path for which A2.2.3 holds, where is a
connected subset of
Proof. It suffices to replace by everywhere in the prooffor Theorems 2.2.1 and 2.2.2.
Remark 2.2.6 If it is known that given by an SA algorithmevolves in a subspace S of then it suffices to verify A2.2.2, A2.2.2’,A2.2.2”, and A2.2.2° in the subspace S in order the corresponding con-clusions about convergence of to hold. For example, in this caseA2.2.2 changes toA2.2.2 (S): There is a continuously differentiable functionR such that for any and
is nowhere dense.Further, used in (2.1.1) is such that for some
and According to Remark 2.2.4, here J is not spec-ified. Then, with A2.2.2 and replaced by A2.2.2(S) andrespectively, Theorem 2.2.1 incorporating with Theorem 2.2.2 assertsthat
2.3. Convergence Under State-IndependentConditions
In the last section we have established convergence theorems undergeneral conditions. These theorems take a sample-path-based form: un-der A2.2.1, A2.2.2, andA2.2.4 converges at those sample paths forwhich A2.2.3 holds. Condition A2.2.3 looks rather complicated, but it
42 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is so weak that it is necessary as to be shown later. However, conditionA2.2.3 is state-dependent in the sense that the condition itself dependson the behavior of This makes it not always possible to verifythe condition beforehand. We are planning to give convergence theo-rems under conditions with no state involved. For this we have toreformulate Theorems 2.2.1 and 2.2.2.
As defined in Section 2.2 where is a mea-surable function. In lieu of A2.2.3 we introduce the following condition.
A2.3.1 For any sufficiently large integer there is anwith such that for any
for any such that converges.
Theorem 2.3.1 Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.1 hold. Thena.s. for generated by (2.1.1)–(2.1.3) with a
given initial value where is a connected subset contained in theclosure of J.
Proof. Let It is clear that
i.e., Then for any
A2.2.3 is fulfilled with possibly depending on and theconclusion of the theorem follows from Theorems 2.2.1 and 2.2.2.
We now introduce a state-independent condition on noise.
A2.3.2 For any is a martingale difference se-quence and for some
where is a family of nondecreasing independent of
We first give an example satisfying A2.3.2. Let be andimensional martingale difference sequence with
Stochastic Approximation Algorithms withExpanding Truncations 43
for some and letbe a measurable and locally bounded function. Then
satisfies A2.3.2, because
and
by assumption.
Theorem 2.3.2 Let be given by (2.1.1)–(2.1.3) for a given initialvalue. Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and
for given in A2.3.2. Then a.s., where is a
connected subset contained in
Proof. Since is measurable and is it fol-lows that is adapted. Approximatingby simple functions, it is seen that
Hence, is a martingale difference sequence, and
a.s.
By the convergence theorem for martingale difference sequences, theseries
converges a.s., which implies that with exists such thatfor each
converges to zero as uniformly inThis means that A2.3.1 holds, and the conclusion of the theorem
follows from Theorem 2.3.1.In applications it may happen that is not directly observed. In-
stead, the time-varying functions are observed, and the observa-tions may be done not at but at i.e., at with bias
44 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Theorem 2.3.3 Let be given by (2.1.1)–(2.1.3) for a given ini-tial value. Assume that A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and
for p given in A2.3.2. Further, assume is anadapted sequence, is bounded by a constant, and for any sufficientlylarge integer there exists with such that for any
for any such that converges. Then, a.s.,
where is a connected subset contained in
Proof. By assumption where is a constant. Then
and again by the convergence theorem for martingale difference sequences,the series
convergence a.s. Consequently, there exists with such thatfor any the convergence indicated in (2.3.5) holds and for anyinteger
tends to zero as uniformly inTherefore, A2.3.1 is fulfilled and the conclusion of the theorem follows
from Theorem 2.3.1.
Remark 2.3.1 The obvious sufficient condition for (2.3.5) is
which in turn is satisfied, if is continuous and
Remark 2.3.2 Theorems 2.3.2 and 2.3.3 with A2.2.2 and A2.2.4 re-placed by A2.2.2° and A2.2.4’, respectively, remain valid, if isreplaced by time-varying
Stochastic Approximation Algorithms withExpanding Truncations 45
2.4. Necessity of Noise ConditionUnder Conditions A2.2.1–A2.2.4 we have established convergence The-
orems for recursively given by (2.1.1)–(2.1.3). Condition A2.2.1 is acommonly accepted requirement for decreasing step size, while A2.2.2 isa stability condition. This kind of conditions are unavoidable for conver-gence of SA type algorithms, although it may appear in different forms.Concerning A2.2.4 on it is the weakest possible: neither continuitynor growth rate of is required. So, it is natural to ask is it possi-ble to further weaken Condition A2.2.3 on noise? We now answer thisquestion.
Theorem 2.4.1 Assume only has one root , i.e., andis continuous at Further, assume A2.2.1 and A2.2.2 hold. Thengiven by (2.1.1)–(2.1.3) converges to at those sample paths for
which one of the following conditions holds:i)
ii) can be decomposed into two parts such that
and
Conversely, if then both i) and ii) are satisfied.
Proof. Sufficiency. It is clear that ii) implies i), which in turn impliesA2.2.3. Consequently, sufficiency follows from Theorem 2.2.1.
Necessity. Assume Then is bounded and (2.1.1)–(2.1.3) turns to be the RM algorithm after a finite number of steps (for
. Therefore,
whereSince and is continuous, Condition ii) is satisfied. And,
Condition i) being a consequence of ii) also holds.
Remark 2.4.1 In the case where and is continuous at, under conditions A2.2.1, A2.2.2, and A2.2.3 by Theorem 2.2.1 we
arrive at Then by Theorem 2.4.1 we derive (2.4.1) which isstronger than A2.2.3. One may ask why a weaker condition A2.2.3 canimply a stronger condition (2.4.1)? Are they equivalent ? The answer
46 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is “yes” or “no”: Yes, these conditions are equivalent but only underadditional conditions A2.2.1, A2.2.2, and continuity of at beingthe unique root of However, these conditions by themselves are notequivalent because condition A2.2.3 is weaker than (2.4.1) indeed.
We now consider the multi-root case. Instead of the singleton wenow have a root set J. Accordingly, continuity of at is replacedby the following condition
In order to derive the necessary condition on noise, we consider thelinear interpolating function
where From form a family of func-
tions, where
where is a constant.For any subsequence define
where appearing on the right-hand side of (2.4.3) denotes the de-pendence of the limit function on the subsequence, and the limsup of avector sequence is taken component-wise. In general, may bediscontinuous.
However, if then
which is not only continuous but also differentiable.Thus, (2.4.2) for the multi-root case corresponds to the continuity of
at for the single root case, while and a certain
analytic property of correspond to
Theorem 2.4.2 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold. Thengiven by (2.1.1)–(2.1.3) is bounded, and the
right derivative for any convergent subsequence
Stochastic Approximation Algorithms withExpanding Truncations 47
if and only if condition A2.2.3 is satisfied, where is a connected subsetof
Proof. Sufficiency. By Theorem 2.2.1 it follows that is boundedand We only need to show
Let be a convergent subsequence. Since is bounded,
the algorithm (2.1.1)–(2.1.3) becomes the one without truncations forlarge enough Therefore,
Notice that and hence
where asThen from (2.4.5) we have
In (2.4.5) the last term tends to zero by A2.2.3 because isbounded and hence the indicator in (2.2.2) can be removed for suf-ficiently large By (2.4.2) the first term on the right-hand side of(2.4.7) also tends to zero as The left-hand side of (2.4.7) is
Consequently,
48 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Necessity. We now assume is bounded, and
for any convergent subsequence and want toshow A2.2.3. Let For any from (2.4.5) we have
From (2.4.6) it is seen that
where asThe assumption means that
where and
Noticing the continuity of from (2.4.10) and (2.4.11) it followsthat
which incorporating with yields (2.4.9). Thus, we have
for any such that converges.By the boundedness of (2.4.12) is equivalent to (2.2.2), and the
proof is completed.
Corollary 2.4.1 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold, andassume J is not dense in any connected set. Then given by (2.1.1)–(2.1.3) converges to some point in J if and only if A2.2.3 holds.
This corollary is a direct generalization of Theorem 2.4.1. The suffi-ciency part follows from Corollary 2.2.1, while the necessity part fol-lows from Theorem 2.4.2 if notice that convergence of implies
for sufficiently largeThe first term on the right-hand side of (2.4.8) tends to zero asby (2.4.2) and So, to verify A2.2.3 it suffices to
show that
Stochastic Approximation Algorithms withExpanding Truncations 49
2.5. Non-Additive NoiseIn the algorithm (2.1.1)–(2.1.3) the noise in observation is
additive. In this section we continue considering (2.1.1)–(2.1.2) but inlieu of (2.1.3) we now have the non-additive noise
where is the observation noise at timeThe problem is that under which conditions does the algorithm defined
by (2.1.1), (2.1.2), and (2.5.1) converge to J, the root set of whichisthe average of with respect to its second argument? To be precise,let be an measurable function and let be adistribution function in The function is defined by
It is clear that the observation given by (2.5.1) can formally be ex-pressed by the one with additive noise:
and Theorems 2.2.1 and 2.2.2 can still be applied. The basic problem ishow to verify A2.2.3. In other words, under which conditions onand does given by (2.5.3) satisfy A2.2.3?
Before describing conditions to be used we first introduce some no-tations. We always take the regular version of conditional probability.This makes conditional distributions introduced later are well-defined.
Let be the distribution function of and be theconditional distribution of given where
Further, let us introduce the following coefficients,
where denotes the Borel in and for a random variablewhere runs over all sets
with probability zero.is known as the mixing coefficient of and it measures the
dependence between and It is clear thatmeasures the closeness of the distribution of to
50 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
The following conditions will be needed.
A2.5.2 (=A2.2.2);
A2.5.3 is a measurable function and is locally Lipschitz-continuousin the first argument, i.e., for any fixed
where is a constant depending on
A2.5.4 (Noise Condition)
i) is a process with mixing coefficient asuniformly in
ii)
where is defined in (2.5.6);iii) as
Theorem 2.5.1 Assume A2.5.1–A2.5.4. Then for generated by(2.1.1), (2.1.2), and (2.5.1)
where is a connected subset of
The proof consists in verifying Condition A2.2.3 satisfied a.s. bygiven in (2.5.3). Then the theorem follows from Theorems 2.2.1 and2.2.2.
We first prove lemmas.
Lemma 2.5.1 Assume A2.5.1, A2.5.3, and A2.5.4 hold. Then thereis an with such that for any and any boundedsubsequence of say,
A2.5.1
as
Stochastic Approximation Algorithms withExpanding Truncations 51
(without loss of generality assume there exists an integersuch that for all
if T is small enough, where is given by (2.1.1), (2.1.2), and (2.5.1),and is given by (1.3.2).
Proof. For any set
By setting in (2.5.6), it is clear that
From (2.5.7), it follows that
and
where (and hereafter) L is taken large enough so thatSince is a convergent martingale, there is a a.s.
such that
From (2.5.13) and it is clear that for any integer L the
series of martingale differences
converges a.s.Denote by the where the above series converges, and set
52 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
It is clear thatLet be fixed and with and
Then for any integer by (2.5.13) we have
where the first term on the right-hand side tends to zero as by(2.5.15).
Assume is sufficiently large such thati) for if as orii) if
We note that in case ii) there will be no truncation in (2.1.1) for
Assume and fix a small enough T such that Letbe arbitrarily fixed.
We prove (2.5.9) by induction. It is clear (2.5.9) is true forAssume (2.5.9) is true for and there
is no truncation for if Noticingwe have, by (2.5.16)
if is large enough.
This means that at time there is no truncation in (2.1.1), and
Lemma 2.5.2 Assume A2.5.1, A2.5.3, and A2.5.4 hold. There is anwith such that if and if as
Stochastic Approximation Algorithms withExpanding Truncations 53
is a bounded subsequence of produced by (2.1.1), (2.1.2),and (2.5.1), then
Proof. Write
where
By (2.5.13), for we have
which converges to a finite limit as by the martingale conver-gence theorem.
Therefore, for any integers L and
converges a.s.
54 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Therefore, there is with such that (2.5.23) holds forany integers L and
Let be fixed, By Lemma 2.5.1,for small
Then
for any by (2.5.23).We now estimate (II). By Lemma 2.5.1 we have the following,
Noticing (2.5.7) and (2.5.14), we then have
Similarly, by Lemma 2.5.1 and (2.5.7)
Combining (2.5.18), (2.5.24), and (2.5.26) leads to
Therefore, to prove the lemma it suffices to show that the right-handside of (2.5.27) is zero.
Stochastic Approximation Algorithms withExpanding Truncations 55
Applying the Jordan-Hahn decomposition to the signed measure,
and noticing that is a process with mixing coefficientwe know that there is a Borel set D in such that for any
Borel set A in
and
Then, we have the following,
where
56 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
For any given there is a j such that
For any fixed by (2.5.13), (2.5.14), and it follows that
Therefore,
Since may be arbitrarily small, this combining with (2.5.27)proves the lemma.
Proof of Theorem 2.5.1.For proving the theorem it suffices to show that A2.2.3 is satisfied by
a.s. By Lemma 2.5.2, we need only to provethat
for is a bounded subsequence, and asAssumeApplying the Jordan-Hahn decomposition to the signed measure,
Stochastic Approximation Algorithms withExpanding Truncations 57
we conclude that
where for the last inequality (2.5.8) and (2.5.12) are invoked. Sinceas the right-hand side of (2.5.32) tends to zero asfor any This proves (2.5.31) and completes the proof
of Theorem 2.5.1.
Remark 2.5.1 From the expression (2.5.3) for observation it is seenthat the observation with non-additive noise can be reduced to the ad-ditive but state-dependent noise which was considered in Section 2.3.However, Theorem 2.5.1 is not covered by Theorems in Section 2.3 andvice versa.
2.6. Connection Between Trajectory Convergenceand Property of Limit Points
In the multi-root case, what we have established so far is that the dis-tance between given by (2.1.1)–(2.1.3) and a connected subsetof converges to zero under various sets of conditions.
As pointed out in Corollary 2.2.1, if J is not dense in any connectedset, then converges to a point belonging to However, it is stillnot clear how does behave when J is dense in some connected set?The following example shows that still may not converge, although
Example 2.6.1 Let
58 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and let
Take step sizes as follows
We apply the RM algorithm (2.2.34) withAs we may take
Then, all conditions A2.2.1–A2.2.4 are satisfied.Notice that
and
where k is such thatBy (2.6.1), it is clear that in (2.6.2)
and
Therefore, is bounded and by Theorem 2.2.4.
As a matter of fact, changes from one to zero and then from zeroto one, and this process repeats forever with decreasing step sizes.
Stochastic Approximation Algorithms withExpanding Truncations 59
Thus, is dense in [0,1]. This phenomenon hints that for tra-jectory convergence of the stability-like condition A2.2.2 is notenough; a stronger stability is needed.
Definition 2.6.1 A point i.e., a root of is called dominantlystable for if there exist a and a positive measurable function
which is bounded in the interval andsatisfies the following condition
for all the ball centered at with radius
Remark 2.6.1 The dominant stability implies stability. To see this, itsuffices to take as the Lyapunov function. Then
The dominant stability of however, is not necessary for asymptoticstability.
Remark 2.6.2 Equality (2.6.3) holds for any whatever is.Therefore, all interior points of J are dominantly stable for Further,for a boundary point of J to be dominantly stable for it sufficesto verify (2.6.3) for with small i.e., all that areclose to and outside J.
Example 2.6.2 Let
In fact, is the gradient of
In this example We now show that all points of Jare dominantly stable for For this, by Remark 2.6.2, it suffices toshow that all with are dominantly stable for and for this,it in turn suffices to show (2.6.3) for any with andfor small enough Denoting by the angle between vectors
and we have for
60 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
It is clear that
for all small enough Therefore, all points in J are dominantlystable for
Theorem 2.6.1 Assume A2.2.1, A2.2.2, and A2.2.4 hold. If for a
given is convergent and a limit point of generated
by (2.1.1)–(2.1.3) is deminantly stable for then for this trajectory
Proof. For any define
where is the one indicated in Definition 2.6.1.It is clear that is well-defined, because there is a convergent subse-
quence: and for any greater than some Iffor any for some then by arbitrariness of
Therefore, for proving the theorem, it suffices to show that, for anysmall an exists such that implies if
Since implies A2.2.3, all conditions of Theorem 2.2.1
are satisfied. By the boundedness of we may assume that islarge enough so that the truncations no longer exist in (2.1.1)–(2.1.3)for It then follows that
Notice that for any andis bounded, and hence by (2.6.3)
Stochastic Approximation Algorithms withExpanding Truncations 61
for some because is convergent and
Further,
An argument similar to that used for (2.6.5) leads to
if is large enough.Then from (2.6.6) we have
From (2.6.4) and (2.6.7) we see that we can inductively obtain
Then, noticing by definitions of we have
where the elementary inequality
62 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is used with for the first inequality in (2.6.8), and withfor the third inequality in (2.6.8). Because is bounded,
and an exists such
that
This means that and completes the proof.For convergence of SA algorithms we have imposed the stability-like
condition A2.2.2 for and the dominant stability con-
dition (2.6.3) for trajectory convergence. It is natural to ask does a limitpoint of trajectory possess a certain stability property? The followingexample gives the negative answer.
Example 2.6.3 Let
It is straightforward to check that
satisfies A2.2.2. Takewhere is a sequence of mutually independent random
variables such that a.s. Then with 1 being
a stable attractor for and all A2.2.1–A2.2.4 are satisfied. TakeThen by Theorem 2.2.1 it follows that
a.s. Since must converge to 0 a.s. Zero, however, isunstable for
In this example converges to a limit, which is independent of ini-tial values and unstable, although conditions A2.2.1–A2.2.4 hold. Thisstrange phenomenon happens because
as a function of is singular for some in the sense that itrestricts the algorithm to evolve only in a certain set of Therefore,
Stochastic Approximation Algorithms withExpanding Truncations 63
in order the limit of to be stable, imposing a certain regularitycondition on and some restrictions on noises is unavoidable.
As in Section 2.3, assume that observation noise iswith being a measurable function defined on Set
Let us introduce the following conditions:
A2.6.1 For a given is a surjection for any
A2.6.2 For any and is continuous in and for anyand
where denotes the ball centered at with radius
It is clear, that A2.6.2 is equivalent to A2.6.2’:
A2.6.2’ For any and any compact set
Before formulating Theorem 2.6.2 we first give some remarks on Con-ditions A2.6.1 and A2.6.2.
Remark 2.6.3 If does not depend on then in (2.6.9)can be removed when taking supremum. In Condition A2.2.3
is a convergent subsequence, and hence is automaticallylocated in a compact set. In Theorems in Sections 2.2, 2.3, 2.4, and2.5, the initial value is fixed, and hence for fixed is a fixedsequence. In contrast to this, in Theorem 2.6.2 we will consider the casewhere the initial value arbitrarily varies, and hence for any fixedmay be any point in If in (2.6.9) were not restricted to a compactset (i.e., with removed in (2.6.9)), then the resultingcondition would be too strong. Therefore, to put in(2.6.9) is to make the condition reasonable.
Remark 2.6.4 If is continuous and ifthen is a surjection.
64 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By this property, is a surjection for a large class of Forexample, let be free of and let the growth rate of be notfaster than linear as Then with satisfying A2.2.1 wehave as for all Hence, A2.6.1holds. In the case where the growth rate of is faster than linearas and for some we alsohave that as for all and A2.6.1holds.
In what follows by stability of a set for we mean it in theLyapunov sense, i.e., a nonnegative continuously differentiable function
exists such that andfor some where
Theorem 2.6.2 Assume A2.2.1, A2.2.2, and A2.6.2 hold, and thatis continuous and for a given A2.6.1 holds. If defined by (2.1.1)–(2.1.3) with any initial value converges to a limit independent of
then belongs to the unique stable set of
Proof. Since by A2.2.2 and by conti-
nuity of exists with such thatHence, By continuity of J is closed, and hence by A2.2.2,
Since we must have Denote by the connectedsubset of containing The minimizer set of that contains isclosed and is contained in Since is a connected setand byA2.2.2 is nowhere dense, is a constant.
By continuity of all connected root-sets are closed and they areseparated. Thus, there exists a such thati.e., contains no root of other than those located in
Set
Then andTherefore, by definition, is stable forWe have to show that and is the unique stable root-set.Let be the connected set of such
that contains By continuity of for an arbitrary smallexist such that and the distance
between the interval and the set is positive;i.e.,
Stochastic Approximation Algorithms withExpanding Truncations 65
We first show that, for any and there existand such that, for any if then
By Theorem 2.2.1, for with sufficiently large there will beno truncation for (2.1.1)–(2.1.3), and
For any let By A2.6.2, sufficiently small
and large enough exist such that for any
If for then (2.6.10) immediatelyfollows by setting Assume for some
Let be the first such one. Then
By (2.6.11), however,
which contradicts (2.6.12). Thus and (2.6.10)is verified.
For a given we now prove the existence of such thatfor any if where the dependence of
on and on the initial value is emphasized. For simplicity of writing,is written as in the sequel.
Assume the assertion is not true; i.e., for any exists such thatand for some
Suppose and
66 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If there exists an with then withexists because is connected and with
This yields a contradictory inequality:
where the first inequality follows from A2.2.2 while the second inequalityis because is the minimizer of
Consequently, for any and
and a subsequence of exists, also denoted by fornotational simplicity, such that By the continuity
of
Hence, by the factBy (2.6.10) and the fact we can choose sufficiently
small T and large enough N such that
and i.e.,
for any By (2.6.10), exists with theproperty such that
Because as for sufficiently large N,by (2.6.10) the last term of (2.6.15) is Then
Stochastic Approximation Algorithms withExpanding Truncations 67
By (2.6.10) and the continuity of the third term on the right handside of (2.6.16) is and by A2.6.2 (Since
with for all sufficiently large N.), the norm of the secondterm on the right-hand side of (2.6.16) is also as Henceby A2.2.2 and (2.6.13), some exists such that the right-handside of (2.6.16) is less than for all sufficiently large N if T is smallenough. By noticing and mentioned
above, from (2.6.14) it follows that the left-hand side of (2.6.16) tendsto a nonnegative limit as The obtained contradiction showsthat exists such that for any ifWith fixed for any byA2.6.1 exists such thatBy and the arbitrary smallness of from this itfollows that Since by assumption, we have
which means that is stable. If another stable set existedsuch that then by the same argument would belong toThe contradiction shows that the uniqueness of the stable set.
2.7. Robustness of Stochastic ApproximationAlgorithms
In this section for the single root case, i.e, the case weconsider the behavior of SA algorithms when conditions for convergenceof algorithms to are not exactly satisfied. It will be shown that a“small” violation of conditions will cause no big effect on the behaviorof the algorithm.
The following result known as Kronecker lemma will be used severaltimes in the sequel. We state it separately for convenience of reference.
Kronecker Lemma. If where is a sequence
of positive numbers nondecreasingly diverging to infinity and is a
sequence of matrices, then
Proof. Set Since
for any there is such that if Then it
68 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
follows that
as and thenWe still consider the algorithm given by (2.1.1)–(2.1.3), where de-
notes the estimate for at time but may not be the exact root ofAs a matter of fact, the following set of conditions will be used to
replace A2.2.1–A2.2.4:
A2.7.1 nonincreasingly tends to zero, and
exists such that
A2.7.2 There exists a nonnegative twice continuously differentiable func-tion such that and
A2.7.3 For sample path the observation noise satisfies the fol-lowing condition
A2.7.4 is continuous, but is not necessary tobe the root of
Comparing A2.7.1–A2.7.4 with A2.2.1–A2.2.4, we see the followingconditions required here are not assumed in Section 2.2: nonincreasing
Stochastic Approximation Algorithms withExpanding Truncations 69
property of condition (2.7.1), nonnegativity of divergence ofto infinity and continuity of but in (2.7.2), in (2.7.3), and
are allowed to be greater than zero.
Concerning we note that from the convergenceof
it follows that i) A2.2.3 holds and ii) by the Kronecker lemma
because is nonincreasing. We will demonstrate
how does the deviation from of the estimate given by (2.1.1)–(2.1.3)depend on and
For used in (2.1.1) define Since ascan be taken sufficiently large such that
Let the initial truncation bound used in (2.1.1) and (2.1.2) belarge enough such that
Take real numbers such that
Since is continuous, an exists such that
Denote
and
where denotes the matrix consisting of the second partial deriva-tives of
Since we have for any and hence
and
70 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Set
We will only consider those in (2.7.2) for which where is givenin (2.7.7). From (2.7.7) and (2.7.8) it is seen that
Consequently, by (2.7.2), a given by (2.7.12) is positive.By continuity of and and exist
such that the following inequalities hold:
By A2.7.3 for can be taken sufficiently large suchthat
Lemma 2.7.1 Assume A2.7.1, A2.7.2, A2.7.4 hold with given in (2.7.3)being less than or equal to If for given by (2.1.1)–(2.1.3) with (2.7.5) fulfilled, for some where K isgiven in (2.7.18), then for any
Proof. Because is nondecreasing as T increases, it suffices toprove the lemma for
Assume the converse: there exists an such that
Stochastic Approximation Algorithms withExpanding Truncations 71
Then for any we have
and hence
which incorporating with the definition of leads to
On the other hand, from (2.7.20) and (2.7.21) it follows that
From (2.7.9) we have
By a partial summation we have
Applying (2.7.3) to the first two terms on the right-hand side of (2.7.25),and (2.7.1) and (2.7.3) to the last term we find
From (2.7.24) and (2.7.26) it then follows that
72 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
which contradicts (2.7.22). This proves the lemma.
Lemma 2.7.2 Under the conditions of Lemma 2.7.1, for anythe following estimate holds:
Proof. Since by Lemma 2.7.1 we have
and hence
Consequently, we have
Lemma 2.7.3 Assume A2.7.1–A2.7.4 hold and satisfies (2.7.7). Thenfor the sample path for which A 2.7.3 holds, a that is independent of
and exists such that
in other words, given by (2.1.1)–(2.1.3) is bounded.
Proof. Let be a sufficiently large integer such that
where K is given by (2.7.18).
Stochastic Approximation Algorithms withExpanding Truncations 73
Assume the lemma is not true. Then there exist and such
that Let be the maximal integer satisfying thefollowing equality:
Then by definition we have
and by (2.7.28) and (2.7.29),
We first show that under the converse assumption there must be ansuch that
Otherwise, for any and from (2.7.24) it followsthat
This together with (2.7.30) implies
which contradicts with the converse assumption.Hence (2.7.31) must be held.By the definition of (2.7.6), and (2.7.30) we have
Since by (2.7.31), from (2.7.4) and (2.7.6) it follows that
74 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
We now show For this it suffices to proveby noticing (2.7.34).
Since similar to (2.7.32) we have
and hence
From (2.7.32) and (2.7.36) it is seen that
where for the second inequality, (2.7.9) and are used, whilefor the last inequality (2.7.18) is invoked.
Paying attention to (2.7.10), we have and andby (2.7.16)
Then by (2.7.32) we see and (2.7.34) becomes
Thus, we can define
and have
Taking in Lemmas 2.7.1 and 2.7.2, and paying attentionto (2.7.4) and we know By Lemmas 2.7.1and 2.7.2, from (2.7.28) we see From (2.7.28)–(2.7.30) wehave obtained which together with the definition ofimplies and hence Therefore, iswell defined, and by the Taylor’s expansion we have
Stochastic Approximation Algorithms withExpanding Truncations 75
where with components located in-between andWe now show that which, as to be shown, implies
a contradiction.By Lemma 2.7.2 we have
and hence
By (2.7.10) it follows that and by (2.7.11).Using Lemma 2.7.1, we continue (2.7.41) as follows:
Noticing we seeIt is clear that (2.7.35) and (2.7.37) remain valid with replaced
by Hence, similar to (2.7.37) we have
By (2.7.11) and the Taylor’s expansion we have
and consequently,
and
76 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (2.7.40), Substituting (2.7.44) into (2.7.43) and using(2.7.12) lead to
Estimating by the treatment similar to that used for
(2.7.26) yields
Noticing by Lemma 2.7.2 we find that
and
Hence, and by (2.7.15) from (2.7.45) it follows that
Using (2.7.14), from the above estimate we have
Stochastic Approximation Algorithms withExpanding Truncations 77
From (2.7.18) it follows that Taking notice of (2.7.13) by(2.7.17) we derive
On the other hand, by Lemma 2.7.2 and (2.7.11), (2.7.17), and (2.7.44)it follows that
whereFrom (2.7.39), (2.7.40), and (2.7.48) we see that
and hence which contradicts with (2.7.47). Thismeans that the converse assumption of the lemma cannot be held.
Corollary 2.7.1 From Lemma 2.7.3 it follows that there existand which is independent of and arbitrarily varying in
intervals and such that
and for with sufficiently large the algorithm (2.1.1)–(2.1.3)turns to an ordinary RM algorithm:
Set
Take and denote
By A2.7.2, Set
78 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If in (2.7.2), then In the general case maybe positive.
Theorem 2.7.1 Assume A2.7.1–A2.7.4 hold and is given by(2.1.1)–(2.1.3) with (2.7.5) held. Then there existand a nondecreasing, left-continuous function defined on suchthat for the sample path for which A2.7.3 holds,
whenever and where and are the ones appearing in(2.7.2) and (2.7.3), respectively. As a matter of fact, can be taken as
the inverse function of
Proof. Given recursively define
We now show that exists such that
Set and assume
From the recursion of we have
Assume is large enough such that by A2.7.3
Stochastic Approximation Algorithms withExpanding Truncations 79
By a partial summation, from (2.7.57) we find that
where (2.7.58) is invoked.By (2.7.1) we see
Without loss of generality, we may assume Then by(2.7.1) we have
Applying (2.7.60) and (2.7.61) to (2.7.59) leads to
80 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and hence
which implies (2.7.56).For and by (2.7.53)
Taking this into account for by (2.7.51)–(2.7.54) and theTaylor’s expansion we have
Therefore, in the following Taylor’s expansion
we have and henceand
Denote
For we have
Stochastic Approximation Algorithms withExpanding Truncations 81
From (2.7.63) and (2.7.64) it then follows that
Similar to (2.7.62), we see that
Consequently, we arrive at
Define
It is clear that is nondecreasing as increases and
Take such that Then we have
82 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Define function
It is clear that is left-continuous, nondecreasing and
From (2.7.66) and (2.7.67) it follows that
which implies, by (2.7.57) and the definition of
Corollary 2.7.2 If in (2.7.2) may not be zero), thenand the right-hand side of (2.7.55) will be
Since may be arbitrarily small and hence the estimation errormay be arbitrarily small. If, in addition, in A2.7.3, then
tending and then in both sides of (2.7.55) we derive
In the case where by tending the right-hand side of(2.7.55) converges to
Consequently, as the estimation error depends on how big isIf in (2.7.2), then can also be taken
arbitrarily small and the estimation error depends on the magnitude of
2.8. Dynamic Stochastic ApproximationSo far we have discussed the root-searching problem for an unknown
function, which is unchanged during the process of estimation. We nowconsider the case where the unknown functions together with their rootschange with time. To be precise, Let be a sequence of unknown
Stochastic Approximation Algorithms withExpanding Truncations 83
functions with roots i.e.,Let be the estimatefor at time based on the observations
Assume the evolution of the roots satisfies the following equation
where areknown functions, while is a sequenceof dynamic noises.
The observations are given by
where is the observation noise and is allowed to depend on
In what follows the discussion is for a fixed sample, and the analysisis purely deterministic. Let us arbitrarily take as the estimate forand define
From equation (2.8.1), we see that may serve as a rough esti-mate for In the sequel, we will impose some conditions onand sothat
where is an unknown constant. Therefore, should notdiverge to infinity. But is unknown, so we will use the expandingtruncation technique.
Take a sequence of increasing numbers satisfying
Let be recursively defined by the following algorithm:
where denotes the number of truncations in (2.8.5) occurred untiltime
84 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
We list conditions to be used.
A2.8.1 and
A2.8.2 is measurable and for anyconstant possibly depending on exists so that
for with
A2.8.3 is known such that
for where
and
A2.8.4 and
A2.8.5 There is a continuously differentiable functionsuch that for and for any
where is a positive constant possibly depending on and A con-stant exists such that
where is an unknown constant that is an upper bound for
A2.8.6 For any convergent subsequence the observation noisesatisfies
where
Remark 2.8.1 Condition A2.8.2 implies the local boundedness, but theupper bound should be uniform with respect to In A2.8.3,measures the difference between the estimation error and the
Stochastic Approximation Algorithms withExpanding Truncations 85
prediction error In general, is greaterthan For example, then A2.8.3 holdswith A2.8.4 means that in the root dynamics, thenoise should be vanishing.
As A2.2.3, Condition A2.8.6 is about existence of a Lyapunov func-tion. To impose such kind a condition is unavoidable in convergenceanalysis of SA algorithms. Inequality (2.8.7) is an easy condition. Forexample, if as then this condition is automati-cally satisfied. The noise condition A2.8.6 is similar to A2.2.3.
Before analyzing convergence property of the algorithm (2.8.5), (2.8.6),and (2.8.2) we give an example of application of dynamic stochastic ap-proximation.
Example 2.8.1 Assume that the chemical product is produced in abatch mode, and the product quality or quantity of the batch de-pends on the temperature in batch. When the temperature equals theideal one, then the product is optimized. Let denote the deviationof the temperature from its optimal value for the batch, wheredenotes the control parameter, which may be, for example, the pressurein batch, the quantity of catalytic promoter, the raw material propor-tion and others. The deviation reduces to zero if the control equals itsoptimal value i.e., Because of the environment change,the optimal parameter may change from batch to batch. Assume
where is known and is the noise.
Let be the estimate for Then may serve as a predictionfor Apply as the control parameter for the batch.Assume that the temperature deviation of for the thbatch can be observed, but the observation may be corrupted bynoise, i.e., where is the observationnoise.
Then we can apply algorithm (2.8.5), (2.8.6), and (2.8.2) to estimateUnder conditions A2.8.1–A2.8.6, by Theorem 2.8.1 to be proved in
this section, the estimate is consistent, i.e.,
Theorem 2.8.1 Under Conditions A2.8.1–A2.8.6 the estimation errortends to zero as where is given by (2.8.5),
(2.8.6), and (2.8.2).
To prove the theorem we start with lemmas.
86 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Lemma 2.8.1 Under A2.8.3 and 2.8.4, the sequenceis bounded for any
Proof. By A2.8.3 and A2.8.4 from (2.8.1) it follows that
Lemma 2.8.2 Assume A2.8.1–A2.8.4 and A2.8.6 hold. Let beaconvergent subsequence such that as Then, thereare a sufficiently small and a sufficiently large integer such thatfor
where is implied by
for where is a constant independentof
Proof. In the case asis bounded, and hence is bounded. By Lemma
2.8.1, is bounded. Therefore, is bounded. Forlarge and
The following expression (2.8.11) and estimate (2.8.12) will frequentlybe used. By (2.8.1) and A2.8.3 we have
Stochastic Approximation Algorithms withExpanding Truncations 87
and
Substitution of (2.8.12) into (2.8.10) leads to
By boundedness of and A2.8.3,
for some ByA2.8.4, while the last term is also
less than by A2.8.6.Without loss of generality, we may assume
Therefore, and the lemma is true for the case
We now consider the case as Let be so largethat for
with being a constant, and
where is given by (2.8.8).
as
88 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Without loss of generality we may assume
Define and take T so small that Weprove the lemma by induction.
By (2.8.8) and (2.8.12), we have
Therefore, at time there is no truncation. Then by (2.8.11) and(2.8.12) we have
where (2.8.14) and (2.8.15) have been used.Let the conclusions of the lemma hold for
We prove that it also holds for Again by (2.8.12), we have
Hence there is no truncation at time By the inductive assump-tion, (2.8.11) and (2.8.12), it follows that
where (2.8.13) and (2.8.14) are invoked.Therefore, the conclusions of the lemma are also true for This
completes the proof.
Lemma 2.8.3 Assume A2.8.1–A2.8.6 hold. Then the number of trun-cations in (2.8.5) is finite and is bounded.
Stochastic Approximation Algorithms withExpanding Truncations 89
Proof. Using the argument in the proof of Lemma 2.8.2, the bounded-ness of follows from the boundedness of the number of truncations.Hence, it suffices to show that as
Assume the converse: as This means that thesequence is unbounded. Let be thesequence of truncation times. We prove that is also unbounded if
Assume is bounded. Then is also bounded. Fromwe select a convergent subsequence, denoted by the same for no-tational simplicity, such that By assumption, truncationhappens at the next time The obtained contradiction shows theunboundedness of in the case
Since algorithm (2.8.5) returns back to for infinitelymany times. Let Then
By Lemma 2.8.1, is bounded and by (2.8.8),Because is unbounded, starting from will exit the ball
with radius where is given by (2.8.7). Therefore, there is an intervaland for any there is a sequence,
such that forand In other words, the values of at
the sequence cross the interval from theleft. It is clear that Select from aconvergent subsequence denoted still by such that as
It is clear thatFrom now on, assume is large enough and T is small enough so that
Lemma 2.8.2 is applicable and it is valid with replaced bySince converges, by A2.8.5 and (2.8.12) it follows that
as Hence we have
By Lemma 2.8.2, forNoticing for small T we then have
In the following Taylor’s expansion is located in-betweenand and by Lemma 2.8.2, By (2.8.9)
90 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and (2.8.11) we have
Notice that by Lemma 2.8.2 and (2.8.13)
for sufficiently large From (2.8.21) and (2.8.23), it follows that
On the other hand, by Lemma 2.8.2
Identifying and in A2.8.5 to and respectively, we canfind such that
by A2.8.5.
Stochastic Approximation Algorithms withExpanding Truncations 91
Let us consider the right-hand side of (2.8.22). Noticingby A2.8.3 and A2.8.4 we have
By A2.8.6,
Noticing that
as and by continuity of we find thattends to zero as and
Since the sum of the first and secondterms on the right-hand side of (2.8.22) is as andThis combining with (2.8.26) yields the following conclusion that for
with sufficiently large and for small enough T from (2.8.22) itfollows that
By (2.8.20), tending to infinity, from (2.8.30) we derive
By Lemma 2.8.2 we have
However, by definition,and Hence from (2.8.32), we must have
if T is small enough. Therefore, This contradicts(2.8.31). The obtained contradiction shows that
92 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Theorem 2.8.2 Assume A2.8.1–A2.8.6 hold. Then the estimation er-ror tends to zero as
Proof. We first show that converges. Assume the converse:
where because is bounded by Lemma 2.8.3.It is clear that there exists an interval that does not containzero such that Without loss of generality, assume
From A2.8.6, it follows that there are infinitely manysequences such that and thatfor
Without loss of generality we may assume converges:Since exists such that and by
Lemma 2.8.2, Completely thesame argument as that used for (2.8.22)–(2.8.32) leads to a contradiction.Hence is convergent.
We now show that as Assume the converse: thereis a subsequence By the same argument we again arriveat (2.8.30). Tending by convergence of we obtain acontradictory inequality This implies that as
The following theorem is similar to Theorem 2.4.1.
Theorem 2.8.3 Assume A2.8.1–A2.8.5 hold and is continuous atuniformly in Then as if and only if A2.8.6
holds. Furthermore, under conditions A2.8.1–A2.8.5, the following threeconditions are equivalent.
1) Condition A2.8.6;
2)
3) can be decomposed into two parts: so that
Proof. Assume as Then is bounded. Wehave shown in the proof of Lemma 2.8.3 that the number of truncationsmust be finite if is bounded. Therefore, starting from some thealgorithm (2.8.5) becomes
The following theorem is similar to Theorem 2.4.1.
and as
Stochastic Approximation Algorithms withExpanding Truncations 93
From (2.8.11) we have
Set
By A2.8.3 and A2.8.4 and aswhile tends to zero because
is uniformly continuous at and Consequently,3) holds.
On the other hand, it is clear that 3) implies 2), which in turn im-plies A2.8.6. By Theorem 2.8.1, under A2.8.1–A2.8.5, Condition A2.8.6implies as
Thus, the equivalence of l)–3) has been justified under A2.8.1–A2.8.5.
2.9. Notes and ReferencesThe initial version of SA algorithms with expanding truncations and
its associated analysis method were introduced in [27], where the algo-rithm was called SA with randomly varying truncations. Convergenceresults of this kind of algorithms can also be found in [14, 28]. The-orems given in Section 2.2 are the improved versions of those given in[14, 27, 28]. Theorems in Section 2.3 can be found in [18]. Necessity ofthe noise condition is proved in [24, 94] for the single-root case, and in[17] for the multi-root case.
Convergence results of SA algorithms with additive noise can be foundin [16]. Concerning the measure theory, we refer to [31, 76, 84]. Resultsgiven in Section 2.6 can be found in [48], and some related problems arediscussed in [3]. For the proof of Remark 2.6.4 we refer to Theorem 3.3in [34]. Example 2.6.1 can be found in [93]. Robustness of SA algorithmsis presented in [24]. The dynamic SA was considered in [38, 39, 91], butthe results presented in Section 2.8 are given in [25].
Chapter 3
ASYMPTOTIC PROPERTIES OF STOCHA-STIC APPROXIMATION ALGORITHMS
In Chapter 2 we were mainly concerned with the path-wise conver-gence analysis for SA algorithms with expanding truncations. Condi-tions were given to guarantee where J denotes the
root set of the unknown function, and the estimate for unknown rootgiven by the algorithm.
In this chapter, for the case where J consists of a singleton weconsider the convergence rate of asymptotic normality ofand asymptotic efficiency of the estimate.
Assume is differentiable at Then as
whereIt turns out that the convergence rate heavily depends on whether
or not F is degenerate. Roughly speaking, in the case where the stepsize in (2.1.1) the convergence rate of forsome positive when F is nondegenerate, and for some
when F vanishes.It will be shown that is asymptotically normal and the covari-
ance matrix of the limit distribution depends on the matrix D if in(2.1.1) the step size is replaced by If F in (3.0.1) is available,then D can be defined to make the limiting covariance matrix minimal,i.e., to make the estimate efficient. However, this is not the case in SA.To overcome the difficulty one way is to derive the approximate valueof F by estimating it, but for this one has to impose rather heavy con-ditions on Efficiency here is derived by using a sequence of slowly
95
is
and
to
96 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
decreasing step sizes, and the averaged estimate appears asymptoticallyefficient.
3.1. Convergence Rate: Nondegenerate CaseIn this section, we give the rate of convergence of to zero
in the case F in (3.0.1) is nondegenerate, where is given by (2.1.1)–(2.1.3). It is worth noting that F is the coefficient for the first order inthe Taylor’s expansion for
The following conditions are to be used.
A3.1.2 A continuously differentiable function existssuch that
for any and for some with
where is used in (2.1.1).
A3.1.3 For the sample path under consideration the observation noisein (2.1.3) can be decomposed into two parts such that
for some
A3.1.4 is measurable and locally bounded, and is differentiable atsuch that as
The matrix F is stable (This implies nondegeneracy of F.), in ad-dition, is also stable, where and are given by (3.1.1) and(3.1.3), respectively.
By stability of a matrix we mean that all its eigenvalues are withnegative real parts.
and
Asymptotic Properties of Stochastic Approximation Algorithms 97
Remark 3.1.1 We now compare A3.1.1–A3.1.4 with A2.2.1–A2.2.4. Be-cause of additional requirement (3.1.1), A3.1.1 is stronger than A2.2.1,but it is automatically satisfied if with In this case a in(3.1.1) equals Also, (3.1.1) is satisfied if with
In this case Take sufficiently small such that
Then and Assume
is a martingale difference sequence with
Then by the convergence theorem for martingale difference se-
quences, Therefore (3.1.3) is satisfied a.s. with
Condition A3.1.4 assumes differentiability of whichis not required in A2.2.4.
Lemma 3.1.1 Let and H be -matrices. Assume H is stableand If satisfies A3.1.1 and l-dimensional vectors
satisfy the following conditions
then defined by the following recursion with arbitrary initial valuetends to zero:
Proof. Set
We now show that there exist constants and such that
Let S be any negative definite matrix. Consider
at
98 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Since H is stable, the positive definite matrix P is well-defined. Inte-grating by parts, we have
which implies
This means that if H is stable, then for any negative definite matrix Swe can find a positive definite matrix P to satisfy equation (3.1.9). Thisfact is called the Lyapunov theorem and (3.1.9) called the Lyapunovequation. Consequently, we can find P > 0 such that
where I denotes the identity matrix of compatible dimension.Since there exists such that for
Consequently,
Without loss of generality we may assume that is sufficiently largesuch that for
for some constant where the first inequality is becauseas and while the second inequality is elementary.Combining (3.1.11) and (3.1.12) leads to
as
Asymptotic Properties of Stochastic Approximation Algorithms 99
and hence
where denotes the minimum eigenvalue of P.Paying attention to that
from (3.1.13) we derive
which verifies (3.1.8).From (3.1.6) it follows that
We have to show that the right-hand side of (3.1.14) tends to zero as
For any fixed because of (3.1.1) and(3.1.8). This implies that as for any initial value
Since as for any exists such thatThen by (3.1.8) we have
The first term at the right-hand side of (3.1.15) tends to zero by A3.1.1,while the second term can be estimated as follows:
as
100 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where the first inequality is valid for sufficiently large sinceas and the second inequality is valid when
Therefore, the right-hand side of (3.1.15) tends to zero asand then
Set
By assumption of the lemma Hence, for anythere exists such that By a partialsummation, we have
where except the last term, the sum of remaining terms tends to zero asby (3.1.8) and
Let us now estimate
Asymptotic Properties of StochasticApproximation Algorithms 101
Since for and as by (3.1.8)we have
which tends to zero as and by (3.1.16) and the factthat Thus, the right-hand side of (3.1.17) tends to
zero as and the proof of the lemma is completed.
Theorem 3.1.1 Assume A3.1.1-A3.1.4 hold. Then given by (2.1.1)–(2.1.3) for those sample paths for which (3.1.3) holds converges towith the following convergence rate:
where is the one given in (3.1.3).
Proof. We first note that by Theorem 2.4.1 and there is notruncation after a finite number of steps. Without loss of generality, wemay assume
By (3.1.1), Hence, by the Taylor’s expansion we
have
Write given by (3.1.4) as follows
where
By (3.1.4) and (3.1.19), for sufficiently large k we have
102 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where
By (3.1.1), (3.1.3) we have
Denote
Then (3.1.22) can be rewritten as
Noticing that which is stable by A3.1.4, we see
that all conditions of Lemma 3.1.1 are satisfied. Hence, by the lemmawhich proves the theorem.
Remark 3.1.2 Consider the dependence of convergence rate on the stepsize Take and let in (3.1.3). In order to
have it suffices to requirea.s.,
Asymptotic Properties of StochasticApproximation Algorithms 103
if is a martingale difference sequence with
So, for (3.1.25) it is sufficient to require
Since the best convergence rate is achievedat the convergence rate is Since
the convergence rate is slowing down as approachesto When (3.1.25) cannot be guaranteed. From this it is seenthat the convergence rate depends on how big is.
3.2. Convergence Rate: Degenerate CaseIn the previous section, for obtaining the convergence rate of
stability and hence nondegeneracy of F is an essential requirement. Wenow consider what will happen if the linear term vanishes in the Taylor’sexpansion of For this we introduce the following set of conditions:
A3.2.2 A continuously differentiable function existssuch that
for any and for some withwhere is used in (2.1.1);
A3.2.3 For the observation noise on the sample path under con-sideration the following series converges:
where
A3.2.4 is measurable and locally bounded, and is differentiable atsuch that as
where F is a stable matrix, and is the one used in A3.2.3.
We first note that in comparison with A3.1.1–A3.1.4, here we do notrequire (3.1.1), but A3.2.2 is the same as A3.1.2. From (3.2.3) we see that
A3.2.1 and
or
For
104 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
the Taylor’s expansion for does not contain the linear term. HereF is the coefficient for a term higher than second order in the Taylor’sexpansion of The noise condition A3.2.3 is different from A3.1.3,but, as to be shown by the following lemma, it also implies A2.2.3.
Lemma 3.2.1 If (3.2.2) holds, then and hence A2.2.3
is satisfied.
Proof. We need only to show
Setting
by a partial summation we have
Since as and converges as the first twoterms on the right-hand side of (3.2.4) tend to zero as and
The last term in (3.2.4) is dominated by
whereBy the following elementary calculation we conclude that the right-
hand side of (3.2.5) tends to zero as and
as
Asymptotic Properties of StochasticApproximation Algorithms 105
which tends to zero as and because as
This combining with (3.2.4) and (3.2.5) shows that
By the Lyapunov equation (3.1.9), there is a positive definite matrixP > 0 such that
Assuming is large enough so that there is no truncation, by (3.2.3) wehave
where is the maximum eigenvalue of P given by (3.2.6).
We start with lemmas. Note that by Theorems 2.2.1 or 2.4.1Therefore, starting from some the algorithm has no truncation.
Define
Denote by and the maximum and minimum eigenvalues ofP, respectively, and by K the condition number
Theorem 3.2.1 Assume A3.2.1–A3.2.4 hold and is given by (2.1.1)–(2.1.3). Then for the sample paths where A3.2.3 holds the followingconvergence rate takes place:
106 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where
Define
Lemma 3.2.2 Assume A3.2.1– A3.2.4 hold. Then is bounded.
Proof. Since exists such that and
where and hereafter, means the smaller one between a and b.
By the definition of we havethere exists such that
Assuming is large enough such that for we also have
Define
where and hereafter
and
Since
If then is bounded.Otherwise, let We need only to
Let
Byfor
and
we have
i.e.,
Asymptotic Properties of StochasticApproximation Algorithms 107
consider the case since if it is not true then is clearlybounded.
Let P be given by (3.2.6). We have
where
In what follows we will prove that
By (3.2.10) and (3.2.6) it is clear that
where the last inequality follows by the following consideration:
By (3.2.11) so for (3.2.16) it suffices to show that
By definition of we have and hence
or
108 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Consequently,
and by the agreement
which verifies the last inequality in (3.2.16).We now estimate By (3.2.10) (3.2.11) and the agreement
we have
Noticing that, as agreed,from (3.2.17) we have
and by (3.2.13),
Again, from (3.2.10) and noticing we have
Consequently, by (3.2.12)
Combining (3.2.14), (3.2.16), (3.2.18), and (3.2.20) yields
for
and
Asymptotic Properties of StochasticApproximation Algorithms 109
Similar to (3.2.14) we treat the right-hand side of the above inequalityas follows.
By the same argument as that used above, we can show that
and inductively we derive
Thus, by (3.2.12) and the definition of
or
This contradicts with the definition of and hence must be infinite.Consequently, is bounded.
110 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Proof of Theorem 3.2.1.By Lemma 3.2.2 and the fact
we have
where
By setting
from (3.2.9) it follows that
This is nothing else but an RM algorithm. Since by Lemma 3.2.2is bounded, no truncation is needed and one may apply Theorem 2.2.1”.
First note that
Hence, A2.2.1 is satisfied.
as So A2.2.3 holds with replaced byA2.2.4 is clearly satisfied, since is continuous. The key issue is to
find a satisfyingA2.2.2”.Take
and define which is closed.Notice
Notice and
as
by
Asymptotic Properties of StochasticApproximation Algorithms 111
ForThen we have
This means that
and the condition A2.2.2” holds.By Theorem 2.2.1”, This implies
which in turn implies (3.2.7) by (3.2.8).Imposing some additional conditions on F, we may have more precise
than (3.2.7) results by using different Lyapunov functions.
Theorem 3.2.2 Assume A3.2.1–A3.2.4 hold, in addition, assume F isnormal, i.e., Let be given by (2.1.1)–(2.1.3). Then
for those sample paths for which A3.2.3 holds, converges
to either zero or one of where denotes an eigenvalue of
More precisely,
where is an unit eigenvector of H corresponding to
Proof. Since F is stable, the integral
is well defined. Noticing that we have
and
and
for
112 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
This means that H is also stable. Therefore, all eigenvalues arenegative. Further, by we find
and hence
We consider (3.2.23) and take
By (3.2.26) we have
Define
Obviously,
for any
Clearly,
where is the dimension ofThus, J is a discrete set, and is nowhere dense because is
continuous. This together with (3.2.28) shows that A2.2.2’ is satisfied.
and
Asymptotic Properties of StochasticApproximation Algorithms 113
By Theorem 2.2.1’, and (3.2.25) is verified.
Corollary 3.2.1 Let Then
In this case,
and hence (3.2.7) and (3.2.25) are respectively equivalent to
and
Remark 3.2.1 For the convergence rate given by (3.1.18)for the nondegenerate case is while for the degenerate case is
by (3.2.29), which is much slower than
3.3. Asymptotic NormalityIn Theorem 3.1.1 we have shown that given
by (2.1.1)–(2.1.3). As shown in Remark 3.1.2,This is a path-wise result. Assuming the observation noise is
a random sequence, we show that is asymptotically normal,
i.e., the distribution of converges to a normal distributionas This convergence implies that in the convergence rate
cannot be improved toWe first consider the linear regression case, i.e., is a linear func-
tion, but may be time-varying.Let us introduce a central limit theorem on double-indexed random
variables. We formulate it as a lemma.
Lemma 3.3.1 Let be an array of l-dimensional randomvectors. Denote
as
forif
114 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
Assume
and
Then
where and hereafter denotes the normal distribution with meanand covariance S.
Let us first consider the linear recursion (3.1.6) and derive its asymp-totic normality. We keep the notation introduced by (3.1.7).
We have obtained estimate (3.1.8) for and now derive moreproperties for it.
Lemma 3.3.2 Assume and
H where H is stable. Then for any
Proof. By (3.1.8) it follows that
Asymptotic Properties of StochasticApproximation Algorithms 115
We will use the following elementary inequality
which follows from the fact that the function equalszero at x = 0 and its derivative By (3.3.8), we derive
which implies
Assume is sufficiently large such that Then
where for the last inequality (3.3.9) is invoked.Combining (3.3.7) and (3.3.10) gives (3.3.6).
Lemma 3.3.3 Set
116 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Under conditions of Lemma 3.3.2,
uniformly with respect to anduniformly with respect to
Proof. Expanding to the series
with we have
where by definition
By stability of H, there exist constants and p > 0 such that
Putting (3.3.13) into (3.3.12) yields that for any
where for the last inequality is assumed to be sufficiently large suchthat and (3.1.8) is used too.
as
as
Asymptotic Properties of StochasticApproximation Algorithms 117
Since and may be arbitrarily small the conclusions
of the lemma follow from (3.3.14) by Lemma 3.3.2.
Lemma 3.3.4 Assume as and
Let A, B, and Q be matrices and let A and B be stable. Then
Proof. For any T > 0 define
Since for fixed T. Denoting
by we then have Consequently,
serves as an integral sum for or equivalently, for
and hence
Therefore, for (3.3.15) it suffices to show that
Similar to (3.3.10), by stability of A we can show that there is a constantsuch that
118 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By stability of A and B, constants and existsuch that
Consequently, we have
which verifies (3.3.18) and completes the proof of the lemma.
Theorem 3.3.1 Let be given by (3.1.6) with an arbitrarily giveninitial value. Assume the following conditions holds:
where are constant matrices with is
a martingale difference sequence of dimension satisfying the followingconditions:
and
and is stable;lI
and
as
Asymptotic Properties of StochasticApproximation Algorithms 119
and
Then is asymptotically normal:
where
Proof. Define by the following recursion
By (3.1.6) it follows that
Using (3.3.19) we have
Consequently,
where
120 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
by (3.3.20).Define
By (3.3.30) and stability of A, from (3.1.8) it follows that constantsand exist such that
Consequently, from (3.3.29) we have
The first term on the right-hand side of (3.3.34) tends to zero asby (3.3.33), while the second term is estimated as follows. By (3.3.31)
where for the last equality, Lemma 3.3.2 and (3.3.33) are used. Thismeans that r and have the same limit distribution if exists.
Consequently, for the theorem it suffices to show
Similar to (3.3.29) and (3.3.31), by (3.3.28) we have
Asymptotic Properties of StochasticApproximation Algorithms 121
Noticing
by Lemma 3.3.2 and (3.1.8), we find that the last term of (3.3.36) tendsto zero in probability. Therefore, for (3.3.24) it suffices to show
We now show that for (3.3.37) it is sufficient to prove
For any fixed we have
By (3.3.21) we have
where convergence to zero follows from and Lemma 3.3.2.
It is worth noting that the convergence is uniform with respect to This
By (3.3.21) and we see that
122 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
implies that the second term on the right-hand side of (3.3.39) tends tozero in probability. The first term on the right-hand side of (3.3.39) canbe rewritten as
By (3.3.33) for any fixed we estimate the first term of (3.3.40) as follows
while for the second term we have
since and
We now show that the last term of (3.3.40) also converges to zero inprobability as
Notice that by (3.3.28), for any fixed and
Therefore, for a fixed there exist constantsand such that
as
Asymptotic Properties of StochasticApproximation Algorithms 123
Then the last term of (3.3.40) is estimated as follows:
For the first term on the right-hand side of (3.3.44) we have
where the last inequality is obtained because is bounded
by some constant by (3.3.30). Since is fixed, in order toprove that the right-hand side of (3.3.45) tends to zero as itsuffices to show
124 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (3.3.33), for any fixed
while for any given we may take sufficiently large such thatTherefore,
by Lemma 3.3.2.Incorporating (3.3.47) with (3.3.48) proves (3.3.46). Therefore, the
right-hand side of (3.3.45) tends to zero as This impliesthat the first term on the right-hand side of (3.3.44) tends to zero inprobability.
By (3.3.43), for the last term of (3.3.44) we have
which tends to zero as as can be shown by an argument similarto that used for (3.3.45).
In summary we conclude that the right-hand side of (3.3.44) tendsto zero in probability, and hence all terms in (3.3.40) tend to zero inprobability. This implies that the right-hand side of (3.3.39) tends tozero in probability as and then Thus, we have shownthat for (3.3.37) it suffices to show (3.3.38).
We now intend to apply Lemma 3.3.1, identifying
to in that lemma. We have to check conditions of the lemma.Since is a martingale difference sequence, (3.3.1) is obviously
satisfied.
Asymptotic Properties of StochasticApproximation Algorithms 125
By (3.3.22) and Lemma 3.3.2,
This verifies (3.3.3). We now verify (3.3.2). We have
where the last term tends to zero by (3.3.22) and Lemma 3.3.2.We show that the first term on the right-hand side of (3.3.49) tends
to (3.3.25).With A and respectively identified to H and in Lemma 3.3.3,
by Lemmas 3.3.2 and 3.3.3 we have
This incorporating with (3.3.49) leads to
By Lemma 3.3.4 we conclude
Finally, we have to verify (3.3.4).
126 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (3.3.33) we have
Noticing that uniformly with respect to
since or equivalently,
uniformly with respect to by (3.3.23) we have
Consequently, for any by Lemma 3.3.2
Thus, all conditions of Lemma 3.3.1 hold, and by this lemma we conclude(3.3.38). The proof is completed.
Remark 3.3.1 Under the conditions of Theorem 3.3.1, if integersare such that then it can be
shown that converges in distribution towhere is a stationary Gaussian Markov process satisfying
the following stochastic differential equation
where is the standard Wiener process.
Asymptotic Properties of StochasticApproximation Algorithms 127
Corollary 3.3.1 From (3.1.7) and (3.3.28), similar to (3.3.29)–(3.3.31)we have
and
By (3.3.33), the first term on the right-hand side of (3.3.50) tendsto zero as Note that the last term in (3.3.34) has beenproved to vanish as and it is just a different writing of
Therefore, from (3.3.50) by Theorem 3.3.1, it fol-
lows that for any fixed
We have discussed the asymptotic normality of for the case
where is linear. We now consider the general Let us firstintroduce conditions to be used.
and
A3.3.2 A continuously differentiable function exists such that
for any and for some withwhere is used in (2.1.1).
for some
128 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where is a martingale difference sequence satisfying (3.3.21)–(3.3.23).
A3.3.3
A3.3.4 is measurable and locally bounded. As
where with a specified in (3.3.52) is stable and
satisfying which is specified in (3.3.53).
Theorem 3.3.2 Let be given by (2.1.1)–(2.1.3) and let A3.3.1–A3.3.4 be held. Then
where
Proof. Since there exists such that
which implies From (3.3.53) it follows that
This together with the convergence theorem for martingale differencesequences yields
Asymptotic Properties of StochasticApproximation Algorithms 129
which implies
Since from it follows thatStability of is implied by stability of which is a part ofA3.3.4. Then by Theorem 3.1.1
By (3.3.55) and (3.3.58) we have
From Theorem 3.1.1 we also know that there is an integer-valued(possibly depending on sample paths) such that
and there is no truncation in (2.1.1) for Consequently,for we have
Denoting
by (3.3.59) and (3.3.54) we see a.s.Then (3.3.60) is written as
By (3.3.28) it follows that
where
130 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Using introduced by (3.3.32), we find
By the argument similar to that used in Corollary 3.3.1, we have
a n d a s
Then by (3.3.51) from (3.3.63) we conclude (3.3.56).
Corollary 3.3.2 Let D be an matrix and let in (2.1.1)–(2.1.2)be replaced by In other words, in stead of (2.1.1) and (2.1.2) ifwe consider
then this is equivalent to replacing and by andrespectively.
In this case the only modification should be made in conditions ofTheorem 3.3.2 consists in that stability of in A3.3.4 should bereplaced by stability of The conclusion of Theorem 3.3.2 re-
mains valid with only modification that and F in (3.3.57)
should be replaced by and DF, respectively.
3.4. Asymptotic EfficiencyIn Corollary 3.3.2 we have mentioned that the limiting covariance
matrix S(D) for depends on D, if in (2.1.l)–(2.1.3) is replacedby By efficiency we mean that S(D) reaches its minimum withrespect to D.
Denote
Asymptotic Properties of StochasticApproximation Algorithms 131
By Corollary 3.3.2, the limiting covariance matrix for withgiven by (3.3.64)–(3.3.66) is expressed by
Theorem 3.4.1 Assume is stable. i) If then S(D)reaches its minimum at andwhere ii) If thenas
Proof. i) Integrating by parts, we have
This means that S(D) satisfies the following algebraic Riccati equation
By stability of and DF is nondegenerate. Thus,(3.4.3) is equivalent to
or
or
From (3.4.4) it follows that
and the equality is achieved atii) If then
132 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
When then
For the commonly used step size i.e., spec-ified in (3.3.52) equals By Theorem 3.4.1 the optimal
and the optimal step size is For
the limiting covariance matrix is Therefore, the optimallimiting covariance matrix for is no matter what
is taken inLet us take Then and the optimal In this
case and is the mini-
mum of the limiting covariance matrix. However, is unknown andis unknown too. Hence, cannot be directly used in the algo-
rithm. To achieve asymptotic efficiency, one way is to estimate F, and
replace the optimal step size by its estimate This is theso-called adaptive SA. But, to guarantee its convergence and optimality,rather restrictive conditions are needed.
Let be estimates for being the root of satisfying
where F is stable and The estimates are obtained on the basis ofobservations
with
If then we call
asymptotically efficient forTo achieve asymptotic efficiency we apply the averaging technique
that is different from adaptive SA.For satisfying A3.3.1, if in (3.3.52) equals zero, then
is called slowly decreasing step size. As a typical example of slowlydecreasing step sizes, one may take
Let be generated by (2.1.1)–(2.1.3) with slowly decreasingDefine
Asymptotic Properties of StochasticApproximation Algorithms 133
In what follows we will show that is asymptotically normaland is asymptotically efficient.
We list the conditions to be used.
A3.4.1 nonincreasingly converges to zero,
and for some
A3.4.2 A continuously differentiable function exists such that
for any and for some withwhere is used in (2.1.1).
A3.4.3 The observation noise is such that
with being a constant independent of and
where is specified in (3.4.7).
A3.4.4 is measurable and locally bounded. There exist a stable ma-trix F, and such that
134 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where is a constant.
Remark 3.4.1 It is clear that satisfies A3.4.1.From (3.4.7) it follows that
where denotes the integer part of
Since is nonincreasing, from (3.4.12) we have
which implies
or
Remark 3.4.2 If with being a martingale
difference sequence satisfying (3.3.21)–(3.3.23), then identifying toin Lemma 3.3.1, by this lemma we have
where is given by (3.4.1). Thus, in this case the second condition in(3.4.8) holds.
We now show that the first condition in (3.4.8) holds too.By the estimate for the weighted sum of martingale difference se-
quences (See Appendix B) we have
which incorporating with (3.4.13) yields
Asymptotic Properties of StochasticApproximation Algorithms 135
It is clear that (3.4.9) is implied by (3.3.21). Therefore, in the presentcase all requirements in A3.4.3 are satisfied.
Theorem 3.4.2 Assume A3.4.1–A3.4.4 hold. Let be given by(2.1.1)–(2.1.3) and be given by (3.4.5). Then is asymptoticallyefficient:
Prior to proving the theorem we establish some properties of slowlydecreasing step size.
Set
By (3.1.8) we have
where and are constants.Set
Lemma 3.4.1 i) The following estimate takes place
where o(1) denotes a magnitude that tends to zero asii) is uniformly bounded with respect to both and
and
Proof. i) By (3.4.6) we know that
136 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
which implies (3.4.17) since asii) By (3.4.6) as and hence for any we have
where denotes the integer part ofUsing (3.4.15) we have
for any where the first term at the right-hand side tends to zeroas by (3.4.20), and the last term tends to zero asTherefore, for (3.4.18) it suffices to show
Noticing that (3.4.13) implies for any we have
Asymptotic Properties of StochasticApproximation Algorithms 137
and hence
By (3.4.6) where asTaking this into account, by (3.4.15) and (3.4.17) we have
where asThus, by (3.4.23) we have
This implies (3.4.21), and together with (3.4.15) shows that is uni-formly bounded with respect to both and
We now express given by (2.1.1)–(2.1.3) in a different form byintroducing a sequence of stopping times and a sequence of processes
To be precise, define
138 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where by definition
Remind that is the sequence used in (2.1.1)–(2.1.3).It is clear thatSimilarly, define
where
Recursively define
where
As a matter of fact, is the first exit time of from the sphere withradius after time and during the time period
evolves as same as and is recursively defined as an RM process.Therefore, given by (2.1.1)–(2.1.3) can be expressed as
Asymptotic Properties of StochasticApproximation Algorithms 139
Lemma 3.4.2 Under Conditions A3.4.1–A3.4.4, there exists an integer-valued such that a.s., a.s., and given by(2.1.1)–(2.1.3) has no truncation for i.e.,
and a.s.
Proof. If we can show that A2.2.3 is implied by A3.4.3, then all condi-tions of Theorem 2.2.1 are fulfilled a.s., and the conclusions of the lemmafollow from Theorem 2.2.1.
Since we have
which means that (2.2.2) is satisfied forWe now check (2.2.2) for By a partial summation we have
where (3.4.6) is used and asBy (3.4.8) the first two terms on the right-hand side of (3.4.34) tend
to zero as by the same reason and by the fact
the last term of (3.4.34) also tends to zero as This means thatsatisfies (2.2.2), and the lemma follows.
By Lemma 3.4.2 we have
140 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and by (3.4.14)
For specified in (3.4.11) and a deterministic integer define thestopping time as follows
From (3.4.35) we have
and
Lemma 3.4.3 If A 3.4.1-A3.4.4 hold, then
is uniformly bounded with respect to
Asymptotic Properties of StochasticApproximation Algorithms 141
Proof. By (3.4.11) and (3.4.15) from (3.4.39) we have
where respectively denote the terms on the right-handside of the inequality in (3.4.40).
By (3.4.19) we see
where as From this we find that is bounded inif is large enough so thatBy (3.4.19) we estimate as follows:
where is assumed to be large enough such that
Thus, by (3.4.9)
142 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
We now pay attention to (3.3.10) in the proof of Lemma 3.3.2 and findthat the right-hand side of (3.4.42) is bounded with respect to
For by (3.4.19) and (3.4.10) we have
where is a constant. Again, by (3.3.10), is bounded inIt remains to estimate By Schwarz inequality we have
By (3.4.19), for large enough
which, as shown by (3.3.11), is bounded in we then by (3.4.37) have
where is a constant.Combing (3.4.40)-(3.4.44) we find that there exists a constant
such that
Asymptotic Properties of StochasticApproximation Algorithms 143
Setting
and
from (3.4.45) we have
where is a constant.Denoting
from (3.4.48) we find
where is set to equal to 1.
From (3.4.48) and (3.4.50) it then follows that
which combining (3.4.46) leads to
144 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where for the last equality we have used (3.4.47).Choosing sufficiently small so that
from (3.4.51) we then have
which is bounded with respect to as shown by (3.3.10).
Lemma 3.4.4 If A3.4.1-A3.4.4 hold, then
Proof. It suffices to prove
Then the lemma follows from (3.4.53) by using the Kronecker lemma.By (3.4.11) and (3.4.37) we have
where the last inequality follows by using the Lyapunov inequality.
Asymptotic Properties of StochasticApproximation Algorithms 145
Applying Lemma 3.4.3, from the above estimate we derive
where is a constant and the convergence of the series follows from(3.4.13).
From (3.4.54) it follows that
which means that
By Lemma 3.4.2, for any given
if is sufficiently large. This together with (3.4.55) shows that
or equivalently,
This verifies (3.4.53) because can be arbitrarily small. The proofof the lemma is completed.
Proof of Theorem 3.4.2.
146 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By Lemma 3.4.2, a.s. and
Consequently,
where asNoticing we have
and hence
By (3.4.16) and (3.4.57), from here we derive
By Lemma 3.4.1, is bounded. Then with the help of (3.4.58) wehave
Asymptotic Properties of StochasticApproximation Algorithms 147
From (3.4.58) and the boundedness of there exists a constantsuch that
Then, we have
where the convergence to zero a.s. follows from Lemma 3.4.4.Putting (3.4.59), (3.4.61) into (3.4.56) leads to
By (3.4.58) we then have
Notice that
Let us denote by the upper bound for where the existence ofis guaranteed by Lemma 3.4.1. Then using (3.4.9) and (3.4.18) we
have
148 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
which implies that
and hence
because is bounded.By (3.4.10) we see that
where the convergence follows from (3.4.13).From this by the Kronecker lemma it follows that
Therefore, we have
and hence
Combining (3.4.62)–(3.4.64) we arrive at
or
Asymptotic Properties of StochasticApproximation Algorithms 149
This incorporating with (3.4.8) implies the conclusion of thetheorem.
This theorem tells us that if in (2.1.1)-(2.1.3) we apply the slowlydecreasing step size, then the averaged estimate leads to the minimalcovariance matrix of the limit distribution.
3.5. Notes and ReferencesConvergence rates and asymptotic normality can be found in [28, 68,
78] for the nondegenerate case. The rate of convergence for the degen-erate case was first considered by Pflug in [74]. The results presented inSection 3.2 are given in [15, 47].
For the proof of central limit theorem (Lemma 3.3.1) we refer to [6,56, 78], while for Remark 3.3.1 refer to [78]. The proof of Theorem 3.3.1and 3.3.2 can be found in [28].
Asymptotic normality of stochastic approximation algorithm was firstconsidered in [44].
For asymptotic efficiency the averaging technique was introduced in[80, 83], and further considered in [35, 59, 66, 67, 74, 98]. Theoremsgiven in Section 3.4 can be found in [13]. For adaptive stochastic ap-proximation refer to [92, 95].
Chapter 4
OPTIMIZATION BY STOCHASTICAPPROXIMATION
Up-to now we have been concerned with finding roots of an unknownfunction observed with noise. In applications, however, one oftenfaces to the optimization problem, i.e., to finding the minimizer or max-inizer of an unknown function It is well know that achievesits maximum or minimum values at the root set of its gradient, i.e., at
although it may be only in the local sense.The gradient is also written as
If the gradient can be observed with or without noise, then theoptimization problem is reduced to the SA problem we have discussed inprevious chapters. Here, we are considering the optimization problem forthe case where the function itself rather than its gradient is observedand the observations are corrupted by noise. This problem was solvedby the classical Kiefer-Wolfowitz (KW) algorithm which took the finitedifferences to serve as estimates for the partial derivatives. To be precise,let be the estimate at time for the minimizer (maximizer) ofand let
be two observations on at time with noises andrespectively, where
are two vectors perturbed from the estimate by and respec-tively, on the component of The KW algorithm suggests taking
151
152 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
the finite difference
as the observation of the component of the gradientIt is clear that
where the component of equals
The RM algorithm
with defined above is called the KW algorithm.It is understandable that in the classical theory for convergence of
the KW algorithm rather restrictive conditions are imposed not only onbut also on and Besides, at each iteration to form finite
differences, observations are needed, where is the dimension ofIn some problems may be very large, for example, in the problem ofoptimizing weights in a neuro-network corresponds to the number ofnodes, which may be large. Therefore, it is of interest not only to weakenconditions required for convergence of the optimizing algorithm but alsoto reduce the number of observations per iteration.
In Section 4.1 the KW algorithm with expanding truncations usingrandomized differences is considered. As to be shown, because of replac-ing finite differences by randomized differences, the number of observa-tions is reduced from to 2 for each iteration, and because of involvingexpanding truncations in the algorithm and applying TS method forconvergence analysis, the conditions needed for have been weak-ened significantly and the conditions imposed on the noise have beenimproved to the weakest possible. The convergence rate and asymp-totic normality for the KW algorithm with randomized differences andexpanding truncations are given in Section 4.2.
The KW algorithm as other gradient-based optimization algorithmsmay be stuck at a local minimizer (or maximizer). How to approachto the global optimizer is one of the important issues in optimizationtheory. Especially, how to pathwisely reach the global optimizer is adifficult and challenging problem. In Section 4.3 the KW algorithm iscombined with searching initial values, and it is shown that the resultingalgorithm a.s. converges to the global optimizer of the unknown function
Optimization by Stochastic Approximation 153
The obtained results are then applied to some practical problemsin Section 4.4.
4.1. Kiefer-Wolfowitz Algorithm withRandomized Differences
There is a fairly long history of random search or approximation ideasin SA. Different random versions of KW algorithm were introduced: forexample, in one version a sequence of random unit vectors that are inde-pendent and uniformly distributed on the unit sphere or unit cube wasused; and in another version the KW algorithm with random directionswas introduced and was called a simultaneous perturbation stochasticapproximation algorithm.
Here, we consider the expandingly truncated KW algorithm with ran-domized differences. Conditions needed for convergence of the proposedalgorithm are considerably weaker than existing ones.
Conditions onLet be a sequence of independent and
identically distributed (iid) random variables such that
Furthermore, let be independent ofthe algebra generated by
is the observation noise to be explained later.For convenience of writing let us denote
It should be emphasized that is a vector and is irrelevant to inverse.At each time two observations are taken: either
or
154 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where is the estimate for the sought-for minimizer (maximizer) ofdenote the observation noises, and is a real
number.The randomized differences are defined as
and
may serve as observations of randomized differences.To be fixed, let us consider observations defined by (4.1.3) and (4.1.4).
The convergence analysis, however, can analogously be done for obser-vations (4.1.5) and (4.1.6).
Thus, the observations considered in the sequel are
where
We now define the KW algorithm with expanding truncations andrandomized differences. Let be a sequence of positive numbersincreasingly diverging to infinity, and let be a fixed point inGiven any initial value the algorithm is defined by:
where is given by (4.1.9) and (4.1.10).It is worth noting that the algorithm (4.1.9)-(4.1.12) differs from
(2.1.1)- (2.1.3) only by observations As a matter of fact, (4.1.11)and (4.1.12) are exactly the same as (2.1.1) and (2.1.2), but (4.1.9) and
Optimization by Stochastic Approximation 155
(4.1.10) are different from (2.1.3). As before, is the number of trun-cations that have occurred before time Clearly the random vectoris measurable with respect to the minimalcontaining both and where Thusthe random vector is independent of
LetThe observation (4.1.9) can be written in the standard form of RM
algorithm. In fact, we can rewrite as follows:
where
Thus, the KW algorithm (4.1.9)-(4.1.12) turns to be a standard RM algo-rithm with expanding truncations (4.1.11)-(4.1.14) considered in Chap-ter 2. Of course, the observation noise expressed by (4.1.14) is quitecomplicated: it is composed of the structural error
and the random noise caused by inaccuracy of observa-tions.
We now list conditions to be used.
A4.1.2 is locally Lipschitz continuous. There is an uniquemaximum of at that is the only root for and
for Further, used in (4.1.11) is such thatsup L(x) for some c and
Remark 4.1.1 If is twice continuously differentiable, then islocally Lipschitz continuous.
A4.1.1 and
exists such that
and as
156 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Remark 4.1.2 If is the unique minimizer of then in (4.1.11)and (4.1.12) should be replaced by
Theorem 4.1.1 Assume A4.1.1, A4.1.2, and Conditions on hold.Let be given by (4.1.9)-(4.1.12) (or (4.1.11)-(4.1.14)) with anyinitial value. Then
if and only if for each the random noise given by (4.1.10) can bedecomposed into the sum of two terms in ways such that
with
and
where is given in Conditions on
Proof. We will apply Theorem 2.2.1 for sufficiency and Theorem 2.4.1for necessity.
Let us first check Conditions A2.2.1–A2.2.4. Condition A2.2.1 is apart of A4.1.1. Condition A2.2.2 is automatically satisfied if we take
noticing that in the presented case. ConditionA2.2.4 is contained in A4.1.2. So, the key issue is to verify thatgiven by (4.1.14) satisfies the requirements.
Let and be vector functions obtained fromwith some of its components replaced by zero:
It is clear that
and
Optimization by Stochastic Approximation 157
For notational convenience, let denote a genericrandom vector such that
where is specified in (4.1.1), and may vary for differentapplications.
We express given by (4.1.14) in an appropriate form to be dealtwith. We mainly use the local Lipschitz-continuity to treat the structuralerror (4.1.15) in
Rewrite the component of the structural error as follows
and for any express
where on the right-hand side of the equality all terms are cancelled exceptthe first and the last terms, and in each difference of L, the argumentsof L differ from each other only by one
We write (4.1.25) in the compact from:
158 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Applying the Taylor’s expansion to (4.1.26) we derive
where
Similarly, we have
and
where
Optimization by Stochastic Approximation 159
Define the following vectors:
Finally, putting (4.1.27)-(4.1.35) into (4.1.14) we obtain the followingexpression for
It is worth noting that each component of and is a martingale
difference sequence, because both and are independent of
For the sufficiency part we have to show that (2.2.2) is satisfied a.s.Let us show that (2.2.2) is satisfied by all components of and
For components of we have for any
since by (4.1.1), and asTherefore, for any integer N
for any such that converges.Thus, all sample paths of components of satisfy (2.2.2). Com-
pletely the same situation takes place for the components of
and
160 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By the convergence theorem for martingale difference sequences, wefind that for any integer N
This is because is inde-
pendent of and is bounded by a constant uniformly with respect
to by Lipschitz-continuity of Then the martingale convergence
theorem applies since for some by A4.1.1.
Similar argument can be applied to components of Since for anyinteger N (4.1.38) holds outside an exceptional set with probability zero,there is an with such that for any
and
for all and N = 1,2, ….Therefore, for all and any integer N
where is given by (1.3.2).
Optimization by Stochastic Approximation 161
From (4.1.17) and (4.1.18) it follows that there exists suchthat and for each
and hence
Combining (4.1.41) and (4.1.42), we find for each
This means that for the algorithm (4.1.11)-(4.1.14), Condition A2.2.3 issatisfied on Thus by Theorem 2.2.1, on This provesthe sufficiency part of the theorem.
Under the assumption a.s. it is clear that both andconverge to zero a.s. and (4.1.39) and (4.1.40) turn to be
and
Then the necessity part of the theorem follows from Theorem 2.4.1. Weshow this. By Theorem 2.4.1, can be decomposed into two parts
and such that and Let us
denote by the component of a vector Define
Then for
162 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
From (4.1.43) and (4.1.36) it follows that
This together with (4.1.44) and (4.1.45) proves the necessity of the the-orem.
Theorem 2.4.1 gives necessary and sufficient condition on the obser-vation noise in order the KW algorithm with expanding truncations andrandomized differences converges to the unique maximizer of a functionL. We now give some simple sufficient conditions on
Theorem 4.1.2 Assume A4.1.1 and A4.1.2 hold. Further, assume that
is independent of
and satisfies one of the following two conditions:i) where is a random variable;
ii) Then
whre is given by (4.1.9)-(4.1.12).
Proof. It suffices to prove (4.1.16)-(4.1.18). Assume i) holds. Letbe given by
By definition, is independent of and so
and
Optimization by Stochastic Approximation 163
where is an upper bound for
By the convergence theorem for martingale difference sequences, itfollows that
Thus in (4.1.16) it can be assumed thatand and the conclusion of the
theorem follows from Theorem 4.1.1.Assume now ii) holds.By the independence assumption it follows that for is inde-
pendent of so that
Then, we have
It directly follows that
Again, it suffices to takesWe now extend the results to the case of multi-extremes. For this A
4.1.2 is replaced by A4.1.2’.
A4.1.2’ is locally Lipschitz continuous, L(J) is nowhere
dense, where the set where L takes extremes, andused in (4.1.11) is such that for some and
Theorem 4.1.3 Let be given by (4.1.9)-(4.1.12) with a given ini-tial value Assume A 4.1.1 and A 4.1.2’ hold. Then
on an with if satisfies (4.1.16)- (4.1.18), orsatisfies conditions given in Theorem 4.1.2, where is a connected
set contained in the closure of .
Proof. Condition A2.2.2 is implied by A4.1.2’ with andA2.2.1 and A2.2.4 are implied by A4.1.1 and A4.1.2, respectively, while
164 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
A2.2.3 is satisfied as shown in Theorems 4.1.1 and 4.1.2. Then theconclusion of the theorem follows from Theorem 2.2.2.
Remark 4.1.3 In the multi-extreme case, the necessary conditions onfor convergence can also be obtained on the
analogy of Theorem 2.4.2.
Remark 4.1.4 Conditions i) or ii) used in Theorem 4.1.2 are simpleindeed. However, in Theorem 4.1.2 is required to be independentof This may not be satisfied if the observation noise
is state-dependent. Taking into account that is theobservation noise when observing at and wesee that depends on and if the observationnoise is state-dependent. In this case, does depend on Thisviolates the assumption about independence made in Theorem 4.1.2.
Consider the case where the observation noise may depend on loca-tions of measurement, i.e., in lieu of (4.1.3) and (4.1.4) consider
Introduce the following condition.
A4.1.3 Both and are measurable functionsand are martingale dif-
ference sequences for any and
for p specified in A4.1.1 with
where is a family of nondecreasing independent of bothand
Optimization by Stochastic Approximation 165
Theorem 4.1.4 Let be given by (4.1.9)–(4.1.12) with a given ini-tial value Assume A4.1.1, A4.1.2’, and A4.1.3 hold. Then
where is a connected subset of
Proof. Introduce the generated by andi.e.,
It is clear that is measurable with respect toand hence are Both
and are Ap-proximating and by simple functions, it is seenthat
Therefore, and aremartingale difference sequences, and
where
Hence, is a martingale difference sequence with
Noticing is bounded and as by (4.1.50) and(4.1.51) and the convergence theorem for martingale difference sequenceswe have, for any integer N > 0
This together with (4.1.37) with replaced by (4.1.39), and (4.1.40)verifies that expressed by (4.1.36) satisfies A2.2.3. Then the con-clusion of the theorem follows from Theorem 2.2.2.
Remark 4.1.5 If J consists of a singleton then Theorems 4.1.3 and4.1.4 ensure a.s. If J is composed of isolated points, then
166 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
theorems ensure that converges to some point in J. However, thelimit is not guaranteed to be a global minimizer of Depending oninitial value, may converge to a local minimizer. We will return backto this issue in Section 4.3.
4.2. Asymptotic Properties of KW AlgorithmWe now present results on convergence rate and asymptotic normality
of the KW algorithm with randomized differences.
Theorem 4.2.1 Assume hypotheses of Theorem 4.1.2 or Theorem 4.1.4with and that
for some and as
where is stable and and are specified in (4.2.1) and (4.2.2),respectively.
Then given by (4.1.9)–(4.1.12) satisfies
Proof. First of all, under conditions of Theorems 4.1.2 or 4.1.4,By Theorem 3.1.1 it suffices to show that given by
(4.1.36) can be represented as
where
From (4.1.28) and (4.1.31) by the local Lipschitz continuity of itfollows that
Optimization by Stochastic Approximation 167
by (4.2.2). Since it follows that
Since and given by (4.1.27) and (4.1.32)are uniformly bounded for for each
where converges. By the convergence theorem for martingaledifference sequences it follows that
where and are are given by (4.1.35).
In the proof of Theorem 4.1.2, replacing by and using (4.2.2),the same argument leads to
Then by defining
we have shown (4.2.4) under the hypotheses of Theorem 4.1.2.Under the hypotheses of Theorem 4.1.4 we have the same conclusions
about and as before. We need only to show (4.2.5). Butthis follows from (4.1.52) with replaced by and the convergence
Remark 4.2.1 Let be given by (4.1.9)–(4.1.12). If andwith then conditions (4.2.1) and (4.2.2) are satisfied.
Theorem 4.2.2 Assume A4.1.1 and A4.1.2 hold and thati) and for some
ii) for some c > 0 and
iii) is stable and for some
iv) given by (4.1.10) is an MA process:
for
and
168 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where are real numbers and is a martingaledifference sequence which is independent of and satisfies
Then
where and
Proof. Since it follows that and
By assumption is independent of and hence is inde-pendent of Then by (4.2.11) and the convergence theorem formartingale difference sequences we obtain (4.2.5). By Theorem 4.2.1 wehave as
and after a finite number of iterations of (4.1.11), say, for thereare no more truncations.
Since and is stable,it follows that
Let be given by
Optimization by Stochastic Approximation 169
By (4.1.11), (4.1.13), (4.1.36), and condition ii) it follows that for
Let be given by
whereSince is stable, by (3.1.8) it follows that there are constants
and such that
Noticing where becauseby condition iii), we have
170 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where respectively denote the five terms on the right-hand side of the first equality of (4.2.19).
By (4.2.18),
By Lemma 3.3.2, because and
By (4.1.28) and (4.1.3) it follows that and henceby i) and (4.2.18)
where is a constant.By Lemma 3.3.2 and the right-hand side of (4.2.20) tends to
zero a.s. asTo estimate let us consider the following linear recursion
By (4.2.17) it follows that
By (4.2.11), Since and
Then by the convergence theorem for martingale differ-
ence sequences it follows that
i.e.,
Optimization by Stochastic Approximation 171
Similarly,
Applying Lemma 3.1.1, we find that From (4.2.22),
it follows that
Since is an MA process driven by a martingale difference sequence
satisfying (4.2.6),
By the argument similar to that used for (4.2.21) and (4.2.22), fromLemma 3.1.1 it follows that
Therefore, putting all these convergence results into (4.2.19) yields
By (3.3.37),
where is given by (4.2.10). By (4.2.18), from (4.2.23) and (4.2.24)
it follows that which together with the definition
(4.2.14) for proves the theorem.
Example 4.2.1 The following example of and satisfies Con-ditions i) and iii) of Theorem 4.2.2:
In this example, and
172 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Remark 4.2.2 Results in Sections 4.1 and 4.2 are proved for the case,
where the two-sided randomized differences areused where and are given by (4.1.3) and (4.1.4), respectively.But, all results presented in Sections 4.1 and 4.2 are also valid for thecase where the one-sided randomized differences
are used, where and are given by (4.1.3) and (4.1.6), respec-tively.
In this case, in (4.1.27), (4.1.28) and in the expression of shouldbe replaced by 1, and (4.1.29)–(4.1.32) disappear. Accordingly, (4.1.36)changes to
Theorems 4.1.1-4.1.4 and 4.2.1 remain unchanged. The conclusion of
Theorem 4.2.2 remains valid too, if in Condition iv)
changes to
4.3. Global OptimizationAs pointed out at the beginning of the chapter, the KW algorithm may
lead to a local minimizer of Before the 1980s, the random searchor its combination with a local search method was the main stochasticapproach to achieve the global minimum when the values of L can exactlybe observed without noise. When the structural property of L is usedfor local search, a rather rapid convergence rate can be derived, but itis hard to escape a local attraction domain. The random search hasa chance to fall into any attraction domain, but its convergence ratedecreases exponentially as the dimension of the problem increases.
Simulating annealing is an attractive method for global optimization,but it provides only convergence in probability rather than path-wiseconvergence. Moreover, simulation shows that for functions with a fewlocal minima, simulated annealing is not efficient. This motivates oneto combine KW-type method with random search. However, a simplecombination of SA and random search does not work: in order to reachthe global minimum one has to reduce the noise effect as time goes on.
A hybrid algorithm composed of a search method and the KW algo-rithm is presented in the sequel with main effort devoted to design eas-
Optimization by Stochastic Approximation 173
ily realizable switching rules and to provide an effective noise-reducingmethod.
We define a global optimization algorithm, which consists of threeparts: search, selection, and optimization. To be fixed, let us discussthe global minimization problem. In the search part, we choose an ini-tial value and make the local search by use of the KW algorithm withrandomized differences and expanding truncations described in Section4.1 to approach the bottom of the local attraction domain. At the sametime, the average of the observations for L is used to serve as an estimateof the local minimum of L in this attraction domain. In the selectionpart, the estimates obtained for the local minima of L are compared witheach other, and the smallest one among them together with the corre-sponding minimizer given by the KW algorithm are selected. Then, theoptimization part takes place, where again the local search is carried out,i.e., the KW algorithm without any truncations is applied to improvethe estimate for the minimizer. At the same time, the correspondingminimum of L is reestimated by averaging the noisy observations. Afterthis, the algorithm goes back to the search part again.
For the local search, we use observations (4.1.3) and (4.1.4), or (4.1.5)and (4.1.6). To be fixed, let us use (4.1.5) and (4.1.6).
In the sequel, by KW algorithm with expanding truncations we meanthe algorithm defined by (4.1.11) and (4.1.12) with
where and are given by (4.1.5) and (4.1.6), respectively. Sim-ilar to (4.1.9) and (4.1.10) we have
where
By KW algorithm we mean
with defined by (4.3.2).It is worth noting that unlike (4.1.8), is used in (4.3.1).
Roughly speaking, this is because in the neighborhood of a miminizer
of is increasing, and in (4.1.11) should be anobservation on
174 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
In order to define switching rules, we have to introduce integer-valuedand increasing functions and such thatand
Define
In the sequel, by the search period we mean the part of algorithm
starting from the test of selecting the initial value up to the nextselection of initial value. At the end of the search period, weare given and being the estimates for the global minimizerand the minimum of L, respectively. Variables such as
and etc. in the search period are equipped by superscript
etc.The global optimization algorithm is defined by the following five
steps.
(GO1) Starting from at the search period, the initial value
is chosen according to a given rule (deterministic or random),
and then is calculated by the KW algorithm with expandingtruncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich , step sizes and and used for truncation aredefined as follows:
where c > 0 and are fixed constants, andare two sequences of positive real numbers increasingly diverging toinfinity.
(GO2) Set the initial estimate for and update the
estimate for by
where is the noise when observing
After steps, is obtained.
(GO3) Let be a given sequence of real numbers such that
and as Set For if
as
e.g.,
Optimization by Stochastic Approximation 175
then set Otherwise, keepunchanged.
(GO4) Improve to by the KW algorithm with expandingtruncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich
where in (4.1.11) and (4.1.12) may be an arbitrary sequence ofnumbers increasingly diverging to infinity, and
At the same time, update the estimate for by
where is the noise when observing At the end of thisstep, and are derived.
(GO5) Go back to (GO1) for the search period.
We note that for the search period is added to and (see(4.3.7) and (4.3.8)). The purpose of this is to diminish the effect ofthe observation noise as increases. Therefore, and both tendto zero, not only as but also as The followingexample shows that adding an increasing to the denominators of
and is necessary.
Example 4.3.1 Let
It is clear that the global minimizer is and are twolocal minima. Furthermore, and are attractiondomains for –1 and +1, respectively.
176 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Since is linear, for local search we apply the ordinary KW al-gorithm without truncation
Here, no randomized differences are introduced, because this is a one-dimentional problem.
Assume
where
and and are mutually independent and both are sequences ofiid random variables with
Let us start from (GO1) and take
(not tending to infinity),
If then, by noticing one of andmust belong to Elementary calculation shows that
Paying attention to (4.3.13), we see
and
i.e.,
Optimization by Stochastic Approximation 177
This means that is located in one of the attraction domainsand Furthermore, by (4.3.12) and (4.3.13), the ob-
servations carried out at these domains are free of noise. Let us considerthe further development of the algorithm, once has fallen into the in-terval or To be fixed, let us assume
For we have
or which impliesIf say, then since
It suffices to consider the case where i.e.,because for the case we again have
(4.3.14) andSimple computation shows that starting from the
observations are free of noise, and the algorithm becomes
As a result of computation, we have
Then, starting from the algorithm will be iterated according to(4.3.14), and hence
For the case it can similarly be shown thatTherefore, whatever the initial value is chosen, will never converge
to the global minimizer if in (GO1) does not diverge to infinity.
Let us introduce conditions to be used.Since we are seeking for global minima of Condition A4.1.2’
should be modified.
A4.3.1 is locally Lipschitz continuous,
and L(J) is nowhere dense, where the set ofextremes of L.
Note that for seeking minima of the corresponding part in A4.1.2’,should be modified as follows: used in (4.1.11) is such that
for some and But this is implied by assuming
178 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
A4.3.2
A4.3.3 For any convergent subsequence of
where denotes given by (4.3.3) with replaced bydenotes used for the ¢ search period, and
A4.3.4 For any convergent subsequence
where is given by (1.3.2).
It is worth emphasizing that each in the sequenceis used only once when we form and
We now give sufficient conditions for A4.1.2, A4.3.3, and A4.4.4. Forthis, we first need to define generated by estimates and
derived up-to current time. Precisely, for running in the searchperiod of Step (GO1) define
and for running in Step (GO4) define
Optimization by Stochastic Approximation 179
Remark 4.3.1 If both sequences
and are martingale difference sequences with
and if
for some then A4.3.2 holds.
This is because
is a maringale difference sequence with bounded second conditional mo-ment, and hence
which implies (4.3.15).By using the second parts of conditions (4.3.22) and (4.3.23), (4.3.16)
can be verified in a similar way.
Remark 4.3.2 If and is independent of
and if exists
such that then by the uncorrelatedness of
with for or
180 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where M is a constant. From this, it follows that
and hence A4.3.3 holds.
Remark 4.3.3 If and is independent of
then by the martingale convergence theorem, A4.3.4 holds.
We now formulate the convergence theorem for the global optimiza-tion algorithm (GO1)–(GO5).
Theorem 4.3.1 Assume A4.1.1, A4.3.2, A4.3.3, and A4.3.4 hold. Fur-ther, assume that selected in (GO1) is dense in anopen set U, for some and
If then
where is derived at Step (GO4) and is the set of global mini-mizers of
The proof of theorem is separated into lemmas.We recall that the essence of the proof for the basic convergence The-
orem 2.2.1 consists in showing the following property that cannotcross a nonempty interval infinitely often ifWe need to extend this property to a family of algorithms.
Assume for each fixed the observation is
and the algorithm develops as follows
Optimization by Stochastic Approximation 181
where
Assume, further, for fixed
Lemma 4.3.1 Assume L(J) is nowhere dense, whereLet be a nonempty interval such If there are
two sequences and such that
and is bounded, then it is impossible to have
where
Proof. Without loss of generality we may assume converges asotherwise, it suffices to select a subsequence.
Assume the converse: i.e., (4.3.28) holds. Along the lines of the prooffor Theorem 2.2.1 we can show that
for some constant M if is sufficiently large. As a matter of fact, this isan analogue of (2.2.3). From (4.3.29) the following analogue of (2.2.15)takes place:
and the algorithm for has no truncation forif is large enough, where is a constant. Similar to
(2.2.27), we then have
and
and
182 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
for some small T > 0 and all sufficiently largeFrom this, by (4.3.27) and convergence of it follows that
By continuity of and (4.3.30) we have
which implies that for small enough T.Then by definition,
which contradicts (4.3.32). The obtained contradiction shows the im-possibility of (4.3.28).
Introduce
such that
and
Lemma 4.3.2 Let be given by (GO1). AssumeA4.3.1 and A4.3.3 hold and for some Then
for any may occur infinitely many often withprobability 0, i.e.,
Proof. Since L(J) is nowhere dense, for any belonging to infinitelymany of there are subsequences such that
and
whereand
Optimization by Stochastic Approximation 183
By assumption as must be bounded.Hence, is bounded. Without loss of generality we may assume
that is convergent.Notice that at Step (GO1), is calculated according to (4.1.11)
and (4.1.12) with given by (4.3.2) and (4.3.3), i.e.,
which differ from (4.1.11) (4.1.12), (4.3.2), and (4.3.3) by superscript (i),which means the calculation is carried out in the search period.
By (4.1.27) with notations (4.1.33) and (4.1.34), equipped by super-script we have
where
If we can show that and
184 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where
then by Lemma 4.3.1, (4.3.42) contradicts with that all sequencescross the interval which is disjoint with
L (J).This then proves (4.3.36).We now show for all sufficiently large if T is small
enough.Since and are finite, where
We now show that on theif is sufficiently large and T is small
enough.Suppose the converse: for any fixed T > 0, there always exists
whatever large is taken such thatSince by continuity of there is a constant
q > 0 such that
For any let us estimate By
and the local Lipschitz continuity of it is seen that
is uniformly bounded with respect to and allThen by A4.3.3, it follows that there is a constant such that
From this it follows that there is no truncation forand
Let T be so small that
Optimization by Stochastic Approximation 185
On the other hand, however, we have andThe obtained contradiction shows for all sufficiently
large if T is small enough.We now prove (4.3.42). Let us order in the following way
From (4.1.34) and by the fact that is an iid sequence and isindependent of sums appearing in (4.1.34), it is easy to be convincedthat is a martingale difference sequence.
By the condition for some it is clearthat for with being a constant. Then we have
By (4.1.28) and (4.3.8), we have
where is a constant. Noticing that for large andsmall T, by (4.3.44),(4.3.45), and A4.3.3 we may assume sufficientlylarge and T small enough such that
This will imply (4.3.42) if we can show
We prove (4.3.47) by induction.We have by definition of Assume that
and by the convergence theorem for martingale difference sequences
186 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and Then there is no truncation at timesince by (4.3.46) (with chosen such that
if T in (4.3.46) is sufficiently small.Then by (4.3.40), we have
and by (4.3.43) and (4.3.46)
for small T. This completes induction, and (4.3.42) is proved, which, inturn, concludes the lemma.
Lemma 4.3.3 Assume A4.3.1–A4.3.3 hold. Further, assume thatfor some and If there
exists a subsequence such that then
Proof. For any by Lemma 4.3.2 there exists such that forany if By (GO2),we have
Then by A4.3.2, there exists such that, for any
This implies the conclusion of the lemma by the arbitrariness of
Optimization by Stochastic Approximation 187
Lemma 4.3.4 Assume A4.3.1–A4.3.3 hold, for
some and If subsequence is such that
then
where denotes the closure of L(J), and and aregiven by (GO1) and (GO2) for the search period.
Proof. Since by A4.3.1, for (4.3.50) it isseen that contains a bounded infinite subsequence, and hence, a
convergent subsequence (for simplicity of notation, assume
such that
Since there exists a such thatand hence
Define
It is worth noting that for any T > 0, is well defined for all
sufficiently large because and hence
We now show that
By the same argument as that just used before, without loss of gen-erality, we may assume is convergent (otherwise, a convergentsubsequence should be extracted) and thus
We have to show
as
188 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By the same argument as that used for deriving (2.2.27), it followsthat there is such that
which implies the correctness of (4.3.53).From (4.3.53) it follows that
because, otherwise, we would have a subsequence with
such that and by (4.3.54)
for large However, by (2.2.15), so for smallenough T > 0, (4.3.56) is impossible. This verifies (4.3.55).
We now show
Assume the converse, i.e.,
From (4.3.54) and (4.3.58) it is seen that for all sufficiently large thesequence
contains at least a crossing the interval withIn other words, we are dealing with a sample path on which both(4.3.54) and (4.3.58) are satisfied. Thus, belongs to ByLemma 4.3.2, the set composed of such is with zero probability. Thisverifies (4.3.57).
From (4.3.57) it follows that
for all sufficiently large
Optimization by Stochastic Approximation 189
Notice that from the following elementary inequalities
by (4.3.5) it follows that
By definition of we write
By (4.3.59) and (4.3.61), noticing we have
because
By (4.3.55) and (4.3.61) we have
Since by (4.3.15), combining (4.3.62)–(4.3.64)
leads to
190 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
which completes the proof of the lemma.
Lemma 4.3.5 Let be given by (GO1)–(GO5). Assume that A4.3.1–
A4.3.4 hold, initial values selected in (GO1) are dense in an openset U containing the set of global minima of
for some and Then for any
Proof. Among the first search periods denote by the number ofthose search periods for which are reset to be i.e.,
Since L(J) is not dense in any interval, there exists an intervalsuch that So, for lemma it suffices to prove
that cannot cross infinitely many times a.s.If then after a finite number of steps, is generated
by (GO4). By Lemma 4.3.1 the assertion of the lemma follows immedi-ately. Therefore, we need only to consider the case where
Denote by the search period for which a resetting happens, i.e.,It is clear that by
In the case by (GO4) the algorithm generates a family
of consecutive sequences:
Let us denote the sequence by
and the corresponding sequence of the values of by
Let be sufficiently small such that
Optimization by Stochastic Approximation 191
and which is possible because L(J) isnowhere dense.
Since is dense in U, visits infinitely often. Assume
By Lemma 4.3.2
if is large enough.Define
This means that the first resetting in or after the search periodoccurs in the search period.
We now show that there is a large enough such that thefollowing requirements are simultaneously satisfied:
where is fixed;
We first show ii)-v).Since all three intervals indicated in ii) have an empty intersection
with L(J), by Lemma 4.3.1 ii) is true if S is large enough. It is clear
i) implies
ii) does not cross intervals
and
iii)
vi)
v)
192 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
that iii) and vi) are correct for fixed and if is large enough,while v) is true because
For i) we first show that there are infinitely many for which
By (4.3.68) and (4.3.71) we have
Consider two cases.1) There is no resetting in the search period. Then
and by (4.3.72) and (4.3.74) it follows that
By (4.3.70) and the definition of there exists at least an integeramong such that
because, otherwise, we would have which contradicts(4.3.74).
By ii) we conclude that
and by (4.3.68) we also have (4.3.76).From (4.3.76), by ii) does not cross for
Consequently,
This together with (4.3.70) implies that
and, in particular,2) If there is a resetting in the search period, then
Optimization by Stochastic Approximation 193
By (GO3) we then have
Noticing as we conclude that there are infinitelymany for which (4.3.73) holds.
We now show that there is a such that
where lim sup is taken along those for which (4.3.73) holds.Assume the converse: there is a subsequence of such that
Then by Lemma 4.3.4,
which contradicts (4.3.73). This proves (4.3.78), and also i). As a matterof fact, we have proved more than i): Precisely, we have shown that thereare infinitely many for which (4.3.73) holds, and for (4.3.73)implies the following inequality:
Let us denote by the totality of those for which (4.3.73)holds and What we have just proved is that contains infinitelymany if
Consider a sequence By ii) it cannot cross the intervalThis means that
Then by (4.3.70)
and by (GO3)
194 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
since is a search period with resetting.Thus, we have shown that if then also be-
longs to Therefore, and
From here and (4.3.67) it follows that
Since may cross the interval only forfinite number of times by Lemma 4.3.1. This completes the proof of thelemma.
Proof of Theorem 4.3.1.By Lemma 4.3.5 the limit exists. By arbitrariness offrom (4.3.69) it follows that
By continuity of we conclude that
4.4. Asymptotic Behavior of Global OptimizationAlgorithm
In last section a global optimization algorithm combining the KW al-gorithm with search method was proposed, and it was proved that thealgorithm converges to the set of global minimizers, i.e.,
However, in the algorithm defined by (GO1)–(GO5), reset-
tings are involved. The convergence by no means
excludes the algorithm from resettings asymptotically. In other words,although it may still happen that
where is defined in Lemma 4.3.5, i.e., it may still be possible to haveinfinitely many resettings.
In what follows we will give conditions under which
In this case, the global optimization algorithm (GO1)–(GO5) asymp-totically behaves like a KW algorithm with expanding truncations andrandomized differences, because for large is purely generated by(GO4) without resetting.
a.s.
a.s.,
a.s.
a.s.
Optimization by Stochastic Approximation 195
A4.4.1 is a singleton, is twice continuously differentiablein the ball centered at with radius for some and
of is positive definite.
A4.4.2 and ordered as in (4.3.20) (4.3.21) andRemark 4.3.1 are martingale difference sequences with
A4.4.3 is independent of
for and
and
for
We recall that is the observation noise in thesearch period.
A4.4.4 is independent of and where
denotes the observation noise when is calculatedin (Go4).
Lemma 4.4.1 Assume A4.4.2 holds and, in addition,
Then, there exists an (maybe depending on such that for any
and
and
196 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Proof. Notice that by A4.4.2 is a martingale
difference sequence with bounded conditional variance. By the conver-gence theorem for martingale difference sequences
which implies (4.4.2).Estimate (4.4.3) can be proved by a similar way.
Lemma 4.4.2 Assume A4.4.3 and A4.4.4 hold. Iffor some then
and
for where and are given in (4.1.34), where super-script denotes the corresponding values in the ith search period.
Proof. Let us prove
Note that
is a martingale difference sequence with bounded conditional secondmoment. So, by the convergence theorem for martingale difference se-quences for (4.4.6) it suffices to show
Optimization by Stochastic Approximation 197
By assumption of the lemma or and
for large The last inequality yields
and hence
Therefore,
Thus, (4.4.6) is correct. As noted in the proof of Lemma 4.3.2, isa martingale difference sequence. So, (4.4.4) is true.
Similarly, (4.4.5) is also verified by using the convergence theorem formartingale difference sequences.
Lemma 4.4.3 In addition to the conditions of Theorem 4.3.1, supposethat A4.4.1 and A4.4.3 hold, is positive definite, and
for some Then there exists a sufficiently large suchthat, for if the inequality
holds for some with then the following inequality holds
Proof. . By A4.4.1 and the Taylor’s expansion, we have
i.e.,
198 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where
Therefore, for any there is a such that for any
and
where and denote the minimum and maximum eigenvalue ofH, respectively, and o(1) is the one given in (4.4.10).
Since is the unique minimizer of and is continuous, thereis such that if We always assumethat is large enough such that
and
where is used in (GO1). From (4.4.8) it then follows thatand there is no truncation at time
Denote
For satisfying (4.4.8) and we have
where is given by (4.3.41).
Optimization by Stochastic Approximation 199
By (4.4.11) it then follows that
where is given by (4.1.33) with superscript denoting thesearch period and
By (4.4.14) it is clear that
Let
For (4.4.9) it suffices to show thatAssume the converse:Let
By (4.4.20), for all
and hence,
Thus, (4.4.12)-(4.4.14) are applicable.
200 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (4.4.17) and the second inequality of (4.4.13), we have for
which incorporating with (4.4.21) yields
Applying the first inequality of (4.4.13) and then (4.4.20) leads to
Since for there is no truncation forUsing (4.4.18) we have
where
We now show that is negative for all sufficiently largeLet us consider terms in By assumption,
from (4.4.19) and (4.4.22) it follows that
Optimization by Stochastic Approximation 201
We now estimate the second term on the right-hand side of (4.4.25)after multiplying it by
From (4.4.4) and (4.4.16) it follows that
uniformly with respect to and with
Noticing that with being a constant,
and that which implies we find
Then, noticing that is bounded by some constantwe have
For the third term on the right-hand side of (4.4.25), multiplying it bywe have
where is a constant.Finally, for the last term of (4.4.25) we have the following estimate
Combining (4.4.26)–(4.4.30) we find that
where
and for large
202 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Consequently, from (4.4.25) it follows that
We now show that
by induction.Assume it holds for i.e.,
which has been verified for We have to show it is true for
By (4.4.18) we have
and
Optimization by Stochastic Approximation 203
where
Comparing (4.4.35) with (4.4.25), we find that in lieu of and
we now have and respec-
tively. But, for both cases we use the same estimate (4.4.27). Therefore,completely by the same argument as (4.4.26)–(4.4.30), we can prove that
and for large
Thus, we have proved (4.4.32).By the elementary inequality
for which is derived from
for any matrices A and B of compatible dimensions, we derive
204 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
from (4.4.32)
As mentioned before, for and there is notruncation. Then by (4.4.18)
where
Then from (4.4.36) and (4.4.27) it follows that
where
which tends to zero as by (4.4.27) and (4.4.38).Then
where for the last equality (4.4.10) is used.Finally, by (4.4.21), for large from (4.4.39) it follows that
Optimization by Stochastic Approximation 205
which incorporating with (4.4.10) yields
This contradicts (4.4.20), the definition of The contradictionshows
Theorem 4.4.1 Assume that A4.3.1, A4.4.1–A4.4.4 hold, and
is positive definite for someFurther, assume that
and for some constants
Then the number of resettings is finite, i.e.,
where is the number of resettings among the first search periods(GO1), and is given in (GO3).
Proof. If (4.4.44) were not true, then there would be an S with positiveprobability such that, for any there exists a subsequencesuch that at the search period a resetting occurs, i.e.,
Notice that
by(4.4.41) and and
206 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
by (4.4.41) and (4.4.42). Hence, conditions of Lemma 4.4.1 are satisfied.Without loss of generality, we may assume that (4.4.2)–(4.4.5) and theconclusion of Theorem 4.3.1 hold From now on assume that
is fixed.It is clear that, for any constant
if is large enough, since forLet
Rewrite (4.4.46) as
Define
and
Noticing that there is no resetting between and and (4.4.47)corresponds to (4.4.8), by the same argument as that used in the proofof Lemma 4.4.3, we find that, for any
Since we have
Optimization by Stochastic Approximation 207
By (4.4.3) (4.4.42) and (4.4.43) it follows that
where for the last inequality (4.4.41) is used.Thus, by (4.4.40)
By (4.4.33) it follows that
provided is large enough, where for the last inequality, (4.4.2) isused.
Since by (4.4.43)
and since
and
208 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
we find
where the last inequality follows from (4.4.40).Using (4.4.51) and (4.4.53), from (4.4.52) for sufficiently large we
have
Using the second inequality of (4.4.43) and then observing that
and
by (4.4.40) and (4.4.41) and we find
We now show that there is such that
Assume the converse:
with
Optimization by Stochastic Approximation 209
Then, we have
for large enough because
Inequality (4.4.57) contradicts (4.4.55). Consequently, (4.4.56) is true.In particular, for we have
Completely by the same argument as that used for (4.4.47)–(4.4.50), by
noticing that there is no resetting from to we conclude
that
By the same treatment as that used for deriving (4.4.54) from (4.4.50),we obtain
Comparing (4.4.58) with (4.4.54), we find that has been changed toand this procedure can be continued if the number of resettings
210 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is infinite. Therefore, for any we have
From (4.4.40) we see
Since we have and hence by
Consequently, by (4.4.41) the right-hand side of (4.4.59) can be esti-mated as follows:
by (4.4.61) if is large enough.However, the left-hand side of (4.4.59) is nonnegative. The obtained
contradiction shows that must be finite, and (4.4.44) is correct.By Theorem 4.4.1, our global optimization algorithm coincides with
KW algorithm with randomized differences and expanding truncationsfor sufficiently large Therefore, theorems proved in Section 4.2 areapplicable to the global optimization algorithm. By Theorems 4.2.1 and4.2.2 we can derive convergence rate and asymptotic normality of thealgorithm described by (GO1)–(GO5).
4.5. Application to Model ReductionIn this section we apply the global optimization algorithm to system
modeling. A real system may be modeled by a high order system which,however, may be too complicated for control design. In control engineer-ing the order reduction for a model is of great importance. In the linearsystem case, this means that a high order transfer function is to beapproximated by a lower order transfer function. For this one may usemethods like the balanced truncation and the Hankel norm approxima-tion. These methods are based on concept of the balanced realization.We are interested in recursively estimating the optimal coefficients of the
Optimization by Stochastic Approximation 211
reduced model by using the stochastic optimization algorithm presentedin Section 4.3.
Let the high order transfer function be
and let it be approximated by a lower order transfer functionIf is of order then is taken to be of order
To be fixed, let us take to be a polynomial of orderand of order
where coefficients should not be confused with stepsizes used in Steps (GO1)-(GO5). Write as whereand stand for coefficients of and
It is natural to take
as the performance index of approximation. The parameters and areto be selected to minimize under the constraint that
is stable. For simplicity of notations we denote and write
asLet us describe the where has the required property.Stability requires that
This implies that
because is the sum of two complex-conjugate roots of
If then which yields Ifthen and hence
(or ).
212 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Set
Identify and appeared in Section 4.3 to
and respectively for the present case.We now apply the optimization algorithm (GO1)–(GO5) to minimiz-
ing under constraint that the parameter in belongs toD. For this we first concretize Steps (GO1)–(GO5) described in Section4.3.
Since is convex in for fixed we take the fixed initial value
for any search period and randomly select initial valuesonly for according to a distribution density which is defined asfollows:
where with and being the uniform dis-tributions over [–2,2] and – 1,1], respectively.
After having been selected in the search period, the algorithm
(4.1.11) and (4.1.12) is calculated with and
As to observations, in stead of (4.3.1) we will use information
about gradient because in the present case the gradientof can explicitly be expressed:
In the search period the observation is denoted by and is givenby
where is independently selected from according to the uniform
distribution, and stands for the estimate for at time in the
search period. It is clear that is an approximation to the integral
Optimization by Stochastic Approximation 213
(4.5.8) with Therefore, we have observations in the form
The expanding truncation method used in (4.1.11) and (4.1.12) re-quires projecting the estimated value to a fixed point, if the estimatedvalue appears outside an expanding region. Let us denote it by In(4.1.11) and (4.1.12) the spheres with expanding radiuses serve asthe expanding regions which are now modified as follows.
Let us write where Define
where
The expanding truncations in (4.1.11) and (4.5.11) are also modified:
where means the projection ofTake Then after
steps, will be obtained.Concerning (GO2)–(GO4), the only change consists in observations.
We replace which is defined byin (GO2)–(GO4) by
214 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where are independently selected from according to the uni-form distribution for each Clearly, is an approximation to
Finally, take equal toIn control theory there are several well-known model reduction meth-
ods such as model reduction by balanced truncation, Hankel norm ap-proximation among others. These methods depend on the balanced re-alization which is a state space realization method for a transfer matrix
keeping the Gramians for controllability and observability of therealized system balanced. In order to compare the proposed global op-timization (GO) method, we take the commonly used model reductionmethods by balanced truncation (BT) and Hankel norm approximation(HNA), which, are realized by using Matlab. For this, the discrete-timetransfer functions are transformed to the continuous time ones byusing d2c provided in Matlab. Then the reduced systems are discretizedto compute for comparison.
As we take a 10th order transfer function respec-tively for the following examples:
Example 4.5.1
Example 4.5.2
Example 4.5.3
Using the algorithm described in Section 4.3, for Examples 4.5.1-4.5.3we obtain the approximate transfer functions of order 4, respectively,
Optimization by Stochastic Approximation 215
denoted by and with
Using Matlab we also derive the 4th order approximations for Exam-ples 4.5.1–4.5.3 by balanced truncation and Hankel norm approximation,which are as follows:
where the subscripts and H denote the results obtained by balancedtruncation and Hankel norm approximation, respectively.
The approximation errors are given in the following table:
From this table we see that the algorithm presented in Section 4.3gives less approximation errors in in comparison with othermethods.
We now compare approximation errors in norm and compare stepresponses between the approximate models and the true one by figures.
In the figures of step response
the solid lines denote the true high order systems;
the dashed lines (- - -) denote the system reduced by Hankel normapproximation;
216 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
the dotted lines denote the system reduced by balanced trun-cation;
The dotted-dashed lines denote the systems reduced by thestochastic optimization method given in Section 3.
In the figures of the approximation error
the solid lines denote the systems reduced by the stochasticoptimization method;
the dashed lines (- - -) denote the system reduced by Hankel normapproximation;
the dotted lines denote the system reduced by balanced trun-cation.
Optimization by Stochastic Approximation 217
Example 4.5.1
Example 4.5.2
Example 4.5.3
These figures show that the algorithm given in Section 4.3 gives lessapproximation error in in comparison with other methodsfor Example 4.5.1 and the intermediate error in for Exam-ples 4.5.2 and 4.5.3. Concerning step responses, the algorithm givenin Section 4.3 provides better approximation in comparison with othermethods for all three examples.
218 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
4.6. Notes and ReferencesThe well-known paper [61] by Kiefer and Wolfowitz is the pioneer
work using stochastic approximation method for optimization. The ran-dom version of KW algorithm was introduced in [63], and the randomdirection version of KW algorithm was dealt with in [85] by the ODEmethod. Theorems 2.4.1, 2.4.2 given in Section 4.1 are presented in [21],while Theorem 2.4.4 in [18]. The results on convergence rate and symp-totic normality of KW algorithm presented in Section 4.2 can be foundin [21].
Global optimization based on noisy observations by discrete-time sim-ulated annealing is considered in [45, 52, 100]. Combination of the KWalgorithm with a search method for global optimization is dealt with in[97]. A better combination given in [49] is presented in Section 4.3 and4.4.
For model reduction we refer to [51, 102]. The global optimizationmethod presented in Section 4.3 is applied to model reduction in Sec-tion 4.5, which is written based on [22].
Chapter 5
APPLICATION TO SIGNAL PROCESSING
The general convergence theorems developed in Chapter 2 can dealwith noises containing not only random components but also structuralerrors. This property allows us to apply SA algorithms to parameterestimation problems arising from various fields. The general approach,roughly speaking, is as follows. First, the parameter estimation problemcoming from practice is transformed to a root-seeking problem for a rea-sonable but unknown function which may not be directly observed.Then, the real observation is artificially written in the standardform
with Normally, it is quite straightforward to arriveat this point. The main difficulty is to verify that the complicated noise
satisfies one of the noise conditions required in theconvergence theorems. It is common that there is no standard method tocomplete the verification procedure, because for different problemsare completely different from each other.
In Section 5.1, SA algorithms are applied to solve the blind channelidentification problem, an active topic in communication. In Section 5.2,the principle component analysis used in pattern classification is dealtwith by SA methods. Section 5.3 continues the problem discussed inSection 5.1, but in more general setting. Namely, unlike Section 5.1,the covariance matrix of the observation noise is no longer assumed tobe known. In Section 5.4, adaptive filtering is considered: Very simpleconditions for convergence of sign-algorithms are given. Section 5.5 dis-cusses the asymptotic behavior of asynchronous SA algorithms, whichtake the possible communication delays between parallel processors intoconsideration.
219
220 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
5.1. Recursive Blind IdentificationIn system and control area, the unknown parameters are estimated on
the basis of observed input and output data of the system. This is thesubject of system identification. In contrast to this, for communicationchannels only the channel output is observed and the channel input is un-available. The topic of blind channel identification is to estimate channelparameters by using the output data only. Blind channel identificationhas drawn much attention from researchers because of its potential ap-plications in wireless communication. However, most existing estimationmethods are “block” algorithms in nature, i.e., parameters are estimatedafter the entire block of data have been received.
By using the SA method, here a recursive approach is presented: Es-timates are continuously improved while receiving new signals.
Consider a system consisting of channels with L being the maximumorder of the channels. Let be the one-dimensionalinput signal, and be the channel out-put at time where N is the number of samplesand may not be fixed:
where
are the unknown channel coefficients.Let us denote by
the coefficients of the channel, and by
the coefficients of the whole system which compose avector.
The observations may be corrupted by noise
where is a vector. The problem is to estimate onthe basis of observations.
Application to Signal Processing 221
Let us introduce polynomials in backward-shift operator
whereWrite and in the component forms
respectively, and express the component via
From this it is clear that
Define
where is a
It is clear that is a xSimilar to and let us define and and and which
have the same structure as and but with replaced by and
respectively.
222 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (5.1.5) we have
From (5.1.8), (5.1.4), and (5.1.10) it is seen that
This means that the channel coefficient satisfies the set of linearequations (5.1.12) with coefficients being the system outputs.
From the input sequence we form the(N – 2L + 1) × (2L + 1)-Hankel matrix
It is clear that the maximal rank of is 2L + 1 as
If is of full rank for some then willalso be of full rank for any
Lemma 5.1.1 Assume the following conditions hold:
A5.1.1 have no common root.
A5.1.2 The Hankel matrix composed of input signal is offull rank (rank=2L + 1).
Then is the unique up to a scalar multiple nonzero vector simulta-neously satisfying
Proof. Assume there is another solution to(5.1.14), which is different from
where isDenote
Application to Signal Processing 223
From (5.1.15) it follows that
By (5.1.7), we then have
which implies
where by we denote the (2L + 1)-dimensional vector composed
of coefficients of the polynomial written inthe form of increasing orders of
Since is of full rank, In other words,
For a fixed (5.1.17) is valid for all Therefore, all
roots of should be roots of for all By A5.1.1,
all roots of must be roots of Consequently, there is a
constant such that Substitutingthis into (5.1.17) leads to
and hence Thus, we conclude that
We first establish a convergence theorem for blind channel identifica-tion based on stochastic approximation methods for the case where anoise-free data sequence is observed.
Then, we extend the results to the case where N is not fixed andobservation is noise-corrupted.
Assume is observed. In this caseare available, and we have We will repeatedlyuse the data by setting
224 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Define estimate for recursively by
with an initial valueWe need the following condition.
Theorem 5.1.1 Assume A5.1.1–A5.1.3 hold. Let be given by(5.1.19) with any initial value with Then
where is a constant.
Proof. Decompose and respectively into orthogonalvectors:
whereIf serves as the initial value for (5.1.19), then by (5.1.14),
Again, by (5.1.14) we have
and we conclude that
and
Therefore, for proving the theorem it suffices to show thatas
Denote
Application to Signal Processing 225
andThen by (5.1.21) we have
Noticing that and is uniformly bounded with respect tofor large we have
andBy (5.1.18)
and by Lemma 5.1.1, is its unique up to a constant multiple eigenvec-
tor corresponding to the zero eigenvalue, and the rank of
is
Denote by the minimal nonzero eigenvalue of
Let be an arbitrary vector orthogonal toThen can be expressed by
where – 1, are the unit eigenvectors of
corresponding to its nonzero eigenvalues.It is clear that
By this, from (5.1.23) and (5.1.24), it follows that for
226 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
Noticing that
we conclude
and hence
From (5.1.21) it is seen that is nonincreasing forHence, the convergence implies that
The proof is completed.
Remark 5.1.1 If the initial value is orthogonal to thenand (5.1.20) is also true. But this is a non-interesting case giving
no information about
Remark 5.1.2 Algorithm (5.1.19) is an SA algorithm with linear time-varying regression function The root set J for istime-invariant: As mentioned above, evolves in
one of the subspaces depending on the initialvalue: In the proof of Theorem 5.5.1we have actually verified that may serve as the Lyapunov function
satisfying A2.2.20 for Then applying Remark 2.2.6 also leads tothe desired conclusion.
We now assume the input signal is a sequence of infinitely manymutually independent random variables and that the observations donot contain noise, i.e., in (5.1.5).
Lemma 5.1.2 Assume A5.1.1 holds and is a sequence of mutually
independent random variables with Then isthe unique unit eigenvector corresponding to the zero eigenvalue for thematrices
Application to Signal Processing 227
and the rank of is
Proof. Since is a sequence of mutually independent random vari-ables and it follows that
where
Proceeding along the lines of the proof of Lemma 5.1.1., we arrive at theanalogue of (5.1.16):
which implies
From (5.1.28) and (5.1.29) it follows that Then followingthe proof of Lemma 5.1.1, we conclude that is the unique unit vectorsatisfying
This shows that is of rank and isits unique unit eigenvector corresponding to the zero eigenvalue.
Let denote the minimal nonzero eigenvalue of Onwe need the following condition.
A5.1.4 is a sequence of mutually independent random variableswith for some and such that
Condition A5.1.3 is strengthened to the following A5.1.5.
A5.1.5 A5.1.3 holds and where is given in A5.1.4.
228 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
It is obvious that if is an iid sequence, then is a positiveconstant, and (5.1.30) is automatically satisfied.
Theorem 5.1.2 Assume A5.1.1, A5.1.4, and A5.1.5 hold, and isgiven by (5.1.19) with initial value Then
where
Proof. In the present situation we still have (5.1.21) and (5.1.22). So,it suffices to show
With N replaced by 4L in the definitions of and we again arriveat (5.1.23).
Since
converges a.s. by A5.1.4 and A5.1.5, there is a large suchthat
Let be an arbitrary vector such thatThen by Lemma 5.1.2,
and hence
Therefore, which
tends to zero since This implies
is bounded, and
a.s.,
Application to Signal Processing 229
We now consider the noisy observation (5.1.5). By the definition(5.1.11), similar to (5.1.9) we have
where and have the same structure as given by (5.1.10) withreplaced by and , respectively.
The following truncated algorithm is used to estimate
with initial value andIntroduce the following conditions.
A5.1.6 and are mutually independent and each of them is asequence of mutually independent random variables (vectors) such that
and
for some
and where is given in A5.1.4.
Set
Then
Denote by the resetting times, i.e.,Then, we have
A5.1.6
230 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
Let be an orthogonal matrix, where
Denote
Then
Noticing we find that
Lemma 5.1.3 Assume A5.1.6 and A5.1.7 hold. Then for given by(5.1.32),
Proof. Setting
we have
and
Application to Signal Processing 231
By A5.1.6, is a martingale difference sequence with
Noticing and we find
that
by the convergence theorem for martingale difference sequences.Since is independent of
and we also have
which together with (5.1.42) implies (5.1.41).
Lemma 5.1.4 Under the condition A5.1.6, ifthen there is a constant possibly depending on sample
path, such that
where
Proof. By A5.1.6 there is a constant possibly depending onsample path, such that
Then the lemma follows from (5.1.36) by noticing
Lemma 5.1.5 Assume A5.1.1 and A5.1.6 hold. Then for any andany the matrix
has rank and serves as its unique unit eigenvectorcorresponding to the zero eigenvalue.
232 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Proof. Since is a sequence of mutually independent nondegeneraterandom variables, where
Notice that coincides with given by (5.1.13) ifsetting N = 4L and in (5.1.13).
Proceeding as the proof of Lemma 5.1.1, we again arrive at (5.1.16).Then, we have Since
we find that Then bythe same argument as that used in the proof of Lemma 5.1.1, we con-clude that for any is the unique unit nonzero vector simultaneouslysatisfying
Since is a matrix, the above assertion
proves that the rank of is and also
proves that is its unique unit eigenvector corresponding to the zeroeigenvalue.
Denote by the minimal nonzero eigenvalue of
We need the following condition.
A5.1.8 There is a such that
It is clear that if is an iid sequence, then is independentof and and A5.1.8 is automatically satisfied.
Lemma 5.1.6 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then for any
Application to Signal Processing 233
if N is large enough, where with c and given inA 5.1.7 and A 5.1.8, respectively.
Proof. Let be the orthogonal matrix com-
posed of eigenvectors of By Lemma 5.1.4,
is the only eigenvector corresponding to the zero eigenvalue.Since can be expressed as
Then
By A5.1.4 is bounded with respect to and hence by(5.1.48) and the nonincreasing property of we have
where denotes the integer part ofSince we have
234 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
which incorporating with (5.1.44) leads to
for large enough
where and
Theorem 5.1.3 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then forgiven by (5.1.32) with initial value
and
where is a random variable expressed by (5.1.60).
Proof. We first prove that the number of truncations is finite, i.e.,a.s.
Assume the converse:
By Lemma 5.1.3, for any given
and
as
Application to Signal Processing 235
if is large enough, say,By the definition of we have
which incorporating with (5.1.52) implies
and
Define
Since is well-defined
by (5.1.54). Notice that from to there is no truncation. Con-sequently,
and
236 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
To be fixed, let us takeFrom (5.1.52) and (5.1.54) it follows that sequences
starting from cross the intervalfor each This means that
crosses interval for eachHere, we call that the sequence
crosses an interval with ifand
there is no truncation in the algorithm (5.1.32) forWithout loss of generality, we may assume converges:
It is clear that andBy Lemma 5.1.4, there is no truncation for
if T is small enough.Then, similar to (2.2.24), for large by Lemmas 5.1.3 and 5.1.4 we
have
where and
By Lemma 5.1.6, for large and small T we have
By Lemma 5.1.4 Noticing that
and by definition of crossing wesee that for small enough T,
Application to Signal Processing 237
This implies that
Letting in (5.1.57), we find that
which contradicts (5.1.58). The contradiction shows that
Thus, starting from the algorithm (5.1.32) suffers from no truncation.If did not converge as then
and would cross a nonempty interval
infinitely often. But this leads to a contradiction as shown above. There-fore, converges as
If were not zero, then there would exist a convergent
subsequence Replacing in (5.1.56) by from(5.1.57) it follows that
Since converges, the left-hand side of (5.1.59) tends to zero,which makes (5.1.59) a contradictory inequality. Thus, we have proved
a.s.
Since from (5.1.40) it follows that
By (5.1.38) and the fact that we finally conclude that
The difficulty of applying the algorithm (5.1.32) consists in that thesecond moment of the noise may not be available. Identificationof channel coefficients without using will be discussed in Sec-tion 5.3, by using the principal component analysis to be described inthe next section.
a.s.
238 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
5.2. Principal Component AnalysisThe principal component analysis (PCA) is one of the basic methods
used in feature extraction, signal processing and other areas. Roughlyspeaking, PCA gives recursive algorithms for finding eigenvectors of asymmetric matrix A based on the noisy observations on A.
Let be a sequence of observed symmetric matrices, andThe problem is to find eigenvectors of A, in particular,
the one corresponding to the maximal eigenvalue.Define
with initial value being a nonzero unit vector. serves as anestimate for unit eigenvector of A.
If then is reset to a different vector with norm equalto 1.
Assume have been defined as estimates for unit
eigenvectors of A. Denote which isan where
where denotes the pseudo-inverse of Since for largeis a full-rank matrix,
Define
if with
If we redefine an with such that
Define the estimate for the eigenvalue corresponding to the
eigenvector whose estimate at time is by the following recursion.
Application to Signal Processing 239
Take an increasingly diverging to infinity sequenceand define by the SA algorithm with expanding truncations:
where
We will use the following conditions:
A5.2.1 and
A5.2.2 are symmetric, and
A5.2.3 and
where is given by (1.3.2).
Examples for which (5.2.8) is satisfied are given in Chapters 1 and 2.We now give one more example.
Example 5.2.1 Assume is stationary and ergodic,
If then satisfies (5.2.8). Set By
ergodicity, we have a.s. By a partial summation it follows
that
which implies (5.2.8).
Let be the unit eigenvector of A corresponding to eigenvaluewhere may not be different.
240 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Theorem 5.2.1 Assume A5.2.1 and A5.2.2 hold. Then given by(5.2.1)–(5.2.6) converges at those samples for which A5.2.3 holds,
and the limits of coincide with
Let denote the limit of as Then
Proof. Consider those for which A5.2.3 holds. We first prove con-vergence of Note that may happen only for a finite
number of steps because as and By
boundedness of we expand into the power series of
where
Further, we rewrite (5.2.9) as
where
Application to Signal Processing 241
Denote
From (5.2.10) and the boundedness of and it is seen that
as Therefore, in order to show that satisfiesA2.2.3 it suffices to show
for any convergent subsequence
By boundedness of and it is clear that
where c is a constant for a fixed sample. For any there is asuch that
Consequently, we have
Expressing the first part of as
we find that
This is because (5.2.8) is applied for the first term on the right-handside of (5.2.16), while for the other two terms we have used (5.2.15),
and the boundedness of and
Similar treatment can also be applied to the second part of Thus,we have verified (5.2.13), and A2.2.3 too.
242 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Denote by S the unit sphere in Then defined by (5.2.2)evolves on S.
Define
The root set of on S is
Defining we find for
Thus, Condition A2.2.2(S) introduced in Remark 2.2.6 is satisfied.Since is bounded, no truncation is needed. Then, by Remark
2.2.6 we conclude that converges to one of sayDenote
Inductively, we now assume
We then have
Since and from (5.2.21) and (5.2.5) it
follows that and by (5.2.6)
We now proceed to show that converges to one of unit eigenvectorscontained in
From (5.2.5) we see that the last term in the recursion
Application to Signal Processing 243
tends to zero as So, by (5.2.22) we need to reset with
and at most for a finite number of times.
Replacing by in (5.2.9)–(5.2.11), we again arrive at(5.2.11) for Precisely,
where
and
By noticing
and using (5.2.22), (5.2.23) can be rewritten as
where asSince tends to an eigenvector of A, from (5.2.11) it follows that
where
Since converges, from (5.2.13) and it follows that
244 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Inductively, assume that
with satisfying (5.2.27), i.e.,
Noticing that for any matrix V, we have
by (5.2.28).Since by (5.2.24), denoting by
the term we have
for any convergent subsequenceDenoting
from (5.2.26) we see
By (5.2.8) and (5.2.30), similar to (5.2.18)–(5.2.20), by Remark 2.2.6
converges to an unit eigenvector of From (5.2.5) it
is seen that converges since and Then from
(5.2.6) it follows that itself converges as
Thus, we have
Application to Signal Processing 245
From (5.2.5) it follows that
which implies that and consequently,
Since the limit of is an unit eigenvector ofwe have
By (5.2.33) it is clear that can be expressed as a linear combi-nation of eigenvectors Consequently,which incorporating with (5.2.34) implies that
This means that is an eigenvector of A, and is different fromby (5.2.33).
Thus, we have shown (5.2.21) for To complete the induction itremains to show (5.2.28) for
As have just shown,tends to zero as from (5.2.31) we have
where satisfies (5.2.29) with replaced by by taking
notice of that (5.2.30) is fulfilled for whole sequence because whichhas been shown to be convergent.
Elementary manipulation leads to
246 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
This expression incorporating with (5.2.35) proves (5.2.28) forThus, we have proved that given by (5.2.1)–(5.2.6)
converge to different unit eigenvectors of A, respectively.To complete the proof of the theorem it remains to showRewrite the untruncated version of (5.2.7) as follows
We have just proved that Then by (5.2.8) and
noticing the fact that converges and we see that
satisfies A2.2.3.The regression function in (5.2.36) is linear:
Applying Theorem 2.2.1 leads to
Remark 5.2.1 If in (5.2.1) and (5.2.3) is replaced by Theo-rem 5.2.1 remains valid. In this case given by (5.2.18) should changeto and correspondingly changes to As a
result, the limit of changes to the opposite sign, fromto
5.3. Recursive Blind Identification by PCAAs mentioned in Section 5.1, the algorithm (5.1.32) for identifying
channel coefficients uses the second moment of the obser-vation noise. This causes difficulty in possible applications, because
may not be available.We continue to consider the problem stated in Section 5.1 with nota-
tions introduced there. In particular, (5.1.1)–(5.1.12), and (5.1.31) willbe used without explanation.
In stead of (5.1.32) we now consider the following normalized SAalgorithm:
Application to Signal Processing 247
Comparing (5.3.1) and (5.3.2) with (5.2.1) and (5.2.2), we find thatthe channel parameter identification algorithm coincides with the PCAalgorithm with By Remark 5.2.1, Theorem 5.2.1 canbe applied to (5.3.1) and (5.3.2) if conditions A5.2.1, A5.2.2, and A5.2.3hold.
The following conditions will be used.
A5.3.1 The input is a sequence, i.e., there exist a con-stant and a function such that for any
where
A5.3.2 There exists a distribution function over such that
where denotes the Borel in and
A5.3.3 The (2L + 1) × (2L + 1)-matrix is nondegenerate,where
A5.3.4 The signal is independent of anda.s., where is a random variable with
A5.3.5 All components of of are
mutually independent with and
and is bounded whereis a constant.
A5.3.6 have no common root.
For Theorem 5.1.1, is assumed to be a sequence of mutuallyindependent random variables (Condition A5.1.6), while in A5.3.1 theindependence is weakened to a property, but the distribution of
A5.3.7 and
248 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
is additionally required to be convergent. Although thereis no requirement on distribution of in Theorem 5.1.1, we noticethat (5.1.30) is satisfied if are identically distributed.
In the sequel, denotes the identity matrix.Define with
and
In what follows denotes the Kronecker product.
Theorem 5.3.1 Assume A5.3.1–A5.3.7 hold. Then
where C is a -matrix and Q is given in A5.3.3, andfor given by (5.3.1) and (5.3.2),
where J denotes the set of unit eigenvectors of C.
Proof. By the definition of we have
Since
Application to Signal Processing 249
and by A5.3.2, (5.3.3) im-
mediately follows.From the definition (5.1.31) for by A5.3.5 it is clear that
is a -identity matrix multiplied by withThen by A5.3.4 and A5.3.5
Identifying inTheorem 5.2.1 to we find that Theorem5.2.1 can be applied to the present algorithm, if we can show (5.2.8),which, in the present case, is expressed as
where is given by (1.3.2), and B is given by (5.3.6).Notice, by the notation introduced by (5.1.33),
Since
and
by the convergence theorem for martingale difference
sequences, for (5.3.7) it suffices to show
Identifying and in Lemma 2.5.2 toand respectively, we find that conditions required there aresatisfied. Then (5.3.8) follows from Lemma 2.5.2, and hence (5.3.7) isfulfilled.
250 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By Theorem 5.2.1 given by (5.3.1) and (5.3.2) converges to anunit eigenvector of B, which clearly is an eigenvector of C.
Lemma 5.3.1 is the unique up to a scalar multiple nonzero vectorsimultaneously satisfying
Proof. Since it is known that satisfies (5.3.9), it suffices to prove theuniqueness.
As in the proof of Lemma 5.1.1, assume isalso a solution to (5.3.9). Then, along the lines of the proof of Lemma5.1.1, we obtain the analogue of (5.1.16), which implies (5.1.29):
where is given by (5.1.28) while by (5.1.16).By A5.3.3 which is nondegener-
ate. Then we have The rest of proof for uniqueness coincideswith that given in Lemma 5.1.1.
By Lemma 5.3.1 zero is an eigenvalue of C with multiplicity one andthe corresponding eigenvector is Theorem 5.3.1 guar-antees that the estimate approaches to J, but it is not clear iftends to the direction of
Let be all different eigenvaluesof C. J is composed of disconnected sets and
where Note thatthe limit points of are in a connected set, so converges to a
for some Let We want to prove that
a.s. or This is the conclusion of Theorem
5.3.2, which is essentially based on the following lemma, proved in [9].
Lemma 5.3.2 Let be a family of nondecreasing andbe a martingale difference sequence with
Let be an adapted random sequence and be a real sequencesuch that and Suppose that onthe following conditions 1, 2 and 3 hold.
Application to Signal Processing 251
2) can be decomposed into two adapted sequences andsuch that
3) coincides with an random variablefor some
Then
Theorem 5.3.2 Assume A5.3.1–A5.3.7 hold. Then defined by(5.3.1) and (5.3.2) converges to up-to a constant multiple:
where equals either
Proof. Assume the contrary: for someSince C is a symmetric matrix, for where andhereafter a possible set with zero probability in is ignored. The proofis completed by four steps.
Step 1. We first explicitly expressExpanding defined by (5.3.2) to the power series of we derive
where
Noting and we derive
and
252 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where is defined by (5.1.4), is given by (5.1.10)with replaced by the observation noise, and denotes theestimate for at time
By (5.3.4) and (5.3.5), there exists a.s. such thata.s.
For any integers and define and
Note that for
and by the convergence of from (5.3.12) it follows thatwhere is a constant for all in By
(5.3.7) we then have
as where and hereafter T should not be confused with thesuperscript T for transpose.
Choose large enough and sufficiently small T such thatLet
and It then followsthat for
In
Application to Signal Processing 253
for sufficiently large.Consequently, for with fixed
and hence
Define
From (5.3.15) it follows that
Tending in (5.3.21) and replacing by in the resulting equal-ity, by (5.3.19) we have
Thus, we have expressed in two ways: (5.3.21) shows that ismeasurable, while (5.3.22) is in the form required in 5.3.2, where
254 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Step 2. In order to show that the summand in (5.3.22) can beexpressed as that required in Lemma 5.3.2 we first show that the series
is convergent on By (5.3.14) and (5.3.7) it suffices to showis convergent on
Define
and
Clearly, is measurable with respect to and Thenby the convergence theorem for martingale difference sequences,
By (5.3.16) it follows that
Application to Signal Processing 255
The first term on the right-hand side of the last equality of (5.3.29) canbe expressed in the following form:
where the last term equals
Combining (5.3.30) and (5.3.31) we derive that the first term on theright-hand side of the last equality of (5.3.29) is
256 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By A5.3.4, A5.3.5, and A5.3.7 it is clear thatHence replacing by in (5.3.29) results in
producing an additional term of magnitude Thus, by (5.3.24)–(5.3.26) we can rewrite (5.3.29) as
where and is By (5.3.28) and A5.3.7the series (5.3.33) is convergent, and hence given by (5.3.23) is aconvergent series.
Step 3. We now define sequences corresponding to and inLemma 5.3.2.
Let We have
where
Denote
Application to Signal Processing 257
Then and are adapted sequences, is a mar-tingale difference sequence, and is written in the form of Lemma 5.3.2:
It remains to verify (5.3.10) and (5.3.11).From (5.3.23) and (5.3.33) it follows that there is a constant
such that Then for noticing
and
we have
By A5.3.4 and A5.3.5 it follows that
As in Step 4 it will be shown that
258 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
From this it follows that
Then from the following inequality
by (5.3.34) and (5.3.36) it follows that
Therefore all conditions required in Lemma 5.3.2 are met, and we con-clude Since it follows that
and must converge to a.s.Step 4. To complete the proof we have to show (5.3.35).Proof. If (5.3.35) were not true, then there would exist a subsequence
such that
For notational simplicity, let us denote the subsequence still bySince by A5.3.5 for if and for any
but if we then have
which incorporating with (5.3.37) implies that
and
Noticing that and from (5.3.38)
and (5.3.24) it follows that
Application to Signal Processing 259
On the other hand, we have
and hence,
where denotes the estimate provided by for at timeSince for any
we have
Hence (5.3.40) implies that
260 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
By A5.3.4 the left-hand side of (5.3.41) equals
Since it follows that for any
The left side of (5.3.42) equals
Thus (5.3.42) implies that for any
Application to Signal Processing 261
Noticing from (5.3.25) we have
Then by A5.3.5, (5.3.39) implies that for any
Notice that
and
262 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Then by A5.3.5, from (5.3.45)–(5.3.47) it follows that
and hence for any
and
Notice that (5.3.49) means that
However, the above expression equals
Therefore,
Application to Signal Processing 263
In the sequel, it will be shown that (5.3.43), (5.3.44), (5.3.48), and(5.3.50)) imply that which contradicts with
This means that the converse assumption (5.3.37) is not true.For any since are coprime, where is
given in (5.1.6), there exist polynomials such that
Let and be the degrees of and respectively. SetIntroduce the q-dimensional vector and q × q
square matrices W and A as follows:
Note that where and Then (5.3.43), (5.3.44), (5.3.48),and (5.3.50) can be written in the following compact form:
To see this, note that for any fixed and on the left hand sides of(5.3.48) and (5.3.50) there are 2L different sums when varies from 0 toL– 1 and replace roles each other. These together with (5.3.43) and(5.3.44) give us 2L + 1 sums, and each of them tends to zero. Explicitlyexpressing (5.3.52), we find that there are 2L +1 nonzero rows and eachrow corresponds to one of the relationships in (5.3.43), (5.3.44), (5.3.48),and (5.3.50).
Since we have put enough zeros in the definition of after multiply-ing the left hand side of (5.3.52) by
has only shifted nonzero elements in
From (5.3.52) it follows that for any and in(5.3.51)
264 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
From (5.3.53) it follows that
Note that for any polynomial of degree ifthe last elements of are zeros. From (5.3.54) it follows that
Denoting
from (5.3.55) we find that
By the definition of the first elements of are zeros, i.e.,This means that the last
elements of are zeros, i.e.,
On the other hand,
By (5.3.56), from (5.3.57) and (5.3.58) it is seen that i.e.,
Application to Signal Processing 265
From (5.3.53) it then follows that
i.e., But this is impossible, because
are unit vectors. Consequently, (5.3.37) is impossible and this completesthe proof of Theorem 5.3.2.
which, however, is unknown.It is required to design the optimal weighting X, which
minimizes
under constraint
where C and are matrices, respectively. In thecase where C = 0, the problem is reduced to the unconstrained one.
It is clear that (5.4.3) is solvable with respect to X if and only ifand in this case the solution to (5.4.3) is
where Z is anyFor notational simplicity, denote
Let L(C) denote the vector space spanned by the columns of matrix C,and let the columns of matrix be an orthogonally normalized basis
5.4. Constrained Adaptive FilteringWe now apply SA methods to adaptive filtering, which is an important
topic in signal processing. We consider the constrained problem, whilethe unconstrained problem is only a special case of the constrained oneas to be explained.
Let and be two observed sequences, where and are
respectively. Assume is stationary and
ergodic with
266 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
of L(C). Then there is a full-rank decompositionNoticing we have Let bean orthogonal matrix. Then
and hence
From this it follows that
and hence a.s. This implies that
Let us express the optimal X minimizing (5.4.2) via By(5.4.8) substituting (5.4.4) into (5.4.2) leads to
On the right-hand side of (5.4.9) only the first term, which is quadratic,depends on Z. Therefore, the optimal should be the solution of
i.e.,
where is any satisfying
Application to Signal Processing 267
Combining (5.4.4) with (5.4.11), we find that
Using the ergodic property of we may replace and bytheir sample averages to obtain the estimate for And, the esti-mate can be updated by using new observations. However, to updatethe estimate, it involves taking the pseudo-inverse of the updated esti-mate for which may be of high dimension. This will slow down thecomputation speed. Instead, we now use an SA algorithm to approach
By (5.4.8), we can rewrite (5.4.10) as
or
We now face to the standard root-seeking problem for a linear function
As before, let and The
following algorithm is used to estimate given by (5.4.12), which inthe notations used in previous chapters is the root set J for the linearfunction given by (5.4.14):
with initial value such that and
Theorem 5.4.1 Assume that is stationary and ergodic with sec-
ond moment given by (5.4.1) and thatThen, after a finite number of steps, say (5.4.16) has no moretruncations, i.e.,
268 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and
i.e.,
where given by (5.4.12) solves the stated constrained optimizationproblem.
Proof. We first note that (5.4.16) is a matrix recursion. However,if in lieu of we consider with being an arbitrary constantvector, then we have a conventional vector recursion, and by (5.4.9)
may serve as a Lyapunov function for thecorresponding regression function obtained from (5.4.14):
Therefore, in order to apply Theorem 2.2.1, we need only to verifythe noise condition.
Denoting
then from (5.4.16) we have
We now show that for a fixed sample if then
and there is a constant such that
if is sufficiently large and T is small enough, where
We need the following fact, which is an extension of Example 5.2.1.Assume the process is stationary and ergodic with
Application to Signal Processing 269
and is a convergent sequence of random matrices
a.s. Then
Let Then by ergodicity of both and
we have
because the second term on the right-hand side of the equality can beestimated as follows
which tends to zero as and then By a partialsummation and by using (5.4.25) we have
which implies (5.4.24) by (5.4.25).Let us consider the following algorithm starting from without trun-
cation:
Set
and
270 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Then from (5.4.26) it follows that
Denote
and
Since is stationary and ergodic, a.s., and
Then by a partial summation, we have
Application to Signal Processing 271
Notice that a.s. by ergodicity. Then for large
and from (5.4.29) it follows that
where (5.4.24) is used incorporating with the fact that
and is stationary with EFrom (5.4.27)–(5.4.30) by convergence of it follows that
for large and small T, where and are constants independent ofand
Consequently, in the case i.e., in
(5.4.16), will never reach the truncation bound forif is large enough and T is small enough.
Then coincides with This verifies(5.4.22), while (5.4.23) follows from (5.4.16) because for a fixedand are bounded, andare also bounded by (5.4.31) and the convergence In
the case i.e., ‚ for some is
bounded, and hence (5.4.22) and (5.4.23) are also satisfied.We are now in a position to verify the noise condition required in
Theorem 2.2.1 for given by (5.4.20), i.e., we want to show thatfor any convergent subsequence
272 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By (5.4.24)
so for (5.4.32) it suffices to show
Again, by (5.4.24) and also by (5.4.23)
which implies (5.4.33). By Theorem 2.2.1, there is such that for is defined by(5.4.17) and converges to the root set J for given by (5.4.14).This completes the proof for the theorem.
Remark 5.4.1 For the unconstrained problem and C = 0, thealgorithm (5.4.16) becomes
Application to Signal Processing 273
Further, if then and Theorem 5.4.1 asserts
a.s.,
provided is stationary, ergodic, and bounded.
5.5. Adaptive Filtering by Sign AlgorithmsWe now consider the unconstrained problem mentioned in Section 5.4,
but we restrict ourselves to discuss the vector case, i.e., instead of matrix
signal we now consider where is and
is one-dimensional. However, instead of quadratic criterion (5.4.2) wenow minimize the cost
where is an vector.Note that the gradient of is given by
where
The problem is to find which minimizes or to approach theroot set J of
As before, let be an increasing sequence of positive real numberssuch that as Define the algorithm as follows
274 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Theorem 5.5.1 Assume is stationary and ergodic with
Then
where is defined by (5.5.4) and (5.5.5) with an arbitrary initialvalue. In addition, in a finite number of steps truncations cease to existin (5.5.4).
Proof. Define
and
Let be a countable set that is dense in let and betwo sequences of positive real numbers such that andas and denote
and
where and is an integer.The summands of (5.5.9)–(5.5.11) are stationary with finite expecta-
tions for any any integer any and any and then theergodic theorem yields that
a.s.,
Application to Signal Processing 275
and
Therefore, there is an such that and for eachthe convergence for (5.5.12)–(5.5.14) takes place for any anyinteger any and any
Let us fix anWe first show that for any fixed
if is large enough (say, for ), and in addition,
where c is a constant which may depend on but is independent ofIn what follows always denote constants that may
depend on but are independent of By (5.4.24) we have for any
There are two cases to be considered. If then for largeenough, and (5.5.15) holds. If is bounded, then thetruncations cease to exist after a finite number of steps. So, (5.5.15) alsoholds if is sufficiently large. Then (5.5.16) follows immediately from(5.5.15) and (5.5.17).
Let us define
where is given by (5.5.2). Then (5.5.15) can be represented as
Let be a convergent subsequence of and let
be such that We now show that
276 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Let By (5.5.16) or forsome integer
We examine that the terms on the right-hand side of (5.5.20) satisfy(5.5.19) .
For the first term on the right-hand side of (5.5.20) we have
where and are deterministic for a fixed and the expecta-tion is taken with respect to and
Since (5.5.6), a.s., applying the dominated convergencetheorem yields
Then from (5.5.21) it follows that
Application to Signal Processing 277
Similarly, for the second term on the right-hand side of (5.5.20) wehave
since a.s.For the third term on the right-hand side of (5.5.20) by (5.4.24),
(5.5.10), and (5.5.13) we have
sinceFinally, for the last term in (5.5.20), by (5.5.14) and (5.4.24) we have
where the last convergence follows from the fact thata.s. as since and
a.s.Combining (5.5.23)–(5.5.26) yields that
278 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Since the left-hand side of (5.5.27) is free of tending to infinity in(5.5.27) leads to (5.5.19). Then the conclusion of the theorem followsfrom Theorem 2.2.1 by noticing that as in A2.2.2 one may take
5.6. Asynchronous Stochastic ApproximationWhen dealing with large interconnected systems, it is natural to con-
sider the distributed, asynchronous SA algorithms. For example, in acommunication network with servers, each server has to allocate audioand video bandwidths in an appropriate portion in order to minimizethe average time of queueing delay. Denote by the bandwidthratio for the server, and Assume the average delay
time depends on only and is differentiable,Then, to minimize is equivalent to find the root of Assumethe time, denoted by spent on transmitting data from the serverto the server is not negligible. Then at the server for theiteration we can observe or only at where
denotes the total time spent until completion of iterations for theserver. This is a typical problem solved by asynchronous SA. Simi-
lar problem arises also from job-scheduling for computers in a computernetwork.
We now precisely define the problem and the algorithm.
At time denote by the estimate for the unknown
root of Components of are observedby different processors, and the communication delays from theprocessor to the processor at time are taken into account. Theobservation of the processor is carried only at
i.e.,
where is the observation noise.In contrast to the synchronous case, the update steps now are different
for different processors, so it is unreasonable to use the same step sizefor all processors in an asynchronous environment. At time the stepsize used in the processor is known and is denoted by
We will still use the expanding truncation technique, but we are un-able to simultaneously change estimates in different processors when theestimate exceeds the truncation bound because of the communicationdelay.
Assume all processors start at the same given initial valueand for all The observation at
Application to Signal Processing 279
the processor is and is updated to by the rule givenbelow. Because of the communication delay the estimate produced bythe processor cannot reach the processor for the initial steps:
By agreement we will take to serve as whenever
At the processor, there are two sequences andare recursively generated, where is the estimate for the
component of at time and is connected with the number oftruncations up-to and including time at the processor. For the
processor at time the newest information about other processorsis In all algorithms discussed until
now all components of are observed at the same point at timeand this makes updating to meaningful. In the present case,
although we are unable to make all processors to observe at thesame points at each time, it is still desirable to require all processorsobserve at points located as close as possible. Presumably, thiswould make estimate updating reasonable. For this, by noticing thatthe estimate gradually changes after a truncation, the ideal is to keepall are equal, but for this the best we can do is toequalize with other
Keeping this idea in mind, we now define the algorithm and the ob-servations for the processor,
Let be a fixed point from where the algorithmrestarts after a truncation.
i) If there exists with then reset to equal the biggest
one among and pull back to the fixed point although
may not exceed the truncation bound. Precisely, in this case define
and observe
for any then observe atii) If
280 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
i.e.,
For both cases i) and ii), and are updated as follows:
where is the step size at time and may be random, andis a sequence of positive numbers increasingly diverging to in-
finity.Let us list conditions to be used.
A5.6.1 is locally Lipschitz continuous.
A5.6.2 and
there exist two positive constants such that
A5.6.3 There is a twice continuously differentiable function (not neces-sarily being nonnegative) such that
and is nowhere dense, where
and denotes the gradient of
A5.6.4 For any convergent subsequence any and any
Application to Signal Processing 281
where
and
A5.6.5
Note that (5.6.10) holds if is bounded, since Note
also that A5.6.3 holds if and
Theorem 5.6.1 Let be given by (5.6.1)–(5.6.6) withinitial value Assume A5.6.1–A5.6.5 hold, and thereis a constant such that and
where is given in A5.6.3. Then
where
The proof of the theorem is separated into lemmas. From now on wealways assume that A5.6.1–A5.6.5 hold.
We first introduce an auxiliary sequence and its associated ob-servation noise It will be shown that differs from onlyby a finite number of steps. Therefore, for convergence of it sufficesto prove convergence of
Let be a samplepath generated by the algorithm (5.6.1)–(5.6.6), where is the one afterresetting according to (5.6.2). Let where isdefined in A5.6.4. Assume By the resetting rule givenin i), for any after resetting we have For we
have and by the definition of
In the processor we take and to replace andrespectively, and define for thoseFurther, define and for
282 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Then we obtain new sequences associated withBy (5.6.1)–(5.6.6), if then there exists a with
and
since and forBecause during the period there is no truncationfor the sequences are recursivelyupdated as follows:
whereDefine delays for as follows
is available to the processor at time
Lemma 5.6.1 For any any convergent subsequenceand any satisfies the following condition
where
Proof. Since equals either or which is available at time itis seen that
For by definition of we havewhich is certainly available to the processor. Therefore,
We rewrite By the definition of andpaying attention to (5.6.17) we see
so
as
Application to Signal Processing 283
We now show that (5.6.18) is true for all Forthere is no truncation for the processor,
and hence by the resetting rule i). If
for some then by (5.6.16) and the definition of it follows that
which implies (5.6.18).If for some then as explained above for the processor
at time the latest information about the estimate produced by theprocessor is In other words,
However, by definition of which yields
This again implies (5.6.18).In summary, we have
This means that for there is no truncation at any time equal toand the observation is carried out at
i.e.,
For any any convergent subsequence and any wehave
By (5.6.11), Then from A5.6.2 and
A5.6.5 it follows that and hence the second term
284 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
on the right-hand side of (5.6.21) tends to zero as Further,from the definition of there is such that Hence the firstterm on the right-hand side of (5.6.21) is of order o(T) by A5.6.4. Con-sequently, from A5.6.2, A5.6.4 and A5.6.5 it follows that satisfies(5.6.15).
Lemma 5.6.2 Let be generated by (5.6.12)–(5.6.14). For any con-vergent subsequence of if is bounded,then there are and such that
where is given in (5.6.14).
where is given in A5.6.2.By (5.6.15) for convergent subsequence there exists such
that for any and
Choose such that For any let
Then for any
If then if is sufficiently large,i.e., no truncation occurs after and hence for
If then there exists such that forany From (5.6.24) it follows that
Therefore, in both cases
Proof. Let whereand
Application to Signal Processing 285
If then for sufficiently large
i.e.,
This contradicts the definition of Therefore,
Lemma 5.6.3 Let be given by (5.6.12)–(5.6.14). For anywith the following assertions take place:
i) In the case, cannot cross infinitely manytimes keeping bounded, where are the startingpoints of crossing;
ii) In the case cannot converge to keepingbounded.
Proof. i) Since is bounded, there exists a convergent subse-quence, which is still denoted by for notational simplicity,
By the boundedness of and (5.6.22)for sufficiently large there is no truncation between andand hence
where By (5.6.20), (5.6.22) andit follows that
286 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By A5.6.2 and A5.6.3 we have
Then by A5.6.1
where is the Lipschitz coefficient of in andBy the boundedness of
and the fact that there is no truncation between and it followsthat
Without loss of generality, we may assume is a convergent se-quence. Then by A5.6.3 and A5.6.5
Therefore,
where
Since is continuous for fixed by A5.6.4 there exists afor such that
Application to Signal Processing 287
Thus, for sufficiently small T and sufficiently large we have
On the other hand, by Lemma 5.6.2
Thus, for sufficiently small T, and
This contradicts (5.6.31), and i) is proved.ii) If is bounded, then there is a convergent subsequenceThen the assertion can be deduced by a similar way as that for i).
Lemma 5.6.4 Under the conditions of Theorem 5.6.1
where is given by (5.6.14).
Proof. If then there exists a sequence such that
From (5.6.12)–(5.6.14) we haveChoose a small positive constant such that
Let be a connectedset containing and included in the setand let be a connected set containing and included in the set
Clearly, and and arebounded.
Since diverges to infinity, there exists such thatfor Noting that there exists i such that
and we can define andfor
Since there is a convergent subsequence in also de-noted by Let be a limit point of
By the definition of is bounded. Butcrosses infinitely many times, and it
is impossible by Lemma 5.6.3. Thus,
Proof of Theorem 5.6.1
and
288 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By Lemma 5.6.4 is bounded. Let
If then by Lemma 5.6.3, we have
If then there are and such thatand since is nowhere dense. But by
Lemma 5.6.3 this is impossible. Therefore,We now show If there is a convergent subsequence
and then (5.6.26)–(5.6.30) still hold. Hence,
This is a contradiction to
Consequently, i.e.,
Since and the truncations occur onlyfor finitely many times. Therefore, and differ from each other onlyfor a finite number of So,
5.7. Notes and ReferencesFor blind identification with “block” algorithms we refer to [71, 96].
Recursive blind channel identification algorithms appear to be new. Sec-tion 5.1 is written on the basis of the joint work “H. F. Chen, X. R. Cao,and J. Zhu, Convergence of stochastic approximation based algorithmsfor blind channel identification”. Principal component analysis is ap-plied in different areas (see, e.g., [36, 79]). The results presented inSection 5.2 are the improved version of those given in [101]. The princi-pal component analysis is applied to solve the blind identification prob-lem in Section 5.3, which is based on the recent work “H. T. Fang andH. F. Chen, Blind channel identification based on noisy observation bystochastic approximation method”. The proof of Lemma 5.3.2 is givenin [9].
For adaptive filter we refer to [57]. The results presented in Sec-tion 5.4 are stronger than those given in [11, 28]. The sign algorithmsare dealt with in [42], but conditions used in Section 5.5 are consider-ably weaker than those in [42]. Section 5.5 is based on the recent work“H. F. Chen and G. Yin, Asymptotic properties of sign algorithms foradaptive filtering”.
Asynchronous stochastic approximation was considered in [9, 88, 89,99]. Section 5.6 is written on the basis of [50].
i.e.,
Chapter 6
APPLICATION TO SYSTEMSAND CONTROL
Assume a control system depends on a parameter and the systemoperation reaches its ideal status when the parameter equals someSince is unknown, we have to estimate it during the operation of thesystem, which, therefore, can work only on the estimate of Inother words, the real system is not under the ideal parameter andthe problem is to on-line estimate and to make the system asymptot-ically operating in the ideal status. It is clear that this kind of systemparameter identification can be dealt with by SA methods.
Adaptive control for linear stochastic systems is a typical examplefor the situation described above. If the system coefficients are known,then the optimal stochastic control may be a feedback control of thesystem state. The corresponding feedback gain can be viewed as theideal parameter which depends on the system coefficients. In the setupof adaptive control, system coefficients are unknown, and hence isunknown. The problem is to estimate and to prove that the resultingadaptive control system by using the estimate as the feedback gain isasymptotically optimal as tends to infinity.
In Section 6.1 the ideal parameter is identified by SA methods forsystems in a general setting, and the results are applied to solving theadaptive quadratic control problem. The adaptive stabilization problemis solved for stochastic systems in Section 6.2, while the adaptive exactpole assignment is discussed in Section 6.3. An adaptive regulationproblem for nonlinear and nonparametric systems is considered is Section6.4.
289
290 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
6.1. Application to Identification and AdaptiveControl
Consider the following linear stochastic system depending on param-eter
where andare unknown.
The ideal parameter for System (6.1.1) is a root of an unknownfunction
The system actually operates with equal to some estimate for ,i.e., the real system is as follows:
For the notational simplicity, we suppress the dependence on thestate and rewrite (6.1.3) as
The observation at time is
where is a noise process.From (6.1.5) it is seen that the function is not directly observed,
but it is connected with as follows:
We list conditions that will be used.
where is generated by (6.1.1).Let be a sequence of positive numbers increasingly diverging to
infinity and let be a fixed point. Fixing an initial value werecursively estimate by the SA algorithm with expanding truncations:
291Application to Systems and Control
A6.1.2 There is a continuously differentiable functionsuch that
for any and is nowhere dense,where J is given by (6.1.2). Further, used in (6.1.8) is such that
inf for some and
A6.1.3 The random sequence in (6.1.1) satisfies a mixing conditioncharacterized by
uniformly in where Further, is such thatsup where
A6.1.4 For sufficiently large integer
for any such that converges, where is given by (1.3.2).
Let is stable}, and let be an open, connected subsetof
A6.1.5 and f are connected by (6.1.6) and (6.1.1) for each satisfies a local Lipschitz condition on
with for any constants and where isgiven in A6.1.3.
with
A6.1.1 and
292 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
A6.1.6 and in (6.1.1) are globally Lipschitz continuous:
where L is a constant.
A6.1.7 given by (6.1.7) is If converges for somethen where may depend on
Theorem 6.1.1 Assume A6.1.1–A6.1.7 hold. Then
where is a connected subset of
Proof. By (6.1.5) we rewrite the observation in the standard from
where
By Theorem 2.2.2 and Condition A6.1.4, the assertion of the theoremwill immediately follow if we can show that for almost all condition(2.2.2) is satisfied with replaced by
Let be expressed as a sum of seven terms:
where
293Application to Systems and Control
where
and and denote the distribution andconditional distribution of given respectively.
To prove the theorem it suffices to show that there exists withsuch that for each all satisfy
(2.2.2) with respectively identified toBy definition, for any there is such that
where is independent ofLet us first show that satisfy (2.2.2).Solving (6.1.1) yields
By A6.1.3 is bounded. Hence, by (6.1.18) is bounded andby A6.1.5 is also bounded:
where
where is given in A6.1.5.Since we haveWe now show that and are continuous in uni-
formly with respect toBy (6.1.18) and (6.1.20), from (6.1.19) it follows that
By (6.1.18) (6.1.20) and the Lipschitz condition A6.1.5 for it followsthat
and
294 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
which implies the uniform continuity of This together with (6.1.13)yield that is also uniformly continuous.
Let be a countable dense subset ofNoticing that is and expressing
as a sum of martingale difference sequences
by (6.1.20) and we find that there is withsuch that for each
for any integer and any From here by uniform continuity ofit follows that for and for any integer
Note that
This is because by (6.1.18) and (6.1.20) we have the following estimate:
We now estimate by the treatment used in Lemma 2.5.2. By ap-plying the Jordan-Hahn decomposition to the signed measure
Similarly, we can find with such that forand
since is bounded by the martingale convergencetheorem. It is worth noting that (6.1.23) holds a.s. for anybut without loss of generality (6.1.23) may be assumed to hold for all
with To see this, we first selectthat (6.1.23) holds for any This is possible
because is a countable set. Then, we notice that iscontinuous in uniformly with respect to Thus, we have
295Application to Systems and Control
withon
such
where is the mixing coefficient given in A6.1.3. Thus, by(6.1.27)–(6.1.29) we have
and
296 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
it is seen that there is a Borel set D in the sampling spacesuch that for any A in the sampling space
Application to Systems and Control 297
By A6.1.5, (6.1.18), (6.1.20), and noticing we find
whose expectation is finite as explained for (6.1.20). Therefore, on theright-hand side of (6.1.30) the conditional expectation is bounded withrespect to by the martingale convergence theorem, and the last term isalso bounded with respect to Thus, by (6.1.10) from (6.1.30) it followsthat there is with such that
Assume is a convergent subsequence
Define
Write (6.1.4) as
Let be fixed.
Now choose sufficiently small so that
and hence
Applying the Gronwall inequality to (6.1.33) we obtain the inequality
where and hereafter always denotes a constant for fixed and,without loss of generality, we assume
Define so that
Consequently, we have
298
Since by A6.1.7, by A6.1.5 and A6.1.6 it follows that for
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
299Application to Systems and Control
whereBy induction we now show that
for all suitable large .
For any fixed if is large enough, since
Therefore (6.1.36) holds for sinceAssume (6.1.36) holds for some By notic-
ing from (6.1.34) and (6.1.35) it follows that
By using (6.1.20) (6.1.37) and the inductive assumption and applying(6.1.19) to it follows that
for where and satisfies thefollowing equation
By A6.1.7 and (6.1.20) we have
and using (6.1.18), (6.1.37), and the inductive assumption we derive
300 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
This combining with (6.1.38) leads to that there are real numbersand such that
for From here it follows that
From the inductive assumption it follows that for
for some large enough integer N. Then by (6.1.12)
Setting
we derive
where (6.1.22), (6.1.24), (6.1.25), (6.1.31), (6.1.39), and (6.1.40) areused.
Choose sufficiently small so that (6.1.35) holds, and
Application to Systems and Control 301
Since by A6.1.5 there is such that
for all From (6.1.41) it then follows that
It can be assumed that is sufficiently large so that
Since by (6.1.42) it follows that
and hence there is no truncation at
Thus, we have
or equivalently,
which proves (6.1.36).Consequently, (6.1.39) is valid for and
hence
times and
and
where is the estimate ofLet be given by (6.1.7) and (6.1.8) with given by (6.1.5).
where and are related by (6.1.44).However, since the ideal is unknown, the real system satisfies the
equation
where and are symmetric such that andLet given by A6.1.3). The control
where is the feedback control which is required to minimize
Finally, noticing that A6.1.5 assumes (6.1.6), we conclude that for each
all satisfy (2.2.2) with
respectively replaced by The proof of the theorem iscompleted.
We now apply the obtained result to an adaptive control problem.Assume that is the ideal parameter for the system, being
the unique zero of an unknown function The system in the idealcondition is described by the equation
302 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
From (6.1.21) and (6.1.13) it is seen that is contin-uous in uniformly with respect to Therefore, its limit is acontinuous function. Then by (6.1.36) it follows that
should be selected in the family U of admissible controls:
Application to Systems and Control 303
In order to give adaptive control we need the expression of the optimalcontrol when is known.
Lemma 6.1.1 Suppose thatis a martingale difference sequence with
ii) where is controllable and observ-able, i.e., · · · , and· · · , are of full rank.Then in the class of nonnegativedefinite matrices there is an unique satisfying
and
where
and
Proof. The existence of an unique solution to (6.1.50) and stability ofF given by (6.1.51) are well-known facts in control theory. We show theoptimality of control given by (6.1.52).
For notational simplicity , we temporarily suppress the dependence ofand on and write them as A, B, and
D, respectively.Noticing
is stable. The optimal control minimizing (6.1.45) is
304 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
we then have
Since by the estimate for the weighted sum of martingaledifference sequence from (6.1.55) it follows that
where is the state in (6.1.47).Thus the closed system becomes
Notice that the last term of (6.1.56) is nonnegative. The conclusions ofthe lemma follow from (6.1.56).
According to (6.1.52), by the certainty-equivalence-principle, we formthe adaptive control
305Application to Systems and Control
which has the same structure as (6.1.4). Therefore, under the assump-tions A6.1.1–A6.1.7 with replaced by and with J being asingleton by Theorem 6.1.1 it is concluded that
By continuity and stability of it is seen that there are andpossibly depending on such that
This yields the boundedness of and
because
By (6.1.60) it follows that
Therefore, the closed system (6.1.58) asymptotically operates under theideal parameter and makes the performance index (6.1.45) minimized.
6.2. Application to Adaptive StabilizationConsider the single-input single-output system
where and are the system input, output, and noise, respec-tively, and
where is the backward shift operator,The system coefficient
is unknown. The purpose of adaptive stabilization is to design controlso that
a.s.
The fact that and a can be solved from (6.2.5) for anymeans that
is nonzero. In other words, the coprimeness of and is equiva-lent to
In the case is unknown the certainly-equivalency-principle suggestsreplacing by its estimate to derive the adaptive control law. How-ever, for may be zero and (6.2.5) may not be solvable withand replaced by their estimates.
Let us estimate by the following algorithm called the weighted leastsquares (WLS) estimate, which is convergent for any feedback control
306 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If is known and if and are coprime, then for an arbitrarystable polynomial of degree there are unique polynomials
and both of order with such that
Then the feedback control generated by
leads the system (6.2.1) to
Then, by stability of (6.2.4) holds if assume
Considering coefficients of and as unknowns, and identifyingcoefficients of for both sides of (6.2.5), we derivea system of linear algebraic equations with matrix for unknowns:
Application to Systems and Control 307
where
Though converges a.s., its limit may not be the true If a boundedsequence can be found such that the modified estimate
and for some
is convergent and
then the control obtained from (6.2.6) with replaced by solves theadaptive stabilization problem, i.e., makes (6.2.4) to hold.
Therefore, the central issue in adaptive stabilization is to find a bound-ed sequence such that given by (6.2.12) is convergent and(6.2.13) is fulfilled. This gives rise to the following definition.
Definition. System (6.2.1) is called adaptively stabilizable by the useof parameter estimate if there is a bounded sequence such that(6.2.13) holds and given by (6.2.12) is convergent.
It can be shown that if system (6.2.1) is controllable, i.e., andare coprime, then it is adaptively stabilizable by the use of the WLS
estimate. It can also be shown that the system is adaptively stabilizableby use of if and only if where and F denote
the limits of and respectively, which are generated by (6.2.9)–(6.2.11).
We now use an SA algorithm to recursively produce such thatis convergent and the resulting estimate by (6.2.12) satisfies
(6.2.13).
is generated by (6.2.9)–(6.2.11), is defined by (6.2.11), and isrecursively defined by an SA given below.
Let us take a few real sequences defined as follows:
where
which can be written as
From algebraic geometry it is known that is afinite set.
However, is not directly observed; the real observation is
The root set of is denoted by where
where
As a matter of fact,
Let and be –dimensional, and let
308 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Application to Systems and Control 309
Let be l-dimensional with only one nonzero element equal to
either +1 or –1, Similarly, let be -dimensionalwith only nonzero elements, each of which equals either 1 or – 1,
The total number of such vectors is
Normalize these vectors and denote the resulting vectors byin the nondecreasing order of the number of nonzero elements in
Define and for Introduce
Define the recursive algorithm for as follows:
and is a fixed vector.The algorithm (6.2.23)–(6.2.27) is the RM algorithm with expanding
truncations, but it differs from the algorithm given by (2.1.1)–(2.1.3)as follows. The algorithm (2.1.1)–(2.1.3) is truncated at the upper sideonly, but the present algorithm is truncated not only at the upper sidebut also at the lower side: is allowed neither to diverge to infinity norto tend to zero; whenever it reaches the truncation bounds the estimate
is pulled back to and is enlarged to at the upper side,while at the lower side is pulled back to which will change to the
310 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
next whenever is satisfied. If for successiveresettings of we have to change to the next one, then we reduceto
Lemma 6.2.1 Assume the following conditions hold:
A6.2.2 System (6.2.1) is adaptively stabilizable by use of generatedby (6.2.9)–(6.2.11), i.e.,
If then after a finite number of steps the algorithm (6.2.23)–(6.2.27) becomes the RM algorithm
converges and
Proof. The basic steps of the proof are essentially the same as those forproving Theorem 2.2.1, but some modifications should be made becauseof truncations at the lower side.
Step 1. Let be a convergent subsequence of
For any define the RM algorithm
with or for some for some
We show that there are M > 0, T > 0 such thatwhen and
when if is large enough, where is givenby (1.3.2).
Let > 1 be a constant such that
It is clear that
A6.2.1 and
Application to Systems and Control 311
Since and are convergent, there is such that
Let By (6.2.29) and (6.2.30), we have
for if and for if whereLet (6.2.31) hold for or
It then follows that
where orThus, (6.2.31) has been inductively proved for
orStep 2. Let be a convergent subsequence. We show that there
are M > 0 and T > 0 such that
if is large enough.If defined by (6.2.25) is bounded, then (6.2.32) directly follows.Again take such that and setAssume Then there is a such that
By the result proved in Step 1, starting from the algo-rithm for cannot directly hit the sphere with radius without atruncation for So it may first hit somelower bound at time and switch to some
from which again by Step 1 cannot directly reach without atruncation. The only possibility is to be truncated again at a lowerbound. Therefore, (6.2.32) takes place.
312 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Step 3. Since and are convergent, by (6.2.32) it follows thatfrom any convergent subsequence there are constants and
such that
if is large enough.Consequently, there is such that
By (6.2.32) and the convergence of and it also follows that
Therefore,
Using (6.2.33) and (6.2.34) by the same argument as that given in Step 3of the proof for Theorem 2.2.1, we arrive at the following conclusion.If starting from the algorithm (6.2.24) is calculatedas an RM algorithm and is bounded, then forany with and cannot cross
infinitely often.Step 4. We now show that is bounded.If is unbounded, then as Therefore,
is unbounded and comes back to the fixed point infinitely manytimes.
Notice that is a finite set and
We see that there is an interval with and0 such that crosses infinitely often, and during each cross-ing the algorithm (6.2.24) behaves like an RM algorithm with staringpoint It is clear that is bounded because as
But by Step 3, this is impossible. Thus, we conclude thatis bounded, and after a finite number of steps (6.2.24) becomes
as
Application to Systems and Control 313
Step 5. We now show (6.2.28), i.e., after a finite number of steps thealgorithm (6.2.35) ceases to truncate at the lower side.
Since and by A6.2.2, it follows that thereis at least one nonzero coefficient in the polynomial for some
with Therefore, for some and a small
From (6.2.16) it is seen that for sufficiently small we have
This combining with convergence of and leads to
for sufficiently largeFrom (6.2.26) and (6.2.36) it follows that must be bounded, and
hence is bounded. This means that there is a such that
We now show that is bounded.Since for all sufficiently large it follows that
were unbounded, then by (6.2.37) the algorithm, starting fromwould infinitely many times enter the sphere with radius
where is small enough such that
Then would cross infinitely often an intervalSince is a finite set, we may assume It is clearthat during the crossing the algorithm behaves like an RM algorithm.By Step 4, this is impossible.
Therefore, there is a such that
Noticing (6.2.20), (6.2.34), and that serves as the Lyapunov func-tion for from Theorem 2.2.1 we conclude the remaining assertionsof the lemma.
where and are defined by 1)-3) described above.
Proof. The key step is to show that
Assume the converse:
Case i) The assumption implies that
and occurs infinitely many times. However,
this is impossible, since and The contradictionshows
Theorem 6.2.1 Assume conditions A6.2.1 and A6.2.2 hold. Then thereis such that and converges and
and use to produce the adaptive control as in 1), and go back to1) for
3) If and none of a)-c) of 2) is the case, then setand go back to 1) for and at the same time change to
i.e.,
Define
314 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Using we now define in (6.2.12) satisfying (6.2.13) and thussolving the adaptive stabilization problem.
Let1) If then set Using we produce
the adaptive control from (6.2.6) with and defined from(6.2.5) with replaced by and go back to 1) for
2) If then definea) for the case whereb) defined by (6.2.24) for the case wherebut
c) for the case, wherebut
and the algorithm defining will run over the following cases: 1) and2a)-2c). Since and are convergent, the inequality
for all sufficiently large Again, this means that (6.2.41) may take placeat most a finite number of times, and we conclude that
Thus, there is such that
If then from (6.2.43) it follows that
Since and for sufficiently largefrom (6.2.42) it follows that
for all sufficiently large Thus, (6.2.41) may take place at most a finitenumber of times. The contradiction shows that
we havethen as
Take a convergent subsequence of For notational simplicity de-note by itself its convergent subsequence. Thus
By Lemma 6.2.1,1) If then
Application to Systems and Control 315
Case ii) The assumption implies that there
is a sequence of integers such that andi.e., for all the following indicator equals one
2) If
implies
6.3. Application to Pole Assignment for Systemswith Unknown Coefficients
Consider the linear stochastic system
where is the -dimensional state, is the one-dimensional control,and is the -dimensional system noise.
The task of pole assignment is to define the feedback control
in order that the characteristic polynomial
of the closed-loop system coincides with a given polynomial
The pair is called similar to if there exists a nonsin-
gular matrix such that
where denotes the column of T.
Consequently, the truncation at the lower bound in (6.2.24) should bevery rare. The computation will be simplified if there is no lower boundtruncation.
316 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
for sufficiently large This means that the algorithm can be at 2b)only for finitely many times. By the same reason it cannot be at 2c)for infinitely many times. Therefore, the algorithm will stick on 1) if
and on 2a) if and in both cases there is asuch that and
The convergence of follows from the convergence of and
Remark 6.2.1 For the case the origin is not a stableequilibrium for the equation
So, is nonsingular if and only if is nonsingular.Assume that is controllable and is already in its con-
troller form (6.3.5). For notational simplicity, we will write ratherthan
where
which imply
Application to Systems and Control 317
Define
where are coefficients of
The pair is called the controller form associated to the pair
If is controllable, i.e., is of full rank,then is similar to its controller form. To see this, we note that(6.3.4) implies and from it followsthat
where is the system noise at time “1” for the system with feedbackgain applied.
Having observed we compute its characteristic polynomial detwhich is a noise-corrupted characteristic polynomial of
Let be the estimate for By observing det weactually learn the difference det which in a certain sensereflects how far det differs from the ideal polynomial
For any let
318 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
With feedback control the closed-loop system takes theform
Since is in controller form,
where are elements of the row vector F:
Therefore, if is known, then comparing (6.3.10) with (6.3.3) givesthe solution to the pole assignment problem, where
We now solve the pole assignment problem by learning for the casewhere is unknown.
Let us combine the vector equation (6.3.9) for initial values to formthe matrix equation
Let In learning control, can be observed at any fixed
For any the observation of is denoted by
Application to Systems and Control 319
be the row vector composed of coefficients of
By (6.3.10)
composed of coefficients of
and respectively.Take a sequence of positive real numbers
and
Calculate the estimate for by the following RM algorithm withexpanding truncations:
with fixed
Theorem 6.3.1 Assume that is controllable and is in the con-troller form. Further, assume the following conditions A6.3.1 and A6.3.2hold:
A6.3.1 The components ofof in (6.3.13) are mutually independent with
A6.3.2
where is the same as that in A6.3.1.Then there is with such that for each as
Similarly, define row vectors
for some
From here it is seen that is a sum of products ofelements from with +1 and –1 as
multiple for each product, where and denote elements of A andrespectively. It is important to note that each product in
includes at least one of as its factor. Thus, the product is ofthe form
From (6.3.21) by (6.3.18), (6.3.15), and (6.3.13) it follows that
Therefore, the conclusion of the theorem will follow from Theorem 2.2.1,if we can show that for any integer N
320 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
where is the desired feedback gain realizing the exact poleassignment.
Proof. Define
where and are given by (6.3.14) and (6.3.17), respectively.By (6.3.11) and (6.3.16) it follows that
Thus, (6.3.19) and (6.3.20) become
It is clear that the recursive algorithm for has the same structureas (2.1.1)–(2.1.3). For the present case, as function required inA2.2.2 we may take
Application to Systems and Control 321
where
By A6.3.1 we have
whereBy A6.3.2 and the convergence theorem for martingale difference se-
quences it follows that
for any integer which implies (6.3.24).
6.4. Application to Adaptive RegulationWe now apply the SA method to solve the adaptive regulation problem
for a nonlinear nonparametric system.Consider the following system
where is the system state, is the control, andis an unknown nonlinear function with being
the unknown equilibrium for the system (6.4.1).Assume the state is observed, but the observations are corrupted
by noise:
where is the observation noise, which may depend onThe purpose of adaptive regulation is to define adaptive control based
on measurements in order the system state to reach the desired value,which, without loss of generality, may be assumed to be equal to zero.
We need the following conditions.
A6.4.1 and
322 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
A6.4.2 The upper bound for is known, i.e., and isrobust stabilizing control in the sense that for any the state
tends to zero for the following system
A6.4.3 The system (6.4.1) is BIBS stable, i.e., for any bounded input,the system state is also bounded;
A6.4.4 is continuous for bounded i.e., for any
A6.4.5 The system (6.4.1) is strictly input passive, i.e., there are andsuch that for any input
A6.4.6 For any convergent subsequence
where is defined by (1.3.2).
It is worth noting that A6.4.6 becomes
if is independent ofThe adaptive control is given according to the following recursive
algorithm:
where b is specified in A6.4.2.
Theorem 6.4.1 Assume A6.4.1–A6.4.6. Then the system (6.4.1), (6.4.2),and (6.4.4) has the desired properties:
Application to Systems and Control 323
at sample paths where A6.4.6 holds.
Proof. Let be a convergent subsequence of such that
andWe have
for sufficiently large and small enough T, where is a constant to bespecified later on. The relationships (6.4.5) and (6.4.6) can be provedalong the lines of the proof for Theorem 2.2.1, but here is known to bebounded, and (6.4.5) and (6.4.6) can be proved more straightforwardly.We show this.
Since the system (6.4.1) is BIBS, from it follows that thereis such that
By A6.4.6 for large and small T > 0,
This implies that
Let be large enough such that
and let T be small enough such that
Then we have
and hence there is no truncation in (6.4.4) for i.e., (6.4.5) holdsfor Therefore,
indeed.By induction, the assertions (6.4.5) and (6.4.6) have been proved.We now show that for any convergent subsequence
there is a such that
from (6.4.4) it follows that (6.4.5) holds for Hence,
Thus, (6.4.5) and (6.4.6) hold for Assume they are true for allWe now show that they are true for
too.Since
for small enough T > 0.By A6.4.5, we have
Let us restrict in (6.4.8) to Then forsmall T and large from (6.4.6) and (6.4.8) it follows that
324 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
and (6.4.6) is true for
Since and it is seen that
Using a partial summation, by (6.4.9) we have
for all sufficiently large and small enough T > 0.Set
forThis implies that there exist a and a sufficiently large which
may depend on but is independent of such that
Application to Systems and Control 325
326 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Then (6.4.10) implies that
This proves (6.4.7).Define
From (6.4.7) it follows that
for convergent subsequenceUsing A6.4.6 and (6.4.11), by completely the same argument as that
used in the proof (Steps 3– 6) of Theorem 2.2.1, we conclude that
Finally, write (6.4.1) as
By A6.4.4 and the boundedness of we haveand by A6.4.2 we conclude
Remark 6.4.1 It is easy to see that A6.4.6 is also necessary if A6.4.1–A6.4.5 hold and and This is because for largethe observation noise can be expressed as
and hence
6.5. Notes and ReferencesFor system identification and adaptive control we refer to [10, 23, 54,
62, 75, 90]. The identification problem stated in Section 6.1 was solved in[72] by ODE method. In comparison with [72], conditions used here haveconsiderably been weakened, and the convergence is proved by the TSmethod rather than the ODE method. Section 6.1 is based on the jointwork by H. F. Chen, T. Duncan and B. Pasik-Duncan. The existenceand uniqueness of the solution to (6.1.50) can be found, e.g., in [23]. Forstochastic quadratic control refer to [2, 10, 12, 33].
Adaptive stabilization for stochastic systems is dealt with in [5, 55, 77].The convergence of WLS and adaptive stabilization using WLS are givenin [55]. The problem is solved by the SA method in [19]. This approachis presented in Section 6.2.
The pole assignment problem for stochastic system with unknowncoefficients is solved by SA with the help of learning in Section 6.3,which is based on [20]. For concept of linear control systems we refer to
Application to Systems and Control 327
which tends to zero as since and
Remark 6.4.2 In the formulation of Theorem 6.4.1 the condition A6.4.5can be replaced either by (6.4.7) or by (6.4.11), which are the conse-quences of A6.4.5. Further, the quadratic can be replaced by acontinuously differentiable function such thatand In this case, in (6.4.7) should becorrespondingly replaced by
Example 6.4.1 Let the nonlinear system be affine:
where the scalar nonlinear function is bounded from above and frombelow by positive constants:
Note that and hence(6.4.7) holds, if Assume is known:Then A6.4.2, A6.4.3, and A6.4.4 are satisfied. Therefore, if satisfiesA6.4.6, then given by (6.4.4) leads to and
In the area of system and control, the SA methods also are successfullyapplied in discrete event dynamic systems, especially, to the perturbationanalysis based parameter optimization.
328 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
[1, 46, 60]. The connection between the feedback gain and coefficients ofthe desired characteristic polynomial is called the Ackermann’s formula,which can be found in [46].
Application of SA to adaptive regulation is based on [26].For perturbation analysis of discrete event dynamic systems we refer
to [58]. The perturbation analysis based parameter optimization is dealtwith in [29, 86, 87].
Appendix A
In Appendix A we introduce the basic concept of probability theory. Results arepresented without proof. For details we refer to [31, 32, 70, 76, 84].
A.1. Probability SpaceThe basic space is denoted by The point is called elementary event or
sample. The point set in is denoted by A,Let be a family of sets in satisfying the following conditions:1.2.3.Then, is called the or The element A of is called the
measurable set, or random event, or event.As a consequence of Properties 2 and 3,
then the complement of A, also belongs toIfIf
if
A set function defined on is called -additive if for any
sequence of disjoint events By definition, one of the values or isnot allowed to be taken by
A nonnegative set function is called a measure.Define
The set functions and are called the upper, lower, and totalvariation of on respectively.
Jordan-Hahn Decomposition Theorem If is on then thereexists a set D such that, for any
and are measures andLet P be a set function defined on with the following properties.1.2.
329
then
330 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
3. if are disjoint. Then, P is called a
probability measure on The triple is called a probability space.PA is called the probability of random event A.It is assumed that any subset of a measurable set of probability zero is measurable
and its probability is zero. After such a completion of measurable sets the resultingprobability space is called completed.
If a relationship between random variables holds for any with possible exceptionof a set with probability zero, then we say this relationship holds a.s. (almost surely)or with probability one.
A.2. Random Variable and Distribution FunctionIn R, the real line, the smallest containing all intervals is called the Borel
and is denoted by The “smallest” means that if there is acontaining all intervals, then there must be in the sense that for any
The Borel can also be defined in Any set in or iscalled the Borel set.
Any interval can be endowed with a measure equal to its length. This measurecan be extended to each i.e., to each Borel set. Any subset of a set withmeasure zero is also assumed to be a measurable set with measure zero. After such acompletion, the measurable set is called Lebesgue measurable, and the measure theLebesgue measure. In what follows always means the completed Borel
A real function defined on is called measurable, if
If is a real measurable function defined on andthen is called a random variable. Therefore, if is a measurable function, then
is also a random variable ifLet be a random variable. The distribution function of is defined as
By a random vector we mean that each componentof is a random variable.
The distribution function of a random vector is defined as
If is differentiable, then its derivative is called the densityof The density of a random vector is defined by a similar way. The density ofl-dimensional normal distribution is defined by
A.3. ExpectationLet be a random variable and letDefine
APPENDIX A 331
whereis called the expectation of
For an arbitrary random variable define
The expectation of is defined as
if at least one of and is finite .If then is called integrable.The expectation of can be expressed by a Lebesgue-Stieltjes integral with respect
to its distribution function
In the density of l-dimensional random vector with normal distribution,
A.4. Convergence Theorems and InequalitiesLet be a sequence of random variables and be a random variable.If then we say that converges to and write
If for any then we say that converges to in
probability and writeIf the distribution functions of converge to at any where
is continuous, then we say weakly (or in distribution) converges to and write
If then we say converges to in the mean square sense and
write l.i.m.implies which in turn implies
Monotone Convergence Theorem If random variables nondecreasingly(nonincreasingly) converge to andthen
Dominated Convergence Theorem If and there exists an inte-
grable random variable such that then and
Fatou Lemma If for some random variable withthen
If is a measurable function, then
332 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Chebyshev Inequality
Lyapunov Inequality
Hölder Inequality
In the special case where the Hölder inequality is called the Schwarzinequality.
A.5. Conditional ExpectationLet be a probability space. is called a of if is a
and by which it is meant that any impliesRadon-Nikodym Theorem Let be a of For any random
variable with at least one of and being finite, there is an uniquemeasurable random variable denoted by such that for any
The random variable satisfying the above equality is called condi-tional expectation of given
Let be the smallest (see A.2) containing all setsis called the generated by
The conditional expectation of given is defined as
Let A be an event. Conditional probability of A given is defined by
Properties of the conditional expectation are listed below.1) for constants and2)3) if is and4) if5) ifConvergence theorems and inequalities stated in A.4 remain true with expectation
replaced by the conditional expectation For example, the conditionalHölder inequality
forFor a sequence of random variables and a the consistent
conditional distribution functions of given
Let and Then
APPENDIX A 333
can be defined such that i) they are for any and anyfixed ii) they are distribution functions for any fixed and iii) for anymeasurable function
A.6. IndependenceLet be a sequence of events.If for any set of indices
then is called mutually independent.Let be a sequence of If events are mutually
independent whenever then the family ofis called mutually independent.
Let be a sequence of random variables and let be the generatedby If is mutually independent, then the sequence of random variablesis called mutually independent.
Law of iterated logarithm Let be a sequence of independent and identicallydistributed (iid) random variables, Then
Proposition A.6.1 Let be a measurable function defined onIf the l-dimensional random vector is independent of the m-dimensional ran-
dom vector then
where
From this proposition it follows that
if is independent of
A. 7. ErgodicityLet be a sequence of random variables and let be the
distribution function ofIf for any integer then
is called stationary, or is a stationary process.
Proposition A.7.1 Let be stationary.
provided exists for all in the range of
334 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
If exists, then
where is a of and is called invariant
If then the stationary process is called ergodic. Thus, forstationary and ergodic process we have
If is a sequence of mutually independent and identically distributed (and hencestationary) random variables, then and the sequence is ergodic.
Appendix B
In Appendix B we present the detailed proof of convergence theorems for martin-gales and martingale difference sequences.
Let be a sequence of random variables, and let be a family of nonde-creasing i.e.,
If is for any then we write and call it as an adaptedprocess.
An adapted process with is called a martingale ifa supermartingale if and a submartingale if
An adapted process is called a martingale difference sequence (MDS) if
A sequence of mutually independent random vectors with is anobvious example of MDS.
An integer-valued measurable function is called a Markov time with respect toif
If, in addition, then is called a stopping time.
B.1. Convergence Theorems for MartingaleLemma B.1.1 Let be adapted, a Markov time, and B a Borel set. Letbe the first time at which the process hits the set B after time i.e.,
Then is a Markov time.Proof. The conclusion follows from the following expression:
335
336 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
For defining the number of up-crossing of an interval by a submartingalewe first define
The largest for which is called the number of up-crossing of the intervalby the process and is denoted by
By Lemma B.1.1So, is a Markov time.
Assume is a Markov time. Again, by Lemma B.1.1,and
Therefore, all are Markov times.Theorem B.1.1 (Doob) For submartingales the following inequalities
hold
where
Proof. Note that equals the number of up-crossing of the intervalby the submartingale or by
Since for
is a submartingale.Thus, without loss of generality, it suffices to prove that for a nonnegative sub-
martingale
Define
APPENDIX B 337
Define also Then for even crosses (0, b) from time toTherefore,
and
Further, the set is since is a Markov time,and
Taking expectation of both sides of (B-l-2) yields
where the last inequality holds because is a submartingale and hence theintegrand is nonnegative.
Thus (B.1.1) and hence the theorem is proved.Theorem B.1.2 (Doob) Let be a submartingale with
a.s.Then there is a random variable with such that
Proof. Set
Assume the converse:Then
where and run over all rational numbers.
338 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By the converse assumption there exist rational numbers such that
Let be the number of up-crossing of the interval byBy Theorem B.1.1
By the monotone convergence theorem from (B-1-4) it follows that
However, (B.1.3) implies which contradicts (B.1.5). Hence,
and
where is invoked. Hence,Corollary B.1.1 If is a nonnegative supermartingale or nonpositive sub-
martingale, then
Because for nonpositive submartingales the corollary follows from the the-orem; while for a nonnegative supermartingale is a nonpositivesubmartingale.
Corollary B.1.2 If is a martingale with thenand
This is because for a martingale andand hence
or converges to a limit which is finite a.s.By Fatou lemma it follows that
APPENDIX B 339
B.2. Convergence Theorems for MDS ILet be an adapted process, and let G be a Borel set in
Then the first exit time from G defined by
is a Markov time. This is because
Lemma B.2.1. Let be a martingale (supermartingale, submartingale)and a Markov time. Then the process stopped at is again a martingale(supermartingale, submartingale), where
Proof. Note that
isIf is a martingale, then
This shows that is a martingale. For supermartingales and submartin-gales the proof is similar.
Theorem B.2.1. Let be a one-dimensional MDS. Then as
converges on
Proof. Since is the first exit time
is a Markov time and by Lemma B.2.1 is a martingale, where M is apositive constant.
Noticing that and that
is we find
340 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
By Corollary B.1.2 converges as It is clear that onTherefore, as pathwisely converges on Since M is
arbitrary, converges on which equals A.
Theorem B.2.2. Let be an MDS and If
then converges on If then
converges onProof. It suffices to prove the first assertion, because the second one is reduced to
the first one if is replaced byDefine
By Lemma B.2.1 is a martingale. It is clear that
Consequently,
By Theorem B.1.2 converges asSince on as converges on and
consequently on which equals
B.3. Borel-Cantelli-Lévy LemmaTheorem B.3.1. (Borel-Cantelli-Lévy Lemma) Let be a sequence of
events, Then if and only if or equivalently,
Proof. Define
Clearly, is a martingale and is an MDS.Since by Theorem B.2.2, converges on
APPENDIX B 341
If then from (B.3.2) it follows that which implies that
converges. Then, this combining with by (B.3.2) yields
Conversely, if then from (B.3.2) it follows that
Noticing that is contained in the set where converges by
Theorem B.2.2, from the convergence of by (B.3.2) it follows that
If are mutually independent and then
Proof. Denote by the generated by
If then
and hence which, by (B.3.1), implies (B.3.3).
When are mutually independent, then
B.4. Convergence Criteria for AdaptedSequences
Let be an adapted process.Theorem B.4.1 Let be a sequence of positive numbers. Then
where
Consequently, implies and
follows from (B.3.1).
Theorem B.3.2 (Borel-Cantelli Lemma) Let be a sequence of events. If
then the probability that occur infinitely often is zero, i.e.,
342 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
Proof. Set
By Theorem B.3.1
or
This means that A is the set where events may occur only finitely many times.
Therefore, on A the series converges if and only if converges.
Theorem B.4.2 (Three Series Criterion) Denote by S the where thefollowing three series converge:
and
where c is a positive constant.
Then converges on S as
Proof. Taking in (B.4.1), we have and
by Theorem B.4.1.Define
Since converges on S, from (B.4.2) it follows that
Noticing that is an MDS and
we see
By Theorem B.2.1 converges on S, or
APPENDIX B 343
Then from (B.4.3) it follows that
or converges ).
B.5. Convergence Theorems for MDS IILet be an MDS.
Theorem B.5.1 (Y. S. Chow) converges on
Proof. By Theorem B.4.2 it suffices to prove where S is defined in Theo-rem B.4.2 with replaced by considered in the present theorem.
We now verify that three series defined in Theorem B.4.2 are convergent on A ifis replaced byFor convergence of the first series it suffices to note
For convergence of the second series, taking into account we find
Finally, for convergence of the last series it suffices to note
and
by the conditional Schwarz inequality.Theorem B.5.2. The conclusion of Theorem B.5.1 is valid also forProof. Define
Then we have
344 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
on A where A is still defined by (B-5-1) but with
Applying Theorem B.5.1 with to the MDS leads to that
converges on A, i.e.,
This is equivalent to
B.6. Weighted Sum of MDSTheorem B.6.1 Let be an l-dimensional MDS and let be a
matrix adapted process. If
for some then as
where
Proof. Without loss of generality, assume
Notice that convergence of implies convergence of since for
sufficiently largeConsequently, from (B.5.2) it follows that
APPENDIX B 345
We have the following estimate:
By Theorems B.5.1 and B.5.2 it follows that
where
Notice that is nondecreasing as If is bounded, then the conclusionof the theorem follows from (B.6.1). If then by the Kronecker lemma
(see Section 3.4) the conclusion of the theorem also follows from (B.6.1).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
B. D. O. Anderson and T. B. Moore, Optimal Control: Linear Quadratic Methods,Prentice-Hall, N. J., 1990.
K. J. Åström, Introduction to Stochastic Control, Academic Press, New York,1970.
M. Benaim, A dynamical systems approach to stochastic approximation, SIAMJ. Control & Optimization, 34:437–472, 1996.
A. Benveniste, M. Metivier and P. Priouret, Adaptive Algorithms and StochasticApproximation, Springer-Verlag, New York, 1990.
B. Bercu, Weighted estimation and tracking for ARMAX models, SIAM J. Con-trol & Optimization, 33:89–106, 1995.
P. Billingsley, Convergence of Probability Measures, Wiley, New York, 1968.
J. R. Blum, Multidimensional stochastic approximation, Ann. Math. Statist.,9:737–744, 1954.
V. S. Borkar, Asynchronous stochastic approximations, SIAM J. Control andOptimization, 36:840–851, 1998.
O. Brandière and M. Duflo, Les algorithmes stochastiques contournents-ils lespièges? Ann. Inst. Henri Poincaré, 32:395–427, 1996.
P. E. Caines, Linear Stochastic Systems, Wiley, New York, 1988.
H. F. Chen, Recursive algorithms for adaptive beam-formers, Kexue Tongbao(Science Bulletin), 26:490–493, 1981.
H. F. Chen, Recursive Estimation and Control for Stochastic SyNew York, 1985.
stems, Wiley,
H. F. Chen, Asymptotic efficient stochastic approximation, Stochastics andStochastics Reports, 45:1–16, 1993.
347
348 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
H. F. Chen, Stochastic approximation and its new applications, Proceedingsof 1994 Hong Kong International Workshop on New Directions of Control andManufacturing, 1994, 2–12.
H. F. Chen, Convergence rate of stochastic approximation algorithms in thedegenerate case, SIAM J. Control & Optimization, 36:100–114, 1998.
H. F. Chen, Stochastic approximation with non-additive measurement noise, J.of Applied Probability, 35:407–417, 1998.
H. F. Chen, Convergence of SA algorithms in multi-root or multi-extreme cases,Stochastics and Stochastics Reports, 64: 255–266, 1998.
H. F. Chen, Stochastic approximation with state-dependent noise, Science inChina (Series E), 43:531–541, 2000.
H. F. Chen and X. R. Cao, Controllability is not necassry for adaptive poleplacement control, IEEE Trans. Autom. Control, AC-42:1222–1229, 1997.
H. F. Chen and X. R. Cao, Pole assignment for stochastic systems with unknowncoefficients, Science in China (Series E), 43:313–323, 2000.
H. F. Chen, T. Duncan, and B. Pasik-Duncan, A Kiefer-Wolfowitz algorithmwith randomized differences, IEEE Trans. Autom. Control, AC-44:442–453, 1999.
H. F. Chen and H. T. Fang, Nonconvex stochastic optimization for model reduc-tion, Global Optimization, 2002.
H. F. Chen and L. Guo, Identification and Stochastic Adaptive Control,Birkhäuser, Boston, 1991.
H. F. Chen, L. Guo, and A. J. Gao, Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds, Stochastic Processesand Their Applications, 27:217–231, 1988.
H. F. Chen and R. Uosaki, Convergence analysis of dynamic stochastic approx-imation, Systems and Control Letters, 35:309–315, 1998.
H. F. Chen and Q. Wang, Adaptive regulator for discrete-time nonlinear non-parametric systems, IEEE Trans. Autom. Control, AC-46: , 2001.
H. F. Chen and Y. M. Zhu, Stochastic approximation procedures with randomlyvarying truncations, Scientia Sinica (Series A), 29:914–926, 1986.
H. F. Chen and Y. M. Zhu, Stochastic Approximation (in Chinese), ShanghaiScientific and Technological Publishers, Shanghai, 1996.
E. K. P. Chong and P. J. Ramadge, Optimization of queues using an infinitesi-mal perturbation analysis-based stochastic algorithm with general update times,SIAM J. Control & Optimization, 31:698–732, 1993.
Y. S. Chow, Local convergence of martingales and the law of large numbers,Ann. Math. Statst. 36:552–558, 1965.
REFERENCES 349
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
Y. S. Chow and H. Teicher, Probablility Theory: Independence, Interchangeabil�
ity, Martingales, Springer Verlag, New York, 1978.
K. L. Chung, A Course in Probability Theory, (second edition), Academic Press,
New York, 1974.
M. H. A. Davis, Linear Estimation and Stochastic Control, Chapman and Hall,
New York, 1977.
K. Deimling, Nonlinear Functional Analysis, Springer, Berlin, 1985.
B. Delyon and A. Juditsky, Stochastic optimization with averaging of trajecto�
ries, Stochastics and Stochastics Reports, 39:107–118, 1992.
E. F. Deprettere (eds.), SVD and Signal Processing, Elsevier, Horth�Holland,
1988.
N. Dunford and J. T. Schwartz, Linear Operators, Part 1: General Theory, Wiley�
Interscience, New York, 1966.
V. Dupač, A dynamic stochastic methods, Ann. Math. Statist. 36:1695–1702.
V. Dupač, Stochastic approximation in the presense of trend, Czeshoslovak Math.
J., 16:454–461, 1966.
A. Dvoretzky, On stochastic approximation, Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, pp. 39–55, 1956.
S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Conver�
gence, Wiley, New York, 1986.
E. Eweda, Convergence of the sign algorithm for adaptive filtering with corre�
lated data, IEEE Trans. Information Theory, IT�37:1450�1457, 1991.
V. Fabian, On asymptotic normality in stochastic approximation, Ann. of Math.
Statis., 39: 1327–1332, 1968.
V. Fabian, On asymptotically efficient recursive estimation, Ann. Statist., 6:
854–856, 1978.
V. Fabian, Simulated annealing simulated, Computers Math. Applic., 33:81–94,
1997.
F. W. Fairman, Linear Control Theory, The State Space Approach, Wiley, Chich�
ester, 1998.
H. T. Fang and H. F. Chen, Sharp convergence rates of stochastic approximation
for degenerate roots, Science in China (Series E), 41:383–392, 1998.
H. T. Fang and H. F. Chen, Stability and instability of limit points of stochastic
approximation algorithms, IEEE Trans. Autom. Control, AC�45:413–420, 2000.
H. T. Fang and H. F. Chen, An a.s. convergent algorithm for global optimization
with noise corrupted observations, J. Optimization and Its Applications, 104:343–
376, 2000.
350
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
H. T. Fang and H. F. Chen, Asymptotic behavior of asynchronous stochasticapproximation, Science in China (Series F), 44:249–258, 2001.
B. A. Francis, A Course in Control Theory, Lecture Notes in Control andInformation Sciences, Vol. 18, 1987.
S. B. Gelfand and S. K. Mitter, Recursive stochastic algorithms for global opti-mization in SIAM J. Control & Optimization, 29:999–1018, 1991.
E. G. Gladyshev, On stochastic approximation (in Russian), Theory Probab.Appl, 10:275–278, 1965.
G. C. Goodwin and K. S. Sin, Adaptive Filtering, Prediction and Control,Prentice-Hall, N.J., 1984.
L. Guo, Self-convergence of weighted least squares with applications to stochasticadaptive control, IEEE Trans. Autom. Control, AC-41:79–89, 1996.
P. Hall and C. C. Heyde, Martingale Limit Theory and Its Applications, Aca-demic Press, New York, 1980.
S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, 1990.
Y. C. Ho and X. R. Cao, Perturbation Analysis of Discrete Event DynamicalSystems, Kluwer, Boston, 1991.
A. Juditsky, A Stochastic estimation algorithm with observation averaging, IEEETrans. Autom. Control, 38:794–798, 1993.
T. Kailath, Linear Systems, Prentice-Hall, N. J., 1980.
J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regressionfunction, Ann. Math. Statist., 23:462–466, 1952.
P. V. Kokotovic (Ed.), Foundations of Adaptive Control, Springer, Berlin, 1991.
J. Koronaski, Random-seeking methods for the stochastic unconstrained opti-mization, Int. J. Control, 21:517–527, 1975.
H. J. Kushner, Approximation and Weak Convergence Methods for Random Pro-cesses with Applications to Stochastic Systems Theory, MIT Press, Cambridge,MA, 1984.
H. J. Kushner and D. S. Clark, Stochastic Approximation for Constrained andUnconstained Systems, Springer-Verlag, New York, 1978.
H. J. Kushner and J. Yang, Stochastic approximation with averaging of theiterates: Optimal asymptotic rates of convergence for general processes, SIAM J.Control & Optimization, 31:1045–1062, 1993.
H. J. Kushner and J. Yang, Stochastic approximation with averaging and feed-back: Rapidly convergent “on line” algorithms, IEEE Trans. Autom. Control,AC-40:24–34, 1995.
STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
REFERENCES 351
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
H. J. Kushner and G. Yin, Stochastic Approximation Algorithms and Applica-tions, Springer-Verlag, New York, 1997.
J. P. LaSaller and Lefchetz, Stability by Lyapunov’s Direct Methods with Ap-plications, Academic Press, New York, 1961.
R. Liptser and A. N. Shiryaev, Statistics of Random Processes, Springer-Verlag,New York, 1977.
R. Liu, Blind signal processing: An introduction, Proceedings 1996 Intl. Symp.Circuits and Systems, Vol. 2, 81–83, 1996.
L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Autom. Con-trol, AC-22:551-575, 1977.
L. Ljung, On positive real transfer functions and the convergence of some recur-sive schemes, IEEE Trans. Autom. Control, AC-22:539–551, 1977.
L. Ljung, G. Pflug, and H. Walk, Stochastic Approximation and Optimizationof Random Systems, Birkhäuser, Basel, 1992.
L. Ljung and T. Söderström, Theory and Practice of Recursive Identification,MIT Press, Cambridge, MA, 1983.
M. Loéve, Probability Theory, Springer, New York, 1977–1978.
R. Lozano and X. H. Zhao, Adaptive pole placement without excitation probingsignals, IEEE Trans. Autom. Control, AC-39:47–58, 1994.
M. B. Nevelson and R. Z. Khasminskii, Stochastic Approximation and Recur-sive Estimation, Amer. Math. Soc., Providence, RI, 1976, Translation of Math.Monographs, Vol. 47.
E. Oja, Subspace Methods of Pattern Recognition, 1st ed., Letchworth, ResearchStudies Press Ltd., Hertfordshire, 1983.
B. T. Polyak, New stochastic approximation type procedures, (in Russian) Au-tom. i Telemekh., 7:98–107, 1990.
B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation byaveraging, SIAM J. Control & Optimization, 30:838–855, 1992.
H. Robbins and S. Monro, A stochastic approximation method, Ann. Math.Statist., 22:400–407, 1951.
D. Ruppert, Stochastic approximation, In B. K. Ghosh and P. K. Sen, Editors,Handbook in Sequential Analysis, 503–529, Marcel Dekker, New York, 1991.
A. N. Shiryaev, Probability, Springer, New York, 1984.
J. C. Spall, Multivariate stochastic approximation using a simultaneous pertur-bation gradient approximation, IEEE Trans. Autom. Control, AC-37:331–341,1992.
352 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
Q. Y. Tang and H. F. Chen, Convergence of perterbation analysis based optimiza-tion algorithm with fixed-number of customers period, Discrete Event DynamicSystems, 4:359–373, 1994.
Q. Y. Tang, H. F. Chen, and Z. J. Han, Convergence rates of perturbation-analysis-Robbins-Monro-Single-run algorithms, IEEE Trans. Autom. Control,AC-42:1442–1447, 1997.
J. N. Tsitsiklis, Asynchronous stochastic approximation and Q-learning MachineLearning, 16:185–202, 1994.
N. J. Tsitsiklis, D. P. Bertsekas, and M. Athans, Distributed asynchronous de-terministic and stochastic gradient optimization algorithms, IEEE Trans. Autom.Control, 31:803–812, 1986.
Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems, AcademicPress, New York, 1971.
K. Uosaki, Some generalizations of dynamic stochastic approximation processes,Ann. Statist., 2:1042–1048, 1974.
J. Venter, An extension of the Robbins-Monro procedure, Ann. Math. Stat.,38:181–190, 1967.
G. J. Wang and H. F. Chen, Behavior of stochastic approximation algorithmin root set of regression function, Systems Science and Mathematical Sciences,12:92–96, 1999.
I. J. Wang, E. K. P. Chong and S. R. Kulkarni, Equivalent necessary and suffi-cient conditions on noise sequences for stochastic approximation algorithms, Adv.Appl. Probab., 28:784–801, 1996.
C. Z. Wei, Multivariate adaptive stochastic approximation, Ann. Stat., 15:1115–1130, 1987.
G. Xu, L. Tong, and T. Kailath, A least squares approach to blind identification,IEEE Trans. Signal Processing, SP-43:2982–2993, 1995.
S. Yakowitz, A globally convergent stochastic approximation, SIAM J. Control& Optimization, 31:30–40, 1993.
G. Yin, On extensions of Polyak’s averaging approach to stochastic approxima-tion, Stochastics and Stochastics Reports, 36:245–264, 1991.
G. Yin and Y. M. Zhu, On w.p.l. convergence of a parallel stochastic approxi-mation algorithm, Probability in the Eng. and Infor. Sciences, 3:55–75, 1989.
[100] R. Zeilinski, Global stochastic approximation: A review of results and someopen problems. In F. Archetti and M. Cugiani (eds.), Numerical Techniques forStochastic Systems, 379–386, Northholland Publ. Co., 1980.
[101] J. H. Zhang and H. F. Chen, Convergence of algorithms used for principalcomponent analysis, Science in China (Series E), 40:597–604, 1997.
REFERENCES 353
[102] K. Zhou, J. C. Doyle, and K. Glover, Robust Optimal Control, Prentice-Hall,New Jersey, 1996.
Index
50, 55, 247329
329329
Ackermann’s formula, 328adapted process, 335adapted sequence, 341adaptive control, 290, 303, 327adaptive filter, 288adaptive filtering, 265, 273adaptive regulation, 321adaptive stabilization, 305, 307, 314, 327adaptive stochastic approximation, 132,
149adaptively stabilizable, 310admissible controls, 302algebraic Riccati equation, 131ARMA process, 39Arzelá-Ascoli theorem, 11, 24asymptotic behavior, 194asymptotic efficiency, 95, 130, 132, 149asymptotic normality, 95, 113, 119, 127,
149, 210asymptotic properties, 95, 166asymptotically efficient, 135asynchronous stochastic approximation,
219, 278, 288averaging technique, 132, 149
balanced realization, 210, 214balanced truncation, 214, 215blind channel identification, 219, 220, 223blind identification, 220Borel 330Borel set, 330Borel-Cantelli Lemma, 341Borel-Cantelli-Lévy Lemma, 340
certainly-equivalency-principle, 304, 306
Chebyshev inequality, 332closure, 38conditional distribution function, 332conditional expectation, 332conditional probability, 332conditional Schwarz inequality, 343constant interpolating function, 13constrained optimization problem, 268controllable, 307, 317, 319controller form, 317–319convergence, 28, 36, 41, 153, 223, 331, 341convergence analysis, 6, 28, 95, 154convergence rate, 95, 96, 101–103, 105,
149convergence theorem for martingale dif-
ference sequences, 97, 128, 160,170, 185, 196, 231, 249, 321,339, 343
convergence theorem for nonnegative su-permartingales, 7–9
convergence theorems for martingale, 335convergent subsequence, 17, 18, 30, 36,
84, 86, 89, 178, 187, 237, 241,244, 271, 275, 280, 282, 283,285, 287, 288, 297, 312, 315,322, 323
coprimeness, 306covariance matrix, 130, 132crossing, 18, 34, 188, 236, 312
degenerate case, 103, 149density, 330distribution function, 330dominant stability, 59, 62dominated convergence theorem, 331dynamic stochastic approximation, 82, 93
equi-continuous, 15ergodic, 265, 268, 270, 273, 274, 334ergodicity, 333
355
356 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS
event, 329expectation, 330
Fatou lemma, 331first exit time, 9, 339
general convergence theorems, 28global minimum, 177global minimizer, 174, 177, 180global optimization, 172–174, 218global optimization algorithm, 180, 194global optimizer, 152globally Lipschitz continuous, 292Gronwall inequality, 298
Hölder Inequality, 332Hankel matrix, 222Hankel norm approximation, 210, 214,
215Hessian, 8, 195
identification, 290integrable, 331interpolating function, 11invariant 334
Jordan-Hahn decomposition, 55, 56, 295,329
Kiefer-Wolfowitz (KW) algorithm, 151–153, 166, 173, 218
Kronecker lemma, 67, 144, 148, 345Kronecker product, 248KW algorithm with expanding trunca-
tions, 152, 154, 173–175
Law of iterated logarithm, 333Lebesgue measurable, 330Lebesgue measure, 330Lebesgue-Stieltjes integral, 331linear interpolating function, 12Lipschitz continuous, 23Lipschitz-continuity, 160local search, 172, 173locally bounded, 17, 29, 96, 103, 133locally Lipschitz continuous, 50, 155, 163,
177, 280Lyapunov equation, 105Lyapunov function, 6, 8, 10, 11, 17, 111,
226, 268, 313Lyapunov inequality, 144, 332Lyapunov theorem, 98
MA process, 171Markov time, 6, 335, 336, 339martingale, 335, 339, 340martingale convergence theorem, 6, 180,
297
martingale difference sequence, 6, 16, 42,97, 128, 134, 159, 164, 168,179, 185, 195–197, 231, 250,257, 294, 335
maxinizer, 151measurable, 17, 29, 96, 103, 133measurable function, 330measurable set, 329measure, 329minimizer, 151mixing condition, 291model reduction, 210monotone convergence theorem, 331multi-extreme, 163, 164multi-root, 46, 57mutually independent, 333, 341
necessity of noise condition, 45non-additive noise, 49nondegenerate case, 96, 149nonnegative adapted sequence, 7nonnegative supermartingale, 6, 7, 338nonpositive submartingale, 338normal distribution, 113, 114, 330nowhere dense, 29, 35, 37, 41, 177, 181,
182, 280, 291
observation, 5, 17, 132, 321observation noise, 5, 103, 133, 175, 195,
321ODE method, 2, 10, 24, 327one-sided randomized difference, 172optimal control, 303optimization, 151optimization algorithm, 212ordinary differential equation (ODE), 10
pattern classification, 219perturbation analysis, 328pole assignment, 316, 318, 327principal component analysis, 238, 288probabilistic method, 4probability measure, 330probability of random event, 330probability space, 329, 330Prohorov’s theorem, 22, 24
Radon-Nikodym Theorem, 332random noise, 10, 21random search, 172random variable, 330randomized difference, 152–154recursive blind identification, 246relatively compact, 22RM algorithm with expanding trunca-
tions, 28, 155, 309, 319
INDEX 357
Robbins-Monro (RM) algorithm, 1, 5, 8,11, 12, 17, 20, 45, 110, 310, 313
robustness, 67, 93
SA algorithm, 67SA algorithm with expanding trunca-
tions, 25, 40, 95, 290SA with randomly varying truncations, 93Schwarz inequality, 142, 332sign algorithms, 273, 288signal processing, 219, 265signed measure, 56, 295Skorohod representation, 23Skorohod topology, 21, 24slowly decreasing step sizes, 132spheres with expanding radiuses, 36stability, 131stable, 96, 97, 102, 131, 133state-dependent, 42, 164state-dependent noise, 29, 57state-independent condition, 41, 42stationary, 265, 268, 270, 273, 274, 333step size, 5, 6, 17, 102, 132, 174stochastic approximation (SA), 1, 223,
226, 246stochastic approximation algorithm, 5,
307, 308stochastic approximation method, 321
stochastic differential equation, 126stochastic optimization, 211stopping time, 335strictly input passive, 322structural error, 10, 157structural inaccuracy, 21submartingale, 335–337, 339subspace, 41, 226supermartingale, 335, 339surjection, 63system identification, 327
three series criterion, 342time-varying, 44trajectory-subsequence (TS) method, 2,
16, 21truncated RM algorithm, 16, 17TS method, 28, 327
uniformly bounded, 15uniformly locally bounded, 41up-crossing, 336, 338
weak convergence method, 21, 24weighted least squares, 306weighted sum of MDS, 344Wiener process, 126
Nonconvex Optimization and Its Applications
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
H. Tuy: Convex Analysis and Global Optimization. 1998 ISBN 0�7923�4818�4
D. Cieslik: Steiner Minimal Trees. 1998 ISBN 0�7923�4983�0
N.Z. Shor: Nondifferentiable Optimization and Polynomial Problems. 1998
ISBN 0�7923�4997�0
R. Reemtsen and J.�J. Rückmann (eds.): Semi�Infinite Programming. 1998
ISBN 0�7923�5054�5
B. Ricceri and S. Simons (eds.): Minimax Theory and Applications. 1998
ISBN 0�7923�5064�2
J.�P. Crouzeix, J.�E. Martinez�Legaz and M. Volle (eds.): Generalized Convexitiy,
Generalized Monotonicity: Recent Results. 1998 ISBN 0�7923�5088�X
J. Outrata, M. Kočvara and J. Zowe: Nonsmooth Approach to Optimization Problems
with Equilibrium Constraints. 1998 ISBN 0�7923�5170�3
D. Motreanu and P.D. Panagiotopoulos: Minimax Theorems and Qualitative Proper�
ties of the Solutions of Hemivariational Inequalities. 1999 ISBN 0�7923�5456�7
J.F. Bard: Practical Bilevel Optimization. Algorithms and Applications. 1999
ISBN 0�7923�5458�3
H.D. Sherali and W.P. Adams: A Reformulation�Linearization Technique for Solving
Discrete and Continuous Nonconvex Problems. 1999 ISBN 0�7923�5487�7
F. Forgó, J. Szép and F. Szidarovszky: Introduction to the Theory of Games. Concepts,
Methods, Applications. 1999 ISBN 0�7923�5775�2
C.A. Floudas and P.M. Pardalos (eds.): Handbook of Test Problems in Local and
Global Optimization. 1999 ISBN 0�7923�5801�5
T. Stoilov and K. Stoilova: Noniterative Coordination in Multilevel Systems. 1999
ISBN 0�7923�5879�1
J. Haslinger, M. Miettinen and P.D. Panagiotopoulos: Finite Element Method for
Hemivariational Inequalities. Theory, Methods and Applications. 1999
ISBN 0�7923�5951�8
V. Korotkich: A Mathematical Structure of Emergent Computation. 1999
ISBN 0�7923�6010�9
C.A. Floudas: Deterministic Global Optimization: Theory, Methods and Applications.
2000 ISBN 0�7923�6014�1
F. Giannessi (ed.): Vector Variational Inequalities and Vector Equilibria. Mathemat�
ical Theories. 1999 ISBN 0�7923�6026�5
D. Y. Gao: Duality Principles in Nonconvex Systems. Theory, Methods and Applica�
tions. 2000 ISBN 0�7923�6145�3
C.A. Floudas and P.M. Pardalos (eds.): Optimization in Computational Chemistry
and Molecular Biology. Local and Global Approaches. 2000 ISBN 0�7923�6155�5
G. Isac: Topological Methods in Complementarity Theory. 2000 ISBN 0�7923�6274�8
P.M. Pardalos (ed.): Approximation and Complexity in Numerical Optimization: Con�
crete and Discrete Problems. 2000 ISBN 0�7923�6275�6
V. Demyanov and A. Rubinov (eds.): Quasidifferentiability and Related Topics. 2000
ISBN 0�7923�6284�5
Nonconvex Optimization and Its Applications
44.
45.
46.47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.62.
63.
A. Rubinov: Abstract Convexity and Global Optimization. 2000ISBN 0-7923-6323-X
R.G. Strongin and Y.D. Sergeyev: Global Optimization with Non-Convex Constraints.2000 ISBN 0-7923-6490-2X.-S. Zhang: Neural Networks in Optimization. 2000 ISBN 0-7923-6515-1H. Jongen, P. Jonker and F. Twilt: Nonlinear Optimization in Finite Dimen-sions. Morse Theory, Chebyshev Approximation, Transversability, Flows, ParametricAspects. 2000 ISBN 0-7923-6561-5R. Horst, P.M. Pardalos and N.V. Thoai: Introduction to Global Optimization. 2ndEdition. 2000 ISBN 0-7923-6574-7S.P. Uryasev (ed.): Probabilistic Constrained Optimization. Methodology andApplications. 2000 ISBN 0-7923-6644-1D.Y. Gao, R.W. Ogden and G.E. Stavroulakis (eds.): Nonsmooth/Nonconvex Mech-anics. Modeling, Analysis and Numerical Methods. 2001 ISBN 0-7923-6786-3A. Atkinson, B. Bogacka and A. Zhigljavsky (eds.): Optimum Design 2000. 2001
ISBN 0-7923-6798-7M. do Rosário Grossinho and S.A. Tersian: An Introduction to Minimax Theoremsand Their Applications to Differential Equations. 2001 ISBN 0-7923-6832-0A. Migdalas, P.M. Pardalos and P. Värbrand (eds.): From Local to Global Optimiza-tion. 2001 ISBN 0-7923-6883-5N. Hadjisavvas and P.M. Pardalos (eds.): Advances in Convex Analysis and GlobalOptimization. Honoring the Memory of C. Caratheodory (1873-1950). 2001
ISBN 0-7923-6942-4R.P. Gilbert, P.D. Panagiotopoulos† and P.M. Pardalos (eds.): From Convexity toNonconvexity. 2001 ISBN 0-7923-7144-5D.-Z. Du, P.M. Pardalos and W. Wu: Mathematical Theory of Optimization. 2001
ISBN 1-4020-0015-4M.A. Goberna and M.A. López (eds.): Semi-Infinite Programming. Recent Advances.2001 ISBN 1-4020-0032-4F. Giannessi, A. Maugeri and P.M. Pardalos (eds.): Equilibrium Problems: NonsmoothOptimization and Variational Inequality Models. 2001 ISBN 1-4020-0161-4G. Dzemyda, V. Šaltenis and A. Žilinskas (eds.): Stochastic and Global Optimization.2002 ISBN 1-4020-0484-2D. Klatte and B. Kummer: Nonsmooth Equations in Optimization. Regularity, Cal-culus, Methods and Applications. 2002 ISBN 1-4020-0550-4S. Dempe: Foundations of Bilevel Programming. 2002 ISBN 1-4020-0631-4P.M. Pardalos and H.E. Romeijn (eds.): Handbook of Global Optimization, Volume2. 2002 ISBN 1-4020-0632-2G. Isac, V.A. Bulavsky and V.V. Kalashnikov: Complementarity, Equilibrium, Effi-ciency and Economics. 2002 ISBN 1-4020-0688-8
KLUWER ACADEMIC PUBLISHERS – DORDRECHT / BOSTON / LONDON