On the almost sure rate of convergence of linear stochastic approximation algorithms

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 2, FEBRUARY 2004 401

On the Almost Sure Rate of Convergence of LinearStochastic Approximation Algorithms

Vladislav B. Tadic, Member, IEEE

Abstract—The almost sure rate of convergence of linear stochastic ap-proximation algorithms is analyzed in this correspondence. As the mainresult, it is demonstrated that their almost sure rate of convergence is equiv-alent to the almost sure rate of convergence of the averages of their inputdata sequences. As opposed tomost of the existing results on the rate of con-vergence of stochastic approximation which cover only algorithms with thenoise decomposable as the sum of a martingale difference, vanishing andtelescoping sequence, the main results of this correspondence hold underassumptions not requiring the input data sequences to admit any particulardecomposition. Although no decomposition of the input data sequences isrequired, the results on the almost sure rate of convergence of linear sto-chastic approximation algorithms obtained in this correspondence are astight as the rate of convergence in the law of iterated logarithm. Moreover,the main result of this correspondence yields the law of iterated logarithmfor linear stochastic approximation if the law of iterated logarithm holdsfor the input data sequences. The obtained general results are illustratedwith two (nontrivial) examples where the input data sequences are stronglymixing strictly stationary random processes or functions of a uniformly er-godic Markov chain. These results are also applied to the analysis of leastmean square (LMS) algorithms.

Index Terms—Almost sure rate of convergence, law of iterated logarithm,least mean square (LMS) algorithms, linear stochastic approximation al-gorithms, strongly mixing stationary random processes, uniformly ergodicMarkov chains.

I. INTRODUCTION

Most of the existing results on the rate of convergence of stochasticapproximation algorithms could be considered as the (functional) cen-tral limit theorem and the (functional) law of iterated logarithm forthese algorithms (for more details see [1], [13], [14], [16], [19], [20]and references cited therein). Although these results (also known asweak and strong diffusion approximations) provide deep insight intothe asymptotic behavior of stochastic approximation algorithms andthe tightest possible rates of their convergence, most of them are basedon the results of martingale limit theory (i.e., on the (functional) centrallimit theorem and the (functional) law of iterated logarithm for martin-gales; for details on martingale limit theory see e.g., [10]). Therefore,these results are restricted to the algorithms with noise which can bedecomposed as the sum of a martingale difference, vanishing and tele-scoping sequence. In order to analyze the almost sure rate of conver-gence of a broader class of stochastic approximation algorithms, anentirely different approach has recently been applied in [3]. This ap-proach is of deterministic nature and (relatively) close to the analysisof the almost sure convergence of stochastic approximation algorithmscarried out using the methodology based on associated ordinary dif-ferential equations (for details on this methodology, see [12], [13]).The assumptions adopted in [3] do not require the noise to admit any

Manuscript received August 2, 2001; revised April 4, 2003. The results inthis correspondence were obtained while the author was with the Departmentof Electrical and Electronic Engineering, University of Melbourne, Melbourne,Australia. The material in this correspondence was presented in part at the 39thAnnual Allerton Conference on Communication, Control and Computing, Mon-ticello, IL, October 2001.

The author is with the Department of Automatic Control and Systems En-gineering, University of Sheffield, Sheffield S1 3JD, U.K. (e-mail: [email protected]).

Communicated by G. Lugosi, Associate Editor for Nonparametric Estima-tion, Classification, and Neural Networks.

Digital Object Identifier 10.1109/TIT.2003.821971

particular decomposition and cover considerably broader class of sto-chastic approximation algorithms than the results based on martingalelimit theory. However, the rates of convergence obtained in [3] are moreconservative (i.e., less tight) than those obtained by using the results ofmartingale limit theory. In the context of linear stochastic approxima-tion algorithms, the results of [3] have significantly been generalizedand extended in [4]. However, the results presented in [4] are still moreconservative than those based on martingale limit theory.

Although linear stochastic approximation algorithms could beconsidered as one of the simplest subclasses of general stochasticapproximation algorithms, they have found wide range of applicationin adaptive signal processing, machine learning, pattern recognition,and econometrics (for details see, e.g., [1], [2], [5], [14], [21] and ref-erences cited therein). As a result, their asymptotic properties (almostsure, mean-square and weak convergence and rate of convergence)have been the focus of a large number of papers (see [4], [6]–[9], [11],[22]).

The almost sure rate of convergence of linear stochastic approxima-tion algorithms is analyzed in this correspondence. As the main result,it is demonstrated that their almost sure rate of convergence is equiv-alent to the almost sure rate of convergence of the averages of theirinput data sequences. As opposed to most of the existing results onthe rate of convergence of stochastic approximation, the main resultsof this correspondence hold under assumptions not requiring the inputdata sequences to admit any particular decomposition. Although no de-composition of the input data sequences is required, the results on thealmost sure rate of convergence of linear stochastic approximation al-gorithms obtained in this correspondence are as tight as the rate of con-vergence in the law of iterated logarithm. Moreover, the main resultof this correspondence yields the law of iterated logarithm for linearstochastic approximation if the law of iterated logarithm holds for theinput data sequences. The obtained general results are illustrated withtwo (nontrivial) examples where the input data sequences are stronglymixing strictly stationary random processes or functions of a uniformlyergodic Markov chain. These results are also applied to the analysis ofleast mean square (LMS) algorithms.

The correspondence is organized as follows. Linear stochastic ap-proximation algorithms are formally defined in Section II. The mainresults and the assumptions under which these results hold are also pre-sented in Section II. The proofs of the main results are given in Sec-tions III and IV. In Section V, cases where the algorithm input datasequences are strongly mixing strictly stationary random processes orfunctions of a uniformly ergodic Markov chains are considered. Theanalysis of LMS algorithms is carried out in the same section.

II. MAIN RESULTS

LetR be the set of reals, whileR+ = (0;1),R+

0 = [0;1). Linearstochastic approximation algorithms are defined by the following dif-ference equation:

�n+1 = �n + n+1(An+1�n + bn+1); n � 0: (1)

f ngn�1 is a sequence of positive reals. �0 is an Rd-valued randomvariable defined on the probability space (;F ; P ), while fAngn�1and fbngn�1 are Rd�d-valued and Rd-valued random processes de-fined on the same probability space. The algorithm (1) solves the equa-tion A� + b = 0 in situations where only the noisy observationsfAngn�1 and fbngn�1 of A 2 Rd�d and b 2 Rd (respectively) areavailable.

Let f�ngn�1 be a nondecreasing sequence of positive reals. More-over, let k � k denote the Euclidean vector norm and the matrix norm

0018-9448/04$20.00 © 2004 IEEE

402 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 2, FEBRUARY 2004

induced by the Euclidean vector norm (i.e., kAk = supk�k=1 kA�k,A 2 Rd�d), while

�(n; t) = sup j � n :

j�1

i=n

i+1 � t ; t 2 R+; n � 0:

The almost sure rate of convergence of the algorithm (1) is analyzedunder the following set of assumptions:

A1: limn!1 n = 0, 1n=1 n = 1.

A2: limn!1�n = 1, a = limn!1 �1n+1(��1n �n+1 � 1) <1.

A3: There exists a nonnegative random variable K such that

limn!1

�(n;t)�1

i=n

i+1kAi+1k � Kt w.p. 1; 8t 2 R+:

(2)

A4: A+ aI is stable (i.e., the real parts of all its eigenvalues arestrictly negative) and

limn!1

supn�j<�(n;t)

j

i=n

i+1(Ai+1 �A) = 0 w.p. 1;

8t 2 R+: (3)

The almost sure rate of convergence of the algorithm (1) is also an-alyzed under the following set of assumptions:

B1: 0< =limn!1n n <1, 0= limn!1 nj n �1n+1�1j<

1.

B2: limn!1�n = 1, � = limn!1n(��1n �n+1 � 1) <1.

B3: There exists a nonnegative random variable L such that

limn!1

n�1n

i=1

kAik = L w.p. 1:

B4: A+ �1�I is stable (i.e., the real parts of all its eigenvaluesare strictly negative) and

limn!1

n�1n

i=1

Ai = A w.p. 1:

Assumptions A1 and B1 correspond to the algorithm step sizef ngn�1. They are satisfied if n = n�1, n � 1, where 2 R+ isa constant. Both of A1 and B1 imply that �(n; t) is well defined andfinite for all t 2 R+, n � 0, as well as that

�(n;t)�1

i=n

i+1 � t <

�(n;t)

i=n

i+1; 8t 2 R+; n � 0 (4)

limn!1

�(n;t)�1

i=n

i+1 = t; 8t 2 R+: (5)

Assumptions A2 and B2 are related to the asymptotic propertiesof the sequence f�ngn�1. If B1 holds, both of them are satisfied if�1 = �2 = 1 and �n = n1=2(log logn)�1=2, n � 3, or if �n = n� ,n � 1, where � 2 (0; 1=2) is a constant. Basically, the sequencef�ngn�1 characterizes the almost sure rate of convergence of the av-erages of fAngn�1 and fbngn�1, as well as the almost sure rate ofconvergence of the algorithm (1) (see Theorems 1, 2 and Corollaries1–4 later; also note that the right-hand sides of (6), (7) and the left-handsides of (13), (14) are a sort of averaging). If B1 holds and fAngn�1,fbngn�1 satisfy the law of iterated logarithm, then f�ngn�1 shouldbe selected as �1 = �2 = 1 and �n = n1=2(log logn)�1=2, n � 3.Otherwise, a natural choice would be �n = n� , where � 2 (0; 1=2) isa constant.

Assumptions A3, A4 and B3, B4 correspond to the almost sureasymptotic properties of fAngn�1 and are standard for the analysis

of almost sure convergence and almost sure rate of convergence oflinear stochastic approximation algorithms (for details see [4], [9],[11], [14, Sec. I.2]). These assumptions are firmly connected withthe almost sure stability of the algorithm (1): without them, it ispractically impossible to demonstrate that the sequence of iteratesf�ngn�0 is almost surely bounded. Moreover, A3, A4 and B3, B4themselves could be considered as stability conditions. AssumptionsA3 and B3 require fkAnkgn�1 to be almost surely stable in average(note that the left-hand side of (2) could be considered as a sort ofaveraging). On the other hand, A4 and B4 demand that fAngn�1almost surely converges in average to a negative definite matrix (notethat the left-hand side of (3) could also be considered as a sort ofaveraging) and that the limit itself has a certain degree of stability (i.e.,the real parts of its eigenvalues should be strictly less than a certainnegative value depending on the asymptotic properties of f ngn�1and f�ngn�1).

Let � 2 Rd be a deterministic variable. The main results on thealmost sure rate of convergence of the algorithm (1) are contained inthe following theorems.

Theorem 1: Let assumptions A1–A4 hold. Moreover, let ~A = A+aI , while ~P is the (unique) positive definite solution of the Lyapunovequation ~AT ~P + ~P ~A = �I . Furthermore, let ~�min and ~�max be theminimal and maximal eigenvalue of ~P (respectively), while ~K = K+

a+ 1, K 0 = 27 ~K~��5=2min

~�9=2max and K 00 = (2 + ~K)�1. Then

limn!1

�nk�n � �k

� K 0 limn!1

supn�j<�(n;1)

j

i=n

i+1�i+1(Ai+1�+bi+1) w.p. 1

(6)

limn!1

�nk�n � �k

� K 00 limn!1

supn�j<�(n;1)

j

i=n

i+1�i+1(Ai+1�+bi+1) w.p. 1:

(7)

Theorem 2: Let B1–B4 hold. Moreover, let ~A = A+ �1�I , while~P is the (unique) positive definite solution of the Lyapunov equation~AT ~P + ~P ~A = �I . Furthermore, let ~�min and ~�max be the minimaland maximal eigenvalue of ~P (respectively), while ~L = K + �1� +

1, c = 2(1+ + 0+ �), L0 = 27c~L~��5=2min

~�9=2max, and L00 = c�4 2

� (1 � �)(2 + ~L)�1. Then

limn!1

�nk�n� �k � L0 limn!1

n�1�n

n

i=1

(Ai� + bi) w.p. 1: (8)

Moreover, if � < 1, then

limn!1

�nk�n��k � L00 limn!1

n�1�n

n

i=1

(Ai� + bi) w.p. 1: (9)

As an immediate consequence of Theorems 1 and 2, the followingcorollaries are obtained.

Corollary 1: Let A1–A4 hold. Then, k�n � �k = O(��1n ) w.p. 1if and only if

limn!1

supn�j<�(n;t)

j

i=n

i+1�i+1(Ai+1� + bi+1) <1 w.p. 1;

8t 2 R+:

Moreover, k�n � �k = o(��1n ) w.p. 1 if and only if

limn!1

supn�j<�(n;t)

j

i=n

i+1�i+1(Ai+1� + bi+1) = 0 w.p. 1;

8t 2 R+: (10)


Corollary 2: Let B1–B4 hold. Then, k�n � �k = O(��1n ) w.p. 1if and only if

limn!1

n�1

�n

n

i=1

(Ai� + bi) <1 w.p. 1: (11)

Moreover, k�n � �k = o(��1n ) w.p. 1 if and only if

limn!1

n�1

�n

n

i=1

(Ai� + bi) = 0 w.p. 1: (12)

To the best of the author’s knowledge, the strongest results on the al-most sure rate of convergence of linear stochastic approximation algo-rithms are contained in [4]. In that paper, under assumptions which areequivalent to A1–A4, it has been shown as a main result that k�n��k =o(��1n ) w.p. 1 if and only if (10) holds. In comparison with the re-sults of [4], the results of Theorem 1 and Corollary 1 are not onlyless conservative, but also provide an insight into the dependence oflimn!1 �nk�n � �k on

limn!1

supn�j<�(n;1)

j

i=n

i+1�i+1(Ai+1� + bi+1) :

Moreover, the results of Theorem 2 and Corollary 2 provide an insightinto how the almost sure rate of convergence of the algorithm (1) de-pends on the almost sure rate of convergence in the law of large num-bers for fAn� + bngn�1. On the other hand, no result of this kindis presented in [4]. Furthermore, Theorems 1, 2 and Corollaries 1, 2provide probably the tightest results on the almost sure convergenceof linear stochastic approximation algorithms. The rationale comes outfrom the fact that the law of the iterated logarithm itself provides thetightest possible results on the almost sure rate of convergence in thelaw of large numbers (which isO(n�1=2(log logn)1=2) and, therefore,not covered by the results of [4]) and from the fact that the almost surerate of convergence of the algorithm (1) is equivalent to the almost surerate of convergence in the law of large numbers for fAn� + bngn�1(due to Theorem 2; also note that the results of Theorem 2 yield the lawof iterated algorithm for linear stochastic approximation if fAngn�1and fbngn�1 satisfy this law).

Theorems 1, 2 and Corollaries 1, 2 are also more general than theresults on the rate of convergence of stochastic approximation whichare based on martingale limit theory. These results are restricted to thealgorithms with noise which can be decomposed as the sum of a mar-tingale difference, vanishing and telescoping sequence (for details see[1], [13], [16], [19], [20]). On the other hand, Theorems 1, 2 and Corol-laries 1, 2 do not require any particular decomposition of fAngn�1,fbngn�1, or fAn�+ bngn�1, and cover considerably broader class ofthe input data sequences (note that fAn� + bngn�1 is the noise in thealgorithm (1)).

Using the results of Theorems 1 and 2, the following corollaries areobtained as well.

Corollary 3: Let A1–A3 hold. Suppose that A+ aI is stable and

limn!1

supn�j<�(n;t)

j

i=n

i+1�i+1(Ai+1 � A) <1 w.p. 1;

8t 2 R+ (13)

limn!1

supn�j<�(n;t)

j

i=n

i+1�i+1(bi+1 � b) < 1 w.p. 1;

8t 2 R+: (14)

Then, k�n � �k = O(��1n ) w.p. 1, where � = �A�1b.

Corollary 4: Let B1–B3 hold. Suppose that A+ �I is stable and

limn!1

n�1

�n

n

i=1

(Ai � A) < 1 w.p. 1 (15)

limn!1

n�1

�n

n

i=1

(bi � b) < 1 w.p. 1: (16)

Then, k�n � �k = O(��1n ) w.p. 1, where � = �A�1b.

III. PROOF OF THEOREM 1 AND COROLLARY 3

Let ~�n = �n(�n � �), �n = An� + bn, and ~�n = �n�n, n � 1,while

~An+1 = ��1n �n+1An+1 +

�1n+1(�

�1n �n+1 � 1)I; n � 0:

Moreover, let ~Un;n = ~U 0n;n = I , n � 0, while~Un;j =(I + j ~Aj) � � � (I + n+1 ~An+1); 0 � n < j

~U 0n;j =(I + j ~A) � � � (I + n+1 ~A); 0 � n < j

~sn;j =

j�1

i=n

i+1 ~�i+1; ~vn;j =

j�1

i=n

~Ui+1;j i+1 ~�i+1; 0 � n � j

~s = limn!1

supn�j<�(n;1)

k~sn;jk:

Then, it is straightforward to verify that~�n+1 = ~�n + n+1( ~An+1

~�n + ~�n+1); n � 0 (17)~�j = ~Un;j

~�n + ~vn;j ; 0 � n � j (18)

~vn;j = ~sn;j+

j�1

i=n+1

~Ui+1;j i+1 ~Ai+1~sn;i; 0 � n � j: (19)

Lemma 1: Let A1–A4 hold. Then

limn!1

supn�j<�(n;t)

j

i=n

i+1( ~Ai+1 � ~A) =0 w.p. 1 (20)

limn!1

�(n;t)�1

i=n

i+1k ~Ai+1k � ~Kt w.p. 1 (21)

for all t 2 R+.Proof: Let t 2 R+, while �n = supn�i �

�1i �i+1, n � 1, and

�0n = sup

n�i �1i+1(�

�1i �i+1 � 1); n � 1

�n = supn�i

maxf��1i �i+1�1; j �1i+1(��1i �i+1�1)�ajg; n � 1:

Due to assumption A2, limn!1�n = 1, limn!1 �0n = a, andlimn!1 �n = 0. On the other hand, using (4), it can easily be de-duced that�(n;t)�1

i=n

i+1k ~Ai+1k

�

�(n;t)�1

i=n

i+1��1i �i+1kAi+1k +

�(n;t)�1

i=n

(��1i �i+1 � 1)

� �n

�(n;t)�1

i=n

i+1kAi+1k + �0nt; n � 1

j

i=n

i+1( ~Ai+1 � ~A)

�

j

i=n

i+1(Ai+1 � A) +

j

i=n

i+1(��1i �i+1 � 1)kAi+1k

+

j

i=n

i+1j �1i+1(�i�i+1 � 1)� aj

� supn�j<�(n;t)

j

i=n

i+1(Ai+1 � A)

+�n 1 +

�(n;t)�1

i=n

i+1kAi+1k ; 1 � n � j < �(n; t):


Then, (20) and (21) follow directly from A1–A4.

Lemma 2: Let A1 hold. Then

limn!1

supn�j<�(n;t)

j

i=n

i+1 ~�i+1

� (t+ 1) limn!1

supn�j<�(n;1)

j

i=n

i+1 ~�i+1 (22)

for all t 2 R+.Proof: Let t 2 R+ and " 2 (0; 1), while ln;0 = n, n � 0, and

ln;k+1 = �(ln;k;1), k � 0. Moreover, let

kn = inffk � 0 : ln;k � �(n; t)g; n � 0:

Due to assumption A1, there exists n0 � 0 (depending on ") such that n � ", n � n0. Then, (4) yields that

(kn � 1)(1� ") �

k �1

k=1

l

l=l

l+1 � l +1

=

k �1

k=1

l �1

l=l

l+1 �

�(n;t)�1

i=n

i+1 � t; n � n0:

Consequently, kn � 1 + (1� ")�1t, n � n0. Therefore,j

i=1

i+1 ~�i+1 =

k

k=1

(l �1)^j

l=l

l+1 ~�l+1

�

k

k=1

(l �1)^j

l=l

l+1 ~�l+1

� (1 + (1� ")�1t) supk�l<�(k;1)

l

i=k

i+1 ~�i+1 ;

n0 � n � j < �(n; t):

Then, it can easily be deduced that

limn!1

supn�j<�(n;t)

j

i=n

i+1 ~�i+1

� (1 + (1� ")�1t) limn!1

supn�j<�(n;1)

j

i=n

i+1 ~�i+1

wherefrom (22) follows by the limit process "! 0+.

Lemma 3: Let A1 and A4 hold. Then

limn!1

supn�j

k ~U 0n;jk � ~��1=2min

~�1=2max; (23)

limn!1

k ~U 0n;�(n;t)k � ~��1=2min

~�1=2max exp(�2�1~��1maxt) (24)

for all t 2 R+.Proof: Let t 2 R+ and " 2 (0; 1). It is straightforward to verify

that

( ~U 0n;j+1#)T ~P ( ~U 0n;j+1#) = ( ~U 0n;j#)

T ~P ( ~U 0n;j#)� j+1k ~U0n;j#k

2

+ 2j+1( ~U0n;j#)

T ~AT ~P ~A( ~U 0n;j#); 0 � n � j (25)

for all # 2 Rd. On the other hand, assumption A1 implies that thereexists n0 � 0 (depending on ") such that nk ~AT ~P ~Ak � ", n � n0,while assumption A4 yields that

~�mink ~U0n;j#k

2 � ( ~U 0n;j#)T ~P ( ~U 0n;j#)

� ~�maxk ~U0n;j#k

2; 0 � n � j

for all # 2 Rd. Then, it can easily be deduced from (25) that

( ~U 0n;j+1#)T ~P ( ~U 0n;j+1#)

� (1� (1� ")~��1max j+1)( ~U0n;j#)

T ~P ( ~U 0n;j#);

n0 � n � j

for all # 2 Rd. Consequently

k ~U 0n;j#k2 � ~��1min( ~U

0n;j#)

T ~P ( ~U 0n;j#)

� ~��1min( ~U0n;n#)

T ~P ( ~U 0n;n#)

j

i=n+1

(1�(1�")~��1max i)

� ~��1min~�maxk#k

2 exp �(1� ")~��1max

j�1

i=n

i+1 ;

n0 � n < j

for all # 2 Rd. Therefore,

k ~U 0n;jk � ~��1=2min

~�1=2max exp �2�1(1� ")~��1max

j�1

i=n

i+1 ;

n0 � n < j (26)

wherefrom (23) directly follows. Due to (26)

k ~U 0n;�(n;t)k � ~��1=2min

~�1=2max exp �2�1(1�")~��1max

�(n;t)�1

i=n

i+1 ;

n � n0:

Then, (5) implies that

limn!1

k ~U 0n;�(n;t)k � ~��1=2min

~�1=2max exp(�2�1(1� ")~��1maxt)

wherefrom (24) follows by the limit process "! 0+.

Lemma 4: Let A1–A4 hold. Then

limn!1

supn�j��(n;t)

k ~Un;jk � ~��1=2min

~�1=2max w.p. 1 (27)

limn!1

k ~Un;�(n;t)k � exp(�2�1~��1maxt) w.p. 1 (28)

limn!1

supn�j��(n;t)

k~vn;jk� ~K~��1=2min

~�1=2max(1+t)2~s w.p. 1: (29)

Proof: Let t 2 R+, while

~Sn;j =

j�1

i=n

i+1( ~Ai+1 � ~A); 0 � n � j:

It is straightforward to verify that

~Un;j = ~U 0n;j + ~Sn;j ~Un;j�1

+

j�1

i=n+1

~U 0i+1;j( i+1 ~A ~Sn;i� ~Sn;i i ~Ai) ~Un;i�1; 0�n<j: (30)


k ~Un;jk �

j

i=n+1

(1 + ik ~Aik) � exp

�(n;t)�1

i=n

i+1k ~Ai+1k ;

0 � n < j � �(n; t): (31)

Due to (31)

limn!1

supn�j��(n;t)

k ~Un;jk � exp( ~Kt) w.p. 1 (32)

while (4), (5), (19), and (30) yield

k~vn;jk � k~sn;jk +

j�1

i=n+1

i+1k ~Ai+1kk ~Ui+1;jkk~sn;ik

� 1 + supk�j��(k;t)

k ~Uk;jk

�(n;t)�1

i=n

i+1k ~Ai+1k

� supn�j<�(n;t)

k~sn;jk; 0 � n � j < �(n; t) (33)

k ~Un;j � ~U 0n;jk �k ~Sn;jkk ~Un;j�1k

+

j�1

i=n+1

( i+1k ~Ak+ ik ~Aik)k~Sn;ikk~Un;i�1k


� tk ~Ak+

�(n;t)�1

i=n

i+1k ~Ai+1k

� supn�j��(n;t)

k ~Sn;jk supn�j��(n;t)

k ~Un;jk;

0 � n � j � �(n; t): (34)

Using Lemma 1 and (32), (34), it can easily be deduced that

limn!1

supn�j��(n;t)

k ~Un;j � ~U 0n;jk = 0 w.p. 1:

Then, (27) and (28) are a direct consequence of Lemma 3, while (29)follows directly from Lemma 1 and (27), (28), (33).

Proof of Theorem 1: Let t = 2~�max log(1 + ~��1=2min

~�1=2max). Due

to Lemmas 1 and 4, there existsN0 2 F such that P (N0) = 0 and thefollowing relations hold on N c

0 :

limn!1

�(n;1)�1

i=n

i+1k ~Ai+1k � ~K (35)

limn!1

supn�j��(n;t)

k ~Un;jk � ~��1=2min

~�1=2max (36)

limn!1

k ~Un;�(n;t)k � ~��1=2min

~�1=2max(1 + ~��1=2min

~�1=2max)�1 (37)

limn!1

supn�j<�(n;t)

k~vn;jk � 9 ~K~��3=2min

~�7=2max~s: (38)

Let ! be an arbitrary sample fromN c0 (due to the notational simplicity,

! does not appear in the relations and expressions which follow in theproof). Obviously, it is sufficient to show that

K00~s � lim

n!1k~�nk � K

0~s:

Due to (17) and (35)

k~sn;jk = ~�j+1 � ~�n �

j

i=n

i+1 ~Ai+1~�i

�k~�nk+ k~�j+1k+

j

i=n

i+1k ~Ai+1kk~�ik

� 2 +

�(n;1)�1

i=n

i+1k ~Ai+1k supn�i

k~�ik;

0 � n � j < �(n; 1):

Then, using Lemma 4, it can easily be deduced that

~s � (2 + ~K) limn!1

k~�nk

wherefrom K 00~s � limn!1 k~�nk directly follows.Let "2(0; 1). Due to (36)–(38), there exists n0 (depending on !, ")

such that

k ~Un;jk � "+ ~��1=2min

~�1=2max; n0 � n � j � �(n; t)

k ~Un;�(n;t)k � ("+ ~��1=2min

~�1=2max)(1 + ~��1=2min

~�1=2max)�1; n � n0

k~vn;jk � 9 ~K~��3=2min

~�7=2max(~s+ "); n0 � n � j < �(n; t):

Let nk+1 = �(nk; t), k � 0. Then, (18) implies

k~�n k �k ~Un ;n kk~�n k+ k~vn ;n k

� ("+ ~��1=2min

~�1=2max)(1 + ~��1=2min

~�1=2max)�1k~�n k

+ 9 ~K~��3=2min

~�7=2max(~s+ "); k � 0 (39)

k~�jk �k ~Un ;jkk~�n k+ k~vn ;jk

� ("+ ~��1=2min

~�1=2max)k~�n k

+ 9 ~K~��3=2min

~�7=2max(~s+ "); nk � j < nk+1; k � 0:

(40)

Owing to (39)

k~�n k � ("+ ~��1=2min

~�1=2max)k(1 + ~�

�1=2min

~�1=2max)�kk~�n k

+ 9 ~K~��3=2min

~�7=2max(~s+ ")

�

k�1

i=0

("+ ~��1=2min

~�1=2max)i(1 + ~�

�1=2min

~�1=2max)�i; k � 0:

Therefore,

limn!1

k~�n k � 9 ~K~��3=2min

~�7=2max(1+~��1=2min

~�1=2max)(1�")�1(~s+"):

Then, (40) yields

limn!1

k~�nk � ("+ ~��1=2min

~�1=2max) limk!1

k~�n k

+ 9 ~K~��3=2min

~�7=2max(~s+ ")

� 9 ~K~��3=2min

~�7=2max(~s+ ")(1 + (1� ")�1

� ("+ ~��1=2min

~�1=2max)(1 + ~��1=2min

~�1=2max))

wherefrom limn!1 k~�nk � K 0~s follows by the limit process"! 0+.

Proof of Corollary 3: Let t 2 R+, while �n = supn�i ��1i ,

n � 1, and

�0n = sup

n�i �1i+1(�

�1i �i+1 � 1); n � 1

Un;j =

j�1

i=n

i+1�i+1(Ai+1 � A); 0 � n � j:

Then, limn!1�n = 0, limn!1�0n = a (due to assumption A2) and

j

i=n

i+1(Ai+1 � A) = ��1j+1Un;j+1 +

j

i=n

��1i+1(�

�1i �i+1 � 1)Un;i;

1 � n � j: (41)

Owing to (4) and (41)

j

i=n

i+1(Ai+1 � A)

� �nkUn;j+1k+ �n�0n

j

i=n

i+1kUn;ik

� �n(1 + �0nt) sup

n�j<�(n;t)

kUn;jk; 1 � n � j < �(n; t):

Consequently, assumption A4 holds. On the other hand

k~sn;jk �

j

i=n

i+1�i+1(Ai+1 �A) k�k

+

j

i=n

i+1�i+1(bi+1 � b) ; 0 � n � j:

Therefore, ~s < 1 w.p. 1. Then, the assertion of Corollary 3 followsdirectly from Theorem 1.

IV. PROOFS OF THEOREM 2 AND COROLLARY 4

Let vn = n�1ni=1 �i, n � 1. Then, it is straightforward to verify

that

vn+1 = vn � (n+ 1)�1vn + (n+ 1)�1�n+1; n � 0 (42)

~sn;j = j j�jvj � n n�nvn +

j�1

i=n

i i+1�i( i �1i+1 � 1)vi

�

j�1

i=n

i i+1�i(��1i �i+1 � 1)vi; 0 � n � j: (43)


Lemma 5: Let assumptions B1 and B2 hold. Suppose that � < 1.Then

c�4

2(1� �) lim

n!1�nkvnk � ~s w.p. 1: (44)

Proof: Let

�n = supn�i

maxf i; ji i � jg; �0n = supn�i

i�1

�1i ; n � 1

�n = supn�i

i�1(i+ 1)�1 �2i+1; n � 1

�0n = sup

n�i

i�1

�1i

�1i+1j i

�1i+1 � 1j; n � 1:

Then, assumption B1 implies that limn!1�n = 0, limn!1�0n = �1, limn!1�n = �2, and limn!1 �0n = �2 0. Let ln;0 = n,n � 0, and ln;k+1 = �(ln;k;1), k � 0, while

mn = sup j � n :

j�1

i=n

(i+ 1)�1 � 1 ; n � 0

kn = inffk � 1 : ln;k � mng; n � 0:


1� �n�

l

i=l

i+1� l +1

=

l �1

i=l

i+1�1; k�1; n�0 (45)

k �1

k=1

l �1

i=l

(i+1)�1�

m �1

i=n

(i+1)�1�1; n�0

j

i=n

(i+1)�1~�i+1=(j + 1)�1 �1j+1~sn;j+1

+

j

i=n

i�1(i+1)�1 �1i+1~sn;i

�

j

i=n

i�1

�1i ( i

�1i+1�1)~sn;i; 1�n�j

j

i=n

(i+1)�1~�i+1=

k

k=1

(l �1)^j

i=l

(i+1)�1~�i+1; 0�n�j<mn

(use (4) to get (45)). Consequently

kn(1� �n) �

k

k=1

l �1

i=l

i+1

� 1 +

k �1

k=1

l �1

i=l

(i+ 1)�1

+

k �1

k=1

l �1

i=l

(i+ 1)�1((i+ 1) i+1 � )

� 1 + ( + �n)

k �1

k=1

l �1

i=l

(i+ 1)�1

� 1 + + �n; n � 0;j

i=k

(i+ 1)�1~�i+1 ��0nk~sk;jk+ (�n + �

0n)

�(k;1)�1

i=k

i+1k~sk;ik

� (�0n + �n + �0n) sup

k�j��(k;1)n�k

k~sk;jk;

1 � n � k � j < �(k; 1)

j

i=n

(i+ 1)�1~�i+1 � kn supk�j<�(k;1)

n�k

j

i=k

(i+ 1)�1~�i+1 ;

0 � n � j < mn (46)

(use (4) to get (46)). Therefore, limn!1 kn � 1 + andj

i=n

(i+ 1)�1~�i+1

� kn(�0n + �n + �

0n) sup

k�j<�(k;1)n�k

k~sk;jk; 1 � n � j < mn

wherefrom

limn!1

supn�j<m

j

i=n

(i+ 1)�1~�i+1 � �2(1 + +

0 + �)2~s (47)

directly follows. On the other hand, applying Theorem 1 to the recur-sion (42), it can easily be deduced that

limn!1

�nkvnk � (2 + �)(2 + (1� �)�1)

� limn!1

supn�j<m

j

i=n

(i+ 1)�1~�i+1 w.p. 1:

Then, (44) follows directly from (47).

Proof of Theorem 2: Let t2R+, while �n=supn�i i i, n�1:and

�0n = sup

n�i

ij i �1i+1 � 1j; �n = sup

n�i

i(��1i �i+1 � 1); n � 1:

Then, assumptions B1 and B2 imply that limn!1�n = ,limn!1 �0n = 0, and limn!1�n = �. Let

Un = n�1

n

i=1

(Ai � A); un = n�1

n

i=1

(kAik � L); n � 1:

It is straightforward to verify thatj

i=n

i+1(Ai+1 �A) = (j + 1) j+1Uj+1 � n nUn

+

j

i=n

i i+1( i �1i+1�1)Ui; 1�n�j (48)

j

i=n

i+1kAi+1k =(j + 1) j+1uj+1 � n nun + L

j

i=n

i+1

+

j

i=n

i i+1( i �1i+1�1)ui; 1�n�j: (49)

Due to (4), (43), (48), and (49)j

i=n

i+1(Ai+1 � A) ��n(kUnk+ kUj+1k)

+ �0n

�(n;t)�1

i=n

i+1kUik

� (2�n + �0nt) sup

n�i

kUik;

1 � n � j < �(n; t)�(n;t)�1

i=n

i+1kAi+1k � �n(junj+ ju�(n;t)j)

+ L

�(n;t)�1

i=n

i+1 + �0n

�(n;t)�1

i=n

i+1juij

� Lt+ (2�n + �0nt) sup

n�i

juij; n � 1


k~sn;jk � �n(�nkvnk+ �jkvjk)

+ (�0

n + �n)

�(n;1)�1

i=n

i+1�ikvik

� (2�n + �0n + �n) sup

n�i�ikvik;

1 � n � j < �(n; 1):

Then, assumptions B3 and B4 imply that (3) holds, as well as that

limn!1

�(n;t)�1

i=n

i+1kAi+1k� Lt w.p. 1; 8t 2 R+ (50)

~s�(2 + 0 + �) lim

n!1�nkvnk: (51)

On the other hand, using Theorem 1 and (50), it can easily be deducedthat

(2 + ~L)�1~s � limn!1

k~�nk � 27~L~��5=2min

~�9=2max~s w.p. 1:

Then, (8) and (9) follow directly from Lemma 5 and (51).

Proof of Corollary 4: Sincen

i=1

�i �

n

i=1

(Ai � A) k�k+

n

i=1

(bi � b) ; n � 0

it can easily be deduced that

limn!1

n�1�n

n

i=1

�i <1 w.p. 1:

Then, the assertion of Corollary 4 follows directly from Theorem 2.

V. EXAMPLES

The results presented in this section correspond to the analysisof LMS algorithms (Theorem 5) and the analysis of cases wherefAngn�1 and fbngn�1 are strongly mixing strictly stationary randomprocesses (Theorem 3) or functions of a uniformly ergodic Markovchain (Theorem 4). The case where fAngn�1 and fbngn�1 arefunctions of a Markov chain is important for the area of reinforcementlearning (for details see [2] and references cited therein), while thesituation where the algorithm input data sequences satisfy stronglymixing conditions is typical for the area of signal processing (see, e.g.,[21]).

Theorem 3: Let assumption B1 hold, while fAngn�0 and fbngn�0areRd�d-valued andRd-valued jointly strictly stationary random pro-cesses satisfying EkA0k < 1 and Ekb0k < 1. Let f�ngn�0 be asequence of positive reals satisfying

EjP (�jFn)� P (�)j � �j�n; 8� 2 Fj; 0 � n � j (52)

where Fn = �fAi; bi : 0 � i � ng and Fn = �fAi; bi : i � ng,n � 0. Suppose that A + 2�1 �1I is stable, where A = E(A0).Moreover, suppose that 1

0�(t)Q2(t)dt < 1, where

�(t) = inffn � 0 : �n � tg; t 2 (0; 1)

Q(t) = inffs 2 R+ : P (kA0k+ kb0k > s) � tg; t 2 (0; 1):

Let b = E(b0) and � = �A�1b, while �n = An� + bn, n � 0, and

�2 = tr E(�0�

T0 ) + 2

1

n=1

E(�0�Tn ) :

Then

k�n � �k = o(n�1=2(log logn)1=2) w.p. 1; if � = 0

and

k�n � �k = O(n�1=2(log logn)1=2) w.p. 1; if � > 0:

Remark: Due to (52), fAngn�0 and fbngn�0 are strongly mixingrandom processes (for the definition and details on strongly mixingconditions, see, e.g., [10], [17], [21] and references cited therein).Moreover, using [17, Theorem 1], it can easily be deduced that � iswell defined, finite, and nonnegative.

Proof: Let �1 = �2 = 1 and �n = n1=2(log logn)�1=2, n � 3.Due to Lemma 6 (given in the Appendix)

limn!1

n(��1n �n+1 � 1) = 2�1

while the law of large numbers for strictly stationary random processes(see, e.g., [18, Theorem V.3.3]) implies

limn!1

n�1

n

i=1

kAik = EkA0k w.p. 1:

On the other hand, using [17, Theorem 2] and the law of iterated loga-rithm for a sequence of independent and identically distributed (i.i.d.)random variables (see, e.g., [18, Theorem IV.4.1]), it can easily be de-duced that

limn!1

n�1�n

n

i=1

(Ai� + bi) = 0 w.p. 1

if � = 0, and

limn!1

n�1�n

n

i=1

(Ai� + bi) � � w.p. 1

if � > 0. Then, the assertion of this theorem follows directly fromTheorem 2.

Theorem 4: Let assumption B1 hold, while fxngn�1 is anRd -valued homogeneous Markov chain having a unique invariantprobability measure �(�). Let An = A(xn) and bn = b(xn), n � 1,where A : Rd ! Rd�d and b : Rd ! Rd are Borel-measurablefunctions. Suppose that there exists a Borel-measurable functionf : Rd ! [1;1) such that f(x)�(dx) < 1 and

f(x0)Pn(x; dx0) <1; 8x 2 R

d; n � 0

maxfkA(x)k2; kb(x)k2g � f(x); 8x 2 Rd

wherePn(�; �) is the nth-step transition probability of fxngn�1. More-over, suppose that A + 2�1 �1I is stable, where A = A(x)�(dx).Furthermore, suppose that there exist constants K 2 R+ and � 2(0; 1) such that

�(x0)Pn(x; dx0)� �(x0)�(dx0) � K�

nf(x);

8x 2 Rd; n � 0 (53)

for any Borel-measurable function � : Rd ! R satisfying j�(x)j �f(x), 8x 2 Rd . Let b = b(x)�(dx) and � = �Ab, while �(x) =

A(x)� + b(x), x 2 Rd , and

�2 = tr �(x)�T (x)�(dx)

+2

1

n=1

�(x)�T (x0)Pn(x; dx0)�(dx) :

Then, k�n � �k = o(n�1=2) w.p. 1 if � = 0, and k�n � �k =O(n�1=2(log logn)1=2) w.p. 1 if � > 0.

Remark: Due to (53), fxngn�1 is an f -uniformly ergodic Markovchain (for the definition and details on this type of ergodicity see [15,Sec. 16]). Moreover, using [15, Theorem 17.0.1], it can easily be de-duced that � is well-defined, finite, and nonnegative.


Proof: Let �n = n1=2, n � 1, if � = 0, while �1 = �2 = 1and �n = n1=2(log logn)�1=2, n � 3, if � > 0. Due to Lemma 6(given in the Appendix ), limn!1n(�

�1n �n+1 � 1) = 2�1, while the

law of large numbers for positive Harris Markov chains (see, e.g., [15,Theorem 17.0.1]) implies

limn!1

n�1

n

i=1

kA(xi)k = kA(x)k�(dx) w.p. 1

(note that a chain is positive Harris if it is f -uniformly ergodic). On theother hand, using [15, Theorem 17.0.1], it can easily be deduced that

limn!1

n�1�n

n

i=1

(A(xi)� + b(xi)) = 0 w.p. 1

if � = 0, and

limn!1

n�1�n

n

i=1

(A(xi)� + b(xi)) � � w.p. 1

if � > 0. Then, the assertion of this theorem follows directly fromTheorem 2.

Theorem 5: Let B1 and B2 hold, while fxngn�1 and fyngn�1 areRd-valued and R-valued random processes. Let An = �xnx

Tn and

bn = xnyn, n � 1. Suppose that there exist a positive definite matrixA 2 Rd�d and a vector b 2 Rd such that A � 2�1 �1I is positivedefinite and

limn!1

n�1�n

n

i=1

(xixTi �A) < 1 w.p. 1

limn!1

n�1�n

n

i=1

(xiyTi � b) <1 w.p. 1: (54)

Then, k�n � �k = O(��1n ) w.p. 1, where � = A�1b.Proof: Due to (54)

limn!1

n�1

n

i=1

(�Txi)2 = lim

n!1n�1

n

i=1

�Txix

Ti �

= �TA� w.p. 1; 8� 2 R

d

wherefrom it can easily be deduced that

limn!1

n�1

n

i=1

kxixTi k = lim

n!1n�1

n

i=1

kxik2

=tr(A) w.p. 1:

Then, the assertion of this theorem follows directly from Corollary 4.

VI. CONCLUSION

The almost sure rate of convergence of linear stochastic approxima-tion algorithms has been analyzed in this correspondence. As the mainresult, it has been demonstrated that their almost sure rate of conver-gence is equivalent to the almost sure rate of convergence of the aver-ages of their input data sequences. As opposed to most of the existingresults on the rate of convergence of stochastic approximation whichcover only algorithms with the noise decomposable as the sum of amartingale difference, vanishing and telescoping sequence, the mainresults of this correspondence hold under assumptions not requiring theinput data sequences to admit any particular decomposition. Althoughno decomposition of the input data sequences is required, the resultson the almost sure rate of convergence of linear stochastic approxima-tion algorithms obtained in this correspondence are as tight as the rate

of convergence in the law of iterated logarithm. Moreover, the mainresult of this correwpondence yields the law of iterated logarithm forlinear stochastic approximation if the law of iterated logarithm holdsfor the input data sequences. Since the law of iterated logarithm pro-vides the tightest possible results on the almost sure rate of conver-gence in the law of large numbers and as the law of large numbers itselfcould be considered as a special case of the almost sure convergence ofstochastic approximation, the results presented in this correspondenceseem to be the least conservative almost sure rate of convergence re-sults for linear stochastic approximation algorithms. The results of thiscorrespondence have been illustrated with two (nontrivial) exampleswhere the input data sequences are strongly mixing strictly stationaryrandom processes or functions of a uniformly ergodic Markov chain.These results have also been applied to the analysis of LMS algorithms.

APPENDIX

Lemma 6: Let �1 = �2 = 1 and �n = n1=2(log logn)�1=2, n � 3.Then

limn!1

n(��1n �n+1 � 1) = 2�1:

Proof: Since

(log logn)�1 log log(n+ 1)

= 1 + (log logn)�1 log(1 + (logn)�1 log(1 + n�1)); n � 3

it can easily be deduced that

(log logn)�1 log log(n+ 1) = 1 + o(n�1)

andn�1=2(n+1)1=2 = 1+2�1n�1+o(n�1),wherefrom the assertionof this lemma directly follows.

REFERENCES

[1] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms andStochastic Approximation. New York: Springer-Verlag, 1990.

[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Program-ming. Atlanta, GA: Athena Scientific, 1996.

[3] H.-F. Chen, “Recent developments in stochastic approximation,” inPreprints 13th IFAC World Congr., vol. C, 1996, pp. 375–380.

[4] E. K. P. Chong, I.-J. Wang, and S. R. Kulkarni, “Noise conditions forprespecified convergence rates of stochastic approximation algorithms,”IEEE Trans. Inform. Theory, vol. 45, pp. 810–814, Mar. 1999.

[5] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. Berlin, Germany: Springer-Verlag, 1996.

[6] E. Eweda and O. Macchi, “Convergence of an adaptive linear estimationalgorithm,” IEEE Trans. Automat. Contr., vol. AC-29, pp. 119–127, Feb.1984.

[7] D. C. Farden, “Stochastic approximation algorithms with correlatednoise,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 105–113, Jan. 1981.

[8] L. Györfi, “Stochastic approximation from ergodic sample for linear re-gression,” Z. Wahrscheinlichkeitstheorie und verwandte Gebiete, vol.54, pp. 47–55, 1980.

[9] , “Adaptive linear procedures under general conditions,” IEEETrans. Inform. Theory, vol. IT-30, pp. 262–267, Mar. 1984.

[10] P. Hall and C. C. Heyde, Martingale Limit Theory and Its Applica-tion. New York: Academic, 1980.

[11] M. A. Kouritzin, “On the convergence of linear stochastic approximationprocedures,” IEEE Trans. Inform. Theory, vol. 42, pp. 1305–1309, July1996.

[12] H. J. Kushner and D. S. Clark, Stochastic Approximation Methods forConstrained and Unconstrained Systems. Berlin, Germany: Springer-Verlag, 1978.

[13] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms andApplications. Berlin, Germany: Springer-Verlag, 1997.

[14] L. Ljung, G. Pflug, and H. Walk, Stochastic Approximation and Opti-mization of Random Systems. Basel, Switzerland: Birkhäuser-Verlag,1992.

[15] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Sta-bility. New York: Springer-Verlag, 1993.


[16] M. B. Nevel’son and R. Z. Has’minskii, Stochastic Approximation andRecursive Estimation. Providence, RI: Amer. Math. Soc., 1976.

[17] E. Rio, “The functional law of the iterated logarithm for stationarystrongly mixing sequences,” Ann. Probab., vol. 23, pp. 1188–1203.

[18] A. N. Shiryayev, Probability. Berlin, Germany: Springer-Verlag,1984.

[19] V. Solo, “Stochastic approximation and the final value theorem,” Sto-chastic Processes Their Applic., vol. 13, pp. 139–156, 1982.

[20] , “Stochastic approximation with dependent noise,” Stochastic Pro-cesses Their Applic., vol. 13, pp. 157–170, 1982.

[21] V. Solo and X. Kong, Adaptive Signal Processing Algorithms: Stabilityand Performance. Englewood Cliffs, NJ: Prentice-Hall, 1995.

[22] H. Walk and L. Zsidó, “Convergence of Robbins-Monro method forlinear problems in banach space,” J. Math. Anal. Applic., vol. 139, pp.152–177, 1989.

Documents

On the almost sure rate of convergence of linear stochastic approximation algorithms