Isotonic estimation in stochastic approximation

Statistics & Probability Letters 9 (1990) 279-287

North-Holland March 1990

ISOTONIC ESTIMATION IN STOCHASTIC APPROXIMATION

D.L. HANSON *

Department of Mathematical Sciences, State University of New York at Binghamton, Binghamton, NY 13901, USA

Hari MUKERJEE

Mathematics and Statistics Department, Wichita State University, Wichita, KS 67208, USA

Received January 1989

Revised April 1989

Abstract: Suppose m( .) is a regression function which has a unique zero 6’. The Robbins-Monro process X, +, = X, - c, Y, is

a standard stochastic approximation method used to estimate 0. In the literature X,,, , is used both as the estimator of 8 after

the n th step and as the design/control setting for the process at the (n + 1)st step. Apparently the justification for this choice

is that X, + B a.s. and has various asymptotic optimality properties.

Following Frees and Ruppert (1986) we distinguish between design/control settings X, and estimates tJ, of 0. When m( .)

is known to be nondecreasing it is possible to incorporate this prior information into the estimation procedure. We define a

new estimator 19, using isotonic regression, discuss its strong consistency, and discuss some of its optimality properties; we

show that an obvious conjecture about the strong consistency of this estimator is false.

One purpose of this note is to generate interest in the possible use of isotonic regression in stochastic approximation.

AMS I970 Subject Classifications: Primary 62L20, 62H12; Secondary 60F15

Keywords: Stochastic approximation, isotonic regression, Robbins-Monro procedure, strong consistency.

1. Introduction

Suppose that for each x E Iw we have a distribution function F, with finite unknown mean m(x) and suppose the equation m(x) = 0 has the unique solution 8. Our objective is to find (estimate) B. We are unable to compute m(x), but for each fixed x we are allowed to sample from F,.

More specifically, let (0, 2, P) be a probability space and let w denote a point in Q. All random variables in this paper will be defined on (a, 2, P). Let XI be an arbitrary random variable, possibly degenerate, and let {( X,, 5, f?,)} be a sequence of ordered triples of random variables and {( f,, g,, )} a sequence of ordered pairs of real valued functions such that for every positive integer n:

(i) (a) The conditional distribution of Y, given X,, . . . , X,, YI, . . . , Y,_, depends only on X, and is

Fx,; (i) (b) g, : R 2n -+ R’, g, is Bore1 measurable, and 13, = g,( X,, . . . , X,,, Y,, . . . , Y,); (i) (c) f, : R2” -+ R’, f, is Bore1 measurable, and X,,, , = f,,( X,, . . . , X,, Y,, . . . , Y,).

* The research of this author was supported by National Science Foundation Grant No. DMS 8402584 and Grant No. DMS 8602565.

0167-7152/90/$3.50 0 1990, Elsevier Science Publishers B.V. (North-Holland) 279

Volume 9, Number 3 STATISTICS & PROBABILITY LETTERS March 1990

I.e., after we have observed Y, ‘at’ Xi, Y, ‘at’ X,, . . . , and Y, ‘at’ X,, we estimate 8 using 0, and decide to take our next observation ‘at’ X, + , .

The following are among the possible scenarios of interest:

(ii) { Y, } is the output of a process, possibly a manufacturing process. At ‘ time n’ the process is run at (or on) the ‘control setting’ X, which would be 8 if the process were to be run optimally. X,, determines the quality of the process at time n. In this case we either claim that 8, is not of interest or we set

0, = X,+1. (iii) A sequence of scientific experiments with outputs Y, is to be run indefinitely for the purpose of

estimating 8. The nth experiment is run at the control setting X,. It is not clear when, if ever, the sequence is to be terminated, but 0, is the estimate of 8 that would be given after the n th experiment whether the process were terminated or not. The sole function of the Xk’s is to produce good values of the 8,‘s.

(iv) The process in (ii) is to be run, but at some time N + 1 (fixed or random) the ‘control setting’ is to be fixed for eternity. I.e., after the Nth experiment is run we compute 8, and then set X, = 8, for all k>Nfl.

Different things are important in the three scenarios. In (ii) it is important that the Xn’s be close to 8, and the 8,‘s are irrelevant. In (iii) it is important that the en’s be close to 8, and the X,,‘s are only of import in that they help determine this closeness. In (iv), as in (ii), the goal is for the X,,‘s to be close to 8, but there is a discontinuity in the purpose of the X,,‘s. For n < N the setting X, is of interest since it is the setting from which Yn is obtained; as in scenario (ii) it is also of interest because it affects the data that will be available to help determine future settings ( Xk’s with k > n + 1). After the process is run at time N we determine X,, , and we only care that it be as close as possible to 0 since we are setting X, = X,,, for all

k,,N+1.

In this paper we are concerned only with the B,,‘s and thus are concerned only with scenarios (iii) and (iv).

Now suppose that, in addition to m(e) = 0, we have (x - 8)m(x) > 0 for all x f 8. The’Robbins-Monro (1951) procedure for sequentially estimating 6’ is given by

X n+1 = x, - C”Yn (1.1)

where {c,,} is a sequence of non-negative constants. This procedure and its variations have been studied extensively-almost all of the results being asymptotic. The convergence almost everywhere, in mean square, and in probability of X,, to B have been studied, as has the asymptotic distribution of X, (see, e.g., Robbins and Monro, 1951; Chung, 1954; Blum, 1954; Sacks, 1958; Venter, 1967; Fabian, 1968; Anbar, 1973; Major and R&&z, 1973; Goodsell and Hanson, 1976; Kersting, 1977; and Ruppert, 1982). Throughout the study of the Robbins-Monro process it seems to have been assumed that the n th estimate

of e was X,,, (i.e., that e,, = X,,,). As far as the authors know, the question of any sort of optimality of

6, = X,+, (for fixed n) has not been addressed, probably because X,,+i is not, in general, optimal. Frees and Rupert (1986) give a fairly thorough discussion of the difference between estimation and

control, give a fairly complete inventory of the known results (all asymptotic), and present a hybrid procedure using the Robbins-Monro process for design/control (i.e., to get the X,‘s) and linear regression to obtain the root estimates (the 8,‘s).

In some cases one knows, or may be willing to assume, that m(x) is non-decreasing. In such cases we give a variation of the procedure suggested by Frees and Ruppert; we use isotonic regression instead of linear regression to obtain the 8,‘s.

In Section 2 we present our estimators 8, and discuss their optimality properties. In Section 3 we prove that { /3,} is strongly consistent if {(X,,, Y,)} is a Robbins-Monro process having properties usually associated with such a process.

It seems reasonable that the sequence { (3, }, obtained via isotonic regression (or linear regression), will be strongly consistent under some fairly mild assumptions on {(X,, Y,)}. There seems to be no a priori reason to assume that using an RM-process for data collection is optimal when e,, (different from X,) is

280


being used for estimation. In Section 4 we discuss the question of the consistency of { BII} when {(X,, Y,)} is not a RM-process; we show that a natural conjecture is false. Section 5 contains some

concluding remarks.

2. Isotonic regression and its proposed use in stochastic approximation

Let (xi, vl), . _ .,(x,, y,) be ordered pairs of real numbers. Define #(S) to be the number of elements in

the set S, and for s G t define

A,(s, t> = c #{i:s<x,it}. (2.1)

(i: s<x(<Cf)

Define

It is well known that the function m,(.) minimizes c:=,[m,,(x,) -v,]” over the collection of non-decreasing functions defined on the set of distinct x,‘s. (See Barlow, et al., 1972; or the argument at the beginning

of Section 2 of Hanson, et al., 1973.) Extend m, so that: (i) it is continuous, (ii) it is linear between adjacent x,‘s, and (iii) m:(x) = 1 if

x < min{ x, } or if max{ x, } < x. This choice is somewhat arbitrary but seems to be at least as ‘natural’ as any other. In that which follows we restrict our attention to this particular extension of m, but out results

-or minor modifications of them-hold for other extensions also. For additional properties of the isotonic estimators m,(e) see Barlow, et al. (1972). Because of the way in which m,(.) was defined, (x: m,,(x) = 0) is a non-empty, closed, and bounded

interval. Let 8, be its midpoint. Usually {x: m,(x) = 0} will consist of a single point and 0, will be that

point. We now replace (x,, y,) ,..., (x,, JJ,,) with (Xi. Y,) ,..., (X,,, Y,) from Section 1 so that A,(s, t),

m,( ‘), and 0, become random quantities. { 0, } is our sequence of proposed estimators of 0.

Example 2.1. Suppose that a2 > 0 is fixed and that the conditional distribution of Y, given X, = x is N(m(x), a*) so that Y, = m(X,) + Z, where {Z,} is an i.i.d. N(0, 0’) sequence. Then the joint density of Y ,,..., Y, given Xi =x, is

f(y,,...,y,lx,)=f(y,ly,-,,...,y,, X,)...f(Y*IX,)

= p1 (2,&p* e-‘_“k-“‘“A”*/2”2. (2.3)

The unique non-decreasing function +z( .) on { x,, . . . , x, } which minimizes xz =, ( yx - m( xx))2 and hence maximizes (2.3) is the restricted maximum likelihood estimator of m( .) (see Brunk, 1955) and is given by (2.2). Thus in this example our m, and e,, as random quantities, are maximum likelihood estimators (not unique) of m and 8.

In all cases, m, is obtained from a least squares fitting procedure and e,, is a zero of m,.

3. Consistency of { 19, } in the Robbins-Monro case

We state the following theorem in the form given in order to emphasize the properties, used in our proof, of certain Robbins-Monro processes. Conditions (3.1)-(3.5) hold under assumptions which are often

281


made when { X,} is a Robbins-Monro process. They hold, for example, under the assumptions used by Sacks (1958) for his Theorem 1.

Theorem 3.1. Suppose

X n+,=Xn-c,Y, forn=l,2 ,...,

c, a c,+1 >O forn=1,2,...,

x, -+ 6 a.s.,

P{ X, > 8 i.0.) = P{ X, < 8 i.0.) = 1,

(3.1)

(3.2)

(3.3)

(3.4)

P fee liminf i Y,I,_,,,,(X,)= -cc = 1. (3.5) n

k=l

Let m,,(e) and 0, be as defined in Section 2. Then 0, -+ 8 a.s.

bmma3.1.SupposeXk>dfork=n ,..., vandX,+,<d. ThenCL=,Y,>O.

Proof. We will prove the lemma by using backward induction to argue that

Y C Y,>O for i=n,...,v.

k=i

X “+, cd< XV so

c,Y,>O and 6 Y, > 0 if i = v. k=i

Now suppose n G i < v and that

k Y, > 0 for j=i+l ,...> V.

k=j

X “+, cd< X, so that

kCkYk= -(x,+,-X,)>O. k=r

Define

Note that fik > 0 for all k. We have

y,=&,iyk j=i k=i

=;,8,[$,‘k- 5 yk < 6 bj % yk = Cl 2 ‘k k I k=j+l 1 j=i k=r k=r

282

Volume 9, Number 3 STATISTlCS & PROBABILITY LETTERS March 1990

so

i Y,>O. 0 k=r

Proof of Theorem 3.1. For notational convenience we assume 8 = 0 in this proof. We are done if we can

prove that for each constant c > 0 we have

P{S,,>c i.o.} =0 and P{fZ,,< -c i.o.} =O.

We will prove only that P{ 19, > c i.o.} = 0. The proof that P{ 0, < -c i.o.} = 0 is essentially the same.

Let

A= w: X,,(w)+O, X,,(w)>Oi.o.,and limsup i Y,(o)Z,,.,,(X,(ti))= +cc . i n k=l )

Note that P(A) = 1. It suffices to show that for each fixed o in A we have d,,(w) > c only finitely often. For the rest of the proof we fix w E A but suppress the dependence of all random variables on w.

Let b be such that 0 < b < c and b = X,, for some i,. Let Z = {k: X, > b}. Note that, since X, + 0, Z is

a finite set. Define

M= c lYkl and k,=max{k: kEZ}. kEI

Choose n* so that

n* c YkZc,_,,(Xk) >A4 and n* > k,.

k=l

Let

d=min{X,: X,>O, l,<k<n*}.

Note that d < b. Choose n,, so that n > n, implies X, < d. Let

.Z= {k: X,>d} and J,=.Zn {l,...,n*}.

Write .Z = Cz=,J, where .Z,, . . , J, are pairwise disjoint and, for v > 1, J, is a set of consecutive integers, say J,= {n,..., m}, such that X,>d for k=n,...,m and Xnl+,<d. From the lemma CI,,J,YA>O if

v> 1, so

c Yh> c Yk= 5 Y,Z,,,,,(X,)>M. kc%J kEJ, k=l

If n 2 n, and t 2 b then

c Y, = c Y,>, c Y,- c [Y,l= c Y,-M>O. (k:d<X,&r; l<k<n) (k:d<X,<r: kEJ) XEJ k=I kEJ

Thus for all n > n, we have min hG,A,(d, t) > 0 so that, since b = X,,,

m,(b) = max minA,(s, 1) > minA,(d, t) > 0. s<h h<r bdl

Thus B,,(w) <b for all n > n,. 17

283


From an applications point of view Theorem 3.1 is interesting when m( .) is nondecreasing because of the optimality properties of m,( .) in this case. However, ‘m( .) nondecreasing’ is not an assumption that was required to prove the theorem.

4. More on consistency of ( 0, )

There seems to be no a priori reason for the data collection process {(X,, Y,)} to be an RM-process. It seems natural to conjecture that if X, + 8 a.s. then, under ‘reasonable additional assumptions’, 0, + 8 a.s. also. The following examples show that when X,, + 6 there is strong evidence to support the use of { 0, } as a consistent sequence of estimators of 8, but that the ‘reasonable additional assumptions’ mentioned above aren’t immediately obvious; in particular, ‘( X, } consistent for 8 ’ does not imply ‘{ 0,) consistent for 8’.

Example 4.1 Suppose { X,} is a sequence of distinct real numbers (constant random variables) such that X,, --, 8, X, > 8 i.o., and X,, -C 8 i.o. Suppose { Z, } is an i.i.d. sequence such that EZ, = 0 and 0 c EZ,’ c: 00; and suppose Y = Z, + m( X,) where (x - e@(x) 2 0 for all x. If m, and 8, are as defined in Section 2,

then e,, --$ 8 a.s.

Remark. Note that in Example 4.1 we do not require that m( .) be non-decreasing, nor do we require the weaker assumption, (x - 0)m(x) > 0 for x + 0, often used when proving consistency of the Robbins-Monro process.

Example 4.1 (continued). The argument is similar to that of Theorem 3.1. Let {u, } be the sequence of positive X,‘s listed in decreasing order, let { j,} be such that u, = X,,, and let ui = Y,,. Note that if

o: limsup ?~,(a)= +oo n i=l

then P(A) = 1. Fix w E A. Then set k, = 0 and obtain k, < k, c k, < . . . , such that

(4 c u,(w)<0 fork,_,ck<k,, j=k,_,+l

(b) c ++-O? j=k,_,+l

for i= 1, 2.... Note that E;L~u,(w) > 0 if i > 1 and 1 < k < k,. It follows that if n > max{ j,, . . . , jk,) then X, and all the remaining X’s are less than uk, so that

m.(“k,) 2 minAH(uk,T I$$< f

t)= ,:“,i:, . .

$ uj(a)/(ki-k+l)>O. 1 /=k

Thus 0, i uk, if n > max{ j,, . . _ , j, } so that lim supnon f 0 if o E A. A similar argument shows that P{lim inf,B, 2 e} = 1.

One might put Theorem 3.1 and Example 4.1 together and come up with a version of the following.

284


Conjecture (false). Suppose (i) m( .) is continuous and strictly increasing with m(0) = 0;

(ii) { Z, } is an i.i.d. sequence of random variables with EZ, = 0 and E( Z,‘) = 1;

(iii) { X, } is a sequence of random variables such that (3.3) and (3.4) hold; (iv) Y, = Z, + m(X,) for all k.

Let m, and 8, be as defined in Section 2. Then 0, + 0 a.s.

Counterexample 4.2. For notational convenience we again set 8 = 0. We assume m ( a) satisfies (i) and { Zi } satisfies (ii). We will construct { Xi} so as to satisfy (3.3) and (3.4). We define {Y} via (iv).

Let A = {w llim inf Ei=rZ, = -cc} and note that P(A) = 1. Let 8, = 3-k and &k = SU~,~,~~~ 1 m(x) (. Fix w E A. (For notational convenience we will not explicitly mention w in the following but it remains

fixed.) Set k, = n, = 0. There is a smallest n > n,, call it n,, such that infkCjk=,(Zj + E,,) c 0 and a smallest k > k,, call if k,, such that

? (zj+m(~~,-'~,/(j+l))} <OS

j=k,+l

Let

X,=8,,-S,,/(j+l) forj=k,,+l,..., k,.

There is a smallest n > n,, call it n2, such that

(4-L)

(4.2)

and a smallest k > k,, call it k,, such that

,=$+I ( Z,+m(-:SnZ-Sn2/j)] x0. 1

(4.3)

Let

X,=-&,-&/j forj=k,+l,..., k,. (4.4)

Suppose n,, k, and X, have been defined for i = 1,. . . , a and j = 1,. . . , k,. There is a smallest n > n,, call it n u + 1, such that

iff 6 (Zj+~,,+,)<O, J=k,+l

and a smallest k > k,, call it k,,,, such that

kI? { Z+m@n.+, - S,,+,/j)} < 0 if o[ is even J=k,+l

and

z,++iq+, - sn,+,/j)] < 0 if 01 is odd. (4.6)

(4.5)

285

Volume 9, Number 3 STATISTICS & PROBABILITY LETTERS

If LY is even, define

x, = %+I - an-+,/j for j = k, + 1,. .., k,+,

and, if LY is odd, define

X, = - SL*+, -S,,+,‘j for j=k,+l,..., k,+,.

Iterate. The X,‘s are defined so as to satisfy the order relations:

x 1 k,+, <x, +2< ... <x, 2 <xk,+, <x, +z< .” 3 <x, , <x, +, -=C .” -Co 5

< . . . < Xk, < Xk*+l < X&+2 < . . . < Xk, < Xk,+l < X&+2 =C . . . < Xk,.

Now Y, = 2, + m( X,) for all j. The definition of the k,‘s, (4.1), (4.3), (4.5) and (4.6) give

C Y, < 0 for all CY and all kE {k,+l,..., k,,,}. j=k

Thus for each k,,

mk,( xk,) = max min Ak,( s, t) < ,T$x AkP(s, xk,) < 0 SGX,, x,,<t ’ A,

March 1990

(4.7)

(4.8)

(4.9)

(4.10)

(4.11)

so that 8, > Xk, for all 0~. Thus C,

P{limsup B,(o)>O} &P(A)=].

Note that { X,} satisfies (3.3) and (3.4) so this is a counterexample to the conjecture.

The following is another example of something that does not work.

Example 4.3. The optimality arguments previously presented for our root estimators 0, might lead one to try using the sequence {B,, } for control as well as for estimation-to set X,,, , = 0, for all n so that at each

step the next observation is taken at the previous root estimate. However, if X, < 6, c X, where X, and X, are the nearest Xk’s to 0, (on the left and right, respectively) from the set { X1,. . . , X,, }, and if X,,, , = 6, so that Y,,, is observed at a,, then standard arguments used in isotonic regression give:

(a) if Y,+, c 0 then m,+,(&) < 0 and m,+,(X,) = m,( X,) > 0 SO On+, E (e,,, Xj);

(b) if Y,+i > 0 then m .+,(%)>O and ~,+,tX,)=~,(X,)<O so &+, E(X, 0,). We have ignored equality (as opposed to inequality) above, but it is clear that the 0,‘s lie in a sequence of nested intervals and that in any realistic case, with probability one, 0 eventually lies outside all the rest of

these intervals.

5. Summary and concluding remarks

For the processes under consideration, Frees and Ruppert (1986) presented a case for distinguishing between a sequence X, of design/control settings and a sequence 6, of estimates of the optimum control setting 8.

We have defined specific sequences m,( .) and 6, of estimators of m( .) and 0 respectively. When m( -) is non-decreasing these are seen to be least squares estimators of m( .) (and, in a sense, of 6). In a particular case they are maximum likelihood estimators.

286


Though we expect to obtain benefits from our estimators only when n is small or moderate in size, we address the question of consistency. When {(X,,, Y,)} IS a ‘reasonable’ Robbins-Monro process, we get

6’, + B a.s. However, even though we expect 0, to be strongly consistent in a wide variety of cases for which X, is strongly consistent (for e), we show by example and counterexample that the question of consistency is, in fact, a very complicated one.

We hope this paper will (i) stimulate interest in, and the investigation of, using isotonic regression in sequential estimation;

(ii) provide support for the suggestion of Frees and Ruppert (1986) that we distinguish between the

‘design/control’ problem and the ‘estimation’ problem.

References

Anbar, D. (1973) On optimal estimation methods using sto-

chastic approximation procedures, Ann. Statist. 1,

1175-1184.

Barlow, R.E., D.J. Bartholomew, J.M. Brenner and H.D. Brunk (1972) Statistical Inference Under Order Restrictions (Wi-

ley, New York). Blum, J.R. (1954) Approximation methods which converge

with probability one, Ann. Math. Statist. 25, 382-386.

Chung, K.L. (1954), On a stochastic approximation method,

Ann. Math. Statist. 25, 463-483.

Fabian, V. (1968). On asymptotic normality in stochastic ap-

proximation, Ann. Math. Statist. 30, 601-605.

Frees, E.W. and D. Ruppert (1987), Estimation following a

Robbins-Monro designed experiment, Tech. Rept. No. 811,

Dept. of Statistics, Univ. of Wisconsin (Madison, WI).

Goodsell, C.A. and D.L. Hanson (1976) Almost sure conver-

gence for the Robbins-Monro process, Ann. Probab. 4,

890-901.

Hanson, D.L., G. Pledger and F.T. Wright (1973), On con-

sistency in monotonic regression, Ann. Statist. 1, 401-421.

Kersting, G. (1977) Almost sure approximation of the Rob-

bins-Monro process by sums of independent variables,

Ann. Probab. 5, 954-965.

Lee, CC. (1981) The quadratic loss of isotonic regression

under normality, Ann. Statist. 9, 686-688.

Major, P. and P. Revtsz (1973), A limit theorem for the

Robbins-Monro approximation, Z. Wahrsch. Verw. Gebiete

27,79-86.

Robbins, H. and S. Monro (1951) A stochastic approximation

method, Ann. Statist. 22, 400-407.

Ruppert, D. (1982) Almost sure approximations to the Rob-

bins-Monro and Kiefer-Wolfowitz processes with depen-

dent noise, Ann. Probab. 10, 178-187. Ruppert, D. (1988). Efficient estimators from a slowly conver-

gent Robbins-Monro process, Tech. Rept. No. 781, School

of Operations Research and Industrial Engineering, Cornell

Univ. (Ithaca, NY).

Sacks, J. (1958), Asymptotic distribution of stochastic ap-

proximation procedures, Ann. Moth. Statist. 29, 373-405.

Venter, J.H. (1967). An extension of the Robbins-Monro

procedure, Ann. Math. Statist. 38, 181-190.

Wu, C.F.J. (1985) Efficient sequential designs with binary

data, J. Amer. Statist. Assoc. 80, 974-984.

287

Documents

Isotonic estimation in stochastic approximation