KULLBACK·LEIBLER CONSTRAINED ESTIMATION OF PROBABILITYMEASURES · Po, uniformly over a class of sets C. The estimators considered here are based on the minimization of the Kullback-Leibierdivergence

KULLBACK·LEIBLER CONSTRAINED ESTIMATION

OF PROBABILITY MEASURES

by

Anne Sheehy

TECHNICAL REPORT No. 137

July 1988

Department of Statistics, GN-22

University of Washington

Seattle. Washington 98195 USA

-

Kulfbeck-Lelbler Constrained Estimation of Probability Measures

by

Anne Sheehy

Stanford University

ABSTRACTWe consider two estimation problems in this paper. In the first, we observe

Xl,." ,Xn i.i.d. Po, where it is assumed known that EPoT = a. In the second, we

observe Xl,." ,Xn E IRa and J::: 2::'1 T(Yi) E IRb, where Xl, ... ,Xn, Yl, ... ,Y",

are i.i.d. Po, T is some Borel measurable function and n/m ----+ A as n/\ m --+ (X). In

both situations we consider the problem of estimating the probability measure

Po, uniformly over a class of sets C. The estimators considered here are based

on the minimization of the Kullback-Leibier divergence from certain collections of

probability measures to the empirical measure of the Xi'S. We show that these

estimators are consistent and asymptotically efficient.

•

Kullback-Leibler Constrained Estimation

of Probability Measures

by

Anne SheehyStanford University

July, 1988

AMS 1980 Subject Classifications: 62G05, 62G20.

"';' Key Words and Phrases: constrained estimation, Kullbeck-Leibler divergence, weakcom"ergence, asymptotic efficiency.

Research supported by: National Science Foundation Grant DM5-S400893 and DM8708083

,.

1

§1 Introduction.

HabermaD,(1984) considered the problem of estimating a Euclidean-valued

I functional of a probability measure Po, where one has observed data X1, ... ,X..

i.i.d. Po and one knows that Po satisfies a finite number of linear constraints. He

proposed an estimator for this problem based on the minimization of the Kullback

Leibler divergence from the constrained family of probability measures to the

empirical measure of the observations. He also derived a number of consistency

and weak convergence results for this estimator. In this paper we extend this

problem and those results in a number of directions as will be described below.

We consider two kinds of estimation problems here. The first can be described

as follows:-

We assume Po is a probability measure on the Borel a-field (IRd, B4 ) . Having

observed data Xl, ... ,Xn E IR4 i.i.d. po. we consider the problem of estimating

certain linear functionals of Po, given that we know Po belongs to the collection of

all probability measures On(IR4, 8 4 ) satisfying a fini te number of linear constraints.

More precisely, let 'P denote the collection of all probability measures on the Borel

a-field (IR4 • 8 4 ) and let T : IR4 -+ IR' be some Borel measurable function. We

assume that Po is constrained to be in

(1) PT,. = {P E P : EpT = a}

for some fixed a E IR'. We will refer to this problem as the "constrained estimation

problem". In this paper we will focus on the problem of estimating one particular

functional of Po, defined in the following way:-

Let C be a collection of sets in B4 • We consider the problem of estimating

functionals of Po of the kind

(2) VC(Po) E L~(C) where vc(Po)(C) = Po(C) for any C E C.

In estimating vc(Po) we are essentially estimating the probability measure Po

uniformly over C. A closely related problem to the one described above is that

2

where Po is not in fact in 'PT,4 but we want to estimate that Q in "PT,f1, which is

"closest" in some sense to Po. See the example below.

In the second problem, we again observe dataX1, ... ,X" E IRd i.i.d. Po. In

addition we observe the IR"-vaJued random variable Tm == ~ L:'l T(l'i) where

Y1 1 ••. ,Ym E IRd are i.i.d. Po, independent of the Xi'S, but are Dot observed and

T(·) is some measurable function from IRd into nt'. We also assume that

n- _ .A > 0 as n ---+ 00.m -

We will refer to this problem as the "supplementary sample moments problem".

Again, we are interested in est~a~ing functions of Po of the kind IIC(PO) E Loo(C),

exactly as defined above.

The theory given here is easily adapted to the case where we are interested

in estimating Euclidean valued functions of Po of the kind vg(Po) = EPo9 where

9 : lR" -10 IR" is some measurable function. For the constrained estimation prob

lem, the analogues of the theorems given in section 3 for the problem of the esti

mation of uc(Po) can in fact be found in Haberman(1984), while those of section 4

can be found in SheehY(1987).

The following is an example of the first kind of estimation problem.

Example: Known Marginals. Suppose ~i is a probability mass function for

i = 1, ..• , k ::; d. Suppose ~i and POi-the i t h marginal distribution of the X's

have support concentrated on {ail ... aib;}, i = 1, ... , k. Let

T(x) = (l(au j(xIl - Ro,(oll),"" l{a..,J(x.) - Ro,(o16,), ... ,

l{a"j(x,) - Ro.(o.,), ... ,l{a..,j(x,) - Ro.("..,)) E JR'

where b = L:~=1 bil and x = (Xl, ... , Xd)T.

Now let

a = (O, ... O)T E JR'.

3

Constraining Po to be in 'PT,fa 1S equivalent to saying we know the fint k

marginal distributions of Po-

Mosteller (1968) gives some interesting examples where 'PT.4 is as defined in

the example above and Po rl 'PT,fI. but one wants to estimate that Qo which is in

'PT,a. and which is "closest" in some sense to Po- One such example is the situa.tion

where we want to compare transition probabilities in social mobility tables between

societies. We may first want to adjust the marginal distributions of row and column

variables so that the variations in the relative sizes of social classes from society

to society do not affect our results. Mosteller suggests adjusting the marginal

probabilities so that they are uniformly distributed on each table. The method he

proposes, Iterative Proportional Fitting, gives the same estimates as the ones we

get from using the Kullback-Leibler constrained estimator which we will describe

below.

Jagers et al. (1985) describe a situation in the sample survey framework where

(XI, Yd, ...• (Xn l Yn) E IR? i.i.d. Po are observed and we know the marginal

distribution of the X's, call it POr' Also we know that POr has finite support.

They consider the problem of estimating functions of the kind vg(Po) = E Po9,

where 9 is some Borel-measurable, real valued function as defined above. In fact

they focus on functions 9 which are functions only of the Y variable. The estimator

they examine is called a "post-stratified" estimator and is defined by

- ~ po.(x)Ep,g(y) = LJ v(x) g(Yo)lIX;=,],

•where v(x) is the number of (X, Y) pairs where Xi = x, and the summation is

taken over all the x values in the support of Poz: with v(x) > O. It can be shown

that the Kullback-Leibler constrained estimator, described below, gives the same

estimate as the post-stratified estimator in this situation.

We now describe the estimators which we will consider here. The idea is the

following: In the fully non-parametric set-up, having observed X J 1 ••• ,Xn i.i.d,

Po, when no knowledge of Po is assumed, it is known that vc(IP n), is in a sense,

4

an optimal estimator for vc(Po} (under appropriate regularity conditions on C and

Po), where

(3)

We call IPn the empirical mea.5tlre of the observations Xl, ... ,Xn'

In the constrained estimation problem we know that Po is in a certain re

stricted class of probability measures and of course feel that we should take this

information into account when estimating functions of Po. \Ve propose doing this

by using vc(Q n(a» as an estimate of vc(Po), where QnCa) E PT,o and QnCa) is

the "closest" as measured by the Kullback-Leibler divergence, probability to lPn l

in PT,IJ.'

We use a similar idea in the supplementary sample moments problem. Define

(4)

and

(5)

1 rn

1P~(.) = m L l(Y,EOli=l

n J mJ 2TN =::; N TdlPn + N TdIPfn.l where N = n + m.

With these definitions, we use lIc(QN(TN» as an estimator for vc(Po) where,

q N(TN) E 'PT,TN is that probability measure which is the "closest" I as measured

by the Kullback-Leibler divergence, to lPn'

We now give a definition of the Kullback-Leibler divergence and introduce

some notation.

Definition 1. De:5.ne the Kullback-Leibler (K-L) divergence between two proba

bility measures Q and P on (IRd,8") to be,

K(Q,P) = {flog ~dQI ifQ ¢: P+00 otherwise.

Notice that K(·,·) is not a metric, in fact it is not even symmetric in its

arguments. However I its value is always non-negative and

K(Q,P) = 0 if and only ifdQ-=1dP

a.e. Q,

If there exists a probability measure Q minimizing K(·,Po) (respectively,

K("IP.)) over PT,. denote that Q by Q(a) (respectively, Q.(a)),

In the constrained estimation problem, the idea of using vc(Q: ,,(a» as an

estimate of vc(Po) has been used extensively in the literature but usually in the

context where Po represents the probabilities in a contingency table and the con

straints on Po are that certain of the marginals are known (see Haberman (1984) for

some appropriate references). Haberman (1984) considers the more general prob

lem which we consider here but focuses on estimating functions of Po of the form

%Jg(Po) = EPo9, where 9 is some Euclidean-valued, Borel-measurable function. In

his paper Haberman does not consider the supplementary sample moments prob

lem nor does he prove efficiency of the estimator for the constrained estimation

problem.

From now on, when we talk about the estimator vc(Qn(a) it should be un

derstood that we are working with the constrained estimation problem. and when

we write Vc(Q,,(TN) we are working with the supplementary sample moments

problem.

In section 2, we list conditions under which these estimators exist and show

that (under certain regularity conditions) they are consistent. Csisza.r(1975) and

Haberman(1984) have done most of the work here.

In section 3, we show that these estimators, when properly normalized, con

verge weakly to limiting Gaussian processes. We also make a comparison between

these estimators and the estimator vc(Pn). It is shown that in each of the two

problems the asymptotic variance of the proposed estimator, at any C E C, is less

than that of the corresponding quantity for the estimator vc(IP n)(C). In section 4

6

'we show that in the constrained estimation problem, vcCQ n(a)) is an asymptot

ically efficient estimator of vc(Po), while in the supplementary sample moments

I problem ve(qn(TN)) is an asymptotically eflicient estimator of ve(P,). Ireland

and Kullback (1968) have shown that in the situation where Po represents the

probabilities in a contingency table, and the constraints are that certain marginal

probabilities are known, then ve(q n(a)) is Best Asymptotically Normal. That we

can show vc(q nCa» is efficient in a more general context is therefore not too sur

prising.

The proof of the results stated in sections 2, 3 and 4 are all contained in

section 5. In section 6 certain technical definitions and theorems which are needed

in the proofs are stated and proved.

§2. Existence and Consistency of the Estimators.

§2.1 Introduction. We first summarize results, proved by Csiszar(1975) about the

existence of Q(a) (as defined in section I).

Define

S(P"oo) = (Q -e; P,: Q E 'P,K(Q,P,) < co].

We need to know under what conditions there exists a probability measure Q(a) E

S(Po, oo) minimizing K(·,Po) over 'PT,IJ"

The following result is due to Csiszar (1975):

(GI) Let

8(P,) = {B E JR.': E p• exp(II'T) < oo}

BT,.,P. = (a E JR.': 'PT,. n S(P"oo) # 0}.

If SCPo) is open, there exists a unique Q(a) in 'PT,4 minimizing K(-,Po) over

'PT,IJ for each inner point a E BT,IJ,Po and its Po-density is of the form

(I) d~~:\,,) = cexp(B(a)'T(x)) for some B(a) E 8(P,).

This is Theorem 3.3 of Csiszar (1975).

7

We note at this point that if a is an element of the relative interior of the

convex hull of the support of POT- 1 (i.e. a E re1iotcosuppPoT-1 ) , then a is an

I inner point of BT,.,P. (see Propositions A.3.2 and A.3.3 of Sheehy (1987». One

consequence of this is that if e(Po) is open and a E relintcosupp POT-I, then

there exists Q(a) minimizing K(·, Po) over 'PT,. and Q(a) has Po·density of the

form given in (1) above.

Therefore, showing that the probability measure Q .(a) (or Q .(T,,) ) ex

ists, with probability one for n sufficiently large, amounts to showing that a .E

re1iotcosuppIP",T-1 (respectively TN E re1intcosuppIP~T-l) with probability one

for n sufficiently large.

In the constrained estimation problem, results are stated both for the case

where EPaT = a and the case where EPoT 'I a, but a E relintcosuppPoT-1 and

there exists Q(a) minimizing K(·, Po) over 'PT,. with Po-density as given in (1).

If EPoT 'I G, then Q(a) is the "clceestv-es measured by the Kullback-Leibler

divergence-probability measure to Po having EPoT = G. It is of interest to study

the behavior of this estimator when Epo T t= a for two reasons. First, because

it is important to know how the estimator behaves when the constraint does not

hold. And second, because sometimes, as in Example 1 above, we may know that

EpoT t= a, but may still be interested in estimating Q(a).

Notice that when Varp,T(= Ep,(T - Ep,T)(T - Ep,T)') is singular, there

exist non-zero vectors u such that u'T = c a.s. POl c a constant. When this is the

case we see that the vector 8(a) defined in (1) will not be unique. Indeed if

(2) H = {u : «r = c.. e.s. Po, for some constant ca}

and 9" = 9(a) +u, u E H then

dQ•• (x) = exp(9:T(x))/Jexp(9: T(x»dPo(x)dPo

= exp(9(a)'T(x»/Jexp(9(a)'T(x»dPo(x) a.s. Po·

8

1t is dear then that B(a) is unique only up to its projection on H.L. From now on

we will define 9(a) to be that projection.

§2.2 Sta.tements of Results. Propositions 1, I' and 2 below are GIivenko-Cantelli

type results for the estimators which we consider here. We need to restrict the

collection of sets C to some pointwise separable, Vapnik-Cervonenkis (Ve) class of

sets (see Definitions 8 and 9). Classes of sets which are pointwise separable, VC

classes include the classes of all closed balls, all ellipsoidal regions, and all convex

hulls of at most k points (k fixed) in ]Rd. Classes of sets which are not VC include

the class of all convex sets in R d, d :::: ~ and the class of all lower layers in lRd

,

d > 2. The pointwise separability condition is needed to ensure the measurability

of the quantities in which we are interested.

Proposition 1. Suppose a E reli.ntcosuppPoT-1 and SCPo) is open, so that there

exists Q(a) minimizing K(·,Po) over Pr,« with Po density as given in (1) above.

SupposeC is a pointwise separable, VC class of sets in s«. Then with probability

one, for n sufficiently large, there exists CQn(a) minimizing K(·JIPn) over 'Pr,a

where

(i)

(ii)

and

(iii)

dQ (a)diP. (r) = c. exp(9.(a)'T(r», 9.(a) E HJ..

9.(a) ~•.e, 9(a) as n ~ co

Ilvc(Q.) - vc(Q(a))llL_(C) ~ •.e. 0 as n ~ co.

Proposition I'. Suppose EPoT = a and 0 E int SCPo). Then. with probability one,

for n sufficiently large, there exists CQ n(a) minimizing K( ·,IPn) over 'Pr,o where

(i) d~~a) (r) = c. exp(9.(a)'T(r», 9.(a) E HJ..

-----------------~------- -

'(ii)

t and

(iii)

On(a) --+a..t. 0 as n --+ 00.

9

Proposition 2. Suppose 0 E int 8(Po). Suppose C is a pointwise separable, VC

class of sets in [511. Then with probability one, for n sufficiently large, there exists

Qn(TN) minimizing K(., lPn ) over 'PT,Tx with

(i)

(ii)

and

as n --+ 00,

(iii)

In Lemmas 1 and 2 and Theorem 1 we show that, under appropriate regularity

conditions, if {an} is a sequence in IR:\ converging (a.s.) to some constant a,

then, with probability one, for n sufficiently large, there exists Q n(an) minimizing

K h IPn) over 'PT,o... and that

From this result we can deduce the preceeding propositions.

Lemma 1. If a E relintcosuppPOT- 1 , {an} C relintcosuppPoT-1 and an --+a..!I. a

th en with probability one, for n sufficiently large, there exists a probability measure

Qn(an) minimizing K( " IPn) over 'PT,4.. an d

(3)

10

~here

S.(a.) e Hi.

Lemma 2. Suppose tbe hypotbeses of Lemma. 1 hold. Suppose Q(a) minimizes

K(·, Po) over PT,. and

(4)

with

Then

d~~:) (.) = cexp(S(a)'T(.»)

Sea) e Hi n int 8(Po).

8,.,(an.) -+1.-9. 8(n) as n -+ 00

with S.(a.) as defined in (3) above.

Theorem 1. Suppose tbe bypotheses of Lemma 2 hold. Suppose C is a pointwise

separable, VC class of sets in s«. Then

IIvc(q.(a.) - vc(Q(a))IIL~(c) ~a.s. ° as n ~ 00.

§3. Weak Convergence of the Estimators.

§3.1 Statements of Main Results. In this section we examine the asymptotic be

havior ofthe estimators vc(q.(a» and vc(q .(TN).

We will need the following notation. For any function f E L2(PO) write

(1) "JE..(f) =JId( y'n(IP. - Po).

IT F is a collection of functions in L2(PO) then we can regard ::«.,., as an element

of Loo(F) and we call X{n the empirical process on F. If j, 9 E :F then

("JE..(f),"JE..(g»T '* N(O,r(f,g»)

11

where

su.» = (VUP.! Covp,(J,g)).Covp,(J,g) Vup,g

In Theorem 2 we show that the estimator vc(q,,(a)) has influence function,

¢e(-, Po)(-) where ¢e(C, Po)(-) E L,(Po) for any C E C, and ¢e(-, pole,) E L~(C).

In Theorem 3 we prove an analogous result for the estimator vc(q ,,(TN)).

Notation. If E is a b x b matrix, define E- to be the Moore-Penrose generalized

inverse of E.

Theorem 2. Let C be a pointwise separable, VC class of sets in s«. Suppose 8(Po)

is open and a E relintcosuppPT-l. Suppose in addition that

(i) JIITII' exp(20(a)'T)dPo < 00 and Jexp(20(a)'T)dPo < 00

where Ora) was defined in (2.1) and Ora) E s» n int 6(Po). Tben

Z1: n(C) =y'n(vc(qn(a)) - ve(Q(a)))(C)

J - dQ(a)= [(Ie - Q(a)(C)) - CovQ,,)(le, T)( VarQ(,)T) (T - a)] dPo

d}l{n + opel)

where tbe opel) term is uniformly small over C.

Theorem 2/, Let C be a pointwise separable, VC class of sets in 13 d . Suppose

o E int6(Po) and Ep.T = a. Tben

Z1:n(C) =y'n(ve(qn(a)) - ve(Po))(C)

= J[(Ie - poeC)) - (Covp,(le, T)(Varp,T)-(T - a)ld}l{n + opel)

where tbe opel) term is unifonnly small over C.

For any function f E L,(Po) wri te

)X~(J) =Jfd(vm(JP~ - Po)).

12

Theorem 3. Let C be a pointwise separable, V C class of sets in Btl. Suppose

o E inte(p~) IUJd ::.. --+ ..\ 81] n - 00. Then

1lN(C) = vn(lIC«~,,(TN)) -lIc(Po»)(C )

= J ..;>. (CovPo( l c ,T)( VarpDT)-(T - a)~:n1+,\

J 1 -+ [(IC - Po(C» - 1 +,\ (CovPll(lc,T)(VarpoT ) (T - o.)~n

+ op(1)

where the op(l) term is uniformly small over C.

Corollaries 1 and 2 below are weak convergence results for the processes 1l n

and 1lN defined in Theorems 2 and 3 above. We use the definition of weak

convergence used in Pollard(1984). Before their statements we need the following

definitions.

Define for any A, B E Btl, and any probability measure Q E 'P for which

EqlITlr.il < 00

and

Definition 2. For any collection cd hsuctious F E L 2(PO) define BF,PQ to be the

smallest a-iield on Loo(Jl wmcb

(i) makes all fullte-dimensional projections measurable and

(ii) conieias all dosed ba11s with centers in C(F, Po), the collection of all uni

formly contiDuous functions in Loo(.F) (w.r.t. the L2(PO) norm).

Corollary 1. Suppose tbe bypotheses of Theorem 2 hold. Then

13

'as random elements of (Loo(C), Bc,pg ) , wbete 'llol(') is a mean-zero, tight, Gaus

sian random element of Loo(C) witb sample paths in C(C, Po), and covariance

I function r Q(ca)("')'

Corollary I'. Suppose tbe hypotbeses of Tbeorem 2' hold. Tben

as random elements of (L(XI(C), Bc,Po), wbere 'll02(') is a mean-zero, tigbt, Gaus

sian random element of L'OO(C) witb sample paths in C(C, Po) and covariance

function rpoh .).

CorollarY 2. Suppose the bypotheses of Theorem 3 bold. Then

as random elements ofLOQ(C), Bc,Po), where 7l 0 3 ( . ) is a mean-zero, tight, Gaussian

random element of Loo(C) with sample paths in C(C, Po), and covariance function

r$o("')'

Remark. At this point it is instructive to think. about the behavior of the al

ternative estimator lIc(IPn) in each of the estimation problems, By well known

results (see e.g. Pollard (1984» if C is a pointwise-seperable, VC class of sets and

Po satisfies the assumptions of Corollary 1 or 2, then

(4)

and

(5)

Ilvc(lPn ) - VdPO)IlL",,(C) -a..II. 0 as n - 00.

-

where.1Y is a mean-zero, tight, Gaussian process in C(C, Po) having covariance

function, t :C xC _1Rl, given by

r(A,B) = Po(A n B) ~ Po(A)Po(B) for A, B E C.

14

Notice that fOT any A E C,

,(6)

In fact

f(A,A) = Var(~(A)) ~ Var(Zlo,(A)) = rp,(A,A).

f(A,A) - rp,(A,A) = Covp,(lA, T)(Varp,T)-Covp,(lA, T).

U VarpoT is non-singular this difference will be strictly positive whenever

Covp.(l A, T) # o. What (4), (5) and (6) are saying is that when Po satisfies the

assumptions of Corollary 1', vc(UJ II) is, like vc( Q tiCa)), a consistent estimator of

vc(Po). However the variance o£the limiting distribution of y'n(vc(lP.) - vc(Po))(A),

for any set A E C, is greater than the variance of the limiting distribution of

y'n(vc«~.(a))- vc(Po))(A). If Po If. "PT,., then vc(Q .(a)) and vc(lP.) estimate

different measures, and comparison of the two estimators is more difficult.

Also notice that

(7) f(A, A) - r},,(A, A) = -2,COV(lA, Tl'(Varp,T)-Covp.(lA, T).1+~

From (7) we see that by taking the information about the supplementary moments

into account and using vc(Q N(TN)) as an estimator of vc(Po) rather than VC(n'n)

we reduce the asymptotic variance of the estimator of vc(Po)(A) =Po(A) by the

quantity on the right hand side of (7).

In section 4 we will show that asymptotically, in a certain sense , vc(Q,.,(T.v))

and vc(Q RCa)) are the "best" estimators for these two situations.

As one would expect

lim r~ (A, B) = rp,(A, B).>........0 0

That is to say that if ~ -+ 0, so that we have "infinitely" more information from

the Yj's than we do from the Xi'S, asymptotically, it is as if we knew EPoT. U

however ~ -+ 00, then ). = 00 and

• >... -lim r p (A, B) = Covp.(lA, 1B) = f(A., A)>.. ......00 0

15

so that, asymptotically, it is as if we had no extra information from the supple

mentary sample moments.

§4. A Convolution Theorem for Estimators of lie(Po).

§4.1 Introduction. In this section we use results of Van der Vaart (1987) and

Bickel, Klaassen, Ritov and Wellner (1989) to derive a. convolution theorem for

"regular estimators" of lIc(Fo) in the constrained estimation problem and in the

supplementary sample moments experiment. We use the Hellinger metric to define

distances between probability measures. For any 9 and h in L2(PO) let

(g, hlp. = j ghdPo and I/gll;'. = j g'dPo.

If z E lRd, Ilz11d denotes the usual Euclidean norm.

We consider the experiment where we have observed data X n, a random

element of the measure space (An, On) where X" = F(Zl, ... , ZN) for some

Euclidean-valued, Borel measurable function F and ZI,'" I ZN E IRd are Li.d.

Po and N = N(n) --+ 00. On the basis of X n we want to estimate some fun

tiona! of Po where we know Po E Po where Po is some collection of probability

measures on (IRd, Bd ) . We call Po the "model". In the constrained estimation

problem X n = (X1, ... X n) where the Xi's E IRd are i.i.d. Po, An = (IRd)n ,

Un = (Bd)n, N :;; n and Po =. PT ,4. In the supplementary sample moments

problem Xn := (X1",.Xnl J;nL:::IT(~)), whereX1,.,.,Xn,Y1, ... ,Ym E IRd

are i.r.d. Po, A. =(JR")" x JR', fl. =(5")' x 5', N;: n+m and 'Po ='P.

Po =P , the collection of all probability measures on (IRd, ad).

We begin with a heuristic explanation of how the results in this section are

derived. The notion of a tangent space to the model Po at Po is central to this

discussion.

Definition 3.. A subspace T(Po) of L 2(PO) is called a ta.ngent 3p~ce at Po E Po

if for aU h E T(Po) there exists (P,} C 'Po with

(1) jlC'«dP,)'/' - (dPO)'/') - ih(dPo)'/'I' ~ 0 as t ~ O.

16

Formula (1) is interpreted as follows. Let Pc and Po have densities Pu and Pt

'respectively with respect to an arbitrary 17-finite measure JJc domina.ting P, + Po.

Then we have

Basically, T(Po} being a tangent space means that it is a subspace in L2(PO) made

up of a collection of "directions" in which we can approach Po in the model Po.

We will use the notation Tm;(Po) to denote a maximal tangent space in the model

Po at Po. The tangent space Tm(Po) is maximal if its closure, Tm;(Po) (in the

L 2(PO) norm) contains any other tangent space T(Po).

Suppose v(Po) E mI. We say that v(·) is differentiable with respect to Tm(Po)

at Po with derivative u(Po) E Tm(Po), if given any h E Tm(Po), there exists a

sequence {PrJ C Po with

and

(2)

Notice tha.t u(Po) is defined uniquely only up as far as elements orthogonal to

T m (Po). We can think of (v(Po),h)Po as being the rate of change of the parameter

11 at Po in the direction h. If 5 is some closed subspace of L 2 (Po) let II(v(Po)J5)

denote the projection of the quantity v(Po) onto S. If we observe Xl,' .. ,Xn i.i.d,

Po E Po and iJn is a "regular estimator" of the parameter II at Po (this concept

will be defined later) where 11(·) is differentiable with respect to T m (Po) at Po (for

some Tm(Po», and -/n(vn - v(Po» has limit law .cPo(Z), it can be shown that

17

where

Zo - N(O, lIII(v(Po) I T~(Po»lI')I

and

W is independent of Zoo

Roughly then, the greater the rate of change of the parameter 1I in the neighbor

hood of Po, the larger the variance of the random variable Zo, as one would expect.

The proof of this result goes along the following lines: Suppose Trn(Po} is

itself a tangent space. Suppose also that for every h E Trn(Po), we can find a

sequence {P..} satisfying (1) and (2). We regard 'P•• = {P.. } as a one-dimensional

submodel of 'PT, .. > Using the Hajek-Le Cam Convolution Theorem (Hajek, 1970),

we can then show that if vn is a "regular estimator with respect to T m(Po)" of

v(Po) in the model 'P.. , with limit law J:p,(Z), then

J:p,(Z) = J:p,(Z" + W)

where

Z" - N(O, (v(Po), h)j" (h, h),,;)

and W is independent of ZOA. Let S" denote the subspace spanned by the function

h. Since the variance of Z" is just lIII(v(Po)!S.)II' it is easy to see that this

quantity is maximized over T~(Po)when h = ho =II(v(Po)IT",(Po». In this sense

'Ptho is the least-favorable sub-model in 'PT,a for the problem of the estimation of

lI(Po), If lin is a "regular estimator with respect to Tm(Po)" of the parameter 1I

at Pa in the model 'Po with limit law .cPD(Z). it must also be a "regular estimator

with respect to S1&o" of the parameter 1I at Po in the least- favorable model 'Pth o ·

By the Hajek-Le Cam Convolution Theorem we must have

J:p,(Z) = J:p,(Z", + W)

where

ZOh, - N(O, lIII(v(Po)IT~(Po»II')

=

18

and W is independent of ZOllo' We can in fact reach the same conclusion without

assuming that Tm(Po) is itself a tangent space.

In this section, we consider estimation of parameters in LQCI(C). The ideas are

essentially the same as those for the real line but the technicalities are more com

plicated. The reader is referred to Van der Vaart (1988) for a thorough exposition

of these ideas.

While this theory is exactly what is needed for the constrained estimation

problem where we observe Xl,." 1 X n i.i.d Po E 'PT,,,,, the supplementary sample

moments problem. does not fit into this framework since there the data is not

i.i.d.. However the basic idea is the same. \Ve begin by showing local asymptotic

normality of the experiment in certain one-dimensional sub-models. Having done

this it is an easy matter to get a convolution theorem for regular estimators in any

one of these sub-models. From this we can then derive a convolution theorem for

regular estimators in the whole model.

§4.2 Definitions. As a consequence of the fact that there is no "natural" choice

for either the space where vc takes its values, or, having chosen that space, for a. a

field defined on that space, there is also no "natural" definition of what constitutes

an estimator sequence for vc. We present one possible definition for an estimator

sequence here. In this definition we fix on a particular choice of range space for

vc and a particular choice of c-field for that space. Not surprisingly, we make this

choice to coincide with the choice made in Theorems 2 and 3 of section 3. \Ve

will then state a convolution theorem for sequences of "estimators" of vc having

certain (to be defined) regularity properties.

Note. We assume throughout the remainder of section 4 that C is a pointwise

seperable VC class of sets in 5' and Ep,IITIII < 00.

Definition 4. Suppose we observe X" E (An, n l'1.). A sequence {vn(X")} ={vn} C Loo(C) is said to be an (L oo(C),Bc,po)-e.st im ato7' .sequence of vc(Po) if V n

is (A.,n.) I (L",,(C),5c ,p, ) measurable.

19

Definition S. An (Loo(C),Bc,PrJ-estimatorsequenl'e, {!In}, ofvc(Po) in the model

'Po is regular if for some maximal tangent space, T m (Po) I to tbe model 'Po at Po,

I corresponding to ea.ch h E Tm(Po} tbere exists a sequence {Pn} C 'PT,G so that

(i) j[.;Ti(dPn)l/'2 - (dPo)1/2) - ~h(dPo)1/'2]2 -t 0

(ii) n.jj'l(vc(Pn ) - vc(Po»- {I{.), hHILoo(c)-O

and

(iii)

f'&

as random elements of(Loo(C), Bt:,Po ) where II is a tigbt random element of Loo(C)

witb sample patbs in C(C, Po).

In what follows, it will sometimes be necessary to spedfy the tangent space

T m(PO) in Definition 5. We will do this by saying that {vn} is regular with respect

to the tangent space T m(Po).

§4.3 Convolution Theorems for Estima.tors in the Constrained Eatimation Problem >We are now ready to state convolutions theorems for regular estimator sequences

of vc(Po} in the model 'fIT,G'

Theorem 4. Suppose we observe Xl, ... IX,. i.i.d. Po E PT,Oo' Let {lin} be an

(Loo(C), Bc,po)-estimator sequence of vc(Po) in tbe model PT.... Suppose {Vn}

is regular with. limit law .c(7l). Tben 7l d 7l0'2 + W wbere 7lo2 is as de6ned in

tbe sta.tement of Corollary I' and \.V is a C(C,Po)-vaJued tight random element

of Loo(C), independent of71 02 •

The above theorems lead to the following definition for an efficient eatim.a.tor of

vc(Po) in the model 'PT.Oo'

Definition 6. Suppose we observe Xl, ... ,X,. Ll.d. Po E PT,,,. Let {V,.} be an

(Loo(C), Bc,po)-estimator sequence ofvc(Po) in tbe model 'PT,G' Suppose {vn } is

regular witb limit law ..c(ll). Hd

7l = 7lo2

20

where 710 2 is as deiined in Coronary 11, we then say that {lin} is an (Loo(C), Bc,Po)efficient e.stimator of lIc(Po).

'Corollary 3. Let C be a pointwise separable,. V C class of sets in Bd. Suppose

o E int e(Po). Then v(q.(a)) is an (L~(C),Bc,po)-e1ficientestimator of vc(Po)

.in the model 'PT,G'

Lemma 3 below is needed for the proof of Theorem 4.

Lemma 3. Define

(3) TO(Po) = (h E L,(Po) : [h[ < K, some K > 0, Ep,h = 0, Ep,hT = O}.

Then TO(Po) is a maximal tangent space at Po E PT,•.

Note that (ii) of Definition 5 implies that for any C E C, vc(Po)(C) is differen

tiable with respect to some maximal tangent space, Tm(Po), and vc(Po)(C)= Ie.

Since Ta(Po) is maximal we must have

Tm(Po) = (h E L,(po): Ep,h = O,Ep,hT = O}

for any maximal tangent space TmCPo). A simple calculation shows that for any

A,B € C, any a == (a1,a2) E IR.2

a'rp,(A,B)a = [[II(a,vc(Po)(A.) + a,vc(po)(B)ITm(Po))II'

with rp,(',') as defined in (3.2) and vc(Po)(-) as defined above. With this obser

vation Theorem 4 follows from Theorem 5.3.1 of Bickel et al. (1989).

§4.4 Convolution Theorems for Estimators in the SupplementarY Sample Momente-.

Experiment;. The results stated here are clumsier than those stated in §4.3 for the

constrained estimation problem. Specifically, in order to get convolution theorems,

we must make some hypotheses about the distribution of the random variable

Jm. 2::'1 T(}i) == .;rnTm . We need to do this in order to get a Local Asymptotic

Normality (LAN) result in one-dimensional sub-models. In Theorem 5 below we

21

hypothesize such an LAN result. In Lemma. 4 we specify certain conditions on the

pdf of POT- 1 under which such an LAN result is true. Corollary 4 is a. restate

'ment of Theorem 5 with these conditions inserted in place of the LAN hypothesis.

These conditions are, DO doubt, much stronger than what are needed to get such

results, see Le Cam and Yang(1988) for weaker (though harder to interpret) hy

potheses.

As before, let 'P be the collection of all probabili ty measures on (lRII, B II).

Theorem 5. Suppose we observe Xl, ... ,Xn E )Rd, Jm 2::: 1 T(Yi) E JRb wbere

Xl, ... , X,,, Yi, ... ,Ym are U.d. Po E 'P, ~ ....... >. as n 1\ m ....... 00, T is some Borel

measurable function. Let T m.(Po) be some maximal tangent space to the model

'P at Po and {vn } be some (LOll (C), Be ,Po )-estimator seq uence of vc(Po) with limit

law .c(71) concentrated 0.0 C(C,Po). Suppose that for every h E Tm(Po) there

exists a sequence {P.1&} c 'P for which

(i) (1) and (2) hold

(i i) {vn } is a regular (L 00 ( C),Be, Po )-estimator sequenee in the model {Pt 1& }, and

(iii) if n: denotes tbe pdf of the random variable (XI, ... , X n I );:: 2:::1 T(Yj ))

wbereX1.···,Xn ,y l , ...• ym, are Li.d. P. 1& and f~m. denotes tb.epdfofthe~.

same random variable where Xl, ... ,Xn, Y1 , ... , Ym are U.d. Po,

where Rn. - 0 in Pa-probability.

Then

d?L = ?L03 + \V

(4)

22

where llo3 is a mean-zero, tight, Gaussian process with sample paths in C(C,Po)

and covariance function r~,(-,·) as defined in (3.3) and W is a tight C(C,Po)

I valued random element of(L~(C),Bc.p.), independent of'll".

Before we state Corollary 4 we need to introduce some notation. Let

T;"(Po) = (h E L,(Po) : Ih[ < K, some K > 0 and Ep,h = O}.

It is an easy matter to show that T~(Po) is a maximal tangent space for the model

"P (see, fur example. the proof of Lemma 3). For any h E T;"(Po) define

dP.."P. = (p•• E"P: dP

o= 1 + Oh, for all 0 such that IOhl < l}.

Corollary 4. Suppose we observe Xl, ... I X .. E IRd, Jm L:~1 T(Yi) E IRb where

X11 ... ,X", 1'1, ... ,Ym are U.d. Po E 'P, ; -+ A as nAm -+ 00 and T is

some Borel measurable function. Suppose that POT- 1 has a bounded density

with respect to Lebesgue measure and EpollTlll < 00. Suppose that {un} is an

(L~(C),Bc.p,)-regular estimator sequence ofvc(Po) in the model "Ph with limit

law.c('ll) concentrated on C(C,Po). for each h E T;"(Po). Then

d'll='llOJ+W

where 710 3 is a mean-aero, tight, Gaussia.a process with sample paths in C(C, Po)

and covariance function r~o('l') as deBDed .in (3.3) and W is a tight random

element of (L~(C), Bc,p,), independent of'll".

Definition 7. Suppose we observe XI, ... , X n E IRd, J;;:. E;:l T(~) E IRb where

Xl,· .. ,Xn , YI , ... ,Yrn. aTe i.i.d. Po E 'P, ::a. -+ A as nA m -+ 00, T Js some Borel

measurable function. Let T m(PO) be some maximal tangent space to the model P

at Po. If {v.. } is an (LCX)(C), BC,PQ

)-estimator sequence for which corresponding to

each h E Tm(Po), there exists a sequence {Po} C"P for wb.ich (i), (ii) and (iii) of

De.finition 5 hold and also (iii) of Theorem 5 holds with () = 1 and P t = p'l. If7-

23

.{Vn } bas limit law £(ll) concentrated on C(C, Po) we say that {vol is an efficient

estimator ifd

II = 1l03 .

CorollarY 5. Suppose we observe XII'" I x, E IRd, );;:. L:::1T(Yj) E JR." wbere

XI, .. . ,X'l ' YI , · · · , Ym. are i.i.d. Po E P, ::. -. >. as n 1\ m -. 00 and T is

some Borel measurable function. Suppose tbat Po T-l bas a bounded density witb

respect to Lebesgue measure and 0 E int e(Po). Then vc(C~ N(TN» is an efficient

estimator of vc (Po).

To prove Corollary 5 it is enough to show that the conditions of Definition 7 hold

with Trn(Po) = T~(Po) and {P'l } := {P4- h} where P 1 h was defined in (4).~n' ~1

With this choice of T m(Po) and {Pn}, showing that (i) and (ii) of Defini tion 5

hold is easy. To show that (iii) of Definition 5 holds, see example 1 of Sheehy

and Wellner(1988). Lemma 4 implies that (iii) of Theorem 5 holds and this will

complete the proof.

Corollary 4 follows from Theorem 5 and Lemma 4 in which a Local Asymp

totic Nonnality (LAN) result for the supplementary sample moments experiment

is stated. Lemma 5--in which we get a convolution theorem for sequences of esti

mators of real-valued parameters in one-dimensional sub-models of P-is needed

in the proof of Theorem 5.

Lemma 4. Suppose PoT-1 bas a bounded density with respect to Lebesgue mea

sure and EPa IITlIf < oo, For a fixed h E T:"(Po) and some 8 E lRl Jet P; = P, h~,

where P, h was defined in (4). Let Inm.. denote the pdf of the random variable~.

(Xl!"" X n , Jm. L:::l T(Yj» wbere XI, ... ,Xn , Yl> , Ym. are i.i.d. Pta, and f~m.

denote the pdfof the same random variable where Xl, , X n, Y1 , ••• , Ym are i.i.d.

Po. Tben,

24

wbere R n ~ 0 in Po-probability,

Lemma. 5. Let {Vn} be a. Borel-tnessursble estimetOT sequence of JJ(Po) E JR.J with

limit law £(Z). Suppose h E T(Po), where T(Po} is some tangent space to the

model 'Po at Po. Suppose also tbat for tbis h there exists a sequence {Pn} C 'PT,o.

so that

(i)

(ii)

(iii)

and

(iv)

bold.

Then,

where

(iii) of Theorem 5 holds with (} =l,P thE P«~'

and Wh is independent of Zoh·

25

§5 Proofs.

§5.1 Proofs of Results in Section 2.

Proof of Lemma 1. The proof given here is an adaptation of the proof of theorem 1

of Haberman(1984). Write K =cosuppPoT-l and lKn == cosuppIPnT-l. The

result given in (Cl) above implies that to prove existence of Qn(an) with IPn

density as given in (2.3), it is enough to show

(a) an E relintlKn , with probability one, for all n sufficiently large

and

(b)

where Pon denotes the n-fold product measure.

We first show (a) holds. Since a E reliatK, for some I!! > 0, BCa,l!!) n K C

relintK, where B(a,f), denotes the ball centered at a of radius I!! in IRd• This in

turn is true if and only if, for some points t}l 1 < j ::5 c , for any ao E BCa, I!!) n K ,

ao = L:j~l Cljtjl where L:i=l Qj = 1, Qj > 0, 1 < j < c, and for every open

neighborhood U of a tj, 1 $ j $ c, poT-leU) > o. For some open Uj , 1 < j s c,

one thus has POT-ICUj ) > 0, where Uj = Uj n K with ao in the relative interior

of the convex hull of {3j : j = 1, ... ! c} for any {Sj : j = 1, ... , c}, with 5j E ii;1 < j ::5 c and every Go E B(a, E) n K. Now use the fact that for any j = 1, ... I c,

the probability that no T(X.-) E Uj , i = 1, ... , n, infinitely often, is zero and the

fact that an -a.t. a to conclude that (a) is true.

We now show that (b) holds. Define

A. = {x : u'T(x) = u' EPoT}.

By definition of H, ifu E H then Po(A.,) = 1. Let {Un}n~l be a dense subset of H

and define A = n~=l Au". If x E A, then u'T(x) = u' EPoT for all u E H. Clearly,

26

·Po(.4) = 1. To show that when (0) is true, we can assume with probability one

8n(a,, ) E HJ., it is then enough to show that

(e)n

p;(nrX, E AD ~ 1.i=l

However (c) is clearly true and we are done. •

Proof of Lemma 2. This lemma. can be proved using an argument similar to that

used in the proof of Theorem 3 of Haberman(1984). I

Proof of Theorem 1. For ease of notation write Q" = ~n(an), Q = Q(a). 8n =9n(an) and 9" 9(a). For any C E C,

(a) (Qn-Q)(C)

= Lleexp(9~T)dJPn _ f1eexp(9'T)dPof exp(li'nT)dJP n f exp(9'T)dPo

1 [J( fIe exp(9'T)dPO)(, )]= f exp(li'nT)dJPn Ie - f exp(9'T)dJPn

exp(9nT) - exp(9'T) dIPn

1 [J( f eXP(9'nT)dIPn) , ]+ f exp(9~T)dIPn Ie - Q(C) f exp(Ii'T)dJPn exp(9 T)d(lP n - Po) .

Lemma 7 (in section 6) shows that if

(b)

then

(c)

:F, = { l e eS'T , e9'T : 9 E B,(O), C E C}

:F& is a VC graph class of functions.

-

(See Definition 10 in section 6.) Clearly, if C is pointwise separable, then :FfJ is

also. We use (e) and Theorems II.24 and 11.25 of Pollard(1984) and the fact that

under the conditions of the theorem 8n _ (j a,s. as n -t co to deduce that

sup IJleexp(9~T)dJPn-Jleexp(9'T)dPol=0(1) a,s.CURd

27

cind

sup IJ Ie exp(9'T)d1P" - JIe exp(ttT)dPol = 0(1)CuR",

We can then rewrite (a) in the following way,

a.s,

(q .. - Q)(C)

= JeXp(O'T)ldPo+ 0(1) J(Ie - Q(C) + o(l»(exp(O~T) - exp(9'T»dIPn

+ JeXP(9'T~dPo + 0(1) J(Ie - Q(C) + 0(1» exp(O'T)d(lP n - Po)

=0(1) a.s. uniformly in C E C. •


We begin by finding the asymptotic distribution of 8..(a) and 8..(TN)

Proposition 3. Suppose a E relintcosuppPoT-l and e(Po) is open. Suppose also

tbat

EPa lIT1I2 exp(20(a)'T) < 00.

Then

where bn E H.

Proposition 3'. Suppose EPoT = a and 0 E int0CPo). Tben

(i)

where b" E H.

Proposition 4. Suppose 0 E int8(Po), ::.. ----.. A 2: 0 and mAn ----.. 00. Then

y'ii8R(TN) = (Varp, T)- {J IV;>. J(T - a )tf%.'... - 1~ >. J(T - a )tf%.R}

+ opel) - bn

28

. where b", E H.

, Peoof of Proposition 3. Foe simplicity of notation we write

Bn(a) =B.. B(a) =B, qn(a) =q.. Q(a) =Q.

By definition 01 fJ", and 8, on a set of probability one, for n sufficiently large

(a)0= y'n (JTexp(B~T)dlPn_ JTexp(B'T)dPo)

J exp(B~T)dlPn J exp(O'T)dPo

1 [J r=, '( JTexp(fi'T)dPo) ]= J exp(B~T)dlPn V n(exp(OnT) - exp(B T») T - J exp(O'T)dPn dlPn

1 [J ( J exp(O~T)dlPn) , ]+ J exp(O~T)dIP. T - a J exp(O'T)dPo exp(O T)dX!:•.Use a Taylor series expansion to write

(b)

y'n(exp(O~T)-exp«(j'T») = y'n(On-B)'T exp(O'T)+~y'n(On-B)'TT'(O.-B) exp(O~T) :

where 118' - 011 < liOn - BII· Use the fact that 0 E interpol and Bn ~•.•. B as

n --+ 00 to get

(e)

and

Jexp(O~T)dIP. = Jexp(B'T)dPo + opel)

(d) Jexp(O'T)dlP n =Jexp(B'T)dPo+ opel).

Putting (a), (b), (e) and (d) together we get

(e)

o = (J exp(~'T)dPo + OP(l)) [j T(T - a + opel))' exp(B'T)dPo+ op(l)] y'n(On - 0)

+ (J exP(~'T)dPo+Opel)) y'n(Bn - O)op(l)

+ [jeT - a):~ dX!:n] + opel)

= [VarQT+ op(l)]y'n(On - 0) + y'n(Bn - O)op(l)

+ [j (T - a):~ dX!:n] + opel).

29

\Ve see from (e) that it must be true that

I (1) lIvn(8" - 9)11 = Opel).

and so

J dQVarQr' ,fii(9,. - 0) :::; - (T - a) dP

od;«,. + opel)

or

(g) ';;'(1),. -I) = - J(VarqT)-(T - a) :~ rJX{n + opel) - b"

where the bn term is the projection of the first two terms on the right hand side of

(g) onto H. We include it to ensure that the quantity y'1i(On - 00 ) E H.L. •

Proof of Proposition 3'. The proof is very similar to that of Propositon 1 and so

will be omitted. •

Proof of Proposition 4. The proof is very similar to that of Proposition 1. We

give an outline here. For simplicity of notation we write

By definition of 9n , on a set of probability one, for n sufficiently large

(a)

(J TeXP(e~T)dfP n J )..;Ti(T.v - a) = vn Jexp(9~T)dJPn - TdPo

= [VarpoT + op(l)]yTi8n + op(1)y'Ti'8n +J(T - a)cC«n + op(l).

The left hand side of (a) can be written aa

(b)

vnCTN - 0) = n JCT - a)dJX~ + m 0JCT - a)tCK:nn+m n+mY;;;'

). J ,;xJ :z )= -- (T - a)dXn +--, (T - a)dhX.m. + op(l .1+>' 1+",

=Opel).

30

From (a) and (b) we see that

and so

So we can write

(e)

.,fii8. : <Varp.T)- {IV:>' /<T - a)<iJK;' - 1~ >. f<T - a)<iJK. } + oP(l) - b.

where bn is the projection onto H of the first two terms on the left hand side of

(c). We include this term to ensure ..;riB" E s-. •Eroof of Theorem 2. For simplicity of notation we write Qn (a) =Q n I Q(a) =QI

8"(a) =Bn and B(a) == 8. Let C E C. Then on a. set of probability one, for n

sufficiently large,

(a).;;i(q n(G) - Q(e))

1 [J ( fIe exp(9'T)dPo) I]= Jexp(9~T)dIPn Ie - f exp(9'T)dIP .. yti(exp(dn T) - exp(8 T))dIP"

1 [J ( Je"(PC9~TdlPn) I ]+ Jexp(9~T)dlP" Ie - QCC) Jexp(8'T)dJPn

exp(8 T)cDKn .

Arguing as in the proof of Proposition 1 yields

(b)

where the opel) terms are uniformly small in C. We now show that the third term

on the right hand side of (b) is Opel). Define

dQ:F;: {(Ie - Qce)) dP

o: C E C}.

-

31

. Theorem 10 of Pollard (1982) shows that :F is a sparse class of functions (see

Definition II in §6). Corollary 13 of Pollard (1982) implies then that :F Is totally

bounded in L,(Po). It is easy to see that C being pointwise seperable implies :F is

pointwise seperable. In the proof of Theorem VII.21 of Pollard (1984) it is shown

that for any f > 0, we can find a finite collection of open balls in Lo::,(.1'"), each

with finite radius, centered at points in C(.1",Po), such that the probability that

}Xn is in the union of those balls, call it B .. , is greater than 1 - E. Since:F is

totally bounded, each element of C(.r, Po) must be bounded. So for some M> 0

we must have that c.p E B~ implies lIc.pllL..,cF) < M, From this we may conclude

that

j dQopel) (Ie - Q(G» dP

odJIl:. = opel).

Thus, from Proposition 3 and (b), using the fact that b~T = b'nEPoT, a constant,

with probability one, we get

y'n(Q .(G)-Q(G)) = j[(le-Q(G»-(Covq(le, T)'(VarqTnT-a)] :~ dJIl:.+op(l)

where the opel) term is uniformly small over C. •

Proof of Theorem 2/. The proof is exactly analogous to that of Theorem 1 and so

is omitted. •

Proof of Theorem 3. For simplicity of notation we write Q..(TN) =Q fl' f),,(TN) =

9". As in the proof of Theorem 2 we can show that on a set of probability one,

for n sufficiently large, for any C Eel

y'n(Q .(G) - Po(G» = Covp,(le, Tl' y'n9. + j(le - Q(G»dJIl:. + opel)

where the opel) term is uniformly small in C. Using the fact tha.t b~T = b~a, a

-

32

constant a.s. Po, we get from Proposition 4,

where the op( 1) term is unifonnly small in C. I

Proof of Corollary 1. Write Q(a) =Q. Define

_ dQF = {[(Ie - Q(a)(C» - Covq(1c, T)(VarqT) (T - a)]-d : C E C}

Po

anddQIe C) = 1e - Q(C» - CovQ(1ct T)(VarqT)-(T - a)dP'

Theorem 10 of Pollard (1982) implies F is a sparse collection of functions

under the hypotheses of the Corollary. It is easy then to see tha.t :F,C and I(·, C)

sa.tisfy the conditions of Lemma 6 (in §6) and so we can conclude that

as random elements of (Loo(C), Bc,Po), where ~(.) is as described in the statement.

•The proofs of Corollaries 1/ and 2 are very similar to that of Corollary 1 and

so are not included here.


Proof of Lemma 3. Obviously T·(Po) is a subspace of L2(PO) . ¥le first show that

T-(Po} contains every tangent space. That is, if T(Po) is a tangent space then

T(Po) C T·(Po) = {h E L 2(PO) : f hdPo = 0, JhTdPlj = OJ.

33

Suppose h E T(Po). Then for some {P,} C Pr,1J. we must have

We show that

J hdPo = 0 and J ThdPo = O.

WritedPt dPo

[« = - and ft = -d~t d~t

where ~t is some measure dominating Pt. and Po, and

A = {x : ft (~) > O}.

Then

1- J hdPol = IJ[t-1(fu - ft.) - hft]d~tl

'" IJC'I"lA,d!',1 + If I{I' (c. (~,;, -I: I') - hI:I') lAd!',)1

s IJC' I"lA' d!'ll + IJI; I' [t-' U';, -I:I') - ~hI;I'] lAd!" I

+ IJ ft1/2(t-l(ft1/

2 - f;/2) - ~hftl/2)lAd~tl

s IJ r 1fttlAed~tl + IJ ft1/

2[t- 1(ft.1/

2- f t

1/2)

- ~hftl/211Ad~tl

+ IJ 4h f i /2( f t

1/2

- ftl/2)d~t.1

+ IJ f;/2(t-l(fe1/

2 - ft1 /2) - ~hf:/2)IAd~t.l

= 0(1)

by (1) and repeated application of the Cauchy-Schwarz Inequality.

We DOW show fThdPa = 0 must hold. Again choose h E T(Po) and a

sequence {Pt.} C Pr,lJ. as defined in (a). We show that we can find a sequence

{Pt} c PT,a (or which (a) holds (with {Pt} in place of {Pd) and for which

34

f IT!'JdPt s 2 I lTl 2dPo for all t 2: O. Then, using an argument similar to that

used above to show JhdPQ = 0, it is easy to show that f ThdPo = 0 and we will,be done.

Define, for K > 0

AK ::: [z : lh(x)l < K},

and

Notice,

Define {Pt} by

dPt = (1 + th1/t)dPo, t > O.

Then, for t > 0

A straightforward argument can be used to show

To show that T-(Po) is itself a tangent space, we need to show that if h E

T-(PlI ) , there exists {Pt } C 'PT,a so tha.t

j[C1« dPt? /2 - (dPO)1/2) - ~h(dPo)1/'J}2 -+ 0 as t --+ O.

It is easy to show that the sequence {Pt } defined by ~;~ = 1 + th will do and we

are done. •

Proof of Corollarv 3. We need to show regularity of the estimator vc(q ,,(a». The

proof of this can be found in example 1 of Sheehy and Wellner(1988). •

Notation: For any {C1, ..• , C.c} C Loo (C), any '.p E Loo(e), define

35

Proof of Theorem 5. Under the hypotheses of the theorem, Lemma 5 implies that

for each {Cb ... C1J c C, Ie > 1 and any a E IRI:,

where, if r~(Cl"'" Cl') is the Ie x Ie matrix whose ijth element is given by

then

and W h(C1 , ... 1 Cl') is independent of 7lo(GIl'" ,Cl'). To see this, in the state

ment of Lemma 5, write

Define

I:v(Po) == L QiVC(PO )(Ci)

i=1

andl'

Ji(Po) =L a,le;.i=1

Then b« E T m (Po) aad so there exists a sequence {h n tI} C T rn (Po) with

l1hna - h rJ llpo ---+ 0 as n - 00. It can be shown that

(b)

a.nd

(c)

where 7l0J (CI •..• 1 CI:) and \V (CI , ••• ,Cl') are as defined in the statement of the

theorem. From (a), (b) and (c) we can thus conclude that

7l(Cb ... I Cl') .!! '7Z0J (C1 , ... ICl') + W(Cl,"'j Cl')

36

With 7103 and W as defined. in the statement. Since:ll and llo3 exist as C(C,Po)

valued tight random elements of Loo(C), it remains only to identify \V(el l • •• I Gli:)

I as (W(C,), ... ,W(C.» where W is a C(C, Pol-valued tight random element of

Loo (C). Lemma. 8 in section 6 makes such an identification. •

Proof of Lemma 4. We will only give an outline of the proof. To prove the lemma

is true, we break the expression in (i) into two parts, one involving the Ioglikelihood

of the X;'s and the other involving the loglikelihood of );;; 2::, T(Y;). The

first part is straightforward to deal with. For the second part we must show

that the loglikelihood can be approximated by the log of the pdf of a multivariate

normal distribution with mean Ep..T and covariance matrix Varp"T evaluated at

Jm: 2::'1 T(Yi). To show that this can be done we use the fact that under the

hypotheses of the Theorem the pdf of Jm 2::'1T('Yi). under Pn can be uniformly

approximated by the pdf of the multivariate normal distribution with mean Ep..T

and covariance matrix VarpnT. (This can be proved. using an argument similar

to that used to prove Theorem XV.5.2, Feller, Vol.Il). Since ;;k 2::1 T(}i) is

oPo(1) the result follows. •

Proof of Lemma 5. The proof is similar to that of the Hajek-Le Cam Convolution

Theorem, see Theorem 3.1 (Roussas(1972) ). •

§6 Technicalities.

Definition 8. Let F be a collection of functions in Lz(Po). ).Ve say that F is

pointwise separable if there exists a countable subclass F" of F such that {or any

f E :F tbere exists a sequence Un} C r witb fn(~) --. f(~) for all ~ E IRd.

Definition 9. [Vapnik and Cervonenkis(1971») Call a class C of subsets of IRd

a VC class of sets of degree v if each set S of v points in IRd lias {ewer than 211

distinct subsets of the form ens picked out by members of C.

Definition 10. Call a class F of real-valued functions on JRd a VC graph class if

C" = {C/: C/ = {(~,t): 0:5 t:5 f(~)},J E:F}

37

is a VC class of sets in IRd+I.

Definition 11. (Pollard(1982» Let :F be a collection of functions witb envelopet

function F (i.e. III < F, for every f E :F). For each :finite subset 5 of!Rd and

each positive 6, take N FCC, 5, F) to be the smallest value of m for v,'hich there are

functions 'PI, ... ,~m in :F such tbat

mtn L(f(x) - ~;(x)J2 s 62 L F(r)2zES ~es

(or every I in :F. Denne

H(5) = HF(6,:F) = suplogNF(o,S,:F)s

the supremum running over all finite subsets afIRo. If

00

L 2-i H F(2- j ,.1") < 00

j=I

we say tbat .1" is sparse.

Lemma. 6. Let C be a separable (in L2(PO) ) collection of sets in IRd, Let J :

IRdX C - IR1 be a l.Jlliformly continuous function from C into L 2(PO) ) i.e. given

ally f > 0 tbere exists a 6> 0 such tbat lllel -lc~ IIL2(Po) < 0 implies Ilfh C1 )

f(·, C2)IIL~(Po) < e. Deline

F = {fC C) : C E C}.

Suppose F is a sparse, pointwise separable collection of functions wjtb envelope

F E L 2(Po). Define'll.,. E Loo(C) by

where:i«.n was defined in (3.1). Tben)

38

as random elements of (Loo(C),Bc,PD) where '7lo is a tight. Gaussia.c, random

element of L~(C) whose sample path, all belong to C(C, Po).

Proof. In Theorem VII.21 of Pollard(1984) he shows that under the above hy

potheses:'«n ~)Xo as random elements of (Loo(F),BF,Po) where;«1) is a tight,

Gaussian, random element of LOl::lF) whose sample pa.ths all belong: to C(F, Po).

The proof proceeds in the following way:-

He shows that

(i) ::Kft is a random element of (Loo(F),BF,Po) and

(ii) Given any ~ > 0, there exists a compact set, K, of completely regular points

such that for every open set, G, containing K, for which [)Xn E G] is measur

able, lim in! PrpXn E G] > 1 - f. Specifically, K is constructed as an inter

section of sets D t • D1 , ... where D" is a finite union of closed balls of radius

t centered at points ofC(F,Po).

(Note that II' E Loo (.1") is a completely regular point, with respect to a a-field

B, if all open balls centered at <p are contained in B.)

It is not difficult to see that we can adapt the proof of this theorem to show

7l n ~ 'Zlo as random elements of (Loo(C), Bc,Po)· In fact, we need to show that

(i) and (ii) are true with 'Zln in place of ::&{n, Bc,pg in place of BT,Pg and C(C, Po)

in place of C(F, Po).

Showing (ii) is true is easy. Just notice that Ie, C) being uniformly contin

uous over C implies C(F, Po) c C(C, Po). We now show that 'Zln is a random

element of (L~(C),Be,p,).

Because each

1 n

:an(C) =~. 0 fCC) = - 2)f(X"C) - Ep,f(X" C)]n i=l

is a real random variable, the finite dimensional projections create no difficulty. It

remains then to show that for any I.p E C(C, Po )

I (b) sup )71" (C) - If'(C) I = sup I~" 0 Je-, C) - 'P( C) Ic c

39

-

is measurable. However, the fact that :F is pointwise-separable with envelope

function F E L2(PO) and C is separable (in L2(PO) ) implies

where C· is a countable collection of sets in C. Measurability of the quantity in

(b) is thus clear and we are done. I

The following lemma is used in the proof of Theorem 1.

Lemma 7. Suppose C is a. VC class of sets in IRd. Tben

is aVe graph class.

Proof. We need to show tha.t :F is a VC graph class, i.e. we need to show

7 = {(x, t) : 0 ~ t ~ e9' T (: ) l A(x), A E C, (J E Bli((JO)}

is a VC class of sets. A typical set in 7 is of the form

{(x, t) : t > Ollogt ~ (J'T(x), x EA.} U {(x,O) : x E AC}

Let

71 = {A C x {OJ : A E C}

and let

40

Theorem 9,2.6 and (9.1.8) of Dudle)'(1982) implies 1'1 is a. VC class of sets. Define

9 : IRd x R+ J-+ IRb

X lRl by g(x, t) = (T(x), log t).

ThenC.4.,9 = {(x,t): t > 0, logt < O'T(x)} nA0IR+1

= g-1(~9) n A.. 0 IR+1

where

/:19 = {(r,s) E IRd+1; (r,s)' (0,-1) > O}.

Theorem 9.2.2 of Dudley (1982) implies tha.t

is a VC class of sets, since {~9 : °E B6(OO)} is one. Theorem 9.2.6 of Dudley

(1982) implies that

;4 = {A0IR+1: A E C}

is a VC class of sets. We then conclude, again from Theorem 9.2.6 of Dudley(1982),

that

is a VC class of sets and the lemma is proved. •

The proof of the following Lemma is contained in the proof of Theorem 5.3.1,

Bickel et al. (1989). We include it here for completeness.

Lemma 8. Let 7l and 7lo be tigbt random elements of Loo(C) witb sample paths

in C(C , Po), wbere C is a. totally bounded subset in L 2(Po). Suppose for every

{CI, ... ICt } C C, k > 1,

(i)

-

where W(CIt ... 1 C.t) is independent of llo(C J , ..... C.t). Then

d7l == 710 + \V

- , ,

41

where W is a tight, random element of Loo(C) with sample paths in C(C, Po) and

\V is independent of '!Lo.I

Proof, Since 1l and 1lo exist as C(C,Po)-valued random elements of Loo(C), it

reameins only to identify W(C!> ... ,C.) as (W(C,), ... , W(C.)) where W is a

C(C, Po )-valued random elements of Loo(C).

Let k ~ L Since C is totally bounded there exists a finite .}Jc net for C, say

{C"

... , Cm} with m = m(k). Thus for each C E C there is a C· E {C"

... ,Cm}

such that lie - C-IIL J ( Po) < '}A;' Define .6i : C -+ IR1 by

(a) v;(C) = (1 - kliC - C;IIL,(Poj)+, i = I, ... , m

andv,(C)

ti;(G) = "m -(C)LtJ=l vJ

i = l, .. "m.

Each ~ .. is 8. uniformly continous function that vanishes off the ball of radius tabout Ci. For every C E C there is at least one C.. so that Vi(C) > 1/2. Thus the

~i are non-negative uniformly cont.inous functions that sum to 1 everywhere in C.

Now define interpolation functions 1lrn. and ?Lom of 7l and tlo by

(b)

(e)

m

7lm(G):; I: A,(G)71(G,),;=1

m

7lm(G) = I:A;(G)71(G,),.=1

G E C

G E C

and use the random vector W (C1 , • . • , em.) =(WI, ... I Wm.y to define a process

W m by

m

(d) W meG) = I: ti;(G)Wi •

.=1

Denote the laws of these processes on (C(C,Po),Bc,po) by Pm == .c(71m), POrn. =.c(71om) and Rm =crw m).

-

, • 4

42

It follows from (i) and linearity of the interpolated processes that

Now since PWI :::;. l' =£(7l) and POm :::;. Po = .c(71o), given any ~ > 0 there exis t

compact sets K I =KI(e:) and K 2 == K2(~) such that lim inf'.; Pm(Kt} ~ 1- e: and

Iim inf's, POm(K2 ) ?: 1 - e, respectively. Hence, for m 2: some m(e:)

1 - e :5 Pm(KI ) =JRm(K I - x)dPom(x)

:5 r RWI(K1 - ~)dPom(~) + eJK2

or

Thus Rm(Kt - x) ~ 1- 2e far some sequence {x Wl } C K 2 • Hence the compact set

has Rm (K 1 - K 2 ) > 1-2e: for m ~ m(e) and it follows that liminfm Rm (K I - K 2 ) >

1-2e. Hence k; is tight and there exists a weakly convergent subsequence {RWI ,},

Rm • :::;. il. on (C(e, Po), Bc,Po)' Let LeW) = R and the conclusion of the Lemma

follows. •

Acknowledgments. Part of this research was carried out at the University of

Washington in Seattle as part of the requirements for a Ph.D. degree in statistics.

I would like to thank Professor Jon Wellner for his help and encouragement during

the course of this research. Thanks are also due to Aad van der Waart with whom

I had some discussions about the results in section 4.

43

References

I Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1989). Efficientand adaptive" inference in semiparametric models. Forthcoming monograph, JohnsHopkins University Press, Baltimore.

Csiszar, 1. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability 3 146-15·8.

Dudley, R. M. (1984). A course on empirical processes. Ecole d'lhe de Probabilities de St. Flour, Lecture Notes in 1-latb., Springer-Verlag, New York 1097(1984),2-142.

Feller, W. (1971) An Introduction to Probability Theory and its Applications.YoU!, 2nd ed., Wiley, New York.

Haberman, S. J. (1984). Adjustment by minimum discriminant information. Ann.Stat. 12 971-988.

Hajek, J. (1972). A characterization of limiting distributions of regular estimators.Z. Wahr. Verw. Gebeite 14 323-330.

Ireland, C. T. and Kullback, S (1968) Contingency tables with given marginals.Biometrika 55 179-188.

Jagers, P., Oden, A., and Trulsson, L (1985). Post-stratification and ratio estimation: Usages of auxilliary information in survey sampling and opinion polls, Inter.Statist. Rev. 53 221-238.

Le Cam, 1. and Yang, G.L (1988) On the preservation of local asymptotic normality under information loss Ann. Stat. 16.Mosteller, F (1968). Association and estimation in contingency tables, J. Amer.Statist. Assoc. 63 1-28.

Parthasarathy, K. R (1967). Probability measures on metric spaces. AcademicPress, New York.

Pollard, D (1982) A central limit theorem for empirical processes, J. Austral.Math. Soc. A33 235-248.

Pollard, D (1984) Convergence of Stochastic Processes. Springer-Verlag, NewYork.

Roussas, G.G. (1972) Contiguity of Probability Measures; Some Applications inStatistics. Cambridge University Press, Cambridge.

Sheehy, A (1987) Constrained estimation and clustering based on Minimum K ullbackLeible:r information methods. Unpublished Ph.D dissertation, Department ofStatistics, University of Washington, Seattle, Washington.

44

'Sheehy, A. and Wellner, J. A (1987) Uniformity in P of some limit theoremsfor empirical measures and processes, Tech. Report No. 87, Dept. of Statist.,University of Washington, Seattle.

van der Vaart, A.W (1987) Statistical Estimation in Large Parameter Spaces CWItracts, Centrum voor Wiskunde en Informatica, Amsterdam.

Vapnik, V. N. and Chervonenkis, A. Ya (1971). On uniform convergence of thefrequencies of events to their probabilities, Theor. Prob. Appl. 16 264-280.

Documents

KULLBACK·LEIBLER CONSTRAINED ESTIMATION OF PROBABILITYMEASURES · Po, uniformly over a class of sets C. The estimators considered here are based on the minimization of the Kullback-Leibierdivergence