Upload
others
View
1
Download
1
Embed Size (px)
Citation preview
KULLBACK·LEIBLER CONSTRAINED ESTIMATION
OF PROBABILITY MEASURES
by
Anne Sheehy
TECHNICAL REPORT No. 137
July 1988
Department of Statistics, GN-22
University of Washington
Seattle. Washington 98195 USA
-
Kulfbeck-Lelbler Constrained Estimation of Probability Measures
by
Anne Sheehy
Stanford University
ABSTRACTWe consider two estimation problems in this paper. In the first, we observe
Xl,." ,Xn i.i.d. Po, where it is assumed known that EPoT = a. In the second, we
observe Xl,." ,Xn E IRa and J::: 2::'1 T(Yi) E IRb, where Xl, ... ,Xn, Yl, ... ,Y",
are i.i.d. Po, T is some Borel measurable function and n/m ----+ A as n/\ m --+ (X). In
both situations we consider the problem of estimating the probability measure
Po, uniformly over a class of sets C. The estimators considered here are based
on the minimization of the Kullback-Leibier divergence from certain collections of
probability measures to the empirical measure of the Xi'S. We show that these
estimators are consistent and asymptotically efficient.
•
Kullback-Leibler Constrained Estimation
of Probability Measures
by
Anne SheehyStanford University
July, 1988
AMS 1980 Subject Classifications: 62G05, 62G20.
"';' Key Words and Phrases: constrained estimation, Kullbeck-Leibler divergence, weakcom"ergence, asymptotic efficiency.
Research supported by: National Science Foundation Grant DM5-S400893 and DM8708083
,.
1
§1 Introduction.
HabermaD,(1984) considered the problem of estimating a Euclidean-valued
I functional of a probability measure Po, where one has observed data X1, ... ,X..
i.i.d. Po and one knows that Po satisfies a finite number of linear constraints. He
proposed an estimator for this problem based on the minimization of the Kullback
Leibler divergence from the constrained family of probability measures to the
empirical measure of the observations. He also derived a number of consistency
and weak convergence results for this estimator. In this paper we extend this
problem and those results in a number of directions as will be described below.
We consider two kinds of estimation problems here. The first can be described
as follows:-
We assume Po is a probability measure on the Borel a-field (IRd, B4 ) . Having
observed data Xl, ... ,Xn E IR4 i.i.d. po. we consider the problem of estimating
certain linear functionals of Po, given that we know Po belongs to the collection of
all probability measures On(IR4, 8 4 ) satisfying a fini te number of linear constraints.
More precisely, let 'P denote the collection of all probability measures on the Borel
a-field (IR4 • 8 4 ) and let T : IR4 -+ IR' be some Borel measurable function. We
assume that Po is constrained to be in
(1) PT,. = {P E P : EpT = a}
for some fixed a E IR'. We will refer to this problem as the "constrained estimation
problem". In this paper we will focus on the problem of estimating one particular
functional of Po, defined in the following way:-
Let C be a collection of sets in B4 • We consider the problem of estimating
functionals of Po of the kind
(2) VC(Po) E L~(C) where vc(Po)(C) = Po(C) for any C E C.
In estimating vc(Po) we are essentially estimating the probability measure Po
uniformly over C. A closely related problem to the one described above is that
2
where Po is not in fact in 'PT,4 but we want to estimate that Q in "PT,f1, which is
"closest" in some sense to Po. See the example below.
In the second problem, we again observe dataX1, ... ,X" E IRd i.i.d. Po. In
addition we observe the IR"-vaJued random variable Tm == ~ L:'l T(l'i) where
Y1 1 ••. ,Ym E IRd are i.i.d. Po, independent of the Xi'S, but are Dot observed and
T(·) is some measurable function from IRd into nt'. We also assume that
n- _ .A > 0 as n ---+ 00.m -
We will refer to this problem as the "supplementary sample moments problem".
Again, we are interested in est~a~ing functions of Po of the kind IIC(PO) E Loo(C),
exactly as defined above.
The theory given here is easily adapted to the case where we are interested
in estimating Euclidean valued functions of Po of the kind vg(Po) = EPo9 where
9 : lR" -10 IR" is some measurable function. For the constrained estimation prob
lem, the analogues of the theorems given in section 3 for the problem of the esti
mation of uc(Po) can in fact be found in Haberman(1984), while those of section 4
can be found in SheehY(1987).
The following is an example of the first kind of estimation problem.
Example: Known Marginals. Suppose ~i is a probability mass function for
i = 1, ..• , k ::; d. Suppose ~i and POi-the i t h marginal distribution of the X's
have support concentrated on {ail ... aib;}, i = 1, ... , k. Let
T(x) = (l(au j(xIl - Ro,(oll),"" l{a..,J(x.) - Ro,(o16,), ... ,
l{a"j(x,) - Ro.(o.,), ... ,l{a..,j(x,) - Ro.("..,)) E JR'
where b = L:~=1 bil and x = (Xl, ... , Xd)T.
Now let
a = (O, ... O)T E JR'.
3
Constraining Po to be in 'PT,fa 1S equivalent to saying we know the fint k
marginal distributions of Po-
Mosteller (1968) gives some interesting examples where 'PT.4 is as defined in
the example above and Po rl 'PT,fI. but one wants to estimate that Qo which is in
'PT,a. and which is "closest" in some sense to Po- One such example is the situa.tion
where we want to compare transition probabilities in social mobility tables between
societies. We may first want to adjust the marginal distributions of row and column
variables so that the variations in the relative sizes of social classes from society
to society do not affect our results. Mosteller suggests adjusting the marginal
probabilities so that they are uniformly distributed on each table. The method he
proposes, Iterative Proportional Fitting, gives the same estimates as the ones we
get from using the Kullback-Leibler constrained estimator which we will describe
below.
Jagers et al. (1985) describe a situation in the sample survey framework where
(XI, Yd, ...• (Xn l Yn) E IR? i.i.d. Po are observed and we know the marginal
distribution of the X's, call it POr' Also we know that POr has finite support.
They consider the problem of estimating functions of the kind vg(Po) = E Po9,
where 9 is some Borel-measurable, real valued function as defined above. In fact
they focus on functions 9 which are functions only of the Y variable. The estimator
they examine is called a "post-stratified" estimator and is defined by
- ~ po.(x)Ep,g(y) = LJ v(x) g(Yo)lIX;=,],
•where v(x) is the number of (X, Y) pairs where Xi = x, and the summation is
taken over all the x values in the support of Poz: with v(x) > O. It can be shown
that the Kullback-Leibler constrained estimator, described below, gives the same
estimate as the post-stratified estimator in this situation.
We now describe the estimators which we will consider here. The idea is the
following: In the fully non-parametric set-up, having observed X J 1 ••• ,Xn i.i.d,
Po, when no knowledge of Po is assumed, it is known that vc(IP n), is in a sense,
4
an optimal estimator for vc(Po} (under appropriate regularity conditions on C and
Po), where
(3)
We call IPn the empirical mea.5tlre of the observations Xl, ... ,Xn'
In the constrained estimation problem we know that Po is in a certain re
stricted class of probability measures and of course feel that we should take this
information into account when estimating functions of Po. \Ve propose doing this
by using vc(Q n(a» as an estimate of vc(Po), where QnCa) E PT,o and QnCa) is
the "closest" as measured by the Kullback-Leibler divergence, probability to lPn l
in PT,IJ.'
We use a similar idea in the supplementary sample moments problem. Define
(4)
and
(5)
1 rn
1P~(.) = m L l(Y,EOli=l
n J mJ 2TN =::; N TdlPn + N TdIPfn.l where N = n + m.
With these definitions, we use lIc(QN(TN» as an estimator for vc(Po) where,
q N(TN) E 'PT,TN is that probability measure which is the "closest" I as measured
by the Kullback-Leibler divergence, to lPn'
We now give a definition of the Kullback-Leibler divergence and introduce
some notation.
Definition 1. De:5.ne the Kullback-Leibler (K-L) divergence between two proba
bility measures Q and P on (IRd,8") to be,
K(Q,P) = {flog ~dQI ifQ ¢: P+00 otherwise.
Notice that K(·,·) is not a metric, in fact it is not even symmetric in its
arguments. However I its value is always non-negative and
K(Q,P) = 0 if and only ifdQ-=1dP
a.e. Q,
If there exists a probability measure Q minimizing K(·,Po) (respectively,
K("IP.)) over PT,. denote that Q by Q(a) (respectively, Q.(a)),
In the constrained estimation problem, the idea of using vc(Q: ,,(a» as an
estimate of vc(Po) has been used extensively in the literature but usually in the
context where Po represents the probabilities in a contingency table and the con
straints on Po are that certain of the marginals are known (see Haberman (1984) for
some appropriate references). Haberman (1984) considers the more general prob
lem which we consider here but focuses on estimating functions of Po of the form
%Jg(Po) = EPo9, where 9 is some Euclidean-valued, Borel-measurable function. In
his paper Haberman does not consider the supplementary sample moments prob
lem nor does he prove efficiency of the estimator for the constrained estimation
problem.
From now on, when we talk about the estimator vc(Qn(a) it should be un
derstood that we are working with the constrained estimation problem. and when
we write Vc(Q,,(TN) we are working with the supplementary sample moments
problem.
In section 2, we list conditions under which these estimators exist and show
that (under certain regularity conditions) they are consistent. Csisza.r(1975) and
Haberman(1984) have done most of the work here.
In section 3, we show that these estimators, when properly normalized, con
verge weakly to limiting Gaussian processes. We also make a comparison between
these estimators and the estimator vc(Pn). It is shown that in each of the two
problems the asymptotic variance of the proposed estimator, at any C E C, is less
than that of the corresponding quantity for the estimator vc(IP n)(C). In section 4
6
'we show that in the constrained estimation problem, vcCQ n(a)) is an asymptot
ically efficient estimator of vc(Po), while in the supplementary sample moments
I problem ve(qn(TN)) is an asymptotically eflicient estimator of ve(P,). Ireland
and Kullback (1968) have shown that in the situation where Po represents the
probabilities in a contingency table, and the constraints are that certain marginal
probabilities are known, then ve(q n(a)) is Best Asymptotically Normal. That we
can show vc(q nCa» is efficient in a more general context is therefore not too sur
prising.
The proof of the results stated in sections 2, 3 and 4 are all contained in
section 5. In section 6 certain technical definitions and theorems which are needed
in the proofs are stated and proved.
§2. Existence and Consistency of the Estimators.
§2.1 Introduction. We first summarize results, proved by Csiszar(1975) about the
existence of Q(a) (as defined in section I).
Define
S(P"oo) = (Q -e; P,: Q E 'P,K(Q,P,) < co].
We need to know under what conditions there exists a probability measure Q(a) E
S(Po, oo) minimizing K(·,Po) over 'PT,IJ"
The following result is due to Csiszar (1975):
(GI) Let
8(P,) = {B E JR.': E p• exp(II'T) < oo}
BT,.,P. = (a E JR.': 'PT,. n S(P"oo) # 0}.
If SCPo) is open, there exists a unique Q(a) in 'PT,4 minimizing K(-,Po) over
'PT,IJ for each inner point a E BT,IJ,Po and its Po-density is of the form
(I) d~~:\,,) = cexp(B(a)'T(x)) for some B(a) E 8(P,).
This is Theorem 3.3 of Csiszar (1975).
7
We note at this point that if a is an element of the relative interior of the
convex hull of the support of POT- 1 (i.e. a E re1iotcosuppPoT-1 ) , then a is an
I inner point of BT,.,P. (see Propositions A.3.2 and A.3.3 of Sheehy (1987». One
consequence of this is that if e(Po) is open and a E relintcosupp POT-I, then
there exists Q(a) minimizing K(·, Po) over 'PT,. and Q(a) has Po·density of the
form given in (1) above.
Therefore, showing that the probability measure Q .(a) (or Q .(T,,) ) ex
ists, with probability one for n sufficiently large, amounts to showing that a .E
re1iotcosuppIP",T-1 (respectively TN E re1intcosuppIP~T-l) with probability one
for n sufficiently large.
In the constrained estimation problem, results are stated both for the case
where EPaT = a and the case where EPoT 'I a, but a E relintcosuppPoT-1 and
there exists Q(a) minimizing K(·, Po) over 'PT,. with Po-density as given in (1).
If EPoT 'I G, then Q(a) is the "clceestv-es measured by the Kullback-Leibler
divergence-probability measure to Po having EPoT = G. It is of interest to study
the behavior of this estimator when Epo T t= a for two reasons. First, because
it is important to know how the estimator behaves when the constraint does not
hold. And second, because sometimes, as in Example 1 above, we may know that
EpoT t= a, but may still be interested in estimating Q(a).
Notice that when Varp,T(= Ep,(T - Ep,T)(T - Ep,T)') is singular, there
exist non-zero vectors u such that u'T = c a.s. POl c a constant. When this is the
case we see that the vector 8(a) defined in (1) will not be unique. Indeed if
(2) H = {u : «r = c.. e.s. Po, for some constant ca}
and 9" = 9(a) +u, u E H then
dQ•• (x) = exp(9:T(x))/Jexp(9: T(x»dPo(x)dPo
= exp(9(a)'T(x»/Jexp(9(a)'T(x»dPo(x) a.s. Po·
8
1t is dear then that B(a) is unique only up to its projection on H.L. From now on
we will define 9(a) to be that projection.
§2.2 Sta.tements of Results. Propositions 1, I' and 2 below are GIivenko-Cantelli
type results for the estimators which we consider here. We need to restrict the
collection of sets C to some pointwise separable, Vapnik-Cervonenkis (Ve) class of
sets (see Definitions 8 and 9). Classes of sets which are pointwise separable, VC
classes include the classes of all closed balls, all ellipsoidal regions, and all convex
hulls of at most k points (k fixed) in ]Rd. Classes of sets which are not VC include
the class of all convex sets in R d, d :::: ~ and the class of all lower layers in lRd
,
d > 2. The pointwise separability condition is needed to ensure the measurability
of the quantities in which we are interested.
Proposition 1. Suppose a E reli.ntcosuppPoT-1 and SCPo) is open, so that there
exists Q(a) minimizing K(·,Po) over Pr,« with Po density as given in (1) above.
SupposeC is a pointwise separable, VC class of sets in s«. Then with probability
one, for n sufficiently large, there exists CQn(a) minimizing K(·JIPn) over 'Pr,a
where
(i)
(ii)
and
(iii)
dQ (a)diP. (r) = c. exp(9.(a)'T(r», 9.(a) E HJ..
9.(a) ~•.e, 9(a) as n ~ co
Ilvc(Q.) - vc(Q(a))llL_(C) ~ •.e. 0 as n ~ co.
Proposition I'. Suppose EPoT = a and 0 E int SCPo). Then. with probability one,
for n sufficiently large, there exists CQ n(a) minimizing K( ·,IPn) over 'Pr,o where
(i) d~~a) (r) = c. exp(9.(a)'T(r», 9.(a) E HJ..
-----------------~------- -
'(ii)
t and
(iii)
On(a) --+a..t. 0 as n --+ 00.
9
Proposition 2. Suppose 0 E int 8(Po). Suppose C is a pointwise separable, VC
class of sets in [511. Then with probability one, for n sufficiently large, there exists
Qn(TN) minimizing K(., lPn ) over 'PT,Tx with
(i)
(ii)
and
as n --+ 00,
(iii)
In Lemmas 1 and 2 and Theorem 1 we show that, under appropriate regularity
conditions, if {an} is a sequence in IR:\ converging (a.s.) to some constant a,
then, with probability one, for n sufficiently large, there exists Q n(an) minimizing
K h IPn) over 'PT,o... and that
From this result we can deduce the preceeding propositions.
Lemma 1. If a E relintcosuppPOT- 1 , {an} C relintcosuppPoT-1 and an --+a..!I. a
th en with probability one, for n sufficiently large, there exists a probability measure
Qn(an) minimizing K( " IPn) over 'PT,4.. an d
(3)
10
~here
S.(a.) e Hi.
Lemma 2. Suppose tbe hypotbeses of Lemma. 1 hold. Suppose Q(a) minimizes
K(·, Po) over PT,. and
(4)
with
Then
d~~:) (.) = cexp(S(a)'T(.»)
Sea) e Hi n int 8(Po).
8,.,(an.) -+1.-9. 8(n) as n -+ 00
with S.(a.) as defined in (3) above.
Theorem 1. Suppose tbe bypotheses of Lemma 2 hold. Suppose C is a pointwise
separable, VC class of sets in s«. Then
IIvc(q.(a.) - vc(Q(a))IIL~(c) ~a.s. ° as n ~ 00.
§3. Weak Convergence of the Estimators.
§3.1 Statements of Main Results. In this section we examine the asymptotic be
havior ofthe estimators vc(q.(a» and vc(q .(TN).
We will need the following notation. For any function f E L2(PO) write
(1) "JE..(f) =JId( y'n(IP. - Po).
IT F is a collection of functions in L2(PO) then we can regard ::«.,., as an element
of Loo(F) and we call X{n the empirical process on F. If j, 9 E :F then
("JE..(f),"JE..(g»T '* N(O,r(f,g»)
11
where
su.» = (VUP.! Covp,(J,g)).Covp,(J,g) Vup,g
In Theorem 2 we show that the estimator vc(q,,(a)) has influence function,
¢e(-, Po)(-) where ¢e(C, Po)(-) E L,(Po) for any C E C, and ¢e(-, pole,) E L~(C).
In Theorem 3 we prove an analogous result for the estimator vc(q ,,(TN)).
Notation. If E is a b x b matrix, define E- to be the Moore-Penrose generalized
inverse of E.
Theorem 2. Let C be a pointwise separable, VC class of sets in s«. Suppose 8(Po)
is open and a E relintcosuppPT-l. Suppose in addition that
(i) JIITII' exp(20(a)'T)dPo < 00 and Jexp(20(a)'T)dPo < 00
where Ora) was defined in (2.1) and Ora) E s» n int 6(Po). Tben
Z1: n(C) =y'n(vc(qn(a)) - ve(Q(a)))(C)
J - dQ(a)= [(Ie - Q(a)(C)) - CovQ,,)(le, T)( VarQ(,)T) (T - a)] dPo
d}l{n + opel)
where tbe opel) term is uniformly small over C.
Theorem 2/, Let C be a pointwise separable, VC class of sets in 13 d . Suppose
o E int6(Po) and Ep.T = a. Tben
Z1:n(C) =y'n(ve(qn(a)) - ve(Po))(C)
= J[(Ie - poeC)) - (Covp,(le, T)(Varp,T)-(T - a)ld}l{n + opel)
where tbe opel) term is unifonnly small over C.
For any function f E L,(Po) wri te
)X~(J) =Jfd(vm(JP~ - Po)).
12
Theorem 3. Let C be a pointwise separable, V C class of sets in Btl. Suppose
o E inte(p~) IUJd ::.. --+ ..\ 81] n - 00. Then
1lN(C) = vn(lIC«~,,(TN)) -lIc(Po»)(C )
= J ..;>. (CovPo( l c ,T)( VarpDT)-(T - a)~:n1+,\
J 1 -+ [(IC - Po(C» - 1 +,\ (CovPll(lc,T)(VarpoT ) (T - o.)~n
+ op(1)
where the op(l) term is uniformly small over C.
Corollaries 1 and 2 below are weak convergence results for the processes 1l n
and 1lN defined in Theorems 2 and 3 above. We use the definition of weak
convergence used in Pollard(1984). Before their statements we need the following
definitions.
Define for any A, B E Btl, and any probability measure Q E 'P for which
EqlITlr.il < 00
and
Definition 2. For any collection cd hsuctious F E L 2(PO) define BF,PQ to be the
smallest a-iield on Loo(Jl wmcb
(i) makes all fullte-dimensional projections measurable and
(ii) conieias all dosed ba11s with centers in C(F, Po), the collection of all uni
formly contiDuous functions in Loo(.F) (w.r.t. the L2(PO) norm).
Corollary 1. Suppose tbe bypotheses of Theorem 2 hold. Then
13
'as random elements of (Loo(C), Bc,pg ) , wbete 'llol(') is a mean-zero, tight, Gaus
sian random element of Loo(C) witb sample paths in C(C, Po), and covariance
I function r Q(ca)("')'
Corollary I'. Suppose tbe hypotbeses of Tbeorem 2' hold. Tben
as random elements of (L(XI(C), Bc,Po), wbere 'll02(') is a mean-zero, tigbt, Gaus
sian random element of L'OO(C) witb sample paths in C(C, Po) and covariance
function rpoh .).
CorollarY 2. Suppose the bypotheses of Theorem 3 bold. Then
as random elements ofLOQ(C), Bc,Po), where 7l 0 3 ( . ) is a mean-zero, tight, Gaussian
random element of Loo(C) with sample paths in C(C, Po), and covariance function
r$o("')'
Remark. At this point it is instructive to think. about the behavior of the al
ternative estimator lIc(IPn) in each of the estimation problems, By well known
results (see e.g. Pollard (1984» if C is a pointwise-seperable, VC class of sets and
Po satisfies the assumptions of Corollary 1 or 2, then
(4)
and
(5)
Ilvc(lPn ) - VdPO)IlL",,(C) -a..II. 0 as n - 00.
-
where.1Y is a mean-zero, tight, Gaussian process in C(C, Po) having covariance
function, t :C xC _1Rl, given by
r(A,B) = Po(A n B) ~ Po(A)Po(B) for A, B E C.
14
Notice that fOT any A E C,
,(6)
In fact
f(A,A) = Var(~(A)) ~ Var(Zlo,(A)) = rp,(A,A).
f(A,A) - rp,(A,A) = Covp,(lA, T)(Varp,T)-Covp,(lA, T).
U VarpoT is non-singular this difference will be strictly positive whenever
Covp.(l A, T) # o. What (4), (5) and (6) are saying is that when Po satisfies the
assumptions of Corollary 1', vc(UJ II) is, like vc( Q tiCa)), a consistent estimator of
vc(Po). However the variance o£the limiting distribution of y'n(vc(lP.) - vc(Po))(A),
for any set A E C, is greater than the variance of the limiting distribution of
y'n(vc«~.(a))- vc(Po))(A). If Po If. "PT,., then vc(Q .(a)) and vc(lP.) estimate
different measures, and comparison of the two estimators is more difficult.
Also notice that
(7) f(A, A) - r},,(A, A) = -2,COV(lA, Tl'(Varp,T)-Covp.(lA, T).1+~
From (7) we see that by taking the information about the supplementary moments
into account and using vc(Q N(TN)) as an estimator of vc(Po) rather than VC(n'n)
we reduce the asymptotic variance of the estimator of vc(Po)(A) =Po(A) by the
quantity on the right hand side of (7).
In section 4 we will show that asymptotically, in a certain sense , vc(Q,.,(T.v))
and vc(Q RCa)) are the "best" estimators for these two situations.
As one would expect
lim r~ (A, B) = rp,(A, B).>........0 0
That is to say that if ~ -+ 0, so that we have "infinitely" more information from
the Yj's than we do from the Xi'S, asymptotically, it is as if we knew EPoT. U
however ~ -+ 00, then ). = 00 and
• >... -lim r p (A, B) = Covp.(lA, 1B) = f(A., A)>.. ......00 0
15
so that, asymptotically, it is as if we had no extra information from the supple
mentary sample moments.
§4. A Convolution Theorem for Estimators of lie(Po).
§4.1 Introduction. In this section we use results of Van der Vaart (1987) and
Bickel, Klaassen, Ritov and Wellner (1989) to derive a. convolution theorem for
"regular estimators" of lIc(Fo) in the constrained estimation problem and in the
supplementary sample moments experiment. We use the Hellinger metric to define
distances between probability measures. For any 9 and h in L2(PO) let
(g, hlp. = j ghdPo and I/gll;'. = j g'dPo.
If z E lRd, Ilz11d denotes the usual Euclidean norm.
We consider the experiment where we have observed data X n, a random
element of the measure space (An, On) where X" = F(Zl, ... , ZN) for some
Euclidean-valued, Borel measurable function F and ZI,'" I ZN E IRd are Li.d.
Po and N = N(n) --+ 00. On the basis of X n we want to estimate some fun
tiona! of Po where we know Po E Po where Po is some collection of probability
measures on (IRd, Bd ) . We call Po the "model". In the constrained estimation
problem X n = (X1, ... X n) where the Xi's E IRd are i.i.d. Po, An = (IRd)n ,
Un = (Bd)n, N :;; n and Po =. PT ,4. In the supplementary sample moments
problem Xn := (X1",.Xnl J;nL:::IT(~)), whereX1,.,.,Xn,Y1, ... ,Ym E IRd
are i.r.d. Po, A. =(JR")" x JR', fl. =(5")' x 5', N;: n+m and 'Po ='P.
Po =P , the collection of all probability measures on (IRd, ad).
We begin with a heuristic explanation of how the results in this section are
derived. The notion of a tangent space to the model Po at Po is central to this
discussion.
Definition 3.. A subspace T(Po) of L 2(PO) is called a ta.ngent 3p~ce at Po E Po
if for aU h E T(Po) there exists (P,} C 'Po with
(1) jlC'«dP,)'/' - (dPO)'/') - ih(dPo)'/'I' ~ 0 as t ~ O.
16
Formula (1) is interpreted as follows. Let Pc and Po have densities Pu and Pt
'respectively with respect to an arbitrary 17-finite measure JJc domina.ting P, + Po.
Then we have
Basically, T(Po} being a tangent space means that it is a subspace in L2(PO) made
up of a collection of "directions" in which we can approach Po in the model Po.
We will use the notation Tm;(Po) to denote a maximal tangent space in the model
Po at Po. The tangent space Tm(Po) is maximal if its closure, Tm;(Po) (in the
L 2(PO) norm) contains any other tangent space T(Po).
Suppose v(Po) E mI. We say that v(·) is differentiable with respect to Tm(Po)
at Po with derivative u(Po) E Tm(Po), if given any h E Tm(Po), there exists a
sequence {PrJ C Po with
and
(2)
Notice tha.t u(Po) is defined uniquely only up as far as elements orthogonal to
T m (Po). We can think of (v(Po),h)Po as being the rate of change of the parameter
11 at Po in the direction h. If 5 is some closed subspace of L 2 (Po) let II(v(Po)J5)
denote the projection of the quantity v(Po) onto S. If we observe Xl,' .. ,Xn i.i.d,
Po E Po and iJn is a "regular estimator" of the parameter II at Po (this concept
will be defined later) where 11(·) is differentiable with respect to T m (Po) at Po (for
some Tm(Po», and -/n(vn - v(Po» has limit law .cPo(Z), it can be shown that
17
where
Zo - N(O, lIII(v(Po) I T~(Po»lI')I
and
W is independent of Zoo
Roughly then, the greater the rate of change of the parameter 1I in the neighbor
hood of Po, the larger the variance of the random variable Zo, as one would expect.
The proof of this result goes along the following lines: Suppose Trn(Po} is
itself a tangent space. Suppose also that for every h E Trn(Po), we can find a
sequence {P..} satisfying (1) and (2). We regard 'P•• = {P.. } as a one-dimensional
submodel of 'PT, .. > Using the Hajek-Le Cam Convolution Theorem (Hajek, 1970),
we can then show that if vn is a "regular estimator with respect to T m(Po)" of
v(Po) in the model 'P.. , with limit law J:p,(Z), then
J:p,(Z) = J:p,(Z" + W)
where
Z" - N(O, (v(Po), h)j" (h, h),,;)
and W is independent of ZOA. Let S" denote the subspace spanned by the function
h. Since the variance of Z" is just lIII(v(Po)!S.)II' it is easy to see that this
quantity is maximized over T~(Po)when h = ho =II(v(Po)IT",(Po». In this sense
'Ptho is the least-favorable sub-model in 'PT,a for the problem of the estimation of
lI(Po), If lin is a "regular estimator with respect to Tm(Po)" of the parameter 1I
at Pa in the model 'Po with limit law .cPD(Z). it must also be a "regular estimator
with respect to S1&o" of the parameter 1I at Po in the least- favorable model 'Pth o ·
By the Hajek-Le Cam Convolution Theorem we must have
J:p,(Z) = J:p,(Z", + W)
where
ZOh, - N(O, lIII(v(Po)IT~(Po»II')
=
18
and W is independent of ZOllo' We can in fact reach the same conclusion without
assuming that Tm(Po) is itself a tangent space.
In this section, we consider estimation of parameters in LQCI(C). The ideas are
essentially the same as those for the real line but the technicalities are more com
plicated. The reader is referred to Van der Vaart (1988) for a thorough exposition
of these ideas.
While this theory is exactly what is needed for the constrained estimation
problem where we observe Xl,." 1 X n i.i.d Po E 'PT,,,,, the supplementary sample
moments problem. does not fit into this framework since there the data is not
i.i.d.. However the basic idea is the same. \Ve begin by showing local asymptotic
normality of the experiment in certain one-dimensional sub-models. Having done
this it is an easy matter to get a convolution theorem for regular estimators in any
one of these sub-models. From this we can then derive a convolution theorem for
regular estimators in the whole model.
§4.2 Definitions. As a consequence of the fact that there is no "natural" choice
for either the space where vc takes its values, or, having chosen that space, for a. a
field defined on that space, there is also no "natural" definition of what constitutes
an estimator sequence for vc. We present one possible definition for an estimator
sequence here. In this definition we fix on a particular choice of range space for
vc and a particular choice of c-field for that space. Not surprisingly, we make this
choice to coincide with the choice made in Theorems 2 and 3 of section 3. \Ve
will then state a convolution theorem for sequences of "estimators" of vc having
certain (to be defined) regularity properties.
Note. We assume throughout the remainder of section 4 that C is a pointwise
seperable VC class of sets in 5' and Ep,IITIII < 00.
Definition 4. Suppose we observe X" E (An, n l'1.). A sequence {vn(X")} ={vn} C Loo(C) is said to be an (L oo(C),Bc,po)-e.st im ato7' .sequence of vc(Po) if V n
is (A.,n.) I (L",,(C),5c ,p, ) measurable.
19
Definition S. An (Loo(C),Bc,PrJ-estimatorsequenl'e, {!In}, ofvc(Po) in the model
'Po is regular if for some maximal tangent space, T m (Po) I to tbe model 'Po at Po,
I corresponding to ea.ch h E Tm(Po} tbere exists a sequence {Pn} C 'PT,G so that
(i) j[.;Ti(dPn)l/'2 - (dPo)1/2) - ~h(dPo)1/'2]2 -t 0
(ii) n.jj'l(vc(Pn ) - vc(Po»- {I{.), hHILoo(c)-O
and
(iii)
f'&
as random elements of(Loo(C), Bt:,Po ) where II is a tigbt random element of Loo(C)
witb sample patbs in C(C, Po).
In what follows, it will sometimes be necessary to spedfy the tangent space
T m(PO) in Definition 5. We will do this by saying that {vn} is regular with respect
to the tangent space T m(Po).
§4.3 Convolution Theorems for Estima.tors in the Constrained Eatimation Problem >We are now ready to state convolutions theorems for regular estimator sequences
of vc(Po} in the model 'fIT,G'
Theorem 4. Suppose we observe Xl, ... IX,. i.i.d. Po E PT,Oo' Let {lin} be an
(Loo(C), Bc,po)-estimator sequence of vc(Po) in tbe model PT.... Suppose {Vn}
is regular with. limit law .c(7l). Tben 7l d 7l0'2 + W wbere 7lo2 is as de6ned in
tbe sta.tement of Corollary I' and \.V is a C(C,Po)-vaJued tight random element
of Loo(C), independent of71 02 •
The above theorems lead to the following definition for an efficient eatim.a.tor of
vc(Po) in the model 'PT.Oo'
Definition 6. Suppose we observe Xl, ... ,X,. Ll.d. Po E PT,,,. Let {V,.} be an
(Loo(C), Bc,po)-estimator sequence ofvc(Po) in tbe model 'PT,G' Suppose {vn } is
regular witb limit law ..c(ll). Hd
7l = 7lo2
20
where 710 2 is as deiined in Coronary 11, we then say that {lin} is an (Loo(C), Bc,Po)efficient e.stimator of lIc(Po).
'Corollary 3. Let C be a pointwise separable,. V C class of sets in Bd. Suppose
o E int e(Po). Then v(q.(a)) is an (L~(C),Bc,po)-e1ficientestimator of vc(Po)
.in the model 'PT,G'
Lemma 3 below is needed for the proof of Theorem 4.
Lemma 3. Define
(3) TO(Po) = (h E L,(Po) : [h[ < K, some K > 0, Ep,h = 0, Ep,hT = O}.
Then TO(Po) is a maximal tangent space at Po E PT,•.
Note that (ii) of Definition 5 implies that for any C E C, vc(Po)(C) is differen
tiable with respect to some maximal tangent space, Tm(Po), and vc(Po)(C)= Ie.
Since Ta(Po) is maximal we must have
Tm(Po) = (h E L,(po): Ep,h = O,Ep,hT = O}
for any maximal tangent space TmCPo). A simple calculation shows that for any
A,B € C, any a == (a1,a2) E IR.2
a'rp,(A,B)a = [[II(a,vc(Po)(A.) + a,vc(po)(B)ITm(Po))II'
with rp,(',') as defined in (3.2) and vc(Po)(-) as defined above. With this obser
vation Theorem 4 follows from Theorem 5.3.1 of Bickel et al. (1989).
§4.4 Convolution Theorems for Estimators in the SupplementarY Sample Momente-.
Experiment;. The results stated here are clumsier than those stated in §4.3 for the
constrained estimation problem. Specifically, in order to get convolution theorems,
we must make some hypotheses about the distribution of the random variable
Jm. 2::'1 T(}i) == .;rnTm . We need to do this in order to get a Local Asymptotic
Normality (LAN) result in one-dimensional sub-models. In Theorem 5 below we
21
hypothesize such an LAN result. In Lemma. 4 we specify certain conditions on the
pdf of POT- 1 under which such an LAN result is true. Corollary 4 is a. restate
'ment of Theorem 5 with these conditions inserted in place of the LAN hypothesis.
These conditions are, DO doubt, much stronger than what are needed to get such
results, see Le Cam and Yang(1988) for weaker (though harder to interpret) hy
potheses.
As before, let 'P be the collection of all probabili ty measures on (lRII, B II).
Theorem 5. Suppose we observe Xl, ... ,Xn E )Rd, Jm 2::: 1 T(Yi) E JRb wbere
Xl, ... , X,,, Yi, ... ,Ym are U.d. Po E 'P, ~ ....... >. as n 1\ m ....... 00, T is some Borel
measurable function. Let T m.(Po) be some maximal tangent space to the model
'P at Po and {vn } be some (LOll (C), Be ,Po )-estimator seq uence of vc(Po) with limit
law .c(71) concentrated 0.0 C(C,Po). Suppose that for every h E Tm(Po) there
exists a sequence {P.1&} c 'P for which
(i) (1) and (2) hold
(i i) {vn } is a regular (L 00 ( C),Be, Po )-estimator sequenee in the model {Pt 1& }, and
(iii) if n: denotes tbe pdf of the random variable (XI, ... , X n I );:: 2:::1 T(Yj ))
wbereX1.···,Xn ,y l , ...• ym, are Li.d. P. 1& and f~m. denotes tb.epdfofthe~.
same random variable where Xl, ... ,Xn, Y1 , ... , Ym are U.d. Po,
where Rn. - 0 in Pa-probability.
Then
d?L = ?L03 + \V
(4)
22
where llo3 is a mean-zero, tight, Gaussian process with sample paths in C(C,Po)
and covariance function r~,(-,·) as defined in (3.3) and W is a tight C(C,Po)
I valued random element of(L~(C),Bc.p.), independent of'll".
Before we state Corollary 4 we need to introduce some notation. Let
T;"(Po) = (h E L,(Po) : Ih[ < K, some K > 0 and Ep,h = O}.
It is an easy matter to show that T~(Po) is a maximal tangent space for the model
"P (see, fur example. the proof of Lemma 3). For any h E T;"(Po) define
dP.."P. = (p•• E"P: dP
o= 1 + Oh, for all 0 such that IOhl < l}.
Corollary 4. Suppose we observe Xl, ... I X .. E IRd, Jm L:~1 T(Yi) E IRb where
X11 ... ,X", 1'1, ... ,Ym are U.d. Po E 'P, ; -+ A as nAm -+ 00 and T is
some Borel measurable function. Suppose that POT- 1 has a bounded density
with respect to Lebesgue measure and EpollTlll < 00. Suppose that {un} is an
(L~(C),Bc.p,)-regular estimator sequence ofvc(Po) in the model "Ph with limit
law.c('ll) concentrated on C(C,Po). for each h E T;"(Po). Then
d'll='llOJ+W
where 710 3 is a mean-aero, tight, Gaussia.a process with sample paths in C(C, Po)
and covariance function r~o('l') as deBDed .in (3.3) and W is a tight random
element of (L~(C), Bc,p,), independent of'll".
Definition 7. Suppose we observe XI, ... , X n E IRd, J;;:. E;:l T(~) E IRb where
Xl,· .. ,Xn , YI , ... ,Yrn. aTe i.i.d. Po E 'P, ::a. -+ A as nA m -+ 00, T Js some Borel
measurable function. Let T m(PO) be some maximal tangent space to the model P
at Po. If {v.. } is an (LCX)(C), BC,PQ
)-estimator sequence for which corresponding to
each h E Tm(Po), there exists a sequence {Po} C"P for wb.ich (i), (ii) and (iii) of
De.finition 5 hold and also (iii) of Theorem 5 holds with () = 1 and P t = p'l. If7-
23
.{Vn } bas limit law £(ll) concentrated on C(C, Po) we say that {vol is an efficient
estimator ifd
II = 1l03 .
CorollarY 5. Suppose we observe XII'" I x, E IRd, );;:. L:::1T(Yj) E JR." wbere
XI, .. . ,X'l ' YI , · · · , Ym. are i.i.d. Po E P, ::. -. >. as n 1\ m -. 00 and T is
some Borel measurable function. Suppose tbat Po T-l bas a bounded density witb
respect to Lebesgue measure and 0 E int e(Po). Then vc(C~ N(TN» is an efficient
estimator of vc (Po).
To prove Corollary 5 it is enough to show that the conditions of Definition 7 hold
with Trn(Po) = T~(Po) and {P'l } := {P4- h} where P 1 h was defined in (4).~n' ~1
With this choice of T m(Po) and {Pn}, showing that (i) and (ii) of Defini tion 5
hold is easy. To show that (iii) of Definition 5 holds, see example 1 of Sheehy
and Wellner(1988). Lemma 4 implies that (iii) of Theorem 5 holds and this will
complete the proof.
Corollary 4 follows from Theorem 5 and Lemma 4 in which a Local Asymp
totic Nonnality (LAN) result for the supplementary sample moments experiment
is stated. Lemma 5--in which we get a convolution theorem for sequences of esti
mators of real-valued parameters in one-dimensional sub-models of P-is needed
in the proof of Theorem 5.
Lemma 4. Suppose PoT-1 bas a bounded density with respect to Lebesgue mea
sure and EPa IITlIf < oo, For a fixed h E T:"(Po) and some 8 E lRl Jet P; = P, h~,
where P, h was defined in (4). Let Inm.. denote the pdf of the random variable~.
(Xl!"" X n , Jm. L:::l T(Yj» wbere XI, ... ,Xn , Yl> , Ym. are i.i.d. Pta, and f~m.
denote the pdfof the same random variable where Xl, , X n, Y1 , ••• , Ym are i.i.d.
Po. Tben,
24
wbere R n ~ 0 in Po-probability,
Lemma. 5. Let {Vn} be a. Borel-tnessursble estimetOT sequence of JJ(Po) E JR.J with
limit law £(Z). Suppose h E T(Po), where T(Po} is some tangent space to the
model 'Po at Po. Suppose also tbat for tbis h there exists a sequence {Pn} C 'PT,o.
so that
(i)
(ii)
(iii)
and
(iv)
bold.
Then,
where
(iii) of Theorem 5 holds with (} =l,P thE P«~'
and Wh is independent of Zoh·
25
§5 Proofs.
§5.1 Proofs of Results in Section 2.
Proof of Lemma 1. The proof given here is an adaptation of the proof of theorem 1
of Haberman(1984). Write K =cosuppPoT-l and lKn == cosuppIPnT-l. The
result given in (Cl) above implies that to prove existence of Qn(an) with IPn
density as given in (2.3), it is enough to show
(a) an E relintlKn , with probability one, for all n sufficiently large
and
(b)
where Pon denotes the n-fold product measure.
We first show (a) holds. Since a E reliatK, for some I!! > 0, BCa,l!!) n K C
relintK, where B(a,f), denotes the ball centered at a of radius I!! in IRd• This in
turn is true if and only if, for some points t}l 1 < j ::5 c , for any ao E BCa, I!!) n K ,
ao = L:j~l Cljtjl where L:i=l Qj = 1, Qj > 0, 1 < j < c, and for every open
neighborhood U of a tj, 1 $ j $ c, poT-leU) > o. For some open Uj , 1 < j s c,
one thus has POT-ICUj ) > 0, where Uj = Uj n K with ao in the relative interior
of the convex hull of {3j : j = 1, ... ! c} for any {Sj : j = 1, ... , c}, with 5j E ii;1 < j ::5 c and every Go E B(a, E) n K. Now use the fact that for any j = 1, ... I c,
the probability that no T(X.-) E Uj , i = 1, ... , n, infinitely often, is zero and the
fact that an -a.t. a to conclude that (a) is true.
We now show that (b) holds. Define
A. = {x : u'T(x) = u' EPoT}.
By definition of H, ifu E H then Po(A.,) = 1. Let {Un}n~l be a dense subset of H
and define A = n~=l Au". If x E A, then u'T(x) = u' EPoT for all u E H. Clearly,
26
·Po(.4) = 1. To show that when (0) is true, we can assume with probability one
8n(a,, ) E HJ., it is then enough to show that
(e)n
p;(nrX, E AD ~ 1.i=l
However (c) is clearly true and we are done. •
Proof of Lemma 2. This lemma. can be proved using an argument similar to that
used in the proof of Theorem 3 of Haberman(1984). I
Proof of Theorem 1. For ease of notation write Q" = ~n(an), Q = Q(a). 8n =9n(an) and 9" 9(a). For any C E C,
(a) (Qn-Q)(C)
= Lleexp(9~T)dJPn _ f1eexp(9'T)dPof exp(li'nT)dJP n f exp(9'T)dPo
1 [J( fIe exp(9'T)dPO)(, )]= f exp(li'nT)dJPn Ie - f exp(9'T)dJPn
exp(9nT) - exp(9'T) dIPn
1 [J( f eXP(9'nT)dIPn) , ]+ f exp(9~T)dIPn Ie - Q(C) f exp(Ii'T)dJPn exp(9 T)d(lP n - Po) .
Lemma 7 (in section 6) shows that if
(b)
then
(c)
:F, = { l e eS'T , e9'T : 9 E B,(O), C E C}
:F& is a VC graph class of functions.
-
(See Definition 10 in section 6.) Clearly, if C is pointwise separable, then :FfJ is
also. We use (e) and Theorems II.24 and 11.25 of Pollard(1984) and the fact that
under the conditions of the theorem 8n _ (j a,s. as n -t co to deduce that
sup IJleexp(9~T)dJPn-Jleexp(9'T)dPol=0(1) a,s.CURd
27
cind
sup IJ Ie exp(9'T)d1P" - JIe exp(ttT)dPol = 0(1)CuR",
We can then rewrite (a) in the following way,
a.s,
(q .. - Q)(C)
= JeXp(O'T)ldPo+ 0(1) J(Ie - Q(C) + o(l»(exp(O~T) - exp(9'T»dIPn
+ JeXP(9'T~dPo + 0(1) J(Ie - Q(C) + 0(1» exp(O'T)d(lP n - Po)
=0(1) a.s. uniformly in C E C. •
§5.2 Proofs of Results in Section 3.
We begin by finding the asymptotic distribution of 8..(a) and 8..(TN)
Proposition 3. Suppose a E relintcosuppPoT-l and e(Po) is open. Suppose also
tbat
EPa lIT1I2 exp(20(a)'T) < 00.
Then
where bn E H.
Proposition 3'. Suppose EPoT = a and 0 E int0CPo). Tben
(i)
where b" E H.
Proposition 4. Suppose 0 E int8(Po), ::.. ----.. A 2: 0 and mAn ----.. 00. Then
y'ii8R(TN) = (Varp, T)- {J IV;>. J(T - a )tf%.'... - 1~ >. J(T - a )tf%.R}
+ opel) - bn
28
. where b", E H.
, Peoof of Proposition 3. Foe simplicity of notation we write
Bn(a) =B.. B(a) =B, qn(a) =q.. Q(a) =Q.
By definition 01 fJ", and 8, on a set of probability one, for n sufficiently large
(a)0= y'n (JTexp(B~T)dlPn_ JTexp(B'T)dPo)
J exp(B~T)dlPn J exp(O'T)dPo
1 [J r=, '( JTexp(fi'T)dPo) ]= J exp(B~T)dlPn V n(exp(OnT) - exp(B T») T - J exp(O'T)dPn dlPn
1 [J ( J exp(O~T)dlPn) , ]+ J exp(O~T)dIP. T - a J exp(O'T)dPo exp(O T)dX!:•.Use a Taylor series expansion to write
(b)
y'n(exp(O~T)-exp«(j'T») = y'n(On-B)'T exp(O'T)+~y'n(On-B)'TT'(O.-B) exp(O~T) :
where 118' - 011 < liOn - BII· Use the fact that 0 E interpol and Bn ~•.•. B as
n --+ 00 to get
(e)
and
Jexp(O~T)dIP. = Jexp(B'T)dPo + opel)
(d) Jexp(O'T)dlP n =Jexp(B'T)dPo+ opel).
Putting (a), (b), (e) and (d) together we get
(e)
o = (J exp(~'T)dPo + OP(l)) [j T(T - a + opel))' exp(B'T)dPo+ op(l)] y'n(On - 0)
+ (J exP(~'T)dPo+Opel)) y'n(Bn - O)op(l)
+ [jeT - a):~ dX!:n] + opel)
= [VarQT+ op(l)]y'n(On - 0) + y'n(Bn - O)op(l)
+ [j (T - a):~ dX!:n] + opel).
29
\Ve see from (e) that it must be true that
I (1) lIvn(8" - 9)11 = Opel).
and so
J dQVarQr' ,fii(9,. - 0) :::; - (T - a) dP
od;«,. + opel)
or
(g) ';;'(1),. -I) = - J(VarqT)-(T - a) :~ rJX{n + opel) - b"
where the bn term is the projection of the first two terms on the right hand side of
(g) onto H. We include it to ensure that the quantity y'1i(On - 00 ) E H.L. •
Proof of Proposition 3'. The proof is very similar to that of Propositon 1 and so
will be omitted. •
Proof of Proposition 4. The proof is very similar to that of Proposition 1. We
give an outline here. For simplicity of notation we write
By definition of 9n , on a set of probability one, for n sufficiently large
(a)
(J TeXP(e~T)dfP n J )..;Ti(T.v - a) = vn Jexp(9~T)dJPn - TdPo
= [VarpoT + op(l)]yTi8n + op(1)y'Ti'8n +J(T - a)cC«n + op(l).
The left hand side of (a) can be written aa
(b)
vnCTN - 0) = n JCT - a)dJX~ + m 0JCT - a)tCK:nn+m n+mY;;;'
). J ,;xJ :z )= -- (T - a)dXn +--, (T - a)dhX.m. + op(l .1+>' 1+",
=Opel).
30
From (a) and (b) we see that
and so
So we can write
(e)
.,fii8. : <Varp.T)- {IV:>' /<T - a)<iJK;' - 1~ >. f<T - a)<iJK. } + oP(l) - b.
where bn is the projection onto H of the first two terms on the left hand side of
(c). We include this term to ensure ..;riB" E s-. •Eroof of Theorem 2. For simplicity of notation we write Qn (a) =Q n I Q(a) =QI
8"(a) =Bn and B(a) == 8. Let C E C. Then on a. set of probability one, for n
sufficiently large,
(a).;;i(q n(G) - Q(e))
1 [J ( fIe exp(9'T)dPo) I]= Jexp(9~T)dIPn Ie - f exp(9'T)dIP .. yti(exp(dn T) - exp(8 T))dIP"
1 [J ( Je"(PC9~TdlPn) I ]+ Jexp(9~T)dlP" Ie - QCC) Jexp(8'T)dJPn
exp(8 T)cDKn .
Arguing as in the proof of Proposition 1 yields
(b)
where the opel) terms are uniformly small in C. We now show that the third term
on the right hand side of (b) is Opel). Define
dQ:F;: {(Ie - Qce)) dP
o: C E C}.
-
31
. Theorem 10 of Pollard (1982) shows that :F is a sparse class of functions (see
Definition II in §6). Corollary 13 of Pollard (1982) implies then that :F Is totally
bounded in L,(Po). It is easy to see that C being pointwise seperable implies :F is
pointwise seperable. In the proof of Theorem VII.21 of Pollard (1984) it is shown
that for any f > 0, we can find a finite collection of open balls in Lo::,(.1'"), each
with finite radius, centered at points in C(.1",Po), such that the probability that
}Xn is in the union of those balls, call it B .. , is greater than 1 - E. Since:F is
totally bounded, each element of C(.r, Po) must be bounded. So for some M> 0
we must have that c.p E B~ implies lIc.pllL..,cF) < M, From this we may conclude
that
j dQopel) (Ie - Q(G» dP
odJIl:. = opel).
Thus, from Proposition 3 and (b), using the fact that b~T = b'nEPoT, a constant,
with probability one, we get
y'n(Q .(G)-Q(G)) = j[(le-Q(G»-(Covq(le, T)'(VarqTnT-a)] :~ dJIl:.+op(l)
where the opel) term is uniformly small over C. •
Proof of Theorem 2/. The proof is exactly analogous to that of Theorem 1 and so
is omitted. •
Proof of Theorem 3. For simplicity of notation we write Q..(TN) =Q fl' f),,(TN) =
9". As in the proof of Theorem 2 we can show that on a set of probability one,
for n sufficiently large, for any C Eel
y'n(Q .(G) - Po(G» = Covp,(le, Tl' y'n9. + j(le - Q(G»dJIl:. + opel)
where the opel) term is uniformly small in C. Using the fact tha.t b~T = b~a, a
-
32
constant a.s. Po, we get from Proposition 4,
where the op( 1) term is unifonnly small in C. I
Proof of Corollary 1. Write Q(a) =Q. Define
_ dQF = {[(Ie - Q(a)(C» - Covq(1c, T)(VarqT) (T - a)]-d : C E C}
Po
anddQIe C) = 1e - Q(C» - CovQ(1ct T)(VarqT)-(T - a)dP'
Theorem 10 of Pollard (1982) implies F is a sparse collection of functions
under the hypotheses of the Corollary. It is easy then to see tha.t :F,C and I(·, C)
sa.tisfy the conditions of Lemma 6 (in §6) and so we can conclude that
as random elements of (Loo(C), Bc,Po), where ~(.) is as described in the statement.
•The proofs of Corollaries 1/ and 2 are very similar to that of Corollary 1 and
so are not included here.
§5.3 Proofs of Results in Section 4.
Proof of Lemma 3. Obviously T·(Po) is a subspace of L2(PO) . ¥le first show that
T-(Po} contains every tangent space. That is, if T(Po) is a tangent space then
T(Po) C T·(Po) = {h E L 2(PO) : f hdPo = 0, JhTdPlj = OJ.
33
Suppose h E T(Po). Then for some {P,} C Pr,1J. we must have
We show that
J hdPo = 0 and J ThdPo = O.
WritedPt dPo
[« = - and ft = -d~t d~t
where ~t is some measure dominating Pt. and Po, and
A = {x : ft (~) > O}.
Then
1- J hdPol = IJ[t-1(fu - ft.) - hft]d~tl
'" IJC'I"lA,d!',1 + If I{I' (c. (~,;, -I: I') - hI:I') lAd!',)1
s IJC' I"lA' d!'ll + IJI; I' [t-' U';, -I:I') - ~hI;I'] lAd!" I
+ IJ ft1/2(t-l(ft1/
2 - f;/2) - ~hftl/2)lAd~tl
s IJ r 1fttlAed~tl + IJ ft1/
2[t- 1(ft.1/
2- f t
1/2)
- ~hftl/211Ad~tl
+ IJ 4h f i /2( f t
1/2
- ftl/2)d~t.1
+ IJ f;/2(t-l(fe1/
2 - ft1 /2) - ~hf:/2)IAd~t.l
= 0(1)
by (1) and repeated application of the Cauchy-Schwarz Inequality.
We DOW show fThdPa = 0 must hold. Again choose h E T(Po) and a
sequence {Pt.} C Pr,lJ. as defined in (a). We show that we can find a sequence
{Pt} c PT,a (or which (a) holds (with {Pt} in place of {Pd) and for which
34
f IT!'JdPt s 2 I lTl 2dPo for all t 2: O. Then, using an argument similar to that
used above to show JhdPQ = 0, it is easy to show that f ThdPo = 0 and we will,be done.
Define, for K > 0
AK ::: [z : lh(x)l < K},
and
Notice,
Define {Pt} by
dPt = (1 + th1/t)dPo, t > O.
Then, for t > 0
A straightforward argument can be used to show
To show that T-(Po) is itself a tangent space, we need to show that if h E
T-(PlI ) , there exists {Pt } C 'PT,a so tha.t
j[C1« dPt? /2 - (dPO)1/2) - ~h(dPo)1/'J}2 -+ 0 as t --+ O.
It is easy to show that the sequence {Pt } defined by ~;~ = 1 + th will do and we
are done. •
Proof of Corollarv 3. We need to show regularity of the estimator vc(q ,,(a». The
proof of this can be found in example 1 of Sheehy and Wellner(1988). •
Notation: For any {C1, ..• , C.c} C Loo (C), any '.p E Loo(e), define
35
Proof of Theorem 5. Under the hypotheses of the theorem, Lemma 5 implies that
for each {Cb ... C1J c C, Ie > 1 and any a E IRI:,
where, if r~(Cl"'" Cl') is the Ie x Ie matrix whose ijth element is given by
then
and W h(C1 , ... 1 Cl') is independent of 7lo(GIl'" ,Cl'). To see this, in the state
ment of Lemma 5, write
Define
I:v(Po) == L QiVC(PO )(Ci)
i=1
andl'
Ji(Po) =L a,le;.i=1
Then b« E T m (Po) aad so there exists a sequence {h n tI} C T rn (Po) with
l1hna - h rJ llpo ---+ 0 as n - 00. It can be shown that
(b)
a.nd
(c)
where 7l0J (CI •..• 1 CI:) and \V (CI , ••• ,Cl') are as defined in the statement of the
theorem. From (a), (b) and (c) we can thus conclude that
7l(Cb ... I Cl') .!! '7Z0J (C1 , ... ICl') + W(Cl,"'j Cl')
36
With 7103 and W as defined. in the statement. Since:ll and llo3 exist as C(C,Po)
valued tight random elements of Loo(C), it remains only to identify \V(el l • •• I Gli:)
I as (W(C,), ... ,W(C.» where W is a C(C, Pol-valued tight random element of
Loo (C). Lemma. 8 in section 6 makes such an identification. •
Proof of Lemma 4. We will only give an outline of the proof. To prove the lemma
is true, we break the expression in (i) into two parts, one involving the Ioglikelihood
of the X;'s and the other involving the loglikelihood of );;; 2::, T(Y;). The
first part is straightforward to deal with. For the second part we must show
that the loglikelihood can be approximated by the log of the pdf of a multivariate
normal distribution with mean Ep..T and covariance matrix Varp"T evaluated at
Jm: 2::'1 T(Yi). To show that this can be done we use the fact that under the
hypotheses of the Theorem the pdf of Jm 2::'1T('Yi). under Pn can be uniformly
approximated by the pdf of the multivariate normal distribution with mean Ep..T
and covariance matrix VarpnT. (This can be proved. using an argument similar
to that used to prove Theorem XV.5.2, Feller, Vol.Il). Since ;;k 2::1 T(}i) is
oPo(1) the result follows. •
Proof of Lemma 5. The proof is similar to that of the Hajek-Le Cam Convolution
Theorem, see Theorem 3.1 (Roussas(1972) ). •
§6 Technicalities.
Definition 8. Let F be a collection of functions in Lz(Po). ).Ve say that F is
pointwise separable if there exists a countable subclass F" of F such that {or any
f E :F tbere exists a sequence Un} C r witb fn(~) --. f(~) for all ~ E IRd.
Definition 9. [Vapnik and Cervonenkis(1971») Call a class C of subsets of IRd
a VC class of sets of degree v if each set S of v points in IRd lias {ewer than 211
distinct subsets of the form ens picked out by members of C.
Definition 10. Call a class F of real-valued functions on JRd a VC graph class if
C" = {C/: C/ = {(~,t): 0:5 t:5 f(~)},J E:F}
37
is a VC class of sets in IRd+I.
Definition 11. (Pollard(1982» Let :F be a collection of functions witb envelopet
function F (i.e. III < F, for every f E :F). For each :finite subset 5 of!Rd and
each positive 6, take N FCC, 5, F) to be the smallest value of m for v,'hich there are
functions 'PI, ... ,~m in :F such tbat
mtn L(f(x) - ~;(x)J2 s 62 L F(r)2zES ~es
(or every I in :F. Denne
H(5) = HF(6,:F) = suplogNF(o,S,:F)s
the supremum running over all finite subsets afIRo. If
00
L 2-i H F(2- j ,.1") < 00
j=I
we say tbat .1" is sparse.
Lemma. 6. Let C be a separable (in L2(PO) ) collection of sets in IRd, Let J :
IRdX C - IR1 be a l.Jlliformly continuous function from C into L 2(PO) ) i.e. given
ally f > 0 tbere exists a 6> 0 such tbat lllel -lc~ IIL2(Po) < 0 implies Ilfh C1 )
f(·, C2)IIL~(Po) < e. Deline
F = {fC C) : C E C}.
Suppose F is a sparse, pointwise separable collection of functions wjtb envelope
F E L 2(Po). Define'll.,. E Loo(C) by
where:i«.n was defined in (3.1). Tben)
38
as random elements of (Loo(C),Bc,PD) where '7lo is a tight. Gaussia.c, random
element of L~(C) whose sample path, all belong to C(C, Po).
Proof. In Theorem VII.21 of Pollard(1984) he shows that under the above hy
potheses:'«n ~)Xo as random elements of (Loo(F),BF,Po) where;«1) is a tight,
Gaussian, random element of LOl::lF) whose sample pa.ths all belong: to C(F, Po).
The proof proceeds in the following way:-
He shows that
(i) ::Kft is a random element of (Loo(F),BF,Po) and
(ii) Given any ~ > 0, there exists a compact set, K, of completely regular points
such that for every open set, G, containing K, for which [)Xn E G] is measur
able, lim in! PrpXn E G] > 1 - f. Specifically, K is constructed as an inter
section of sets D t • D1 , ... where D" is a finite union of closed balls of radius
t centered at points ofC(F,Po).
(Note that II' E Loo (.1") is a completely regular point, with respect to a a-field
B, if all open balls centered at <p are contained in B.)
It is not difficult to see that we can adapt the proof of this theorem to show
7l n ~ 'Zlo as random elements of (Loo(C), Bc,Po)· In fact, we need to show that
(i) and (ii) are true with 'Zln in place of ::&{n, Bc,pg in place of BT,Pg and C(C, Po)
in place of C(F, Po).
Showing (ii) is true is easy. Just notice that Ie, C) being uniformly contin
uous over C implies C(F, Po) c C(C, Po). We now show that 'Zln is a random
element of (L~(C),Be,p,).
Because each
1 n
:an(C) =~. 0 fCC) = - 2)f(X"C) - Ep,f(X" C)]n i=l
is a real random variable, the finite dimensional projections create no difficulty. It
remains then to show that for any I.p E C(C, Po )
I (b) sup )71" (C) - If'(C) I = sup I~" 0 Je-, C) - 'P( C) Ic c
39
-
is measurable. However, the fact that :F is pointwise-separable with envelope
function F E L2(PO) and C is separable (in L2(PO) ) implies
where C· is a countable collection of sets in C. Measurability of the quantity in
(b) is thus clear and we are done. I
The following lemma is used in the proof of Theorem 1.
Lemma 7. Suppose C is a. VC class of sets in IRd. Tben
is aVe graph class.
Proof. We need to show tha.t :F is a VC graph class, i.e. we need to show
7 = {(x, t) : 0 ~ t ~ e9' T (: ) l A(x), A E C, (J E Bli((JO)}
is a VC class of sets. A typical set in 7 is of the form
{(x, t) : t > Ollogt ~ (J'T(x), x EA.} U {(x,O) : x E AC}
Let
71 = {A C x {OJ : A E C}
and let
40
Theorem 9,2.6 and (9.1.8) of Dudle)'(1982) implies 1'1 is a. VC class of sets. Define
9 : IRd x R+ J-+ IRb
X lRl by g(x, t) = (T(x), log t).
ThenC.4.,9 = {(x,t): t > 0, logt < O'T(x)} nA0IR+1
= g-1(~9) n A.. 0 IR+1
where
/:19 = {(r,s) E IRd+1; (r,s)' (0,-1) > O}.
Theorem 9.2.2 of Dudley (1982) implies tha.t
is a VC class of sets, since {~9 : °E B6(OO)} is one. Theorem 9.2.6 of Dudley
(1982) implies that
;4 = {A0IR+1: A E C}
is a VC class of sets. We then conclude, again from Theorem 9.2.6 of Dudley(1982),
that
is a VC class of sets and the lemma is proved. •
The proof of the following Lemma is contained in the proof of Theorem 5.3.1,
Bickel et al. (1989). We include it here for completeness.
Lemma 8. Let 7l and 7lo be tigbt random elements of Loo(C) witb sample paths
in C(C , Po), wbere C is a. totally bounded subset in L 2(Po). Suppose for every
{CI, ... ICt } C C, k > 1,
(i)
-
where W(CIt ... 1 C.t) is independent of llo(C J , ..... C.t). Then
d7l == 710 + \V
- , ,
41
where W is a tight, random element of Loo(C) with sample paths in C(C, Po) and
\V is independent of '!Lo.I
Proof, Since 1l and 1lo exist as C(C,Po)-valued random elements of Loo(C), it
reameins only to identify W(C!> ... ,C.) as (W(C,), ... , W(C.)) where W is a
C(C, Po )-valued random elements of Loo(C).
Let k ~ L Since C is totally bounded there exists a finite .}Jc net for C, say
{C"
... , Cm} with m = m(k). Thus for each C E C there is a C· E {C"
... ,Cm}
such that lie - C-IIL J ( Po) < '}A;' Define .6i : C -+ IR1 by
(a) v;(C) = (1 - kliC - C;IIL,(Poj)+, i = I, ... , m
andv,(C)
ti;(G) = "m -(C)LtJ=l vJ
i = l, .. "m.
Each ~ .. is 8. uniformly continous function that vanishes off the ball of radius tabout Ci. For every C E C there is at least one C.. so that Vi(C) > 1/2. Thus the
~i are non-negative uniformly cont.inous functions that sum to 1 everywhere in C.
Now define interpolation functions 1lrn. and ?Lom of 7l and tlo by
(b)
(e)
m
7lm(G):; I: A,(G)71(G,),;=1
m
7lm(G) = I:A;(G)71(G,),.=1
G E C
G E C
and use the random vector W (C1 , • . • , em.) =(WI, ... I Wm.y to define a process
W m by
m
(d) W meG) = I: ti;(G)Wi •
.=1
Denote the laws of these processes on (C(C,Po),Bc,po) by Pm == .c(71m), POrn. =.c(71om) and Rm =crw m).
-
, • 4
42
It follows from (i) and linearity of the interpolated processes that
Now since PWI :::;. l' =£(7l) and POm :::;. Po = .c(71o), given any ~ > 0 there exis t
compact sets K I =KI(e:) and K 2 == K2(~) such that lim inf'.; Pm(Kt} ~ 1- e: and
Iim inf's, POm(K2 ) ?: 1 - e, respectively. Hence, for m 2: some m(e:)
1 - e :5 Pm(KI ) =JRm(K I - x)dPom(x)
:5 r RWI(K1 - ~)dPom(~) + eJK2
or
Thus Rm(Kt - x) ~ 1- 2e far some sequence {x Wl } C K 2 • Hence the compact set
has Rm (K 1 - K 2 ) > 1-2e: for m ~ m(e) and it follows that liminfm Rm (K I - K 2 ) >
1-2e. Hence k; is tight and there exists a weakly convergent subsequence {RWI ,},
Rm • :::;. il. on (C(e, Po), Bc,Po)' Let LeW) = R and the conclusion of the Lemma
follows. •
Acknowledgments. Part of this research was carried out at the University of
Washington in Seattle as part of the requirements for a Ph.D. degree in statistics.
I would like to thank Professor Jon Wellner for his help and encouragement during
the course of this research. Thanks are also due to Aad van der Waart with whom
I had some discussions about the results in section 4.
43
References
I Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1989). Efficientand adaptive" inference in semiparametric models. Forthcoming monograph, JohnsHopkins University Press, Baltimore.
Csiszar, 1. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability 3 146-15·8.
Dudley, R. M. (1984). A course on empirical processes. Ecole d'lhe de Probabilities de St. Flour, Lecture Notes in 1-latb., Springer-Verlag, New York 1097(1984),2-142.
Feller, W. (1971) An Introduction to Probability Theory and its Applications.YoU!, 2nd ed., Wiley, New York.
Haberman, S. J. (1984). Adjustment by minimum discriminant information. Ann.Stat. 12 971-988.
Hajek, J. (1972). A characterization of limiting distributions of regular estimators.Z. Wahr. Verw. Gebeite 14 323-330.
Ireland, C. T. and Kullback, S (1968) Contingency tables with given marginals.Biometrika 55 179-188.
Jagers, P., Oden, A., and Trulsson, L (1985). Post-stratification and ratio estimation: Usages of auxilliary information in survey sampling and opinion polls, Inter.Statist. Rev. 53 221-238.
Le Cam, 1. and Yang, G.L (1988) On the preservation of local asymptotic normality under information loss Ann. Stat. 16.Mosteller, F (1968). Association and estimation in contingency tables, J. Amer.Statist. Assoc. 63 1-28.
Parthasarathy, K. R (1967). Probability measures on metric spaces. AcademicPress, New York.
Pollard, D (1982) A central limit theorem for empirical processes, J. Austral.Math. Soc. A33 235-248.
Pollard, D (1984) Convergence of Stochastic Processes. Springer-Verlag, NewYork.
Roussas, G.G. (1972) Contiguity of Probability Measures; Some Applications inStatistics. Cambridge University Press, Cambridge.
Sheehy, A (1987) Constrained estimation and clustering based on Minimum K ullbackLeible:r information methods. Unpublished Ph.D dissertation, Department ofStatistics, University of Washington, Seattle, Washington.
44
'Sheehy, A. and Wellner, J. A (1987) Uniformity in P of some limit theoremsfor empirical measures and processes, Tech. Report No. 87, Dept. of Statist.,University of Washington, Seattle.
van der Vaart, A.W (1987) Statistical Estimation in Large Parameter Spaces CWItracts, Centrum voor Wiskunde en Informatica, Amsterdam.
Vapnik, V. N. and Chervonenkis, A. Ya (1971). On uniform convergence of thefrequencies of events to their probabilities, Theor. Prob. Appl. 16 264-280.