131
ADVANCED PROBABILITY AND STATISTICAL INFERENCE I Lecture Notes of BIOS 760 -4 -2 0 2 4 0 50 100 150 200 250 300 350 400 n=1 -4 -2 0 2 4 0 50 100 150 200 250 300 350 400 n=10 -4 -2 0 2 4 0 50 100 150 200 250 300 350 400 n=100 -4 -2 0 2 4 0 50 100 150 200 250 300 350 400 n=1000 Distribution of Normalized Summation of n i.i.d Uniform Random Variables

ADVANCED PROBABILITY AND STATISTICAL INFERENCE I

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

ADVANCED PROBABILITY ANDSTATISTICAL INFERENCE I

Lecture Notes of BIOS 760

āˆ’4 āˆ’2 0 2 40

50

100

150

200

250

300

350

400n=1

āˆ’4 āˆ’2 0 2 40

50

100

150

200

250

300

350

400n=10

āˆ’4 āˆ’2 0 2 40

50

100

150

200

250

300

350

400n=100

āˆ’4 āˆ’2 0 2 40

50

100

150

200

250

300

350

400n=1000

Distribution of Normalized Summation of n i.i.d Uniform Random Variables

PREFACE

These course notes have been revised based on my past teaching experience at the departmentof Biostatistics in the University of North Carolina in Fall 2004 and Fall 2005. The context in-cludes distribution theory, probability and measure theory, large sample theory, theory of pointestimation and efficiency theory. The last chapter specially focuses on maximum likelihoodapproach. Knowledge of fundamental real analysis and statistical inference will be helpful forreading these notes.

Most parts of the notes are compiled with moderate changes based on two valuable textbooks:Theory of Point Estimation (second edition, Lehmann and Casella, 1998) and A Course inLarge Sample Theory (Ferguson, 2002). Some notes are also borrowed from a similar coursetaught in the University of Washington, Seattle, by Professor Jon Wellner. The revision hasincorporated valuable comments from my colleagues and students sitting in my previous classes.However, there are inevitably numerous errors in the notes and I take all the responsibilitiesfor these errors.

Donglin ZengAugust, 2006

CHAPTER 1 A REVIEW OFDISTRIBUTION THEORY

This chapter reviews some basic concepts of discrete and continuous random variables. Distri-bution results on algebra and transformations of random variables (vectors) are given. Part ofthe chapter pays special attention to the properties of the Gaussian distributions. The finalpart of this chapter introduces some commonly-used distribution families.

1.1 Basic Concepts

Random variables are often classified into discrete random variables and continuous randomvariables. By names, discrete random variables are some variables which take discrete valueswith an associated probability mass function; while, continuous random variables are variablestaking non-discrete values (usually R) with an associated probability density function. A proba-bility mass function consists of countable non-negative values with their total sum being one anda probability density function is a non-negative function in real line with its whole integrationbeing one.

However, the above definitions are not rigorous. What is the precise definition of a randomvariable? Why shall we distinguish between mass functions or density functions? Can somerandom variable be both discrete and continuous? The answers to these questions will becomeclear in next chapter on probability measure theory. However, you may take a glimpse below:

(a) Random variables are essentially measurable functions from a probability measure spaceto real space. Especially, discrete random variables map into discrete set and continuousrandom variables map into the whole real line.

(b) Probability (probability measure) is a function assigning non-negative values to sets of aĻƒ-field and it satisfies the property of countable additivity.

(c) Probability mass function for a discrete random variable is the Radon-Nykodym derivativeof random variable-induced measure with respect to a counting measure. Probabilitydensity function for continuous random variable is the Radon-Nykodym derivative ofrandom variable-induced measure with respect to the Lebesgue measure.

For this chapter, we do not need to worry about these abstract definitions.Some quantities to describe the distribution of a random variable include cumulative distri-

bution function, mean, variance, quantile, mode, moments, centralized moments, kurtosis andskewness. For instance, if X is a discrete random variable taking values x1, x2, ... with probabili-ties m1,m2, .... The cumulative distribution function of X is defined as FX(x) =

āˆ‘xiā‰¤x mi. The

1

DISTRIBUTION THEORY 2

kth moment of X is given as E[Xk] =āˆ‘

i mixki and the kth centralized moment of X is given as

E[(X āˆ’ Āµ)k] where Āµ is the expectation of X. If X is a continuous random variable with prob-ability density function fX(x), then the cumulative distribution function FX(x) =

āˆ« x

āˆ’āˆž fX(t)dt

and the kth moment of X is given as E[Xk] =āˆ«āˆžāˆ’āˆž xkfX(x)dx if the integration is finite.

The skewness of X is given by E[(X āˆ’ Āµ)3]/V ar(X)3/2 and the kurtosis of X is given byE[(X āˆ’ Āµ)4]/V ar(X)2. The last two quantities describe the shape of the density function:negative values for the skewness indicate the distribution that are skewed left and positive val-ues for the skewness indicate the distribution that are skewed right. By skewed left, we meanthat the left tail is heavier than the right tail. Similarly, skewed right means that the righttail is heavier than the left tail. Large kurtosis indicates a ā€œpeakedā€ distribution and smallkurtosis indicates a ā€œflatā€ distribution. Note that we have already used E[g(X)] to denote theexpectation of g(X). Sometimes, we use

āˆ«g(x)dFX(x) to represent it no matter wether X is

continuous or discrete. This notation will be clear after we introduce the probability measure.Next we review an important definition in distribution theory, namely the characteris-

tic function of X. By definition, the characteristic function for X is defined as Ļ†X(t) =E[expitX ] =

āˆ«expitxdFX (x ), where i is the imaginary unit, the square-root of -1. Equiva-

lently, Ļ†X(t) is equal toāˆ«

expitxfX (x )dx for continuous X and isāˆ‘

j mj expitxj for discreteX. The characteristic function is important since it uniquely determines the distribution func-tion of X, the fact implied in the following theorem.

Theorem 1.1 (Uniqueness Theorem) If a random variable X with distribution functionFX has a characteristic function Ļ†X(t) and if a and b are continuous points of FX , then

FX(b)āˆ’ FX(a) = limTā†’āˆž

1

2Ļ€

āˆ« T

āˆ’T

eāˆ’ita āˆ’ eāˆ’itb

itĻ†X(t)dt.

Moreover, if FX has a density function fX (for continuous random variable X) , then

fX(x) =1

2Ļ€

āˆ« āˆž

āˆ’āˆžeāˆ’itxĻ†X(t)dt.

ā€ 

We defer the proof to Chapter 3. Similar to the characteristic function, we can define themoment generating function for X as MX(t) = E[exptX]. However, we note that MX(t) maynot exist for some t but Ļ†X(t) always exists.

Another important and distinct feature in distribution theory is the independence of tworandom variables. For two random variables X and Y , we say X and Y are independent ifP (X ā‰¤ x, Y ā‰¤ y) = P (X ā‰¤ x)P (Y ā‰¤ y); i.e., the joint distribution function of (X,Y )is the product of the two marginal distributions. If (X, Y ) has a joint density, then anequivalent definition is that the joint density of (X,Y ) is the product of two marginal den-sities. Independence introduces many useful properties, among which one important propertyis that E[g(X)h(Y )] = E[g(X)]E[h(Y )] for any sensible functions g and h. In more gen-eral case when X and Y may not be independent, we can calculate the conditional densityof X given Y , denoted by fX|Y (x|y), as the ratio between the joint density of (X,Y ) andthe marginal density of Y . Thus, the conditional expectation of X given Y = y is equal to

DISTRIBUTION THEORY 3

E[X|Y = y] =āˆ«

xfX|Y (x|y)dx. Clearly, when X and Y are independent, fX|Y (x|y) = fX(x)and E[X|Y = y] = E[X]. For conditional expectation, two formulae are useful:

E[X] = E[E[X|Y ]] and V ar(X) = E[V ar(X|Y )] + V ar(E[X|Y ]).

So far, we have reviewed some basic concepts for a single random variable. All the abovedefinitions can be generalized to multivariate random vector X = (X1, ..., Xk)

ā€² with a jointprobability mass function or a joint density function. For example, we can define the meanvector of X as E[X] = (E[X1], ..., E[Xk])

ā€² and define the covariance matrix for X as E[XX ā€²]āˆ’E[X]E[X]ā€². The cumulative distribution function for X is a k-variate function FX(x1, ..., xk) =P (X1 ā‰¤ x1, ..., Xk ā‰¤ xk) and the characteristic function of X is a k-variate function, defined as

Ļ†X(t1, ..., tk) = E[ei(t1X1+...+tkXk )] =

āˆ«

Rk

ei(t1 x1+...+tkxk )dFX(x1, ..., xk).

Same as Theorem 1.1, an inversion formula holds: Let A = (x1, .., xk) : a1 < x1 ā‰¤ b1, . . . , ak <xk ā‰¤ bk be a rectangle in Rk and assume P (X āˆˆ āˆ‚A) = 0, where āˆ‚A is the boundary of A.Then

FX(b1, ..., bk)āˆ’ FX(a1, ..., ak) = P (X āˆˆ A)

= limTā†’āˆž

1

(2Ļ€)k

āˆ« T

āˆ’T

Ā· Ā· Ā·āˆ« T

āˆ’T

kāˆj=1

eāˆ’itj aj āˆ’ eāˆ’itj bj

itjĻ†X(t1, ..., tk)dt1 Ā· Ā· Ā· dtk.

Finally, we can define the conditional density, the conditional expectation, the independence oftwo random vectors similar to the univariate case.

1.2 Examples of Special Distributions

We list some commonly-used distributions in the following examples.

Example 1.1 Bernoulli Distribution and Binomial Distribution A random variable Xis said to be Bernoulli(p) if P (X = 1) = p = 1 āˆ’ P (X = 0). If X1, ..., Xn are independent,identically distributed (i.i.d) Bernoulli(p), then Sn = X1 + ... + Xn has a binomial distribution,denoted by Sn āˆ¼ Binomial(n, p), with

P (Sn = k) =

(n

k

)pk(1āˆ’ p)nāˆ’k.

The mean of Sn is equal to np and the variance of Sn is equal to np(1āˆ’ p). The characteristicfunction for Sn is given by

E[eitSn ] = (1āˆ’ p + peit)n.

Clearly, if S1 āˆ¼ Binomial(n1, p) and S2 āˆ¼ Binomial(n2, p) and S1, S2 are independent, thenS1 + S2 āˆ¼ Binomial(n1 + n2, p).

Example 1.2 Geometric Distribution and Negative Binomial Distribution Let X1, X2, ...be i.i.d Bernoulli(p). Define W1 = minn : X1 + ... + Xn = 1. Then it is easy to see

P (W1 = k) = (1āˆ’ p)kāˆ’1p, k = 1, 2, ...

DISTRIBUTION THEORY 4

We say W1 has a geometric distribution: W1 āˆ¼ Geometric(p). To be general, define Wm =minn : X1 + ... + Xn = m to be the first time that m successes are obtained. Then

P (Wm = k) =

(k āˆ’ 1

māˆ’ 1

)pm(1āˆ’ p)kāˆ’m, k = m,m + 1, ...

Wm is said to have negative binomial distribution: Wm āˆ¼ Negative Binomial(m, p). The meanof Wm is equal to m/p and the variance of Wm is m/p2āˆ’m/p. If Z1 āˆ¼ Negative Binomial(m1, p)and Z2 āˆ¼ Negative Binomial(m2, p) and Z1, Z2 are independent, then

Z1 + Z2 āˆ¼ Negative Binomial(m1 + m2, p).

Example 1.3 Hypergeometric Distribution A hypergeometric distribution can be obtainedusing the following urn model: suppose that an urn contains N balls with M bearing the number1 and N āˆ’M bearing the number 0. We randomly draw a ball and denote its number as X1.Clearly, X1 āˆ¼ Bernoulli(p) where p = M/N . Now replace the ball back in the urn andrandomly draw a second ball with number X2 and so forth. Let Sn = X1 + ... + Xn be the sumof all the numbers in n draws. Clearly, Sn āˆ¼ Binomial(n, p). However, if each time we draw aball but we do not replace back, then X1, ..., Xn are dependent random variable. It is knownthat Sn has a hypergeometric distribution:

P (Sn = k) =

(Mk

)(Nāˆ’Mnāˆ’k

)(

Nn

) , k = 0, 1, .., n.

Or, we write Sn āˆ¼ Hypergeometric(N, M, n).

Example 1.4 Poisson Distribution A random variable X is said to have a Poisson distri-bution with rate Ī», denoted X āˆ¼ Poisson(Ī»), if

P (X = k) =Ī»keāˆ’Ī»

k!, k = 0, 1, 2, ...

It is known that E[X] = V ar(X) = Ī» and the characteristic function for X is equal expāˆ’Ī»(1āˆ’eit). Thus, if X1 āˆ¼ Poisson(Ī»1) and X2 āˆ¼ Poisson(Ī»2) are independent, then X1 + X2 āˆ¼Poisson(Ī»1 + Ī»2). It is also straightforward to check that conditional on X1 + X2 = n, X1 isBinomial(n, Ī»1/(Ī»1 + Ī»2)). In fact, a Poisson distribution can be considered as the summationof a sequence of bernoulli trials each with small success probability: suppose that Xn1, ..., Xnn

are i.i.d Bernoulli(pn) and npn ā†’ Ī». Then Sn = Xn1 + ...+Xnn has a Binomial(n, pn). We notethat for fixed k, when n is large,

P (Sn = k) =n!

k!(nāˆ’ k)!pk

n(1āˆ’ pn)nāˆ’k ā†’ Ī»k

k!eāˆ’Ī».

Example 1.5 Multinomial Distribution Suppose that B1, ..., Bk is a partition of R. LetY1, ..., Yn be i.i.d random variables. Let X i = (Xi1, ..., Xik) ā‰” (IB1(Yi), ..., IBk

(Yi)) for i = 1, ..., n

DISTRIBUTION THEORY 5

and set N = (N1, ..., Nk) =āˆ‘n

i=1 Xi. That is, Nl, 1 ā‰¤ l ā‰¤ k counts the number of times thatY1, ..., Yn fall into Bl. It is easy to calculate

P (N1 = n1, ..., Nk = nk) =

(n

n1, ..., nk

)pn1

1 Ā· Ā· Ā· pnkk , n1 + ... + nk = n,

where p1 = P (Y1 āˆˆ B1), ..., pk = P (Y1 āˆˆ Bk). Such a distribution is called the Multinomialdistribution, denoted Multinomial(n, (p1, .., pk)). We note that each Nl is a binomial distributionwith mean npl. Moreover, the covariance matrix for (N1, ..., Nk) is given by

n

p1(1āˆ’ p1) . . . āˆ’p1pk...

. . ....

āˆ’p1pk . . . pk(1āˆ’ pk)

.

Example 1.6 Uniform Distribution A random variable X has a uniform distribution in aninterval [a, b] if Xā€™s density function is given by I[a,b](x)/(bāˆ’a), denoted by X āˆ¼ Uniform(a, b).Moreover, E[X] = (a + b)/2 and V ar(X) = (bāˆ’ a)2/12.

Example 1.7 Normal Distribution The normal distribution is the most commonly useddistribution and a random variable X with N(Āµ, Ļƒ2) has a probability density function

1āˆš2Ļ€Ļƒ2

expāˆ’(xāˆ’ Āµ)2

2Ļƒ2.

Moreover, E[X] = Āµ and var(X) = Ļƒ2. The characteristic function for X is given by expitĀµāˆ’Ļƒ2 t2/2. We will discuss such distribution in detail later.

Example 1.8 Gamma Distribution A Gamma distribution has a probability density

1

Ī²ĪøĪ“(Īø)xĪøāˆ’1 expāˆ’x

Ī², x > 0

denoted by Ī“(Īø, Ī²). It has mean ĪøĪ² and variance ĪøĪ²2. Specially, when Īø = 1, the distributionis called the exponential distribution, Exp(Ī²). When Īø = n/2 and Ī² = 2, the distribution iscalled the Chi-square distribution with degrees of freedom n, denoted by Ļ‡2

n.

Example 1.9 Cauchy Distribution The density for a random variable X āˆ¼ Cauchy(a, b)has the form

1

bĻ€ 1 + (xāˆ’ a)2/b2 .

Note E[X] = āˆž. Such a distribution is often used as a counterexample in distribution theory.Many other distributions can be constructed using some elementary algebra such as sum-

mation, product, quotient of the above special distributions. We will discuss them in nextsection.

DISTRIBUTION THEORY 6

1.3 Algebra and Transformation of Random Variables (Vec-

tors)

In many applications, one wishes to calculate the distribution of some algebraic expressionof independent random variables. For example, suppose that X and Y are two independentrandom variables. We wish to find the distributions of X +Y , XY and X/Y (we assume Y > 0for the last two cases).

The calculation of these algebraic distributions is often done using the conditional expec-tation. To see how this works, we denote FZ(Ā·) as the cumulative distribution function of anyrandom variable Z. Then for X + Y ,

FX+Y (z) = E[I(X+Y ā‰¤ z)] = EY [EX [I(X ā‰¤ zāˆ’Y )|Y ]] = EY [FX(zāˆ’Y )] =

āˆ«FX(zāˆ’y)dFY (y);

symmetrically,

FX+Y (z) =

āˆ«FY (z āˆ’ x)dFX(x).

The above formula is called the convolution formula, sometimes denoted by FX āˆ—FY (z). If bothX and Y have densities functions fX and fY respectively, then the density function for X + Yis equal to

fX āˆ— fY (z) ā‰”āˆ«

fX(z āˆ’ y)fY (y)dy =

āˆ«fY (z āˆ’ x)fX(x)dx.

Similarly, we can obtain the formulae for XY and X/Y as follows:

FXY (z) = E[E[I(XY ā‰¤ z)|Y ]] =

āˆ«FX(z/y)dFY (y), fXY (z) =

āˆ«fX(z/y)/yfY (y)dy,

FX/Y (z) = E[E[I(X/Y ā‰¤ z)|Y ]] =

āˆ«FX(yz)dFY (y), fX/Y (z) =

āˆ«fX(yz)yfY (y)dy.

These formulae can be used to construct some familiar distributions from simple randomvariables. We assume X and Y are independent in the following examples.

Example 1.10 (i) X āˆ¼ N(Āµ1, Ļƒ21) and Y āˆ¼ N(Āµ2, Ļƒ

22). X + Y āˆ¼ N(Āµ1 + Āµ2, Ļƒ

21 + Ļƒ2

2).(ii) X āˆ¼ Cauchy(0, Ļƒ1) and Y āˆ¼ Cauchy(0, Ļƒ2) implies X + Y āˆ¼ Cauchy(0, Ļƒ1 + Ļƒ2).(iii) X āˆ¼ Gamma(r1, Īø) and Y āˆ¼ Gamma(r2, Īø) implies that X + Y āˆ¼ Gamma(r1 + r2, Īø).(iv) X āˆ¼ Poisson(Ī»1) and Y āˆ¼ Poisson(Ī»2) implies X + Y āˆ¼ Poisson(Ī»1 + Ī»2).(v) X āˆ¼ Negative Binomial(m1, p) and Y āˆ¼ Negative Binomial(m2, p). Then X +Y āˆ¼ NegativeBinomial(m1 + m2, p).

The results in Example 1.10 can be verified using the convolution formula. However, theseresults can also be obtained using characteristic functions, as stated in the following theorem.

Theorem 1.2 Let Ļ†X(t) denote the characteristic function for X. Suppose X and Y areindependent. Then Ļ†X+Y (t) = Ļ†X(t)Ļ†Y (t). ā€ 

The proof is direct. We can use Theorem 1.2 to find the distribution of X +Y . For example,in (i) of Example 1.10, we know Ļ†X(t) = expĀµ1t āˆ’ Ļƒ2

1t2/2 and Ļ†Y (t) = expĀµ2t āˆ’ Ļƒ2

2t2/2.

Thus,Ļ†X+Y (t) = exp(Āµ1 + Āµ2)tāˆ’ (Ļƒ2

1 + Ļƒ22)t

2/;

DISTRIBUTION THEORY 7

while the latter is the characteristic function of a normal distribution with mean (Āµ1 + Āµ2) andvariance (Ļƒ2

1 + Ļƒ22).

Example 1.11 Let X āˆ¼ N(0, 1), Y āˆ¼ Ļ‡2m and Z āˆ¼ Ļ‡2

n be independent. Then

XāˆšY/m

āˆ¼ Studentā€™s t(m),

Y/m

Z/nāˆ¼ Snedecorā€™s Fm,n,

Y

Y + Zāˆ¼ Beta(m/2, n/2),

where

ft(m)(x) =Ī“((m + 1)/2)āˆš

Ļ€mĪ“(m/2)

1

(1 + x2/m)(m+1)/2I(āˆ’āˆž,āˆž)(x),

fFm,n(x) =Ī“(m + n)/2

Ī“(m/2)Ī“(n/2)

(m/n)m/2xm/2āˆ’1

(1 + mx/n)(m+n)/2I(0,āˆž)(x),

fBeta(a,b) =Ī“(a + b)

Ī“(a)Ī“(b)xaāˆ’1(1āˆ’ x)bāˆ’1I(0 < x < 1).

Example 1.12 If Y1, ..., Yn+1 are i.i.d Exp(Īø), then

Zi =Y1 + . . . + Yi

Y1 + . . . + Yn+1

āˆ¼ Beta(i, nāˆ’ i + 1).

Particularly, (Z1, . . . , Zn) has the same joint distribution as that of the order statistics (Ī¾n:1, ..., Ī¾n:n)of n Uniform(0,1) random variables.

Both the results in Example 1.11 and 1.12 can be derived using the formulae at the beginningof this section. We now start to examine the transformation of random variables (vectors).Especially, the following theorem holds.

Theorem 1.3 Suppose that X is k-dimension random vector with density function fX(x1, ..., xk).Let g be a one-to-one and continuously differentiable map from Rk to Rk. Then Y = g(X) isa random vector with density function

fX(gāˆ’1(y1, ..., yk))|Jgāˆ’1(y1, ..., yk)|,where gāˆ’1 is the inverse of g and Jgāˆ’1 is the Jacobian of gāˆ’1. ā€ 

The proof is simply based on the variable-transformation in integration. One application ofthis result is given in the following example.

Example 1.13 Let X and Y be two independent standard normal random variables. Considerthe polar coordinate of (X, Y ), i.e., X = R cos Ī˜ and Y = R sin Ī˜. Then Theorem 1.3 givesthat R2 and Ī˜ are independent and moreover, R2 āˆ¼ Exp2 and Ī˜ āˆ¼ Uniform(0, 2Ļ€). As anapplication, if one can simulate variables from a uniform distribution (Ī˜) and an exponentialdistribution (R2), then using X = R cos Ī˜ and Y = R sin Ī˜ produces variables from a standardnormal distribution. This is exactly the way of generating normally distributed numbers inmost of statistical packages.

DISTRIBUTION THEORY 8

1.4 Multivariate Normal Distribution

One particular distribution we will encounter in larger-sample theory is the multivariate normaldistribution. A random vector Y = (Y1, ..., Yn)ā€² is said to have a multivariate normal distributionwith mean vector Āµ = (Āµ1, ..., Āµn)ā€² and non-degenerate covariance matrix Ī£nƗn, denoted asN(Āµ, Ī£) or Nn(Āµ, Ī£) to emphasize Y ā€™s dimension, if Y has a joint density as

fY (y1, ..., yn) =1

(2Ļ€)n/2|Ī£|1/2expāˆ’1

2(y āˆ’ Āµ)ā€²Ī£āˆ’1(y āˆ’ Āµ).

We can derive the characteristic function of Y using the following ad hoc way:

Ļ†Y (t) = E[eitā€²Y ]

=1

(2Ļ€)n/2|Ī£|1/2

āˆ«expitā€²y āˆ’ 1

2(y āˆ’ Āµ)ā€²Ī£āˆ’1(y āˆ’ Āµ)dy

=1

(2Ļ€)n/2|Ī£|1/2

āˆ«expāˆ’1

2yā€²Ī£āˆ’1y + (it + Ī£āˆ’1Āµ)ā€²y āˆ’ Āµā€²Ī£āˆ’1Āµ

2dy

=expāˆ’Āµā€²Ī£āˆ’1Āµ/2

(2Ļ€)n/2|Ī£|1/2

āˆ«exp

āˆ’1

2(y āˆ’ Ī£itāˆ’ Āµ)ā€²Ī£āˆ’1(y āˆ’ Ī£itāˆ’ Āµ)

+1

2(Ī£it + Āµ)ā€²Ī£āˆ’1(Ī£it + Āµ)

dy

= expitā€²Āµāˆ’ 1

2tā€²Ī£t.

Particularly, if Y has standard multivariate normal distribution with mean zero and covarianceInƗn, Ļ†Y (t) = expāˆ’tā€²t/2.

The following theorem describes the properties of a multivariate normal distribution.

Theorem 1.4 If Y = AnƗkXkƗ1 where X āˆ¼ Nk(0, I) (standard multivariate normal distribu-tion), then Y ā€™s characteristic function is given by

Ļ†Y (t) = exp āˆ’tā€²Ī£t/2 , t = (t1, ..., tn) āˆˆ Rk

and rank(Ī£) = rank(A). Conversely, if Ļ†Y (t) = expāˆ’tā€²Ī£t/2 with Ī£nƗn ā‰„ 0 of rank k, then

Y = AnƗkXkƗ1 with rank(A) = k and X āˆ¼ Nk(0, I).

ā€ 

Proof

Ļ†Y (t) = E[expitā€²(AX)] = E[expi(Aā€²t)ā€²X] = expāˆ’(Aā€²t)ā€²(Aā€²t)/2 = expāˆ’tā€²AAā€²t/2.

Thus, Ī£ = AAā€² and rank(Ī£) = rank(A). Conversely, if Ļ†Y (t) = expāˆ’tā€²Ī£t/2, then frommatrix theory, there exist an orthogonal matrix O such that Ī£ = Oā€²DO, where D is a diagonalmatrix with first k diagonal elements positive and the rest (nāˆ’ k) elements being zero. Denote

DISTRIBUTION THEORY 9

these positive diagonal elements as d1, ..., dk. Define Z = OY . Then the characteristic functionfor Z is given by

Ļ†Z(t) = E[expit ā€²(OY )] = E [expi(O ā€²t)ā€²Y ] = expāˆ’(O ā€²t)ā€²Ī£ (O ā€²t)/2

= expāˆ’d1t21/2āˆ’ ...āˆ’ dkt

2k/2.

This implies that Z1, ..., Zk are independent N(0, d1), ..., N(0, dk) and Zk+1 = ... = Zn = 0. LetXi = Zi/

āˆšdi for i = 1, ..., k and write Oā€² = (BnƗk, CnƗ(nāˆ’k)). Then

Y = Oā€²Z = BnƗk

Z1...

Zk

= BnƗkdiag(

āˆšd1, ...,

āˆšdk)

X1...

Xk

ā‰” AX.

Clearly, rank(A) = k. ā€ 

Theorem 1.5 Suppose that Y = (Y1, ..., Yk, Yk+1, ..., Yn)ā€² has a multivariate normal distribution

with mean Āµ = (Āµ(1)ā€², Āµ(2)ā€²)ā€² and a non-degenerate covariance matrix

Ī£ =

(Ī£11 Ī£12

Ī£21 Ī£22

).

Then(i) (Y1, ..., Yk)

ā€² āˆ¼ Nk(Āµ(1), Ī£11).

(ii) (Y1, ..., Yk)ā€² and (Yk+1, ..., Yn)ā€² are independent if and only if Ī£12 = Ī£21 = 0.

(iii) For any matrix AmƗn, AY has a multivariate normal distribution with mean AĀµ and co-variance AĪ£Aā€².(iv) The conditional distribution of Y (1) = (Y1, ..., Yk)

ā€² given Y (2) = (Yk+1, ..., Yn)ā€² is a multi-variate normal distribution given as

Y (1)|Y (2) āˆ¼ Nk(Āµ(1) + Ī£12Ī£

āˆ’122 (Y (2) āˆ’ Āµ(2)), Ī£11 āˆ’ Ī£12Ī£

āˆ’122 Ī£21).

ā€ 

Proof (i) From Theorem 1.4, we obtain that the characteristic function for (Y1, ..., Yk) āˆ’ Āµ(1)

is given by expāˆ’tā€²(DĪ£)(DĪ£)ā€²t/2, where D = (IkƗk 0kƗ(nāˆ’k)). Thus, the characteristicfunction is equal to

exp āˆ’(t1, ..., tk)Ī£11(t1, ..., tk)ā€²/2 ,

which is the same as the characteristic function from Nk(0, Ī£11).(ii) The characteristics function for Y can be written as

exp

[it(1)ā€²Āµ(1) + it(2)ā€²Āµ(2) āˆ’ 1

2

t(1)ā€²Ī£11t

(1) + 2t(1)ā€²Ī£12t(2) + t(2)ā€²Ī£22t

(2)]

.

If Ī£12 = 0, the characteristics function can be factorized as the product of the separate functionsfor t(1) and t(2). Thus, Y (1) and Y (2) are independent. The converse is obviously true.(iii) The result follows from Theorem 1.4.

DISTRIBUTION THEORY 10

(iv) Consider Z(1) = Y (1)āˆ’Āµ(1)āˆ’Ī£12Ī£āˆ’122 (Y (2)āˆ’Āµ(2)). From (iii), Z(1) has a multivariate normal

distribution with mean zero and covariance calculated by

Cov(Z(1), Z(1)) = Cov(Y (1), Y (1))āˆ’ 2Ī£12Ī£āˆ’122 Cov(Y (2), Y (1)) + Ī£12Ī£

āˆ’122 Cov(Y (2), Y (2))Ī£āˆ’1

22 Ī£21

= Ī£11 āˆ’ Ī£12Ī£āˆ’122 Ī£21.

On the other hand,

Cov(Z(1), Y (2)) = Cov(Y (1), Y (2))āˆ’ Ī£12Ī£āˆ’122 Cov(Y (2), Y (2)) = 0.

From (ii), Z(1) is independent of Y (2). Then the conditional distribution Z(1) given Y (2) is thesame as the unconditional distribution of Z(1); i.e.,

Z(1)|Y (2) āˆ¼ N(0, Ī£11 āˆ’ Ī£12Ī£āˆ’122 Ī£21).

The result follows. ā€ 

With normal random variables, we can use algebra of random variables to construct anumber of useful distributions. The first one is Chi-square distribution. Suppose X āˆ¼ Nn(0, I),then ā€–Xā€–2 =

āˆ‘ni=1 X2

i āˆ¼ Ļ‡2n, the chi-square distribution with n degrees of freedom. One can

use the convolution formula to obtain that the density function for Ļ‡2n is equal to the density

for the Gamma(n/2, 2), denoted by g(y; n/2, 1/2).

Corollary 1.1 If Y āˆ¼ Nn(0, Ī£) with Ī£ > 0, then Y ā€²Ī£āˆ’1Y āˆ¼ Ļ‡2n. ā€ 

Proof Since Ī£ > 0, there exists a positive definite matrix A such that AAā€² = Ī£. ThenX = Aāˆ’1Y āˆ¼ Nn(0, I). Thus

Y ā€²Ī£āˆ’1Y = X ā€²X āˆ¼ Ļ‡2n.

ā€ 

Suppose X āˆ¼ N(Āµ, 1). Define Y = X2, Ī“ = Āµ2. Then Y has density

fY (y) =āˆžāˆ‘

k=0

pk(Ī“/2)g(y; (2k + 1)/2, 1/2),

where pk(Ī“/2) = exp(āˆ’Ī“/2)(Ī“/2)k/k!. Another ways to obtain this is: Y |K = k āˆ¼ Ļ‡22k+1

where K āˆ¼ Poisson(Ī“/2). We call Y has the noncentral chi-square distribution with 1 degreeof freedom and noncentrality parameter Ī“ and write Y āˆ¼ Ļ‡2

1(Ī“). More generally, if X =(X1, ..., Xn)ā€² āˆ¼ Nn(Āµ, I) and let Y = X ā€²X, then Y has a density fY (y) =

āˆ‘āˆžk=0 pk(Ī“/2)g(y; (2k+

n)/2, 1/2) where Ī“ = Āµā€²Āµ. We write Y āˆ¼ Ļ‡2n(Ī“) and call Y has the noncentral chi-square

distribution with n degrees of freedom and noncentrality parameters Ī“. It is then easy to showthat if X āˆ¼ N(Āµ, Ī£), then Y = X ā€²Ī£āˆ’1X āˆ¼ Ļ‡2

n(Ī“).If X āˆ¼ N(0, 1), Y āˆ¼ Ļ‡2

n and they are independent, then X/āˆš

Y/n is called t-distributionwith n degrees of freedom. If Y1 āˆ¼ Ļ‡2

m, Y2 āˆ¼ Ļ‡2n and Y1 and Y2 are independent, then

(Y1/m)/(Y2/m) is called F-distribution with degrees freedom of m and n. These distributionshave already been introduced in Example 1.11.

DISTRIBUTION THEORY 11

1.5 Families of Distributions

In Examples 1.1-1.12, we have listed a number of different distributions. Interestingly, a numberof them can be unified into a family of general distribution form. One advantage of thisunification is that in order to study the properties of each distribution within the family, wecan examine this family as a whole.

The first family of distributions is called the location-scale family. Suppose that X has adensity function fX(x). Then the location-scale family based on X consists of all the distribu-tions generated by aX + b where a is a positive constant (scale parameter) and b is a constantcalled location parameter. We notice that the distributions such as N(Āµ, Ļƒ2), Uniform(a, b),Cauchy(Āµ, Ļƒ) belong a location-scale family. For a location-scale family, we can easily see thataX + b has a density fX((y āˆ’ b)/a)/a and it has mean aE[X] + b and variance a2var(X).

The second important family, which we will discuss in more detail, is called the exponentialfamily. In fact, many examples of either univariate or multivariate distributions, includingbinomial, poisson distributions for discrete variables and normal distribution, gamma distribu-tion, beta distribution for continuous variables belong to some exponential family. Especially,a family of distributions, PĪø, is said to form an s-parameter exponential family if the dis-tributions PĪø have the densities (with respect to some common dominating measure Āµ) of theform

pĪø(x) = exp

sāˆ‘

k=1

Ī·k(Īø)Tk(x)āˆ’B(Īø)

h(x).

Here Ī·i and B are real-valued functions of Īø and Ti are real-value function of x. When Ī·k(Īø) =Īø, the above form is called the canonical form of the exponential family. Clearly, it stipulatesthat

expB(Īø) =

āˆ«exp

sāˆ‘

k=1

Ī·k(Īø)Tk(x)h(x)dĀµ(x) < āˆž.

Example 1.14 X1, ..., Xn are i.i.d according to N(Āµ, Ļƒ2). Then the joint density of (X1, ..., Xn)is given by

exp

Āµ

Ļƒ2

nāˆ‘i=1

xi āˆ’ 1

2Ļƒ2

nāˆ‘i=1

x2i āˆ’

n

2Ļƒ2Āµ2

1

(āˆš

2Ļ€Ļƒ)n.

Then Ī·1(Īø) = Āµ/Ļƒ2, Ī·2(Īø) = āˆ’1/2Ļƒ2, T1(x1, ..., xn) =āˆ‘n

i=1 xi, and T2(x1, ..., xn) =āˆ‘n

i=1 x2i .

Example 1.15 X has binomial distribution Binomial(n, p). The distribution of X = x canwritten as

expx logp

1āˆ’ p+ n log(1āˆ’ p)

(n

x

).

Clearly, Ī·(Īø) = log(p/(1āˆ’ p)) and T (x) = x.

Example 1.16 X has poisson distribution with poisson rate Ī». Then

P (X = x) = expx log Ī»āˆ’ Ī»/x!.

Thus, Ī·(Īø) = log Ī» and T (x) = x.

DISTRIBUTION THEORY 12

Since the exponential family covers a number of familiar distributions, one can study theexponential family as a whole to obtain some general results applicable to all the memberswithin the family. One result is to derive the moment generation function for (T1, ..., Ts), whichis defined as

MT (t1, ..., ts) = E [expt1T1 + ... + tsTs] .Note that the coefficients in the Taylor expansion of MT correspond to the moments of (T1, ..., Ts).

Theorem 1.6 Suppose the densities of an exponential family can be written as the canonicalform

expsāˆ‘

k=1

Ī·kTk(x)āˆ’ A(Ī·)h(x),

where Ī· = (Ī·1, ..., Ī·s)ā€². Then for t = (t1, ..., ts)

ā€²,

MT (t) = expA(Ī· + t)āˆ’ A(Ī·).ā€ 

Proof It follows from that

MT (t) = E [expt1T1 + ... + tsTs] =

āˆ«exp

sāˆ‘

k=1

(Ī·i + ti)Ti(x)āˆ’ A(Ī·)h(x)dĀµ(x)

and

expA(Ī·) =

āˆ«exp

sāˆ‘

k=1

Ī·iTi(x)h(x)dĀµ(x).

ā€ 

Therefore, for an exponential family with canonical form, we can apply Theorem 1.6 tocalculate moments of some statistics. Another generating function is called the cumulant gen-erating functions defined as

KT (t1, ..., ts) = log MT (t1, ..., ts) = A(Ī· + t)āˆ’ A(Ī·).

Its coefficients in the Taylor expansion are called the cumulants for (T1, ..., Ts).

Example 1.17 In normal distribution of Example 1.14 with n = 1 and Ļƒ2 fixed, Ī· = Āµ/Ļƒ2 and

A(Ī·) =1

2Ļƒ2Āµ2 = Ī·2Ļƒ2/2.

Thus, the moment generating function for T = X is equal to

MT (t) = expĻƒ2

2((Ī· + t)2 āˆ’ Ī·2) = expĀµt + t2Ļƒ2/2.

From the Taylor expansion, we can obtain the moments of X whose mean is zero (Āµ = 0) isgiven by

E[X2r+1] = 0, E[X2r] = 1 Ā· 2 Ā· Ā· Ā· (2r āˆ’ 1)Ļƒ2r, r = 1, 2, ...

DISTRIBUTION THEORY 13

Example 1.18 X has a gamma distribution with density

1

Ī“(a)baxaāˆ’1eāˆ’x/b, x > 0.

For fixed a, it has a canonical form

expāˆ’x/b + (aāˆ’ 1) log xāˆ’ log(Ī“(a)ba)I(x > 0).

Correspondingly, Ī· = āˆ’1/b, T = X, A(Ī·) = log(Ī“(a)ba) = a log(āˆ’1/Ī·) + log Ī“(a). Then themoment generating function for T = X is given by

MX(t) = expa logĪ·

Ī· + t = (1āˆ’ bt)āˆ’a.

After the Taylor expansion around zero, we obtain

E[X] = ab, E[X2] = ab2 + (ab)2, ...

As a further note, the exponential family has an important role in classical statistical infer-ence since it possesses many nice statistical properties. We will revisit it in Chapter 4.

READING MATERIALS : You should read Lehmann and Casella, Sections 1.4 and 1.5.

PROBLEMS

1. Verify the densities of t(m) and Fm,n in Example 1.11.

2. Verify the two results in Example 1.12.

3. Suppose X āˆ¼ N(Ī½, 1). Show that Y = X2 has a density

fY (y) =āˆžāˆ‘

k=0

pk(Āµ2/2)g(y; (2k + 1)/2, 1/2),

where pk(Āµ2/2) = exp(āˆ’Āµ2/2)(Āµ2/2)k/k! and g(y; n/2, 1/2) is the density of Gamma(n/2, 2).

4. Suppose X = (X1, ..., Xn) āˆ¼ N(Āµ, I) and let Y = X ā€²X. Show that Y has a density

fY (y) =āˆžāˆ‘

k=0

pk(Āµā€²Āµ/2)g(y; (2k + n)/2, 1/2).

5. Let X āˆ¼ Gamma(Ī±1, Ī²) and Y āˆ¼ Gamma(Ī±2, Ī²) be independent random variables.Derive the distribution of X/(X + Y ).

DISTRIBUTION THEORY 14

6. Show that for any random variables X, Y and Z,

Cov(X, Y ) = E[Cov(X, Y |Z)] + Cov(E[X|Z], E[Y |Z]),

where Cov(X,Y |Z) is the conditional covariance of X and Y given Z.

7. Let X and Y be i.i.d Uniform(0,1) random variables. Define U = Xāˆ’Y , V = max(X, Y ) =X āˆØ Y .

(a) What is the range of (U, V )?

(b) find the joint density function fU,V (u, v) of the pair (U, V ). Are U and V indepen-dent?

8. Suppose that for Īø āˆˆ R,

fĪø(u, v) = 1 + Īø(1āˆ’ 2u)(1āˆ’ 2v) I(0 ā‰¤ u ā‰¤ 1, 0 ā‰¤ v ā‰¤ 1).

(a) For what values of Īø is fĪø a density function in [0, 1]2?

(b) For the set of Īøā€™s identified in (a), find the corresponding distribution function FĪø

and show that it has Uniform(0,1) marginal distributions.

(c) If (U, V ) āˆ¼ fĪø, compute the correlation Ļ(U, V ) ā‰” Ļ as a function of Īø.

9. Suppose that F is the distribution function of random variables X and Y with X āˆ¼Uniform(0, 1) marginally and Y āˆ¼ Uniform(0, 1) marginally. Thus, F (x, y) satisfies

F (x, 1) = x, 0 ā‰¤ x ā‰¤ 1, and F (1, y) = y, 0 ā‰¤ y ā‰¤ 1.

(a) Show thatF (x, y) ā‰¤ x āˆ§ y

for all 0 ā‰¤ x ā‰¤ 1, 0 ā‰¤ y ā‰¤ 1. Here x āˆ§ y = min(x, y) and we denote it as FU(x, y).

(b) Show thatF (x, y) ā‰„ (x + y āˆ’ 1)+

for all 0 ā‰¤ x ā‰¤ 1, 0 ā‰¤ y ā‰¤ 1. Here (x + y āˆ’ 1)+ = max(x + y āˆ’ 1, 0) and we denoteit as FL(x, y).

(c) Show that FU is the distribution function of (X, X) and FL is the distribution func-tion of (X, 1āˆ’X).

10. (a) If W āˆ¼ Ļ‡22 = Gamma(1, 2), find the density of W , the distribution function W and

the inverse distribution function explicitly.

(b) Suppose that (X, Y ) āˆ¼ N(0, I2Ɨ2). In two-dimensional plane, let R be the distanceof (X, Y ) from (0, 0) and Īø be the angle between the line from (0,0) to (X,Y) andthe right-half line of x-axis. Then X = R cos Ī˜ and Y = R sin Ī˜. Show that R andĪ˜ are independent random variables with R2 āˆ¼ Ļ‡2

2 and Ī˜ āˆ¼ Uniform(0, 2Ļ€).

(c) Use the above two results to show how to use two independent Uniform(0,1) randomvariables U and V to generate two standard normal random variables. Hint: use oneresult that if X has a distribution function F then F (X) has a uniform distributionin [0, 1].

DISTRIBUTION THEORY 15

11. Suppose that X āˆ¼ F on [0,āˆž), Y āˆ¼ G on [0,āˆž), and X and Y are independent randomvariables. Let Z = minX, Y = X āˆ§ Y and āˆ† = I(X ā‰¤ Y ).

(a) Find the joint distribution of (Z, āˆ†).

(b) If X āˆ¼ Exponential(Ī») and Y āˆ¼ Exponential(Āµ), show that Z and āˆ† are indepen-dent.

12. Let X1, ..., Xn be i.i.d N(0, Ļƒ2). (w1, ..., wn) is a constant vector such that w1, ..., wn > 0and w1 + ... + wn = 1. Define Xnw =

āˆšw1X1 + ... +

āˆšwnXn. Show that

(a) Yn = Xnw/Ļƒ āˆ¼ N(0, 1).

(b) (nāˆ’ 1)S2n/Ļƒ

2 = (āˆ‘n

i=1 X2i āˆ’ X2

nw)/Ļƒ2 āˆ¼ Ļ‡2nāˆ’1.

(c) Yn and S2n are independent so Tn = Yn/

āˆšS2

n āˆ¼ tnāˆ’1/Ļƒ.

(d) when w1 = ... = wn = 1/n, show that Yn is the standardized sample mean and S2n is

the sample variance.

Hint: Consider an orthogonal matrix Ī£ such that the first row is (āˆš

w1, ...,āˆš

wn). Let

Z1...

Zn

= Ī£

X1...

Xn

.

Then Yn = Z1/Ļƒ and (nāˆ’ 1)S2n/Ļƒ2 = (Z2

2 + ... + Z2n)/Ļƒ2.

13. Let XnƗ1 āˆ¼ N(0, InƗn). Suppose that A is a symmetric matrix with rank r. ThenX ā€²AX āˆ¼ Ļ‡2

r if and only if A is a projection matrix (that is, A2 = A). Hint: use thefollowing result from linear algebra: for any symmetric matrix, there exits an orthogonalmatrix O such that A = Oā€² diag((d1, ..., dn))O; A is a projection matrix if and only ifd1, ..., dn take values of 0 or 1ā€™s.

14. Let Wm āˆ¼ Negative Binomial(m, p). Consider p as a parameter.

(a) Write the distribution as an exponential family.

(b) Use the result for the exponential family to derive the moment generating functionof Wm, denoted by M(t).

(c) Calculate the first and the second cumulants of Wm. By definition, in the expansionof the cumulant generating function,

log M(t) =āˆžāˆ‘

k=0

Āµk

k!tk,

Āµk is the kth cumulant of Wm. Note that these two cumulants are exactly the meanand the variance of Wm.

15. For the density C expāˆ’|x|1/2

,āˆ’āˆž < x < āˆž, where C is the normalized constant,

show that moments of all orders exist but the moment generating function exists only att = 0.

DISTRIBUTION THEORY 16

16. Lehmann and Casella, page 64, problem 4.2.

17. Lehmann and Casella, page 66, problem 5.6.

18. Lehmann and Casella, page 66, problem 5.7.

19. Lehmann and Casella, page 66, problem 5.8.

20. Lehmann and Casella, page 66, problem 5.9.

21. Lehmann and Casella, page 66, problem 5.10.

22. Lehmann and Casella, page 67, problem 5.12.

23. Lehmann and Casella, page 67, problem 5.14.

CHAPTER 2 MEASURE,INTEGRATION ANDPROBABILITY

This chapter is an introduction to (probability) measure theories, a foundation for all theprobabilistic and statistical framework. We first give the definition of a measure space. Thenwe introduce measurable functions in a measure space and the integration and convergence ofmeasurable functions. Further generalization including the product of two measures and theRadon-Nikodym derivatives of two measures is introduced. As a special case, we describe howthe concepts and the properties in measure space are used in parallel in a probability measurespace.

2.1 A Review of Set Theory and Topology in Real Space

We review some basic concepts in set theory. A set is a collection of elements, which can be acollection of real numbers, a group of abstract subjects and etc. In most of cases, we considerthat these elements come from one largest set, called a whole space. By custom, a whole spaceis denoted by Ī© so any set is simply a subset of Ī©. We can exhaust all possible subsets of Ī©then the collection of all these subsets is denoted as 2Ī©, called the power set of Ī©. We alsoinclude the empty set, which has no element at all and is denoted by āˆ…, in this power set.

For any two subsets A and B of the whole space Ī©, A is said to be a subset of B if B containsall the elements of A, denoted as A āŠ† B. For arbitrary number of sets AĪ± : Ī± is some index,where the index of Ī± can be finite, countable or uncountable, we define the intersection of thesesets as the set which contains all the elements common to AĪ± for any Ī±. The intersection ofthese sets is denoted as āˆ©Ī±AĪ±. AĪ±ā€™s are disjoint if any two sets have empty intersection. Wecan also define the union of these sets as the set which contains all the elements belonging toat least one of these sets, denoted as āˆŖĪ±AĪ±. Finally, we introduce the complement of a set A,denoted by Ac, to be the set which contains all the elements not in A. Among the definitionsof set intersection, union and complement, the following relationships are clear: for any B andAĪ±,

B āˆ© āˆŖĪ±AĪ± = āˆŖĪ± B āˆ© AĪ± , B āˆŖ āˆ©Ī±AĪ± = āˆ©Ī± B āˆŖ AĪ± ,

āˆŖĪ±AĪ±c = āˆ©Ī±AcĪ±, āˆ©Ī±AĪ±c = āˆŖĪ±Ac

Ī±. ( de Morgan law)

Sometimes, we use (A āˆ’ B) to denote a subset of A excluding any elements in B. Thus(A āˆ’ B) = A āˆ© Bc. Using this notation, we can always partition the union of any countable

17

BASIC MEASURE THEORY 18

sets A1, A2, ... into a union of countable disjoint sets:

A1 āˆŖ A2 āˆŖ A3 āˆŖ ... = A1 āˆŖ (A2 āˆ’ A1) āˆŖ (A3 āˆ’ A1 āˆŖ A2) āˆŖ ...

For a sequence of sets A1, A2, A3, ..., we now define the limit sets of the sequence. The upperlimit set of the sequence is the set which contains the elements belonging to infinite numberof the sets in this sequence; the lower limit set of the sequence is the set which contains theelements belonging to all the sets except a finite number of them in this sequence. The formeris denoted by limnAn or lim supn An and the latter is written as limnAn or lim infn An. We canshow

lim supn

An = āˆ©āˆžn=1 āˆŖāˆžm=nAm , lim infn

An = āˆŖāˆžn=1 āˆ©āˆžm=nAm .

When both limit sets agree, we say that the sequence has a limit set. In the calculus, we knowthat for any sequence of real numbers x1, x2, ..., it has a upper limit, lim supn xn, and a lowerlimit, lim infn xn, where the former refers to the upper bound of the limits for any convergentsubsequences and the latter is the lower bound. It should be cautious that such upper limit orlower limit is different from the upper limit or lower limit of sets.

The second part of this section reviews some basic topology in a real line. Because thedistance between any two points is well defined in a real line, we can define a topology in a realline. A set A of the real line is called an open set if for any point x āˆˆ A, there exists an openinterval (xāˆ’Īµ, x+Īµ) contained in A. Clearly, any open interval (a, b) where a could be āˆ’āˆž andb could be āˆž, is an open set. Moreover, for any number of open sets AĪ± where Ī± is an index,it is easy to show that āˆŖĪ±AĪ± is open. A closed set is defined as the complement of an open set.It can also be show that A is closed if and only if for any sequence xn in A such that xn ā†’ x,x must belong to A. By the de Morgan law, we also see that the intersection of any numberof closed sets is still closed. Only āˆ… and the whole real line are both open set and closed set;there are many sets neither open or closed, for example, the set of all the rational numbers. Ifa closed set A is bounded, A is also called a compact set. These basic topological concepts willbe used later. Note that the concepts of open set or closed set can be easily generalized to anyfinite dimensional real space.

2.2 Measure Space

2.2.1 Introduction

Before we introduce a formal definition of measure space, let us examine the following examples.

Example 2.1 Suppose that a whole space Ī© contains countable number of distinct pointsx1, x2, .... For any subset A of Ī©, we define a set function Āµ#(A) as the number of points inA. Therefore, if A has n distinct points, Āµ#(A) = n; if A has infinite many number of points,then Āµ#(A) = āˆž. We can easily show that (a) Āµ#(āˆ…) = 0; (b) if A1, A2, ... are disjoint sets ofĪ©, then Āµ#(āˆŖnAn) =

āˆ‘n Āµ#(An). We will see later that Āµ# is a measure called the counting

measure in Ī©.

Example 2.2 Suppose that the whole space Ī© = R, the real line. We wish to measure the sizesof any possible subsets in R. Equivalently, we wish to define a set function Ī» which assigns

BASIC MEASURE THEORY 19

some non-negative values to the sets of R. Since Ī» measures the size of a set, it is clear thatĪ» should satisfy (a) Ī»(āˆ…) = 0; (b) for any disjoint sets A1, A2, ... whose sizes are measurable,Ī»(āˆŖnAn) =

āˆ‘n Ī»(An). Then the question is how to define such a Ī». Intuitively, for any interval

(a, b], such a value can be given as the length of the interval, i.e., (b āˆ’ a). We can furtherdefine Ī»-value of any set in B0, which consists of āˆ… together with all finite unions of disjointintervals with the form āˆŖn

i=1(ai, bi], or āˆŖni=1(ai, bi] āˆŖ (an+1,āˆž), (āˆ’āˆž, bn+1] āˆŖ āˆŖn

i=1(ai, bi], withai, bi āˆˆ R, as the total length of the intervals. But can we go beyond it, as the real line has farfar many sets which are not intervals, for example, the set of rational numbers? In other words,is it possible to extend the definition of Ī» to more sets beyond intervals while preserving thevalues for intervals? The answer is yes and will be given shortly. Moreover, such an extensionis unique. Such set function Ī» is called the Lebesgue measure in the real line.

Example 2.3 This example simply asks the same question as in Example 2.2, but now onk-dimensional real space. Still, we define a set function which assigns any hypercube its volumeand wish to extend its definition to more sets beyond hypercubes. Such a set function is calledthe Lebesgue measure in Rk, denoted as Ī»k.

From the above examples, we can see that three pivotal components are necessary in defininga measure space:

(i) the whole space, Ī©, for example, x1, x2, ... in Example 2.1, R and Rk in the last twoexamples,

(ii) a collection of subsets whose sizes are measurable, for example, all the subsets in Example2.1, the unknown collection of subsets including all the intervals in Example 2.2,

(iii) a set function which assigns negative values (sizes) to each set of (ii) and satisfies properties(a) and (b) in the above examples.

For notation, we use (Ī©,A, Āµ) to denote each of them; i.e., Ī© denotes the whole space, Adenotes the collection of all the measurable sets, and Āµ denotes the set function which assignsnon-negative values to all the sets in A.

2.2.2 Definition of a measure space

Obviously, Ī© should be a fixed non-void set. The main difficulty is the characterization of A.However, let us understand intuitively what kinds of sets should be in A: as a reminder, Acontains the sets whose sizes are measurable. Now suppose that a set A in A is measurablethen we would think that its complement is also measurable, intuitively, the size of the wholespace minus the size of A. Additionally, if A1, A2, ... are in A so are measurable, then we shouldbe able to measure the total size of A1, A2, ..., i.e, the union of these sets. Hence, as expected,A should include the complement of a set which is in A and the union of any countable numberof sets which are in A. This turns out that A must be a Ļƒ-field, whose definition is given below.

Definition 2.1 (fields, Ļƒ-fields) A non-void class A of subsets of Ī© is called a:(i) field or algebra if A,B āˆˆ A implies that A āˆŖ B āˆˆ A and Ac āˆˆ A; equivalently, A is closedunder complements and finite unions.(ii) Ļƒ-field or Ļƒ-algebra if A is a field and A1, A2, ... āˆˆ A implies āˆŖāˆži=1Ai āˆˆ A; equivalently, A isclosed under complements and countable unions. ā€ 

BASIC MEASURE THEORY 20

In fact, a Ļƒ-field is not only closed under complement and countable union but also closedunder countable intersection, as shown in the following proposition.

Proposition 2.1. (i) For a field A, āˆ…, Ī© āˆˆ A and if A1, ..., An āˆˆ A, āˆ©ni=1Ai āˆˆ A.

(ii) For a Ļƒ-field A, if A1, A2, ... āˆˆ A, then āˆ©āˆži=1Ai āˆˆ A. ā€ 

Proof (i) For any A āˆˆ A, Ī© = A āˆŖ Ac āˆˆ A. Thus, āˆ… = Ī©c āˆˆ A. If A1, ..., An āˆˆ A thenāˆ©n

i=1Ai = (āˆŖni=1A

ci)

c āˆˆ A.(ii) can be shown using the definition of a (Ļƒ-)field and the de Morgan law. ā€ 

We now give a few examples of Ļƒ-field or field.

Example 2.4 The class A = āˆ…, Ī© is the smallest Ļƒ-field and 2Ī© = A : A āŠ‚ Ī© is the largestĻƒ-field. Note that in Example 2.1, we choose A = 2Ī© since each set of A is measurable.

Example 2.5 Recall B0 in Example 2.2. It can be checked that B0 is a field but not a Ļƒ-field,since (a, b) = āˆŖāˆžn=1(a, bāˆ’ 1

n] does not belong to B0.

After defining a Ļƒ-field A on Ī©, we can start to introduce the definition of a measure. Asimplicated before, a measure can be understood as a set-function which assigns non-negativevalue to each set in A. However, the values assigned to the sets of A are not arbitrary and theyshould be compatible in the following sense.

Definition 2.2 (measure, probability measure) (i) A measure Āµ is a function from a Ļƒ-fieldA to [0,āˆž) satisfying: Āµ(āˆ…) = 0; Āµ(āˆŖāˆžn=1An) =

āˆ‘āˆžn=1 Āµ(An) for any countable (finite) disjoint

sets A1, A2, ... āˆˆ A. The latter is called the countable additivity.(ii) Additionally, if Āµ(Ī©) = 1, Āµ is a probability measure and we usually use P instead of Āµ toindicate a probability measure. ā€ 

The following proposition gives some properties of a measure.

Proposition 2.2 (i) If An āŠ‚ A and An āŠ‚ An+1 for all n, then Āµ(āˆŖāˆžn=1An) = limnā†’āˆž Āµ(An).(ii) If An āŠ‚ A, Āµ(A1) < āˆž and An āŠƒ An+1 for all n, then Āµ(āˆ©āˆžn=1An) = limnā†’āˆž Āµ(An).(iii) For any An āŠ‚ A, Āµ(āˆŖnAn) ā‰¤ āˆ‘

n Āµ(An) (countable sub-additivity). ā€ 

Proof (i) It follows from

Āµ(āˆŖāˆžn=1An) = Āµ(A1 āˆŖ (A2 āˆ’ A1) āˆŖ ...) = Āµ(A1) + Āµ(A2 āˆ’ A1) + ....

= limnĀµ(A1) + Āµ(A2 āˆ’ A1) + ... + Āµ(An āˆ’ Anāˆ’1) = lim

nĀµ(An).

(ii) First,

Āµ(āˆ©āˆžn=1An) = Āµ(A1)āˆ’ Āµ(A1 āˆ’ āˆ©āˆžn=1An) = Āµ(A1)āˆ’ Āµ(āˆŖāˆžn=1(A1 āˆ© Acn)).

Then since A1 āˆ© Acn is increasing, from (i), the second term is equal to limn Āµ(A1 āˆ© Ac

n) =Āµ(A1)āˆ’ limn Āµ(An). (ii) thus holds.(iii) From (i), we have

Āµ(āˆŖnAn) = limn

Āµ(A1 āˆŖ ... āˆŖ An) = limn

nāˆ‘

i=1

Āµ(Ai āˆ’ āˆŖj<iAj)

BASIC MEASURE THEORY 21

ā‰¤ limn

nāˆ‘i=1

Āµ(Ai) =āˆ‘

n

Āµ(An).

The result holds. ā€  .

If a class of sets An is increasing or decreasing, we can treat āˆŖnAn or āˆ©nAn as its limitset. Then Proportion 2.2 says that such a limit can be taken out of the measure for increasingsets and it can be taken out of the measure for decreasing set if the measure of some An isfinite. For an arbitrary sequence of sets An, in fact, similar to Proposition 2.2, we can show

Āµ(lim infn

An) = limn

Āµ(āˆ©āˆžk=nAn) ā‰¤ lim infn

Āµ(An).

The triplet (Ī©,A, Āµ) is called a measure space. Any set in A is called a measurable set.Particularly, if Āµ = P is a probability measure, (Ī©,A, P ) is called a probability measure space,abbreviated as probability space; an element in Ī© is called a probability sample and a set in Ais called a probability event. As an additional note, a measure Āµ is called Ļƒ-finite if there existsa countable sets Fn āŠ‚ A such that Ī© = āˆŖnFn and for each Fn, Āµ(Fn) < āˆž.

Example 2.6 (i) A measure Āµ on (Ī©,A) is discrete if there are finitely or countably manypoints Ļ‰i āˆˆ Ī© and masses mi āˆˆ [0,āˆž) such that

Āµ(A) =āˆ‘Ļ‰iāˆˆA

mi, A āˆˆ A.

Some examples include probability measures in discrete distributions.(ii) in Example 2.1, we define a counting measure Āµ# in a countable space. This definition canbe generalized to any space. Especially, a counting measure in the space R is not Ļƒ-finite.

2.2.3 Construction of a measure space

Even though (Ī©,A, Āµ) is well defined, a practical question is how to construct such a measurespace. In the specific Example 2.2, one asks whether we can find a Ļƒ-field including all theintervals of B0 and on this Ļƒ-field, whether we can define a measure Ī» such that Ī» assigns anyinterval its length. Even more general, suppose that we have a class of sets C and a set functionĀµ satisfying property (i) of Definition 2.2. Can we find a Ļƒ-field which contains all the setsof C and moreover, can we obtain a measure defined for any set of this Ļƒ-field such that themeasure agrees with Āµ in C? The answer is positive for the first question and is positive forthe second question when C is a field. Indeed, such a Ļƒ-field is the smallest Ļƒ-field containingall the sets of C, called Ļƒ-field generated by C, and such a measure can be obtained using themeasure extension result as given below.

First, we show that the Ļƒ-field generated by C exists and is unique.

Proposition 2.3 (i) Arbitrary intersections of fields (Ļƒ-fields) are fields (Ļƒ-fields).(ii) For any class C of subsets of Ī©, there exists a minimal Ļƒ-field containing C and we denoteit as Ļƒ(C). ā€ 

Proof (i) can be shown using the definitions of a (Ļƒ-)field. For (ii), we define

Ļƒ(C) = āˆ©CāŠ‚A,A is Ļƒ-fieldA,

BASIC MEASURE THEORY 22

i.e., the intersection of all the Ļƒ-fields containing C. From (i), this class is also Ļƒ-field. Obviously,it is the minimal one among all the Ļƒ-fields containing C. ā€ 

Then the following result shows that an extension of Āµ to Ļƒ(C) is possible and unique if Cis a field.

Theorem 2.1 (Caratheodory Extension Theorem) A measure Āµ on a field C can beextended to a measure on the minimal Ļƒ-field Ļƒ(C). If Āµ is Ļƒ-finite on C, then the extension isunique and also Ļƒ-finite. ā€ 

Proof The proof is skipped. Essentially, we define an extension of Āµ using the following outermeasure definition: for any set A,

Āµāˆ—(A) = inf

āˆžāˆ‘i=1

Āµ(Ai) : Ai āˆˆ C, A āŠ‚ āˆŖāˆži=1Ai

.

This is also the way of calculating the measure of any set in Ļƒ(C). ā€ 

Using the above results, we can construct many measure spaces. In Example 2.2, we firstgenerate a Ļƒ-field containing all the intervals of B0. Such a Ļƒ-field is called the Borel Ļƒ-field,denoted by B, and any set in B is called a Borel set. Then we can extend Ī» to B and the obtainedmeasure is called the Lebesgue measure. The triplet (R,B, Ī») is named the Borel measure space.Similarly, in Example 2.3, we can obtain the Borel measure space in Rk, denoted by (Rk,Bk, Ī»k).

We can also obtain many different measures in the Borel Ļƒ-field. To do that, let F be afixed generalized distribution function: F is non-decreasing and right-continuous. Then startingfrom any interval (a, b], we define a set function Ī»F ((a, b]) = F (b)āˆ’F (a) thus Ī»F can be easilydefined for any set of B0. Using the Ļƒ-field generation and measure extension, we thus obtaina different measure Ī»F in B. Such a measure is called the Lebesgue-Stieltjes measure generatedby F . Note that the Lebesuge measure is a special case with F (x) = x. Particularly, if F is adistribution function, i.e., F (āˆž) = 1 and F (āˆ’āˆž) = 0, this measure is a probability measurein R.

In a measure space (Ī©,A, Āµ), it is intuitive to assume that any subsets of a set with measurezero should be given measure zero. However, these subsets may not be included in A. Therefore,a final stage of constructing a measure space is to perform the completion by including suchnuisance sets in the Ļƒ-field. Especially, a general definition of the completion of a measure isgiven as follows: for a measure space (Ī©,A, Āµ), a completion is another measure space (Ī©, A, Āµ)where

A = A āˆŖN : A āˆˆ A, N āŠ‚ B for some B āˆˆ A such that Āµ(B) = 0and let Āµ(AāˆŖN) = Āµ(A). Particularly, the completion of the Borel measure space is called theLebesgue measure space and the completed Borel Ļƒ-field is called the Ļƒ-field of Lebesgue sets.From now on, we always assume that a measure space is completed.

BASIC MEASURE THEORY 23

2.3 Measurable Function and Integration

2.3.1 Measurable function

In measure theory, functions defined on a measure space are more interesting and important,as compared to measure space itself. Specially, only so-called measurable functions are useful.

Definition 2.3 (measurable function) Let X : Ī© 7ā†’ R be a function defined on Ī©. X ismeasurable if for x āˆˆ R, the set Ļ‰ āˆˆ Ī© : X(Ļ‰) ā‰¤ x is measurable, equivalently, belongs to A.Especially, if the measure space is a probability measure space, X is called a random variable.ā€ 

Hence, for a measurable function, we can evaluate the size of the set such like Xāˆ’1((āˆ’āˆž, x]).In fact, the following proposition concludes that for any Borel set B āˆˆ B, Xāˆ’1(B) is a measur-able set in A.

Proposition 2.4 If X is measurable, then for any B āˆˆ B, Xāˆ’1(B) = Ļ‰ : X(Ļ‰) āˆˆ B ismeasurable. ā€ 

Proof We defined a class as below:

Bāˆ— =B : B āŠ‚ R, Xāˆ’1(B) is measurable in A

.

Clearly, (āˆ’āˆž, x] āˆˆ Bāˆ—. Furthermore, if B āˆˆ Bāˆ—, then Xāˆ’1(B) āˆˆ A. Thus, Xāˆ’1(Bc) =Ī©āˆ’Xāˆ’1(B) āˆˆ A then Bc āˆˆ Bāˆ—. Moreover, if B1, B2, ... āˆˆ Bāˆ—, then Xāˆ’1(B1), X

āˆ’1(B2), ... āˆˆ A.Thus, Xāˆ’1(B1 āˆŖ B2 āˆŖ ...) = Xāˆ’1(B1) āˆŖXāˆ’1(B2) āˆŖ ... āˆˆ A. So B1 āˆŖ B2 āˆŖ ... āˆˆ Bāˆ—. We concludethat Bāˆ— is a Ļƒ-field. However, the Borel set B is the minimal Ļƒ-filed containing all intervals ofthe type (āˆ’āˆž, x]. So B āŠ‚ Bāˆ—. Then for any Borel set B, Xāˆ’1(B) is measurable in A. ā€ 

One special example of a measurable function is a simple function defined asāˆ‘n

i=1 xiIAi(Ļ‰),

where Ai, i = 1, ..., n are disjoint measurable sets in A. Here, IA(Ļ‰) is the indicator functionof A such that IA(Ļ‰) = 1 if Ļ‰ āˆˆ A and 0 otherwise. Note that the summation and maximumof a finite number of simple functions are still simple functions. More examples of measurablefunctions can be constructed from elementary algebra.

Proposition 2.5 Suppose that Xn are measurable. Then so are X1 + X2, X1X2, X21 and

supn Xn, infn Xn, lim supn Xn and lim infn Xn. ā€ 

Proof All can be verified using the following relationship:

X1 + X2 ā‰¤ x = Ī©āˆ’ X1 + X2 > x = Ī©āˆ’ āˆŖrāˆˆQ X1 > r āˆ© X2 > xāˆ’ r ,

where Q is the set of all rational numbers. X21 ā‰¤ x is empty if x < 0 and is equal to

X1 ā‰¤āˆš

x āˆ’ X1 < āˆ’āˆšx. X1X2 = (X1 + X2)2 āˆ’X2

1 āˆ’X22 /2 so it is measurable. The

remaining proofs can be seen from the following:

supn

Xn ā‰¤ x

= āˆ©n Xn ā‰¤ x .

BASIC MEASURE THEORY 24

infn

Xn ā‰¤ x

=

sup

n(āˆ’Xn) ā‰„ āˆ’x

.

lim sup

nXn ā‰¤ x

= āˆ©rāˆˆQ,r>0 āˆŖāˆžn=1 āˆ©kā‰„n Xk < x + r .

lim infn

Xn = āˆ’ lim supn

(āˆ’Xn).

ā€ 

One important and fundamental fact for measurable function is given in the following propo-sition.

Proposition 2.6 For any measurable function X ā‰„ 0, there exists an increasing sequence ofsimple functions Xn such that Xn(Ļ‰) increases to X(Ļ‰) as n goes to infinity. ā€ 

Proof Define

Xn(Ļ‰) =n2nāˆ’1āˆ‘

k=0

k

2nI k

2nā‰¤ X(Ļ‰) <

k + 1

2n+ nI X(Ļ‰) ā‰„ n .

That is, we simply partition the range of X and assign the smallest value within each partition.Clearly, Xn is increasing over n. Moreover, if X(Ļ‰) < n, then |Xn(Ļ‰) āˆ’ X(Ļ‰)| < 1

2n . Thus,Xn(Ļ‰) converges to X(Ļ‰). ā€ 

This fact can be used to verify the measurability of many functions, for example, if g is acontinuous function from R to R, then g(X) is also measurable.

2.3.2 Integration of measurable function

Now we are ready to define the integration of a measurable function.

Definition 2.4 (i) For any simple function X(Ļ‰) =āˆ‘n

i=1 xiIAi(Ļ‰), we define

āˆ‘ni=1 xiĀµ(Ai) as

the integral of X with respect to measure Āµ, denoted asāˆ«

XdĀµ.(ii) For any X ā‰„ 0, we define

āˆ«XdĀµ as

āˆ«XdĀµ = sup

Y is simple function, 0 ā‰¤ Y ā‰¤ X

āˆ«Y dĀµ.

(iii) For general X, let X+ = max(X, 0) and Xāˆ’ = max(āˆ’X, 0). Then X = X+ āˆ’Xāˆ’. If oneof

āˆ«X+dĀµ,

āˆ«Xāˆ’dĀµ is finite, we define

āˆ«XdĀµ =

āˆ«X+dĀµāˆ’ āˆ«

Xāˆ’dĀµ. ā€ 

Particularly, we call X is integrable ifāˆ« |X|dĀµ =

āˆ«X+dĀµ +

āˆ«Xāˆ’dĀµ is finite. Note the

definition (ii) is consistent with (i) when X itself is a simple function. When the measure spaceis a probability measure space and X is a random variable,

āˆ«XdĀµ is also called the expectation

of X, denoted by E[X].

BASIC MEASURE THEORY 25

Proposition 2.7 (i) For two measurable functions X1 ā‰„ 0 and X2 ā‰„ 0, if X1 ā‰¤ X2, thenāˆ«X1dĀµ ā‰¤ āˆ«

X2dĀµ.(ii) For X ā‰„ 0 and any sequence of simple functions Yn increasing to X,

āˆ«YndĀµ ā†’ āˆ«

XdĀµ. ā€ 

Proof (i) For any simple function 0 ā‰¤ Y ā‰¤ X1, Y ā‰¤ X2. Thus,āˆ«

Y dĀµ ā‰¤ āˆ«X2dĀµ by the

definition ofāˆ«

X2dĀµ. We take the supreme over all the simple functions less than X1 andobtain

āˆ«X1dĀµ ā‰¤ āˆ«

X2dĀµ.(ii) From (i),

āˆ«YndĀµ is increasing and bounded by

āˆ«XdĀµ. It suffices to show that for any simple

function Z =āˆ‘m

i=1 xiIAi(Ļ‰), where Ai, 1 ā‰¤ i ā‰¤ m are disjoint measurable sets and xi > 0,

such that 0 ā‰¤ Z ā‰¤ X, it holds

limn

āˆ«YndĀµ ā‰„

māˆ‘i=1

xiĀµ(Ai).

We consider two cases. First, supposeāˆ«

ZdĀµ =āˆ‘m

i=1 xiĀµ(Ai) is finite thus both xi and Āµ(Ai)are finite. Fix an Īµ > 0, let Ain = Ai āˆ© Ļ‰ : Yn(Ļ‰) > xi āˆ’ Īµ . Since Yn increases to X whois larger than or equal to xi in Ai, Ain increases to Ai. Thus Āµ(Ain) increases to Āµ(Ai) byProposition 2.2. It yields that when n is large,

āˆ«YndĀµ ā‰„

māˆ‘i=1

(xi āˆ’ Īµ)Āµ(Ai).

We conclude limn

āˆ«YndĀµ ā‰„ āˆ«

ZdĀµ āˆ’ Īµāˆ‘m

i=1 Āµ(Ai). Then limn

āˆ«YndĀµ ā‰„ āˆ«

ZdĀµ by letting Īµapproach 0. Second, suppose

āˆ«ZdĀµ = āˆž then there exists some i from 1, ..., m, say 1, so

that Āµ(A1) = āˆž or x1 = āˆž. Choose any 0 < x < x1 and 0 < y < Āµ(A1). Then the setA1n = A1 āˆ© Ļ‰ : Yn(Ļ‰) > x increases to A1. Thus when n large enough, Āµ(A1n) > y. We thusobtain limn

āˆ«YndĀµ ā‰„ xy. By letting x ā†’ x1 and y ā†’ Āµ(A1), we conclude limn

āˆ«YndĀµ = āˆž.

Therefore, in either case, limn

āˆ«YndĀµ ā‰„ āˆ«

ZdĀµ. ā€ 

Proposition 2.7 implies that, to calculate the integral of a non-negative measurable functionX, we can choose any increasing sequence of simple functions Yn and the limit of

āˆ«YndĀµ is

the same asāˆ«

XdĀµ. Particularly, such a sequence can chosen as constructed as Proposition 2.6;then āˆ«

XdĀµ = limn

n2nāˆ’1āˆ‘

k=1

k

2nĀµ(

k

2nā‰¤ X <

k + 1

2n) + nĀµ(X ā‰„ n)

.

Proposition 2.8 (Elementary Properties) Supposeāˆ«

XdĀµ,āˆ«

Y dĀµ andāˆ«

XdĀµ+āˆ«

Y dĀµ exit.Then(i) āˆ«

(X + Y )dĀµ =

āˆ«XdĀµ +

āˆ«Y dĀµ,

āˆ«cXdĀµ = c

āˆ«XdĀµ;

(ii) X ā‰„ 0 impliesāˆ«

XdĀµ ā‰„ 0; X ā‰„ Y impliesāˆ«

XdĀµ ā‰„ āˆ«Y dĀµ; and X = Y a.e., that is,

Āµ(Ļ‰ : X(Ļ‰) 6= Y (Ļ‰)) = 0, implies thatāˆ«

XdĀµ =āˆ«

Y dĀµ;(iii) |X| ā‰¤ Y with Y integrable implies that X is integrable; X and Y are integrable impliesthat X + Y is integrable.ā€ 

BASIC MEASURE THEORY 26

Proposition 2.8 can be proved using the definition. Finally, we give a few facts of computingintegration without proof.

(a) Suppose Āµ# is a counting measure in Ī© = x1, x2, .... Then for any measurable functiong, āˆ«

gdĀµ# =āˆ‘

i

g(xi).

(b) For any continuous function g(x), which is also measurable in the Lebsgue measure space(R,B, Ī»),

āˆ«gdĪ» is equal to the usual Riemann integral

āˆ«g(x)dx, whenever g is integrable.

(c) In a Lebsgue-stieljes measure space (Ī©,B, Ī»F ), where F is differentiable except discontin-uous points x1, x2, ..., the integration of a continuous function g(x) is given by

āˆ«gdĪ»F =

āˆ‘i

g(xi) F (xi)āˆ’ F (xiāˆ’)+

āˆ«g(x)f(x)dx,

where f(x) is the derivative of F (x).

2.3.3 Convergence of measurable functions

In this section, we provide some important theorems on how to take limits in the integration.

Theorem 2.2 (Monotone Convergence Theorem) If Xn ā‰„ 0 and Xn increases to X, thenāˆ«XndĀµ ā†’ āˆ«

XdĀµ. ā€ 

Proof Choose non-negative simple function Xkm increasing to Xk as m ā†’ āˆž. Define Yn =maxkā‰¤n Xkn. Yn is an increasing series of simple functions and it satisfies

Xkn ā‰¤ Yn ā‰¤ Xn, so

āˆ«XkndĀµ ā‰¤

āˆ«YndĀµ ā‰¤

āˆ«XndĀµ.

By letting n ā†’āˆž, we obtain

Xk ā‰¤ limn

Yn ā‰¤ X,

āˆ«XkdĀµ ā‰¤

āˆ«lim

nYndĀµ = lim

n

āˆ«YndĀµ ā‰¤ lim

n

āˆ«XndĀµ,

where the equality holds since Yn is simple function. By letting k ā†’āˆž, we obtain

X ā‰¤ limn

Yn ā‰¤ X, limk

āˆ«XkdĀµ ā‰¤

āˆ«lim

nYndĀµ ā‰¤ lim

n

āˆ«XndĀµ.

The result holds. ā€ 

Example 2.7 This example shows that the non-negative condition in the above theorem isnecessary: let Xn(x) = āˆ’I(x > n)/n be measurable function in the Lebesgue measure space.Clearly, Xn increases to zero but

āˆ«XndĪ» = āˆ’āˆž.

BASIC MEASURE THEORY 27

Theorem 2.3 (Fatouā€™s Lemma) If Xn ā‰„ 0 then

āˆ«lim inf

nXndĀµ ā‰¤ lim inf

n

āˆ«XndĀµ.

ā€ 

Proof Notelim inf

nXn =

āˆžsupn=1

infmā‰„n

Xm.

Thus, the sequence infmā‰„n Xm increases to lim infn Xn. By the Monotone Convergence The-orem, āˆ«

lim infn

XndĀµ = limn

āˆ«infmā‰„n

XmdĀµ ā‰¤āˆ«

XndĀµ.

Take the lim inf on both sides and the theorem holds. ā€ 

The next theorem requires two more definitions.

Definition 2.5 A sequence Xn converges almost everywhere (a.e.) to X, denoted Xn ā†’a.e. X,if Xn(Ļ‰) ā†’ X(Ļ‰) for all Ļ‰ āˆˆ Ī© āˆ’ N where Āµ(N) = 0. If Āµ is a probability, we write a.e. asa.s. (almost surely). A sequence Xn converges in measure to a measurable function X, denotedXn ā†’Āµ X, if Āµ(|Xn āˆ’ X| ā‰„ Īµ) ā†’ 0 for all Īµ > 0. If Āµ is a probability measure, we say Xn

converges in probability to X. ā€ 

The following proposition further justifies the convergence almost everywhere.

Proposition 2.9 Let Xn, X be finite measurable functions. Then Xn ā†’a.e. X if and only iffor any Īµ > 0,

Āµ(āˆ©āˆžn=1 āˆŖmā‰„n |Xm āˆ’X| ā‰„ Īµ) = 0.

If Āµ(Ī©) < āˆž, then Xn ā†’a.e. X if and only if for any Īµ > 0,

Āµ(āˆŖmā‰„n |Xm āˆ’X| ā‰„ Īµ) ā†’ 0.

ā€ 

Proof Note that

Ļ‰ : Xn(Ī©) ā†’ X(Ļ‰)c = āˆŖāˆžk=1 āˆ©āˆžn=1 āˆŖmā‰„n

Ļ‰ : |Xm(Ļ‰)āˆ’X(Ļ‰)| ā‰„ 1

k

.

Thus, if Xn ā†’a.e X, the measure of the left-hand side is zero. However, the right-hand sidecontains āˆ©āˆžn=1 āˆŖmā‰„n |Xm āˆ’X| ā‰„ Īµ for any Īµ > 0. The direction ā‡’ is proved. For the otherdirection, we choose Īµ = 1/k for any k, then by countable sub-additivity,

Āµ(āˆŖāˆžk=1 āˆ©āˆžn=1 āˆŖmā‰„n

Ļ‰ : |Xm(Ļ‰)āˆ’X(Ļ‰)| ā‰„ 1

k

)

BASIC MEASURE THEORY 28

ā‰¤āˆ‘

k

Āµ(āˆ©āˆžn=1 āˆŖmā‰„n

Ļ‰ : |Xm(Ļ‰)āˆ’X(Ļ‰)| ā‰„ 1

k

) = 0.

Thus, Xn ā†’a.e. X. When Āµ(Ī©) = 1, the latter holds by Proposition 2.2. ā€ 

The following proposition describes the relationship between the convergence almost every-where and the convergence in measure.

Proposition 2.10 Let Xn be finite a.e.(i) If Xn ā†’Āµ X, then there exists a subsequence Xnk

ā†’a.e X.(ii) If Āµ(Ī©) < āˆž and Xn ā†’a.e. X, then Xn ā†’Āµ X. ā€ 

Proof (i) For any k, there exists some nk such that

P (|Xnkāˆ’X| ā‰„ 2āˆ’k) < 2āˆ’k.

ThenĀµ(āˆŖmā‰„k |Xnm āˆ’X| ā‰„ Īµ) ā‰¤ Āµ(āˆŖmā‰„k

|Xnm āˆ’X| ā‰„ 2āˆ’k) ā‰¤

āˆ‘

mā‰„k

2āˆ’m ā†’ 0.

Thus from the previous proposition, Xnkā†’a.e X.

(ii) is direct from the second part of Proposition 2.9. ā€ 

Example 2.8 Let X2n+k = I(x āˆˆ [k/2n, (k + 1)/2n)), 0 ā‰¤ k < 2n be measurable functions inthe Lebesgue measure space. Then it is easy to see Xn ā†’Ī» 0 but does not converge to zeroalmost everywhere. While, there exists a subsequence converging to zero almost everywhere.

Example 2.9 In Example 2.7, n2Xn ā†’a.e. 0 but Ī»(|Xn| > Īµ) ā†’ āˆž. This example shows thatĀµ(Ī©) < āˆž in (ii) of Proposition 2.10 is necessary.

We now state the third important theorem.

Theorem 2.4 (Dominated Convergence Theorem) If |Xn| ā‰¤ Y a.e. with Y integrable,and if Xn ā†’Āµ X (or Xn ā†’a.e. X), then

āˆ« |Xn āˆ’X|dĀµ ā†’ 0 and limāˆ«

XndĀµ =āˆ«

XdĀµ. ā€ 

Proof First, assume Xn ā†’a.e X. Define Zn = 2Y āˆ’ |Xn āˆ’X|. Clearly, Zn ā‰„ 0 and Zn ā†’ 2Y .By the Fatouā€™s lemma, we have

āˆ«2Y dĀµ ā‰¤ lim inf

n

āˆ«(2Y āˆ’ |Xn āˆ’X|)dĀµ.

That is, lim supn

āˆ« |Xn āˆ’ X|dĀµ ā‰¤ 0 and the result holds. If Xn ā†’Āµ X and the result doesnot hold for some subsequence of Xn, by Proposition 2.10, there exits a further sub-sequenceconverging to X almost surely. However, the result holds for this further subsequence. Weobtain the contradiction. ā€ 

The existence of the dominating function Y is necessary, as seen in the counter example inExample 2.7. Finally, the following result describes the interchange between integral and limitor derivative.

BASIC MEASURE THEORY 29

Theorem 2.5 (Interchange of Integral and Limit or Derivatives) Suppose that X(Ļ‰, t)is measurable for each t āˆˆ (a, b).(i) If X(Ļ‰, t) is a.e. continuous in t at t0 and |X(Ļ‰, t)| ā‰¤ Y (Ļ‰), a.e. for |t āˆ’ t0| < Ī“ with Yintegrable, then

limtā†’t0

āˆ«X(Ļ‰, t)dĀµ =

āˆ«X(Ļ‰, t0)dĀµ.

(ii) Suppose āˆ‚āˆ‚t

X(Ļ‰, t) exists for a.e. Ļ‰, all t āˆˆ (a, b) and | āˆ‚āˆ‚t

X(Ļ‰, t)| ā‰¤ Y (Ļ‰), a.e. for all t āˆˆ (a, b)with Y integrable. Then

āˆ‚

āˆ‚t

āˆ«X(Ļ‰, t)dĀµ =

āˆ«āˆ‚

āˆ‚tX(Ļ‰, t)dĀµ.

ā€ 

Proof (i) follows from the Dominated Convergence Theorem and the subsequence argument.(ii) can be seen from the following:

āˆ‚

āˆ‚t

āˆ«X(Ļ‰, t)dĀµ = lim

hā†’0

āˆ«X(Ļ‰, t + h)āˆ’X(Ļ‰, t)

hdĀµ.

Then from the conditions and (i), such a limit can be taken within the integration. ā€ 

2.4 Fubini Integration and Radon-Nikodym Derivative

2.4.1 Product of measures and Fubini-Tonelli theorem

Suppose that (Ī©1,A1, Āµ1) and (Ī©2,A2, Āµ2) are two measure spaces. Now we consider the productset Ī©1 Ɨ Ī©2 = (Ļ‰1, Ļ‰2) : Ļ‰1 āˆˆ Ī©1, Ļ‰2 āˆˆ Ī©2. Correspondingly, we define a class

A1 Ɨ A2 : A1 āˆˆ A1, A2 āˆˆ A2 .

A1ƗA2 is called a measurable rectangle set. However, the above class is not a Ļƒ-field. We thusconstruct the Ļƒ-filed based on this class and denote

A1 ƗA2 = Ļƒ(A1 Ɨ A2 : A1 āˆˆ A1, A2 āˆˆ A2).To define a measure on this Ļƒ-field, denoted Āµ1 Ɨ Āµ2, we can first define it on any rectangle set

(Āµ1 Ɨ Āµ2)(A1 Ɨ A2) = Āµ1(A1)Āµ2(A2).

Then Āµ1 Ɨ Āµ2 is extended to all sets in the A1 ƗA2 by the Caratheodory Extension theorem.One simple example is the Lebesgue measure in a multi-dimensional real space Rk. We let

(R,B, Ī») be the Lebesgue measure in one-dimensional real space. Then we can use the aboveprocedure to define Ī»Ć— ...Ɨ Ī» as a measure on Rk = RƗ ...ƗR. Clearly, for each cube in Rk,this measure gives the same value as the volume of the cube. In fact, this measure agrees withĪ»k defined in Example 2.3.

With the product measure, we can start to discuss the integration with respect to thismeasure. Let X(Ļ‰1, Ļ‰2) be the measurable function on the measurable space (Ī©1 Ɨ Ī©2,A1 Ɨ

BASIC MEASURE THEORY 30

A2, Āµ1 Ɨ Āµ2). The integration of X is denoted asāˆ«Ī©1ƗĪ©2

X(Ļ‰1, Ļ‰2)d(Āµ1 Ɨ Āµ2). In the case whenthe measurable space is real space, this integration is simply bivariate integration such likeāˆ«

R2 f(x, y)dxdy. As in the calculus, we are often concerned about whether we can integrateover x first then y or we can integrate y first then x. The following theorem gives the conditionof changing the order of integration.

Theorem 2.6 (Fubini-Tonelli Theorem) Suppose that X : Ī©1 Ɨ Ī©2 ā†’ R is A1 Ɨ A2

measurable and X ā‰„ 0. Thenāˆ«

Ī©1

X(Ļ‰1, Ļ‰2)dĀµ1 is A2 measurable,

āˆ«

Ī©2

X(Ļ‰1, Ļ‰2)dĀµ2 is A1 measurable,

andāˆ«

Ī©1ƗĪ©2

X(Ļ‰1, Ļ‰2)d(Āµ1 Ɨ Āµ2) =

āˆ«

Ī©1

āˆ«

Ī©2

X(Ļ‰1, Ļ‰2)dĀµ2

dĀµ1 =

āˆ«

Ī©2

āˆ«

Ī©1

X(Ļ‰1, Ļ‰2)dĀµ1

dĀµ2.

ā€ 

As a corollary, suppose X is not necessarily non-negative but we can write X = X+ āˆ’Xāˆ’.Then the above results hold for X+ and Xāˆ’. Thus, if

āˆ«Ī©1ƗĪ©2

|X(Ļ‰1, Ļ‰2)|d(Āµ1 Ɨ Āµ2) is finite,then the above results hold.

Proof Suppose that we have shown the theorem holds for any indicator function IB(Ļ‰1, Ļ‰2),where B āˆˆ A1 ƗA2. We construct a sequence of simple functions, denoted as Xn, increases toX. Clearly,

āˆ«Ī©1

Xn(Ļ‰1, Ļ‰2)dĀµ1 is measurable and

āˆ«

Ī©1ƗĪ©2

Xn(Ļ‰1, Ļ‰2)d(Āµ1 Ɨ Āµ2) =

āˆ«

Ī©2

āˆ«

Ī©1

Xn(Ļ‰1, Ļ‰2)dĀµ1

dĀµ2.

By the monotone convergence theorem,āˆ«Ī©1

Xn(Ļ‰1, Ļ‰2)dĀµ1 increases toāˆ«Ī©1

X(Ļ‰1, Ļ‰2)dĀµ1 almosteverywhere. Further applying the monotone convergence theorem to both sides of the aboveequality, we obtain

āˆ«

Ī©1ƗĪ©2

X(Ļ‰1, Ļ‰2)d(Āµ1 Ɨ Āµ2) =

āˆ«

Ī©2

āˆ«

Ī©1

X(Ļ‰1, Ļ‰2)dĀµ1 dĀµ2.

Similarly, āˆ«

Ī©1ƗĪ©2

X(Ļ‰1, Ļ‰2)d(Āµ1 Ɨ Āµ2) =

āˆ«

Ī©1

āˆ«

Ī©2

X(Ļ‰1, Ļ‰2)dĀµ2 dĀµ1.

It remains to show IB(Ļ‰1, Ļ‰2) satisfies the theoremā€™s results for B āˆˆ A1 ƗA2.To this end, we define what is called a monotone class: M is a monotone class if for any

increasing sequence of sets B1 āŠ† B2 āŠ† B3 . . . in the class, āˆŖiBi belongs to M. We then let M0

be the minimal monotone class in A1ƗA2 containing all the rectangles. The existence of suchminimal class can be proved using the same construction as Proposition 2.3 and noting thatA1 ƗA2 itself is a monotone class. We show that M0 = A1 ƗA2.

BASIC MEASURE THEORY 31

(a) M0 is a field: for A,B āˆˆM0, it suffices to show that A āˆ©B, A āˆ©Bc, Ac āˆ©B āˆˆM0. Weconsider

MA = B āˆˆM0 : A āˆ©B,A āˆ©Bc, Ac āˆ©B āˆˆM0 .

It is straightforward to see that if A is a rectangle, then B āˆˆMA for any rectangle B and thatMA is a monotone class. Thus, MA = M0 for A being a rectangle. For general A, the previousresult implies that all the rectangles are in MA. Clearly, MA is a monotone class. Therefore,MA = M0 for any A āˆˆM0. That is, for A, B āˆˆM0, A āˆ©B, A āˆ©Bc, Ac āˆ©B āˆˆM0.

(b) M0 is a Ļƒ-field. For any B1, B2, ... āˆˆ M0, we can write āˆŖiBi as the union of increasingsets B1, B1 āˆŖ B2, .... Since each set in the sequence is in M0 and M0 is a monotone class,āˆŖiBi āˆˆM0. Thus, M0 is a Ļƒ-field so it must be equal to A1 ƗA2.

Now we come back to show that for any B āˆˆ A1 ƗA2, IB satisfies the equality in Theorem2.6. To do this, we define a class

B : B āˆˆ A1 ƗA2 is measurable and IB satifies the equality in Theorem 2.6 .

Clearly, the class contains all the rectangles. Second, the class is a monotone class: supposeB1, B2, ... is an increasing sequence of sets in the class, we apply the monotone convergencetheorem to

āˆ«

Ī©1ƗĪ©2

IBid(Āµ1 Ɨ Āµ2) =

āˆ«

Ī©2

āˆ«

Ī©1

IBidĀµ1

dĀµ2 =

āˆ«

Ī©1

āˆ«

Ī©2

IBidĀµ2

dĀµ1

and note IBiā†’ IāˆŖiBi

. We conclude that āˆŖiBi is also in the defined class. Therefore, from theprevious result about the relationship between the monotone class and the Ļƒ-field, we obtainthat the defined class should be the same as A1 ƗA2. ā€ 

Example 2.10 Let (Ī©, 2Ī©, Āµ#) be a counting measure space where Ī© = 1, 2, 3, ... and (R,B, Ī»)be the Lebesgue measure space. Define f(x, y) be a bivariate function in the product of thesetwo measure space as f(x, y) = I(0 ā‰¤ x ā‰¤ y) expāˆ’y. To evaluate the integral f(x, y), we usethe Fubini-Tonelli theorem and obtain

āˆ«

Ī©Ć—R

f(x, y)dĀµ# Ɨ Ī» =

āˆ«

Ī©

āˆ«

R

f(x, y)dĪ»(y)dĀµ#(x) =

āˆ«

Ī©

expāˆ’xdĀµ#(x)

=āˆžāˆ‘

n=1

expāˆ’n = 1/(eāˆ’ 1).

2.4.2 Absolute continuity and Radon-Nikodym derivative

Let (Ī©,A, Āµ) be a measurable space and let X be a non-negative measurable function on Ī©.We define a set function Ī½ as

Ī½(A) =

āˆ«

A

XdĀµ =

āˆ«IAXdĀµ

for each A āˆˆ A. It is easy to see that Ī½ is also a measure on (Ī©,A). X can be regarded asthe derivative of the measure Ī½ with respect Āµ (one can think about an example in real space).

BASIC MEASURE THEORY 32

However, one question is the opposite direction: if both Āµ and Ī½ are the measures on (Ī©,A),can we find a measurable function X such that the above equation holds? To answer this, weneed to introduce the definition of absolute continuity.

Definition 2.6 If for any A āˆˆ A, Āµ(A) = 0 implies that Ī½(A) = 0, then Ī½ is said to beabsolutely continuous with respect to Āµ, and we write Ī½ ā‰ŗā‰ŗ Āµ. Sometimes it is also said thatĪ½ is dominated by Āµ. ā€ 

One equivalent condition to the above the condition is the following lemma.

Proposition 2.11 Suppose Ī½(Ī©) < āˆž. Then Ī½ ā‰ŗā‰ŗ Āµ if and only if for any Īµ > 0, there existsa Ī“ such that Ī½(A) < Īµ whenever Āµ(A) < Ī“. ā€ 

Proof ā€œ ā‡ā€²ā€² is clear. To prove ā€œ ā‡’ā€²ā€², we use the contradiction. Suppose there exists Īµ and aset An such that Ī½(An) > Īµ and Āµ(An) < nāˆ’2. Since

āˆ‘n Āµ(An) < āˆž, we have

Āµ(lim supn

An) ā‰¤āˆ‘mā‰„n

Āµ(An) ā†’ 0.

Thus Āµ(lim supn An) = 0. However, Ī½(lim supn An) = limn Ī½(āˆŖmā‰„nAm) ā‰„ lim supn Ī½(An) ā‰„ Īµ. Itis a contradiction. ā€ 

The following Radon-Nikodym theorem says that if Ī½ is dominated by Āµ, then a measurablefunction X satisfying the equation exists. Such X is called the Radon-Nikodym derivative of Ī½with respect Āµ, denoted by dĪ½/dĀµ.

Theorem 2.7 (Radon-Nikodym theorem) Let (Ī©,A, Āµ) be a Ļƒ-finite measure space, andlet Ī½ be a measurable on (Ī©,A) with Ī½ ā‰ŗā‰ŗ Āµ. Then there exists a measurable function X ā‰„ 0such that Ī½(A) =

āˆ«A

XdĀµ for all A āˆˆ A. X is unique in the sense that if another measurablefunction Y also satisfies the equation, then X = Y , a.e. ā€ 

Before proving Theorem 2.7, we need the following Hahn decomposition theorem for anyadditive set function with real values, Ļ†(A), which is defined on a measurable space (Ī©,A) suchthat for countable disjoint sets A1, A2, ...,

Ļ†(āˆŖnAn) =āˆ‘

n

Ļ†(An).

The main difference from the usual measure definition is that Ļ†(A) can be negative and mustbe finite.

Proposition 2.12 (Hahn Decomposition) For any additive set function Ļ†, there exist dis-joint sets A+ and Aāˆ’ such that A+ āˆŖAāˆ’ = Ī©, Ļ†(E) ā‰„ 0 for any E āŠ‚ A+ and Ļ†(E) ā‰¤ 0 for anyE āŠ‚ Aāˆ’. A+ is called positive set and Aāˆ’ is called negative set of Ļ†. ā€ 

Proof Let Ī± = supĻ†(A) : A āˆˆ A. Suppose there exists a set A+ such that Ļ†(A+) = Ī± < āˆž.Let Aāˆ’ = Ī©āˆ’A+. If E āŠ‚ A+ and Ļ†(E) < 0, then Ļ†(A+āˆ’E) ā‰„ Ī±āˆ’Ļ†(E) > Ī±, an impossibility.Thus, Ļ†(E) ā‰„ 0. Similarly, for any E āŠ‚ Aāˆ’, Ļ†(E) ā‰¤ 0.

BASIC MEASURE THEORY 33

It remains to construct such A+. Choose An such that Ļ†(An) ā†’ Ī±. Let A = āˆŖnAn. For eachn, we consider all possible intersection of A1, ..., An, denoted by Bn = Bni : 1 ā‰¤ i ā‰¤ 2n. Thenthe collection of Bn is a partition of A. Let Cn be the union of those Bni in Bn such that Ļ†(Bni) >0. Then Ļ†(An) ā‰¤ Ļ†(Cn). Moreover, for any m < n, Ļ†(Cm āˆŖ ... āˆŖ Cn) ā‰„ Ļ†(Cm āˆŖ ... āˆŖ Cnāˆ’1). LetA+ = āˆ©āˆžm=1āˆŖnā‰„m Cn. Then Ī± = limm Ļ†(Am) ā‰¤ limm Ļ†(āˆŖnā‰„mCn) = Ļ†(A+). Then Ļ†(A+) = Ī±. ā€ 

We now start to prove Theorem 2.7.

Proof We first show that this holds if Āµ(Ī©) < āˆž. Let Īž be the class of non-negative functionsg such that

āˆ«E

gdĀµ ā‰¤ Ī½(E). Clearly, 0 āˆˆ Īž. If g and gā€² are in Īž, then

āˆ«

E

max(g, gā€²)dĀµ =

āˆ«

Eāˆ©gā‰„gā€²gdĀµ +

āˆ«

Eāˆ©g<gā€²gā€²dĀµ ā‰¤

āˆ«

Eāˆ©gā‰„gā€²dĪ½ +

āˆ«

Eāˆ©g<gā€²dĪ½ = Ī½(E).

Thus, max(g, gā€²) āˆˆ Īž. Moreover, if gn increases to g and gn āˆˆ Īž, then by the monotoneconvergence theorem, g āˆˆ Īž.

Let Ī± = supgāˆˆĪž

āˆ«gdĀµ then Ī± ā‰¤ Ī½(Ī©). Choose gn in Īž such that

āˆ«gndĀµ > Ī± āˆ’ nāˆ’1. Define

fn = max(g1, ..., gn) āˆˆ Īž and fn increases to f āˆˆ Īž. We haveāˆ«

fdĀµ = Ī±.Define a measure 0 ā‰¤ Ī½s(E) = Ī½(E)āˆ’ āˆ«

EfdĀµ. We will show that there exists set SĀµ and SĪ½

such that Āµ(Ī©āˆ’ SĀµ) = 0, Ī½s(Ī©āˆ’ SĪ½) = 0, and SĀµ āˆ© SĪ½ = āˆ…. If this is true, then since Ī½ ā‰ŗā‰ŗ Āµ,Ī½s(Ī©āˆ’ SĀµ) ā‰¤ Ī½(Ī©āˆ’ SĀµ) = 0. Thus,

Ī½s(E) ā‰¤ Ī½s(E āˆ© (Ī©āˆ’ SĀµ)) + Ī½s(E āˆ© (Ī©āˆ’ SĪ½)) = 0.

This gives that Ī½(E) =āˆ«

EfdĀµ. We prove the previous statement by contradiction. Let A+

n āˆŖAāˆ’n

be a Hahn decomposition for the the set function Ī½sāˆ’nāˆ’1Āµ and let M = āˆŖnA+n so M c = āˆ©nA

āˆ’n .

Since Ī½s(Mc) āˆ’ nāˆ’1Āµ(M c) ā‰¤ Ī½s(A

āˆ’n ) āˆ’ nāˆ’1Āµ(Aāˆ’

n ) ā‰¤ 0, we have Ī½s(Mc) ā‰¤ nāˆ’1Āµ(M c) ā†’ 0.

Then Āµ(M) must be positive. Therefore, there exists some A = A+n such that Āµ(A) > 0 and

Ī½s(E) ā‰„ nāˆ’1Āµ(E) for any E āŠ‚ A. For such A, we have that for Īµ = 1/n,

āˆ«

E

(f + ĪµIA)dĀµ =

āˆ«

E

fdĀµ + ĪµĀµ(E āˆ© A)

ā‰¤āˆ«

E

fdĀµ + Ī½s(E āˆ© A)

ā‰¤āˆ«

Eāˆ©A

fdĀµ + Ī½s(E āˆ© A) +

āˆ«

Eāˆ’A

fdĀµ

ā‰¤ Ī½(E āˆ© A) +

āˆ«

Eāˆ’A

fdĀµ ā‰¤ Ī½(E āˆ© A) + Ī½(E āˆ’ A) = Ī½(E).

In other words, f + ĪµIA is in Īž. However,āˆ«

(f + ĪµIA)dĀµ = Ī± + ĪµĀµ(A) > Ī±. We obtain thecontradiction.

We have proved the theorem for Āµ(Ī©) < āˆž. If Āµ is countably finite, there exists countabledecomposition of Ī© into Bn such that Āµ(Bn) < āˆž. For the measures Āµn(A) = Āµ(Aāˆ©Bn) andĪ½n(A) = Ī½(A āˆ©Bn), Ī½n ā‰ŗā‰ŗ Āµn so we can find non-negative fn such that

Ī½(A āˆ©Bn) =

āˆ«

Aāˆ©Bn

fndĀµ.

BASIC MEASURE THEORY 34

Then Ī½(A) =āˆ‘

n Ī½(A āˆ©Bn) =āˆ«

A

āˆ‘n fnIBndĀµ.

The function f satisfying the result must be unique almost everywhere. If two f1 ad f2

satisfy thatāˆ«

Af1dĀµ =

āˆ«A

f2dĀµ then after choosing A = f1 āˆ’ f2 > 0 and A = f1 āˆ’ f2 < 0,we obtain f1 = f2 almost everywhere. ā€ 

Using the Radon-Nikodym derivative, we can transform the integration with respect to themeasure Āµ to the integration with respect to the measure Ī½.

Proposition 2.13 Suppose Ī½ and Āµ are Ļƒ-finite measure defined on a measure space (Ī©,A)with Ī½ ā‰ŗā‰ŗ Āµ, and suppose Z is a measurable function such that

āˆ«ZdĪ½ is well defined. Then

for any A āˆˆ A, āˆ«

A

ZdĪ½ =

āˆ«

A

ZdĪ½

dĀµdĀµ.

ā€ 

Proof (i) If Z = IB where B āˆˆ A, then

āˆ«

A

ZdĪ½ = Ī½(A āˆ©B) =

āˆ«

Aāˆ©B

dĪ½

dĀµdĀµ =

āˆ«

A

IBdĪ½

dĀµdĀµ.

The result holds.(ii) If Z ā‰„ 0, we can find a sequence of simple function Zn increasing to Z. Clearly, for Zn,

āˆ«

A

ZndĪ½ =

āˆ«

A

ZndĪ½

dĀµdĀµ.

Take limits on both sides and apply the monotone convergence theorem. We obtain the result.(iii) For any Z, we write Z = Z+ āˆ’ Zāˆ’. Then both Z+ and Zāˆ’ are integrable. Thus,

āˆ«ZdĪ½ =

āˆ«Z+dĪ½ āˆ’

āˆ«Zāˆ’dĪ½ =

āˆ«Z+ dĪ½

dĀµdĀµāˆ’

āˆ«Zāˆ’ dĪ½

dĀµdĀµ =

āˆ«Z

dĪ½

dĀµdĀµ.

ā€ 

2.4.3 X-induced measure

Let X be a measurable function defined on (Ī©,A, Āµ). Then for any B āˆˆ B, since Xāˆ’1(B) āˆˆ A,we can define a set function on all the Borel sets as

ĀµX(B) = Āµ(Xāˆ’1(B)).

Such ĀµX is called a measure induced by X. Hence, we obtain a measure in the Borel Ļƒ-field(R,B, ĀµX).

Suppose that (R,B, Ī½) is another measure space (often the counting measure or the Lebesguemeasure) and ĀµX is dominated by Ī½ with the derivative f . Then f is called the density of X

BASIC MEASURE THEORY 35

with respect to the dominating measure Ī½. Furthermore, we obtain that for any measurablefunction g from R to R,

āˆ«

Ī©

g(X(Ļ‰))dĀµ(Ļ‰) =

āˆ«

R

g(x)dĀµX(x) =

āˆ«

R

g(x)f(x)dĪ½(x).

That is, the integration of g(X) on the original measure space Ī© can be transformed as theintegration of g(x) on R with respect to the induced-measure ĀµX and can be further transformedas the integration of g(x)f(x) with respect to the dominating measure Ī½.

When (Ī©,A, Āµ) = (Ī©,A, P ) is a probability space, the above interpretation has a specialmeaning: X is now a random variable then the above equation becomes

E[g(X)] =

āˆ«

R

g(x)f(x)dĪ½(x).

We immediately recognize that f(x) is the density function of X with respect to the dominat-ing measure Ī½. Particularly, if Ī½ is the counting measure, f(x) is in fact the probability massfunction; if Ī½ is the Lebesgue measure, f(x) is the probability density function in the usualsense. This fact has an important implication: any expectations regarding random variableX can be computed via its probability mass function or density function without referral towhatever probability measure space X is defined on. This is the reason why in most of statis-tical framework, we seldom mention the underlying measure space while only give either theprobability mass function or the probability density function.

2.5 Probability Measure

2.5.1 Parallel definitions

Already discussed before, a probability measure space (Ī©,A, P ) satisfies that P (Ī©) = 1 andrandom variable (or random vector in multi-dimensional real space) X is a measurable functionon this space. The integration of X is equivalent to the expectation. The density or the massfunction of X is the Radon-Nikydom derivative of the X-induced measure with respect to theLebesgue measure or the counting measure in real space. By using the mass function or densityfunction, statisticians unconsciously ignore the underlying probability measure space (Ī©,A, P ).However, it is important for readers to keep in mind that whenever a density function or massfunction is referred, we assume that above procedure has been worked out for some probabilityspace.

Recall that F (x) = P (X ā‰¤ x) is the cumulative distribution function of X. Clearly, F (x) isa nondecreasing function with F (āˆ’āˆž) = 0 and F (āˆž) = 1. Moreover, F (x) is right-continuous,meaning that F (xn) ā†’ F (x), if xn decreases to x. Interestingly, we can show that ĀµF , theLebesgue-Stieljes measure generated by F , is exactly the same measure as the one induced byX, i.e., PX .

Since a probability measure space is a special case of general measure space, all the propertiesfor the general measure space including the monotone convergence theorem, the Fatouā€™s lemma,the dominating convergence theorem, and the Fubini-Tonelli theorem apply.

BASIC MEASURE THEORY 36

2.5.2 Conditional expectation and independence

Nevertheless, there are some features only specific to probability measure, which distinguishprobability theory from general measure theory. Two of these important features are conditionalprobability and independence. We describe them in the following text.

In a probability measure space (Ī©,A, P ), we know the conditional probability of an eventA given another event B is defined as P (A|B) = P (A āˆ© B)/P (B) and P (A|Bc) = P (A āˆ©Bc)/P (Bc). This means: if B occurs, then the probability that A occurs is P (A|B); if B doesnot occur, then the probability that A occurs if P (A|Bc). Thus, such a conditional distributioncan be thought as a measurable function assigned to the Ļƒ-field āˆ…, B, Bc, Ī©, which is equal

P (A|B)IB(Ļ‰) + P (A|Bc)IBc(Ļ‰).

Such a simple example in fact characterizes the essential definition of conditional probability.Let ā„µ be the sub-Ļƒ-filed of A. For any A āˆˆ A, the conditional probability of A given ā„µ is ameasurable function on (Ī©,ā„µ), denoted P (A|ā„µ), and satisfies that(i) P (A|ā„µ) is measurable in ā„µ and integrable;(ii) For any G āˆˆ ā„µ, āˆ«

G

P (A|ā„µ)dP = P (A āˆ©G).

Theorem 2.8 (Existence and Uniqueness of Conditional Probability Function) Themeasurable function P (A|ā„µ) exists and is unique in the sense that any two functions satisfying(i) and (ii) are the same almost surely. ā€ 

Proof In the probability space (Ī©,ā„µ, P ), we define a set function Ī½ on ā„µ such that Ī½(G) =P (Aāˆ©G) for any G āˆˆ ā„µ. It can easily show Ī½ is a measure and P (G) = 0 implies that Ī½(G) = 0.Thus Ī½ ā‰ŗā‰ŗ P . By the Radon-Nikodym theorem, there exits a ā„µ-measurable function X suchthat

Ī½(G) =

āˆ«

G

XdP.

Thus X satisfies the properties (i) and (ii). Suppose X and Y both are measurable in ā„µ andāˆ«G

XdP =āˆ«

GY dP for any G āˆˆ ā„µ. That is,

āˆ«G(X āˆ’ Y )dP = 0. Particularly, we choose

G = Xāˆ’Y ā‰„ 0 and G = Xāˆ’Y < 0. We then obtaināˆ« |Xāˆ’Y |dP = 0. So X = Y , a.s. ā€ 

Some properties of the conditional probability P (Ā·|ā„µ) are the following.

Theorem 2.9 P (āˆ…|ā„µ) = 0, P (Ī©|ā„µ) = 1 a.e. and

0 ā‰¤ P (A|ā„µ) ā‰¤ 1

for each A āˆˆ A. if A1, A2, ... is finite or countable sequence of disjoint sets in A, then

P (āˆŖnAn|ā„µ) =āˆ‘

n

P (An|ā„µ).

ā€ 

BASIC MEASURE THEORY 37

The properties can be verified directly from the definition. Now we define the conditionalexpectation of a integrable random variable X given ā„µ, denoted E[X|ā„µ], as(i) E[X|ā„µ] is measurable in ā„µ and integrable;(ii) For any G āˆˆ ā„µ, āˆ«

G

E[X|ā„µ]dP =

āˆ«

G

XdP,

equivalently; E [E[X|ā„µ]IG] = E[XIG], a.e.The existence and the uniqueness of E[X|ā„µ] can be shown similar to Theorem 2.8. The

following properties are fundamental.

Theorem 2.10 Suppose X,Y,Xn are integrable.(i) If X = a a.s., then E[X|ā„µ] = a.(ii) For constants a and b, E[aX + bY |ā„µ] = aE[X|ā„µ] + b[Y |ā„µ].(iii) If X ā‰¤ Y a.s., then E[X|ā„µ] ā‰¤ E[Y |ā„µ].(iv) |E[X|ā„µ]| ā‰¤ E[|X||ā„µ].(v) If limn Xn = X a.s., |Xn| ā‰¤ Y and Y is integrable, then limn E[Xn|ā„µ] = E[X|ā„µ].(vi) If X is measurable in ā„µ, then

E[XY |ā„µ] = XE[Y |ā„µ].

(vii) For two sub-Ļƒ fields ā„µ1 and ā„µ2 such that ā„µ1 āŠ‚ ā„µ2,

E [E[X|ā„µ2]|ā„µ1] = E[X|ā„µ1].

(viii) P (A|ā„µ) = E[IA|ā„µ]. ā€ 

Proof (i)-(iv) be shown directly using the definition. To prove (v), we consider Zn = supmā‰„n |Xmāˆ’X|. Then Zn decreases to 0. From (iii), we have

|E[Xn|ā„µ]āˆ’ E[X|ā„µ]| ā‰¤ E[Zn|ā„µ].

On the other hand, E[Zn|ā„µ] decreases to a limit Z ā‰„ 0. The result holds if we can show Z = 0a.s. Note E[Zn|ā„µ] ā‰¤ E[2Y |ā„µ], by the dominated convergence theorem,

E[Z] =

āˆ«E[Z|ā„µ]dP ā‰¤

āˆ«E[Zn|ā„µ]dP ā†’ 0.

Thus Z = 0 a.s.To see (vi) holds, we first show it holds for a simple function X =

āˆ‘i xiIBi

where Bi aredisjoint set in ā„µ. For any G āˆˆ ā„µ,

āˆ«

G

E[XY |ā„µ]dP =

āˆ«

G

XY dP =āˆ‘

i

xi

āˆ«

Gāˆ©Bi

Y dP =āˆ‘

i

xi

āˆ«

Gāˆ©Bi

E[Y |ā„µ]dP =

āˆ«

G

XE[Y |ā„µ]d.

Hence, E[XY |ā„µ] = XE[Y |ā„µ]. For any X, using the previous construction, we can find asequence of simple functions Xn converging to X and |Xn| ā‰¤ |X|. Then we have

āˆ«

G

XnY dP =

āˆ«

G

XnE[Y |ā„µ]dP.

BASIC MEASURE THEORY 38

Note that |XnE[Y |ā„µ]| = |E[XnY |ā„µ]| ā‰¤ E[|XY ||ā„µ]. Taking limits on both sides and from thedominated convergence theorem, we obtain

āˆ«

G

XY dP =

āˆ«

G

XE[Y |ā„µ]dP.

Then E[XY |ā„µ] = XE[Y |ā„µ].For (vii), for any G āˆˆ ā„µ1 āŠ‚ ā„µ2, it is clear form that

āˆ«

G

E[X|ā„µ2]dP =

āˆ«

G

XdP =

āˆ«

G

E[X|ā„µ1]dP.

(viii) is clear from the definition of the conditional probability. ā€ 

How can we relate the above conditional probability and conditional expectation given asub-Ļƒ field to the conditional distribution or density of X given Y ? In R2, suppose (X,Y )has joint density function f(x, y) then it is known that the conditional density of X givenY = y is equal to f(x, y)/

āˆ«xf(x, y)dx and the conditional expectation of X given Y = y is

equal toāˆ«

xxf(x, y)dx/

āˆ«xf(x, y)dx. To recover these formulae using the current definition, we

define ā„µ = Ļƒ(Y ), the Ļƒ-field generated by the class Y ā‰¤ y : y āˆˆ R. Then we can define theconditional probability P (X āˆˆ B|ā„µ) for any B in (R,B). Since P (X āˆˆ B|ā„µ) is measurable inĻƒ(Y ), P (X āˆˆ B|ā„µ) = g(B, Y ) where g(B, Ā·) is a measurable function. For any Y ā‰¤ y āˆˆ ā„µ,

āˆ«

Yā‰¤y0

P (X āˆˆ B|ā„µ)dP =

āˆ«I(y ā‰¤ y0)g(B, y)fY (y)dy = P (X āˆˆ B, Y ā‰¤ y0)

=

āˆ«I(y ā‰¤ y0)

āˆ«

B

f(x, y)dxdy.

Differentiate with respect to y0, we have g(B, y)fY (y) =āˆ«

Bf(x, y)dx. Thus,

P (X āˆˆ B|ā„µ) =

āˆ«

B

f(x|y)dx.

Thus, we note that the conditional density of X|Y = y is in fact the density function of theconditional probability P (X āˆˆ Ā·|ā„µ) with respect to the Lebesgue measure.

On the other hand, E[X|ā„µ] = g(Y ) for some measurable function g(Ā·). Note that

āˆ«I(Y ā‰¤ y0)E[X|ā„µ]dP =

āˆ«I(y ā‰¤ y0)g(y)fY (y)dy = E[XI(Y ā‰¤ y0)] =

āˆ«I(y ā‰¤ y0)xf(x, y)dxdy.

We obtain g(y) =āˆ«

xf(x, y)dx/āˆ«

f(x, y)dx. Then E[X|ā„µ] is the same as the conditional ex-pectation of X given Y = y.

Finally, we give the definition of independence: Two measurable sets or events A1 and A2

in A are independent if P (A āˆ© B) = P (A)P (B). For two random variables X and Y , X andY are said to independent if for any Borel sets B1 and B2, P (X āˆˆ B1, Y āˆˆ B2) = P (X āˆˆB1)P (Y āˆˆ B2). In terms of conditional expectation, X is independent of Y implies that forany measurable function g, E[g(X)|Y ] = E[g(X)].

BASIC MEASURE THEORY 39

READING MATERIALS : You should read Lehmann and Casella, Sections 1.2 and 1.3. Youmay read Lehmann Testing Statistical Hypotheses, Chapter 2.

PROBLEMS

1. Let O be the class of all open sets in R. Show that the Borel Ļƒ-field B is also a Ļƒ-fieldgenerated by O, i.e., B = Ļƒ(O).

2. Suppose (Ī©,A, Āµ) is a measure space. For any set C āˆˆ A, we defineAāˆ©C as A āˆ© C : A āˆˆ A.Show that (Ī© āˆ© C,A āˆ© C, Āµ) is a measure space (it is called the measure space restrictedto C).

3. Suppose (Ī©,A, Āµ) is a measure space. We define a new class

A = A āˆŖN : A āˆˆ A and N is contained in a set B āˆˆ A with Āµ(B) = 0 .

Furthermore, we define a set function Āµ on A: for any A āˆŖ N āˆˆ A, Āµ(A āˆŖ N) = Āµ(A).Show (Ī©, A, Āµ) is a measure space (it is called the completion of (Ī©,A, Āµ)).

4. Suppose (R,B, P ) is a probability measure space. Let F (x) = P ((āˆ’āˆž, x]). Show

(a) F (x) is an increasing and right-continuous function with F (āˆ’āˆž) = 0 and F (āˆž) = 1.F is called a distribution function.

(b) if denote ĀµF as the Lebesgue-Stieljes measure generated from F , then P (B) = ĀµF (B)for any B āˆˆ B. Hint: use the uniqueness of measure extension in the Caratheodoryextension theorem.

Remark: In other words, any probability measure in the Borel Ļƒ-field can be consideredas a Lebesgue-Stieljes measure generated from some distribution function. Obviously,a Lebesgue-Stieljes measure generated from some distribution function is a probabilitymeasure. This gives a one-to-one correspondence between probability measures and dis-tribution functions.

5. Let (R,B, ĀµF ) be a measure space, where B is the Borel Ļƒ-filed and ĀµF is the Lebesgue-Stieljes measure generated from F (x) = (1āˆ’ eāˆ’x)I(x ā‰„ 0).

(a) Show that for any interval (a, b], ĀµF ((a, b]) =āˆ«(a,b]

eāˆ’xI(x ā‰„ 0)dĀµ(x), where Āµ is the

Lebesgue measure in R.

(b) Use the uniqueness of measure extension in the Carotheodory extension theorem toshow ĀµF (B) =

āˆ«B

eāˆ’xI(x ā‰„ 0)dĀµ(x) for any B āˆˆ B.

(c) Show that for any measurable function X in (R,B) with X ā‰„ 0,āˆ«

X(x)dĀµF (x) =āˆ«X(x)eāˆ’xI(x ā‰„ 0)dĀµ(x). Hint: use a sequence of simple functions to approximate

X.

(d) Using the above result and the fact that for any Riemann integrable function, itsRiemann integral is the same as its Lebesgue integral, calculate the integration

āˆ«(1+

eāˆ’x)āˆ’1dĀµF (x).

BASIC MEASURE THEORY 40

6. If X ā‰„ 0 is a measurable function on a measure space (Ī©,A, Āµ) andāˆ«

XdĀµ = 0, thenĀµ(Ļ‰ : X(Ļ‰) > 0) = 0.

7. Suppose X is a measurable function andāˆ« |X|dĀµ < āˆž. Show that for each Īµ > 0, there

exists a Ī“ > 0 such thatāˆ«

A|X|dĀµ < Īµ whenever Āµ(A) < Ī“.

8. Let Āµ be the Borel measure in R and Ī½ be the counting measure in the space Ī© =1, 2, 3, ... such that Ī½(n) = 2āˆ’n for n = 1, 2, 3, .... Define a function f(x, y) : RƗĪ© 7ā†’R as f(x, y) = I(yāˆ’1 ā‰¤ x < y)x. Show f(x, y) is a measurable function with respect to theproduct measure space (RƗĪ©, Ļƒ(BƗ 2Ī©), ĀµĆ— Ī½) and calculate

āˆ«RƗĪ©

f(x, y)d(ĀµĆ— Ī½)(x, y).

9. F and G are two continuous generalized distribution functions. Use the Fubini-Tonellitheorem to show that for any a ā‰¤ b,

F (b)G(b)āˆ’ F (a)G(a) =

āˆ«

[a,b]

FdG +

āˆ«

[a,b]

GdF (integration by parts).

Hint: consider the equalityāˆ«

[a,b]Ɨ[a,b]

d(ĀµF Ɨ ĀµG) =

āˆ«

[a,b]Ɨ[a,b]

I(x ā‰„ y)d(ĀµF Ɨ ĀµG) +

āˆ«

[a,b]Ɨ[a,b]

I(x < y)d(ĀµF Ɨ ĀµG),

where ĀµF and ĀµG are the measures generated by F and G respectively.

10. Let Āµ be the Borel measure in R. We list all rational numbers in R as r1, r2, .... Define Ī½as another measure such that for any B āˆˆ B, Ī½(B) = Āµ(Bāˆ© [0, 1])+

āˆ‘riāˆˆB 2āˆ’i. Show that

neither Ī½ ā‰ŗā‰ŗ Āµ nor Āµ ā‰ŗā‰ŗ Ī½ is true; however, Ī½ ā‰ŗā‰ŗ Āµ+ Ī½. Calculate the Radon-Nikodymderivative dĪ½/d(Āµ + Ī½).

11. X is a random variable in a probability measure space (Ī©,A, P ). Let PX be the probabilitymeasure induced by X. Show that for any measurable function g : R ā†’ R such that g(X)is integrable, āˆ«

Ī©

g(X(Ļ‰))dP (Ļ‰) =

āˆ«

R

g(x)dPX(x).

Hint: first prove it for a simple function g.

12. X1, ..., Xn are i.i.d with Uniform(0,1). Let X(n) be maxX1, ..., Xn. Calculate the con-ditional expectation E[X1|Ļƒ(X(n))], or equivalently, E[X1|X(n)].

13. X and Y are two random variables with density functions f(x) and g(y) in R. DefineA = x : f(x) > 0 and B = y : g(y) > 0. Show PX , the measure induced by X, isdominated by PY , the measured induced by Y , if and only if Ī»(A āˆ© Bc) = 0 (that is,A is almost contained in B). Here, Ī» is the Lebesgue measure in R. Use this result toshow that the measure induced by Uniform(0, 1) random variable is dominated by themeasure induced by N(0, 1) random variable but the opposite is not true.

14. Continue Question 9, Chapter 1. The distribution functions FU and FL are called theFrechet bounds. Show that FL and FU are singular with respect to Lebesgue measure Ī»2

in [0, 1]2; i.e., show that the corresponding probability measure PL and PU satisfy

P ((X, Y ) āˆˆ A) = 1, Ī»2(A) = 0

BASIC MEASURE THEORY 41

andP ((X,Y ) āˆˆ Ac) = 0, Ī»2(Ac) = 1

for some set A (which will be different for PL and PU). This implies that FL and FU donot have densities with respect to Lebesgue measure on [0, 1]2.

15. Lehmann and Casella, page 63, problem 2.6

16. Lehmann and Casella, page 64, problem 2.11

17. Lehmann and Casella, page 64, problem 3.1

18. Lehmann and Casella, page 64, problem 3.3

19. Lehmann and Casella, page 64, problem 3.7

CHAPTER 3 LARGE SAMPLETHEORY

In many probabilistic and statistical problems, we are faced with a sequence of random variables(vectors), say Xn, and wish to understand the limit properties of Xn. As one example, let Xn

be the number of heads appearing in n independent tossing coins. Interesting questions can be:what is the limit of the proportion of observing heads, Xn/n, when n is large? How accurateis Xn/n to estimate the probability of observing head in a flipping? Such theory studying thelimit properties of a sequence of random variables (vectors) Xn is called large sample theory.In this chapter, we always assume the existence of a probability measure space (Ī©,A, P ) andsuppose X, Xn, n ā‰„ 1 are random variables (vectors) defined in this probability space.

3.1 Modes of Convergence in Real Space

3.1.1 Definition

Definition 3.1 Xn is said to converge almost surely to X, denoted by Xn ā†’a.s. X, if thereexists a set A āŠ‚ Ī© such that P (Ac) = 0 and for each Ļ‰ āˆˆ A, Xn(Ļ‰) ā†’ X(Ļ‰). ā€ 

Remark 3.1. Note that

Ļ‰ : Xn(Ļ‰) ā†’ X(Ļ‰)c = āˆŖĪµ>0 āˆ©n Ļ‰ : supmā‰„n

|Xm(Ļ‰)āˆ’X(Ļ‰)| > Īµ.

Then the above definition is equivalent to

P (supmā‰„n

|Xm āˆ’X| > Īµ) ā†’ 0 as n ā†’āˆž.

Such an equivalence is also implied in Proposition 2.9.

Definition 3.2 Xn is said to converge in probability to X, denoted by Xn ā†’p X, if for everyĪµ > 0,

P (|Xn āˆ’X| > Īµ) ā†’ 0.

ā€ 

Definition 3.3 Xn is said to converge in rth mean to X, denote by Xn ā†’r X, if

E[|Xn āˆ’X|r] ā†’ 0 as n ā†’āˆž for functions Xn, X āˆˆ Lr(P ),

42

LARGE SAMPLE THEORY 43

where X āˆˆ Lr(P ) means E[|X|r] =āˆ« |X|rdP < āˆž. ā€ 

Definition 3.4 Xn is said to converge in distribution of X, denoted by Xn ā†’d X or Fn ā†’d F(or L(Xn) ā†’ L(X) with L referring to the ā€œlawā€ or ā€œdistributionā€), if the distribution functionsFn and F of Xn and X satisfy

Fn(x) ā†’ F (x) as n ā†’āˆž for each continuity point x of F.

ā€ 

Definition 3.5 A sequence of random variables Xn is uniformly integrable if

limĪ»ā†’āˆž

lim supnā†’āˆž

E [|Xn|I(|Xn| ā‰„ Ī»)] = 0.

ā€ 

3.1.2 Relationship among modes

The following theorem describes the relationship among all the convergence modes.

Theorem 3.1 (A) If Xn ā†’a.s. X, then Xn ā†’p X.(B) If Xn ā†’p X, then Xnk

ā†’a.s. X for some subsequence Xnk.

(C) If Xn ā†’r X, then Xn ā†’p X.(D) If Xn ā†’p X and |Xn|r is uniformly integrable, then Xn ā†’r X.(E) If Xn ā†’p X and lim supn E|Xn|r ā‰¤ E|X|r, then Xn ā†’r X.(F) If Xn ā†’r X, then Xn ā†’rā€² X for any 0 < rā€² ā‰¤ r.(G) If Xn ā†’p X, then Xn ā†’d X.(H) Xn ā†’p X if and only if for every subsequence Xnk

there exists a further subsequenceXnk,l such that Xnk,l ā†’a.s. X.(I) If Xn ā†’d c for a constant c, then Xn ā†’p c. ā€ 

Remark 3.2 The results of Theorem 3.1 appear to be complicated; however, they can be welldescribed in Figure 1 below.

Figure 1: Relationship among Modes of Convergence

LARGE SAMPLE THEORY 44

Proof (A) For any Īµ > 0,

P (|Xn āˆ’X| > Īµ) ā‰¤ P (supmā‰„n

|Xm āˆ’X| > Īµ) ā†’ 0.

(B) Since for any Īµ > 0, P (|Xnāˆ’X| > Īµ) ā†’ 0, we choose Īµ = 2āˆ’m then there exists a Xnm suchthat

P (|Xnm āˆ’X| > 2āˆ’m) < 2āˆ’m.

Particularly, we can choose nm to be increasing. For the sequence Xnm, we note that for anyĪµ > 0, when nm is large,

P (supkā‰„m

|Xnkāˆ’X| > Īµ) ā‰¤

āˆ‘

kā‰„m

P (|Xnkāˆ’X| > 2āˆ’k) ā‰¤

āˆ‘

kā‰„m

2āˆ’k ā†’ 0.

Thus, Xnm ā†’a.s. X.(C) We use the Markov inequality: for any positive and increasing function g(Ā·) and randomvariable Y ,

P (|Y | > Īµ) ā‰¤ E[g(|Y |)g(Īµ)

].

In particular, we choose Y = |Xn āˆ’X| and g(y) = |y|r. It gives that

P (|Xn āˆ’X| > Īµ) ā‰¤ E[|Xn āˆ’X|r

Īµr] ā†’ 0.

(D) It is sufficient to show that for any subsequence of Xn, there exists a further subsequenceXnk

such that E|Xnkāˆ’ X|r ā†’ 0. For any subsequence of Xn, from (B), there exists a

further subsequence Xnk such that Xnk

ā†’a.s. X. We will show the result holds for Xnk.

For any Īµ, there exists Ī» such that

lim supnk

E[|Xnk|rI(|Xnk

|r ā‰„ Ī»)] < Īµ.

Particularly, we choose Ī» (only depending on Īµ) such that P (|X|r = Ī») = 0. Then, it is clearthat |Xnk

|rI(|Xnk|r ā‰„ Ī») ā†’a.s. |X|rI(|X|r ā‰„ Ī»). By the Fatouā€™s Lemma,

E[|X|rI(|X|r ā‰„ Ī»)] =

āˆ«lim

n|Xnk

|rI(|Xnk|r ā‰„ Ī»)dP ā‰¤ lim inf

nk

E[|Xnk|rI(|Xnk

|r ā‰„ Ī»)] < Īµ.

Therefore,

E[|Xnkāˆ’X|r]

ā‰¤ E[|Xnkāˆ’X|rI(|Xnk

|r < 2Ī», |X|r < 2Ī»)] + E[|Xnkāˆ’X|rI(|Xnk

|r ā‰„ 2Ī» or |X|r ā‰„ 2Ī»)]

ā‰¤ E[|Xnkāˆ’X|rI(|Xnk

|r < 2Ī», |X|r < 2Ī»)]

+2rE[(|Xnk|r + |X|r)I(|Xnk

|r ā‰„ 2Ī» or |X|r ā‰„ 2Ī»)],

where the last inequality follows from the inequality (x+y)r ā‰¤ 2r(max(x, y))r ā‰¤ 2r(xr+yr), x ā‰„0, y ā‰„ 0. Note that the first term converges to zero from the dominated convergence theorem.

LARGE SAMPLE THEORY 45

Furthermore, when nk is large, I(|Xnk| ā‰„ 2Ī») ā‰¤ I(|X| ā‰„ Ī») and I(|X| ā‰„ 2Ī») ā‰¤ I(|Xnk

| ā‰„ Ī»)almost surely. Then the second term is bounded by

2 āˆ— 2r E[|Xnk|rI(|Xnk

| ā‰„ Ī»)] + E[|X|rI(|X| ā‰„ Ī»)] ,

which is smaller than 2r+1Īµ. Thus,

lim supn

E[|Xnkāˆ’X|r] ā‰¤ 2r+1Īµ.

Let Īµ tend to zero and the result holds.(E) It is sufficient to show that for any subsequence of Xn, there exists a further subsequenceXnk

such that E[|Xnkāˆ’ X|r] ā†’ 0. For any subsequence of Xn, from (B), there exists a

further subsequence Xnk such that Xnk

ā†’a.s. X. Define

Ynk= 2r(|Xnk

|r + |X|r)āˆ’ |Xnkāˆ’X|r ā‰„ 0.

We apply the Fatouā€™s Lemma to Yn and obtain thatāˆ«

lim infnk

YnkdP ā‰¤ lim inf

nk

āˆ«Ynk

dP.

It is equivalent to

2r+1E[|X|r] ā‰¤ lim infnk

2rE[|Xnk|r] + 2rE[|X|r]āˆ’ E[|Xnk

āˆ’X|r] .

Thus,

lim supnk

E[|Xnkāˆ’X|r] ā‰¤ 2r

lim inf

nk

E[|Xnk|r]āˆ’ E[|X|r]

ā‰¤ 0.

The result holds.(F) We need to use the Holder inequality as follows

āˆ«|f(x)g(x)|dĀµ ā‰¤

āˆ«|f(x)|pdĀµ(x)

1/p āˆ«|g(x)|qdĀµ(x)

1/q

,1

p+

1

q= 1.

If we choose Āµ = P , f = |Xnāˆ’X|rā€² , g ā‰” 1 and p = r/rā€², q = r/(rāˆ’ rā€²) in the Holder inequality,we obtain

E[|Xn āˆ’X|rā€² ] ā‰¤ E[|Xn āˆ’X|r]rā€²/r ā†’ 0.

(G) Xn ā†’p X. If x is a continuity point of X, i.e., P (X = x) = 0, then for any Īµ > 0,

P (|I(Xn ā‰¤ x)āˆ’ I(X ā‰¤ x)| > Īµ)

= P (|I(Xn ā‰¤ x)āˆ’ I(X ā‰¤ x)| > Īµ, |X āˆ’ x| > Ī“)

+P (|I(Xn ā‰¤ x)āˆ’ I(X ā‰¤ x)| > Īµ, |X āˆ’ x| ā‰¤ Ī“)

ā‰¤ P (Xn ā‰¤ x,X > x + Ī“) + P (Xn > x,X < xāˆ’ Ī“) + P (|X āˆ’ x| ā‰¤ Ī“)

ā‰¤ P (|Xn āˆ’X| > Ī“) + P (|X āˆ’ x| ā‰¤ Ī“).

The first term converges to zero as n ā†’āˆž since Xn ā†’p X. The second term can be arbitrarilysmall if we choose Ī“ is small, since limĪ“ā†’0 P (|X āˆ’ x| ā‰¤ Ī“) = P (X = x) = 0. Thus, we haveshown that I(Xn ā‰¤ x) ā†’p I(X ā‰¤ x). From the dominated convergence theorem,

Fn(x) = E[I(Xn ā‰¤ x)] ā†’ E[I(X ā‰¤ x)] = FX(x).

LARGE SAMPLE THEORY 46

Thus, Xn ā†’d X.(H) One direction follows from (B). To prove the other direction, we use the contradiction.Suppose there exists Īµ > 0 such that P (|Xn āˆ’X| > Īµ) does not converge to zero. Then we canfind a subsequence Xnā€² such hat P (|Xnā€² āˆ’ X| > Īµ) > Ī“ for some Ī“ > 0. However, by thecondition, we can choose a further subsequence Xnā€²ā€² such that Xnā€²ā€² ā†’a.s. X then Xnā€²ā€² ā†’p Xfrom A. This is a contradiction.(I) Let X ā‰” c. It is clear from the following:

P (|Xn āˆ’ c| > Īµ) ā‰¤ 1āˆ’ Fn(c + Īµ) + Fn(cāˆ’ Īµ) ā†’ 1āˆ’ FX(c + Īµ) + FX(cāˆ’ Īµ) = 0.

ā€ 

Remark 3.3 Denote E[|X|r] as Āµr. Then as proving (F) in Theorem 3.1., we obtain Āµsāˆ’tr Āµrāˆ’s

t ā‰„Āµrāˆ’t

s where r ā‰„ s ā‰„ t ā‰„ 0. Thus, log Āµr is convex in r for r ā‰„ 0. Furthermore, the proof of (F )

says that Āµ1/rr is increasing in r.

Remark 3.4 For r ā‰„ 1, we denote E[|X|r]1/r as ā€–Xā€–r (or ā€–Xā€–Lr(P )). Clearly, ā€–Xā€–r ā‰„ 0 andthe equality holds if and only if X = 0 a.s. For any constant Ī», ā€–Ī»Xā€–r = |Ī»|ā€–Xā€–r. Furthermore,we note that

E[|X+Y |r] ā‰¤ E[(|X|+|Y |)|X+Y |rāˆ’1] ā‰¤ E[|X|r]1/rE[|X+Y |r]1āˆ’1/r+E[|Y |r]1/rE[|X+Y |r]1āˆ’1/r.

Then we obtain a triangular inequality (called the Minkowskiā€™s inequality)

ā€–X + Y ā€–r ā‰¤ ā€–Xā€–r + ā€–Y ā€–r.

Therefore, ā€– Ā· ā€–r in fact is a norm in the linear space X : ā€–Xā€–r < āˆž. Such a normed spaceis denoted as Lr(P ).

The following examples illustrate the results of Theorem 3.1.

Example 3.1 Suppose that Xn is degenerate at a point 1/n; i.e., P (Xn = 1/n) = 1. Then Xn

converges in distribution to zero. Indeed, Xn converges almost surely.

Example 3.2 X1, X2, ... are i.i.d with standard normal distribution. Then Xn ā†’d X1 but Xn

does not converge in probability to X1.

Example 3.3 Let Z be a random variable with a uniform distribution in [0, 1]. Let Xn =I(m2āˆ’k ā‰¤ Z < (m + 1)2āˆ’k) when n = 2k + m where 0 ā‰¤ m < 2k. Then Xn converges inprobability to zero but not almost surely. This example is already given in the second chapter.

Example 3.4 Let Z be Uniform(0, 1) and let Xn = 2nI(0 ā‰¤ Z < 1/n). Then E[|Xn|r]] ā†’āˆžbut Xn converges to zero almost surely.

The next theorem describes the necessary and sufficient conditions of convergence in mo-ments from convergence in probability.

Theorem 3.2 (Vitaliā€™s theorem) Suppose that Xn āˆˆ Lr(P ), i.e., ā€–Xnā€–r < āˆž, where 0 <r < āˆž and Xn ā†’p X. Then the following are equivalent:(A) |Xn|r are uniformly integrable.

LARGE SAMPLE THEORY 47

(B) Xn ā†’r X.(C) E[|Xn|r] ā†’ E[|X|r]. ā€ 

Proof (A) ā‡’ (B) has been shown in proving (D) of Theorem 1.1. To prove (B) ā‡’ (C), firstfrom the Fatouā€™s lemma, we have

lim infn

E[|Xn|r] ā‰„ E[|X|r].

Second, we apply the Fatouā€™s lemma to 2r(|Xn āˆ’X|r + |X|r)āˆ’ |Xn|r ā‰„ 0 and obtain

E[2r|X|r āˆ’ |X|r] ā‰¤ 2r lim infn

E[|Xn āˆ’X|r] + 2rE[|X|r]āˆ’ lim supn

E[|Xn|r].

Thus,lim sup

nE[|Xn|r] ā‰¤ E[|X|r] + 2r lim inf

nE[|Xn āˆ’X|r].

We conclude that E[|Xn|r] ā†’ E[|X|r].To prove (C) ā‡’ (A), we note that for any Ī» such that P (|X|r = Ī») = 0, by the dominated

convergence theorem,

lim supn

E[|Xn|rI(|Xn|r ā‰„ Ī»)] = lim supn

E[|Xn|r]āˆ’ E[|Xn|rI(|Xn|r < Ī»)] = E[|X|rI(|X|r ā‰„ Ī»)]

Thus,lim

Ī»ā†’āˆžlim sup

nE[|Xn|rI(|Xn|r ā‰„ Ī»)] = lim

Ī»ā†’āˆžlim sup

nE[|X|rI(|X|r ā‰„ Ī»)] = 0.

ā€ 

From Theorem 3.2, we see that the uniform integrability plays an important role to ensurethe convergence in moments. One sufficient condition to check the uniform integrability of Xnis the Liapunov condition: if there exists a positive constant Īµ0 such that lim supn E[|Xn|r+Īµ0 ] <āˆž, then |Xn|r satisfies the uniform integrability condition. This is because

E[|Xn|rI(|Xn|r ā‰„ Ī»)] ā‰¤ E[|Xn|r+Īµ0|]Ī»Īµ0

.

3.1.3 Useful integral inequalities

We list some useful inequalities below, some of which have already been used. The first in-equality is the Holder inequality:

āˆ«|f(x)g(x)|dĀµ ā‰¤

āˆ«|f(x)|pdĀµ(x)

1/p āˆ«|g(x)|qdĀµ(x)

1/q

,1

p+

1

q= 1.

We briefly describe how the Holder inequality is derived. First, the following inequality holds(Youngā€™s inequality):

|ab| ā‰¤ |a|pp

+|b|qq

, a, b > 0,

LARGE SAMPLE THEORY 48

where the equality holds if and only if a = b. This inequality is clear from its geometric meaning.In this inequality, we choose a = f(x)/

āˆ« |f(x)|pdĀµ(x)1/p and b = g(x)/āˆ« |g(x)|qdĀµ(x)1/q

and integrate over x on both side. It gives the Holder inequality and the equality holds if andonly if f(x) is proportional to g(x) almost surely. When p = q = 2, the inequality becomes

āˆ«|f(x)g(x)|dĀµ(x) ā‰¤

āˆ«f(x)2dĀµ(x)

1/2 āˆ«g(x)2dĀµ(x)

1/2

,

which is the Cauchy-Schwartz inequality. One implication is that for non-trivial X and Y ,(E[|XY |])2 ā‰¤ E[|X|2]E[|Y |2] and that the equality holds if and only if |X| = c0|Y | almostsurely for some constant c0.

A second important inequality is the Markovā€™s inequality, which was used in proving (C) ofTheorem 3.1:

P (|X| ā‰„ Īµ) ā‰¤ E[g(|X|)]g(Īµ)

,

where g ā‰„ 0 is a increasing function in [0,āˆž). We can choose different g to obtain many similarinequalities. The proof of the Markov inequality is direct from the following:

P (|Y | > Īµ) = E[I(|Y | > Īµ)] ā‰¤ E[g(|Y |)g(Īµ)

I(|Y | > Īµ)] ā‰¤ E[g(|Y |)g(Īµ)

].

If we choose g(x) = x2 and X as X āˆ’ E[X] in the Markov inequality, we obtain

P (|X āˆ’ E[X]| ā‰„ Īµ) ā‰¤ V ar(X)

Īµ2.

This inequality is the Chebychevā€™s inequality and gives an upper bound for controlling the tailprobability of X using its variance.

In summary, we have introduced different modes of convergence for random variables andobtained the relationship among these modes. The same definitions and relationship can begeneralized to random vectors. One additional remark is that since convergence almost surelyor in probability are special definitions of convergence almost everywhere or in measure as givenin the second chapter, all the theorems in Section 2.3.3 including the monotone convergencetheorem, the Fatouā€™s lemma and the dominated convergence theorem should apply. Conver-gence in distribution is the only one specific to probability measure. In fact, this model will bethe main interest of the subsequent sections.

3.2 Convergence in Distribution

Among all the convergence modes of Xn, convergence in distribution is the weakest conver-gence. However, this convergence plays an important and sufficient role in statistical inference,especially when large sample behavior of random variables is of interest. We focus on suchparticular convergence in this section.

3.2.1 Portmanteau theorem

The following theorem gives all equivalent conditions to the convergence in distribution for asequence of random variables Xn.

LARGE SAMPLE THEORY 49

Theorem 3.3 (Portmanteau Theorem) The following conditions are equivalent.(a) Xn converges in distribution to X.(b) For any bounded continuous function g(Ā·), E[g(Xn)] ā†’ E[g(X)].(c) For any open set G in R, lim infn P (Xn āˆˆ G) ā‰„ P (X āˆˆ G).(d) For any closed set F in R, lim supn P (Xn āˆˆ F ) ā‰¤ P (X āˆˆ F ).(e) For any Borel set O in R with P (X āˆˆ āˆ‚O) = 0 where āˆ‚O is the boundary of O, P (Xn āˆˆO) ā†’ P (X āˆˆ O). ā€ 

Proof (a) ā‡’ (b). Without loss of generality, we assume |g(x)| ā‰¤ 1. We choose [āˆ’M,M ] suchthat P (|X| = M) = 0. Since g is continuous in [āˆ’M, M ], g is uniformly continuous in [āˆ’M,M ].Thus for any Īµ, we can partition [āˆ’M,M ] into finite intervals I1 āˆŖ ... āˆŖ Im such that withineach interval Ik, maxIk

g(x)āˆ’minIkg(x) ā‰¤ Īµ and X has no mass at all the endpoints of Ik (this

is feasible since X has at most countable points with point masses). Therefore, if choose anypoint xk āˆˆ Ik, k = 1, ..., m,

|E[g(Xn)]āˆ’ E[g(X)]|ā‰¤ E[|g(Xn)|I(|Xn| > M)] + E[|g(X)|I(|X| > M)]

+|E[g(Xn)I(|Xn| ā‰¤ M)]āˆ’māˆ‘

k=1

g(xk)P (Xn āˆˆ Ik)|

+|māˆ‘

k=1

g(xk)P (Xn āˆˆ Ik)āˆ’māˆ‘

k=1

g(xk)P (X āˆˆ Ik)|

+|E[g(X)I(|X| ā‰¤ M)]āˆ’māˆ‘

k=1

g(xk)P (X āˆˆ Ik)|

ā‰¤ P (|Xn| > M) + P (|X| > M) + 2Īµ +māˆ‘

k=1

|P (Xn āˆˆ Ik)āˆ’ P (X āˆˆ Ik)|.

Thus, lim supn |E[g(Xn)]āˆ’E[g(X)]| ā‰¤ 2P (|X| > M) + 2Īµ. Let M ā†’āˆž and Īµ ā†’ 0. We obtain(b).(b) ā‡’ (c). For any open set G, we define a function

g(x) = 1āˆ’ Īµ

Īµ + d(x,Gc),

where d(x,Gc) is the minimal distance between x and Gc, defined as infyāˆˆGc |xāˆ’ y|. Since forany y āˆˆ Gc,

d(x1, Gc)āˆ’ |x2 āˆ’ y| ā‰¤ |x1 āˆ’ y| āˆ’ |x2 āˆ’ y| ā‰¤ |x1 āˆ’ x2|,

we have d(x1, Gc)āˆ’ d(x2, G

c) ā‰¤ |x1 āˆ’ x2|. Then,

|g(x1)āˆ’ g(x2)| ā‰¤ Īµāˆ’1|d(x1, Gc)āˆ’ d(x2, G

c)| ā‰¤ Īµāˆ’1|x1 āˆ’ x2|.

g(x) is continuous and bounded. From (a), E[g(Xn)] ā†’ E[g(X)]. Note g(x) = 0 if x /āˆˆ G and|g(x)| ā‰¤ 1. Thus,

lim infn

P (Xn āˆˆ G) ā‰„ lim infn

E[g(Xn)] ā†’ E[g(X)].

LARGE SAMPLE THEORY 50

Let Īµ ā†’ 0 and we obtain E[g(X)] converges to E[I(X āˆˆ G)] = P (X āˆˆ G).(c) ā‡’ (d). This is clear by taking complement of F .(d) ā‡’ (e). For any O with P (X āˆˆ āˆ‚O) = 0, we have

lim supn

P (Xn āˆˆ O) ā‰¤ lim supn

P (Xn āˆˆ O) ā‰¤ P (X āˆˆ O) = P (X āˆˆ O),

andlim inf

nP (Xn āˆˆ O) ā‰„ lim inf

nP (Xn āˆˆ Oo) ā‰„ P (X āˆˆ Oo) = P (X āˆˆ O).

Here, O and Oo are the closure and interior of O respectively.(e) ā‡’ (a). It is clear by choosing O = (āˆ’āˆž, x] with P (X āˆˆ āˆ‚O) = P (X = x) = 0. ā€ 

The conditions in Theorem 3.3 are necessary, as seen in the following examples.

Example 3.5 Let g(x) = x, a continuous but unbounded function. Let Xn be a randomvariable taking value n with probability 1/n and value 0 with probability (1 āˆ’ 1/n). ThenXn ā†’d 0. However, E[g(X)] = 1 9 0. This shows that the boundness of g in condition (b) isnecessary.

Example 3.6 The continuity at boundary in (e) is also necessary: let Xn be degenerate at 1/nand consider O = x : x > 0. Then P (Xn āˆˆ O) = 1 but Xn ā†’d 0.

3.2.2 Continuity theorem

Another way of verifying convergence in distribution of Xn is via the convergence of the char-acteristic functions of Xn, as given in the following theorem. This result is very useful in manyapplications.

Theorem 3.4 (Continuity Theorem) Let Ļ†n and Ļ† denote the characteristic functions ofXn and X respectively. Then Xn ā†’d X is equivalent to Ļ†n(t) ā†’ Ļ†(t) for each t. ā€ 

Proof To prove ā‡’ direction, from (b) in Theorem 3.1,

Ļ†n(t) = E[eitXn ] ā†’ E[eitX ] = Ļ†(t).

We thus need to prove ā‡ direction. This proof consists of the following steps.Step 1. We show that for any Īµ, there exists a M such that supn P (|Xn| > M) < Īµ. Thisproperty is called asymptotic tightness of Xn. To see that, we note that

1

Ī“

āˆ« Ī“

āˆ’Ī“

(1āˆ’ Ļ†n(t))dt = E[1

Ī“

āˆ« Ī“

āˆ’Ī“

(1āˆ’ eitXn)dt]

= E[2(1āˆ’ sin Ī“Xn

Ī“Xn

)]

ā‰„ E[2(1āˆ’ 1

|Ī“Xn|)I(|Xn| > 2

Ī“)]

ā‰„ P (|Xn| > 2

Ī“).

LARGE SAMPLE THEORY 51

However, the left-hand side of the inequality converges to

1

Ī“

āˆ« Ī“

āˆ’Ī“

(1āˆ’ Ļ†(t))dt.

Since Ļ†(t) is continuous at t = 0, this limit can be smaller than Īµ if we choose Ī“ small enough.Let M = 2

Ī“. We obtain that when n > N0, P (|Xn| > M) < Īµ. Choose M larger then we can

have P (|Xk| > M) < Īµ, for k = 1, ..., N0. Thus,

supn

P (|Xn| > M) < Īµ.

Step 2. We show for any subsequence of Xn, there exists a further sub-sequence Xnk and

the distribution function for Xnk, denoted by Fnk

, converges to some distribution function.First, we need the Hellyā€™s Theorem.

Hellyā€™s Selection Theorem For every sequence Fn of distribution functions, there exists asubsequence Fnk

and a nondecreasing, right-continuous function F such that Fnk(x) ā†’ F (x)

at continuity points x of F . ā€ 

We defer the proof of the Hellyā€™s Selection Theorem to the end of the proof. Thus, fromthis theorem, for any subsequence of Xn, we can find a further subsequence Xnk

such thatFnk

(x) ā†’ G(x) for some nondecreasing and right-continuous function G and the continuitypoints x of G. However, the Hellyā€™s Selection Theorem does not imply that G is a distributionfunction since G(āˆ’āˆž) and G(āˆž) may not be 0 or 1. But from the tightness of Xnk

, for anyĪµ, we can choose M such that Fnk

(āˆ’M)+(1āˆ’Fnk(M)) = P (|Xn| > M) < Īµ and we can always

choose M such that āˆ’M and M are continuity points of G. Thus, G(āˆ’M) + (1āˆ’G(M)) < Īµ.Let M ā†’ āˆž and since 0 ā‰¤ G(āˆ’M) ā‰¤ G(M) ā‰¤ 1, we conclude that G must be a distributionfunction.

Step 3. We conclude that the subsequence Xnk in Step 2 converges in distribution to X.

Since Fnkweakly converges to G(x) and G(x) is a distribution function and Ļ†nk

(t) converges toĻ†(t), Ļ†(t) must be the characteristic function corresponding to the distribution G(x). From theuniqueness of the characteristic function in Theorem 1.1 (see the proof below), G(x) is exactlythe distribution of X. Therefore, Xnk

ā†’d X. The theorem has been proved.We need to prove the Hellyā€™s Selection Theorem: let r1, r2, ... be all the rational numbers.

For r1, we choose a subsequence of Fn, denoted by F11, F12, ... such that F11(r1), F12(r1), ...converges. Then for r2, we choose a further subsequence from the above sequence, denoteby F21, F22, ... such that F21(r2), F22(r2), ... converges. We continue this for all the rationalnumbers. We obtain a matrix of functions as follows:

F11 F12 . . .F21 F22 . . ....

.... . .

.

We finally select the diagonal functions, F11, F22, .... thus this subsequence converges for all therational numbers. We denote their limits as G(r1), G(r2), ... Define G(x) = infrk>x G(rk). It isclear to see that G is nondecreasing. If xk decreases to x, for any Īµ > 0, we can find rs such thatrs ā‰„ x and G(x) > G(rs) āˆ’ Īµ. Then when k is large, G(xk) āˆ’ Īµ ā‰¤ G(rs) āˆ’ Īµ < G(x) ā‰¤ G(xk).

LARGE SAMPLE THEORY 52

That is, limk G(xk) = G(x). Thus, G is right-continuous. If x is a continuity point of G, forany Īµ, we can find two sequence of rational number rk and rkā€² such that rk decreases to xand rkā€² increases to x. Then after taking limits for the inequality Fll(rkā€²) ā‰¤ Fll(x) ā‰¤ Fll(rk), wehave

G(rkā€²) ā‰¤ lim infl

Fll(x) ā‰¤ lim supl

Fll(x) ā‰¤ G(rk).

Let k ā†’āˆž then we obtain liml Fll(x) = G(x).It remains to prove Theorem 1.1, whose proof is deferred here: after substituting Ļ†(t) in to

the integration, we obtain

1

2Ļ€

āˆ« T

āˆ’T

eāˆ’ita āˆ’ eāˆ’itb

itĻ†(t)dt =

1

2Ļ€

āˆ« T

āˆ’T

āˆ« āˆž

āˆ’āˆž

eāˆ’ita āˆ’ eāˆ’itb

iteitxdF (x)dt

=1

2Ļ€

āˆ« āˆž

āˆ’āˆž

āˆ« T

āˆ’T

eit(xāˆ’a) āˆ’ eit(xāˆ’b)

itdtdF (x).

The interchange of the integrations follows from the Fubiniā€™s theorem. The last part is equal

āˆ« āˆž

āˆ’āˆž

sgn(xāˆ’ a)

Ļ€

āˆ« T |xāˆ’a|

0

sin t

tdtāˆ’ sgn(xāˆ’ b)

Ļ€

āˆ« T |xāˆ’b|

0

sin t

tdt

dF (x).

The integrand is bounded by 2Ļ€

āˆ«āˆž0

sin tt

dt and as T ā†’ āˆž, it converges to 0, if x < a or x > b;1/2, if x = a or x = b; 1, if x āˆˆ (a, b). Therefore, by the dominated convergence theorem, theintegral converges to

F (bāˆ’)āˆ’ F (a) +1

2F (b)āˆ’ F (bāˆ’)+

1

2F (a)āˆ’ F (aāˆ’) .

Since F is continuous at b and a, the limit is the same as F (b)āˆ’ F (a). Furthermore, supposethat F has a density function f . Then

F (x)āˆ’ F (0) =1

2Ļ€

āˆ« āˆž

āˆ’āˆž

1āˆ’ eāˆ’itx

itĻ†(t)dt.

Since | āˆ‚āˆ‚x

1āˆ’eāˆ’itx

itĻ†(t)| ā‰¤ Ļ†(t), according to the interchange between derivative and integration,

we obtain

f(x) =1

2Ļ€

āˆ« āˆž

āˆ’āˆžeāˆ’itxĻ†(t)dt.

ā€ 

The above theorem indicates that to prove the weak convergence of a sequence of randomvariables, it is sufficient to check the convergence of their characteristic functions. For example,if X1, ..., Xn are i.i.d Bernoulli(p), then the characteristic function of Xn = (X1 + ... + Xn)/n isgiven by (1āˆ’p+peit/n)n converges to a function Ļ†(t) = eitp, which is the characteristic functionfor a degenerate random variable X ā‰” p. Thus Xn converges in distribution to p. Then fromTheorem 3.1, Xn converges in probability to p.

Theorem 3.4 also has a multivariate version when Xn and X are k-dimensional randomvectors: Xn ā†’d X if and only if E[expit ā€²Xn] ā†’ E [expit ā€²X ], where t is any k-dimensional

LARGE SAMPLE THEORY 53

constant. Since the latter is equivalent to the weak convergence of tā€²Xn to tā€²X, we concludethat the weak convergence of Xn to X is equivalent to the weak convergence of tā€²Xn to tā€²Xfor any t. That is, to study the weak convergence of random vectors, we can reduce to studythe weak convergence of one-dimensional linear combination of the random vectors. This is thewell-known Cramer-Woldā€™s device:

Theorem 3.5 (The Cramer-Wold device) Random vector Xn in Rk satisfy Xn ā†’d X ifand only tā€²Xn ā†’d tā€²X in R for all t āˆˆ Rk. ā€ 

3.2.3 Properties of convergence in distribution

Some additional results from convergence in distribution are the following theorems.

Theorem 3.6 (Continuous mapping theorem) Suppose Xn ā†’a.s. X, or Xn ā†’p X, orXn ā†’d X. Then for any continuous function g(Ā·), g(Xn) converges to g(X) almost surely, orin probability, or in distribution. ā€ 

Proof If Xn ā†’a.s. X, then clearly, g(Xn) ā†’a.s g(X). If Xn ā†’p X, then for any subsequence,there exists a further subsequence Xnk

ā†’a.s. X. Thus, g(Xnk) ā†’a.s. g(X). Then g(Xn) ā†’p

g(X) from (H) in Theorem 3.1. To prove that g(Xn) ā†’d g(X) when Xn ā†’d X, we apply (b)of Theorem 3.3. ā€ 

Remark 3.5 Theorem 3.6 concludes that g(Xn) ā†’d g(X) if Xn ā†’d X and g is continuous. Infact, this result still holds if P (X āˆˆ C(g)) = 1 where C(g) contains all the continuity pointsof g. That is, if gā€™s discontinuity points take zero probability of X, the continuous mappingtheorem holds.

Theorem 3.7 (Slutsky theorem) Suppose Xn ā†’d X, Yn ā†’p y and Zn ā†’p z for someconstant y and z. Then ZnXn + Tn ā†’d zX + y. ā€ 

Proof We first show that Xn + Yn ā†’d X + y. For any Īµ > 0,

P (Xn + Yn ā‰¤ x) ā‰¤ P (Xn + Yn ā‰¤ x, |Yn āˆ’ y| ā‰¤ Īµ) + P (|Yn āˆ’ y| > Īµ)

ā‰¤ P (Xn ā‰¤ xāˆ’ y + Īµ) + P (|Yn āˆ’ y| > Īµ).

Thus,lim sup

nFXn+Yn(x) ā‰¤ lim sup

nFXn(xāˆ’ y + Īµ) ā‰¤ FX(xāˆ’ y + Īµ).

On the other hand,

P (Xn + Yn > x) = P (Xn + Yn > x, |Yn āˆ’ y| ā‰¤ Īµ) + P (|Yn āˆ’ y| > Īµ)

ā‰¤ P (Xn > xāˆ’ y āˆ’ Īµ) + P (|Yn āˆ’ y| > Īµ).

LARGE SAMPLE THEORY 54

Thus,

lim supn

(1āˆ’ FXn+Yn(x)) ā‰¤ lim supn

P (Xn > xāˆ’ y āˆ’ Īµ) ā‰¤ lim supn

P (Xn ā‰„ xāˆ’ y āˆ’ 2Īµ)

ā‰¤ (1āˆ’ FX(xāˆ’ y āˆ’ 2Īµ)).

We obtain

FX(xāˆ’ y āˆ’ 2Īµ) ā‰¤ lim infn

FXn+Yn(x) ā‰¤ lim supn

FXn+Yn(x) ā‰¤ FX(x + y + Īµ).

Let Īµ ā†’ 0 then it holds

FX+y(xāˆ’) ā‰¤ lim infn

FXn+Yn(x) ā‰¤ lim supn

FXn+Yn(x) ā‰¤ FX+y(x).

Thus, Xn + Yn ā†’d X + y.On the other hand, we have

P (|(Zn āˆ’ z)Xn| > Īµ) ā‰¤ P (|Zn āˆ’ z| > Īµ2) + P (|Zn āˆ’ z| ā‰¤ Īµ2, |Xn| > 1

Īµ).

Thus,

lim supn

P (|(Znāˆ’ z)Xn| > Īµ) ā‰¤ lim supn

P (|Znāˆ’ z| > Īµ2)+ lim supn

P (|Xn| ā‰„ 1

2Īµ) ā†’ P (|X| ā‰„ 1

2Īµ).

Since Īµ is arbitrary, we conclude that (Znāˆ’ z)Xn ā†’p 0. Clearly zXn ā†’d zX. Hence, ZnXn ā†’d

zX from the proof in the first half. Again, using the first halfā€™s proof, we obtain ZnXn +Yn ā†’d

zX + y. ā€ 

Remark 3.6 In the proof of Theorem 3.7, if we replace Xn + Yn by aXn + bYn, we can showthat aXn + bYn ā†’d aX + by by considering different cases of either a or b or both are non-zeros.Then from Theorem 3.5, (Xn, Yn) ā†’d (X, y) in R2. By the continuity theorem, we obtainXn + Yn ā†’d X + y and XnYn ā†’d Xy. This immediately gives Theorem 3.7.

Both Theorems 3.6 and 3.7 are useful in deriving the convergence of some transformedrandom variables, as shown in the following examples.

Example 3.7 Suppose Xn ā†’d N(0, 1). Then by continuous mapping theorem, X2n ā†’d Ļ‡2

1.

Example 3.8 This example shows that g can be discontinuous in Theorem 3.6. Let Xn ā†’d Xwith X āˆ¼ N(0, 1) and g(x) = 1/x. Although g(x) is discontinuous at origin, we can still showthat 1/Xn ā†’d 1/X, the reciprocal of the normal distribution. This is because P (X = 0) = 0.However, in Example 3.6 where g(x) = I(x > 0), it shows that Theorem 3.6 may not be trueif P (X āˆˆ C(g)) < 1.

Example 3.9 The condition Yn ā†’p y, where y is a constant, is necessary. For example, letXn = X āˆ¼ Uniform(0, 1). Let Yn = āˆ’X so Yn ā†’d āˆ’X, where X is an independent randomvariable with the same distribution as X. However Xn+Yn = 0 does not converge in distributionto the non-zero random variable X āˆ’ X.

LARGE SAMPLE THEORY 55

Example 3.10 Let X1, X2, ... be a random sample from a normal distribution with mean Āµand variance Ļƒ2 > 0, then from the central limit theorem and the law of large number, whichwill be given later, we have

āˆšn(Xn āˆ’ Āµ) ā†’d N(0, Ļƒ2), s2

n =1

nāˆ’ 1

nāˆ‘i=1

(Xi āˆ’ Xn)2 ā†’a.s Ļƒ2.

Thus, from Theorem 3.7, it gives

āˆšn(Xn āˆ’ Āµ)

sn

ā†’d1

ĻƒN(0, Ļƒ2) āˆ¼= N(0, 1).

From the distribution theory, we know the left-hand side has a t-distribution with degrees offreedom (n āˆ’ 1). Then this result says that in large sample, tnāˆ’1 can be approximated by astandard normal distribution.

3.2.4 Representation of convergence in distribution

As already seen before, working with convergence in distribution may not be easy, as comparedwith convergence almost surely. However, if we can represent convergence in distribution asconvergence almost surely, many arguments can be simplified. The following famous theoremshows that such a representation does exist.

Theorem 3.8 (Skorohodā€™s Representation Theorem) Let Xn and X be random vari-ables in a probability space (Ī©,A, P ) and Xn ā†’d X. Then there exists another probabilityspace (Ī©, A, P ) and a sequence of random variables Xn and X defined on this space such thatXn and Xn have the same distributions, X and X have the same distributions, and moreover,Xn ā†’a.s. X. ā€ 

Before proving Theorem 3.8, we define the quantile function corresponding to a distributionfunction F (x), denoted by Fāˆ’1(p), for p āˆˆ [0, 1],

Fāˆ’1(p) = infx : F (x) ā‰„ p.

Some properties regarding the quantile function are given in the following proposition.

Proposition 3.1 (a) Fāˆ’1 is left-continuous.(b) If X has continuous distribution function F , then F (X) āˆ¼ Uniform(0, 1).(c) Let Ī¾ āˆ¼ Uniform(0, 1) and let X = Fāˆ’1(Ī¾). Then for all x, X ā‰¤ x = Ī¾ ā‰¤ F (x). Thus,X has distribution function F . ā€ 

Proof (a) Clearly, Fāˆ’1 is nondecreasing. Suppose pn increases to p then Fāˆ’1(pn) increases tosome y ā‰¤ Fāˆ’1(p). Then F (y) ā‰„ pn so F (y) ā‰„ p. Therefore Fāˆ’1(p) ā‰¤ y by the definition ofFāˆ’1(p). Thus y = Fāˆ’1(p). Fāˆ’1 is left-continuous.(b) X ā‰¤ x āŠ‚ F (X) ā‰¤ F (x). Thus, F (x) ā‰¤ P (F (X) ā‰¤ F (x)). On the other hand,F (X) ā‰¤ F (x)āˆ’ Īµ āŠ‚ X ā‰¤ x. Thus, P (F (X) ā‰¤ F (x)āˆ’ Īµ) ā‰¤ F (x). Let Īµ ā†’ 0 and we obtainP (F (X) ā‰¤ F (x)āˆ’) ā‰¤ F (x). Then if X is continuous, we have P (F (X) ā‰¤ F (x)) = F (x) so

LARGE SAMPLE THEORY 56

F (X) āˆ¼ Uniform(0, 1).(c) P (X ā‰¤ x) = P (Fāˆ’1(Ī¾) ā‰¤ x) = P (Ī¾ ā‰¤ F (x)) = F (x). ā€ 

Proof Using the quantile function, we can construct the proof of Theorem 3.8. Let (Ī©, A, P )be ([0, 1],B āˆ© [0, 1], Ī»), where Ī» is the Borel measure. Define Xn = Fāˆ’1

n (Ī¾), X = Fāˆ’1(Ī¾), whereĪ¾ is uniform random variable on (Ī©, A, P ). From (c) in the previous proposition, Xn has adistribution Fn which is the same as Xn. It remains to show Xn ā†’a.s. X.

For any t āˆˆ (0, 1) such that there is at most one value x such that F (x) = t (it is easy tosee t is the continuous point of Fāˆ’1), we have that for any z < x, F (z) < t. Thus, when n islarge, Fn(z) < t so Fāˆ’1

n (t) ā‰„ z. We obtain lim infn Fāˆ’1n (t) ā‰„ z. Since z is any number less than

x, we have lim infn Fāˆ’1n (t) ā‰„ x = Fāˆ’1(t). On the other hand, from F (x + Īµ) > t, we obtain

when n is large enough, Fn(x + Īµ) > t so Fāˆ’1n (t) ā‰¤ x + Īµ. Thus, lim supn Fāˆ’1

n (t) ā‰¤ x + Īµ. SinceĪµ is arbitrary, we obtain lim supn Fāˆ’1

n (t) ā‰¤ x.We conclude Fāˆ’1

n (t) ā†’ Fāˆ’1(t) for any t which is continuous point of Fāˆ’1. Thus Fāˆ’1n (t) ā†’

Fāˆ’1(t) for almost every t āˆˆ (0, 1). That is, Xn ā†’a.s. X. ā€ This theorem can be useful in a lot of arguments. For example, if Xn ā†’d X and one

wishes to show some function of Xn, denote by g(Xn), converges in distribution to g(X), thenby the representation theorem, we obtain Xn and X and Xn ā†’a.s. X. Thus, if we can showg(Xn) ā†’a.s. g(X), which is often easy to show, then of course, g(Xn) ā†’d g(X). Since g(Xn)has the same distribution as g(Xn) and so are g(X) and g(X), g(Xn) ā†’d g(X). Using thistechnique, readers should easily prove the continuous mapping theorem. Also see the diagramin Figure 2.

Figure 2: Representation of Convergence in Distribution

Our final remark of this section is that all the results such as the continuous mappingtheorem, the Slutsky theorem and the representation theorem can be in parallel given for theconvergence of random vectors. The proofs for random vectors are based on the Crame-Woldā€™sdevice.

3.3 Summation of Independent Random Variables

The summation of independent random variables are commonly seen in statistical inference.Specially, many statistics can be expressed as the summation of i.i.d random variables. Thus,

LARGE SAMPLE THEORY 57

this section gives some classical large sample results for this type of statistics, which includethe weak/strong law of large numbers, the central limit theorem, and the Delta method etc.

3.3.1 Preliminary lemma

Proposition 3.2 (Borel-Cantelli Lemma) For any events An,

āˆžāˆ‘i=1

P (An) < āˆž

implies P (An, i.o.) = P (An occurs infinitely often) = 0; or equivalently, P (āˆ©āˆžn=1 āˆŖmā‰„n Am) =0. ā€ 

ProofP (An, i.o) ā‰¤ P (āˆŖmā‰„nAm) ā‰¤

āˆ‘mā‰„n

P (Am) ā†’ 0, as n ā†’āˆž.

ā€ 

As a result of the proposition, if for a sequence of random variables, Zn, and for any Īµ > 0,āˆ‘n P (|Zn| > Īµ) < āˆž. Then with probability one, |Zn| > Īµ only occurs finite times. That is,

Zn ā†’a.s. 0.

Proposition 3.3 (Second Borel-Cantelli Lemma) For a sequence of independent eventsA1, A2, ...,

āˆ‘āˆžn=1 P (An) = āˆž implies P (An, i.o.) = 1. ā€ 

Proof Consider the complement of An, i.o. Note

P (āˆŖāˆžn=1 āˆ©mā‰„n Acm) = lim

nP (āˆ©mā‰„nA

cm) = lim

n

āˆmā‰„n

(1āˆ’ P (Am)) ā‰¤ lim supn

expāˆ’āˆ‘mā‰„n

P (Am) = 0.

ā€ 

Proposition 3.4 X,X1, ..., Xn are i.i.d with finite mean. Define Yn = XnI(|Xn| ā‰¤ n). Thenāˆ‘āˆžn=1 P (Xn 6= Yn) < āˆž. ā€ 

Proof Since E[|X|] < āˆž,

āˆžāˆ‘n=1

P (Xn 6= Yn) ā‰¤āˆžāˆ‘

n=1

P (|X| ā‰„ n) =āˆžāˆ‘

n=1

nP (n ā‰¤ |X| < (n + 1)) ā‰¤āˆžāˆ‘

n=1

E[|X|] < āˆž.

From the Borel-Cantelli Lemma, P (Xn 6= Yn, i.o) = 0. That is, for almost every Ļ‰ āˆˆ Ī©, whenn is large enough, Xn(Ļ‰) = Yn(Ļ‰). ā€ 

LARGE SAMPLE THEORY 58

3.3.2 Law of large numbers

We start to prove the weak and strong law of large numbers.

Theorem 3.9 (Weak Law of Large Number) If X, X1, ..., Xn are i.i.d with mean Āµ (soE[|X|] < āˆž and Āµ = E[X]), then Xn ā†’p Āµ. ā€ 

Proof Define Yn = XnI(āˆ’n ā‰¤ Xn ā‰¤ n). Let Āµn =āˆ‘n

k=1 E[Yk]/n. Then by the Chebyshevā€™sinequality,

P (|Yn āˆ’ Āµn| ā‰„ Īµ) ā‰¤ V ar(Yn)

Īµ2ā‰¤

āˆ‘nk=1 V ar(XkI(|Xk| ā‰¤ k))

n2Īµ2.

Since

V ar(XkI(|Xk| ā‰¤ k)) ā‰¤ E[X2kI(|Xk| ā‰¤ k)]

= E[X2kI(|Xk| ā‰¤ k, |Xk| ā‰„

āˆškĪµ2)] + E[X2

kI(|Xk| ā‰¤ k, |X| ā‰¤āˆš

kĪµ2)]

ā‰¤ kE[|Xk|I(|Xk| ā‰„āˆš

kĪµ2)] + kĪµ4,

P (|Yn āˆ’ Āµn| ā‰„ Īµ) ā‰¤āˆ‘n

k=1 E[|X|I(|X| ā‰„āˆš

kĪµ2)]

nĪµ2+ Īµ2n(n + 1)

2n2.

Thus, lim supn P (|Yn āˆ’ Āµn| ā‰„ Īµ) ā‰¤ Īµ2. We conclude that Yn āˆ’ Āµn ā†’p 0. On the other hand,Āµn ā†’ Āµ. We obtain Yn ā†’p Āµ. This implies that for any subsequence, there is a furthersubsequence Ynk ā†’a.s. Āµ. Since Xn is eventually the same as Yn for almost every Ļ‰ fromProposition 3.4, we conclude Xnk ā†’a.s. Āµ. This implies Xn ā†’p Āµ. ā€ 

Theorem 3.10 (Strong Law of Large Number) If X1, ..., Xn are i.i.d with mean Āµ thenXn ā†’a.s. Āµ. ā€ 

Proof Without loss of generality, we assume Xn ā‰„ 0 since if this is true, the result also holdsfor any Xn by Xn = X+

n āˆ’Xāˆ’n .

Similar to Theorem 3.9, it is sufficient to show Yn ā†’a.s. Āµ, where Yn = XnI(Xn ā‰¤ n). NoteE[Yn] = E[X1I(X1 ā‰¤ n)] ā†’ Āµ so

nāˆ‘

k=1

E[Yk]/n ā†’ Āµ.

Thus, if we denote Sn =āˆ‘n

k=1(Ykāˆ’E[Yk]) and we can show Sn/n ā†’a.s. 0, then the result holds.Note

V ar(Sn) =nāˆ‘

k=1

V ar(Yk) ā‰¤nāˆ‘

k=1

E[Y 2k ] ā‰¤ nE[X2

1I(X1 ā‰¤ n)].

Then by the Chebyshevā€™s inequality,

P (| Sn

n| > Īµ) ā‰¤ 1

n2Īµ2V ar(Sn) ā‰¤ E[X2

1I(X1 ā‰¤ n)]

nĪµ2.

LARGE SAMPLE THEORY 59

For any Ī± > 1, let un = [Ī±n]. Then

āˆžāˆ‘n=1

P (| Sun

un

| > Īµ) ā‰¤āˆžāˆ‘

n=1

1

unĪµ2E[X2

1I(X1 ā‰¤ un)] ā‰¤ 1

Īµ2E[X2

1

āˆ‘unā‰„X1

1

un

].

Since for any x > 0,āˆ‘

unā‰„xĀµnāˆ’1 < 2āˆ‘

nā‰„log x/ log Ī± Ī±āˆ’n ā‰¤ Kxāˆ’1 for some constant K, wehave āˆžāˆ‘

n=1

P (| Sun

un

| > Īµ) ā‰¤ K

Īµ2E[X1] < āˆž,

From the Borel-Cantelli Lemma in Proposition 3.2, Sun/un ā†’a.s. 0.For any k, we can find un < k ā‰¤ un+1. Thus, since X1, X2, ... ā‰„ 0,

Sun

un

un

un+1

ā‰¤ Sk

kā‰¤ Sun+1

un+1

un+1

un

.

After taking limits in the above, we have

Āµ/Ī± ā‰¤ lim infk

Sk

kā‰¤ lim sup

k

Sk

kā‰¤ ĀµĪ±.

Since Ī± is arbitrary number larger than 1, let Ī± ā†’ 1 and we obtain limk Sk/k = Āµ. The proofis completed. ā€ 

3.3.3 Central limit theorem

We now consider the central limit theorem. All the proofs can be based on the convergence ofthe corresponding characteristic function. The following lemma describes the approximation ofa characteristic function.

Proposition 3.5 Suppose E[|X|m] < āˆž for some integer m ā‰„ 0. Then

|Ļ†X(t)āˆ’māˆ‘

k=0

(it)k

k!E[Xk]|/|t|m ā†’ 0, as t ā†’ 0.

ā€ 

Proof We note the following expansion for eitx,

eitx =māˆ‘

k=1

(itx)k

k!+

(itx)m

m![eitĪøx āˆ’ 1],

where Īø āˆˆ [0, 1]. Thus,

|Ļ†X(t)āˆ’māˆ‘

k=0

(it)k

k!E[Xk]|/|t|m ā‰¤ E[|X|m|eitĪøX āˆ’ 1|]/m! ā†’ 0,

LARGE SAMPLE THEORY 60

as t ā†’ 0. ā€ 

Theorem 3.11 (Central Limit Theorem) If X1, ..., Xn are i.i.d with mean Āµ and varianceĻƒ2 then

āˆšn(Xn āˆ’ Āµ) ā†’d N(0, Ļƒ2). ā€ 

Proof Denote Yn =āˆš

n(Xn āˆ’ Āµ). We consider the characteristic function of Yn.

Ļ†Yn(t) =Ļ†X1āˆ’Āµ(t/

āˆšn)

n.

Using Proposition 3.5, we have Ļ†X1āˆ’Āµ(t/āˆš

n) = 1āˆ’ Ļƒ2t2/2n + o(1/n). Thus,

Ļ†Yn(t) ā†’ expāˆ’Ļƒ2t2

2.

The result holds. ā€ 

Theorem 3.12 (Multivariate Central Limit Theorem) If X1, ..., Xn are i.i.d randomvectors in Rk with mean Āµ and covariance Ī£ = E[(Xāˆ’Āµ)(Xāˆ’Āµ)ā€²], then

āˆšn(Xnāˆ’Āµ) ā†’d N(0, Ī£).

ā€ 

Proof Similar to Theorem 3.11, but this time, we consider a multivariate characteristic functionE[expiāˆšntā€²(Xn āˆ’ Āµ)]. Note the result of Proposition 3.5 holds for this multivariate case. ā€ 

Theorem 3.13 (Liapunov Central Limit Theorem) Let Xn1, ..., Xnn be independent ran-dom variables with Āµni = E[Xni] and Ļƒ2

ni = V ar(Xni). Let Āµn =āˆ‘n

i=1 Āµni, Ļƒ2n =

āˆ‘ni=1 Ļƒ2

ni.If

nāˆ‘i=1

E[|Xni āˆ’ Āµni|3]Ļƒ3

n

ā†’ 0,

thenāˆ‘n

i=1(Xni āˆ’ Āµni)/Ļƒn ā†’d N(0, 1). ā€ 

We skip the proof of Theorem 3.13 but try to give a proof for the following Theorem 3.14,for which Theorem 3.13 is a special case.

Theorem 3.14 (Lindeberg-Fell Central Limit Theorem) Let Xn1, ..., Xnn be independentrandom variables with Āµni = E[Xni] and Ļƒ2

ni = V ar(Xni). Let Ļƒ2n =

āˆ‘ni=1 Ļƒ2

ni. Then bothāˆ‘ni=1(Xniāˆ’Āµni)/Ļƒn ā†’d N(0, 1) and max Ļƒ2

ni/Ļƒ2n : 1 ā‰¤ i ā‰¤ n ā†’ 0 if and only if the Lindeberg

condition1

Ļƒ2n

nāˆ‘i=1

E[|Xni āˆ’ Āµni|2I(|Xni āˆ’ Āµni| ā‰„ ĪµĻƒn)] ā†’ 0, for all Īµ > 0

holds. ā€ 

LARGE SAMPLE THEORY 61

Proof ā€œ ā‡ā€²ā€²: We first show that maxĻƒ2nk/Ļƒ

2n : 1 ā‰¤ k ā‰¤ n ā†’ 0.

Ļƒ2nk/Ļƒ

2n ā‰¤ E[|(Xnk āˆ’ Āµk)/Ļƒn|2]ā‰¤ 1

Ļƒ2n

E[I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)(Xnk āˆ’ Āµnk)

2] + E[I(|Xnk āˆ’ Āµnk| < ĪµĻƒn)(Xnk āˆ’ Āµnk)2]

ā‰¤ 1

Ļƒ2n

E[I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)(Xnk āˆ’ Āµnk)2] + Īµ2.

Thus,

maxkĻƒ2

nk/Ļƒ2n ā‰¤

1

Ļƒ2n

nāˆ‘

k=1

E[|Xnk āˆ’ Āµnk|2I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)] + Īµ2.

From the Lindeberg condition, we immediately obtain

maxkĻƒ2

nk/Ļƒ2n ā†’ 0.

To prove the central limit theorem, we let Ļ†nk(t) be the characteristic function of (Xnk āˆ’Āµnk)/Ļƒn. We note

|Ļ†nk(t)āˆ’ (1āˆ’ Ļƒ2nk

Ļƒ2n

t2

2)|

ā‰¤E

[āˆ£āˆ£āˆ£eit(Xnkāˆ’Āµnk)/Ļƒn āˆ’2āˆ‘

j=0

(it)j

j!

(Xnk āˆ’ Āµnk

Ļƒn

)j āˆ£āˆ£āˆ£]

ā‰¤E

[I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)

āˆ£āˆ£āˆ£eit(Xnkāˆ’Āµnk)/Ļƒn āˆ’2āˆ‘

j=0

(it)j

j!

(Xnk āˆ’ Āµnk

Ļƒn

)j āˆ£āˆ£āˆ£]

+ E

[I(|Xnk āˆ’ Āµnk| < ĪµĻƒn)

āˆ£āˆ£āˆ£eit(Xnkāˆ’Āµnk)/Ļƒn āˆ’2āˆ‘

j=0

(it)j

j!

(Xnk āˆ’ Āµnk

Ļƒn

)j āˆ£āˆ£āˆ£]

.

From the expansion in proving Proposition 3.5, the inequality |eitx āˆ’ (1 + itxāˆ’ t2x2/2)| ā‰¤ t2x2

so we apply it to the first half on the right-hand side. Additionally, from the Taylor expansion,|eitx āˆ’ (1 + itxāˆ’ t2x2/2)| ā‰¤ |t|3|x|3/6 so we apply it to the second half of the right-hand side.Then, we obtain

|Ļ†nk(t)āˆ’ (1āˆ’ Ļƒ2nk

Ļƒ2n

t2

2)|

ā‰¤E

[I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)t2

(Xnk āˆ’ Āµnk

Ļƒn

)2]

+ E

[I(|Xnk āˆ’ Āµnk| < ĪµĻƒn)|t|3 |Xnk āˆ’ Āµnk|3

6Ļƒ3n

āˆ£āˆ£āˆ£]

ā‰¤ t2

Ļƒ2n

E[(Xnk āˆ’ Āµnk)2I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)] +

Īµ|t|36

Ļƒ2nk

Ļƒ2n

.

Therefore,

nāˆ‘

k=1

|Ļ†nk(t)āˆ’ (1āˆ’ t2

2

Ļƒ2nk

Ļƒ2n

)| ā‰¤ t2

Ļƒ2n

nāˆ‘

k=1

E[I(|Xnk āˆ’ Āµnk| ā‰„ ĪµĻƒn)(Xnk āˆ’ Āµnk)2] +

Īµ|t|36

.

LARGE SAMPLE THEORY 62

This summation goes to zero as n ā†’āˆž then Īµ ā†’ 0.Since for any complex numbers Z1, ..., Zm,W1, ..., Wm with norm at most 1,

|Z1 Ā· Ā· Ā·Zm āˆ’W1 Ā· Ā· Ā·Wm| ā‰¤māˆ‘

k=1

|Zk āˆ’Wk|,

we have

|nāˆ

k=1

Ļ†nk(t)āˆ’nāˆ

k=1

(1āˆ’ t2

2

Ļƒ2nk

Ļƒ2n

))| ā‰¤nāˆ‘

k=1

|Ļ†nk(t)āˆ’ (1āˆ’ t2

2

Ļƒ2nk

Ļƒ2n

)| ā†’ 0.

On the other hand, from |ez āˆ’ 1āˆ’ z| ā‰¤ |z|2e|z|,

|nāˆ

k=1

eāˆ’t2Ļƒ2nk/2Ļƒ2

n āˆ’nāˆ

k=1

(1āˆ’ t2

2

Ļƒ2nk

Ļƒ2n

))| ā‰¤nāˆ‘

k=1

|eāˆ’t2Ļƒ2nk/2Ļƒ2

n āˆ’ 1 + t2Ļƒ2nk/2Ļƒ

2n|

ā‰¤nāˆ‘

k=1

et2Ļƒ2nk/2Ļƒ2

nt4Ļƒ4nk/4Ļƒ

4n ā‰¤ (max

kĻƒnk/Ļƒn)2et2/2t4/4 ā†’ 0.

We have

|nāˆ

k=1

Ļ†nk(t)āˆ’nāˆ

k=1

eāˆ’t2Ļƒ2nk/2Ļƒ2

n| ā†’ 0.

The result thus follows by noticing

nāˆ

k=1

eāˆ’t2Ļƒ2nk/2Ļƒ2

n ā†’ eāˆ’t2/2.

ā€œ ā‡’ā€²ā€²: First, we note that from 1āˆ’ cos x ā‰¤ x2/2,

t2

2Ļƒ2n

nāˆ‘

k=1

E[|Xnk āˆ’ Āµnk|2I(|Xnk āˆ’ Āµnk| > ĪµĻƒn)] ā‰¤ t2

2āˆ’

nāˆ‘

k=1

āˆ«

|Xnkāˆ’Āµnk|ā‰¤ĪµĻƒn

t2y2

2Ļƒ2n

dFnk(y)

ā‰¤ t2

2āˆ’

nāˆ‘

k=1

āˆ«

|Xnkāˆ’Āµnk|ā‰¤ĪµĻƒn

[1āˆ’ cos(ty/Ļƒn)]dFnk(y),

where Fnk is the distribution for Xnk āˆ’ Āµnk. On the other hand, since maxĻƒnk/Ļƒn ā†’ 0,maxk |Ļ†nk(t)āˆ’ 1| ā†’ 0 uniformly on any finite interval of t. Then

|nāˆ‘

k=1

log Ļ†nk(t)āˆ’nāˆ‘

k=1

(Ļ†nk(t)āˆ’ 1)| ā‰¤nāˆ‘

k=1

|Ļ†nk(t)āˆ’ 1|2 ā‰¤ maxk|Ļ†nk(t)āˆ’ 1|

nāˆ‘

k=1

|Ļ†nk(t)āˆ’ 1|

ā‰¤ maxk|Ļ†nk(t)āˆ’ 1|

nāˆ‘

k=1

t2Ļƒ2nk/Ļƒ

2n.

Thus,nāˆ‘

k=1

log Ļ†nk(t) =nāˆ‘

k=1

(Ļ†nk(t)āˆ’ 1) + o(1).

LARGE SAMPLE THEORY 63

Sinceāˆ‘n

k=1 log Ļ†nk(t) ā†’ āˆ’t2/2 uniformly in any finite interval of t, we obtain

nāˆ‘

k=1

(1āˆ’ Ļ†nk(t)) = t2/2 + o(1)

uniformly in finite interval of t. That is,

nāˆ‘

k=1

āˆ«(1āˆ’ cos(ty/Ļƒn))dFnk(y) = t2/2 + o(1).

Therefore, for any Īµ and for any |t| ā‰¤ M , when n is large,

t2

2Ļƒ2n

nāˆ‘

k=1

E[|Xnk āˆ’ Āµnk|2I(|Xnk āˆ’ Āµnk| > ĪµĻƒn)] ā‰¤nāˆ‘

k=1

āˆ«

|Xnkāˆ’Āµnk|>ĪµĻƒn

[1āˆ’ cos(ty/Ļƒn)]dFnk(y) + Īµ

ā‰¤ 2nāˆ‘

k=1

āˆ«

|Xnkāˆ’Āµnk|>ĪµĻƒn

dFnk(y) + Īµ ā‰¤ 2

Īµ2

nāˆ‘

k=1

E[|Xnk āˆ’ Āµnk|2]Ļƒ2

n

+ Īµ ā‰¤ 2/Īµ2 + Īµ.

Let t = M = 1/Īµ3 and we obtain the Lindeberg condition. ā€ 

Remark 3.7 To see how Theorem 3.14 implies the result in Theorem 3.13, we note that

1

Ļƒ2n

nāˆ‘i=1

E[|Xnk āˆ’ Āµnk|2I(|Xnk āˆ’ Āµnk| > ĪµĻƒn)] ā‰¤ 1

Īµ3Ļƒ3n

nāˆ‘

k=1

E[|Xnk āˆ’ Āµnk|3].

We give some examples to show the application of the central limit theorems in statistics.

Example 3.11 This is one example from a simple linear regression. Suppose Xj = Ī±+Ī²zj + Īµj

for j = 1, 2, ... where zj are known numbers not all equal and the Īµj are i.i.d with mean zeroand variance Ļƒ2. We know that the least square estimate for Ī² is given by

Ī²n =nāˆ‘

j=1

Xj(zj āˆ’ zn)/nāˆ‘

j=1

(zj āˆ’ zn)2

= Ī² +nāˆ‘

j=1

Īµj(zj āˆ’ zn)/nāˆ‘

j=1

(zj āˆ’ zn)2.

Assume

maxjā‰¤n

(zj āˆ’ zn)2/

nāˆ‘j=1

(zj āˆ’ zn)2 ā†’ 0.

we can show that the Lindeberg condition is satisfied. Thus, we conclude that

āˆšn

āˆšāˆ‘nj=1(zj āˆ’ zn)2

n(Ī²n āˆ’ Ī²) ā†’d N(0, Ļƒ2).

LARGE SAMPLE THEORY 64

Example 3.12 The example is taken from the randomization test for paired comparison. In apaired study comparing treatment vs control, 2n subjects are grouped into n pairs. For pair, itis decided at random that one subject receives treatment but not the other. Let (Xj, Yj) denotethe values of jth pairs with Xj being the result of the treatment. The usual paired t-test isbased on the normality of Zj = Xj āˆ’ Yj which may be invalid in practice. The randomizationtest (sometimes called permutation test) avoids this normality assumption, solely based onthe virtue of the randomization that the assignments of the treatment and the control areindependent in the pair, i.e., conditional on |Zj| = zj, Zj = |Zj|sgn(Zj) is independent takingvalues Ā±|Zj| with probability 1/2, when treatment and control have no difference. Therefore,conditional on z1, z2, ..., the randomization t-test, based on the t-statistic

āˆšnāˆ’ 1Zn/sz where s2

z

is 1/nāˆ‘n

j=1(Zj āˆ’ Zn)2, has a discrete distribution on 2n equally likely values. We can simulatethis distribution by the Monte Carlo method easily. Then if this statistic is large, there is strongevidence that treatment has large value. When n is large, such computation can be intimate,a better solution is to find an approximation. The Lindeberg-Feller central limit theorem canbe applied if we assume

maxjā‰¤n

z2j /

nāˆ‘j=1

z2j ā†’ 0.

It can be shown that this statistic has an asymptotic normal distribution N(0, 1). The detailscan be found in Ferguson, page 29.

Example 3.13 In Ferguson, page 30, an example of applying the central limit theorem is givenfor the signed-rank test for paired comparisons. Interested readers can find more details there.

3.3.4 Delta method

In many situation, the statistics are not simply the summation of independent random variablesbut a transformation of the latter. In this case, the Delta method can be used to obtain a similarresult to the central limit theorem.

Theorem 3.15 (Delta method) For random vector X and Xn in Rk , if there exists twoconstant an and Āµ such that an(Xnāˆ’Āµ) ā†’d X and an ā†’āˆž, then for any function g : Rk 7ā†’ Rl

such that g has a derivative at Āµ, denoted by āˆ‡g(Āµ)

an(g(Xn)āˆ’ g(Āµ)) ā†’d āˆ‡g(Āµ)X.

ā€ 

Proof By the Skorohod representation, we can construct Xn and X such that Xn āˆ¼d Xn andX āˆ¼d X (āˆ¼d means the same distribution) and an(Xnāˆ’Āµ) ā†’a.s. X. Then an(g(Xn)āˆ’g(Āµ)) ā†’a.s.

āˆ‡g(Āµ)X. We obtain the result. ā€ 

As a corollary of Theorem 3.15, ifāˆš

n(Xn āˆ’ Āµ) ā†’d N(0, Ļƒ2), then for any differentiablefunction g(Ā·), āˆšn(g(Xn)āˆ’ g(Āµ)) ā†’d N(0, gā€²(Āµ)2Ļƒ2).

LARGE SAMPLE THEORY 65

Example 3.14 Let X1, X2, ... be i.i.d with fourth moment. An estimate of the sample vari-ance is s2

n = (1/n)āˆ‘n

i=1(Xi āˆ’ Xn)2. We can use the Delta method in deriving the asymp-totic distribution of s2

n. Denote mk as the kth moment of X1 for k ā‰¤ 4. Note that s2n =

(1/n)āˆ‘n

i=1 X2i āˆ’ (

āˆ‘ni=1 Xi/n)2 and

āˆšn

[(Xn

(1/n)āˆ‘n

i=1 X2i

)āˆ’

(m1

m2

)]ā†’d N

(0,

(m2 āˆ’m1 m3 āˆ’m1m2

m3 āˆ’m1m2 m4 āˆ’m22

)),

we can apply the Delta method with g(x, y) = y āˆ’ x2 to obtain

āˆšn(s2

n āˆ’ V ar(X1)) ā†’d N(0,m4 āˆ’ (m2 āˆ’m21)

2).

Example 3.15 Let (X1, Y1), (X2, Y2), ... be i.i.d bivariate samples with finite fourth moment.One estimate of the correlation among X and Y is

Ļn =sxyāˆšs2

xs2y

,

where sxy = (1/n)āˆ‘n

i=1(Xiāˆ’Xn)(Yiāˆ’Yn), s2x = (1/n)

āˆ‘ni=1(Xiāˆ’Xn)2 and s2

y = (1/n)āˆ‘n

i=1(Yiāˆ’Yn)2. To derive the large sample distribution of Ļn, we can first obtain the large sampledistribution of (sxy, s

2x, s

2y) using the Delta method as in Example 3.14 then further apply the

Delta method with g(x, y, z) = x/āˆš

yz. We skip the details.

Example 3.16 The example is taken from the Pearsonā€™s Chi-square statistic. Suppose thatone subject falls into K categories with probabilities p1, ..., pK , where p1 + ... + pK = 1. Weactually observe n1, ..., nk subjects in these categories from n = n1 + ...+nK i.i.d subjects. ThePearsonā€™s statistic is defined as

Ļ‡2 = n

Kāˆ‘

k=1

(nk

nāˆ’ pk)

2/pk,

which can be treated asāˆ‘

(observed countāˆ’ expected count)2/expected count. To obtain theasymptotic distribution of Ļ‡2, we note that

āˆšn(n1/n āˆ’ p1, ..., nK/n āˆ’ pK) has an asymptotic

multivariate normal distribution. Then we can apply the Delta method to g(x1, ..., xK) =āˆ‘Ki=1 x2

k.

3.4 Summation of Non-independent Random Variables

In statistical inference, one will also encounter the summation of non-independent randomvariables. Theoretical results of the large sample theory for general non-independent randomvariables do not exist but for some summations with special structure, we have the similarresults to the central limit theorem. These special cases include the U-statistics, the rankstatistics, and the martingales.

LARGE SAMPLE THEORY 66

3.4.1 U-statistics

We suppose X1, ..., Xn are i.i.d. random variables.

Definition 3.6 A U-statistics associated with h(x1, ..., xr) is defined as

Un =1

r!(

nr

)āˆ‘

Ī²

h(XĪ²1 , ..., XĪ²r),

where the sum is taken over the set of all unordered subsets Ī² of r different integers chosenfrom 1, ..., n. ā€ 

One simple example is h(x, y) = xy. Then Un = (n(n āˆ’ 1))āˆ’1āˆ‘

i6=j XiXj. Many examplesof U statistics arise from rank-based statistical inference. If let X(1), ..., X(n) be the orderedrandom variables of X1, ..., Xn, one can see

Un = E[h(X1, ..., Xr)|X(1), ..., X(n)].

Clearly, Un is the summation of non-independent random variables.If define h(x1, ..., xr) as (r!)āˆ’1

āˆ‘(x1,...,xr) is permutation of (x1, ..., xr) h(x1, ..., xr), then h(x1, ..., xr)

is permutation-symmetric and moreover,

Un =1(nr

)āˆ‘

Ī²1<...<Ī²r

h(Ī²1, ..., Ī²r).

In the last expression, h is called the kernel of the U-statistic Un.The following theorem says that the limit distribution of U is the same as the limit distri-

bution of a sum of i.i.d random variables. Thus, the central limit theorem can be applied toU .

Theorem 3.16 Let Āµ = E[h(X1, ..., Xr)]. If E[h(X1, ..., Xr)2] < āˆž, then

āˆšn(Un āˆ’ Āµ)āˆ’āˆšn

nāˆ‘i=1

E[Un āˆ’ Āµ|Xi] ā†’p 0.

Consequently,āˆš

n(Un āˆ’ Āµ) is asymptotically normal with mean zero and variance r2Ļƒ2, where,with X1, ..., Xr, X1, ..., Xr i.i.d variables,

Ļƒ2 = Cov(h(X1, X2, ..., Xr), h(X1, X2, ..., Xr)).

ā€ 

To prove Theorem 3.16, we need the following lemmas. Let S be a linear space of randomvariables with finite second moments that contain the constants; i.e., 1 āˆˆ S and for any X, Y āˆˆS, aX + bY āˆˆ Sn where a and b are constants. For random variable T , a random variable S iscalled the projection of T on S if E[(T āˆ’ S)2] minimizes E[(T āˆ’ S)2], S āˆˆ S.

Proposition 3.6 Let S be a linear space of random variables with finite second moments.Then S is the projection of T on S if and only if S āˆˆ S and for any S āˆˆ S, E[(T āˆ’ S)S] = 0.

LARGE SAMPLE THEORY 67

Every two projections of T onto S are almost surely equal. If the linear space S contains theconstant variable, then E[T ] = E[S] and Cov(T āˆ’ S, S) = 0 for every S āˆˆ S. ā€ 

Proof For any S and S in S,

E[(T āˆ’ S)2] = E[(T āˆ’ S)2] + 2E[(T āˆ’ S)S] + E[(S āˆ’ S)2].

Thus, if S satisfies that E[(T āˆ’ S)S] = 0, then E[(T āˆ’ S)2] ā‰„ E[(T āˆ’ S)2]. Thus, S isthe projection of T on S. On the other hand, if S is the projection, for any constant Ī±,E[(T āˆ’ S āˆ’ Ī±S)2] is minimized at Ī± = 0. Calculate the derivative at Ī± = 0 and we obtainE[(T āˆ’ S)S] = 0.

If T has two projections S1 and S2, then from the above argument, we have E[(S1āˆ’S2)2] = 0.

Thus, S1 = S2, a.s. If the linear space S contains the constant variable, we choose S = 1. Then0 = E[(T āˆ’ S)S] = E[T ]āˆ’ E[S]. Clearly, Cov(T āˆ’ S, S) = E[(T āˆ’ S)S] = 0. ā€ 

Proposition 3.7 Let Sn be linear space of random variables with finite second momentsthat contain the constants. Let Tn be random variables with projections Sn on to Sn. IfV ar(Tn)/V ar(Sn) ā†’ 1 then

Zn ā‰” Tn āˆ’ E[Tn]āˆšV ar(Tn)

āˆ’ Sn āˆ’ E[Sn]āˆšV ar(Sn)

ā†’p 0.

ā€ 

Proof E[Zn] = 0. Note that

V ar(Zn) = 2āˆ’ 2Cov(Tn, Sn)āˆš

V ar(Tn)V ar(Sn).

Since Sn is the projection of Tn, Cov(Tn, Sn) = Cov(Tn āˆ’ Sn, Sn) + V ar(Sn) = V ar(Sn). Wehave

V ar(Zn) = 2(1āˆ’āˆš

V ar(Sn)

V ar(Tn)) ā†’ 0.

By the Markovā€™s inequality, we conclude that Zn ā†’p 0. ā€ 

The above lemma implies that if Sn is the summation of i.i.d random variables such that(Sn āˆ’E[Sn])/

āˆšV ar(Sn) ā†’d N(0, Ļƒ2), so is (Tn āˆ’E[Tn])/

āˆšV ar(Tn). The limit distribution of

U-statistics is derived using this lemma.We now start to prove Theorem 3.16.

Proof Let X1, ..., Xr be random variables with the same distribution as X1 and they areindependent of X1, ..., Xn. Denote Un by

āˆ‘ni=1 E[U āˆ’Āµ|Xi]. We show that Un is the projection

of Un on the linear space Sn = g1(X1) + ... + gn(Xn) : E[gk(Xk)2] < āˆž, k = 1, ..., n, which

contains the constant variables. Clearly, Un āˆˆ Sn. For any gk(Xk) āˆˆ Sn,

E[(Un āˆ’ Un)gk(Xk)] = E[E[Un āˆ’ Un|Xk]gk(Xk)] = 0.

LARGE SAMPLE THEORY 68

In fact, we can easily see that

Un =nāˆ‘

i=1

(nāˆ’1rāˆ’1

)(

nr

) E[h(X1, ..., Xrāˆ’1, Xi)āˆ’ Āµ|Xi] =r

n

nāˆ‘i=1

E[h(X1, ..., Xrāˆ’1, Xi)āˆ’ Āµ|Xi].

Thus,

V ar(Un) =r2

n2

nāˆ‘i=1

E[(E[h(X1, ..., Xrāˆ’1, Xi)āˆ’ Āµ|Xi])2]

=r2

nCov(E[h(X1, ..., Xrāˆ’1, X1)|X1], E[h(X1, ..., Xrāˆ’1, X1)|X1])

=r2

nCov(h(X1, X2, ..., Xr), h(X1, X2..., Xr)) =

r2Ļƒ2

n,

where we use the equation

Cov(X,Y ) = Cov(E[X|Z], E[Y |Z]) + E[Cov(X, Y |Z)].

Furthermore,

V ar(Un) =

(n

r

)āˆ’2 āˆ‘

Ī²

āˆ‘

Ī²ā€²Cov(h(XĪ²1 , ..., XĪ²r), h(XĪ²ā€²1 , ..., XĪ²ā€²r))

=

(n

r

)āˆ’2 rāˆ‘

k=1

āˆ‘

Ī² and Ī²ā€² share k components

Cov(h(X1, X2, .., Xk, Xk+1, ..., Xr), h(X1, X2, ..., Xk, Xk+1, ..., Xr)).

Since the number of Ī² and Ī²ā€² sharing k components is equal to(

nr

)(rk

)(nāˆ’rrāˆ’k

), we obtain

V ar(Un) =rāˆ‘

k=1

r!

k!(r āˆ’ k)!

(nāˆ’ r)(nāˆ’ r + 1) Ā· Ā· Ā· (nāˆ’ 2r + k + 1)

n(nāˆ’ 1) Ā· Ā· Ā· (nāˆ’ r + 1)

ƗCov(h(X1, X2, .., Xk, Xk+1, ..., Xr), h(X1, X2, ..., Xk, Xk+1, ..., Xr)).

The dominating term in Un is the first term of order 1/n while the other terms are of order1/n2. That is,

V ar(Un) =r2

nCov(h(X1, X2, ..., Xr), h(X1, X2, ..., Xr)) + O(

1

n2).

We conclude that V ar(Un)/V ar(Un) ā†’ 1. From Proposition 3.7, it holds that

Un āˆ’ ĀµāˆšV ar(Un)

āˆ’ UnāˆšV ar(Un)

ā†’p 0.

Theorem 3.16 thus holds. ā€ 

LARGE SAMPLE THEORY 69

Example 3.17 In a bivariate i.i.d sample (X1, Y1), (X2, Y2), ..., one statistic of measuring theagreement is called Kendallā€™s Ļ„ -statistic given as

Ļ„ =4

n(nāˆ’ 1)

āˆ‘āˆ‘i<j

I (Yj āˆ’ Yi)(Xj āˆ’Xi) > 0 āˆ’ 1.

It can be seen that Ļ„ + 1 is a U-statistic of order 2 with the kernel

2I (y2 āˆ’ y1)(x2 āˆ’ x1) > 0 .

Hence, by the above central limit theorem,āˆš

n(Ļ„n + 1 āˆ’ 2P ((Y2 āˆ’ Y1)(X2 āˆ’X1) > 0)) has anasymptotic normal distribution with mean zero. The asymptotic variance can be computed asin Theorem 3.16.

3.4.2 Rank statistics

For a sequence of i.i.d random variables X1, ..., Xn, we can order them from the smallest tothe largest and denote by X(1) ā‰¤ X(2) ā‰¤ ... ā‰¤ X(n). The latter is called order statistics of theoriginal sample. The rank statistics, denoted by R1, ..., Rn are the ranks of Xi among X1, ..., Xn.Thus, if all the Xā€™s are different, Xi = X(Ri). When there are ties, Ri is defined as the averageof all indices such that Xi = X(j) (sometimes called midrank). To avoid possible ties, we onlyconsider the case that Xā€™s have continuous densities.

By name, a rank statistic is any function of the ranks. A linear rank statistic is a rankstatistic of the special form

āˆ‘ni=1 a(i, Ri) for a given matrix (a(i, j))nƗn. If a(i, j) = ciaj, then

such statistic with formāˆ‘n

i=1 ciaRiis called simple linear rank statistic, which will be our

concern in this section. Here, c and aā€™s are called the coefficients and scores.

Example 3.18 In two independent sample X1, ..., Xn and Y1, ..., Ym, a Wilcoxon statistic isdefined as the summation of all the ranks of the second sample in the pooled data X1, ..., Xn,Y1, ..., Ym, i.e.,

Wn =n+māˆ‘

i=n+1

Ri.

This is a simple linear rank statistic with cā€™s are 0 and 1 for the first sample and the secondsample respectively and the vector a is (1, ..., n+m). There are other choices for rank statistics,for instance, the van der Waerden statistic

āˆ‘n+mi=n+1 Ī¦āˆ’1(Ri).

For order statistics and rank statistics, there are some useful properties.

Proposition 3.8 Let X1, ..., Xn be a random sample from continuous distribution function Fwith density f . Then

1. the vectors (X(1), ..., X(n)) and (R1, ..., Rn) are independent;

2. the vector (X(1), ..., X(n)) has density n!āˆn

i=1 f(xi) on the set x1 < ... < xn;

3. the variable X(i) has density(

nāˆ’1iāˆ’1

)F (x)iāˆ’1(1āˆ’F (x))nāˆ’if(x); for F the uniform distribution

on [0, 1], it has mean i/(n + 1) and variance i(nāˆ’ i + 1)/[(n + 1)2(n + 2)];

LARGE SAMPLE THEORY 70

4. the vector (R1, ..., Rn) is uniformly distributed on the set of all n! permutations of1, 2, ..., n;

5. for any statistic T and permutation r = (r1, ..., rn) of 1, 2, ..., n,

E[T (X1, ..., Xn)|(R1, .., Rn) = r] = E[T (X(r1), .., X(rn))];

6. for any simple linear rank statistic T =āˆ‘n

i=1 ciaRi,

E[T ] = ncnan, V ar(T ) =1

nāˆ’ 1

nāˆ‘i=1

(ci āˆ’ cn)2

nāˆ‘i=1

(ai āˆ’ an)2.

ā€ 

The proof of Proposition 3.8 is elementary so we skip. For simple linear rank statistic, acentral limit theorem also exists:

Theorem 3.17 Let Tn =āˆ‘n

i=1 ciaRisuch that

maxiā‰¤n

|ai āˆ’ an|/āˆšāˆšāˆšāˆš

nāˆ‘i=1

(ai āˆ’ an)2 ā†’ 0, maxiā‰¤n

|ci āˆ’ cn|/āˆšāˆšāˆšāˆš

nāˆ‘i=1

(ci āˆ’ cn)2 ā†’ 0.

Then (Tn āˆ’ E[Tn])/āˆš

V ar(Tn) ā†’d N(0, 1) if and only if for every Īµ > 0,

āˆ‘

(i,j)

I

āˆš

n|ai āˆ’ an||ci āˆ’ cn|āˆšāˆ‘n

i=1(ai āˆ’ an)2āˆ‘n

i=1(ci āˆ’ cn)2> Īµ

|ai āˆ’ an|2|ci āˆ’ cn|2āˆ‘n

i=1(ai āˆ’ an)2āˆ‘n

i=1(ci āˆ’ cn)2ā†’ 0.

We can immediately recognize that the last condition is similar to the Lindeberg condition.The proof can be found in Ferguson, Chapter 12.

Besides of rank statistics, there are other statistics based on ranks. For example, a simplelinear signed rank statistic has the form

nāˆ‘i=1

aR+isign(Xi),

where R+1 , ..., R+

n , called absolute rank, are the ranks of |X1|, ..., |Xn|. In a bivariate sample(X1, Y1), ..., (Xn, Yn), one can define a statistic of the form

nāˆ‘i=1

aRibSi

for two constant vector (a1, ..., an) and (b1, ..., bn), where (R1, ..., Rn) and (S1, ..., Sn) are respec-tive ranks of (X1, ..., Xn) and (Y1, ..., Yn). Such a statistic is useful for testing independence ofX and Y . Another statistic is based on permutation test, as exemplified in Example 3.12. Forall these statistics, some conditions ensure that the central limit theorem holds.

LARGE SAMPLE THEORY 71

3.4.3 Martingales

In this section, we consider the central limit theorem for another type of the sum of non-independent random variables. These random variables are called martingale.

Definition 3.7 Let Yn be a sequence of random variables and Fn be sequence of Ļƒ-fields suchthat F1 āŠ‚ F2 āŠ‚ .... Suppose E[|Yn|] < āˆž. Then the pairs (Yn,Fn) is called a martingale if

E[Yn|Fnāˆ’1] = Ynāˆ’1, a.s.

(Yn,Fn) is a submartingale if

E[Yn|Fnāˆ’1] ā‰„ Ynāˆ’1, a.s.

(Yn,Fn) is a supmartingale if

E[Yn|Fnāˆ’1] ā‰¤ Ynāˆ’1, a.s.

ā€ 

The definition implies that Y1, ..., Yn are measurable in Fn. Sometimes, we say Yn is adaptedto Fn. One simple example of martingale is that Yn = X1 + ... + Xn, where X1, X2, ... are i.i.dwith mean zero, and Fn is the Ļƒ-filed generated by X1, ..., Xn. This is because

E[Yn|Fnāˆ’1] = E[X1 + ... + Xn|X1, ..., Xnāˆ’1] = Ynāˆ’1.

For Yn = X21 + ... + X2

n, one can verify that (Yn,Fn) is a submartingale. In fact, from onesubmartingale, one can construct many submartingales as shown in the following lemma.

Proposition 3.9 Let (Yn,Fn) be a martingale. For any measurable and convex function Ļ†,(Ļ†(Yn),Fn) is a submartingale. ā€ 

Proof Clearly, Ļ†(Yn) is adapted to Fn. It is sufficient to show

E[Ļ†(Yn)|Fnāˆ’1] ā‰„ Ļ†(Ynāˆ’1).

This follows from the well-known Jensenā€™s inequality: for any convex function Ļ†,

E[Ļ†(Yn)|Fnāˆ’1] ā‰„ Ļ†(E[Yn|Fnāˆ’1]) = Ļ†(Ynāˆ’1).

ā€ 

Particularly, the Jensenā€™s inequality is given in the following lemma.

Proposition 3.10 For any random variable X and any convex measurable function Ļ†,

E[Ļ†(X)] ā‰„ Ļ†(E[X]).

ā€ 

LARGE SAMPLE THEORY 72

Proof We first claim that for any x0, there exists a constant k0 such that for any x,

Ļ†(x) ā‰„ Ļ†(x0) + k0(xāˆ’ x0).

The line Ļ†(x0) + k0(x āˆ’ x0) is called the supporting line for Ļ†(x) at x0. By the convexity, wehave that for any xā€² < yā€² < x0 < y < x,

Ļ†(x0)āˆ’ Ļ†(xā€²)x0 āˆ’ xā€²

ā‰¤ Ļ†(y)āˆ’ Ļ†(x0)

y āˆ’ x0

ā‰¤ Ļ†(x)āˆ’ Ļ†(x0)

xāˆ’ x0

.

Thus, Ļ†(x)āˆ’Ļ†(x0)xāˆ’x0

is bounded and decreasing as x decreases to x0. Let the limit be k+0 then

Ļ†(x)āˆ’ Ļ†(x0)

xāˆ’ x0

ā‰„ k+0 .

I.e.,Ļ†(x) ā‰„ k+

0 (xāˆ’ x0) + Ļ†(x0).

Similarly,Ļ†(xā€²)āˆ’ Ļ†(x0)

xā€² āˆ’ x0

ā‰¤ Ļ†(yā€²)āˆ’ Ļ†(x0)

yā€² āˆ’ x0

ā‰¤ Ļ†(x)āˆ’ Ļ†(x0)

xāˆ’ x0

.

Then Ļ†(xā€²)āˆ’Ļ†(x0)xā€²āˆ’x0

is increasing and bounded as xā€² increases to x0. Let the limit be kāˆ’0 then

Ļ†(xā€²) ā‰„ kāˆ’0 (xā€² āˆ’ x0) + Ļ†(x0).

Clearly, k+0 ā‰„ kāˆ’0 . Combining those two inequalities, we obtain

Ļ†(x) ā‰„ Ļ†(x0) + k0(xāˆ’ x0)

for k0 = (k+0 + kāˆ’0 )/2. We choose x0 = E[X] then

Ļ†(X) ā‰„ Ļ†(E[X]) + k0(X āˆ’ E[X]).

The Jensenā€™s inequality holds by taking the expectation on both sides. ā€ 

If (Yn,Fn) is a submartingale, we can write Yn = (Yn āˆ’ E[Yn|Fnāˆ’1]) + E[Yn|Fnāˆ’1]. Notethat (Ynāˆ’E[Yn|Fnāˆ’1],Fn) is a martingale and that E[Yn|Fnāˆ’1] is measurable in Fnāˆ’1. Thusany submartingale can be written as the summation of a martingale and a random variablepredictable in Fnāˆ’1. We now state the limit theorems for the martingales.

Theorem 3.18 (Martingale Convergence Theorem) Let (Xn,Fn) be submartingale. IfK = supn E[|Xn|] < āˆž, then Xn ā†’a.s. X where X is a random variable satisfying E[|X|] ā‰¤ K.ā€ 

The proof needs the maximal inequality for a submartingale and the up-crossing inequality.

Proof We first prove the following maximal inequality: for Ī± > 0,

P (maxiā‰¤n

Xi ā‰„ Ī±) ā‰¤ 1

Ī±E[|Xn|].

LARGE SAMPLE THEORY 73

To see that, we note that

P (maxiā‰¤n

Xi ā‰„ Ī±)

=nāˆ‘

i=1

P (X1 < Ī±, ..., Xiāˆ’1 < Ī±,Xi ā‰„ Ī±)

ā‰¤nāˆ‘

i=1

E[I(X1 < Ī±, ..., Xiāˆ’1 < Ī±,Xi ā‰„ Ī±)Xi

Ī±]

=1

Ī±

nāˆ‘i=1

E[I(X1 < Ī±, ..., Xiāˆ’1 < Ī±,Xi ā‰„ Ī±)Xi].

Since E[Xn|X1, ..., Xnāˆ’1] ā‰„ Xnāˆ’1, E[Xn|X1, ..., Xnāˆ’2] ā‰„ E[Xnāˆ’1|X1, ..., Xnāˆ’2] and so on. Weobtain E[Xn|X1, ..., Xi] ā‰„ E[Xi+1|X1, ..., Xi] ā‰„ Xi for i = 1, ..., nāˆ’ 1. Thus,

P (maxiā‰¤n

Xi ā‰„ Ī±) ā‰¤ 1

Ī±

nāˆ‘i=1

E[I(X1 < Ī±, ..., Xiāˆ’1 < Ī±, Xi ā‰„ Ī±)E[Xn|X1, ..., Xi]]

ā‰¤ 1

Ī±E[Xn

nāˆ‘i=1

I(X1 < Ī±, ..., Xiāˆ’1 < Ī±, Xi ā‰„ Ī±)] ā‰¤ 1

Ī±E[Xn] ā‰¤ 1

Ī±E[|Xn|].

For any interval [Ī±, Ī²] (Ī± < Ī²), we define a sequence of numbers Ļ„1, Ļ„2, ... as follows:Ļ„1 is the smallest j such that 1 ā‰¤ j ā‰¤ n and Xj ā‰¤ Ī± and is n if there is not such j;Ļ„2k is the smallest j such that Ļ„2kāˆ’1 < j ā‰¤ n and Xj ā‰„ Ī², and is n if there is not such j;Ļ„2k+1 is the smallest j such Ļ„2k < j ā‰¤ n and Xj ā‰¤ Ī±, and is n if there is not such j.A random variable U , called upcrossings of [Ī±, Ī²] by X1, ..., Xn, is the largest i such thatXĻ„2iāˆ’1

ā‰¤ Ī± < Ī² ā‰¤ XĻ„2i. We then show that

E[U ] ā‰¤ E[|Xn|] + |Ī±|Ī² āˆ’ Ī±

.

Let Yk = max0, Xk āˆ’ Ī± and Īø = Ī² āˆ’ Ī±. It is easy to see Y1, ..., Yn is a submartingale. The Ļ„k

are unchanged if the definitions Xj ā‰¤ Ī± is replaced by Yj = 0 and Xj ā‰„ Ī² by Yj ā‰„ Īø, and so Uis also the number of upcrossings of [0, Īø] by Y1, .., Yn. We also obtain

E[YĻ„2k+1āˆ’ YĻ„2k

] =āˆ‘

1ā‰¤k1<k2ā‰¤n

E[(Yk2 āˆ’ Yk1)I(Ļ„2k+1 = k2, Ļ„2k = k1)]

=nāˆ’1āˆ‘

k1=1

nāˆ‘

kā€²=2

E[I(Ļ„2k = k1, k1 < kā€² ā‰¤ Ļ„2k+1)(Ykā€² āˆ’ Ykā€²āˆ’1)]

=nāˆ’1āˆ‘

k1=1

nāˆ‘

kā€²=2

E[I(Ļ„2k = k1, k1 < kā€²)(1āˆ’ I(Ļ„2k+1 < kā€²))(Ykā€² āˆ’ Ykā€²āˆ’1)].

By the definition, if Ļ„2kāˆ’1 = i is measurable in Fi for i = 1, ..., n, where Fi is the Ļƒ-fieldgenerated by Y1, ..., Yi, then

Ļ„2k = j = āˆŖjāˆ’1i=1 Ļ„2kāˆ’1 = i, Yi+1 < Īø, ..., Yjāˆ’1 ā‰¤ Īø, Yj ā‰„ Īø

LARGE SAMPLE THEORY 74

belongs to the Ļƒ-field Fj and Ļ„2k = n = Ļ„2k ā‰¤ nāˆ’ 1c lies in Fn. Similarly, if Ļ„2k = i āˆˆ Fi

for any i = 1, ..., n, so is Ļ„2k+1 = i āˆˆ Fi for any i = 1, ..., n. Thus, by the deduction, weobtain that for any i = 1, ..., n, Ļ„k = i is in Fi. Then,

E[I(Ļ„2k = k1, k1 < kā€²)(1āˆ’ I(Ļ„2k+1 < kā€²))(Ykā€² āˆ’ Ykā€²āˆ’1)]

= E[I(Ļ„2k = k1, k1 < kā€²)(1āˆ’ I(Ļ„2k+1 < kā€²))(E[Ykā€²|Fkā€²āˆ’1]āˆ’ Ykā€²āˆ’1)] ā‰„ 0.

We conclude that E[YĻ„2k+1āˆ’ YĻ„2k

] ā‰„ 0.Since Ļ„k is strictly increasing and Ļ„n = n,

Yn = YĻ„n ā‰„ YĻ„n āˆ’ YĻ„1 =nāˆ‘

k=2

(YĻ„kāˆ’ YĻ„kāˆ’1

) =āˆ‘

2ā‰¤kā‰¤n,k even

(YĻ„kāˆ’ YĻ„kāˆ’1

) +āˆ‘

2ā‰¤kā‰¤n,k odd

(YĻ„kāˆ’ YĻ„kāˆ’1

).

When k is even, YĻ„kāˆ’ YĻ„kāˆ’1 ā‰„ Īø and the total number of such k is U . The expectation of the

second half is non-negative. We obtain

E[Yn] ā‰„ ĪøE[U ].

Thus,

E[U ] ā‰¤ Īø

E[Yn] ā‰¤ E[|X|+ |Ī±|

Ī² āˆ’ Ī±.

With the maximal inequality, we can start to prove the martingale convergence theorem.Let Un be the number of upcrossings of [Ī±, Ī²] by X1, ..., Xn. Then

E[Un] ā‰¤ K + |Ī±|Ī² āˆ’ Ī±

.

Let Xāˆ— = lim supn Xn and Xāˆ— = lim infn Xn. If Xāˆ— < Ī± < Ī² < Xāˆ—, then Un must go to infinity.Since Un is bounded with probability 1, P (Xāˆ— < Ī± < Ī² < Xāˆ—) = 0. Now

Xāˆ— < Xāˆ— = āˆŖĪ±<Ī²,Ī±,Ī² are rational numbersXāˆ— < Ī± < Ī² < Xāˆ—.We obtain P (Xāˆ— = Xāˆ—) = 1. That is, Xn converges to their common values X. By the Fatouā€™slemma, E[|X|] ā‰¤ lim infn E[|Xn|] ā‰¤ K. X is integrable and finite with probability 1. ā€  .

As a corollary of the martingale convergence theorem, we obtain

Corollary 3.1 If Fn is increasing Ļƒ-field and denote Fāˆž as the Ļƒ-field generated by āˆŖāˆžn=1Fn,then for any random variable Z with E[|Z|] < āˆž, it holds

E[Z|Fn] ā†’a.s. E[Z|Fāˆž].

ā€ 

Proof Denote Yn = E[Zn|Fn]. Clearly, Yn is a martingale adapted to Fn. Moreover, E[|Yn|] ā‰¤E[|Z|]. By the martingale convergence theorem, Yn converges to some random variable Y almostsurely. Clearly, Y is measurable in Fāˆž. We then show Yn is uniformly integrable. SinceYn ā‰¤ E[|Zn||Fn], we may assume Z is non-negative. For any Īµ > 0, there exists a Ī“ such

LARGE SAMPLE THEORY 75

that E[ZIA] < Īµ whenever P (A) < Ī“ (since the measure E[ZIA] is absolutely continuous withrespect to the measure P ). Note that for a large Ī±, consider the set A = P (E[Z|Fn] ā‰„ Ī±).Since

P (A) = E[I(E[Z|Fn] ā‰„ Ī±)] ā‰¤ 1

Ī±E[Z],

we can choose Ī± large enough (independent of n) such that P (A) < Ī“. Thus, E[ZI(E[Z|Fn] ā‰„Ī±)] < Īµ for any n. We conclude E[Z|Fn] is uniformly integrable. With the uniform integrability,we have that for any A āˆˆ Fk, limn

āˆ«A

YndP =āˆ«

AY dP. Note that

āˆ«A

YndP =āˆ«

AZdP for n > k.

Thus,āˆ«

AY dP =

āˆ«A

ZdP =āˆ«

AE[Z|Fāˆž]dP . This is true for any A āˆˆ āˆŖāˆžn=1Fāˆž so it is also true

for any A āˆˆ Fāˆž. Since Y is measurable in Fāˆž, Y = E[Z|Fāˆž], a.s. ā€ 

Finally, a similar theorem to the Lindeberg-Feller central limit theorem also exists for themartingales.

Theorem 3.19 (Martingale Central Limit Theorem) Let (Yn1,Fn1), (Yn2,Fn2), ... be amartingale. Define Xnk = Ynk āˆ’ Yn,kāˆ’1 with Yn0 = 0 thus Ynk = Xn1 + ... + Xnk. Suppose that

āˆ‘

k

E[X2nk|Fn,kāˆ’1] ā†’p Ļƒ2

where Ļƒ is a positive constant and that

āˆ‘

k

E[X2nkI(|Xnk| ā‰„ Īµ)|Fn,kāˆ’1] ā†’p 0

for each Īµ > 0. Then āˆ‘

k

Xnk ā†’d N(0, Ļƒ2).

ā€ 

The proof is based on the approximation of the characteristic function and we skip thedetails here.

3.5 Some Notation

In a probability space (Ī©,A, P ), let Xn be random variables (random vectors). We introducethe following notation: Xn = op(1) denotes that Xn converges in probability to zero, Xn = Op(1)denotes that Xn is bounded in probability; i.e.,

limMā†’āˆž

lim supn

P (|Xn| ā‰„ M) = 0.

It is easy to see Xn = Op(1) is equivalent to saying Xn is uniformly tight. Furthermore, fora sequence of random variable rn, Xn = op(rn) means that |Xn|/rn ā†’p 0 and Xn = Op(rn)means that |Xn|/rn is bounded in probability.

There are many rules of calculus with o and O symbols. For instance, some commonly usedformulae are (Rn is a deterministic sequence)

op(1) + op(1) = op(1), Op(1) + Op(1) = Op(1), Op(1)op(1) = op(1),

LARGE SAMPLE THEORY 76

(1 + op(1))āˆ’1 = 1 + op(1), op(Rn) = Rnop(1), Op(Rn) = RnOp(1),

op(Op(1)) = op(1).

Furthermore, if a real function R(Ā·) satisfies that R(h) = o(|h|p) as h ā†’ 0, then R(Xn) =op(|Xn|p); if R(h) = O(|h|p) as h ā†’ 0, then R(Xn) = Op(|Xn|p). Readers should be able toprove these results without difficulty.

READING MATERIALS : You should read Lehmann and Casella, Section 1.8, Ferguson, Part1, Part 2, Part 3 12-15

PROBLEMS

1. (a) If X1, X2, ... are i.i.d N(0, 1), then X(n)/āˆš

2 log n ā†’p 1 where X(n) is the maximumof X1, ..., Xn. Hint: use the following inequality: for any Ī“ > 0,

Ī“āˆš2Ļ€

eāˆ’(1+Ī“)y2/2y ā‰¤āˆ« āˆž

y

1āˆš2Ļ€

eāˆ’x2/2dx ā‰¤ eāˆ’y2(1āˆ’Ī“)/2

āˆšĪ“

.

(b) If X1, X2, ... are i.i.d Uniform(0, 1), derive the limit distribution of n(1āˆ’X(n)).

2. Suppose that U āˆ¼ Uniform(0, 1), Ī± > 0, and

Xn = (nĪ±/ log(n + 1))I[0,nāˆ’Ī±](U).

(a) Show that Xn ā†’a.s. 0 and E[Xn] ā†’ 0.

(b) Can you find a random variable Y with |Xn| ā‰¤ Y for all n with E[Y ] < āˆž?

(c) For what values of Ī± does the uniform integrability condition

lim supnā†’āˆž

E[|Xn|I|Xn|ā‰„M ] ā†’ 0, as M ā†’āˆž

hold?

3. (a) Show by example that distribution functions having densities can converge in distri-bution even if the densities do not converge. Hint: Consider fn(x) = 1 + cos 2Ļ€nxin [0, 1].

(b) Show by example that distributions with densities can converge in distribution to alimit that has no density.

(c) Show by example that discrete distributions can converge in distribution to a limitthat has a density.

4. Stirlingā€™s formula. Let Sn = X1 + ...+Xn, where the X1, ..., Xn are independent and eachhas the Poisson distribution with parameters 1. Calculate or prove successively:

LARGE SAMPLE THEORY 77

(a) Calculate the expectation of (Sn āˆ’ n)/āˆš

nāˆ’, the negative part of (Sn āˆ’ n)/āˆš

n.

(b) Show (Sn āˆ’ n)/āˆš

nāˆ’ ā†’d Zāˆ’, where Z has a standard normal distribution.

(c) Show

E

[Sn āˆ’ nāˆš

n

āˆ’]ā†’ E[Zāˆ’].

(d) Use the above results to derive the Stirlingā€™s formula:

n! āˆ¼āˆš

2Ļ€nn+1/2eāˆ’n.

5. This problem gives an alternative way of proving the Slutsky theorem. Let Xn ā†’d Xand Yn ā†’p y for some constant y. Assume Xn and Yn are both measurable functionson the same probability measure space (Ī©,A, P ). Then (Xn, Yn)ā€² can be considered as abivariate random variable into R2.

(a) Show (Xn, Yn)ā€² ā†’d (X, y)ā€². Hint: show the characteristic function of (Xn, Yn)ā€² con-verges using the dominated convergence theorem.

(b) Use the continuous mapping theorem to prove the Slutsky theorem. Hint: first showZnXn ā†’d zX using the function g(x, z) = xz; then show ZnXn + Yn ā†’d zX + yusing the function g(x, y) = x + y.

6. Suppose that Xn is a sequence of random variables in a probability measure space.Show that, if E[g(Xn)] ā†’ E[g(X)] for all continuous g with bounded support (that is,g(x) is zero when x is outside a bounded interval), then Xn ā†’d X. Hint: verify (c) of thePortmanteau Theorem. Follow the proof for (c) by considering g(x) = 1āˆ’ Īµ/[Īµ+d(x,GcāˆŖ(āˆ’M, M)c)] for any M .

7. Suppose that X1, ..., Xn are i.i.d with distribution function G(x). Let Mn = maxX1, .., Xn.(a) If G(x) = (1āˆ’ expāˆ’Ī±x)I(x > 0), what is the limit distribution of Mnāˆ’Ī±āˆ’1 log n?

(b) If

G(x) =

0 if x ā‰¤ 1,1āˆ’ xāˆ’Ī± if x ā‰„ 1,

where Ī± > 0, what is the limit distribution of nāˆ’1/Ī±Mn?

(c) If

G(x) =

0 if x ā‰¤ 0,1āˆ’ (1āˆ’ x)Ī± if 0 ā‰¤ x ā‰¤ 1,1 if x ā‰„ 1,

where Ī± > 0, what is the limit distribution of n1/Ī±(Mn āˆ’ 1)?

8. (a) Suppose that X1, X2, ... are i.i.d in R2 with distribution giving probability Īø1 to (1, 0),probability Īø2 to (0, 1), Īø3 to (0, 0) and Īø4 to (āˆ’1,āˆ’1) where Īøj ā‰„ 0 for j = 1, 2, 3, 4and Īø1 + ...+ Īø4 = 1. Find the limiting distribution of

āˆšn(Xnāˆ’E[X1]) and describe

the resulting approximation to the distribution of Xn.

LARGE SAMPLE THEORY 78

(b) Suppose that X1, ..., Xn is a sample from the Poisson distribution with parameterĪ» > 0: P (X1 = k) = expāˆ’Ī»Ī»k/k!, k = 0, 1, ... Let Zn = [

āˆ‘ni=1 I(Xi = 1)]/n.

What is the joint asymptotic distribution ofāˆš

n((Xn, Zn)ā€²āˆ’ (Ī», Ī»eāˆ’Ī»))? Let p1(Ī») =P (X1 = 1). What is the asymptotic distribution of p1 = p1(Xn)? What is the jointasymptotic distribution of (Zn, p1) (after centering and rescaling)?

(c) If Xn possesses a t-distribution with n degrees of freedom, then Xn ā†’d N(0, 1) asn ā†’āˆž. Show this.

9. Suppose that Xn converges in distribution to X. Let Ļ†n(t) and Ļ†(t) be the characteristicfunctions of Xn and X respectively. We know that Ļ†n(t) ā†’ Ļ†(t) for each t. The followingprocedure shows that if supn E[|Xn|] < C0 for some constant C0, the convergence point-wise of the characteristic functions can be strengthened to the convergence uniformly inany bounded interval,

sup|t|<M

|Ļ†n(t)āˆ’ Ļ†(t)| ā†’ 0

for any constant M . Verify each of the following steps.

(a) Show that E[|Xn|] =āˆ«āˆž

0P (|Xn| ā‰„ t)dt and E[|X|] =

āˆ«āˆž0

P (|X| ā‰„ t)dt. Hint: writeP (|Xn| ā‰„ t) = E[I(|Xn| ā‰„ t)] then apply the Fubini-Tonelli theorem.

(b) Show that P (|Xn| ā‰„ t) ā†’ P (|X| ā‰„ t) almost everywhere (with respect to theLebsgue measure). Then apply the Fatouā€™s lemma to show that E[|X|] ā‰¤ C0.

(c) Show that both Ļ†n(t) and Ļ†(t) satisfy: for any t1, t2,

|Ļ†n(t1)āˆ’ Ļ†n(t2)| ā‰¤ C0|t1 āˆ’ t2|,|Ļ†(t1)āˆ’ Ļ†(t2)| ā‰¤ C0|t1 āˆ’ t2|.

That is, Ļ†n and Ļ† are uniformly continuous.

(d) Show that suptāˆˆ[āˆ’M,M ] |Ļ†n(t)āˆ’ Ļ†(t)| ā†’ 0. Hint: first partition [āˆ’M,M ] into equallyspaced āˆ’M = t0 < t1 < ... < tm = M ; then for t in one of these intervals, say[tk, tk+1], use the inequality

|Ļ†n(t)āˆ’ Ļ†(t)| ā‰¤ |Ļ†n(t)āˆ’ Ļ†n(tk)|+ |Ļ†n(tk)āˆ’ Ļ†(tk)|+ |Ļ†(tk)āˆ’ Ļ†(t)|.

10. Suppose that X1, ..., Xn are i.i.d from the uniform distribution in [0, 1]. Derive the asymp-

totic distribution of Giniā€™s mean difference, which is defined as(

n2

)āˆ’1 āˆ‘āˆ‘i<j |Xi āˆ’Xj|.

11. Suppose that (X1, Y1), ..., (Xn, Yn) are i.i.d from a bivariate distribution with bounded

fourth moments. Derive the limit distribution of U =(

n2

)āˆ’1 āˆ‘āˆ‘i<j(Yj āˆ’ Yi)(Xj āˆ’ Xi).

Write the expression in terms of the moments of (X1, Y1).

12. Let Y1, Y2, ... be independent random variables with mean 0 and variance Ļƒ2. Let Xn =(āˆ‘n

k=1 Yk)2 āˆ’ nĻƒ2 and show that Xn is a martingale.

13. Suppose that X1, ..., Xn are independent N(0, 1) random variables, and let Yi = X2i for

i = 1, ..., n. Thusāˆ‘n

i=1 Y 2i āˆ¼ Ļ‡2

n.

(a) Show thatāˆš

n(Yn āˆ’ 1) ā†’d N(0, Ļƒ2) and find Ļƒ2.

LARGE SAMPLE THEORY 79

(b) Show that for each r > 0,āˆš

n(Y rn āˆ’ 1) ā†’d N(0, V (r)2) and find V (r)2 as a function

of r.

(c) Show that āˆšnY 1/3

n āˆ’ (1āˆ’ 2/(9n))āˆš2/9

ā†’d N(0, 1).

Does this agree with your result in (b).

(d) Make normal probability plots to compare the approximations in (a) and (c) (thetransformation in (c) is called the ā€œWilson-Hilfertyā€ transformation of a Ļ‡2-randomvariable.

14. Suppose that X1, X2, ... are i.i.d positive random variables, and define Xn =āˆ‘n

i=1 Xi/n,

Hn = 1/ nāˆ’1āˆ‘n

i=1(1/Xi), and Gn = āˆni=1 Xi1/n

to be the arithmetic, harmonic andgeometric means respectively. We know that Xn ā†’a.s. E[X1] = Āµ if and only if E[|Xi|] isfinite.

(a) Use the strong law of large numbers together with appropriate additional hypothesesto show that Hn ā†’a.s. 1/ E[1/X1] ā‰” h and Gn ā†’a.s. expE[log X1] ā‰” g.

(b) Find the joint limiting distribution ofāˆš

n(Xn āˆ’ Āµ,Hn āˆ’ h,Gn āˆ’ g). You will needto impose or assume additional moment conditions to be able to prove this. Specifythese additional assumptions carefully.

(c) Suppose that Xi āˆ¼ Gamma(r, Ī») with r > 0. Find what values of r are the hypothe-ses you impose in (c) satisfied? Compute the covariance of the limiting distributionin (c) as explicitly as you can in this case.

(d) Show thatāˆš

n(Gn/Xn āˆ’ g/Āµ) ā†’d N(0, V 2). Compute V explicitly when Xi āˆ¼Gamma(r, Ī») with r satisfying the conditions you found in (d).

15. Suppose that (N11, N12, N21, N22) has multinomial distribution with (n, p) where p =(p11, p12, p21, p22) and

āˆ‘2i=1

āˆ‘2j=1 pij = 1. Thus, N ā€™s can be treated as counts in a 2Ɨ

table. The log-odds ratio is defined by

Ļˆ = logp12p21

p11p22

.

(a) Suggest an estimator of Ļˆ, say Ļˆn.

(b) Show that the estimator you proposed in (a) is asymptotically normal and computethe asymptotic variance of your estimator. Hint: The vectors of N ā€™s is the sum of nindependent Multinomial(1, p) random vectors Yi, i = 1, ..., n.

16. Suppose that Xi āˆ¼ Bernoulli(pi), i = 1, .., n are independent. Show that if

nāˆ‘i=1

pi(1āˆ’ pi) ā†’āˆž,

then āˆšn(Xn āˆ’ pn)āˆš

nāˆ’1āˆ‘n

i=1 pi(1āˆ’ pi)ā†’d N(0, 1).

LARGE SAMPLE THEORY 80

Give one example pi for which the above convergence in distribution holds and anotherexample for which it fails.

17. Suppose that X1, ..., Xn are independent with common mean Āµ but with variances Ļƒ21, ..., Ļƒ

2n

respectively.

(a) Show that Xn ā†’p Āµ ifāˆ‘n

i=1 Ļƒ2i = o(n2).

(b) Now suppose that Xi = Āµ + ĻƒiĪµi where Īµ1, ..., Īµn are i.i.d with distribution functionF with E[Īµ1] = 0 and var(Īµ1) = 1. Show that if

maxiā‰¤n

Ļƒ2i /

nāˆ‘i=1

Ļƒ2i ā†’ 0

then with Ļƒ2n = nāˆ’1

āˆ‘ni=1 Ļƒ2

i ,

āˆšn(Xn āˆ’ Āµ)

Ļƒn

ā†’d N(0, 1).

Hence show that if furthermore Ļƒ2 ā†’ Ļƒ20, then

āˆšn(Xn āˆ’ Āµ) ā†’d N(0, Ļƒ2

0).

(c) If Ļƒ2i = Air for some constant A, show that maxiā‰¤n Ļƒ2

i /āˆ‘n

i=1 Ļƒ2i ā†’ 0 but Ļƒ2

n has notlimit. In this case, n(1āˆ’r)/2(Xn āˆ’ Āµ) = Op(1).

18. Suppose that X1, ..., Xn are independent with common mean Āµ but with variances Ļƒ21, ..., Ļƒ

2n

respectively, the same as the previous question. Consider the estimator of Āµ: Tn =āˆ‘ni=1 Ļ‰niXi, where Ļ‰ = (Ļ‰n1, ..., Ļ‰nn)) is a vector of weights with

āˆ‘ni=1 Ļ‰ni = 1.

(a) Show that all the estimators Tn have the mean Āµ and the choice of weights minimizingvar(Tn) is

Ļ‰optni =

1/Ļƒ2iāˆ‘n

j=1(1/Ļƒ2j )

, i = 1, ..., n.

(b) Compute var(Tn) when Ļ‰ = Ļ‰opt and show Tn ā†’p Āµ ifāˆ‘n

i=1(1/Ļƒ2i ) ā†’āˆž.

(c) Suppose Xi = Āµ + ĻƒiĪµi where Īµ1, ..., Īµn are i.i.d with distribution function F withE[Īµ1] = 0 and var(Īµ1) = 1. Show that

āˆšāˆšāˆšāˆšnāˆ‘

i=1

(1/Ļƒ2i )(Tn āˆ’ Āµ) ā†’d N(0, 1)

if maxiā‰¤n(1/Ļƒ2i )/

āˆ‘nj=1(1/Ļƒ

2j ) ā†’ 0, where Ļ‰ chosen as Ļ‰opt.

(d) Compute var(Tn)/var(Xn) when Ļ‰ = Ļ‰opt in the case Ļƒ2i = Ari for r = 0.25, 0.5, 0.75

and n = 5, 10, 20, 50, 100,āˆž.

19. Ferguson, page 6 and page 7, problems 1-7

20. Ferguson, page 11 and page 12, problems 1-8

21. Ferguson, page 18, problems 1-5

LARGE SAMPLE THEORY 81

22. Ferguson, page 23, page 24 and page 25, problems 1-8

23. Ferguson, page 34 and page 35, problems 1-10

24. Ferguson, page 42 and page 43, problems 1-6

25. Ferguson, page 49 and page 50, problems 1-6

26. Ferguson, page 54 and page 55, problems 1-4

27. Ferguson, page 60, problems 1-4

28. Ferguson, page 65 and page 66, problems 1-3

29. Read Ferguson, pages 87-92 and do problems 3-6

30. Ferguson, page 100, problems 1-2

31. Lehmann and Casella, page 75, problems 8.2, 8.3

32. Lehmann and Casella, page 76, problems 8.8, 8.10, 8.11, 8.12, 8.14, 8.15, 8.16, 8.17 8.18

33. Lehmann and Casella, page 77, problems 8.19, 8.20, 8.21, 8.22, 8.23, 8.24, 8.25, 8.26

POINT ESTIMATION AND EFFICIENCY 82

CHAPTER 4 POINT ESTIMATION AND EFFICIENCY

The objective of science is to make general conclusions based on observed empirical data orphenomenon. The differences among different scientific areas are scientific tools implementedand scientific approaches to derive the decisions. However, they follow a similar procedure asfollows:(A) a class of mathematical models is proposed to model scientific phenomena or processes;(B) an estimated model is derived using the empirical data;(C) the obtained model is validated using more and new observations; if wrong, go back to (A).Usually, in (A), the class of mathematical models is proposed based on either past experienceor some physical laws. (B) is the step where all different scientific tools can play by using math-ematical methods to determine the model. (C) is the step of model validation. Undoubtedlyeac step is important.

In statistical science, (A) corresponds to proposing a class of distribution functions, denotedby P , to describe the probabilistic mechanisms of data generation. (B) consists of all kinds ofstatistical methods to decide which distribution in the class of (A) fits the data best. (C) ishow one can validate or test the goodness of the distribution obtained in (B). Our goal of thiscourse is mainly on (B), which is called statistical inference step.

One good estimation approach should be able to estimate model parameters with reasonableaccuracy. Such accuracy is characterized by either unbiasedness in finite sample performance orconsistency in large sample performance. Furthermore, by accounting for randomness in datageneration, we also want the estimation to be somewhat robust to intrinsic random mechanism.This robustness is characterized by the variance of the estimates. Thus, an ideally best estimatorshould have no bias and have the smallest variance in any finite sample. Unfortunately, althoughsuch estimators may exist for some models, most of models do not. One compromise is to seekan estimator which has no bias and has the smallest variance in large sample, i.e., an estimatewhich is asymptotically unbiased and efficient. Fortunately, such an estimator exists for mostof models.

In this chapter, we review some commonly-used estimation approaches, with particularattention to the estimation providing the unbiased and smallest variance estimators if they exist.The smallest variance for finite sample is characterized by the Cramer-Rao bound (efficiencybound in finite sample). Such a bound also turns out to be the efficiency bound in large sample,where we show that the asymptotic variance of any regular estimators in regular models cannot be smaller than this bound.

4.1 Introductory Examples

A model P is a collection of probability distributions for the data we observe. Parameters ofinterest are simply some functionals on P , denoted by Ī½(P ) for P āˆˆ P .

Example 4.1 Suppose X is a non-negative random variable.Case A. Suppose that X āˆ¼ Exponential(Īø), Īø > 0; thus pĪø(x) = Īøeāˆ’ĪøxI(x ā‰„ 0). P consists ofdistribution function which are indexed by a finite-dimensional parameter Īø. P is a parametricmodel. Ī½(pĪø) = Īø is parameter of interest.Case B. Suppose P consists of the distribution functions with density pĪ»,G =

āˆ«āˆž0

Ī» expāˆ’Ī»xdG(Ī»),where Ī» āˆˆ R and G is any distribution function. Then P consists of the distribution functions

POINT ESTIMATION AND EFFICIENCY 83

which are indexed by both real parameter Ī» and functional parameter G. P is a semiparametricmodel. Ī½(pĪ»,G) = Ī» or G or both can be parameters of interest.Case C. P consists of all distribution functions in [0,āˆž). P is a nonparametric model.Ī½(P ) =

āˆ«xdP (x), the mean of the distribution function, can be parameter of interest.

Example 4.2 Suppose that X = (Y, Z) is a random vector on R+ ƗRd.Case A. Suppose X āˆ¼ PĪø with Y |Z = z āˆ¼ exponential(Ī»eĪøā€²z) for y ā‰„ 0. This is a parametricmodel with parameter space Ī˜ = R+ ƗRd.Case B. Suppose X āˆ¼ PĪø,Ī» with Y |Z = z āˆ¼ Ī»(y)eĪøā€²z expāˆ’Ī›(y)eĪøā€²z where Ī›(y) =

āˆ« y

0Ī»(y)dy.

This is a semiparametric model, the Cox proportional hazards model for survival analysis, withparameter space (Īø, Ī») āˆˆ RƗ Ī»(y) : Ī»(y) ā‰„ 0,

āˆ«āˆž0

Ī»(y)dy = āˆž.Case C. Suppose X āˆ¼ P on R+ƗRd where P is completely arbitrary. This is a nonparametricmodel.

Example 4.3 Suppose X = (Y, Z) is a random vector in RƗRd.Case A. Suppose that X = (Y, Z) āˆ¼ PĪø with Y = Īøā€²Z + Īµ where Īø āˆˆ Rd and Īµ āˆ¼ N(0, Ļƒ2). Thisis a parametric model with parameter space (Īø, Ļƒ) āˆˆ Rd ƗR+.Case B. Suppose X = (Y, Z) āˆ¼ PĪø with Y = Īøā€²Z + Īµ where Īø āˆˆ Rd and Īµ āˆ¼ G with density g isindependent of Z. This is a semiparametric model with parameters (Īø, g).Case C. Suppose X = (Y, Z) āˆ¼ P where P is an arbitrary probability distribution on R Ɨ Rd.This is a nonparametric model.

For a given data, there are many reasonable models which can be used to describe data. Agood model is usually preferred if it is compatible with underlying mechanism of data genera-tion, has as few model assumption as possible, can be presented in simple ways, and inferenceis feasible. In other words, a good model should make sense, be flexible and parsimonious, andbe easy for inference.

4.2 Methods of Point Estimation: A Review

There have been a number of estimation methods proposed for many statistical models. How-ever, some methods may work well from some statistical models but may not work well forothers. In the following sections, we list a few of these methods, along with examples.

4.2.1 Least square estimation

The least square estimation is the most classical estimation method. This method estimatesthe parameters by minimizing the summed square distance between the observed quantitiesand the expected quantities.

Example 4.4 Suppose n i.i.d observations (Yi, Zi), i = 1, ..., n, are generated from the distri-bution in Example 4.3. To estimate Īø, one method is to minimize the least square function

nāˆ‘i=1

(Yi āˆ’ Īøā€²Zi)2.

POINT ESTIMATION AND EFFICIENCY 84

This gives the least square estimate for Īø as

Īø = (nāˆ‘

i=1

ZiZā€²i)āˆ’1(

nāˆ‘i=1

ZiYi).

It can show that E[Īø] = Īø. Note that this estimation does not use any distribution function inĪµ so applies to all three cases.

4.2.2 Uniformly minimal variance and unbiased estimation

Sometimes, one seeks an estimate which is unbiased for parameters of interest. Furthermore,one wants such an estimate to have the least variation. If such an estimator exists, we call itthe uniformly minimal variance and unbiased estimator (UMVUE) (an estimator T is unbiasedfor the parameter Īø if E[T ] = Īø). It should be noted that such an estimator may not exist.

The UMVUE often exists for distributions in the exponential family, whose probabilitydensity functions are of form

pĪø(x) = h(x)c(Īø) expĪ·1(Īø)T1(x) + ...Ī·s(Īø)Ts(x),where Īø āˆˆ Rd and T (x) = (T1(x), ..., Ts(x)) is the s-dimensional statistics. The following lemmadescribes how one can find a UMVUE for Īø from an unbiased estimator.

Definition 4.1 T (X) is called a sufficient statistic for X āˆ¼ pĪø with respect to Īø if the conditionaldistribution of X given T (X) is independent of Īø. T (X) is a complete statistic with respect toĪø if for any measurable function g, EĪø[g(T (X))] = 0 for any Īø implies g = 0, where EĪø denotesthe expectation under the density function with parameter Īø. ā€ 

It is easy to check that T (X) is sufficient if and only if pĪø(x) can be factorized intogĪø(T (x))h(x). Thus, in the exponential family, T (X) = (T1(X), ..., Ts(X)) is sufficient. Ad-ditionally, if the exponential family is of full-rank (i.e., (Ī·1(Īø), ..., Ī·s(Īø)) : Īø āˆˆ Ī˜ contains acube in s-dimensional space), T (X) is also a complete statistic. The proof can be referred toTheorem 6.22 in Lehmann and Casella (1998).

Proposition 4.1 Suppose Īø(X) is an unbiased estimator for Īø; i.e., E[Īø(X)] = Īø. If T (X) is asufficient statistics of X, then E[Īø(X)|T (X)] is unbiased and moreover,

V ar(E[Īø(X)|T (X)]) ā‰¤ V ar(Īø(X)),

with the equality if and only if with probability 1, Īø(X) = E[Īø(X)|T (X)]. ā€ 

Proof E[Īø(X)|T ] is clearly unbiased and moreover, by the Jensenā€™s inequality,

V ar(E[Īø(X)|T ]) = E[(E[Īø(X)|T ])2]āˆ’ E[Īø(X)]2 ā‰¤ E[Īø(X)2]āˆ’ Īø2 = V ar(Īø(X)).

The equality holds if and only if E[Īø(X)|T ] = Īø(X) with probability 1. ā€ 

Proposition 4.2 If T (X) is complete sufficient and Īø(X) is unbiased, then E[Īø(X)|T (X)] isthe unique UMVUE for Īø. ā€ 

POINT ESTIMATION AND EFFICIENCY 85

Proof For any unbiased estimator for Īø, denoted by T (X), we obtain from Proposition 4.1 thatE[T (X)|T (X)] is unbiased and

V ar(E[T (X)|T (X)]) ā‰¤ V ar(T (X)).

Since E[E[T (X)|T (X)]āˆ’ E[Īø(X)|T (X)]] = 0 and E[T (X)|T (X)] and E[Īø(X)|T (X)] are inde-pendent of Īø, the completeness of T (X) gives that

E[T (X)|T (X)] = E[Īø(X)|T (X)].

That is, V ar(E[Īø(X)|T (X)]) ā‰¤ V ar(T (X)). Thus, E[Īø(X)|T (X)] is the UMVUE. The abovearguments also show that such a UMVUE is unique. ā€ 

Proposition 4.2 suggests two ways to derive the UMVUE in the presence of a completesufficient statistic T (X): one way is to find an unbiased estimator of Īø then calculate theconditional expectation of this unbiased estimator given T (X); another way is to directly finda function g(T (X)) such that E[g(T (X))] = Īø. The following example describes these twomethods.

Example 4.5 Suppose X1, ..., Xn are i.i.d according to the uniform distribution U(0, Īø) and wewish to obtain a UMVUE of Īø/2. From the joint density of X1, ..., Xn given by

1

ĪønI(X(n) < Īø)I(X(1) > 0),

one can easily show X(n) is complete and sufficient for Īø. Note E[X1] = Īø/2. Thus, a UMVUEfor Īø/2 is given by

E[X1|X(n)] =n + 1

n

X(n)

2.

The other way is to directly find a function g(X(n)) = Īø/2 by noting

E[g(X(n))] =1

Īøn

āˆ« Īø

0

g(x)nxnāˆ’1dx = Īø/2.

Thus, we have āˆ« Īø

0

g(x)xnāˆ’1dx =Īøn+1

2n.

We differentiate both sides with respect to Īø and obtain g(x) = n+1n

x2. Hence, we again obtain

the UMVUE for Īø/2 is equal to (n + 1)X(n)/2n.Many more examples of the UMVUE can be found in Chapter 2 of Lehmann and Casella

(1998).

4.2.3 Robust estimation

In some regression problems, one may be concerned about outliers. For example, in a simplelinear regression, an extreme outlier may affect the fitted line greatly. One estimation approachcalled robust estimation approach is to propose an estimator which is little influenced by extreme

POINT ESTIMATION AND EFFICIENCY 86

observations. Often, for n i.i.d observations X1, ..., Xn, the robust estimation approach is tominimize an objective function with the form

āˆ‘ni=1 Ļ†(Xi; Īø).

Example 4.6 In linear regression, a model for (Y,X) is given by

Y = Īøā€²X + Īµ,

where Īµ has mean zero. One robust estimator is to minimize

nāˆ‘i=1

|Yi āˆ’ Īøā€²Xi|

and the obtained estimator is called the least absolute deviation estimator. A more generalobjective function is to minimize

nāˆ‘i=1

Ļ†(Yi āˆ’ Īøā€²Xi),

where Ļ†(x) = |x|k, |x| ā‰¤ C and Ļ†(x) = Ck when |x| > C.

4.2.4 Estimating functions

In recent statistical inference, more and more estimators are based on estimating functions. Theuse of estimating functions has been extensively seen in semiparametric model. An estimatingfunction for Īø is a measurable function f(X; Īø) with E[f(X; Īø)] = 0 or approximating zero.Then an estimator for Īø using n i.i.d observations can be constructed by solving the estimatingequation

nāˆ‘i=1

f(Xi; Īø) = 0.

The estimating function is useful, especially when there are other parameters in the model butonly Īø is parameters of interest.

Example 4.7 We still consider the linear regression example. We can see that for any functionW (X), E[XW (X)(Y āˆ’ Īøā€²X)] = 0. Thus an estimating equation for Īø can be constructed as

nāˆ‘i=1

XiW (Xi)(Yi āˆ’ Īøā€²Xi) = 0.

Example 4.8 Still in the regression example but we now assume the median of Īµ is zero. Itis easy to see that E[XW (X)sgn(Y āˆ’ Īøā€²X)] = 0. Then an estimating equation for Īø can beconstructed as

nāˆ‘i=1

XiW (Xi)sgn(Yi āˆ’ Īøā€²Xi) = 0.

POINT ESTIMATION AND EFFICIENCY 87

4.2.5 Maximum likelihood estimation

The most commonly used method, at least in parametric models, is the maximum likelihoodestimation method: If n i.i.d observations X1, ..., Xn are generated from a distribution functionwith densities pĪø(x), then it is reasonable to believe that the best value for Īø should be the onemaximizing the observed likelihood function, which is defined as

Ln(Īø) =nāˆ

i=1

pĪø(Xi).

The obtained estimator Īø is called the maximum likelihood estimator for Īø. Many nice propertiesare possessed by the maximum likelihood estimators and we will particularly investigate thisissue in next chapter. Recent development has also seen the implementation of the maximumlikelihood estimation in semiparametric models and nonparametric models.

Example 4.9 Suppose X1, ..., Xn are i.i.d. observations from exp(Īø). Then the likelihoodfunction for Īø is equal to

Ln(Īø) = Īøn expāˆ’Īø(X1 + ... + Xn).

The maximum likelihood estimator for Īø is given by Īø = Xn.

Example 4.10 The setting is Case B of Example 1.2. Suppose (Y1, Z1), ..., (Yn, Zn) are i.i.dwith the density function Ī»(y)eĪøā€²z expāˆ’Ī›(y)eĪøā€²zg(z), where g(z) is the known density functionof Z = z. Then the likelihood function for the parameters (Īø, Ī») is given by

Ln(Īø, Ī») =nāˆ

i=1

Ī»(Yi)e

Īøā€²Zi expāˆ’Ī›(Yi)eĪøā€²Zig(Zi)

.

It turns out that the maximum likelihood estimators for (Īø, Ī») do not exist. One way is to let Ī›be a step function with jumps at Y1, ..., Yn and let Ī»(Yi) be the jump size, denoted as pi. Thenthe likelihood function becomes

Ln(Īø, p1, ..., pn) =nāˆ

i=1

pie

Īøā€²Zi expāˆ’āˆ‘

Yjā‰¤Yi

pjeĪøā€²Zig(Zi)

.

The maximum likelihood estimators for (Īø, p1, ..., pn) are given as: Īø solves the equation

nāˆ‘i=1

Zi

[1āˆ’

āˆ‘Yjā‰„Yi

ZjeĪøā€²Zj

āˆ‘Yjā‰„Yi

eĪøā€²Zj

]= 0

and

pi =1āˆ‘

Yjā‰„YieĪøā€²Zj

.

POINT ESTIMATION AND EFFICIENCY 88

4.2.6 Bayesian estimation

In this estimation approach, the parameter Īø in the model distributions pĪø(x) is treated asa random variable with some prior distribution Ļ€(Īø). The estimator for Īø is defined as a valuedepending on the data and minimizing the expected loss function or the maximal loss function,where the loss function is denoted as l(Īø, Īø(X)). The usual loss function includes the quadraticloss (Īøāˆ’ Īø(X))2, the absolute loss |Īøāˆ’ Īø(X)| etc. It often turns out that Īø(X) can be determinedfrom the posterior distribution of P (Īø|X) = P (X|Īø)P (Īø)/P (X).

Example 4.11 Suppose X āˆ¼ N(Īø, 1), where Īø has an improper prior distribution of beinguniform in (āˆ’āˆž,āˆž). It is clear that the estimator Īø(X), minimizing the quadratic loss E[(Īøāˆ’Īø(X))2], is the posterior mean E[Īø|X] = X.

4.2.7 Concluding remarks

We have reviewed a few methods which are seen in many statistical problems. However we havenot exhausted all estimation approaches. Other estimation methods include the conditionallikelihood estimation, the profile likelihood estimation, the partial likelihood estimation, theempirical Bayesian estimation, the minimax estimation, the rank estimation, L-estimation andetc.

With a number of estimators, one natural question is to decide which estimator is the bestchoice. The first criteria is that the estimator must be unbiased or at least consistent with thetrue parameter. Such a property is called the first order efficiency. In order to make a preciseestimation, we may also want the estimator to have as small variance as possible. The issuethen becomes the second order efficiency, which we will discuss in the next section.

4.3 Cramer-Rao Bounds for Parametric Models

4.3.1 Information bound in one-dimensional model

First, we assume the model is one-dimensional parametric model P = PĪø : Īø āˆˆ Ī˜ with Ī˜ āŠ‚ R.We assume:A. X āˆ¼ PĪø on (Ī©,A) with Īø āˆˆ Ī˜.B. pĪø = dPĪø/dĀµ exists where Āµ is a Ļƒ-finite dominating measure.C. T (X) ā‰” T estimates q(Īø) has EĪø[|T (X)|] < āˆž; set b(Īø) = EĪø[T ]āˆ’ q(Īø).D. qā€²(Īø) ā‰” q(Īø) exists.

Theorem 4.1 (Information bound or Cramer-Rao Inequality) Suppose:(C1) Ī˜ is an open subset of the real line.(C2) There exists a set B with Āµ(B) = 0 such that for x āˆˆ Bc, āˆ‚pĪø(x)/āˆ‚Īø exists for all Īø.Moreover, A = x : pĪø(x) = 0 does not depend on Īø.(C3) I(Īø) = EĪø[lĪø(X)2] > 0 where lĪø(x) = āˆ‚ log pĪø(x)/āˆ‚Īø. Here, I(Īø) is the called the Fisherinformation for Īø and lĪø is called the score function for Īø.(C4)

āˆ«pĪø(x)dĀµ(x) and

āˆ«T (x)pĪø(x)dĀµ(x) can both be differentiated with respect to Īø under the

integral sign.

POINT ESTIMATION AND EFFICIENCY 89

(C5)āˆ«

pĪø(x)dĀµ(x) can be differentiated twice under the integral sign.If (C1)-(C4) hold, then

V arĪø(T (X)) ā‰„ q(Īø) + b(Īø)2

I(Īø),

and the lower bound is equal to q(Īø)2/I(Īø) if T is unbiased. Equality holds for all Īø if and onlyif for some function A(Īø), we have

lĪø(x) = A(Īø)T (x)āˆ’ EĪø[T (X)], a.e.Āµ.

If, in addition, (C5) holds, then

I(Īø) = āˆ’EĪø

āˆ‚2

āˆ‚Īø2log pĪø(X)

= āˆ’EĪø[lĪø(X)].

ā€ 

Proof Note

q(Īø) + b(Īø) =

āˆ«T (x)pĪø(x)dĀµ(x) =

āˆ«

Acāˆ©Bc

T (x)pĪø(x)dĀµ(x).

Thus from (C2) can (C4),

q(Īø) + b(Īø) =

āˆ«

Acāˆ©Bc

T (x)lĪø(x)pĪø(x)dĀµ(x) = EĪø[T (X)lĪø(X)].

On the other hand, sinceāˆ«

Acāˆ©Bc pĪø(x)dĀµ(x) = 1,

0 =

āˆ«

Acāˆ©Bc

lĪø(x)pĪø(x)dĀµ(x) = EĪø[lĪø(X)].

Thenq(Īø) + b(Īø) = Cov(T (X), lĪø(X)).

By the Cauchy-Schwartz inequality, we obtain

|q(Īø) + b(Īø)| ā‰¤ V ar(T (X))V ar(lĪø(X)).

The equality holds if and only if

lĪø(X) = A(Īø) T (X)āˆ’ EĪø[T (X)] , a.s.

Finally, if (C5) holds, we further differentiate

0 =

āˆ«lĪø(x)pĪø(x)dĀµ(x)

and obtain

0 =

āˆ«lĪø(x)pĪø(x)dĀµ(x) +

āˆ«lĪø(x)2pĪø(x)dĀµ(x).

Thus, we obtain the equality I(Īø) = āˆ’EĪø[lĪø(X)]. ā€ 

POINT ESTIMATION AND EFFICIENCY 90

Theorem 4.1 implies that the variance of any unbiased estimator has a lower bound q(Īø)2/I(Īø),which is intrinsic to the parametric model. Especially, if q(Īø) = Īø, then the lower bound for thevariance of unbiased estimator for Īø is the inverse of the information. The following examplescalculate this bound for some parametric models.

Example 4.12 Suppose X1, ..., Xn are i.i.d Poisson(Īø). The density function for (X1, ..., Xn)is given by

pĪø(X1, ..., Xn) = āˆ’nĪø + nXn log Īø āˆ’nāˆ‘

i=1

log(Xi!).

Thus,

lĪø(X1, ..., Xn) =n

Īø(Xn āˆ’ Īø).

It is direct to check all the regularity conditions of Theorem 3.1 are satisfied. Then In(Īø) =n2/Īø2V ar(Xn) = n/Īø. The Carmer-Rao bound for Īø is equal to Īø/n. On the other hand, wenote Xn is an unbiased estimator of Īø. Moreover, since Xn is the complete statistic for Īø. Xn

is indeed the UMVUE of Īø. Note V ar(Xn) = Īø/n. We conclude that Xn attains the lowerbound. However, although Tn = X2

n āˆ’ nāˆ’1Xn is unbiased for Īø2 and it is UMVUE of Īø2, wefind V ar(Tn) = 4Īø3/n + 2Īø2/n2 > the Cramer-Rao lower bound for Īø2. In other words, someUMVUE attains the lower bound but some do not.

Example 4.13 Suppose X1, ..., Xn are i.i.d with density pĪø(x) = g(x āˆ’ Īø) where g is knowndensity. This family is the one-dimensional location model. Assume gā€² exists and the regularityconditions in Theorem 4.1 are satisfied. Then

In(Īø) = nEĪø[gā€²(X āˆ’ Īø)

g(X āˆ’ Īø)

2

] = n

āˆ«gā€²(x)2

g(x)dx.

Note the information does not depend on Īø.

Example 4.14 Suppose X1, ..., Xn are i.i.d with density pĪø(x) = g(x/Īø)/Īø where g is a knowndensity function. This model is one-dimensional scale model with the common shape g. It isdirect to calculate

In(Īø) =n

Īø2

āˆ«(1 + y

gā€²(y)

g(y))2g(y)dy.

4.3.2 Information bound in multi-dimensional model

We can extend Theorem 4.1 to the case in which the model is k-dimensional parametric family:P = PĪø : Īø āˆˆ Ī˜ āŠ‚ Rk. Similar to Assumptions A-C, we assume PĪø has density functionpĪø with respect to some Ļƒ-finite dominating measure Āµ; T (X) is an estimator for q(Īø) withEĪø[|T (X)|] < āˆž and b(Īø) = EĪø[T (X)]āˆ’ q(Īø is the bias of T (X); q(Īø) = āˆ‡q(Īø) exists.

Theorem 4.2 (Information inequality) Suppose that(M1) Ī˜ an open subset in Rk.(M2) There exists a set B with Āµ(B) = 0 such that for x āˆˆ Bc, āˆ‚pĪø(x)/āˆ‚Īøi exists for all Īø and

POINT ESTIMATION AND EFFICIENCY 91

i = 1, ..., k. The set A = x : pĪø(x) = 0 does no depend on Īø.(M3) The k Ɨ k matrix I(Īø) = (Iij(Īø)) = EĪø[lĪø(X)lĪø(X)ā€²] > 0 is a positive definite where

lĪøi(x) =

āˆ‚

āˆ‚Īøi

log pĪø(x).

Here, I(Īø) is called the Fisher information matrix for Īø and lĪø is called the score for Īø.(M4)

āˆ«pĪø(x)dĀµ(x) and

āˆ«T (x)pĪø(x)dĀµ(x) can both be differentiated with respect to Īø under

the integral sign.(M5)

āˆ«pĪø(x)dĀµ(x) can be differentiated twice with respect to Īø under the integral sign.

If (M1)-(M4) holds, than

V arĪø(T (X)) ā‰„ (q(Īø) + b(Īø))ā€²Iāˆ’1(Īø)(q(Īø) + b(Īø))

and this lower bound is equal q(Īø)ā€²I(Īø)āˆ’1q(Īø) if T (X) is unbiased. If, in addition, (M5) holds,then

I(Īø) = āˆ’EĪø[lĪøĪø(X)] = āˆ’(EĪø

āˆ‚2

āˆ‚Īøiāˆ‚Īøj

log pĪø(X)

).

ā€ 

Proof Under (M1)-(M4), we have

q(Īø) + b(Īø) =

āˆ«T (x)lĪø(x)pĪø(x)dĀµ(x) = EĪø[T (x)lĪø(X)].

On the other hand, fromāˆ«

pĪø(x)dĀµ(x) = 1, 0 = EĪø[lĪø(X)]. Thus,

|

q(Īø) + b(Īø)ā€²

I(Īø)āˆ’1

q(Īø) + b(Īø)|

= |EĪø[T (X)(q(Īø) + b(Īø))ā€²I(Īø)āˆ’1lĪø(X)]|= |CovĪø(T (X), (q(Īø) + b(Īø))ā€²I(Īø)āˆ’1lĪø(X))|ā‰¤

āˆšV arĪø(T (X))(q(Īø) + b(Īø))ā€²I(Īø)āˆ’1(q(Īø) + b(Īø)).

We obtain the information inequality. In addition, if (M5) holds, we further differentiateāˆ«lĪø(x)pĪø(x)dĀµ(x) = 0 and obtain the then

I(Īø) = āˆ’EĪø[lĪøĪø(X)] = āˆ’(EĪø

āˆ‚2

āˆ‚Īøiāˆ‚Īøj

log pĪø(X)

).

ā€ 

Example 4.15 The Weibull family P is the parametric model with densities

pĪø(x) =Ī²

Ī±(x

Ī±)Ī²āˆ’1 exp

āˆ’(

x

Ī±)Ī²

I(x ā‰„ 0)

POINT ESTIMATION AND EFFICIENCY 92

with respect to the Lebesgue measure where Īø = (Ī±, Ī²) āˆˆ (0,āˆž) Ɨ (0,āˆž). We can easilycalculate that

lĪ±(x) =Ī²

Ī±

(x

Ī±)Ī² āˆ’ 1

,

lĪ²(x) =1

Ī²āˆ’ 1

Ī²log

(x

Ī±)Ī²

(x

Ī±)Ī² āˆ’ 1

.

Thus, the Fisher information matrix is

I(Īø) =

(Ī²2/Ī±2 āˆ’(1āˆ’ Ī³)/Ī±

āˆ’(1āˆ’ Ī³)/Ī± Ļ€2/6 + (1āˆ’ Ī³)2 /Ī²2

),

where Ī³ is the Eulerā€™s constant (Ī³ ā‰ˆ 0.5777...). The computation of I(Īø) is simplified by notingthat Y ā‰” (X/Ī±)Ī² āˆ¼ Exponential(x).

4.3.3 Efficient influence function and efficient score function

From the above proof, we also note that the lower bound is attained for an unbiased estimatorT (X) if and only if T (X) = q(Īø)ā€²Iāˆ’1(Īø)lĪø(X), the latter is called the efficient influence functionfor estimating q(Īø) and its variance, which is equal to q(Īø)ā€²I(Īø)āˆ’1q(Īø), is called the informationbound for q(Īø). If we regard q(Īø) as a function on all the distributions of P and denote Ī½(PĪø) =q(Īø), then in some literature, the efficient influence function and the information bound for q(Īø)can be represented as l(X,PĪø|Ī½,P) and Iāˆ’1(PĪø|Ī½,P), both implying that the efficient influencefunction and the information matrix are meant for a fixed model P , for a parameter of interestĪ½(PĪø) = q(Īø), and at a fixed distribution PĪø.

Proposition 4.3 The information bound Iāˆ’1(P |Ī½,P) and the efficient influence functionl(Ā·, P |Ī½,P) are invariant under smooth changes of parameterization. ā€ 

Proof Suppose Ī³ 7ā†’ Īø(Ī³) is a one-to-one continuously differentiable mapping of an open subsetĪ“ of Rk onto Ī˜ with nonsingular differential Īø. The model of distribution can be representedas PĪø(Ī³) : Ī³ āˆˆ Ī“. Thus, the score for Ī³ is Īø(Ī³)lĪø(X) so the information matrix for Ī³ is equalto

Īø(Ī³)ā€²I(Īø(Ī³))Īø(Ī³),

which is the same as the information matrix for Īø = Īø(Ī³). The efficient influence function forĪ³ is equal to

(Īø(Ī³)q(Īø(Ī³)))ā€²I(Ī³)āˆ’1lĪ³ = q(Īø(Ī³))ā€²I(Īø(Ī³))āˆ’1lĪø

and it is the same as the efficient influence function for Īø. ā€ 

The proposition implies that the information bound and the efficient influence function forsome Ī½ in a family of distribution are independent of the parameterization used in the model.However, with some natural and simple parameterization, the calculation of the informationbound and the efficient influence function can be directly done along the definition. Especially,we look into a specific parameterization where Īøā€² = (Ī½ ā€², Ī·ā€²) and Ī½ āˆˆ N āŠ‚ Rm, Ī· āˆˆ H āŠ‚ Rkāˆ’m.Ī½ can be regarded as a map mapping PĪø to one of component of Īø, Ī½, and it is the parameterof interest while Ī· is a nuisance parameter. We want to assess the cost of not knowing Ī· by

POINT ESTIMATION AND EFFICIENCY 93

comparing the information bounds and the efficient influence functions for Ī½ in the model P (Ī·is unknown parameter) and PĪ· (Ī· is known and fixed).

In the model P , we can decompose

lĪø =

(l1l2

), lĪø =

(l1l2

),

where l1 is the score for Ī½ and l2 is the score for Ī·, l1 is the efficient influence function for Ī½ andl2 is the efficient influence function for Ī·. Correspondingly, we can decompose the informationmatrix I(Īø) into

I(Īø) =

(I11 I12

I21 I22

),

where I11 = EĪø[l1lā€²1], I12 = EĪø[l1l

ā€²2], I21 = EĪø[l2l

ā€²1], and I22 = EĪø[l2l

ā€²2]. Thus,

Iāˆ’1(Īø) =

(Iāˆ’111Ā·2 āˆ’Iāˆ’1

11Ā·2I12Iāˆ’122

āˆ’Iāˆ’122Ā·1I21I

āˆ’111 Iāˆ’1

22Ā·1

)ā‰”

(I11 I12

I21 I22

),

whereI11Ā·2 = I11 āˆ’ I12I

āˆ’122 I21, I22Ā·1 = I22 āˆ’ I21I

āˆ’111 I12.

Since the information bound for estimating Ī½ is equal to

Iāˆ’1(PĪø|Ī½,P) = q(Īø)ā€²Iāˆ’1(Īø)q(Īø),

where q(Īø) = Ī½, andq(Īø) =

(ImƗm 0mƗ(kāˆ’m)

),

we obtain the information bound for Ī½ is given by

Iāˆ’1(PĪø|Ī½,P) = Iāˆ’111Ā·2 = (I11 āˆ’ I12I

āˆ’122 I21)

āˆ’1.

The efficient influence function for Ī½ is given by

l1 = q(Īø)ā€²Iāˆ’1(Īø)lĪø = Iāˆ’111Ā·2l

āˆ—1,

where lāˆ—1 = l1 āˆ’ I12Iāˆ’122 l2. It is easy to check

I11Ā·2 = E[lāˆ—1(lāˆ—1)ā€²].

Thus, lāˆ—1 is called the efficient score function for Ī½ in P .Now we consider the model PĪ· with Ī· known and fixed. It is clear the information bound

for Ī½ is just Iāˆ’111 and the efficient influence function for Ī½ is equal to Iāˆ’1

11 l1.Since I11 > I11Ā·2 = I11 āˆ’ I12I

āˆ’122 I21, we conclude that knowing Ī· increases the Fisher infor-

mation for Ī½ and decreases the information bound for Ī½. Moreover, knowledge of Ī· does notincrease information about Ī½ if and only if I12 = 0. In this case, l1 = Iāˆ’1

11 l1 and lāˆ—1 = l1.

Example 4.16 Suppose

P = PĪø : pĪø = Ļ†((xāˆ’ Ī½)/Ī·)/Ī·, Ī½ āˆˆ R, Ī· > 0 .

POINT ESTIMATION AND EFFICIENCY 94

Note that

lĪ½(x) =xāˆ’ Ī½

Ī·2, lĪ·(x) =

1

Ī·

(xāˆ’ Ī½)2

Ī·2āˆ’ 1

.

Then the information matrix I(Īø) is given by by

I(Īø) =

(Ī·āˆ’2 00 2Ī·āˆ’2

).

Then we can estimate the Ī½ equally well whether we know the variance or not.

Example 4.17 If we reparameterize the above model as

PĪø = N(Ī½, Ī·2 āˆ’ Ī½2), Ī·2 > Ī½2.

The easy calculation shows that I12(Īø) = Ī½Ī·/(Ī·2 āˆ’ Ī½2)2. Thus lack of knowledge of Ī· in thisparameterization does change the information bound for estimation of Ī½.

We provide a nice geometric way of calculating the efficient score function and the efficientinfluence function for Ī½. For any Īø, the linear space L2(PĪø) = g(X) : EĪø[g(X)2] < āˆž is aHilbert space with the inner product defined as

< g1, g2 >= E[g1(X)g2(X)].

On this Hilbert space, we can define the concept of the projection. For any closed linear spaceS āŠ‚ L2(PĪø) and any g āˆˆ L2(PĪø), the projection of g on S is g āˆˆ S such that gāˆ’ g is orthogonalto any gāˆ— in S in the sense that

E[(g(X)āˆ’ g(X))gāˆ—(X)] = 0, āˆ€gāˆ— āˆˆ S.

The orthocomplement of S is a linear space with all the g āˆˆ L2(P ) such that g is orthogonalto any gāˆ— āˆˆ S. The above concepts agree with the usual definition in the Euclidean space.The following theorem describes the calculation of the efficient score function and the efficientinfluence function.

Theorem 4.3 A. The efficient score function lāˆ—1(Ā·, PĪø|Ī½,P) is the projection of the score functionl1 on the orthocomplement of [l2] in L2(PĪø), where [l2] is the linear span of the components ofl2.B. The efficient influence function l(Ā·, PĪø|Ī½,PĪ·) is the projection of the efficient influence functionl1 on [l1] in L2(PĪø). ā€ 

Proof A. Suppose the projection of l1 on [l2] is equal to Ī£l2 for some matrix Ī£. Since E[(l1 āˆ’Ī£l2)l

ā€²2] = 0, we obtain Ī£ = I12I

āˆ’122 then the projection on the orthocomplement of [l2] is equal

to l1 āˆ’ I12Iāˆ’122 l2, which is the same as lāˆ—1.

B. After the algebra, we note

l1 = Iāˆ’111Ā·2(l1 āˆ’ I12I

āˆ’122 l2) = (Iāˆ’1

11 + Iāˆ’111 I12I

āˆ’122Ā·1I21I

āˆ’111 )(l1 āˆ’ I12I

āˆ’122 l2) = Iāˆ’1

11 l1 āˆ’ Iāˆ’111 I12l2.

Since from A, l2 is orthogonal to l1, the projection of l1 on [l1] is equal Iāˆ’111 l1, which is the

efficient influence function l(Ā·, PĪø|Ī½,PĪ·). ā€ 

POINT ESTIMATION AND EFFICIENCY 95

The following table describes the relationship among all these terminologies.Term Notation P (Ī· unknown) PĪ· (Ī· known)

efficient score lāˆ—1(, P |Ī½, Ā·) lāˆ—1 = l1 āˆ’ I12Iāˆ’122 l2 l1

information I(P |Ī½, Ā·) E[lāˆ—1(lāˆ—1)ā€²] = I11 āˆ’ I12I

āˆ’122 I22 I11

efficient l1(Ā·, P |Ī½, Ā·) l1 = I11l1 + I12l2 = Iāˆ’111Ā·2l

āˆ—1 Iāˆ’1

11 l1influence information = Iāˆ’1

11 l1 āˆ’ Iāˆ’111 I12l2

information bound Iāˆ’1(P |Ī½, Ā·) I11 = Iāˆ’111Ā·2 = Iāˆ’1

11 + Iāˆ’111 I12I

āˆ’122Ā·1I21I

āˆ’111 Iāˆ’1

11

4.4 Asymptotic Efficiency Bound

4.4.1 Regularity conditions and asymptotic efficiency theorems

The Cramer-Rao bound can be considered as the lower bound for any unbiased estimator infinite sample. One may ask whether such a bound still holds in large sample. To be specific,we suppose X1, ..., Xn are i.i.d PĪø (Īø āˆˆ R) and an estimator Tn for Īø satisfies that

āˆšn(Tn āˆ’ Īø) ā†’d N(0, V (Īø)2).

Then the question is whether V (Īø)2 ā‰„ 1/I(Īø). Unfortunately, this may not be true as thefollowing example due to Hodges gives one counterexample.

Example 4.18 Let X1, ..., Xn be i.i.d N(Īø, 1) so that I(Īø) = 1. Let |a| < 1 and define

Tn =

Xn if|Xn| > nāˆ’1/4

aXn if|Xn| ā‰¤ nāˆ’1/4.

Then

āˆšn(Tn āˆ’ Īø) =

āˆšn(Xn āˆ’ Īø)I(|Xn| > nāˆ’1/4) +

āˆšn(aXn āˆ’ Īø)I(|Xn| ā‰¤ nāˆ’1/4)

=d ZI(|Z +āˆš

nĪø| > n1/4) +aZ +

āˆšn(aāˆ’ 1)Īø

I(|Z +

āˆšnĪø| ā‰¤ n1/4)

ā†’a.s. ZI(Īø 6= 0) + aZI(Īø = 0).

Thus, the asymptotic variance ofāˆš

nTn is equal 1 for Īø 6= 0 and a2 for Īø = 0. The latter issmaller than the Cramer-Rao bound. In other words, Tn is a superefficient estimator.

To avoid the Hodgeā€™s superefficient estimator, we need impose some conditions to Tn inaddition to the weak convergence of

āˆšn(Tn āˆ’ Īø). One such condition is called locally regular

condition in the following sense.

Definition 4.2 Tn is a locally regular estimator of Īø at Īø = Īø0 if, for every sequence Īøn āŠ‚ Ī˜with

āˆšn(Īøn āˆ’ Īø) ā†’ t āˆˆ Rk, under PĪøn ,

(local regularity)āˆš

n(Tn āˆ’ Īøn) ā†’d Z, as n ā†’āˆž

where the distribution of Z depend on Īø0 but not on t. Thus the limit distribution ofāˆš

n(Tnāˆ’Īøn)does not depend on the direction of approach t of Īøn to Īø0. Tn is a locally Gaussian regularif Z has normal distribution. ā€ 

POINT ESTIMATION AND EFFICIENCY 96

In the above definition,āˆš

n(Tn āˆ’ Īøn) ā†’d Z under PĪøn is equivalent to saying that for anybounded and continuous function g, EĪøn [g(

āˆšn(Tnāˆ’Īøn))] ā†’ E[g(Z)]. One can consider a locally

regular estimator as the one whose limit distribution is locally stable: if data are generated undera model not far from a given model, the limit distribution of centralized estimator remains thesame.

Furthermore, the locally regular condition, combining with the following two additionalconditions, gives the results that the Cramer-Rao bound is also the asymptotic lower bound:

(C1) (Hellinger differentiability) A model P = PĪø : Īø āˆˆ Rk is a parametric model dominatedby a Ļƒ-finite measure Āµ. It is called a Hellinger-differentiable parametric model if

ā€–āˆšpĪø+h āˆ’āˆšpĪø āˆ’ 1

2hā€²lĪø

āˆšpĪøā€–L2(Āµ) = o(|h|),

where pĪø = dPĪø/dĀµ.

(C2) (Local Asymptotic Normality (LAN)) In a model P = PĪø : Īø āˆˆ Rk dominated by aĻƒ-finite measure Āµ, suppose pĪø = dPĪø/dĀµ. Let l(x; Īø) = log p(x, Īø) and let

ln(Īø) =nāˆ‘

i=1

l(Xi; Īø)

be the log-likelihood function of X1, ..., Xn. The local asymptotic normality condition at Īø0 is

ln(Īø0 + nāˆ’1/2t)āˆ’ ln(Īø0) ā†’d N(āˆ’1

2tā€²I(Īø0)t, t

ā€²I(Īø0)t)

under PĪø0 .Both conditions (C1) and (C2) are the smooth conditions imposed on the parametric models.

In other words, we do not allow a model whose parameterization is irregular. An irregular modelis seldom encountered in practical use.

The following theorem gives the main results.

Theorem 4.4 (Hajekā€™s convolution theorem) Under conditions (C1)-(C2) with I(Īø0) non-singular. For any locally regular estimator of Īø, Tn, the limit distribution of

āˆšn(Tn āˆ’ Īø0)

under PĪø0 satisfiesZ =d Z0 + āˆ†0,

where Z0 āˆ¼ N(0, Iāˆ’1(Īø0)) is independent of āˆ†0. ā€ 

As a corollary, if V (Īø0)2 is the asymptotic variance of

āˆšn(Tn āˆ’ Īø0), then V (Īø0)

2 ā‰„ Iāˆ’1(Īø0).Thus, the Cramer-Rao bound is a lower bound for the asymptotic variances of any locallyregular estimators. Furthermore, we obtain the following corollary from Theorem 4.4.

Corollary 4.1 Suppose that Tn is a locally regular estimator of Īø at Īø0 and that U : Rk ā†’ R+

is bowl-shaped loss function; i.e., U(x) = U(āˆ’x) and x : U(x) ā‰¤ c is convex for any c ā‰„ 0.Then

lim infn

EĪø0 [U(āˆš

n(Tn āˆ’ Īø0))] ā‰„ E[U(Z0)],

where Z0 āˆ¼ N(0, I(Īø0)āˆ’1). ā€ 

POINT ESTIMATION AND EFFICIENCY 97

Corollary 4.2 (Hajek-Le Cam asymptotic minmax theorem) Suppose that (C2) holds,that Tn is any estimator of Īø, and U is bowl-shaped. Than

limĪ“ā†’0

lim infn

supĪø:āˆš

n|Īøāˆ’Īø0|ā‰¤Ī“

EĪø[U(āˆš

n(Tn āˆ’ Īø))] ā‰„ E[U(Z0)],

where Z0 āˆ¼ N(0, I(Īø0)āˆ’1). ā€ 

In summary, the two corollaries conclude that the asymptotic loss of any regular estimatorsis at least the loss given by the distribution Z0. Thus, from this point of view, Z0 is also thedistribution of most efficiency. The proofs of the two corollaries are beyond this book so areskipped.

4.4.2 Le Camā€™s lemmas

Before proving Theorem 4.4, we introduce the contiguity definition and the Le Camā€™s lemmas.Consider a sequence of measure spaces (Ī©n,An, Āµn) and on each measure space, we have twoprobability measure Pn and Qn with Pn ā‰ŗā‰ŗ Āµn and Qn ā‰ŗā‰ŗ Āµn. Let pn = dPn/dĀµn andqn = dQn/dĀµn be the corresponding densities of Pn and Qn. We define the likelihood ratios

Ln =

qn/pn if pn > 01 if qn = pn = 0n if qn > 0 = pn.

Definition 4.3 (Contiguity) The sequence Qn is contiguous to Pn if for every sequenceBn āˆˆ An for which Pn(Bn) ā†’ 0 it follows that Qn(Bn) ā†’ 0. ā€ 

Thus contiguity of Qn to Pn means that Qn is ā€œasymptotically absolutely continuousā€with respect to Pn. We denote Qn / Pn. Two sequences are contiguous to each other ifQn / Pn and Pn / Qn and we write Pn / .Qn.

Definition 4.4 (Asymptotic orthogonality) The sequence Qn is asymptotically orthogonal toPn if there exists a sequence Bn āˆˆ An such that Qn(Bn) ā†’ 1 and Pn(Bn) ā†’ 0. ā€ 

Proposition 4.4 (Le Camā€™s first lemma) Suppose under Pn, Ln ā†’d L with E[L] = 1. ThenQn / Pn. On the contrary, if Qn / Pn and under Pn, Ln ā†’d L, then E[L] = 1. ā€ 

Proof We fist prove the first half of the lemma. Let Bn āˆˆ An with Pn(Bn) ā†’ 0. ThenIĪ©nāˆ’Bn converges to 1 in probability under Pn. Since Ln is asymptotically tight, (Ln, IĪ©nāˆ’Bn) isasymptotically tight under Pn. Thus, by the Hellyā€™s lemma, for every subsequence of n, thereexists a further subsequence such that (Ln, IĪ©nāˆ’Bn) ā†’d (L, 1). By the Protmanteau Lemma,since (v, t) 7ā†’ vt is continuous and nonnegative,

lim infn

Qn(Ī©n āˆ’Bn) ā‰„ lim infn

āˆ«IĪ©nāˆ’Bn

dQn

dPn

dPn ā‰„ E[L] = 1.

We obtain Qn(Bn) ā†’ 0. Thus Qn / Pn.

POINT ESTIMATION AND EFFICIENCY 98

We then prove the second half of the lemma. The probability measure Rn = (Pn + Qn)/2dominate both Pn and Qn. Note that dPn/dQn, Ln and Wn = dPn/dRn are tight withrespect to Qn, Pn and Rn. By the Prohovā€™s theorem, for any subsequence, there existsa further subsequence such that

dPn

dQn

ā†’d U, under Qn,

Ln =dQn

dPn

ā†’d L, under Pn,

Wn =dPn

dRn

ā†’d W, under Rn

for certain random variables U , V , and W . Since ERn [Wn] = 1 and 0 ā‰¤ Wn ā‰¤ 2, we obtainE[W ] = 1. For a given bounded, continuous function f , define g(Ļ‰) = f(Ļ‰/(2āˆ’ Ļ‰))(2āˆ’ Ļ‰) for0 ā‰¤ Ļ‰ < 2 and g(2) = 0. Then g is continuous. Thus,

EQn [f(dPn

dQn

)] = ERn [f(dPn

dQn

)dQn

dRn

] = ERn [g(Wn)] ā†’ E[f(W

2āˆ’W)(2āˆ’W )].

Since EQn [f(dPn/dQn)] ā†’ E[f(U)], we have

E[f(U)] = E[f(W

2āˆ’W)(2āˆ’W )].

Choose fm in the above expression such that fm ā‰¤ 1 and fm decreases to I0. From thedominated convergence theorem, we have

P (U = 0) = E[I0(W

2āˆ’W)(2āˆ’W )] = 2P (W = 0).

However, since

Pn( dPn

dQn

ā‰¤ Īµn āˆ© qn > 0) ā‰¤āˆ«

dPn/dQnā‰¤Īµn

dPn

dQn

dQn ā‰¤ Īµn ā†’ 0

and Qn / Pn,

P (U = 0) = limn

P (U ā‰¤ Īµn) ā‰¤ lim infn

Qn(dPn

dQn

ā‰¤ Īµn) = lim infn

Qn( dPn

dQn

ā‰¤ Īµn āˆ© qn > 0) = 0.

That is, P (W = 0) = 0. Similar to the above deduction, we obtain that

E[f(L)] = E[f(2āˆ’W

W)W ].

Choose fm in the expression such that fm(x) increase to x. By the monotone convergencetheorem, we have

E[L] = E[(2āˆ’W )I(W > 0)] = 2P (W > 0)āˆ’ 1 = 1.

ā€ 

POINT ESTIMATION AND EFFICIENCY 99

As a corollary, we have

Corollary 4.3 If log Ln ā†’d N(āˆ’Ļƒ2/2, Ļƒ2) under Pn, then Qn / Pn. ā€ 

Proof Under Pn, Ln ā†’d expāˆ’Ļƒ2/2+ĻƒZ where the limit has mean 1. The result thus followsfrom Proposition 4.4. ā€ 

Proposition 4.5 (Le Camā€™s third lemma) Let Pn and Qn be sequence of probability mea-sures on measurable spaces (Ī©n,An), and let Xn : Ī©n ā†’ Rk be a sequence of random vectors.Suppose that Qn / Pn and under Pn,

(Xn, Ln) ā†’d (X, L).

Then G(B) = E[IB(X)L] defines a probability measure, and under Qn, Xn ā†’d G. ā€ 

Proof Because V ā‰„ 0, for countable disjoint sets B1, B2, ..., by the monotone convergencetheorem,

G(āˆŖBi) = E[limn

(IB1 + ... + IBn)L] = limn

nāˆ‘i=1

E[IBiL] =

āˆžāˆ‘i=1

G(Bi).

From Proposition 4.4, E[L] = 1. Then G(Ī©) = 1. G is a probability measure. Moreover, forany measurable simple function f , it is easy to see

āˆ«fdG = E[f(X)L].

Thus, this equality holds for any measurable function f . In particular, for continuous andnonnegative function f , (x, v) 7ā†’ f(x)v is continuous and nonnegative. Thus,

lim inf EQn [f(Xn)] ā‰„ lim inf

āˆ«f(Xn)

dQn

dPn

dPn ā‰„ E[f(X)L].

Thus, under Qn, Xn ā†’d G. ā€ 

Remark 4.1 In fact, the name Le Camā€™s third lemma is often reserved for the following result.If under Pn,

(Xn, log Ln) ā†’d Nk+1

( (Āµ

āˆ’Ļƒ2/2

),

(Ī£ Ļ„Ļ„ Ļƒ2

) ),

then under Qn, Xn ā†’d Nk(Āµ + Ļ„, Ī£). This result follows from Proposition 4.5 by noticing thatthe characteristic function of the limit distribution G is equal to E[eitXeY ], where (X, Y ) hasthe joint distribution

Nk+1

( (Āµ

āˆ’Ļƒ2/2

),

(Ī£ Ļ„Ļ„ Ļƒ2

) ).

Such a characteristic function is equal expitā€²(Āµ + Ļ„) āˆ’ tā€²Ī£t/2, which is the characteristicfunction for Nk(Āµ + Ļ„, Ī£).

POINT ESTIMATION AND EFFICIENCY 100

4.4.3 Proof of the convolution theorem

Equipped with the Le Camā€™s two lemmas, we start to prove the convolution result in Theorem4.4.

Proof of Theorem 4.4 We divide the proof into the following steps.Step I. We first prove that the Hellinger differentiability condition (C1) implies that PĪø0 [lĪø0 ] = 0,the Fisher information I(Īø0) = EĪø0 [lĪø0l

ā€²Īø0

] exists, and moreover, for every convergent sequencehn ā†’ h, as n ā†’āˆž,

lognāˆ

i=1

pĪø0+hn/āˆš

n

pĪø0

(Xi) =1āˆšn

nāˆ‘i=1

hā€²lĪø0(Xi)āˆ’ 1

2hā€²IĪø0h + rn,

where rn ā†’p 0. To see that , we abbreviate pn, p, g as pĪø0+h/āˆš

n, pĪø0 , hā€²lĪø0 . Sinceāˆš

n(āˆš

pnāˆ’āˆšp)converges in L2(Āµ) to g

āˆšp/2,

āˆšpn converges to

āˆšp in L2(Āµ). Then

E[g] =

āˆ«1

2gāˆš

p2āˆš

pdĀµ = limnā†’āˆž

āˆ« āˆšn(āˆš

pn āˆ’āˆšp)(āˆš

pn +āˆš

p)dĀµ = 0.

Thus, EĪø0 [lĪø0 ] = 0. Let Wni = 2(āˆš

pn(Xi)/p(Xi)āˆ’ 1). We have

V ar(nāˆ‘

i=1

Wni āˆ’ 1āˆšn

nāˆ‘i=1

g(Xi)) ā‰¤ E[(āˆš

nWni āˆ’ g(Xi))2] ā†’ 0,

E[nāˆ‘

i=1

Wni] = 2n(

āˆ« āˆšpnāˆš

pdĀµāˆ’ 1) = āˆ’n

āˆ«[āˆš

pn āˆ’āˆšp]2dĀµ ā†’ āˆ’1

4E[g2].

Here, E[g2] = hā€²I(Īø0)h. By the Chebyshevā€™s inequality, we obtain

nāˆ‘i=1

Wni =1āˆšn

nāˆ‘i=1

g(Xi)āˆ’ 1

4E[g2] + an,

where an ā†’p 0.Next, by the Taylor expansion,

lognāˆ

i=1

pn

p(Xi) = 2

nāˆ‘i=1

log(1 +1

2Wni) =

nāˆ‘i=1

Wni āˆ’ 1

4

nāˆ‘i=1

W 2ni +

1

2

nāˆ‘i=1

W 2niR(Wni),

where R(x) ā†’ 0 as x ā†’ 0. Since E[(āˆš

nWni āˆ’ g(Xi))2] ā†’ 0, nW 2

ni = g(Xi)2 + Ani where

E[|Ani|] ā†’ 0. Thenāˆ‘n

i=1 W 2ni ā†’p E[g2]. Moreover,

nP (|Wni| > Īµāˆš

2) ā‰¤ nP (g(Xi)2 > nĪµ2)+nP (|Ani| > nĪµ2) ā‰¤ Īµāˆ’2E[g2I(g2 > nĪµ2)]+Īµāˆ’2E[|Ani|] ā†’ 0.

The left-hand side is the upper bound for P (max1ā‰¤iā‰¤n |Wni| > Īµ). Thus, max1ā‰¤iā‰¤n |Wni| con-verges to zero in probability; so is max1ā‰¤iā‰¤n |R(Wni)|. Therefore,

lognāˆ

i=1

pn

p(Xi) =

nāˆ‘i=1

Wni āˆ’ 1

4E[g2] + bn,

POINT ESTIMATION AND EFFICIENCY 101

where bn ā†’p 0. Combining all the results, we obtain

lognāˆ

i=1

pĪø0+hn/āˆš

n

pĪø0

(Xi) =1āˆšn

nāˆ‘i=1

hā€²lĪø0(Xi)āˆ’ 1

2hā€²IĪø0h + rn,

where rn ā†’pn 0.

Step II. Let Qn be the probability measure with densityāˆn

i=1 pĪø0+h/āˆš

n(xi) and Pn be theprobability measure with

āˆni=1 pĪø0(xi). Define

Sn =āˆš

n(Tn āˆ’ Īø0), āˆ†n =1āˆšn

nāˆ‘i=1

lĪø0(Xi).

By the assumptions, Sn weakly converges to some distribution and so is āˆ†n under Pn; thus,(Sn, āˆ†n) is tight under Pn. By the Prohorovā€™s theorem, for any subsequence, there exists afurther subsequence such that (Sn, āˆ†n) ā†’d (S, āˆ†) under Pn. From Step I, we immediatelyobtain that under Pn,

(Sn, logdQn

dPn

) ā†’d (S, hā€²āˆ†āˆ’ 1

2hā€²I(Īø0)h).

Since under Pn, dQn/dPn weakly converges to N(āˆ’hā€²I(Īø0)h/2, hā€²I(Īø0)h), Corollary 4.3 givesthat Qn/Pn. Then from the Le Camā€™s third lemma, under Qn, Sn =

āˆšn(Tnāˆ’Īø0) converges

in distribution to a distribution Gh. Clearly, Gh is the same as distribution with Z + h.

Step III. We show Z = Z0 + āˆ†0 where Z0 āˆ¼ N(0, I(Īø0)āˆ’1) is independent of āˆ†0. From Step II,

we haveEĪø0+h/

āˆšn[expitā€²Sn] ā†’ expitā€²hE[expitā€²Z].

On the other hand,

EĪø0+h/āˆš

n[expitā€²Sn] = EĪø0 [expitā€²Sn + logdQn

dPn

] + o(1) ā†’ EĪø0 [expitā€²Z + hā€²āˆ†āˆ’ 1

2hā€²I(Īø0)h].

We have

EĪø0 [expitā€²Z + hā€²āˆ†āˆ’ 1

2hā€²I(Īø0)h] = expitā€²hEĪø0 [expitā€²Z]

and it should hold for any complex number t and h. We let h = āˆ’i(tā€² āˆ’ sā€²)I(Īø0)āˆ’1 and obtain

EĪø0 [expitā€²(Z āˆ’ I(Īø0)āˆ’1āˆ†) + isā€²I(Īø0)

āˆ’1āˆ†] = EĪø0 [expitā€²Z +1

2tā€²I(Īø0)

āˆ’1t] expāˆ’1

2sā€²I(Īø0)

āˆ’1s.This implies that āˆ†0 = (Z āˆ’ I(Īø0)

āˆ’1āˆ†) is independent of Z0 = I(Īø0)āˆ’1āˆ† and Z0 has the

characteristics function expāˆ’sā€²I(Īø0)āˆ’1s/2, meaning Z0 āˆ¼ N(0, I(Īø0)

āˆ’1). Then Z = Z0 + āˆ†0.ā€ 

The convolution theorem indicates that if Tn is locally regular and the model P is theHellinger differentiable and LAN, then the Cramer-Rao bound is also the asymptotic lowerbound. We have shown that the result holds for estimating Īø. In fact, the same procedureapplies to estimating q(Īø) where q is differentiable at Īø0. Then the local regularity condition isthat under PĪø0+h/

āˆšn, āˆš

n(Tn āˆ’ q(Īø0 + h/āˆš

n)) ā†’d Z,

where Z is independent of h. The result in Theorem 4.4 then becomes that Z = Z0 +āˆ†0 whereZ0 āˆ¼ N(0, q(Īø0)

ā€²I(Īø0)āˆ’1q(Īø0)) is independent of āˆ†0.

POINT ESTIMATION AND EFFICIENCY 102

4.4 Sufficient conditions for Hellinger-differentiability and local reg-ularity

Checking the conditions of the local regularity and the Hellinger-differentiability and may beeasy in practice. The following propositions give some sufficient conditions for the Hellingerdifferentiability and the local regularity.

Proposition 4.6. For every Īø in an open subset of Rk let pĪø be a Āµ-probability density. Assumethat the map Īø 7ā†’ sĪø(x) =

āˆšpĪø(x) is continuously differentiable for every x. If the elements

of the matrix I(Īø) = E[(pĪø/pĪø)(pĪø/pĪø)ā€²] are well defined and continuous at Īø. Then the map

Īø ā†’ āˆšpĪø is Hellinger differentiable with lĪø given by pĪø/pĪø. ā€ 

Proof The map Īø 7ā†’ pĪø = s2Īø is differentiable. We have pĪø = 2sĪøsĪø so conclude sĪø is zero

whenever pĪø = 0. We can write sĪø = (pĪø/pĪø)āˆš

pĪø/2.On the other hand,

āˆ« sĪø+tht āˆ’ sĪø

t

2

dĀµ =

āˆ« āˆ« 1

0

(ht)ā€²sĪø+uthtdu

2

dĀµ

ā‰¤āˆ« āˆ« 1

0

((ht)ā€²sĪø+utht)

2dudĀµ =1

2

āˆ« 1

0

hā€²tI(Īø + utht)htdu.

As ht ā†’ h, the right side converges toāˆ«

(hā€²sĪø)2dĀµ by the continuity of IĪø. Since

sĪø+tht āˆ’ sĪø

tāˆ’ hā€²sĪø

converges to zero almost surely, following the same proof as Theorem 3.1 (E) of Chapter 3, weobtain āˆ« [

sĪø+tht āˆ’ sĪø

tāˆ’ hā€²sĪø

]2

dĀµ ā†’ 0.

ā€ 

Proposition 4.7 If Tn is an estimator sequence of q(Īø) such that

āˆšn(Tn āˆ’ q(Īø))āˆ’ 1āˆš

n

nāˆ‘i=1

qĪøI(Īø)āˆ’1lĪø(Xi) ā†’p 0,

where q is differentiable at Īø, then Tn is the efficient and regular estimator for q(Īø). ā€ 

Proof ā€œ ā‡’ā€²ā€² Let āˆ†n,Īø = nāˆ’1/2āˆ‘n

i=1 lĪø(Xi). Then āˆ†n,Īø converges in distribution to a vector āˆ†Īø āˆ¼N(0, I(Īø)). From Step I in proving Theorem 4.4, log dQn/dPn is equivalent to hā€²āˆ†n,Īøāˆ’hā€²I(Īø)h/2asymptotically. Thus, the Slutskyā€™s theorem gives that under PĪø

(āˆšn(Tn āˆ’ q(Īø)), log

dQn

dPn

)ā†’d (qĪøI(Īø)āˆ’1āˆ†Īø, h

ā€²āˆ†Īø āˆ’ hā€²I(Īø)h/2)

POINT ESTIMATION AND EFFICIENCY 103

āˆ¼ N

((0

āˆ’hā€²I(Īø)h/2

),

(qā€²ĪøI(Īø)āˆ’1qĪø qā€²Īøh

qĪøhā€² hā€²I(Īø)h

)).

Then from the Le Camā€™s third lemma, under PĪø+h/āˆš

n,āˆš

n(Tn āˆ’ q(Īø)) converges in distributionto a normal distribution with mean qĪøh and covariance matrix qā€²ĪøI(Īø)āˆ’1qĪø. Thus, under PĪø+h/

āˆšn,āˆš

n(Tnāˆ’q(Īø+h/āˆš

n)) converges in distribution to N(0, qĪøI(Īø)ā€²qā€²Īø). We obtain that Tn is regular.ā€ 

Definition 4.5 If a sequence of estimator Tn has the expansion

āˆšn(Tn āˆ’ q(Īø)) = nāˆ’1/2

nāˆ‘i=1

Ī“(Xi) + rn,

where rn converges to zero in probability, then Tn is called an asymptotically linear estimatorfor q(Īø) with influence function Ī“. Note that Ī“ depends on Īø. ā€ 

For asymptotically linear estimator, the following result holds.

Proposition 4.8 Suppose Tn is an asymptotically linear estimator of Ī½ = q(Īø) with influencefunction Ī“. ThenA. Tn is Gaussian regular at Īø0 if and only if q(Īø) is differentiable at Īø0 with derivative qĪø and,with lĪ½ = l(Ā·, PĪø0|q(Īø),P) being the efficient influence function for q(Īø), EĪø0 [(Ī“ āˆ’ lĪ½)l] = 0 forany score l of P .B. Suppose q(Īø) is differentiable and Tn is regular. Then Ī“ āˆˆ [l] if and only if Ī“ = lĪ½ . ā€ 

Proof A. By asymptotic linearity of Tn, it follows that

( āˆšn(Tn āˆ’ q(Īø0))

Ln(Īø0 + tn/āˆš

n)āˆ’ Ln(Īø0)

)ā†’d N

(0

āˆ’tā€²I(Īø0)t

),

(EĪø0 [Ī“Ī“ā€²] EĪø0 [Ī“lā€²]tEĪø0 [lĪ“

ā€²]t tā€²I(Īø0)t

).

From the Le Camā€™s third lemma, we obtain that under PĪø0+tn/āˆš

n,

āˆšn(Tn āˆ’ q(Īø0)) ā†’d N(EĪø0 [Ī“

ā€²l]t, EĪø0 [Ī“Ī“ā€²]).

If Tn is regular, we have that under PĪø0+tn/āˆš

n,

āˆšn(Tn āˆ’ q(Īø0 + tn/

āˆšn)) ā†’d N(0, EĪø0 [Ī“Ī“ā€²]).

Comparing with the above convergence, we obtain

āˆšn(q(Īø0 + tn/

āˆšn)āˆ’ q(Īø0)) ā†’ EĪø0 [Ī“

ā€²l]t.

This implies q is differentiable with qĪø = EĪø[Ī“ā€²l]. Since EĪø0 [l

ā€²Ī½ l] = qĪø, the direction ā€œ ā‡’ā€²ā€² holds.

To prove the other direction, since q(Īø) is differentiable and under PĪø0+tn/āˆš

n,

āˆšn(Tn āˆ’ q(Īø0)) ā†’d N(EĪø0 [Ī“

ā€²l]t, E[Ī“Ī“ā€²])

POINT ESTIMATION AND EFFICIENCY 104

from the Le Camā€™s third lemma, we obtain under PĪø0+tn/āˆš

n,

āˆšn(Tn āˆ’ q(Īø0 + tn/

āˆšn)) ā†’d N(0, E[Ī“Ī“ā€²]).

Thus, Tn is Gaussian regular.B. If Tn is regular, from A, we obtain Ī“āˆ’ lĪ½ is orthogonal to any score in P . Thus, Ī“ āˆˆ [l]

implies that Ī“ = lĪ½ . The converse is obvious. ā€ 

Remark 4.2 We have discussed the efficiency bound for real parameters. In fact, these resultscan be generalized (though non-trivial) to the situation where Īø contains infinite dimensionalparameter in semiparametric model. This generalization includes semiparametric efficiencybound, efficient score function, efficient influence function, locally regular estimator, Hellingerdifferentiability, LAN and the Hajek convolution result.

READING MATERIALS : You should read Lehmann and Casella, Sections 1.6, 2.1, 2.2, 2.3,2.5, 2.6, 6.1, 6.2, Ferguson, Chapter 19 and Chapter 20

PROBLEMS

1. Let X1, ..., Xn be i.i.d according to Poisson(Ī»). Find the UMVU estimator of Ī»k for anypositive integer k.

2. Let Xi, i = 1, ..., n, be independently distributed as N(Ī± + Ī²ti, Ļƒ2) where Ī±, Ī² and Ļƒ2 are

unknown, and the tā€™s are known constants that are not all equal. Find the least squareestimators of Ī± and Ī² and show that they are also the UMVU estimators of Ī± and Ī².

3. If X has the distribution Poisson(Īø), show that 1/Īø does not have an unbiased estimator.

4. Suppose that we want to model the survival of twins with a common genetic defect, butwith one of the two twins receiving some treatment. Let X represent the survival timeof the untreated twin and let Y represent the survival time of the treated twin. One(overly simple) preliminary model might be to assume that X and Y are independentwith Exponential(Ī·) and Exponential(ĪøĪ·) distributions, respectively:

fĪø,Ī·(x, y) = Ī·eāˆ’Ī·xĪ·Īøeāˆ’Ī·ĪøyI(x > 0, y > 0).

(a) On crude approach to estimation in this problem is to reduce the data to W = X/Y .Find the distribution of W and compute the Cramer-Rao lower bound for unbiasedestimators of Īø based on W .

(b) Find the information bound for estimating Īø based on observation of (X,Y ) pairswhen Ī· is known and unknown.

(c) Compare the bounds you computed in (a) and (b) and discuss the pros and cons ofreducing to estimation based on the W .

POINT ESTIMATION AND EFFICIENCY 105

5. This is a continuation of the preceding problem. A more realistic model involves assumingthat the common parameter Ī· for the two wins varies across sets of twins. There are severaldifferent ways of modeling this: one approach involves supposing that each pair of twinsobserved (Xi, Yi) has its own fixed parameters Ī·i, i = 1, .., n. In this model we observe(Xi, Yi) with density fĪø,Ī·i

for i = 1, ..., n; i.e.,

fĪø,Ī·i(x, y) = Ī·ie

āˆ’Ī·ixiĪ·iĪøeāˆ’Ī·iĪøyiI(xi > 0, yi > 0).

This is sometimes called a functional model (or model with incidental nuisance parame-ters).Another approach is to assume that Ī· ā‰” Z has a distribution, and that our obser-vations are from the mixture distribution. Assuming (for simplicity) that Z = Ī· āˆ¼Gamma(a, 1/b) (a and b are known) with density

ga,b(Ī·) =baĪ·aāˆ’1

Ī“(a)expāˆ’bĪ·I(Ī· > 0),

it follows that the (marginal) distribution of (X, Y ) is

pĪø,a,b(x, y) =

āˆ« āˆž

0

fĪø,z(x, y)ga,b(z)dz.

This is sometimes called a ā€œstructural modelā€ (or mixture model).

(a) Find the information bound for Īø in the functional model based on (Xi, Yi), i =1, ..., n.

(b) Find the information bound for Īø in the structural model based on (Xi, Yi), i =1, ..., n.

(c) Compare the information bounds you computed in (a) and (b). When is the informa-tion for Īø in the functional model larger than the information for Īø in the structuralmodel?

6. Suppose that X āˆ¼ Gamma(Ī±, 1/Ī²); i.e., X has density pĪø given by

pĪø(x) =Ī²Ī±

Ī“(Ī±)xĪ±āˆ’1 expāˆ’Ī²xI(x > 0), Īø = (Ī±, Ī²) āˆˆ (0,āˆž)Ɨ (0,āˆž).

Consider estimation of q(Īø) = EĪø[X].

(a) Compute the Fisher information matrix I(Īø).

(b) Derive the efficient score function, the efficient influence function and the efficientinformation bound for Ī±.

(c) Compute q(Īø) and find the efficient influence functions for estimation of q(Īø). Com-pare the efficient influence functions you find in (c) with the influence function ofthe natural estimator Xn.

7. Compute the score for location, āˆ’(f ā€²/f)(x), and the Fisher information when:

POINT ESTIMATION AND EFFICIENCY 106

(a) f(x) = Ļ†(x) = (2Ļ€)āˆ’1/2 expāˆ’x2/2, (normal or Gaussian);

(b) f(x) = expāˆ’x/(1 + expāˆ’x)2, (logistic);

(c) f(x) = expāˆ’|x|/2, (double exponential);

(d) f(x) = tk, the t-distribution with k degrees of freedom;

(e) f(x) = expāˆ’x expāˆ’ exp(āˆ’x), (Gumbel or extreme value).

8. Suppose that P = PĪø : Īø āˆˆ Ī˜ , Ī˜ āŠ‚ Rk is a parametric model satisfying the hypothesesof the multiparameter Crameer-Rao inequality. Partition Īø as Īø = (Ī½, Ī·), where Ī½ āˆˆ Rm

and Ī· āˆˆ Rkāˆ’m and 1 ā‰¤ m < k. Let l = lĪø = (l1, l2) be the corresponding partition of thescores and with l = Iāˆ’1(Īø)l, the efficient influence function for Īø, let l = (l1, l2) be thecorresponding partition of l. In both cases, l1, l1 are m-vectors of functions and l2, l2 arek āˆ’m vectors. Partition I(Īø) and Iāˆ’1(Īø) correspondingly as

I(Īø) =

(I11 I12

I21 I22

),

where I11 is mƗm, I12 is mƗ (kāˆ’m), I21 is (kāˆ’m)Ɨm, I22 is (kāˆ’m)Ɨ (kāˆ’m). alsowrite

Iāˆ’1(Īø) = [I ij]i,j=1,2.

Verify that

(a) I11 = Iāˆ’111Ā·2 where I11Ā·2 = I11 āˆ’ I12I

āˆ’122 I21, I22 = Iāˆ’1

22Ā·1 where I22Ā·1 = I22 āˆ’ I21Iāˆ’111 I12,

I12 = āˆ’Iāˆ’111Ā·2I12I

āˆ’122 , I21 = āˆ’I22 Ā· 1āˆ’1I21I

āˆ’111 ..

(b) Verify that l1 = I11l1 + I12l2 = Iāˆ’111Ā·2(l1 āˆ’ I12I

āˆ’122 l2), and l2 = I21l1 + I22l2 = Iāˆ’1

22Ā·1(l2 āˆ’I21I

āˆ’111 l1).

9. Let Tn be the Hodges superefficient estimator of Īø.

(a) Show that Tn is not a regular estimator of Īø at Īø = 0, but that it is regular at everyĪø 6= 0. If Īøn = t/

āˆšn, find the limiting distribution of

āˆšn(Tn āˆ’ Īøn) under PĪøn .

(b) For Īøn = t/āˆš

n show that

Rn(Īøn) = nEĪøn [(Tn āˆ’ Īøn)2] ā†’ a2 + t2(1āˆ’ a)2.

This is larger than 1 if t2 > (1 + a)/(1āˆ’ a), and hence supper efficiency also entailsworse risks in a local neighborhood of the points where the asymptotic variance issmaller.

10. Suppose that (Y |Z) āˆ¼ Weibull(Ī»āˆ’1 expāˆ’Ī³Z, Ī²) and Z āˆ¼ GĪ· on R with density gĪ· withrespect to some dominating measure Āµ. Thus the conditional cumulative hazards functionĪ›(t|z) is given by

Ī›Ī³,Ī»,Ī²(t|z) = (Ī»eĪ³zt)Ī² = Ī»Ī²eĪ²Ī³ztĪ²

and henceĪ»Ī³,Ī»,Ī²(t|z) = Ī»Ī²eĪ²Ī³zĪ²tĪ²āˆ’1.

POINT ESTIMATION AND EFFICIENCY 107

(Recall that Ī»(t) = f(t)/(1āˆ’ F (t)) and Ī›(t) = āˆ’ log(1āˆ’ F (t)) if F is continuous). Thusit makes sense to reparameterize by defining Īø1 = Ī²Ī³ (this the parameter of interest sinceit reflects the effect of the covariate Z), Īø2 = Ī»Ī² and Īø2 = Ī². This yields

Ī»Īø(t|z) = Īø2Īø3 expĪø1ztĪø3āˆ’1.

You may assume that a(z) = (āˆ‚/āˆ‚z) log gĪ·(z) exists and E[a(Z)2] < āˆž. Thus Z isa ā€œcovariateā€ or ā€œpredictor variableā€, Īø1 is a ā€œregression parameterā€ which affects theintensity the (conditionally) Exponential variable Y , and Īø = (Īø1, Īø2, Īø3, Īø4) where Īø4 = Ī·.

(a) Derive the joint density pĪø(y, z) of (Y, Z) for the reparameterized model.

(b) Find the information matrix for Īø. What does the structure of this matrix say aboutthe effect of Ī· = Īø4 being known or unknown about the estimation of Īø1, Īø2, Īø3?

(c) Find the information and information bound for Īø1 if the parameter Īø2 and Īø3 areknown.

(d) What is the information for Īø1 if just Īø3 is known to be equal to 1?

(e) Find the efficient score function and the efficient influence function for estimation ofĪø1 when Īø3 is known.

(f) Find the information I11Ā·(2,3) and information bound for Īø1 if the parameters Īø2 andĪø3 are unknown.

(g) Find the efficient score function and the efficient influence function for estimation ofĪø1 when Īø2 and Īø3 are unknown.

(h) Specialize the calculation in (d)-(g) to the case when Z āˆ¼ Bernoulli(Īø4) and comparethe information bounds.

11. Lehmann and Casella, page 72, problems 6.33, 6.34, 6.35

12. Lehmann and Casella, pages 129-137, problems 1.1-3.30

13. Lehamann and Casella, pages 138-143, problems 5.1-6.12

14. Lehmann and Casella, pages 496-501, problems 1.1-2.14

15. Ferguson, pages 131-132, problems 2-5

16. Ferguson, page 139, problems 1-4

MAXIMUM LIKELIHOOD ESTIMATION 108

CHAPTER 5 EFFICIENT ESTIMATION: MAXIMUM

LIKELIHOOD APPROACH

In the previous chapter, we have discussed the asymptotic lower bound (efficiency bound) forall the regular estimators. Then a natural question is what estimator can achieve this bound;equivalently, what estimator can be asymptotically efficient. In this chapter, we will focus onthe most commonly-used estimator, maximum likelihood estimator. We will show that undersome regularity conditions, the maximum likelihood estimator is asymptotically efficient.

Suppose X1, ..., Xn are i.i.d from PĪø0 in the model P = PĪø : Īø āˆˆ Ī˜. We assume

(A0). Īø 6= Īøāˆ— implies PĪø 6= PĪøāˆ— (identifiability).(A1). PĪø has a density function pĪø with respect to a dominating Ļƒ-finite measure Āµ.(A2). The set x : pĪø(x) > 0 does not depend on Īø.

Furthermore, we denote

Ln(Īø) =nāˆ

i=1

pĪø(Xi), ln(Īø) =nāˆ‘

i=1

log pĪø(Xi).

Ln(Īø) and ln(Īø) are called the likelihood function and the log-likelihood function of Īø, respectively.An estimator Īøn of Īø0 is the maximum likelihood estimator (MLE) of Īø0 if it maximizes thelikelihood function Ln(Īø), equivalently, ln(Īø).

Some cautions should be taken in the maximization: first, the maximum likelihood estimatormay not exist; second, even if the maximum likelihood estimator exists, it may not be unique;third, the definition of the maximum likelihood estimator depends on the parameterization ofpĪø so different parameterization may lead to the different estimators.

5.1 Ad Hoc Arguments of MLE Efficiency

In the following, we explain the intuition why the maximum likelihood estimator is the efficientestimator; while we leave rigorous conditions and arguments to the subsequent sections. First,to see the consistency of the maximum likelihood estimator, we introduce the definition of theKullback-Leibler information as follows.

Definition 5.1 Let P be a probability measure and let Q be another measure on (Ī©,A) withdensities p and q with respect to a Ļƒ-finite measure Āµ (Āµ = P + Q always works). P (Ī©) = 1and Q(Ī©) ā‰¤ 1. Then the Kullback-Leibler information K(P,Q) is

K(P,Q) = EP [logp(X)

q(X)].

ā€ 

Immediately, we obtain the following result.

Proposition 5.1 K(P, Q) is well-defined, and K(P,Q) ā‰„ 0. K(P, Q) = 0 if and only if P = Q.ā€ 

MAXIMUM LIKELIHOOD ESTIMATION 109

Proof By the Jensenā€™s inequality,

K(P, Q) = EP [āˆ’ logq(X)

p(X)] ā‰„ āˆ’ log EP [

q(X)

p(X)] = āˆ’ log Q(Ī©) ā‰„ 0.

The equality holds if and only if p(x) = Mq(x) almost surely with respect P and Q(Ī©) = 1.Thus, M = 1 and P = Q. ā€ 

Now that Īøn maximizes ln(Īø),

1

n

nāˆ‘i=1

pĪøn(Xi) ā‰„ 1

n

nāˆ‘i=1

pĪø0(Xi).

Suppose Īøn ā†’ Īøāˆ—. Then we would expect to the both sides converge to

EĪø0 [pĪøāˆ—(X)] ā‰„ EĪø0 [pĪø0(X)],

which implies K(PĪø0 , PĪøāˆ—) ā‰¤ 0. From Proposition 5.1, PĪø0 = PĪøāˆ— . From (A0), Īøāˆ— = Īø0 (the modelidentifiability condition is used here). That is, Īøn converges to Īø0. Note in this argument, threeconditions are essential: (i) Īøn ā†’ Īøāˆ— (compactness of Īøn); (ii) the convergence of nāˆ’1ln(Īøn)(locally uniform convergence); (iii) PĪø0 = PĪøāˆ— implies Īø0 = Īøāˆ— (identifiability).

Next, we give an ad hoc discussion on the efficiency of the maximum likelihood estimator.Suppose Īøn ā†’ Īø0. If Īøn is in the interior of Ī˜, Īøn solves the following likelihood (or score)equations

ln(Īøn) =nāˆ‘

i=1

lĪøn(Xi) = 0.

Suppose lĪø(X) is twice-differentiable with respect to Īø. We apply the Taylor expansion tolĪøn

(Xi) at Īø0 and obtain

āˆ’nāˆ‘

i=1

lĪø0(Xi) =nāˆ‘

i=1

lĪøāˆ—(Xi)(Īø āˆ’ Īø0),

where Īøāˆ— is between Īø0 and Īø. This gives that

āˆšn(Īø āˆ’ Īø0) = āˆ’ 1āˆš

n

nāˆ’1

nāˆ‘i=1

lĪøāˆ—(Xi)

āˆ’1 nāˆ‘

i=1

lĪø0(Xi)

.

By the law of large number, we can seeāˆš

n(Īøn āˆ’ Īø0) is asymptotically equivalent to

1āˆšn

nāˆ‘i=1

I(Īø0)āˆ’1lĪø0(Xi).

Then Īøn is an asymptotically linear estimator of Īø0 with the influence function I(Īø0)āˆ’1lĪø0 =

l(Ā·, PĪø0|Īø,P). This shows that Īøn is the efficient estimator of Īø0 and the asymptotic variance ofāˆšn(Īøn āˆ’ Īø0) attains the efficiency bound, which was defined in the previous chapter. Again,

the above arguments require a few conditions to go through.As mentioned before, in the following sections we will rigorously prove the consistency and

the asymptotic efficiency of the maximum likelihood estimator. Moreover, we will discuss thecomputation of the maximum likelihood estimators and some alternative efficient estimationapproaches.

MAXIMUM LIKELIHOOD ESTIMATION 110

5.2 Consistency of Maximum Likelihood Estimator

We provide some sufficient conditions for obtaining the consistency of maximum likelihoodestimator.

Theorem 5.1 Suppose that(a) Ī˜ is compact.(b) log pĪø(x) is continuous in Īø for all x.(c) There exists a function F (x) such that EĪø0 [F (X)] < āˆž and | log pĪø(x)| ā‰¤ F (x) for all x andĪø.Then Īøn ā†’a.s. Īø0. ā€ 

Proof For any sample Ļ‰ āˆˆ Ī©, Īøn is compact. Thus, be choosing a subsequence, we assumeĪøn ā†’ Īøāˆ—. Suppose we can show that

1

n

nāˆ‘i=1

lĪøn(Xi) ā†’ EĪø0 [lĪøāˆ—(X)].

Then since1

n

nāˆ‘i=1

lĪøn(Xi) ā‰„ 1

n

nāˆ‘i=1

lĪø0(Xi),

we haveEĪø0 [lĪøāˆ—(X)] ā‰„ EĪø0 [lĪø0(X)].

Thus Proposition 5.1 plus the identifiability gives Īøāˆ— = Īø0. That is, any subsequence of Īøn

converges to Īø0. We conclude that Īøn ā†’a.s. Īø0.It remains to show

Pn[lĪøn(X)] ā‰” 1

n

nāˆ‘i=1

lĪøn(Xi) ā†’ EĪø0 [lĪøāˆ—(X)].

SinceEĪø0 [lĪøn

(X)] ā†’ EĪø0 [lĪøāˆ—(X)]

by the dominated convergence theorem, it suffices to show

|Pn[lĪøn(X)]āˆ’ EĪø0 [lĪøn

(X)]| ā†’ 0.

We can even prove the following uniform convergence result

supĪøāˆˆĪ˜

|Pn[lĪø(X)]āˆ’ EĪø0 [lĪø(X)]| ā†’ 0.

To see this, we defineĻˆ(x, Īø, Ļ) = sup

|Īøā€²āˆ’Īø|<Ļ

(lĪøā€²(x)āˆ’ EĪø0 [lĪøā€²(X)]).

Since lĪø is continuous, Ļˆ(x, Īø, Ļ) is measurable and by the dominance convergence theorem,EĪø0 [Ļˆ(X, Īø, Ļ)] decreases to EĪø0 [lĪø(x) āˆ’ EĪø0 [lĪø(X)]] = 0. Thus, for Īµ > 0, for any Īø āˆˆ Ī˜, thereexists a ĻĪø such that

EĪø0 [Ļˆ(X, Īø, ĻĪø)] < Īµ.

MAXIMUM LIKELIHOOD ESTIMATION 111

The union of Īøā€² : |Īøā€²āˆ’Īø| < ĻĪø covers Ī˜. By the compactness of Ī˜, there exists a finite numberof Īø1, ..., Īøm such that

Ī˜ āŠ‚ āˆŖmi=1Īøā€² : |Īøā€² āˆ’ Īøi| < ĻĪøi

.Therefore,

supĪøāˆˆĪ˜

Pn[lĪø(X)]āˆ’ EĪø0 [lĪø(X)] ā‰¤ sup1ā‰¤iā‰¤m

Pn[Ļˆ(X, Īøi, ĻĪøi)].

We obtain

lim supn

supĪøāˆˆĪ˜

Pn[lĪø(X)]āˆ’ EĪø0 [lĪø(X)] ā‰¤ sup1ā‰¤iā‰¤m

PĪø[Ļˆ(X, Īøi, ĻĪøi)] ā‰¤ Īµ.

Thus, lim supn supĪøāˆˆĪ˜ Pn[lĪø(X)]āˆ’ EĪø0 [lĪø(X)] ā‰¤ 0. We apply the similar arguments to āˆ’l(X, Īø)and obtain lim supn supĪøāˆˆĪ˜ āˆ’Pn[lĪø(X)] + EĪø0 [lĪø(X)] ā‰¤ 0. Thus,

limn

supĪøāˆˆĪ˜

|Pn[lĪø(X)]āˆ’ EĪø0 [lĪø(X)]| ā†’ 0.

ā€ 

As a note, condition (c) in Theorem 5.1 is necessary. Ferguson (2002) page 116 gives aninteresting counterexample showing that if (c) is invalid, the maximum likelihood estimatorconverges to a fixed constant whatever true parameter is.

Another type of consistency result is the classical Waldā€™s consistency result.

Theorem 5.2 (Waldā€™s Consistency) Ī˜ is compact. Suppose Īø 7ā†’ lĪø(x) = log pĪø(x) is upper-semicontinuous for all x, in the sense

lim supĪøā€²ā†’Īø

lĪøā€²(x) ā‰¤ lĪø(x).

Suppose for every sufficient small ball U āŠ‚ Ī˜,

EĪø0 [supĪøā€²āˆˆU

lĪøā€²(X)] < āˆž.

Then Īøn ā†’p Īø0. ā€ 

Proof Since EĪø0 [lĪø0(X)] > EĪø0 [lĪøā€²(X)] for any Īøā€² 6= Īø0, there exists a ball UĪøā€² containing Īøā€² suchthat

EĪø0 [lĪø0(X)] > EĪø0 [ supĪøāˆ—āˆˆUĪøā€²

lĪøāˆ—(X)].

Otherwise, there exists a sequence Īøāˆ—m ā†’ Īøā€² but EĪø0 [lĪø0(X)] ā‰¤ EĪø0 [lĪøāˆ—m(X)]. Since lĪøāˆ—m(x) ā‰¤supU ā€² lĪøā€²(X) where U ā€² is the ball satisfying the condition, we obtain

lim supm

EĪø0 [lĪøāˆ—m(X)] ā‰¤ EĪø0 [lim supm

lĪøāˆ—m(X)] ā‰¤ EĪø0 [lĪøā€²(X)].

We then obtain EĪø0 [lĪø0(X)] ā‰¤ EĪø0 [lĪøā€²(X)] and this is a contradiction.

MAXIMUM LIKELIHOOD ESTIMATION 112

For any Īµ, the balls āˆŖĪøā€²UĪøā€² covers the compact set Ī˜āˆ© |Īøā€² āˆ’ Īø0| > Īµ so there exists a finitecovering balls, U1, ..., Um. Then

P (|Īønāˆ’ Īø0| > Īµ) ā‰¤ P ( sup|Īøā€²āˆ’Īø0|>Īµ

Pn[lĪøā€²(X)] ā‰„ Pn[lĪø0(X)]) ā‰¤ P ( max1ā‰¤iā‰¤m

Pn[ supĪøā€²āˆˆUi

lĪøā€²(X)] ā‰„ Pn[lĪø0(X)])

ā‰¤māˆ‘

i=1

P (Pn[ supĪøā€²āˆˆUi

lĪøā€²(X)] ā‰„ Pn[lĪø0(X)]).

SincePn[ sup

Īøā€²āˆˆUi

lĪøā€²(X)] ā†’a.s. EĪø0 [ supĪøā€²āˆˆUi

lĪøā€²(X)] < EĪø0 [lĪø0(X)],

the right-hand side converges to zero. Thus, Īøn ā†’p Īø0. ā€ 

5.3. Asymptotic Efficiency of Maximum Likelihood Esti-

mator

The following theorem gives some regular conditions so that the maximum likelihood estimatorattains asymptotic efficiency bound.

Theorem 5.3 Suppose that the model P = PĪø : Īø āˆˆ Ī˜ is Hellinger differentiable at an innerpoint Īø0 of Ī˜ āŠ‚ Rk. Furthermore, suppose that there exists a measurable function F (X) withEĪø0 [F (X)2] < āˆž such that for every Īø1 and Īø2 in a neighborhood of Īø0,

| log pĪø1(x)āˆ’ log pĪø2(x)| ā‰¤ F (x)|Īø1 āˆ’ Īø2|.

If the Fisher information matrix I(Īø0) is nonsingular and Īøn is consistent, then

āˆšn(Īøn āˆ’ Īø0) =

1āˆšn

nāˆ‘i=1

I(Īø0)āˆ’1lĪø0(Xi) + op(1).

In particular,āˆš

n(Īøn āˆ’ Īø0) is asymptotically normal with mean zero and covariance matrixI(Īø0)

āˆ’1.ā€ 

Proof For any hn ā†’ h, by the Hellinger differentiability,

Wn = 2

(āˆšpĪø0+hn/

āˆšn

pĪø0

āˆ’ 1

)ā†’ hā€²lĪø0 , in L2(PĪø0).

We obtain āˆšn(log pĪø0+hn/

āˆšn āˆ’ log pĪø0) = 2

āˆšn log(1 + Wn/2) ā†’p hā€²lĪø0 .

Using the Lipschitz continuity of log pĪø and the dominate convergence theorem, we can show

EĪø0

[āˆšn(Pn āˆ’ P )[

āˆšn(log pĪø0+hn/

āˆšn āˆ’ log pĪø0)āˆ’ hā€²lĪø0 ]

]ā†’ 0

MAXIMUM LIKELIHOOD ESTIMATION 113

andV arĪø0

[āˆšn(Pn āˆ’ P )[

āˆšn(log pĪø0+hn/

āˆšn āˆ’ log pĪø0)āˆ’ hā€²lĪø0 ]

]ā†’ 0.

Thus, āˆšn(Pn āˆ’ P )[

āˆšn(log pĪø0+hn/

āˆšn āˆ’ log pĪø0)āˆ’ hā€²lĪø0 ] ā†’p 0,

whereāˆš

n(Pn āˆ’ P )[g(X)] is defined as

nāˆ’1/2

[nāˆ‘

i=1

g(Xi)āˆ’ EĪø0 [g(X)]]

.

From Step I in proving Theorem 4.4, we know

lognāˆ

i=1

log pĪø0+hn/āˆš

n

log pĪø0

=1āˆšn

nāˆ‘i=1

hā€²lĪø0(Xi)āˆ’ 1

2hā€²I(Īø0)h + op(1).

We obtainnEĪø0 [log pĪø0+hn/

āˆšn āˆ’ log pĪø0 ] ā†’ āˆ’hā€²I(Īø0)h/2.

Hence the map Īø 7ā†’ EĪø0 [log pĪø] is twice-differentiable with second derivative matrix āˆ’I(Īø0).Furthermore, we obtain

nPn[log pĪø0+hn/āˆš

n āˆ’ log pĪø0 ] = āˆ’1

2hā€²nI(Īø0)hn + hā€²n

āˆšn(Pn āˆ’ P )[lĪø0 ] + op(1).

We choose hn =āˆš

n(Īøn āˆ’ Īø0) and hn = I(Īø0)āˆ’1āˆš

n(Pn āˆ’ P )[lĪø0 ]. It gives that

nPn[log pĪønāˆ’ log pĪø0 ] = āˆ’n

2(Īøn āˆ’ Īø0)

ā€²I(Īø0)(Īø āˆ’ Īø0) +āˆš

n(Īøn āˆ’ Īø0)āˆš

n(Pn āˆ’ P )[lĪø0 ] + op(1),

nPn[log pĪø0+I(Īø0)āˆ’1āˆš

n(Pnāˆ’P )[lĪø0 ]/āˆš

n āˆ’ log pĪø0 ]

=1

2āˆšn(Pn āˆ’ P )[lĪø0 ]ā€²I(Īø0)

āˆ’1āˆšn(Pn āˆ’ P )[lĪø0 ]+ op(1).

Since the left-hand side of the fist equation is larger than the left-hand side of the secondequation, after simple algebra, we obtain

āˆ’1

2

āˆšn(Īøn āˆ’ Īø0)āˆ’ I(Īø0)

āˆ’1āˆš

n(Pn āˆ’ P )[lĪø0 ]ā€²

I(Īø0)āˆš

n(Īøn āˆ’ Īø0)āˆ’ I(Īø0)āˆ’1āˆš

n(Pn āˆ’ P )[lĪø0 ]

+op(1) ā‰„ 0.

Thus, āˆšn(Īøn āˆ’ Īø0) = I(Īø0)

āˆ’1āˆš

n(Pn āˆ’ P )[lĪø0 ] + op(1).

ā€ 

A classical condition for the asymptotic normality forāˆš

n(Īønāˆ’ Īø0) is the following theorem.

Theorem 5.4 For each Īø in an open subset of Euclidean space. Let Īø 7ā†’ lĪø(x) = log pĪø(x)be twice continuously differentiable for every x. Suppose EĪø0 [lĪø0 l

ā€²Īø0

] < āˆž and E[lĪø0 ] exists and

MAXIMUM LIKELIHOOD ESTIMATION 114

is nonsingular. Assume that the second partial derivative of lĪø(x) is dominated by a fixedintegrable function F (x) for every Īø in a neighborhood of Īø0. Suppose Īøn ā†’p Īø0. Then

āˆšn(Īøn āˆ’ Īø0) = āˆ’(EĪø0 [lĪø0 ])

āˆ’1 1āˆšn

nāˆ‘i=1

lĪø0(Xi) + op(1).

ā€ 

Proof Īøn solves the equation

0 =nāˆ‘

i=1

lĪø(Xi).

After the Taylor expansion, we obtain

0 =nāˆ‘

i=1

lĪø0(Xi) +nāˆ‘

i=1

lĪø0(Xi)(Īøn āˆ’ Īø0) +1

2(Īøn āˆ’ Īø0)

ā€²

nāˆ‘i=1

l(3)

Īøn(Xi)

(Īøn āˆ’ Īø0),

where Īøn is between Īøn and Īø0. Thus,

|

1

n

nāˆ‘i=1

lĪø0(Xi)

(Īøn āˆ’ Īø0) +

1

n

nāˆ‘i=1

lĪø0(Xi)| ā‰¤ 1

n

nāˆ‘i=1

|F (Xi)||Īøn āˆ’ Īø0|2.

We obtain (Īøn āˆ’ Īø0) = op(1/āˆš

n). Then it holds

āˆšn(Īøn āˆ’ Īø0)

1

n

nāˆ‘i=1

lĪø0(Xi) + op(1)

= āˆ’ 1āˆš

n

nāˆ‘i=1

lĪø0(Xi).

The result holds. ā€  .

5.4 Computation of Maximum Likelihood Estimate

A variety of methods can be used to compute the maximum likelihood estimate. Since themaximum likelihood estimate, Īøn, solves the likelihood equation

nāˆ‘i=1

lĪø(Xi) = 0,

one numerical method for the calculation is via the Newton-Raphson iteration: at kth iteration,

Īø(k+1) = Īø(k) āˆ’

1

n

nāˆ‘i=1

lĪø(k)(Xi)

āˆ’1 1

n

nāˆ‘i=1

lĪø(k)(Xi)

.

Sometimes, calculating lĪø may be complicated. Note the

āˆ’ 1

n

nāˆ‘i=1

lĪø(k)(Xi) ā‰ˆ I(Īø(k)).

MAXIMUM LIKELIHOOD ESTIMATION 115

Then a Fisher scoring algorithm is via the following iteration

Īø(k+1) = Īø(k) + I(Īø(k))āˆ’1

1

n

nāˆ‘i=1

lĪø(k)(Xi)

.

An alternative method to find the maximum likelihood estimate is by optimum search algo-rithm. Note that the objective function is Ln(Īø). Then a simple search method is grid searchby evaluating the Ln(Īø) along a number of Īøā€™s in the parameter space. Clearly, such a methodis only feasible with very low-dimensional Īø. Other efficient methods include quasi-Newtonsearch (gradient-decent search) where at each Īø, we search along the direction of Ln(Īø). Re-cent development has seen many Bayesian computation methods, including MCMC, simulationannealing etc.

In this section, we particularly focus on the calculation of the maximum likelihood estimatewhen part of data are missing or some mis-measured data are observed. In such calculation,a useful algorithm is called the expectation-maximization (EM) algorithm. We will describethis algorithm in detail and explain why the EM algorithm may give the maximum likelihoodestimate. A few examples are given for illustration.

5.4.1 EM framework

Suppose Y denotes the vector of statistics from n subjects. In many practical problems, Ycan not be fully observed due to data missingness; instead, partial data or a function of Y isobserved. For simplicity, suppose Y = (Ymis, Yobs), where Yobs is the part of Y which is observedand Ymis is the part of Y which is not observed. Furthermore, we introduce R as a vector of 0/1indicating which subjects are missing/not missing. Then the observed data include (Yobs, R).

Assume Y has a density function f(Y ; Īø) where Īø āˆˆ Ī˜. Then the density function for theobserved data (Yobs, R) āˆ«

Ymis

f(Y ; Īø)P (R|Y )dYmis,

where P (R|Y ) denotes the conditional probability of R given Y . One additional assumptionis that P (R|Y ) = P (R|Yobs) and P (R|Y ) does not depend on Īø; i.e., the missing probabilityonly depends on the observed data and it is non-informative about Īø. Such an assumption iscalled the missing at random (MAR) and is often assumed for missing data problem. Underthe MAR, the density function for the observed data is equal

āˆ«

Ymis

f(Y ; Īø)dYmisP (R|Y ).

Hence, if we wish to calculate the maximum likelihood estimator for Īø, we can ignore the partof P (R|Y ) but simply maximize the part of

āˆ«Ymis

f(Y ; Īø)dYmis. Note the latter is exactly themarginal density of Yobs, denoted by f(Yobs; Īø).

The way of the EM algorithm is as follows: we start from any initial value of Īø(1) and usethe following iterations. The kth iteration consists both E-step and M-step:

E-step. We evaluate the conditional expectation

E[log f(Y ; Īø)|Yobs, Īø

(k)].

MAXIMUM LIKELIHOOD ESTIMATION 116

Here, E[Ā·|Yobs, Īøk] is the conditional expectation given the observed data and the current value

of Īø. That is,

E[log f(Y ; Īø)|Yobs, Īø

(k)]

=

āˆ«Ymis

[log f(Y ; Īø)]f(Y ; Īø(k))dYmisāˆ«Ymis

f(Y ; Īø(k))dYmis

.

Such an expectation can often be evaluated using simple numerical calculation, as will be seenin the later examples.

M-step. We obtain Īø(k+1) by maximizing

E[log f(Y ; Īø)|Yobs, Īø

(k)].

We then iterate till the convergence of Īø; i.e., the difference between Īø(k+1) and Īø(k) is less thana given criteria.

The reason why the EM algorithm may give the maximum likelihood estimator is the fol-lowing result.

Theorem 5.5 At each iteration of the EM algorithm, log f(Yobs; Īø(k+1)) > log f(Yobs; Īø

(k)) andthe equality holds if and only if Īø(k+1) = Īø(k). ā€ 

Proof From the EM algorithm, we see

E[log f(Y ; Īø(k+1))|Yobs, Īø

(k)] ā‰„ E

[log f(Y ; Īø(k))|Yobs, Īø

(k)].

Sinelog f(Y ; Īø) = log f(Yobs; Īø) + log f(Ymis|Yobs, Īø),

we obtainE

[log f(Ymis|Yobs, Īø

(k+1))|Yobs, Īø(k)

]+ log f(Yobs; Īø

(k+1))

ā‰„ E[log f(Ymis|Yobs, Īø

(k))|Yobs, Īø(k)

]+ log f(Yobs; Īø

(k)).

On the other hand, since

E[log f(Ymis|Yobs, Īø

(k+1))|Yobs, Īø(k)

] ā‰¤ E[log f(Ymis|Yobs, Īø

(k))|Yobs, Īø(k)

]

by the non-negativity of the Kullback-Leibler information, we conclude that log f(Yobs; Īø(k+1)) ā‰„

log f(Yobs, Īø(k)). The equality holds if and only if

log f(Ymis|Yobs, Īø(k+1)) = log f(Ymis|Yobs, Īø

(k)),

equivalently, log f(Y ; Īø(k+1)) = log f(Y ; Īø(k)) thus Īø(k+1) = Īø(k). ā€ 

From Theorem 5.5, we conclude that each iteration of the EM algorithm increases theobserved likelihood function. Thus, it is expected that Īø(k) will eventually converge to themaximum likelihood estimate. If the initial value of the EM algorithm is chosen close to themaximum likelihood estimate (though we never know) and the objective function is concave inthe neighborhood of the maximum likelihood estimate, then the maximization in the M-step

MAXIMUM LIKELIHOOD ESTIMATION 117

can be replaced by the Newton-Raphson iteration. Correspondingly, an alternative way to theEM algorithm is given by:

E-step. We evaluate the conditional expectation

E

[āˆ‚

āˆ‚Īølog f(Y ; Īø)|Yobs, Īø

(k)

]

and

E

[āˆ‚2

āˆ‚Īø2log f(Y ; Īø)|Yobs, Īø

(k)

]

M-step. We obtain Īø(k+1) by solving

0 = E

[āˆ‚

āˆ‚Īølog f(Y ; Īø)|Yobs, Īø

(k)

]

using one-step Newton-Raphson iteration:

Īø(k+1) = Īø(k) āˆ’

E

[āˆ‚2

āˆ‚Īø2log f(Y ; Īø)|Yobs, Īø

(k)

]āˆ’1

E

[āˆ‚

āˆ‚Īølog f(Y ; Īø)|Yobs, Īø

(k)

]āˆ£āˆ£āˆ£āˆ£āˆ£Īø=Īø(k)

.

We note that in the second form of the EM algorithm, only one-step Newton-Raphson iterationis used in the M-step since it still ensures that the iteration will increase the likelihood function.

5.4.2 Examples of using EM algorithm

Example 5.1 Suppose a random vector Y has a multinomial distribution with n = 197 and

p = (1

2+

Īø

4,1āˆ’ Īø

4,1āˆ’ Īø

4,Īø

4).

Then the probability for Y = (y1, y2, y3, y4) is given by

n!

y1!y2!y3!y4!(1

2+

Īø

4)y1(

1āˆ’ Īø

4)y2(

1āˆ’ Īø

4)y3(

Īø

4)y4 .

If we use the Newton-Raphson iteration to calculate the maximum likelihood estimator forĪø, then after calculating the first and the second derivative of the log-likelihood function, weiterate using

Īø(k+1) = Īø(k) +

Y1

1/16

(1/2 + Īø(k)/4)2+ (Y2 + Y3)

1

(1āˆ’ Īø(k))2+ Y4

1

Īø(k)2

āˆ’1

Ɨ

Y11/4

1/2 + Īø(k)/4āˆ’ (Y2 + Y3)

1

1āˆ’ Īø(k)+ Y4

1

Īø(k)

.

Suppose we observe Y = (125, 18, 20, 34). If we start with Īø(1) = 0.5, after the convergence, weobtain Īø(k) = 0.6268215. We can use the EM algorithm to calculate the maximum likelihood

MAXIMUM LIKELIHOOD ESTIMATION 118

estimator. Suppose the full data is X which has a multivariate normal distribution with n andthe p = (1/2, Īø/4, (1 āˆ’ Īø)/4, (1 āˆ’ Īø)/4, Īø/4). Then Y can be treated as an incomplete data ofX by Y = (X1 + X2, X3, X4, X5). The score equation for the complete data X is simple

0 =X2 + X5

Īøāˆ’ X3 + X4

1āˆ’ Īø.

Thus we note the M-step of the EM algorithm needs to solve the equation

0 = E

[X2 + X5

Īøāˆ’ X3 + X4

1āˆ’ Īø|Y, Īø(k)

];

while the E-step evaluates the above expectation. By simple calculation,

E[X|Y, Īø(k)] = (Y11/2

1/2 + Īø(k)/4, Y1

Īø(k)/4

1/2 + Īø(k)/4, Y2, Y3, Y4).

Then we obtain

Īø(k+1) =E[X2 + X5|Y, Īø(k)]

E[X2 + X5 + X3 + X4|Y, Īø(k)]=

Y1Īø(k)/4

1/2+Īø(k)/4+ Y4

Y1Īø(k)/4

1/2+Īø(k)/4+ Y2 + Y3 + Y4

.

We start form Īø(1) = 0.5. The following table gives the results from iterations:

k Īø(k+1) Īø(k+1) āˆ’ Īø(k) Īø(k+1)āˆ’Īøn

Īø(k)āˆ’Īøn

0 .500000000 .126821498 .14651 .608247423 .018574075 .13462 .624321051 .002500447 .13303 .626488879 .000332619 .13284 .626777323 .000044176 .13285 .626815632 .000005866 .13286 .626820719 .0000007797 .626821395 .0000001048 .626821484 .000000014

From the table, we find the EM converges and the result agrees with what is obtained form theNewton-Raphson iteration. We also note the the convergence is linear as (Īø(k+1) āˆ’ Īøn)/(Īø(k) āˆ’ Īøn)becomes a constant when convergence; comparatively, the convergence in the Newton-Raphsoniteration is quadratic in the sense (Īø(k+1) āˆ’ Īøn)/(Īø(k) āˆ’ Īøn)2 becomes a constant when conver-gence. Thus, the Newton-Raphon iteration converges much faster than the EM algorithm;however, we have already seen the calculation of the EM is much less complex than the Newton-Raphson iteration and this is the advantage of using the EM algorithm.

Example 5.2 We consider the example of exponential mixture model. Suppose Y āˆ¼ PĪø wherePĪø has density

pĪø(y) =pĪ»eāˆ’Ī»y + (1āˆ’ p)Āµeāˆ’Āµy

I(y > 0)

and Īø = (p, Ī», Āµ) āˆˆ (0, 1) Ɨ (0,āˆž) Ɨ (0,āˆž). Consider estimation of Īø based on Y1, ..., Yn

i.i.d pĪø(y). Solving the likelihood equation using the Newton-Raphson is much computation

MAXIMUM LIKELIHOOD ESTIMATION 119

involved. We take an approach based on the EM algorithm. We introduce the complete dataX = (Y, āˆ†) āˆ¼ pĪø(x) where

pĪø(x) = pĪø(y, Ī“) = (pyeāˆ’Ī»y)Ī“((1āˆ’ p)Āµeāˆ’Āµy)1āˆ’Ī“.

This is natural from the following mechanism: āˆ† is a bernoulli variable with P (āˆ† = 1) = pand we generate Y from Exp(Ī») if āˆ† = 1 and from Exp(Āµ) if āˆ† = 0. Thus, āˆ† is missing. Thescore equation for Īø based on X is equal to

0 = lp(X1, ..., Xn) =nāˆ‘

i=1

āˆ†i

pāˆ’ 1āˆ’āˆ†i

1āˆ’ p

,

0 = lĪ»(X1, ..., Xn) =nāˆ‘

i=1

āˆ†i(1

Ī»āˆ’ Yi),

0 = lĀµ(X1, ..., Xn) =nāˆ‘

i=1

(1āˆ’āˆ†i)(1

Āµāˆ’ Yi).

Thus, the M-step of the EM algorithm is to solve the following equations

0 =nāˆ‘

i=1

E

[āˆ†i

pāˆ’ 1āˆ’āˆ†i

1āˆ’ p

|Y1, ..., Yn, p(k), Ī»(k), Āµ(k)

]=

nāˆ‘i=1

E

[āˆ†i

pāˆ’ 1āˆ’āˆ†i

1āˆ’ p

|Yi, p

(k), Ī»(k), Āµ(k)

],

0 =nāˆ‘

i=1

E

[āˆ†i(

1

Ī»āˆ’ Yi)|Y1, ..., Yn, p(k), Ī»(k), Āµ(k)

]=

nāˆ‘i=1

E

[āˆ†i(

1

Ī»āˆ’ Yi)|Yi, p

(k), Ī»(k), Āµ(k)

],

0 =nāˆ‘

i=1

E

[1āˆ’āˆ†i)(

1

Āµāˆ’ Yi)|Y1, ..., Yn, p

(k), Ī»(k), Āµ(k)

]=

nāˆ‘i=1

E

[1āˆ’āˆ†i)(

1

Āµāˆ’ Yi)|Yi, p

(k), Ī»(k), Āµ(k)

].

This immediately gives

p(k+1) =1

n

nāˆ‘i=1

E[āˆ†i|Yi, p(k), Ī»(k), Āµ(k)],

Ī»(k+1) =

āˆ‘ni=1 E[āˆ†i|Yi, p

(k), Ī»(k), Āµ(k)]āˆ‘ni=1 YiE[āˆ†i|Yi, p(k), Ī»(k), Āµ(k)]

,

Āµ(k+1) =

āˆ‘ni=1 E[(1āˆ’āˆ†i)|Yi, p

(k), Ī»(k), Āµ(k)]āˆ‘ni=1 YiE[(1āˆ’āˆ†i)|Yi, p(k), Ī»(k), Āµ(k)]

.

The conditional expectation

E[āˆ†|Y, Īø] =pĪ»eāˆ’Ī»Y

pĪ»eāˆ’Ī»Y + (1āˆ’ p)Āµeāˆ’ĀµY.

As seen above, the EM algorithm facilitates the computation.

MAXIMUM LIKELIHOOD ESTIMATION 120

5.4.3 Information calculation in EM algorithm

We now consider the information of Īø in the missing data. Denote lc as the score functionfor Īø in the full data and denote lmis|obs as the score for Īø in the conditional distribution of

Ymis given Yobs and lobs as the the score for Īø in the distribution of Yobs. Then it is clear thatlc = lmis|obs + lobs. Using the formula

V ar(U) = V ar(E[U |V ]) + E[V ar(U |V )],

we obtainV ar(lc) = V ar(E[lc|Yobs]) + E[V ar(lc|Yobs)].

SinceE[lc|Yobs] = lobs + E[lmis|obs|Yobs] = lobs

andV ar(lc|Yobs) = V ar(lmis|obs|Yobs),

we obtainV ar(lc) = V ar(lobs) + E[V ar(lmis|obs|Yobs)].

Note that V ar(lc) is the information for Īø based the complete data Y , denote by Ic(Īø),V ar(lobs) is the information for Īø based on the observed data Yobs, denote by Iobs(Īø), andthe V ar(lmis|obs|Yobs) is the conditional information for Īø based on Ymis given Yobs, denoted byImis|obs(Īø; Yobs). We obtain the following Louis formula

Ic(Īø) = Iobs(Īø) + E[Imis|obs(Īø, Yobs)].

Thus, the complete information is the summation of the observed information and the missinginformation. One can even show when the EM converges, the convergence linear rate, denoteas (Īø(k+1) āˆ’ Īøn)/(Īø(k) āˆ’ Īøn) approximates the 1āˆ’ Iobs(Īøn)/Ic(Īøn).

The EM algorithms can be applied to not only missing data but also data with measurementerror. Recently, the algorithms have been extended to the estimation in missing data in manysemiparametric models.

5.5 Nonparametric Maximum Likelihood Estimation

In the previous section, we have studied the maximum likelihood estimation for parametricmodels. The maximum likelihood estimation can also be applied to many semiparametric ornonparametric models and this approach has been received more and more attention in recentyears. We illustrate through some examples how such an estimation approach is used in thesemiparametric or nonparametric model. Since obtaining the consistency and the asymptoticproperties of the maximum likelihood estimators require both advanced probability theory inmetric space and semiparametric efficiency theory, we would rather not get into details of thesetheories.

Example 5.3 Let X1, ..., Xn be i.i.d random variables with common distribution F , where Fis any unknown distribution function. One may be interested in estimating F . This model is

MAXIMUM LIKELIHOOD ESTIMATION 121

a nonparametric model. We consider maximizing the likelihood function to estimate F . Thelikelihood function for F is given by

Ln(F ) =nāˆ

i=1

f(Xi),

where f(Xi) is the density function of F with respect to some dominating measure. However,the maximum of Ln(F ) does not exists since one can always choose a continuous f such thatf(X1) ā†’āˆž. To avoid this problem, instead, we maximize an alternative function

Ln(F ) =nāˆ

i=1

FXi,

where FXi denotes the value F (Xi)āˆ’F (Xiāˆ’). It is clear that Ln(F ) ā‰¤ 1 and if Fn maximizesLn(F ), Fn must be a distribution function with point masses only at X1, ..., Xn. We denoteqi = FXi and qi = qj if Xi = Xj. Then maximizing Ln(F ) is equivalent to maximizing

nāˆi=1

qi subject toāˆ‘

distinct qi

qi = 1.

The maximization with the Lagrange-Multiplier gives that

qi =1

n

nāˆ‘j=1

I(Xj = Xi).

Then

F (x) =1

n

nāˆ‘i=1

I(Xn ā‰¤ x) = Fn(x).

In other words, the maximum likelihood estimator for F is the empirical distribution functionFn. It can be shown that Fn converges to F almost surely uniformly in x and

āˆšn(Fn āˆ’ F )

converges in distribution to a Brownian bridge process. Fn is called the nonparametric maximumlikelihood estimator of F .

Example 5.4 Suppose X1, ..., Xn are i.i.d F and Y1, ..., Yn are i.i.d G. We observe i.i.d pairs(Z1, āˆ†1), ..., (Zn, āˆ†n), where Zi = min(Xi, Yi) and āˆ†i = I(Xi ā‰¤ Yi). We consider Xi as survivaltime and Yi as censoring time. Then it is easy to calculate the joint distributions for (Zi, āˆ†i),i = 1, ..., n, is equal to

Ln(F,G) =nāˆ

i=1

f(Zi)(1āˆ’G(Zi))āˆ†i (1āˆ’ F (Zi))g(Zi)1āˆ’āˆ†i .

Similarly, Ln(F,G) does not have the maximum so we consider an alternative function

Ln(F,G) =nāˆ

i=1

FZi(1āˆ’G(Zi))āˆ†i (1āˆ’ F (Zi))GZi1āˆ’āˆ†i .

MAXIMUM LIKELIHOOD ESTIMATION 122

Ln(F,G) ā‰¤ 1 and maximizing Ln(F, G) is equivalent to maximizing

nāˆi=1

pi(1āˆ’Qi)āˆ†i qi(1āˆ’ Pi)1āˆ’āˆ†i ,

subject to the constraintāˆ‘

i pi =āˆ‘

j qj = 1, where pi = FZi, qi = GZi, and Pi =āˆ‘Yjā‰¤Yi

pj, Qi =āˆ‘

Yjā‰¤Yiqj. However, this maximization may not be easy. Instead, we will take

a different approach by considering a new parameterization. Define the hazard functions Ī»X(t)and Ī»Y (t) as

Ī»X(t) = f(t)/(1āˆ’ F (tāˆ’)), Ī»Y (t) = g(t)/(1āˆ’G(tāˆ’))

and the cumulative hazard functions Ī›X(t) and Ī›Y (t) as

Ī›X(t) =

āˆ« t

0

Ī»X(s)ds, Ī›Y (t) =

āˆ« t

0

Ī»Y (s)ds.

The derivation of F and G from Ī›X and Ī›Y is based on the following product-limit form:

1āˆ’ F (t) =āˆsā‰¤t

(1āˆ’ dĪ›X) ā‰” limmaxm

i=1 |tiāˆ’tiāˆ’1|ā†’0

āˆ0=t0<t1<...<tm=t

1āˆ’ (Ī›X(ti)āˆ’ Ī›X(tiāˆ’1)),

1āˆ’G(t) =āˆsā‰¤t

(1āˆ’ dĪ›Y ) ā‰” limmaxm

i=1 |tiāˆ’tiāˆ’1|ā†’0

āˆ0=t0<t1<...<tm=t

1āˆ’ (Ī›Y (ti)āˆ’ Ī›Y (tiāˆ’1)).

Under the new parameterization, the likelihood function for (Zi, āˆ†i), i = 1, ..., n, is given by

nāˆi=1

[Ī»X(Zi)

āˆ†i expāˆ’Ī›X(Zi)Ī»Y (Zi)1āˆ’āˆ†i expāˆ’Ī›Y (Zi)

].

Again, we maximize a modified function

nāˆi=1

[Ī›XZiāˆ†i expāˆ’Ī›X(Zi)Ī›Y Zi1āˆ’āˆ†i expāˆ’Ī›Y (Zi)

],

where Ī›XZi and Ī›Y Zi are the jump sizes of Ī›X and Ī›Y at Zi. The maximization becomesmaximizing

nāˆi=1

[aāˆ†i

i expāˆ’Aib1āˆ’āˆ†ii expāˆ’Bi

],

where Ai =āˆ‘

Zjā‰¤Ziaj and Bi =

āˆ‘Zjā‰¤Zi

bj. Simple calculation gives that

ai =āˆ†i

Ri

, bi =(1āˆ’āˆ†i)

Ri

, Ri =āˆ‘

Yjā‰„Yi

1.

Thus, the NPMLEā€™s for Ī›X and Ī›Y are given by

Ī›X(t) =āˆ‘Yiā‰¤t

āˆ†i

Ri

, Ī›Y (t) =āˆ‘Yiā‰¤t

1āˆ’āˆ†i

Ri

.

MAXIMUM LIKELIHOOD ESTIMATION 123

As a result of the product-limit formula, we obtain the NPMLEā€™s for F and G are

Fn = 1āˆ’āˆYiā‰¤t

1āˆ’ āˆ†i

Ri

, Gn = 1āˆ’

āˆYiā‰¤t

1āˆ’ 1āˆ’āˆ†i

Ri

.

Both 1 āˆ’ Fn and 1 āˆ’ Gn are called the Kaplan-Meier estimates of the survival functions forthe survival time and the censoring time respectively. The results based on counting processtheory show that Fn and Gn are uniformly consistent and both

āˆšn(Fn āˆ’ F ) and

āˆšn(Gn āˆ’G)

are asymptotically Gaussian.

Example 5.5 Suppose T is survival time and Z is covariate. Assume that the conditionaldistribution of T given Z has a conditional hazard function

Ī»(t|Z) = Ī»(t)eĪøā€²Z .

Then the likelihood function from n i.i.d (Ti, Zi), i = 1, ..., n is given by

Ln(Īø, Ī›) =nāˆ

i=1

Ī»(Ti) expāˆ’Ī›(Ti)e

Īøā€²Zif(Zi)

.

Note f(Zi) is not informative about Īø and Ī» so we can discard it from the likelihood function.Again, we replace Ī»Ti by Ī›Ti and obtain a modified function

Ln(Īø, Ī›) =nāˆ

i=1

Ī›Ti expāˆ’Ī›(Ti)e

Īøā€²Zi

.

Let pi = Ī›Ti we maximize

nāˆi=1

pi expāˆ’(

āˆ‘Yjā‰¤Yi

pj)eĪøā€²Zi

or its logarithm asnāˆ‘

i=1

Īøā€²Zi āˆ’ expĪøā€²Zi

āˆ‘Yjā‰¤Yi

pj + log pj

.

We obtain

pi =1āˆ‘

Yjā‰„YiexpĪøā€²Zj

by differentiating with respect to pi. After substituting it back into the log Ln(Īø, Ī›), we find Īøn

maximizes the function

log

nāˆ

i=1

expĪøā€²Ziāˆ‘Yjā‰„Yi

expĪøā€²Zj

.

The function inside the logarithm is called the Coxā€™s partial likelihood for Īø. The consistencyand the asymptotic efficiency for Īøn have been well studied since the Cox (1972) proposed thisestimation, with help from the martingale process theory.

MAXIMUM LIKELIHOOD ESTIMATION 124

Example 5.6 We consider X1, ..., Xn are i.i.d F and Y1, ..., Yn are i.i.d G. We only observe(Yi, āˆ†i) where āˆ†i = I(Xi ā‰¤ Yi) for i = 1, ..., n. This data is one type of interval censored data(or current status data). The likelihood for the observations is

nāˆi=1

F (Yi)

āˆ†i(1āˆ’ F (Yi))1āˆ’āˆ†ig(Yi)

.

To derive the NPMLE for F and G, we instead maximize

nāˆi=1

Pāˆ†i

i (1āˆ’ Pi)1āˆ’āˆ†iqi

,

subject to the constraint thatāˆ‘

qi = 1 and 0 ā‰¤ Pi ā‰¤ 1 increases with Yi. Clearly, qi = 1/n(suppose Yi are all different). This constrained maximization turns out to be solved by thefollowing steps:(i) Plot the points (i,

āˆ‘Yjā‰¤Yi

āˆ†j), i = 1, ..., n. This is called the cumulative sum diagram.

(ii) Form the Hāˆ—(t), the greatest the convex minorant of the cumulative sum diagram.(iii) Let Pi be the left derivative of Hāˆ— at i.Then (P1, ..., Pn) maximizes the object function. Groeneboom and Wellner (1992) shows thatif f(t), g(t) > 0,

n1/3(Fn(t)āˆ’ F (t)) ā†’d

(F (t)(1āˆ’ F (t))f(t)

2g(t)

)1/3

(2Z),

where Z is the location the maximum of the process B(t)āˆ’ t2 : t āˆˆ R where B(t) is standardBrownian motion starting from 0.

In summary, the NPMLE is a generalization of the maximum likelihood estimation to thesemiparametric or nonparametric models. We have seen that in such a generalization, we oftenreplace the functional parameter by an empirical function with jumps only at observed dataand maximize a modified likelihood function. However, both computation of the NPMLE andthe asymptotic property of the NPMLE can be difficult and vary for different specific problems.

5.6 Alternative Efficient Estimation

Although the maximum likelihood estimation is the most popular way of obtaining an asymp-totically efficient estimator, there are alternative ways of deriving efficient estimation. Amongthem, one-step efficient estimation is the simplest.

In one-step efficient estimation, we assume that a strongly consistent estimator for parameterĪø, denoted by Īøn, is given. Moreover |Īøn āˆ’ Īø0| = Op(n

āˆ’1/2). One-step procedure is essentially aone-step Newton-Raphson iteration in solving the likelihood score equation; that is, we define

Īøn = Īøn āˆ’

ln(Īøn)āˆ’1

ln(Īøn),

where ln(Īø) is the sore function of the observed log-likelihood function and ln(Īø) is the derivativeof ln(Īø). The next theorem shows that Īøn is an asymptotically efficient estimator.

MAXIMUM LIKELIHOOD ESTIMATION 125

Theorem 5.6 Let lĪø(X) be the log-likelihood function of Īø. Assume that there exists a neigh-

borhood of Īø0 such that in this neighborhood, |l(3)Īø (X)| ā‰¤ F (X) with E[F (X)] < āˆž. Then

āˆšn(Īøn āˆ’ Īø0) ā†’d N(0, I(Īø0)

āˆ’1),

where I(Īø0) is the Fisher information. ā€ 

Proof Since Īøn ā†’a.s. Īø0, we perform the Taylor expansion on the right-hand side of the one-stepequation and obtain

Īøn = Īøn āˆ’

ln(Īøn)

ln(Īø0) + ln(Īøāˆ—)(Īøn āˆ’ Īø0)

where Īøāˆ— is between Īøn and Īø0. Therefore,

Īøn āˆ’ Īø0 =

[I āˆ’

ln(Īøn)

āˆ’1

ln(Īøāˆ—)]

(Īøn āˆ’ Īø0)āˆ’

ln(Īøn)

ln(Īø0).

On the other hand, by the condition that |l(3)Īø (X)| ā‰¤ F (X) with E[F (X)] < āˆž, we know

1

nln(Īøāˆ—) ā†’a.s. E[lĪø0(X)],

1

nln(Īøn) ā†’a.s. E[lĪø0(X)].

Thus,

Īøn āˆ’ Īø0 = op(|Īøn āˆ’ Īø0|)āˆ’

E[lĪø0(X)] + op(1)āˆ’1 1

nln(Īø0)

so āˆšn(Īøn āˆ’ Īø0) = op(1)āˆ’

E[lĪø0(X)] + op(1)

āˆ’1 1āˆšn

ln(Īø0) ā†’d N(0, I(Īø0)āˆ’1).

We have proved that Īøn is asymptotically efficient. ā€ 

Remark 5.1 Many different conditions from Theorem 5.6 can be used to ensure the asymp-totic efficiency of Īøn and here we have presented a simple one. Additionally, in the one-stepestimation, since ln(Īøn) approximates āˆ’I(Īø0) and the latter can be estimated by āˆ’I(Īøn), wesometimes use a slightly different one-step update:

Īøn = Īøn + I(Īøn)āˆ’1l(Īøn).

One can recognize that this estimation is in fact one-step iteration in the Fisher scoring algo-rithm. Another efficient estimation arises from the Bayesian estimation method, where it canbe shown that under regular condition of prior distribution, the posterior mode is equivalentto the maximum likelihood estimator. We will not pursue this method here.

In summary, efficient estimation is one of the most important goals in statistical inference.The maximum likelihood approach provides a natural and simple way of deriving an efficientestimator. However, when the maximum likelihood approach is not feasible, for example, themaximum likelihood estimator does not exist or the computation is difficult, other estimationapproaches may be considered such as one-step estimation, Bayesian estimation etc. So far,

MAXIMUM LIKELIHOOD ESTIMATION 126

we only focus on parametric models. When model is given semiparametrically or nonpara-metrically, the maximum likelihood estimator or the Bayesian estimator usually does not existbecause of the presence of some infinite dimensional parameters. In this case, some approxi-mated likelihood approaches have been developed, one of which is the nonparametric maximumlikelihood approach (sometimes called empirical likelihood approach) as given in Section 5.5.Other approaches include partial likelihood approach, sieve likelihood approach, and penalizedlikelihood approach etc. These topics need another full text to describe and will be deferred tosome future course.

READING MATERIALS: You should read Ferguson, Sections 16-20, Lehmann and Casella,Sections 6.2-6.7

PROBLEMS

We need the following definitions to answer the given problems.

Definition 5.2. Tn and Tn are two sequences of estimators for Īø. Supposeāˆš

n(Tn āˆ’ Īø) ā†’d N(0, Ļƒ2),āˆš

n(Tn āˆ’ Īø) ā†’d N(0, Ļƒ2).

The asymptotic relative efficiency (ARE) of Tn with respect to Tn is defined as r = Ļƒ2/Ļƒ2.Intuitively, r can be understood as: to achieve the same accuracy in estimating Īø, using theestimator Tn needs approximately 1/r times as many observations as using the estimator Tn.Thus, if r > 1, Tn is more efficient than Tn; vice versa.

Definition 5.3. If Ī“0 and Ī“1 are statistics, then the random interval (Ī“0, Ī“1) is called a (1āˆ’Ī±)-confidence interval for g(Īø) if

PĪø(g(Īø) āˆˆ (Ī“0, Ī“1)) ā‰„ 1āˆ’ Ī±.

Intuitively, the above inequality says: however data are generated, there is at least (1 āˆ’ Ī±)probability that the interval contains the true value g(Īø). Also, a random set S constructedfrom data is called a (1āˆ’ Ī±)-confidence region for g(Īø) if

PĪø(g(Īø) āˆˆ S) ā‰„ 1āˆ’ Ī±.

If (Ī“0, Ī“1) and S change with sample size n and the above inequalities hold at the limit, then(Ī“0, Ī“1) and S are approximately (1āˆ’Ī±)-confidence interval and confidence region respectively.

1. Suppose that (X1, Y1),...,(Xn, Yn) are i.i.d. with bivariate normal distribution N2(Āµ, Ī£)where Āµ = (Āµ1, Āµ2)

ā€² āˆˆ R2 and

Ī£ =

(Ļƒ2 ĻƒĻ„ĻĻƒĻ„Ļ Ļ„ 2

)

where Ļƒ2 > 0, Ļ„ 2 > 0, and Ļ āˆˆ (āˆ’1, 1).

MAXIMUM LIKELIHOOD ESTIMATION 127

(a) If we assume that Āµ1 = Āµ2 = Īø and Ī£ is known, what is the maximum likelihoodestimator of Īø?

(b) If we assume that Āµ is known and Ļƒ2 = Ļ„ 2 = Īø, what is the maximum likelihoodestimator of (Īø, Ļ)?

(c) What is the asymptotic distribution of the estimator you found in (b)?

2. Let X1, ..., Xn be i.i.d. with common density

fĪø(x) =Īø

(1 + x)Īø+1I(x > 0), Īø > 0.

(a) Find the maximum likelihood estimator of Īø, denoted as Īøn. Give the limit distribu-tion of

āˆšn(Īøn āˆ’ Īø).

(b) Find a function g such that, regardless the value of Īø,āˆš

n(g(Īøn)āˆ’ g(Īø)) ā†’d N(0, 1).

(c) Construct an approximately 1āˆ’ Ī± confidence interval based on (b).

3. Suppose X has a standard exponential distribution with density f(x) = eāˆ’xI(x > 0).Given X = x, Y has a Poisson distribution with mean Ī»x.

(a) Determine the marginal mass function of Y . Find E[Y ] and V ar(Y ) without usingthe mass function of Y .

(b) Give a lower bound for the variance of an unbiased estimator of Ī» based on X andY .

(c) Suppose (X1, Y1), ..., (Xn, Yn) are i.i.d., with each pair having the same joint distri-bution as X and Y . Let Ī»n be the maximum likelihood estimator based on thesedata, and let Ī»n be the maximum likelihood estimator based on Y1, ..., Yn. Determinethe asymptotic relative efficiency of Ī»n with respect to Ī»n.

4. Suppose that X1, ..., Xn are i.i.d. with density function pĪø(x), Īø āˆˆ Ī˜ āŠ‚ Rk. DenotelĪø(x) = log pĪø(x). Assume lĪø(x) is three times differentiable with respect to Īø and its thirdderivatives are bounded by M(x), where supĪø EĪø[M(X)] < āˆž. Let Īøn be the maximumlikelihood estimator of Īø and assume

āˆšn(Īøn āˆ’ Īø) ā†’d N(0, Iāˆ’1

Īø ), where IĪø denotes theFisher information at Īø and is assumed to be non-singular.

(a) To estimate the asymptotic variance ofāˆš

n(Īøn āˆ’ Īø), one proposes an estimator Iāˆ’1n ,

where

In = āˆ’ 1

n

nāˆ‘i=1

lĪøn(Xi).

Prove that Iāˆ’1n is a consistent estimator of Iāˆ’1

Īø .

(b) Show āˆšnI1/2

n (Īøn āˆ’ Īø) ā†’d N(0, IkƗk),

where I1/2n is the square root matrix of In and IkƗk is k-by-k identity matrix. From

this approximation, construct an approximate (1āˆ’ Ī±)-confidence region for Īø.

MAXIMUM LIKELIHOOD ESTIMATION 128

(c) Let ln(Īø) =āˆ‘n

i=1 lĪø(Xi). Perform Taylor expansion on āˆ’2(ln(Īø) āˆ’ ln(Īøn)) (called

likelihood ratio statistic) at Īøn and show

āˆ’2(ln(Īø)āˆ’ ln(Īøn)) ā†’d Ļ‡2k.

From this result, construct an approximate 1āˆ’ Ī± confidence region for Īø.

5. Human beings can be classified into one of four blood groups (phenotypes) O,A,B,AB. Theinheritance of blood groups is controlled by three genes, O, A, B, of which O is recessiveto A and B. If r, p, q are the gene probabilities in the population of O,A,B respectively(r + p + q = 1), the probabilities of the six possible combinations (genotypes) in randommating (where two individuals draw at random from the population contribute one geneeach) are shown in the following tables:

Phenotype Genotype probabilityO OO r2

A AA p2

A AO 2rpB BB q2

B BO 2rqAB AB 2pq

We observe among N individuals that the phenotype frequencies NO, NA, NB, NAB andwish to estimate the gene probabilities from such data. A simple approach is to regard theobservations as incomplete, the complete data set being the genotype frequencies NOO,NAA, NAO, NBB, NBO, NAB.

(a) Derive the EM algorithm for estimation of (p, q, r).

(b) Suppose that we observe NO = 176, NA = 182, NB = 60, NAB = 17. Use the EMalgorithm to calculate the maximum likelihood estimator of (p, q, r), with startingvalue p = q = r = 1/3 and stopping iteration once the maximal difference betweenthe new estimates and the previous one is less than 10āˆ’4.

6. Suppose that X has a density function f(x) and given X = x, Y āˆ¼ N(Ī²x, Ļƒ2). Let(X1, Y1), ..., (Xn, Yn) be i.i.d. observations with the same distribution as (X, Y ). However,in many applications, not all Xā€™s are observable and we assume that Xm+1, ..., Xn aremissing for some 1 < m < n and that the missingness satisfies MAR assumption. Thenthe observed likelihood function is

māˆi=1

[f(Xi)

1āˆš2Ļ€Ļƒ2

expāˆ’(Yi āˆ’ Ī²Xi)2

2Ļƒ2]Ɨ

nāˆi=m+1

āˆ«

x

[f(x)

1āˆš2Ļ€Ļƒ2

expāˆ’(Yi āˆ’ Ī²x)2

2Ļƒ2]

dx.

Suppose that the observed values for Xā€™s are distinct. We want to calculate the NPMLEfor Ī² and Ļƒ2. To do that, we ā€œassumeā€ that X only has point mass pi > 0 at the observeddata Xi = xi for i = 1, ...,m.

(a) Rewrite the likelihood function using Ī², Ļƒ2 and p1, ..., pm.

MAXIMUM LIKELIHOOD ESTIMATION 129

(b) Write out the score equations for all the parameters.

(c) A simple approach to calculate the NPMLE is to use the EM algorithm, whereXm+1, ..., Xn are missing data. Derive the EM algorithm. Hint: Xi, i = m + 1, ..., n,can only have values x1, ..., xm with probabilities p1, ..., pm.

7. Ferguson, pages 117-118, problems 1-3

8. Ferguson, pages 124-125, problems 1-7

9. Ferguson, page 131, problem 1

10. Ferguson, page 139, problems 1-4

11. Lehmann and Casella, pages 501-514, problems 3.1-7.34