64
APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013 Jay Taylor (ASU) Fall 2013 1 / 63

APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

;

APM 421 Probability TheoryMiscellaneous Topics

Jay TaylorFall 2013

Jay Taylor (ASU) Fall 2013 1 / 63

Page 2: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Outline

1 Transformations

2 Moment Generating Functions

3 The Moment Problem

4 The Normal Distribution

5 The Central Limit Theorem

Jay Taylor (ASU) Fall 2013 2 / 63

Page 3: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Transformations of Random Variables

Our focus in this section is on transformations of random variables, which play animportant role throughout probability and statistics. We begin by describing how thedistribution of a random variable changes under an arbitrary transformation.

TheoremSuppose that X is a random variable with distribution PX on U and let f : U → V be afunction. Then Y = f (X ) is a random variable with distribution PY = PX ◦ f −1 on V .

Proof: If B ⊂ V is a subset of V , then

P(Y ∈ B) = P(f (X ) ∈ B)

= P(X ∈ f −1(B))

= PX (f −1(B)).

Jay Taylor (ASU) Fall 2013 3 / 63

Page 4: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

The following result shows how the probability mass function of a discrete randomvariable changes under transformation.

TheoremSuppose that X is a discrete random variable with probability mass function pX on a setU = {x1, x2, · · · }. Then, given any function f : U → V , the variable Y = f (X ) is alsodiscrete with probability mass function

pY (y) =X

x∈f−1(y)

pX (x)

for y ∈ V .

Proof: We have:

pY (y) = P(Y = y) = P(f (X ) = y) = P(X ∈ f −1(y))

=X

x∈f−1(y)

P(X = x)

=X

x∈f−1(y)

pX (x).

Jay Taylor (ASU) Fall 2013 4 / 63

Page 5: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Example: Suppose that X is uniformly distributed on the set U = {−5,−4, · · · , 4, 5}and let Y = |X | be the absolute value of X . Since U contains 11 elements, it followsthat the probability mass function of X is

pX (k) =1

11for− 5 ≤ k ≤ 5.

Since the transformation f (k) = |k| is two-to-one when |k| > 0 and one-to-one whenk = 0, it follows that P(Y = 0) = P(X = 0) = 1/11, while

P(Y = k) = P(X = k) + P(X = −k) =2

11

for k = 1, · · · , 5. Consequently, the probability mass function of Y is

pY (k) =

8<:111

if k = 0

211

if 1 ≤ k ≤ 5.

Jay Taylor (ASU) Fall 2013 5 / 63

Page 6: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

The cumulative distribution function plays an important role when we studytransformations of continuous random variables.

Example: Let X be a non-negative random variable with CDF FX and let Y = X n,where n ≥ 1. Then, for any x ≥ 0,

FY (x) = P(Y ≤ x)

= P(X n ≤ x)

= P(X ≤ x1/n)

= FX

`x1/n´,

while for any x < 0, we have FY (x) = 0. Similarly, if X has a density pX = F ′X (x), thenY also has a density given by

pY (x) =d

dxFY (x) =

8<:0 if x < 0

1n

x1n−1pX

`x1/n

´if x ≥ 0.

Jay Taylor (ASU) Fall 2013 6 / 63

Page 7: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Example: Let X be a continuous real-valued random variable with densitypX (x) = F ′X (x) and let Y = X 2. Then, for x ≥ 0,

FY (x) = P{Y ≤ x}= P{X 2 ≤ x}= P{−

√x ≤ X ≤

√x}

= FX (√

x)− FX (−√

x),

while for any x < 0, we have FY (x) = 0. In this case, the density of Y is given by

pY (x) =d

dxFY (x) =

8<:0 if x < 0

12√

x

`pX

`√x´

+ pX

`−√

x´´

if x ≥ 0.

Jay Taylor (ASU) Fall 2013 7 / 63

Page 8: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Next we have a pair of theorems which show how monotonic transformations act oncontinuous random variables.

TheoremSuppose that X is a continuous real-valued random variable with cumulative distributionfunction FX and density pX , and let Y = g(X ), where g : R→ R is a strictly increasingdifferentiable function. Then Y is a continuous random variable with cumulativedistribution function

FY (y) =

8<:0 if y ≤ inf(g(R))FX

`g−1(y)

´if y ∈ g(R)

1 if y ≥ sup(g(R)).

and density

pY (y) =

(pX (g−1(y)) dg−1(y)

dyif y ∈ g(R)

0 if y /∈ g(R).

Jay Taylor (ASU) Fall 2013 8 / 63

Page 9: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Proof: By the definition of the c.d.f., we have

FY (y) = P(Y ≤ y) = P(g(X ) ≤ y).

There are then three cases to be considered according to the value of y :

Case 1: If y ∈ g(R), then because g is order-preserving

P(g(X ) ≤ y) = P(X ≤ g−1(y)) = FX (g−1(y)).

Case 2: If y ≤ inf(g(R)), then

P(g(X ) ≤ y) = P(∅) = 0.

Case 3: If y ≥ sup(g(R)), then

P(g(X ) ≤ y) = P(X ∈ R) = 1.

We can then calculate the density of Y by differentiating FY (y).

Jay Taylor (ASU) Fall 2013 9 / 63

Page 10: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

TheoremLet X and Y = g(X ) be as in the previous theorem, but now suppose that g : R→ R isstrictly decreasing and differentiable. Then Y is a continuous random variable withcumulative distribution function

FY (y) =

8<:0 if y ≤ inf(g(R))1− FX

`g−1(y)

´if y ∈ g(R)

1 if y ≥ sup(g(R)).

and density

pY (y) =

(−pX (g−1(y)) dg−1(y)

dyif y ∈ g(R)

0 if y /∈ g(R).

Remark: Because g is strictly decreasing, so is g−1 and therefore the derivativedg−1(y)/dy is negative for all values of y in the range of g . This shows that expressionfor the density of Y is non-negative at all such y .

Jay Taylor (ASU) Fall 2013 10 / 63

Page 11: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Proof: The proof is similar to that of the preceding theorem, but with a twist. Becauseg is strictly decreasing, g is order-reversing, i.e., if x < y , then g(x) > g(y).Consequently, if y ∈ g(R), then

FY (y) = P(Y ≤ y)

= P(g(X ) ≤ y)

= P(X ≥ g−1(y))

= 1− P(X < g−1(y))

= 1− P(X ≤ g−1(y))

= 1− FX (g−1(y)).

Jay Taylor (ASU) Fall 2013 11 / 63

Page 12: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Exercise: Suppose that X is exponentially distributed with parameter λ and letY = 1/X . Find the cumulative distribution function and density of Y .

Jay Taylor (ASU) Fall 2013 12 / 63

Page 13: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

TheoremSuppose that U is a uniform random variable on (0, 1) and let X be a real-valuedrandom variable with a strictly increasing continuous CDF FX (·). Then the randomvariable Y = F−1

X (U) has the same distribution as X , while the random variableZ = FX (X ) is uniformly distributed on (0, 1).

Proof: First consider the distribution of Y :

FY (x) = P{Y ≤ x}= P{F−1

X (U) ≤ x}= P{U ≤ FX (x)}= FX (x),

since FX (x) ∈ [0, 1] for all x and P{U ≤ y} = y whenever y ∈ [0, 1]. This shows that Yand X have the same distribution.

Jay Taylor (ASU) Fall 2013 13 / 63

Page 14: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Similarly, the CDF of Z = FX (X ) is

FZ (x) = P{Z ≤ x}= P{FX (X ) ≤ x}= P{X ≤ F−1

X (x)}= FX (F−1

X (x))

= x ,

for any x ∈ [0, 1], which shows that Z is a uniform random variable on [0, 1].

Remark: This result is at the heart of one of the most basic algorithms for generatingrandom numbers with a specified distribution. Suppose that we wish to generate asequence of i.i.d. random numbers with the same distribution as some random variableX , but that we only have access to a stream of independent standard uniform randomvariables, U1,U2, · · · . If X has a continuous and strictly increasing c.d.f., FX , then wecan generate a sequence of independent random variables X1,X2, · · · , each having thesame distribution as X , by setting Xi = F−1

X (Ui ).

Jay Taylor (ASU) Fall 2013 14 / 63

Page 15: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Example: Recall that the CDF of the exponential distribution with parameter λ is

FX (x) = 1− e−λx .

A simple calculation shows that

F−1X (x) = − 1

λln(1− x),

and so it follows that if U is uniform on [0, 1], then

Y = − 1

λln(1− U)

is exponentially distributed with parameter λ. In fact, because the distribution of 1− Uis also uniform on [0, 1], the random variable

Y ′ = − 1

λln(U)

is also exponentially distributed with parameter λ.

Jay Taylor (ASU) Fall 2013 15 / 63

Page 16: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Multivariate Transformations

It is also useful to consider functions of random vectors. However, before we candescribe the statistics of such transformations, we need to recall some ideas from vectorcalculus. We begin with a definition.

DefinitionSuppose that f = (f1, · · · , fm) : Rn → Rm is a differentiable function. Then the Jacobianmatrix of f at x ∈ Rn is the m by n matrix

Jf (x) =

2664∂f1∂x1

(x) · · · ∂f1∂xn

(x)...

...∂fm∂x1

(x) · · · ∂fm∂xn

(x)

3775 .

Interpretation: The Jacobian matrix of a differentiable function f : Rn → Rm at x isthe best linear approximation to f at x in the sense that

limh→0

f (x + h)− f (x)− J(x)h

||h|| = 0.

Jay Taylor (ASU) Fall 2013 16 / 63

Page 17: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

The Jacobian plays an important role in multivariate integration. The following theoremdescribes how to make a change of variables in an integral over an n-dimensional region.Recall that an n by n matrix A is said to be non-singular if its determinant, det(A), isnot equal to 0. Furthermore, a matrix is invertible if and only if it is non-singular.

TheoremLet U and V be open subsets of Rn and suppose that φ : U → V is a differentiablemapping with differentiable inverse φ−1 : V → U and that the Jacobian matrices ofboth φ and φ−1 are non-singular at every point in their respective domains. Then, iff : U → R is a bounded, continuous function,

ZU

f (u)du =

ZV

f (φ−1(v))|det(Jφ(φ−1(v)))|−1dv

=

ZV

f (φ−1(v))|det(Jφ−1 (v))|dv.

Remark: The two Jacobians are related by the identity: J−1φ (u) = Jφ−1 (φ(u)).

Jay Taylor (ASU) Fall 2013 17 / 63

Page 18: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

TheoremSuppose that X1, · · · ,Xn are jointly continuous with joint density pX(x1, · · · , xn) onan open set U ⊂ Rn and let φ : U → V ⊂ Rn be a differentiable mapping withdifferentiable inverse φ−1 : V → U. Assume that the Jacobian matrices of both φ andφ−1 are non-singular. Then the random variables Y1 = φ1(X), · · · ,Yn = φn(X) arejointly continuous with joint density

pY(y) = pX(φ−1(y))|det(Jφ(φ−1(y)))|−1

= pX(φ−1(y))|det(Jφ−1 (y))|

for y ∈ V .

Proof (sketch): If B ⊂ V is an open set and A = φ−1(B) ⊂ U, then by thechange-of-variables formula

P(X ∈ A) =

ZA

pX(u)du

=

ZB

pX(φ−1(v))|J(φ−1(v))|−1dv.

Jay Taylor (ASU) Fall 2013 18 / 63

Page 19: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

However, we also know that

P(X ∈ A) = P(φ(X) ∈ φ(A))

= P(Y ∈ B)

=

ZB

pY(y)dy.

Since both expressions are equal to P(X ∈ A), it follows thatZB

pY(y)dy =

ZB

pX(φ−1(v))|J(φ−1(v))|−1dv,

and since this identity holds for every open set B ⊂ V , we can conclude that

pY(y) = pX(φ−1(v))|J(φ−1(v))|−1,

as claimed. The second identity follows from the relationship between the twoJacobians.

Jay Taylor (ASU) Fall 2013 19 / 63

Page 20: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

Example: Suppose that X and Y are jointly continuous with joint density functionp(x , y) on R2, and let R and Θ be the polar coordinates of the random point (X ,Y ). Inthis case, the transformation is φ = (r , θ) : R2 → [0,∞)× [0, 2π), with components

r(x , y) =p

x2 + y 2

θ(x , y) = tan−1(y/x)

while the inverse transformation φ−1 = (x , y) has components

x(r , θ) = r cos(θ)

y(r , θ) = r sin(θ).

Furthermore, the Jacobian of the inverse map φ−1 is

Jφ−1 (x) =

»cos(θ) −r sin(θ)sin(θ) r cos(θ),

–,

which has determinant

det(Jφ−1 (θ, r)) = r cos2(θ) + r sin2(θ) = r .

Jay Taylor (ASU) Fall 2013 20 / 63

Page 21: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Transformations

It follows that the the joint density function of (R,Θ) is

p(r , θ) = p(r cos(θ), r sin(θ))r .

For example, if (X ,Y ) is uniformly distributed on the unit disk D, then (R,Θ) takesvalues in the set

D ′ = {(r , θ) : 0 ≤ r ≤ 1, 0 ≤ θ < 2π}

with joint density

p(r , θ) = 2r1[0,1](r) · 1

2π1[0,2π)(θ).

In particular, this shows that R and Θ are independent and that Θ is uniformlydistributed on [0, 2π), while R ∼ Beta(2, 1).

Jay Taylor (ASU) Fall 2013 21 / 63

Page 22: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Moment Generating Functions

Recall that the probability generating function of a non-negative integer valued randomvariable X is the function

φX (s) = EhsXi

=∞Xn=0

P{X = n}sn.

In this section, we will introduce a different kind of generating function which is definedfor a larger class of random variables.

DefinitionIf X is a real-valued random variable, then the moment generating function of X is thefunction MX : R→ [0,∞] defined by the formula

MX (t) = EhetXi.

Jay Taylor (ASU) Fall 2013 22 / 63

Page 23: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Examples:

If X ∼ Binomial(n, p), then the moment generating function of X is

MX (t) = EhetXi

=nX

k=0

n

k

!pk(1− p)n−ketk =

`pet + (1− p)

´n.

If X ∼ Poisson(λ), then the moment generating function of X is

MX (t) = EhetXi

=∞Xn=0

e−λλn

n!etn = exp

`− λ(1− et)

´.

If X ∼ Uniform(a, b), then the moment generating function of X is

MX (t) = EhetXi

=

Z b

a

etx dx

b − a=

etb − eta

t(b − a).

Jay Taylor (ASU) Fall 2013 23 / 63

Page 24: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

The next theorem explains why MX (t) is called the moment generating function of X .Recall that the n’th moment of a real-valued random variable X is the quantity

mn ≡ E [X n] ,

provided that the expectation exists. Odd moments need not exist (see example below),but even moments will always exist, although they may be infinite.

TheoremSuppose that X is a real-valued random variable with moment generating functionMX (t). If MX is finite on an open neighborhood (−ε, ε) of 0, then all of the momentsof X are finite and

MX (t) = EhetXi

= E

"∞Xn=0

X n

n!tn

#=∞Xn=0

E [X n]tn

n!,

for t ∈ (−ε, ε).

Jay Taylor (ASU) Fall 2013 24 / 63

Page 25: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Example: If X ∼ Exponential(λ), then the moment generating function of X is

φX (t) = EhetXi

=

Z ∞0

etxλe−λxdx =

8<:λλ−t

for t < λ

∞ otherwise.

Although the moment generating function is infinite on part of the real line, it is finiteon an open neighborhood of 0, namely on (−∞, λ) and so the theorem tells us that allof the moments of the exponential distribution exist and are finite, and that

MX (t) =∞Xn=0

mn

n!tn =

λ

λ− t=

nXn=0

λ−ntn.

However, since Taylor series expansions are unique, it follows that

mn = E[X n] =n!

λn.

Jay Taylor (ASU) Fall 2013 25 / 63

Page 26: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Under certain conditions, the moment generating function determines the distribution ofa random variable uniquely.

TheoremSuppose that X and Y are real-valued random variables with the same momentgenerating function, i.e.,

MX (t) = EhetXi

= EhetYi

= MY (t),

for every t ∈ R. If these moment generating functions are finite on some openneighborhood of 0, then X and Y have the same distribution.

Remark: The result is not true if we do not require the moment generating function tobe finite in some neighborhood of 0.

Jay Taylor (ASU) Fall 2013 26 / 63

Page 27: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Moment generating functions are particularly useful when analyzing sums ofindependent random variables.

TheoremSuppose that X1, · · · ,Xn are independent RVs with moment generating functionsψX1 (t), · · · , ψXn (t). Then the moment generating function of the sumX = X1 + · · ·+ Xn is

ψX (t) =nY

i=1

ψXi (t),

i.e., the moment generating function of a sum of independent RVs is just the product ofthe moment generating functions of the individual RVs.

Proof: Recalling that the expected value of a product of independent variables is equalto the product of the expectations of these variables, we have

ψX (t) = Ehet(X1+···+Xn)

i=

nYi=1

EhetXi

i=

nYi=1

ψXi (t).

Jay Taylor (ASU) Fall 2013 27 / 63

Page 28: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Example: Recall that a random variable X is said to have the Gamma distribution withshape parameter α and scale parameter λ if X takes values in [0,∞) with density

pX (x) =λα

Γ(α)xα−1e−λx .

Furthermore, since the density must integrate to 1 over the range of the variable, weknow that Z ∞

0

xα−1e−λxdx =Γ(α)

λα.

In particular, if t < λ, then Z ∞0

xα−1e−λxetxdx =Γ(α)

(λ− t)α.

Jay Taylor (ASU) Fall 2013 28 / 63

Page 29: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

This shows that the moment generating function of X is

MX (t) =λα

Γ(α)

Z ∞0

xα−1e−λxetxdx =

8><>:“

λλ−t

”αfor t < λ

∞ otherwise.

Now, suppose that X1, · · · ,Xn are independent exponentially-distributed randomvariables, each with scale parameter λ. We showed previously that the momentgenerating function of each such variable is

MX (t) =λα

Γ(α)

Z ∞0

xα−1e−λxetxdx =

8<:λλ−t

for t < λ

∞ otherwise.

Jay Taylor (ASU) Fall 2013 29 / 63

Page 30: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Thus, if S = X1 + · · ·+ Xn, then according to the theorem the moment generatingfunction of S is

MS(t) =nY

i=1

MXi (t) =

„λ

λ− t

«n

provided that t < λ. However, since this is also the moment generating function of aGamma-distributed random variable with shape parameter α = n and scale parameter λ,it follows that S is itself Gamma-distributed with these parameters.

Jay Taylor (ASU) Fall 2013 30 / 63

Page 31: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Moment generating functions can also be defined for random vectors.

DefinitionSuppose that X = (X1, · · · ,Xn) is a random vector with values in Rn. Then the jointmoment generating function of X is the function MX : Rn → [0,∞] defined by theformula

MX(t1, · · · , tn) = Ehet·Xi

= E

"nY

i=1

eti Xi

#.

As in the univariate case, the joint moment generating function determines the jointdistribution of X uniquely as long as it is finite in some open neighborhood of 0 ∈ Rn.Furthermore, in this case, all of the joint moments of X are finite and

MX(t) =X

k1≥0,··· ,kn≥0

E

"nY

i=1

X kii

#nY

i=1

tkii

ki !.

Jay Taylor (ASU) Fall 2013 31 / 63

Page 32: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Joint moment generating functions can sometimes be used to show that a collection ofrandom variables are independent.

TheoremSuppose that X = (X1, · · · ,Xn) is a random vector with values in Rn and that the jointmoment generating function MX(t) is finite in some open neighborhood of 0. ThenX1, · · · ,Xn are independent if and only if the joint moment generating function is equalto the product of the marginal moment generating functions:

MX(t) =nY

i=1

MXi (ti )

for all t = (t1, · · · , tn) in this neighborhood.

Jay Taylor (ASU) Fall 2013 32 / 63

Page 33: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Moment generating functions suffer from one major flaw, which is that they may beinfinite every point except the origin. In this case, the m.g.f. carries only very limitedinformation about the distribution of the variable.

Example: Recall that a real-valued random variable X is said to have the Cauchydistribution if the density of X is

pX (x) =1

π(1 + x2).

We previously showed that expectation of a Cauchy random variable does not exist. Themoment generating function does exist but is infinite everywhere except at t = 0:

MX (t) =

0 for t = 0∞ otherwise.

Jay Taylor (ASU) Fall 2013 33 / 63

Page 34: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Indeed, if t > 0, then

MX (t) =1

π

Z ∞−∞

etx

1 + x2dx ≥ 1

π

Z ∞0

etx

1 + x2dx =∞

since

limx→∞

etx/(1 + x2) =∞.

Similarly, if t < 0, then

MX (t) =1

π

Z ∞−∞

etx

1 + x2dx ≥ 1

π

Z 0

−∞

etx

1 + x2dx =∞

since

limx→−∞

etx/(1 + x2) =∞.

Jay Taylor (ASU) Fall 2013 34 / 63

Page 35: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

Moment Generating Functions

Furthermore, there are many other distributions that have exactly the same momentgenerating function as the Cauchy distribution. For example, for every real numberγ > 1, there is a distribution on R with density

p(x) =1

C

1

1 + |x |γ ,

where C <∞ is a normalizing constant, and calculations similar to those on thepreceding slide show that the corresponding moment generating function is

M(t) =

0 for t = 0∞ otherwise.

Jay Taylor (ASU) Fall 2013 35 / 63

Page 36: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Moment Problem

The Moment Problem

A natural question to ask is whether we can identify a distribution on the real line if allwe know are its moments. The answer is clearly negative if any of the moments areinfinite. However, as the next example demonstrates, the answer can still be negativeeven if all of the moments are finite.

Example: Suppose that X is a random variable that takes values in the set [0,∞) withdensity

f (x) =1

Ce−αxλ

, x > 0

where α > 0, λ ∈ (0, 1/2), and C <∞ is a normalizing constant. We first observe thatall of the moments of X are finite since for any n ≥ 0

mn = E [X n] =1

C

Z ∞0

xne−αxλ

dx <∞.

Jay Taylor (ASU) Fall 2013 36 / 63

Page 37: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Moment Problem

Next, let β = α tan(λπ) and, for each δ ∈ (−1, 1), let X (δ) be a random variable thattakes values in [0,∞) with density

gδ(x) = f (x)h1 + δ sin(βxλ)

i, x > 0.

I claim that, whatever the value of δ in this range, X (δ) has the same (finite) momentsas X . To verify this claim, it suffices to show that

Z ∞0

xne−αxλ

sin(βxλ)dx = 0

for every n ≥ 0. We begin by observing that if p > 0 and q ∈ C is a complex numberwith positive real part, Re q > 0, then

Z ∞0

tp−1e−qtdt =Γ(p)

qp.

(The condition on the real part of q is needed so that the integral will converge.)

Jay Taylor (ASU) Fall 2013 37 / 63

Page 38: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Moment Problem

Consequently, if we let p = (n + 1)/λ, q = α+ iβ, t = xλ and dt = λxλ−1dx , then thislast identity implies that

Γ`

n+1λ

´(α + iβ)(n+1)/λ

=

Z ∞0

(xλ)[(n+1)/λ]−1e−(α+iβ)xλ

λxλ−1dx

= λ

Z ∞0

xne−(α+iβ)xλ

dx

= λ

Z ∞0

xne−αxλ

cos(βxα)dx − iλ

Z ∞0

xne−αxλ

sin(βxα)dx .

Here we have used Euler’s formula,

e iθ = cos(θ) + i sin(θ),

with θ = βxα.

Jay Taylor (ASU) Fall 2013 38 / 63

Page 39: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Moment Problem

We now turn out attention to the denominator of the expression on the left-hand side.Recalling that β = α tan(λπ), this is

(α + iβ)(n+1)/λ = α(n+1)/λ(1 + i tan(λπ))(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λ(cos(λπ) + i sin(λπ))(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λe iλπ(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λe iπ(n+1)

= α(n+1)/λ cos(λπ)−(n+1)/λ cos((n + 1)π) ∈ R,

since sin((n + 1)π) = 0 for any integer n. This shows that the expression on theleft-hand side is a real number. Since this is equal to the difference of a real-valuedintegral and an imaginary-valued integral, the latter must in fact be equal to 0,Z ∞

0

xne−αxλ

sin(βxα)dx = 0,

and this holds for every integer n ≥ 0, which establishes that the variables X (δ),δ ∈ (−1, 1), all have the same finite moments.

Jay Taylor (ASU) Fall 2013 39 / 63

Page 40: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

The Normal Distribution

The normal distribution is one of the most important distributions in probability theoryand statistics. We begin with the definition and then derive some of its properties.

DefinitionA random variable X is said to have the normal distribution with mean µ and varianceσ2 > 0, written X ∼ N (µ, σ2), if X has density function

pX (x) =1√

2πσ2e−(x−µ)2/2σ2

.

If µ = 0 and σ = 1, then X is said to be a standard normal random variable.

Remark: The normal distribution is also called the Gaussian distribution, after C. F.Gauss (1777-1855), who proposed that measurement errors will often follow thisdistribution.

Jay Taylor (ASU) Fall 2013 40 / 63

Page 41: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

To verify that pX (x) is a probability density function, we need to show that its integralover the real line is equal to 1. Suppose that µ = 0 and σ2 = 1. Letting

I =

Z ∞−∞

e−y2/2dy ,

we have

I 2 =

Z ∞−∞

e−x2/2dx

Z ∞−∞

e−y2/2dy

=

Z ∞−∞

Z ∞−∞

e−(x2+y2)/2dxdy

=

Z ∞0

Z 2π

0

re−r2/2dθdr (dxdy = r drdθ)

= 2π

Z ∞0

re−r2/2dr = 2π.

Jay Taylor (ASU) Fall 2013 41 / 63

Page 42: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

This shows that I =√

2π and therefore

1√2π

Z ∞−∞

e−y2/2dy = 1.

The general case then follows by making the substitution y = (x −µ)/σ and dy = dx/σ:

1√2πσ2

Z ∞∞

e−(x−µ)2/2σ2

dx =1√2π

Z ∞−∞

e−y2/2dy = 1.

Jay Taylor (ASU) Fall 2013 42 / 63

Page 43: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

Exercise: Suppose that X ∼ N (µ, σ2) and let Y = aX + b, where a 6= 0 and b areconstants. Show that Y ∼ N (aµ+ b, a2σ2).

Solution: If a > 0, then the cumulative distribution function of Y is

P(Y ≤ x) = P (aX + b ≤ x)

= P„

X ≤ x − b

a

«= FX

„x − b

a

«.

Differentiating, we find that the density function of Y is

pY (x) =1

apX

„x − b

a

«=

1√2πa2σ2

e−(x−b−aµ)2/2a2σ2

which shows that Y is normally distributed with mean b + aµ and variance a2µ2.

Jay Taylor (ASU) Fall 2013 43 / 63

Page 44: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

Exercise: Suppose that X ∼ N (µ, σ2) and let Y = aX + b, where a 6= 0 and b areconstants. Show that Y ∼ N (aµ+ b, a2σ2).

Solution: If a > 0, then the cumulative distribution function of Y is

P(Y ≤ x) = P (aX + b ≤ x)

= P„

X ≤ x − b

a

«= FX

„x − b

a

«.

Differentiating, we find that the density function of Y is

pY (x) =1

apX

„x − b

a

«=

1√2πa2σ2

e−(x−b−aµ)2/2a2σ2

which shows that Y is normally distributed with mean b + aµ and variance a2µ2.

Jay Taylor (ASU) Fall 2013 43 / 63

Page 45: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

This following special cases are particularly important:

If Z ∼ N (0, 1) is a standard normal random variable and X = µ+ σZ , thenX ∼ N (µ, σ2). This shows that every normal distribution can be constructed fromthe standard normal distribution via a linear transformation.

Conversely, if X ∼ N (µ, σ2), then Z = (X − µ)/σ ∼ N (0, 1) is a standard normalrandom variable. This process (subtracting the mean and dividing by the standarddeviation) is called standardization.

Remark: These relationships allow us to deduce the properties of the general normaldistribution from those of the standard normal distribution.

Jay Taylor (ASU) Fall 2013 44 / 63

Page 46: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

Proposition

If X ∼ N (µ, σ2), then the expected value of X is µ, while the variance of X is σ2.

Proof: Let Z = (X − µ)/σ, so that Z ∼ N (0, 1). Then

E[Z ] =1√2π

Z ∞−∞

xe−x2/2dx = 0,

while

Var(Z) = EˆZ 2˜ =

1√2π

Z ∞−∞

x2e−x2/2dx

=1√2π

„−xe−x2/2|∞∞ +

Z ∞−∞

e−x2/2dx

«= 1.

Since X = µ+ σZ , the proposition follows from the identities

E[X ] = µ+ σE[Z ] = µ

Var(X ) = σ2Var(Z) = σ2.

Jay Taylor (ASU) Fall 2013 45 / 63

Page 47: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

The cumulative distribution function of the standard normal distribution is usuallydenoted by the Greek letter Φ:

Φ(x) =1√2π

Z x

−∞e−t2/2dt.

Although this integral cannot be expressed in terms of elementary functions, it can beevaluated numerically using software such as Matlab or R. Alternatively, extensivetabulations of Φ can be found online and in most statistics text books.

The cumulative distribution function of a non-standard normal random variableX ∼ N (µ, σ2) can be evaluated in terms of Φ through standardization:

FX (x) = P(X ≤ x) = P„

X − µσ

≤ x − µσ

«= Φ

“x − µσ

”.

Jay Taylor (ASU) Fall 2013 46 / 63

Page 48: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

Because the tails of the normal density decay very rapidly, a normally distributedvariable is unlikely to differ from its mean by more than a few standard deviations.

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x

p(x)

Normal Distributions

σ2=0.25

σ2=1

σ2=4

σ2=25z P(Z ≥ z)

0.5 0.308541 0.158662 0.022753 0.001354 0.000035 2.87E − 07

Jay Taylor (ASU) Fall 2013 47 / 63

Page 49: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

The moment generating function of the normal distribution can be evaluated as follows.First, if Z ∼ N (0, 1), then

MZ (t) = EhetZi

=1√2π

Z ∞−∞

etzez2/2dz

= et2/2 1√2π

Z ∞−∞

e−12(z2−2tz+t2)dz

= et2/2 1√2π

Z ∞−∞

e−12(z−t)2 dz

= et2/2.

The general case is then dealt with by standardization: if X = µ+ σZ ∼ N (µ, σ2), then

MX (t) = EhetXi

= Ehet(µ+σZ)

i= eµtE

hetσX

i= eµt+σ2t2/2.

Jay Taylor (ASU) Fall 2013 48 / 63

Page 50: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Normal Distribution

The next theorem asserts that a sum of independent normal random variables is stillnormally distributed.

TheoremSuppose that X1, · · · ,Xn are independent normal random variables and that Xi hasmean µi and variance σ2

i . Then X = X1 + · · ·+ Xn is normally distributed with meanµ = µ1 + · · ·+ µn and variance σ2 = σ2

1 + · · ·+ σ2n.

Proof: Since X1, · · · ,Xn are independent, the moment generating function of X is equalto the product of the moment generating functions of the Xi ’s:

MX (t) =nY

i=1

MXi (t) =nY

i=1

eµi t+σ2i t2/2 = exp

t

nXi=1

µi +1

2t2

nXi=1

σ2i

!= eµt+σ2t2/2.

This shows that X has the same moment generating function as N (µ, σ2), which allowsus to conclude that X itself has this distribution.

Jay Taylor (ASU) Fall 2013 49 / 63

Page 51: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

The Central Limit Theorem

Many kinds of data are found to be approximately normally distributed:

measurement errors;

the velocities of the molecules in an ideal gas;

physical dimensions of individuals in a population, e.g., adult human heights andweights;

the logarithm of the latency periods for chicken pox, hepatitis B, and polioinfections;

house prices within a given area.

As with the Poisson distribution, the fact that the normal distribution can be used tomodel so many kinds of data invites a mathematical explanation. This is provided, atleast in part, by the central limit theorem, which asserts that the sum of a largenumber of i.i.d. random variables is approximately normal.

Jay Taylor (ASU) Fall 2013 50 / 63

Page 52: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

To explain the content of the central limit theorem, we need to decide what it means tosay that a sequence of random variables (Xn : n ≥ 0) converges to a limit X . One typeof convergence, called weak convergence, is defined below.

Definition

1 Suppose that F and Fn, n ≥ 1 are cumulative distribution functions on R. Then Fn

is said to converge weakly to F if limn→∞ Fn(x) = F (x) at every point x where Fis continuous.

2 A sequence of random variables Xn is said to converge in distribution to a randomvariable X if the cumulative distribution functions Fn(x) = P(Xn ≤ x) convergeweakly to F (x) = P(X ≤ x).

Remark: In fact, there are many different senses in which a sequence of randomvariables can converge to a limit. This mode of convergence is called weak convergencebecause it is implied by most of the other criteria in common use.

Jay Taylor (ASU) Fall 2013 51 / 63

Page 53: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

Example: To see why we only require pointwise convergence at continuity points, let Xn

and X be the degenerate random variables defined by setting

P(Xn = 1/n) = 1 and P(X = 0) = 1.

Clearly, Xn should converge X under any reasonable definition of convergence. In fact,

Fn(x) =

0 if x < 1/n1 otherwise

and F (x) =

0 if x < 01 otherwise,

which shows that Fn(x) converges to F (x) as n→∞ at every x 6= 0. Since F (x) iscontinuous everywhere except at x = 0, it follows that the sequence Xn converges indistribution to X even though

1 = F (0) 6= 0 = limn→∞

Fn(0).

Jay Taylor (ASU) Fall 2013 52 / 63

Page 54: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

The next proposition shows how we can use moment generating functions to deducethat a sequence of random variables converges in distribution. The assumption that themoment generating function of the limit X is finite is essential.

TheoremLet X1,X2, · · · be a sequence of random variables with moment generating functionsψXn (t), and let X be a random variable with moment generating function ψX . Then thesequence Xn converges in distribution to X if

limn→∞

ψXn (t) = ψX (t) <∞

for all t ∈ R.

Remark: This result is useful because it can be used to prove weak convergence evenwhen we cannot explicitly calculate the cumulative distribution functions of the randomvariables involved.

Jay Taylor (ASU) Fall 2013 53 / 63

Page 55: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

We now come to the main result.

Theorem(The Central Limit Theorem) Suppose that X1,X2, · · · is a sequence of I.I.D. variableswith finite mean µ and variance σ2, and let Sn = X1 + · · ·+ Xn. Then the sequence ofrandom variables

Zn ≡Sn − nµ

σ√

n=

1

σn1/2

„1

nSn − µ

«converges in distribution to the standard normal N (0, 1), i.e., for every real number z,

limn→∞

P{Zn ≤ z} = Φ(z) ≡ 1√2π

Z z

−∞e−x2/2dx .

Proof: We will prove the CLT under the assumption that the moment generatingfunction of the random variables Xi , denoted M(t), is finite on the entire real line. Thenit suffices to show that the moment generating functions of the variables Zn convergepointwise to the moment generating function of a standard normal random variable.

Jay Taylor (ASU) Fall 2013 54 / 63

Page 56: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

Let us first assume that µ = 0 and σ2 = 1. Notice that the moment generating functionof the scaled random variable Xi/

√n is

exp

tXi√

n

ff–= M

„t√n

«

and that the moment generating function of the sum Zn =Pn

i=1 Xi/√

n is equal to

MZn (t) =

»M

„t√n

«–n

.

Let L(t) = log M(t). Since M(0) = 1, M ′(0) = µ = 0 and M ′′(0) = σ2 = 1, we have

L(0) = log(M(0)) = 0

L′(0) =M ′(0)

M(0)= 0

L′′(0) =M(0)M ′′(0)−M ′(0)2

M(0)2= 1.

Jay Taylor (ASU) Fall 2013 55 / 63

Page 57: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

It suffices to show that

limn→∞

»M

„t√n

«–n

= et2/2,

which is equivalent to

limn→∞

nL

„t√n

«=

t2

2.

However, this last identity can be verified using L’Hopital’s rule

limn→∞

L(t/√

n)

n−1= lim

x→0

L(tx)

x2= lim

x→0

tL′(tx)

2x= lim

x→0

t2L′′(tx)

2

=t2

2.

The general case can then be handled by applying this result to the standardizedvariables X ∗i = (Xi − µ)/σ, which have mean 0 and variance 1.

Jay Taylor (ASU) Fall 2013 56 / 63

Page 58: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

The following result, known as the Demoivre-Laplace Theorem, was the first versionof the CLT to be discovered. It asserts that the binomial distribution with parameters nand p is approximately normal when n is large.

Corollary

If Sn is a binomial random variable with parameters n and p ∈ (0, 1), then

limn→∞

P

(a ≤ Sn − npp

np(1− p)≤ b

)= Φ(b)− Φ(a).

Proof: The result follows from the CLT once we note that Sn has the same distributionas the sum of n independent Bernoulli(p) random variables,

Snd= X1 + · · ·+ Xn,

and that E[Sn] = nµ and Var(Sn) = np(1− p).

Jay Taylor (ASU) Fall 2013 57 / 63

Page 59: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

Example: Suppose that a fair coin is tossed 100 times. What is the probability that thenumber of heads obtained is between 45 and 55 (inclusive)?

Solution: If X denotes the number of heads obtained in 100 tosses, then X is abinomial random variable with parameters (100, 1/2). By the Demoivre-Laplacetheorem, we know that

P{45 ≤ X ≤ 55} = P−1 ≤ X − 50

5≤ 1

ff≈ Φ(1)− Φ(−1)

= Φ(1)− (1− Φ(1))

= 2Φ(1)− 1 = 0.683.

Notice that we have also made use of the identity

Φ(−x) = 1− Φ(x),

which follows from the fact that if Z ∼ N (0, 1), then

P{Z ≤ −x} = P{Z ≥ x} = 1− P{Z ≤ x}.

Jay Taylor (ASU) Fall 2013 58 / 63

Page 60: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

The convergence of the binomial distribution to the normal distribution with increasingn is illustrated in the figures below which compare the normal distribution with mean npand variance np(1− p) with the binomial distribution for n = 10 (left) and n = 100(right) when p = 0.2.

Remarkably, the match between the binomial and normal densities is pretty good evenwhen n is only 10.

Jay Taylor (ASU) Fall 2013 59 / 63

Page 61: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

Another example of the scope of the CLT is provided by the distribution of adult humanheights, which is approximately normal. The figure shows a histogram for the heights ofa sample of 5000 adults (source: SOCR), as well as the best fitting normal distribution.

Normality of quantitative traits can beexplained by Fisher’s infinitesimal model:

The trait depends on a large numberL of variable loci.

The two alleles at each locus have asmall effect Xl,m and Xl,p on the trait.

The loci act additively.

Then an individual’s height may beexpressed as below:

60 62 64 66 68 70 72 74 760

50

100

150

200

250DistributionofAdultHeights

height(inches)

data

normal

H =LX

l=1

(Xl,m + Xl,p) + ε.

where ε is the random environmental effect on height.

Jay Taylor (ASU) Fall 2013 60 / 63

Page 62: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

The Central Limit Theorem even shows up in number theory. Here is a typical example.

Theorem(Erdos-Kac) For each n ≥ 1, let Yn be uniformly distributed on the set {1, · · · , n}, anddefine φ(n) to be the number of prime divisors of the integer n, e.g., φ(2) = 1,φ(3) = 1, φ(4) = 1, φ(5) = 1, φ(6) = 2, etc. Then, for every real number x,

limn→∞

P

φ(Yn)− log log(n)p

log log(n)≤ x

!= Φ(x).

In other words, if an integer is chosen at random between 1 and n, then for large n thenumber of prime divisors of that integer is approximately normally distributed with meanand variance both equal to log log(n).

Intuition: The result can be proven by writing φ(Yn) as a sum of indicator variables andthen showing that these are approximately independent and identically-distributed:

φ(Yn) =Xp≤Yn

1p|Yn .

Jay Taylor (ASU) Fall 2013 61 / 63

Page 63: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

One question that we might ask is why the limit is normal, i.e., what’s so normal aboutthe normal distribution? One answer to this question comes from information theory.

DefinitionThe differential entropy of a continuous probability distribution P with density p(x) onthe real line is the quantity

H(P) = −Z ∞−∞

p(x) ln(p(x))dx .

The entropy of a distribution can be thought of as a measure of the information contentof the distribution, i.e., if X has distribution P then the entropy H(P) is a measure ofhow much information we gain when we observe the outcome X of an experiment.Equivalently, H(P) also quantifies the amount of uncertainty that we have before weobserve this outcome: the greater the uncertainty in advance of the observation, themore we learn by observing the outcome.

Jay Taylor (ASU) Fall 2013 62 / 63

Page 64: APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

The Central Limit Theorem

Part of the connection with the Central Limit Theorem is provided by the followingproposition.

LemmaOf all the continuous distributions on the real line with mean 0 and variance 1, thestandard normal distribution is the unique distribution with maximum entropy, i.e.,

H(N (0, 1)) = sup {H(P) : P ∈ P0,1(R) :} ,

where P0,1(R) is the collection of all probability measures on the real line with mean 0and variance 1.

The reason that this relevant is that the operation of averaging a set of randomvariables is one that generates entropy, i.e., one can show that the entropy of astandardized sum of a collection of i.i.d. random variables is at least as great as theentropy of any one of those variables. This makes sense since in the process of averaginga set of random variables, we are losing information and thus increasing entropy.

Jay Taylor (ASU) Fall 2013 63 / 63