APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013

;

APM 421 Probability TheoryMiscellaneous Topics

Jay TaylorFall 2013

Jay Taylor (ASU) Fall 2013 1 / 63

Outline

1 Transformations

2 Moment Generating Functions

3 The Moment Problem

4 The Normal Distribution

5 The Central Limit Theorem


Transformations

Transformations of Random Variables

Our focus in this section is on transformations of random variables, which play animportant role throughout probability and statistics. We begin by describing how thedistribution of a random variable changes under an arbitrary transformation.

TheoremSuppose that X is a random variable with distribution PX on U and let f : U → V be afunction. Then Y = f (X ) is a random variable with distribution PY = PX ◦ f −1 on V .

Proof: If B ⊂ V is a subset of V , then

P(Y ∈ B) = P(f (X ) ∈ B)

= P(X ∈ f −1(B))

= PX (f −1(B)).


Transformations

The following result shows how the probability mass function of a discrete randomvariable changes under transformation.

TheoremSuppose that X is a discrete random variable with probability mass function pX on a setU = {x1, x2, · · · }. Then, given any function f : U → V , the variable Y = f (X ) is alsodiscrete with probability mass function

pY (y) =X

x∈f−1(y)

pX (x)

for y ∈ V .

Proof: We have:

pY (y) = P(Y = y) = P(f (X ) = y) = P(X ∈ f −1(y))

=X

x∈f−1(y)

P(X = x)

=X

x∈f−1(y)

pX (x).


Transformations

Example: Suppose that X is uniformly distributed on the set U = {−5,−4, · · · , 4, 5}and let Y = |X | be the absolute value of X . Since U contains 11 elements, it followsthat the probability mass function of X is

pX (k) =1

11for− 5 ≤ k ≤ 5.

Since the transformation f (k) = |k| is two-to-one when |k| > 0 and one-to-one whenk = 0, it follows that P(Y = 0) = P(X = 0) = 1/11, while

P(Y = k) = P(X = k) + P(X = −k) =2

11

for k = 1, · · · , 5. Consequently, the probability mass function of Y is

pY (k) =

8<:111

if k = 0

211

if 1 ≤ k ≤ 5.


Transformations

The cumulative distribution function plays an important role when we studytransformations of continuous random variables.

Example: Let X be a non-negative random variable with CDF FX and let Y = X n,where n ≥ 1. Then, for any x ≥ 0,

FY (x) = P(Y ≤ x)

= P(X n ≤ x)

= P(X ≤ x1/n)

= FX

`x1/n´,

while for any x < 0, we have FY (x) = 0. Similarly, if X has a density pX = F ′X (x), thenY also has a density given by

pY (x) =d

dxFY (x) =

8<:0 if x < 0

1n

x1n−1pX

`x1/n

íf x ≥ 0.


Transformations

Example: Let X be a continuous real-valued random variable with densitypX (x) = F ′X (x) and let Y = X 2. Then, for x ≥ 0,

FY (x) = P{Y ≤ x}= P{X 2 ≤ x}= P{−

√x ≤ X ≤

√x}

= FX (√

x)− FX (−√

x),

while for any x < 0, we have FY (x) = 0. In this case, the density of Y is given by

pY (x) =d

dxFY (x) =

8<:0 if x < 0

12√

x

`pX

`√x´

+ pX

`−√

x´´

if x ≥ 0.


Transformations

Next we have a pair of theorems which show how monotonic transformations act oncontinuous random variables.

TheoremSuppose that X is a continuous real-valued random variable with cumulative distributionfunction FX and density pX , and let Y = g(X ), where g : R→ R is a strictly increasingdifferentiable function. Then Y is a continuous random variable with cumulativedistribution function

FY (y) =

8<:0 if y ≤ inf(g(R))FX

`g−1(y)

íf y ∈ g(R)

1 if y ≥ sup(g(R)).

and density

pY (y) =

(pX (g−1(y)) dg−1(y)

dyif y ∈ g(R)

0 if y /∈ g(R).


Transformations

Proof: By the definition of the c.d.f., we have

FY (y) = P(Y ≤ y) = P(g(X ) ≤ y).

There are then three cases to be considered according to the value of y :

Case 1: If y ∈ g(R), then because g is order-preserving

P(g(X ) ≤ y) = P(X ≤ g−1(y)) = FX (g−1(y)).

Case 2: If y ≤ inf(g(R)), then

P(g(X ) ≤ y) = P(∅) = 0.

Case 3: If y ≥ sup(g(R)), then

P(g(X ) ≤ y) = P(X ∈ R) = 1.

We can then calculate the density of Y by differentiating FY (y).


Transformations

TheoremLet X and Y = g(X ) be as in the previous theorem, but now suppose that g : R→ R isstrictly decreasing and differentiable. Then Y is a continuous random variable withcumulative distribution function

FY (y) =

8<:0 if y ≤ inf(g(R))1− FX

`g−1(y)

íf y ∈ g(R)

1 if y ≥ sup(g(R)).

and density

pY (y) =

(−pX (g−1(y)) dg−1(y)

dyif y ∈ g(R)

0 if y /∈ g(R).

Remark: Because g is strictly decreasing, so is g−1 and therefore the derivativedg−1(y)/dy is negative for all values of y in the range of g . This shows that expressionfor the density of Y is non-negative at all such y .


Transformations

Proof: The proof is similar to that of the preceding theorem, but with a twist. Becauseg is strictly decreasing, g is order-reversing, i.e., if x < y , then g(x) > g(y).Consequently, if y ∈ g(R), then

FY (y) = P(Y ≤ y)

= P(g(X ) ≤ y)

= P(X ≥ g−1(y))

= 1− P(X < g−1(y))

= 1− P(X ≤ g−1(y))

= 1− FX (g−1(y)).


Transformations

Exercise: Suppose that X is exponentially distributed with parameter λ and letY = 1/X . Find the cumulative distribution function and density of Y .


Transformations

TheoremSuppose that U is a uniform random variable on (0, 1) and let X be a real-valuedrandom variable with a strictly increasing continuous CDF FX (·). Then the randomvariable Y = F−1

X (U) has the same distribution as X , while the random variableZ = FX (X ) is uniformly distributed on (0, 1).

Proof: First consider the distribution of Y :

FY (x) = P{Y ≤ x}= P{F−1

X (U) ≤ x}= P{U ≤ FX (x)}= FX (x),

since FX (x) ∈ [0, 1] for all x and P{U ≤ y} = y whenever y ∈ [0, 1]. This shows that Yand X have the same distribution.


Transformations

Similarly, the CDF of Z = FX (X ) is

FZ (x) = P{Z ≤ x}= P{FX (X ) ≤ x}= P{X ≤ F−1

X (x)}= FX (F−1

X (x))

= x ,

for any x ∈ [0, 1], which shows that Z is a uniform random variable on [0, 1].

Remark: This result is at the heart of one of the most basic algorithms for generatingrandom numbers with a specified distribution. Suppose that we wish to generate asequence of i.i.d. random numbers with the same distribution as some random variableX , but that we only have access to a stream of independent standard uniform randomvariables, U1,U2, · · · . If X has a continuous and strictly increasing c.d.f., FX , then wecan generate a sequence of independent random variables X1,X2, · · · , each having thesame distribution as X , by setting Xi = F−1

X (Ui ).


Transformations

Example: Recall that the CDF of the exponential distribution with parameter λ is

FX (x) = 1− e−λx .

A simple calculation shows that

F−1X (x) = − 1

λln(1− x),

and so it follows that if U is uniform on [0, 1], then

Y = − 1

λln(1− U)

is exponentially distributed with parameter λ. In fact, because the distribution of 1− Uis also uniform on [0, 1], the random variable

Y ′ = − 1

λln(U)

is also exponentially distributed with parameter λ.


Transformations

Multivariate Transformations

It is also useful to consider functions of random vectors. However, before we candescribe the statistics of such transformations, we need to recall some ideas from vectorcalculus. We begin with a definition.

DefinitionSuppose that f = (f1, · · · , fm) : Rn → Rm is a differentiable function. Then the Jacobianmatrix of f at x ∈ Rn is the m by n matrix

Jf (x) =

2664∂f1∂x1

(x) · · · ∂f1∂xn

(x)...

...∂fm∂x1

(x) · · · ∂fm∂xn

(x)

3775 .

Interpretation: The Jacobian matrix of a differentiable function f : Rn → Rm at x isthe best linear approximation to f at x in the sense that

limh→0

f (x + h)− f (x)− J(x)h

||h|| = 0.


Transformations

The Jacobian plays an important role in multivariate integration. The following theoremdescribes how to make a change of variables in an integral over an n-dimensional region.Recall that an n by n matrix A is said to be non-singular if its determinant, det(A), isnot equal to 0. Furthermore, a matrix is invertible if and only if it is non-singular.

TheoremLet U and V be open subsets of Rn and suppose that φ : U → V is a differentiablemapping with differentiable inverse φ−1 : V → U and that the Jacobian matrices ofboth φ and φ−1 are non-singular at every point in their respective domains. Then, iff : U → R is a bounded, continuous function,

ZU

f (u)du =

ZV

f (φ−1(v))|det(Jφ(φ−1(v)))|−1dv

=

ZV

f (φ−1(v))|det(Jφ−1 (v))|dv.

Remark: The two Jacobians are related by the identity: J−1φ (u) = Jφ−1 (φ(u)).


Transformations

TheoremSuppose that X1, · · · ,Xn are jointly continuous with joint density pX(x1, · · · , xn) onan open set U ⊂ Rn and let φ : U → V ⊂ Rn be a differentiable mapping withdifferentiable inverse φ−1 : V → U. Assume that the Jacobian matrices of both φ andφ−1 are non-singular. Then the random variables Y1 = φ1(X), · · · ,Yn = φn(X) arejointly continuous with joint density

pY(y) = pX(φ−1(y))|det(Jφ(φ−1(y)))|−1

= pX(φ−1(y))|det(Jφ−1 (y))|

for y ∈ V .

Proof (sketch): If B ⊂ V is an open set and A = φ−1(B) ⊂ U, then by thechange-of-variables formula

P(X ∈ A) =

ZA

pX(u)du

=

ZB

pX(φ−1(v))|J(φ−1(v))|−1dv.


Transformations

However, we also know that

P(X ∈ A) = P(φ(X) ∈ φ(A))

= P(Y ∈ B)

=

ZB

pY(y)dy.

Since both expressions are equal to P(X ∈ A), it follows thatZB

pY(y)dy =

ZB

pX(φ−1(v))|J(φ−1(v))|−1dv,

and since this identity holds for every open set B ⊂ V , we can conclude that

pY(y) = pX(φ−1(v))|J(φ−1(v))|−1,

as claimed. The second identity follows from the relationship between the twoJacobians.


Transformations

Example: Suppose that X and Y are jointly continuous with joint density functionp(x , y) on R2, and let R and Θ be the polar coordinates of the random point (X ,Y ). Inthis case, the transformation is φ = (r , θ) : R2 → [0,∞)× [0, 2π), with components

r(x , y) =p

x2 + y 2

θ(x , y) = tan−1(y/x)

while the inverse transformation φ−1 = (x , y) has components

x(r , θ) = r cos(θ)

y(r , θ) = r sin(θ).

Furthermore, the Jacobian of the inverse map φ−1 is

Jφ−1 (x) =

»cos(θ) −r sin(θ)sin(θ) r cos(θ),

–,

which has determinant

det(Jφ−1 (θ, r)) = r cos2(θ) + r sin2(θ) = r .


Transformations

It follows that the the joint density function of (R,Θ) is

p(r , θ) = p(r cos(θ), r sin(θ))r .

For example, if (X ,Y ) is uniformly distributed on the unit disk D, then (R,Θ) takesvalues in the set

D ′ = {(r , θ) : 0 ≤ r ≤ 1, 0 ≤ θ < 2π}

with joint density

p(r , θ) = 2r1[0,1](r) · 1

2π1[0,2π)(θ).

In particular, this shows that R and Θ are independent and that Θ is uniformlydistributed on [0, 2π), while R ∼ Beta(2, 1).


Moment Generating Functions


Recall that the probability generating function of a non-negative integer valued randomvariable X is the function

φX (s) = EhsXi

=∞Xn=0

P{X = n}sn.

In this section, we will introduce a different kind of generating function which is definedfor a larger class of random variables.

DefinitionIf X is a real-valued random variable, then the moment generating function of X is thefunction MX : R→ [0,∞] defined by the formula

MX (t) = EhetXi.



Examples:

If X ∼ Binomial(n, p), then the moment generating function of X is

MX (t) = EhetXi

=nX

k=0

n

k

!pk(1− p)n−ketk =

`pet + (1− p)

ń.

If X ∼ Poisson(λ), then the moment generating function of X is

MX (t) = EhetXi

=∞Xn=0

e−λλn

n!etn = exp

`− λ(1− et)

´.

If X ∼ Uniform(a, b), then the moment generating function of X is

MX (t) = EhetXi

=

Z b

a

etx dx

b − a=

etb − eta

t(b − a).



The next theorem explains why MX (t) is called the moment generating function of X .Recall that the n’th moment of a real-valued random variable X is the quantity

mn ≡ E [X n] ,

provided that the expectation exists. Odd moments need not exist (see example below),but even moments will always exist, although they may be infinite.

TheoremSuppose that X is a real-valued random variable with moment generating functionMX (t). If MX is finite on an open neighborhood (−ε, ε) of 0, then all of the momentsof X are finite and

MX (t) = EhetXi

= E

"∞Xn=0

X n

n!tn

#=∞Xn=0

E [X n]tn

n!,

for t ∈ (−ε, ε).



Example: If X ∼ Exponential(λ), then the moment generating function of X is

φX (t) = EhetXi

=

Z ∞0

etxλe−λxdx =

8<:λλ−t

for t < λ

∞ otherwise.

Although the moment generating function is infinite on part of the real line, it is finiteon an open neighborhood of 0, namely on (−∞, λ) and so the theorem tells us that allof the moments of the exponential distribution exist and are finite, and that

MX (t) =∞Xn=0

mn

n!tn =

λ

λ− t=

nXn=0

λ−ntn.

However, since Taylor series expansions are unique, it follows that

mn = E[X n] =n!

λn.



Under certain conditions, the moment generating function determines the distribution ofa random variable uniquely.

TheoremSuppose that X and Y are real-valued random variables with the same momentgenerating function, i.e.,

MX (t) = EhetXi

= EhetYi

= MY (t),

for every t ∈ R. If these moment generating functions are finite on some openneighborhood of 0, then X and Y have the same distribution.

Remark: The result is not true if we do not require the moment generating function tobe finite in some neighborhood of 0.



Moment generating functions are particularly useful when analyzing sums ofindependent random variables.

TheoremSuppose that X1, · · · ,Xn are independent RVs with moment generating functionsψX1 (t), · · · , ψXn (t). Then the moment generating function of the sumX = X1 + · · ·+ Xn is

ψX (t) =nY

i=1

ψXi (t),

i.e., the moment generating function of a sum of independent RVs is just the product ofthe moment generating functions of the individual RVs.

Proof: Recalling that the expected value of a product of independent variables is equalto the product of the expectations of these variables, we have

ψX (t) = Ehet(X1+···+Xn)

i=

nYi=1

EhetXi

i=

nYi=1

ψXi (t).



Example: Recall that a random variable X is said to have the Gamma distribution withshape parameter α and scale parameter λ if X takes values in [0,∞) with density

pX (x) =λα

Γ(α)xα−1e−λx .

Furthermore, since the density must integrate to 1 over the range of the variable, weknow that Z ∞

0

xα−1e−λxdx =Γ(α)

λα.

In particular, if t < λ, then Z ∞0

xα−1e−λxetxdx =Γ(α)

(λ− t)α.



This shows that the moment generating function of X is

MX (t) =λα

Γ(α)

Z ∞0

xα−1e−λxetxdx =

8><>:“

λλ−t

”αfor t < λ

∞ otherwise.

Now, suppose that X1, · · · ,Xn are independent exponentially-distributed randomvariables, each with scale parameter λ. We showed previously that the momentgenerating function of each such variable is

MX (t) =λα

Γ(α)

Z ∞0

xα−1e−λxetxdx =

8<:λλ−t

for t < λ

∞ otherwise.



Thus, if S = X1 + · · ·+ Xn, then according to the theorem the moment generatingfunction of S is

MS(t) =nY

i=1

MXi (t) =

„λ

λ− t

«n

provided that t < λ. However, since this is also the moment generating function of aGamma-distributed random variable with shape parameter α = n and scale parameter λ,it follows that S is itself Gamma-distributed with these parameters.



Moment generating functions can also be defined for random vectors.

DefinitionSuppose that X = (X1, · · · ,Xn) is a random vector with values in Rn. Then the jointmoment generating function of X is the function MX : Rn → [0,∞] defined by theformula

MX(t1, · · · , tn) = Ehet·Xi

= E

"nY

i=1

eti Xi

#.

As in the univariate case, the joint moment generating function determines the jointdistribution of X uniquely as long as it is finite in some open neighborhood of 0 ∈ Rn.Furthermore, in this case, all of the joint moments of X are finite and

MX(t) =X

k1≥0,··· ,kn≥0

E

"nY

i=1

X kii

#nY

i=1

tkii

ki !.



Joint moment generating functions can sometimes be used to show that a collection ofrandom variables are independent.

TheoremSuppose that X = (X1, · · · ,Xn) is a random vector with values in Rn and that the jointmoment generating function MX(t) is finite in some open neighborhood of 0. ThenX1, · · · ,Xn are independent if and only if the joint moment generating function is equalto the product of the marginal moment generating functions:

MX(t) =nY

i=1

MXi (ti )

for all t = (t1, · · · , tn) in this neighborhood.



Moment generating functions suffer from one major flaw, which is that they may beinfinite every point except the origin. In this case, the m.g.f. carries only very limitedinformation about the distribution of the variable.

Example: Recall that a real-valued random variable X is said to have the Cauchydistribution if the density of X is

pX (x) =1

π(1 + x2).

We previously showed that expectation of a Cauchy random variable does not exist. Themoment generating function does exist but is infinite everywhere except at t = 0:

MX (t) =

0 for t = 0∞ otherwise.



Indeed, if t > 0, then

MX (t) =1

π

Z ∞−∞

etx

1 + x2dx ≥ 1

π

Z ∞0

etx

1 + x2dx =∞

since

limx→∞

etx/(1 + x2) =∞.

Similarly, if t < 0, then

MX (t) =1

π

Z ∞−∞

etx

1 + x2dx ≥ 1

π

Z 0

−∞

etx

1 + x2dx =∞

since

limx→−∞

etx/(1 + x2) =∞.



Furthermore, there are many other distributions that have exactly the same momentgenerating function as the Cauchy distribution. For example, for every real numberγ > 1, there is a distribution on R with density

p(x) =1

C

1

1 + |x |γ ,

where C <∞ is a normalizing constant, and calculations similar to those on thepreceding slide show that the corresponding moment generating function is

M(t) =

0 for t = 0∞ otherwise.


The Moment Problem

The Moment Problem

A natural question to ask is whether we can identify a distribution on the real line if allwe know are its moments. The answer is clearly negative if any of the moments areinfinite. However, as the next example demonstrates, the answer can still be negativeeven if all of the moments are finite.

Example: Suppose that X is a random variable that takes values in the set [0,∞) withdensity

f (x) =1

Ce−αxλ

, x > 0

where α > 0, λ ∈ (0, 1/2), and C <∞ is a normalizing constant. We first observe thatall of the moments of X are finite since for any n ≥ 0

mn = E [X n] =1

C

Z ∞0

xne−αxλ

dx <∞.


The Moment Problem

Next, let β = α tan(λπ) and, for each δ ∈ (−1, 1), let X (δ) be a random variable thattakes values in [0,∞) with density

gδ(x) = f (x)h1 + δ sin(βxλ)

i, x > 0.

I claim that, whatever the value of δ in this range, X (δ) has the same (finite) momentsas X . To verify this claim, it suffices to show that

Z ∞0

xne−αxλ

sin(βxλ)dx = 0

for every n ≥ 0. We begin by observing that if p > 0 and q ∈ C is a complex numberwith positive real part, Re q > 0, then

Z ∞0

tp−1e−qtdt =Γ(p)

qp.

(The condition on the real part of q is needed so that the integral will converge.)


The Moment Problem

Consequently, if we let p = (n + 1)/λ, q = α+ iβ, t = xλ and dt = λxλ−1dx , then thislast identity implies that

Γ`

n+1λ

´(α + iβ)(n+1)/λ

=

Z ∞0

(xλ)[(n+1)/λ]−1e−(α+iβ)xλ

λxλ−1dx

= λ

Z ∞0

xne−(α+iβ)xλ

dx

= λ

Z ∞0

xne−αxλ

cos(βxα)dx − iλ

Z ∞0

xne−αxλ

sin(βxα)dx .

Here we have used Euler’s formula,

e iθ = cos(θ) + i sin(θ),

with θ = βxα.


The Moment Problem

We now turn out attention to the denominator of the expression on the left-hand side.Recalling that β = α tan(λπ), this is

(α + iβ)(n+1)/λ = α(n+1)/λ(1 + i tan(λπ))(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λ(cos(λπ) + i sin(λπ))(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λe iλπ(n+1)/λ

= α(n+1)/λ cos(λπ)−(n+1)/λe iπ(n+1)

= α(n+1)/λ cos(λπ)−(n+1)/λ cos((n + 1)π) ∈ R,

since sin((n + 1)π) = 0 for any integer n. This shows that the expression on theleft-hand side is a real number. Since this is equal to the difference of a real-valuedintegral and an imaginary-valued integral, the latter must in fact be equal to 0,Z ∞

0

xne−αxλ

sin(βxα)dx = 0,

and this holds for every integer n ≥ 0, which establishes that the variables X (δ),δ ∈ (−1, 1), all have the same finite moments.


The Normal Distribution


The normal distribution is one of the most important distributions in probability theoryand statistics. We begin with the definition and then derive some of its properties.

DefinitionA random variable X is said to have the normal distribution with mean µ and varianceσ2 > 0, written X ∼ N (µ, σ2), if X has density function

pX (x) =1√

2πσ2e−(x−µ)2/2σ2

.

If µ = 0 and σ = 1, then X is said to be a standard normal random variable.

Remark: The normal distribution is also called the Gaussian distribution, after C. F.Gauss (1777-1855), who proposed that measurement errors will often follow thisdistribution.



To verify that pX (x) is a probability density function, we need to show that its integralover the real line is equal to 1. Suppose that µ = 0 and σ2 = 1. Letting

I =

Z ∞−∞

e−y2/2dy ,

we have

I 2 =

Z ∞−∞

e−x2/2dx

Z ∞−∞

e−y2/2dy

=

Z ∞−∞

Z ∞−∞

e−(x2+y2)/2dxdy

=

Z ∞0

Z 2π

0

re−r2/2dθdr (dxdy = r drdθ)

= 2π

Z ∞0

re−r2/2dr = 2π.



This shows that I =√

2π and therefore

1√2π

Z ∞−∞

e−y2/2dy = 1.

The general case then follows by making the substitution y = (x −µ)/σ and dy = dx/σ:

1√2πσ2

Z ∞∞

e−(x−µ)2/2σ2

dx =1√2π

Z ∞−∞

e−y2/2dy = 1.



Exercise: Suppose that X ∼ N (µ, σ2) and let Y = aX + b, where a 6= 0 and b areconstants. Show that Y ∼ N (aµ+ b, a2σ2).

Solution: If a > 0, then the cumulative distribution function of Y is

P(Y ≤ x) = P (aX + b ≤ x)

= P„

X ≤ x − b

a

«= FX

„x − b

a

«.

Differentiating, we find that the density function of Y is

pY (x) =1

apX

„x − b

a

«=

1√2πa2σ2

e−(x−b−aµ)2/2a2σ2

which shows that Y is normally distributed with mean b + aµ and variance a2µ2.



Exercise: Suppose that X ∼ N (µ, σ2) and let Y = aX + b, where a 6= 0 and b areconstants. Show that Y ∼ N (aµ+ b, a2σ2).

Solution: If a > 0, then the cumulative distribution function of Y is

P(Y ≤ x) = P (aX + b ≤ x)

= P„

X ≤ x − b

a

«= FX

„x − b

a

«.

Differentiating, we find that the density function of Y is

pY (x) =1

apX

„x − b

a

«=

1√2πa2σ2

e−(x−b−aµ)2/2a2σ2

which shows that Y is normally distributed with mean b + aµ and variance a2µ2.



This following special cases are particularly important:

If Z ∼ N (0, 1) is a standard normal random variable and X = µ+ σZ , thenX ∼ N (µ, σ2). This shows that every normal distribution can be constructed fromthe standard normal distribution via a linear transformation.

Conversely, if X ∼ N (µ, σ2), then Z = (X − µ)/σ ∼ N (0, 1) is a standard normalrandom variable. This process (subtracting the mean and dividing by the standarddeviation) is called standardization.

Remark: These relationships allow us to deduce the properties of the general normaldistribution from those of the standard normal distribution.



Proposition

If X ∼ N (µ, σ2), then the expected value of X is µ, while the variance of X is σ2.

Proof: Let Z = (X − µ)/σ, so that Z ∼ N (0, 1). Then

E[Z ] =1√2π

Z ∞−∞

xe−x2/2dx = 0,

while

Var(Z) = EˆZ 2˜ =

1√2π

Z ∞−∞

x2e−x2/2dx

=1√2π

„−xe−x2/2|∞∞ +

Z ∞−∞

e−x2/2dx

«= 1.

Since X = µ+ σZ , the proposition follows from the identities

E[X ] = µ+ σE[Z ] = µ

Var(X ) = σ2Var(Z) = σ2.



The cumulative distribution function of the standard normal distribution is usuallydenoted by the Greek letter Φ:

Φ(x) =1√2π

Z x

−∞e−t2/2dt.

Although this integral cannot be expressed in terms of elementary functions, it can beevaluated numerically using software such as Matlab or R. Alternatively, extensivetabulations of Φ can be found online and in most statistics text books.

The cumulative distribution function of a non-standard normal random variableX ∼ N (µ, σ2) can be evaluated in terms of Φ through standardization:

FX (x) = P(X ≤ x) = P„

X − µσ

≤ x − µσ

«= Φ

“x − µσ

”.



Because the tails of the normal density decay very rapidly, a normally distributedvariable is unlikely to differ from its mean by more than a few standard deviations.

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x

p(x)

Normal Distributions

σ2=0.25

σ2=1

σ2=4

σ2=25z P(Z ≥ z)

0.5 0.308541 0.158662 0.022753 0.001354 0.000035 2.87E − 07



The moment generating function of the normal distribution can be evaluated as follows.First, if Z ∼ N (0, 1), then

MZ (t) = EhetZi

=1√2π

Z ∞−∞

etzez2/2dz

= et2/2 1√2π

Z ∞−∞

e−12(z2−2tz+t2)dz

= et2/2 1√2π

Z ∞−∞

e−12(z−t)2 dz

= et2/2.

The general case is then dealt with by standardization: if X = µ+ σZ ∼ N (µ, σ2), then

MX (t) = EhetXi

= Ehet(µ+σZ)

i= eµtE

hetσX

i= eµt+σ2t2/2.



The next theorem asserts that a sum of independent normal random variables is stillnormally distributed.

TheoremSuppose that X1, · · · ,Xn are independent normal random variables and that Xi hasmean µi and variance σ2

i . Then X = X1 + · · ·+ Xn is normally distributed with meanµ = µ1 + · · ·+ µn and variance σ2 = σ2

1 + · · ·+ σ2n.

Proof: Since X1, · · · ,Xn are independent, the moment generating function of X is equalto the product of the moment generating functions of the Xi ’s:

MX (t) =nY

i=1

MXi (t) =nY

i=1

eµi t+σ2i t2/2 = exp

t

nXi=1

µi +1

2t2

nXi=1

σ2i

!= eµt+σ2t2/2.

This shows that X has the same moment generating function as N (µ, σ2), which allowsus to conclude that X itself has this distribution.


The Central Limit Theorem


Many kinds of data are found to be approximately normally distributed:

measurement errors;

the velocities of the molecules in an ideal gas;

physical dimensions of individuals in a population, e.g., adult human heights andweights;

the logarithm of the latency periods for chicken pox, hepatitis B, and polioinfections;

house prices within a given area.

As with the Poisson distribution, the fact that the normal distribution can be used tomodel so many kinds of data invites a mathematical explanation. This is provided, atleast in part, by the central limit theorem, which asserts that the sum of a largenumber of i.i.d. random variables is approximately normal.



To explain the content of the central limit theorem, we need to decide what it means tosay that a sequence of random variables (Xn : n ≥ 0) converges to a limit X . One typeof convergence, called weak convergence, is defined below.

Definition

1 Suppose that F and Fn, n ≥ 1 are cumulative distribution functions on R. Then Fn

is said to converge weakly to F if limn→∞ Fn(x) = F (x) at every point x where Fis continuous.

2 A sequence of random variables Xn is said to converge in distribution to a randomvariable X if the cumulative distribution functions Fn(x) = P(Xn ≤ x) convergeweakly to F (x) = P(X ≤ x).

Remark: In fact, there are many different senses in which a sequence of randomvariables can converge to a limit. This mode of convergence is called weak convergencebecause it is implied by most of the other criteria in common use.



Example: To see why we only require pointwise convergence at continuity points, let Xn

and X be the degenerate random variables defined by setting

P(Xn = 1/n) = 1 and P(X = 0) = 1.

Clearly, Xn should converge X under any reasonable definition of convergence. In fact,

Fn(x) =

0 if x < 1/n1 otherwise

and F (x) =

0 if x < 01 otherwise,

which shows that Fn(x) converges to F (x) as n→∞ at every x 6= 0. Since F (x) iscontinuous everywhere except at x = 0, it follows that the sequence Xn converges indistribution to X even though

1 = F (0) 6= 0 = limn→∞

Fn(0).



The next proposition shows how we can use moment generating functions to deducethat a sequence of random variables converges in distribution. The assumption that themoment generating function of the limit X is finite is essential.

TheoremLet X1,X2, · · · be a sequence of random variables with moment generating functionsψXn (t), and let X be a random variable with moment generating function ψX . Then thesequence Xn converges in distribution to X if

limn→∞

ψXn (t) = ψX (t) <∞

for all t ∈ R.

Remark: This result is useful because it can be used to prove weak convergence evenwhen we cannot explicitly calculate the cumulative distribution functions of the randomvariables involved.



We now come to the main result.

Theorem(The Central Limit Theorem) Suppose that X1,X2, · · · is a sequence of I.I.D. variableswith finite mean µ and variance σ2, and let Sn = X1 + · · ·+ Xn. Then the sequence ofrandom variables

Zn ≡Sn − nµ

σ√

n=

1

σn1/2

„1

nSn − µ

«converges in distribution to the standard normal N (0, 1), i.e., for every real number z,

limn→∞

P{Zn ≤ z} = Φ(z) ≡ 1√2π

Z z

−∞e−x2/2dx .

Proof: We will prove the CLT under the assumption that the moment generatingfunction of the random variables Xi , denoted M(t), is finite on the entire real line. Thenit suffices to show that the moment generating functions of the variables Zn convergepointwise to the moment generating function of a standard normal random variable.



Let us first assume that µ = 0 and σ2 = 1. Notice that the moment generating functionof the scaled random variable Xi/

√n is

E»

exp

tXi√

n

ff–= M

„t√n

«

and that the moment generating function of the sum Zn =Pn

i=1 Xi/√

n is equal to

MZn (t) =

»M

„t√n

«–n

.

Let L(t) = log M(t). Since M(0) = 1, M ′(0) = µ = 0 and M ′′(0) = σ2 = 1, we have

L(0) = log(M(0)) = 0

L′(0) =M ′(0)

M(0)= 0

L′′(0) =M(0)M ′′(0)−M ′(0)2

M(0)2= 1.



It suffices to show that

limn→∞

»M

„t√n

«–n

= et2/2,

which is equivalent to

limn→∞

nL

„t√n

«=

t2

2.

However, this last identity can be verified using L’Hopital’s rule

limn→∞

L(t/√

n)

n−1= lim

x→0

L(tx)

x2= lim

x→0

tL′(tx)

2x= lim

x→0

t2L′′(tx)

2

=t2

2.

The general case can then be handled by applying this result to the standardizedvariables X ∗i = (Xi − µ)/σ, which have mean 0 and variance 1.



The following result, known as the Demoivre-Laplace Theorem, was the first versionof the CLT to be discovered. It asserts that the binomial distribution with parameters nand p is approximately normal when n is large.

Corollary

If Sn is a binomial random variable with parameters n and p ∈ (0, 1), then

limn→∞

P

(a ≤ Sn − npp

np(1− p)≤ b

)= Φ(b)− Φ(a).

Proof: The result follows from the CLT once we note that Sn has the same distributionas the sum of n independent Bernoulli(p) random variables,

Snd= X1 + · · ·+ Xn,

and that E[Sn] = nµ and Var(Sn) = np(1− p).



Example: Suppose that a fair coin is tossed 100 times. What is the probability that thenumber of heads obtained is between 45 and 55 (inclusive)?

Solution: If X denotes the number of heads obtained in 100 tosses, then X is abinomial random variable with parameters (100, 1/2). By the Demoivre-Laplacetheorem, we know that

P{45 ≤ X ≤ 55} = P−1 ≤ X − 50

5≤ 1

ff≈ Φ(1)− Φ(−1)

= Φ(1)− (1− Φ(1))

= 2Φ(1)− 1 = 0.683.

Notice that we have also made use of the identity

Φ(−x) = 1− Φ(x),

which follows from the fact that if Z ∼ N (0, 1), then

P{Z ≤ −x} = P{Z ≥ x} = 1− P{Z ≤ x}.



The convergence of the binomial distribution to the normal distribution with increasingn is illustrated in the figures below which compare the normal distribution with mean npand variance np(1− p) with the binomial distribution for n = 10 (left) and n = 100(right) when p = 0.2.

Remarkably, the match between the binomial and normal densities is pretty good evenwhen n is only 10.



Another example of the scope of the CLT is provided by the distribution of adult humanheights, which is approximately normal. The figure shows a histogram for the heights ofa sample of 5000 adults (source: SOCR), as well as the best fitting normal distribution.

Normality of quantitative traits can beexplained by Fisher’s infinitesimal model:

The trait depends on a large numberL of variable loci.

The two alleles at each locus have asmall effect Xl,m and Xl,p on the trait.

The loci act additively.

Then an individual’s height may beexpressed as below:

60 62 64 66 68 70 72 74 760

50

100

150

200

250DistributionofAdultHeights

height(inches)

data

normal

H =LX

l=1

(Xl,m + Xl,p) + ε.

where ε is the random environmental effect on height.



The Central Limit Theorem even shows up in number theory. Here is a typical example.

Theorem(Erdos-Kac) For each n ≥ 1, let Yn be uniformly distributed on the set {1, · · · , n}, anddefine φ(n) to be the number of prime divisors of the integer n, e.g., φ(2) = 1,φ(3) = 1, φ(4) = 1, φ(5) = 1, φ(6) = 2, etc. Then, for every real number x,

limn→∞

P

φ(Yn)− log log(n)p

log log(n)≤ x

!= Φ(x).

In other words, if an integer is chosen at random between 1 and n, then for large n thenumber of prime divisors of that integer is approximately normally distributed with meanand variance both equal to log log(n).

Intuition: The result can be proven by writing φ(Yn) as a sum of indicator variables andthen showing that these are approximately independent and identically-distributed:

φ(Yn) =Xp≤Yn

1p|Yn .



One question that we might ask is why the limit is normal, i.e., what’s so normal aboutthe normal distribution? One answer to this question comes from information theory.

DefinitionThe differential entropy of a continuous probability distribution P with density p(x) onthe real line is the quantity

H(P) = −Z ∞−∞

p(x) ln(p(x))dx .

The entropy of a distribution can be thought of as a measure of the information contentof the distribution, i.e., if X has distribution P then the entropy H(P) is a measure ofhow much information we gain when we observe the outcome X of an experiment.Equivalently, H(P) also quantifies the amount of uncertainty that we have before weobserve this outcome: the greater the uncertainty in advance of the observation, themore we learn by observing the outcome.



Part of the connection with the Central Limit Theorem is provided by the followingproposition.

LemmaOf all the continuous distributions on the real line with mean 0 and variance 1, thestandard normal distribution is the unique distribution with maximum entropy, i.e.,

H(N (0, 1)) = sup {H(P) : P ∈ P0,1(R) :} ,

where P0,1(R) is the collection of all probability measures on the real line with mean 0and variance 1.

The reason that this relevant is that the operation of averaging a set of randomvariables is one that generates entropy, i.e., one can show that the entropy of astandardized sum of a collection of i.i.d. random variables is at least as great as theentropy of any one of those variables. This makes sense since in the process of averaginga set of random variables, we are losing information and thus increasing entropy.


Documents

APM 421 Probability Theory Miscellaneous Topics Jay Taylor ...jtaylor/teaching/Fall2013/STP421/lectures/c… · APM 421 Probability Theory Miscellaneous Topics Jay Taylor Fall 2013