Stat 110 Final Review, Fall 2011 - Harvard University · Stat 110 Final Review, Fall 2011 ... (intermixing reading, thinking, solving problems, ... Bernoulli, Binomial, Geometric,

Stat 110 Final Review, Fall 2011

Prof. Joe Blitzstein

1 General Information

The final will be on Thursday 12/15, from 2 PM to 5 PM. No books, notes, computers,cell phones, or calculators are allowed, except that you may bring four pages ofstandard-sized paper (8.5” x 11”) with anything you want written (or typed) onboth sides. There will be approximately 8 problems, equally weighted. The materialcovered will be cumulative since probability is cumulative.

To study, I recommend solving lots and lots of practice problems! It’s a goodidea to work through as many of the problems on this handout as possible withoutlooking at solutions (and then discussing with others and looking at solutions tocheck your answers and for any problems where you were really stuck), and to takeat least two of the practice finals under timed conditions using only four pages ofnotes. Carefully going through class notes, homeworks, and handouts (especially thishandout and the midterm review handout) is also important, as long as it is doneactively (intermixing reading, thinking, solving problems, and asking questions).

2 Topics

• Combinatorics: multiplication rule, tree diagrams, binomial coe�cients, per-mutations and combinations, sampling with/without replacement when orderdoes/doesn’t matter, inclusion-exclusion, story proofs.

• Basic Probability: sample spaces, events, axioms of probability, equally likelyoutcomes, inclusion-exclusion, unions, intersections, and complements.

• Conditional Probability: definition and meaning, writing P (A1\A2\ · · ·\An)as a product, Bayes’ Rule, Law of Total Probability, thinking conditionally,prior vs. posterior probability, independence vs. conditional independence.

• Random Variables: definition and interpretations, stories, discrete vs. contin-uous, distributions, CDFs, PMFs, PDFs, MGFs, functions of a r.v., indicatorr.v.s, memorylessness of the Exponential, universality of the Uniform, Poissonapproximation, Poisson processes, Beta as conjugate prior for the Binomial.sums (convolutions), location and scale.

1

• Expected Value: linearity, fundamental bridge, variance, standard deviation,covariance, correlation, using expectation to prove existence, LOTUS.

• Conditional Expectation: definition and meaning, taking out what’s known,conditional variance, Adam’s Law (iterated expectation), Eve’s Law.

• Important Discrete Distributions: Bernoulli, Binomial, Geometric, NegativeBinomial, Hypergeometric, Poisson.

• Important Continuous Distributions: Uniform, Normal, Exponential, Gamma,Beta, Chi-Square, Student-t.

• Jointly Distributed Random Variables: joint, conditional, and marginal distri-butions, independence, Multinomial, Multivariate Normal, change of variables,order statistics.

• Convergence: Law of Large Numbers, Central Limit Theorem.

• Inequalities: Cauchy-Schwarz, Markov, Chebyshev, Jensen.

• Markov chains: Markov property, transition matrix, irreducibility, stationarydistributions, reversibility.

• Strategies: conditioning, symmetry, linearity, indicator r.v.s, stories, checkingwhether answers make sense (e.g., looking at simple and extreme cases andavoiding category errors).

• Some Important Examples: birthday problem, matching problem (de Mont-mort), Monty Hall, gambler’s ruin, prosecutor’s fallacy, testing for a disease,capture-recapture (elk problem), coupon (toy) collector, St. Petersburg para-dox, Simpson’s paradox, two envelope paradox, waiting time for HH vs. waitingtime for HT, store with a random number of customers, bank-post o�ce ex-ample, Bayes’ billiards, random walk on a network, chicken and egg.

2

3 Important Distributions

3.1 Table of Distributions

The table below will be provided on the final (included as the last page). This ismeant to help avoid having to memorize formulas for the distributions (or having totake up a lot of space on your pages of notes). Here 0 < p < 1 and q = 1 � p. Theparameters for Gamma and Beta are positive real numbers; n, r, and w are positiveintegers, as is b for the Hypergeometric.

Name Param. PMF or PDF Mean Variance

Bernoulli p P (X = 1) = p, P (X = 0) = q p pq

Binomial n, p�nk

�pkqn�k, for k 2 {0, 1, . . . , n} np npq

Geometric p qkp, for k 2 {0, 1, 2, . . . } q/p q/p2

NegBinom r, p�r+n�1r�1

�prqn, n 2 {0, 1, 2, . . . } rq/p rq/p2

Hypergeom w, b, n(wk

)( b

n�k

)(w+b

n

), for k 2 {0, 1, . . . , n} µ = nw

w+b(w+b�nw+b�1 )n

µn(1� µ

n)

Poisson � e��k

k! , for k 2 {0, 1, 2, . . . } � �

Uniform a < b 1b�a

, for x 2 (a, b) a+b2

(b�a)2

12

Normal µ, �2 1�p2⇡e�(x�µ)2/(2�2) µ �2

Exponential � �e��x, for x > 0 1/� 1/�2

Gamma a,� �(a)�1(�x)ae��xx�1, for x > 0 a/� a/�2

Beta a, b �(a+b)�(a)�(b)x

a�1(1� x)b�1, for 0 < x < 1 µ = aa+b

µ(1�µ)a+b+1

�2 n 12n/2�(n/2)

xn/2�1e�x/2, for x > 0 n 2n

Student-t n �((n+1)/2)pn⇡�(n/2)(1 + x2/n)�(n+1)/2 0 if n > 1 n

n�2 if n > 2

3

3.2 Connections Between Distributions

The table above summarizes the PMFs/PDFs of the important distributions, andtheir means and variances, but it does not say where each distribution comes from(stories), or how the distributions interrelate. Some of these connections betweendistributions are listed below.

Also note that some of the important distributions are special cases of others.Bernoulli is a special case of Binomial; Geometric is a special case of Negative Bi-nomial; Unif(0,1) is a special case of Beta; and Exponential and �2 are both specialcases of Gamma.

1. Binomial: If X1, . . . , Xn are i.i.d. Bern(p), then X1 + · · ·+Xn ⇠ Bin(n, p).

2. Neg. Binom.: IfG1, . . . , Gr are i.i.d. Geom(p), thenG1+· · ·+Gr ⇠ NBin(r, p).

3. Location and Scale: If Z ⇠ N (0, 1), then µ+ �Z ⇠ N (µ, �2).

If U ⇠ Unif(0, 1) and a < b, then a+ (b� a)U ⇠ Unif(a, b).

If X ⇠ Expo(1), then ��1X ⇠ Expo(�).

If Y ⇠ Gamma(a,�), then �Y ⇠ Gamma(a, 1).

4. Symmetry: If X ⇠ Bin(n, 1/2), then n�X ⇠ Bin(n, 1/2).

If U ⇠ Unif(0, 1), then 1� U ⇠ Unif(0, 1).

If Z ⇠ N (0, 1), then �Z ⇠ N (0, 1).

5. Universality of Uniform: Let F be the CDF of a continuous r.v., such thatF�1 exists. If U ⇠ Unif(0, 1), then F�1(U) has CDF F . Conversely, if X ⇠ F ,then F (X) ⇠ Unif(0, 1).

6. Uniform and Beta: Unif(0, 1) is the same distribution as Beta(1, 1). The jthorder statistic of n i.i.d. Unif(0, 1) r.v.s is Beta(j, n� j + 1).

7. Beta and Binomial: Beta is the conjugate prior to Binomial, in the sensethat if X|p ⇠ Bin(n, p) and the prior is p ⇠ Beta(a, b), then the posterior isp|X ⇠ Beta(a+X, b+ n�X).

8. Gamma: If X1, . . . , Xn are i.i.d. Expo(�), then X1+ · · ·+Xn ⇠ Gamma(n,�).

9. Gamma and Poisson: In a Poisson process of rate �, the number of arrivalsin a time interval of length t is Pois(�t), while the time of the nth arrival isGamma(n,�).

4

10. Gamma and Beta: IfX ⇠ Gamma(a,�), Y ⇠ Gamma(b,�) are independent,then X/(X + Y ) ⇠ Beta(a, b) is independent of X + Y ⇠ Gamma(a+ b,�).

11. Chi-Square: �2n is the same distribution as Gamma(n/2, 1/2).

12. Student-t: If Z ⇠ N (0, 1) and Y ⇠ �2n are independent, then Zp

Y/nhas the

Student-t distribution with n degrees of freedom. For n = 1, this becomes theCauchy distribution, which we can also think of as the distribution of Z1/Z2

with Z1, Z2 i.i.d. N (0, 1).

4 Sums of Independent Random Variables

Let X1, X2, . . . , Xn be independent random variables. The table below shows thedistribution of their sum, X1+X2+ · · ·+Xn, for various important cases dependingon the distribution ofXi. The central limit theorem says that a sum of a large numberof i.i.d. r.v.s will be approximately Normal, while these are exact distributions.

Xi

Pni=1 Xi

Bernoulli(p) Binomial(n, p)

Binomial(mi, p) Binomial(Pn

i=1 mi, p)

Geometric(p) NBin(n, p)

NBin(ri, p) NBin(Pn

i=1 ri, p)

Poisson(�i) Poisson(Pn

i=1 �i)

Unif(0,1) Triangle(0,1,2) (n = 2)

N (µi, �2i ) N (

Pni=1 µi,

Pni=1 �

2i )

Exponential(�) Gamma(n,�)

Gamma(↵i,�) Gamma(Pn

i=1 ↵i,�)

Z2i , for Zi ⇠ N (0, 1) �2

n

5

5 Review of Some Useful Results

5.1 De Morgan’s Laws

(A1 [ A2 · · · [ An)c = Ac

1 \ Ac2 · · · \ Ac

n,

(A1 \ A2 · · · \ An)c = Ac

1 [ Ac2 · · · [ Ac

n.

5.2 Complements

P (Ac) = 1� P (A).

5.3 Unions

P (A [ B) = P (A) + P (B)� P (A \ B);

P (A1 [ A2 [ · · · [ An) =nX

i=1

P (Ai), if the Ai are disjoint;

P (A1 [ A2 [ · · · [ An) nX

i=1

P (Ai);

P (A1[A2[· · ·[An) =nX

k=1

(�1)k+1

X

i1<i2<···<ik

P (Ai1 \ Ai2 \ · · · \ Aik

)

!(Inclusion-Exclusion).

5.4 Intersections

P (A \ B) = P (A)P (B|A) = P (B)P (A|B),

P (A1 \ A2 \ · · · \ An) = P (A1)P (A2|A1)P (A3|A1, A2) · · ·P (An|A1, . . . , An�1).

5.5 Law of Total Probability

If E1, E2, . . . , En are a partition of the sample space S (i.e., they are disjoint andtheir union is all of S) and P (Ei) 6= 0 for all i, then

P (A) =nX

i=1

P (A|Ei)P (Ei).

6

An analogous formula holds for conditioning on a continuous r.v. X with PDF f(x):

P (A) =

Z 1

�1P (A|X = x)f(x)dx.

Similarly, to go from a joint PDF f(x, y) for (X, Y ) to the marginal PDF of Y ,integrate over all values of x:

fY (y) =

Z 1

�1f(x, y)dx.

5.6 Bayes’ Rule

P (A|B) =P (B|A)P (A)

P (B).

Often the denominator P (B) is then expanded by the Law of Total Probability. Forcontinuous r.v.s X and Y , Bayes’ Rule becomes

fY |X(y|x) =fX|Y (x|y)fY (y)

fX(x).

5.7 Expected Value, Variance, and Covariance

Expected value is linear : for any random variables X and Y and constant c,

E(X + Y ) = E(X) + E(Y ),

E(cX) = cE(X).

Variance can be computed in two ways:

Var(X) = E(X � EX)2 = E(X2)� (EX)2.

Constants come out from variance as the constant squared:

Var(cX) = c2Var(X).

For the variance of the sum, there is a covariance term:

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ),

where

Cov(X, Y ) = E((X � EX)(Y � EY )) = E(XY )� (EX)(EY ).

7

So if X and Y are uncorrelated, then the variance of the sum is the sum of the vari-ances. Recall that independent implies uncorrelated but not vice versa. Covarianceis symmetric:

Cov(Y,X) = Cov(X, Y ),

and covariances of sums can be expanded as

Cov(X + Y, Z +W ) = Cov(X,Z) + Cov(X,W ) + Cov(Y, Z) + Cov(Y,W ).

Note that for c a constant,Cov(X, c) = 0,

Cov(cX, Y ) = cCov(X, Y ).

The correlation of X and Y , which is between �1 and 1 , is

Corr(X, Y ) =Cov(X, Y )

SD(X)SD(Y ).

This is also the covariance of the standardized versions of X and Y .

5.8 Law of the Unconscious Statistician (LOTUS)

LetX be a discrete random variable and h be a real-valued function. Then Y = h(X)is a random variable. To compute EY using the definition of expected value, wewould need to first find the PMF of Y and use EY =

Py yP (Y = y). The Law of

the Unconscious Statistician says we can use the PMF of X directly:

Eh(X) =X

x

h(x)P (X = x).

Similarly, for X a continuous r.v. with PDF fX(x), we can find the expected valueof Y = h(X) by integrating h(x) times the PDF of X, without first finding fY (y):

Eh(X) =

Z 1

�1h(x)fX(x)dx.

5.9 Indicator Random Variables

Let A and B be events. Indicator r.v.s bridge between probability and expectation:P (A) = E(IA), where IA is the indicator r.v. for A. It is often useful to think ofa “counting” r.v. as a sum of indicator r.v.s. Indicator r.v.s have many pleasant

8

properties. For example, (IA)k = IA for any positive number k, so it’s easy to handlemoments of indicator r.v.s. Also note that

IA\B = IAIB,

IA[B = IA + IB � IAIB.

5.10 Symmetry

There are many beautiful and useful forms of symmetry in statistics. For example:

1. If X and Y are i.i.d., then P (X < Y ) = P (Y < X). More generally, ifX1, . . . , Xn are i.i.d., then P (X1 < X2 < . . .Xn) = P (Xn < Xn�1 < · · · < X1),and likewise all n! orderings are equally likely (in the continuous case it followsthat P (X1 < X2 < . . .Xn) = 1

n! , while in the discrete case we also have toconsider ties).

2. If we shu✏e a deck of cards and deal the first two cards, then the probabilityis 1/52 that the second card is the Ace of Spades, since by symmetry it’sequally likely to be any card; it’s not necessary to do a law of total probabilitycalculation conditioning on the first card.

3. Consider the Hypergeometric, thought of as the distribution of the number ofwhite balls, where we draw n balls from a jar with w white balls and b blackballs (without replacement). By symmetry and linearity, we can immediatelyget that the expected value is n w

w+b, even though the trials are not independent,

as the jth ball is equally likely to be any of the balls, and linearity still holdswith dependent r.v.s.

4. By symmetry we can see immediately that if T is Cauchy, then 1/T is alsoCauchy (since if we flip the ratio of two i.i.d. N (0, 1) r.v.s, we still have theratio of two i.i.d. N (0, 1) r.v.s!).

5. E(X1|X1 +X2) = E(X2|X1 +X2) by symmetry if X1 and X2 are i.i.d. So bylinearity, E(X1|X1+X2)+E(X2|X1+X2) = E(X1+X2|X1+X2) = X1+X2,which gives E(X1|X1 +X2) = (X1 +X2)/2.

9

5.11 Change of Variables

Let Y = g(X), where g is a di↵erentiable function from Rn to itself whose inverseexists, and X = (X1, . . . , Xn) is a continuous random vector with PDF f

X

. The PDFof Y can be found using the Jacobian of the transformation g:

fY

(y) = fX

(x)

��@x

@y

�� ,

where x = g�1(y) and |@x@y| is the absolute value of the Jacobian determinant of g�1

(here |@x@y| can either be found directly or by using the reciprocal of |@y

@x|).

In the case n = 1, this says that if Y = g(X) where g is di↵erentiable with g0(x) > 0everywhere, then

fY (y) = fX(x)dx

dy,

which is easily remembered if written in the form

fY (y)dy = fX(x)dx.

Remember when using this that fY (y) is a function of y (found by solving for x interms of y), and the bounds for y should be specified. For example, if y = ex and xranges over R, then y ranges over (0,1).

5.12 Order Statistics

LetX1, . . . , Xn be i.i.d. continuous r.v.s with PDF f and CDF F . The order statisticsare obtained by sorting the Xi’s, with X(1) X(2) · · · X(n). The marginal PDFof the jth order statistic is

fX(j)(x) = n

✓n� 1

j � 1

◆f(x)F (x)j�1(1� F (x))n�j.

5.13 Moment Generating Functions

The moment generating function of X is the function

MX(t) = E(etX),

if this exists for all t in some open interval containing 0. For X1, . . . , Xn independent,the MGF of the sum Sn = X1 + · · ·+Xn is

MSn

(t) = MX1(t) · · ·MXn

(t),

10

which is often much easier to deal with than a convolution. The name “momentgenerating function” comes from the fact that the derivatives of MX at 0 give themoments of X:

M 0X(0) = E(X),M 00

X(0) = E(X2),M 000X (0) = E(X3), . . . .

Sometimes these can be computed “all at once” without explicitly taking derivatives,by finding the Taylor series forMX(t), e.g., the MGF of X ⇠ Expo(1) is 1

1�tfor t < 1,

which is the geometric seriesP1

j=0 tn =

P1j=0 n!

tn

n! for |t| < 1. So the nth moment of

X is n! (the coe�cient of tn

n! ).

5.14 Conditional Expectation

The conditional expected value E(Y |X = x) is a number (for each x) which is theaverage value of Y , given the information that X = x. The definition is analogousto the definition of EY : just replace the PMF or PDF by the conditional PMF orconditional PDF.

It is often very convenient to just directly condition on X to obtain E(Y |X), whichis a random variable (it is a function of X). This intuitively says to average Y ,treating X as if it were a known constant: E(Y |X = x) is a function of x, andE(Y |X) is obtained from E(Y |X = x) by “changing x to X”. For example, ifE(Y |X = x) = x3, then E(Y |X) = X3.

Important properties of conditional expectation:

E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X) (Linearity);

E(Y |X) = E(Y ) if X and Y are independent;

E(h(X)Y |X) = h(X)E(Y |X) (Taking out what’s known);

E(Y ) = E(E(Y |X)) (Iterated Expectation/Adam’s Law);

Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)) (Eve’s Law).

The latter two identities are often useful for finding the mean and variance of Y :first condition on some choice of X where the conditional distribution of Y given Xis easier to work with than the unconditional distribution of Y , and then account forthe randomness of X.

11

5.15 Convergence

Let X1, X2, . . . be i.i.d. random variables with mean µ and variance �2. The samplemean is defined as

Xn =1

n

nX

i=1

Xi.

The Strong Law of Large Numbers says that with probability 1, the sample meanconverges to the true mean:

Xn ! µ with probability 1.

The Weak Law of Large Numbers (which follows from Chebyshev’s Inequality) saysthat Xn will be very close to µ with very high probability: for any ✏ > 0,

P (|Xn � µ| > ✏) ! 0 as n ! 1.

The Central Limit Theorem says that the sum of a large number of i.i.d. randomvariables is approximately Normal in distribution. More precisely, standardize thesum X1+ · · ·+Xn (by subtracting its mean and dividing by its standard deviation);then the standardized sum approaches N (0, 1) in distribution (i.e., the CDF of thestandardized sum converges to �). So

(X1 + · · ·+Xn)� nµ

�pn

! N (0, 1) in distribution.

In terms of the sample mean,pn

�(Xn � µ) ! N (0, 1) in distribution.

5.16 Inequalities

When probabilities and expected values are hard to compute exactly, it is useful tohave inequalities. One simple but handy inequality is Markov’s Inequality:

P (X > a) E|X|a

,

for any a > 0. Let X have mean µ and variance �2. Using Markov’s Inequality with(X � µ)2 in place of X gives Chebyshev’s Inequality:

P (|X � µ| > a) �2/a2.

12

For convex functions g (convexity of g is equivalent to g00(x) � 0 for all x, assumingthis exists), there is Jensen’s Inequality (the reverse inequality holds for concave g):

E(g(X)) � g(E(X)) for g convex.

The Cauchy-Schwarz inequality bounds the expected product of X and Y :

|E(XY )| p

E(X2)E(Y 2).

If X and Y have mean 0 and variance 1, this reduces to saying that the correlationis between -1 and 1. It follows that correlation is always between -1 and 1.

5.17 Markov Chains

Consider a Markov chain X0, X1, . . . with transition matrix Q = (qij), and let v bea row vector listing the initial probabilities of being in each state. Then vQn is therow vector listing the probabilities of being in each state after n steps, i.e., the jthcomponent is P (Xn = j).

A vector s of probabilities (adding to 1) is stationary for the chain if sQ = s; by theabove, if a chain starts out with a stationary distribution then the distribution staysthe same forever. Any irreducible Markov chain has a unique stationary distributions, and the chain converges to it: P (Xn = i) ! si as n ! 1.

If s is a vector of probabilities (adding to 1) that satisfies the reversibility conditionsiqij = sjqji for all states i, j, then it automatically follows that s is a stationarydistribution for the chain; not all chains have this condition hold, but for those thatdo it is often easier to show that s is stationary using the reversibility condition thanby showing sQ = s.

13

6 Common Mistakes in Probability

6.1 Category errors

A category error is a mistake that not only happens to be wrong, but also it is wrongin every possible universe. If someone answers the question “How many students arein Stat 110?” with “10, since it’s one ten,” that is wrong (and a very bad approxima-tion to the truth); but there is no logical reason the enrollment couldn’t be 10, asidefrom the logical necessity of learning probability for reasoning about uncertainty inthe world. But answering the question with “-42” or “⇡” or “pink elephants” wouldbe a category error. To help avoid being categorically wrong, always think aboutwhat type an answer should have. Should it be an integer? A nonnegative integer?A number between 0 and 1? A random variable? A distribution?

• Probabilities must be between 0 and 1.

Example: When asked for an approximation to P (X > 5) for a certainr.v. X with mean 7, writing “P (X > 5) ⇡ E(X)/5.” This makes two mis-takes: Markov’s inequality gives P (X > 5) E(X)/5, but this is an upperbound, not an approximation; and here E(X)/5 = 1.4, which is silly as anapproximation to a probability since 1.4 > 1.

• Variances must be nonnegative.

Example: For X and Y independent r.v.s, writing that “Var(X � Y ) =Var(X)� Var(Y )”, which can immediately be seen to be wrong from the factthat it becomes negative if Var(Y ) > Var(X) (and 0 if X and Y are i.i.d.).The correct formula is Var(X �Y ) = Var(X)+Var(�Y )� 2Cov(X, Y ), whichis Var(X) + Var(Y ) if X and Y are uncorrelated.

• Correlations must be between �1 and 1.

Example: It is common to confuse covariance and correlation; they are relatedby Corr(X, Y ) = Cov(X, Y )/(SD(X)SD(Y )), which is between -1 and 1.

• The range of possible values must make sense.

Example: Two people each have 100 friends, and we are interested in the dis-tribution of X = (number of mutual friends). Then writing “X ⇠ N (µ, �2)”doesn’t make sense since X is an integer (sometimes we use the Normal as anapproximation to, say, Binomials, but exact answers should be given unless anapproximation is specifically asked for); “X ⇠ Pois(�)” or “X ⇠ Bin(500, 1/2)”don’t make sense since X has possible values 0, 1, . . . , 100.

14

• Units should make sense.

Example: A common careless mistake is to divide by the variance rather thanthe standard deviation when standardizing. Thinking of X as having unitsmakes it clear whether to divide by variance or standard deviation, e.g., if Xis measured in light years, then E(X) and SD(X) are also measured in lightyears (whereas Var(X) is measured in squared light years), so the standardizedr.v. X�E(X)

SD(X) is unitless (as desired).

Thinking about units also helps explain the change of variables formula,

fX(x)dx = fY (y)dy.

If, for example, X is measured in nanoseconds and Y = X3, then the units offX(x) inverse nanoseconds and the units of fY (y) are inverse cubed nanosec-onds, and we need the dx and dy to make both sides be unitless (rememberthat we can think of fX(x)dx as the probability that X is in a tiny interval oflength dx, centered at x).

• A number can’t equal a random variable (unless the r.v. is actually a constant).Quantities such as E(X), P (X > 1), FX(1),Cov(X, Y ) are numbers. We oftenuse the notation “X = x”, but this is shorthand for an event (it is the set ofall possible outcomes of the experiment where X takes the value x).

Example: A store has N ⇠ Pois(�) customers on a certain day, each ofwhom spends an average of µ dollars. Let X be the total amount spent by thecustomers. Then “E(X) = Nµ” doesn’t make sense, since E(X) is a number,while the righthand side is an r.v.

Example: Writing something like “Cov(X, Y ) = 3 if Z = 0 and Cov(X, Y ) =1 if Z = 1” doesn’t make sense, as Cov(X, Y ) is just one number. Similarly,students sometimes write “E(Y ) = 3 when X = 1” when they mean E(Y |X =1) = 3. This is both conceptually wrong since E(Y ) is a number, the overallaverage of Y , and careless notation that could lead, e.g., to getting “E(X) = 1if X = 1, and E(X) = 0 if X = 0” rather than EX = p for X ⇠ Bern(p).

• Don’t replace a r.v. by its mean, or confuse E(g(X)) with g(EX).

Example: On the bidding for an unknown asset problem (#6 on the finalfrom 2008), a common mistake is to replace the random asset value V by itsmean, which completely ignores the variability of V .

15

Example: If X � 1 ⇠ Geom(1/2), then 2E(X) = 4, but E(2X) is infinite (asin the St. Petersburg Paradox), so confusing the two is infinitely wrong. Ingeneral, if g is convex then Jensen’s inequality says that E(g(X)) � g(EX).

• An event is not a random variable.

Example: If A is an event and X is an r.v., it does not make sense to write“E(A)” or “P (X)”. There is of course a deep connection between events andr.v.s, in that for any event A there is a corresponding indicator r.v. IA, andgiven an r.v. X and a number x, we have events X = x and X x.

• Dummy variables in an integral can’t make their way out of the integral.

Example: In LOTUS for a r.v. X with PDF f , the letter x in E(g(X)) =R1�1 g(x)f(x)dx is a dummy variable; we could just as well write

R1�1 g(t)f(t)dt

orR1�1 g(u)f(u)du or even

R1�1 g(2)f(2)d2, but the x (or whatever this

dummy variable is called) can’t migrate out of the integral.

• A random variable is not the same thing as its distribution! See Section 6.4.

• The conditional expectation E(Y |X) must be a function of X (possibly a con-stant function, but it must be computable just in terms of X). See Section 6.5.

6.2 Notational paralysis

Another common mistake is a reluctance to introduce notation. This can be botha symptom and a cause of not seeing the structure of a problem. Be sure to defineyour notation clearly, carefully distinguishing between constants, random variables,and events.

• Give objects names if you want to work with them.

Example: Suppose that we are interested in a LogNormal r.v. X (so log(X) ⇠N (µ, �2) for some µ, �2). Then log(X) is clearly an important object, so weshould give it a name, say Y = log(X). Then, X = eY and, for example, wecan easily obtain the moments of X using the MGF of Y .

Example: Suppose that we want to show that

E(cos4(X2 + 1)) � (E(cos2(X2 + 1)))2.

The essential pattern is that there is a r.v. on the right and its square on the left;so let Y = cos2(X2 + 1), which turns the desired inequality into the statementE(Y 2) � (EY )2, which we know is true because variance is nonnegative.

16

• Introduce clear notation for events and r.v.s of interest.

Example: In the Calvin and Hobbes problem (from HW 3 and the final from2010), clearly the event “Calvin wins the match” is important (so give it aname) and the r.v. “how many of the first two games Calvin wins” is important(so give it a name). Make sure that events you define really are events (theyare subsets of the sample space, and it must make sense to talk about whetherthe event occurs) and that r.v.s you define really are r.v.s (they are functionsmapping the sample space to the real line, and it must make sense to talk abouttheir distributions and talk about them as a numerical summary of some aspectof the random experiment).

• Think about location and scale when applicable.

Example: If Yj ⇠ Expo(�), it may be very convenient to work with Xj = �Yj,which is Expo(1). In studying X ⇠ N (µ, �2), it may be very convenient towrite X = µ+ �Z where Z ⇠ N (0, 1) is the standardized version of X.

6.3 Common sense and checking answers

Whenever possible (i.e., when not under severe time pressure), look for simple waysto check your answers, or at least to check that they are plausible. This can be donein various ways, such as using the following methods.

1. Miracle checks. Does your answer seem intuitively plausible? Is there a cat-egory error? Did asymmetry appear out of nowhere when there should besymmetry?

2. Checking simple and extreme cases. What is the answer to a simpler versionof the problem? What happens if n = 1 or n = 2, or as n ! 1, if the probleminvolves showing something for all n?

3. Looking for alternative approaches and connections with other problems. Isthere another natural way to think about the problem? Does the problemrelate to other problems we’ve seen?

• Probability is full of counterintuitive results, but not impossible results!

Example: Suppose that we have P (snow Saturday) = P (snow Sunday) =1/2. Then we can’t say “P (snow over the weekend) = 1”; clearly there issome chance of no snow, and of course the mistake is to ignore the need fordisjointness.

17

Example: In finding E(eX) for X ⇠ Pois(�), obtaining an answer that canbe negative, or an answer that isn’t an increasing function of � (intuitively, itis clear that larger � should give larger average values of eX).

• Check simple and extreme cases whenever possible.

Example: Suppose we want to derive the mean and variance of a Hyperge-ometric, which is the distribution of the number of white balls if we draw nballs without replacement from a bag containing w white balls and b blackballs. Suppose that using indicator r.v.s, we (correctly) obtain that the meanis µ = nw

w+band the variance is (w+b�n

w+b�1 )nµn(1� µ

n).

Let’s check that this makes sense for the simple case n = 1: then the mean andvariance reduce to those of a Bern(w/(w + b)), which makes sense since withonly 1 draw, it doesn’t matter whether sampling is with replacement.

Now let’s consider an extreme case where the total number of balls (w + b) isextremely large compared with n. Then it shouldn’t matter much whether thesampling is with or without replacement, so the mean and variance should bevery close to those of a Bin(n,w/(b + w)), and indeed this is the case. If wehad an answer that did not make sense in simple and extreme cases, we couldthen look harder for a mistake or explanation.

Example: Let X1, X2, . . . , X1000 be i.i.d. with a continuous distribution, andconsider the question of whether the event X1 < X2 is independent of theevent X1 < X3. Many students guess intuitively that they are independent.But now consider the more extreme question of whether P (X1 < X2|X1 <X3, X1 < X4, . . . , X1 < X1000) is P (X1 < X2). Here most students guessintuitively (and correctly) that

P (X1 < X2|X1 < X3, X1 < X4, . . . , X1 < X1000) > P (X1 < X2),

since the evidence that X1 is less than all of X3, . . . , X1000 suggests that X1 isvery small. Yet this more extreme case is the same in principle, just di↵erentin degree. Similarly, the Monty Hall problem is easier to understand with1000 doors than with 3 doors. To show algebraically that X1 < X2 is notindependent of X1 < X3, note that P (X1 < X2) = 1/2, while

P (X1 < X2|X1 < X3) =P (X1 < X2, X1 < X3)

P (X1 < X3)=

1/3

1/2=

2

3,

where the numerator is 1/3 since the smallest of X1, X2, X3 is equally likely tobe any of them.

18

• Check that PMFs are nonnegative and sum to 1, and PDFs are nonnegativeand integrate to 1 (or that it is at least plausible), when it is not too messy.

Example: Writing that the PDF of X is “f(x) = 15e

�5x for all x > 0 (and0 otherwise)” is immediately seen to be wrong by integrating (the constant infront should be 5, which can also be seen by recognizing this as an Expo(5).Writing that the PDF is “f(x) = 1+e�x

1+xfor all x > 0 (and 0 otherwise)”

doesn’t make sense since even though the integral is hard to do directly, clearly1+e�x

1+x> 1

1+x, and

R10

11+x

dx is infinite.

Example: Consider the following problem: “You are invited to attend 6 wed-dings next year, independently with all months of the year equally likely. Whatis the probability that no two weddings are in the same month?” A commonmistake is to treat the weddings as indistinguishable. But no matter howgeneric and cliched weddings can be sometimes, there must be some way todistinguish two weddings!

It often helps to make up concrete names, e.g., saying “ok, we need to look atthe possible schedulings of the weddings of Daenerys and Drogo, of Cersei andRobert, . . . ”. There are 126 equally likely possibilities and, for example, it ismuch more likely to have 1 wedding per month in January through June than tohave all 6 weddings in January (whereas treating weddings as indistinguishablewould suggest having these be equal).

6.4 Random variables vs. distributions

A random variable is not the same thing as its distribution! We call this confusionsympathetic magic, and the consequences of this confusion are often disastrous. Everyrandom variable has a distribution (which can always be expressed using a CDF,which can be expressed by a PMF in the discrete case, and which can be expressedby a PDF in the continuous case).

Every distribution can be used as a blueprint for generating r.v.s (for example,one way to do this is using Universality of the Uniform). But that doesn’t meanthat doing something to a r.v. corresponds to doing it to the distribution of ther.v. Confusing a distribution with a r.v. with that distribution is like confusing amap of a city with the city itself, or a blueprint of a house with the house itself. Theword is not the thing, the map is not the territory.

• A function of a r.v. is a r.v.

19

Example: Let X be discrete with possible values 0, 1, 2, . . . and PMF pj =P (X = j), and let Y = X + 3. Then Y is discrete with possible values3, 4, 5, . . . , and its PMF is given by P (Y = k) = P (X = k � 3) = pk�3 fork 2 {3, 4, 5, . . . }. In the continuous case, if Y = g(X) with g di↵erentiableand strictly increasing, then we can use the change of variables formula tofind the PDF of Y from the PDF of X. If we only need E(Y ) and not thedistribution of Y , we can use LOTUS. A common mistake is not seeing whythese transformations of X are themselves r.v.s and how to handle them.

Example: For X, Y i.i.d., writing that “Emax(X, Y ) = EX since max(X, Y )is either X or Y , both of which have mean EX”; this misunderstands how andwhy max(X, Y ) is a r.v. Of course, we should have Emax(X, Y ) � EX sincemax(X, Y ) � X.

• Avoid sympathetic magic.

Example: Is it possible to have two r.v.s X, Y which have the same distribu-tion but are never equal, i.e., the event X = Y never occurs?

Example: In finding the PDF of XY , writing something like “fX(x)fY (y).”This is a category error since if we let W = XY , we want a function fW (w), nota function of two variables x and y. The mistake is in thinking that the PDF ofthe product is the product of the PDFs, which comes from not understandingwell what a distribution really is.

Example: For r.v.s X and Y with PDFs fX and fY respectively, the event{X < Y } is very di↵erent conceptually from the inequality fX < fY . In fact,it is impossible that for all t, fX(t) < fY (t), since both sides integrate to 1.

• A CDF F (x) = P (X x) is a way to specify the distribution of X, and isa function defined for all real values of x. Here X is the r.v., and x is anynumber; we could just as well have written F (t) = P (X t).

Example: Why must a CDF F (x) be defined for all x and increasing every-where, and why is it not true that a CDF integrates to 1?

6.5 Conditioning

It is easy to make mistakes with conditional probability so it is important to thinkcarefully about what to condition on and how to carry that out. Conditioning is thesoul of statistics.

20

• Condition on all the evidence!

Example: In the Monty Hall problem, if Monty opens door 2 then we can’tjust use “P (door 1 has car|door 2 doesn’t have car) = 1/2,” since this does notcondition on all the evidence: we know not just that door 2 does not have thecar, but also that Monty opened door 2. How the information was collected isitself information. Why is this additional information relevant? To see this,contrast the problem as stated with the variant where Monty randomly choosesto open one of the 2 doors not picked by the contestant (so there is a chance ofrevealing the car and spoiling the game): di↵erent information is obtained inthe two scenarios. This is another example where looking at an extreme casehelps (consider the analogue of the Monty Hall problem with a billion doors).

Example: In the murder problem, a common mistake (often made by defenseattorneys, intentionally or otherwise) is to focus attention on P (murder|abuse),which is irrelevant since we know the woman has been murdered, and we areinterested in the probability of guilt given all the evidence (including the factthat the murder occurred).

• Don’t destroy information.

Example: Let X ⇠ Bern(1/2) and Y = 1 + W with W ⇠ Bern(1/2) inde-pendent of X. Then writing “E(X2|X = Y ) = E(Y 2) = 2.5” is wrong (infact, E(X2|X = Y ) = 1 since if X = Y , then X = 1 ), where the mistake isdestroying the information X = Y , thinking we’re done with that informationonce we have plugged in Y for X. A similar mistake is easy to make in the twoenvelope paradox.

Example: On the bidding for an unknown asset problem (#6 on the finalfrom 2008), a very common mistake is to forget to condition on the bid beingaccepted. In fact, should have E(V |bid accepted) < E(V ) since if the bid isaccepted, it restricts how much the asset could be worth (intuitively, this issimilar to“buyer’s remorse”: it is common (though not necessarily rational) forsomeone to regret making an o↵er if the o↵er is accepted immediately, thinkingthat is a sign that a lower o↵er would have su�ced).

• Independence shouldn’t be assumed without justification, and it is importantto be careful not to implicitly assume independence without justification.

Example: For X1, . . . , Xn i.i.d., we have Var(X1 + ... + Xn) = nVar(X1),but this is not equal to Var(X1 + ... + X1) = Var(nX1) = n2Var(X1). For

21

example, if X and Y are i.i.d. N (µ, �2), then X + Y ⇠ N (2µ, 2�2), whileX +X = 2X ⇠ N (2µ, 4�2).

Example: Is it always true that if X ⇠ Pois(�) and Y ⇠ Pois(�), thenX + Y ⇠ Pois(2�)? What is an example of a sum of Bern(p)’s which is notBinomial?

Example: In the two envelope paradox, it is not true that the amount ofmoney in the first envelope is independent of the indicator of which envelopehas more money.

• Independence is completely di↵erent from disjointness!

Example: Sometimes students try to visualize independent events A and Bwith two non-overlapping ovals in a Venn diagram. Such events in fact can’tbe independent (unless one has probability 0), since learning that A happenedgives a great deal of information about B: it implies that B did not occur.

• Independence is a symmetric property: if A is independent of B, then B isindependent of A. There’s no such thing as unrequited independence.

Example: If it is non-obvious whether A provides information about B butobvious that B provides information about A, then A and B can’t be indepen-dent.

• The marginal distributions can be extracted from the joint distribution, butknowing the marginal distributions does not determine the joint distribution.

Example: Calculations that are purely based on the marginal CDFs FX andFY of dependent r.v.s X and Y may not shed much light on events such as{X < Y } which involve X and Y jointly.

• Keep the distinction between prior and posterior probabilities clear.

Example: Suppose that we observe evidence E. Then writing “P (E) = 1since we know for sure that E happened” is careless; we have P (E|E) = 1, butP (E) is the prior probability (the probability before E was observed).

• Don’t confuse P (A|B) with P (B|A).Example: This mistake is also known as the prosecutor’s fallacy since it isoften made in legal cases (but not always by the prosecutor!). For example, theprosecutor may argue that the probability of guilt given the evidence is veryhigh by attempting to show that the probability of the evidence given innocence

22

is very low, but in and of itself this is insu�cient since it does not use the priorprobability of guilt. Bayes’ rule thus becomes Bayes’ ruler, measuring theweight of the evidence by relating P (A|B) to P (B|A) and showing us how toupdate our beliefs based on evidence.

• Don’t confuse P (A|B) with P (A,B).

Example: The law of total probability is often wrongly written without theweights as “P (A) = P (A|B) + P (A|Bc)” rather than P (A) = P (A,B) +P (A,Bc) = P (A|B)P (B) + P (A|Bc)P (Bc).

• The expression Y |X does not denote a r.v.; it is notation indicating that inworking with Y , we should use the conditional distribution of Y given X (i.e.,treat X as a known constant). The expression E(Y |X) is a r.v., and is afunction of X (we have summed or integrated over the possible values of Y ).

Example: Writing “E(Y |X) = Y ” is wrong, except if Y is a function of X,e.g., E(X3|X) = X3; by definition, E(Y |X) must be g(X) for some functiong, so any answer for E(Y |X) that is not of this form is a category error.

23

7 Stat 110 Final from 2006

1. The number of fish in a certain lake is a Pois(�) random variable. Worried thatthere might be no fish at all, a statistician adds one fish to the lake. Let Y be theresulting number of fish (so Y is 1 plus a Pois(�) random variable).

(a) Find E(Y 2) (simplify).

(b) Find E(1/Y ) (in terms of �; do not simplify yet).

(c) Find a simplified expression for E(1/Y ). Hint: k!(k + 1) = (k + 1)!.

24

2. Write the most appropriate of , �, =, or ? in the blank for each part (where “?”means that no relation holds in general.) It is not necessary to justify your answersfor full credit; some partial credit is available for justified answers that are flawedbut on the right track.

In (c) through (f),X and Y are i.i.d. (independent identically distributed) positiverandom variables. Assume that the various expected values exist.

(a) (probability that a roll of 2 fair dice totals 9) (probability that a roll of 2fair dice totals 10)

(b) (probability that 65% of 20 children born are girls) (probability that 65%of 2000 children born are girls)

(c) E(pX)

pE(X)

(d) E(sinX) sin(EX)

(e) P (X + Y > 4) P (X > 2)P (Y > 2)

(f) E ((X + Y )2) 2E(X2) + 2(EX)2

25

3. A fair die is rolled twice, with outcomes X for the 1st roll and Y for the 2nd roll.

(a) Compute the covariance of X + Y and X � Y (simplify).

(b) Are X + Y and X � Y independent? Justify your answer clearly.

(c) Find the moment generating function MX+Y (t) of X+Y (your answer should bea function of t and can contain unsimplified finite sums).

26

4. A post o�ce has 2 clerks. Alice enters the post o�ce while 2 other customers,Bob and Claire, are being served by the 2 clerks. She is next in line. Assume thatthe time a clerk spends serving a customer has the Expo(�) distribution.

(a) What is the probability that Alice is the last of the 3 customers to be done beingserved? (Simplify.) Justify your answer. Hint: no integrals are needed.

(b) Let X and Y be independent Expo(�) r.v.s. Find the CDF of min(X, Y ).

(c) What is the expected total time that Alice needs to spend at the post o�ce?

27

5. Bob enters a casino with X0 = 1 dollar and repeatedly plays the following game:with probability 1/3, the amount of money he has increases by a factor of 3; withprobability 2/3, the amount of money he has decreases by a factor of 3. Let Xn bethe amount of money he has after playing this game n times. Thus, Xn+1 is 3Xn

with probability 1/3 and is 3�1Xn with probability 2/3.

(a) Compute E(X1), E(X2) and, in general, E(Xn). (Simplify.)

(b) What happens to E(Xn) as n ! 1? Let Yn be the number of times out of thefirst n games that Bob triples his money. What happens to Yn/n as n ! 1?

(c) Does Xn converge to some number c as n ! 1 (with probability 1) and if so,what is c? Explain.

28

6. Let X and Y be independent standard Normal r.v.s and let R2 = X2+Y 2 (whereR > 0 is the distance from (X, Y ) to the origin).

(a) The distribution of R2 is an example of three of the “important distributions”listed on the last page. State which three of these distributions R2 is an instance of,specifying the parameter values.

(b) Find the PDF of R. (Simplify.) Hint: start with the PDF fW (w) of W = R2.

(c) Find P (X > 2Y + 3) in terms of the standard Normal CDF �. (Simplify.)

(d) Compute Cov(R2, X). Are R2 and X independent?

29

7. Let U1, U2, . . . , U60 be i.i.d. Unif(0,1) and X = U1 + U2 + · · ·+ U60.

(a) Which important distribution is the distribution of X very close to? Specifywhat the parameters are, and state which theorem justifies your choice.

(b) Give a simple but accurate approximation for P (X > 17). Justify briefly.

(c) Find the moment generating function (MGF) of X.

30

8. Let X1, X2, . . . , Xn be i.i.d. random variables with E(X1) = 3, and consider thesum Sn = X1 +X2 + · · ·+Xn.

(a) What is E(X1X2X3|X1)? (Simplify. Your answer should be a function of X1.)

(b) What is E(X1|Sn) + E(X2|Sn) + · · ·+ E(Xn|Sn)? (Simplify.)

(c) What is E(X1|Sn)? (Simplify.) Hint: use (b) and symmetry.

31

9. An urn contains red, green, and blue balls. Balls are chosen randomly withreplacement (each time, the color is noted and then the ball is put back.) Let r, g, bbe the probabilities of drawing a red, green, blue ball respectively (r + g + b = 1).

(a) Find the expected number of balls chosen before obtaining the first red ball, notincluding the red ball itself. (Simplify.)

(b) Find the expected number of di↵erent colors of balls obtained before getting thefirst red ball. (Simplify.)

(c) Find the probability that at least 2 of n balls drawn are red, given that at least1 is red. (Simplify; avoid sums of large numbers of terms, and

Por · · · notation.)

32

10. LetX0, X1, X2, . . . be an irreducible Markov chain with state space {1, 2, . . . ,M},M � 3, transition matrix Q = (qij), and stationary distribution s = (s1, . . . , sM).The initial state X0 is given the stationary distribution, i.e., P (X0 = i) = si.

(a) On average, how many of X0, X1, . . . , X9 equal 3? (In terms of s; simplify.)

(b) Let Yn = (Xn � 1)(Xn � 2). For M = 3, find an example of Q (the transitionmatrix for the original chain X0, X1, . . . ) where Y0, Y1, . . . is Markov, and anotherexample of Q where Y0, Y1, . . . is not Markov. Mark which is which and brieflyexplain. In your examples, make qii > 0 for at least one i and make sure it ispossible to get from any state to any other state eventually.

(c) If each column of Q sums to 1, what is s? Verify using the definition of stationary.

33


1. Consider the birthdays of 100 people. Assume people’s birthdays are independent,and the 365 days of the year (exclude the possibility of February 29) are equally likely.

(a) Find the expected number of birthdays represented among the 100 people, i.e.,the expected number of days that at least 1 of the people has as his or her birthday(your answer can involve unsimplified fractions but should not involve messy sums).

(b) Find the covariance between how many of the people were born on January 1and how many were born on January 2.

34

2. Let X and Y be positive random variables, not necessarily independent. Assumethat the various expected values below exist. Write the most appropriate of , �, =,or ? in the blank for each part (where “?” means that no relation holds in general.) Itis not necessary to justify your answers for full credit; some partial credit is availablefor justified answers that are flawed but on the right track.

(a) (E(XY ))2 E(X2)E(Y 2)

(b) P (|X + Y | > 2) 110E((X + Y )4)

(c) E(ln(X + 3)) ln(E(X + 3))

(d) E(X2eX) E(X2)E(eX)

(e) P (X + Y = 2) P (X = 1)P (Y = 1)

(f) P (X + Y = 2) P ({X � 1} [ {Y � 1})

35

3. Let X and Y be independent Pois(�) random variables. Recall that the momentgenerating function (MGF) of X is M(t) = e�(e

t�1).

(a) Find the MGF of X + 2Y (simplify).

(b) Is X + 2Y also Poisson? Show that it is, or that it isn’t (whichever is true).

(c) Let g(t) = lnM(t) be the log of the MGF of X. Expanding g(t) as a Taylor series

g(t) =1X

j=1

cjj!tj

(the sum starts at j = 1 because g(0) = 0), the coe�cient cj is called the jthcumulant of X. Find cj in terms of �, for all j � 1 (simplify).

36

4. Consider the following conversation from an episode of The Simpsons :

Lisa: Dad, I think he’s an ivory dealer! His boots are ivory, his hat isivory, and I’m pretty sure that check is ivory.

Homer: Lisa, a guy who’s got lots of ivory is less likely to hurt Stampythan a guy whose ivory supplies are low.

Here Homer and Lisa are debating the question of whether or not the man (namedBlackheart) is likely to hurt Stampy the Elephant if they sell Stampy to him. Theyclearly disagree about how to use their observations about Blackheart to learn aboutthe probability (conditional on the evidence) that Blackheart will hurt Stampy.

(a) Define clear notation for the various events of interest here.

(b) Express Lisa’s and Homer’s arguments (Lisa’s is partly implicit) as conditionalprobability statements in terms of your notation from (a).

(c) Assume it is true that someone who has a lot of a commodity will have less desireto acquire more of the commodity. Explain what is wrong with Homer’s reasoningthat the evidence about Blackheart makes it less likely that he will harm Stampy.

37

5. Empirically, it is known that 49% of children born in the U.S. are girls (and 51%are boys). Let N be the number of children who will be born in the U.S. in March2009, and assume that N is a Pois(�) random variable, where � is known. Assumethat births are independent (e.g., don’t worry about identical twins).

Let X be the number of girls who will be born in the U.S. in March 2009, and let Ybe the number of boys who will be born then (note the importance of choosing goodnotation: boys have a Y chromosome).

(a) Find the joint distribution of X and Y . (Give the joint PMF.)

(b) Find E(N |X) and E(N2|X).

38

6. Let X1, X2, X3 be independent with Xi ⇠ Expo(�i) (so with possibly di↵erentrates). A useful fact (which you may use) is that P (X1 < X2) =

�1�1+�2

.

(a) Find E(X1 +X2 +X3|X1 > 1, X2 > 2, X3 > 3) in terms of �1,�2,�3.

(b) Find P (X1 = min(X1, X2, X3)), the probability that the first of the three Expo-nentials is the smallest. Hint: re-state this in terms of X1 and min(X2, X3).

(c) For the case �1 = �2 = �3 = 1, find the PDF of max(X1, X2, X3). Is this one ofthe “important distributions”?

39

7. Let X1, X2, . . . be i.i.d. random variables with CDF F (x). For every number x,let Rn(x) count how many of X1, . . . , Xn are less than or equal to x.

(a) Find the mean and variance of Rn(x) (in terms of n and F (x)).

(b) Assume (for this part only) that X1, . . . , X4 are known constants. Sketch anexample showing what the graph of the function R4(x)

4 might look like. Is the functionR4(x)

4 necessarily a CDF? Explain briefly.

(c) Show that Rn

(x)n

! F (x) as n ! 1 (with probability 1).

40

8. (a) Let T be a Student-t r.v. with 1 degree of freedom, and let W = 1/T . Findthe PDF of W (simplify). Is this one of the “important distributions”?

Hint: no calculus is needed for this (though it can be used to check your answer).

(b) Let Wn ⇠ �2n (the Chi-Square distribution with n degrees of freedom), for each

n � 1. Do there exist an and bn such that an(Wn � bn) ! N (0, 1) in distribution asn ! 1? If so, find them; if not, explain why not.

(c) Let Z ⇠ N (0, 1) and Y = |Z|. Find the PDF of Y , and approximate P (Y < 2).

41

9. Consider a knight randomly moving around on a 4 by 4 chessboard:

! A! ! B! ! C ! ! D

4

3

2

1

The 16 squares are labeled in a grid, e.g., the knight is currently at the square B3,and the upper left square is A4. Each move of the knight is an L-shape: two squareshorizontally followed by one square vertically, or vice versa. For example, from B3the knight can move to A1, C1, D2, or D4; from A4 it can move to B2 or C3. Notethat from a white square, the knight always moves to a gray square and vice versa.

At each step, the knight moves randomly, each possibility equally likely. Considerthe stationary distribution of this Markov chain, where the states are the 16 squares.

(a) Which squares have the highest stationary probability? Explain very briefly.

(b) Compute the stationary distribution (simplify). Hint: random walk on a graph.

42


1. Joe’s iPod has 500 di↵erent songs, consisting of 50 albums of 10 songs each. Helistens to 11 random songs on his iPod, with all songs equally likely and chosenindependently (so repetitions may occur).

(a) What is the PMF of how many of the 11 songs are from his favorite album?

(b) What is the probability that there are 2 (or more) songs from the same albumamong the 11 songs he listens to? (Do not simplify.)

(c) A pair of songs is a “match” if they are from the same album. If, say, the 1st,3rd, and 7th songs are all from the same album, this counts as 3 matches. Amongthe 11 songs he listens to, how many matches are there on average? (Simplify.)

43

2. Let X and Y be positive random variables, not necessarily independent. Assumethat the various expressions below exist. Write the most appropriate of , �, =, or? in the blank for each part (where “?” means that no relation holds in general.) Itis not necessary to justify your answers for full credit; some partial credit is availablefor justified answers that are flawed but on the right track.

(a) P (X + Y > 2) EX+EY2

(b) P (X + Y > 3) P (X > 3)

(c) E(cos(X)) cos(EX)

(d) E(X1/3) (EX)1/3

(e) E(XY ) (EX)EY

(f) E (E(X|Y ) + E(Y |X)) EX + EY

44

3. (a) A woman is pregnant with twin boys. Twins may be either identical orfraternal (non-identical). In general, 1/3 of twins born are identical. Obviously,identical twins must be of the same sex; fraternal twins may or may not be. Assumethat identical twins are equally likely to be both boys or both girls, while for fraternaltwins all possibilities are equally likely. Given the above information, what is theprobability that the woman’s twins are identical?

(b) A certain genetic characteristic is of interest. For a random person, this has anumerical value given by a N (0, �2) r.v. Let X1 and X2 be the values of the geneticcharacteristic for the twin boys from (a). If they are identical, then X1 = X2; if theyare fraternal, then X1 and X2 have correlation ⇢. Find Cov(X1, X2) in terms of ⇢, �2.

45

4. (a) Consider i.i.d. Pois(�) r.v.s X1, X2, . . . . The MGF of Xj is M(t) = e�(et�1).

Find the MGF Mn(t) of the sample mean Xn = 1n

Pnj=1 Xj. (Hint: it may help to do

the n = 2 case first, which itself is worth a lot of partial credit, and then generalize.)

(b) Find the limit of Mn(t) as n ! 1. (You can do this with almost no calculationusing a relevant theorem; or you can use (a) and that ex ⇡ 1 + x if x is very small.)

46


(a) What is the probability that Alice is the last of the 3 customers to be done beingserved? Justify your answer. Hint: no integrals are needed.



47

6. You are given an amazing opportunity to bid on a mystery box containing amystery prize! The value of the prize is completely unknown, except that it is worthat least nothing, and at most a million dollars. So the true value V of the prize isconsidered to be Uniform on [0,1] (measured in millions of dollars).

You can choose to bid any amount b (in millions of dollars). You have the chanceto get the prize for considerably less than it is worth, but you could also lose moneyif you bid too much. Specifically, if b < 2

3V , then the bid is rejected and nothingis gained or lost. If b � 2

3V , then the bid is accepted and your net payo↵ is V � b(since you pay b to get a prize worth V ). What is your optimal bid b (to maximizethe expected payo↵)?

48

7. (a) Let Y = eX , with X ⇠ Expo(3). Find the mean and variance of Y (simplify).

(b) For Y1, . . . , Yn i.i.d. with the same distribution as Y from (a), what is the approx-imate distribution of the sample mean Yn = 1

n

Pnj=1 Yj when n is large? (Simplify,

and specify all parameters.)

49

8.

1

2

3

4

5

6

7

(a) Consider a Markov chain on the state space {1, 2, . . . , 7} with the states arrangedin a “circle” as shown above, and transitions given by moving one step clockwise orcounterclockwise with equal probabilities. For example, from state 6, the chain movesto state 7 or state 5 with probability 1/2 each; from state 7, the chain moves to state1 or state 6 with probability 1/2 each. The chain starts at state 1.

Find the stationary distribution of this chain.

(b) Consider a new chain obtained by “unfolding the circle.” Now the states arearranged as shown below. From state 1 the chain always goes to state 2, and fromstate 7 the chain always goes to state 6. Find the new stationary distribution.

1 2 3 4 5 6 7

50


1. A group of n people play “Secret Santa” as follows: each puts his or her name ona slip of paper in a hat, picks a name randomly from the hat (without replacement),and then buys a gift for that person. Unfortunately, they overlook the possibilityof drawing one’s own name, so some may have to buy gifts for themselves (on thebright side, some may like self-selected gifts better). Assume n � 2.

(a) Find the expected number of people who pick their own names (simplify).

(b) Find the expected number of pairs of people, A and B, such that A picks B’sname and B picks A’s name (where A 6= B and order doesn’t matter; simplify).

(c) Let X be the number of people who pick their own names. Which of the “im-portant distributions” are conceivable as the distribution of X, just based on thepossible values X takes (you do not need to list parameter values for this part)?

(d) What is the approximate distribution of X if n is large (specify the parametervalue or values)? What does P (X = 0) converge to as n ! 1?

51


(a) E(X3)p

E(X2)E(X4)

(b) P (|X + Y | > 2) 116E((X + Y )4)

(c) E(pX + 3)

pE(X + 3)

(d) E(sin2(X)) + E(cos2(X)) 1

(e) E(Y |X + 3) E(Y |X)

(f) E(E(Y 2|X)) (EY )2

52

3. Let Z ⇠ N (0, 1). Find the 4th moment E(Z4) in the following two di↵erent ways:

(a) using what you know about how certain powers of Z are related to other distri-butions, along with information from the table of distributions.

(b) using the MGF M(t) = et2/2, by writing down its Taylor series and using how

the coe�cients relate to moments of Z, not by tediously taking derivatives of M(t).

Hint: you can get this series immediately from the Taylor series for ex.

53

4. A chicken lays n eggs. Each egg independently does or doesn’t hatch, withprobability p of hatching. For each egg that hatches, the chick does or doesn’t survive(independently of the other eggs), with probability s of survival. Let N ⇠ Bin(n, p)be the number of eggs which hatch, X be the number of chicks which survive, andY be the number of chicks which hatch but don’t survive (so X + Y = N).

(a) Find the distribution of X, preferably with a clear explanation in words ratherthan with a computation. If X has one of the “important distributions,” say which(including its parameters).

(b) Find the joint PMF of X and Y (simplify).

(c) Are X and Y independent? Give a clear explanation in words (of course it makessense to see if your answer is consistent with your answer to (b), but you can getfull credit on this part even without doing (b); conversely, it’s not enough to just say“by (b), . . . ” without further explanation).

54

5. Suppose we wish to approximate the following integral (denoted by b):

b =

Z 1

�1(�1)bxce�x2/2dx,

where bxc is the greatest integer less than or equal to x (e.g., b3.14c = 3).

(a) Write down a function g(x) such that E(g(X)) = b forX ⇠ N (0, 1) (your functionshould not be in terms of b, and should handle normalizing constants carefully).

(b) Write down a function h(u) such that E(h(U)) = b for U ⇠ Unif(0, 1) (yourfunction should not be in terms of b, and can be in terms of the function g from (a)and the standard Normal CDF �).

(c) Let X1, X2, . . . , Xn be i.i.d. N (0, 1) with n large, and let g be as in (a). Whatis the approximate distribution of 1

n(g(X1) + · · ·+ g(Xn))? Simplify the parameters

fully (in terms of b and n), and mention which theorems you are using.

55

6. Let X1 be the number of emails received by a certain person today and let X2 bethe number of emails received by that person tomorrow, with X1 and X2 i.i.d.

(a) Find E(X1|X1 +X2) (simplify).

(b) For the case Xj ⇠ Pois(�), find the conditional distribution of X1 given X1+X2,i.e., P (X1 = k|X1+X2 = n) (simplify). Is this one of the “important distributions”?

56


�1�1+�2

.




57

8. Let Xn be the price of a certain stock at the start of the nth day, and assumethat X0, X1, X2, . . . follows a Markov chain with transition matrix Q (assume forsimplicity that the stock price can never go below 0 or above a certain upper bound,and that it is always rounded to the nearest dollar).

(a) A lazy investor only looks at the stock once a year, observing the values on days0, 365, 2 · 365, 3 · 365, . . . . So the investor observes Y0, Y1, . . . , where Yn is the priceafter n years (which is 365n days; you can ignore leap years). Is Y0, Y1, . . . also aMarkov chain? Explain why or why not; if so, what is its transition matrix?

(b) The stock price is always an integer between $0 and $28. From each day to thenext, the stock goes up or down by $1 or $2, all with equal probabilities (except fordays when the stock is at or near a boundary, i.e., at $0, $1, $27, or $28).

If the stock is at $0, it goes up to $1 or $2 on the next day (after receivinggovernment bailout money). If the stock is at $28, it goes down to $27 or $26 thenext day. If the stock is at $1, it either goes up to $2 or $3, or down to $0 (withequal probabilities); similarly, if the stock is at $27 it either goes up to $28, or downto $26 or $25. Find the stationary distribution of the chain (simplify).

58


1. Calvin and Hobbes play a match consisting of a series of games, where Calvin hasprobability p of winning each game (independently). They play with a “win by two”rule: the first player to win two games more than his opponent wins the match.

(a) What is the probability that Calvin wins the match (in terms of p)?

Hint: condition on the results of the first k games (for some choice of k).

(b) Find the expected number of games played.

Hint: consider the first two games as a pair, then the next two as a pair, etc.

59

2. A DNA sequence can be represented as a sequence of letters, where the “alphabet”has 4 letters: A,C,T,G. Suppose such a sequence is generated randomly, where theletters are independent and the probabilities of A,C,T,G are p1, p2, p3, p4 respectively.

(a) In a DNA sequence of length 115, what is the expected number of occurrencesof the expression “CATCAT” (in terms of the pj)? (Note that, for example, theexpression “CATCATCAT” counts as 2 occurrences.)

(b) What is the probability that the first A appears earlier than the first C appears,as letters are generated one by one (in terms of the pj)?

(c) For this part, assume that the pj are unknown. Suppose we treat p2 as a Unif(0, 1)r.v. before observing any data, and that then the first 3 letters observed are “CAT”.Given this information, what is the probability that the next letter is C?

60

3. Let X and Y be i.i.d. positive random variables. Assume that the various ex-pressions below exist. Write the most appropriate of , �, =, or ? in the blank foreach part (where “?” means that no relation holds in general). It is not necessaryto justify your answers for full credit; some partial credit is available for justifiedanswers that are flawed but on the right track.

(a) E(eX+Y ) e2E(X)

(b) E(X2eX)p

E(X4)E(e2X)

(c) E(X|3X) E(X|2X)

(d) E(X7Y ) E(X7E(Y |X))

(e) E(XY+ Y

X) 2

(f) P (|X � Y | > 2) Var(X)2

61

4. Let X be a discrete r.v. whose distinct possible values are x0, x1, . . . , and letpk = P (X = xk). The entropy of X is defined to be H(X) = �

P1k=0 pk log2(pk).

(a) Find H(X) for X ⇠ Geom(p).

Hint: use properties of logs, and interpret part of the sum as an expected value.

(b) Find H(X3) for X ⇠ Geom(p), in terms of H(X).

(c) Let X and Y be i.i.d. discrete r.v.s. Show that P (X = Y ) � 2�H(X).

Hint: Consider E(log2(W )), where W is a r.v. taking value pk with probability pk.

62

5. Let Z1, . . . , Zn ⇠ N (0, 1) be i.i.d.

(a) As a function of Z1, create an Expo(1) r.v. X (your answer can also involve thestandard Normal CDF �).

(b) Let Y = e�R, where R =p

Z21 + · · ·+ Z2

n. Write down (but do not evaluate) anintegral for E(Y ).

(c) Let X1 = 3Z1 � 2Z2 and X2 = 4Z1 + 6Z2. Determine whether X1 and X2 areindependent (being sure to mention which results you’re using).

63

6. Let X1, X2, . . . be i.i.d. positive r.v.s. with mean µ, and let Wn = X1X1+···+X

n

.

(a) Find E(Wn).

Hint: consider X1X1+···+X

n

+ X2X1+···+X

n

+ · · ·+ Xn

X1+···+Xn

.

(b) What random variable does nWn converge to as n ! 1?

(c) For the case that Xj ⇠ Expo(�), find the distribution of Wn, preferably withoutusing calculus. (If it is one of the “important distributions” state its name andspecify the parameters; otherwise, give the PDF.)

64

7. A task is randomly assigned to one of two people (with probability 1/2 for eachperson). If assigned to the first person, the task takes an Expo(�1) length of timeto complete (measured in hours), while if assigned to the second person it takes anExpo(�2) length of time to complete (independent of how long the first person wouldhave taken). Let T be the time taken to complete the task.

(a) Find the mean and variance of T .

(b) Suppose instead that the task is assigned to both people, and let X be the timetaken to complete it (by whoever completes it first, with the two people working in-dependently). It is observed that after 24 hours, the task has not yet been completed.Conditional on this information, what is the expected value of X?

65

1 211/2

31/21/4

5/12

41/31/6

7/12

51/41/8

7/8

8. Find the stationary distribution of the Markov chain shown above, without usingmatrices. The number above each arrow is the corresponding transition probability.

66

Stat 110 Final Review Solutions, Fall 2011

Prof. Joe Blitzstein (Department of Statistics, Harvard University)

1 Solutions to Stat 110 Final from 2006

1. The number of fish in a certain lake is a Pois(�) random variable. Worried thatthere might be no fish at all, a statistician adds one fish to the lake. Let Y be theresulting number of fish (so Y is 1 plus a Pois(�) random variable).

(a) Find E(Y 2) (simplify).

We have Y = X + 1 with X ⇠ Pois(�), so Y 2 = X2 + 2X + 1. So

E(Y 2) = E(X2 + 2X + 1) = E(X2) + 2E(X) + 1 = (�+ �2) + 2�+ 1 = �2 + 3�+ 1,

since E(X2) = Var(X) + (EX)2 = �+ �2.

(b) Find E(1/Y ) (in terms of �; do not simplify yet).

By LOTUS,

E(1

Y) = E(

1

X + 1) =

1X

k=0

1

k + 1e��

k

k!

(c) Find a simplified expression for E(1/Y ). Hint: k!(k + 1) = (k + 1)!.

1X

k=0

1

k + 1e��

k

k!= e��

1X

k=0

�k

(k + 1)!=

e��

�

1X

k=0

�k+1

(k + 1)!=

e��

�(e� � 1) =

1

�(1� e��).

1

2. Write the most appropriate of , �, =, or ? in the blank for each part (where “?”means that no relation holds in general.) It is not necessary to justify your answersfor full credit; some partial credit is available for justified answers that are flawedbut on the right track.

In (c) through (f),X and Y are i.i.d. (independent identically distributed) positiverandom variables. Assume that the various expected values exist.

(a) (probability that a roll of 2 fair dice totals 9) � (probability that a roll of 2 fairdice totals 10)

The probability on the left is 4/36 and that on the right is 3/36 as there is only oneway for both dice to show 5’s.

(b) (probability that 65% of 20 children born are girls) � (probability that 65% of2000 children born are girls)

With a large number of births, by the LLN it becomes likely that the fraction thatare girls is close to 1/2.

(c) E(pX)

pE(X)

By Jensen’s inequality (or since Var(pX) � 0).

(d) E(sinX) ? sin(EX)

The inequality can go in either direction. For example, let X be 0 or ⇡ with equalprobabilities. Then E(sinX) = 0, sin(EX) = 1. But if we let X be ⇡/2 or 5⇡/2with equal probabilities, then E(sinX) = 1, sin(EX) = �1.

(e) P (X + Y > 4) � P (X > 2)P (Y > 2)

The righthand side is P (X > 2, Y > 2) by independence. The � then holds sincethe event X > 2, Y > 2 is a subset of the event X + Y > 4.

(f) E ((X + Y )2) = 2E(X2) + 2(EX)2

The lefthand side is

E(X2) + E(Y 2) + 2E(XY ) = E(X2) + E(Y 2) + 2E(X)E(Y ) = 2E(X2) + 2(EX)2

since X and Y are i.i.d.

2

3. A fair die is rolled twice, with outcomes X for the 1st roll and Y for the 2nd roll.

(a) Compute the covariance of X + Y and X � Y (simplify).

Cov(X + Y,X � Y ) = Cov(X,X)� Cov(X, Y ) + Cov(Y,X)� Cov(Y, Y ) = 0.

(b) Are X + Y and X � Y independent? Justify your answer clearly.

They are not independent: information about X + Y may give information aboutX � Y . For example, if we know that X + Y = 12, then X = Y = 6, so X � Y = 0.

(c) Find the moment generating function MX+Y (t) of X+Y (your answer should bea function of t and can contain unsimplified finite sums).

Since X and Y are i.i.d., LOTUS gives

MX+Y (t) = E(et(X+Y )) = E(etX)E(etY ) =

1

6

6X

k=1

ekt

!2

3


(a) What is the probability that Alice is the last of the 3 customers to be done beingserved? (Simplify.) Justify your answer. Hint: no integrals are needed.

Alice begins to be served when either Bob or Claire leaves. By the memorylessproperty, the additional time needed to serve whichever of Bob or Claire is still thereis Expo(�). The time it takes to serve Alice is also Expo(�), so by symmetry theprobability is 1/2 that Alice is the last to be done being served.


Use the order statistics results, or compute it directly:

P (min(X, Y ) > z) = P (X > z, Y > z) = P (X > z)P (Y > z) = e�2�z,

so min(X, Y ) has the Expo(2�) distribution, with CDF F (z) = 1� e�2�z.


The expected time spent waiting in line is 12� by (b). The expected time spent being

served is 1�. So the expected total time is

1

2�+

1

�=

3

2�.

4

5. Bob enters a casino with X0 = 1 dollar and repeatedly plays the following game:with probability 1/3, the amount of money he has increases by a factor of 3; withprobability 2/3, the amount of money he has decreases by a factor of 3. Let Xn bethe amount of money he has after playing this game n times. Thus, Xn+1 is 3Xn

with probability 1/3 and is 3�1Xn with probability 2/3.

(a) Compute E(X1), E(X2) and, in general, E(Xn). (Simplify.)

E(X1) =1

3· 3 + 2

3· 1/3 =

11

9

E(Xn+1) can be found by conditioning on Xn.

E(Xn+1|Xn) =1

3· 3Xn +

2

3· 3�1Xn =

11

9Xn,

so

E(Xn+1) = E(E(Xn+1|Xn)) =11

9E(Xn).

Then

E(X2) = (11

9)2 =

121

81and in general,

E(Xn) = (11

9)n.

(b) What happens to E(Xn) as n ! 1? Let Yn be the number of times out of thefirst n games that Bob triples his money. What happens to Yn/n as n ! 1?

By the above, E(Xn) ! 1 as n ! 1. By LLN, Yn/n ! 13 a.s. as n ! 1.

(c) Does Xn converge to some number c as n ! 1 and if so, what is c? Explain.

In the long run, Bob will win about 1/3 of the time and lose about 2/3 of the time.Note that a win and a loss cancel each other out (due to multiplying and dividingby 3), so Xn will get very close to 0. In terms of Yn from (b), with probability 1

Xn = 3Yn3�(n�Yn) = 3n(2Ynn �1) ! 0.

because Yn/n approaches 1/3. So Xn converges to 0 (with probability 1).

5

6. Let X and Y be independent standard Normal r.v.s and let R2 = X2+Y 2 (whereR > 0 is the distance from (X, Y ) to the origin).

(a) The distribution of R2 is an example of three of the “important distributions”listed on the last page. State which three of these distributions R2 is an instance of,specifying the parameter values. (For example, if it were Geometric with p = 1/3,the distribution would be Geom(1/3) and also NBin(1,1/3).)

It is �22, Expo(1/2), and Gamma(1,1/2).

(b) Find the PDF of R. (Simplify.) Hint: start with the PDF fW (w) of W = R2.

R =pW with fW (w) = 1

2e�w/2 gives

fR(r) = fW (w)|dw/dr| = 1

2e�w/22r = re�r2/2, for r > 0.

(This is known as the Rayleigh distribution.)

(c) Find P (X > 2Y + 3) in terms of the standard Normal CDF �. (Simplify.)

P (X > 2Y + 3) = P (X � 2Y > 3) = 1� �

✓3p5

◆

since X � 2Y ⇠ N (0, 5).

(d) Compute Cov(R2, X). Are R2 and X independent?

They are not independent since knowing X gives information about R2, e.g., X2

being large implies that R2 is large. But R2 and X are uncorrelated:

Cov(R2, X) = Cov(X2+Y 2, X) = Cov(X2, X)+Cov(Y 2, X) = E(X3)�(EX2)(EX)+0 = 0.

6

7. Let U1, U2, . . . , U60 be i.i.d. Unif(0,1) and X = U1 + U2 + · · ·+ U60.

(a) Which important distribution is the distribution of X very close to? Specifywhat the parameters are, and state which theorem justifies your choice.

By the Central Limit Theorem, the distribution is approximately N (30, 5) sinceE(X) = 30,Var(X) = 60/12 = 5.

(b) Give a simple but accurate approximation for P (X > 17). Justify briefly.

P (X > 17) = 1�P (X 17) = 1�P

✓X � 30p

5 �13p

5

◆⇡ 1��

✓�13p

5

◆= �

✓13p5

◆.

Since 13/p5 > 5, and we already have �(3) ⇡ 0.9985 by the 68-95-99.7% rule, the

value is extremely close to 1.

(c) Find the moment generating function (MGF) of X.

The MGF of U1 is E(etU1) =R 1

0 etudu = 1t(et � 1) for t 6= 0, and the MGF of U1 is 1

for t = 0. Thus, the MGF of X is 1 for t = 0, and for t 6= 0 it is

E(etX) = E(et(U1+···+U60)) =�E(etU1)

�60=

(et � 1)60

t60.

7

8. Let X1, X2, . . . , Xn be i.i.d. random variables with E(X1) = 3, and consider thesum Sn = X1 +X2 + · · ·+Xn.

(a) What is E(X1X2X3|X1)? (Simplify. Your answer should be a function of X1.)

E(X1X2X3|X1) = X1E(X2X3|X1) = X1E(X2)E(X3) = 9X1.

(b) What is E(X1|Sn) + E(X2|Sn) + · · ·+ E(Xn|Sn)? (Simplify.)

By linearity, it is E(Sn|Sn), which is Sn.

(c) What is E(X1|Sn)? (Simplify.) Hint: use (b) and symmetry.

By symmetry, E(Xj|Sn) = E(X1|Sn) for all j. Then by (b),

nE(X1|Sn) = Sn,

so

E(X1|Sn) =Sn

n.

8

9. An urn contains red, green, and blue balls. Balls are chosen randomly withreplacement (each time, the color is noted and then the ball is put back.) Let r, g, bbe the probabilities of drawing a red, green, blue ball respectively (r + g + b = 1).

(a) Find the expected number of balls chosen before obtaining the first red ball, notincluding the red ball itself. (Simplify.)

The distribution is Geom(r), so the expected value is 1�rr.

(b) Find the expected number of di↵erent colors of balls obtained before getting thefirst red ball. (Simplify.)

Use indicator random variables: let I1 be 1 if green is obtained before red, and 0otherwise, and define I2 similarly for blue. Then

E(I1) = P (green before red) =g

g + r

since “green before red” means that the first nonblue ball is green. Similarly, E(I2) =b/(b+ r), so the expected number of colors obtained before getting red is

E(I1 + I2) =g

g + r+

b

b+ r.

(c) Find the probability that at least 2 of n balls drawn are red, given that at least1 is red. (Simplify; avoid sums of large numbers of terms, and

Por · · · notation.)

P (at least 2| at least 1) = P (at least 2)

P (at least 1)=

1� (1� r)n � nr(1� r)n�1

1� (1� r)n.

9

10. LetX0, X1, X2, . . . be an irreducible Markov chain with state space {1, 2, . . . ,M},M � 3, transition matrix Q = (qij), and stationary distribution s = (s1, . . . , sM).The initial state X0 is given the stationary distribution, i.e., P (X0 = i) = si.

(a) On average, how many of X0, X1, . . . , X9 equal 3? (In terms of s; simplify.)

Since X0 has the stationary distribution, all of X0, X1, . . . have the stationary dis-tribution. By indicator random variables, the expected value is 10s3.

(b) Let Yn = (Xn � 1)(Xn � 2). For M = 3, find an example of Q (the transitionmatrix for the original chain X0, X1, . . . ) where Y0, Y1, . . . is Markov, and anotherexample of P where Y0, Y1, . . . is not Markov. Mark which is which and brieflyexplain. In your examples, make qii > 0 for at least one i and make sure it ispossible to get from any state to any other state eventually.

Note that Yn is 0 if Xn is 1 or 2, and Yn is 2 otherwise. So the Yn process canbe viewed as merging states 1 and 2 of the Xn-chain into one state. Knowing thehistory of Yn’s means knowing when the Xn-chain is in State 3, without being ableto distinguish State 1 from State 2.

If q13 = q23, then Yn is Markov since given Yn, even knowing the past X0, . . . , Xn

does not a↵ect the transition probabilities. But if q13 6= q23, then the Yn past historycan give useful information about Xn, a↵ecting the transition probabilities. So oneexample (not the only possible example!) is

Q1 =

0

@13

13

13

13

13

13

13

13

13

1

A (Markov) Q2 =

0

@12

12 0

13

13

13

1 0 0

1

A (not Markov).

(c) If each column of Q sums to 1, what is s? Verify using the definition of stationary.

The stationary distribution is uniform over all states:

s = (1/M, 1/M, . . . , 1/M).

This is because

�1/M 1/M . . . 1/M

�Q =

1

M

�1 1 . . . 1

�Q =

�1/M 1/M . . . 1/M

�,

where the matrix multiplication was done by noting that multiplying a row vectorof 1’s times Q gives the column sums of Q.

10


1. Consider the birthdays of 100 people. Assume people’s birthdays are independent,and the 365 days of the year (exclude the possibility of February 29) are equally likely.

(a) Find the expected number of birthdays represented among the 100 people, i.e.,the expected number of days that at least 1 of the people has as his or her birthday(your answer can involve unsimplified fractions but should not involve messy sums).

Define indicator r.v.s Ij where Ij = 1 if the jth day of the year appears on the listof all the birthdays. Then EIj = P (Ij = 1) = 1� (364365)

100, so

E(365X

j=1

Ij) = 365

✓1� (

364

365)100◆.

(b) Find the covariance between how many of the people were born on January 1and how many were born on January 2.

Let Xj be the number of people born on January j. Then

Cov(X1, X2) = � 100

3652.

To see this, we can use the result about covariances in the Multinomial, or we cansolve the problem directly as follows (or with various other methods). Let Aj be theindicator for the jth person having been born on January 1, and define Bj similarlyfor January 2. Then

E(X1X2) = E

(X

i

Ai)(X

j

Bj)

!= E(

X

i,j

AiBj) = 100 · 99( 1

365)2

since AiBi = 0, while Ai and Bj are independent for i 6= j. So

Cov(X1, X2) = 100 · 99( 1

365)2 � (

100

365)2 = � 100

3652.

11


(a) (E(XY ))2 E(X2)E(Y 2) (by Cauchy-Schwarz)

(b) P (|X + Y | > 2) 110E((X + Y )4) (by Markov’s Inequality)

(c) E(ln(X + 3)) ln(E(X + 3)) (by Jensen)

(d) E(X2eX) � E(X2)E(eX) (since X2 and eX are positively correlated)

(e) P (X + Y = 2) ? P (X = 1)P (Y = 1) (What if X, Y are independent? What ifX ⇠ Bern(1/2) and Y = 1�X?)

(f) P (X + Y = 2) P ({X � 1} [ {Y � 1}) (left event is a subset of right event)

12

3. Let X and Y be independent Pois(�) random variables. Recall that the momentgenerating function (MGF) of X is M(t) = e�(e

t�1).

(a) Find the MGF of X + 2Y (simplify).

E(et(X+2Y )) = E(etX)E(e2tY ) = e�(et�1)e�(e

2t�1) = e�(et+e2t�2).

(b) Is X + 2Y also Poisson? Show that it is, or that it isn’t (whichever is true).

No, it is not Poisson. This can be seen by noting that the MGF from (a) is not ofthe form of a Poisson MGF, or by noting that E(X + 2Y ) = 3�, Var(X + 2Y ) = 5�are not equal, whereas any Poisson random variable has mean equal to its variance.

(c) Let g(t) = lnM(t) be the log of the MGF of X. Expanding g(t) as a Taylor series

g(t) =1X

j=1

cjj!tj

(the sum starts at j = 1 because g(0) = 0), the coe�cient cj is called the jthcumulant of X. Find cj in terms of �, for all j � 1 (simplify).

Using the Taylor series for et,

g(t) = �(et � 1) =1X

j=1

�tj

j!,

so cj = � for all j � 1.

13

4. Consider the following conversation from an episode of The Simpsons :

Lisa: Dad, I think he’s an ivory dealer! His boots are ivory, his hat isivory, and I’m pretty sure that check is ivory.

Homer: Lisa, a guy who’s got lots of ivory is less likely to hurt Stampythan a guy whose ivory supplies are low.

Here Homer and Lisa are debating the question of whether or not the man (namedBlackheart) is likely to hurt Stampy the Elephant if they sell Stampy to him. Theyclearly disagree about how to use their observations about Blackheart to learn aboutthe probability (conditional on the evidence) that Blackheart will hurt Stampy.

(a) Define clear notation for the various events of interest here.

Let H be the event that the man will hurt Stampy, let L be the event that a manhas lots of ivory, and let D be the event that the man is an ivory dealer.

(b) Express Lisa’s and Homer’s arguments (Lisa’s is partly implicit) as conditionalprobability statements in terms of your notation from (a).

Lisa observes that L is true. She suggests (reasonably) that this evidence makesD more likely, i.e., P (D|L) > P (D). Implicitly, she suggests that this makes itlikely that the man will hurt Stampy, i.e., P (H|L) > P (H|Lc). Homer argues thatP (H|L) < P (H|Lc).

(c) Assume it is true that someone who has a lot of a commodity will have less desireto acquire more of the commodity. Explain what is wrong with Homer’s reasoningthat the evidence about Blackheart makes it less likely that he will harm Stampy.

Homer does not realize that observing that Blackheart has so much ivory makes itmuch more likely that Blackheart is an ivory dealer, which in turn makes it more likelythat the man will hurt Stampy. (This is an example of Simpson’s Paradox.) It maybe true that, controlling for whether or not Blackheart is a dealer, having high ivorysupplies makes it less likely that he will harm Stampy: P (H|L,D) < P (H|Lc, D) andP (H|L,Dc) < P (H|Lc, Dc). However, this does not imply that P (H|L) < P (H|Lc).

14

5. Empirically, it is known that 49% of children born in the U.S. are girls (and 51%are boys). Let N be the number of children who will be born in the U.S. in March2009, and assume that N is a Pois(�) random variable, where � is known. Assumethat births are independent (e.g., don’t worry about identical twins).

Let X be the number of girls who will be born in the U.S. in March 2009, and let Ybe the number of boys who will be born then (note the importance of choosing goodnotation: boys have a Y chromosome).

(a) Find the joint distribution of X and Y . (Give the joint PMF.)

Note that the problem is equivalent to the chicken and egg problem (the structureis identical). So X and Y are independent with X ⇠ Pois(0.49�), Y ⇠ Pois(0.51�).The joint PMF is

P (X = i, Y = j) = (e�0.49�(0.49�)i/i!)(e�0.51�(0.51�)j/j!).

(b) Find E(N |X) and E(N2|X).

Since X and Y are independent,

E(N |X) = E(X + Y |X) = X + E(Y |X) = X + EY = X + 0.51�,

E(N2|X) = E(X2+2XY +Y 2|X) = X2+2XE(Y )+E(Y 2) = (X+0.51�)2+0.51�.

15

6. Let X1, X2, X3 be independent with Xi ⇠ Expo(�i) (independent Exponentialswith possibly di↵erent rates). A useful fact (which you may use) is that P (X1 <X2) =

�1�1+�2

.


By linearity, independence, and the memoryless property, we get

E(X1|X1 > 1) + E(X2|X2 > 2) + E(X3|X3 > 3) = ��11 + ��1

2 + ��13 + 6.


The desired probability is P (X1 min(X2, X3)). Noting that min(X2, X3) ⇠Expo(�2 + �3) is independent of X1, we have

P (X1 min(X2, X3)) =�1

�1 + �2 + �3.


Let M = max(X1, X2, X3). Using the order statistics results from class or by directlycomputing the CDF and taking the derivative, for x > 0 we have

fM(x) = 3(1� e�x)2e�x.

This is not one of the “important distributions”. (The form is reminiscent of a Beta,but a Beta takes values between 0 and 1, while M can take any positive real value;in fact, B ⇠ Beta(1, 3) if we make the transformation B = e�M .)

16

7. Let X1, X2, . . . be i.i.d. random variables with CDF F (x). For every number x,let Rn(x) count how many of X1, . . . , Xn are less than or equal to x.

(a) Find the mean and variance of Rn(x) (in terms of n and F (x)).

Let Ij(x) be 1 if Xj x and 0 otherwise. Then

Rn(x) =nX

j=1

Ij(x) ⇠ Bin(n, F (x)),

so ERn(x) = nF (x) and Var(Rn(x)) = nF (x)(1� F (x)).

(b) Assume (for this part only) that X1, . . . , X4 are known constants. Sketch anexample showing what the graph of the function R4(x)

4 might look like. Is the functionR4(x)

4 necessarily a CDF? Explain briefly.

For X1, . . . , X4 distinct, the graph of R4(x)4 starts at 0 and then has 4 jumps, each of

size 0.25 (it jumps every time one of the Xi’s is reached).

The R4(x)4 is the CDF of a discrete random variable with possible valuesX1, X2, X3, X4.

(c) Show that Rn(x)n

! F (x) as n ! 1 (with probability 1).

As in (a), Rn(x) is the sum of n i.i.d. Bern(p) r.v.s, where p = F (x). So by the Lawof Large Numbers, Rn(x)

n! F (x) as n ! 1 (with probability 1).

17

8. (a) Let T be a Student-t r.v. with 1 degree of freedom, and let W = 1/T . Findthe PDF of W (simplify). Is this one of the “important distributions”?

Hint: no calculus is needed for this (though it can be used to check your answer).

Recall that a Student-t with 1 degree of freedom (also known as a Cauchy) can berepresented as a ratio X/Y with X and Y are i.i.d. N (0, 1). But then the reciprocalY/X is of the same form! So W is also Student-t with 1 degree of freedom, and PDFfW (w) = 1

⇡(1+w2) .

(b) Let Wn ⇠ �2n (the Chi-squared distribution with n degrees of freedom), for each

n � 1. Do there exist an and bn such that an(Wn � bn) ! N (0, 1) in distribution asn ! 1? If so, find them; if not, explain why not.

Write Wn =Pn

i=1 Z2i with the Zi i.i.d. N (0, 1). By the CLT, the claim is true with

bn = E(Wn) = n and an =1p

Var(Wn)=

1p2n

.

(c) Let Z ⇠ N (0, 1) and Y = |Z|. Find the PDF of Y , and approximate P (Y < 2).

For y � 0, the CDF of Y is

P (Y y) = P (|Z| y) = P (�y Z y) = �(y)� �(�y),

so the PDF of Y is

fY (y) =1p2⇡

e�y2/2 +1p2⇡

e�y2/2 = 21p2⇡

e�y2/2.

By the 68-95-99.7% Rule, P (Y < 2) ⇡ 0.95.

18

9. Consider a knight randomly moving around on a 4 by 4 chessboard:

! A! ! B! ! C ! ! D

4

3

2

1

The 16 squares are labeled in a grid, e.g., the knight is currently at the square B3,and the upper left square is A4. Each move of the knight is an L-shape: two squareshorizontally followed by one square vertically, or vice versa. For example, from B3the knight can move to A1, C1, D2, or D4; from A4 it can move to B2 or C3. Notethat from a white square, the knight always moves to a gray square and vice versa.

At each step, the knight moves randomly, each possibility equally likely. Considerthe stationary distribution of this Markov chain, where the states are the 16 squares.

(a) Which squares have the highest stationary probability? Explain very briefly.

The four center squares (B2, B3, C2, C3) have the highest stationary probabilitysince they are the most highly connected squares: for each of these squares, thenumber of possible moves to/from the square is maximized.

(b) Compute the stationary distribution (simplify). Hint: random walk on a graph.

Use symmetry to note that there are only three “types” of square: there are 4 centersquares, 4 corner squares (such as A4), and 8 edge squares (such as B4; excludecorner squares from being considered edge squares). Recall from the Markov chainhandout that the stationary probability of a state for random walk on an undirectednetwork is proportional to its degree.

A center square here has degree 4, a corner square has degree 2, and an edge squarehas degree 3. So these have probabilities 4a, 2a, 3a respectively for some a. To finda, count the number of squares of each type to get 4a(4) + 2a(4) + 3a(8) = 1, givinga = 1/48. Thus, each center square has stationary probability 4/48 = 1/12; eachcorner square has stationary probability 2/48 = 1/24; and each edge square hasstationary probability 3/48 = 1/16.

19


1. Joe’s iPod has 500 di↵erent songs, consisting of 50 albums of 10 songs each. Helistens to 11 random songs on his iPod, with all songs equally likely and chosenindependently (so repetitions may occur).

(a) What is the PMF of how many of the 11 songs are from his favorite album?

The distribution is Bin(n, p) with n = 11, p = 150 (thinking of getting a song from

the favorite album as a “success”). So the PMF is

✓11

k

◆✓1

50

◆k ✓49

50

◆11�k

, for 0 k 11.

(b) What is the probability that there are 2 (or more) songs from the same albumamong the 11 songs he listens to? (Do not simplify.)

This is a form of the birthday problem.

P (at least 1 match) = 1� P (no matches) = 1� 50 · 49 · · · · 405011

= 1� 49!

39! · 5010 .

(c) A pair of songs is a “match” if they are from the same album. If, say, the 1st,3rd, and 7th songs are all from the same album, this counts as 3 matches. Amongthe 11 songs he listens to, how many matches are there on average? (Simplify.)

Defining an indicator r.v. Ijk for the event that the jth and kth songs match, wehave E(Ijk) = P (Ijk = 1) = 1/50, so the expected number of matches is

✓11

2

◆1

50=

11 · 102 · 50 =

110

100= 1.1.

20

2. Let X and Y be positive random variables, not necessarily independent. Assumethat the various expressions below exist. Write the most appropriate of , �, =, or? in the blank for each part (where “?” means that no relation holds in general.) Itis not necessary to justify your answers for full credit; some partial credit is availablefor justified answers that are flawed but on the right track.

(a) P (X + Y > 2) EX+EY2 (by Markov and linearity)

(b) P (X + Y > 3) � P (X > 3) (since X > 3 implies X + Y > 3 since Y > 0)

(c) E(cos(X))? cos(EX) (e.g., let W ⇠ Bern(1/2) and X = aW + b for various a, b)

(d) E(X1/3) (EX)1/3 (by Jensen)

(e) E(XY )?(EX)EY (take X constant or Y constant as examples)

(f) E (E(X|Y ) + E(Y |X)) = EX + EY (by linearity and Adam’s Law)

21

3. (a) A woman is pregnant with twin boys. Twins may be either identical orfraternal (non-identical). In general, 1/3 of twins born are identical. Obviously,identical twins must be of the same sex; fraternal twins may or may not be. Assumethat identical twins are equally likely to be both boys or both girls, while for fraternaltwins all possibilities are equally likely. Given the above information, what is theprobability that the woman’s twins are identical?

By Bayes’ Rule,

P (identical|BB) =P (BB|identical)P (identical)

P (BB)=

12 ·

13

12 ·

13 +

14 ·

23

= 1/2.

(b) A certain genetic characteristic is of interest. For a random person, this has anumerical value given by a N (0, �2) r.v. Let X1 and X2 be the values of the geneticcharacteristic for the twin boys from (a). If they are identical, then X1 = X2; if theyare fraternal, then X1 and X2 have correlation ⇢. Find Cov(X1, X2) in terms of ⇢, �2.

Since the means are 0, Cov(X1, X2) = E(X1X2) � (EX1)(EX2) = E(X1X2). Wefind this by conditioning on whether the twins are identical or fraternal:

E(X1X2) = E(X1X2|identical)1

2+E(X1X2|fraternal)

1

2= E(X2

1 )1

2+⇢�21

2=

�2

2(1+⇢).

22

4. (a) Consider i.i.d. Pois(�) r.v.s X1, X2, . . . . The MGF of Xj is M(t) = e�(et�1).

Find the MGF Mn(t) of the sample mean Xn = 1n

Pnj=1 Xj. (Hint: it may help to do

the n = 2 case first, which itself is worth a lot of partial credit, and then generalize.)

The MGF isE(e

tn (X1+···+Xn)) =

⇣E(e

tnX1)

⌘n= en�(e

t/n�1),

since the Xj are i.i.d. and E(etnX1) is the MGF of X1 evaluated at t/n.

(b) Find the limit of Mn(t) as n ! 1. (You can do this with almost no calculationusing a relevant theorem; or you can use (a) and that ex ⇡ 1 + x if x is very small.)

By the Law of Large Numbers, Xn ! � with probability 1. The MGF of the constant� (viewed as a r.v. that always equals �) is et�. Thus, Mn(t) ! et� as n ! 1.

23


(a) What is the probability that Alice is the last of the 3 customers to be done beingserved? Justify your answer. Hint: no integrals are needed.

Alice begins to be served when either Bob or Claire leaves. By the memorylessproperty, the additional time needed to serve whichever of Bob or Claire is still thereis Expo(�). The time it takes to serve Alice is also Expo(�), so by symmetry theprobability is 1/2 that Alice is the last to be done being served.


Use the order statistics results, or compute it directly:

P (min(X, Y ) > z) = P (X > z, Y > z) = P (X > z)P (Y > z) = e�2�z,

so min(X, Y ) has the Expo(2�) distribution, with CDF F (z) = 1� e�2�z.


The expected time spent waiting in line is 12� by (b). The expected time spent being

served is 1�. So the expected total time is

1

2�+

1

�=

3

2�.

24

6. You are given an amazing opportunity to bid on a mystery box containing amystery prize! The value of the prize is completely unknown, except that it is worthat least nothing, and at most a million dollars. So the true value V of the prize isconsidered to be Uniform on [0,1] (measured in millions of dollars).

You can choose to bid any amount b (in millions of dollars). You have the chanceto get the prize for considerably less than it is worth, but you could also lose moneyif you bid too much. Specifically, if b < 2

3V , then the bid is rejected and nothingis gained or lost. If b � 2

3V , then the bid is accepted and your net payo↵ is V � b(since you pay b to get a prize worth V ). What is your optimal bid b (to maximizethe expected payo↵)?

We choose a bid b � 0, which cannot be defined in terms of the unknown V . Theexpected payo↵ can be found by conditioning on whether the bid is accepted. Theterm where the bid is rejected is 0, so the expected payo↵ is

E(V � b|b � 2

3V )P (b � 2

3V ) =

✓E(V |V 3

2b)� b

◆P (V 3

2b).

For b � 2/3, the bid is definitely accepted but we lose money on average, so assumeb < 2/3. Then

✓E(V |V 3

2b)� b

◆P (V 3

2b) = (

3

4b� b)

3

2b = �3

8b2,

since given that V 32b, the conditional distribution of V is Uniform on [0, 32b].

The above expression is negative except at b = 0, so the optimal bid is 0: one shouldnot play this game! What’s the moral of this story? First, investing in an assetwithout any information about its value is a bad idea. Second, condition on all theinformation. It is crucial in the above calculation to use E(V |V 3

2b) rather thanE(V ) = 1/2; knowing that the bid was accepted gives information about how muchthe mystery prize is worth!

25

7. (a) Let Y = eX , with X ⇠ Expo(3). Find the mean and variance of Y (simplify).

By LOTUS,

E(Y ) =

Z 1

0

ex(3e�3x)dx =3

2,

E(Y 2) =

Z 1

0

e2x(3e�3x)dx = 3.

So E(Y ) = 3/2,Var(Y ) = 3� 9/4 = 3/4.

(b) For Y1, . . . , Yn i.i.d. with the same distribution as Y from (a), what is the approx-imate distribution of the sample mean Yn = 1

n

Pnj=1 Yj when n is large? (Simplify,

and specify all parameters.)

By the CLT, Yn is approximately N (32 ,34n) for large n.

26

8.

1

2

3

4

5

6

7

(a) Consider a Markov chain on the state space {1, 2, . . . , 7} with the states arrangedin a “circle” as shown above, and transitions given by moving one step clockwise orcounterclockwise with equal probabilities. For example, from state 6, the chain movesto state 7 or state 5 with probability 1/2 each; from state 7, the chain moves to state1 or state 6 with probability 1/2 each. The chain starts at state 1.

Find the stationary distribution of this chain.

The symmetry of the chain suggests that the stationary distribution should be uni-form over all the states. To verify this, note that the reversibility condition is satis-fied. So the stationary distribution is (1/7, 1/7, . . . , 1/7).

(b) Consider a new chain obtained by “unfolding the circle.” Now the states arearranged as shown below. From state 1 the chain always goes to state 2, and fromstate 7 the chain always goes to state 6. Find the new stationary distribution.

1 2 3 4 5 6 7

By the results from class for random walk on an undirected network, the station-ary probabilities are proportional to the degrees. So we just need to normalize(1, 2, 2, 2, 2, 2, 1), obtaining (1/12, 1/6, 1/6, 1/6, 1/6, 1/6, 1/12).

27


1. A group of n people play “Secret Santa” as follows: each puts his or her name ona slip of paper in a hat, picks a name randomly from the hat (without replacement),and then buys a gift for that person. Unfortunately, they overlook the possibilityof drawing one’s own name, so some may have to buy gifts for themselves (on thebright side, some may like self-selected gifts better). Assume n � 2.

(a) Find the expected number of people who pick their own names (simplify).

Let Ij be the indicator r.v. for the jth person picking his or her own name. ThenE(Ij) = P (Ij = 1) = 1

n. By linearity, the expected number is n · E(Ij) = 1.

(b) Find the expected number of pairs of people, A and B, such that A picks B’sname and B picks A’s name (where A 6= B and order doesn’t matter; simplify).

Let Iij the the indicator r.v. for the ith and jth persons having such a “swap” (fori < j). Then E(Iij) = P (i picks j)P (j picks i|i picks j) = 1

n(n�1) .

Alternatively, we can get this by counting: there are n! permutations for who pickswhom, of which (n � 2)! have i pick j and j pick i, giving (n�2)!

n! = 1n(n�1) . So by

linearity, the expected number is�n2

�· 1n(n�1) =

12 .

(c) Let X be the number of people who pick their own names. Which of the “im-portant distributions” are conceivable as the distribution of X, just based on thepossible values X takes (you do not need to list parameter values for this part)?

SinceX is an integer between 0 and n, the only conceivable “important distributions”are Binomial and Hypergeometric. Going further (which was not required), note thatX actually can’t equal n � 1, since if n � 1 people pick their own names then theremaining person must too. So the possible values are the integers from 0 to n exceptfor n� 1, which rules out all of the “important distributions”.

(d) What is the approximate distribution of X if n is large (specify the parametervalue or values)? What does P (X = 0) converge to as n ! 1?

By the Poisson Paradigm, X is approximately Pois(1) for large n. As n ! 1,P (X = 0) ! 1/e, which is the probability of a Pois(1) r.v. being 0.

28


(a) E(X3) p

E(X2)E(X4) (by Cauchy-Schwarz)

(b) P (|X + Y | > 2) 116E((X + Y )4) (by Markov, taking 4th powers first)

(c) E(pX + 3)

pE(X + 3) (by Jensen with a concave function)

(d) E(sin2(X)) + E(cos2(X)) = 1 (by linearity)

(e) E(Y |X + 3) = E(Y |X) (knowing X + 3 is equivalent to knowing X)

(f) E(E(Y 2|X)) � (EY )2 (by Adam’s Law and Jensen)

29

3. Let Z ⇠ N (0, 1). Find the 4th moment E(Z4) in the following two di↵erent ways:

(a) using what you know about how certain powers of Z are related to other distri-butions, along with information from the table of distributions.

Let W = Z2, which we know is �21. By the table, E(W ) = 1,Var(W ) = 2. So

E(Z4) = E(W 2) = Var(W ) + (EW )2 = 2 + 1 = 3.

(b) using the MGF M(t) = et2/2, by writing down its Taylor series and using how

the coe�cients relate to moments of Z, not by tediously taking derivatives of M(t).

Hint: you can get this series immediately from the Taylor series for ex.

Plugging t2/2 in for x in ex =P1

n=0xn

n! , we have

et2/2 =

1X

n=0

t2n

n! · 2n .

The t4 term is1

2! · 22 t4 =

1

8t4 =

3

4!t4,

so the 4th moment is 3, which agrees with (a).

30

4. A chicken lays n eggs. Each egg independently does or doesn’t hatch, withprobability p of hatching. For each egg that hatches, the chick does or doesn’t survive(independently of the other eggs), with probability s of survival. Let N ⇠ Bin(n, p)be the number of eggs which hatch, X be the number of chicks which survive, andY be the number of chicks which hatch but don’t survive (so X + Y = N).

(a) Find the distribution of X, preferably with a clear explanation in words ratherthan with a computation. If X has one of the “important distributions,” say which(including its parameters).

We will give a story proof that X ⇠ Bin(n, ps). Consider any one of the n eggs.With probability p, it hatches. Given that it hatches, with probability s the chicksurvives. So the probability is ps of the egg hatching a chick which survives. Thus,X ⇠ Bin(n, ps).

(b) Find the joint PMF of X and Y (simplify).

As in the chicken-egg problem from class, condition on N and note that only theN = i+ j term is nonzero: for any nonnegative integers i, j with i+ j n,

P (X = i, Y = j) = P (X = i, Y = j|N = i+ j)P (N = i+ j)

= P (X = i|N = i+ j)P (N = i+ j)

=

✓i+ j

i

◆si(1� s)j

✓n

i+ j

◆pi+j(1� p)n�i�j

=n!

i!j!(n� i� j)!(ps)i(p(1� s))j(1� p)n�i�j.

(c) Are X and Y independent? Give a clear explanation in words (of course it makessense to see if your answer is consistent with your answer to (b), but you can getfull credit on this part even without doing (b); conversely, it’s not enough to just say“by (b), . . . ” without further explanation).

They are not independent, unlike in the chicken-egg problem from class (where Nwas Poisson). To see this, consider extreme cases: if X = n, then clearly Y = 0.This shows that X can yield information about Y .

31

5. Suppose we wish to approximate the following integral (denoted by b):

b =

Z 1

�1(�1)bxce�x2/2dx,

where bxc is the greatest integer less than or equal to x (e.g., b3.14c = 3).

(a) Write down a function g(x) such that E(g(X)) = b forX ⇠ N (0, 1) (your functionshould not be in terms of b, and should handle normalizing constants carefully).

There are many possible solutions. By LOTUS, we can take g(x) =p2⇡(�1)bxc.

We can also just calculate b: by symmetry, b = 0 since b�xc = �bxc � 1 exceptwhen x is an integer, so the integral from �1 to 0 cancels that from 0 to 1, so wecan simply take g(x) = 0.

(b) Write down a function h(u) such that E(h(U)) = b for U ⇠ Unif(0, 1) (yourfunction should not be in terms of b, and can be in terms of the function g from (a)and the standard Normal CDF �).

By Universality of the Uniform, ��1(U) ⇠ N (0, 1), so define X = ��1(U). ThenE(g(��1(U))) = b, so we can take h(u) = g(��1(u)).

(c) Let X1, X2, . . . , Xn be i.i.d. N (0, 1) with n large, and let g be as in (a). Whatis the approximate distribution of 1

n(g(X1) + · · ·+ g(Xn))? Simplify the parameters

fully (in terms of b and n), and mention which theorems you are using.

For the choice of g obtained from LOTUS, we have E(g(X)) = b and Var(g(X)) =2⇡�b2 (since g(x)2 = 2⇡), so by the CLT, the approximate distribution is N (b, (2⇡�b2)/n).

For the choice g(x) = 0, the distribution is degenerate, giving probability 1 to thevalue 0.

32

6. Let X1 be the number of emails received by a certain person today and let X2 bethe number of emails received by that person tomorrow, with X1 and X2 i.i.d.

(a) Find E(X1|X1 +X2) (simplify).

By symmetry, E(X1|X1 +X2) = E(X2|X1 +X2). By linearity,

E(X1|X1 +X2) + E(X2|X1 +X2) = E(X1 +X2|X1 +X2) = X1 +X2.

SoE(X1|X1 +X2) = (X1 +X2)/2.

(b) For the case Xj ⇠ Pois(�), find the conditional distribution of X1 given X1+X2,i.e., P (X1 = k|X1+X2 = n) (simplify). Is this one of the “important distributions”?

By Bayes’ Rule and the fact that X1 +X2 ⇠ Pois(2�),

P (X1 = k|X1 +X2 = n) = P (X1 +X2 = n|X1 = k)P (X1 = k)/P (X1 +X2 = n)

= P (X2 = n� k)P (X1 = k)/P (X1 +X2 = n)

=e��n�k

(n� k)!

e��k

k!e2�(2�)�nn!

=

✓n

k

◆✓1

2

◆n

.

Thus, the conditional distribution is Bin(n, 1/2). Note that the � disappeared! Thisis not a coincidence; there is an important statistical reason for this, but that is astory for another day and another course.

33


�1�1+�2

.


By linearity, independence, and the memoryless property, we get

E(X1|X1 > 1) + E(X2|X2 > 2) + E(X3|X3 > 3) = ��11 + ��1

2 + ��13 + 6.


The desired probability is P (X1 min(X2, X3)). Noting that min(X2, X3) ⇠Expo(�2 + �3) is independent of X1, we have

P (X1 min(X2, X3)) =�1

�1 + �2 + �3.


Let M = max(X1, X2, X3). Using the order statistics results from class or by directlycomputing the CDF and taking the derivative, for x > 0 we have

fM(x) = 3(1� e�x)2e�x.

This is not one of the “important distributions”. (The form is reminiscent of a Beta,but a Beta takes values between 0 and 1, while M can take any positive real value;in fact, B ⇠ Beta(1, 3) if we make the transformation B = e�M .)

34

8. Let Xn be the price of a certain stock at the start of the nth day, and assumethat X0, X1, X2, . . . follows a Markov chain with transition matrix Q (assume forsimplicity that the stock price can never go below 0 or above a certain upper bound,and that it is always rounded to the nearest dollar).

(a) A lazy investor only looks at the stock once a year, observing the values on days0, 365, 2 · 365, 3 · 365, . . . . So the investor observes Y0, Y1, . . . , where Yn is the priceafter n years (which is 365n days; you can ignore leap years). Is Y0, Y1, . . . also aMarkov chain? Explain why or why not; if so, what is its transition matrix?

Yes, it is a Markov chain: given the whole past history Y0, Y1, . . . , Yn, only the mostrecent information Yn matters for predicting Yn+1, because X0, X1, . . . is Markov.The transition matrix of Y0, Y1, . . . is Q365, since the kth power of Q gives the k-steptransition probabilities.

(b) The stock price is always an integer between $0 and $28. From each day to thenext, the stock goes up or down by $1 or $2, all with equal probabilities (except fordays when the stock is at or near a boundary, i.e., at $0, $1, $27, or $28).If the stock is at $0, it goes up to $1 or $2 on the next day (after receiving governmentbailout money). If the stock is at $28, it goes down to $27 or $26 the next day. If thestock is at $1, it either goes up to $2 or $3, or down to $0 (with equal probabilities);similarly, if the stock is at $27 it either goes up to $28, or down to $26 or $25. Findthe stationary distribution of the chain (simplify).

This is an example of random walk on an undirected network, so we know thestationary probability of each node is proportional to its degree. The degrees are(2, 3, 4, 4, . . . , 4, 4, 3, 2), where there are 29� 4 = 25 4’s. The sum of these degrees is110 (coincidentally?). Thus, the stationary distribution is

(2

110,

3

110,

4

110,

4

110, . . . ,

4

110,

4

110,

3

110,

2

110),

with 25 4110 ’s.

35


1. Calvin and Hobbes play a match consisting of a series of games, where Calvin hasprobability p of winning each game (independently). They play with a “win by two”rule: the first player to win two games more than his opponent wins the match.

(a) What is the probability that Calvin wins the match (in terms of p)?

Hint: condition on the results of the first k games (for some choice of k).

Let C be the event that Calvin wins the match, X ⇠ Bin(2, p) be how many of thefirst 2 games he wins, and q = 1� p. Then

P (C) = P (C|X = 0)q2 + P (C|X = 1)(2pq) + P (C|X = 2)p2 = 2pqP (C) + p2,

so P (C) = p2

1�2pq . This can also be written as p2

p2+q2since p + q = 1. (Also, the

problem can be thought of as gambler’s ruin where each player starts out with $2.)

Miracle check : Note that this should (and does) reduce to 1 for p = 1, 0 for p = 0,and 1

2 for p = 12 . Also, it makes sense that the probability of Hobbes winning, which

is 1� P (C) = q2

p2+q2, can also be obtained by swapping p and q.

(b) Find the expected number of games played.

Hint: consider the first two games as a pair, then the next two as a pair, etc.

Think of the first 2 games, the 3rd and 4th, the 5th and 6th, etc. as “mini-matches.”The match ends right after the first mini-match which isn’t a tie. The probability ofa mini-match not being a tie is p2 + q2, so the number of mini-matches needed is 1plus a Geom(p2 + q2) r.v. Thus, the expected number of games is 2

p2+q2.

Miracle check : For p = 0 or p = 1, this reduces to 2. The expected number of gamesis maximized when p = 1

2 , which makes sense intuitively. Also, it makes sense thatthe result is symmetric in p and q.

36

2. A DNA sequence can be represented as a sequence of letters, where the “alphabet”has 4 letters: A,C,T,G. Suppose such a sequence is generated randomly, where theletters are independent and the probabilities of A,C,T,G are p1, p2, p3, p4 respectively.

(a) In a DNA sequence of length 115, what is the expected number of occurrencesof the expression “CATCAT” (in terms of the pj)? (Note that, for example, theexpression “CATCATCAT” counts as 2 occurrences.)

Let Ij be the indicator r.v. of “CATCAT” appearing starting at position j, for1 j 110. Then E(Ij) = (p1p2p3)2, so the expected number is 110(p1p2p3)2.

Miracle check : Stat 115 is the bioinformatics course here and Stat 110 is this course,so 109(p1p2p3)2 would have been a much less aesthetically pleasing result (this kindof “o↵ by one” error is extremely common in programming, but is not hard to avoidby doing a quick check). The number of occurrences is between 0 and 110, so theexpected value should also be between 0 and 110.

(b) What is the probability that the first A appears earlier than the first C appears,as letters are generated one by one (in terms of the pj)?

Consider the first letter which is an A or a C (call it X; alternatively, condition onthe first letter of the sequence). This gives

P (A before C) = P (X is A|X is A or C) =P (X is A)

P (X is A or C)=

p1p1 + p2

.

Miracle check : The answer should be 1/2 for p1 = p2, should go to 0 as p1 !0, should be increasing in p1 and decreasing in p2, and finding P (A before C) by1� P (A before C) should agree with finding it by swapping p1, p2.

(c) For this part, assume that the pj are unknown. Suppose we treat p2 as a Unif(0, 1)r.v. before observing any data, and that then the first 3 letters observed are “CAT”.Given this information, what is the probability that the next letter is C?

Let X be the number of C’s in the data (so X = 1 is observed here). The prior isp2 ⇠ Beta(1, 1), so the posterior is p2|X = 1 ⇠ Beta(2, 3) (by the connection betweenBeta and Binomial, or by Bayes’ Rule). Given p2, the indicator of the next letterbeing C is Bern(p2). So given X (but not given p2), the probability of the next letterbeing C is E(p2|X) = 2

5 .

Miracle check : It makes sense that the answer should be strictly in between 1/2 (themean of the prior distribution) and 1/3 (the observed frequency of C’s in the data).

37

3. Let X and Y be i.i.d. positive random variables. Assume that the various ex-pressions below exist. Write the most appropriate of , �, =, or ? in the blank foreach part (where “?” means that no relation holds in general). It is not necessaryto justify your answers for full credit; some partial credit is available for justifiedanswers that are flawed but on the right track.

(a) E(eX+Y ) � e2E(X) (write E(eX+Y ) = E(eXeY ) = E(eX)E(eY ) = E(eX)E(eX)using the fact that X, Y are i.i.d., and then apply Jensen)

(b) E(X2eX) p

E(X4)E(e2X) (by Cauchy-Schwarz)

(c) E(X|3X) = E(X|2X) (knowing 2X is equivalent to knowing 3X)

(d) E(X7Y ) = E(X7E(Y |X)) (by Adam’s law and taking out what’s known)

(e) E(XY+ Y

X) � 2 (since E(X

Y) = E(X)E( 1

Y) � EX

EY= 1, and similarly E( Y

X) � 1)

(f) P (|X � Y | > 2) Var(X)2 (by Chebyshev, applied to the r.v. W = X � Y , which

has variance 2Var(X): P (|W � E(W )| > 2) Var(W )/4 = Var(X)/2)

38

4. Let X be a discrete r.v. whose distinct possible values are x0, x1, . . . , and letpk = P (X = xk). The entropy of X is defined to be H(X) = �

P1k=0 pk log2(pk).

(a) Find H(X) for X ⇠ Geom(p).

Hint: use properties of logs, and interpret part of the sum as an expected value.

H(X) = �1X

k=0

(pqk) log2(pqk)

= � log2(p)1X

k=0

pqk � log2(q)1X

k=0

kpqk

= � log2(p)�q

plog2(q),

with q = 1 � p, since the first series is the sum of a Geom(p) PMF and the secondseries is the expected value of a Geom(p) r.v.

Miracle check : entropy must be positive (unless X is a constant), since it is the aver-age “surprise” (where the “surprise” of observing X = xk is � log2(pk) = log2(

1pk)).

(b) Find H(X3) for X ⇠ Geom(p), in terms of H(X).

Let pk = pqk. Since X3 takes values 03, 13, 23, . . . with probabilities p0, p1, . . . re-spectively, we have H(X3) = H(X).

Miracle check : The definition of entropy depends on the probabilities pk of the valuesxk, not on the values xk themselves, so taking a one-to-one function of X should notchange the entropy.

(c) Let X and Y be i.i.d. discrete r.v.s. Show that P (X = Y ) � 2�H(X).

Hint: Consider E(log2(W )), where W is a r.v. taking value pk with probability pk.

Let W be as in the hint. By Jensen, E(log2(W )) log2(EW ). But

E(log2(W )) =X

k

pk log2(pk) = �H(X),

EW =X

k

p2k = P (X = Y ),

so �H(X) log2 P (X = Y ). Thus, P (X = Y ) � 2�H(X).

39

5. Let Z1, . . . , Zn ⇠ N (0, 1) be i.i.d.

(a) As a function of Z1, create an Expo(1) r.v. X (your answer can also involve thestandard Normal CDF �).

Use Z1 to get a Uniform and then the Uniform to get X: we have �(Z1) ⇠ Unif(0, 1),and we can then takeX = � ln(1��(Z1)). By symmetry, we can also use� ln(�(Z1)).

Miracle check : 0 < �(Z1) < 1, so � ln(�(Z1)) is well-defined and positive.

(b) Let Y = e�R, where R =p

Z21 + · · ·+ Z2

n. Write down (but do not evaluate) anintegral for E(Y ).

Let W = Z21 + · · · + Z2

n ⇠ �2n, so Y = e�

pW . We will use LOTUS to write E(Y )

using the PDF of W (there are other possible ways to use LOTUS here, but this issimplest since we get a single integral and we know the �2

n PDF). This gives

E(Y ) =

Z 1

0

e�pw 1

2n/2�(n/2)wn/2�1e�w/2dw.

(c) Let X1 = 3Z1 � 2Z2 and X2 = 4Z1 + 6Z2. Determine whether X1 and X2 areindependent (being sure to mention which results you’re using).

There are uncorrelated:

Cov(X1, X2) = 12Var(Z1) + 10Cov(Z1, Z2)� 12Var(Z2) = 0.

Also, (X1, X2) is Multivariate Normal since any linear combination of X1, X2 can bewritten as a linear combination of Z1, Z2 (and thus is Normal since the sum of twoindependent Normals is Normal). So X1 and X2 are independent.

40

6. Let X1, X2, . . . be i.i.d. positive r.v.s. with mean µ, and let Wn = X1X1+···+Xn

.

(a) Find E(Wn).

Hint: consider X1X1+···+Xn

+ X2X1+···+Xn

+ · · ·+ XnX1+···+Xn

.

The expression in the hint equals 1, and by linearity and symmetry its expectedvalue is nE(Wn). So E(Wn) = 1/n.

Miracle check : in the case that the Xj are actually constants, X1X1+···+Xn

reduces to1n. Also in the case Xj ⇠ Expo(�), part (c) shows that the answer should reduce to

the mean of a Beta(1, n� 1) (which is 1n).

(b) What random variable does nWn converge to as n ! 1?

By the Law of Large Numbers, with probability 1 we have

nWn =X1

(X1 + · · ·+Xn)/n! X1

µas n ! 1.

Miracle check : the answer should be a random variable since it’s asked what r.v. nWn

converges to. It should not depend on n since we let n ! 1.

(c) For the case that Xj ⇠ Expo(�), find the distribution of Wn, preferably withoutusing calculus. (If it is one of the “important distributions” state its name andspecify the parameters; otherwise, give the PDF.)

Recall that X1 ⇠ Gamma(1) and X2+ · · ·+Xn ⇠ Gamma(n�1). By the connectionbetween Beta and Gamma (i.e., the bank-post o�ce story), Wn ⇠ Beta(1, n� 1).

Miracle check : the distribution clearly always takes values between 0 and 1, and themean should agree with the answer from (a).

41

7. A task is randomly assigned to one of two people (with probability 1/2 for eachperson). If assigned to the first person, the task takes an Expo(�1) length of timeto complete (measured in hours), while if assigned to the second person it takes anExpo(�2) length of time to complete (independent of how long the first person wouldhave taken). Let T be the time taken to complete the task.

(a) Find the mean and variance of T .

Write T = IX1 + (1 � I)X2, with I ⇠ Bern(1/2), X1 ⇠ Expo(�1), X2 ⇠ Expo(�2)independent. Then

ET =1

2(��1

1 + ��12 ),

Var(T ) = E(Var(T |I)) + Var(E(T |I))

= E(I21

�21

+ (1� I)21

�22

) + Var

✓I

�1+

1� I

�2

◆

= E(I1

�21

+ (1� I)1

�22

) + Var

✓I(

1

�1� 1

�2)

◆

=1

2(1

�21

+1

�22

) +1

4(1

�1� 1

�2)2.

Miracle check : for �1 = �2, the two people have the same distribution so randomlyassigning the task to one of the two should be equivalent to just assigning it to thefirst person (so the mean and variance should agree with those of an Expo(�1) r.v.).It makes sense that the mean is the average of the two means, as we can condition onwhether I = 1 (though the variance is greater than the average of the two variances,by Eve’s Law). Also, the results should be (and are) the same if we swap �1 and �2.

(b) Suppose instead that the task is assigned to both people, and let X be the timetaken to complete it (by whoever completes it first, with the two people working in-dependently). It is observed that after 24 hours, the task has not yet been completed.Conditional on this information, what is the expected value of X?

Here X = min(X1, X2) with X1 ⇠ Expo(�1), X2 ⇠ Expo(�2) independent. ThenX ⇠ Expo(�1 + �2) (since P (X > x) = P (X1 > x)P (X2 > x) = e�(�1+�2)x, or byresults on order statistics). By the memoryless property,

E(X|X > 24) = 24 +1

�1 + �2.

Miracle check : the answer should be greater than 24 and should be very close to 24if �1 or �2 is very large. Considering a Poisson process also helps make this intuitive.

42

1 211/2

31/21/4

5/12

41/31/6

7/12

51/41/8

7/8

8. Find the stationary distribution of the Markov chain shown above, without usingmatrices. The number above each arrow is the corresponding transition probability.

We will show that this chain is reversible by solving for s (which will work out nicelysince this is a birth-death chain). Let qij be the transition probability from i to j,and solve for s in terms of s1. Noting that qij = 2qji for j = i+ 1 (when 1 i 4),we have that

s1q12 = s2q21 gives s2 = 2s1.

s2q23 = s3q32 gives s3 = 2s2 = 4s1.

s3q34 = s4q43 gives s4 = 2s3 = 8s1.

s4q45 = s5q54 gives s5 = 2s4 = 16s1.

The other reversibility equations are automatically satisfied since here qij = 0 unless|i� j| 1. Normalizing, the stationary distribution is

✓1

31,2

31,4

31,8

31,16

31

◆.

Miracle check : this chain “likes” going from left to right more than from right toleft, so the stationary probabilities should be increasing from left to right. We alsoknow that sj =

Pi siqij (since if the chain is in the stationary distribution at time

n, then it is also in the stationary distribution at time n + 1), so we can check, forexample, that s1 =

Pi siqi1 =

12s2.

43

Documents

Stat 110 Final Review, Fall 2011 - Harvard University · Stat 110 Final Review, Fall 2011 ... (intermixing reading, thinking, solving problems, ... Bernoulli, Binomial, Geometric,