MATH2901 Revision Seminar - mathsoc.unsw.edu.au€¦ · Table of Contents 1 Basic Probability Theory 2 Review of Random Variables 3 Computations for Common Distributions 4 Inequalities

MATH2901 Revision SeminarPart I: Probability Theory

Jeffrey Yang

UNSW Society of Statistics/UNSW Mathematics Society

August 2020

Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 1 / 116

Note

The theoretical content was mostly sourced from Libo’s notes.

Most of the examples were taken from the slides Rui Tong (2018 StatSocTeam) made for that year’s revision seminar.

All examples are presented at the end so that it isn’t as obvious whichtechniques/methods should be used.

It is recommended that you refer to the official lecture notes when quotingdefinitions/results.


Table of Contents

1 Basic Probability Theory

2 Review of Random Variables

3 Computations for Common Distributions

4 Inequalities

5 Limit of Random Variables

6 Tips and Assorted Examples


Basic Probability Theory

1 Axioms and Basic Results

2 Conditional Probability and Independence

3 Basic Laws

4 Bayes Formula


Introduction to Probability Theory

Probability theory is about modelling and analysing random experiments.We can do this mathematically by specifying an appropriate probabilityspace consisting of three components:

1 a sample space, Ω, which is the set of all possible outcomes.

2 an event space, F , which is the set of all events ”of interest”. Here,we can understand F as being a set of subsets of Ω which satisfiescertain conditions.

3 a probability function, P : F → [0, 1], which assigns each event inthe event space a probability.

Of course, for the model to be meaningful, P should satisfy certain axioms.


Axioms

Definition (Probability Space)

A probability space is the triple (Ω,F ,P). Here, P satisfies the axioms

1 P(A) ≥ 0 for all A ∈ F2 P(Ω) = 1

3 P (⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ) for mutually exclusive events A1,A2, ... ∈ F

Exercise for the reader: Prove the axioms.


Elementary Results

From the axioms, we are able to derive the following fundamental results:

1 If A1,A2, ...,Ak are mutually exclusive,

P

(k⋃

i=1

Ai

)=

k∑i=1

P(Ai ).

2 P(∅) = 0.

3 For any A ⊆ Ω, 0 ≤ P(A) ≤ 1 and P(A) = 1− P(A).

4 If B ⊂ A, then P(B) ≤ P(A). Hence, if B occurs → A occurs, thenP(B) ≤ P(A).

These results can be used without proof.


Conditional Probability

Definition (Conditional Probability)

The conditional probability that an event A occurs, given that an event Bhas occurred is

P(A|B) =P(A ∩ B)

P(B).

Here, we require P(B) 6= 0.

Given that B has occurred, the total probability for the possible results ofan experiment equals P(B). Visually, we see that the only outcomes in Athat are now possible are those in A ∩ B.


Independence

Definition (Independent Events)

Events A and B are independent if P(A ∩ B) = P(A)P(B).

Recall that for any two events A and B , we have P(A ∩ B) = P(A|B)P(B).Thus, A and B are independent if and and only P(A|B) = P(A) (orequivalently, P(B|A) = P(B)). Intuitively, this means that knowing event Ahas occurred does not give us any information on the probability of event Boccurring (and vice versa).


Independence

Definition (Pairwise Independent Events)

For a countable sequence of events Ai, the events are pairwiseindependent if

P(Ai ∩ Aj) = P(Ai )P(Aj)

for all i 6= j

Definition ((Mutually) Independent Events)

For a countable sequence of events Ai, the events are (mutually)independent if for any collection Ai1 ,Ai2 , ...,Ain , we have

P(Ai1 ∩ ... ∩ Ain) = P(Ai1)...P(Ain).

Independence implies pairwise independence but not vice versa. Can youthink of an example of where pairwise independence does not implyindependence?


The Multiplicative Law

Definition (The Multiplicative Law)

For events A1,A2, we have

P(A1 ∩ A2) = P(A2|A1)P(A1).

For events A1,A2,A3, we have

P(A1 ∩ A2 ∩ A3) = P(A3 ∩ A2 ∩ A1)

= P(A3|A2 ∩ A1)P(A2 ∩ A1)

= P(A3|A1 ∩ A2)P(A2|A1)P(A1).

We trust that the reader is able to generalise this to cases involving agreater number of events.

The Multiplicative Law is highly useful when dealing with asequence of dependent trials.


The Additive Law

Definition (The Additive Law)

For events A and B, we have

P(A ∪ B) = P(A) + P(B)− P(A ∩ B).

You might have noticed that the Additive Law resembles theInclusion-exclusion principle. The following quote from Wikipedia shedssome light on this:

”As finite probabilities are computed as counts relative to the cardinality ofthe probability space, the formulas for the principle of inclusion–exclusionremain valid when the cardinalities of the sets are replaced by finiteprobabilities.”

https://en.wikipedia.org/wiki/Inclusion-exclusion principle


The Law of Total Probability

Definition (The Law of Total Probability)

Suppose A1,A2, ...,Ak are mutually exclusive (Ai ∩ Aj = ∅ for all i 6= j)

and exhaustive (⋃k

i=1 Ai = Ω = sample space); that is, A1, ...,Ak forms apartition of Ω. Then, for any event B, we have

P(B) =k∑

i=1

P(B|Ai )P(Ai ).

The Law of Total Probability relates marginal probabilities to conditionalprobabilities and is often used in calculations involving Bayes’ Formula.


Bayes’ Formula/Bayes’ Theorem/Bayes’ Law

Definition (Bayes’ Theorem)

For a partition A1,A2, ...,Ak and an event B,

P(Aj |B) =P(B|Aj)P(Aj)∑ki=1 P(B|Ai )P(Ai )

=P(B|Aj)P(Aj)

P(B)

In essence, Bayes’ Theorem allows us to reverse the order of conditioningprovided that we know the marginal probabilities of event A and event B.When dealing with problems involving Bayes’ Theorem, it is recommendedthat one draws a tree diagram.

Here, we shall adopt the frequentist interpretation of probability whereprobability measures a proportion of outcomes – as opposed to the Bayesianinterpretation where probability measures a degree of belief.


Table of Contents




4 Inequalities




Review of Random Variables

1 Discrete Random Variables

2 Continuous Random Variables

3 Cumulative Distribution Function

4 Expectation and Moments


Discrete Random Variables

Definition (Discrete Random Variable)

The random variable X is discrete if there are countably many values x forwhich P(X = x) > 0.

The probability structure of X is typically described by its probability(mass) function.

Definition (Probability (Mass) Function)

The probability function of the discrete random variables X is given by

fX (x) = P(X = x).

The following two properties are important and apply to all discrete randomvariables:

1 fX (x) ≥ 0 for all x ∈ R2∑

all x fX (x) = 1


Continuous Random Variables

A continuous random variable has a continuum of possible values.Continuous random variables do not have a probability (mass) function, buthave the analogous (probability) density function.

Definition ((Probability) Density Function)

The density function of a continuous random variable is a real-valuedfunction fX on R with the property∫

AfX (x)dx = P(X ∈ A)

for any (measurable) set A ⊆ R.

The following two properties are important and apply to all continuousrandom variables:

1 fX (x) ≥ 0 for all x ∈ R.2∫∞−∞ fX (x)dx = 1.


Cumulative Distribution Function (cdf)

Definition (Cumulative Distribution Function)

The cumulative distribution function (cdf) of a random variable X isdefined by

FX (x) = P(X ≤ x).

Here, X can be either continuous or discrete.

In the case where X is a continuous random variable, we have the followingimportant results:

1 FX =∫∞−∞ fX (t)dt.

2 fX (x) = F ′X (x).

3 P(a ≤ X ≤ b) =∫ ba fX (x)dx = area under fX between a and b.

Thus, if we know one of FX or fX , we are able to derive the other. Moreover,once we know these, we are able to derive any probability/property of X .


Important Remarks on Continuous Random Variables

Suppose X is a continuous random variable. Then we have

P(X = a) = 0 for any a ∈ R.

Hence, it is only meaningful to talk about the probability of X lying in somesubinterval(s) of R. Consequently, we have

P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) = P(a ≤ X ≤ b).

Thus, we typically do not have to worry about whether an interval containsits boundary points or not.

This is NOT the case for discrete random variables.


Expectation

Definition (Expected Value of a Discrete Random Variable)

The expected value or mean of a discrete random variable X is

EX = E[X ]def=∑all x

x · P(X = x) =∑all x

x · fX (x),

where fX is the probability function of X .

Definition (Expected Value of a Continuous Random Variable)

The expected value of mean of a continuous random variable X is

E[X ] =

∫ ∞−∞

x · fX (x)dx ,

where fX is the density function of X .

In either case, E[X ] can be interpreted as the long run average of X .Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 21 / 116

Expectation of Transformed Random Variables

Often, we are interested in a transformation of a random variable. Inparticular, we often examine the rth moment of X about some constant a,E[(X − a)r ]. The following results provide a method of calculating theexpectation of a transformation of a random variable.

Result (Transformation of a Discrete Random Variable)

Let X be a discrete random variable and let g be a function of X . Then

Eg(X ) = E[g(X )] =∑all x

g(x) · fX (x).

Result (Transformation of a Continuous Random Variable)

Let X be a continuous random variable and let g be a function of X . Then

Eg(X ) = E[g(X )] =

∫ ∞−∞

g(X )fX (x)dx .


Properties of Expectation

Result (Linearity of Expectation)

Let X ,Y be random variables and a, b be constants. Then

E[aX + bY ] = aE[X ] + bE[Y ].

Result (Expected Value of a Constant)

For a constant c ,E[c] = c .

In general, for dependent random variables X ,Y ,

E[XY ] 6= E[X ]E[Y ].

Also, if g is a transformation of X , then typically,

E[g(X )] 6= g(E[X ]).


Variance and Standard Deviation

Definition (Variance)

Let µ = E[X ]. Then the variance of X , denoted Var[X ], is defined as

Var[X ] = E[(X − µ)2].

Observe that this is the second moment of X about µ.

Definition (Standard Deviation)

The standard deviation of X is the square root of its variance:

σ =√

Var[X ].

Both variance and standard deviation are measures of the spread of arandom variable. Standard deviations are in the same unit as X and thus,can be more readily interpreted. However, variances are easier to work withtheoretically.


Properties of Variance

Result (Alternative Formula for Variance)

Let X be a random variable and let µ = E[X ]. Then

Var[X ] = E[X 2]− µ2.

The variance will often be calculated using this formula.

Result (Nonlinearity of Variance)

Let X be a random variable and let a be a constant. Then

Var[X + a] = Var[X ]

Var[aX ] = a2 Var[X ].

Pay attention to the nonlinearity of variance. A common mistake is treatingvariance as being linear.


Moment Generating Functions

Definition (Moment Generating Function)

The moment generating function (mgf) of a random variable X is

µX (u) = E[euX ].

We say that the moment generating function of X exists if mX (u) is finitein some interval containing zero.

The following result regarding the r th moment of X , E[X r ], shows why themoment generating function is called as such.

Result (r th Moment of a random variable)

Let X be a random variable. Then in general, we have

E[X r ] = m(r)X (0)

for r = 0, 1, 2, ....


Properties of Moment Generating Functions

Result (Uniqueness)

Let X and Y be two random variables all of whose moments exist. If

mX (u) = mY (u)

for all u in a neighbourhood of 0 (i.e., for all |µ| < ε for some ε < 0), then

FX (x) = FY (y) for all x ∈ R.

That is, the moment generating function of a random variable is unique.


Properties of Moment Generating Functions

Result (Convergence)

Let Xn : n = 1, 2, ... be a sequence of random variables, each withmoment generating function mXn(u). Furthermore, suppose that

limn→∞

mXn(u) for all u in a neighbourhood of 0

and mX (u) is a moment generating function of a random variable X . Then

limn→∞

FXn(x) = FX (x) for all x ∈ R.

That is, convergence of moment generating functions implies convergenceof cumulative distribution functions.


Location and Scale Families of Densities

Result (Location Family of Densities)

A location family of densities based on the random variable U is thefamily of densities fX (x) where X = U + c for all possible c . Here, fX (x) isgiven by:

fX (x) = fU(x − c).

Result (Scale Family of Densities)

A scale family of densities based on the random variable U is the family ofdensities fX (x) where X = cU for all possible c . fX (x) is given by:

fX (x) = c−1fU

(xc

).

In essence, the density function of a continuous random variable may belongto a family of density functions that all have a similar form. This relates tothe concept of parameters in statistics.Proving the above results is not difficult and is a good exercise.


Table of Contents




4 Inequalities




Computations for Common Distributions

1 Common distributions summary

2 Computing the MGF of random variables

3 Computing the moments directly

4 Computing the distribution of simple transformed random variables

5 Bivariate density computations

6 Computing the sum of random variables using the convolutionformula, bivariate transformation and MGFs

7 Identifying distributions from the MGF


Bernoulli Distribution

Definition (Bernoulli Distribution)

For a Bernoulli trial, define the random variable

X =

1 if the trial results in success

0 otherwise

Then X is said to have a Bernoulli distribution.

Result (Probability Function of X )

If X is a Bernoulli random variable defined according to a Bernoulli trialwith success probability 0 < p < 1 then the probability function of X is

fX (x) =

p if x = 1

1− p if x = 0.

An equivalent way of writing this is fX (x) = px(1− p)x , x = 0, 1.


Binomial Distribution

Definition (Binomial Random Variable)

Consider a sequence of n independent Bernoulli trials, each with successprobability p. If

X = total number of successes

then X is a Binomial random variable with parameters n and p. Acommon shorthand is

X ∼ Bin(n, p).

Here, ”∼” has the meaning ”is distributed as” or ”has distribution”.

Results

If X ∼ Bin(n, p) then

1 fX (x) =(nx

)px(1− p)n−x , x = 0, ..., n,

2 E[X ] = np,

3 Var(X ) = np(1− p).


Geometric Distribution

Definition (Geometric Distribution)

IfX = number of trials until first success,

then X is said to have a geometric distribution with parameter p (theprobably of success on each trial).

Results

If X ∼ Geom(p), then

1 fX (x ; p) = p(1− p)x−1, x = 1, 2, ...

2 E[X ] = 1p ,

3 Var(X ) = 1−pp2 .


Poisson Distribution

Definition (Poisson Distribution)

The random variable X has a Poisson distribution with parameter λ > 0if its probability function is

fX (x ;λ) = P(X = x) =e−λλx

x!, x = 0, 1, 2, ...

A common abbreviation is X ∼ Poisson(λ).

The Poisson distribution is a model of the occurrence of point events in acontinuum. The number of points occurring in a time interval t is a randomvariable with a Poisson(λt) distribution

Results

If X ∼ Poisson(λ)

1 E(X ) = λ,

2 Var(X ) = λ.Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 35 / 116

Exponential Distribution

The exponential distribution is useful for describing the probability structureof positive random variables. The exponential distribution is closely relatedto the Poisson distribution; the time until the next event has an exponentialdistribution with parameter β = 1

λ .

Definition

A random variable X is said to have an exponential distribution withparameter β > 0 if X has density function:

fX (x ;β) =1

βe−x/β, x > 0.

Results1 E(X ) = β,

2 Var(X ) = β2,

3 Memoryless property: P(X > s + t|X > s) = P(X > t).


Uniform Distribution

Definition (Uniform Distribution)

A continuous random variable X that can take values in the interval (a, b)with equal likelihood is said to have a uniform distribution on (a,b). Acommon shorthand is

X ∼ Unif(a, b).

Results

If X ∼ Unif(a, b), then

1 fX (x ; a, b) = 1b−a , a < x < b,

2 E(X ) = a+b2 ,

3 Var(X ) = (b−a)2

12 .


Gamma Function

The Gamma Function extends the factorial function to the real numbers.

Definition (Gamma Function)

The Gamma function at x ∈ R is given by

Γ(x) =

∫ ∞0

tx−1e−tdt.

Results1 Γ(x) = (x − 1)Γ(x − 1),

2 Γ(n) = (n − 1)!,, n = 1, 2, 3...

3 Γ( 12 ) =

√π,

4∫∞

0 xme−xdx = m! for m = 0, 1, 2, ....


Beta Function

Definition (Beta Function)

The Beta function at x , y ∈ R is given by

B(x , y) =

∫ 1

0tx−1(1− t)y−1dt.

Result

For all x , y ∈ R,

B(x , y) =Γ(x)Γ(y)

Γ(x + y).


Phi Function

Definition (Φ)

For all x ∈ R,

Φ(x) =1√2π

∫ x

−∞e−t

2/2dt.

We cannot simplify the above expression for Φ(x) as there is no closed-formanti-derivative.Φ gives the cumulative distribution function of the standard normaldistribution.

Results1 limx→−∞Φ(x) = 0,

2 limx→∞Φ(x) = 1,

3 Φ(0) = 12 ,

4 Φ is monotonically increasing over R.


Normal Distribution

Definition (Normal Distribution)

The random variable X is said to have a normal distribution withparameters µ and σ2 (where −∞ < µ <∞ and σ2 > 0) if X has densityfunction

fX (x ;µ, σ) =1

σ√

2πe−(x−µ)2

2σ2 ,−∞ < x <∞.

A common shorthand isX ∼ N(µ, σ2).

Results

If X ∼ N(µ, σ2),

1 E(X ) = µ,

2 Var(X ) = σ2,


Computing Normal Distribution Probabilities

Result

If Z ∼ N(0, 1) then

P(Z ≤ x) = FZ (x) = Φ(x).

In other words, the Φ function is the cumulative distribution function of theN(0, 1) random variable.

Result

If X ∼ N(µ, σ2) then

Z =X − µσ

∼ N(0, 1).


Gamma Distribution

Definition (Gamma Distribution)

A random variable X is said to have a Gamma distribution withparameters α and β (where α, β > 0) if X has density function:

fX (x ;α, β) =e−α/βxα−1

Γ(α)βα, x > 0

A common shorthand is:

X ∼ Gamma(α, β).

Results

If X ∼ Gamma(α, β) then

1 E(X ) = αβ,

2 Var(X ) = αβ2.

Moreover, Y has an Exponential distribution iff Y ∼ Gamma(1, β).Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 43 / 116

Beta Distribution

Definition (Beta Distribution)

A random variable X is said to have a Beta distribution with parametersaα, β > 0 if its density function is

fX (x ;α, β) =xα−1(1− x)β−1

B(α, β), 0 < x < 1.

The Beta distribution generalises the Unif(0, 1) distribution which can bethought of as a Beta distribution with a = b = 1.

Results

If X has a distribution with parameters α and β, then

1 E(X ) = αα+β ,

2 Var(X ) = αβ(α+β+1)(α+β)2 .


Joint Probability Function and Density Function

Definition (Joint Probability Function)

If X and Y are discrete random variables, then the joint probabilityfunction of X and Y is

fX ,Y (x , y) = P(X = x ,Y = y),

the probability that X = x and Y = y .

Definition (Joint Density Function)

The joint density function of continuous random variables X and Y is abivariate function fX ,Y with the property∫ ∫

AfX ,Y (x , y)dxdy = P((X ,Y ) ∈ A)

for any (measurable) subset A of R2.


Joint Cumulative Distribution Function

Definition (Joint Cumulative Distribution Function)

The joint cdf of X and Y is

FX ,Y (x , y) = P(X ≤ x ,Y ≤ y)

=

∑u≤x

∑v≤y P(X = u,Y = v) (X discrete)∫ y

−∞∫ x−∞ fX ,Y (u, v)dudv (X continuous).

Result1 If X and Y are discrete radom variables, then∑

all x

∑all y fX ,Y (x , y) = 1.

2 If X and Y are continuous random variables then∫∞−∞

∫∞−∞ fX ,Y (x , y)dxdy = 1.


Expectation of Joint Functions

Result (Expectation of Joint Functions)

If g is any function of g(X ,Y ),

E[g(X ,Y )] =

=∑

all x

∑all y g(x , y)P(X = x ,Y = y) discrete

=∫∞−∞

∫∞−∞ g(x , y)fX ,Y (x , y)dxdy continuous


Marginal Probability Function

Result (Marginal Probability Function)

If X and Y are discrete, then fX (x) and fY (y) can be calculated fromfX ,Y (x , y) as follows:

fX (x) =∑all y

fX ,Y (x , y)

fY (y) =∑all x

fX ,Y (x , y).

fX (x) is sometimes referred to as the marginal probability function of X .


Marginal Density Function

Result (Marginal Density Function)

If X and Y are continuous, then fX (x) and fY (y) can be calculated fromfX ,Y (x , y) as follows:

fX (x) =

∫ ∞−∞

fX ,Y (x , y)dy

fY (y) =

∫ ∞−∞

fX ,Y (x , y)dx .

fX (x) is sometimes referred to as the marginal density function of X .


Conditional Probability Function

Definition (Conditional Probability Function)

If X and Y are discrete, the conditional probability function of X givenY = y is

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ,Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

Similarly,

fY |X (y |x) = P(Y = y |X = x) =fX ,Y (x , y)

fX (x).

Result

P(Y ∈ A|X = x) =∑y∈A

fY |X (y |X = x).


Conditional Density Function

Definition (Conditional Density Function)

If X and Y are continuous, the conditional density function of X givenY = y is

fX |Y (x |Y = y) =fX ,Y (x , y)

fY (y).

Similarly,

fY |X (y |X = x) =fX ,Y (x , y)

fX (x).

Shorthand: fY |X (y |x).

Result

P(a ≤ Y ≤ b|X = x) =

∫ b

afY |X (y |x)dy .


Conditional Expected Value and Variance

Result (Conditional Expected Value)

The conditional expected value of X given Y = y is

E(X |Y = y) =

∑all x xP(X = x |Y = y) if X is discrete∫∞−∞ xfX |Y (x |y)dx if X is continuous

Result (Conditional Variance)

The conditional variance of X given Y = y is

Var(X |Y = y) = E(X 2|Y = y)− [E(X |Y = y)]2

where

E(X 2|Y = y) =

∑all x x

2P(X = x |Y = y)∫∞−∞ x2fX |Y (x |y)dx .


Independent Random Variables

Definition (Independent)

Random variables X and Y are independent if and only if for all x , y

fX ,Y (x , y) = fX (x)fY (y).

Result

Random variables X and Y are independent if and only if for all x , y

fY |X (x , y) = fY (y)

.

Result

If X and Y are independent

FX ,Y (x , y) = FX (x) · FY (y).


Covariance

Definition (Covariance

The covariance of X and Y is

Cov(X ,Y ) = E[(X − µX )(Y − µY )].

Cov(X ,Y ) measures how much X and Y vary about their means and alsohow much they vary together linearly.

Results1 Cov(X ,X ) = Var(X )

2 Cov(X ,Y ) = E(XY )− µXµY3 If X and Y are independent, then Cov(X ,Y ) = 0.


Correlation

Definition (Correlation)

The correlation between X and Y is

Corr(X ,Y ) =Cov(X ,Y )√

Var(X ) · Var(Y ).

Corr(X ,Y ) measures the strength of the linear association between X andY . Independent random variables are uncorrelated, but uncorrelatedvariables are not necessarily independent (Consider the case whereE(X ) = 0 and Y = X 2).

Results1 |Corr(X ,Y )| ≤ 1

2 |Corr(X ,Y )| = 1 if and only if P(Y = a + bX ) = 1 for some constantsa, b.


Transformations

Result

For discrete X ,

fY (y) = P(Y = y) = P[h(X ) = y ] =∑

x :h(x)=y

P(X = x).

Result

For continuous X , if h is monotonic over the set x : fX (x) > 0 then

fY (y) = fX (x)

∣∣∣∣dxdy∣∣∣∣

= fX (h−1(y))

∣∣∣∣dxdy∣∣∣∣

for y such that fX (h−1(y)) > 0.


Linear Transformation

Result

For a continuous random variable X , if Y = aX + b is a lineartransformation of X with a 6= 0, then

fY (y) =1

|a|fX

(y − b

a

)for all y such that fX ( y−ba ) > 0.


Leading into Bivariate Transformations...

If X and Y have joint density fX ,Y (x , y) and U is a function of X and Y ,we can find the density of U by calculating FU(u) = P(U ≤ u) anddifferentiating.


Bivariate Transformations

Result

If U and V are functions of continuous random variables X and Y , then

fU,V (u, v) = fX ,Y (x , y) · |J|

where

J =

∣∣∣∣∣∣∂x∂u

∂x∂v

∂y∂u

∂y∂v

∣∣∣∣∣∣is a determinant called the Jacobian of the transformation.The full specification of fU,V (u, v) requires that the range of (u, v) valuescorresponding to those (x , y) for which fX ,Y (x , y) > 0 is determined.


Bivariate Transformations

To find fU(u) by bivariate transformation:

1 Define some bivariation transformation (U,V ).

2 Find fU,V (u, v).

3 We want the marginal distribution of U. So now find∫∞−∞ fU,V (u, v).

Using a bivariate transformation to find the distribution of U is often moreconvenient that deriving it via the cumulative distribution function. Usingthe cdf requires double integration, which we can avoid when we use abivariate transformation.


Sum of Independent Random Variables - ProbabilityFunction/Density Function Approach

Result (Discrete Convolution Formula)

Suppose that X and Y are independent random variables raking onlynon-negative integer values, and let Z = X + Y . Then

fX (z) =z∑

y=0

fX (z − y)fY (y), z = 0, 1, ...

Result (Continuous Convolution Formula)

Suppose X and Y are independent continuous random variables withX ∼ fX (x) and Y ∼ fY (y). Then Z = X + Y has density

fZ (z) =

∫all possible y

fX (z − y)fY (y)dy .


Sum of Gammmas

Result

If X1,X2, ...,Xn are independent with Xi ∼ Gamma(αi , β), then

n∑i=1

Xi ∼ Gamma(n∑

i=1

αi , β).


Sum of Independent Random Variables - MomentGenerating Function Approach

Result

Suppose that X and Y are independent random variables with momentgenerating functions mX and mY . Then

mX+Y (u) = mX (u)mY (u).

This generalises to n independent random variables.

Result

If X ∼ N(µX , σ2X ) and Y ∼ N(σY , σ

2Y ) are independent then

X + Y ∼ N(µX + µY , σ2X + σ2

Y ).

This also generalises.


Table of Contents




4 Inequalities




Inequalities

1 Jensen’s Inequality

2 Markov’s inequality

3 Chebyshev’s inequality


Jensen’s Inequality

Result (Jensen’s Inequality)

If h is a convex function (i.e., concave up) and X is a random variable, then

E[h(X )] ≥ h(E[X ]).

There are several formulations of Jensen’s inequality and the aboveformulation is the one most relevant for probability theory. Studentsstudying MATH2701 next term will be delighted to re-encounter Jensen’sinequality albeit in a different formulation.


Markov’s Inequality

Result (Markov’s Inequality)

Let X be a nonnegative random variable and a > 0. Then

P(X ≥ a) ≤ E[X ]

a.

That is, the probability that X is at least a is at most the expectation of Xdivided by a.


Chebychev’s Inequality

Result (Chebychev’s Inequality)

If X is any random variable with E[X ] = µ, Var(X ) = σ2 then

P(|X − µ| > kσ) ≤ 1

k2.

That is, the probability that X is more than k standard deviations from itsmean is less than 1

k2 .

The significance of Chebychev’s Inequality is that we can make specificprobabilistic statements about a random variable given only its mean andstandard deviation – observe that we have not made any assumptions onthe distribution of X .Interestingly, Chebychev’s Inequality can be easily derived as a corollary ofMarkov’s Inequality (Hint: consider the random variable (X − E[X ])2)


Table of Contents




4 Inequalities




Limit of Random Variables

1 Almost Sure Convergence

2 Convergence in Probability

3 Convergence in Distribution

4 Central Limit Theorem

5 Law of Large Numbers

6 The Statement and Application of the Delta Method


Almost Sure Convergence

Definition (Almost Sure Convergence)

The sequence of numerical random variables X1,X2, ... is said to convergealmost surely to a numerical random variable X , denoted Xn

a.s.→ X if

P(ω : lim

n→∞Xn(ω) = X (ω)

)= 1.

Within the context of probability theory, an event is said to happen almostsurely if it happens with probability 1. Note that it is possible for the set ofexceptions to be non-empty as long as it has probability 0.


Almost Sure Convergence

The previous definition of almost sure convergence can be difficult to workwith so we will make use of an alternative definition.

Result (Alternative Definition of Almost Sure Convergence)

Xna.s.→ X

if and only if for every ε > 0,

limn→∞

P

(supk≥n|Xk − X | > ε

)= 0.

Almost Sure Convergence is the mode of convergence used in the StrongLaw of Large Numbers.


Convergence in Probability

The main idea behind Convergence in Probability is that the probability ofan “unusual” event becomes smaller as the sequence progresses.

Definition (Convergence in Probability)

The sequence of random variables X1,X2, ... converges in probability to arandom variable X if, for all ε > 0,

limn→∞

P(|Xn − X | > ε) = 0.

This is usually written as

XnP→ X .

For context, an estimator is called consistent if it converges in probabilityto the quantity being estimated. Furthermore, convergence in probability isthe type of convergence established by the Weak Law of Large Numbers.


Relationship between Almost Sure Convergence andConvergence in Probability

Result (Almost Sure Convergence implies Convergence in Probability)

Xna.s.→ X =⇒ Xn

P→ X

andXn

a.s.→ 0 ⇐⇒ supk≥n|Xk |

P→ 0.

To understand why almost sure convergence is stronger than convergence inprobability, almost sure convergence depends on a joint distribution whereasconvergence in probability depends only on a marginal distribution.Wise words of wisdom: Almost sure convergence means no noodle leavesthe strip (for large enough n), convergence in probability means theproportion of noodles leaving the strip goes to 0 (as n→∞).


Convergence in Distribution

Convergence in Distribution is concerned with whether the distributions ofXi converges to the distribution of some random variable X . In other words,we increasingly expect to see the next outcome in a sequence of randomexperiments become better modelled by a given probability distribution.

Definition (Convergence in Distribution)

Let X1,X2, ... be a sequence of random variables. We say that Xn

converges in distribution to X if

limn→∞

FXn(x) = FX (x)

for all x where FX is continuous. A common shorthand is Xnd→ X .

We say that FX is the limiting distribution of Xn.

Convergence in distribution often arises in applications of the central limittheorem.


More on Convergence in Distribution

Convergence in distribution allows us to make approximate probabilitystatements about Xn, for large n, if we can derive the limiting distributionFX (n).

Result (Establishing Convergence in Distribution using Moments)

Let Xnn∈N be a sequence of random variables, each with mgf MXi(t).

Furthermore, suppose that

limn→∞

MXn(t) = MX (t).

If MX (t) is a moment generating function then there is a unique FX (whichgives a random variable X ) whose moments are determined by MX (t) andfor all points of continuity FX (x) we have

limn→∞

FXn(x) = FX (x).


Relationship between Convergence in Probability andConvergence in Distribution

Result (Convergence in Probability implies Convergence in Distribution)

XnP→ X =⇒ Xn

d→ X .

Convergence in probability is concerned with the convergence of the actualvalues (the xi ’s) whereas convergence in distribution is concerned with theconvergence of the distributions (the FXi

(x)’s).


Weak Law of Large Numbers

Result (Weak Law of Large Numbers)

Suppose X1,X2, ... are independent, each with mean µ and variance0 < σ2 <∞. If

Xn =1

n

n∑i=1

Xi , then XnP→ µ.

The Weak Law of Large Numbers describes how the sample averageconverges to the distributional average as the sample size increases.


Slutsky’s Theorem

Result (Slutsky’s Theorem)

Let X1,X2, ... be a sequence of random variables such that Xnd→ X for

some distribution X and let Y1,Y2, ... be another sequence of random

variables such that YnP→ c for some constant c . Then

1 Xn + Ynd→ X + c

2 XnYnd→ cX .

Slutsky’s Theorem extends some properties of algebraic operations onconvergent sequences of real numbers to sequences of random variables andis useful for establishing convergence in distribution results.


Strong Law of Large Numbers

The Weak Law corresponds to convergence in probability while the StrongLaw corresponds to almost sure convergence.

Result (Strong Law of Large Numbers)

Let X1,X2, ... be independent with common mean E[X ] = µ and varianceVar(X ) = σ2 <∞, then

Xna.s.→ µ.


Central Limit Theorem

Result (Central Limit Theorem)

Suppose X1,X2, ... are independent and identically distributed randomvariables with common mean µ = E(Xi ) and common variance

σ2 = Var(Xi ) <∞. For each n ≥ 1 let Xn =∑n

i=1 Xi

n . Then

Xn − µσ/√n

d→ Z

where Z ∼ N(0, 1). It is common to write

Xn − µσ/√n

d→ N(0, 1).

Note that E(Xn = µ and Var(Xn) = σ2

n so that Central Limit Theoremstates that the limiting distribution of any standardised average ofindependent random variables is the standard Normal distribution.


Alternative Forms of the Central Limit Theorem

Sometimes, probabilities involving related quantities such as the sum∑ni=1 Xi are required. Since

∑ni=1 Xi = nX , the Central Limit Theorem

also applies to the sum of a sequence of random variables.

Results

1√n(X − µ)

d→ N(0, σ2)

2

∑i Xi−nµσ√n

d→ N(0, 1)

3

∑i Xi−nµ√

n

d→ N(0, σ2)


Applications of the Central Limit Theorem

Central Limit Theorem for Binomial Distribution

Suppose X ∼ Bin(n, p). Then

X − np√np(1− p)

d→ N(0, 1)

Normal Approximation to the Poisson Distribution

Suppose X ∼ Poisson(λ). Then

limλ→∞

P(X − λ√

λ≤ x

)= P(Z ≤ x)

where Z ∼ N(0, 1).

Continuity correction: add 12 to the numerator.


The Delta Method

The Delta Method

Let Y1,Y2, ... be a sequence of random variables such that

√n(Yn − θ)

σ

d→ N(0, 1).

Suppose the function g is differentiable in the neighbourhood of θ andg ′(θ) 6= 0. Then

√n(g(Yn)− g(θ))

d→ N(0, σ2[g ′(θ)]2).

Alternatively,g(Yn)− g(θ)

g ′(θ)/√n

d→ N(0, 1).

There are several ways of stating the Delta method.


Table of Contents




4 Inequalities




General Tips

1 When finding a pdf/cdf, always specify the domain over which it isdefined.

2 Make sure that the relevant conditions are satisfied before applyingsome result/theorem (e.g., are the random variables i.i.d? Does asequence of random variables have the right mode of convergence?)

3 Does the pdf in question belongs to a known family? If so, you cantypically make use of some known results.

4a.s.→ =⇒ P→ =⇒ d→

5 Know how to do calculations involving bivariate distributions

6 Know how to apply (bivariate) transformations

7 Know how to sum independent random variables using the convolutionformula approach and the moment generating function approach.

8 Know the Large Numbers Laws, the CLT and the Delta method.


Example 1

Example (MATH1251) (contd.)

99% of the people with the disease receive a positive test. 98% of thosewithout receive a negative test. If 2% of the population have the disease,determine the probability of someone having the disease given they receiveda positive test.


Example 1

We require P(D | T ) =P(T | D)P(D)

P(T ).

P(T ) = P(T | D)P(D) + P(T | Dc)P(Dc)

= P(T | D)P(D) +(1− P(T c | Dc)

)P(Dc)

= 0.99× 0.02 + (1− 0.98)× 0.98 = 0.0394

∴ P(D | T ) =0.99× 0.02

0.0394≈ 0.5025


Example 1

A lot of people get stuck with Bayes’ law, especially when used with otherresults. Use a tree diagram!


Example 2

Example

Given the distribution of X below, compute its expectation and standarddeviation.

x 0 3 9 27

P(X = x) 0.3 0.1 0.5 0.1

E[X ] = 7.5


Example 2

Example


x 0 3 9 27

P(X = x) 0.3 0.1 0.5 0.1

E[X ] =∑all x

x P(X = x)

= 0× 0.3 + 3× 0.1 + 9× 0.5 + 27× 0.1

= 7.5


Example 2

Example


x 0 3 9 27

P(X = x) 0.3 0.1 0.5 0.1

E[X ] = 7.5

E[X 2] = 02 × 0.3 + 32 × 0.1 + 92 × 0.5 + 272 × 0.1

= 114.3


Example 2

Example


x 0 3 9 27

P(X = x) 0.3 0.1 0.5 0.1

E[X ] = 7.5

E[X 2] = 114.3

σX =

√E[X 2]− (E[X ])2 =

√114.3− 7.52 =

√58.05 ≈ 7.619


Example 2

Example (2901 oriented)

Let X ∼ Geom(p). Prove that E[X ] = 1p .

E[X ] =∞∑x=1

xp(1− p)x−1


Example 2



Recall: P(X = x) = p(1− p)x−1 for x = 1, 2, . . .

E[X ] =∑all x

x P(X = x) =∞∑x=1

xp(1− p)x−1


Example 2



E[X ] =∞∑x=1

xp(1− p)x−1

=∞∑y=0

(y + 1)p(1− p)y (y = x − 1)

= (1− p)

∞∑y=0

(y + 1)p(1− p)y−1

= (1− p)∞∑y=0

yp(1− p)y−1 + (1− p)∞∑y=0

p(1− p)y−1

= (1− p)∞∑y=1

yp(1− p)y−1 + (1− p)∞∑y=1

p(1− p)y−1

+ p(1− p)−1 (evaluating at y = 0)

= (1− p)E[X ] + (1− p)

(1 + p(1− p)−1

)


Example 2

E[X ] =∞∑x=1

xp(1− p)x−1

=∞∑y=0

(y + 1)p(1− p)y (y = x − 1)

= (1− p)

∞∑y=0

(y + 1)p(1− p)y−1

= (1− p)

∞∑y=0

yp(1− p)y−1 + (1− p)∞∑y=0

p(1− p)y−1

= (1− p)∞∑y=1

yp(1− p)y−1 + (1− p)∞∑y=1

p(1− p)y−1


= (1− p)E[X ] + (1− p)

(1 + p(1− p)−1

)


Example 2

E[X ] =∞∑x=1

xp(1− p)x−1

= (1− p)∞∑y=0

yp(1− p)y−1 + (1− p)∞∑y=0

p(1− p)y−1

= (1− p)∞∑y=1

yp(1− p)y−1 + (1− p)∞∑y=1

p(1− p)y−1


= (1− p)E[X ] + (1− p)

(1 + p(1− p)−1

)


Example 3



∴ pE[X ] =

((1− p) + p

)E[X ] =

1

p


Example 3

In general, can be done with the aid of Taylor series or binomial theorem.But preferably just do this:

Method (Deriving Expected Value from definition) (2901)

Keep rearranging the expression until you make the entire density, or E[X ],appear again.

Discrete case - Use a change of summation index at some point

Continuous case - Use integration by parts (or occasionally integrationby substitution)


Example 4

Example

Let fX (x) = 2θ2 x for 0 < x < θ. Compute the MGF and (2901) assert its

existence.


Example 4

Example


existence.

Integrate by parts

mX (u) = E[euX ] =2

θ2

∫ θ

0xeux dx

=2

θ2

(xeux

u

∣∣∣∣θ0

−∫ θ

0

eux

udx

)

=2θeuθ

uθ2− 2

θ2

(eux

u2

∣∣∣∣θ0

)

=2(uθeuθ − euθ + 1)

u2θ2


Example 4

Example


existence.

Slowly tidy everything up

mX (u) = E[euX ] =2

θ2

∫ θ

0xeux dx

=2

θ2

(xeux

u

∣∣∣∣θ0

−∫ θ

0

eux

udx

)

=2θeuθ

uθ2− 2

θ2

(eux

u2

∣∣∣∣θ0

)


u2θ2


Example 4

Example


existence.

Slowly tidy everything up

mX (u) = E[euX ] =2

θ2

∫ θ

0xeux dx

=2

θ2

(xeux

u

∣∣∣∣θ0

−∫ θ

0

eux

udx

)

=2θeuθ

uθ2− 2

θ2

(eux

u2

∣∣∣∣θ0

)


u2θ2


Example 4

Example


existence.

mX (u) =2(uθeuθ − euθ + 1)

u2θ2

GeoGebra simulation


https://ggbm.at/Bwn9yAMd

Example 4

Example


existence.

Idea: Can check that the limit as u → 0 is finite. The finiteness of the limitimplies the required result.

limu→0

2(uθeuθ − euθ + 1)

u2θ2

LH= lim

u→0

2(θeuθ + uθ2euθ − θeuθ

)2uθ2

= limu→0

euθ

= 1


Example 5

Example

Use the MGF of X ∼ Bin(n, p) to prove that E[X ] = np.

E[X ] = limu→0

d

du(1− p + peu)n

= limu→0

n(1− p + peu)n−1 · peu

= n(1− p + p)n−1 · p= np


Example 5

Example

Use the MGF of X ∼ Bin(n, p) to prove that E[X ] = np.

E[X ] = limu→0

d

du(1− p + peu)n

= limu→0

n(1− p + peu)n−1 · peu

= n(1− p + p)n−1 · p= np


Example 6

Example

A busy switchboard receives 150 calls an hour on average. Assume thatcalls are independent from each other and can be modelled with a Poissondistribution. Find the probability of

1 Exactly 3 calls in a given minute

2 At least 10 calls in a given 5 minute period.

Naive:X ∼ Poisson(150).


Example 6

Example




In Q1, take X ∼ Poisson(150/60) = Poisson(2.5). Then,

P(X = 3) = e−2.5 2.53

3!≈ 0.2138


Example 6

Example




In Q2, take Y ∼ Poisson(2.5× 5) = Poisson(12.5). Then,

P(Y ≥ 10) = 1− P(Y ≤ 9)

= 1− e−12.5

(12.50

0!+ · · ·+ 12.59

9!

)

= 1 - ppois(9,lambda=12.5,lower=TRUE)

≈ 0.7985689


Example 6

Example




In Q2, take Y ∼ Poisson(2.5× 5) = Poisson(12.5). Then,

P(Y ≥ 10) = 1− P(Y ≤ 9)

= 1 - ppois(9,lambda=12.5,lower=TRUE)

≈ 0.7985689


Example 7

Example (2901 course pack)

If, on average, 5 servers go offline during the day, what is the chance thatno servers will go offline in the next hour?

The number of servers going offline in a day is X ∼ Poisson(5).

So the time taken for the next server to go offline is T ∼ Exp(0.2),measured in days.

P(T >

1

24

)=

∫ ∞1/24

5e−5t dt

= e−5/24


Example 7





P(T >

1

24

)=

∫ ∞1/24

5e−5t dt

= e−5/24


Example 7





∴ We require P(T >

1

24

)

P(T >

1

24

)=

∫ ∞1/24

5e−5t dt

= e−5/24


Example 7





P(T >

1

24

)=

∫ ∞1/24

5e−5t dt

= e−5/24


Example 8

Formula (Transforming a Discrete r.v.)

P(h(X ) = y

)=

∑x :h(x)=y

P(X = x)

Um, ye wat?


Example 8

Example

A random variable has the following distribution:

x -1 0 1 2

P(X = x) 0.38 0.21 0.14 0.27

Determine the distribution of Y = X 3 and Z = X 2.


Example 8

Example


x -1 0 1 2

P(X = x) 0.38 0.21 0.14 0.27


If X can take the values −1, 0, 1, 2,then Y = X 3 takes the values −1, 0, 1, 8.

P(Y = −1) = P(X 3 = −1) = P(X = −1) = 0.38

Similarly, P(Y = 0) = 0.21, P(Y = 1) = 0.14, P(Y = 8) = 0.27.


Example 8

Example


x -1 0 1 2

P(X = x) 0.38 0.21 0.14 0.27


On the other hand, X 2 can only take the values of 0, 1, 4.

P(Z = 0) = P(X 2 = 0) = P(X = 0) = 0.21

P(Z = 1) = P(X 2 = 1) = P(X = ±1) = 0.38 + 0.14 = 0.62

...and P(Z = 4) is still equal to 0.27.


Example 8

Example


x -1 0 1 2

P(X = x) 0.38 0.21 0.14 0.27


On the other hand, X 2 can only take the values of 0, 1, 4.

P(Z = 0) = P(X 2 = 0) = P(X = 0) = 0.21

P(Z = 1) = P(X 2 = 1) = P(X = ±1) = 0.38 + 0.14 = 0.62

...and P(Z = 4) is still equal to 0.27.


Example 8

Just to think about... (2901 oriented)

If X ∼ Poisson(λ), what must be the distribution of Y = X 2

P(Y = y) =

e−λ λ

√y

(√y)! if y = 0, 1, 4, 9, . . .

0 otherwise


Example 9

Method 1 (Continuous random variable transform theorem)

Consider the transform y = h(x). If h is monotonic wherever fX (x) isnon-zero, then the density of Y = h(X ) is

fY (y) = fX(h−1(y)

) ∣∣∣∣dxdy∣∣∣∣

Example

Let X ∼ Exp(λ). What is the density of Y = X 2?


Example 9

Example


fX (x) = 1λe−x/λ for all x > 0.

h(x) = x2 is invertible for all x > 0, with h−1(y) =√y .

x =√y , so dx

dy = 12√y

∴ fY (y) = fX (√y)

∣∣∣∣ 1

2√y

∣∣∣∣

=1

λe−√y/λ

∣∣∣∣ 1

2√y

∣∣∣∣=

1

2λ√ye−√y/λ


Example 9

Example


fX (x) = 1λe−x/λ for all x > 0.

h(x) = x2 is invertible for all x > 0, with h−1(y) =√y .

x =√y , so dx

dy = 12√y

∴ fY (y) = fX (√y)

∣∣∣∣ 1

2√y

∣∣∣∣=

1

λe−√y/λ

∣∣∣∣ 1

2√y

∣∣∣∣=

1

2λ√ye−√y/λ


Example 10

Example


fY (y) =1

2λ√ye−√y/λ

Since x > 0 and y = x2, y > 0 as well.


Example 10

Example

Let X ∼ Unif(−10, 10). What is the density of Y = X 2?

fY (y) =1

20√y

Since −10 < x < 10 and y = x2, we must have 0 < y < 100.


Example 11

Example

The joint probability distribution of X and Y is

y

0 1 2

0 1/16 1/8 1/8x 1 1/8 1/16 0

2 3/16 1/4 1/16

Determine P(X = 0,Y = 1), P(X ≥ 1,Y < 1) and P(X − Y = 1)

P(X = 0,Y = 1) =1

8


Example 11

Example


y

0 1 2

0 1/16 1/8 1/8x 1 1/8 1/16 0

2 3/16 1/4 1/16


P(X ≥ 1,Y < 1) = P(X = 1,Y = 0) + P(X = 2,Y = 0)

=1

8+

3

16=

5

16


Example 11

Example


y

0 1 2

0 1/16 1/8 1/8x 1 1/8 1/16 0

2 3/16 1/4 1/16


P(X − Y = 1) = P(X = 2,Y = 1) + P(X = 1,Y = 0)

=1

4+

1

8=

3

8


Example 12

Joint continuous distributions

Unless you know how to use indicator functions really well (2901), sketchthe region!

Example

fX ,Y (x , y) =1

x2y2x ≥ 1, y ≥ 1

is the joint density of the continuous r.v.s X and Y . Find P(X < 2,Y ≥ 4)and P(X ≤ Y 2).


Example 12

Example

fX ,Y (x , y) =1

x2y2x ≥ 1, y ≥ 1


P(X < 2,Y ≥ 4) =

∫ 2

1

∫ ∞4

1

x2y2dy dx

=

∫ 2

1

1

4x2dx

=1

8


Example 12

Example

fX ,Y (x , y) =1

x2y2x ≥ 1, y ≥ 1


P(X ≤ Y 2) =

∫ ∞1

∫ x2

1

1

x2y2dy dx

=

∫ ∞1

(1

x2− 1

x4

)dx

=2

3


Example 13

Example

Find E[Y 2 lnX ] for the following distribution

y

1 2

x 1 1/10 1/52 3/10 2/5

E[Y 2 lnX ] = 12 ln 1P(X = 1,Y = 1) + 22 ln 1P(X = 1,Y = 2)

+ 12 ln 2P(X = 2,Y = 1) + 22 ln 2P(X = 2,Y = 2)

=

(3

10+ 2× 2

5

)ln 2 =

11 ln 2

10


Example 13

Example


y

1 2

x 1 1/10 1/52 3/10 2/5


+ 12 ln 2P(X = 2,Y = 1) + 22 ln 2P(X = 2,Y = 2)

=

(3

10+ 2× 2

5

)ln 2 =

11 ln 2

10


Example 13

Example


y

1 2

x 1 1/10 1/52 3/10 2/5


+ 12 ln 2P(X = 2,Y = 1) + 22 ln 2P(X = 2,Y = 2)

=

(3

10+ 2× 2

5

)ln 2 =

11 ln 2

10


Example 14

Problem

Examine the existence of E[XY ] for the earlier example:

fX ,Y (x , y) =1

x2y2for x , y ≥ 1.


Example 15

Definition (Cumulative Distribution Function)

The CDF FX ,Y (x , y) is the function given by

FX ,Y (x , y) = P(X ≤ x ,Y ≤ y)

Finding a CDF (Continuous case)

FX ,Y (x , y) =

∫ x

−∞

∫ y

−∞fX ,Y (u, v) du dv

Example

For the earlier example, FX ,Y (x , y) = 0 if x < 1 or y < 1. Else:

FX ,Y (x , y) =

∫ x

1

∫ y

1

1

u2v2du dv =

(1− 1

x

)(1− 1

y

)Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 101 / 116

Example 16

Recall that P(A ∩ B) = P(A)P(B).

Definition (Independence of random variables)

Two random variables are independent when:

P(X = x ,Y = y) = P(X = x)P(Y = y) (discrete case)

fX ,Y (x , y) = fX (x)fY (y) (continuous case)

Example

Test if X and Y are independent, for

fX ,Y (x , y) =1

x2y2x , y ≥ 1.


Example 16

Example

Test if X and Y are independent, for

fX ,Y (x , y) =1

x2y2x , y ≥ 1.

fX (x) =

∫ ∞1

1

x2y2dy

=1

x2x ≥ 1

Similarly fY (y) =1

y2y ≥ 1.

Therefore since fX ,Y (x , y) = fX (x)fY (y), X and Y are independent.


Example 17

Example

Determine P(X = x | Y = 2), i.e. fX |Y (x | 2), for

y

1 2

x 1 1/10 1/52 3/10 2/5

P(Y = 2) = P(X = 1,Y = 2) + P(X = 2,Y = 2)

=1

5+

2

5

=3

5.


Example 17

Example

Determine P(X = x | Y = 2), i.e. fX |Y (x | 2), for

y

1 2

x 1 1/10 1/52 3/10 2/5

P(Y = 2) =3

5

P(X = 1 | Y = 2) =P(X = 1,Y = 2)

P(Y = 2)=

1

3

P(X = 2 | Y = 2) =P(X = 2,Y = 2)

P(Y = 2)=

2

3


Example 18

Lemma (Independence of random variables)

Two random variables are independent if and only if

fY |X (y | x) = fY (y)

orfX |Y (x | y) = fX (x)

Investigation

For the earlier example with fX ,Y (x , y) = x−2y−2 for x ≥ 1, y ≥ 1, provethe independence of X and Y using this lemma instead.


Example 19

Definition (Conditional Expectation)

E[X | Y = y ] =

∑all x

xP(X = x | Y = y) discrete case∫ ∞−∞

x fX |Y (x | y) dx continuous case

Definition (Conditional Variance)

Var(X | Y = y) = E[X 2 | Y = y ]−(E[X | Y = y ]

)2

(And similarly for Y . Basically, just add the condition to the originalformula.)


Example 19

Example

Find E[X | Y = 2] and Var(X | Y = 2) for

y

1 2

x 1 1/10 1/52 3/10 2/5

E[X | Y = 2] = 1 · P(X = 1 | Y = 2) + 2 · P(X = 2 | Y = 2)

= 1× 1

3+ 2× 2

3

=5

3.


Example 19

Example


y

1 2

x 1 1/10 1/52 3/10 2/5

E[X 2 | Y = 2] = 12 · P(X = 1 | Y = 2) + 22 · P(X = 2 | Y = 2)

= 12 × 1

3+ 22 × 2

3= 3.


Example 19

Example


y

1 2

x 1 1/10 1/52 3/10 2/5

Var(X 2 | Y = 2) = 3−(

5

3

)2

=2

9


Example 20

Example

Let fX ,Y (x , y) = xy for x ∈ [0, 1], y ∈ [0, 2]. Determine their covariance inthe old fashioned way.

Step 1: Determine the marginal densities

fX (x) =

∫ 2

0xy dy = 2x (0 ≤ x ≤ 1)

fY (y) =

∫ 1

0xy dx =

y

2(0 ≤ y ≤ 2)


Example 20

Example


Step 2: Find the marginal expectations E[X ] and E[Y ]

E[X ] =

∫ 1

02x2 dx =

2

3

E[Y ] =

∫ 2

0

y2

2dy =

4

3


Example 20

Example


Step 3: Find E[XY ]

E[XY ] =

∫ 1

0

∫ 2

0xy dy dx = · · · =

8

9

Step 4: Plug in:

Cov(X ,Y ) = E[XY ]− E[X ]E[Y ] =8

9− 2

3× 4

3= 0.


Example 20

Example

Let fX ,Y (x , y) = xy for x ∈ [0, 1], y ∈ [0, 2].Determine their covariance inthe old fashioned way.

That was a horrible idea.

Can prove that X and Y are independent

Can use the Fubini-Tonelli theorem to just check that E[XY ] equalsE[X ]E[Y ]


Example 21

Example (2901)

Let Z ∼ N (0, 1) and W satisfy P(X = 1) = P(X = −1) = 12 . Suppose

that W and Z are independent and define X := WZ .

Show that Cov(X ,Z ) = 0.

Noting that E[Z ] = 0,

Cov(X ,Z ) = E[XZ ]− E[X ]E[Z ] = E[XZ ]

Subbing in X = WZ and using independence gives

Cov(X ,Z ) = E[WZ 2] = E[W ]E[Z 2]


Example 21

Example (2901)




Noting that E[Z ] = 0,

Cov(X ,Z ) = E[XZ ]− E[X ]E[Z ] = E[XZ ]

Subbing in X = WZ and using independence gives

Cov(X ,Z ) = E[WZ 2] = E[W ]E[Z 2]


Example 21

Example (2901)




Observe that

E[W ] = 1P(X = 1)− 1P(X = −1) = 0.

Hence Cov(X ,Z ) = E[W ]E[Z 2] = 0.


Example 22

Theorem (Bivariate Transform Formula)

Suppose X and Y have joint density function fX ,Y and let U and V betransforms on these random variables. Then the joint density of U,V is

fU,V (u, v) = fX ,Y (x , y) | det(J)|

where J is the Jacobian matrix

J =

(∂x∂u

∂x∂v

∂y∂u

∂y∂v

)Remember: x above y and u left of v

Example (Course pack)

Let X and Y be i.i.d. Exp(4) r.v.s. Find the joint density of U and V if

U = 12 (X − Y ) and V = Y .

We know that y > 0. Since v = y , it immediately follows that v > 0.


Example 22



U =1

2(X − Y ) and V = Y .

We have y = v and

u =1

2(x − v) =⇒ x = 2u + v .

∴ J =

(2 10 1

)and det(J) = 2.



Example 22



U =1

2(X − Y ) and V = Y .

fX ,Y (x , y) =1

16e−(x+y)/4

Since y = v and x = 2u + v , we get x + y = 2u + 2v . Therefore

fU,V (u, v) =1

8e−(u+v)/2.



Example 22



U =1

2(X − Y ) and V = Y .



Example 22



U =1

2(X − Y ) and V = Y .

We know that y > 0. Since v = y , it immediately follows that v > 0.However, x > 0 and x = 2u + v . Therefore:

2u + v > 0

u > −v

2


Example 23

Example

Let X and Y be i.i.d. Geom(p). Use convolutions to find the probabilityfunction of Z := X + Y .

The probability functions are P(X = x) = p(1− p)x for x = 1, 2, 3, . . .,and P(Y = y) = p(1− p)y for y = 1, 2, 3, . . .. Therefore:

P(X = z − y) = p(1− p)z−y

for z − y = 1, 2, 3, . . .,


Example 23

Example


The probability functions are P(X = x) = p(1− p)x for x = 1, 2, 3, . . .,and P(Y = y) = p(1− p)y for y = 1, 2, 3, . . .. Therefore:

P(X = z − y) = p(1− p)z−y

for z − y = 1, 2, 3, . . ., i.e.

y − z = . . . ,−3,−2,−1 ⇐⇒ y = . . . , z − 3, z − 2, z − 1


Example 23

Example


Hence P(X = z − y)P(Y = y) = p(1− p)z−yp(1− p)y = p2(1− p)z , when

y = 0, 1, 2, . . .

and y = . . . , z − 3, z − 2, z − 1.

Therefore, y = 0, 1, 2, . . . , z − 3, z − 2, z − 1.


Example 23

Example


∴ P(Z = z) =z−1∑y=0

p2(1− p)z

= zp2(1− p)z (sum only depends on y !)


Example 23

Example


∴ P(Z = z) =z−1∑y=0

p2(1− p)z

= zp2(1− p)z (sum only depends on y !)

Since x = 1, 2, . . . and y = 1, 2, . . ., i.e. x and y are natural numbersgreater than or equal to 1, z = x + y = 2, 3, 4, . . .


Example 24

Example

Let X and Y be i.i.d. Exp(1). Prove that Z := X + Y follows aGamma(2, 1) distribution using a convolution.

The densities are fX (x) = e−x for x > 0, and fY (y) = e−y for y > 0.Therefore:

fX (z − y) = e−z+y , for z − y > 0 , i.e. y < z

Hence fX (z − y)fY (y) = e−z when y < z and y > 0. i.e.

fX (z − y)fY (y) = e−z for 0 < y < z


Example 24

Example


The densities are fX (x) = e−x for x > 0, and fY (y) = e−y for y > 0.Therefore:

fX (z − y) = e−z+y , for z − y > 0 , i.e. y < z

Hence fX (z − y)fY (y) = e−z when y < z and y > 0. i.e.

fX (z − y)fY (y) = e−z for 0 < y < z


Example 24

Example


∴ fZ (z) =

∫ z

0e−z dy

= e−zz

=e−z/1z2−1

Γ(2)12

Since x > 0 and y > 0, z = x + y > 0. Thus Z has the density of aGamma(2,1) random variable.


Example 24

Example


∴ fZ (z) =

∫ z

0e−z dy

= e−zz

=e−z/1z2−1

Γ(2)12

Since x > 0 and y > 0, z = x + y > 0. Thus Z has the density of aGamma(2,1) random variable.


Example 25

Theorem (MGF of a sum)

If X and Y are independent random variables, then

mX+Y (u) = mX (u)mY (u)

Example

Let X and Y be i.i.d. Exp(1). Prove that Z := X + Y follows aGamma(2, 1) distribution from quoting MGFs.


Example 25

Example

Let X and Y be i.i.d. Exp(1). Prove that Z := X + Y follows aGamma(2, 1) distribution from quoting MGFs.

mX (u) = 11−u and mY (u) = 1

1−u . So clearly

mZ (u) = mX (u)mY (u) =

(1

1− u

)2

,

which is the MGF of a Gamma(2,1) distribution. Hence Z follows thisdistribution as well.


Example 26

Example

Let X1, . . . ,Xn be a sequence of i.i.d. Unif(0, 1) random variables. Define

Yn = nminU1, . . . ,Un. Prove that Ynd→ Y , where Y ∼ Exp(1).

FYn(y) = P(Yn ≤ y) = P(nminU1, . . . ,Un ≤ y)

= P(

minU1, . . . ,Un ≤y

n

)

= 1− P(

minU1, . . . ,Un ≥y

n

)

= 1− P(U1 >

y

n, . . . ,Un >

y

n

)


Example 26

Example




= P(

minU1, . . . ,Un ≤y

n

)= 1− P

(minU1, . . . ,Un ≥

y

n

)

= 1− P(U1 >

y

n, . . . ,Un >

y

n

)

In general, if minx1, . . . , xn ≤ x , then not every xi ≤ x .


Example 26

Example




= P(

minU1, . . . ,Un ≤y

n

)= 1− P

(minU1, . . . ,Un ≥

y

n

)= 1− P

(U1 >

y

n, . . . ,Un >

y

n

)But it is true that if minU1, . . . ,Un ≥ x , then every xi ≥ x .


Example 26

Example



FYn(y) = 1− P(U1 >

y

n, . . . ,Un >

y

n

)= 1− P

(U1 >

y

n

). . .P

(Un >

y

n

)(independence)

= 1−[P(U1 >

y

n

)]n(id. distributed)

= 1−

[∫ 1

y/n1 dt

]n= 1−

(1− y

n

)n


Example 26

Example



FYn(y) = 1− P(U1 >

y

n, . . . ,Un >

y

n

)= 1− P

(U1 >

y

n

). . .P

(Un >

y

n

)(independence)

= 1−[P(U1 >

y

n

)]n(id. distributed)

= 1−

[∫ 1

y/n1 dt

]n= 1−

(1− y

n

)n


Example 26

Example



∴ limn→∞

FYn(y) = 1− e−y = FY (y)

Hence Ynd→ Y .


Example 27

Example (Libo’s notes)

Australians have average weight about 68 kg and variance about 16 kg2.Suppose 40 random Australians are chosen. What is the (approximate)probability that the average weight of these Australians is over 80?

Let X1, . . . ,X40 be the weights of the Australians.


Example 27



Let X1, . . . ,X40 be the weights of the Australians. Then n = 40, µ = 68and σ = 4, so by the CLT:

X − 68

4/√

40

d→ Z

where Z ∼ N (0, 1).


Example 27



∴ P(X40 > 80) = P(X40 − 68

4/√

40>

80− 68

4/√

40

)≈ P

(Z >

80− 68

4/√

40

)

= P(Z > 3√

40)

= 1-pnorm(3*sqrt(40))

or pnorm(3*sqrt(40), lower.tail=FALSE)


Example 27



∴ P(X40 > 80) = P(X40 − 68

4/√

40>

80− 68

4/√

40

)≈ P

(Z >

80− 68

4/√

40

)= P(Z > 3

√40)

= 1-pnorm(3*sqrt(40))

or pnorm(3*sqrt(40), lower.tail=FALSE)


Example 28

Lemma (Normal Approximation to Binomial)

Let X ∼ Bin(n, p), which is a sum of n independent Ber(p) r.v.s. Then

X − np√np(1− p)

d→ N (0, 1)


Example 28

Example

An unfortunate soul decided to sit his exam despite having a migraine andthe flu. Fortunately, it was not a university exam, and the paper involvedonly 200 multiple choice questions with 5 options. Therefore, he randomlyguesses every answer. What is the (approximate) probability he fails?

Let X be how many he gets correct. Then X ∼ Bin(200, 1

5

).

We may approximate X with Y ∼ N (40, 32). Then,

P(X < 100) ≈ P(Y < 100)

= P(Y − 40√

32<

100− 40√32

)= P

(Z <

60√32

)= P(Z < 10.6066)

Oh my...


Example 28

Example



5

).


P(X < 100) ≈ P(Y < 100)

= P(Y − 40√

32<

100− 40√32

)= P

(Z <

60√32

)= P(Z < 10.6066)

Oh my...


Example 28

Example



5

).


P(X < 100) ≈ P(Y < 100)

= P(Y − 40√

32<

100− 40√32

)= P

(Z <

60√32

)= P(Z < 10.6066)

Oh my...


Example 28

Example



5

).


P(X < 100) ≈ P(Y < 100)

= P(Y − 40√

32<

100− 40√32

)= P

(Z <

60√32

)= P(Z < 10.6066) Oh my...


Example 29


Let X1,X2, ... be a sequence of i.i.d random variables with mean 2 andvariance 7. Obtain a large sample approximation for the distribution of(Xn)3.


Example 29

The CLT gives√n(Xn − 2)

d→ N(0, 7)

Applying the Delta Method with g(x) = x3 leads to g ′(x) = 3x2 and then

√n[(X 3

n )− 23]d→ N(0, 7 · (3 · 22)2).

Simplifying, we have

√n[((X )3

n − 8]d→ N(0, 1008).

Thus, for large n, the approximate distribution of (Xn)3 is N(8, 1008n ).


Example 30

Example (More 2801 focused)

The Riemann zeta function is defined for complex s with real part greaterthan 1 by the absolutely convergent infinite series

ζ(s) =∞∑n=1

1

ns.

Prove that the real part of every non-trivial zero of the Riemann zetafunction is 1

2 .


Example 30


Documents

MATH2901 Revision Seminar - mathsoc.unsw.edu.au€¦ · Table of Contents 1 Basic Probability Theory 2 Review of Random Variables 3 Computations for Common Distributions 4 Inequalities