1 Single Random Variables and Distributions - UCL... Random Variables and Distributions ... • A discrete random variable can also be characterized by its probability ... and pis

MSc Maths and Statistics 2008Department of Economics UCL

Chapter 6: Random Variables and DistributionsJidong Zhou

Chapter 6: Random Variables and Distributions

We will now study the main tools used to characterize experiments with uncertainty:

random variables and their distributions.

1 Single Random Variables and Distributions

1.1 Basic definitions

• Random variables: a random variable is a function that maps the sample space Ω of an

experiment into R. In other words, a random variable X is a function that assigns a

real number X(ω) to each possible experimental outcome ω ∈ Ω.

— for example, for an experiment in which a coin is tossed 10 times, the sample spaceconsists of 210 sequences of 10 heads and tails. The number of heads obtained on

the 10 tosses can be regarded as a random variable, and let us denote it by X.

Clearly, X maps each possible sequences into the set 0, 1, · · · , 10.

• Distributions: suppose A is a subset of R and we wish to measure the probability thatX ∈ A. This is given by:

Pr(X ∈ A) = Pr(ω ∈ Ω : X(ω) ∈ A).

Note that ω ∈ Ω : X(s) ∈ A is an event and so the right-hand side is well defined.The distribution of a random variable X is the collection of all probabilities Pr(X ∈ A)

for all subsets A of the real numbers.

— consider the above example again. If A = 1, · · · , 10, then Pr(X ∈ A) is just the

probability that the experiment outcome is a sequence with at least one head, and

so

Pr(X ∈ A) = 1− (12)10.

• Distribution functions: the distribution function (or df ) F of a random variable X is a

function defined for each real number x as follows:

F (x) = Pr(X ≤ x) = Pr(ω ∈ Ω : X(ω) ≤ x).

It just measures the probability of the event consisting of those outcomes satisfying

X(ω) ≤ x. Sometimes we also call it the cumulative distribution function (or cdf ).

— it is ready to show that F must satisfy the following properties:

1



∗ if x1 < x2, then F (x1) ≤ F (x2);

∗ limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;

∗ F is continuous from the right, i.e., F (x) = F (x+).

— if the df of a random variable X is known, then we can derive the probability of X

belonging to any interval:

∗ Pr(X > x) = 1− F (x);

∗ Pr(x1 < X ≤ x2) = F (x2)− F (x1);

∗ Pr(X < x) = F (x−);

∗ Pr(X = x) = F (x)− F (x−1).

Now we discuss two classes of random variables:

1.2 Discrete random variables

A random variable X is discrete if there are at most countably many possible values for X.

For example, the above random variable counting the number of heads is a discrete one.

• For a discrete random variable having values among xii∈N, its distribution functioncan be calculated as

F (x) =X

i:xi≤xPr(X = xi).

Clearly, this df must be a step function and discontinuous.

• A discrete random variable can also be characterized by its probability function (or pf )

defined as

f(x) = Pr(X = x)

for x ∈ R. If x is not one of the possible values of X, clearly f(x) = 0. This pf f(x)

just measures the likelihood of each particular outcome x.

— the relationship between the df and pf for a discrete random variable is

f(x) = F (x)− F (x−) or F (x) =X

i:xi≤xf(xi).

• An important discrete random variable: the binomial distribution with parameters n

and p is represented by the pf

f(x) =

(Cxnp

x(1− p)n−x if x = 0, 1, · · · , n0 otherwise

.

2



For example, consider n products produced by a firm. Suppose the probability of each

product being defective is p and these n products are independently produced. Then

f(x) just measures the probability that x of them are defective. From this pf, it is easy

to construct the df.

1.3 Continuous random variables

A random variable X is continuous if it can take any value on some (bounded or unbounded)

interval. For example, the weight of a person in next year, the temperature in tomorrow, and

the house price in next month can be regarded as continuous random variables.

• Probability density functions: given the distribution function F of a continuous randomvariable, we define its probability density function (or pdf ) as a nonnegative function f

satisfying

F (x) =

Z x

−∞f(t)dt

for all x ∈ R.

— if F is differentiable, then

f(x) = F 0(x).

— given the probability density function f , we can calculate

Pr(a < X ≤ b) =

Z b

af(x)dx.

— for a continuous random variable with continuous distribution functions, Pr(X =

x) = 0 for any x ∈ R.

• An examples: the uniform distribution on [a, b]:

f(x) =

(1

b−a if x ∈ [a, b]0 otherwise

and

F (x) =

⎧⎪⎨⎪⎩1 if x > bx−ab−a if x ∈ [a, b]0 if x < a

.

3



1.4 Functions of a random variable

Given the distribution of X, we want to know the distribution of Y = h(X) where h(·) is afunction.

• X is a discrete random variable: if g(y) is the probability function of Y , then

g(y) = Pr(Y = y) = Pr [h(X) = y]

=X

x:h(x)=y

f(x).

• X is a continuous random variable: if G(y) is the distribution function of Y , then

G(y) = Pr(Y ≤ y) = Pr(h(X) ≤ y)

=

Zx:h(x)≤y

f(x)dx.

If G(y) is a differentiable function, then the pdf of Y is

g(y) = G0(y).

— example: X is the uniform distribution on [−1, 1] and Y = X2. Then for 0 ≤ y ≤ 1,

G(y) = Pr(X2 ≤ y)

=

Z √y

−√yf(x)dx

=√y.

For y > 1, G(y) = 1; and for y < 0, G(y) = 0. The pdf of Y on (0, 1] is

g(y) =1

2√y.

— if h is strictly monotonic function, then the pdf of Y can be directly calculated as

g(y) = f£h−1(y)

¤ ¯dh−1(y)dy

¯=

f£h−1(y)

¤|h0 [h−1(y)]|

where h−1 is the inverse function of h. The second equality is because of the

derivative rule for the inverse function. We prove this result when h is strictly

increasing (the proof is similar if h is strictly decreasing):

G(y) = Pr£X ≤ h−1(y)

¤=

Z h−1(y)

−∞f(x)dx.

4



So

g(y) = G0(y) = f£h−1(y)

¤ dh−1(y)dy

.

The second equality is using the Lebnitz’s rule we have introduced in Chapter 2.

Exercise 1 (i) Suppose that the pdf of a random variable X is

f(x) =

(x2 if 0 < x < 2

0 otherwise.

Determine the df and pdf of the new random variable Y = X(2−X).

(ii) Suppose the pdf of a random variable X is

f(x) =

(e−x if x > 0

0 otherwise.

Determine the pdf of Y =√X.

(iii) Suppose X has a continuous distribution function F , and let Y = F (X). Show

that Y has a uniform distribution on [0, 1]. (This transformation from X to Y is called the

probability integral transformation.)

1.5 Moments

The distribution of a random variable contains all of the probabilistic information about it.

However, it is usually cumbersome to present the entire distribution. Instead, some summaries

of the distribution can be useful for giving people a rough idea of how the distribution looks

like. The most commonly used summaries are the moments of the random variable.

• Expectation

— for a discrete random variable X with pf f having positive values on xi, itsexpectation is

E(X) =Xi

xif(xi).

When X has infinitely many values, this series may not converge.1 We say E(X)

exists if and only if Xi

|xi| f(xi) <∞.

This condition guarantees thatP

i xif(xi) converges.

1For example, if f(n) = 1kn2

for n = 1, 2, · · · , where k = ∞n=1

1n2(which converges as we have confirmed

in Chapter 1), then ∞n=1 nf(n) does not converge.

5



— for a continuous random variable X with pdf f , its expectation is

E(X) =

Z ∞

−∞xf(x)dx.

Similarly, this integration may not be well defined.2 We say E(X) exists if and

only if Z ∞

−∞|x| f(x)dx <∞.

— the expectation of X is also called the expected value of X or the mean of X. It can

be regarded as the center of gravity of the distribution of X, but not necessarily

the central position of the distribution.

— examples:

(i) the expectation of the uniform distribution on [a, b] isZ b

a

x

b− adx =

a+ b

2.

(ii) the expectation of the binomial distribution is

nXx=0

xCxnp

x(1− p)n−x = np.

— some properties of the expectation (we assume all expectations exist):

∗ for scalars a and b,

E(a+ bX) = a+ bE(X).

∗ for two random variables X1 and X2, we have

E(X1 +X2) = E(X1) +E(X2).

∗ if h(X) = h1(X) + h2(X) is a function of X, then E(h(X)) = E(h1(X)) +

E(h2(X)).

∗E[h(X)] =

Z ∞

−∞h(x)f(x)dx,

but E[h(X)] is in general not equal to h [E(X)] except when h is a linear

function.2For example, for the Cauchy distribution which has pdf

f(x) =1

π(1 + x2),

one can verify that ∞−∞ xf(x)dx does not exists.

6



• Variance

— the variance of a distribution is given by:

V ar(X) = E£(X −E(X))2

¤= E(X2)−E(X)2.

It is also often denoted by σ2. (The variance of a distribution may also not exist.)

The square root of the variance is called the standard deviation and often denoted

by σ =pV ar(x).

— it measures the spread or dispersion of the distribution around its mean.

— examples:

(i) the variance of the uniform distribution on [a, b] isZ b

a

x2

b− adx−

µa+ b

2

¶2=

a2 + ab+ b2

3−µa+ b

2

¶2=

(b− a)2

12.

(ii) the variance of the binomial distribution is np(1− p).

— some properties of the variance:

∗ V ar(c) = 0 where c is a constant;

∗ V ar(aX + b) = a2V ar(X) where a and b are scalars.

• Two useful inequalities:

— Markov Inequality: suppose X is a random variable with Pr(X ≥ 0) = 1. Thenfor any real number t > 0,

Pr(X ≥ t) ≤ E(X)

t.

∗ this result can help approximate probability distribution of a random variable

when only its mean is known.

— Chebyshev Inequality: suppose X is a random variable and V ar(X) exists.

Then for any real number t > 0,

Pr(|X −E(X)| ≥ t) ≤ V ar(X)

t2.

∗ this is just from the Markov inequality by realizeing |X −E(X)| ≥ t ⇐⇒[X −E(X)]2 ≥ t2.

7



∗ this result just says that it becomes less likely for the realization of a randomvariable when it is farther away from the mean.

∗ see more applications of these two results in Section 4.8 in DeGroot andSchervish (2002).

• Higher order moments

— kth moment of X is E(Xk)

∗ the mean of X is just the first order moment

∗ again, the kth moment may fail to exist for some distributions. We say thekth moment exists if and only if E(

¯Xk¯) <∞

∗ if E(¯Xk¯) <∞ for some positive integer k, then E(

¯Xj¯) <∞ for any positive

integer j < k.

— kth central moment of X is E£(X −E(X))k

¤∗ the variance of X is just the second order central moment

• Moment generating functionsThe moment generating function (or mgf ) of a random variable X is

ψ(t) = E(etX).

If ψ(t) exists for all values of t in an open interval around t = 0, then we have

ψ(n)(0) = E(Xn).

That is, the nth order derivative of the mgf of X evaluated at t = 0 is just the nth

moment of X. Thus, the mean is ψ0(0) and the variance is ψ00(0) − £ψ0(0)¤2. In manycases, using mgf to compute moments is more convenient than using the definition

directly.

— example: The pdf of X is

f(x) =

(e−x if x > 0

0 otherwise.

Compute the mean and variance of X.

ψ(t) =

Z ∞

0etxe−xdx =

Z ∞

0e(t−1)xdx

=1

1− t

8



for t < 1. So ψ(t) exists for t in an open interval around t = 0. Since

ψ0(t) =1

(1− t)2and ψ00(t) =

2

(1− t)3,

it is ready to show that E(X) = 1 and V ar(X) = 1.

— an important result: if the mgf of two random variables X and Y are iden-

tical for all values of t in an open interval around t = 0, then the probability

distributions of X and Y must be identical.

• Quantile and medianThe p-quantile of a distribution is the value x that divides the distribution in two parts,

one with probability p and the other with probability 1−p. More precisely, if a randomvariable’s distribution function is F , then its p-quantile is the smallest x such that

F (x) ≥ p. In particular, the 0.5-quantile is called the median. That is, the median of a

distribution divides it in two parts, each with the equal probability.

— examples:

(a) if Pr(X = 1) = 0.1, Pr(X = 2) = 0.4, Pr(X = 3) = 0.3, and Pr(X = 4) = 0.2,

then the median is 2, and the 0.8-quantile is 3.

(b) if a continuous random variable has the pdf

f(x) =

⎧⎪⎨⎪⎩1/2 for 0 ≤ x ≤ 11 for 2.5 ≤ x ≤ 30 otherwise

,

then the median is 1, and the 0.4-quantile is 0.8.

— in some cases, the median can reflect the “average” value of a random variable X

better than the mean. For example, if Pr(X = 10) = 0.99 and Pr(X = 10000) =

0.01, then the mean of X is 109.9 which is much higher than 10, but its median is

10 which is closer to the value of X in most of the time.

— the median minimizes the mean absolue error E(|X − d|), while the mean mini-mizes the mean square error E[(X − d)2].

— given the pdf f(x), then the value of x for which f(x) is maximum is called the

mode of the distribution

Exercise 2 (i) Let X be a random variable that can take only the values 0, 1, 2, · · · . Show

E(X) =∞Xn=0

nPr(X = n) =∞Xn=1

Pr(X ≥ n).

9



(ii) Prove that the variance of the binomial distribution is np(1− p).

(iii) Let X has discrete uniform distribution on the integers 1, · · · , n. Compute the vari-ance of X. (You may wish to use the formula

Pnk=1 k

2 = 16n(n+ 1)(2n+ 1).)

2 Bivariate Distributions

In many cases, we need more than one random variable to describe an experiment. This part

consider the bivariate case. Let (X,Y ) be a pair of random variables. We first study their

joint distribution.

2.1 Joint distributions

• The discrete case: if both X and Y are discrete random variables, the joint probability

functions is

f(x, y) = Pr(X = x, Y = y).

If (x, y) is not the value of X and Y , f(x, y) = 0. The pf is always non-negative and

satisfies Xf(x, y) = 1.

The joint distribution function is now

F (x, y) =X

xi≤x,yj≤yf(xi, yj).

• The continuous case: if X and Y are continuous random variables, the joint distribution

function is

F (x, y) = Pr(X ≤ x, Y ≤ y)

for any (x, y) ∈ R2. It is nondecreasing in each argument and satisfies

limx→−∞y→−∞

F (x, y) = 0

limx→∞y→∞

F (x, y) = 1.

The joint probability density function is a nonnegative function f defined on R2 such

that

F (a, b) =

Z b

−∞

Z a

−∞f(x, y)dxdy

for any (a, b) ∈ R2. If F (x, y) is twice differentiable, then the pdf is

f(x, y) =∂2F (x, y)

∂x∂y.

10



And we can calculate

Pr(a < X ≤ b, c < Y ≤ d) =

Z d

c

Z b

af(x, y)dxdy.

• Example: Suppose the joint pdf of X and Y is

f(x, y) =

(cx2y for x2 ≤ y ≤ 10 otherwise

.

Determine the value of c and then calculate Pr(X ≥ Y ).

First of all, f must satisfy Z ∞

−∞

Z ∞

−∞f(x, y)dxdy = 1,

which implies

c

Z 1

−1

Z 1

x2x2ydydx =

c

2

Z 1

−1x2(1− x4)dx = 1.

It is ready to solve

c =21

4.

The probability

Pr(X ≥ Y ) =21

4

Z 1

0

Z x

x2x2ydydx =

3

20.

• The expectation of a function of two random variables:

E [h(X,Y )] =

Z ∞

−∞

Z ∞

−∞h(x, y)f(x, y)dxdy.

Exercise 3 Suppose that the joint pdf of X and Y is

f(x, y) =

(c(x2 + y) for 0 ≤ y ≤ 1− x2

0 otherwise.

Determine the value of c and then calculate Pr(Y ≤ X + 1).

2.2 Marginal distributions

Given the joint distribution function of X and Y , we want to know the distribution function

of each random variable. This is called the marginal distribution. In general, given F (x, y),

the marginal distribution function of X is

F1(x) = Pr(X ≤ x, Y ≤ ∞),

and that of Y is

F2(y) = Pr(X ≤ ∞, Y ≤ y).

11



• The discrete case: the marginal probability function of X is

f1(x) =Xy

f(x, y),

and that of Y is

f2(y) =Xx

f(x, y).

Then, the marginal distribution function of X is

F1(x) =Xxi≤x

f1(xi)

and that of Y is

F2(y) =Xyj≤y

f2(yj).

• The continuous case: the marginal distribution function of X is

F1(x) = Pr(X ≤ x, Y ≤ ∞) =Z ∞

−∞

Z x

−∞f(x, y)dxdy,

and the marginal probability density function of X is

f1(x) =

Z ∞

−∞f(x, y)dy.

Similarly,

F2(y) = Pr(X ≤ ∞, Y ≤ y) =

Z ∞

−∞

Z y

−∞f(x, y)dydx

and

f2(y) =

Z ∞

−∞f(x, y)dx.

• Example: Suppose the joint pdf of X and Y is

f(x, y) =

(1 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 10 otherwise

.

Then the marginal pdf of X is

f1(x) =

Z 1

0f(x, y)dy = 1

and the marginal df of X is

F1(x) = x.

12



• Although the marginal distribution of X and Y can be derived from their joint distribu-

tion, it is usually impossible to reconstruct their joint distribution from their marginal

distributions without additional information. (See the exceptional case below where two

random variables are independent.)

• The moments of each random variable can be calculated by using its marginal distribu-

tion. Since it is very straightforward, we will not present the details.

2.3 Conditional distributions

We have encountered the concept of conditional probability before. We now apply it to

distribution functions.

Suppose we know the joint distribution of a pair of random variables (X,Y ). In general,

we can derive the revised probability of Y ∈ B conditional on that we have learned X ∈ A as

follows:

Pr(Y ∈ B|X ∈ A) =Pr(Y ∈ B,X ∈ A)

Pr(X ∈ A)

if Pr(X ∈ A) > 0. Both the numerator and the denominator can be computed from the joint

distribution of X and Y . From now on, we focus on conditional distribution functions (i.e.,

B has the form of Y ≤ y and A is a singleton set).

• The discrete case: given the joint probability function f(x, y), the probability function

of Y conditional on X = x is

f2(y|x) ≡ Pr(Y = y|X = x)

=Pr(Y = y,X = x)

Pr(X = x)

=f(x, y)

f1(x).

It measures the revised probability of Y = y conditional onX = x. Then the distribution

function of Y conditional on X = x is

F2(y|x) =P

yj≤y f(x, yj)f1(x)

.

The conditional distribution function of X can be similarly derived.

• The continuous case: since Pr(X = x) = 0 for a continuous random variable, we can

derive the conditional distribution of Y in the following way:

Pr(Y ≤ y|x < X ≤ x+∆) =F (x+∆, y)− F (x, y)

F1(x+∆)− F1(x).

13



Then we divide both numerator and denominator by ∆ and then let ∆ tend to zero.

This limit operation yields the conditional distribution function of Y :

F2(y|x) ≡ Pr(Y ≤ y|X = x) =∂F (x, y)/∂x

dF1(x)/dx

=∂F (x, y)/∂x

f1(x).

Then the conditional probability density function of Y is

f2(y|x) = ∂F2(y|x)∂y

=f(x, y)

f1(x)

whenever F2(y|x) is differentiable with respect to y.

• In either case, we have

f(x, y) = f2(y|x)f1(x) = f1(x|y)f2(y).

That is, if we know the marginal pdf and the conditional pdf, then we can reconstruct

the joint pdf. Furthermore, we also have

f1(x|y) =f2(y|x)f1(x)

f2(y)

=f2(y|x)f1(x)R

x f2(y|x)f1(x)dx.

(In the discrete case, the integration in the denominator should be replaced by the sum.)

This is the Bayes’ Theorem for random variables.

Exercise 4 Suppose the joint pdf of X and Y is

f(x, y) =

(316(4− 2x− y) for x > 0, y > 0 and 2x+ y < 4

0 otherwise.

Determine the conditional pdf of Y for every given value of X, and compute Pr(Y ≥ 2|X =

0.5).

2.4 Conditional moments

Our exposition is for continuous random variables, but all results also hold for discrete ones.

Consider X and Y with the joint pdf f(x, y).

• Conditional expectation:

14



— the conditional expectation of Y given X = x is

E(Y |x) =Z ∞

−∞yf2(y|x)dy

where f2(y|x) is the conditional pdf of y. When x changes, this conditional expec-

tation will also change.

— the conditional expectation of Y given X, denoted by E(Y |X), is a function of Xand so a random variable. If h(x) ≡ E(Y |x), then E(Y |X) = h(X) and its dis-

tribution can be derived from X’s marginal distribution according to this function

relationship.

— then if all related expectations exist, we have

E [E(Y |X)] =

Z ∞

−∞E(Y |x)f1(x)dx

=

Z ∞

−∞

Z ∞

−∞yf2(y|x)f1(x)dxdy

=

Z ∞

−∞

Z ∞

−∞yf(x, y)dxdy

= E(Y ),

where the second step uses the definition of E(Y |x). This result is called the lawof iterated expectations.

— similarly, E [E(r(X,Y )|X)] = E(r(X,Y )) for any function r.

• Condition variance:

— the conditional variance of Y given X = x is

V ar(Y |x) = E[Y −E(Y |x)]2 |x= E(Y 2|x)− [E(Y |x)]2 .

Exercise 5 (i) Prove (a) if E(Y |X) = 0 then E(Y ) = 0; and (b) if E(Y |X) = 0 then

E(XY ) = 0.

(ii) Suppose the distribution of X is symmetric with respect to the point x = 0 and all

moments of X exist. Suppose E(Y |X) = aX + b for constants a and b. Show that X2m and

Y are uncorrelated for m = 1, 2, · · · .(iii) Show that

V ar(Y ) = E [V ar(Y |X)] + V ar[E(Y |X)].

15



2.5 Independent random variables

• Two random variables X and Y are independent iff, for any two subsets A and B of R,we have

Pr(X ∈ A and Y ∈ B) = Pr(X ∈ A) Pr(X ∈ B).

• This statement is equivalent to that two random variables X and Y are independent iff

F (x, y) = F1(x)F2(y)

or

f(x, y) = f1(x)f2(y)

or

f1(x|y) = f1(x)

if f2(y) > 0.

— the last statement just says that knowing the realized value of Y does not change

our probability judgment of X (vice versa).

— the last statement also indicates that ifX and Y are independent random variables,

then the set of all (x, y) pairs where f(x, y) > 0 should be rectangular.

• Example: suppose the joint pdf of X and Y is

f(x, y) =

(2e−(x+2y) for x ≥ 0 and y ≥ 00 otherwise

.

Are X and Y independent to each other? It is ready to calculate that f1(x) = e−x forx ≥ 0 and f1(x) = 0 for x < 0; and f2(y) = 2e

−2y for y ≥ 0 and f2(y) = 0 for y < 0.

Thus, f(x, y) = f1(x)f2(y) and so X and Y are indeed independent.3

• Properties: if X and Y are two independent random variables, then

— E(XY ) = E(X)E(Y )

— V ar(aX + bY ) = a2V ar(X) + b2V ar(Y )

— E(X|Y ) = E(X) and E(Y |X) = E(Y ) where

E(X|Y = y) =

Z +∞

−∞xf1(x|y)dx,

E(Y |X = x) =

Z +∞

−∞yf2(y|x)dy.

3 In effect, for two continuous random variables, they are independent iff f(x, y) = g1(x)g2(y) for all x and y,

where gi are nonnegative functions. That is, the joint pdf can be factorized into the product of a nonnegative

function of x and a nonnegative function of y.

16



— h(X) and g(Y ) are also independent for any two functions h and g, and so E [h(X)g(Y )] =

E [h(X)]E [g(Y )]

— if ψx and ψy are the mgf of X and Y , respectively, then the mgf of Z = X + Y is

ψz = ψxψy

Exercise 6 (i) suppose the joint pdf of X and Y is

f(x, y) =

(kx2y2 for x2 + y2 ≤ 10 otherwise

.

Show that X and Y are not independent.

(ii) Suppose X1 and X2 are two independent variables and their mgf are ψ1(t) and ψ2(t),

respectively. Let Y = X1 +X2 and let its mgf be ψ(t). Show that, if all mgf exist, then

ψ(t) = ψ1(t)ψ2(t).

(iii) Let (X,Y,Z) be independent random variables such that:

E(X) = −1 and V ar(X) = 2

E(Y ) = 0 and V ar(Y ) = 3

E(Z) = 1 and V ar(Z) = 4

Let

T = 2X + Y − 3Z + 4U = (X + Z)(Y + Z)

Find E(T ), V ar(T ), E(T 2) and E(U).

2.6 Covariance and correlations

These two concepts are used to measure how much two random variables X and Y depend on

each other. Let E(X) and E(Y ) be the expectations of X and Y, respectively. (Notice that

they are calculated by using X and Y ’s marginal distributions.)

• The covariance of X and Y :

Cov(X,Y ) = E [(X −E(X))(Y −E(Y ))]

= E(XY )−E(X)E(Y )

— if V ar(X) <∞ and V ar(Y ) <∞, then Cov(X,Y ) will exist and be finite.

17



— the sign of the covariance indicates the direction of covariation of X and Y . But

its magnitude is also influenced by the overall magnitudes of X and Y .

• The correlation of X and Y :

ρ(X,Y ) =Cov(X,Y )p

V ar(X)V ar(Y )

whenever both variances are nonzero.

— ρ is between −1 and 1.4

— X and Y are said to be positively correlated if ρ(X,Y ) > 0; they are negatively

correlated if ρ < 0; and they are uncorrelated if ρ = 0.

• Properties:

— if X and Y are independent, then Cov(X,Y ) = 0 and ρ(X,Y ) = 0. But the

converse of this statement is not true.5

— if Y = aX+b for some constants a and b, then ρ(X,Y ) = 1 if a > 0 and ρ(X,Y ) =

−1 if a < 0. The converse is also true.— the correlation only measures the linear relationship between X and Y . A large

|ρ| means that X and Y are close to being linearly related and hence are closely

related. But when |ρ| is small, X and Y could still be closely related according to

some nonlinear relationship. (See the example in footnote 5.)

— if both V ar(X) and V ar(Y ) are finite, then

V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X,Y ). (1)

Furthermore,

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X,Y ),

and

V ar

ÃnXi=1

Xi

!=

nXi=1

V ar(Xi) +Xi6=j

Cov(Xi,Xj).

4This result is based on the Schwarz inequality [E(XY )]2 ≤ E(X2)E(Y 2) for any two random variables. if

the right-hand side is finite, then the equality will hold iff there are constants a and b such that aX + bY = 0

with probability 1.5That is, even if two random variables are uncorrelated, they can be dependent. For example, X is a

discrete random variable with Pr(X = 1) = Pr(X = 0) = Pr(X = −1) = 1/3, and Y = X2. They are clearly

dependent but one can check that Cov(X,Y ) = 0.

18



Exercise 7 (i) Suppose that the pair (X,Y ) are uniformly distributed on the interior of the

circle of radius 1. Compute Cov(X,Y ).

(ii) Suppose X has uniform distribution on the interval [−2, 2] and Y = X6. Show X and

Y are uncorrelated.

(iii) Prove the result (1), and

Cov(aX + bY + c, Z) = aCov(X,Z) + bCov(Y,Z)

where a, b, c are constants and all covariances exist.

(iv) Suppose that X and Y have the same variance, and the variances of X+Y and X−Yalso exist. Show X + Y and X − Y are uncorrelated.

2.7 Multivariate distributions

All of the above concepts and results can be readily extended to the case with more than two

random variables.

• Let X = [X1 X2 · · · Xn]T be an n× 1 column vector of random variables.

• The joint distribution function is:F (x) = Pr(X ≤ x)

= Pr(X1 ≤ x1,X2 ≤ x2, · · · ,Xn ≤ xn)

• The joint pdf in the continuous case is:

f(x) =∂nF (x)

∂x1∂x2 · · · ∂xn .

• The marginal pdf is:

f1,··· ,k(x1, · · · , xk) =Z ∞

−∞· · ·Z ∞

−∞| z n−k

f(x1, · · · , xn)dxk+1 · · · dxn.

• Without loss of generality, the joint pdf of the last n− k random variables conditional

on the first k < n random variables’ realized values is

f(x1, · · · , xn)f1,··· ,k(x1, · · · , xk) .

• The n random variables are independent iff

F (x1, · · · , xn) = F (x1) · · ·F (xn)or

f(x1, · · · , xn) = f(x1) · · · f(xn).

19



• Some of the most important moments are the following:

— expectation:

E(X) =³E(X1) · · · E(X2)

´Twhere each of the expectations inside the vector are performed using the marginal

distributions. For example:

E(X1) =

Z +∞

−∞x1f1(x1)dx1.

— covariance matrix:

Σ ≡ V ar(X) = Eh(X −E(X)) (X −E(X))T

i

=

⎛⎜⎜⎝σ11 · · · σ1n...

. . ....

σn1 · · · σnn

⎞⎟⎟⎠= E(XXT )−E(X)E(X)T .

— for a constant column vector a,

V ar(aTX) = E£(aTX − aTE(X))2¤

= Eh©aT (X −E(X))

ª2i= E

haT (X −E(X)) (X −E(X))T a

i= aTΣa

This is a quadratic form. Since the variance is always non-negative according to

its definition, it yields aTΣa ≥ 0 for any nonzero a. That is, the covariance matrixΣ is positive semidefinite.

— for a constant matrix A, we have

V ar(AX) = AΣA0.

2.8 Functions of multiple random variables

We focus on the case with continuous random variables.

• Suppose the joint pdf of n random variables X1, · · · ,Xn is f(x1, · · · , xn), and a newrandom variable is constructed as Y = h(X1, · · · ,Xn). Then what is the pdf of Y . We

can compute the df of Y first:

G(y) = Pr(Y ≤ y) =

Z· · ·Z

A(y)

f(x1, · · · , xn)dx1 · · · dxn

20



where A(y) = (x1, · · · , xn) ∈ Rn : h(x1, · · · , xn) ≤ y. If G(y) is differentiable, we canderive g(y) = G0(y).

• Examples: suppose n independent random variablesX1, · · · ,Xn share the same distribu-

tion F which is differentiable and has the density function f . Let Ymax = maxX1, · · · ,Xnand Ymin = minX1, · · · ,Xn. Determine the pdf of Ymax and Ymin.

Gmax(y) = Pr(Ymax ≤ y)

= Pr(X1 ≤ y, · · · ,Xn ≤ y)

= F (y)n,

and so gmax(y) = nF (y)n−1f(y).

Gmin(y) = Pr(Ymin ≤ y)

= 1− Pr(Ymin > y)

= 1− [1− F (y)]n ,

and so gmin(y) = n [1− F (y)]n−1 f(y).

Exercise 8 (i) Revisit the above example on Ymax and Ymin. Derive the joint pdf of (Ymax, Ymin).(ii) Suppose X1 and X2 are two independent random variables and each distributes over

[0, 1] uniformly. Find the pdf of Y = X1 +X2.

• Now consider the case with n new random variables:

Y1 = h1(X1, · · · ,Xn) (2)...

Yn = hn(X1, · · · ,Xn).

We want to derive the joint pdf of Y1, · · · , Yn.To do that, we need assumptions about the functions hi. If S is the subset of Rn such that

Pr((X1, · · · ,Xn) ∈ S) = 1 and T is the subset of Rn such that Pr((Y1, · · · , Yn) ∈ T ) = 1,

we assume that the transformation from S to T by all hi is a one-to-one correspondence.

That is, given a point (y1, · · · , yn) in T , we have a unique preimage (x1, · · · , xn) in S.

With this assumption, we can solve (2) in terms of

X1 = s1(Y1, · · · , Yn) (3)...

Xn = sn(Y1, · · · , Yn).

21



Construct the determinant

J =

¯¯ ∂s1

∂y1· · · ∂s1

∂yn...

. . ....

∂sn∂y1

· · · ∂sn∂yn

¯¯

for every point (y1, · · · , yn) ∈ T . We call it the Jacobian of the transformation in (3).

Then the joint pdf of the n new random variables is

g(y1, · · · , yn) =(

f(s1, · · · , sn) |J | for (y1, · · · , yn) ∈ T

0 otherwise,

where |J | is the absolute value of the Jacobian.6

• Examples: suppose the joint pdf of X1 and X2 is

f(x1, x2) =

(4x1x2 for 0 < x1, x2 < 1

0 otherwise.

Let Y1 = X1/X2 and Y2 = X1X2. Find the joint pdf of Y1 and Y2.

It is easy to see that y1 > 0 and y2 ∈ (0, 1). For each pair of such y1 and y2, we can

derive

x1 =√y1y2,

x2 =

ry2y1.

Then the Jocabian is ¯¯ 1

2

qy2y1

12

qy1y2

− 12y1

qy2y1

12√y1y2

¯¯ = 1

2y1.

Therefore,

g(y1, y2) =

(2y2y1 for y1 > 0 and 0 < y2 < 1

0 otherwise.

• A technique about the sum (or the difference) of two random variables:

Suppose we want to know the pdf of Y = X1 +X2 or X1 −X2. Sometimes it is quite

hard to calculate G(y) = Pr(X1 +X2 ≤ y) or Pr(X1 −X2 ≤ y). In that case, we can

introduce another new random variable Z = X2. Then we first derive the joint pdf of

(Y,Z) and then find the marginal distribution of Y .

6 In particular, if Y = AX, where X and Y are the vectors of random variables and A is a n×n nonsigular

matrix, then

g(y) =1

|detA|f(A−1y).

22



Exercise 9 Suppose that X1 and X2 are independent and share the same distribution

f(x) =

(e−x for x > 0

0 otherwise.

Find the pdf of Y = X1 −X2.

23

Documents

1 Single Random Variables and Distributions - UCL... Random Variables and Distributions ... • A discrete random variable can also be characterized by its probability ... and pis