Joe Erickson - Bucks County Community Collegefaculty.bucks.edu/erickson/math115/Statistics.pdfDescriptive statistics is the branch concerned with the organization and presentation

StatisticsJoe Erickson

Contents

1 Introduction to Statistics 11.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Data and Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Describing Data with Tables & Graphs 42.1 Frequency Distributions and Their Tables . . . . . . . . . . . . . . . . . . . . 42.2 Graphs of Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Stem-and-Leaf and Dot Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Pie and Pareto Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Graphing Paired Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Numerical Data Descriptors 103.1 Measures of Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Measures of Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Probability Theory 144.1 Fundamental Counting Principle . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4 The Multiplicative and Additive Laws . . . . . . . . . . . . . . . . . . . . . . 284.5 Combinations and Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 304.6 Combinatorial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Discrete Probability Distributions 345.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Expected Value: Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.4 The Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 485.6 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Continuous Probability Distributions 536.1 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Expected Value: Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . 606.3 The Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.4 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.5 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.6 The Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

1

1Introduction to Statistics

1.1 – Basic Definitions

The notion of a variable is taken to be understood by the reader who has had at leastone course in basic algebra. The values assumed by a variable X, without regard to order butcounting repetitions of the same value, are called data (singular datum). Any collection ofdata is a data set, and if a data set consists of the values x1, x2, . . . , xn (counting repetitionsof the same value), then the data set may be denoted by

[[x1, x2, . . . , xn]].

The range of a variable X is the set of all possible distinct values that X can assume. A veryimportant point: a given data set associated with a variable X may feature some values in therange of X multiple times, while other values do not appear at all.

A “data set” is not really a “set” in the usual mathematical sense, but rather a multiset.The difference between sets and multisets is simple: sets recognize neither repetition nor orderamong the objects they contain, whereas multisets do recognize repetitions of objects but arestill blind to order. For example, 1, 2, 1, 1, 1, 2, 2, and 2, 1 are all considered to be thesame set: namely, the set containing precisely the elements 1 and 2, and we write

1, 2 = 1, 1, 1, 2, 2 = 2, 1.

In contrast, [[1, 2]] and [[1, 1, 1, 2, 2]] are very different data sets: [[1, 2]] indicates a variable Xassumed the value 1 once, and the value 2 once (in no particular order); [[1, 1, 1, 2, 2]] indicatesthat X assumed the value 1 three times, and the value 2 twice (again in no particular order).We write

[[1, 2]] 6= [[1, 1, 1, 2, 2]].

On the other hand [[1, 1, 1, 2, 2]] and [[2, 1, 2, 1, 1]] are considered the same data set, since orderstill does not matter to a multiset! We write

[[1, 1, 1, 2, 2]] = [[2, 1, 2, 1, 1]].

2

Example 1.1. Let X represent the score that turns up when a six-sided die is rolled. Thenthe range of X equals the set 1, 2, 3, 4, 5, 6. Now, by rolling the die at least once, X can beused to generate a data set. So for instance we may roll the die sixteen times and generate thefollowing data set:

D = [[2, 2, 1, 5, 6, 1, 4, 2, 6, 1, 6, 4, 2, 6, 1, 2]].

Note that the data set D counts repetitions of the same value. In this case we see that not allof the values in the range of X appear in D (3 is missing), while other values in the range of Xappear multiple times in D.

Again, data sets are not sensitive to order. If we rearrange the data set D in an ascendingorder,

[[1, 1, 1, 1, 2, 2, 2, 2, 2, 4, 4, 5, 6, 6, 6, 6]],

it is considered to be the same data set (i.e. equal to D).

Definition 1.2. Statistics is the study of the collection, organization, presentation, analysis,and interpretation of data.

There are two types of data sets associated with any variable, and the differences betweenthese two types strike to the heart of what statistics is all about.

Definition 1.3. Let X be a variable. A population for X is a data set associated with X thatcontains precisely all of the data that is of interest. A sample for X is any data set associatedwith X that is contained within a population for X.

Example 1.4. Let X be a variable whose range equals the set

Stein, Trump, Johnson, Clinton, Other.

Thus the values of X can be taken to represent the response given when someone is asked whothey would like to see elected U.S. President in 2016. What we wish to do is predict who willwin the election before the election happens. This naturally requires the gathering of data. Togather the best data possible, we need to rigorously define the population for X. Clearly thepopulation for X cannot be the survey responses of everyone in the world, for not everyone inthe world is eligible to vote in U.S. elections. The population for X cannot even be the surveyresponses of all U.S. citizens of voting age, for not all such individuals bother to vote. Arguablythe population for X should be taken to be the survey responses of all eligible U.S. voters whoactually intend to vote on election day.

Of course, there is a problem: it is not practical to take a survey of all the over 125,000,000eligible U.S. voters who intend to vote on election day. Any polling firm, therefore, must insteadgather a sample for X, usually comprised of a few hundred or a few thousand survey responses.The challenge is to get a sample that accurately reflects the population as a whole. It is achallenge that will likely keep statisticians comfortably employed until the end of history. Itbears repeating: the differences between a population and a sample strike to the heart of whatstatistics is all about.

3

In Example 1.4 we could say, somewhat less formally, that the “population” is the set ofall eligible U.S. voters who intend to vote on election day, and any “sample” is some subsetof the “population.” In this way the population for X in the example becomes a collectionof human individuals rather than a data set comprised of values in the range of X. There isnothing wrong with taking this approach, since there is a direct correspondence between eachhuman individual in the “population” and a value in the range of X.

There are two branches of statistics. Descriptive statistics is the branch concerned withthe organization and presentation of data, while inferential statistics is concerned with usingsample data to draw conclusions about a population. What is mentioned in the definition ofstatistics which is not mentioned in the definitions of descriptive and inferential statistics? Thebusiness of collecting data. Collecting good data is tremendously important, but is neitherdescriptive nor inferential.

Any numerical description of a population characteristic is called a parameter, while anynumerical description of a sample characteristic is a statistic. Common sorts of parameters andstatistics are means, medians, modes, standard deviations, percentages of a total, and the like.

1.2 – Data and Measurement

1.3 – Experimental Design

4

2Describing Data with Tables & Graphs

2.1 – Frequency Distributions and Their Tables

Example 2.1. The average daily number of sunspots observed from Earth on the sun’s diskfor the years 1976 to 2015 are as follows:

18.4 39.3 131.0 220.1 218.9 198.9 162.4 91.060.5 20.6 14.8 33.9 123.0 211.1 191.8 203.3

133.0 76.1 44.9 25.1 11.6 28.9 88.3 136.3173.9 170.4 163.6 99.3 65.3 45.8 24.7 12.6

4.2 4.8 24.9 80.8 84.5 94.0 113.3 69.8

1. We shall make a frequency distribution for this data set using nine classes.

2. The minimum data entry is 4.2, and the maximum data entry is 220.1. The range of thedata set is thus 220.1− 4.2 = 215.9, and so we determine the class width to be

Class width =Range

Number of classes=

215.9

9= 23.98 ≈ 24.

3. The lower limit of the first class we shall set to be 4.2, the minimum data entry. Weobtain the lower limits of the other classes by repeatedly adding the class width of 24 tothe lower limit of the first class:

4.2 28.2 52.2 76.2 100.2 124.2 148.2 172.2 196.2

Since our data consists of decimals to the tenths place, the upper limit of the first classmust be 0.1 less than the lower limit of the second class, which is to say 28.1. We obtainthe upper limits of the other classes by repeatedly adding the class width of 24 to theupper limit of the first class:

28.1 52.1 76.1 100.1 124.1 148.1 172.1 196.1 220.1

4. Tally the number of data entries that fall into each class. A column of tally marks canbe included in the frequency distribution table, though it is not required. The leftmost

5

column of the frequency distribution table must list each of the nine classes, and therightmost column must list the frequency of each class.

Class Tally Frequency

4.2–28.1 ;; 10

28.2–52.1 ; 5

52.2–76.1 :::: 4

76.2–100.1 ; : 6

100.2–124.1 :: 2

124.2–148.1 ::: 3

148.2–172.1 ::: 3

172.2–196.1 :: 2

196.2–220.1 ; 5

Example 2.2. We extend the frequency distribution constructed in Example 2.1 to includeclass midpoints, relative class frequencies fr, and cumulative frequencies fc. We also do awaywith the tally column, which holds the same information as the existing column for frequencyf . Relative frequencies are given to the thousandths place in this example, since it results inno rounding. In general, if relative frequencies are given to k decimal places, then all relativefrequencies should be written to k decimal places by inserting zeros as needed. Thus, in thiscase where relative frequencies are given to 3 decimal places, we write 0.25 as 0.250, and 0.1 as0.100.

Class Midpoint f fr fc

4.2–28.1 16.15 10 0.250 10

28.2–52.1 40.15 5 0.125 15

52.2–76.1 64.15 4 0.100 19

76.2–100.1 88.15 6 0.150 25

100.2–124.1 112.15 2 0.050 27

124.2–148.1 136.15 3 0.075 30

148.2–172.1 160.15 3 0.075 33

172.2–196.1 184.15 2 0.050 35

196.2–220.1 208.15 5 0.125 40

6

2.2 – Graphs of Frequency Distributions

A frequency histogram is a representation of a frequency distribution as a vertical bargraph. Each bar represents a class of data, and the height of the bar indicates the frequency fof the class.

Example 2.3. We now make a frequency histogram of the data in Example 2.1. The numbersin the frequency column of the frequency distribution table are measured by the vertical scale,and the data entries are measured by the horizontal scale. In particular, as described above, thehorizontal scale can be rendered in two ways: using class boundaries or class midpoints.

First we shall render the horizontal scale with class boundaries. Looking at the class columnof the frequency distribution table in Example 2.1, where the class limits are given, we see thatthe gap between the first class 4.2–28.1 and the second class 28.2–52.1 is 28.2− 28.1 = 0.1. Halfof this value is 0.05, and so we subtract 0.05 from every lower class limit to obtain lower classboundaries, and we add 0.05 to every upper class limit to obtain upper class boundaries.

Class Limits Class Boundaries Class Midpoint Frequency

4.2–28.1 4.15–28.15 16.15 10

28.2–52.1 28.15–52.15 40.15 5

52.2–76.1 52.15–76.15 64.15 4

76.2–100.1 76.15–100.15 88.15 6

100.2–124.1 100.15–124.15 112.15 2

124.2–148.1 124.15–148.15 136.15 3

148.2–172.1 148.15–172.15 160.15 3

172.2–196.1 172.15–196.15 184.15 2

196.2–220.1 196.15–220.15 208.15 5

In this way we obtain the frequency histogram shown in Figure 2, with the zig-zag near theleft end of the horizontal axis used to denote a break in the scale: the actual distance between 0and 4.15 on the horizontal scale should be smaller than depicted.

Next we render the horizontal scale with class midpoints. We found the class midpointsalready in Example 2.2, and they are included for reference in the table above. The resultantfrequency histogram, shown in Figure 1, otherwise looks the same as before.

A relative frequency histogram is nothing more than a frequency histogram in whichthe vertical scale measures relative frequency fr rather than frequency f .

Example 2.4. A relative frequency histogram of the data in Example 2.1 is shown in Figure 3.The horizontal scale still measures the data entries and can feature either class boundaries orclass midpoints.

7

Average

Average daily number of sunspots

Frequency(n

umberofyears)

2

4

6

8

10

16.15

40.15

64.15

88.15

112.15

136.15

160.15

184.15

208.15

Figure 1. Frequency histogram with horizontal axis featuring class midpoints.

Average

Average daily number of sunspots

Frequency(n

umberofyears)

2

4

6

8

10

4.15

28.15

52.15

76.15

100.15

124.15

148.15

172.15

196.15

220.15

Figure 2. Frequency histogram with horizontal axis featuring class boundaries.

8

Average

Average daily number of sunspots in a year

RelativeFrequency

(fractionofyears)

0.05

0.10

0.15

0.20

0.25

4.15

28.15

52.15

76.15

100.15

124.15

148.15

172.15

196.15

220.15

Figure 3. Relative frequency histogram.

9

2.3 – Stem-and-Leaf and Dot Plots

Example 2.5. The following data represent the daily high temperatures (in degrees Fahrenheit)during April 2003 for Willow Grove:

48 73 61 47 42 49 37 3639 47 47 66 61 65 79 8456 46 62 65 66 65 53 6268 58 69 77 75 72

A stem-and-leaf plot for this data set using six stems is as follows:

Stem Leaves Key: 3|6 = 36

3 6 7 9

4 2 6 7 7 7 8 9

5 3 6 8

6 1 1 2 2 5 5 5 6 6 8 9

7 2 3 5 7 9

8 4

2.4 – Pie and Pareto Charts

2.5 – Graphing Paired Data Sets

10

3Numerical Data Descriptors

3.1 – Measures of Center

A “measure of center” for a data set D (otherwise known as an average of the data in D)gives a value that is, in some sense, the “center” of D. The most common measures of centerare the arithmetic mean, the median, and the mode. We will also consider measures of centerknown as the weighted arithmetic mean and the midrange.

Definition 3.1. The arithmetic mean of a population of size N is

µ =x1 + x2 + · · ·+ xN

N=

1

N

N∑k=1

xk =Σx

N,

where x1, x2, . . . , xN are the values in the population.The arithmetic mean of a sample of size n is

x =x1 + x2 + · · ·+ xn

n=

1

n

n∑k=1

xk =Σx

n,

where x1, x2, . . . , xn are the values in the sample.

Henceforth we shall refer to the arithmetic mean as simply the mean. There exist otherkinds of means such as the geometric mean, harmonic mean, and so on.

Of course, the two formulas in Definition 3.1 convey the same idea, with the only differencesbeing in terminology and notation. If a data set is a population, then the mean is denoted bythe symbol µ (the Greek letter mu, pronounced “mew”); and if a data set is a sample (i.e. aportion of a population), then the mean is denoted by x (“x-bar”). In Definition 3.1 the definingexpressions for µ and x are each written in three ways, in increasing order of compactness. Thesymbol Σx in the rightmost (most compact) expressions always denotes the sum of all the valuesin a data set, no matter whether it is a population or a sample.

Before defining the median of a data set, we first introduce some new notation. Given a dataset D = [[x1, . . . , xn]], we often need to rearrange the data in ascending order. Symbolically we

may do this by relabeling the values in D using the symbols x↑1, . . . , x↑n. Here x↑1 denotes the

11

smallest of the values in D, while x↑2 denotes the second smallest value, x↑3 the third smallestvalue, and so on until we reach the largest value x↑n. Thus we have

x↑1 < x↑2 < x↑3 < · · · < x↑n.

Since data sets are not sensitive to order, [[x↑1, . . . , x↑n]] is considered to be the same data set as

D = [[x1, . . . , xn]].

Example 3.2. If D = [[x1, x2, x3, x4, x5, x6]] is a data set with x1 = 6, x2 = −1, x3 = 8, x4 = 0,x5 = 6, x6 = 3, then:

x↑1 = −1, x↑2 = 0, x↑3 = 3, x↑4 = 6, x↑5 = 6, x↑6 = 8.

Definition 3.3. The median of the data set [[x1, . . . , xn]] is defined as follows:

Median =

x↑n/2+1/2, if n is odd

x↑n/2 + x↑n/2+1

2, if n is even

Example 3.4. Find the median of the data set [[6, 4, 9, 7,−3, 5, 4, 2, 4, 9, 9, 6,−7, 3, 3]].

Solution. The size of the data set is n = 15, which of course is an odd number. Calculatingn/2 + 1/2 = 15/2 + 1/2 = 8, by Definition 3.3 the median of the data set equals x↑8. Writingout the data set in ascending order gives

[[−7,−3, 2, 3, 3, 4, 4, 4, 5, 6, 6, 7, 9, 9, 9]],

which makes clear that

x↑1 = −7, x↑2 = −3, x↑3 = 2, x↑4 = x↑5 = 3, x↑6 = x↑7 = x↑8 = 4,

x↑9 = 5, x↑10 = x↑11 = 6, x↑12 = 7, x↑13 = x↑14 = x↑15 = 9,

and hence the median is 4.

Definition 3.5. Let t1, . . . , tm denote the set of distinct values contained in a data set D.For each 1 ≤ i ≤ m let fi denote the frequency with which the value ti occurs in D, and let fequal the largest of the frequencies f1, . . . , fm. The modes of D are the elements of t1, . . . , tmhaving frequency equal to f .

Example 3.6. Find the modes of the data set [[6, 4, 9, 7,−3, 5, 4, 2, 4, 9, 9, 6,−7, 3, 3]].

Solution. The set of distinct values in the data set is −7,−3, 2, 3, 4, 5, 6, 7, 9. The values−7, −3, 2, 5, and 7 each appear in the data set precisely once, and so each have a frequencyof 1. The values 3 and 6 each appear twice, so each have frequency 2. Finally, 4 and 9 eachhave frequency 3. The largest frequency is thus f = 3, and since the values 4 and 9 each havefrequency equal to f = 3, both 4 and 9 are the mode of the data set.

12

In general, a data set having one mode is called unimodal, and a data set with more thanone mode is multimodal. In particular a data set with two modes is bimodal, which is thecase in Example 3.6.

Definition 3.7. For each x in a data set let w ∈ R be an associated weight. Then theweighted mean of the data set is

x =Σ(wx)

Σw.

Example 3.8. A grade point average (GPA) is a weighted mean of “grade points” receivedfor each credit of coursework completed. An academic course credit given a grade of A, B, C, D,or F is awarded 4, 3, 2, 1, or 0 grade points, respectively (at least at institutions that appendno + or − to any letter grade). The weight associated with each of the five possible grade pointvalues is simply the number of course credits that are awarded that value.

As an example, suppose Wolfgang Jagermann got an A in 1 five-credit course, a C in 2three-credit courses, a B in 1 three-credit course, and a D in 1 two-credit course. Then hisGPA is determined as follows, where we let “cr” stand for “credits” and “gp” stand for “gradepoints”:

GPA =(5 cr)(4 gp) + (6 cr)(2 gp) + (3 cr)(3 gp) + (2 cr)(1 gp)

5 cr + 6 cr + 3 cr + 2 cr= 2.6875 gp.

Thus Herr Jagermann received a mean of approximately 2.69 grade points per credit.

3.2 – Measures of Dispersion

13

3.3 – Measures of Position

Example 3.9. The following data represent the daily high temperatures (in degrees Fahrenheit)during April 2003 for Willow Grove:

48 73 61 47 42 49 37 3639 47 47 66 61 65 79 8456 46 62 65 66 65 53 6268 58 69 77 75 72

(a) Find the quartiles Q1, Q2, and Q3.(b) Construct a box-and-whisker plot.

Solution.(a) The Median is (61 + 62)/2 = 61.5, and so Q2 = 61.5. Now we divide the data set into twohalves: its smallest 15 values and its largest 15 values. The smallest 15 values in ascendingorder are

36, 37, 39, 42, 46, 47, 47, 47, 48, 49, 53, 56, 58, 61, 61.

The median of the smallest 15 values is Q1 = 47. The largest 15 values in ascending order are

62, 62, 65, 65, 65, 66, 66, 68, 69, 72, 73, 75, 77, 79, 84.

The median of the largest 15 values is Q3 = 68. Thus:

Q1 = 47, Q2 = 61.5, Q3 = 68.

(b) The box-and-whisker plot:

36 40 44 48 52 56 60 64 68 72 76 80 84

36 8447 61.5 68

14

4Probability Theory

4.1 – Fundamental Counting Principle

Definition 4.1. An experiment E is a predetermined procedure by which an observation ismade. The procedure consists of one or more performances of a single unvarying action called atrial.

When an experiment E is done, we speak of “performing E .” The individual trials constitutingan experiment E are experiments in their own right, so that E may be viewed as a succession ofsimpler experiments. The precise number of trials constituting an experiment is not necessarilyknown in advance, since the experiment’s predetermined procedure may call for repeating anaction until some desired result is obtained.

Example 4.2. A single roll of a six-sided die is an experiment, and it consists of just one trial:namely, a single roll of a six-sided die. There is no meaningful way to decompose the procedureof rolling one die one time into a sequence of simpler actions.

Ten successive rolls of a die is also an experiment, only this time the experiment can be saidto consist of ten trials: each roll is one trial.

What is observed to occur when an experiment is conducted we tentatively refer to as a“result.” There are usually many ways to describe the result of an experiment. If an experimentconsists of a single roll of a six-sided die and the result is a 5, we could say the outcome was“five,” or “an odd number,” or “a number greater than 2,” among other possibilities. Only thefirst characterization of the experimental result, that of getting a “five,” makes clear which sideof the die was actually observed facing upward when the experiment was finished. To say theresult was “an odd number” leaves open the possibility of a 1, 3, or 5 having actually turned up,with no additional information at hand to narrow the choices down.

For the next definition recall that a set A is a subset of a set B, written A ⊆ B, if x ∈ Aimplies that x ∈ B. Put another way, A is a subset of B if every element of A is also an elementof B. This means, in particular, that any set is a subset of itself: A ⊆ A.

Definition 4.3. An outcome of an experiment is a description of an experimental result thatloses no information gained by the original observation. The sample space Ω of an experiment

15

is the set of all the experiment’s possible outcomes, and an event is any subset of the samplespace.

In the literature the elements of an experiment’s sample space are often referred to as samplepoints rather than outcomes. A simple event is an event that contains precisely one outcomeof an experiment.

Let Ω = ω1, ω2, . . . , ωn be the sample space of an experiment. Among the n elements of Ω,chose any m ≥ 1 of them: ωk1 , ωk2 , . . . , ωkm . (Here k1, k2, . . . , km are numbers drawn from theset 1, 2, . . . , n, where of course 0 ≤ m ≤ n.) Let E be the set containing these elements:

E = ωk1 , ωk2 , . . . , ωkm.

Then E is a subset of Ω, we write E ⊆ Ω, and we call E an event in accord with Definition 4.3.Specifically, E is the event in which one of the outcomes ωk1 , ωk2 , . . . , ωkm is observed to occurupon carrying out the experiment a single time.

Note that since Ω is a subset of itself (i.e. Ω ⊆ Ω), the set Ω is itself an event: namely,the event that one of the possible outcomes of the experiment occurs upon carrying out theexperiment a single time. This is always certain to happen, and so Ω is called the certainevent.

The empty set ∅, which is the set that contains no elements, is considered to be a subset ofany set, so it is certainly a subset of any experiment’s sample space Ω and therefore constitutesan event. It is specifically the event that none of the outcomes contained in Ω are observed tooccur upon carrying out the experiment a single time. Of course, this implies that an outcomeoccurred that is not among the outcomes contained in Ω, and since by definition Ω containsall possible outcomes, we have arrived at a contradiction. For this reason ∅ is called theimpossible event.

Two events E1, E2 ∈ Ω are mutually exclusive if E1 ∩ E2 = ∅. This means that thesimultaneous occurrence of events E1 and E2 is impossible.

Example 4.4. Suppose an experiment consists of a single toss of a coin, so as to see which sideof the coin faces up when it comes to rest. There are two possible outcomes: heads or tails. Ifwe let H stand for “heads” and T stand for “tails,” then we can see that the sample space of theexperiment is Ω = H,T. There are only four possible events in this scenario, correspondingto the four possible subsets of Ω:

∅, H, T, H,T.

The set ∅ corresponds to the event of a coin toss resulting in neither heads nor tails, whichcannot occur. The set H,T, which equals Ω itself, is the event of a coin toss resulting ineither heads or tails, which is sure to occur. The set H is the event of observing heads butnot tails, while T is the event of observing tails but not heads.

Example 4.5. Suppose an experiment consists of rolling a single eight-sided die having sidesnumbered from 1 to 8. The sample space of the experiment is Ω = 1, 2, 3, 4, 5, 6, 7, 8. Thenumber of possible events that can occur equals the number of subsets that Ω has, whichhappens to be 256—too many to easily list. The event 2, 4, 6, 8 could be described as theevent of rolling an even number, while the event 2, 3, 5, 7 could be described as the event ofrolling a prime number. How might we describe the event 6, 7, 8 or 1, 4?

16

Another set theoretic notion is needed for what’s coming. The Cartesian product of twononempty sets A and B is the set of ordered pairs

A×B =

(a, b) : a ∈ A and b ∈ B.

Thus if A = $,%,# and B = α, β, then

A×B =

($, α), ($, β), (%, α), (%, β), (#, α), (#, β)

andB × A =

(α, $), (β, $), (α,%), (β,%), (α,#), (β,#)

.

Note that A×B 6= B × A since, for instance, ($, α) 6= (α, $).More generally, given nonempty sets A1, A2, . . . , An, we have

A1 × A2 × · · · × Am =

(a1, a2, . . . , an) : ak ∈ Ak for each 1 ≤ k ≤ n.

Definition 4.6. Suppose experiments E1 and E2 have sample spaces Ω1 and Ω2, respectively.The experiment consisting of performing E1 and E2 in sequence, denoted by E1E2, is theexperiment having sample space Ω1 × Ω2.

More generally, if Ek has sample space Ωk for each 1 ≤ k ≤ n, then the experiment E1 · · · Enconsisting of performing E1, . . . , En in sequence is the experiment having sample spaceΩ1 × · · · × Ωn.

Example 4.7. Let E1 be the experiment consisting of a single toss of a coin, as in Example 4.4,and let E2 be the experiment consisting of a single roll of an eight-sided die, as in Example 4.5.The experiment E consisting of performing E1 and E2 in sequence is the experiment consistingof first tossing the coin and then rolling the die. The sample space of E1 is Ω1 = H,T, whilethe sample space of E2 is Ω2 = 1, 2, 3, 4, 5, 6, 7, 8. Thus, by definition, the sample space of E is

Ω1 × Ω2 =

(H, 1), (H, 2), (H, 3), (H, 4), (H, 5), (H, 6), (H, 7), (H, 8),

(T, 1), (T, 2), (T, 3), (T, 4), (T, 5), (T, 6), (T, 7), (T, 8).

Each ordered pair in Ω1 × Ω2 is an outcome of the experiment E . For instance, (T, 7) is theoutcome of tossing heads and then rolling a 7.

Theorem 4.8 (Fundamental Counting Principle). If experiment Ek has nk outcomes foreach 1 ≤ k ≤ m, then the experiment E1E2 · · · Em has n1n2 · · ·nm possible outcomes.

The veracity of the theorem is fairly self-evident; however, the rigorous-minded can craft aproof using the Principle of Induction.

Example 4.9. If E1 is the experiment consisting of a single toss of a coin, and E2 is theexperiment consisting of a single roll of an eight-sided die, then E1 has n1 = 2 outcomes (seeExample 4.4), and E2 has n2 = 8 outcomes (see Example 4.5). By the Fundamental CountingPrinciple it follows that the experiment E consisting of performing E1 and E2 in sequence hasn1n2 = 16 outcomes. These outcomes are listed in Example 4.7.

17

Example 4.10. An e-mail account password is known to consist of six characters with thefollowing parameters: the first three characters are letters with possible repetition, the fourthand fifth characters are numerals without repetition, and the last character is a symbol fromthe set #, $,%, !, ?. The letters are case-sensitive, so that for instance x is distinguished fromX. How many possible passwords satisfy the parameters?

Define the following four experiments:

E1: Choose an uppercase or lowercase later.

E2: Choose a numeral.

E3: Choose a numeral not chosen in E2.E4: Choose a symbol from the set #, $,%, !, ?.

A possible password is formed by performing the experiment E = E1E1E1E2E3E4. Note thatE1 is performed in sequence three consecutive times, and it is important that E2 be performedbefore E3!

The number of outcomes possible for E1 is n1 = 52: choose any one of 26 lowercase or26 uppercase letters. The number of outcomes possible for E2 is n2 = 10: choose one of thenumerals from 0 to 9. For E3 we must choose a numeral not already chosen in the course ofperforming E2, and so there are n3 = 9 outcomes. Finally, E4 has n4 = 5 possible outcomes.

By the Fundamental Counting Principle we conclude that the number of possible outcomesfor E is

n1n1n1n2n3n4 = 52 · 52 · 52 · 10 · 9 · 5 = 63,273,600.

Therefore there are 63,273,600 possible passwords.

18

4.2 – Definition of Probability

A set is finite if it has a finite number of elements. Thus A = 1, 2, 3 is finite (3 elements),whereas the set of positive integers N = 1, 2, 3, . . . has an infinite number of elements andhence is not finite.

Definition 4.11. A one-to-one correspondence between the elements of one set A andanother set B is a function f : A→ B with the following properties:

1. Injectivity: Whenever a1, a2 ∈ A are such that a1 6= a2, then f(a1) 6= f(a2).2. Surjectivity: For every b ∈ B there exists some a ∈ A such that f(a) = b.

Remark. In the definition above, the equality f(a) = b can be interpreted as a “pairing” ofa ∈ A with b ∈ B, and this pairing may be expressed quite naturally as the ordered pair (a, b).So, a one-to-one correspondence between sets A and B may be viewed as a relation that pairseach element of A with a unique element of B, such that no element of B is left unpaired withan element of A.

A function f : A→ B possessing the injectivity (resp. surjectivity) property in Definition4.11 is injective (resp. surjective). Thus a function is a one-to-one correspondence if andonly if it is both injective and surjective.

Example 4.12. Let A = 1, 2, 3 and B = u, v, w. If we define the function f : A→ B by

f(1) = v, f(2) = w, f(3) = u,

then f is a one-to-one correspondence between A and B. This correspondence may be alternatelyexpressed as the set of ordered pairs (1, v), (2, w), (3, u).

Now define a new function g : A→ B by

g(1) = v, g(2) = v, g(3) = w.

The function g is not a one-to-one correspondence between A and B, since for u ∈ B there isno a ∈ A such that g(a) = u, thereby violating the surjectivity property in Definition 4.11.

Example 4.13. Let A = 1, 2, 3, 4 and B = u, v, w. The function h : A→ B by

h(1) = v, h(2) = w, h(3) = w, h(4) = u,

is not a one-to-one correspondence between A and B. We have 2, 3 ∈ A with 2 6= 3, and yeth(2) = h(3) = w, which violates the injectivity property in Definition 4.11.

Definition 4.14. A set A is countably infinite if there exists a one-to-one correspondencebetween the elements of A and the elements of N = 1, 2, 3, . . .. A set that is finite or countablyinfinite is discrete. A set that is not discrete is uncountable. An uncountable set that equalsan interval of real numbers is continuous.

Implicit in Definition 4.14 is that any interval of real numbers such as (0, 1), [−1, 1], or (2,∞)is an uncountable set. This is true, but no proof of the fact shall be given here.

19

Example 4.15. Show that the set of positive even integers A = 2, 4, 6, . . . is countablyinfinite.

Solution. Define a function f : A→ N by

f(a) =a

2for each a ∈ A. Note that if a ∈ A, then a is a positive integer divisible by 2, so that a/2 is apositive integer and hence an element of N as required.

Suppose a1, a2 ∈ A are such that a1 6= a2. Then a1/2 6= a2/2, and hence f(a1) 6= f(a2). Thisshows that f is an injective function.

Next, let n ∈ N be arbitrary. Then 2n is an even integer, which is to say 2n ∈ A and hence2n is in the domain of f . Now, f(2n) = 2n/2 = n, and since n ∈ N is arbitrary, we see thatfor any n ∈ N there exists some a ∈ A such that f(a) = n. This shows that f is a surjectivefunction.

Since f : A→ N is both injective and surjective, it is a one-to-one correspondence. ThereforeA is countably infinite.

A discrete sample space is a sample space that is a discrete set, and a continuoussample space is a sample space that is a continuous set.

Example 4.16. Consider the experiment E1 consisting of tossing a coin repeatedly until headsis obtained. The first toss could result in heads (outcome: H), or the first toss tails and thesecond toss heads (outcome: TH), or the first two tosses tails and the third toss heads (outcome:TTH), and so on. The sample space for E1 is

Ω1 = H,TH, TTH, TTTH, . . ..

This is a countably infinite set, for we can construct a one-to-one correspondence with N bypairing 1 with H, 2 with TH, 3 with TTH, and in general n ∈ N with the outcome ω ∈ Ωconsisting of n tosses of the coin (n− 1 tails and 1 heads). Therefore Ω1 is a discrete samplespace.

Now consider the experiment E2 consisting of recording the exact time it takes, in seconds,until heads is obtained when repeatedly tossing a coin. Of course any measuring device such asa stopwatch will give an approximate time (the nearest millisecond perhaps), but in fact theactual time it takes to get heads the first time could be any positive real number. The samplespace for E2 is thus Ω2 = (0,∞), and since any interval of real numbers is a continuous set, Ω2

is a continuous sample space.

The collection of all subsets of a set A is denoted by P(A) and called the power set of A.In particular if Ω is the sample space of an experiment, then its power set P(Ω) is the collectionof all possible events that could occur when the experiment is performed. Clearly E ∈ P(Ω) ifand only if E ⊆ Ω.

A sequence of events is a collection E1, E2, E3, . . . ⊆ P(Ω) that is ordered in accordancewith the subscripts: event E1 before E2, E2 before E3, and so on. Such a sequence we denoteby (En)∞n=1. Similarly, (En)Nn=1 for N ∈ N denotes a finite sequence of events, with E1 beforeE2, E2 before E3, and so on until EN , which is the last event. A sequence (En)∞n=1 or finitesequence (En)Nn=1 is mutually exclusive if Ei ∩ Ej = ∅ whenever i 6= j.

20

In what follows it will be convenient to introduce some new notation. Given any sequence ofsets (An)∞n=1 and N ∈ N, we define

N⋃n=1

An = A1 ∪ A2 ∪ · · · ∪ AN = x : x ∈ An for some 1 ≤ n ≤ N,

and∞⋃n=1

An = x : x ∈ An for some n ≥ 1.

If the sets being united are mutually disjoint, then t will be used instead of ∪. Also, given setsA and B, we define the set difference of A and B to be the set

A−B = x : x ∈ A and x /∈ B.

In particular, given a sample space Ω and E ⊆ Ω, we define the event

Ec = Ω− E

to be the complement of the event E (relative to Ω).

Definition 4.17. Let S be a set. A σ-algebra on S is a collection Σ ⊆ P(S) with the followingproperties.

1. S ∈ Σ.2. If A ∈ Σ, then S − A ∈ Σ.3. If An ∈ Σ for n ∈ N, then

⋃∞n=1An ∈ Σ.

Since S − S = ∅, the first two properties above immediately imply that ∅ is a member ofany σ-algebra. Also, for any set S it can be shown that P(S) itself is a σ-algebra on S. Anotherσ-algebra on S (in fact the smallest one possible) is the collection ∅, S.

Example 4.18. Let S = 1, 2, 3. What is the smallest σ-algebra on S that contains 1? LetΣ be this σ-algebra. Then 1 ∈ Σ by requirement, and ∅, S ∈ Σ by the first two propertiesin Definition 4.17. Moreover, 1 ∈ Σ implies that S − 1 = 2, 3 is in Σ also. It isstraightforward to check that the collection

∅, 1, 2, 3, S,

is a σ-algebra on S that contains 1, and since none of the members ∅, 2, 3, or S can beremoved from the collection without violating one of the three properties in Definition 4.17, itfollows that Σ = ∅, 1, 2, 3, S is the smallest σ-algebra on S that contains 1.

Given any σ-algebra Σ on a sample space Ω, we say a sequence of events (En)∞n=1 is asequence in Σ if En ∈ Σ for all n ∈ N.

Definition 4.19. Let Ω be a sample space, and let Σ ⊆ P(Ω) be a σ-algebra on Ω. A functionP : Σ→ R is a probability measure on Σ if it satisfies the following axioms:

A1. P(E) ≥ 0 for every E ∈ Σ.A2. P(Ω) = 1.

21

A3. If (En)∞n=1 is a sequence of mutually exclusive events in Σ, then

P

(∞⊔n=1

En

)=∞∑n=1

P(En).

If E ∈ Σ, then E is a measurable event, and the real number P(E) is the probability of theevent E.

It may be wondered why the notion of a σ-algebra is involved in the definition of a probabilitymeasure. If Σ in the definition is not equal to P(Ω), then that means there are sets E ⊆ Ω forwhich P(E) is undefined. That is, there exist events in the sample space of an experiment thathave no probability! How can this be?

In fact, if Ω in Definition 4.19 is specifically a discrete sample space, then usually Σ is takento be P(Ω) as expected. Difficulties arise, however, in cases when Ω is an uncountable samplespace, including the common scenario in which Ω is an interval of real numbers. It can beshown, for instance, that the only probability measure P that can be defined on P(R) is givenby P(R) = 1 and P(E) = 0 for all E ⊆ R such that E 6= R. This is not a very useful probabilitymeasure! Thus, if we wish to develop a theory of probability that has any utility in a worldfull of continuous sample spaces, we must relinquish the hope of assigning probability valuesto every conceivable event. The area of mathematics most concerned with developing such atheory of probability is known as measure theory, and it is beyond the scope of this book.

Proposition 4.20. Let Ω be a sample space and P a probability measure on a σ-algebra Σ onΩ. Then the following hold.

1. P(∅) = 0.2. If (En)Nn=1 is a finite sequence of mutually exclusive events in Σ, then

P

(N⊔n=1

En

)=

N∑n=1

P(En).

3. If E1, E2 ∈ Σ are such that E1 ⊆ E2, then P(E1) ≤ P(E2).4. P(E) ≤ 1 for every E ∈ Σ.5. If E1, E2 ∈ Σ, then

P(E1 − E2) = P(E1)− P(E1 ∩ E2). (4.1)

Thus if E2 ⊆ E1 also holds, then

P(E1 − E2) = P(E1)− P(E2). (4.2)

6. If E ∈ Σ, then P(Ec) = 1− P(E).

Proof.Proof of (1). Define the sequence (En)∞n=1 with E1 = Ω and En = ∅ for n ≥ 2. This sequence ismutually exclusive with Ω =

⊔∞n=1En. Now by Axiom A3,

P(Ω) = P

(∞⊔n=1

En

)=∞∑n=1

P(En) = P(Ω) +∞∑n=2

P(∅),

22

and thus∞∑n=2

P(∅) = 0.

Since P(∅) ≥ 0 by Axiom A1, it follows that P(∅) = 0.

Proof of (2). Let (En)Nn=1 be a finite sequence of mutually exclusive events in Σ. Extend thissequence to an infinite sequence (En)∞n=1 of mutually exclusive events by defining En = ∅ forn ≥ N + 1. Then

∞⊔n=1

En =N⊔n=1

En,

and P(En) = P(∅) = 0 for n ≥ N + 1 by part (1). By Axiom A3,

P

(N⊔n=1

En

)= P

(∞⊔n=1

En

)=∞∑n=1

P(En) =N∑n=1

P(En) +∞∑

n=N+1

P(En) =N∑n=1

P(En).

Proof of (3). Suppose E1, E2 ∈ Σ are such that E1 ⊆ E2. Then E2 = E1 ∪ (E2 − E1), and sinceE1 ∩ (E2 − E1) = ∅, part (2) implies that

P(E2) = P(E1 ∪ (E2 − E1)

)= P(E1) + P(E2 − E1).

Now, P(E2 − E1) ≥ 0 by Axiom A1, so that

P(E1) = P(E2)− P(E2 − E1) ≤ P(E2).

Proof of (4). Suppose E ∈ Σ. Then P(E) ≤ P(Ω) = 1 by part (3) and Axiom A2.

Proof of (5). Suppose E1, E2 ∈ Σ. Then

P(E1) = P(E1 ∩ (E2 ∪ Ec

2))

= P((E1 ∩ E2) ∪ (E1 ∩ Ec

2)).

Since E1 ∩ E2 and E1 ∩ Ec2 are mutually exclusive events, part (2) implies that

P(E1) = P(E1 ∩ E2) + P(E1 ∩ Ec2).

It is straightforward to check that E1 ∩ Ec2 = E1 − E2, and so (4.1) follows. We obtain (4.2)

from (4.1) by noting that E1 ∩ E2 = E2 if E2 ⊆ E1.

Proof of (6). We have E and Ec = Ω−E as mutually exclusive events in Σ such that E∪Ec = Ω,and so Axiom A2 and part (2) imply that

1 = P(Ω) = P(E ∪ Ec) = P(E) + P(Ec).

The conclusion that P(Ec) = 1− P(E) readily follows.

23

Theorem 4.21. Let Ω be a discrete sample space, and suppose P is a probability measure onP(Ω). If E ⊆ Ω with E 6= ∅, then

P(E) =∑ω∈E

P(ω).

Proof. Suppose E ⊆ Ω, with E not equalling the empty set. Then E is discrete since Ωis discrete, and so either E = ω1, . . . , ωN for some N ∈ N, or E = ω1, ω2, . . .. SupposeE = ω1, . . . , ωN. For each 1 ≤ n ≤ N define En = ωn, so E = E1 ∪ · · · ∪EN . Since ωi 6= ωjwhenever i 6= j, we find that Ei∩Ej = ∅ whenever i 6= j, and hence (En)Nn=1 is a finite sequenceof mutually exclusive events in Ω. Now, by Proposition 4.20(2),

P(E) = P

(N⊔n=1

En

)=

N∑n=1

P(En) =N∑n=1

P(ωn) =∑ω∈E

P(ω),

as desired. The proof in the case when E = ω1, ω2, . . . is nearly identical, only N is replacedwith ∞, and Axiom A3 is used.

Given a finite set A, let n(A) denote the number of elements in A. For A = 1, 2, 3 we haven(A) = 3, and for the empty set ∅ we have n(∅) = 0.

Theorem 4.22. Let Ω be a sample space such that n(Ω) = n ≥ 1, and suppose P is a probabilitymeasure on P(Ω). If P(ω) = 1/n for each ω ∈ Ω, then

P(E) =n(E)

n(Ω)(4.3)

for any E ⊆ Ω.

Proof. Suppose P(ω) = 1/n for each ω ∈ Ω, and let E ⊆ Ω. The sample space Ω is discretesince it is finite. Assuming E 6= ∅, Theorem 4.21 implies that

P(E) =∑ω∈E

P(ω) =∑ω∈E

1

n=

n(E)

n=

n(E)

n(Ω),

thereby affirming (4.3). In the case when E = ∅ Proposition 4.20(1) implies that P(E) = 0.But we also have n(E) = 0, and so (4.3) obtains once more.

According to Theorem 4.22, if an experiment has a finite sample space Ω, and each outcomein Ω is equally likely to occur, then the probability of an event E ⊆ Ω is given by (4.3). Such aprobability P(E) is sometimes referred to as the classical probability of E.

Example 4.23. Let E be the experiment consisting of a single roll of a single eight-sided diehaving sides numbered from 1 to 8. Find the probability of rolling a perfect square.

Solution. The sample space of the experiment is Ω = 1, 2, 3, 4, 5, 6, 7, 8, which is finite. Weassume the die to be balanced, so that no one number is more likely to be rolled than another.

24

Let E be the event of rolling a perfect square. There are only two perfect squares on theeight-sided die: 1 (since 1 = 12) and 4 (since 4 = 22), and thus E = 1, 4. By Theorem 4.22,the probability of E is

P(E) =n(E)

n(Ω)=

2

8=

1

4.

This is to say there is “one chance in four” that the event E will occur.

Example 4.24. Let E be the experiment of drawing two cards in succession from a standarddeck of playing cards, without replacement. Find the probability of getting a king on the firstdraw, and a card with an even number on the second draw.

Solution. Let A be the event of getting a king on the 1st draw and any card on the 2nd draw,and let B be the event of drawing any card on the 1st draw and an even-numbered card onthe 2nd draw. The elements of these sets may be presented as ordered pairs (X, Y ), where Xdenotes the first card drawn and Y the second card drawn. Hence if K denotes a king and Ean even-numbered card, then A has elements of the form (K, Y ), and B has elements of theform (X,E). It follows that the elements of the set A ∩B are pairs form (K,E), which is to sayA ∩B consists precisely of those outcomes in the sample space Ω of the experiment E whereina king is drawn first and an even-numbered card second. Thus we seek the probability of theevent A ∩B.

To find P(A ∩ B) we need n(A ∩ B) and n(Ω). How many elements are there in the setA ∩B? This is simply to ask how many ordered pairs of the form (K,E) there are. There are4 choices for K: it could be a king of hearts, diamonds, clubs, or spades; meanwhile there are20 choices for E: it could be a 2, 4, 6, 8, or 10 in any one of the four suits. Therefore by theFundamental Counting Principle there are (4)(20) = 80 possible pairs of the form (K,E), andso n(A ∩B) = 80.

Now for n(Ω), the total number of possible outcomes of the experiment E . Elements of theset Ω are pairs of the form (X, Y ), with X being any one of the 52 cards in the deck before thefirst card is drawn, and Y being any one of the remaining 51 cards available after the first cardis drawn. Thus n(Ω) = (52)(51) = 2652.

Finally, by Theorem 4.22 we find that

P(A ∩B) =n(A ∩B)

n(Ω)=

80

2652=

20

663≈ 0.0302,

our answer.

Some further properties of a probability measure P that will be useful in Chapter 6 followpresently. First we introduce some new notation: given a sequence of sets (An)∞n=1, we writeAn ↑ A if A =

⋃∞n=1An and An ⊆ An+1 for all n, and we write An ↓ A if A =

⋂∞n=1An and

An ⊇ An+1 for all n.

Proposition 4.25. Let Ω be a sample space, P a probability measure on a σ-algebra Σ on Ω,and (En)∞n=1 a sequence in Σ. If En ↑ E or En ↓ E, then

limn→∞

P(En) = P(E).

25

Proof. Suppose En ↑ E, so En ⊆ En+1 for each n ∈ N, and E =⋃∞n=1En. If we define E0 = ∅,

then (En − En−1)∞n=1 is a mutually exclusive sequence in Σ such that

E =∞⋃n=1

En =∞⊔n=1

(En − En−1).

By Axiom A3 followed by Proposition 4.20(5),

P(E) = P

(∞⊔n=1

(En − En−1)

)=∞∑n=1

P(En − En−1) =∞∑n=1

[P(En)− P(En−1)

]= lim

n→∞

n∑k=1

[P(Ek)− P(Ek−1)

]= lim

n→∞

[P(En)− P(E0)

]= lim

n→∞P(En)

as desired, where P(E0) = P(∅) = 0 by Proposition 4.20(1).Now suppose En ↓ E, so En ⊇ En+1 for each n ∈ N, and E =

⋂∞n=1En. Then Ec

n ⊆ Ecn+1

and Ec =⋃∞n=1E

cn, which is to say Ec

n ↑ Ec, and so by our previous finding we have

limn→∞

P(Ecn) = P(Ec).

By Proposition 4.20(6) it follows that

limn→∞

[1− P(En)] = 1− P(E),

which immediately yieldslimn→∞

P(En) = P(E)

as desired.

26

4.3 – Conditional Probability

The idea that the occurrence of one event may affect the chances of another event occurringis a natural one which motives the following definition.

Definition 4.26. Let P be a probability measure on a σ-algebra Σ. If A,B ∈ Σ such thatP(B) > 0, then the conditional probability of A given B is

P(A |B) =P(A ∩B)

P(B). (4.4)

The symbol P(A |B) may be read as “the probability of event A given the occurrence ofevent B,” and while seems to connote the existence of an external timeline in which B occursbefore A, it is not necessarily so. The value P(A∩B) is sometimes called the joint probabilityof A and B, which is the probability of events A and B both occurring (though not necessarilysimultaneously in time).

Definition 4.27. Two events A and B are independent if P(A ∩B) = P(A)P(B).

Proposition 4.28. Let A and B be events such that P(B) > 0. Then A and B are independentif and only if P(A|B) = P(A).

Proof. Suppose A and B are independent. Then by Definition 4.26,

P(A|B) =P(A ∩B)

P(B)=

P(A)P(B)

P(B)= P(A).

Now suppose that P(A|B) = P(A). Then by Definition 4.26,

P(A ∩B) = P(A|B)P(B) = P(A)P(B),

finishing the proof.

If it is given that P(A |B) = P(A), then by Definition 4.26 it must necessarily be the casethat P(B) > 0, and so A and B must be independent by Proposition 4.28. The same conclusionfollows if P(B |A) = P(B).

Example 4.29. Let E be the experiment of Example 4.24: two cards are drawn in successionfrom a standard deck of playing cards, without replacement. Use Definition 4.26 to find theprobability of getting an even-numbered card on the second draw, given that a king was takenon the first draw.

Solution. As in Example 4.24, let A be the event of getting a king on the 1st draw and anycard on the 2nd draw, and let B be the event of drawing any card on the 1st draw and aneven-numbered card on the 2nd draw. We must determine P(A), which in turn requires findingn(A). The set A consists precisely of pairs of the form (K, Y ), where K is any king taken on the1st draw (4 choices), and Y is any one of the cards remaining in the deck after the 1st draw (51

27

choices). The Fundamental Counting Principle thus implies that n(A) = (4)(51) = 204, whereaswe found that n(Ω) = 2652 in Example 4.24. By Theorem 4.22,

P(A) =n(A)

n(Ω)=

204

2652=

1

13.

In Example 4.24 we obtained P(A ∩B) = 20663

, and so by Definition 4.26 we have

P(B |A) =P(A ∩B)

P(A)=

20/663

1/13=

20

51≈ 0.392.

Remark. There is, to be sure, a quicker way to figure out the probability of getting an even-numbered card on the second draw, given that a king was taken on the first draw. There are 51cards remaining in the deck after a king is taken out, and 20 of those 51 cards are even-numbered.This immediately implies that the answer is 20

51, and so we see that the curious formula for

determining a conditional probability given by Definition 4.26 agrees with our intuition.

28

4.4 – The Multiplicative and Additive Laws

Theorem 4.30 (Additive Law). Let P be a probability measure on a σ-algebra Σ. If A,B ∈ Σ,then

P(A ∪B) = P(A) + P(B)− P(A ∩B). (4.5)

Proof. By Proposition 4.20(5) we have

P((A ∪B)− (A ∩B)

)= P(A ∪B)− P(A ∩B). (4.6)

Also, since(A ∪B)− (A ∩B) = (A−B) ∪ (B − A)

for mutually exclusive measurable events A−B and B − A, Proposition 4.20 parts (2) and (5)give

P((A ∪B)− (A ∩B)

)= P

((A−B) ∪ (B − A)

)= P(A−B) + P(B − A)

=[P(A)− P(A ∩B)

]+[P(B)− P(A ∩B)

]= P(A) + P(B)− 2P(A ∩B).

This result combined with (4.6) gives (4.5), finishing the proof.

The Additive Law as stated above involves just two events A and B; however, the law can beextended to involve three events or more. The three-event version we now state, as it evidencesa pattern that in fact holds in the general case.

Proposition 4.31. Let P be a probability measure on a σ-algebra Σ. If A,B,C ∈ Σ, then

P(A ∪B ∪ C) = P(A) + P(B) + P(C)

− P(A ∩B)− P(A ∩ C)− P(B ∩ C)

+ P(A ∩B ∩ C).

Proof. For A,B,C ∈ Σ, let D = B ∪ C. By the Additive Law,

P(D) = P(B ∪ C) = P(B) + P(C)− P(B ∩ C),

and also

P(A ∩D) = P(A ∩ (B ∪ C)

)= P

((A ∩B) ∪ (A ∩ C)

)= P(A ∩B) + P(A ∩ C)− P

((A ∩B) ∩ (A ∩ C)

).

Now,

P(A ∪B ∪ C) = P(A ∪D) = P(A) + P(D)− P(A ∩D)

= P(A) +[P(B) + P(C)− P(B ∩ C)

]−[P(A ∩B) + P(A ∩ C)− P

((A ∩B) ∩ (A ∩ C)

)],

29

which with regrouping gives the desired outcome.

Example 4.32. Let E be the experiment of drawing two cards in succession from a standarddeck of playing cards, without replacement. Find the probability of getting a king on the firstdraw, and a card in the suit of clubs on the second draw.

Solution. Let A be the event of getting a king on the 1st draw and any card on the 2nd draw,and let B be the event of drawing any card on the 1st draw and a club on the 2nd draw. Thenthe event of a king on the 1st draw and a club on the 2nd draw is A ∩B. If a king is taken onthe 1st draw, the probability of a club on the 2nd draw will depend on whether it was K♣ thatwas taken, or a king in some other suit. If we let A1 be the event of getting a K♣ on the 1stdraw and any card on the 2nd draw, and A2 the event of getting a K♥, K♦, or K♠ on the 1stdraw and any card on the 2nd draw, then A = A1 ∪ A2 with A1 ∩ A2 = ∅. It follows that

(A1 ∩B) ∩ (A2 ∩B) = (A1 ∩ A2) ∩B = ∅ ∩B = ∅,so

P((A1 ∩B) ∩ (A2 ∩B)

)= P(∅) = 0,

and then by the Additive and Multiplicative Laws,

P(A ∩B) = P((A1 ∪ A2) ∩B

)= P

((A1 ∩B) ∪ (A2 ∩B)

)= P(A1 ∩B) + P(A2 ∩B)− P

((A1 ∩B) ∩ (A2 ∩B)

)= P(A1 ∩B) + P(A2 ∩B)

= P(A1)P(B |A1) + P(A2)P(B |A2).

The final probabilities may be found in the same way the probabilities in Example 4.29 werefound; however, as observed in the remark following that example, the conditional probabilitiesmay be found in a quicker fashion (at least in this relatively simple setting). First, we haveP(A1) = 1

52and P(A2) = 3

52, since there is but one K♣ in a deck of 52 cards, and 3 kings in

other suits. As for P(B |A1), if a K♣ is drawn first, then there are 12 cards in the suit of clubsremaining among the 51 cards left in the deck, and so P(B |A1) = 12

51. If a K♥, K♦, or K♠ is

drawn first, then 13 clubs remain in the deck of 51, and so P(B |A2) = 1351

. Therefore,

P(A ∩B) = P(A1)P(B |A1) + P(A2)P(B |A2) =1

52· 12

51+

3

52· 13

51=

1

52,

our answer.

30

4.5 – Combinations and Permutations

Definition 4.33. For any n ∈ N, the factorial of n is

n! =n∏k=1

k = (1)(2)(3) · · · (n− 1)(n).

Also we define0! = 1.

Usually n! is read as “n factorial.” So we have 1! = 1, 2! = (1)(2) = 2, 3! = (1)(2)(3) = 6,and so on. Factorials get very large very quickly:

20! = 2,432,902,008,176,640,000.

For n ∈ N and integer 1 ≤ k ≤ n, let nPk denote the number of k-sized permutations thatmay be formed from a population of n distinct objects, and let nCk denote the number of k-sizedcombinations that may be formed from a population of n distinct objects. We define

nP0 = 1 and nC0 = 1. (4.7)

Lemma 4.34. For any n ∈ N and integer 1 ≤ k ≤ n, nPk = k!nCk.

Proof. Each of the nCk k-sized combinations formed from n distinct objects has, by theFundamental Counting Principle, a total of k! possible permutations. Therefore the totalnumber of k-sized permutations that may be formed from the n objects is k! times nCk, asclaimed.

Theorem 4.35. For any n ∈ N and integer 1 ≤ k ≤ n,

nPk =n!

(n− k)!and nCk =

n!

k!(n− k)!.

Proof. To form a k-sized permutation from n distinct objects, we choose any one of the nobjects to be 1st in the arrangement, then one of the remaining n−1 objects to be 2nd, then oneof the remaining n− 2 objects to be 3rd, and so on until finally choosing one of the remainingn − (k − 1) to be kth (and hence last) in the arrangement. By the Fundamental CountingPrinciple the total number of arrangements (i.e. permutations) possible are

n(n− 1)(n− 2) · · · [n− (k − 1)],

and therefore

nPk = n(n− 1)(n− 2) · · · [n− (k − 1)]

=n(n− 1)(n− 2) · · · [n− (k − 1)] · (n− k)!

(n− k)!

=n!

(n− k)!,

as claimed.

31

For the second formula in the theorem, by Lemma 4.34 we have

nCk =nPkk!

=1

k!· n!

(n− k)!=

n!

k!(n− k)!,

which finishes the proof.

Sincen!

(n− 0)!=n!

n!= 1 and

n!

0!(n− 0)!=

n!

(1)(n!)= 1,

for any n ∈ N, we see in light of the definitions in (4.7) that the formulas in Theorem 4.35 holdfor integers 0 ≤ k ≤ n.

Another symbol for nCk that is prevalent in the literature is(nk

). Thus we have(

n

k

)=

n!

k!(n− k)!.

Both nCk and(nk

)are often read as “n choose k.”

32

4.6 – Combinatorial Probability

In a standard deck of 52 playing cards recall that there are 13 ranks: A (ace), 2, 3, 4, 5, 6,7, 8, 9, 10, J (jack), Q (queen), and K (king). Depending on the rules in effect during a game,aces are either considered “high” (ranked higher than kings) or “low” (ranked lower than twos).Also there are four suits in a standard deck: ♣ (club), ♠ (spade), ♥ (heart), and ♦ (diamond).For convenience we list all the cards here:

A♣ 2♣ 3♣ 4♣ 5♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣A♠ 2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠A♥ 2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥A♦ 2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦

Example 4.36. In five-card poker a three-of-a-kind is a hand consisting of three cards of thesame rank together with two other cards in two other ranks. An example would be getting threenines, a four, and a queen: 9♣ 9♥ 9♠ Q♥ 4♥. Find the probability of getting a three-of-a-kindin a game of five-card poker, assuming the five cards are dealt without replacement.

Solution. First we count how many distinct three-of-a-kind hands are possible in five-cardpoker. We may consider the formation of a three-of-a-kind as follows:

• Choose 1 rank among the 13 available to be the rank of the three-of-a-kind: 13C1.• Choose 3 suits among the 4 available to be the suits in the three-of-a-kind: 4C3.• Choose 2 ranks among the 12 remaining to be the ranks of the other two cards in the

hand: 12C2.• For each of the other two cards in the hand: choose 1 suit among the 4 available for the

card’s rank: (4C1)2.

By the Fundamental Counting Principle the total number of full houses possible is the productof the numbers of choices:

Number of three-of-a-kind hands =

(13

1

)(4

3

)(12

2

)(4

1

)2

= 54, 912.

The total number of five-card hands is(525

), and so the probability of being dealt a three-of-a-kind

in five-card poker is (131

)(43

)(122

)(41

)2(525

) =54,912

2,598,960≈ 0.02113.

Example 4.37. In five-card poker a full house is a hand consisting of a three-of-a-kind and apair. An example would be getting three jacks and two sevens: J♦ J♣ J♥ 7♣ 7♠. Find theprobability of getting a full house in a game of five-card poker, assuming the five cards are dealtwithout replacement.

Solution. First we count how many distinct full house hands are possible in five-card poker.We may consider the formation of a full house as follows:

33

• Choose 1 rank among the 13 available to be the rank of the three-of-a-kind: 13C1.• Choose 3 suits among the 4 available to be the suits in the three-of-a-kind: 4C3.• Choose 1 rank among the 12 remaining to be the rank of the pair: 12C1.• Choose 2 suits among the 4 available to be the suits in the pair: 4C2.

By the Fundamental Counting Principle the total number of full houses possible is the productof the numbers of choices:

Number of full houses =

(13

1

)(4

3

)(12

1

)(4

2

)= 3744.


), and so the probability of being dealt a full house in

five-card poker is (131

)(43

)(121

)(42

)(525

) =3744

2,598,960≈ 0.00144.

Example 4.38. In five-card poker a straight flush is a hand having cards of sequential rank(called a straight) that are all in the same suit (called a flush). An example of a straigh flush isQ♦ J♦ 10♦ 9♦ 8♦. Under “high rules,” an ace can rank either high (above a king) or low(below a two), but cannot rank both high and low in the same hand. Thus, under high rules,both A♥ K♥ Q♥ J♥ 10♥ and 5♣ 4♣ 3♣ 2♣ A♣ quality as straight flushes. Assuming highrules are in effect, find the probability of getting a straight flush in a game of five-card poker,assuming the five cards are dealt without replacement.

Solution. The “high card” in a straight is the card in the straight having the highest rank.Note that under high rules, the rank of the high card in a straight can be anything from 5 to K,plus also A. That’s 10 possibilities. We tally the number of distinct straight flush hands thatare possible in five-card poker under high rules by determining the number of choices availableat each stage in constructing such a hand.

• Choose the rank of the high card in the straight flush. There are 10 choices: 10C1.• Choose the suit of the straight flush. There are 4 choices: 4C1.

By the Fundamental Counting Principle the total number of straight flushes possible is theproduct of the numbers of choices:

Number of straight flushes =

(10

1

)(4

1

)= 40.


), and so the probability of being dealt a full house in

five-card poker is (101

)(41

)(525

) =40

2,598,960≈ 1.539× 10−5.

34

5Discrete Probability Distributions

5.1 – Discrete Random Variables

For the following definition, recall that the symbol f : A → B denotes a function f thatmaps each element of a set A to precisely one element of a set B. The set A is the domain off , the set B is the codomain of f , and the range of f is the set Ran(f) ⊆ B given by

Ran(f) = b ∈ B : f(a) = b for some a ∈ A,

or equivalently we may writeRan(f) = f(a) : a ∈ A.

Definition 5.1. Let Ω be a sample space. A random variable on Ω is a function X : Ω→ R,and we say X is a discrete random variable if RanX is a discrete set.

If Ω happens to be a discrete sample space, then any random variable on Ω is necessarily adiscrete random variable.

If Ω is a sample space and X : Ω→ R is a discrete random variable, then for each x ∈ RanXwe define the expression X = x to denote the event ω ∈ Ω : X(ω) = x. Thus if P is aprobability measure on a σ-algebra on Ω that contains the event X = x, then

P(X = x) = P(ω ∈ Ω : X(ω) = x

).

Note that we do not here require that Ω be a discrete sample space: only the range of X needbe discrete.

Remark. In a casual setting the expression X = x may be thought of as meaning that therandom variable X equals the real number x. However it must be stressed that X is a function,and functions cannot “equal” a numerical quantity. In strictest terms a function is a specialkind of set of ordered pairs.

Definition 5.2. Let Ω be a sample space, X a discrete random variable on Ω, and P aprobability measure on a σ-algebra on Ω that contains the event X = x for each x ∈ RanX.The probability mass function for X is the function fX : RanX → R given by

fX(x) = P(X = x)

35

for each x ∈ RanX.

Henceforth it is to be understood that whenever the probability mass function of a randomvariable X is considered, there in fact exists a probability measure on a σ-algebra containing allevents of the form X = x.

Example 5.3. A constant random variable is a random variable X : Ω → R for whichthere exists some fixed c ∈ R such that X(ω) = c for all ω ∈ Ω, in which case we may writeX ≡ c. (The symbol X ≡ c may be read as “X is constantly equal to c.”) The event X = x isthus ∅ if x 6= c, and Ω if x = c. Since P(∅) = 0 and P(Ω) = 1, we find that

fX(x) =

1, x = c

0, x 6= c

is the probability mass function of X if X ≡ c. Usually if X ≡ c we simply denote X by thesymbol c, where of course Ran c = c.

Theorem 5.4. Let Ω be a sample space and X a discrete random variable on Ω. Then theprobability mass function fX has the following properties:

1. fX(x) ≥ 0 for each x ∈ RanX.

2.∑

x∈RanX

fX(x) = 1.

Proof. Axiom A1 in Definition 4.19 makes clear that fX(x) ≥ 0 for each x ∈ RanX.Assume RanX is countably infinite, so RanX = x1, x2, x3, . . .. For each xn ∈ RanX

define

En = ω ∈ Ω : X(ω) = xn.

For any ω ∈ Ω we have X(ω) = xi for some i ∈ N, so that ω ∈ Ei, and hence ω ∈⋃∞n=1En.

This shows that

Ω =∞⋃n=1

En.

Also, since xi 6= xj whenever i 6= j, it follows that Ei ∩ Ej = ∅ whenever i 6= j, and so (En)∞n=1

is a mutually exclusive sequence of events. Now, notationally X = xn denotes the event En,and so ∑

x∈RanX

f(x) =∞∑n=1

f(xn) =∞∑n=1

P(En) = P

(∞⊔n=1

En

)= P(Ω) = 1

by Axioms A3 and A2.The argument is much the same if RanX is a finite set, only Proposition 4.20(2) must be

used instead of Axiom A3 in the end.

Given any function f : A→ B, the graph of f is the set of ordered pairs

(x, f(x)) : x ∈ A.

36

If A happens to be a finite set, then the graph of f may conveniently be given by a table with acolumn for x and a column for f(x).

Definition 5.5. If X is a discrete random variable with probability mass function fX , then thegraph of fX is the probability distribution of X.

A discrete probability distribution is the probability distribution of any discrete randomvariable.

Especially in the case when the sample space Ω is a finite set, it is customary to depict theprobability distribution of a discrete random variable X : Ω→ R as a relative frequency tableor histogram.

Notation. Throughout these notes the symbol X ∼ D will often be used to indicate that arandom variable X has probability distribution D.

37

5.2 – Expected Value: Discrete Case

For the next definition, recall that a series∑an is absolutely convergent if the series∑

|an| is convergent.

Definition 5.6. Let Ω be a sample space, P a probability measure on a σ-algebra on Ω, andX : Ω→ R a discrete random variable. The expected value of X, E(X), is defined to be

E(X) =∑

x∈RanX

xP(X = x),

provided the series is absolutely convergent. The variance of X, Var(X), is defined to be

Var(X) = E((X − E(X))2

),

again provided the series is absolutely convergent.

It is common practice to let µ = E(X), so that

Var(X) = E((X − µ)2

)=

∑x∈RanX

xP((X − µ)2 = x

).

Both E(X) and Var(X) will always exist as real numbers in the case when Ω is a finitesample space, since then RanX is necessarily a finite set, and thus both series in Definition 5.6become finite sums. Finite sums are trivially absolutely convergent.

Let X be a random variable on a sample space Ω, and let g be any real-valued functiondefined on RanX. Then we define g(X) to be the random variable on Ω given by

[g(X)](ω) = g(X(ω))

for each ω ∈ Ω. Thus g(X) is in fact the function g X : Ω→ R, the composition of g with thefunction X. Frequently we will write [g(X)](ω) simply as g(X)(ω). Finally, for any constant cwe let cX denote the random variable given by (cX)(ω) = cX(ω), and we let c by itself denotethe random variable that maps ω 7→ c for all ω ∈ Ω.

Example 5.7. Fix c ∈ R, and let g : R → R be given by g(x) = cx. Then for any randomvariable X : Ω→ R we have

g(X)(ω) = g(X(ω)) = cX(ω) = (cX)(ω),

and so g(X) = cX as we would expect.Similarly, if g(x) = ax+ b for constants a, b ∈ R, then g(X) = aX + b; and if g(x) = xn for

n ∈ N, then g(X) = Xn.

Lemma 5.8. If X is a random variable on Ω and g : RanX → R, thenω ∈ Ω : g(X(ω)) = y

=

⊔x∈g−1(y)

ω ∈ Ω : X(ω) = x

for each y ∈ Ran g(X).

38

Proof. Clearly

ω ∈ Ω : X(ω) = x1 ∩ ω ∈ Ω : X(ω) = x2 = ∅

whenever x1 6= x2. Let y ∈ Ran g(X). Now,

ω ∈ω ∈ Ω : g(X(ω)) = y

⇔ ω ∈ Ω and g(X(ω)) = y

⇔ ω ∈ Ω and X(ω) ∈ g−1(y)

⇔ ω ∈ Ω and X(ω) = x for some x ∈ g−1(y)

⇔ ω ∈ω ∈ Ω : X(ω) = x for some x ∈ g−1(y)

⇔ ω ∈

⊔x∈g−1(y)

ω ∈ Ω : X(ω) = x,

which finishes the proof.

Proposition 5.9. If X is a discrete random variable on Ω and g : RanX → R, then

P(g(X) = y) =∑

x∈g−1(y)

P(X = x)

for each y ∈ Ran g(X).

Proof. Let y ∈ Ran g(X). Applying Lemma 5.8,

P(g(X) = y) = P(ω ∈ Ω : g(X(ω)) = y

)= P

( ⊔x∈g−1(y)

ω ∈ Ω : X(ω) = x

)

= P

( ⊔x∈g−1(y)

(X = x)

)=

∑x∈g−1(y)

P(X = x),

where the last equality follows from the observation that X is discrete, so that⊔x∈g−1(y)

(X = x)

is an at-most countable union of mutually exclusive events, and therefore either Axiom A3 orProposition 4.20(2) applies.

Theorem 5.10. If X is a discrete random variable on Ω, and g : RanX → R is such thatE(g(X)) is defined, then

E(g(X)) =∑

x∈RanX

g(x)P(X = x).

Proof. By Definition 5.6 and Proposition 5.9,

E(g(X)) =∑

y∈Ran g(X)

yP(g(X) = y) =∑

y∈Ran g(X)

(y∑

x∈g−1(y)

P(X = x)

)

39

=∑

y∈Ran g(X)

( ∑x∈g−1(y)

yP(X = x)

)

=∑

y∈Ran g(X)

∑x∈RanXg(x)=y

g(x)P(X = x)

(5.1)

=∑

x∈RanX

g(x)P(X = x). (5.2)

For the last equality, we observe that for y1, y2 ∈ Ran g(X) such that y1 6= y2, we have

x ∈ RanX : g(x) = y1 ∩ x ∈ RanX : g(x) = y2 = ∅,

and so each x ∈ RanX appears in at most one term g(x)P(X = x) in the sum (5.1). On theother hand, for each x ∈ Ran(X) we have y = g(x) ∈ Ran g(X), and so x will appear in atleast one term g(x)P(X = x) in (5.1). Thus for each x ∈ Ran(X) there is precisely one termg(x)P(X = x) in (5.1), which shows that (5.1) must equal (5.2).

Theorem 5.11. Let X be a discrete random variable on Ω, let g1, . . . , gn : RanX → R, and letc be a constant. Provided the involved series are absolutely convergent, the following hold.

1. E(c) = c.

2. E(cX) = cE(X).

3. E

(n∑k=1

gk(X)

)=

n∑k=1

E(gk(X)

).

Proof.Proof of (1). Referring to Example 5.3 and Definition 5.6,

E(c) =∑

x∈Ran c

xP(X = x) =∑x∈c

xP(X = x) = cP(X = c) = c.

Proof of (2). As Example 5.7 illustrates, g(X) = cX when g(x) = cx. By Theorem 5.10,

E(cX) =∑

x∈RanX

cxP(X = x) = c∑

x∈RanX

xP(X = x) = cE(X).

Proof of (3). Simply apply Theorem 5.10 with g(X) =∑n

k=1 gk(X):

E

(n∑k=1

gk(X)

)=

∑x∈RanX

(n∑k=1

gk(x)

)P(X = x) =

n∑k=1

( ∑x∈RanX

gk(x)P(X = x)

).

Proposition 5.12. Let X be a discrete random variable with E(X) defined. Provided theinvolved series are absolutely convergent, the following hold.

1. Var(X) = E(X2)− E2(X).

40

2. Var(c+X) = Var(X).3. Var(cX) = c2Var(X).

Proof.Proof of (1). Let µ = E(X). By Definition 5.6, Var(X) = E((X − µ)2). Now, by Theorem 5.11,

Var(X) = E((X − µ)2

)= E(X2 − 2µX + µ2) = E(X2)− E(2µX) + E(µ2)

= E(X2)− 2µE(X) + µ2 = E(X2)− 2µ2 + µ2 = E(X2)− µ2

= E(X2)− E2(X).

Proof of (2). By part (1) and Theorem 5.11,

Var(c+X) = E((c+X)2

)− E2(c+X) = E(c2 + 2cX +X2)−

(E(c) + E(X)

)2= E(c2) + 2cE(X) + E(X2)− E2(c)− 2E(c)E(X)− E2(X)

= c2 + 2cE(X) + E(X2)− c2 − 2cE(X)− E2(X)

= E(X2)− E2(X) = Var(X).


Var(cX) = E((cX)2

)− E2(cX) = c2E(X2)−

(cE(X)

)2= c2E(X2)− c2E2(X) = c2Var(X).

41

5.3 – The Binomial Distribution

Definition 5.13. A binomial experiment with parameters n ∈ N and p ∈ (0, 1), denoted byBin(n, p), is an experiment with the following properties.

1. The experiment is predetermined to consist of n trials.2. Each trial has two possible outcomes: “success” S and “failure” F .3. The probability of “success” p is the same for each trial.

A trial that has precisely two possible outcomes is called a Bernoulli trial. The outcomesmay be labeled “success” and “failure,” or “yes” and “no,” or even (especially in computerscience) 1 and 0. A Bernoulli trial with probability of success p is, by itself, the binomialexperiment Bin(1, p).

In general, if an experiment consists of n trials, and the probabilities of the possible outcomesof each trial is the same for all n trials, then the trials are said to be independent. Thusa binomial experiment can be defined to be an experiment consisting of a fixed number ofindependent two-outcome trials.

Each possible outcome of a binomial experiment Bin(n, p) is a particular arrangement of nsuccesses and failures. Thus, if Ω is the sample space of Bin(n, p), then each outcome ω ∈ Ωmay be identified with a string (i.e. a finite sequence) of n letters, each letter being either anS or an F . (By the Fundamental Counting Principle there are 2n such strings, so n(Ω) = 2n

and we see that any binomial experiment has a finite sample space.) Suppose k is the numberof times the letter S appears in the string ω. Then 0 ≤ k ≤ n, and setting X(ω) = k definesa random variable X : Ω → 0, 1, . . . , n that counts the total number of successes in eachoutcome ω of the experiment Bin(n, p).

Example 5.14. The flipping of a single coin three times is a binomial experiment with n = 3trials. The outcome of a trial (i.e. one coin flip) that we designate as “success” we can chooseto be the result of getting heads. The probability of success is of course 1

2for each trial, and we

denote the experiment by Bin(3, 12).

Example 5.15. The rolling of a single six-sided die five times may be turned into a binomialexperiment if we define “success” to be obtaining a 1 and “failure” to be obtaining anythingelse. Thus, even though any one of six different numbers may result from each roll of the die,there are only two outcomes of interest: obtaining a 1 or not obtaining a 1. The probability ofsuccess is 1

6for each trial, and we denote the experiment by Bin(5, 1

6).

Theorem 5.16. Let Ω be the sample space of Bin(n, p). Define random variable X : Ω →0, . . . , n by X(ω) = k, where k is the total number of successes in ω. Then

P(X = k) =

(n

k

)pk(1− p)n−k

for each k ∈ 0, . . . , n.

42

Proof. Fix 0 ≤ k ≤ n. Let ω ∈ Ω be any outcome of Bin(n, p) for which the total numberof successes is k. The corresponding set ω is then an event in which success is observed tooccur precisely k times, in a particular sequence, upon performing the experiment Bin(n, p) asingle time. In fact, the occurrence of the event ω implies k successes and n− k failures in aparticular sequence. Each success has probability p, and so each failure has probability 1− p.Thus the probability of ω occurring is

P(ω) = pk(1− p)n−k

by the Multiplication Rule.Now we turn to the question of how many outcomes in Ω are comprised of k successes and

n − k failures. Any outcome ω has n distinct positions for the placement of letters S or F ,and so the problem is simply to determine how many combinations of k out of the n positionscan be assigned the letter S (with the other positions being assigned an F ). If we let r denotethe number of combinations, then since each combination corresponds to an outcome, we maydenote the outcomes by ω1, . . . , ωr. The number r is, of course, otherwise denoted by nCk or(nk

), and is given explicitly by Theorem 4.35.Finally, each of the events ω1, . . . , ωr is mutually exclusive from all the others (which is

to say one particular arrangement of k successes and n− k failures precludes all others), andeach has a probability of pk(1 − p)n−k. Now, since X(ω) = k if and only if ω = ωj for some1 ≤ j ≤ r, we find that

ω ∈ Ω : X(ω) = k =r⊔j=1

ωj,

and therefore

P(X = k) = P(ω ∈ Ω : X(ω) = k

)= P

(r⊔j=1

ωj

)=

r∑j=1

P(ωj)

=r∑j=1

pk(1− p)n−k = rpk(1− p)n−k =

(n

k

)pk(1− p)n−k

by the Addition Rule.

Definition 5.17. Let n ∈ N and p ∈ (0, 1). A random variable X with range 0, . . . , n has abinomial distribution with parameters n and p, denoted by Bin(n, p), if its probability massfunction fX is given by

fX(k) =

(n

k

)pk(1− p)n−k (5.3)

for each k ∈ 0, . . . , n.

Technically speaking, Definition 5.17 presupposes that the domain of X is some sample spaceΩ, and there exists some σ-algebra Σ ⊆ P(Ω) and probability measure P : Σ→ [0, 1] for which

P(X = k) =

(n

k

)pk(1− p)n−k

holds for each k ∈ 0, . . . , n. Theorem 5.16 furnishes one explicit situation in which this is thecase. In the theorem, the sample space Ω is finite, Σ = P(Ω), and P is the classical probability

43

measure of Theorem 4.22. This is enough to ensure that the function fX in Definition 5.17 is aviable probability mass function.

There is a one-to-one correspondence between binomial experiments with parameters n andp and binomial distributions with parameters n and p, and therefore we denote each by thesame symbol Bin(n, p). The symbol X ∼ Bin(n, p) shall be used to indicate that a discreterandom variable X has the distribution Bin(n, p).

Remark. In practice any reference to a “binomial distribution” is understood to be a referenceto a random variable X having probability mass function given by (5.3). Analogous conventionsapply to all the various probability distributions we may encounter in these notes.

Example 5.18. Gomer Trundlebum spent the weekend partying in the company of magicaljellybeans from the Flowr of Lyfe cannabis dispensary in Eugene, Oregon. When the last strainsof otherworldly sitar music faded from the maelstrom of his mind on Monday morning whilstvanquishing the munchies at Ye Olde Pancake House on 11th Avenue, he suddenly rememberedthat he had a quiz that day in his quantum electrodynamics class. The quiz consists of eightmultiple-choice questions, each question having five choices: one right, four wrong. Not knowinga Hermitian conjugate operator from a hole in the ground, Gomer must randomly guess theanswer to every question.

(a) What is the probability Gomer will get precisely three questions right?(b) What is the probability Gomer will pass the quiz, if passing requires getting a score of at

least 70%?

Solution.(a) This is a binomial experiment with n = 8 independent trials, and if we define “success” tobe “guessing the right answer,” then the probability of success is p = 1

5for each trial. Thus

the event of getting precisely three questions correct is the event of observing a total of k = 3successes. Letting the random variable X be the “success counter” as usual, the probability weseek is

P(X = 3) = 8C3

(1

5

)3(1− 1

5

)8−3

=8!

(3!)(5!)

(1

5

)3(4

5

)5

≈ 0.1468.

(b) Only getting 6, 7, or 8 questions right results in a score of at least 70%, and so we mustfind the probability that X ≥ 6. The event X ≥ 6 is the union of the events X = 6, X = 7,and X = 8, and since these are mutually exclusive events, the Addition Rule implies that

P(X ≥ 6) = P(X = 6 or X = 7 or X = 8

)= P(X = 6) + P(X = 7) + P(X = 8)

We calculate

P(X = 6) = 8C6

(1

5

)6(4

5

)2

≈ 0.001147,

P(X = 7) = 8C7

(1

5

)7(4

5

)1

≈ 8.192× 10−5,

and

P(X = 8) = 8C8

(1

5

)8(4

5

)0

=

(1

5

)8

≈ 2.560× 10−6.

44

Therefore the probability of passing the quiz is

P(X ≥ 6) ≈ 0.001147 + 8.192× 10−5 + 2.560× 10−6 ≈ 0.00123.

This is only about 1 chance in 813 for poor Gomer. Party on!

45

5.4 – The Geometric Distribution

Definition 5.19. A geometric experiment with parameter p ∈ (0, 1), denoted by Geo(p), isan experiment with the following properties.

1. The experiment consists of at least one trial.2. Each trial has two possible outcomes: “success” and “failure.”3. The probability of “success” p is the same for each trial.4. The experiment stops when the first “success” is observed.

The third property could be otherwise stated as: “The trials are independent.” There is nopredetermined or even maximum number of trials that a geometric experiment may have. Thefirst success may occur on the 1st trial, or the 2nd, or the 3rd, and so on. Thus the samplespace Ω of a geometric experiment is countably infinite, with outcomes S (success on the 1sttrial), FS (failure on the 1st trial, success on the 2nd), FFS (failure on the 1st and 2nd trials,success on the 3rd), and so on:

Ω = S, FS, FFS, FFFS, FFFFS, FFFFFS, FFFFFFS, . . .

Except for S itself, each outcome ω ∈ Ω is a string of consecutive F letters followed by a singleS letter. A natural random variable X : Ω→ N to consider is given by

X(ω) = Length of the string ω. (5.4)

That is, X(ω) is the number of the trial in which the first success is observed. For example,

X(S) = 1, X(FS) = 2, X(FFS) = 3,

and so on. The function X is seen to be injective; that is, for any outcomes ω1, ω2 ∈ Ω suchthat ω1 6= ω2, it follows that X(ω1) 6= X(ω2).

Theorem 5.20. Let Ω be the sample space of Geo(p). Define random variable X : Ω→ R byX(ω) = k, where k is the length of the string ω. For each k ∈ N,

P(X = k) = p(1− p)k−1.

Proof. Fix 1 ≤ k ≤ n. Let ω ∈ Ω be any outcome of Geo(p) for which X(ω) = k. Then ω isan event in which k− 1 consecutive failures are observed, followed by a single success. Since theprobability of success is P(S) = p, the probability of failure is P(F ) = 1− p, and the trials inthe experiment are independent, the Multiplication Rule implies that

P(ω) = P(S)k−1∏j=1

P(F ) = pk−1∏j=1

(1− p) = p(1− p)k−1.

As noted above, the function X is injective, and so there is in fact no other outcome in Ω thatX will map to k. Hence the event that X = k is precisely the event ω, and therefore

P(X = k) = P(ω) = p(1− p)k−1

as claimed.

46

Definition 5.21. Let p ∈ (0, 1). A random variable X with range N has a geometric dis-tribution with parameter p, denoted by Geo(p), if its probability mass function fX is givenby

fX(k) = p(1− p)k−1

for each k ∈ N.

It is understood in the definition that the domain of X is some sample space Ω, and thereis some σ-algebra Σ ⊆ P(Ω) and probability measure P : Σ → [0, 1] for which P(X = k) =p(1− p)k−1 holds for each k ∈ N. Theorem 5.20 gives an explicit instance of this for countablyinfinite Ω, Σ = P(Ω), and P the classical probability measure of Theorem 4.22. This ensuresthat fX in Definition 5.17 is a viable probability mass function.

The symbol X ∼ Geo(p) shall be used to indicate that a discrete random variable X has thedistribution Geo(p).

Recall the geometric series from calculus. For any a ∈ R and x ∈ (−1, 1),∞∑k=0

axk =a

1− x. (5.5)

Differentiating both sides of the equation with respect to x once, and then a second time, yields∞∑k=1

akxk−1 =a

(1− x)2and

∞∑k=2

ak(k − 1)xk−2 =2a

(1− x)3. (5.6)

These last two equations we use in the course of proving the following.

Theorem 5.22. If X ∼ Geo(p), then

E(X) =1

pand Var(X) =

1− pp2

Proof. Suppose X ∼ Geo(p). By Definition 5.21 we have 1− p ∈ (0, 1) and RanX = N, and so

E(X) =∑

k∈RanX

kP(X = k) =∞∑k=1

kfX(k) =∞∑k=1

pk(1− p)k−1

=p

[1− (1− p)]2=

p

p2=

1

p.

Next, through reindexing and routine manipulations of the equations in (5.5) and (5.6), weobtain

∞∑k=1

ak2xk−1 =a+ ax

(1− x)3. (5.7)

Now, with Proposition 5.12 and equation (5.7), we have

Var(X) = E(X2)− E2(X) =∑

k∈RanX

k2P(X = k)− 1

p2=∞∑k=1

k2fX(k)− 1

p2

=∞∑k=1

pk2(1− p)k−1 − 1

p2=

p+ p(1− p)[1− (1− p)]3

− 1

p2=

1− pp2

,

finishing the proof.

47

Example 5.23. An oil prospector plans to drill holes in a region until a productive well isfound or funds are exhausted. Geologists estimate the probability any single drilling will besuccessful to be 0.18.

(a) What is the probability that the fourth hole drilled is the first to yield a productive well?(b) If the prospector can afford to drill no more than seven holes, what is the probability that

the search for a productive well will fail?

Solution.(a) This is a geometric experiment in which each trial consists of the drilling of a hole, “success”is defined to be discovery of a productive well, and the probability of success is p = 0.21 for eachtrial. Letting the random variable X denote the number of the trial in which the first success isobserved, the probability of success on the 3rd trial (i.e. upon the drilling of the 3rd hole) is

P(X = 4) = 0.18(1− 0.18)4−1 = 0.18(0.82)3 ≈ 0.0992.

(b) We start by finding the probability the first success is observed upon performing one of thefirst six trials. With X defined as before, this means calculating the probability that X ≤ 6.The event X ≤ 6 is the union of the six events X = 1, . . . , X = 6, and since these are mutuallyexclusive events the Addition Rule implies that

P(X ≤ 7) =7∑

k=1

P(X = k) =7∑

k=1

0.18(1− 0.18)k−1 = 0.187∑

k=1

(0.82)k−1

= 0.18(1 + 0.82 + 0.822 + 0.823 + 0.824 + 0.825 + 0.826)

≈ 0.751.

The probability of a failed search is the probability that the first success is not observed after 7trials, which is P(X > 7). Since

P(X ≤ 7) + P(X > 7) = 1,

we find thatP(X > 7) = 1− P(X ≤ 7) = 1− 0.7507 ≈ 0.249.

Thus there is roughly a 25% chance of failing to find a productive well before funds run out.

48

5.5 – The Hypergeometric Distribution

Definition 5.24. A hypergeometric experiment with parameters N ∈ N and K,n ∈0, 1, . . . , N, denoted by Hyp(N,K, n), is an experiment with the following properties.

1. At the outset there is a population of size N .2. K members of the population are named a “success,” and the rest named a “failure.”3. There are n trials.4. Each trial consists of drawing 1 individual, without replacement, from the population.

49

5.6 – The Poisson Distribution

Definition 5.25. A Poisson experiment with parameter λ > 0, denoted by Pois(λ), is anexperiment with the following properties.

1. The experiment consists of observing the number of times an event E occurs in a givenbounded interval I.

2. E occurs a mean of λ times in any interval having length equal to the length of I.3. The probability that E occurs at least once in any two bounded intervals of equal length is

the same.

The sample space Ω of a Poisson experiment is countably infinite. We may write

Ω = ω0, ω1, ω2, . . ., (5.8)

where ωk is the outcome that event E occurs k times in the interval I for each integer k ≥ 0.A consequence of the second property in Definition 5.25 is that if I1 and I2 are two disjoint

intervals of the same length, then the probability that E occurs at least once in I2 is not affectedby whether or not E occurs at least once in I1. Thus if E1 is the event that E occurs at leastonce in I1 and E2 is the event that E occurs at least once in I2, then E1 and E2 are independentevents.

Suppose it is known that the mean number of times an event E occurs in any interval oflength ` is λ. Let I be a given interval of length `. What is the probability that E will occurprecisely k times in I?

To start, we partition the interval I into n subintervals I1, . . . , In, each subinterval of length`/n. We make two assumptions:

• The value of n is much larger than both k and λ.• The k occurrences of E are distributed over k distinct subintervals.

Thus E occurs precisely once in k subintervals, and does not occur at all in the other n − ksubintervals. The subintervals must be mutually disjoint for this to work. For example, ifI = [0, t] for t > 0, then we may let

I1 =

[0,t

n

), I2 =

[t

n,2t

n

), . . . , In−1 =

[(n− 2)t

n,(n− 1)t

n

), In =

[(n− 1)t

n, t

],

all of which have length t/n.Our assumptions above enable us to construct a binomial experiment consisting of n trials,

the jth trial being an observation of whether or not E occurs in the subinterval Ij. We call theoccurrence of E in a subinterval Ij a “success,” and the nonoccurrence of E in Ij a “failure.”We wish to determine the probability of observing k successes among the n trials. The secondproperty in Definition 5.25 ensures that the trials are independent (i.e. the occurrence of Ein one subinterval does not influence the chances of E occurring in another subinterval) whichis to say the probability of “success” is the same for each trial. Since E occurs a mean of λtimes over the n trials, the probability E will occur in any one trial is λ/n if we assume n to bemuch larger than λ. This is our probability of “success.” By Theorem 5.16 the probability of ksuccesses is

n!

k!(n− k)!

(λ

n

)k(1− λ

n

)n−k(5.9)

50

for the binomial experiment we have contrived, which is based on our earlier assumptions.The value of the expression (5.9) can only be an approximation of the probability that E

occurs k times in the interval I. This is because our initial assumptions do not allow completefreedom of choice as to where in the interval I the k occurrences of E may be placed for ourfixed (but very large) choice for the value of n. Thus, if anything, the probability given by (5.9)should underestimate the true probability that E occurs k times in I. The only way to increasethe freedom of choice in how the k occurrences of E may be distributed over I is to increase thevalue of n. Therefore we should expect that as n→∞ the value of (5.9) will converge to thetrue probability that E occurs k times in I. Since

limn→∞

n!

nk(n− k)!= lim

n→∞

n(n− 1) · · · (n− k + 1)

nk= lim

n→∞

(n

n· n− 1

n· · · n− k + 1

n

)= 1,

limn→∞

(1− λ

n

)−k=

[limn→∞

(1− λ

n

)]−k= 1−k = 1,

and, with the help of L’Hopital’s Rule,

limn→∞

(1− λ

n

)n= exp

[limn→∞

n ln

(1− λ

n

)]= exp

[limn→∞

λ

λ/n− 1

]= e−λ,

we find that

limn→∞

n!

k!(n− k)!

(λ

n

)k(1− λ

n

)n−k=λk

k!· limn→∞

n!

nk(n− k)!· limn→∞

(1− λ

n

)−k· limn→∞

(1− λ

n

)n=λke−λ

k!. (5.10)

So, recalling the sample space Ω of a Poisson experiment as characterized by (5.8), from(5.10) it seems reasonable to conclude that

P(ωk) =λke−λ

k!. (5.11)

But does this truly constitute a probability measure? The following theorem affirms that it canbe made to be one by conferring to it the necessary additivity property.

Theorem 5.26. Let E be an experiment with sample space Ω = ω0, ω1, ω2, . . ., and let λ > 0.Define P : P(Ω) → R by (5.11) for each ωk ∈ Ω, and for any mutually exclusive sequence(ωnk

)k let

P(⊔

kωnk)

=∑

kP(ωnk

). (5.12)

Then P is a probability measure on P(Ω).

Proof. We must check that the axioms in Definition 4.19 are satisfied. If A ⊆ Ω, thenA =

⊔kωnk

for some mutually exclusive sequence (ωnk)k, and so

P(A) = P(⊔

kωnk)

=∑

kP(ωnk

) ≥ 0

since

P(ωnk) =

λnke−λ

nk!≥ 0

51

for each k. Hence Axiom A1 is satisfied.To show the equation in Axiom A3 holds for any mutually exclusive sequence of events

(Ak)∞k=1, we observe that each event Ak, as well as the event

⋃∞k=1Ak, can be written in the

form⊔kωnk

for some mutually exclusive sequence (ωnk)k. Repeated use of (5.12) will finish

the argument.Finally, we have

P(Ω) = P(⊔∞

k=0ωk

)=∞∑k=0

P(ωk) =∞∑k=0

λke−λ

k!= e−λ

∞∑k=0

λk

k!= e−λeλ = 1,

and so Axiom A2 holds. Therefore P : P(Ω)→ R is a probability measure.

It should be noted that Theorem 5.26 does not require the experiment E to be a Poissonexperiment specifically, but with the theorem we may now confidently define the Poissondistribution.

Definition 5.27. Let λ > 0. A random variable X with range 0, 1, 2, . . . has a Poissondistribution with parameter λ if its probability mass function fX is given by

fX(k) =λke−λ

k!

for each k ≥ 0.

The foregoing developments, including Theorem 5.26, show constructively that there doesindeed exist a random variable X having a Poisson distribution. That fX as given in Definition5.27 is a viable probability mass function is therefore not in doubt.

We summarize our findings thus far. If an experiment E has the properties of a Poissonexperiment with parameter λ that are listed in Definition 5.25, then the random variable Xthat counts the number of times the event E is observed to occur in the fixed interval I willhave a Poisson distribution with parameter λ; that is,

P(X = k) =λke−λ

k!

for each k ≥ 0.

Example 5.28. During the 20th century the number of hurricanes to strike the U.S. mainlandhad a Poisson distribution with a mean of about 0.6 per calendar year.

(a) What is the probability that exactly one hurricane strikes the U.S. mainland in a givencalendar year?

(b) What is the probability that more than two hurricanes strikes the U.S. mainland in a givencalendar year?

Solution.(a) If the random variable X counts the number of hurricanes in a given calendar year, then Xhas a Poisson distribution with parameter 0.6, and so

P(X = k) =0.6ke−0.6

k!

52

for k ≥ 0. Thus we have

P(X = 1) =0.6e−0.6

1!≈ 0.329.

(b) We have

P(X > 2) = 1− P(X ≤ 2) = 1−[P(X = 0) + P(X = 1) + P(X = 2)

]= 1− 0.60e−0.6

0!− 0.61e−0.6

1!− 0.62e−0.6

2!

= 1− 0.5488− 0.3293− 0.0988 = 0.0231.

53

6Continuous Probability Distributions

6.1 – Continuous Random Variables

Throughout this chapter, whenever given a random variable X : Ω → R, we assume theexistence of a probability measure P on a σ-algebra Σ ⊆ P(Ω) that contains all events of theform

ω ∈ Ω : X(ω) ∈ I,

where I is any interval in R. To be clear, I may have any of the forms

(−∞,∞), (−∞, x2), (−∞, x2], [x1,∞), (x1,∞), (x1, x2), [x1, x2), [x1, x2]

for −∞ < x1 < x2 < ∞. This means that if RanX = R, as will often be the case, thenP : Σ→ [0, 1] is such that the probabilities

P(x1 ≤ X ≤ x2), P(X ≤ x2), P(X ≥ x1)

are defined for any x1, x2 ∈ R with x1 < x2, as are

P(x1 < X < x2), P(X < x2), P(X > x1)

for any x1 ∈ R ∪ −∞ and x2 ∈ R ∪ ∞, and so on. Recall that, for instance, x1 < X < x2denotes the event

ω ∈ Ω : x1 < X(ω) < x2 = ω ∈ Ω : X(ω) ∈ (x1, x2).

Definition 6.1. Let X be a random variable on Ω. The cumulative distribution function(c.d.f.) of X is the function FX : R→ [0, 1] given by

FX(x) = P(X ≤ x)

for each x ∈ R.

Example 6.2. Find the cumulative distribution function FX for the random variable X havingbinomial distribution with parameters n = 3 and p = 1/3.

54

Solution. By definition we have

P(X = k) =

(3

k

)(1

3

)k(2

3

)3−k

for k ∈ 0, 1, 2, 3, and so

P(X = 0) =8

27, P(X = 1) =

4

9, P(X = 2) =

2

9, P(X = 3) =

1

27.

Moreover, P(X = x) = 0 for any x /∈ 0, 1, 2, 3.Let x ∈ R. If x ∈ (−∞, 0), then P(X ≤ x) = 0 since no values less than 0 are in the range

of X. If x ∈ [0, 1), then

P(X ≤ x) = P(X < 0) + P(X = 0) + P(0 < X ≤ x) = 0 +8

27+ 0 =

8

27.

If x ∈ [1, 2), then

P(X ≤ x) = P(X < 0) + P(X = 0) + P(0 < X < 1) + P(X = 1) + P(1 < X ≤ x)

= 0 +8

27+ 0 +

4

9+ 0 =

20

27.

Carrying on in this fashion, we obtain a piecewise-defined function:

FX(x) =

0, x < 0827, 0 ≤ x < 1

2027, 1 ≤ x < 2

2627, 2 ≤ x < 3

1, x ≥ 3.

See Figure 4 below.

x

FX(x)

1

12

0 1 2 3 4

Figure 4.

55

The c.d.f. of a discrete random variable will always be a step function such as that foundin Example 6.2.

It will be convenient to define a c.d.f. FX at −∞ and ∞ as follows:

FX(−∞) = limx→−∞

FX(x) and FX(∞) = limx→∞

FX(x),

provided the limits exist. The next theorem assures us that both limits always exist.

Theorem 6.3. A cumulative distribution function FX : R→ [0, 1] for a random variable X hasthe following properties.

1. FX is monotone increasing.2. FX(−∞) = 0 and FX(∞) = 1.3. FX is right-continuous on R.

Proof.Proof of (1). Let x1, x2 ∈ R with x1 < x2. Then by Axiom A3 and Proposition 4.20(2),

FX(x1) = P(X ≤ x1) ≤ P(X ≤ x1) + P(x1 < X ≤ x2)

= P((X ≤ x1) ∪ (x1 < X ≤ x2)

)= P(X ≤ x2) = FX(x2),

and so FX(x1) ≤ FX(x2). Therefore FX is monotone increasing.

Proof of (2). We prove that FX(∞) = 1, and leave the similar argument that FX(−∞) = 0 tothe reader.

Let ε > 0. Suppose for every α > 0 there exists xα > α such that P(X ≤ xα) ≤ 1− ε. Thenin particular for each n ∈ N there exists xn > n such that FX(xn) = P(X ≤ xn) ≤ 1− ε, andsince FX is monotone increasing by part (1), it follows that FX(n) = P(X ≤ n) ≤ 1− ε. Thus(FX(n))∞n=1 is a monotone increasing sequence of real numbers with upper bound 1 − ε, andthis fact together with the Monotone Convergence Theorem of elementary analysis leads to theconclusion that limn→∞ FX(n) exists in R with limn→∞ FX(n) ≤ 1 − ε. Now, by Proposition4.20(2) and Axiom A3,

limn→∞

FX(n) = limn→∞

[P(X ≤ 0) +

n∑k=1

P(k − 1 < X ≤ k)

]

= P(X ≤ 0) +∞∑k=1

P(k − 1 < X ≤ k)

= P(X ≤ 0) + P

(∞⊔k=1

(k − 1 < X ≤ k)

)= P(X ≤ 0) + P(0 < X <∞) = P(X <∞),

and so P(X <∞) ≤ 1− ε. On the other hand X : Ω→ R for some sample space Ω, and so

P(X <∞) = P(ω : X(ω) <∞

)= P(Ω) = 1

56

by Axiom A2. Our two results taken together give 1 ≤ 1− ε, or ε ≤ 0, which contradicts ourinitial assumption that ε > 0. Hence there must exist some α > 0 such that P(X ≤ x) > 1− εfor all x > α, or equivalently |FX(x)− 1| < ε for all x > α. Therefore FX(∞) = 1.

Proof of (3). Suppose FX is not right-continuous at x ∈ R. Then there exists some ε > 0 suchthat, for each n ∈ N, there is some xn ∈ [x, x+ 1

n) for which FX(xn)− FX(x) ≥ ε. Hence

P(x < X ≤ x+ 1

n

)= FX

(x+ 1

n

)− FX(x) ≥ FX(xn)− FX(x) ≥ ε (6.1)

for each n, since x+ 1n> xn and FX is monotone increasing. Defining

En =(x < X ≤ x+ 1

n

),

it is routine to check that En ↓ ∅ and so

limn→∞

P(En) = P(∅) = 0 (6.2)

by Propositions 4.25 and 4.20(1). On the other hand (6.1) shows that P(En) ≥ ε for all n, whichcontradicts (6.2). Therefore FX must be right-continuous at x, and since x ∈ R is arbitrary, weconclude that FX is right-continuous on R.

Definition 6.4. Let X : Ω→ R be a random variable with cumulative distribution function FX ,and let F ′X denote the derivative of FX . Then X is continuous if the following hold:

1. FX is continuous on R.2. F ′X exists and is continuous on R− S, where S is such that S ∩ I is a finite set for any

bounded interval I.

Remark. Our definition of a continuous random variable is taken by many authors to be thedefinition of an “absolutely continuous” random variable, with a “continuous” random variablebeing a random variable X for which RanX is an uncountably infinite set. In very elementarystatistics books a random variable may be defined to be “continuous” if its range is a continuousset (i.e. an interval of real numbers).

Proposition 6.5. If X is a continuous random variable, then P(X = x) = 0 for all x ∈ R.

Proof. Suppose X is a continuous random variable. For fixed x ∈ R set P(X = x) = r. Thecontinuity of FX on R implies the left-continuity of FX at x, and so

limξ→x−

P(X ≤ ξ) = limξ→x−

FX(ξ) = FX(x) = P(X ≤ x) = P(X < x) + r.

Now, (X ≤ ξ) ⊆ (X < x) for all ξ < x, so that

P(X ≤ ξ)− P(X < x) ≤ 0

for all ξ < x by Proposition 4.20(3), and hence

r = limξ→x−

[P(X ≤ ξ)− P(X < x)

]≤ 0.

On the other hand r ≥ 0 by the definition of a probability measure. Therefore r = 0, and sincex ∈ R is arbitrary, we conclude that P(X = x) = 0 for all x ∈ R.

57

With Propositions 4.20(2) and 6.5 we find that

P(X ≤ x) = P(X < x) + P(X = x) = P(X < x),

and similarly P(X ≥ x) = P(X > x).

Definition 6.6. Let FX be the cumulative distribution function for a continuous random variableX. The probability density function (p.d.f.) of X is the function fX given by

fX(x) = F ′X(x)

for all x ∈ Dom(F ′X).

By Definition 6.4(2), the set Dom(F ′X) includes either all of R, or all of R except for someset of isolated values S. A strict reading of Definition 6.6 implies that Dom(fX) = Dom(F ′X),and so for a < b the integral

´ bafX will technically be a Type II improper Riemann integral

if [a, b] ∩ S 6= ∅. However, at each point x ∈ [a, b] ∩ S it can be shown that fX has eithera removable or jump discontinuity, and so fX(x) can be assigned any value in R (such as 0)without affecting the value of

´ bafX . Assuming this is done, we may then treat

´ bafX as a proper

integral for a, b ∈ R, or as purely a Type I improper integral if a = −∞ or b =∞. This willhenceforth be our practice.

Theorem 6.7. The probability density function fX of a continuous random variable X has thefollowing properties.

1. fX(x) ≥ 0 for all x ∈ Dom(fX).2. For all x ∈ R,

FX(x) =

ˆ x

−∞fX .

3.´∞−∞ fX = 1.

4. For any −∞ < a < b <∞,

P(a ≤ X ≤ b) =

ˆ b

a

fX , P(X ≤ b) =

ˆ b

−∞fX , P(X ≥ a) =

ˆ ∞a

fX . (6.3)

Proof.Proof of (1). Let x ∈ Dom(fX), so F ′X(x) is defined. Since FX is monotone increasing on R byTheorem 6.3(1), it follows that fX(x) = F ′X(x) ≥ 0.

Proof of (2). Fix x ∈ R. By the Fundamental Theorem of Calculus and Theorem 6.3(2),ˆ x

−∞fX = lim

ξ→−∞

ˆ x

ξ

F ′X(t) dt = limξ→−∞

[FX(t)

]xξ

= limξ→−∞

[FX(x)− FX(ξ)

]= FX(x)− lim

ξ→−∞FX(ξ) = FX(x)− FX(−∞) = FX(x).

Proof of (3). Fix x ∈ R. By part (2) and Theorem 6.3(2),ˆ ∞−∞

fX = limx→∞

ˆ x

−∞fX = lim

x→∞FX(x) = FX(∞) = 1.

58

Proof of (4). By Proposition 4.20(5), part (2) above, and a property of integrals,

P(a ≤ X ≤ b) = P((X ≤ b)− (X < a)

)= P(X ≤ b)− P(X < a)

= P(X ≤ b)− P(X ≤ a) = FX(b)− FX(a)

=

ˆ b

−∞fX −

ˆ a

−∞fX =

ˆ b

a

fX .

The second equation in (6.3) is precisely part (2). As for the third equation, we have

P(X ≥ a) = 1− P(X < a) = 1− P(X ≤ a) =

ˆ ∞−∞

fX −ˆ a

−∞fX =

ˆ ∞a

fX

by Proposition 4.20(5) and part (3).

Definition 6.8. If X is a continuous random variable with probability density function fX ,then the graph of fX is the probability distribution of X. A continuous probabilitydistribution is the probability distribution of a continuous random variable.

Throughout much of the remainder of this chapter we will be looking at several continuousprobability distributions that are commonly used in applications.

Example 6.9. A random variable X has probability density function fX given by

fX(x) =

0.2, x ∈ (−1, 0]

0.2 + cx, x ∈ (0, 1]

0, elsewhere.

(a) Find the value of c.(b) Find the cumulative distribution function FX , then find FX(−1), FX(0), FX(1).(c) Find P(0 ≤ X ≤ 0.5).(d) Find P(X > 0.5 | X > 0.1).

Solution.(a) For fX to be a p.d.f. we must have

´∞−∞ fX = 1 by Theorem 6.7(3). Since

ˆ ∞−∞

fX =

ˆ 0

−1fX +

ˆ 1

0

fX =

ˆ 0

−10.2 dx+

ˆ 1

0

(0.2 + cx) dx = 0.4 + 0.5c,

it follows that 0.4 + 0.5c = 1, and hence c = 1.2.

(b) For x ∈ (−∞,−1],

FX(x) =

ˆ x

−∞fX =

ˆ x

−10.2 dξ = 0.2[ξ]x−1 = 0.2(x+ 1).

For x ∈ (0, 1], recalling that c = 1.2,

FX(x) =

ˆ x

−1fX =

ˆ 0

−10.2 dξ +

ˆ x

0

(0.2 + 1.2ξ) dξ = 0.2 + 0.2x+ 0.6x2.

59

x

FX(x)

1

12

−2 −1 0 1 2 3

Figure 5.

For x ∈ (1,∞), noting that fX(x) = 0 for x > 1,

FX(x) =

ˆ 0

−10.2 dξ +

ˆ 1

0

(0.2 + 1.2ξ) dξ = 1.

Therefore

FX(x) =

0, x ∈ (−∞,−1]

0.2 + 0.2x, x ∈ (−1, 0]

0.2 + 0.2x+ 0.6x2, x ∈ (0, 1]

1, x ∈ (1,∞),

which we use to obtain FX(−1) = 0, FX(0) = 0.2, and FX(1) = 1. The graph of FX is shown inFigure 5.

(c) We have

P(0 ≤ X ≤ 0.5) = P(X ≤ 0.5)− P(X < 0) = P(X ≤ 0.5)− P(X ≤ 0)

= F (0.5)− F (0) = [0.2 + 0.2(0.5) + 0.6(0.5)2]− 0.2 = 0.25.

(d) We have

P(X > 0.5 | X > 0.1) =P((X > 0.5) ∩ (X > 0.1)

)P(X > 0.1)

=P(X > 0.5)

P(X > 0.1)

=1− P(X ≤ 0.5)

1− P(X ≤ 0.1)=

1− F (0.5)

1− F (0.1)=

1− 0.45

1− 0.226

=275

387≈ 0.711.

60

6.2 – Expected Value: Continuous Case

For the next definition, recall that an integral´∞−∞ f is absolutely convergent if the

integral´∞−∞ |f | is convergent.

Definition 6.10. Let Ω be a sample space, X : Ω→ R a continuous random variable, and fXthe probability density function of X. The expected value of X is

E(X) =

ˆ ∞−∞

xfX(x)dx,

provided the integral is absolutely convergent. The variance of X is

Var(X) = E((X − E(X))2

),

again provided the integral is absolutely convergent.

Lemma 6.11. Suppose X is a continuous random variable with probability density function fXsuch that fX(x) = 0 for all x ∈ (−∞, 0)− S, where S is a set such that S ∩ I is finite for anybounded interval I ⊆ (−∞, 0). Then

E(X) =

ˆ ∞0

[1− FX(x)]dx.

Proof. We have ˆ 0

−∞xfX(x)dx = 0

since xfX(x) = 0 for all x ∈ (−∞, 0)− S. Thus, by Theorem 6.7 and the theorem by Fubinithat allows for the interchanging of the order of an iterated integral,ˆ ∞

0

[1− FX(x)]dx =

ˆ ∞0

P(X > x)dx =

ˆ ∞0

P(X ≥ x)dx =

ˆ ∞0

ˆ ∞x

fX(y)dydx

=

ˆ ∞0

ˆ y

0

fX(y)dxdy =

ˆ ∞0

yfX(y)dy =

ˆ ∞−∞

xfX(x)dx = E(X).

Theorem 6.12. Let X be a continuous random variable on Ω, and let g : RanX → R be suchthat g(X) is also a continuous random variable. If E(g(X)) is defined, then

E(g(X)) =

ˆ ∞−∞

g(x)fX(x)dx

Proof. We prove the theorem in the illustrative case when g is nonnegative and increasing onRanX. By Theorem 6.7 and Proposition 6.5,ˆ 0

−∞fg(X)(x)dx = P(g(X) ≤ 0) = P(g(X) < 0) + P(g(X) = 0) = P(g(X) < 0) = 0,

the last equality owing to (g(X) < 0) = ∅. Since fg(X) is nonnegative and continuous whereF ′g(X) exists, it follows from Definition 6.4(2) that fg(X)(x) = 0 for all x ∈ (−∞, 0)− S, where

61

S is a set such that S ∩ I is finite for any bounded interval I ⊆ (−∞, 0). By Lemma 6.11 andFubini’s Theorem we obtain

E(g(X)) =

ˆ ∞0

[1− Fg(X)(x)]dx =

ˆ ∞0

P(g(X) > y)dy

=

ˆ ∞0

(ˆx:g(x)>y

fX(x)dx

)dy =

ˆ ∞0

ˆ ∞g−1(y)

fX(x)dxdy

=

ˆ ∞−∞

ˆ g(x)

0

fX(x)dydx =

ˆ ∞−∞

g(x)fX(x)dx,

as desired.The proof of the more general case that drops the condition that g is increasing is much the

same, save that the integral ˆx:g(x)>y

fX(x)dx

may present complications.

Remark. Many respectable textbooks1 simply define E(g(X)) to equal the integral given inTheorem 6.12, so long as it is absolutely convergent. We could have done the same here, whichis one reason why we do not trouble ourselves with a general proof of the theorem.

Theorem 6.13. Let X be a continuous random variable on Ω, let g1, . . . , gn : RanX → R besuch that each gk(X) is a continuous random variable, and let c be a constant. Provided theinvolved integrals are absolutely convergent, the following hold.

1. E(c) = c.

2. E(cX) = cE(X).

3. E

(n∑k=1

gk(X)

)=

n∑k=1

E(gk(X)

).

Proof.Proof of (1). The constant random variable c is discrete, not continuous, and so E(c) = c followsdirectly from Theorem 5.11(1).

Proof of (2). Applying Theorem 6.12 with g(x) = cx, we obtain

E(cX) =

ˆ ∞−∞

cxfX(x)dx = c

ˆ ∞−∞

xfX(x)dx = cE(X).

Proof of (3). Letting

g(x) =n∑k=1

gk(x),

once again Theorem 6.12 and a property of integrals complete the argument.

1Such as Statistical Inference by Casella & Berger.

62

It is worth noting that in the case of a constant (and hence discrete) random variable c, theequation in Theorem 6.12 gives the correct value for E(c):ˆ ∞

−∞cfX(x)dx = c

ˆ ∞−∞

fX(x)dx = c,

where the last equality follows from Theorem 6.7(3).

Proposition 6.14. Let X be a continuous random variable with E(X) defined. Provided theinvolved integrals are absolutely convergent, the following hold.

1. Var(X) = E(X2)− E2(X).2. Var(c+X) = Var(X).3. Var(cX) = c2Var(X).

Proof.Proof of (1). Let µ = E(X). By Definition 6.10, Var(X) = E((X − µ)2). Now, by Theorem 6.13,

Var(X) = E((X − µ)2

)= E(X2 − 2µX + µ2) = E(X2)− E(2µX) + E(µ2)

= E(X2)− 2µE(X) + µ2 = E(X2)− 2µ2 + µ2 = E(X2)− µ2

= E(X2)− E2(X).


Var(c+X) = E((c+X)2

)− E2(c+X) = E(c2 + 2cX +X2)−

(E(c) + E(X)

)2= E(c2) + 2cE(X) + E(X2)− E2(c)− 2E(c)E(X)− E2(X)

= c2 + 2cE(X) + E(X2)− c2 − 2cE(X)− E2(X)

= E(X2)− E2(X) = Var(X).


Var(cX) = E((cX)2

)− E2(cX) = c2E(X2)−

(cE(X)

)2= c2E(X2)− c2E2(X) = c2Var(X).

Example 6.15. We find here the expected value and variance of the random variable X inExample 6.9, which we found has p.d.f. given by

fX(x) =

0.2, x ∈ (−1, 0]

0.2 + 1.2x, x ∈ (0, 1]

0, x /∈ (−1, 1].

We have

E(X) =

ˆ ∞−∞

xfX(x)dx =

ˆ 0

−10.2xdx+

ˆ 1

0

x(0.2 + 1.2x)dx = 0.4,

63

and by Proposition 6.14(1) and Theorem 6.12,

Var(X) = E(X2)− E2(X) =

ˆ ∞−∞

x2fX(x)dx− (0.4)2

=

ˆ 0

−10.2x2dx+

ˆ 1

0

x2(0.2 + 1.2x)dx− 0.16 =41

150= 0.273.

The standard deviation of X would thus be√Var(X) ≈ 0.523.

64

6.3 – The Uniform Distribution

Definition 6.16. Let −∞ < θ1 < θ2 <∞. A continuous random variable X has a uniformdistribution with parameters θ1 and θ2, denoted by U(θ1, θ2), if its probability density functionfX is given by

fX(x) =

1

θ2 − θ1, x ∈ [θ1, θ2]

0, x /∈ [θ1, θ2].

The symbol X ∼ U(θ1, θ2) will be used to indicate that a continuous random variable X hasthe distribution U(θ1, θ2).

Theorem 6.17. If X ∼ U(θ1, θ2), then

E(X) =θ1 + θ2

2and Var(X) =

(θ2 − θ1)2

12.

Morever, the cumulative distribution function of X is

FX(x) =

0, x ∈ (−∞, θ1)x− θ1θ2 − θ1

, x ∈ [θ1, θ2]

1, x ∈ (θ2,∞).

Proof. By definition,

E(X) =

ˆ ∞−∞

xfX(x)dx =

ˆ θ2

θ1

x

θ2 − θ1dx =

1

θ2 − θ1

[x2

2

]θ2θ1

=θ22 − θ21

2(θ2 − θ1)=θ1 + θ2

2.

By Proposition 6.14 and Theorem 6.12,

Var(X) = E(X2)− E2(X) =

ˆ θ2

θ1

x2

θ2 − θ1dx−

(θ1 + θ2

2

)2

=θ22 − θ31

3(θ2 − θ1)− θ21 + 2θ1θ2 + θ22

4=θ22 − 2θ1θ2 + θ21

12=

(θ2 − θ1)2

12.

By Theorem 6.7(2), for x < θ1,

FX(x) =

ˆ x

−∞fX =

ˆ x

−∞(0)dξ = 0.

For θ1 ≤ x ≤ θ2,

FX(x) =

ˆ x

θ1

1

θ2 − θ1dξ =

1

θ2 − θ1[ξ]xθ1

=x− θ1θ2 − θ1

.

For x > θ2,

FX(x) =

ˆ θ2

θ1

1

θ2 − θ1dξ =

1

θ2 − θ1[ξ]θ2θ1

=θ2 − θ1θ2 − θ1

= 1.

This completes the proof.

65

x0 x

fX(x)

1θ2−θ1

θ1 θ2

Figure 6. The shaded area equals P(X ≤ x0).

From Theorem 6.17 we see in particular that

P(X ≤ x0) =x− θ1θ2 − θ1

for any x0 ∈ [θ1, θ2], which is the area under the curve y = fX(x) to the left of the line x = x0,as shown in Figure 6.

66

6.4 – The Normal Distribution

Definition 6.18. Let µ ∈ (−∞,∞) and σ ∈ (0,∞). A random variable X has a normaldistribution with parameters µ and σ, denoted by N(µ, σ), if its probability density functionfX is given by

fX(x) =1

σ√

2πe−(x−µ)

2/(2σ2) (6.4)

for all x ∈ R.

The symbol X ∼ N(µ, σ) shall be used to indicate that a continuous random variable Xhas the distribution N(µ, σ). The distribution N(1, 0) is of special import and is called thestandard normal distribution.

Lemma 6.19. ˆ ∞−∞

e−x2

dx =√π,

ˆ ∞−∞

xe−x2

dx = 0,

ˆ ∞−∞

x2e−x2

dx =

√π

2.

Proof. The set [0,∞)× [0,∞) in R2 is given in polar coordinates as(r, θ) : 0 ≤ θ ≤ π/2, 0 ≤ r <∞

,

and so, making the substitution u = −r2, we obtain(ˆ ∞0

e−x2

dx

)2

=

(ˆ ∞0

e−x2

dx

)(ˆ ∞0

e−y2

dy

)=

ˆ ∞0

(e−y

2

ˆ ∞0

e−x2

dx

)dy

=

ˆ ∞0

ˆ ∞0

e−x2−y2 dxdy =

ˆ π/2

0

ˆ ∞0

re−r2

drdθ

=

ˆ π/2

0

(limR→∞

ˆ R

0

re−r2

dr

)dθ =

ˆ π/2

0

(limR→∞

ˆ −R2

0

−1

2eudu

)dθ

= −1

2

ˆ π/2

0

limR→∞

(e−R

2 − 1)dθ = −1

2

ˆ π/2

0

(−1)dθ =π

4.

Now taking the square root gives ˆ ∞0

e−x2

dx =

√π

2, (6.5)

and since ˆ 0

−∞e−x

2

dx =

ˆ ∞0

e−x2

dx

owing to x 7→ e−x2

being an even function, it follows thatˆ ∞−∞

e−x2

dx =

ˆ 0

−∞e−x

2

dx+

ˆ ∞0

e−x2

dx =

√π

2+

√π

2=√π,

which verifies the first equation in the lemma

67

Turning to the second equation in the lemma, we haveˆ ∞0

xe−x2

dx = limt→∞

ˆ t

0

xe−x2

dx = limt→∞

ˆ t

0

(− 1

2e−x

2)′dx

= limt→∞

[−1

2e−x

2]t0

= −1

2limt→∞

(e−t

2 − 1)

=1

2,

and also ˆ 0

−∞xe−x

2

dx = −1

2

since x 7→ xe−x2

is an odd function. Thereforeˆ ∞−∞

xe−x2

dx =

ˆ ∞0

xe−x2

dx+

ˆ 0

−∞xe−x

2

dx =1

2− 1

2= 0.

Finally, applying integration by parts and (6.5), we obtainˆ ∞0

x2e−x2

dx = limt→∞

([−x

2e−x

2]t0−ˆ t

0

−1

2e−x

2

dx

)= lim

t→∞

(−1

2te−t

2

+1

2

ˆ t

0

e−x2

dx

)=

1

2

ˆ ∞0

e−x2

dx =

√π

4,

and so the third equation in the lemma follows since x 7→ x2e−x2

is an even function.

It might be wondered why the second equality in Lemma 6.19 does not obtain immediatelyfrom the observation that x 7→ xe−x

2is an odd function. The answer is that, by definition,ˆ ∞−∞

f =

ˆ 0

−∞f +

ˆ ∞0

f

if and only if both´ 0−∞ f and

´∞0f converge. Thus it was necessary in the proof of the lemma

to verify that´∞0xe−x

2dx converges.

Theorem 6.20. If X ∼ N(µ, σ), then

E(X) = µ and Var(X) = σ2.

Proof. By Definition 6.18,

E(X) =

ˆ ∞−∞

xfX(x)dx =1

σ√

2π

ˆ ∞−∞

xe−(x−µ)2/(2σ2)dx.

Making the substitution

u =x− µσ√

2,

and noting that u→ ±∞ as x→ ±∞, it follows that

E(X) =1

σ√

2π

ˆ ∞−∞

(σ√

2u+ µ)e−u

2 · σ√

2 du =1√π

ˆ ∞−∞

(µ+ σ

√2u)e−u

2

du

68

=µ√π

ˆ ∞−∞

e−u2

du+σ√

2√π

ˆ ∞−∞

ue−u2

du =µ√π·√π +

σ√

2√π· 0 = µ,

by the first two equation in Lemma 6.19.To show that Var(X) = σ2, we employ the third equation in Lemma 6.19 to obtain E(X2) =

µ2 + σ2, and then apply Proposition 6.14(1).

In light of Theorem 6.20, the parameters µ and σ that characterize a normal distributionare called the mean and standard deviation of the distribution, respectively.

Theorem 6.21. If X ∼ N(µ, σ) and

Z =X − µσ

,

then Z ∼ N(0, 1).

Proof. Suppose X ∼ N(µ, σ). For any z ∈ R the c.d.f. for Z yields

FZ(z) = P(Z ≤ z) = P

(X − µσ

≤ z

)= P(X ≤ σz + µ) = FX(σz + µ),

and thus the p.d.f. for Z is given by

fZ(z) = F ′Z(z) =d

dz

[FX(σz + µ)

]= σF ′X(σz + µ) = σfX(σz + µ)

= σ · 1

σ√

2πe−[(σz+µ)−µ]

2/(2σ2) =1√2πe−z

2/2.

This is precisely the normal distribution with mean 0 and standard deviation 1, and thereforeZ ∼ N(0, 1).

We do not need to employ calculus to see that the p.d.f. given by (6.4) has a global maximumat x = µ, since that is the only value for x where the power of e is not negative. At x = µ theglobal maximum value is

fX(µ) = 1/(σ√

2π).

There are no other local or global extrema for fX . To see this, we find the first derivative:

f ′X(x) = − x− µσ3√

2πe−(x−µ)

2/(2σ2).

Thus f ′X(x) > 0 for all x < µ, and f ′X(x) < 0 for all x > µ. This shows that fX is strictlyincreasing on (−∞, µ), and strictly decreasing on (µ,∞).

Another matter of interest is the location of any inflection points in the graph of fX . Thesecond derivative of fX is

f ′′X(x) =

[(x− µ)2

σ5√

2π− 1

σ3√

2π

]e−(x−µ)

2/(2σ2),

and so f ′′X(x) = 0 if and only if

(x− µ)2

σ5√

2π− 1

σ3√

2π= 0,

69

x

fX(x)

µ µ+ σµ− σ

Figure 7. The p.d.f. of N(µ, σ), with inflection points at x = µ± σ.

which is satisfied if and only if

(x− µ)2 − σ2 = 0.

The solutions are x = µ± σ. From this we find that f ′′X < 0 on the interval (x− µ, x+ µ), andf ′′X > 0 on the intervals (−∞, x− µ) and (µ,∞). Thus by the Concavity Test the graph of fXis concave down on (x− µ, x+ µ) and concave up on (−∞, x− µ) ∪ (x+ µ,∞), which impliesthat there are inflection points at x = µ± σ. See Figure 7, which shows the classic “bell curve”of the normal distribution.

Example 6.22. The alkalinity level of water specimens taken from the Han River in Seoul isnormally distributed with a mean of 50 mg/L and a standard deviation of 3.2 mg/L. Find theprobability that a water specimen taken from the river has an alkalinity level

(a) exceeding 45 mg/L.(b) below 55 mg/L.(c) between 51 and 52 mg/L.

Solution.(a) Let X be the random variable that gives the alkalinity level of a water specimen from theriver, so that X ∼ N(50, 3.2). We must find P(X > 45). By Theorem 6.21 the random variableZ given by

Z =X − 50

3.2(6.6)

has the standard normal distribution N(1, 0), and so a standard normal distribution table canbe used to find that

P(X > 45) = P

(X − 50

3.2>

45− 50

3.2

)= P(Z > −1.5625) ≈ P(Z > −1.56)

= 1− P(Z ≤ −1.56) = 1− 0.0594 = 0.9406.

(b) Again letting Z be given by (6.6), we find from a standard normal distribution table that

P(X ≤ 55) = P

(Z ≤ 55− 50

3.2= 1.5625

)≈ P(Z ≤ 1.56) = 0.9406,

70

the same answer as in part (a). This should not be surprising, considering the symmetry of thenormal distribution about the line x = µ in general.

(c) Using a standard normal distribution table, and cleaving to the practice of taking the meanof two probabilities whose associated z values are equidistant from a value in the table, weobtain

P(51 < X < 52) = P

(51− 50

3.2<X − 50

3.2<

52− 50

3.2

)= P(0.3125 < Z < 0.6250)

= P(Z ≤ 0.6250)− P(Z < 0.3125)

≈ P(Z ≤ 0.62) + P(Z ≤ 0.63)

2− P(Z ≤ 0.31)

=0.7324 + 0.7357

2− 0.6217 ≈ 0.1124.

Note that we round the final probability value to four decimal places in accord with the tableand significant figure conventions.

71

6.5 – The Gamma Distribution

The gamma function Γ : (0,∞)→ R is given by

Γ(t) =

ˆ ∞0

xt−1e−xdx

for all t > 0. That the integral is real-valued for each t > 0 can be shown using the ComparisonTest from calculus, and direct calculation will show that Γ(1) = 1.

Proposition 6.23. For any t ∈ (1,∞),

Γ(t) = (t− 1)Γ(t− 1),

and thus Γ(n+ 1) = n! for all n ∈ N.

Proof. Fix t > 1. Applying integration by parts,

Γ(t) =

ˆ ∞0

xt−1e−xdx =[− xt−1e−x

]∞0

+

ˆ ∞0

(t− 1)xt−2e−xdx

= (t− 1)

ˆ ∞0

xt−2e−xdx = (t− 1)Γ(t− 1).

Now Γ(n+ 1) = n! for n ∈ N follows from Γ(1) = 1 and induction.

Definition 6.24. Let α, β ∈ (0,∞). A random variable X has a gamma distribution withparameters α and β, denoted by Gam(α, β), if its probability density function fX is given by

fX(x) =

xα−1e−x/β

βαΓ(α), x ∈ [0,∞)

0, x ∈ (−∞, 0).

(6.7)

For any α, β > 0, making the substitution u = x/β, we haveˆ ∞−∞

fX(x)dx =1

βαΓ(α)

ˆ ∞0

xα−1e−x/β dx =1

βαΓ(α)

ˆ ∞0

β(βu)α−1e−udu

=1

βαΓ(α)· βαˆ ∞0

uα−1e−udu =1

βαΓ(α)· βαΓ(α) = 1,

and so we see that Definition 6.24 does in fact define a valid probability distribution. Also wesee that ˆ ∞

0

xα−1e−x/β dx = βαΓ(α) (6.8)

for any α, β > 0.

Theorem 6.25. If X ∼ Gam(α, β), then

E(X) = αβ and Var(X) = αβ2.

72

Proof. By (6.8) followed by Proposition 6.23,

E(X) =

ˆ ∞−∞

xfX(x)dx =1

βαΓ(α)

ˆ ∞0

xαe−x/β dx =1

βαΓ(α)· βα+1Γ(α + 1)

=βΓ(α + 1)

Γ(α)=β · αΓ(α)

Γ(α)= αβ.

Next, by Proposition 6.14(1) and Theorem 6.12, applying integration by parts twice,

Var(X) = E(X2)− E2(X) =

ˆ ∞−∞

x2fX(x)dx− (αβ)2

=1

βαΓ(α)

ˆ ∞0

xα+1e−x/β dx− (αβ)2

=1

βαΓ(α)

([− βxα+1e−x/β

]∞0

+ (α + 1)β

ˆ ∞0

xαe−x/β dx

)− (αβ)2

=α + 1

βα−1Γ(α)

ˆ ∞0

xαe−x/β dx− (αβ)2

=(α + 1)αβ

βα−1Γ(α)

ˆ ∞0

xα−1e−x/β dx− α2β2

=(α + 1)α

βα−2Γ(α)· βαΓ(α)− α2β2 = αβ2,

where we recall (6.8) for the last integral.

Definition 6.26. Let ν ∈ (0,∞). A random variable X has a chi-square distribution withν degrees of freedom, denoted by χ2(ν), if X ∼ Gam(ν/2, 2).

In general we say X is a chi-square random variable if X ∼ χ2(ν) for some ν > 0.Comparing Definitions 6.24 and 6.26, it is clear that if X ∼ χ2(ν) then the probability densityfunction of X is given by

fX(x) =

xν/2−1e−x/2

2ν/2Γ(ν/2), x ∈ [0,∞)

0, x ∈ (−∞, 0).

Since X ∼ χ2(ν) if and only if X ∼ Gam(α, β) with α = ν/2 and β = 2, the following isimmediate from Theorem 6.25.

Proposition 6.27. If X ∼ χ2(ν), then

E(X) = ν and Var(X) = 2ν.

Definition 6.28. Let β ∈ (0,∞). A random variable X has an exponential distributionwith parameter β, denoted by Exp(β), if X ∼ Gam(1, β).

73

We say X is an exponential random variable if X ∼ Exp(β) for some β > 0, in whichcase the p.d.f. of X is

fX(x) =

1

βe−x/β, x ∈ [0,∞)

0, x ∈ (−∞, 0).

Theorem 6.25 implies the following.

Proposition 6.29. If X ∼ Exp(β), then

E(X) = β and Var(X) = β2.

Example 6.30. A carpet factory uses propylene (C3H6) to produce synthetic fibers. Thenumber of kiloliters of propylene used in a day, X, is found to have an exponential distributionwith parameter β = 4.

(a) Find the probability that the factory will use more than 4 kiloliters in a given day.(b) How much propylene should the factory have in stock so that the probability of running out

of the chemical before the day is done is only 0.05?

Solution.(a) The p.d.f. of X is

fX(x) =

1

4e−x/4, x ∈ [0,∞)

0, x ∈ (−∞, 0),

and so

P(X > 4) = 1− P(X ≤ 4) = 1−ˆ 4

−∞fX = 1−

ˆ 4

0

1

4e−x/4dx = 1 + (e−1 − 1) = e−1.

(b) First,

P(X > c) = 1− P(X ≤ c) = 1−ˆ c

−∞fX = 1−

ˆ c

0

1

4e−x/4dx = 1 +

[e−x/4

]c0

= e−c/4.

Now,

P(X > c) = 0.05 ⇒ e−c/4 = 0.05 ⇒ c = 4 ln 20,

and so the factory should stock about 4 ln 20 ≈ 12.0 kiloliters of propylene.

Example 6.31. Suppose X ∼ Gam(α, β).

(a) Show that

E(Xa) =βαΓ(α + a)

Γ(α)

for any a > −α.(b) Find expressions for E(X1/2), E(X−1/2), and E(X−n) for n ∈ N, as well as the values of α

for which each holds.

74

Solution.(a) Fix a > −α, so that α + a > 0. By Theorem 6.12 and equation (6.8),

E(Xa) =

ˆ ∞−∞

xafX(x)dx =

ˆ ∞0

xa · xα−1e−x/β

βαΓ(α)dx

=1

βαΓ(α)

ˆ ∞0

x(α+a)−1e−x/β dx

=1

βαΓ(α)· βα+aΓ(α + a) =

βαΓ(α + a)

Γ(α).

(b) If α > 0 we have 12> −α, and hence

E(X1/2) =βαΓ(α + 1/2)

Γ(α)

for all α > 0.If α > 1

2we have −1

2> −α, and hence

E(X−1/2) =βαΓ(α− 1/2)

Γ(α)

for all α > 12.

Finally, for each n ∈ N we have

E(X−n) =βαΓ(α− n)

Γ(α)

for all α > n.

75

6.6 – The Beta Distribution

Definition 6.32. Let α, β ∈ (0,∞). A random variable X has a beta distribution withparameters α and β, denoted by Beta(α, β), if its probability density function fX is given by

fX(x) =

xα−1(1− x)β−1

B(α, β), x ∈ [0, 1]

0, x ∈ R \ [0, 1],(6.9)

where B : (0,∞)× (0,∞)→ R defined by

B(s, t) =

ˆ 1

0

xs−1(1− x)t−1dx (6.10)

is the beta function.

That the integral in (6.10) is a convergent improper integral if 0 < α < 1 or 0 < β < 1 is aroutine matter to verify. There is a natural relationship between the beta and gamma functions,which will be needed in a later proof.

Proposition 6.33. For all s, t > 0,

B(s, t) =Γ(s)Γ(t)

Γ(s+ t).

Proof. Fix s, t > 0, and let r, R > 0 be such that r < R. Setting D = [r, R]× [r, R], Fubini’sTheorem gives

I =

¨D

xs−1yt−1e−x−ydA =

ˆ R

r

ˆ R

r

xs−1yt−1e−x−ydxdy

=

(ˆ R

r

xs−1e−xdx

)(ˆ R

r

yt−1e−ydy

). (6.11)

Putting E =[

rr+R

, Rr+R

]× [2r, 2R], define the transformation T : E → D by

T (u, v) =(uv, (1− u)v

).

Then D = T (E), T is injective on E (the interior of E), and the functions x(u, v) = uv andy(u, v) = (1−u)v have continuous first-order partial derivatives on E. Calculating the Jacobianof T ,

JT (u, v) =

∣∣∣∣ xu(u, v) xv(u, v)yu(u, v) yv(u, v)

∣∣∣∣ =

∣∣∣∣ v u−v 1− u

∣∣∣∣ = v,

and letting

f(u, v) = xs−1yt−1e−x−y,

the change of variables formula for double integrals gives

I =

¨D

f(x, y)dA =

¨E

f(T (u, v))|JT (u, v)|dA

76

=

ˆ 2R

2r

ˆ Rr+R

rr+R

(uv)s−1[(1− u)v]t−1e−uv−(1−u)vvdudv

=

(ˆ 2R

2r

vs+t−1e−vdv

)(ˆ Rr+R

rr+R

us−1(1− u)t−1du

). (6.12)

Equating (6.11) with (6.12) and letting r → 0+ thus yields(ˆ R

0

xs−1e−xdx

)(ˆ R

0

xt−1e−xdx

)=

(ˆ 2R

0

xs+t−1e−xdx

)(ˆ 1

0

xs−1(1− x)t−1dx

).

Now letting R→∞ gives(ˆ ∞0

xs−1e−xdx

)(ˆ ∞0

xt−1e−xdx

)=

(ˆ ∞0

xs+t−1e−xdx

)(ˆ 1

0

xs−1(1− x)t−1dx

),

or equivalently Γ(s)Γ(t) = Γ(s+ t)B(s, t), which leads to the desired result.

Documents

Joe Erickson - Bucks County Community Collegefaculty.bucks.edu/erickson/math115/Statistics.pdfDescriptive statistics is the branch concerned with the organization and presentation