STAT:5100 (22S:193) Statistical Inference I - Week 4homepage.stat.uiowa.edu/~luke/classes/193/notes-week4.pdf · 2015. 12. 17. · Monday, September 14, 2015A Final Definition of

STAT:5100 (22S:193) Statistical Inference IWeek 4

Luke Tierney

University of Iowa

Fall 2015

Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1

Monday, September 14, 2015

Recap

• Reliability block diagrams

• Conditional probability.

• Law of total probability

• Bayes theorem

• Conditioning examples

• Gambler’s ruin problem


Monday, September 14, 2015 Gambler’s Ruin

Gambler’s Ruin

Example

• The gambler’s ruin problem is another famous problem in probability.

• It is related to sequential stopping rules in clinical trials.

• Players A and B have a total fortune of M dollars between them.

• Player A starts with n dollars and B with M − n.

• Each turn a coin is toss independently:• The probability of heads is p.• If a head occurs then B pays one dollar to A.• If a tail occurs then A pays one dollar to B.

• The game ends when one player has all M dollars.

• What is the probability that A ends up with all the fortune, or B isruined?

• Some simulations:

http://www.stat.uiowa.edu/~luke/classes/193/ruin.R.


http://www.stat.uiowa.edu/~luke/classes/193/ruin.R


Example (gambler’s ruin. continued)

• Let pn be the probability that B is ruined when A starts with n dollars.

• Then

p0 = 0

pM = 1.

• For 0 < n < M

pn = P(B is ruined|first toss is H)p + P(B is ruined|first toss is T )(1− p)

• Now

P(B is ruined|first toss is H) = pn+1

P(B is ruined|first toss is T ) = pn−1

• So pn satisfies the second order difference equation

pn = pn+1p + pn−1(1− p).




• Subtracting pnp from both sides and rearranging yields

pn+1 − pn =1− p

p(pn − pn−1).

• Since p0 = 0 this implies that

pn+1 − pn =

(1− p

p

)n

p1

• Therefore for 0 < n ≤ M

pn = p1

n∑k=1

(1− p

p

)k−1




• The sum is

n∑k=1

(1− p

p

)k−1=

1−

(1−pp

)n

1−(

1−pp

) if p 6= 12

n if p = 12 .

• Using the fact that pM = 1 this gives

pn =

1−

(1−pp

)n

1−(

1−pp

)M if p 6= 12

nM if p = 1

2 .

• This holds for 0 ≤ n ≤ M.


Monday, September 14, 2015 Continuous Sample Spaces

Continuous Sample Spaces

• Some experiments require sample spaces that are continuous ranges:• Measuring the melting point of a substance.• Measuring the angle at which a particle leaves a point source.• Recording the time it takes to service a customer at a bank.

• Some experiments may require several continuous ranges:• Recording height and weight of a randomly selected person.• Recording the service times for each of the first 10 customers.

• There can be a continuous range of continuous ranges:• Recording the trajectory of a migrating bird.

• Continuous ranges are uncountable and require some differentapproaches.



Example

• We would like to model the direction in which a particle leaves apoint source.

• This has a number of applications:• Photon emission in optics.• Gamma particles in nuclear medicine and medical imaging.

• We would like to capture the idea that the particle is equally likely toleave in any direction.

• The angle will be in the range [0, 2π).

• Equally likely to leave in any direction would mean• The probability of the angle falling in the ranges [0, π) and [π, 2π)

should be 12 each.

• The probability of the angle falling in the range [π, 3π/2) should be 14 .

• The probability of the angle falling in an interval [a, b) should beproportional to the length b − a.

• This means the probability would be b−a2π .



Example (continued)

• What is the probability the particle leaves in a particular direction a;what is P({a})?

• Let An = [a, a + 1/n).• Then A1 ⊃ A1 ⊃ . . . and

⋂∞n=1 An = {a}.

• So by the continuity property of probability

P({a}) = limn→∞

P(An) = limn→∞

1

2πn= 0.

• A consequence is that

P([a, b)) = P((a, b)) = P((a, b]) = P([a, b])

• We can compute the probability density near a direction a as

f (a) = limh↓0

P([a, a + h))

h= lim

h↓0

P([a− h, a])

h= lim

h↓0

1

h

h

2π=

1

2π.



• Does a probability with the properties of the previous example exist?

• Equivalent question: is it possible to compute a meaningful size, orlength, for any subset of the real line?

• Unfortunately: no.

• There exist some very strange subsets of the real line.

• Reasonable subsets, even many very complicated ones, do allow theirsize to be computed.

• To move forward we need to be able to restrict our probabilities to acollection B of reasonable subsets.

• For our axioms to make sense we need our collection of reasonablesubsets B to satisfy some closure conditions:

(i) ∅ ∈ B.(ii) If A ∈ B then Ac ∈ B.

(iii) If A1,A2, · · · ∈ B then⋃∞

i=1 Ai ∈ B.

• A collection B of subsets of a sample space S that satisfies theseconditions is called a sigma-algebra of subsets.



Examples

• The smallest sigma-algebra on a sample space S is

B = {∅, S}.

• The largest sigma-algebra on a sample space S is

B = {all subsets of S} = 2S .

• If S is countable and {s} ∈ B for all s ∈ S , thenB = {all subsets of S}.

• If S is not countable, then we often need to use a B that does notcontain all subsets of S .

• Even for countable state spaces it is sometimes useful to use a B thatdoes not contain all subsets of S .



Example

Let S = R = (−∞,∞) and let B be the smallest sigma-algebra containingall open intervals (a, b), a, b ∈ R.

• This is well-defined.

• B is called the Borel sigma-algebra on S .

• The sets in B are called Borel sets.

• Non-Borel sets do exist but are quite strange.

• B is also the smallest sigma-algebra containing all intervals of theform (−∞, a] for all a ∈ R.

• Other characterizations are possible.

• Analogous definitions apply when S is an interval.

• Analogous definitions apply when S is R2 or Rn.


Monday, September 14, 2015 A Final Definition of Probability

A Final Definition of Probability

Definition

A probability, or probability function, is a real-valued function defined on asigma-algebra B of subsets of a sample space S such that

(i) P(A) ≥ 0 for all A ∈ B(ii) P(S) = 1

(iii) If A1,A2, . . . ∈ B are pairwise disjoint, then

P

( ∞⋃i=1

Ai

)=∞∑i=1

P(Ai ).

• This is the definition introduced by Kolmogorov in 1933 that is nowthe standard.

• A triple (S ,B,P) is called a probability space.• If S is a subset of the real line then a probability is sometimes called a

probability distribution.Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 13

Monday, September 14, 2015 A Final Definition of Probability

With this definition we can use the probability we developed to model theangle of a particle leaving a point source:

TheoremFor any bounded interval [L,U] with L < U there exists a uniqueprobability defined on the Borel subsets of the interval such that

P((a, b)) =b − a

U − L

for L ≤ a < b ≤ U. This is called the uniform distribution on [L,U].


Monday, September 14, 2015 Random Variables

Random Variables

• Random variables are numerical quantities with values we areuncertain about.

• In our framework we can think of each outcome in our sample spaceas producing a particular value for the random variable.

• We can think of a random variable as a real-valued function definedon our sample space.

• Sometimes we are interested in only one random variable, sometimesin many.


Monday, September 14, 2015 Random Variables

Examples

• Outcomes are all possible subsets of 3 out of 100 items.• We are mainly interested in the number of defectives in the sample.

• Outcomes are all possible past and future prices of a stock.• We are mainly interested in the stock today and one month from now.

• A student is selected at random. Outcomes are all possible students.• We are interested in the height and weight of the student selected.


Wednesday, September 16, 2015

Recap

• Conditioning examples• reliability block diagram• matching problem• gambler’s ruin problem

• Continuous sample spaces• sigma algebras• Borel sets

• A final definition of probability

• Random variables


Wednesday, September 16, 2015

• A random variable represents a numerical measurement that can becomputed for each outcome.

• Informally: A random variable is a real-valued function defined on asample space.

• Random variables are usually denoted by upper case letters from theend of the alphabet:

X (s) : S → R• Lower case letters usually denote generic realized values: x = X (s).

• Two dice are rolled; the sample space is S = {(1, 1), (1, 2), . . . , (6, 6)}.• X is the sum of the values on the two dice.• The experiment is performed and results in the outcome s = (3, 2).• The realized value of X is x = X (s) = X ((3, 2)) = 5.

• Often the argument is dropped:

P({s ∈ S : X (s) ≤ 5}) = P({X ≤ 5}) = P(X ≤ 5)


Wednesday, September 16, 2015 Random Variables

Random Variables

• When our probability is defined on all subsets of the sample spacethen any real-valued function on the sample space is a randomvariable.

• If our probability is defined only for some subsets of the sample space,the subsets in a sigma-algebra B, then to be able to compute

P(X ≤ 5) = P({s ∈ S : X (s) ≤ 5})

we need to have {s ∈ S : X (s) ≤ 5} ∈ B.

• This property is called measurability with respect to B.

• A random variable is a B-measurable real-valued function on thesample space S .



Definition

Let S be a sample space and B a sigma-algebra of events on S . Areal-valued function X = X (s) : S → R is a random variable if

{s : X (s) ≤ t} ∈ B

for all t ∈ R.An equivalent condition is that

X−1(A) = {s : X (s) ∈ A} ∈ B

for any Borel set A.



Examples

• Toss a coin n times, X = number of heads.

• Roll two dice,

X = sum,Y = difference.

• Sample 1000,

X = number who favor some issue,Y = number who oppose the issue,Z = 1000− X − Y .


Wednesday, September 16, 2015 A Measurability Example

Example

• A coin is tossed twice; the sample space is S = {HH,HT ,TH,TT}.• Let Xi be the number of heads on toss i and let Y = X1 + X2 be the

total number of heads.

• Let B be all subsets of S

• and letB1 = {∅,S , {HH,HT}, {TH,TT}}.

• Both B and B1 are sigma-algebras.

• X1, X2, and Y are all measurable with respect to B.

• X1 is measurable with respect to B1.

• X2 and Y are not measurable with respect to B1:

{X2 ≤ 0} = {HT ,TT} 6∈ B1{Y ≤ 0} = {TT} 6∈ B1


Wednesday, September 16, 2015 A Measurability Example

Example (continued)

• B1 is called the sigma-algebra generated by X1.

• B1 is the smallest sigma-algebra with respect to which X1 ismeasurable.

• B1 consists of those subsets of S for which one can determinewhether the outcome that has occurred is in the subset or not byknowing only the value of X1.

• Conversely, if we know for each event A ∈ B1 whether A has occurredor not, then we know the value of X1.


Wednesday, September 16, 2015 Distributions

Distributions

• Let X be a random variable on a probabiity space (S ,B,P).

• For any Borel set A, let

PX (A) = P({X ∈ A}) = P({s ∈ S : X (s) ∈ A}).

• PX is a probability on R, the Borel sigma-algebra on R.

• A random variable induces the probability PX on the real line R.

• This induced probability is called the distribution of X .

• The triple (R,R,PX ) is the probability space induced by X on thereal line.

• Similarly, a pair of random variables X and Y induces a probability onR2.

• This is called the joint probability distribution of X and Y .


Friday, September 18, 2015

Recap

• Random variables

• Measurability

• Distributions, induced probabilites

• Cumulative Distribution Function (CDF)


Friday, September 18, 2015 Cumulative Distribution Functions

Cumulative Distribution Functions

• Writing down probabilities for all Borel sets would be a daunting task.

• We need some simpler tools for specifying distributions.

• One tool that is available for any random variable is the cumulativedistribution function:

Definition

The cumulative distribution function (CDF) of a random variable X is

FX (x) = P(X ≤ x)

for all x ∈ R.

• A few authors define the CDF as FX (x) = P(X < x).

• The survival function SX (x) = P(X > x) is sometimes used for lifetime distributions.

• The survival function satisfies SX (x) = 1− FX (x).



Example

For X = number of heads in two coin flips:

FX (x) =

0 x < 014 0 ≤ x < 134 1 ≤ x < 2

1 x ≥ 2

−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

xF

X((x

))

●

●

●

The plot was created with

s <- stepfun(c(0, 1, 2), c(0, .25, .75, 1))

plot(s, verticals = FALSE, pch = 19, main="", ylab=expression(F[X](x)))



Example

• Let X be the angle in which a particle leaves a point source.

• Suppose X has a uniform distribution on [0, 2π).

• This means for 0 ≤ a ≤ b ≤ 2π

P(a < X < b) =b − a

2π.

• Therefore

FX (x) =

0 x < 0x2π 0 ≤ x < 2π

1 x ≥ 2π.



• Some properties of the CDF:

(a) limx→−∞ F (x) = 0(b) limx→∞ F (x) = 1(c) F is nondecreasing(d) F is right-continuous

• Properties (a), (b), and (d) follow from the continuity of probablity.

• If a function F has these four properties, then it is the CDF of somerandom variable.

• If F has a jump at x , then P(X = x) is the size of the jump.


Friday, September 18, 2015 Computing Some Probabilities Using the CDF

Computing Some Probabilities

P(a < X ≤ b) = F (b)− F (a) for any a < b.

Proof.

The event {X ≤ b} can be written as the disjoint union

{X ≤ b} = {X ≤ a} ∪ {a < X ≤ b}

By finite additivity,

F (b) = P(X ≤ b) = P(X ≤ a) + P(a < X ≤ b)

= F (a) + P(a < X ≤ b)




P(X < a) = F (a−) = limx↑a F (x) for any a ∈ R.

Proof.

Let An ={

a− 1n < X < a

}. Then for each n ≥ 1

P(X < a) = F

(a− 1

n

)+ P(An)

Furthermore,∞⋂n=1

An = ∅.

So by the continuity property limn→∞ P(An) = 0, and

P(X < a) = F (a−) + limn→∞

P(An) = F (a−).




P(X = a) = F (a)− F (a−) = size of the jump of F at a.

Proof.

The event {X = a} can be written as

{X = a} = {X ≤ a}\{X < a}.

SoP(X = a) = P(X ≤ a)− P(X < a) = F (a)− F (a−)

P(a ≤ X ≤ b) = F (b)− F (a−).

Proof.

Follows similarly from observing that

{a ≤ X ≤ b} = {X ≤ b}\{X < a}.



Joint CDF of Two Random Variables

• The joint CDF of two random variables X and Y is defined as

FXY (x , y) = P({X ≤ x} ∩ {Y ≤ y}).

• The probability of any rectangle with sides parallel to the axes can becalculated from the joint CDF.

• Other probabilities are harder to obtain from the joint CDF.


Friday, September 18, 2015 Comparing Distributions

Comparing Distributions

Definition

Two random variables X and Y are identically distributed if theirdistributions PX and PY are identical, i.e. if

PX (A) = P(X ∈ A) = P(Y ∈ A) = PY (A)

for all Borel sets A.

TheoremTwo random variables X and Y are identically distributed if and only iftheir CDF’s are identical.


Friday, September 18, 2015 Classifying Distributions

Classifying Distributions

• A random variable is discrete if the set of its possible values is finiteor countable.

• Equivalently, a random variable is discrete if its CDF is a stepfunction.

• A random variable is continuous if its CDF is continuous and there isa nonnegative function f such that

F (x) =

∫ x

−∞f (u)du

for all x ∈ R.• F will be differentiable “almost everywhere” with F ′ = f .• f is called a probability density function.

• Continuous distributions are sometimes called absolutely continuous.


Friday, September 18, 2015 Classifying Distributions

• Combinations of discrete and continuous features are possible.• The CDF of the waiting time at a bank might look like this:

0

• The CDF of the amount of rainfall on a day would also look like this.

• There also exist weird random variables that have a CDF that iscontinuous but nowhere differentiable.


Friday, September 18, 2015 Characterizing Discrete and Continuous Distributions

Characterizing Discrete and Continuous Distributions

• The CDF can be used to characterize any probability distribution onthe real line.

• Simpler characterizations are available if the RV’s are discrete orcontinuous.

• For discrete random variables we can use the probability massfunction (PMF):

Definition

The probability mass function (PMF) of a discrete random variable isgiven by

fX (x) = P(X = x)

for all x ∈ R.

• If X is the discrete set of possible values of X , then fX (x) = 0 for allx 6∈ X .



• Using the PMF we can compute any probabilities we want:

P(a ≤ X ≤ b) =∑

a≤x≤bfX (x)

P(X ∈ A) =∑x∈A

fX (x)

• In particular,

FX (x) =∑y≤x

fX (y)

• So the PMF completely determines the distribution of a RV.



• The PMF is defined for continuous random variables but isn’t useful.

• Instead, we use the probability density function (PDF):

Definition

If X is a random variable and fX (x) is a nonnegative real-valued functionon R such that

FX (x) = P(X ≤ x) =

∫ x

−∞fX (u)du

for all x ∈ R then X is continuous and fX (x) is a probability densityfunction (PDF) for X (or for FX ).



Notes

• If fX (x) is continuous at x , then fX (x) = F ′X (x), i.e.

fX (x) = limε↓0

FX (x + ε)− FX (x)

ε

= limε↓0

1

εP(x ≤ X ≤ x + ε)

• So fX (x) is the “density of probability” at x .

• fX (x) > 1 is possible.

• P(X = x) = 0 for all x if X is continuous.

• For any Borel subset A of R,

P(X ∈ A) =

∫A

fX (x)dx



TheoremA function fX (x) is a PMF (or PDF) of some random variable if and onlyif

(i) fX (x) ≥ 0 for all x ∈ R.

(ii) ∑x

fX (x) = 1 (PMF)∫ ∞−∞

fX (x)dx = 1 (PDF)


Friday, September 18, 2015 Examples

Example

• A random variable with density

fX (x) =

{λe−λx x ≥ 0

0 otherwise

for some λ > 0 is called an exponential random variable.

• This is a useful simple model for life times.

• We have fX (x) ≥ 0 for all x and∫ ∞−∞

fX (x)dx = 0 +

∫ ∞0

λe−λxdx = 0 + 1 = 1

So fX is a density for any λ > 0.

• The CDF is

FX (x) =

∫ x

−∞fX (y)dy =

{0 for x ≤ 0

1− e−λx for x > 0


Documents

STAT:5100 (22S:193) Statistical Inference I - Week 4homepage.stat.uiowa.edu/~luke/classes/193/notes-week4.pdf · 2015. 12. 17. · Monday, September 14, 2015A Final Definition of