Chapter 2 - Sergio Turner · ... Single: Expected value, mean, variance, standard deviation ......

Preview:

Citation preview

Chapter 2

Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-1

Review of Probability

The probability framework for statistical inference

a) Random variables and probability distributions

b) Single: Expected value, mean, variance, standard deviation

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

c) Two: joint VS marginal VS conditional distributions; independence, covariance, sums of rvs

d) Key distributions: Normal, Chi-squared, Student t, F

e) Random sampling & sample average’s distribution

f) Large sample approximations

1-2

Random variables & probability distrib’ns

• Random variables (rvs): commute time, #computer crashes

• Rvs can be continuous (time) or discrete (#crashes)

• Outcomes: Mutually exclusive values that a rv can take

• Eg: no crash, crash once, crash twice, …; numerically: 0,1,2,…

• Sample space: set of all outcomes, e.g. {0,1,2,…}

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Sample space: set of all outcomes, e.g. {0,1,2,…}

• Event: By definition a collection of outcomes.

• E.g. “crash no more than once” = {0,1} = {no crash, crash once}

• Probability of an outcome/event: Proportion of time it occurs in the long run (after many independent, identical experiments)

1-3

Probability distrib’n – discrete rv

• Probability distribution of a rv: The list, across outcomes, of the probability of the outcome

• The probabilities in the list add up to 1

• Example: M = #computer crashes while you write paper

• Assume: If four crashes occur, write paper by hand (M<5)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Assume: If four crashes occur, write paper by hand (M<5)

• Event = {0,1} has probability Pr(M=0) + Pr(M=1) = .9

• Cumulative distribution function cdf: Prob. rv is at most given value, e.g. cdf(1) = .9

1-4

outcome 0 1 2 3 4

Pr dist .8 .1 .06 .03 .01

Cum dist .8 .9 .96 .99 1

Bernoulli distribution – discrete rv

• If there are only TWO outcomes, rv called Bernoulli

• E.g.: Let G be gender of next person you meet

• Outcomes are “male”, “female”

• If probability of one outcome is p, the other’s must be 1 – p (for prob’s add up to 1)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

(for prob’s add up to 1)

1-5

Probability distrib’n – continuous rv

• Cumulative probability distribution cdf(x): Probability rv is at most a given value, x

• p. 19, figure 2.2

• Probability density function pdf(x): Intuitively, it is the probability of outcome x … except with a continuous rv, usually this is zero for every x.

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

usually this is zero for every x.

• More accurate: pdf is the function with the property that, for x<y, cdf(y) – cdf(x) = area under pdf between pdf(y) and pdf(x)

• E.g. Probability(commute 15’ - 20’ long) = .78 - .20 = .58

1-6

Expected values, Mean, Variance

• Expected value E(Y) of a random variable Y: the long run average value of the rv (after many independent, repeated ocurrences)

• Its value is denoted µY … Also called expectation of Y or mean of Y

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

Y

mean of Y

• Computed as average of outcomes, each weighted by its probability

• E.g.: E(M) = 0x.8 + 1x.1 + 2x.06 + 3x.03+4x.01 = .35 … the mean number of crashes is .35

1-7

Expected values of Bernoulli rv

• Say Bernoulli G has probability distribution Pr(G=1)=p, Pr(G=0)=1-p

• Then E(G) = 1xp + 0x(1-p) = p

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• That is, its mean is the probability of outcome 1 (whatever it signifies)

1-8

Formulas for expected value

• Discrete rv: If Y can take k outcome values y1, …, yk with probabilities p1, …, pk respectively, then:

• E(Y) = y1· p1 + … yk· pk = ∑i yi· pi

• Continuous rv: If Y has a pdf, with values ranging from L to

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Continuous rv: If Y has a pdf, with values ranging from L to H, then:

• E(Y) = ∫[L,H] y ·pdf(y) … (just fyi)

• Note: If Y, Z are rvs, then E(Y+Z) = E(Y) +E(Z)

• Note: If c is a constant, then E(cY) = cE(Y)

1-9

Standard deviation and variance

• These measure the spread of a rv

• Variance of Y var(Y) := E[(Y-µY)2], the expected squared

deviation from its mean. Also denoted σ2Y

• Formula: var(Y) = ∑i (yi - µY)2 pi

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Formula: var(Y) = ∑i (yi - µY)2 pi

• Expanding square: var(Y) = E(Y2) - µY2

• Note: If c a constant, var(cY) = c2var(Y), var(c+Y)=var(Y)

• This involving the square, it is not comparable to Y

• Standard deviation of Y σY:= square root of var(Y)

• Var(M) = .6475, stdev(M) ~ .8

1-10

Variance of Bernoulli rv

• Say Bernoulli G has probability distribution Pr(G=1)=p, Pr(G=0)=1-p

• Recall E(G) = p

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• So var(G) = (0-p)2(1-p) + (1-p)2p = p(1-p)

1-11

Mean & Variance of linear function

• Say X is a rv, and Y a linear function of it: Y = a + bX

• Then Y is a rv in its own right

• Its mean E(Y) = E(a + bX) = E(a) + E(bX)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Its mean E(Y) = E(a + bX) = E(a) + E(bX)

= a + bE(X) … in short, µY = a+bµX

• Recall if c a constant: var(cX) = c2var(X) and var(c+X) = var(X), so …

• Var(Y) = var(a+bX) = var(bX) = b2var(X)

• σY = |b|σX upon taking square roots on both sides

1-12

Measures of symmetry and tails

Skewness(Y) =

= measure of asymmetry of a distribution

• skewness = 0: distribution is symmetric

• skewness > (<) 0: distribution has fatter right (left) tail

E Y − µY( )

3

σY

3

− µ( )4

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

Kurtosis(Y) =

= measure of mass in tails

= measure of probability of large values

• kurtosis = 3: normal distribution

• skewness > 3: heavy tails (“leptokurtotic”)

• Skew.(cY)=Skew.(Y), Kurt.(cY)=Kurt.(Y) “scale-invariant”

1-13

E Y − µY( )

4

σY

4

Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-14

Two random variables: joint distributions and covariance

• The joint distribution of two random variables (say X and Y) is the probability/pdf of (X,Y) = (x,y) taking particular values, jointly.

• Say X = 0 (raining), 1 (not) & Y = 0 (long commute), 1 (not)

• Four outcomes for (X,Y): (0,0), (0,1), (1,0), (1,1)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-15

Y↓\X-> Rain X=0 Dry X=1 total

Long Y=0 .15 .07 .22

Short Y=1 .15 .63 .78

total .3 .7

Two rvs: marginal dist’bn

• The marginal probability distribution of rv Y is its probability distribution, as X is free to take any value

• That is, Pr(Y=y) := ∑i Pr(X=xi, Y=y)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• E.g. Pr(long commute) = .22 & Pr(rain) = .3

• Useful to compute expectations, variances, etc of Y:

• E(Y) = ∑i yi· pi = 0·Pr(Y=0) + 1·Pr(Y=1) = Pr(Y=1) = .78

1-16

Two rvs: conditional dist’bn

• The distribution of rv Y conditional on rv X taking a specific value is called the conditional distribution of Y given X .

• Written Pr(Y=y|X=x)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• E.g. Pr(Y=0|X=0) = .5 (equally likely)

• Bayes’ formula: Pr(Y=y|X=x) = Pr(Y=y,X=x)/Pr(X=x)

• Indeed, Pr(Y=0,X=0)/Pr(X=0) = .15/.30 = .5

Note, the denominator uses the marginal dist’bn of X

1-17

Two rvs: conditional expectation

• The conditional expectation/mean of Y given X , E(Y|X=x), is the mean of Pr(Y|X=x)

• Discrete: E(Y|X=x):= ∑i Pr(Y=yi|X=x)·yi

• E(Y|X=1) = Pr(Y=0|X=1)·0+Pr(Y=1|X=1)·1 = .63/.7 = .9

• E(Y|X=0) = Pr(Y=0|X=0)·0+Pr(Y=1|X=0)·1 = .5·

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• E(Y|X=0) = Pr(Y=0|X=0)·0+Pr(Y=1|X=0)·1 = .5

• Law of iterated expectations: The mean of Y is the weighted average of E(Y|X=xi), with weights given by the probability distribution of X = x1, …, xk.

• That is, E(Y) = ∑i E(Y|X=xi)·Pr(X=xi)

• Compactly, E(Y) = E[E(Y|X)] E.g. E(Y) = .9·7 + .5·.3 = .78

1-18

·

Two rvs: conditional variance

• The conditional variance of Y given X , var(Y|X=x), is the variance of Pr(Y|X=x): E[ Y-E(Y|X=x) ]2

• E(Y|X=x) above is constant & underlying prob. is Pr(Y|X=x)

• Discrete: var(Y|X=x):= ∑i Pr(Y=yi|X=x)·[yi-E(Y|X=x)]2

·

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Discrete: var(Y|X=x):= ∑i Pr(Y=yi|X=x)·[yi-E(Y|X=x)]2

• var(Y|X=1) = Pr(Y=0|X=1)·[0-E(Y|X=1)]2+Pr(Y=1|X=1)·[1-E(Y|X=1)]2 = .1·[0-.9]2+.9·[1-.9]2 = .081 + .009 = .09

1-19

·

Two rvs: independence

• Informally, rvs X, Y are independent if knowing the value of one yields no information about the value of the other.

• Precisely, they are independent if the conditional distribution of Y given X equals the marginal distribution of Y.

·

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• That is, if Pr(Y=y|X=x) = Pr(Y=y) for all possible x

• Using Bayes’ formula: Pr(Y=y,X=x) = Pr(Y=y)·Pr(X=x)

1-20

·

Aside: Standardizing a rv

• A common transformation of a rv is standardizing it:

• X into X:=(X-µx)/σx

• Deviations from mean, divided by standard deviation

• E(X)=0 and var(X)=1.

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• E(X)=0 and var(X)=1.

• Thus standardized rvs always have mean 0 and st.dev 1

• Exercise: If c>0 is a constant, then cX = X

• Thus we say this transformation is scale-invariant. If X is measuring time, whether in seconds, minutes or hours, the transformation gives the same result.

1-21

Two rvs: covariance

• A measure of how much two rvs X, Y vary together is this:

• Covariance between X and Y cov(X,Y):= E[(X-µx)(Y-µY)]

• It is also denoted σXY

• Expanding, cov(X,Y) = E(XY) – µxµY

• Note, (X-µ ) & (Y-µ ) are deviations from their means. ·

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Note, (X-µx) & (Y-µY) are deviations from their means.

• Suppose when X tends to exceed its mean, so does Y tend to exceed its mean. Then the product is positive, and so is the covariance. Likewise, a negative covariance suggests that when X overperforms, Y underperforms, relative to means.

• Discrete: cov(X,Y) = ∑i ∑j (xi-µx)(yj-µY)·Pr(X=xi,Y=yj)

• Exercise: If X, Y are independent, cov = 0 (converse false)

• Exercise: cov(aX,Y)=acov(X,Y). Also, cov(a+X,Y)=cov(X,Y)

1-22

·

Two rvs: correlation

• Covariance E[(X-µx)(Y-µY)] involves variables in potentially different scales (eg. X in minutes, Y in hours), so the product makes little sense.

• However, recall that X =(X-µx)/σx and Y =(Y-µY)/σY are scale-invariant, so E(X·Y) makes more sense:

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• corr(X,Y):= E(X·Y) = … (bottom last slide) … = cov(X,Y)/σxσY

• Rvs are said to be uncorrelated if cov(X,Y) = 0. Then corr=0.

• Exercise: If E(Y|X) is independent of X (equal to µY), then X,Y are uncorrelated.

• Fact: Corr is always between -1 and +1

1-23

Correlation measures linear association

Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-24

Mean and variance of sums of rvs

Some basic consequences of the definitions of E and var:

• E(X+Y) = E(x) + E(Y)

• E(a + bX) = a + bE(X)

• Var(a+bX) = b2var(X)

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)

… so iff uncorrelated is it true Var(X+Y) = Var(X) + Var(Y)

• Var(X) = E(X2) - µ2X

• Cov(a + bX,Y) = bCov(X,Y)

• Cov(X,Y) = E(XY) - µX µY• Cov(X,X) = Var(X)

1-25

Key distributions: Normal

• The normal distribution with mean µ and variance σ2>0, denoted N(µ,σ2), is defined by the pdf

−−

2

2

1exp

2

1)(

σ

µ

πσ

yyfY

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• The factor preceding exp ensures probabilities sum to 1, ∫f(y)dy=1

• One can show that E(y)=µ, var(y)= σ2, skew. = 0, kurt. = 3

• Standard normal dist’n: Z=N(0,1), i.e. zero mean & unit var.

• Its cdf is denoted Φ, so Pr(y≤c)= Φ(c)

• Table of values of Φ in p.749-750.

• Table is relevant also for any normal N(µ,σ2), standardize it…

1-26

22 σπσ

Key distributions: Normal

• Say Y is normal, set Z: = (Y-µ)/σ, so Y = µ+ σZ.

• One can show that Z is N(0,1), so that Φ is relevant.

• For example, to look up Pr(Y≤D), note

Φ=−

≤=−

≤−

=≤µµµµ DD

ZDY

DY )Pr()Pr()Pr(

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• Which again you can look up on p.750, given D,µ,σ

• Since this is a cdf, Pr(Z>K) = 1-Pr(Z≤K) = 1-Φ(K)

• Also, to look up Pr(A<Z≤B), note

1-27

Φ=≤=≤=≤

σσσσZDY )Pr()Pr()Pr(

( ) ( ) ( ) ( ) ( )[ ] ( ) ( )ABBA ΦΦ=ΦΦ=>==≤< --1--1BZPr-AZPr-1BZAPr

Key distributions: normal

• An important feature is that normal dist’bns are closed under sums and scalings. That is, if X,Y are normals, and a,b are constants, then aX+bY also is normal.

• The mean of aX+bY we already know, from our work on expectations: Its mean is aµ +bµ

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

expectations: Its mean is aµX+bµY

• Its variance we also know, from before:

• Fact: If two normals are uncorrelated, they are independent

• Recall converse for any rv, if independent, then uncorrelated

1-28

( ) ( ) ( ) ( )YXabYbXabYaX ,cov2varvarvar22

++=+

Key distributions: normal

• Fact: If a set of rvs has a multivariate normal disb’n, then the marginal dist’bn of each is normal

• Fact: If X,Y have a bivariate normal dit’bn, then E(Y|X=x) is linear in x, i.e. E(Y|X=x)=a+bx for all x, and some constants a,b.

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

constants a,b.

1-29

Key distributions: Chi-squared

• The chi-squared distribution with m degrees of freedom is the dist’bn of a sum of m squared independent standard normal dist’bns. Denoted χm

2

• So if X,Y are standard normals, then X2 + Y2 is a chi-squared with df=2.

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

squared with df=2.

• Table on p.752 gives some values, given the percentile.

• We see the 95th percentile for a χ22 is 5.99

• The chi-squared will appear when we do tests. If we wish to test that a certain error term is statistically insignificant, and know that it has such a dist’bn, then the table will help us.

1-30

Key distributions: Student t

• The Student t distribution with m degrees of freedom is the dist’bn of the ratio Z/(χm

2 /m)1/2, the ratio of a standard normal over the square root of a chi-squared with df=m divided by m, where these are independent.

• It has the same shape as a normal, except for fatter tails,

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

• It has the same shape as a normal, except for fatter tails, which thin out the larger is m.

• A table with percentiles for the t dist’bn is on p. 751

1-31

Key distributions: F

• The F distribution with m,n degrees of freedom Fm,n is the dist’bn of the ratio (W/m)/(V/n) where W,V are independent chi-squared dist’bns with df=m,n respectively.

• Z/(χm2 /m)1/2, the ratio of a standard normal over the square

root of a chi-squared with df=m divided by m.

Copyright © 2011 Pearson Addison-Wesley. All rights reserved.

root of a chi-squared with df=m divided by m.

• A related dist’bn is the Fm,∞ = W/m, where W is a χm2

• When n is large, this is a good approximation.

• Tables on pp.753-6 give values of these F’s at various percentiles and given various df’s.

1-32

Recommended