Stat 101: Lecture 18 Summer 2006

Stat 101: Lecture 18

Stat 101: Lecture 18

Summer 2006

Designed Experiments

I Double-blinded, randomized, control study versusobservational study.

I Causation and association.I Confounding factors may exist.I Weighted average and the chi-square test.I Summary statistics: mean, median, sd, IQR.I Plots: histogram, boxplot, scatterplot.

Mathematical model for regression

I Each point (Xi , Yi) in the scatterplot satisfies:

Yi = a + bXi + εi

I εi ∼ N(0, sd = σ). σ is usually unknown. The ε’s havenothing to do with one another (independent). e.g., big εidoes not imply big εj .

I We know Xi ’s exactly. This imply that all error occurs in thevertical direction.

Estimating the regression line

ei = Yi − (a + bXi) is called residuals. It measures the verticaldistance from a point to the regression line.One estimates a and b by minimizing,

f (a, b) =n∑


(Yi − (a + bXi))2

Take the derivative of f (a, b) w.r.t a and b, and set them to 0, weget,

a = Y − bX ; b =1n

∑n1 XiYi − X Y


∑n1 X 2

i − X 2

f (a, b) is also referred as Sum of Squared Errors (SSE).

I Definition – Frequentist view versus Bayesian view.I Kolmogorov’s Axioms.I Conditional probability.

P(A | B) =P(A and B)


I Independence.

P(A | B) = P(A)

I The Addition Rule:

P(A or B) = P(A) + P(B)− P(A and B)

I The total probability rule:

P(A) = P(A | B)P(B) + P(A | not B)P(not B)

I The Bayes’ rule:Let A1, . . . , An be mutually exclusive and suppose thatP(A1 or A2 . . . or An) = 1. Then,

P(A1 | B) =P(B | A1)× P(A1)∑n

i=1 P(B | Ai)P(Ai)

I The binomial formula:

P(exactly r successes) =


)pr (1− p)n−r

I The Poisson formula:

P(exactly k events) =λk

k !exp(−λ)

The Normal distribution and the Central Limit TheoremI The normal distribution and use of the normal table,

f (x | µ, σ) =1√2πσ

exp(− 1

2σ2 (x − µ)2)

I Box model – EV, σ.I the Central Limit Theorem for averages:

X − EVσ/√

n∼ N(0, 1)

I the Central Limit Theorem for sums:

nX − nEV√nσ

∼ N(0, 1)

I the Central Limit Theorem for proportion:

p − p√p(1− p)/n

∼ N(0, 1)

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/


I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/


I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/


I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Significance Tests

A significance test requires:

I a null and alternative hypothesis.I a test statistic.I a significance probability (P-value).

I: Possible hypotheses

1. H0 : θ = θ0; H0 : θ 6= θ0.2. H0 : θ ≤ θ0; H0 : θ > θ0.3. H0 : θ ≥ θ0; H0 : θ < θ0.

Here θ represents a generic parameter. It could be a populationmean, a population proportion, the difference of two populationmeans, or many other things.

II: Possible test statistics

a. Population mean, we take θ to be the population mean µ. Ifyou know the population SD, or for n > 26 you use thesample SD as an estimate of the SD, then you get thesignificance probability from a z-table and the test statisticis:

ts =X − µ0



b. For the previous case, if you have a sample of size n ≤ 26,and use the sample SD to estimate the population SD,then the significance probability comes from a tn−1 tableand the test statistic is:

ts =X − µ0


n − 1

c. For a test about a proportion, θ = p. The significanceprobability comes from a z-table, and the test statistics is:

ts =p − p0√

p0(1− p0)/n

d. For a test of the difference of two means, θ = µ1 − µ2.Assuming that the sample sizes from each populationsatisfy n1 > 26 and n2 > 26, then the significanceprobability comes from a z-table and the test statistic is:

ts =X1 − X2 − θ0√

SD21/n1 + SD2


e. For a test of the difference of two proportions, takeθ = p1 − p2. Use a z-table for the significance probabilityand the test statistic:

ts =p1 − p2 − θ0√

p1 (1− p1) /n1 + p2 (1− p2) /n2

f. For n ≤ 26, with θ = µ1 − µ2, and n paired differencesXi −Yi , use tn−1 for the significance probability and the teststatistic is:

ts =X − Y − θ0


n − 1

The SDd is the sample variance of the n differences.

III: The significance probability

I The significance probability of the test statistic depends onthe hypothesis chosen in Part I. For that choice, let W be arandom variable with z or tn−1 distribution, as indicated inPart II. Then,

1. The significance probability isP(W ≤ − | ts |) + P(W ≥ − | ts |).

2. The significance probability is P(W ≥ ts).3. The significance probability is P(W ≤ −ts).

I The significance probability is “the chance of observingdata that supports the alternative hypothesis as or morestrongly than the data you have seen, when the nullhypothesis is correct.”

I Goodness-of-Fit Tests:

H0 : The model holds; Ha : The model fails

ts =∑ (Oi − Ei)



k = #categories − 1

I Contigency table and tests of independence:

H0 : the two criteria are independent

Ha : some dependence exists

ts =∑

all cells

(Oij − Eij)2


Eij =ith row sum× jth column sum

totalk = (number of rows - 1)× (number of columns - 1)

Multiple Regression

I In multiple regression, there is more than one explanatoryvariable. The model is,

Yi = a + b1X1i + b2X2i + . . . + bpXpi + εi

Again, the εi are independent normal r.v.s with mean 0.I The null and alternative hypotheses are:

H0 : b1 ≥ 0; Ha : b1 < 0

I The test statistic is,

ts =b1 − 0

seI This is compared to a t-distribution with n-p-1 degrees of

freedom where p is the number of explanatory variables inour regression model.

The BootstrapThe pivot confidence interval assumes that the behavior ofθ − θ is approximately the same as the behavior of θ − θ∗. And,

I Suppose we use a computer to draw 1000 bootstrapsamples of size n. For each such sample, it calculates anew estimate of the parameter of interest.

I Rank these estimate from least to largest. We denotethese ordered bootstrap estimates by

θ∗(1), . . . , θ∗(1000)

where the number in parentheses shows the order in termsof size. Thus θ∗(1) is the smallest estimate of the sd found in

one of the 1000 bootstrap samples, and θ∗(1000) is thelargest.

I The 95% confidence interval is given by,

L = 2θ − θ∗(0.975); U = 2θ − θ∗(0.025)

Bayesian Statistics

I Recall the Bayes’ Theorem:

P(A1 | B) =P(B | A1)× P(A1)∑ki=1 P(B | Ai)× P(Ai)

where A1, . . . , Ak are mutually exclusive and

P(A1 or A2 or . . . or Ak ) = 1

I Specify prior distribution. Calculate Likelihood andposterior.

I Posterior predictive probability – use the posteriorprobability as weight.

The Prior, Likelihood, and Posterior

Model Prior P(data | model) Product Posteriorp P(model) P(k = 0 | p) P(model | data)

0.1 1/9 0.656 0.0729 0.4270.2 1/9 0.410 0.0455 0.2670.3 1/9 0.24 0.0266 0.1560.4 1/9 0.130 0.0144 0.0840.5 1/9 0.065 0.007 0.0410.6 1/9 0.026 0.0029 0.0170.7 1/9 0.008 0.0009 0.0050.8 1/9 0.002 0.0002 0.0010.9 1/9 0.000 0.0000 0.000

1 0.1704 1