Lecture 5: Confidence Intervals - Facultyfaculty.nps.edu/rdfricke/Business_Stats/lecture5.pdf5 Inference: Making Educated Guesses • We want to use a sample to make guesses about

Business Statistics

Lecture 5:

Confidence Intervals

2

• Confidence intervals

• The “t” distribution

Goals for this Lecture

3

Welcome to Interval Estimation!

Moments

Mean 815.0340

Std Dev 0.8923

Std Error Mean 0.0892

Upper 95% Mean 815.2111

Lower 95% Mean 814.8569

N 100.0000

Sum Weights 100.0000

Sample mean

Sample SD

Sample size

THIS CLASS

LAST CLASS

4

Hmmm...

• In the motor shaft case, we built a model using and s

• We treated them like population parameters

• What if they were way off?

• How would we know?

• E.g., What if for the 100 shafts and we decided the process was not capable

• How sure are we that would give the same result?

X

820X

5

Inference: Making Educated Guesses

• We want to use a sample to make

guesses about a larger population

• Samples are variable (we’d get different

values if we took a different sample), so

our guesses are uncertain

• We want to guess in such a way that:

• There is a chance we guessed right

• We know what that chance is

6

General Strategy for Guessing

• Pick a statistic similar to the parameter

you want to guess

• Figure out what the sampling distribution of

your statistic looks like

• Use the sampling distribution to assess the

quality of your guess

• We’ll start by guessing averages,

because they’re easy

7

Assumption

• Data are collected as a simple random

sample:

• Every unit in population equally likely to be

chosen

• Don’t just look at the 800 most highly paid

CEO’s

• Choosing one unit does not change the

relative chances for another unit to be

chosen

• Other sampling schemes require

different techniques

8

Last Class: Distribution of Averages

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

812 813 814 815 816 817 818

ShaftDiam

Individual

Histogram of all future shafts. SD=s

Mean of 5

Histogram of all possible means of five future shafts. SE= s/sqrt(5)

9

• Best guess for is

• If you always guess for , you will

never be right!

• You can guess with a confidence

interval and be right some of the time

• Narrow intervals: higher chance of being

wrong

• Wide intervals: less useful

X

X

Guessing , the Population Mean

100.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

812 813 814 815 816 817 818

ShaftDiam

Individual

(Unobserved) dist. of population

Mean of 5

(Unobserved) dist. of sample mean

Unobserved pop mean

95% confidence interval for pop mean

• Alternatively, is within 2 SE’s of

95% of the time

X

• Because of the CLT, we know that

is within 2 SE’s of 95% of the time

Main Idea

X

Sample mean

11

How to Guess

• Choose a probability of being wrong: a

• Find z on the normal table so that a/2 of

the probability is above z and a/2 is less

than -z

• Example: if a = 5% then z = 1.96

• Calculate and

• Your interval is

Xn

s

X zn

s

12

Example: Shaft Process

• Sample mean:

• Sample SD:

• Number of shafts:

• SE:

• 95% confidence interval for :

• With numbers:

815.03 1.96(0.08923) = [814.859, 815.209]

/ 0.8923/ 100 = 0.08923X

s s n

100n

0.8923s

815.03x

1.96 /x s n

13

Mean

Std Dev

Std Error Mean

Upper 95% Mean

Lower 95% Mean

N

Sum Weights

815.0340

0.89230.0892

815.2111

814.8569

100.0000

100.0000

Best Guess at

Best Guess at s

95% Confidence Interval for

• There’s a 95% chance that this interval contains

• There is a 5% chance it does not

JMP: Shaft Diameters

14

So, What is a Confidence Interval?

• It’s an interval around our sample statistic that

shows how variable the sample statistic is

• Narrow: real (population) value unlikely to be far

away

• Wide: little information about the population value

• Two CIs for our example:

Confidence interval #1

Confidence interval #2

814 816815

Observed

Sample Mean

15

Again, What is a Confidence Interval?

• A confidence interval is a random

interval

• Random because it is a function of a

random variable ( )

• Confidence level is the long-run

percentage of intervals that will “cover”

the population parameter

• It is not the probability that the interval

contains the true parameter!

X

16

A Simulation

intervals not including population mean: 2

95% Confidence Intervals for mean = 50, sd = 10, n = 5sample

1 5 10 15 20 25 30 35 40 45 50

10

20

30

40

50

60

70

80

90

100

17

intervals not including population mean: 10

95% Confidence Intervals for mean = 50, sd = 10, n = 95sample

1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

20

30

40

50

60

70

80

Another Simulation

5

18

Deriving a Confidence Interval (1)

• Let X1, X2, …, Xn be a random sample

from a normal population with unknown

mean and known standard deviation s

• Create a CI for based on the sampling

distribution of the mean:

• To start, we know that (via

standardizing):

~ (0,1)/

XN

n

s

nNX /,~ 2s


• Now for Z ~ N(0,1) we know

• That is, there is a 95% probability that the random

variable Z lies in this fixed interval

• Thus

• And, after some algebra

• Now we say we are 95% confident that this

random interval covers the fixed (unknown) 19

Pr( 1.96 1.96) 0.95Z

-Pr -1.96 1.96 0.95

/

X

n

s

Pr 1.96 1.96 0.95X Xn n

s s

20


• So, If X1 = x1, X2 = x2, …, Xn = xn are

observed values of a random sample

from a :

• We can be 95% confident that the

interval covers the population mean

• Interpretation: In the long run, 19 times out

of 20 the interval will cover the true mean

and 1 time out of 20 it will not

1.96xn

s is a 95% confidence interval for

2,sN

21

But Something’s Fishy...

• Why make all the fuss about the mean

being random if we treat the SD as

known?

• Since s is a population quantity, we

have to estimate it with s

• Should that make the intervals wider or

narrower?

• When s unknown (almost always), use

“t” distribution rather than the normal

22

0.00

0.10

0.20

0.30

0.40

-4 -3 -2 -1 0 1 2 3 4

Z= number of SE’s from the mean

normal

T3

T10

T100

The t Distribution

23

Degrees of Freedom (df)

• The more degrees of freedom we have, the

better we can estimate s

• The better we estimate s, the closer we are to

s being known

• Thus, the more df we have, the closer t

values are to z values

• Calculating degrees of freedom:

• Each observation adds one degree of freedom

• One degree of freedom is used up when we

calculate

• There are n-1 degrees of freedom left

X

24

How to (Really) Guess

• Choose a probability of being wrong: a

• Calculate and

• For DF=1-n, find t from Table A3-5

(page 496) for p=1-a

• Then, the confidence interval is

X /s n

sX t

n

Table A3-5

• Example

• For a=0.05 and

df=100, we have

t=1.984

• Notation

25

, / 2

1, / 2

100,0.025

1.984

df

n

t

t

t

a

a

26

Mean

Std Dev

Std Error Mean

Upper 95% Mean

Lower 95% Mean

N

Sum Weights

815.0340

0.89230.0892

815.2111

814.8569

100.0000

100.0000

Best Guess at

Best Guess at s

95% Confidence Interval for

Shaft Diameters Example Redux

1.984 / 815.034 1.984 0.8923/ 100

[814.857,815.211]

x s n

27

How Confidence Intervals Behave

• Width of CI’s:

• Margin of error:

• Bigger SD Bigger SE wider intervals

• Bigger sample size• Smaller SE narrower intervals

• Smaller t values narrower intervals

• Higher confidence Bigger t values wider intervals

n

stw n 2/,12 a

n

stE n 2/,1 a

28

t vs. z

• Use t when you don’t know s

• The t distribution assumes the data are normally distributed

• Options if data are not normally distributed:

• Transform the data (logarithms)

• If transformations don’t work and sample size is big ( > 30) ignore the problem

• If transformations don’t work and sample size is small, read the book about nonparametric tests

29

• Manufacturer of consumer electronics:

• How many households will purchase a

computer in the next year?

• Use survey to collect responses from

100 households

• To justify sales projections, management

needs the proportion to be at least 25%

• Should management revise sales

projections?

Example (CompPur.jmp)

30

Example, continued

• Survey results:

• Another sample would likely have given a different result

• What we want to know is, based on this result, where could the true proportion lie?

No

Yes

Total

Level

86

14

100

Count

0.86000

0.14000

1.00000

Prob

2 Levels

Frequencies

31

Example, continued

• When data are 0’s and 1’s, they are REALLY not normal

.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 1.1

Mean

Std Dev

Std Error Mean

Upper 95% Mean

Lower 95% Mean

N

Sum Weights

0.1400

0.3487

0.0349

0.2092

0.0708

100.0000

100.0000

• Rule of thumb: at least 30 observations, 5

successes, and 5 failures lets CLT kick in

• No difference between means and proportions!

95% CI for true proportion

Other Confidence Intervals

• There are lots of other confidence

intervals – we’ve concentrated on CIs

for the mean

• See your textbook for

• CI for the variance (i.e., s2)

• CI for the difference of two means

• Not enough time to learn about these

• Just skim those sections in the book

• And know that CIs exist for other

parameters 32

33

• Descriptive Statistics

• Probability

• And, Or, Not

• Normal distribution

• Central limit theorem

• Computing SE( ) from SD(X)

• Inference

• Confidence intervals for population means and proportions

X

What We Have Learned So Far…

Documents

Lecture 5: Confidence Intervals - Facultyfaculty.nps.edu/rdfricke/Business_Stats/lecture5.pdf5 Inference: Making Educated Guesses • We want to use a sample to make guesses about