GIST141B, Fall, 1999 Revised 8/11/991 Modeling Populations Introduction to the Normal Distribution

GIST141B, Fall, 1999 Revised 8/11/99 1

Modeling Populations

Introduction to the Normal Distribution


A Look AheadWhat you will be able to do after this lecture

Understand two reasons why experiments provide an imperfect window on reality

Understand and describe: Population Population parameter Sample

Given a statement involving a statistical inference identify the population to which it applies

Distinguish between a statistical inference and a statement that merely summarizes the results of a study.


ExperimentsThe Scientist’s Imperfect Window on Reality

The way things re

ally are

An Experiment


Two Reasons That the Window is Imperfect

1 The data exhibit variation

2 The experimental data are an incomplete sample the real system

A goal of scienceTo use our data to make generalizations

about the way things really are.

A goal of scienceTo use our data to make generalizations

about the way things really are.


The Effects of Variation and Sampling


Definitions

Population The collection of all entities or quantities about which

you wish to make a generalization

Population parameter The particular characteristic of the population that

you wish to study The quantity whose value you wish to estimate

Sample A subset of the population which you select and then

measure The source of the data that you will use to estimate

the population parameter


The Population: What “Really is”


The Sample:What We See


Example #1

A researcher is trying to determine the average income of JMU students one year after graduation. He randomly selects 100 students from each graduating class and the contacts them and obtains their annual income.

Population:

Population parameter:

Sample:

All JMU graduates one year past graduation

100 graduates from each class

Average annual income


Example #2

A chemical engineer is evaluating a new chemical process for making nylon polymer. He wishes to determine the average tensile strength of nylon yarn that is produced by the new process. He manufactures 500 samples of nylon yarn and measures the tensile strength of each sample.

Population:


Sample:

All nylon yarn that will be made with the new process

Average tensile strength

500 samples of nylon yarn


Example #3

An environmental scientist is trying to determine the rate at which stratospheric ozone is depleting over the South Pole. Over a five-year period, she measures ozone levels by means of instrumentation on a weather balloon that is released daily from a research station near the South Pole.

Population:


Sample:

The stratosphere over the South Pole

The rate at which ozoneis decreasing.

Daily samples over a station near the South Pole (approximately 1825 samples)


Two Different Types of Populations

A collection of distinct entities People, bacteria, etc

A theoretical or existing system The stratosphere A new chemical process


DefinitionA statistical inference is a generalization about the population that is made from a sample of data.

Example 1 Sample data: A particular pair of shoes was purchased

from a shoe store at a price that was much cheaper than the same pair could have been purchased elsewhere.

Generalization: Any shoes purchased from this store will be cheaper than what they would cost elsewhere. (Population: All the shoes purchased from this store)

Example 2 Sample data: Professor Snigglehopper gave easy tests

in English 102. 1st Generalization: Professor Snigglehopper will give

easy tests in English 425. (Population: All of Professor Snigglehopper’s tests)

2nd Generalization: All English professors give easy tests (Population: All tests given by all English profs)


Exercise

Galileo rolled balls down inclined ramps and measured the distance covered in a fixed time. Suppose he did 25 replicate runs on a ramp with an incline of 5o. Based on these runs, Galileo found the average distance traveled in one second was 0.41 meters. For each statement below, determine if the statement is a statistical inference or if it is not.• “On average, a ball will roll 0.41 meters down a 5o incline in

1 second.”• “The balls traveled an average of 0.41 meters down the

incline in 1 second in my experiments.”• “A ball will generally roll about 0.4 meters down a 5o incline

in 1 second.”• “A ball will generally roll somewhere between 0.1 and 7

meters down a 5o incline in 1 second.”• “The balls did not all travel exactly the same distance in 1

sec.”


Wrap-upWhat you should be able to do

Understand two reasons why experiments provide an imperfect window on reality

Understand and describe: Population Population parameter Sample

Given a statement involving a statistical inference identify the population to which it applies

Distinguish between a statistical inference and a statement that merely summarizes the results of a study.


The Normal Distribution Model

Basic Theory



Explain the importance of the normal distribution for modeling populations

State the properties of the normal distribution Bell-shaped, symmetric Areas under the curve represent proportions of

the population Bell centered over Width of bell determined by N(,) notation


The Normal Distribution Model

A mathematical model for describing the likelihood of getting particular measured values from the population whenever you take a sample

Accounts for the variability in the population Uses probability to model the randomness in the

population Provides a basis for making statistical inferences

from sample data Predictions about the overall makeup of the population Inferences about the population mean Comparisons of means of two or more populations


ExampleVerbal SAT scores for the Class of 2000.

Verbal SAT Scores: JMU Class of 2000

0

100

200

300

400

500

600

700

27

5 to

29

9

32

5 to

34

9

37

5 to

39

9

42

5 to

44

9

47

5 to

49

9

52

5 to

54

9

57

5 to

59

9

62

5 to

64

9

67

5 to

69

9

72

5 to

74

9

77

5 to

79

9

X XX XX

XX

X

XXX

XX X

Data from the first 14 students

Data from the first 14 students




0

100

200

300

400

500

600

700

27

5 to

29

9

32

5 to

34

9

37

5 to

39

9

42

5 to

44

9

47

5 to

49

9

52

5 to

54

9

57

5 to

59

9

62

5 to

64

9

67

5 to

69

9

72

5 to

74

9

77

5 to

79

9

# S

tud

en

ts

Data from all 3165 students

Data from all 3165 students




0

100

200

300

400

500

600

700

27

5 to

29

9

32

5 to

34

9

37

5 to

39

9

42

5 to

44

9

47

5 to

49

9

52

5 to

54

9

57

5 to

59

9

62

5 to

64

9

67

5 to

69

9

72

5 to

74

9

77

5 to

79

9

# S

tud

en

tsThis is the shape of

the normal distribution.

It models the shape of this histogram


The Normal DistributionModels the shape of many different populations

Dependent variable (Y) = probability density Larger values indicate values that are more

common Values near zero indicate values that are

uncommon Independent variable (X) = possible value in the

population (i.e. Verbal SAT score) Parameters: (population mean) and

(population standard deviation Model wheree

2

1Y

2

2

2

X

,)(

= population standard deviation

= population mean

and are based on measuring every entity in the population (i.e. not based on sample data)

and are based on measuring every entity in the population (i.e. not based on sample data)


The Normal Distribution

X = Possible Values in the Population

Y =

Pro

bab

ilit

y D

ensi

ty

The Normal DistributionModels the shape of many different populations

2

2

2

X

e2

1Y

)(


Important Facts About the Normal Distribution



Y =

Pro

ba

bili

ty D

en

sit

y

Area = 1.0(100% of the population)

Total area under the normal curve

is equal to 1.0(100% of the population)

Total area under the normal curve

is equal to 1.0(100% of the population)





Y =

Pro

ba

bili

ty D

en

sit

y

Area = proportion of the population with values

between A and B.

Area = proportion of the population with values

between A and B.

A B




Y =

Pro

ba

bili

ty D

en

sit

y


“Bell” is centered over the

Population Mean

“Bell” is centered over the

Population Mean

Values near the mean are much more common than values far from the mean

Values near the mean are much more common than values far from the mean

Most likelyrange ofvalues

Least likely range of values




Y =

Pro

ba

bili

ty D

en

sit

y

Values below the mean are just as likely as values above the mean

Values below the mean are just as likely as values above the mean



Normal Distribution


Y =

Pro

ba

bili

ty D

en

sit

y

68.3% of the population is within ONE standard

deviation from the mean

68.3% of the population is within ONE standard

deviation from the mean68%


1 1


Normal Distribution


Y =

Pro

ba

bili

ty D

en

sit

y

95.6% of the population is within TWO

standard deviations from

the mean

95.6% of the population is within TWO


the mean 95%


1.96 1.96


Normal Distribution


Y =

Pro

ba

bili

ty D

en

sit

y

99.7% of the population is within THREE


the mean

99.7% of the population is within THREE


the mean99.7%


3 3


Exercise

JMU’s class of 2000 has verbal SAT scores that follow a normal distribution with a mean of 584 and a standard deviation of 67. Give a range of SAT scores within which 95% of the scores from the class of 2000 will fall.

A: 95% will have scores between 517 & 651

B: 95% will have scores between 450 & 718

C: 95% will have scores between 555 & 613

D: I haven’t got a clue




Y =

Pro

ba

bili

ty D

en

sit

y

The population standard deviation

() determines the width of the bell

The population standard deviation

() determines the width of the bell

1

2

1 > 21 > 2

Important Facts About the Normal DistributionImportant Facts About the Normal Distribution

The larger the standard

deviation, the wider the bell.

The larger the standard

deviation, the wider the bell.


N(,)

Normal Distribution: N(, )

Possible Values in the Population

Pro

bab

ility

De

nsi

ty

Notation for the Normal Distribution

determines the width of the

bell

determines the width of the

bell

determines the location of the bell

determines the location of the bell

N( , ) refers to a normal distribution model

with a mean of and a

standard deviation of

2

2

2

X

e2

1Y

)(


Example TerminologyThe population of Verbal SAT scores for the JMU Class of 2000 follow a N(584,67) distribution


0

100

200

300

400

500

600

7002

75

to 2

99

32

5 to

34

9

37

5 to

39

9

42

5 to

44

9

47

5 to

49

9

52

5 to

54

9

57

5 to

59

9

62

5 to

64

9

67

5 to

69

9

72

5 to

74

9

77

5 to

79

9

# S

tud

en

ts

2

2

672

584SAT

e267

1Y )(

)(

*

Sample data from the

class of 2000


Example TerminologyVerbal SAT scores for the JMU Class of 2000 follow a N(584,67) distribution

NOTATION“Verbal SAT scores ~ N(584,67)” means

“The population of verbal SAT scores follow a N(584,67) distribution.”

NOTATION“Verbal SAT scores ~ N(584,67)” means

“The population of verbal SAT scores follow a N(584,67) distribution.”


Match and to the correct distribution


-2 -1 0 1 2 3 4 5 6

Possible Values in the Population

Pro

bab

ility

Den

sity

N(3,0.5)

N(2,1.0)

N(4.5,0.5)

N(3,1.0)



Explain the importance of the normal distribution for modeling populations

State the properties of the normal distribution Bell-shaped, symmetric Areas under the curve represent proportions of

the population Bell centered over Width of bell determined by N(,) notation


Applying theNormal Distribution Model

Using the Normal Distributionto Make Statistical InferencesAbout the Population



Understand Z-scores Definition/formula Interpretation Use the table of the Standard Normal (z) Distribution

and Z-scores to find areas under the normal curve

Determine what proportion of a normally distributed population falls in a given range

Determine a range of values within which a specified proportion of a normally distributed population will fall


Using the Normal Distribution Model



Y =

Pro

ba

bili

ty D

en

sit

y

A B

By calculating areas under the N(m,s) curve, we can predict

how often certain measurement values will

occur

By calculating areas under the N(m,s) curve, we can predict

how often certain measurement values will

occur

Area = probability that future observed values will fall

between A and B

Area = probability that future observed values will fall

between A and B


Example: Average Monthly Ozone Readings at Syowa ~ N(300,40) distribution.

N(300,40) Distribution Modelfor Ozone at Syowa

160

195

230

265

300

335

370

405

440

Possible Ozone Readings(Monthly Average)

Pro

bab

ility

De

nsi

ty

Area = probability that a randomly chosen month will have an

average reading above 335 Dobson

units

Area = probability that a randomly chosen month will have an

average reading above 335 Dobson

units

Area = probability that a randomly

chosen month will have an average

reading between 230 and 300 Dobson units

Area = probability that a randomly

chosen month will have an average

reading between 230 and 300 Dobson units


Using Z-scores to Calculate Probabilities from a Normal Distribution

Definition: The Z-score for a value (X) from a normal distribution is equal to that value’s distance from the mean, in standard deviations, i.e.

Z scoreX

Z-score converts the scale of the data from a N(m,s) distribution to a N(0,1) distribution (the standard normal distribution)

The larger the Z-score, the further X is from the population mean

Uses: To find areas under the N(m,s) curve


Exercise: Syowa Ozone ~ N(300,40)Calculate the Z-scores for these ozone values

Ozone at Syowa

260

390

388

Z-score

-1.00

2.25

2.20

A monthly average of 260 is 1 standard deviation

below the mean

A monthly average of 390 is 2.25 standard deviations

above the mean

A monthly average of 388 is 2.20 standard deviations

above the mean


Areas under the Standard Normal CurveSee page 714 in the Triola Text

0 1 2 3-3 -2 -1

9.87% of the population lies between the mean

and 0.25 standard deviations above the

mean

9.87% of the population lies between the mean

and 0.25 standard deviations above the

meanArea =0.0987

Standard Normal (z) DistributionZ 0.00 0.01 0.02 0.03 0.04 0.05 0.06

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.02390.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.06360.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.10260.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.14060.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772

Table A-2



0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.02390.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.06360.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.10260.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.14060.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772

Table A-2

Areas under the Standard Normal CurveTable entries are the area to the left of Z.

0 1 2 3-3 -2 -1

15.54% of the population is between

the mean and 0.40 standard deviations

above the mean

15.54% of the population is between

the mean and 0.40 standard deviations

above the mean

Area =0.1554


Areas under the Standard Normal CurveTable entries are the area to the left of Z.

0 1 2 3-3 -2 -1


0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.02390.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.06360.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.10260.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.14060.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772

Table A-2

0.1554 - 0.0987 = 0.0567Hence, 5.67% of the

population is between 0.25 and 0.40 standard

deviations above the mean

0.1554 - 0.0987 = 0.0567Hence, 5.67% of the

population is between 0.25 and 0.40 standard

deviations above the mean


ExerciseUse the table of “Areas Under the Standard Normal Curve” to calculate the proportion of the population falling in the ranges specified.

What fraction of the population will be at most 2 standard deviations above the mean? A: 2.28% B: 97.72% C: 57.93% D: 47.72%

What percent of the population will be at least 2.27 standard deviations above the mean? A: 1.16% B: 98.84% C: 98.82% D: Don’t know

How much of the population will have values somewhere between 1.03 standard deviations below the mean and 1.75 standard deviations above the mean? A: 0.1515% B: 95.99% C: 0.8084% D: 80.84%


Applying Z-scores and Areas Under the Standard Normal Curve

Hint: Draw a picture of the N(500,100) distribution and shade in the area under the

curve that you are interested in. Then convert to z-scores

and find the area.

Suppose that it is known that the VERBAL SAT scores of all H. S. Seniors in the U.S. follow a N(500,100) distribution. What fraction of students receive a score of 650 or less?

500 600 700 800200 300 4001.50Z-score

for SAT of 650



500 600 700 800200 300 4001.50Z-score

for SAT = 650


1.5 0.43321.6 0.44521.7 0.45541.8 0.46411.9 0.4713

From Table A-2in Triola,

with Z = 1.50

Area =0.4332

Area= 0.5

TotalArea =0.9332



Hence, 93.3% of H.S. seniors

receive a score of 650 or less.

500 600 700 800200 300 4001.50Z-score

for SAT = 650


TotalArea =0.9332


Exercise

What fraction of H.S. seniors get a score of 325 or less on the verbal SAT? A: 95.99% B: 4.01% C: 95.91% D: Don’t know

How many students score above 730 on the verbal SAT? A: 0.99% B: 98.9% C: 48.9% D: 1.07%


Exercise

One of the Big Three automakers sells four-door sedan with an advertised highway fuel efficiency of 33 mpg. The fact is that the population of autos of this type will average 33 mpg, with some variation around that average. Suppose that the actual efficiency achieved by this population of autos is normally distributed with a mean of 33 and and a standard deviation of 6.4 mpg. What proportion of autos of this design will get mileage levels exceeding 40 mpg?

A: 1.09% B: 0.8621% C: 13.79% D: 86.21%


The automaker is considering giving a 95% range of mpg ratings for this automobile. Under this thinking, the sales sticker will display a range of values (centered over the average rating of 33 mpg) within which 95% of autos of this type will fall. What range of values should the automaker use?

Z-scorescale

Z-scorescale

mpg scalempg scale

Area =0.95

1.96-1.96

33

0

20.5 45.5

The automaker can advertise that 95% of these cars

will have fuel efficiency

between 20.5 and 45.5 mpg.

The automaker can advertise that 95% of these cars

will have fuel efficiency

between 20.5 and 45.5 mpg.



Understand Z-scores Definition/formula Interpretation Use the table of Areas Under the Standard Normal

Curve Using Z-scores to find areas under the standard

normal curve

Determine what proportion of a normally distributed population falls in a given range

Determine a range of values within which a specified proportion of a normally distributed population will fall


Statistical Inferences About the Population Mean

Using the Normal Probability Model to Make Statistical Inferences



State the requirements for a statistical inference to be valid and apply them to evaluate the validity of a given inference.

Calculate and interpret a (1 - )% confidence interval for the mean (by hand and by using JMP IN®) When the exact value of the population standard

deviation is known When the exact value of is not known

Use the (1-)% confidence interval to make inferences about the population mean

Distinguish between valid and invalid interpretations of the (1 - )% confidence interval.


Many Studies are Aimed at Making Inferences about the Population Mean

Objective of the studyTo estimate the average annual income of JMU graduates one year after graduation.

PopulationAll JMU graduates 1 year after graduation

Population parameterMean annual income (in dollars)

Example #1


Objective of the studyTo determine the average tensile strength of nylon yarn made from a new manufacturing process.

PopulationAll nylon yarn made from the new process

Population parameterMean tensile strength (in g/cm2)

Example #2



Objective of the studyTo determine the rate (in grams/mile) at which the 1998 Ford Taurus engine emits hydrocarbons (under normal driving conditions).

PopulationAll Ford Taurus automobiles

Population parameterMean emissions rate of HC’s (in grams/mile)

Example #3



Making Statistical Inferences About the Population Mean

Goal

To make a valid statistical inference

about the value of based on the

value of the sample estimate

Goal

To make a valid statistical inference

about the value of based on the

value of the sample estimate X


Two Requirements for a Statistical Inference to be Valid

1 The data come from an unbiased sample of the population Samples were randomly selected Every subject in the population had an equal

chance of being selected

2 The inference accurately states the degree of certainty in the conclusion


Example of a Credible InferenceAbout the Population MeanIn order to determine the distance traveled in one second by a ball rolling down a 5o incline, suppose Galileo made 20 replicate runs and recorded the distance traveled (in meters) in one second for each run. The 20 replicates yielded an average distance of 0.41 meters. In addition, suppose that we know that the recorded distances traveled in one second follow a normal distribution with a standard deviation of 0.06 meters. Based on these data we can state with 95% confidence that the true average distance that a ball will travel in one second on this incline is between 0.38 and 0.44 meters.

Population =

Population Parameter =

Sample data =

All possible distances the ball could travel in 1 second The mean distance that a ball will travel in 1 second Distances from 20 replicate runs


In order to determine the distance traveled in one second by a ball rolling down a 5o incline, suppose Galileo made 20 replicate runs and recorded the distance traveled (in meters) in one second for each run. The 20 replicates yielded an average distance of 0.41 meters. In addition, suppose that we know that the recorded distances traveled in one second follow a normal distribution with a standard deviation of 0.06 meters. Based on these data we can state with 95% confidence that the true average distance that a ball will travel in one second on this incline is between 0.38 and 0.44 meters.

The first three sentences summarize whathappened in the experiment

(i.e. 20 reps; avg distance of 0.41 meters) and what we know about this population

(i.e. the population ~ N(,0.06)).They DO NOT state a statistical inference

Example of a Credible InferenceAbout the Population Mean


In order to determine the distance traveled in one second by a ball rolling down a 5o incline, suppose Galileo made 20 replicate runs and recorded the distance traveled (in meters) in one second for each run. The 20 replicates yielded an average distance of 0.41 meters. In addition, suppose that we know that the recorded distances traveled in one second follow a normal distribution with a standard deviation of 0.06 meters. Based on these data we can state with 95% confidence that the true average distance that a ball will travel in one second on this incline is between 0.38 and 0.44 meters.

Notice the statement of the degree of certainty in its truth.

Example of a Credible InferenceAbout the Population Mean

The last sentence uses the experimental data to generalize the results to the entire.population. This is a statistical inference


Definition:The Confidence Interval for the Mean

Example If we want to have 95% confidence, we use = 0.05

Names for Type I error probability -level significance level

A (1-)% confidence interval for the mean is a range of values running from a lower bound to an upper bound wherein we can be (1-)% confident that the true population mean falls.


Formula for Calculating a (1-)% Confidence Interval for the Mean When the Exact Value of is known

Lower Bound (LB) Upper Bound (UB)

nZX

2 nZX

2

Where n = # of samples and where

is chosen so that the area under the

normal curve to the right of is

Where n = # of samples and where

is chosen so that the area under the

normal curve to the right of is 2

Z

2Z


A Return to the Earlier Galileo Example: Calculating a 95% Confidence Interval for the Mean

n = 20 sample (20 replicate runs)

meters 410X .

= 0.06 meters

= 0.05 (because (1-) = 0.95)

96102502

.ZZ .


A Return to the Earlier Galileo Example: Calculating a 95% Confidence Interval for the Mean

38020

060961410

2

.

...

nZXLB

44020

060961410

2

.

...

nZXUB

We are 95% certain that the true average distance traveled down the 5o incline in one second is between 0.38 and 0.44 meters.

We are 95% certain that the true average distance traveled down the 5o incline in one second is between 0.38 and 0.44 meters.


Valid Interpretations of the Confidence Interval for the Mean

You must state the range and the confidence level.“We are (1 - )% confident that the mean is

somewhere between LB and UB.”

You must state the range and the confidence level.“We are (1 - )% confident that the mean is

somewhere between LB and UB.”

Valid“We are 95% confident that the mean distance traveled in 1 second is between 0.38 and 0.44 meters.”

Not valid“The mean distance traveled in 1 second is between 0.38 and 0.44 meters.”



Valid“We can claim with 95% confidence that the mean distance traveled in 1 second is different than 0.48 meters.”

Not Valid“We can claim that the mean distance traveled is different than 0.48 meters.” (need to state the confidence level)

We CAN CLAIM that the population mean is different from any of the values OUTSIDE the

confidence interval, provided we state the confidence level of our claim [(1 - )%

confidence].

We CAN CLAIM that the population mean is different from any of the values OUTSIDE the

confidence interval, provided we state the confidence level of our claim [(1 - )%

confidence].



Valid“The mean distance traveled in one second could be any value in the range from 0.38 to 0.44 meters.”

Not Valid“The mean distance traveled in 1 second will be 0.41 meters.”

You CANNOT CLAIM with any measurable confidence that the population mean is equal to any specific value INSIDE the confidence

interval.

You CANNOT CLAIM with any measurable confidence that the population mean is equal to any specific value INSIDE the confidence

interval.


ExerciseCalculate a 90% confidence interval for the mean in the Galileo experiment

What is the value of ?A: 90% B: 0.90 C: 5% D: 0.10

What is the value of Z/2 ?A: 1.645 B: 1.28 C: 0.90 D: 0.95

What is the value of the lower bound (LB)?A: 0.02 meters B: 0.02 sec C: 0.39 sec D: 0.39 meters

What is the value of the upper bound (UB)?A: 0.43 sec B: 0.02 meters C: 0.43 meters D: 0.41 meters


Confidence Intervals for the Mean When the Population Standard Deviation is Estimated from the Data

The value of is unknown

Introduces more uncertainty in the results

Cannot use Z-scores

Use “t-scores” Larger than corresponding Z-scores

Lead to wider confidence limits

See table A-3 “t-Distribution” on page 715 of Triola.


Confidence Intervals for the Mean When the Population Standard Deviation is Estimated from the Data

s = the standard deviation (calculated from the data) n = number of samples in the data

n

stXLB

,df

2

n

stXUB

,df

2

tdf,/2 is from table A-3, page 715 of Triola

df = “degrees of freedom”– A measure of how much data you had– df = n-1 – If df = 30 or more, then the tdf,/2 is very close to

Z/2


Reading the table of t-values, page 715 of Triola

t-Distribution

0.005 0.01 0.025 0.05 0.10 0.25Degrees (one tail) (one tail) (one tail) (one tail) (one tail) (one tail)

of 0.01 0.02 0.05 0.10 0.20 0.50freedom (two tails) (two tails) (two tails) (two tails) (two tails) (two tails)

1 63.657 31.821 12.706 6.314 3.078 1.0002 9.925 6.965 4.303 2.920 1.886 0.8163 5.841 4.541 3.182 2.353 1.638 0.7654 4.604 3.747 2.776 2.132 1.533 0.7415 4.032 3.365 2.571 2.015 1.476 0.727

6 3.707 3.143 2.447 1.943 1.440 0.718

Table A-3

For an 80% confidence interval, a = 0.20. Use t-values

in this column to calculate 80% confidence intervals

The value of refers to the “two tails” value

For a 95% confidence interval, = 0.05. Use t-values in this

column to calculate 95% confidence intervals


Reading the table of t-values, page 715 of Triola

t-Distribution

0.005 0.01 0.025 0.05 0.10 0.25Degrees (one tail) (one tail) (one tail) (one tail) (one tail) (one tail)

of 0.01 0.02 0.05 0.10 0.20 0.50freedom (two tails) (two tails) (two tails) (two tails) (two tails) (two tails)

1 63.657 31.821 12.706 6.314 3.078 1.0002 9.925 6.965 4.303 2.920 1.886 0.8163 5.841 4.541 3.182 2.353 1.638 0.7654 4.604 3.747 2.776 2.132 1.533 0.7415 4.032 3.365 2.571 2.015 1.476 0.727

6 3.707 3.143 2.447 1.943 1.440 0.718

Table A-3

df = 6Use these t-values to calculate confidence

intervals whenever you have only 7 data points.


Procedure for Calculating the Confidence Interval for the Mean

Collect the Datan = sample size

representative samples

Calculate thesample mean

Is value of

known?

Determine the value of

Get the value of Z/2 from table A-2

in Triola

LB X Zn

2

UB X Zn

2

Calculate thesample standard

deviation (s)

Get the value of tdf,/2 from table A-3

in Triola

n

stXUB

n

stXLB

,n

,n

21

21

STOP

Yes

No


Example: Calculating a confidence interval when the population standard deviation is estimated from the data

A soft drink bottling plant makes 2-liter bottles of soft drink. Because of slight variations in the soft drink composition and variations in the bottling machine, the actual content of the filled bottles varies from bottle to bottle. Seven filled bottles were sampled and their contents measured. The data for these samples are given below (quantities reported are liters of soft drink found in the bottle). Find a 95% confidence interval for the average content (in liters) for bottles made at the plant.

1.83, 2.02, 1.76, 1.90, 1.95, 2.10, 1.88


Bottling Plant Example

1- = 0.95 = 0.05 n = 7 (df = 6) tdf,2 = t6, 0.025 = 2.447

liters 921X .

s = 0.11

ttle)(liters/bo 1.82

7

0.11(2.447)1.92

n

s

2df,tXLB

ttle)(liters/bo 2.02

7

0.11(2.447)1.92

n

s

2df,tXUB


Calculating a 95% Confidence Interval in JMP IN

Step 1Enter the data into JMP IN


Calculating a 95% Confidence Interval in JMP IN

Step 2Select Analyze: Distribution of Y

We are 95%confident that the average volume of soft drink for all 2-liter bottles produced at the plant is between 1.81 and 2.03

liters

We are 95%confident that the average volume of soft drink for all 2-liter bottles produced at the plant is between 1.81 and 2.03

liters



State the requirements for a statistical inference to be valid and apply them to evaluate the validity of a given inference.

Calculate and interpret a (1 - )% confidence interval for the mean (by hand and by using JMP IN®) When the exact value of the population standard

deviation is known When the exact value of is not known

Use the (1-)% confidence interval to make inferences about the population mean

Distinguish between valid and invalid interpretations of the (1 - )% confidence interval.


Statistical Inferences About the Comparison of Two Population Means

Confidence Intervalfor the Difference of Two Population Means

200.0

300.0

400.0

500.0

600.0

RESOLUTE SYOWAO

zon

e (D

ob

son

Un

its)



Calculate and interpret a (1 - )% confidence interval for the difference of two population means

By hand

By using JMP IN®

Use the (1 - )% confidence interval for the difference of two population means to determine if the two means are different.

Distinguish between valid and invalid statements that interpret the meaning of a (1 - )% confidence interval for the difference of two population means.


The Goal of Many Studies is to Compare the Means from Two Populations

Example 1A group of engineers is conducting a study to compare the energy efficiency of two different fuels. The team will burn

several replicate samples of each fuel type and measure the energy generated (in calories) for each replicate.

Population #1The energy efficiencies of all possible samples of fuel type #1

Population #2The energy efficiencies of all possible samples of fuel type #2

Population ParametersThe mean of energy efficiency of fuel type #1, compared to the mean energy efficiency of fuel type #2. (fuel type #1 vs fuel type #2)


Example #2

A sociologist compared the value placed on human life by individuals who watch more than 20 hours of TV each week

with those who watch less than 5 hours each week. She randomly selected 50 individuals from each category and gave each a test to evaluate each individual’s value for human life.

Population #1Test scores from all people watching over 20 hours of TV each week.

Population #2Test scores from all people watching under 5 hours of TV each week.

Population ParametersThe mean test score from the “over 20 hours crowd,” compared to the mean score from the “under 5 hours crowd” (over 20 hrs vs under 5 hrs)


Example #3

A medical doctor is comparing two different surgical techniques for repairing a torn anterior cruciate ligament (ACL) in the knee.

Fifteen randomly selected patients with torn ACL’s are treated with the old technique, and fifteen other randomly selected patients are treated with the new technique. The time required for each

person to recover 90% of motion in the injured knee is recorded.

Population #1All possible recovery times under the old method.

Population #2All possible recovery times under the new method.

Population ParametersMean recovery time with the old technique, compared to the mean recovery time with the new technique. (old technique vs new technique)


Making Statistical Inferences About the Comparison of Two Population Means

Assumptions and Requirements Each population follows a normal distribution (or

something reasonably close to normal)

The populations have the same standard deviation

We have a representative sample from each population Samples were randomly selected Every entity in the population had an equal chance of being

selected


Exercise We wish to compare the average height of JMU students who were born in the months of January through June to the average height of JMU students who were born in the months July through December.

Population #1All JMU students born in Jan-Jun

Population #2All JMU students born in Jul-Dec

Population ParametersAverage height in these two populations

DataHeights of 10 randomly selected students• 5 born in Jan - Jun• 5 born in Jul - Dec.


Data Sheet for Recording the Height DataBorn in Jan -

JunBorn in July -

Aug

MeanStd. Dev.


True or False?

The average height of students in our sample that were born in January - June is different than the average height of students in our sample that were born in July - December.

The average height of all JMU students born in January - June is different than the average height of all JMU students that were born in July - December.


True or False?



The 1st statement is true.It is only a summary of our data.

It does not generalize the results and apply them to the underlying population parameters.


True or False?



We cannot tell if the 2nd statement is reasonable.It is a statistical inference because it generalizes

what what we saw in our data and applies it to the underlying population parameters.


Recall: Two Requirements for a Statistical Inference to be Valid

1 The data come from an unbiased sample of each population

2 The inference accurately states the degree of certainty in the conclusion


Definition: Confidence Interval for the Difference of Two Population Means

Same interpretation of as before. Names for

Type I error probability -level significance level

A (1-)% confidence interval for the difference of two population means is a range of values running from a lower bound to an upper bound wherein we can be (1-)% confident that the true difference falls.


Formula for Calculating a (1-)% Confidence Interval for the difference of Two Population Means

Lower Bound (LB)

Upper Bound (UB)

21222111

21 nnst)XX( p ,nn

21222111

21 nnst)XX( p ,nn

• s1 and s2 are the standard deviations of the two groups of sample data

• n1 = sample size from the first population

• n2 = sample size from the second population

• tn1+n2-2, a/2 is from the

table A-3 in Triola

2

11

21

222

211

nn

snsnsp


Example

An automobile manufacturing company was trying to determine which type of tire to install on its new models. Six sets of tire

brand “A” and six sets of tire brand “B” were installed on 12 new automobiles and the number of miles of use before 60% of the tread was worn off was measured. The results are given in the

table below. Calculate a 95% confidence interval for the difference in average mileage between the two brands of tires.

Brand A Brand B38610 3102734840 2982735793 2881232833 3006439477 2852433752 29500

Tire Mileage Results


Calculations for the 95% Confidence Interval

miles35884 A brand from mileage average X 1

miles s 6.26561

miles s 24.9042 miles29626 B brand from mileage average X 2

61 n

62 n

miles

nn

snsnsp

3.1984266

24.90456.26565

2

11

22

21

221

211

95% confidence

= 0.05

/2 = 0.025

tn1+n2-2, /2 = t10, 0.025 = 2.228


Calculations for the 95% Confidence Interval

miles

.

..

nnst)XX( Bound Lower p ,nn

3706

525526258

6

1

6

13198422822962635884

11

21222121

miles

.

nnst)XX( Bound Upper p ,nn

8810

525526258

11

21222121


95% Confidence Interval for Tire Tread Example

We are 95% confident that Brand A will average between 3706 and 8810 more miles

than Brand B before using up 60% of its tread.


Using JMP IN to Calculate the 95% confidence interval for the difference of two population means

1 Create two columns in JMP IN®

One column containing all the measurements

A second column indicating which population each measurement comes from

Make sure the second columnis set to “N” (for “nominal”)


2 Select Analyze: Fit Y by X Place the measured

value on the Y-axis Place the population

ID on the X-axis

Mileage By Brand

27500

30000

32500

35000

37500

40000

Brand A Brand B

Brand


3 Select Means: Anova/t-test under the Analysis button

t-Test

EstimateStd ErrorLower 95%Upper 95%

Difference 6258.47 1145.65 3705.77 8811.16

t-Test 5.463

DF 10

Prob>|t| 0.0003

Assuming equal variances

Upper and Lower Bounds for the 95% confidence interval


Valid Interpretations of the Confidence Interval for the Difference between Two Population Means

You must state the range and the confidence level.“We are (1 - )% confident that the difference

1 - 2 is somewhere between LB and UB.”

You must state the range and the confidence level.“We are (1 - )% confident that the difference

1 - 2 is somewhere between LB and UB.”

Valid“We are 95% confident that the mean mileage difference (brand A - brand B) is between 3706 and 8810 miles.”

Not valid“The mean difference in mileage between brand A and brand B is 6258 miles.”


Valid Interpretations of the Confidence Interval for the Difference between Two Population Means

Valid“Since the confidence interval for the difference in average mileage does not include zero, we can claim with 95% confidence that that the average mileage for brand A is different than the average mileage for brand B.”

We CAN CLAIM with (1-)% confidence that the means are different if the confidence interval does

not include the value “0.”

We CAN CLAIM with (1-)% confidence that the means are different if the confidence interval does

not include the value “0.”



Calculate and interpret a (1 - )% confidence interval for the difference of two population means By hand By using JMP IN®

How to use the (1 - )% confidence interval for the difference of two population means to determine if the two means are different.

How to distinguish between valid and invalid statements that interpret the meaning of a (1 - )% confidence interval for the difference of two population means.


Introduction to Statistical Hypothesis Testing

The meansare equal?

The meansare not equal?



Define the two types of statistical hypotheses and how each is used Null hypothesis Alternative hypothesis

Given a description of a hypothesis testing problem, determine null and alternative hypotheses both in words and in equation form

Use confidence intervals to do the following hypothesis tests Test whether the population mean is different from a specified value Test whether two population have the same mean value

Describe the two types of errors that can occur in hypothesis testing

Describe in words what each type of error would consist of in a given real-life problem.


Definition: Hypothesis

In statistical testing, a hypothesis is a claim or statement about a property of a population

Example hypotheses Medical researchers claim that the average body temperature

of healthy adults is actually not equal to 98.6oF. (i.e. 98.6) The average GPA of JMU students who study 8 hours or more

per week is higher than the average GPA for JMU students who do not. (i.e. at least 8 hrs/week > less than 8 hrs/week )

Many hypotheses of interest assert something about the value of the mean of one or more populations.


Difference between an hypothesis and a statistical inference

Both assert something about a population parameter

Hypothesis is an assertion that will be tested by the data (it’s made “before looking at the data”)

Statistical inference can look just like an hypothesis, except that it is a conclusion that is made, based on a look at the data

The process of making a statistical inference can be thought of as using the data to “test” which hypothesis is most reasonable


Two Types of HypothesisThe alternative hypothesis (H1)

States that the population parameter is different from some specified value

Also refered to as the research hypothesis Usually is the claim that we wish to really evaluate

(and often that we hope is true)

The null hypothesis (H0) States the opposite claim as the alternative

hypothesis Usually states that the population parameter(s) is NOT

different than a specified value (or are NOT different different from each other).


Two Cases Considered in this lecture

Case 1: Testing hypotheses about the value of a single population mean Referred to as the single population case

Case 2: Testing hypotheses about how the means of two different populations compare Referred to as the two population case


The single population case:Testing if the population mean is equal to a specified value

Goal: To determine if the population mean differs from a specified value

Data: A random sample of n measurements from a single population

Examples Determine if the average starting salary of JMU

graduates differs from the nationwide mean of $30,000/year.

Determine if average shelf life of a new battery exceeds 12 months.


Single Population Case

Null Hypothesis Asserts that the population mean IS NOT DIFFERENT than

the specified value Notation

H0: = 0 , where is the population mean, and where 0 is the specified value we are comparing against.

Alternative Hypothesis Asserts that the population mean IS DIFFERENT than the

specified value Notation

H1:


Single Population Case

Method for testing the two hypothesesConstruct a (1-)% confidence interval for the mean. If 0 is outside the interval, we can conclude with (1-)% confidence that H1 is true. Otherwise, we say that we “cannot reject” H0.


ExampleStarting Salary of JMU Graduates

A study was conducted to determine if the average starting salary of JMU graduates is different than the national average of $30,000 for college graduates nationwide. 25 JMU graduates were randomly selected and their starting salaries recorded. The average starting salary in the sample was $33,796. A 99% confidence interval for the average runs from $29,297 to $38,296

Null HypothesisIn words H0: The average starting salary of JMU graduates is equal to $30,000In equation form H0: JMU graduates = $30,000

Alternative HypothesisIn words H1: The average starting salary of JMU graduates is not equal to $30,000 In equation form H1: JMU graduates $30,000


ExampleAn electrical firm conducted a study to determine if the average useable lifetime of a new light bulb design differs from the 800 hour average associated with the old design. Thirty-six of the new design bulbs were randomly selected and used until failure. The average lifetime of the 36 bulbs was 902 hours and the standard deviation was 30 hours. Can we conclude with 95% confidence that the new bulbs differ from the standard?

Null HypothesisIn words H0: The average life time is not

different from 800 hoursUsing equations H0: new light bulbs = 800.

Alternative HypothesisIn words H1: The average life time differs

from 800 hrsUsing equations H1: new light bulbs 800


Light Bulb Exercise (continued)

95% confidence interval

893

89902

596190236

30902 025035

2

.

.

t

n

stXLB

.,

,df

912

5961902

2

.n

stXUB

,df


Light Bulb Exercise (continued)

Since the benchmark value of 800 hours is not included in the 95% confidence interval, we can reject the null hypothesis and conclude with 95% confidence that the average life of the new bulbs is different from 800 hours.


Possible Outcomes of Hypothesis Testing

Null Hypotheses?

Alpha and Beta Errors?


Possible OutcomesThe Risks in Hypothesis Testing

Wh

at i

s co

ncl

ud

ed RejectNullHyp.

Do notReject

NullHyp.

The results of the experiment, data analysis and hypothesis test.

The results of the experiment, data analysis and hypothesis test.

What is Actually TrueNull hypothesis Alternative Hypothesis

This is seldom

known with certainty

This is seldom

known with certainty


Possible OutcomesThe Risks in Hypothesis Testing

Wh

at i

s co

ncl

ud

ed RejectNullHyp.

Do notReject

NullHyp.

What is Actually TrueNull hypothesis Alternative Hypothesis

Correct decision

Correct decision

Type 1 error(-value gives probability of such an error)

Type 2 error (probability not

given here)


ExerciseState in your own words what a Type I and a Type II error would consist of with the light bulb example

Type I error A type I error would consist of concluding that

the average lifetime of the new bulb is different from 800 hours when in fact it is not different.

Type II error A type II error would consist of concluding that

the average lifetime of the new bulb is NOT different from 800 hours when in fact it really is different.


Two Population CaseTesting if the means of two populations differ from one another

Goal: To determine if the means from two populations differ from each other.

Data: A random sample of n1 measurements from one population and n2 measurements from the other population

Examples Determine if the average weight loss under one

diet/exercise plan differs from the average weight loss under another plan.

Determine if the average nicotine content from one brand of cigarettes is different than the average nicotene content from another brand.


Two Population Case

Null Hypothesis Asserts that the two population means ARE NOT

DIFFERENT Notation

H0: 1 = 2 , where 1 is the mean of the first population, and where 2 is the mean of the second population.

Alternative Hypothesis Asserts that the two population means ARE DIFFERENT Notation

H1: 1 2


Two Population Case

Method of testing the two hypothesesConstruct a (1-)% confidence interval for the difference of the two means. If the interval does not include the value zero, then we can conclude with (1-)% confidence that H1 is true. Otherwise, we say that we “cannot reject” H0.


Example: Two Population Case

A study was made to compare the effects of two different weight lifting programs on overall strength improvement. Ten randomly selected individuals were assigned to use program A, and eight were assigned to use program B. After twelve weeks, the gain in strength was measured on each individual (expressed as the change in the maximum number of pounds that the individual could bench press). The group using program A showed an average increase of 55 pounds, with a standard deviation of 12 pounds. The group using program B showed an average increase of 40 pounds with a standard deviation of 14 pounds. Perform a statistical test to determine if the average strength gain differs between the two programs. Use a confidence level of 99%


Weight training example (continued)

Null HypothesisIn words H0: The average increase in

bench press pounds is the same for both programsUsing equations H0: program A = program B

Alternative HypothesisIn words H1: The average increase in bench press pounds

differs between the two programsUsing equations H1: program A program B

pounds 9.12

2810

147129

2nn

s1ns1ns

22

21

222

211

p


Weight Training Exercise (continued)

99% confidence interval for the difference

pounds

..

nnst)XX( Bound Lower p ,nn

3

1815

8

1

10

191292124055

11

21222121

pounds

nnst)XX( Bound Upper p ,nn

33

1815

11

21222121


Weight Training Exercise (continued)

We are 99% certain that, over a 12-week period, the the two programs will lead to average strength gains that differ by somewhere between -3 and 33 pounds.

Since the 99% confidence interval includes the value “0,” we cannot conclude with 99% confidence that the two programs result in different average gains in strength.



Define the two types of statistical hypotheses and how each is used Null hypothesis Alternative hypothesis

Given a description of a hypothesis testing problem, determine null and alternative hypotheses both in words and in equation form

Use confidence intervals to do the following hypothesis tests Test whether the population mean is different from a specified value Test whether two population have the same mean value

Describe the two types of errors that can occur in hypothesis testing

Describe in words what each type of error would consist of in a given real-life problem.

Documents

GIST141B, Fall, 1999 Revised 8/11/991 Modeling Populations Introduction to the Normal Distribution