Statistics. Be able to state the null and alternative hypotheses for testing the difference between two population proportions. Know how to examine

COMPARING TWO PROPORTIONSStatistics

WHAT YOU WILL LEARN

Be able to state the null and alternative hypotheses for testing the difference between two population proportions.

Know how to examine your data for violations of conditions that would make inference about the difference between the two population proportions unwise or invalid.

Understand that the formula for the standard error of the difference between two independent sample proportions is based on the principle that when finding the sum or difference of two independent random variable, their variances add.

TERMS

Variances of independent random variables added— The variance of a sum or difference of

independent random variables is the sum of the variances of those variables.

TERMS

Sampling distribution— The sampling distribution of is,

under appropriate assumptions, modeled by a Normal model with mean and standard deviation

€

ˆ p 1 − ˆ p 2

€

μ =p1 − p2

€

SD ˆ p 1 − ˆ p 2( ) =p1q1

n1

+p2q2

n2

TERMS

Two-proportion z-interval— A two-proportion z-interval gives a

confidence interval of the true difference in proportions, p1 – p2 , in two independent groups.

The confidence interval is

where z* is a critical value from the standard Normal model corresponding to a specified confidence level.€

ˆ p 1 − ˆ p 2( ) ± z *×SE ˆ p 1 − ˆ p 2( ),

TERMS

Pooling— When we have data from different sources

that we believe are homogeneous, we can get a better estimate of the common proportion and its standard deviation. We can combine, or pool, the data into a single group for the purpose of estimating the common proportion. The resulting pooled standard error is based on more data and is thus more reliable (in the null hypothesis is true and the groups are truly homogenous).

TERMS

Two-proportion z-test— Test the null hypothesis H0: p1 – p2 = 0 by

referring the statistic

to a standard Normal model.

€

z =ˆ p 1 − ˆ p 2

SEpooledˆ p 1 − ˆ p 2( )

EXAMPLE

Who do you think is more intelligent, men or women? Gallup poll of 520 women and 506 men. 28% of the men thought men were more intelligent. 14% of the women thought men were more

intelligent. Comparing two percentages are much more

common than questions with isolated percentages. Example– Treatment is better than placebo control Example– This year’s results are better than last

year’s.

COMPARING TWO PROPORTIONS

We know the difference between the two proportions of the random sample is 14%, but what is the true difference?

We would like to find the true difference and the margin of error.

For this we need to determine the standard deviation of the sampling distribution model for the difference in the proportions.


Remember– The variance of the sum or difference of two independent random variables is the sum of their variances. (Chapter 16).

Why will this work?


How does this work? Consider grabbing a box of cereal. It claims there are 16 ounces in the

box. We know that this is not exact because

there is some variance from box to box. When you pour 2 ounces of cereal in a

bowl, there will be further variance from bowl to bowl.

How much cereal is left in the box?


According to our rule, the amount of cereal left in the box would now be the sum of the two variances.

We need the standard deviation, not the variance which is finding the square root of the variance.


Here are the formulas.

This formula applies only when X and Y are independent.€

Var (X −Y ) = Var (X) + Var (Y )

SD(X −Y ) = SD2(X) + SD2(Y ) = Var (X) + Var (Y )


The samples can have different sizes and different proportion values.

We use subscripts to keep the different values straight.

In comparing males and females, we could use the subscripts of M and F or 1 and 2.


The standard deviations of the sample proportions are:

€

SD ˆ p 1( ) =p1q1

n1

SD ˆ p 2( ) =p2q2

n2


The variance of the difference in the proportions is:

The standard deviation is:

€

Var ˆ p 1 − ˆ p 2( ) =p1q1

n1

⎛

⎝ ⎜

⎞

⎠ ⎟

2

+p2q2

n2

⎛

⎝ ⎜

⎞

⎠ ⎟

2

=p1q1

n1

+p2q2

n2

€

SD ˆ p 1 − ˆ p 2( ) =p1q1

n1

+p2q2

n2


Since we usually don’t know the true values of p1 and p2, we use the sample proportions from the data we are given.

We use them to estimate the variances and find the standard error.

€

SE ˆ p 1 − ˆ p 2( ) =ˆ p 1ˆ q 1n1

+ˆ p 2 ˆ q 2n2

INDEPENDENCE ASSUMPTIONS

Within each group the data should be based on results for independent individuals.

Randomization Condition– The data in each group should be drawn

independently and at random from a homogeneous population or generated by a randomized comparative experiment.

The 10% Condition— If the data are sampled without replacement, the

sample should not exceed 10% of the population.

INDEPENDENCE ASSUMPTIONS

Since we are comparing two groups, we need to add the Independent Assumption.

This is the most important assumption. Independent Groups Assumption—

The two groups we are comparing must also be independent of each other. Usually, the independence of the groups from each other is evident in the way data were collected.

SAMPLE SIZE CONDITION

Each of the groups must be big enough.

Success/Failure Condition— Both groups are big enough that at least

10 successes and at least 10 failures have been observed in each.

SAMPLING DISTRIBUTION The sampling distribution model for a

difference between two independent proportions. Provided that the sampled values are

independent, the samples are independent, and the sample sizes are large enough, the sampling distribution of is modeled by a Normal model with and standard deviation

€

ˆ p 1 − ˆ p 2

€

μ =p1 − p2

€

SD ˆ p 1 − ˆ p 2( ) =p1q1

n1

+p2q2

n2

SAMPLING DISTRIBUTION

If we have the sampling distribution model and the standard deviation, we have what we need to find the margin of error for the differences in proportions.

SAMPLING DISTRIBUTION Two-proportion z-interval—

When the conditions are met, we are ready to find the confidence interval for the difference of two proportions, . The confidence interval is

where we find the standard error of the difference,

from the observed proportions.The critical value z* depends on the particular

confidence level, C, that you specify.

€

p1 − p2

€

ˆ p 1 − ˆ p 2( ) ± z * ×SE ˆ p 1 − ˆ p 2( )

€

SE ˆ p 1 − ˆ p 2( ) =ˆ p 1ˆ q 1n1

+ˆ p 2 ˆ q 2n2

POOLING

Consider this example— The National Sleep Foundation asked a random

sample of 1010 U.S. adults questions about their sleep habits. The study ensured that there was an equal number of men and women.

On the question about snoring had 995 respondents, 37% of adults reported that they snored at least a few nights a week during the past year.

26% of the 184 people under 30 snored with 39% of the 811 in the older group.

Can the difference really be 13% or is it due to the natural fluctuations in the sample that was chosen?

POOLING

This type of question uses a hypothesis test.

What would be the null hypothesis? H0: p1 – p2 = 0 or H0: p1 = p2

What would be the alternative hypothesis?

HA:

€

p1 ≠ p2

POOLING

The hypothesis is about a new parameter– the difference in proportions.

We need to find the standard error for that.

But we can actually do better than the standard error.€

SE ˆ p 1 − ˆ p 2( ) =ˆ p 1ˆ q 1n1

+ˆ p 2 ˆ q 2n2

POOLING

The proportions and the standard deviations are linked.

There are two proportions in the standard error formula, but look at the null hypothesis.

It claims the proportions are equal. To test the hypothesis, we assume that

the null hypothesis is true. This means that there is a single value for

in the SE formula.

€

ˆ p

POOLING

How can we do this? If the null hypothesis is true, then

among all adults the two groups have the same proportion.

We will see 48 + 318 = 366 snorers out of a total of 184 + 811 = 995 adults who responded to the question.

The overall proportion of snorers was 366/995 = 0.3678.

POOLING

Pooling– Combining the counts to get an overall proportion.

Whenever we we have data from different sources or different groups but we believe that they really came from the same underlying population, we can pool them to get better estimates.

€

ˆ p pooled =Success1 + Success2

n1 + n2

POOLING

When we have only proportions and not the counts, as in the snoring example, we have to reconstruct the number of successes by multiplying the sample sizes by the proportions.

If these calculations don’t come out to whole numbers, round first.

There must have been a whole number of successes to begin with. (This is the only time you round in the middle of a calculation.)

€

Success1 = n1ˆ p 1 and Success2 = n2

ˆ p 2

POOLING

We can then put the pooled value into the formula, substituting it for both sample proportions in the standard error formula.

€

SE pooledˆ p 1 − ˆ p 2( ) =

ˆ p pooledˆ q pooled

n1

+ˆ p pooled

ˆ q pooled

n2

POOLING

Snoring--

€

=0.3678 × 1− 0.3678( )

184+

0.3678 × 1− 0.3678( )811

= 0.039375

EXAMPLE-- #1 PAGE 507

A presidential candidate fears he has a problem with women voters. His campaign staff plans to run a poll to assess the situation. They’ll randomly sample 300 men and 300 women, asking if they have a favorable impression of the candidate. Obviously, the staff can’t know this, but suppose the candidate has a positive image with 59% of males but with only 53% of females.


What kind of sampling design is his staff planning to use?

This is a stratified random sample, stratified by gender.


What difference would you expect the poll to show?

We would expect the difference in proportions in the sample to be the same as the difference in proportions in the population, with the percentage of the respondents with a favorable impression of the candidate 6% higher among males.


Of course, sampling error means the poll won’t reflect the difference perfectly. What’s the standard error for the difference in the proportions?

The standard deviation of the difference proportions is:

€

σ ˆ p M − ˆ p F( ) =ˆ p M ˆ q MnM

+ˆ p F ˆ q FnF

€

=0.59( ) 0.41( )

300+

0.53( ) 0.47( )300

= 4%

EXAMPLE-- #1 PAGE 507 Sketch a sampling model for the size

difference in proportions of men and women with favorable impressions of this candidate that might appear in a poll like this.

Difference in proportion with favorable impression (Male – Female)

68%

95%

99.7%

-6% -2% 2% 6% 10% 14% 18%


Could the campaign be misled by the poll, concluding that there really is no gender gap? Explain.

The campaign could certainly be misled by the poll. According to the model, a poll showing little difference could occur relatively frequently. That result is only 1.5 standard deviations below the expected difference in proportions.


In October 2000 the U.S. Department of Commerce reported the results of a large-scale survey on high school graduation. Researchers contacted more than 25,000 Americans aged 24 years to see if they had finished high school; 84% of the 12,460 males and 88.1% of the 12,678 females indicated that they had high school diplomas.


Are the assumptions and conditions necessary for inference satisfied? Explain.

Randomization condition— Assume that the samples are representative of all recent

graduates. 10% condition—

Although large, the samples are less than 10% of all graduates. Independent samples condition—

The sample of men and the sample of women were drawn independently of each other.

Success/Failure condition— The samples are very large, certainly large enough for the

methods of inference to be used.


Create a 95% confidence interval for the difference in graduation rates between males and females.

€

ˆ p F − ˆ p M( ) ± z *ˆ p F ˆ q FnF

+ˆ p M ˆ q MnM

€

= 0.881− 0.849( ) ±1.9600.881( ) 0.119( )

12,687+

0.849( ) 0.151( )12,460

= (0.024, 0.040)


Interpret your confidence interval.

We are 95% confident that the proportion of 24-year old American women who have graduated from high school is between 2.4% and 4.0% higher than the proportion of American men the same age who have graduated from high school.


Does this provide strong evidence that girls are more likely than boys to complete high school? Explain.

Since the interval for the difference in proportions of high school graduates does not contain 0, there is strong evidence that women are more likely than men to complete high school.

EXAMPLE– #6 PAGE 508

The painful wrist condition called carpal tunnel syndrome can be treated with surgery or less invasive wrist splints. In September 2002, Time magazine reported on a study of 176 patients. Among the half that had surgery, 80% showed improvement after three months, but only 54% of those who used the wrist splints improved.


What’s the standard error of the difference in the two proportions?

€

SE ˆ p Surg − ˆ p Splint( ) =ˆ p surg

ˆ q surg

nsurg

+ˆ p splint

ˆ q splint

nsplint

€

=0.80( ) 0.20( )

88+

0.54( ) 0.46( )88

= 0.068

EXAMPLE– #6 PAGE 508 Construct a 95% confidence interval for this difference. Randomization condition–

It’s not clear whether or not this study was an experiment. If so, assume that the subjects were randomly allocated to treatment groups. If not, assume that the subjects are representative of all carpal tunnel sufferers.

10% condition— 88 subjects in each group are less than 10% of all carpal tunnel

sufferers. Independent samples condition—

The improvement rates of the two groups are not related. Success/Failure condition--

All are greater than 10, so the samples are large enough.€

nˆ p (surg) = (88)(0.80) = 70; n ˆ q (surg) = (88)(.20) =18

nˆ p (splint) = (88)(0.54) = 48; n ˆ q (splint) = (88)(0.46) = 40


Success/Failure condition—

All are greater than 10, so the samples are large enough.

Since the conditions have been satisfied, we will find a two-proportion z-interval.

€

nˆ p (surg) = (88)(0.80) = 70; n ˆ q (surg) = (88)(.20) =18

nˆ p (splint) = (88)(0.54) = 48; n ˆ q (splint) = (88)(0.46) = 40


Success/Failure condition— Since the conditions have been satisfied,

we will find a two-proportion z-interval.

€

ˆ p Surg − ˆ p Splint( ) ± z *ˆ p surg

ˆ q surg

nsurg

+ˆ p splint

ˆ q splint

nsplint

€

=(0.80 − 0.54) ±1.9600.80( ) 0.20( )

88+

0.54( ) 0.46( )88

= 0.126, 0.394( )

EXAMPLE– #6 PAGE 508 State an appropriate conclusion.

We are 95% confident that the proportion of patients who show improvement in carpal tunnel syndrome with surgery is between 12.6% and 39.4% higher than the proportion who show improvement with wrist splints.

Documents

Statistics. Be able to state the null and alternative hypotheses for testing the difference between two population proportions. Know how to examine