20
Chapter 11: Inference for Distributions of Categorical Data 11.1: Chi-Square Tests for Goodness of Fit Learning Objectives -State appropriate hypotheses and compute expected counts for a chi-square test for goodness of fit. -Calculate the chi-square statistic, degrees of freedom, and P-value for a chi- square test for goodness of fit. -Perform a chi-square test for goodness of fit. -Conduct a follow-up analysis when the results of a chi-square test are statistically significant. Activity: Goodness of Fit Test for M&M Distribution Mars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution: Hackettstown, NJ : Brown: 12.5%, Red: 12.5%, Yellow: 12.5%, Green: 12.5%, Orange: 25%, Blue: 25% Cleveland, OH : Brown: 12.4%, Red: 13.1%, Yellow: 13.5%, Green: 19.8%, Orange: 20.5%, Blue: 20.7% To find out which factory your package of M&M’s came from, check the serial code: a code that includes HKP was produced in Hackettstown and a code that includes CLV was produced in Cleveland. Hypotheses: H 0 : The company’s stated color distribution for M&M’S ® Milk Chocolate Candies is correct. H a : The company’s stated color distribution for M&M’S ® Milk Chocolate Candies is not correct. Individual Data Brown Red Yellow Green Orange Blue Class Data Color Observe d Count Expecte d Count Observed- Expected (Observed- Expected) 2 (Observed- Expected) 2 Expected Brown Red Yellow

Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Chapter 11: Inference for Distributions of Categorical Data

11.1: Chi-Square Tests for Goodness of FitLearning Objectives-State appropriate hypotheses and compute expected counts for a chi-square test for goodness of fit.-Calculate the chi-square statistic, degrees of freedom, and P-value for a chi-square test for goodness of fit. -Perform a chi-square test for goodness of fit.-Conduct a follow-up analysis when the results of a chi-square test are statistically significant.

Activity: Goodness of Fit Test for M&M DistributionMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:

Hackettstown, NJ: Brown: 12.5%, Red: 12.5%, Yellow: 12.5%, Green: 12.5%, Orange: 25%, Blue: 25%

Cleveland, OH: Brown: 12.4%, Red: 13.1%, Yellow: 13.5%, Green: 19.8%, Orange: 20.5%, Blue: 20.7%

To find out which factory your package of M&M’s came from, check the serial code: a code that includes HKP was produced in Hackettstown and a code that includes CLV was produced in Cleveland.

Hypotheses: H0: The company’s stated color distribution for M&M’S ® Milk Chocolate Candies is correct.

Ha: The company’s stated color distribution for M&M’S ® Milk Chocolate Candies is not correct.

Individual DataBrown Red Yellow Green Orange Blue

Class DataColor Observed

CountExpected

CountObserved-Expected

(Observed-Expected)2

(Observed-Expected)2

Expected

Brown

Red

Yellow

Green

Orange

Blue

Chi-Square Statistic:

χ2=∑ (Observed - Expected )2

Expected =

Page 2: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Chi-Square DistributionThe chi-square distributions are a family of distributions that take only positive values and are skewed to the right. A particular chi-square distribution is specified by giving its degrees of freedom.The chi-square goodness-of-fit test uses the chi-square distributionwith degrees of freedom = the number of categories - 1.

Using the Chi-Square Distribution to Find a P-ValueThe p-value is the area under the curve to the right of the test statistic.

Example: Find the p-value for the test statistic found in the M&M activity.

What does the p-value tell us about the test?

2

Page 3: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

HypothesesH0: The stated distribution of the categorical variable in the population of interest is correct.Ha: The stated distribution of the categorical variable in the population of interest is not correct.

OR

H0: p1=¿ , p2=¿,…, pk=¿Ha: At least one of these proportions in incorrect.

Conditions

Random: The data come a well-designed random sample or from a randomized experiment.

Large Counts: All expected counts are greater than 5.

Chi-Square Test for Goodness of FitSuppose the conditions are met. To determine whether a categorical variable has a specified distribution in the population of interest, expressed as the proportion of individuals falling into each possible category, perform a chi-square test for goodness of fit.

Start by finding the expected count for each category assuming that H0 is true. Then calculate the chi-square statistic:

Calculator: STAT→TESTS

D: χ2

GOF-Test Observed: (List of observed counts) Expected: (List of expected counts) df: (# categories – 1)

Example: Does the warm, sunny weather in Arizona affect a driver’s choice of car color? Cass thinks that Arizona drivers might opt for a lighter color with the hope that it will reflect some of the heat from the sun. To see if the distribution of car colors in Oro Valley, near Tucson, is different from the distribution of car colors across North America, she selected a random sample of 300 cars in Oro Valley. The table shows the distribution of car color for Cass’s sample in Oro Valley and the distribution of car color in North America, according to www.ppg.com.

Do these data provide convincing evidence that the distribution of car color in Oro Valley differs from the North American distribution?

3

Page 4: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

4

Page 5: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Example: In his book Outliers, Malcolm Gladwell suggests that a hockey player’s birth month has a big influence on his chance to make it to the highest levels of the game. Specifically, since January 1 is the cut-off date for youth leagues in Canada (where many National Hockey League (NHL) players come from), players born in January will be competing against players up to 12 months younger. The older players tend to be bigger, stronger, and more coordinated and hence get more playing time, more coaching, and have a better chance of being successful. To see if birth date is related to success (judged by whether a player makes it into the NHL), a random sample of 80 National Hockey League players from a recent season was selected and their birthdays were recorded. The one-way table below summarizes the data on birthdays for these 80 players:

Birthday Jan-Mar Apr-Jun Jul-Sept Oct-DecNumber of Players 32 20 16 12

Do these data provide convincing evidence that the birthdays of all NHL players are evenly distributed among the four quarters of the year?

Homework: pg. 693-696 #1, 3, 5, 7, 9, 17, 19-22

5

Page 6: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

11.2: Side-by-side bar graphs, Segmented Bar Graphs, Mosaic PlotsLearning Objective-I can display data from a two-way table using side-by-side bar graphs, segmented bar graphs, and mosaic plots.-Compare conditional distributions for data in a two-way table.

What will be the EK mascot?

1. Create two bar graphs below to display the results. Use three different colors for the bars.

2. Complete the third graph by taking each bar from the teacher sample and stacking them. Use the colors to mark each section. Do the same for the student sample.

Teachers Students

When the high school was built in 1969, the school needed to pick a mascot. The principal decided to have the students and teachers vote between three choices: rams, falcons, or prairie dogs. He took a random sample of students and a random sample of teachers. The results of the surveys are given in the table.

6

Rams Falcons Prairie DogsTeachers 80% 10% 10%Students 30% 60% 10%

Page 7: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

3. According to your displays, which mascot appears to have the most support? Explain.

4. Upon hearing the results of the surveys, the students argued that the decision was incorrect because 100 teachers had been surveyed and 500 students had been surveyed. Use this information to fill in the table below with the number of responses.

5. How many times more students were sampled than teachers? _____. How can you update the third graph in #1 to take into account the sample size? Adjust your graph.

6. What should they make the EK mascot? Explain.

Side-by-side bar graphs, Segmented Bar Graphs, Mosaic PlotsSide-by-Side Bar Graph:

Segmented Bar Graph:

Mosaic Plot:

7

Rams Falcons Prairie DogsTeachersStudentsTotal

Page 8: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Example: The following table gives the result of a random sample of upper level students at Rocky Vista University (the Fighting Prairie Dogs!), along with a mosaic plot.

Grade LevelEmployment Status Junior Senior

Currently working 14 30Not working but have had a job 22 40Never had a job 15 10

1. Calculate the proportion of juniors that are currently working, not working but have had a job, and never had a job.

2. Calculate the proportion of seniors that are currently working, not working but have had a job, and never had a job.

3. Write a few sentences summarizing what the display reveals about the association between grade level and job experience for the students in the sample.

8

Page 9: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Example: Market researchers suspect that background music may affect the mood and buying behavior of customers. One study in a Mediterranean restaurant compared three randomly assigned treatments: no music, French accordion music, and Italian string music. Under each condition, the researchers recorded the numbers of customers who ordered French, Italian, and other entrees.

1. Calculate the conditional distribution (in proportions) of the entree ordered for each treatment.

Entree No Music French Music Italian Music

French

Italian

Other

2. Make side-by-side bar graphs to display the conditional distributions in part 1.

3. Write a few sentences comparing the distributions of entrees ordered under the three music treatments.

pg. 724 #27 (Make each type of graph)

9

Page 10: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

11.2: Inference for Two-Way TablesLearning Objectives-State appropriate hypotheses and compute expected counts for a chi-square test based on data in a two-way table.-Calculate the chi-square statistic, degrees of freedom, and P-value for a chi-square test based on data in a two-way table.-Perform a chi-square test for homogeneity.-Perform a chi-square test for independence. -Choose the appropriate chi-square test.

Chi-Square Test for Homogeneity: Used to determine whether the distribution of a categorical variables is the same for each of several populations or treatments.

Multiple ComparisonsThe problem of how to do many comparisons at once with an overall measure of confidence in all our conclusions is common in statistics. This is the problem of multiple comparisons. Statistical methods for dealing with multiple comparisons usually have two parts:

1. An overall test to see if there is good evidence of any differences among the parameters that we want to compare.

2. A detailed follow-up analysis to decide which of the parameters differ and to estimate how large the differences are.

The overall test uses the familiar chi-square statistic and distributions.

Hypotheses for χ2 Test for HomogeneityH0: There is no difference in the distribution of a categorical variable for several populations or treatments.

Ha: There is a difference in the distribution of a categorical variable for several populations or treatments.

Finding Expected CountsWhen H0 is true, the expected count in any cell of a two-way table is

Conditions for χ2 Test for HomogeneityRandom: The data come a well-designed random sample or from a randomized experiment.

Large Counts: All expected counts are greater than 5

10

Page 11: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Performing a χ2 Test for HomogeneityStart by finding the expected count for each category assuming that H0 is true. Then calculate the chi-square statistic

where the sum is over all cells (not including totals) in the two-way table. If H0 is true, the χ2 statistic has approximately a chi-square distribution with degrees of freedom = (number of rows − 1)(number of columns − 1).The P-value is the area to the right of χ2 under the corresponding chi-square density curve.

Calculator: C: χ2-Test Observed: (Matrix that holds observed counts) Expected: (Matrix that holds expected counts) Calculate: ENTER

Example: Random digit dialing telephone surveys used to exclude cell phone numbers. If the opinions of people who have only cell phones differ from those of people who have landline service, the poll results may not represent the entire adult population. The Pew Research Center interviewed separate random samples of cell-only and landline telephone users who were less than 30 years old. Here’s what the Pew survey found about how these people describe their political party affiliation. Perform a test to determine if there is a difference in the distribution of party affiliation in the cell-only and landline populations.

11

Page 12: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

12

Page 13: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Homework: pg. 724-726 # 29, 31, 37, 39Chi-Square Test for Independence: Used when a single random sample of individuals is chosen from a single population and then classified based on two categorical variables and the goal is to analyze the relationship between the variables. (Is there an association or not?)

Chi-Square Test for IndependenceH0: There is no association between two categorical variables in the population of interest.Ha: There is an association between two categorical variables in the population of interest.

Conditions: • Random: The data come a well-designed random sample or from a randomized experiment.• Large Counts: All expected counts are greater than 5

Start by finding the expected count for each category assuming that H0 is true. Then calculate the chi-square statistic

where the sum is over all cells (not including totals) in the two-way table. If H0 is true, the χ2 statistic has approximately a chi-square distribution with degrees of freedom = (number of rows − 1)(number of columns − 1). The P-value is the area to the right of χ2 under the corresponding chi-square density curve.

Example: Are men and women equally likely to suffer lingering fear from watching scary movies as children? Researchers asked a random sample of 117 college students to write narrative accounts of their exposure to scary movies before the age of 13. More than one-fourth of the students said that some of the fright symptoms are still present when they are awake. The following table breaks down these results by gender.

(a) Explain why a chi-square test for independence and not a chi-square test for homogeneity should be used in this setting.

13

Page 14: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

(b) Perform a test to determine if there is an association between gender and ongoing fright symptoms in the population of college students. Use ∝=.01 .

14

Page 15: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

(c) Which cell contributes most to the chi-square statistic? In what way does this cell differ from what the null hypothesis suggests?

(d) Interpret the P-value in context.

15

Page 16: Mrs. Venneman · Web viewMars, Incorporated makes milk chocolate candies. M&M’s are produced at two different factories. Each factory has its own claimed distribution:To find out

Homework: pg. 726-729 #41-45 odd, 51-56

Chapter 11 Learning Objectives SectionRelated Example

on Page(s)

RelevantChapter Review

Exercise(s)

Can I do this?

State appropriate hypotheses and compute expected counts for a chi-square test for goodness of fit.

11.1 681 R11.1

Calculate the chi-square statistic, degrees of freedom, and P-value for a chi-square test for goodness of fit.

11.1 683, 685 R11.1

Perform a chi-square test for goodness of fit. 11.1 688 R11.1

Conduct a follow-up analysis when the results of a chi-square test are statistically significant.

11.1, 11.2

Discussion on 690–691, 716 R11.4

Compare conditional distributions for data in a two-way table. 11.2 697, 711 R11.3, R11.5

State appropriate hypotheses and compute expected counts for a chi-square test based on data in a two-way table.

11.2 701, 713 R11.2, R11.3, R11.4, R11.5

Calculate the chi-square statistic, degrees of freedom, and P-value for a chi-square test based on data in a two-way table.

11.2 704 R11.3, R11.5

Perform a chi-square test for homogeneity. 11.2 708 R11.3

Perform a chi-square test for independence. 11.2 715 R11.5

Choose the appropriate chi-square test. 11.2 718 R11.4

Plan of Action:

16