31
CHI – SQUARE TESTS A mini project report Submitted to the SRM University in partial fulfillment of the requirements for the award of the degree of MASTER OF BUSINESS ADMINISTRATION BY Under the Supervision and Guidance of MS. RAMMA R Assistant Professor DEPARTMENT OF MANAGEMENT STUDIES SRM UNIVERSITY KATTANKULATHUR CAMPUS CHENNAI – 603 203 1 ST SEMESTER NOVEMBER, 2009 1

Mini Project Statistics]

Embed Size (px)

Citation preview

Page 1: Mini Project Statistics]

CHI – SQUARE TESTS

A mini project report Submitted to the SRM University in partial fulfillment of the

requirements for the award of the degree of

MASTER OF BUSINESS ADMINISTRATION

BY

Under the Supervision and Guidance of

MS. RAMMA RAssistant Professor

DEPARTMENT OF MANAGEMENT STUDIESSRM UNIVERSITY

KATTANKULATHUR CAMPUSCHENNAI – 603 203

1ST SEMESTERNOVEMBER, 2009

CONTENTS

1

Page 2: Mini Project Statistics]

CHAPTER NO PARTICULARS PAGE NO

ABSTRACT 1

1 CHI – SQUARE TEST

INTRODUCTION 2

PRELIMINARIES 2

TYPES OF CHI – SQUARE TESTS 3

USING CHI - SQUARE 4

INTERPRETATION 6

2CHI – SQUARE TEST

GOODNESS OF FIT 7

3CHI – SQUARE TESTTEST OF INDEPENDENCE OF ATTRIBUTES 13

CONCLUSION 18

APPENDIX 19

2

Page 3: Mini Project Statistics]

ABSTRACT

Chi-square (X2) test is a nonparametric statistical analyzing method often used in

experimental work where the data consist in frequencies or `counts' – for example the

number of boys and girls in a class having their tonsils out – as distinct from quantitative

data obtained from measurement of continuous variables such as temperature, height,

and so on. The most common use of the test is to assess the probability of association or

independence of facts. This paper summarizes the chi-squared test. Beginning from the

basics it discusses with descriptive examples on where and how to apply this test. The

purpose of the paper is to present a quick overview on chi-square test, so that one who

doesn't have much knowledge on statistics may use it as a beginner's guide.

1. CHI – SQUARE TEST

Introduction

3

Page 4: Mini Project Statistics]

Suppose we have a sample of boys and girls from the 5th, 8th, and 12th grade of

school. We may want to know whether there is an association between the gender of

students and the grade levels. Or, we may want to know whether the men and women in a

liberal arts college differ in their section of majors. These are the type of questions that

the chi-squared (χ2) test is designed to answer. It was first introduced by British

statistician Karl Pearson in 1900.

Chi is a letter of the Greek alphabet; the symbol is χ and it's pronounced like

KYE, the sound in "kite." The chi square test uses the statistic chi squared, written χ 2.

The "test" that uses this statistic helps an investigator determine whether an observed set

of results matches an expected outcome. In some types of research (genetics provides

many examples) there may be a theoretical basis for expecting a particular result- not a

guess, but a predicted outcome based on a sound theoretical foundation.

Before going into details of chi-squared test first we present some statistical

preliminaries necessary to clearly understand it.

Preliminaries

Population: A population is an individual or group that represents all the

members of a certain group or category of interest. Populations do not have to include

people. Suppose we want to know the average age of the cats in the city. The population

in this study is made up of cats, not people. A population does not need to be large to

count as a population. We may want to know the average height of the 3 kids in a family.

In this study, the population is comprised of only 3 kids.

Sample: A sample is a subset of a given population. Samples are not necessarily

good representations of the populations from which they are selected. But choosing

representative samples is important for most experiments.

Parameter: A parameter is a value generated from, or applied to a population.

Variable: A variable is pretty much anything that can be codified and can have

value from a set (domain) of more than one value. Variables may be quantitative or

qualitative. A quantitative variable is one that is scored in such a way that its values

indicate some sort of amount. For example, height is a quantitative variable. On the

contrary, a qualitative variable indicates some kind of category. A commonly used

4

Page 5: Mini Project Statistics]

qualitative variable in social science research is the dichotomous variable, which has two

different categories. For instance, gender has two categories: male and female. Chi-

square test is applicable to when we have qualitative variables classified into categories.

Nominally scaled variable: A nominally scaled variable is one in which the

labels that are used to identify the different levels of the variable have no weight, or

numeric value. For example the sample may be divided into two groups labeled “0” and

“1”. In this case the value “1” does not indicate a higher score than the value of “0”.

Rather the, “0” and “1” are simply names, or labels have been assigned to each group.

Contingency table: When the members of a sample are doubly classified (i.e.,

classified in two separate ways), the results may be arranged in rectangular tables. Such a

table is called contingency table.

TYPES OF CHI – SQUARE TESTS

Two Types of χ2

χ2 Test for Goodness of Fit

o Involves a single categorical variable only

χ2 Test for Independence

o Involves 2(+) categorical variables

1. χ2 Test for Goodness of Fit

Generally just involves one categorical variable

Null hypothesis specifies the proportion of the population in each category

Determines how well sample data conform to proportions set forth by the

null hypothesis

Test statistic (χ2) examines whether the proportions in the sample reliably

differ from the null hypothesis

2. χ2 Test for Independence

Examines the relationship between two (or more) categorical variables to determine if they are independent.

Two variables are said to be independent if there is no relationship between them.

5

Page 6: Mini Project Statistics]

Two variables are said to be dependent if there is a relationship between them.

Similar to the correlation coefficient, except that instead of both variables being continuous, both variables are categorical.

o Can’t correlate Academic Major with Favorite Barnyard Animal, but you can do a chi-square.

Null hypothesis: no relationship. Alternative hypothesis: some relationship.

Using Chi Square

1) Knowing which statistic to use to test the relationship between each variable

depends on the type of data you have (and sometimes the type of question you

want to answer). Chi Square can be used when you have two Discrete Variables

2) Single Discrete Variable

Goodness of Fit χ2: Allows us to test whether the group frequencies differ

from chance patterns (base rate frequencies: the frequency instances

naturally occur in the environment). (df = k -1)

Calculating Expected Frequencies

For Equal Chances Assumed: Ei = n/k k = number of groups

For Unequal Chances Assumed: Ei = n (pi) pi = the probability of

occurrence for outcome i.

p = % / 100. 20% = .20

ΣO = n, ΣE = n, ΣO = ΣE

Statistical Hypotheses:

Ho: Of = Ef

Ha: Of =\= Ef

Example Research Question: Does number of people who say they like

cheesy poofs (Yes = 1) vs. those who do not like cheesy poofs (No = 0),

differ significantly from the number expected by chance alone?

The Decision to Reject Ho:

If χ2 obtained is > χ2 critical @ p< .05 for df = k-1;

6

Page 7: Mini Project Statistics]

Then reject Ho (Fail to reject Ha)

If χ2 obtained is < χ2 critical @ p < .05 for df = k-1;

Then Fail to Reject Ho (Reject Ha)

3) Discrete × Discrete

Test of independence: Allows us to test whether the cross tabulation

pattern of two nominal variables differs from the patterns expected by

chance.

(df = (R-1)(C-1)) where R = # of rows & C = # of columns.

Statistical Hypotheses: Ho: Of = Ef

Ha: Of =\= Ef

Example Research Question: Does the number of people who think they

are Eric Cartman (yes = 1, no = 0), relative to whether or not they eat

cheesy poofs (yes = 1, no = 0), significantly differ from the frequencies

that are expected by chance alone.

4) Limitations on X2:

a) Responses must be independent and mutually exclusive and exhaustive.

Each case from the sample should fit into one and only one cell of the

cross tab matrix. (This also means that repeated measure designs can not

be tested using chi square).

b) Low expected Frequencies limit the validity of X2.

If df = 1 (e.g., 2x2 matrix), then no expected frequency can be less than 5.

Also, If df = 2, all expected frequencies should exceed 2.

7

Page 8: Mini Project Statistics]

If df = 3 or greater, then all expected frequencies except one should be 5

or greater and the one cell needs to have an expected frequency of 1 or

greater.

The interpretation

In every χ2-test the calculated χ2 value will either be (i) less than or equal to the

critical χ2 value OR (ii) greater that the critical χ2 value.

i. If calculated χ2 ≤ critical χ2, then we conclude that there is no

statistically significant difference between the two distributions. That is, the

observed results are not significantly different from the expected results, and the

numerical difference between observed and expected can be attributed to

chance.

ii. If calculated χ2 > critical χ2, then we conclude that there is a

statistically significant difference between the two distributions. That is, the

observed results are significantly different from the expected results, and the

numerical difference between observed and expected can not be attributed to

chance. That means that the difference found is due to some other factor. This

test won't identify that other factor, only that there is some factor other than

chance responsible for the difference between the two distributions.

8

Page 9: Mini Project Statistics]

2. CHI-SQUARE TEST – GOODNESS OF FIT

The chi-square (I) test is used to determine whether there is a significant

difference between the expected frequencies and the observed frequencies in one or more

categories. Do the numbers of individuals or objects that fall in each category differ

significantly from the number you would expect? Is this difference between the expected

and observed due to sampling error, or is it a real difference?

Chi-Square Test Requirements

1. Quantitative data.

2. One or more categories.

3. Independent observations.

4. Adequate sample size (at least 10).

5. Simple random sample.

6. Data in frequency form.

7. All observations must be used.

Expected Frequencies

When you find the value for chi square, you determine whether the observed

frequencies differ significantly from the expected frequencies. You find the expected

frequencies for chi square in three ways:

1. You hypothesize that all the frequencies are equal in each category. For

example, you might expect that half of the entering freshmen class of 200

at Tech College will be identified as women and half as men. You figure

the expected frequency by dividing the number in the sample by the

number of categories. In this exam pie, here there are 200 entering

freshmen and two categories, male and female, you divide your sample of

200 by 2, the number of categories, to get 100 (expected frequencies) in

each category.

2. You determine the expected frequencies on the basis of some prior

knowledge. Let's use the Tech College example again, but this time

9

Page 10: Mini Project Statistics]

pretend we have prior knowledge of the frequencies of men and women in

each category from last year's entering class, when 60% of the freshmen

were men and 40% were women. This year you might expect that 60% of

the total would be men and 40% would be women. You find the expected

frequencies by multiplying the sample size by each of the hypothesized

population proportions. If the freshmen total were 200, you would expect

120 to be men (60% x 200) and 80 to be women (40% x 200).

Now let's take a situation, find the expected frequencies, and use the chi-square

test to solve the problem.

CASE:

Thai, the manager of a car dealership, did not want to stock cars that were bought

less frequently because of their unpopular color. The five colors that he ordered were red,

yellow, green, blue, and white. According to Thai, the expected frequencies or number of

customers choosing each color should follow the percentages of last year. She felt 20%

would choose yellow, 30% would choose red, 10% would choose green, 10% would

choose blue, and 30% would choose white. She now took a random sample of 150

customers and asked them their color preferences. The results of this poll are shown in

Table 1 under the column labeled observed frequencies."

10

Page 11: Mini Project Statistics]

SOLUTION:

The expected frequencies in Table 2 are figured from last year's percentages.

Table 2: Expected Frequencies

Based on the percentages for last year, we would expect 20% to choose yellow.

Figure the expected frequencies for yellow by taking 20% of the 150 customers, getting

an expected frequency of 30 people for this category. For the color red we would expect

30% out of 150 or 45 people to fall in this category. Using this method, Thai figured out

the expected frequencies 30, 45, 15, 15, and 45. Obviously, there are discrepancies

between the colors preferred by customers in the poll taken by Thai and the colors

preferred by the customers who bought their cars last year. Most striking is the difference

in the green and white colors. If Thai were to follow the results of her poll, she would

stock twice as many green cars than if she were to follow the customer color preference

for green based on last year's sales. In the case of white cars, she would stock half as

many this year. What to do??? Thai needs to know whether or not the discrepancies

between last year's choices (expected frequencies) and this year's preferences on the basis

of her poll (observed frequencies) demonstrate a real change in customer color

preferences. It could be that the differences are simply a result of the random sample she

chanced to select. If so, then the population of customers really has not changed from last

year as far as color preferences go.

The NULL HYPOTHESIS states that there is no significant difference

between the expected and observed frequencies.

The ALTERNATIVE HYPOTHESIS states they are different.

11

Page 12: Mini Project Statistics]

The level of significance (the point at which you can say with 95% confidence

that the difference is NOT due to chance alone) is set at .05 (the standard for most science

experiments.)

The chi-square formula used on these data is

Where,

O is the Observed Frequency in each category

E is the Expected Frequency in the corresponding category

df is the "degree of freedom" (n-1)

χ2 is Chi Square

PROCEDURE

We are now ready to use our formula for χ2 and find out if there is a significant

difference between the observed and expected frequencies for the customers in choosing

cars. We will set up a worksheet; then you will follow the directions to form the columns

and solve the formula.

1. Directions for Setting Up Worksheet for Chi Square

2. After calculating the Chi Square value, find the "Degrees of Freedom." (DO NOT

SQUARE THE NUMBER YOU GET, NOR FIND THE SQUARE ROOT – THE

NUMBER YOU GET FROM COMPLETING THE CALCULATIONS AS

ABOVE IS CHI SQUARE.)

12

Page 13: Mini Project Statistics]

Degrees of freedom (df) refers to the number of values that are free to vary after

restriction has been placed on the data. For instance, if you have four numbers with the

restriction that their sum has to be 50, and then three of these numbers can be anything,

they are free to vary, but the fourth number definitely is restricted. For example, the first

three numbers could be 15, 20, and 5, adding up to 40; then the fourth number has to be

10 in order that they sum to 50. The degrees of freedom for these values are then three.

The degrees of freedom here is defined as N - 1, the number in the group minus one

restriction (4 - 1).

Table 3: Degrees of freedom table

3. Find the table value for Chi Square. Begin by finding the df found in step 2 along

the left hand side of the table. Run your fingers across the proper row until you

reach the predetermined level of significance (.05) at the column heading on the

Table of chi square critical valuesprobability (P)

degrees of freedom* 0.05 0.01

1 3.84 6.632 5.99 9.213 7.81 11.344 9.49 13.285 11.07 15.096 12.59 16.817 14.07 18.488 15.51 20.099 16.92 21.6710 18.31 23.2111 19.68 24.7312 21.03 26.2213 22.36 27.6914 23.68 29.1415 25.00 30.5816 26.30 32.0017 27.59 33.4118 28.87 34.8119 30.14 36.1920 31.41 37.57

13

Page 14: Mini Project Statistics]

top of the table. The table value for Chi Square in the correct box of 4 df and

P=.05 level of significance is 9.49.

4. If the calculated chi-square value for the set of data you are analyzing (26.95) is

equal to or greater than the table value (9.49), reject the null hypothesis. There IS

a significant difference between the data sets that cannot be due to chance

alone. If the number you calculate is LESS than the number you find on the table,

than you can probably say that any differences are due to chance alone.

In this situation, the rejection of the null hypothesis means that the differences

between the expected frequencies (based upon last year's car sales) and the observed

frequencies (based upon this year's poll taken by Thai) are not due to chance. That is,

they are not due to chance variation in the sample Thai took; there is a real difference

between them. Therefore, in deciding what color autos to stock, it would be to Thai's

advantage to pay careful attention to the results of her poll.

THE STEPS IN USING THE CHI-SQUARE TEST MAY BE SUMMARIZED AS

FOLLOWS:

CHI-SQUARE TEST SUMMARY

1. Write the observed frequencies in column O

2. Figure the expected frequencies and write them in column E.

3. Use the formula to find the chi-square value:

4. Find the df. (N-1)

5. Find the table value (consult the Chi Square Table.)

6. If your chi-square value is equal to or greater than the table value, reject

the null hypothesis: differences in your data are not due to chance alone

14

Page 15: Mini Project Statistics]

3. CHI-SQUARE TEST OF INDEPENDENCE OF ATTRIBUTES

CASE:

In 1956, the number of people on record who died of tuberculosis in England and

Wales was 5375. Of these 3804 were males and 1571 were females: 3534 males and 1319

females died of tuberculosis of respiratory system, while the remainder died of other

forms of tuberculosis. Their data can be arranged in contingency table as shown in Table

1. This 2*2 (the members of the sample having been dichotomized in two different ways)

contingency table is an example of the simplest form.

Chi-squared Test

The entries in the cells in a contingency table may be frequencies, as in Table 1,

or frequencies may be transformed into proportions or percentages. However, it is

important to note that in whatever form (frequencies, proportions, etc.) they are presented

are not continuous measurements. Chi-squared test can be applied to only discrete data

for the purpose of the test; of course, continuous data can be often put into discrete form

by the use of intervals on a continuous scale. For instance, age is a continuous variable,

but if people are classified into different age-groups, then the intervals of time

corresponding to these groups can be treated as if they were discrete units.

Chi-square (χ2) test is a nonparametric statistical test to determine if the two or

more classifications of the samples are independent or not. For explanation, let us

consider the data presented in Table 1 where the people are classified according to two

attributes: gender and type of tuberculosis. We may want to determine whether death

15

Page 16: Mini Project Statistics]

caused by tuberculosis of respiratory system or other type of tuberculosis is dependent on

gender. To get the answer, we may apply chi-square test.

In order to carry out the chi-square test, it will be helpful to rearrange Table 1

leaving the marginal frequencies as they are but replacing the frequencies in the cells of

the body of the table by the letters E1 to E4. After this arrangement we get Table 2.

Looking at the column of the marginal totals on the right of the table we see that

the proportion of deaths, males and females combined, due to tuberculosis of the

respiratory system, is

Now, if the two classifications are independent, that is if the form of tuberculosis

from which people die is independent of gender, we would expect that the proportion of

males that died from tuberculosis of respiratory system would be equal to that of females

died from the same cause, and consequently would equal the proportion of the total

group, 0.903.

The expected values E1 and E2 then must be chosen so that the following holds.

Using equation 2 we find

Once the value of E1 is known, the values of E2; E3 and E4 can be deduced since the following facts are true.

16

Page 17: Mini Project Statistics]

Calculating values of E1; E2; E3, and E4 and putting these expected values

replacing the corresponding letters in Table 2, we get Table 3. These are the values that

one would expect to find in the cells in the body of Table 1 were the two methods of

classification independent. Though the number of people may not be fractional, fractional

values may appear in the table of expected frequencies. Especially when the sample size

is small, we retain the fractional values to increase the accuracy of subsequent

calculations.

Now, if we refer again to the data, we see that the observed frequencies in Table 1

differ considerably from the expected frequencies in Table 3. The question in concern is

whether this difference is such as could have arisen from random sampling error alone, or

whether it indicates a real difference between genders: males and females.

Now, let us define our null hypothesis as follows.

Null hypothesis: The number of men and women died in 1956 due to

tuberculosis of respiratory system and other types of tuberculosis is independent of their

sex.

Chi-square test, if properly applied may give us the answer by rejecting the null

hypothesis or failing to reject it. The test is based on the chi-square (χ2) distribution. To

17

Page 18: Mini Project Statistics]

compare the observed and expected frequencies, we produce chi-square (χ2) value using

the formula stated in equation 8.

In this equation 8, Oi stands for observed frequencies, Ei stands for expected

frequencies, and i runs from 1; 2; : : : ; n, where n is the number of cells in the

contingency table.

To perform the calculations it is useful to arrange the observed and expected

frequencies as shown in Table 4.

To asses the significance of the calculated value of χ2, we refer to the standard

chi-square table presented in Appendix. This table contains the critical χ2 values on

different degrees of freedom and levels of probability. Referring back to Table 3, we

recall that once the value of any one of the Ei (i = 1; : : : ; 4) had been determined, all

other Ei could be deduced. In other words, when the marginal totals of a 2 * 2

contingency table is given, only one cell in the body of the table can be filled arbitrarily.

This fact is expressed by saying that a 2 * 2 contingency table has only one degree of

freedom. The degree of freedom (df) of a contingency table with r rows and c columns is

computed using the following formula given in equation 9.

To assess the significance of our chi-square value χ2 = 101:35, we enter the chi-

square table of the Appendix, with df = 1, that is we look into the first row, which

corresponds to one degree of freedom. The largest value in that row is 10.828 under the

probability (P) level 0.001. A value of chi-square equal to or greater than 10.828 would

be expected to occur by chance only once in a thousand times if the null hypothesis is

true. Since our chi-square value 101.35 is much greater than 10.828, it would be expected

to occur even less frequently. Hence our chi-square test rejects the null hypothesis. So,

we conclude that the proportion of males died from tuberculosis of respiratory system,

namely 3534/3804 = 0:929, is significantly different from the proportion of females,

namely 1319/1571 = 0:840, that died from the same cause.

18

Page 19: Mini Project Statistics]

Before accomplishing the test we define a level of confidence, which is the

probability level (P) we are going to accept. Once the χ2 value is computed and the

number of degrees of freedom is determined, we go to the chi-square table and look into

the row corresponding to the given degree of freedom. Then if we find our χ2 value to be

less than (to the left side of our level of confidence) that of the value corresponding to our

level of confidence, we conclude that our null hypothesis is probably true. On the

contrary, if our χ2 value lies over the level of confidence or to its right indicating less

probability of occurring the difference by chance, we know that our chi-square test rejects

the null hypothesis. Therefore, we conclude that the classifications on population are

dependent on each other.

Limitations of Chi-square Test

As mentioned before, chi-square test cannot be applied on continuous data. It can

only be applied to qualitative data classified into categories, or labeled using nominally

scaled variables [3, 4]. In the standard chi-square table presented in the Appendix the chi-

square values computed using the formula in equation 8 assuming that the expected

values (E) are large. Most statisticians warn against using the test when any of the

expected values are less than 5. This warning implies that the use of chi-square test is

restricted to large samples.

19

Page 20: Mini Project Statistics]

CONCLUSION

Chi-square test tells us whether the classifications on a given population are

dependent on each other or not. However, it is important to stress that the establishment

of statistical association by means of chi-square does not necessarily imply any causal

relationship between the attributes being compared, but it does indicate that the reason for

the association is worth investigating. For example, if further investigation carried out on

the case of people's death from tuberculosis, we might have found out that the reason

why the number of men died from tuberculosis of the respiratory system is higher than

the women is the fact that there are more smokers in men than in women.

20

Page 21: Mini Project Statistics]

APPENDIX

21