Upload
manassamantaray28
View
60
Download
0
Embed Size (px)
Citation preview
CHI – SQUARE TESTS
A mini project report Submitted to the SRM University in partial fulfillment of the
requirements for the award of the degree of
MASTER OF BUSINESS ADMINISTRATION
BY
Under the Supervision and Guidance of
MS. RAMMA RAssistant Professor
DEPARTMENT OF MANAGEMENT STUDIESSRM UNIVERSITY
KATTANKULATHUR CAMPUSCHENNAI – 603 203
1ST SEMESTERNOVEMBER, 2009
CONTENTS
1
CHAPTER NO PARTICULARS PAGE NO
ABSTRACT 1
1 CHI – SQUARE TEST
INTRODUCTION 2
PRELIMINARIES 2
TYPES OF CHI – SQUARE TESTS 3
USING CHI - SQUARE 4
INTERPRETATION 6
2CHI – SQUARE TEST
GOODNESS OF FIT 7
3CHI – SQUARE TESTTEST OF INDEPENDENCE OF ATTRIBUTES 13
CONCLUSION 18
APPENDIX 19
2
ABSTRACT
Chi-square (X2) test is a nonparametric statistical analyzing method often used in
experimental work where the data consist in frequencies or `counts' – for example the
number of boys and girls in a class having their tonsils out – as distinct from quantitative
data obtained from measurement of continuous variables such as temperature, height,
and so on. The most common use of the test is to assess the probability of association or
independence of facts. This paper summarizes the chi-squared test. Beginning from the
basics it discusses with descriptive examples on where and how to apply this test. The
purpose of the paper is to present a quick overview on chi-square test, so that one who
doesn't have much knowledge on statistics may use it as a beginner's guide.
1. CHI – SQUARE TEST
Introduction
3
Suppose we have a sample of boys and girls from the 5th, 8th, and 12th grade of
school. We may want to know whether there is an association between the gender of
students and the grade levels. Or, we may want to know whether the men and women in a
liberal arts college differ in their section of majors. These are the type of questions that
the chi-squared (χ2) test is designed to answer. It was first introduced by British
statistician Karl Pearson in 1900.
Chi is a letter of the Greek alphabet; the symbol is χ and it's pronounced like
KYE, the sound in "kite." The chi square test uses the statistic chi squared, written χ 2.
The "test" that uses this statistic helps an investigator determine whether an observed set
of results matches an expected outcome. In some types of research (genetics provides
many examples) there may be a theoretical basis for expecting a particular result- not a
guess, but a predicted outcome based on a sound theoretical foundation.
Before going into details of chi-squared test first we present some statistical
preliminaries necessary to clearly understand it.
Preliminaries
Population: A population is an individual or group that represents all the
members of a certain group or category of interest. Populations do not have to include
people. Suppose we want to know the average age of the cats in the city. The population
in this study is made up of cats, not people. A population does not need to be large to
count as a population. We may want to know the average height of the 3 kids in a family.
In this study, the population is comprised of only 3 kids.
Sample: A sample is a subset of a given population. Samples are not necessarily
good representations of the populations from which they are selected. But choosing
representative samples is important for most experiments.
Parameter: A parameter is a value generated from, or applied to a population.
Variable: A variable is pretty much anything that can be codified and can have
value from a set (domain) of more than one value. Variables may be quantitative or
qualitative. A quantitative variable is one that is scored in such a way that its values
indicate some sort of amount. For example, height is a quantitative variable. On the
contrary, a qualitative variable indicates some kind of category. A commonly used
4
qualitative variable in social science research is the dichotomous variable, which has two
different categories. For instance, gender has two categories: male and female. Chi-
square test is applicable to when we have qualitative variables classified into categories.
Nominally scaled variable: A nominally scaled variable is one in which the
labels that are used to identify the different levels of the variable have no weight, or
numeric value. For example the sample may be divided into two groups labeled “0” and
“1”. In this case the value “1” does not indicate a higher score than the value of “0”.
Rather the, “0” and “1” are simply names, or labels have been assigned to each group.
Contingency table: When the members of a sample are doubly classified (i.e.,
classified in two separate ways), the results may be arranged in rectangular tables. Such a
table is called contingency table.
TYPES OF CHI – SQUARE TESTS
Two Types of χ2
χ2 Test for Goodness of Fit
o Involves a single categorical variable only
χ2 Test for Independence
o Involves 2(+) categorical variables
1. χ2 Test for Goodness of Fit
Generally just involves one categorical variable
Null hypothesis specifies the proportion of the population in each category
Determines how well sample data conform to proportions set forth by the
null hypothesis
Test statistic (χ2) examines whether the proportions in the sample reliably
differ from the null hypothesis
2. χ2 Test for Independence
Examines the relationship between two (or more) categorical variables to determine if they are independent.
Two variables are said to be independent if there is no relationship between them.
5
Two variables are said to be dependent if there is a relationship between them.
Similar to the correlation coefficient, except that instead of both variables being continuous, both variables are categorical.
o Can’t correlate Academic Major with Favorite Barnyard Animal, but you can do a chi-square.
Null hypothesis: no relationship. Alternative hypothesis: some relationship.
Using Chi Square
1) Knowing which statistic to use to test the relationship between each variable
depends on the type of data you have (and sometimes the type of question you
want to answer). Chi Square can be used when you have two Discrete Variables
2) Single Discrete Variable
Goodness of Fit χ2: Allows us to test whether the group frequencies differ
from chance patterns (base rate frequencies: the frequency instances
naturally occur in the environment). (df = k -1)
Calculating Expected Frequencies
For Equal Chances Assumed: Ei = n/k k = number of groups
For Unequal Chances Assumed: Ei = n (pi) pi = the probability of
occurrence for outcome i.
p = % / 100. 20% = .20
ΣO = n, ΣE = n, ΣO = ΣE
Statistical Hypotheses:
Ho: Of = Ef
Ha: Of =\= Ef
Example Research Question: Does number of people who say they like
cheesy poofs (Yes = 1) vs. those who do not like cheesy poofs (No = 0),
differ significantly from the number expected by chance alone?
The Decision to Reject Ho:
If χ2 obtained is > χ2 critical @ p< .05 for df = k-1;
6
Then reject Ho (Fail to reject Ha)
If χ2 obtained is < χ2 critical @ p < .05 for df = k-1;
Then Fail to Reject Ho (Reject Ha)
3) Discrete × Discrete
Test of independence: Allows us to test whether the cross tabulation
pattern of two nominal variables differs from the patterns expected by
chance.
(df = (R-1)(C-1)) where R = # of rows & C = # of columns.
Statistical Hypotheses: Ho: Of = Ef
Ha: Of =\= Ef
Example Research Question: Does the number of people who think they
are Eric Cartman (yes = 1, no = 0), relative to whether or not they eat
cheesy poofs (yes = 1, no = 0), significantly differ from the frequencies
that are expected by chance alone.
4) Limitations on X2:
a) Responses must be independent and mutually exclusive and exhaustive.
Each case from the sample should fit into one and only one cell of the
cross tab matrix. (This also means that repeated measure designs can not
be tested using chi square).
b) Low expected Frequencies limit the validity of X2.
If df = 1 (e.g., 2x2 matrix), then no expected frequency can be less than 5.
Also, If df = 2, all expected frequencies should exceed 2.
7
If df = 3 or greater, then all expected frequencies except one should be 5
or greater and the one cell needs to have an expected frequency of 1 or
greater.
The interpretation
In every χ2-test the calculated χ2 value will either be (i) less than or equal to the
critical χ2 value OR (ii) greater that the critical χ2 value.
i. If calculated χ2 ≤ critical χ2, then we conclude that there is no
statistically significant difference between the two distributions. That is, the
observed results are not significantly different from the expected results, and the
numerical difference between observed and expected can be attributed to
chance.
ii. If calculated χ2 > critical χ2, then we conclude that there is a
statistically significant difference between the two distributions. That is, the
observed results are significantly different from the expected results, and the
numerical difference between observed and expected can not be attributed to
chance. That means that the difference found is due to some other factor. This
test won't identify that other factor, only that there is some factor other than
chance responsible for the difference between the two distributions.
8
2. CHI-SQUARE TEST – GOODNESS OF FIT
The chi-square (I) test is used to determine whether there is a significant
difference between the expected frequencies and the observed frequencies in one or more
categories. Do the numbers of individuals or objects that fall in each category differ
significantly from the number you would expect? Is this difference between the expected
and observed due to sampling error, or is it a real difference?
Chi-Square Test Requirements
1. Quantitative data.
2. One or more categories.
3. Independent observations.
4. Adequate sample size (at least 10).
5. Simple random sample.
6. Data in frequency form.
7. All observations must be used.
Expected Frequencies
When you find the value for chi square, you determine whether the observed
frequencies differ significantly from the expected frequencies. You find the expected
frequencies for chi square in three ways:
1. You hypothesize that all the frequencies are equal in each category. For
example, you might expect that half of the entering freshmen class of 200
at Tech College will be identified as women and half as men. You figure
the expected frequency by dividing the number in the sample by the
number of categories. In this exam pie, here there are 200 entering
freshmen and two categories, male and female, you divide your sample of
200 by 2, the number of categories, to get 100 (expected frequencies) in
each category.
2. You determine the expected frequencies on the basis of some prior
knowledge. Let's use the Tech College example again, but this time
9
pretend we have prior knowledge of the frequencies of men and women in
each category from last year's entering class, when 60% of the freshmen
were men and 40% were women. This year you might expect that 60% of
the total would be men and 40% would be women. You find the expected
frequencies by multiplying the sample size by each of the hypothesized
population proportions. If the freshmen total were 200, you would expect
120 to be men (60% x 200) and 80 to be women (40% x 200).
Now let's take a situation, find the expected frequencies, and use the chi-square
test to solve the problem.
CASE:
Thai, the manager of a car dealership, did not want to stock cars that were bought
less frequently because of their unpopular color. The five colors that he ordered were red,
yellow, green, blue, and white. According to Thai, the expected frequencies or number of
customers choosing each color should follow the percentages of last year. She felt 20%
would choose yellow, 30% would choose red, 10% would choose green, 10% would
choose blue, and 30% would choose white. She now took a random sample of 150
customers and asked them their color preferences. The results of this poll are shown in
Table 1 under the column labeled observed frequencies."
10
SOLUTION:
The expected frequencies in Table 2 are figured from last year's percentages.
Table 2: Expected Frequencies
Based on the percentages for last year, we would expect 20% to choose yellow.
Figure the expected frequencies for yellow by taking 20% of the 150 customers, getting
an expected frequency of 30 people for this category. For the color red we would expect
30% out of 150 or 45 people to fall in this category. Using this method, Thai figured out
the expected frequencies 30, 45, 15, 15, and 45. Obviously, there are discrepancies
between the colors preferred by customers in the poll taken by Thai and the colors
preferred by the customers who bought their cars last year. Most striking is the difference
in the green and white colors. If Thai were to follow the results of her poll, she would
stock twice as many green cars than if she were to follow the customer color preference
for green based on last year's sales. In the case of white cars, she would stock half as
many this year. What to do??? Thai needs to know whether or not the discrepancies
between last year's choices (expected frequencies) and this year's preferences on the basis
of her poll (observed frequencies) demonstrate a real change in customer color
preferences. It could be that the differences are simply a result of the random sample she
chanced to select. If so, then the population of customers really has not changed from last
year as far as color preferences go.
The NULL HYPOTHESIS states that there is no significant difference
between the expected and observed frequencies.
The ALTERNATIVE HYPOTHESIS states they are different.
11
The level of significance (the point at which you can say with 95% confidence
that the difference is NOT due to chance alone) is set at .05 (the standard for most science
experiments.)
The chi-square formula used on these data is
Where,
O is the Observed Frequency in each category
E is the Expected Frequency in the corresponding category
df is the "degree of freedom" (n-1)
χ2 is Chi Square
PROCEDURE
We are now ready to use our formula for χ2 and find out if there is a significant
difference between the observed and expected frequencies for the customers in choosing
cars. We will set up a worksheet; then you will follow the directions to form the columns
and solve the formula.
1. Directions for Setting Up Worksheet for Chi Square
2. After calculating the Chi Square value, find the "Degrees of Freedom." (DO NOT
SQUARE THE NUMBER YOU GET, NOR FIND THE SQUARE ROOT – THE
NUMBER YOU GET FROM COMPLETING THE CALCULATIONS AS
ABOVE IS CHI SQUARE.)
12
Degrees of freedom (df) refers to the number of values that are free to vary after
restriction has been placed on the data. For instance, if you have four numbers with the
restriction that their sum has to be 50, and then three of these numbers can be anything,
they are free to vary, but the fourth number definitely is restricted. For example, the first
three numbers could be 15, 20, and 5, adding up to 40; then the fourth number has to be
10 in order that they sum to 50. The degrees of freedom for these values are then three.
The degrees of freedom here is defined as N - 1, the number in the group minus one
restriction (4 - 1).
Table 3: Degrees of freedom table
3. Find the table value for Chi Square. Begin by finding the df found in step 2 along
the left hand side of the table. Run your fingers across the proper row until you
reach the predetermined level of significance (.05) at the column heading on the
Table of chi square critical valuesprobability (P)
degrees of freedom* 0.05 0.01
1 3.84 6.632 5.99 9.213 7.81 11.344 9.49 13.285 11.07 15.096 12.59 16.817 14.07 18.488 15.51 20.099 16.92 21.6710 18.31 23.2111 19.68 24.7312 21.03 26.2213 22.36 27.6914 23.68 29.1415 25.00 30.5816 26.30 32.0017 27.59 33.4118 28.87 34.8119 30.14 36.1920 31.41 37.57
13
top of the table. The table value for Chi Square in the correct box of 4 df and
P=.05 level of significance is 9.49.
4. If the calculated chi-square value for the set of data you are analyzing (26.95) is
equal to or greater than the table value (9.49), reject the null hypothesis. There IS
a significant difference between the data sets that cannot be due to chance
alone. If the number you calculate is LESS than the number you find on the table,
than you can probably say that any differences are due to chance alone.
In this situation, the rejection of the null hypothesis means that the differences
between the expected frequencies (based upon last year's car sales) and the observed
frequencies (based upon this year's poll taken by Thai) are not due to chance. That is,
they are not due to chance variation in the sample Thai took; there is a real difference
between them. Therefore, in deciding what color autos to stock, it would be to Thai's
advantage to pay careful attention to the results of her poll.
THE STEPS IN USING THE CHI-SQUARE TEST MAY BE SUMMARIZED AS
FOLLOWS:
CHI-SQUARE TEST SUMMARY
1. Write the observed frequencies in column O
2. Figure the expected frequencies and write them in column E.
3. Use the formula to find the chi-square value:
4. Find the df. (N-1)
5. Find the table value (consult the Chi Square Table.)
6. If your chi-square value is equal to or greater than the table value, reject
the null hypothesis: differences in your data are not due to chance alone
14
3. CHI-SQUARE TEST OF INDEPENDENCE OF ATTRIBUTES
CASE:
In 1956, the number of people on record who died of tuberculosis in England and
Wales was 5375. Of these 3804 were males and 1571 were females: 3534 males and 1319
females died of tuberculosis of respiratory system, while the remainder died of other
forms of tuberculosis. Their data can be arranged in contingency table as shown in Table
1. This 2*2 (the members of the sample having been dichotomized in two different ways)
contingency table is an example of the simplest form.
Chi-squared Test
The entries in the cells in a contingency table may be frequencies, as in Table 1,
or frequencies may be transformed into proportions or percentages. However, it is
important to note that in whatever form (frequencies, proportions, etc.) they are presented
are not continuous measurements. Chi-squared test can be applied to only discrete data
for the purpose of the test; of course, continuous data can be often put into discrete form
by the use of intervals on a continuous scale. For instance, age is a continuous variable,
but if people are classified into different age-groups, then the intervals of time
corresponding to these groups can be treated as if they were discrete units.
Chi-square (χ2) test is a nonparametric statistical test to determine if the two or
more classifications of the samples are independent or not. For explanation, let us
consider the data presented in Table 1 where the people are classified according to two
attributes: gender and type of tuberculosis. We may want to determine whether death
15
caused by tuberculosis of respiratory system or other type of tuberculosis is dependent on
gender. To get the answer, we may apply chi-square test.
In order to carry out the chi-square test, it will be helpful to rearrange Table 1
leaving the marginal frequencies as they are but replacing the frequencies in the cells of
the body of the table by the letters E1 to E4. After this arrangement we get Table 2.
Looking at the column of the marginal totals on the right of the table we see that
the proportion of deaths, males and females combined, due to tuberculosis of the
respiratory system, is
Now, if the two classifications are independent, that is if the form of tuberculosis
from which people die is independent of gender, we would expect that the proportion of
males that died from tuberculosis of respiratory system would be equal to that of females
died from the same cause, and consequently would equal the proportion of the total
group, 0.903.
The expected values E1 and E2 then must be chosen so that the following holds.
Using equation 2 we find
Once the value of E1 is known, the values of E2; E3 and E4 can be deduced since the following facts are true.
16
Calculating values of E1; E2; E3, and E4 and putting these expected values
replacing the corresponding letters in Table 2, we get Table 3. These are the values that
one would expect to find in the cells in the body of Table 1 were the two methods of
classification independent. Though the number of people may not be fractional, fractional
values may appear in the table of expected frequencies. Especially when the sample size
is small, we retain the fractional values to increase the accuracy of subsequent
calculations.
Now, if we refer again to the data, we see that the observed frequencies in Table 1
differ considerably from the expected frequencies in Table 3. The question in concern is
whether this difference is such as could have arisen from random sampling error alone, or
whether it indicates a real difference between genders: males and females.
Now, let us define our null hypothesis as follows.
Null hypothesis: The number of men and women died in 1956 due to
tuberculosis of respiratory system and other types of tuberculosis is independent of their
sex.
Chi-square test, if properly applied may give us the answer by rejecting the null
hypothesis or failing to reject it. The test is based on the chi-square (χ2) distribution. To
17
compare the observed and expected frequencies, we produce chi-square (χ2) value using
the formula stated in equation 8.
In this equation 8, Oi stands for observed frequencies, Ei stands for expected
frequencies, and i runs from 1; 2; : : : ; n, where n is the number of cells in the
contingency table.
To perform the calculations it is useful to arrange the observed and expected
frequencies as shown in Table 4.
To asses the significance of the calculated value of χ2, we refer to the standard
chi-square table presented in Appendix. This table contains the critical χ2 values on
different degrees of freedom and levels of probability. Referring back to Table 3, we
recall that once the value of any one of the Ei (i = 1; : : : ; 4) had been determined, all
other Ei could be deduced. In other words, when the marginal totals of a 2 * 2
contingency table is given, only one cell in the body of the table can be filled arbitrarily.
This fact is expressed by saying that a 2 * 2 contingency table has only one degree of
freedom. The degree of freedom (df) of a contingency table with r rows and c columns is
computed using the following formula given in equation 9.
To assess the significance of our chi-square value χ2 = 101:35, we enter the chi-
square table of the Appendix, with df = 1, that is we look into the first row, which
corresponds to one degree of freedom. The largest value in that row is 10.828 under the
probability (P) level 0.001. A value of chi-square equal to or greater than 10.828 would
be expected to occur by chance only once in a thousand times if the null hypothesis is
true. Since our chi-square value 101.35 is much greater than 10.828, it would be expected
to occur even less frequently. Hence our chi-square test rejects the null hypothesis. So,
we conclude that the proportion of males died from tuberculosis of respiratory system,
namely 3534/3804 = 0:929, is significantly different from the proportion of females,
namely 1319/1571 = 0:840, that died from the same cause.
18
Before accomplishing the test we define a level of confidence, which is the
probability level (P) we are going to accept. Once the χ2 value is computed and the
number of degrees of freedom is determined, we go to the chi-square table and look into
the row corresponding to the given degree of freedom. Then if we find our χ2 value to be
less than (to the left side of our level of confidence) that of the value corresponding to our
level of confidence, we conclude that our null hypothesis is probably true. On the
contrary, if our χ2 value lies over the level of confidence or to its right indicating less
probability of occurring the difference by chance, we know that our chi-square test rejects
the null hypothesis. Therefore, we conclude that the classifications on population are
dependent on each other.
Limitations of Chi-square Test
As mentioned before, chi-square test cannot be applied on continuous data. It can
only be applied to qualitative data classified into categories, or labeled using nominally
scaled variables [3, 4]. In the standard chi-square table presented in the Appendix the chi-
square values computed using the formula in equation 8 assuming that the expected
values (E) are large. Most statisticians warn against using the test when any of the
expected values are less than 5. This warning implies that the use of chi-square test is
restricted to large samples.
19
CONCLUSION
Chi-square test tells us whether the classifications on a given population are
dependent on each other or not. However, it is important to stress that the establishment
of statistical association by means of chi-square does not necessarily imply any causal
relationship between the attributes being compared, but it does indicate that the reason for
the association is worth investigating. For example, if further investigation carried out on
the case of people's death from tuberculosis, we might have found out that the reason
why the number of men died from tuberculosis of the respiratory system is higher than
the women is the fact that there are more smokers in men than in women.
20
APPENDIX
21