4 normal probability plots at once par(mfrow=c(2,2)) for(i in 1:4) { qqnorm(dataframe[,1] [dataframe[,2]==i],ylab=“Data quantiles”) title(paste(“yourchoice”,i,sep=“”))}

4 normal probability plots at oncepar(mfrow=c(2,2))

for(i in 1:4) {

qqnorm(dataframe[,1] [dataframe[,2]==i],ylab=“Data quantiles”)

title(paste(“yourchoice”,i,sep=“”))}

These plots can be produced by going to “file” and “new” and

“script file”. Paste the commands into the script file window,

press “F10” and the four plots are produced automatically.

4 histograms all at onceSame as above, but instead of qqnorm, use hist, and you only

need one column rather than dataframe 1 and 2. Also, don’t forget

to change your label.

Lab: Chi-Squared Test (X2) Lack of Fit

November 10, 2000

History

Invented in 1900 Oldest inference procedure still used in

its original form English statistician Karl Pearson

The X2 Test

When you have data values for two categorical variables

Also called a two-way table For example: men/women and NSOE

track; regenerated seaweed (yes/no) and access level (limpet only/limpet and fish/etc).

Example: Why do Men and Women Participate in Sports?

Desire to win or do better than others– called social comparison

Desire to improve one’s skills or to do one’s best– called mastery

Data Collected from 67 male and 67 female

undergraduate students at a large university

Survey given asking about students’ sports goals.

Students were all categorized either high or low with regard to both of the questions:– high or low social comparison– high or low mastery

Duda, Joan L., Leisures Sciences, 10(1988), pp. 95-106

Groups

This leads to four groups:– High social comparison, high mastery. – High social comparison, low mastery. – Low social comparison, high mastery– Low social comparison, low mastery

We want to compare this for men and women.

Observed Counts for Sports Goals

Goal Female Male

HSC-HM 14 31

HS-LM 7 18

LSC-HM 21 5

LSC-LM 25 13

Total 67 67

1. Add Totals

Column: In this case, what population the observation comes from..


Goal Female Male Total

HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Row: Categorical response variable

Grand total



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

A Cell

A table with r rows and c columns contains r x c cells

X2 is really an analysis of 5 things in this table:

Frequency (actual count) Percent of overall total Percent of row Percent of column Expected count



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Frequency: Just the cell count



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Overall Percent: Cell count divided by grand total

14/134=0.105. That is, 10.5% of all those studied were HSC-HM and female.



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Row Percent: Cell count divided by row total

14/45=0.311 That is, of all those students reporting HSC-HM,31% were female.



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Column Percent: Cell count divided by column total

14/67=0.209 That is, of all female student participants, 21% were HSC-HM..

Expected Count

Coming later to a slide near you...

These percents are useful in graphical analysis. Overall, row, and column percent can

be calculated for each cell Then questions of interest can be asked We are interested in the effect of sex on

sports goals. In this case, we would examine the

column percents

Column percents for sports goals

Goal Female Male

HSC-HM 21 46

HSC-LM 10 27

LSC-HM 31 7

LSC-LM 37 19

Total 100 100

05

1015

2025303540

4550

Female Male

HSC-HMHSC-LM

LSC-HM

LSC-LM

Surprise, surprise - we want to ask whether these apparently

obvious differences are significant.

Can these differences be attributed to chance?

Calculate the chi-square and compare to a chi-square distribution

Determine the p-value A low p-value means we reject our null

hypothesis (sound familiar?)

The hypotheses: Null

No association exists between our row and our column variables– No association exists between sex

and sports goals

– The distributions of sports in the male and female populations are the same.

The hypotheses: Alternative Alternative: An association exists

between the row and column variables– No particular direction (not one- or two-

sided)– The distributions of sports goals in the male

and female populations are not all the same.

– Includes many kinds of possible associations

– “Men rate social comparison higher as a goal than do women”

OK: Now back to the Expected Count

If the null hypothesis were true, what would the count in each cell be?

For women in the HSC-HM cell, it would work like this:– 33.6% of all respondents are HSC-HM– We have 67 women– So, if no sex difference exists (our null),

we would expect that 33.6% of our 67 women would be HSC-HM --> 22.5 women.



HSC-HM 14 31 45

HS-LM 7 18 25

LSC-HM 21 5 26

LSC-LM 25 13 38

Total 67 67 134

Expected Count

1. 45/134=33.6% of all respondents are HSC-HM.

2. 33.6% of 67 women is 22.5.

Finally: The Chi-Squared Statistic Itself

Compare the entire set of observed counts with the set of expected counts.

Take the difference in each cell between observed and expected

Square each difference Normalize these (divide by the expected

count) Sum over all cells.

The Formula:

Large values of X2 provide evidence against the null hypothesis

A chi-square distribution is used to obtain the p-value

Degrees of freedom are (r-1)(c-1)

2

2 observed count - expected count

expected countX

In this case... Chi-squared = 24.898 on 3 df. The p-value is less than 0.0005. The chance of obtaining a chi-squared

value greater than or equal to this due to chance alone is very small

Clear evidence against the null hypothesis

Strong evidence that female and male students have different distributions of sports goals.

Is that all you can say? No, you can and should combine the test with

a description that shows the relationship. – Percents in our earlier table and our graph– Summary comments: the percent fo males in each

of the HSC goal classes is more than twice the percent of females.

– The HSC-HM group contains 46% of the males, but only 21% of the females

– The HSC-LM group contains 27% of the males and only 10% of the females

– We conclude that males are more likely to be motivated by social comparison goals and females are more likely to be motivated by mastery goals.

Important to remember:

The approximation of the population chi-square by our estimate becomes more accurate as the cell counts increase.

For 2 x 2 tables, the expected count in each of the 4 cells must be five or higher.

For tables larger than 2 x 2, the average of the expected counts must be 5 or higher, and the smallest expected count must be 1 or more.

Important to remember:

This is sometimes called the chi-squared test for homogeneity or the chi-squared test of independence.

Although this is is one of the most widely used of statistical tools, it is also one of the least informative.– The only thing you produce is a p-value and

there is no associated parameter to describe the degree of dependence

– the alternative hypothesis is very general (that row and columns are not independent)

Documents

4 normal probability plots at once par(mfrow=c(2,2)) for(i in 1:4) { qqnorm(dataframe[,1] [dataframe[,2]==i],ylab=“Data quantiles”) title(paste(“yourchoice”,i,sep=“”))}