Chi Square & Correlation

Chi Square & Correlation

Nonparametric Test of Chi2

Used when too many assumptions are violated in T-Tests: Sample size too small to reflect population Data are not continuous and thus not

appropriate for parametric tests based on normal distributions.

χ2 is another way of showing that some pattern in data is not created randomly by chance.

X2 can be one or two dimensional. X2 deals with the question of whether what

we observed is different from what is expected

Calculating X2

What would a contingency table look like if no relationship exists between gender and voting for Bush? (i.e. statistical independence)

25 2525 25

Male Female

Voted for Bush

Voted for Kerry

50

50

50 50 100

NOTE: INDEPENDENT VARIABLES ON COLUMS AND DEPENDENT ON ROWS

Calculating X2

What would a contingency table look like if a perfect relationship exists between gender and voting for Bush?

50 00 50

Male Female

Voted for Bush

Voted for Kerry

Calculating the expected value

Nff

f jiij

))((^

ijf^

The expected frequency of the cell in the ith row and jth column

Fi = The total in the ith row marginalFj = The total in the jth column marginalN = The grand total, or sample size for the entire table

Expected Voted for Bush = 50x50 / 100 = 25


Again, the basic question is what you are observing in some given data created by chance or through some systematic process?

EEO 2

2 )(

O= Observed frequency E= Expected frequency


The null hypothesis we are testing here is that the proportion of occurrences in each category are equal to each other (Ho: B=K). Our research hypothesis is that they are not equal (Ha: B =K).

Given the sample size, how many cases could we expect in each category (n/#categories)? The obtained/critical value estimation will provide a coefficient and a Pr. that the results are random.

Let’s do a X2 (50-25)2/25=25 (0 - 25)2 /25=25(0 - 25)2 /25=25(50-25)2 /25=25

X2=100

50 00 50

Male Female

Voted for Bush

Voted For Kerry

What would X2 be when there is statistical independence?

Let’s corroborate with SPSS

Chi-Square Tests

.000b 1 1.000

.000 1 1.000

.000 1 1.0001.000 .579

.000 1 1.000

100

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is25.00.

b.

Chi-Square Tests

100.000b 1 .00096.040 1 .000

138.629 1 .000.000 .000

99.000 1 .000

100

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is25.00.

b.

Testing for significance How do we know if the

relationship is statistically significant?

We need to know the df (df= (R-1) (C-1) )(2-1)(2-1)= 1

We go to the X2 distribution to look for the critical value (CV= 3.84)

We conclude that the relationship gender and voting is statistically significant.

20 3030 20

Male Female

Voted for Bush

Voted forKerry

X2= 4

When is X2 appropriate to use?

X2 is perhaps the most widely used statistical technique to analyze nominal and ordinal data

Nominal X nominal (gender and voting preferences)

Nominal and ordinal (gender and opinion for W)

X2 can also be used with larger tables

Opinion of Bush

MALE FEMALE

Favorable 40 5

Indifferent 10 20

Unfavorable 15 55

45

30

70

65 80 145

(19.4)

(.88)

(8.6)

(15.8)

(.72)

(6.9)

X2=52.3 Do we reject the null hypothesis?

Correlation (Does not mean causation) We want to know how two variables

are related to each other Does eating doughnuts affect

weight? Does spending more hours studying

increase test scores? Correlation means how much two

variables overlap with each other

Types of Correlations

X (cause) Y (effect) Correlation

Values

Increases Increases Positive 0 to1

Decreases Decreases Positive 0 to 1

Increases Decreases Negative -1 to 0

Decreases Increases Negative -1 to 0

IncreaseDecreases

Does not change

Independent 0

Conceptualizing Correlation

Measuring Development

GPD POP WEIGHT

Weak

GDP EDUCATION

Strong

Correlation will be associated with what type of validity?

Correlation Coefficient

])(][)([ 2222 YYnXXn

YXXYnrxy

Log value Log sqft value2 sqft2 Val * sqft

5.13 4.02 26.3169 16.1604 20.6226

5.2 4.54 27.04 20.6116 23.608

4.53 3.53 20.5209 12.4609 15.9909

4.79 3.8 22.9441 14.44 18.202

4.78 3.86 22.8484 14.8996 18.4508

4.72 4.17 22.2784 17.3889 19.6824

29.15 23.92 141.95 95.96 116.56

Home Value & Square footage

Correlation Coefficient

])92.23()6*96.95[(])15.29()6*95.141[(

)92.23)(15.29()56.116*6(22

xyr

66.209.278.

Correlations

1 .778. .0686 6

.778 1

.068 .6 6

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

VALUE

SQFT

VALUE SQFT

Rules of ThumbSize of correlation coefficient

General Interpretation

.8 - 1.0 Very Strong

.6 - .8 Strong

.4 - .6 Moderate

.2 - .4 Weak

.0 - .2 Very Weak or no relationship

Multiple Correlation Coefficients

Correlations

1 .784** .775** .708**. .000 .000 .000

46 46 46 46.784** 1 .669** .654**.000 . .000 .000

46 46 46 46.775** .669** 1 .895**.000 .000 . .000

46 46 46 46.708** .654** .895** 1.000 .000 .000 .

46 46 46 46

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

VALUE

SQFT

BTH

BDR

VALUE SQFT BTH BDR

Correlation is significant at the 0.01 level (2-tailed).**.

Limitation of correlation coefficients They tell us how strong two variables are

related However, r coefficients are limited because

they cannot tell anything about: 1. Causation between X and Y 2. Marginal impact of X on Y 3. What percentage of the variation of Y is

explained by X 4. Forecasting Because of the above Ordinary Least Square

(OLS) is most useful

Do you have the BLUES? B for Best (Minimum error) L for Linear (The form of the

relationship) U for Un-bias (does the parameter

truly reflect the effect?) E for Estimator

Home value and sq. Feet

SQFT

4.64.44.24.03.83.63.4

VALU

E

5.3

5.2

5.1

5.0

4.9

4.8

4.7

4.6

4.5

XY

Does the above line meet the BLUE criteria?

Documents

Chi Square & Correlation