Upload
jayden-mcmanus
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
MULTIPLE REGRESSION
MULTIPLE REGRESSION
ANALYSISANALYSIS
MULTIPLE REGRESSION
MULTIPLE REGRESSION ANALYSISANALYSIS
ENGR. DIVINO AMOR P. RIVERASTATISTICAL COORDINATION OFFICER I NSO LA UNION
CONTENTSCONTENTS
• Table for the types of Multiple Regression Analysis
• Ordinary Least Square (OLS)
• Multiple Linear Regression with Dummy Variables
• Multicollinearity
• Goodness of Fit of Multiple Linear Regression (MLR)
• Residual Analysis
• Multiple Linear Regression Method / Procedure
• Cross Validation
• Binomial or Binary Logistic Regression
• Multinomial Logistics Regression
• Measure of Fit
NSO LA UNION
Types of Multiple Regression Types of Multiple Regression AnalysisAnalysis
Dependent Variables
Independent Variables
Multiple Regression Procedure
Continuous All Continuous Ordinary Least Square (OLS) also known as Multiple Linear Regression
Continuous Continuous with other variables as categorical
Multiple Linear RegressionWith Dummy Variables
2 Categories Continuous with other variables as categorical or Purely Categorical
Binary or Binomial Logistic Regression
More Than 2 Categories
Continuous with other variables as categorical or Purely Categorical
Multinomial Logistic Regression
NSO LA UNION
Ordinary Least Square (OLS) also known as Multiple Linear
RegressionAssumptions and Conditions
1. Linearity Assumption If the model is true then y is linearly related to
each of the x’s.
Straight enough condition:Scatterplots of y against each other
of the predictors are reasonably straight. It is a good idea to plot the residuals against the predicted values and check for patterns, especially for bends or other nonlinearities.
NSO LA UNION
2. Independence Assumption The errors in the true underlying regression
model must be mutually independent. Since there is no way to be sure that the independence assumption is true.
Randomness condition:The data should arise from a
random sample or in a randomized experiment. Check for the regression residuals for evidence of patterns, trends, or clumping.
3. Equal Variance Assumption The variability of y should be about the same
for all values of every x. Check using scatterplot.
NSO LA UNION
4. Normality Assumption We assume that the errors around the
regression model at any specified values of x-variables follow a normal model.
Nearly Normal condition:look at a histogram or Normal
probability plot of the residuals and the normal probability plot is fairly straight. Normality Assumption becomes less important as the sample size grows.
NSO LA UNION
Ordinary Least Square (OLS) also known as Multiple Linear
RegressionSteps in applying OLS Procedure
1. Variables. Name the variables, Classify the variables with respect to level of measurements and Kinds of variables.
2. Plan. Check the conditions
3. Check multicollinearity. Lewis – Beck (1980) is to regress each independent variable on all other independent variables so that the relationship of each independent variables with all of the other independent variables are considered or apply correlation procedures to all independent variables with respect to each other. If the r is nearly 1.0 or using the rule of thumb that r is more than 0.7, which means multicollinearity exist.
NSO LA UNION
Level of Measurement
Nominal (Discrete)
Ordinal (Categorical)
Interval/ Ratio (Continuous)
Nominal (Discrete)
Phi Coefficient / Cramers’ V or Contingency coefficient
Phi Coefficient / Cramers’ V or Contingency coefficient
Correlation ratio ( Eta Square )
Ordinal (Categorical)
Phi Coefficient / Cramers’ V or Contingency coefficient/ Gamma / Spearman Rank-Order Correlation
Correlation ratio ( Eta Square )
Interval/ Ratio (Continuous)
Pearson Product-Moment Correlation
Using the appropriate Correlation Analysis ProcedureUsing the appropriate Correlation Analysis Procedure
NSO LA UNION
4. Choose your method.
5. Interpretation.
NSO LA UNION
Multiple Linear Regression with Dummy Variables
Because categorical predictor variables cannot be entered directly into a regression model and be meaningfully interpreted, some other method of dealing with information of this type must be developed.
In general, a categorical variable with k levels will be transformed into k-1 variables each with two levels. For example, if a categorical variable had six levels, then five dichotomous variables could be constructed that would contain the same information as the single categorical variable. Dichotomous variables have the advantage that they can be directly entered into the regression model. The process of creating dichotomous variables from categorical variables is called dummy coding.
Then researcher do the procedures in the OLS Procedure.
NSO LA UNION
MulticollinearityMulticollinearityIn some cases, multiple regression results
may seem paradoxical. Even though the overall P value is very low, all of the individual P values are high. This means that the model fits the data well, even though none of the X variables has a statistically significant impact on predicting Y. How is this possible? When two X variables are highly correlated, both convey essentially the same information. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If you removed both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to your model last. When this happens, the X variables are collinear and the results show multicollinearity.
NSO LA UNION
Assess multicollinearity, InStat tells you how well each independent (X) variable is predicted from the other X variables. The results are shown both as an individual R square value (distinct from the overall R square of the model) and a Variance Inflation Factor (VIF). When those R square and VIF values are high for any of the X variables, your fit is affected by multicollinearity. Multicollinearity can also be detected by Condition Index.
Belsley, Kuh and Welsch (1980) construct the Condition Indices as the square roots of the ratio of the largest eigenvalue to each individual eigenvalue, . The Condition Number of the X matrix is defined as the largest Condition Index, . When this number is large, the data are said to be ill conditioned. A Condition Index of 30 to 300 indicates moderate to strong collinearity. A collinearity problem occurs when a component associated with a high condition index contributes strongly to the variance of the two or more variables.
NSO LA UNION
In assessing the goodness of fit of a regression equation, a slightly different statistic, called R2-adjusted or R2adj is calculated:
22 (1 )( 1)
11
R N nR adj
N
where N is the number of
observations in the data set (usually the number of people or households) and n the number of independent variables or regressors. This allows for the extra regressors. R square adj will always be lower than R square if there is more than one regressor.
Goodness of Fit in Multiple Linear Regression
NSO LA UNION
The VIF associated with any X-variable will be found by regressing it on all other X-variables. The resulting R2 will be use to calculate that variable’s VIF. The VIF for any Xi represents the variable’s influence on multi-collinearity. VIF is computed as:
)1(
1)(
2i
i RXVIF
In general, multicollinearity is not considered a significant problem unless the VIF of a single Xi measures at least 10, or the sum of the VIF’s for all Xi is at least 10.
NSO LA UNION
Residual Analysis (Durbin-Residual Analysis (Durbin-Watson Test Statistics)Watson Test Statistics)
The residuals are defined the differences where and Y is observed value of the dependent variable and is the corresponding fitted value obtained by the use of fitted regression model. In performing the regression analysis, the study have mean zero, a constant variance 2 and follow a normal distribution. If the fitted model is correct, the residuals should exhibit tendencies that tend to confirm the assumptions of the model made, or at least, should not exhibit a denial of the assumption. Neter et-al. (1983).
Test statistics (DW)
i
21ii
e
)ee(d
NSO LA UNION
The Durbin-Watson coefficient, d, tests for autocorrelation. The value of d ranges from 0 to 4. Values close to 0 indicate extreme positive autocorrelation; close to 4 indicates extreme negative autocorrelation; and close to 2 indicates no serial autocorrelation. As a rule of thumb, d should be between 1.5 and 2.5 to indicate independence of observations. Positive autocorrelation means standard errors of the b coefficients are too small. Negative autocorrelation means standard errors are too large.
NSO LA UNION
Multiple Linear Regression Method or Procedure
1. Forward Selection Multiple Regression Method
The computer will choose the variable that gives the largest regression sum of squares when performing a simple linear regression with y, or equivalently, that will give the largest value R2. Then choose the variable that when is inserted in the model gives the largest increase in R2. This will continue until the most recent variable inserted fails to induce a significant increase in the explained regression. Such an increase can be determined at each at each step by using the appropriate F-test or t-test.
NSO LA UNION
2. Backward Elimination Multiple Regression Method
Fit all the variables included in the variable. Choosing the variable that gives the smallest value of the regression sum of squares adjusted for the others. Then fit a regression equation using the remaining variables then choose the smallest value of the regression sum of squares adjusted for the other variables remaining. Once again if it is significant, the variable is removed from the model. At each step the variance (s2) used in the F-test is the error mean square for the regression. This process is repeated until some step the variable with the smallest adjusted regression sum of squares results in a significant f-value for some predetermined significance level (to enter P-value is less than 0.05 and P-value to removed is greater than 0.10 below).
NSO LA UNION
3. Stepwise Multiple Regression MethodIt is accomplished with a slight but important
modification of the forward selection procedure. The modification involves further testing at each stage to ensure the continued effectiveness of variables that had been inserted into the model at an earlier stage. This represents an improvement over forward selection, since it is quite possible that a variable entering in the regression equation at a nearly stage might have been rendered unimportant or redundant because of relationships that exist between it and other variables entering at later stages. Therefore, at a stage in which a new variable has been entered into the regression equation through a significant increase in R2 as determined by the F-test, all the variables already in the model are subjected to F-test (or, equivalent to t-test) in light of the new variable, and are deleted if they do not display a significant f-value. The procedure is continued until a stage is reached in which no additional variables can be inserted or deleted. (Walpole and Myers, 1989).
NSO LA UNION
Cross-ValidationThe multiple correlation coefficient of a
regression equation can be use to estimate the accuracy of a prediction equation. However, this method usually overestimates the actual accuracy of the prediction. Cross-validation is used to test the accuracy of a prediction equation without having to wait for the new data. The first step in this process is to divide the sample randomly into groups of equal size. The cross-validation involved applying the predictor equation to the other group and determining how many times it correctly discriminated between those who would and those who would not.
NSO LA UNION
Binomial Logistic Regression (Binary Logistic Regression)
Binomial logistic regression will be employed for this type of regression is suited if the dependent variable is dichotomy and the independents are of any type (scale or categorical). This statistical tool is employed to predict a dependent variable on the basis of independents and to determine the percent of variance in the dependent variable explained by the independents; to rank the relative importance of independents; to assess interaction effects; and to understand the impact of covariate control variables. (Miller and Volker, 1985)
NSO LA UNION
Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable (the natural log of the odds of the dependent occurring or not). In this way, logistic regression estimates the probability of a certain event occurring. (Hosmer D. & Lemeshow S., 2000)
After fitting the logistic regression model, questions about the suitability of the model, the variables to be retained, and goodness of fit are all considered. (Pampal, 2000).
NSO LA UNION
Logistic regression does not face strict assumptions of multivariate normality and equal variance-covariance matrices across groups, features not found in all situations. Furthermore, logistic regression accommodates all types of independent variables (metric and non-metric).
Although assumptions regarding the distributions of predictors are not required for logistic regression, multivariate normality and linearity among the predictors may enhance power because a linear combination of predictors is used to form the exponent.
Several ways of testing the goodness of fit was run to further prove that the logistic equation adequately fits the data.
NSO LA UNION
Multinomial Logistics Multinomial Logistics Regression Regression It is similar to Binary Logistic Regression the
only difference is that the dependent variable is that there are more than 2 categories, an of example this is the grading system of Don Mariano Marcos Memorial State University. Multinomial logit models assume response counts at each level of covariate combination as multinomial and multinomial counts at different combinations are dependent.
The benefit of using multinomial logit model is that it models the odds of each category relative to a baseline category as a function of covariates, and it can test the equality of coefficients even if confounders are different unlike in the case of pair-wise logistics where testing of equality of coefficients requires assumptions about confounder effects. The Multinomial Logit Model is arguably the most widely statistical model for polytomous (multi-category) response variables ( Powers and Xie, 2000: Chapter 7; Fox, 1997: Chapter15)
NSO LA UNION
Measures of FitMeasures of Fit
There are six (6) scalar measures of model fit: (1) Deviance, (2) Akaike Information Criterion, (3) The Bayesian Information Criterion, (4) McFadden’s , (5) Cox and Snell Pseudo and (6) Nagelkerke Pseudo . There is no convincing evidence that selection of a model that maximizes the value of a given measure necessarily results in a model that is optimal in any sense other than the model having a larger (or smaller value) of that measure (Long & Freese, 2001). However, it is still helpful to see any differences in their level of goodness of fit, and hence provide us some guidelines in choosing an appropriate model.
NSO LA UNION
Multinomial Logistics Multinomial Logistics Regression Regression
1. DevianceAs a first measure of model of fit, the
researcher use the Residual Deviance (D) for the model, which is defined as follows:
where is a predicted value and is an observed value for i= 1, …, N
^1 1
( )( ) 2 ( ) log
jni
k ii j
ij
I y jD M I y j
p
^
ijp
NSO LA UNION
2. Akaike Information CriterionAs a second measure of fit, Akaike (1973)
Information Criterion that is defined as follows:
where is the maximum likelihood of the model and P is the number of parameters in the model. A model Having smaller AIC is considered the better fitting model.
2 ( ) 2kInL M PAIC
N
( )kL M
NSO LA UNION
3. Bayesian Information CriterionAs a third measure of fit, Bayesian Information
Criterion (BIC) (Raftery, 1995) as a simple and accurate large-sample approximation, especially if there are as about 40 observations (Raftery, 1995). In the study, BIC is defined as follows:
where is deviance of the model and
is the degrees of freedom associated with deviance. The more negative is the better fit. Raftery (1995) also provides guidelines for the strength of evidence favoring against based on a difference in as follows:
( ) lnk k kBIC D M df N
( )kD M kM kdf
KBIC
2M 1M
NSO LA UNION
4. McFadden’s R2
The McFadden’s adjusted . This measure is also known as the “likelihood-ratio index” which compares a full model with all parameters with the model just with intercept and defined as:
/BIC1 – BIC2/ Evidence
0 - 2 Weak
2 - 6 Positive
6 - 10 Strong
> 10 Very Strong
( )fullM int( )erceptM
*2
int
( )1
( )Full
McF
ercept
InL M KR
InL M
where is the number of parameter.*K
NSO LA UNION
5. Cox and Snell Pseudo R2
where is the likelihood of the current model and
is the likelihood of the initial model; that is,if the constant is not included in the model;
if the constant is not included in the model where
and W is vector with element , the weight for the ith case .
2
2^
(0)1
( )
w
McF
lR
l
^
( )l (0)l(0) log(0.5)l W
^ ^ ^ ^
0 0 0(0) log /(1 ) log(1 )l W
^
0 /n
i ii
w y W
NSO LA UNION
6. Nagelkerke Pseudo R2
where
22
2max( )cs
Ncs
RR
R
2max( )csR 2
1 (0) wl=
NSO LA UNION
How to run Regression on Statistical How to run Regression on Statistical SoftwareSoftware
From the menu choose:
Statistical Package for Social Sciences (SPSS)
Analyze Regression
Linear …..
NSO LA UNION
Statistical Package for Social Sciences
Method Forward
Backward
Stepwise
Statistics Estimate
s Confidence Intervals
Model Fit R-square Change
Linear Regression dialog box
NSO LA UNION
Linear Regression Plots dialog boxStatistical Package for Social
Sciences
Plots SRESID
ZPRED
• Check
Histogram
Normal probability plot
Produce all partial plots
NSO LA UNION
Linear Regression Statistics dialog boxStatistical Package for Social
Sciences
NSO LA UNION
Example of Multiple Linear RegressionStatistical Package for Social
Sciences
NSO LA UNION
Example of Multiple Linear RegressionStatistical Package for Social
Sciences
From the menu From the menu choose:choose:
AnalyzeAnalyze RegressionRegression
Linear …..Linear …..
NSO LA UNION
Example of Multiple Linear RegressionStatistical Package for Social
Sciences
NSO LA UNION
Statistical Package for Social Sciences
NSO LA UNION
Statistical Package for Social Sciences
NSO LA UNION
Variables Entered/Removeda
price .
Stepwise(Criteria:Probability-of-F-to-enter<= .050,Probability-of-F-to-remove >= .100).
tray .
Stepwise(Criteria:Probability-of-F-to-enter<= .050,Probability-of-F-to-remove >= .100).
diam .
Stepwise(Criteria:Probability-of-F-to-enter<= .050,Probability-of-F-to-remove >= .100).
Model1
2
3
VariablesEntered
VariablesRemoved Method
Dependent Variable: timea.
NSO LA UNION
Statistical Package for Social Sciences
Model Summaryd
.919a .845 .842 7.55968 .845 310.019 1 57 .000
.927b .859 .854 7.26835 .014 5.661 1 56 .021
.932c .869 .862 7.07328 .010 4.131 1 55 .047 1.955
Model1
2
3
R R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Durbin-Watson
Predictors: (Constant), pricea.
Predictors: (Constant), price, trayb.
Predictors: (Constant), price, tray, diamc.
Dependent Variable: timed.
NSO LA UNION
Statistical Package for Social Sciences
Coefficientsa
6.542 1.932 3.386 .001
.339 .019 .919 17.607 .000 .919 .919 .919 1.000 1.000
6.387 1.859 3.436 .001
.329 .019 .891 17.290 .000 .919 .918 .868 .948 1.055
6.163 2.590 .123 2.379 .021 .326 .303 .119 .948 1.055
11.361 3.043 3.733 .000
.377 .030 1.021 12.549 .000 .919 .861 .613 .360 2.778
8.169 2.707 .163 3.018 .004 .326 .377 .147 .822 1.216
-.866 .426 -.175 -2.033 .047 .700 -.264 -.099 .320 3.125
(Constant)
price
(Constant)
price
tray
(Constant)
price
tray
diam
Model1
2
3
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Zero-order Partial Part
Correlations
Tolerance VIF
Collinearity Statistics
Dependent Variable: timea.
NSO LA UNION
Statistical Package for Social Sciences
Collinearity Diagnosticsa
1.861 1.000 .07 .07
.139 3.653 .93 .93
2.182 1.000 .05 .05 .08
.680 1.791 .05 .03 .91
.138 3.982 .90 .93 .02
3.132 1.000 .01 .01 .03 .00
.707 2.105 .01 .00 .84 .00
.138 4.770 .32 .35 .01 .00
.023 11.577 .66 .63 .12 1.00
Dimension1
2
1
2
3
1
2
3
4
Model1
2
3
EigenvalueCondition
Index (Constant) price tray diam
Variance Proportions
Dependent Variable: timea.
NSO LA UNION
How to run Regression on Statistical How to run Regression on Statistical SoftwareSoftware
From the menu choose:
Statistical Package for Social Sciences (SPSS)
Analyze Regression
Binary …..
Binary Logistics Regression
NSO LA UNION
Binary Regression dialog boxStatistical Package for Social
Sciences MethodMethod Forward
conditional, Forward LR, Forward Wald
Backward Conditional, Backward LR, Backward Wald
Categorical Covariate
s
NSO LA UNION
Binary Regression Statistics dialog boxStatistical Package for Social
Sciences
Statistics Statistics & Plots& Plots
Residuals• Standardiz
ed• Deviance Statistic
s & Plots• Classificatio
n Plot
• Hosmer – Lemeshow goodness of fit
• CI for Exp (B)
ENGR. DIVINO AMOR P. RIVERAOIC- PROVINCIAL STATISTICS OFFICER NSO LA UNION
Example of Binary Regression (bankloan_cs.sav)
Statistical Package for Social Sciences
Data View
NSO LA UNION
Example of Binary Regression (bankloan.sav)
Statistical Package for Social Sciences
Variable View
NSO LA UNION
Statistical Package for Social Sciences
NSO LA UNION
Statistical Package for Social Sciences
NSO LA UNION
Statistical Package for Social Sciences
Output
Goodness-of-fit statistics help you to determine whether the model adequately describes the data The Hosmer-lemeshow The Hosmer-lemeshow statistic indicates a poor statistic indicates a poor fit if the significance fit if the significance value is less than 0.05. value is less than 0.05. The model adequately The model adequately fits the datafits the data
NSO LA UNION
Statistical Package for Social Sciences Forward stepwise method
start with a model that doesn’t include any of the predictors.
At each step, the predictor with the largest score statistics whose significance value is less than a specified value (usual default is 0.05) is added to the model.
The variables left out of the analysis at the last step have a significance values larger than 0.05
NSO LA UNION
Statistical Package for Social Sciences The variables chosen by the
forward stepwise method should all have significant changes in -2 log-likelihood
The change in -2log-likelihood is generally more reliable than the Wald statistic.
The larger the pseudo r-square statistics indicate that more of the variation is explained by the model, from a minimum of 0 to a maximum of 1.
NSO LA UNION
Statistical Package for Social Sciences The ratio of
the coefficient to its standard error, squared, equals the Wald Statistic. If the significance level of the Wald statistics is small is small (less than 0.05) then the parameter is useful to the model.
The predictors and coefficient values shown in the last step are used by the procedure to make predictions.
The meaning of a logistic regression coefficient is not as straightforward
While B is convenient for testing for testing the usefulness of predictors, Exp(B) is easier to interpret
NSO LA UNION
Statistical Package for Social Sciences
For example, Exp (B) for employ is equal to 0.781, which means that the odds of default for a person who has been employed at their current job for 2 years are 0.781 times the odds of default for a person who has been employed at their current job for 1 year, all other things being equal.
NSO LA UNION
Statistical Package for Social Sciences
)(1
)()(
defaultOdd
defaultOdddefaultP
781.01
781.0
439.0
The odds of default for a person with 1 more year on the job are 1 * 0.781 = 0.781, so the corresponding probability of default reduces to 0.439
NSO LA UNION
Multinomial Logistics RegressionMultinomial Logistics RegressionStatistical Package for Social
Sciences
From the menu choose: AnalyzeAnalyze RegressionRegression
Multinomial Multinomial Logistic…..Logistic…..
NSO LA UNION
Statistical Package for Social Sciences Multinomial Logistic Regression Statistics dialog box
NSO LA UNION
Multinomial Logistics RegressionMultinomial Logistics RegressionStatistical Package for Social
Sciences
StatisticsStatistics Model
• Pseudo R-square• Information
Criteria
Parameters
• Estimates
• Likelihood ratio tests
• Confidence Interval
• Goodness of Fit
• Asymptotic Correlations
NSO LA UNION
ReliabilityReliability
Kuder-Richardson Formula 20 & 21 and
Cronbach Alpha
•The KR20 value gives an overall indication of the test reliability.
• It is important to note that the KR20 value will be affected by the number of responses in an exam.
•The formula to calculate the exams KR20 can be expressed as
e
re
k
kKR
1
20
Where:
e = is the variance of the respondents’ scores for the exam
r = is the sum of the variances of the respondents’ scores for each response
k = is the no. of responses
21
120
e
pq
k
kKR
•The KR20 value gives an overall indication of the test reliability with the assumption that the difficulty of the test were equal.
•The formula to calculate the exams KR21 can be expressed as
2
)(1
121
kMkM
k
kKR
Where:
M = the assessment mean
k = the number of items in the assessment
2 = variance
Example :
Kr20 – Dissertation of Rodolfo Vergara
Cronbach Alpha
Another one of the most commonly used reliability coefficients is Cronbach's alpha (a ).
It is based on the internal consistency of items in the tests.
It is flexible and can be used with test formats that have more than one correct answer.
Cronbach Alpha can be used for both binary-type and large-scale data.
On the other hand, KR can be applied to dichotomously-scored data only.
For example, if your test questions are multiple choices or true/false, the responses must be binary in nature (either right or wrong).
But if your test is composed of essay-type questions and each question worths 10 points, then the scale is ranged from 0 to 10.
Chi-squareChi-square
NSO LA UNION
In order to choose the proper statistics to examine data, we first have to figure out at what level each variable is measured.
Level of Measurements• Nominal Level
The word nominal means “ in name”. Examples of nominal level variables are the following: Sex (with the categories of male and female), ethnicity (categories could include African American, Latino, and white), Political Party Identification (Democrat, Republican, Independent,etc.) and Religion (Catholic, Protestant, Jewish, Hindu, Buddhist, etc.).
NSO LA UNION
• Ordinal LevelThe word Ordinal means “ in order”. Examples of
ordinal level variables would be the variable "fear of crime" with categories such as very afraid, somewhat afraid, and not afraid. These categories have names (as does a nominal-level variable), but they also have something more. The categories have an inherent order from more to less fear. Another example is "social class," with categories such as lower class, working class, middle class, and upper class.
• Interval LevelIf the categories are numbers, they probably
have equal intervals between them then they are called interval level variables.
NSO LA UNION
• Ratio Level
Variables measured at the ratio level have all the characteristics of nominal-, ordinal-, and interval-level measures (categories that have names, order, and equal intervals), and the categories include a true zero point. Even though the sample you are examining may not include any cases in the category zero, zero is possible at least in theory. An example is "income in dollars."
The Chi-square (χ2) test is used to determine the strength of association between two nominal variables. It uses data that are in the form of frequencies. The formula for χ2 is
E
EOX
22 )(
Χ2 = C hi − square value O = frequencies actually observedE = expected frequencies
Chi-square χ2 test of ”goodness of fit”. In this test, it is used to determine whether a significant difference exists between the observed frequency distribution and the expected frequency distribution.
Example : Do doctors differs in their responses as to whether or not they read newspapers daily?
Do ctor’s R esponse O bserved (O ) Exp ected (E) Yes 80 50 N o 20 50
Totals 100 100
The expected (E) frequencies of 50 for each response option (Yes or No) in the above, last column, were determined based on the belief that yes and no have equal probabilities of being responded. These frequencies are some- times called theoretical frequencies.
50
)5020(
50
)5080( 222
X
362 X
Chi-square χ2 test of independence. When two nominal variables are involved in the study, the test of independence is used. It sets to find out whether two variables, A and B, are independent of each other, or whether an association exists between A and B.
Example : Is there an association between the manager’s annual salary rate and his educational attainment?
Educational Attainment
Managers’Annual
Salary RateLow Minimum High
O E O E O E Total
High (PhP 120,000
and above5 21 47 34 48 44 100
Moderate (PhP 72,000
to PhP 119,99915 18 23 29 46 37 84
Low (below PhP 72,000) 30 11 10 17 10 23 50
Total 50 80 104 234
Example : Is there an association between the manager’s annual salary rate and his educational attainment?
Step 1. Formulate the null hypothesis (H0) statement and its corresponding alternative hypothesis (H1). State the alpha α level and determine the degrees of freedom (df ), based on the number of frequency cells or observations.
H0: Salary rate is not associated with educational attainment.H1: Salary rate is associated with educational attainment.α = 0.05df=(c − 1)(r − 1)df =(3 − 1)(3 − 1)Df = 4
Note: c = column and r = row
Step 2. Determine the expected frequency of each cell by multiplying the pertinent marginal row total in the table with the pertinent marginal column total and divide the product by the grand total of frequencies.
Step 3. Compute the value of x2 using the formula:
536.63
)(
2
22
X
E
EOX
Computed Value
Critical Value in the Chi-square Table
536.632 X
)05.0@4(488.92 dfX
Step 4. Test the null hypothesis and state the findings.Since the computed x2 value which is 63.536 is
greater than the critical value x2 = 9.488 at the 0.05 level of significance, we failed to accept the H0 and to accept the H1.
Step 5. Draw the conclusion. Pay level is associated with educational attainment.
Caution in the Use of a Chi-square Test. In testing independence using x2 with a two by two contingency table with df=1, the formula of x2 is
Caution in the Use of a Chi-square Test. In testing independence using x2 with a two by two contingency table with df=1, the formula of x2 is
E
EOX
22 )5.0(
(Yate’s Correction Formula)