Upload
valentine-nelson-day
View
230
Download
4
Embed Size (px)
Citation preview
Slide 1
Standard Binary Logistic Regression
Slide 2
Logistic regression
Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or non-metric independent variables. (SPSS now supports Multinomial Logistic Regression that can be used with more than two groups, but our focus now is on binary logistic regression for two groups.)
Logistic regression combines the independent variables to estimate the probability that a particular event will occur, i.e. a subject will be a member of one of the groups defined by the dichotomous dependent variable. In SPSS, the model is always constructed to predict the group with higher numeric code. If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event.
Predicting the “No” event create some awkward wording in our problems. Our only option for changing this is to recode the variable.
If the probability for group membership in the modeled category is above some cut point (the default is 0.50), the subject is predicted to be a member of the modeled group. If the probability is below the cut point, the subject is predicted to be a member of the other group.
For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category
Slide 3
Level of measurement requirements
Logistic regression analysis requires that the dependent variable be dichotomous.
Logistic regression analysis requires that the independent variables be metric or non-metric. The logistic regression procedure will dummy-code non-metric variables for us. For logistic regression, we will use indicator dummy-coding, rather than deviation dummy-coding since I think it makes more sense to compare the odds for two groups rather than compare the odds for one group to the average odds for all groups.
If an independent variable is ordinal, we can either treat it as non-metric and dummy-code it or we can treat it as interval, in which case we will attach the usual caution.
Dichotomous independent variables do not have to be dummy-coded, but in our problems we will have SPSS dummy-code them because then we do not need to worry about the original codes for the variable as we can always interpret
Slide 4
Dummy-coding in SPSS - 1
When we want SPSS to dummy-code a variable, we enter the specifications in the Define Categorical Variables dialog box. Here we are dummy-coding sex, using the defaults of indicatory coding with the last category as the reference category.
In the table of coefficients, the dummy-coded variable is referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table.
SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the last category as reference, FEMALE is coded 0.
Slide 5
Variables in the Equation
-1.590 .361 19.427 1 .000-.235 .229 1.047 1 .306
sex(1)Constant
Step1
a
B S.E. Wald df Sig.
Variable(s) entered on step 1: sex.a.
Dummy-coding in SPSS - 2
Here we are dummy-coding sex, using the defaults of indicatory coding with the First category as the reference category. Note you must click on the Change button after selecting the First option button.
In the table of coefficients, the dummy-coded variable is still referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table, but in this case it stands for females.
SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the FIRST category as reference, MALE is coded 0.
Slide 6
Assumptions
Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables.
When the variables satisfy the assumptions of normality, linearity, and homogeneity of variance, discriminant analysis has historically been cited as the more effective statistical procedure for evaluating relationships with a non-metric dependent variable. However, logistic regression is being used more and more frequently because it can be interpreted similarly to other general linear model problems.
When the variables do not satisfy the assumptions of normality, linearity, and homogeneity of variance, logistic regression is the statistic of choice since it does not make these assumptions.
Multicollinearity is a problem for logistic regression with the same consequences as multiple regression, i.e. we are likely to misinterpret the contribution of independent variables when they are collinear. SPSS does not compute tolerance values for logistic regression, so we will detect it through the examination of standard errors. We will not interpret models when evidence of multicollinearity is found.
Evidence of multicollinearity is detected as a numerical problem in the attempted solution.
Slide 7
Numerical problems
The maximum likelihood method used to calculate logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer.
Sometimes, the method will break down and not be able to converge or find an answer.
Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables.
The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0 (this does not apply to the constant).
Slide 8
Sample size requirements
The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression.
If we do not meet the sample size requirement, it is suggested that this be mentioned as a limitation to our analysis. If the relationships between predictors and the dependent variable are strong, we may still attain statistical significance with smaller samples.
Slide 9
Methods for including variables
SPSS supports the three methods for including variables in the regression equation: the standard or simultaneous method in which all independents are included at the
same time The hierarchical method in which control variables are entered in the analysis before
the predictors whose effects we are primarily concerned with. The stepwise method (forward conditional or forward LR in SPSS) in which variables
are selected in the order in which they maximize the statistically significant contribution to the model.
For all methods, the contribution to the model is measures by model chi-square is a statistical measure of the fit between the dependent and independent variables, like R².
Slide 10
Computational method
Multiple regression uses the least-squares method to find the coefficients for the independent variables in the regression equation, i.e. it computed coefficients that minimized the residuals for all cases.
Logistic regression uses maximum-likelihood estimation to compute the coefficients for the logistic regression equation. This method finds attempts to find coefficients that match the breakdown of cases on the dependent variable.
The overall measure of how will the model fits is given by the likelihood value, which is similar to the residual or error sum of squares value for multiple regression. A model that fits the data well will have a small likelihood value. A perfect model would have a likelihood value of zero.
Maximum-likelihood estimation is an iterative procedure that successively tries works to get closer and closer to the correct answer. When SPSS reports the "iterations," it is telling us how may cycles it took to get the answer.
Slide 11
Overall test of relationship
Errors in a logistic regression models are measured in terms of “-2 log likelihood” values which are analogous to “total sum of squares”. When an independent variable has a relationship to the dependent variable the measure of error decreases. Since “-2 log likelihood (abbreviated at -2LL) is measured in negative numbers, an improvement is relationship is indicated by a larger number, e.g. if -2LL were -200, a -2LL of -100 would represent an improvement.
The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the -2 log likelihood values for a model which does not contain any independent variables and the model that contains the independent variables.
This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square.
The significance test for the model chi-square is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables.
In a hierarchical logistic regression, the significance test for the addition of the predictor variables is based on the block chi-square in the omnibus tests of model coefficients.
Slide 12
Overall test of relationship in SPSS output
Though the iteration history is not usually an output of interest, it does show us how the model chi-square value is derived.
The original -2 Log Likelihood value is 213.891.
At the end of this step, the -2 Log Likelihood value is 192.726.
213.891 – 192.726 = 21.165, the value for Model Chi-square in the table of Omnibus Tests of Model Coefficients.
Slide 13
Relationship of Individual Independent Variables and Dependent Variable
There is a test of significance for the relationship between an individual independent variable and the dependent variable, a significance test of the Wald statistic .
The individual coefficients represent change in the odds of being a member of the modeled category. Individual coefficients are expressed in log units and are not directly interpretable. However, if the b coefficient is used as the power to which the base of the natural logarithm (2.71828) is raised, the result represents the change in the odds of the modeled event associated with a one-unit change in the independent variable.
If a coefficient is positive, its transformed log value will be greater than one, meaning that the modeled event is more likely to occur. If a coefficient is negative, its transformed log value will be less than one, and the odds of the event occurring decrease. A coefficient of zero (0) has a transformed log value of 1.0, meaning that this coefficient does not change the odds of the event one way or the other.
The interpretive statement for individual relationships, provided they are statistically significant, incorporates the odds ratio or Exp(B) in SPSS output.
Slide 14
Interpreting individual relationships - 1
Exp(B) can be interpreted as a percentage change by subtracting 1.0 from the Exp(B) value.
In this example, Exp(B) – 1.0 = .204 – 1.0 = -.796
We can state this finding as females (sex(1) value in this example) were 79.6% less likely to …
Note: in this example, sex was coded so that males was the reference category.
Slide 15
Interpreting individual relationships - 2
Exp(B) can be interpreted as a multiplier when percentage change is confusing.
We can state this finding as males (sex(1) value in this example) were 4.9 or approximately 5 times more likely to …
In this example, Exp(B) – 1.0 = 4.902 – 1.0 = 3.902, or 390.2% more likely.
Note: in this example, sex was coded so that females was the reference category.
Slide 16
Strength of logistic regression relationship
While logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model.
A more useful measure to assess the utility of a logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable.
Slide 17
Evaluating usefulness for logistic models
The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone.
Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.
The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.
Slide 18
Comparing accuracy rates
To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for logistic regression.)
Classification Tablea
20 34 37.0
10 72 87.8
67.6
ObservedYES
NO
EXPECT U.S. IN WORLDWAR IN 10 YEARS
Overall Percentage
Step 1YES NO
EXPECT U.S. INWORLD WAR IN 10
YEARS PercentageCorrect
Predicted
The cut value is .500a.
SPSS reports the overall accuracy rate in the Classification Table. The overall accuracy rate computed by SPSS was 67.6% in this example.
Slide 19
Computing by chance accuracy
The number of cases in each group is found in the Classification Table at Step 0 (before any independent variables are included). The proportion of cases in the largest group is equal to the overall percentage (60.3%).
Classification Tablea,b
0 54 .0
0 82 100.0
60.3
ObservedYES
NO
EXPECT U.S. IN WORLDWAR IN 10 YEARS
Overall Percentage
Step 0YES NO
EXPECT U.S. INWORLD WAR IN 10
YEARS PercentageCorrect
Predicted
Constant is included in the model.a.
The cut value is .500b.
The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² = 0.521).
The proportional by chance accuracy criteria is 65.2% (1.25 x 52.1% = 65.2%).
Since the accuracy rate in this example, 67.6%, is greater than the 65.2% by chance accuracy criteria, this would would be characterized as useful.
Slide 20
Outliers
Logistic regression models the relationship between a set of independent variables and the probablity that a case is a member of one of the categories of the dependent variable (In SPSS, the modeled category is the one with the higher numeric code.) If the probability is greater than 0.5, the case is classified in the modeled category. If the probability is less than 0.50, the case is classified in the other category.
The actual probability of the modeled event for any case is either 1.0 or 0.0, i.e. a case is in the modeled category or it is not.
The residual is the difference between the actual probability and the predicted probability for a case. If the predicted probability for a case that actually belonged to the modeled category was 0.80, the residual would be 1.00 – 0.80 = 0.20.
The residual can be standardized by dividing it by an estimate of its standard deviation. Since the dependent variable is dichotomous or binary, the standard deviation for proportions is used.
Slide 21
Strategy for Outliers
Our strategy for evaluating the impact of outliers on our logistic regression model will parallel what we have done for multiple regression and discriminant analysis:
First, we run a baseline model including all cases Second, we run a model excluding outliers whose studentized residual is
greater than 2.58 or less than -2.58 (z-score for p = .01). If the model excluding outliers has a classification accuracy rate that is 2%
or more higher than the accuracy rate of the baseline model, we will interpret the revised model. If the accuracy rate of the revised model without outliers is less than 2% more accurate, we will interpret the baseline model.
Slide 22
The Problem in BlackboardThe Problem in Blackboard
The problem statement tells us: the variables included in the analysis whether each variable should be
treated as metric or non-metric the type of dummy coding and
reference category for non-metric variables
the alpha for both the statistical relationships and for diagnostic tests
Slide 23
The Statement about Level of Measurement
The first statement in the problem asks about level of measurement. Standard binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regression calls non-metric variables “categorical.”
SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.
Slide 24
Marking the Statement about Level of Measurement
Mark the check box as a correct statement because:• The dependent variable "computer use" [compuse] is
dichotomous level, satisfying the requirement for the dependent variable.
• The independent variables "highest year of school completed" [educ] and "socioeconomic index" [sei] are interval level, satisfying the requirement for independent variables.
• The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables.
• The independent variable "condition of health" [health] is ordinal level which the problem instructs us to dummy-code as a non-metric variable.
Slide 25
The Statement about Outliers
While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model.
To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.
Slide 26
Running the standard binary logistic regression
Select the Regression | Binary Logistic… command from the Analyze menu.
Slide 27
Selecting the dependent variable
Second, click on the right arrow button to move the dependent variable to the Dependent text box.
First, highlight the dependent variable compuse in the list of variables.
Slide 28
Selecting the independent variables
Move the independent variables stated in the problem to the Covariates list box.
Slide 29
Declare the categorical variables - 1
To tell SPSS that two of the variables are non-metric and need to be dummy-coded, click on the Categorical button.
Slide 30
Declare the categorical variables - 2
Move the variables sex and health to the Categorical Covariates list box.
SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.
Slide 31
Declare the categorical variables - 3
We could change the dummy-coding to a different scheme by choosing another method from the drop-down menu, and clicking on the Change button.
However, we will use indicator dummy-coding for our logistic regression problems, so that we are comparing the difference in odds between two specific groups, rather than comparing one group to the average odds for all other groups. I think “average odds” complicates the interpretation.
Slide 32
Declare the categorical variables - 4
We will also accept the default of using the last valid category as the reference category for each variable (we do not use higher numbered missing values as a reference category).
Note that sex is a dichotomous variable, and does not require dummy-coding. I prefer to dummy-code it anyhow so that my interpretation is consistently based on the difference between categories coded 0 and 1. I do not need to alter my interpretation if two different numbers were used for the original coding.
Click on the Continue button to close the dialog box.
Slide 33
Specifying the method for including variables
SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.
SPSS also supports the specification of "Blocks" of variables for testing hierarchical models.
Since the problem calls for a standard binary logistic regression, we accept the default Enter method for including variables.
Slide 34
Adding outliers to the data set - 1
Click on the Save… button to request the statistics that we want to save.
SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers.
Slide 35
Adding outliers to the data set - 2
Second, click on the Continue button to complete the specifications.
First, mark the checkbox for Standardized residuals in the Residuals panel.
Slide 36
Requesting the output
While optional statistical output is available, we do not need to request any optional statistics.
Click on the OK button to request the output.
Slide 37
Detecting the presence of outliers - 1
SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2.
I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.
Slide 38
Detecting the presence of outliers - 2
Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.
To detect outliers, we will sort the ZRE_1 column twice:• first, in ascending order to identify outliers with a
standardized residual of +2.58 or greater.• second, in descending order to identify outliers with
a standardized residual of -2.58 or less.
Slide 39
Detecting the presence of outliers - 3
After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have five outliers that have standardized residuals of -2.58 or less.
Slide 40
Detecting the presence of outliers - 4
To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.
Slide 41
Detecting the presence of outliers - 5
Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret.
Had there been no outliers, we would move on to the issue of sample size.
After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of +2.58 or more.
Slide 42
Running the model excluding outliers - 1
We will use a Select Cases command to exclude the outliers from the analysis.
Slide 43
Running the model excluding outliers - 2
Second, click on the If button to specify the condition.
First, in the Select Cases dialog box, mark the option button If condition is satisfied.
Slide 44
Running the model excluding outliers - 3
The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than or 2.58.
The abs() or absolute value function tells SPSS to ignore the sign of the value.
After typing in the formula, click on the Continue button to close the dialog box.
To eliminate the outliers, we request the cases that are not outliers be selected into the analysis.
Slide 45
Running the model excluding outliers - 4
SPSS displays the condition we entered on the Select Cases dialog box.
Click on the OK button to close the dialog box.
Slide 46
Running the model excluding outliers - 5
SPSS indicates which cases are excluded by drawing a slash across the case number.
Scrolling down in the data, we see that the outliers and cases with missing values are excluded.
Slide 47
Running the model excluding outliers - 6
To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.
Slide 48
Running the model excluding outliers - 7
Click on the Save button to open the dialog box.
The only change we will make is to clear the check box for saving standardized residuals.
Slide 49
Running the model excluding outliers - 8
First, clear the check box for Standardized residuals.
Second, click on the Continue button to close the dialog box.
Slide 50
Running the model excluding outliers - 9
Finally, click on the OK button to request the output.
Slide 51
Accuracy rate of the baseline model including all cases
Navigate to the Classification Table for the logistic regression with all cases. To distinguish the two models, I often refer to the first one as the baseline model.
The accuracy rate for the model with all cases is 75.1%.
Slide 52
Accuracy rate of the revised model excluding outliers
Navigate to the Classification Table for the logistic regression excluding outliers. To distinguish the two models, I often refer to the first one as the revised model.
The accuracy rate for the model excluding outliers is 78.0%.
Slide 53
Marking the statement for excluding outliers
In the initial logistic regression model, 5 cases had a standardized residual of +2.58 or greater or -2.58 or lower: - Case 20000032 had a standardized residual of -3.59 - Case 20000178 had a standardized residual of -5.83 - Case 20001092 had a standardized residual of -2.90 - Case 20001544 had a standardized residual of -4.16 - Case 20002344 had a standardized residual of -3.78
Since the classification accuracy of the model that excluded outliers (78.0%) was greater by 2% or more than the classification accuracy for the model that included all cases (75.1%), we mark the check box for the statement.
All of the remaining statements will be evaluated based on the output for the model that excludes outliers.
Slide 54
The statement about multicollinearity and other numerical problems
Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.
Slide 55
Checking for multicollinearity
The standard errors for the variables included in the analysis were: the standard error for "highest year of school completed" [educ] was .11, the standard error for survey respondents who said that their health was poor was 1.44, the standard error for survey respondents who said that their health was fair was .62, the standard error for survey respondents who said that their health was good was .53, the standard error for "socioeconomic index" [sei] was .02 and the standard error for survey respondents who were male was .45.
Slide 56
Marking the statement about multicollinearity and other numerical problems
Since none of the independent variables in this analysis had a standard error larger than 2.0, we mark the check box to indicate there was no evidence of multicollinearity.
Slide 57
The statement about sample size
Hosmer and Lemeshow, who wrote the widely used text on logistic regression, suggest that the sample size should be 10 cases for every independent variable.
Slide 58
The output for sample size
The 164 cases available for the analysis satisfied the recommended sample size of 60 (10 cases per independent variable) for logistic regression recommended by Hosmer and Lemeshow.
We find the number of cases included in the analysis in the Case Processing Summary.
Slide 59
Marking the statement for sample size
Since we satisfy the sample size requirement, we mark the check box.
Slide 60
The overall relationship between the dependent and independent variables
The existence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi-square for the model that includes all of the independent variables.
Slide 61
The output for the overall relationship
In this analysis, the test of the full model versus a model with intercept only was statistically significant, χ²(6, N = 164) = 88.44, p < .001. The null hypothesis that there is no difference between the model with only a constant and the model with independent variables was rejected.
The existence of a relationship between the independent variables and the dependent variable was supported.
Slide 62
Marking the statement for overall relationship
Since the overall relationship was statistically significant, we mark the check box.
Slide 63
The statement about the relationship between education and computer use
Having satisfied the criteria for an overall relationship, we examine the findings for individual relationships with the dependent variable. If the overall relationship were not significant, we would not interpret the individual relationships.
The first statement concerns the relationship between education and computer use.
Slide 64
Output for the relationship between education and computer use
The probability of the Wald statistic for the independent variable "highest year of school completed" [educ] (χ²(1, N = 164) = 11.49, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for "highest year of school completed" [educ] was equal to zero was rejected. The value of Exp(B) for the variable "highest year of school completed" [educ] was 1.43 which implies an increase in the odds of 43.2% (1.43 - 1.0 = .43). The statement that 'For each unit increase in "highest year of school completed", survey respondents were 43.2% more likely to use a computer' is correct.
Slide 65
Marking the statement for the relationship between education and computer use
Since the relationship was statistically significant, and the odds ratio was correctly interpreted as an increase of 43.2%, we mark the check box for the statement.
Slide 66
The statement for the relationship between poor health and computer use
The next statement concerns the relationship between the dummy-coded variable for poor health and computer use.
Slide 67
Output for the relationship between poor health and computer
The probability of the Wald statistic for the independent variable survey respondents who said that their health was poor (χ²(1, N = 164) = 8.20, p = .004) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was poor was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that their health was poor was .016 which implies a decrease in the odds of 98.4% (.016 - 1.000 = -.984). The statement that 'Survey respondents who said that their health was poor were 98.4% less likely to use a computer compared to those who said that their health was excellent' is correct.
Slide 68
Marking the statement for the relationship between poor health and computer use
Since the relationship was statistically significant, and the odds ratio was correctly interpreted as a decrease of 98.4% compared to the reference group in excellent health, we mark the check box for the statement.
Slide 69
The statement for the relationship between fair health and computer use
The next statement concerns the relationship between the dummy-coded variable for fair health and computer use.
Slide 70
Output for the relationship between fair health and computer use
The probability of the Wald statistic for the independent variable survey respondents who said that their health was fair (χ²(1, N = 164) = 6.60, p = .010) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was fair was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that their health was fair was .204 which implies a decrease in the odds of 79.6% (.204 - 1.000 = -.796). The statement that 'Survey respondents who said that their health was fair were 79.6% less likely to use a computer compared to those who said that their health was excellent' is correct.
Slide 71
Marking the statement for the relationship between fair health and computer use
Since the relationship was statistically significant, and the odds ratio was correctly interpreted as a decrease of 79.6% compared to the reference group in excellent health, we mark the check box for the statement.
Slide 72
The statement for the relationship between good health and computer use
The next statement concerns the relationship between the dummy-coded variable for good health and computer use.
Slide 73
Output for the relationship between good health and computer use
The probability of the Wald statistic for the independent variable survey respondents who said that their health was good (χ²(1, N = 164) = 1.53, p = .216) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was good was equal to zero was not rejected. Survey respondents who said that their health was good does not have an impact on the odds that survey respondents use a computer. The analysis does not support the relationship that 'Survey respondents who said that their health was good were 48.4% less likely to use a computer compared to those who said that their health was excellent'
Slide 74
Marking the statement for the relationship between good health and computer use
Since the relationship was not statistically significant, the check box is not marked.
Slide 75
The statement for relationship between socioeconomic index and computer use
The next statements concern the relationship between socioeconomic index and computer use. We are offered two alternative interpretations of the direction of the relationship. If the relationship is not statistically significant, neither will be correct.
Slide 76
Output for the relationship between socioeconomic index and computer use
The probability of the Wald statistic for the independent variable "socioeconomic index" [sei] (χ²(1, N = 164) = 16.93, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for "socioeconomic index" [sei] was equal to zero was rejected. The value of Exp(B) for the variable "socioeconomic index" [sei] was 1.070 which implies an increase in the odds of 7.0% (1.070 - 1.000 = .070). The statement that 'For each unit increase in "socioeconomic index", survey respondents were 7.0% more likely to use a computer' is correct.
Slide 77
Marking the relationship between socioeconomic index and computer use
Since the relationship was statistically significant and the odds ratio indicated an increase of 7.0%, the first statement is marked and the second is left blank.
Slide 78
The statement for the relationship between sex and computer use
The next statement concerns the relationship between the sex and computer use.
Slide 79
Output for the relationship between sex and computer use
The probability of the Wald statistic for the independent variable survey respondents who were male (χ²(1, N = 164) = 2.10, p = .148) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who were male was equal to zero was not rejected. Survey respondents who were male does not have an impact on the odds that survey respondents use a computer. The analysis does not support the relationship that 'Survey respondents who were male were 47.8% less likely to use a computer compared to those who were female'
Slide 80
Marking the statement for the relationship between sex and computer use
Since the relationship was not statistically significant, the check box is not marked.
Slide 81
Statement about the usefulness of the model based on classification accuracy
The final statement concerns the usefulness of the logistic regression model. The independent variables could be characterized as useful predictors distinguishing survey respondents who use a computer from survey respondents who not use a computer if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.
Slide 82
Computing proportional by-chance accuracy rate
The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (.396² + .604² = .521).
The proportion in the largest group is 60.4% or .604. The proportion in the other group is 1.0 – 0.604 = .396.
At Block 0 with no independent variables in the model, all of the cases are predicted to be members of the modal group, 1=Yes in this example.
Slide 83
Output for the usefulness of the model based on classification accuracy
To be characterized as a useful model, the accuracy rate should be 25% higher than the by chance accuracy rate.
The by chance accuracy criteria is compute by multiplying the by chance accurate rate of .521 times 1.25, or 1.25 x .521 = .652 (65.2%)..
The classification accuracy rate computed by SPSS was 78.0% which was greater than or equal to the proportional by chance accuracy criteria of 65.2% (1.25 x 52.1% = 65.2%).
The criteria for classification accuracy is satisfied.
Slide 84
Marking the statement for usefulness of the model
Since the criteria for classification accuracy was satisfied, the check box is marked.
Slide 85
Standard Binary Logistic Regression: Level of Measurement
No
No
Ordinal level variable treated as metric?
Yes
Yes
Level of measurement ok?
Consider limitation in discussion of findings
Mark check box for level of measurement
Do not mark check box for level of measurement
Mark: Inappropriate application of the statistic
Stop
Slide 86
Standard Binary Logistic Regression: Exclude Outliers
Run Baseline Binary Logistic Regression, Including All Cases,
Requesting Standardized Residuals
No
YesAccuracy rate for revisedModel >= accuracy rate for baseline model + 2%
Run Revised Binary Logistic Regression, Excluding Outliers (standardized
Residuals >= 2.58)
Interpret baseline model
Interpret revised model
Mark check box for excluding outliers
Do not mark check box for excluding outliers
Slide 87
Standard Binary Logistic Regression: Multicollinearity and Sample Size
No
YesMulticollinearity/Numerical Problems (S. E. > 2.0)
Stop
Yes
NoAdequate Sample Size(Number of IV’s x 10)
Consider limitation in discussion of findings
Mark check box for no multicollinearity
Do not mark check box for no multicollinearity
Mark check box for sample size
Do not mark check box for sample size
Slide 88
Standard Binary Logistic Regression: Overall Relationship
Probability of Model Chi-square ≤ α
Yes
Mark check box for overall relationship
Do not mark check box for overall relationship
No
Stop
Slide 89
Standard Binary Logistic Regression: Individual Relationships
Yes
Individual relationship(Wald Sig ≤ α)?
No
Mark check box for individual relationship
Correct interpretation of direction and strength of
relationship?
Yes
Do not mark check box for individual relationship
No
Additional individualRelationships to
interpret?Yes
No
Slide 90
Standard Binary Logistic Regression: Classification Accuracy
Yes
Classification accuracy > 1.25 x by chance
accuracy rate
Do not mark check box for classification accuracy
No
Mark check box for classification accuracy